VOLUME FOUR HUNDRED AND SIXTY-SEVEN
METHODS
IN
ENZYMOLOGY Computer Methods, Part B
METHODS IN ENZYMOLOGY Editors-in-Chief
JOHN N. ABELSON AND MELVIN I. SIMON Division of Biology California Institute of Technology Pasadena, California, USA Founding Editors
SIDNEY P. COLOWICK AND NATHAN O. KAPLAN
VOLUME FOUR HUNDRED AND SIXTY-SEVEN
METHODS
IN
ENZYMOLOGY Computer Methods, Part B EDITED BY
MICHAEL L. JOHNSON University of Virginia Health Sciences Center Department of pharmacology Charlottesville, Virginia, USA
LUDWIG BRAND Department of Biology Johns Hopkins University Baltimore, Maryland, USA
AMSTERDAM • BOSTON • HEIDELBERG • LONDON NEW YORK • OXFORD • PARIS • SAN DIEGO SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO Academic Press is an imprint of Elsevier
Academic Press is an imprint of Elsevier 525 B Street, Suite 1900, San Diego, CA 92101-4495, USA 30 Corporate Drive, Suite 400, Burlington, MA 01803, USA 32 Jamestown Road, London NW1 7BY, UK First edition 2009 Copyright # 2009, Elsevier Inc. All Rights Reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means electronic, mechanical, photocopying, recording or otherwise without the prior written permission of the publisher Permissions may be sought directly from Elsevier’s Science & Technology Rights Department in Oxford, UK: phone (+44) (0) 1865 843830; fax (+44) (0) 1865 853333; email: permissions@ elsevier.com. Alternatively you can submit your request online by visiting the Elsevier web site at http://elsevier.com/locate/permissions, and selecting Obtaining permission to use Elsevier material Notice No responsibility is assumed by the publisher for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein. Because of rapid advances in the medical sciences, in particular, independent verification of diagnoses and drug dosages should be made For information on all Academic Press publications visit our website at elsevierdirect.com ISBN: 978-0-12-375023-5 ISSN: 0076-6879 Printed and bound in United States of America 09 10 11 12 10 9 8 7 6 5 4 3 2 1
CONTENTS
Contributors Preface Volumes in Series
1. Correlation Analysis: A Tool for Comparing Relaxation-Type Models to Experimental Data
xiii xix xxi
1
Maurizio Tomaiuolo, Joel Tabak, and Richard Bertram 1. Introduction 2. Scatter Plots and Correlation Analysis 3. Example 1: Relaxation Oscillations 4. Example 2: Square Wave Bursting 5. Example 3: Elliptic Bursting 6. Example 4: Using Correlation Analysis on Experimental Data 7. Summary Acknowledgment References
2 3 4 13 15 18 19 20 20
2. Trait Variability of Cancer Cells Quantified by High-Content Automated Microscopy of Single Cells
23
Vito Quaranta, Darren R. Tyson, Shawn P. Garbett, Brandy Weidow, Mark P. Harris, and Walter Georgescu 1. Introduction 2. Background 3. Experimental and Computational Workflow 4. Application to Traits Relevant to Cancer Progression 5. Conclusions Acknowledgments References
3. Matrix Factorization for Recovery of Biological Processes from Microarray Data
24 25 26 34 54 54 54
59
Andrew V. Kossenkov and Michael F. Ochs 1. Introduction 2. Overview of Methods
59 63 v
vi
Contents
3. Application to the Rosetta Compendium 4. Results of Analyses 5. Discussion References
4. Modeling and Simulation of the Immune System as a Self-Regulating Network
68 70 74 75
79
Peter S. Kim, Doron Levy, and Peter P. Lee 1. 2. 3. 4.
Introduction Mathematical Modeling of the Immune Network Two Examples of Models to Understand T Cell Regulation How to Implement Mathematical Models in Computer Simulations 5. Concluding Remarks Acknowledgments References
5. Entropy Demystified: The ‘‘Thermo’’-dynamics of Stochastically Fluctuating Systems
80 84 92 100 105 106 107
111
Hong Qian 1. Introduction 2. Energy 3. Entropy and ‘‘Thermo’’-dynamics of Markov Processes 4. A Three-State Two-Cycle Motor Protein 5. Phosphorylation–Dephosphorylation Cycle Kinetics 6. Summary and Challenges References
6. Effect of Kinetics on Sedimentation Velocity Profiles and the Role of Intermediates
112 113 117 122 125 131 132
135
John J. Correia, P. Holland Alday, Peter Sherwood, and Walter F. Stafford 1. Introduction 2. Methods 3. ABCD Systems 4. Monomer–Tetramer Model 5. Summary Acknowledgments References
136 138 141 151 158 159 159
Contents
7. Algebraic Models of Biochemical Networks
vii
163
Reinhard Laubenbacher and Abdul Salam Jarrah 1. Introduction 2. Computational Systems Biology 3. Network Inference 4. Reverse-Engineering of Discrete Models: An Example 5. Discussion References
8. High-Throughput Computing in the Sciences
164 165 176 181 190 193
197
Mark Morgan and Andrew Grimshaw 1. What is an HTC Application? 2. HTC Technologies 3. High-Throughput Computing Examples 4. Advanced Topics 5. Summary References
9. Large Scale Transcriptome Data Integration Across Multiple Tissues to Decipher Stem Cell Signatures
199 200 204 218 226 226
229
Ghislain Bidaut and Christian J. Stoeckert 1. Introduction 2. Systems and Data Sources 3. Data Integration 4. Artificial Neural Network Training and Validation 5. Future Development and Enhancement Plans Acknowledgments References
230 231 236 238 243 244 244
10. DynaFit—A Software Package for Enzymology
247
Petr Kuzmicˇ 1. Introduction 2. Equilibrium Binding Studies 3. Initial Rates of Enzyme Reactions 4. Time Course of Enzyme Reactions 5. General Methods and Algorithms 6. Concluding Remarks Acknowledgments References
248 250 255 260 262 275 276 276
viii
Contents
11. Discrete Dynamic Modeling of Cellular Signaling Networks
281
Re´ka Albert and Rui-Sheng Wang 1. Introduction 2. Cellular Signaling Networks 3. Boolean Dynamic Modeling 4. Variants of Boolean Network Models 5. Application Examples 6. Conclusion and Discussion Acknowledgments References
12. The Basic Concepts of Molecular Modeling
282 284 286 297 301 303 303 303
307
Akansha Saxena, Diana Wong, Karthikeyan Diraviyam, and David Sept 1. Introduction 2. Homology Modeling 3. Molecular Dynamics 4. Molecular Docking References
13. Deterministic and Stochastic Models of Genetic Regulatory Networks
308 308 317 324 330
335
Ilya Shmulevich and John D. Aitchison 1. Introduction 2. Boolean Networks 3. Differential Equation Models 4. Probabilistic Boolean Networks 5. Stochastic Differential Equation Models References
14. Bayesian Probability Approach to ADHD Appraisal
336 337 343 347 351 353
357
Raina Robeva and Jennifer Kim Penberthy 1. Introduction 2. Bayesian Probability Algorithm 3. The Value of Bayesian Probability Approach as a Meta-Analysis Tool 4. Discussion and Future Directions Acknowledgment References
358 362 369 373 377 378
Contents
15. Simple Stochastic Simulation
ix
381
Maria J. Schilstra and Stephen R. Martin 1. Introduction 2. Understanding Reaction Dynamics 3. Graphical Notation 4. Reactions 5. Reaction Kinetics 6. Transition Firing Rules 7. Summary 8. Notes References
16. Monte Carlo Simulation in Establishing Analytical Quality Requirements for Clinical Laboratory Tests: Meeting Clinical Needs
382 385 386 389 389 393 406 407 409
411
James C. Boyd and David E. Bruns 1. Introduction 2. Modeling Approach 3. Methods for Simulation Study 4. Results 5. Discussion References
17. Nonlinear Dynamical Analysis and Optimization for Biological/Biomedical Systems
412 414 416 417 429 431
435
Amos Ben-Zvi and Jong Min Lee 1. Introduction 2. Hypothalamic–Pituitary–Adrenal Axis System 3. Development of a Clinically Relevant Performance-Assessment Tools 4. Dynamic Programming 5. Computation of Optimal Treatments for HPA Axis System 6. Conclusions Acknowledgments References
436 437 441 452 455 458 458 458
x
Contents
18. Modeling of Growth Factor-Receptor Systems: From Molecular-Level Protein Interaction Networks to Whole-Body Compartment Models
461
Florence T. H. Wu, Marianne O. Stefanini, Feilim Mac Gabhann, and Aleksander S. Popel 1. Background 2. Molecular-Level Kinetics Models: Simulation of In Vitro Experiments 3. Mesoscale Single-Tissue 3D Models: Simulation of In Vivo Tissue Regions 4. Single-Tissue Compartmental Models: Simulation of In Vivo Tissue 5. Multitissue Compartmental Models: Simulation of Whole Body 6. Conclusions Acknowledgments References
19. The Least-Squares Analysis of Data from Binding and Enzyme Kinetics Studies: Weights, Bias, and Confidence Intervals in Usual and Unusual Situations
462 466 474 482 485 493 494 494
499
Joel Tellinghuisen 1. Introduction 2. Least Squares Review 3. Statistics of Reciprocals 4. Weights When y is a True Dependent Variable 5. Unusual Weighting: When x is the Dependent Variable 6. Assessing Data Uncertainty: Variance Function Estimation 7. Conclusion References
500 503 506 511 521 524 526 527
20. Nonparametric Entropy Estimation Using Kernel Densities
531
Douglas E. Lake 1. Introduction 2. Motivating Application: Classifying Cardiac Rhythms 3. Renyi Entropy and the Friedman–Tukey Index 4. Kernel Density Estimation 5. Mean-Integrated Square Error 6. Estimating the FT Index 7. Connection Between Template Matches and Kernel Densities 8. Summary and Future Work Acknowledgments References
532 533 535 536 538 540 544 545 545 546
Contents
21. Pancreatic Network Control of Glucagon Secretion and Counterregulation
xi
547
Leon S. Farhy and Anthony L. McCall 1. Introduction 2. Mechanisms of Glucagon Counterregulation (GCR) Dysregulation in Diabetes 3. Interdisciplinary Approach to Investigating the Defects in the GCR 4. Initial Qualitative Analysis of the GCR Control Axis 5. Mathematical Models of the GCR Control Mechanisms in STZ-Treated Rats 6. Approximation of the Normal Endocrine Pancreas by a Minimal Control Network (MCN) and Analysis of the GCR Abnormalities in the Insulin Deficient Pancreas 7. Advantages and Limitations of the Interdisciplinary Approach 8. Conclusions Acknowledgment References
22. Enzyme Kinetics and Computational Modeling for Systems Biology
548 550 551 553 556
560 571 575 575 575
583
Pedro Mendes, Hanan Messiha, Naglis Malys, and Stefan Hoops 1. Introduction 2. Computational Modeling and Enzyme Kinetics 3. Yeast Triosephosphate Isomerase (EC 5.3.1.1) 4. Initial Rate Analysis 5. Progress Curve Analysis 6. Concluding Remarks Acknowledgments References
23. Fitting Enzyme Kinetic Data with KinTek Global Kinetic Explorer
584 586 588 590 594 598 598 598
601
Kenneth A. Johnson 1. Background 2. Challenges of Fitting by Simulation 3. Methods 4. Progress Curve Kinetics 5. Fitting Full Progress Curves 6. Slow Onset Inhibition Kinetics 7. Summary Acknowledgments References Author Index Subject Index
602 603 605 610 613 620 624 625 625 627 637
CONTRIBUTORS
John D. Aitchison Institute for Systems Biology, Seattle, Washington, USA Re´ka Albert Department of Physics, Pennsylvania State University, University Park, Pennsylvania, USA P. Holland Alday Department of Biochemistry, University of Mississippi Medical Center, Jackson, Mississippi, USA Amos Ben-Zvi Chemical and Materials Engineering, University of Alberta, Edmonton, Alberta, Canada Richard Bertram Department of Mathematics and Programs in Neuroscience and Molecular Biophysics, Florida State University, Tallahassee, Florida, USA Ghislain Bidaut Inserm, UMR891, CRCM, Integrative Bioinformatics, and Institut PaoliCalmettes; Univ Me´diterrane´e, Marseille, France James C. Boyd Department of Pathology, Charlottesville, Virginia, USA
University
of
Virginia
Health
System,
David E. Bruns Department of Pathology, Charlottesville, Virginia, USA
University
of
Virginia
Health
System,
John J. Correia Department of Biochemistry, University of Mississippi Medical Center, Jackson, Mississippi, USA Karthikeyan Diraviyam Biomedical Engineering and Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, USA Leon S. Farhy Department of Medicine, Center for Biomathematical Technology, University of Virginia, Charlottesville, Virginia, USA xiii
xiv
Contributors
Feilim Mac Gabhann Department of Biomedical Engineering, Institute for Computational Medicine, Johns Hopkins University, Baltimore, Maryland, USA Shawn P. Garbett Vanderbilt Integrative Cancer Biology Center, Vanderbilt University Medical Center, Nashville, Tennessee, USA Walter Georgescu Vanderbilt Integrative Cancer Biology Center, and Department of Biomedical Engineering, Vanderbilt University Medical Center, Nashville, Tennessee, USA Andrew Grimshaw Department of Computer Science, University of Virginia, Charlottesville, Virginia, USA Mark P. Harris Department of Cancer Biology, Vanderbilt University Medical Center, Nashville, Tennessee, USA Stefan Hoops Virginia Bioinformatics Institute, Virginia Polytechnic Institute and State University, Blacksburg, Virginia, USA Abdul Salam Jarrah Virginia Bioinformatics Institute at Virginia Tech, Blacksburg, Virginia, USA Kenneth A. Johnson Department of Chemistry and Biochemistry, Institute for Cell and Molecular Biology, University of Texas, Austin, Texas, USA Peter S. Kim Department of Mathematics, University of Utah, Salt Lake City, Utah, USA Andrew V. Kossenkov The Wistar Institute, Philadelphia, Pennsylvania, USA Petr Kuzmicˇ BioKin Ltd., Watertown, Massachusetts, USA Douglas E. Lake Departments of Internal Medicine (Cardiovascular Division) and Statistics, University of Virginia, Charlottesville, Virginia, USA Reinhard Laubenbacher Virginia Bioinformatics Institute at Virginia Tech, Blacksburg, Virginia, USA Jong Min Lee Chemical and Materials Engineering, University of Alberta, Edmonton, Alberta, Canada
Contributors
xv
Peter P. Lee Division of Hematology, Department of Medicine, Stanford University, Stanford, California, USA Doron Levy Department of Mathematics and Center for Scientific Computation and Mathematical Modeling (CSCAMM), University of Maryland, College Park, Maryland, USA Naglis Malys Manchester Centre for Integrative Systems Biology, and Faculty of Life Sciences, The University of Manchester, Manchester, United Kingdom Stephen R. Martin Division of Physical Biochemistry, MRC National Institute for Medical Research, London, United Kingdom Anthony L. McCall Department of Medicine, Center for Biomathematical Technology, University of Virginia, Charlottesville, Virginia, USA Pedro Mendes Manchester Centre for Integrative Systems Biology, and School of Computer Science, The University of Manchester, Manchester, United Kingdom; Virginia Bioinformatics Institute, Virginia Polytechnic Institute and State University, Blacksburg, Virginia, USA Hanan Messiha Manchester Centre for Integrative Systems Biology, and School of Chemistry, The University of Manchester, Manchester, United Kingdom Mark Morgan Department of Computer Science, University of Virginia, Charlottesville, Virginia, USA Michael F. Ochs The Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, Maryland, USA Jennifer Kim Penberthy Department of Psychiatry and Neurobehavioral Sciences, University of Virginia Health System, Charlottesville, Virginia, USA Aleksander S. Popel Department of Biomedical Engineering, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA Hong Qian Department of Applied Mathematics, University of Washington, Seattle, Washington, USA
xvi
Contributors
Vito Quaranta Department of Cancer Biology, and Vanderbilt Integrative Cancer Biology Center, Vanderbilt University Medical Center, Nashville, Tennessee, USA Raina Robeva Department of Mathematical Sciences, Sweet Briar College, Sweet Briar, Virginia, USA Akansha Saxena Biomedical Engineering, Washington University, St Louis, Missouri, USA Maria J. Schilstra Biological and Neural Computation Group, Science and Technology Research Institute, University of Hertfordshire, Hatfield, United Kingdom David Sept Biomedical Engineering and Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, USA Peter Sherwood Boston Biomedical Research Institute, Watertown, Massachusetts, USA Ilya Shmulevich Institute for Systems Biology, Seattle, Washington, USA Walter F. Stafford Boston Biomedical Research Institute, Watertown, Massachusetts, USA Marianne O. Stefanini Department of Biomedical Engineering, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA Christian J. Stoeckert Center for Bioinformatics, Department of Genetics, University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania, USA Joel Tabak Department of Biological Science and Program in Neuroscience, Florida State University, Tallahassee, Florida, USA Joel Tellinghuisen Department of Chemistry, Vanderbilt University, Nashville, Tennessee, USA Maurizio Tomaiuolo Department of Biological Science and Program in Neuroscience, Florida State University, Tallahassee, Florida, USA Darren R. Tyson Department of Cancer Biology, and Vanderbilt Integrative Cancer Biology Center, Vanderbilt University Medical Center, Nashville, Tennessee, USA
Contributors
xvii
Rui-Sheng Wang Department of Physics, Pennsylvania State University, University Park, Pennsylvania, USA Brandy Weidow Department of Cancer Biology, and Vanderbilt Integrative Cancer Biology Center, Vanderbilt University Medical Center, Nashville, Tennessee, USA Diana Wong Biomedical Engineering, Washington University, St Louis, Missouri, USA Florence T.H. Wu Department of Biomedical Engineering, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
PREFACE
A general perception exists that the only applications of computers and computer methods in biological and biomedical research are either basic statistical analysis or the searching of DNA sequence databases. While these are important applications they only scratch the surface of the current and potential applications of computers and computer methods in biomedical research. The various chapters within this volume include a wide variety of applications that extend beyond this limited perception. The use of computers and computational methods has become ubiquitous in biological and biomedical research. This has been driven by numerous factors, a few of which follow: One primary reason is the emphasis being placed on computers and computational methods within the National Institutes of Health (NIH) Roadmap; another factor is the increased level of mathematical and computational sophistication among researchers, particularly amongst junior scientists, students, journal reviewers, and NIH Study Section members; and another is the rapid advances in computer hardware and software which make these methods far more accessible to the rank and file research community. The training of the majority of senior M.D.s and Ph.D.s in clinical or basic disciplines at academic research and medical centers commonly does not include advanced coursework in mathematics, numerical analysis, statistics, or computer science. The chapters within this volume have been written in order to be accessible to this target audience. MICHAEL L. JOHNSON LUDWIG BRAND
xix
METHODS IN ENZYMOLOGY
VOLUME I. Preparation and Assay of Enzymes Edited by SIDNEY P. COLOWICK AND NATHAN O. KAPLAN VOLUME II. Preparation and Assay of Enzymes Edited by SIDNEY P. COLOWICK AND NATHAN O. KAPLAN VOLUME III. Preparation and Assay of Substrates Edited by SIDNEY P. COLOWICK AND NATHAN O. KAPLAN VOLUME IV. Special Techniques for the Enzymologist Edited by SIDNEY P. COLOWICK AND NATHAN O. KAPLAN VOLUME V. Preparation and Assay of Enzymes Edited by SIDNEY P. COLOWICK AND NATHAN O. KAPLAN VOLUME VI. Preparation and Assay of Enzymes (Continued) Preparation and Assay of Substrates Special Techniques Edited by SIDNEY P. COLOWICK AND NATHAN O. KAPLAN VOLUME VII. Cumulative Subject Index Edited by SIDNEY P. COLOWICK AND NATHAN O. KAPLAN VOLUME VIII. Complex Carbohydrates Edited by ELIZABETH F. NEUFELD AND VICTOR GINSBURG VOLUME IX. Carbohydrate Metabolism Edited by WILLIS A. WOOD VOLUME X. Oxidation and Phosphorylation Edited by RONALD W. ESTABROOK AND MAYNARD E. PULLMAN VOLUME XI. Enzyme Structure Edited by C. H. W. HIRS VOLUME XII. Nucleic Acids (Parts A and B) Edited by LAWRENCE GROSSMAN AND KIVIE MOLDAVE VOLUME XIII. Citric Acid Cycle Edited by J. M. LOWENSTEIN VOLUME XIV. Lipids Edited by J. M. LOWENSTEIN VOLUME XV. Steroids and Terpenoids Edited by RAYMOND B. CLAYTON xxi
xxii
Methods in Enzymology
VOLUME XVI. Fast Reactions Edited by KENNETH KUSTIN VOLUME XVII. Metabolism of Amino Acids and Amines (Parts A and B) Edited by HERBERT TABOR AND CELIA WHITE TABOR VOLUME XVIII. Vitamins and Coenzymes (Parts A, B, and C) Edited by DONALD B. MCCORMICK AND LEMUEL D. WRIGHT VOLUME XIX. Proteolytic Enzymes Edited by GERTRUDE E. PERLMANN AND LASZLO LORAND VOLUME XX. Nucleic Acids and Protein Synthesis (Part C) Edited by KIVIE MOLDAVE AND LAWRENCE GROSSMAN VOLUME XXI. Nucleic Acids (Part D) Edited by LAWRENCE GROSSMAN AND KIVIE MOLDAVE VOLUME XXII. Enzyme Purification and Related Techniques Edited by WILLIAM B. JAKOBY VOLUME XXIII. Photosynthesis (Part A) Edited by ANTHONY SAN PIETRO VOLUME XXIV. Photosynthesis and Nitrogen Fixation (Part B) Edited by ANTHONY SAN PIETRO VOLUME XXV. Enzyme Structure (Part B) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME XXVI. Enzyme Structure (Part C) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME XXVII. Enzyme Structure (Part D) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME XXVIII. Complex Carbohydrates (Part B) Edited by VICTOR GINSBURG VOLUME XXIX. Nucleic Acids and Protein Synthesis (Part E) Edited by LAWRENCE GROSSMAN AND KIVIE MOLDAVE VOLUME XXX. Nucleic Acids and Protein Synthesis (Part F) Edited by KIVIE MOLDAVE AND LAWRENCE GROSSMAN VOLUME XXXI. Biomembranes (Part A) Edited by SIDNEY FLEISCHER AND LESTER PACKER VOLUME XXXII. Biomembranes (Part B) Edited by SIDNEY FLEISCHER AND LESTER PACKER VOLUME XXXIII. Cumulative Subject Index Volumes I-XXX Edited by MARTHA G. DENNIS AND EDWARD A. DENNIS VOLUME XXXIV. Affinity Techniques (Enzyme Purification: Part B) Edited by WILLIAM B. JAKOBY AND MEIR WILCHEK
Methods in Enzymology
VOLUME XXXV. Lipids (Part B) Edited by JOHN M. LOWENSTEIN VOLUME XXXVI. Hormone Action (Part A: Steroid Hormones) Edited by BERT W. O’MALLEY AND JOEL G. HARDMAN VOLUME XXXVII. Hormone Action (Part B: Peptide Hormones) Edited by BERT W. O’MALLEY AND JOEL G. HARDMAN VOLUME XXXVIII. Hormone Action (Part C: Cyclic Nucleotides) Edited by JOEL G. HARDMAN AND BERT W. O’MALLEY VOLUME XXXIX. Hormone Action (Part D: Isolated Cells, Tissues, and Organ Systems) Edited by JOEL G. HARDMAN AND BERT W. O’MALLEY VOLUME XL. Hormone Action (Part E: Nuclear Structure and Function) Edited by BERT W. O’MALLEY AND JOEL G. HARDMAN VOLUME XLI. Carbohydrate Metabolism (Part B) Edited by W. A. WOOD VOLUME XLII. Carbohydrate Metabolism (Part C) Edited by W. A. WOOD VOLUME XLIII. Antibiotics Edited by JOHN H. HASH VOLUME XLIV. Immobilized Enzymes Edited by KLAUS MOSBACH VOLUME XLV. Proteolytic Enzymes (Part B) Edited by LASZLO LORAND VOLUME XLVI. Affinity Labeling Edited by WILLIAM B. JAKOBY AND MEIR WILCHEK VOLUME XLVII. Enzyme Structure (Part E) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME XLVIII. Enzyme Structure (Part F) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME XLIX. Enzyme Structure (Part G) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME L. Complex Carbohydrates (Part C) Edited by VICTOR GINSBURG VOLUME LI. Purine and Pyrimidine Nucleotide Metabolism Edited by PATRICIA A. HOFFEE AND MARY ELLEN JONES VOLUME LII. Biomembranes (Part C: Biological Oxidations) Edited by SIDNEY FLEISCHER AND LESTER PACKER
xxiii
xxiv
Methods in Enzymology
VOLUME LIII. Biomembranes (Part D: Biological Oxidations) Edited by SIDNEY FLEISCHER AND LESTER PACKER VOLUME LIV. Biomembranes (Part E: Biological Oxidations) Edited by SIDNEY FLEISCHER AND LESTER PACKER VOLUME LV. Biomembranes (Part F: Bioenergetics) Edited by SIDNEY FLEISCHER AND LESTER PACKER VOLUME LVI. Biomembranes (Part G: Bioenergetics) Edited by SIDNEY FLEISCHER AND LESTER PACKER VOLUME LVII. Bioluminescence and Chemiluminescence Edited by MARLENE A. DELUCA VOLUME LVIII. Cell Culture Edited by WILLIAM B. JAKOBY AND IRA PASTAN VOLUME LIX. Nucleic Acids and Protein Synthesis (Part G) Edited by KIVIE MOLDAVE AND LAWRENCE GROSSMAN VOLUME LX. Nucleic Acids and Protein Synthesis (Part H) Edited by KIVIE MOLDAVE AND LAWRENCE GROSSMAN VOLUME 61. Enzyme Structure (Part H) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME 62. Vitamins and Coenzymes (Part D) Edited by DONALD B. MCCORMICK AND LEMUEL D. WRIGHT VOLUME 63. Enzyme Kinetics and Mechanism (Part A: Initial Rate and Inhibitor Methods) Edited by DANIEL L. PURICH VOLUME 64. Enzyme Kinetics and Mechanism (Part B: Isotopic Probes and Complex Enzyme Systems) Edited by DANIEL L. PURICH VOLUME 65. Nucleic Acids (Part I) Edited by LAWRENCE GROSSMAN AND KIVIE MOLDAVE VOLUME 66. Vitamins and Coenzymes (Part E) Edited by DONALD B. MCCORMICK AND LEMUEL D. WRIGHT VOLUME 67. Vitamins and Coenzymes (Part F) Edited by DONALD B. MCCORMICK AND LEMUEL D. WRIGHT VOLUME 68. Recombinant DNA Edited by RAY WU VOLUME 69. Photosynthesis and Nitrogen Fixation (Part C) Edited by ANTHONY SAN PIETRO VOLUME 70. Immunochemical Techniques (Part A) Edited by HELEN VAN VUNAKIS AND JOHN J. LANGONE
Methods in Enzymology
xxv
VOLUME 71. Lipids (Part C) Edited by JOHN M. LOWENSTEIN VOLUME 72. Lipids (Part D) Edited by JOHN M. LOWENSTEIN VOLUME 73. Immunochemical Techniques (Part B) Edited by JOHN J. LANGONE AND HELEN VAN VUNAKIS VOLUME 74. Immunochemical Techniques (Part C) Edited by JOHN J. LANGONE AND HELEN VAN VUNAKIS VOLUME 75. Cumulative Subject Index Volumes XXXI, XXXII, XXXIV–LX Edited by EDWARD A. DENNIS AND MARTHA G. DENNIS VOLUME 76. Hemoglobins Edited by ERALDO ANTONINI, LUIGI ROSSI-BERNARDI, AND EMILIA CHIANCONE VOLUME 77. Detoxication and Drug Metabolism Edited by WILLIAM B. JAKOBY VOLUME 78. Interferons (Part A) Edited by SIDNEY PESTKA VOLUME 79. Interferons (Part B) Edited by SIDNEY PESTKA VOLUME 80. Proteolytic Enzymes (Part C) Edited by LASZLO LORAND VOLUME 81. Biomembranes (Part H: Visual Pigments and Purple Membranes, I) Edited by LESTER PACKER VOLUME 82. Structural and Contractile Proteins (Part A: Extracellular Matrix) Edited by LEON W. CUNNINGHAM AND DIXIE W. FREDERIKSEN VOLUME 83. Complex Carbohydrates (Part D) Edited by VICTOR GINSBURG VOLUME 84. Immunochemical Techniques (Part D: Selected Immunoassays) Edited by JOHN J. LANGONE AND HELEN VAN VUNAKIS VOLUME 85. Structural and Contractile Proteins (Part B: The Contractile Apparatus and the Cytoskeleton) Edited by DIXIE W. FREDERIKSEN AND LEON W. CUNNINGHAM VOLUME 86. Prostaglandins and Arachidonate Metabolites Edited by WILLIAM E. M. LANDS AND WILLIAM L. SMITH VOLUME 87. Enzyme Kinetics and Mechanism (Part C: Intermediates, Stereo-chemistry, and Rate Studies) Edited by DANIEL L. PURICH VOLUME 88. Biomembranes (Part I: Visual Pigments and Purple Membranes, II) Edited by LESTER PACKER
xxvi
Methods in Enzymology
VOLUME 89. Carbohydrate Metabolism (Part D) Edited by WILLIS A. WOOD VOLUME 90. Carbohydrate Metabolism (Part E) Edited by WILLIS A. WOOD VOLUME 91. Enzyme Structure (Part I) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME 92. Immunochemical Techniques (Part E: Monoclonal Antibodies and General Immunoassay Methods) Edited by JOHN J. LANGONE AND HELEN VAN VUNAKIS VOLUME 93. Immunochemical Techniques (Part F: Conventional Antibodies, Fc Receptors, and Cytotoxicity) Edited by JOHN J. LANGONE AND HELEN VAN VUNAKIS VOLUME 94. Polyamines Edited by HERBERT TABOR AND CELIA WHITE TABOR VOLUME 95. Cumulative Subject Index Volumes 61–74, 76–80 Edited by EDWARD A. DENNIS AND MARTHA G. DENNIS VOLUME 96. Biomembranes [Part J: Membrane Biogenesis: Assembly and Targeting (General Methods; Eukaryotes)] Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 97. Biomembranes [Part K: Membrane Biogenesis: Assembly and Targeting (Prokaryotes, Mitochondria, and Chloroplasts)] Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 98. Biomembranes (Part L: Membrane Biogenesis: Processing and Recycling) Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 99. Hormone Action (Part F: Protein Kinases) Edited by JACKIE D. CORBIN AND JOEL G. HARDMAN VOLUME 100. Recombinant DNA (Part B) Edited by RAY WU, LAWRENCE GROSSMAN, AND KIVIE MOLDAVE VOLUME 101. Recombinant DNA (Part C) Edited by RAY WU, LAWRENCE GROSSMAN, AND KIVIE MOLDAVE VOLUME 102. Hormone Action (Part G: Calmodulin and Calcium-Binding Proteins) Edited by ANTHONY R. MEANS AND BERT W. O’MALLEY VOLUME 103. Hormone Action (Part H: Neuroendocrine Peptides) Edited by P. MICHAEL CONN VOLUME 104. Enzyme Purification and Related Techniques (Part C) Edited by WILLIAM B. JAKOBY
Methods in Enzymology
xxvii
VOLUME 105. Oxygen Radicals in Biological Systems Edited by LESTER PACKER VOLUME 106. Posttranslational Modifications (Part A) Edited by FINN WOLD AND KIVIE MOLDAVE VOLUME 107. Posttranslational Modifications (Part B) Edited by FINN WOLD AND KIVIE MOLDAVE VOLUME 108. Immunochemical Techniques (Part G: Separation and Characterization of Lymphoid Cells) Edited by GIOVANNI DI SABATO, JOHN J. LANGONE, AND HELEN VAN VUNAKIS VOLUME 109. Hormone Action (Part I: Peptide Hormones) Edited by LUTZ BIRNBAUMER AND BERT W. O’MALLEY VOLUME 110. Steroids and Isoprenoids (Part A) Edited by JOHN H. LAW AND HANS C. RILLING VOLUME 111. Steroids and Isoprenoids (Part B) Edited by JOHN H. LAW AND HANS C. RILLING VOLUME 112. Drug and Enzyme Targeting (Part A) Edited by KENNETH J. WIDDER AND RALPH GREEN VOLUME 113. Glutamate, Glutamine, Glutathione, and Related Compounds Edited by ALTON MEISTER VOLUME 114. Diffraction Methods for Biological Macromolecules (Part A) Edited by HAROLD W. WYCKOFF, C. H. W. HIRS, AND SERGE N. TIMASHEFF VOLUME 115. Diffraction Methods for Biological Macromolecules (Part B) Edited by HAROLD W. WYCKOFF, C. H. W. HIRS, AND SERGE N. TIMASHEFF VOLUME 116. Immunochemical Techniques (Part H: Effectors and Mediators of Lymphoid Cell Functions) Edited by GIOVANNI DI SABATO, JOHN J. LANGONE, AND HELEN VAN VUNAKIS VOLUME 117. Enzyme Structure (Part J) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME 118. Plant Molecular Biology Edited by ARTHUR WEISSBACH AND HERBERT WEISSBACH VOLUME 119. Interferons (Part C) Edited by SIDNEY PESTKA VOLUME 120. Cumulative Subject Index Volumes 81–94, 96–101 VOLUME 121. Immunochemical Techniques (Part I: Hybridoma Technology and Monoclonal Antibodies) Edited by JOHN J. LANGONE AND HELEN VAN VUNAKIS VOLUME 122. Vitamins and Coenzymes (Part G) Edited by FRANK CHYTIL AND DONALD B. MCCORMICK
xxviii
Methods in Enzymology
VOLUME 123. Vitamins and Coenzymes (Part H) Edited by FRANK CHYTIL AND DONALD B. MCCORMICK VOLUME 124. Hormone Action (Part J: Neuroendocrine Peptides) Edited by P. MICHAEL CONN VOLUME 125. Biomembranes (Part M: Transport in Bacteria, Mitochondria, and Chloroplasts: General Approaches and Transport Systems) Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 126. Biomembranes (Part N: Transport in Bacteria, Mitochondria, and Chloroplasts: Protonmotive Force) Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 127. Biomembranes (Part O: Protons and Water: Structure and Translocation) Edited by LESTER PACKER VOLUME 128. Plasma Lipoproteins (Part A: Preparation, Structure, and Molecular Biology) Edited by JERE P. SEGREST AND JOHN J. ALBERS VOLUME 129. Plasma Lipoproteins (Part B: Characterization, Cell Biology, and Metabolism) Edited by JOHN J. ALBERS AND JERE P. SEGREST VOLUME 130. Enzyme Structure (Part K) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME 131. Enzyme Structure (Part L) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME 132. Immunochemical Techniques (Part J: Phagocytosis and Cell-Mediated Cytotoxicity) Edited by GIOVANNI DI SABATO AND JOHANNES EVERSE VOLUME 133. Bioluminescence and Chemiluminescence (Part B) Edited by MARLENE DELUCA AND WILLIAM D. MCELROY VOLUME 134. Structural and Contractile Proteins (Part C: The Contractile Apparatus and the Cytoskeleton) Edited by RICHARD B. VALLEE VOLUME 135. Immobilized Enzymes and Cells (Part B) Edited by KLAUS MOSBACH VOLUME 136. Immobilized Enzymes and Cells (Part C) Edited by KLAUS MOSBACH VOLUME 137. Immobilized Enzymes and Cells (Part D) Edited by KLAUS MOSBACH VOLUME 138. Complex Carbohydrates (Part E) Edited by VICTOR GINSBURG
Methods in Enzymology
xxix
VOLUME 139. Cellular Regulators (Part A: Calcium- and Calmodulin-Binding Proteins) Edited by ANTHONY R. MEANS AND P. MICHAEL CONN VOLUME 140. Cumulative Subject Index Volumes 102–119, 121–134 VOLUME 141. Cellular Regulators (Part B: Calcium and Lipids) Edited by P. MICHAEL CONN AND ANTHONY R. MEANS VOLUME 142. Metabolism of Aromatic Amino Acids and Amines Edited by SEYMOUR KAUFMAN VOLUME 143. Sulfur and Sulfur Amino Acids Edited by WILLIAM B. JAKOBY AND OWEN GRIFFITH VOLUME 144. Structural and Contractile Proteins (Part D: Extracellular Matrix) Edited by LEON W. CUNNINGHAM VOLUME 145. Structural and Contractile Proteins (Part E: Extracellular Matrix) Edited by LEON W. CUNNINGHAM VOLUME 146. Peptide Growth Factors (Part A) Edited by DAVID BARNES AND DAVID A. SIRBASKU VOLUME 147. Peptide Growth Factors (Part B) Edited by DAVID BARNES AND DAVID A. SIRBASKU VOLUME 148. Plant Cell Membranes Edited by LESTER PACKER AND ROLAND DOUCE VOLUME 149. Drug and Enzyme Targeting (Part B) Edited by RALPH GREEN AND KENNETH J. WIDDER VOLUME 150. Immunochemical Techniques (Part K: In Vitro Models of B and T Cell Functions and Lymphoid Cell Receptors) Edited by GIOVANNI DI SABATO VOLUME 151. Molecular Genetics of Mammalian Cells Edited by MICHAEL M. GOTTESMAN VOLUME 152. Guide to Molecular Cloning Techniques Edited by SHELBY L. BERGER AND ALAN R. KIMMEL VOLUME 153. Recombinant DNA (Part D) Edited by RAY WU AND LAWRENCE GROSSMAN VOLUME 154. Recombinant DNA (Part E) Edited by RAY WU AND LAWRENCE GROSSMAN VOLUME 155. Recombinant DNA (Part F) Edited by RAY WU VOLUME 156. Biomembranes (Part P: ATP-Driven Pumps and Related Transport: The Na, K-Pump) Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER
xxx
Methods in Enzymology
VOLUME 157. Biomembranes (Part Q: ATP-Driven Pumps and Related Transport: Calcium, Proton, and Potassium Pumps) Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 158. Metalloproteins (Part A) Edited by JAMES F. RIORDAN AND BERT L. VALLEE VOLUME 159. Initiation and Termination of Cyclic Nucleotide Action Edited by JACKIE D. CORBIN AND ROGER A. JOHNSON VOLUME 160. Biomass (Part A: Cellulose and Hemicellulose) Edited by WILLIS A. WOOD AND SCOTT T. KELLOGG VOLUME 161. Biomass (Part B: Lignin, Pectin, and Chitin) Edited by WILLIS A. WOOD AND SCOTT T. KELLOGG VOLUME 162. Immunochemical Techniques (Part L: Chemotaxis and Inflammation) Edited by GIOVANNI DI SABATO VOLUME 163. Immunochemical Techniques (Part M: Chemotaxis and Inflammation) Edited by GIOVANNI DI SABATO VOLUME 164. Ribosomes Edited by HARRY F. NOLLER, JR., AND KIVIE MOLDAVE VOLUME 165. Microbial Toxins: Tools for Enzymology Edited by SIDNEY HARSHMAN VOLUME 166. Branched-Chain Amino Acids Edited by ROBERT HARRIS AND JOHN R. SOKATCH VOLUME 167. Cyanobacteria Edited by LESTER PACKER AND ALEXANDER N. GLAZER VOLUME 168. Hormone Action (Part K: Neuroendocrine Peptides) Edited by P. MICHAEL CONN VOLUME 169. Platelets: Receptors, Adhesion, Secretion (Part A) Edited by JACEK HAWIGER VOLUME 170. Nucleosomes Edited by PAUL M. WASSARMAN AND ROGER D. KORNBERG VOLUME 171. Biomembranes (Part R: Transport Theory: Cells and Model Membranes) Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 172. Biomembranes (Part S: Transport: Membrane Isolation and Characterization) Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER
Methods in Enzymology
xxxi
VOLUME 173. Biomembranes [Part T: Cellular and Subcellular Transport: Eukaryotic (Nonepithelial) Cells] Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 174. Biomembranes [Part U: Cellular and Subcellular Transport: Eukaryotic (Nonepithelial) Cells] Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 175. Cumulative Subject Index Volumes 135–139, 141–167 VOLUME 176. Nuclear Magnetic Resonance (Part A: Spectral Techniques and Dynamics) Edited by NORMAN J. OPPENHEIMER AND THOMAS L. JAMES VOLUME 177. Nuclear Magnetic Resonance (Part B: Structure and Mechanism) Edited by NORMAN J. OPPENHEIMER AND THOMAS L. JAMES VOLUME 178. Antibodies, Antigens, and Molecular Mimicry Edited by JOHN J. LANGONE VOLUME 179. Complex Carbohydrates (Part F) Edited by VICTOR GINSBURG VOLUME 180. RNA Processing (Part A: General Methods) Edited by JAMES E. DAHLBERG AND JOHN N. ABELSON VOLUME 181. RNA Processing (Part B: Specific Methods) Edited by JAMES E. DAHLBERG AND JOHN N. ABELSON VOLUME 182. Guide to Protein Purification Edited by MURRAY P. DEUTSCHER VOLUME 183. Molecular Evolution: Computer Analysis of Protein and Nucleic Acid Sequences Edited by RUSSELL F. DOOLITTLE VOLUME 184. Avidin-Biotin Technology Edited by MEIR WILCHEK AND EDWARD A. BAYER VOLUME 185. Gene Expression Technology Edited by DAVID V. GOEDDEL VOLUME 186. Oxygen Radicals in Biological Systems (Part B: Oxygen Radicals and Antioxidants) Edited by LESTER PACKER AND ALEXANDER N. GLAZER VOLUME 187. Arachidonate Related Lipid Mediators Edited by ROBERT C. MURPHY AND FRANK A. FITZPATRICK VOLUME 188. Hydrocarbons and Methylotrophy Edited by MARY E. LIDSTROM VOLUME 189. Retinoids (Part A: Molecular and Metabolic Aspects) Edited by LESTER PACKER
xxxii
Methods in Enzymology
VOLUME 190. Retinoids (Part B: Cell Differentiation and Clinical Applications) Edited by LESTER PACKER VOLUME 191. Biomembranes (Part V: Cellular and Subcellular Transport: Epithelial Cells) Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 192. Biomembranes (Part W: Cellular and Subcellular Transport: Epithelial Cells) Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 193. Mass Spectrometry Edited by JAMES A. MCCLOSKEY VOLUME 194. Guide to Yeast Genetics and Molecular Biology Edited by CHRISTINE GUTHRIE AND GERALD R. FINK VOLUME 195. Adenylyl Cyclase, G Proteins, and Guanylyl Cyclase Edited by ROGER A. JOHNSON AND JACKIE D. CORBIN VOLUME 196. Molecular Motors and the Cytoskeleton Edited by RICHARD B. VALLEE VOLUME 197. Phospholipases Edited by EDWARD A. DENNIS VOLUME 198. Peptide Growth Factors (Part C) Edited by DAVID BARNES, J. P. MATHER, AND GORDON H. SATO VOLUME 199. Cumulative Subject Index Volumes 168–174, 176–194 VOLUME 200. Protein Phosphorylation (Part A: Protein Kinases: Assays, Purification, Antibodies, Functional Analysis, Cloning, and Expression) Edited by TONY HUNTER AND BARTHOLOMEW M. SEFTON VOLUME 201. Protein Phosphorylation (Part B: Analysis of Protein Phosphorylation, Protein Kinase Inhibitors, and Protein Phosphatases) Edited by TONY HUNTER AND BARTHOLOMEW M. SEFTON VOLUME 202. Molecular Design and Modeling: Concepts and Applications (Part A: Proteins, Peptides, and Enzymes) Edited by JOHN J. LANGONE VOLUME 203. Molecular Design and Modeling: Concepts and Applications (Part B: Antibodies and Antigens, Nucleic Acids, Polysaccharides, and Drugs) Edited by JOHN J. LANGONE VOLUME 204. Bacterial Genetic Systems Edited by JEFFREY H. MILLER VOLUME 205. Metallobiochemistry (Part B: Metallothionein and Related Molecules) Edited by JAMES F. RIORDAN AND BERT L. VALLEE
Methods in Enzymology
xxxiii
VOLUME 206. Cytochrome P450 Edited by MICHAEL R. WATERMAN AND ERIC F. JOHNSON VOLUME 207. Ion Channels Edited by BERNARDO RUDY AND LINDA E. IVERSON VOLUME 208. Protein–DNA Interactions Edited by ROBERT T. SAUER VOLUME 209. Phospholipid Biosynthesis Edited by EDWARD A. DENNIS AND DENNIS E. VANCE VOLUME 210. Numerical Computer Methods Edited by LUDWIG BRAND AND MICHAEL L. JOHNSON VOLUME 211. DNA Structures (Part A: Synthesis and Physical Analysis of DNA) Edited by DAVID M. J. LILLEY AND JAMES E. DAHLBERG VOLUME 212. DNA Structures (Part B: Chemical and Electrophoretic Analysis of DNA) Edited by DAVID M. J. LILLEY AND JAMES E. DAHLBERG VOLUME 213. Carotenoids (Part A: Chemistry, Separation, Quantitation, and Antioxidation) Edited by LESTER PACKER VOLUME 214. Carotenoids (Part B: Metabolism, Genetics, and Biosynthesis) Edited by LESTER PACKER VOLUME 215. Platelets: Receptors, Adhesion, Secretion (Part B) Edited by JACEK J. HAWIGER VOLUME 216. Recombinant DNA (Part G) Edited by RAY WU VOLUME 217. Recombinant DNA (Part H) Edited by RAY WU VOLUME 218. Recombinant DNA (Part I) Edited by RAY WU VOLUME 219. Reconstitution of Intracellular Transport Edited by JAMES E. ROTHMAN VOLUME 220. Membrane Fusion Techniques (Part A) Edited by NEJAT DU¨ZGU¨NES, VOLUME 221. Membrane Fusion Techniques (Part B) Edited by NEJAT DU¨ZGU¨NES, VOLUME 222. Proteolytic Enzymes in Coagulation, Fibrinolysis, and Complement Activation (Part A: Mammalian Blood Coagulation Factors and Inhibitors) Edited by LASZLO LORAND AND KENNETH G. MANN
xxxiv
Methods in Enzymology
VOLUME 223. Proteolytic Enzymes in Coagulation, Fibrinolysis, and Complement Activation (Part B: Complement Activation, Fibrinolysis, and Nonmammalian Blood Coagulation Factors) Edited by LASZLO LORAND AND KENNETH G. MANN VOLUME 224. Molecular Evolution: Producing the Biochemical Data Edited by ELIZABETH ANNE ZIMMER, THOMAS J. WHITE, REBECCA L. CANN, AND ALLAN C. WILSON VOLUME 225. Guide to Techniques in Mouse Development Edited by PAUL M. WASSARMAN AND MELVIN L. DEPAMPHILIS VOLUME 226. Metallobiochemistry (Part C: Spectroscopic and Physical Methods for Probing Metal Ion Environments in Metalloenzymes and Metalloproteins) Edited by JAMES F. RIORDAN AND BERT L. VALLEE VOLUME 227. Metallobiochemistry (Part D: Physical and Spectroscopic Methods for Probing Metal Ion Environments in Metalloproteins) Edited by JAMES F. RIORDAN AND BERT L. VALLEE VOLUME 228. Aqueous Two-Phase Systems Edited by HARRY WALTER AND GO¨TE JOHANSSON VOLUME 229. Cumulative Subject Index Volumes 195–198, 200–227 VOLUME 230. Guide to Techniques in Glycobiology Edited by WILLIAM J. LENNARZ AND GERALD W. HART VOLUME 231. Hemoglobins (Part B: Biochemical and Analytical Methods) Edited by JOHANNES EVERSE, KIM D. VANDEGRIFF, AND ROBERT M. WINSLOW VOLUME 232. Hemoglobins (Part C: Biophysical Methods) Edited by JOHANNES EVERSE, KIM D. VANDEGRIFF, AND ROBERT M. WINSLOW VOLUME 233. Oxygen Radicals in Biological Systems (Part C) Edited by LESTER PACKER VOLUME 234. Oxygen Radicals in Biological Systems (Part D) Edited by LESTER PACKER VOLUME 235. Bacterial Pathogenesis (Part A: Identification and Regulation of Virulence Factors) Edited by VIRGINIA L. CLARK AND PATRIK M. BAVOIL VOLUME 236. Bacterial Pathogenesis (Part B: Integration of Pathogenic Bacteria with Host Cells) Edited by VIRGINIA L. CLARK AND PATRIK M. BAVOIL VOLUME 237. Heterotrimeric G Proteins Edited by RAVI IYENGAR VOLUME 238. Heterotrimeric G-Protein Effectors Edited by RAVI IYENGAR
Methods in Enzymology
xxxv
VOLUME 239. Nuclear Magnetic Resonance (Part C) Edited by THOMAS L. JAMES AND NORMAN J. OPPENHEIMER VOLUME 240. Numerical Computer Methods (Part B) Edited by MICHAEL L. JOHNSON AND LUDWIG BRAND VOLUME 241. Retroviral Proteases Edited by LAWRENCE C. KUO AND JULES A. SHAFER VOLUME 242. Neoglycoconjugates (Part A) Edited by Y. C. LEE AND REIKO T. LEE VOLUME 243. Inorganic Microbial Sulfur Metabolism Edited by HARRY D. PECK, JR., AND JEAN LEGALL VOLUME 244. Proteolytic Enzymes: Serine and Cysteine Peptidases Edited by ALAN J. BARRETT VOLUME 245. Extracellular Matrix Components Edited by E. RUOSLAHTI AND E. ENGVALL VOLUME 246. Biochemical Spectroscopy Edited by KENNETH SAUER VOLUME 247. Neoglycoconjugates (Part B: Biomedical Applications) Edited by Y. C. LEE AND REIKO T. LEE VOLUME 248. Proteolytic Enzymes: Aspartic and Metallo Peptidases Edited by ALAN J. BARRETT VOLUME 249. Enzyme Kinetics and Mechanism (Part D: Developments in Enzyme Dynamics) Edited by DANIEL L. PURICH VOLUME 250. Lipid Modifications of Proteins Edited by PATRICK J. CASEY AND JANICE E. BUSS VOLUME 251. Biothiols (Part A: Monothiols and Dithiols, Protein Thiols, and Thiyl Radicals) Edited by LESTER PACKER VOLUME 252. Biothiols (Part B: Glutathione and Thioredoxin; Thiols in Signal Transduction and Gene Regulation) Edited by LESTER PACKER VOLUME 253. Adhesion of Microbial Pathogens Edited by RON J. DOYLE AND ITZHAK OFEK VOLUME 254. Oncogene Techniques Edited by PETER K. VOGT AND INDER M. VERMA VOLUME 255. Small GTPases and Their Regulators (Part A: Ras Family) Edited by W. E. BALCH, CHANNING J. DER, AND ALAN HALL
xxxvi
Methods in Enzymology
VOLUME 256. Small GTPases and Their Regulators (Part B: Rho Family) Edited by W. E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 257. Small GTPases and Their Regulators (Part C: Proteins Involved in Transport) Edited by W. E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 258. Redox-Active Amino Acids in Biology Edited by JUDITH P. KLINMAN VOLUME 259. Energetics of Biological Macromolecules Edited by MICHAEL L. JOHNSON AND GARY K. ACKERS VOLUME 260. Mitochondrial Biogenesis and Genetics (Part A) Edited by GIUSEPPE M. ATTARDI AND ANNE CHOMYN VOLUME 261. Nuclear Magnetic Resonance and Nucleic Acids Edited by THOMAS L. JAMES VOLUME 262. DNA Replication Edited by JUDITH L. CAMPBELL VOLUME 263. Plasma Lipoproteins (Part C: Quantitation) Edited by WILLIAM A. BRADLEY, SANDRA H. GIANTURCO, AND JERE P. SEGREST VOLUME 264. Mitochondrial Biogenesis and Genetics (Part B) Edited by GIUSEPPE M. ATTARDI AND ANNE CHOMYN VOLUME 265. Cumulative Subject Index Volumes 228, 230–262 VOLUME 266. Computer Methods for Macromolecular Sequence Analysis Edited by RUSSELL F. DOOLITTLE VOLUME 267. Combinatorial Chemistry Edited by JOHN N. ABELSON VOLUME 268. Nitric Oxide (Part A: Sources and Detection of NO; NO Synthase) Edited by LESTER PACKER VOLUME 269. Nitric Oxide (Part B: Physiological and Pathological Processes) Edited by LESTER PACKER VOLUME 270. High Resolution Separation and Analysis of Biological Macromolecules (Part A: Fundamentals) Edited by BARRY L. KARGER AND WILLIAM S. HANCOCK VOLUME 271. High Resolution Separation and Analysis of Biological Macromolecules (Part B: Applications) Edited by BARRY L. KARGER AND WILLIAM S. HANCOCK VOLUME 272. Cytochrome P450 (Part B) Edited by ERIC F. JOHNSON AND MICHAEL R. WATERMAN VOLUME 273. RNA Polymerase and Associated Factors (Part A) Edited by SANKAR ADHYA
Methods in Enzymology
xxxvii
VOLUME 274. RNA Polymerase and Associated Factors (Part B) Edited by SANKAR ADHYA VOLUME 275. Viral Polymerases and Related Proteins Edited by LAWRENCE C. KUO, DAVID B. OLSEN, AND STEVEN S. CARROLL VOLUME 276. Macromolecular Crystallography (Part A) Edited by CHARLES W. CARTER, JR., AND ROBERT M. SWEET VOLUME 277. Macromolecular Crystallography (Part B) Edited by CHARLES W. CARTER, JR., AND ROBERT M. SWEET VOLUME 278. Fluorescence Spectroscopy Edited by LUDWIG BRAND AND MICHAEL L. JOHNSON VOLUME 279. Vitamins and Coenzymes (Part I) Edited by DONALD B. MCCORMICK, JOHN W. SUTTIE, AND CONRAD WAGNER VOLUME 280. Vitamins and Coenzymes (Part J) Edited by DONALD B. MCCORMICK, JOHN W. SUTTIE, AND CONRAD WAGNER VOLUME 281. Vitamins and Coenzymes (Part K) Edited by DONALD B. MCCORMICK, JOHN W. SUTTIE, AND CONRAD WAGNER VOLUME 282. Vitamins and Coenzymes (Part L) Edited by DONALD B. MCCORMICK, JOHN W. SUTTIE, AND CONRAD WAGNER VOLUME 283. Cell Cycle Control Edited by WILLIAM G. DUNPHY VOLUME 284. Lipases (Part A: Biotechnology) Edited by BYRON RUBIN AND EDWARD A. DENNIS VOLUME 285. Cumulative Subject Index Volumes 263, 264, 266–284, 286–289 VOLUME 286. Lipases (Part B: Enzyme Characterization and Utilization) Edited by BYRON RUBIN AND EDWARD A. DENNIS VOLUME 287. Chemokines Edited by RICHARD HORUK VOLUME 288. Chemokine Receptors Edited by RICHARD HORUK VOLUME 289. Solid Phase Peptide Synthesis Edited by GREGG B. FIELDS VOLUME 290. Molecular Chaperones Edited by GEORGE H. LORIMER AND THOMAS BALDWIN VOLUME 291. Caged Compounds Edited by GERARD MARRIOTT VOLUME 292. ABC Transporters: Biochemical, Cellular, and Molecular Aspects Edited by SURESH V. AMBUDKAR AND MICHAEL M. GOTTESMAN
xxxviii
Methods in Enzymology
VOLUME 293. Ion Channels (Part B) Edited by P. MICHAEL CONN VOLUME 294. Ion Channels (Part C) Edited by P. MICHAEL CONN VOLUME 295. Energetics of Biological Macromolecules (Part B) Edited by GARY K. ACKERS AND MICHAEL L. JOHNSON VOLUME 296. Neurotransmitter Transporters Edited by SUSAN G. AMARA VOLUME 297. Photosynthesis: Molecular Biology of Energy Capture Edited by LEE MCINTOSH VOLUME 298. Molecular Motors and the Cytoskeleton (Part B) Edited by RICHARD B. VALLEE VOLUME 299. Oxidants and Antioxidants (Part A) Edited by LESTER PACKER VOLUME 300. Oxidants and Antioxidants (Part B) Edited by LESTER PACKER VOLUME 301. Nitric Oxide: Biological and Antioxidant Activities (Part C) Edited by LESTER PACKER VOLUME 302. Green Fluorescent Protein Edited by P. MICHAEL CONN VOLUME 303. cDNA Preparation and Display Edited by SHERMAN M. WEISSMAN VOLUME 304. Chromatin Edited by PAUL M. WASSARMAN AND ALAN P. WOLFFE VOLUME 305. Bioluminescence and Chemiluminescence (Part C) Edited by THOMAS O. BALDWIN AND MIRIAM M. ZIEGLER VOLUME 306. Expression of Recombinant Genes in Eukaryotic Systems Edited by JOSEPH C. GLORIOSO AND MARTIN C. SCHMIDT VOLUME 307. Confocal Microscopy Edited by P. MICHAEL CONN VOLUME 308. Enzyme Kinetics and Mechanism (Part E: Energetics of Enzyme Catalysis) Edited by DANIEL L. PURICH AND VERN L. SCHRAMM VOLUME 309. Amyloid, Prions, and Other Protein Aggregates Edited by RONALD WETZEL VOLUME 310. Biofilms Edited by RON J. DOYLE
Methods in Enzymology
xxxix
VOLUME 311. Sphingolipid Metabolism and Cell Signaling (Part A) Edited by ALFRED H. MERRILL, JR., AND YUSUF A. HANNUN VOLUME 312. Sphingolipid Metabolism and Cell Signaling (Part B) Edited by ALFRED H. MERRILL, JR., AND YUSUF A. HANNUN VOLUME 313. Antisense Technology (Part A: General Methods, Methods of Delivery, and RNA Studies) Edited by M. IAN PHILLIPS VOLUME 314. Antisense Technology (Part B: Applications) Edited by M. IAN PHILLIPS VOLUME 315. Vertebrate Phototransduction and the Visual Cycle (Part A) Edited by KRZYSZTOF PALCZEWSKI VOLUME 316. Vertebrate Phototransduction and the Visual Cycle (Part B) Edited by KRZYSZTOF PALCZEWSKI VOLUME 317. RNA–Ligand Interactions (Part A: Structural Biology Methods) Edited by DANIEL W. CELANDER AND JOHN N. ABELSON VOLUME 318. RNA–Ligand Interactions (Part B: Molecular Biology Methods) Edited by DANIEL W. CELANDER AND JOHN N. ABELSON VOLUME 319. Singlet Oxygen, UV-A, and Ozone Edited by LESTER PACKER AND HELMUT SIES VOLUME 320. Cumulative Subject Index Volumes 290–319 VOLUME 321. Numerical Computer Methods (Part C) Edited by MICHAEL L. JOHNSON AND LUDWIG BRAND VOLUME 322. Apoptosis Edited by JOHN C. REED VOLUME 323. Energetics of Biological Macromolecules (Part C) Edited by MICHAEL L. JOHNSON AND GARY K. ACKERS VOLUME 324. Branched-Chain Amino Acids (Part B) Edited by ROBERT A. HARRIS AND JOHN R. SOKATCH VOLUME 325. Regulators and Effectors of Small GTPases (Part D: Rho Family) Edited by W. E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 326. Applications of Chimeric Genes and Hybrid Proteins (Part A: Gene Expression and Protein Purification) Edited by JEREMY THORNER, SCOTT D. EMR, AND JOHN N. ABELSON VOLUME 327. Applications of Chimeric Genes and Hybrid Proteins (Part B: Cell Biology and Physiology) Edited by JEREMY THORNER, SCOTT D. EMR, AND JOHN N. ABELSON
xl
Methods in Enzymology
VOLUME 328. Applications of Chimeric Genes and Hybrid Proteins (Part C: Protein–Protein Interactions and Genomics) Edited by JEREMY THORNER, SCOTT D. EMR, AND JOHN N. ABELSON VOLUME 329. Regulators and Effectors of Small GTPases (Part E: GTPases Involved in Vesicular Traffic) Edited by W. E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 330. Hyperthermophilic Enzymes (Part A) Edited by MICHAEL W. W. ADAMS AND ROBERT M. KELLY VOLUME 331. Hyperthermophilic Enzymes (Part B) Edited by MICHAEL W. W. ADAMS AND ROBERT M. KELLY VOLUME 332. Regulators and Effectors of Small GTPases (Part F: Ras Family I) Edited by W. E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 333. Regulators and Effectors of Small GTPases (Part G: Ras Family II) Edited by W. E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 334. Hyperthermophilic Enzymes (Part C) Edited by MICHAEL W. W. ADAMS AND ROBERT M. KELLY VOLUME 335. Flavonoids and Other Polyphenols Edited by LESTER PACKER VOLUME 336. Microbial Growth in Biofilms (Part A: Developmental and Molecular Biological Aspects) Edited by RON J. DOYLE VOLUME 337. Microbial Growth in Biofilms (Part B: Special Environments and Physicochemical Aspects) Edited by RON J. DOYLE VOLUME 338. Nuclear Magnetic Resonance of Biological Macromolecules (Part A) Edited by THOMAS L. JAMES, VOLKER DO¨TSCH, AND ULI SCHMITZ VOLUME 339. Nuclear Magnetic Resonance of Biological Macromolecules (Part B) Edited by THOMAS L. JAMES, VOLKER DO¨TSCH, AND ULI SCHMITZ VOLUME 340. Drug–Nucleic Acid Interactions Edited by JONATHAN B. CHAIRES AND MICHAEL J. WARING VOLUME 341. Ribonucleases (Part A) Edited by ALLEN W. NICHOLSON VOLUME 342. Ribonucleases (Part B) Edited by ALLEN W. NICHOLSON VOLUME 343. G Protein Pathways (Part A: Receptors) Edited by RAVI IYENGAR AND JOHN D. HILDEBRANDT VOLUME 344. G Protein Pathways (Part B: G Proteins and Their Regulators) Edited by RAVI IYENGAR AND JOHN D. HILDEBRANDT
Methods in Enzymology
xli
VOLUME 345. G Protein Pathways (Part C: Effector Mechanisms) Edited by RAVI IYENGAR AND JOHN D. HILDEBRANDT VOLUME 346. Gene Therapy Methods Edited by M. IAN PHILLIPS VOLUME 347. Protein Sensors and Reactive Oxygen Species (Part A: Selenoproteins and Thioredoxin) Edited by HELMUT SIES AND LESTER PACKER VOLUME 348. Protein Sensors and Reactive Oxygen Species (Part B: Thiol Enzymes and Proteins) Edited by HELMUT SIES AND LESTER PACKER VOLUME 349. Superoxide Dismutase Edited by LESTER PACKER VOLUME 350. Guide to Yeast Genetics and Molecular and Cell Biology (Part B) Edited by CHRISTINE GUTHRIE AND GERALD R. FINK VOLUME 351. Guide to Yeast Genetics and Molecular and Cell Biology (Part C) Edited by CHRISTINE GUTHRIE AND GERALD R. FINK VOLUME 352. Redox Cell Biology and Genetics (Part A) Edited by CHANDAN K. SEN AND LESTER PACKER VOLUME 353. Redox Cell Biology and Genetics (Part B) Edited by CHANDAN K. SEN AND LESTER PACKER VOLUME 354. Enzyme Kinetics and Mechanisms (Part F: Detection and Characterization of Enzyme Reaction Intermediates) Edited by DANIEL L. PURICH VOLUME 355. Cumulative Subject Index Volumes 321–354 VOLUME 356. Laser Capture Microscopy and Microdissection Edited by P. MICHAEL CONN VOLUME 357. Cytochrome P450, Part C Edited by ERIC F. JOHNSON AND MICHAEL R. WATERMAN VOLUME 358. Bacterial Pathogenesis (Part C: Identification, Regulation, and Function of Virulence Factors) Edited by VIRGINIA L. CLARK AND PATRIK M. BAVOIL VOLUME 359. Nitric Oxide (Part D) Edited by ENRIQUE CADENAS AND LESTER PACKER VOLUME 360. Biophotonics (Part A) Edited by GERARD MARRIOTT AND IAN PARKER VOLUME 361. Biophotonics (Part B) Edited by GERARD MARRIOTT AND IAN PARKER
xlii
Methods in Enzymology
VOLUME 362. Recognition of Carbohydrates in Biological Systems (Part A) Edited by YUAN C. LEE AND REIKO T. LEE VOLUME 363. Recognition of Carbohydrates in Biological Systems (Part B) Edited by YUAN C. LEE AND REIKO T. LEE VOLUME 364. Nuclear Receptors Edited by DAVID W. RUSSELL AND DAVID J. MANGELSDORF VOLUME 365. Differentiation of Embryonic Stem Cells Edited by PAUL M. WASSAUMAN AND GORDON M. KELLER VOLUME 366. Protein Phosphatases Edited by SUSANNE KLUMPP AND JOSEF KRIEGLSTEIN VOLUME 367. Liposomes (Part A) Edited by NEJAT DU¨ZGU¨NES, VOLUME 368. Macromolecular Crystallography (Part C) Edited by CHARLES W. CARTER, JR., AND ROBERT M. SWEET VOLUME 369. Combinational Chemistry (Part B) Edited by GUILLERMO A. MORALES AND BARRY A. BUNIN VOLUME 370. RNA Polymerases and Associated Factors (Part C) Edited by SANKAR L. ADHYA AND SUSAN GARGES VOLUME 371. RNA Polymerases and Associated Factors (Part D) Edited by SANKAR L. ADHYA AND SUSAN GARGES VOLUME 372. Liposomes (Part B) Edited by NEJAT DU¨ZGU¨NES, VOLUME 373. Liposomes (Part C) Edited by NEJAT DU¨ZGU¨NES, VOLUME 374. Macromolecular Crystallography (Part D) Edited by CHARLES W. CARTER, JR., AND ROBERT W. SWEET VOLUME 375. Chromatin and Chromatin Remodeling Enzymes (Part A) Edited by C. DAVID ALLIS AND CARL WU VOLUME 376. Chromatin and Chromatin Remodeling Enzymes (Part B) Edited by C. DAVID ALLIS AND CARL WU VOLUME 377. Chromatin and Chromatin Remodeling Enzymes (Part C) Edited by C. DAVID ALLIS AND CARL WU VOLUME 378. Quinones and Quinone Enzymes (Part A) Edited by HELMUT SIES AND LESTER PACKER VOLUME 379. Energetics of Biological Macromolecules (Part D) Edited by JO M. HOLT, MICHAEL L. JOHNSON, AND GARY K. ACKERS VOLUME 380. Energetics of Biological Macromolecules (Part E) Edited by JO M. HOLT, MICHAEL L. JOHNSON, AND GARY K. ACKERS
Methods in Enzymology
xliii
VOLUME 381. Oxygen Sensing Edited by CHANDAN K. SEN AND GREGG L. SEMENZA VOLUME 382. Quinones and Quinone Enzymes (Part B) Edited by HELMUT SIES AND LESTER PACKER VOLUME 383. Numerical Computer Methods (Part D) Edited by LUDWIG BRAND AND MICHAEL L. JOHNSON VOLUME 384. Numerical Computer Methods (Part E) Edited by LUDWIG BRAND AND MICHAEL L. JOHNSON VOLUME 385. Imaging in Biological Research (Part A) Edited by P. MICHAEL CONN VOLUME 386. Imaging in Biological Research (Part B) Edited by P. MICHAEL CONN VOLUME 387. Liposomes (Part D) Edited by NEJAT DU¨ZGU¨NES, VOLUME 388. Protein Engineering Edited by DAN E. ROBERTSON AND JOSEPH P. NOEL VOLUME 389. Regulators of G-Protein Signaling (Part A) Edited by DAVID P. SIDEROVSKI VOLUME 390. Regulators of G-Protein Signaling (Part B) Edited by DAVID P. SIDEROVSKI VOLUME 391. Liposomes (Part E) Edited by NEJAT DU¨ZGU¨NES, VOLUME 392. RNA Interference Edited by ENGELKE ROSSI VOLUME 393. Circadian Rhythms Edited by MICHAEL W. YOUNG VOLUME 394. Nuclear Magnetic Resonance of Biological Macromolecules (Part C) Edited by THOMAS L. JAMES VOLUME 395. Producing the Biochemical Data (Part B) Edited by ELIZABETH A. ZIMMER AND ERIC H. ROALSON VOLUME 396. Nitric Oxide (Part E) Edited by LESTER PACKER AND ENRIQUE CADENAS VOLUME 397. Environmental Microbiology Edited by JARED R. LEADBETTER VOLUME 398. Ubiquitin and Protein Degradation (Part A) Edited by RAYMOND J. DESHAIES VOLUME 399. Ubiquitin and Protein Degradation (Part B) Edited by RAYMOND J. DESHAIES
xliv
Methods in Enzymology
VOLUME 400. Phase II Conjugation Enzymes and Transport Systems Edited by HELMUT SIES AND LESTER PACKER VOLUME 401. Glutathione Transferases and Gamma Glutamyl Transpeptidases Edited by HELMUT SIES AND LESTER PACKER VOLUME 402. Biological Mass Spectrometry Edited by A. L. BURLINGAME VOLUME 403. GTPases Regulating Membrane Targeting and Fusion Edited by WILLIAM E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 404. GTPases Regulating Membrane Dynamics Edited by WILLIAM E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 405. Mass Spectrometry: Modified Proteins and Glycoconjugates Edited by A. L. BURLINGAME VOLUME 406. Regulators and Effectors of Small GTPases: Rho Family Edited by WILLIAM E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 407. Regulators and Effectors of Small GTPases: Ras Family Edited by WILLIAM E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 408. DNA Repair (Part A) Edited by JUDITH L. CAMPBELL AND PAUL MODRICH VOLUME 409. DNA Repair (Part B) Edited by JUDITH L. CAMPBELL AND PAUL MODRICH VOLUME 410. DNA Microarrays (Part A: Array Platforms and Web-Bench Protocols) Edited by ALAN KIMMEL AND BRIAN OLIVER VOLUME 411. DNA Microarrays (Part B: Databases and Statistics) Edited by ALAN KIMMEL AND BRIAN OLIVER VOLUME 412. Amyloid, Prions, and Other Protein Aggregates (Part B) Edited by INDU KHETERPAL AND RONALD WETZEL VOLUME 413. Amyloid, Prions, and Other Protein Aggregates (Part C) Edited by INDU KHETERPAL AND RONALD WETZEL VOLUME 414. Measuring Biological Responses with Automated Microscopy Edited by JAMES INGLESE VOLUME 415. Glycobiology Edited by MINORU FUKUDA VOLUME 416. Glycomics Edited by MINORU FUKUDA VOLUME 417. Functional Glycomics Edited by MINORU FUKUDA
Methods in Enzymology
xlv
VOLUME 418. Embryonic Stem Cells Edited by IRINA KLIMANSKAYA AND ROBERT LANZA VOLUME 419. Adult Stem Cells Edited by IRINA KLIMANSKAYA AND ROBERT LANZA VOLUME 420. Stem Cell Tools and Other Experimental Protocols Edited by IRINA KLIMANSKAYA AND ROBERT LANZA VOLUME 421. Advanced Bacterial Genetics: Use of Transposons and Phage for Genomic Engineering Edited by KELLY T. HUGHES VOLUME 422. Two-Component Signaling Systems, Part A Edited by MELVIN I. SIMON, BRIAN R. CRANE, AND ALEXANDRINE CRANE VOLUME 423. Two-Component Signaling Systems, Part B Edited by MELVIN I. SIMON, BRIAN R. CRANE, AND ALEXANDRINE CRANE VOLUME 424. RNA Editing Edited by JONATHA M. GOTT VOLUME 425. RNA Modification Edited by JONATHA M. GOTT VOLUME 426. Integrins Edited by DAVID CHERESH VOLUME 427. MicroRNA Methods Edited by JOHN J. ROSSI VOLUME 428. Osmosensing and Osmosignaling Edited by HELMUT SIES AND DIETER HAUSSINGER VOLUME 429. Translation Initiation: Extract Systems and Molecular Genetics Edited by JON LORSCH VOLUME 430. Translation Initiation: Reconstituted Systems and Biophysical Methods Edited by JON LORSCH VOLUME 431. Translation Initiation: Cell Biology, High-Throughput and Chemical-Based Approaches Edited by JON LORSCH VOLUME 432. Lipidomics and Bioactive Lipids: Mass-Spectrometry–Based Lipid Analysis Edited by H. ALEX BROWN VOLUME 433. Lipidomics and Bioactive Lipids: Specialized Analytical Methods and Lipids in Disease Edited by H. ALEX BROWN
xlvi
Methods in Enzymology
VOLUME 434. Lipidomics and Bioactive Lipids: Lipids and Cell Signaling Edited by H. ALEX BROWN VOLUME 435. Oxygen Biology and Hypoxia Edited by HELMUT SIES AND BERNHARD BRU¨NE VOLUME 436. Globins and Other Nitric Oxide-Reactive Protiens (Part A) Edited by ROBERT K. POOLE VOLUME 437. Globins and Other Nitric Oxide-Reactive Protiens (Part B) Edited by ROBERT K. POOLE VOLUME 438. Small GTPases in Disease (Part A) Edited by WILLIAM E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 439. Small GTPases in Disease (Part B) Edited by WILLIAM E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 440. Nitric Oxide, Part F Oxidative and Nitrosative Stress in Redox Regulation of Cell Signaling Edited by ENRIQUE CADENAS AND LESTER PACKER VOLUME 441. Nitric Oxide, Part G Oxidative and Nitrosative Stress in Redox Regulation of Cell Signaling Edited by ENRIQUE CADENAS AND LESTER PACKER VOLUME 442. Programmed Cell Death, General Principles for Studying Cell Death (Part A) Edited by ROYA KHOSRAVI-FAR, ZAHRA ZAKERI, RICHARD A. LOCKSHIN, AND MAURO PIACENTINI VOLUME 443. Angiogenesis: In Vitro Systems Edited by DAVID A. CHERESH VOLUME 444. Angiogenesis: In Vivo Systems (Part A) Edited by DAVID A. CHERESH VOLUME 445. Angiogenesis: In Vivo Systems (Part B) Edited by DAVID A. CHERESH VOLUME 446. Programmed Cell Death, The Biology and Therapeutic Implications of Cell Death (Part B) Edited by ROYA KHOSRAVI-FAR, ZAHRA ZAKERI, RICHARD A. LOCKSHIN, AND MAURO PIACENTINI VOLUME 447. RNA Turnover in Bacteria, Archaea and Organelles Edited by LYNNE E. MAQUAT AND CECILIA M. ARRAIANO VOLUME 448. RNA Turnover in Eukaryotes: Nucleases, Pathways and Analysis of mRNA Decay Edited by LYNNE E. MAQUAT AND MEGERDITCH KILEDJIAN
Methods in Enzymology
xlvii
VOLUME 449. RNA Turnover in Eukaryotes: Analysis of Specialized and Quality Control RNA Decay Pathways Edited by LYNNE E. MAQUAT AND MEGERDITCH KILEDJIAN VOLUME 450. Fluorescence Spectroscopy Edited by LUDWIG BRAND AND MICHAEL L. JOHNSON VOLUME 451. Autophagy: Lower Eukaryotes and Non-Mammalian Systems (Part A) Edited by DANIEL J. KLIONSKY VOLUME 452. Autophagy in Mammalian Systems (Part B) Edited by DANIEL J. KLIONSKY VOLUME 453. Autophagy in Disease and Clinical Applications (Part C) Edited by DANIEL J. KLIONSKY VOLUME 454. Computer Methods (Part A) Edited by MICHAEL L. JOHNSON AND LUDWIG BRAND VOLUME 455. Biothermodynamics (Part A) Edited by MICHAEL L. JOHNSON, JO M. HOLT, AND GARY K. ACKERS (RETIRED) VOLUME 456. Mitochondrial Function, Part A: Mitochondrial Electron Transport Complexes and Reactive Oxygen Species Edited by WILLIAM S. ALLISON AND IMMO E. SCHEFFLER VOLUME 457. Mitochondrial Function, Part B: Mitochondrial Protein Kinases, Protein Phosphatases and Mitochondrial Diseases Edited by WILLIAM S. ALLISON AND ANNE N. MURPHY VOLUME 458. Complex Enzymes in Microbial Natural Product Biosynthesis, Part A: Overview Articles and Peptides Edited by DAVID A. HOPWOOD VOLUME 459. Complex Enzymes in Microbial Natural Product Biosynthesis, Part B: Polyketides, Aminocoumarins and Carbohydrates Edited by DAVID A. HOPWOOD VOLUME 460. Chemokines, Part A Edited by TRACY M. HANDEL AND DAMON J. HAMEL VOLUME 461. Chemokines, Part B Edited by TRACY M. HANDEL AND DAMON J. HAMEL VOLUME 462. Non-Natural Amino Acids Edited by TOM W. MUIR AND JOHN N. ABELSON VOLUME 463. Guide to Protein Purification, 2nd Edition Edited by RICHARD R. BURGESS AND MURRAY P. DEUTSCHER VOLUME 464. Liposomes, Part F Edited by NEJAT DU¨ZGU¨NES,
xlviii
Methods in Enzymology
VOLUME 465. Liposomes, Part G Edited by NEJAT DU¨ZGU¨NES, VOLUME 466. Biothermodynamics, Part B Edited by MICHAEL L. JOHNSON, GARY K. ACKERS, AND JO M. HOLT VOLUME 467. Computer Methods, Part B Edited by MICHAEL L. JOHNSON AND LUDWIG BRAND
C H A P T E R
O N E
Correlation Analysis: A Tool for Comparing Relaxation-Type Models to Experimental Data Maurizio Tomaiuolo,* Joel Tabak,* and Richard Bertram† Contents 2 3 4 13 15 18 19 20 20
1. Introduction 2. Scatter Plots and Correlation Analysis 3. Example 1: Relaxation Oscillations 4. Example 2: Square Wave Bursting 5. Example 3: Elliptic Bursting 6. Example 4: Using Correlation Analysis on Experimental Data 7. Summary Acknowledgment References
Abstract We describe a new technique for comparing mathematical models to the biological systems that are described. This technique is appropriate for systems that produce relaxation oscillations or bursting oscillations, and takes advantage of noise that is inherent to all biological systems. Both types of oscillations are composed of active phases of activity followed by silent phases, repeating periodically. The presence of noise adds variability to the durations of the different phases. The central idea of the technique is that the active phase duration may be correlated with either/both the previous or next silent phase duration, and the resulting correlation pattern provides information about the dynamic structure of the system. Correlation patterns can easily be determined by making scatter plots and applying correlation analysis to the cluster of data points. This could be done both with experimental data and with model simulation data. If the model correlation pattern is in general agreement with the experimental data, then this adds support for the validity of the model.
* {
Department of Biological Science and Program in Neuroscience, Florida State University, Tallahassee, Florida, USA Department of Mathematics and Programs in Neuroscience and Molecular Biophysics, Florida State University, Tallahassee, Florida, USA
Methods in Enzymology, Volume 467 ISSN 0076-6879, DOI: 10.1016/S0076-6879(09)67001-4
#
2009 Elsevier Inc. All rights reserved.
1
2
Maurizio Tomaiuolo et al.
Otherwise, the model must be corrected. While this tool is only one test of many required to validate a mathematical model, it is easy to implement and is noninvasive.
1. Introduction Multivariable systems in which one or more of the variables change slowly compared with the others have the potential to produce relaxation oscillations. These oscillations are characterized by a ‘‘silent state’’ in which the fast variables are at a low value, and an ‘‘active state’’ in which the fast variables are at a high or stimulated value. The fast variables jump back and forth between these states as the slow variables slowly increase and decrease. The fast variable time course thus resembles a square wave, while the slow variable time course has a saw-tooth pattern. The van der Pol oscillator is a classic example for this system (van der Pol and van der Mark, 1928). Several important biological and biochemical systems have the features of relaxation oscillators, including cardiac and neuronal action potentials (Bertram and Sherman, 2005; van der Pol and van der Mark, 1928), population bursts in neuronal networks (Tabak et al., 2001), the cell cycle (Tyson, 1991), glycolytic oscillations (Goldbeter and Lefever, 1972), and the Belousov– Zhabotinskii chemical reaction (see Murray, 1989, for discussion). Bursting oscillations are a generalization of relaxation oscillations, where the active state is itself oscillatory (Bertram and Sherman, 2005; Rinzel and Ermentrout, 1998). Thus, bursting consists of fast oscillations clustered into slower episodes. These oscillations are common in nerve cells (see Coombes and Bressloff, 2005, for many examples) and hormone-secreting endocrine cells (Bertram and Sherman, 2005; Dean and Mathews, 1970; Li et al., 1997; Tsaneva-Atanasova et al., 2007; Van Goor et al., 2001). Analysis techniques for models of relaxation-type oscillations are well developed. For pure relaxation oscillations a phase-plane analysis is typically used (Strogatz, 1994). For bursting oscillations, a geometric singular perturbation analysis, often called fast/slow analysis, is the standard analytical tool (Bertram et al., 1995; Rinzel, 1987; Rinzel and Ermentrout, 1998). From these analyses one can understand features such as threshold behaviors, the effects of perturbations, the conversion of the system from an oscillatory to a stationary state or vice versa, the slowdown of the fast oscillations near the end of the active state that is often observed during bursting, or the subthreshold oscillations that are sometimes observed during the silent phase of a burst. Thus, the analysis is useful for understanding the dynamic behaviors observed experimentally. While most of the analysis described above assumes that the system is deterministic, in reality all of the biological and biochemical systems on
Correlation Analysis of Oscillations
3
which the models are based contain noise. The noise could be due to intrinsic factors such as a small number of substrate molecules or ion channels of a certain type. It could also be due to extrinsic factors such as stochastic synaptic input to a neuron, stochastic activation of G-proteincoupled receptors by extracellular ligands, or measurement error. Whatever the origin, noise can make it more difficult to detect some subtle features of the oscillation. This makes it harder to know how well the mathematical model reproduces the behavior of the system under investigation, since key model predictions may depend on the detection of these subtle features in the experimental record (Bertram et al., 1995). In this chapter, we describe a tool based on statistical correlation analysis that can be used to compare the behavior of a mathematical model against experimental data and thus help to determine the validity of the model. This method is designed for relaxation-type models and makes use of intrinsic noise in the system. Subtle features such as spike frequency slowdown or subthreshold oscillations are not utilized. Instead, we look at correlation patterns between the durations of active and silent phases in the experimental data, and in simulation data generated by stochastic implementations of the corresponding model. We demonstrate the use of the tool through four examples. First, we show how it can be (and has been) used to make a powerful (and testable) prediction that can distinguish the type of slow negative feedback underlying a relaxation oscillation. Second, we demonstrate how the tool can be used to study bursting oscillations, focusing on the ‘‘square wave’’ class of bursters. The third example focuses on ‘‘elliptic bursters,’’ and demonstrates that the correlation pattern can distinguish one type of bursting from another. Finally, we apply the correlation analysis to a model of the pituitary lactotroph, a cell in the pituitary gland that secretes the hormone prolactin. We contrast the correlation patterns obtained with this model with experimental electrical data from a pituitary lactotroph cell line.
2. Scatter Plots and Correlation Analysis In a deterministic system, the duration of each active phase of a relaxation-type oscillator is the same, and the duration of each silent phase is the same. However, when the system exhibits random fluctuations, or noise, durations will vary since the noise can perturb the system prematurely from one state to another. We measure the duration of each silent phase and each active phase (see Appendix for the algorithm used for bursting oscillations), and then make a scatter plot of the active phase durations versus the previous silent phase durations. A separate scatter plot is made of the active phase durations versus the following silent phase durations. We then use
4
Maurizio Tomaiuolo et al.
these scatter plots to look for correlations between the active phase and silent phase durations. This can be done using simulation data from a model, or using actual data for the corresponding experimental system. As we demonstrate in the examples below, one expects certain correlation patterns to exist, based on the dynamic structure of the model. The validity of the model is supported (but not established) if the expected correlation patterns match those from the experimental data. If there is no match, then it is likely that the model should be modified or the parameters adjusted. This approach can be used for relaxation oscillations or bursting oscillations, and is most useful when there are enough experimental data to establish statistical confidence in the correlation patterns.
3. Example 1: Relaxation Oscillations We consider a system whose activity a is controlled by a fast positive feedback and a slow negative feedback process. This forms the basis for many biological oscillators (Ermentrout and Chow, 2002; Friesen and Block, 1984; Tsai et al., 2008). The activity varies according to: da ð1:1Þ ¼ a þ a1 ðwa y0 Þ þ i: dt This equation means that a tends to reach a steady state determined by the steady-state input/output function (or activation function) a1, with a time constant ta (which is set to 1, so time is relative to ta). The function a1 is a sigmoid function of its input (Fig. 1.1) which is proportional to the system’s output, a. This injection of the activity back into the system’s input represents positive feedback and the gain of the feedback loop is set by the parameter w. The other parameter, y0, represents the half-activation input: if the input is below y0, the output (activity) will be low, and if the input is above y0, then the output will be high. Finally, the term i provides for a possible external input to the system, such as a brief resetting perturbation. This activity equation can describe, for example, the mean firing rate within a network of neurons connected by excitatory synapses (Tabak et al., 2000, 2006; Wilson and Cowan, 1972). In this mean field framework, the network steady-state input/output function a1 depends on the input/ output properties of the single cells, the degree of heterogeneity in the network, as well as the synaptic dynamics. The parameter w represents the degree of connectivity in the network while y0 sets the amount of excitation that neurons need to receive to become activated. Such a system will always evolve to a steady state (defined by da/dt ¼ 0). For some parameter values, the system may have two stable steady states, one at a high- and one at a low-activity level. This is a direct consequence of the ta
5
Correlation Analysis of Oscillations
B
A w
w xS
Input
1 0.8
Input
a
+
a∞(wsa-q0)
1
s=1
0.6
0.4
0.4
0.2
0.2 q0
0.4
a
a∞(wa-q0-q)
0.8
s = 0.7
0.6
a
+ –q
0.6
0.8
1
q=0
q = 0.1
q0 q0+ q
a
0.6
0.8
1
Figure 1.1 System with (fast) positive feedback and two types of (slow) negative feedback. The positive feedback loop is shown in black; the negative feedback loop is in gray. (A) System with divisive feedback, which decreases the gain of the positive feedback loop by a factor s (upper panel). The effect of this feedback is a decrease of the slope of the system steady-state activation. (B) System with subtractive feedback, which decreases the effective input by y (upper panel). The effect of this feedback is a shift of the steady-state activation function of the system to the right. In both cases, the steadystate output function is given by a1 ðxÞ ¼ 1=ð1 þ expðx=ka ÞÞ.
positive feedback. To create relaxation oscillations, we add a slow negative feedback process. This will allow the system to switch repetitively between the high and low steady states. We consider two types of slow negative feedback. The first type is divisive feedback. This feedback reduces the amount of positive feedback and is implemented using a slow variable, s, according to: da ta ¼ a þ a1 ðwsa y0 Þ þ i; ð1:2Þ dt ds ð1:3Þ ¼ s þ s1 ðaÞ; dt where s1 is a decreasing function of a, so that s decreases during high activity episodes and recovers when the activity is low. Figure 1.1A illustrates that such divisive feedback decreases the slope of the input/output relationship of the system. In a mean field neuronal network model, for example, synaptic depression would be implemented as divisive feedback (Shpiro et al., 2007; Tabak et al., 2006). ts
6
Maurizio Tomaiuolo et al.
The second type is subtractive feedback. In this case, the half-activation point of the system is shifted by a slow variable, y, according to: ta
da ¼ a þ a1 ðwa y0 yÞ þ i; dt
ð1:4Þ
dy ð1:5Þ ¼ y þ y1 ðaÞ; dt where, y1 is an increasing function of a, so that y increases during high activity and decreases during low activity. Figure 1.1B shows how subtractive feedback shifts the activation function to the right, so more input is necessary to achieve a given output. In a mean field neuronal model, adaptation of cell excitability by outward ionic currents would be implemented as subtractive feedback (Shpiro et al., 2007; Tabak et al., 2006). Both the models defined by Eqs. (1.2) and (1.3) (s-model) and by Eqs. (1.4) and (1.5) (y-model) generate relaxation oscillations. We first examine the oscillations generated by the s-model (Fig. 1.2A). The upper panel shows the time courses of a and s. Activity is oscillating between active (high a) and silent (low a) phases. During the silent phase (1), s increases, increasing the level of positive feedback, until activity jumps to a high level (2). This starts the active phase (3), during which s decreases, decreasing the level of positive feedback. When s is low enough, there is not enough positive feedback to sustain the high activity, a falls to the low level (4) and the cycle repeats. We can gain qualitative understanding of this cyclic activity by using a ‘‘phase-plane’’ representation. Instead of plotting time courses, we plot a(t) versus s(t) in the (a, s) plane (Fig. 1.2A, lower panel). First, we use the fact that s is much slower than a, and, for each value of s, now treated as a parameter, plot the steady states of Eq. (1.2) (the points for which da/dt ¼ 0). We obtain an S-shaped curve, called the a-nullcline. For some values of s there are three possible steady states, one low (stable), one high (stable), and one intermediate that is unstable. Thus, within that range of s values the system is bistable, as mentioned above, with the middle state acting as a threshold: at any given time, if a is below this threshold it will fall to the low steady state; if it is above threshold it will rise to the high steady state. We now allow s to vary and plot the state of the system, represented by a trajectory in the (a, s) plane. Assume a is low initially, so we start on the lower branch of the S-curve. In this case s increases slowly according to Eq. (1.3) and the trajectory follows the lower branch (1). This continues until s passes the value where the low and middle steady states meet (the low ‘‘knee’’ of the a-nullcline, LK), so there is no other steady state of Eq. (1.2) other than the upper steady state. Thus, the system jumps to the high activity state (2). Once in the high activity state, s decreases and the system slowly tracks the high branch of the S-curve, moving to the left; this is the active phase (3). Eventually, the trajectory passes the high knee (HK) of the ty
7
Correlation Analysis of Oscillations
A
B
3
1
1 0.8
2
0.4 0.2 0
a, q
4
0.6
0.2
1 0
500
1000 Time
1 0.8
a
2000
0
Stim
500
1000 Time
2000
2500
1 0.8
4
2
0.4 0.2 0 0.3
2500
3
HK
0.6
0.6 0 0.4
0.5
s
0.4 0.2
1 0.4
0.6 a
a, s
0.8
0.6
0.7
LK 0.8
0
0.2
0.4 q
0.5
0.6
Figure 1.2 The s-model and the y-model produce relaxation oscillations with similar properties. (A) The s-model. Upper panel, oscillatory time courses of s and a. A brief stimulation (‘‘stim,’’ arrow) before full recovery triggers an active phase of shorter duration than the unstimulated ones. Lower panel, the oscillations are represented by a trajectory in the (a, s) plane that slowly tracks the lower and upper branches of the a-nullcline (S-shaped curve) and quickly jumps between the branches at the transition points, or knees of the S-curve. LK, low knee; HK, high knee; stim, stimulation that provokes a premature active phase. (B) The y-model. Upper panel, time courses of y and a. Lower panel, phase-plane trajectory superimposed on the Z-shaped a-nullcline.
S-curve (HK) where the upper steady state meets the middle steady state. Activity then falls abruptly to the low level (4) and the cycle repeats. In the phase plane, the effect of a brief stimulation (arrow in Fig. 1.2A) is apparent: if the stimulus (i) is large enough to bring a above the middle branch of the S-curve (i.e., the threshold), a will immediately jump up to the high state. The resulting premature active phase will be shorter than an unstimulated active phase because it starts at a lower value of s, so less time will be needed to reach the HK. Note that an active phase can also be prematurely terminated by a stimulation that brings a below the threshold. In that case, the following silent phase will also be correspondingly shorter (not shown).
8
Maurizio Tomaiuolo et al.
Figure 1.2B shows that the y-model also generates relaxation oscillations and responds to brief perturbations in a very similar way. The only visible difference is that the a-nullcline is a Z-shaped curve instead of S-shaped. This is because y increases with high activity and decreases during periods of low activity, in an opposite fashion to s. In both models, the oscillations of the slow variable allow the system to switch between the active and silent phases. The system tracks the stable branches of the a-nullcline until it reaches a knee, where it transitions from one branch to the other. In many cases, only the activity variable a, but not the feedback variables s or y, would be readily measurable in experiments. The two models generate oscillations in a with the same properties, so how can one tell whether experimentally observed relaxation oscillations are controlled by a divisive or a subtractive feedback mechanism? In the following, we show that noise that is intrinsic to the biological system has different effects on the two models, so one would only need to record spontaneous oscillations to distinguish between the two possible models. We include noise by replacing i in Eq. (1.4) with m, where m is the magnitude of the noise and is a normally distributed random variable. Results presented in the following are not overly sensitive to the way noise is added to the activity. The most important assumption is that noise perturbs the system’s activity, not the slow feedback process. The simulations with noise produce episodic activity as shown in Fig. 1.2, but with variable durations of the active and silent phases. The activity time course is shown in Fig. 1.3A for the s-model with noise. We expect that noise induces early (or delayed) transitions between the silent and active phases, leading to shortened (or lengthened) active and silent phases. To evaluate these effects, we plot active phase duration as a function of the preceding silent phase (Fig. 1.3B) or following silent phase (Fig. 1.3C). We observe a strong positive correlation between the length of the active phase and preceding—but not following—silent phase. This correlation pattern (Fig. 1.3B and C) is the signature of relaxation oscillations that rely on slow divisive feedback. The cause for this pattern can be deduced from Fig. 1.3A. Transitions between silent to active phases can occur at very different levels of the slow variable s, but the transitions from active to silent phases seem to occur around the same value of s with very little change from period to period. This implies that a shorter silent phase corresponds to a lower value of s at active phase onset, and thus to a correspondingly shorter active phase duration (cf. Fig. 1.2A). The correlation shown in Fig. 1.3B therefore illustrates that all the variability in active phase duration is due to variability in the preceding silent phase duration. On the other hand, regardless of active phase duration, the following silent phase starts at the same s value as all other silent phases and therefore is not influenced by active phase duration. Thus, there is no correlation between active phase and following silent phase duration
9
Correlation Analysis of Oscillations
F
1
0.8
0.6
0.6
0.4
0.4
0.2
0.2 1000
2000
3000
Time
B
0
4000
250
300
D
400
0.3 0.2 0.1 0.75
0.8
0.4 0.3 0.2
0.3
0.6
0.5
0.1
s value at on transition
Next SPD
J
0.6
Frequency
0.4
300
340
I
0.5
Frequency
0.5
320
Previous SPD
0.6
0.5
0.4 0.3 0.2 0.1
0.35
s value at off transition
0.4
340
194
300
E
0.6
0.7
300
320
198
194
Next SPD
4000
202
198
200
350
Previous SPD
3000
H
APD
APD
APD
200
190 180
180
2000
Time
202
200
190
1000
G
C
200
0
APD
0
Frequency
0
Frequency
1
0.8
a, q
a, s
A
0.4 0.3 0.2 0.1
0.14
0.18
0.22
q value at on transition
0.58
0.63
q value at off transition
0.68
Figure 1.3 Activity patterns generated by the two types of relaxation oscillators. (A) Time courses of a and s generated by the s-model with noise. There is visibly more variability of the s value at the on transition than at the off transition (mean values at the transitions are indicated by the horizontal-dashed lines). A strong correlation between active phase and preceding (B), but not following (C), silent phase duration corresponds to a wide distribution of s at the on transition (D) and a narrow distribution of s at the off transition (E). (F) Time courses of a and y generated by the y-model with noise. There is a weak correlation between active phase duration and both the preceding (G) and following (H) silent phase duration. They correspond to equal amounts of variability in the distributions of y at the on (I) and off ( J) transitions. APD, active phase duration; SPD, silent phase duration.
(Fig. 1.3C), that is, the variability in active phase duration does not cause any of the variability in the following silent phase duration. To illustrate this discrepancy between the ‘‘on’’ (silent to active) and the ‘‘off’’ (active to silent) transitions, we plot histograms of the value of s at the transitions. Figure 1.3D shows the wide distribution of s values at the ‘‘on’’
10
Maurizio Tomaiuolo et al.
transition, while Fig. 1.3E reveals a very narrow distribution of s values at the off transition. Thus, the correlation pattern shown in Fig. 1.3B and C is due to the large variations of s at the on transition relative to the off transition. Note that if the variability of s at the off transition was greater, then the correlation between active and preceding silent phase duration would be reduced, since there would be some variability in active phase duration not caused by variability in the length of the silent phase. Also, with less variability at the on transition, the small amount of variability at the off transition would have a larger impact and there would be a tendency for a longer active phase to be followed by a longer silent phase. If the variability of s values at both the on and off transitions was equal, we would observe a weak (but significant) correlation between active phase duration and both preceding and following silent phase durations. This situation occurs with the y-model. Figure 1.3F shows time courses generated by the y-model with noise. For the same amount of noise as used in the s-model, there is less variability in the length of active and silent phases. More importantly, the variability of y is similar at the on and off transitions (Fig. 1.3I and J). This results in weak but significant correlation between the duration of the active phase and both preceding (Fig. 1.3G) and following (Fig. 1.3H) silent phases. Thus, if the correlation pattern is similar to Fig. 1.3B and C then divisive feedback is likely involved, while if the correlation pattern is similar to Fig. 1.3G and H a subtractive feedback is involved. We now give a qualitative explanation for the differences in the amount of variability of the slow negative feedback variables at the on and off transitions, since these differences cause the differences in the correlation patterns. It is possible to predict the correlation patterns using survival analysis of particles in a two-well potential (Lim and Rinzel, submitted). Here, we give an equivalent but more intuitive explanation based on geometrical arguments. Again, we use the difference of time scales between the fast activity and the slow negative feedback processes, and we begin with the s-model (divisive feedback). There are two contributing factors to the observed correlation pattern. The first concerns the shape of the a-nullcline in the phase plane. Figure 1.4A shows that a perturbation which transiently changes the activity level can induce a phase transition if it brings the activity across threshold (the middle branch of the S-curve). Because the nullcline is much flatter on the bottom than on the top, it is easier to induce an on transition (at the sharp low knee (LK) of the S-curve) than an off transition (at the round HK). Thus, positive perturbations will be able to induce on transitions for a much larger range of s values than the range of values for which negative perturbations of the same amplitude can induce off transitions. This effect, however, contributes only a small fraction of the variability induced by noise, because noise does not just act to create a series of quick perturbations in the activity level. The effects of noise are integrated over
11
Correlation Analysis of Oscillations
C
+Δi
0
0.4 0.6 0.8 s
LK
0.5 s
0.6
0.6
2000 Time E 0.6 Frequency
Frequency
1500
0.4 0.2 0.7 0.8 Low knee
0.2 0.3 0.4 High knee
0
0.2 0.4 0.6 q
+Δi 0
0.5 q HK
0.4 LK 1000
2500
0.4
0.5
0.2
HK
1000
1
0.6
LK
0.4
D
H
q
s
0.8
0.5 0
1
G
1
a
0.5
I Frequency
0
HK
0.6 0.4 0.2 0.1 0.2 Low knee
1500 2000 Time
J Frequency
0.5
F
1
a
B
1
a
a
A
2500
0.6 0.4 0.2 0.55 0.65 High knee
Figure 1.4 Qualitative explanation for the differences in the variability at the on and off transitions. (A) The a-nullcline and superimposed trajectory of the relaxation oscillation for the s-model. Vertical arrows show a perturbation of amplitude 0.2 that can induce a premature transition. The on transition can occur further away from the knee than the off transition. (B) The effect of a change in external input, Di ¼ 0.01, on the a-nullcline. (C) Time courses of the low knee (LK), high knee (HK) and s (gray, magnified from Fig. 1.3A). The upward arrow indicates a premature on transition due to noise moving the LK to a small s value. The downward arrow indicates a late on transition due to the LK maintaining its position. (D) Wide distribution of LK and (E) narrow distribution of HK obtained during the simulation. Compare with Fig. 1.3D and E. (F) Symmetrical a-nullcline and superimposed oscillation trajectory for the ymodel, with vertical arrows depicting perturbations that can induce a transition. (G) Changes in input affect the LK and HK similarly. (H) Time courses of LK, HK, and y (gray, magnified from Fig. 1.3F). (I) Distribution of LK and ( J) distribution of HK; compare with Fig. 1.3I and J.
time since noise is included in the activity equation (Eq. 1.4). Thus, if we see noise as a rapidly varying external input to the system, we also realize that it affects the a-nullcline, perturbing it. To quantify this contribution of noise, we first consider how a change Di to a constant input to the system affects the a-nullcline. As illustrated in Fig. 1.4B, Di shifts the LK horizontally to a greater extent than the HK. In fact, we have shown that for a small change in external input, the ratio of the resulting changes in the position of the LK and HK, Dslk/Dshk, is proportional to the ratio of the activity at the HK and LK, ahk/alk (Tabak et al., 2009). Since activity is close to 0 at the LK, this ratio can be high, around 17.4 for the parameters used here.
12
Maurizio Tomaiuolo et al.
The prediction from this analysis with constant input is that noise will ‘‘shake’’ the a-nullcline during the simulation, moving the knees horizontally. Figure 1.4C shows the resulting time course of both LK and HK positions. LK varies much more than HK, as predicted, and the ratio of their standard deviations is close to 17.4. This panel also shows the time course of s (magnified from Fig. 1.3A), which increases during the silent phase and decreases during the active phase. When LK moves downward, s can cross over (upward arrow) and produce an on transition at an unusually low value of s. On the other hand, when LK remains high it can delay a transition (downward arrow). Thus, the large variations in the positions of the LK create the variability of s at the on transition. On the other hand, there is little variability of HK and therefore little variability of s at the off transition. Therefore, the wide and narrow distributions of LK (Fig. 1.4D) and HK (Fig. 1.4E) explain the wide and narrow distributions of s at the on (Fig. 1.3D) and off transitions (Fig. 1.3E). These differences between LK and HK are absent in the y-model. First, Fig. 1.4F shows that for subtractive negative feedback, the knees of the anullcline are symmetrical and therefore it is equally easy for a perturbation to induce an on or off transition (compare with Fig. 1.4A). Second, input variation affects both knees’ position similarly (Fig. 1.4G). Thus, noise creates equal variations in the LK and HK (Fig 1.4H), and the variability of y is similar at the on and off transitions. The distributions of HK and LK (Fig. 1.4I and J) are comparable to the distributions of y at the on and off transitions (Fig. 1.3I and J). In these examples, we used a smooth sigmoid function for a1, the steady-state activation of the system (Fig. 1.1). If instead, a1 had an abrupt onset and smooth saturation, then the Z-shaped nullcline of Fig. 1.4F would have a sharper LK, and the same pattern of correlation as the s-model could be observed. On the other hand, if the activation function was steep at higher a and smooth at lower a then the opposite correlation pattern could be observed (i.e., correlation between length of active and following—but not preceding—silent phase). Thus, the correlation pattern obtained from the y-model depends on the exact shape of the activation function. In contrast, for the s-model we obtain the correlation pattern shown on Fig. 1.3B and C regardless of the shape of a1 because the large noiseinduced variations of the LK are dominant. One exception would be when y0 is very large, in which case the deterministic system would exhibit a stable equilibrium rather than a relaxation oscillation. That is, the oscillation is driven entirely by the noise. In this case, there should be no correlations between the active and either the preceding or the following silent phases. With this exception, the correlation pattern produced by the s-model is very robust to parameter changes.
Correlation Analysis of Oscillations
13
In many systems, the active phase is not steady but oscillatory—this defines bursting. The slow negative feedback variable controls the transitions between active and silent phases of bursting as described above for the relaxation oscillations. However, the fast oscillations that occur during the active phase of bursting can greatly increase the sensitivity of the off transition to noise. This, in turn, can change the correlation pattern. In the following examples, we present several cases of bursting oscillations in excitable cells that exhibit different correlation patterns.
4. Example 2: Square Wave Bursting Square wave bursting has been described in a number of cell types (Butera et al., 1999; Chay and Keizer, 1983; Cornelisse et al., 2001) and belongs to the class of integrator-like neurons (Izhikevich, 2001). It has two primary characteristics. One is that the spikes often ride on a depolarized plateau, as in Fig. 1.5A. However, this is not always the case, since the spikes may undershoot the plateau (Bertram et al., 1995). The second characteristic is that the time between spikes progressively increases during the active phase of a burst. To investigate the correlation pattern on square wave bursting, we use a simplified version of a biophysically derived pancreatic b-cell model (Sherman and Rinzel, 1992). Equations for this and other bursting models used herein can be found in the primary references. Parameter values used were those described therein. Additionally, equations, parameter values, and computer programs for all models are available at http://www.math.fsu.edu/bertram/software/neuron. For the bursting models discussed, the primary observable variable is the membrane potential or voltage (V ), which evolves in time according to X dV ð1:6Þ ¼ i Ii þ Inoise : dt The ionic currents, Ii, vary from model to model, as do the number and identity of other variables. Random noise is introduced through the term Inoise ¼ m where, is a normally distributed random variable and m is the noise magnitude. In addition to this voltage equation, there are equations for current activation and inactivation variables. One of these variables changes slowly compared with V, and for each model is similar to y discussed earlier, providing subtractive negative feedback. Figure 1.5A shows the voltage time course of the model with added noise of magnitude 1 pA with the corresponding slow variable, s, superimposed. This is a slow negative feedback variable that activates an inhibitory current. When s is sufficiently large the voltage cannot reach the spike C
14
Maurizio Tomaiuolo et al.
A
V (mV)
−30
s
−50 V −70
B
0
2
4
C
2
0.5 2 2.5 3 Previous SPD (s)
0 1.5
3.5 E
10
2
3.5
15 10
Count
Count
2.5 3 Next SPD (s)
1 0.5
15
5 0
10
1.5
1
D
8
2
APD (s)
APD (s)
1.5
0 1.5
6
Time (s)
0.17 0.172 0.174 0.176 s value at active phase onset
5 0
0.178 0.18 0.182 0.184 s value at active phase termination
Figure 1.5 (A) Voltage trace and slow variable of a noisy square wave burster (Sherman and Rinzel, 1992). To facilitate superposition, the slow variable (s) time course has been rescaled. (B) Scatter plot obtained by plotting the duration of each active phase with the duration of the preceding silent phase. In this case, no correlation is observed (r ¼ 0.12, p ¼ 0.15). (C) The plot of active phase duration versus duration of the next silent phase shows a positive correlation (r ¼ 0.72, p < 10 20). Thus, on average, a short (long) active phase will be followed by a short (long) silent phase. (D) Distribution of the slow variable at the beginning of an active phase. (E) Distribution of the slow variable at the end of an active phase. The width of the slow variable distribution is greater at the active phase termination than at the active phase onset. That is, active phase termination is more sensitive to noise than active phase initiation.
threshold, so spiking stops and the cell enters a silent phase. In the absence of spiking the inhibitory s variable declines, eventually reaching a level that is low enough to allow spiking to resume.
Correlation Analysis of Oscillations
15
Scatter plots of the active phase duration versus the previous and the next silent phase durations are constructed as described in the previous section. The scatter plot of the active phase versus the following silent phase (Fig. 1.5C) shows a positive correlation, indicating that short (long) active phases lead to short (long) silent phases. In panel B, however, there is no correlation between the durations of the active phase and the previous silent phase. That is, the length of the silent phase does not provide information regarding the duration of the next active phase of bursting. To explain the correlation pattern, we plot the distributions of the values of the slow variable at the beginning and the end of an active phase. Variation of this slow variable is responsible for starting and stopping the spiking during a burst. For square wave bursting the width of the slow variable distribution is greater at the active phase termination (Fig. 1.5E) than at the active phase onset (Fig. 1.5D). The reason for this is that the spiking slows down near the end of the active phase, and the voltage spends most of its time near the spike threshold (i.e., the trajectory is approaching a homoclinic orbit), and so is sensitive to small perturbations. Thus, the termination of the burst is more subject to noise than its initiation. This example illustrates that correlation analysis of the active and silent phase durations, when applied to a model of square wave bursting, produces a pattern with positive correlation for active versus next silent phase duration, but no correlation for active versus previous silent phase duration. This result holds for the other three models of square wave bursting tested, and the rational for this is that all models of square wave bursting have similar dynamic structures. This correlation analysis technique can be used on actual voltage data from a nerve or endocrine cell, for example, to determine if the cell is a square wave burster. It would only require that active and silent phase durations be determined and plotted as in Fig. 1.5B and C. If the patterns match those of a square wave burster, then this tells the modeler a great deal about the form that the model should take. That is, it greatly limits the range of possible models that describes the cell’s electrical behavior.
5. Example 3: Elliptic Bursting Next, we consider elliptic bursting, which is observed in several types of neurons (Del Negro et al., 1998; Destexhe et al., 1993; Llinas et al., 1991) and belongs to the class of resonator-like neurons (Izhikevich, 2000; Llinas, 1988). Elliptic bursting is characterized by large spikes that do not ride on a depolarized plateau and small subthreshold oscillations that are present immediately before and after the active phase of a burst (Fig. 1.6A). The large spikes alone, however, do not uniquely identify this burst type, since
16
Maurizio Tomaiuolo et al.
100
V (mV)
A
50
s
0
V
−50
0
2
4
B
Time (s) C
0.08
0.04
0.08
0.12
0.04 E
8
8
6
6
4 2
0.06 0.08 0.1 Next SPD (s)
0.12
10
Count
Count
0.06 0.08 0.1 Previous SPD (s)
10
0
10
0.04 0.04
D
8
0.12
APD (s)
APD (s)
0.12
6
4 2
0.4 0.42 0.44 0.46 s value at active phase onset
0
0.46 0.48 0.5 0.52 s value at active phase termination
Figure 1.6 (A) Voltage trace and slow variable of an elliptic burster with noise (magnitude 0.2 pA). (B) Scatter plots of active phase duration against the duration of the previous silent phase. A correlation exists (r ¼ 0.53, p < 10 10). (C) There is a weak correlation between the durations of active phases and the following silent phases (r ¼ 0.22, p ¼ 0.006). (D) The width of the slow variable distribution at episode onset is greater than the distribution at active phase termination (E), thus active phase termination is less sensitive to noise than active phase initiation.
square wave bursters can have large spikes (Bertram et al., 1995), as can type two or parabolic bursters (Rinzel and Lee, 1987) and other burst types. Moreover, the small subthreshold oscillations are largely obscured when noise is added to the system. That is, the subthreshold oscillations present in the deterministic system may not readily be distinguished from those introduced by the random noise. We use the reduced version of the
17
Correlation Analysis of Oscillations
A
−20
V (mV)
−30 −40 −50 −60 −70 0
1
1.5
2
2.5 Time (s)
0.25
C 0.25
0.2
0.2 APD (s)
APD (s)
B
0.5
0.15 0.1 0.05 1 1.5 Previous SPD (s)
E
4.5
5
0.1
1 1.5 Next SPD (s)
2
0.7 0.6
APD (s)
0.6 APD (s)
4
0.15
0 0.5
2
0.7
0.5 0.4 0.3 0.2
3.5
0.05
0 0.5 D
3
0.5 0.4 0.3
1
1.2 1.4 1.6 Previous SPD (s)
1.8
0.2
1
1.2
1.4 1.6 Next SPD (s)
1.8
Figure 1.7 Correlation analysis applied to experimental data, and compared with a corresponding model. (A) Sample voltage trace of GH4 cell bursting. (B) Scatter plot obtained using GH4 cell data showing the active phase duration versus previous silent phase duration (r ¼ 0.10, p ¼ 0.21). (C) Scatter plot of active phase duration versus next silent phase duration (r ¼ 0.68, p < 10 20). (D)–(E) Scatter plots obtained from computer simulations of a model of the pituitary lactotroph (Tabak et al., 2007) with noise added (4 pA magnitude), ((D), r ¼ 0.07, p ¼ 0.43) and ((E), r ¼ 0.67, p < 10 15).
Hodgkin–Huxley giant squid axon model (Rinzel, 1985), with an added slow outward current, to analyze the correlation patterns for elliptic bursting. Figure 1.7A shows the voltage time course of the model with noise magnitude of 0.2 pA with the corresponding slow variable superimposed.
18
Maurizio Tomaiuolo et al.
The scatter plot of the active phase versus the duration of the previous silent phase (Fig. 1.6B) shows a positive correlation, indicating that short (long) active phase durations are preceded by short (long) silent phase durations. In contrast, there is only a weak correlation between the active phase duration and the duration of the next silent phase (Fig. 1.6C). Therefore, the duration of the previous silent phase predicts the active phase duration, but the active phase duration does not accurately predict the next silent phase duration. As in the previous sections, we plot the distributions of the slow variable at the onset and at the termination of an active phase. The slow variable in elliptic bursting exhibits a wider distribution at burst onset (Fig. 1.6D) than at burst termination (Fig. 1.6E). The reason for the wide onset distribution is that the subthreshold oscillations bring the voltage near the spike threshold, and once this threshold is crossed an active phase is initiated. Thus, the active phase initiation is very sensitive to noise, has been described previously for this type of bursting (Kuske and Baer, 2002; Su et al., 2004). During the active phase only a precise voltage perturbation at the right time can lead to spike termination (Rowat, 2007). Thus, active phase termination is relatively insensitive to the effects of noise. This example demonstrates that application of correlation analysis can distinguish model elliptic bursting from model square wave bursting. The analysis could also be applied experimentally, taking advantage of the noise that is inherent in the system. The outcome of the analysis could help with the choice of model used to describe the biological system.
6. Example 4: Using Correlation Analysis on Experimental Data In this example, we illustrate how correlation analysis can be used as a test for the validity of a model by applying it to both the model and the experimental system. The model describes fast bursting electrical activity in prolactin-secreting pituitary lactotrophs (Tabak et al., 2007). The experimental preparation is the GH4 pituitary lactotroph cell line. Like primary lactotrophs, cells from this lactotroph cell line often exhibit fast bursting electrical oscillations. A sample trace of GH4 bursting activity is shown in Fig. 1.7A. Correlation analysis was applied to a voltage trace approximately 5-min long, consisting of 150 bursts. The scatter plots show that there is no correlation between active phase duration and the previous silent phase (Fig. 1.7B), but a strong positive correlation between active phase duration and the next silent phase duration (Fig. 1.7C). We next compare these scatter plots with those from computer simulations of the model with added noise of magnitude 4 pA. The bursting
Correlation Analysis of Oscillations
19
produced by the model is neither square wave nor elliptic (Tabak et al., 2007), but instead is of the type referred to as pseudo-plateau (Stern et al., 2008). The model scatter plots show that, as with the experimental data, there is no correlation between active and previous silent phase durations (Fig. 1.7D), and a strong positive correlation between the active and the next silent phase durations (Fig. 1.7E). Thus, the correlation analysis provides some support for the validity of the mechanism for bursting in the mathematical model.
7. Summary We have demonstrated that correlation analysis can be a useful tool for comparing mathematical models with experimental data as a first check for the validity of the model. This type of analysis is appropriate for systems that produce relaxation oscillations or bursting oscillations. While it does not validate the model, it is a first test that is simple to apply to both the model and the biological system. Furthermore, it is noninvasive: all that is required is that one measure the activity of the biological system and make scatter plots of active and silent phase durations. Because the correlation analysis is a statistical test, confidence in the results increases with the number of data points. In this case, the data points are bursts or relaxation oscillations. For our example with the GH4 lactotroph cell line, 5 min of continuous recording was sufficient to give reliable results. While our examples focused on neural oscillations, the method is equally applicable to other types of biological systems that generate relaxation-type oscillations.
Appendix: Algorithm for the Determination of Phase Durations During Bursting Here, we describe the method used to determine silent and active phase durations for a noisy burst time course, where V is the observable. Upon visual inspection, we first set a threshold, VS, such that if V > VS a spike is recorded. Denote the times at which two spikes occur by ti and tj, then two spikes are considered to lie within a single burst if |ti – tj| < d, where, d is a positive parameter chosen by examination of interspike intervals. Conversely, if |ti – tj| > d then the two spikes are not considered part of the same burst. In a similar way, we obtain the duration of each silent phase by computing the difference between the last spike of a burst and the first spike of the following burst. We then create three vectors of equal size. One vector, ! b, contains all the active phase durations in chronological order (i.e., [b1, b2, . . ., bN]). The other two vectors contain the silent phase durations.
20
Maurizio Tomaiuolo et al.
! The preceding silent phase vector is s prec ¼ ½s1 ; s2 ; :::; sN , while the follow! ing silent phase vector is s next ¼ ½s2 ; s3 ; :::; sN þ1 . We then plot the elements of ! b versus those of ! s prec or versus ! s next to make scatter plots. Computer codes for the computation of active and silent phase durations can be downloaded from http://www.math.fsu.edu/bertram/software/neuron. In the case of experimental data, the data may have to be detrended if any slow trends in active and silent phase durations are present.
ACKNOWLEDGMENT This work was supported by NIH grant DA-19356.
REFERENCES Bertram, R., and Sherman, A. (2005). Negative calcium feedback: The road from Chay-Keizer. In ‘‘Bursting: The Genesis of Rhythm in the Nervous System,’’ (S. Coombes and P. C. Bressloff, eds.), World Scientific, Singapore. Bertram, R., Butte, M., Kiemel, T., and Sherman, A. (1995). Topological and phenomenological classification of bursting oscillations. Bull. Math. Biol. 57, 413–439. Butera, R. J., Rinzel, J., and Smith, J. C. (1999). Models of respiratory rhythm generation in the pre-Bo¨tzinger complex I. Bursting pacemaker neurons. J. Neurophysiol. 82, 382–397. Chay, T. R., and Keizer, J. (1983). Minimal model for membrane oscillations in the pancreatic b-cell. Biophys. J. 42, 181–190. Coombes, S., and Bressloff, P.C (2005). Bursting: The Genesis of Rhythm in the Nervous System. World Scientific Publishing Co., Singapore. Cornelisse, L. N., Scheenen, W. J. J. M., Koopman, W. J. H., Roubos, E. W., and Gielen, S. C. A. M. (2001). Minimal model for intracellular calcium oscillations and electrical bursting in melanotrope cells of Xenopus laevis. Neural Comput. 13, 113–137. Dean, P. M., and Mathews, E. K. (1970). Glucose-induced electrical activity in pancreatic islet cells. J. Physiol. 210, 255–264. Del Negro, C. A., Hsiao, C.-F., Chandler, S. H., and Garfinkel, A. (1998). Evidence for a novel bursting mechanism in rodent trigeminal neurons. Biophys. J. 75, 174–182. Destexhe, A., Babloyantz, A., and Sejnowski, T. J. (1993). Ionic mechanisms for intrinsic slow oscillations in thalamic relay neurons. Biophys. J. 65, 1538–1552. Ermentrout, G. B., and Chow, C. C. (2002). Modeling neural oscillations. Physiol. Behav. 77, 629–633. Friesen, W. O., and Block, G. D. (1984). What is a biological oscillator? Am. J. Physiol. 246, R847–R853. Goldbeter, A., and Lefever, R. (1972). Dissipative structures for an allosteric model; application to glycolytic oscillations. Biophys. J. 12, 1302–1315. Izhikevich, E. M. (2000). Neural excitability, spiking and bursting. Int. J. Bifur. Chaos. 10, 1171–1266. Izhikevich, E. M. (2001). Resonate-and-fire neurons. Neural Netw. 14, 883–894. Kuske, R., and Baer, S. M. (2002). Asymptotic analysis of noise sensitivity in a neuronal burster. Bull. Math. Biol. 64, 447–481. Li, Y.-X., Stojilkovic, S. S., Keizer, J., and Rinzel, J. (1997). Sensing and refilling calcium stores in an excitable cell. Biophys. J. 72, 1080–1091.
Correlation Analysis of Oscillations
21
Lim, S., and Rinzel, J. Noise-induced transitions in slow wave neuronal dynamics. J. Comput. Neurosci. (in press). Llinas, R. R. (1988). The intrinsic electrophysiological properties of mammalian neurons: Insights into central nervous system function. Science 242, 1654–1664. Llinas, R. R., Grace, T., and Yarom, Y. (1991). In vitro neurons in mammalian cortical layer 4 exhibit intrinsic oscillatory activity in the 10- to 50-Hz frequency range. Proc. Natl. Acad. Sci. USA 88, 897–901. Murray, J. D. (1989). Mathematical Biology. Springer-Verlag, Berlin. Rinzel, J. (1985). Excitation dynamics: Insights from simplified membrane models. Fed. Proc. 44, 2944–2946. Rinzel, J. (1987). A formal classification of bursting mechanisms in excitable systems. In ‘‘Mathematical Topics in Population Biology, Morphogenesis and Neurosciences,’’ (E. Teramoto and M. Yamaguti, eds.), Vol. 71. Springer-Verlag, Berlin. Rinzel, J., and Ermentrout, G. B. (1998). Analysis of neural excitability and oscillations. In ‘‘Methods in Neuronal Modeling: From Ions to Networks,’’ (C. Koch and I. Segev, eds.), pp. 251–291. MIT Press, Cambridge. Rinzel, J., and Lee, Y. S. (1987). Dissection of a model for neuronal parabolic bursting. J. Math. Biol. 25, 653–675. Rowat, P. (2007). Interspike interval statistics in the stochastic Hodgkin–Huxley model: Coexistence of gamma frequency bursts and highly irregular firing. Neural Comput. 19, 1215–1250. Sherman, A., and Rinzel, J. (1992). Rhythmogenic effects of weak electrotonic coupling in neuronal models. Proc. Natl. Acad. Sci. USA 89, 2471–2474. Shpiro, A., Curtu, R., Rinzel, J., and Rubin, N. (2007). Dynamical characteristics common to neuronal competition models. J. Neurophysiol. 97, 462–473. Stern, J. V., Osinga, H. M., LeBeau, A., and Sherman, A. (2008). Resetting behavior in a model of bursting in secretory pituitary cells: Distinguishing plateaus from pseudoplateaus. Bull. Math. Biol. 70, 68–88. Strogatz, S. H. (1994). Nonlinear dynamics and chaos. Addison-Wesley, Reading, MA. Su, J., Rubin, J., and Terman, D. (2004). Effects of noise on elliptic bursters. Nonlinearity 17, 1–25. Tabak, J., Rinzel, J., and O’Donovan, M. J. (2001). The role of activity-dependent network depression in the expression and self-regulation of spontaneous activity in the developing spinal cord. J. Neurosci. 21, 8966–8978. Tabak, J., O’Donavan, M. J., and Rinzel, J. (2006). Differential control of active and silent phases in relaxation models of neuronal rhythms. J. Comput. Neurosci. 21, 307–328. Tabak, J., Toporikova, N., Freeman, M. E., and Bertram, R. (2007). Low dose of dopamine may stimulate prolactin secretion by increasing fast potassium currents. J. Comput. Neurosci. 22, 211–222. Tabak, J., Senn, W., O’Donovan, M. J., and Rinzel, J. (2000). Modeling of spontaneous activity in developing spinal cord using activity-dependent depression in an excitatory network. J. Neurosci. 20, 3041–3056. Tabak, J., Mascagni, M., and Bertram, R. (2009). Mechanism for the universal pattern of activity in developing neuronal networks, submitted. Tsai, T. Y.-C., Choi, Y. S., Ma, W., Pomerening, J. R., Tang, C., and Ferrell, J. E. Jr. (2008). Robust, tunable biological oscillations from interlinked positive and negative feedback loops. Science 321, 126–129. Tsaneva-Atanasova, K., Sherman, A., Van Goor, F., and Stojilkovic, S. S. (2007). Mechanism of spontaneous and receptor-controlled electrical activity in pituitary somatotrophs: Experiments and theory. J. Neurophysiol. 98, 131–144. Tyson, J. J. (1991). Modeling the cell division cycle: cdc2 and cyclin interactions. Proc. Natl. Acad. Sci. USA 88, 7328–7332.
22
Maurizio Tomaiuolo et al.
van der Pol, B., and van der Mark, J. (1928). The heartbeat considered as a relaxation oscillation, and an electrical model of the heart. Phil. Mag. 6, 763–775. Van Goor, F., Zivadinovic, D., Martinez-Fuentes, A. J., and Stojilkovic, S. S. (2001). Dependence of pituitary hormone secretion on the pattern of spontaneous voltagegated calcium influx. J. Biol. Chem. 276, 33840–33846. Wilson, H. R., and Cowan, J. D. (1972). Excitatory and inhibitory interactions in localized populations of model neurons. Biophys. J. 12, 1–24.
C H A P T E R
T W O
Trait Variability of Cancer Cells Quantified by High-Content Automated Microscopy of Single Cells Vito Quaranta,*,† Darren R. Tyson,*,† Shawn P. Garbett,† Brandy Weidow,*,† Mark P. Harris,* and Walter Georgescu†,‡ Contents 24 25 26 28 30 30 31 31 32 34 34 45 54 54 54
1. Introduction 2. Background 3. Experimental and Computational Workflow 3.1. Time-lapse image acquisition 3.2. Data management 3.3. Image processing 3.4. Cellular parameter extraction 3.5. Statistical analysis 3.6. Data categories 4. Application to Traits Relevant to Cancer Progression 4.1. Cell motility 4.2. Cell proliferation 5. Conclusions Acknowledgments References
Abstract Mapping quantitative cell traits (QCT) to underlying molecular defects is a central challenge in cancer research because heterogeneity at all biological scales, from genes to cells to populations, is recognized as the main driver of cancer progression and treatment resistance. A major roadblock to a multiscale framework linking cell to signaling to genetic cancer heterogeneity is the dearth of large-scale, single-cell data on QCT-such as proliferation, death sensitivity, motility, metabolism, and other hallmarks of cancer. High-volume single-cell * { {
Department of Cancer Biology, Vanderbilt University Medical Center, Nashville, Tennessee, USA Vanderbilt Integrative Cancer Biology Center, Vanderbilt University Medical Center, Nashville, Tennessee, USA Department of Biomedical Engineering, Vanderbilt University Medical Center, Nashville, Tennessee, USA
Methods in Enzymology, Volume 467 ISSN 0076-6879, DOI: 10.1016/S0076-6879(09)67002-6
#
2009 Elsevier Inc. All rights reserved.
23
24
Vito Quaranta et al.
data can be used to represent cell-to-cell genetic and nongenetic QCT variability in cancer cell populations as averages, distributions, and statistical subpopulations. By matching the abundance of available data on cancer genetic and molecular variability, QCT data should enable quantitative mapping of phenotype to genotype in cancer. This challenge is being met by high-content automated microscopy (HCAM), based on the convergence of several technologies including computerized microscopy, image processing, computation, and heterogeneity science. In this chapter, we describe an HCAM workflow that can be set up in a medium size interdisciplinary laboratory, and its application to produce high-throughput QCT data for cancer cell motility and proliferation. This type of data is ideally suited to populate cell-scale computational and mathematical models of cancer progression for quantitatively and predictively evaluating cancer drug discovery and treatment.
1. Introduction Cancer cells both across and within individual patients are heterogeneous with respect to genetic (and epigenetic) makeup (Heng et al., 2009). Furthermore, it is increasingly appreciated that even within genetically identical clonal cell populations, individual cells may differ from each other in phenotypic traits (Brock et al., 2009; Stockholm et al., 2007). Genetic and nongenetic heterogeneity remain a formidable challenge for cancer treatment, especially in the case of molecular targeted drugs. Largescale genetic and epigenetic analyses of cancer variability have begun to extract common patterns at the molecular scale within this morass of heterogeneity. More recently, powerful cell phenotype analytical methods are coming on line, mostly due to the convergence of image processing, computerdriven automation and high-throughput microscopes (Dove, 2003; Evans and Matsudaira, 2007; Perlman et al., 2004; Starkuviene and Pepperkok, 2007). These methods, termed for convenience high-content automated microscopy (HCAM), enable large-scale analyses of cancer cell phenotype variability that will eventually match the scope of genetic variability analyses. In this chapter, we describe our implementation of HCAM methods in order to quantify cell traits such as proliferation and motility, and their variability within a cell population such as a cancer cell line. To be clear, we refer to a quantitative cell trait (QCT) as a cell-scale functional property (e.g., proliferation, motility, metabolism, death sensitivity) that displays cell-tocell variability in a cell population, with respect to some quantitative metric. It is highly desirable that a QCT be defined in numeric terms for machine compatibility, since it is virtually impossible to intuitively deal with, follow in time, or predict consequences of QCT combinations, for example, in
Quantitative Cell Trait Variability in Cancer
25
cancer progression or drug response. In these early days, these metrics are not firmly established and undoubtedly at some point will have to be agreed upon, particularly for comparing data from different sources in an automated fashion. In this chapter, our primary goal was to describe methods that define QCT heterogeneity quantitatively, regardless of the source of heterogeneity (e.g., genetic, epigenetic, nongenetic). However, we recognize that, once metrics are established, investigating the source of QCT variability becomes a tantalizing priority. Such investigations may span from identification of molecular mechanisms responsible for generating or dampening QCT variability, to mathematical or statistical modeling for best guess of the type of heterogeneity source (e.g., genetic or nongenetic). Consequences are very practical, because in the case of genetic sources one would expect permanent inheritance of the heterogeneity source, whereas a nongenetic source would produce temporary inheritance of heterogeneity.
2. Background Heterogeneity is a central feature of cancer that occurs at all biological scales, from genes to cells to populations. For decades, it has been suspected to be largely responsible for cancer progression and resistance to treatment, spawning intense studies especially at the genetic and molecular level. For example, panels of cancer cell lines have been subjected to genomic, gene expression or protein array analyses, evidencing a large number of genetic mutations, and signaling network alterations associated with malignant transformation. While these studies have been enormously informative and have taken our understanding of cancer to incredible depth, they have suffered from at least two limitations: (i) high-throughput genetic and biochemical studies are generally impractical at the single-cell level and are mostly based on average measurements of a test cell population; and (ii) genotype to phenotype mapping in a single cell, that is, linking genetic or molecular changes with phenotypic trait output (like motility or proliferation) remains challenging. These limitations are especially frustrating in the context of cancer progression, which is paced by the appearance of cell clones abnormal with respect to ‘‘hallmark’’ traits, such that cancer may be referred to as a disease of outliers. Methods to map QCT to underlying molecular defects would effectively produce a multiscale bridge from cancer genetics to cancer cell biology. In this multiscale framework, treatment and drug discovery could be approached with predictive methods. Analysis at the single-cell level of QCT variability, regardless of source, is commanding increasing attention due to convergent advances in several disciplines. As a whole, the science of heterogeneity has reached maturity in
26
Vito Quaranta et al.
many fields, such as face recognition, machine learning, and signal processing, producing theory that is applicable to cancer cell biology, as well as a wealth of mathematical and computational tools. Computer-driven microscopes are rapidly being refined and promise to deliver for adherent cells the spectacular advances flow cytometry has produced for cells in suspension. Image processing software and automation have the potential to create automated workflows to capture and analyze the behavior of thousands of single cells under tens of conditions in relatively short experimental times. This ensemble of technology is commonly referred to as HCAM. Its application to cancer cell biology promises to revolutionize our understanding of cell-to-cell variability with respect to a phenotypic trait, referred to above as QCT. It is perhaps worth noting that QCT studies are gaining traction in fields other than cancer. An emerging view is that phenotypic trait variability is inherent to living systems, in large part as an inevitable consequence of biological ‘‘noise’’ at several steps of intracellular molecular operation (e.g., gene transcription, mRNA translation, protein folding). Furthermore, local microenvironmental conditions may extinguish or amplify this noise and, to an extent, even nongenetic variability is inherited from mother to daughter cells on a temporary basis. Normal cells apply considerable resources to constrain or dampen both genetic and nongenetic variability of their traits, particularly as they become functional components of differentiated tissues. In this sense, variability is a negative factor with respect to homeostasis. However, trait variability may provide options to cells for responding to microenvironmental changes, for example, by pushing operation of that trait to the extremes of a range in order to survive under extreme microenvironmental stress, and perhaps for evolving new strategies. In summary, variability of a phenotypic cell trait can be considered a measure of cell adaptability, as well as of evolvability of underlying biochemical circuitry. From this broader perspective, an intriguing view is that during cancer progression the adaptability of cancer cells to microenvironmental changes is ever-increasing. QCT analysis is a key step to breaking down adaptability into numerical parameters that can be evaluated spatially and temporally by higher scale computational modeling.
3. Experimental and Computational Workflow Advances in experimental and microscopic technologies have made it possible to gather high-quality high-content images of cells and cellular components at an ever-increasing rate. Development of such state-of-theart equipment and tools allows investigators to gather spatial (i.e., in x-, y-, and z-axes, covering many fields of view, using multiple wavelengths) and time-resolved (i.e., at rapid intervals, over several days) quantifiers that
27
Quantitative Cell Trait Variability in Cancer
describe various cell traits (i.e., motility, proliferation) in vitro. Such methodology can also be used to explore heterogeneity of traits of individual cells within cell populations. HCAM is rapidly becoming the most efficient methodology to measure phenotypic traits of cancer cell populations at the single-cell level. In this section, we describe a streamlined workflow for acquiring and processing HCAM data that we have established for our own group (Fig. 2.1). This process can be divided into the key steps highlighted in Fig. 2.1. For some of these steps, recent comprehensive reviews have appeared and the reader is referred to those after a brief discussion. Other steps are dealt with in detail. In brief, the workflow is enabled by development of an informatics pipeline for images and image-associated metadata. Image features are derived using a suite of existing and newly developed image analysis and computer High-content automated microscopy (HCAM)
Time-lapse image acquisition
Raw data
(96- or 364-wells per experiment)
Data management (OME/OMERO) >100 GB per experiment
Formatted, classified data
Image processing (ACCRE cluster) ~96-wells/1 h
Processed data segmentation tracking
Data categories average, distribution, variability/subpopulations
Population-and single-cell data
Statistical analysis (non)parametrics, BIC, MLE, clustering
Cellular parameter extraction intensity, morphology, velocity, etc
Input into mathematical/ computational models
Figure 2.1 Workflow for computer-assisted analysis of quantitative cell traits (QCT). For simplification, the general workflow associated with measuring QCT is broken into the following steps: (i) time-lapse image acquisition (e.g., assay setup, microscopy); (ii) data management (OME, ACCRE); (iii) image processing (e.g., cell identification, segmentation, tracking); (iv) cellular parameter extraction (e.g., cell speed, doubling time); (v) statistical analysis ((non)parametric tests); and (vi) data categories (i.e., averages, distributions, and ‘‘statistical subpopulations’’)
28
Vito Quaranta et al.
software tools. Ultimately, the goal of this workflow is to establish an efficient pipeline to store, disseminate, and analyze single-cell data for streamlining the use of data categories, for example, for input into mathematical/computational models (Fig. 2.1). For simplification, the workflow has been broken down into the following steps (all of which are expanded upon in the following sections): 1. 2. 3. 4. 5. 6.
Time-lapse image acquisition Data management Image processing (segmentation and tracking) Cellular parameter extraction Statistical analyses Data categories (average, distribution, statistical subpopulations)
3.1. Time-lapse image acquisition Measurement of spatially- and time-resolved phenotypic traits involves largescale data acquisition. In order to examine traits efficiently in many cell lines, in many relevant conditions (e.g., hypoxia, drug treatment), and over time, inclusion of a high-throughput methodology such as HCAM is vital. We have primarily utilized a temperature- and CO2-controlled, automated, spinning-disk confocal microscope, the BD Pathway 855 (BD Biosciences, Rockville, MD) for single-cell phenotypic studies, although many other systems exist that provide similar functions. The Bioimager is capable of imaging an entire 96-well microplate in a single channel and focal plane in 10 min, but also has the flexibility to accommodate many other sample formats. Imaging can also be performed repeatedly with multiple images per well, multiple focal planes (z-sections), and using multiple fluorescent channels (using two light sources and various filters for a variety of fluorophores), making the setup ideal for performance of high-content time-lapse studies. In addition, the machine has a capacity for automated liquid handling allowing for precise control of the duration and volume of compound treatments. Lastly, this machine is versatile with adaptable hardware that is directly integrated into our data management system (described below in detail). Ultimately, increasing the efficiency of data acquisition via implementation of such methodology makes it possible to maximize the amount (and potentially the value) of quantitative data extracted from image-based single-cell studies. There are several trade-offs that require careful consideration during single-cell studies. First, the speed at which images can be acquired limits the number of wells, surface area covered, number of channels, or z-sections, etc. that can be imaged prior to returning to the starting position for the next sequential round of image acquisition. For instance, the frequency of image acquisition is of critical importance when examining motile cells. In order to automate the identification and tracking of
Quantitative Cell Trait Variability in Cancer
29
individual cells over time (repeated images), the distance the cell has moved between frames must be kept below a minimal threshold that is dependent on the density of cells imaged. It is computationally more challenging to identify a cell between two sequential images as the number of cells and the distance a cell moves increase. A list of imaging trade-offs is shown in Table 2.1. Another important consideration is photobleaching and phototoxicity. This is generally not a problem for phase-contrast imaging, but can be Table 2.1 Imaging trade-offs for dynamic high-content automated microscopy
Frequency of image acquisition Automatic tracking algorithms become more error-prone as cell speed or time between successive frames increases Increasing imaging frequency facilitates automatic tracking but increases total light exposure, which increases risk of phototoxicity and photobleaching Duration of light exposure Increasing exposure time increases signal-to-noise ratio, but also increases total light exposure, which increases risk of phototoxicity and photobleaching Minimum exposure time that provides sufficient signal-to-noise ratio over the entire experiment should be employed Camera binning may be used to increase signal at the cost of some spatial resolution Area to be imaged Directly determines the maximum number of cells to be imaged Decreasing objective magnification (e.g., from 20 to 10) increases area but reduces resolution Digital stitching of adjacent frames (montaging) can be used to increase imaged area at the cost of time, file size, and increased light exposure at overlapping frame borders Number of channels and z-sections Increasing number of parameters and z-sections increases light exposure and time required per well Number of conditions and cell types and technical replicates Limited by the frequency of imaging required in each well to address biological question and time required to image each well Increasing technical replicates allows sufficient number of cells to be imaged if low density culture is required initially Duration of experiment Automatic tracking algorithms become more error-prone as cell density increases, which occurs exponentially under optimal culture conditions Longer experiment times may affect maintenance microenvironmental conditions (e.g., depletion of nutrients, medium evaporation, etc.)
30
Vito Quaranta et al.
a substantial limitation for fluorescence imaging. Nipkow spinning disk confocal imaging is particularly well suited for reducing phototoxicity and photobleaching and has become the method of choice for live-cell imaging (Gra¨f et al., 2005). However, a limitation of imaging through spinning disks is that z-axis resolution is reduced compared to that of laser scanning confocal imaging. Regardless of whether spinning disks are used, the potential effects of imaging on cellular phenotypes must be considered.
3.2. Data management Individual HCAM experiments generate large datasets, commonly exceeding 50 GB in size. Therefore, data management—including storage, retrieval, backup, and processing—is facilitated by incorporation of data into the open microscopy environment (OME; http://www.openmicroscopy. org; Swedlow et al., 2003). This open-source software has been designed specifically to address the challenges of HCAM data and provides a standardized management platform by developing software and data format standards for the storage and manipulation of biological microscopy data (Goldberg et al., 2005; Swedlow et al., 2009). OME has previously been used in a number of biological studies to examine many aspects of cellular behavior (Dikovskaya et al., 2007; Porter et al., 2007). An OME remote objects (OMERO) server is established, which provides access to image data (the binary pixel data) and metadata (i.e., associated information about instrument settings, configurations, annotations). Access to data is enabled through client applications that simply run on a user’s computer. These include light-weight web-based interfaces, which can be accessed from any computer with a standard web browser; Java-based client applications, which provide more functionality than the web interface, but must be installed separately on each client computer; and a full cross-platform API, which provides data accessibility from third-party applications like ImageJ and VisBio. In addition, incorporation of data into OME provides MATLAB bindings to facilitate sophisticated image processing and analysis directly through the OMERO server.
3.3. Image processing Formatted images classified into datasets are then processed using various image analysis tools/algorithms (e.g., cell tracking, segmentation); we use a combination of some existing and freely available from open sources such as MetaMorphTM (Molecular Devices, Sunnyvale, CA), ImageJ (http://rsbweb. nih.gov/ij/; Rasband, 1997–2006), CellProfiler (http://www.cellprofiler.org; Carpenter et al., 2006), OpenLab (Improvision, Waltham, MA), and others
Quantitative Cell Trait Variability in Cancer
31
custom-developed in-house, to utilize the Vanderbilt Advanced Computing Center for Research & Education (ACCRE) Cluster for rapid processing of individual wells in parallel. MATLAB and Unix Shell scripts, which are designed to run in a high-throughput mode, facilitate this effort. We will present three specific processing modules that were custom-designed for processing of cell motility (Section 4.1) and proliferation (Section 4.2).
3.4. Cellular parameter extraction The output from the image processing pipeline is a set of cell parameters and images for visual inspection. Information from each tracked cell can be extracted from raw or processed images or from aggregate data for further data analysis. Typical parameters obtained from each image include cell perimeter, mean pixel intensity, and measures of shape such as eccentricity and solidity. Other measurements first require the identification of individual cells across multiple frames. These QCT include parameters of cell motility (e.g., speed, direction), intermitotic times (IMT), and progeny trees. Once all data are extracted from images, it is saved as a set of CSV files for statistical analysis and a set of images for visual inspection (i.e., quality control).
3.5. Statistical analysis A variety of analytical and statistical tools are applied to further analyze singlecell data, using a few statistical/mathematical packages including, R (http:// www.r-project.org/; free software environment), SPSS (SPSS, Inc., Chicago, IL), and Mathematica (Wolfram Research, Inc., Champaign, IL). Averages and distributions of data are tested using a combination of parametric and nonparametric statistical tests, as needed. First, normality of data is tested using various statistical tests (e.g., D’ Agostino’s K-squared, Shapiro Wilks W), dependent upon sample size, prior to all further analyses. Given a dataset that classically fits normality, parametric statistics (e.g., Student’s t-test, ANOVA) can be applied to detect significant relationships, and presentation of averages, standard deviation (SD), and standard error (SE) is sufficient to describe the population. However, given nonnormality, slightly more extraneous nonparametric tests must be employed to accurately capture population dynamics (e.g., Wilcoxon signed-rank, Kolmogorov–Smirnov tests). Of note, failure to verify assumptions about data (particularly in the study of population heterogeneity) can lead to unfortunate misinterpretation and wrongful conclusions. Statistical subpopulation analysis will also employ a number of other classic and adapted methods, which are described at length in Section 3.6.1.
32
Vito Quaranta et al.
3.6. Data categories Raw or processed data can be presented in various ways, of which we discuss three: (i) averages, (ii) distributions, or (iii) ‘‘statistical subpopulations’’ (i.e., variability distribution discretized by statistical techniques such as clustering). Each of these categories can then be incorporated into corresponding mathematical models. 3.6.1. Averages We have previously incorporated average data, for various phenotypic traits for a panel of genetically related breast cancer cells into the hybrid-discrete continuum (HDC) mathematical model for parameterization (Anderson et al., 2009). However, with the realization of heterogeneity of cell populations, presentation of a single value (average) is often inadequate for an accurate description. Although SD or SE can sometimes be used to effectively describe the variability of normal (Gaussian) populations, skewed nonnormal datasets rich with outliers and possible subpopulations should not rely on these means. Instead, analysis of population probability distributions and subpopulations via various approaches is preferable in these instances, as described below. 3.6.2. Distributions Obtaining single-cell measures using HCAM, in combination with rigorous statistical treatment, allows examination and analysis of large populations of cells (N > 1000) in a fairly efficient manner. Using this type of data acquisition for phenotypic traits, in lieu of population-based metrics, allows for presentation of probability distribution, which describes both the range of possible values that a random variable can attain and the probability that the value is within any subset of that range. This category of measurement is particularly useful for representing the spread or variability (i.e., heterogeneity) of a cell population by depicting the nuances of its data (e.g., nonnormality, skewness, kurtosis, outliers), which are lost using simple presentation of averages. By providing such data for parameterization of mathematical and computational models where appropriate, one can model heterogeneity of populations more realistically (in line with experimentation), which may ultimately lead to important insights otherwise overlooked. A specific example of applying these techniques to single-cell motility data is detailed below in Section 4.1. 3.6.3. Statistical subpopulations Using other statistical approaches, raw data for various phenotypic cell traits can also be processed to reveal ‘‘subpopulations’’ present within the greater population being examined. In order to quantify intracell line variability, we discretize the continuous distribution measurements described in the
33
Quantitative Cell Trait Variability in Cancer
above section into ‘‘functional subpopulations,’’ as previously described (Loo et al., 2007; Perlman et al., 2004; Slack et al., 2008). The advantage of identifying discrete subpopulations is that they can be compared across cell lines and identify common trends in response to perturbations of interest. Specifically, methods can be employed to estimate trait subpopulations using model-fit criteria, such as Bayesian information criterion (BIC) or gap statistics (Fraley and Raftery, 2002). BIC is an approximation of integrated likelihood, according to Eq. (2.1): 2 logðpðD j Mk ÞÞ 2 logðpðD j ^ yk ; Mk ÞÞ vk logðnÞ ¼ BICk
ð2:1Þ
vk is the number of independent parameters to be estimated in model Mk. yk is the parameter being estimated, and n is the number of points in the dataset D. This approximation has been shown to be a consistent estimator of density, even when dealing with nonparametric (Roeder and Wasserman, 1995) or noisy data. Expectation maximization (EM) is also used in statistics for finding maximum likelihood estimates (MLE) of parameters with a known number of clusters (Eliason, 1993). Using this method, model-based hierarchical agglomerative clustering is used to compute an approximate maximum for the classification likelihood, following Eq. (2.2), as previously described (Fraley and Raftery, 2002): n Y Lcl ðy1 ; . . . ; yg ; ‘1 ; . . . ; ‘n jyÞ ¼ fli ðyi j yli Þ ð2:2Þ i¼1
‘n labels a unique classification of each observation, and yg is the parameter estimate for each cluster. By combining the hierarchical agglomerative clustering with both EM and BIC, a robust strategy is developed. The brief outline of this algorithm is as follows: (1) choose maximum number of clusters; (2) perform hierarchical agglomerative clustering to estimate a classification for the data under each model, up to a selected maximum number; (3) compute the EM to determine parameters under each model; and (4) utilize BIC to select which is the most likely model of the data. Additional statistical techniques (i.e., principal components analysis (PCA) or Gaussian mixed models (GMM)) can then be applied as needed to reduce the dimension of a dataset and to find clusters of cells or subpopulations. Ultimately, these subpopulations are represented as probabilistic mixtures of stereotypes (i.e., phenotypes). As presented previously (Loo et al., 2007; Perlman et al., 2004; Slack et al., 2008), we can summarize the percentages of states within a cancer population as a ‘‘subpopulation trait profile’’—a simple probability vector whose entries sum to one. This analysis allows us to approximate subpopulations (i.e., heterogeneity) that exist within a cell line population with respect to a specific cell trait. Further, we can also use this approach to examine whether specific microenvironmental perturbations (e.g., hypoxia, drug treatment) influence/induce
34
Vito Quaranta et al.
apparent patterns of heterogeneity of cancer cell populations. This particular approach can be invaluable for explaining shifts of cell populations, or more interestingly cell subpopulations, which is quickly becoming a major field of study in cancer research. These analyses have all been previously used in different combinations for teasing apart cell subtypes based on various parameters (Loo et al., 2007; Perlman et al., 2004; Slack et al., 2008). In summary, this is just one approach for numerically describing the heterogeneity of cell populations, particularly highlighting outliers, based on any number of relevant traits of interest. Much like flow cytometry separates a cell population from suspension into various subpopulations based on certain assignments (e.g., fluorescent marker, cell size), the coupling of HCAM and rigorous statistical tests can provide a means for separating or grouping of live cells dynamically over time from image-based studies. A specific example of applying these techniques to single-cell proliferation data is detailed below in Section 4.2.
4. Application to Traits Relevant to Cancer Progression As described above, various experimental and computational tools are being developed to investigate a number of applications relevant to the study of QCT in cancer progression. In this chapter, we have focused on measurement and analysis of two specific phenotypic traits (QCT) of single cells, motility and proliferation, both of which are hallmarks of cancer (Hanahan and Weinberg, 2000). It is well established that both traits are aberrantly regulated during disease progression, and probable that intervention strategies targeting these processes may be useful in clinical treatment. The following sections briefly expand upon the clinical importance of each trait, our chosen methodology for various analyses of each, and the implications of conducting such studies.
4.1. Cell motility Cell motility plays an essential role in many biological systems, but precise quantitative knowledge of the biophysical processes involved in cell migration is limited. It is well established that migration of both epithelial and transformed cancer cells is a complex and dynamic process, which involves changes in cell size, shape, and overall movement (Friedl and Wolf, 2003). Therefore, one can characterize cell motility by quantifying several metrics. This provides opportunities to improve the predictive accuracy of computational and mathematical models by incorporating more numerical parameters. Herein, we present a method for assessing single-cell motility,
Quantitative Cell Trait Variability in Cancer
35
combining experimental, statistical and computational tools, and apply it to the analysis of the dynamics of ‘‘unbiased’’ single-cell migration in vitro (i.e., undirected, or without addition of chemoattractant). This pipeline for analysis was designed with the intention of examining large numbers of heterogeneous cancer cell populations (i.e., cell lines in vitro). This method improves upon classic methods for studying migration (e.g., Boyden chamber) because it captures single-cell dynamics underlying the heterogeneity of cancer cell populations. 4.1.1. Single-cell motility analysis: image acquisition and validation We established protocols for both manual (Harris et al., 2008) and automated (custom-written algorithms) cell tracking of single-cell motility. Manual cell tracking is standard practice (Harris et al., 2008) and not covered in this chapter. Although facilitated by several software packages and image analysis tools (Section 3.3), it is laborious and time-consuming (Harris et al., 2008) and limits the throughput. In the context of HCAM and highthroughput studies, it still has a critical function to validate automated analyses. Automated high-throughput cell tracking (thousands of cells) presents significant challenges, discussed in the following. Due to low signal-tonoise ratio (low contrast between cell and background), automated cell tracking of digital bright-field or phase-contrast microscopic images is often impractical and error prone. Fluorescent-based imaging has far superior signal-to-noise ratios and the resultant images significantly simplify the process of automated tracking. Therefore, for high-throughput studies in our laboratory (using the BD 855 Pathway), cells are labeled with a nuclear protein (histone H2B) conjugated to monomeric red fluorescent protein (H2BmRFP; Addgene Plasmid 18982) to enable identification of nuclei of individual cells. This protein has been used by many groups for imaging purposes, and to date no significant alterations in cellular function due to its expression have been described. The most efficient method for obtaining pooled populations of cells with stable expression of a transgene is using retroviral-mediated transduction, although any method that produces similar results may be employed alternatively. Cells should be flow-sorted to minimize the number of nonexpressing cells within the populations. Numerous protocols exist for these procedures and will not be covered here further. Once a stable cell line is established in which the fluorescent protein is expressed, various parameters must be compared to the parental strain to ensure no obvious clonal selection has occurred. HCAM assays can then be carried out as follows: (1) Cells are seeded into 96-well microplates ( 2000 cells per well), allowed to adhere for 1 h in the temperaturecontrolled (37 C), CO2-controlled chamber of the BD Pathway 855 machine, and washed to remove nonadherent cells from wells prior to
36
Vito Quaranta et al.
tracking. (2) Fluorescent images are then automatically obtained at predetermined intervals for a given period of time (5 min intervals, for 4 h), controlled by BD Attovision software. Based on the information presented in Table 2.1, imaging parameters are set to enable the automatic tracking of as many cell types and conditions as possible. For the BD Pathway 855, the optimized settings for automated tracking of H2B-labeled MCF10A cells are listed in Table 2.2. Using these image settings, 240 TIFF images (1.3 MB each) per well per experiment are generated—approximately 35,000 images comprising 50 GB of storage space experiment. These images are exported and stored using the various data management strategies previously described in Section 3.2 (OME, ACCRE). 4.1.2. Single-cell motility: Image processing We are developing custom-written algorithms for automated assessment of single-cell motility. These tools are designed to integrate with a number of programs/applications, including MATLAB, CellProfiler, and ImageJ. They also interface with OME and cluster computing (e.g., ACCRE at Vanderbilt). A motility software module we designed (named WG1) imports raw images, thresholds them to obtain binary images, segments the binary blobs into objects (i.e., cell, nuclei), calculates centroid values, and assembles them into a matrix that is sent to an external tracking algorithm (for bright-field images, an external tensor voting algorithm can also be used to infer missing edges prior to segmentation). Tracks obtained from the external algorithm are then saved and can be used for processing by other modules (described in other sections). Optionally, WG1 can also be used to overlay the detected single-cell outlines and tracks over the original cell images and save the new images to disk. The resulting image stacks can be visually inspected for quality control.
Table 2.2 BD Pathway 855 settings for H2BmRFP1 fluorescence imaging
Approximate time intervals between images of same well is 5–6 min 0.25 s exposure, 2 2 camera binning 20 objective, 1 2 montaging Single-channel illumination (555/28 nm excitation, 600/30 nm emission) in single focal plane through spinning disk 36–40 wells (e.g., 6 technical replicates, 2 microenvironments, 3 cell types), back-and-forth well scanning, no delays (after last well, immediately return to first) 48–96 h total imaging duration
37
Quantitative Cell Trait Variability in Cancer
4.1.3. Single-cell motility: Cellular parameter extraction Once individual cells (nuclei) have been identified and tracked by either a manual or computer-driven method, a number of both classic and novel motility-related parameters for each cell and/or population can be extracted (Table 2.3). Some metrics can be performed at the population-level (P), some at the single-cell level (S), and others at both levels. Some of these parameters are described in the following sections. 4.1.3.1. Classic single-cell and population metrics Speed (S, P): Cell speed is thought to correlate with cancer invasion (Wells, 2006). There are several previous investigations of single-cell speed (undirected) or velocity (directed) for cancer cell lines, in various microenvironments (Anderson et al., 2009; Hofmann-Wellenhof et al., 1995; Jiao et al., 2008). Table 2.3 List of cell motility measurements Motility measurement
Category
Description
Speed
S, P
Persistence time
P
Motion fraction
P
Turn-angle distribution Surface area
P
Speed fluctuation
S
Step-length
S, P
Instantaneous motion fraction Dynamic expansion and contraction cell activity
P
Describes average single-cell or population-based movement according to x, y (and z) coordinates from cell tracking (mm/min) Combination of persistence in direction and speed (min) Percentage of motile cells across a time-lapse movie (image stack) within a population (%) Tracking x, y (and z) coordinates are used to calculate cell trajectories Measurement of cell size (in pixels) (Harris et al., 2008) 95% confidence interval of the standard deviation of speed for single cell over time (image stack) The distance a cell moves between pauses divided by the number of total steps (mm/min) Percentage of cells motile at any given time (image) within a total population (%) Measurement that represents the overall change in cell area and motion over time (Harris et al., 2008)
S, P
S, P
38
Vito Quaranta et al.
Single-cell speed obtained from time-lapse image stacks is automatically calculated using the x, y (and z in three-dimensional studies) coordinates obtained from tracking centroids (calculated from cell nuclei outlines) using MATLAB algorithms. We have previously examined cells for time periods ranging from just a few minutes to 24 h, at various time resolution (30 s to 10 min intervals). It should be noted that experiments should be optimized (e.g., cell type, matrix, surface), as this can contribute to metric accuracy. We have examined single-cell speeds using frequency histograms and scatter-plots overlaid with box-and-whisker plots containing statistics (Fig. 2.2A), particularly to highlight heterogeneity of a dataset and other trends in data (e.g., skewness, kurtosis). Persistence time (P): Persistence time (min) is one of the most common measures of cell motility (Dunn and Brown, 1987). This measure assumes cell motion is a persistent random walk (PRW), and combines persistence in direction and speed in calculation. The PRW model can be derived from the Langevin equation (Eq. (2.3)): d! v ! m! a ¼m ¼ F ð! x Þ b! v þ! ðtÞ |fflfflffl{zfflfflffl} |{z} |ffl{zffl} dt force
drag
ð2:3Þ
noise
This is a stochastic differential equation describing Brownian motion in a potential, resulting in the Ornstein–Uhlenbeck process (Uhlenbeck and Ornstein, 1930) where m is the mass of the particle, v is the velocity vector, x is the position, t is time, is the coefficient of friction and represents noise of mean zero. An expectation of the model is described by the Furth equation (Eq. (2.4) (Furth, 1920): P 2 2 t=P hd i ¼ nd hS iPt 1 ð1 e Þ ð2:4Þ t This equation describes the expected mean-squared displacement over time, d represents displacement, nd is number of dimensions, S is speed, P is persistence time, and t is time. Motion is initially ballistic (directed), transitioning in time to super-diffusive, and finally to diffusive. The persistence time is the descriptive parameter of the break point in this transition (Codling et al., 2008). Thus, to accurately calculate persistence time, one must observe cells for a long enough time interval for them to transition to the diffusive regime (roughly 3 h for a 10 min persistence time). We have previously calculated persistence times by both the traditional Dunn method (Dunn and Brown, 1987), and the updated Kipper method (Kipper et al., 2007), which reduces standard error of data to fit by approximately 50%, which is shown in Eq. (2.5) where x is an estimate of normalized mean-squared displacement.
Cell speed
A
Persistence time
B
Motile cell fraction
C
Threshold for displacement= 20 mm
MCF10A
MCF10A P < 0.001 N = 81
Pt = 7.36 N = 50
1500
20 N = 81
15
N = 98
N = 83
2.50
10
<x> (min2)
Frequency
25
100
1000
N = 50 90
500
5 1.5
2.0
Speed (mm/min) AT1 P = 0.027 N = 98
Frequency
15
0
2.5
10
1.50
1.00
1.5
2.0
Frequency
500
50
100
30 20 10
150
Cell line
40
10
CA1d CA1d
50
200
Time (min)
AT1
60
20 0
MCF10A
70
30
0.00
P < 0.001 N = 83
40
1000
2.5
0
Pt = 3.31 N = 50
1500
<x> (min2)
1.0
200
0
0
Speed (mm/min) CA1d
150
Pt = 4.91 N = 50
1500
0.50 0.5
100
Time (min) AT1
5
0.0
50
Percent motile cells
1.0
<x> (min2)
0.5
Speed (mm/min)
0.0
80
0
2.00
0
MCF10A
AT1
CA1d
Cell line
1000 500 0
0 0.0
0.5
1.0
1.5
2.0
Speed (mm/min)
2.5
0
50
100
150
200
Time (min)
Figure 2.2 Classic motility-based metrics. (A) Single-cell measurements for speed (mm/min) can be effectively presented in frequency histograms (left), whereby the raw average speed calculated for each cell over time (here 4 h, with 5 min intervals) is represented in columns divided by bins (gray). P values represent whether data are distributed normally (P 0.05) or nonnormally (P 0.05) according to a Shapiro Wilks test (the black curve indicates theoretical ‘‘normal’’ fit for each data range shown). Alternatively, data can be presented in scatter-plots (right; representing individual cells) overlaid with box-and-whiskers (representing statistics of the population). Both of these graphical methods are particularly useful for presentation of datasets that are skewed and rich with variability or outliers (i.e., heterogeneity). Here, we show MCF10A, AT1, and CA1d cell lines in normal culture conditions. (B) Persistence time (min) represents the combination of a cell population’s persistence in both direction and speed. Plots include analysis of persistence time according to the Kipper method, whereby a cell population’s breaking point between the ballistic and diffusive regime is quantified (Pt shown for each). Here, we again show MCF10A, AT1, and CA1d cell lines in normal culture conditions. (C) Motile cell fraction captures the percentage of cells moving out of an entire tracked population of cells. Again, MCF10A, AT1, and CA1d cell lines in normal culture conditions are presented.
40
Vito Quaranta et al.
P t=P ðeÞðtÞ Pt 1 ð1 e Þ t
ð2:5Þ
Kipper also provides a full treatment of systematic errors in measurement of persistence time. An example of graphs containing mean-squared displacement versus time are shown for three cell lines in Fig. 2.2B. We are yet to determine a steadfast trend for persistence time in cell lines we have examined (data not published), as no obvious correlations have emerged (possibly due to the heterogeneity of populations), however we have determined that this metric can shift dramatically upon changing the cells’ microenvironment, which is consistent with previous literature (Kim et al., 2008). Motion fraction (P): The motile cell fraction is the percentage of motile cells within a given population, as previously described (Kim et al., 2008). In a number of previous studies, we have determined that for many cell line populations, the majority of cells are nonmotile throughout an entire assay. Interestingly, it seems that, as a trend, a small subpopulation of cells is highly motile, and up to order of magnitude greater in measurement (Fig. 2.2C). Turn-angle distribution (S. P): This metric has classically been applied to analysis of bacterial motility (Berg and Brown, 1972; Duffy and Ford, 1997). Recently, we analyzed turn-angle distributions of epithelial and cancer cell lines (Potdar et al., 2009). Individual cell trajectories are tracked and turnangle values taken from each. This method is subject to systematic measurement error, unless appropriate sampling intervals and high-resolution images are selected. Consider a model system where speed is chosen from an exponential distribution and turn-angle is chosen from a Von Mises (circular-normal) distribution (Eq. (2.6)),1 where r and y are polar coordinates, l and k are shape parameters, I0 is the modified Bessel function of the first kind for the distributions: f ðr; yjk; lÞ ¼
ek cos y l2 elr 2pI0 ðkÞ |fflffl{zfflffl} |fflfflfflffl{zfflfflfflffl} Exponential
ð2:6Þ
Von Mises
Figure 2.3A shows the resulting distribution (l, k ¼ 1), where the peak is the location of a cell, the positive x-axis represents the turn angle (equal to 0), and the grid represents the observable pixels. l is a factor of mean cell speed, observation interval and the pixel width, and the principal factor in experimental configuration. The aliasing occurs in the measurement, because each x, y pair on the grid represents the observable angles (see 1
The extra l normalization term is due to the polar form of the Jacobian.
41
Quantitative Cell Trait Variability in Cancer
Cell motility distribution
A
Brownian motion: measured vs. actual s/pixel width = 2
B 0.08
0.3 0.2 0.1 0.0 -1.0
1.0 0.5 0.0 -0.5
0.02 y 0.00
0.5 1.0
-1.0
-3
Binned motility: measured vs. actual
D Total measurement error
0.15 Density 0.10
0.05
-3
-2
0.04
-0.5
0.0 x
C
Density
Density
0.06
-1
1 Radians
2
3
-2
-1
1 0 Radians
2
3
Total measurement error 0.10
20° 10° 5°
0.08 0.06 0.04 0.02 0.5
1.0 l
1.5
2.0
Figure 2.3 Turn-angle (distribution) analysis. Turn-angle represents the trajectory of single-cells during a time-lapse movie. (A) Von Mises/exponential polar distribution (l, k ¼ 1), where the peak is the location of a cell and the height represents the probability of the cell’s location in the next observation frame, the x-axis represents the turn-angle, and the grid represents the observable pixels. (B) This plot is an example of measurement error calculated for pure Brownian motion. The dotted line is the flat turn angle distribution and the solid line is the measured distribution. (C) This plot shows the resulting error from the Von Mises/exponential model with l ¼ 0.5 and with 37 bins observed. The difference between observed and actual is the shaded region between the two curves. (D) Effects of total measurement error by l on the x-axis and bin size by the three curves. Note that this does not include the potential loss for a sample interval around or above the persistence time. Total measurement error is quantified using an equation presented in the text.
Fig. 2.3B, for an example of measurement error for Brownian motion). Increasing pixel resolution reduces this error. The best time sampling interval is a tradeoff between being too short whereby a cell does not move as far along the grid (increasing aliasing), versus being too long and greater than the persistence time, whereby the cell’s observable motion is diffusive. Figure 2.3C shows the resulting error from the Von Mises/ exponential model with a l ¼ 0.5, and 37 bins. Computation of this is done by integrating the density in each pixel and sum-binning the density of
42
Vito Quaranta et al.
the measurable angle of the coordinate. This is a correctable error, and the observed bins can be corrected by this ratio. Total measurement error (TME) is quantified by the following Eq. (2.7), where ym is measured angle and ya is the true angle and n is the number of bins: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðym ya Þ2 ð2:7Þ TME ¼ n Figure 2.3D is a graph showing the effect of TME by l on the x-axis and bin size by the three curves. Note this does not include the potential loss for a sample interval around or above the persistence time. Further, it is important to note that quantifying these metrics based on cell centroids, as opposed to by pixel, improves the accuracy of the data significantly. Surface area (S, P): Surface area is commonly used in image processing (Alexopoulos et al., 2002; Carpenter et al., 2006), often as an indicator of differentiation, apoptosis, and other biological processes (Mukherjee et al., 2004; Ray and Acton, 2005). This metric simply quantifies overall cell size (in pixels). In previous studies (Harris et al., 2008), we have designed custom-written MATLAB algorithms to obtain single-cell surface area measurements of cancer cells. As with single-cell speed, this metric can be represented at both the individual and population (average) levels. Overall cell size can be assessed from bright-field or fluorescence images, and subcellular compartments (e.g., nuclei, mitochondria) can also be measured given appropriate use of markers. 4.1.3.2. Novel single-cell and population metrics One of the main assumptions of the PRW model is that cells are always in motion. However, we have determined that cells do not necessarily meet these criteria, and instead typically pause frequently as they migrate. In order to refine the model to incorporate this idea, we have developed a few novel metrics, each described below in detail, that quantitate this phenomenon in various ways. Speed fluctuation (S): Individual cells do not typically maintain constant speed during the course of a time-lapse movie. Instead, their activity is often composed of frames of fast movement, slower movement, and no movement. We have implemented a metric to capture this behavior, termed speed fluctuation (Fig. 2.4A). For non-Gaussian datasets, this metric is calculated using bootstrapping to obtain the range of 95% confidence intervals (CI) of the SD of cell speed for each individual cell in a population. In summary, a number of our previous studies have determined that singlecell speed over time is largely variable and that cells within a population exhibit large amounts of heterogeneity in terms of fluctuation (unpublished data). Further, we have also found that distinct cell lines exhibit contrasting trends in fluctuation—with some remaining fairly constant and others fluctuating dramatically—and that introduction of various
Speed fluctuation
B
Step identification
MCF10A
2.0 1.0
Single-cell steps
Speed (mm/min)
4.0
3.0
Dynamic expansion and contraction of cell activity (DECCA)
CA1d
4.0
Speed (mm/min)
D
Threshold for error= 1 mm
3.0 2.0 1.0 0.0
0.0 4 8 12 16 20 24 28 32 36 40 44 48
4 8 12 16 20 24 28 32 36 40 44 48
Frame
Frame
Phase contrast
A
2.0 90 1.0
N = 50
N = 50
N = 50
80
Differential
100
20 40 60 80 100120140160 180200 220 20 40 60 80 100 120 140 160 180 200 220
20 40 60 80 100 120 140 160 180 200 220
20 40 60 80 100120140160 180200 220
20 40 60 80 100120140160 180200 220
Frame
CA1d 4.0 3.0 2.0
70 60
DECCA
4 8 12 16 20 24 28 32 36 40 44 48
Percent motile cells
Speed (mm/min)
Threshold for movement = 1 pixel
3.0
0.0
Speed (mm/min)
Instantaneous motion fraction
C
20 40 60 80 100 120 140 160 180 200 220 20 40 60 80 100 120140 160 180200 220
AT1 4.0
20 40 60 80 100 120 140 160 180 200 220
50 40 30 20
1.0
0 0 0 0 0 0 0 0 0 0 0 20 40 60 80 100120140160180200220
10
0.0
20 40 60 80 100 120 140 160 180 200 220
2500 2000 1500 1000 500 0 −500 −1000 −1500 −2000 −2500
20 40 60 80 100120140160180200220
Time
0 4 8 12 16 20 24 28 32 36 40 44 48
Frame
MCF10A
AT1
CA1d
Cell line
Figure 2.4 Novel motility-based metrics. (A) Plots show speed fluctuation of randomly chosen single-cells (here, MCF10A, AT, and CA1d in normal culture conditions). As cells do not typically maintain constant speed across time, this metric is an effective way to capture fluctuations in speed. For nonnormal datasets, this metric is calculated by bootstrapping to obtain the range of 95% confidence intervals of the standard deviation of a population. (B) Plot shows the steps taken by an randomly chosen individual CA1d cell (cell steps are represented by red dashes). Cell steplength is the sum of the displacement in a step. Cell step-lengths can also be analyzed at the population-level to obtain the best-fit distribution. (C) Instantaneous motion fraction represents the percentage of cells moving (threshold for movement >1 mm) at any given time during a timelapse movie (here, 4 h, with 5 min intervals). (D) Dynamic expansion and contraction of cell activity (DECCA) values can be calculated for single cells by thresholding phase-contrast images to generate differential images that capture different types of cell movement (expansion vs. contraction) using a heat-scale (red/yellow, positive change; blue, negative change; green, no change), which are further converted to DECCA-specific images that are used for direct quantification of this metric, as previously described (Harris et al., 2008).
44
Vito Quaranta et al.
microenvironmental conditions can cause dampening or increases in fluctuation for cells (unpublished data). For normally distributed data, presentation of SD or the interquartile range can convey a similar metric. Step-length (S, P): To accurately add cell pausing into migration models, it is necessary to experimentally determine the distance a cell travels between consecutive pauses. Step-length, flight length, and flight time are three metrics that are used in ecology to study foraging behavior of birds, bees, and mammals (Gautestad and Mysterud, 2005; Viswanathan et al., 1999). The term step-length has also been used to describe the movement of molecular motors on polymers (Wallin et al., 2007). All three terms are used to quantitate distance or time between pauses in motion, but to our knowledge, this metric has not been used previously to quantify the motion of epithelial cells. To obtain step-length, we measured the overall distance traveled between cell pauses in a time-lapse movie using x, y coordinates obtained from cell tracking (defined by two consecutive frames at the same coordinate) and discarded all step-lengths below our tracking error threshold (lengths < 1 mm). Sample step-lengths are shown in Fig. 2.4B. Interestingly, we observe that, just as single-cell speed fluctuates across time and within a population, cell step-length is also highly variable both within and across cell lines. Instantaneous motion fraction (IMF; P): Persistence and diffusion coefficients are often used to describe cellular motion. However, both of these representations make a number of assumptions about cellular behavior. In particular, they assume all cells are in motion at all times. The IMF was developed to test this assumption, and to provide an additional metric to monitor differences in migration characteristics between cell lines and in various conditions. It measures the percentage of motile cells (must move more than 1 pixel, our measurement error threshold) within a given population at any given time (frame) of a time-lapse movie. In contrast to the motile cell fraction metric, which shows percentage of cells that are ‘‘successful’’ in their migration, this metric represents the ratio of cells ‘‘attempting’’ to move. Figure 2.4C shows an example of applying this metric to MCF10A, AT1, and CA1d cell lines in normal tissue culture conditions; quite clearly these cell lines exhibit heterogeneous expression of motility at any given moment (at 5 min intervals). Dynamic expansion and contraction cell activity (DECCA)(S, P): Kymography is one method used to gain insight into the specific mechanisms of cell movement by studying morphological changes in shape and size (Bryce et al., 2005; Cai et al., 2007). However, kymography is used for relatively small sample sizes (due to highly magnified images required), during relatively short periods of time (Bear et al., 2002; Cai et al., 2007). We have developed a novel metric, termed DECCA, which represents the overall change in cell area and motion over time (Harris et al., 2008). We previously developed this novel metric to quantify the difference between a completely
Quantitative Cell Trait Variability in Cancer
45
nonmotile cell (velocity ¼ 0) and a nonmotile cell (also with a velocity ¼ 0, and of the exact same size) that ruffles its lamellipodia, a classic behavior of cancer cells during migration. Figure 2.4D includes a sample of how this metric captures dynamic behavior, adapted from our previous work (Harris et al., 2008). Time-lapse microscopy images of cell motility can be used to extract all or some of the metrics described above, which can subsequently be used to generate computational simulations (Windrose plots) that combine the various parameters into a single visual depiction of motility. Samples simulations for each of the cell lines presented above in normal tissue culture conditions can be viewed at http://vicbc.vanderbilt.edu/itumor/cell. 4.1.3.3. Statistical subpopulations of motile cells Each motility metric demonstrates heterogeneity in a cell population and can be used to investigate relevant differences between normal and cancer cells. However, the reason for using many motility metrics is that each metric by itself is insufficient for defining statistical subpopulations. Defining statistical subpopulations facilitates examining relationships between distinct QCT (e.g., defining how proliferation subpopulations relate to motility subpopulations within a cell population). In the case of motility, cluster analysis methods of BIC and EM (as described in Section 3.5, and applied to proliferation QCT, Section 4.2.4) are applicable, as long as multiple parameters are combined.
4.2. Cell proliferation Typical studies of proliferation in cultured cell lines involve counting cells (either directly or indirectly) in a population over time. These results are usually presented as a population doubling time (DT) calculated from the number of cells identified at various intervals or as a percentage of the population in each phase of the cell cycle (G1, S, or G2/M) at a given point in time (usually using flow cytometry). These population-level assays are generally limited by the fact that, as endpoint assays, they require large numbers of samples to provide accurate information. This limitation is alleviated by continual monitoring/sampling of cells within a population over time. Nonadherent cells can be sampled with relative ease without disrupting their normal culture. However, for adherent epithelial cell lines, this requires microscopic visualization. The use of time-lapse video of transmitted light microscopy for continual visualization of cells over many days has been used for decades. However, due to the low signal-to-noise ratio (low contrast) between cells background, previously described in the motility application above (Section 4.1.1), automated cell counting of digital light microscopic images remains a challenge. Therefore, we have moved to fluorescent-based imaging to facilitate automated tracking.
46
Vito Quaranta et al.
4.2.1. Validation of H2BmRFP-labeled cells As for motility studies, we utilize flow-sorted cells with stable expression of histone H2BmRFP for proliferation studies. Prior to examination of cells at the single cell level, it is important to ensure no obvious clonal selection has occurred during the generation of the modified cells. To do this the resultant population must be compared to the parental cell line. This procedure is easily accomplished using HCAM and comparing to other population-level assays—manual counting being the gold standard. By imaging the cells every 1–4 h and using automatic segmentation algorithms to quantify cell numbers, population doubling times can be calculated by simple linear regression of the natural log of the number of cells in each image. An example of the verification of the similarity of H2B-labeled cells with parental cells is shown in Fig. 2.5A. 4.2.2. Single-cell proliferation rates: Image acquisition Once the population-level proliferation rates have been validated for a particular fluorescent protein-labeled cell line, further investigation of proliferation metrics at the single-cell level can proceed. Based on the information presented in Table 2.1, imaging parameters are set to enable the automatic tracking of as many cell types and conditions as possible. The optimized setting for automatic tracking of H2B-labeled MCF10A cells with the BD Pathway 855 imager are listed in Table 2.2. Using these imaging settings, approximately 240 TIFF images (1.3 MB each) per well per day are generated––approximately 35,000 images comprising 50 GB of storage space per 96 h experiment. 4.2.3. Single-cell proliferation rates: Image processing and parameter extraction The automated analysis of HCAM-generated images can be used to determine IMT (time between mitotic events) of individual cells within a cell population if image acquisition is sufficiently frequent to allow for automatic tracking of cells over time (6–12 frames/h). In addition, the tracking algorithm described for motility has been modified to include the ability to detect mitotic events and associate resultant progeny with their parental cell. The first software module is the same as used for motility (WG1). The output of this module is a set of MATLAB label matrices and a list of cell centroids at each time step, which can be used for processing by two other modules for obtaining additional proliferation metrics. The second module (WG2) uses the track ID and shape parameters from the label matrices to extract parameters. To determine cell division events, this algorithm identifies tracked cell IDs that were not present in a previous frame of a time-lapse movie. In order to separate true mitotic events from cells entering the frame, cells that have been moving too fast and were lost
A
Manual counting
13.5 13.0
6.75 6.50 DT = 18.3 h
12.0
LN (# cells)
LN (# cells)
12.5
11.5 y = 0.0378 x + 10.045 R2 = 0.98041
11.0
6.00
5.50 5.25
20
40 60 Time (h)
80
100
y = 0.039 x + 5.0109 R2 = 0.99465
5.75
10.0 0
DT = 17.7 h
6.25
10.5
9.5
Automated
7.00
5.00
0
10
20 30 Time (h)
40
50
B All generations
All generations
Generation 1 only
60 50
0.20
0.10
40
N = 3864
Density
Density
Density
N = 3864
20
N = 2243 30
10 0
0
0.00 10
15
20
25 30 IMT (h)
35
40
0.02
0.04 GR (h-1)
0.06
0.02
0.04 GR (h-1)
0.06
Figure 2.5 Representative graphs of proliferation data. (A) Validation of cell lines for HCAM studies. Population doubling times of AT1 cells or AT1 cells modified to stably express H2BmRFP were determined by manual counting (AT1, left) or automated cell counting (AT1-H2BmRFP, right). The population DT is calculated by dividing the natural log of 2 by the slope of the curve fit by linear regression and is indicated within each graph. (B) Distributions of single-cell IMT and GR. IMT and GR of individual AT1-H2BmRFP cells cultured under standard conditions were determined using time-lapse HCAM as described in the text. The distribution of IMT has a long rightward tail (left). When the data are transformed to GR the resultant distribution demonstrates a more normal shape (middle). When only a single generation is plotted, the bias toward larger GR (shorter IMT) is reduced, thereby increasing the relative abundance of the smaller GR (longer IMT) (right).
48
Vito Quaranta et al.
by the tracking toolbox and cells whose fluorescence intensity is fluctuating above and below the foreground intensity threshold, which disappear from certain frames and suddenly appear in other frames. We use the collapse in size of the cell nuclei and proximity of the nuclei in anaphase as markers of a true mitotic event. Filters in the algorithm reject new cells that are too far from other cells or have too great an area as possible mitotic events. An additional filter checks the size of possible parents and compares it with the size of the presumptive daughter cells. If the size ratio of parent area to areas of possible daughter cells is too small the event is rejected. Finally, if the size of the two possible daughter nuclei is too dissimilar, the event is also rejected. After the mitotic events are detected, new IDs are assigned to the daughter cells and each cell receives a parent ID. Cells that have entered the frame and cells that were present at the beginning of the movie receive a parent ID equal to zero. In the last module (WG3), proliferation information, as well as centroid position and shape parameters (e.g., area, eccentricity), are saved to a set of comma-separated text files. In addition, images are generated with the detected nuclei boundaries (or cytoplasm in bright-field images) colorcoded based on generation number and cell ID overlaid onto the original image and saved as JPEG files to facilitate manual validation of the automatic segmentation and tracking. 4.2.4. Single-cell proliferation rates: Statistical analyses 4.2.4.1. Single-cell IMT and generation rate Single cell IMT define the duration of each individual cell lifetime or cell cycle. The generation rate (GR) is calculated as LN(2)/IMT and is used instead of IMT, since its distribution has been shown to be normal (Gaussian) in several noncancerous cell lines. However, the distribution of GR of all cells in a population is overrepresented by the faster dividing cells, which generates platykurtotic (tall and narrow) distributions (Sisken and Morasca, 1965). To reduce this bias, only a single generation is analyzed. An example of the distribution of IMT and GR from multiple generations or a single generation is demonstrated in Fig. 2.5B. It is important to compare the single-cell GR with population-level metrics (i.e., population DT) since population-level data is comprised the single-cell metrics. For example, under conditions where the population proliferation rate is nonlinear, calculation of a population DT is inappropriate as it is changing over time (Fig. 2.6A, Condition 2), whereas the population DT is calculated as 16.91 h under normal culture conditions (Fig. 2.6A, Condition 1), corresponding to a GR of 0.041––the slope of the line. The population-level proliferation curve in Condition 2 suggests a decreased IMT of the cells over time. Linear regression of data plotted with single-cell IMT on the x-axis and cell birth time on the y-axis provides a tool to examine whether the IMT is time dependent. The horizontal line in
49
Quantitative Cell Trait Variability in Cancer
Fig. 2.6C, Condition 1, indicates no correlation between birth time and IMT, whereas there is a clear positive correlation of IMT with birth time in Condition 2, indicating that cell cycle times are increasing over the course of the experiment. This type of analysis is not limited to birth time and, therefore, provides a useful general approach for detecting parameter interdependencies. 4.2.4.2. Progeny tree (clonal subpopulation) generation rates The image processing algorithms described above in Section 4.2.3 provide a method to link individual cell data to its parent and progeny to generate a family (progeny) tree of dependent data. Each progeny tree represents a clonal population with unknown dependence to other progeny trees, such that progeny trees may be related to varying degrees or unrelated. One metric A Population doubling time AT1 (0/0)
8.5 8.3 8.1 7.9 DT = 16.91 h 7.7 7.5 7.3 y = 0.041 x + 6.721 7.1 R2 = 0.99393 6.9 6.7 6.5 0 10 20 30 40 Time (h)
B
7.9 LN (total cell #)
LN (total cell #)
AT1 (S/S)
7.5 7.3 7.1 6.9 6.7 6.5
0
Individual cell generation rate (GR) generation 1 only AT1 (S/S) 80 n = 2243
80
20 40 Time (h)
60
AT1 (0/0) n = 528
60 Density
60 Density
7.7
40
40
20
20
0
0 0.01
0.03
GR
0.05
0.07
0.01
Figure 2.6 (Continued)
0.03
0.05 GR
0.07
50
Vito Quaranta et al.
C AT1 (S/S)
Birth time distribution generation 1 only
40
45 Intermitotic time
35 Intermitotic time
AT1 (0/0)
30 25 20 15
40 35 30 25 20 15
10 0
10
20 30 Birth time
D
40
10 0
5
10 15 Birth time
20
Progeny tree generation rate (GR) AT1 (0/0)
AT1 (S/S) 150
150 n = 1350
n = 1317
100 Density
Density
100
50
50
0
0 0.01
0.03
0.05 GR
0.07
0.01
0.03
GR
0.05
0.07
Figure 2.6 Graphical representation of proliferation metrics. AT1 cells were cultured in standard culture conditions (Condition 1, left column) or under growth factor restricted conditions (Condition 2, right column) and subjected to time-lapse HCAM. (A) Population DT was determined as described in figure DRT2 using a larger number of cells and more frequent image acquisition (every 1 h). Proliferation in Condition 1 demonstrates the typical exponential (log-linear) division rate whereas proliferation in Condition 2 is clearly not log-linear. (B) BIC analysis of the distribution plots of individual cell GR from generation 1 in Condition 1 indicate the presence of two subpopulations with mean values of 0.025 and 0.05 h 1 (indicated by vertical dashed lines). The estimated density of a mixed Gaussian using the EM method (described in Section 3.6) is indicated by the curve overlaying the histogram. In Condition 2, BIC analysis indicates two subpopulations with different densities and mean values than in Condition 1. (C) To examine whether the IMT of cells is similar throughout the experiment the IMT of cells are plotted according to their birth time during the experiment. The nearly horizontal linear regression indicates that the IMT of cells in Condition 1 are not increasing significantly over the course of the experiment whereas the IMT of cells in Condition 2 are increasing. (D) The density histograms of progeny tree GR are comprised a single population (by BIC analysis) in Condition 1 but are clearly distinguished into two subpopulations in Condition 2.
51
Quantitative Cell Trait Variability in Cancer
that can be obtained using data pulled from entire progeny trees is and maximum likelihood estimate of GR for each using the following Eq. (2.8): GR ¼
Bt Dt St St
ð2:8Þ
Bt and Dt are the number of mitotic events and the number of deaths, respectively and St is the total lifetime of the population. St is obtained by summing the lifespan of each cell within a progeny tree (Keiding and Lauritzen, 1978). In the absence of detectable death, the equation is reduced to simply (Eq. (2.9)): GR ¼
Bt St
ð2:9Þ
Since the estimate of GR for the progeny tree is based on the population lifetime (St) and the number of mitotic events (Bt) occurring within a progeny tree, these values can be calculated even for progeny trees containing a single mitotic event (one parent and two offspring). In addition, St is calculated using all cells in each tree, regardless of whether it leaves the frame or persists to the end of the experiment. Thus, deriving GR from progeny trees provides a system with which to compare the proliferation rates of clonal subpopulations within the context of a potentially heterogeneous population, without requiring individual clones to be isolated. Thus, this analysis introduces potential for high-throughput comparison of multiple genetically stable clonal populations and should be able to detect preexisting or frequently occurring stable genetic alterations that alter the proliferative capacity of the cells within the population as a whole. A representative plot of progeny tree GR and the relationship to the other metrics are shown in Fig. 2.6. 4.2.4.3. Analysis of sibling pairs Another proliferation metric that can be used to detect variability within each cell line is the similarity of IMT or GR between sibling pairs (or other members within a progeny tree). Since each sibling pair is presumably genetically identical, differences between them can be considered nongenetic. Metrics of this similarity or differences between siblings are obtained either by determining the correlation between sibling GR (Fig. 2.7A and B) or by plotting the difference between the IMT of sibling pairs (Fig. 2.7C). Although not yet applied to our datasets, a very promising approach to quantify the variance of proliferation metrics within cell lines is the bifurcating autoregression model (Staude et al., 1997). The model accounts for cells progressing through a standard cell cycle and can be used to quantify heterogeneity in the population using bifurcating data structures such as progeny trees. The model provides quantitative values of mean and variance
52
Vito Quaranta et al.
sib_2GR (h–1)
0.08 0.06
Sibling pair scatter plots AT1 (condition 1) 0.08 r = 0.74 p = 1.6e–10 0.06 sib_2GR (h–1)
A
0.04 0.02 0.00 0.00
0.08
0.04 0.02 0.00 0.00
120
Residual plots AT1 (condition 1) 120
100
100
80
80
Density
B
Density
0.02 0.04 0.06 sib_1GR (h–1)
60 40
0
0
C
0.08
MCF10A (condition 2)
40 20
–0.01 0.01 Residual error
0.02 0.04 0.06 sib_1GR (h–1)
60
20
–0.03
AT1 (condition 2) r = 0.561 p = 3.4e–08
–0.03
0.03
–0.01 0.01 Residual error
0.03
Differences between sibling intermitotic times (IMT) 1.00
Fraction
0.75
AT1 (condition 1) AT1 (condition 2)
0.50
0.25
0.00 0.01
0.1
1 Time(h)
10
100
Figure 2.7 Sibling pair analysis. (A) Scatter-plots of sibling pairs were demonstrated to be significantly correlated as indicated by the high correlation coefficients (r) and low P-values. (B) Residual plots similarly demonstrate the stronger correlation in Condition 1. (C) The differences between sibling pair IMT can also be represented using cumulative density distributions using a log scale on the x-axis (time).
53
Quantitative Cell Trait Variability in Cancer
in the population and can quantify the variance of metrics between related members of a progeny tree (e.g., mothers and daughters or sibling pairs). 4.2.5. Other proliferation-related metrics Other standard assays of DNA synthesis (e.g., bromodeoxyuridine (BrdU) incorporation) and DNA content (e.g., incorporation of fluorescent DNAbinding dyes such as 40 ,6-diamidino-2-phenylindole (DAPI) or Hoescht 33342) can easily be incorporated into the HCAM experiments. These assays can be performed in situ to produce results similar to those obtainable using flow cytometry. However, a live-cell, fluorescent, ubiquitinationbased cell cycle indicator, ‘‘Fucci’’ system (Sakaue-Sawano et al., 2008) now makes it possible to track the cell cycle of individual cells over time. The Fucci system uses two fluorescent protein-conjugated protein fragments that are rapidly degraded upon ubiquitylation with different fluorescent properties for each phase (G1/S and G2/M) of the cell cycle (SakaueSawano et al., 2008). Data generated by these approaches can easily be integrated with the other proliferation metrics to provide a more complete picture of the cell cycle times of individual cells in the population over time. A list of proliferation, such as IMT, is shown in Table 2.4. 4.2.6. Quality control For verification of automated tracking results, random wells (fields of view) are selected for manual verification. The manually derived results of these fields are subjected to the same analysis, and the results are compared for accuracy with the automated results to determine the error rate of the automated process (e.g., histograms of the mitotic times are compared with the two-sample Kolmogorov–Smirnov test for significant differences.) Table 2.4 Proliferation metrics obtainable from H2B-labeled cells Time-based features
Morphologic features
Other features
Population DT/GR Single cell IMT/GR
Nuclear size Nuclear shape
Differences between sibling IMT Clonal population GR (progeny trees) Mitotic events per unit time G1/M–G2/S conversion rate DNA synthesis rate
Nuclei per cell
Nuclei per frame Distance between nuclei centroids Cell death
Bi- or multipolar mitotic event Nuclear area
DNA content % in cell cycle phase (G1/S/G2)
54
Vito Quaranta et al.
5. Conclusions From this chapter, it is hopefully evident that QCT studies by HCAM can address fundamental questions in cancer, including: (i) defining the relation between progression of cancer cell aggressiveness and QCT variability in a tumor; (ii) determining whether QCT variability range tracks with tumor response to drugs and drug combinations; and (iii) relating QCT variability to the rise of cancer resistance to treatment. It is also expected that these quantitative analyses will have a profound impact on computational and mathematical modeling of cancer progression and treatment, by complementing the plethora of molecular data with an abundance of much needed cellular data.
ACKNOWLEDGMENTS We thank Dr. Jerome Jourquin for incorporating motility movies into http://vicbc. vanderbilt.edu/itumor/cell. Support for this work was provided by NCI grant U54CA113007.
REFERENCES Alexopoulos, L. G., Erickson, G. R., and Guilak, F. (2002). A method for quantifying cell size from differential interference contrast images: Validation and application to osmotically stressed chondrocytes. J. Microsc. 205(Pt 2), 125–135. Anderson, A. R. A., Hassanein, M., Branch, K. M., Lu, J., Lobdell, N. A., Maier, J., Basanta, D., Weidow, B., Reynolds, A. B., Quaranta, V., Estrada, L., and Weaver, A. M. (2009). Microenvironmental independence associated with tumor progression. Cancer Research (in press). Bear, J. E., Svitkina, T. M., Krause, M., Schafer, D. A., Loureiro, J. J., Strasser, G. A., Maly, I. V., Chaga, O. Y., Cooper, J. A., Borisy, G. G., and Gertler, F. B. (2002). Antagonism between Ena/VASP proteins and actin filament capping regulates fibroblast motility. Cell 109(4), 509–521. Berg, H. C., and Brown, D. A. (1972). Chemotaxis in Escherichia coli analysed by threedimensional tracking. Nature 239, 500–504. Brock, A., Chang, H., and Huang, S. (2009). Non-genetic heterogeneity–a mutationindependent driving force for the somatic evolution of tumours. Nat. Rev. Genet. 10(5), 336–342. Bryce, N. S., Clark, E. S., Leysath, J. L., Currie, J. D., Webb, D. J., and Weaver, A. M. (2005). Cortactin promotes cell motility by enhancing lamellapodial persistence. Curr. Biol. 15(14), 1276–1285. Cai, L., Marshall, T. W., Uetrecht, A. C., Schafer, D. A., and Bear, J. E. (2007). Coronin 1B coordinates Arp2/3 complex and cofilin activities at the leading edge. Cell 128(5), 915–929. Carpenter, A. E., Jones, T. R., Lamprecht, M. R., Clarke, C., Kang, I. H., Friman, O., Guertin, D. A., Chang, J. H., Lindquist, R. A., Moffat, J., Golland, P., and
Quantitative Cell Trait Variability in Cancer
55
Sabatini, D. M. (2006). Cell Profiler: Image analysis software for identifying and quantifying cellular phenotypes. Genome Biol. 7(10), R100. Codling, E. A., Plank, M. J., and Benhamou, S. (2008). Random walk models in biology. J. R. Soc. Interface 5(25), 813–834. Dikovskaya, D., Schiffmann, D., Newton, I. P., Oakley, A., Kroboth, K., Sansom, O., Jamieson, T. J., Meniel, V., Clarke, A., and Na¨thke, I. S. (2007). Loss of APC induces polyploidy as a result of a combination of defects in mitosis and apoptosis. J. Cell Biol 176 (2), 183–195. Dove, A. (2003). Screening for content—The evolution of high throughput. Nat. Biotechnol. 21, 859–864. Duffy, K. J., and Ford, R. M. (1997). Turn angle and run time distributions characterize swimming behavior for Pseudomonas putida. J. Bacteriol. 179(4), 1428–1430. Dunn, G. A., and Brown, A. F. (1987). A unified approach to analyzing cell motility. J. Cell Sci. Suppl. 8, 81–102. Eliason, S. R. (1993). Maximum Likelihood Estimation: Logic and Practice 96. SAGE Publications, Thousand Oaks, CA. Evans, J. G., and Matsudaira, P. (2007). Linking microscopy and high content screening in large-scale biomedical research. Methods Mol. Biol. 356, 33–38. Fraley, Chris, and Raftery, AdrianE. (2002). Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 97, 611–631. Friedl, P., and Wolf, K. (2003). Tumour-cell invasion and migration: Diversity and escape mechanisms. Nat. Rev. 3, 362–374. Furth, R. (1920). Die Brownsche Bewegung bei Berucksichtigung einer Persistenz der Bewegungsrichtung, Mit Anvendunen auf die Bewegung lebender Infusorien. Z. Physik. 2, 244–256. Gautestad, A. O., and Mysterud, I. (2005). Intrinsic scaling complexity in animal dispersion and abundance. Am. Nat. 165(1), 44–55. Goldberg, I. G., Allan, C., Burel, J. M., Creager, D., Falconi, A., Hochheiser, H., Johnston, J., Mellen, J., Sorger, P. K., and Swedlow, J. R. (2005). The open microscopy (OME) data model and XML file: Open tools for informatics and quantitative analysis in biological imaging. Genome Biol. 6, R47. Gra¨f, R., Rietdorf, J., and Zimmermann, T. (2005). Live cell spinning disk microscopy. Adv. Biochem. Eng. Biotechnol. 95, 57–75. Hanahan, D., and Weinberg, R. A. (2000). The hallmarks of cancer. Cell 100(1), 57–70. Harris, M. P., Kim, E., Weidow, B., Wikswo, J. P., and Quaranta, V. (2008). Migration of isogenic cell lines quantified by dynamic multivariate analysis of single-cell motility. Cell Adh. Migr. 2(2), 127–136. Heng, H. H., Bremer, S. W., Stevens, J. B., Ye, K. J., Liu, G., and Ye, C. J. (2009). Genetic and epigenetic heterogeneity in cancer: A genome-centric perspective. J. Cell. Physiol. 220(3), 538–547. Hofmann-Wellenhof, R., Fink-Puches, R., Smolle, J., Helige, C., Tritthart, H. A., and Kerl, H. (1995). Correlation of melanoma cell motility and invasion in vitro. Melanoma Res. 5(5), 311–319. Jiao, X., Katiyar, S., Liu, M., Mueller, S. C., Lisanti, M. P., Li, A., Pestell, T. G., Wu, K., Ju, X., Li, Z., Wagner, E. F., Takeya, T., Wang, C., and Pestell, R. G. (2008). Disruption of c-Jun reduces cellular migration and invasion through inhibition of c-Src and hyperactivation of ROCK II kinase. Mol. Biol. Cell 19(4), 1378–1390. Keiding, N., and Lauritzen, S. L. (1978). Marginal maximal likelihood estimates and estimation of the offspring mean in a branching process. Scand. J. Stat. 5, 106–110. Kim, H. D., Guo, T. W., Wu, A. P., Wells, A., Gertler, F. B., and Lauffenburger, D. A. (2008). Epidermal growth factor-induced enhancement of glioblastoma cell migration in
56
Vito Quaranta et al.
3D arises from an intrinsic increase in speed but an extrinsic matrix- and proteolysisdependent increase in persistence. Mol. Biol. Cell 19, 4249–4259. Kipper, M. J., Kleinman, H. K., and Wang, F. W. (2007). New method for modeling connective-tissue cell migration: Improved accuracy on motility parameters. Biophys. J. 93(5), 1797–1808. Loo, L. H., Wu, L. F., and Altshuler, S. J. (2007). Image-based multivariate profiling of drug responses from single cells. Nat. Methods 4(5), 445–453. Mukherjee, D. P., Ray, N., and Acton, S. T. (2004). Level set analysis for leukocyte detection and tracking. IEEE Trans. Image Process. 13(4), 562–572. Perlman, Z. E., Slack, M. D., Feng, Y., Mitchison, T. J., Wu, L. F., and Altschuler, S. J. (2004). Multidimensional drug profiling by automated microscopy. Science 306(5699), 1194–1198. Porter, I. M., McClelland, S. E., Khoudoli, G. A., Hunter, C. J., Andersen, J. S., McAinsh, A. D., Blow, J. J., and Swedlow, J. R. (2007). Bod1, a novel kinetochore protein required for chromosome biorientation. J. Cell Biol. 179(2), 187–197. Potdar, A. A., Lu, J., Jeon, J., Weaver, A. M., and Cummings, P. T. (2009). Bimodal analysis of mammary epithelial cell migration in two dimensions. Ann. Biomed. Eng. 37(1), 230–245. Rasband, W. S. (1997–2006). ImageJ, U.S. National Institutes of Health, Bethesda, MD, USA. http://rsbweb.nih.gov/ij/. Ray, N., and Acton, S. T. (2005). Data acceptance for automated leukocyte tracking through segmentation of spatiotemporal images. IEEE Trans. Biomed. Eng. 52(10), 1702–1712. Roeder, K., and Wasserman, L. (1995). Practical Bayesian density estimation using mixtures of normals. J. Am. Stat. Assoc. 92. Sakaue-Sawano, A., Kurokawa, H., Morimura, T., Hanyu, A., Hama, H., Osawa, H., Kashiwagi, S., Fukami, K., Miyata, T., Miyoshi, H., Imamura, T., Ogawa, M., et al. (2008). Visualizing spatiotemporal dynamics of multicellular cell-cycle progression. Cell 132(3), 487–498. Sisken, J. E., and Morasca, L. (1965). Intrapopulation kinetics of the mitotic cycle. Cell Biol. 25, 179–189. Slack, M. D., Martinez, E. D., Wu, L. F., and Altschuler, S. J. (2008). Characterizing heterogeneous cellular responses to perturbations. Proc. Natl. Acad. Sci. USA 105(49), 19306–19311. Starkuviene, V., and Pepperkok, R. (2007). The potential of high-content high-throughput microscopy in drug discovery. Br. J. Pharmacol. 152, 62–71. Staude, R. G., Huggins, R. M., Zhang, J., Axelrod, D. E., and Kimmel, M. (1997). Estimating clonal heterogeneity and interexperiment variability with the bifurcating autoregressive model for cell lineage data. Math. Biosci. 143, 103–121. Stockholm, D., Benchaouir, R., Picot, J., Rameau, P., Neildez, T. M. A., Landini, G., Laplace-Builhe, C., and Paldi, A. (2007). The origin of phenotypic heterogeneity in a clonal cell population in vitro. PLoS ONE 2(4), e394. Swedlow, J. R., Goldberg, I., Brauner, E., and Sorger, P. K. (2003). Informatics and quantitative analysis in biological imaging. Science 300(100), 100–102. Swedlow, J. R., Goldberg, I. G., and Eliceiri, K. W. (2009). Bioimage informatics for experimental biology. Annu. Rev. Biophys. 38, 327–346. Uhlenbeck, G. E., and Ornstein, L. S. (1930). On the theory of the Brownian motion. Phys. Rev. 36, 823–841. Viswanathan, G. M., Buldyrev, S. V., Havlin, S., da Luz, M. G., Raposo, E. P., and Stanley, H. E. (1999). Optimizing the success of random searches. Nature 401(6756), 911–914.
Quantitative Cell Trait Variability in Cancer
57
Wallin, A. E., Salmi, A., and Tuma, R. (2007). Step length measurement—Theory and simultion for tethered bead constant-force single molecule assay. Biophys. J. 93(3), 795–805. Wells, A. (2006). Cell Motility in Cancer Invasion and Metastasis. In ‘‘Cancer MetastasisBiology and Treatment Series.’’ Springer.
C H A P T E R
T H R E E
Matrix Factorization for Recovery of Biological Processes from Microarray Data Andrew V. Kossenkov* and Michael F. Ochs† Contents 59 63 63 64 65 68 68 70 74 75
1. Introduction 2. Overview of Methods 2.1. Clustering techniques 2.2. Traditional statistical approaches 2.3. Matrix factorization techniques 2.4. Extensions to nonnegative matrix factorization 3. Application to the Rosetta Compendium 4. Results of Analyses 5. Discussion References
Abstract We explore a number of matrix factorization methods in terms of their ability to identify signatures of biological processes in a large gene expression study. We focus on the ability of these methods to find signatures in terms of gene ontology enhancement and on the interpretation of these signatures in the samples. Two Bayesian approaches, Bayesian Decomposition (BD) and Bayesian Factor Regression Modeling (BFRM), perform best. Differences in the strength of the signatures between the samples suggest that BD will be most useful for systems modeling and BFRM for biomarker discovery.
1. Introduction Microarray technology introduced a new complexity into biological studies through the simultaneous measurement of thousands of variables, replacing a technique (the Northern blot) that typical measured at most tens * {
The Wistar Institute, Philadelphia, Pennsylvania, USA The Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, Maryland, USA
Methods in Enzymology, Volume 467 ISSN 0076-6879, DOI: 10.1016/S0076-6879(09)67003-8
#
2009 Elsevier Inc. All rights reserved.
59
60
Andrew V. Kossenkov and Michael F. Ochs
of variables. Traditional analysis focused on measurements with minimal statistical complexity, but direct application of such tests (e.g., the t-test) to microarrays resulted in massive numbers of ‘‘significant’’ differentially regulated genes, when reality suggested far fewer. There were a number of reasons for the failure of these tests, including the small number of replicates leading to chance detection when tens of thousands of variables were measured (Tusher et al., 2001), the unmodeled covariance arising from coordinated expression (Kerr et al., 2002), and non-gene-specific error models (Hughes et al., 2000). While a number of statistical issues have now been successfully addressed (Allison et al., 2006), two aspects of the biology of gene expression raise difficulties for many analyses. The issues can be noted in a simple model of signaling in the yeast S. cerevisiae. In Fig. 3.1, the three overlapping MAPK pathways are shown. The pathways share a number of upstream regulatory components (e.g., Ste11), and regulate sets of genes divided here into five groups (A–E), with a few of the many known targets shown. The Fus3 mating response MAPK protein activates the Ste12 transcription factor, leading to expression of groups A and B. The Kss1 filamentation response MAPK protein activates the Ste12–Tec1 regulatory complex, leading to expression of groups B, C, and D. The Hog1 high-osmolarity response MAPK protein activates the Sko1 transcription factor, leading to expression of groups D and E.
Ste11
Ste7
Fus3
Dig1/2
Kss1
Ste12
A Far1 Pho81 Afr1
Pbs2
B Pcl2 Dig1 Ste2
Hog1
Tec1
C Cln1 Pcl1 Bud8
Sko1
D Hal1
E Gre2 Ahp1 Sfa1
Figure 3.1 The tightly coupled MAPK pathways in S. cerevisiae. Activation of the pathways lead to transcriptional responses, which produce overlapping sets of transcripts that would be measured in a gene expression experiment. This multiple regulation, which is ubiquitous in eukaryotic biology, motivates the use of matrix factorization methods in high-throughput biological data analysis.
61
Matrix Factorization of Microarray Data
The standard methods used in microarray analysis will look for genes that are differentially expressed between two states. If we imagine those two states as mating activation and filamentation activation, we identify genes associated with each process, but we do not identify all genes associated with either process. Alternatively, clustering in an experiment where each process is independently active will lead to identification of five clusters (one for each group A–E) even though only three processes are active. Naturally, the complexity is substantially greater as there is no true isolation of a single biological process, as any system with only a single process active would be dead, and any measurement is convolved with measurements of ongoing biological behavior required for survival, homeostasis, or growth. These processes use many of the same genes, due to borrowing of gene function that has occurred throughout evolution. [Note: for S. cerevisiae, plain text Ste12 indicates the protein, while italic text ste12 indicates the gene.] Essentially, this example shows the two underlying biological principles that need to be addressed in many analyses of high-throughput data— multiple regulation of genes due to gene reuse in different biological processes and nonorthogonality of biological process activity arising from the natural simultaneity of biological behaviors. Mathematically, we can state the problem as a matrix factorization problem: Dij ¼
P X
Aik Pkj þ eij
ð3:1Þ
k¼1
where D is the data matrix comprising measurements on N genes (or other entities) indexed by i across M conditions indexed by j, P is the pattern matrix for P patterns indexed by k, A is the amplitude or weighting matrix that determines how much of each gene’s behavior can be attributed to each pattern, and e is the error matrix. P is essentially a collection of basis vectors for the factorization into P dimensions, and as such it is often useful to normalize the rows of P to sum to 1. This makes the A matrix similar to loading or score matrices, such as in principal component analysis (PCA). It is useful to note there that the nonindependence of biological processes is equivalent to nonorthogonality of the rows of P, indicating the factorization is ideally into a basis space that reflects underlying biological behaviors but is not orthonormal. We introduced Bayesian Decomposition (BD), a Markov chain Monte Carlo algorithm, to address these fundamental biological issues in microarray studies (Moloshok et al., 2002), extending our original work in spectroscopy (Ochs et al., 1999). Kim and Tidor introduced nonnegative matrix factorization (NMF), created by Lee and Seung (1999), into microarray analysis (Brunet et al., 2004; Kim and Tidor, 2003), for the same reason. Subsequently, it was realized that sparseness aids in identifying
62
Andrew V. Kossenkov and Michael F. Ochs
biologically meaningful processes, and sparse NMF was introduced (Gao and Church, 2005). Fortuitously, due to its original use in spectroscopy, sparseness was already a feature of BD through its atomic prior (Sibisi and Skilling, 1997). More recently, Carvalho and colleagues introduced Bayesian factor regression modeling (BFRM), an additional Markov chain Monte Carlo method, for microarray data analysis (Carvalho et al., 2008). Targeted methods that directly model multiple sources of biological information have been introduced as well. Liao and Roychowdhury introduced network component analysis (NCA), which relied on information about the binding of transcriptional regulators to help isolate the signatures of biological processes (Liao et al., 2003). The use of information on transcriptional regulation can also aid in sparseness, as shown by its inclusion in BD as prior information (Kossenkov et al., 2007). These methods have been developed and applied primarily to microarray data, as it was the first high-throughput biological data that included dynamic behavior, in contrast to sequence data. Microarrays were developed independently by a number of groups in the 1990s (Lockhart et al., 1996; Schena et al., 1995), and their use is now widespread. A number of technical issues plagued early arrays, and error rates were high. The development of normalization and other preprocessing procedures improved data reproducibility and robustness (Bolstad et al., 2003; Cheng and Wong, 2001; Irizarry et al., 2003), leading to studies that demonstrated the ability to produce meaningful datasets from arrays run in different laboratories at different times (English and Butte, 2007). Data can be accessed, though not always with useful metadata, in the GEO and ArrayExpress repositories (Edgar et al., 2002; Parkinson et al., 2005). However, the methods discussed here are also suitable for other high-throughput data where the fundamental assumptions of multiple overlapping sets within the data and nonorthogonality of these sets across the samples holds. In the near future, these data are likely to include large-scale proteomics measurements and metabolite measurements. We have previously undertaken a study of some of these methods to determine their ability to solve Eq. (3.1) using simulations of the cell cycle (Kossenkov and Ochs, 2009). This study did not address the recovery of biologically meaningful patterns from real data, where numerous unknowns exist. Most of these relate to the fundamental issue that separates biological studies from those in physics and chemistry—in biology we are unable to isolate variables of interest away from other unknowns, as to do so is to kill the organism under study. Instead, we must perform studies in a background of incomplete knowledge of the activities a cell is undertaking and incomplete knowledge of the entities (e.g., genes, proteins) associated with these processes. In addition, sampling is difficult and therefore tends to be limited (i.e., large N, small P), and the data remain prone to substantial variance, perhaps due to true biological variation instead of technical issues.
Matrix Factorization of Microarray Data
63
We have undertaken a new analysis of the Rosetta compendium, a dataset of quadruplicate measurements of 300 yeast gene knockouts and chemical treatments (Hughes et al., 2000), to determine how well various matrix factorization methods recover signatures of biological processes. The Rosetta study included 63 control replicates of wild-type yeast grown in rich media, allowing a gene-specific error model. One interesting result to emerge from this work is that roughly 10% of yeast genes appear to be under limited transcriptional regulation, so that their transcript levels vary by orders of magnitude without a corresponding variation in protein levels or phenotype. This has obvious implications for studies where whole genome transcript levels are measured on limited numbers of replicates. Using known biological behaviors that are affected by specific gene knockouts, we compared a number of methods from clustering through the matrix factorization methods discussed above to determine how well such methods recover biological information from microarray measurements. We first give a brief description of each method, then we present the dataset and results of our analyses.
2. Overview of Methods 2.1. Clustering techniques To provide a baseline for comparison, we applied two widely used clustering techniques to the dataset, as well as an approach where genes were assigned to groups at random. Hierarchical clustering (HC) was introduced for microarray work by Eisen et al. (1998), and because of easy-to-use software and its lead as the first technique, it has seen significant use and is available in desktop tools (Saeed et al., 2006). HC, as performed by most users, is done in an agglomerative fashion, using a metric to determine intergene and intercluster distances. Metrics used in microarray studies include Pearson correlation, which captures the shape of changes across the samples, and Euclidean distance, which captures the magnitude of changes. HC creates a tree of distances (a dendrogram) and groups the genes based on the nodes of this tree. As such, different numbers of clusters can be created by cutting at different levels on the tree; however, each specific set of clusters is the most parsimonious for that level and that metric. K-means (or K-medians) clustering has also been widely used in microarray studies, and it relies on an initial random assignment of genes to P clusters. Genes are then moved between clusters based on gene-cluster distances in an iterative fashion. The same metrics are typically used as in HC, and since the number of clusters is defined a priori, there is no necessity of choosing a tree level as in HC. However, a tree can be created after clustering is complete if desired.
64
Andrew V. Kossenkov and Michael F. Ochs
2.2. Traditional statistical approaches The factorization implied by Eq. (3.1) can be accomplished in a number of ways. One of the most widely used is singular value decomposition (SVD) or its relative, PCA. These methods create M new basis vectors from the data in D, and these new basis vectors are orthonormal. SVD is an analytic procedure that decomposes D into the product of a left singular matrix U, a diagonal matrix of ordered values S referred to as the singular values, and a right singular matrix VT, that is, D ¼ USVT :
ð3:2Þ
Alter and colleagues introduced SVD to microarray studies, and defined the rows of VT as eigengenes, and the columns of U as eigenarrays (Alter et al., 2000). The eigengenes are similar to the concept of patterns for Eq. (3.1). PCA performs a similar decomposition; however, the analysis proceeds from the covariance matrix, so that the principal components (PCs) follow the variance in the data. The first PC is aligned with the axis of maximum variance in the M-dimensional space of the data matrix, with each additional PC chosen to be orthogonal to the previous PCs and in the direction that maximizes variance among all orthogonal directions. This creates a new orthonormal basis space in which the PCs represent directions of maximum variance. The singular values are now referred to as scores, and the value of the score provides the amount of variance explained by the corresponding PC. In most applications of PCA and SVD to microarray data, the matrices are truncated so that only the strongest eigengenes or PCs are retained. This is a form of dimensionality reduction, which, in the case of PCA, retains the maximum amount of variance across the data at each possible dimension. The orthogonality conditions of SVD and PCA were realized to be overly constraining for microarray data. Lin and colleagues and Liebermeister independently introduced independent component analysis (ICA) to microarray analysis to address this issue (Liebermeister, 2002; Lin et al., 2002). As with typical applications of PCA, ICA projects the data onto a lower dimensional space. In linear ICA, the goal is to solve Eq. (3.1) by finding P, such that P Y ¼ WD
ð3:3Þ
through the identification of the unmixing matrix, W. The unmixing matrix is designed to make the rows of Y, and therefore P, as statistically independent as possible. A number of measures of independence can be used, such as maximizing negentropy or nongaussinity (Hyvrinen et al., 2001). Because ICA is not strictly constrained like PCA or SVD, it is possible to obtain multiple solutions for Y from the same data. As such, sometimes multiple applications must be performed and a rule applied to pick the best Y (Frigyesi et al., 2006).
65
Matrix Factorization of Microarray Data
2.3. Matrix factorization techniques The desire to escape both the exclusivity of gene assignment to a single cluster occurring in clustering and the independence criteria of statistical methods such as PCA led to the introduction of two techniques from other fields that addressed these issues. Naturally, these methods require constraints, as Eq. (3.1) is degenerate, allowing an infinite number of equally good solutions in the absence of a constraint, such as the one provided by an orthonormal basis in PCA. The methods are distinguished by the methods of constraint and the search algorithm for finding an optimal solution to Eq. (3.1) within these constraints. All these methods also rely on dimensionality reduction, so that the number of elements in the matrices A and P are less than those in D. BD applies a positivity constraint within an atomic prior to limit the possible A and P matrices. The atomic prior relies on implementation of an additional domain, an atomic domain modeling an infinite one-dimensional space upon which atoms are placed, and mappings between it and the A and P matrices. This provides great flexibility, as the mappings, in the form of convolution functions, can distribute an atom to a complex distribution encoding additional prior knowledge (e.g., the form of a response curve, a coordinated change in multiple genes). The atomic domain comprises a positive additive distribution (Sibisi and Skilling, 1997), and an Occam’s Razor argument (i.e., parsimony) penalizes excessive structure through the prior distribution on the atoms. The resulting posterior distribution that combines this prior with the likelihood determined from the fit to the data is sampled by a Markov chain Monte Carlo Gibbs sampler (Geman and Geman, 1984). This approach allows patterns to be constrained in multiple ways, permitting the rows of P to be nonorthogonal, while still identifying unique solutions. Even with unique directions defined by the rows of P, there is still flexibility in the equation that allows amplitude in rows of P to be transferred to columns in A without changing D. As such, the rows of P are normalized to sum to 1. For the work presented here, a simple convolution function that maps each atom to a single matrix element is used, as this only enforces positivity on A and P, similar to NMF. The posterior distribution sampled by BD is generated from the prior and the likelihood through Bayes’ equation: pðA; PjDÞ ¼
pðDjA; PÞpðA; PÞ pðDÞ
ð3:4Þ
where pðA; PjDÞ is the posterior distribution, pðDjA; PÞ is the likelihood, pðA; PÞ is the prior, and pðDÞ is the marginal likelihood of the data, which is also known as the evidence. The likelihood is the probability distribution associated with a w2 distribution, and BD therefore uses the estimates of error during modeling, which can be very powerful given the large
66
Andrew V. Kossenkov and Michael F. Ochs
variation in uncertainty across different genes in a microarray experiment. This also permits seamless treatment of missing values, as they can be estimated at a typical value (background level) with a large uncertainty, thus not affecting the likelihood. The evidence is not used by BD, as Gibbs sampling requires only relative estimates of the posterior distribution; however, it has been proposed that it can be used for model selection, which in this case would be determining the correct number of dimensions, P, in Eq. (3.1) (Skilling, 2006). Presently, BD requires a choice of P. NMF applies positivity and dimensionality reduction to find the patterns of P, each of which is defined as a positive linear combination of rows of D. Each row of D is therefore a linear combination of patterns, with the weight given by the corresponding element in A. As with BD, the choice of P must be made before applying the algorithm. In an NMF simulation, random matrices A and P are initialized according to some scheme, such as from a uniform distribution. The two matrices are then iteratively updated with X Aia Dim Pam ¼ Pam X i Aia Mim Xi ð3:5Þ D P j dj aj Ada ¼ Pda X M P i dj aj which guarantees reaching a local maximum in the likelihood. The updating rules climb a gradient in likelihood, which does lead to the problem of becoming trapped in a local maximum in the probability space. In general, application of NMF therefore is done multiple times from different initial random points, and the best fit to the data is used. The fits obtained from repeated runs on complex microarray data can vary significantly in some cases, due to the complex probability structure that appears typical for biological data. MCMC techniques tend to be more resistant to this problem, as they are designed specifically to escape local maxima, although they are prone to miss sharp local maxima in relatively flat spaces; however, this has not yet appeared to be a problem in biological data. The absence of constraints beyond positivity in NMF does lead to a tendency for the recovery of signal-invariant metagenes that carry little or no information, and the failure to include error estimates can lead to genes with large variance being overweighted during fitting. These issues have been addressed in the extensions to NMF discussed below. NCA uses information on the binding of transcriptional regulators to DNA and dimensionality reduction to reduce the possible A and P matrices. The concept is to create a two-layer network with one layer populated by transcriptional regulators and the other by the genes they regulate, with edges connecting regulators to target genes. NCA addresses the degeneracy of Eq. (3.1) through
67
Matrix Factorization of Microarray Data
D ¼ AXX1 P þ «
ð3:6Þ
where AX includes all possible A matrices and X 1P all possible P matrices. By demanding that X be diagonal, A and P are uniquely determined to a scaling factor (i.e., the rows of P require normalization just as in BD). The diagonality of X requires that the transcriptional regulators be independent. The solution of Eq. (3.6) is found by minimizing kD APk2 ;
ð3:7Þ
which is equivalent to maximizing the likelihood with an assumption of uniform Gaussian errors. For the application of NCA, the relative strength of the transcription of a gene by a regulator must be determined. This is done by measuring the binding affinity of a transcription factor to the promoter for a gene. Since each gene can be regulated by multiple regulators, the expression of a gene in a given condition must be estimated as a combination of the regulation from different factors. A log-linear model is used, so each additional binding of a regulator leads to a multiplicative increase in expression. However, it is not clear that the affinity of binding of a transcription factor is the dominant issue in determining transcript abundance, especially in eukaryotes. BFRM is a Markov chain Monte Carlo technique that solves Dij ¼ mi þ
r X k¼1
bik hkj þ
P X
Aip Ppj þ «ij
ð3:8Þ
p¼1
where A can be viewed as factor loadings for latent factors P (Carvalho et al., 2008). The h matrix provides a series of known covariates in the data, which are then treated using linear regression with coefficients b. The mean vector, m, provides a gene specific term that adjusts all genes to the same level, while the « matrix provides for noise, treated as normally distributed with a pattern specific variance. The latent factors here are then those that remain after accounting for covariates. This model has also been extended by inclusion in D of response variables as additional columns. This extends the second summation in Eq. (3.8) to P þ Q, where Q is the number of latent factors tied to response variables. In both cases, the model also aims for sparse solutions, equivalent to the Occam’s razor approach of BD. BFRM also attempts to address the issue of the number of patterns or latent factors. This is done through an evolutionary stochastic search. Essentially, the algorithm attempts to change P to P þ 1 by thresholding the probability of inclusion of a new factor. The model is refit with the additional factor, and the factor is retained if it improves the model by some criterion. In actuality, the algorithm can suggest multiple additional latent factors at each step and choose to keep multiple factors. Evolution ceases
68
Andrew V. Kossenkov and Michael F. Ochs
when no additional factors are accepted. The BFRM software allows turning off of the evolution, which we have done here to allow direct comparison with other methods at the same P.
2.4. Extensions to nonnegative matrix factorization NMF has become widely used in a number of fields, including analysis of high-throughput biological data. Unlike BD and BFRM, there is no inherent sparseness criterion applied in NMF. This is not surprising, as the original application to imaging argues against sparseness (Lee and Seung, 1999), since images tend to have continuous elements. Sparseness is added to NMF in sparse NMF (sNMF), which penalizes solutions based on the number of nonzero components in A and P (Gao and Church, 2005). A similar approach is presented in nonsmooth NMF (nsNMF), which created a sparse representation of the patterns by introducing a smoothness matrix into the factorization (Carmona-Saez et al., 2006). Addressing the lack of error modeling, least-squares NMF (lsNMF) converts Eq. (3.5) to a normalized form, adjusting the Dij and Mij terms by the specific uncertainty estimates at each matrix element (Wang et al., 2006). It also introduces stochastic updating the matrix elements, in an attempt to limit the problem of trapping in local maxima.
3. Application to the Rosetta Compendium The sample dataset for this study is generated from experiments on the yeast, S. cerevisiae, which has been studied in depth for a number of biological processes, including the eukaryotic cell cycle, transcriptional and translational control, cell wall construction, mating, filamentous growth, and response to high osmolarity. There is substantial existing biological data on gene function, providing a large set of annotations for analysis (Guldener et al., 2005; Mewes et al., 2004). In addition, there is a rich resource, the Saccharomyces Genome Database, maintained by the community that includes sequence and expression data, protein structure, pathway information, and functional annotations (Christie et al., 2004). The Rosetta compendium provides a large set of measurements of expression in S. cerevisiae including 300 deletion mutants or chemical treatments targeted at disrupting specific biological functions (Hughes et al., 2000). The 300 experimental conditions were all probed by microarray four times, with dye flips (technical replicates) of two biological replicates. Control experiments involved 63 wild-type yeast grown in rich medium and then analyzed by microarrays. The gene-specific variation seen in these ‘‘identical’’ cultures were combined with variance measured from
Matrix Factorization of Microarray Data
69
quadruplicate measurements of each mutant or chemical treatment to produce a gene-specific error model. This error model provided the estimate of the uncertainty for those algorithms utilizing such an estimate. The data were downloaded from Rosetta Inpharmatics and filtered to remove experiments where less than two genes underwent threefold changes and to remove genes that did not change by threefold across the remaining experiments. The resulting dataset comprised 764 genes and 228 experiments with associated error estimates. All algorithms were applied to the data using default settings, with a Pearson correlation metric and average linkage used for clustering procedures and maximum iterations parameter of 1000 for NMF, sNMF, lsNMF, ICA, and NCA. Patterns for clusters in clustering methods were calculated as average of sum-1 normalized expression profiles for each gene from a cluster. BD and lsNMF were run using the PattTools Java interface (available from the authors). NMF and sNMF were run using the same code base, with sNMF sparseness set to 0.8 (Hoyer, 2004). BFRM was run using version 2 of the BFRM software (Carvalho et al., 2008), and BD and BFRM both sampled 5000 points from the posterior distribution using default settings on hyperparameters. Clustering methods (HC, KMC) naturally assigned a gene to a single cluster. For methods that provided uncertainty estimates for values in the A matrix (BD, lsNMF), we used threshold of 3s to decide if a gene belonged to a pattern. Note that this permitted a gene to be assigned to multiple patterns, each of which explained part of the overall expression at a significant level. An additional conversion step was done for methods that provide continuous values for elements in the A matrix without uncertainty measurements (NMF, sNMF, ICA, NCA, BFRM). For these methods we assigned a gene to a group based on the absolute value of the corresponding element in matrix A being above the average of the absolute values for the gene, as in Kossenkov et al. (2007). The original publication applied biclustering to the data and reported on a number of clusters tied to specific biological processes at varying levels of significance (Hughes et al., 2000). Clusters were found for mitochondrial function, cell wall construction, protein synthesis, ergosterol biosynthesis, mating, MAPK signaling, rnr1/HU genes, histone deacetylase, isw genes, vacuolar APase/iron regulation, sir genes, and the tup1/ssn6 global repressor. We converted these strong signatures to Munich Information Center for Protein Sequences (MIPS) categories from the Comprehensive Yeast Genome Database (Guldener et al., 2005; Mewes et al., 2004). These categories are detailed in Table 3.1. We added MIPS class 38, transposable elements, to the list to look for methods that could distinguish the mating response from the filamentation response (Bidaut et al., 2006). We searched for signatures of these processes in the results of the analyses. To keep the analysis simple and less biased, we looked only for
70
Andrew V. Kossenkov and Michael F. Ochs
Table 3.1 The mapping of originally reported processes identified by twodimensional clustering to MIPS categories together with the number of proteins in each category in MIPS and in the analyzed dataset
Original report
MIPS number
MIPS name
Proteins
In data
Mitochondrial function Cell wall
02.45
44
4
214
33
Protein synthesis Protein synthesis Mating MAPK activation Histone deacetylase
12
Energy conversion and regeneration Biogenesis of cell wall Protein synthesis
480
16
246
9
69 27
15 5
187
5
–
38
120
14
42.01
12.01.01 41.01.01 30.01.05.01.03 10.01.09.05
Ribosomal proteins Mating MAPKKK cascade DNA conformation modification Transposable elements
Transposable elements have been added to track the difference between mating and filamentation, as filamentation requires transposable element activation.
these specific processes. However, it is important to remember that real biological interpretation often relies on identification of coordinated changes in sets of related biological processes (e.g., mating, meiosis, cell fate). For all techniques, we focused on 15 patterns or clusters, as we have previously identified this as providing the best dimensional estimation (Bidaut et al., 2006). Analysis was performed using ClutrFree, which calculates enrichment and hypergeometric test values for all patterns for each MIPS term (Bidaut and Ochs, 2004).
4. Results of Analyses Although the fundamental goal of the nonclustering methods is the optimal solution to Eq. (3.1), albeit potentially with covariates as in Eq. (3.8), the methods differ substantially in their treatment of the data. BD, as applied here, and the NMF methods require positivity in A and P, while NCA, ICA, PCA, and BFRM allow negative values. The A matrix is still easily interpreted in terms of enrichment of the gene ontology terms
Matrix Factorization of Microarray Data
71
from Table 3.1; however, the P matrix can vary greatly in its information content. Therefore, we focused on recovery of the signatures identified in the original study in terms of a hypergeometric p-value when the genes were assigned to patterns as described above. In addition, we focused on the P matrix for the strong mating pattern recovered with good p-values by all methods to determine what type of information can be recovered. Table 3.2 provides p-values determined using the gene ontology categories in Table 3.1. The table does not include any NMF methods, as these produced no p-values under 0.50. This may reflect the known problem that NMF tends to spread signal over many elements, and in this case even the sparse method failed to isolate a signature, though this may reflect a conservative sparseness parameter. We identified all patterns with an uncorrected p-value under 0.05 or the strongest p-value present under 0.5, if no p-values reached the 0.05 threshold. In addition, we show how many of the eight terms in which each method found at least one significant pattern. The Bayesian Markov chain Monte Carlo methods performed best in this regard, and it appears that the other matrix factorization methods captured less of these signatures than the clustering methods, although it is the case that these specific signatures were chosen based on their inclusion in the original paper that relied on clustering. The original paper reported on deletion mutants that were associated with these patterns; however, it is difficult to use this information with matrix factorization methods. For instance, in BD, while the pattern associated with protein synthesis does include all mutants mentioned in the paper, it also includes all other deletion mutants. This was taken to indicate that protein synthesis is vital to all growing yeast (Bidaut et al., 2006), a true though not overly useful insight. On the other hand, BFRM shows two patterns associated with this term, one has only the bub2 deletion mutant (indicated by bub2D) and the other only ste4D with any strength in the pattern matrix. This may reflect the strong sparseness that BFRM enforces on the data, thus indicating that in terms of differences between deletion mutants; these two are the most significant in terms of protein synthesis. No other matrix factorization methods had a significant p-value for this term. The mating term was deemed significant by all methods. Mating and filamentation are strongly coupled in yeast, with the main difference in transcriptional response to pathway activation being the use of the Tec1 cofactor. Tec1 is the driver of transposon activity, so we expect the filamentation signature to include the ‘‘transposable elements’’ category, even though it may include the Mating category due to sharing of genes between these two processes as indicated in Fig. 3.1. We use the ‘‘transposable elements’’ term to choose a mating pattern and a filamentation pattern for BD and NCA, where two patterns appear associated with mating. For BD, we assign pattern E to mating and pattern D to filamentation. Looking at the associated rows of the P matrix for deletion mutants
Table 3.2 Hypergeometric p-values for enrichment in gene ontology terms for different methods MIPS name
BD
BFRM
NCA
ICA
PCA
HC
KMC
Energy generation (ATP synthase) Biogenesis of cell wall Protein synthesis
A 0.029
A 0.17
A 0.39
A 0.19
A 0.47
A 0.28
A 0.16
B 0.015
B 0.050
B 0.14
A 0.083
B 0.18
B 0.069
A 0.15
A 0.0076
C 0.016 D 0.021
B 0.37
B 0.12
C 0.084
B 0.009 C 0.04
Ribosomal proteins Mating
C 0.017 A 0.016 D 0.0001 E <10 5
C 0.038
C 0.33
A 0.074
C 0.020
E <10 5
D 0.0018 E 0.015
D 0.0018
MAPKKK cascade DNA conformation modification Transposable elements Terms with significant enrichment
F 0.04
D 0.19
E 0.43
C 0.0041 D 0.14 E 0.0026 B 0.18
C 0.0028 D 0.045 E 0.0016 D 0.015 E 0.038 F <10 5
E 0.28
E 0.17
D 0.17
–
F 0.069
F 0.092
F 0.27
F 0.073
F 0.051
E 0.19
D <10 5 G <10 5 7
G <10 5 E 0.0004 5
D <10 5 E 0.0002 2
–
D 0.025
G <10 5
F <10 5
1
3
4
4
B 0.0008 D <10 5
BD, Bayesian Decomposition; BFRM, Bayesian Factor Regression Modeling; NCA, network component analysis; ICA, independent component analysis; PCA, principal component analysis; HC, hierarchical clustering; KMC, K-means Clustering. The random clustering and pattern recognition methods, as well as all nonnegative matrix factorization methods, are not included as they generated no p-values under 0.50. Letters are assigned to patterns or clusters as they appear within a column to allow the reader to identify repeated uses of the same pattern.
Matrix Factorization of Microarray Data
73
associated with this pattern in the original paper, we find that most deletion mutants show signal for pattern E; however, ste11D, ste7D, ste18D, ste12D, fus3D, kss1D/fus3D show no signal, ste4D and ste5D show low signal. In addition, tec1D shows high signal, as expected for mating. We see almost identical signals in pattern D; however, here there is low signal for tec1D, which is expected as knocking out the key regulator of filamentation should eliminate the transcriptional pattern from the data. For BFRM, we assign pattern E to mating and pattern G to filamentation. Pattern E shows no signal except in the dig1D/dig2D double deletion mutants and sst2D. This makes sense again in terms of strong sparseness, as the Dig1 and Dig2 repressors suppress the mating response. With their absence, there should be a strong mating response signal, although it is also expected that there will also be a strong filamentation response signal. Pattern G is more complicated, including a number of deletion mutants, such as anp1D, gas1D, and swi4D. As it does not include the dig1D/dig2D double mutants, it is likely that the filamentation and mating signature are combined in pattern E, which is not unusual in the analyses of this dataset. For NCA, we assign pattern E to mating and D to filamentation, although both are enhanced in each and show very similar patterns in P. For pattern E, ste4D shows low signal and kss1D/fus3D gives a negative signal, while all other ste deletion mutants appear similar to other mutants. In pattern D, there is a sharp negative peak for the ssn6D, and not any strong variation for the ste mutants. This is somewhat difficult to interpret in terms of the mating pathway. For ICA, three patterns are associated with the Mating term and none with ‘‘transposable elements’’ term. In all patterns the ste deletion mutants and other mutants on key pathway members are low; however, this is true in almost all patterns. ICA recovers patterns that are very sparse generally, so there are few mutants strongly associated with these or other patterns. In all three patterns, the tec1D mutant also has no signal. PCA is equally difficult to interpret for the one pattern, D, associated with the Mating and ‘‘transposable elements’’ terms. This arises from the orthogonality required in PCA. However, for pattern D the dig1D/dig2D mutants show a small positive signal, and the ste deletion mutants show very low signal, as does the tec1D mutant. However, many mutants show significant signal, and many are zero. The clustering methods essentially report a few genes associated with the clusters found for mating and filamentation. These include the dig1D/dig2D mutants and sst2D, also detected by BFRM. For the filamentation patterns, HC produces a fairly even pattern across all deletion mutants, while K-means clustering (KMC) shows a great deal of variation. There is no clear signature related to filamentation in either case.
74
Andrew V. Kossenkov and Michael F. Ochs
5. Discussion The matrix factorization methods discussed in this work have significantly different designs, and this affects their value for different types of analysis. PCA is very fast and decomposes the data into a series of PCs that capture maximum variance at each potential dimensionality. This can be very powerful for denoising data if only the strongest PCs are retained, although this has not been successful for high-throughput biological data. In addition, PCA can provide insight to the strongest patterns in the data. The orthogonality demanded of additional patterns can mix biological behaviors, however, so it is not capable of isolating strongly overlapping signatures in data. ICA provides a higher order statistical measure for independence, when compared to PCA; however, it performs very poorly in terms of capturing signatures in the data. This may reflect the tight coupling of biological processes in living systems, making it difficult to identify true statistical independence in the data. In contrast, NMF methods appear to have difficulty isolating signatures due to the strong overlap, and the bias toward smooth solutions results in a spreading of signal across the patterns. While this is a strength in many applications, it can limit their value for highthroughput biological data. However, sparse methods are likely to overcome some of this difficulty, but tuning is required and may lead to overfitting. NCA relies heavily on affinity data for the binding of transcriptional regulators to their targets. Unfortunately, such data assumes that the targets of regulators are known, as well as that affinities have been measured or can be reliably computed. This remains rare, especially in mammalian systems that are the focus of most studies. As such, NCA is at a disadvantage in comparisons such at these, so its failure to find many patterns is not surprising. In cases where such data is available, it can be a very useful technique to apply. The Bayesian methods, BD and BFRM, performed best among the matrix factorization methods. In one sense, this is not surprising as the methods were created to address biological data and include sparseness in their design. BD appears to require less sparseness, leading to its ability to identify continuous distributions in the patterns (e.g., all mutants except those in the mating pathway). However, it does not therefore identify samples that strongly distinguish the gene sets, as BFRM does. In contrast, BFRM is unable to recover cases where the continuous distribution is desirable. This suggests that BD is most useful when trying to find solutions to Eq. (3.1) that reproduce the full system without an emphasis on strong differences, while BFRM is most useful when the goal is to find those conditions that are the most dissimilar in terms of signatures. This suggests the use of BFRM for biomarker discovery, including the ability to handle covariates as in Eq. (3.8), and the use of BD for systems modeling.
Matrix Factorization of Microarray Data
75
REFERENCES Allison, D. B., Cui, X., Page, G. P., and Sabripour, M. (2006). Microarray data analysis: From disarray to consolidation and consensus. Nat. Rev. Genet. 7, 55–65. Alter, O., Brown, P. O., and Botstein, D. (2000). Singular value decomposition for genomewide expression data processing and modeling. Proc. Natl. Acad. Sci. USA 97, 10101–10106. Bidaut, G., and Ochs, M. F. (2004). ClutrFree: Cluster tree visualization and interpretation. Bioinformatics 20, 2869–2871. Bidaut, G., Suhre, K., Claverie, J. M., and Ochs, M. F. (2006). Determination of strongly overlapping signaling activity from microarray data. BMC Bioinform. 7, 99. Bolstad, B. M., Irizarry, R. A., Astrand, M., and Speed, T. P. (2003). A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19, 185–193. Brunet, J. P., Tamayo, P., Golub, T. R., and Mesirov, J. P. (2004). Metagenes and molecular pattern discovery using matrix factorization. Proc. Natl. Acad. Sci. USA 101, 4164–4169. Carmona-Saez, P., Pascual-Marqui, R. D., Tirado, F., Carazo, J. M., and PascualMontano, A. (2006). Biclustering of gene expression data by Non-smooth Non-negative Matrix Factorization. BMC Bioinform. 7, 78. Carvalho, C. M., Chang, J., Lucas, J., Nevins, J. R., Wang, Q., and West, M. (2008). High-dimensional sparse factor modelling: Applications in gene expression genomics. J. Am. Stat. Assoc. 103, 1438–1456. Cheng, L., and Wong, W. H. (2001). Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection. Proc. Natl. Acad. Sci. USA 98, 31–36. Christie, K. R., Weng, S., Balakrishnan, R., Costanzo, M. C., Dolinski, K., Dwight, S. S., Engel, S. R., Feierbach, B., Fisk, D. G., Hirschman, J. E., Hong, E. L., Issel-Tarver, L., et al. (2004). Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms. Nucleic Acids Res. 32, D311–D314. Edgar, R., Domrachev, M., and Lash, A. E. (2002). Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 30, 207–210. Eisen, M. B., Spellman, P. T., Brown, P. O., and Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA 95, 14863–14868. English, S. B., and Butte, A. J. (2007). Evaluation and integration of 49 genome-wide experiments and the prediction of previously unknown obesity-related genes. Bioinformatics 23, 2910–2917. Frigyesi, A., Veerla, S., Lindgren, D., and Hoglund, M. (2006). Independent component analysis reveals new and biologically significant structures in micro array data. BMC Bioinform. 7, 290. Gao, Y., and Church, G. (2005). Improving molecular cancer class discovery through sparse non-negative matrix factorization. Bioinformatics 21, 3970–3975. Geman, S., and Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI 6(6), 721–741. Guldener, U., Munsterkotter, M., Kastenmuller, G., Strack, N., van Helden, J., Lemer, C., Richelles, J., Wodak, S. J., Garcia-Martinez, J., Perez-Ortin, J. E., Michael, H., Kaps, A., et al. (2005). CYGD: The Comprehensive Yeast Genome Database. Nucleic Acids Res. 33, D364–D368. Hoyer, P. (2004). Non-negative matrix factorization with sparseness constraints. J. Mach. Learn. Res. 5, 1457.
76
Andrew V. Kossenkov and Michael F. Ochs
Hughes, T. R., Marton, M. J., Jones, A. R., Roberts, C. J., Stoughton, R., Armour, C. D., Bennett, H. A., Coffey, E., Dai, H., He, Y. D., Kidd, M. J., King, A. M., et al. (2000). Functional discovery via a compendium of expression profiles. Cell 102, 109–126. Hyvrinen, A., Karhunen, J., and Oja, E. (2001). Independent Component Analysis. John Wiley & Sons, New York. Irizarry, R. A., Bolstad, B. M., Collin, F., Cope, L. M., Hobbs, B., and Speed, T. P. (2003). Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res. 31, e15. Kerr, M. K., Afshari, C. A., Bennett, L., Bushel, P., Martinez, J., Walker, N. J., and Churchill, G. A. (2002). Statistical analysis of a gene expression microarray experiment with replication. Stat. Sin. 12, 203–218. Kim, P. M., and Tidor, B. (2003). Subsystem identification through dimensionality reduction of large-scale gene expression data. Genome Res. 13, 1706–1718. Kossenkov, A. V., and Ochs, M.F (2009). Matrix factorization methods applied in microarray analysis. Int. J. Data Min. Bioinform. (in press). Kossenkov, A. V., Peterson, A. J., and Ochs, M. F. (2007). Determining transcription factor activity from microarray data using Bayesian Markov Chain Monte Carlo sampling. Stud. Health Technol. Inform. 129, 1250–1254. Lee, D. D., and Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature 401, 788–791. Liao, J. C., Boscolo, R., Yang, Y. L., Tran, L. M., Sabatti, C., and Roychowdhury, V. P. (2003). Network component analysis: reconstruction of regulatory signals in biological systems. Proc. Natl. Acad. Sci. USA 100, 15522–15527. Liebermeister, W. (2002). Linear modes of gene expression determined by independent component analysis. Bioinformatics 18, 51–60. Lin, S. M., Liao, X., McConnell, P., Vata, K., Carin, L., and Goldschmidt, P. (2002). Using functional genomic units to corroborate user experiments with the rosetta compendium. In ‘‘Methods of Microarray Data Analysis II,’’ (S. M. Lin and K. E. Johnson, eds.), Kluwer Academic Publishers, Boston. Lockhart, D. J., Dong, H., Byrne, M. C., Follettie, M. T., Gallo, M. V., Chee, M. S., Mittmann, M., Wang, C., Kobayashi, M., Horton, H., and Brown, E. L. (1996). Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat. Biotechnol. 14, 1675–1680. Mewes, H. W., Amid, C., Arnold, R., Frishman, D., Guldener, U., Mannhaupt, G., Munsterkotter, M., Pagel, P., Strack, N., Stumpflen, V., Warfsmann, J., and Ruepp, A. (2004). MIPS: Analysis and annotation of proteins from whole genomes. Nucleic Acids Res. 32, D41–D44. Moloshok, T. D., Klevecz, R. R., Grant, J. D., Manion, F. J., Speier, W. F. t., and Ochs, M. F. (2002). Application of Bayesian Decomposition for analysing microarray data. Bioinformatics 18, 566–575. Ochs, M. F., Stoyanova, R. S., Arias-Mendoza, F., and Brown, T. R. (1999). A new method for spectral decomposition using a bilinear Bayesian approach. J. Magn. Reson. 137, 161–176. Parkinson, H., Sarkans, U., Shojatalab, M., Abeygunawardena, N., Contrino, S., Coulson, R., Farne, A., Lara, G. G., Holloway, E., Kapushesky, M., Lilja, P., Mukherjee, G., et al. (2005). ArrayExpress–a public repository for microarray gene expression data at the EBI. Nucleic Acids Res. 33, D553–D555. Saeed, A. I., Bhagabati, N. K., Braisted, J. C., Liang, W., Sharov, V., Howe, E. A., Li, J., Thiagarajan, M., White, J. A., and Quackenbush, J. (2006). TM4 microarray software suite. Methods Enzymol. 411, 134–193. Schena, M., Shalon, D., Davis, R. W., and Brown, P. O. (1995). Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270, 467–470.
Matrix Factorization of Microarray Data
77
Sibisi, S., and Skilling, J. (1997). Prior distributions on measure space. J. R. Stat. Soc. B 59, 217–235. Skilling, J. (2006). Nested sampling for Bayesian computations. Proc. Valencia/ISBA 8th World Meeting on Bayesian Statistics. Tusher, V. G., Tibshirani, R., and Chu, G. (2001). Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. USA 98, 5116–5121. Wang, G., Kossenkov, A. V., and Ochs, M. F. (2006). LS-NMF: A modified non-negative matrix factorization algorithm utilizing uncertainty estimates. BMC Bioinform. 7, 175.
C H A P T E R
F O U R
Modeling and Simulation of the Immune System as a Self-Regulating Network Peter S. Kim,* Doron Levy,† and Peter P. Lee‡ Contents 1. Introduction 1.1. Complexity of immune regulation 1.2. Self/nonself discrimination as a regulatory phenomenon 2. Mathematical Modeling of the Immune Network 2.1. Ordinary differential equations 2.2. Delay differential equations 2.3. Partial differential equations 2.4. Agent-based models 2.5. Stochastic differential equations 2.6. Which modeling approach is appropriate? 3. Two Examples of Models to Understand T Cell Regulation 3.1. Intracellular regulation: The T cell program 3.2. Intercellular regulation: iTreg-based negative feedback 4. How to Implement Mathematical Models in Computer Simulations 4.1. Simulation of the T cell program 4.2. Simulation of the iTreg model 5. Concluding Remarks Acknowledgments References
80 81 83 84 85 87 88 89 90 91 92 93 97 100 100 103 105 106 107
Abstract Numerous aspects of the immune system operate on the basis of complex regulatory networks that are amenable to mathematical and computational modeling. Several modeling frameworks have recently been applied to simulating the immune system, including systems of ordinary differential equations, delay differential equations, partial differential equations, agent-based models, * {
{
Department of Mathematics, University of Utah, Salt Lake City, Utah, USA Department of Mathematics and Center for Scientific Computation and Mathematical Modeling (CSCAMM), University of Maryland, College Park, Maryland, USA Division of Hematology, Department of Medicine, Stanford University, Stanford, California, USA
Methods in Enzymology, Volume 467 ISSN 0076-6879, DOI: 10.1016/S0076-6879(09)67004-X
#
2009 Elsevier Inc. All rights reserved.
79
80
Peter S. Kim et al.
and stochastic differential equations. In this chapter, we summarize several recent examples of work that has been done in immune modeling and discuss two specific examples of models based on DDEs that can be used to understand the dynamics of T cell regulation.
1. Introduction The immune system plays a vital role in human health, with more than 15% of genes in the human genome being linked to immune function (Hackett et al., 2007). The immune system is generally thought to protect against external invaders, such as bacteria, viruses, and other pathogens, while ignoring self. The mechanisms by which the immune system discriminates between self and nonself are becoming elucidated, but are far from being completely understood. Lymphocytes (T and B cells) express antigen receptors generated via novel combinations of gene (V, D, J) segments. This creates an extraordinarily diverse repertoire of unique antigen receptors (>107 T cell receptors, TCR, in T cells (Arstila et al., 1999); >108 immunoglobulins, Ig, in B cells (Rajewsky, 1996)) that can respond to potentially all pathogens. However, self-reactive lymphocytes that are also generated in the process could cause autoimmunity if left unchecked. Newly generated T cells mature within the thymus: 95% die during this process, due to strong binding to self antigens (negative selection) or lack of sufficient signaling (positive selection). Thymic selection is a powerful force that shapes the mature T cell repertoire; this process is referred to as central tolerance. It is now known that potentially autoreactive T cells still persist after thymic selection, so other mechanisms must be operative to keep these in check to maintain peripheral tolerance. A major area of focus in immunology in recent years is regulatory T cells (Tregs), which suppress other immune cells and play an important role in peripheral self tolerance. While there is no organ equivalent to the thymus for B cells, two early tolerance checkpoints regulate developing autoreactive human B cells: the first one at the immature B cell stage in the bone marrow, and the second one at the transition from new emigrant to mature naive B cells in the periphery (Meffre and Wardemann, 2008). As the thymus involutes by young adulthood, how potentially autoreactive T cells are deleted from then on is unclear. New experimental and modeling work suggest that the thymus may play a less prominent role than generally thought in the development of the peripheral T cell pool, even in persons below age 20 (Bains et al., 2009). While the self/nonself view of immunology makes sense and holds true by and large, exceptions exist upon closer inspection. Since cancer cells are of self origin, it was assumed for decades that the immune system ignores cancer. Yet, approximately 80% of human tumors are infiltrated by T cells,
Modeling and Simulation of the Immune System
81
which appear to have beneficial effects (Galon et al., 2006; Nelson, 2008). Tumor-infiltrating lymphocytes (TILs) have been expanded in vitro and their targets have been identified. Contrary to initial expectations, most tumor-infiltrating T cells were found to be directed against self, nonmutated antigens. Such antigens are commonly referred to as tumor-associated antigens or TAAs. Many of the TAAs identified thus far have been in the setting of melanoma (Kawakami and Rosenberg, 1997; Rosenberg, 2001)—the most common ones include MART (melanoma antigen recognized by T cells), gp100, and tyrosinase; others include MAGE, BAGE, GAGE, and NYESO. TAAs have also been identified for breast cancer (e.g., HER-2/neu (Sotiropoulou et al., 2003), MUC (Bo¨hm et al., 1998)), leukemia (e.g., proteinase 3 (Molldrem et al., 1999), WT1 (Oka et al., 2000)), and colon cancer (e.g., CEA (Fong et al., 2001)). Hence, tumor immunity is a form of autoimmunity (Pardoll, 1999). How TAAs which are self, nonmutated proteins break tolerance in the setting of cancer remains poorly understood. This adds complexity to the puzzle of immune regulation. During a typical infection, the immune response unfolds in multiple waves. The cascade begins with almost immediate responses by innate immune cells, such as neutrophils, which create an inflammatory microenvironment that subsequently attracts dendritic cells and lymphocytes to initiate the adaptive immune response. Perhaps one reason the immune system operates in a series of successive waves rather than in one continuous, concentrated surge is that each burst of immune cells has to be tightly regulated, since some primed cells could potentially give rise to an uncontrolled autoimmune response. Most immune cells exist in different states (resting/active, immature/mature, naı¨ve/effector/memory), which provide additional regulatory mechanisms. What controls the magnitude and duration of each individual response, how does one response give way to another, and induce cellular state changes? More generally, how does the immune system work as such a multifaceted, yet robustly controlled network? In this chapter, we show how principles from mathematical modeling can shed light into understanding how the immune system functions as a self-regulating network. The complexity of the immune system, with emergent properties and nonlinear dynamics, makes it amenable to computational methods for analysis.
1.1. Complexity of immune regulation As knowledge of the immune system grows it becomes increasingly clear that most immune ‘‘decisions’’ (e.g., whether to attack or tolerate a certain target, or whether to magnify or suppress an immune response) are not made autonomously by individual cells or even by a few isolated cells. Instead, most immune responses result from a multitude of interactions
82
Peter S. Kim et al.
among various types of cells, continually signaling to one another via cell contact and cytokine-mediated mechanisms. For example, various types of T cells interact to drive cytotoxic T cell expansion and produce an overall immune response. To begin, T cells must be activated in the lymph node by antigen-presenting cells (APCs), primarily dendritic cells, that present stimulatory or suppressive signals depending on what signals they received while interacting with other cells and cytokines in the surrounding tissue. In the event of infection, APCs usually start by stimulating CD4þ T cells, which begin to multiply and secrete IL-2 and other growth signals that lead to increased activity in the lymph node. Shortly afterward, the cytotoxic (CD8þ) T cells get stimulated and begin to proliferate rapidly. Cytotoxic T cells also produce a small amount of IL-2, but mostly direct their energy to extensive proliferation. Even then, T cell activation follows a more multifaceted route than that already described, for upon stimulation, helper (CD4þ) T cells commit to one of two maturation pathways, Th1 and Th2, depending on the type of stimulation by APCs and cytokine signals. These pathways direct the adaptive immune response toward cellular or humoral immunity, the first of which is mediated by T cells and macrophages and the latter by B cells and antibodies. Furthermore, in a coregulating network, these separate responses serve to promote their own advancement while suppressing the other. Specifically, activated Th1 cells release IFN-g, which promote Th1 differentiation while hindering Th2 production, and conversely, activated Th2 cells release IL-4 and IL-10, which promote Th2 production while hindering Th1 cells. Even from this simplified perspective, CD4þ T cell differentiation is governed by a regulatory network, composed of two negatively coupled positive feedback loops. Another type of T cell associated with the CD4þ family is the regulatory T cell (Treg). As far as currently known, these cells function as a global, negative feedback mechanism that suppresses all activated T cells, downregulates the stimulatory capacity of APCs, and secretes immunosuppressive cytokines. These cells either emerge directly from the thymus with regulatory capability and are called naturally occurring regulatory T cells (nTregs), or differentiate from nonregulatory T cells after activation and are called antigen-induced regulatory T cells (iTregs). The precise mechanisms governing Treg-mediated regulation are not well understood, although clear evidence shows that Tregs play an essential role in maintaining self tolerance and immune homeostasis (Sakaguchi et al., 1995, 2008). For example, Tregs influence the extent of memory T cell expansion following an immune response via an IL-2-dependent mechanism (Murakami et al., 1998). Furthermore, Tregs may control the extent of effector T cell proliferation during an acute immune response via an IL-2-dependent feedback mechanism (Sakaguchi et al., 2008).
Modeling and Simulation of the Immune System
83
Shifting to another aspect of the adaptive immune response, B cells can be activated by Th2 cells as discussed above, but they can also respond to antigen without T cell intervention. Many antigens, especially those with repeating carbohydrate epitopes such as those that come from bacteria, can stimulate B cells without T cell intervention. Furthermore macrophages can also display repeated patterns of the same antigen in a way that instigates B cell activation. Yet, most antigens are T cell-dependent, and B cells usually require T cell interaction to achieve maximum stimulation. Nonetheless, not only do T cell-independent mechanisms for B cell activation exist, but experimental evidence shows that B cells also play a role in regulating T cell responses. In particular, the balance between IgG and IgM antibodies secreted by B cells directs the immune response either toward monocytic cells which favor Th1 production or toward further B cell activity which favor Th2 production (Bheekha Escura et al., 1995). In later work, Casadevall and Pirofski propose that IgM and IgG may even direct the course of the T cell response by playing proinflammatory and anti-inflammatory roles (Casadevall and Pirofski, 2003, 2006). Hence, T cell/B cell interactions are not unequivocally unidirectional, since stimulatory and suppressive mechanisms operate in a feedback loop through which each cell subpopulation reciprocally influences the other. Furthermore, B cells also exhibit a high level of self-regulation, since antigen-specific antibody responses can be amplified or reduced by several hundredfold via an antibody-mediated feedback mechanism (Heyman, 2000, 2003). Although this summary only touches a small part of possible immune behavior, it is clear from the myriad interactions among diverse immune cells that nearly all responses are regulated by a huge network of positive and negative feedback loops that consistently keep the global system in check.
1.2. Self/nonself discrimination as a regulatory phenomenon Another critical aspect of the immune system is self/nonself discrimination. This term refers to the capacity of the immune system to decide whether a particular target is a virulent pathogen, a harmless foreign body, or a normal and healthy self cell. The ensuing immune response must adjust drastically based on the verdict of this decision. Discrimination between target types is largely antigen-based. Certain pathogens, such as bacteria, microbes, and parasites, present protein and carbohydrate sequences that never appear on normal tissue, conspicuously marking them as nonself. The immune system must continually learn over time to recognize certain peptide sequences as normal, while continuing to recognize other sequences as foreign. Until recently, the prevailing view was that antigen recognition worked by a lock and key mechanism in which adaptive immune cells expressed specific antigen receptors that only responded to one or a few peptide sequences, making it straightforward to see how the immune system
84
Peter S. Kim et al.
could avoid autoimmunity by removing any immune cells that had a chance of reacting with self antigen. The distinction between self and nonself antigen became blurred, however, when experimental studies revealed that the mature repertoire still maintains self-reactive immune cells; mice depleted of naturally occurring Tregs invariably develop autoimmune disease (Sakaguchi et al., 1995). Furthermore, experimental and quantitative results showed that a high level of cross-reactivity is a central feature of the T cell repertoire (Mason, 1998). These results indicated that T cells react to a range of peptide sequences and that a T cell that primarily reacts to foreign antigen could also potentially cross-react with some self-antigen, thus giving rise to an autoimmune, bystander response against healthy cells. Due to the intrinsic cross-reactivity of antigen receptors and the inevitable presence of self-reactive immune cells, successful self/nonself discrimination cannot occur as the result of a simple black and white mechanism operating at the individual immune cell level. Instead, this process must emerge from a self-regulatory immune network. Along this line, a novel view of self/nonself discrimination is emerging as a group phenomenon resulting from interactions among several immune agents, including APCs, effector T cells, Tregs, and their molecular signals (Kim et al., 2007).
2. Mathematical Modeling of the Immune Network As mentioned above, the immune system operates according to a diverse, interconnected network of interactions, and the complexity of the network makes it difficult to understand experimentally. On one hand, in vitro experiments that examine a few or several cell types at a time often provide useful information about isolated immune interactions. However, these experiments also separate immune cells from the natural context of a larger biological network, potentially leading to nonphysiological behavior. On the other hand, in vivo experiments observe phenomena in a physiological context, but are usually incapable of resolving the contributions of individual regulatory components. To provide a particular example of this shortcoming, our understanding of Treg-mediated regulation and its effect on the immune response is still very poor, even though the majority of individual Treg interactions have already been thoroughly described. This problem of connecting complex, global phenomena to basic interactions extends over a wide range of immunological questions. How, then, can we take individual-scale results that have been established and connect them to large-scale phenomena? This gap in immunological knowledge provides a fruitful ground for mathematical modeling and computational science. In the following sections, we provide specific
Modeling and Simulation of the Immune System
85
examples of what approaches from mathematical modeling have been applied and what insights have been gained. Table 4.1 gives a summary of advantages, disadvantages, and examples for each modeling approach.
2.1. Ordinary differential equations Mathematical models based on systems of ordinary differential equations (ODEs) are the most common as these types of models have been used for cancer immunology (de Pillis et al., 2005; Moore and Li, 2004), natural killer cell responses (Merrill, 1981), B cell responses (Lee et al., 2009; Shahaf et al., 2005), B cell memory (De Boer and Perelson, 1990; De Boer et al., 1990; Varela and Stewart, 1990; Weisbuch et al., 1990), Treg dynamics (Burroughs et al., 2006; Carneiro et al., 2005; Fouchet and Regoes, 2008; Leo´n et al., 2003, 2004, 2007a,b), and T cell responses (Antia et al., 2003; Wodarz and Thomsen, 2005 to name a few examples. The primary advantage of ODE modeling is that this model structure has already been extensively applied in the study of reaction kinetics and other physical phenomena. In addition, the mathematical analysis of these systems is relatively simple compared to other types of models and their solutions can be computationally simulated with great efficiency. That is to say, these models can be made extremely complex, before becoming computationally unfeasible. For example, Merrill constructs an ODE model of NK cell dynamics (Merrill, 1981). In his model, NK cells represent an immune surveillance population that responds immediately to stimulation without the need of prior activation or proliferation. Using this model, he discusses how the NK population could trigger a subsequent T cell response, if necessary, by releasing stimulatory cytokines, such as IFN-g. In a model focusing on a different aspect of the immune network, Fouchet and Regoes consider interactions between T cells and APCs to explain self/nonself discrimination (Fouchet and Regoes, 2008). In their model, precursor T cells differentiate into either effector or regulatory T cells depending on whether the stimulation from the APC is immunogenic or tolerogenic. The differentiated effector and regulatory T cells then turn around and drive other APCs to become immunogenic or tolerogenic in two competing positive feedback loops. Furthermore, Tregs also suppress effector T cells. Using this model, Fouchet and Regoes demonstrate how this feedback network causes the immune response to commit to either a fully immunogenic or a fully tolerogenic response, depending on the initial concentration, growth rate, and strength of antigenic stimulus of the target. They also consider how perturbations in the target population may lead to switches between the two network states. Modeling the adaptive immune system as a whole, Lee et al. construct a comprehensive ODE model incorporating APCs, CD4þ T cells, CD8þ
Table 4.1 Advantages, disadvantages, and examples of each modeling approach: ODEs, DDEs, PDEs, SDEs, and ABMs Modeling approach
Advantages
Disadvantages
Examples
ODE
Computationally efficient, describes complex systems elegantly, simple mathematical analysis easy to formulate
Does not capture spatial dynamics or stochastic effects
DDE
Captures delayed feedback, computationally efficient Captures spatial dynamics and age-based behavior
Does not capture spatial dynamics or stochastic effects Computationally demanding, complex mathematical analysis Computationally demanding, difficult to analyze mathematically Highly computationally demanding, difficult to analyze mathematically
de Pillis et al. (2005), Moore and Li (2004), Merrill (1981), Lee et al. (2009), Shahaf et al. (2005), De Boer et al. (1990), De Boer and Perelson (1990), Varela and Stewart (1990), Weisbuch et al. (1990), Burroughs et al. (2006), Carneiro et al. (2005), Fouchet and Regoes (2008), Leo´n et al. (2004), Leo´n et al. (2003), Leo´n et al. (2007a,b) Kim et al. (2007), Colijn and Mackey (2005)
PDE
SDE
Captures stochastic effects
ABM
Captures spatial dynamics and individual diversity, captures stochastic effects, easy to formulate
Antia et al. (2003), Onsum and Rao (2007)
Figge (2009)
Catron et al. (2004), Scherer et al. (2006), Figge et al. (2008), Casal et al. (2005)
Modeling and Simulation of the Immune System
87
T cells, B cells, antibodies, and two immune environments, lungs and lymph nodes (Lee et al., 2009). Using their model, Lee et al. investigate multiple scenarios of infection by influenza A virus, and study the effects of immune population levels, functionality of immune cells, and the duration of infection on the overall immune response. They propose that antiviral therapy reduces viral spread most effectively when administered within two days of exposure. Their highly intricate, multifaceted model demonstrates the ability of mathematical techniques to capture a multitude of dynamic interactions over a broad spectrum of cell types.
2.2. Delay differential equations Systems of ODEs are finite dimensional dynamical systems, while delay differential equations (DDEs) and partial differential equations (PDEs) are infinite-dimensional dynamical systems. As a result, DDEs and PDEs require more computational and analytical complexity than their finite dimensional counterparts. However, infinite-dimensional systems come with unique modeling advantages. In general, DDEs are simpler than PDEs. DDE models are also similar in structure to ODE models, except that they explicitly include time delays. Many biological processes exhibit delayed responses to stimuli, and DDE models allow us to understand the effects of these delays on a feedback network. An example of a model that makes use of DDEs is the work by Colijn and Mackey (2005) in which they model the development of neutrophils from stem cells (i.e., neutrophil hematopoiesis). Neutrophils that have attained maturity release a molecular signal that causes cells earlier in development to stop differentiating. Ideally, this signaling gives rise to a delayed negative feedback that ultimately stabilizes the neutrophil population at an equilibrium. However, the long delay in the signal permits a situation in which the neutrophil population never stabilizes, but continues to oscillate from unusually high to unusually low levels. Colijn and Mackey connect this oscillatory dynamic to cyclical neutropenia, a disease that causes patients to have periodically low levels of neutrophils. Another example is our recent work in which we devise a mathematical model to study the regulation of the T cell response by naturally occurring Tregs (Kim et al., 2007). In this model, we consider a variety of immune agents, including APCs, CD4þ T cells, CD8þ T cells, Tregs, target cells, antigen, and positive and negative growth signals. Furthermore, each immune population can migrate between two distinct environments: the lymph node and the tissue. We also consider various time scales, such as a long delay between initial CD8þ T cell stimulation and full activation and a much shorter delay for each T cell division. The delays cause the CD8þ response to initiate with a time lag after the CD4þ response. In addition,
88
Peter S. Kim et al.
the delay due to cell division ensures that the Treg response develops more slowly than the other two T cell responses, allowing a small time window of unrestricted T cell expansion. The delays produce another unexpected phenomenon, a two-phase cycle of T cell maturation. In the first phase, CD4þ T cells expand and secrete positive growth signal allowing CD8þ T cells to proliferate rapidly, whereas in the second phase, the Treg population catches up to the original effector T cell population and begins suppressing T cell activity, causing a sudden shift from proliferation to emigration from the lymph node into the peripheral tissue, where CD8þ T cells can more effectively eliminate the target population. From a practical point of view, DDE models are only slightly more complex than ODE models to simulate numerically. Evaluating DDE systems reduces to recording the past history of all populations throughout the simulation. Hence, with only a slight increase in computational complexity, DDE models widely expand the repertoire of phenomena that can be attained.
2.3. Partial differential equations PDE models capture more complexity than DDE and ODE models. In biological modeling, PDEs are often applied in two ways, age-structured and spatio-temporal models. Age-structured models account for the progression of individual cells or members through a scheduled development process. As many organisms exhibit behaviors that depend on their maturity and developmental level, age-structured models provide a useful framework for modeling internal development of an organism over time. For example, Antia et al. formulate an age-structured model to simulate the progression of cytotoxic T cells through an autonomous T cell proliferation program (Antia et al., 2003). According to the program, activated T cells enter into a scheduled period of expansion, and then relative stabilization, followed by a period of contraction, and then restabilization at a lower level. These four stages of scheduled development comprise the T cell proliferation program. Using this model, they study the effect of variations in the T cell program on the level and duration of cytotoxic T cell responses. Furthermore, they conclude that T cell responses that are governed by autonomous, intracellular programs will execute similarly despite a wide range of antigen stimulation levels. This latter phenomenon has also been observed experimentally (Kaech and Ahmed, 2001; Mercado et al., 2000; van Stipdonk et al., 2003). Returning to the notion of regulation, a T cell program such as the one modeled by Antia et al. (2003), or any other scheduled developmental process, implies a system of internal self-regulation that may be invisible
Modeling and Simulation of the Immune System
89
to the external network, but that results from diverse interactions within the cell. Due to the inherent difficulty of simultaneously modeling feedback networks on intracellular and extracellular levels, age-structured models provide an efficient tool for investigating the interactions between internal and external regulatory mechanisms. Another classical and highly useful application of PDE models is modeling spatio-temporal dynamics. Using this approach, Onsum and Rao develop a PDE model for neutrophil migration toward a site of infection by moving toward higher chemical concentrations (Onsum and Rao, 2007). They simulate how two chemical signals interacting in an antagonizing manner allow neutrophils to orient themselves within the chemical gradient. Their PDE model is composed of a system of diffusion and chemotaxis equations in one space dimension. From the viewpoint of deterministic, differential equations, PDEs provide the most powerful mathematical modeling tool that captures the broadest range of biological phenomena. These models have, however, the potential to be significantly more computationally demanding than ODE and DDE systems.
2.4. Agent-based models The concept of an agent-based model (ABM) refers to a different modeling philosophy than that used in differential equation systems. First of all, ABMs deal with discrete and distinguishable agents, such as individual cells or isolated molecules, unlike differential equations, which deal with collective populations, such as densities of cells. In addition, ABMs easily allow us to account for probabilistic uncertainty, or stochasticity, in biological interactions. For example, in a stochastic ABM, an individual agent only changes state or location at a certain probability and not by following a deterministic process. Finally, as with PDEs, most ABMs consider the motion of agents through space. A powerful application of ABMs is demonstrated by Catron et al. (2004). They devise a sophisticated ABM to simulate the interaction between a T cell and a dendritic cell in the lymph node. By observing repeated simulations of T cell–DC interactions, they obtain estimates of the frequency of T cell–DC interactions and the expected time for T cells to become fully stimulated. In another ABM, Scherer et al. simulate T cell competition for access to binding sites on mature antigen-bearing APCs (Scherer et al., 2006). As in standard first-order reaction kinetics (i.e., the law of mass action), T cells interact with APCs with a probability proportional to the product of their two populations. Furthermore, each APC possesses a finite number of antigen binding sites that can each present either of the two types of antigen simulated in the model. Using their model, Scherer et al. determine that the
90
Peter S. Kim et al.
nature of T cell competition changes depending on the level of antigen expressed by the APCs. More specifically, under low antigen expression, T cells of the same antigen-specificity are more likely to compete, allowing for the coexistence of multiple T cell responses against different target epitopes. On the other hand, under high antigen expression, T cell competition becomes more indiscriminate, ultimately allowing highly reactive T cell populations to competitively exclude T cell populations that are specific for different epitopes. Using an agent-based approach, this model demonstrates how intercelluar competition can indirectly provide a means of T cell regulation. At the cellular level, Figge et al. simulate B cell migration in the germinal center of a lymph node (Figge et al., 2008). In their model, they assume that individual B cells move according to a random walk attempting to follow a chemoattractant. They apply their model for the purpose of resolving the paradox obtained from two-photon imaging data that B cell migration initially appears follow a chemotactic gradient but then devolves into what resembles more of an undirected random walk. Using their simulations, they hypothesize that chemotaxis must remain active throughout the entire B cell migration process as to maintain a sense of the germinal center. At the same time, individual B cells downregulate chemokine receptors causing them to lose sensitivity to the chemical gradient. On the molecular level, Casal et al. construct an ABM for T cell scanning of the surface of an APC. In this model, the agents are individual T cell receptors and peptide sequences that populate the surfaces of interacting cells (Casal et al., 2005). The main advantage of agent-based modeling is the ability to account for probabilistic uncertainty and individual diversity within a large population. The main difficulty is, on the other hand, the huge computational complexity that accompanies such sophisticated models. Roughly speaking, most ABMs take on the order of several hours to even days to simulate even once, whereas most deterministic models can be evaluated much faster. Furthermore, stochastic ABMs usually have to be simulated numerous times to obtain the overall average behavior of the system. Thus, despite their advantages ABMs often present great challenges in terms of computational implementation.
2.5. Stochastic differential equations Perhaps the least explored path in immunological modeling is the use of stochastic differential equations (SDEs). From the point of view of complexity, SDEs lie somewhere in-between deterministic, differential equation models, and ABMs. SDEs are written and formulated a lot like ODEs, except that they allow their variables to take random values. Traditionally, SDEs provide an effective means of accounting for noise, random walks,
Modeling and Simulation of the Immune System
91
and sporadic events (modeled as a Poisson process), and they have been applied extensively in financial mathematics, chemistry, and physics. However, they have not yet fully made an entry into mainstream immunological modeling. Nonetheless, we can provide one example of an SDE model that has been applied to X-linked agammaglobulinemia, a genetic disorder of B cell maturation that prevents the production of immunoglobulin. In his model, Figge formulates a system of SDEs to simulate the depletion of immunoglobulin by natural degradation and antigenic consumption and its periodic replenishment by immunoglobulin substitution therapy (Figge, 2009). The stochastic model captures the tendency of the immunoglobulin repertoire to shift toward certain antigen-specificities at the expense of others. In addition, the regulatory network clarifies how immunoglobulin substitution therapy may affect other aspects of the overall immune response in ways that were not clear before. Figge’s assessment of the current treatment strategy is that lower treatment frequencies, separated by a period of one to several weeks, may actually benefit the prevention of chronic infection. The computational complexity of SDEs generally falls between that of deterministic models and ABMs. SDE models are one step above ODE models in terms of their complexity, because they incorporate stochastic effects. Nonetheless, like ODEs, SDEs still consider populations as collective groups rather than as individual agents.
2.6. Which modeling approach is appropriate? Such a diverse selection of available models, not to mention possible hybrid formulations, begs the question, ‘‘Which modeling approach is most appropriate?’’ The answer depends on the nature of regulatory interactions involved, among other issues. As discussed, ODEs are the most efficient method for modeling huge levels of biological complexity without a substantial increase in the computational work. For any regulatory networks that do not rely significantly on delayed feedback, spatial distribution of cells and molecules, or probabilistic events, ODE models are the most effective approach. For networks that seem to depend on delayed feedback, DDEs provide a good paradigm and also remain relatively simple from a computational point of view. Networks of cells and molecules that do not mix well or efficiently, but remain localized over a long period of time, may correspond most appropriately to PDE models that account for space. Similarly, networks of cells that change behavior gradually over time also lend themselves naturally to age or maturity-structured formulations using PDEs. Moving beyond deterministic models, SDEs provide one means of adding stochasticity to differential equations, but they come with a higher level of computational complexity.
92
Peter S. Kim et al.
In recent years, there has been a growth in the use of ABMs. The ABM paradigm currently provides the most complex and versatile framework for mathematical modeling by incorporating all elements of spatial and temporal dynamics, probabilistic events, and individual diversity within populations. However, ABMs demand by far the most intensive computational algorithms and are often impractical for statistical analyses such as parameter sensitivity or data fitting, which usually require numerous simulations. As a paradigm, ABMs seek to closely replicate the complexity of biological systems, so that ‘‘experiments’’ can be done in silico much more economically and even ethically than they could be performed in vivo. On one hand, ABMs provide practical means of transporting experimental studies from the wet lab to the computer lab. However, ABMs do not replace other forms of modeling because they do not substantially reduce the inherent complexity of biological systems. Furthermore, when applying ABMs to an immunological network, one should confirm that the dynamic behavior of the ABM cannot be sufficiently recreated by a simpler, differential equation formulation. For example, the two papers (DoumicJauffret, 2009; Kim et al., 2008) succeed in replicating the dynamics of a highly complex ABM with almost no deviation using two deterministic models: difference equations and PDEs. In addition, these deterministic models require only 4 min and 30 s of computation time as opposed to the approximately 50 7 ¼ 350 h required by the ABM. This result demonstrates the efficacy of hybrid methods that merge both ABM and differential equation frameworks to capture the underlying characteristics of a biological network without adding any superfluous detail. In practice, it is difficult to predict which mathematical and computational paradigm is most suitable for a given situation. Ideally, to thoroughly understand a system from a modeling perspective, one should devise mathematical models of all types for each immunological network in question. This line of thinking is, needless to say, unreasonable. Instead, the rational and the most informed approach to mathematical modeling is to recognize the capacities and limitations of each type of model and to apply the paradigm that most accurately quantifies the essential dynamics of the system without introducing any unnecessary complexity.
3. Two Examples of Models to Understand T Cell Regulation In the following section, we provide two examples of DDE models based on immune regulatory networks that were proposed by Antia et al. (2003) and Kim (2009). Each of the models describes a distinct network that could regulate T cell development during an acute infection.
93
Modeling and Simulation of the Immune System
3.1. Intracellular regulation: The T cell program The first regulatory network is based on the notion of a T cell proliferation program. According to this concept, T cells follow a fixed program of development that initiates after stimulation and then proceeds to unfold without any further feedback from the environment. As mentioned in Section 2.3, a programmed cellular response implies an intracellular regulatory mechanism that may still be highly complex, although it no longer interacts with the rest of the external network. Our example comes from Kim (2009), and it stems from the original T cell program model formulated by Antia et al. (2003) and further developed by Wodarz and Thomsen (2005). The T cell program (illustrated in Fig. 4.1) can be summarized as follows: 1. APCs mature, present relevant target antigen, and migrate from the site of infection to the draining lymph node. 2. In the lymph node, APCs activate naı¨ve T cells that enter a minimal developmental program of m cell divisions. 3. T cells that have completed the minimal developmental program become effector cells that can divide in an antigen-dependent manner (i.e., upon further interaction with APCs) up to n additional times. 4. Effector cells that divided the maximum number of times stop dividing.
1) Migration of APCs to lymph node sA a(t)A0 A0 A1
2) Initial T cell activation sT kA1T0 T0 Delay = s
m T1 ⫻2
A1
3) Antigen-dependent proliferation Ti
kA1Ti Delay = r
Ti + 1 ⫻2
4) Awaiting apoptosis Tn + 1
A1
(Although not indicated, each cell has a natural death rate according to its kind.)
Figure 4.1 The T cell program. (1) Immature APCs pick up antigen at the site of infection at a time-dependent rate a(t). These APCs mature and migrate to the lymph node. (2) Mature antigen-bearing APCs present antigen to naı¨ve T cells causing them to activate and enter the minimal developmental program of m divisions. (3) Activated T cells that have completed the minimal program continue to divide upon further interaction with mature APCs for up to n additional divisions. (4) T cells that have completed the maximal number of divisions stop dividing and wait for apoptosis. Although not indicated, each cell in the diagram has a natural death rate according to its kind.
94
Peter S. Kim et al.
This process can be translated into a system of DDEs in which each equation corresponds to one of the cell populations shown in Fig. 4.1. The system of DDEs is as follows: 0
A 0 ðtÞ ¼ sA d0 A0 ðtÞ aðtÞA0 ðtÞ; 0
ð4:1Þ
A1 ðtÞ ¼ aðtÞA0 ðtÞ d1 A1 ðtÞ;
ð4:2Þ
T0 ðtÞ ¼ sT d0 T0 ðtÞ kA1 ðtÞT0 ðtÞ;
ð4:3Þ
0
0
T1 ðtÞ ¼ 2m kA1 ðt sÞT0 ðt sÞ kA1 ðtÞT1 ðtÞ d1 T1 ðtÞ; 0
Ti ðtÞ ¼ 2kA1 ðt rÞTi1 ðt rÞ kA1 ðtÞTi ðtÞ d1 Ti ðtÞ; 0
T nþ1 ðtÞ ¼ 2kA1 ðt rÞTn ðt rÞ d1 Tnþ1 ðtÞ:
ð4:4Þ ð4:5Þ ð4:6Þ
The variables in the equations have the following definitions:
A0 is the concentration of APCs at the site of infection. A1 is the concentration of APCs that have matured, started to present target antigen, and migrated to the lymph node. T0 is the concentration of antigen-specific naı¨ve T cells in the lymph node. Tiþ 1 is the concentration of effector cells that undergone i antigendependent divisions after the minimal developmental program. Tnþ 1, denotes T cells that have undergone n divisions after the minimal developmental program. These cells have terminated the proliferation program and can no longer divide. Cell concentration is measured in units of k/mL (thousands of cells per microliter). Figure 4.2 shows an expanded diagram of how the first two equations, Eqs. (4.1) and (4.2), are derived from step 1. Equation (4.1) pertains to the 1) Migration of APCs to lymph node sA Supply A0
Supply
Stimulation
A0' (t) = sA-d0A0(t) − a(t)A0(t) Natural death
a(t)A0 Stimulation
Δ A1
Stimulation A'1(t) = a(t)A0(t) − d1A1(t) Natural death
Figure 4.2 Expanded diagram of how Eqs. (4.1) and (4.2) are derived from step 1 of the T cell program.
95
Modeling and Simulation of the Immune System
population of immature APCs waiting at the site of infection. These cells are supplied into the system at a constant rate, sA, and die at a proportional rate, d0. Without stimulation, this population always remains at equilibrium, given by sA/d0. The time-dependent coefficient a(t) denotes the rate of stimulation of APCs as a function of time. The function a(t) can be seen as being proportional to the antigen concentration at the site of infection. Equation (4.2) pertains to the population of APCs that have matured, started to present relevant antigen, and migrated to the lymph node. For simplicity, the model accounts for the maturation, presentation of antigen, and migration of APCs as one event. The first term of the equation corresponds to the rate at which these APCs enter the lymph node as APCs at the site of infection are stimulated. The second term is the natural death rate of this population. Figure 4.3 shows an expanded diagram of how Eqs. (4.3) and (4.4) are derived from step 2. Equation (4.3) pertains to naı¨ve T cells. This population is replenished at a constant rate, sT, and dies at a proportional rate, d0. Without stimulation, the population remains at equilibrium, sT/d0. The third term in this equation is the rate of stimulation of naı¨ve T cells by mature APCs. The bilinear form of this term follows the law of mass action where k is the kinetic coefficient. Equation (4.4) pertains to newly differentiated effector cells that have just finished the minimal developmental program of m divisions. The first term gives the rate at which activated naı¨ve T cells enter the first effector state, T1. This term corresponds to the final term of the previous equation for T00 (t), except that it has an additional coefficient of 2m and it depends on cell concentrations at time ts. The coefficient 2m accounts for the increase in population of naı¨ve T cells after m divisions, and the time delay s is the
2) Initial T cell activation Stimulation
ST Supply
T0
Proliferation from minimal developmental program kA1T0 m T1 × 2 Delay = s
A1
Supply
Stimulation
T0'(t) = sT − d1T0(t) − kA1(t)T0(t) Natural death
Proliferation from minimal developmental program
Natural death
T1⬘(t) = 2mkA1(t − s)T0(t − s) − kA1(t)T1(t) − d1T1(t) Further stimulation
Figure 4.3 Expanded diagram of how Eqs. (4.3) and (4.4) are derived from step 2 of the T cell program.
96
Peter S. Kim et al.
duration of the minimal developmental program. This term accounts for newly proliferated effector cells that appear in the T1 population s time units after activation from T0. The second term is the rate at which T1 cells are stimulated by mature APCs for further division. It is based on the law of mass action and is of the same form as the final term of the equation for T00 (t). This term exists in the equation only if the number of possible antigen-dependent divisions, n, is not 0. Finally, as shown by the last term, T1 cells continuously die at rate d1. Figure 4.4 shows an expanded diagram of how Eq. (4.5) is derived from step 3 and how Eq. (4.6) is derived from step 4. For i ¼ 2, ..., n, Eq. (4.5) for Ti0 (t) is analogous to the equation for T10 (t), except that these cells only divide once after stimulation. Hence, the coefficient of the first term is 2, and the time delay is r, the duration of a single division. As before, the second term is the rate at which these cells become stimulated for further division, and the final term is the death rate. Note that we use the same death rate, d1, for all effector cells. The final equation, Eq. (4.6), pertains to cells that have undergone the maximum number of possible antigen-dependent divisions. These cells do not divide anymore and can only die at rate d1. The parameter estimates used for this model come from Kim (2009) and are summarized in Table 4.2. The function a(t), representing the rate of antigen stimulation, is defined by fðtÞfðb tÞ aðtÞ ¼ c ; ð4:7Þ fðbÞ2 where fðxÞ ¼
e1=x 0
2
3) Antigen-dependent proliferation Ti Δ
kA1Ti Delay = r
4) Awaiting apoptosis Ti+1 ×2
Tn⫹1 (after n divisions)
A1
Proliferation
if x 0 if x < 0
Further stimulation
Ti'(t) = 2kA1(t − r)Ti-1(t − r) − kA1(t)Ti (t) − d1Ti(t)
Natural death
Proliferation T 'n+1(t) = 2kA1(t − ρ)Tn(t − r) − d1Tn+1(t) Natural death
Figure 4.4 Expanded diagram of how Eq. (4.5) is derived from step 3 and how (4.6) is derived from step 4 of the T cell program.
97
Modeling and Simulation of the Immune System
Table 4.2 Estimates for model parameters Parameter
Description
Estimate
A0(0) T0(0) sA sT d0 d0 d1 d1 k m
Initial concentration of immature APCs Initial concentration of naı¨ve T cells Supply rate of immature APCs Supply rate of naı¨ve T cells Death/turnover rate of immature APCs Death/turnover rate of naı¨ve T cells Death/turnover rate of mature APCs Death/turnover rate of effector T cells Kinetic coefficient Number of divisions in minimal developmental program Maximum number of antigendependent divisions Duration of one T cell division Duration of minimal developmental program Rate of APC stimulation Duration of antigen availability Level of APC stimulation Rate of differentiation of effector cells into iTregs
sA/d0 ¼ 10 sT/d0 ¼ 0.04 0.3 0.0012 0.03 0.03 0.8 0.4 20 7
n r s a(t) b c r
3–10 1/3 3 Eq. (4.7) 10 1 0.01
Concentrations are in units of k/mL, and time is measured in days.
and b, c > 0. This function starts at 0, increases to a positive value for some time, and returns to 0. Kim (2009) demonstrated the duration of antigen availability, b, is estimated to be 10 days, and the level of APC stimulation, c, is estimated to be 1. (See Fig. 4.5 for graphs of a(t) for b ¼ 3 and b ¼ 10 when c ¼ 1.)
3.2. Intercellular regulation: iTreg-based negative feedback The second regulatory network is based on a negative feedback loop mediated by iTregs that differentiate from effector T cells during the course of the immune response. Several mathematical models considering Tregmediated feedback have been developed for naturally occurring regulatory T cells (nTregs) (Burroughs et al., 2006; Leo´n et al., 2003) and iTregs (Fouchet and Regoes, 2008), but these models focus on the function of Tregs in maintaining immune tolerance. In contrast, the following model
98
Peter S. Kim et al.
focuses on the primary response against acute infection rather than longterm behavior. The model comes from Kim’s study (2009). In this feedback network, T cell responses begin the same way as for the T cell program. However, T cell contraction initiates differently, since it is mediated by external suppression by iTregs. This process (illustrated in Fig. 4.6) can be described in five steps: 1. APCs mature, present relevant target antigen, and migrate from the site of infection to the draining lymph node.
a (t)
1 b=3
b = 10
0.5 0
0
2
4
6
8
10
t (days)
Figure 4.5 Graphs of the antigen function a(t) given by Eq. (4.7) for b ¼ 3 and b ¼ 10 when c ¼ 1. The function a(t) represents the rate that immature APCs pick up antigen and are stimulated.
1) Migration of APCs to lymph node SA a(t)A0 Δ A0 A1
2) Initial T cell activation ST kA1T0 T0 Delay = s Δ
TE ×2m
A1 3) Antigen-dependent proliferation TE
kA1TE Delay = r
4) Effector cells differentiate into iTregs
TE ⫻2
TE
rTE
TR
A1 5) iTregs suppress effector cells TR
TE
kTRTE
T1
(although not indicated, each cell has a natural death rate according to its kind.)
Figure 4.6 Diagram of the iTreg model. The first three steps are identical to those in the cell division-based model that is shown in Fig. 4.1. In the fourth step, effector cells differentiate into iTregs at rate r. In the fifth step, iTregs suppress effector cells. Although not indicated, each cell in the diagram has a natural death rate according to its kind.
Modeling and Simulation of the Immune System
99
2. In the lymph node, APCs activate naı¨ve T cells that enter a minimal developmental program of m cell divisions. 3. T cells that have completed the minimal developmental program become effector cells that keep dividing in an antigen-dependent manner as long as they are not suppressed by iTregs. 4. Effector cells differentiate into iTregs at a constant rate. 5. The iTregs suppress effector cells upon interaction. The model can be formulated as a system of five DDEs shown below: A0 0 ðtÞ
¼ sA d0 A0 ðtÞ aðtÞA0 ðtÞ; A 1 ðtÞ ¼ aðtÞA0 ðtÞ d1 A1 ðtÞ; 0 T 0 ðtÞ ¼ sT d0 T0 ðtÞ kA1 ðtÞT0 ðtÞ; 0 T E ðtÞ ¼ 2m kA1 ðt sÞT0 ðt sÞT0 ðt sÞ kA1 ðtÞTE ðtÞ þ 2kA1 ðt rÞTE ðt rÞ ðd1 þ rÞTE ðtÞ kTR ðtÞTE ðtÞ: 0
0
T R ðtÞ ¼ rTE ðtÞ d1 TR :
ð4:8Þ
ð4:9Þ
As in the previous model, A0 is the concentration of APCs at the site of infection, A1 is the concentration of APCs that have matured, started to present target antigen, and migrated to the lymph node, and T0 is the concentration of naı¨ve T cells in the lymph node. In addition, TE is the concentration of effector cells, and TR is the concentration of iTregs. The first three equations for APCs and naı¨ve T cells are identical to those in the T cell program model from Section 3.1. The first two terms of Eq. (4.8) for TE0 (t) are identical to the first two terms of Eq. (4.4) for the T cell program. The third term in this equation is the rate that cells that have just finished dividing reenter the effector cell population. In this model, cells do not have a programmed maximum number of divisions, so it is not necessary to count the number of divisions a cell has undertaken. The only regulatory mechanism is suppression by iTregs. The fourth term is the rate that effector cells exit the population through death at rate d1 or differentiate into iTregs at rate r. The final term is the rate that effector cells are suppressed by iTregs. As before, the rate of iTreg–effector interactions follows the same mass action law as APC–T cell interactions. Equation (4.9) pertains to iTregs. The first term is the rate at which effector cells differentiate into iTregs, and the second term is the rate at which iTregs die. The iTregs have the same death rate as effector cells. All parameters in this model are identical to those used for the T cell program, except for r, the rate of differentiation of effector cells into iTregs. As Kim (2009) demonstrated, we estimate that r ¼ 0.01/day, meaning that 1% of effector cells differentiate into iTregs per day. The parameters used in the iTreg model are listed in Table 4.2.
100
Peter S. Kim et al.
4. How to Implement Mathematical Models in Computer Simulations Once a mathematical model has been developed, the next step is to implement it computationally. A common approach is to write the relevant computational software for each problem, since this method has the advantage of allowing the programmer to optimize the computer algorithms for his or her particular needs. However, various software packages already exist for most of the modeling paradigms. For ABM simulations, the immune system simulator (IMMSIM) (Celada and Seiden, 1992; Seiden and Celada, 1992), the synthetic immune system (SIS) (Mata and Cohn, 2007), the basic immune simulator (BIS) (Folcik et al., 2007) provide platforms for generating virtual immune systems populated by a variety of cell types. For deterministic, differential equation models, the most frequently used programs are MATLAB, Maple, and XPPAUT. In general, DDE models are relatively simple to evaluate on any of the software platforms for differential equations mentioned above, and we numerically simulate the DDE models from Sections 3.1 and 3.2 with the ‘‘dde23’’ function of MATLAB R2008a. Currently, no widely used computational tools exist for evaluating SDE models for biological systems, since this direction of research has yet to be developed. Although numerous programs are available for pricing financial devices, these approaches are usually not ideal for models pertaining to immunological networks.
4.1. Simulation of the T cell program The following simulations are drawn from Kim’s study (2009) and primarily focus on the effect of antigen stimulation levels and precursor concentrations on the magnitude of the T cell response. Before proceeding to numerical simulations, the first crucial point to notice is that the T cell dynamics given by Eqs. (4.1)–(4.6) directly scale with respect to the precursor concentration, T0(0). In other words, a T cell response that begins with x times as many precursors as another automatically has a peak that is x times higher. This scaling property holds, because T cell program model is linear with respect to the T cell populations, Ti(0). As a result, simulations pertaining to the T cell program only consider relative T cell expansion levels, given by Ttotal(t)/T0(0), rather than total T cell populations given by ðt n X Ttotal ðtÞ ¼ Ti ðtÞ þ kA1 ðuÞdu þ Tnþ1 ðtÞ: i¼1
tr
Note that Eqs. (4.4)–(4.6) imply that stimulated T cells leave the system during the division process and return r time units later. Hence, the total T cell concentration is not only the sum of T cell populations given by Ti(t),
101
Modeling and Simulation of the Immune System
but also the populations that are undergoing division, which are given by the integrals in the above expression. The first set of simulations examines the dependence of the T cell peak on the two antigen-related parameters, c and b, corresponding to the level and duration of antigen presentation, respectively. We use the parameters listed in Table 4.2, and c and b vary from 0.1 to 3 and from 1 to 15 days, respectively. The maximum T cell expansion level versus c is plotted in Fig. 4.7A, and the maximum T cell expansion level versus b is plotted in 4.7B. To understand the T cell behavior under a variety of possible T cell programs, we use the lowest and highest estimated values (3 and 10) of n, which denotes the maximum possible number of antigen-dependent divisions after the minimal developmental program. We draw the two corresponding curves for each value of n in each plot. As can be seen in Fig. 4.7A, T cell dynamics saturate very quickly in relation to c, so much so that the doubling level period is almost constant A Population doublings
25 20 n = 10 15 n=3
10 5 0
0
0.5
1
B
1.5 c
2
2.5
3
Population doublings
25 20
n = 10
15 n=3
10 5 0
0
5
10
15
b
Figure 4.7 Dependence of T cell dynamics on c, a parameter corresponding to the level of antigen presentation, and on b, the duration of antigen availability. (A) Maximum T cell expansion level versus c. Expansion level is measured in population doublings, which is defined by log2(max(Ttotal/T0(0))). Data is shown for the two possible values of n, the maximum number of possible antigen-dependent divisions after the minimal developmental program. (B) Maximum T cell expansion level versus b.
102
Peter S. Kim et al.
from as low as c ¼ 0.1 to as high as c ¼ 3. By continuity, the size of the T cell peak must go to 0 as c decreases, but the drop is very steep. The two extra points shown in the curves for n ¼ 3 correspond to c ¼ 0.001 and c ¼ 0.01. These values correspond to roughly 0.1% and 1% of APCs getting stimulated per day. Hence, even very low stimulation levels result in nearly saturated T cell dynamics. The plots of the maximum expansion level and the time of peak versus b are shown in Fig. 4.7B. The figure shows that T cell dynamics also saturate as b increases, but not as quickly as for c. Hence, the simulations show that the duration of antigen availability is more important than the level. The plots on Fig. 4.7B show that for both n ¼ 3 and n ¼ 10, T cell expansion levels begin to saturate around b ¼ 4 or 5 days, indicating that the immune response behaves similarly as long as antigen remains available for long enough to elicit a fully developed T cell response. As a final simulation for the T cell program model, Fig. 4.8 shows the time evolution of various cell populations when n ¼ 10 and the rest of the A
4000 Effector cells
Concentration (k/mL)
3500 3000 2500 2000 1500 1000 500 0
Concentration (k/mL)
B
10,000 ⫻ naive cells 0
10 9 8 7 6 5 4 3 2 1 0
5
10 Time (day)
15
20
Immature APCs
Mature APCs
0
5
10 Time (day)
15
20
Figure 4.8 Time evolution of immune cell populations over time. (A) The dynamics of naı¨ve and effector cells over 20 days. (B) The dynamics of immature and mature APCs.
103
Modeling and Simulation of the Immune System
parameters are taken from Table 4.2. In Fig. 4.8A, the T cell peak is 94605 times higher than the precursor concentration, T0(0) ¼ 0.04 k/mL. As mentioned at the beginning of this section, the ratio between the T cell peak and the precursor concentration remains constant for all values of T0(0). Since this relation is exactly linear, we do not give a plot of the maximum height of T cell expansion versus precursor concentration.
4.2. Simulation of the iTreg model The simulations in this section are also drawn from Kim’s study (2009) and follow a similar pattern to those in Section 4.1. Due to the negative feedback from iTregs in Eq. (4.8), T cell dynamics do not directly scale with respect to precursor frequencies as in the T cell program model. In this case, it is informative to look at the total effector cell population, given by Ttotal ðtÞ ¼ T1 ðtÞ þ
ðt
kA1 ðuÞT1 ðuÞdu:
tr
Figure 4.9 displays a log–log plot of the maximum expansion level versus the initial naı¨ve T cell concentration, T0(0), which is varied from 4 10 4 to 4 k/mL, a range 100 times lower and 100 times higher than the estimated value in Table 4.2. As shown in Fig.4.9A, the simulated data fits a power law of exponent 0.3004, meaning that the expansion roughly scales to the cubed 3.5
log10 max(Ttotal)
p1 = 1327 p2 = 0.3004 (rcorr = 0.9987) 3
max(Ttotal) = p1T0(0)p2 2.5
2 −3.5
−3
−2.5
−2
−1.5 −1 −0.5 log10 T0(0)
0
0.5
1
Figure 4.9 A log–log plot of the dependence of the T peak on T0(0), the initial concentration of naı¨ve T cells. The linear regression shows that the maximum T cell expansion level is roughly proportional to T0(0)1/3. The linear correlation rcorr ¼ 0.9987.
104
Peter S. Kim et al.
root of the initial naı¨ve cell concentration. For example, to obtain a T cell response that is 10 times higher (or lower) than normal, the system would need to start with a reactive precursor concentration that is 1000 times higher (or lower) than normal. Following the same sensitivity analysis as in Section 3.1, we see in Fig. 4.10 that the dynamics of the iTreg model exhibits similar saturating behavior with respect to the level and to the duration of antigen stimulation, given by c and b, respectively. Like the program-based model, the feedback model generates dynamics that behave insensitively to the level and duration of antigen stimulation. Figure 4.11 shows the time evolution of the effector and iTreg populations when all other parameters are taken from Table 4.2. The figure
A 16 Population doublings
14 12 10 8 6 4 2 0
0
0.5
1
1.5 c
B
2
2.5
3
14 Population doublings
12 10 8 6 4 2 0
0
5
10 b
Figure 4.10 Dependence of T cell dynamics on c, a parameter corresponding to the level of antigen presentation, and on b, the duration of antigen availability. (A) Maximum T cell expansion level versus c. (B) Maximum T cell expansion level versus b.
105
Modeling and Simulation of the Immune System
500 450 Concentration (k/mL)
400
Total effector cells
350 300 250 200 150 100
100 ⫻ iTregs
50 0
0
5
10 Time (day)
15
20
Figure 4.11 Time evolution of effector and iTreg populations over time. The peak of the iTreg response roughly coincides with the peak of the T cell response, but the iTreg response decays slower.
indicates that the iTreg concentration peaks around the same time as the T cell response, but lingers a while longer ensuring a full contraction of the T cell population. In this example, the naı¨ve T cell population begins at 0.04 k/mL and peaks at 482 k/mL, corresponding to an expansion level of 13.6 divisions on average. The numerical simulations show that both the T cell program and feedback regulation models exhibit similar insensitivity to the nature of antigen stimulation. However, the feedback model behaves differently from the program-based model with respect to variations in precursor frequency. Specifically, the feedback model significantly reduces variance in precursor concentration (by a cubed root power law), whereas the T cell program model directly translates variance in precursor concentration to variance in peak T cell levels (by a linear scaling law).
5. Concluding Remarks By constructing mathematical models based on DDEs, we show how we can investigate two structurally distinct, regulatory networks for T cell dynamics. Using modeling, we readily determine the similarities and differences between the two models. In particular, we find that both networks have very low sensitivity to changes in the nature of antigen stimulation, but differ greatly in how they respond to variations in T cell precursor
106
Peter S. Kim et al.
frequency. Hence, our example demonstrates how mathematical and computational analysis can immediately provide a testable hypothesis to help validate or invalidate these two proposed regulatory networks. Moving to a broader perspective, the entire immune response operates as a system of self-regulating networks, and many of these networks have the potential to be elucidated by mathematical modeling and computational simulation. Several modeling frameworks already exist and up to now, ODE models have been the most widely used due to their versatility across a wide range of problems and their ability to handle complex systems efficiently. DDEs and PDEs, which are both infinite-dimensional systems, also frequently appear in the repertoire of deterministic models. DDEs possess one advantage over ODEs in that they explicitly account for the delayed feedback without adding substantial computational complexity. PDE models provide an even more complex framework and can incorporate a wide range of spatial and temporal phenomena such as molecular diffusion and cell motion and maturation. Among probabilistic models, stochastic ABMs are the most widely used, since they are typically easy to formulate, directly model individual diversity within populations, and recreate phenomena resulting from random events. An unavoidable disadvantage of ABMs is, however, that they are computationally demanding, especially in comparison to deterministic, differential equation models that can often be used to approximate the same phenomena with good accuracy. Thus, a promising compromise between the deterministic, differential equation, and stochastic agent-based paradigms comes from SDEs, a type of differential equation system that incorporates stochastic behavior. Nonetheless, this domain of mathematical modeling remains largely unexplored, at least for immunological networks, and offers a strong possibility for future research. The immune regulatory network is intricate, made up of myriad intracellular and intercellular interactions that evade complete understanding and provide fertile ground for the unraveling of these interwoven mysteries with the help of insight gained from mathematical and computational modeling.
ACKNOWLEDGMENTS The work of PSK was supported in part by the NSF Research Training Grant and the Department of Mathematics at the University of Utah. The work of DL was supported in part by the joint NSF/NIGMS program under Grant Number DMS-0758374. This work was supported by a Department of Defense Era of Hope grant to PPL. The work of DL and of PPL was supported in part by Grant Number R01CA130817 from the National Cancer Institute. The content is solely the responsibility of the authors and is does not necessarily represent the official views of the National Cancer Institute or the National Institute of Health.
Modeling and Simulation of the Immune System
107
REFERENCES Antia, R., Bergstrom, C. T., Pilyugin, S. S., Kaech, S. M., and Ahmed, R. (2003). Models of CD8þ responses: 1. What is the antigen-independent proliferation program. J. Theor. Biol. 221(4), 585–598. Arstila, T. P., Casrouge, A., Baron, V., Even, J., Kanellopoulos, J., and Kourilsky, P. (1999). A direct estimate of the human alphabeta T cell receptor diversity. Science 286(5441), 958–961. Bains, I., Antia, R., Callard, R., and Yates, A. J. (2009). Quantifying the development of the peripheral naive CD4þ T-cell pool in humans. Blood 113(22), 5480–5487. Bheekha Escura, R., Wasserbauer, E., Hammerschmid, F., Pearce, A., Kidd, P., and Mudde, G. C. (1995). Regulation and targeting of T-cell immune responses by IgE and IgG antibodies. Immunology 86(3), 343–350. Bo¨hm, C. M., Hanski, M. L., Stefanovic´, S., Rammensee, H. G., Stein, H., Taylor-Papadimitriou, J., Riecken, E. O., and Hanski, C. (1998). Identification of HLA-A2-restricted epitopes of the tumor-associated antigen MUC2 recognized by human cytotoxic T cells. Int. J. Cancer 75(5), 688–693. Burroughs, N. J., de Oliveira, B. M. P. M., and Pinto, A. A. (2006). Regulatory T cell adjustment of quorum growth thresholds and the control of local immune responses. J. Theor. Biol. 241, 134–141. Carneiro, J., Paixa˜o, T., Milutinovic, D., Sousa, J., Leon, K., Gardner, R., and Faro, J. (2005). Immunological self-tolerance: Lessons from mathematical modeling. J. Comput. Appl. Math. 184, 77–100. Casadevall, A., and Pirofski, L. A. (2003). The damage-response framework of microbial pathogenesis. Nat. Rev. Microbiol. 1, 17–24. Casadevall, A., and Pirofski, L. A. (2006). A reappraisal of humoral immunity based on mechanisms of antibody-mediated protection against intracellular pathogens. Adv. Immunol. 91, 1–44. Casal, A., Sumen, C., Reddy, T. E., Alber, M. S., and Lee, P. P. (2005). Agent-based modeling of the context dependency in T cell recognition. J. Theor. Biol. 236(4), 376–391. Catron, D. M., Itano, A. A., Pape, K. A., Mueller, D. L., and Jenkins, M. K. (2004). Visualizing the first 50 hr of the primary immune response to a soluble antigen. Immunity 21(3), 341–347. Celada, F., and Seiden, P. E. (1992). A computer model of cellular interactions in the immune system. Immunol. Today 13(2), 56–62. Colijn, C., and Mackey, M. C. (2005). A mathematical model of hematopoiesis—II. Cyclical neutropenia. J. Theor. Biol. 237(2), 133–146. De Boer, R. J., and Perelson, A. S. (1990). Size and connectivity as emergent properties of a developing immune network. J. Theor. Biol. 149(3), 381–424. De Boer, R. J., Kevrekidis, I. G., and Perelson, A. S. (1990). A simple idiotypic network with complex dynamics. Chem. Eng. Sci. 45, 2375–2382. de Pillis, L. G., Radunskaya, A. E., and Wiseman, C. L. (2005). A validated mathematical model of cell-mediated immune response to tumor growth. Cancer Res. 65(17), 7950–7958. Doumic-Jauffret, M., Kim, P. S., and Perthame, B. (2009). Stability analysis of a simplified yet complete model for chronic myelegenous leukemia. Submitted for publication. Figge, M. T. (2009). Optimization of immunoglobulin substitution therapy by a stochastic immune response model. PLoS ONE 4(5), e5685. Figge, M. T., Garin, A., Gunzer, M., Kosco-Vilbois, M., Toellner, K. M., and MeyerHermann, M. (2008). Deriving a germinal center lymphocyte migration model from two-photon data. J. Exp. Med. 205(13), 3019–3029.
108
Peter S. Kim et al.
Folcik, V. A., An, G. C., and Orosz, C. G. (2007). The Basic Immune Simulator: An agentbased model to study the interactions between innate and adaptive immunity. Theor. Biol. Med. Model. 4, 39. Fong, L., Hou, Y., Rivas, A., Benike, C., Yuen, A., Fisher, G. A., Davis, M. M., and Engleman, E. G. (2001). Altered peptide ligand vaccination with Flt3 ligand expanded dendritic cells for tumor immunotherapy. Proc. Natl. Acad. Sci. USA 98(15), 8809–8814. Fouchet, D., and Regoes, R. (2008). A population dynamics analysis of the interaction between adaptive regulatory T cells and antigen presenting cells. PLoS ONE 3(5), e2306. Galon, J., Costes, A., Sanchez-Cabo, F., Kirilovsky, A., Mlecnik, B., Lagorce-Page`s, C., Tosolini, M., Camus, M., Berger, A., Wind, P., Zinzindohoue´, F., Bruneval, P., et al. (2006). Type, density, and location of immune cells within human colorectal tumors predict clinical outcome. Science 313(5795), 1960–1964. Hackett, C. J., Rotrosen, D., Auchincloss, H., and Fauci, A. S. (2007). Immunology research: Challenges and opportunities in a time of budgetary constraint. Nat. Immunol. 8(2), 114–117. Heyman, B. (2000). Regulation of antibody responses via antibodies, complement, and Fc receptors. Annu. Rev. Immunol. 18, 709–737. Heyman, B. (2003). Feedback regulation by IgG antibodies. Immunol. Lett. 88(2), 157–161. Kaech, S. M., and Ahmed, R. (2001). Memory CD8þ T cell differentiation: Initial antigen encounter triggers a developmental program in naı¨ve cells. Nat. Immunol. 2(5), 415–422. Kawakami, Y., and Rosenberg, S. A. (1997). Human tumor antigens recognized by T-cells. Immunol. Res. 16(4), 313–339. Kim, P. S., Lee, P. P., and Levy, D. (2009). Emergent group dynamics governed by regulatory cells produce a robust primary T cell response. Accepted by Bull. Math. Biol. Kim, P. S., Lee, P. P., and Levy, D. (2007). Modeling regulation mechanisms of the immune system. J. Theor. Biol. 246, 33–69. Kim, P. S., Lee, P. P., and Levy, D. (2008). Modeling imatinib-treated chronic myelogenous leukemia: Reducing the complexity of agent-based models. Bull. Math. Biol. 70(3), 728–744. Lee, H. Y., Topham, D. J., Park, S. Y., Hollenbaugh, J., Treanor, J., Mosmann, T. R., Jin, X., Ward, B. M., Miao, H., Holden-Wiltse, J., Perelson, A. S., Zand, M., et al. (2009). Simulation and prediction of the adaptive immune response to influenza A virus infection. J. Virol. 83(14), 7151–7165. Leo´n, K., Lage, A., and Carneiro, J. (2003). Tolerance and immunity in a mathematical model of T-cell mediated suppression. J. Theor. Biol. 225, 107–126. Leo´n, K., Faro, J., Lage, A., and Carneiro, J. (2004). Inverse correlation between the incidences of autoimmune disease and infection predicted by a model of T cell mediated tolerance. J. Autoimmun. 22, 31–42. Leo´n, K., Lage, A., and Carneiro, J. (2007a). How regulatory CD25þCD4þ T cells impinge on tumor immunobiology? On the existence of two alternative dynamical classes of tumors. J. Theor. Biol. 247, 122–137. Leo´n, K., Lage, A., and Carneiro, J. (2007b). How regulatory CD25þCD4þ T cells impinge on tumor immunobiology: The differential response of tumors to therapies. J. Immunol. 179(9), 5659–5668. Mason, D. (1998). A very high level of crossreactivity is an essential feature of the T-cell receptor. Immunol. Today 19(9), 395–404. Mata, J., and Cohn, M. (2007). Cellular automata-based modeling program: Synthetic immune system. Immunol. Rev. 216, 198–212. Meffre, E., and Wardemann, H. (2008). B-cell tolerance checkpoints in health and autoimmunity. Curr. Opin. Immunol. 20(6), 632–638. Mercado, R., Vijh, S., Allen, S. E., Kerksiek, K., Pilip, I. M., and Pamer, E. G. (2000). Early programming of T cell populations responding to bacterial infection. J. Immunol. 165(12), 6833–6839.
Modeling and Simulation of the Immune System
109
Merrill, S. J. (1981). A model of the role of natural killer cells in immune surveillance—I. J. Math. Biol. 12, 363–373. Molldrem, J. J., Lee, P. P., Wang, C., Champlin, R. E., and Davis, M. M. (1999). A PR1human leukocyte antigen-A2 tetramer can be used to isolate low-frequency cytotoxic T lymphocytes from healthy donors that selectively lyse chronic myelogenous leukemia. Cancer Res. 59(11), 2675–2681. Moore, H., and Li, N. K. (2004). A mathematical model for chronic myelogenous leukemia (CML) and T cell interaction. J. Theor. Biol. 225(4), 513–523. Murakami, M., Sakamoto, A., Bender, J., Kappler, J., and Marrack, P. (1998). CD25þCD4þ T cells contribute to the control of memory CD8þ T cells. Proc. Natl. Acad. Sci. USA 99 (13), 8832–8837. Nelson, B. H. (2008). The impact of T-cell immunity on ovarian cancer outcomes. Immunol. Rev. 222, 101–116. Oka, Y., Elisseeva, O. A., Tsuboi, A., Ogawa, H., Tamaki, H., Li, H., Oji, Y., Kim, E. H., Soma, T., Asada, M., Ueda, K., Maruya, E., et al. (2000). Human cytotoxic T-lymphocyte responses specific for peptides of the wild-type Wilms’ tumor gene (WT1) product. Immunogenetics 51(2), 99–107. Onsum, M., and Rao, C. V. (2007). A mathematical model for neutrophil gradient sensing and polarization. PLoS Comput. Biol. 3(3), e36. Pardoll, D. M. (1999). Inducing autoimmune disease to treat cancer. Proc. Natl. Acad. Sci. USA 96(10), 5340–5342. Rajewsky, K. (1996). Clonal selection and learning in the antibody system. Nature 381(6585), 751–758. Rosenberg, S. A. (2001). Progress in human tumour immunology and immunotherapy. Nature 411(6835), 380–384. Sakaguchi, S., Sakaguchi, N., Asano, M., Itoh, M., and Toda, M. (1995). Immunologic selftolerance maintained by activated T cells expressing IL-2 receptor a-chains (CD25). Breakdown of a single mechanism of self-tolerance causes various autoimmune diseases. J. Immunol. 155(3), 1151–1164. Sakaguchi, S., Yamaguchi, T., Nomura, T., and Ono, M. (2008). Regulatory T cells and immune tolerance. Cell 133(5), 775–787. Scherer, A., Salathe´, M., and Bonhoeffer, S. (2006). High epitope expression levels increase competition between T cells. PLoS Comput. Biol. 2(8), e109. Seiden, P. E., and Celada, F. (1992). A model for simulating cognate recognition and response in the immune system. J. Theor. Biol. 158(3), 329–357. Shahaf, G., Johnson, K., and Mehr, R. (2005). B cell development in aging mice: Lessons from mathematical modeling. Int. Immunol. 18, 31–39. Sotiropoulou, P. A., Perez, S. A., Voelter, V., Echner, H., Missitzis, I., Tsavaris, N. B., Papamichail, M., and Baxevanis, C. N. (2003). Natural CD8þ T-cell responses against MHC class I epitopes of the HER-2/neu oncoprotein in patients with epithelial tumors. Cancer Immunol. Immunother. 52(12), 771–779. van Stipdonk, M. J., Hardenberg, G., Bijker, M. S., Lemmens, E. E., Droin, N. M., Green, D. R., and Schoenberger, S. P. (2003). Dynamic programming of CD8þ T lymphocyte responses. Nat. Immunol. 4(4), 361–365. Varela, F. J., and Stewart, J. (1990). Dynamics of a class of immune networks: Global stability of idiotype interactions. J. Theor. Biol. 144, 93–101. Weisbuch, G., DeBoer, R. J., and Perelson, A. S. (1990). Localized memories in idiotypic networks. J. Theor. Biol. 146(4), 483–499. Wodarz, D., and Thomsen, A. R. (2005). Effect of the CTL proliferation program on virus dynamics. Int. Immunol. 17(9), 1269–1276.
C H A P T E R
F I V E
Entropy Demystified: The ‘‘Thermo’’dynamics of Stochastically Fluctuating Systems Hong Qian Contents 1. Introduction 2. Energy 2.1. Equilibrium and nonequilibrium steady state 2.2. Cycle kinetics, thermodynamic box and detailed balance 3. Entropy and ‘‘Thermo’’-dynamics of Markov Processes 3.1. Entropy and entropy balance equation 3.2. ‘‘Equilibrium’’ and time reversibility 3.3. ‘‘Free energy’’ and relative entropy 4. A Three-State Two-Cycle Motor Protein 5. Phosphorylation–Dephosphorylation Cycle Kinetics 5.1. PdPC signaling switch and phosphorylation energy 5.2. PdPC with Michaelis–Menten kinetics 5.3. Substrate specificity amplification 6. Summary and Challenges 6.1. A little historical reflection 6.2. Entropy: A mathematical concept? References
112 113 113 115 117 118 119 121 122 125 125 128 130 131 131 132 132
Abstract In fluctuating enzyme reaction systems represented in terms of Markov processes, we show that entropy and the Second Law of Thermodynamics are mathematical consequence of the stochastic dynamics. In this kinetic approach to entropy, the Second Law is quantified with a positive entropy production rate. We argue that the concept of entropy is really a mathematical one which arises from any stochastic dynamics. Two examples from molecular biophysics, the efficiency of a motor protein ATPase and the substrate specificity of a phosphorylation-dephosphorylation cycle are discussed. Department of Applied Mathematics, University of Washington, Seattle, Washington, USA Methods in Enzymology, Volume 467 ISSN 0076-6879, DOI: 10.1016/S0076-6879(09)67005-1
#
2009 Elsevier Inc. All rights reserved.
111
112
Hong Qian
1. Introduction One of the most difficult concepts in molecular biophysics and enzymology is entropy. One becomes comfortable with it only after spending a great deal of time studying equilibrium thermodynamics and statistical physics. Entropy is associated with the changing of Gibbs free energy with respect to temperature when the pressure is held constant: S ¼ (@G/@T)P. But what is Gibbs free energy? On the other hand, most people are comfortable with the concept of energy, another key term in molecular biophysics and enzymology. One talks about it and one feels about it, even in everyday life. But if we trace a little history of the term, we discover that energy was as elusive a term as entropy at its inception when Gottfried Leibniz, the seventeenth century German mathematician and philosopher (Antognazza, 2008; Bardi, 2007), first introduced it. More importantly, the concept of energy was really made concrete through mathematic constructions! It was via Newton’s second law of motion that the energy conservation was first clearly being demonstrated (Goldstein, 1950). In many different areas of science, there are differential equations other than Newton’s second law of motion that characterize dynamics of systems that have nothing to do with energy per se. For example, the epidemiological dynamics of viral infection. The mathematical concept of ‘‘energy,’’ it turns out, is equally valid in the dynamical systems. It is now widely understood that a large class of differential equations has the ‘‘Hamiltonian structure.’’ One example is a mathematical model for the interacting dynamics between populations of a predator and a prey. There is a conservation even though both the predator and prey populations oscillate (Murray, 2003). The mathematical concept of energy also played an important role in John Hopfield’s theory of neural networks (Hopfield, 1982). The purpose of this chapter is to show, via some simple mathematical manipulations, that the concept of entropy arises from any stochastic dynamical systems. The term ‘‘stochastic’’ is the key here. In thermal physics, the stochasticity comes from molecular collisions at finite temperature. But in economics, it comes from the million of individuals in the market trading; and in evolution, it comes from random mutations in the genome. These are completely different physical and biological phenomena. However, when their dynamics can be cast in terms of stochastic equations, there will be an ‘‘entropy’’ and even a ‘‘Second Law’’ which might have nothing to do with thermal physics. They are purely mathematical constructions. At the present time, in other areas of biology, such as evolutionary dynamics and neural coding, information theory has been applied as a useful tool. Central to the mathematical theory of communication, originally developed be Claude Shannon, is a quantity called Shannon entropy (Shannon and Weaver, 1949). Shannon’s information entropy has an identical mathematical
Thermodynamics of Stochastically Fluctuating Systems
113
expression as that given by Gibbs except a trival unit in terms of kBT, where kB is Boltzmann’s constant and T is temperature in Kelvin. The situation has been very confusing. Whether the Gibbs entropy and Shannon entropy are related or not has become a highly controversial subject in ‘‘entropy research.’’ Adding to the confusion are many other theories and thoughts based on entropy, for example, the entropy theory of social values (Chen, 2005). But there is indeed a common thread running through all the abovementioned systems: They are all stochastic with fluctuations; all should and could be understood in terms of appropriate stochastic dynamical models called Markov processes (Taylor and Karlin, 1998). After having decided on the title Entropy Demystified for the present chapter, a search of the literature led me to Arieh Ben-Naim’s wonderful book with a similar title. The subtitle of his book is The Second Law Reduced to Plain Common Sense. In a way, my objective is very different—the analysis presented in the present chapter will take the concept of entropy out of the realm of thermal physics. I claim that it is a much more general concept associated with a wide class of dynamical systems with stochastic motion, be it stock market or evolutionary dynamics, communications or atomic physics. It just happened that the term first arrived in the atomic physics in connection to thermodynamics. Another book deserves a special mentioning. Michael Mackey has written a book on Time’s Arrow: The Origins of Thermodynamic Behaviour (Mackey, 2003). His thesis shares very much the spirit of the present chapter. However, it turns out that the detailed mathematical structures of his and ours are quite different. Interestingly, when applied to isothermal closed molecular systems, the two structures emerge. How to reconcile these two mathematical structures in an abstract mathematical theory remains an open question. Perhaps, the most nontrivial aspect of thermodynamic entropy is that it can do work (Dill and Bromberg, 2003). So the classical mechanical energy is not conserved after all! It is this realization that led to the new physics of matters in the nineteenth century (Shachtman, 2000).
2. Energy 2.1. Equilibrium and nonequilibrium steady state Let us first consider the simple conformational transition between two states of an enzyme: a
A Ð B: b
ð5:1Þ
114
Hong Qian
The molecule is left alone in a test tube at a constant temperature and pressure. Then one has in equilibrium cA b eGA ¼ o ; eq ¼ cB a eGB eq
eq
o
ð5:2Þ
eq
where cA and cB are equilibrium concentrations of A and B. The standard state Gibbs free energy GAo and GBo , as well as all the energy below, are measured in unit kBT. Gibbs entropy will be in unit kB. Therefore, b o o o DG ¼ GB GA ¼ ln : ð5:3Þ a The DGo between states A and B is a property of the molecule (in an appropriate solvent). It is more or less determined by the molecular structures. The most important aspect of an equilibrium state of a biochemical reaction system is that it does not, on average, absorb nor dissipate energy (the First Law), and it does not, on average, convert energy from one form to another (the Second Law). But many biochemical reactions in a living cell are not at their equilibria. Rather, the concentrations of A and B are sustained at a relatively constant levels in homeostasis. This is because the reaction in Eq. (5.1) is not in isolation in a cell, the production of a protein from its biosynthesis and its degradation are regulated and balanced. So in this case, one has a constant, steady-state flux in the reaction J ss ¼ acA bcB ;
ð5:4Þ
in which acA ¼ is the amount of A ! B per unit time, and bcB ¼ Jss is the amount of B ! A per unit time. In energetic terms, from the elementary physical chemistry, we have the Gibbs free energy of A at concentration cA, and of B at concentration cB: Jþss
GA ¼ GAo þ ln cA ;
GB ¼ GBo þ ln cB :
ð5:5Þ
We see that when A and B are at their equilibrium, that is, eq eq lnðcA =cB Þ ¼ DGo , eq c o DG ¼ GB GA ¼ DG þ ln Beq ¼ 0: ð5:6Þ cA This is precisely the meaning of chemical equilibrium, according to Gibbs: that left-hand side and right-hand side (rhs) of a reaction have equal Gibbs free energy. However, when A and B are not at their chemical equilibrium, ln(cA/cB) 6¼ DGo, then DG 6¼ 0. Furthermore, if DG > 0, then
115
Thermodynamics of Stochastically Fluctuating Systems
eGA cA ) Go > 1: e B cB o
GBo
þ ln cB
GAo
þ ln cA > 0;
Using the relation in Eq. (5.3), we have acA > 1; bcB
) J ss > 0:
Similarly, if DG < 0, then Jss < 0. One should think of this result as ‘‘electrical current runs from high voltage to low voltage.’’ In fact, the product of Jss DG is precisely the heat dissipated from the chemical reaction sustained under a nonequilibrium steady state (NESS). We note that even though DG can be either positive or negative, the product is always positive. The inequality Jss DG 0 is in fact the Second Law of Thermodynamics according to Lord Kelvin: One can not covert 100% heat to work in an isothermal reaction, and the heat dissipation is zero if and only if the reaction is in an equilibrium Jss ¼ DG ¼ 0. This is also known as De Donder inequality (De Donder and Rysselberghe, 1936).
2.2. Cycle kinetics, thermodynamic box and detailed balance In many cell biology applications, when one takes a more careful look of the conformational transitions between states A and B of an enzyme, one will find that the transition is accompanied with turnover of some cofactors (regulators, ligands, substrates). Furthermore, from biological functional perspective, A state of the enzyme is itself inactive while the B state is active. Let us now continue the above analysis by considering this more complex, but clearly realistic scenario, in terms of the following kinetic scheme: ko1
A þ S Ð AS; k1
k2
AS Ð B þ P1 ; o k2
k3
BÐ A þ P2 : o k3
ð5:7Þ
The rate constants koi ði ¼ 1; 2; 3Þ are second order, while kj ( j ¼ 1, 2, 3) are first order. We note that if we combine the three reactions in Eq. (5.7), we have the net reaction S Ð P1 þ P2 :
ð5:8Þ
Let us consider the case in which the concentrations of S, P1, and P2 are all kept at constant in the reaction system, let us say with cS, cP1, and cP2. Then the NESS has cP cP DG ¼ GP1 þ GP2 GS ¼ ln Keq þ ln 1 2 ; ð5:9Þ cS
116
Hong Qian
where the equilibrium constant for the reaction in Eq. (5.8) Keq ¼ ko1 k2 k3 =k1 ko2 ko3 . This last equality is widely known in chemical kinetics as the thermodynamic box. ko1 =k1 , k2 =ko2 , k3 =ko3 are equilibrium constants for the three reactions in Eq. (5.7), respectively. The equilibrium constant for the overall reaction is the product of the individual reaction steps. Substituting this relation into Eq. (5.9), we have DG ¼ ln
k1 ko2 cP1 ko3 cP2 : ko1 cS k2 k3
ð5:10Þ
DG ¼ 0 if and only if when the concentrations of S, P1, and P2 are in their eq eq eq chemical equilibrium: cP1 cP2 =cS ¼ Keq . We now introduce a very important concept: The pseudo-first-order rate constants. Since the concentrations of S, P1, and P2 are kept constant in the reaction system in Eq. (5.7), and since one is mainly interested in the conformational transition of the enzyme among the three states A, AS, B, we can simplify the kinetics in Eq. (5.7) into a pseudo-unimolecular cycle kinetics: k1
k3
k2
A Ð AS;
AS Ð B;
k1
ð5:11Þ
B Ð A;
k2
k3
in which k1 ¼ ko1 cS ; k2 ¼ ko2 cP1 ; k3 ¼ ko3 cP2 are pseudo-first-order rate constants. The cyclic reaction system, in terms of the first-order and pseudo-firstorder rate constants ki (i ¼ 1, 2, 3), has the following important properties: (1) When the rate constants satisfy k1 k2 k3 ¼ 1; k1 k2 k3
ð5:12Þ
all the cofactors are in chemical equilibrium, and vice versa. The enzyme reaction kinetics in Eq. (5.11) eventually reaches an equilibrium, in which eq
eq
eq
eq
eq
eq
k1 cA ¼ k1 cAS ; k2 cAS ¼ k2 cB ; k3 cB ¼ k3 cA :
ð5:13Þ
The relation in Eq. (5.13) is known as detailed balance. The relation in Eq. (5.12) is called Wegscheider, or Kolmogorov, cycle condition (Beard and Qian, 2008). The equations in Eq. (5.13) also indicate that the fluxes in the reactions are zero. (2) When the left-hand side of Eq. (5.12), denoted by g, is greater than 1, g
k1 k2 k3 > 1; k1 k2 k3
ð5:14Þ
the cofactor concentrations are sustained at an nonequilibrium level. The free energy difference in Eq. (5.10) in fact is DG ¼ lng:
ð5:15Þ
117
Thermodynamics of Stochastically Fluctuating Systems
To the cycle kinetics of enzyme in Eq. (5.11) the ln g is the driving force which keeps the reaction out of equilibrium. In the steady state, there will be a nonzero cycle flux going through A ! AS ! B ! A: ss ss J ss ¼k1 cAss k1 cAS ¼k2 cAS k2 cBss ¼k3 cBss k3 cAss
¼
ðk1 k2 k3 k1 k2 k3 ÞcT ; k2 k3 þk3 k1 þk1 k2 þk3 k1 þk1 k2 þk2 k3 þk1 k2 þk2 k3 þk3 k1 ð5:16Þ
in which cT ¼ cA þ cAS þ cB are the total concentration of the enzyme. We again note that when DG ¼ ln g ¼ 0, Jss ¼ 0, and vice versa. Equation (5.16) indicates that for a biochemical reaction system under a chemical driving force reaches a NESS. Detailed balance is not preserved in such ‘‘open chemical systems’’ (Qian, 2007). If one substitutes all the second-order rate constants in Eq. (5.7) back to the Eq. (5.16), one will recover the well-known Michaelis–Menten– Briggs–Haldane equation for reversible, ordered-uni-bi enzyme kinetics (Segel, 1975): J ¼
½S f r ½P1 ½P2 Vmax Km Vmax Km
ss
P
S
1þ
½S KmS
þ
½S½P1 Km0 S
þ
½P1 KmP1
þ
½P2 KmP2
2 þ ½PK1 ½P m
;
ð5:17Þ
P
where k2 k3 cT ðk1 þ k2 Þk3 r ; ; Vmax ¼ k1 cT ; KmS ¼ o k2 þ k3 k1 ðk2 þ k3 Þ ðk1 þ k2 Þk3 ðk1 þ k2 Þk3 0 KmS ¼ ; KmP1 ¼ ; o o k1 k2 k1 ko2 k3 ðk1 þ k2 Þk3 KmP2 ¼ o ; KmP ¼ : k3 ko2 ko3
f Vmax ¼
3. Entropy and ‘‘Thermo’’-dynamics of Markov Processes We are now in the position to make a conceptual leap, an mathematical abstraction. In the theory of probability, Markov processes are widely used to model stochastic dynamics, just as differential equations are widely used to model deterministic dynamics. A discrete state Markov model is in generally depicted exactly like that in Eq. (5.11), which has three states. A four-state Markov process in general has 12 possible transitions:
118
Hong Qian
q12
A Ð B; q21
q23
B Ð C; q32
q34
q13
C Ð D;
A Ð C;
q43
q31
q24
B Ð D: q42
ð5:18Þ
Of course, some of the transitions could have rate constants being zero, effectively nonexistent. For a stochastic dynamical system, one no longer asks in which state the system is at time t, rather one asks what the probability is the system in state i at time t, pi(t). Then the pi(t) satisfies a set of equations known as the master equation: dpj ðtÞ X ¼ ðpj qji þ pi qij Þ: dt i
ð5:19Þ
where i can be A, B, C, D, or 1, 2, 3, 4. The conceptual leap is this: While the states in Eq. (5.18) can be the conformational states of a single enzyme, they could also be the different states of a cell in which gene expressions are now known to be stochastic, or different state of a genome in evolution in which randomness comes from genetic random mating and mutation. In the latter two cases, there is no connection to Gibbs free energy, nor molecular thermodynamics, per se. Nevertheless, as we shall show, the concepts of energy, and entropy, are abstract mathematical concepts intimately associated with any Markov process.
3.1. Entropy and entropy balance equation The mathematics in this section is not difficult, but the steps are rather abstract. We shall ‘‘define’’ the entropy S associated with a Markov process, and then find out how entropy changes with time if the system’s probability distribution follows Eq. (5.19). We consider the master equation in Eq. (5.19), with 1 i, j N. The system has N states. Let X S½fpj ðtÞg ¼ pj ðtÞlnpj ðtÞ: ð5:20Þ j
Then we have the time derivative of S according to the chain rule: dS X dpj ln pj ¼ dt dt j X ¼ ln pj ðpj qji pi qij Þ i;j6¼i
1X ¼ ½ln pj ðpj qji pi qij Þ þ ln pi ðpi qij pj qji Þ 2 i;j6¼i ! ! X pi qij qij 1X ¼ : ðpi qij pj qji Þln ðpi qij pj qji Þln pj qji qji 2 i;j6¼i i>j
ð5:21Þ
Thermodynamics of Stochastically Fluctuating Systems
We shall now name a few things: pi qij 1X epr ¼ ðpi qij pj qji Þln pj qji 2 i;j6¼i
119
ð5:22Þ
is called entropy production rate; and qij edr ¼ ðpi qij pj qji Þln qji i>j X
ð5:23Þ
is called free energy dissipation rate. Then Eq. (5.21) becomes dS ¼ epr edr: ð5:24Þ dt We call this the entropy balance equation. We would like to provide a more precise meaning for the term edr: For an isothermal chemical reaction system with constant chemical energy input, the edr is the free energy dissipation which include enthalpy and entropy parts; it contains the work done by the system against its environment, such as by a molecular motor.1
3.2. ‘‘Equilibrium’’ and time reversibility The reason we use quotation marks for ‘‘equilibrium’’ and ‘‘thermo’’dynamics is because these concepts only existed in molecular physics in the past. But now we consider them as abstract mathematical concepts associated with any stochastic dynamics in terms of Markov processes. For any Markovian stochastic dynamics, there is an entropy of the system. The entropy production is always positive. This is easy to verify from Eq. (5.22). Note that each term in the summation is the product J DG for the transitions (forward and reverse reactions) between states i and j. A part of the entropy of the system is dissipated as ‘‘heat.’’ When a stochastic dynamics reaches its stationary state, the S no longer change with time and dS/dt ¼ 0. Hence epr ¼ edr. Let us denote the stationary probability by pssi . If a system satisfies pssi qij ¼ pssj qji
ð5:25Þ
in its steady state, then it has zero entropy production, as well as zero free energy dissipation. The steady state is in fact an equilibrium. In this case, the stochastic, stationary dynamics is said to be detail balanced. In fact, one can 1
In fact, one also has the change of enthalpy and free energy of an isothermal, driven chemical system (Qian and Beard, 2005) dH/dt ¼ ceir edr and dG/dt ¼ ceir epr, where ceir is the chemical energy input rate. Therefore, in an NESS, the edr ¼ ceir means energy conservation and ecir ¼ epr is the Clausius equality for isothermal processes. For abstract stochastic dynamics, however, one usually can not define the enthalpy H.
120
Hong Qian
show that the stochastic dynamics has identical statistics with respect to time reversal. What kind of stochastic systems will reach an equilibrium? Motivated by the discussions on enzyme reactions above, we have the following result: If each and every state in the system has a ‘‘standard state free energy,’’ say Eio for state i, and the transition rate constants qij o o ¼ eEj þEi ; qji
ð5:26Þ
then the stationary probability distribution pssi is proportional to eEi . Furthermore, we can introduce another quantity, called ‘‘free energy’’ of the whole stochastic system: X F½fpj ðtÞg ¼ pj ðtÞEjo SðtÞ: ð5:27Þ o
j
Then, the Eq. (5.24) can be rewritten as epr ¼ edr
dS dt
! X qij dS ¼ ðpi qij pj qji Þln qji dt i>j X dS ðpi qij pj qji ÞðEjo Eio Þ ¼ dt i>j X dS ðpi qij pj qji ÞEio ¼ dt i;j6¼i X dpi dS Eio ; ¼ dt dt i
ð5:28Þ
that is dF ¼ epr 0: dt
ð5:29Þ
Equation (5.29) is well known in isothermal statistical physics. But more importantly to our discussion, it is a mathematical result for any Markov dynamics with standard state energy. This class of Markov processes is called reversible, or symmetric, Markov process.
Thermodynamics of Stochastically Fluctuating Systems
121
3.3. ‘‘Free energy’’ and relative entropy The inequality in Eq. (5.29) is a very important property of any reversible, symmetric Markov process. It corresponds to a nondriven molecular system. We now generalize this result to any Markov system (Mackey, 2003). Combining Eqs. (5.25) and (5.26), we have Eio ¼ lnpssi þ C where C is a simple constant independent of i. Therefore, one can rewrite the F in Eq. (5.27) ! X pj ðtÞ F½fpj ðtÞg ¼ : ð5:30Þ pj ðtÞln pssj j The constant C is set to zero with no consequences. The quantity in Eq. (5.30) is called relative entropy. For most Markov models (technically called irreducible Markov processes), irrespective of whether symmetric (reversible) or not, the stationary probability distribution fpssj g is unique. Then we have for a general Markov dynamics: ! pj dF X dpj ln ss ¼ dt pj dt j ! X pj ¼ ðpi qij pj qji Þln ss pj i; j " ! !# X pj pi ¼ pi qij ln ss pi qij ln ss pj pi i; j ! X pj pss ¼ pi qij ln ss i ð5:31Þ pj pi i; j ! X pj pss pi qij ss i 1 pj pi i; j ! X pj pss qij i pi qij ¼ pssj i; j ! X pj pssj qji ¼ pi qij pssj i; j ¼ 0: In the above derivation, we have used the inequality ln x x 1. Furthermore, we have
122
Hong Qian
F½fpj ðtÞg ¼
X
pj ðtÞ pj ðtÞln ss pj ðtÞ
!
! X pssj ðtÞ pj ðtÞln ¼ pj ðtÞ j j
X
pssj ðtÞ
!
ð5:32Þ
1 pj ðtÞ pj ðtÞ j X X pssj ðtÞ þ pj ðtÞ ¼ 0:
j
j
Therefore, the generalization of free energy F, called relatively entropy, is a nonnegative quantity, and is never increasing in a Markovian stochastic dynamics. It continuous decreases until it reaches zero, when the stochastic system reaches its steady state. The entropy balance equation (5.24) and the monotonic decreasing relative entropy, Eq. (5.31), are two important equations for the ‘‘thermo’’-dynamics of any stochastically fluctuating systems modeled in terms of general Markov processes. This ‘‘thermo’’-dynamic structure is not unique to molecular systems and biological macromolecules, but also to other biological systems such as stochastic cell dynamics, infectious disease dynamics, Darwinian evolutionary process, and maybe even economics. It is tempting to suggest that if one can find the entropy analogues in evolutionary process (Ao, 2008) and economic theory (Chen, 2005), it would be a significant progress in the respective field.2 We shall now return to enzyme systems and give two examples of how the above theory is applied to modeling enzyme kinetics and functions.
4. A Three-State Two-Cycle Motor Protein A motor protein is an enzyme, usually an ATPase, that converts chemical energy from ATP hydrolysis to mechanical movement. In fact, it can move against an external load (Howard, 2001; Qian, 2005, 2008). Figure 5.1A shows a very simple kinetic scheme of a motor protein. For simplicity, we assume that the motor can move a step on its track, from n to n þ 1, when it hydrolyzes the bound ATP. However, we do not assume 2
Based on the present discussion of entropy, a hypothesis on market value of a commodity could be such: market value ¼ intrinsic value þ speculative value, where the speculative value ¼ ‘‘temperature’’ entropy. The ‘‘temperature’’ in a market theory measures the randomness of the speculative market (Marx, 1992).
123
Thermodynamics of Stochastically Fluctuating Systems
A
B
ATP k1
M
ADP, Pi
k3
k–1
k–3
A
M.ATP
k–2
k2
M.ADP.Pi
n q3
q1 q–1 q–3 q–2
B
C
q2
n+1
Figure 5.1 (A) The biochemical kinetic scheme of a simple motor protein model. The step M.ATP ! M.ADP.Pi is loosely coupled to the motor translocation on its track moving a step forward. When it ‘‘slips,’’ the ATP hydrolysis is ‘‘futile.’’ We assume the probability of slippage to be 1/(1 þ s). The biochemical kinetic scheme in (A) can be mapped into the Markov model in (B) using the pseudo-rate constants: q1 ¼ k1[ATP], q2 ðf Þ ¼ ko2 ð1 þ seð1rÞf d Þ, q2 ðf Þ ¼ ko2 ð1 þ serf d Þ, q 3 ¼ k 3[ADP][Pi]. The parameter r, known as a slitting parameter, represents the position of the transition state (Hill, 1981; Qian, 2006).
that the stepping necessarily occurs. There is a finite rate, with rate constant ko2 , of hydrolysis without the motor translocation. Therefore, with the presence of a resistant force f, then the rate ð5:33Þ k2 ¼ ko2 1 þ seð1rÞf d ; where d is the step size of the motor, r characterizes the location of the transition state (Hill, 1981; Qian, 2006), and s is the ratio of the mechanical to futile steps. Such a motor has its chemical and chemomechanical cycles not being tightly coupled. When there is no external load, f ¼ 0, and the ATP, ADP, and Pi are at chemical equilibrium, one has ½ATP eq k1 ko2 k3 ¼ KATP ¼ : ð5:34Þ k1 ko2 k3 ½ADP½Pi In a living cell, however, the [ATP]/[ADP][Pi] is very large, on the order of 1010 times the equilibrium ratio: g ¼ eDGATP ¼
½ATP q1 q2 ð0Þq3 ¼ : KATP ½ADP½Pi q1 q2 ð0Þq3
ð5:35Þ
The steady-state biochemical flux of the motor enzyme is, thus J ss ¼
q1 q2 q3 q1 q2 q3 : q2 q3 þq3 q1 þq1 q2 þq3 q1 þq1 q2 þq2 q3 þq1 q2 þq2 q3 þq3 q1 ð5:36Þ
124
Hong Qian
The Jss consists of two parts: A chemomechanical cycle A Ð B Ð C Ð A with rate constants sk02 eð1rÞf d and sko2 erf d for the second step, which is coupled to motor stepping of a distance d against a force f. This gives the motor velocity: V ss ¼
q1 ko2 seð1rÞf d q3 q1 ko2 serf d q3 d: q2 q3 þq3 q1 þq1 q2 þq3 q1 þq1 q2 þq2 q3 þq1 q2 þq2 q3 þq3 q1 ð5:37Þ
Another is a futile cycle A Ð B Ð C Ð A with rate constants ko2 and ko2 for the second step. Hydrolysis of an ATP does not lead to motor movement. The efficiency of the motor, therefore, is ¼
V ss f fd : ¼ ss ðg1Þerf d J DGATP 1 þ sðge f d 1Þ ln g
ð5:38Þ
The term ðg 1Þerf d sðgef d 1Þ
ð5:39Þ
is the ratio between the flux of the futile cycle and that of the chemomechanical cycle. The futile cycle flux is zero if and only if g ¼ 1, that is, there is no energy in ATP hydrolysis; the chemomechanical cycle flux is zero if f ¼ fmax ¼ (1/d) ln g, when the motor stop moving. fmax is known as the stalling force. Figure 5.2 shows the as a function of resistant force f with various values of the parameters r and s. When the motor moves, the enzyme kinetics is in a NESS. There is a balance equation for the free energy difference of each and every reactions in the system, similar to the Kirchoff’s loop law of electrical circuit (Beard and Qian, 2008): DGAB þ DGBC ðf Þ þ DGCA ¼ DGATP :
ð5:40Þ
The rhs should be thought as a battery. Each time the enzyme transits from state A to B, it dissipates the amount of free energy DGAB. Same can be said for the transition C ! A. For the step B ! C, part of the DGBC( f ) is dissipated, but another part is used to do work against the force f in the amount of f d. In an isothermal biochemical reaction system, the amount of entropy production is simply the amount of chemical energy being used up outside of the system. Now recall the Eqs. (5.22) and (5.23) from the previous section. The rhs of Eq. (5.22) contains the terms in the form of ð J þ JÞlnð J þ =JÞ, where i stands for each and every reaction. For the motor model in Fig. 5.1B, all three
125
Thermodynamics of Stochastically Fluctuating Systems
r = 0.9, s = 100
1
r = 0.75, s = 100 r = 0.75, s = 50
0.8 Efficiency h
r = 0.75, s = 10 r = 0.5, s = 100
0.6 0.4 0.2 0
0
0.25 0.5 0.75 External resistant force f/fmax
1
Figure 5.2 Motor efficiency Z as a function of external load f according to Eq. (5.38). The maximal force fmax ¼ (1/d) ln g. In the calculations, g ¼ 1010 corresponding to the fmaxd ¼ 21. For both small and large f, the efficiency is low. The maximal efficiency increases with r and s.
reactions have the same Jss. Hence, the epr in Eq. (5.22) is simply epr ¼ Jss DGATPf Vss, the first term is the denominator in the Eq. (5.38).
5. Phosphorylation–Dephosphorylation Cycle Kinetics From the previous discussion, we see that an absolutely irreversible chemical reaction is incompatible with thermodynamics: If the reaction A ! B is irreversible, then the free energy difference between A and B must be infinite. This is unrealistic. In kinetic study of biochemical reactions, one often assumes irreversible reactions. This is acceptable if one is only interested in kinetics but not thermodynamics. Almost all the kinetic studies on the phosphorylation–dephosphorylation cycle (PdPC) kinetics in cellular signaling assumes irreversible kinase and phosphatase (as we shall do ourselves in Sections 5.2 and 5.3).
5.1. PdPC signaling switch and phosphorylation energy In this section, we introduce a thermodynamic and kinetic combined analysis of reversible PdPC kinetics:
126
Hong Qian
k1
k2
k1
k2
E þ K Ð EK Ð E∗ þ K;
k3
k4
k3
k4
E∗ þ P Ð E∗ P Ð E þ P;
ð5:41Þ
in which E is a substrate enzyme, K and P are protein kinase and phosphatase, respectively. The rate constant k1 ¼ ko1 ½ATP, k2 ¼ ko2 ½ADP, k4 ¼ ko4 ½Pi. We assume that the cellular concentrations of ATP, ADP, and Pi are constant. It is clear that every cycle of phosphorylation and dephosphorylation of an E turns an amount of chemical free energy into heat. From the traditional physical perspective, there is no work done. Hence the PdPC are widely called futile cycle (Qian and Beard, 2006). However, as has been recently suggested, signal transduction are information processing and delivering processes, and information processing requires free energy dissipation, accompanied with entropy production (Qian, 2007). Following the standard Michaelis–Menten kinetics, but let us assume that the amount of kinase and phosphatase are large in comparison with their respective Michaelis constants, then the kinetics in Eq. (5.41) can be represented by a Markov model q1
q2
q1
q2
E Ð E∗ Ð E;
ð5:42Þ
with ko1 k2 ½ATP½K ; k1 þ k2 k3 k4 ½P q2 ¼ ; k3 þ k4
q1 ¼
ko2 k1 ½ADP½K ; k1 þ k2 ko k3 ½Pi½P ¼ 4 : k3 þ k4
q1 ¼ q2
Same as in Eq. (5.35), we have g¼
q1 q2 ko k2 k3 k4 ½ATP ¼ eDGATP : ¼ o 1 o q1 q2 k4 k3 k2 k1 ½ADP½Pi
ð5:43Þ
Then the fraction of substrate being phosphorylated: f ¼
½E∗ q1 þ q2 yþm ¼ ¼ ; ∗ ½E þ ½E q1 þ q2 þ q1 þ q2 y þ m þ 1 þ y=ðgmÞ
where y¼
ko1 k2 ðk3 þ k4 Þ½ATP ½K ; k3 k4 ðk1 þ k2 Þ ½P
m¼
k3 ko4 ½Pi : k3 k4
ð5:44Þ
127
Thermodynamics of Stochastically Fluctuating Systems
y represents the activation strength, that is, the level of kinase [K] to that of phosphatase [P] as an upstream signal. m represents the basal level of activation in the absence of the kinase. In a living cellular environment, the m is very small but g 1010 is very large. In fact the product gm is very large. Hence we have f y/(1 þ y ), a hyperbolic curve. This is widely expected when one increases the kinase activity to increase the level of phosphorylation. However, Eq. (5.44) also shows that if g ¼ 1, then f ¼ m=ð1 þ mÞ, which is independent of the level of kinase and phosphatase whatsoever. This is precisely what one expects from a test tube experiment of PdPC in a chemical equilibrium: An enzyme is a catalyst that speeds up a biochemical reaction without changing its equilibrium. Figure 5.3 shows the fraction of phosphorylation f, according to Eq. (5.44), as a function of y and other parameters. We note that for large g, that is, high level of phosphorylation potential, the PdPC as a signaling switch behaves as expected. However, if the energy level is low, then the signal activation can be significantly compromised. More interestingly, we note that the level of irreversibility of the dephosphorylation reaction, m, plays a critical role. The level of energy, g, has to be sufficiently greater than the m 1 in order for the biological switch to function properly.
Fraction of phosphorylation f
1
0.8
0.6
0.4
0.2
0 0.001
0.1
10
1000
Activation signal q
Figure 5.3 Fraction of phosphorylation according to the Eq. (5.44) based on simple reversible kinetics. From top to bottom: g ¼ 1010, 107, 106, and 0.5 106, respectively. m ¼ 10 6 for all the four curves.
128
Hong Qian
5.2. PdPC with Michaelis–Menten kinetics We now consider the PdPC in Eq. (5.41) following the Michaelis–Menten kinetics (Murray, 2003; Segel, 1975). We assume both the kinase and the phosphatase catalyzed reactions are irreversible. This problem was first worked out by Goldbeter and Koshland (1981); see Qian (2003) for reversible PdPC with Michaelis–Menten kinetics and the consequence of nonequilibrium thermodynamics. The analyses presented below, thus, should be considered as the limiting situation of g ¼ 1. Even though it is not realistic, it gives the upper bound on the various phenomena we analyze. Treating reversible kinase and phosphatase is not difficult, as shown in Section 5.1. But the algebra are often more involved. With the above assumptions, the concentration change of E* follows a single differential equation d½E∗ V1 ðEt ½E∗ Þ V2 ½E∗ ; ¼ dt K1 þ Et ½E∗ K2 þ ½E∗
ð5:45Þ
in which V1 ¼ k2Kt and V2 ¼ k4Pt are the maximal velocities of the kinase and phosphatase, Et, Kt, and Pt are the total concentrations of the substrate enzyme, the kinase and the phosphatase; K 1 ¼ (k 1 þ k2)/k1 and K2 ¼ (k 3 þ k4)/k3 are the Michaelis constants of the kinase for the E and the phosphatase for the E*. The steady-state fraction of phosphorylation, f ¼ [E*]/Et, satisfies the equation d[E*]/dt ¼ 0. Rearranging some terms, we have V1 ð1 f Þ V2 f ¼ 0; K1 Et þ 1 f K2 Et þ f which yields (Goldbeter and Koshland, 1981): y¼
f ðK1 Et þ 1 f Þ ; ðK2 Et þ f Þð1 f Þ
ð5:46Þ
in which we let y ¼ V1/V2 to represent again the activation strength. If both kinase and phosphatase are operating in their linear, nonsaturating regimes, that is, K1Et and K2 Et 1, then Eq. (5.46) is reduced to K2 K1 y ; ð5:47Þ f ¼ 1 þ KK21 y which is a hyperbolic activation curve as shown in Fig. 5.4D. This is exactly the top curve in Fig. 5.3 with logarithmic abscissa. If, however, both enzymes are highly saturated and operating in the zeroth-order regime,
129
Thermodynamics of Stochastically Fluctuating Systems
B
A
C
K E
E*
E
E*
E P
P
Activation signal
Phosphorylated
F Phosphorylated
E Phosphorylated
D
E*
Activation signal
Activation signal
Figure 5.4 PdPC kinetics, E ! E* ! E catalyzed by kinase K and phosphatase P, can operate in different regimes: (A) Both enzymes are in their linear region if the amount of each enzyme is greater than the total substrate Et ¼ [E] þ [E*], or Et is less than the Michaelis constants of the enzymes. (B) The kinase is operating in the linear regime but the phosphatase is highly saturated, that is, [E*] the Michaelis constant of the phosphatase. The latter reaction then is zeroth order. (C) Both kinase and phosphatase are zeroth order. The corresponding steady-state levels of phosphorylation f ¼ [E*]/Et, as function of the signal, defined as the ratio of the kinase activity to that of phosphatase, are shown in (D), (E), and (F), respectively. (D) Hyperbolic activation curve; (E) Delayed onset of activation; (F) Ultrasensitivity.
then K1Et, K2 Et 1 and Eq. (5.45) becomes df =dt ¼ V1 V2 , where 0 f 1. Therefore, its steady state is simply 0 y < 1; ð5:48Þ f ¼ 1 y > 1; as shown in Fig. 5.4F. One can also verify this result from Eq. (5.46) by solving the quadratic equation for f. This result is known as zeroth-order ultrasensitivity (Goldbeter and Koshland, 1981). What has not been widely discussed is when one of the enzymes is operating in the linear regime while the other in the zeroth-order regime. This gives rise to the delayed onset, shown in Fig. 5.4B and E (Qian and Cooper, 2008). Let us consider that the phosphatase is the one operating in zeroth-order, then K2 Et 1. In this case, we have the Eq. (5.46) becoming
130
Hong Qian
8 <0 f ¼ 1 K1 Et : y1
y K1 Et þ 1; ð5:49Þ
y K1 Et þ 1:
We note that the activation curve in Eq. (5.49), that is, Fig. 5.4E, rises from f ¼ 0 at y ¼ K1Et þ 1 to f ¼ 0.5 at y ¼ 2K1Et þ 1. Hence, smaller the K1Et, sharper the rising. If K1 Et 1, Eq. (5.49) is reduced to Eq. (5.48). In the stochastic analysis of PdPC in Qian and Cooper (2008), the three cases in Fig. 5.4A–C are called independent, semisequential, and sequential, respectively.
5.3. Substrate specificity amplification We now focus on the PdPC with zeroth-order phosphatase and first-order kinase, as shown in Fig. 5.4B and E. Let us consider two substrates with different K1 and K10 , for the proper and improper substrates: K10 > K1. Then because of the delayed onset characteristics, the above result suggests the ratio of the phosphorylation levels that can be very large in the appropriate range of y (Fig. 5.5A). Quantitatively, we have 8 0 > K > > 1 > > K > > > 1 > 0 < ðy 1 K1 Et ÞðK1 Et þ 1 yÞ f
K2 Et yðy 1Þ f0 > > > > > y 1 K1 Et > > > > : y 1 K10 Et
y < K1 E t ; 0
1 þ K1 Et < y < 1 þ K1 Et ; 0
y > K1 Et : ð5:50Þ
It is clear that the ratio 0has a maximum between 1 þ K1Et and 1 þ K10 Et. In fact, if 1 K1 Et K1 Et , the maximum is located very near y ¼ K1Et. We thus have obtained an interesting biochemical result that is not at all intuitive: The specificity in signaling process can be regulated by the magnitude of the Michaelis constant of the phosphatase, which has no direct interaction with the two substrates. This could be a yet to be discovered biological function of ‘‘zeroth-order’’ phosphatase. We also note that the ratio f/f 0 can be much greater than K10 /K1 (Fig. 5.5B). The selectively in a living cell needs not to be limited by the equilibrium affinity. In open chemical systems, the energy from phosphorylation reaction can amplify the specificity (Qian, 2007).
131
Thermodynamics of Stochastically Fluctuating Systems
6. Summary and Challenges 6.1. A little historical reflection When Newton originally proposed the equation of motion, F ¼ ma, and combined with Hook’s law for elasticity force F ¼ k(x x0), there was no concept of energy. However, from analyzing the Newton’s equation of motion: d2 x ¼ kx; dt2 Newton, and Leibniz, discovered that 2 1 dx 1 þ kx2 ¼ constant m 2 dt 2 m
ð5:51Þ
ð5:52Þ
in the motion. The concept of kinetic energy emerged out of the idea of vis viva, which Leibniz defined as the product of the mass of an object and its velocity squared. Energy is a conserved quantity in mechanics. Later on, by introducing the frictional force into the equation of motion and to account for the loss of mechanical energy: m
B
1 0.1 0.01 0.001 0.0001 0.00001
1 10 100 1000 10,000 q : kinase to phosphatase activity
ð5:53Þ
10,000
f/f ⬘ : specificity amplification
f : fraction of activation
A
d2 x dx ¼ kx ; 2 dt dt
1000 100 10 1
0
40 80 120 160 200 q : kinase to phosphatase activity
Figure 5.5 Specificity amplification in PdPC with zeroth-order phosphatase and firstorder kinase. In (A) the K1Et ¼ 10 for the upper curve and K10 Et ¼ 100 for the lower curve following Eq. (5.46), both with K2Et ¼ 0.001. Hence the conventional affinity difference is 10-fold. For small y, there is a linear regime. (B) The ratio of f/f 0 from (A), which is about 10 for small y, can increase up to 2000 for the optimal y 20. Eventually for very large y, f/f 0 1. The filled squares are according to approximated, analytical formula given in Eq. (5.50).
132
Hong Qian
! 2 d 1 dx 1 2 1 dx 2 þ kx ¼ < 0; m dt 2 dt 2 2 dt
ð5:54Þ
Leibniz claimed that heat consisted of the random motion of the constituent parts of matter—a view shared by Newton, although it would be more than a century until this was generally accepted. So, we see that the concept of ‘‘energy,’’ a term now every person walking on the street uses, is a pure mathematical construction. Energy is clearly related to physical reality, but it is also a concept meaningful to any conservative dynamics which can be modeled by differential equations.
6.2. Entropy: A mathematical concept? The situation is not different for the concept of entropy: It is clearly related to the thermal molecular systems, but it is also a concept meaningful to any stochastic dynamics. Entropy is a mathematical concept. It is a quantity intimately associated with random dynamical systems, as we have shown in this chapter for master equations. But in fact it is much more general. For some recent accounts see Gaspard (2004), Jiang et al. (2004), Mackey (2003), and Qian et al. (2002). Just as Hamiltonian to Hamiltonian systems, and energy to conservative systems (Strogatz, 2001), when applying the stochastic dynamics theory to molecular systems at constant temperature, it is the Gibbs entropy. When applied to communication system, it is Shanon’s entropy. But it could also be applied to evolutionary theory (Ao, 2008) and to economical dynamics (Chen, 2005). At the end of the excellent text (Ben-Naim, 2007), Professor Arieh BenNaim discussed the nature of the Second Law and whether it can be derived. On this, I shall respectfully disagree with the author: One indeed can derive the Second Law of Thermodynamics, provided that one believes that the molecular physics can be represented by a mathematical theory of Markov processes. Along this line, Gibbs has already started the endeavor a century ago. One simply needs to carry on his tradition by developing a time-dependent ensemble theory. This might be indeed the theory of complexity one is looking for (Laughlin et al., 2000; Mitchell, 2009).
REFERENCES Antognazza, M. R. (2008). Leibniz: An Intellectual Biography. Cambridge University Press, New York. Ao, P. (2008). Emerging of stochastic dynamical equalities and steady state thermodynamics from Darwinian dynamics. Commun. Theor. Phys. 49, 1073–1090.
Thermodynamics of Stochastically Fluctuating Systems
133
Bardi, J. S. (2007). The Calculus Wars: Newton, Leibniz, and the Greatest Mathematical Clash of All Time. Basic Books, New York. Beard, D. A., and Qian, H. (2008). Chemical Biophysics: Quantitative Analysis of Cellular Systems. Cambridge Texts in Biomedical Engineering, Cambridge University Press, Cambridge. Ben-Naim, A. (2007). Entropy Demystified: The Second Law Reduced to Plain Common Sence. Workd Scientific, New Jersy. Chen, J. (2005). The Physical Foundation of Economics: An Analytical Thermodynamic Theory. World Scientific, Singapore. De Donder, T., and Rysselberghe, P. (1936). Theory of Affinity. Stanford University Press, Palo Alto, CA. Dill, K. A., and Bromberg, S. (2003). Molecular Driving Forces: Statistical Thermodynamics in Chemistry and Biology. Garland Science, New York. Gaspard, P. (2004). Fluctuation theorem for nonequilibrium reactions. J. Chem. Phys. 120, 8898–8905. Goldbeter, A., and Koshland, D. E. (1981). An amplified sensitivity arising from covalent modification in biological systems. Proc. Natl. Acad. Sci. USA 78, 6840–6844. Goldstein, H. (1950). Classical Mechanics. Addison-Wesley, Reading, MA. Hill, T. L. (1981). Proc. Natl. Acad. Sci. USA 78, 5613–5617. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. USA 79, 2554–2558. Howard, J. (2001). Mechanics of Motor Proteins and the Cytoskeleton. Sinauer Associates, Sunderland, MA. Jiang, D.-Q, Qian, M., and Qian, M.-P. (2004). Mathematical Theory of Nonequilibrium Steady States: On the Frontier of Probability and Dynamical Systems. Springer Lecture Notes in Mathematics Vol. 1833. Laughlin, R. B., Pines, D., Schmalian, J., Stojkovi, B. P., and Wolynes, P. G. (2000). The middle way. Proc. Natl. Acad. Sci. USA 97, 32–37. Mackey, M. C. (2003). Time’s Arrow: The Origins of Thermodynamic Behaviour. Dover Publisher, New York. Marx, K. (1992). Capital I: A Critique of Political Economy. Penguin, New York. Mitchell, M. (2009). Complexity: A Guided Tour. Oxford University Press, New York. Murray, J. D. (2003). Mathematical Biology I: An Introduction. 3rd ed. Springer, New York. Qian, H. (2003). Thermodynamic and kinetic analysis of sensitivity amplification in biological signal transduction. Biophys. Chem. 105, 585–593. Qian, H. (2005). Cycle kinetics, steady-state thermodynamics and motors—A paradigm for living matter physics. J. Phys. Cond. Matt. 17, S3783–S3794. Qian, H. (2006). J. Phys. Chem. B. 110, 15063–15074. Qian, H. (2007). Phosphorylation energy hypothesis: open chemical systems and their biological functions. Ann. Rev. Phys. Chem. 58, 113–142. Qian, H. (2008). Viscoelasticity of living materials: Mechanics and chemistry of muscle as an active macromolecular system. Mol. Cell. Biomech. 5, 107–117. Qian, H., and Beard, D. A. (2005). Thermodynamics of stoichiometric biochemical networks in living systems far from equilibrium. Biophys. Chem. 115, 213–220. Qian, H., and Beard, D. A. (2006). Metabolic futile cycles and their functions: A systems analysis of energy and control. IEE Proc. Sys. Biol. 153, 192–200. Qian, H., and Cooper, J. A. (2008). Temporal cooperativity and sensitivity amplification in biological signal transduction. Biochemistry 47, 2211–2220. Qian, H., Qian, M., and Tang, X. (2002). Thermodynamics of the general diffusion process: Time-reversibility and entropy production. J. Stat. Phys. 107, 1129–1141. Segel, I. H. (1975). Enzyme Kinetics. Wiley, New York.
134
Hong Qian
Shachtman, T. (2000). Absolute Zero and the Conquest of Cold. Mariner Books, Boston, MA. Shannon, C., and Weaver, W. (1949). The Mathematical Theory of Communication. University of Illinois Press, Illinois. Strogatz, S. H. (2001). Nonlinear Dynamics and Chaos. Perseus Books Group, New York. Taylor, H. M., and Karlin, S. (1998). An Introduction to Stochastic Modeling. 3rd Ed. Academic Press, San Diego.
C H A P T E R
S I X
Effect of Kinetics on Sedimentation Velocity Profiles and the Role of Intermediates John J. Correia,* P. Holland Alday,*,1 Peter Sherwood,† and Walter F. Stafford† Contents 136 138 141 151 158 159 159
1. Introduction 2. Methods 3. ABCD Systems 4. Monomer–Tetramer Model 5. Summary Acknowledgments References
Abstract We have previously presented a tutorial on direct boundary fitting of sedimentation velocity data for kinetically mediated monomer–dimer systems [Correia and Stafford, 2009]. We emphasized the ability of Sedanal to fit for the koff values and measure their uncertainty at the 95% confidence interval. We concluded for a monomer–dimer system the range of well-determined koff values is limited to 0.005–10 5 s 1 corresponding to relaxation times of 70 to 33,000 s. More complicated reaction schemes introduce the potential complexity of low concentrations of an intermediate that may also influence the kinetic behavior during sedimentation. This can be seen in a cooperative ABCD system (A þ B ! C; B þ C ! D) where C, the 1:1 complex, is sparsely populated (K1 ¼ 104 M 1, K2 ¼ 108 M 1). Under these conditions a k1,off < 0.01 s 1 produces slow kinetic features. The low concentration of species C contributes to this effect while still allowing the accurate estimation of k1,off (although k2,off can readily compensate and contribute to the kinetics). More complex reactions involving concerted assembly or cooperative ring
* Department of Biochemistry, University of Mississippi Medical Center, Jackson, Mississippi, USA Boston Biomedical Research Institute, Watertown, Massachusetts, USA Current address: University of Utah Health Sciences Center, University Health Care, Salt Lake City, Utah, USA
{ 1
Methods in Enzymology, Volume 467 ISSN 0076-6879, DOI: 10.1016/S0076-6879(09)67006-3
#
2009 Elsevier Inc. All rights reserved.
135
136
John J. Correia et al.
formation with low concentrations of intermediate species also display kinetic effects due to a slow flux of material through the sparsely populated intermediate states. This produces a kinetically limited reaction boundary that produces partial resolution of individual species during sedimentation. Cooperativity of ring formation drives the reaction and thus separation of these two effects, kinetics and energetics, can be challenging. This situation is experimentally exhibited by systems that form large oligomers or rings and may especially contribute to formation of micelles and various protein aggregation diseases including formation of b-amyloid and tau aggregates. Simulations, quantitative parameter estimation by direct boundary fitting and diagnostic features for these systems are presented with an emphasis on the features available in Sedanal to simulate and analyze kinetically mediated systems.
1. Introduction We previously presented a tutorial on the analysis of direct boundary fitting of sedimentation velocity data focusing on a weak kinetically mediated monomer–dimer system (Correia and Stafford, 2009). The emphasis was the effects of slow kinetics on the shape of a reversible reaction boundary and the extraction of K2 and koff values and their confidence intervals. While analyzing a rapidly reversible cooperative ABCD system, it was noticed that the fits, but not the data, exhibited an inflection in the boundary that suggested separation of calculated species into a kinetically mediated boundary. The cause of this turned out to be that the default value of koff for fitting in Sedanal was internally set to 0.01 s 1, a value we assumed implied rapid kinetics. An inquiry into the relationship between koff values and the ABCD reaction mechanism leads us to an investigation of the role of intermediates. There are many large biophysical systems that exhibit unusually slow kinetics. The assembly of microtubules (Desai and Mitchison, 1997), F-actin (Wegner, 1976), and TMV capsid protein (Scheele and Schuster, 1974, 1975) all proceed by an endwise mechanism that is known to impact and limit both the rate of growth and the rate of disassembly. Both tubulin and actin hydrolyze a high-energy phosphate bond (GTP or ATP, respectively) to convert the polymer to a less stable form and thus unlink the kinetics of their disassembly from the initial equilibrium process. Thus, their assembly is a nonequilibrium process (Desai and Mitchison, 1997). Alternatively, TMV capsid protein displays an overshoot behavior (in the absence of RNA), described as metastable polymerization, where helical rod-like polymers form high molecular weight, long length distributions that would relax to an equilibrium length distribution over the course of
Effect of Kinetics on Sedimentation Velocity Profiles
137
days and weeks (Scheele and Schuster, 1974, 1975). The system is strongly pH and temperature dependent as described in a subsequent study and summarized as a time-resolved phase diagram (Schuster et al., 1979). Because of the size distributions and heterogeneity of the system, AUC played a prominent role in these 1970s TMV studies. These systems also display formation of smaller oligomers, often structures that nucleate filament polymer formation. They are often rings and spirals (tubulin and TMV) or short linear or helical polymers (tubulin and actin) that either form cooperatively or exhibit the unstable features of a nucleus. There have been detailed AUC studies on the stability and the hydrodynamic structure of many of these systems (Correia et al., 1985; Frigon and Timasheff, 1975a,b; Shire et al., 1979). The implicit assumptions and often observations have been that while intermediates exist (especially see Frigon and Timasheff, 1975a,b for diagnostic procedures) they play a minor role in the energetics of the reaction with monomers and Nmers predominating. Contrary to these studies there have been few detailed studies into the kinetics of oligomer and Nmer formation. Tai and Kegeles (1984) used rapid mixing techniques to study micelle formation and developed analysis methods (Huang et al., 1984; Kegeles, 1979, 1980, 1984) that explicitly incorporate the flux of material through weakly populated intermediates. Thusius has investigated the relaxation kinetics of glutamate dehydrogenase indefinite self-association and focused on the details of the molecular mechanisms, including the critical role of polymer annealing reactions instead of exclusive addition and dissociation of monomers (Thusius, 1975; Thusius et al., 1975). Most of the early work on simulating and understanding kinetically mediated transport in definite assembly and ligand-mediated assembly systems has come from John Cann and Gerson Kegeles (Cann, 1970, 1978a,b; Cann and Kegeles, 1974; Kegeles, 1979; Kegeles and Cann, 1978; Oberhauser et al., 1965). In contrast, there has been little recent attention paid to the role of kinetics in AUC studies of definite polymer formation (Correia and Stafford, 2009; Dam et al., 2005; Gelinas et al., 2004; Zhao and Beckett, 2008). This has been in part due to the relatively recent availability of AUC fitting software that can deal with direct boundary fitting of the kinetics of any definite reaction mechanism (Brown, 2009; Dam and Schuck, 2005; Dam et al., 2005; Stafford, 2000; Stafford and Sherwood, 2004). Here, we present examples of how to investigate kinetically mediated hetero- and self-association reactions in the analytical ultracentrifuge. Two systems are investigated, a cooperative ABCD system and a concerted monomer–tetramer system. Simulations, quantitative parameter estimation by direct boundary fitting and diagnostic features for these systems are presented with an emphasis on the features available in Sedanal to simulate and analyze kinetically mediated systems.
138
John J. Correia et al.
2. Methods All sedimentation velocity absorbance data simulations and curve fitting were done with Sedanal (Stafford and Sherwood, 2004; version 5.32) at 50,000 rpm and 100 s intervals with 0.005 abs units of Gaussian random noise added to each data set where noted. The meniscus position rm is chosen to be between 5.85 to 5.90 cm consistent with taking proper advantage of a long sedimentation column to improve resolution of species. Velocity data were initially analyzed with DCDTþ 2 (Philo, 2006) or DCDT within Sedanal (Stafford, 1992) to produce g(s*) distributions. All Sedanal fitting models were constructed with the ModelEditor (v1.74). Data were preprocessed to select for meniscus, base and fit regions and stored as binary cell data files with an abr extension. An even number of scans (typically 20) were chosen for direct boundary fitting so that the edge of the plateau in the last few scans was near the base of the fit region. Fitting is done using either the Simplex (Nelder and Mead, 1965) or the Levenberg–Marquart algorithm. Confidence intervals were estimated using F-statistics at the 95% confidence interval using the methods of Bates and Watts (1988), first used in sedimentation analysis by Johnson et al. (1981) and reviewed in Johnson (1992). As described previously (Correia and Stafford, 2009), an F-statistic for a given parameter is constructed by starting at the best fit value and then stepping the value of that parameter along the parameter axis, and while holding that parameter value constant at each value, repeating the fit allowing all the other parameters to float. The fits are repeated at each step of the parameter value until the rms of the fit hits a target rms value corresponding to the target value of F (expressed as the (rms/rmso)2 where rmso is of the best fit of the data). The target value of F is determined by the degrees of freedom and is typically around 1.0231 for this kind of data at the 95% confidence limit, which corresponds to the two standard deviation level. To establish the target value, the total number-of-points less degrees of freedom are plugged into a formula (Algorithm 724, Collected Algorithms from ACM; also see 6.4.10 in the third edition of ‘‘Numerical Recipes’’; code available upon request). Current versions of the software allow you choose to search either the () or the (þ) side of the best fit parameter value to determine the F-statistics. (For complex models the full process per parameter can be exceedingly slow involving up to as many as 32 successive fits, each of which might take many hours on a fast processor. We are currently implementing the use of quad duo servers to spread the multiple fitting tasks around.) For comparison, we also did selective Monte Carlo (MC) analysis (Error estimation control under Advanced parameters) to establish the relative range of confidence intervals both methods provide (see Fig. 6.6).
Effect of Kinetics on Sedimentation Velocity Profiles
139
The ABCD model is based upon an experimental system involving the binding of stathmin-eGFP to tubulin to make a 1:2 stable complex (Alday, 2009; Alday and Correia, 2009). The ABCD model data involves formation of a weak 1:1 AB complex (104 M 1) and a tight 1:2 AB2 complex (1.05 108 M 1) where AþB$C
ðK1 Þ
CþB$D
ðK2 Þ
For comparison, an ABCD model with equivalent sites (i.e., equal intrinsic binding energies, in which the ratio of macroscopic binding constants, K2/K1, is 0.25) was also simulated corresponding to K1 ¼ 2.049 106 M 1 and K2 ¼ 5.1235 105 M 1. These values of K were chosen so that both cases have K1 K2 ¼ 1.05 1012 M 2 and thus have the same midpoint at 1 mM. (The relationship K1 ¼ 4 K2 is derived from statistical factors and reflect two ways for B to bind to A to make C, 2 kon, and two ways for B to dissociate from D, 2 koff.) For ABCD simulations, SA ¼ 2.3 s, SB ¼ 5.4 s, SC ¼ 6.5 s, SD ¼ 7.8 s, MA ¼ 44,945, MB ¼ 100,000, ext278 nm are 1.2 and 0.487 ml mg 1 cm 1 for A and B, respectively, and the density increments are 0.2563 and 0.2557 for A and B, respectively. In other words, we turned the control files (with extension ‘‘abc’’) from experimental fitting files into simulation files.1 The monomer–tetramer systems are based upon an assembly model that uses M1 ¼ 100,000 and S1 ¼ 5.1 s, with S2, S3, and S4 defined by an n2/3 (where n ¼ 1, 2, 3, and 4) rule corresponding to 9.096 s, 10.608 s, and 12.8 s, respectively. All values of ext278 nm are 1.2 ml mg 1 cm 1 and density increments are 0.264. For a monomer–tetramer model the reaction involves an equilibrium constant K4 ¼ 8.53 1016 M 3 corresponding to 4A $ A4 with the absence of intermediates. For a sequential model with the same overall value of K, there are three successive reactions where, K2 ¼ K3 ¼ K4 all with units M 1 and the product K2 K3 K4 ¼ 8.53 1016 M 3.
1
A þ A $ A2
ðK2 Þ
A þ A2 $ A3
ðK3 Þ
A þ A3 $ A4
ðK4 Þ
In Sedanal abc files are text files that save and store all the information used in the control panel window including the model which is constructed in the ModelEditor (currently ver 1.76), the parameter values, user defined equations that interrelate parameters, the data sets and the scans to be used in the fitting including weighting factors, and the method of fitting to be used including numbers of points, convergence criteria and F-statistics. See Sedanal-User-Manual for more details (available on the RASMB software download Web site).
140
John J. Correia et al.
For the concerted tetramer model K2 ¼ K3 ¼ 4.4 103 M 1 and K4 ¼ 4.4 109 M 1 and the product K2 K3 K4 ¼ 8.53 1016 M 3. As described above, this means all the three models have the same midpoint at 2.27 mM. (It is standard practice to interpret a Kn (M(n 1)) in terms of a succession of n 1 interaction interfaces, or taking the (n 1)-th root of Kn (n 1 Kn), the overall Kn, to estimate the midpoint of the reaction and the contribution of each interface to stability (discussed in Van Holde et al., 1998, Table 15.2). Ring closure provides an additional interface and one source of cooperativity.) Kinetic parameters are varied and implemented in Sedanal as described elsewhere in the text. They can be entered into either a global box under Advanced parameters and Kinetics/Equilibrium or individually under Kinetic parameters as kr values (referred to here as koff values) for each Keq value. For slow processes, the maximum number of integration steps allowed should be reset to a large value, at least 10,000,000, to allow slow kinetic systems to come to equilibrium before the start of the simulated run if you are using kinetic equilibration, but primarily for slow kinetic systems using kinetics during the run (see text for further discussion). The use of kinetics to either solve for initial equilibrium conditions or to solve the distributions during the run is also implemented under Advanced parameters and Kinetics/Equilibrium (see screen dump in Fig. 6.1). Choices appear as boxes to be clicked to turn on different features for initial equilibration or during the run. The choices include analytical solutions, Newton–Raphson (which solves the equilibrium distribution based upon the reaction mechanism and the choice of equilibrium constants), and kinetic integrators. There are only a few cases that use analytical solutions, isodesmic, monomer–dimer (2A ¼ A2), and bimolecular complex formation (A þ B ¼ C). It is advisable to always use Newton–Raphson for initial equilibration. One seldom wants to test if the system is at equilibrium at the start of the fitting although it can be useful for Kinetic time course simulations, an option under the Main window (implemented in Fig. 6.10). For kinetic integrator one can choose BulSt (Bulirsch–Stoer; Press et al., 1992) or SEulEx which refers to (Semiimplicit EULer EXtrapolation; Hairer and Wanner, 1996). We find SEulEx runs about 4–6 faster than BulSt for ABCD models but this may also be case dependent and should be tested for each data set. To speed up Fstat analysis, we also recommend using 400 points in combination with SEulEx. For the ABCD model fewer than 400 points gives GRID Checks typically meaning too few points to perform the Claverie. This is often fixed by increasing the number of points and/or varying the grid spacing (Claverie control under Advanced parameters) with more points in the boundary and at the base to handle steep gradients. t
Effect of Kinetics on Sedimentation Velocity Profiles
141
Figure 6.1 A screen dump of the advanced parameters kinetics equilibrium control window where both initial equilibration prior to the run and the equilibration during the simulation are chosen. N–R solves the equilibrium distributions by solving the equilibrium relationships. The other kinetic options use a kinetic integrator to simulate the distribution. If the time of equilibration is not long enough prior to the run (Max integration steps allowed) or the equilibration time during sedimentation is too short kinetic effects will be observed. It is recommended to use N–R prior to the run. The need to use a kinetic integrator during the run is a function of kr and the reaction mechanism as described in the text. If you use N–R during the run, you will necessarily simulate an equilibrium distribution, even if the parameters chosen would generate kinetic effects with a kinetic integrator.
3. ABCD Systems To explore the role of intermediates on kinetic effects in sedimentation studies, we simulated velocity data for two types of ABCD systems; one model that has equivalent binding sites (K2/K1 ¼ 0.25) and the other model that has concerted formation of species D, the 1:2 complex (K2/ K1 ¼ 104). Concentration distributions for all species half way through the simulation (5000 s) are plotted in Fig. 6.2. The main differences between these two cases are that in the equivalent binding model (panel A) all species are highly populated with the 1:1 species being predominant, while in the
142
John J. Correia et al.
A 3 ⫻ 10–6
Conc (M)
K2 /K1 = 0.25 A B C D
2 ⫻ 10–6
1 ⫻ 10–6
0 B 3 ⫻ 10–6 K2 /K1 = 104 A
Conc (M)
2 ⫻ 10–6
B C D
1 ⫻ 10–6
0 5.8
6.0
6.2
6.4 6.6 Radius (cm)
6.8
7.0
7.2
Figure 6.2 Simulation of the c(r) distributions for ABCD systems where (A) the reaction involves equivalent binding sites with K2/K1 ¼ 0.25 and (B) a cooperative reaction in which the final complex formation is concerted and K2/K1 ¼ 104. Simulations are done at a 1:1 ratio with 5 mM A and B. These figures show that the intermediate C is weakly populated in the cooperative case while both C and D are populated in the equivalent case with a preference for C. One should also note the slight gradient in A in the regions of the complexes as required by mass action, that is, there is no constant bath.
concerted model (panel B) species C is weakly populated due to the cooperative formation of species D. It is the impact of this minor species that we wish to explore. Also note the trailing zone of excess species A in both panels due to its low sedimentation coefficient. In these systems, this trailing zone simplifies fitting for both SA and the concentration of A during
Effect of Kinetics on Sedimentation Velocity Profiles
143
the direct boundary analysis. Especially, note the gradient in A (d[A]/dr) in the regions corresponding to both free B and hetero-complexes, 6.3–6.6 cm in this set of scans. This gradient in A arises from the effects of mass action in the boundary. We describe the transport in this region as a reaction boundary where the species distributions, in the absence of kinetic effects, obey the equilibrium relationships (as defined above). To establish reliability of the direct boundary analysis for these systems under rapidly reversible conditions, three noise perturbed data sets were simulated with koff ¼ 0.1 s 1 for each ABCD model where A:B ratios were 1:2, 1:1, and 2:1, and B ¼ 5 mM. These data are plotted as g(s) distributions in Fig. 6.3. Note in the cooperative case (panel B) the reaction is driven to form species D by raising the concentration of A, and thus the reaction boundary approaches the sedimentation coefficient of species D. Alternatively, the equivalent case (panel A) actually displays a drift toward species C reflecting the concentration distributions in the reaction boundary seen in Fig. 6.2A. In both cases, excess A runs as a separate zone that increases in area with increasing total concentration of A. Sedanal fitting of these data reveal best fit values of K1 ¼ 2.067 106 M 1 h1.706,2.501i and K2 ¼ 5.149 105 M 1 h4.833,5.501i with rms ¼ 0.005049 for the equivalent sites data. The best fit values for the concerted model data are K1 ¼ 0.996 104 M 1 h0.660,2.005i and K2 ¼ 1.058 108 M 1 h0.612,1.491i with rms ¼ 0.005012 (also see summary of ABCD concerted models in Table 6.1). All fitted parameters are accurately and precisely determined. (Fits are not shown because the rms values are consistent with random residuals.) In both cases, the range of OD280 nm values are 0.65–0.82, ideal for absorbance optics, while the range of % saturation of B (fraction of component B in a complex) varies from 64.5–94.0% in the equivalent sites case to 70.3–89.6% in the cooperative case. (These numbers were determined with the Equilibrium calculations window under the Main Menu, which calculates the equilibrium concentrations for all species given the K’s and the initial total concentrations of all components.) The main difference in these fits is the 1:1 complex is most populated in the equivalent case (Fig. 6.2A) while the 1:2 complex is exclusively populated in the cooperative case (Fig. 6.2B). These are simulated data. For analysis of experimental data one must determine the S values for all species which in principle is easy for the components A and B, if they are well behaved when run alone. However, the sedimentation coefficient for species C and D present the challenge. For the equivalent sites case species D does not fully populate unless you drive the reaction with more A, 1:4 or 1:8 ratios, while species C is favored by increasing the concentration of B, 4:1 or 8:1 ratios (Stafford, 2000). Alternatively, for the cooperative case, the S value for species D is easily determined by extrapolation (Fig. 6.3B), although of course it depends upon an optical signal, absorbance in this case, and what the midpoint of the reaction is. A weaker
144
John J. Correia et al.
A
0.5 Equivalent model A:B ratio
0.4
1:2 1:1
0.3 g(s)
2:1
0.2
0.1
0.0 B
0.5 Cooperative model A:B ratio
0.4
1:2 1:1
0.3 g(s)
2:1
0.2
0.1
0.0 0
2
4
6
8
10
S
Figure 6.3 Simulation of noise perturbed data for the (A) equivalent and (B) cooperative ABCD cases in Fig. 6.1 at an A:B ratio of 1:2, 1:1, and 2:1, and B ¼ 5 mM. Sedimentation coefficient g(s) distributions are plotted for each data set. Consistent with the K2/K1 ratio and the concentration distributions in Fig. 6.1A, the g(s) distribution for the fast peak in the equivalent model drifts to the sedimentation coefficient of species C. For the cooperative model the fast peak approaches the sedimentation coefficient of species D. These data are simulated and presented to display the differences between an ABCD equivalent and a concerted model. The span of the concentration ranges is sufficient in both cases for Sedanal direct boundary fitting [see text and prior discussion of antibody ABCD systems by Stafford (2000)].
K2 reaction would require a longer extrapolation. However, the S value for species C in the concerted case is indeterminate. It is not populated enough to ever see as a separate zone in the velocity pattern. That is the bad news. The good news is it does not matter for direct boundary fitting because it does not contribute significantly to the boundary optical signal.
Table 6.1 Kinetically mediated concerted ABCD model Model
K1 M 1
k1,off s 1 1
K2 (M 1)
k2,off s 1
rms
1
k1,off ¼ 0.1 0.996 10 M 0.1 1.058 10 M 0.1 0.005012 k2,off ¼ 0.1 h0.660,2.005i h0.6116,1.491i ------------------------------------------------------------------------------------------- -------------------- --------------k1,off ¼ 0.1 1.199 104 0.1 0.888 108 0.978 10 4 0.004981 k2,off ¼ 10 4 h0.225,8.228i h0.207,4.587i h0.913,1.401i 1.006 104 0.130 1.040 108 0.972 10 4 0.004981 hUB,8.3321i h0.009,UBi h0.212,UBi h0.810,1.465i 0.745 104 0.015 2.070 108 0.1 0.005737 hUB,2.000i h0.006,UBi h0.867,UBi ------------------------------------------------------------------------------------------- -------------------- --------------k1,off ¼ 0.005 0.976 104 5.157 10 3 1.081 108 0.1 0.004960 k2,off ¼ 0.1 h0.263,6.694i h2.756,9.756i h0.474,5.584i 0.985 104 5.111 10 3 1.071 108 0.0426 0.004960 h0.125,7.780i h2.231,12.540i h0.195,7.758i h0.017,UBi 0.053 104 0.1 1.926 109 0.707 10 4 0.004978 hUB,0.067i h1.749,2.130i h0.071,0.923i 4
8
Values in h, i are 95% confidence intervals determined by F-statistics. UB means unbounded in that limit. Underlined symbols are held constant during the fit.
146
John J. Correia et al.
The simplest way to deal with it is guess a reasonable value based upon the S values for the other species. (These data are based upon an experimental system involving stathmin and tubulin and that is exactly what we did, guess 6.5 S, partway between tubulin at 5.1 S and the 2:1 complex at 7.8 S. In practice, small errors in the S value of species D will not have a big effect on K2 either, especially on a DG scale, but this is something that an investigator should test during analysis of their data; see Correia and Stafford, 2009.) To investigate kinetic effects in the concerted model, we simulated velocity data and produced g(s) distributions at a ratio of 1:1, while fixing one koff value at 0.1 s 1 and varying the other koff value from 10 1 to 10 6 s 1. The results can be seen in Fig. 6.4. Panel A presents fixed k2,off while varying k1,off. Panel B presents fixed k1,off while varying k2,off. Both panels show the impact of kinetics on the transport boundary through the shape of the g(s) distributions. As kinetics slow down, the fast peak shifts and approaches the rate of species D, while a new peak resolves corresponding to free species B that now trails behind the complex zone. The resolution of these species indicates the loss of the reaction boundary and the loss of equilibrium concentrations at each radial position due to the slow kinetics. The difference between the two panels is the magnitude of koff at which these kinetic effects become evident. For k2,off these effects occur between 10 3 and 10 5 s 1 (see dark blue line corresponding to 10 4 s 1 in free B region of panel B), similarly to what is observed for a monomer–dimer system (Correia and Stafford, 2009). For k1,off the boundary shape becomes kinetically mediated between 10 2 and 10 3 s 1 (see red and green lines in transition regions of panel A). This difference in the kinetic impact of k1,off versus k2,off apparently is caused by a small concentration of the 1:1 species [C]. For example, at 6.8 cm the concentration of C in the scans in Fig. 6.1 equals 2.1 mM in panel A and 0.021 mM in panel B thus accounting for the two order of magnitude shift in the effect of koff. This situation creates a bottleneck, a rate-limiting step, where the flux of material from A and B to D and back is limited by k1,off and the low concentration of C governed by the equilibrium relationships (Kegeles, 1979, 1980, 1984; Tai and Kegeles, 1984). Do systems like this allow you to extract both equilibrium and kinetic constants by direct boundary fitting? To test this, we simulated noise perturbed concerted ABCD model data sets at a 1:2, 1:1, and 2:1 A/B ratio, B ¼ 5 mM. We simulated one set with k1,off ¼ 0.005 s 1 and k2,off ¼ 0.1 s 1 (Fig. 6.5A). We simulated a second set with k1,off ¼ 0.1 s 1 and k2,off ¼ 10 4 s 1 (Fig. 6.5B). The g(s) distributions are plotted in Fig. 6.5. It is obvious that both data sets exhibit the characteristics of a kinetically mediated reaction boundary, with the fast peak approaching the S value of species D, and any free B trailing behind and running as a separate peak. As you drive the reaction with excess A there is no free B so this peak disappears at a 2:1 ratio. The two sets of patterns are very similar indicating
147
Effect of Kinetics on Sedimentation Velocity Profiles
A 0.6
k1,off = 10–1 k1,off = 10–2 k1,off = 10–3 k1,off = 10–4 k1,off = 10–5 k1,off = 10–6
0.4 g(s)
AB2 complex
ABCD 278 nm simulations 5 mM A and B, k2,off = 0.1
0.5
0.3 0.2
Free B
Free A
0.1 0.0 B
0.6
k2,off = 10–1 k2,off = 10–2 k2,off = 10–3 k2,off = 10–4 k2,off = 10–5 k2,off = 10–6
0.4 g(s)
AB2 complex
ABCD 278 nm simulations 5 mM A and B, k1,off = 0.1
0.5
0.3 0.2
Free B
Free A
0.1 0.0
0
2
4
6
8
10
S
Figure 6.4 Simulation of an ABCD concerted system where (A) k2,off is fixed at 0.1 s 1 and k1,off is varied from 10 1 and 10 6 s 1 and (B) k1,off is fixed at 0.1 s 1 and k2,off is varied from 10 1 and 10 6 s 1. In an ABCD concerted model k2,off impacts kinetics between 10 3 and 10 5 s 1 (see dark blue line in free B region) while k1,off impacts boundary shape between 10 2 and 10 3 s 1 (see red and green lines in transition regions). The shift in the kinetic impact of koff for k1 versus k2 is apparently caused by the product of a small concentration of the 1:1 species [C] times the k1,off value. For example, at 6.8 cm the [C] in the scan in Fig. 6.1 equals 2.08 mM in panel 1A and 20.8 nM in panel 1B thus accounting for the two order of magnitude shift in the koff effect. This is not seen in the equivalent ABCD case because both species C and D are populated (they both display kinetic effects between 10 3 and 10 5 s 1 and thus look like panel 3B; data not shown).
either k1,off or k2,off can cause similar kinetically mediated reaction boundaries although at different ranges of values. Can we fit these data to extract the equilibrium and kinetic constants and how do they cross correlate? Table 6.1 presents a summary of the Sedanal fitting of the fast kinetic case (k1,off ¼ k2,off ¼ 0.1 s 1) in the top line and these two kinetically mediated
148
John J. Correia et al.
A 0.4 Cooperative model A:B ratio k1,off = 0.005 s–1 1:2 1:1 2:1
g(s)
0.3
0.2
0.1
0.0 B Cooperative model A:B ratio k2,off = 10–4 s–1
0.4
1:2 1:1 2:1
g(s)
0.3
0.2
0.1
0.0
0
2
4
6
8
10
S
Figure 6.5 Simulation of noise perturbed data for the cooperative model and an A:B ratio of 1:2, 1:1, and 2:1, and B ¼ 5 mM are presented as g(s) distributions. (A) k1,off is kinetically slow at 0.005 s 1 and k2,off is fast at 0.1 s 1. (B) k1,off is fast at 0.1 s 1 and k2,off is kinetically slow at 0.0001 s 1. Consistent with the K2/K1 ratio and slow kinetics, the g(s) distribution for the fast peak runs near the expected value for species D. What is noticeable, however, is that the boundary shapes are similar in both panels regardless of which koff value causes the kinetic effect. These data are fit with Sedanal to extract K and koff values in various combinations (Table 6.1).
data sets. For each kinetically mediated set of data there are three attempts at fitting. The upper line contains results of fits for three parameters, K1, K2, and the slow kinetic rate constant while holding the fast rate constant at 0.1 s 1. The next line contains results of fits for all four parameters, exploring the likely possibility of coupling between k1,off and k2,off. The third line shows an inverse fit, holding the slow kinetic rate constant to a fast value (0.1 s 1) and fitting the fast kinetic rate constant to establish if it can substitute for the other slow rate constant. Consistent with the fitting of the
Effect of Kinetics on Sedimentation Velocity Profiles
149
fast kinetic data in Fig. 6.3 above (top line in Table 6.1), the K1 and K2 values are well determined for these kinetically mediated cases as well, whether we fit for three or four parameters and starting with the correct values as guesses. As discussed before (Correia and Stafford, 2009; Stafford, 2000), as long as the system is initially at equilibrium (and you span a reasonable saturation range) there is no problem in extracting equilibrium constants from kinetically mediated velocity data. This conclusion is also verified by MC (or Bootstrap with replacement) analysis, although in general the confidence intervals are smaller than with F-stat (Fig. 6.6). For example, a 20 data set MC on the three parameter k2,off analysis (with k1,off fixed at 0.1 s 1) returns k2,off ¼ 0.978 10 4 ( 0.010) s 1, K1 ¼ 1.224 104 (0.187) M 1 and K2 ¼ 0.888 108 ( 0.105) M 1, where the values are 1 sd. Correlation plots of these data show a K2, K1 R ¼ 0.99338, a K1, k2,off R ¼ 0.71192, and a K2, k2,off R ¼ 0.79979, so most of the coupling actually comes from K2 K1, where their product from these 20 simulations equals 1.069 1012 ( 0.018) M 2. This was repeated for a four parameter MC for the case where k2,off is slow. Analysis returns k1,off ¼ 0.149 (0.054) s 1, k2,off ¼ 0.974 10 5 (0.015) s 1, K1 ¼ 0.996 104 (0.347) M 1, and K2 ¼ 1.181 108 (0.444) M 1, where the values are 1 sd. Once again correlation plots reveal most of the coupling between parameters is between K2 K1, where their product from 30 simulations equals 1.049 1012 ( 0.025) M 2. To make this a visual comparison, these results are plotted as correlation plots in Fig. 6.6 and compared with a two parameter (K1 and K2) MC analysis of the fast kinetic case described above. Panel A shows the result of plotting K1 versus K2 from a two, three, and four parameter MC analyses. The width in each distribution increases with each additional parameter added to the analysis. For example, the sd. about K1 increases from 0.026 for two parameters, to 0.187 for three parameters, to 0.253 for four parameters. This clearly must reflect cross correlation between K and koff values. How does this compare with F-statistics? Panel B (notice the changes in scale) plots the K1 versus K2 correlation plot from an F-stat analysis of the three parameter fit (K1, K2, and k2,off) superimposed upon the data from Panel A (red symbols), while panel C plots the F-statistic for K1. This results show the 95% F-stat is at least 10-fold larger and more highly asymmetric than the corresponding MC. The F-stat explores a wider range of values (compare scales in panels A and B) exposing significant changes in the correlation coefficient between K1 and K2. We can thus conclude that F-statistics are a far more robust way of investigating uncertainty in and cross correlation between parameters. What about the koff values? The slow rate constants are well determined in the three parameter fits, with slow k1,off (5.157 10 3 s 1) and slow k2,off (0.978 10 4 s 1) both accurate and precise. This verifies the information content of the boundary shapes, the inflections due to the presence
150
John J. Correia et al.
A K1 K2 MC
2 ⫻ 108
K1 K2 k2,off MC K2
K1 k1,off K2 k2,off MC
1 ⫻ 108
6.0 ⫻ 103
9.0 ⫻ 103
B
1.2 ⫻ 104 K1
1.5 ⫻ 104
1.8 ⫻ 104
K2
6 ⫻ 108 4 ⫻ 108 K1 K2 k2,off F-stat 2 ⫻ 108
2.0 ⫻ 104
4.0 ⫻ 104
6.0 ⫻ 104
8.0 ⫻ 104
6.0 ⫻ 104
8.0 ⫻ 104
K1
C
F-stat
1.04
1.02
1.00 2.0 ⫻ 104
4.0 ⫻ 104 K1
Figure 6.6 Correlation plots of K2 versus K1 values derived from Monte Carlo and F-statistic analysis of the case where k1,off is fast at 0.1 s 1 and k2,off is kinetically slow at 0.0001 s 1. (A) plots K2 versus K1 results from MC analysis where two, three, and four parameters are fit as indicated in the legend. The width of the K1 and K2 MC distributions clearly gets wider with the addition of kinetic parameters, as described in the text. (B) plots K2 versus K1 results from F-statistic analysis where three parameters are fit as indicated in the legend. The MC distributions from (A) are also plotted (red symbols) for comparison. The F-stat distribution is more asymmetric and 10-fold wider than the corresponding MC distribution (o). (C) plots the 95% confidence interval for K1 obtained from F-stat analysis. The limits of this confidence interval for both K1 and K2 are also plotted in (B) as dotted lines.
Effect of Kinetics on Sedimentation Velocity Profiles
151
of free species B, shown in Fig. 6.5. The four parameter fits reveal that the value of the slow koff stays near the three parameter value (5.111 10 3 or 0.972 10 4 s 1) and the fast koff values retain a fast value (0.130 or 0.043 s 1). Thus, there is no significant coupling or switching between the off rates. This is a bit surprising since the inverse fits show that k2,off can substitute for a slow k1,off, and a slow k1,off can substitute for a slow k2,off (although the rms of this fit 0.005737 is not very good). We anticipated that the confidence interval for the fast koff would have extended into the slow range of off rates. Since this did not happen in the four parameter fits, the error space searched obviously stayed close to the three parameter minima and never extended into a kinetically slow regime for the other off rate. An investigator’s tendency may be to constrain K1 during this fitting since it was so weak. For fits of just K2 and slow k2,off values (k1,off ¼ 0.1 and k2,off ¼ 10 4 s 1 case), we find K2 and k2,off are highly correlated (R ¼ 0.9999). This K2 and k2,off correlation is significantly reduced in the three parameter fit (R ¼ 0.7998), and dramatically reduced in the four parameter fit (R ¼ 0.3710). This poor correlation may also in part explain the slow convergence properties of this system. This is further complicated by the fact that K1 can strongly couple to k1,off as seen in the case in which k1,off ¼ 0.005 and k2,off ¼ 0.1 s 1 (Table 6.1). In this inverse fit constraining k1,off to 0.1 also forces K1 to compensate and reduce its value from 1 104 to 0.053 104 M 1 with a surprisingly well determined upper confidence limit. This reduction in K1 reduces the concentration of species C resulting in a slower flux by constraining koff [C]. Thus, the kinetic effects in this inverse fit are due to both a reduced K1 and a slow k2,off value. This suggests being able to interpret which step is the unique cause of the kinetic effect is a bit problematic due to coupling between the K and koff values as well as between the K1 and K2 values. While we have outlined the fitting procedures available in Sedanal to attack this kind of problem, an expanded approach involving measurements of relaxation kinetics (Bernasconi, 1976; Eccleston et al., 2008) may be advised to compliment this AUC analysis.
4. Monomer–Tetramer Model One of the earliest sets of systems simulated for mass transport of macromolecules in the limit of no diffusion was an equilibrium monomer–Nmer system (Gilbert, 1955, 1959, 1960; Gilbert and Gilbert, 1978; Gilbert and Jenkins, 1959). Gilbert theory states for a monomer– Nmer system where N > 2, the transport pattern will resolve into a bimodal distribution where the slow peak migrates at the rate of the monomer and the fast peak approaches the rate of the Nmer as loading concentration is increased. This is seen in Fig. 6.7A where the monomer–tetramer
152
John J. Correia et al.
A 4A - A4 A-A2 -A3 -A4 Concerted model
0.16
g(s)
0.12
0.08
0.04
0.00 B 0.16
Concerted model koff = 0.1 koff = 1 Equilbrium koff = 1
g(s)
0.12
0.08
0.04
0.00 0
2
4
6
8 S
10
12
14
16
Figure 6.7 Sedimentation distribution analysis g(s) comparing three kinds of monomer–tetramer systems where Koverall equals 8.32 1016 M 3. (A) shows an overlay of three simulated curves generated at 5 mM total protein for monomer– tetramer (solid black line), a monomer–dimer–trimer–tetramer where all the Ks are identical (dashed black line), and a concerted model system where K2 ¼ K3¼ 4.4 103 M 1 < K4 ¼ 4.4 109 M 1 and koff ¼ 0.1 s 1 (dotted line). Note the classic bimodel Gilbert type pattern for the monomer–Nmer model and the skewed trailing zone for the sequential model. The concerted model exhibits a complete separation of monomer and tetramer species suggesting a significant kinetic effect. (B) presents an overlay of three concerted cases to investigate the origin of the kinetic effects. One line (dotted line), replotted from (A), where koff ¼ 0.1 s 1 and the run is simulated with kinetics. A second line (solid dashed line) where kr ¼ 1 s 1 and the run is simulated with kinetics. A third line (short dashed line) where the run is simulated with N–R and thus equilibrium conditions. The perfect superposition of the blue and green lines means these simulations are at equilibrium and no kinetic effects are evident at koff ¼ 0.1 s 1. Note data scans were chosen so that the peak broadening limits were in a comparable range (215–244 kD) so as to maintain similar height to width characteristics of the distributions. The vertical dotted lines represent the monomer and tetramer S value.
153
Effect of Kinetics on Sedimentation Velocity Profiles
system (black line) sediments as a bimodel boundary with a slow monomer peak and a faster reaction boundary. This simulation is conducted at 5 mM, just above the midpoint of the reaction, 2.27 mM. The concentration dependence for this system is shown in Fig. 6.8, spanning a 100-fold concentration range over which the system changes by mass action from monomer at 0.5 mM, to a bimodal pattern at 5 mM, to mostly tetramer at 50 mM. What happens if there are intermediates? To show this, we simulated a monomer–dimer–trimer–tetramer pattern where all the successive Ks are equal (no statistical factors for this general case; Fig. 6.7A, dashed black line). These patterns run as a skewed reaction boundary, similar to A 0.4
0.5 mM 5 mM 50 mM
Normalized g(s)
0.3
0.2
0.1
0.0
B 0.4
0.5 mM 5 mM 50 mM
Normalized g(s)
0.3
0.2
0.1
0.0
0
2
4
6
8 S
10
12
14
16
Figure 6.8 Normalized sedimentation distribution analysis g(s) comparing the concentration dependence of (A) a monomer–tetramer system and (B) a monomer– dimer–trimer–tetramer system where all the Ks are identical. In both cases, Koverall equals 8.32 1016 M 3. Simulations are done with N–R at 0.5 (), 5.0 (), and 50 (-) mM. Data scans were chosen so that the peak broadening limits were 215 kD. The vertical dotted lines represent the monomer and tetramer S value. (See Stafford (2009) for more details on the use of normalized g(s) analysis.)
154
John J. Correia et al.
monomer–dimer (see Correia and Stafford, 2009). The concentration dependence of the sequential system is shown in Fig. 6.8B over a 100fold concentration range. The patterns change from a broad zone near the monomer region at 0.5 mM, to a skewed pattern in the central distribution region at 5.0 mM, to a more uniform although still skewed boundary approaching the tetramer region at 50 mM. The difference in the two systems in their extent of reaction at the extremes is due to the fact that the monomer–tetramer system is cooperative and undergoes a greater extent of dissociation or association over this concentration range. For example, over the concentration range of 0.5, 5 and 50 mM, the sequence model goes from 35.2% to 74.08% and to 94.15% reaction (fraction of monomer reacted), while over the same concentration range the monomer–tetramer model goes from 3.77% to 64.87% and to 93.17% (again using the Equilibrium calculator window). Since the associated material is uniform in tetramer in the monomer–tetramer system, while more distributed over multiple species in the sequential model, the weight average position of the boundary is closer to the endpoint for the monomer–tetramer system at the higher concentration. These features thus account for the different shapes of the normalized g(s) distributions. To ask what role low concentrations of intermediates might play, we simulated a concerted monomer–tetramer case where K2 was equal to K3, both were much smaller than K4 and all koff values were 0.1 s 1. This simulation generates the g(s) pattern shown in Fig. 6.7A (dotted line). The distribution is fully separated into a kinetically mediated noninteracting boundary composed of monomer and tetramer. Raising koff to 1 s 1 eliminates the kinetic effects as shown in Fig. 6.7B. One curve (solid dashed line) represents a simulation done with kinetics during the run (Fig. 6.1) while the overlapping curve (short dashed line) was done under equilibrium assumptions (N–R in Fig. 6.1), proving 1 s 1 gives an equilibrium result (see Fig. 6.10 below). To investigate these g(s) distributions further, we plotted two concentration distributions for these concerted models in Fig. 6.9. Panel A shows the equilibrium distributions at later stages of the simulation (scans at 4000 s). In the concerted case, the gradient in the region between monomer and tetramer zones represents monomer that dissociates from the tetramer and trails behind it. Even at equilibrium, low concentrations of dimer and trimer (barely visible in the baseline) prevent the dissociated monomer in the fast zone from keeping up with the tetramer. This is a very different pattern than the sequential data sets which exhibit a skewed g(s) distribution. In Fig. 6.9B we plot concentration distributions (again scans at 4000 s) for the kinetically limited concerted reaction, that is, koff equal 0.1 s 1. The monomer and tetramer zones essentially run independently with only a shallow gradient of A in the mid regions. (At koff equal 0.01 s 1 this gradient in species A completely disappears.) These comparisons show two features. First monomer–Nmer, sequential and concerted models with
155
Effect of Kinetics on Sedimentation Velocity Profiles
A
2.0 ⫻ 10–6
Concentration (M)
1.6 ⫻ 10–6
Monomer Dimer Trimer Tetramer
1.2 ⫻ 10–6
8.0 ⫻ 10–7 4.0 ⫻ 10–7
B
0.0 2.0 ⫻ 10–6
Concentration (M)
1.6 ⫻ 10–6
Monomer Dimer Trimer Tetramer
1.2 ⫻ 10–6
8.0 ⫻ 10–7 4.0 ⫻ 10–7
0.0 6.0
6.2
6.4
6.6 Radius
6.8
7.0
7.2
Figure 6.9 A plot of the monomer–dimer–trimer–tetramer species distributions simulated by Sedanal for the concerted equilibrium cases. Data from scan 40 (4000 s) are plotted. Monomer (black lines), dimer (short dashed lines), trimer, (dotted lines), and tetramer (dashed lines) are plotted for both scans in each panel. (A) shows results for koff ¼ 1 where no kinetic effects are observed and the monomer concentration exhibits a broad gradient in the reaction boundary between monomer and tetramer. Note the relative absence of dimer and trimer are predicted by the ratio of equilibrium constants. (B) shows results for koff ¼ 0.1 where kinetics effects are evident. The monomer and tetramer species sediment as independent zones with only a slight gradient in the leading monomer zone, consistent with the elevated green line in Fig. 6.5B. This shows the influence of low concentration of intermediates plus kinetics to cause the monomer and tetramer zones to run independently.
156
John J. Correia et al.
intermediates all display very different yet characteristic features during sedimentation at equilibrium. Visual inspection alone almost allows one to assign the mechanism. Second, concerted models with intermediates are extremely sensitive to kinetic effects and at koff values equal to 0.1 s 1 kinetic effects are dominant features of the patterns. The appearance of kinetically mediated reaction boundaries has been previously described in terms of the relaxation time of the reaction (Correia and Stafford, 2009; Kegeles and Cann, 1978). Above 100 s relaxation time, in a speed dependent manner, kinetic features will begin to appear in the sedimentation pattern for dimerization. What we now can reassert is this is true for any reaction mechanism. This is shown in Fig. 6.10 where kinetic traces are simulated (Kinetics box in the Main Menu of Sedanal) for the three monomer–tetramer systems simulated above. Panel A shows reequilibration from 100% tetramer for a monomer–tetramer (-o-) and a sequential monomer–tetramer model (-D-). Relaxation time is defined as the time it takes to get to 1/e of the original signal in tetramer concentration units. The times of 26 and 79 s are consistent with the rapid equilibrium behavior displayed by these systems. Panel B shows similar kinetic simulations for the concerted model where koff is varied from 0.01 (-□-), to 0.1 (- r-) to 1 (- ◊-) s 1. The relaxation times drop from 69,000 to 6900 and to 690 s for these three cases. The presence of kinetic effects in sedimentation velocity simulations is thus supported by these slow reequilibration times. (The 1 s 1 koff simulation appears slow, 690 s, relative to the results shown in Fig. 6.7B. This is due to the initial conditions, 100% tetramer, used for these kinetic simulations. Starting a sedimentation run at equilibrium and then reequilibrating during the run due to radial dilution and gradients established in the boundary obviously generates much shorter relaxation times; that is, the data are initially closer to equilibrium, and thus the g(s) pattern for these data superimposes with an equilibrium simulation (Fig. 6.7B).) What these data show is that intermediates at low concentration in a reaction scheme can have a large impact on the kinetics and sedimentation or transport patterns for mechanisms that involve more than two or three intermediate reactants. Intrinsic koff values alone do not necessarily explain the behavior. What is important to stress here is that limited flux through those species present in low concentrations creates a bottleneck in the reaction scheme. Here, we have outlined the various tools available within Sedanal to simulate and investigate models that involve low concentrations of intermediate species. This applies to monomer–Nmer or discrete assembly systems in general, and systems that form rings in particular. (A relevant difference in previous models for large N was to introduce a nucleation event at the dimer stage to create a cascade, a deep valley, of very small amounts of intermediates leading up to the formation of a cooperative Nmer (Kegeles, 1979). This may also apply to aggregation diseases mediated by a slow folding step to explain the nucleation requirement.) It also applies
157
Effect of Kinetics on Sedimentation Velocity Profiles
A Monomer tetramer t ~26 s Sequential t ~79 s
1.2 ⫻ 10–6
[tetramer] (M)
1.0 ⫻ 10–6 8.0 ⫻ 10–7 6.0 ⫻ 10–7 4.0 ⫻ 10–7 2.0 ⫻ 10–7 0
100
200
300
400
500
B 1.3 ⫻ 10–6
Concerted koff = 0.01 t ~69,000 s Concerted koff = 0.1 t ~6900 s
[tetramer] (M)
1.2 ⫻ 10–6
Concerted koff = 1 t ~690 s
1.1 ⫻ 10–6 1.0 ⫻ 10–6 9.0 ⫻ 10–7 8.0 ⫻ 10–7 7.0 ⫻ 10–7 20,000
40,000 60,000 Time (s)
80,000
1,00,000
Figure 6.10 Simulation of kinetics of reequilibration for the three tetramer models, monomer–Nmer, sequential and concerted utilizing the Kinetics feature on the main Sedanal page. (A) presents the relaxation of an initial concentration of 1.25 mM pure tetramer (5 mM worth of monomers) to equilibrium for a monomer–Nmer model (-o-) and a sequential model (-D-). Overall K ’s are as described in Fig. 6.5A and the koff values are 0.01 s 1. Relaxation times are estimated as the time it takes to get to 1/e of the change in concentration of tetramer, D[A4]. (B) presents the relaxation of an initial concentration of 1.25 mM pure tetramer (5 mM worth of monomers) to equilibrium for a concerted model with koff values of 0.01 (-□-), 0.1 (-r-), and 1 (-e-) s 1. Overall K ’s are as described in Fig. 6.7B. Relaxation times are estimated as above.
to systems involving indefinite polymerization and as discussed in the introduction, many large assembly systems do exhibit kinetic effects due to bottlenecks created by a cascade of reactions during reequilibration (Kegeles, 1979). Intermediate sized systems, micelles, small oligomers, and
158
John J. Correia et al.
rings have been shown experimentally to exhibit slow kinetic effects (Lobert et al., 1996; Tai and Kegeles, 1984; Thusius, 1975; Thusius et al., 1975). In AUC analysis this is often demonstrated by speed dependence of the simulated or g(s) patterns (Kegeles and Cann, 1978; Lobert et al., 1996). As described in great detail by Kegeles and Thusius, a combination of kinetic and equilibrium approaches are clearly required to investigate these complex systems and interpret their behavior.
5. Summary Here, we have reviewed approaches available in Sedanal for studying kinetic effects in sedimentation velocity data where the origin of the effect is due to the presence of low concentrations of intermediates. We have emphasized two types of systems, ABCD and monomer–tetramer. A combination of approaches available in Sedanal (sedimentation velocity simulations, g(s) distribution analysis, direct boundary fitting, F-statistics, Monte Carlo analysis and kinetic simulations) has been applied to the problem. Insights into the concentration dependence and the kinetic features of these association models are revealed and details about parameter determination and the complexity of parameter cross correlation have been established. Additional improvements in these methods, especially speeding up the F-statistic analysis through the use of multiprocessor computers and parallel processing, are in progress as a direct result of writing this chapter. What additional issues must be addressed in future studies? One feature missing from our simulations here that probably plays a role in indefinite polymer systems is that our simulations assume that all events involve the successive addition of monomers. An equilibrium model for indefinite polymerization can be written in numerous ways and we have chosen a monomer pathway. For example, a sequential system must add or dissociate one monomer to or from each intermediate at a time (Kegeles, 1979, 1980, 1984). In reality, indefinite polymers often have identical bonds at every position and thus any bond may break. This is the difference between the two following schemes, as described by Thusius (1975): P þ Pi $ Piþ1 Pi þ Pj $ Pk The scheme that allows annealing and breaks in the middle can clearly reequilibrate faster because there are multiple pathways to dissociation, and we anticipate that real systems are likely to behave in this manner as well, that is, the kinetic pathways are not uniquely defined as a cascade of monomer dissociations. (This is the major difference between the Kegeles
Effect of Kinetics on Sedimentation Velocity Profiles
159
shell model (1979) and the Thusius indefinite polymerization model (Thusius, 1975; Thusius et al., 1975).) However, currently the models we build in Sedanal for analysis of equilibrium systems must be independent reactions. Thus, we cannot simulate or fit a sequential case where we add a fourth reaction scheme involving dimers (A2 þ A2 $ A4 or K24) since these are not an independent reaction (K2K3K4 ¼ K24K22). Future software efforts will investigate developing kinetic and stochastic simulation approaches that allow competing pathways to exist and thus influence relaxation times (Schilstra et al., 2008; also see chapter 15 in this volume).
ACKNOWLEDGMENTS We thank Mike Johnson for his patience and his continuing ability to organize these volumes. We thank our collaborators at BBRI and UMMC AUC facilities for stimulating our continued interest in these computational challenges.
REFERENCES Alday, P. H. (2009). Use of fluorescently labeled proteins in quantitative sedimentation velocity studies of heterogeneous biomolecular interactions, Ph.D. thesis University of Mississippi Medical Center. Alday, P. H., and Correia, J. J. (2009). Macromolecular interaction of Halichondrin B analogs Eribulin (E7389) and ER-076349 with tubulin by analytical ultracentrifugation. Biochemistry 48, 7927–7938. Bates, D. M., and Watts, D. G. (1988). Numerical Regression Analysis and its Applications Wiley, New York. Bernasconi, C. F. (1976). Relaxation Kinetics. Academic Press, pp. 14–15. New York. Brown, P. J. (2009). SEDPHAT—An analysis platform for the biophysical analysis of reversibly assembled multi-protein complexes in solution. Biophys. J. 96(Suppl. 1), 74A. Cann, J. R. (1970). Interacting Macromolecules: The Theory and Practice of Their Electrophoresis, Ultracentrifugation, and Chromatography Academic Press, New York. Cann, J. R., and Kegeles, G. (1974). Theory of sedimentation for kinetically controlled dimerization reactions. Biochemistry 13, 1868–1874. Cann, J. R. (1978a). Measurement of protein interactions mediated by small molecules using sedimentation velocity. Methods Enzymol. 48, 242–248. Cann, J. R. (1978b). Ligand binding by associating sytems. Methods Enzymol. 48, 299–307. Correia, J. J., and Stafford, W. F. (2009). Extracting Equilibrium Constants from Kinetically Limited Reacting Systems. Methods Enzymol. 455, 419–446. Correia, J. J., Shire, S. J., Yphantis, D. A., and Schuster, T. M. (1985). Sedimentation equilibrium measurements of intermediate size tobacco mosaic virus (TMV) protein polymers. Biochemistry 24, 3292–3297. Dam, J., Velikovsky, C. A., Mariuzza, R. A., Urbanke, C., and Schuck, P. (2005). Sedimentation velocity analysis of heterogeneous protein–protein interactions: Lamm equation modeling and sedimentation coefficient distributions c(s). Biophys. J. 89, 619–634.
160
John J. Correia et al.
Dam, J., and Schuck, P. (2005). Sedimentation velocity analysis of heterogeneous protein– protein interactions: Sedimentation coefficient distributions c(s) and asymptotic boundary profiles from Gilbert–Jenkins theory. Biophys. J. 89, 651–666. Desai, A., and Mitchison, T. J. (1997). Microtubule polymerization dynamics. Annu. Rev. Cell Dev. Biol. 13, 83–117. Eccleston, J. F., Martin, S. R., and Schilstra, M. J. (2008). Rapid kinetic techniques. In ‘‘Methods in Cell Biology, Vol. 84, Biophysical Tools for Biologists: Vol. 1, In Vitro Techniques,’’ (H. W. Detrich, ed.), pp. 445–477. Elsevier, Amsterdam. Frigon, R. P., and Timasheff, S. N. (1975a). Magnesium-induced self- association of calf brain tubulin. I. Stoichiometry. Biochemistry 14, 4559–4566. Frigon, R. P., and Timasheff, S. N. (1975b). Magnesium-induced self-association of calf brain tubulin. II. Thermodynamics. Biochemistry 14, 4567–4573. Gelinas, A. D., Toth, J., Bethoney, K. A., Stafford, W. F., and Harrison, C. J. (2004). Mutational analysis of the energetics of the GrpE DnaK binding interface: Equilibrium association constants by sedimentation velocity analytical ultracentrifugation. J. Mol. Biol. 339, 447–458. Gilbert, G. A. (1955). General discussion. Discuss. Faraday Soc. 20, 65–77. Gilbert, G. A. (1959). Sedimentation and electrophoresis of interacting substances.1. Idealized boundary shape for a single substance aggregating reversibly. Proc. Roy. Soc. (London) A250, 377–388. Gilbert, G. A., and Jenkins, R. C. Ll. (1959). Sedimentation and electrophoresis of interacting substances. II. Asymptotic boundary shape for two substances interacting reversibly. Proc. Roy. Soc. Lond. A: Math. Phys. Eng. Sci. 253, 420–437. Gilbert, G. A. (1960). Concentration-dependent sedimentation of aggregating proteins in the ultracentrifuge. Nature 186, 882–883. Gilbert, L. M., and Gilbert, G. A. (1978). Molecular transport of reversibly reacting systems: Asymptotic boundary profiles in sedimentation, electrophoresis, and chromatography. Methods Enzymol. 48, 195–213. Hairer, E., and Wanner, G. (1991). Solving Ordinary Differential Equations II. Stiff and Differential-Algebraic Problems. Springer Series in Computational Mathematics 14 SpringerVerlag, Berlin, Second edition 1996. Huang, C- H. C., Tai, M.-T., and Kegeles, G. (1984). Pressure-jump kinetics of bovine b-casein micellization. Biophys. Chem. 20, 89–94. Johnson, M. L., Correia, J. J., Halvorson, H., and Yphantis, D. A. (1981). Analysis of data from the analytical ultracentrifuge by nonlinear least-squares techniques. Biophys. J. 36, 575–588. Johnson, M. L. (1992). Why, when, and how biochemists should use least-squares. Anal. Biochem. 206, 215–225. Kegeles, G., and Cann, J. (1978). Kinetically controlled mass transport of associating– dissociating macromolecules. Methods Enzymol. 48, 248–270. Kegeles, G. (1979). A shell model for size distribution in micelles. J. Phy. Chem. 83, 1728–1732. Kegeles, G. (1980). The dissolution of micelles in relaxation kinetics. J. Colloid Interface Sci. 73, 274–275. Kegeles, G. (1984). Time dependence of micelle dissolution in relaxation kinetics. J. Colloid Interface Sci. 99, 153–163. Lobert, S., Vulevic, B., and Correia, J. J. (1996). Interaction of vinca alkaloids with tubulin: A comparison of vinblastine, vincristine and vinorelbine. Biochemistry 35, 6806–6814. Nelder, J. A., and Mead, R. (1965). A simplex method for function minimization. Comput. J. 7, 308–313.
Effect of Kinetics on Sedimentation Velocity Profiles
161
Oberhauser, D. F., Bethune, J. L., and Kegeles, G. (1965). Countercurrent distributions of chemically reacting systems: IV kinetically controlled dimerization in a boundary. Biochemistry 4, 1878–1884. Philo, J. S. (2006). Improved methods for fitting sedimentation coefficient distributions derived by time-derivative techniques. Anal. Biochem. 354, 238–246. Press, W. H., Flannery, B. P., Teukolsky, S. A., Vetterling, W. T., and Richardson, W. T. (1992). Extrapolation and the Bulirsch–Stoer Method. Numerical Recipes in FORTRAN: The Art of Scientific Computing 2nd ed. Cambridge University Press, Cambridge, England, pp. 718–725. Scheele, R. B., and Schuster, T. M. (1974). Kinetics of protein subunit interactions: Simulation of a polymerized overshoot. Biopolymers 13, 275–288. Scheele, R. B., and Schuster, T. M. (1975). Hysteresis of proton binding to Tobacco mosaic virus protein associated with metastable polymerization. J. Mol. Biol. 94, 519–525. Schilstra, M. J., Martin, S. R., and Keating, S. (2008). Methods for simulating the dynamics of complex biological processes. In ‘‘Methods in Cell Biology, Vol. 84, Biophysical Tools for Biologists: Vol. 1, In Vitro Techniques,’’ (H. W. Detrich, ed.), pp. 807–842. Schuster, T. M., Scheele, R. B., and Khairallah, L. H. (1979). Mechanism of self-association of tobacco mosaic virus protein. I. Nucleation-controlled kinetics of polymerization. J. Mol. Biol. 127, 461–485. Shire, S. J., Steckert, J. J., and Schuster, T. M. (1979). Mechanism of self-association of tobacco mosaic virus protein. II. Characterization of the metastable polymerization nucleus and the initial stages of helix formation. J. Mol Biol. 127, 487–506. Stafford, W. F. (1992). Boundary analysis in sedimentation transport experiments: a procedure for obtaining sedimentation coefficient distributions using the time derivative of the concentration profile. Anal. Biochem. 203, 295–301. Stafford, W. F. (2000). Analysis of reversibly interacting macromolecular systems by time derivative sedimentation velocity. Methods Enzymol. 323, 302–325. Stafford, W. F. (2009). Protein–protein and ligand-protein interactions studied by analytical ultracentrifugation. In ‘‘Protein Structure, Stability, and Interactions,’’ ( J. W. Schriver, ed.), Vol. 490, pp. 83–113. Humana Press, New York. Stafford, W. F., and Sherwood, P. J. (2004). Analysis of heterologous interacting systems by sedimentation velocity: Curve fitting algorithms for estimation of sedimentation coefficients, equilibrium and kinetic constants. Biophys. Chem. 108, 231–243. Tai, M., and Kegeles, G. (1984). A micelle model for the sedimentation behavior of bovine beta-casein. Biophys. Chem. 20, 81–87. Thusius, D., Dessen, P., and Jallon, J.-M. (1975). Mechanism of bovine liver glutamate dehydrogenase self-assembly: I. Kinetic evidence for a random association of polymer chains. J. Mol. Biol. 92, 413–432. Thusius, D. (1975). Mechanism of bovine liver glutamate dehydrogenase self-assembly: II. Simulation of relaxation spectra for an open linear polymerization proceeding via a sequential addition of monomer units. J. Mol. Biol. 94, 367–383. Van Holde, K. E., Johnson, W. C., and Ho, P. S. (1998). Principles of Physical Biochemistry Prentice Hallp. 593. New York. Wegner, A. (1976). Head to tail polymerization of actin. J. Mol. Biol. 108, 139–150. Zhao, H., and Beckett, D. (2008). Kinetic partitioning between alternative protein-protein interactions controls a transcriptional switch. J. Mol. Biol. 380, 223–236.
C H A P T E R
S E V E N
Algebraic Models of Biochemical Networks Reinhard Laubenbacher and Abdul Salam Jarrah Contents 164 165 176 181 181 184 184
1. 2. 3. 4.
Introduction Computational Systems Biology Network Inference Reverse-Engineering of Discrete Models: An Example 4.1. Boolean networks: Deterministic and stochastic 4.2. Inferring Boolean networks 4.3. Inferring stochastic Boolean networks 4.4. Polynome: Parameter estimation for Boolean models of biological networks 4.5. Example: Inferring the lac operon 5. Discussion References
185 189 190 193
Abstract With the rise of systems biology as an important paradigm in the life sciences and the availability and increasingly good quality of high-throughput molecular data, the role of mathematical models has become central in the understanding of the relationship between structure and function of organisms. This chapter focuses on a particular type of models, so-called algebraic models, which are generalizations of Boolean networks. It provides examples of such models and discusses several available methods to construct such models from highthroughput time course data. One specific such method, Polynome, is discussed in detail.
Virginia Bioinformatics Institute at Virginia Tech, Blacksburg, Virginia, USA Methods in Enzymology, Volume 467 ISSN 0076-6879, DOI: 10.1016/S0076-6879(09)67007-5
#
2009 Elsevier Inc. All rights reserved.
163
164
Reinhard Laubenbacher and Abdul Salam Jarrah
1. Introduction The advent of functional genomics has enabled the molecular biosciences to come a long way towards characterizing the molecular constituents of life. Yet, the challenge for biology overall is to understand how organisms function. By discovering how function arises in dynamic interactions, systems biology addresses the missing links between molecules and physiology. (Bruggemann and Westerhoff 2006)
With the rise of systems biology as an important paradigm in the life sciences and the availability and increasingly good quality of high-throughput molecular data, the role of mathematical models has become central in the understanding of the relationship between structure and function of organisms. It is by now well understood that fundamental intracellular processes such as metabolism, signaling, or various stress responses, can be conceptualized as complex dynamic networks of interlinked molecular species that interact in nonlinear ways with each other and with the extracellular environment. Available data, such as transcriptional data obtained from DNA microarrays or high-throughput sequencing machines, complemented by single-cell measurements, are approaching a quality and quantity that makes it feasible to construct detailed mechanistic models of these networks. There are two fundamental approaches one can take to the construction of mathematical or statistical network models. For each approach the first step is to choose an appropriate model type, which might be, for example, a dynamic Bayesian network (DBN) or a system of ordinary differential equations (ODE). The traditional bottom-up approach begins with a ‘‘parts list,’’ that is, a list of the molecular species to be included in the model, together with information from the literature about how the species interact. For each model type, there will then likely be some model parameters that are unknown. These will then either be estimated or, if available, fitted to experimental data. The result is typically a detailed mechanistic model of the network. That is, the selected parts are assembled to a larger system. Systems biology has provided another, so-called top-down approach, which attempts to obtain an unbiased view of the underlying network from high-throughput experimental data alone, using statistical or mathematical network inference tools. The advantage of this approach is that the results are not biased by a perception or presumed knowledge of which parts are important and how they work together. The disadvantage is that the resulting model is most likely phenomenological, without a detailed mechanistic structure. It is also the case that at this time the available data sets for this approach are rather small, compared to the number of network nodes, so that the resulting network inference problem is underdetermined.
Algebraic Models of Biochemical Networks
165
The most beneficial approach, therefore, is to combine both methods by using the available information about the network as prior information for an appropriate network inference method. Thus, network inference becomes an extreme case of parameter estimation, in which no parameters are specified. The development of appropriate network inference methods has become an important research field within computational systems biology. The goal of this chapter is to illustrate parameter estimation and network inference using a particular type of model, which we will call algebraic model. Boolean networks, which are being used increasingly as models for biological networks, represent the simplest instance. First, in Section 2, we will provide an introduction to computational systems biology and give some detailed examples of algebraic models. In the following section, we provide an introduction to network inference within several different model paradigms and provide details of several methods. Finally, in the last section, we talk about network inference as it pertains specifically to algebraic models. Few inference methods provide readily available software. We describe one of those, Polynome, in enough detail so that the reader can explore the software via the available Web interface.
2. Computational Systems Biology To understand how molecular networks function it is imperative that we understand the dynamic interactions between their parts. Gene regulatory networks give an important example. They are commonly represented mathematically as so-called directed graphs, whose nodes are genes and sometimes proteins. There is an arrow from gene A to gene B, if A contributes to the regulation of B in some way. The arrow typically indicates whether this contribution is an activation or an inhibition. It is important to understand that such a network provides an abstract representation of gene regulation in the sense that the actual regulation is not direct but could involve a fairly long chain of biochemical reactions, involving mRNA from gene A and/or a protein A codes for. As an example, we consider the lac operon in E. coli, one of the earliest and best understood examples of gene regulation. We will use this gene network as a running example for the entire chapter. E. coli prefers glucose as a growth medium, and the operon genes allow E. coli to metabolize lactose in the absence of glucose. When glucose is present it is observed that the enzymes involved in lactose metabolism have very low activity, even if intracellular lactose is present. In the absence of glucose, lactose metabolism is induced through expression of the lac operon genes. Figure 7.1 shows a
166
Reinhard Laubenbacher and Abdul Salam Jarrah
representation of the basic biological mechanisms of the network, commonly referred to as a ‘‘cartoon’’ representation. While this representation is very intuitive, to understand the biology, a network representation of the system is a first step toward the construction of a dynamic mathematical model. Such a representation is given in Fig. 7.2. The dynamic properties of this network are determined by two control mechanisms, one positive, leading to induction of the operon, and another
The LAC operon
Regions coding for proteins Regulatory regions Diffusable regulatory proteins
RNA polymerase
LacI
P
Pi
O
mRNA + ribosomes
LacZ
LacY
LacA
mRNA + ribosomes
I Z
Y
A
Figure 7.1 The lac operon (from http://www.uic.edu/classes/bios/bios100/ lecturesf04am/lect15.htm).
M
C
Ge
R
B
P
A
L
Al
Ll
Le
Figure 7.2 The lac operon network (from Stigler and Veliz-Cuba, 2009).
Algebraic Models of Biochemical Networks
167
one is negative, leading to the repression of the operon mechanism. The negative control is initiated by glucose, through two modes of action. In the absence of intracellular glucose, the catabolite activator protein CAP forms a complex with another protein, cAMP, which binds to a site upstream of the lac promoter region, enhancing transcription of the lac genes. Intracellular glucose inhibits cAMP synthesis and thereby gene transcription. Furthermore, extracellular glucose inhibits the uptake of lactose into the cell via lactose permease, one of the proteins in the lac operon. The positive control operates via the action of lactose permease, increasing intracellular lactose, and by disabling of the lac repressor protein via increased production of allolactose. To understand how these two feedback loops, one positive and the other negative, work together to determine the dynamic properties of this network it is necessary to construct a mathematical model that captures this synergistic interplay of positive and negative feedback. Several different mathematical modeling frameworks are available for this purpose. The most commonly used type of model for molecular networks is a system of ODE, one equation for each of the nodes in the network. Each equation describes the rate of change of the concentration of the corresponding molecular species over time, as a function of other network nodes involved in its regulation. As an example, we present a very simplified differential equations model of the lac operon taken from Section 5.2 of deBoer. This is an example of a model which was referred to in the introduction as ‘‘bottom-up.’’ The model includes only the repressor R, the lac operon mRNA M, and allolactose A. The three equations are given below: R ¼ 1=ð1 þ An Þ; dM=dt ¼ c0 þ cð1 RÞ gM; dA=dt ¼ ML dA vMA=ðh þ AÞ: Here, c0, c, g, v, d, h, and L are certain model parameters, n is a fixed positive integer, and the concentrations R, M, and A are functions of time t. The model does not distinguish between intracellular and extracellular lactose, both denoted by L. It is assumed further that the enzyme b-galactosidase is proportional to the operon activity M and is not represented explicitly. The repressor concentration R is represented by a so-called Hill function, which has a sigmoid-shaped graph: the larger the Hill coefficient n, the steeper the shape of the sigmoid function. The constant c0 represents the baseline activity of the operon transcript M, and the term gM represents degradation. The concentration of allolactose A grows with M, assuming that lactose L is present. Its degradation term is represented by a Michaelis–Menten type enzyme substrate reaction composed of two terms. The various model parameters can be estimated to fit experimental data, using parameter estimation algorithms.
168
Reinhard Laubenbacher and Abdul Salam Jarrah
This model is not detailed enough to incorporate the two different feedback loops discussed earlier, but will serve as an illustration of the kind of information a dynamic model can provide. The most important information typically obtained from a model is about the steady states of the network, that is, network states at which all derivatives are equal to 0, so that the system remains in the steady state, once it reaches it. A detailed analysis of this model can be found in deBoer and also in Laubenbacher and Sturmfels (2009). Such an analysis shows that the model has three steady states, two stable and one unstable. This is what one would expect from the biology. Depending on the lactose concentration, the operon is either ‘‘on’’ or ‘‘off,’’ resulting in two stable steady states. For bacteria growing in an environment with an intermediate lactose concentration, it has been shown that the operon can be in either one of these two states, and which one is attained depends on the environmental history a particular bacterium has experienced, a form of hysteresis. This behavior corresponds to the unstable steady state. Thus, the model dynamics agrees with experimental observations. Another modeling framework that has gained increasing prominence is that of Boolean networks, initially introduced to biology by Kauffman (1969) as a model for genetic control of cell differentiation. Since then, a wide array of such models has been published, as discussed in more detail below. They represent the simplest examples of what is referred to in the title of this chapter as algebraic models. They are particularly useful in cases when the quantity or quality of available experimental data is not sufficient to build a meaningful differential equations model. Furthermore, algebraic models are just one step removed from the way a biologist would describe the mechanisms of a molecular network, so that they are quite intuitive and accessible to researchers without mathematical background. In particular, this makes them a useful teaching tool for students in the life sciences (Robeva and Laubenbacher, 2009). To illustrate the concept, we present here a Boolean model of the lac operon, taken from Stigler and Veliz-Cuba (2009). A Boolean network consists of a collection of nodes or variables, each of which can take on two states, commonly represented as ON/OFF or 1/0. Each node has attached to it a Boolean function that describes how the node depends on some or all of the other nodes. Time progresses in discrete steps. For a given state of the network at time t ¼ 0, the state at time t ¼ 1 is determined by evaluating all the Boolean functions at this state. Example 7.1 We provide a simple example to illustrate the concept. Consider a network with four nodes, x1, . . ., x4. Let the corresponding Boolean functions be f1 ¼ x1 AND x3 ;
f2 ¼ x2 OR ðNOT x4 Þ; f3 ¼ x1 ;
f4 ¼ x1 OR x2 :
169
Algebraic Models of Biochemical Networks
Then this Boolean network can be described by the function f ¼ ð f1 ; . . . ; f4 Þ : f0; 1g4 ! f0; 1g4 ; where f ðx1 ; . . . ; x4 Þ ¼ ðx1 AND x3 ; x2 OR ðNOT x4 Þ; x1 ; x1 OR x2 Þ:
ð7:1Þ
As a concrete example, f(0, 1, 1, 0) ¼ (0, 1, 0, 1). Here, {0, 1}4 represents the set of all binary 4-tuples, of which there are 16. The dependencies among the variables can be represented by the directed graph in Fig. 7.3, representing the wiring diagram of the network. The dynamics of the network is represented by another directed graph, the discrete analog of the phase space of a system of differential equations, given in Fig. 7.4. The nodes of the graph represent the 16 possible states of the network. There is a directed arrow from state a to state b if f(a) ¼ b. The network has three steady states, (0, 1, 0, 1), (1, 1, 1, 1), and (1, 0, 1, 1). (In general, there might be periodic points as well.) A Boolean model of the lac operon. The following model is presented in Stigler and Veliz-Cuba (2009). In addition to the three variables M (mRNA for the 3 lac genes), R (the repressor protein), and A (allolactose) in the previous ODE model, we need to include these additional variables:
Lac permease (P) b-Galactosidase (B) Catabolite activator protein CAP (C) Lactose (L) Low concentrations of lactose (Llow) and allolactose (Alow)
x1
x3
x4
x2
Figure 7.3 Wiring diagram for Boolean network in Example 7.1 (constructed using the software tool DVD; http://dvd.vbi.vt.edu).
170
Reinhard Laubenbacher and Abdul Salam Jarrah
1001
0001
1000
1100
0110
1101
0111
0010
0011
1010
1110
1011
111 1
0000
0100
0101
Figure 7.4 Dynamics of the Boolean network in Example 7.1 (constructed using the software tool DVD; http://dvd.vbi.vt.edu).
The last two variables are needed for the model to be accurate, since we need to allow for three, rather than two, possible concentration levels of lactose and allolactose: absent, low, and high. Introducing additional binary variables to account for the three states avoids the introduction of variables with more than two possible states. (Models with multistate variables will be discussed below.) The model depends on two external parameters: a, representing the concentration of external lactose and g, representing the concentration of external glucose. They can both be set to the values 0 and 1, providing four different choices. The interactions between these different molecular species are described in Stigler and Veliz-Cuba (2009) by the following Boolean functions: fM ¼ ðNOT RÞ AND C; fP ¼ M; fB ¼ M; fC ¼ NOT g; fR ¼ ðNOT AÞ AND ðNOT Alow Þ; fA ¼ L AND B; fA low ¼ A OR L OR Llow ; fL ¼ ðNOT gÞ AND P AND a; fLlow ¼ ðNOT gÞ AND ðL OR aÞ:
ð7:2Þ
To understand these Boolean statements and how they assemble to a mathematical network model, consider the first one. It represents a rule that describes how the concentration of lac mRNA evolves over time. To make the time dependence explicit, one could write fM(t þ 1) ¼ (NOT R(t)) AND C(t) (similarly for the other functions). This is to be interpreted as saying that the lac genes are transcribed at time t þ 1 if the repressor protein
Algebraic Models of Biochemical Networks
171
R is absent at time t and the catabolite activator protein C is present at time t. The interpretation of the other functions is similar. One of the model assumptions is that transcription and translation of a gene happens in one time step, and so does degradation of mRNA and proteins. Thus, choosing the above ordering of these nine variables, a state of the lac operon network is given by a binary 9-tuple, such as (0, 1, 1, 0, 0, 1, 0, 1, 1). The above Boolean functions can be used as the coordinate functions of a time-discrete dynamical system on {0, 1}9, that is, a function f ¼ ðfM ; fP ; fB ; fC ; fR ; fA ; fAlow ; fL ; fLlow Þ : f0; 1g9 ! f0; 1g9 : For a given network state a in {0, 1}9, the function f is evaluated as f ðaÞ ¼ ð fM ðaÞ; fP ðaÞ; fB ðaÞ; fC ðaÞ; fR ðaÞ; fA ðaÞ; fA low ðaÞ; fL ðaÞ; fL low ðaÞÞ to give the next network state. For the exemplary network state above, and the parameter setting a ¼ 1, g ¼ 0, we obtain f ð0; 1; 1; 0; 0; 1; 0; 1; 1Þ ¼ ð0; 0; 0; 1; 0; 1; 1; 1; 1Þ: It is shown in Stigler and Veliz-Cuba (2009) that this model captures all essential features of the lac operon, demonstrated through several other published ODE models. One of these is its bistability, which was already discussed in the simple model above. It is shown that for each of the four possible parameter settings the models attains a steady state, corresponding to the switch-like nature of the network. It is in principle possible to compute the entire phase space of the model, using a software package such as DVD (http://dvd.vbi.vt.edu). The space, which contains 29 states, is too large to be visualized, however. We have discussed this particular model in some detail, for three reasons. It represents a model of an interesting network that continues to be the object of ongoing research, yet is simple enough to fit the space constraints of this chapter. For the reader unfamiliar with algebraic models of this type, it provides a detailed realistic example to explore. Finally, we will use this particular model subsequently to demonstrate a particular network inference method. Boolean network models of biological systems are the most common type of discrete model used, including gene regulatory networks such as the cell cycle in mammalian cells (Faure et al., 2006), in budding yeast (Li et al., 2004) and fusion yeast (Davidich and Bornholdt, 2007), and the metabolic networks in E. coli (Samal and Jain, 2008) and in Saccharomyces cerevisiae (Herrgard et al., 2006). Also, Boolean network models of signaling networks have recently been used to provide insights into different mechanisms such as the molecular neurotransmitter signaling pathway (Gupta et al., 2007), the T cell receptor signaling pathways (Saez-Rodriguez et al., 2007), the signaling network for the long-term survival of cytotoxic T lymphocytes in humans (Zhang et al., 2008), and the abscisic acid signaling pathway (Li et al., 2006a).
172
Reinhard Laubenbacher and Abdul Salam Jarrah
A more general type of discrete model, so-called logical models, was introduced by the geneticist (Thomas and D’Ari, 1989) for the study of gene regulatory networks. Since then they have been developed further, with published models of the cell-fate determination in Arabidopsis thaliana (Espinosa-Soto et al., 2004), the root hair regulatory network (Mendoza and Alvarez-Buylla, 2000), the Hh signaling pathway (Gonzalez et al., 2008), the gap gene network in Drosophila (Sanchez and Thieffry, 2001), and the differentiation process in T helper cells (Mendoza, 2006), to name a few. A logical model consists of a collection of variables x1, . . ., xn, representing molecular species such as mRNA, where variable xj takes values in a finite set Sj. The number of elements in Sj corresponds to the number of different concentrations of xj that trigger different modes of action. For instance, when transcribed at a low level, a gene might perform a different role than when it is expressed at a high level, resulting in three states: absent, low, and high. The variables are linked in a graph, the dependency graph or wiring diagram of the model, as in the Boolean case, by directed edges, indicating a regulatory action of the source of the edge on the target. Each edge is equipped with a sign þ/– (indicating activation or inhibition) and a weight. The weight indicates the level of transcription of the source node required to activate the regulatory action. The state transitions of each node are given by a table involving a list of so-called logical parameters. The dynamics of the system is encoded by the state space graph, again as in the Boolean case, whose edges indicate state transitions of the system. An additional structural feature of this model type is that the variables can be updated sequentially, rather than in parallel, as in the previously discussed Boolean model. This feature allows the inclusion of different time scales and stochastic features that lead to asynchronous updating of variables. The choice of different update orders at any given update step can result in a different transition. To illustrate the logical model framework, we briefly describe the T cell differentiation model in Mendoza (2006). As they differentiate from a common precursor called T0, two distinct functional subsets T1 and T2 of T helper cells have been identified in the late 1980s. T1 cells secret IFNg, which promotes more T1 differentiation while inhibiting the differentiation into T2. On the other hand, T2 cells secret IL-4, a cytokine which promotes T2 differentiation and inhibits that of T1. The most general logical model is presented in Mendoza (2006), where the gene regulatory network of T1/T2 differentiation is synthesized from published experimental data. The multilevel network includes 19 genes and four stimuli and interactions at the inter- and intracellular levels. Some of the nodes are assumed to be Boolean while a few others are multistates (Low, Medium, or High). Based on the value at the source of an
Algebraic Models of Biochemical Networks
173
edge being above or below some threshold, that edge is considered active or not, in the sense, that it will contribute to changing the value at the sink of that edge. When there is more than one incoming edge, the combinations of different active incoming edges and their thresholds are assembled into a logical function. It is worth mentioning briefly that the algebraic model framework is well suited for the study of the important relationship between the structure and the dynamics of a biological network. In systems biology, the work of Uri Alon has drawn a lot of attention to this topic, summarized in his book (Alon, 2006). An important focus of Alon’s work has been the effect of so-called network motifs, such as feed-forward loops, on dynamics. Another topic of study in systems biology has been the logical structure of gene regulatory and other networks. In the context of Boolean networks, significant work has been devoted to identifying features of Boolean functions that make them particularly suitable for the modeling of gene regulation and metabolism. As an example, we present here a summary of the work on so-called nested canalyzing functions (NCFs), a particular class of Boolean functions that appear frequently in systems biology models. Biological systems in general and biochemical networks in particular are robust against noise and perturbations, and, at the same time, can evolve and adapt to different environments (Balleza et al., 2008). Therefore, a realistic model of any such system must possess these properties, and hence the update functions for the network nodes cannot be arbitrary. Different classes of Boolean functions have been suggested as biologically relevant models of regulatory mechanisms: biologically meaningful functions (Raeymaekers, 2002), post functions (Shmulevich et al., 2003b), and chain functions (Gat-Viks and Shamir, 2003). However, the class that has received the most attention is that of NCFs, introduced by Kauffman et al. (2004) for gene regulatory networks. We first give some precise definitions. Definition 7.1 A Boolean function g(x1, . . ., xn): {0, 1}n ! {0, 1}n is called canalyzing in the variable xi with the input value ai and output value bi if xi appears in g and gðx1 ; . . . ; xi1 ; ai ; xiþ1 ; . . . ; xn Þ ¼ bi for all inputs of all variables xj and j 6¼ i. The definition is reminiscent of the concept of ‘‘canalization’’ introduced by the geneticist (Waddington, 1942) to represent the ability of a genotype to produce the same phenotype regardless of environmental variability.
174
Reinhard Laubenbacher and Abdul Salam Jarrah
Definition 7.2 Let f be a Boolean function in n variables. Let s be a permutation of the set {1, . . ., n}. The function f is a nested canalyzing function (NCF) in the variable order xs(1), . . ., xs(n) with canalyzing input values a1, . . ., an and canalyzed output values b1, . . ., bn, if f ðx1 ; . . . ; xn Þ ¼ b1 if xsð1Þ ¼ a1 ; ¼ b2 if xsð1Þ 6¼ a1 and xsð2Þ ¼ a2 ; ... ¼ bn1 if xsð1Þ 6¼ a1 ; . . . ; xsðn1Þ 6¼ an1 ; and xsðnÞ ¼ an ; ¼ bn if xsð1Þ 6¼ a1 ; . . . ; xsðn1Þ 6¼ an1 ; and xsðnÞ 6¼ an : The function f is nested canalyzing if it is nested canalyzing for some variable ordering s. As an example, the function f(x,y,z) ¼ x AND (NOT y) AND z is nested canalyzing in the variable order x,y,z with canalyzing values 0,1,0 and canalyzed values 0,0,0, respectively. However, the function f(x,y,z,w) ¼ x AND y AND (z OR w) is not nested canalyzing because, if x ¼ 1 and y ¼ 1, then the value of the function is not constant for any input values for either z or w. One important characteristic of NCFs is that they exhibit a stabilizing effect on the dynamics of a system. That is, small perturbations of an initial state should not grow in time and must eventually end up in the same attractor of the initial state. The stability is typically measured using so-called Derrida plots which monitor the Hamming distance between a random initial state and its perturbed state as both evolve over time. If the Hamming distance decreases over time, the system is considered stable. The slope of the Derrida curve is used as a numerical measure of stability. Roughly speaking, the phase space of a stable system has few components and the limit cycle of each component is short. Example 7.2 Consider the Boolean networks. f ¼ ðx4 ; x3 XOR x4 ; x2 XOR x4 ; x1 XOR x2 XOR x3 Þ : f0; 1g4 ! f0; 1g4 ; g ¼ ðx4 ; x3 AND x4 ; x2 AND x4 ; x1 AND x2 AND x3 Þ : f0; 1g4 ! f0; 1g4 :
Notice that f and g have the same dependency graph and that g is a Boolean network constructed with NCFs while f is not. It is clear that the phase space of g in Fig. 7.5 has fewer components and much shorter limit cycles compared to the phase space of f in Fig. 7.6, and therefore g should be considered more stable than f. In Kauffman et al. (2004), the authors studied the dynamics of nested canalyzing Boolean networks over a variety of dependency graphs. That is,
175
Algebraic Models of Biochemical Networks
0000
0001
0010
0011
1110
0101
1011
0111
1101
1010
1000
0110
1001
1111
0100
1100
Figure 7.5 Phase space of the network f in Example 7.2.
0111
1111
1
1110
0001
1000
1001
0010
0011
1011
1100
0100
0101
1010
1101
0110
0000
Figure 7.6 Phase space of network g in Example 7.2.
for a given random graph on n nodes, where the in-degree of each node is chosen at random between 0 and k, where k < n þ 1, a NCF is assigned to each node in terms of the in-degree variables of that node. The dynamics of these networks were then analyzed and the stability measured using Derrida plots. It is shown that nested canalyzing networks are remarkably stable regardless of the in-degree distribution and that the stability increases as the average number of inputs of each node increases. An extensive analysis of available biological data on gene regulations (about 150 genes) showed that 139 of them are regulated by canalyzing
176
Reinhard Laubenbacher and Abdul Salam Jarrah
functions (Harris et al., 2002; Nikolayewaa et al., 2007). In Kauffman et al. (2004) and Nikolayewaa et al. (2007) it was shown that 133 of the 139 are, in fact, nested canalyzing. Most published molecular networks are given in the form of a wiring diagram, or dependency graph, constructed from experiments and prior published knowledge. However, for most of the molecular species in the network, little knowledge, if any, could be deduced about their regulatory mechanisms, for instance in the gene transcription networks in yeast (Herrgard et al., 2006) and E. coli (Barrett et al., 2005). Each one of these networks contains more than 1000 genes. Kauffman et al. (2003) investigated the effect of the topology of a subnetwork of the yeast transcriptional network where many of the transcriptional rules are not known. They generated ensembles of different models where all models have the same dependency graph. Their heuristic results imply that the dynamics of those models that used only NCFs were far more stable than the randomly generated models. Since it is already established that the yeast transcriptional network is stable, this suggests that the unknown interaction rules are very likely NCFs. In Balleza et al. (2008), the whole transcriptional network of yeast, which has 3459 genes as well as the transcriptional networks of E. coli (1481 genes) and B. subtillis (840 genes) have been analyzed in a similar fashion, with similar findings. These heuristic and statistical results show that the class of NCFs is very important in systems biology. We showed in Jarrah et al. (2007) that this class is identical to the class of so-called unate cascade Boolean functions, which has been studied extensively in engineering and computer science. It was shown in Butler et al. (2005) that this class produces the binary decision diagrams with shortest average path length. Thus, a more detailed mathematical study of this class of functions has applications to problems in engineering as well. In this section, we have shown that algebraic models, in particular Boolean network models, play an important role in systems biology, as models for a variety of molecular networks. They also are very useful in studying more theoretical questions, such as design principles for molecular networks. In Section 3, we will show how to construct such models from experimental data, in the top-down fashion discussed earlier.
3. Network Inference In 2006 the first ‘‘Dialogue on Reverse-Engineering Assessment and Methods (DREAM)’’ workshop was held, supported in part by the NIH Roadmap Initiative. The rationale for this workshop is captured on the DREAM Web site (http://wiki.c2b2.columbia.edu/dream/index.php/
Algebraic Models of Biochemical Networks
177
The_DREAM_Project): ‘‘The endless complexities of biological systems are orchestrated by intricate networks comprising thousands of interacting molecular species, including DNA, RNA, proteins, and smaller molecules. The goal of systems biology is to map these networks in ways that provide both fundamental understanding and new possibilities for therapy. However, although modern tools can provide rich data sets by simultaneously monitoring thousands of different types of molecules, discerning the nature of the underlying network from these observations—reverse engineering— remains a daunting challenge.’’ Traditionally, models of molecular regulatory systems in cells have been created bottom-up, where the model is constructed piece by piece by adding new components and characterizing their interactions with other molecules in the model. This process requires that the molecular interactions have been well characterized, usually through quantitative numerical values for kinetic parameters. Note that the construction of such models is biased toward molecular components that have already been associated with the phenomenon. Still, modeling can be of great help in this bottom-up process, by revealing whether the current knowledge about the system is able to replicate its in vivo behavior. This modeling approach is well suited to complement experimental approaches in biochemistry and molecular biology, since models thus created can serve to validate the mechanisms determined in vitro by attempting to simulate the behaviors of intact cells. While this approach has been dominant in cellular modeling, it does not scale very well to genomewide studies, since it requires that proteins be purified and studied in isolation. This is not a practical endeavor due to its large scale, but especially because a large number of proteins act on small molecules that are not available in purified form, as would be required for in vitro studies. With the completion of the human genome sequence and the accumulation of other fully sequenced genomes, research is moving away from the molecular biology paradigm to an approach characterized by large-scale molecular profiling and in vivo experiments (or, if not truly in vivo, at least, in situ, where experiments are carried out with intact cells). Technologies such as transcript profiling with microarrays, protein profiling with 2D gels and mass spectrometry, and metabolite profiling with chromatography and mass spectrometry, produce measurements that are large-scale characterizations of the state of the biological material probed. Other new large-scale technologies are also able to uncover groups of interacting molecules, delineating interaction networks. All these experimental methods are data rich, and it has been recognized (e.g., Brenner, 1997; Kell, 2004; Loomis and Sternberg, 1995) that modeling is necessary to transform such data into knowledge. A new modeling approach is needed for large-scale profiling experiments. Such a top-down approach starts with little knowledge about the system, capturing at first only a coarse-grained image of the system with only a few variables. Then, through iterations of simulation and experiment,
178
Reinhard Laubenbacher and Abdul Salam Jarrah
the number of variables in the model is increased. At each iteration, novel experiments will be suggested by simulations of the model that provide data to improve it further, leading to a higher resolution in terms of mechanisms. While the processes of bottom-up and top-down modeling are distinct, both have as an objective the identification of molecular mechanisms responsible for cell behavior. Their main difference is that the construction of top-down models is biased by the data of the large-scale profiles, while bottom-up models are biased by the preexisting knowledge of particular molecules and mechanisms. While top-down modeling makes use of genome-wide profiling data, it is conceptually very different from other genome-wide data analysis approaches. Top-down modeling needs data produced by experiments suitable for the approach. One should not expect that a random combination of arbitrary molecular snapshots would be of much use for the topdown modeling process. Sometimes they may serve some purpose (e.g., variable selection) but overall, top-down modeling requires perturbation experiments carried out with appropriate controls. In the face of modern experimental methods, the development of an effective topdown modeling strategy is crucial. Furthermore, we believe that a combination of top-down and bottom-up approaches will eventually have to be used. An example of a first step in this direction is the apoptosis model in Bentele et al. (2004). A variety of different network inference methods have been proposed in recent years, using different modeling frameworks, requiring different types and quantities of input data, and providing varying amounts of information about the system to be modeled. There are fundamentally three pieces of information one wants to know about a molecular network: (i) its wiring diagram, that is, the causal dependencies among the network nodes, for example, gene activation or repression; (ii) the ‘‘logical’’ structure of the interactions between the nodes, for example, multiplicative or additive interaction of transcription factors; and (iii) the dynamics of the network, for example, the number of steady states. At one end of the model spectrum are statistical models that capture correlations among network variables. These models might be called high level (Ideker and Lauffenburger, 2003). The output of methods at the other end of the spectrum is a system of ODE which models network dynamics, provides a wiring diagram of variable dependencies as well as a mechanistic description of node interactions. In-between is a range of model types such as information-theory-based models, difference equations, Boolean networks, and multistate discrete models. The literature on top-down modeling, or network inference, has grown considerably in the last few years, and we provide here a brief and necessarily incomplete review. The majority of new methods that have appeared utilize statistical tools. At the high-level end of the spectrum recent work has focused on the inference of relevance networks, first introduced in
Algebraic Models of Biochemical Networks
179
Butte et al. (2000). Using pairwise correlations of gene expression profiles and appropriate threshold choices, an undirected network of connections is inferred. Partial correlations are considered in de la Fuente et al. (2004) for the same purpose. In Rice et al. (2005) conditional correlations using gene perturbations are used to assign directionality and functionality to the edges in the network and to reduce the number of indirect connections. Another modification is the use of time-delayed correlations in Li et al. (2006b) to improve inference. Using mutual information instead of correlation, together with information-theoretic tools, the ARACNE algorithm in Margolin et al. (2006) reports an improvement in eliminating indirect edges in the network. Probably the largest part of the recent literature on statistical models is focused on the use of DBNs for reverse-engineering. Originally proposed in Friedman et al. (2000), the use of causal Bayesian network methods has evolved to focus on (DBNs), to avoid the limitation of not capturing feedback loops in the network. These can be thought of as a sequence in time of Bayesian networks that can represent feedback loops over time, despite the fact that each of the Bayesian networks is an acyclic directed graph. A variety of DBN algorithms and software packages have been published, see, for example, Beal et al. (2005), Dojer et al. (2006), Friedman (2004), Nariai et al. (2005), Pournara and Wernisch (2004), Yu et al. (2004), and Zou and Conzen (2005). Probably the largest challenge to DBN methods, as to all other methods, is the typically small sample sizes available for microarray data. One proposed way to meet this challenge is by bootstrapping, that is, the generation of synthetic data with a similar distribution as the experimental data; see, for example, Pe’er et al. (2001). These methods all provide as output a wiring diagram, in which each edge represents a statistically significant relationship between the two network nodes it connects. Other approaches that result in a wiring diagram includes (Tringe et al., 2004), building on prior work by Wagner (2001, 2004). If time course data are available it is useful to obtain a dynamic model of the network. There have been some recent results in this direction using Boolean network models, first introduced in Kauffman (1969). Each of the methods in Mehra et al. (2004), Kim et al. (2007), and Martin et al. (2007) either modifies or provides an alternative to the original Boolean inference methods in Liang et al. (1998), Akutsu et al. (1999, 2000a,b), and Ideker et al. (2000). Moving farther toward mechanistic models, an interesting network inference method resulting in a Markov chain model can be found in Ernst et al. (2007). Finally, methods using systems of differential equations include (Andrec et al., 2005; Bansal et al., 2006, 2007; Chang et al., 2005; Deng et al., 2005; Gadkar et al., 2005; Gardner et al., 2003; Kim et al., 2007; Yeung et al., 2002). Reverse-engineering methods have also been developed for the S-system formalism of Savageau (1991), which is a special case
180
Reinhard Laubenbacher and Abdul Salam Jarrah
of ODE-based modeling, such as Kimura et al. (2005), Marino and Voit, (2006), and Thomas et al., (2004). Validating reverse-engineering methods and comparing their performance is very difficult at this time. Typically, each method is validated using data from simulated networks of different sizes and more or less realistic architecture. This is a crucial first step for any method, since it is important to measure its accuracy against a known network. The most common organism used for validation is yeast, with a wide collection of published data sets (very few of which are time course data). One of the key problems in comparing different methods is that they typically have different data requirements, ranging from time course data to steady-state measurements. Some methods require very specific perturbation experiments, whereas others need only a collection of single measurements. Some methods use continuous data, others, for instance most Bayesian network methods, require discretized data. Some methods take into account information such as binding site motifs. At present, there is no agreed-upon suite of test networks that might provide a more objective comparison. Nonetheless, comparisons are beginning to be made (Bansal et al., 2007; Kremling et al., 2004; Werhli et al., 2006), but there too it is hard to correctly interpret the results. Most methods show some success with specialized data sets and particular organisms, but many theoretical and practical challenges remain in all cases. The stated goal of the DREAM effort mentioned in the beginning is to develop precisely such a set of benchmark data sets that can be used as a guide for method developers and a way to carry out more systematic comparisons. We briefly describe two such methods here, one using parameter estimation for systems of differential equations as the principal tool, the other using statistical methods. A comparison of different methods, including these two, was done in Camacho et al. (2007). In the Section 4, we describe in detail a method that has as output either a wiring diagram or a Boolean network, using the interpolation of data by a Boolean network as the main tool. First, we describe a reverse-engineering method that uses multiple regression, proposed by Gardner et al. (2003), which is similar to de la Fuente and Mendes (2002) and Yeung et al. (2002). The method uses linear regression and requires data that are obtained by perturbing the variables of the network around a reference steady state. A crucial assumption of the method is that molecular networks are sparse, that is, each variable is regulated only by a few others. This method assumes no more than three regulatory inputs per node. The network is then recovered using multiple regression of the data. It estimates the coefficients in the Jacobian matrix of a generic system of linear differential equations representing the rates of change of the different variables. (Recall that the Jacobian matrix of a linear system of differential equations has as entry in position (i, j) the coefficient of the variable xi in the equation for variable xj.) The assumption that the
Algebraic Models of Biochemical Networks
181
wiring diagram of the network is sparse translates into the assumption that only a few of the entries of this matrix are different from 0. The second method, originally published in Hartemink et al. (2002), uses the framework of so-called DBNs, a type of statistical model that gives as output a directed graph depicting causal dependency relations between the variables. These dependency relations are computed in terms of a time evolution of joint probability distributions on the variables, viewed as discrete random variables. The data required for this reverse-engineering method are time courses representing temporal responses to perturbations of the system from a steady state. The method has been implemented in the software package BANJO, described in Bernard and Hartemink (2005).
4. Reverse-Engineering of Discrete Models: An Example 4.1. Boolean networks: Deterministic and stochastic As mentioned in the introduction, Boolean networks were first proposed as models for gene regulatory networks by Kauffman (1969), where a gene is considered either expressed (1) or not expressed (0), and the state of a gene is determined by the states of its immediate neighbors in the network. One interpretation of the different attractors (steady states, in particular) could be as the different phenotypes into which a cell will differentiate, starting from an arbitrary initialization. As there is evidence that gene regulation as well as metabolism could be stochastic, Boolean networks, which are deterministic, have been generalized to account for stochasticity. Boolean networks with perturbations (BNps) and Probabilistic Boolean networks (PBNs) have been developed for modeling noisy gene regulatory networks, see, for example, Akutsu et al. (2000a,b), Shmulevich et al. (2002, p. 225), and Yu et al. (2004). Definition 7.3 A Boolean network with perturbations (BNp) is a Boolean network where, at each time step, the state of a randomly chosen node is flipped with some probability p. That is, if the current state of the network is x ¼ (x1, . . ., xn), the next state y is determined as follows. A unit vector e ¼ (e1, . . ., en) is chosen at random where ei is zero for all but one coordinate j for which ej ¼ 1. Then y ¼ f(x) þ e with probability p and y ¼ f(x) with probability 1–p. In particular, in a BNp, one and only one node could be perturbed with probability p at each time step. That is, the phase space of a BNp is a labeled directed graph, where the vertices are all possible states of the network, and the label of an edge (x,y) is the probability that y ¼ f(x) þ e
182
Reinhard Laubenbacher and Abdul Salam Jarrah
for some unit vector e. It is clear that the out-degree of each node x is n, where the edge (x,y) is labeled with 1–p if y ¼ f(x), and p if y ¼ f(x) þ e for some unit vector e. Definition 7.4 A PBN is a Boolean network in which each node in the network could possibly have more than one update function, in the form of a family of Boolean functions, together with a probability distribution. When it is time to update the state of a node, a function is chosen at random from the family of that node and is used to decide its new state. Namely, for each node xi in the network, let { f, g, . . .} be the set of local functions, where the probability of choosing f is p, that of choosing g is q, etc. and p þ q þ . . . ¼ 1. Then the phase space of the network is a labeled directed graph with a directed edge (x,y), if, for all i, we have yi ¼ f(x), for some local update function f for node i. The label on the edge is the product of the probabilities of each of the coordinates, where the probability of a coordinate is the sum of all probabilities p such that yi ¼ f(x). Example 7.3 Consider the network on three nodes in Fig. 7.7 below, and, let f11 ¼ x1 OR x2, f12 ¼ x2 AND x3, with probabilities 0.7, and 0.3, respectively. Suppose node 2 has only one local function, f2 ¼ x2 OR x3, and suppose node 3 has two functions: f31 ¼ x1 AND x2 with probability 0.4, and f32 ¼ x2 with probability 0.6. The phase space of this network is depicted in Fig. 7.8. Another way to introduce stochasticity into a deterministic model is by updating the network nodes asynchronously, and, at each time step, the order at which they are updated is chosen at random. These networks are
x1
x2
x3
Figure 7.7 The wiring diagram of the probabilistic Boolean network in Example 7.3.
183
Algebraic Models of Biochemical Networks
0.70
100
001
0.30
1.00
1.00
000
101
0.30
0.12
010
0.18
0.70
011
0.42
0.28
0.40
0.60
0.30
110
0.70
111
1.00
Figure 7.8 The phase space of the probabilistic Boolean network in Example 7.3. Notice that there are three fixed points: 000 and 111 with probability 1, while the state 010 is a fixed point with probability 0.12. Furthermore, the two states 011 and 110 form a limit cycle with probability 0.12.
clearly biologically relevant as they capture the order at which different events and processes take place and hence could change the outcome. In Chaves et al. (2005), the authors present a stochastic update-order model of the segment polarity network in Drosophila. The model captures aspects of this biological process that the original model in Albert and Othmer (2003) did not account for. One way to accomplish updatestochastic simulation of deterministic Boolean networks is to represent them as PBNs, where each node has two local update functions; the original one and the identity function, with appropriate choice of probabilities. This approach is implemented in the parameter estimation package Polynome, which we discuss below. Next, we briefly review some of the known network inference methods for BNp and PBNs. It goes without saying that, to infer a Boolean network from experimental data sets, one has to start by assuming that each node in the network can only be in one of two states at any given time. In particular, the data used for inferring the network must also be binary, and hence the
184
Reinhard Laubenbacher and Abdul Salam Jarrah
experimental data first have to be discretized into two qualitative states. There are several different methods for discretizing continuous data, see, for example Dimitrova et al. (2008). However, for the Boolean case, all methods come down to deciding the proper threshold that should be used to decide if a given molecular species is present (1) or absent (0). For DNA microarray data, this may be done by, for example, choosing a fold change above which a gene is considered upregulated compared to a control value. Or it may be done by inspection of a time course.
4.2. Inferring Boolean networks As we described above, Boolean networks have emerged as a powerful framework for modeling and simulating gene regulatory networks. Therefore, it is natural to infer these networks from experimental data, and different methods have been proposed. Liang et al. (1998) pioneered this approach with the algorithm REVEAL, where information-theoretic principles are applied to reduce the search space. In Akutsu et al. (1999), the authors proposed a simple algorithm for identifying a Boolean network from a data set assuming that the in-degree of each node is relatively small. They discussed requirement on the data for such networks to exist. Recently, Martin et al. (2007) presented an algorithm for identifying all activation–inhibition Boolean networks (here each edge is either an activator or a strong inhibitor) from a given data set. Here, too, a small upper bound for the in-degree of each node is assumed. The dynamics of the identified Boolean networks are then used to shed light on the biological system. Using the Boolean framework, Ideker et al. presented a method that identifies a minimal wiring diagram of the network from time course data (Ideker et al., 2002). The network is minimal in the sense that each edge in the network is essential to reproduce the time course data.
4.3. Inferring stochastic Boolean networks Deterministic Boolean network models seem inadequate for modeling some biological systems, as uncertainty is a prominent feature of many known systems. This is due either to hidden variables, intrinsic or extrinsic noise, or measurement noise. Different algorithms have been proposed for inferring stochastic Boolean networks within the framework of PBNs (Ching, 2005; Shmulevich et al., 2002, 2003a) or BNps (Yu et al., 2004; Akutsu et al., 2000a,b). Shmulevich and his collaborators developed an inference method that identifies a set of local functions of a given node using either time course data or a set of steady states (Shmulevich et al., 2002). For each node xi in the network, a set of local Boolean functions Xi ¼ {f, g, . . .} is assigned with
185
Algebraic Models of Biochemical Networks
probabilities Pi ¼ {p, q, . . .}. The set Xi and Pi correspond to the highest coefficients of determination (CoD) of the node xi relative to randomly chosen subsets of variables that could be possible input sets for node xi. On the other hand, Yu et al. (2004) presented an algorithm for inferring a Boolean network with perturbations from steady-state data. Based on certain assumptions about the size of the basin of attraction for each observed state and lengths of transients, a matrix describing the transition between different attractors is computed. Some of the algorithms mentioned above have been implemented as either Cþþ code, such as the algorithm of Akutsu et al., or within other software packages, such as the algorithms of Shmulevich et al. (2002), which require the commercial software package Matlab. Furthermore, experimental data need to be Booleanized ahead of time before applying these algorithms. In Section 4.4, we describe the software package Polynome (Dimitrova et al., 2009) that incorporates several different algorithms using tools from computational algebra and algebraic geometry. The software is capable of inferring wiring diagrams as well as deterministic and PBNs. Furthermore, the software can be used to simulate and explore the dynamics of the inferred network.
4.4. Polynome: Parameter estimation for Boolean models of biological networks As described earlier, the goal of parameter estimation is to use experimental time course data to determine missing information in the description of a Boolean network model for the biological system from which the data were generated. This can be done with either partial or no prior information about the wiring diagram and dynamics of the system. Polynome will infer either a static wiring diagram alone or a dynamical model, with both deterministic and stochastic model choices. The software is available via a Web interface at http://polymath.vbi.vt.edu/polynome. Figure 7.9 shows a screenshot of the interface of Polynome. The main idea behind the algebraic approach underlying Polynome is that any Boolean function can be written uniquely as a polynomial where the exponent of any variable is either 0 or 1 (hence the name). The dictionary is constructed from the basic correspondence: x AND y ¼ xy;
x OR y ¼ x þ y þ xy;
NOT x ¼ x þ 1:
Therefore, any Boolean network f can be written as a polynomial dynamical system f ðx1 ; . . . ; xn Þ ¼ f ðxÞ ¼ ðf1 ðxÞ; . . . ; fn ðxÞÞ : f0; 1gn ! f0; 1gn ;
Figure 7.9 A screenshot of POLYNOME at http://polymath.vbi.vt.edu/polynome.
Algebraic Models of Biochemical Networks
187
where the polynomial function fi is used to compute the next state of node i in the network. For example, the Boolean network in Eq. (7.1) has the following polynomial form: f ðx1 ; . . . ; x4 Þ ¼ ðx1 x3 ; 1 þ x4 þ x2 x4 ; x1 ; x1 þ x2 þ x1 x2 Þ: The Boolean network from Eq. (7.2) has the polynomial form: f ðx1 ; . . . ; x9 Þ ¼ ðð1 þ x4 Þx5 ; x1 ; x1 ; 1; ð1 þ x6 Þð1 þ x7 Þ; x3 x8 ; x6 þ x8 þ x9 þ x6 x8 þ x6 x9 þ x8 x9 ; x2 ðx8 þ 1ÞÞ; where g ¼ 1 and a ¼ 1; and ðx1 ; . . . ; x9 Þ ¼ ðM; P; B; C; R; A; Alow ; L; Llow Þ: Studying Boolean networks as polynomial dynamical systems has many advantages, primarily that within the polynomial framework, a wide variety of algorithmic and theoretical tools from computer algebra and algebraic geometry can be applied. The remainder of the section will describe the parameter estimation algorithms implemented in Polynome and illustrate the software using the lac operon example described in detail above. This example is also used in Dimitrova et al. (2009) to validate the software. Input. The user can input two kinds of information. The first kind consists of one or more time courses of experimental data. While several different data types are possible, we will focus here on DNA microarray data, for the sake of simplicity. If the network model to be estimated has nodes x1, . . ., xn, then a data point consists of a vector (a1, . . ., an) of measurements, one for each gene in the network. Since the model is Boolean, the first step is to discretize the input data into two states, 0 and 1. The user can provide a threshold to discriminate between the two states. As default algorithm, Polynome uses the algorithm in Dimitrova et al. (2008), which incorporates an information-theoretic criterion and is designed to preserve dynamic features of the continuous time series and to be robust to noise in the data. The second type of input consists of biological information. This can take the form of known edges in the wiring diagram or known Boolean functions for some of the network nodes. Recall that an edge from node xi to node xj in the wiring diagram indicates that node xi exerts causal influence on the regulation of node xj. In other words, the variable xi appears in the local update function fj for xj. In the absence of this type of information, the problem is equal to what is often called reverse-engineering of the network, that is, network inference using exclusively system-level data for the network. To understand the algorithms, it is necessary to clarify the relationship between the input data and the networks produced by the software. The software produces networks that fit the given experimental data in the following sense. Suppose that the input consists of a time course s1, . . ., st,
188
Reinhard Laubenbacher and Abdul Salam Jarrah
which each si 2 {0, 1}n. Then, we say that a Boolean network f fits the given time course if f(si) ¼ siþ 1 for all i. Software output. There are five types of output the user can request. We briefly describe these and the algorithms used to obtain them. A static wiring diagram of the network, that is, a directed graph with vertices the nodes of the network and edges indicating causal regulatory relationships. Since there is generally more than one such diagram for the given information (unless a complete wiring diagram is already provided as input), the user can request either a diagram with weights on the edges, indicating the probability of a particular edge being present, or a collection of topscoring diagrams. The algorithm used for this purpose has been published in Jarrah et al. (2009). It computes all possible wiring diagrams of Boolean networks that fit the given data. The algorithm outputs only minimal wiring diagrams. Here, a wiring diagram is minimal if it is not possible to remove an edge and still obtain a wiring diagram of a model that fits the given data. In this sense, the output is similar to that in Ideker et al. (2002). However, the approach in Jarrah et al. (2009) is to encode the family of all wiring diagrams as an algebraic object, a certain monomial ideal, which has the advantage that ALL minimal wiring diagrams can be calculated, in contrast to a diagram produced by a heuristic search. A deterministic dynamic model in the form of a Boolean network that fits the given data exactly and satisfies the constraints imposed by the input on the wiring diagram and the Boolean functions, using the algorithm described in Laubenbacher and Stigler (2004). This is done by first computing the set of all Boolean networks that fit the given data and the constraints. Using tools from computational algebra, this can be done by describing the entire set of models, that is, the entire parameter space, in a way similar to the description of the set of all solutions to a system of nonhomogeneous linear equations. As in the case of nonhomogeneous linear equations, if f and g are two Boolean networks that fit the given data set, that is, f(st) ¼ stþ 1 ¼ g(st), then (f–g)(st) ¼ 0 for all t. Hence, all networks that fit the data can be found by finding one particular model f and adding to it any Boolean network g such g(st) ¼ 0 for all t. The space of all such g can be described by a type of basis that is similar to a vector space basis for the null space of a homogeneous system of linear equations. A PBN that fits the given data. That is, the network has a family of update functions for each node, together with a probability distribution on the functions, as described earlier. This network has the property that for any choice of function at any update the network fits exactly the given data. The network is constructed using an algorithm that builds on the one described in Dimitrova et al. (2007). A Boolean network that optimizes data fit and model complexity. In contrast to the previous two choices of output, this network does not necessarily fit the given data exactly but gives a network that is optimized with respect to both
189
Algebraic Models of Biochemical Networks
data fit and model complexity. This is a good model choice if the data are assumed to contain significant noise, since it reduces the tendency to overfit the data with a complex model. This option uses an evolutionary algorithm (Vera-Licona et al., 2009) that is computationally intensive and is only feasible for small networks at this time. A deterministic model that is simulated stochastically. This model is constructed by estimating Boolean functions that fit the data exactly, when simulated with synchronous update. But the network is then simulated using a stochastic update order. That is, the simulated network may not fit the given data exactly, but will have the same steady states as the synchronous model. The stochastic update order is obtained by representing the deterministic system as a PBN by adding the identify function to each node. At a given update, if the identity function is chosen, this represents a delay of the corresponding variable. Choosing an appropriate probability distribution, one can in this way simulate a stochastic sequential update order. The resulting phase space is a complete graph, with transition probabilities on the edges. This approach is also computationally very intensive, so this option is only feasible for small networks.
4.5. Example: Inferring the lac operon In this section, we demonstrate some of the features of Polynome by applying it to data generated from the Boolean lac operon model in Eq. (7.1) above. That is, we take the approach that this model represents the biological system we want to construct a model of, based on ‘‘experimental’’ data generated directly from the model. This approach has the advantage that it is straightforward to estimate the performance of the estimation algorithm in this case. The data in Table 7.1 include four time courses: all molecules are high, only R is high, only M is high, and only L and Llow are high. Table 7.1 A set of time courses from the lac operon model, generated using Eq. (7.2) All are high
R is high
M is high
L and Llow are high
111111111 011101111 100101111 111100101 111100111 111101111 111101111
000010000 000110001 000110101 000100101 100100101 111100101 111100111 111101111 111101111
100000000 011110001 000110111 000100101 100100101 111100101 111100111 111101111 111101111
000000011 000110101 000100101 100100101 111100101 111100111 111101111 111101111
190
Reinhard Laubenbacher and Abdul Salam Jarrah
Table 7.2 shows a PBN (in polynomial form) inferred from the data in Table 7.1, using Polynome. Here, for each node, a list of update functions and their probabilities is given. The bold functions are the ones with probability higher than 0.1. (This threshold is provided by the user.) Notice that the true function (1 þ x4)x5 for x1 is in the list of inferred functions for x1 with the second highest probability, the same as for the true function x3x8 for x6. The inferred functions with highest probability for nodes 2,3,4 are the correct ones. In the case of node 7, the only inferred polynomial x9 is clearly not the ‘‘true’’ function, which is x6 þ x8 þ x9. However, it is important to remember that we are using four time courses involving only 26 states from the phase space of 512 states. Parameter estimation methods cannot recover information about the network that is missing from the data. The phase space of this system has 512 states and many edges connecting them and so a visual inspection of the phase space graph is not possible. Polynome in this case provides a summary of the dynamics that includes the number of components, the number of limit cycles of each possible length as well as the stability of these cycles. Here, the stability of a cycle is the probability of remaining in that cycle. Table 7.3 shows that our inferred system in Table 7.2 has only one component which has the steady state (111101111), and its stability is 0.33. Note that the original Boolean lac operon model in Eq. (7.2) has only one component and the same steady state as the inferred model! The wiring diagram of the inferred network is shown in Fig. 7.10.
5. Discussion Mathematical models have become an important tool in the repertoire of systems biologists wanting to understand the structure and dynamics of complex biological networks. Our focus has been on algebraic models and methods for constructing them from experimental time course data. For differential equations-based models, the standard approach to dealing with unknown model parameters is to estimate them by fitting the model to experimental data. The same approach is taken here to the estimate of unknown model parameters in an algebraic model. We have described several approaches to this problem in the literature. For one of these approaches, implemented in the software package Polynome, we have provided a detailed guide to how the software can be used with experimental data via a Web interface. The extreme case of parameter estimation is the lack of any prior biological information, so that the network is to be inferred from experimental data alone. This is typically referred to as reverse-engineering or network inference. Many different approaches have been published to this ‘‘top-down’’
Algebraic Models of Biochemical Networks
191
Table 7.2 A probabilistic Boolean model inferred from the data in Table 7.1 using Polynome
f1 ¼ { x5*x8þx1*x5þx5þx2*x6þx2*x8þx6þx1*x7þx8þx1þ1 #.0222222 x5þx7*x8þx1*x9þx8þx1þ1 #.0222222 x5þx4þx1*x9þx9þx1þ1 #.133333 x5þx1*x4þx4þx9þx1þ1 #.0666667 x4*x5þx4 #.2 x5þx7*x8þx1*x7þx8þx1þ1 #.0666667 x5*x7þx7 #.244444 x5*x9þx4 #.0444444 x2*x5þx5*x8þx5þx1*x4þx1*x2þx2þx1*x8þx8þx1þ1 #.0222222 x5þx4*x8þx1*x4þx8þx1þ1 #.0666667 x5*x9þx7*x8þx8þx9 #.0222222 x5þx4þx1*x7þx9þx1þ1 #.0444444 x5*x6þx5*x8þx5*x9þx2*x6þx2*x8þx6þx8þx9 #.0222222 x5*x6þx5*x8þx5þx1*x4þx1*x6þx6þx1*x8þx8þx1þ1 #.0222222 } f2 ¼ x1 f3 ¼ x1 f4 ¼ 1 f5 ¼ x7þ1 f6 ¼ { x2*x5þx1*x5þx2*x6þx1*x2þx6þx2þx1*x8 #.0222222 x5*x6þx5*x8þx3*x6þx4þx6þx8þx9 #.0222222 x3*x6þx1*x6þx1*x8 #.0444444 x5*x8þx5*x7þx1*x5þx5þx3*x6þx6þx1*x7þx8þx7þx1þ1 #.0444444 x2*x8 #.377778 x3*x8 #.355556 x2*x6þx1*x6þx1*x8 #.0888889 x3*x6þx6þx1*x8þx3*x7þx1*x3 #.0222222 x5*x6þx5*x8þx5*x7þx5þx2*x6þx6þx1*x7þx8þx7þx1þ1 #.0222222 } f7 ¼ { x5*x8þx1*x5þx2*x6þx2*x8þx4þx6þx8 #.0222222 x5*x6þx5*x8þx4þx1*x6þx6þx1*x8þx8 #.0222222 x4*x8þx4þx8 #.111111 x5*x7þx5þx4þx1*x7þx7þx1þ1 #.0444444 x9 #.666667 x4*x5þx5þx1*x4þx1þ1 #.0666667 x4þx7*x8þx8 #.0444444 x2*x5þx5*x8þx4þx1*x2þx2þx1*x8þx8 #.0222222 } f8 ¼ { x2 #.511111 x3 #.488889 } f9 ¼ 1
192
Reinhard Laubenbacher and Abdul Salam Jarrah
Table 7.3 The analysis of the phase space of the probabilistic Boolean network using local functions with probability more than 0.1 (the bold functions in Table 7.2). This is provided by Polynome (and the simulation software package DVD (http://dvd.vbi.vt. edu) in the case the phase space is too large to visualize Analysis of the phase space [m ¼ 2, n ¼ 9]
Number of components 1 Number of fixed points 1 Fixed point, component size, stability (1 1 1 1 0 1 1 1 1), 512, 0.33
x4
x9
x1
x3
x2
x8
x7
x6
x5
Figure 7.10 The wiring diagram of the inferred network in Table 7.2.
approach to modeling. There are still significant challenges ahead, arising primarily due to the lack of sufficiently large, appropriately collected time course data sets. Nonetheless, the field has advanced to the point where there are some first successes. It is our hope that this chapter has encourage the reader to try this approach to data modeling, whether using algebraic models, or others based on differential equations or statistics.
Algebraic Models of Biochemical Networks
193
REFERENCES Akutsu, T., Miyano, S., et al. (1999). Identification of genetic networks from a small number of gene expression patterns under the Boolean network model. Pac. Symp. Biocomput. 17–28. Akutsu, T., Miyano, S., et al. (2000a). Algorithms for inferring qualitative models of biological networks. Pac. Symp. Biocomput. 293–304. Akutsu, T., Miyano, S., et al. (2000b). Inferring qualitative relations in genetic networks and metabolic pathways. Bioinformatics 16(8), 727–734. Albert, R., and Othmer, H. G. (2003). The topology of the regulatory interactions predicts the expression pattern of the segment polarity genes in Drosophila melanogaster. J. Theor. Biol. 223(1), 1–18. Alon, U. (2006). An Introduction to Systems Biology: Design Principles of Biological Circuits. CRC Press, Boca Raton, FL. Andrec, M., Kholodenko, B. N., et al. (2005). Inference of signaling and gene regulatory networks by steady-state perturbation experiments: Structure and accuracy. J. Theor. Biol. 232(3), 427. Balleza, E., Alvarez-Buylla, E. R., et al. (2008). Critical dynamics in gene regulatory networks: Examples from four kingdoms. PLoS One 3(6), e2456. Bansal, M., Gatta, G. D., et al. (2006). Inference of gene regulatory networks and compound mode of action from time course gene expression profiles. Bioinformatics 22(7), 815–822. Bansal, M., Belcastro, V., et al. (2007). How to infer gene networks from expression profiles. Mol. Syst. Biol. 3, 78. doi:10.1038/msb4100120. Barrett, C. B., Herring, C. D., et al. (2005). The global transcriptional regulatory network for metabolism in Escherichia coli exhibits few dominant functional states. Proc. Natl. Acad. Sci. USA 102(52), 19103–19108. Beal, M. J., Falciani, F., et al. (2005). A Bayesian approach to reconstructing genetic regulatory networks with hidden factors. Bioinformatics 21(3), 349–356. Bentele, M., Lavrik, I., et al. (2004). Mathematical modeling reveals threshold mechanism in CD95-induced apoptosis. J. Cell Biol. 166(6), 839–851. Bernard, A., and Hartemink, A. (2005). Informative structure priors: Joint learning of dynamic regulatory networks from multiple types of data. Pac. Symp. Biocompu. conf. proceedings pp. 459–470. Brenner, S. (1997). Loose ends. Curr. Biol. 73. Bruggemann, F. J., and Westerhoff, H. (2006). The nature of systems biology. Trends Microbiol. 15(1), 45–50. Butler, J. T., Tsutomu, S., et al. (2005). Average path length of binary decision diagrams. IEEE Trans. Comput. 54(9), 1041–1053. Butte, A. J., Tamayo, P., et al. (2000). Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. PNAS 97(22), 12182–12186. Camacho, D., vera-Licona, P., et al. (2007). Comparison of reverse-engineering method using an in silico network. Ann. NY Acad. Sci. 1115, 73–89. Chang, W.-C., Li, C.-W., et al. (2005). Quantitative inference of dynamic regulatory pathways via microarray data. BMC Bioinform. 6(1), 44. Chaves, M., Albert, R., et al. (2005). Robustness and fragility of Boolean models for genetic regulatory networks. J. Theor. Biol. 235, 431–449. Ching, W. K., Ng, M. M., Fung, E. S., and Akutsu, T. (2005). On construction of stochastic genetic networks based on gene expression sequences. Int. J. Neural Syst. 15(4), 297–310. Davidich, M. I., and Bornholdt, S. (2007). Boolean network model predicts cell cycle sequence of fission yeast. PLoS One 3(2), e1672.
194
Reinhard Laubenbacher and Abdul Salam Jarrah
deBoer, R. J. (2008). Theoretical biology. Undergraduate Course at Utrecht University, available at http://theory.bio.uu.nl/rdb/books/. de la Fuente, A., and Mendes, P. (2002). Quantifying gene networks with regulatory strengths. Mol. Biol. Rep. 29(1–2), 73–77. de la Fuente, A., Bing, N., et al. (2004). Discovery of meaningful associations in genomic data using partial correlation coefficients. Bioinformatics 20(18), 3565–3574. Deng, X., Geng, H., et al. (2005). EXAMINE: A computational approach to reconstructing gene regulatory networks. Biosystems 81(2), 125. Dimitrova, E., Jarrah, A., et al. (2007). A Groebner-fan-based method for biochemical network modeling. Proceedings of the International Symposium on Symbolic and Algebraic Computation, Assoc Comp Mach, Waterloo, CA. Dimitrova, E., Vera-Licona, P., et al. (2008). Data discretization for reverse-engineering: A comparative study (under review). Dimitrova, E., Garcia-Puente, L., et al. (2009). Parameter estimation for Boolean models of biological networks. Theor. Comp. Sci. (in press). Dojer, N., Gambin, A., et al. (2006). Applying dynamic Bayesian networks to perturbed gene expression data. BMC Bioinform. 7(1), 249. Ernst, J., Vainas, O., et al. (2007). Reconstructing dynamic regulatory maps. Mol. Syst. Biol. 3, 74. Espinosa-Soto, C., Padilla-Longoria, P., et al. (2004). A gene regulatory network model for cell-fate determination during Arabidopsis thaliana flower development that is robust and recovers experimental gene expression profiles. Plant Cell 16(11), 1923–1939. Faure, A., Naldi, A., et al. (2006). Dynamical analysis of a generic Boolean model for the control of the mammalian cell cycle. Bioinformatics 22(14), 124–131. Friedman, N. (2004). Inferring cellular networks using probabilistic graphical models. Science 303(5659), 799–805. Friedman, N., Linial, M., et al. (2000). Using Bayesian networks to analyze expression data. J. Comput. Biol. 7(3–4), 601–620. Gadkar, K., Gunawan, R., et al. (2005). Iterative approach to model identification of biological networks. BMC Bioinform. 6(1), 155. Gardner, T. S., di Bernardo, D., et al. (2003). Inferring genetic networks and identifying compound mode of action via expression profiling. Science 301(5629), 102–105. Gat-Viks, I., and Shamir, R. (2003). Chain functions and scoring functions in genetic networks. Bioinformatics 19, 108–117. Gonzalez, A., Chaouiya, C., et al. (2008). Logical modelling of the role of the Hh pathway in the patterning of the Drosophila wing disc. Bioinformatics 24(234–240), 16. Gupta, S., Bisht, S. S., et al. (2007). Boolean network analysis of a neurotransmitter signaling pathway. J. Theor. Biol. 244(3), 463–469. Harris, S. E., Sawhill, B. K., et al. (2002). A model of transcriptional regulatory networks based on biases in the observed regulation rules. Complex Syst. 7(4), 23–40. Hartemink, A., Gifford, D., et al. (2002). Bayesian methods for elucidating genetic regulatory networks. IEEE Intel. Syst. 17, 37–43. Herrgard, M. J., Lee, B. S., et al. (2006). Integrated analysis of regulatory and metabolic networks reveals novel regulatory mechanisms in Saccharomyces cerevisiae. Genome Res. 16, 627–635. Ideker, T. E., and Lauffenburger, D. (2003). Building with a scaffold: Emerging strategies for high- to low-level cellular modeling. Trends Biotechnol. 21(6), 256–262. Ideker, T. E., Thorsson, V., et al. (2000). Discovery of regulatory interactions through perturbation: Inference and experimental design. Pac. Symp. Biocomput. 5, 305–316. Ideker, T. E., Ozier, O., et al. (2002). Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics 18(Suppl. 1), S233–S240.
Algebraic Models of Biochemical Networks
195
Jarrah, A., Raposa, B., et al. (2007). Nested canalyzing, unate cascade, and polynomial functions. Physica D 233(2), 167–174. Jarrah, A., Laubenbacher, R., Stigler, B., and Stillman, M. (2007). Reverse-engineering polynomial dynamical systems. Adv. Appl. Math. 39, 477–489. Kauffman, S. A. (1969). Metabolic stability and epigenesis in randomly constructed genetic nets. J. Theor. Biol. 22(3), 437–467. Kauffman, S. A., Peterson, C., et al. (2003). Random Boolean network models and the yeast transcriptional network. Proc. Natl. Acad. Sci. USA 100(25), 14796–14799. Kauffman, S. A., Peterson, C., et al. (2004). Genetic networks with canalyzing Boolean rules are always stable. Proc. Natl. Acad. Sci. USA 101(49), 17102–17107. Kell, D. B. (2004). Metabolomics and systems biology: Making sense of the soup. Curr. Opin. Microbiol. 7(3), 296–307. Kim, J., Bates, D., et al. (2007). Least-squares methods for identifying biochemical regulatory networks from noisy measurements. BMC Bioinform. 8(1), 8. Kimura, S., Ide, K., et al. (2005). Inference of S-system models of genetic networks using a cooperative coevolutionary algorithm. Bioinformatics 21(7), 1154–1163. Kremling, A., Fischer, S., et al. (2004). A benchmark for methods in reverse engineering and model discrimination: Problem formulation and solutions. Genome Res. 14(9), 1773–1785. Laubenbacher, R., and Stigler, B. (2004). A computational algebra approach to the reverse engineering of gene regulatory networks. J. Theor. Biol. 229, 523–537. Laubenbacher, R., and Sturmfels, B. (2009). Computer algebra in systems biology. Am. Math. Mon. (in press). Li, F., Long, T., et al. (2004). The yeast cell-cycle network is robustly designed. Proc. Natl. Acad. Sci. USA 101(14), 4781–4786. Li, S., Assman, S. M., et al. (2006a). Predicting essential components of signal transduction networks: A dynamic model of guard cell abscisic acid signaling. PLoS Biol. 4(10), e312. Li, X., Rao, S., et al. (2006b). Discovery of time-delayed gene regulatory networks based on temporal gene expression profiling. BMC Bioinform. 7(1), 26. Liang, S., Fuhrman, S., et al. (1998). REVEAL, a general reverse engineering algorithm for inference of genetic network architectures. Pac. Symp Biocomput. 3, 18–29. Loomis, W. F., and Sternberg, P. W. (1995). Genetic networks. Science 269(5224), 649. Margolin, A. A., Nemenman, I., et al. (2006). ARACNE: An algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinform. 7(Suppl. 1), S7. Marino, S., and Voit, E. (2006). An automated procedure for the extraction of metabolic network information from time series data. J. Bioinform. Comp. Biol. 4(3), 665–691. Martin, S., Zhang, Z., et al. (2007). Boolean dynamics of genetic regulatory networks inferred from microarray time series data. Bioinformatics 23(7), 866–874. Mehra, S., Hu, W.-S., et al. (2004). A Boolean algorithm for reconstructing the structure of regulatory networks. Metab. Eng. 6(4), 326. Mendoza, L. (2006). A network model for the control of the differentiation process in Th cells. Biosystems 84, 101–114. Mendoza, L., and Alvarez-Buylla, E. R. (2000). Genetic regulation of root hair development in Arabidopsis thaliana: A network model. J. Theor. Biol. 204, 311–326. Nariai, N., Tamada, Y., et al. (2005). Estimating gene regulatory networks and proteinprotein interactions of Saccharomyces cerevisiae from multiple genome-wide data. Bioinformatics 21(Suppl. 2), ii206–ii212. Nikolayewaa, S., Friedela, M., et al. (2007). Boolean networks with biologically relevant rules show ordered behavior. Biosystems 90(1), 40–47. Pe’er, D., Regev, A., et al. (2001). Inferring subnetworks from perturbed expression profiles. Bioinformatics 17(Suppl. 1), S215–S224.
196
Reinhard Laubenbacher and Abdul Salam Jarrah
Pournara, I., and Wernisch, L. (2004). Reconstruction of gene networks using Bayesian learning and manipulation experiments. Bioinformatics 20(17), 2934–2942. Raeymaekers, L. (2002). Dynamics of Boolean networks controlled by biologically meaningful functions. J. Theor. Biol. 218(3), 331–341. Rice, J. J., Tu, Y., et al. (2005). Reconstructing biological networks using conditional correlation analysis. Bioinformatics 21(6), 765–773. Robeva, R., and Laubenbacher, R. (2009). Mathematical biology education: Beyond calculus. Science 325(5940), 542–543. Saez-Rodriguez, J., Simeoni, L., et al. (2007). A logical model provides insights into T cell receptor signaling. PLoS Comp. Biol. 3(8), e163. Samal, A., and Jain, S. (2008). The regulatory network of E. coli metabolism as a Boolean dynamical system exhibits both homeostasis and flexibility of response. BMC Syst. Biol. 2, 21. Sanchez, L., and Thieffry, D. (2001). A logical analysis of the Drosophila gap-gene system. J. Theor. Biol. 211, 115–141. Savageau, M. A. (1991). Biochemical systems theory: Operational differences among variant representations and their significance. J. Theor. Biol. 151(4), 509. Shmulevich, I., Dougherty, E. R., et al. (2002). Probabilistic Boolean networks: A rulebased uncertainty model for gene regulatory networks. Bioinformatics 18(2), 261–274. Shmulevich, I., Gluhovsky, I., et al. (2003a). Steady-state analysis of genetic regulatory networks modelled by probabilistic Boolean networks. Comp. Funct. Genomics 4(6), 601–608. Shmulevich, I., Lahdesmaki, H., et al. (2003b). The role of certain Post classes of Boolean network models of genetic networks. Proc. Natl. Acad. Sci. USA 100(19), 10734–10739. Stigler, B., and Veliz-Cuba, A. (2009). Network topology as a driver of bistability in the lac operon http://arxiv.org/abs/0807.3995. Thomas, R., and D’Ari, R. (1989). Biological Feedback. CRC Press. Thomas, R., Mehrotra, S., et al. (2004). A model-based optimization framework for the inference on gene regulatory networks from DNA array data. Bioinformatics 20(17), 3221–3235. Tringe, S., Wagner, A., et al. (2004). Enriching for direct regulatory targets in perturbed gene-expression profiles. Genome Biol. 5(4), R29. Vera-Licona, P., Jarrah, A., et al. (2009). An optimization algorithm for the inference of biological networks (in preparation). Waddington, C. H. (1942). Canalisation of development and the inheritance of acquired characters. Nature 150, 563–564. Wagner, A. (2001). How to reconstruct a large genetic network from n gene perturbations in fewer than n(2) easy steps. Bioinformatics 17(12), 1183–1197. Wagner, A. (2004). Reconstructing pathways in large genetic networks from genetic perturbations. J. Comput. Biol. 11(1), 53–60. Werhli, A. V., Grzegorczyk, M., et al. (2006). Comparative evaluation of reverse engineering gene regulatory networks with relevance networks, graphical Gaussian models and Bayesian networks. Bioinformatics 22(20), 2523–2531. Yeung, M. K., Tegner, J., et al. (2002). Reverse engineering gene networks using singular value decomposition and robust regression. Proc. Natl. Acad. Sci. USA 99(9), 6163–6168. Yu, J., Smith, V. A., et al. (2004). Advances to Bayesian network inference for generating causal networks from observational biological data. Bioinformatics 20(18), 3594–3603. Zhang, R., Shah, M. V., et al. (2008). Network model of survival signaling in large granular lymphocyte leukemia. Proc. Natl. Acad. Sci. USA 105(42), 16308–16313. Zou, M., and Conzen, S. D. (2005). A new dynamic Bayesian network (DBN) approach for identifying gene regulatory networks from time course microarray data. Bioinformatics 21(1), 71–79.
C H A P T E R
E I G H T
High-Throughput Computing in the Sciences Mark Morgan and Andrew Grimshaw Contents 199 200 200 201 202 204 204 208 211 215 217 218 218 219 222 223 226 226
1. What is an HTC Application? 2. HTC Technologies 2.1. Scripting languages 2.2. Batch queuing systems 2.3. Portable batch system 3. High-Throughput Computing Examples 3.1. Data transformation 3.2. Parameter space studies 3.3. Monte Carlo simulations 3.4. Problem decomposition 3.5. Iterative refinement 4. Advanced Topics 4.1. Resource restrictions 4.2. Checkpointing 4.3. File staging 4.4. Grid systems 5. Summary References
Abstract While it is true that the modern computer is many orders of magnitude faster than that of yesteryear; this tremendous growth in CPU clock rates is now over. Unfortunately, however, the growth in demand for computational power has not abated; whereas researchers a decade ago could simply wait for computers to get faster, today the only solution to the growing need for more powerful computational resource lies in the exploitation of parallelism. Software parallelization falls generally into two broad categories—‘‘true parallel’’ and high-throughput computing. This chapter focuses on the latter of these two types of parallelism. With high-throughput computing, users can run many copies of their software at the same time across many different Department of Computer Science, University of Virginia, Charlottesville, Virginia, USA Methods in Enzymology, Volume 467 ISSN 0076-6879, DOI: 10.1016/S0076-6879(09)67008-7
#
2009 Elsevier Inc. All rights reserved.
197
198
Mark Morgan and Andrew Grimshaw
computers. This technique for achieving parallelism is powerful in its ability to provide high degrees of parallelism, yet simple in its conceptual implementation. This chapter covers various patterns of high-throughput computing usage and the skills and techniques necessary to take full advantage of them. By utilizing numerous examples and sample codes and scripts, we hope to provide the reader not only with a deeper understanding of the principles behind high-throughput computing, but also with a set of tools and references that will prove invaluable as she explores software parallelism with her own software applications and research.
While it is true that the modern computer is many orders of magnitude faster than that of yesteryear, this tremendous growth in CPU clock rates is now over. Unfortunately, however, the growth in demand for computational power has not abated; whereas researchers a decade ago could simply wait for computers to get faster, today the only solution to the growing need for more powerful computational resource lies in the exploitation of parallelism. Parallel computing can be broken down into two broad categories: capability parallelism and capacity parallelism. Capability computation, or what we sometimes refer to as ‘‘true parallel,’’ refers to a single large application running on many computers at the same time and with the various parallel components communicating among themselves. In contrast, capacity parallelism involves many copies of an application all running simultaneously but in isolation from the other parallel components. This type of parallelism, sometimes called high-throughput computing (or HTC), is the subject of this chapter. Unlike true parallel applications that must often communicate among all participating components, HTC relies only on an initial setup and a final communication of results. During execution, each component of an HTC application works independently without any regard to the state or progress of its sibling tasks. This means that HTC applications are both easy to create and largely agnostic of computational setup—they run equally well on a cluster of machines front-ended by a batch system like PBS (http://www. pbsgridworks.com), LSF (http://www.platform.com), Condor (http://www. cs.wisc.edu/condor; Thain et al., 2005), or SGE (http://www.sun.com/ software/sge) as they do on a compute grid or cloud. Most importantly, HTC is applicable to an incredibly large and diverse range of applications. Consider the example of a single application that can analyze a satellite picture to determine if there are any geographic features of interest (perhaps indications of oil or valuable minerals). The thorough explorer wants to examine thousands of such pictures to determine the next best location to drill or mine. However, it takes far too long to run the program 1000 times in a row. Fortunately, there is no need to do so. Instead, our researcher creates a simple BASH shell script to submit and monitor jobs in a batch system.
High-Throughput Computing in the Sciences
199
This system, in turn, runs many copies of the program at the same time, each with different pictures to analyze. What could have taken weeks before on one computer now takes mere hours on his or her company’s cluster. We start this chapter with a description of what makes an application an HTC application. Then, we examine some of the technologies that exist in support of high-throughput computation. From there, we go through a number of examples to illustrate the various ways in which HTC applications are organized and managed. We then examine a few of the more advanced topics as they relate to high-throughput applications and we finish with a summary of what we have learned.
1. What is an HTC Application? There is no strict definition of an HTC application. Computer Scientists tend to define HTC in terms of how it is different from High Performance or Parallel Computing. Wikipedia suggests that the main differences have to do with execution times and coupling. Most parallel applications are tightly coupled,1 while HTC tends to be very loosely coupled (http://en. wikipedia.org/wiki/High_Throughput_Computing). More generally, we tend to say that a true parallel application is a collection of computational components all running at the same time and cooperatively working to solve a single problem—in essence it is a single large application split among a number of computational resources. In contrast, an HTC application is really a number of identical programs each running simultaneously on a group of computers and each working on a different set of input data. Sometimes called ‘‘bag-of-tasks’’ or ‘‘parameter sweep’’ applications, HTC jobs can more formally be described in terms of sets of inputs and associated results. Consider the set of inputs X ¼ {x1, x2, . . . xn}. Any given input xi represents some arbitrary collection of files and/or command line parameters used by a sequential program which implements the function f such that it produces the result ri. Therefore, the HTC application is defined to fill in the result set R ¼ {r1, r2, . . . rn} such that ri ¼ f (xi) | 8xi 2 X. Another way of saying this is that for all inputs of interest, a program is run for each input that produces a corresponding resultant output. Thus, the HTC application is the conglomeration of these mappings. Many HTC applications begin as a single sequential program that someone writes to solve a problem or answer a question. For example, what is the lift produced by this wing configuration? How similar is this protein sequence to that of a frog’s? What does the computer generated 1
The term ‘‘tightly coupled’’ refers to the tendency of these applications to require frequent communications between the constituent parallel components.
200
Mark Morgan and Andrew Grimshaw
scene look like for this frame of this movie? Each of these programs can run as single job on a single machine. They become HTC applications when someone turns around decides to run the program 1000 times with different inputs. The questions change accordingly. What does the space of solutions for wing configurations look like? Which protein sequence is closest to my sample? What does the entire scene from the movie look like?
2. HTC Technologies A number of technologies exist that are an integral part of HTC. These include things like scripting languages that provide an invaluable tool for managing and manipulating high-throughput jobs as well as programs like batch systems and grids that specifically enable the tasks necessary to run the various instances of the job. In this section, we examine a set of these tools and describe what role they play in the HTC application.
2.1. Scripting languages Not specifically designed to support HTC, one of the simplest yet most valuable tools in its support is that of the scripting language. Because the applications in question are often sequential applications that were never designed for the use to which HTC puts them, they often lack the management capabilities necessary to perform the large tasks needed. Scripting languages provide a convenient and quick way to solve this problem by giving the user a medium in which he or she may quickly write tools to organize, launch, monitor, and collect results from the various parallel job instances. Without scripting languages, users would have to control HTC applications by hand, typing in thousands of inputs and launching each job individually from the keyboard—an unscalable and intractable solution as the problem becomes increasingly large. A number of scripting languages are commonly used in scientific computing. These include the standard UNIX shell languages like BASH, CSH, and KSH as well as more advanced languages like Python (http://www. python.org), Tcl (Ousterhout, 1994), and Perl (http://www.perl.com). Often the language chosen has more to do with user/programmer familiarity then with language features and power. However, no matter what language you choose to use, the goals are the same: to write a program in an environment where one can easily and rapidly interact with external tools or programs (often programs as simple as ls,2 cat, grep, sed, awk, etc.) 2
While HTC jobs can be launched and managed from any type of computer, they are most frequently run on UNIX machines due to the rich set of commands and scripting languages available on those platforms. For that reason, I will tend to use refer to UNIX-based tools and languages.
High-Throughput Computing in the Sciences
201
Despite the fact that any scripting language can be used, one of the UNIX shell languages is often the candidate of choice because of their ubiquity, familiarity, and that fact that many HTC systems (e.g., the batch systems) use shell scripts as a means of communicating job descriptions (we will see this in action when we talk about portable batch system (PBS) scripts.)
2.2. Batch queuing systems In order to successfully run an HTC application, a single sequential application needs to be launched or run on some number of back-end machines or resources. The inputs for these sequential jobs need to be made available to each instance of the program and once the program has finished executing, the outputs need to be collected back together. While some users in the past have accomplished all of this using custom-made solutions of varying complexity and effectiveness,3 by far the most common means of controlling HTC is by way of the batch queuing system or job management system. Batch systems, also known as queuing systems, are large pieces of software that monitor and manage large clusters of machines for the purposes of ‘‘doling’’ out those resources to jobs requesting run time on them. Systems such as PBS, LSF, SGE, and Condor work by maintaining a list of jobs that users want to run and assigning computers to those jobs as the resources become available. Further, these systems often keep track of resources used by various jobs and individual users for the purpose of accounting and billing. Batch systems generally guarantee that when a job is given a resource, that resource is devoted to the job for the duration allotted (which may be fixed or dynamic based on job execution time). The batch system is usually responsible for getting the job started on the target resource and ultimately has the ability to stop or kill the job at its discretion. Batch systems all differ in the various details, but generally work in the same way despite those details. A job description file or submission script describes the job that a user wants to run and when submitted by a submission tool tells the queuing system how to run the job and what resources are required for that execution. Jobs submitted, in this way, result in a job token or key that can then be used by other tools to refer to that specific job. This key, which is nothing more than a unique string created by the queuing system, refers to that job for the lifetime of the job in the batch system’s list. Users monitor and manipulate the jobs that the batch system is managing using other tools provided by the batch system implementation. While different batch queuing systems sometimes have different tools for submitting, monitoring, and managing jobs, a POSIX standard exists which suggests the use of qsub, qstat, and qkill as the basic tools necessary to 3
A very common home-grown solution is merely the clever use of password-less ssh and simple BASH or Perl process control.
202
Mark Morgan and Andrew Grimshaw
accomplish this task. PBS, in particular, supports this standard and given its ubiquity as a job management system we will use these tools throughout this chapter as an exemplar of a queuing system implementation. Generally speaking, batch systems support two kinds of jobs: sequential and parallel. For the parallel case, what we usually mean is a tightly coupled, true parallel job such as an MPI (Gropp et al., 1994) or OpenMP (Chandra et al., 2000) job. Sequential refers to a single job that a user wants to run exactly once on a single computer. Given these definitions, how then is it that batch systems support HTC applications? The answer lies in the fact that while batch systems support sequential jobs in only the singleton case, they nevertheless have access to a relatively large number of resources on which they can launch those jobs. Thus, by submitting a number of copies of a single job (each presumably with slightly different inputs), one can use the batch system as a mechanism for controlling and managing large numbers of independent jobs. It is in this regard that the scripting languages described previously become evident. Batch systems are merely the mechanism by which jobs are run and managed, but more often than not a script is responsible for orchestrating the application as a whole.
2.3. Portable batch system PBS is one such queuing system and is one of the more common batch systems available. A number of implementations of this software exist, ranging from free, open-source versions to supported, commercially produced ones. We use PBS throughout this chapter as an exemplar batch system because of its common use and the similarities between it and other batch systems. PBS submission scripts are nothing more than shell scripts with PBS specific instructions and restrictions that are embedded in the comments of the script. This technique of embedding additional information into the comments of a scripting language (or for that matter, any language) is a common and frequently used means of extending a language beyond its original design. In fact, the ‘‘job’’ that the user is submitting to the PBS queue is really the shell script itself. This shell script almost always calls on another program to execute (the sequential application we talked about earlier) but it is important to realize that the ‘‘program’’ that PBS runs directly is the script that was submitted to the batch system and that this script can contain arbitrarily complex and intricate code. In Fig. 8.1 an example PBS submission script describes a request to run the render-frame program for frame 1 of scene 1 of a movie. There are a couple of interesting details to note about this script. First of all, the script is a standard BASH shell script as implied by its first line. Despite this fact, a number of PBS directives follow embedded in standard BASH script comments. These directives indicate, respectively, the name of the queue
High-Throughput Computing in the Sciences
203
#!/bin/bash
#PBS –q largeQueue #PBS –o /home/jdoe/movie/stdout.txt #PBS –e /home/jdoe/movie/stderr.txt
echo $HOSTNAME cd /home/jdoe/movie render-frame scene-1-frame-1.input scene-1-frame-1.tiff
Figure 8.1 Example simple PBS submission script.
where the job should be submitted,4 the location to which standard output should be redirected as the job runs, and the location to which standard error should be directed. Next, illustrating the point that the submission script can be arbitrarily complex, we see a couple of lines of BASH script which setup the job. Finally, the binary program is run given the name of the input frame to render and the name of the output file to generate. This last line is particularly important for a couple of reasons. First of all, the fact that the exact frame to render (and the exact output to generate) is given explicitly implies that this sequential job is only one of potentially many sequential jobs that make up the HTC application needed to render an entire scene or movie. One can imagine the collection of such PBS submission scripts that would be required to render a corresponding collection of frames, thus generating a movie sequence. Further, notice that no description is given for how the frame input was generated nor how the output tiffs are to be ‘‘glued’’ together into a resultant movie. This is typical for an HTC application. Most of the time, the inputs are generated through some external mechanism (perhaps another program, perhaps by hand, sometimes even by the user’s HTC management script). Additionally, the resultant outputs are usually collected together using yet another piece of software. Finally, note that the submission script assumes that the data are available in the same place regardless of which machine the job ends up running on. This last expectation, namely, that all machines controlled by that queue share some portion of the file system (usually using NFS 4
We often refer to an instance of a PBS system (or any other batch system) as a queue but in fact these systems often have more than one virtual queue embedded.
204
Mark Morgan and Andrew Grimshaw
(Sun Microsystems, Inc, 1989), CIFS (Leach and Naik, 1997), Lustre (http://wiki.lustre.org/index.php/Main_Page), or some other network file system software), is a typical constraint of most batch systems.
3. High-Throughput Computing Examples In this section, we examine more closely a number of HTC examples that illustrate patterns of computation common to scientific applications. While these patterns are separated into categories, it is worth noting that the partitioning is largely arbitrary and significant overlap between the examples may be evident. In fact, it is generally assumed that clever data decompositioning and organization can transform one type of HTC application into another (e.g., a Monte Carlo simulation can be viewed as a parameter sweep where the random number seed is the parameter, etc.). Furthermore, because the mechanism by which the individual single jobs are submitted or monitored is similar enough between the various back-end systems (be they queues or grids), we use PBS as a single unifying medium in which to demonstrate the techniques in question. These techniques should translate equally well into whatever back-end technology is appropriate for the reader.
3.1. Data transformation The first example of an HTC pattern is what we refer to as the data transformation pattern. In this pattern, we assume that the user has a program that reads data from a set of input files and generates a transformed version of that data as a set of output files. The exact transformation is irrelevant and may constitute more of an analysis than an actual transforming of the data. What is important is that the number of input data files available determines the number of output files to be generated and therefore also the number of times that the sequential application needs to be run. The movie frame example that we gave earlier (Fig. 8.1) is a perfect example of this. For this example, assume that we have a binary called lineDetector that reads an input image file and generates a new image file resultant from performing a horizontal and vertical line detection algorithm. In other words, the resultant image file contains a new image that shows the locations of the horizontal and vertical lines detected in the original image. We further assume that our input images are all located in the directory /home/jdoe/images/input and have the names input-image-1.tiff, input-image-2.tiff, etc. Our first step in generating this HTC application is to create by hand a PBS submission script for one job. We often start this way because it provides a convenient template from which to develop the rest of the HTC application. This example PBS submission script is given in Fig. 8.2.
High-Throughput Computing in the Sciences
205
#!/bin/bash
#PBS –q largeQueue
cd /home/jdoe/images lineDetector input/input-image-1.tiff output/output-image-1.tiff
Figure 8.2 Example PBS submission script.
Notice in the example that we have chosen to write our template submission script using image 1 and that we have identified our resultant output image with the same number. This pattern is typical and reflects the user’s desire to be able to associate the resultant images with the inputs from which they are derived. At this point, we should submit the script to our PBS system to verify that we have not made any mistakes with respect to running this job. Determining the correctness of your single job submission script is not always easy. Does the queuing system (in this case PBS) accept the submission script? Once submitted, does the job actually run or does it stay queued forever? After running, does your job produce the output that you expected? Figuring out why any of these problems occur is sometimes a black art and often requires a working familiarity with the batch system in question. If the queuing system does not accept the job, often it will tell you why. Maybe the queue you specified does not exist, or perhaps the format you gave for a resource restriction is not correct. If your job gets submitted to the queue but never runs, this can sometimes be caused by specifying resource restrictions that can never be satisfied such as asking for 100 nodes from a queue that only has 50 nodes available. If your job seems to run but does not produce the output you expect, your program can sometimes be in error, but sometimes some aspect of the programs environment has not been setup correctly (e.g., missing libraries, library paths not set correctly, input files not made available, etc.). To solve these problems often you will need to add statements to the submission script which indicate to the queuing system that you would like to get back standard output and standard error streams from your job (these streams will usually contain error messages indicating what went wrong). Finally, keeping in mind that the submission script that you use to submit the job to the queue is itself a shell script that will be run on the target node, you can sometimes put appropriate debugging statements into the script itself to help you determine what is happening. Once we are sure that the submission template script is correct, we need to write a submission manager or control script that creates and submits a single job to the PBS system for each individual sequential run that we need
206
Mark Morgan and Andrew Grimshaw
to perform. Because we have a directory full of input files over which we need to run the application, we will write our script to iterate through those files and submit a new job to the PBS system for each one. First, we modify our submission script template so that it contains ‘‘variables’’ in place of the names of the input and output files. These variables are nothing more than standard operating system environment variables, just like PATH and HOME, and are set by the batch queuing system from values given to the qsub program on the command line when you submit the job. The modified submission script is given in Fig. 8.3. Notice that in this example two variables, INPUT and OUPUT, are defined to indicate, respectively, the input and output file names to use for this job. Also notice that we have used the ${VARIABLE} notation for environment variables rather than the more common $VARIABLE syntax. This deviation from the typical is not required by the queuing system but rather is the author’s preference to enhance readability of the script. In all respects, the variables are true environment variables and may be referenced using any valid environment variable syntax. Strictly speaking, it is not necessary to write PBS submission templates like the one above. Instead, one can generate relatively easily the submission scripts on the fly directly from the text of the control script we are about to write using echo statements or BASH here-documents. However, because the submission script is also a BASH shell script, generating it inside of another BASH shell script is prone to errors having to do with variable substitution. Thus, for readability and clarity, we prefer to generate submission templates such as this one throughout this chapter. The final step in creating our HTC application is to create a shell script capable of iterating through the input files and, for each one, generating and submitting a PBS submission script to run the job. As an optimization, notice that the shell script first checks to see if the output file already exists. This optimization is recommended because often times an HTC run will need to be repeated to fill in missing results. These missing results can be the product of imperfect batch systems or grid systems prone to failures or job #!/bin/bash
#PBS –q largeQueue
cd /home/jdoe/images lineDetector input/${INPUT} output/${OUTPUT}
Figure 8.3 Example simple PBS template submission script.
207
High-Throughput Computing in the Sciences
loss, or simply the result of a desire to run the data translation program over additional data not available at the time the first run was initiated. Figure 8.4 shows this BASH shell script. Several features about the above shell script warrant explanation. First of all, for space and readability reasons, throughout the script we have ignored
#!/bin/bash
# We make a directory to keep the submission scripts in just to # keep our working directory from getting cluttered. mkdir -p scripts
# Iterate over all the files in the input directory. for INPUTPATH in input/* do # For each file, determine it's name is (without the path) # as well as the name of the desired output file. INPUTFILE=`basename $INPUTPATH` OUTPUTFILE=`echo $INPUTFILE | sed -e "s/input/output/g"`
# If the output file does not exist, create and submit # A PBS job. if [ ! -e output/$OUTPUTFILE ] then echo "Submitting job for input/$INPUTFILE” qsub –v “INPUT=$INPUTFILE, OUTPUT=$OUTPUTFILE” submission-script.pbs fi done
Figure 8.4
Line detector submission control script.
\
208
Mark Morgan and Andrew Grimshaw
the possibility that file names exist with spaces in them. One can of course prevent such occurrences by design, but it is generally better to create your scripts from the outset so that they are capable of accommodating such anomalies. Finally, the script could have been simplified by choosing a simpler naming scheme for our input and output file names; specifically, had the input and output file names been the same rather than different, the script would have needed only to indicate the directory for outputs rather than translating the input file names into output names. However, once again, this would tend to reduce readability, which we prefer to preserve, given that the more complex version is more typical of real world examples.
3.2. Parameter space studies Another common pattern that we often come across in HTC applications is that of the parameter space study. A parameter space study is an HTC application where we wish to run a sequential application once each for a number of points within the input space, thus generating a matching output function. Users then analyze the output space to produce some summary of results, or to pick some optimal solution, or to insert as inputs into a different application. Consider the example where we have a sequential program that determines the lift generated by a fixed wing in air for wings of various lengths and angles of attack. We want to determine what affect changing those two parameters has on the wing’s lift so that we can pick an optimal solution. This example is similar to the data translation example in that it requires us to write a shell script to submit jobs to the queue. However, for this example the number of sequential jobs to be run is determined not by a set of input files, but rather by a set of input parameter values. We start the same way, by generating the example template PBS submission script.5 Notice that this new submission script has three variables (Fig. 8.5). We have two parameters over which we wish to iterate the parameter space, #!/bin/bash
#PBS –q largeQueue
calculateLift ${WINGLENGTH} ${WINGANGLE} ${OUTPUTFILE}
Figure 8.5 Airflow over wing submission script. 5
In this example, we skip straight to the PBS submission script with variable tokens rather than giving an exemplar script with actual values to test. In the general case, however, you should always generate an exemplar for testing purposes as it helps to debug problems that may develop further down the road.
High-Throughput Computing in the Sciences
209
namely WINGLENGTH and WINGANGLE. We also need to indicate the name of the output file that we want to generate, thus requiring the third variable, OUTPUTFILE. Once again, we will want to be able to match the output files with the input data. In this case, we will do so by naming the output file so that the name tells us what length and angle were used. Now that we have the submission template ready, we generate a shell script that can manage and submit the jobs to the PBS queue. The following BASH shell script shows a script that can submit a number of jobs corresponding to an input range of wing angles and lengths. The wing lengths are given as integers representing the number of inches in the wing’s length while the angle is given as a floating point number representing the angle in degrees. This latter decision to represent the angle as a floating point number was done to illustrate a technique for iterating over floating point numbers in a BASH shell script despite the fact that the BASH scripting language cannot usually handle floating point numbers. However, generally speaking, if you are designing your own scripts, you can create your sequential program and script in such a way as to avoid this necessity. Also, once again, note that the script first checks for the desired output file before submitting the job. This, as before, allows us to repeatedly run the example generating only the output files that are missing (Fig. 8.6). #!/bin/bash
# Check to make sure that the arguments are correct if [ $# -ne 6 ] then echo "USAGE:
$0 <min-angle> <max-angle>
<min-
length> <max-length> " exit 1 fi
# Set variables for easier readability MINANGLE=$1; MAXANGLE=$2; ANGLEINCR=$3 MINLEN=$4; MAXLEN=$5; LENINCR=$6
Figure 8.6 (Continued)
210
Mark Morgan and Andrew Grimshaw
# Loop through the angles requested.
We have to use the
# bc program here to do the loop because BASH cannot handle # floating point numbers natively. while [ `echo "scale=1; $MINANGLE <= $MAXANGLE" | bc -l` -ne 0 ] do # Inside the angle loop, we are going to loop through the # wing lengths as well.
We assume that length is given as an
# integral number of inches LENGTH=$MINLEN while [ $LENGTH -le $MAXLEN ] do # We create a file name that reflects the angle/length OUTPUT=winglift-$MINANGLE-$LENGTH.dat if [ ! –e $OUTPUT ] then echo "Submitting job for $OUTPUT" qsub –v “WINGANGLE=$MINANGLE, WINGLENGTH=$LENGTH, \ OUTPUTFILE=$OUTPUT” submission-script.pbs fi
LENGTH=$(( $LENGTH + $LENINCR )) done
MINANGLE=`echo "scale=1; $MINANGLE + $ANGLEINCR" | bc -l` done
Figure 8.6 Airflow over wing control script.
A common alternative approach to this solution is to have parameter space studies in which the parameters themselves come from files rather than being inputted as actual numbers that you iterate through within the
High-Throughput Computing in the Sciences
211
control script. Figure 8.7 gives the same control script in Fig. 8.6 modified to take the wing angles from an input file. In practice, this file could contain any textual data, not just numbers.
3.3. Monte Carlo simulations A Monte Carlo simulation is a program that generates results based on a large number of random samples. Monte Carlos generally produce nondeterministic results and are most useful when a large number of degrees of freedom exist in the space being sampled. Monte Carlo applications are classic examples of both true parallel and HTC applications, differing from one another only in the application used to produce the results and the length of time necessary to run the simulations. In this example, we will try to estimate the value of P using a Monte Carlo simulation that works by using the knowledge that the area of a circle is equal to P multiplied by the radius of the circle squared. If you had the exact area of the circle and its exact radius you could calculate the value of P simply by dividing the area by the square of the radius. In our simulation, we will estimate the area of a circle with a radius of one by throwing imaginary darts at a dartboard with that circle inscribed inside of it. Because we know that the exact area of a square with a unit circle inscribed inside of it, we can estimate the area of the circle—and thus the value of P—by multiplying the area of the square by the ratio of darts that randomly land inside the circle to the total number thrown. However, in order to get a reasonable approximation for P, we have to throw a large number of darts. This is where our Monte Carlo simulation comes in. Assume that we have a sequential binary which, given a random6 number seed, throws 1,000,000 imaginary darts at the dartboard described earlier (it does this by generating random x and y coordinates in the range [ 1.0, 1.0]). The program then prints out the number of darts that ‘‘hit’’ within the unit circle. To turn this application into an HTC application, we need to generate a large number of PBS jobs that each runs their own 1,000,000 dart simulation. Once all of the results come back, we can then sum the results together to get our circle-to-square area ratio and thus estimate a value for P. The PBS submission template and BASH control script are given in Fig. 8.8. Notice that the name of the output file is once again related to the specific run. In this case, the output file name has a number indicating which ‘‘millions’’ of dart-throws the output represents (i.e., the first million darts, the second million, etc.). This is because there really is no 6
Random number generation is a complex topic both in sequential applications and in parallel applications. Those details, however, are beyond the scope of this chapter and as such are left to more thorough treatments available in other texts.
212
Mark Morgan and Andrew Grimshaw
#!/bin/bash
# Check to make sure that the arguments are correct if [ $# -ne 4 ] then echo "USAGE:
$0 <min-length> <max-length>
" exit 1 fi
# Set variables for easier readability ANGLEFILE=$1 MINLEN=$2; MAXLEN=$3; LENINCR=$4
# Loop through the angles requested. for ANGLE in `cat $ANGLEFILE` do # Inside the angle loop, we are going to loop through the # wing lengths as well.
We assume that length is given as
# an integral number of inches LENGTH=$MINLEN while [ $LENGTH -le $MAXLEN ] do # We create a file name that reflects angle/length OUTPUT=winglift-$ANGLE-$LENGTH.dat if [ ! –e $OUTPUT ] then
Figure 8.7 (Continued)
High-Throughput Computing in the Sciences
213
# We create a file name that reflects angle/length OUTPUT=winglift-$ANGLE-$LENGTH.dat if [ ! –e $OUTPUT ] then echo "Submitting job for $OUTPUT" qsub –v “WINGANGLE=$ANGLE, WINGLENGTH=$LENGTH, \ OUTPUTFILE=$OUTPUT” submission-script.pbs fi
LENGTH=$(( $LENGTH + $LENINCR )) done done
Figure 8.7 Airflow over wing control script redux.
#!/bin/bash
#PBS –q largeQueue
throwDarts ${SEED} > dart-results.${NUMBER}
Figure 8.8 Monte Carlo submission script.
distinguishing characteristic between the various results (though we could if we wanted store the seed number given). Rather, we are merely using the file name as a convenient means of determining whether or not the output was generated for a given sequential run. Also, in this example, the throwDarts program does not generate an output file. Instead, it prints results to the standard output stream. This is not an uncommon occurrence and while it can be selectively used or avoided when the user has control over the source code for the sequential binary, oftentimes the binary is a piece of legacy code which cannot, for various reasons, be modified (Fig. 8.9).
214
Mark Morgan and Andrew Grimshaw
#!/bin/bash
# Check the arguments if [ $# -ne 1 ] then echo "USAGE:
$0 "
exit 1 fi
NUMITERS=$1
# Loop through the iterations while [ $NUMITERS -gt 0 ] do # Use BASH's built in RANDOM variable to generate a seed SEED=$RANDOM
# If the result hasn't yet been generated, submit a job # to create it. RESULTFILE=dart-results.$NUMITERS if [ ! -e $RESULTFILE ] then qsub –v “SEED=$SEED, NUMBER=$NUMITERS” \ submission-script.pbs fi NUMITERS=$(( $NUMITERS - 1 )) done
Figure 8.9 Monte Carlo control script.
High-Throughput Computing in the Sciences
215
3.4. Problem decomposition So far, in this chapter, we have ignored the issue of problem decomposition. Sometimes, the decomposition is either obvious or determined by the sequential program that we are using. Often, however, the user can choose how he or she wants to decompose the larger problem into a collection of smaller ones. Doing so can have a dramatic impact on the total time it takes to execute a high-throughput application as well as the overall probability that application will finish successfully. In the advanced section of this chapter, we examine the latter of these concerns, but for now we are going to see how problem decomposition can affect the overall runtime of an HTC application. Consider the dart-throwing Monte Carlo example we looked at in the previous section. A naı¨ve implementation might have written the throwDarts sequential program such that it threw exactly one dart instead of 1,000,000. In this way, the output files would have been numbered by dart rather than by millions of darts. However, this scheme lacks scalability and efficiency. To get a reasonably good estimate for P, we have to throw lots of darts. Assume that we needed to throw as many as 1 billion darts. If we generated a PBS job for each dart we would have submitted 1 billion jobs into the PBS queue and it would in turn have created 1 billion result files. From a scale point of view, neither PBS nor the file system into which the results will land is capable of handling that many items. From an efficiency perspective, running a sequential job for each dart-throw would take too long. While it might take a long time to throw 1 billion darts, it takes an incredibly short amount of time to throw one dart. On the other hand, submitting a job to PBS, having PBS run that job and then getting the results back is a relatively hefty operation requiring multiple seconds at best to complete. We simply cannot afford to spend seconds setting up and submitting jobs that only need a few microseconds to run. Instead, we batched together a large number of dart-throws into a single run of the throwDarts program so that the time that it takes to generate, submit, and run the job through PBS is amortized by the relatively long execution time of the throwDarts program. As a general rule of thumb, you want the execution time of a single unit of your HTC application to be on the order of 100 times (or more) greater than the amount of time it takes PBS to process the job. In this way, the cost of using PBS has little impact on the overall performance of your application as a whole. At the same time, you want the number of individual submissions to the PBS queue to be large enough to get some benefit. Simply submitting a single throwDarts program to the queue that throws all 1 billion darts alone produces no benefit over simply running the program on your desktop.7 7
Unless of course there is some benefit to running on a machine that the PBS batch system has access to that you do not. It is not uncommon for a user to submit a single job to a batch or grid system when that user simply cannot run the program on any other machines which he or she has access too. However, as this chapter is about HTC applications, we ignore such possibilities.
216
Mark Morgan and Andrew Grimshaw
The choice of how many jobs to break the HTC application into depends on a number of factors including how many resources the queue has access to, how many users are trying to submit jobs to the queue at any given time, and how many slots on the queue a single user is allowed to consume at any one time.8 As a general rule of thumb, if your sequential application runs in a relatively consistent amount of time regardless of the input data, then coordinating the total number of job submissions with the number of slots available to you makes sense (e.g., if you have 10 slots available to you, having somewhere around 10 jobs might make sense, assuming that the runtime of each job is reasonable). However, if the runtime of your sequential program is highly variable depending on input, or the resources on which you are running have a high chance of failure, it makes more sense to decompose the problem into many more pieces. As jobs run through the queue, longer running jobs will consume a slot for a corresponding large period of time while short running jobs finish early and vacate their respective slots, leaving them available for other jobs to run. This is an optimization mechanism known as load balancing, whereby longer running jobs have a decreased effect on the overall runtime length of the batch because many smaller runtime jobs have an opportunity to execute one after another at the same time. This principle is similar to how having multiple lanes of traffic improves the overall efficiency of cars moving along the road as opposed to having a single line of traffic whose speed is ultimately determined by the slowest driver. With the darts program, the method of decomposition, if not the number, was obvious; throwing many darts is identical in concept to throwing a single dart. However, this is not always the case. Sometimes, a sequential program does not naturally decompose into obvious pieces. The example first mentioned in this chapter (in which a scene from a movie was rendered using a batch of sequential jobs, each of which rendered a single frame from the movie) might not in fact be the best decomposition of the problem. If the time required to render a single frame is relatively small, then we would want to render many frames in a single job. Similarly, if the time required to render a single frame took a large amount of time, we would probably want to decompose the problems such that only portions of a frame were rendered by any given job. In both cases, the sequential program (and in fact the output files generated by the hypothetical decomposition) are not necessarily available. After all, the program generates pictures representing an entire frame, not pieces of a picture, or snippits of 8
To prevent one user’s jobs from preventing another user’s jobs from running, batch system administrators will often limit how many slots or nodes a user can simultaneously hold at any given time. Furthermore, the administrator will sometimes also additionally limit how many jobs a user can have in the queue at any given time, regardless of how many are running.
High-Throughput Computing in the Sciences
217
a movie. To make these decompositions work, some amount of programming on the user’s part is required. For the multiframe example, the answer is to submit jobs that are themselves scripts, each script executing the render program multiple times and generating multiple images. These images can then be tarred or zipped by the script and returned as the single result representing the snippit of the movie rendered. In the case of rendering pieces of a frame, the answer is not as simple. Unless the render program had the option to render a piece of a frame and generate a partial image file, the user would have to come up with a way of modifying the render program to perform these partial operations. He or she would also need a way of representing partial frames and later gluing them together.
3.5. Iterative refinement The last example that we will look at in this chapter with respect to typical HTC use cases is that of iterative refinement. So far all of the examples we have examined assume that we want to run a large number of sequential jobs to generate a large number of resultant outputs. These outputs would then, presumably, be combined together at the end to produce a single result. However, sometimes the parameter space is too large and the nature of the result space too unknown for the researcher to provide a sufficient set of boundary conditions for his application. Maybe he or she wants to see what wing angle to the nearest 10th of a degree and nearest 16th of an inch provides the best lift, but it takes too long to run all possible combinations. Using iterative refinement the researcher first submits a portion of an HTC job examining a relatively broad spectrum of the target parameter space. As the results from those jobs arrive, he or she can analyze them to determine which ranges of the broad parameter study show promise and can thus narrow the space and submit new jobs as a refinement of his or her parameter-sweep study. The first study might analyze wing angles by increments of 5 and lengths in increments of 6 in. Based off of the results of that study, new parameter spaces defined in terms of single degree and single inch increments are then launched for areas of interest to the researcher. Iterative refinement need not involve refinement of the parameter space. Sometimes, the refinement takes place instead in the sequential program or algorithm. A researcher comparing a protein sequence against a database of other sequences might first run an HTC application using a ‘‘quick and dirty’’ algorithm to determine which database sequences show promise. Then, based on these results, interesting database sequences could be compared against the test sample using a much slower but more accurate comparison algorithm. Regardless of the reason for the iterative refinement, the methods for controlling them are largely the same and, generally speaking, consist of building on what we have already seen in this chapter. Most of the differences lie in how to analyze the results of one run to generate the inputs for
218
Mark Morgan and Andrew Grimshaw
the next run. Usually, a user will submit the first run using a control script like the one we have described earlier, wait for that HTC run to fully complete, analyze the results, select parameters for the next run, and then submit the new run using either the same or a different control script. However, sometimes it makes more sense to have a single control script analyzing the results of one run as they are returned from the batch system and then on the fly deciding whether or not to generate and submit a new run based off of those results. Doing so, however, is a much more complicated task involving error checking with the queue (to make sure that the job was not lost or failed), making sure that the full results are available (i.e., that the job is completely finished and not simply in the process of generating results), and (potentially) simultaneously managing runs from multiple different refinements.
4. Advanced Topics So far, we have covered the relatively straightforward aspects of using systems to submit and control HTC jobs. In this next section, we will take a deeper look at some of the more advanced topics having to do with HTC applications, including restricting resources for scheduling purposes, checkpointing results, and staging data to and from local file systems. This is by no means an exhaustive list but rather an introduction to a few topics of interest and importance.
4.1. Resource restrictions It used to be the case that organizations would setup a number of queues on batch systems, each one representing a certain job type for which a particular set of resources was intended. For example, an IT department might create one queue for long-running jobs and another one for short jobs; one queue might be intended for Linux machines and the other for Solaris. Increasingly, however, it is becoming more common for an IT department to have only one queue and to rely instead on the user submitting jobs to the queue with certain restrictions indicated. For example, a user can indicate in a PBS batch script how many processors he or she wants per node, how much memory, how long the job will take to run, or even what kind of operating system is preferred. Generally speaking, in order to get the most out of your batch system, you need to describe the appropriate amount of information for your IT department’s resources. The following PBS submission script shows an example in which the user has requested that his or her job be put on a machine with two cpus per node, that it will take 10 GB of memory when executing, and that it needs to be a Linux machine of some type (Fig. 8.10).
High-Throughput Computing in the Sciences
219
#!/bin/bash
#PBS –q largeQueue #PBS –l ncpus=2:mem=10GB:arch=linux #PBS –o /home/jdoe/movie/stdout.txt #PBS –e /home/jdoe/movie/stderr.txt
echo $HOSTNAME cd /home/jdoe/movie render-frame scene-1-frame-1.input scene-1-frame-1.tiff
Figure 8.10 Example resource restrictions submission script.
4.2. Checkpointing Probably, the most important advanced topic—and one that is frequently overlooked when it comes to HTC—is that of checkpointing. While it would be wonderful if all jobs took 15 min to run to completion, the truth is that there are many applications that run for days, weeks, or even months. Unfortunately, it is unrealistic to assume that an application can run uninterrupted for long periods of time. Perhaps, the program leaks memory, or perhaps the user is sharing the machine with another program that is leaking some operating system resource, thus making the machine unstable. Labs sometimes lose power for long periods of time, causing machines to fail while jobs are running. For that matter, it is often the case that batch system itself is configured to kill jobs that take too long to execute.9 In the end, regardless of the cause, the end result is the same: the loss of all in-memory data and progress made on your long-running job. When you start talking about HTC applications, the odds of a longrunning program failing to complete increase. By utilizing lots of machines at the same time, you inadvertently increasing the chances that one of the machines on which your job is running is going to fail before it finishes. There are essentially two ways to deal with the problem of long-running jobs. One is simply to shorten your job so that it does not take as long but instead requires more runs to complete. For example, maybe each instance 9
Configuring a batch queuing system to limit jobs to a certain duration is often a bone of contention between users and administrators but is generally necessary to ensure fairness amongst all of the cluster’s users.
220
Mark Morgan and Andrew Grimshaw
of your program rendered 10 frames from a movie scene. Instead of running 1000 jobs, each rendering 10 frames of the movie, you could perhaps submit 10,000 jobs where each job rendered only one frame. The other solution is to employ something called checkpointing. Checkpointing is the act of periodically recording data about the progress of your program so that if the program should fail for whatever reason, you can simply restart the program from the last known checkpoint and continue from there. Unfortunately, checkpointing is an activity that many researchers ignore because it requires them to implement extra code that would not otherwise be necessary in a perfect world where nothing failed. Also, while a few projects have tried to make the process of checkpointing easier or automatic for applications, the truth is that none of these is perfect and the likelihood is that you will not have access to such a system. Further, it is not generally possible to describe a solution that works for all applications. Each application is different and the nature of your application’s checkpointing needs depends on how your program is structured. Furthermore, if you do not have access to the original program’s source code, you may not have the ability to checkpoint at all. Checkpointing in HTC applications often involves storing intermediate state information about your running application into a shared directory (recall that most batch systems use a shared file system to ease the transfer of applications and data between resources behind the batch system). Your management script then needs to be able to detect when an application has failed and restart that program using the stored checkpoint. Imagine an application with the command line given in Fig. 8.11. Each run of this application takes an input file as a parameter describing the data to be analyzed. It also takes two additional parameters describing, respectively, the name of an output file to generate when the program is complete and the name of a checkpoint file to generate periodically as intermediate results become available.10 Finally, the application takes an optional set of parameters instruct the program to restart from an intermediate checkpoint file already available from a previous run. Given this application, we now revisit a job management script that we saw earlier in this chapter and modify it to work with our new application. analyze-data \ [--restart ]
Figure 8.11 Example checkpointing command-line.
10
Implicit, in this example, is the assumption that the application binary removes checkpoint files as new checkpoints become available or as the program finishes successfully. If this is not the case, the control script needs to differentiate between checkpoint files that are still in use and those that are no longer needed.
High-Throughput Computing in the Sciences
221
If you compare this script with the first job management script given in this chapter, you can see that they are very similar to one another (Fig. 8.12). The only difference is that this script checks for the existence of a checkpoint file before submitting the job to the batch system. If the checkpoint file exists, then we use a different PBS submission template file (one that presumably uses the—restart version of the command). In this way, whenever we run the script, we will submit jobs to the queue, one for each input file that does not yet have a corresponding output file, and one for which the restart option will be given if an appropriately named checkpoint file exists. As with the previous case, it is important to understand the difference
#!/bin/bash
# We make a directory to keep the submission scripts in just to # keep our working directory from getting cluttered. mkdir -p scripts
# Iterate over all the files in the input directory. for INPUTPATH in input/* do # For each file, determine it's name is (without the path) # as well as the name of the desired output file and a # checkpoint file. INPUTFILE=`basename $INPUTPATH` OUTPUTFILE=`echo $INPUTFILE | sed -e "s/input/output/g"` CPFILE=`echo $INPUTFILE | sed –e “s/input/checkpoint/g”`
# If the output file does not exist, create and submit # A PBS job. if [ ! -e output/$OUTPUTFILE ] then
Figure 8.12 (Continued)
222
Mark Morgan and Andrew Grimshaw
# Before submitting a job, we first check to see if there # is an intermediate checkpoint to restart from if [ -e checkpoints/$CPFILE ] then echo “Re-submitting job for input/$INPUTFILE” TEMPLATE=resubmission-template.pbs else echo "Submitting job for input/$INPUTFILE” TEMPLATE=submission-template.pbs fi
qsub –v “INPUT=$INPUTFILE, CHECKPOINT=$CPFILE, \ OUTPUT=$OUTPUTFILE” $TEMPLATE fi done
Figure 8.12 Checkpointing example control script.
between a job that failed before for some transient problem, and one that fails consistently because of bad inputs or bad data. If your program crashes every time, it tries to work with the data file given, no amount of checkpointing and restarting the program will fix that issue.
4.3. File staging File staging is another advanced topic that is sometimes useful (and in fact, is sometimes required) for HTC applications. File staging is the act of copying a file in from a source to the compute node where the computation is taking place, or equivalently, copying some data file out to a target location from a compute node. There are many different ways to copy this data, including downloading it from the Web, copying it using ftp/sftp or rcp/scp, or even mailing a result file to an email address. In some cases, you may have no choice but to copy the data for an HTC job. Despite the fact that most batch systems (PBS included) tend to rely on shared file systems being available, sometimes the data that you need is not available on those file systems and sometimes it might be too large. Maybe
High-Throughput Computing in the Sciences
223
you have 1000 inputs files of 100 MB each and only 1 GB of disk space available (i.e., you have room to store a couple of input files at a time, but not enough to store all of them on the shared file systems). Performance is the other main reason why people will sometimes stage files in and out. When staging a file for performance, what you are essentially doing is paying an upfront cost for copying the file in from a slow storage system (such as NFS) to a faster one (such as the local disk) so that you can use the faster storage medium for repeated reads later. Sometimes, these repeated reads happen during the lifespan of a single program (e.g., the program may need to read a given file over and over again during its execution rather than read it once and store the information internally in memory). Other times, the file is reused multiple times as many different instances of a program are run for a given HTC application. Recall that it is not generally the case that if you have 1000 or 10,000 jobs to run that you will automatically have access to an equivalent number of resources. Generally, a batch system will run a few of your programs at a time and queue the rest until a resource becomes available. In this case, if you have a file that does not change between runs (what is often called a constant file), that file can be copied to local disk once and then repeatedly reused as other copies of the program are run. The following example illustrates a PBS submission script for a movie CG-rendering program that takes not only an input frame to render and the output image to which to render it, but also a texture input indicating a database of scene textures to use for the frame. Since this texture database can be reused for other frames that may later get rendered on this node, we try to copy it to local disk space once and reuse the local copy from there on out.11 File staging as it relates to creating local copies for performance reasons requires that you be aware of how the local disk space is cleaned up and when (Fig. 8.13). If you are sharing the local disk space with other users and the compute setup does not somehow automatically clean up local disk space, then computing etiquette would suggest that you have a way of cleaning up the local copies when your HTC run is complete. Conversely, if the nodes in the cluster have a mechanism in place for automatically cleaning up local disk space (e.g., every time they reboot), that event must also be anticipated.
4.4. Grid systems For the most part, information given in this chapter is independent (except in the specific syntax) of the system providing the cluster management. Whether you are talking about PBS, SGE, Condor, or a grid such as Globus (http://www.globus.org/toolkit/) or Genesis II (http://www.cs. 11
Note that the use of /local-disk and /shared-disk are used only as exemplars in this example. Every organization has its own setup for their respective compute clusters and as such each user will need to determine the geography of his or her compute environment.
224
Mark Morgan and Andrew Grimshaw
#!/bin/bash
#PBS –q largeQueue
if [ ! –e /local-disk/jdoe/scene-textures.dat ] then cp /shared-disk/jdoe/scene-textures.dat \ /local-disk/jdoe/scene-textures.dat fi render-frame scene-1-frame-1.dat \ --textures /local-disk/jdoe/scene-textures.dat \ scene-1-frame-1.tiff
Figure 8.13 Submission script for file staging example.
virginia.edu/vcgr/wiki/index.php/The_Genesis_II_Project; Morgan, 2007), generally speaking, there will be a way to execute jobs using a qsub-like mechanism, there will be a way of querying information about running or completed jobs, and there will be a way of killing or cleaning up jobs. However, there are a few differences between traditional batch systems and grids that are worth pointing out. Grid systems, like batch queuing systems, give users the ability to start, monitor, and manage jobs on remote back-end resources. They differ from batch systems in the flexibility that they offer users both in terms of types and numbers of resources, as well as in the availability of tools for job management and control. Batch systems usually restrict users to clusters of similarly configured machines (generally, though not always, of the same operating system and make). They also typically back-end to resources under a single administrative domain, inevitably limiting the number of resources available for use. Grids, on the other hand, are designed to support greatly varying resource types from numerous administrative domains. It is not at all uncommon for a grid system to include resources from multiple universities, companies, or national labs, ranging in type from large supercomputers or workstations running variations of UNIX to small desktop computers running Mac OS X or Windows to clusters of computers sitting in racks in a machine room somewhere. In fact, a grid system will often contain among its resources other batch queuing systems.
High-Throughput Computing in the Sciences
225
While many batch systems can front-end for heterogeneous compute nodes (i.e., compute nodes of differing architectures and operating systems), this is not generally put to use in most organizations. Usually, a given queue will submit jobs to only one type of compute node (sometimes identical in every regard, sometimes differing in insignificant ways such as hard drive size or clock speed). Grids, however, by their very nature tend to be quite diverse, supporting large numbers and types of resources ranging from Windows to Linux, desktop to rack-mount, and fast to slow. Sometimes, the machines in grids will have policies in place to prohibit execution when someone is logged in to the machine, and sometimes they will not. This diversity means that when you submit a job to a grid, you will often need to specify the resource constraints applicable to your job, such as what operating system it needs and how much memory it requires. Given that grids support heterogeneous sets of machines, these machines are highly unlikely to support a shared file system (which you will recall was an outright assumption for most batch systems). Some grids do support shared namespaces and file systems through the use of grid-specific device drivers such as Genesis II’s FUSE file system for Linux or its G-ICING Installable File System for Windows, but this is by no means guaranteed. Given this restriction, HTC applications running on a grid will often have no choice but to stage data in and out. Another difference between grids and batch systems is that grids often support machines in wildly differing administrative domains and situations. When an HTC job is running on a cluster of machines in a controlled environment such as a power-conditioned machine room, a user could be reasonably confident that his application could run for hours or even a day or more without interruption. However, when you start including machines in public computer labs at a University, or even those sitting in student’s dorm rooms, the chances of the machine getting powered off or rebooting skyrockets. For this reason, when working with grids you will often need to be even more vigilant about picking appropriate job lengths and checkpointing. Finally, in a grid system the chances of your application being installed on any given machine, or installed with the correct plug-ins, modules, or libraries that you need become vanishingly small. For this reason, grids often include some sort of mechanism for application configuration and deployment. While it may seem that using a grid instead of a compute cluster only complicates an already complex problem, it is important to realize that the benefits of grids can often outweigh these drawbacks. Grids are usually many orders of magnitude larger then clusters in terms of numbers of resources. They tend to be undersubscribed in terms of usage as compared to compute clusters that are frequently oversubscribed. Also, they provide many other features and functions that clusters simply cannot, such as data
226
Mark Morgan and Andrew Grimshaw
sharing and collaboration, fault-tolerance, Quality of Service (QoS) guarantees, etc. For many people, these benefits make the added complications worthwhile.
5. Summary In this chapter, we have provided a brief introduction to HTC techniques as they relate to the sciences. We have tried to describe some of the more common patterns in the hopes that the examples are both illustrative and potentially useful to users. However, no single example can ever be a one-size-fits-all solution. Every application has its own nuances and requirements and each solution will by necessity tend to be unique to that application. We have shown that a good working knowledge of scripting can be invaluable to an HTC user and that familiarity with basic tools such as grep, sed, and awk tremendously enhances the ways in which a user can manage and control his or her job. Finally, we have tried to provide enough of an introduction to more advanced HTC compute topics such as staging and checkpointing to give the reader an idea of other areas of computation that he or she can explore if they seem relevant or important to his or her application space. HTC has been and remains one of the more effective means of parallelization available to the researcher. Having a good understanding of these techniques and mechanisms will aid you as you produce not only future applications but also the data that you will one day analyze with those applications. While it is an unfortunate fact of life that you sometimes must work with existing software over which you have little or no control, a working understanding of HTC techniques will help you plan for and simplify HTC control and submission scripts with a little bit of upfront planning.
REFERENCES Chandra, R., Menon, R., Dagum, L., Kohr, D., Maydan, D., and McDonald, J. (2000). Parallel Programming in OpenMP. Morgan Kaufmann, 1558606718. Gropp, W., Lusk, E., and Skjellum, A. (1994). Using MPI: Portable parallel programming with the message-passing interface. Sci. Eng. Comput. Ser. MIT Press, Cambridge, MA pp. 0-262-57104-8307. Ousterhout, J. K. (1994). Tcl and the Tk Toolkit. Addison-Wesley, Reading, MA0-20163337-X. Leach, P. J., and Naik, D. C. (1997). A Common Internet File System (CIFS/1.0) Protocol. http://tools.ietf.org/html/draft-leach-cifs-v1-spec-01.txt. 19 December.
High-Throughput Computing in the Sciences
227
Morgan, M. M. (2007). Genesis II: Motivation, Architecture, and Experiences using Emerging Web and OGF Standards. Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid, 0-7695-2833-3. May 2007. Sun Microsystems, Inc (1989). NFS: Network Filesystem Protocol Specification. IETF RFC-1094. March. Thain, D., Tannenbaum, T., and Livny, M. (2005). Distributed Computing in Practice: The Condor Experience. Concurrency Comput. Pract. Exp. 17(2–4), 323–356. 10.1002/ cpe.938.
C H A P T E R
N I N E
Large Scale Transcriptome Data Integration Across Multiple Tissues to Decipher Stem Cell Signatures Ghislain Bidaut*,†,‡ and Christian J. Stoeckert Jr.§ Contents 1. Introduction 2. Systems and Data Sources 2.1. Computing environment 2.2. Data source 2.3. Normalization 2.4. Databases 2.5. Stem cells generalized hierarchy 3. Data Integration 3.1. Integrating data to a final compendium indexed by common gene identifier 3.2. Vector projection 3.3. Variation filtering 4. Artificial Neural Network Training and Validation 4.1. Leave-one-out validation—generation of 31 ANN models 4.2. Minimal error data set 4.3. Independence testing 4.4. Applying the whole algorithm 4.5. Results interpretation 5. Future Development and Enhancement Plans Acknowledgments References
* { { }
230 231 231 232 232 234 235 236 236 237 237 238 238 240 240 241 243 243 244 244
Inserm, UMR891, CRCM, Integrative Bioinformatics, Marseille, France Institut Paoli-Calmettes, Marseille, France Univ Me´diterrane´e, Marseille, France Center for Bioinformatics, Department of Genetics, University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania, USA
Methods in Enzymology, Volume 467 ISSN 0076-6879, DOI: 10.1016/S0076-6879(09)67009-9
#
2009 Elsevier Inc. All rights reserved.
229
230
Ghislain Bidaut and Christian J. Stoeckert
Abstract A wide variety of stem cells has been reported to exist and renew several adult tissues, raising the question of the existence of a stemness signature—that is, a common molecular program of differentiation. To detect such a signature, we applied a data integration algorithm on several DNA microarray datasets generated by the Stem Cell Genome Anatomy Project (SCGAP) Consortium on several mouse and human tissues, to generate a cross-organism compendium that we submitted to a single layer artificial neural network (ANN) trained to attribute differentiation labels—from totipotent stem cells to differentiated ones (five labels in total were used). The inherent architecture of the system allowed studing the biology behind stem cells differentiation stages and the ANN isolated a 63 gene stemness signature. This chapter presents technological details on DNA microarray integration, ANN training through leave-one-out cross-validation, and independent testing on uncharacterized adult tissues by automated detection of differentiation capabilities on human prostate and mouse stomach progenitors. All scripts of the Stem Cell Analysis and characterization by Neural Networks (SCANN) project are available on the SourceForge Web site: http://scann. sourceforge.net
1. Introduction In recent years, hundreds of data sets have been deposited in public repositories such as the Gene Expression Omnibus (Barrett et al., 2007) and Array Express (Parkinson et al., 2009), including experiments on cancer studies, developmental biology, and others, in multiple organisms. These databases have the potential to shed light on unresolved biological questions, such as the determination of a common cell signature among the different stem and progenitor cell types reported to exist (stemness signature). Several projects have provided researchers the ability to perform global queries on such databases, such as the SPELL Web interface (meta-analysis in Saccaromices cerevisiae, see Hibbs et al., 2007), or the GeneSapiens system (Homo sapiens, see Kilpinen et al., 2008) which queries an integrated database generated from public transcriptome repositories. These systems are highly nonspecific as they are based on techniques to integrate hundreds of datasets. On a smaller scale, several groups have developed data integration algorithms for cancer tumors classification on multiple datasets. The goal of these is to improve classification robustness when applying a classifier trained on a given dataset to an independent dataset. In Chuang et al. (2007), authors pursued a protein-network based classification by superimposing a breast cancer gene expression dataset on a large-scale protein network to detect subnetworks (subregions from the full interactome)
Large Scale Stem Cells Transcriptome Data Integration
231
whose expression is highly correlated to distant metastasis. Results showed a significant increase in the classifier accuracy when applied to independent data. Numerous ranked-based methods have been also proposed, such as the work proposed by Xu et al. (2005) where top scoring pairs of genes are identified to form a marker list for prostate cancer. The last class of methods is based on combining inter-study data into a final dataset by specific data transformation methods, including data renormalization (Shen et al., 2004). We are proposing the extension of a method previously applied (Scearce et al., 2002)—the vector projection—to integrate multiple microarray datasets to answer the question of stemness existence—that is, to discover a shared transcriptional signature between multiple tissues. A classifier is trained on an integrated stem cell dataset (compendium) generated from a set of individual stem cell DNA microarray datasets—each of these experiments being consistently labeled on a generalized stem cell hierarchy. After extracting the signature on a training set, we applied it to characterize two tissues: mouse stomach progenitors and human prostate progenitor—to identify the type of stem or progenitor cell represented. In Sohal et al. (2008), the authors integrated data across multiple platforms to study hematopoietic stem and progenitors cells. In Gerrits et al. (2008), authors integrated DBNA microarray profiles and genetics linkage of two genetically distinct mice types to find a stem cell signature in hematopoietic stem cells. This chapter describes the study from an experimental point of view, describing data sources, scripts, and algorithms that constitute the Stem Cell Analysis by Neural Network (SCANN system)—detailed biological results are available from a previous study (Bidaut and Stoeckert, 2009).
2. Systems and Data Sources 2.1. Computing environment To manipulate such data structures, we assume proficiency in programming, preferably in languages such as Perl or Python or Java, that allow for quick development of data file manipulations and mathematical transformations. The Perl language is used by the authors throughout the chapter for all scripting tasks. In addition, R and Bioconductor1 (Gentleman et al., 2004) must be installed in order to normalize the Affymetrix# DNA arrays. These languages can be run on most environments, but a Unix-like system (SolarisTM, CentOSTM, or Ubuntu Linux TM) is preferably used on a server-class system (A multicore server with 8 GB þ RAM, and 1
Available from http://www.biocondiuctor.org.
232
Ghislain Bidaut and Christian J. Stoeckert
10 GB þ of disk space is recommended—especially if the algorithms presented here are to be applied on larger datasets). The software described in this chapter has been made available on Sourceforge under the SCANN package version 1.0. The complete archive can be retrieved at http://scann.sourceforge.net
2.2. Data source Data sources are multiple and heterogeneous. We first trained the system with data generated by the Stem Cell Genome Anatomy Consortium (SCGAP2). The consortium generated data in multiple tissues, each member being in charge of a particular stem cell type. Most data are available for download from individual laboratories, with links accessible from the main SCGAP Web site. Other data are accessible through the respective author’s Web site. Although multiple data types were generated, including immunohistochemistry experiments for localized gene expression measurements, only the DNA array data was kept in our integration study. Table 9.1 summarized data sources, types, and platforms, and location on the Web.
2.3. Normalization Normalization was done with the Bioconductor package which is an R-based bioinformatics library that allows for bioinformatics data analysis of most biological data types, including DNA/protein sequences, microarray data, and others. We are using the package affy that allows for normalization of multiple Affymetrix chip types. Detailed documentation is available from the affy vignette: (command vignette(‘‘affy’’) at the R prompt). The following procedure was followed on an example dataset measured on the Affymetrix# HG-U133A platform: $ mkdir data_tmp $ tar xf Data.tar-C data_tmp $ cd data_tmp
Note that if the archive contains data from multiple platforms (For instance, MG-U430 A and B), data must be uncompressed in separate directories. The following commands are then applied from the R environment: > library(affy) > RawData ¼ ReadAffy()
The show() command gives details on loaded data.
2
http://www.scgap.org.
Table 9.1 Summary of data sources, platforms, and location Author/lab
Tissue
Platform
Source
Ochsner et al. (2007) Rowe et al. (unpublished) Ivanova et al. (2002) Ivanova et al. (2002) Ivanova et al. (2002) Ivanova et al. (2002) Ivanova et al. (2002) Ivanova et al. (Unpublished) Ivanova et al. (Unpublished) Ivanova et al. (Unpublished) Oudes et al. (2006) Mills et al. (2002)
Mouse embryonic liver Mouse bone
Affymetrix# MG-U430A Affymetrix# MG-U74Av2, B, C Affymetrix# HG-U95Av2, B, C Affymetrix# MG-U74Av2, B, C Affymetrix# MG-U74Av2, B, C Affymetrix# MG-U74Av2, B, C Affymetrix# MG-U74Av2, B, C Affymetrix# HG-U133A,B Affymetrix# HG-U133A,B Affymetrix# MG-U430A,B Affymetrix# HG-U133A,B Affymetrix# Mu11K A,B Six distinct platforms
http://liver-hsc.scgap.org/data.html
Human fetal liver (HSCs) Mouse fetal liver (HSCs) Mouse adult bone marrow Mouse embryonic stem cells Mouse neural stem cells Human coord blood (HSCs) Human adult bone marrow Mouse adult bone marrow Human prostate progenitors Mouse stomach progenitors Total: five distinct 12 distinct tissues groups
Datasets printed in italic are test datasets characterized independently by the system.
http://skeletalbiology.uchc.edu/ 30_ResearchProgram/304_gap/index.htm http://www.cbil.upenn.edu/SCGAP/related_data. html http://www.cbil.upenn.edu/SCGAP/related_data. html http://www.cbil.upenn.edu/SCGAP/related_data. html http://www.cbil.upenn.edu/SCGAP/related_data. html http://www.cbil.upenn.edu/SCGAP/related_data. html http://www.cbil.upenn.edu/SCGAP/data.html http://www.cbil.upenn.edu/SCGAP/data.html http://www.cbil.upenn.edu/SCGAP/data.html Available from the authors Available from the authors
234
Ghislain Bidaut and Christian J. Stoeckert
> show(RawData) AffyBatch object size of arrays ¼ 712 712 features (63 kb) cdf ¼ HG-U133A (22283 affyids) number of samples ¼ 20 number of genes ¼ 22,283 annotation ¼ hgu133a notes ¼
Data are then normalized with the expresso function, (affy package) with the following options, for loess normalization: > expr ¼ expresso (DataA, bgcorrect.method ¼ ‘‘mas,’’ normalize.method ¼ ‘‘loess’’, pmcorrect.method ¼ ‘‘mas’’, summary.method ¼ ‘‘medianpolish’’)
After normalization, data object expr is exported on disk: > write.table(expr, ‘NormalizedData.txt’); > q()
After normalizing all data, we have in hand several individual datasets grouped by tissues. Please note that data normalization was done on 2007 versions of Linux/Perl and R/Bioconductor and results may vary slightly from what we have obtained at this time.
2.4. Databases To integrate data on a common framework (i.e., a common identification system), several databases must be downloaded and used. 1. The DFCI Resourcerer database: The Resourcerer database is a compendium of continuously maintained annotation files for standard DNA microarray platforms (Tsai et al., 2001). Platforms files for our datasets are available through FTP download here3: These files are used to generate probeID–geneID correspondence tables. 2. NCBI Homologene database (Wheeler et al., 2008). This database (available here4) represents an in silico generated database of homologs of fully sequenced genomes. Homologues are represented under lists of NCBI Gene IDs/Symbols indexed by taxon IDs. Each homolog is indexed by a unique homologene IDs linking several geneIDs from different taxons. This database is used to build a geneID-homologene ID correspondence table. 3. NCBI Gene_info database. This database (available here5) is a flat file version of the NCBI Entrez Gene database (Wheeler et al., 2008). It provides information on genes—Gene Ontology, symbol, and 3 4 5
ftp://occams.dfci.harvard.edu/pub/bio/tgi/data/Resourcerer. ftp://ftp.ncbi.nih.gov/pub/HomoloGene/current/homologene.data. ftp://ftp.ncbi.nih.gov/gene/DATA/gene_info.gz.
235
Large Scale Stem Cells Transcriptome Data Integration
synonyms. We use this file to build a geneID to Symbol conversion table to allow mapping of geneIDs from heterogeneous species to our final integrated compendium (see Section 3.2).
2.5. Stem cells generalized hierarchy To train a classifier on heterogeneous datasets, a coordinated and homogeneous description that can be applied to describe these data has been created with all SCGAP consortium members. These descriptions give the state of differentiation of a given tissue, upon which we wish to perform training and classification. To include prior knowledge in the system, we used the well-established stem cell hierarchy from the hematopoiesis system and generalized it to all tissues in our compendium, by including other stem cell types (totipotent stem cells). This full hierarchy is shown in Fig. 9.1A and includes all these types.
A: Totipotent stem cells: Capable of self-renewal and able to generate all cells types B: Multipotent stem cells: Capable of self-renewal and able to generate most cell types C: Progenitors cells: Capable of generating several cell types D: Lineage-committed progenitors (LCP) cells: Capable of generating a single or a restricted number of cell types E: Differentiated cells: Cell displaying final phenotype
These types of cell categories were used to label consistently our data for tissue training and classification (Table 9.2). Some tissues do note contains all the categories—for instance, mouse bone are only represented on categories C, D, and E, and the classifier has to cope with such missing categories. B
A s
r ito
n
er
ff Di
0.6 0.4
Differentiation stage
0.2 0
Diff. cell
ito en
ne
Li
ia
t en
ce
LCP
rs
-c
e ag
ted
Progenitor
om
Multipotent
tem ts
Pr
og
po ten
lls
itt
m
1 0.8
Totipotent
lls
ed
ce
ce lls tem ts M
ul ti
ten ip o To t
ge
o pr
Figure 9.1 (A) The generalized stem cells hierarchy. Arrows denote the hierarchical order of cell differentiation as well as self-renewal capability of totipotent and multipotent stem cells. (B) The stem cell model vectors are represented (highlighted is the multipotent stem cells model vector that peaks for the multipotent category).
236
Ghislain Bidaut and Christian J. Stoeckert
Table 9.2 Stem cell tissues and their category coverage Tissue
Stem cells categories
Mouse embryonic liver Mouse bone Human fetal liver (HSCs) Mouse fetal liver (HSCs) Mouse adult bone marrow (HSCs) Mouse embryonic stem cells (ESCs) Mouse neural stem cells(NSCs) Human cord blood (HSCs) Human adult bone marrow (HSCs) Mouse adult bone marrow(HSCs) Human prostate progenitors Mouse stomach progenitors 12 Distinct tissues
C, E C, D, E B, D, E B, D, E B, C, D, E A B B, D B, D B, C X, E X, E Five distinct categories
In italic, the two uncharacterized tissues are noted X.
3. Data Integration 3.1. Integrating data to a final compendium indexed by common gene identifier Databases described in Section 2.4 were parsed and hash tables were generated in order to integrate all the datasets to a final compendium using a common identifier. The following steps were followed: Once normalized, specific platform probes IDs were collapsed to common gene profile by expression averaging, leaving a set of gene expression profiles indexed by gene IDs. We simultaneously collapsed multiple microarrays when necessary (For instance, mouse bone tissue samples are profiled on Affymetrix# U74Av2, Bv2, Cv2). Then, these sets of expression profiles were aligned separately for each organism, and two separate tables were generated for mouse and human. Finally, the two organisms data were merged through the homologene ID to gene ID table. Homologene IDs were kept as long as at least one homologeneID was present either in mouse or human. This has resulted in a final data matrix of 18,720 homologs and 82 tissues samples (see file all_expression_Princeton _UConn_Baylor_WashU_ISB_separated_samples.txt, available from the SCANN Web site) describing gene expression data of genes across developmental stages for different tissues and organisms. Multiple IDs were kept within the file (Mouse Gene ID, Human Gene ID) for data verification purposes.
Large Scale Stem Cells Transcriptome Data Integration
237
The following scripts from the SCANN distribution were used for the analysis: match_resourcerer.pl: Converts probe_ID to Gene ID collapse_chips.pl: Combines several gene profiles with
the same probe ID to a single profile identified with a gene ID, and simultaneously combines several chips to a single data file align_human_mouse.pl: align gene profiles from mouse and human to a single profile indexed by Homologene ID Note that missing values were propagated in the final file as ‘‘NaN.’’
3.2. Vector projection Samples were selected from every tissue to form groups of progenitorsdifferentiated cells. We populated the hierarchy as much as possible— depending on the available samples (Table 9.2). To quickly capture gene expression variation on categories, we vector projected gene expression profiles across categories for a given tissue on a set of five vectors modeling gene expression profiles we wished to detect. Vectors are detailed in Fig. 9.1B, and the arrow shows the example for the multipotent stem cells vector (B) that peaks over category B and has a lower expression over other categories. Each vector has been designed to extract genes with a higher expression over a given category. The projection itself is a dot product: p ¼< gene profile; mode vectorl >
ð9:1Þ
The two vectors gene_profile and model_vector are normalized to 1.0 before projection to remove variation effects inherent to the nature of the data. Many genes are characterized by missing expression values, and several tissues are not profiled on the five vector types. To cope with missing values, we devised a strategy where the dot product is made without missing points. To compensate for these points within, vectors minus the missing categories are renormalized. A projection matrix of 18 K genes, 12 tissues, and 5 projection values per tissue was obtained. This is the dataset we are submitting to the neural network for training and classification.
3.3. Variation filtering The final dataset was variation filtered: We kept only an initial dataset of the genes characterized by a projection value of at least thstage over nstage number of tissues for a given developmental stage. Table 9.3 recapitulates these settings. Thresholds have been separately parameterized for all tissues to compensate for the overlap variation between stem cells stages. (There is a
238
Ghislain Bidaut and Christian J. Stoeckert
Table 9.3 Parameters settings for vector projection threshold (thstage) over a minimum set of tissues nstage Stem cell stage
thstage
nstage
A: Totipotent stem cells B: Multipotent stem cells C: Progenitor cells D: Lineage-committed progenitors E: Differentiated cells
0.62 0.64 0.8 0.8 0.94
20 20 20 18 3
higher gene expression overlap for early progenitors than for differentiated cells, which are more heterogeneous.) This leads to a final input dataset of 3939 K genes.
4. Artificial Neural Network Training and Validation To extract the common molecular program from the set of tissues, we trained a multiclass single layer Artificial Neural Network on the final combined dataset. The ANN presented Fig. 9.2 is an extension of the single layer associative memory Greer and Khan (2007). It is built around five neurons corresponding to the five stem cell developmental stages A, B, C, D, and E—each of them characterized by a set of n weights, n being the input data size. The output for each neuron has been defined as the following dot product: XN yj ¼ wx ð9:2Þ i¼0 ij i where x is the input data (expression profiles projections defined Section 3.2), w is the set of weights for the current neuron, and y is the neuron activity. The ANN is trained with a subset of the data—the training set—this being the projected data for hematopoietic stem cells, mouse neuronal stem cells, mouse embryonic stem cells, and mouse bone progenitors, representing a total of 31 tissues. Human prostate and mouse stomach progenitor data (comprising nine samples in total) are left out for independent testing.
4.1. Leave-one-out validation—generation of 31 ANN models The training is organized through several cross validation steps performed on a reduced set of genes, to find the set of genes minimizing the training error. At each cross validation step, a tissue is left out (leave-oneout cross-validation, LOO) and the network is trained for 200 epochs with 30 tissues presented in a random order at each epoch.
239
Large Scale Stem Cells Transcriptome Data Integration
Data compendium for different stem/progenitor tissues generated from large scale DNA microarray integration (82 experiments in 40 tissues-18 k Genes)
1. Data integration by controlled experiment labeling and homology alignement (homologene database)
2. Vector projection on 5 stem cell gene expression profile basis vector
3. Variation-based filtering: 3939 genes kept Séparation training/testing Training set (mouse and human HSC, mouse ESC, mouse neural SC, mouse bone) (31 tissues)
Testing set (human prostate, mouse stomach epitelium) (9 tissues)
Leave one out cross validation
ANN training Training population weights Gene 1 Neuron 1: Totipotent stem cell Neuron 2: Multipotent stem cell Neuron 3: Progenitor Neuron 4: Lineage-committed progenitor Neuron 5: Differentiated cell Gene n
5. Weights are averaged over all models and ranked. Top 16 genes per neuron are conserved to optimize classification
31 ANN models
Classification of independent data (Unknown tissues)
Figure 9.2 The full analysis procedure. From the top are basic normalization and annotation steps, data integration, model vector projections, and variation filtering. After this step, the 3939 genes dataset is then subdivided into training and testing subsets. Training dataset goes into a double loop of leave-one-out validation by the artificial neural network (ANN) and size reduction (only the significant genes are kept according to a schedule defined Section 4.2—not represented here). Finally, 31 ANN models are kept for a minimal error rate obtained with 63 genes, and combined for testing on the independent dataset by majority voting.
240
Ghislain Bidaut and Christian J. Stoeckert
At each epoch, tissues are presented, and weights are updated proportionally with the real obtained output and desired output: Dwj ¼ aðnÞ½yj ydj x
ð9:3Þ
where wj is the jth weight vector, a(n) is the monotonic decreasing function of n, the number of epochs, yj is the obtained output, ydj is the desired output, and x is the input vector. a(n) is defined as follows: aðnÞ ¼
1 ðn=30 þ 1Þ
ð9:4Þ
After 200 epochs, the LOO cross-validation yields 31 ANN networks models. For classification purposes, these 31 models are combined through majority vote, for instance, for characterization of unknown tissues/samples (script classifytissue.pl, see also Section 4.3). We plotted the quadratic error during ANN training (Fig. 9.3A) and testing (Fig. 9.3B) phases to insure that no signs of overtraining were present. Overtraining is typically shown through increasing square error curves, showing that the network overspecialized on some training samples and is not able to generalize on others. This is a necessary test to ensure both the quality of the training procedure and absence of biases in the training dataset.
4.2. Minimal error data set During training, we subsequently reduced the training input size. The first training set was the 3939 genes validated after variation filtering (Section 3.3). Subsequently, the set was reduced by keeping the m most significant weights for each network (weights were sorted by decreasing order or absolute value, and top m gene was kept). We plotted systematic error for m taken from this list of values (Fig. 9.4A): m ¼ [400, 300, 200, 175, 150, 125, 100, 75, 60, 65, 50, 45, 40, 35, 30, 25, 23, 22, 20, 18, 16, 14, 12, 10, 8, 7, 6, 5]. Minimal error was obtained for m ¼ 16, corresponding to a set of 63 genes, as shown Fig. 9.4A. Five errors were reported on individual ANNS (Majority vote set this error to 0).
4.3. Independence testing 31 ANN models obtained for m ¼ 16 (63 genes) are combined in a majority vote for testing and classification on independent data. We used these for classification of human prostate progenitors and mouse stomach epithelium samples. These tissues contain potentially adult stem cells progenitors, but with unknown differentiation capabilities. To classify them on the generalized stem cells hierarchy as shown in Fig. 9.1A, we presented them to the network and obtained the following results:
Large Scale Stem Cells Transcriptome Data Integration
A
241
1.8
Quadratic error rate
1.6 1.4 1.2 1 0.8 0.6 0.4 0.2
0
20
40
60
80 100 120 140 160 180 200 Epochs
0
20
40
60
80 100 120 140 160 180 200
B 18
Quadratic error rate
16 14 12 10 8 6 4 2 0
Epochs
Figure 9.3 Training error for training (A) and testing (B) sets during leave-one-out procedure. The traces are strictly decreasing and show no signs of overtraining.
Mouse stomach progenitors: Classifies as Category C: Progenitors Human prostate progenitors: Classifies as Category B: Multipotent stem cells.
4.4. Applying the whole algorithm The whole data analysis pipeline (variation filtering, vector projection, ANN training with leave-one-out testing) is implemented within the classtbynnshuffl.pl script from the SCANN package. Here is the command line syntax for 200 epoch and independence testing on mouse stomach and human prostate tissues:
242
Ghislain Bidaut and Christian J. Stoeckert
A Number of misclassified samples
100
10
Minimum for 63 genes
1 104
103
102
101
Number of genes B 18
Quadratic error rate
16 Training set Held out tissue
14 12 10 8 6 4 2 0
0
5
10
15
20
25
30
ANN index
Figure 9.4 Systematic errors: In (A), the graphics shows the sum of errors across all 31 ANN models for a progressively reduced input dataset. Minimum is obtained for 63 genes—please note that the error is plotted for individual ANN models and that it decreases to 0 when taking into account majority voting. In (B), we plotted the quadratic error rate for the 31 individual ANN models for an input set of 63 genes corresponding to 31 held out tissues. ANNs models committing errors are pointed by arrows. $ /home/Bidaut/bin/scann/classtbynnshuffl.pl -i / data/stem-cell-project-data/final_dataset/all_expression_Princeton_UConn_Baylor_WashU_ISB_separated_samples.txt -h /home/bidaut/data/annotations/list_gene_id _human.txt-m/home/bidaut/data/annotations/list_gene_ id_mouse.txt -g /home/bidaut/data/annotations/gene2go -
Large Scale Stem Cells Transcriptome Data Integration
243
mp 1 -t ‘mouse gut 1’ -t ‘mouse gut 2’ -t ‘human prostate 1’ t ‘human prostate 2’ -t ‘human prostate 3’ -t ‘human prostate 4’ -t ‘human prostate 5’ -t ‘human prostate 6’ -t ‘human prostate 7’ -nepoch 200 list_gene_id_human.txt and list_gene_id_mouse.txt are
tab-delimited files containing NCBI geneIDs, gene symbols, and textual description for human and mouse, respectively. These are generated from the NCBI gene_info file (Section 2.4).
4.5. Results interpretation The proposed neural network architecture (single layer) allows for detailed exploration of two important parameters—which are usually not possible with most classification systems that operate as ‘‘black boxes’’ and do not allow insight for further understanding of the classification process and eventual biases in the training data:
Ranking weights in increasing order for each of the five stages allows extraction of genes reported by the classifier to be of utmost importance in the biology of differentiation at this particular stem cell stage. Ranking genes on a vector y obtained by application of an input vector x (corresponding to a given tissue to characterize) to the neural network (y ¼ x.w) allows to isolates gene profiles critical for proper association of this tissue. Although a hidden layer might lower classification error, the complexity of the data might drive the network to overtraining and artificially fit the weight to the training data without being able to generalize to unknown tissues. Also, this type of multilayer perceptions does not allow for weight interpretation and the ability to correlate list of markers with stem cell developmental stages would have been lost. The list of 63 genes linked with every stem cell stage is available from the supporting Web site, and is detailed in Bidaut and Stoeckert, 2009. Briefly, we found genes involved in development (Hopx), other genes involved in cancer (Letmd1) and some stem cells markers. (CD109 is a cell surface antigen found on a subset of hematopoietic stem cells, FIAT is a transcriptional regulator of osteoblastic functions, Sfrp4 is a Wnt pathway inhibitor that plays a central role in cell fate decisions.)
5. Future Development and Enhancement Plans Our analysis identified genes not previously linked to stem cell differentiation or cancer—these genes are regulated by stem cell genes which are downstream receptors of development pathways. Although these genes are
244
Ghislain Bidaut and Christian J. Stoeckert
good discriminators, they are poor descriptors of the biology linked to differentiation. A possible improvement would be to sort out these genes and keep only the upstream regulators of development/differentiation for every stem cell differentiation stage by interactome–transcriptome integration (Chuang et al., 2007). We are also planning the improvement of our technological base, through (i) implementation and parallelization of the algorithm on a Linux Beowolf Cluster, and (ii) direct use of stem cells data stored in public repositories, to extend our data compendium. Also, a recently published package could allow us to perform annotations and data integration under R (Kuhn et al., 2008). Improvement in classification is envisioned through boosting, and implementation on a public server is planned.
ACKNOWLEDGMENTS Ghislain Bidaut is funded by the Institut National de la Sante´ et de la Recherche me´dicale, the Fondation pour la Recherche Me´dicale, and the Institut National du Cancer (Grant 08/ 3D1616/Inserm-03-01/NG-NC). This work was initially funded by grant U01 DK63481 to Chris Stoeckert. Thanks to Wahiba Gherraby for reading the manuscript and to all members of the SCGAP consortium for sharing data and insights for this project.
REFERENCES Barrett, T., Troup, D. B., Wilhite, S. E., Ledoux, P., Rudnev, D., Evangelista, C., Kim, I. F., Soboleva, A., Tomashevsky, M., and Edgar, R. (2007). NCBI GEO: Mining tens of millions of expression profiles—Database and tools update. Nucleic Acids Res. 35 (Database issue), D760–D765. Bidaut, G., and Stoeckert, C. J. Jr. (2009). Characterization of unknown adult stem cell samples by large scale data integration and artificial neural networks. Pac. Symp. Biocomput. 356–367. Chuang, H. Y., Lee, E., Liu, Y. T., Lee, D., and Ideker, T. (2007). Network-based classification of breast cancer metastasis. Mol. Syst. Biol. 3, 140. Gentleman, R. C., Carey, V. J., Bates, D. M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., Hornik, K., Hothorn, T. et al. (2004). Bioconductor: Open software development for computational biology and bioinformatics. Genome Biol. 5(10):R80. Gerrits, A., Dykstra, B., Otten, M., Bystrykh, L., and de Haan, G. (2008). Combining transcriptional profiling and genetic linkage analysis to uncover gene networks operating in hematopoietic stem cells and their progeny. Immunogenetics 60(8), 411–422. Greer, B., and Khan, J. (2007). Online analysis of microarray data using artificial neural networks. J. Methods Mol. Biol. 377, 61–74. Hibbs, M. A., Hess, D. C., Myers, C. L., Huttenhower, C., Li, K., and Troyanskaya, O. G. (2007). Exploring the functional landscape of gene expression: Directed search of large microarray compendia. Bioinformatics 23(20), 2692–2699.
Large Scale Stem Cells Transcriptome Data Integration
245
Ivanova, N. B., Dimos, J. T., Schaniel, C., Hackney, J. A., Moore, K. A., and Lemischka, I. R. (2002). A stem cell molecular signature. Science 298(5593), 601–604. Kilpinen, S., Autio, R., Ojala, K., Iljin, K., Bucher, E., Sara, H., Pisto, T., Saarela, M., Skotheim, R. I., Bjo¨rkman, M., Mpindi, J. P., Haapa-Paananen, S., et al. (2008). Systematic bioinformatic analysis of expression levels of 17, 330 human genes across 9, 783 samples from 175 types of healthy and pathological tissues. Genome Biol. 9(9), R139. Kuhn, A., Luthi-Carter, R., and Delorenzi, M. (2008). Cross-species and cross-platform gene expression studies with the Bioconductor-compliant R package ‘annotationTools’. BMC Bioinformatics 9, 26. Mills, J. C., Andersson, N., Hong, C. V., Stappenbeck, T. S., and Gordon, J. I. (2002). Molecular characterization of mouse gastric epithelial progenitor cells. Proc. Natl. Acad. Sci. USA 99(23), 14819–14824. Ochsner, S. A., Strick-Marchand, H., Qiu, Q., Venable, S., Dean, A., Wilde, M., Weiss, M. C., and Darlington, G. J. (2007). Transcriptional profiling of bipotential embryonic liver cells to identify liver progenitor cell surface markers. Stem Cells 25(10), 2476–2487. Epub 2007 Jul. 19. Oudes, A. J., Campbell, D. S., Sorensen, C. M., Walashek, L. S., True, L. D., and Liu, A. Y. (2006). Transcriptomes of human prostate cells. BMC Genomics 7, 92. Parkinson, H., Kapushesky, M., Kolesnikov, N., Rustici, G., Shojatalab, M., Abeygunawardena, N., Berube, H., Dylag, M., Emam, I., Farne, A., Holloway, E., Lukk, M., et al. (2009). ArrayExpress update—From an archive of functional genomics experiments to the atlas of gene expression. Nucleic Acids Res. 37(Database issue), D868–D872. Scearce, L. M., Brestelli, J. E., McWeeney, S. K., Lee, C. S., Mazzarelli, J., Pinney, D. F., Pizarro, A., Stoeckert, C. J. Jr, Clifton, S. W., Permutt, M. A., Brown, J., Melton, D. A., et al. (2002). Functional genomics of the endocrine pancreas: The pancreas clone set and PancChip, new resources for diabetes research. Diabetes 51(7), 1997–2004. Shen, R., Ghosh, D., and Chinnaiyan, A. M. (2004). Prognostic meta-signature of breast cancer developed by two-stage mixture modeling of microarray data. BMC Genomics 5 (1), 94. Sohal, D., Yeatts, A., Ye, K., Pellagatti, A., Zhou, L., Pahanish, P., Mo, Y., Bhagat, T., Mariadason, J., Boultwood, J., Melnick, A., Greally, J., et al. (2008). Meta-analysis of microarray studies reveals a novel hematopoietic progenitor cell signature and demonstrates feasibility of inter-platform data integration. PLoS ONE 3(8), e2965. Tsai, J., Sultana, R., Lee, Y., Pertea, G., Karamycheva, S., Antonescu, V., Cho, J., Parvizi, B., Cheung, F., and Quackenbush, J. (2001). Resourcerer: A database for annotating and linking microarray resources within and across species. Genome Biol. 2 (11), SOFTWARE0002. Wheeler, D. L., Barrett, T., Benson, D. A., Bryant, S. H., Canese, K., Chetvernin, V., Church, D. M., Dicuccio, M., Edgar, R., Federhen, S., Feolo, M., Geer, L. Y., et al. (2008). Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 36(Database issue), D13–D21. Xu, L., Tan, A. C., Naiman, D. Q., Geman, D., and Winslow, R. L. (2005). Robust prostate cancer marker genes emerge from direct integration of inter-study microarray data. Bioinformatics 21(20), 3905–3911.
C H A P T E R
T E N
DynaFit—A Software Package for Enzymology Petr Kuzmicˇ Contents 248 250 250 252 255 255 260 261 262 263 269 273 275 275 276 276 276
1. Introduction 2. Equilibrium Binding Studies 2.1. Experiments involving intensive physical quantities 2.2. Independent binding sites and statistical factors 3. Initial Rates of Enzyme Reactions 3.1. Thermodynamic cycles in initial rate models 4. Time Course of Enzyme Reactions 4.1. Invariant concentrations of reactants 5. General Methods and Algorithms 5.1. Initial estimates of model parameters 5.2. Uncertainty of model parameters 5.3. Model-discrimination analysis 6. Concluding Remarks 6.1. Model discrimination analysis 6.2. Optimal design of experiments Acknowledgments References
Abstract Since its original publication, the DynaFit software package [Kuzmicˇ, P. (1996). Program DYNAFIT for the analysis of enzyme kinetic data: Application to HIV proteinase. Anal. Biochem. 237, 260–273] has been used in more than 500 published studies. Most applications have been in biochemistry, especially in enzyme kinetics. This paper describes a number of recently added features and capabilities, in the hope that the tool will continue to be useful to the enzymological community. Fully functional DynaFit continues to be freely available to all academic researchers from http://www.biokin.com.
BioKin Ltd., Watertown, Massachusetts, USA Methods in Enzymology, Volume 467 ISSN 0076-6879, DOI: 10.1016/S0076-6879(09)67010-5
#
2009 Elsevier Inc. All rights reserved.
247
248
Petr Kuzmicˇ
1. Introduction DynaFit (Kuzmicˇ, 1996) is a software package for the statistical analysis of experimental data that arise in biochemistry (e.g., enzyme kinetics; Leskovar et al., 2008), biophysics (protein folding; Bosco et al., 2009), organic chemistry (organic reaction mechanisms; Storme et al., 2009), physical chemistry (guest–host complexation equilibria; Gasa et al., 2009), food chemistry (fermentation dynamics; Van Boekel, 2000), chemical engineering (bio-reactor design; Von Weymarn et al., 2002), environmental science (bio-sensors for heavy metals; Le Clainche and Vita, 2006), and related areas. The common features of these diverse systems are that (a) the underlying theoretical model is based on the mass action law (Guldberg and Waage, 1879); (b) the model can be formulated in terms of stoichiometric equations; and (c) the experimentally observable quantity is a linear function of concentrations or, more generally, populations of reactive species. The main use of DynaFit is in establishing the detailed molecular mechanisms of the physical, chemical, or biological processes under investigation. Once the molecular mechanism has been identified, DynaFit can be used for routine quantitative determination of either microscopic rate constants or thermodynamic equilibrium constants that characterize individual reaction steps. DynaFit can be used for the statistical analysis of three different classes of experiments: (1) the progress of chemical or biochemical reactions over time; (2) the initial rates of enzyme reaction, under either the rapid-equilibrium or the steady-state approximations (Segel, 1975); and (3) equilibrium ligand-binding studies. Regardless of the type of experiment, the main benefit of using the DynaFit package is that it allows the investigator to specify the fitting model in the biochemical notation (e.g., E þ S <¼¼> E.S --> E þ P) instead of mathematical notation (e.g., v ¼ kcat[E]0[S]0/([S]0 þ Km)). For example, to fit a set of initial rates of an enzyme reaction to a steadystate kinetic model for the ‘‘Bi Bi Random’’ mechanism (Segel, 1975, p. 647) (Scheme 10.1), the investigator can specify the following text in the DynaFit input file: [data] data ¼ rates approximation ¼ steady-state [mechanism] E þ A <¼¼> E. A : k1 k2 E. A þ B <¼¼> E. A. B : k3 k4 E. A. B <¼¼> E. B þ A : k5 k6
249
DynaFit—A Software Package for Enzymology
k1
E k7
E•A
k2 k4
k8
k3
k6 E•B
k5
E•A•B
k9
E+P+Q
Scheme 10.1 E. B <¼¼> E þ B : k7 k8 E. A. B --> E þ P þ Q : k9 [constants] k8 ¼ (k1 k3 k5 k7) / (k2 k4 k6) ...
The program will internally derive the initial rate law corresponding to this steady-state reaction mechanism (or any arbitrary mechanism), and perform the least-squares fit of the experimental data. This allows the investigator to focus exclusively on the biochemistry, rather than on the mathematics. Using exactly equivalent notation, one can analyze equilibrium binding data, such as those arising in competitive ligand displacement assays, or time-course data from continuous assays. Importantly, the DynaFit algorithm does not make any assumptions regarding the relative concentrations of reactants. Specifically, it is no longer necessary to assume that the enzyme concentration is negligibly small compared to the concentrations of reactants (substrates and products) and modifiers (inhibitors and activators). This feature is especially valuable for the kinetic analysis of ‘‘slow, tight’’ enzyme inhibitors (Morrison and Walsh, 1988; Szedlacsek and Duggleby, 1995; Williams and Morrison, 1979). Since its original publication (Kuzmicˇ, 1996), DynaFit has been utilized in more than 500 journal articles. In the intervening time, many new features have been added. The main purpose of this report is to give a brief sampling of several newly added capabilities, which might be of interest specifically to the enzymological community. The survey of DynaFit updates is by no means comprehensive; the full program documentation is available online (http://www.biokin.com/dynafit). This article has been divided into four parts. The first three parts touch on the three main types of experiments: (1) equilibrium ligand binding studies; (2) initial rates of enzyme reactions; and (3) the time course of enzyme reactions. The fourth and last part contains a brief overview of selected data-analytical approaches, which are common to all three major experiment types.
250
Petr Kuzmicˇ
2. Equilibrium Binding Studies DynaFit can be used to fit, or to simulate, equilibrium binding data. The main purpose is to determine the number of distinct noncovalent molecular complexes, the stoichiometry of these complexes in terms of component molecular species, and the requisite equilibrium constants. The most recent version of the software includes features and capabilities that go beyond the original publication (Kuzmicˇ, 1996). For example, DynaFit can now be used to analyze equilibrium binding data involving—at least in principle—an unlimited number of simultaneously varied components. A practically useful four-component mixture might include (1) a protein kinase; (2) a Eu-labeled antibody (a FRET-donor) raised against the kinase; (3) a kinase inhibitor, whose dissociation constant is being measured; and (4) a fluorogenic FRET-acceptor molecule competing with the inhibitor for binding. Investigations are currently ongoing into the optimal design of such multicomponent equilibrium binding studies.
2.1. Experiments involving intensive physical quantities DynaFit can analyze equilibrium binding experiments involving intensive physical quantities. Unlike their counterparts, the extensive physical quantities, intensive quantities do not depend on the total amount of material present in the system. Instead, intensive quantities are proportional to mole fractions of chemical or biochemical substances. A prime example of an intensive physical quantity is the NMR chemical shift (assuming that fast-exchange conditions apply, where the chemical shift is a weighted average of chemical shifts of all microscopic states of the given nucleus). We have recently used this technique to investigate the guest–host complexation mechanism in a system involving three different ionic species of a guest molecule (paraquat, acting as the ‘‘ligand’’) binding to a crownether molecule (acting as the ‘‘receptor’’), with either 1:1 or 1:2 stoichiometry (Gasa et al., 2009). This guest–host system involved four components forming up to nine noncovalent molecular complexes, and a correspondingly large number of microscopic equilibrium constants. DynaFit has also been used in the NMR context to determine the binding affinity between the RIZ1 tumor suppressor protein and a model peptide representing histone H3 (Briknarova´ et al., 2008). The following illustrative example involves the use of DynaFit for highly precise determination of a protein–ligand equilibrium binding constant.
251
DynaFit—A Software Package for Enzymology
2.1.1. NMR study of protein–protein interactions Figure 10.1 (unpublished data courtesy of K. Briknarova´ and J. Bouchard, University of Montana) displays the changes in NMR chemical shifts for six different protons and six different nitrogen nuclei in the PR domain from a transcription factor PRDM5 (Deng and Huang, 2004), depending on the concentration of a model peptide ligand. The NMR chemical shift data for all 12 nuclei were analyzed in the global mode (Beechem, 1992). The main purpose of this experiment was to determine the strength of the binding interaction. It was assumed that the binding occurs with the simplest 1:1 stoichiometry. A DynaFit code fragment corresponding to Scheme 10.2 is shown as follows: [mechanism] R þ L <¼¼> R.L : Kd1 dissociation [responses] intensive [data] plot titration ...
0.2
1.0
0.1
0.5 Δ d 15N
Δ d 1H
Note the use of the keyword intensive in the [responses] section of the script, which means that the observed physical quantity
0.0
0.0
– 0.5
–0.1
–1.0
–0.2 0.0
0.5 1.0 1.5 (ligand), mM
0.0
2.0
0.5 1.0 1.5 (ligand), mM
2.0
Figure 10.1 NMR chemical shift titration of the PRDM5 protein (total concentration varied between 0.125 and 0.1172 mM) with a model peptide ligand. Left: 1H-chemical shifts of six selected protons. Right: 15N-chemical shifts of six selected nitrogen nuclei. The chemical shifts for all 12 nuclei were fit globally (Beechem, 1992) to the binding model shown in Scheme 10.2. Kd1 R+L
R•L
Scheme 10.2
252
Petr Kuzmicˇ
(chemical shift) is proportional not to the quantity of various molecular species present in the sample, but rather to the corresponding mole fractions. Also note the keyword titration, which is used to produce a simple Cartesian plot—with the ligand concentration [L] formally acting as the only independent variable—even though the experiment was performed by gradual addition of ligand to the same initial protein sample. This means that both the protein (titrand) and the model peptide (titrant) concentrations were changing with each added aliquot. It is very important to recognize that, in this case, the experimental data points are not statistically independent, as is implicitly assumed by the theory of nonlinear least-squares regression ( Johnson, 1992, 1994; Johnson and Frasier, 1985). However, the practice of incrementally adding to the same base solution of the titrand has been firmly established in protein–protein and protein–ligand NMR titration studies. The best-fit value of the dissociation equilibrium constant, determined from the data shown in Fig. 10.1, was Kd1 ¼ (0.087 0.007) [0.073 ... 0.108] mM. The values in square brackets are approximate confidence intervals determined by the profile-t method of Bates and Watts (Brooks et al., 1994). Please note that, unlike the formal standard error shown in the parentheses, the confidence intervals are not symmetrical about the best-fit value. Using the global fit method (Beechem, 1992), the strength of the protein–ligand binding interactions was determined for a number of different nuclei, and the results were highly consistent; the coefficient of variation for the equilibrium was approximately 10% regardless of which chemical shift was monitored.
2.2. Independent binding sites and statistical factors The most recent version of DynaFit (Kuzmicˇ, 1996) allows the investigator to properly define the relationship between (a) intrinsic rate constant or equilibrium constants, and (b) macroscopic rate constants or equilibrium constants. This distinction is necessary in the analysis of multiple identical binding sites. As the simplest possible example, consider the binding of L, a ligand molecule, to R, a receptor molecule that contains two identical and independent binding sites (Scheme 10.3). 2 ka R
kd
L R
Scheme 10.3
ka 2 kd
L R
L
DynaFit—A Software Package for Enzymology
253
In Scheme 10.3, ka and kd are intrinsic rate constants. The statistical factors (‘‘2’’) shown in Scheme 10.3 express the fact that there are two identical pathways for L to associate with R, but only one way for for L to associate with RL. Similarly, RL2 can yield RL in two equivalent ways, whereas RL can dissociate into R þ L only in one way. Thus, if we define the first and second dissociation equilibrium constants as K1 ¼ [RL]eq[L]eq/[RL2]eq and K2 ¼ [R]eq[L]eq/ [RL]eq, then for independent equivalent sites we must have K1 ¼ 4K2. In the DynaFit notation, the difference between independent and interacting binding sites can be expressed by using the following syntax: [task] data ¼ equilibria model ¼ interacting ? [mechanism] R þ L <¼¼> R.L : K1 dissociation R.L þ L <¼¼> R.L.L : K2 dissociation [constants] K1 ¼ ... K2 ¼ ... ... [task] data ¼ equilibria model ¼ independent ? [mechanism] R þ L <¼¼> R.L : K1 dissociation R.L þ L <¼¼> R.L.L : K2 dissociation [constants] K1 ¼ 4 * K2 ; <¼¼ STATISTICAL FACTOR K2 ¼ ... ...
2.2.1. Interacting versus independent sites on a trimeric enzyme Błachut-Okrasinska et al. (2007) utilized DynaFit for a comprehensive kinetic investigation of mRNA cap analogues binding to the eIF4E regulatory protein (see also Niedzwiecka et al., 2007). From the same laboratory comes a study of the trimeric purine nucleoside phosphorylase (PNP) interacting with nucleoside multisubstrate inhibitors (Wielgus-Kutrowska et al., 2007). A representative equilibrium binding experiment is shown in Fig. 10.2 (raw experimental data (Wielgus-Kutrowska and Bzowska, 2006) courtesy of B. Wielgus-Kutrowska, Warsaw University). The object of the experiment was to determine whether inhibitor binding sites on the PNP trimer are independent or interacting. Under the given assay conditions, the PNP enzyme is a nondissociative homotrimer. The presence of three separate inhibitor binding sites is
254
Petr Kuzmicˇ
1.0
F/F0
0.8
0.6
0.4
0.2 0
5 (ligand), mM
10
Figure 10.2 Equilibrium titration of the trimeric PNP from Cellulomonas sp. (0.47 mM as monomer) with a nucleoside analog inhibitor 2-amino-9-[2-(phosphonometoxy)ethyl]-6-sulfanylpurine; F/F0 represents relative fluorescence intensity (PNP plus ligand divided by PNP only). See Wielgus-Kutrowska and Bzowska (2006) for details. Solid curve: least-squares fit to the interacting sites model (Scheme 10.4). Dashed curve: least-squares fit to the independent sites model, in which equilibrium constants were linked via statistical factors such that K1:K2:K3 ¼ 9:3:1.
P
K1
PL
K2
PL2
K3
PL3
Scheme 10.4
represented in Scheme 10.4 by three association equilibrium constants, K1, K2, and K3. If the inhibitor sites were genuinely independent, the titration data would fit sufficiently well to an equilibrium binding model where the ratios K1:K2:K3 ¼ 9:3:1 are strictly maintained. In contrast, if the binding sites are interacting, it would be necessary to relax the fitting model such that the equilibrium constants could attain arbitrary values. To perform the model discrimination analysis (Myung and Pitt, 2004) using the Akaike Information Criterion (AICc) (Burnham and Anderson, 2002), the requisite DynaFit script contains the following text: [task] data ¼ equilibria model ¼ interacting ? [mechanism] P þ L <¼¼¼> P.L : K1 equilibrium
DynaFit—A Software Package for Enzymology
255
P.L þ L <¼¼¼> P.L.L : K2 equilibrium P.L.L þ L <¼¼¼> P.L.L.L : K3 equilibrium [constants] ; vary independently K3 ¼ 1 ? K2 ¼ 3 ? K1 ¼ 9 ? ... [task] data ¼ equilibria model ¼ independent ? [constants] ; link via statistical factors K3 ¼ 1 ? K2 ¼ 3 * K3 K1 ¼ 9 * K3 ...
As can be seen from Fig. 10.2 (dashed curve), the independent-sites model provides a poor description of the available data. The interacting-sites model (solid curve) produces a much better fit. This result is in agreement with previously published investigations of the same system (Bzowska, 2002; Bzowska et al., 2004; Wielgus-Kutrowska et al., 2002, 2007).
3. Initial Rates of Enzyme Reactions The study of initial rates of enzyme-catalyzed reactions defines the traditional approach to mechanistic enzymology (Segel, 1975). Earlier versions of the DynaFit software package (Kuzmicˇ, 1996) were suitable for the analysis of initial-rate data under the rapid equilibrium approximation (Kuzmicˇ, 2006), where it is assumed that the chemical steps in an enzyme mechanism are negligibly slow in comparison with all association and dissociation steps. The current version of DynaFit extends the initial rate analysis to the more general steady-state approximation (Kuzmicˇ, 2009a). This section introduces the important topic of thermodynamic cycles, which are relevant in steady-state enzyme mechanisms, especially those involving multiple substrates (e.g., kinases or reductases). A simulation study, involving dihydrofolate reductase (DHFR) as a model system, provides an illustrative example.
3.1. Thermodynamic cycles in initial rate models It is a fundamental fact of thermodynamics that the Gibbs free energy change is independent of any particular path between thermodynamic states. This leads to the idea of a thermodynamic box in enzyme kinetic mechanisms (Gilbert, 1999, p. 271).
256
Petr Kuzmicˇ
There are numerous logically equivalent ways to express the idea of a thermodynamic box. When expressed specifically in terms of microscopic rate constants, the product of rate constants associated with a set of arrows starting and ending at a given reactant must be the same in both directions (clockwise and counterclockwise, Scheme 10.5). This is equivalent to saying that the overall equilibrium constant associated with any cyclic path through the mechanism must be unity. In general, we do not usually have advance knowledge of the Gibbs free energy change associated with the uncatalyzed reaction. However, for all nonchemical steps in the mechanism (i.e., noncovalent binding and dissociation of ligands), the overall equilibrium constant for each thermodynamic cycle must be unity. We can use this fact to check on the consistency of a postulated set of rate constant values. In the latest version of DynaFit, we can also use this fact to constrain the values of particular microscopic rate constants. 3.1.1. Steady-state initial rate equation for DHFR The catalytic mechanism of Escherichia coli DHFR is shown in Scheme 10.6 (Benkovic et al., 1988; Fierke et al., 1987). The abbreviations used in Scheme 10.3 are as follows: E is the DHFR enzyme; F and FH are dihydrofolate and tetrahydrofolate, respectively; N and NH are NADPþ and NADPH, respectively; the symbol EFH N stands for the ternary molecular complex E N FH; and the numbers above each arrow represent microscopic rate constant (e.g., ‘‘1’’ stands for k1). All 22 microscopic rate constants appearing in Scheme 10.6 have been determined in a large number of independent experiment (Table 10.1). The reaction mechanism in Scheme 10.6 contains six thermodynamic boxes that do not involve the reversible chemical step (rate constants k21 and k22). For example, moving clockwise or counterclockwise along the lower right box in Scheme 10.6, we expect that the product k8 k9 k11 k13 ¼ 27,200 be numerically equal to k7 k10 k12 k14 ¼ 28,000. The corresponding equilibrium constant K ¼ k8k9k11k13/k7k10k12k14 ¼ 0.97 k1 E
k7
E•B
E
E•A
E•A•B
k1 × k3 × k5 × k7
E•A
k8
k3
k5
k2
E•B =
Scheme 10.5
k4
k6
E•A•B
k2 × k4 × k6 × k8
257
DynaFit—A Software Package for Enzymology
Table 10.1 Microscopic rate constants in the catalytic mechanism of E. coli DHFR (Benkovic et al., 1988)
k1 k3 k5 k7 k9 k11 k13 k15 k17 k19 k21
mM 1 mM 1 s 1 s 1 mM 1 s 1 s 1 mM 1 mM 1 s 1 s 1
25 8 12.5 3.5 40 1.7 20 13 25 200 950
21
s 1 s 1
k2 k4 k6 k8 k10 k12 k14 k16 k18 k20 k22
s 1 s 1 s 1
EFH
s 1 s 1 s 1 s 1 s 1
ENH FH
4
20
1
17
6 5
2
18
8
16
ENH
E
EN 15
s 1 s 1 mM 1 mM 1 s 1 mM 1 mM 1 s 1 s 1 mM 1 s 1
3
19
EN FH
1.4 85 2 20 40 5 40 300 2.4 5 0.6
7
13
10 9
14 12
NH
EF 11
EF 22
Scheme 10.6
is indeed very nearly equal to unity. The same is true for all five remaining thermodynamic boxes, including the largest (not counting the chemical NH NH step) box defined by the path EN ! ENH ! FH ! EFH ! EFH ! E F N N EF ! E ! E ! EFH . The steady-state initial rate equation for DHFR, based on the comprehensive mechanism in Scheme 10.6 and derived by using the King–Altman method (King and Altman, 1956), contains 33 algebraic terms in the numerator, 65 algebraic terms in the denominator, and up to cubic exponents for concentrations. When printed in a page layout required by this volume, the single algebraic rate equation for DHFR would occupy approximately 20 printed pages (results not shown). A quote from Segel’s seminal text, discussing the ‘‘Bi Bi Random Steady-State’’ mechanism, is also applicable to DHFR.
258
Petr Kuzmicˇ
The [initial rate] equation does not describe a hyperbola and, theoretically, the reciprocal plots are not linear, unless one substrate is saturating. [...] The groups of rate constants cannot be combined into convenient kinetic constants [Michaelis constants and inhibition constants]. (Segel, 1975, p. 647)
In DynaFit, we can now represent the same initial rate law, under the steady-state approximation, by entering the following text: [task] data ¼ rates approximation ¼ steady-state [reaction] | F þ NH <¼¼> FH þ N [enzyme] | E [mechanism] E þ FH <¼¼> E.FH : k1 k2 E.FH þ NH <¼¼> E.FH.NH : k3 k4 E.FH.NH <¼¼> E.NH þ FH : k5 k6 E.NH <¼¼> E þ NH : k7 k8 E.NH þ F <¼¼> E.F.NH : k9 k10 E.F.NH <¼¼> E.F þ NH : k11 k12 E.F <¼¼> E þ F : k13 k14 E þ N <¼¼> E.N : k15 k16 E.N þ FH <¼¼> E.FH.N : k17 k18 E.FH.N <¼¼> E.FH þ N : k19 k20 E.F.NH <¼¼> E.FH.N : k21 k22
The steady-state rate initial law derived internally by DynaFit consists of a system of simultaneous nonlinear algebraic equations evaluated numerically (Kuzmicˇ, 2009a), by using the multidimensional Newton–Raphson method (Press et al., 1992, p. 379). Unlike the traditional algebraic formalism (Segel, 1975), the numerical formalism utilized by DynaFit does not make the simplifying assumption that all reactant and modifier concentrations are essentially infinitely larger than the enzyme concentration. Given the values of rate constants associated with the mechanism in Scheme 10.6, DynaFit was used to simulate initial reaction rates while varying the concentration of dihydrofolate and NADPHþ (Fig. 10.3). The substrate saturation curves at relatively low dihydrofolate concentrations are expected to display a local maximum, followed by a decrease to an asymptotically saturating value. Importantly, DynaFit can now properly take into account the presence of thermodynamic boxes in the DHFR mechanism, in order to constrain certain rate constants based on the values of other rate constants. For example, to express the constraint k7 ¼ k8k9k11k13/k10k12k14, and a similar constraint for rate constant k2, we would use the following DynaFit input:
259
DynaFit—A Software Package for Enzymology
0.025
(NADPH)+ 0.1 mM 0.2 mM 0.4 mM 0.8 mM 1.6 mM
Initial rate, mM/s
0.020
0.015
0.010
0.005
0.000 0.01
0.1 1 (dihydrofolate), mM
10
100
Figure 10.3 Simulated initial reaction rates for DHFR, based on the mechanism in Scheme 10.6 (Benkovic et al., 1988; Fierke et al., 1987) and rate constant values listed in Table 10.1. [constants] k2 ¼ (k1 k3 k5 k7) / (k4 k6 k8) k7 ¼ (k8 k9 k11 k13) / (k10 k12 k14)
Given that the DHFR mechanism contains six thermodynamic boxes for the noncovalent binding and dissociation steps, and also given that each thermodynamic box involves between 8 and 16 microscopic rate constants, many logically equivalent ways are available to place overall constraints on the kinetic model. Which particular rate constants should be constrained in DynaFit models need to be carefully evaluated on a case-by-case basis. A casual survey of the published biochemical literature reveals occasional violations of the thermodynamic box rule. For example, Digits and Hedstrom (1999) presented a kinetic model for inosine monophosphate (IMP) dehydrogenase interacting with IMP and NADþ (Scheme 10.7). In Scheme 10.7 (Digits and Hedstrom, 1999), the numerical values of all monomolecular rate constants are in s 1 units, and the bimolecular association rate constants are shown in mM 1 s 1 units. The salient feature of the mechanism in Scheme 10.7 is that the enzyme– IMP complex undergoes an isomerization before cofactor binding. Clearly, the overall equilibrium constant for the noncovalent interactions is significantly different from unity, which means that at least one rate constant in
260
Petr Kuzmicˇ
21 EIMP
* EIMP
1.1
0.20
2.6 94
7.7
Keq = 28
IMP
ENAD
E 0.043
etc.
2.2 4.7
41 ENAD
Scheme 10.7
the postulated kinetic model is in error. This error was corrected in a later report (Schlippe et al., 2004), where the kinetic mechanism was further developed using DynaFit. The explanation for the inconsistency in Scheme 10.7 (L. Hedstrom, personal communication) lies in that the equilibria for formation of the ternary complexes were determined by measuring binding to the binary complexes of an inactive mutant. The relevant mutation (Cys to Ala) perturbs IMP binding in the binary complex, so IMP binding to the E.NAD complex is probably also perturbed and therefore unlikely to mimic the wild-type enzyme. Nevertheless, the measured values were utilized in the postulated reaction scheme. As a general warning, when using inactive mutants to infer rate constants in a similar fashion, special attention must be paid to the consistency of thermodynamic boxes. A similar inconsistency in a noncovalent binding mechanism is present in a DynaFit study of the plasma membrane calcium pump isoform 4b by calmodulin (Penheiter et al., 2003). The recent addition of a thermodynamic box checking feature into DynaFit should prevent similar inconsistencies occasionally cropping up in the published literature.
4. Time Course of Enzyme Reactions DynaFit (Kuzmicˇ, 1996) was initially developed to process the time course of ‘‘slow, tight’’ (Morrison and Walsh, 1988; Szedlacsek and Duggleby, 1995; Williams and Morrison, 1979) enzyme inhibition assays. In the intervening period, a number of features and capabilities had been added to further facilitate the analysis of reaction dynamics. For example, DynaFit can now be used to analyze ‘‘double-mixing’’ stopped-flow experiments (Williams et al., 2004). Microscopic rate constants can be constrained with respect to statistical factors (see Section 2.2) or thermodynamic boxes (Section 3.1), or defined as fixed ratios where equilibrium constants are
DynaFit—A Software Package for Enzymology
261
known from independent experiments. This section describes another representative example of such recently added capabilities.
4.1. Invariant concentrations of reactants Under highly specialized experimental circumstances, or for the purpose of modeling an in vivo biochemical system, DynaFit can now be used to simulate or fit experimental data under the assumption that the concentrations of certain reactants remain invariant, even as they participate in the underlying reaction mechanism. The corresponding DynaFit notation is to use the exclamation mark: [concentrations] Substrate ¼ 1.2345 !
4.1.1. SPR on-chip enzyme kinetics The invariant concentration technique had been utilized in building a preliminary mathematical model for on-chip kinetics of transglucosidase alter-nansucrase (E.C. 2.4.1.140) from Leuconostoc mesenteroides NRRL B-1355 (Cle´ et al., 2008, 2010). This enzyme catalyzes the transfer of glucose from sucrose to acceptors at their nonreducing ends. In this particular case, the acceptor was a carboxymethyl dextran surface on a surface plasmon resonance (SPR) chip. When sucrose solution mixed with the transglucosidase enzyme is flowed over the SPR chip, the dextran oligomer chains on the chip’s surface are extended with additional glucose moieties, and this process can be monitored by SPR. Importantly, the bulk sucrose concentration does not change over time, because it is being replenished by the continuous flow. A typical SPR sensorgram of the enzyme-catalyzed extension of a dextran surface is shown in Fig. 10.4 (see Cle´ et al., 2010 for details). The important portion of the DynaFit script used in this analysis is shown below. Note that the enzyme–sucrose (‘‘S’’) association is made irreversible in the postulated mechanism. The reasons for choosing this simplified Van Slyke– Cullen kinetic model (Slyke and Cullen, 1914) are explained in a separate report (Kuzmicˇ, 2009b). [task] data ¼ progress task ¼ fit [mechanism] E þ dextran <¼¼> E.dextran : k1 k2 E.dextran þ S ---> E.dextran. S : k3 E.dextran. S ---> E.dextran þ P : k4 [concentrations] E ¼ 0.18 ! ; invariant S ¼ 11700 ! ; invariant dextran ¼ 0.00002
262
Petr Kuzmicˇ
1200 E 1000
Signal
800
600 D 400 C 200
B A
0 0
20
t, s
40
60
Figure 10.4 SPR sensorgram of the enzyme-catalyzed extension of a dextran surface. Transglucosidase alternansucrase at various concentrations was coinjected with sucrose (11.7 mM) over the surface of the SPR chip. Curves AE: enzyme concentration [E]0 ¼ 0.018, 0.022, 0.03, 0.044, and 0.09 mM, respectively.
The surface catalysis phenomena involved, for example, in starch biosynthesis and in cellulose degradation are still relatively poorly understood. The significance of the on-chip enzyme kinetics experiment is that it can potentially shed light on biologically relevant heterogeneous phase processes. At this preliminary phase of the investigation, the best-fit values of microscopic rate constants (not shown) were obtained separately for each recorded progress curve. The goal of the ongoing research is to produce a global (Beechem, 1992) mathematical model for the on-chip kinetics.
5. General Methods and Algorithms This section briefly summarizes selected features and capabilities added to the DynaFit software package since its original publication (Kuzmicˇ, 1996). These general algorithms are applicable to all types of experimental data (progress curves, initial rates, and complex equilibria) being analyzed. This selection of added features is not exhaustive, but it emphasizes some of the most difficult tasks in the analysis of biochemical data:
DynaFit—A Software Package for Enzymology
263
How do we know where to start (the initial estimate problem); How do we know whether the best-fit parameters are good enough (the confidence interval problem); and How do we know which fitting model to choose among several alternatives (the model discrimination problem).
5.1. Initial estimates of model parameters One of the most difficult tasks of a data analyst performing nonlinear least-squares regression is to come up with initial estimates of model parameters that are sufficiently close to the true values. If the initial estimate of rate or equilibrium constants is not sufficiently accurate, the data-fitting algorithm might converge to a local minimum, or not converge at all. This is the nature of the Levenberg–Marquardt algorithm (Marquardt, 1963; Reich, 1992), which is the main least-squares minimization algorithm used by DynaFit. The updated DynaFit software offers two different methods to avoid local minima on the least-squares hypersurface, that is, to avoid incorrect ‘‘best-fit’’ values of rate constants and other model parameters. The first method relies on a brute-force systematic parameter scan, and the second method uses ideas from evolutionary computing. 5.1.1. Systematic parameter scan To increase the probability that a true global minimum is found for all rate and equilibrium constants, DynaFit allows the investigator to specify a set of alternate initial estimates. The software then generates all possible combinations of starting values, and performs the corresponding number of independent least-squares regressions. The results are ranked by the residual sum of squares. For example, let us assume that the postulated mechanism includes four adjustable rate constants, k1–k4, and that we wish to examine four different starting values (spaced by a factor of 10) for each of them. The requisite DynaFit code would read as follows: [constants] k1 ¼ { 0.01, 0.1, 1, 10} ? k2 ¼ {0.001, 0.01, 0.1, 1} ? k3 ¼ {0.001, 0.01, 0.1, 1} ? k4 ¼ {0.001, 1, 1000, 1000000} ?
In this case, the program would perform 44 ¼ 256 separate least-squares minimizations, starting from 256 different combinations of initial estimates. In extreme cases, the execution time required for such systematic parameter scans might reach many minutes or even hours by using the currently
264
Petr Kuzmicˇ
available computing technology. However, for critically important data analyses, avoiding local minima and therefore incorrect mechanistic conclusions should be worth the wait. 5.1.2. Global minimization by differential evolution As an alternate solution to the problem of local minima in least-squares regression analysis, DynaFit now uses the differential evolution (DE) (Price et al., 2005) algorithm. DE belongs to the family of stochastic evolutionary strategy (ES) algorithms, which attempt to find a global sum-of-squares minimum by using ideas from evolutionary biology. The essential feature of any ES data-fitting algorithm is that it starts from a large number of simultaneous, randomly chosen initial estimates for all adjustable model parameters. The algorithm then evolves this population of ‘‘organisms,’’ by allowing only the sufficiently fit population members to ‘‘sexually reproduce.’’ In this case, by fitness we mean the sum of squares associated with each particular combination of rate constants and other model parameters (the genotype). By sexual reproduction, we mean that selected population members have their genome (i.e., model parameters) carried over into the next generation by using Nature’s usual tricks— chromosomal crossover accompanied by random mutations. There are many variations on the ES computational scheme, and also a growing number of variants of the DE algorithm itself. The interested reader is encouraged to examine several recently published books and monographs (Chakraborty, 2008; Feoktistov, 2008; Onwubolu and Davendra, 2009; Price et al., 2005) for details. Typically, the number of population members does not change through the evolutionary process, meaning that if we start with 1000 different initial estimates for each rate constant, we also have 1000 different estimates at the end, after a large number of generations have reproduced. Importantly, while we might start with a population of 1000 estimates spanning 12 or 18 orders of magnitude for each rate constant, the hope is that we end with 1000 estimates all of which are close to the best possible value. The performance of the DE algorithm (Price et al., 2005), as implemented in DynaFit, is illustrated by using an example involving irreversible inhibition kinetics of the HIV protease. This particular test problem was first presented in the original DynaFit publication (Kuzmicˇ, 1996), and was subsequently reused by Mendes and Kell (1998) to test the performance of the popular software package Gepasi. The simulation software package COPASI (Hoops et al., 2006), a direct descendant of Gepasi, is also being profiled in this volume. Figure 10.5 displays fluorescence changes during a fluorogenic assay (Kuzmicˇ et al., 1996; Peranteau et al., 1995) of the HIV protease. The nominal enzyme concentration was 4 nM in each of the five kinetic experiments; the nominal substrate concentration was 25 mM; the inhibitor
265
DynaFit—A Software Package for Enzymology
0.6
Signal
0.4
0.2
0.0
Residual
0.01 0.00 –0.01 0
1000
2000 t, s
3000
Figure 10.5 Least-squares fit of progress curves from HIV protease in the presence of an irreversible inhibitor. Results of the best-fit were obtained by using the differential evolution algorithm (Price et al., 2005).
concentrations (curves from top to bottom; Fig. 10.5) were 0, 1.5, 3, and 4 nM (two experiments). As is discussed elsewhere (Kuzmicˇ, 1996), each initial enzyme and substrate concentration was treated as an adjustable parameter. The vertical offset on the signal axis was also treated as an adjustable parameter for each experiment separately. The mechanistic model is shown in Scheme 10.8, where M is the monomer subunit of the HIV protease. The numbering of rate constants in Scheme 10.8 was chosen to match a previous report (Mendes and Kell, 1998). The dimensions used throughout the analysis (see also final results in Table 10.2) were mM for all concentrations, mM 1 s 1 for all second-order rate constants, and s 1 for all first-order rate constants. The rate constants k11 ¼ 0.1, k12 ¼ 0.0001, and k21 ¼ k41 ¼ k51 ¼ 100 were treated as fixed parameters in the model, whereas the rate constants k22, k3, k42, k52, and k6
266
Petr Kuzmicˇ
E•P k41 k42 k11 M+M
k21 E
k12 k52
E•S
k22
k3
E+P
k51 E•I
k6
E-I
Scheme 10.8
were treated as adjustable parameters. To match the Gepasi test (Mendes and Kell, 1998) using the same example problem, each rate constant was constrained to remain less than 105 in absolute value. In the course of the DE optimization, rate constants were allowed to span 12 orders of magnitude (between 10 7 and 105). Each adjustable concentration was allowed to vary within 50% of its nominal value. An excerpt from a requisite DynaFit script input file is shown as follows: [task] data ¼ progress task ¼ fit algorithm ¼ differential-evolution [mechanism] M þ M <¼¼> E : k11 k12 E þ S <¼¼> ES : k21 k22 ES ---> E þ P : k3 E þ P <¼¼> EP : k41 k42 E þ I <¼¼> EI : k51 k52 EI --> EJ : k6 [constants] k11 ¼ 0.1 k12 ¼ 0.0001 k21 ¼ 100 k22 ¼ 300 ? (0.0000001 .. 100000) k3 ¼ 10 ? (0.0000001 .. 100000) k41 ¼ 100 k42 ¼ 500 ? (0.0000001 .. 100000) k51 ¼ 100 k52 ¼ 0.1 ? (0.0000001 .. 100000) k6 ¼ 0.1 ? (0.0000001 .. 100000)
267
DynaFit—A Software Package for Enzymology
Table 10.2 Least-squares fit of HIV protease inhibition data shown in Fig. 10.5: Comparison of the simulated annealing (SA) algorithm as implemented in Gepasi (Mendes and Kell, 1998) and COPASI (Hoops et al., 2006) with the differential evolution (DE) algorithm as implemented in DynaFit (Kuzmicˇ, 1996)
Parameter
k22 k3 k42 k52 k6 [S]1 [S]2 [S]3 [S]4 [S]5 [E]1 [E]2 [E]3 [E]4 [E]5 D1 D2 D3 D4 D5 Iterations Sum of squares Run time (h) a b c d e
SA (Mendes and Kell, 1998)
SA (this work)a
DE
SA/DE
201.1 7.352 1171 13,140 30,000 24.79 23.43 26.79 32.10 26.81 0.004389 0.004537 0.005470 0.004175 0.003971 0.00801 0.00391 0.00896 0.01600 0.00379 630,000 0.0211024
273.1 6.517 1989 11,120 4453 24.74 23.46 26.99 20.92 17.59 0.005029 0.004965 0.005796 0.004238 0.003980 0.00712 0.00490 0.01395 0.01192 0.00005 1,025,242 0.0201911
23.67 3.922 128.2 0.00008562 0.0004599 24.65 23.37 26.99 14.39 16.04 0.007484 0.006568 0.007116 0.004221 0.003396 0.00508 0.00289 0.01354 0.00337 0.00777 –b 0.0194526
11.54 1.66 15.51 130,000,000 9,700,000 1.00 1.00 1.00 1.45 1.10 0.67 0.76 0.81 1.00 1.17 1.40 1.69 1.03 3.54 0.01 –b 1.04
–c
16.5d,e
1.1e
15
Software Gepasi (Mendes and Kell, 1998) ver. 3.30. Iteration counts in SA and DE are not compatible. Running time not given in the original publication. Interrupted. IntelÒ CoreTM2 Duo T7400 microprocessor (2.16 GHz, 667 MHz bus, 4 MB cache).
DynaFit automatically chooses the population size, based on the number of adjustable model parameters, and on the range of values they are allowed to span. In this case, the DE algorithm started with 259 separate estimates for each of the 15 adjustable model parameters (five rate constants, five locally adjusted substrate and enzyme concentrations, and five offsets on the signal axis). A representative histogram of distribution for one of the 15 adjustable
268
Petr Kuzmicˇ
Initial
Count
30 20 10 0 –6
–4
–2
0
2
4
6
80 Final
Count
60 40 20 0 – 4.075
– 4.070 log10 k52
– 4.065
Figure 10.6 The initial and final distribution of the rate constant k52 in the differential evolution (Price et al., 2005) fit of HIV protease inhibition data shown in Fig. 10.5. The population contained 259 members.
model parameters (the rate constant k52) is shown in the upper panel of Fig. 10.6. Note that the 259 initial estimates of the rate constant k52 span 12 orders of magnitude. The initial random distribution of parameter values is uniform (as opposed to Gaussian or similarly bell-shaped) on the logarithmic scale. The swarm of 259 ‘‘organisms,’’ each carrying a unique combination of 15 adjustable model parameters (the genotype), was allowed to evolve using the Darwinian evolutionary principles (selection by fitness; chromosomal crossover during the ‘‘mating’’ of population members; random genetic mutations). After 793 generations, each of the 15 model parameters converged to a relatively narrow range of values, as shown in the bottom panel of Fig. 10.6 for the rate constant k52. The simulated best-fit model is shown as smooth curves in Fig. 10.5. The best-fit values of adjustable model parameters are shown in Table 10.2, where Di is offset on the signal axis for individual data sets. The simulated annealing (SA) algorithm (Corana et al., 1987; Kirkpatrick et al., 1983) was chosen for comparison with DE, because it appears to be the best performing global optimization method currently reported in the biochemical literature (Mendes and Kell, 1998).
DynaFit—A Software Package for Enzymology
269
The results listed in Table 10.2 show that the DE algorithm found a combination of model parameters that lead to a significantly lower sum of squares (i.e., a better fit) compared to the SA algorithm. Some model parameters, such as the adjustable substrate concentrations, were very close to identical in both data-fitting methods. Other model parameters, such as the rate constants k52 and k6 that characterize the inhibitor properly, differed by 6–8 orders of magnitude. The SA algorithm had to be terminated manually after approximately 17 h of continued execution, and more than one million iterations. The DE algorithm terminated automatically after 66 min, when defined convergence criteria were satisfied. We can conclude that, in the specific case of the HIV protease irreversible kinetics, the DE global optimization algorithm clearly performs significantly better than the SA algorithm. However, this does not mean that the best-fit DE parameter values listed in Table 10.2 are any closer to the true values, when compared with the SA parameters. In fact, it appears that neither set of parameter values should be regarded with much confidence (see Section 5.2). Probably, the only conclusion we can make safely is that very much more research is needed into the relative merits of global optimization algorithms such as DE and SA—specifically, as they are applied to the analysis of biochemical kinetic data.
5.2. Uncertainty of model parameters Most biochemists are likely to see the uncertainty of kinetic model parameters expressed only as formal standard errors. Formal standard errors are the plus-or-minus values standing next to the best-fit values of nonlinear parameters, as reported by all popular software packages for nonlinear leastsquares regression, including DynaFit. However, it should be strongly emphasized that formal standard errors can (and usually do) grossly underestimate the statistical uncertainty. For a rigorous theoretical treatment of statistical inference regions for nonlinear parameters, see Bates and Watts (1988). Johnson et al. (2009) recently stated that DynaFit (Kuzmicˇ, 1996) users are provided only with the ‘‘standard errors [...] without additional aids to evaluate the extent to which the fitted parameters are actually constrained by the data.’’ This statement is factually false, and needs to be corrected for the record. Since version 2.23 released in January 1997 and extensively documented in the freely distributed user manual, DynaFit has always implemented the profile-t search method of Bates and Watts (Bates and Watts, 1988; Brooks et al., 1994; Watts, 1994) to compute approximate inference regions of nonlinear model parameters. The most recent update to DynaFit adds an additional aid to evaluate the extent to which the fitted parameters are constrained by the data. This is a particular modification of the well-established Monte-Carlo method (Straume and Johnson, 1992).
270
Petr Kuzmicˇ
5.2.1. Monte-Carlo confidence intervals The Monte-Carlo method (Straume and Johnson, 1992) for the determination of confidence intervals is based on the following idea. After an initial least-squares fit using the usual procedure, the best-fit values of nonlinear parameters are used to simulate many (typically, at least 1000) artificial data sets. The idealized theoretical model curves (e.g., the smooth curves in Fig. 10.5) are always the same, but the superimposed pseudo-random noise is different every time. The 1000 slightly different sets of pseudo-experimental data are again subjected to nonlinear least-squares regression. In the end, the 1000 different sets of best-fit values for model parameters are tallied up to construct a histogram of the parameter distribution. The range of values spanned by each histogram is the Monte-Carlo confidence interval for the given model parameter. ‘‘Shuffle’’ and ‘‘shift’’ Monte-Carlo methods A crucially important part of the above Monte-Carlo procedure is the simulation of the pseudo-random noise to be superimposed on the idealized data. How should we choose the statistical distribution, from which the pseudo-random noise is drawn? Usually, it is assumed that the pseudorandom experimental noise has Normal or Gaussian distribution (Straume and Johnson, 1992), and that the individual data points are statistically independent or uncorrelated. If so, the standard deviation of this Gaussian distribution (the half-width of the requisite bell curve) can be taken as the standard error of fit from the first-pass regression analysis of the original data. However, we have recently demonstrated (Kuzmicˇ et al., 2009) that experimental data points recorded in at least one particular enzyme assay are not statistically independent. Instead, we see a strong neighborhood correlation among adjacent data points—spanning up to six nearest neighbors. To reflect the possible serial correlation among nearby data points, DynaFit (Kuzmicˇ, 1996) now allows two variants of the Monte-Carlo method, which could be called the ‘‘shift’’ Monte-Carlo and ‘‘shuffle’’ Monte-Carlo algorithms. In both cases, instead of generating presumably Gaussian errors to be superimposed on the idealized data, we merely rearrange the order of the actual residuals generated by the first-pass leastsquares fit. In the shuffle variant, the residuals are reused in truly randomized order. In the shift variant of the Monte-Carlo algorithm, the order of the residuals is preserved, but the starting position changes. For example, let us assume that a particular reaction progress curve (such as one of those shown in Fig. 10.5) contains 300 experimental data points. After the first-pass least-squares fit, we could simulate up to 300 synthetic progress curves by superimposing the ordered sequence of residuals. In one such simulated curve, the first synthetic data point would be assigned
DynaFit—A Software Package for Enzymology
271
residual No. 17, the second data point residual No. 18, and so on. At the end of the ordered sequence of residuals, we wrap around to the beginning (i.e., data point No. 300 17 ¼ 283 will receive residual No. 1). In another simulated curve, the first data point would be generated from residual No. 213, the second data point from residual No. 214, and so on. The practical usefulness of the shift and shuffle variants of the MonteCarlo method (Straume and Johnson, 1992) is that it avoids having to make assumptions about the statistical distribution (Gaussian, Lorentzian, etc.) of the random noise that is inevitably present in the experimental data. Interestingly, the original conception of the Monte-Carlo method (Dwass, 1957; Nichols and Holmes, 2001) was, in fact, based on permuting existing population members, rather than making distributional assumptions. Two-dimensional histograms The ‘‘shift’’ Monte-Carlo confidence intervals for rate constants k22, k3, and k42 from the least-squares fit of HIV protease inhibition data are shown in Fig. 10.7. The best-fit values of each model parameter are marked with a filled triangle. The rate constant k3 is characterized by a relatively narrow confidence intervals (spanning from approximately 3 to 9 s 1). In contrast, the Monte-Carlo confidence intervals for rate constants k22 and k42 not only are much wider (approximately 4 orders of magnitude for k42) but also are clearly bi-modal. The appearance of such double-hump histogram for any parameter is a strong indication that (a) the model is probably severely over-parameterized, and (b) the data could very likely be fit to at least two alternate mechanisms. In order to better diagnose possible statistical coupling between pairs of rate constants, beyond what conventional Monte-Carlo histograms can provide, DynaFit now produces two-dimensional histograms such as those shown in Fig. 10.8. The thin solid path enclosing each histogram in Fig. 10.8 is the convex hull—the shortest path entirely enclosing a set of points in a plane. The approximate area occupied by the convex hull is a useful empirical measure of parameter redundancy. If any two rate constants were truly statistically independent, the corresponding two-dimensional Monte-Carlo histogram plot would resemble a circular area with the highest population density appearing in the center. We can see in Fig. 10.8 the rate constants k22 and k42 are clearly correlated, as is indicated by the elongated crescent shape of the two-dimensional histogram. In summary, with regard to assessing the statistical uncertainty of nonlinear model parameters, DynaFit (Kuzmicˇ, 1996) has always allowed the investigator to perform the full search in parameter space, using the profile-t method (Bates and Watts, 1988; Brooks et al., 1994; Watts, 1994). As a result of such detailed analysis, the investigator often must face the unpleasant fact that the confidence regions for rate constants, equilibrium constants, or derived kinetic parameters (e.g., Michaelis constants) not only are much
272
Petr Kuzmicˇ
k22
Count
300 200 100 0 10
1000
k3
400 Count
100
300 200 100 0 3 140 120
4
5
6
7
8
9
10
k42
Count
100 80 60 40 20 0 100
1000
10,000
1,00,000
Figure 10.7 Monte-Carlo confidence intervals for model parameters: Distribution histograms for rate constants k22, k3, and k42 from least-squares fit of HIV protease inhibition data shown in Fig. 10.5.
larger than the formal standard errors would suggest, but perhaps also larger than would appear ‘‘publishable.’’ However, it must be strongly emphasized that the formal standard errors for nonlinear parameters reported by DynaFit should never be given much credence. The program reports them mostly for compatibility with other software package typically used by biochemists. In order to obtain a more realistic interpretation of the experimental data, DynaFit users are encouraged to go beyond formal standard errors, and utilize both the previously available profile-t method (Brooks et al., 1994), and now also the modified Monte-Carlo method (Straume and Johnson, 1992).
273
DynaFit—A Software Package for Enzymology
1.0
0.8
log k42
log k3
5
4
3
0.6
1.5
2.0 log k22
2.5
1.5
2.0 log k22
2.5
Figure 10.8 Monte-Carlo confidence intervals for model parameters: Twodimensional correlation histograms for rate constants k22 versus k3 (left; uncorrelated) and k22 versus k42 (right; strong correlation) from least-squares fit of HIV protease inhibition data shown in Fig. 10.5.
5.3. Model-discrimination analysis The problem of selecting the most plausible theoretical model among several candidates (e.g., deciding whether a given enzyme inhibitor is competitive, noncompetitive, or mixed-type) represents one of the most challenging tasks facing the data analyst. Myung and Pitt (2004) and Myung et al. (2009) reviewed recent developments in earlier volumes of this series. This section contains only a very brief summary of the model-discrimination features available in DynaFit (Kuzmicˇ, 1996). The reader is referred to the full program documentation available online (http://www.biokin.com/ dynafit/). DynaFit (Kuzmicˇ, 1996) currently offers two distinct methods for statistical model discrimination. First, for nested fitting models, the updated version of DynaFit continues to offer the F-statistic method previously discussed by Mannervik (1981, 1982) and many others. Secondly, for any group of alternate models, whether nested or nonnested, DynaFit uses the second-order AICc (Burnham and Anderson, 2002) to perform model discrimination. Briefly, the AICc criterion is defined by Eq. (10.1), where S is the residual sum of squares; nP is the number of adjustable model parameters; and nD is the number of experimental data points. For each candidate model in a collection of alternate models, DynaFit computes DAICc as the difference between AICc for the particular model, and the AICc for the best model (with the lowest value of AICc). Thus, the best model is by definition assigned DAICc ¼ 0. The Akaike weight, wi, for the ith model in a collection of m alternatives, is defined by Eq. (10.2):
274
Petr Kuzmicˇ
2nP ðnP þ 1Þ nD nP 1 exp 12DAICðiÞ c wi ¼ Xm ðiÞ 1 exp DAIC 2 c i¼1
AICc ¼ log S þ 2nP þ
ð10:1Þ ð10:2Þ
Burnham and Anderson (2002) formulated a series of empirical rules for interpreting the observed DAICc values for each alternate fitting model, stating that DAICc > 10 might be considered a sufficiently strong evidence against the given model. Practical experience with the Burnham and Anderson rule suggests that it is applicable only when the number of experimental data points is a reasonably small multiple of the number of adjustable model parameters (e.g., nD < 20 nP). In some cases, the number of data points is very much larger. For example, in certain continuous assays or stopped-flow measurements, it is not unusual to collect thousands of experimental data points in order to determine two or three kinetic constants. In such cases, the DAICc > 10 rule has been found unreliable. In general, a candidate model should probably be rejected only if its Akaike weight, wi, is smaller than approximately 0.001. The DynaFit notation needed to compare a series of alternate models, and to select the most plausible model if a selection is possible, is illustrated on the following input file fragment. Please note the use of question marks after each (arbitrarily chosen) model name. This notation instructs DynaFit to evaluate the plausibility of the given model, in comparison with other models that are marked identically. [task] model ¼ Competitive ? [mechanism] E þ S <¼¼¼> E.S : Ks dissoc E.S ---> E þ P : kcat E þ I <¼¼¼> E.I : Ki dissoc ... [task] model ¼ Uncompetitive ? [mechanism] E þ S <¼¼¼> E.S : Ks dissoc E.S ---> E þ P : kcat E.S þ I <¼¼¼> E.S.I : Kis dissoc ... [task] model ¼ Mixed-type noncompetitive ?
DynaFit—A Software Package for Enzymology
275
[mechanism] E þ S <¼¼¼> E.S : Ks dissoc E.S ---> E þ P : kcat E þ I <¼¼¼> E.I : Ki dissoc E.S þ I <¼¼¼> E.S.I : Kis dissoc ... [task] model ¼ Partial mixed-type ? [mechanism] E þ S <¼¼¼> E.S : Ks dissoc E.S ---> E þ P : kcat E þ I <¼¼¼> E.I : Ki dissoc E.S þ I <¼¼¼> E.S.I : Kis dissoc E.S.I ---> E.I þ P : kcat’ ...
When DynaFit is presented with a series of alternate models in a similar way, it will fit the available experimental data to each postulated model in turn. After the last model in the series is fit to the data, the program presents to the user a summary table listing the values of DAICc. The AIC-based model discrimination feature available in DynaFit has been utilized in a number of reports (Błachut-Okrasinska et al., 2007; Collom et al., 2008; Gasa et al., 2009; Jamakhandi et al., 2007; Kuzmicˇ et al., 2006).
6. Concluding Remarks DynaFit (Kuzmicˇ, 1996) has proved quite useful in a number of projects, as is evidenced by the number of journal publications that cite the program. It is hoped that the software will continue to enable innovative research. This section offers a few closing comments on DynaFit enhancements currently in development.
6.1. Model discrimination analysis The AIC criterion is based solely on the number of optimized parameters and the corresponding sum of squares. The degree of uncertainty associated with each particular set of model parameters is completely ignored. However, if two candidate models with exactly identical number of adjustable parameters hypothetically produced exactly identical sums of squares, but one of these models was associated with significantly narrower confidence regions, then this model should be preferred (Myung and Pitt, 2004). The minimum description length (MDL) also known as stochastic complexity (SC) measure (Myung and Pitt, 2004) would clearly be a more appropriate
276
Petr Kuzmicˇ
model-discrimination criterion. Unfortunately, for technical reasons, the MDL criterion is extremely difficult to compute (Myung et al., 2009). Investigations are currently ongoing into at least an approximate computation of the MDL/SC test.
6.2. Optimal design of experiments Most biochemists—probably like most experimentalists—prefer to do the experiment first, then proceed to data analysis, and finally to publication. However, to paraphrase the eminent statistician G. E. P. Box (Box et al., 1978), no amount of the most ingenious data analysis can salvage a poorly designed experiment. When examining the extant enzymological literature, one often wonders exactly how the concentrations were chosen. Why was an exponential series (1, 2, 4, 8, 16) used for substrate concentrations, instead of a linear series (3, 6, 9, 12, 15) (Kuzmicˇ et al., 2006)? Was it by design, or was it because ‘‘that’s how we always did it’’? Similar choices profoundly affect how much—if anything—can be learned from any given experiment. A wellestablished statistical theory of optimal experiment design (Atkinson and Donev, 1992; Fedorov, 1972) has been used by biochemical researchers in the past (Duggleby, 1981; Endre´nyi, 1981; Franco et al., 1986). At the present time, DynaFit is being modified to implement these ideas, and deploy them for computer-assisted rational design of experiments.
ACKNOWLEDGMENTS Kla´ra Briknarova´ and Jill Bouchard (University of Montana) are gratefully acknowledged for sharing their as yet unpublished NMR titration data. Jan Antosiewicz (Warsaw University) provided stimulating discussions and procured the PNP inhibition data for testing the statistical-factors feature in DynaFit; the raw experimental data were made available by Beata Wielgus-Kutrowska, Agnieszka Bzowska, and Katarzyna Breer (Warsaw University). Stephen Bornemann and his colleagues ( John Innes Center, Norwich) graciously invited me to peek into the mysteries of their unique SPR on-chip kinetic system, and inspired the development of the invariant concentration algorithm. Liz Hedstrom (Brandeis University) made helpful comments and suggestions. I am grateful to Andrei Ruckenstein (formerly of the BioMaPS Institute for Quantitative Biology, Rutgers University; currently at Boston University) for illuminating discussions regarding thermodynamic boxes in biochemical mechanisms. Sarah McCord (Massachusetts College of Pharmacy and Health Sciences) provided expert assistance in editing this manuscript.
REFERENCES Atkinson, A., and Donev, A. (1992). Optimum Experimental Designs. Oxford University Press, Oxford. Bates, D. M., and Watts, D. G. (1988). Nonlinear Regression Analysis and its Applications. Wiley, New York.
DynaFit—A Software Package for Enzymology
277
Beechem, J. M. (1992). Global analysis of biochemical and biophysical data. Methods Enzymol. 210, 37–54. Benkovic, S. J., Fierke, C. A., and Naylor, A. M. (1988). Insights into enzyme function from studies on mutants of dihydrofolate reductase. Science 239, 1105–1110. Błachut-Okrasinska, E., Bojarska, E., Stepin´ski, J., and Antosiewicz, J. (2007). Kinetics of binding the mRNA cap analogues to the translation initiation factor eIF4E under secondorder reaction conditions. Biophys. Chem. 129, 289–297. Bosco, G., Baxa, M., and Sosnick, T. (2009). Metal binding kinetics of Bi-Histidine sites used in c analysis: Evidence of high-energy protein folding intermediates. Biochemistry 48, 2950–2959. Box, G. E. P., Hunter, W. G., Hunter, J. S., and Hunter, W. G. (1978). Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building. John Wiley, New York. Briknarova´, K., Zhou, X., Satterthwait, A., Hoyt, D., Ely, K., and Huang, S. (2008). Structural studies of the SET domain from RIZ1 tumor suppressor. Biochem. Biophys. Res. Commun. 366, 807–813. Brooks, I., Watts, D., Soneson, K., and Hensley, P. (1994). Determining confidence intervals for parameters derived from analysis of equilibrium analytical ultracentrifugation data. Methods Enzymol. 240, 459–478. Burnham, K. B., and Anderson, D. R. (2002). Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. Springer-Verlag, New York. Bzowska, A. (2002). Calf spleen purine nucleoside phosphorylase: Complex kinetic mechanism, hydrolysis of 7-methylguanosine, and oligomeric state in solution. Bioch. Biophys. Acta 1596, 293–317. Bzowska, A., Koellner, G., Stroh, B. W.-K. A., Raszewski, G., Holy´, A., Steiner, T., and Frank, J. (2004). Crystal structure of calf spleen purine nucleoside phosphorylase with two full trimers in the asymmetric unit: Important implications for the mechanism of catalysis. J. Mol. Biol. 342, 1015–1032. Chakraborty, U. K. (2008). Advances in Differential Evolution. Springer-Verlag, New York. Cle´, C., Gunning, A. P., Syson, K., Bowater, L., Field, R. A., and Bornemann, S. (2008). Detection of transglucosidase-catalyzed polysaccharide synthesis on a surface in real-time using surface plasmon resonance spectroscopy. J. Am. Chem. Soc. 130, 15234–15235. Cle´, C., Martin, C., Field, R. A., Kuzmicˇ, P., and Bornemann, S. (2010). Detection of enzyme-catalyzed polysaccharide synthesis on surfaces. Biocatal. Biotransform. in press. Collom, S. L., Laddusaw, R. M., Burch, A. M., Kuzmicˇ, P., Grover, P. Miller, and Andand, M. D. P. (2008). CYP2E1 substrate inhibition: Mechanistic interpretation through an effector site for monocyclic compounds. J. Biol. Chem. 383, 3487–3496. Corana, A., Marchesi, M., Martini, C., and Ridella, S. (1987). Minimizing multimodal functions of continuous variables with the ‘‘simulated annealing’’ algorithm. ACM Trans. Math. Softw. 13, 262–280. Deng, Q., and Huang, S. (2004). PRDM5 is silenced in human cancers and has growth suppressive activities. Oncogene 17, 4903–4910. Digits, J. A., and Hedstrom, L. (1999). Kinetic mechanism of tritrichomonas foetus inosine 50 -monophosphate dehydrogenase. Biochemistry 38, 2295–2306. Duggleby, R. (1981). Experimental designs for the distribution free analysis of enzyme kinetic data. In ‘‘Kinetic Data Analysis’’ (L. Endre´nyi, ed.), pp. 169–181. Plenum Press, New York. Dwass, M. (1957). Modified randomization tests for nonparametric hypotheses. Ann. Math. Stat. 28, 181–187. Endre´nyi, L. (1981). Design of experiments for estimating enzyme and pharmacokinetic parameters. In ‘‘Kinetic Data Analysis’’ (L. Endre´nyi, ed.), pp. 137–169. Plenum Press, New York.
278
Petr Kuzmicˇ
Fedorov, V. (1972). Theory of Optimal Experiments. Academic Press, New York. Feoktistov, V. (2008). Differential Evolution: In Search of Solutions. Springer-Verlag, New York. Fierke, C. A., Johnson, K. A., and Benkovic, S. J. (1987). Construction and evaluation of the kinetic scheme associated with dihydrofolate reductase from Escherichia coli. Biochemistry 26, 4085–4092. Franco, R., Gavalda, M. T., and Canela, E. I. (1986). A computer program for enzyme kinetics that combines model discrimination, parameter refinement and sequential experimental design. Biochem. J. 238, 855–862. Gasa, T., Spruell, J., Dichtel, W., Srensen, T., Philp, D., Stoddart, J., and Kuzmicˇ, P. (2009). Complexation between methyl viologen (paraquat) bis(hexafluorophosphate) and dibenzo[24]crown-8 revisited. Chem. Eur. J. 15, 106–116. Gilbert, H. F. (1999). Basic Concepts in Biochemistry. McGraw-Hill, New York. ¨ ber die chemische Affinita¨t. J. Prakt. Chem. 127, Guldberg, C. M., and Waage, P. (1879). U 69–114. Hoops, S., Sahle, S., Gauges, R., Lee, C., Pahle, J., Simus, N., Singhal, M., Xu, L., Mendes, P., and Kummer, U. (2006). COPASI—A COmplex PAthway SImulator. Bioinformatics 22, 3067–3074. Jamakhandi, A. P., Kuzmicˇ, P., Sanders, D. E., and Miller, G. P. (2007). Global analysis of protein–protein interactions reveals multiple cytochrome P450 2E1reductase complexes. Biochemistry 46, 10192–10201. Johnson, M. L. (1992). Why, when, and how biochemists should use least squares. Anal. Biochem. 206, 215–225. Johnson, M. L. (1994). Use of least-squares techniques in biochemistry. Methods Enzymol. 240, 1–22. Johnson, M. L., and Frasier, S. G. (1985). Nonlinear least-squares analysis. Methods Enzymol. 117, 301–342. Johnson, K. A., Simpson, Z. B., and Blom, T. (2009). Global Kinetic Explorer: A new computer program for dynamic simulation and fitting of kinetic data. Anal. Biochem. 387, 20–29. King, E. L., and Altman, C. (1956). A schematic method of deriving the rate laws for enzyme-catalyzed reactions. J. Phys. Chem. 60, 1375–1378. Kirkpatrick, S., Gelatt, C., and Vecchi, M. P. (1983). Optimization by simulated annealing. Science 220, 671–680. Kuzmicˇ, P. (1996). Program DYNAFIT for the analysis of enzyme kinetic data: Application to HIV proteinase. Anal. Biochem. 237, 260–273. Kuzmicˇ, P. (2006). A generalized numerical approach to rapid-equilibrium enzyme kinetics: Application to 17b-HSD. Mol. Cell. Endocrinol. 248, 172–181. Kuzmicˇ, P. (2009a). A generalized numerical approach to steady-state enzyme kinetics: Applications to protein kinase inhibition. Biochim. Biophys. Acta—Prot. Proteom. in press, doi:10.1016/j.bbapap.2009.07.028. Kuzmicˇ, P. (2009b). Application of the Van Slyke–Cullen irreversible mechanism in the analysis of enzymatic progress curves. Anal. Biochem. 394, 287–289. Kuzmicˇ, P., Peranteau, A. G., Garcı´a-Echeverrı´a, C., and Rich, D. H. (1996). Mechanical effects on the kinetics of the HIV proteinase deactivations. Biochem. Biophys. Res. Commun. 221, 313–317. Kuzmicˇ, P., Cregar, L., Millis, S. Z., and Goldman, M. (2006). Mixed-type noncompetitive inhibition of anthrax lethal factor protease by aminoglycosides. FEBS J. 273, 3054–3062. Kuzmicˇ, P., Lorenz, T., and Reinstein, J. (2009). Analysis of residuals from enzyme kinetic and protein folding experiments in the presence of correlated experimental noise. Anal. Biochem. 395, 1–7.
DynaFit—A Software Package for Enzymology
279
Le Clainche, L., and Vita, C. (2006). Selective binding of uranyl cation by a novel calmodulin peptide. Environ. Chem. Lett. 4, 45–49. Leskovar, A., Wegele, H., Werbeck, N., Buchner, J., and Reinstein, J. (2008). The ATPase cycle of the mitochondrial Hsp90 analog trap1. J. Biol. Chem. 283, 11677–11688. Mannervik, B. (1981). Design and analysis of kinetic experiments for discrimination between rival models. In ‘‘Kinetic Data Analysis’’ (L. Endre´nyi, ed.), pp. 235–270. Plenum Press, New York. Mannervik, B. (1982). Regression analysis, experimental error, and statistical criteria in the design and analysis of experiments for discrimination between rival kinetic models. Methods Enzymol. 87, 370–390. Marquardt, D. W. (1963). An algorithm for least-squares estimation of nonlinear parameters. J. Soc. Ind. Appl. Math. 11, 431–441. Mendes, P., and Kell, D. (1998). Non-linear optimization of biochemical pathways: Applications to metabolic engineering and parameter estimation. Bioinformatics 14, 869–883. Morrison, J. F., and Walsh, C. T. (1988). The behavior and significance of slow-binding enzyme inhibitors. Adv. Enzymol. Relat. Areas Mol. Biol. 61, 201–301. Myung, J. I., and Pitt, M. A. (2004). Model comparison methods. Methods Enzymol. 383, 351–366. Myung, J. I., Tang, Y., and Pitt, M. A. (2009). Evaluation and comparison of computational models. Methods Enzymol. 454, 287–304. Nichols, T. E., and Holmes, A. P. (2001). Nonparametric permutation tests for functional neuroimaging: A primer with examples. Human Brain Map. 15, 1–25. Niedzwiecka, A., Stepin´ski, J., Antosiewicz, J., Darzynkiewicz, E., and Stolarski, R. (2007). Biophysical approach to studies of Cap-eIF4E interaction by synthetic Cap analogs. Methods Enzymol. 430, 209–245. Onwubolu, G. C., and Davendra, D. (2009). Differential Evolution: A Handbook for Global Permutation-Based Combinatorial Optimization. Springer-Verlag, New York. Penheiter, A. R., Bajzer, Zˇ., Filoteo, A. G., Thorogate, R., To¨ro¨k, K., and Caride, A. J. (2003). A model for the activation of plasma membrane calcium pump isoform 4b by Calmodulin. Biochemistry 42, 12115–12124. Peranteau, A. G., Kuzmicˇ, P., Angell, Y., Garcı´a-Echeverrı´a, C., and Rich, D. H. (1995). Increase in fluorescence upon the hydrolysis of tyrosine peptides: Application to proteinase assays. Anal. Biochem. 227, 242–245. Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P. (1992). Numerical Recipes in C. Cambridge University Press, Cambridge. Price, K. V., Storn, R. M., and Lampinen, J. A. (2005). Differential Evolution: A Practical Approach to Global Optimization. Springer-Verlag, New York. Reich, J. G. (1992). Curve Fitting and Modelling for Scientists and Engineers. McGraw-Hill, New York. Schlippe, Y. V. G., Riera, T. V., Seyedsayamdost, M. R., and Hedstrom, L. (2004). Substitution of the conserved Arg-Tyr dyad selectively disrupts the hydrolysis phase of the IMP dehydrogenase reaction. Biochemistry 43, 4511–4521. Segel, I. H. (1975). Enzyme Kinetics. Wiley, New York. Slyke, D. D. V., and Cullen, G. E. (1914). The mode of action of urease and of enzymes in general. J. Biol. Chem. 19, 141–180. Storme, T., Deroussent, A., Mercier, L., Prost, E., Re, M., Munier, F., Martens, T., Bourget, P., Vassal, G., Royer, J., and Paci, A. (2009). New ifosfamide analogs designed for lower associated neurotoxicity and nephrotoxicity with modified alkylating kinetics leading to enhanced in vitro anticancer activity. J. Pharmacol. Exp. Ther. 328, 598–609. Straume, M., and Johnson, M. L. (1992). Monte Carlo method for determining complete confidence probability distributions of estimated model parameters. Methods Enzymol. 210, 117–129.
280
Petr Kuzmicˇ
Szedlacsek, S., and Duggleby, R. G. (1995). Kinetics of slow and tight-binding inhibitors. Methods Enzymol. 249, 144–180. Van Boekel, M. (2000). Kinetic modelling in food science: A case study on chlorophyll degradation in olives. J. Sci. Food Agric. 80, 3–9. Von Weymarn, N., Kiviharju, K., and Leisola, M. (2002). High-level production of Dmannitol with membrane cell-recycle bioreactor. J. Ind. Microbiol. Biotechnol. 29, 44–49. Watts, D. G. (1994). Parameter estimates from nonlinear models. Methods Enzymol. 240, 23–36. Wielgus-Kutrowska, B., and Bzowska, A. (2006). Probing the mechanism of purine nucleoside phosphorylase by steady-state kinetic studies and ligand binding characterization determined by fluorimetric titrations. Biochim. Biophys. Acta 1764, 887–902. Wielgus-Kutrowska, B., Bzowska, A., Tebbe, J., Koellner, G., and Shugar, D. (2002). Purine nucleoside phosphorylase from cellulomonas sp.: Physicochemical properties and binding of substrates determined by ligand-dependent enhancement of enzyme intrinsic fluorescence, and by protective effects of ligands on thermal inactivation of the enzyme. Biochem. Biophys. Acta 1597, 320–334. Wielgus-Kutrowska, B., Antosiewicz, J., Dlugosz, M., Holy´, A., and Bzowska, A. (2007). Towards the mechanism of trimeric purine nucleoside phosphorylases: Stopped-flow studies of binding of multisubstrate analogue inhibitor—2-amino-9-[2-(phosphonomethoxy)ethyl]-6-sulfanylpurine. Biophys. Chem. 125, 260–268. Williams, J. W., and Morrison, J. F. (1979). The kinetics of reversible tight-binding inhibition. Methods Enzymol. 63, 437–467. Williams, C. R., Snyder, A. K., Kuzmicˇ, P., O’Donnell, M., and Bloom, L. B. (2004). Mechanism of loading the Escherichia coli DNA polymerase III sliding clamp. I. Two distinct activities for individual ATP sites in the g complex. J. Biol. Chem. 279, 4376–4385.
C H A P T E R
E L E V E N
Discrete Dynamic Modeling of Cellular Signaling Networks Re´ka Albert and Rui-Sheng Wang Contents 282 284 286 288 289 291 293 295 295 297 297 298 299 301 301 302 303 303 303
1. Introduction 2. Cellular Signaling Networks 3. Boolean Dynamic Modeling 3.1. Constructing the network backbone 3.2. Determining transfer functions 3.3. Selecting models for state transitions 3.4. Analyzing steady states of the system 3.5. Testing the robustness of the dynamic model 3.6. Making biological implications and predictions 4. Variants of Boolean Network Models 4.1. Threshold Boolean networks 4.2. Piecewise linear systems 4.3. From Boolean switches to dose–response curves 5. Application Examples 5.1. Abscisic acid-induced stomatal closure 5.2. T-LGL survival signaling network 6. Conclusion and Discussion Acknowledgments References
Abstract Understanding signal transduction in cellular systems is a central issue in systems biology. Numerous experiments from different laboratories generate an abundance of individual components and causal interactions mediating environmental and developmental signals. However, for many signal transduction systems there is insufficient information on the overall structure and the molecular mechanisms involved in the signaling network. Moreover, lack of kinetic and temporal information makes it difficult to construct quantitative models of signal transduction pathways. Discrete dynamic modeling, combined with network analysis, provides an effective way to integrate fragmentary Department of Physics, Pennsylvania State University, University Park, Pennsylvania, USA Methods in Enzymology, Volume 467 ISSN 0076-6879, DOI: 10.1016/S0076-6879(09)67011-7
#
2009 Elsevier Inc. All rights reserved.
281
282
Re´ka Albert and Rui-Sheng Wang
knowledge of regulatory interactions into a predictive mathematical model which is able to describe the time evolution of the system without the requirement for kinetic parameters. This chapter introduces the fundamental concepts of discrete dynamic modeling, particularly focusing on Boolean dynamic models. We describe this method step-by-step in the context of cellular signaling networks. Several variants of Boolean dynamic models including threshold Boolean networks and piecewise linear systems are also covered, followed by two examples of successful application of discrete dynamic modeling in cell biology.
1. Introduction With the increasing availability of high-throughput techniques, nowadays it is possible to collect large datasets on the abundance and activity of biological components such as genes, proteins, RNAs, and metabolites. The diverse interactions between these components coordinate cellular systems and are responsible for cellular functions. Networks of interaction and regulation can be discerned throughout the process in which a coding sequence of DNA is transferred into active proteins (Barabasi and Oltvai, 2004). At the genomic/transcriptomic level, transcription factors can activate or inhibit the expression of genes into mRNAs and regulate the activity of genes, contributing to a transcriptional (gene) regulatory network (Buck and Lieb, 2004; Lee et al., 2002). At the proteomic level, proteins participate in diverse posttranslational modifications of other proteins or form protein complexes with other proteins to exert additional functional roles. Such associations between proteins are achieved by protein–protein interactions (Figeys et al., 2001; Walhout and Vidal, 2001). Biochemical reactions in the cellular metabolism can likewise be integrated into metabolic networks (Hatzimanikatis et al., 2004; Reed et al., 2003). A variety of interactions can integrate into signaling networks. For example, external signals from the exterior of a cell are first transferred to the inside of that cell by a cascade of protein–protein interactions of signaling molecules (Albert, 2005; Li et al., 2006). Then, a combination of biochemical reactions and transcriptional regulation triggers the expression of genes to respond to the signals. Instead of focusing on individual components, an important issue in systems biology is to study how cellular systems facilitate diverse functions by such interacting components. Cellular systems are by no means static. Instead, most cellular responses are transient and biological components interact dynamically with each other. Therefore, to understand the mechanisms by which interacting components achieve dynamic behaviors, topological analysis of cellular networks is insufficient. Dynamic modeling serves as a standard tool for system-level elucidation of the dynamics of cellular processes. It can link
Discrete Dynamic Modeling of Cellular Signaling Networks
283
fundamental physicochemical principles, prior knowledge about regulatory pathways, and experimental data of various types to create a powerful predictive model (Aldridge et al., 2006). Such a model is able to decipher how diverse interactions account for phenotype traits and make novel predictions that lead to further experimental explorations (Li et al., 2006; Thakar et al., 2007; Zhang et al., 2008). Quantitative dynamic models based on differential equations have been widely used to model various biological systems (Aldridge et al., 2006). However, this modeling approach requires many kinetic parameters which are generally not or insufficiently known. This aspect poses an obstacle for quantitative modeling of large-scale systems, especially when temporal data are insufficient for parameter estimation (Conzelmann and Gilles, 2008). At the same time, much knowledge of individual components and causal interactions in a biological process can be inferred from the experimental literature as qualitative data. Therefore, qualitative techniques such as Boolean networks and Petri nets have been used for modeling signal transduction networks (Chaouiya, 2007; Gilbert et al., 2006; Sackmann et al., 2006). Such discrete dynamic modeling approaches, combined with network analysis tools (Kachalo et al., 2008), are able to integrate fragmentary knowledge of regulatory interactions into a predictive and informative model in systems where kinetic parameters are not sufficiently known to allow a continuous model (Bornholdt, 2008). Discrete dynamic modeling, particularly Boolean dynamic modeling, has been successfully applied in modeling many gene regulatory networks and signaling networks, especially in those systems where the organization of networks is more important than the kinetic details of the individual interactions. The Drosophila segment polarity regulatory network was shown to be a robust developmental module that can function despite variations in kinetic parameters (von Dassow et al., 2000). Albert and Othmer (2003) developed a Boolean dynamic model which accurately predicts the gene expression outcomes of this developmental module. The cell cycle control of the budding yeast Saccharomyces cerevisiae is a widely studied robust biological process in the cell. Threshold Boolean network models accurately reproduce the yeast cell cycle dynamics and predict the critical events of the cell cycle (Davidich and Bornholdt, 2008b; Li et al., 2004). Generalized logical models have been used for modeling the segmentation in Drosophila embryos (Sanchez and Thieffry, 2001, 2003). Discrete dynamic modeling has been equally successfully applied in systems as different as plants and mammals, for example, in Arabidopsis flower morphogenesis (Espinosa-Soto et al., 2004; Mendoza et al., 1999), root hair development (Mendoza and Alvarez-Buylla, 2000), abscisic acid (ABA)induced stomatal closure (Li et al., 2006), the human cholesterol regulatory pathway (Kervizic and Corcos, 2008), mammalian immune response to bacteria (Thakar et al., 2007, 2009), T cell signaling networks (Kaufman et al., 1999; Saez-Rodriguez et al., 2007) as well as the T-LGL leukemia survival signaling (Zhang et al., 2008).
284
Re´ka Albert and Rui-Sheng Wang
In this chapter, we introduce the fundamental concepts of Boolean dynamic modeling, in the context of signaling networks. We also illustrate several network analysis tools and Boolean network simulation tools. Finally, we discuss variants of Boolean models including threshold Boolean models and piecewise linear systems and give two examples of successful application of discrete dynamic modeling in cell biology.
2. Cellular Signaling Networks Living cells constantly receive various external stimuli and developmental signals and convert them into intracellular responses. Such processes are collectively known as signal transduction, which involves a collection of interacting chemicals and molecules such as enzymes, proteins, second messengers (Gomperts et al., 2003). Signal transduction is an important part of cell communication that governs basic cellular activities, coordinates cell actions and maintains the equilibrium between the cell and its surroundings. Many cellular decisions such as proliferation, differentiation, and apoptosis are achieved by signal transduction. Therefore, understanding cell signaling is essential for studying the underlying mechanisms of cellular systems. Figure 11.1 gives an abstract view of a signal transduction process, which demonstrates that signal transduction has three main steps. First, signal transduction processes are activated by extracellular signaling molecules binding to cell-surface receptors. Then the signals are transferred inside the cell and trigger a sequence of biochemical reactions such as activation of signal adaptors, phosphorylation of enzymes, which occur in the cytoplasm. The signals are amplified and passed to the nucleus and further to genes through a series of biochemical reactions that link submembrane events. Finally, cells respond to the signals by changing cellular function through the expression of genes. At every step of the signal transduction process feedbacks are possible and important. Signal transduction pathways interact and crosstalk with one another to form signal transduction networks (also called signaling networks) (Gomperts et al., 2003). Signaling networks can be represented as directed graphs where the orientation of the edges reflects the direction of signal propagation. In a signaling network, there exist one or more clear starting nodes representing the binding of the initial signal(s) to receptor(s) and one or more output nodes representing the cellular responses to the signal(s). Besides these nodes, there are a number of intermediate nodes consisting of second messengers, enzymes, kinases, proteins, genes, ions, metabolites, and/or other compounds involved in transferring the signals. The edges in a signaling network represent diverse interactions between signaling components such as protein binding, complex formation, transcription
285
Discrete Dynamic Modeling of Cellular Signaling Networks
Signal Receptor
P
Cell membrane P
Ion Ion
Second messengers Inhibition
Protein
Phosphorylation ATP
P Protein
Protein
Binding
ADP Protein complex
Translation
Activation
mRNA
Transcription
Transcription
Gene
DNA
Gene
mRNA
DNA
Nucleus Cellular responses
Figure 11.1 Scheme of a hypothetical signal transduction process involving diverse interactions of cellular components.
regulation, phosphorylation of a protein, and enzymatic catalysis. As shown in Fig. 11.1, signal propagation follows the paths from starting node(s) via a succession of intermediate components to the final output node(s). Signaling networks are usually very complex in their organization and too complicated to be analyzed by human mind. Analysis of signaling networks requires an iterative combination of experimental and theoretical approaches including developing appropriate models and generating quantitative data. Traditional work in biology has focused on studying individual parts of cell signaling pathways. Mathematical modeling of cellular networks from a system-level perspective helps us understand the underlying structure of cell signaling networks and how changes in these networks may affect the transmission of information (Cho and Wolkenhauer, 2003). Modeling cellular signaling networks is often challenging. Unlike other biological networks, cellular signaling networks involve the interactions of components from different levels such as transcriptome, metabolome, and proteome. In most
286
Re´ka Albert and Rui-Sheng Wang
cases, the dynamic activity of signaling molecules in cellular systems is not experimentally accessible. Even when temporal data are available at the transcriptional or proteomic level, they may not reflect the real activity of signaling molecules because of the existence of various posttranscriptional regulations and posttranslational modifications whose effects are still difficult to quantitatively assess by current biological technologies (Foth et al., 2008). Despite these obstacles, by using discrete dynamic models and network analysis methods, cellular signaling networks can be assembled and qualitatively modeled in a predictive manner and these models provide hypotheses for the underlying mechanisms, as we shall describe in the following section.
3. Boolean Dynamic Modeling Dynamic models describe the behavior of a system over time (Ellner and Guckenheimer, 2006). In dynamic models, each node has a state (or status) that varies over time due to interactions with other nodes. For a continuous dynamic system, the states of nodes are described by quantitative variables and the changes in the nodes’ states are usually modeled by a set of differential equations, in which the time variable runs over a continuous interval. Although the concentration of components in cellular signaling networks is continuously quantitative, various conditions can lead to a saturation regime or a regime of low concentrations, which enables a binary or qualitative simplification of component states (Bornholdt, 2008). In discrete dynamic models, the state of a node is qualitative and the time variable is often also discrete. Discrete dynamic models include Boolean networks (Kauffman, 1969), finite dynamical systems ( Jarrah and Laubenbacher, 2007), difference equations (May, 1976), Petri nets (Chaouiya, 2007), etc. The most popular discrete dynamic model applied in modeling biological networks is Boolean networks, which constitute the focus of this chapter. Boolean networks are a representation of a dynamic system introduced to model gene regulatory networks (Kauffman, 1969; Thomas, 1973). A Boolean network model can be represented by a directed graph G ¼ (V, E) with a set of nodes V ¼ {v1, v2, . . ., vn} denoting the elements of the network and a list of Boolean transfer functions F ¼ {F1, F2, . . ., Fn} implicitly defining the edges E between the nodes. Network node vi stands for a gene, a protein, or a stimulus with an associated expression level Xi, representing the concentration of a gene (protein) or the amount of the stimulus present in the cell. This level is approximated as two qualitative states, where Xi ¼ 1 represents the fact that node vi is expressed or ON (a high concentration) and Xi ¼ 0 means that it is not expressed or OFF (a baseline/subthreshold concentration). F is a set of logical functions, one assigned to each node. It represents the regulatory rules between network
Discrete Dynamic Modeling of Cellular Signaling Networks
287
components and determines the evolution of the system from the current state to the next state. The future state of each node is determined by the current states of other nodes through its Boolean transfer function: Xi* ¼ Fi ðX1 ; X2 ; . . . ; Xn Þ where the * denotes a future state and i ¼ 1, 2, . . ., n. The Boolean transfer functions can be expressed using the logic operators ‘‘not,’’ ‘‘and,’’ and ‘‘or.’’ Since the state space of a Boolean network is finite, the system will eventually reach a stationary state (also called fixed point) or a set of recurring states. These stationary or recurring states are collectively referred to as dynamic attractors. Boolean networks provide a straightforward formalism to describe the dynamics of biological networks without the involvement of kinetic details and thus are suited for modeling large-scale networks. They can be used to analyze the qualitative behavior of a system such as qualitative gene expression patterns, or the stability, or lack thereof, of a response to a signal. We will illustrate the concepts of Boolean dynamic modeling on a simple signaling network given in Fig. 11.2. In this network, I is the input node representing the signal and the output node O stands for the ultimate cellular response. There are six intermediate nodes: A, B, . . ., F denoting proteins, metabolites, or other signaling components. The interactions between the nodes are represented by directed edges. Formulating a Boolean dynamic model for signaling networks entails three main steps: constructing the network, determining the Boolean transfer functions, and selecting update modes for state transitions. In the following, we will discuss these steps one by one. A
I
B
Transfer functions A* = I
A
B
C
B* = not I C* = not I and D D*= (A or not B) and E
D F
E
E* = A or not D F* = not C and D
O
O* = (D and E) or F
Figure 11.2 An example of a simple signaling network and its Boolean network model. (A) In this network example, node I is the input and node O is the output. Nodes A, B,. . .,F are intermediate nodes. Positive interactions are represented by directed edges with sharp arrows, negative interactions are represented by directed edges with blunt arrows. (B) The Boolean transfer functions for each signaling component (node) in (A). The state of the nodes is indicated by the node labels and * denotes the state of the node at a future time instant.
288
Re´ka Albert and Rui-Sheng Wang
A recently developed software for modeling biological systems by using Boolean network models is BooleanNet, available at http://www.code. google.com/p/booleannet/ (Albert et al., 2008). The input to this software is a set of Boolean rules in a text file. Users can select among several state transition modes, and between purely Boolean and continuous-Boolean hybrid modeling. This software requires minimal programming expertise and can be run via a web interface or as a Python library to be used through an application programming interface. All the illustrative computations in this chapter on the example in Fig. 11.2 are done with BooleanNet.
3.1. Constructing the network backbone In modeling signaling networks by a Boolean dynamic model, the first step is to synthesize the network by reading the relevant literature concerning the signaling networks to be modeled. Published literature provides a valuable source of information about individual signaling components and the cause–effect relationships between different components. Although such information is not always explicit and complete, we can infer many regulatory relationships from experimental observations. Experimental information about the involvement of a component in a signaling network has several types. For example, experiments that show that the activity or subcellular location or concentration of a protein changes after giving the input signal or perturbing a known component of the signaling network indicate that this protein might be a component of the signaling network of interest. Different responses to a stimulus after mutating or overexpressing a gene provide genetic evidences for the involvement of the product of the gene in the signal transduction process. Enzymatic activity, protein– protein interactions (Walhout and Vidal, 2001), and transcription factor– gene interactions (Buck and Lieb, 2004) provide biochemical evidence of direct relationships between two components. Chemical or exogenous treatments of a component provide pharmacological evidences which imply indirect relationships between two components. Both biochemical and pharmacological evidences can be represented as component-to-component relationships such as A promotes B (denoted by A ! B), A inhibits B (denoted by A—|B), which correspond to directed arcs from A to B in a graph representing a signaling network. The arcs can be classified as inhibitory (negative) or activating (positive). In many situations, genetic evidences from multiple experiments lead to double causal inferences like C promotes the process through which A promotes B (Albert et al., 2007). In some cases, these inferences can be broken down to two separate component-to-component relationships. For example, if the interaction between A and B is direct and C is not a catalyst of the A–B interaction, C can be assumed to activate A. Many experimental observations are indirect casual relationships or even double casual relationships mentioned above that do not lend themselves to
Discrete Dynamic Modeling of Cellular Signaling Networks
289
easy representation. Fortunately, software applications exist that make network synthesis easier. Based on the idea and method framework by Albert et al. (2007), the software NET-SYNTHESIS for synthesizing signaling networks from collected literature observations was developed and available at http://www.cs.uic.edu/dasgupta/network-synthesis (Kachalo et al., 2008). The main idea of the network synthesis method is to find a most parsimonious network such that it incorporates all known components and processes and is consistent with all reachability relationships between known components (Albert et al., 2007). The input to NET-SYNTHESIS is a list of positive or negative relationships among biological components and its output is a simplified network diagram and a file with the edges of the inferred signaling network (Kachalo et al., 2008).
3.2. Determining transfer functions The constructed network is a static backbone of the signal transduction process. In order to understand the dynamic behavior of the system and beyond, the next step is to determine the dependence relationships among the node states which eventually define the functional state of the signal transduction system. The state change of each node in a Boolean network is described by a Boolean transfer function which is determined by the knowledge of the nodes directly upstream of this node, and the sign (inhibition or activation) of the edges between the upstream nodes (regulators) and the target node. The state of nodes having a single activator and no inhibitors, represented as A ! B in the network and obtained from the observation that a high concentration of A activates B, just follows the state of the activator with a time delay. For example, in Fig. 11.2, the transfer function for the state of node A has the following form: A* ¼ I; where for simplicity the state of the nodes is indicated by the node labels and * denotes the state of node A at a future time instant. The rule indicates that the next state of the node A equals the current state of node I. Nodes having a single inhibitor and no activators, represented by A—|B in the network, indicating that the activation of the target node B requires a low concentration or inactivity of the inhibitor A, have the state opposite to the state of the inhibitor with a time delay. For example, the transfer function for the state of node B in Fig. 11.2 has the following form: B* ¼ not I; where ‘‘not’’ denotes logical negation such that not ON ¼ OFF, and not OFF ¼ ON. In most cases, the activation of a component requires
290
Re´ka Albert and Rui-Sheng Wang
multiple regulators. The ‘‘and’’ operator can be used to denote conditional regulation, that is, that the coexpression of two (or more) regulators is absolutely necessary to activate the target node. In the example of Fig. 11.2, the component C is regulated by both I and D. We assume that the absence of I and the presence of D are required for the activation of node C, so the transfer function for the state of node C has the following form: C* ¼ not I and D: Sometimes a component is regulated by multiple pathways, but any of them can activate the component independently. Such relationships between multiple pathways can be represented by the ‘‘or’’ operator, representing independent activation. In the example of Fig. 11.2, we assume that the regulation of node E by A and D is independent, and either the presence of A or the absence of D is sufficient to activate E, thus leading to the transfer function: E* ¼ A or not D: When a component is regulated by more than two nodes, the transfer function can be a complicated combination of ‘‘and,’’ ‘‘or,’’ and ‘‘not’’ operations, depending on the relationships between multiple pathways, exemplified by the transfer functions of the output node O and the intermediate node D in Fig. 11.2. Each transfer function in a Boolean network model can also be represented by a logical truth table, listing all the states of a node resulted from those of its regulators. The transfer functions are usually determined according to prior knowledge on components or pathways. For example, if the proteins A and B bind with each other to form a complex C, both A and B must be present in order for the node C to be active, and this can be described by C ¼ A and B. Similarly, if a component C is expressed only in A and B double mutant organisms, not expressed in B mutants and A mutants, it means that the expression of C requires the inhibition of both A and B, then we can have C* ¼ not A and not B. Published biological literature is an important source for capturing such dependencies between signaling components. If the structure of the Boolean network or a Boolean transfer function is not fully known, several variants of the Boolean network can be created and comparing their dynamic sequences or the output responses with those observations for the real systems can guide the completion of the Boolean dynamic model (Bornholdt, 2008). For those signaling pathways that are poorly investigated by biologists, a complementary way is to learn the topology of Boolean networks and Boolean functions from available highthroughput temporal data, which has been a key machine learning problem (La¨hdesma¨ki et al., 2003).
Discrete Dynamic Modeling of Cellular Signaling Networks
291
3.3. Selecting models for state transitions Given the topological structure and transfer functions of Boolean networks, the transition from one state of the signal transduction system to the next can be implemented in multiple ways, which have a considerable effect on the dynamics of the system. Synchronous models use the simplest update method, in which the states of all nodes are updated simultaneously (in one step) according to the last state of the system: Xi ðt þ 1Þ ¼ Fi ðX1 ðtÞ; X2 ðtÞ; . . . ; Xn ðtÞÞ: This update method implicitly assumes that the time scales of all biological processes in which signaling components are involved in are similar. The benefit of synchronous models is that the intermediate dynamic state sequence of the system is deterministic, thus an initial condition leads to the same attractor in different replicate simulations. For example, let us assume that the initial condition of the signaling network in Fig. 11.2 is I0 ¼ C0 ¼ 1 (ON), A0 ¼ B0 ¼ D0 ¼ E0 ¼ F0 ¼ O0 ¼ 0 (OFF). The state of the system at the first step is calculated by plugging in the node states in the initial condition into the transfer functions in Fig. 11.2B, leading to I1 ¼ 1 (since nothing is affecting I), A1 ¼ 1, B1 ¼ 0, C1 ¼ 0 (since not I0 ¼ 0), D1 ¼ 0 (since E0 ¼ 0), E1 ¼ 1 (since not D0 ¼ 1), F1 ¼ 0, and O1 ¼ 0. Representing the state of the system as an array in the order of I-A-B-C-D-E-F-O, after one step X(t ¼ 1) ¼ [1 1 0 0 0 1 0 0] using a synchronous model. In the second step, the state of the system becomes X(t ¼ 2) ¼ [1 1 0 0 1 1 1 1], a state which remains unchanged even after further update. Thus, this state is a fixed point dynamical attractor. Biological processes in cellular systems are complicated, and most often the time scales of these processes are different and can vary widely from fractions of seconds to hours. For example, protein phosphorylation and other posttranslational mechanisms are much faster than protein synthesis or transcriptional regulation. Synchronous models assume the existence of a perfect synchronization among the states of signaling components and cannot properly account for the different time scales over which diverse biological processes take place in a cellular system. Thus, the update method of the system state needs to be extended to account for different time scales. In asynchronous models, the nodes are updated in a nonsynchronous order, depending on the timing information, or lack thereof, of individual biological events. In a random asynchronous model, the next updating times for each component may be randomly chosen at each time instant (Chaves et al., 2005). In a more frequently used asynchronous model (Chaves et al., 2005), the update order is selected randomly from all possible permutations of the nodes: Xi ðt þ 1Þ ¼ Fi ðX1 ðt1 Þ; X2 ðt2 Þ; . . . ; Xn ðtn ÞÞ
292
Re´ka Albert and Rui-Sheng Wang
where ti 2 {t, t þ 1}, i ¼ 1, 2, . . ., n denotes the most recent time step at which node vi was updated, which depends on the position of node vi in the update order of all nodes. This update method guarantees that each node is updated exactly once during each unit time interval. For example, in Fig. 11.2, let us set the same initial condition as before, I0 ¼ C0 ¼ 1 (ON), A0 ¼ B0 ¼ D0 ¼ E0 ¼ F0 ¼ O0 ¼ 0 (OFF) and use an asynchronous update with the update order I–A–E–D–B–C–F–O. The next state of the system is calculated by plugging in the most recent node states in the transfer functions in Fig. 11.2B, leading to I1 ¼ 1 (since nothing is affecting I), A1 ¼ 1, E1 ¼ 1 (since A1 ¼ 1 and also not D0 ¼ 1), D1 ¼ 1 (since A1 ¼ E1 ¼ 1), B1 ¼ 0, C1 ¼ 0 (since not I1 ¼ 0), F1 ¼ 1, and O1 ¼ 1. Thus, for this update order X(t ¼ 1) ¼ [1 1 0 0 1 1 1 1], a state that was obtained after two synchronous updates. If, however, the nodes in Fig. 11.2 were updated in the order I–A–B– C–D–E–F–O, the same state would have been obtained as in a synchronous update. In asynchronous models involving stochasticity, the same initial condition could lead to different states, and by extension, to different attractors due to the random choice involved in state transition. The uncertainty in these update methods can reflect the population level differences or system stochasticity in signal transduction processes. If prior knowledge about the timing scales of some components is available, the asynchronous update can be augmented by restricting the update order of these components. For example, if it is known that in a signal transduction process, component A is always activated before component B, the permutations for the update order of components can be restricted to those in which A is before B. There also exist deterministic asynchronous models in which each node vi is associated with an intrinsic time unit gi and is updated at multiples of that unit tikþ1 ¼ tik þ gi ¼ kgi (Chaves et al., 2006). At any given time t, the node vi whose time instant is closest to t, that is, tik ¼ minj;l ftjl > tg is updated in the following way: Xi ðtik Þ ¼ Fi ðX1 ðtk1i ; tk2i ; . . . ; tkni ÞÞ where tkji is the most recent instant when node vj was updated: tkji ¼ maxl ftjl < tik g. While in the random-order asynchronous model presented earlier each node is updated once in every time step (also called a round of update), in this intrinsic time unit asynchronous model nodes with longer time units will have less updates than nodes with shorter time units. This update mode is more intuitive and reasonable if the time units for biological events such as translation, transcription, and phosphorylation involved in a signal transduction process can be estimated from biological knowledge, otherwise the time units for each node can be sampled randomly from an interval.
Discrete Dynamic Modeling of Cellular Signaling Networks
293
3.4. Analyzing steady states of the system For a Boolean network model with n nodes, there are 2n possible initial conditions, from which the system will eventually converge to a limited set of attractors. In synchronous models and deterministic asynchronous models, these attractors are fixed points (steady states) or k-cycles (in which k states are repeated regularly). In stochastic asynchronous models, the attractors are fixed points or so-called loose attractors, sets of states that are repeated irregularly (Harvey and Bossomaier, 1997). Asynchronous models have the same fixed points as synchronous models since in a fixed point (steady state) the order of update of the nodes is irrelevant. For example, the state X ¼ [1 1 0 0 1 1 1 1] is a fixed point of the network in Fig. 11.2 and all states that include the ON state of the input node I will ultimately lead to this fixed point, irrespectively of the (a)synchronicity of the model. The only other attractor possible for this network is X ¼ [0 0 1 0 0 1 0 0], and all states that include the OFF state of the input node I will ultimately lead to this fixed point. Note that in both fixed points the state of the output node is the same as the state of the input node. In modeling signaling networks, usually the initial condition can be set according to prior expert knowledge. For example, the input node in a signaling network represents signals or stimuli, so the initial condition should have the input node as ON. Only after the stimuli are present and transferred into intracellular components, the cell can generate the final response. So, in the initial condition, the output node can be set as OFF. The initial state of other intermediate components can be similarly set according to prior knowledge. If no sufficient information is available for setting a realistic initial condition, one can sample over a large number of random initial conditions and examine the fraction of realizations of a given state of a node (e.g., ON) at a certain time step. Such fraction for the output node represents the probability that the system attains the response and reflects a dynamic behavior that is weakly dependent on the details of the initial conditions (Li et al., 2006). For the output node O in Fig. 11.2, let us set the input node I as ON and the output node O as OFF, randomly sample the initial state of all other nodes, and use BooleanNet with a random-order asynchronous model (Albert et al., 2008). The fraction of O ¼ ON in 50 replicate simulations as a function of time steps is shown in Fig. 11.3. We can see that the fraction of O ¼ ON stabilizes at 1 after three time steps, indicating that in all simulations with I ¼ ON, O ¼ OFF as initial conditions, the output stabilizes at ON, irrespective of the initial states of other nodes. Repeating the analysis with I ¼ OFF, O ¼ OFF leads to the fraction of O ¼ ON stabilizing at 0, indicating that in all simulations the output stabilizes at OFF. By setting different initial conditions, differential modes of input/output behavior can be identified. Attractors represent combinations of the activation
294
Re´ka Albert and Rui-Sheng Wang
1.0 I = ON, O = OFF I = OFF, O = OFF
Fraction of O = ON
0.8 0.6 0.4 0.2 0.0 0
2
4 6 Time steps
8
10
Figure 11.3 The fraction of O ¼ ON in 50 asynchronous simulations of the Boolean network given in Fig. 11.2 as a function of time steps (rounds of update).
states of components that trigger the cell responses in signaling networks or specify the phenotype behaviors in gene regulatory networks. In modeling signaling networks, the state of the output node is more interesting than those of intermediate nodes, so observing the long-term behavior of the output node is most relevant. In modeling gene regulatory networks, usually the attractors and the dynamic sequence of the whole system correspond to known biological events such as certain phases of the cell cycle (Li et al., 2004), apoptosis and cell differentiation (Huang et al., 2009), and thus identifying attractors is most relevant. The fixed points of a Boolean network model can be determined analytically by finding all the possible solutions X of the equations: Xi ¼ Fi ðX1 ; X2 ; . . . ; Xn Þ; meaning that the next state of each node equals its current state. For example, the fixed points of the network in Fig. 11.2 can be determined by solving the set of equations that results when taking away the stars from the left hand side of the transfer functions in Fig. 11.2B. Expressing the state of each node as a function of the state of the node I, and simplifying the resulting expressions, we find A ¼ D ¼ F ¼ O ¼ I, B ¼ not I, C ¼ 0, E ¼ 1, which yields the two solutions we found earlier. The set of initial conditions that leads the system to a specific attractor is referred to as its basin of attraction, which can be determined by doing repeated simulations from each initial condition. While asynchronous models have the same fixed points as synchronous models, the basin of attraction of these fixed points is generally different and the basin of attraction of different fixed
Discrete Dynamic Modeling of Cellular Signaling Networks
295
points can be overlapping, since the final state(s) reachable from a specific initial condition depends on the update mode. BooleanNet provides function modules to detect steady states or cyclic attractors. For example, in Fig. 11.2, if the transfer function for E is changed to E* ¼ A and not D, the components D and E form a negative feedback loop. In a synchronous model the initial condition [1 0 0 1 0 0 0 0] leads to a cyclic attractor with period 4 after two steps: [ 1 1 0 0 1 1 0 0] ! [ 1 1 0 0 1 0 1 1] ! [ 1 1 0 0 0 0 1 1] ! [ 1 1 0 0 0 1 0 1] ! [ 1 1 0 0 1 1 0 0]. There are other initial conditions that lead to the same cyclic attractor in this small network.
3.5. Testing the robustness of the dynamic model With all preparation done, an important step is to assess if the constructed dynamic model is able to reproduce known dynamic behaviors or cellular responses, and if the model is robust in terms of changes in interactions or Boolean transfer functions. Comparing the model’s intermediate dynamic sequence and output responses with the experimentally observed dynamic events can suggest if further changes to the model are needed. For example, assume that a biological initial condition is expected to lead to the ON state of the output node after some time. If this biologically plausible initial condition always leads to an OFF state of the output node in the constructed dynamic model, regardless of the update orders of the nodes, it indicates that one or more Boolean transfer functions may be wrong (e.g., use ‘‘and’’ instead of ‘‘or’’ or vice versa) or incomplete (e.g., some important components are not included). After several rounds of comparison of the model with experimental observations, a dynamic model consistent with all important prior knowledge is obtained. However, it is not a good model if small perturbations lead to drastically different results, because it suggests that the model cannot reflect the adaptability of the system under diverse circumstances. A robust model should be able to maintain the original output response for most small perturbations. Systematic assessment of the robustness of the model can be done in multiple ways, for example, interchanging ‘‘or’’ and ‘‘and’’ rules, switching an inhibitory edge to an activating edges or vice versa, rewiring a pair of edges, adding or deleting an edge. The fraction of cases in which the output is altered in a certain number of perturbations can reflect the robustness of the model.
3.6. Making biological implications and predictions Discrete dynamic modeling allows us to integrate fragmentary knowledge about a system into a logical representation, which further helps us understand the system at a global level. A great advantage of discrete dynamic modeling is its ability to predict the outcomes of system perturbations and
296
Re´ka Albert and Rui-Sheng Wang
direct future wet-bench experiments. There are several efficient ways to analyze the effects of system perturbations. For example, knockout mutants can be simulated by keeping the state of these components as OFF. Overexpression or constitutive activation of certain components can be simulated by keeping the state of these components as ON. Chemical or exogenous treatments can be simulated by activating or inhibiting certain nodes. By studying the effects of such system perturbations, we can assess the importance of certain components, predict the phenotype traits for system perturbations, and gain other valuable insights into the underlying mechanisms of the signal transduction system. For example, in Fig. 11.2, we randomly sample the initial conditions in which I ¼ ON and O ¼ OFF and examine the effects of perturbations of the model by knocking out and overexpressing each node. The fractions of O ¼ ON in wild type (no perturbation) and perturbed models are shown in Fig. 11.4. We found that knockout of A leads to a reduced fraction of O ¼ ON, suggesting hyposensitivity to the signal in A mutants. Blocking B, or C, or F leads to a response very similar to that in the wild type. Knockouts of D or E will completely eliminate the response (i.e., the fraction of O ¼ ON is zero). This insensitivity to the signal in D or E mutants suggests the essentiality of the components D and E for the output response O. In addition, overexpression of A, B, or C has no effect on the response to the signal I. However, overexpression of D, E, or F makes the model reach the ON state of the output node O faster, indicating hypersensitivity to the signal in these perturbed models.
1.0
Fraction of O = ON
0.8 0.6 0.4
Wild type Knockout of A Knockout of B Knockout of D Overexpression of D
0.2 0.0 0
2
4 6 Time steps
8
10
Figure 11.4 The fraction of O ¼ ON in 50 asynchronous simulations of the Boolean network given in Fig. 11.2 as a function of time steps in wild type and perturbed models.
Discrete Dynamic Modeling of Cellular Signaling Networks
297
4. Variants of Boolean Network Models The nodes in a Boolean network have only two states, ON and OFF, which sometimes are not sufficient to characterize the activity or concentration level of signaling components. Logical models with more than two states have been used to model several biological systems, such as root hair development (Mendoza and Alvarez-Buylla, 2000), the segmentation in Drosophila embryos (Sanchez and Thieffry, 2003), and Arabidopsis floral morphogenesis (Espinosa-Soto et al., 2004). Such models are still qualitative but with more levels of activity or concentration for some specific signaling components. Generally, the transfer functions are given in the form of truth tables and the nodes are updated synchronously in these models. In addition, threshold Boolean networks (Ku¨rten, 1988), the simplest Boolean network models, have been successfully used as well (Davidich and Bornholdt, 2008b; Li et al., 2004). A hybrid model of Boolean transfer functions and differential equations called piecewise linear differential equations developed by Glass (1975)) has also been fruitfully applied due to its attractive combination of continuous time, quantitative information, and few kinetic parameters (Chaves et al., 2006; De Jong et al., 2004; Thakar et al., 2009).
4.1. Threshold Boolean networks In threshold networks (Ku¨rten, 1988), each node takes binary values of 0 or 1. Instead of using logic operators ‘‘and,’’ ‘‘or,’’ and ‘‘not,’’ the transfer functions of threshold networks use þ or to represent activation and inhibition. Jij ¼ þ1 denotes that the node vj activates node vi, Jij ¼ 1 means that node vj inhibits node vi, and Jij ¼ 0 denotes that there is no regulatory signal from node vj to node vi. The dynamics of the model is determined by the following transfer functions: 8 Xn < 1 if Jij Xj ðtÞ þ yi > 0 Xj¼1 Xi ðt þ 1Þ ¼ n : 0 if J X ðtÞ þ yi 0 j¼1 ij j which are simple sum rules for each node. yi is a threshold parameter that controls how many signals are needed for the activation of node vi. Though mostly using a synchronous update method, threshold Boolean networks can also be updated in an asynchronous mode. A slightly modified synchronous threshold Boolean network was successfully applied to model the yeast cell cycle control network (Davidich and Bornholdt, 2008b; Li et al., 2004). In these two studies, they found that a large percentage of initial states of the systems lead to a specific fixed point attractor which exactly corresponds to the G1 phase of the cell cycle.
298
Re´ka Albert and Rui-Sheng Wang
The dynamic sequence leading to this attractor corresponds to the biological pathway encoding the cell cycle event, which shows the power and generality of Boolean dynamic modeling. The dynamic models were demonstrated to have very high robustness, which suggests evolutionary constraints for the variable of the systems (Davidich and Bornholdt, 2008b; Li et al., 2004).
4.2. Piecewise linear systems Boolean dynamic models focus on the topological structure of a network and simplify the dynamics of the network, which enables efficient analysis of large networks. However, Boolean models ignore the intermediate states of expression and kinetic details, and may miss dynamic behaviors. Continuous models such as ordinary differential equations provide a more detailed description of a system, but involve many kinetic parameters which are largely unknown. Leon Glass introduced a variant of Boolean network models called piecewise linear differential equations which provides a bridge between discrete and continuous modeling approaches (Glass, 1975). In this model, each node of the network is represented by both a continuous variable X^ i denoting the concentration of the component vi and a discrete variable Xi denoting its activity. The continuous variables are determined by ordinary differential equations: dX^ i ¼ ki Fi ðX1 ; X2 ; . . . ; Xn Þ di X^ i dt which denotes that the concentration change rate of component vi is a combination of synthesis (governed by the Boolean transfer function Fi ) and free degradation (the second term in the right side). ki is the synthesis rate constant and di is the degradation rate constant. At time instant t, the discrete variable Xi is defined as a step function according to a threshold of its continous concentration: 8 ki > ^ > > < 0 X i ðtÞ yi di ; Xi ðtÞ ¼ > > 1 X^ i ðtÞ > yi ki ; > : di where yi 2 (0, 1) is a threshold for the component vi and represents the fraction of concentration necessary for vi to become active. Note that each fixed point of a Boolean network yields a steady state of its piecewise linear system. The dynamic trajectory of a piecewise linear system upon a certain initial condition can be obtained by solving the ordinary differential equations between the time points at which a discrete variable changes its
Discrete Dynamic Modeling of Cellular Signaling Networks
299
value. Piecewise linear models have been developed to model the Drosophila segmentation network (Chaves et al., 2006), the pathogen–immune interaction network (Thakar et al., 2009), and the Escherichia coli carbon starvation response network (Ropers et al., 2006). BooleanNet has a module for implementing such piecewise linear systems (Albert et al., 2008).
4.3. From Boolean switches to dose–response curves In Boolean dynamic modeling, the state of each node is like a Boolean switch, either ON or OFF, depending on the states of its regulators. If step functions are adopted as regulatory functions in ordinary differential equations, the continuous dose–response curve of a dynamic variable can become a Boolean-like switch. Thus, as shown by Davidich and Bornholdt (2008a), a Boolean network model can be formulated as a specific coarse-grained limit of a more detailed differential equation model. Let X(t) denote the mRNA level of a gene at time t, and Y(t) denote the active concentration of the gene’s transcriptional activator at time t. Assume Y(t) linearly increases from 0 to 1 in the time span t 2 [0, 10]. Define a discrete step function over [0, 1] as the regulatory function for X(t): 0 Y 0:5; BðY Þ ¼ 1 Y > 0:5: shown in solid line in Fig. 11.5A. In the Boolean model, the relation between X and Y can be described as X ¼ BðY Þ where for convenient comparison we assume that the state of Y can be transferred to X immediately. In the piecewise linear system, the relation between X and Y can be described as dX ¼ BðY Þ X dt where we assume unit synthesis rate and degradation rate constants for convenience. The state of X according to the Boolean model and the piecewise linear model can be seen in Fig. 11.5B, denoted as ‘‘Boolean’’ and ‘‘PieceWL,’’ respectively. We can see that the curve of X(t) in the piecewise linear system is just the continuous version of that in the Boolean model. Now, instead of using a Boolean switch B(Y), we assume that the regulatory function is the widely used Hill function, as shown in dashed lines in Fig. 11.5A (denoted as ‘‘F(Y), n ¼ 3, 6, 10’’): FðY Þ ¼
Yn ; Y n þ Kn
300
Re´ka Albert and Rui-Sheng Wang
A
Regulatory functions
1.0
F(Y), n = 3 F(Y), n = 6
0.8
F(Y), n = 10 B(Y)
0.6 0.4 0.2 0 0
1
2
3
4
5 t
6
7
8
9
10
5 t
6
7
8
9
10
B 1
ODE, n = 3 ODE, n = 6
0.8
ODE, n = 10 PieceWL
0.6 X(t)
Boolean 0.4 0.2 0 0
1
2
3
4
Figure 11.5 Illustration of step functions and the behavior of the corresponding dynamic systems. (A) A Boolean function and a Hill function with different parameters used as transfer functions. (B) The dynamic behavior of the Boolean model, the piecewise linear system and the ordinary differential equations with Hill functions.
where we set K ¼ 0.5. In the ordinary differential equation system, the relation between X and Y can be described as dX ¼ FðY Þ X: dt Again, we equate the rate constants with 1. The state of X according to the ordinary differential equation can be seen in Fig. 11.5B, denoted as ‘‘ODE, n ¼ 3, 6, 10.’’ We can see that, when n ! 1, Y > K, then dX/dt > 0, therefore the mRNA synthesis of the gene is dominant and
Discrete Dynamic Modeling of Cellular Signaling Networks
301
X(t) is increasing; when Y < K, then dX/dt 0, the mRNA degradation of the gene is dominant and X(t) stays at zero. The simplification of this phenomenon is just a Boolean switch: If Y is ON, the transcription factor activates the gene and X is ON. If Y is OFF, the mRNA degrades and X is OFF. Thus, the continuous dose–response curve becomes a Boolean switch. Note that any step function like Hill functions can generate such correspondence. Here, there is only one regulator to activate the gene. When multiple components regulate the gene, the number of the parameters in the ordinary differential equation will increase, but the piecewise linear system still has two parameters and the Boolean network has no kinetic parameters. This advantage of Boolean network models and piecewise linear models makes them able to model large signaling networks.
5. Application Examples Boolean networks have been successfully applied in modeling many biological processes. Here, we describe two application examples, based on our research on Boolean dynamic modeling of signaling networks.
5.1. Abscisic acid-induced stomatal closure Plants take up carbon dioxide for photosynthesis and lose water by transpiration through the pores called stomata. The guard cells, specialized cells that flank the stomata and determine their size, have developed into a favorite model system for understanding plant signal transduction. For example, under drought stress conditions, plants synthesize the phytohormone ABA that triggers cellular responses in guard cells, resulting in stomatal closure to reduce plant water loss. ABA-induced stomatal closure has been studied by many different labs, but the information about this signal transduction process had been quite fragmentary. In Li et al. (2006), an ABA-induced stomatal closure signaling network with over 40 components has been assembled by an extensive curation of experimental literature. In this network, the input node is the ABA signal and the output node is the response of guard cells to the ABA signal, that is, the closure of the stomata. The intermediate nodes include some important proteins such as G protein a subunit GPA1, G protein b subunit AGB1, protein kinase OST1, second messengers such as cytosolic Ca2þ, phosphatidic acid, and ion flows. Integrating a large number of experimental observations, an asynchronous Boolean dynamic model was developed to simulate the ABA signaling process. The node Closure has a fixed point for each state of the input mode, corresponding to Closure ¼ OFF for ABA ¼ OFF and Closure ¼ ON for ABA ¼ ON. Randomly selected initial conditions
302
Re´ka Albert and Rui-Sheng Wang
were extensively sampled and the fraction of ‘‘Closure ¼ ON’’ in all simulations was used as the output of the model, representing the percentage of the stomata in a population that have been closed due to ABA signaling (Li et al., 2006). Simulating the knockout of signaling components and comparing the percentage of closed stomata with that in wild type indicates that the assembled network is robust against a significant fraction of perturbations and also identifies essential components such as membrane depolarizability, anion efflux, actin cytoskeleton reorganization, whose disruption leads to insensitivity to ABA. In addition, the dynamic model is able to classify nodes by determining whether their disruption would lead to hyposensitivity, hypersensitivity, or insensitivity to ABA. Several of these predictions have been validated by wet-bench experiments, demonstrating the power of discrete dynamic models (Li et al., 2006).
5.2. T-LGL survival signaling network T cell large granular lymphocyte (T-LGL) leukemia represents a class of lympho-proliferative diseases characterized by an abnormal clonal proliferation of cytotoxic T cells. Unlike normal cytotoxic T lymphocytes (CTL) which are eliminated by activation-induced cell death, leukemic T-LGL are not sensitive to Fas-induced apoptosis (which is crucial to the normal activation-induced cell death) and thus remain long-term competent. In Zhang et al. (2008), a T-LGL survival signaling network with over 50 components was created by using NET-SYNTHESIS (Kachalo et al., 2008) to integrate the signaling relationships collected from databases and the literature. The main input node in this network is ‘‘Stimuli’’ representing antigen stimulation, and the main output node is ‘‘Apoptosis,’’ summarizing the biological effect in normal activation-induced cell death. This network describes how stimuli like chronic virus infection activate the T cell receptor and a subsequent signal cascade and induce the depletion of reactive CTL through activation-induced cell death. Certain nodes in this network are activated only in leukemic T-LGL and affect the normal activation-induced cell death. Based on the assembled signaling network, a predictive Boolean dynamic model was constructed. Simulating the overexpression of proteins indicates that all known signaling abnormalities in leukemic T-LGL can be reproduced by only keeping two proteins, IL-15, and PDGF constitutively expressed (ON). The study also identified key mediators of the disease, such as NF-kB, SPHK1, and S1P, which stabilize into an ON or OFF state in T-LGL leukemia and the reversal of this state leads to effective cell death. Several predictions of the model were validated by the corresponding wetbench experiments and provide important insights for the possible treatment of this disease (Zhang et al., 2008). This example again demonstrates that discrete dynamic modeling can help to generate important testable hypotheses without the requirement of kinetic details and quantitative
Discrete Dynamic Modeling of Cellular Signaling Networks
303
information. Such a global view of this complicated biological process would not have been possible without network assembly and discrete dynamic modeling, since patient samples are scarce for this disease.
6. Conclusion and Discussion This chapter introduced discrete dynamic modeling and network analysis approaches. Network-based discrete dynamic modeling allows the logical organization of disparate information from biological experiments into a coherent framework. Based on these models, more predictive and testable hypotheses can be obtained, for example, simulating knockout or overexpression of some components allows us to predict phenotype responses and find new targets or intervention points. It is a powerful tool to help us understand the system behavior of cellular signaling pathways and save much work that has to be done in vivo and in vitro. Importantly, discrete dynamic modeling is conceptually simple and fits biologists’ intuitive thinking without a requirement for sophisticated quantitative knowledge. It is worth noting that there is no permanent model for a biological system. The efficacy and accuracy of dynamic models heavily depends on the current knowledge that was used as input to the model. While guiding experiment designs and helping to generate testable hypotheses, models may become outdated as more biological observations are accumulated. At that point, the models need to be modified and refined. Such interplay between theoretical modeling and biological experimentation plays an essential role in the advancement of systems biology.
ACKNOWLEDGMENTS This work and the original research reported here was partially supported by NSF grants MCB-0618402 and CCF-0643529 (CAREER), NIH grant R01 GM083113-01, and USDA grant 2006-35100-17254.
REFERENCES Albert, R. (2005). Scale-free networks in cell biology. J. Cell Sci. 118, 4947–4957. Albert, R., and Othmer, H. G. (2003). The topology of the regulatory interactions predicts the expression pattern of the segment polarity genes in Drosophila melanogaster. J. Theor. Biol. 223, 1–18. Albert, R., DasGupta, B., Dondi, R., Kachalo, S., Sontag, E., Zelikovsky, A., and Westbrooks, K. (2007). A novel method for signal transduction network inference from indirect experimental evidence. J. Comput. Biol. 14, 927–949. Albert, I., Thakar, J., Li, S., Zhang, R., and Albert, R. (2008). Boolean network simulations for life scientists. Source Code Biol. Med. 3, 16.
304
Re´ka Albert and Rui-Sheng Wang
Aldridge, B. B., Burke, J. M., Lauffenburger, D. A., and Sorger, P. K. (2006). Physicochemical modelling of cell signalling pathways. Nat. Cell Biol. 8, 1195–1203. Barabasi, A. L., and Oltvai, Z. N. (2004). Network biology: Understanding the cell’s functional organization. Nat. Rev. Genet. 5, 101–113. Bornholdt, S. (2008). Boolean network models of cellular regulation: Prospects and limitations. J. R. Soc. Interface 5(Suppl. 1), S85–S94. Buck, M. J., and Lieb, J. D. (2004). ChIP-chip: Considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments. Genomics 83, 349–360. Chaouiya, C. (2007). Petri net modelling of biological networks. Brief Bioinform. 8, 210–219. Chaves, M., Albert, R., and Sontag, E. D. (2005). Robustness and fragility of Boolean models for genetic regulatory networks. J. Theor. Biol. 235, 431–449. Chaves, M., Sontag, E. D., and Albert, R. (2006). Methods of robustness analysis for Boolean models of gene control networks. Syst. Biol. (Stevenage) 153, 154–167. Cho, K. H., and Wolkenhauer, O. (2003). Analysis and modelling of signal transduction pathways in systems biology. Biochem. Soc. Trans. 31, 1503–1509. Conzelmann, H., and Gilles, E. D. (2008). Dynamic pathway modeling of signal transduction networks: A domain-oriented approach. Methods Mol. Biol. 484, 559–578. Davidich, M., and Bornholdt, S. (2008a). The transition from differential equations to Boolean networks: A case study in simplifying a regulatory network model. J. Theor. Biol. 255, 269–277. Davidich, M. I., and Bornholdt, S. (2008b). Boolean network model predicts cell cycle sequence of fission yeast. PLoS ONE 3, e1672. De Jong, H., Gouze, J. L., Hernandez, C., Page, M., Sari, T., and Geiselmann, J. (2004). Qualitative simulation of genetic regulatory networks using piecewise-linear models. Bull. Math. Biol. 66, 301–340. Ellner, S. P., and Guckenheimer, J. (2006). Dynamic Models in Biology. Princeton University Press, Princeton, NJ. Espinosa-Soto, C., Padilla-Longoria, P., and Alvarez-Buylla, E. R. (2004). A gene regulatory network model for cell-fate determination during Arabidopsis thaliana flower development that is robust and recovers experimental gene expression profiles. Plant Cell 16, 2923–2939. Figeys, D., McBroom, L. D., and Moran, M. F. (2001). Mass spectrometry for the study of protein–protein interactions. Methods 24, 230–239. Foth, B. J., Zhang, N., Mok, S., Preiser, P. R., and Bozdech, Z. (2008). Quantitative protein expression profiling reveals extensive post-transcriptional regulation and posttranslational modifications in schizont-stage malaria parasites. Genome Biol. 9, R177. Gilbert, D., Fuss, H., Gu, X., Orton, R., Robinson, S., Vyshemirsky, V., Kurth, M. J., Downes, C. S., and Dubitzky, W. (2006). Computational methodologies for modelling, analysis and simulation of signalling networks. Brief Bioinform. 7, 339–353. Glass, L. (1975). Classification of biological networks by their qualitative dynamics. J. Theor. Biol. 54, 85–107. Gomperts, B. D., Kramer, I. M., and Tatham, P. E. R. (2003). Signal Transduction. Academic Press, San Diego, California. Harvey, I., and Bossomaier, T. (1997). Time out of joint: Attractors in asynchronous random Boolean networks. In ‘‘Proceedings of the Fourth European Conference on Artificial Life (ECAL97),’’ (P. Husbands and I. Harvey, eds.), pp. 67–75. MIT Press, Cambridge, MA. Hatzimanikatis, V., Li, C., Ionita, J. A., and Broadbelt, L. J. (2004). Metabolic networks: Enzyme function and metabolite structure. Curr. Opin. Struct. Biol. 14, 300–306.
Discrete Dynamic Modeling of Cellular Signaling Networks
305
Huang, A. C., Hu, L., Kauffman, S. A., Zhang, W., and Shmulevich, I. (2009). Using cell fate attractors to uncover transcriptional regulation of HL60 neutrophil differentiation. BMC Syst. Biol. 3, 20. Jarrah, A. S., and Laubenbacher, R. (2007). Finite Dynamical Systems: A Mathematical Framework for Computer Simulation. In ‘‘Mathematical Modeling, Simulation, Visualization and e-Learning,’’ (D. Konate´, ed.), pp. 343–358. Springer, Berlin Heidelberg. Kachalo, S., Zhang, R., Sontag, E., Albert, R., and DasGupta, B. (2008). NET-SYNTHESIS: A software for synthesis, inference and simplification of signal transduction networks. Bioinformatics 24, 293–295. Kauffman, S. A. (1969). Metabolic stability and epigenesis in randomly constructed genetic nets. J. Theor. Biol. 22, 437–467. Kaufman, M., Andris, F., and Leo, O. (1999). A logical analysis of T cell activation and energy. Proc. Natl. Acad. Sci. USA 96, 3894–3899. Kervizic, G., and Corcos, L. (2008). Dynamical modeling of the cholesterol regulatory pathway with Boolean networks. BMC Syst. Biol. 2, 99. Ku¨rten, K. E. (1988). Correspondence between neural threshold networks and Kauffman Boolean cellular automata. J. Phys. A 21, L615–L619. La¨hdesma¨ki, H., Shmulevich, I., and Yli-Harja, O. (2003). On learning gene regulatory networks under the Boolean network model. Machine Learn. 52, 147–167. Lee, T. I., Rinaldi, N. J., Robert, F., Odom, D. T., Bar-Joseph, Z., Gerber, G. K., Hannett, N. M., Harbison, C. T., Thompson, C. M., Simon, I., Zeitlinger, J., Jennings, E. G., et al. (2002). Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 298, 799–804. Li, F., Long, T., Lu, Y., Ouyang, Q., and Tang, C. (2004). The yeast cell-cycle network is robustly designed. Proc. Natl. Acad. Sci. USA 101, 4781–4786. Li, S., Assmann, S. M., and Albert, R. (2006). Predicting essential components of signal transduction networks: A dynamic model of guard cell abscisic acid signaling. PLoS Biol. 4, e312. May, R. M. (1976). Simple mathematical models with very complicated dynamics. Nature 261, 459–467. Mendoza, L., and Alvarez-Buylla, E. R. (2000). Genetic regulation of root hair development in Arabidopsis thaliana: A network model. J. Theor. Biol. 204, 311–326. Mendoza, L., Thieffry, D., and Alvarez-Buylla, E. R. (1999). Genetic control of flower morphogenesis in Arabidopsis thaliana: A logical analysis. Bioinformatics 15, 593–606. Reed, J. L., Vo, T. D., Schilling, C. H., and Palsson, B. O. (2003). An expanded genomescale model of Escherichia coli K-12 (iJR904 GSM/GPR). Genome Biol. 4, R54. Ropers, D., de Jong, H., Page, M., Schneider, D., and Geiselmann, J. (2006). Qualitative simulation of the carbon starvation response in Escherichia coli. Biosystems 84, 124–152. Sackmann, A., Heiner, M., and Koch, I. (2006). Application of Petri net based analysis techniques to signal transduction pathways. BMC Bioinform. 7, 482. Saez-Rodriguez, J., Simeoni, L., Lindquist, J. A., Hemenway, R., Bommhardt, U., Arndt, B., Haus, U. U., Weismantel, R., Gilles, E. D., Klamt, S., and Schraven, B. (2007). A logical model provides insights into T cell receptor signaling. PLoS Comput. Biol. 3, e163. Sanchez, L., and Thieffry, D. (2001). A logical analysis of the Drosophila gap-gene system. J. Theor. Biol. 211, 115–141. Sanchez, L., and Thieffry, D. (2003). Segmenting the fly embryo: A logical analysis of the pair-rule cross-regulatory module. J. Theor. Biol. 224, 517–537. Thakar, J., Pilione, M., Kirimanjeswara, G., Harvill, E. T., and Albert, R. (2007). Modeling systems-level regulation of host immune responses. PLoS Comput. Biol. 3, e109. Thakar, J., Saadatpour-Moghaddam, A., Harvill, E. T., and Albert, R. (2009). Constraintbased network model of pathogen–immune system interactions. J. R. Soc. Interface 6, 599–612.
306
Re´ka Albert and Rui-Sheng Wang
Thomas, R. (1973). Boolean formalization of genetic control circuits. J. Theor. Biol. 42, 563–585. von Dassow, G., Meir, E., Munro, E. M., and Odell, G. M. (2000). The segment polarity network is a robust developmental module. Nature 406, 188–192. Walhout, A. J., and Vidal, M. (2001). High-throughput yeast two-hybrid assays for largescale protein interaction mapping. Methods 24, 297–306. Zhang, R., Shah, M. V., Yang, J., Nyland, S. B., Liu, X., Yun, J. K., Albert, R., and Loughran, T. P. Jr. (2008). Network model of survival signaling in large granular lymphocyte leukemia. Proc. Natl. Acad. Sci. USA 105, 16308–16313.
C H A P T E R
T W E LV E
The Basic Concepts of Molecular Modeling Akansha Saxena,* Diana Wong,* Karthikeyan Diraviyam,† and David Sept† Contents 308 308 309 313 313 316 317 317 318 320 321 324 324 325 326 328 329 329 330 330
1. Introduction 2. Homology Modeling 2.1. Sequence analysis 2.2. Secondary structure prediction 2.3. Tertiary structure prediction 2.4. Structure validation 2.5. Conclusions 3. Molecular Dynamics 3.1. Molecular mechanics 3.2. Setting up and running simulations 3.3. Simulation analysis 4. Molecular Docking 4.1. Basic components 4.2. Choosing the correct tool 4.3. Preparing the molecules 4.4. Iterative docking and analysis 4.5. Post analysis 4.6. Virtual screening 4.7. Conclusions References
Abstract Molecular modeling techniques have made significant advances in recent years and are becoming essential components of many chemical, physical and biological studies. Here we present three widely used techniques used in the simulation of biomolecular systems: structural and homology modeling, molecular dynamics and molecular docking. For each of these topics we present a * {
Biomedical Engineering, Washington University, St Louis, Missouri, USA Biomedical Engineering and Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, USA
Methods in Enzymology, Volume 467 ISSN 0076-6879, DOI: 10.1016/S0076-6879(09)67012-9
#
2009 Elsevier Inc. All rights reserved.
307
308
Akansha Saxena et al.
brief discussion of the underlying scientific basis of the technique, some simple examples of how the method is commonly applied, and some discussion of the limitations and caveats of which the user should be aware. References for further reading as well as an extensive list of software resources are provided.
1. Introduction Molecular modeling techniques have made significant advances in recent years and are becoming essential components of many chemical, physical, and biological studies. There are two primary reasons for this evolution: first, there has been an explosive growth in the available structural data for proteins, not only from X-ray crystallography, but also from NMR and electron microscopy studies. Second, accompanying this growth in the area of structural biology have come significant advances in both computational techniques and hardware. As computers continue to increase in speed and capability, we are able to tackle larger and more complex systems. The purpose of this chapter is to outline the basic methodology behind three commonly used techniques. The first section features a discussion of biomolecular structure and homology modeling techniques, we then discuss molecular dynamics (MD) and sampling of protein conformational space, and finally, we cover molecular docking applications. In each of these sections, we will present the basic approach, discuss some of the details, some caveats and limitations, and provide additional references for the reader who requires more information.
2. Homology Modeling As each genome is sequenced, we are faced with the daunting task of digging for useful information in this growing ocean of letters. At the time of writing this article, GenBank (Benson et al., 2009) reports nearly 100 million sequences and the Protein Data Bank contains almost 55,000 protein structures (Berman et al., 2000). Due to this amount of information, it is next to impossible to manually sort through, order, and correlate all of the data. Thanks to major improvements in computing power and algorithms, we can easily handle such large-scale data and derive useful information. Computational modeling has become an essential tool in guiding and enabling one to make rational decisions with respect to hypothesis driven biological research. In parallel, the wide availability of Web-based applications has produced several computational tools as online servers, which are for the most part user friendly, and have thus made it more amenable for researchers to use. Indeed, most of the tools and protocols that
The Basic Concepts of Molecular Modeling
309
are discussed in this chapter can be accessed and utilized by just using a modern day laptop with an Internet connection. Since the purpose of this chapter is to give the readers a flavor of the different computational methods that involve protein modeling, we are going to skip any discussion of genomics and dive directly into proteomics. There are numerous tools each with its unique protocol to predict the desired property and it is beyond the scope of this chapter to discuss each tool in detail. For an exhaustive list of other tools, readers are referred to sites such as Expasy (Gasteiger et al., 2003), NCBI, EMBL-EBI, BiologyWorkBench (Subramaniam, 1998), and references (Kretsinger et al., 2004; Madhusudhan et al., 2005) (see Table 12.1).
2.1. Sequence analysis It is possible to get general information about a protein function by identifying certain motifs or domains in its sequence. For example, one can calculate the hydrophobicity of an amino acid at a specific position and thereby creating hydropathy plot/scale for a given sequence. Using such information, one could get an idea whether a sequence segment of a protein can be on the protein interior, protein surface, or perhaps a transmembrane segment. One of the tools TMHMM (Krogh et al., 2001) predicts the transmembrane segments by applying a Hidden Markov Model (HMM) (for more details, see, e.g., Punta et al., 2007). If we are interested in predicting possible posttranslational modification sites or DNA-binding motifs or signaling sites, one can scan the query sequence against protein databases such as Prosite (Hulo et al., 2008), Prints (Attwood et al., 2003), or InterPro (Hunter et al., 2009). PredictProtein (Rost et al., 2004) and SCRATCH (Cheng et al., 2005)—online servers that can analyze the sequence for by submitting the sequence to other prediction tools that can each analyze the sequence for unique features. Since all these tools are based on statistics, it is always recommended to try multiple software packages to look for consensus and inconsistencies in the individual predictions. Many meta-servers like InterProScan (Zdobnov and Apweiler, 2001) or PredictProtein (Rost et al., 2004) enable the user to simultaneously submit the query sequence to multiple online tools from one central Web site. Web sites like EMBL-EBI, Expasy, NCBI as mentioned above maintain a list of tools available for various forms of analysis. At the end of the day, one has to keep in mind the limitations of these tools and that the accuracy of any tool is never 100%. As mentioned above, all these prediction algorithms are based on statistical analysis of available data across multiple organisms. If any of these tools lacks the specificity for the organism of interest, the confidence in its predictions may be more questionable. One should always cross check the predictions with the available knowledge of the system and see if it is compatible with the system that is being studied.
310
Akansha Saxena et al.
Table 12.1 Molecular modeling software and resources Name
URL
GenBank Protein Data Bank Expasy NCBI EMBL-EBI InterPro Swiss-Prot UniProt SMART Pfam PROSITE PRINTS
http://www.ncbi.nlm.nih.gov/Genbank/ http://www.rcsb.org http://ca.expasy.org/ http://www.ncbi.nlm.nih.gov http://www.ebi.ac.uk/ http://www.ebi.ac.uk/interpro/ http://ca.expasy.org/sprot/ http://www.uniprot.org/ http://smart.embl-heidelberg.de/ http://pfam.sanger.ac.uk/ http://ca.expasy.org/prosite/ http://www.bioinf.manchester.ac.uk/dbbrowser/ PRINTS/index.php http://ca.expasy.org/prosite/ http://www.bioinf.manchester.ac.uk/dbbrowser/ PRINTS/index.php http://workbench.sdsc.edu/
PROSITE PRINTS Biology Workbench PredictProtein SCRATCH InterProScan BLAST FASTA PSI-BLAST PHI-BLAST HMMER ClustalW TMHMM Jpred3 PSIPRED PHD SSPro FUGUE MODELLER SWISS-MODEL 3D-JIGSAW PLOP
http://www.predictprotein.org/ http://www.igb.uci.edu/tools/scratch/ http://www.ebi.ac.uk/Tools/InterProScan/ http://blast.ncbi.nlm.nih.gov/Blast.cgi http://www.ebi.ac.uk/Tools/fasta/index.html http://www.ebi.ac.uk/Tools/psiblast/ http://www.ebi.ac.uk/Tools/blastpgp/ http://hmmer.janelia.org/ http://www.ebi.ac.uk/Tools/clustalw2/index.html http://www.cbs.dtu.dk/services/TMHMM/ http://www.compbio.dundee.ac.uk/www-jpred/ http://bioinf.cs.ucl.ac.uk/psipred/ http://www.predictprotein.org/ http://scratch.proteomics.ics.uci.edu/ http://tardis.nibio.go.jp/fugue/ http://salilab.org/modeller/ http://swissmodel.expasy.org//SWISS-MODEL.html http://bmm.cancerresearchuk.org/3djigsaw/ http://www.jacobsonlab.org/plop_manual/ plop_overview.htm 123Dþ http://123d.ncifcrf.gov/123Dþhtml pGenTHREADER http://bioinf.cs.ucl.ac.uk/psipred/ 3D-PSSM http://www.sbg.bio.ic.ac.uk/3dpssm/index2.html Rosetta http://robetta.bakerlab.org/ I-TASSER http://zhang.bioinformatics.ku.edu/I-TASSER/
x
The Basic Concepts of Molecular Modeling
311
Table 12.1 (continued) Name
URL
PROCHECK
http://www.biochem.ucl.ac.uk/roman/procheck/ procheck.html http://swift.cmbi.kun.nl/whatif/ https://prosa.services.came.sbg.ac.at/prosa.php http://nihserver.mbi.ucla.edu/Verify_3D/ http://www.jalview.org/ http://tcoffee.vital-it.ch/cgi-bin/Tcoffee/tcoffee_cgi/ index.cgi http://www.cgl.ucsf.edu/chimera/ http://ambermd.org/ http://www.gromacs.org/ http://dasher.wustl.edu/tinker/ http://www.ks.uiuc.edu/Research/namd/ http://www.charmm.org/ http://www.schrodinger.com/
WHAT IF ProSA Verify3D Jalview T-Coffee Chimera AMBER GROMACS TINKER NAMD CHARMM DESMOND/ IMPACT AutoDock ClusPro DOCK FITTED FlexX FRED ICM GLIDE GOLD HADDOCK HEX PatchDock SymmDock FireDock RosettaDock Surflex
http://autodock.scripps.edu/ http://cluspro.bu.edu/login.php http://dock.compbio.ucsf.edu/ http://www.fitted.ca/ http://www.biosolveit.de/flexx/ http://www.eyesopen.com/products/applications/ fred.html http://www.molsoft.com/docking.html http://www.schrodinger.com/ProductDescription. php?mID¼6&sID¼6 http://www.ccdc.cam.ac.uk/products/life_sciences/ gold/ http://www.nmr.chem.uu.nl/haddock/ http://www.loria.fr/ritchied/hex/ http://bioinfo3d.cs.tau.ac.il/
http://rosettadock.graylab.jhu.edu/ http://www.optive.com/
A more common and direct approach toward prediction of protein function from sequence is through searching the sequence databases such as NCBI or Swiss-Prot (Bairoch et al., 2004), using tools such as BLAST (Altschul et al., 1990; McGinnis and Madden, 2004) or FASTA (Pearson, 1990), for closely
312
Akansha Saxena et al.
related sequences to your query. The best-case scenario would be if a query picks out sequences (hits) from the database that are both functionally well characterized and share a high sequence identity. Based on high sequence similarity to the hit, it can be generally assumed that the query protein will have a similar fold and hence might belong to the same structural family of a given hit and, depending on the extent of sequence identity, possibly have a function similar to the hit. The above stated generalization of sequence– structure–function relationship has its root in the process of divergent evolution. In divergent evolution, two proteins that share sequence similarity diverged from a common ancestor and since structure diverges slowly than sequence they should also have similar fold. In general, two sequences sharing more than 40% of sequence identity share similar structural fold (Davidson, 2008). Yet this simple sequence, structure, and function relationship is always not true (Roessler et al., 2008). A famous example would be myoglobin and hemoglobin, proteins that are similar in structure and function yet have only 20% sequence similarity. On the same note convergent evolution can result in proteins with very different structural folds but similar function (e.g., subtilisin). These are exceptions to the standard rule that are to be borne in mind when analyzing two sequences and inferring function or fold on the basis of sequence similarity. In a less than ideal scenario, a query fails to generate any significant hits or the generated hits are of low sequence similarity (<25%) to a given protein and a collection of diverse proteins that cannot be used for any further inferences of functions or folds. The reasons behind this could be many. The protein of interest and its supposed homologs may have acquired major insertions and deletions that make the alignment between the two problematic. Another possibility is that the query protein is a multidomain protein and the domains are not properly aligned with sequences in the database. For such difficult queries, instead of doing a pairwise search as discussed above the search is done based on a pattern or a profile generated from a multiple sequence alignment (Pei, 2008) based off of the query protein sequence. ClustalW (Larkin et al., 2007) is one of the most widely used tools for generating multiple sequence alignments. Software tools such as PSIBLAST (Altschul et al., 1997), PHI-BLAST (Zhang et al., 1998), and HMMER (Eddy, 1998) are packages that are used to perform and enable pattern/profile/blocks-based searching. The patterns are generated using methods like dynamic programming, HMM, or genetic algorithms. Using these multiple sequence alignment-based pattern or profiles increases the search space of the query in a database as it increases the combination of sequence space that can be matched in a database. These patterns/profiles are searched against protein databases that incorporate these profile/patterns in their own creation and classification of protein sequences. These include databases such as Prosite (Hulo et al., 2008), Pfam (Finn et al., 2008),
The Basic Concepts of Molecular Modeling
313
SMART (Letunic et al., 2009), and InterPro (Hunter et al., 2009). It is to be noted that since these databases were built based on different protocols, the contents do differ. Hence, it is important that the query is searched against multiple databases and the user should apply a consensus approach in interpreting the results. Finally, it is crucial to know if the alignment predicted between the query and the hit is statistically significant and it is not some random match. The statistical significance of an alignment obtained is represented as expectation value or E-value. Basically, the E-value of an alignment having score X is the number of times one expects to find alignments having score equal to or greater than X in a comparison of random sequences similar to the query. The lower the E-value, the better the alignment that is predicted. Tools such as Jalview (Waterhouse et al., 2009), T-coffee (Notredame et al., 2000), and Chimera (Pettersen et al., 2004) can be used to visualize and edit multiple sequence alignments. Choosing the right homolog for your query is important because it forms the first step for homology modeling discussed below.
2.2. Secondary structure prediction The secondary structure prediction of your query sequence can also aid in determining the structural fold of the protein. For example, if the secondary structure profile of the query matches with another family with a protein of known structure then probably the query protein has a similar structure. Use of secondary structure information can also improve the sequence alignment of the query with its homolog. To aid secondary structure prediction analysis, only primary elements like alpha-helices and betastrands are examined while other structural features are considered to be coils. Methods such as nearest neighbor, neural networks, and HMMs are used to generate these predictions. The most commonly used secondary structure prediction software includes JPRED (Cole et al., 2008), PSIPRED (Bryson et al., 2005), PHD (Rost and Sander, 1993), or SSPro (Cheng et al., 2005), and some methods can also provide information on the solvent accessibility of the amino acids in the query. It should be noted that the current prediction protocols have an accuracy rate of 65–75% with slightly higher accuracy in the prediction alpha-helices. Hence, it is again important to test the predictions using multiple software packages and take a consensus approach in interpreting the results.
2.3. Tertiary structure prediction Predicting the 3D-structure of protein from an amino acid sequence is the ‘‘holy grail’’ for structural modelers. Having a 3D-structure of your query protein can enhance greatly a hypothesis driven biological research.
314
Akansha Saxena et al.
Depending on the quality of the model produced, it can be used to study the electrostatic profiles under different conditions (Diraviyam et al., 2003; Sali et al., 1993), prediction of possible ligand or small molecule binding sites, determination of residues for site-directed mutagenesis, refining of X-ray and NMR structures, and a host of other possibilities. The methods to predict 3D-structure of a protein broadly falls into three categories: (1) homology or comparative modeling, (2) threading, and (3) ab initio or de novo modeling. The best method to choose for building a model depends on how evolutionarily diverged the query protein with respect to other family of proteins in the database that has a representative structure. The most straightforward case is when the query protein produces significant hits of high sequence similarity (generally > 40%) and one of those hits also has a representative X-ray or NMR structure. In this case, opting for homology modeling can produce reliable models that are within 2 A˚ RMSD of the experimental structure. For instances of low sequence identity, for example, where a multiple sequence alignment approach was required to produce reliable hits, homology modeling based on a single structure cannot be reliable. However, current homology modeling protocols can produce good structural models for such proteins for even these low sequence similarity cases by using multiple templates to model different regions. For the most difficult case of low sequence similarity, alternate model building methods such as threading and ab initio modeling methods can be used to produce structural models of the query. 2.3.1. Homology modeling The aim of homology modeling is to model or predict the structural coordinates of a query protein based on the known structure of a sequence homolog (generally referred as the template). Since the model produced can be only as good as the template, it is imperative that rigorous analysis is done before choosing the template. This is achieved by searching the query sequence on various databases and using different search protocols, as discussed above in sequence analysis. In cases where multiple templates are available, it is usually best to use the template with the highest resolution and fewest gaps in sequence/structure. One should also be aware of the environmental conditions under which the template structure was solved. After selecting an appropriate template, the next step involves determining the best alignment between the template and query. The default alignment that is produced by sequence search tools may not be optimal, and the use of profile/patterns derived from multiple sequence alignments could again enhance the quality of alignment for less similar sequences. As with choosing the template, threading can also help in improving the sequence alignment. Threading adds in the structural information to sequence-based alignment. Programs such as FUGUE (Shi et al., 2001) and SALIGN (implemented in MODELLER (Marti-Renom et al., 2000)
The Basic Concepts of Molecular Modeling
315
use an intermediate protocol where the structural information is used in a profile-based alignment. Once an optimal alignment is constructed, this information is fed into homology model building software. There are online model building servers such as SWISS-MODEL (Schwede et al., 2003) or 3D-JIGSAW (Bates et al., 2001) or stand-alone software such as MODELLER that can be used. In MODELLER, the model is built based on spatial restraints resulting from the query-template alignment. It can also perform de novo loop prediction, using multiple templates to construct a model. The final model is generated using conjugate gradient and MD simulation (simulated annealing). In analyzing the final model one has to keep in mind that in addition to errors from an incorrect alignment or a nonideal template, there are also errors inherent to model building method that can result in backbone distortions even in conserved regions and side-chain packing or conformation errors. In such instances, protein optimization programs such as PLOP (Jacobson et al., 2002), both of which can do multiple cycles of side-chain optimization and minimization, can be used to improve the model. 2.3.2. Threading Threading is typically used in cases where there is no significant homology predicted (< 20%) between your query and any sequence in the database, but it can also be employed to improve the alignment of low homology query-template sequences. Briefly, the query sequence is threaded through all available folds in the structural database and a score for each fold is calculated using some suitable scoring scheme, and the fold that gives the best score is assumed to be the fold for the query sequence. For more details in the methods that are used in threading the sequence and scoring algorithms the reader is referred to Mount (2001). Some of the popular online servers for threading includes 123D (Alexandrov et al., 1996), pGenTHREADER (Lobley et al., 2009), and 3D-PSSM (Kelley et al., 2000). Since this method is more often used when the sequence similarity is very low, interpreting details beyond the fold of the protein, like side-chain interactions, are not reliable. In this case, protein optimization software can again be useful in improving the structure. 2.3.3. Ab initio modeling Ab initio modeling uses a combination of statistical analysis and physicsbased energy function to predict the native fold of a given sequence. Ab initio modeling is preferred in predicting the structure of a sequence when no suitable template is found or if it is known that the query adopts a different fold than the predicted template in spite of the sequence similarity. The various different ab initio algorithms use statistical information, secondary structure prediction, and fragment assembly for fold prediction.
316
Akansha Saxena et al.
Also common to all algorithms is the simplified representation of the protein to keep the prediction problem tractable. For more details on the different ab initio prediction methods the reader is referred to (Hardin et al., 2002). Some of the software that is currently used for ab initio structure prediction includes Rosetta (Rohl et al., 2004) and I-TASSER (Zhang, 2008). Rosetta is one of the more widely used packages. It uses a fragmentbased assembly protocol to predict the full structure. Here the fragments form a reduced representation of the protein. A key assumption in Rosetta is that the distribution of structures sampled by a particular sequence fragment is reasonably represented by the distributions of conformations adopted by that fragment and closely related fragments in the protein structure database. Fragment libraries are then built based on the protein structure database. The reduced conformational search space of the query is searched using the Monte Carlo algorithm with an energy function that is defined as the Bayesian probability of sequence/structure matches (Simons et al., 1997). This produces compact structures that have optimized local and nonlocal interactions. Based on the observation that folding of small protein predominantly follows a single exponential process, conformational search is achieved by running short simulations. In general, 1000 short simulations are performed independently (Shortle et al., 1998). The resulting structures are clustered and generally the central model representing the largest cluster is chosen as the best predicted structure for the query sequence. There are important limitations of this fragment based ab initio prediction. The conformational search space of fragments is predetermined, and long-range interactions within the protein are not included in the structure prediction of fragments.
2.4. Structure validation It is critically important that the predicted models, produced using any of the methods discussed above, should be further examined to increase confidence in the structural model. Tools such as PROCHECK (Laskowski et al., 1993) and WHATIF (Vriend, 1990) can check multiple structural variables in the model against expected/reference values for the same. PROSA (Wiederstein and Sippl, 2007) uses knowledge-based potentials of mean force to evaluate model accuracy. It also gives the Z-score for the model that indicates overall model quality. An anomalous Z-score indicates problem with the structural model. Verify 3D (Luthy et al., 1992) is another tool that utilizes the location and environmental profile of each residue in relation to a set of reference structures to predict the quality of the structure. It should be noted that performing the validation analysis on both the predicted model and template and comparing the results can be useful, since the model can only be as good as the template.
The Basic Concepts of Molecular Modeling
317
2.5. Conclusions As the number of experimentally solved structures and sequences deposited into their respective databases increase in number, the prediction accuracy of the sequence analysis and structure prediction tools will continue to advance. Increased availability of cheap and fast computational resources, as well as improvement in prediction algorithms will enable more exhaustive and accurate prediction in less time. In this regard, the Critical Assessment of techniques for protein Structure Prediction (CASP), a large-scale meeting that evaluates the current status of prediction algorithms, is an excellent resource for evaluating the current state of the field. At the end of this biannual meeting, the performance-based rankings results are published in the official Web site (http://predictioncenter.org/), and this resource is perhaps the best source for up-to-date information on current methods and algorithms.
3. Molecular Dynamics We have seen how we can obtain a 3D-structure of a protein from predicting the positions of the nuclear centers of all atoms that make up the molecule. This structure is helpful in understanding the relative position of the atoms, but this is only a static picture and does not tell us how the proteins move and function. More than four decades ago, scientists demonstrated that if we are able to calculate the relative positions of the protein atoms at small intervals of time, we can then predict the behavior of the atoms over a longer time scale. The principal tool for such calculations is the method of MD simulations that was first introduced in the late 1950s by Alder and Wainwright (1959, 1960). In the 1960s, Rahman carried out the first realistic MD simulations with liquid argon and liquid water (Rahman, 1964; Rahman and Stillinger, 1974), and later in 1976, the first protein simulation was performed by McCammon et al. (1976, 1977). Since then, researchers have used MD to investigate the structural, dynamic, and thermodynamic properties of biological molecules, including such items as the characterization of protein folding kinetics and pathways, protein structure refinement and protein–protein interactions. The goal of this section is to give an overview of the basic methodology required for studying the protein dynamics using MD simulations. We will first talk about the basics of MD and try to provide information on the basic considerations that go into a simulation. We will then provide a step-bystep protocol on how to prepare a system for MD, and finally, we will discuss analysis tools that can be used on the resulting trajectory to study the dynamics and flexibility of the protein.
318
Akansha Saxena et al.
3.1. Molecular mechanics The basis of the MD technique is little more than the integration of the Newton’s equation of motion: F ¼ ma, to calculate the positions of the atoms over time. The behavior of the atoms in such a calculation can be likened to the movement of billiard balls (Leach, 2001). Of course, billiard balls move in straight lines until they collide with each other and the collisions change their direction, but in our case, the atoms experience a varying force in between collisions. To account for the effects of these changing forces and to get more realistic dynamics, the equation of motion is integrated over short time steps (typically 1–2 fs) such that the forces can be regarded as constant. The integration yields a series of conformations in time that reveal the realistic movement of the atoms and the protein as a whole. Newton’s equation is a second-order differential equation (the acceleration a is the second derivative of position) and this means that we are required to provide the positions of each atom, from a crystal or model structure, and their individual velocities, calculated from a thermal distribution. The forces that act on each atom, F, is determined from a molecular mechanics potential by taking the negative gradient (i.e., F ¼ rU ). A discussion of molecular mechanics potentials could alone fill multiple chapters, however, it is sufficient to say that these parameter sets are based on first-principles physics but are parameterized empirically through detailed comparisons with many experimental measurements. All molecular mechanics potentials deal with two primary classes of interactions: bonded interactions and nonbonded interactions. The bonded interactions are composed of a bond stretching term for two covalently bonded atoms (see Fig. 12.1), an angle-bending term for three
Torsions Angles Bonds
Electrostatics and van der Waals
Figure 12.1 Representation of the bonded and nonbonded interactions used in molecular mechanics force fields.
The Basic Concepts of Molecular Modeling
319
consecutively bonded atoms, and a torsional term for four consecutive atoms. Since the integration time step is small and the perturbations about the equilibrium point are typically very small in proteins, these potentials are typically modeled as a harmonic springs (E ¼ ð1=2Þkx2 ). The use of such effective potentials makes calculations much easier and faster. The nonbonded interactions capture longer range interactions within the protein. The major nonbonded forces include the electrostatic interactions, based on the Coulomb’s law, and van der Waals interactions, usually based on a Lennard–Jones potential. The following equation sums up these interactions with the first three terms representing the bonded terms while the last two representing the nonbonded interactions: U¼
X
kb ðb bo Þ2 þ
bonds
X
X
ky ðy yo Þ2
angels
X qi qj X C12 C6 þ Að1 cosðnp fÞÞ þ þ 6 er rij12 rij atoms diherdrals charges ij
!
The most commonly used force fields for MD include AMBER, OPLSAA, CHARMM, AMOEBA, and GROMOS (Brooks et al., 2009; Duan et al., 2003; Hess et al., 2008; Ren and Ponder, 2003). A nice review on the usage of different types of force fields and their development can be found in Ponder et al. (Ponder and Case, 2003) and Cheatham et al. discuss force fields that can be used for nucleic acids (Cheatham and Young, 2000). An important component of MD simulations is the definition of a suitable environment for the protein. For cytoplasmic proteins, this means immersing the protein in water with suitable concentrations of salt or other ions, and maintaining the proper temperature and pressure. There are several water models available in the literature including SPC, TIP3P, TIP4P, POL3, and AMOEBA (Berendsen et al., 1987; Caldwell and Kollman, 1995; Jorgensen and Madura, 1985; Jorgensen et al., 1983; Mahoney and Jorgensen, 2000). SPC and TIP3P are the simplest models that represent the water molecule with three interaction sites. The other models are more sophisticated as they include dummy atoms at the lone pair positions, which improve the dipole and quadrupole moments of the water molecule. POL3, SPC/E, and AMOEBA contain additional terms to account for the polarizability. Membrane proteins are treated differently since they have to be immersed in a lipid membrane first, before placing the whole system in a water bath. Several models for the lipid membranes are available such as DPPC, DMPC, POPC, DLPE, DOPC, or DOPE (de Vries et al., 2005; Heller et al., 1993; Tieleman and Berendsen, 1996; Tieleman et al., 1997).
320
Akansha Saxena et al.
3.2. Setting up and running simulations Many MD packages are available for no cost or with modest academic fees. Popular packages include AMBER, CHARMM, GROMACS, NAMD, and TINKER (see Table 12.1). Although there are small differences in the implementation depending on the software package being used, the simulation procedure can be divided into four basic steps: preparation, minimization, heating, and production. 3.2.1. Preparation The first task of the simulation is to obtain a structure of the protein of interest. This is typically a pdb file that contains a list of all atoms in a protein and their 3D coordinates. The protein structure needs to be complete (all atoms), and missing atoms such as hydrogens can be added by using the preparatory tools of the MD software. It is also important to check the protonation states of the histidines, the existence of disulfide bonds and any potential posttranslational modifications. Next, the protein is immersed in a water bath using any of the water models mentioned above. If it is a membrane protein then it is first embedded in a lipid bilayer and then the whole system is submerged in water. Salt ions can be added to more closely mimic physiological conditions, and additional ions are typically added to neutralize the overall charge of the system. 3.2.2. Minimization The starting structure almost assuredly has small atomic clashes, strained bonds and angles, or other potential problems. We need to resolve these issues before starting our simulation, and we do that by minimizing the potential energy of the system, effectively moving all bonds, angles, and so on to their equilibrium values. There is variable quality in the minimization routines for the various MD packages, but as long as major atomic clashes are removed, we should be able to proceed to the next step. 3.2.3. Equilibration Since we are normally trying to connect simulation results with wet-lab experiments, we need to match the experimental conditions as closely as possible. The minimized protein structure can be viewed as being at 0 K, but we need to heat the system to a ‘‘normal’’ temperature of perhaps 300–310 K. Since the MD protocol is an equilibrium method, we need to slowly perturb the system, usually heating the system in 50 K steps for short periods of time (20–50 ps) until we reach our desired simulation temperature. Once we reach our production temperature, we need to allow the system to equilibrate to again remove any artifacts. The time required for equilibration is a point of debate in the simulation community and depending on the
321
The Basic Concepts of Molecular Modeling
system, equilibration times may range from 100 ps to 50 ns, or more. When in doubt, more equilibration is certainly the safest route to follow. 3.2.4. Production Now the system is ready to start the production run. Depending upon the computer resources, the simulations can be divided into parallel processors to increase the speed. Just as with the equilibration step, the simulation time will depend on the size of the system and what the ultimate goal of the simulation is. In Section 3.3, we will discuss some of the analysis tools that can be used to evaluate your simulation results.
3.3. Simulation analysis 3.3.1. Equilibration measures As the simulation progresses, the protein evolves from the minimized state, attains a state of equilibrium and then begins to fluctuate around this point. One measure used by researchers is to study the evolution of the dynamics of the system is the Root Mean Square Deviation or RMSD of the protein relative to the starting structure. The RMSD is defined as the average deviation of all atoms from their starting position. The formula used for this analysis is
1 XN RMSDðt1 ; t2 Þ ¼ m kr ðt Þ ri ðt2 Þk2 1 i i 1 M
1=2
where M is the total mass and ri(t) is the position of atom i at time t. The calculation can be performed using any set of atoms, however, the backbone atoms or just the alpha carbons are the most common choices. A typical RMSD plot is shown in Fig. 12.2, in this case starting from the equilibration phase of the simulation for a protein of 140 amino acids. During the initial equilibration phase, the protein fluctuates significantly, but after about 50 ns it settles down at a steady value and could be considered to have reached equilibrium. One issue with respect to RMSD is that it depends on the reference state (in the case of Fig. 12.2, the structure at the start of the production phase). For this reason, it can be a nonideal measure for equilibrium, and many researchers use methods such as principle component analysis (discussed later). 3.3.2. RMSD fluctuations RMSD fluctuation, commonly known as RMSF, is a tool which quantifies the dynamics of the polypeptide backbone by finding the extent of movement of each residue around its mean position, throughout the length of the simulation. The formula used for this analysis is
322
Akansha Saxena et al.
0.3
0.4 0.2
RMSD (nm)
0.3
0.1
0.25
0
0
2
4
6
0.2 0.15 0.1 0.05 0 0
50
100
150
200
Time (ns)
Figure 12.2 RMSD profile of a molecular dynamics trajectory using the initial structure as a reference. The inset shows the changes during the first 5 ns of the trajectory.
RMSFðri Þ ¼
X
2 1=2 r ðtÞ ^r i t¼1 i n
where, just as before, ri(t) is the position of atom i at time t and ^r i denotes the average position of atom i. To do the analysis, the backbone or alpha carbons atoms are typically selected. The calculation yields large RMSF values for parts of the protein that are highly flexible while portions that are constrained result in lower values. A comparison of these values between a wild type and mutant simulations can give insight into the effects of mutation or ligand binding. An example of an RMSF plot with the calculations performed on the Ca atoms can be seen in the Fig. 12.3. As seen from the corresponding protein structure, large RMSF values result for the loop regions of the protein, but other regions (such as the helix C) also show large-scale motion. 3.3.3. Principal component analysis Principal Component Analysis or PCA is a type of eigen value analysis where the complicated dynamics of a system are decomposed into simpler, orthogonal degrees of freedom. PCA is similar to the RMSF calculation discussed above except that the full cross correlation matrix of all atom pairs is calculated. The eigen modes of this matrix are determined, and in this way the high frequency, small amplitude fluctuations (small eignevalues) can be
323
The Basic Concepts of Molecular Modeling
0.5
E
RMSF (nm)
0.4 0.3
C A
0.2
D
B
F
F
B C
E
0.1 0
D 25
50 75 100 Residue numbers
A
150
Figure 12.3 RMSF plot for an MD trajectory. The peak RMSF values are labeled and are seen to correspond with the most flexible regions of the protein.
4
Principal component 2 (nmˆ2)
3
150 to 180 ns
0 to 30 ns 30 to 60 ns 60 to 90 ns 90 to 120 ns 120 to 150 ns 150 to 180 ns 180 to 210 ns
0 to 30 ns Start
180 to 210 ns
2
30 to 60 ns 1
0 –1 120 to 150 ns –2
90 to 120 ns
60 to 90 ns
End –3 –3
–2
–1 0 1 Principal component 1 (nmˆ2)
2
Figure 12.4 Projection of an MD trajectory on the space spanned by first two principal component vectors.
filtered out of the dynamics trajectory, and the larger/slower motions (large eigen values) can be extracted. Figure 12.4 shows a projection of a dynamics trajectory on the first two principal modes. The trajectory starts in the right
324
Akansha Saxena et al.
top corner of space spanned by these two modes, migrates to the lower left quadrant over the first 100 ns, and then remains in this region for the remaining 100 ns. This is the same MD trajectory used in creating the RMSD plot shown in Fig. 12.3, but it now suggests that the system does not reach equilibrium for almost 100 ns, not the 50 ns as suggested by RMSD analysis. This underlines the challenges and considerations that one face is performing these types of simulations and emphasize how carefully the results of an MD simulation should be analyzed. MD simulations are now being used successfully in a wide variety of chemical, physical, and biological systems. As the field progresses, other simulation techniques such as Brownian dynamics, Monte Carlo simulations, and a host of multiscale and coarse-graining methods are emerging. These techniques have some advantages and disadvantages when compared to MD, but one needs to select the correct tool for the problem at hand.
4. Molecular Docking As it is difficult to obtain structures for every protein, we could possibly want, getting the cocrystal structure of two bound proteins is often a greater challenge. Since solving such bound structures may not always be experimentally possible, there has been significant effort in the simulation community to predict them. The previous sections covered protocols to obtain structures, whether from X-ray crystallography or through ab initio prediction, and how to sample dynamics of these structures. This section will show how to take these results and use them to predict bound complexes with drugs, ligands, or other proteins. As discussed before, the confidence in binding models obtained from docking methods is only as good as the experimental information we have a priori. The more information that is available (mutagenesis data, sequence, or structure conservation, etc.) the more reliable docking simulations will be.
4.1. Basic components While an end-user of various tools available for docking does not have to understand the minutiae of all the algorithms under the hood, it is still important to appreciate some details so as to know which software is suitable for different situations. Every docking program has two essential components: a search algorithm and an energy scoring function (Leach et al., 2006). The details and interdependence of these two components vary greatly among the different pieces of software and some of these details are discussed below.
The Basic Concepts of Molecular Modeling
325
The issue of search space deals with sampling the different possible orientations, or poses, that the macromolecule and ligand can bind. In rigid docking, where the internal coordinates of the macro and ligand are held static, there are six relative degrees of freedom for two molecules: three translational and three rotational degrees of freedom. This can lead to hundreds of thousands or millions of possibilities, depending on the size of molecules, but this problem is further compounded in the case of flexible docking. Once bonds are allowed to rotate and side chain or backbone conformations are explored, the size of the search space increases exponentially. There are many different search algorithms, ranging from brute-force conformational searches or more effective and efficient stochastic-based algorithms. In general, better sampling of the different possible poses will lead to a higher probability of finding the correct binding structure. With so many poses generated from the search step, we need a system to rank them according to their likelihood of being the binding answer(s), whether it be by energetics, binding affinity, or some other metric. Scoring functions ( Jain, 2006) to evaluate these structures must not only be accurate in calculating the energy of a pose, but also efficient to rank a large number of structures in a timely matter. The binding score or energy resulting from various scoring functions can be based on first-principles (like molecular mechanics force fields), empirical data (functions fitted to experimental data), semiempirical (a combination of the two), or knowledge based (statistics and heuristics). There are programs that show good performance, although they may be optimized to the selected benchmarks and may only prove to make good predictions for systems for which they are parameterized (Huang et al., 2006). Since the difficulties in predicting absolute binding energies are great, it is often more desirable and effective to predict the correct relative affinities of a group of compounds. In reality, the entire procedure to generate a single docked complex needs to be repeated tens, hundreds, or thousands of time. To analyze this large ensemble of predictions, many protocols use cluster analysis, looking for structures that are repeatedly predicted as a measure of confidence. Of course, any result needs to be compared to known experimental data as a sanity check. Additional analysis would be recommended, as well as further experimental validation in an iterative cycle to improve any docked model.
4.2. Choosing the correct tool The first step is to select the appropriate docking software for the system of interest. Some programs are parameterized for specific kinds of protein structures or particular ligands (such as only small molecules), while others are more widely applicable. At the same time, some programs are free for academic use, while others charge a nominal or substantial fee, even for academic use. Unfortunately, it is not possible in this limited space to
326
Akansha Saxena et al.
provide details on each software package and the user will need to investigate each package on an individual basis. Table 12.1 contains a list of software and Web addresses, but some of the more widely used packages are AutoDock (Goodsell and Olson, 1990), FlexX (Rarey et al., 1995), GLIDE (Friesner et al., 2004; Halgren et al., 2004), GOLD ( Jones et al., 1995), HADDOCK (Dominguez et al., 2003), and RosettaDock (Schueler-Furman et al., 2005a,b). Although every software package will claim to have certain advantages, the best method in assessing their quality is in head-to-head comparisons, hopefully completed by some independent third party. For small molecule docking, there have been several published comparison published in recent years (Cross et al., 2009; Cummings et al., 2005; Leach et al., 2006; Sousa et al., 2006). In the case of protein–protein docking (Bonvin, 2006; Leach et al., 2006; Ritchie, 2008; Vajda and Kozakov, 2009), the best resource for evaluating the most current docking methods is the results gathered from the bi-annual critical assessment of predicted interactions (CAPRI) (Lensink et al., 2007; Mendez et al., 2005). Like the CASP competition for structure prediction, CAPRI evaluates blind predictions of protein–protein interactions. In the latest round (of which results were published in 2007), there is an additional component for assessing scoring functions. Additionally, there are evaluation tests (Schulz-Gasch and Stahl, 2003; Tiwari et al., 2009; Warren et al., 2006) with decoy benchmarks (Huang et al., 2006; Irwin, 2008) or evaluative reviews (Moitessier et al., 2008) that are released whenever a new tool has been developed.
4.3. Preparing the molecules The ligand of choice impacts how the search step should be carried out. In general, docking protocol can be divided into protein–small molecule docking and protein–protein. Each category will be covered in this section, where the smaller molecule (e.g., drug) is defined as the ligand and the larger protein defined as the macromolecule. 4.3.1. Macromolecule Regardless of the ligand, the macromolecule protein is usually dealt with in the same way. The sheer size of a protein and the potential degrees of freedom usually means that exhaustive sampling is not possible. Many programs have some limited sampling capability, especially for the side chains through the use of rotamer libraries, but they tend to not adequately sample backbone conformations. To facilitate whichever program the reader uses, it can be very beneficial to sample an ensemble of structures from an MD or other type of simulation. Alternatively, if there is NMR structure data, the ensemble of models can be used in independent docking runs.
The Basic Concepts of Molecular Modeling
327
In essence, a series of snapshots will give the docking search algorithms a different starting structure from which to sample, and this may aid in a better holistic representation of the conformational space. Once a series of macromolecule structures are chosen, they need to be prepared for docking. The details of this are specific to the program being used, however, there are several general considerations that need to be kept in mind. Just as with MD simulations, these would include the protonation state of the protein and particular residues, the proper treatment of any nonstandard amino acids and posttranslational modifications, and the inclusion of any required ligands, nucleotides, ions, and so on. If the sites of these modifications is known or thought to be close to the binding site, these may be critical for success—if they are more distal from the binding site, they may be able to be ignored. Less of a formatting issue and more of a technical one, the charge state of the macromolecule is very important and needs to be thoroughly considered as it is often the driving force of many intermolecular interactions. Experimental information, such as pH or salt dependency can help in deciding what a charge state of particular groups should be. Some programs require an active hand in making this determination. There may be other steps for preparing the macromolecule, such as defining flexible regions and rotatable bonds, which are important to consider. 4.3.2. Small molecule ligands Drugs and small peptides tend to have more limited degrees of freedom and can therefore be treated in a more systematic approach. In short, the fewer the rotatable bonds, the easier it is to sample completely. Depending on the software, docking packages that allow the user to define fixed or rotatable bonds are usually sufficient, although caution would still be advised to make sure if enough conformations are used in docking and that they are not sterically hindered. Just as for the macromolecule, the electrostatics on the ligand is very important in driving interactions, and if not correctly represented, the results could be drastically affected. 4.3.3. Protein ligands Peptides with secondary structure and proteins being docked as ligands have a slightly different treatment than drugs or small molecules. While certain parts may be locked into helices or sheets and thus have somewhat restrictive motions, there could be unstructured loops or more dynamic regions. Just as we saw before, it is not possible to fully capture these degrees of freedom, and these protein ligands are treated in the same fashion as the macromolecule. Again, if there is an NMR structure, those models can be used or the ligand could be subjected to simulation studies. Ultimately, we would ideally generate an ensemble of structures for the macromolecule and ligand and perform docking for every combination of the two. As will be
328
Akansha Saxena et al.
covered in the virtual screening section, some have found that combining methods to converge at a docked model may be more necessary for protein–protein docking (Vajda and Kozakov, 2009).
4.4. Iterative docking and analysis With the ligand and macro prepared, we can begin the process of generating docked models (see Fig. 12.5). To find the best model, an iterative method is typically the most successful approach. A general, first pass, docking may help to find a region on the macromolecule where the ligand is most likely to interact. This is a blind run docking, meaning that the macro and ligand are allowed to randomly pair with no bias in any region. To save in time in this step, it is usually best to allow limited or no flexibility. The second step would be to cluster the results from the blind docking, grouping the ligands based on location and examining their energetic, or ranking, score. Ideally, this would identify one particular area on the macromolecule, but if there are several locations, a highly ranked representative structure of each cluster (or binding site) can be used for finer docking runs. As always, the use of experimental data here is crucial in determining probable sites as well as in corroborating the selection of the best model. Using these filtered results, we can perform refinement docking where the
Macro
Ligand
(via NMR, MD, MC etc)
(via NMR, MD, MC etc)
General docking (randomize)
Experimental data (e.g. biochemical, mutagenesis, etc)
Clustering and analysis (filter results)
Refinement docking (impose constraints)
Figure 12.5
The standard docking protocol.
Docked model
The Basic Concepts of Molecular Modeling
329
ligand is restricted to a specified region based on the blind docking results. How this restriction is imposed depends on the software being used, but most programs possess this capability. There are often more fine grained tuning options for more exact exploration, including side-chain sampling or repacking that allow for flexible docking (Bonvin, 2006). The user should make use of these capabilities as appropriate.
4.5. Post analysis If an iterative docking methodology is used, analysis needs to take place intermittently to maximize docking success. Clustering coupled with score ranking is the most basic analysis to find potential good poses. Particularly in the case of protein–protein docking, the use of experimental data, such as mutagenesis data, may be required. Such data can be used to filter out false positives and improve the overall results. Also, although the search and scoring steps of most docking protocols are highly intertwined, one can easily rescore a set of poses using a different scoring function or functions. Some programs have multiple and easily manipulated scoring functions. Although usually requiring extra effort, rescoring aids in enriching the results and converging at a consensus result with which other methods agree.
4.6. Virtual screening So far, the methods described here assume that the ligand to be docked is already identified. In the case of drug discovery, the small molecule of interest that binds a given target may be what we are trying to determine. In such cases, a whole gamut of small molecule data bases (such as Available Chemicals Directory, ChemACX, Maybridge Database, Zinc, NCI Diversity Set (Voigt et al., 2001)), can be docked against the target protein for screening purposes. This kind of virtual screening can aid in narrowing down potential inhibitor candidates before vast resources are devoted to testing them at the bench. High-throughput virtual screening is obviously very popular with pharmaceutical companies since it would not only save money from actual testing but also help in leading drug discovery. With 1000–100,000 compounds in each database, high-throughput methods are required to accomplish such screening in a timely matter. Usually, this simply means using multiple computer processors with a reasonably fast docking tool and performing docking with a database of molecules. Many times it is advantageous to use a combination of several docking software for both the search and scoring algorithms in order to find a consensus subset of the best docking small molecules (Vajda and Kozakov, 2009). While the details of this protocol are out of the scope of this chapter, we list
330
Akansha Saxena et al.
some literature to further elucidate virtual screening (Cross et al., 2009; Irwin, 2008; Jain, 2004; Kitchen et al., 2004; Kontoyianni et al., 2008; Shoichet, 2004; Zoete et al., 2009).
4.7. Conclusions There is one caveat that users of docking software must continually remind themselves of: analyze the docked models with a skeptical eye. It is very easy to accept poses that fit a mechanistic model that we want to prove and thus be biased toward what we wish to see. On the other hand, models that we create are just that: models. It is still fair to include some heuristic filtering, provided it is supported with good reasoning. In the current state of available methods, there is a general acknowledgment that the accurate representation of electrostatics still needs significant improvement. A typical molecular representation boils down complex electrostatic surfaces as simple point charges on a single atom. While this speeds up calculations in a first pass docking, more exact electrostatics, namely higher order moments and the effects of polarization, may be needed to improve the capabilities of all programs (Illingworth et al., 2008).
REFERENCES Alder, B. J., and Wainwright, T. E. (1959). Studies in molecular dynamics. I. General method. J. Chem. Phys. 31(2), 459–466. Alder, B. J., and Wainwright, T. E. (1960). Studies in molecular dynamics. II. Behavior of a small number of elastic spheres. J. Chem. Phys. 33(5), 1439–1451. Alexandrov, N. N., Nussinov, R., and Zimmer, R. M. (1996). Fast protein fold recognition via sequence to structure alignment and contact capacity potentials. In ‘‘Pacific Symposium on Biocomputing ‘96,’’ (L. Hunter and T. Klein, eds.). World Scientific Publishing Co., Singapore. Altschul, S. F., et al. (1990). Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410. Altschul, S. F., et al. (1997). Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389–3402. Attwood, T. K., et al. (2003). PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Res. 31(1), 400–402. Bairoch, A., et al. (2004). Swiss-Prot: Juggling between evolution and stability. Brief Bioinform. 5(1), 39–55. Bates, P. A. (2001). Enhancement of protein modeling by human intervention in applying the automatic programs 3D-JIGSAW and 3D-PSSM. Proteins (Suppl. 5), 39–46. Benson, D. A., et al. (2009). GenBank. Nucleic Acids Res. 37(Database issue), D26–D31. Berendsen, H. J. C., Grigera, J. R., and Straatsma, T. P. (1987). The missing term in effective pair potentials. J. Phys. Chem. 91(24 %R), 6269–6271. doi:10.1021/j100308a038. Berman, H. M., et al. (2000). The Protein Data Bank. Nucleic Acids Res. 28(1), 235–242. Bonvin, A. M. (2006). Flexible protein–protein docking. Curr. Opin. Struct. Biol. 16(2), 194–200. Brooks, B. R., et al. (2009). CHARMM: The biomolecular simulation program. J. Comput. Chem. 30(10), 1545–1614.
The Basic Concepts of Molecular Modeling
331
Bryson, K., et al. (2005). Protein structure prediction servers at University College London. Nucleic Acids Res. 33, W36–W38. Web server issue. Caldwell, J. W., and Kollman, P. A. (1995). Structure and properties of neat liquids using nonadditive molecular dynamics: Water, methanol, and N-methylacetamide. J. Phys. Chem. 99(16 %R), 6208–6219. doi:10.1021/j100016a067. Cheatham, T. E., and Young, M. A. (2000). Molecular dynamics simulation of nucleic acids: Successes, limitations, and promise. Biopolymers 56(4), 232–256. Cheng, J., et al. (2005). SCRATCH: A protein structure and structural feature prediction server. Nucleic Acids Res. 33, W72–W76. Web server issue. Cole, C., Barber, J. D., and Barton, G. J. (2008). The Jpred 3 secondary structure prediction server. Nucleic Acids Res. 36, W197–W201. Web server issue. Cross, J. B., et al. (2009). Comparison of several molecular docking programs: Pose prediction and virtual screening accuracy. J. Chem. Inf. Model. 49(6), 1455–1474. Cummings, M. D., et al. (2005). Comparison of automated docking programs as virtual screening tools. J. Med. Chem. 48(4), 962–976. Davidson, A. R. (2008). A folding space odyssey. Proc. Natl. Acad. Sci. USA 105(8), 2759–2760. de Vries, A. H., et al. (2005). Molecular dynamics simulations of phospholipid bilayers: Influence of artificial periodicity, system size, and simulation time. J. Phys. Chem. B 109 (23), 11643–11652. doi:10.1021/jp0507952. Diraviyam, K., et al. (2003). Computer modeling of the membrane interaction of FYVE domains. J. Mol. Biol. 328(3), 721–736. Dominguez, C., Boelens, R., and Bonvin, A. M. (2003). HADDOCK: A protein–protein docking approach based on biochemical or biophysical information. J. Am. Chem. Soc. 125(7), 1731–1737. Duan, Y., et al. (2003). A point-charge force field for molecular mechanics simulations of proteins based on condensed-phase quantum mechanical calculations. J. Comput. Chem. 24(16), 1999–2012. Eddy, S. R. (1998). Profile hidden Markov models. Bioinformatics 14(9), 755–763. Finn, R. D., et al. (2008). The Pfam protein families database. Nucleic Acids Res. 36(Database issue), D281–D288. Friesner, R. A. (2004). Glide: A new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy. J. Med. Chem. 47(7), 1739–1749. Gasteiger, E., et al. (2003). ExPASy: The proteomics server for in-depth protein knowledge and analysis. Nucleic Acids Res. 31(13), 3784–3788. Goodsell, D. S., and Olson, A. J. (1990). Automated docking of substrates to proteins by simulated annealing. Proteins 8(3), 195–202. Halgren, T. A. (2004). Glide: A new approach for rapid, accurate docking and scoring. 2. Enrichment factors in database screening. J. Med. Chem. 47(7), 1750–1759. Hardin, C., Pogorelov, T. V., and Luthey-Schulten, Z. (2002). Ab initio protein structure prediction. Curr. Opin. Struct. Biol. 12(2), 176–181. Heller, H., Schaefer, M., and Schulten, K. (1993). Molecular dynamics simulation of a bilayer of 200 lipids in the gel and in the liquid crystal phase. J. Phys. Chem. 97(31 %R), 8343–8360. doi:10.1021/j100133a034. Hess, B., et al. (2008). GROMACS 4: Algorithms for highly efficient, load-balanced, and scalable molecular simulation. J. Chem. Theory Comput. 4(3), 435–447. doi:10.1021/ ct700301q. Huang, N., Shoichet, B. K., and Irwin, J. J. (2006). Benchmarking sets for molecular docking. J. Med. Chem. 49(23), 6789–6801. Hulo, N., et al. (2008). The 20 years of PROSITE. Nucleic Acids Res. 36(Database issue), D245–D249.
332
Akansha Saxena et al.
Hunter, S. (2009). InterPro: The integrative protein signature database. Nucleic Acids Res. 37(Database issue), D211–D215. Illingworth, C. J., et al. (2008). Assessing the role of polarization in docking. J. Phys. Chem. A 112(47), 12157–12163. Irwin, J. J. (2008). Community benchmarks for virtual screening. J. Comput. Aided Mol. Des. 22(3–4), 193–199. Jacobson, M. P., et al. (2002). On the role of the crystal environment in determining protein side-chain conformations. J. Mol. Biol. 320(3), 597–608. Jain, A. N. (2004). Virtual screening in lead discovery and optimization. Curr. Opin. Drug Discov. Devel. 7(4), 396–403. Jain, A. N. (2006). Scoring functions for protein–ligand docking. Curr. Protein Pept. Sci. 7(5), 407–420. Jones, G., Willett, P., and Glen, R. C. (1995). A genetic algorithm for flexible molecular overlay and pharmacophore elucidation. J. Comput. Aided. Mol. Des. 9(6), 532–549. Jorgensen, W. L., and Madura, J. D. (1985). Temperature and size dependence for Monte Carlo simulations of TIP4P water. Mol. Phys. Int. J. Interface Chem. Phys. 56(6), 1381–1392. Jorgensen, W. L., et al. (1983). Comparison of simple potential functions for simulating liquid water. J. Chem. Phys. 79(2), 926–935. Kelley, L. A., MacCallum, R. M., and Sternberg, M. J. (2000). Enhanced genome annotation using structural profiles in the program 3D-PSSM. J. Mol. Biol. 299(2), 499–520. Kitchen, D. B., et al. (2004). Docking and scoring in virtual screening for drug discovery: Methods and applications. Nat. Rev. Drug Discov. 3(11), 935–949. Kontoyianni, M., et al. (2008). Theoretical and practical considerations in virtual screening: A beaten field? Curr. Med. Chem. 15(2), 107–116. Kretsinger, R. H., Ison, R. E., and Hovmoller, S. (2004). Prediction of protein structure. Methods Enzymol. 383, 1–27. Krogh, A., et al. (2001). Predicting transmembrane protein topology with a hidden Markov model: Application to complete genomes. J. Mol. Biol. 305(3), 567–580. Larkin, M. A., et al. (2007). Clustal W and Clustal X version 2.0. Bioinformatics 23(21), 2947–2948. Laskowski, R. A., MacArthur, M. W., Moss, D. S., and Thornton, J. M. (1993). PROCHECK: A program to check the stereochemical quality of protein structures. J. Appl. Cryst. 26, 283–291. Leach, A. (2001). Molecular Modelling: Principles and Applications. 2nd edn. Prentice Hall. Harlow, England. Leach, A. R., Shoichet, B. K., and Peishoff, C. E. (2006). Prediction of protein–ligand interactions docking and scoring: Successes and gaps. J. Med. Chem. 49(20), 5851–5855. Lensink, M. F., Mendez, R., and Wodak, S. J. (2007). Docking and scoring protein complexes: CAPRI 3rd edn. Proteins 69(4), 704–718. Letunic, I., Doerks, T., and Bork, P. (2009). SMART 6: Recent updates and new developments. Nucleic Acids Res. 37(Database issue), D229–D232. Lobley, A., Sadowski, M. I., and Jones, D. T. (2009). pGenTHREADER and pDomTHREADER: New methods for improved protein fold recognition and superfamily discrimination. Bioinformatics 25(14), 1761–1767. Luthy, R., Bowie, J. U., and Eisenberg, D. (1992). Assessment of protein models with threedimensional profiles. Nature 356(6364), 83–85. Madhusudhan, M. S., Narayanan Eswar, M. A. M.-R., Bino, J., Ursula, P., Rachel, K., Min-Yi, S., and Andrej, S. (2005). Comparative protein structure modeling. In ‘‘The Proteomics Protocols Handbook,’’ ( J. M. Walker, ed.), pp. 831–860. Humana Press, Totowa, New Jersy.
The Basic Concepts of Molecular Modeling
333
Mahoney, M. W., and Jorgensen, W. L. (2000). A five-site model for liquid water and the reproduction of the density anomaly by rigid, nonpolarizable potential functions. J. Chem. Phys. 112(20), 8910. Marti-Renom, M. A., et al. (2000). Comparative protein structure modeling of genes and genomes. Annu. Rev. Biophys. Biomol. Struct. 29, 291–325. McCammon, J. A., et al. (1976). The hinge-bending mode in lysozyme. Nature 262(5566), 325–326. McCammon, J. A., Gelin, B. R., and Karplus, M. (1977). Dynamics of folded proteins. Nature 267(5612), 585–590. McGinnis, S., and Madden, T. L. (2004). BLAST: At the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Res. 32, W20–W25. Web server issue. Mendez, R., et al. (2005). Assessment of CAPRI predictions in rounds 3–5 shows progress in docking procedures. Proteins 60(2), 150–169. Moitessier, N., et al. (2008). Towards the development of universal, fast and highly accurate docking/scoring methods: A long way to go. Br. J. Pharmacol. 153(Suppl. 1), S7–S26. Mount, W. D. (2001). Bioinformatics: Sequence and Genome Analysis. Cold Spring Harbor Laboratory Press, New York. Notredame, C., Higgins, D. G., and Heringa, J. (2000). T-Coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302(1), 205–217. Pearson, W. R. (1990). Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol. 183, 63–98. Pei, J. (2008). Multiple protein sequence alignment. Curr. Opin. Struct. Biol. 18(3), 382–386. Pettersen, E. F., et al. (2004). UCSF Chimera—A visualization system for exploratory research and analysis. J. Comput. Chem. 25(13), 1605–1612. Ponder, J. W., and Case, D. A. (2003). Force fields for protein simulations. Adv. Protein Chem. 66, 27–85. Punta, M., et al. (2007). Membrane protein prediction methods. Methods 41(4), 460–474. Rahman, A. (1964). Correlations in the motion of atoms in liquid argon. Phys. Rev. 136(2A), A405. Rahman, A., and Stillinger, F. H. (1974). Propagation of sound in water. A moleculardynamics study. Phys. Rev. A 10(1), 368. Rarey, M., Kramer, B., and Lengauer, T. (1995). Time-efficient docking of flexible ligands into active sites of proteins. Proc. Int. Conf. Intell. Syst. Mol. Biol. 3, 300–308. Ren, P., and Ponder, J. W. (2003). Polarizable atomic multipole water model for molecular mechanics simulation. J. Phys. Chem. B 107(24), 5933–5947. doi:10.1021/jp027815þ. Ritchie, D. W. (2008). Recent progress and future directions in protein–protein docking. Curr. Protein Pept. Sci. 9(1), 1–15. Roessler, C. G., et al. (2008). Transitive homology-guided structural studies lead to discovery of Cro proteins with 40% sequence identity but different folds. Proc. Natl. Acad. Sci. USA 105(7), 2343–2348. Rohl, C. A., et al. (2004). Protein structure prediction using Rosetta. Methods Enzymol. 383, 66–93. Rost, B., and Sander, C. (1993). Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Biol. 232(2), 584–599. Rost, B., Yachdav, G., and Liu, J. (2004). The predict protein server. Nucleic Acids Res. 32, W321–W326. Web server issue. Sali, A., et al. (1993). Three-dimensional models of four mouse mast cell chymases. Identification of proteoglycan binding regions and protease-specific antigenic epitopes. J. Biol. Chem. 268(12), 9023–9034. Schueler-Furman, O., Wang, C., and Baker, D. (2005a). Progress in protein–protein docking: Atomic resolution predictions in the CAPRI experiment using RosettaDock with an improved treatment of side-chain flexibility. Proteins 60(2), 187–194.
334
Akansha Saxena et al.
Schueler-Furman, O., et al. (2005b). Progress in modeling of protein structures and interactions. Science 310(5748), 638–642. Schulz-Gasch, T., and Stahl, M. (2003). Binding site characteristics in structure-based virtual screening: Evaluation of current docking tools. J. Mol. Model. 9(1), 47–57. Schwede, T., et al. (2003). SWISS-MODEL: An automated protein homology-modeling server. Nucleic Acids Res. 31(13), 3381–3385. Shi, J., Blundell, T. L., and Mizuguchi, K. (2001). FUGUE: Sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. J. Mol. Biol. 310(1), 243–257. Shoichet, B. K. (2004). Virtual screening of chemical libraries. Nature 432(7019), 862–865. Shortle, D., Simons, K. T., and Baker, D. (1998). Clustering of low-energy conformations near the native structures of small proteins. Proc. Natl. Acad. Sci. USA 95(19), 11158–11162. Simons, K. T., et al. (1997). Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J. Mol. Biol. 268(1), 209–225. Sousa, S. F., Fernandes, P. A., and Ramos, M. J. (2006). Protein–ligand docking: Current status and future challenges. Proteins 65(1), 15–26. Subramaniam, S. (1998). The biology workbench—A seamless database and analysis environment for the biologist. Proteins 32(1), 1–2. Tieleman, D. P., and Berendsen, H. J. C. (1996). Molecular dynamics simulations of a fully hydrated dipalmitoylphosphatidylcholine bilayer with different macroscopic boundary conditions and parameters. J. Chem. Phys. 105(11), 4871. Tieleman, D. P., Marrink, S. J., and Berendsen, H. J. C. (1997). A computer perspective of membranes: Molecular dynamics studies of lipid bilayer systems. Biophys. Biochem. Acta 1331(3), 235. Tiwari, R., et al. (2009). Carborane clusters in computational drug design: A comparative docking evaluation using AutoDock, FlexX, Glide, and Surflex. J. Chem. Inf. Model. 49(6), 1581–1589. Vajda, S., and Kozakov, D. (2009). Convergence and combination of methods in protein– protein docking. Curr. Opin. Struct. Biol. 19(2), 164–170. Voigt, J. H., et al. (2001). Comparison of the NCI open database with seven large chemical structural databases. J. Chem. Inf. Comput. Sci. 41(3), 702–712. Vriend, G. (1990). WHAT IF: A molecular modeling and drug design program. J. Mol. Graph. 8(1), 52–56. (see also p. 29). Warren, G. L., et al. (2006). A critical assessment of docking programs and scoring functions. J. Med. Chem. 49(20), 5912–5931. Waterhouse, A. M., et al. (2009). Jalview Version 2—A multiple sequence alignment editor and analysis workbench. Bioinformatics 25(9), 1189–1191. Wiederstein, M., and Sippl, M. J. (2007). ProSA-web: Interactive web service for the recognition of errors in three-dimensional structures of proteins. Nucleic Acids Res. 35, W407–W410. (Web server issue). Zdobnov, E. M., and Apweiler, R. (2001). InterProScan—An integration platform for the signature-recognition methods in InterPro. Bioinformatics 17(9), 847–848. Zhang, Y. (2008). I-TASSER server for protein 3D structure prediction. BMC Bioinform. 9, 40. Zhang, Z., et al. (1998). Protein sequence similarity searches using patterns as seeds. Nucleic Acids Res. 26(17), 3986–3990. Zoete, V., Grosdidier, A., and Michielin, O. (2009). Docking, virtual high throughput screening and in silico fragment-based drug design. J. Cell Mol. Med. 13(2), 238–248.
C H A P T E R
T H I R T E E N
Deterministic and Stochastic Models of Genetic Regulatory Networks Ilya Shmulevich and John D. Aitchison Contents 1. Introduction 2. Boolean Networks 2.1. Attractors as cell types and cellular functional states 3. Differential Equation Models 3.1. Accurate description of cellular growth and division and prediction of mutant phenotypes 4. Probabilistic Boolean Networks 4.1. Steady-state analysis and stability under stochastic fluctuations 5. Stochastic Differential Equation Models 5.1. The influence of noise on system behavior References
336 337 341 343 346 347 350 351 352 353
Abstract Traditionally molecular biology research has tended to reduce biological pathways to composite units studied as isolated parts of the cellular system. With the advent of high throughput methodologies that can capture thousands of data points, and powerful computational approaches, the reality of studying cellular processes at a systems level is upon us. As these approaches yield massive datasets, systems level analyses have drawn upon other fields such as engineering and mathematics, adapting computational and statistical approaches to decipher relationships between molecules. Guided by high quality datasets and analyses, one can begin the process of predictive modeling. The findings from such approaches are often surprising and beyond normal intuition. We discuss four classes of dynamical systems used to model genetic regulatory networks. The discussion is divided into continuous and discrete models, as well as deterministic and stochastic model classes. For each combination of these categories, a model is presented and discussed in the context of the yeast cell cycle, illustrating how different types of questions can be addressed by different model classes. Institute for Systems Biology, Seattle, Washington, USA Methods in Enzymology, Volume 467 ISSN 0076-6879, DOI: 10.1016/S0076-6879(09)67013-0
#
2009 Elsevier Inc. All rights reserved.
335
336
Ilya Shmulevich and John D. Aitchison
1. Introduction Modern molecular biology technologies and the proliferation of Web-based resources containing information on various aspects of biomolecular networks in living cells have made it possible to mathematically model dynamical systems of molecular interactions that control various cellular functions and processes. Such models can then be used to predict the behavior of the system in response to different perturbations or stimuli and ultimately for developing rational control strategies intended to drive the cellular system toward a desired state or away from an undesired state that may be associated with disease. To this end, various dynamical models have been studied, most commonly in the context of genetic regulatory networks, for a variety of biological systems. Although there are a number of natural ways to categorize and classify dynamical models of genetic networks, this chapter presents a model class with an accompanying example in each combination of deterministic versus stochastic and continuous versus discrete model categories. The example used in each of the model classes is that of the yeast cell cycle, as this system has been extensively studied from a variety of different perspectives and with different model classes. It is not the intention of this chapter to go into an in-depth investigation of the cell cycle, but rather to use it as a running example to illustrate the kinds of questions that can be addressed by the different model classes considered. A deterministic model of a genetic regulatory network may involve a number of different mechanisms that capture the collective behavior of the elements constituting the network. The models can differ in numerous ways, such as in the nature of the physical elements that are represented in the model (i.e., genes, proteins, and other factors); the resolution or scale at which the behavior of the network elements are captured (e.g., are genes discretized, such as being either on or off, or do they take on continuous values?); and how the network elements interact (e.g., interactions can either be present or absent or they may have a quantitative nature). The common aspect of deterministic models is the inherent lack of randomness or stochasticity in the model. This chapter presents Boolean networks and systems of differential equations as examples of discrete and continuous deterministic models of genetic networks, respectively. Stochastic models of genetic regulatory networks differ from their deterministic counterparts by incorporating randomness or uncertainty. Most deterministic models can be generalized such that one associates probabilities with particular components or aspects of the model. Thus, stochastic models can also be categorized into discrete and continuous categories. The stochastic or probabilistic components in such models can
Deterministic and Stochastic Models of Genetic Regulatory Networks
337
either be associated with model structure, so that the interactions or rules of interaction are described by probability distributions, or by the incorporation of noise terms that capture intrinsic biological stochasticity or measurement uncertainty. Probabilistic Boolean networks (PBNs) and stochastic differential equations are presented as examples of discrete and continuous stochastic models of genetic networks, respectively.
2. Boolean Networks Boolean networks are a class of discrete dynamical systems that can be characterized by the interactions over a set of Boolean variables. Random Boolean networks (RBN), which are ensembles of random network structures, were first introduced by Kauffman (1969a,b) as a simple model class for studying dynamical properties of gene regulatory networks at a time when the structure of such networks was largely unknown. The idea behind such an approach is to define an ensemble of Boolean networks such that it fulfills certain known features of biological networks and then study random instances of these networks to learn more about general properties of such networks (Kauffman, 1974, 1993, 2004). Boolean network modeling of genetic networks was further developed by Thomas (1973) and others. The ensemble approach has been extraordinarily successful in shedding light on fundamental principles of complex living systems at all scales of organization, including adaptability and evolvability, robustness, coordination of complex behaviors, storage of information, and the relationships between the structure of such complex systems and their dynamical behavior. The reader is referred to several excellent review articles that cover the ensemble properties of Boolean networks (Aldana et al., 2002; Drossel, 2007). However, our focus here is on Boolean network models that can be used to capture the behavior of a specific gene regulatory network. Consider a directed graph where the vertices represent genes and the directed edges represent the actions of genes, or rather their products, on other genes. For example, directed edges from genes A and B into gene C indicate that A and B jointly act on C. The specific mechanism of action is not represented in the graph structure itself, so an additional representation is necessary. One of the simplest representation of frameworks assumes that genes are binary-valued entities, meaning that they can be in one of two possible states of activity (e.g., ON or OFF) at any given point in time, and that they act on each other by means of rules represented by Boolean functions. For example, gene C may be determined by the output of a Boolean function whose inputs are A and B. The underlying directed graph merely represents the input–output relationships. We now present this idea more formally.
338
Ilya Shmulevich and John D. Aitchison
A Boolean network is defined by a set of nodes (genes) {x1, . . ., xn} and a list of Boolean functions {f1, f2, . . ., fn}. Each gene xi 2 {0, 1} (i ¼ 1, . . ., n) is a binary variable whose value at time t þ 1 is completely determined by the values of genes xj1, xj2, . . ., xjki at time t by means of a Boolean function fi : f0; 1gki ! f0; 1g. That is, there are ki regulatory genes assigned to gene xi that determine the ‘‘wiring’’ of that gene. Thus, one can write xi ðt þ 1Þ ¼ fi ðxj1 ðtÞ; xj2 ðtÞ; . . . ; xjki ðtÞÞ:
ð13:1Þ
In an RBN, the functions fi are selected randomly as are the genes that are used as their inputs. This is the basis of the ensemble approach mentioned above. Each xi represents the state (expression) of gene i, where xi ¼ 1 represents the fact that gene i is expressed and xi ¼ 0 means it is not expressed. Such a seemingly crude simplification of gene expression has ample justification in the experimental literature (Bornholdt, 2008). Indeed, consider the fact that many organisms exhibit an amazing determinism of gene activity under specific experimental contexts or conditions, such as Escherichia coli under temperature change (Richmond et al., 1999). The determinism is apparent despite the prevalent molecular stochasticity and experimental noise inherent to measurement technologies such as microarrays. Furthermore, accurate mathematical models of gene regulation that capture kinetic level details of molecular reactions frequently operate with expressed molecular concentrations spanning several orders of magnitude, either in a saturation regime or in a regime of insignificantly small concentrations, with rapid switch-like transitions between such regimes (Davidich and Bornholdt, 2008a). Further, even higher organisms, which are necessarily more complex in terms of genetic regulation and heterogeneity, exhibit remarkable consistency when gene expression is quantized into two levels; for example, different subtypes of human tumors can be reliably discriminated in the binary domain (Shmulevich and Zhang, 2002). In a Boolean network, a given gene transforms its inputs (regulatory factors that bind to it) into an output, which is the state or expression of the gene itself at the next time-point. All genes are assumed to update synchronously in accordance with the functions assigned to them and this process is then repeated. It is clear that the dynamics of a synchronous Boolean network are completely determined by Eq. (13.1). The artificial synchrony simplifies computation while preserving the qualitative, generic properties of global network dynamics. Synchronous updating has been applied in most analytical studies so far, as it is the only one that yields deterministic state transitions. Although the introduction of asynchronous updating, which typically involves a random update schedule, renders the system stochastic, asynchronous updating is not per se biologically more realistic and has to be motivated carefully in every case not to fall victim to artifacts (Chaves et al., 2005). Additionally, recent research indicates that some
339
Deterministic and Stochastic Models of Genetic Regulatory Networks
molecular control networks are so robustly designed that timing is not a critical factor (Braunewell and Bornholdt, 2006), that time ordering in the emergence of cell-fate patterns is not an artifact of synchronous updating in the Boolean model (Alvarez-Buylla et al., 2008), and that simplified synchronous models are able to reliably reproduce the sequence of states in biological systems. Nonetheless, PBNs, presented in Section 4, are able to model asynchronous updating as well as other stochastic generalizations of Boolean networks. Let us start with a simple example to illustrate the dynamics of Boolean networks and present the key idea of attractors. Consider a Boolean network consisting of five genes {x1, . . ., x5} with the corresponding Boolean functions given by the truth tables shown in Table 13.1. Note that x4(t þ 1) ¼ f4(x4(t)) is a function of only one variable and is an example of autoregulation. The maximum connectivity (i.e., maximal number of regulators) K ¼ maxiki is equal to 3 in this case. The dynamics of this Boolean network are shown in Fig. 13.1. Since there are five genes, there are 25 ¼ 32 possible states that the network can be in. Each state is represented by a circle and the arrows between states show the transitions of the network according to the functions in Table 13.1. It is easy to see that because of the inherent deterministic directionality in Boolean networks as well as only a finite number of possible states, certain states will be revisited infinitely often if, depending on the initial starting state, the network happens to transition into them. Such states are called attractors and the states that lead into them, including the attractors themselves, comprise their basins of attraction. For example, in Fig. 13.1, the state (00000) is an attractor and Table 13.1 Truth tables of the functions in a Boolean network with five genes
j1 j2 j3
f1
f2
f3
f4
f5
0 1 1 1 0 1 1 1
0 1 1 0 0 1 1 1
0 1 1 0 1 1 0 1
0 1 – – – – – –
0 0 0 0 0 0 0 1
5 2 4
3 5 4
3 1 5
4 – –
5 4 1
The indices j1, j2, and j3 indicate the input connections for each of the functions.
340
Ilya Shmulevich and John D. Aitchison
10010 00000
00100
10100
10000
01111
01110
00111
00110 10011
11110 11011 11010 01100 11000
01000 11100
11111 10110
00101
00001
10101
11001
10001
01101
01001
01010
00010 10111
11101 01011 00011
Figure 13.1 The state-transition diagram for the Boolean network defined in Table 13.1 (Shmulevich et al., 2002c).
together with the seven other (transient) states that eventually lead into it comprise its basin of attraction. The attractors represent the fixed points of the dynamical system, thus capturing the system’s long-term behavior. The attractors are always cyclical and may consist of more than one state. Starting from any state on an attractor, the number of transitions necessary for the system to return to it is called the cycle length. For example, the attractor (00000) has cycle length 1 while the states (11010) and (11110) comprise an attractor of length 2. Real genetic regulatory networks are highly stable in the presence of perturbations, since the cell must be able to maintain homeostasis in metabolism or its developmental program in the face of such external perturbations and variety of stimuli. Within the Boolean network formalism, this means that when a minimal number of genes transiently change value (say, by means of some external stimulus), the system typically transitions into states that reside in the same basin of attraction and the network eventually ‘‘flows’’ back to the same attractor. Generally speaking, large basins of attraction correspond to higher stability. Such stability of networks in living organisms allows the cells to maintain their functional state within their environment. Although in developmental biology, epigenetic, heritable changes in cell determination have been well established, it is now becoming evident that the same type of mechanisms may also be responsible in carcinogenesis and that gene expression patterns can be inherited without the need for mutational changes in DNA (MacLeod, 1996). In the Boolean network framework, this can be explained by so-called hysteresis; that is, a change in the system’s state caused by a stimulus that does not change back when the stimulus is withdrawn (Huang, 1999). Thus, if the change of some particular gene does in fact cause a transition to a different attractor, the network will often remain in the new attractor even if that gene is switched off. Thus, the
Deterministic and Stochastic Models of Genetic Regulatory Networks
341
structure of the state space of a Boolean network, in which every state in a basin of attraction is associated with the corresponding attractor to which the system will ultimately flow, represents a type of associative memory.
2.1. Attractors as cell types and cellular functional states Real gene regulatory networks exhibit spontaneous emergence of ordered collective behavior of gene activity, captured by the attractors. Indeed, recent findings provide experimental evidence for the existence of attractors in real regulatory networks (Chang et al., 2008; Huang and Ingber, 2000; Huang et al., 2005). At the same time, many studies have shown (e.g., Wolf and Eeckman, 1998) that dynamical system behavior and stability of equilibria can be largely determined from regulatory element organization. This suggests that there must exist certain generic features of regulatory networks that are responsible for their inherent robustness and stability. Since in multicellular organisms, the cellular ‘‘fate’’ is determined by which genes and proteins are expressed, the attractors in the Boolean networks should correspond to cell types, an idea originally due to Kauffman (2004). This interpretation is quite reasonable if cell types are characterized by stable recurrent patterns of gene expression ( Jacob and Monod, 1961). Another interpretation of attractors in Boolean networks is that they correspond to cellular states, such as proliferation (cell cycle), apoptosis (programmed cell death), and differentiation (execution of tissue-specific tasks) (Huang, 1999). Such an interpretation can provide new insights into cellular homeostasis and cancer progression, the latter being characterized by a disbalance between these cellular states. For instance, an occurrence of a structural mutation can result in a reduction of the probability of the network entering the apoptosis attractor(s), making the cells less likely to undergo apoptosis and exhibiting uncontrolled growth. Similarly, an enlargement of the basins of attraction for the proliferation attractor would hyperstabilize it, resulting in hyperproliferation, typical of tumorigenesis. Such an interpretation need not be at odds with the interpretation that attractors represent cellular types. To the contrary, these views are complementary to each other, since for a given cell type, different cellular functional states must exist and be determined by the collective behavior of gene activity. Thus, one cell type can comprise several ‘‘neighboring’’ attractors each corresponding to different cellular functional states. Biological networks can often be modeled as logical circuits from wellknown local interaction data in a straightforward way. This is clearly one of the advantages of the Boolean network approach. Though logical models may sometimes appear obvious and simplistic, compared to detailed kinetic models of biomolecular reactions, they may help to understand the dynamic key properties of a regulatory process. Further, a Boolean network model can be formulated as a coarse-grained limit of the more detailed differential
342
Ilya Shmulevich and John D. Aitchison
equations model for a system (Davidich and Bornholdt, 2008a), discussed in Section 3. They may also lead the experimentalist to ask new questions and to test them first in silico. Let us consider a Boolean network model of the cell cycle control network in the budding yeast Saccharomyces cerevisiae proposed in Li et al. (2004). The core regulatory network involving activations and inhibitions among cyclins, transcription factors, and check points, such as cell size, consists of 11 binary variables. The Boolean functions, Eq. (13.1), assigned to each variable are chosen from the subclass of threshold Boolean functions (Muroga, 1971), which sum up their inputs with weights and if the sum exceeds a threshold, then the output of the function is equal to 1, else it is equal to 0. This is equivalent to a perceptron and represents a hyperplane that cuts the Boolean hypercube into two halves, zeros on one side, and ones on the other. The model, shown in Fig. 1 in Li et al. (2004), also has self-degradation loops such that nodes that are not negatively regulated by others are degraded at the next time point. The dynamics of the model are described by Xn 8 1; aij xj ðtÞ > 0 > > < Xj¼1 n aij xj ðtÞ < 0 xi ðt þ 1Þ ¼ 0; ð13:2Þ > Xj¼1 n > : x ðtÞ a x ðtÞ ¼ 0 i
j¼1 ij j
and the weights were all set to 1 or 1, depending on activation or inhibition, respectively (Li et al., 2004). Since there are 11 nodes in the network, there are 2048 states in total and all the state transitions can be computed directly through Eq. (13.2). One of the attractors, among seven, is the most stable and attracts approximately 86% of all states. This stable (fixed point) attractor, in which the molecules Cdhl and Sicl are equal to 1 and all others (Cln3, MBF, SBF, Cln1/2, Swi5, Cdc20, Clb5/6, Clb1/2, Mcml) are equal to 0, represents the biological G1 stationary state (one of the four phases of the cell cycle process in which the cell grows and can commit to division), guaranteeing cellular stability in this state. It is further demonstrated in Li et al. (2004) that the dynamic state trajectories starting from each of the states in the basin of attraction of the G1 stationary state converge rapidly onto an attracting state trajectory that is highly stable, ensuring that starting from any point in the cell cycle process, the system does not deviate from this trajectory. It is also shown, by comparison with random networks, that the highly stable attractor is unlikely to arise by chance (Li et al., 2004). Additionally, the results were fairly insensitive to the values of the weights, justifying setting them both equal to 1. Other similar studies have been carried out with the cell cycle of the fission yeast Schizosaccharomyces pombe (Davidich and Bornholdt, 2008b)
Deterministic and Stochastic Models of Genetic Regulatory Networks
343
and the mammalian cell cycle (Faure´ et al., 2006). Recently, a new more accurate Boolean network model, which can incorporate time delays, has been proposed as a model of the budding yeast cell cycle (Irons, 2009).
3. Differential Equation Models A model of a genetic network based on a system of differential equations expresses the rates of change of an element, such as a gene product, in terms of the levels of other elements of the network and possibly external inputs. In general, a nonlinear time-dependent differential equation has the form x_ ¼ f ðx; u; tÞ;
ð13:3Þ
where x is a state vector denoting the values of the physical variables in the system, x_ ¼ dx=dt is the elementwise derivative of x, u is a vector of external inputs, and t is time. If time is discretized and the functional dependency specified by f does not depend on time, then the system is said to be time-invariant. If f is linear and time-invariant, then it can be expressed as x_ ¼ Ax þ Bu:
ð13:4Þ
where A and B are constant matrices (Weaver et al., 1999). When x_ ¼ 0, the variables no longer change with time and thus define the steady state of the system, which is analogous to a fixed point attractor in a Boolean network. Consider the simple case of a gene product x (a scalar) whose rate of synthesis is proportional, with kinetic constant k1, to the abundance of another protein a that is sufficiently abundant such that the overall concentration of a is not significantly changed by the reaction. However, x is also subject to degradation, the rate of which is proportional, with constant k2, to the concentration of x itself. This situation can be expressed as x_ ¼ k1 a k2 x with
a; x > 0:
ð13:5Þ
Let us analyze the behavior of this simple system. If initially x ¼ 0, then the decay term is also 0 and x_ ¼ k1 a. However, as x is produced, the decay term k2x will also increase thereby decreasing the rate x_ toward 0 and stabilizing x at some steady-state value x. It is easy to determine this value, since setting x_ ¼ 0 and solving for x yields x ¼
k1 a : k2
ð13:6Þ
344
Ilya Shmulevich and John D. Aitchison
2
1.5
1
0.5
0
0
1.25
2.5 t
3.75
5
Figure 13.2 The behavior of the solution to x_ ¼ k1 a k2 x, x(0) ¼ 0, where k1 ¼ 2, k2 ¼ 1, and a ¼ 1. As can be seen, the gene product x, shown with a solid plot, tends toward its steady-state value given in Eq. (13.6). The time derivative x, _ which starts at initial value of k1a and tends toward 0, is shown with a dashed plot.
This behavior is shown in Fig. 13.2, where x starts off at x ¼ 0 and approaches the value in Eq. (13.6). The exact form of the kinetics is xðtÞ ¼
k1 a ð1 ek2 t Þ: k2
ð13:7Þ
Similarly, the derivative x, _ also shown in Fig. 13.2, starts off at the initial value of k1a and thereafter tends toward zero. Now suppose that a is suddenly removed after the steady-state value x is reached. Since a ¼ 0, we have x_ ¼ k2 x and since the initial condition is x ¼ k1a/k2, x_ ¼ k1 a initially. The solution of this equation is xðtÞ ¼
k1 a k2 t e k2
ð13:8Þ
and it can be seen that it will eventually approach zero. This example describes a linear relationship between a and x. _ However, most gene interactions are highly nonlinear. When the regulator is below some critical value, it has very little effect on the regulated gene. When it is above the critical value, it has virtually full effect that cannot be significantly amplified by increased concentrations of the regulator. This nonlinear behavior is typically described by sigmoid functions, which can be either monotonically increasing or decreasing. A common form is the so-called Hill functions given by
345
Deterministic and Stochastic Models of Genetic Regulatory Networks
xn y þ xn yn F ðx; yÞ ¼ n ¼ 1 F þ ðx; yÞ: y þ xn F þ ðx; yÞ ¼
n
ð13:9Þ
The function Fþ(x, 1) is illustrated in Fig. 13.3 for n ¼ 1, 2, 5, 10, 20, 50, and 100. It can be seen that it approaches an ideal step function with increasing n, thus approximating a Boolean switch. In fact, the parameter y essentially plays the role of the threshold value. Glass (1975) used step functions in place of sigmoidal functions in differential equation models, resulting in so-called piecewise linear differential equations. Glass and Kauffman (1973) also showed that many systems exhibit the same qualitative behavior for a wide range of sigmoidal steepnesses, parameterized by n. Given that gene regulation is nonlinear, the differential equation models can incorporate the Hill functions into their synthesis and decay terms. There are many available computer tools for simulating and analyzing such dynamical systems using a variety of methods and algorithms (Lambert, 1991), including DBsolve (Goryanin et al., 1999), GEPASI (Mendes, 1 0.9 0.8 0.7
F+(x,q )
0.6 0.5 0.4 0.3 0.2 0.1 0
0
0.2
0.4
0.6
0.8
1 x
1.2
1.4
1.6
1.8
2
Figure 13.3 The function Fþ(x, y) for y ¼ 1 and n ¼ 1, 2, 5, 10, 20, 50, and 100. As n gets large, Fþ(x, y) approaches an ideal step function and thus functions as a Boolean switch.
346
Ilya Shmulevich and John D. Aitchison
1993), and Dizzy (Ramsey et al., 2005). Additionally, there are toolboxes available for MATLABÒ that can be used for modeling, simulating, and analyzing biological systems with ordinary differential equations (Schmidt and Jirstrand, 2006). MathWorks’ SimBiologyÒ toolbox (http://www. mathworks.com/products/simbiology) also provides a graphical user interface for constructing models and entering reactions, parameters, and kinetic laws, which can be simulated deterministically or stochastically. A useful review of nonlinear ordinary differential equation modeling of the cell cycle is available in Sible and Tyson (2007).
3.1. Accurate description of cellular growth and division and prediction of mutant phenotypes Let us return to the regulatory network controlling the cell cycle in budding yeast. If the goal of the modeling is to predict detailed quantitative phenomena, such as cell cycle duration in parent and daughter cells, the length of the different phases of the cell cycle, or ratios between certain regulatory proteins, then logical models such as Boolean networks are not appropriate, and systems of ordinary differential equations with detailed kinetic parameters must be used. Chen et al. (2004) constructed such a detailed model of the cell cycle regulatory network containing 36 equations with 148 constants, in addition to algebraic equations (available in Table 1 in that paper, with Table 2 containing parameter values). The model incorporates protein concentrations, cell mass, DNA mass, the state of the emerging bud, and of the mitotic spindle. After manual fitting of some of the parameters, the dynamics generated by the model were able to accurately describe the growth and division of wild-type cells. Remarkably, the model also conformed to the phenotypes of more than 100 mutant strains, in terms of experimentally observed properties such as size at bud emergence or at onset of DNA synthesis, viability, or growth rate, relative to these properties in the wild type. It should be pointed out that parameter estimation of the model must be approached with care. First, the objective function, for example meansquared error between the model predictions and the experimental data, may have multiple local optima in the parameter space. Thus, an apparently good model fit may nonetheless contain unrealistic sets of parameters that will ultimately fail to generalize. For example, as was found in Chen et al. (2004), changing parameters to ‘‘rescue’’ a model with respect to a mutant (i.e., make it agree with experimental observations) often exhibit unintended and unanticipated effects on other mutants. Second, model selection must be carefully considered, since a model that is overly complex, meaning that it has many degrees of freedom, is likely to ‘‘overfit’’ the data and thereby, sacrifice predictive accuracy. In other words, the model may appear to predict very well when tested against data on which it was trained,
Deterministic and Stochastic Models of Genetic Regulatory Networks
347
but when tested against data under new conditions, the model will predict very poorly. There are powerful tools, such as minimum description length, and indeed, entire frameworks based on algorithmic information theory and Bayesian inference, devoted to these fundamental issues (Rissanen, 2007).
4. Probabilistic Boolean Networks PBNs are probabilistic or stochastic generalizations of Boolean networks. Essentially, the deterministic dynamics are replaced by probabilistic dynamics, which can be framed within the mature and well-established theory of Markov chains, for which many analytical and numerical tools have been developed. Recall that Markov chains are stochastic processes having the property that future states depend only on the present state, and not on the past states. The transitions from one state to another (possibly itself) are specified by state transition probabilities. Boolean networks are special cases of PBNs in which state transition probabilities are either 1 or 0, depending on whether Eq. (13.1) is satisfied for all i ¼ 1,. . .,n. The probabilistic nature of this model class affords flexibility and power in terms of making inferences from data, which necessarily contain uncertainty, as well as in terms of understanding the dynamical behavior of biological networks, particularly in relation to their structure. Once the state transition probabilities for a Markov chain corresponding to a PBN are determined, it becomes possible to study the steady-state (long-run) behavior of the stochastic system. This long-run behavior is analogous to attractors in Boolean networks or fixed points in systems of differential equations. Kim et al. (2002) investigated the Markov chain corresponding to a small network based on microarray data observations of human melanoma samples. The steady-state behavior (distribution) of the constructed Markov chain was then compared to the initial observations. If the Markov chain is ergodic, meaning that it is possible to reach any state from any other state after an arbitrary number of steps, then the steady-state probability corresponds to the fraction of the time that the system will spend in that particular state. The remarkable finding was that only a small number of all possible states had significant steady-state probabilities and most of those states with high probability were observed in the data. Furthermore, it was found that more than 85% of those states with high steady-state probability that were not observed in the data were very close to the observed data in terms of Hamming distance, which is equal to the number of genes that ‘‘disagree’’ in their binary values. Based on the transition rules inferred from the data, the model produced localized stability, meaning that the system tended to flow back to the states with high steady-state probability mass if placed in
348
Ilya Shmulevich and John D. Aitchison
their vicinity. Thus, the stochastic dynamics of the Markov chain were able to mimic biological regulation. It should be noted that Markov chains are commonly used to model gene expression dynamics using so-called dynamic Bayesian networks (Murphy and Mian, 1999; Yu et al., 2004; Zou and Conzen, 2005). Indeed, PBNs and dynamic Bayesian networks are able to represent the same joint probability distribution over their common variables (i.e., genes) (La¨hdesma¨ki et al., 2006). Except in very restricted circumstances, gene expression data refute the determinism inherent to the Boolean network model, there typically being a number of possible successor states to any given state. Consequently, if one continues to assume the state at time t þ 1 is independent of the state values prior to time t, then, as stated above, the network dynamics are described by a Markov chain whose state transition matrix reflects the observed stochasticity. In terms of gene regulation, this stochasticity can be interpreted to mean that several regulator gene sets are associated with each gene and at any time point one of these ‘‘predictor’’ sets, along with a corresponding Boolean function, is randomly chosen to provide the value of the gene as a function of the values within the chosen predictor set. It is this reasoning that motivated the original definition of a PBN in which the definition of a Boolean network was adapted in such a way that, for each gene, at each time point, a Boolean function (and predictor gene set) is randomly chosen to determine the network transition (Shmulevich et al., 2002a,c). Rather than simply randomly assigning Boolean functions at each time point, one can take the perspective that the data come from distinct sources, each representing a ‘‘context’’ of the cell. From this perspective, the data derive from a family of deterministic networks and, in principle, the data could be separated into separate samples according to the contexts from which they have been derived. Given the context, the overall network would function as a Boolean network, its transition matrix reflecting determinism (i.e., each row contains one 1, in the column that corresponds to the successor state, and the rest are 0s). If defined in this manner, a PBN is a collection of Boolean networks in which a constituent network governs gene activity for a random period of time before another randomly chosen constituent network takes over, possibly in response to some random event, such as an external stimulus or the action of a (latent) regulator that is outside the scope of the network. Since the latter is not part of the model, network switching is random. This model defines a ‘‘context-sensitive’’ PBN (Brun et al., 2005; Shmulevich et al., 2002c). The probabilistic nature of the constituent choice reflects the fact that the system is open, not closed, the idea being that changes between the constituent networks result from the genes responding to latent variables external to the model network. We now formally define PBNs. Although we retain the terminology ‘‘Boolean’’ in the definition, this does not refer to the binary quantization assumed in standard Boolean networks, but rather to the logical character of
Deterministic and Stochastic Models of Genetic Regulatory Networks
349
the gene predictor functions. In the case of PBNs, quantization is assumed to be finite, but not necessarily binary. However, we restrict ourselves to the binary domain here for simplicity. Formally, a PBN consists of a sequence V ¼ fxi gni¼1 of n nodes, where xi 2 {0, 1}, and a sequence f fl gml¼1 of vector-valued functions, defining constituent networks. In the framework of gene regulation, each element xi represents the expression value of a ð1Þ ð2Þ ðnÞ gene. Each vector-valued function f l ¼ ð fl ; fl ; . . . ; fl Þ determines a constituent network, or context, of the PBN. The function ðiÞ fl : f0; 1gn ! f0; 1g is a predictor of gene i, whenever network l is selected. At each updating epoch, a decision is made whether to switch the constituent network. This decision depends on a binary random variable x: if x ¼ 0, then the current context is maintained; if x ¼ 1, then a constituent network is randomly selected from among all constituent networks according to the selection probability distribution fcl gml¼1
m X
cl ¼ 1:
ð13:10Þ
l¼1
The switching probability q ¼ P(x ¼ 1) is a system parameter. If the current network is maintained, then the PBN behaves like a fixed network and synchronously updates the values of all the genes according to the current context. Note that, even if x ¼ 1, a different constituent network is not necessarily selected because the ‘‘new’’ network is selected from among all contexts. In other words, the decision to switch is not equivalent to the decision to change the current network. If a switch is called for (x ¼ 1), then, after selecting the predictor function fl, the values of genes are updated accordingly; that is, according to the network determined by fl. If q < 1, the PBN is said to be context-sensitive; if q ¼ 1, the PBN is said to be instantaneously random, which corresponds to the original definition in Shmulevich et al. (2002a). Whereas a network switch corresponds to a change in a latent variable causing a structural change in the functions governing the network, a random perturbation corresponds to a transient value change that leaves the network wiring unchanged, as in the case of activation or inactivation owing to external stimuli such as stress conditions, small molecule inhibitors, etc. In a PBN with perturbation, there is a small probability p that a gene may change its value at each epoch. Perturbation is characterized by a random perturbation vector g ¼ (g1, g2, . . ., gn), gi 2 {0, 1}, and P(gi ¼ 1) ¼ p, the perturbation probability; gi is also known as a Bernoulli(p) random variable. If x(t) is the current state of the network, and g(t þ 1) ¼ 0, then the next state of the network is given by x(t þ 1) ¼ fl(x(t)), as in Eq. (13.1); otherwise, x(t þ 1) ¼ x(t) g(t þ 1), where is componentwise exclusive OR. The probability of no perturbation, in which
350
Ilya Shmulevich and John D. Aitchison
case the next state is determined according to the current network function fl, is (1 p)n and the probability of a perturbation is 1 (1 p)n. The perturbation model captures the realistic situation where the activity of a gene undergoes a random alteration (Shmulevich et al., 2002b). As with Boolean networks, attractors play a major role in the study of PBNs. By definition, the attractor cycles of a PBN consist of the attractor cycles of the constituent networks, and their basins are likewise defined. Whereas in a Boolean network two attractor cycles cannot intersect, attractor cycles from different contexts can intersect in a PBN. The presentation of the state transition probabilities of the Markov chain corresponding to the (context-sensitive) PBN is beyond the scope of this chapter, and the reader is referred to Brun et al. (2005). Suffice it to say that from the state transition matrix of the Markov chain, which is guaranteed to be ergodic under a gene perturbation model as described above, even for very small p, one can compute the steady-state distribution. A Markov chain is said to possess a steady-state distribution if there exists a probability distribution p ¼ (p1, p2, . . ., pM) such that for all states i, j 2 {1, 2, . . ., M}, lim Pijr ¼ pj ;
r!1
ð13:11Þ
where Pijr is the r-step transition probability between states i and j. If there exists a steady-state distribution, then regardless of the initial state, the probability of the Markov chain being in state i in the long run can be estimated by sampling the observed states in the simulation (by simply counting the percentage of time the chain spends in that state). Such an approach was used to analyze the joint steady-state probabilities of several key molecules (NFkB; Tie-2 and TGFB3) in a 15-gene network derived from human glioma gene expression data (Shmulevich et al., 2003).
4.1. Steady-state analysis and stability under stochastic fluctuations The Boolean network model of the cell cycle, discussed in Section 2.1, was generalized in Zhang et al. (2006) such that network dynamics are described by a Markov chain with transition probabilities: n X e2bT Pðxi ðt þ 1Þ ¼ 1 þ xðtÞÞ ¼ 2bT ; if T ¼ aij xj ðtÞ 6¼ 0 ð13:12Þ e þ1 j¼1 and Pðxi ðt þ 1Þ ¼ xi ðtÞ þ xðtÞÞ ¼
n X 1 ; if T ¼ aij xj ðtÞ ¼ 0 ð13:13Þ 1 þ ea j¼1
Deterministic and Stochastic Models of Genetic Regulatory Networks
351
The term T appears in Eq. (13.2). Note that this is essentially a way of introducing noise and therefore making the Markov chain ergodic, so that a steady-state distribution exists. The positive number b plays the role of temperature that characterizes the strength of the noise introduced into the system dynamics. The parameter a is used to characterize the stochasticity when the input to a node is zero and determines the probability for a protein to maintain its state when there is no input to it. It should be noted that when a, b ! 1, the stochastic model converges to the deterministic Boolean network model in Li et al. (2004). The state transition probabilities allow the computation of the steady-state distribution in Eq. (13.11). In addition, the so-called net probability flux pi Pij —pj Pji from state i to state j can be determined, where Pij is the state transition probability. The steady-state probability of the stationary G1 phase of the cell cycle was studied relative to the noise level determined by b. It was found that this state is indeed the most probable state of the system and that it decreases with increasing noise strength, as expected, since random perturbations will tend to move the system away from the attractor (Zhang et al., 2006). Interestingly, a type of phase transition was found whereby at a critical value of the parameter b, the steady-state probability of the stationary G1 state virtually vanishes and the system becomes dominated by noise and cannot carry out coordinated behavior. Nonetheless, this critical temperature is quite high and the system is able to tolerate approximately 10% of its rules misbehaving, implying that the cell cycle network is robust against stochastic fluctuations (Zhang et al., 2006). Additionally, the probability flux from states other than those on the cell cycle trajectory from the excited G1 state is convergent onto this trajectory, implying homeostatic stability.
5. Stochastic Differential Equation Models The stochastic generalization of Boolean networks, leading to Markovian dynamics, is intended to capture uncertainty in the data, whether due to measurement noise or biological variability, intrinsic or extrinsic, the latter being caused by latent variables external to the model. On the other hand, if the intention of the modeling is to capture quantitative molecular or physical details, as in systems of ordinary differential equations discussed in Section 3, then stochastic fluctuations on the molecular level can be incorporated explicitly into the model using stochastic differential equations. For example, as most regulatory molecules are produced at very low intracellular concentrations, the resulting reaction rates exhibit large variability. Such intrinsic molecular noise has been found to be important for many biological functions and processes (Ozbudak et al., 2002; Raser and O’Shea, 2005).
352
Ilya Shmulevich and John D. Aitchison
There exist powerful stochastic simulation methods for accurately simulating the dynamics of a system of chemically reacting molecules that can reflect the discrete and stochastic nature of such systems on a cellular scale. A recent review of such methods is available in Cao and Samuels (2009). However, there are undoubtedly other intrinsic and extrinsic contributions to variability in gene and protein expression, for example, due to spatial heterogeneity or fluctuations in cellular components (Swain et al., 2002). Stochastic differential equations allow for a very general incorporation of stochasticity into a model without the need to assume specific knowledge about the nature of such stochasticity. Manninen et al. (2006) developed several approaches to incorporate stochasticity into deterministic differential equation models, obtaining socalled Itoˆ stochastic differential equations, and applied them to neuronal protein kinase C signal transduction pathway modeling. By a comparative analysis it was shown that such approaches are preferred to the stochastic simulation algorithm methods, as the latter are considerably slower by several orders of magnitude when simulating systems with a large number of chemical species (Manninen et al., 2006). The stochastic differential equation framework additionally allows the incorporation of stochasticity into the reaction rates, rate constants, and concentrations. The basic model can be written as a Langevin equation with multiplicative noise (Rao et al., 2002), so that for a single species xi, x_ i ¼ fi ðx; u; tÞ þ gðxi Þxi zðtÞ;
ð13:14Þ
where fi(x, u, t) is the deterministic model and xi(t) is zero mean unit variance Gaussian white noise. The function g(xi) represents the contribution of the fluctuations and it is commonly assumed to p beffiffiffiffiproportional to the square root of the concentration, that is, gðxi Þ xi . The solution to such stochastic differential equations can be obtained by numerical integration using standard techniques.
5.1. The influence of noise on system behavior Let us turn to the cell cycle control network of the fission yeast S. pombe, for which a system of ordinary differential equations was proposed (Novak et al., 2001), consisting of eight deterministic differential equations and three algebraic equations. We mention in passing that a Boolean network model for this network is available in Davidich and Bornholdt (2008b). The differential equation model in Novak et al. (2001) was found to be in good agreement with wild-type cells as well as with several mutants. Steuer (2004)converted this model to a system of stochastic differential equations and compared the simulations with experimental data. It was found that the cycle time and division size distributions within a cell population were predicted well by the model; for example, the model predicted a negative
Deterministic and Stochastic Models of Genetic Regulatory Networks
353
correlation between cycle time and mass at birth, meaning that the cells that are large at birth have shorter cycle times, which ensures homeostasis in successive generations (Steuer, 2004). The stochastic model also accounted for a characteristic ratio of the coefficients of variation for the cycle time and division length. The stochastic differential equation model was also applied to study a certain double mutant (wee1 cdc25 D) that exhibits quantized cycle lengths. A deterministic model of the mutants can be obtained by removing the corresponding parameters from the system of differential equations. However, the simulation of the deterministic differential equation model of the double mutant results in periodically alternating long and short cycle times, which are determined exclusively by cell mass at birth, meaning that small cells have long cycles and have large daughters, and large cells have short cycles and give rise to small daughters. The simulation of the stochastic differential equation model produces very different results: cell mass at birth no longer determines the length of the next cycle and the (nonintuitive) characteristic clusters (i.e., ‘‘quantization’’) in a plot of cycle time versus mass at birth are in good agreement with experimental observations (Steuer, 2004). Additionally, in the stochastic model, the oscillation between long and short cycles disappears, which is consistent with experimental observations. Thus, the inclusion of stochastic fluctuations in the model was able to account for several features not accounted for by the deterministic model. The fact that noise is able to qualitatively alter macroscopic system behavior suggests that stochastic fluctuations play a key role in modulating cellular regulation. Stochastic differential equation models provide a powerful framework for gaining an understanding of these phenomena.
REFERENCES Aldana, M., Coppersmith, S., and Kadanoff, L. P. (2002). Boolean dynamics with random couplings. In ‘‘Perspectives and Problems in Nonlinear Science,’’ (E. Kaplan, J. E. Marsden, and K. R. Sreenivasan, eds.), pp. 23–89. Springer, New York. Alvarez-Buylla, E. R., Chaos, A., Aldana, M., Benı´tez, M., Cortes-Poza, Y., EspinosaSoto, C., Hartasa´nchez, D. A., Lotto, R. B., Malkin, D., Escalera Santos, G. J., and Padilla-Longoria, P. (2008). Floral morphogenesis: Stochastic explorations of a gene network epigenetic landscape. PLoS ONE 3(11), e3626. Bornholdt, S. (2008). Boolean network models of cellular regulation: Prospects and limitations. J. R. Soc. Interface 5(Suppl. 1), S85–S94. Braunewell, S., and Bornholdt, S. (2006). Superstability of the yeast cell-cycle dynamics: Ensuring causality in the presence of biochemical stochasticity. J. Theor. Biol. 245(4), 638–643. Brun, M., Dougherty, E. R., and Shmulevich, I. (2005). Steady-state probabilities for attractors in probabilistic Boolean networks. Signal Process. 85(4), 1993–2013. Cao, Y., and Samuels, D. C. (2009). Discrete stochastic simulation methods for chemically reacting systems. Methods Enzymol. 454, 115–140.
354
Ilya Shmulevich and John D. Aitchison
Chang, H. H., Hemberg, M., Barahona, M., Ingber, D. E., and Huang, S. (2008). Transcriptome-wide noise controls lineage choice in mammalian progenitor cells. Nature 453(7194), 544–547. Chaves, M., Albert, R., and Sontag, D. (2005). Robustness and fragility of Boolean models for genetic regulatory networks. J. Theor. Biol. 235, 431–449. Chen, K. C., Calzone, L., Csikasz-Nagy, A., Cross, F. R., Novak, B., and Tyson, J. J. (2004). Integrative analysis of cell cycle control in budding yeast. Mol. Biol. Cell 15, 3841–3862. Davidich, M., and Bornholdt, S. (2008a). The transition from differential equations to Boolean networks: A case study in simplifying a regulatory network model. J. Theor. Biol. 255(3), 269–277. Davidich, M. I., and Bornholdt, S. (2008b). Boolean network model predicts cell cycle sequence of fission yeast. PLoS ONE 3(2), e1672. Drossel, B (2007). Random Boolean networks. In ‘‘Annual Review of Nonlinear Dynamics and Complexity, Vol. 1,’’ (HG Schuster, ed.), Wiley. Faure´, A., Naldi, A., Chaouiya, C., and Thieffry, D. (2006). Dynamical analysis of a generic Boolean model for the control of the mammalian cell cycle. Bioinformatics 22(14), e124–e131. Glass, L. (1975). Classification of biological networks by their qualitative dynamics. J. Theor. Biol. 54, 85–107. Glass, L., and Kauffman, S. A. (1973). The logical analysis of continuous, nonlinear biochemical control networks. J. Theor. Biol. 39, 103–129. Goryanin, I., Hodgman, T. C., and Selkov, E. (1999). Mathematical simulation and analysis of cellular metabolism and regulation. Bioinformatics 15(9), 749–758. Huang, S. (1999). Gene expression profiling, genetic networks, and cellular states: An integrating concept for tumorigenesis and drug discovery. J. Mol. Med. 77(6), 469–480. Huang, S., and Ingber, D. E. (2000). Shape-dependent control of cell growth, differentiation, and apoptosis: Switching between attractors in cell regulatory networks. Exp. Cell Res. 261(1), 91–103. Huang, S., Eichler, G., Bar-Yam, Y., and Ingber, D. E. (2005). Cell fates as highdimensional attractor states of a complex gene regulatory network. Phys. Rev. Lett. 94 (12), 128701–128704. Irons, D. J. (2009). Logical analysis of the budding yeast cell cycle. J. Theor. Biol. 257(4), 543–559. Jacob, F, and Monod, J (1961). On the regulation of gene activity. Cold Spring Harb. Symp. Quant. Biol. 26, 193–211. Kauffman, S. A. (1969a). Metabolic stability and epigenesis in randomly constructed genetic nets. J. Theor. Biol. 22, 437–467. Kauffman, S. A. (1969b). Homeostasis and differentiation in random genetic control networks. Nature 224, 177–178. Kauffman, S. A. (1974). The large scale structure and dynamics of genetic control circuits: An ensemble approach. J. Theor. Biol. 44, 167–190. Kauffman, S. A. (1993). The Origins of Order: Self-Organization and Selection in Evolution. Oxford University Press, New York. Kauffman, S. (2004). A proposal for using the ensemble approach to understand genetic regulatory networks. J. Theor. Biol. 230(4), 581–590. Kim, S., Li, H., Dougherty, E. R., Cao, N., Chen, Y., Bittner, M. L., and Suh, E. B. (2002). Can Markov chain models mimic biological regulation? J. Biol. Syst. 10(4), 431–445. La¨hdesma¨ki, H., Hautaniemi, S., Shmulevich, I., and Yli-Harja, O. (2006). Relationships between probabilistic Boolean networks and dynamic Bayesian networks as models of gene regulatory networks. Signal Process. 86(4), 814–834.
Deterministic and Stochastic Models of Genetic Regulatory Networks
355
Lambert, J. D. (1991). Numerical Methods for Ordinary Differential Equations. Wiley, Chichester. Li, F., Long, T., Lu, Y., Quyang, Q., and Tang, C. (2004). The yeast cell-cycle network is robustly designed. Proc. Natl. Acad. Sci. USA 101(14), 4781–4786. MacLeod, M. C. (1996). A possible role in chemical carcinogenesis for epigenetic, heritable changes in gene expression. Mol. Carcinog. 15(4), 241–250. Manninen, T., Linne, M. L., and Ruohonen, K. (2006). Developing Itoˆ stochastic differential equation models for neuronal signal transduction pathways. Comput. Biol. Chem. 30(4), 280–291. Mendes, P. (1993). GEPASI: A software package for modelling the dynamics, steady states and control of biochemical and other systems. Comput. Appl. Biosci. 9(5), 563–571. Muroga, S. (1971). Threshold Logic and its Applications. Wiley-Interscience. Murphy, K., and Mian, S. (1999). Modelling Gene Expression Data using Dynamic Bayesian Networks. Technical Report, University of California, Berkeley. Novak, B., Pataki, Z., Ciliberto, A., and Tyson, J. J. (2001). Mathematical model of the cell division cycle of fission yeast. Chaos 11(1), 277–286. Ozbudak, E. M., Thattai, M., Kurtser, I., Grossman, A. D., and van Oudenaarden, A. (2002). Regulation of noise in the expression of a single gene. Nat. Genet. 31(1), 69–73. Ramsey, S., Orrell, D., and Bolouri, H. (2005). Dizzy: Stochastic simulations of large-scale genetic regulatory networks. J. Bioinform. Comput. Biol. 3(2), 1–21. Rao, C. V., Wolf, D. M., and Arkin, A. P. (2002). Control, exploitation and tolerance of intracellular noise. Nature 420(6912), 231–237. Raser, J. M., and O’Shea, E. K. (2005). Noise in gene expression: Origins, consequences, and control. Science 309(5743), 2010–2013. Richmond, C. S., Glasner, J. D., Mau, R., Jin, H., and Blattner, F. R. (1999). Genomewide expression profiling in Escherichia coli K-12. Nucleic Acids Res. 27, 3821–3835. Rissanen, J. (2007). Information and Complexity in Statistical Modeling. Springer. Schmidt, H., and Jirstrand, M. (2006). Systems biology toolbox for MATLAB: A computational platform for research in systems biology. Bioinformatics 22(4), 514–515. Shmulevich, I., and Zhang, W. (2002). Binary analysis and optimization-based normalization of gene expression data. Bioinformatics 18(4), 555–565. Shmulevich, I., Dougherty, E. R., Kim, S., and Zhang, W. (2002a). Probabilistic Boolean networks: A rule-based uncertainty model for gene regulatory networks. Bioinformatics 18(2), 261–274. Shmulevich, I., Dougherty, E. R., and Zhang, W. (2002b). Gene perturbation and intervention in probabilistic Boolean networks. Bioinformatics 18(10), 1319–1331. Shmulevich, I., Dougherty, E. R., and Zhang, W. (2002c). From Boolean to probabilistic Boolean networks as models of genetic regulatory networks. Proc. IEEE 90(11), 1778–1792. Shmulevich, I., Gluhovsky, I., Hashimoto, R., Dougherty, E. R., and Zhang, W. (2003). Steady-state analysis of probabilistic Boolean networks. Comp. Funct. Genom. 4(6), 601–608. Sible, J. C., and Tyson, J. J. (2007). Mathematical modeling as a tool for investigating cell cycle control networks. Methods 41(2), 238–247. SimBiology 3.0 Toolbox http://www.mathworks.com/products/simbiology. Steuer, R. (2004). Effects of stochasticity in models of the cell cycle: From quantized cycle times to noise-induced oscillations. J. Theor. Biol. 228(3), 293–301. Swain, P. S., Elowitz, M. B., and Siggia, E. D. (2002). Intrinsic and extrinsic contributions to stochasticity in gene expression. Proc. Natl. Acad. Sci. USA 99(20), 12795–12800. Thomas, R. (1973). Boolean formalization of genetic control circuits. J. Theor. Biol. 42, 563–585.
356
Ilya Shmulevich and John D. Aitchison
Weaver, D. C., Workman, C. T., and Stormo, G. D. (1999). Modeling regulatory networks with weight matrices. Pac. Symp. Biocomput. 4, 112–123. Wolf, D. M., and Eeckman, F. H. (1998). On the relationship between genomic regulatory element organization and gene regulatory dynamics. J. Theor. Biol. 195(2), 167–186. Yu, J., Smith, V. A., Wang, P. P., Hartemink, A. J., and Jarvis, E. D. (2004). Advances to Bayesian network inference for generating causal networks from observational biological data. Bioinformatics 20(18), 3594–3603. Zhang, Y., Qian, M., Ouyang, Q., Deng, M., Li, F., and Tang, C. (2006). Stochastic model of yeast cell-cycle network. Physica D 219(1), 35–39. Zou, M., and Conzen, S. D. (2005). A new dynamic Bayesian network (DBN) approach for identifying gene regulatory networks from time course microarray data. Bioinformatics 21(1), 71–79.
C H A P T E R
F O U R T E E N
Bayesian Probability Approach to ADHD Appraisal Raina Robeva* and Jennifer Kim Penberthy† Contents 1. Introduction 1.1. Prevalence 1.2. Etiology of ADHD 1.3. Summary of problem 1.4. Comprehensive psychophysiological assessment 2. Bayesian Probability Algorithm 2.1. Methods 2.2. Results 3. The Value of Bayesian Probability Approach as a Meta-Analysis Tool 3.1. Methods 3.2. Results 4. Discussion and Future Directions Acknowledgment References
358 359 360 361 362 362 362 367 369 369 373 373 377 378
Abstract Accurate diagnosis of attentional disorders such as attention-deficit hyperactivity disorder (ADHD) is imperative because there are multiple negative psychosocial sequelae related to undiagnosed and untreated ADHD. Early and accurate detection can lead to effective intervention and prevention of negative sequelae. Unfortunately, diagnosing ADHD presents a challenge to traditional assessment paradigms because there is no single test that definitively establishes its presence. Even though ADHD is a physiologically based disorder with a multifactorial etiology, the diagnosis has been traditionally based on a subjective history of symptoms. In this chapter we outline a stochastic method that utilizes a Bayesian interface for quantifying and assessing ADHD. It can be used to combine of a variety of psychometric tests and physiological markers into a single * {
Department of Mathematical Sciences, Sweet Briar College, Sweet Briar, Virginia, USA Department of Psychiatry and Neurobehavioral Sciences, University of Virginia Health System, Charlottesville, Virginia, USA
Methods in Enzymology, Volume 467 ISSN 0076-6879, DOI: 10.1016/S0076-6879(09)67014-2
#
2009 Elsevier Inc. All rights reserved.
357
358
Raina Robeva and Jennifer Kim Penberthy
standardized instrument that, on each step, refines a probability for ADHD for each individual based on information provided by the individual assessments. The method is illustrated with data from a small study of six college female students with ADHD and six matched controls in which the method achieves correct classification for all participants, where none of the individual assessments was capable of achieving perfect classification. Further, we provide a framework for applying this Bayesian method for performing meta-analysis of data obtained from disparate studies and using disparate tests for ADHD based on calibration of the data into a unified probability scale. We use this method to combine data from five studies that examine the diagnostic abilities of different behavioral rating scales and EEG assessments of ADHD, enrolling a total of 56 ADHD and 55 control subjects of different age groups and gender.
1. Introduction Like most psychiatric disorders, the diagnosis of attention-deficit hyperactivity disorder (ADHD) relies on subjective criteria. Unlike a neurological condition such as stroke, in which examination and neuroimaging provide clear, objective criteria in diagnosis, ADHD lacks the ‘‘hard evidence’’ that aids in evaluation and treatment. The difficulty in clinical diagnosis is reflected in the frequent shifts in the diagnostic criteria for ADHD. For example, various versions of the Diagnostic and Statistical Manual of Mental Disorders (DSM), which is used by clinicians to diagnose ADHD, have all presented different conceptualizations of the disorder. Current DSM-IV diagnostic criteria for ADHD include a persistent pattern of inattention and/or hyperactivity–impulsivity that is more frequent and severe than is typically observed in individuals in a comparable level of development. Evidence of six of nine inattentive behaviors and/or six of nine hyperactive–impulsive behaviors must have been present before age 7, and must clearly interfere with social, academic, and/or occupational functioning. The most current criteria from the Diagnostic and Statistical Manual of Mental Disorders, 4th Edition (DSM-IV; American Psychiatric Association, 1994) distinguishes three subtypes of ADHD. One of the subtypes of ADHD is termed ADHD, predominantly inattentive type, and is often referred to in the literature as ADD, or attention-deficit disorder, signifying that there is an absence of a majority of hyperactive or impulsive symptom criteria. Another subtype of ADHD is termed ADHD, predominantly hyperactive–impulsive type, which specifies persons who demonstrate a majority of symptoms of hyperactivity and impulsivity, but not inattention. A third subtype is ADHD, combined type. ADHD, combined type is the name used when a person meets diagnostic criteria for both ADHD, inattentive type, and ADHD, hyperactive–impulsive type. In other
Bayesian Probability Approach to ADHD Appraisal
359
words, someone diagnosed with ADHD, combined type, displays a majority of symptoms of both inattention and hyperactivity and impulsivity. The diagnosis of any of the three forms of ADHD must still be made exclusively by history, for no laboratory or psychological test or battery is available that provides sufficient sensitivity and specificity. Consequently, the diagnosis of ADHD is highly dependent on a retrospective report of a patient’s past behavior and subjective judgments on degree of relative impairment. Due to the subjective nature of assessment, precision in diagnosis has been elusive. Further complicating the matter, there are a large number and variety of procedures that purportedly assess ADHD. Among these are clinical interviews, rating scales, psychological and neuropsychological tests, observational assessment techniques, and medical procedures, such as MRI and EEG, each with their own variations, and many rating scales with parent, teacher, and self-report versions. However, no one procedure maps perfectly onto all of the DSM-IV criteria for diagnosis of ADHD. To arrive at an ADHD diagnosis, a combination of assessment procedures—a multimethod approach—is necessary (Anastopoulous and Shelton, 2001; DuPaul et al., 1992).
1.1. Prevalence It is difficult to tell if the prevalence of ADHD per se has risen, but it is clear that the number of children identified with the disorder who obtain treatment has risen over the past decade. The US Centers for Disease Control estimates that approximately 4.6 million (8.4%) American children aged 6–17 years have at some point in their lives received a diagnosis ADHD. Of these children, nearly 59% are reported to be taking a prescription medication (Pastor and Reuben, 2008). Rates of stimulant use have been growing fast in both the USA and Europe (Habel et al., 2005; Safer et al., 1996; Zito et al., 2000). Indeed, in the last 10 years, Germany has seen a 47fold increase (Schwabe and Paffrath, 2006). But per capita stimulant consumption remains greater in the USA than in all of Europe. This increased identification and treatment seeking is due in part to greater media interest, heightened consumer awareness, and the availability of effective treatments. Within the USA, ADHD prevalence rates vary substantially across and within states. Reasons for variation in prevalence rates include changing diagnostic criteria over time, the frequent use of referred samples to estimate rates, variations in ascertainment in different settings, and the lack of a comprehensive, reliable, and cost-effective diagnostic assessment. Practitioners of all types vary greatly in the degree to which they use DSM-IV criteria to diagnose ADHD. Practice surveys among primary care pediatricians and family physicians reveal wide variations in practice patterns of using diagnostic criteria and methods. Statistics suggest that only one out of every three people who have an attention disorder get help. Therefore, two out of three people who have an attention disorder never receive a diagnosis
360
Raina Robeva and Jennifer Kim Penberthy
or treatment (Monastra et al., 1999). Part of the dilemma is that the diagnosis of ADHD must still be made exclusively by history. Therefore, a significant problem is a lack of a systematic, reliable, comprehensive, and affordable assessment for ADHD (American Academy of Pediatrics, 2000). This problem is made more urgent by the fact that early recognition, and management of this condition can redirect the educational and psychosocial development of most children with ADHD, thereby having a significant impact upon the well-being of a child accurately diagnosed with ADHD (Hinshaw, 1994; Klein and Mannuzza, 1991; Reiff et al., 1993; Weiss et al., 1985). According to the NIH Consensus Statement (National Institutes of Health Consensus Development Conference Statement, 2000), the diagnosis of ADHD can be made reliably using well-tested diagnostic interview methods. However, as of yet, there is no independent valid test for ADHD. Although research has suggested a central nervous system basis for ADHD, further research is necessary to firmly establish ADHD as a brain disorder. The Consensus Conference concluded that after years of clinical research and experience with ADHD, knowledge about the cause or causes of ADHD remain largely speculative. Despite the plethora of research emerging in the last few years, there remains no biologically based method of diagnosis for ADHD. Indeed, the literature calls for studies that will illuminate the etiology of ADHD (Castellanos, 1997).
1.2. Etiology of ADHD In spite of these well-documented problems, the etiology of ADHD remains methodologically difficult to study and has yielded inconsistent results (Barkley, 1990). One possibility for this is that because of changing and inconsistent use of diagnostic classifications, very few researchers have screened for or tested diagnostically identical ADHD samples. Most investigators accept that ADHD exists as a distinct clinical syndrome and suggest a multifactorial etiology that includes neurobiology as an important factor. Zametkin and Rappaport (1987) identified 11 separate neuroanatomical hypotheses that have been proposed for the etiology of ADHD. A majority of studies have concluded that either delayed maturation or defects in cortical activation play large roles in the pathophysiology of ADHD. For example, studies of cerebral blood flow measured by single-positron emission computerized tomography have demonstrated decreased metabolic activity in suspected attentional areas of the brain (Heilman et al., 1991) and indicated lower arousal in the mesial frontal areas (Lou et al., 1989). Recent anatomical studies have reported reduced bilateral regional brain volumes in specific and multiple subareas of the frontal cortex which govern premotor and higher level cognitive function (Mostofsky et al., 2002; Sowell et al., 2003). In addition, there appears to be reduced volume in the anterior temporal cortices accompanied by increases bilaterally in gray matter in the posterior temporal and inferior
Bayesian Probability Approach to ADHD Appraisal
361
parietal cortices (Sowell et al., 2003). These studies highlight the heterogeneity of the disorder and the neuropsychological constructs used to define the weaknesses associated with the disorder (i.e., executive function, working memory). These, as well as additional neurophysiological findings, have been interpreted as evidence of delayed maturation and cortical hypoarousal in regions of prefrontal and frontal cortex. Unfortunately, while neuroanatomical findings support the notion that ADHD is a distinct clinical syndrome and add to our understanding of the etiology of ADHD, neuroimaging techniques are too expensive for general use, are restricted to a few centers, and lack clear specificity and sensitivity in the diagnosis of ADHD. One technique suggested by a National Institute of Mental Health committee as a possible method to identify functional measures of child and adolescent psychopathology ( Jensen et al., 1993) is that of quantitative EEG. Compared to methods of functional neuroimaging (such as positron emission tomography or single photon emission computed tomography), quantitative EEG is easier to perform, less expensive, does not involve radioactive tracers, and is noninvasive (Kuperman et al., 1990).
1.3. Summary of problem Diagnosing ADHD presents a challenge to traditional assessment paradigms because there is no single assessment tool or medical test that definitively establishes its presence. Because ADHD is considered to be a physiologically based disorder with a multifactorial etiology that includes neurobiology as an important factor, the recommended diagnostic procedure for ADHD relies on a multimethod assessment (American Academy of Pediatrics, 2000; Anastopoulous and Shelton, 2001; DuPaul et al., 1992; National Institutes of Health Consensus Development Conference Statement, 2000). Ideally, this assessment should consist of the following individual components: (a) behavior rating scales, (b) behavioral observations, (c) parent and teacher interviews, (d) neuropsychological assessment, (e) academic screening, and (f ) EEG/brain imaging measures (Anastopoulous and Shelton, 2001; Barkley, 2002). The integrated results presumably converge to provide a composite judgment, considered to be a ‘‘best estimate’’ of the diagnosis and to be more accurate than any single source or assessment alone. The results from multiple measures and methods often do not clearly converge on a diagnosis, but rather provide contradictory information (Anastopoulous and Shelton, 2001; Barkley, 2002). Thus it is, most often, that the ‘‘best estimate’’ diagnosis is the subjective opinion of a clinician whose clinical judgment is influenced and limited by factors such as experience, resources, and prejudices. A reliable and comprehensive assessment that is founded on and determined by objective and standardized methods would provide a welcome advantage to diagnosing ADHD, as well as the
362
Raina Robeva and Jennifer Kim Penberthy
effectiveness of treatments for ADHD. What is needed is not only a comprehensive, multimethod assessment approach, but a strategy for reliably and consistently combining the results from these assessments in a way that provides standardized computation of an accurate probability for diagnosis. An ideal strategy for achieving such results is a Bayesian probability approach to combine disparate assessments. As we will discuss in the following sections, this Bayesian approach can be utilized to combine not only disparate results related to diagnosing the same disorder, but the approach can also successfully be employed to combine different, but related studies, examining the same question in a meta-analysis format.
1.4. Comprehensive psychophysiological assessment We propose and utilize a multimethod procedure that assesses symptoms of ADHD in various domains. Specifically, we employ: (a) standardized psychological questionnaires and ratings from the subjects, their caregivers, and teachers, to determine reported difficulty with cognitive transitions and dysregulation of behavior and attention in the form of ADHD symptoms; (b) prospective behavioral data collected using standardized tests to assess actual impairment on continuous performance tasks and additional tasks associated with poor self-regulation of attention and behavior, as well as standardized ratings from the subjects, parents, and blind raters of the subjects’ behavior and performance; (c) physiological assessments in the form of EEGs to access inconsistency of cognitive transition across multiple time dimensions; and (d) a comprehensive, yet flexible, assessment model combining data from multiple sources to address the complete DSM-IV criteria for ADHD, which is also reactive to treatment effects. This assessment procedure is designed to incorporate various tests and markers for ADHD, none of which alone could claim perfect sensitivity and specificity in diagnosing ADHD. In addition, an important feature of this sequential assessment is that it is test-order-invariant, and can accommodate missing data. The formal framework of the combined assessment employs a Bayesian algorithm that allows for linking of disparate ADHD assessment instruments, within a single study or across multiple studies, into one unified and objective stochastic assessment.
2. Bayesian Probability Algorithm 2.1. Methods We begin with a general description of the Bayesian algorithm, followed by the method for standardizing the scores of different tests and assessments. We then provide brief description of the studies and data that we use to illustrate these methods.
363
Bayesian Probability Approach to ADHD Appraisal
2.1.1. Standardizing the scores for different tests The algorithm is based on the idea that on every assessment subjects earn a certain test scores where the magnitude of the score depends on whether the subject has ADHD, as well as on the severity of the disorder. Therefore, a subject with a certain condition (ADHD) is expected to yield a higher score (or a lower score depending on the direction of the test), compared to a subject without that condition. However, the relationship between the condition and the test score is not always exact—it may happen that a subject without ADHD receives a score indicating ADHD or vice versa. Thus, this relationship is probabilistic and is best quantified as a conditional probability of earning certain score, given a preexisting condition, which is a value between 0 and 1. The exact conversion of the test scores to probabilities for ADHD depends on the assessment’s range of scores, direction of the test’s scale (i.e., whether lower or higher scores are associated with ADHD), and cutoff values that separate the scores indicating ADHD from those indicating nonADHD. The probability of earning a certain score on a test depends on the subjects’ condition, ADHD or non-ADHD, and the score’s calibration translates into conditional probabilities for earning a given test score x, given a condition of ADHD or non-ADHD. Assume that a test for ADHD can generate a range of values from 0 to M, with scores greater than a certain value C, 0 < C < M, indicating ADHD. In this case the mapping of a test score x on this test, 0 < x < M, to a probability for earning this score in case of ADHD, could be computed as x a lnð0:5Þ PðxÞ ¼ PðxjADHDÞ ¼ ; where a ¼ ; ð14:1Þ M lnðC=MÞ where the value of a obtained from the condition that the cutoff value C is mapped to a probability of 0.5. That is, a is determined from the condition PðCÞ ¼ ðC=MÞa ¼ 0:5, leading to a ¼ lnð0:5Þ= lnðC=MÞ. For tests where scores lower than the cutoff value C indicate ADHD, the standardized probability is computed as PðxÞ ¼ PðxjADHDÞ ¼ 1
x a M
;
where a ¼
lnð0:5Þ ; lnðC=MÞ
ð14:2Þ
mapping a score of 0 to a probability for ADHD equal to 1 and the maximal score M to a probability for ADHD equal to 0. Figure 14.1 depicts these two cases for different values of M and C. An alternative approach for standardizing tests scores can be found in Robeva et al. (2004) where calibration of the probabilities is done by piecewise linear functions (with cutoff values being mapped to a probability of 0.5 as well). A brief discussion of the similarities and differences between these standardizations in the context of a more general approach can be found in Section 4.
364
Raina Robeva and Jennifer Kim Penberthy
A 1 0.9 0.8 0.7 P(x)
0.6 0.5 0.4 0.3 0.2 0.1 0 B
0
2
4
6
8
10 12 14 16 18 20 22 24 26 28 30 32 34 36 Test score
1 0.9 0.8 0.7
P(x)
0.6 0.5 0.4 0.3 0.2 0.1 0
0
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 Test score
Figure 14.1 Examples of test score standardization according to Eqs. (14.1) and (14.2). (A) Calibration for a test with M ¼ 36, C ¼ 12, and scores >12 indicating ADHD. In this case, a ¼ lnð0:5Þ=lnðC=MÞ ¼ 0:63093 and the function depicted in the figure is PðxÞ ¼ ðx=36Þ0:63093 . (B) Calibration for a test with M ¼ 100, C ¼ 40, and scores <40 indicating ADHD. In this case, a ¼ 0:75647 and the function depicted in the figure is PðxÞ ¼ 1 ðx=100Þ0:75647 .
2.1.2. Bayesian probability approach Having obtained the probabilities for ADHD, the following Bayesian method introduced by Robeva et al. (2004) allows for combining all different assessment scores into one combined standardized probability estimate. The general idea follows the classical theory of Bayesian inference (Carlin and Louis, 2000) in that before each experiment we have a prior probability of certain
Bayesian Probability Approach to ADHD Appraisal
365
outcome (e.g., ADHD) followed by an update based on the experimental data and resulting in a posterior probability of the outcome. Assume that the results from n different tests for ADHD are available and that each test score has been standardized to reflect the corresponding probability for ADHD. For k ¼ 1; 2; . . . ; n, denote the conditional probability for earning a score x on the kth test given ADHD by P1;k ðxÞ and the conditional probability for earning this score given non-ADHD by P2;k ðxÞ, and use P2;k ðxÞ ¼ 1 P1;k ðxÞ. The process begins with assigning a prior 0 probability for ADHD, PADHD , of 0.5. Heuristically, this means that prior to the results of any tests, all subjects are considered to have a 50/50 chance for being ADHD. Next, the probabilities P1;k ðxÞ and P2;k ðxÞ corresponding to the results of the first test yielding a sore x are used to calculate a posterior 1 probability PADHD for ADHD, given this subject’s score x on the first test (k ¼ 1) via the formula: 1 PADHD ¼
0 P1;1 PADHD 0 0 P1;1 PADHD þ P2;1 ð1 PADHD Þ
ð14:3Þ
The procedure continues recursively: after each step the posterior probability becomes a prior probability for the next step; for example, taking into 0 consideration the scores from the second test, in the formula above PADHD is 1 replaced by PADHD , P1;1 is replaced by P1;2 , and P2;1 is replaced by P2;2 , 2 producing as a result PADHD , which now incorporates results from the first two test. In general, the transition between steps i 1 and i, for i i ¼ 1; 2; . . . ; k, is given by Eq. (14.4), with PADHD representing the probability for ADHD based on the first i tests. i PADHD ¼
i1 P1;i PADHD : i1 i1 P1;i PADHD þ P2;i ð1 PADHD Þ
ð14:4Þ
We next illustrate the value of this method on a small study involving 12 female college students. 2.1.3. Subjects: Bayesian probability approach Six ADHD and six non-ADHD Caucasian college age females engaged in a series of short concentration tasks (2–3 min) with shorter resting intervals (1–2 min) and EEG data were collected. The EEG Consistency Index (CI) (Cox et al., 1998; Kovatchev et al., 2001; Merkel et al., 2000) was calculated from these data, with the CI employing data from eight electrode sites CZ, PZ, P3, P4, F3, F4, T3, and T4. In addition, all participants were administered the Wender–Utah Rating Scale (WURS) (Ward et al., 1993). Participants who presented with an ADHD diagnosis were further screened by completing the Brown (Brown, 1996) ADD test (described below) to confirm their ADHD status. As a result of this screening, a total of six
366
Raina Robeva and Jennifer Kim Penberthy
women were selected for this group. All women in this sample were classified as ADHD combined type. The control sample was matched based on similarity to our already defined ADHD sample. In order to confirm the lack of ADHD symptoms and to rule out the possibility of ADHD, the potential control participants were also screened with the Brown ADD test. In the cases where more than one individual matched our ADHD participants, those with lowest Brown scores were selected from the pool of potential control subjects. Results of analyses of these subjects are reported in Robeva et al. (2004). 2.1.4. Procedure The measures of symptoms of ADHD and EEG markers for ADHD used for the study are described below. The Brown ADD scale and ADHDSymptom Inventory (SI) were used for additional screening while the WURS and the EEG CI were used to validate the proposed combined Bayesian assessment. The Brown ADD scale (Brown, 1996) is a 40-item self-report scale designed to serve as a preliminary screen for ADD and to assess for additional cognitive and affective impairments often associated with ADDs. Individuals rate the occurrences of behavior on a 4-point scale ranging from 0 (Never) to 3 (Almost daily). Individuals with ADHD are likely to score >40 on this scale. The WURS test is a 61-item retrospective self-report scale with adequate reliability and validity (Ward et al., 1993). Individuals rate the severity of ADHD symptoms experienced during childhood using a 5-point Likert scale. The score from the WURS (short form) ranges from 0 to 100, with scores >30 indicating ADHD. For adults, WURS has been shown to be a valid retrospective screening and dimensional measure of childhood ADHD symptoms (Stein et al., 1999, 2000), to replicate and correlate with Connors Abbreviated Parent and Teacher Questionnaire and demonstrate internal consistency reliability (Fossati et al., 2001), and to exhibit good construct validity (Weyandt et al., 1995). The EEG-CI is an EEG-based measure of ADHD (Cox et al., 1998; Kovatchev et al., 2001; Merkel et al., 2000). The CI ranges from 0% to 100%; a CI <40% indicates ADHD (Kovatchev et al., 2001). The CI of a person is computed using data from two adjacent disparate cognitive tasks. For this study, we used a previously published algorithm with threshold parameter of 1.0 and no cutoff (Cox et al., 1998; Kovatchev et al., 2001; Merkel et al., 2000). Details can be found in the above referenced articles. Following Eqs. (14.1) and (14.2), the standardization of WURS and EEG CI scores into conditional probabilities can now be done using the following functions: WURS scale (M ¼ 100, 0 x 100, C ¼ 30, sores >30 indicate ADHD):
Bayesian Probability Approach to ADHD Appraisal
367
x 0:57572 : ð14:5Þ 100 EEG CI scale (M ¼ 100, 0 x 100, C ¼ 40, sores <40 indicate ADHD, see Fig. 14.1B): x 0:75647 PðxÞ ¼ PðxjADHDÞ ¼ 1 ð14:6Þ 100 PðxÞ ¼ PðxjADHDÞ ¼
2.1.5. Statistical analyses Fisher’s exact probability tests were used to evaluate the ability of each test to classify ADHD versus non-ADHD participants. T-tests, repeated measures ANOVA, and correlations were used to analyze the probabilities of ADHD computed at and after each test of the combined assessment.
2.2. Results Table 14.1 presents the scores from each test (step of the combined assessment) as well as the probability of earning these scores given ADHD as derived from the standardized scores for each test. The highlighted values (in bold) in Table 14.1 identify subjects who were misclassified or remained unclassified by the test (e.g., 0.5 probabilities for ADHD while being in the ADHD group, or 0.5 probabilities for ADHD while being controls). Table 14.1 shows that the two tests misclassified one control and one ADHD subject, which led to identical significance levels of p ¼ 0.041 for both of the Fisher’s exact probability tests. However, the tests misclassified different subjects, thus we can expect that a combination of all tests would be able to distinguish clearly ADHD participants from controls. Specifically, Table 14.1—in the column Steps of the Combined Stochastic Assessment— presents the WURS (short form), and the CI scores of all participants together with their respective conditional probabilities P1WURS and P1CI to earn this score on each of the specified test, given ADHD. Table 14.2 presents the Bayesian assessment process, showing each participant’s updated probability for ADHD after each test. The initial probabilities for all participants are 0.5, for example, we do not assume anything in the beginning of the assessment. At each step, a probability of ADHD >0.5 classifies the subject as ADHD, while a probability <0.5 classifies the subject as a control. It is evident that the classification improves when the second test is incorporated, achieving, in this case, correct classification for all subjects after the second test. After the first step (WURS), one ADHD and one control subject are misclassified. At Step 2, the CI corrects the classification of the misclassified control subjects without compromising the classification of the rest of the subjects.
368
Raina Robeva and Jennifer Kim Penberthy
Table 14.1 Screening scores, scores from the tests included in the sequential stochastic assessment, and conditional probabilities of earning these scores given ADHD for all participants Steps of the sequential stochastic assessment Screening
Group
ADHD
IDa
A: WURS Brown ADD scale
121 94 401 58 427 63 509 55 874 91 712 64 Control 45 20 191 15 194 8 268 6 500 32 640 17 Fisher’s exact tests
Score
B: CI
P1WURS
CI %
P1CI
33 0.52 12.50 0.79 37 0.55 12.50 0.79 45 0.61 50.00 0.41 25 0.42 12.50 0.79 46 0.61 12.50 0.79 56 0.69 0.00 1.00 3 0.05 100.0 0.00 12 0.20 50.0 0.41 8 0.13 62.5 0.30 2 0.03 37.5 0.52 35 0.54 87.5 0.10 9 0.15 87.5 0.10 Exact significance Exact significance (two-sided) ¼ 0.041 (two-sided) ¼ 0.041
a
Generated randomly to ensure confidentiality. The highlighted values (in bold) indicate participants who were misclassified by a particular test.
Moreover, the average probabilities for ADHD increase for the combined assessment from 59% to 80% in the ADHD group, and decrease from 26% to 10% in the control group, thus increasing the separation between the groups (Table 14.2). To confirm that this increase is significant, we conducted a 2 2 repeated measures ANOVA (ADHD-control) (pre–post combined assessment). This analysis resulted in a significant interaction effect, F ¼ 18.2, p ¼ 0.002, df ¼ 1,10, demonstrating that even with this small sample size the application of the combined assessment resulted in significantly better separation of the two groups. The increase of the separation is further confirmed by the increase of the t-values (from 4.376 to 10.094) of the t-tests comparing ADHD vs. control probabilities for ADHD at each step (Table 14.2). There is a highly significant correlation between the final probabilities for ADHD and the Brown behavioral scale used as a screening measure: r ¼ 0.91, p < 0.001. In addition, the partial correlations between the final probabilities for ADHD and the Brown scale used for screening, controlling
369
Bayesian Probability Approach to ADHD Appraisal
Table 14.2 Sequential psychophysiological assessment: at Step 0 all subjects are assigned equal probabilities of ADHD, which are then refined by the sequential tests Sequential probabilities of ADHD
Group
ID
ADHD
121 401 427 509 874 712 Control 45 191 194 268 500 640 t-value, significance
Step 0: Initialization
Step 1: WURS
Step 2: CI
0 PADHD
1 PADHD
2 PADHD
0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 –
0.53 0.56 0.63 0.45 0.64 0.72 0.13 0.30 0.23 0.11 0.55 0.25 t ¼ 4.376, p ¼ 0.001
0.81 0.83 0.54 0.76 0.87 1.00 0.00 0.22 0.12 0.11 0.11 0.03 t ¼ 10.094, p < 0.00001
The highlighted values (in bold) indicate subjects who were misclassified.
for the first step of the assessment (WURS) is: r ¼ 0.69, p ¼ 0.018. This indicates that despite a significant initial correlation between the screening and Step 1 of the assessment (r ¼ 0.86, p < 0.001), Step 2 has additional sizable contribution over Step 1. Thus, a combined approach not only separates control from ADHD subjects better than its components but also yields a better agreement (collinearity) between the final ADHD score and the baseline diagnosis.
3. The Value of Bayesian Probability Approach as a Meta-Analysis Tool 3.1. Methods When analyzed separately—data from multiple studies may provide only limited information with limited clinical generalizability—due to small sample size, differing assessments, and limited scope focused on a specific age/gender subject group. Often, no gender or age comparisons are possible
370
Raina Robeva and Jennifer Kim Penberthy
within a small study. An important feature of the sequential assessment is that it can be used to unify separate studies, each of which may utilize different assessments or tests, thereby increasing the total number of subjects in a meta-analysis and the diversity of the studied population. As a result, the power of the statistical tests increases, new comparisons between subgroups of the unified large sample become possible, and the clinical validity and generalizability of the results increases. In this section we show how our proposed Bayesian model may be employed as a method for combining not only related diagnostic tests, but also disparate studies that include such tests. 3.1.1. Subjects We use results from five different studies designed to validate the use of the EEG CI as a physiological marker for ADHD and in the process employed multiple other assessments including behavioral rating scales. The studies were conducted in the course of several years in the same laboratories and under supervision by the same team of investigators. All five studies included at least one biological assessment in the form of EEG data and at least one ADHD symptom questionnaire such as the WURS (Ward et al., 1993), the AD/HD Rating Scale-IV (DuPaul et al., 1998), or the ADHD-SI (Cox et al., 1998). The inclusion/exclusion criteria were consistent for all five studies and can be found in Penberthy et al. (2005). All individuals in the ADHD group were tested off their ADHD medication, discontinued at least 36 h prior to the EEG testing under the supervision of a physician. We briefly outline these studies below, providing references to the original sources for the details. Study I Four boys (ages 6–10) with ADHD and four age-matched control boys had their EEG data acquired during two 30-min tasks separated by a 5-min break. For the ADHD boys, this procedure was repeated 3 months later, to assess test–retest reliability. The EEG CI was based on this data and calculated from information obtained from four electrode sites, CZ, PZ, P3, and P4. Parents completed the ADHD-SI. Results are reported in Cox et al. (1998). Study II Seven ADHD males and six non-ADHD males, ages 18–25, participated in a double-blind, placebo versus methylphenidate controlled crossover design study. ADHD subjects had to have previously taken methylphenidate but could not be taking any medication for their condition within the 6 months prior to the study. EEG data was acquired while the subjects were given four tasks of the Gordon Diagnostic System, two easy (auditory and visual) and two hard (auditory and visual). EEG CI was calculated based upon this data obtained from CZ, PZ, P3, and P4, and the relative power of the frequency bands computed as in Study I. Subjects
Bayesian Probability Approach to ADHD Appraisal
371
and their parents completed the ADHD-SI. Results are reported in Merkel et al. (2000) and in Cox et al. (2000). Study III Eighteen boys and 17 girls, ages 8–16, classified as either ADHD or non-ADHD had EEG data collected for 36 min while performing various tasks. The EEG CI was based on this data and calculated from information obtained from eight electrode sites, CZ, PZ, P3, P4, F3, F4, T3, and T4. The relative power of the frequency bands was computed based on the same thresholds employed in Study I and II. Parents and teachers completed the ADHD-SI. Results are reported in Kovatchev et al. (2001). Study IV This is the study of 12 college-age females described in Section 2.1.2. Study V Seventy-seven children, ages 8–12, were administered EEGs while watching a movie for 20 min, resting with eyes open for 5 min, reading silently for 10 min, resting with eyes open for 5 min, then performing creative drawing tasks for 10 min. This pattern was repeated once, for a total test time of 100 min. The EEG CI was calculated when contrasting EEG during the video and reading and during the reading and divergent thinking tasks, utilizing information collected from CZ, PZ, P3, and P4. Parents and teachers were administered the AD/HD Rating ScaleIV (DuPaul et al., 1998), and parents completed the ADHD-SI. Results are reported in Kalbfleisch (2001).
3.1.2. Procedure To illustrate the utility of the Bayesian probability approach as a metaanalysis tool, we combined data from the five studies described above. For each study, we included at least one measure of symptoms of ADHD and the EEG CI. The symptom measures used for the meta-analysis are WURS (described in Section 2.1.4) and ADHD-SI (Cox et al., 1998) and AD/HD Rating Scale (RS)-IV (DuPaul et al., 1998), which we describe next. The ADHD-SI is an 18-item scale developed from DSM-IV criteria for ADHD and was introduced by Cox et al. (1998). The ADHD-SI is a measure of symptom severity, and is scored on a 3-point Likert scale with higher scores representing higher severity of symptoms. The ADHD-SI correlates highly with other rating scales assessing hyperactivity and inattentive behavior and exhibits good discriminative power (Cox et al., 1998; Merkel et al., 2000). The score on ADHD-SI ranges from 0 to 36 with scores >12 indicating ADHD (Cox et al., 1998). The AD/HD Rating Scale-IV is similar to the ADHD-SI, both scales being developed independently and concurrently at different laboratories. This rating scale has demonstrated adequate reliability and validity (DuPaul et al., 1998). The scale items reflect the DSM-IV criteria and respondents are asked
372
Raina Robeva and Jennifer Kim Penberthy
to indicate the frequency of each symptom on a 4-point Likert scale. The Home and School Versions of the scale both consist of two subscales: Inattention (nine items) and Hyperactivity–Impulsivity (nine items). The manual provides information regarding the factor analysis procedures to develop the scales, as well as information regarding the standardization, normative data, reliability, validity, and clinical interpretation of the scales. The score ranges from 0 to 100 with scores >93 indicating ADHD (DuPaul et al., 1998). Following Eqs. (14.1) and (14.2), the standardization of ADHD-SI and AD/HD RS-IV scores into conditional probabilities is done using the following functions: ADHD-SI scale (M ¼ 36, 0 x 36, C ¼ 12, sores >12 indicate ADHD, Fig. 14.1A): x 0:63093 PðxÞ ¼ PðxjADHDÞ ¼ ð14:7Þ 36 AD/HD Rating Scale-IV (M ¼ 100, 0 x 100, C ¼ 93, sores >93 indicate ADHD): x 9:55134 PðxÞ ¼ PðxjADHDÞ ¼ ð14:8Þ 100 After the data from these five studies are standardized using Eqs. (14.5)– (14.8), we can use the Bayesian algorithm described above to perform metaanalysis of the data. In this case, the algorithm is employed just as in the case of a single study, except that this time we can use scores from all measures for ADHD used in the studies. In our illustration, we will use ADHD-SI, AD/HD RS-IV, WURS, and the EEG CI. As before, when transitioning between steps i 1 and i, for i ¼ 1; 2; . . . ; k, where k is the total number of i tests used for the analysis, the probability PADHD , representing the probability for ADHD based on the first i tests, is computed from Eq. (14.4). Since, however, not all tests have been administered to all subjects in the combined dataset, the following modification is needed to allow for inclusion of studies for which scores from test i are not recorded: 8 i1 P1;i PADHD > < if a score for test i is recorded i1 i1 i þ P2;i ð1 PADHD Þ PADHD ¼ P1;i PADHD > : i1 PADHD if a score for test i is not recorded ð14:9Þ 3.1.3. Statistical analyses T-tests were used to compare the probabilities for ADHD estimated by the combined tests, across the ADHD versus no ADHD groups. Three-way ANOVA was used to elucidate the effect of age and gender on ADHD/ non-ADHD classification. Two-way ANOVA was used to elucidate the effect of study on ADHD/non-ADHD classification.
Bayesian Probability Approach to ADHD Appraisal
373
3.2. Results To further illustrate the Bayesian algorithm described in Section 2.1.2, Table 14.3 presents the sequential probabilities for ADHD based on the sequence of tests ADHD SI, AD/HD RS- IV, WURS, and the EEG CI. Notice that since the results of all different tests are standardized, it is possible to combine ADHD/controls scores across different studies and groups. Table 14.3 presents the process of increasing separation of ADHD and control subjects along the steps of the Bayesian algorithm. With more tests, the difference in the mean classification probabilities for ADHD and controls increases within each gender and age group, achieving better separation at the 4 end of the procedure (Table 14.3, column PADHD ). Notice also that, according to Eq. (14.9), the probability for ADHD remains unchanged when a certain tests is not available for a particular group. For the group of girls who are >16 years of age, the probability for ADHD remains at 0.5 until WURS is taken into consideration, since the ADHD-SI and AD/HD RS-IV were not administered for this age group and no test scores were available. Table 14.4 exemplifies how the effects of age and gender, as well as the effect of different studies combined for the meta-analysis, on the classification into ADHD/non-ADHD groups weakens with the incorporation of more tests. The interaction between age/gender or study with ADHD / non-ADHD classification effects are diminishing progressively from being highly significant at the first test to becoming negligible after the fourth test. This indicates that the accumulation of data from multiple tests gradually eliminates the effects of confounding factors such as gender and age and any between-study differences. This justifies the validity of meta-analysis using scores from multiple tests.
4. Discussion and Future Directions Accurately and reliably diagnosing ADHD presents challenges because there is currently no single assessment tool or medical test that can definitively diagnosis it. What do exist, however, are numerous assessments and tests of varying design. Some are checklists of symptoms that caregivers such as parents and teachers complete, some are behavioral or neuropsychological tests that the child completes in an office, some are written evaluations of behavior or symptoms based on the clinician’s observations, some are based on physiological data such as MRI or EEG readings. Each assessment has its own scoring system, criteria, and format for administration, and unfortunately, none of these individual assessment tools has been shown to be 100% accurate in diagnosing ADHD. This is to be expected, however, since ADHD is considered to be a physiologically based disorder with a
Table 14.3
Mean Probabilities for ADHD by age and gender groups for the sequential Bayesian assessment
Group
Boys 16 Boys > 16 Girls 16 Girls > 16
ADHD Control t-value, p ADHD Control t-value, p ADHD Control t-value, p ADHD Control t-value, p
Step 1: SI
Step 2: AD/HD RS-IV
Step 3: WURS
Step 4: CI
1 PADHD
2 PADHD
3 PADHD
4 PADHD
0.5860 0.3824 t ¼ 4.87, p ¼ 0.000 0.7188 0.1924 t ¼ 5.12, p ¼ 0.002 0.7707 0.3208 t ¼ 7.90, p ¼ 0.000 0.500 0.500 –
0.6441 0.0615 t ¼ 7.90, p ¼ 0.000 0.7188 0.1924 t ¼ 5.12, p ¼ 0.002 0.7707 0.3208 t ¼ 7.90, p ¼ 0.000 0.500 0.500 –
0.6441 0.0615 t ¼ 7.90, p 0.7188 0.1924 t ¼ 5.12, p 0.7707 0.3208 t ¼ 7.90, p 0.5883 0.2605 t ¼ 4.38, p
0.7199 0.0961 t ¼ 9.62, p ¼ 0.000 0.9357 0.2310 t ¼ 4.55, p ¼ 0.003 0.8704 0.3676 t ¼ 7.10, p ¼ 0.000 0.8022 0.1002 t ¼ 10.09, p ¼ 0.000
¼ 0.000 ¼ 0.002 ¼ 0.000 ¼ 0.002
375
Bayesian Probability Approach to ADHD Appraisal
Table 14.4 Significance of age–gender and study effects on ADHD/control classifications Three-way interactions: ADHD–age–gender
Two-way interactions: ADHD–study
Sequential tests
F
p
F
p
ADHD-SI ADHD-SI þ AD/HD RS-IV ADHD-SI þ AD/HD RS-IV þ WURS ADHD-SI þ AD/HD RS-IV þ WURS þ EEG-CI
22.879 4.281
0.000 0.041
54.595 6.322
0.000 0.000
0.115
0.735
1.059
0.381
0.221
0.639
0.730
0.574
multifactorial etiology that includes neurobiology as an important factor, and would not be easily classified by only one assessment tool. In fact, the reliability of the ADHD diagnosis based on one method or test alone is quite low and lower still when chance agreement is considered. For example, previous research has found 78% agreement between a structured interview and a diagnosis of ADHD (Welner et al., 1987) and 70–80% accuracy (with considerable variation depending on age range) of laboratory measures of attention in correctly predicting an ADHD diagnosis (Fischer et al., 1995). Thus, as reported by Angold et al. (1999), this results in both problems with overdiagnosis as well as underdiagnosis of ADHD. What is needed is a methodology for combining disparate assessments and tests in order to not only provide a more accurate diagnosis of the individual but also to enable the combination of multiple studies of ADHD assessments, thus increasing the sample size and providing more power, generalizability, and possibilities for cross-sectional comparisons. Other researchers are now utilizing the Bayesian approach to diagnose ADHD and producing interesting preliminary results. For example, Foreman et al. (2009) recently reported using a Bayesian approach to model the Development and Well-Being Assessment (DAWBA)’s various parameters in order to justify its utilization in primary care settings in the UK. They determined that using the DAWBA in primary care settings may improve access to accurate diagnosis of ADHD in primary care settings (Foreman et al., 2009). Similarly, researchers continue to look for novel methods to classify and predict diagnoses of ADHD in the fields of imaging and genetics that will more closely link assessment data with underlying neurobiological markers (Castellanos and Tannock, 2002).
376
Raina Robeva and Jennifer Kim Penberthy
In our own research utilizing the Bayesian approach described above, we have successfully combined different individual assessments to produce a more reliable and accurate individual diagnosis of ADHD. To be used for combining the results of disparate tests, the Bayesian approach requires a method for standardization of the test results. In general, this standardization is a mapping of test scores into probabilities that is required to follow some basic principles: (1) it should preserve the direction of the original scale (i.e., if scores higher than the threshold are indicative of ADHD, the probabilities for ADHD should be increasing with the increasing of the test score while they should be decreasing if test scores lower than the cutoff are indicative of ADHD; (2) it should be a monotone function; (3) it should map the value recommended as a cutoff for each test to 0.5 and the minimal and maximal score values into 0 or 1. Clearly, these basic requirements can be achieved by multiple mappings, including the power functions used in this chapter. We already mentioned that piecewise linear mapping defined in Robeva et al. (2004) was used in our earlier work, generating similar qualitative results (Penberthy et al., 2005; Robeva et al., 2004). However, we adopted the use of power functions in this chapter, since it eliminates the problem of generating probabilities with different increase/decrease rates on both sides of the cutoff value. Regardless, claiming any one specific mapping to be preferable or superior over any other would be a speculation at this point. We are currently working on finding a more comprehensive answer to this question that includes the use of mappings capable of accommodating the ‘‘gray zone’’ ranges of some tests, as well as translating these ranges into respective ‘‘gray zones’’ on the standardized scales. In each of the five studies described earlier, we standardized and combined disparate assessments, including a behavioral assessment of ADHD, such as ADHD-SI, AD/HD Rating Scale-IV, WURS, and an EEG assessment of ADHD—the CI. These assessments were joined, within each study, using a Bayesian algorithm, resulting in a combined probability for ADHD for each subject. In general, this combined probability presents a better assessment of ADHD than each of the separate tests it includes. Such a procedure is especially useful in situations such as diagnosing ADHD, when there is no single conclusive assessment, but rather a number of imperfect tests that marginally address the outcome of interest, and where researchers may have multiple related tests performed on a single subject, which they wish to combine into a more comprehensive assessment of this subject. Equally important, once the data output from each individual study is standardized, the Bayesian approach allows for data to be combined across different studies, thus producing a method for meta-analysis. In addition to significantly increasing the sample size, this approach allows the data to be examined in subgroups divided by age and/or gender, diagnostic group, etc. In our example, in all five studies, the subjects were classified into groups of ADHD versus non-ADHD. However, different studies focused on different
Bayesian Probability Approach to ADHD Appraisal
377
age and gender groups. The standardization of the data allowed crosssectional analyses, which were not possible with the original data. For example, we found that within each age and gender group the Bayesian algorithm increases the separation between the ADHD and control groups with the incorporation of more test scores. We also found that accumulation of more tests diminishes the effect of age and gender on the ADHD/ non-ADHD classification. Based on our review and research we propose that a viable alternative to a single definitive measure of ADHD is a combination of measures, equipped with a method for refining the results from one test with the results from another and yielding a compounded assessment that works better than each of its separate components. This concept of combining test outcome with intuitive knowledge and expert opinion is well developed mathematically. Bayesian methods provide a way to combine probabilistic and experimentaldata reasoning, as well as convenient tools for creating sequential evaluation procedures that refine the outcome assessment with every subsequent step. These procedures are especially useful in situations where there is no single conclusive assessment, but rather a number of imperfect tests that marginally address the outcome of interest. Various individual ADHD assessment tools, including rating scales and physiological assessments, have not proven to be as accurate in diagnosing ADHD as a comprehensive, standardized, objective, yet flexible and adaptive, assessment package for ADHD that can incorporate multifactorial assessments. The proposed assessment does not aim at replacing any established practices for screening and diagnosing of ADHD but instead at demonstrating that the outcomes of related studies can be combined in a manner that allows meta-analysis of different types of data which may not be collected in the same manner in each study, and which can include physiological data as well as symptom reports. It should be emphasized that the application of the proposed meta-analysis tool is not limited to the specific tests used in the discussed studies—the metaanalysis is capable of accommodating a variety of other tests. Specifically, almost any individual test or assessment could be employed within this model, assuming that the output of such test can be standardized into probabilities for the specific disorder or disease. As such, this meta-analysis procedure may provide a much-needed tool for combining related studies with similar or disparate tests and assessments in a number of research areas, which may otherwise have small, less generalizable studies of limited power.
ACKNOWLEDGMENT The authors thank Boris Kovatchev from the University of Virginia for a consultation regarding the statistical analyses.
378
Raina Robeva and Jennifer Kim Penberthy
REFERENCES American Academy of Pediatrics (2000). Clinical practice guideline: Diagnosis and evaluation of the child with attention-deficit/hyperactivity disorder. Pediatrics 105, 1158–1170. American Psychiatric Association (1994). Diagnostic and statistical manual of mental disorders. 4th edn American Psychiatric Association, Washington, DC. Anastopoulous, A. D., and Shelton, T. L. (2001). Assessing Attention-Deficit/Hyperactivity Disorder. Kluwer Academic/Plenum Publishers, New York. Angold, A., Costelle, E. J., Farmer, E., Burns, B., and Erkanli, A. (1999). Impaired by undiagnosed. J. Am. Acad. Child Adolesc. Pychiatry 38, 129–137. Barkley, R. A. (1990). A critique of current diagnostic criteria for attention deficit hyperactivity disorder: Clinical and research implications. J. Dev. Behav. Pediatr 11, 343–352. Barkley, R. A. (2002). Attention Deficit Hyperactivity Disorder: A Handbook for Diagnosis and Treatment. Guilford Press, New York. Brown, T. E. (1996). Brown Attention Deficit Disorder Scales: Manual. The Psychological Corporation, San Antonio, TX. Carlin, B. P., and Louis, T. A. (2000). Bayes and Empirical Bayes Methods for Data Analysis. 2nd edn. Chapman & Hall/CRC, Washington, DC. Castellanos, F. X. (1997). Toward a pathophysiology of attention-deficit/hyperactivity disorder. Clin. Pediatr. (Phila) 36, 381–393. Castellanos, F. X., and Tannock, R. (2002). Neuroscience of attention-deficit/ hyperactivity disorder: The search for endophenotypes. Nat. Rev. Neurosci. 3, 617–628. Cox, D. J., Kovatchev, B. P., Morris, J. B., Phillips, C., Hill, R., and Merkel, L. (1998). Electroencephalographic and psychometric differences between boys with and without Attention-Deficit/Hyperactivity Disorder (ADHD): A pilot study. Appl. Psychophysiol. Biofeedback 23, 179–188. Cox, D. J., Merkel, R. L., Kovatchev, B., and Seward, R. (2000). Effect of stimulant medication on driving performance of young adults with attention-deficit hyperactivity disorder: A preliminary double-blind placebo controlled trial. J. Nerv. Ment. Disord. 188, 230–234. DuPaul, G. J., Anastopoulos, A. D., Shelton, T. L., Guevremond, D. C., and Metevia, L. (1992). Multi-method assessment of attention deficit hyperactivity disorder: The diagnostic utility of clinic-based tests. J. Clin. Child Psychol. 21, 394–402. DuPaul, G. J., Power, T. J., Anastopoulos, A. D., and Reid, R. (1998). ADHD Rating Scale—IV Checklists, Norms, and Clinical Interpretation. Guilford Press, New York. Fischer, M., Newby, R. F., and Gordon, M. (1995). Who are the false negatives on Continuous Performance Tests? J. Clin. Child Psychol. 24, 427–433. Foreman, D., Morton, S., and Ford, Tamsin (2009). Exploring the clinical utility of the Development And Well-Being Assessment (DAWBA) in the detection of hyperkinetic disorders and associated diagnoses in clinical practice. Child Psychol. Psychiatry 50, 460–470. Fossati, A., Di Ceglic, A., Acquarini, E., Donati, D., Donini, M., Novella, L., and Maffei, C. (2001). The retrospective assessment of childhood attention-deficit hyperactivity disorder in adults: Reliability and validity of the Italian version of the Wender Utah Rating Scale. Compr. Psychiat. 42, 326–336. Habel, L. A., Schaefer, C. A., Levine, P., Bhat, A. K., and Elliott, G. (2005). Treatment with stimulants among youths in a large California health plan. J. Child Adolesc. Psychopharmacol. 15, 62–67. Heilman, D., Voeller, K., and Nadeau, S. (1991). A possible pathophysiologic substrate of attention deficit hyperactivity disorder. J. Child Neurol. 6(Suppl.), S74–S79. Hinshaw, S. P. (1994). Attention Deficits and Hyperactivity in Children. Sage, Thousand Oaks, CA.
Bayesian Probability Approach to ADHD Appraisal
379
Jensen, P. S., Koretz, D., Locke, B. Z., Schneider, S., Radke-Yarrow, M., Richters, J. E., and Rumsey, J. M. (1993). Child and adolescent psychopathology research: Problems and prospects for the 1990s. J. Abnorm. Child Psychol. 21, 551–581. Kalbfleisch, M. L. (2001). Electroencephalographic (EEG) differences between boys with average and high-aptitude with and without attention deficit hyperactivity disorder (ADHD) during task transitions. Dissertation Abstr. Int. Sect. B Sci. Eng. 62(1-B), 96. Klein, R. G., and Mannuzza, S. (1991). Long-term outcome of hyperactive children: A review. J. Am. Acad. Child Adolesc. Psychiatry 30, 383–387. Kovatchev, B. P., Cox, D. J., Hill, R., Reeve, R., Robeva, R. S., and Loboschefski, T. (2001). A psychophysiological marker of Attention Deficit/Hyperactivity DisorderDefining the EEG consistency index. Appl. Psychophysiol. Biofeedback 26, 127–139. Kuperman, S., Gaffney, G. R., Hamdan-Allen, G., Preston, D. F., and Venkatesh, L. (1990). Neuroimaging in child and adolescent psychiatry. J. Am. Acad. Child Adolesc. Psychiatry 29, 159–172. Lou, H., Henriksen, L., Bruhn, P., Borner, H., and Nielsen, J. (1989). Striatal dysfunction in attention deficit and hyperkinetic disorder. Arch. Neurol. 46, 48–52. Merkel, R. L., Cox, D. J., Kovatchev, B. P., Morris, J., Seward, R., Hill, R., and Reeve, R. (2000). The EEG consistency index as a measure of Attention Deficit/Hyperactivity Disorder and responsiveness to medication: A double blind placebo controlled pilot study. Appl. Psychophysiol. Biofeedback 25, 133–142. Monastra, V. J., Lubar, J. F., Linden, M., VanDeusen, P., Green, G., Wing, W., Phillips, A., and Fenger, T. N. (1999). Assessing attention deficit hyperactivity disorder via quantitative electroencephalography: An initial validation study. Neuropsychology 13, 424–433. Mostofsky, S. H., Cooper, K. L., Kates, W. R., Denckla, M. B., and Kaufmann, W. E. (2002). Smaller prefrontal and premoter volumes in boys with attention-deficit/ hyperactivity disorder. Biol. Psychiatry 52, 785–794. National Institutes of Health Consensus Development Conference Statement (2000). Diagnosis and treatment of attention-deficit/hyperactivity disorder (ADHD). J. Am. Acad. Child Adolesc. Psychiatry 39, 182–193. Pastor, P. N., and Reuben, C. A. (2008). Diagnosed attention deficit hyperactivity disorder and learning disability: United States, 2004-2005. Vital Health Statistics. National Center for Health Statistics, Vol. 10. Penberthy, J. K., Cox, D., Breton, M., Robeva, R., Kalbfleisch, M. L., Loboschefski, T., and Kovatchev, B. (2005). Calibration of ADHD assessments across studies: A metaanalysis tool. Appl. Psychophysiol. Biofeedback 30(1), 31–51. Reiff, M. I., Banez, G. A., and Culbert, T. P. (1993). Children who have attentional disorders: Diagnosis and evaluation. Pediatr. Rev. 14, 455–465. Robeva, R., Penberthy, J. K., Loboschefski, T., Cox, D., and Kovatchev, B. (2004). Sequential psycho-physiological assessment of ADHD: A pilot study of Bayesian probability approach illustrated by appraisal of ADHD in female college students. Appl. Psychophysiol. Biofeedback 29(1), 1–18. Safer, D. J., Zito, J. M., and Fine, E. M. (1996). Increased methylphenidate usage for attention deficit disorder in the 1990s. Pediatrics 98, 1084–1088. Schwabe, U., and Paffrath, D. (2006). Arzneiverordnungs—Report 2006. Springer, Berlin. Sowell, E. R., Thompson, P. M., Welcome, S. F., Henkenius, A. L., and Toga, A. W. (2003). Cortical abnormalities in children and adolescents with attention-deficit hyperactivity disorder. Lancet 62, 1699–1707. Stein, M. A., Fischer, M., and Szumowski, E. (1999). Evaluation of adults for ADHD. J. Am. Acad. Child Adolesc. Psychiatry 38, 940–941. Stein, M. A., Fischer, M., and Szumowski, E. (2000). Evaluation of adults for ADHD: Erratum. J. Am. Acad. Child Adolesc. Psychiatry 39, 674.
380
Raina Robeva and Jennifer Kim Penberthy
Ward, M. F., Wender, P. H., and Reimherr, F. W. (1993). The Wender Utah Rating Scale: An aid in the retrospective diagnosis of childhood attention-deficit-hyperactivity disorder. Am. J. Psychiatry 150, 885–890. Weiss, G., Hechtman, L., Milroy, T., and Perlman, T. (1985). Psychiatric status of hyperactives as adults: A controlled prospective 15-year follow-up of 63 hyperactive children. J. Am. Acad. Child Adolesc. Psychiatry 24, 211–220. Welner, Z., Reich, W., Herjanic, B., and Jung, K. G. (1987). Reliability, validity, and parent-child agreement studies of the Diagnostic Interview for Children and Adolescents (DICA). J. Am. Acad. Child Adolesc. Psychiatry 26(5), 649–653. Weyandt, L. L., Linterman, I., and Rice, J. A. (1995). Reported prevalence of attentional difficulties in a general sample of college students. J. Psychopathol. Behav. 17, 293–304. Zametkin, A. J., and Rappaport, J. L. (1987). Neurobiology of attention deficit disorder with hyperactivity: Where have we come in 50 years? J. Am. Acad. Child Adolesc. Psychiatry 26, 676–686. Zito, J. M., Safer, D. J., dosReis, S., Gardner, J. F., Boles, M., and Lynch, F. (2000). Trends in the prescribing of psychotropic medications to preschoolers. JAMA 283, 1025–1030.
C H A P T E R
F I F T E E N
Simple Stochastic Simulation Maria J. Schilstra* and Stephen R. Martin† Contents 382 385 386 389 389 390 392 392 393 393 394 395 401 404 406 407 409
1. 2. 3. 4. 5.
Introduction Understanding Reaction Dynamics Graphical Notation Reactions Reaction Kinetics 5.1. Second-order reactions 5.2. First-order reactions 5.3. Pseudo-first-order reactions 5.4. Aside 6. Transition Firing Rules 6.1. Ground rules 6.2. First-order reactions 6.3. Multiple options 6.4. Pseudo-first-order and second-order reactions 7. Summary 8. Notes References
Abstract Stochastic simulations may be used to describe changes with time of a reaction system in a way that explicitly accounts for the fact that molecules show a significant degree of randomness in their dynamic behavior. The stochastic approach is almost invariably used when small numbers of molecules or molecular assemblies are involved because this randomness leads to significant deviations from the predictions of the conventional deterministic (or continuous) approach to the simulation of biochemical kinetics. Advances in computational methods over the three decades that have elapsed since the publication of Daniel Gillespie’s seminal paper in 1977 ( J. Phys. Chem. 81, 2340–2361) have allowed researchers to produce highly * Biological and Neural Computation Group, Science and Technology Research Institute, University of Hertfordshire, Hatfield, United Kingdom Division of Physical Biochemistry, MRC National Institute for Medical Research, London, United Kingdom
{
Methods in Enzymology, Volume 467 ISSN 0076-6879, DOI: 10.1016/S0076-6879(09)67015-4
#
2009 Elsevier Inc. All rights reserved.
381
382
Maria J. Schilstra and Stephen R. Martin
sophisticated models of complex biological systems. However, these models are frequently highly specific for the particular application and their description often involves mathematical treatments inaccessible to the nonspecialist. For anyone completely new to the field to apply such techniques in their own work might seem at first sight to be a rather intimidating prospect. However, the fundamental principles underlying the approach are in essence rather simple, and the aim of this article is to provide an entry point to the field for a newcomer. It focuses mainly on these general principles, both kinetic and computational, which tend to be not particularly well covered in specialist literature, and shows that interesting information may even be obtained using very simple operations in a conventional spreadsheet.
1. Introduction Over the past two decades, a number of important single molecule techniques have emerged and been applied to research in biology. These techniques have given researchers the ability to obtain information about single molecules—or molecular assemblies—that could not be obtained from measurements on large ensembles of molecules. Single molecule methods have facilitated the study of the movement of molecular motors along actin or microtubules; the characterization of RNA/DNA-based motors (polymerases, topoisomerases, and helicases); the dynamic growth/ shortening behavior of individual microtubules; the motion of individual ribosomes as they translate single messenger RNA hairpins; the movement of single biomolecules in a membrane or viruses on a cell surface; and the behavior of single proteins or ligands both within living cells and interacting with cell surface receptors. The data generated from these experiments contain kinetic, statistical, and spatial information and allow one to understand how the molecules behave, either individually or—by averaging many data sets—as an ensemble. Quantitative interpretation of these observations requires the construction of dynamic models, in which all components that are thought to be essential for the functioning of the system under study act and interact in a manner that might resemble the original. These models may then be used to simulate the dynamics of the real-world system. If ‘‘test runs’’ under conditions that mimic the experimental ones give the expected results, one may challenge the model by applying different conditions, and check whether its predictions are borne out by the behavior of the real system. One may also make a quantitative comparison between the observed and simulated data, and thereby refine the model parameters to achieve improved agreement with the experimental observations. The traditional approach to modeling dynamic biochemical systems is based on the law of mass-action, and relies for its validity on three major
Simple Stochastic Simulation
383
assumptions. These are that the reaction volume is spatially homogeneous (so that any small subvolume is indistinguishable from any other), that it is well stirred (so that the chance of any two molecules colliding is the same throughout the volume), and that it contains a large number of molecules. Because interactions between molecules are random processes it is impossible to predict the exact time at which a reaction involving one or two molecules or particles will occur. In systems with a large number of interacting molecules, this random behavior is averaged out and the overall state of the system in terms of the concentrations of the components becomes totally predictable. It is this property that enables the traditional deterministic simulation approach to be employed. The starting point for such simulations is the set of coupled ordinary differential equations (ODEs) which describe the timedependence of the concentrations of the different chemical species involved. When the parameters and initial concentrations have been defined, a numerical integrator is used to calculate the concentrations as a function of time. Deterministic simulations will always produce the same ‘‘predetermined’’ result for any given set of starting conditions because they model reactions as continuous fluxes of material and implicitly include the assumption that one is dealing with a very large number of molecules. If only small numbers of molecules are involved, then the stochastic fluctuations are no longer averaged out, and the evolution of the system— in terms of the position, concentration, or number of species—from any given set of starting conditions will never be exactly the same. These stochastic fluctuations are clearly of enormous importance in single molecule studies where properties of the system may often be inferred from the distributions of experimentally observed values. Stochastic fluctuations are equally prominent in cellular processes which generally occur in very small volumes, often involve relatively small numbers of molecules, and are neither spatially homogeneous nor well stirred. The deterministic approach cannot be used in such situations, and these systems must be modeled using a stochastic formulation of their chemical kinetics. Stochastic formulations include the random aspects of the system, as well as its deterministic characteristics that yield a predictable average behavior. ‘‘Monte Carlo’’ computer simulations, which use random number generators to incorporate randomness, but take account of the probabilities that particular events will occur, may then be used to generate time courses that single molecules or particles may follow. These probabilities are calculated from exactly the same physical properties of the system—rate constants, etc.—as those used in the deterministic approach. The exponential rise in citations over the last decade (Fig. 15.1) of the paper that presents the algorithm for performing such Monte Carlo simulations, now known as Gillespie’s Direct Method (Gillespie, 1977), parallels the revolution in the development of single molecule techniques (Cornish and Ha, 2007). Further analysis of the citation data suggests that the citations mostly originate from papers whose subject area are classified as (biological)
384
Maria J. Schilstra and Stephen R. Martin
Number of citations
300
200
100
0 1975
1985
1995 Year
2005
Figure 15.1 Number of citations per year for Gillespie (1977), based on a search of the Science Citation Index—Expanded and the Conference Proceedings Citation Index— Science (Thomson Reuters, Web of Science). The paper has been cited 1286 times in the period from 1977 to 2008. The solid line shows that the increase has been exponential from 1996 (17 citations) to 2008 (275 citations).
physics, mathematics, or computer science. Perhaps not surprisingly, the papers that contain an analysis, explanation, justification, or refinement of the technique, are often laden with equations and notions from statistical physics. As a result, the impression is created that producing stochastic formulations of biochemical reaction systems is the preserve of mathematicians and physicists, and requires one to be almost superhumanly conversant with probability theory and statistics. Although this may be the case for those who wish to analyze and base generally applicable theory on such formulations, we contend that setting up stochastic models for the sole purpose of simulation is not only straightforward, but also engaging, and above all, enlightening. Our view is that anyone who is able to calculate the residual amount of 32P in a stock of radioactive ATP a week after receiving it can also set up and perform a stochastic simulation of a biochemical reaction system. Moreover, we believe that hands-on experience with quantitative model building and stochastic simulation may significantly deepen one’s insight into the basic principles of chemical kinetics. The aim of this article is to provide an entry point to the field for a newcomer. It focuses mainly on explaining the origin and meaning of the fundamental equations of Gillespie’s original1 exact algorithm (Gillespie, 1
Gillespie’s (1976) paper is usually credited as the first publication of this method, but a very similar algorithm, applied to the simulation of Ising spin systems was published a year earlier (Bortz et al., 1975). To our knowledge, the method was independently (re-)discovered at least twice in the period before 1995 when the Gillespie paper was relatively unknown (Bayley et al., 1990; Kleutsch and Frehland, 1991).
Simple Stochastic Simulation
385
1976), now known as the First Reaction Method. We intentionally focus on the modeling of very simple systems, because our primary purpose here is to show that stochastic simulations are both easy to understand and relatively easy to implement. Although the ability to write some basic computer code is a considerable advantage, it is, in fact, the case that simple applications can quite often be implemented in a spreadsheet program. Many situations one might wish to simulate are, of course, much more complex than those described here, but most of the fundamental principles we discuss remain exactly the same.
2. Understanding Reaction Dynamics Even seemingly very complex transformations in cell biology can usually be broken down into a series of elementary physical and chemical reactions. A reaction is a process in which one or more ‘‘species’’ change into other species. Species that are consumed in a reaction are the reactants; those that are formed are the reaction products. Members of a particular species—atoms, ions, molecules, or molecular assemblies—have the same properties as one another, but are different from the members of other species. More generally, species embody state, and reactions state transitions. A state is characterized by a finite lifetime: finite meaning nonzero—that is, measurable— and the lifetime of a state is the average time that particles spend in that state. For example, the radionuclide 32P has a half-life 14.3 days, from which one may calculate that its atoms have an average life of approximately 20 days before decaying to 32S. Many individual atoms will, of course, exist as 32P for significantly shorter periods, and many will survive for much longer. In contrast, the transition of a single 32P atom to the 32S state is effectively instantaneous. A process that occurs instantaneously is called an event. Appreciation of the notions of state and state transition, lifetime, and event is the key to understanding how the dynamics of chemical, and therefore biochemical reactions are modeled. Each state (and thus each species) is, by definition, restricted to some kind of container, and movement from one container into another involves a state transition. A container may be a cellular compartment, such as the cytoplasm or the nucleus, but more commonly is simply the vessel in which an in vitro reaction is occurring. Containers are usually three-dimensional, and have volume, but containers with fewer spatial dimensions are also possible. Two-dimensional containers have an area, rather than a volume, whereas one-dimensional ones have a length. Two-dimensional containers are used, for example, in models of diffusion of proteins in membranes, and one-dimensional ones in models of the motion of particles along a track. To avoid confusion we will only consider three-dimensional containers, but
386
Maria J. Schilstra and Stephen R. Martin
most concepts and expressions are, with the appropriate substitutions, equally applicable to two-and even one-dimensional containers. In the following, we will give species names that begin with capital letters (e.g., X, Y), and indicate specific instances of a species (molecules, assemblies, etc. in a specific state) with lower case letters (x, y). The number of instances or items of a particular species in a compartment is written as n subscripted with the species name (nX, nY). Square brackets around the species name indicate concentration ([X], [Y]), the number of items expressed in moles per unit of volume: ½X ¼
nX NA V
ð15:1Þ
Here, V is the volume of the container, and NA is Avogadro’s Number, 6.023 1023 items per mole. In biochemistry, concentration is generally expressed in molar (abbreviated M), which is the same as moles per liter (mol/L). Unless specified otherwise, reactions will be indicated as RXY, where the subscript identifies the reactants.2
3. Graphical Notation Throughout this article, we will use Petri-nets to depict species and reactions. The Petri-net format was invented3 to describe discrete distributed systems, in which objects are transformed into other objects in multiple processes that occur simultaneously and interact with each other. Petri-nets are used for many different purposes, from the design of complex computer software to the management of manufacturing systems. They are, however, also eminently suitable for depicting biochemical reaction systems at the level of interactions between molecules. The reason we prefer to illustrate our argument with Petri-nets rather than equivalent, often more compact, chemical reaction schemes is the evocative way in which the Petri-net notation and terminology emphasize the discrete nature of biochemical reactions, and draw attention to network structure and its associated dependencies. A Petri-net is a directed bipartite graph; a graph being a mathematical concept that may be visualized as a set of symbols, such as circles or boxes, connected by lines or arrows. The symbols are called ‘‘nodes’’ or ‘‘vertices’’; the lines are ‘‘arcs’’ or ‘‘edges.’’ Each arc connects two nodes. Arcs may have arrowheads, and point from one node to the other, which makes the 2 3
Note that in the formulation used in this paper reversible reactions are always specified as two separate reactions in which the reactants of one reaction are the products of the other. The Petri-net notation is named after its inventor, C. A. Petri (http://www.scholarpedia.org/article/ Petri_net). Petri-nets are sometimes called place-transition or P/T graphs.
Simple Stochastic Simulation
387
graph directed. ‘‘Bi-partite’’ indicates that there are two kinds of nodes, and that nodes of the first type can only have connections to nodes of the other type. In the following we shall use the classic, somewhat idiosyncratic Petrinet terminology. The two types of nodes are called place and transition nodes, or more simply, places and transitions. Places represent states (species), and transitions represent processes that involve a state transition (reactions). Places are usually represented by circles or ellipses, transitions by rectangles, squares, and sometimes simply by straight line segments. Arcs that point from place to transition nodes are called input arcs; arcs that point away from a transition toward a place are output arcs. Likewise, places that are connected to transitions via input arcs are called input places; those that are connected via output arcs are the output states of the transition. Input and output places therefore represent reactants and reaction products. Place nodes may contain tokens, which are often shown as small circles or dots inside the node. Tokens count the items (molecules, assemblies, etc.) that are in the state represented by the place. Although each token represents one item, the tokens in a given place are indistinguishable as they are not associated with any specific item. In the simple Petri-net example in Fig. 15.2A, the single reactant in reaction (transition) RX is species (place) X, and the product is species Y. There are three instances (tokens) of species X in the compartment or container (which is not specifically indicated), and none of Y. Reaction RXZ in Fig. 15.2B has two reactants, X and Z, and one product (Y). Arcs have weights. The default arc weight is 1, but other numbers are allowed, provided they are nonnegative integers. Default weights are not normally indicated in the graph, but other weights are. In Fig. 15.2C and D, the input arcs of RX, RX1, RX2, and the output arc of RY have higher weights (namely 2), whereas all other arcs have a default weight of 1. It may, by now, be apparent that the Petri-net notation closely corresponds to the conventional chemical reaction notation: the example Petri-nets depict the reactions X ! Y (A), X þ Z ! Y (B), 2X ! Y (C), and the two reactions 2X ! Y; 2X ! Z (D), with the input and output arc weights corresponding to the reactant and product stoichiometry in the reactions. Transitions make the Petri-net dynamic, as they can fire. When a transition fires, tokens are removed from its input place or places, and new tokens are deposited in its output place(s). The number of tokens removed and deposited is equal to the weight of the connecting arcs. Thus, upon firing of RX in Fig. 15.2C, two tokens will be removed from X, and one deposited in Y. Likewise, when RY fires (Fig. 15.2C) one token will be removed from Y, and two deposited in X. It is important to realize that tokens do not move from place to place: they are simply counters that have no identity of their own. Transitions can fire if, and only if, there is a sufficient number of tokens in each of its input places, that is, if number of tokens in the input places is
388
Maria J. Schilstra and Stephen R. Martin
A
RX X
Y
B RXZ
X
Y
Z C
2
X
D
RX (Disabled) (Enabled) Transition (Reaction) RY
2 2
Place (Species, state)
Y
2 Arc with weight (Stoichiometry)
RX1 Y
Token (Species instance, item)
RX2
X 2
Z
Figure 15.2 Simple Petri-net examples and explanation of the symbols. These Petrinets can be used to represent the chemical reactions X ! Y (A), X þ Z ! Y (B), 2X ! Y (C), and the two reactions 2X ! Y; 2X ! Z (D).
equal to or greater than the weights of the associated input arcs. Therefore, RX in Fig. 15.2A is enabled, because there are three tokens in X. Transition RXZ in B, however, cannot fire, because even though there are enough tokens in X, there are none in Z. Similarly, in Fig. 15.2C only RX is enabled to fire; RY, which needs at least one token in its input place Y, is disabled. After the firing of RX and the accompanying removal of two tokens, there will be one token left in X, not enough for RX to fire again. However, one token will have appeared in Y, thereby enabling transition RY. When RY fires after having become enabled, two tokens will be put back in X and one removed from Y, so that firing can go on ad infinitum. In contrast, RX (Fig. 15.2A) can fire three times in succession, but no more. Apart from the rule that a transition can only fire if all of its input places contain a sufficient number of tokens, the basic Petri-net definition specifies no further rules for transition firing. In Fig. 15.2D, both transitions RX1 and RX2 can fire. However, they cannot fire simultaneously—because this would require the presence of four tokens in X—and firing of RX1 will disable RX2 and vice versa. To avoid conflict, different firing rules have been
Simple Stochastic Simulation
389
established for different Petri-net applications. For Petri-nets that represent chemical reaction systems, the logical option is to apply the rules that underlie chemical reaction kinetics. These rules are based on the principle that, although it is impossible to predict exactly when an individual species item will undergo a chemical reaction, the likelihood that it will do so within a given time interval can be computed. According to these rules, the reaction that happens first (and thereby possibly prevents other reactions that could also have occurred) will be chosen on the basis of a weighted lottery. To understand where these rules come from, a good understanding of chemical reaction kinetics is crucial. In the following, we will give a summary of the basic principles.
4. Reactions Most chemical reactions have either one or two reactants. Reactions with a single reactant appear to happen spontaneously, whereas reactions with two reactants require a collision between the two participants. Reactions that require a three-body collision do exist but are uncommon, simply because a three-body collision is a rare event. Radionuclide decay, such as the reaction 32P ! 32S, is an example of a one-reactant, or unimolecular, reaction. Examples from biochemistry include conformational changes (or isomerizations) and dissociation of protein–ligand complexes. The formation of a complex between two species such as a protein and a ligand, is an example of the very common two-reactant, or bimolecular, reaction. Although many important complexes in biochemistry contain multiple components, these will almost invariably be assembled through a series of individual reactions with just two reactants. Likewise, many complex reaction schemes—such as the enzyme catalyzed conversion of one or more substrates into one or more reaction products—consist of a series of unimolecular and bimolecular events.
5. Reaction Kinetics To understand reaction dynamics, it is convenient to consider what happens to a single ‘‘central’’ molecule that takes part in a reaction. We will first consider the case in which the central molecule takes part in a secondorder bimolecular reaction, and then discuss what happens in first-order unimolecular reactions. Whereas the terms unimolecular and bimolecular indicate the number of reactants, the designations first-order and secondorder refer to the sum of the power to which the concentration terms in the rate equation are raised, as we will clarify below.
390
Maria J. Schilstra and Stephen R. Martin
5.1. Second-order reactions If the central molecule is a reactant of type X in the bimolecular reaction X þ Z ! Y, as in Fig. 15.2B, it needs to collide with (‘‘find’’) its reaction partner, Z, before the reaction can take place. The frequency with which type Z molecules collide with the single central molecule, x, is proportional to the concentration of Z: if there are twice as many molecules of the reaction partner in the compartment in which the reaction takes place, the collision frequency will be double. Conversely, if a given number of Z molecules become distributed over a volume that is twice as large, the collision frequency will be halved. The collision frequency also depends on factors such as the temperature and the viscosity of the medium in which the molecules move. Moreover, in few reactions, if any, does every collision lead to a reaction. In biochemical reactions, which often involve large molecules, the reactants must at least be in a favorable orientation with respect to each other, so that only a fraction of the total number of collisions between reactants results in an actual reaction. We will assume, for now, that the environmental factors that affect the collision frequency and the fraction of successful reactions are constant. In this case, the frequency of ‘‘successful’’ encounters (that is the number of successful encounters within a given period) between the central molecule and its reaction partner would also be proportional to the concentration of the reaction partner—if, of course, the central molecule were not consumed by the reaction. If there are, say, 1000 items of the same species as the central molecule, the probability that any one of those will undergo a successful collision and react with a reaction partner is 1000 times greater.4 Therefore, the rate at which the reaction proceeds, that is the rate at which the reactant molecules are consumed, is proportional to the number of items of one of the species taking part, and to the concentration of the other:
dnX dnZ nX nZ ¼ ¼ k nX ½Z ¼ k dt dt NA V
ð15:2Þ
Here dnX/dt and dnZ/dt are the rates at which X and Z items disappear from the medium (a plus sign would indicate appearance). The rate is measured in number of items per unit of time. If time is measured in seconds, and volume in liters, the dimensions of the expressions on both sides of the equation are s 1 (as the numbers nX and nZ are dimensionless), and the dimension of the proportionality factor k is M 1s 1. The proportionality factor is called the rate constant,5 and equations that express the rate 4 5
Provided the molecules do not hinder each other. Significant hindrance necessitates the introduction of a correction factor, but this is beyond the scope of the present discussion. Following the recommendations in the IUPAC Compendium of Chemical Terminology, we use k as the symbol for rate constants.
391
Simple Stochastic Simulation
at which the quantity of one particular species changes, are rate equations. The reaction rate may also be expressed in terms of concentration by dividing both sides of the equation by NA V:
d½X d½Z ¼ ¼ k½X½Z dt dt
ð15:3Þ
Note that this operation alters the dimensions of the whole expression to M/s, but leaves the value and dimensions of the rate constant unchanged. Reactions with rate equations of the form of Eq. (15.3) are called secondorder reactions, with k as a second-order rate constant, because the total power to which the concentrations of all reactant species are raised is 2 (i.e., 1 þ 1). Reactions in which two molecules of the same species react, for example, in the dimerization reaction RX in Fig. 15.2C, are also secondorder reactions. In this case, two items disappear following a successful collision. Since a molecule cannot react with itself, each molecule has only nX1 other molecules with which it can react and Eq. (15.2) then becomes:
dnX nX ðnX 1Þ ¼ 2k dt NA V
ð15:4Þ
If there are many molecules, (nX1) nX, and the reaction rate may be expressed as:
dnX ¼ 2k½X2 dt
ð15:5Þ
Of course it is also possible to incorporate V or NA V in the value of k, as k0 ¼ k/V or k0 ¼ k/(NA V ). In these cases, the dimensions of k0 are mol 1s 1 or molecule 1s 1, and its value depends on the volume of the compartment in which the reaction takes place. It is very important, however, to realize that a second-order rate constant expressed as mol 1s 1 hides a spatial dimension, and that this concealment may lead to severe confusion. It is advisable to always report the values of second-order rate constants in M 1s 1, not least because their values may then be compared with those of other second-order reactions. Reactions that require a collision between two molecules are limited by the velocity at which the individual molecules move around. The average velocity of the particles may be estimated from their diffusion coefficients, which, in turn, depend on the temperature and viscosity of the medium. As a rule of thumb, the maximum value for a second-order rate constant at 25 C in water is of the order of 109 M 1s 1(Atkins, 1994). Reported or observed second-order rate constants with values that are significantly greater than 109 M 1s 1 should therefore be treated with suspicion.
392
Maria J. Schilstra and Stephen R. Martin
5.2. First-order reactions As collisions are events, it is easy to picture bimolecular reactions as collections of events that happen after the reactants have spent some time moving around in the medium. Unimolecular reactions, in contrast, do not generally or obviously require collisions: they appear to happen spontaneously. Nonetheless, state transitions that involve a single particle as a reactant, such as the reaction X ! Y in Fig. 15.2A, are often also best described as events that happen spontaneously after the particle has spent some (so far indeterminate) amount of time in the reactant state, here X. For example, out of a population of 1000 32P atoms, approximately 500 would have decayed to 32S after 14.3 days, but this does not, of course, mean that each of these 500 atoms took 14.3 days to make the full transition. Many unimolecular biochemical transitions also appear to happen instantaneously. Just as in the bimolecular case, the probability that any single molecule or particle in a population will make the transition from state X to state to state Y is directly proportional to the number of particles present and the rate at which items of X disappear (dnX/dt) is therefore given by:
dnX ¼ knX dt
ð15:6Þ
Dividing both sides by NA V gives: d½X ¼ k½X ð15:7Þ dt The proportionality constant k is, in this case, a first-order rate constant, measured in s 1, and the unimolecular reaction on which the proportionality is based is of the first order, because it involves the concentration of one species, raised to the power of one. Unlike second-order rate constants, first-order ones do not incorporate a volume factor, either explicitly or implicitly. First-order reaction rates are independent of the volume of the compartment in which the reactions take place. Like second-order rate constants, first-order ones also have upper limits to their values. The fastest reactions in biochemistry are those that involve the transfer of energy, electrons, or protons between well-positioned donors and acceptors, such as the chlorophylls and cytochromes in photosynthetic reaction centers. The values of the first-order rate constants for these very fast reactions are of the order of 1013 down to 107 s 1. Most other biochemical processes are likely to have significantly smaller rate constants.
5.3. Pseudo-first-order reactions Suppose that, in a reaction that yields Eq. (15.2) or (15.3), the concentration of Z changes hardly or not at all when the reaction takes place (i.e., d[Z]/ dt 0 or d[Z]/dt ¼ 0). This would happen if [Z] were so much larger than
393
Simple Stochastic Simulation
[X] that even when all items of X had reacted the number of Z items would remain almost unaffected. In this case, the factor [Z] may be incorporated into a new rate constant, k0 ¼ k [Z], giving:
d½Z dnX ¼ knX ½Z ¼ k 0 nX ; ¼0) dt dt
d½X 0 ¼ k ½X dt
ð15:8Þ
Here, the shapes of dnX/dt and d[X]/dt are the same as those in Eqs. (15.6) and (15.7), and k0 is called a pseudo-first-order rate constant. Like real first-order rate constants, k0 is measured in s 1 (M 1s 1 M ), as it incorporates the dimensions (molar) of [Z].
5.4. Aside Rate constants are a measure of the reactivity of the participant(s) in a reaction, and depend on external factors such as temperature, pH, and ionic strength. Second-order rate constants also depend on the viscosity of the medium. If one’s aim is to predict concentration changes that occur in a reaction system, whether by evaluating ODEs or by carrying out stochastic simulations, knowledge of the values of the rate constants is sufficient—but essential. It is not necessary to specifically consider any of the factors that are amalgamated in the rate constants, or to consider the situation at the level of collisions. Similarly, observation of concentration changes in a reaction system only yields information on the values of the rate constants. However, the values of the rate constants may provide important, but indirect, clues about the molecular properties of the reactants, or about the environment in which they were obtained. In any case, the effect of changing environmental conditions (if any) on the rate constants must be assessed and taken into account.
6. Transition Firing Rules In the above, we have established the following: 1. Both first- and second-order reactions involve instantaneous state transitions. 2. Individual items (molecules, complexes, assemblies, etc.) that act as reactants in a reaction spend a certain amount of time in this ‘‘reactant state’’ before they undergo the state transition. 3. The frequency at which a state transition occurs is proportional to the number of items of the single reactant in the case of a first-order reaction, and to the number of items of one reactant and the concentration of the other in the case of a second-order reaction. The proportionality constants are called the rate constants for the reactions.
394
Maria J. Schilstra and Stephen R. Martin
Although it is not possible to predict exactly when any particular item (or items) will react, it is possible to use the rate constants to compute the probability that it will do so within a given period of time.
6.1. Ground rules Consider Eq. (15.6), which says that the difference dnX between the number of X molecules, nX, present at the beginning, t0, and the end, t1, of a very short period, dt, is proportional to nX (dt ¼ t1t0 and dnX ¼ nX,t¼1nX,t¼0, where nX,t¼0 is nX at time t0, and nX,t¼1 is nX at time t1). Equation (15.6) is a relatively simple differential equation that has an analytical solution: an equation that expresses how many X molecules (nX) are still left in the reaction volume at any time into the reaction.6 The equation is: nX ¼ n0X ekt
ð15:9Þ
Here, n0X is the number of X molecules at the beginning of the reaction, k is the first-order rate constant for the reaction, and t is the time into the reaction. By dividing both sides by nX, we know what fraction of the initial amount is still present at time t (namely nX =n0X ¼ ekt ), and which fraction has already reacted (1 nX =n0X ¼ 1 ekt ). Therefore, if there are 106 molecules of X at the start of the reaction, and k is 10 s 1, after 1 ms, (1e0.00110) 100% ¼ 0.995%, or approximately 9950 molecules will have reacted and disappeared from the compartment. After 10 and 50 ms, the percentages are 9.52 (9.52 104 molecules) and 39.3 (or 3.93 105 molecules). Note that, although the absolute quantities will be different for different values of n0X , the percentages will always the same. Thus, if there are (1 106 3.93 105) ¼ 6.07 105 molecules (60.7%) left after 50 ms, there will be 60.7% of 6.07 105, or 3.68 105 molecules left after another 50 ms. Therefore, if we know k and nX at any point in time, we can predict how many will be left after a given time interval. Equation (15.9) is exactly valid only when the number of items is effectively infinite. It also expresses the average that would be obtained for many observations (‘‘experiments’’) on a finite numbers of molecules. Focusing now on a single X molecule, x, we can predict the odds that it will have disappeared after, for example, 50 ms. As we have seen above, about 39.3% of all X molecules that were present at the beginning will have disappeared by the end of the interval, and the chance that x is among those is therefore also 39.3%. If x is indeed still present at the end of the interval, 6
Readers who are unfamiliar with differential equations may just accept that Eq. (9) follows from Eq. (6) for nX ¼ n0X at t ¼ 0.
395
Simple Stochastic Simulation
the probability that it will react within the next 50 ms is, again, 39.3%. This may be expressed as an equation: F ¼ 1 ekDt
ð15:10Þ
Here F is the probability that an item has undergone the state transition in the time interval Dt ¼ tt0, and k is the first-order rate constant for the reaction. Equation (15.10) may be used, inter alia, to compute the half-life, t½, of a species for which k is known: this is the time at which 50% of the molecules or other items have reacted (i.e., F ¼ 0.5). The time for which this is true is t½ ¼ ln(2)/k. In the example in which k ¼ 10 s 1, t½ is 69 ms. It can be shown7 that the average life span, or lifetime,t, of the species is equal to 1/k, and that the standard deviation on the average life span is equal to the value of t itself. This means that at Dt ¼ t, some 63.2% of the amount present at t0 has disappeared, and 36.8% is still left, as e-t/k ¼ 1/e 0.368. We will use this knowledge, especially Eq. (15.10), in the formulation of the firing rules for the transitions in the Petri-nets that represent biochemical reactions.
6.2. First-order reactions Imagine an object that can undergo multiple sequential, irreversible firstorder state transitions, X1 ! X2, X2 ! X3, etc., until it reaches a final state Y, as illustrated in the Petri-net in Fig. 15.3. This could be a simple model of a protein that undergoes a number of conformational changes, an electron that hops from center to center in an electron transfer chain, or a molecular motor that moves along a track. In this example, we will make the rate constants for all reactions the same: kX1 ¼ kX2 ¼ . . . ¼ k ¼ 10 s 1. Initially, place X1 contains a token, indicating that there is one item in state X1. Upon firing of transition RX1, this token will disappear, and a new one will be deposited in X2 (the weight of each arc is the default, 1). The questions one might pose are how long does it take, on average, for a molecule to undergo the full transformation from X1 to Y, and what is the spread in the arrival times? One seemingly logical approach would be to divide the total estimated reaction time into small steps and calculate the probability of the event occurring within a time step. For example, according to Eq. (15.10), there is a 50% chance that the first transition will fire within 69 ms. If we were to divide the total reaction time into 69 ms segments we could use a simple coin flip to decide whether or not the transition had fired during the interval between t ¼ 0 and t ¼ 69 ms. If it did not, we could look at the next 69 ms interval, and try again. However, if it had happened, we would not, of 7
Because the proof for these statements is quite involved, it is omitted here.
396
Maria J. Schilstra and Stephen R. Martin
System X1 state
X2
X3
X4
Y
0 1 2 3 4 RX1
RX2
RX3
RX4
Figure 15.3 Petri-net model of five sequential state transitions undergone by a single molecule. X1 to Y represent the five states that the molecule can assume; the five system states (top to bottom) indicate the development of the system after the consecutive firings of transitions RX1 to RX4. The model is used to establish the average dwell time of the token in each place, and to determine the spread in the arrival time the token in place Y.
course, know exactly when it had happened. Furthermore, it would be possible that not only RX1, but also RX2, or even more transitions might also have fired within the 69 ms. To avoid these uncertainties, the size of the interval must be reduced. As we have seen above, in the interval between 0 and 1 ms, there is about 0.995% chance that RX1 fires. The chance that RX2 also fires within that interval is therefore 0.995% of 0.995%, which is only 0.0099%, or about 1 in 10,000. We can use a computational random number generator8 to draw a number from a very large set of uniformly distributed numbers between 0 and 1. If the draw yields a number greater than 0.00995, the situation is unchanged after 1 ms, and we draw another random number to determine whether the event will happen in the interval between 1 and 2 ms, and so on. If we draw a number between 0 and 0.00995 for a particular interval, we know that the RX1 has fired. However, there is still a worry that, although unlikely, RX2 may also already have fired within that period. Because of that, we cannot be sure what the situation is after 1 ms: does X2 or X3, or maybe even one of the places further down the chain, contain the token? To resolve this, we could decide that both RX1 and RX2 fire in the interval under consideration if the draw yields a number r that is smaller than 9.9 10 5, that RX3 also fires when r < 9.9 10 7, and so on. Alternatively, we could divide time into smaller and smaller intervals, so that the times at which events happen are identified 8
See, for example, http://en.wikipedia.org/wiki/Random_number_generator or http://www.random.org/
397
Simple Stochastic Simulation
with greater precision, and the probability that two or more events occurring in the same interval becomes vanishingly small. Unfortunately, this would require generating more random numbers, and, for the very small time intervals that would be considered ‘‘safe,’’ slow down the computation of the firing times to a snail’s pace. Fortunately, there is a simple solution to this problem. Instead of calculating the probability that an event occurs within a fixed time period we can use a random number to calculate directly the time at which the event occurs. We know that the probability F that the transition will fire within the period between now and Dt increases monotonically from 0 to 1. We now divide the vertical axis in a plot of F against Dt into 10 segments, segment 1 from 0 to 0.1, segment 2 from 0.1 to 0.2, and so on, so that each segment represents an equal part, 10%, of the total probability. Rearranging Eq. (15.10) and taking logarithms of both sides gives Dt ¼ ln(1F )/k, which allows us to compute the time slots that correspond to each segment, as shown in Fig. 15.4. These time slots are unequal in size, and the last time slot is infinitely long. We then draw a random number to choose one of the B
1.0
1.0 0.8
0.6
0.6 F
0.8
F
A
0.4
0.4
0.2
0.2
0.0 0.0
0.1
0.2 0.3 Time (s)
0.4
0.5
0.0 0.0
0.1
0.2 0.3 Time (s)
0.4
0.5
Figure 15.4 Two ways to decide on the timing of an event in a stochastic simulation. The solid black line indicates the monotonically increasing probability F that the event has occurred in the interval from time 0 to time t. In (A), the time axis is sampled in equal steps. The corresponding probability that an event occurs in a particular time slot is different for each slot (F(n)F(n-1) ¼ (1ekt(n))(1ekt(n-1)), where F(n) is the function value at the end t(n) of the nth time slot). The correspondence is indicated by the solid gray lines. Because of the different probabilities, it is necessary to draw a random number r (from a uniformly distributed set) for each time slot to decide whether an event has taken place in that slot (r < 1ekt(n)). In (B), the F-axis is divided into equally sized segments representing equal probabilities. Each segment corresponds to a particular time slot (t(n)t(n1) ¼ (ln(1F(n 1))ln(F(n)))/k whose size increases with increasing F, as indicated by the solid gray lines. In this case, it is only necessary to draw a single random number to decide on the time slot in which the event takes place. In both cases, the uncertainty in the event timing (the width of the time slot) may be reduced by increasing the sampling rate. This will reduce the efficiency of the first method (A), but not that of the second (B).
398
Maria J. Schilstra and Stephen R. Martin
vertical divisions (all are equally probable), and decide that the transition will fire in the corresponding time slot. Of course, these time slots, particularly the ones corresponding to the higher segments, are quite long, and there is an undesirable uncertainty in the timing of the event. By dividing the vertical axis into smaller segments, we can narrow down the corresponding time slots (apart from the last one, which will always be infinitely large), so that if we divide it in an infinitely large number of segments, the time slots will become infinitely short. If we then randomly draw a number, r, from an infinitely large,9 uniformly distributed set, we can pinpoint the precise time t at which the event occurs by evaluating Dt ¼ tt0 ¼ ln(1r)/k. Now suppose there are nX tokens in X1, instead of one. In that case, the transition will be firing at a rate of J ¼ nX k, where the transition firing rate J, may also be called the reaction flux, propensity, or hazard. The time interval up to the first transition firing is computed in exactly the same way as illustrated above, with J substituted for k: Dt ¼
lnð1 rÞ J
ð15:11Þ
Note that r may be 0 (in which case the event happens at the same time as the previous one, but they do occur in an ordered fashion), but may not be 1. With this knowledge, we now return to the situation in which there is just one token in X1, as in system state 0 in Fig. 15.3. Suppose evaluation of Eq. (15.11) (with JX1 ¼ 1 k) with random number r1 yields a firing time t1 ¼ Dt1 þ t0 for RX1. Coincident with this firing event at t1, the overall state of the system changes from the starting state 0 to state 1, in which RX1 is disabled as X1 has lost its token, whereas X2 now contains a token and RX2 is enabled (Fig. 15.3). We then carry out the same steps for the now enabled RX2, drawing a random number and computing Dt2 and firing time t2 using time t1 as the starting point, removing the token from RX2’s input place and placing one in its output place at t2. After repeating this process for RX3 and RX4, the endpoint of this simulation is reached, in which Y contains a token, and all transitions are disabled, so that the system cannot develop any further. We now know how long it has taken the molecule on this occasion to undergo the full transformation from X1 to Y (or the 9
In practice, the amount of random numbers that are generated by random number generators on digital computers is finite, and limited by the number of bits, nB, that are used to express each generated number. If nB ¼ 16 bits (2 bytes), as is sometimes the case, only 65,536 different numbers can be expressed, which means that the vertical axis will be divided in segments of size 1.5 10 5. This means that the uncertainty in the times corresponding to the lowest, middle, and one but highest segments (01.5 105, 0.50.500015, and 0.9999690.999985) are 1.5 105 k, 3.0 105 k, and 11.0 k! Although most modern random number generators have much greater precision, it is worth keeping in mind that there is always some uncertainty associated with firing times that are computed on the basis of the numbers that they generate.
399
Simple Stochastic Simulation
molecular motor to move from position X1 to Y, etc.), and how long each individual step has taken, and we can plot the ‘‘time trajectories’’ of each place. A typical series of trajectories is plotted in Fig. 15.5A. By starting anew, and repeating the whole process many times, we can generate histograms of the arrival times, and of the ‘‘dwell times’’ (life spans) A
B
Dwell time (X2) Arrival time X1 X2 X3 X4 Y 0.0
0.2
0.4 0.6 Time (s)
0.8
t 0.000
0.6023
0.092
0.092
RX1
Event
0.7506
0.139
0.231
RX2
0.5640
0.083
0.314
RX3
0.8482
0.188
0.503
RX4
4.0
200
0.2
E
0.4 0.6 Time (s)
0.8
0.8 1.2 Time (s)
400
0.4
0
0.0
0.2
0.0 1.6
0.4 0.6 Time (s)
0.8
0. 0 1.0
800
0.8
400
0.4
0 0.0
0.4
0.8 1.2 Time (s)
0.0 1.6
Cumulative density
0.4
Probability density
2.0 1.0
100
0.8
F
Arrival time in Y
200
0 0.0
0. 0 1.0
Accumulated number
0 0.0
800
Cumulative density
8.0
Probability density
Dwell time in X2
Accumulated number
D 400
Number
Δt
1.0
C
Number
Random
Figure 15.5 (A) Typical time trajectories for places X1 to Y in the simple sequential model of Fig. 15.3. Each line represents the state of the place indicated on the left as a function of time (low, no token; high, token present). (B) Computation of the transition firing times using a spreadsheet. Random numbers were drawn using the spreadsheet’s random number generator; time intervals Dt were computed from Eq. (15.11) (k ¼ 10 s 1); time t is the accumulation of the Dt values. (C, E) Distribution of the token dwell times (life spans) in X2 and arrival times in Y, based on 1000 trajectories. (D, F) Accumulated data from (C) and (E). Dashed lines indicate the theoretical probability density and cumulative density distributions, obtained using the spreadsheet’s exponential distribution function (C, noncumulative, Eq. (15.12); D, cumulative, Eq. (15.10), with k ¼ 10), and gamma distribution function (E, noncumulative; F, cumulative, with b ¼ 10 and a ¼ 4).
400
Maria J. Schilstra and Stephen R. Martin
of the tokens in each individual place. Dwell times are obtained by subtracting the firing time of the transition that deposited the token in the place from the firing time of the one that removed it, or, in the case of X1, simply by recording the firing time of RX1. The dwell time of the token in Y is infinite, as there is no transition to remove it. The arrival time of the token in Y is obtained by recording the firing time of RX4. The data in these histograms (Fig. 15.5C and E) may be accumulated (‘‘integrated’’) into new histograms by adding the number in each time slot to the sum of the numbers in all previous time slots. These cumulative histograms show the number of observed dwell or arrival times falling inside a particular time slot or in earlier ones. When normalized to 1 (or 100%), these plots therefore indicate the probability that an event has happened at or before the upper time limit of the slot. In other words, they express the same thing as Eq. (15.10), a cumulative distribution of probabilities. The dwell times of tokens in X2 are determined by the occurrence of a single event (firing of RX2), and the cumulative distribution of these dwell times is, therefore, described by Eq. (15.10). Figure 15.5D contains a plot of the cumulative density (the values of F ) obtained by applying either Eq. (15.10) with k ¼ 10, or the equivalent cumulative exponential distribution function provided in statistical function packages of spreadsheet programs or other computational tools. As the histograms in Fig. 15.5D and F are the integrated versions of those in C and E, C and E are the derivatives of D and F. Since the normalized data in Fig. 15.5D are described with Eq. (15.10), it follows that the data in Fig. 15.5C are described by its derivative, Eq. (15.12): f ¼
dF ¼ keJDt dt
ð15:12Þ
Equation (15.12) is a so-called exponential probability density function, and Eq. (15.10) is the cumulative distribution function for this probability density function. In Fig. 15.5C, the values of f obtained by applying Eq. (15.12) (or equivalent noncumulative exponential distribution function in statistical functions packages) are compared with the histogram data. Unlike the dwell time distribution in Fig. 15.5C, which was derived from a process whose timing is determined by a single event, the arrival time distribution in Fig. 15.5E and F is nonexponential. It can be shown that arrival time distributions in irreversible sequential multistep processes, in which the transition probabilities are the same in each step, are described by the gamma distribution function,10 which is expressed in terms of the ‘‘shape parameter’’a, the ‘‘rate parameter’’ b, and the time interval Dt. In this case, a and b are equal to the number of transitions and the rate constant 10
The gamma distribution function is f ¼ (baDta1ebDt)/G(a), where G(a) ¼ (a 1)! if a is a positive integer.
401
Simple Stochastic Simulation
k, respectively. The mean of gamma distributed values is equal to a/b: the number of steps times the average time taken for each step (which is the lifetime,qtffiffiffiffiffiffiffiffiffi ¼ 1/k), and the standard deviation (the square root of the ffi
variance) is a=b2 . Thus, the theoretical values for the mean and standard deviation on the arrival times in this system are 0.4 and 0.2; we obtained values of 0.401 and 0.205 after recording 1000 trajectories.
6.3. Multiple options The Petri-net in Fig. 15.6 is similar to that in Fig. 15.3, but in this case the reactions in which the item goes from state to state are reversible. Reversible reactions are modeled using two separate transitions, and the places that are on either side provide input for one and output for the other. In the system states with one token in X1 or Y (0 and 4), only one transition is enabled (RX1 or RY); in all other cases, there are two. Firing of either one will yield different new system states (and enabled transitions). Suppose all rate constants kf for the forward reactions (kf, for RX1, RX2f, . . ., RX4f ) are 10 s 1, as in the previous example, and those for the reactions in the reverse direction (kr, for RX2r, RX3r, . . ., RY) are four times slower at 2.5 s 1, and the system is in state 0 when we start looking at it. We can use the method described above to draw a random number and compute the
System state 0
RX1
RX2f
RX3f
RX4f
RX2r
RX3r
RX4r
RY
1
4 X1
X2
X3
X4
Y
Figure 15.6 Petri-net model of a system in which a single molecule or other item undergoes four sequential reversible state transitions. The model is used to estimate the average time it takes until a token appears in Y, the so-called mean first-passage time.
402
Maria J. Schilstra and Stephen R. Martin
time for the first transition firing. After the transition has fired, the system is in state one, and both RX2f and RX2r are enabled. However, RX1f fires four times as fast as RX2r, so if they would fire independently (i.e., if firing of one would not affect the odds of the other firing), about 80 out of 100 firings would originate from RX1f, and 20 from RX2r. It may be understood intuitively that, if a random number is drawn for both transitions, there is a 20% possibility that the value of Dt associated with the slower reaction, RX2r, is the smaller of the two. We can, therefore, decide which transition will fire simply by choosing the one that would fire first. This is illustrated in Fig. 15.7A. If this happens to be RX2f, a token will appear in X3 and enable RX3f and RX3r at the time of firing; if it is RX2r, one will appear in X1,
A
B 1.0
X1
X2
X3
X4
Y
F
0.8 0.6 0.4 0.2 0.0 0.0
0.2
0.4 0.6 Time (s)
0.8
1.0
0.0
0.2
0.4 0.6 Time (s)
0.8
1.0
C
Number
First passage of Y
100
0 0.0
2.0
1.0
0.4
0.8 Time (s)
1.2
Probability density
200
0.0 1.6
Figure 15.7 (A) Cumulative density functions (Eq. (15.10)) for k ¼ 10 (solid gray line) and k ¼ 2.5 (gray dashed line) and comparison of Dt values computed from two sets of random numbers (Eq. (15.11); one random number for each enabled transition), one set for which the smallest Dt value is obtained with the faster transition (solid black lines), and one in which the slower transition ‘‘wins’’ (dashed black lines; smallest Dt values indicated by circles). (B) Typical token trajectories in the Petri-net of Fig. 15.6; the lowest position in each trajectory indicates a token in X1, second lowest, token in X2, highest, token in Y, etc. (C) Distribution of the time intervals between the start and the first appearance of a token in Y, based on 1000 trajectories. The mean firstpassage time obtained from these data is 0.51 0.29, and the dashed line is a gamma distribution constructed on the basis of these values (a ¼ 3.12, b ¼ 6.15 s 1). All computations were again carried out in a spreadsheet program.
Simple Stochastic Simulation
403
reenabling RX1, and in both cases the token will disappear from X2, and both RX2f and RX2r will be disabled. In contrast to the model in Fig. 15.3, this model will always have enabled transitions, so that the system remains dynamic. Figure 15.7B shows four typical trajectories over the first second into the reaction. This model may be used to estimate mean first-passage times, the average amount of time that passes before a particular state (here Y) is first reached from a starting state (here X1) in a sequence of reversible reactions. Figure 15.7C shows the distribution of first-passage times. The mean m and standard deviation s obtained from the first-passage times in 1000 trajectories were used to compute the values of a and b (b ¼ m/s2; a ¼ m b) to construct the gamma distribution function that is shown in the figure.11 Now consider the Petri-net in Fig. 15.3 again. Rather than collecting data from 1000 trajectories as we have done above, we may also start with 1000 tokens in place X1, and record the distribution of tokens over all five places as time progresses. Transition RX1 will now fire 1000 times, until all its tokens have disappeared. Equation (15.11) may again be used to compute the interval to the first transition firing. As J is now a thousand times greater, the interval is of course likely to be significantly smaller than an interval computed for a single token. After RX1 has fired once, the number of tokens in X1 is 999 and that in X2 is 1. Both RX1 and RX2 are now enabled, with the firing propensity of RX1 slightly smaller (999 k) than it was before the event, and that of RX2 (1 k) significantly smaller than that of RX1, but now finite. Again, we determine which transition will fire first by drawing a random number for both enabled transitions, and compute a value for Dt for each based on their firing propensity J. After one of the enabled transitions has fired, and the tokens have been redistributed accordingly, we may repeat this process until 1000 tokens have arrived Y, and all transitions are disabled. Figure 15.8 shows the token redistribution and transition firing count over time in this system. Note that the dwell times in X1 and arrival times in Y are distributed in the same way in the 1- and 1000-token systems. However, trajectories of single items, such as the ones in Figs. 15.5A and 15.7B can only be obtained from simulations in which each place contains a maximum of one token. As tokens are indistinguishable, a particular molecule cannot be associated with a particular token if there is more than one token in one place.
11
As in this case, the value of a is noninteger, the expression for RG(a) in the gamma distribution function is substituted by a more complex, continuous expression G(a) ¼ 0 1xa1e xdx. Many spreadsheet programs supply functions for evaluating the gamma distribution equation, given Dt, a, and b, in its noncumulative as well as its cumulative form.
404
Maria J. Schilstra and Stephen R. Martin
B X1 X2 X3
400
X4 0 0.0
0.4
0.8 1.2 Time (s)
1.6
0.8
800 RX4
0.4
400
0 0.0
0.4
0.8 1.2 Time (s)
0.0 1.6
Cumulative density
800
RX1
Y
Number of firings
Number of tokens
A
Figure 15.8 (A) Token redistribution over all five places in the Petri-net model in Fig. 15.3 as a function of time. The first-order rate constant for all transitions was 10 s 1, and at the start of the simulation there were 1000 tokens in X1, and none in the other places. (B) Number of times transitions RX1 to RX4 have fired as a function of time (unlabeled curves are the data of RX2 and RX3, left to right). Filled gray circles indicate the firing times observed in the stochastic simulation; solid black lines are an exponential cumulative density function for Eq. (15.10) with k ¼ 10 s 1 for RX1 and cumulative gamma distribution functions with b ¼ 10 s 1 and a ¼ 2, 3, and 4 for RX2, RX3, and RX4, respectively. Note that curve X1 (in A) describes the distribution of dwell-times in X1; that the data in Y (A) and RX4 (B) are equal (as the number of tokens in Y registers the number of times RX4 has fired), and that the derivatives of the cumulative distribution functions for RX1, RX2, RX3, and RX4 describe the token arrival time distribution for X1, X3, X4, and Y.
6.4. Pseudo-first-order and second-order reactions The Petri-net in Fig. 15.9 is similar to that in Fig. 15.6, but one of its transitions, RX3f, represents a second-order reaction. The compound Z is consumed in a reaction with X3, and a product, A, is released in the reverse (first-order) reaction represented by RX4r. If there are a very large number of tokens in Z, such that change in the number of tokens over the full time simulation is negligible, we may assume that its concentration is constant. In that case, the reaction is pseudo-first-order (see Eq. (15.8)), and treated in the same way as real first-order reactions. Equation (15.11) is used to compute the interval to the next transition firing, with J equal to k0 X3 if there is a single token, and to nX3 k0 X3 if there are nX3 tokens in X3. The pseudo-first-order rate constant k0 X3 is equal to kX3 Ztot, where kX3 is the second-order rate constant, and Ztot the (constant) total concentration of Z. In this case, it is not necessary to keep track of the actual number of tokens in Z, and a constant value for k0 X3 may be used throughout the simulation. However, if the number of Z particles is relatively small and changes significantly over the course of the simulation, the number of tokens nZ in Z must also be taken into account in the expression for J. Nonetheless, because neither nX3 nor nZ changes between events, J ¼ kX3 nX3 [Z] ¼ kX3 nX3 nZ/(NAV) may be used to compute Dt for RX3.
405
Simple Stochastic Simulation
A Z
X1
RX1
RX2f
RX3f
RX4f
RX2r
RX3r
RX4r
RY
X2
X3
X4
Y
A
B Y X4 X3 X2 X1 Y X4 X3 X2 X1 Y X4 X3 X2 X1
400 1
200 0 400
2
200
nA 3
0 400 200
nZ 0
20
40
60
80
0 100
Time (s)
Figure 15.9 (A) Petri-net model of a system similar to that in Fig. 15.6, but in which RX3f is a second-order reaction in which species Z reacts with X3, and a reaction product (A) is released in reaction RX4r. The rate constants for the first-order reactions (i.e., all reactions except RX3f ) are all 10 s 1 and the volume of the container in which the reaction takes place is 8 10 18 L (8 fl, the volume of a cube with 0.2 mm sides which is the size of a small bacterium). All simulations were started with a token in X1; the figure depicts possible states before and after depletion of the tokens in Z. (B) Trajectories of the token position in X1 to Y (gray lines), the number of tokens in Z (nZ, large black plusses, only shown in panel 3), and in A (nA, smaller black plusses, all panels). Initial value of nZ and values for the second-order rate constant kX3f were
406
Maria J. Schilstra and Stephen R. Martin
7. Summary In summary, the procedure to set up and perform a stochastic simulation of a dynamic biochemical reaction network includes the following steps. 1. Set up the model structure. The model must describe how particular types of molecules or molecular assemblies (‘‘places’’) are transformed by chemical or physical reactions (‘‘transitions’’) into other types. Reactions remove instances (‘‘tokens’’: molecules, molecular assemblies) of their reactant(s) and produce product instances. Reactions perform individual transformations (or ‘‘fire’’) at a particular rate; with each firing being an ‘‘event’’ that occurs instantaneously. The number of instances removed and produced upon a single firing event is specified in the reaction stoichiometry. 2. Associate each reaction with an equation that can be used to evaluate the firing rate, J, under any set of conditions. In the modeling of biochemical reactions, it is reasonable to use the laws of Mass Action: First-order reactions: J ¼ knX Second-order reactions between particles of a different type: J ¼ knX nY =NA V Second-order reactions between particles of the same type (dimerization): J ¼ knX ðnX 1Þ=NA V Here, nX and nY are the number of instances of type X and Y present in the reaction vessel or compartment volume V (the number of tokens in places X and Y), and k is the first- or second-order rate constant for the reaction. Other expressions for J are allowed, but J must be constant between events. If it is not, the central Eq. (15.11) is no longer valid. 3. Decide how many instances of each type there are at t0, the beginning of the simulation, and set an end time, tend, for the simulation. Set the simulated time t to t0. 4. For each reaction Ri, compute its value Ji, randomly draw a number ri between 0 and 1 (0 ri < 1) from a large, uniformly distributed set, and use these values to calculate the putative time, ti at which it will fire next: ti ¼ t þ
lnð1 ri Þ Ji
100,000 (20 mM) and 500 M 1s 1 (panels 1 and 2) or 400 (80 mM) and 5 107 M 1s 1 (panel 3). In the simulation shown in panel 1, a pseudo-first-order approximation was used (with nZ constant). In those in panels 2 and 3, the actual value of nZ after each event was used in the computation of J and Dt. Comparison of panels 1 and 2 shows that the results are very similar if nz changes relatively little (about 0.2%).
Simple Stochastic Simulation
407
5. Decide which reaction has produced tmin, the smallest value of ti. If tmin < tend, the earliest event will occur within the maximum simulation time. In this case, set the simulated time t to tmin, and let the transition that produced tmin proceed by removing the specified number of instances from its reactant(s) and adding new ones to its product(s). This token redistribution changes the overall state of the system. 6. To continue the simulation, the procedure is repeated from point 4.
8. Notes 1. Improving efficiency. Owing to the special properties of exponential equations (Eqs. (15.9) and (15.10)), computing new firing times for all reactions after an event is justified, but not entirely necessary if the firing propensity of some reactions is unchanged by the event. Gibson and Bruck (2000) have described a method in which the dependencies in the reaction network (expressing which reactions affect the firing propensity of which other reactions) are identified, and taken into account in the reevaluation of the system. They coined the names ‘‘First Reaction Method’’ for the method in which all transitions are evaluated, and ‘‘Next Reaction Method’’ for their variant. In another important variant of the First Reaction Method, known as Gillespie’s Direct Method, Gillespie (1977), just two random numbers are drawn per evaluation round. Here, the sum of all propensities is used to compute the firing time, and the reaction that fires at that time is selected through a lottery in which each reaction’s chance of being chosen is proportional to its firing propensity. Dependent on the characteristics of the modeled network, either variant may improve the efficiency of the simulation. However, as both incur some computational overhead, the First Reaction Method, which is easiest to implement, is often equally efficient in small systems. 2. Modeling more complex systems. Events included in a stochastic simulation do not have to be just chemical reactions. Any process that can be associated with a probability can be included as an event in the simulation, and stochastic simulations really come into their own systems with specific localized spatial characteristics. For example, a molecular motor moving along an actin or microtubule track will periodically and randomly encounter an obstruction, or may ‘‘jump’’ from one track to another. Likewise, molecules moving in a cellular environment will frequently encounter obstacles. Stochastic simulations may also be easily adapted to model the behavior of a single geometrically complex structure such as, for example, the end of a microtubule (Martin et al., 1993). Note that the graphical Petri-net notation may be helpful in the model design state; however, its use is not essential. Although its computational
408
Maria J. Schilstra and Stephen R. Martin
implementation may be compact, the drawings quickly become unwieldy as the systems get larger. Moreover, some systems lend themselves better to description in the form of Petri-nets than others. Systems such as those mentioned above are more easily and more efficiently expressed in purpose-built code, outside of the straightjacket of the Petri-net formulation or the equivalent chemical reaction notation. 3. Modeling and simulation in practice. Simple stochastic simulations, such as the ones presented in Figs. 15.5 and 15.7, are easily performed in a conventional spreadsheet. Obviously, this requires an excellent understanding of the stochastic approach, and we recommend newcomers to try implementing these examples first. For more complex systems, the ability to write computer code offers a considerable advantage. This code may equally well be written in interpreted ‘‘scripting’’ languages such as Python, Perl, or VBA as in compiled ones such as Java, Fortran or C/Cþþ, or in numerical computing environments such as MATLAB, Mathematica, Octave, or R. Because of its simplicity, implementation of the First Reaction algorithm as outlined in Section 11 is an ideal goal for those who would like to familiarize themselves with numerical problem solving. The ability to write code gives a programmer the ultimate control over the model and its simulation, input, output, and presentation. In addition, there is nowadays a raft of software tools (e.g., see the list of SBML-supporting packages on http://smbl.org) that will allow users to enter a set of (bio)chemical reactions, a set of parameters, and an initial state, and perform a stochastic simulation. These tools usually implement a First Reaction Method variant, sometimes complemented with accelerated approximate techniques such as Tau-leaping (Gillespie, 2001), or methods based on the Langevin or Fokker-Planck equations. Because the expressions for the reaction firing propensity J can, in combination with the reaction stoichiometry, be used directly to construct the ODEs for the system, some tools also offer facilities for deterministic simulation. These tools allow the user to quickly set up a model, perform a simulation, and obtain the results in a convenient format. Few, if any of these tools are suitable for modeling spatial inhomogeneity or structures with geometric complexity. Regrettably, however, such tools contribute little to their users’ understanding of the principles that lie beneath the stochastic approach. 4. Further reading. The first section in molecular biology and biochemistry textbooks is often dedicated to the kinetics and thermodynamics of biochemical reaction systems (e.g., see http://en.wikibooks.org/wiki/ Biochemistry). More extensive information on this subject may be found in Atkins and de Paula (2006). Cornish-Bowden (1999) explains basic concepts from mathematics, including exponents and logarithms, differential and integral calculus, and statistics, aiming at students of biochemistry. Wilkinson (2006) provides an extensive formal introduction to stochastic modeling in Systems Biology. The book ‘‘Systems
Simple Stochastic Simulation
409
Modelling in Cellular Biology’’ (Szallasi et al., 2006) contains some excellent chapters on stochastic modeling and related numerical simulation methods, notably (Gillespie and Petzold, 2006; Kruse and Elf, 2006; Paulsson and Elf, 2006). Useful reviews discussing recent developments are found in the work of Gillespie (2007) and Pahle (2009).
REFERENCES Atkins, P. W. (1994). Physical Chemistry. Oxford University press, Oxford, UK. Atkins, P. W., and de Paula, J. (2006). Physical Chemistry for the Life Sciences. W. H. Freeman and Company, New York, NY. Bayley, P. M., Schilstra, M. J., and Martin, S. R. (1990). Microtubule dynamic instability: A numerical simulation of experimental microtubule properties using the lateral cap model. J. Cell Sci. 95, 33–48. Bortz, A. B., Kalos, M. H., and Lebowitz, J. L. (1975). A new algorithm for Monte Carlo simulation of Ising spin systems. J. Comput. Phys. 17, 10–18. Cornish, P. V., and Ha, T. (2007). A survey of single-molecule techniques in chemical biology. ACS Chem. Biol. 2, 53. Cornish-Bowden, A. (1999). Basic mathematics for biochemists. Oxford University Press, New York, NY. Gibson, M. A., and Bruck, J. (2000). Efficient exact stochastic simulation of chemical systems with many species and many channels. J. Phys. Chem. A 104, 1876–1889. Gillespie, D. T. (1976). A general method for numerically simulating the stochastic time evolution of coupled chemical reactions. J. Comp. Phys. 22, 403–434. Gillespie, D. T. (1977). Exact stochastic simulation of coupled chemical reactions. J. Phys. Chem. 81, 2340–2361. Gillespie, D. T. (2001). Approximate accelerated stochastic simulation of chemically reacting systems. J. Comp. Phys. 115, 1716–1733. Gillespie, D. T. (2007). Stochastic simulation of chemical kinetics. Annu. Rev. Phys. Chem. 58, 35–55. Gillespie, D. T., and Petzold, L. R. (2006). Numerical simulation for biochemical kinetics. In ‘‘System Modeling in Cellular Biology,’’ (Z. Szallasi J. Stelling and V. Periwal, eds.), pp. 331–353. MIT Press, Cambridge, MA. Kleutsch, B., and Frehland, E. (1991). Monte-Carlo-simulations of voltage fluctuations in biological membranes in the case of small numbers of transport units. Eur. Biophys. J. 19, 203–211. Kruse, K., and Elf, J. (2006). Kinetics in spatially extended system. In ‘‘System Modeling in Cellular Biology,’’ (Z. Szallasi J. Stelling and V. Periwal, eds.), pp. 177–198. MIT Press, Cambridge, MA. Martin, S. R., Schilstra, M. J., and Bayley, P. M. (1993). Dynamic instability of microtubules: Monte-Carlo simulation and application to different types of microtubule lattice. Biophys. J. 65, 578–596. Pahle, J. (2009). Biochemical simulations: Stochastic, approximate stochastic and hybrid approaches. Brief. Bioinform. 10, 53–64. Paulsson, J., and Elf, J. (2006). Stochastic modelling of intracellular kinetics. In ‘‘System Modeling in Cellular Biology,’’ (Z. Szallasi J. Stelling and V. Periwal, eds.), pp. 149–175. MIT Press, Cambridge, MA. Szallasi, Z., Stelling, J., and Periwal, V. (eds.) (2006). In ‘‘System modeling in cellular biology’’ MIT Press, Cambridge, MA. Wilkinson, D. J. (2006). Stochastic Modelling for Systems Biology. Chapman & Hall/CRC, London, UK.
C H A P T E R
S I X T E E N
Monte Carlo Simulation in Establishing Analytical Quality Requirements for Clinical Laboratory Tests: Meeting Clinical Needs James C. Boyd and David E. Bruns Contents 412 414 414 415 416 417 417 427 429 431
1. Introduction 2. Modeling Approach 2.1. Simulation of assay imprecision and inaccuracy 2.2. Modeling physiologic response to changing conditions 3. Methods for Simulation Study 4. Results 4.1. Yale regimen 4.2. University of Washington regimen 5. Discussion References
Abstract Introduction. Patient outcomes, such as morbidity and mortality, depend on accurate laboratory test results. Computer simulation of the effects of test performance parameters on outcome measures may represent a valuable approach to defining the quality of assay performance that is needed to provide optimal outcomes. Methods. We carried out computer simulations of patients on intensive insulin treatment to determine the effects of glucose meter imprecision and bias on (1) the frequencies of glucose concentrations >160 mg/dL; (2) the frequencies of hypoglycemia (<60 mg/dL); (3) the mean glucose; and (4) glucose variability. For each patient, starting with a randomly selected initial glucose concentration and individualized responsiveness to insulin, hourly glucose concentrations were simulated to reflect the effects of (1) IV glucose Department of Pathology, University of Virginia Health System, Charlottesville, Virginia, USA Methods in Enzymology, Volume 467 ISSN 0076-6879, DOI: 10.1016/S0076-6879(09)67016-6
#
2009 Elsevier Inc. All rights reserved.
411
412
James C. Boyd and David E. Bruns
administration, (2) gluconeogenesis, (3) insulin doses as determined using regimens from the University of Washington and Yale University, and (4) errors in glucose measurements by the meter. For each of 45 sets of glucose meter bias and imprecision conditions, 100 patients were simulated, and each patient was followed for 100 h. Results. For both insulin regimens: Mean glucose was inversely related to assay bias; glucose variability increased with negative assay bias and assay imprecision; frequencies of glucose concentrations >160 mg/dL increased with negative assay bias and assay imprecision; and frequencies of hypoglycemia increased with positive assay bias and assay imprecision. Nevertheless, each regimen displayed unique sensitivity to variations in meter imprecision and bias. Conclusions. Errors in glucose measurement exert important regimendependent effects on glucose control in intensive IV insulin administration. The results of this proof-of-principle study suggest that simulation of the clinical effects of measurement error is an attractive approach for assessment of assay performance requirements.
1. Introduction Quantitative laboratory measurements play an increasingly important role in medicine. Well-known examples include (a) quantitative assays for cardiac troponins for diagnosing acute coronary syndromes (heart attacks) (Morrow et al., 2007) and (b) measurements of LDL cholesterol to guide decisions on use of statin drugs (Expert Panel on Detection, Evaluation, and Treatment of High Blood Cholesterol in Adults, 2001). Errors in these assays are recognized to lead to misdiagnoses and to inappropriate treatment or lack of appropriate treatment. A growing problem is to define how accurate laboratory measurements must be. Various approaches have been used to define quality specifications (or analytical goals) for clinical assays. A hierarchy of these approaches has been proposed (Fraser and Petersen, 1999; Petersen, 1999). At the low end of the hierarchy, performance of an assay can be compared with the ‘‘state of the art’’ or with performance criteria set by regulatory agencies or with the opinions of practicing clinicians or patients. A step higher, biology can be used as a guide to analytical quality by considering the average inherent biological variation within people; for substances whose concentration in plasma varies dramatically from day to day, there is less pressure to have assays that are highly precise because the analytical variation constitutes a small portion of the total variation. However, none of these approaches directly examines the relationship between quality of test performance and clinical outcomes. The collected opinions of physicians are likely to be
Simulation Studies for Analytical Quality Goals
413
anecdotal and reflect wide variation of opinion, whereas criteria based on biological variation or analytical state of the art, or the criteria of regulatory agencies may have no relation to clinical outcomes. Few studies have examined analytical quality requirements based on the highest criterion, patient outcomes. Clinical trials to determine the effects of analytical error on patient outcomes (such as mortality) are extremely difficult to devise and expensive (Price et al., 2006). Unlike trials of new drugs, the costs of such studies are large in relation to the potential for profit. For ethical reasons, it may be impossible to conduct prospective randomized clinical trials in which patients are randomized to different groups defined by use of high- or low-quality analytical testing methods. The common lack of standardization of methods used in different studies usually undermines efforts to draw useful general conclusions on this question based on systematic reviews of published and unpublished clinical studies. By contrast to these approaches, computer simulation studies allow a systematic examination of many levels of assay performance. There are common clinical situations in which patient outcomes are almost certainly connected with the analytical performance of a laboratory test. These situations represent ideal models for using simulation studies. One such situation occurs when a laboratory test result is used to guide the administration of a drug: a measured concentration of drug or of a drug target determines the dose of drug. Errors of measurement lead to selection of an inappropriate dose of drug. A common example is the use of measured concentrations of glucose to guide the administration of insulin. In this situation, higher glucose concentrations are a signal that a higher dose of insulin is needed, whereas a low glucose concentration is a signal to decrease or omit the next dose of insulin. Several years ago, we carried out simulation modeling of the use of home glucose meters by patients to adjust the patients’ insulin doses (Boyd and Bruns, 2001). In clinical practice, the insulin dose has been determined from a table that relates the measured glucose concentration and the dose of insulin to give. We examined the relationships between errors in glucose measurements and resulting errors in selection of the insulin dose that is appropriate for the true glucose concentration. The simulation model addressed glucose meters with specified bias (average error) and imprecision (variability of repeated measurements of a sample, expressed as coefficient of variation (CV)). We found that to select the intended insulin dosage 95% of the time required that both the bias and the CV of the glucose meter be <2%, considerably less than the error in commonly used meters (Boyd and Bruns, 2001). Based on these results, we concluded that simulation modeling studies could be used to provide a clinically relevant basis for setting quality specifications for home glucose meters used to adjust insulin doses. Recently, several randomized controlled trials have found that tight control of patients’ glucose in surgical, medical, and neonatal intensive
414
James C. Boyd and David E. Bruns
care units improved clinical outcomes, including rates of mortality and morbidity (see, e.g., Van den Berghe et al., 2001, 2006; Vlasselaers et al., 2009). Although some subsequent studies also showed improved patient outcomes with tight glucose control (TGC), others did not, such that metaanalyses of all available studies showed no improvement in rates of mortality or morbidity (Griesdale et al., 2009; Wiener et al., 2008). The three studies cited above measured glucose with devices known to have good accuracy and precision, but most other studies, many of which reported disappointing results, used devices with lower accuracy and precision (Scott et al., 2009). Aside from this suggestive observation, however, little is known regarding the quality of glucose assays that is required to achieve optimum results with TGC programs. We set out to use simulation modeling as an alternative to clinical trials in patients to address the quality requirements for measurements of glucose in TGC programs. Any simulation model for evaluating the clinical success of TGC regimens requires selection of clinical measures of success. The currently popular measure for assessing the success of TGC is the mean blood glucose. Additional measures for assessing the tightness of blood glucose control include the frequencies of hypo- or hyperglycemia, the percent of time that the patients’ blood glucose concentrations are within the target interval, and the relative variability in blood glucose concentrations over time. We have developed a modeling approach to assess the impact of analytical imprecision and inaccuracy in glucose testing on clinical measures of outcome in TGC. Although the data we present are preliminary and have some weaknesses that we will point out, we believe that such a modeling approach may represent a generally useful method to help answer the question of the analytical testing quality required in a variety of clinical circumstances to meet medical needs.
2. Modeling Approach 2.1. Simulation of assay imprecision and inaccuracy Laboratory tests generally display both inaccuracy and imprecision. Systematic assay inaccuracy is reflected in assay bias—the mean deviation of test results from the true concentrations. Assay bias is assessed by comparing test measurements on samples with measurements made by a reference measurement system that is known to have very low bias. Assay imprecision is assessed by repetitive measurements of quality control materials that simulate patient samples. The data are used to determine the assay imprecision (standard deviation) at several concentrations of analyte. The average assay imprecision is a reasonably good estimator
415
Simulation Studies for Analytical Quality Goals
of the imprecision that might be seen in the analysis of patient samples. Assay imprecision is usually expressed as a relative imprecision, or CV, in percent, obtained by dividing the observed assay SD by the mean and multiplying by 100%. Quality control data on repeated analyses of the same samples over months or years are statistically well behaved and follow a Gaussian distribution. Thus, assuming that the imprecision of a laboratory test for patient samples is similar to that observed on quality control samples, simulation of assay imprecision can easily be accomplished using a random number generator that yields normally distributed values with mean ¼ 0 and SD ¼ 1, such as the RANNOR function in SAS. To generate a simulated series of test results that have a bias of B% and a relative imprecision of S%, the following equation is used: TestðsimulatedÞ ¼ ConcðtrueÞ þ ðB=100Þ ConcðtrueÞ þ RANNORðseedÞ ðS=100Þ ConcðtrueÞ
ð16:1Þ
where Test(simulated) is the test result with B% bias and S% relative imprecision, Conc(true) is the true concentration of analyte in the sample, and RANNOR(seed) is the RANNOR function output at a given ‘‘seed’’ value. Alternative equations are easily generated to simulate a constant rather than a proportional bias, or to use standard deviations expressed in concentration units rather than relative standard deviations (CVs). Combinations of these approaches allow simulation of test results that, for instance, have a constant concentration-based SD at low assay values, but above a threshold concentration, a constant relative SD.
2.2. Modeling physiologic response to changing conditions Where laboratory measurements are used to guide patient treatment, simulation modeling requires a good simulation model of the physiologic response to drug administration. For the example shown in this chapter, we have modeled the physiologic response of glucose to insulin administration. Although the model we have developed is very simplistic (and, therefore, may not reflect true physiologic response very accurately), it is sufficient for the purposes of demonstration. Sophisticated models that give highly accurate representations of true physiologic response of glucose to insulin have been developed for Type 1 and Type 2 diabetes and have received FDA approval for use in preclinical trials of closed-loop control of glucose (Kovatchev et al., 2009). These models would be the ideal models to apply in the simulation studies we present below, but due to their complexity and cost, we have used our simple physiologic model to demonstrate the concept being presented here.
416
James C. Boyd and David E. Bruns
3. Methods for Simulation Study We utilized simulation modeling studies to evaluate the effect of analytical errors in glucose measurements on the outcomes of simulated intensive care unit patients on TCG regimens. Our simulation models of the TCG regimens were designed to determine the effects of assay imprecision and bias on four measures of success: (1) the frequency of plasma glucose concentrations above goal range (>160 mg/dL); (2) the frequency of plasma glucose concentrations in the hypoglycemic range (defined as <60 mg/dL for this study, but easily redefined if desired); (3) the mean blood glucose; and (4) the variability of plasma glucose (expressed as the standard deviation of repeated measurements of glucose in the individual). We modeled two published TGC regimens—the Yale University protocol (referred to herein as Yale) and the University of Washington (UW) protocol. The authors of the Yale protocol describe it as a ‘‘safe and effective insulin infusion protocol in a medical intensive care unit’’ (Goldberg et al., 2004). The UW protocol was developed in the context of diabetic patients undergoing surgery, but also can be applied in patients without diabetes (Trence et al., 2003). The protocols differ in underlying approach, and each has a different goal for the range of glucose concentrations in patients. We evaluated the effects of glucose assay imprecision and bias on the four measures of clinical success (above) and whether the two regimens differed in their sensitivity to errors of glucose measurement. The physiologic release of insulin from the pancreatic beta cell is known to be linearly related to the prevailing glucose concentration (Toffolo et al., 1980), and this relationship forms the basis for selection of the pharmacologic doses of insulin given to patients who lack adequate endogenous insulin. In our computer model, true glucose concentrations after administration of insulin were generated based on this model of the relationship of glucose and insulin. For each patient modeled, the initial glucose concentration and the patient’s individual responsiveness to insulin were randomly selected: the starting glucose was selected from a range of 40 to 600 mg/dL and the responsiveness to insulin was selected from a range of 10 to 54 mg/ dL decrease in glucose per unit of insulin per hour. To simplify calculations, the insulin responsiveness in a given patient was assumed to remain constant, but could be programmed to change predictably or randomly. The glucose concentration was modified each hour to reflect IV glucose administration and normal physiologic generation via gluconeogenesis. We chose a mean increment (SD) from IV and endogenous sources of 50 ( 10) mg/dL. Each hour, the glucose concentration was decremented according to the patient’s underlying insulin responsiveness and the insulin dose determined by the regimen. Combinations of analytical bias ranging
Simulation Studies for Analytical Quality Goals
417
from 20% to þ20% in 5% increments and analytical imprecision (expressed as percent coefficients of variation) ranging from 0% to þ20% in 5% increments were modeled. For each set of analytical error conditions (paired values of % bias and % relative imprecision expressed as CV), 100 patients were simulated, and each patient was followed for 100 h. This gave 10,000 glucose measurements for each of 45 sets of analytical error conditions simulated, for a total of 450,000 glucose measurements for each simulation. We performed a side-by-side comparison of simulated glucose concentrations as measured by a perfect glucose assay and an assay with inherent analytical error. These glucose concentrations were used to determine the insulin administration rates according to the regimens outlined earlier. Based on the insulin responsiveness for a given patient, the decrement in true glucose concentration resulting from an insulin dose can be calculated. Iterative application of this approach, on consecutively measured glucoses, will generate a series of true glucose values in each patient from which the frequencies of hypoglycemia and above-target glucose concentrations can be determined along with the variability of the plasma glucose concentration. All computer modeling was performed using SAS software (SAS Institute, Cary, NC). The SAS code used for modeling the Yale and the UW regimens is included in Appendices 1 and 2, respectively.
4. Results 4.1. Yale regimen Figures 16.1–16.4 show examples of true and measured glucose concentrations in simulated patients and the related insulin infusion rates during 100 h. The left panels of each figure show representative simulated patients in which the Yale regimen was used to determine insulin infusion rates. The right panels show similar patients treated using the UW regimen. We will first describe key features of these example patients treated by use of the Yale regimen (left panels). The upper and lower left panels of Fig. 16.1 show a comparison of glucose control achieved by the Yale regimen using a perfect glucose assay (top panel) versus the glucose control achieved in the same patient using an imperfect glucose assay (lower panel) with a positive 20% bias and a 5% CV. With the perfect assay, in the top panel periodic oscillation of glucose concentrations and insulin administration rates occur due to insulin dosage adjustments specified by the Yale regimen. In the lower panel, with a positively biased glucose assay, similar oscillations can be seen, but now, as
Simulated patient using Yale regimen moderately insulin-responsive
Simulated patient using Washington regimen moderately insulin-responsive
1
0
0 10 20 30 40 50 60 70 80 90 100 h Glucose assay: bias = 20%, CV = 5% True glucose Measured glucose Insulin rate
200 150
3 100 2 50
1
0 0
0 10 20 30 40 50 60 70 80 90 100 h
4 3
100 2 50
1
0 0
0 10 20 30 40 50 60 70 80 90 100 h Glucose assay: bias = 20%, CV = 5% True glucose Measured glucose Insulin rate
200
5 4
150
5
Insulin infusion rate (U\h)
50
Insulin rate Glucose
150
5 4 3
100 2 50
1
0 0
Insulin infusion rate (U\h)
2
Glucose (mg/dL)
3 100
0
Glucose (mg/dL)
4
200
Glucose (mg/dL)
150
Insulin infusion rate (U\h)
Glucose (mg/dL)
Insulin rate Glucose
Perfect glucose assay 5
Insulin infusion rate (U\h)
Perfect glucose assay 200
0 10 20 30 40 50 60 70 80 90 100 h
Figure 16.1 Glucose concentrations and insulin infusion rates in two simulated, moderately insulin-responsive patients. Upper panels: Results with a perfect glucose assay. Lower panels: Results with a glucose assay with 20% bias and imprecision (expressed as CV) of 5%. Insulin infusion rates for patients were determined by the Yale regimen (left panels) or the University of Washington regimen (right panels).
Perfect glucose assay
Perfect glucose assay
4 3
100 2 50
1
0 0
0 10 20 30 40 50 60 70 80 90 100 h
200
Insulin rate Glucose
150
3 2 50
1
0 0
5 4 3
100 2 50
1
0 0
0 10 20 30 40 50 60 70 80 90 100 h
200
Glucose (mg/dL)
Glucose (mg/dL)
150
0 10 20 30 40 50 60 70 80 90 100 h Glucose assay: bias = 5%, CV = 10%
Insulin infusion rate (U/h)
True Trueglucose Measured glucose Insulin rate
4
100
Glucose assay: bias = 5%, CV = 10% 200
5
True glucose Measured glucose Insulin rate
150
5 4 3
100 2 50
1
0 0
Insulin infusion rate (U/h)
150
5
Glucose (mg/dL)
Insulin rate Glucose
Insulin infusion rate (U/h)
Simulated patient using Washington regimen insulin-responsive
Insulin infusion rate (U/h)
Glucose (mg/dL)
200
Simulated patient using Yale regimen moderately insulin-responsive
0 10 20 30 40 50 60 70 80 90 100 h
Figure 16.2 Glucose concentrations and insulin infusion rates in two simulated patients. Upper panels: Results with a perfect glucose assay. Lower panels: Results with a glucose assay with 5% bias and imprecision (expressed as CV) of 10%. Insulin infusion rates for patients were determined by the Yale regimen (left panels) or the University of Washington regimen (right panels).
Simulated patient using Yale regimen insulin-responsive
Simulated patient using Washington regimen moderately insulin-resistant
2
100
1 0 0
15 200 10 100 5 0
0 10 20 30 40 50 60 70 80 90 100 h
0
Glucose assay: bias = 0%, CV = 20%
4 3 2
100
1 0 0
0 10 20 30 40 50 60 70 80 90 100 h
Glucose (mg/dL)
Glucose (mg/dL)
200
300
5 Insulin infusion rate (U/h)
True glucose Measured glucose Insulin rate
300
20
0 10 20 30 40 50 60 70 80 90 100 h Glucose assay: bias = 0%, CV = 20% True glucose Measured glucose Insulin rate
20 15
200 10 100 5 0 0
Insulin infusion rate (U/h)
3
Insulin rate Glucose
300 Glucose (mg/dL)
200
4
Insulin infusion rate (U/h)
Glucose (mg/dL)
Insulin rate Glucose
5
Insulin infusion rate (U/h)
Perfect glucose assay
Perfect glucose assay 300
0 10 20 30 40 50 60 70 80 90 100 h
Figure 16.3 Glucose concentrations and insulin infusion rates in two simulated patients. Upper panels: Results with a perfect glucose assay. Lower panels: Results with a glucose assay with 0% bias and imprecision (expressed as CV) of 20%. Insulin infusion rates for patients were determined by the Yale regimen (left panels) or the University of Washington regimen (right panels).
Perfect glucose assay
Perfect glucose assay 20 15 10 5
0 10 20 30 40 50 60 70 80 90 100 h
10 100 5 0 10 20 30 40 50 60 70 80 90 100 h Glucose assay: bias=–15%, CV=20%
20 15
5 0 10 20 30 40 50 60 70 80 90 100 h
Glucose (mg/dL)
300
10
0
15
0
Insulin infusion rate (U/h)
Glucose (mg/dL)
True glucose Measured glucose Insulin rate
20
200
0
Glucose assay: bias = –15%, CV = 20% 1100 1000 900 800 700 600 500 400 300 200 100 0
Insulin rate Glucose
True glucose Measured glucose Insulin rate
20 15
200 10 100 5 0 0
Insulin infusion rate (U/h)
0
300 Glucose (mg/dL)
Insulin rate Glucose
Insulin infusion rate (U/h)
Simulated patient using Washington regimen moderately insulin-resistant
Insulin infusion rate (U/h)
Glucose (mg/dL)
1100 1000 900 800 700 600 500 400 300 200 100 0
Simulated patient using Yale regimen insulin-resistant
0 10 20 30 40 50 60 70 80 90 100 h
Figure 16.4 Glucose concentrations and insulin infusion rates in two simulated patients. Upper panels: Results with a perfect glucose assay. Lower panels: Results with a glucose assay with 15% bias and imprecision (expressed as CV) of 20%. Insulin infusion rates for patients were determined by the Yale regimen (left panels) or the University of Washington regimen (right panels).
422
James C. Boyd and David E. Bruns
expected, the true glucose concentration is displaced downward compared with the perfect assay in the top panel, oscillating between approximately 70 and 120 mg/dL rather than between 100 and 150 mg/dL. The biased assay gives measured glucose concentrations higher than the true glucose concentrations, and thus higher rates of insulin infusion are selected to hold measured glucose near the target range as measured by the meter. True glucose is displaced downwards. Figure 16.2, left shows a different simulated patient (with moderate insulin responsiveness) on the Yale regimen, now using in the bottom panel a less-severely biased glucose assay (5%), but one with imprecision that is higher (CV ¼ 10%) well within the range reported for glucose meters. Now it can immediately be appreciated that the control of the glucose concentrations and insulin rates in the bottom panel is ‘‘noisier’’ than in the top panel. Thus, the increased imprecision in the glucose assay results in increased variability in the simulated patient’s glucose control. Figure 16.3, left shows another simulated patient, on the Yale regimen, with low insulin resistance, as can be judged by the relatively low hourly insulin administration rates. In the bottom panel, a highly imprecise glucose assay (as judged by the assay CV equal to 20%) that has zero bias has been used to make the glucose measurements. It is easy to appreciate a serious degradation in the precision of glucose control in this simulated patient when a more-imprecise assay is used. The true glucose concentrations range from approximately 70 to 280 mg/dL in this panel. A major motivation for TGC protocols is to avoid such high concentrations of glucose. Note that these high concentrations are seen despite the absence of a bias in the measurements, only increased variability (imprecision) of measurement. Figure 16.4, left, presents a simulated patient who is insulin resistant (i.e., requires higher doses of insulin to control glucose). With the perfect assay, the patient’s glucose is well controlled, although much higher insulin doses are required to control it. In the bottom panel with a highly imprecise (CV ¼ 20%) and strongly negatively biased (bias ¼ 15%) assay, we see again see wide fluctuations in glucose control, and eventual loss of glucose control. The imprecise assay results appear to have totally fooled the insulin regimen, such that it is giving inappropriately low doses of insulin in the face of ever-increasing glucose concentrations that eventually reach 1000 mg/dL. Although such escape from control could be detected by caregivers who could intervene, the example points out that it is possible to fool an insulin regimen when using glucose measurements of very poor quality. Each contour plot in Figs. 16.5–16.8 shows a summary of 450,000 glucose measurements made in 4500 simulated patients (100 patients for each combination of measurement bias and imprecision) as described in Section 3.
423
Simulation Studies for Analytical Quality Goals
Percentage of glucoses <60 mg/dL (Yale) 20
Bias (%)
10
0.4
0
2.0 1.8 1.6 1.4 1.2 1.0 0.8 0.6
0.2 –10 0.0 –20 0
20
5
10 15 CV (%) Percentage of glucoses <60 mg/dL (Washington)
20
6.0 5.0 4.0
10 Bias (%)
3.0 2.0 0
1.0 0.8 0.6 0.4
–10
0.2
–20 0
5
10 CV (%)
15
20
Figure 16.5 Contour plots showing percentage of glucose results that were <60 mg/dL when using meters that have the indicated bias and imprecision (CV). Insulin infusion rates were determined according to the Yale regimen (top panel) or the University of Washington regimen (bottom panel).
The input for each patient is based on simulated measurements of glucose, adjustments of insulin infusion rate based on the measured glucose, and calculation of the change in glucose concentration; this is repeated for 100 h as shown for the example patients in Figs. 16.1–16.4. The upper panels in Figs. 16.5–16.8 show results for the Yale regimen and the lower panels for the UW regimen. As before, we will describe first the finding in the Yale regimen (upper panels). Figure 16.5, top panel shows the relationship between assay quality measures and the frequency of hypoglycemia observed with the
424
James C. Boyd and David E. Bruns
Percentage of glucoses >160 mg/dL (Yale)
20
Bias (%)
10
10
0
20 30
–10
40 50 60
–20 0
5
10 CV (%)
15
20
Percentage of glucoses >160 mg/dL (U. Washington)
20
10 Bias (%)
3 5 0 10 20
–10
30 40 –20 0
5
10 CV (%)
15
20
Figure 16.6 Contour plots showing percentage of glucose results that were >160 mg/ dL when using meters that have the indicated bias and imprecision (CV). Insulin infusion rates were determined according to the Yale regimen (top panel) or the University of Washington regimen (bottom panel).
Yale insulin regimen. To use this plot, a particular set of bias and imprecision conditions is chosen that match a given glucose assay. Suppose an assay has a bias of þ7% and an imprecision (CV) of 5%. Reading from the isocontours that represent the rate of hypoglycemia (in percent of readings), an assay with these performance characteristics would lead to a rate of hypoglycemia between 0.0% and 0.2%. Thus, for bias and imprecision conditions that fall below the 0.2% isocontour, hypoglycemia would be predicted by the simulation to occur no more frequently than 0.2% of
425
Simulation Studies for Analytical Quality Goals
Mean glucose (mg/dL) as a function of assay bias and imprecision (Yale) 20 110
Bias (%)
10
120 130
0
140 150 –10
160
–20 0
5
10 CV (%)
15
20
Mean glucose (mg/dL) as a function of assay bias and imprecision (U. Washington) 20 110 115
10 Bias (%)
120 125
0
130 135 140 145 150 155
–10
–20 0
5
10 CV (%)
15
20
Figure 16.7 Contour plots showing mean glucose concentrations when using meters that have the indicated bias and imprecision (CV). Insulin infusion rates were determined according to the Yale regimen (top panel) or the University of Washington regimen (bottom panel).
the time. The frequency of hypoglycemia is increased by positive assay bias and by increased imprecision. As assay bias is increased toward þ20% and imprecision increased toward a CV of 20%, the simulation results suggested that the observed rates of hypoglycemia could exceed 2% of observations. Figure 16.6 shows similar plots for the percentage of true glucoses that exceed 160 mg/dL—a rough measure of the control of hyperglycemia. As
426
James C. Boyd and David E. Bruns
Variability in glucose control expressed as standard deviation (mg/dL) Yale
20
Bias (%)
10
0
35 40 45
–10
50 55 60 –20 0
5
15
20
Variability in glucose control expressed as standard deviation (mg/dL) U. Washington
20
10 Bias (%)
10 CV (%)
35
0
40 45
–10
50 –20 0
5
10 CV (%)
15
20
Figure 16.8 Contour plots showing imprecision of glucose control (expressed as SD) when using meters that have the indicated bias and imprecision (CV). Insulin infusion rates were determined according to the Yale regimen (top panel) or the University of Washington regimen (bottom panel).
the assay becomes more negatively biased toward 20%, and assay imprecision (as CV) increases toward 20%, the simulation suggests that more than 60% of true glucose concentrations could exceed 160 mg/dL. The mean glucose concentration was inversely related to assay bias (Fig. 16.7)—the higher the bias in the positive direction, the lower the mean true glucose (Fig. 16.7). Variability in glucose control (as measured by the average standard deviation of glucose results in a patient, Fig. 16.8) increased rapidly when glucose assay imprecision exceeded 10%.
Simulation Studies for Analytical Quality Goals
427
4.2. University of Washington regimen We will now turn to the UW regimens, which require some additional description. Table 16.1 shows the four regimens that dictate the insulin infusion rate (in units per hour) for a given glucose concentration. The regimens have been designed for patients with differing insulin resistance. Thus, regimen 1 is for insulin-sensitive patients (respond readily to low doses of insulin), whereas regimens 2, 3, and 4 are used in increasingly insulin-resistant patients. Higher insulin doses are administered for a given glucose concentration as the regimens progress from regimen 1 to regimen 4. To apply the UW approach, the correct insulin regimen has to be selected for each patient. Most patients start out on regimen 1, and are moved from one regimen to another. Separate criteria are defined for moving up one regimen versus moving down a regimen. For deciding to move up to the next higher regimen, the current regimen is deemed a failure when the measured blood glucose is above the target range (80–180 mg/dL) and the blood glucose does not change by at least 60 mg/dL within 1 h after administration of insulin on the current regimen. When this happens, the decision is made to move to the next higher regimen. The next lower regimen is used when the measured blood glucose is <70 mg/dL for two consecutive measurements. Returning to Fig. 16.1, we can compare results in example patients for the UW regimen in the right two panels compared with the Yale University regimen in the left two panels. The upper panel on each side presents the results when glucose is measured with a perfect assay, and the lower panel presents results when an imperfect method is used for glucose measurement. With a perfect glucose assay (top), each insulin regimen controls blood glucose concentrations well (although with a different pattern of oscillation of the values). When glucose is measured by a strongly positively biased assay with 5% imprecision (CV), shown in the lower panels, both regimens appear to control glucose, although the true glucose has been displaced to lower values. Note that an episode of hypoglycemia in the lower right panel at about 3 h is obscured by the high bias of the meter and would have gone unrecognized. Figure 16.2 shows, for the example patients, that the UW and Yale regimens appear to control glucose within similar bounds when the glucose assay has a bias of 5% and an imprecision of 10%. The UW regimen appears, for this example patient, to show much more variability of glucose and more-frequent changes of insulin infusion rate than is seen with the Yale algorithm. Figure 16.3 shows the regimens compared in the bottom panels for a glucose assay with no bias, but an imprecision of 20%. With the highly imprecise glucose assay, both insulin regimens allow extreme variability of glucose, including, for the UW regimen, an episode of marked hypoglycemia (at
Table 16.1 The University of Washington standardized insulin administration regimena Regimen 1 Glucose (mg/dL)
Regimen 2 Units/h
Glucose (mg/dL)
Regimen 3 Units/h
Glucose (mg/dL)
<60 ¼ Hypoglycemia (admin 50 mL D50W. Notify MD if unresolved in 20 min) <70 Off <70 Off <70 70–109 0.2 70–109 0.5 70–109 110–119 0.5 110–119 1 110–119 120–149 1 120–149 1.5 120–149 150–179 1.5 150–179 2 150–179 180–209 2 180–209 3 180–209 210–239 2 210–239 4 210–239 240–269 3 240–269 5 240–269 270–299 3 270–299 6 270–299 300–329 4 300–329 7 300–329 330–359 4 330–359 8 330–359 >360 6 >360 12 >360 a
Modified from Clement et al. (2004), as adapted from Trence et al. (2003).
Regimen 4 Units/h
Glucose (mg/dL)
Units/h
Off 1 2 3 4 5 6 8 10 12 14 16
<70 70–109 110–119 120–149 150–179 180–209 210–239 240–269 270–299 300–329 330–359
Off 1.5 3 5 7 9 12 16 20 24 29
Simulation Studies for Analytical Quality Goals
429
about 94 h, lower right panel). Here, it should be noted that the particular patient chosen for the UW simulation on the right is much more highly insulin resistant than the patient chosen for the Yale simulation on the left. Nevertheless, the effects of assay high assay imprecision seem to be similar. Figure 16.4 shows representative results for a highly inaccurate and imprecise glucose assay. Whereas the Yale regimen allows extreme hyperglycemia and appears totally to have lost control of glucose (lower left), the UW regimen shows extreme fluctuations of glucose and again allows at least one episode of hypoglycemia. In common for the two regimens, worsening assay performance shows detrimental effects on glucose control. Returning to the contour plots that show results in 4500 simulated patients (Figs. 16.5–16.8), the lower panel of Fig. 16.5, shows the effect of simulated assay bias and imprecision on the percentage of glucoses <60 mg/dL for the UW regimen. Compared with the Yale regimen, the UW regimen appeared to give a higher frequency of hypoglycemia with increasing glucose assay imprecision. This effect suggests that it is particularly important to maintain glucose assay imprecision at low levels when using the UW regimen. Figure 16.6, bottom panel shows the effect of simulated assay bias and imprecision on the percentages of glucoses >160 mg/dL for the UW regimen. The rate of above-target glucose concentrations appears to increase more slowly with increasing negative assay bias using the UW regimen than using the Yale regimen. As with the Yale regimen, the mean glucose concentration obtained using the UW regimen was inversely related to assay bias—the higher the bias in the positive direction, the lower the mean glucose (Fig. 16.7, bottom panel). Interestingly, increasing assay imprecision is associated with increased mean glucose when using the Yale regimen, but this effect is not seen with the UW regimen. Variability in glucose control (as measured by the mean standard deviation of glucose results in a patient) also increased rapidly when glucose assay imprecision exceeded 10% (Fig. 16.8, bottom panel). As we noted earlier, the frequency of glucose concentrations >160 mg/dL was directly related to negative assay bias and increased imprecision.
5. Discussion In this study, we have modeled the effect of errors in glucose measurement on the ability to achieve TGC in patients. The model predicts that measurement error degrades glucose control, and that the effects of measurement error on glucose control depend on the regimen chosen for selection of the rate of intravenous infusion of insulin.
430
James C. Boyd and David E. Bruns
Current approaches to measurement of glucose in TGC programs vary widely in accuracy (freedom from bias) and imprecision. A large early study that showed decreased mortality from TGC used a precise and accurate analyzer (Radiometer ABL Blood Gas Analyzer) with an imprecision (expressed as CV) of <2.8% and <3.5% at concentrations of 92 and 220 mg/dL, respectively, including all results, including rare outliers (personal communication to DEB, Roger Bouillon, 16 March 2002). By contrast, many subsequent studies used ‘‘glucose meters,’’ often from unspecified manufacturers. Numerous studies have demonstrated that glucose meters have greater imprecision and biases than do central laboratory or blood gas analyzer methods. One study from the CDC (Kimberly et al., 2006) of five common glucose meters showed mean differences versus a central laboratory method as high as 32% and imprecision (CVs) of 6–11% in the hands of a single trained medical technologist. Several studies have documented that glucose results produced from glucose meters by healthcare workers and patients have worse imprecision (higher CVs) than those generated by laboratory technologists. College of American Pathologists proficiency testing shows CVs of 17 glucose meter methods (19,597 sites) of 12–14% and bias between any two methods as high as 41%. Our results suggest that use of such meters will severely degrade the control of blood glucose with either the Yale regimen or the UW regimen. We do not envision a protocol that can overcome these limitations short of using very frequent sampling. We emphasize that the simulation model implemented in these studies does not account for test result variations due to patient factors (drugs, interferents, matrix effects), sample collection artifacts (such as drawing blood from IV lines, or collecting skin-puncture blood in the presence of hypoperfusion of skin capillaries as is often seen in critically ill patients), or the occurrence of random spurious test results. All of these factors are important considerations and will serve only to inflate the observed assay bias and imprecision estimates. Thus, merely establishing that an assay operates in an apparently acceptable range of imprecision and bias does not mean that these other factors cannot degrade the ability of any regimen to achieve glucose concentrations in the target range. Finally, it appears from these studies that the effects of measurement inaccuracy and imprecision should be carefully weighed in decisions to implement intensive IV insulin regimens. Our study has several limitations. Any simulation models applied to the evaluation of analytical quality required for TGC regimens would need to be adaptable to the wide variety of regimens that have been proposed. Each regimen may have been designed to meet the needs of differing patient populations, may require different levels of nursing attention, and may have a different goal range for glucose and a greater or lesser
Simulation Studies for Analytical Quality Goals
431
emphasis on avoiding hypoglycemia. It is left to future studies to investigate the effects of assay errors on other regimens beyond the two modeled here. As mentioned earlier our model of physiologic control of plasma glucose accounts for only some of the many characteristics of the patient and does not address sample collection, matrix effects, variations in patient activity, variations in nutritional intake, and many other potentially relevant variables. Thus, this work represents a proof-of-concept approach to the use of simulation modeling. Despite these limitations, the work described here points to the performance quality of glucose measurement as a critical but overlooked factor in success of TGC programs. The landmark large study of Van den Berghe et al. (2001) used the precise and accurate glucose method mentioned earlier and demonstrated markedly decreased mortality with TGC. Subsequent reports have most often used relatively imprecise and inaccurate glucose meters, and a meta-analysis of studies of TGC found no decrease in overall mortality with TGC when data from all studies (including Van den Berghe et al., 2001) were analyzed. Moreover, a recent multinational study showed increased mortality with TGC (The NICE-SUGAR Study Investigators, 2009). This latter finding is not surprising given that meters from various manufacturers were used with a single regimen for adjusting insulin infusion rates. A single regimen cannot be appropriate for the variety of available glucose meters, some of which have positive biases and others of which have negative biases. The single regimen will lead to administration of too much insulin to patients monitored by glucose meters that give falsely high results and too little insulin to patients monitored with meters that report falsely low results. Finally, it is simply unrealistic to aim to keep glucose within an interval that represents a range of plus or minus a few percent when the glucose measuring device has a CV of more than 10%. Future studies of TGC must concentrate on use of the better methods for measurement of glucose or risk loss of the benefits of TGC or even risk harm to patients. It is anticipated that simulation modeling can be a valuable tool in design of future studies and in understanding the effect of measurement accuracy and precision on desired outcomes for patients treated by TGC protocols.
REFERENCES Boyd, J. C., and Bruns, D. E. (2001). Quality specifications for glucose meters: Assessment by simulation modeling of errors in insulin dose. Clin. Chem. 47, 209–214. Clement, S., Braithwaite, S. S., Magee, M. F., Ahmann, A., Smith, E. P., Schafer, R. G., and Hirsch, I. B. (2004). American Diabetes Association Diabetes in Hospitals Writing
432
James C. Boyd and David E. Bruns
Committee Management of diabetes and hyperglycemia in hospitals. Diabetes Care 27, 553–597. Expert Panel on Detection, Evaluation, and Treatment of High Blood Cholesterol in Adults (2001). Executive summary of the third report of the National Cholesterol Education Program (NCEP) expert panel on detection, evaluation, and treatment of high blood cholesterol in adults (adult treatment panel III). J. Am. Med. Assoc. 285, 2486–2497. Fraser, C. G., and Petersen, P. H. (1999). Analytical performance characteristics should be judged against objective quality specifications. Clin. Chem. 45, 321–323. Goldberg, P. A., Siegel, M. D., Sherwin, R. S., Halickman, J. I., Lee, M., Bailey, V. A., Lee, S. L., Dziura, J. D., and Inzucchi, S. E. (2004). Implementation of a safe and effective insulin infusion procol in a medical intensive care unit. Diabetes Care 27, 461–467. Griesdale, D. E. G., de Souza, R. J., van Dam, R. M., Heyland, D. K., Cook, D. J., Malhotra, A., Dhaliwal, R., Henderson, W. R., Chittock, D. R., Finfer, S., and Talmo, D. (2009). Intensive insulin therapy and mortality among critically ill patients: A meta-analysis including NICE-SUGAR study data. Can. Med. Assoc. J. 180, 821–827. Kimberly, M. M., Vesper, H. W., Caudill, S. P., Ethridge, S. F., Archibold, E., Porter, K. H., and Myers, G. L. (2006). Variability among five over-the-counter blood glucose monitors. Clin. Chim. Acta 364, 292–297. Kovatchev, B. P., Breton, M., Man, C. D., and Cobelli, C. (2009). In silico preclinical trials: A proof of concept in closed-loop control of type 1 diabetes. J. Diabetes Sci. Technol. 3, 44–55. Morrow, D. A., Cannon, C. P., Jesse, R. L., Newby, L. K., Ravkilde, J., Storrow, A. B., Wu, A. H., and Christenson, R. H. (2007). National Academy of Clinical Biochemistry, National Academy of Clinical Biochemistry laboratory medicine practice guidelines: Clinical characteristics and utilization of biochemical markers in acute coronary syndromes. Circulation 115, e356–e375. Petersen, P. H. (1999). Quality specifications based on analysis of effects of performance on clinical decision-making. Scand. J. Clin. Lab. Invest. 59, 517–521. Price, C. P., Bossuyt, P. M. M., and Bruns, D. E. (2006). Introduction to laboratory medicine and evidence-based laboratory medicine. In ‘‘Tietz Textbook of Clinical chemistry and Molecular Diagnostics’’ (C. A. Burtis, E. R. Ashwood, and D. E. Bruns, eds.), 4th edn. pp. 323–351. Elsevier, Philadelphia, PA. Scott, M. G., Bruns, D. E., Boyd, J. C., and Sacks, D. B. (2009). Tight glucose control in the intensive care unit: Are glucose meters up to the task? Clin. Chem. 55, 18–20. The NICE-SUGAR Study Investigators (2009). Intensive versus conventional glucose control in critically ill patients. N. Engl. J. Med. 360, 1283–1297. Toffolo, G., Bergman, R. N., Finegood, D. T., Bowden, C. R., and Cobelli, C. (1980). Quantitative estimation of beta cell sensitivity to glucose in the intact organism: A minimal model of insulin kinetics in the dog. Diabetes 29, 979–990. Trence, D. L., Kelly, J. L., and Hirsch, I. B. (2003). The rationale and management of hyperglycemia for in-patients with cardiovascular disease: Time for change. J. Clin. Endocrinol. Metab. 88, 2430–2437. Van den Berghe, G., Wouters, P., Weekers, F., Verwaest, C., Bruyninckx, F., Schetz, M., Vlasselaers, D., Ferdinande, P., Lauwers, P., and Bouillon, R. (2001). Intensive insulin therapy in critically ill patients. N. Engl. J. Med. 345, 1359–1367. Van den Berghe, G., Wilmer, A., Hermans, G., Meersseman, W., Wouters, P. J., Milants, I., Van Wijngaerden, E., Bobbaers, H., and Bouillon, R. (2006). Intensive insulin therapy of medical intensive care patients. N. Engl. J. Med. 354, 449–461.
Simulation Studies for Analytical Quality Goals
433
Vlasselaers, D., Milants, I., Desmet, L., Wouters, P. J., Vanhorebeek, I., van den Heuvel, I., Mesotten, D., Casaer, M. P., Meyfroidt, G., Ingels, C., Muller, J., Van Cromphaut, S., et al. (2009). Intensive insulin therapy for patients in paediatric intensive care: A prospective, randomised controlled study. Lancet 373, 547–556. Wiener, R., Wiener, D. C., and Larson, R. J. (2008). Benefits and risks of tight glucose control in critically ill adults: A meta-analysis. J. Am. Med. Assoc. 300, 933–944.
C H A P T E R
S E V E N T E E N
Nonlinear Dynamical Analysis and Optimization for Biological/ Biomedical Systems Amos Ben-Zvi and Jong Min Lee Contents 1. Introduction 2. Hypothalamic–Pituitary–Adrenal Axis System 2.1. Background 2.2. System model 2.3. Steady-state analysis 3. Development of a Clinically Relevant Performance-Assessment Tools 3.1. Construction of an invariant manifold 3.2. Evaluation of treatment options 3.3. Development of an appropriate optimal control objective 4. Dynamic Programming 4.1. DP for deterministic systems 4.2. Minimizing worst-case cost under uncertainty using DP 5. Computation of Optimal Treatments for HPA Axis System 5.1. Deterministic optimization 5.2. Worst-case optimization 6. Conclusions Acknowledgments References
436 437 437 438 439 441 441 445 448 452 452 454 455 455 456 458 458 458
Abstract As mathematical models are increasingly available for biological/biomedical systems, dynamic optimization can be a useful tool for manipulating systems. Dynamic optimization is a computational tool for finding a sequence of optimal actions to attain desired outcomes from the system. This chapter discusses two dynamic optimization algorithms, model predictive control and dynamic programming, in the context of finding optimal treatment strategy for correcting hypothalamic–pituitary–adrenal (HPA) axis dysfunction. It is shown Chemical and Materials Engineering, University of Alberta, Edmonton, Alberta, Canada Methods in Enzymology, Volume 467 ISSN 0076-6879, DOI: 10.1016/S0076-6879(09)67017-8
#
2009 Elsevier Inc. All rights reserved.
435
436
Amos Ben-Zvi and Jong Min Lee
that dynamic programming approach has the advantage over the model predictive control (MPC) methodology in terms of robustness to error in parameter estimates and flexibility of accommodating clinically relevant objective function.
1. Introduction Many biological and biomedical systems show dynamic behaviors at different length and time scales. Mathematical modeling is useful for quantitative analysis of dynamic behaviors, and such a model typically takes the form of ordinary differential equations. For example, metabolism of yeast has been modeled as a set of reaction kinetics involving biochemical entities inside the cell to predict cell growth and other cellular phenotypes (Klipp et al., 2006; Rizzi et al., 1997; Teusink et al., 2000). Such modeling efforts are found in a number of areas including cellular networks, developmental biology, biomedical applications, etc. (Deutsch et al., 2007). A system can be considered as a mapping between independent variables (or inputs) and dependent variables (outputs) when there is more number of variables than that of equations. In such a case, independent variables can be specified to achieve desired outcomes from the system. For example, one can find optimal operating conditions (e.g., temperature, feed rate, etc.) of a bioreactor to maximize cell growth. Dynamic optimization is concerned with finding optimal values of inputs at each decision time point to optimize a performance criterion. In general, a total cost or reward to be incurred during the time window of interest is used as the performance index, that is the objective function, and the systems dynamics serve as path constraints that the system should follow. In this chapter, we investigate two major methodologies that formulate dynamic optimization problems. The first is model predictive control (MPC), which can be used to find optimal solutions when the model is exact. The other is dynamic programming (DP), which is a general framework for computing optimal solutions for the models with uncertainties. It is proposed that DP possesses several advantages over MPC for biological/ biomedical systems. DP has the flexibility in choosing performance criterion. In addition, robust optimization can be naturally handled in the DP framework when the system has uncertainty (including uncertainty in system parameters and in the measurement of key variables). The hypothalamic–pituitary–adrenal (HPA) axis, a nueroendocrine control system, is studied for illustration of the methodologies. The ability of DP to explicitly assign a cost to each point in the state space is important because it allows great flexibility in incorporating information regarding side-effects of the treatment into the optimization algorithm.
437
Nonlinear Dynamical Analysis and Optimization
2. Hypothalamic–Pituitary–Adrenal Axis System 2.1. Background The hypothalamic–pituitary–adrenal (HPA) axis is a neuroendocrine control system regulating adrenocortical glucocorticoid secretions by the brain. Glucocorticoids play an important role in regulating the stress response of the human body as well as affecting mood and cognition. HPA dysfunction has been associated with a variety of stress-related disorders including fibromyalgia, and chronic fatigue syndrome (CFS). Although an extensive body of literature has been devoted to studying the HPA axis system ( Jacobson, 2005) and the relation of the HPA axis to CFS (Cleare et al., 2001; Crofford et al., 2004; Giorgio et al., 2005), little attention has been paid to using a systems-based approach for explaining HPA axis dysfunction and offering corrective treatment. The HPA axis can be viewed as a feedback system designed to maintain homeostasis in the face of external stress ( Jacobson, 2005). Stress activates the release of corticotropin-releasing hormone (CRH) in the hypothalamus. The release of CRH into the anterior pituitary stimulates the release of adrenocorticotropic hormone (ACTH). ACTH, in turn, stimulates the release of the glucocorticoid cortisol from the adrenal cortex. Cortisol has regulating effects on peripheral tissues including the regulation of immune response (Kimas et al., 1990). Examples of stresses which affect the HPA axis dynamics include viral infection, dehydration, and fear. Furthermore, cortisol has the effect of downregulating the production of ACTH and CRH. A schematic of this feedback path is shown in Fig. 17.1. A lumped-parameter model of the system shown in Fig. 17.1 has been proposed by Gupta et al. (2007). This model of the HPA axis system predicts the existence of two stable steady states for the HPA axis system. One of these steady-state points corresponds to a state of hypocorticolism (lowered cortisol concentration) which has been observed in CFS patients (Crofford et al., 2004). Cortisol has been shown to have an inhibitory effect on the immune system, in general, and the production of interleukin-6 and
ACTH
CRH
External stress
Hypothalamus
Pituitary gland
Cortisol
Adrenal gland
Figure 17.1 Block diagram of the HPA dynamics.
438
Amos Ben-Zvi and Jong Min Lee
interleukin-1 beta in specific (Kimas et al., 1990). As a result, hypocorticolism leads to an overactive immune system which has been observed in CFS and fibromyalgia patients. While the model proposed by Gupta et al. (2007) includes the HPA axis dynamics, it does not include the effect of pharmacological agents which can be used to treat HPA axis dysfunction. Several pharmacological agents have been shown to affect the dynamics of the HPA axis including cyproheptadine which decreases the amplitude of ACTH secretions as well as exogenous ACTH (EACTH). The model proposed by Gupta et al. (2007) can be augmented to take the effects of cyprohyptadine and EACTH. In this chapter, the model proposed by Gupta et al. (2007) is modified and used to develop robust and effective dynamic optimization strategies to treat HPA axis dysregulation. First, it is shown that the system has two stable homeostatic rest-points. The basin of attraction of these points forms a dense subset of the state space. Repair of HPA axis deregulation occurs when the system is driven from the basin of attraction of the hypocorticolic rest point to the basin of attraction of the healthy rest point. Secondly, the boundary between the two basins is shown to be a parameter-dependent invariant manifold of codimension one. Finally, the formulation of clinically relevant dynamic optimization is discussed and solved using DP.
2.2. System model A model of the HPA axis proposed by Gupta et al. (2007) which has been modified to include the effect of pharmacological agents is described by the differential equation: x_ ¼ f ðx; u; p; dÞ
ð17:1Þ
where x ¼ ½x1 ; x2 ; x3 ; x4 T are the system states, whose description is given in Table 17.1. The system parameters are given by the vector p ¼ ½ki1 ; kcd ; kad ; ki2 ; kcr ; krd ; kT . Nominal values for the system parameters are listed in Table 17.2. The variable d in Eq. (17.1) is the stress term which describes the effect of stress (both physical and psychological) on the hypothalamus. This variable is seen as a disturbance that perturbs the system (17.1) from a steady-state value. The variable u is the input which models the effect of pharmacological agents on the system. Administration of Table 17.1 States of HPA axis model State
Description
Stable rest points
x1 x2 x3 x4
CRH concentration ACTH concentration Free GR receptor Cortisol concentration
(0.6261, 0.6610) (0.0597, 0.0513) (0.0809, 0.5629) (0.0597, 0.0513)
Nonlinear Dynamical Analysis and Optimization
439
Table 17.2 Nominal parameters of HPA axis model Parameter
Description
ki1 kcd kad ki2 kcr krd k
Inhibition constant for CRH production CRH degradation constant ACTH degradation constant Inhibition constant for ACTH production GR production constant GR degradation constant Inhibition constant for GR production
Value
0.1 1.0 10.0 0.1 0.05 0.9 0.001
EACTH will have a positive impact on ACTH concentration (u > 0), while administration of cyproheptadine will have a negative impact (u < 0). The mapping f : R4 R7 R R ! R4 is given by 2 3 1 kcd x1 6 7 1 þ xk 4 6 7 2 3 2 3 6 7 0 0 x1 6 7 kad x2 6 x2 7 6 7 6 x2 7 x x 3 4 7 þ 6 7u þ 6 7d ð17:2Þ 1þ f ¼6 405 6 7 405 k i2 6 7 6 7 0 0 6 ðx3 x4 Þ2 7 4 5 þ k k x x x cr rd 3 2 4 k þ ðx3 x4 Þ2 i1
2.3. Steady-state analysis It has previously been shown that (Ben-Zvi et al., 2009) when no control action is exerted (i.e., when u ¼ 0), the HPA axis model has two stable steady-state points over a range of stress values 0.021 d 0.168. This is shown graphically by plotting the steady-state values of all four model states in Fig. 17.2. As can be seen from Fig. 17.2, under normal conditions corresponding to d ¼ 0, the HPA axis has both a normal and a diseased state (the latter corresponding to hypocorticolism). Furthermore, the steady-state map shown in Fig. 17.2 shows that a person starting at the healthy steady state (corresponding to the higher cortisol concentration) may be drawn to the lower cortisol concentration after being exposed to a high stress level (d 0.168) for a prolonged period of time. This process is labeled ‘‘Path A’’ in Fig. 17.2; the high-stress level pushes the HPA axis system to an operating region where only one steady state exists corresponding to the diseased state. When the external stress is removed, the homeostatic mechanism of the HPA axis drives the system to the steady state corresponding to hypocorticolism.
440
Amos Ben-Zvi and Jong Min Lee
0.075
0.07
0.065 Path A x4
0.06
0.055 Path B 0.05
0.045
−0.25 −0.2 −0.15 −0.1 −0.05
0 d
0.05
0.1
0.15
0.2
0.25
Figure 17.2 Steady-state values of x4 (cortisol) as functions of d. Path A is a hypothetical path taken by the cortisol concentration an individual as they proceed from a healthy homeostatic equilibrium point to a hypocortical equilibrium point as a response to prolonged stress. Path B is a hypothetical path taken by the cortisol concentration an individual as they proceed from a hypocortical to a healthy equilibrium.
Manipulation of system (17.2) to drive it from a hypocortical to a healthy rest point may be accomplished by reducing the cortisol concentration so that only a single homeostatic rest point exists. Once treatment is stopped, system (17.2) will return, under homeostatic feedback to the healthy rest point. This path is labeled ‘‘Path B’’ in Fig. 17.2. It has previously been shown by Gupta et al. (2007) that there exists a (positive) cortisol concentration that is sufficiently low to ensure that system (17.2) can be driven to the healthy equilibrium. This conclusion holds for regardless of the value of the system parameters (assuming all parameters are strictly positive). The analysis by Gupta et al. (2007) shows that there exists a treatment that will bring system (17.2) from a hypocortical to a healthy equilibrium. However, it does not identify a single input trajectory that is optimal. Furthermore, manipulation of the HPA axis involves changing the concentrations of key hormones, and, as such, will have side effects. Cortisol is a key hormone regulating blood sugar and blood pressure as well as affecting sodium, potassium and water loss. As a result, large deviations in cortisol concentration may cause serious side effects. In addition to affecting cortisol secretion, potential side effects of ACTH are irritability, increased appetite and weight gain. Furthermore,
Nonlinear Dynamical Analysis and Optimization
441
cyproheptadine itself causes side-effects including nervousness headache and muscle weakness (Schu¨rmeyer et al., 1996). In this chapter, two methodological approaches, model predictive control (MPC) and dynamic programming (DP), to compute a theoretical treatment that will move system (17.2) to a healthy equilibrium are discussed. MPC has been the most popular approach in process control due to its ability to handle multivariable control problems with constraints (Morari and Lee, 1999). The potential problem of MPC is that the solution may be brittle in the sense that a small deviation in parameter values or uncertainty in the initial state values will make the solution ineffective (Guerlain et al., 2002). On the other hand, DP can provide the general framework for computing optimal solutions of systems under uncertainties. The results presented in this chapter are divided into two parts. First, a method for obtaining a clinically relevant metric for treatment performance will be developed. This method is based on assessing the effect of parameter uncertainty on the boundary between the healthy and hypocorticolic basins of attraction. A nonlinear MPC-based treatment is compared using the proposed method. Secondly, a DP based approach for the computation of treatment that explicitly incorporates a clinically relevant metric is presented.
3. Development of a Clinically Relevant Performance-Assessment Tools 3.1. Construction of an invariant manifold Dynamic optimization methods can be used to compute optimal input signals subject to a variety of constraints or cost functions. However, it is often not clear what the ideal objective function for an optimization scheme. For example, it is often not clear how the input or state penalty in the MPC objective function affects the robustness of the MPC solution to parameter uncertainty. In this section, a method for assessing the performance of the NLMPC controller is presented that allows proposed treatments to be evaluated and compared using a clinically relevant, heuristic-based, approach. It has previously been shown by Ben-Zvi et al. (2009) that system (17.2) has three steady-state points. At the nominal parameter values (i.e., p ¼ pnom) given in Table 17.2, u ¼ 0, d ¼ 0, the three steady-state points are given by xl ¼ ½0:661; 0:051; 0:563; 0:051 xm ¼ ½0:646; 0:055; 0:321; 0:055 xh ¼ ½0:627; 0:060; 0:087; 0:060
442
Amos Ben-Zvi and Jong Min Lee
Let the linearization of f (with respect to x) be denoted by @f Df ¼ @x jp¼pnom ;d¼0;u¼0 . The linearization of system (17.2) at xl, xm, and xh are as follows: 2 3 1 0 0 4:39 6 0:777 10 0:203 2:25 7 7 Df ðxl Þ ¼ 6 4 0 0 0:02 9:713 5 0 1 0 1 2
1 6 0:850 Df ðxm Þ ¼ 6 4 0 0 2
1 6 1 Df ðxh Þ ¼ 6 4 0 0
0 0 10 0:257 0 0:229 1 0
0 0 10 0 0 0:900 1 0
3 4:16 1:50 7 7 6:59 5 1
3 10:0 0 7 7 0 5 1
The eigenvalues ll, lm, and lh and eigenvectors Vl, Vm, and Vh of Df(xl), Df(xm) and Df(xh), respectively are given by ll ¼ f9:81; 1:03 þ 0:669i; 1:03 0:669i; 0:156g lm ¼ f9:90; 1:00 þ 0:660i; 1:00 0:660i; 0:123g lh ¼ f0:940 þ 1:05i; 0:940 1:05i; 10:1; 0:900g and 82 3 2 3 2 3 2 39 0:056 3:26 þ 2:75i 3:26 2:75i 0:919 > > > > <6 6 0:355 þ 0:254i 7 6 0:355 0:254i 7 6 0:149 7= 0:992 7 6 7 6 7 6 7 6 7 Vl ¼ 4 ; ; ; 0:112 5 4 0:383 5:22i 5 4 0:383 þ 5:22i 5 4 12:7 5> > > > : ; 0:113 0:399 þ 0:514i 0:399 0:514i 0:177 82 3 2 3 2 3 2 39 0:052 0:360 1:57i 0:360 þ 1:57i 0:682 > > > > <6 6 7 6 7 6 7= 0:992 7 7; 6 0:037 0:164i 7; 6 0:037 þ 0:164i 7; 6 0:207 7 Vm ¼ 6 4 0:073 5 4 0:910 þ 0:794i 5 4 0:910 0:794i 5 4 11:5 5> > > > : ; 0:111 0:249 0:057i 0:249 þ 0:057i 0:184 82 3 2 3 2 3 2 39 3:23 4:53i 3:23 þ 4:53i 0:119 0 > > > > <6 7 6 0:295 þ 0:534i 7 6 0:987 7 6 0 7= 0:295 0:534i 7; 6 7; 6 7; 6 7 Vh ¼ 6 4 5 4 5 4 0 5 4 1 5> 0 0 > > > : ; 0:493 0:31055i 0:493 þ 0:31055i 0:108 0
443
Nonlinear Dynamical Analysis and Optimization
Note that the order of the eigenvectors and eigenvalues in their respective sets was chosen so that ith eigenvector in each set corresponds to the ith eigenvalue for the same rest point with i 2 f1; 2; 3g (e.g., the third entry in Vh is the eigenvector for the third entry in lh). Two of the points, xl and xh, corresponding to low and high cortisol concentration, respectively, are stable in the sense that both Df ðxl ; pnom ; 0; 0Þ and Df ðxh ; pnom ; 0; 0Þ have only eigenvalues whose real parts are negative. The steady-state point, xm corresponding to an intermediate cortisol concentration is unstable in the sense that the linearization Df ðxm ; pnom ; 0; 0Þ has one eigenvalue with a positive real part and three eigenvalues with negative real parts. Using the stable manifold theorem in (Chicone, 1999), it is possible to show that the point xm lies on an invariant manifold of codimension one. Furthermore, this manifold separates the basin of attraction of points xl and xh. Theorem 17.1 (Chicone, 1999) Suppose that S : Rk ! Rk and U : Rl ! Rl are linear transformations such that all eigenvalues of S have real part less than a, all eigenvalues of U have real part greater than b, and a < b. If F 2 C 1 ðRk Rl ; Rk Þ and G 2 C 1 ðRk Rl ; Rl Þ are such that Fð0; 0Þ ¼ 0, DFð0; 0Þ ¼ 0, Gð0; 0Þ ¼ 0, DGð0; 0Þ ¼ 0 , and such that kFk1 and kGk1 are sufficiently small, then there is a unique function a 2 C 1 ðRk ; Rl Þ, with the following properties: að0Þ ¼ 0;
Dað0Þ ¼ 0;
e 2 sup Rk kDaðeÞk < 1
whose graph, namely the set W ð0; 0Þ ¼ fðz; yÞ 2 Rk Rl : y ¼ aðzÞg is an invariant manifold for the system of differential equations given by Eq. (17.1) z_ ¼ Sz þ Fðx; yÞ;
y_ ¼ Uy þ Gðz; yÞ
ð17:3Þ
To see how Theorem 17.1 is useful for analysis of the HPA axis system let x ¼ x xm . Furthermore, consider the Taylor series expansion of the mapping f from system (17.2) with respect to x, with u ¼ d ¼ 0. f ¼ Df ðxm Þ x þ Bð xÞ Furthermore, let 2 0:0196 6 0:00593 P¼6 4 0:330 0:00528
0:00505 0:0960 0:00702 0:0108
0:488 0:0510 0:161 0:00276
be a linear transformation such that the matrix
ð17:4Þ 3 0:0172 0:00184 7 7 0:328 5 0:0773
ð17:5Þ
444
Amos Ben-Zvi and Jong Min Lee
2
0:123 6 0 L ¼ P 1 Df ðxm ÞP ¼ 6 4 0 0
0 9:90 0 0
0 0 1:00 0:660
3 0 0 7 7 0:660 5 1:00
ð17:6Þ
is in real-Jordan form. In the coordinates x~ ¼ P 1 x, and using the Taylor series expansion shown in Eq. (17.4), system (17.2) can be written as x~_ ¼ L~ x þ BðP~ xÞ
ð17:7Þ
With respect to Theorem 17.1, let k ¼ 3, l ¼ 1, y ¼ x~1 , and z ¼ ½~ x1 ; x~2 ; x~3 T , where x~i denotes the ith element of x~. Furthermore, let 2 3 9:90 0 0 S¼4 0 1:00 0:660 5 0 0:660 1:00 and u ¼ [0.123]. Note that all eigenvalues of S are negative. Let the numbers a and b in Theorem 17.1 be 0 e and 0 þ e, respectively with e > 0 being an arbitrarily small number. Define the function F and G by the relation Gðz; yÞ BðP~ xÞ ¼ Bðz; yÞ ¼ ð17:8Þ Fðz; yÞ Note that G(z, y) and F(z, y) can be written as a sum of polynomials in z and y of order two or higher. As a result, G(0, 0) ¼ 0, F(0, 0) ¼ 0, DG (0, 0) ¼ 0, and DF(0, 0) ¼ 0 and all conditions of Theorem 17.1 are met. The conclusion of Theorem 17.1 implies that there exists an invariant manifold defined by the constraint y ¼ a(z). However, computation of the function a : R3 ! R is computationally intensive. As a result, an approximation to a will be computed. This approximation will be based on the linearization of Eq. (17.3). In particular, if yð0Þ ¼ 0, and about the steadystate point ðy; zÞ ¼ ð0; 0Þ, the differential equation becomes z_ ¼ Sz
ð17:9Þ
and it is exponentially stable. As a result, the set of points belonging to the invariant manifold described by Theorem 17.1 can be approximated locally about (y, z) ¼ (0, 0) by the set Z ¼ fz 2 R3 jkzk < ez g where ez > 0 is a small positive number. Computationally, the set Z may be mapped in the x coordinates by choosing a value z 2 Z and computing x ¼ Pð½0; z T Þ x ¼ xm þ x ¼ xm þ P~
ð17:10Þ
Nonlinear Dynamical Analysis and Optimization
445
The set surface generated by mapping the set Z into the x coordinates can be projected into three dimensions using the projection mapping pr : ðx1 ; x2 ; x3 ; x4 Þ7!ðx1 ; x2 ; x3 Þ. The projection is shown in Fig. 17.3A. The surface shown in Fig. 17.3A is a first-order approximation of the invariant manifold separating the basin of attraction of points xh and xl. This idea is illustrated in Fig. 17.3 where the system dynamics are integrated forward in time for t 2 ½0; 5 from different points on the surface. As can be seen from Fig. 17.3, the system trajectories approach xm for a small time. For a large time (t > 500), however, the system trajectories are driven away from xm and toward either xh and xl. This is shown in Fig. 17.4 which shows system trajectories for t 2 ½0; 500. In order for a treatment to be successful, it must drive the HPA axis system from xl to xh. To do this, the trajectory of the system must penetrate the boundary approximated by the surface shown in Fig. 17.3A. Generally, the prescription and application of medicine is done in two-step iterative process. First, the condition of the patient is assessed (observation) then a treatment is administered (i.e., control is applied). Finally, the condition of the patient is reassessed and further treatment is chosen as necessary. Typically, a patient will take their medication over a time span of several days or weeks. As a result, most medical treatment is, in effect an ‘‘open-loop’’ process where measurements are taken infrequently and typically only to verify that the open-loop treatment has worked. As a result, it is not generally possible to monitor the condition of a patient to determine when the boundary illustrated in Fig. 17.3A has been crossed. Rather, one must rely on a sufficiently robust open-loop treatment so that the boundary is likely to be crossed even if there is significant system-model mismatch or exogenous disturbances.
3.2. Evaluation of treatment options The approximate boundary computed in the previous section corresponds to a nominal and disturbance-free model. If there were no modeling errors and no unmodeled exogenous disturbances then an optimal treatment which employs the minimum amount of therapeutic intervention to penetrate the surface could be computed. For example, the optimal dosage for a drug could be chosen as the minimal dosage necessary to drive the HPA axis across the boundary. This idea is illustrated in Fig. 17.5 where the system trajectory under constant dosages corresponding to input values of uðtÞ ¼ 0:5; uðtÞ ¼ 0:75; uðtÞ ¼ 0:8 and u(t) ¼ 2 for t 2 ½0; 15 are shown. The point along the trajectory where treatment stops is shown as a black square along the trajectory path. As shown in Fig. 17.5, the input corresponding to u(t) ¼ 0.75 is insufficient to drive the HPA axis system across the boundary. That is, the black square located along the trajectory corresponding to u(t) ¼ 0.75 is above the boundary plane.
446
Amos Ben-Zvi and Jong Min Lee
A 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.09 0.08
0.07
0.07
0.06
0.06 0.05
0.05 0.04
0.04 0.03
0.03 0.02
0.02
The projection of the set Z onto three dimension in the x coordinates.
B 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.09 0.08
0.07
0.07
0.06
0.06 0.05
0.05 0.04
0.04 0.03
0.03 0.02
0.02
Scenario 2
Figure 17.3 Numerical integration of system (17.2) with initial conditions in Z for 0 t 5.
447
Nonlinear Dynamical Analysis and Optimization
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.09
0.08 0.07 0.06 0.05 0.04 0.03 0.02
0.625
0.63
0.635
0.64
0.645
0.65
0.655
0.66
0.665
0.67
0.675
Figure 17.4 Long-time numerical integration of system (17.2) with initial conditions in Z for 0 t 500.
The just-sufficient input (u(t) ¼ 0.8) does force the system across the boundary and therefore drives the system to the new equilibrium. Finally, the high dosage trajectory (u(t) ¼ 2) forces the system to cross the boundary and therefore drives the system to the new equilibrium. The disadvantage of using a high dosage is that, as shown in Fig. 17.5, the state trajectories do not stay close to the equilibrium conditions. As a result, if this treatment was applied, it would cause severe deviation in hormone concentrations. These deviations can cause serious side effects including high blood pressure, activated immune system, weight gain and irritability. The advantage of using a high dosage is that, as shown in Fig. 17.5, it drives the HPA axis well past the boundary plane. As a result, the high dosage treatment is likely to be effective even if the computed boundary surface is wrong due to modeling inaccuracies. This idea is illustrated in Fig. 17.6 where the response of the system to inputs of u ¼ 0.8 and u ¼ 2 are compared for two different values of the parameter kad, one corresponding to the nominal model (kad ¼ 10.0) and one corresponding to a small modeling error (kad ¼ 9.75). As shown in Fig. 17.6 while an input of uðtÞ ¼ 0:8 is effective under nominal conditions, it is likely to be ineffective even for small errors in the nominal model. An ideal control strategy would, therefore, not only penetrate but also drive the process some distance from the boundary surface. The ideal control strategy (i.e., treatment) would have three competing objectives. First, it should drive the HPA axis well past the nominal
448
Amos Ben-Zvi and Jong Min Lee
B 0.6
x3 (Free GR concentration)
x3 (Free GR concentration)
A 0.5 0.4 0.3 0.2 0.1
x
0.06 CT H c 0.05 onc 0.04 ent rati on)
2 (A
0.65
0.7
0.75
0.6 0.5 0.4 0.3 0.2 0.1
x2 ( 0.06 AC TH con 0.05 cen trat 0.04 ion )
on) 0.6 ntrati once RH c x 1 (C
Trajectory for u(t) = −0.5
0.7
0.75
n) 0.6 tratio oncen RH c x 1 (C
Trajectory for u(t) = −0.75
C
D 0.6
x3 (Free GR concentration)
x3 (Free GR concentration)
0.65
0.5 0.4 0.3 0.2 0.1
x2
0.06
(AC 0.05 TH con cen 0.04 trat ion )
0.65
0.7
n) tratio oncen RH c x 1 (C
0.6
Trajectory for u(t) = −0.80
0.75
0.6 0.5 0.4 0.3 0.2 0.1 0.06 x2 ( AC 0.05 TH con 0.04 cen trat ion )
0.65
0.7
0.75
n) 0.6 tratio oncen RH c x 1 (C
Trajectory for u(t) = −2.0
Figure 17.5 Constant dosage trajectories for constant dosage trajectories with t 2 [0, 15]. The square on the trajectory path is the point where treatment stops. The star and circle are the hypocortisolic and healthy rest points.
boundary and therefore ensure that the system will stabilize about the healthy equilibrium. Second, it would prevent large deviation in hormone levels. Finally, an ideal control strategy would minimize the amount of input action required to achieve the first two objectives. Such a control strategy would be robust in the face of modeling errors, would cause minimal side effects and would minimize the cost of treatment.
3.3. Development of an appropriate optimal control objective A general dynamic optimization problem commonly found in optimal control literature is as follows: min
u0; ...;utf 1
tf 1 X
fðxk ; uk Þ þ f ðxtf Þ
ð17:11Þ
i¼0
with the state transition rule given by Eq. (17.1) for a given initial state x0 and piecewise constant inputs u(t) ¼ uk and d(t) ¼ dk for kh t (k þ 1)h.
449
Nonlinear Dynamical Analysis and Optimization
Equilibria points for nominal system kcd = 9.75
0.6
0.5
Equilibria points for kcd = 9.75
0.4 x3 0.3
0.2
0.1 0.065
0.06
0.055 x2
0.05 0.045
0.04
0.58
0.6
0.62
0.64
0.66
0.68
0.7
0.72
0.74
x1
Figure 17.6 Constant dosage trajectories for u(t) ¼ 0.8 (dashed) and u(t) ¼ 2 (solid) with t 2 [0, 15] for two different values of the parameter kad. The black square along the trajectory path is the point where treatment stops.
h is the sampling time and xk represents the value of x at the kth sample is the time (i.e., x(t) at t ¼ hk). f is the single stage cost function and f terminal state cost function at time tf. In MPC, a typical form of the objective function is fðxk ; uk Þ ¼ ðxk rk ÞT Qðxk rk Þ þ ðuk uk1 ÞT Rðuk uk1 Þ t Þ ¼ ðxt rt ÞT Qt ðxt rt Þ fðx f f f f f ð17:12Þ where rt is a reference point at time t, and Q, R, and Qt are weighting matrices with proper dimensions that make the stagewise cost scalar. MPC solves Eq. (17.11) with the stagewise cost of Eq. (17.12) in the context of finding an open-loop input trajectory offline for fixed finite-time process. Given the feedback measurement of x at each time, it can also solve the problem online in a receding horizon fashion (Morari and Lee, 1999). The single stage cost in Eq. (17.12) cannot explicitly incorporate all of the control objectives because choosing reference trajectory is overly restrictive. Specifically, the goal of constraining the states is to reduce the side-effects of treatment. The time-dependence implied by choosing a specific trajectory, rt is not necessary. This idea is illustrated in Fig. 17.7. In Fig. 17.7A and B, the trajectory generated by the NLMPC solution is shown for Q ¼ 0 and R ¼ 0, respectively. As can be seen from Fig. 17.7, the NLMPC solution which seeks to minimize the input effort (i.e., Q ¼ 0) also provides a
450
Amos Ben-Zvi and Jong Min Lee
x3 (Free GR concentration)
A
x2
0.6 0.5 0.4 0.3 0.2 0.1 0.07 (AC 0.06 TH 0.05 con cen 0.04 tra tio n)
0.8 0.75 ion) ntrat once
0.7
0.6 x1
0.65 c (CRH
NLMPC trajectory with Q = 1/100, R = 0. The end of treatment coincides with the healthy equilibrium point.
x3 (Free GR concentration)
B
x2
0.6 0.5 0.4 0.3 0.2 0.1 0.07 (AC 0.06 TH 0.05 con cen 0.04 tra tio n)
0.8 0.75 ion) ntrat once
0.7
0.6 x1
0.65 c (CRH
NLMPC trajectory with Q = 0, R = 1/100.
Figure 17.7 NLMPC generated trajectories for t 2 [0, 15]. The square on the trajectory path is the point where treatment stops. The star and circle are the hypocortisolic and healthy rest points.
trajectory with least state deviation. It can also be seen that choosing Q ¼ 0 results in a much more robust trajectory (the end of treatment coincides with the healthy equilibrium). These qualitative aspects of the trajectories in Fig. 17.7 are not obvious from the choice of Q and R. As a result, one is
451
Nonlinear Dynamical Analysis and Optimization
forced to seek an alternative approach for the development of an appropriate objective function. A clinically relevant and straightforward objective function for computing treatment is suggested. A stagewise cost function is designed to yield a control law that maintains the controlled trajectories in a user-defined tube containing a line connecting an initial state of sickness to the corresponding healthy state as shown in Fig. 17.8. Quantitative definition of the stagewise cost is proposed as 8 d < m and xkþ1 2 Xs : 0 > > < d < m and xkþ12 = Xs : 50l fðxk ; uk Þ ¼ ð17:13Þ d m and x > kþ1 2 Xs : 500d > : d < m and xkþ12 = Xs : 50l þ 500d where d is the distance between the state point at time k þ 1 and the straight line connecting the sick initial state and the corresponding healthy state. m is a user-specified radius of the tube. Xs is a set of states from which no control action (u ¼ 0) is necessary onwards for the system to settle in a healthy state. This set can be found by integration of Eq. (17.2) with u and d set as zero. The scaling factors of the stagewise cost and the radius m are specified by a user according to the relative importance between l and d as well as the order of distances. Since the logics included in the single-stage cost requires integer variables, the resulting optimization problem of MPC becomes multihorizon mixed integer nonlinear programming (MINLP). Though the objective
m xsickk
xk+1 d
l
xhealthy Figure 17.8 Design of clinically relevant objective function for HPA axis system.
452
Amos Ben-Zvi and Jong Min Lee
function is straightforward to formulate, the optimization problem is very difficult to solve due to the increasing number of optimization variables with the size of horizon in addition to the inherent complexity of MINLP.
4. Dynamic Programming Dynamic programming (DP) offers an alternative approach to solving multistage optimal control problems (Bellman, 1957). The approach involves stagewise calculation of the so-called ‘‘cost-to-go’’ values for all states. The cost-to-go of a state is the sum of all costs that you can expect to incur under a given policy starting from that state, and hence expresses the quality of a state in terms of achievable future performance. Though MPC has been the most popular advanced control technique for the process industry owing to its ability to handle a large multivariable system with constraints, DP has several advantages over MPC in solving biological/biomedical dynamic optimization problems. First, DP provides flexibility for choosing a complex stagewise cost because DP reduces a multistage optimization problem to a single-stage one by encoding longterm performance in a cost-to-go function. The single-stage optimization problem is solved by calculating the optimal action that minimizes the sum of the current stagewise cost and the cost-to-go of the successor state. Another advantage of DP is its ability to take into account the uncertainty in the optimal control calculation, whereas the conventional MPC ignores the uncertainty and feedback at future time points and solves a deterministic open-loop optimal control problem.
4.1. DP for deterministic systems For a deterministic system where the success state can be exactly evaluated given the current state and input values, the entire future sequence of states and actions is determined with a fixed starting state and a deterministic policy. The cost-to-go function, J, under a policy m is the sum of stagewise costs up to the end of the horizon. Jkm ðxk Þ ¼
t f 1 X
tÞ fðxi ; ui Þ þ fðx f
ð17:14Þ
i¼k
where ui ¼ mðxi Þ. The optimal cost-to-go function, J , is the cost-to-go function under an optimal policy and is unique: J ¼ min J m ¼ J m m2P
ð17:15Þ
453
Nonlinear Dynamical Analysis and Optimization
where P is a set of all possible deterministic policies that map xi to ui. For the finite horizon problem of Eq. (17.14), the optimal cost-to-go function should satisfy the following Bellman’s optimality equation: Jk ðxk Þ ¼ minffðxk ; uk Þ þ Jkþ1 ðxkþ1 Þg uk
ð17:16Þ
To solve the above optimality equation, sequential calculation of Jk for all state points is performed in a backward manner starting from the terminal t Þ. stage with Jtf ¼ fðx f With the optimal cost-to-go function for the k þ 1 stage, Jkþ1 ðxkþ1 Þ, calculated offline, the following single stage problem, which is equivalent to tf stage problem defined earlier, is solved to compute an optimal control action for any given state x at time k: uk ¼ arg minffðxk ; uk Þ þ Jkþ1 ðxkþ1 Þg uk
ð17:17Þ
Infinite horizon formulation, in which tf is set to infinity, was shown to be advantageous in the context of system’s stability and feasibility of optimal solutions for systems without termination time (Rawlings and Muske, 1993). A typical objective function of infinite horizon problems is given as min
u0 ;u1 ;...;u1
1 X
gi fðxi ; ui Þ
ð17:18Þ
i¼0
where g 2 ð0; 1Þ is a discount factor. It should be noted that f is not limited to a certain type of norm and g is used to prevent the total cost from diverging to infinity. For the infinite horizon problem, by letting k ! 1, the following Bellman equation is obtained. J1 ðxk Þ ¼ minffðxk ; uk Þ þ gJ1 ðxkþ1 Þg uk
ð17:19Þ
The above Bellman equation is solved offline for all possible states and one cost-to-go function is obtained regardless of the time point. There are two conventional approaches for computing the cost-to-go function offline, value iteration and policy iteration. In this chapter, the value iteration will be used for its simplicity. In value iteration, one starts with an initial guess, usually zero, for the cost-to-go for each state and iterates on the Bellman equation until convergence. This is equivalent to calculating the cost-to-go value for each state by assuming an action that minimizes the sum of the current stage cost and the cost-to-go for the next state according to the current estimate. Hence, each update assumes that the calculated action is optimal. The algorithm involves the following steps.
454
Amos Ben-Zvi and Jong Min Lee
1. Discretize the continuous state space into a finite number of state points, xi ði ¼ 1; . . . ; N Þ 2 Xi . 2. Initialize J 0 ðxi Þ as 0 for all xi . 3. In jth iteration, obtain ( j þ 1)th estimate of cost-to-go for each state xk: J jþ1 ðxi Þ ¼ minffðxi ; ui Þ þ gJ j ð^ xÞg ui
ð17:20Þ
where x^ is the successor state of xi obtained by integrating Eq. (17.1) with the constant inputs of ui and di. x^ may not be found in Xi and thus estimation of cost-to-go for x^ is necessary. It was shown that instance-based local averaging schemes such as k-nearest neighborhood method provide stable offline learning for the estimation (Lee et al., 2006). 4. Perform the above iteration (step 3) until k J jþ1 ðxi Þ J j ðxi Þk1 < ½ð1 gÞ=2ge, where e is a user-defined threshold value. 5. If the convergence criterion met, use J jþ1 as J for computing optimal control action for any given state.
4.2. Minimizing worst-case cost under uncertainty using DP In practical situations, HPA axis dynamics of each patient is likely to have variations from the nominal parameter values, and the sequence of optimal treatments based on the nominal model may not work well. Another advantage of DP is that uncertainty can be taken into account explicitly. There are two possible approaches within DP to handle uncertain systems. The first is stochastic DP approach where the expectation of cost-to-go function is minimized given the joint probability distribution function (PDF) of uncertain parameter vector. The other approach is minimizing the worst-case (maximum) scenario of cost-to-go function when the parameters are known only within certain bounds. The resulting solution of min-max optimization is conservative, but it corresponds to the best strategy available in the absence of PDF. In this chapter, we show the latter approach because it is relatively easier to set bounds of parameters than to find PDF, which requires many data sets. In the worst-case formulation, the following discounted infinite worstcase cost is minimized: 1 X max gk fðxk ; uk Þ ð17:21Þ p0; ...;p1
k¼0
The corresponding dynamic program is formulated as J ðxk Þ ¼ min maxffðxk ; uk Þ þ gJ ðxkþ1 Þg uk 2U pk 2P
ð17:22Þ
Nonlinear Dynamical Analysis and Optimization
455
Obtaining the converged cost-to-go function offline is the same as in the deterministic case except the maximization problem is solved for each input first before the minimization step.
5. Computation of Optimal Treatments for HPA Axis System 5.1. Deterministic optimization The first step is to learn the optimal cost-to-go function offline. To store the cost-to-go values in a tabular form with interpolation by k-nearest neighbor, the state and action spaces are discretized in an equal-spaced fashion. The state space is discretized such that each state variable has ten points between the two steady states. As a result, the total number of discrete state points is 10,000. The input is discretized into 25 points between 2 and 2. This is also clinically relevant consideration because adjustment of drug dosage chosen among a finite set of discrete values is easier to implement than among infinite number of continuous values. For the 10,000 points, initial cost-to-go values were set as zero and the value iteration algorithm is implemented to obtain converged cost-to-go values. In estimating cost-to-go values of the points that are not found in the set of discretized points, the following k-nearest neighborhood with k ¼ 4 was employed: X Jð^ xÞ ¼ wi Jðxi Þ ð17:23Þ xi 2Nk ð^ xÞ
where 1=di wi ¼ X 1=di i
ð17:24Þ
di is the Euclidean distance between x^ and its ith nearest point (i ¼ 1, 2, . . ., k). The cost-to-go estimates are not sensitive to the number of neighboring points because the weighting factor is inversely proportional to the distance. With the discount factor of 0.98 and the convergence tolerance e of 0.1, the offline learning step converges after 15 steps. With the converged costto-go function, the following single-stage optimization is solved to find the optimal control action at each time step: uk ¼ arg minffðxk ; uk Þ þ gJ~ ðxkþ1 Þg uk 2U
ð17:25Þ
where J~ is the converged cost-to-go values with the four-nearest neighborhood approximator.
456
x3 (Free GR concentration)
Amos Ben-Zvi and Jong Min Lee
0.6 0.5 0.4 0.3 0.2 0.1 0.06 x ( 2 A CT Hc
0.75 0.05 ent rati on
0.65
onc
0.04 )
0.6 x1
0.7
) ration ncent o c (CRH
Figure 17.9 State trajectory under the treatment policy derived from deterministic DP.
Figure 17.9 shows the controlled trajectory of the nominal system computed from the learned cost-to-go function. The HPA axis system penetrates the boundary successfully under the computed policy. Note, however, that the end of treatment is very close to the boundary surface that separates the healthy and unhealthy equilibria. That is, the nominal treatment is just barely sufficient to achieve the desired outcome (driving the system to the healthy equilibrium) assuming a perfect model and no external disturbances. This solution is not robust to errors in model formulation or parameter estimation and is therefore unlikely to achieve the desired outcome under real-world conditions. To obtain a more realistic treatment, a modified DP approach can be used to generate a treatment that will be effective under a variety of conditions. This type of approach is typically called ‘‘worst-case’’ optimization because it seeks to find a treatment that is effective even for the most difficult-to-treat scenario.
5.2. Worst-case optimization In this chapter, the worst-case optimization is computed to deal with a situation where the parameter values in the process model are not known exactly. The HPA axis model contains seven parameters. While none of the parameters may be precisely known, a more typical situation arises when some subset of the model parameters are unknown. In this case, it will be assumed that two of the parameters kad and krd are not known exactly, but,
457
Nonlinear Dynamical Analysis and Optimization
rather, are known to vary between [9.5, 10.5] and [0.85 0.95], respectively. The approach presented in this chapter may be extended to a larger number of unknown parameters. The DP ‘‘worst-case’’ objective is to find robust control actions that can drive the system to a healthy state without exact knowledge of the systems’ parameters by solving the min-max problem. Since the system is nonlinear, it is computationally exorbitant to find the real worst-case scenario with exhaustive search over all possible combination of the parameters. Hence, we use a sample-based approach to approximate the worst-case cost function. Before the offline learning, fifty points were sampled from the 2-D parameter space. In solving Eq. (17.22), the 50 randomly generated points (using a uniform distribution) are searched over given each control action to find the worst case first. It should be noted that the parameter values chosen for this analysis were constrained to values that allow for multiple equilibria. Once the converged cost-to-go function was obtained, the following optimization problem is solved to find an optimal control action at each time. uk ¼ min max ffðxk ; uk Þ þ gJ ðxkþ1 Þg
ð17:26Þ
uk 2U pk 2P50
x3 (Free GR concentration)
where P50 is the set of sampled parameters. Figure 17.10 shows the case where the true system’s parameter is (kad, krd) ¼ (9.7545, 0.9094). Without the knowledge of true parameter values, the system could be driven to a healthy steady-state point by the suggested approach. It is noted that the worst-case optimization gives very
0.6 0.5 0.4 0.3 0.2 0.1 0.06 x ( 2 A CT Hc
0.75 0.05 ent rati on
0.65
onc
0.04 )
0.7
) ration ncent o c RH x 1 (C
0.6
Figure 17.10 State trajectory under the worst-case solution via DP.
458
Amos Ben-Zvi and Jong Min Lee
conservative treatment, the highest dosage sustained for 7 days compared to 5 days in nominal case. However, this treatment was computed through DP without the knowledge of the true parameters by minimizing the worstcase performance criterion. Without such an optimization tool, it is very difficult to provide guidelines on the dosage when the exact model parameter of each patient is not available. Hence, the proposed procedure will be able to provide a single treatment course that is ‘‘likely,’’ in an appropriate sense, to treat the vast majority of individuals with minimal side-effects.
6. Conclusions This chapter discussed two algorithms for dynamic optimization of biological/biomedical systems. Dynamic programming has the distinct advantage over model predictive control approach in that DP has the flexibility for choosing objective function and ability to take uncertainty into account. Steady state and stability analyses of nonlinear systems with multiple steady states were also discussed with the example of HPA axis system. Despite its generality, the potential issue of DP is that its offline computational requirement increases with the number of state variables, referred to as ‘‘curse-of-dimensionality.’’ In addition, it also requires full measurement of state variables. The recent emergence of reinforcement learning and approximate dynamic programming puts forth a possibility to avert the curse-of-dimensionality and necessity of full state measurement.
ACKNOWLEDGMENTS This work was supported partly by the National Sciences and Engineering Research Council (NSERC) of Canada under Discovery Grant.
REFERENCES Bellman, R. E. (1957). Dynamic Programming. Princeton University Press, Princeton, NJ. Ben-Zvi, A., et al. (2009). Model-based therapeutic correction of hypothalamic-pituitaryadrenal axis dysfunction. PLoS Comput. Biol. 5, e1000273. Chicone, C. (1999). Ordinary Differential Equations with Applications. Springer-Verlag, New York. Cleare, A., et al. (2001). Hypothalamo-pituitary-adrenal axis dysfunction in chronic fatigue syndrome. J. Clin. Endocrinol. Metab. 86, 3545–3554. Crofford, L., et al. (2004). Basal circadian and pulsatile acth and cortisol secretion in patients with fibromyalgia and/or chronic fatigue syndrome. Brain Behav. Immun. 18, 314–325. Deutsch, A., et al. (2007). Mathematical Modeling of Biological Systems: Cellular Biophysics, Regulatory Networks, Development, Biomedicine, and Data Analysis. Birkha¨user Boston, Boston.
Nonlinear Dynamical Analysis and Optimization
459
Giorgio, A., et al. (2005). 24-hour pituitary and dernal hormone profiles in chronic fatigue syndrome. Psychosom. Med. 67, 433–440. Guerlain, S., et al. (2002). The MPC elucidator: A case study in the design for humanautomation interaction. IEEE Trans. Syst. Man Cybernet. Part A: Syst. Hum. 32, 25–40. Gupta, S., et al. (2007). Inclusion of the glucocorticoid receptor in a hypothalamic pituitary adrenal axis model reveals bistability. Theor. Biol. Med. Model 4. Jacobson, L. (2005). Hypothalamic-pituitary-adrenocortical axis regulation. Endocrinol. Metab. Clin. North Am. 34, 271–292. Kimas, N., et al. (1990). Immunologic abnormalities in chronic fatigue syndrome. J. Clin. Microbiol. 28, 1403–1410. Klipp, E., et al. (2006). Integrative model of the response of yeast to osmotic shock. Nat. Biotechnol. 23, 975–982. Lee, J. M., et al. (2006). Choice of approximator and design of penalty function for an approximate dynamic programming based control approach. J. Process Control 16, 135–156. Morari, M., and Lee, J. H. (1999). Model predictive control: Past, present and future. Comput. Chem. Eng. 23, 667–682. Rawlings, J. B., and Muske, K. R. (1993). The stability of constrained receding horizon control. IEEE Trans. Automat. Contr. 38, 1512–1516. Rizzi, M., et al. (1997). In vivo analysis of metabolic dynamics in Saccharomyces cerevisiae: II. Mathematical model. Biotechnol. Bioeng. 55, 592–608. Schu¨rmeyer, T. H., et al. (1996). Effect of cyproheptadine on episodic ACTH and cortisol secretion. Eur. J. Clin. Invest. 26, 397–403. Teusink, B., et al. (2000). Can yeast glycolysis be understood in terms of in vitro kinetics of the constituent enzymes? Testing biochemistry. Eur. J. Biochem. 267, 5313–5329.
C H A P T E R
E I G H T E E N
Modeling of Growth Factor-Receptor Systems: From Molecular-Level Protein Interaction Networks to Whole-Body Compartment Models Florence T. H. Wu,* Marianne O. Stefanini,* Feilim Mac Gabhann,† and Aleksander S. Popel* Contents 1. Background 1.1. Biology of growth factor systems 1.2. Computational models of the VEGF system 2. Molecular-Level Kinetics Models: Simulation of In Vitro Experiments 2.1. Mathematical framework for biomolecular interaction networks 2.2. Case study: Mechanism of PlGF synergy—Shifting VEGF to VEGFR2 versus PlGF–VEGFR1 signaling 2.3. Case study: Mechanism of NRP1–VEGFR2 coupling via VEGF165—Effect on VEGF isoform-specific receptor binding 3. Mesoscale Single-Tissue 3D Models: Simulation of In Vivo Tissue Regions 3.1. Mathematical framework for tissue architecture, blood flow, and tissue oxygenation 3.2. Case study: Proangiogenic VEGF gene therapy for muscle ischemia 3.3. Case study: Proangiogenic VEGF cell-based therapy for muscle ischemia 3.4. Case study: Proangiogenic exercise therapy for muscle ischemia 4. Single-Tissue Compartmental Models: Simulation of In Vivo Tissue
* {
462 462 466 466 468 468 472 474 474 480 480 481 482
Department of Biomedical Engineering, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA Department of Biomedical Engineering, Institute for Computational Medicine, Johns Hopkins University, Baltimore, Maryland, USA
Methods in Enzymology, Volume 467 ISSN 0076-6879, DOI: 10.1016/S0076-6879(09)67018-X
#
2009 Published by Elsevier Inc.
461
462
Florence T. H. Wu et al.
4.1. Mathematical framework for tissue porosity and available volume fractions 4.2. Case study: Pharmacodynamic mechanism and tumor microenvironment affect efficacy of anti-NRP1 therapy in cancer 5. Multitissue Compartmental Models: Simulation of Whole Body 5.1. Mathematical framework of intertissue transport 5.2. Case study: Pharmacokinetics of anti-VEGF therapy in cancer 5.3. Case study: Mechanism of sVEGFR1 as a ligand trap 6. Conclusions Acknowledgments References
482
483 485 486 488 491 493 494 494
Abstract Most physiological processes are subjected to molecular regulation by growth factors, which are secreted proteins that activate chemical signal transduction pathways through binding of specific cell-surface receptors. One particular growth factor system involved in the in vivo regulation of blood vessel growth is called the vascular endothelial growth factor (VEGF) system. Computational and numerical techniques are well suited to handle the molecular complexity (the number of binding partners involved, including ligands, receptors, and inert binding sites) and multiscale nature (intratissue vs. intertissue transport and local vs. systemic effects within an organism) involved in modeling growth factor system interactions and effects. This chapter introduces a variety of in silico models that seek to recapitulate different aspects of VEGF system biology at various spatial and temporal scales: molecular-level kinetic models focus on VEGF ligand–receptor interactions at and near the endothelial cell surface; mesoscale single-tissue 3D models can simulate the effects of multicellular tissue architecture on the spatial variation in VEGF ligand production and receptor activation; compartmental modeling allows efficient prediction of average interstitial VEGF concentrations and cell-surface VEGF signaling intensities across multiple large tissue volumes, permitting the investigation of wholebody intertissue transport (e.g., vascular permeability and lymphatic drainage). The given examples will demonstrate the utility of computational models in aiding both basic science and clinical research on VEGF systems biology.
1. Background 1.1. Biology of growth factor systems 1.1.1. Growth factor systems in angiogenesis At the molecular and cellular levels, growth factors are extracellularly secreted polypeptides which, upon binding to specific cell-surface target receptors, trigger intracellular signal transduction pathways that regulate cell
Modeling Growth Factor-Receptor Systems
463
proliferation, differentiation and survival (Lodish et al., 2004). At the tissue and organ levels, growth factors are responsible for orchestrating many physiological processes in complex multicellular organisms. Of utmost importance in human physiology and pathology is the process known as angiogenesis—the growth of new capillaries or microvessels from preexisting blood vasculature—which critically supports organogenesis during embryonic development (Haigh, 2008); physiological growth and repair in adult tissues (such as in wound healing (Bao et al., 2009), muscular adaptation to exercise (Brown and Hudlicka, 2003), or endometrial regeneration (Girling and Rogers, 2005)); as well as the malignant growth of tumor tissues (Kerbel, 2008). Sprouting angiogenesis is a well-coordinated and complex cascade of molecular, cellular, and tissue-level events (Qutub et al., 2009): First, tissue ischemia is converted into a chemical cue for angiogenesis, as hypoxic cells (e.g., cancer cells in tumor tissue or myocytes in ischemic muscle) transcribe and secrete growth factors in response to hypoxia-inducible factor 1 (HIF1) activation. The growth factor ligands then diffuse throughout the extracellular fluid, where some become sequestered to matrix proteoglycans and others bind cell-surface receptors on the capillary endothelium. Cell-surface receptorbound ligands initiate vessel sprouting by turning quiescent endothelial cells into migratory tip cells. Extracellular matrix-bound and freely diffusing ligands form chemotatic gradients that guide the migration of tip cell filopodia in the capillary sprout. In Sections 1.1.2 and 1.1.3, we further explore the multiscale nature and molecular complexity of angiogenic growth factor interactions. 1.1.2. Systems biology of VEGF: Interaction networks and molecular cross talk Many growth factor systems are involved in angiogenic regulation, including the vascular endothelial growth factor (VEGF) system of at least five ligands (VEGF-A, PlGF, VEGF-B, VEGF-C, VEGF-D) and three receptors (VEGFR1, VEGFR2, VEGFR3) (Mac Gabhann and Popel, 2008; Roy et al., 2006); the fibroblast growth factor (FGF) system of at least 18 ligands (FGF1 to FGF10 and FGF16 to FGF23) and 4 receptors (FGFR1 to FGFR4) (Beenken and Mohammadi, 2009); the angiopoietin (Ang) system of at least four ligands (ANG1 to ANG4) and two receptors (TIE1 and TIE2) (Augustin et al., 2009); the platelet-derived growth factor (PDGF) system of at least four ligands (PDGF-A to PDGF-D) and two receptors (PDGFR-a and PDGFR-b) (Andrae et al., 2008); and the insulin-like growth factor (IGF) system of at least two ligands (IGF1 and IGF2) and two receptors (IGF1R and IGF2R) (Mazitschek and Giannis, 2004; Pollak, 2008). There are organizational similarities between these growth factor systems—all of the above receptors except for the IGF2R are transmembrane receptor tyrosine kinases (RTKs) that are activated by ligand-induced dimerization and transphosphorylation of tyrosine residues (Gschwind et al., 2004);
464
Florence T. H. Wu et al.
coreceptors (e.g., neuropilin-1 (NRP1) for VEGFRs and syndecan-4 for FGFRs) and endothelial integrins (e.g., avb5 for VEGFRs and avb3 for FGFRs) often modulate receptor signaling (Simons, 2004); heparan sulfate proteoglycans (HSPGs) are involved in the extracellular matrix sequestration of ligands in the VEGF, PDGF, FGF systems (Andrae et al., 2008; Beenken and Mohammadi, 2009; Roy et al., 2006). Intracellularly, details are also emerging on the convergent and integrative cross talk between and within growth factor systems in angiogenic signaling. Among the overlap in their transcriptional profiles downstream of RTK activation, all aforementioned growth factor systems can activate the canonical Ras-MAPK signaling pathway (Simons, 2004). VEGF and FGF2 were observed to induce the expression of each other (Simons, 2004). Within the VEGF system, the existence of heterodimeric ligands (e.g., VEGF-A/PlGF and VEGF-A/VEGF-B) and heterodimeric receptors (e.g., VEGFR1/VEGFR2) is expected to introduce new signal transduction pathways in addition to those downstream of classic homodimeric ligand–receptor activation (Cao, 2009; Mac Gabhann and Popel, 2008). While VEGFR1 is mainly a negative regulator of angiogenesis, its possible proangiogenic roles are suspected to involve the intermolecular transphosphorylation of VEGFR2 by PlGF-activated VEGFR1 (Autiero et al., 2003). In this chapter, we will introduce computational frameworks that are well suited for the quantitative modeling of highly complex molecular interaction networks such as that of the VEGF system in angiogenesis. While our examples focus on the VEGF system, the mathematical frameworks are generally adaptable for any of the organizationally similar growth factor systems introduced above, with the potential of further integration between the VEGF, FGF, PDGF, IGF, Ang-Tie system modules themselves. 1.1.3. Multiscale biology of VEGF: Transport and signaling range In understanding the biology of angiogenic growth factors, it is of equal importance to identify the key molecular players and to distinguish where in the body the molecular interactions take place. The spatial range of activity can vary between growth factors (or between isoforms of the same growth factor) depending on their propensity for intratissue and intertissue transport. The intratissue transport of a secreted growth factor—that is, its diffusive and convective transport within the extracellular matrix (ECM)—is dependent on the growth factor’s molecular size, the pore sizes of the ECM, and its chemical affinity for ECM proteoglycans. For instance, heparin-binding affinity of VEGF, which determines the extent of its sequestration by ECM heparan sulfate proteoglycans, is traditionally thought to be encoded in the VEGF-A gene on exons 6 and 7 (Harper and Bates, 2008). Hence, the proangiogenic VEGF121 and antiangiogenic VEGF121b splice isoforms which skip exons 6 and 7 are mostly freely diffusible once secreted into the
Modeling Growth Factor-Receptor Systems
465
extracellular fluid; while the higher molecular-weight isoforms VEGF145(b), VEGF148, VEGF165(b), VEGF183(b), VEGF189(b), and VEGF206 have progressively higher heparin-binding affinity due to their inclusion of increasingly greater portions of exons 6 and 7 (Harper and Bates, 2008). Yet these higher molecular-weight isoforms in their matrix-bound state can be subjected to proteolytic cleavage by plasmin or matrix metalloproteinases (MMPs), which releases active fragments of 110–113 amino acids in length and with similar angiogenic properties as VEGF121 (Ferrara and Davis-Smyth, 1997; Lee et al., 2005; Qutub et al., 2009). On a larger scale, the intertissue transport of a growth factor is first affected by its rate of entry into or exit from the blood or lymphatic vasculature, that is, its permeability through the blood capillary endothelium or its lymphatic drainage rate from interstitial spaces. Once in the bloodstream or lymphatic fluid, the intertissue transport of a growth factor may be further facilitated by specific carrier proteins or circulating cells. For VEGF, its potential carriers in blood include soluble forms of its normal receptors (sVEGFR1, sVEGFR2, sNRP1) (Ebos et al., 2004; Gagnon et al., 2000; Sela et al., 2008), plasma fibronectin (Wijelath et al., 2006), as well as platelets (Verheul et al., 2007). All together, these transport properties influence the distance over which a growth factor can signal. Autocrine signaling occurs when growth factors act upon the same cells that produced them; juxtacrine growth factors act upon adjacent cells after secretion; paracrine growth factors diffuse through the extracellular fluid and target cells within the same tissue but of a different cell type than that which secreted them; whereas growth factors that are transported through the bloodstream to distant target tissues act in an endocrine manner (Lauffenburger and Linderman, 1993; Lodish et al., 2004). Angiogenic VEGF signaling occurs predominantly in a paracrine fashion: epithelial cells in fenestrated organs (e.g., glomerular podocytes), mesenchymal cells (e.g., skeletal myocytes in ischemic muscles), vascular stromal cells (e.g., pericytes and smooth muscle cells) and hypoxic tumor cells are all known to secrete VEGF, which then diffuses to and activates the VEGF receptors on nearby endothelial cell surfaces (Kerbel, 2008; Maharaj and D’Amore, 2007). However, there is also evidence of autocrine VEGF signaling loops in VEGFproducing endothelial cells (Martin et al., 2009). Furthermore, autocrine VEGF signaling involving intracellular VEGF receptors (‘‘intracrine signaling’’) has been documented in breast carcinoma cells (Lee et al., 2007a,b) and hematopoietic stem cells (Gerber et al., 2002); although in these contexts, VEGF functions as a cell survival signal rather than a proangiogenic factor. While there have not been formally established specific endocrine functions of VEGF in normal physiology, aberrantly high circulating levels of VEGF may have deleterious systemic effects. In clinical trials administering VEGF via intravascular infusion to stimulate therapeutic angiogenesis for ischemic muscle diseases, unintended side effects of the high systemic VEGF concentrations such as hypotension and macular edema have been transiently
466
Florence T. H. Wu et al.
and sporadically observed (Collinson and Donnelly, 2004). The VEGF concentrations in the plasma of cancer patients are also known to be several-fold higher than healthy baseline levels, although it is uncertain whether the tumors are themselves the source of the elevated circulating VEGF, or whether conversely the elevated circulating VEGF triggered the malignant growth of tumors (Kut et al., 2007; Stefanini et al., 2008). Therefore, a complete understanding of VEGF biology—including its pathogenic role in cancer and its therapeutic potential in ischemic diseases—necessitates an appreciation of the dynamic distribution of VEGF in the human body (Kut et al., 2007). The computational models presented in this chapter are complementary, capturing the biology of VEGF interactions at different length scales: the molecular-level kinetic models can predict the local intensity of VEGF– VEGFR complex formation on endothelial cell surfaces as a marker of angiogenic activation or vessel sprouting initiation; the mesoscale singletissue 3D models involving multiple cell types can predict the spatial gradients of VEGF in the extracellular space and simulate paracrine signaling distributions that guide capillary sprout migration; and the multitissue compartmental models can be used to simulate whole-body VEGF distributions and to investigate the possibility of endocrine VEGF effects.
1.2. Computational models of the VEGF system Figure 18.1 summarizes the multiscale nature and complexity of the molecular interactions involved in the systems biology of the VEGF ligand–receptor system. In the following sections, we introduce models for investigating emergent behavior at progressively higher spatial scales: from molecular (subcellular) level models, to mesoscale (intratissue) models, to whole-body (intertissue) models. The chosen examples also illustrate the versatility of computational modeling for investigating basic science questions (e.g., simulating the molecular mechanisms underlying the PlGF–VEGF synergy and NRP1–VEGFR2 synergy) and assisting in the design of translational medicine (e.g., comparing the therapeutic efficacy of cell-based versus protein delivery of VEGF; optimizing the dose of anti-VEGF therapy).
2. Molecular-Level Kinetics Models: Simulation of In Vitro Experiments In our first two examples, models were developed to investigate specific molecular functions and interactions of key players in the VEGF ligand–receptor system—PlGF and NRP1—by recapitulating in vitro experiments. The spatial scope of these models focused on molecular behaviors near the endothelial cell surface, including extracellular ligand diffusion and cell-surface ligand–receptor binding.
467
Modeling Growth Factor-Receptor Systems
Interstitial fluid
PIGF
1 VEGF165
sVEGFR1
VEGF121
2
Tumor
5 7
VEGFR2 VEGF
6
R1
stin Ava
Extracellular matrix
4
3 Blood flow
Blood flow
7
Systemic blood circulation
sFn 9
VEGF
sVEGFR1
Lymph flow via thoracic duct
Avastin
Platelets 8
cyte l myo Skeleta VE
GF
R2
12 7 F VEG 10 H 2O
Lymph flow
PIG
F
VEG VEG
FR1
VEG
NR
PI
ECM
11 VEGF
VEG
F
FR1
FR2
8 Blood flow
Figure 18.1 Multiscale Systems Biology of the VEGF ligand–receptor system. (1) Hypoxia, such as that in growing tumor tissues (top panel), trigger the expression and extracellular secretion of VEGF ligand proteins, for example, VEGF-A (isoforms VEGF121 and VEGF165) and PlGF. At the cellular level, VEGF ligands diffuse toward nearby capillary surfaces, binding endothelial cell-surface receptors (VEGFR1, VEGFR2) and coreceptors (NRP1) in various configurations to activate (2) proangiogenic and (3) antiangiogenic downstream signaling. (4) At the tissue level, VEGF ligands with heparin-binding domains can be sequestered at heparan sulfate proteoglycan sites in the extracellular matrix (ECM), forming chemotactic gradients that guide capillary sprout migration. (5) Soluble VEGFR1 (sVEGFR1) potentially modulates angiogenic signaling via ligand trapping or dominant-negative heterodimerization with transmembrane VEGFR monomers. (6) Humanized anti-VEGF antibodies (e.g., AvastinÒ ), through their capacity to sequester specific VEGF ligands, are being investigated as antiangiogenic agents in cancer treatment. At the whole-organism level, macromolecules such as VEGF ligands and their soluble receptors may have systemic
468
Florence T. H. Wu et al.
2.1. Mathematical framework for biomolecular interaction networks Mathematical theory and formulations for kinetic modeling of cell-surface ligand–receptor binding and cell-surface receptor/ligand trafficking have been presented in classical texts (Lauffenburger and Linderman, 1993). The standard description for the binding kinetics of ligand L to receptor R to form complex C involves characterization of the complex association and dissociation rate constants kon and koff: dC ð18:1Þ ¼ kon RL koff C dt Endocytotic internalization of free receptors and complexes are generally characterized by first-order rate constants: RþL ÐC $
dR dC ð18:2Þ ¼ kint;R R; ¼ kint;C C dt dt Free receptor insertion rates are typically introduced through zero-order source terms; in the following models, they are chosen to maintain a steady total population (free and bound) of receptors in the absence of added ligand.
2.2. Case study: Mechanism of PlGF synergy—Shifting VEGF to VEGFR2 versus PlGF–VEGFR1 signaling Our first case study sought to decipher the molecular mechanisms behind PlGF’s observed ability to augment the angiogenic response to VEGF-A in in vitro assays for endothelial cell survival, proliferation and migration. Details and full references can be found in Mac Gabhann and Popel (2004). These two members of the VEGF family have different receptorbinding properties: VEGF-A (hereinafter referred to as simply ‘‘VEGF’’) binds with both VEGFR1 and VEGFR2; while PlGF only binds VEGFR1. Two proposed mechanisms for the PlGF–VEGF synergy were: (a) ‘‘ligand shifting’’, where PlGF displaces VEGF from VEGFR1, effectively freeing effects, as they enter the blood circulatory system (middle panel) through intertissue transport processes including (7) transcapillary vascular permeability and (8) lymphatic drainage of the interstitial fluid. (9) Other VEGF carriers in the blood include soluble fibronectin and platelets. (10) Similarly in skeletal muscle (bottom panel), VEGF ligand expression is upregulated in hypoxic myocytes. However in peripheral arterial disease, the angiogenic response is insufficient to alleviate muscle ischemia. Proangiogenic therapies under investigation include VEGF-A delivery through cell, gene, and protein therapy. Adjuvant therapeutic targets include (11) PlGF (thought to work synergistically through VEGFR1 signaling or ligand shifting) and (12) NRP1 (via presentation of VEGF to VEGFR2 or reducing antiangiogenic VEGFVEGFR1 complexes).
Modeling Growth Factor-Receptor Systems
469
more VEGF to bind the more proangiogenic VEGFR2; and (b) PlGF– VEGFR1 signaling, where PlGF activation of VEGFR1 may transduce qualitatively different (proangiogenic) signals than that from VEGF activation (generally inhibitory of angiogenic signaling). An in silico model was thus constructed to quantify these mechanistic contributions to the VEGF– PlGF synergy. The in silico model formulation mimicked in vitro assay geometry and conditions as illustrated in Fig. 18.2A. At the bottom of a cell culture well, a confluent layer of endothelial cells expressing receptors VEGFR1 and VEGFR2 on the surface (z ¼ 0) was exposed to the fluid media. Into the cell culture media (from z ¼ 0 to z ¼ h), ligands were administered at time zero—either VEGF alone (‘‘PlGF’’ case) or VEGF and PlGF (‘‘PlGFþ’’ case)—to assess the synergistic effects of PlGF. Mathematically, each molecular species was represented by either a volumetric concentration (V for VEGF, P for PlGF) or cell-surface concentration (R1 for VEGFR1, R2 for VEGFR2, VR1 for VEGFVEGFR1 complex, PR1 for PlGFVEGFR1 complex, VR2 for VEGFVEGFR2 complex), using the continuum approach. The initial value problem was essentially a single spatial dimension problem (z-direction) as molecular concentrations are assumed uniform in the plane parallel to the endothelial cell surface. Coupled diffusion and reaction equations (Eqs. (18.3) and (18.4), respectively) were used to describe the time evolution of extracellular ligand transport and cell-surface molecular binding interactions. D represented ligand diffusivity (cm2/s); sR is the insertion rate or receptor R (mol/cm2/s); kint is the receptor or complex internalization rate (s 1); kon and koff are the rate constants of complex association (M 1 s 1) and dissociation (s 1). @ V @ 2 DV V ¼ @z2 DP P @t P
ð18:3Þ
3 2 3 3 3 2 2 3 2 kint;R1 R1 kon;V ;R1 koff ;V ;R1 R1 sR1 7 7 6 6 R2 7 6 sR2 7 6 kint;R2 R2 7 6 0 0 7 6 7 7 7 6 6 7 6 @6 6 VR1 7 ¼ 6 0 7 6 kint;VR1 VR1 7 þ 6 kon;V ;R1 7V R1 þ 6 koff ;V ;R1 7VR1 7 6 7 7 7 6 6 7 6 @t 6 5 5 4 4 PR1 5 4 0 5 4 kint;PR1 PR1 5 4 0 0 kint;VR2 VR2 0 0 0 VR2 3 3 3 3 2 2 2 2 0 0 koff ;P;R1 kon;P;R1 7 7 7 7 6 6 6 6 k k 0 0 7 7 6 on;V ;R2 7 6 off ;V ;R2 7 6 6 7V R2 þ 6 7VR2 7P R1 þ 6 7PR1 þ 6 0 0 þ6 0 0 7 7 7 7 6 6 6 6 5 5 4 4 4 kon;P;R1 5 4 koff ;P;R1 5 0 0 kon;V ;R2 koff ;V ;R2 0 0 2
ð18:4Þ Boundary conditions were given by Eq. (18.5). qV is the endothelial secretion rate of VEGF (mol/cm2/s).
PIGF
VEGF
VEGF
PIGF
V(z,t)
P(z,t)
Diffusion eqns: Initial conditions (1) Ligand shifting PIGF
VEGF (2) PR1 signaling
Boundary conditions: VEGF
PIGF
Reaction eqns:
z=0
0
Endothelial cells
0
0
C
koff
1 1
R2
R1
–
V121
VEGF121
Exons 1–5
V165
VEGF165
Exons 1–5
kon
1 1 0
VR1(t)
0
koff
kon
2 2
1 1
V165
+
+
2 2
2 2
kon
koff
V121
kon
0
PR1(t)
7a
+
R2(t)
7b
8a
8b
8a
8b
NRP1 binding
+ N
koff
Rx2 k on
“Bridge” V165
koff N
kc
koff
Rx3
Rx1
V165 2 2
kon 2 2
2 2
V165
+ 2 2
koff 1 1 +
R1(t)
VEGFR2 binding V121
VEGF
N
2 2
V165
kc koff
+ 2 2
N
~ ~
140
PIGF has a small effect on VEGFR2 signaling
120
V-R2 (with PIGF) V-R2 (without PIGF) V-R1 + P-R1 (with PIGF) V-R1 (with PIGF) V-R1 (without PIGF)
100 80 PIGF has a large impact on VEGF-VEGFR1 and PIGF-VEGFR1 signaling
60 40 20 0
~ ~
z=h
B
t>0
P0
Ligand-receptor complexes (,000 per cell)
t=0
V0
0
VR2(t)
D
VEGF-VEGFR2 (,000/cell)
A
1
2 Time (h)
3
22 24
100
80
60
VEGF165 VEGF121 VEGF165 + NRP1-Ab Inhibition of VEGFR2 signaling by blocking binding to NRP1 ~8%
40 Inhibition of VEGFR2 signaling by blocking coupling to NRP1 ~45%
20
0 0.1
1 Initial VEGF (nM)
Figure 18.2 Molecular-level kinetics models. Modeling of molecular interactions underlying PlGF–VEGF synergy: experimental setup (A) and sample results (B) based on data previously published in Mac Gabhann and Popel (2004). Modeling of NRP1’s role in the differential binding of VEGF isoforms to VEGFR2: experimental setup (C) and sample results (D) based on data previously published in Mac Gabhann and Popel (2005).
Modeling Growth Factor-Receptor Systems
DV
DP
@V @z @P @z @V @z
471
! ¼ qV þ ðkon;V ;R1 V R1 koff ;V ;R1 VR1Þ z¼0
þðkon;V ;R2 V R2 koff ;V ;R2 VR2Þ
!
¼ ðkon;P;R1 P R1 koff ;P;R1 PR1Þ !z¼0 ¼ z¼h
@P @z
ð18:5Þ
! ¼0 z¼h
The initial conditions describing exogenous ligand administration and total receptor densities were specified by Eq. (18.6). V ðt ¼ 0Þ ¼ V0 ; Pðt ¼ 0Þ ¼ P0 ; R1ðt ¼ 0Þ ¼ R10 ; R2ðt ¼ 0Þ ¼ R20 ; VR1ðt ¼ 0Þ ¼ PR1ðt ¼ 0Þ ¼ VR2ðt ¼ 0Þ ¼ 0 ð18:6Þ The predicted final concentration of cell-surface ligated VEGFVEGFR2 complexes served as a surrogate marker for achieved angiogenic response. Representative values used for the model parameters were based on experimental literature as detailed in Mac Gabhann and Popel (2004). Numerical solution of the coupled nonlinear differential equations (Eqs. (18.3)–(18.6)) was achieved using an iterative implicit finite-difference scheme. Sample results are shown in Fig. 18.2B. Crucially, the simulations predicted that PlGF addition only increased VEGFVEGFR2 complex formation up to 5% at the peak percentile change between PlGFþ and PlGF cases; that is, the ‘‘ligand shifting effect’’ is expected to be minimal. On the other hand, the transient increase in total ligated VEGFR1 complexes was as high as 43%, during which, the magnitude increase in PlGFVEGFR1 complexes exceeded that of the decrease in VEGFVEGFR1 complexes. This would suggest that ‘‘VEGFR1 signaling’’ plays a more prominent role in the observed PlGF–VEGF synergy: that PlGF critically alters the VEGFR1 signaling profile in both the absolute quantity of signaling VEGFR1 complexes and the signaling quality of VEGFR1 complexes (elevated proangiogenic PlGFVEGFR1 signaling and reduced modulatory VEGFVEGFR1 signaling). Experimental support for these computational predictions have been found (Autiero et al., 2003), as intermolecular cross talk was reported to occur downstream of PlGF–VEGFR1 binding, leading to the transphosphorylation of VEGFR2 and amplification of proangiogenic VEGFVEGFR2 signaling.
472
Florence T. H. Wu et al.
2.3. Case study: Mechanism of NRP1–VEGFR2 coupling via VEGF165—Effect on VEGF isoform-specific receptor binding Our second case study investigated another molecular player in the systems biology of VEGF, neuropilin-1 (NRP1). As illustrated in Fig. 18.2C, NRP1-binding affinity is conferred to the higher molecular-weight isoforms of VEGF such as VEGF165 predominantly through transcription of exon 7; the VEGF121 isoform, which lacks exon 7, is generally considered to have negligible affinity for NRP1. Because the NRP1-binding domains of VEGF165 do not overlap with its VEGFR-binding domains, VEGF165 can act as a bridge in the formation of a ternary complex: VEGF165VEGFR2NRP1. A reduced interaction network model involving VEGF121, VEGF165, VEGFR2, and NRP1 (Fig. 18.2C) was thus constructed to quantify the role of VEGF165-bridged VEGFR2–NRP1 coupling in generating VEGF isoform-specific angiogenic responses. Details and full references can be found in Mac Gabhann and Popel (2005). Geometrically, the experimental setup again involved a confluent layer of endothelial cells on the bottom of a cell culture well, as in Fig. 18.2A. As before, the initial value problem in one spatial dimension was formulated as a system of coupled diffusion and reaction equations (Eqs. (18.7) and (18.8)– (18.13), respectively). A new parameter, kc, represents the coupling rate between VEGFR2 and NRP1 via VEGF165: @V121 ¼ DV r2 V121 ; @t
@V165 ¼ DV r2 V165 @t
ð18:7Þ
@R2 ¼ sR2 kint;R2 R2 ðkon;VR2 V165 R2 koff ;VR2 V165 R2Þ @t ðkon;VR2 V121 R2 koff ;VR2 V121 R2Þ ðkc;VN 1;R2 V165 N 1 R2 koff ;VR2 V165 R2N 1Þ ð18:8Þ @N 1 ¼ sN 1 kint;N 1 N1 ðkon;VN 1 V165 N 1 koff ;VN 1 V165 N 1Þ @t ðkc;VR2;N 1 V165 R2 N 1 koff ;VN 1 V165 R2N 1Þ ð18:9Þ @V121 R2 ¼ kint;VR2 V121 R2 ðkon;VR2 V121 R2 koff ;VR2 V121 R2Þ @t ð18:10Þ
473
Modeling Growth Factor-Receptor Systems
@V165 R2 ¼ kint;VR2 V165 R2 ðkon;VR2 V165 R2 koff ;VR2 V165 R2Þ @t ðkc;VR2;N 1 V165 R2 N 1 koff ;VN 1 V165 R2N 1Þ ð18:11Þ @V165 N 1 ¼ kint;VN 1 V165 N 1 ðkon;VN 1 V165 N 1 koff ;VN 1 V165 N 1Þ @t ðkc;VN 1;R2 V165 N 1 R2 koff ;VR2 V165 R2N 1Þ
ð18:12Þ @V165 R2N 1 ¼ kint;VR2N 1 V165 R2N1 þ ðkc;VN 1;R2 V165 N 1 R2 @t koff ;VR2 V165 R2N 1Þ þ ðkc;VR2;N 1 V165 R2 N 1 koff ;VN 1 V165 R2N 1Þ ð18:13Þ Boundary conditions were given by Eq. (18.14): ! @V121 DV ¼ qV þ ðkon;VR2 V121 R2 koff ;VR2 V121 R2Þ @z !z¼0 @V165 DV ¼ qV þ ðkon;VR2 V165 R2 koff ;VR2 V165 R2Þ @z z¼0
@V121 @z
! ¼ z¼h
þðkon;VN ! 1 V165 N 1 koff ;VN 1 V165 N 1Þ @V165 ¼0 @z z¼h
ð18:14Þ
The predicted final VEGFVEGFR2 concentration again served as a marker for the strength of proangiogenic signal transduction. The experimental and theoretical derivations of model parameter values are detailed in Mac Gabhann and Popel (2005). Numerical solution of the coupled nonlinear differential equations (Eqs. (18.7)–(18.14)) was achieved using an iterative implicit finite-difference scheme. Sample results are shown in Fig. 18.2D. The in silico modeling of stepwise reaction kinetics (Fig. 18.2C) allowed the prediction of differential antiangiogenic efficacies from therapeutic interference of two distinct aspects of NRP1 function: VEGF binding and VEGFR2 coupling. Simulated blockade of VEGF165 binding to NRP1 (blocking reactions ‘‘Rx1’’ and ‘‘Rx2’’ in Fig. 18.2C) resulted in the convergence of VEGF165 response to that of VEGF121 in terms of VEGFR2 activation; however, simulated blockade of
474
Florence T. H. Wu et al.
NRP1–VEGFR2 coupling (blocking reactions ‘‘Rx1’’ and ‘‘Rx3’’ in Fig. 18.2C) converted NRP1 into a VEGF165 sink (through intact reaction ‘‘R2’’ in Fig. 18.2C), further reducing VEGF165 response to below that of VEGF121 (Fig. 18.2D).
3. Mesoscale Single-Tissue 3D Models: Simulation of In Vivo Tissue Regions The study of VEGF binding to receptors on cells in vitro, and the validation of the VEGF kinetic interaction network between multiple ligands and multiple receptors, leads us to ask the question: how does this network behave in vivo? In Sections 4 and 5, we will discuss the transport of VEGF between tissues and around the body, but here we will focus first on the behavior of VEGF in a local volume of tissue. This multicellular milieu requires significant additions to our model in order to accurately simulate the local transport of VEGF, including diffusion of VEGF ligands over significant distances, extracellular matrix sequestration and variable production rates of VEGF throughout the tissue. We place these all in an anatomically based 2D or 3D multicellular tissue geometry. The models can predict the creation of interstitial VEGF gradients due to the nonuniform nature of the tissue anatomy. This is of particular interest because VEGF is believed to be a chemotactic guiding agent for blood vessels, but also because local variability in VEGF concentration can lead to local variation in VEGF receptor ligation and signaling, allowing for focal activation of endothelial cells. The model framework can be adapted to most tissues; here we present a case with parameters specifically selected to represent a skeletal muscle experiencing ischemia (specifically, rat extensor digitorum longus, or EDL, for a rodent model of hindlimb artery ligation), and we describe how to computationally test several therapeutic interventions including gene therapy and exercise.
3.1. Mathematical framework for tissue architecture, blood flow, and tissue oxygenation 3.1.1. 2D and 3D tissue geometry based on microanatomy A cross-section (for 2D) or a volume of tissue (for 3D) is reconstructed from histological and other microanatomical information (Fig. 18.3A–C). The major relevant features of the tissue are the blood vessels, the parenchymal cells (from here on, we will assume these are skeletal myocytes, i.e., long multinucleated cells), and the interstitial space between them. From a computational modeling point of view, the tissue comprised volumes and surfaces, defined as those portions of the tissue where molecules can move
475
Modeling Growth Factor-Receptor Systems
in all directions (volumes) and those portions where the movement of molecules is restricted to a plane (e.g., receptors inserted in cell membrane can move only laterally). There are three major volumes of the tissue for our purposes: the vascular space (i.e., inside the blood vessels, determined by the density of blood vessels and their diameters), the intracellular space (whether inside of parenchymal cells or endothelial cells), and the interstitial space between cells, which is itself divided into three volumes (each of which is not contiguous), based on the density of the fibrous matrix present: the
A
80
208 um
m 0u
400 um B
Figure 18.3 (Continued)
476
Florence T. H. Wu et al.
D
C SM MBM ECM
Microvascular network geometry
Muscle fiber geometry
Matrix composition
Vessel
EBM
Tissue O2 distribution
F
40 Peak Average
30
20
10
VEGF secretion
Interstitial VEGF gradients and VEGFR activation
1.8 1.6
VEGF-VEGFR (,000/EC)
VEGF gradients (%VEGF/10 um)
E
Blood flow
Peak Average
1.4 1.2 1.0 0.8 0.6 0.4 0.2
0
0.0 Untreated Uniform Random Regional Distant Adjacent Exercise Gene therapy
Cell therapy
Rest
Exercise training
Untreated Uniform Random Regional Distant Adjacent Exercise Gene therapy
Cell therapy
Rest
Exercise training
Figure 18.3 Mesoscale modeling. (A) Schematic of generated microvascular network of capillaries, arterioles and venules, consistent with histological and other measurements of rat skeletal muscle. (B, C) Cross-section of muscle (indicated by box in A), showing the capillaries (red) and the muscle fibers or myocytes (brown; ‘‘SM’’). Detail in (C) of myocytes, endothelial cells, and the extracellular matrix (ECM) and basement membranes (gray; ‘‘MBM’’ and ‘‘EBM’’) between. For three-dimensional simulations, the full volume of tissue is used; for two-dimensional simulations the indicated cross-section is used. (D) Schematic illustrating how the tissue microanatomy (top row) impacts on the calculation of blood flow, oxygen distribution, VEGF distribution and VEGFR binding. (E) Local VEGF gradients within ligated EDL following treatment. The maximum within the tissue and the average across the tissue are reported. (F) VEGFR2 activation on vessels of ligated EDL following treatment. (A–D) Adapted from Mac Gabhann et al. (2007a). (E, F) Based on results previously published in Mac Gabhann et al. (2007a).
extracellular matrix, and basement membrane regions surrounding the endothelial cells and the myocytes. There are two major surfaces; again, these are not contiguous. First, the combined cell surfaces of the skeletal myocytes, which are assumed to be cylindrical (diameter 37.5 mm, consistent with rat histology), and arranged in a regular hexagonal grid formation, accounting for almost 80% of total tissue cross-sectional area. VEGF is secreted from the myocytes’ surfaces. Second, the surface of the endothelial cells that make up the blood vessels, specifically, the abluminal surface (the luminal surface faces the blood stream and we neglect it for now). Again, the blood vessels are assumed to be cylindrical, and although most (but not all) are parallel to the muscle fibers, they do not occupy every possible position between fibers, but instead have
Modeling Growth Factor-Receptor Systems
477
a stochastic, nonuniform arrangement (based on experimentally measured capillary-to-fiber ratios, capillary-to-fiber distances and histology) occupying 2.5% of total tissue volume (leaving 18% as interstitial space). On this endothelial surface, VEGF receptors are expressed. Thus, VEGF must diffuse from the myocyte surface where it is secreted, through basement membranes and extracellular matrix, to the endothelial surface where it ligates its cognate receptors. To model tissues at the mesoscale, we use the above microanatomical information as an input to a set of integrated models of blood flow, oxygen transport, and VEGF transport (Fig. 18.3D). 3.1.2. Volumes: Blood flow Blood flow and hematocrit value calculations are based on Pries et al.’s twophase continuum model (Pries and Secomb, 2005), and reduces to a system of nonlinear algebraic equations (two per vessel) that are solved iteratively ( Ji et al., 2006). The Fahraeus–Lindqvist effect and nonuniform hematocrit distribution at vascular bifurcations are included in the blood flow model. Higher blood flow rates are used for exercising conditions, to represent the increased perfusion (and enhanced oxygen delivery) to exercising muscles. In addition, exercise-trained rats have higher average capillary blood velocity. 3.1.3. Volumes: Diffusion and consumption of oxygen Oxygen transport in the tissue in detailed in Ji et al. (2006) and Goldman and Popel (2000). Oxygen arrives in the tissue via the blood vessel network, and the partial pressure of oxygen in the vessels, Pb, is described by RBC @Pb 2 RBC @SHb ð18:15Þ Jwall ¼ 0 vb ab þ HD Cbind @Pb @x R where vb is mean blood velocity; ab is oxygen solubility in blood; HD is the RBC RBC discharge hematocrit; Cbind and SHb are the oxygen binding capacity and oxygen saturation of the red blood cell; x is the longitudinal position in the vessel; R is vessel radius; Jwall is the oxygen flux across the vessel walls (i.e., into the tissue). Oxygen diffuses across the endothelial cells, and freely throughout the tissue (both interstitial and intracellular). Within the cells, it may also be consumed by binding to Myoglobin (Mb). The local partial pressure of oxygen in the tissue, P, is described by Mb @P Cbind @SMb 1 @SMb 1 2 Mb DO2 r P þ DMb Cbind r rP MðPÞ ¼ 1þ a @P @P @t a a ð18:16Þ
478
Florence T. H. Wu et al.
where DO2 and DMb are the diffusivities of oxygen and myoglobin; a is the Mb oxygen solubility in tissue; Cbind and SMb are the oxygen binding capacity and oxygen saturation of myoglobin; and M(P) represents Michaelis– Menten kinetic consumption of oxygen.
3.1.4. Volumes: Diffusion of VEGF and sequestration by ECM The VEGF ligands, VEGF120 (rodent form of VEGF121) and VEGF164 (rodent form of VEGF165), can both diffuse through the interstitium following secretion; however, the longer isoform also binds to glycoproteins in the extracellular matrix, becoming reversibly sequestered. The equations are thus identical to those of Section 2, with the addition of binding and unbinding terms: if iþj bind X @Ci ðkon;i;j Ci Cj koff ;ij Cij Þ ¼ Di r2 Ci @t j
ð18:17Þ
where Ci and Cj are concentrations of two interstitial molecules, i and j. In the rat EDL model, Ci ¼ [V120] or [V164]; Cj ¼ [GAG]. Concentration of proteins in the thin endothelial or myocyte basement membranes is given by an equation of the form: if iþm Xbind @Ci 1 si þ ðkoff ;im Rim kon;i;m Ci Rm Þ Jout;i ¼ @t dBM m
þ
if iþj bind X
ð18:18Þ
ðkoff ;ij Cij kon;i;j Ci Cj Þ
j
where si is the secretion rate from the cell (typically from myocytes); Rm and Rim are the concentrations of receptor m and of the i–m complex on the cell surface (typically on endothelial cells); Jout is the Fickian diffusive flux from BM to ECM of VEGF; and dBM is the basement membrane thickness.
3.1.5. Surfaces: Receptor–ligand interactions The ligand–receptor interactions that take place are precisely those that were outlined in Section 2, and that will be used in Sections 4 and 5: VEGF120 and VEGF164 bind to VEGFR1 and VEGFR2, while only the longer isoform binds Neuropilin-1 and extracellular matrix. The general form of the receptor and receptor complex equations is therefore:
Modeling Growth Factor-Receptor Systems
if iþm Xbind @Rm ðkoff ;im Rim kon;i;m Ci Rm Þ ¼ ðsm kint;m Rm Þ þ @t i if mþn Xbind ðkdissoc;mn Rmn kcouple;m;n Rm Rn Þ þ
479
ð18:19Þ
n
where sm and kint,m are the membrane insertion rate and internalization rate of receptor m; kcouple,m,n and kdissoc,mn are the kinetic rate of binding and unbinding of two surface receptors m and n to each other. Note in particular that the concentration of the ligand (Ci) in each case is the concentration in the basement membrane region closest to the receptor. Thus, the receptor occupancy varies from cell to cell across the capillary network. Examples of specific individual equations can be found in Section 2. 3.1.6. Surfaces: VEGF production/secretion rates The production and secretion of VEGF has been observed to be inducible by hypoxia (Forsythe et al., 1996; Qutub and Popel, 2006). Here, we use an empirical relationship (Mac Gabhann et al., 2007b) for the increase in the baseline secretion rate of VEGF (S0) based on the observed upregulation of VEGF mRNA and protein during hypoxia in cells and tissues ( Jiang et al., 1996; Tang et al., 2004): ( !a ! 20 PO2 S ¼ S0 1 þ 5 ; S0 jPO2 20 mmHg; 19 ) ð18:20Þ 6S0 jPO2 1 mmHg
3.1.7. What is not included in these models? Intracellular VEGF is not included in these simulations; that includes both postinternalization VEGF and presecretion VEGF. In addition, we neglect the intravasation of VEGF into the bloodstream, either by endothelial cell secretion or through paracellular routes, for example, permeability. Lymphatic transport of VEGF is also neglected. These additional transport routes could be accommodated in the above model structure with the addition of new surfaces or terms. Although endothelial VEGF production and parenchymal VEGFR expression have been observed in recent years (Bogaert et al., 2009; Lee et al., 2007a,b), these are not included as part of these simulations; there is no technical obstacle to doing so.
480
Florence T. H. Wu et al.
3.1.8. Relationship to single-compartment models It is important to note that the spatial averages of VEGF concentrations at the endothelial cell surface and of VEGFR activation in the mesoscale models match well with the values in single-compartment models (Section 4) that do not include diffusion or VEGF gradients. Thus, it may be possible to calculate the average receptor activation using less computationally intensive compartment models, and use the mesoscale models to estimate the spatial gradients.
3.2. Case study: Proangiogenic VEGF gene therapy for muscle ischemia To improve the perfusion and healing of ischemic muscle tissue with impaired angiogenic response, several therapies have been suggested, typically involving the delivery of VEGF (one or more isoforms) to the muscle. The first of these, gene therapy, increases the VEGF secretion by adding additional VEGF-encoding genes to the cells that are transfected. By transfecting multiple copies, or by judicious choice of VEGF promoters and enhancers in the new construct, significant increases in VEGF secretion can be obtained. We have modeled both uniform upregulation of VEGF (increasing VEGF secretion at every myocyte surface point in the model) and stochastic upregulation, in which each cell has a randomly increased VEGF production within a certain range (using the myonuclear density, we know the size of the myocyte surface that is under the control of each nucleus; thus, we can assign a random number to each region, that stays constant through the simulation) (Mac Gabhann et al., 2007a). These increases in VEGF production result in increased VEGFR2 activation, however the VEGF gradients are not significantly increased (Fig. 18.3E and F); in this case blood vessels might be induced to sprout, but have no directional cues. Further simulations restricting the VEGF transfection to a specific region of the muscle demonstrates increased VEGFR2 activation coupled with very high VEGF gradients towards the transfected tissue, but only in a narrow region between transfected and nontransfected tissue (Mac Gabhann et al., 2007a). This suggests that VEGF gene delivery needs to be effectively localized with a high degree of spatial accuracy to allow the gradients of VEGF to bring the new vessels to the affected volume.
3.3. Case study: Proangiogenic VEGF cell-based therapy for muscle ischemia Another route to bringing more VEGF to the tissue, and one which may allow for more spatial specificity, is the delivery of VEGF-overexpressing cells, for example, myoblasts that will effectively integrated into the existing
Modeling Growth Factor-Receptor Systems
481
muscle and produce excess VEGF locally. To simulate this, we select specific myocytes in the model to overexpress VEGF, and distribute these distantly or close together (Mac Gabhann et al., 2006, 2007a). That is, since the secretion rate of VEGF can have a different value for every spatial location on the myocytes surface in our model, we can upregulate VEGF in a specific subset of these cells. For this therapy, we observe in the simulations both increased VEGFR2 binding and increased VEGF gradients (Fig. 18.3E and F), but only within approximately one to two myocyte diameters from the new VEGF overexpressing cells (Mac Gabhann and Popel, 2006, 2007a). In addition, cells close together synergize while distant ones do not. In this way, we can see that a small number of cells, or cells distributed too broadly, would have a low probability of attracting perfusion from a neighboring region; however, a large mass of cells, at the right location, could serve as a local chemoattractant. The results described in Sections 3.2 and 3.3, for therapies reliant on VEGF upregulation alone, mirror the outcome of the several clinical trials of VEGF isoforms in humans for coronary artery disease (CAD) or peripheral artery disease (PAD); these trials have not had the success that was expected of them. Instead, the standard of care for PAD continues to be exercise, and it is this therapy that we consider next.
3.4. Case study: Proangiogenic exercise therapy for muscle ischemia Exercise training in rats has been shown to not only restore the ability of hypoxic, ischemic tissue to upregulate VEGF following injury, but also increases the expression levels of the VEGF receptors (Lloyd et al., 2003). Thus, we used our model to simulate the exercise-dependent upregulation of both the ligands and the receptors, using experimentally measured increases ( Ji et al., 2007; Mac Gabhann et al., 2007a,b). In this case, we increase the secretion rate of VEGF isoforms from each point on the myocyte surface, during exercise; in addition, we increase the insertion rate of the VEGF receptors at every point on the endothelial cell surface at all times (as a result of exercise training). The results of these simulations are quite different from those before: first, during exercise, both the VEGFR2 activation and the VEGF gradients are increased, not just locally but across the upregulated tissue ( Ji et al., 2007; Mac Gabhann et al., 2007a); second, during rest periods, while VEGF upregulation ceases and the occupancy of VEGFR2 returns to lower levels, the high VEGF gradients are maintained (Fig. 18.3E and F). This suggests that the activation step for attracting new blood vessels may be during a smaller window of time, while the guidance of the new vessel to its destination can take place continuously.
482
Florence T. H. Wu et al.
This observation that our current best strategy for PAD, exercise, increases both ligand expression and receptor activation, leaves us with the possibility of developing combined ligand–receptor therapy (especially for those who cannot exercise).
4. Single-Tissue Compartmental Models: Simulation of In Vivo Tissue From Section 3, we saw that the investigative use of VEGF models for therapeutic applications require larger spatial scale modeling of in vivo growth factor transport and effects at the tissue level. However, the full spatial resolution afforded by 3D modeling becomes computationally intensive when the volume of interest grows from tissue regions to whole tissues and organs. Here we demonstrate the strategy of compartmental modeling— where tissue fluid volumes (e.g., interstitial fluid volume) are approximated as well-mixed compartments of uniform protein concentrations, based on the assumption that diffusion occurs on faster timescales than that of molecular binding kinetics (Damkohler number <1; Mac Gabhann and Popel, 2006). While compartmental models cannot predict diffusion-limited VEGF gradients, they allow the prediction of average interstitial VEGF levels and average cell-surface VEGF–VEGFR binding within single tissues (Section 4) or in multiple interconnected tissues (Section 5).
4.1. Mathematical framework for tissue porosity and available volume fractions In converting interstitial volumetric protein concentrations to their appropriate per-tissue-volume basis, it is useful to consider extravascular regions as porous media (Truskey et al., 2004) and to use the following standard definitions. e is the porosity, defined as the fractional void space within the total tissue volume. F is the partition coefficient, representing the fractional interstitial fluid volume that is available or accessible to macromolecules, that is, excluding all isolated or impenetrable pores. Together, they define the available volume fraction, KAV, which is used to convert interstitial concentrations from per-interstitial-fluid-volume (M) to per-tissue-volume (mol/(cm3 tissue)) basis: available interstitial fluid volume ¼ eF ¼ ðeIS f Þ total tissue volume interstitial space interstitial fluid volume ¼ total tissue volume interstitial space available fluid volume interstitial fluid volume
KAV ¼
ð18:21Þ
483
Modeling Growth Factor-Receptor Systems
Endothelial cell-surface receptor and complex densities also have to be converted from their per-surface-area units (mol/cm2 cell surface area) to per-tissue-volume (mol/cm3 tissue) basis for consistency within the mathematical equations given below. For this purpose, other geometrical attributes (e.g., microvessel density and diameters, surface area per endothelial cell, and endothelial surface area per tissue volume) have to be quantified to reflect the particular tissue architecture modeled, as documented in Mac Gabhann and Popel (2006) for breast tumor tissue geometry.
4.2. Case study: Pharmacodynamic mechanism and tumor microenvironment affect efficacy of anti-NRP1 therapy in cancer In this example of a single-tissue compartment model, three strategies of inhibiting different aspects of NRP1 functionality were simulated to predict their relative in vivo antiangiogenic efficacy as cancer treatments in breast tumor tissues. The first strategy of inhibiting NRP1 expression (Fig. 18.4A), which could be achieved clinically through siRNA methods, was simulated by reducing the NRP1 insertion rate, sN. The second strategy of impeding VEGF binding to NRP1 (Fig. 18.4B), which could be implemented clinically through a PlGF fragment (P2D) that binds NRP1 exclusively, was modeled with additional equations representing the competition between VEGF and P2D for NRP1 (Eq. (18.22)) and corresponding modifications to the original NRP1 equation. The third strategy of blocking NRP1–VEGFR coupling while allowing NRP1– VEGF binding (Fig. 18.4C), which had been done experimentally using a NRP1 antibody, was modeled with additional equations for the antibody interactions (Eq. (18.23)) and corresponding modifications to the original VEGF165, NRP1, and VEGF165NRP1 equations. Detailed references and full equations can be found in Mac Gabhann and Popel (2006):
d½P2D ¼ kon;PN 1 ½P2D ½N1 þ koff ;PN 1 ½P2D N1 ; dt d½P2D N1 ¼ kint;PN 1 ½P2D N1 þ kon;PN 1 ½P2D ½N1 dt koff ;PN 1 ½P2D N1
ð18:22Þ
484
Florence T. H. Wu et al.
B P2Δ
A
P2Δ
N
N
kint
SN C
koff
kon
V165
N
V165
V165
V165 +
2 2
V V165 165
kc +
Ab kc
N
N
2 2
Ab N
2 2
D
Block coupling
10 uM
1 uM
10 nM
100 nM
10 uM
1 uM
10 nM
100 nM
100%
Block expression Block binding
10 uM
1 uM
10 nM
100 nM
10 uM
1 uM
100 nM
10 nM
100%
99.9%
0%
99%
10 uM
1 uM
100 nM
20 %
10 nM
20%
10 uM
40 %
1 uM
40%
100 nM
60 %
10 nM
60%
100%
80 %
99.9%
80%
99%
Block coupling
Average % inhibition of VEGFR2 signaling (48 h) 100 %
90%
10,000 VEGFR1 per EC
Peak % inhibition of VEGFR2 signaling
Block expression Block binding
99.9%
Block expression Block binding
100%
0%
90%
dose
10 uM
100 nM
0%
Block coupling
90%
Block expression Block binding
1 uM
0%
10 nM
20%
10 uM
20 %
100 nM
40%
1 uM
40 %
10 nM
60%
100%
60 %
99%
80%
99.9%
80 %
99%
Average % inhibition of VEGFR2 signaling (48 h) 100%
90%
0 VEGFR1 per EC
Peak % inhibition of VEGFR2 signaling 100 %
Block coupling
Figure 18.4 Single-tissue compartmental model. Simulation of three targeted strategies of blocking NRP1 functionality: (A) NRP1 expression; (B) VEGF–NRP1 binding; (C) VEGFR–NRP1 coupling. (D) Inhibition of VEGF–VEGFR2 signaling by increasing doses of each of the three strategies outlined in (A–C). Peak and average over the first 48 h following therapeutic administration. Sample results (D) based on data previously published in Mac Gabhann and Popel (2006).
Modeling Growth Factor-Receptor Systems
485
d½AbNRP ¼ kon;AbN1 ½AbNRP ½N1 þ koff ;AbN 1 ½AbNRP N1 dt kon;AbN1 ½AbNRP ½V165 N1 þ koff ;AbN 1 ½V165 N1 AbNRP d½AbNRP N1 ¼ kint;AbN 1 ½AbNRP N1 þ kon;AbN1 ½AbNRP ½N1 dt koff ;AbN1 ½AbNRP N1 kon;VN 1 ½V165 ½AbNRP N1 þkoff ;VN1 ½V165 N1 AbNRP d½V165 N1 AbNRP ¼ kint;VN 1 ½V165 N1 AbNRP dt þkon;AbN 1 ½AbNRP ½V165 N1 koff ;AbN 1 ½V165 N1 AbNRP þkon;VN 1 ½V165 ½AbNRP N1 koff ;VN 1 ½V165 N1 AbNRP ð18:23Þ Numerical solution of the full system of coupled nonlinear ordinary differential equations was achieved using a Runge–Kutta integration scheme (Mac Gabhann and Popel, 2006). Sample results are shown in Fig. 18.4D. The results predict that the third strategy of blocking NRP1– VEGFR coupling will induce the most effective and sustained inhibition of VEGFR2 ligation and signaling, with the differential advantage being most discernible in tumor types where endothelial expression of VEGFR1 is low. This case study demonstrates that model incorporation of molecularly detailed interaction networks allow the sophisticated optimization of therapeutic strategies, from the fine-tuning of therapeutic agent dosing to the customization of targeted molecular mechanisms according to the diseased tissue type with its characteristic receptor expression levels.
5. Multitissue Compartmental Models: Simulation of Whole Body In this section, we introduce models that compartmentalize the whole body into multiple tissue compartments. The two examples given below each have three compartments (Fig. 18.5): the blood; the tissue of interest (breast tumor tissue in Section 5.2 and calf muscle in Section 5.3); and the rest of the body, which we call the ‘‘normal compartment.’’ Future model extensions can further subdivide the normal compartment into individual organ compartments. In this chapter, we demonstrate that the modeling of two major intercompartment transport processes—vascular permeability and lymphatic drainage—is essential in predicting whole-body pharmacokinetics of macromolecules.
486
Florence T. H. Wu et al.
PBM
VEGF secretion Interstitium
ECM Internalization
sR1 secretion
EBM
Lymphatic drainage
Normal
Parenchymal and stromal cells
Abluminal side of endothelial cells I.V. infusion
Permeability
Blood
Luminal side of normal endothelium Clearance
Plasma
Permeability
Tissue of interest
Abluminal side of endothelial cells EBM
sR1 secretion Interstitium VEGF secretion
Lymphatic drainage
Luminal side of T.O.I. endothelium
Internalization ECM PBM
Parenchymal and stromal cells
Figure 18.5 Whole-body multitissue compartmental model. Schematic of experimental setup for investigation of whole-body pharmacokinetics of sVEGFR1 and anti-VEGF, as well as their effects on VEGF signaling in the calf muscle and breast cancer (‘‘tissues of interest’’) respectively. Endogenous sources of sVEGFR1 include endothelial production. Intravascular infusion of exogenous sVEGFR1 and anti-VEGF were also simulated. Figure adapted from Wu et al. (2009b).
5.1. Mathematical framework of intertissue transport 5.1.1. Macromolecular vascular permeability Unlike small molecules that can easily traverse microvessel endothelia through interendothelial cleft junctions (Fu and Shen, 2003), the transendothelial permeation of macromolecules such as peptide growth factors occurs through caveolar structures and vesiculo-vacuolar organelles at much slower rates that are dependent on protein size and VEGF-induction
487
Modeling Growth Factor-Receptor Systems
(VEGF is also known as VPF, vascular permeability factor) (Feng et al., 2002; Fu and Shen, 2003; Garlick and Renkin, 1970). The equations relevant to protein transport between the available fluid volumes of the tissue compartment and the blood compartment via vascular permeability are: 8 kTB kTB > d½proteininterstitium ½proteininterstitium ½proteinblood p STB p STB > > þ ¼ > < dt Tissue Volume KAV ;tissue Tissue Volume KAV ;blood kBT kTB > d½proteinblood ½proteinblood ½proteininterstitium p STB p STB > > þ ¼ > : dt Tissue Blood KAV ;blood Tissue Blood KAV ;tissue
ð18:24Þ kTB p
represents the microvascular permeability rate for the protein of where interest going from the tissue (T) to the blood (B) compartment (cm/s) across STB (cm2), the total endothelial surface area (i.e., the tissue–blood interface). Microvascular permeability is modeled as a passive bidirectional transport process, represented in Fig. 18.5 by a double-arrow, with identical BT intravasation and extravasation rates, kTB p ¼ kp . The in vivo value of kp for a given protein is rarely found in the literature. Instead, we extrapolate from calibration curves (Garlick and Renkin, 1970) correlating permeability rates with protein size (Stokes–Einstein radius of the molecule). For instance, the calculated Stokes–Einstein radius (ae) of the 45 kDa globular VEGF protein ˚ according to 0.483 (molecular weight in Da)0.386 as given in is 30.2 A Venturoli and Rippe (2005). Our extrapolation methods yield a baseline permeability rate of 4.3 10 8 cm/s for VEGF (Stefanini et al., 2008). The microvascular permeability for VEGF in tumor tissue (for use in modeling breast tumor as the tissue of interest in Section 5.2) can be estimated from corresponding values for similar-sized molecules, for example, ˚ ) was measured to permeate at about ovalbumin (45 kDa; ae ¼ 30.8 A 7 5.77 10 cm/s (Yuan et al., 1995). Details can be found in Stefanini et al. (2008). 5.1.2. Lymphatic drainage Lymphatic drainage is a major route by which interstitial proteins are transported to the blood because size-dependent transendothelial permeability restricts their intravasation into blood capillaries. In contrast, there is no macromolecular impedance in the filling of the initial lymphatics, hence the protein concentrations drained through the lymphatics are assumed to be continuous with interstitial concentrations at the lymphatic entrance. Mathematically, we describe unidirectional lymphatic drainage as:
488
Florence T. H. Wu et al.
d½proteininterstitium kL ½proteininterstitium ; ¼ dt Tissue Volume KAV ;tissue d½proteinblood kL ½proteininterstitium ¼þ dt Blood Volume KAV ;tissue
ð18:25Þ
where kL is the lymphatic drainage rate (cm3/s). Detailed derivation and parameterization can be found in Wu et al. (2009b).
5.2. Case study: Pharmacokinetics of anti-VEGF therapy in cancer The pharmacokinetics of anti-VEGF therapy can be studied via wholebody compartmental modeling based on the detailed biochemical reactions between VEGF ligands and their receptors described above. Specifically, we can simulate and study the VEGF isoform specificity of anti-VEGF agents, ligand–agent binding configurations (e.g., whether such binding is monoor multimeric), agent biodistribution (if the anti-VEGF agent is confined to one or several compartments), as well as various therapeutic regimen designs (varying dosage, frequency of administration, the site of injection). This example models bevacizumab, a humanized monoclonal antibody to VEGF, the characteristics and properties of which have been reported (150 kDa; Kd of 1.8 nM in Presta et al. (1997); half-life ¼ 21 days in Gordon et al. (2001)). The diseased tissue of interest is a tumor. In the absence of the anti-VEGF agent, a total of 40 ordinary differential equations describes the compartmental system (19 ordinary differential equations (ODEs) for each tissue; 2 ODEs for the blood compartment). When the anti-VEGF agent is added and confined to the blood compartment (nonextravasating agent), three more equations are added to the model (blood compartment), representing the chemical interactions of the antiVEGF with VEGF121 and VEGF165, as well as free anti-VEGF: d½Ablood KAV ;blood;A ½V165 Ablood ¼ qA KAV ;blood;A cA ½Ablood koff ;V A;blood dt KAV ;blood;VA ½V165 blood þ kon;V A;blood ½Ablood KAV ;blood;V KAV ;blood;A koff ;V A;blood ½V121 Ablood KAV ;blood;VA ½V121 blood þ kon;V A;blood ½Ablood KAV ;blood;V
ð18:26Þ
Modeling Growth Factor-Receptor Systems
489
where KAV,blood,i is the available volume fraction of the molecule i (A ¼ antiVEGF, V ¼ VEGF, VA ¼ complex VEGF/anti-VEGF). Note that UAV ¼ KAV U. The first term represents the concentration of anti-VEGF drug injected into the patient bloodstream. The second term represents the clearance of the anti-VEGF agent by the organs (e.g., kidneys or liver). The equations related to the complex form of the anti-VEGF drug are of the form: d½V165 Ablood ¼ cV A ½V165 Ablood koff ;V A;blood ½V165 Ablood dt þkon;V A;blood
ðKAV ;blood;VA Þ2 ½V165 blood ½Ablood KAV ;blood;V :KAV ;blood;A ð18:27Þ
Figure 18.6 illustrates the transient dynamics resulting from the intravenous injection of the VEGF-antibody; in these simulations a breast tumor of 2 cm diameter is considered. In the single-dose treatment (10 mg/kg), intravenous injection leads to a rapid decrease in the free VEGF concentration (Fig. 18.6A). That level returns to baseline after about 3–4 weeks. For daily smaller doses of treatment or ‘‘metronomic’’ treatment (1 mg/kg for 10 days), a new lower pseudo-steady state for the plasma VEGF level emerges during the duration of treatment (Fig. 18.6B). Following treatment cessation (10 days), the plasma VEGF level returns to that before treatment after about 3 weeks. The metronomic injection also delays the peak of maximum formation of VEGF-antiVEGF complex as compared to a singledose treatment (Fig. 18.6C and D). Equations (18.26) and (18.27) describe an intravenous injection of antiVEGF antibodies that would be confined to the plasma. If the anti-VEGF agent is injected intravenously and can extravasate, terms of the form of Eq. (18.24) are added to Eqs. (18.26) and (18.27) and additional equations corresponding to the free and bound antibody concentrations are needed for each tissue compartment in the systems of ODEs. Interestingly, the addition of extravasation to the model drastically changes the form of the response. Transiently after injection, the free VEGF level in plasma drops drastically (data not shown). This is due to the binding of the antibody to the free VEGF present in the plasma. Following this drop there is a several-fold increase of free VEGF concentration in plasma. The apparent ‘‘rebound’’ effect is due to the amount of drug delivered and its extravasation. Briefly, while some antibodies bind to the free VEGF in plasma, another portion extravasates and binds to the free VEGF present in the available interstitial fluid volume in the tissues (healthy tissue and breast tumor). Although some of the formed complexes subsequently dissociate within the same tissue, significant quantities are brought into the bloodstream (via microvascular permeability and lymphatic drainage) where the complex dissociates,
490
Florence T. H. Wu et al.
B
A
200
200 100
Free [VEGF] (pM)
Free [VEGF] (pM)
150 50
100 50
5 4 3 2 1 0 D 2.5 [VEGF-antiVEGF] (nM)
5 4 3 2 1 0 C 2.5 2.0 1.5 1.0
1.5 1.0 0.5
0.0 E 2.0
F 0.0 2.0 Free [antiVEGF] (mM)
0.5
2.0
Free [antiVEGF] (mM)
[VEGF-antiVEGF] (nM)
Healthy tissue Blood Tumor
150
Healthy tissue Blood Tumor
1.5 1.0 0.5 0.0 0
7
14 21 Time (days)
28
35
1.5 1.0 0.5 0.0 0
7
14 21 Time (days)
28
35
Figure 18.6 Compartmental model of whole-body anti-VEGF pharmacokinetics. Comparison between single-dose (A) and metronomic (B) intravenously delivered anti-VEGF treatment (without extravasation of the anti-VEGF molecule). Each dose (intravenous infusion) takes place over 90 min. (A, C, E) Single dose of 10 mg/kg, (B,D,F) 1 mg/kg daily for 10 days.
leading to more free VEGF in the blood compartment. Such a counterintuitive increase in serum VEGF following intravenous administration of anti-VEGF agents has been observed in experiments (Gordon et al., 2001; Segerstrom et al., 2006; Willett et al., 2005) and our model is, to our knowledge, the first to explain this phenomenon by an intrinsic mechanism of intertissue transport.
491
Modeling Growth Factor-Receptor Systems
This example illustrates how computational models can provide useful insights that are not easily accessible by in vitro or in vivo experiments. This model can also be extended to examine the effects of drug treatment via alternate routes of anti-VEGF administration (e.g., intramuscular injection).
5.3. Case study: Mechanism of sVEGFR1 as a ligand trap In this last example of a multitissue compartmental model, we investigated the molecular mechanisms by which sVEGFR1, a truncated soluble variant of the endothelial cell-surface VEGFR1, inhibits VEGF signaling. The two prevailing postulated mechanisms are: direct VEGF ligand sequestration, reducing the VEGF available for VEGFR activation (Fig. 18.7A, middle); and heterodimerization with cell-surface VEGFR monomers, rendering the receptor dimer nonfunctional as trans-phosphorylation of paired intracellular domains of full-length VEGFRs is necessary for activating signal transduction (Fig. 18.7A, bottom). The model as described in detail in Wu et al. (2009a) simulated the first mechanism, to assess the antiangiogenic potential of sVEGFR1’s ligand trapping capacity alone. sVEGFR1 in its monomeric (110 kDa) and dimeric (220 kDa) forms are about two to five times larger than VEGF; thus free sVEGFR1 and the sVEGFR1VEGF complex have lower vascular permeability rates than VEGF, while sharing the same lymphatic drainage rates as VEGF. Full mathematical equations describing these transport properties, along with the sVEGFR1–VEGF binding and sVEGFR1–NRP1 coupling interactions, can be found in Wu et al. (2009a). Sample equations for the concentrations of free sVEGFR1 and sVEGFR1VEGF165 in tissue j are given here: d½V165 sR1 j dt
¼
kL;j ½V165 sR1 j Uj KAV ;j
½V165 sR1 j gj SjB ½V165 sR1 B B!j j!B þ kp;V sR1 kp;V sR1 Uj KAV ;B KAV ;j þ kon;V sR1;j ½V165 j ½sR1 j koff ;V sR1;j ½V165 sR1 j
d½sR1 j
!
ð18:28Þ !
kL;j ½sR1 j gj SjB B!j ½sR1 B j!B ½sR1 j þ kp;sR1 kp;sR1 dt Uj KAV ;j Uj KAV ;B KAV ;j kon;sR1M;j ½sR1 j ½MEBM j þ koff ;sR1M;j ½sR1 MEBM j kon;sR1M;j ½sR1 j ½MECM j þkoff ;sR1M;j ½sR1 MECM j kon;sR1M;j ½sR1 j ½MPBM j þ koff ;sR1M;j ½sR1 MPBM j kon;sR1N 1;j ½sR1 j ½N1 j þ koff ;sR1N 1;j ½sR1 N1 j kon;V sR1;j ½V121 j ½sR1 j þkoff ;V sR1;j ½V121 sR1 j kon;V sR1;j ½V165 j ½sR1 j þ koff ;V sR1;j ½V165 sR1j ¼ qsR1;j
ð18:29Þ
A
V121
V165
V165
2 2
N 2 2
2 2
Signaling VEGFR complexes V121 s s 1 1
N
V 121
V 165 s s1 1
s s1 1
sVEGFR1 as a ligand trap of VEGF V121
V165
s 1
s 1
2
N 2
V165
2
s 1
Non-signaling heterodimer complexes
B
Free VEGF in normal interstitium [pM] 11.2 1.8 11 1.6 10.8 1.4 10.6 1.2
Free VEGF in plasma [pM]
Free VEGF in calf interstitium [pM] 11.4
CTRL 11.2
sR1-V: permeability ⫽ 0
11 10.8
sR1-V: lymph drainage⫽0 free sR1 and sR1-V: permeability⫽0
10.6
sR1-V: permeability⫽0,lymph drainage⫽0
10.4
10.4 1 10.2 15
20 25 30 Time (wks)
35
0.8
15
20
25 Time (wks)
30
35
15
20 25 30 Time (wks)
35
Figure 18.7 Compartmental model of whole-body sVEGFR1 transport. (A) sVEGFR1 has been postulated to have antagonistic effects on VEGF signaling complex formation (top row) through competitive binding of VEGF ligands (middle row) and dominant-negative heterodimerization with endothelial cell-surface VEGF receptors (bottom row). (B) Sample simulations of the ligand-trapping effects of intravascularly administered exogenous sVEGFR1 based on results previously published in Wu et al. (2009a).
Modeling Growth Factor-Receptor Systems
493
In this model, calf muscle tissue was chosen as the ‘‘tissue of interest’’ compartment (Fig. 18.5), in order to investigate the effects of endogenous sVEGFR1 produced from the calf muscle on local (calf compartment) and global (normal compartment) VEGF signaling complex formation. Such predictions were expected to provide insight on whether pathological upregulation in sVEGFR1 expression in the calf may contribute to the dampened VEGF response observed in ischemic calf muscles in peripheral arterial disease. Therapeutic intravascular (IV) delivery of exogenous sVEGFR1 was also simulated to assess its efficacy in lowering systemic levels of VEGF (Wu et al., 2009a). While, intuitively, intravascular sVEGFR1VEGF complex formation following simulated IV infusion of sVEGFR1 would be expected to lead to a sustained reduction of plasma free VEGF, in fact sample results in Fig. 18.7B show that the permeability rates and lymphatic drainage rates of free sVEGFR1 and sVEGFR1VEGF are predicted to critically determine whether this reduction will take place or not. In other words, these simulations suggest that prior to the clinical translation of administering exogenous sVEGFR1 to lower systemic VEGF levels in antiangiogenic therapy, extensive experimental research is needed to exclude the computationally predicted possibilities where sVEGFR1 delivery counterintuitively elevates plasma free VEGF. As in Section 5.2, this computational study demonstrates that the intertissue transport properties of proteins significantly affect their whole-body pharmacokinetic effects.
6. Conclusions In this chapter, we summarized several computational models for investigating different spatial and temporal aspects of VEGF systems biology. In Section 2, we described molecular network models that simulated in vitro endothelial cell-surface interaction experiments to investigate the roles of the PlGF ligand and NRP1 coreceptor within the VEGF system. In Section 3, we presented mesoscale models for investigating the effects of in vivo tissue architecture on VEGF ligand and receptor interactions, and for predicting the intramuscular response and relative therapeutic efficacy of various modalities of proangiogenic therapies (gene vs. cell vs. exercise therapy) for ischemic muscle diseases. In Sections 4 and 5, moleculardetailed compartmental modeling was introduced as a method for efficient prediction of average molecular concentrations within tissue subcompartments (e.g., interstitial or plasma VEGF concentrations; intramuscular cell-surface density of signaling complexes) and investigation of intercompartment transport processes (e.g., microvascular permeability and lymphatic drainage). We described several compartmental model studies
494
Florence T. H. Wu et al.
predicting in vivo effects of intratissue trafficking and whole-body pharmacokinetics on angiogenic response to treatments using NRP1, anti-VEGF and sVEGFR1 as therapeutic targets/agents. While the VEGF system models presented in this chapter were limited to the representation of two ligand isoforms (VEGF121 and VEGF165) and three receptors (VEGFR1, VEGFR2, NRP1), these model frameworks can be readily extended to include other VEGF ligand isoforms and receptors (e.g.,VEGFR3, NRP2). Furthermore, similar computational modeling techniques are applicable and have contributed to the study of other growth factor systems, including that of FGF (Filion and Popel, 2004, 2005; Forsten et al., 2000) and EGF (Wiley et al., 2003).
ACKNOWLEDGMENTS This work was supported by NIH grants R01 HL079653, R33 HL0877351, and R01 CA138264.
REFERENCES Andrae, J., Gallini, R., and Betsholtz, C. (2008). Role of platelet-derived growth factors in physiology and medicine. Genes Dev. 22, 1276–1312. Augustin, H. G., Koh, G. Y., Thurston, G., and Alitalo, K. (2009). Control of vascular morphogenesis and homeostasis through the angiopoietin-tie system. Nat. Rev. Mol. Cell. Biol. 10, 165–177. Autiero, M., Waltenberger, J., Communi, D., Kranz, A., Moons, L., Lambrechts, D., Kroll, J., Plaisance, S., De Mol, M., Bono, F., Kliche, S., Fellbrich, G., et al. (2003). Role of PIGF in the intra- and intermolecular cross talk between the VEGF receptors Flt1 and Flk1. Nat. Med. 9, 936–943. Bao, P., Kodra, A., Tomic-Canic, M., Golinko, M. S., Ehrlich, H. P., and Brem, H. (2009). The role of vascular endothelial growth factor in wound healing. J. Surg. Res. 153, 347–358. Beenken, A., and Mohammadi, M. (2009). The FGF family: Biology, pathophysiology and therapy. Nat. Rev. Drug. Discov. 8, 235–253. Bogaert, E., Van Damme, P., Poesen, K., Dhondt, J., Hersmus, N., Kiraly, D., Scheveneels, W., Robberecht, W., and Van Den Bosch, L. (2009). VEGF protects motor neurons against excitotoxicity by upregulation of GluR2. Neurobiol. Aging [Epub ahead of print] Pubmed ID: 19185395. Brown, M. D., and Hudlicka, O. (2003). Modulation of physiological angiogenesis in skeletal muscle by mechanical forces: Involvement of VEGF and metalloproteinases. Angiogenesis 6, 1–14. Cao, Y. (2009). Positive and negative modulation of angiogenesis by VEGFR1 ligands. Sci. Signal. 2, re1. Collinson, D. J., and Donnelly, R. (2004). Therapeutic angiogenesis in peripheral arterial disease: Can biotechnology produce an effective collateral circulation? Eur. J. Vasc. Endovasc. Surg. 28, 9–23.
Modeling Growth Factor-Receptor Systems
495
Ebos, J. M., Bocci, G., Man, S., Thorpe, P. E., Hicklin, D. J., Zhou, D., Jia, X., and Kerbel, R. S. (2004). A naturally occurring soluble form of vascular endothelial growth factor receptor 2 detected in mouse and human plasma. Mol. Cancer Res. 2, 315–326. Feng, D., Nagy, J. A., Dvorak, H. F., and Dvorak, A. M. (2002). Ultrastructural studies define soluble macromolecular, particulate, and cellular transendothelial cell pathways in venules, lymphatic vessels, and tumor-associated microvessels in man and animals. Microsc. Res. Tech. 57, 289–326. Ferrara, N., and Davis-Smyth, T. (1997). The biology of vascular endothelial growth factor. Endocr. Rev. 18, 4–25. Filion, R. J., and Popel, A. S. (2004). A reaction-diffusion model of basic fibroblast growth factor interactions with cell surface receptors. Ann. Biomed. Eng. 32, 645–663. Filion, R. J., and Popel, A. S. (2005). Intracoronary administration of FGF-2: A computational model of myocardial deposition and retention. Am. J. Physiol. Heart Circ. Physiol. 288, 263–279. Forsten, K. E., Fannon, M., and Nugent, M. A. (2000). Potential mechanisms for the regulation of growth factor binding by heparin. J. Theor. Biol. 205, 215–230. Forsythe, J. A., Jiang, B. H., Iyer, N. V., Agani, F., Leung, S. W., Koos, R. D., and Semenza, G. L. (1996). Activation of vascular endothelial growth factor gene transcription by hypoxia-inducible factor 1. Mol. Cell. Biol. 16, 4604–4613. Fu, B. M., and Shen, S. (2003). Structural mechanisms of acute VEGF effect on microvessel permeability. Am. J. Physiol. Heart Circ. Physiol. 284, 2124–2135. Gagnon, M. L., Bielenberg, D. R., Gechtman, Z., Miao, H. Q., Takashima, S., Soker, S., and Klagsbrun, M. (2000). Identification of a natural soluble neuropilin-1 that binds vascular endothelial growth factor: In vivo expression and antitumor activity. Proc. Natl. Acad. Sci. USA 97, 2573–2578. Garlick, D. G., and Renkin, E. M. (1970). Transport of large molecules from plasma to interstitial fluid and lymph in dogs. Am. J. Physiol. 219, 1595–1605. Gerber, H. P., Malik, A. K., Solar, G. P., Sherman, D., Liang, X. H., Meng, G., Hong, K., Marsters, J. C., and Ferrara, N. (2002). VEGF regulates haematopoietic stem cell survival by an internal autocrine loop mechanism. Nature 417, 954–958. Girling, J. E., and Rogers, P. A. (2005). Recent advances in endometrial angiogenesis research. Angiogenesis 8, 89–99. Goldman, D., and Popel, A. S. (2000). A computational study of the effect of capillary network anastomoses and tortuosity on oxygen transport. J. Theor. Biol. 206, 181–194. Gordon, M. S., Margolin, K., Talpaz, M., Sledge, G. W. Jr., Holmgren, E., Benjamin, R., Stalter, S., Shak, S., and Adelman, D. (2001). Phase I safety and pharmacokinetic study of recombinant human anti-vascular endothelial growth factor in patients with advanced cancer. J. Clin. Oncol. 19, 843–850. Gschwind, A., Fischer, O. M., and Ullrich, A. (2004). The discovery of receptor tyrosine kinases: Targets for cancer therapy. Nat. Rev. Cancer 4, 361–370. Haigh, J. J. (2008). Role of VEGF in organogenesis. Organogenesis 4, 247–256. Harper, S. J., and Bates, D. O. (2008). VEGF-A splicing: The key to anti-angiogenic therapeutics? Nat. Rev. Cancer 8, 880–887. Ji, J. W., Tsoukias, N. M., Goldman, D., and Popel, A. S. (2006). A computational model of oxygen transport in skeletal muscle for sprouting and splitting modes of angiogenesis. J. Theor. Biol. 241, 94–9108. Ji, J. W., Mac Gabhann, F., and Popel, A. S. (2007). Skeletal muscle VEGF gradients in peripheral arterial disease: Simulations of rest and exercise. Am. J. Physiol. Heart Circ. Physiol. 293, H3740–H3749. Jiang, B. H., Semenza, G. L., Bauer, C., and Marti, H. H. (1996). Hypoxia-inducible factor 1 levels vary exponentially over a physiologically relevant range of O2 tension. Am. J. Physiol. 271, C1172–C1180.
496
Florence T. H. Wu et al.
Kerbel, R. S. (2008). Tumor angiogenesis. N. Engl. J. Med. 358, 2039–2049. Kut, C., Mac Gabhann, F., and Popel, A. S. (2007). Where is VEGF in the body? A metaanalysis of VEGF distribution in cancer. Br. J. Cancer 97, 978–985. Lauffenburger, D. A., and Linderman, J. L. (1993). Receptors: Models for Binding, Trafficking, and Signaling. Oxford University Press, New York. Lee, S., Jilani, S. M., Nikolova, G. V., Carpizo, D., and Iruela-Arispe, M. L. (2005). Processing of VEGF-A by matrix metalloproteinases regulates bioavailability and vascular patterning in tumors. J. Cell. Biol. 169, 681–691. Lee, S., Chen, T. T., Barber, C. L., Jordan, M. C., Murdock, J., Desai, S., Ferrara, N., Nagy, A., Roos, K. P., and Iruela-Arispe, M. L. (2007a). Autocrine VEGF signaling is required for vascular homeostasis. Cell 130, 691–703. Lee, T., Seng, S., Sekine, M., Hinton, C., Fu, Y., Avraham, H. K., and Avraham, S. (2007b). Vascular endothelial growth factor mediates intracrine survival in human breast carcinoma cells through internally expressed VEGFR1/FLT1. PLoS Med. 4. Lloyd, P. G., Prior, B. M., Yang, H. T., and Terjung, R. L. (2003). Angiogenic growth factor expression in rat skeletal muscle in response to exercise training. Am. J. Physiol. Heart Circ. Physiol. 284, H1668–H1678. Lodish, H., Berk, A., Matsudaira, P., Kaiser, C. A., Krieger, M., Scott, M. P., Zipursky, S. L., and Darnell, J. (2004). Molecular Cell Biology. W.H. Freeman & Co., New York. Mac Gabhann, F., and Popel, A. S. (2004). Model of competitive binding of vascular endothelial growth factor and placental growth factor to VEGF receptors on endothelial cells. Am. J. Physiol. Heart Circ. Physiol. 286, H153–H164. Mac Gabhann, F., and Popel, A. S. (2005). Differential binding of VEGF isoforms to VEGF receptor 2 in the presence of neuropilin-1: A computational model. Am. J. Physiol. Heart Circ. Physiol. 288, H2851–H2860. Mac Gabhann, F., and Popel, A. S. (2006). Targeting neuropilin-1 to inhibit VEGF signaling in cancer: Comparison of therapeutic approaches. PLoS Comput. Biol. 2, e180. Mac Gabhann, F., and Popel, A. S. (2008). Systems biology of vascular endothelial growth factors. Microcirculation 15, 715–738. Mac Gabhann, F., Ji, J. W., and Popel, A. S. (2006). Computational model of vascular endothelial growth factor spatial distribution in muscle and pro-angiogenic cell therapy. PLoS Comput. Biol. 2, e127. Mac Gabhann, F., Ji, J. W., and Popel, A. S. (2007a). Multi-scale computational models of pro-angiogenic treatments in peripheral arterial disease. Ann. Biomed. Eng. 35, 982–994. Mac Gabhann, F., Ji, J. W., and Popel, A. S. (2007b). VEGF gradients, receptor activation, and sprout guidance in resting and exercising skeletal muscle. J. Appl. Physiol. 102, 722–734. Maharaj, A. S., and D’Amore, P. A. (2007). Roles for VEGF in the adult. Microvasc. Res. 74, 100–113. Martin, D., Galisteo, R., and Gutkind, J. S. (2009). CXCL8/IL8 stimulates vascular endothelial growth factor (VEGF) expression and the autocrine activation of VEGFR2 in endothelial cells by activating NFkappaB through the CBM (Carma3/Bcl10/Malt1) complex. J. Biol. Chem. 284, 6038–6042. Mazitschek, R., and Giannis, A. (2004). Inhibitors of angiogenesis and cancer-related receptor tyrosine kinases. Curr. Opin. Chem. Biol. 8, 432–441. Pollak, M. (2008). Insulin and insulin-like growth factor signalling in Neoplasia. Nat. Rev. Cancer 8, 915–928. Presta, L. G., Chen, H., O’Connor, S. J., Chisholm, V., Meng, Y. G., Krummen, L., Winkler, M., and Ferrara, N. (1997). Humanization of an anti-vascular endothelial growth factor monoclonal antibody for the therapy of solid tumors and other disorders. Cancer Res. 57, 4593–4599.
Modeling Growth Factor-Receptor Systems
497
Pries, A. R., and Secomb, T. W. (2005). Microvascular blood viscosity in vivo and the endothelial surface layer. Am. J. Physiol. Heart Circ. Physiol. 289, H2657–H2664. Qutub, A. A., and Popel, A. S. (2006). A computational model of intracellular oxygen sensing by hypoxia-inducible factor HIF1 Alpha. J. Cell. Sci. 119, 3467–3480. Qutub, A. A., Mac Gabhann, F., Karagiannis, E. D., Vempati, P., and Popel, A. S. (2009). Multiscale models of angiogenesis. IEEE Eng. Med. Biol. Mag. 28, 14–31. Roy, H., Bhardwaj, S., and Yla-Herttuala, S. (2006). Biology of vascular endothelial growth factors. FEBS Lett. 580, 2879–2887. Segerstrom, L., Fuchs, D., Backman, U., Holmquist, K., Christofferson, R., and Azarbayjani, F. (2006). The anti-VEGF antibody bevacizumab potently reduces the growth rate of high-risk neuroblastoma xenografts. Pediatr. Res. 60, 576–581. Sela, S., Itin, A., Natanson-Yaron, S., Greenfield, C., Goldman-Wohl, D., Yagel, S., and Keshet, E. (2008). A novel human-specific soluble vascular endothelial growth factor receptor 1: Cell-type-specific splicing and implications to vascular endothelial growth factor homeostasis and preeclampsia. Circ. Res. 102, 1566–1574. Simons, M. (2004). Integrative signaling in angiogenesis. Mol. Cell. Biochem. 264, 99–102. Stefanini, M. O., Wu, F. T. H., Mac Gabhann, F., and Popel, A. S. (2008). A compartment model of VEGF distribution in blood, healthy and diseased tissues. BMC Syst. Biol. 2, 77. Tang, K., Breen, E. C., Wagner, H., Brutsaert, T. D., Gassmann, M., and Wagner, P. D. (2004). HIF and VEGF relationships in response to hypoxia and sciatic nerve stimulation in rat gastrocnemius. Respir. Physiol. Neurobiol. 144, 71–80. Truskey, G. A., Yuan, F., and Katz, D. F. (2004). Porosity, tortuosity, and available volume fraction. Transport Phenomena in Biological Systems, Pearson Prentice Hall, NJ, pp. 389–398. Venturoli, D., and Rippe, B. (2005). Ficoll and dextran vs. globular proteins as probes for testing glomerular permselectivity: Effects of molecular size, shape, charge, and deformability. Am. J. Physiol. Renal Physiol. 288, 605–613. Verheul, H. M., Lolkema, M. P., Qian, D. Z., Hilkes, Y. H., Liapi, E., Akkerman, J. W., Pili, R., and Voest, E. E. (2007). Platelets take up the monoclonal antibody bevacizumab. Clin. Cancer Res. 13, 5341–5347. Wijelath, E. S., Rahman, S., Namekata, M., Murray, J., Nishimura, T., Mostafavi-Pour, Z., Patel, Y., Suda, Y., Humphries, M. J., and Sobel, M. (2006). Heparin-II domain of fibronectin is a vascular endothelial growth factor-binding domain: Enhancement of VEGF biological activity by a singular growth factor/matrix protein synergism. Circ. Res. 99, 853–860. Wiley, H. S., Shvartsman, S. Y., and Lauffenburger, D. A. (2003). Computational modeling of the EGF-receptor system: A paradigm for systems biology. Trends Cell. Biol. 13, 43–50. Willett, C. G., Boucher, Y., Duda, D. G., di Tomaso, E., Munn, L. L., Tong, R. T., Kozin, S. V., Petit, L., Jain, R. K., Chung, D. C., Sahani, D. V., Kalva, S. P., et al. (2005). Surrogate markers for antiangiogenic therapy and dose-limiting toxicities for bevacizumab with radiation and chemotherapy: Continued experience of a phase I trial in rectal cancer patients. J. Clin. Oncol. 23, 8136–8139. Wu, F. T., Stefanini, M. O., Mac Gabhann, F., Kontos, C. D., Annex, B. H., and Popel, A. S. (2009a). A computational kinetic model of VEGF trapping by soluble VEGF receptor-1: Effects of transendothelial and lymphatic macromolecular transport. Physiol. Genomics 38, 29–41. Wu, F. T., Stefanini, M. O., Mac Gabhann, F., and Popel, A. S. (2009b). A compartment model of VEGF distribution in humans in the presence of soluble VEGF receptor-1 acting as a ligand trap. PLoS ONE 4, e5108. Yuan, F., Dellian, M., Fukumura, D., Leunig, M., Berk, D. A., Torchilin, V. P., and Jain, R. K. (1995). Vascular permeability in a human tumor xenograft: Molecular size dependence and cutoff size. Cancer Res. 55, 3752–3756.
C H A P T E R
N I N E T E E N
The Least-Squares Analysis of Data from Binding and Enzyme Kinetics Studies: Weights, Bias, and Confidence Intervals in Usual and Unusual Situations Joel Tellinghuisen Contents 500 503 503 505
1. Introduction 2. Least Squares Review 2.1. Standard linear and nonlinear least squares 2.2. Multiple uncertain variables: Deming’s treatment 2.3. Uncertainty in functions of uncertain quantities: Error propagation 3. Statistics of Reciprocals 3.1. A simple Monte Carlo experiment 3.2. Implications—The 10% rule of thumb 3.3. Application to binding and kinetics data 4. Weights When y is a True Dependent Variable 4.1. Constant sy 4.2. Illustrations for perfectly fitting data 4.3. Real data example 4.4. Monte Carlo simulations 5. Unusual Weighting: When x is the Dependent Variable 5.1. Effective variance treatment 5.2. Checking the results with exactly fitting data 5.3. The unique answer 6. Assessing Data Uncertainty: Variance Function Estimation 7. Conclusion References
505 506 506 509 510 511 511 512 515 517 521 521 522 524 524 526 527
Department of Chemistry, Vanderbilt University, Nashville, Tennessee, USA Methods in Enzymology, Volume 467 ISSN 0076-6879, DOI: 10.1016/S0076-6879(09)67019-1
#
2009 Elsevier Inc. All rights reserved.
499
500
Joel Tellinghuisen
Abstract The rectangular hyperbola, y ¼ abx/(1 þ bx), is widely used as a fit model in the analysis of data from studies of binding, sorption, enzyme kinetics, and fluorescence quenching. The choice of this or its linearized versions—the double-reciprocal, y-reciprocal, or x-reciprocal—in unweighted least squares imply different assumptions about the error structure of the data. The rules of error propagation are reviewed and used to derive weighting expressions for application in weighted least squares, in the usual case where y is correctly considered the dependent variable, and in the less common situations where x is the true dependent variable, in violation of one of the fundamental premises of most least-squares methods. The latter case is handled through an effective variance treatment and through a least-squares method that treats any or all of the variables as uncertain. The weighting expressions for the linearized versions of the fit model are verified by computing the parameter standard errors for exactly fitting data. Consistent weightings yield identical standard errors in this exercise, as is demonstrated with a common data analysis program. The statistical properties of linear and nonlinear estimators of the parameters are examined with reference to the properties of reciprocals of normal variates. Monte Carlo simulations confirm that the least-squares methods yield negligible bias and trustworthy confidence limits for the parameters as long as their percent standard errors are less than 10%. Correct weights being the key to optimal analysis in all cases, methods for estimating variance functions by least-squares analysis of replicate data are reviewed briefly.
1. Introduction One of the simplest and most frequently encountered nonlinear relations in physical data analysis is the rectangular hyperbola, expressible in a number of ways, including abx : ð19:1Þ 1 þ bx This mathematical form occurs in the analysis of data obtained in studies of enzyme kinetics (Askelof et al., 1976; Cleland, 1967; Cornish-Bowden and Eisenthal, 1974; Dowd and Riggs, 1965; Mannervik, 1982; Ritchie and Prvan, 1996; Wilkinson, 1961), binding and complexation (Bowser and Chen, 1998; Feldman, 1972; Johnson, 1985; Munson and Rodbard, 1980), sorption (Barrow, 1978; Bolster, 2008; Kinniburgh, 1986), and fluorescence quenching (Eftink and Ghiron, 1981; Laws and Contino, 1992). From the earliest encounter with this relation, workers recognized that it could be rewritten to facilitate analysis with straight-line graphical plots (Langmuir, 1918), and the various linearized forms earned naming status for y¼
The Least-Squares Analysis of Data from Binding and Enzyme Kinetics Studies
501
their proposers (Connors, 1987). Thus, the version of Michaelis– Menten enzyme kinetics expressed as 1 1 1 ¼ þ CX þ A ð19:2Þ y abx a became the Lineweaver–Burk equation, while equivalent forms became the Benesi–Hildebrand equation of complexation and the Stern–Volmer relation of fluorescence quenching. Multiplying through by x got Hanes and Woolf their version of enzyme kinetics, x 1 x ¼ þ ¼ C þ Ax y ab a
ð19:3Þ
while expressions with y on both sides of the equation, like y ¼ ab by ð19:4Þ x earned Scatchard and Eadie and Hofstee their places on the linearization marquee. Following Connors (1987) and others, I will call these the double reciprocal, the y-reciprocal, and the x-reciprocal forms, respectively; and I will use upper-case letters to denote reciprocals, as done above. Almost from the time these linearizations were proposed, it was clear that their use in quantitative analysis by the method of least squares required attention to weighting (Lineweaver et al., 1934). This problem has attracted much attention, as is clear from the titles of most of the references already cited. Yet, as was lamented by Kinniburgh (1986) over two decades ago, ‘‘Although much well-founded criticism of the various linearized forms of the Langmuir isotherm has appeared in the environmental chemistry literature, the lessons to be learned seem to go largely unheeded.’’ Praising the virtues of nonlinear least squares (NLLS), he continued, ‘‘The benefit of the NLLS approach, when properly weighted, is subtle and not to be seen in statistics such as R2; rather the benefit is in the assurance that the best parameter estimates have been obtained.’’ Kinniburgh showed that many of the claims in favor of or against the various linearized forms of the Langmuir isotherm were incorrectly based on tacit assumptions about the data error structures embodied in the use of unweighted LS (ULS) with these forms. Writing a decade later, Ritchie and Prvan (1996) noted, ‘‘By the time least squares linear regression methods had become readily available on calculators and computers, the need for appropriate weighting had been largely forgotten.’’ They also pointed out that it was not fair to Lineweaver and Burk to use their name for ULS analysis with Eq. (19.2), because LB actually used weighted LS (WLS), working in collaboration with the statistician Deming (Lineweaver et al., 1934). The main conclusions of the works by Kinniburgh and Ritchie and Prvan and many of the other writers already cited might be summarized, in the quest for optimal LS
502
Joel Tellinghuisen
analysis of rectangularly hyperbolic data, the weighting of the data is far more important than the choice of fit relation. Sadly, their comments about the lessons going largely unheeded remain current, as works purporting to test the various fitting representations but doing so incorrectly for well-known reasons continue to be published. Kinniburgh (1986) also addressed a special problem in most sorption work and in many studies of binding and complexation: The quantity normally measured—the equilibrium concentration [L] of ligand or sorbate—is also the independent variable of the common fit models. This violates one of the fundamental assumptions behind most LS methods, that the independent variable be error-free. One way of handling this problem, recognized long ago (Barrow, 1978; Feldman, 1972; Meinert and McHugh, 1968; Munson and Rodbard, 1980) but still not widely used (Bolster, 2008), is to express the measured quantity [L] in terms of the total sorbate or ligand concentration L t, which is normally much more precisely determined and hence more suitable as independent variable in the fit model. Recently, we (Tellinghuisen and Bolster, 2009a) have examined these dependencies in detail and have used an effective variance treatment (Barker and Diana, 1974; Clutton-Brock, 1967; Orear, 1982) to provide weighting expressions that are statistically valid for all of the common forms of Eq. (19.1), for uncertainty in both [L] and L t. In this work, we also noted that there is one NLLS algorithm that yields identical results for all ways of expressing the relation among the model parameters and the variables, any number of which can be considered uncertain. This is an algorithm based on Deming’s (1964) treatment and implemented in iterative form as early as 1972 (Britt and Luecke, 1973; Jefferys, 1980; Lybanon, 1984; Powell and Macdonald, 1972). Since this algorithm gives results that are independent of the choice of fit relation, it becomes clear that the only user inputs that can affect the parameter values for a given set of ([L], L t) values are the data weights, or equivalently the assessed uncertainties in [L] and L t. One important topic still rarely addressed is how to obtain the weights for the data. In linear LS (LLS), it is rigorously true that minimum-variance (hence most precise) estimates of the model parameters are obtained if and only if the data are assigned weights inversely proportional to their variances, wi / s2 i . It is not possible to make such an assertion for NLLS (Di Cera, 1992), in part because many nonlinear estimators do not even have finite variance (Shukla, 1972). Nonetheless, the parameter estimates from NLLS are generally reasonable; and Monte Carlo studies have shown that the precision estimates can be reliable in establishing confidence limits (Tellinghuisen, 2000a). Further, there appears to be no general prescription for achieving the narrowest confidence limits that works better than the same as for LLS, wi / s2 i . I make that assumption here. In subsequent sections, I first briefly review the fundamental LS relations relevant to the present work. I then address the question: How reliable is NLLS in estimating the constants a and b in Eqs. (19.1)–(19.4) and their
The Least-Squares Analysis of Data from Binding and Enzyme Kinetics Studies
503
confidence limits, in the normal situation where y is the single error-prone (dependent) variable? The answer to this question is aided by considerations of the statistics of reciprocals of normal variates. I test these predictions through Monte Carlo simulations for selected conditions. I then address the situation where x is the directly measured quantity and y is obtained from x through a simple computation. The application of effective variance methods to this problem requires care, because y is fully correlated with x. I close with a discussion of variance function (VF) analysis for obtaining the data weights. Why the just expressed concern with reciprocals? In estimating the constant K(b) that characterizes the kinetics or binding, we have the choice of estimating this or its inverse (e.g., the dissociation constant Kd). If one of these is a normal variate, the other does not even have finite variance; and when it has relatively large nominal uncertainty, the reciprocal variate can be significantly biased with very asymmetric confidence limits (Tellinghuisen, 2000a,b). It is easier to avoid such complications if we know which—K or Kd—is the more normal variate. On the other hand, if they are sufficiently precise, the question becomes irrelevant; so predicting their precision becomes equally important. And that in turn requires information about the data error. Frequently one encounters statements like ‘‘If the data are precise and the fit model is correct, it doesn’t matter which of Eqs. (19.1)–(19.4) you use, because all will return nearly identical estimates of the parameters.’’ But it is not just the parameters that are of interest in a LS analysis: At least as important are their uncertainties ( Johnson and Faunt, 1992), since after all, one can express all results as 1 1 (times a power of 10). And fitting precise data to these equations without attention to weighting will surely return different parameter uncertainties. Below I will illustrate how one can obtain reliable parameter error estimates by fitting exact data with assumed data error, using the a priori covariance matrix Vprior. In fact, it is not necessary to actually work with this matrix, because many data analysis programs either have an option permitting this choice or make it by default. The KaleidaGraph (Synergy Software) program is in the latter category (Tellinghuisen, 2000c). I have found it both valuable and instructive in predicting parameter precisions from exactly fitting data, including demonstrating that proper weighting yields identical nominal parameter standard errors (SEs) for all of Eqs. (19.1)–(19.4). I will demonstrate its use in this manner below.
2. Least Squares Review 2.1. Standard linear and nonlinear least squares The theory and procedures of linear and nonlinear least-squares fitting methods have been covered in several of the already cited works (Connors, 1987; Johnson and Faunt, 1992; Tellinghuisen, 2000a) and are
504
Joel Tellinghuisen
readily available elsewhere (Bevington, 1969; Press et al., 1986), including in two earlier contributions from me in this series (Tellinghuisen, 2004, 2009a). Here, I emphasize a few important points, in which I will use the same notation as in the last two cited works.
Minimum-variance estimation of the adjustable parameters requires that the data be weighted inversely as their variances, wi / s2 i :
ð19:5Þ
As already noted, this is rigorously true for LLS, and I assume it for NLLS.
The variances for the estimated parameters are the diagonal elements of the variance–covariance matrix, of which we distinguish two versions, Vprior and Vpost, both proportional to A1 ¼ ðXT WXÞ1 ;
ð19:6Þ
where the design matrix X (also called the Jacobian matrix) is as given in the earlier works and the weight matrix W is diagonal, with elements Wii ¼ wi. If the data variances are known a priori, use of wi ¼ s2 yields Vprior ¼ i A 1. Vprior is exact for linear LLS and exact in the limit of small data error for nonlinear NLLS. If the data are normally distributed, the estimated parameters will be normally distributed for LLS and normal in the small data error limit for NLLS. In LLS, with x representing the independent variable and y the dependent, Vprior depends only on the x-structure and the error structure of the data; in NLLS it may depend also on the values of the yi and the fit parameters. The LS solution minimizes S ¼ Swi d2i , where di is the fit residual in the dependent variable y. If the weights are taken as wi ¼ s2 i , S follows the w2 distribution, which has expectation value n and variance 2n, where n is the number of statistical degrees of freedom. Equivalently S/n follows the reduced w2 distribution, with mean 1 and variance 2/n. These properties require data with normally distributed error and a fit to a true model. If the data variances are not known absolutely, parameter variance estimates are obtained from Vpost ¼ s2y A1 , where s2y is the estimated variance for data of unit weight, calculated from the fit residuals using
S ð19:7Þ s2y ¼ : v In LLS, the parameter variances are now estimated quantities having the statistical properties of w2. In NLLS, the variability inherent in Vprior makes Vpost-based estimates even more variable.
The Least-Squares Analysis of Data from Binding and Enzyme Kinetics Studies
505
The consequences of violating the prescription of Eq. (19.5) include the obvious one—that the estimated parameters will not be of best precision—and a less recognized but arguably more important one: V-based parameter standard-error estimates are unreliable.
2.2. Multiple uncertain variables: Deming’s treatment All of the foregoing applies to most LS methods, in which a single variable is taken to be uncertain. Deming’s (1964, original version 1938) treatment makes no such distinction between independent (hence error-free) and dependent variables. In his approach, the minimization target is X S¼ wxi d2xi þ wyi d2yi þ ð19:8Þ where the sum runs over all uncertain variables, with each residual being the difference between measured and adjusted value, for example, dxi ¼ xadj,ixj. The iterative implementations of Deming’s approach facilitate convergence on the minimum S. If the weights are again taken as the inverse variances in x, y, ... , the resulting Vprior is again exact in the small-error limit. From the definition of S in Eq. (19.8), it is clear that the results must be independent of the manner in which the fit model is expressed. This includes situations where x is the single uncertain variable. By contrast, properly weighted fits to different versions of a given response function, like Eqs. (19.1)–(19.4), yield statistically equivalent but not numerically identical results. The very form of Eq. (19.8) directs the user’s attention to the weights, which must all be correct within a common scale factor to achieve the minimum-variance results. Thus with just x and y uncertain, the wxi and wyi must correctly reflect the relative uncertainties of all xi and yi. This is not done in, for example, methods that minimize ‘‘perpendicular’’ distances from the measured points to the calculated curve (Schulthess and Dey, 1996; Valsami et al., 2000). In such methods, results change with scale changes in the axes, requiring ‘‘axis conversion factors,’’ which are ill-defined and fail to recognize that even the perpendicular distances should be shorter for very precise points than for imprecise ones. It follows that such methods are not invariant with respect to changes in the representation of the fit relationship.
2.3. Uncertainty in functions of uncertain quantities: Error propagation To calculate the uncertainty, sf in some function f of uncertain quantities, we use error propagation. Taking the uncertain quantities as elements of the vector b,
506
Joel Tellinghuisen
s2f ¼ gT Vg;
ð19:9Þ
in which the jth element of the vector g is @f/@bj. This expression is rigorously correct for functions f that are linear in variables bj that are themselves normal variates (Tellinghuisen, 2001). If the bj are independent, the covariance matrix V will not have off-diagonal elements, and one has the more familiar expression, !2 X @f s2f ¼ s2bj : ð19:10Þ @bj The simplest application of Eqs. (19.9) and (19.10) is to functions of a single uncertain variable, giving sf ¼ jdf =dbjsb :
ð19:11Þ
On the other hand, if the bj are the parameters from a least-squares fit, they are usually correlated, requiring the use of Eq. (19.9). The foregoing has direct application to the linearized versions, Eqs. (19.2)–(19.4), of the Langmuir relation. Thus, with minor provisos on the data and proper weighting, one can use LLS to obtain statistically reliable estimates of A and C from Eq. (19.2) or (19.3), and hence satisfactory results for a (¼ 1/A), b (¼ A/C), and sa [from Eq. (19.11), sa/a ¼ sA/A]. However, A and C are typically correlated, so Eq. (19.9) is needed to obtain sb. All three ‘‘linear’’ forms are in fact nonlinear in a and b. Accordingly, when analyzed with NLLS, they all yield directly the desired estimates of a and b and their SEs. [Equation (19.4), with y on both sides, has special problems that will be addressed with the effective variance method below.]
3. Statistics of Reciprocals 3.1. A simple Monte Carlo experiment Here I will use the KaleidaGraph (KG) program to illustrate some properties of random variates and their sums, and then to examine the statistics of reciprocals of normal variates. I strongly believe that serious readers not already acquainted with this or a similar program (Origin, Igor, SigmaPlot) designed for scientific data analysis and presentation, will find the time invested to become so acquainted well spent. However, the present exercise can be done with Excel (de Levie, 2008). The main points will parallel results illustrated in Table 19.1 and Fig. 19.2 of my ‘‘Bias and inconsistency’’ paper (Tellinghuisen, 2000b) and in my instructional work and its online supplement (Tellinghuisen, 2000c).
507
The Least-Squares Analysis of Data from Binding and Enzyme Kinetics Studies
Table 19.1 Monte Carlo statistics of a ¼ 1 and its reciprocal A, from 105 values with normally distributed random error of specified saa
a b
sa
a
sab
A
sA
0.05 0.10 0.20 0.30 0.40 0.40 0.40
0.99991 0.99982 0.99964 0.99946 0.99928 0.99933 1.00202
0.050135 0.100270 0.200539 0.300809 0.401078 0.399943 0.399634
1.00262 1.01054 1.04642 1.13154 1.05163 1.39158 1.43629
0.050621 0.104363 0.24286 2.4916 71.732 34.877 23.782
Same random normal variates used for first five sa values, to illustrate the effects of scaling. Obtained from sampling statistics, as ½a2 a2 1=2 , and analogously for A.
For good sampling statistics, it is desirable to generate very large data sets. To a very good approximation, histogrammed (binned) data follow Poisson statistics, meaning the variance equals the bin count. Thus, bin counts of 104 have 1% error (s 102); smaller bin counts have smaller absolute, larger relative error, and vice versa for larger bin counts. I start by expanding the number of rows in the KG data sheet to 105. Then, executing from the ‘‘Formula Entry’’ box the statement, c0 ¼ ranð Þ þ ranð Þ
ð19:12aÞ
generates a sum of two random, uniform (0 < x < 1) variates in the first column of the data sheet. It is instructive to use the ‘‘Bin Data’’ command at this point to gain visual appreciation: The result is a triangular distribution (Fig. 19.1), peaking at 1. Adding another uniform random variate— c0¼c0þran( )—generates a distribution that is piecewise quadratic with mean 1.5. Revise to c0 ¼ c0 þ ranð Þ þ ranð Þ þ ranð Þ
ð19:12bÞ
and execute the ‘‘Run’’ command three times. The resulting sum of 12 random numbers will have a mean close to 6.00 and standard deviation 1.00; and through the beauty of the Central Limit Theorem, the distribution will be very close to the Gaussian, or normal distribution, " # ðx mÞ2 PG ðm; s; xÞ ¼ CG exp ; ð19:13Þ 2s2 where m is the mean, s2 the variance, and CG a normalizing constant. Subtract 6 [c0 ¼ c06] to produce a column of 105 random, normal deviates of mean 0. [While easy, this is not the best method for generating random normal deviates; see Press et al. (1986).]
508
Joel Tellinghuisen
5000
Count
4000 3000 2000 1000 0 0.0
0.5
1.0 Histogram x
1.5
2.0
Figure 19.1 Histogram of results from summing two uniform random deviates, each defined over the range 0 < x < 1.
Now we use this column to add varying amounts of Gaussian error to a constant. For example, c1 ¼ 1 þ 0.1*c0 produces in the second column a random variate of mean 1 with normal error of s ¼ 0.1. Clearly, the statistics of this column or any other produced in the same manner are fully predictable from the statistics of the entries in c0. But the statistics and distributions for the reciprocals of these quantities—for example, c1 ¼ 1/(1 þ 0.1*c0), c2 ¼ 1/(1 þ 0.2*c0), etc. are another matter. With increasing s, we observe progressively increasing positive bias in their means and their standard deviations, and eventually, instability in both (Table 19.1). The reason for the instability is that the distribution of reciprocals has Lorentzian tails, which means it has infinite variance. This violates a prime requirement for sampling under the Central Limit Theorem, meaning sampling cannot be relied upon to yield convergent estimates of the mean and standard deviation. Results for sa ¼ 0.4 are illustrated in Fig. 19.2. Qualitatively, the instability arises from the significant probably of getting a 0 in the initial normal distribution. As long as this probability is small, there is only a modest systematic bias in the mean of A (considered as an estimator of a) and in sA (which should be considered an asymptotic estimator, since formally the variance is infinite). Thus, the bias in A¯ is only 1% for sa ¼ 0.10 [relative standard error (RSE) sa/a the same], rising to 5% for 20% RSE. By Eq. (19.11) sA should equal sa (same RSE); but it is 1% larger for 5% RSE (sa ¼ 0.05), with the excess rising sharply thereafter, to 4% at 10% RSE, 21% at 20% RSE, and >700% for 30% RSE.
The Least-Squares Analysis of Data from Binding and Enzyme Kinetics Studies
509
7000 6000
Count
5000 4000 3000 2000 1000 0 –0.5
0.0
0.5
1.0 1.5 a (A)
2.0
2.5
Figure 19.2 Histograms of 105 values of the normal variate a (m ¼ 1, s ¼ 0.4) and its inverse A. The smooth curves are fits to Eq. (19.13) for a and to the derived distribution for A given in Eq. (12) of Tellinghuisen (2000b).
3.2. Implications—The 10% rule of thumb I have examined with Monte Carlo (MC) simulations a number of nonlinear fit models in recent years, and this sort of reciprocal behavior is the most pathological I have seen. It has led me to state a ‘‘10% rule of thumb’’ for when to trust the V-based parameter SEs from NLLS (Tellinghuisen, 2000a): If the parameter’s estimated RSE is less than 10%, then this SE is valid within 10% for assessing the confidence limits of the parameter, for which the bias is insignificant. Note from Table 19.1 that at 10% RSE (sa ¼ 0.10), the bias in A is less than 10% of its SE, which is only 5% in excess of its predicted value (0.10). Thus, the 10% rule would appear to be conservative. However, as already noted, in NLLS there is variability inherent in Vprior for ‘‘real’’ data, which augments that from reciprocal behavior. Of course, there are many cases where the nonlinear estimator behaves much better than this. On the other hand, most workers still rely upon the Vpost-based estimates for their parameter SEs, and these can never be any more reliable than their inherent uncertainty from the statistical properties of w2. This means relative uncertainty (2/n)1/2 in the estimated variances, and by Eq. (19.11), (2n) 1/2 in the estimated SEs. Many binding and kinetics studies employ eight or fewer points in a dataset. This translates into 30% uncertainty in the Vpost-based parameter SEs—an amount that will usually dwarf that from the NLLS method itself. Concerns about reciprocal behavior apply to the data also, which are inverted in the use of Eqs. (19.2) and (19.3) to analyze binding and kinetics data. Although we seldom collect enough data to confirm that they
510
Joel Tellinghuisen
are normal, it seems more reasonable to assume that the raw data are normal than that their reciprocals are; and the data certainly have finite variance, whereas their reciprocals may not, thus violating one of the requirements for LS. Many instruments average numerous digital conversions of an analog signal, and by the Central Limit Theorem, these (with certain instrumental limitations) should approach normality, just as we observed above in the MC experiment. Counting instruments follow Poisson statistics, and for large numbers of counts, the Poisson distribution approaches the Gaussian. So it does seem more reasonable to attribute normality to the raw data than to their inverses. As before, the proper warning is to avoid inverting data with large relative uncertainty, and the 10% rule is again a good guideline. Although the inverted data remain biased estimators of the original quantities and thus yield biased and even inconsistent estimates of the LS parameters (Tellinghuisen, 2000b), the magnitudes of the biases will typically remain insignificant compared with the parameter SEs, if the data are properly weighted.
3.3. Application to binding and kinetics data Consider analysis of data using the double reciprocal linearization, Eq. (19.2). Neglecting the nonnormality and bias of the data from inversion, LLS yields A and C estimates that are normally distributed. With proper weighting these will be minimum-variance estimates, and we can obtain a and b from a ¼ 1/A and b ¼ A/C. If C is precise to better than 10%, we expect b to be well characterized and near-normal in this analysis (Tellinghuisen, 2000b). Similarly, if A is precise, a will be well characterized. Then we also expect B ¼ 1/b (from C/A) to be close to normal, even if C is imprecise. Neither a nor b emerges naturally from these considerations as a normal variate, so we cannot tell which of b and B will be more normal. Note that if we employ NLLS with Eq. (19.2) and fit to a and b, we will obtain identical values as from fitting to A and C, provided the data are weighted the same; and of course we obtain estimates of sa and sb directly in the nonlinear fit. We will see below that if sy is constant, the transformation to 1/y imposes strongly y-dependent weighting in fits to Eq. (19.2). This weighting itself can be a source of enhanced bias and imprecision for noisy data. Alternatively, consider the nonlinear fit to the variation of Eq. (19.1), x y¼ ; ð19:14Þ C þ Ax which requires the same weighting as fitting to Eq. (19.1)—unweighted for constant sy. This was originally proposed as a better way to obtain results from the nonlinear fit (Ratkowsky, 1986), which is a nonexistent problem with today’s computational methods. Thus, we would normally prefer the fit to Eq. (19.1), which yields directly the SEs in a and b. However,
The Least-Squares Analysis of Data from Binding and Enzyme Kinetics Studies
511
the previous considerations about the properties of the six estimators remain valid, and we expect both A and C from an analysis with Eq. (19.14) to be nearly normal variates.
4. Weights When y is a True Dependent Variable 4.1. Constant sy In most experimental ways of studying MM kinetics, complexation, or quenching, x is a controlled variable and y is measured, making the usual identification of independent and dependent variables appropriate (Connors, 1987). For some of these methods, it is also reasonable to take sy as constant, especially if measurements are taken over a small dynamic range. In this case, the direct NLLS fit to Eq. (19.1) is the straightforward approach, doable with ULS; if sy is thought to be known, taking wi ¼ s2 y permits use of Vprior and subsequent use of the w2 test as a check of the fit’s reasonableness (Zeng et al., 2008a). Use of Eq. (19.11) yields the well-known weighting expressions for fits to the double-reciprocal and y-reciprocal linearizations. Letting z stand for the transformed dependent variable, sz ¼ |dz/dy| sy, hence s1=y ¼ sy =y2 :
ð19:15aÞ
By assumption x is error-free, so Eq. (19.11) applies for Eq. (19.3), too, giving sx=y ¼ sy x=y2 :
ð19:15bÞ
Both results yield weights that vary strongly over a typical dataset, and the question arises, what values of y should be used to obtain numerical values? With precise data, this is of little concern, but for noisy data, Monte Carlo tests have indicated it is better to use the calculated than the observed values. This makes the computation iterative, since the calculated values are not known at the outset. It is noteworthy that consistently weighted fits to Eqs. (19.2) and (19.3) yield identical results. With naive use of ULS with these forms, workers in some fields have come to prefer Eq. (19.3) over Eq. (19.2). This is perhaps because the factor of x neutralizes some of the y2 dependence in the denominator, making the weighting error from ULS less significant with Eq. (19.3). The use of y as both the dependent and the independent variable in Eq. (19.4) makes this linearized version in violation of the basic LS assumptions. However, it can be treated with an effective variance (EV) approach to obtain weights that yield consistent results for the parameter SEs. The idea behind the EV approach is to project the uncertainty in the
512
Joel Tellinghuisen
‘‘independent’’ variable into an equivalent uncertainty in the dependent variable. Again error propagation is used to yield seff ¼ |dy/dx| sx, and if the two contributions are independent, the variances add, giving s2y;tot ¼ s2eff þ s2y . However, here the two contributions are not independent, rather are fully correlated, because they involve the same variable. Thus, an error ey in y produces a direct error ey/x in the dependent variable and an indirect error of magnitude (df/dy)ey ¼ bey, through its effect on the fit function, f ¼ abby. The result of the perfect correlation is to make the two contributions additive in s, giving sy;tot ¼ sy ðx1 þ bÞ:
ð19:16Þ
This result is not readily available in the literature. It is given incorrectly in Eq. (3.25) of Connors (1987), who treats the two contributions as independent, hence adding in quadrature. Bowser and Chen (1999) give it correctly but cite Connors and give no explanation. In using Eqs. (19.15a) and (19.16), one must of course use values of x and y for each of the i ¼ 1:n data points. Accordingly, these equations are already in a form to accommodate any variability in sy, by just using syi values for the individual data points.
4.2. Illustrations for perfectly fitting data Figures 19.3 and 19.4 illustrate results obtained from exactly fitting data for a 5-point model having an approximately arithmetic structure, with x having a dynamic range of 20 (0.5–10) and y a range of 5 and constant sy. Note that the parameter SEs all agree to the fourth significant figure, where the discrepancies result from imprecision in the numerical differentiation used by the KG program. Also, the w2 values (‘‘Chisq’’) are all 0, as expected for exactly fitting data; thus the Vpost-based SEs would also all be 0, which is of course meaningless for the present exercise. Figure 19.4 includes results for fitting the double-reciprocal data to the second form of Eq. (19.2). It is easy to verify that sa/a ¼ sA/A, as predicted by Eq. (19.11) for A ¼ 1/a. On the other hand, application of Eq. (19.10) to b ¼ A/C yields sb sA 2 sC 2 1=2 þ ¼ 0:1767; ¼ b A C
ð19:17Þ
which is about 14% smaller than the correct value, the difference being due to the correlation between A and C, omitted from Eqs. (19.10) and (19.17). To achieve the results illustrated in Figs. 19.3 and 19.4 using the KG program, one first plots the data by selecting independent (‘‘x’’) and dependent (‘‘y’’) variables. When the ‘‘General’’ fit menu is then opened under ‘‘Curve Fit,’’ the user selects a fit name (or adds it if necessary) and then clicks in the ‘‘Define...’’ box to enter the fit relation. Clicking the ‘‘Weight Data’’
The Least-Squares Analysis of Data from Binding and Enzyme Kinetics Studies
8
2.0
7 6
1.5
4
1.0
x/y
y
5
3 0.5
0.0
513
0
2
4
x
6
8
y = x/a + 1/(a*b) Value Error a 2.000 0.08266 b 1.000 0.2045 Chisq 0.000 NA R 1.000 NA
2
y x/y
1 0
10
y = a*b*x/(1 + b*x) Value Error a 2.000 0.08261 b 1.0000 0.2044 Chisq 7.28e-13 NA R 1.0000 NA
Figure 19.3 Five-point model having an approximately arithmetic structure (x ¼ 0.5, 3, 5, 7.5, 10), with a ¼ 2, b ¼ 1, and sy ¼ 0.08, displayed and fitted in accord with Eq. (19.1) (left ordinate scale) and Eq. (19.3) (right). The KaleidaGraph ‘‘General’’ NLLS routine is used to obtain the results presented in the fit results boxes, where ‘‘Error’’ is the Vprior-based standard error. The entries at the top of each box show the fit model, in which ‘‘x’’ is the default independent variable and ‘‘y’’ the dependent; the user enters only the part to the right of the ‘‘¼’’ sign in the fit definition box. 1/x 0.0
0.5
1.0
1.5
2.0
2.5 2.0
1.6
1.5 y/x
1/y
1.2
1.0
0.8
0.4
0.0 0.0
y=X/(b*a)+1/a Value Error a 2.000 0.08266 b 1.0000 0.2045 Chisq 6.486e-13 NA R 1.0000 NA
a b Chisq R
y=a*b–b*x Value Error 2.000 0.08262 1.0000 0.2044 8.931e-13 NA 1.0000 NA
0.5
1/y y/x 0.5
A C Chisq R
y=C*X+A Value Error 0.5000 0.02065 0.5000 0.08590 6.486e-13 NA 1.0000 NA
1.0 y
1.5
2.0
0.0
Figure 19.4 Same data displayed and analyzed in accord with the double-reciprocal [Eq. (19.2), axes top and left] and x-reciprocal [Eq. (19.4), bottom and right axes] linearizations.
box ensures that the user will later be prompted to designate a column of sy values for weights upon selecting a dependent variable to fit. This manner of providing for the computation of weights makes it easy to verify that the
514
Joel Tellinghuisen
parameter SEs scale with sy. Thus, for example, increasing sy to 0.2 in the present exercise will increase the SEs by a factor of 0.2/0.08 ¼ 2.5. With such a change, a remains fairly precise, with RSE 0.1; but the RSE in C increases from 17% to 43%, so we can expect b (¼ A/C ) to exhibit strongly non-Gaussian behavior, while B should be close to normal. Before examining these distributions, let us consider several other aspects of the present computations with exact data. First, there is no particular role for the distribution of the values on the independent axis, because with proper weighting, the results from Fig. 19.4, where the data are strongly bunched on the ‘‘x’’ axes, are identical to those from Fig. 19.3, where they are evenly distributed. We can take such considerations further by asking what difference is made by using a geometric distribution for x? Or what is the effect of changing b or the data error structure for the same x-structure? Or what if the fit is redefined in the form more commonly used to analyze kinetics data, with B instead of b. Results answering these questions are displayed in Figs. 19.5–19.7. From the middle results in Fig. 19.5, we see that changing the x-structure decreases the precision in a (from sa ¼ 0.083–0.091) but increases that for b (sb ¼ 0.20–0.16). Changes of this magnitude are not likely to be practically important, so we can conclude that for a specified range of x and constant sy, geometric and arithmetic data structures are roughly equivalent. Results in all these figures show that a is better defined when there are more points near the large-y asymptote, which occurs for large b. On the other hand, b is determined with best relative precision in the midrange of values sampled here. These trends are not significantly altered when the error structure is changed from constant sy to constant coefficient of variation, 8% sy, in part because
2.0
y = a*b*x/(1 + b*x) Value Error a 2.000 0.6429 b 0.1000 0.05429 Chisq NA 1.128e–14 R NA 1.0000
y
1.5 b = 0.1 1 10
1.0
a b
0.5
0.0
a b
1
x
Value 2.000 1.000 Value 2.000 10.000
Error 0.09050 0.1595 Error 0.05712 3.529
10
Figure 19.5 Results for same model with geometric x structure (0.5, 1, 2, 5, 10), obtained as a function of b. Note the logarithmic axis scale for x.
515
The Least-Squares Analysis of Data from Binding and Enzyme Kinetics Studies
y = a*b*x/(1 + b*x) Value Error a 2.000 0.3376 b 0.1000 0.02136 Chisq 6.431e-14 NA R 1.0000 NA
2.0
y
1.5
1.0
a b
0.5 a b
0.0
1
Value 2.000 1.000 Value 2.000 10.000
Error 0.1387 0.1691 Error 0.1101 6.252
10
x
Figure 19.6 Results for same model as in Fig. 19.5, but error structure changed from sy ¼ 0.08 to sy ¼ 0.08 y. 2.0 a B Chisq R
y
1.5 b ⫽ 0.1 1 10
1.0
a B
0.5
0.0
a B
1
x
y = a*x/(B + x) Value Error 2.000 0.6433 10.000 5.433 1.128e–14 NA 1.0000 NA
Value 2.000 1.000 Value 2.000 0.1000
Error 0.09052 0.1595 Error 0.05712 0.03528
10
Figure 19.7 Same data as in Fig. 19.5, but with fit model redefined in terms of B ¼ 1/b.
the sampling range of the y values is not large enough to give strong heteroscedasticity except for b ¼ 0.1. Finally, Fig. 19.7 confirms expectations that sB/B ¼ sb/b when the fit model includes B instead of b.
4.3. Real data example Next let us use some of the normal random deviates produced in the earlier exercise on reciprocals to produce a ‘‘real’’ dataset and see how it responds to analysis with the different models. I take sy ¼ 0.08 to scale the normal deviates, but assume just that sy ¼ constant for the analyses.
516
Joel Tellinghuisen
This corresponds to the common situation where the uncertainty is thought to be independent of y but its magnitude is unknown. Figure 19.8 shows the data and results of their analysis using Eqs. (19.1) and (19.3), as in Fig. 19.3. Consider first the unweighted fit to Eq. (19.1). The results returned by KG for ULS are Vpost-based, so the parameter variances already include the prefactor s2y from Eq. (19.7). Using the output ‘‘Chisq’’ value, we can estimate sy ¼ 0.073 [¼ (0.01586/3)1/2]. This is somewhat smaller than the value adopted for the simulation. Had we assumed that sy was known to be 0.08 and used these values in a weighted fit, we would have obtained Vprior-based parameter SEs larger by the factor 0.08/0.073, and w2 ¼ 2.5 (¼ 3(0.073/0.08)2). For analysis using the y-reciprocal form of Eq. (19.3), we must weight the data using Eq. (19.15b), but we do not know sy, so we take it to be 1.0, thus using sx/y ¼ x/y2. For y in this expression, we use the calculated value from the fitted function, which makes the weighting iterative. The calculations converge in several cycles, yielding the results in the second fit box in Fig. 19.8. However, KG always uses Vprior in weighted fits, which means that it treats our weights as absolute. To obtain correct results for the parameter SEs we must now include the scale factor of Eq. (19.7), which means multiplying each SE by the factor (Chisq/3)1/2. We can obtain the same results by just rescaling our data s values by the same factor, yielding the results shown in the third results box. The results from the two different fits are now close but not identical. Note also that rescaling the data s values has raised w2 to its expected value of 3 (¼ np ¼ 52) in the third box. 8
2.0
y = a*b*x/(1 + b*x) Value Error a 2.097 0.08893 b 0.7738 0.1458 Chisq 0.01586 NA R 0.9929 NA
7 6
1.5
4
1.0
3
x/y y
0.5
2 1
0.0
0
2
4
x
6
8
10
0
a b Chisq R
y = x/a + 1/a/b Value 2.098 0.7668 0.01551 0.9974
a b Chisq R
2.098 0.7668 3.001 0.9974
x/y
y
5
Error 1.225 1.974 NA NA 0.08810 0.1419 NA NA
Figure 19.8 Synthetic data having x structure of Fig. 19.3 and sy ¼ 0.08, giving the following synthetic y values: 0.551, 1.555, 1.597, 1.759, and 1.889. The curves show results from unweighted fit to Eq. (19.1), and weighted fit to Eq. (19.3) with iterative adjustment of the weights. The middle box shows WLS results prior to scaling the data s values; final parameter SEs can be obtained by multiplying the indicated errors for a and b by (w2/n)1/2 ¼ (0.01551/3)1/2 ¼ 0.0719.
The Least-Squares Analysis of Data from Binding and Enzyme Kinetics Studies
517
The use of consistent data weighting returns identical results for any data set analyzed with Eqs. (19.2) and (19.3). We can confirm this by converting our final sx/y values to s1/y by dividing by xi (Eqs. (19.15a) and (19.15b)) and then using these in fitting the data to Eq. (19.2). Results are shown in Fig. 19.9 (first box), which also includes results for fitting the same data to the linear double-reciprocal form (second box) and to the x-reciprocal form. In all cases, the s values in the columns used to compute the weights have been scaled to ensure that w2 ¼ n ¼ 3, which is equivalent to using Eq. (19.7) to obtain Vpost. From the three different analyses, the a values now fall in the range 2.102.13 and their SEs 0.0880.100, while b ¼ 0.720.77 and sb ¼ 0.140.15. Thus, for this dataset a is about 1 s larger than its true value, while b is almost 2 s smaller. It is interesting to note that the SEs for A and C (middle box) are not identical to those in Fig. 19.4. This is because the weights have been defined in terms of the fitted y values in Fig. 19.9. Use of the original weights (based on the exact yi) would have resulted in SEs identical to those in Fig. 19.4, and in slightly different A and C values no longer compatible with those in the first box.
4.4. Monte Carlo simulations I turn now to the question of how well data analyzed with these relations comport with the predictions based on exactly fitting data. All of the results reported here were obtained for the model first introduced in Fig. 19.3, under variation of the data error sy and changes in b. Table 19.2 presents 1/x 0.5
1.0
1.5
2.0
2.5
2.0
1.6
1.5
1.2
1.0
0.5
0.0 0.0
0.8
0.4
1/y y/x 0.5
1.0 y
1.5
2.0
0.0
y = X/(b∗a) + 1/a Value Error a 2.098 0.08810 b 0.7668 0.1419 Chisq 3.001 NA R 0.9679 NA
y/x
1/y
0.0
y = C∗X + A Value Error A 0.4766 0.02000 C 0.6215 0.09324 Chisq 3.001 NA R 0.9679 NA y = a∗b − b∗x Value Error 2.130 0.09999 a b 0.7155 0.1445 Chisq 3.001 NA R 0.9439 NA
Figure 19.9 Analyses of same synthetic data with Eqs. (19.3) (solid points and line) and (19.4). All parameter SEs are a posteriori, or based on Vpost.
Table 19.2 Monte Carlo statistics (as % biases) of a, b, C and their reciprocals, from 105 datasets, for model illustrated in Fig. 19.3 (a ¼ 2, b ¼ 1) with constant sy of varying magnitudea,b sy
0.004 0.008 0.080 0.200 0.010 0.050 0.100 0.200 a b
A
sA
C
sC
c
sc
B
sB
b
sb
0.00 0.00 0.15 0.91 0.00 0.06 0.23 0.94
0.0 0.0 1.1 6.0 0.0 0.0 1.3 5.7
0.00 0.02 1.42 8.87
0.0 0.0 2.4 14.4
0.00 0.01 1.57 11.16
0.0 0.0 3.0 28.2
0.00 0.03 2.32 16.11 0.03 0.89 3.62 16.38
0.0 0.0 4.6 36.7 0.0 1.5 6.9 38.0
0.04 0.75 3.14 14.46
0.0 1.3 5.6 36.0
First four lines from fit using Eq. (19.14); others using Eq. (19.1). Where entries are missing, they were not evaluated. Exact parameter standard errors for this model: sA ¼ 0.2582 sy; sa ¼ 1.0327 sy; sC ¼ 1.0737 sy; sc ¼ 4.2948 sy; sb ¼ sB ¼ 2.5550 sy. Thus the predicted RSEs equal 10% when sy ¼ 0.19 (A and a), 0.039 (B and b), 0.047 (C and c).
519
The Least-Squares Analysis of Data from Binding and Enzyme Kinetics Studies
summary statistics in the form of % bias in the parameters and their SEs, the references being the true values for the parameters and the predictions of their SEs from exactly fitting data. The results bear out expectations of negligible bias in all quantities for sufficiently small data error and show progressively increasing bias with increasing sy. The parameter A (¼ 1/a) is nearly normal for all simulations summarized here, as illustrated in Fig. 19.10. (For this reason, a is not included in the table.) A and a are also the most precise of the three base quantities summarized here, so that sA/A reaches 0.1 only for the highest data error included (sy ¼ 0.2). Deviations from normality were comparable for B and C and their reciprocals, as illustrated for C and c in Fig. 19.11. Still, for the smallest data error included in Table 19.2, both are approximately normal at the level of precision obtained from 105 data sets. Figure 19.12 shows that these properties change with changes in b. When b ¼ 10, a is closer to normal than its reciprocal, though both are reasonably close, because their RSE is only 1.5%. B is also nearly normal, but its reciprocal is significantly nonnormal; their RSEs are 22%. These properties would follow from the predictions above if C were nearly normal (since B ¼ C/A remains normal if C is normal and A is precise). An MC check confirmed this. When b is reduced to 0.1, A is the most nearly normal σy = 0.008
10,000
8000
Count
y = a*exp(−.5*(x-c)^2/b^2)
0.08 0.20
6000
a
Value 9928.2
b c
1.0043 −0.0051031
Chisq R
29.081 1
Error 38.549 0.0022661 0.0031798 NA NA
4000
2000
0 −4
−3
−2
−1
0 δA/σA
1
2
3
4
Figure 19.10 Histogrammed results for A from 105 simulated datasets for model of Fig. 19.3 with varying data error, analyzed using Eq. (19.14). The displayed fit results are obtained fitting the binned data for sy ¼ 0.008 to a Gaussian (Eq. 19.13), with weighting based on the Poisson treatment of bin counts (variance ¼ count). The Chisq value is reasonable for the 32 data points fitted here, but not for the other two datasets (Chisq ¼ 138 and 747), showing that these are not Gaussian at this precision level.
520
Joel Tellinghuisen
10,000
sy = 0.004 0.08 0.20
Count
8000 6000 4000 2000 0 −3
−2
−1
0 1 dc/sc
2
3
4 −3
−2
−1
0 1 dc/sc
2
3
4
Figure 19.11 Histogrammed results for C and c for the same model. The curves are fitted Gaussians for sy ¼ 0.004 and yield w2 values that are only marginally consistent with normality—46 (left) and 38. 12,000 10,000
σy = 0.05 b = 10
b = 0.1
Count
a b A B
a b A B
8000 6000 4000 2000 0 −3
−2
−1
0 1 db/sb
2
3
4 −3
−2
−1
0 1 db/sb
2
3
4
Figure 19.12 Histogrammed results for sy ¼ 0.05, a ¼ 2, and b ¼ 10 (left) and 0.1 (right). The fitted curves are for the most normal dataset in each case and confirm normality at this precision for a when b ¼ 10 (w2 ¼ 23) but not for A when b ¼ 0.1 (w2 ¼ 240).
parameter (21% RSE) and b is closer to normal than B. This result is expected when A is normal and C is precise (b ¼ A/C ). This is not quite true, although the predicted RSE in C for these conditions (16%) is much less than the 37% for b and B. Space does not permit extending these MC computations to the linearized forms of Eq. (19.1), but previous work has shown that these behave similarly, but with additional sources of bias from the inverted data and from the occurrence of y-dependent weights (Tellinghuisen, 2000a,b).
The Least-Squares Analysis of Data from Binding and Enzyme Kinetics Studies
521
These problems make the transformed relations poorer statistically, but not drastically so, except when data having large relative error are inverted, as already discussed. In summary, the 10% rule of thumb is conservative in predicting the range of validity of the V-based parameter SEs, in that there are situations where the parameters can be roughly normal for RSEs significantly exceeding 10%. In predicting conditions where near normality may hold for RSEs exceeding 10%, the considerations of LLS fitting to A and C are useful but not infallible. And recall that if Vpost is used to estimate the SEs, the relative uncertainty in this estimator, (2n) 1/2 ¼ 0.41 for the present example, would swamp the uncertainty inherent in the NLLS method itself [and require the use of the t-distribution to assess confidence limits (Tellinghuisen, 2000a)].
5. Unusual Weighting: When x is the Dependent Variable 5.1. Effective variance treatment If x is a measured rather than controlled quantity, it is uncertain, in violation of an important LS assumption for the independent variable. For example, in sorption work one measures the equilibrium concentration x of sorbate, and the sorbed amount y is computed from the initial concentration x0 using an equation of form y ¼ V (x0x) (Bolster, 2008). Often x0 and V are both much more precisely determined than x, making the errors in y and x perfectly correlated. This situation can arise also in binding studies where free ligand is not in great excess, making it improper to set [L] Lt, the prepared total ligand concentration, taken as precisely known. Notably, dialysis and related methods fall in this category. In such cases, it is not correct to compute weights as wi / s2 yi , because the uncertainty in x is manifested in y in two ways: the direct contribution from y ¼ V (x0x) and an indirect contribution from the effect of changes in x on the fit function. Bolster and I have recently treated this case with the effective variance (EV) method and have derived weighting formulas for the situation just described (Tellinghuisen and Bolster, 2009a). Here, I will rederive the weighting formulas for fitting with Eqs. (19.1)–(19.4) and will illustrate how exactly fitting data can be used to verify the results. Interested readers are referred to the full paper for numerical examples and results of MC simulations. As was noted already in connection with Eq. (19.16), the idea behind the EV method is to project the uncertainty in x onto the y-axis. Again here the only source of error in y is presumed to be that in x, and this makes the two contributions fully correlated, requiring that the ss be added, with attention to signs, to give s2tot ¼ ðseff sdir Þ2 . The two contributions arise as follows for analysis using Eq. (19.1): Let a point on the true curve be
522
Joel Tellinghuisen
subject to an error ex in x. This produces a direct error ey ¼ Vex, leading to sdir ¼ Vsx. There is also an effective or indirect error (dy/dx)ex, through the displacement of the fit function to (x þ ex). The two contributions add in the same direction, leading to a total " # ab stot;1 ¼ sx V þ ; ð19:18Þ ð1 þ bxÞ2 from which the weights can be computed as usual, as s2 tot;1 . [The same result can be obtained more directly by first rewriting the equation as Vx0 ¼ Vx þ abx=ð1 þ bxÞ]. Error propagation again suffices to obtain corresponding expressions for fits to Eqs. (19.2) and (19.3). Thus, since we have already fully projected the effects of sx into stot,1, we can use Eqs. (19.15a) and (19.15b) to obtain for Eq. (19.2), stot;2 ¼ sx
V ð1 þ bxÞ2 þ ab ; ðabxÞ2
ð19:19Þ
and we can obtain the expression needed for fitting with Eq. (19.3) from Eq. (19.19) by noting that stot,3 ¼ xstot,2. (These expressions can also be derived ‘‘starting from scratch’’ for each.) Equation (19.4) is complicated by the use of the pseudo-independent variable y, requiring considerations like those already discussed in connection with Eq. (19.16). We obtain sx ð19:20Þ þ bV sx : x Similar considerations can be used to add contributions from x0, if it is considered uncertain. All results for both are collected in Table 19.3, where I include also another version of Eq. (19.1) that has been used relatively little, but which is especially appropriate when x0 is error-free and x is uncertain. This is the equation obtained by solving the quadratic expression, stot;4 ¼ ½V þ bða yÞ
x2 þ xða=V x0 þ 1=bÞ x0 =b ¼ 0;
ð19:21Þ
for x, which I refer to as the direct equation, since it properly treats x as the dependent and x0 the independent variable.
5.2. Checking the results with exactly fitting data The signs of the two terms which add to give stot can be a source of confusion. Fortunately it is easy to check the results with exactly fitting data, since we have already seen that consistent weighting yields identical parameter SEs for all forms of Eq. (19.1) with exactly fitting data. We will do this with the model of Figs. 19.3 and 19.4, but now with error in x instead of y. I take V ¼ 3, from which one could compute x0 values consistent with y ¼ V (x0x). However,
523
The Least-Squares Analysis of Data from Binding and Enzyme Kinetics Studies
Table 19.3 Summary of effective-variance-based weighting expressions for the least-squares analysis of data following the equation y ¼ V (x0x) ¼ abx (1 þ bx)1a
a
Relation (Equation)
stot,x
Direct (19.21)
sx
stot,x0
"
x þ 1=b 2x þ a=V x0 þ 1=b sx0V
sx0
#
Langmuir (19.1)
ab sx V þ ð1 þ bxÞ2
Double reciprocal (19.2) y-Reciprocal (19.3) x-Reciprocal (19.4)
V ð1 þ bxÞ2 þ ab sx ðabxÞ2 V ð1 þ bxÞ2 þ ab sx x ðabxÞ2 sx x ½V þ bða yÞ þ sx bV
sx0 V
ð1 þ bxÞ2 ðabxÞ2
ð1 þ bxÞ2 ðabxÞ2 sx0 x0 ½V þ bða yÞ þ sx0 bV sx0 x V
V is presumed to be a known constant of negligible uncertainty. Weights are w ¼ s2 tot , where s2tot ¼ s2tot;x þ s2tot;x0 ; and quantities are evaluated for each point using the relevant xi and x0,i values. Errors in x and x0 are assumed to be independent.
y ⫽ aⴱbⴱx/(1 ⫹ bⴱx) a
y ⫽ x/a ⫹ 1/a/b
Value
Error
2.000
0.1369
Value
Error
a
2.000
0.1370
b
1.0000
0.3689
b
1.000
0.3690
Chisq
2.996e-13
NA
Chisq
0.000
NA
R
1.0000
NA
R
1.000
NA
y ⫽ x/(bⴱa) ⫹ 1/a
y ⫽ aⴱb ⫺ bⴱx Value
Error
2.000
0.1369
b
1.0000
0.3688
Chisq
3.693e-13
NA
1.0000
NA
Value
Error
a
2.000
0.1370
a
b Chisq
1.0000 2.754e-13
0.3690 NA
R
1.0000
NA
R
Figure 19.13 Results from LS analyses via Eqs. (19.1)–(19.4) of exactly fitting data for the model of Fig. 19.3, with constant error in x, sx ¼ 0.04. Weights were obtained using the EV expressions of Eqs. (19.18)–(19.20) and Table 19.3, with the constant V taken as 3.0.
524
Joel Tellinghuisen
this is not necessary except for the direct fit to Eq. (19.21), because the other models express y as a function of x. I also take sx ¼ 0.04 for this illustration. Results for fits to Eqs. (19.1)–(19.4) (Fig. 19.13) confirm that Eqs. (19.18)–(19.20) provide consistent weightings for error in x. To verify that these weightings are correct for the physical model, we conducted MC computations in which we added random error to x and propagated it into y through the latter’s definition. We also checked that the solutions to Eq. (19.21), where x is properly the dependent variable, yielded identical parameter SEs. In application to real data, the EV weighting expressions require iterative adjustment, since they depend on the values of the fit parameters. Again, such iterations are easy to perform through repetitive fits with KG and similar programs, and they converge rapidly.
5.3. The unique answer As I have noted, the algorithm of Deming (1964) yields the same values for the parameters and their estimated SEs for all correct forms expressing the relation among the variables and parameters. This follows from the definition of the minimization target in Eq. (19.8), and Bolster and I have demonstrated its performance in applications to both synthetic and real data (Tellinghuisen and Bolster, 2009a,b). In all of these examples, the application of the EV treatment to Eq. (19.1) yielded results very close to those obtained with the Deming algorithm, so I feel confident in advising users to just use Eq. (19.1) with the EV weights of Eq. (19.18) and Table 19.3 [or their analogs if isotherms or binding relations other than Eq. (19.1) are involved].
6. Assessing Data Uncertainty: Variance Function Estimation From the foregoing, it is clear that a correct analysis of data by LS fitting requires knowledge of the data uncertainty. An obvious approach is to repeat the experiments many times and collect sampling statistics. Lest the reader despair of having to run dozens of day-long experiments, it is useful to know that neglect of weighting may not be a serious problem when the range of weights is moderate—say less than a factor of 10 over the data set (Tellinghuisen, 2007). By contrast, the weights needed for transformed relations like Eq. (19.2) cannot be neglected, as these can easily span a range of 100 or greater, from their y4 dependence. As an alternative to tedious repetition of every experiment, the data error can often be gleaned
The Least-Squares Analysis of Data from Binding and Enzyme Kinetics Studies
525
from archival data collected over time in many different experiments done with similar equipment and techniques. There is great value in knowing the data error, even if it is approximately constant, because this knowledge permits the use of the w2 test to judge the suitability of a fit model (Tellinghuisen and Bolster, 2009b; Zeng et al., 2008a). While the w2 test may not be adequate to guarantee the suitability of a model (Straume and Johnson, 1992), it does work well to eliminate many inadequate models. Also, data heteroscedasticity can usually be characterized through VFs that contain only two or three parameters, which can be estimated adequately from as few as 20 data points (Tellinghuisen, 2008a, 2009b). The estimation is generally done by LS, methods for which I will briefly review for the analysis of replicate data. Details and examples, and discussion of VF estimation from residuals can be found in the cited references. Under the assumption that the data errors depend in some simple, smooth way on the experimental parameters, we seek to obtain that relation, var(xi,yi), through LS fitting of sampling estimates s2i that we obtain from replicate measurements. Such estimates follow a scaled w2 distribution, which means that they have error proportional to their magnitude, namely s(s2) ¼ (2/n)1/2 s2. If we obtain the estimate s2i from m measurements, first averaged to obtain a mean, n ¼ m1. Accordingly, the estimates should be weighted wi / s 4. The large relative uncertainty (e.g., 100% for m ¼ 3, 50% for m ¼ 9), means that the wi can be quite uncertain. The remedy for this is to use the estimated VF itself to compute the weights, wi ¼ var(xi,yi) 2. This renders the fit iterative, since the VF is not known at the outset. However, like similar iterative weightings already discussed, these computations typically converge adequately in a few cycles. An alternative approach is to fit ln(s2i ). From error propagation, if z ¼ ln(y) and y has uncertainty sy, then sz ¼ sy/y. If sy is proportional to y, sy ¼ cy, then sz ¼ c. This is the case here, and s(ln(s2)) ¼ (2/n)1/2. This approach has the advantage that the weights are independent of any fitting, so no iteration is required, but the disadvantage that the resulting estimates of the VF are biased negatively for small m (Tellinghuisen, 2008a, 2009b). Still, if the VF is not itself a primary target of the study, log fitting should give negligible loss of precision in the fitted response function. Why not just assign the weights as wi ¼ s2 i ? This approach works well when large numbers of replicates (10 or more) are used, but it has long been known that such weighting can actually be worse than ULS for small m (Jacquez and Norusis, 1973; Tellinghuisen, 2007). For illustration, suppose data of constant s are sampled with m ¼ 3 replicates at a number of (x,y) points. The large (100%) uncertainty in the s2i values ensures that many of these estimates will be much too small or too large, and by the principle of Eq. (19.5), the resulting WLS fit cannot be the minimum-variance one, which is this case would be the ULS fit.
526
Joel Tellinghuisen
One approach for deriving weighting functions that definitely should not be used is the trial-and-error analysis of the data with various weighting formulas, using ‘‘quality coefficients’’ to assess the results. Although this method has seen increasing use recently in some areas of biochemical and medical research, I have shown that it is fundamentally flawed, because the nature of this test makes it self-fulfilling (Tellinghuisen, 2008b). What functions can be used as VFs? A number of studies have indicated that measurement or instrumental variances generally contain terms that are constant, proportional to y, and proportional to y2 (Ingle and Crouch, 1972; Rodbard and Frazier, 1975; Thompson, 1988; Zeng et al., 2008b), though often only two of these three are justified statistically. The measurement variance can be assessed straightforwardly through repeated measurements of samples that span the desired range of y. On the other hand, the method variance is needed to assign realistic weights, and it is harder to assess, as it requires repetition of entire experiments. The functional dependence of the method variance on y is also not easy to predict. Statisticians have preferred power functions and exponentials (Davidian and Carroll, 1987). Bolster and I found that the exponential form was needed in a recent sorption study (Tellinghuisen and Bolster, 2009b).
7. Conclusion Numerous studies conducted over the last half century have emphasized the importance of proper weighting in the least-squares analysis of rectangularly hyperbolic data. Here, I have attempted to substantiate and simplify those findings, by showing that consistent weights for Eq. (19.1) and its linearizations are easily derived using the rules for error propagation. Extending the oft-seen statement that precise data yield identical values of the LS parameters in all fit representations, I have emphasized that consistently weighted, precise data also yield identical parameter standard errors. Further, I have addressed the unusual situation of many sorption and binding studies, where the ‘‘independent’’ variable x of the fit models is the measured, uncertain quantity. Several treatments of this problem— effective variance, reexpressing the relation with x the dependent variable, and use of the Deming/Lybanon algorithm—all yield satisfactory results. Consistent weighting is unfortunately not the same as correct weighting. For the latter, there is a simple but widely neglected rule for obtaining minimum-variance estimates, rigorously true in LLS and anecdotally valid in NLLS: wi / s2 i . Better awareness of this rule should serve to direct analysts’ attention toward determining their data error structure, instead of seeking ‘‘magic weighting formulas’’ through trial-and-error experimentation with
The Least-Squares Analysis of Data from Binding and Enzyme Kinetics Studies
527
different weighting expressions—an invalid approach that has unfortunately gained traction in some fields in recent years. Knowledge of the data VF is useful for more than correct weighting of the data from the most recent experiment: It also facilitates the design of better experiments. This is an important topic only touched on here but addressed in many other works, including those by Connors (1987) and Bowser and Chen (1998, 1999). The same methods that I have used to confirm weighting expressions, with exactly fitting data, can also be used to explore other ranges of parameter and variable extent, in order to achieve better results. But such efforts are of limited value without reliable information about the data error structure. Finally, I have used Monte Carlo computations to show that the V-based parameter error estimates from nonlinear LS analysis of rectangularly hyperbolic data are trustworthy for establishing confidence limits unless the RSEs are large. The 10% rule of thumb is a useful and normally conservative guideline in this context: The V-based estimates should be reliable within 10% if the RSE is less than 10%. This guideline has emerged from examination of the statistical properties of reciprocals of normal variates—the most pathological behavior typically observed for NLLS estimators.
REFERENCES Askelof, P., Korsfeldt, M., and Mannervik, B. (1976). Error structure of enzyme kinetic experiments: Implications for weighting in regression analysis of experimental data. Eur. J. Biochem. 69, 61–67. Barker, D. R., and Diana, L. M. (1974). Simple method for fitting data when both variables have uncertainty. Am. J. Phys. 42, 224–227. Barrow, N. J. (1978). The description of phosphate adsorption curves. J. Soil Sci. 29, 447–462. Bevington, P. R. (1969). Data Reduction and Error Analysis for the Physical Sciences. McGraw-Hill, New York. Bolster, C. H. (2008). Revisiting a statistical shortcoming when fitting the Langmuir model to sorption data. J. Environ. Qual. 37, 1986–1992. Bowser, M. T., and Chen, D. D. Y. (1998). Monte Carlo simulation of error propagation in the determination of binding constants from rectangular hyperbolae. 1. Ligand concentration range and binding constant. J. Phys. Chem. A 102, 8063–8071. Bowser, M. T., and Chen, D. D. Y. (1999). Monte Carlo simulation of error propagation in the determination of binding constants from rectangular hyperbolae. 2. Effect of the maximum-response range. J. Phys. Chem. A 103, 197–202. Britt, H. I., and Luecke, R. H. (1973). The estimation of parameters in nonlinear, implicit models. Technometrics 15, 233–247. Cleland, W. W. (1967). The statistical analysis of enzyme kinetic data. Adv. Enzymol. 29, 1–32. Clutton-Brock, M. (1967). Likelihood distributions for estimating functions when both variables are subject to error. Technometrics 9, 261–269. Connors, K. A. (1987). Binding Constants: The Measurement of Molecular Complex Stability. Wiley, New York.
528
Joel Tellinghuisen
Cornish-Bowden, A., and Eisenthal, R. (1974). Statistical considerations in the estimation of enzyme kinetic parameters by the direct linear plot and other methods. Biochem. J. 139, 721–730. Davidian, M., and Carroll, R. J. (1987). Variance function estimation. J. Am. Stat. Assoc. 82, 1079–1091. de Levie, R. (2008). Advanced Excel for Scientific Data Analysis. Oxford University Press, New York. Deming, W. E. (1964). Statistical Adjustment of Data. Dover, New York. Di Cera, E. (1992). Use of weighting functions in data fitting. Methods Enzymol. 210, 68–87. Dowd, J. E., and Riggs, D. S. (1965). A comparison of estimates of Michaelis–Menten kinetic constants from various linear transformations. J. Biol. Chem. 240, 863–869. Eftink, M. R., and Ghiron, C. A. (1981). Fluorescence quenching studies with proteins. Anal. Biochem. 114, 199–227. Feldman, H. A. (1972). Mathematical theory of complex ligand-binding systems at equilibrium: Some methods for parameter fitting. Anal. Biochem. 48, 317–338. Ingle, J. D. Jr., and Crouch, S. R. (1972). Evaluation of precision of quantitative molecular absorption spectrometric measurements. Anal. Chem. 44, 1375–1386. Jacquez, J. A., and Norusis, M. (1973). Sampling experiments on the estimation of parameters in heteroscedastic linear regression. Biometrics 29, 771–779. Jefferys, W. H. (1980). On the method of least squares. Astron. J. 85, 177–181. Johnson, M. L. (1985). The analysis of ligand-binding data with experimental uncertainties in independent variables. Anal. Biochem. 148, 471–478. Johnson, M. L., and Faunt, L. M. (1992). Parameter estimation by least-squares methods. Methods Enzymol. 210, 1–37. Kinniburgh, D. G. (1986). General purpose adsorption isotherms. Environ. Sci. Technol. 20, 895–904. Langmuir, I. (1918). The adsorption of gases on plane surfaces of glass, mica, and platinum. J. Am. Chem. Soc. 40, 1361–1403. Laws, W. R., and Contino, P. B. (1992). Fluorescence quenching studies: Analysis of nonlinear Stern-Volmer data. Methods Enzymol. 210, 448–463. Lineweaver, H., Burk, D., and Deming, W. E. (1934). The dissociation constant of nitrogen-nitrogenase in azotobacter. J. Am. Chem. Soc. 56, 225–230. Lybanon, M. (1984). A better least-squares method when both variables have uncertainties. Am. J. Phys. 52, 22–26. Mannervik, B. (1982). Regression Analysis, Experimental error, and statistical criteria in the design and analysis of experiments for discriminating between rival kinetic models. Methods Enzymol. 87, 370–390. Meinert, C. L., and McHugh, R. B. (1968). The biometry of an isotope displacement immunologic microassay. Math. Biosci. 2, 319–338. Munson, P. J., and Rodbard, D. (1980). LIGAND: A versatile computerized approach for characterization of ligand-binding systems. Anal. Biochem. 107, 220–239. Orear, J. (1982). Least squares when both variables have uncertainties. Am. J. Phys. 50, 912–916. Powell, D. R., and Macdonald, J. R. (1972). A rapidly convergent iterative method for the solution of the generalized nonlinear least squares problem. Computer J. 15, 148–155. Press, W. H., Flannery, B. P., Teukolsky, S. A., and Vetterling, W. T. (1986). Numerical Recipes. Cambridge University Press, Cambridge, UK. Ratkowsky, D. A. (1986). A suitable parameterization of the Michaelis–Menten enzyme reaction. Biochem. J. 240, 357–360. Ritchie, R. J., and Prvan, T. (1996). A simulation study on designing experiments to measure the Km of Michaelis–Menten kinetics curves. J. Theor. Biol. 178, 239–254. Rodbard, D., and Frazier, G. R. (1975). Statistical analysis of radioligand assay data. Methods Enzymol. 37, 3–22.
The Least-Squares Analysis of Data from Binding and Enzyme Kinetics Studies
529
Schulthess, C. P., and Dey, D. K. (1996). Estimation of Langmuir constants using linear and nonlinear least squares regression analysis. Soil Sci. Soc. Am. J. 60, 433–442. Shukla, G. K. (1972). Problem of calibration. Technometrics 14, 547–553. Straume, M., and Johnson, M. L. (1992). Analysis of residuals: Criteria for determining goodness of fit. Methods Enzymol. 210, 87–105. Tellinghuisen, J. (2000a). A Monte Carlo study of precision, bias, inconsistency, and nonGaussian distributions in nonlinear least squares. J. Phys. Chem. A 104, 2834–2844. Tellinghuisen, J. (2000b). Bias and inconsistency in linear regression. J. Phys. Chem. A 104, 11829–11835. Tellinghuisen, J. (2000c). Nonlinear least-squares using microcomputer data analysis programs: KaleidaGraphTM in the physical chemistry teaching laboratory. J. Chem. Educ. 77, 1233–1239. Tellinghuisen, J. (2001). Statistical error propagation. J. Phys. Chem. A 105, 3917–3921. Tellinghuisen, J. (2004). Statistical error in isothermal titration calorimetry. Methods Enzymol. 383, 245–282. Tellinghuisen, J. (2007). Weighted least squares in calibration: What difference does it make? Analyst 132, 536–543. Tellinghuisen, J. (2008a). Least squares with non-normal data: Estimating experimental variance functions. Analyst 133, 161–166. Tellinghuisen, J. (2008b). Weighted least squares in calibration: The problem with using ‘‘quality coefficients’’ to select weighting formulas. J. Chromatogr. B 872, 162–166. Tellinghuisen, J. (2009a). Least squares in calibration: Weights, nonlinearity, and other nuisances. Methods Enzymol. 454, 259–285. Tellinghuisen, J. (2009b). Variance function estimation by replicate analysis and generalized least squares: A Monte Carlo comparison. Chemometr. Intell. Lab. Syst. doi: 10.1016/ j.chemolab.2009.09.001. Tellinghuisen, J., and Bolster, C. H. (2009a). Weighting formulas for the least-squares analysis of binding phenomena data. J. Phys. Chem. B 113, 6151–6157. Tellinghuisen, J., and Bolster, C. H. (2009b). Least-squares analysis of high-replication phosphorus sorption data with weighting from variance function estimation. Environ. Sci. Technol. unpublished work. Thompson, M. (1988). Variation of precision with concentration in an analytical system. Analyst 113, 1579–1587. Valsami, G., Iliadis, A., and Macheras, P. (2000). Non-linear regression analysis with errors in both variables: Estimation of co-operative binding parameters. Biopharm. Drug Dispos. 21, 7–14. Wilkinson, G. N. (1961). Statistical estimations in enzyme kinetics. Biochem. J. 80, 324–332. Zeng, Q. C., Zhang, E., and Tellinghuisen, J. (2008a). Univariate calibration by reversed regression of heteroscedastic data: A case study. Analyst 133, 1649–1655. Zeng, Q. C., Zhang, E., Dong, H., and Tellinghuisen, J. (2008b). Weighted least squares in calibration: Estimating data variance functions in high-performance liquid chromatography. J. Chromatogr. A 1206, 147–152.
C H A P T E R
T W E N T Y
Nonparametric Entropy Estimation Using Kernel Densities Douglas E. Lake Contents 1. Introduction 2. Motivating Application: Classifying Cardiac Rhythms 3. Renyi Entropy and the Friedman–Tukey Index 4. Kernel Density Estimation 5. Mean-Integrated Square Error 6. Estimating the FT Index 7. Connection Between Template Matches and Kernel Densities 8. Summary and Future Work Acknowledgments References
532 533 535 536 538 540 544 545 545 546
Abstract The entropy of experimental data from the biological and medical sciences provides additional information over summary statistics. Calculating entropy involves estimates of probability density functions, which can be effectively accomplished using kernel density methods. Kernel density estimation has been widely studied and a univariate implementation is readily available in MATLAB. The traditional definition of Shannon entropy is part of a larger family of statistics, called Renyi entropy, which are useful in applications that require a measure of the Gaussianity of data. Of particular note is the quadratic entropy which is related to the Friedman–Tukey (FT) index, a widely used measure in the statistical community. One application where quadratic entropy is very useful is the detection of abnormal cardiac rhythms, such as atrial fibrillation (AF). Asymptotic and exact small-sample results for optimal bandwidth and kernel selection to estimate the FT index are presented and lead to improved methods for entropy estimation.
Departments of Internal Medicine (Cardiovascular Division) and Statistics, University of Virginia, Charlottesville, Virginia, USA Methods in Enzymology, Volume 467 ISSN 0076-6879, DOI: 10.1016/S0076-6879(09)67020-8
#
2009 Published by Elsevier Inc.
531
532
Douglas E. Lake
1. Introduction Results from experiments are often simply reported with the summary statistics of sample mean and standard deviation. These statistics gives qualitative information about the location and scale of the data, but does not answer questions about the shape of the distribution that may be important. Does the data have a bell-shaped Gaussian (normal) distribution? Does the data have more than one mode? Higher order sample statistics such as the skewness and kurtosis are common calculations that provide additional information to start to better answer these questions, but more work is usually needed. One way to fully understand these properties is to look at the distribution of the data by simply constructing a histogram. Mathematically, a histogram is an estimate of the underlying probability density function (PDF) of a random variable X representing the quantity being calculated. Better PDF estimates can be obtained using a method called kernel density estimation which is readily available in many software packages, for example, the MATLAB function KSDENSITY, and has been widely studied (Scott, 1992). Despite this, the method remains underutilized in practice and the goal of this chapter is to introduce its use for analysis of experimental data from biological and medical sciences. While visualizing and pondering an estimate of the distribution is feasible on a case-by-case basis, many applications require determining information about many distributions in an automated fashion by calculating a number, called a functional of the PDF. One widely used functional of the PDF is the entropy or more precisely the Shannon entropy originally developed as part of communication theory (Shannon, 1997). Shannon entropy is a member of the Renyi entropy family (discussed below) and is an example of a measure of Gaussianity which can indicate whether a PDF is bell shaped or perhaps has multiple modes ( Jones and Sibson, 1987; Lake, 2006). Beirlant et al., provide an excellent overview of nonparametric methods to estimate entropy and some of the terminology used there is repeated here (Beirlant et al., 1997). However, the statistical properties studied there are asymptotic in nature and many of these results have limited use in practice. Another important member of the Renyi entropy family is quadratic entropy. Quadratic entropy does not share some of the optimal theoretical properties of Shannon entropy, but has advantages that will make it the focus of this chapter. One advantage is that many of the statistical properties of estimates of a related quantity to quadratic entropy, called the Friedman– Tukey (FT) index, can be expressed with exact closed-form expressions. Another advantage is that the properties for quadratic entropy are generally better for small sample sizes where, for example, Shannon entropy estimates can have large bias.
533
Nonparametric Entropy Estimation Using Kernel Densities
2. Motivating Application: Classifying Cardiac Rhythms The classification of cardiac rhythms using series of interbeat or RR intervals is an important clinical problem that has proved to benefit from entropy analysis (Costa et al., 2002; Lake, 2006; Lake et al., 2002). Several common clinical scenarios call for identification of cardiac rhythm in ambulatory outpatients. Atrial fibrillation (AF) is an increasingly common disorder of cardiac rhythm in which the atria depolarize at exceedingly fast rates and is often paroxysmal (occurs suddenly) in nature. There is an increased risk of stroke with AF along with risks associated with treatments, so decisions about its therapy are best informed by knowledge of the frequency, duration, and severity of the arrhythmia. Detecting and classifying AF can often be confused with normal sinus rhythms with ectopic (premature) beats or other arrhythmias such as bigeminy or trigeminy. Figure 20.1 shows examples of a series of 100 RR intervals for AF and trigeminy rhythms. The series of AF can perhaps best be described as
Atrial fibrillation (AF)
RR interval (ms)
1000
800
600
400
0
20
40
60
80
100
60
80
100
Trigeminy
RR interval (ms)
1000
800
600
400
0
20
40 Beat number
Figure 20.1 Examples of n ¼ 100 consecutive beats from two abnormal heart rhythms, atrial fibrillation (AF) and trigeminy.
534
Douglas E. Lake
Kernel density estimates AF (Q ⫽ 1.24) Trigeminy (Q ⫽ 0.502)
1
Probability density
0.8
0.6
0.4
0.2
0 ⫺4
⫺3
⫺2
0 1 ⫺1 Standardized values
2
3
4
Figure 20.2 Kernel density estimates for the n ¼ 100 standardized observations from the two examples from Fig. 20.1 using the default settings from the MATLAB function KSDENSITY. The corresponding entropy estimate for the bell-shaped AF density is much higher than for the multimodal trigeminy density function.
looking like ‘‘white’’ noise while the trigeminy rhythm has three distinct levels (or modes) of heart rate. The differences between these rhythms are clear by looking at the distribution of the RR intervals. Figure 20.2 shows the kernel density estimates of the two series after they have been standardized (zero mean and unit variance). The results use the default MATLAB settings of KSDENSITY. The trigeminy series has three readily apparent peaks in its density estimate representing each of the heart rate modes while the AF rhythm has more of a bell-shaped or normal distribution. Entropy estimates associated with the densities (given below) are much higher for the AF (Q ¼ 1.24) than for the trigeminy (Q ¼ 0.502). While the differences in the two rhythms are obvious with n ¼ 100 points, sometimes decisions for therapy need to be made with much smaller sample sizes. An important example of this is for patients with severe heart disease who have implantable cardioverter-defibrillator (ICD) devices to reduce the incidence of sudden cardiac death. The therapy in this case is an electric shock and accurate decisions need to be made on records on the order of n ¼ 16 beats. Finding good entropy estimates on small data sets can be challenging and requires some of the mathematical detail presented here.
535
Nonparametric Entropy Estimation Using Kernel Densities
3. Renyi Entropy and the Friedman–Tukey Index The precise mathematical definitions of the entropy measures to be discussed will now be presented. The entropy (Shannon entropy) of a continuous random variable X with density f is ð1 HðXÞ ¼ E½ logð f ðXÞÞ ¼ logðf ðxÞÞf ðxÞdx ð20:1Þ 1
where E is the expectation and log is the natural logarithm. The quadratic entropy is defined to be ð 1 2 QðXÞ ¼ logðE½ f ðXÞÞ ¼ log f ðxÞdx ð20:2Þ 1
which is similar to entropy with the expectation and logarithm operations reversed. Both measures are special cases of Renyi entropy (or q-entropy) defined to be ð 1 1 1 q1 q Rq ðXÞ ¼ f ðxÞdx ð20:3Þ logðE½ f ðXÞ Þ ¼ log 1q 1q 1 where for q ¼ 1, the limit can be obtained using calculus (l’Hospital’s rule). In particular, Shannon entropy corresponds to q ¼ 1, that is H(X ) ¼ R1(X ), and quadratic entropy corresponds to q ¼ 2, that is, Q(X ) ¼ R2(X ). All of the above entropies involve finding the expectation of a function of X which is defined to be an integral involving the PDF f (x). For quadratic entropy this is just the integral of ‘‘f-squared’’ or f 2(x). This quantity also arises in the analysis of kernel densities and the following specific notation will be used ð1 Ið f Þ ¼ f 2 ðxÞdx: ð20:4Þ 1
This quantity is also named the FT index of a random variable and the notation FT(X) ¼ I( f ) will also be used. A good place to start in finding accurate entropy estimates is to investigate the statistical properties of estimates of FT(X). Entropy is an example of a measure of Gaussianity that has received much attention recently in a variety of applications, including the analysis of heart rate (Lake, 2006). These measures are used in independent component analysis (ICA) where the alternative terminology measure of non-Gaussianity is used (Hyvarinen et al., 2001). One example application of ICA is the separation of signals from multiple speakers, which is informally called the cocktail party problem. Measures of Gaussianity are also used for exploratory projection pursuit (EPP), which searches for interesting low-dimensional
536
Douglas E. Lake
projections of high-dimensional data ( Jones and Sibson, 1987). Here, interesting means non-Gaussian and is measured by what is called a projection index. Non-Gaussian projections can be used as features for multivariate discrimination and for data visualization (e.g., XGobi software) (Ripley, 1996). The Friedman–Tukey index was originally developed as a projection index and is commonly used for this purpose. The competition and debate over the use of order 2 (q ¼ 2) versus order 1 (q ¼ 1) entropies has taken place in a variety of applications and appears in many different forms. For development of optimal decision trees using the CART method, the discrete form of the FT index, called the Gini index (order 2), is alternative measure of impurity to the discrete form of entropy (order 1) ( Breiman et al., 1984). For goodness-of-fit tests there has been a long history and debate over using the log-likelihood ratio test (order 1) versus chi-squared tests (order 2) (Read and Cressie, 1988). Finally, the original motivation for this work involves two popular entropy measures for time series data, approximate entropy (order 1) and sample entropy (order 2). These measures will be discussed in more detail below.
4. Kernel Density Estimation Kernel density estimation has been widely studied for the past 50 years and an excellent resource for its many interesting mathematical details can be found in Scott (1992). The basic setup is a random sample of independent and identically distributed (iid) data X1, X2,. . ., Xn coming from a random variable X with PDF f. To estimate f, another function K(u), called the kernel, is associated with each observation. All the kernels to be considered here will also be a PDF associated with a random variable (which will also be called K). All kernels considered here will have mean equal to 0 and be symmetric so that K(u) ¼ K(u). The variance of the kernel density function will be denoted by s2K , which for purposes of simplifying the comparison of kernels, will be assumed to be 1. The kernel function is scaled by an important parameter h called the bandwidth and the notation Kh(u) ¼ K(u/h)/h is used. The bandwidth can be interpreted as a scale parameter of the kernel and the random variable Kh has standard deviation h. The bandwidth is analogous to the bin size for histograms. Once a kernel function and bandwidth have been specified, the estimated density function at each point x is calculated by n n 1X 1X x Xi f^ðxÞ ¼ : ð20:5Þ Kh ðx Xi Þ ¼ K h n i¼1 nh i ¼ 1
537
Nonparametric Entropy Estimation Using Kernel Densities
Common kernels include the Gaussian, uniform, and triangle kernels. Another kernel that has many optimal asymptotic properties is called the Epanechnikov kernel which is parabolic in shape. These kernels, scaled to have unit variance, are all shown in Fig. 20.3 and are all available options with the KSDENSITY function. To illustrate the effect of the shape of the kernel and bandwidth selection on PDF estimates, a small set of n ¼ 16 observations were simulated from a standard normal distribution. This is a reasonable model of 16 standardized points from a short episode of AF. Histograms and kernel density estimates (using the default options) along with the standard normal PDF are displayed in Fig. 20.4. The optimal bandwidth of h ¼ 0.587 used in the kernel density estimate comes from a formula to be discussed below that involves both the sample size n and an estimate of the standard deviation of the data. This simple example clearly shows the benefit of using a smooth estimate of the PDF versus the usual ‘‘choppy’’ step-function histogram estimate. With this same data set, Fig. 20.5 displays the effect of kernel and bandwidth selection on the PDF estimates. For all kernels, bandwidths that are too small provide too much resolution for each of the 16 points and bandwidths that are too large smear out the data essentially removing all the characteristics of the distribution. The optimal bandwidth for each kernel is approximately h ¼ 0.6 and the estimate provides a reasonably good tradeoff between the two extreme cases. Note that the uniform kernel
0.5
Gaussian Uniform Epanechnikov Triangle
0.45 0.4 0.35
K(u)
0.3 0.25 0.2 0.15 0.1 0.05 0 ⫺4
⫺3
⫺2
⫺1
0 u
1
2
3
4
Figure 20.3 Shapes of four common kernels used in kernel density estimation all normalized to have unit variance.
538
Douglas E. Lake
Histogram Kernel density estimate True density
0.6
Probability density f(x)
0.5
0.4
0.3
0.2
0.1
−3
−2
−1
0 x
1
2
3
Figure 20.4 A comparison of the kernel density method with the more common histogram (both using default MATLAB settings) for n ¼ 16 random observations from a standard normal distribution. The kernel density method clearly provides a far better estimate in this case. The 16 observations were 0.39, 0.14, 2.33, 1.36, 1.81, 1.11, 0.142, 1.11, 0.56, 0.48, 0.68, 0.28, 1.33, 0.72, 0.66, 0.20.
produces a discontinuous step-function estimate similar to the histogram and that the Epanechnikov and Gaussian results are very similar.
5. Mean-Integrated Square Error The bandwidth for each of the kernels used by KSDENSITY is selected using a formula that asymptotically minimizes a goodness-of-fit criterion called Mean-Integrated Squared Error (MISE) for the Gaussian kernel. The MISE is defined to be ð1 MISE ¼ E½ð f^ðxÞ f ðxÞÞ2 dx ð20:6Þ 1
where f is the true PDF estimate and f^ is the kernel density estimate. For the simplest case of where both the true density and kernel functions are standard Gaussian, the MISE can be calculated exactly (after some work) to be
539
Nonparametric Entropy Estimation Using Kernel Densities
Gaussian
h = 0.1
Uniform
h=1
h=2
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0
–2
0
2
0
–2
0
2
0
–2
0
2
0
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0 Epanechnikov
h = 0.6
0.6
–2
0
2
0
–2
0
2
0
–2
0
2
0
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0
–2
0 x
2
0
–2
0 x
2
0
–2
0 x
2
0
–2
0
2
–2
0
2
–2
0 x
2
Figure 20.5 Effect of the bandwidth parameter h and the kernel function K on the density estimate for the example data from Fig. 20.4. For all kernels, a bandwidth of around h ¼ 0.6 gives an optimal estimate.
1 MISEðhÞ ¼ pffiffiffi 1 2 2 p
! rffiffiffiffiffiffiffiffiffiffiffiffiffi 2 1 1 1 þ pffiffiffiffiffiffiffiffiffiffiffiffiffi þ pffiffiffiffiffiffiffiffiffiffiffiffiffi 2 þ h2 1 þ h2 nh n 1 þ h2 ð20:7Þ
where h is the bandwidth. This complicated expression can be well approximated by a Taylor’s series expansion giving the asymptotic MISE (AMISE) AMISEðhÞ ¼
3 1 pffiffiffi h4 þ pffiffiffi 32 p 2 pnh
ð20:8Þ
whose minimal value can be found using calculus. This results in an optimal bandwidth of h* ¼ ð4=3Þ1=5 n1=5 ¼ 1:0592n1=5
ð20:9Þ
and this is precisely the formula used by KSDENSITY. A subtle, but nontrivial, point is that the above quantity assumes that the data are standardized to have unit variance. This can only be achieved exactly if s is known which is not likely in practice and an estimate is needed. While the sample standard deviation (usually denoted by s) seems a
540
Douglas E. Lake
reasonable choice, an estimate that is robust in the presence of outliers is generally preferable. In fact, the KSDENSITY function estimates s as the median of the absolute deviations from the median of the data divided by an appropriate constant (the expected value of this calculation for standard normal data). Another such robust estimate would be a normalized interquartile range.
6. Estimating the FT Index The quantity FT(X) is a functional I( f ) of the PDF f and can be estimated in a couple of ways (Beirlant et al., 1997; Rao, 1983). One approach is called the plug-in estimate and simply involves inserting any estimate of the density into the formula ð1 2 I^ ¼ I^ð f Þ ¼ Ið f^Þ ¼ ð20:10Þ f^ ðxÞdx 1
and evaluating the integral numerically (e.g., Simpson’s method). A second method is called the resubstitution estimate given by I^ ¼ I^ð f Þ ¼
n 1X f^ðXi Þ; ni¼1
ð20:11Þ
which requires estimating the density only at the observed values from the sample. Unless otherwise stated, all the estimates of I( f ) will refer to the resubstitution estimate above. The kernel density estimate at one of the sample points Xi is more biased when the term involving Xi is included in the sum. A less biased estimate is f^ðXi Þ ¼
1 X Kh ðXi Xj Þ; ðn 1Þ j 6¼ i
ð20:12Þ
which only involves n 1 terms. This is analogous to not including selfmatches in the SampEn algorithm which has proved to be less biased than the ApEn algorithm which includes self-matches (Lake et al., 2002; Richman, 2004; Richman and Moorman, 2000). Both these algorithms are discussed below. The biased estimate includes all n terms n 1X Kð0Þ n 1 ^ f^b ðXi Þ ¼ Kh ðXi Xj Þ ¼ ð20:13Þ þ f ðXi Þ; nj ¼1 nh n which can be substantially biased if nh is not large. Combining Eqs. (20.11) and (20.12), the estimate of the FT index becomes a double summation over all possible pairs of sample points
Nonparametric Entropy Estimation Using Kernel Densities
I^ ¼
541
n X n1 X n X X 1 2 Kh ðXi Xj Þ ¼ Kh ðXi Xj Þ nðn 1Þ i ¼ 1 j 6¼ i nðn 1Þ i ¼ 1 j ¼ i þ 1
ð20:14Þ where the last expression follows from the assumed symmetry of the kernel density function. The form of this estimate is, relatively speaking, much simpler than a corresponding estimate of the Shannon entropy. For example, many of its statistical properties can be found exactly in closed-form and optimized. The selection of the bandwidth h becomes a tradeoff between the increased bias at large values and the increased variance at small values. Traditionally these two properties are combined in the mean-squared error (MSE) of an estimator MSE ¼ E½ðIð f Þ I^ð f ÞÞ2 ¼ ðIð f Þ E½I^ð f ÞÞ2 þ V ½I^ð f Þ
ð20:15Þ
which is the sum of the variance V and bias squared. Note that while there are some similarities between the expressions for MISE and MSE, they are different criteria for evaluating the estimators. The optimal asymptotic bandwidths to minimize the MSE were investigated by Pawlak, though some of the formulas there are incorrect (Pawlak, 1987). In order to do this, expressions for the asymptotic mean-squared error (AMSE) are needed. Expanding the double summation in Eq. (20.14) and after some work, the exact MSE can be calculated as follows 2 MSEðhÞ ¼ ðIð f Þ E1 Þ2 þ ðE2 þ 2ðn 1ÞE11 ð2n 3ÞE12 Þ; nðn 1Þ ð20:16Þ where the three expectations are E1 ¼ E½Kh ðX1 X2 Þ E2 ¼ E½Kh2 ðX1 X2 Þ E11 ¼ E½Kh ðX1 X2 ÞKh ðX1 X3 Þ:
ð20:17Þ
Using Taylor series expansions of these kernel density expectations (as is done for the AMISE in Scott (1992)) results in an expression for the AMSE in terms of the bandwidth h 4 2 1 AMSEðhÞ ¼ Varð f ðXÞÞ þ 2 IðKÞIð f Þ þ I 2 ð f 0 Þh4 ð20:18Þ n nh 4 where f 0 is the derivative of the PDF. This expression asymptotically approximates the MSE within a constant times h4, that is, MSE ¼ AMSE þ O(h4). The bandwidth that minimizes this quantity can again be found using calculus to be
542
Douglas E. Lake
h* ¼ ð2IðKÞIð f ÞÞ1=5 Ið f 0 Þ2=5 n2=5 ;
ð20:19Þ
which for Gaussian data becomes h* ¼ 23/5n 2/5 ¼ 1.516 n 2/5. For moderate sizes of n, this bandwidth is smaller than that in Eq. (20.9). This suggests increased entropy estimation accuracy can be achieved using smaller bandwidths than those optimized for MISE. The optimal bandwidth in Eq. (20.19) gives the following minimal AMSE 4 5 AMSE* ¼ Varð f ðXÞÞ þ ð2IðKÞIð f ÞÞ4=5 Ið f 0 Þ2=5 n8=5 ð20:20Þ n 4 The MSE of these estimators go to zero for large n and are therefore consistent. A similar expression using the biased estimate of the PDF with self-matches can be determined and the second term is of the form of a constant times n 4/3. This is asymptotically larger (not as good estimator) as the unbiased version without self-matches. As before, the exact MSE for the special case of standard Gaussian data and kernel can be found for all n in closed-form with 1 Ið f Þ ¼ pffiffiffi ; 2 p 1 E1 ¼ pffiffiffi ð1 þ h2 =2Þ1=2 ; 2 p 1 E2 ¼ ð1 þ h2 =4Þ1=2 ; 4ph 1 E11 ¼ pffiffiffi ð1 þ h2 Þ1=2 ð1 þ h2 =3Þ1=2 : 2p 3
ð20:21Þ
Since bandwidths are usually selected using asymptotic results, a natural question would be to what extent is error reduced using exact formulas. Figure 20.6 shows the exact and asymptotic approximation formulas for n ¼ 16. In this case, the optimum bandwidth using the ANSE is h* ¼ 0.674 versus h* ¼ 0.5. The corresponding minimum MSE is approximately 10% lower than that using the nonexact approximations. The expression in Eq. (20.20) depends on the kernel through the quantity I(K) and more generally sKI(K) with smaller values giving better results. This optimal MISE depends on this quantity in the same manner and used to compare the efficiency of estimators. Table 20.1 shows these results for commonly used kernels. It can be shown that the Epanechnikov parabolic kernel is asymptotically most efficient among all possible estimators. It is also interesting to note that the normal kernel is more efficient than uniform, but not as efficient as the triangle kernel.
543
Nonparametric Entropy Estimation Using Kernel Densities
0.02 Exact gaussian MSE Asymptotic MSE (AMSE)
0.018 0.016 0.014
MSE
0.012 0.01 0.008 0.006
hⴱ ⫽ 0.5
0.004 hⴱ ⫽ 0.674
0.002 0
0
0.2
0.4 0.6 Bandwidth (h)
0.8
1
Figure 20.6 The exact MSE for the asymptotic AMSE for n ¼ 16. The optimal exact value is h* ¼ 0.5 versus the optimal asymptotic value of h* ¼ 0.674 resulting in approximately a 10% reduction in the MSE. Table 20.1 Efficiency of commonly used kernels Kernel
skI(K)
Efficiency
Uniform Triangle Epanechnikov Normal
0.2887 0.2722 0.2683 0.2821
1.0758 1.0143 1.0000 1.0513
A final observation should be made about the plug-in estimate in comparison to the resubstitution estimate results presented here. At first glance, plug-in estimate looks superior to sample point estimate because it uses estimates of density at all points. However, it can be shown that the plug-in estimate is equivalent to the resubstitution estimate with a new kernel K2 equal to a convolution of the original kernel K with itself and using self-matches. This corresponds to a new random variable K2 equal the sum of two independent random variables with distribution the same as K. So if K is Gaussian, K2 is also Gaussian and the two methods are equivalent (with different bandwidths) (Erdogmus et al., 2004). However,
544
Douglas E. Lake
the equivalent resubstitution estimate includes self-matches which introduces extra bias and would argue against using the plug-in method.
7. Connection Between Template Matches and Kernel Densities Our example showing the analysis of heart rhythms by simply looking at the distribution of the RR intervals does not tell the full story about the physical process. The interactions between successive observations provide additional information that is not available in a kernel density estimate which does not depend on the order of the observations. In fact, the most prominent feature of AF is not that its overall distribution is unimodal compared to multimodal, but that observations are ‘‘white’’ in that they are unpredictable and appear to occur randomly with little apparent dependence on previous observations. The concepts of entropy to measure concepts like predictability or order do extend to time series data in the form of what is termed entropy rate (Cover and Thomas, 1991). These properties have been widely studied with the two popular and related measures of approximate entropy and sample entropy. The fundamental calculation of both of these methods involves the counting of template matches which is basically part of a multivariate kernel density estimate using a uniform kernel. Within the framework of Renyi entropy, these two measures correspond to orders q ¼ 1 and q ¼ 2, respectively We now briefly describe these methods to show their relation to the results on kernel density estimation presented here. For a time-series x1, x2, . . ., xN, let xm(i) denote the m points xi, xiþ 1, . . ., xiþm 1 which we call a template and can be considered a vector of length m. An instance where all the components of the vector xm( j) are within a distance r of xm(i) is called a template match. The quantity r is essentially the bandwidth of a uniform kernel. Let Bi denote the number of template matches with xm(i) and Ai denote the number of template matches with xmþ 1(i). The number pi ¼ Ai/Bi is an estimate of the conditional probability that the point xjþm is within r of xiþm 1 given that xm( j) matches xm(i). Pincus introduced the statistic approximate entropy or Apen as a measure of regularity (Pincus, 1991). Denoted by ApEn, this can be calculated by m 1 NX Ai ApEnðm; r; N Þ ¼ ð20:22Þ log Bi N mi¼1 and is the negative average natural logarithm of this conditional probability. Self-matches are included in the original ApEn algorithm to avoid the
545
Nonparametric Entropy Estimation Using Kernel Densities
pi ¼ 0/0 indeterminate form, but this convention leads to noticeable bias especially for smaller N and larger m. A related but more robust statistic called sample entropy or SampEn was introduced by Richman and Moorman designed to reduce this bias by not including self-matches (Richman and Moorman, 2000). SampEn is calculated by , ! N m N m X X SampEnðm; r; N Þ ¼ log ð20:23Þ Ai Bi ; i¼1
i¼1
which is just negative the logarithm of an estimate of the conditional probability of a match of length m þ 1 given a match of length m. As with quadratic entropy, SampEn has added advantage that its statistical properties are more accessible than those of ApEn. The optimal bandwidths presented here provide a formal setting for evaluating and selecting the tolerance r. In particular, the matching part of these algorithms for templates of length 1 is proportional to results using the uniform kernel in Fig. 20.3 with r ¼ 31/2 h ¼ 1.732 h.
8. Summary and Future Work Entropies are functionals of the PDFs and can be effectively estimated using the kernel density methods. The optimal bandwidths for the FT index, which is part of the calculation of quadratic entropy, have been presented. The optimal bandwidth tends to be smaller than that traditionally obtained to minimize the MISE and used by MATLAB. Estimating entropy for small samples can benefit from exact results which are in closed-form for the Gaussian signal and kernel case and can be evaluated numerically in other instances. Exact results for Gaussian mixtures are straightforward, but messy. Future work includes extending these results to arbitrary entropies (q 6¼ 2) and in particular Shannon entropy (q ¼ 1). This is not a trivial undertaking because the estimates are not simply an average and involve nonlinear functions, for example, the logarithm. The one-dimensional results (d ¼ 1) need to be also extended to higher dimensions (d > 1). An important application of these results is estimating entropy for timeseries data which not only involves d > 1, but has dependent data. These advances would be directly applicable to finding the optimal tolerance r for the template matching step in the calculation of SampEn.
ACKNOWLEDGMENTS This work was supported by grant 0855399E from the American Heart Association, MidAtlantic Research Consortium. Yan Liu and Sida Peng provided support in checking the
546
Douglas E. Lake
mathematical details of some of the formulas presented here as well as investigating future directions in the area of nonparametric entropy estimation using kernel densities and other methods. My continued collaboration with Randall Moorman, MD in the mathematical analysis of heart rate, including detecting atrial fibrillation in short records, provided ample clinical motivation for the development of the methods presented here.
REFERENCES Beirlant, J., Dudewicz, E. J., Gyorfi, L., and van der Meulen, E. C. (1997). Nonparametric entropy estimation: An overview. Int. J. Math. Stat. Sci. 6(1), 17–39. Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification and Regression Trees. Wandswork and Brooks-Cole, Monterey, CA. Costa, M., Goldberger, A. L., and Peng, C. K. (2002). Multiscale entropy analysis of complex physiologic time series. Phys. Rev. Lett. 89, 068102. Cover, T. M., and Thomas, J. A. (1991). Elements of Information Theory. John Wiley and Sons, New York. Erdogmus, D., Hild, K., Principe, J., Lazaro, M., and Santamaria, I. (2004). Adaptive blind deconvolution of linear channels using renyi’s entropy with parzen window estimation. IEEE Trans. Signal. Process. 52(6), 1489–1498. Hyvarinen, A., Karhunen, J., and Oja, E. (2001). Independent Component Analysis. John Wiley and Sons, New York. Jones, M. C., and Sibson, R. (1987). What is projection pursuit? J. R. Stat. Soc. A 150, 1–36. Lake, D. E. (2006). Renyi entropy measures of heart rate Gaussianity. IEEE Trans. Biomed. Eng. 53(1), 21–27. Lake, D. E., Richman, J. S., Griffin, M. P., and Moorman, J. R. (2002). Sample entropy analysis of neonatal heart rate variability. Am. J. Physiol. 283, R789–R797. Pawlak, M. (1987). Contribution to the discussion of the paper ‘‘‘What is Projection Pursuit’’, Jones, M.C., Sibson R. J. R. Stat. Soc. A 150, 1–36, (the contribution pp. 31–32). Pincus, S. M. (1991). Approximate entropy as a measure of system complexity. Proc. Natl. Acad. Sci. 88, 2297–2301. Rao, B. L. S. P. (1983). Nonparametric Functional Estimation. Academic Press, London. Read, T., and Cressie, N. (1988). Goodness-of-Fit Statistics for Discrete Multivariate Data. Springer-Verlag, New York. Richman, J. S. (2004). Sample entropy statistics. University of Alabama Birmingham. Ph.D. dissertation. Richman, J. S., and Moorman, J. R. (2000). Physiological time series analysis using approximate entropy and sample entropy. Am. J. Physiol. 278, 2039–2049. Ripley, B. (1996). Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge. Scott, D. W. (1992). Multivariate Density Estimation: Theory, Practice, and Visualization. John Wiley, New York. Shannon, C. E. (1997). The mathematical theory of communication (Reprinted). M D Comput. 14, 306–317.
C H A P T E R
T W E N T Y- O N E
Pancreatic Network Control of Glucagon Secretion and Counterregulation Leon S. Farhy and Anthony L. McCall Contents 1. Introduction 2. Mechanisms of Glucagon Counterregulation (GCR) Dysregulation in Diabetes 3. Interdisciplinary Approach to Investigating the Defects in the GCR 4. Initial Qualitative Analysis of the GCR Control Axis 4.1. b-Cell inhibition of a-cells 4.2. d-Cell inhibition of a-cells 4.3. a-Cell stimulation of d-cells 4.4. Glucose stimulation of b- and d-cells 4.5. Glucose inhibition of a-cells 5. Mathematical Models of the GCR Control Mechanisms in STZ-Treated Rats 6. Approximation of the Normal Endocrine Pancreas by a Minimal Control Network (MCN) and Analysis of the GCR Abnormalities in the Insulin Deficient Pancreas 6.1. Dynamic network approximation of the MCN 6.2. Determination of the model parameters 6.3. In silico experiments 6.4. Validation of the MCN 6.5. In silico experiments with simulated complete insulin deficiency 6.6. Defective GCR response to hypoglycemia with the absence of a switch-off signal in the insulin deficient model 6.7. GCR response to switch-off signals in insulin deficiency
548 550 551 553 553 554 554 555 555 556
560 561 562 564 565 565 566 567
Department of Medicine, Center for Biomathematical Technology, University of Virginia, Charlottesville, Virginia, USA Methods in Enzymology, Volume 467 ISSN 0076-6879, DOI: 10.1016/S0076-6879(09)67021-X
#
2009 Published by Elsevier Inc.
547
548
Leon S. Farhy and Anthony L. McCall
6.8. Reduction of the GCR response by high glucose conditions during the switch-off or by failure to terminate the intrapancreatic signal 6.9. Simulated transition from a normal physiology to an insulinopenic state 7. Advantages and Limitations of the Interdisciplinary Approach 8. Conclusions Acknowledgement References
567 569 571 575 575 575
Abstract Glucagon counterregulation (GCR) is a key protection against hypoglycemia compromised in insulinopenic diabetes by an unknown mechanism. In this work, we present an interdisciplinary approach to the analysis of the GCR control mechanisms. Our results indicate that a pancreatic network which unifies a few explicit interactions between the major islet peptides and blood glucose (BG) can replicate the normal GCR axis and explain its impairment in diabetes. A key and novel component of this network is an a-cell auto-feedback, which drives glucagon pulsatility and mediates triggering of pulsatile GCR by hypoglycemia via a switch-off of the b-cell suppression of the a-cells. We have performed simulations based on our models of the endocrine pancreas which explain the in vivo GCR response to hypoglycemia of the normal pancreas and the enhancement of defective pulsatile GCR in b-cell deficiency by switch-off of intrapancreatic a-cell suppressing signals. The models also predicted that reduced insulin secretion decreases and delays the GCR. In conclusion, based on experimental data we have developed and validated a model of the normal GCR control mechanisms and their dysregulation in insulin deficient diabetes. One advantage of this construct is that all model components are clinically measurable, thereby permitting its transfer, validation, and application to the study of the GCR abnormalities of the human endocrine pancreas in vivo.
1. Introduction Blood glucose (BG) homeostasis is maintained by a complex, ensemble control system characterized by a highly coordinated interplay between and among various hormone and metabolite signals. One of its key components, the endocrine pancreas, responds dynamically to changes in BG, nutrients, neural, and other signals by releasing insulin and glucagon in a pulsatile manner to regulate glucose production and metabolism. Abnormalities in the secretion and interaction of these two hormones mark the progression of many diseases, including diabetes, but also the metabolic syndrome, the polycystic ovary syndrome, and others. Diminished or complete loss of endogenous insulin secretion in diabetes is closely associated with failure
Network Control of Glucagon Counterregulation
549
of the pancreas to respond with glucagon secretion properly not only to hyper but also to hypoglycemia. The latter is not caused by loss of glucagon secreting a-cells, but instead is due to defects in glucagon counterregulation (GCR) signaling, through an unknown mechanism, which is generally recognized as a major barrier to safe treatment of diabetes (Cryer and Gerich, 1983; Gerich, 1988) since unopposed hypoglycemia can cause coma, seizures, or even death (Cryer, 1999, 2002, 2003). Our recent experimental (Farhy et al., 2008) and mathematical modeling (Farhy and McCall, 2009; Farhy et al., 2008) results show that a novel understanding of the defects in the GCR control mechanisms can be gained if these are viewed as abnormalities of the network of intrapancreatic interactions that control glucagon secretion, rather than as defects in an isolated molecular interaction or pathway. In particular, we have demonstrated that in a b-cell-deficient rat model the GCR control mechanisms can be approximated by a simple feedback network (construct) of dose–response interactions between BG and the islet peptides. Within the framework of this construct, the defects of GCR response to hypoglycemia can be explained by loss of rapid switch-off of b-cell signaling during hypoglycemia to trigger an immediate GCR response. These results support the ‘‘switch-off ’’ hypothesis which posits that a-cell activation during hypoglycemia requires both the availability and rapid decline of intraislet insulin (Banarer et al., 2002). They also extend this hypothesis by refocusing from the lack of endogenous insulin signaling to the a-cells as a sole mechanistic explanation and instead focus on possible abnormalities in the general way by which the b-cells regulate the a-cells. In addition, the experimental and theoretical modeling data collected so far indicate that the GCR control network must have two key features: a (direct or indirect) feedback of glucagon secreting a-cells on themselves (auto-feedback) and a (direct or indirect) negative regulation of glucagon by BG. In our published model these two properties are mediated by d-cell somatostatin and we have shown that such connectivity adequately explains ours [and others (Zhou et al., 2004)] experimental data (Farhy and McCall, 2009; Farhy et al., 2008). The construct we proposed recently (Farhy and McCall, 2009; Farhy et al., 2008) is suitable for the study and analysis of rodent physiology, but the explicit involvement of somatostatin limits its applicability to clinical studies since in the human, pancreatic somatostatin cannot be reliably measured and therefore, the ability of the model to describe adequately the human physiology and its potential differences from rodent physiology cannot be verified. In the current work, we review our existing models and show that a control network in which somatostatin is not explicitly involved (but incorporated implicitly) can also adequately approximate the GCR control mechanisms. We confirm that the (new) construct can substitute for the older, more complex construct by verifying that it explains the same
550
Leon S. Farhy and Anthony L. McCall
experimental observations already shown to be reconstructed by the older network (Farhy and McCall, 2009; Farhy et al., 2008). We also demonstrate that the newer network can explain the regulation of the normal pancreas by BG and the gradual reduction in the GCR response to hypoglycemia during the transition from a normal to an insulin deficient state. As a result, a more precise description of the components that are the most critical for the system is provided by a model of GCR regulation. This model can be applied to study the abnormalities in glucagon secretion and counterregulation and to identify hypothetical ways to repair these not only in the rodent but also in the human.
2. Mechanisms of Glucagon Counterregulation (GCR) Dysregulation in Diabetes Studies of tight BG control in types 1 and 2 diabetes to prevent chronic hyperglycemia-related complications have found a threefold excess of severe hypoglycemia (The Action to Control Cardiovascular Risk in Diabetes Study Group, 2008; The Diabetes Control and Complications Trial Research Group, 1993; The UK Prospective Diabetes Study Group, 1998). Hypoglycemia impairs quality of life and risks coma, seizures, accidents, brain injury, and death. Severe hypoglycemia is usually due to overtreatment against a background of delayed and deficient hormonal counterregulation. In health, GCR curbs dangerously low BG nadirs and stimulates quick recovery from hypoglycemia (Cryer and Gerich, 1983; Gerich, 1988). However, in type 1 (Fukuda et al., 1988; Gerich et al., 1973; Hoffman et al., 1994) and type 2 diabetes (Segel et al., 2002), the GCR is impaired by uncertain mechanisms and if it is accompanied by a loss of epinephrine counterregulation it leads to severe hypoglycemia and thus presents a major barrier to safe treatment of diabetes (Cryer, 1999, 2002). Understanding the mechanisms that mediate GCR, its dysregulation and how it can be repaired, is therefore a major challenge in the struggle for safer treatment of diabetes. Despite more than 30 years of research, the mechanism by which hypoglycemia stimulates GCR and how it is impaired in diabetes have yet to be elucidated (Gromada et al., 2007). First described by Gerich et al. (1973), defective GCR is common after about 10 years of T1DM. The loss of GCR appears to be more rapid with a very young age of onset and may occur within a few years after onset of T1DM0. Although unproven, the appearance of defective GCR seems to parallel insulin secretory loss in these patients. The defect appears to be stimulus specific, since a-cells retain their ability to secrete glucagon in response to other stimuli, such as arginine (Gerich et al., 1973). Three mechanisms have been proposed as a potential source for impairment of GCR. Those that account for the stimulus specificity of the
Network Control of Glucagon Counterregulation
551
defect include impaired BG-sensing in a-cells (Gerich et al., 1973) and/or autonomic dysfunction (Hirsch and Shamoon, 1987; Taborsky et al., 1998) The ‘‘switch-off’’ hypothesis envisions that a-cell activation by hypoglycemia requires both the availability and rapid decline of intraislet insulin and attributes the defect in the GCR in insulin deficiency to loss of a (insulin) ‘‘switch-off ’’ signal from the b-cells (Banarer et al., 2002). These theories are not mutually exclusive, but they all could be challenged. For example, a-cells do not express GLUT2 transporters (Heimberg et al., 1996) and it is unclear whether the a-cell GLUT1 transporters can account for the rapid a-cell response to variations in BG (Heimberg et al., 1995). In addition, proglucagon mRNA levels are not altered by BG (Dumonteil et al., 2000) and it is debatable whether BG variations in the physiological range can affect the a-cells (Pipeleers et al., 1985). The switch-off hypothesis can also be disputed, since in the a-cell-specific insulin receptor knockout mice the GCR response to hypoglycemia is preserved (Kawamori et al., 2009). Finally, the hypothesis for autonomic control contradicts evidence that blockade of epinephrine and acetylcholine actions did not reduce the GCR in humans (Hilsted et al., 1991), and that the denervated human pancreas still releases glucagon in response to hypoglycemia (Diem et al., 1990). Recent in vivo experiments by Zhou et al. support the ‘‘switch-off ’’ hypothesis. They have shown that, in STZ-treated rats, GCR is impaired, but can be restored if their deficiency in intraislet insulin is reestablished and decreased (switched off ) during hypoglycemia (Zhou et al., 2004). Additional in vitro and in vivo evidence to support the switch-off hypothesis has been reported (Hope et al., 2004; Zhou et al., 2007a). Whether insulin is the trigger of GCR in the studies by Zhou et al. (2004, 2007a) has been challenged by results by the same group, in which zinc ions, not the insulin molecule itself, provided the switch-off signal to initiate glucagon secretion during hypoglycemia (Zhou et al., 2007b). In view of the above background, the mechanisms that control the secretion of glucagon and their dysregulation in diabetes are not well understood. This lack in understanding prevents restoring GCR to normal in patients with diabetes and the development of treatments to effectively repair defective GCR to allow for a safer control of hyperglycemia. No such treatment currently exists.
3. Interdisciplinary Approach to Investigating the Defects in the GCR The network underlying the GCR response to hypoglycemia includes hundreds of components from numerous pathways and targets in various pools and compartments. It would therefore be unfeasible to collect and
552
Leon S. Farhy and Anthony L. McCall
relate experimental data pertaining to all components of this network. Nevertheless, understanding the glucagon secretion control network is vital for furthering knowledge concerning the control of GCR, its compromise in diabetes, and developing treatment strategies. To address this problem, we have taken a minimal model approach in which the system is simplified by clustering all known and unknown factors into a small number of explicit components. Initially, these components were chosen with the goal to test whether recognized physiological relationships can explain key experimental findings. In our case, the first reports describing the in vivo enhancement of GCR by switch-off of insulin (Zhou et al., 2004) prompted us to propose a parsimonious model of the complex GCR control mechanisms including relationships between the a- and d-cells, BG and switch-off signals (below). According to these initial efforts (Farhy et al., 2008), the postulated network explains the switch-off phenomenon by interpreting the GCR as a rebound. It further predicts that: (i) in b-cell deficiency, multiple a-cell suppressing signals should enhance GCR if they are terminated during hypoglycemia, and (ii) the switch-off-triggered GCR must be pulsatile. The model-based predictions motivated a series of in vivo experiments, which showed that indeed, in STZ-treated male Wistar rats, intrapancreatic infusion of insulin, and somatostatin followed by their switch-off during hypoglycemia enhances the pulsatile GCR response (Farhy et al., 2008). These experimental results confirmed that the proposed network is a good candidate for a model of the GCR control axis. In addition to confirming the initial model predictions, our experiments also suggested some new features of the GCR control network, including indications that different a-cell suppressing switch-off signals not only can enhance GCR in b-cell deficiency but also that they do so via different mechanisms. For example, the results suggest higher response to insulin switch-off and more substantial suppression of glucagon by somatostatin (Farhy et al., 2008). To show that these observations are consistent with our network model, we had to extend it to reflect the assumption that the a-cell activity can be regulated differently by different a-cell suppressing signals. We showed that this assumption can explain the difference in the GCRenhancing action of two a-cell-suppressing signals (Farhy and McCall, 2009). The simulations suggest strategies to use a-cell inhibitors to manipulate the network and repair defective GCR. However, they also indicate that not all a-cell inhibitors may be suitable for that purpose, and the infusion rate of the ones that are, should be carefully selected. In this regard, a clinically verified and tested model of the GCR control axis can greatly enhance our ability to precisely and credibly simulate changes resulting from certain interventions and ultimately will assist us in defining the best strategy to manipulate the system in vivo in humans. However, the explicit involvement of somatostatin and the d-cells in our initial network and model limits the potential for clinical applications as pancreatic somatostatin cannot be
Network Control of Glucagon Counterregulation
553
reliably measured in the human in vivo and the ability of the model to describe the human glucagon axis cannot be verified. To address this limitation we have recently reduced our initial network into a Minimal Control Network (MCN) of the GCR control axis in which somatostatin and the d-cells are no longer explicitly involved, but their effects are implicitly incorporated in the model. Our analysis (presented below) shows that the new MCN is an excellent model of the GCR axis and can substitute the older, more complex structure. Thereby, we have developed a model that can be verified clinically and used to assist the analysis of the GCR axis in vivo in humans. Importantly, the new model is not limited to b-cell deficiency and hypoglycemia only. In fact, it describes the transition form a normal to a b-cell deficient state and can explain the failure of suppression of basal glucagon secretion in response to increase in BG observed in this state. If it is confirmed experimentally that the MCN can successfully describe both the normal and b-cell deficient pancreas, future studies may focus on the defects of the pancreatic network not only in type 1 but also in type 2 diabetes, or more generally, in any pathophysiological condition that is accompanied by metabolic abnormalities of the endocrine pancreas.
4. Initial Qualitative Analysis of the GCR Control Axis To understand the mechanisms of GCR and their dysregulation, the pancreatic peptides have been extensively studied and much evidence suggests that a complex network of interacting pathways modulates glucagon secretion and GCR. Some of the well-documented relationships between different islet cell signals are summarized in the following subsections.
4.1. b-Cell inhibition of a-cells Pancreatic perfusions with antibodies to insulin, somatostatin, and glucagon have suggested that the blood within the islets flows from b- to a- to d-cells in dogs, rats, and humans (Samols and Stagner, 1988, 1990; Stagner et al., 1988, 1989). It was then proposed that, insulin regulates glucagon, which in turn regulates somatostatin. Various b-cell signals provide an inhibitory stimulus to the a-cells and suppress glucagon. These include cosecreted insulin, zinc, GABA, and amylin (Gedulin et al., 1997; Gromada et al., 2007; Ishihara et al., 2003; Ito et al., 1995; Maruyama et al., 1984; Rorsman and Hellman, 1988; Rorsman et al., 1989; Samols and Stagner, 1988; Wendt et al., 2004; Xu et al., 2006). In particular, b-cells store and secrete GABA, which can diffuse to neighboring cells to bind to localized within the islets only on a-cells (Rorsman and Hellman, 1988; Wendt et al., 2004). Insulin can directly
554
Leon S. Farhy and Anthony L. McCall
suppress glucagon by binding to its own receptors (Kawamori et al., 2009) or to IGF-1 receptors on the a-cells (Van Schravendijk et al., 1987). Insulin also translocates and activates GABAA receptors on the a-cells, which leads to membrane hyperpolarization and, ultimately, suppresses glucagon. Hence, insulin may directly inhibit the a-cells, and indirectly potentiate the effects of GABA (Xu et al., 2006). Infusion of amylin in rats inhibits argininestimulated glucagon (Gedulin et al., 1997), but not hypoglycemia GCR (Silvestre et al., 2001); similar results were found with the synthetic amylin analog pramlintide (Heise et al., 2004) even though in some studies hypoglycemia was increased, but it is unclear if this is a GCR effect or is related to failure to reduce meal insulin adequately (McCall et al., 2006). Finally, a negative effect of zinc on glucagon has been proposed (Ishihara et al., 2003), including a role in control of GCR (Zhou et al., 2007b). The role of zinc is unclear as zinc ions do not suppress glucagon in the mouse (Ravier and Rutter, 2005).
4.2. d-Cell inhibition of a-cells Exogenous somatostatin inhibits insulin and glucagon; however, the role of the endogenous hormone is controversial (Brunicardi et al., 2001, 2003; Cejvan et al., 2003; Gopel et al., 2000a; Klaff and Taborsky, 1987; Kleinman et al., 1994; Ludvigsen et al., 2004; Portela-Gomes et al., 2000; Schuit et al., 1989; Strowski et al., 2000; Sumida et al., 1994; Tirone et al., 2003). The concept that d-cells are downstream of a- and d-cells favors the perception that in vivo, intraislet somatostatin cannot directly suppress the a- or b-cell through the islet microcirculation (Samols and Stagner, 1988, 1990; Stagner et al., 1988, 1989). On the other hand, the pancreatic a- and b-cells express at least one of the somatostatin receptors (SSTR1-5) (Ludvigsen et al., 2004; Portela-Gomes et al., 2000; Strowski et al., 2000), and recent in vitro studies involving somatostatin immunoneutralization (Brunicardi et al., 2001) or application of selective antagonists to different somatostatin receptors suggest that a-cell somatostatin inhibits the release of glucagon (Cejvan et al., 2003; Strowski et al., 2000). In addition, d-cells are in close proximity to a-cells in rat and human islets, and d-cell processes were observed to extend into a-cell clusters in rat islets (Kleinman et al., 1994, 1995). Therefore, somatostatin may act via existing common gap junctions or by diffusion through the islet interstitium.
4.3. a-Cell stimulation of d-cells The ability of endogenous glucagon to stimulate d-cell somatostatin is supported by a study in which administration of glucagon antibodies in the perfused human pancreas resulted in inhibition of somatostatin release (Brunicardi et al., 2001). Earlier immunoneutralization perfusions of the
Network Control of Glucagon Counterregulation
555
rat or dog pancreas also showed that glucagon stimulates somatostatin (Stagner et al., 1988, 1989). The glucagon receptor colocalized with 11% of immunoreactive somatostatin cells (Kieffer et al., 1996), suggesting that the a-cells may directly regulate some of the d-cells. Exogenous glucagon also stimulates somatostatin (Brunicardi et al., 2001; Epstein et al., 1980; Kleinman et al., 1995; Utsumi et al., 1979). Finally, glutamate, which is cosecreted with glucagon under low-glucose conditions, stimulates somatostatin release from diencephalic neurons in primary culture (Tapia-Arancibia and Astier, 1988) and a similar relation could exist in the islets of the pancreas.
4.4. Glucose stimulation of b- and d-cells It is well established that hyperglycemia directly stimulates b-cells, which react instantaneously to changes in BG (Ashcroft et al., 1994; Bell et al., 1996; Dunne et al., 1994; Schuit et al., 2001). Additionally, it has been proposed that d-cells have a glucose-sensing mechanism similar to those in b-cells (Fujitani et al., 1996; Gopel et al., 2000a) and consequently, that somatostatin release is increased in response to glucose stimulation (Efendic et al., 1978; Hermansen et al., 1979), possibly via a Ca2þ-dependent mechanism (Hermansen et al., 1979).
4.5. Glucose inhibition of a-cells Hyperglycemia has been proposed to inhibit glucagon even though hypoglycemia alone appears insufficient to stimulate high amplitude GCR (Gopel et al., 2000b; Heimberg et al., 1995, 1996; Reaven et al., 1987; Rorsman and Hellman, 1988; Schuit et al., 1997; Unger, 1985). In addition to the above mostly consensus findings which show that the a-cell activity is controlled by multiple intervening pathways, there are other indirect evidences suggesting that the dynamic relationships between the islet signals are important for the regulation of glucagon secretion and GCR. For example, the concept is supported by the pulsatility of the pancreatic hormones (Genter et al., 1998; Grapengiesser et al., 2006; Grimmichova et al., 2008), which implies feedback control (Farhy, 2004), and by results suggesting that: insulin and somatostatin pulses are in phase ( Jaspan et al., 1986; Matthews et al., 1987) pulses of insulin and glucagon recur with a phase shift (Grapengiesser et al., 2006), pulses of somatostatin and glucagon appear in antisynchronous fashion (Grapengiesser et al., 2006), and insulin pulses entrain a- and d-cell oscillations (Salehi et al., 2007). A pancreatic network consistent with these findings is shown in Fig. 21.1. It summarizes interactions (mostly consensus) between BG, b-, a-, and d-cells: somatostatin (or more generally the d-cells) is stimulated by glucagon (a-cells) and BG; glucagon (a-cells) is inhibited by the d-cells (by somatostatin) and by bcell signals; and BG stimulates the b-cells. This network could easily explain the
556
Leon S. Farhy and Anthony L. McCall
Blood glucose
Switch-off signals
(–) Alpha cells
(+)
(–)
Delta cells (+)
Figure 21.1 Schematic presentation of a network model of the GCR control mechanisms in STZ-treated rats.
GCR response to hypoglycemia. Indeed, hypoglycemia would decrease both b- and d-cell activity, which would entail increased release of glucagon from acells after the suppression from the neighboring b- and d-cells is removed. However, it is not apparent whether this network can explain the defect in GCR observed in b-cell deficiency or the above mentioned restoration of defective GCR by a switch-off. This dampens the appeal of the network as a simple unifying hypothesis for regulation of GCR, and for the compromise of this regulation in diabetes. The difficulties in intuitive reconstruction of the properties of the network emerge from the surprisingly complex behavior of this system due to the a–d-cell feedback loop. Shortly after the first reports describing the in vivo repair of GCR by intrapancreatic infusion and switch-off of insulin (Zhou et al., 2004), we applied mathematical modeling to analyze and reconstruct the GCR control network. These considerations demonstrated that the network in Fig. 21.1 can explain the switch-off effect (Farhy and McCall, 2009; Farhy et al., 2008). We have also presented experimental evidence to support these model predictions (Farhy et al., 2008). These efforts are described in the following section.
5. Mathematical Models of the GCR Control Mechanisms in STZ-Treated Rats We have developed and validated (Farhy and McCall, 2009; Farhy et al., 2008) a mathematical model of the GCR control mechanisms in the b-cell deficient rat pancreas which explains two key experimental observations: (a) in STZ-treated rats, rebound GCR which is triggered by a switch-off signal (a signal that is intrapancreatically infused and terminated during hypoglycemia) is pulsatile; and (b) the switch-off of either
Network Control of Glucagon Counterregulation
557
somatostatin or insulin enhances the pulsatile GCR. The basis of this mathematical model is the network outlined in Fig. 21.1 which summarizes the major interactive mechanisms of glucagon secretion in b-cell deficiency by selected consensus interactions between plasma glucose, a-cell suppressing switch-off signals, a-cells, and d-cells. We should note that the b-cells were part of the network proposed in Farhy et al. (2008), but not part of the corresponding mathematical model, which was designed to approximate the insulin deficient pancreas. In addition to explaining glucagon pulsatility during hypoglycemia and the switch-off responses mentioned above, this construct predicts each of the following experimental findings in diabetic STZ-treated rats: (i) Glucagon pulsatility during hypoglycemia after a switch-off, with pulses recurring at 15–20 min as suggested by the results of the pulsatility deconvolution analysis we have previously performed (Farhy et al., 2008); (ii) Pronounced (almost fourfold increase over baseline) pulsatile glucagon response following a switch-off of either insulin or somatosatin during hypoglycemia (Farhy et al., 2008); (iii) Restriction of the GCR enhancement by insulin switch-off by high BG conditions (Zhou et al., 2004); (iv) Lack of a GCR response to hypoglycemia when there is no switch-off signal (Farhy et al., 2008); (v) Suppression of GCR when insulin is infused into the pancreas but not switched off during hypoglycemia (Zhou et al., 2004); (vi) More than 30% higher GCR response to insulin vs somatostatin switch-off (Farhy et al., 2008); (vii) Better glucagon suppression by somatostatin than by insulin before a switch-off (Farhy et al., 2008). We note that in our prior study (Farhy et al., 2008) the comparisons between insulin and somatostatin switch-off in (vi) and (vii) were not significant. However, the difference in (vii) was close to being significant at p ¼ 0.07. Therefore, one of the goals of the latter study (Farhy and McCall, 2009) was to test in silico whether the differences (vi) of a higher GCR response to insulin switch-off and (vii) a better glucagon suppression by somatostatin switch-off were likely and can be predicted by the model of the insulin-deficient pancreas (Fig. 21.1). To demonstrate the above predictions we used dynamic network modeling and formalized the network shown in Fig. 21.1 by a system of nonlinear ordinary differential equations to approximate the glucagon and somatostatin concentration rates of change under the control of switch-off signals and BG. Then, we were able to adjust the model parameters to reconstruct the experimental findings listed in (i)–(vii) which validates the model based on the network shown in Fig. 21.1.
558
Leon S. Farhy and Anthony L. McCall
The model equations are: 1 1 1 0 þ rGL GL ¼ kGL GL þ r basal nSS 1 þ I1 ðtÞ 1 þ ½SSðt DSS Þ=tSS 1 þ I2 ðtÞ ð21:1Þ 0
SS ¼ kSS SS þ rSS
½GLðt DGL Þ=tGL nGL ½BGðtÞ=tBG nBG þ b SS 1 þ ½GLðt DGL Þ=tGL nGL 1 þ ½BGðtÞ=tBG nBG ð21:2Þ
Here, GL(t), SS(t), BG(t), I1(t), and I2(t) denote the concentrations of glucagon, somatostatin, blood glucose, and exogenous switch-off signal(s) [acting on the pulsatile or/and the basal glucagon secretion], respectively; the derivative is with respect to the time t. The meaning of the remaining parameters is explained in the following section. We note that the presence of two terms, I1(t) and I2(t), to represent the switch-off signal in Eq. (21.1) reflects the assumption that different switch-off signals may have a different impact on glucagon secretion and may suppress differently the basal and/or d-cell-regulated a-cell release. We have used the above model (Farhy and McCall, 2009) to show that the glucagon control axis postulated in Fig. 21.1 is consistent with the experimental findings (i)–(vii) [above] and we showed that insulin and somatostatin affect differently the basal and the system-regulated a-cell activity. After the model was validated we used it to predict the outcome of different switch-off strategies and explore their potential to improve GCR in b-cell deficiency: Fig. 21.2 (Farhy and McCall, 2009). The figure summarizes results from in silico experiments tracking the dynamic of glucagon from time t ¼ 0 h (start) to t ¼ 4 h (end). In some simulations, intrapancreatic infusion of insulin or somatostatin started at t ¼ 0.5 h and was either continued to the end or was switched off at t ¼ 2.5 h. When hypoglycemia was simulated, BG ¼ 110 mg/dL from t ¼ 0 h to t ¼ 2 h, glucose decline starts at t ¼ 2 h, BG ¼ 60 mg/dL at t ¼ 2.5 h (switch-off point), at the end of the simulations (t ¼ 4 h) BG ¼ 43 mg/dL. At the top of the bar graph (a), we show baseline results without switch-off signals. The black bar illustrates the glucagon level before t ¼ 2 h which is the time where BG ¼ 110 mg/dL and glucagon would be maximally suppressed if a switch-off signal were present. The white and the gray bars illustrate the maximal glucagon response in the 1 h interval from t ¼ 2.5 h to t ¼ 3.5 h without (white) and with (gray) hypoglycemia stimulus. This interval corresponds to the 1 h interval after a switch-off in all other simulations. The black and white bars are the same since glucagon levels remain unchanged if there is no hypoglycemia. Each subsequent set of three bars indicates these effects with single switch-off [(b) and (c)], combined switchoff (d), no switch-off of a single signal [(e) and (f )], a mixture of switch-off
559
Network Control of Glucagon Counterregulation
Glucagon concentration [pg/ml] 0
100
200
(a) no SO
300
500
400
1.4
(b) SS (SO)
3.8
(c) INS (SO)
3.9
(d) SS (SO) + INS(SO) (e) SS (no SO)
10.2 1.6
(f) INS (no SO) (g) SS (no SO) + INS (SO)
2.4 3.2
(h) SS (SO) + INS (no SO) (i) SS (no SO) + INS (no SO)
600
8.6 2.9
Glucagon supressed by the intrapancreatic signal(s) GCR response to switch-off without hypoglycemia GCR response to switch-off with hypoglycemia
Figure 21.2 Summary of the model-predicted GCR responses to different switch-off signals with or without simulated hypoglycemia (see text for more detail). SO, switchoff; no SO, the signal was not switched off; SS, somatostatin; INS, insulin. Modified from Farhy and McCall (2009).
and no switch-off for the two signals [(g) and (h)], and no switch-off for the combination of the two signals (i). Thus, the bar graph gives the following glucagon concentrations: glucagon suppressed by the intrapancreatic signal (black bars: the glucagon concentration immediately before the onset of BG decline at t ¼ 2 h: at that time glucagon is maximally suppressed by the intrapancreatic infusion and not affected by the decline in glucose); GCR response to a switch-off if hypoglycemia was not induced (white bars: the maximal glucagon concentrations achieved within a 1 h interval after the switch-off); and GCR response if hypoglycemia was induced (gray bars: the maximal glucagon concentrations achieved within a 1 h interval after the switch-off ). The graph also includes the maximal fold increase in glucagon in response to a switch-off during hypoglycemia relative to the glucagon levels before the onset of BG decline. Thus, we concluded that the impact of an a-cell inhibitor on the GCR depends on the nature of the signal and the mode of its delivery. These comparisons between strategies of manipulating the network to enhance the GCR by a switch-off revealed a good potential of a combined switch-off to
560
Leon S. Farhy and Anthony L. McCall
amplify the benefits provided by each of the individual signals (Farhy and McCall, 2009) and even a potential to explore scenarios in which the a-cell suppressing signal is not terminated.
6. Approximation of the Normal Endocrine Pancreas by a Minimal Control Network (MCN) and Analysis of the GCR Abnormalities in the Insulin Deficient Pancreas The explicit involvement of somatostatin in the model described above limits the potential clinical application as pancreatic somatostatin cannot be reliably measured in the human in vivo and the ability of the model to describe the human glucagon axis cannot be verified. It is, however, possible to simplify the network in a way that somatostatin is no longer explicitly involved, but is incorporated implicitly. In the original model shown in Fig. 21.1, somatostatin appears in the following two compound pathways, the ‘‘a-cell ! d-cell ! a-cell’’ feedback loop and in the ‘‘BG ! d-cell ! a-cell’’ pathway. By virtue of its interactions in the ‘‘a-cell ! d-cell ! a-cell’’ pathway, the a-cells effectively control their own activity and therefore this pathway can be replaced by a delayed ‘‘a-cell ! a-cell’’ auto-feedback loop. Such regulation is also consistent with reports that glucagon may directly suppress its own release (Kawai and Unger, 1982) possibly by binding to glucagon receptors located on a subpopulation of the a-cells (Kieffer et al., 1996) or by other autocrine mechanisms. Through the ‘‘BG ! d-cell ! a-cell’’ pathway, blood glucose downregulates the release of glucagon and the action is mediated by somatostatin. Therefore, this pathway can be simplified and substituted by the BG ! a-cell interaction. The outcome of the described procedure of network reduction is a new Minimal Control Network (MCN) of the GCR control mechanisms in which somatostatin and the d-cells are no longer explicitly involved (Fig. 21.3). As was originally proposed in our prior work (Farhy et al., 2008), the b-cells of the normal pancreas are now part of the MCN (and of the mathematical model). This feature also extends the physiological relevance of the model. The b-cells are assumed to be stimulated by hyperglycemia and to suppress the activity of the a-cells. The latter action is based on an extensive data that the b-cells (co)release a variety of signals, including insulin, GABA, zinc, and amylin, all of which are known to suppress the a-cell activity (Gedulin et al., 1997; Ishihara et al., 2003; Ito et al., 1995; Reaven et al., 1987; Rorsman and Hellman, 1988; Samols and Stagner, 1988; Van Schravendijk et al., 1987; Wendt et al., 2004; Xu et al., 2006). In addition, it has been reported that the pulses of insulin and glucagon recur with a phase shift (Grapengiesser et al., 2006) which is
561
Network Control of Glucagon Counterregulation
Beta cells
(+)
Blood glucose
(–) (–) Alpha cells
(–)
Figure 21.3 A Minimal Control Network (MCN) of the interactions between BG and the a- and b-cells postulated to regulate the GCR in the normal pancreas. In this network the d-cells are not represented explicitly.
consistent with the postulated negative regulation of the a-cells by the b-cells. An extensive background justifying all postulated MCN relationships was presented in Section 4.
6.1. Dynamic network approximation of the MCN Similar to the analysis of the old network, dynamic network modeling methods are used to study the properties of the MCN shown in Fig. 21.3. In particular, two differential equations approximate the glucagon and insulin concentration rate of change: 0
tINS 1 þ rGL tINS þ INS 1 þ ðBG=tBG ÞnBG 1 tINS nGL 1 þ ½GLðt DGL Þ=tGL tINS þ INS
GL ¼ kGL GL þ rGL;basal
INS ¼ kINS INS þ rINS 0
ð21:3Þ
ðBG=tBG;2 ÞnBG;2 þ rINS;basal Pulse ð21:4Þ 1 þ ðBG=tBG;2 ÞnBG;2
Here, GL(t), BG(t), and INS(t) denote time-dependent concentrations of glucagon, blood glucose, and insulin (or exogenous switch-off signal in the b-cell-deficient model), respectively; the derivative is the rate of change with respect to the time t. The term Pulse in Eq. (21.4) denotes a pulse generator specific to the b-cells superimposed to guarantee physiological relevance of the simulations. The meaning of the parameters is defined as follows: kGL and kINS are rates of elimination for glucagon and insulin, respectively; rGL is BG- and auto-feedback-regulated rate of release of glucagon;
562
Leon S. Farhy and Anthony L. McCall
rGL,basal is glucagon basal rate of release; rINS is BG-regulated rate of release of insulin; rINS,basal is insulin basal rate of release; tINS is half-maximal inhibitory dose for the negative action of insulin on glucagon; tBG and tBG,2 are half-maximal inhibitory doses for BG (ID50); tGL is half-maximal inhibitory dose for glucagon (ID50); nBG, nBG,2, and nGL are Hill coefficients describing the slope of the corresponding dose–response interactions; DGL is delay in the auto-feedback.
6.2. Determination of the model parameters The half-life (t1/2) of glucagon was assumed to be 2 min to match the results of our pulsatility analysis (Farhy et al., 2008) and other published data. Therefore, we fixed the parameter kGL ¼ 22 h 1. The half-life of insulin was assumed to be 3 min as suggested in the literature (Grimmichova et al., 2008). Therefore, to approximate insulin’s t1/2, we fixed the parameter kINS ¼ 14 h 1. The remaining parameters used in the simulations were determined functionally and some of the concentrations presented below are in arbitrary units (specifically, those related to insulin). These units, however, can be easily rescaled to match real concentrations. The delay in the auto-feedback DGL ¼ 7.2 min was functionally determined, together with the potencies tBG ¼ 50 mg/dL, tGL ¼ 6 pg/mL and sensitivities nBG ¼ 5, nGL ¼ 5 in the auto-feedback control function, to guarantee that glucagon pulses during GCR recur at intervals of 15–20 min to correspond to the number of pulses after a switch-off point detected in the pulsatility analysis (Farhy et al., 2008). The parameters rINS ¼ 80,000 and rINS,basal ¼ 270, together with the amplitude of the pulses of the pulse generator and the parameters tBG,2 ¼ 400 mg/dL and nBG,2 ¼ 3 were functionally determined to guarantee that BG is capable of stimulating more than ninefold increase in insulin over baseline in response to a glucose bolus. The ID50, tINS ¼ 20, was functionally determined based on the insulin concentrations to guarantee that insulin withdrawal during hypoglycemia can trigger GCR. The glucagon release (rGL ¼ 42,750 pg/mL/h) and basal secretion rate (rGL,basal ¼ 2,128 pg/mL/h) were functionally determined so that a strong hypoglycemic stimulus can trigger more than 10-fold increase in glucagon from the normal pancreas. The parameters of the pulse generator, Pulse, were chosen to generate every 6 min a square wave of height ¼ 10 over a period of 36 s based on published reports on insulin pulsatility reporting recurring insulin pulses every 4–12 min (Prksen, 2002). We note that insulin pulsatility was modeled to mimic the variation of insulin in the portal vein, rather than in the circulation. This explains the deep nadirs between the pulses evident in the simulations. The parameter values of the model are summarized in Table 21.1.
Table 21.1 Summary of core interactive constants in the auto-feedback MCN Rate constant Elimination (1/h)
Glucagon kGL ¼ 22 h 1 BG Insulin Pulse
kINS ¼ 14 h 1
Dose–response control functions Release (concentration/h)
ED50, or ID50 (concentration)
Slope
Delay (min)
rGL ¼ 42,570 pg/mL/h rGL,basal ¼ 2,128 pg/ mL/h
tGL ¼ 85 pg/mL
nGL ¼ 5
DGL ¼ 7.2 min
tBG ¼ 50 mg/dL tBG,2 ¼ 400 mg/dL tINS ¼ 20
nBG ¼ 5 nBG,2 ¼ 3
rINS ¼ 80,000 rINS,basal ¼ 270 Periodic function: a square wave of height ¼ 10 over a period of 36 s recurring every 6 min
564
Leon S. Farhy and Anthony L. McCall
6.3. In silico experiments The simulations were performed as follows: Simulation of the glucose input to the system. We performed two different simulations to mimic hypoglycemia: (a) BG decline from 110 to 60 mg/dL in 1 h and (b) stepwise (1 h steps) decline in BG from 110 to 60 (same as in (a)), then to 45, and then to 42 mg/dL. The stepwise decline into hypoglycemia is intended to investigate a possible distinction between the model responses to 60 mg/dL (a) and to a stronger hypoglycemic stimulus (b); it also mimics a commonly employed human experimental conditions (staircase hypoglycemic clamp). To0 generate glucose profiles that satisfy (a) and (b), we used the equation BG ¼ 3BG þ 3 step þ 330, where the function step changes from 110 to 60, 45, and 42 mg/dL at 1-h steps. Then we used the solution to the above equation in Eqs. (21.3) and (21.4). Similarly, an increase of glucose was simulated by using the above equation and a step function which increases the BG levels from 110 to 240 mg/dL to mimic acute hyperglycemia. Transition from a normal to an insulin deficient state. The simulation was performed by gradually reducing to zero the amplitude of the pulses generated by the pulse generator, Pulse. Simulation of intrapancreatic infusion of different a-cell suppressing signals. These simulations were performed in an insulin deficient model. Equation (21.4) is replaced by an equation which describes the dynamic of the infused signal: 0
SO ¼ kSO SO þ InfusionðtÞ Here, SO represents the concentration of the switch-off signal, an abrupt termination of an a-cell suppressing signal. The function Infusion describes the rate of its intrapancreatic infusion (equal to Height if the signal is infused or to 0 otherwise) and kSO its (functional) rate of elimination. Then, the terms (1 þ m1 SO) and (1 þ m2 SO) are used in Eq. (21.3) to divide the parameters rGL and rGL,basal, respectively to simulate suppression of the a-cell activity by the signal. Differences in the parameters m1 and m2 model unequal action of the infused signal on the basal and BG/auto-feedback-regulated glucagon secretion. In particular, to simulate an insulin switch-off we used parameters kSO ¼ 3, Height ¼ 55, m1 ¼ 0.08, and m2 ¼ 0.5; to simulate somatostatin switch-off we used kSO ¼ 3.5, Height ¼ 10, m1 ¼ 1, and m2 ¼ 1.4. The parameters were functionally determined to explain our experimental observations (below) and the possible differences in the response to the two types of switch-offs (Farhy et al., 2008). In particular, the action of exogenous insulin on BG/auto-feedback-regulated and basal glucagon secretion is distributed like a 1:6.3 ratio. Similar to our previous work (Farhy and McCall, 2009), exogenous insulin suppresses the basal more
Network Control of Glucagon Counterregulation
565
than the pulsatile glucagon release, for somatostatin, the suppressive effect is more uniform in a 1:1.4 ratio.
6.4. Validation of the MCN To validate the new network we perform an in silico study in three steps:
Demonstrate that the (new) MCN (Fig. 21.3) is compatible with the mechanism of GCR and response to switch-off signals in insulin deficiency. We have already shown that our (original) network which includes somatostatin as an explicit node is consistent with key experimental data. To confirm that the (new) MCN can substitute the older more complex construct, we tested the hypothesis that it can approximate the same key experimental observations [all (i) through (vii) listed in the beginning of Section 5] already shown to be predicted by the old network (Fig. 21.1). Show that the mechanisms underlying the dysregulation of GCR in insulin deficiency can be explained by the MCN. To this end we demonstrated that the BG-regulated MCN can explain (i) high GCR response if the bcells are intact and provide a potent switch-off signal to the a-cells; and (ii) reduction of GCR following a simulated gradual decrease in insulin secretion to mimic transition from normal physiology to an insulinopenic state. Verify that the proposed MCN approximates the basic properties of the normal endocrine pancreas. Even though our primary goal is to explain the GCR control mechanisms and their dysregulation, we have demonstrated that the postulated MCN can explain the increase in insulin secretion and decrease in glucagon release in response to BG stimulation. The goal of this in silico study is to validate the MCN by demonstrating that the parameters of the mathematical model (Eqs. (21.3) and (21.4)) that approximate the MCN (Fig. 21.3) can be determined in a way that the output of the model can predict certain general features of the in vivo system. Therefore, the simulated profiles are expected to reproduce the overall behavior of the system rather than to match exactly experimentally observed individual hormone dynamics. To integrate the equations we used a Runge–Kutta 4 algorithm and its specific implementation within the software package Berkeley-Madonna.
6.5. In silico experiments with simulated complete insulin deficiency We demonstrate that the proposed MCN model, which has changed significantly since initially introduced (Farhy and McCall, 2009; Farhy et al., 2008), is consistent with the experimental observations in STZ-treated rats reported by us and others (Farhy et al., 2008; Zhou et al., 2004).
566
Leon S. Farhy and Anthony L. McCall
6.6. Defective GCR response to hypoglycemia with the absence of a switch-off signal in the insulin deficient model The plot in Fig. 21.4 (bottom left panel) shows the predicted lack of glucagon response to hypoglycemia if a switch-off signal is missing—a key observation reported in our experimental study (Farhy et al., 2008) and elsewhere (Zhou et al., 2004, 2007a,b). The system responds with only about 30% increase in the pulse amplitude of glucagon in the 45 min interval after BG reaches 60 mg/dL, which agrees with our experimental observations (Fig. 21.4, top panels) and shows that the model satisfies condition (iv) from Section 5 (no GCR response to hypoglycemia without a switch-off signal).
600 Saline switch-off
Insulin switch-off
Somatostatin switch-off
Glucagon [pg/mL]
480
360
240
120
0 120
600 Insulin switch-off
Somatostatin switch-off 80
480 Glucagon [pg/mL]
BG
BG
BG
360
40
240
0
BG [mg/dl]
No switch-off
Glucagon 120 Glucagon Glucagon 0 0
1
2 3 Time [h]
4
5
0
1
2 3 Time [h]
4
5
0
1
2 3 Time [h]
4
5
Figure 21.4 The mean observed (top) and model-predicted (bottom) glucagon response to hypoglycemia and saline switch-off or no switch-off (left), insulin switch-off (middle), and somatostatin switch-off (right). The shaded area marks the period monitored in our experimental study. The simulations were performed with a complete insulin deficiency.
Network Control of Glucagon Counterregulation
567
6.7. GCR response to switch-off signals in insulin deficiency The model response to a 1.5 h intrapancreatic infusion of insulin or somatostatin switched off at hypoglycemia (BG ¼ 60 mg/dL) is shown in the bottom middle and right panels of Fig. 21.4. The infusion was initiated at time t ¼ 0.5 h (arbitrary time units) and switched off at t ¼ 2 h. A simulated gradual BG decline started at t ¼ 1 h and BG ¼ 60 mg/dL at the switch-off point. The model response illustrates a pulsatile rebound glucagon secretion after the switch-off reaching almost a fourfold increase in glucagon in the 45 min period after the switch-off as compared to the preswitch-off levels, which is similar to the experimental observations: Fig. 21.4 (top middle and right panels). Therefore, the model satisfies conditions (i) the pulsatility timing and (ii) the pulsatility amplitude increase from Section 5 in regard to insulin and somatostatin switch-off. In addition, the bottom middle and right panels of Fig. 21.4 demonstrate that the model satisfies conditions (vi) >30% higher GCR response to insulin vs somatostatin switch-off and (vii) better glucagon suppression by somatostatin before a switch-off compared with suppression by insulin from Section 5. Of interest, the prediction that an insulin switch-off signal suppresses more potently the basal, rather than the pulsatile glucagon release is similar to the prediction of the previous model (Farhy and McCall, 2009) and it is necessary to explain the difference between the insulin switch-off and somatostatin switch-off: Fig. 21.4, middle vs left panels. Note that the experimental data shown in the top panels of Fig. 21.4 were collected during our previous experimental study (Farhy et al., 2008). The pulsatility of glucagon is not apparent in the plots presented in Fig. 21.4 since they reflect averaged experimental data (n ¼ 7 in the saline group and n ¼ 6 in the insulin and somatostatin switch-off groups). In Farhy et al. (2008), glucagon pulsatility was confirmed on the individual profiles of glucagon measured in the circulation by deconvolution analysis and the current simulations which approximate the dynamic of glucagon in the portal circulation agree well with these results.
6.8. Reduction of the GCR response by high glucose conditions during the switch-off or by failure to terminate the intrapancreatic signal For comparison, Fig. 21.5 depicts the GCR response if an insulin signal was infused and switched off, but hypoglycemia was not present (top panel) or if intrapancreatic insulin was infused, but not switched off during hypoglycemia (bottom panel). In the first simulation glucagon increases by only 60 pg/mL relative to the concentration at the switch-off point and in the second simulation the GCR response is reduced approximately twofold as
568
Leon S. Farhy and Anthony L. McCall
600
120
Glucagon [pg/mL]
80
Insulin switch-off
360
40
BG [mg/dL]
BG 480
No hypoglycemia 240
0
120
600
120
480
80
Glucagon [pg/mL]
BG Insulin infused, but not switched off
360
40
BG [mg/dL]
0
Hypoglycemia 240
0
Glucagon 120
0 0
1
2 3 Time [h]
4
5
Figure 21.5 Model-predicted minimal absolute glucagon response to insulin switchoff if the intrapancreatic signal (black bar) is terminated during euglycemia (top panel: glucagon increases minimally with only 60 pg/mL greater than the concentration at the switch-off point) and to insulin intrapancreatic insulin infusion if the signal is not
Network Control of Glucagon Counterregulation
569
compared to the response depicted in Fig. 21.4 (middle bottom panel). This result agrees with the observations reported in Zhou et al. (2004) which demonstrate a lack of significant increase in glucagon in this 1 h interval if insulin is not switched off. In an additional analysis (results not shown), we increased the simulated rate of infusion of the insulin switch-off signal fourfold. By increasing the parameter Height from 55 to 220 (see Section 6.3) and used a stronger hypoglycemic stimulus (40 mg/dL). The model responded with an increase in glucagon after the switch-off which reached concentrations above 800 pg/mL in the 1 h interval after the switch-off point. When the same signal was not terminated in this simulation, the response was restricted to a rise only to 180 pg/mL. This outcome reproduces more closely the observations in Zhou et al. (2004). Thus, the model satisfies conditions: (iii) the restriction of the response to an insulin switch-off by high BG conditions and (v) the absence of a pronounced GCR when no insulin switch-off is performed as detailed in Section 5.
6.9. Simulated transition from a normal physiology to an insulinopenic state One set of simulations was performed to evaluate the model-generated glucagon response to a stepwise BG decline into hypoglycemia with a normal and insulin-deficient pancreas. The response of the normal model shown in Fig. 21.6 (top panel) illustrates a pronounced glucagon response to hypoglycemia (about fourfold increase when BG ¼ 60 mg/dL and about 14-fold increase over baseline when BG approaches 42.5 mg/dL). Of interest, the model predicts that when BG starts to fall the highfrequency glucagon, pulsatility during the basal period entrained by the insulin pulses will be replaced by low-frequency oscillations maintained by the a-cell auto-feedback. The model also predicts that a complete absence of BG-stimulated and basal insulin release will result in the following abnormalities in the glucagon secretion and response to hypoglycemia (Fig. 21.6, bottom panel):
A significant reduction in the fold glucagon response to hypoglycemia relative to baseline (only about 1.3-fold increase when BG ¼ 60 mg/dL and only about threefold increase when BG approaches 42 mg/dL).
switched off (bottom panel: glucagon increases only 85 pg/mL greater than to the concentration at the time when BG ¼ 60 mg/dL and only about twofold relative to baseline)—these values are by contrast increased more than 3.5 fold when the switchoff occurs—see Fig. 21.4, the bottom middle panel. All of these simulations were performed with a complete insulin deficiency.
570
Leon S. Farhy and Anthony L. McCall
Normal pancreas
Glucagon [pg/mL]
480
120 80
BG
40
360 Insulin
0
240
BG [mg/dL], insulin
600
120 Glucagon 0 Insulin deficient pancreas
120
Glucagon [pg/mL]
480
80 BG
360
40 Insulin
240
0
BG [mg/dL], insulin
600
Glucagon 120 0 0
1
2 3 Time [h]
4
5
Figure 21.6 Model-derived glucagon response to hypoglycemia (stepwise BG decline) in normal physiology with intact insulin release (top) and predicted decrease and delay in GCR following a simulated removal of insulin secretion to mimic a transition from a normal to an insulin deficient state.
A reduction in the absolute glucagon response to hypoglycemia (15% lower response when BG ¼ 60 mg/dL and 42% lower response when BG approaches 42.5 mg/dL). A delay in the GCR response (BG remains below 60 mg/dL for more than 1 h without any sizable change in glucagon). A 2.5-fold increase in basal glucagon. Disappearance of the insulin-driven high-frequency glucagon pulsatility. A comparison between the model response to hypoglycemia when BG remains at 60 mg/dL (Fig. 21.4, lower left panel) and when it falls further to about 42.5 mg/dL in the staircase hypoglycemic clamp (Fig. 21.6, bottom panel) reveals the interesting model prediction that a sufficiently strong hypoglycemic stimulus may still evoke some delayed glucagon release. However, additional analysis (results not shown) disclosed that if the basal glucagon release (model parameter rGL,basal) is 15–20% higher this response
Network Control of Glucagon Counterregulation
571
will be completely suppressed. Therefore, the model predicts that GCR abnormalities may be due to both the lack of an appropriate switch-off signal and to a significant basal hyperglucagonemia. The same simulations were performed also under the assumption that BG declines only to 60 mg/ dL and remains at that level similarly to the experiments depicted in the lower panels of Fig. 21.4 (results not shown). We detected that the glucagon pulses released by the normal pancreas were about 47% lower which stresses the importance of the strength of the hypoglycemic stimulus to the amount of GCR response. Under conditions of complete absence of insulin the weaker hypoglycemic stimulus evokes practically no response (this outcome has already been shown in Fig. 21.4, lower left panel) and the concentration of glucagon was 57% lower than the response stimulated by the stepwise decline (Fig. 21.6, bottom panel). A second set of simulations was designed to test the hypothesis that the model of the MCN can correctly predict a typical increase in insulin secretion and a decrease in glucagon following an increase in BG. We also monitored how these two system responses change during a transition from a normal physiology to an insulinopenic state. To this end an increase in BG was simulated (see Section 6.3) with an elevation of the BG concentration from 110 to about 240 mg/dL in 1 h and then a return back to normal in the next 1.5 h. The model-predicted response of the normal pancreas is shown on the top panel of Fig. 21.7. In this simulation the BG-driven release of insulin increased almost ninefold which caused significant suppression in glucagon release. The bottom plot in Fig. 21.7 illustrates the effect on the system response of 100% reduction in BG-stimulated insulin release. As expected, insulin deficiency results in an increase of glucagon and limited ability of hyperglycemia to suppress glucagon (Meier et al., 2006).
7. Advantages and Limitations of the Interdisciplinary Approach A key conclusion of our model-based simulations is that some of the observed system behavior (like the system response to a switch-off) emerges from interplay between multiple components. Models like the networks in Figs. 21.1 and 21.3 are certainly not uncommon in endocrine research, and typically exemplify regulatory hypotheses. Traditionally, such models are studied using methods that probe individual components or interactions in isolation from the rest of the system. This approach has been taken in the majority of the published studies that investigate the GCR regulation (see Section 4). The limitation of this approach is that the temporal relationships between the system components and relative contribution of each interaction to the overall system behavior cannot be properly assessed.
572
Normal pancreas
900
Insulin 600 300 120
0 Glucagon
60 0
Insulin deficient pancreas
900 600 300
Glucagon [pg/mL]
BG
Insulin
120
0
BG [mg/dL], insulin
Glucagon [pg/mL]
BG
BG [mg/dL], insulin
Leon S. Farhy and Anthony L. McCall
Glucagon 60 0 0
1
2 3 Time [h]
4
5
Figure 21.7 Simulated progressive decline of the ability of glucose to suppress glucagon resulting from a gradual transition (same as in Fig. 21.6) from a normal physiology (top) to an insulinopenic state (bottom).
Therefore, especially when the model contains feedbacks, the individual approach cannot answer the question of whether the model explains the system control mechanisms. The main reason for this limitation is that some key specifics of the system behavior, like its capability to oscillate and respond with a rebound to a switch-off, both require and are the result of the time-varying interactions of several components. If these are studied in isolation, little information will be gained about the dynamic behavior of this network-like mechanism. Numerous reports have documented that the glucagon control axis is indeed a complex network-like structure, and therefore it lends itself to a complex dynamic behavior analysis approach. This highlights both the significance and necessity of the mathematical methods that we propose to use to analyze the experimental data. Using differential equations-based modeling is perhaps the only way to estimate the dynamic interplay of the pancreatic hormones and their importance for GCR control.
Network Control of Glucagon Counterregulation
573
Mathematical models have not been applied to study the GCR control mechanisms, but have been used to explore other aspects of the control of BG homeostasis (Guyton et al., 1978; Insel et al., 1975; Steele et al., 1974; Yamasaki et al., 1984). For example, the minimal model of Bergman and colleagues, proposed in 1979 for estimating insulin sensitivity (Bergman et al., 1979), received considerable attention and further development (Bergman et al., 1987; Breda et al., 2001; Cobelli et al., 1986, 1990; Mari, 1997; Quon et al., 1994; Toffolo et al., 1995, 2001). We have previously used modeling methods to successfully estimate and predict the onset of the counterregulation in T1DM patients (Kovatchev et al., 1999, 2000) as well as to study other complex endocrine axes (Farhy, 2004; Farhy and Veldhuis 2003, 2004, 2005; Farhy et al., 2001, 2002, 2007). However, despite the proven utility of this methodology, our recent efforts were the first to apply a combination of network modeling and in vivo studies to dissect the GCR control axis (Farhy and McCall, 2009; Farhy et al., 2008). The selected few MCN components cannot exhaustively recreate all signals that control the GCR. Indeed, in the normal pancreas, glucagon may control its own secretion via a/b-cell interactions. For example, human b-cells express glucagon receptors (Huypens et al., 2000; Kieffer et al., 1996) and exogenous glucagon stimulates insulin by glucagon- and GLP-1-receptors (Huypens et al., 2000). One immunoneutralization study suggests that endogenous glucagon stimulates insulin (Brunicardi et al., 2001), while other results imply that a-cell glutamate may bind to receptors on b-cells to stimulate insulin and GABA (Bertrand et al., 1992; Inagaki et al., 1995; Uehara et al., 2004). It has been recently reported that in human islets, a-cell glutamate serves as a positive autocrine signal for glucagon release by acting on ionotropic glutamate receptors (iGluRs) on a-cells (Cabrera et al., 2008). Thus, absence of functional b-cells may cause glutamate hypersecretion, followed by desensitization of the a-cell iGluRs, and ultimately by defects in GCR as conjectured (Cabrera et al., 2008). Interestingly, a similar hypothesis to explain the defective GCR in diabetes by an increased chronic a-cell activity due to lack of b-cell signaling can be formulated based on our results. However, in our case hyperglucagonemia is the main reason for the GCR defects. The two hypotheses are not mutually exclusive, but ours can explain also the in vivo GCR pulsatility during hypoglycemia observed by us (Farhy et al., 2008) and others (Genter et al., 1998). Most importantly, the acell positive autoregulation is consistent with the proposed here negative delayed a-cell auto-feedback, which could be mediated in part by iGluRs desensitization as suggested (Cabrera et al., 2008). The autocrine regulation is implicitly incorporated in our model equations in the parameter rGL. The b-cells may control the d-cells, which are downstream from b-cells in the order of intraislet vascular perfusion. However, in one study, anterograde infusion of insulin antibody in the perfused rat pancreas stimulated
574
Leon S. Farhy and Anthony L. McCall
both glucagon and somatostatin (Samols and Stagner, 1988), while another immunoneutralization study documented a decrease in somatostatin at high glucose concentrations (Brunicardi et al., 2001). Suppression by insulin of the a-cells (as proposed here) could explain this apparent contradiction. It is also possible that the d-cells inhibit the b-cells (Brunicardi et al., 2003; Huypens et al., 2000; Schuit et al., 1989; Strowski et al., 2000). Finally, the MCN components are influenced by numerous extrapancreatic factors, some of which have important impacts on glucagon secretion and GCR, including the autonomic input, catecholamines, growth hormone, ghrelin, and incretins (Gromada et al., 2007; Havel and Ahren, 1997; Havel and Taborsky, 1989; Heise et al., 2004). For example, the incretin GLP-1 inhibits glucagon, though the mechanism of this inhibition is still controversial (Gromada et al., 2007). Also, there are three major autonomic influences on the a-cell: sympathetic nerves, parasympathetic nerves, and circulating epinephrine, all of which are activated by hypoglycemia, and are capable of stimulating glucagon and suppressing insulin (Bolli and Fanelli, 1999; Brelje et al., 1989; Taborsky et al., 1998). We cannot track all signals that control the GCR and most of them have no explicit terms in our model. However, they are not omitted or considered unimportant. In fact, when we describe mathematically the MCN, we are including the impact of the nervous system and other factors, even though they have no individual terms in the equations. Thus, the MCN unifies all factors that control glucagon release based on the assumption that the primary physiological relationships that are explicit in the MCN are influenced by these factors. The model-based simulations suggest that the postulated MCN model of regulation of GCR is consistent with the experimental data. However, at this stage we cannot estimate how good this model is, and it is therefore hard to assess the validity of its predictions. The simulations can only reconstruct the general ‘‘averaged’’ behavior of the in vivo system, and new experimental data are required to support the much important property that the model can explain the GCR response in individual animals. These should involve interventional studies to manipulate the vascular input to the pancreas and analyze the corresponding changes in the output by collecting frequently sampled portal vein data for multiple hormones simultaneously. These must be analyzed by the mathematical model to estimate whether the MCN provides an objectively good description of the action of the complex GCR control mechanism. Note that with this approach we cannot establish the model-based inferences in ‘‘micro’’ detail, since they imply molecular mechanisms that are out of reach of the in vivo methodology. The approach cannot nor is it intended to address the microscopic behavior of the a-cells or the molecular mechanisms that govern this behavior. In this regard, insulin and glucagon (and somatostatin) should be viewed only as (macroscopic) surrogates for the activity of the different cell types under a variety of other intra- and extrapancreatic influences.
Network Control of Glucagon Counterregulation
575
Even though it is usually not stated explicitly, simple models are always used in experimental studies and, especially in in vivo experiments, many factors are ignored or postulated to have no impact on the outcome. Using constructs like the ones described in this work to analyze hormone concentration data has the advantage that the underlying model is very explicit, incorporates multiple relationships and uses well established mathematical and statistical techniques to show its validity and reconstruct the involved signals and pathways.
8. Conclusions In the current work, we present our interdisciplinary efforts to investigate the system-level network control mechanisms that mediate the GCR and their abnormalities in diabetes—a concept as yet almost completely unexplored for GCR. The results confirm the hypothesis that a streamlined model which omits an explicit (but not implicit) somatostatin (d-cell) node entirely reproduces the results of our original more complex models. Our new findings define more precisely the components that are the most critical for the system and strongly suggest that a delayed a-cell auto-feedback plays a key role in GCR regulation. The results demonstrate that such a regulation is consistent not only with most of the in vivo system behavior typical for the insulin deficient pancreas, but it also explains key features, characteristic for the transition from a normal to an insulin deficient state. A major advantage of the current model is that its only explicit components are BG, insulin, and glucagon. These are clinically measurable which would allow the application of the new construct to the study of the control, function, and abnormalities of the human glucagon axis.
ACKNOWLEDGEMENT The study was supported by NIH/NIDDK grant R21 DK072095.
REFERENCES Ashcroft, F. M., Proks, P., Smith, P. A., Ammala, C., Bokvist, K., and Rorsman, P. (1994). Stimulus-secretion coupling in pancreatic beta cells. J. Cell. Biochem. 55(Suppl), 54–65. Banarer, S., McGregor, V. P., and Cryer, P. E. (2002). Intraislet hyperinsulinemia prevents the glucagon response to hypoglycemia despite an intact autonomic response. Diabetes 51 (4), 958–965. Bell, G. I., Pilkis, S. J., Weber, I. T., and Polonsky, K. S. (1996). Glucokinase mutations, insulin secretion, and diabetes mellitus. Annu. Rev. Physiol. 58, 171–186.
576
Leon S. Farhy and Anthony L. McCall
Bergman, R. N., Ider, Y. Z., Boeden, C. R., and Cobelli, C. (1979). Quantitative estimation of insulin sensitivity. Am. J. Physiol. 236, E667–E677. Bergman, R. N., Prager, R., Volund, A., and Olefsky, J. M. (1987). Equivalence of the insulin sensitivity index in man derived by the minimal model method and the euglycemic glucose clamp. J. Clin. Invest. 79, 790–800. Bertrand, G., Gross, R., Puech, R., Loubatieres-Mariani, M. M., and Bockaert, J. (1992). Evidence for a glutamate receptor of the AMPA subtype which mediates insulin release from rat perfused pancreas. Br. J. Pharmacol. 106(2), 354–359. Bolli, G. B., and Fanelli, C. G. (1999). Physiology of glucose counterregulation to hypoglycemia. Endocrinol. Metab. Clin. North Am. 28, 467–493. Breda, E., Cavaghan, M. K., Toffolo, G., Polonsky, K. S., and Cobelli, C. (2001). Oral glucose tolerance test minimal model indexes of beta-cell function and insulin sensitivity. Diabetes 50(1), 150–158. Brelje, T. C., Scharp, D. W., and Sorenson, R. L. (1989). Three-dimensional imaging of intact isolated islets of Langerhans with confocal microscopy. Diabetes 38(6), 808–814. Brunicardi, F. C., Kleinman, R., Moldovan, S., Nguyen, T. H., Watt, P. C., Walsh, J., and Gingerich, R. (2001). Immunoneutralization of somatostatin, insulin, and glucagon causes alterations in islet cell secretion in the isolated per fused human pancreas. Pancreas 23(3), 302–308. Brunicardi, F. C., Atiya, A., Moldovan, S., Lee, T. C., Fagan, S. P., Kleinman, R. M., Adrian, T. E., Coy, D. H., Walsh, J. H., and Fisher, W. E. (2003). Activation of somatostatin receptor subtype 2 inhibits insulin secretion in the isolated perfused human pancreas. Pancreas 27(4), e84–e89. Cabrera, O., Jacques-Silva, M. C., Speier, S., Yang, S. N., Ko¨hler, M., Fachado, A., Vieira, E., Zierath, J. R., Kibbey, R., Berman, D. M., Kenyon, N. S., Ricordi, C., et al. (2008). Glutamate is a positive autocrine signal for glucagon release. Cell Metab. 7(6), 545–554. Cejvan, K., Coy, D. H., and Efendic, S. (2003). Intra-islet somatostatin regulates glucagon release via type 2 somatostatin receptors in rats. Diabetes 52(5), 1176–1181. Cobelli, C., Pacini, G., Toffolo, G., and Sacca, L. (1986). Estimation of insulin sensitivity and glucose clearance from minimal model: New insights from labeled IVGTT. Am. J. Physiol. 250, E591–E598. Cobelli, C., Brier, D. M., and Ferrannini, E. (1990). Modeling glucose metabolism in man: Theory and practice. Horm. Metab. Res. Suppl. 24, 1–10. Cryer, P. E. (1999). Hypoglycemia is the limiting factor in the management of diabetes. Diabetes Metab. Res. Rev. 15(1), 42–46. Cryer, P. E. (2002). Hypoglycemia the limiting factor in the glycaemic management of type I and type II diabetes. Diabetologia 45(7), 937–948. Cryer, P. E., and Gerich, J. E. (1983). Relevance of glucose counterregulatory systems to patients with diabetes: Critical roles of glucagon and epinephrine. Diabetes Care 6(1), 95–99. Cryer, P. E., Davis, S. N., and Shamoon, H. (2003). Hypoglycemia in diabetes. Diabetes Care 26, 1902–1912. Diem, P., Redmon, J. B., Abid, M., Moran, A., Sutherland, D. E., Halter, J. B., and Robertson, R. P. (1990). Glucagon, catecholamine and pancreatic polypeptide secretion in type I diabetic recipients of pancreas allografts. J. Clin. Invest. 86(6), 2008–2013. Dumonteil, E., Magnan, C., Ritz-Laser, B., Ktorza, A., Meda, P., and Philippe, J. (2000). Glucose regulates proinsulin and prosomatostatin but not proglucagon messenger ribonucleic acid levels in rat pancreatic islets. Endocrinology 141(1), 174–180. Dunne, M. J., Harding, E. A., Jaggar, J. H., and Squires, P. E. (1994). Ion channels and the molecular control of insulin secretion. Biochem. Soc. Trans. 22(1), 6–12.
Network Control of Glucagon Counterregulation
577
Efendic, S., Nylen, A., Roovete, A., and Uvnas-Wallenstein, K. (1978). Effects of glucose and arginine on the release of immunoreactive somatostatin from the isolated perfused rat pancreas. FEBS Lett. 92(1), 33–35. Epstein, S., Berelowitz, M., and Bell, N. H. (1980). Pentagastrin and glucagon stimulate serum somatostatin-like immunoreactivity in man. J. Clin. Endocrinol. Metab. 51, 1227–1231. Farhy, L. S. (2004). Modeling of oscillations in endocrine networks with feedback. Methods Enzymol. 384, 54–81. Farhy, L. S., and McCall, A. L. (2009). System-level control to optimize glucagon counterregulation by switch-off of a-cell suppressing signals in b-cell deficiency. J. Diabetes Sci. Technol. 3(1), 21–33. Farhy, L. S., and Veldhuis, J. D. (2003). Joint pituitary-hypothalamic and intrahypothalamic autofeedback construct of pulsatile growth hormone secretion. Am. J. Physiol. Regul. Integr. Comp. Physiol. 285(5), R1240–R1249. Farhy, L. S., and Veldhuis, J. D. (2004). Putative GH pulse renewal: Periventricular somatostatinergic control of an arcuate-nuclear somatostatin and GH-releasing hormone oscillator. Am. J. Physiol. Regul. Integr. Comp. Physiol. 286(6), R1030–R1042. Farhy, L. S., and Veldhuis, J. D. (2005). Deterministic construct of amplifying actions of ghrelin on pulsatile growth hormone secretion. Am. J. Physiol. Regul. Integr. Comp. Physiol. 288, R1649–R1663. Farhy, L. S., Straume, M., Johnson, M. L., Kovatchev, B., and Veldhuis, J. D. (2001). A construct of interactive feedback control of the GH axis in the male. Am. J. Physiol. Regul. Integr. Comp. Physiol. 281(1), R38–R51. Farhy, L. S., Straume, M., Johnson, M. L., Kovatchev, B., and Veldhuis, J. D. (2002). Unequal autonegative feedback by GH models the sexual dimorphism in GH secretory dynamics. Am. J. Physiol. Regul. Integr. Comp. Physiol. 282(3), R753–R764. Farhy, L. S., Bowers, C. Y., and Veldhuis, J. D. (2007). Model-projected mechanistic bases for sex differences in growth-hormone (GH) regulation in the human. Am. J. Physiol. Regul. Integr. Comp. Physiol. 292, R1577–R1593. Farhy, L. S., Du, Z., Zeng, Q., Veldhuis, P. P., Johnson, M. L., Brayman, K. L., and McCall, A. L. (2009). Amplification of pulsatile glucagon secretion by switch-off of a-cell suppressing signals in streptozotocin treated rats. Am. J. Physiol. Endocrinol. Metab. 295, E575–E585. Fujitani, S., Ikenoue, T., Akiyoshi, M., Maki, T., and Yada, T. (1996). Somatostatin and insulin secretion due to common mechanisms by a new hypoglycemic agent, A-4166, in perfused rat pancreas. Metab. Clin. Exp. 45(2), 184–189. Fukuda, M., Tanaka, A., Tahara, Y., Ikegami, H., Yamamoto, Y., Kumahara, Y., and Shima, K. (1988). Correlation between minimal secretory capacity of pancreatic betacells and stability of diabetic control. Diabetes 37(1), 81–88. Gedulin, B. R., Rink, T. J., and Young, A. A. (1997). Dose-response for glucagonostatic effect of amylin in rats. Metabolism 46, 67–70. Genter, P., Berman, N., Jacob, M., and Ipp, E. (1998). Counterregulatory hormones oscillate during steady-state hypoglycemia. Am. J. Physiol. 275(5), E821–E829. Gerich, J. E. (1988). Lilly lecture: Glucose counterregulation and its impact on diabetes mellitus. Diabetes 37(12), 1608–1617. Gerich, J. E., Langlois, M., Noacco, C., Karam, J. H., and Forsham, P. H. (1973). Lack of glucagon response to hypoglycemia in diabetes: Evidence for an intrinsic pancreatic alpha cell defect. Science 182(108), 171–173. Gopel, S. O., Kanno, T., Barg, S., and Rorsman, P. (2000a). Patch-clamp characterisation of somatostatin-secreting-cells in intact mouse pancreatic islets. J. Physiol. 528(3), 497–507. Gopel, S. O., Kanno, T., Barg, S., Weng, X. G., Gromada, J., and Rorsman, P. (2000b). Regulation of glucagon release in mouse-cells by KATP channels and inactivation of TTX-sensitive Naþ channels. J. Physiol. 528, 509–520.
578
Leon S. Farhy and Anthony L. McCall
Grapengiesser, E., Salehi, A., Quader, S. S., and Hellman, B. (2006). Glucose induces glucagon release pulses antisynchronous with insulin and sensitive to purinoceptors inhibition. Endocrinology 147, 3472–3477. Grimmichova, R., Vrbikova, J., Matucha, P., Vondra, K., Veldhuis, P., and Johnson, M. (2008). Fasting insulin pulsatile secretion in lean women with polycystic ovary syndrome. Physiol. Res. 57, 1–8. Gromada, J., Franklin, I., and Wollheim, C. B. (2007). a-Cells of the endocrine pancreas: 35 Years of research but the enigma remains. Endocr. Rev. 28(1), 84–116. Guyton, J. R., Foster, R. O., Soeldner, J. S., Tan, M. H., Kahn, C. B., Koncz, L., and Gleason, R. E. (1978). A model of glucose-insulin homeostasis in man that incorporates the heterogeneous fast pool theory of pancreatic insulin release. Diabetes 27, 1027–1042. Havel, P. J., and Ahren, B. (1997). Activation of autonomic nerves and the adrenal medulla contributes to increased glucagon secretion during moderate insulin-induced hypoglycemia in women. Diabetes 46, 801–807. Havel, P. J., and Taborsky, G. J. Jr. (1989). The contribution of the autonomic nervous system to changes of glucagon and insulin secretion during hypoglycemic stress. Endocr. Rev. 10(3), 332–350. Heimberg, H., De Vos, A., Pipeleers, D., Thorens, B., and Schuit, F. (1995). Differences in glucose transporter gene expression between rat pancreatic alpha- and beta-cells are correlated to differences in glucose transport but not in glucose utilization. J. Biol. Chem. 270(15), 8971–8975. Heimberg, H., De Vos, A., Moens, K., Quartier, E., Bouwens, L., Pipeleers, D., Van Schaftingen, E., Madsen, O., and Schuit, F. (1996). The glucose sensor protein glucokinase is expressed in glucagon-producing alpha-cells. Proc. Natl. Aca. Sci. USA 93(14), 7036–7041. Heise, T., Heinemann, T., Heller, S., Weyer, C., Wang, Y., Strobel, S., Kolterman, O., and Maggs, D. (2004). Effect of pramlintide on symptom, catecholamine, and glucagon responses to hypoglycemia in healthy subjects. Metabolism 53(9), 1227–1232. Hermansen, K., Christensen, S. E., and Orskov, H. (1979). Characterization of somatostatin release from the pancreas: The role of potassium. Scand. J. Clin. Lab. Invest. 39(8), 717–722. Hilsted, J., Frandsen, H., Holst, J. J., Christensen, N. J., and Nielsen, S. L. (1991). Plasma glucagon and glucose recovery after hypoglycemia: The effect of total autonomic blockade. Acta Endocrinol. 125(5), 466–469. Hirsch, B. R., and Shamoon, H. (1987). Defective epinephrine and growth hormone responses in type I diabetes are stimulus specific. Diabetes 36(1), 20–26. Hoffman, R. P., Arslanian, S., Drash, A. L., and Becker, D. J. (1994). Impaired counterregulatory hormone responses to hypoglycemia in children and adolescents with new onset IDDM. J. Pediatr. Endocrinol. 7(3), 235–244. Hope, K. M., Tran, P. O., Zhou, H., Oseid, E., Leroy, E., and Robertson, R. P. (2004). Regulation of alpha-cell function by the beta-cell in isolated human and rat islets deprived of glucose: The ‘‘switch-off ’’ hypothesis. Diabetes 53(6), 1488–1495. Huypens, P., Ling, Z., Pipeleers, D., and Schuit, F. (2000). Glucagon receptors on human islet cells contribute to glucose competence of insulin release. Diabetologia 43(8), 1012–1019. Inagaki, N., Kuromi, H., Gonoi, T., Okamoto, Y., Ishida, H., Seino, Y., Kaneko, T., Iwanaga, T., and Seino, S. (1995). Expression and role of ionotropic glutamate receptors in pancreatic islet cells. FASEB J. 9(8), 686–691. Insel, P. A., Liljenquist, J. E., Tobin, J. D., Sherwin, R. S., Watkins, P., Andres, R., and Berman, M. (1975). Insulin control of glucose metabolism in man. A new kinetic analysis. J. Clin. Invest. 55, 1057–1066. Ishihara, H., Maechler, P., Gjinovci, A., Herrera, P. L., and Wollheim, C. B. (2003). Islet b-cell secretion determines glucagon release from neighboring a-cells. Nat. Cell Biol. 5, 330–335.
Network Control of Glucagon Counterregulation
579
Ito, K., Maruyama, H., Hirose, H., Kido, K., Koyama, K., Kataoka, K., and Saruta, T. (1995). Exogenous insulin dose-dependently suppresses glucopenia-induced glucagon secretion from perfused rat pancreas. Metab. Clin. Exp. 44(3), 358–362. Jaspan, J. B., Lever, E., Polonsky, K. S., and Van Cauter, E. (1986). In vivo pulsatility of pancreatic islet peptides. Am. J. Physiol. 251(2 Pt 1), E215–E226. Kawai, K., and Unger, R. H. (1982). Inhibition of glucagon secretion by exogenous glucagon in the isolated, perfused dog pancreas. Diabetes 31(6), 512–515. Kawamori, D., Kurpad, A. J., Hu, J., Liew, C. W., Shih, J. L., Ford, E. L., Herrera, P. L., Polonsky, K. S., McGuinness, O. P., and Kulkarni, R. N. (2009). Insulin signaling in alpha cells modulates glucagon secretion in vivo. Cell Metab. 9(4), 350–361. Kieffer, T. J., Heller, R. S., Unson, C. G., Weir, G. C., and Habener, J. F. (1996). Distribution of glucagon receptors on hormone-specific endocrine cells of rat pancreatic islets. Endocrinology 137(11), 5119–5125. Klaff, L. J., and Taborsky, G. J. Jr. (1987). Pancreatic somatostatin is a mediator of glucagon inhibition by hyperglycemia. Diabetes 36(5), 592–596. Kleinman, R., Gingerich, R., Wong, H., Walsh, J., Lloyd, K., Ohning, G., De Giorgio, R., Sternini, C., and Brunicardi, F. C. (1994). Use of the Fab fragment for immunoneutralization of somatostatin in the isolated perfused human pancreas. Am. J. Surg. 167(1), 114–119. Kleinman, R., Gingerich, R., Ohning, G., Wong, H., Olthoff, K., Walsh, J., and Brunicardi, F. C. (1995). The influence of somatostatin on glucagon and pancreatic polypeptide secretion in the isolated perfused human pancreas. Int. J. Pancreatol. 18(1), 51–57. Kovatchev, B. P., Farhy, L. S., Cox, D. J., Straume, M., Yankov, V. I., GonderFrederick, L. A., and Clarke, W. L. (1999). Modeling insulin-glucose dynamics during insulin induced hypoglycemia. Evaluation of glucose counterregulation. J. Theor. Med. 1, 313–323. Kovatchev, B. P., Straume, M., Farhy, L. S., and Cox, D. J. (2000). Dynamic network model of glucose counterregulation in subjects with insulin-requiring diabetes. Methods Enzymol. 321, 396–410. Ludvigsen, E., Olsson, R., Stridsberg, M., Janson, E. T., and Sandler, S. (2004). Expression and distribution of somatostatin receptor subtypes in the pancreatic islets of mice and rats. J. Histochem. Cytochem. 52(3), 391–400. Mari, A. (1997). Assessment of insulin sensitivity with minimal model: Role of model assumptions. Am. J. Physiol. 272, E925–E934. Maruyama, H., Hisatomi, A., Orci, L., Grodsky, G. M., and Unger, R. H. (1984). Insulin within islets is a physiologic glucagon release inhibitor. J. Clin. Invest. 74(6), 2296–2299. Matthews, D. R., Hermansen, K., Connolly, A. A., Gray, D., Schmitz, O., Clark, A., Orskov, H., and Turner, R. C. (1987). Greater in vivo than in vitro pulsatility of insulin secretion with synchronized insulin and somatostatin secretory pulses. Endocrinology 120(6), 2272–2278. McCall, A. L., Cox, D. J., Crean, J., Gloster, M., and Kovatchev, B. P. (2006). A novel analytical method for assessing glucose variability: Using CGMS in type 1 diabetes mellitus. Diabetes Technol. Ther. 8(6), 644–653. Meier, J. J., Kjems, L. L., Veldhuis, J. D., Lefebvre, P., and Butler, P. C. (2006). Postprandial suppression of glucagon secretion depends on intact pulsatile insulin secretion: Further evidence for the intraislet insulin hypothesis. Diabetes 55(4), 1051–1056. Pipeleers, D. G., Schuit, F. C., Van Schravendijk, C. F., and Van de Winkel, M. (1985). Interplay of nutrients and hormones in the regulation of glucagon release. Endocrinology 117(3), 817–823. Prksen, N. (2002). The in vivo regulation of pulsatile insulin secretion. Diabetologia 45(1), 3–20.
580
Leon S. Farhy and Anthony L. McCall
Portela-Gomes, G. M., Stridsberg, M., Grimelius, L., Oberg, K., and Janson, E. T. (2000). Expression of the five different somatostatin receptor subtypes in endocrine cells of the pancreas. Appl. Immunohistochem. Mol. Morphol. 8(2), 126–132. Quon, M. J., Cochran, C., Taylor, S. I., and Eastman, R. C. (1994). Non-insulin mediated glucose disappearance in subjects with IDDM. Discordance between experimental results and minimal model analysis. Diabetes 43, 890–896. Ravier, M. A., and Rutter, G. A. (2005). Glucose or insulin, but not zinc ions, inhibit glucagon secretion from mouse pancreatic alpha cells. Diabetes 54, 1789–1797. Reaven, G. M., Chen, Y. D., Golay, A., Swislocki, A. L., and Jaspan, J. B. (1987). Documentation of hyperglucagonemia throughout the day in nonobese and obese patients with noninsulin-dependent diabetes mellitus. J. Clin. Endocrinol. Metab. 64(1), 106–110. Rorsman, P., and Hellman, B. (1988). Voltage-activated currents in guinea pig pancreatic alpha 2 cells. Evidence for Ca2þ-dependent action potentials. J. Gen. Physiol. 91(2), 223–242. Rorsman, P., Berggren, P. O., Bokvist, K., Ericson, H., Mohler, H., Ostenson, C. G., and Smith, P. A. (1989). Glucose-inhibition of glucagon secretion involves activation of GABAA-receptor chloride channels. Nature 341(6239), 233–236. Salehi, A., Quader, S. S., Grapengiesser, E., and Hellman, B. (2007). Pulses of somatostatin release are slightly delayed compared with insulin and antisynchronous to glucagon. Regul. Pept. 144, 43–49. Samols, E., and Stagner, J. I. (1988). Intra-islet regulation. Am. J. Med. 85(5A), 31–35. Samols, E., and Stagner, J. I. (1990). Islet somatostatin–microvascular, paracrine, and pulsatile regulation. Metab. Clin. Exp. 39(9 Suppl 2), 55–60. Schuit, F. C., Derde, M. P., and Pipeleers, D. G. (1989). Sensitivity of rat pancreatic A and B cells to somatostatin. Diabetologia 32(3), 207–212. Schuit, F., De Vos, A., Farfari, S., Moens, K., Pipeleers, D., Brun, T., and Prentki, M. (1997). Metabolic fate of glucose in purified islet cells. Glucose-regulated anaplerosis in beta cells. J. Biol. Chem. 272(30), 18572–18579. Schuit, F. C., Huypens, P., Heimberg, H., and Pipeleers, D. G. (2001). Glucose sensing in pancreatic beta-cells: A model for the study of other glucose-regulated cells in gut, pancreas, and hypothalamus. Diabetes 50(1), 1–11. Segel, S. A., Paramore, D. S., and Cryer, P. E. (2002). Hypoglycemia-associated autonomic failure in advanced type 2 diabetes. Diabetes 51(3), 724–733. Silvestre, R. A., Rodrı´guez-Gallardo, J., Jodka, C., Parkes, D. G., Pittner, R. A., Young, A. A., and Marco, J. (2001). Selective amylin inhibition of the glucagon response to arginine is extrinsic to the pancreas. Am. J. Physiol. Endocrinol. Metab. 280, E443–E449. Stagner, J. I., Samols, E., and Bonner-Weir, S. (1988). Beta-alpha-delta pancreatic islet cellular perfusion in dogs. Diabetes 37(12), 1715–1721. Stagner, J. I., Samols, E., and Marks, V. (1989). The anterograde and retrograde infusion of glucagon antibodies suggests that A cells are vascularly perfused before D cells within the rat islet. Diabetologia 32(3), 203–206. Steele, R., Rostami, H., and Altszuler, N. (1974). A two-compartment calculator for the dog glucose pool in the nonsteady state. Fed. Proc. 33, 1869–1876. Strowski, M. Z., Parmar, R. M., Blake, A. D., and Schaeffer, J. M. (2000). Somatostatin inhibits insulin and glucagon secretion via two receptors subtypes: An in vitro study of pancreatic islets from somatostatin receptor 2 knockout mice. Endocrinology 141(1), 111–117. Sumida, Y., Shima, T., Shirayama, K., Misaki, M., and Miyaji, K. (1994). Effects of hexoses and their derivatives on glucagon secretion from isolated perfused rat pancreas. Horm. Metab. Res. 26(5), 222–225.
Network Control of Glucagon Counterregulation
581
Taborsky, G. J. Jr., Ahren, B., and Havel, P. J. (1998). Autonomic mediation of glucagon secretion during hypoglycemia: Implications for impaired alpha-cell responses in type 1 diabetes. Diabetes 47(7), 995–1005. Tapia-Arancibia, L., and Astier, H. (1988). Glutamate stimulates somatostatin release from diencephalic neurons in primary culture. Endocrinology 123, 2360–2366. The Action to Control Cardiovascular Risk in Diabetes Study Group (2008). Effects of intensive glucose lowering in type 2 diabetes. N. Engl. J. Med. 358, 2545–2559. The Diabetes Control and Complications Trial Research Group (1993). The effect of intensive treatment of diabetes on the development and progression of long-term complications in insulin-dependent diabetes mellitus. N. Engl. J. Med. 329, 977–986. Tirone, T. A., Norman, M. A., Moldovan, S., DeMayo, F. J., Wang, X. P., and Brunicardi, F. C. (2003). Pancreatic somatostatin inhibits insulin secretion via SSTR-5 in the isolated perfused mouse pancreas model. Pancreas 26(3), e67–e73. Toffolo, G., De Grandi, F., and Cobelli, C. (1995). Estimation of beta-cell sensitivity from intravenous glucose tolerance test C-peptide data. Knowledge of the kinetics avoids errors in modeling the secretion. Diabetes 44, 845–854. Toffolo, G., Breda, E., Cavaghan, M. K., Ehrmann, D. A., Polonsky, K. S., and Cobelli, C. (2001). Quantitative indices of b-cell function during graded up&down glucose infusion from C-peptide minimal models. Am. J. Physiol. 280, 2–10. Uehara, S., Muroyama, A., Echigo, N., Morimoto, R., Otsuka, M., Yatsushiro, S., and Moriyama, Y. (2004). Metabotropic glutamate receptor type 4 is involved in autoinhibitory cascade for glucagon secretion by alpha-cells of islet of Langerhans. Diabetes 53(4), 998–1006. UK Prospective Diabetes Study Group (1998). Intensive blood-glucose control with sulphonylureas or insulin compared with conventional treatment and risk of complications in patients with type 2 diabetes. Lancet 352, 837–853. Unger, R. H. (1985). Glucagon physiology and pathophysiology in the light of new advances. Diabetologia 28, 574–578. Utsumi, M., Makimura, H., Ishihara, K., Morita, S., and Baba, S. (1979). Determination of immunoreactive somatostatin in rat plasma and responses to arginine, glucose and glucagon infusion. Diabetologia 17, 319–323. Van Schravendijk, C. F., Foriers, A., Van den Brande, J. L., and Pipeleers, D. G. (1987). Evidence for the presence of type I insulin-like growth factor receptors on rat pancreatic A and B cells. Endocrinology 121(5), 1784–1788. Wendt, A., Birnir, B., Buschard, K., Gromada, J., Salehi, A., Sewing, S., Rorsman, P., and Braun, M. (2004). Glucose inhibition of glucagon secretion from rat alpha-cells is mediated by GABA released from neighboring beta-cells. Diabetes 53(4), 1038–1045. Xu, E., Kumar, M., Zhang, Y., Ju, W., Obata, T., Zhang, N., Liu, S., Wendt, A., Deng, S., Ebina, Y., Wheeler, M. B., Braun, M., et al. (2006). Intraislet insulin suppresses glucagon release via GABA-GABAA receptor system. Cell Metab. 3, 47–58. Yamasaki, Y., Tiran, J., and Albisser, A. M. (1984). Modeling glucose disposal in diabetic dogs fed mixed meals. Am. J. Physiol. 246, E52–E61. Zhou, H., Tran, P. O., Yang, S., Zhang, T., LeRoy, E., Oseid, E., and Robertson, R. P. (2004). Regulation of alpha-cell function by the beta-cell during hypoglycemia in Wistar rats: The ‘‘switch-off ’’ hypothesis. Diabetes 53(6), 1482–1487. Zhou, H., Zhang, T., Oseid, E., Harmon, J., Tonooka, N., and Robertson, R. P. (2007a). Reversal of defective glucagon responses to hypoglycemia in insulin-dependent autoimmune diabetic BB rats. Endocrinology 148, 2863–2869. Zhou, H., Zhang, T., Harmon, J. S., Bryan, J., and Robertson, R. P. (2007b). Zinc, not insulin, regulates the rat a-cell response to hypoglycemia in vivo. Diabetes 56, 1107–1112.
C H A P T E R
T W E N T Y- T W O
Enzyme Kinetics and Computational Modeling for Systems Biology Pedro Mendes,*,†,‡ Hanan Messiha,*,§ Naglis Malys,*,} and Stefan Hoops‡ Contents 1. Introduction 2. Computational Modeling and Enzyme Kinetics 2.1. Standards in computational systems biology 2.2. COPASI: A biochemical modeling and simulation package 3. Yeast Triosephosphate Isomerase (EC 5.3.1.1) 4. Initial Rate Analysis 5. Progress Curve Analysis 6. Concluding Remarks Acknowledgments References
584 586 586 587 588 590 594 598 598 598
Abstract Enzyme kinetics is a century-old area of biochemical research which is regaining popularity due to its use in systems biology. Computational models of biochemical networks depend on rate laws and kinetic parameter values that describe the behavior of enzymes in the cellular milieu. While there is a considerable body of enzyme kinetic data available from the past several decades, a large number of enzymes of specific organisms were never assayed or were assayed in conditions that are irrelevant to those models. The result is that systems biology projects are having to carry out large numbers of enzyme kinetic assays. This chapter reviews the main methodologies of enzyme kinetic data analysis and proposes using computational modeling software for that purpose. It applies the biochemical network modeling software COPASI to data from enzyme assays of yeast triosephosphate isomerase (EC 5.3.1.1). * { {
} }
Manchester Centre for Integrative Systems Biology, The University of Manchester, Manchester, United Kingdom School of Computer Science, The University of Manchester, Manchester, United Kingdom Virginia Bioinformatics Institute, Virginia Polytechnic Institute and State University, Blacksburg, Virginia, USA School of Chemistry, The University of Manchester, Manchester, United Kingdom Faculty of Life Sciences, The University of Manchester, Manchester, United Kingdom
Methods in Enzymology, Volume 467 ISSN 0076-6879, DOI: 10.1016/S0076-6879(09)67022-1
#
2009 Published by Elsevier Inc.
583
584
Pedro Mendes et al.
1. Introduction Modern biochemical research is becoming a systems approach where mathematical models of the dynamics of molecular networks play an important role. These models are needed to understand the relationship between the underlying biophysical and biochemical parameters and the nonlinear behavior of the system. Models are also important as devices that integrate the various types of data needed for these studies. Under the term systems biology we include two distinct types of studies: one is driven by whole-genome data such as that from transcriptomics and high-throughput protein–protein interactions, while the other is based on in vitro data from purified molecules. The former is a top-down (analytic) approach that centers on network inference, while the latter is a bottom-up (synthetic) approach that reconstructs the system based on the knowledge of the individual parts. The ultimate objective of both approaches is the same, however: to understand how the behavior of living cells depends on the molecular mechanisms that compose them. To some extent, systems biology can be seen as the link between biochemistry and physiology. The bottom-up approach to systems biology is based on existing knowledge of the network of molecular interactions and much work is ongoing to create accurate descriptions of these networks (e.g., Herrga˚rd et al., 2008). But assembling the network structure is only the first part, and while it provides for interesting analyses (Schilling et al., 1999) the bulk of cellular properties is dynamic and requires dynamic models for their understanding. Dynamics are introduced in models through the kinetics of the molecular interactions, the majority of which are enzyme-catalyzed reactions. The determination of kinetic parameters and rate laws is thus an important activity in systems biology. But the new field of application has its own specific requirements that result in different constraints to assays and data analysis. Traditionally enzyme kinetics has been a vehicle for determining reaction mechanisms. This means that assays had to expose differences between mechanisms, which are often subtle, and therefore there was a big emphasis on accuracy of results. Since the mechanism of catalysis of an enzyme is rarely different for each of its substrates, many assays were carried out with synthetic substrate analogs, which are often more readily available (and cheaper) than the physiological substrate; other reasons for use of substrate analogs are related to advantageous physicochemical properties (e.g., solubility, light absorption, etc.). The same applies to modifiers, which were often also analogs of, or event entirely unrelated to, physiological metabolites of that pathway. Another common practice in the quest for mechanisms is to carry
Enzyme Kinetics for Computational Systems Biology
585
out the assays at the optimum pH of the enzyme, not the physiological pH. Frequently only one of the directions of the reaction was assayed and parameters for the products were not determined. Finally, the enzyme preparations themselves were often not sufficiently pure containing unknown proportions of isoenzymes. To construct biochemical network models that are relevant to cellular physiology it is important to determine the kinetic properties of the enzyme in conditions as close as possible to the cellular milieu. At a minimum the pH and temperature should be consistent with the relevant cells. Importantly, synthetic substrates or inhibitor analogs are undesirable and provide no useful information to the model. As much as possible one should also determine the kinetic properties of each single isoenzyme (or at least the isoenzymes of relevance), after all when several forms exist in an organism it is because they have different properties and fulfill different roles (even if in certain diseases or mutants one form may substitute the other). The kinetic parameters of all substrates and products should be determined, so that one can appropriately include reversible reactions in the model. Even if one has to represent some reaction as irreversible, it is important that the rate law be sensitive to the product concentrations. It is not a surprise that little amount of data has been published to date that fulfills the requirements above. As it turns out, even without these requirements, the number of isoenzymes which have been studied kinetically in any form, is less than is often portrayed. Consequently, there is a real need to assay a large number of different isoenzymes to provide data for the construction of physiologically relevant biochemical network models. Systems biology needs enzyme kinetics assays in large numbers and therefore, in this age of robotics, there is a real need for highthroughput enzyme characterizations that follow the principles described out here. But an enhanced interaction between systems biology and enzyme kinetics is synergistic, and also enzyme kinetics has something to gain from systems biology. With the increase in interest to model biochemical networks, computational systems biology has been creating a series of tools that are also useful if applied to enzyme kinetics. This is particularly true in the area of parameter estimation where several algorithms have been shown to be useful and are indeed also useful to enzyme kinetics. The availability of increasingly sophisticated and standardized modeling and simulation software will undoubtedly benefit enzyme kinetics. Here we review the main approaches to enzyme kinetic data analysis and discuss them in light of their new field of application and how systems biology modeling tools can be useful. An illustration is presented with the COPASI modeling software (Hoops et al., 2006) applied to the kinetics of purified yeast triosephosphate isomerase (EC 5.3.1.1).
586
Pedro Mendes et al.
2. Computational Modeling and Enzyme Kinetics Biochemical networks are sets of reactions that are linked by common substrates and products. The dynamics of biochemical networks is frequently described as sets of coupled ordinary differential equations (ODEs) that represent the rate of change of concentrations of the chemical species involved in the network. The right-hand side of these ODEs is the algebraic sum of the rate laws of the reactions that produce or consume the chemical species (positive when it is produced, negative when consumed). There is formally no difference between a biochemical network and an enzyme reaction mechanism as both conform to this description. It is possible (though perhaps not desirable) to represent an entire biochemical network through elementary reactions, as was done in the past (Chance et al., 1960), but soon shown to be impractical and unnecessary (Rhoads et al., 1968). For the purposes of systems biology studies it suffices to represent each enzyme-catalyzed reaction as a single step and associate to it an appropriate integrated rate law. It is debatable whether the rate laws even need to be based on a mechanism and generic rate laws have been proposed for this effect (Liebermeister and Klipp, 2006). The systems biologist should be cautioned, though, that mechanistic details may indeed affect the dynamics as is the case with competitive versus uncompetitive inhibitor drugs (Cornish-Bowden, 1986; Westley and Westley, 1996).
2.1. Standards in computational systems biology A major force driving computational systems biology has been the establishment of standards for various aspects of modeling. The systems biology markup language (SBML) (Hucka et al., 2003) is a standard format to encode the information required to express a biochemical network model including its kinetics. SBML is represented in extended markup language (XML), which is itself a standard widely adopted on the Internet. Although it could appear that a simple common format would not be terribly significant, the creation and subsequent development of SBML has resulted in the formation of a vibrant community of researchers which has passed critical mass. The consequence is that there are now several compatible software packages to model biochemical networks. Some are generic and provide many algorithms, while others are more specialized. Importantly, all of these are compatible in the sense that they can read and write models in a way that allows researchers to be able to use them without hindrance. This includes not only simulators (Hoops et al., 2006) but also packages for graphical depiction of networks (Funahashi et al., 2003), databases of reactions and kinetic parameters (Rojas et al., 2007), network analysis and data
Enzyme Kinetics for Computational Systems Biology
587
visualization (Kohler et al., 2006; Shannon et al., 2003), and so on. In some cases these packages can even work in a more integrated way, such as the SBW suite (Sauro et al., 2003), or Cell Designer and COPASI. SBML has also been a source of innovation as the specification has covered modeling methods that were not previously supported very well or at all. Models represented in SBML can be based on ODE, algebraic equations, stochastic kinetics, and discrete events. Beyond SBML, there are also standards for how to report models and their simulations (MIRIAM) (Le Nove`re et al., 2005), for the graphical representation of networks and models (SBGN) (Le Nove`re et al., 2009). An ontology for systems biology is being developed, a large section of which covers enzyme kinetics terms. Finally, there are emerging standards for specifying modeling procedures (MIASE) and data (SBRML). All of these could also be useful to some extent to enzyme kinetics and the software that has resulted from them is definitely useful indeed.
2.2. COPASI: A biochemical modeling and simulation package COPASI (Hoops et al., 2006) is an open source biochemical network modeling and simulation software package that we (PM and SH) have been developing with our colleagues Ursula Kummer and Sven Sahle (University of Heidelberg) and many coworkers. COPASI has implemented almost all of the features described in SBML (with the single exception of explicit time delays). It contains algorithms for simulation through ODE, algebraic equations, the stochastic simulation algorithm of Gillespie (1977) and derivatives, and discrete events. It also allows to mix several of these in a single simulation. COPASI also includes a number of algorithms for stoichiometric analyzes, systematic parameter scanning or Monte Carlo sampling, metabolic control analysis and generic sensitivity analysis, time scale and stability analysis, optimization and parameter estimation. Of greater relevance to the present topic are sensitivity analysis, and parameter estimation. COPASI is available free to nonprofit research. The COPASI user represents a biochemical network model through the language of biochemistry, while the software internally constructs the appropriate mathematical representation (the user is able to check it if needed). As indicated earlier, models could consist of the elementary reactions of an enzyme-catalyzed mechanism or use integrated rate laws (of which there are several predefined). Thus, COPASI is also useful for modeling in enzyme catalysis. The parameter estimation infrastructure of COPASI is fairly sophisticated allowing to use data from several different experiments that can even be of different type (e.g., time courses or steady-state measurements) and be stored across several files. COPASI currently uses a least-squares approach
588
Pedro Mendes et al.
whereas the sum of the square of residuals between the data and model is minimized. The sum of squares can be constructed over several variables which will be scaled appropriately (such that all contribute equally to the total sum). The number and type of parameters to be estimated is unrestricted by the software. The minimization can be subject to arbitrary nonlinear constraints on any feature of the model. The approach used in COPASI follows the framework of Mendes and Kell (1998) where a number of different nonlinear optimization algorithms can be used to minimize the sum of squares. These can be carried out alternatively to each other or in sequence (Rodriguez-Fernandez et al., 2006). The obvious application of COPASI’s parameter estimation engine to enzyme kinetics is for progress curve analysis. This is fairly straightforward and requires only (1) entering the relevant reactions and rate laws in the model—a single overall reaction following an integrated rate law, or a series of elementary reactions following mass action kinetics, (2) setting up the link between the data and the model, by identifying what elements of the model do the columns in the data file represent, (3) selecting which parameters are to be estimated, their boundaries (if any), and whether the fit is to be independent for each experiment or global to all experiments, and (4) selecting an algorithm for minimization. In addition to progress curves, COPASI is also useful for initial rate analysis, being able to carry out the two steps needed for this approach: determination of initial rates and nonlinear regression to the appropriate rate law.
3. Yeast Triosephosphate Isomerase (EC 5.3.1.1) One of the objectives of the Manchester Centre for Integrative Systems Biology is to demonstrate the feasibility of the bottom-up approach by applying it to the metabolism of the yeast Saccharomyces cerevisiae. To achieve this we established a range of experimental and computational methodologies that consist of purification of proteins, kinetic assays, measurement of enzyme concentrations through targeted mass spectrometry, measurement of metabolite levels by GC–MS and LC–MS, and computational work flows to manage and analyze data. Here, we will use the yeast enzyme triosephosphate isomerase (EC 5.3.1.1) for illustration of the procedures discussed in the remainder of the chapter, many other enzymes are being analyzed in our pipeline. Protein production and purification is based on the MORF mutant collection (Gelperin et al., 2005) composed of yeast strains that each overexpress a single one of the proteins of the yeast genome (for other proteins we also use the TAP mutant collection, Ghaemmaghami et al., 2003). Yeast cultures are grown in raffinose medium and then switched to galactose to
Enzyme Kinetics for Computational Systems Biology
589
trigger the overexpression of the protein of interest. MORF proteins have C terminus tags that allow affinity purification using IgG and nickel. While the majority of the MORF tag is cleaved off, a small 6 His peptide is still left in the C terminus at the end of the purification. This might affect the kinetics of these enzymes and ideally we would prefer to obtain native enzymes, however this would require devising new constructs which is presently beyond the scope of our work. Aliquots of the purified protein are stored at 20 ºC in MES (2-[N-morpholino]-ethanesulfonic acid) buffer at pH 6.5, as used in the kinetic assays. In this scenario where hundreds of proteins are being assayed it is important to standardize the assay conditions and to process them in as high throughput as possible. Thus, we have settled on running spectrophotometric assays monitoring the consumption or production of NADH or NADPH, by using one or more coupling reactions where needed. Assays are carried out with a NOVOstar plate reader in 384-well format plates with a reaction volume of 60 ml. A reaction buffer consisting of 100 mM MES (2-[N-morpholino]-ethanesulfonic acid), pH 6.5, 100 mM KCl and 5 mM MgCl was used throughout. Triosephosphate isomerase (EC 5.3.1.1) was isolated from the MORPH strain overexpressing the gene TPI1 as described above. The kinetics of the purified enzyme were then determined in both reaction directions by coupling to glyceraldehyde 3-phosphate dehydrogenase (EC 1.2.1.12) or glycerol 3-phosphate dehydrogenase (EC 1.1.1.8). The forward reaction was measured according to Krietsch (1975) with slight modifications. The reaction mixture contained 1 mM NADþ, 1 mM EDTA, 120 mM DTT, 4 mM sodium arsenate, 2.5 U glyceraldehyde 3phosphate dehydrogenase in the reaction buffer at various concentrations of glycerone phosphate (DHAP). The overall reaction scheme considered is: DHAP ! G3P G3P þ NADþ þ As2 ! NADH þ 3PG
ð22:1Þ
The reverse reaction was measured in the reaction buffer based on Bergmeyer et al. (1974) with minor modifications with 8.5 U/ml glycerol-3-phosphate dehydrogenase, 0.15 mM NADH at various concentrations of glyceraldehyde 3-phosphate (G3P). The overall reaction considered is: G3P ! DHAP DHAP þ NADH ! NADþ þ Gol3P
ð22:2Þ
In both cases the enzyme was incubated in the reaction mixture and the reactions were started by the addition of the DHAP or G3P and absorbance was collected every 19 s for 4731 s (nearly 80 min).
590
Pedro Mendes et al.
4. Initial Rate Analysis In the early days of enzymology, when there were no computational aids for calculations, Michaelis and Menten (1913) proposed to determine the kinetics of enzymes by measuring initial rates of reaction. This had the advantage of simplifying calculations as there is no product accumulation to consider. The methodology passes by determining progress curves at different concentrations of substrate and estimating the rate at t ¼ 0. Then these data are used to estimate the kinetic parameters by regression on the integrated rate equation. In the case of Henri–Michaelis–Menten kinetics it is also possible to estimate these parameters by simple linear regression using transformations of the rate law (Lineweaver and Burk, 1934) or by graphical methods (Eisenthal and Cornish-Bowden, 1974). However, with the widespread availability of computers, this is now recognized to be best carried out through nonlinear regression. Several software packages exist that are capable of carrying out this type of regression, including DynaFit (Kuzmic, 1996) described elsewhere in this volume. Several authors have criticized the (unfortunately still widespread) practice of determining the initial rate by linear regression of the ‘‘linear’’ part of the curve. Of course the curve has no linear part, and regression of a set of initial data points results in underestimating the rate (Duggleby, 1985). A better approach is to fit the parameters of a hyperbola to each progress curve and then use the corresponding initial substrate concentration to obtain the rate at t ¼ 0: v0 ¼
V app S0 app Km þ S0
ð22:3Þ
where the parameters of the hyperbola V app and Kmapp are only gross estimates of V and Km but nevertheless allow for an accurate estimation of the initial rate through Eq. (22.3). This procedure is easy to carry out with COPASI, where it can estimate all of the initial rates in one step. Essentially one enters the reaction schemes (22.1) or (22.2), assigns the irreversible Henri–Michaelis–Menten rate law to the reaction of interest and mass action kinetics for the coupling reaction (a more complex rate law could also be used, but if the assay was designed correctly the linking enzyme should be operating in conditions near firstorder kinetics). An algebraic equation needs to be added to the model, to express the conversion of absorbance units to the concentration of NADH: Abs340 nm ¼ ½NADHe þ offset
ð22:4Þ
where Abs340 nm is a new variable in the model, e (a constant) is the molar absorptivity coefficient of NADH (in our case multiplied by the path
Enzyme Kinetics for Computational Systems Biology
591
length, which we calibrated to be 0.43 cm in the 384-well plate) and offset is another constant that will be needed to adjust for the initial absorbance. The data file is organized by rows representing each time-dependent reading, columns containing values of time, initial concentrations of DHAP and G3P, the initial absorbance (offset), and the absorbance measured; an empty line separates one time course from the next. These data can easily be formatted by the multiplate reader software or with a simple (automated) script. Once this file is mapped to the appropriate model elements in COPASI, one selects the parameters to estimate (in this case V and Km for the enzyme of interest, as well as the rate constant to represent the rate of the coupling enzyme reaction). Finally, one needs to choose an optimization method and run the minimization. Here, we applied the SRES algorithm (Runarsson and Yao, 2000), followed by Levenberg–Marquardt (Levenberg, 1944; Marquardt, 1963), as suggested by Rodriguez-Fernandez et al. (2006). This is easily done by setting COPASI to update the model with the result of the estimation, and then we simply to run the LM algorithm from where SRES finished. Application of this method to the data of TPI’s forward reaction yields a set of V app and Kmapp which are then used to calculate initial rates (in a spreadsheet applying Eq. (22.3). Note that while the absorbance is very well fit many of the V app and Kmapp values are poor estimates of the V and Km. The second step uses the initial rates already estimated and the corresponding initial substrate concentrations and fit them to the Michaelis– Menten equation. This is carried out in COPASI in a new model similar to the first, but where we fixed the concentrations of the substrate and product, and associated the measured initial rates with the steady-state rate of the TPI reaction in the model (the coupling reaction is no longer needed). The results are depicted in Fig. 22.1 and the final estimates for the parameter values are Km ¼ 6.4265 0.18582 mM and V ¼ 8.5267 104 7.1161 106 mM s1. A similar procedure was repeated with the data for the reverse reaction and it was observed, at the end of the first stage, that a strong substrate inhibition was taking place (Fig. 22.2). This meant that a different rate law needed to be used in the second step. First, we attempted to fit the initial rates to the substrate inhibition rate law that is derived when a second molecule of substrate binds the enzyme–substrate complex (and forms a nonproductive complex): v¼
VS 2 Km þ S 1 þ KSi
ð22:5Þ
however, the software was not able to provide a good fit, even after applying global optimization algorithms (all of those available in COPASI).
592
Pedro Mendes et al.
A
Initial rate estimation
2
1.5
1
0.5
0 0
100
200
300
400
500
B
8.10–4
v0
6.10–4
4.10–4
2.10–4
0
10
20
30 [DHAP0]
40
50
60
Figure 22.1 Initial rate analysis of the forward reaction of triosephosphate isomerase (EC 5.3.1.1). (A) Independent fits to the time courses from which the initial rates were determined (crosses are data points, solid lines are the fitted curves). (B) Nonlinear regression of kinetic parameters on initial rate data; position of Km and V are represented with dashed lines.
593
Enzyme Kinetics for Computational Systems Biology
A Initial rate estimation
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0
1000
2000
3000
4000
5000
B 8.10– 4
v0
6.10– 4
4.10– 4
2.10– 4
0
10
20
30 [G3P]0
40
50
60
Figure 22.2 Initial rate analysis of the reverse reaction of triosephosphate isomerase (EC 5.3.1.1), displaying strong substrate inhibition. (A) Independent fits to the time courses from which the initial rates were determined (crosses are data points, solid lines are the fitted curves). (B) Nonlinear regression of kinetic parameters on initial rate data; position of Km and V are represented with dashed lines.
594
Pedro Mendes et al.
Therefore, we attempted a rate law where the substrate inhibition term is raised to the fourth power: v¼
VS 4 ; Km þ S 1 þ KSi
ð22:6Þ
and this provided a very good fit to the data (Fig. 22.2). Mechanistic enzymologists would not normally use such an equation without identifying a mechanism that explains it. However, for our purposes of building a network model, this rate law is perfectly acceptable. With the network model we identify which steps have a strong effect on other parts of the network using sensitivity analysis, and those steps that do indeed have high levels of control are then chosen for a further, more thorough kinetic analysis. That means that TPI could be examined further in case it has a strong effect on the rest of the network model; if it does not then it is not important to identify a more accurate rate law. The most important feature for building a bottom-up biochemical network model is that the relation between the concentration of the effectors and the rate be accurate; the underlying mechanism is secondary.
5. Progress Curve Analysis The reason why progress curves are problematic early on in the history of enzyme kinetics is now what makes them very attractive: they combine information from the forward and reverse reactions. Thus, progress curves contain more information than initial rates, and because of that one may be able to estimate kinetic parameters from a smaller number of samples than with the initial rate approach. The main difficulty of progress curves passes through the need to integrate the ODEs, since the progress curve is an explicit relation between concentrations (rather than rates of change) and time. This is not a problem, however, for biochemical simulation software which are equipped with integrators that can deal with a very wide range of initial value problems. In particular, those that also incorporate minimization algorithms, such as COPASI, are able to carry out progress curve analysis directly. To carry out this type of analysis the model must include the reversible reaction since immediately after the start there will be molecules of substrate and product present and therefore the two reactions happen simultaneously. It is important also to include the coupling reaction, when they are used. Actually, it would be beneficial to include the full kinetic details of the coupling enzyme(s) as it is likely that at some point in a time course they are no longer operating in the optimal conditions. But if the assay is designed carefully then the linking reaction can be represented with a fast mass action rate law.
Enzyme Kinetics for Computational Systems Biology
595
While it is possible to obtain estimates of all parameters of a rate law from a single time course, those estimates are poor. A much more robust method is to perform a global analysis where the same set of parameter values must fit all of the time courses measured. To set up such a procedure in COPASI is similar to the first step of the initial rate described above: the data file must contain all of the trajectories and columns with all of the metabolites whose concentration was changed, plus the variables measured. It is better to use the measured signals (absorbance, fluorescence intensity, etc.) and to include in the model the equations that transform them into concentrations. This allows for factors that are included in such equations to be adjusted as part of the fit, if needed. For the example of yeast TPI, we have included all of the time courses up until the point when the absorbance reaches 3.25 where the detector is saturated and no longer provides a linear relation between signal and concentration. We also removed obvious outliers, in this case an absorbance curve with a negative slope but which should have been positive. At this stage, we need to consider which rate law to use, since through the initial rate analysis we already identified that it should contain substrate inhibition by G3P. The solution is to use either v¼
V r KPmp 4 ; S P 1 þ Kms þ Kmp 1 þ KSi Vf
S Kms
ð22:7Þ
for the reaction of G3P to DHAP, or V r KPmp v¼ 4 ; S P 1 þ Kms þ Kmp 1 þ KPi Vf
S Kms
ð22:8Þ
for the reaction of DHAP to G3P. The reaction assays used here were planned exclusively for initial rate analysis and could be optimized further for progress curve analysis. For example, the levels of NAD and especially NADH used were fairly low, partly to ensure the linking enzyme was operating close to first order, but also because NADH strongly absorbs light. However, its initial concentration could still be increased two or threefold in order to allow the reaction extent to reach further. Ideally, one would like progress curves to reach close to equilibrium, as in this way each curve carries more information about the reverse reaction parameters. In the example presented here, the progress curves of the forward reaction have little information about the reverse, and vice versa. Figures 22.3 and 22.4 represent the progress curves of the forward and reverse reactions and their fits. It is clear by eye that the reverse direction is a
596
Pedro Mendes et al.
Progress curves (forward) 4 3.5 3 2.5 2 1.5 1 0.5 0 0
1000
2000
3000
4000
5000
Figure 22.3 Progress curve analysis of the forward reaction of triosephosphate isomerase (EC 5.3.1.1). All curves were fit simultaneously to Eq. (22.8) and consequently follow the same value for the kinetic parameters of that curve.
Progress curves (reverse)
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0
1000
2000
3000
4000
5000
Figure 22.4 Progress curve analysis of the reverse reaction of triosephosphate isomerase (EC 5.3.1.1). All curves were fit simultaneously to Eq. (22.7) and consequently follow the same value for the kinetic parameters of that curve.
597
Enzyme Kinetics for Computational Systems Biology
worse approximation overall; a plot of residuals is not needed to reveal this (though that is usually the best way to assess quality of fit). A summary of all results obtained in the initial rate analysis, and in the two progress curves analyses is presented in Table 22.1. If we take the parameters obtained by initial rate analysis as the most reliable, then one can conclude that the two progress curve analyses were able to obtain some parameter values in the correct range but not those of the reaction in the opposite direction. This is due to the fact that the progress curves ended quite far from equilibrium and if the assays had been designed for them those estimates were likely to be better. Despite this the estimates for the substrate inhibition constant of G3P are quite consistent by the three methods. The hardest kinetic data to find published are parameters from the reverse reaction (the direction that is less favorable thermodynamically). Progress curves are ideal to reveal some level of information from the parameters of the reverse reaction when it is not feasible to operate assays in that direction. This is a problem in reactions that have strong energetic drive in one direction (such as many kinases) but also when the products of reaction are not available commercially. Obviously, when possible one should run the reaction in both directions as the data obtained that way is of higher quality (irrespective of being based on initial rates or progress curves); but when this is not possible the availability of progress curve data is a much appreciated gift to the modeler. Table 22.1 Summary of kinetic parameters of yeast triosephosphate isomerase (EC 5.3.11) determined by initial rate and progress curve analyses. Values are represented as estimates and standard deviations (coefficient of variation in brackets)
Initial rate
Km DHAP 6.43 0.186 (2.89%) Km G3P
5.25 0.635 (12.1%)
Ki,G3P
35.1 1.07 (3.06%)
Vf
0.853 103 7.12 106 (0.835%) 0.446 103 22.4 106 (5.03%)
Vr
Progress curves (forward)
Progress curves (reverse)
8.82 1.19 (13.5%)
2.70 103 0.759 103 (28.1%) 10.4 0.925 (8.90%)
9.21 103 2.36 103 (25.6%) 16.0 1.10 (6.84%) 0.938 103 0.127 103 (13.6%) 1.00 108 8.75 107 (8750%)
25.3 0.528 (2.08%) 1.00 108 6.17 107 (6150%) 1.20 103 0.108 103 (9.02%)
598
Pedro Mendes et al.
6. Concluding Remarks Systems biology has many innovative experimental and computational technologies that are revolutionizing research. But it is also creating a stronghold for a technology that is very well established and which has a strong theoretical background—enzyme kinetics. In our own laboratory we have embarked on a large-scale effort to obtain enzyme kinetic data for the purpose of constructing models of metabolism. But the objective is clearly to learn more about how cells work via the means of computational models, and not at all about the mechanisms of catalysis, except when they reveal themselves of importance to cellular function. Computational systems biology has made considerable advances recently and appears poised to enter an exponential growth phase, fueled by a strong community that grew out of the standardization efforts. The technologies of the semantic Web are already impacting this field and more is to be expected (Kell and Mendes, 2008). Computational modeling and simulation software are becoming more and more sophisticated, allowing to carry out computations that would be unbelievable only a couple of decades ago. These advances are also benefiting enzyme kinetics data analysis and we foresee a time when the concept of ‘‘gene function’’ becomes synonym with the kinetics of its protein product embedded in the cellular biochemical network.
ACKNOWLEDGMENTS We are grateful to many colleagues for discussions about this topic, in particular Neil Swainston, Juergen Pahle, and Douglas B. Kell. COPASI is a collaborative project with Ursula Kummer and Sven Sahle (University of Heidelberg). PM and SH thank the National Institute for General Medical Sciences for financial support (R01 GM080219), PM and NM thank the BBSRC and EPSRC for funding the MCISB (BB/C008219/1), and PM and HM thank the BBSRC funding through grant BB/F003501/1. This is a contribution from the Manchester Centre for Integrative Systems Biology.
REFERENCES Bergmeyer, H. U., et al. (1974). Enzymes as biochemical reagents. In ‘‘Methods of Enzymatic Analysis,’’ (H. U. Bergmeyer, ed.), Vol. I, pp. 425–522. Academic Press, New York, NY. Chance, B., et al. (1960). Metabolic control mechanisms. V. A solution for the equations representing interaction between glycolysis and respiration in ascites tumor cells. J. Biol. Chem. 235, 2426–2439. Cornish-Bowden, A. (1986). Why is uncompetitive inhibition so rare? A possible explanation, with implications for the design of drugs and pesticides. FEBS Lett. 203, 3–6. Duggleby, R. G. (1985). Estimation of the initial velocity of enzyme-catalysed reactions by non-linear regression analysis of progress curves. Biochem. J. 228, 55–60. Eisenthal, R., and Cornish-Bowden, A. (1974). The direct linear plot. A new graphical procedure for estimating enzyme kinetic parameters. Biochem. J. 139, 715–720.
Enzyme Kinetics for Computational Systems Biology
599
Funahashi, A., et al. (2003). CellDesigner: A process diagram editor for gene-regulatory and biochemical networks. Biosilico 1, 159–162. Gelperin, D. M., et al. (2005). Biochemical and genetic analysis of the yeast proteome with a movable ORF collection. Genes Dev. 19, 2816–2826. Ghaemmaghami, S., et al. (2003). Global analysis of protein expression in yeast. Nature 425, 737–741. Gillespie, D. T. (1977). Exact stochastic simulation of coupled chemical reactions. J. Phys. Chem. 81, 2340–2361. Herrga˚rd, M. J., et al. (2008). A consensus yeast metabolic network reconstruction obtained from a community approach to systems biology. Nat. Biotechnol. 26, 1155–1160. Hoops, S., et al. (2006). COPASI: A Complex pathway simulator. Bioinformatics 22, 3067–3074. Hucka, M., et al. (2003). The systems biology markup language (SBML): A medium for representation and exchange of biochemical network models. Bioinformatics 19, 524–531. Kell, D. B., and Mendes, P. (2008). The markup is the model: Reasoning about systems biology models in the Semantic Web era. J. Theor. Biol. 252, 538–543. Kohler, J., et al. (2006). Graph-based analysis and visualization of experimental results with ONDEX. Bioinformatics 22, 1383–1390. Krietsch, W. K. (1975). Triosephosphate isomerase from yeast. Methods Enzymol. 41, 434–438. Kuzmic, P. (1996). Program DYNAFIT for the analysis of enzyme kinetic data: Application to HIV proteinase. Anal. Biochem. 237, 260–273. Le Nove`re, N., et al. (2005). Minimum information requested in the annotation of biochemical models (MIRIAM). Nat. Biotechnol. 23, 1509–1515. Le Nove`re, N., et al. (2009). The systems biology graphical notation. Nat. Biotechnol. 27, 735–741. Levenberg, K. (1944). A method for the solution of certain nonlinear problems in least squares. Quart. Appl. Math. 2, 164–168. Liebermeister, W., and Klipp, E. (2006). Bringing metabolic networks to life: Convenience rate law and thermodynamic constraints. Theor. Biol. Med. Model 3, 41. Lineweaver, H., and Burk, D. (1934). The determination of enzyme dissociation constants. J. Am. Chem. Soc. 56, 658–666. Marquardt, D. W. (1963). An algorithm for least squares estimation of nonlinear parameters. SIAM J. 11, 431–441. Mendes, P., and Kell, D. (1998). Non-linear optimization of biochemical pathways: Applications to metabolic engineering and parameter estimation. Bioinformatics 14, 869–883. Michaelis, L., and Menten, M. L. (1913). Die kinetik der invertinwirkung. Biochem. Z. 49, 333–369. Rhoads, D. G., et al. (1968). A method of calculating time-course behavior of multi-enzyme systems from the enzymatic rate equations. Comput. Biomed. Res. 2, 45–50. Rodriguez-Fernandez, M., et al. (2006). A hybrid approach for efficient and robust parameter estimation in biochemical pathways. Biosystems 83, 248–265. Rojas, I., et al. (2007). Storing and annotating of kinetic data. In Silico Biol. 7, S37–S44. Runarsson, T., and Yao, X. (2000). Stochastic ranking for constrained evolutionary optimization. IEEE Trans. Evol. Comp. 4, 284–294. Sauro, H. M., et al. (2003). Next generation simulation tools: The Systems Biology Workbench and BioSPICE integration. Omics 7, 355–372. Schilling, C. H., et al. (1999). Metabolic pathway analysis: Basic concepts and scientific applications in the post-genomic era. Biotechnol. Prog. 15, 296–303. Shannon, P., et al. (2003). Cytoscape: A software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504. Westley, A. M., and Westley, J. (1996). Enzyme inhibition in open systems. Superiority of uncompetitive agents. J. Biol. Chem. 271, 5347–5352.
C H A P T E R
T W E N T Y- T H R E E
Fitting Enzyme Kinetic Data with KinTek Global Kinetic Explorer Kenneth A. Johnson Contents 602 603 605 605 607 608 608 609 609 610 613 617 620 624 625 625
1. Background 2. Challenges of Fitting by Simulation 3. Methods 3.1. Defining the model 3.2. Defining each experiment 3.3. Defining output factors 3.4. A note on units 3.5. Information content of data 3.6. A note on statistics 4. Progress Curve Kinetics 5. Fitting Full Progress Curves 5.1. Error analysis 6. Slow Onset Inhibition Kinetics 7. Summary Acknowledgments References
Abstract KinTek Global Kinetic Explorer software offers several advantages in fitting enzyme kinetic data. Behind the intuitive graphical user interface lies fast and efficient algorithms to perform numerical integration of rate equations so that kinetic parameters or starting concentrations can be scrolled while the time dependence of the reaction is dynamically updated in the graphical display. This immediate feedback between the model and the output provides a powerful tool for learning kinetics, for exploring the complex relationships between rate constants and the observable signals, and for fitting data. Dynamic simulation provides an easy means to obtain starting estimates for kinetic parameters before fitting by nonlinear regression and for exploring parameter space after a fit is achieved. Moreover, the fast algorithms for numerical integration allow for Department of Chemistry and Biochemistry, Institute for Cell and Molecular Biology, University of Texas, Austin, Texas, USA Methods in Enzymology, Volume 467 ISSN 0076-6879, DOI: 10.1016/S0076-6879(09)67023-3
#
2009 Elsevier Inc. All rights reserved.
601
602
Kenneth A. Johnson
the brute force computation of confidence contours to provide reliable estimates of the range over which parameters can vary, which is especially important because it reveals when parameters are not well constrained. As illustrated by several examples outlined here, standard nonlinear regression methods fail to detect when parameters are not constrained by the data and generally produce standard error estimates that are extremely misleading. This brings forth an important distinction between a ‘‘good’’ fit where a minimum chi2 is achieved and one where all variable parameters are well constrained based upon sufficient information content of the data. These concepts are illustrated by example in fitting full progress curve kinetics and in fitting the time dependence of slow-onset inhibition.
1. Background Fitting kinetic data based upon numerical integration of rate equations has several advantages over conventional fitting to mathematical functions derived by analytical solution of the rate equations (Barshop et al., 1983; Johnson et al., 2009a,b; Zimmerle and Frieden, 1989). In particular, by fitting primary data directly to a model by computer simulation, all aspects of the data are included in the fitting process including rates as well as amplitudes of the reactions without any simplifying assumptions. In contrast, conventional data fitting is dependent upon solution of mathematical expressions to define the time and concentration dependence of the reaction. Solving mathematical expressions usually requires simplifying assumptions that may only be valid to a first approximation. For example, steady-state kinetic methods assume that one can measure an initial velocity without having significant changes in the concentrations of substrate or product and this restricts data collection to early stages of the reaction where the signal amplitude is low. Integration of differential equations for fitting pre-steady-state kinetic data usually requires construction of a simplified model with no more than two or three kinetically significant steps because of the complexities of the math, producing one exponential phase for each step. In either case, fitting the primary data to measure the rates of reaction is usually followed by subsequent analysis of the concentration dependence of the observed rates. By this process, one fits the data to multiple equations and parameters, some of which are redundant in their information content and many of which are subsequently discarded (e.g., in plotting the concentration dependence of only the observed rate and ignoring the amplitude of a reaction). As an end result, errors are compounded, or worse yet, glossed over in reaching mechanistic conclusions. As a point of contrast, we consider the fitting of data defining the formation of a quininoid species upon reaction of serine with pyridoxal phosphate in the first step of the beta reaction of tryptophan synthase.
603
KinTek Explorer
The reaction can be monitored by fluorescence stopped-flow and fit to a simple two-step model (Anderson et al., 1991) to derive all four rate constants: E+S
k1 k–1
ES
k2
EA
k–2
By conventional methods, each transient obtained at a different serine concentration was fit to a double exponential function with five unknown variables two amplitudes, two rates, and an endpoint: Y ¼ A1 el1 t þ A2 el2 t þ C This fitting ignores the relationships between the rates and amplitudes that are inherent in the data set and therefore increases the errors in the process of extracting the two rates, l1 and l2. The rates of the fast- and slow-reaction phases were then plotted as a function of substrate concentration and fitted to equations obtained by solving the differential equations for the two-step reaction. In the end, the data, consisting of transients collected a four different substrate concentrations, were fit to a total of 23 independent parameters but only three of the rate constants could be estimated from this analysis. A fourth rate constant was estimated by analysis of the reaction amplitudes (Anderson et al., 1991) to define the net equilibrium constant K1K2. Data fitting based upon numerical integration of rate equations overcomes the many limitations to conventional data fitting. In this process, primary data, consisting of the observable signal as a function of time at several substrate concentrations, are fit globally to the model including appropriate output factors to scale the observable signal to the absolute concentrations of reactants. In the tryptophan synthase example, the data set can be fit directly to the model to derive all four rate constants and two fluorescence scaling factors, where the observed fluorescence was attributable to the formation and decay of the ES complex: F ¼ F0 þ DF½ES, as described in detail in Johnson et al. (2009a). Moreover, the full extent to which individual kinetic parameters are constrained by the data was revealed by analysis of the confidence contours derived by monitoring the sum square error as parameters are systematically varied while fitting the data ( Johnson et al., 2009b).
2. Challenges of Fitting by Simulation There are two competing challenges to fitting data based upon computer simulation; namely, a model must be complete enough to provide an adequate description of the underlying mechanism, but not more complex than can be supported by the data. A complete model is required so that the
604
Kenneth A. Johnson
data fitting is built upon a realistic mechanism without unsupported simplifying assumptions. Even with a realistic minimal model, not all of the rate constants may be known or constrained by the data. Accordingly, one needs to have a good understanding of what can be determined in fitting the data and how to set up the system to extract meaningful information. Most importantly, after a good fit has been obtained, it is essential that the model and the parameter set be carefully evaluated to estimate how well each of the kinetic parameters is constrained by the data. Here, an important distinction must be made between a good fit and well-constrained parameters. A good fit is achieved when the minimum chi2 value derived by nonlinear regression reflects the sigma value of the original data (Bates and Watts, 1988). However, if an equally good fit can be achieved with a different set of parameters, then the parameters are not well constrained. In this chapter, the concept of the information content of data will be introduced. That is, how many constants can be determined from a given set of data, and specifically which rate constants are constrained by data? This is the most important question and one that is so often overlooked in fitting kinetic data to a model. With modern computer programs, it is far too easy to define an overly complex model with many parameters that are not determined by the data. Although one expects that the standard error analysis from nonlinear regression should indicate when parameters are ill-defined, this approach usually fails when fitting multiple parameters that are not well constrained by the data ( Johnson et al., 2009b). Thus, one must clearly define what is known, what is not known, and what simplifying assumptions were made to enable data to be fit. The process of developing a model and fitting experimental data will be illustrated with several examples in this chapter. The confidence contour analysis will be used to show what happens when parameters are not well constrained and how the problems can be overcome by performing additional experiments or by simplifying the model. Although KinTek Explorer professional version is offered for sale to defray the programming costs, a free student version is available at www. kintek-corp.com which includes an extensive instruction manual that describes the operation of the program in more detail than can be given here. In addition, each of the examples in this manuscript, illustrating the use of the simulation and data fitting, is also included with the simulation program in the examples folder of the software available online. Many of the concepts explained here are illustrated better by simply running the program, opening the appropriate file and adjusting the rate constants and output factors in order to the see how the curves change in shape. One unique feature of KinTek Explorer is the ability of the user to click with the computer mouse on a rate constant, starting concentration, or signal output factor and to scroll the value up and down while simultaneously observing the changes in the shape of the output curves. This dynamic simulation provides rich feedback to help learn kinetics and to
KinTek Explorer
605
provide initial estimates of kinetic parameters for fitting by nonlinear regression. Perhaps more importantly, dynamic simulation affords a powerful means to explore parameter space to see how well individual constants are constrained by the data, and to explore whether individual parameters are linked to one another and to search a wide range of parameter space in looking for alternative values to fit the data. In this short review, details about how to perform simulations and fit data will be given only in general terms since the manual provided with the KinTek Explorer gives the necessary instructions on how to use the software. Rather, the approach of using simulation to fit data will be illustrated by use of examples. In particular, the examples show how tools unique to KinTek Explorer can be used to evaluate the extent to which parameters are constrained by the data and then to use that information to design new experiments to fill in the gap in knowledge or understand how the model must be reduced to be inline with the information inherent in the data.
3. Methods In fitting kinetic data there is no substitute for a sound understanding of the principles in the design and interpretation of experiments. Nonetheless, by use of kinetic simulators in general and KinTek Explorer in particular, many of the pitfalls in interpretation can be avoided. Every week models are published that are simply not consistent with the data. These errors could be avoided by fitting data using computer simulation because all elements of the data must be consistent with the model to achieve a good fit. Moreover, the simulation program itself serves as a valuable learning tool. Prior to performing any experiments, the user can run a simulation and see what results might be obtained from a given experiment based upon different underlying models, as described below (see ahead to Fig. 23.1). Moreover, one can readily see the effects of changing substrate concentrations or rate constants on the observable outputs. In this way, intuition can be developed that helps a great deal in deciphering more complex kinetic data to divine the simplest model. A simulation is based upon four required elements: a model, a set of starting concentrations of reactants, an observable output function, and a set of rate constants. In using simulation to fit data, one seeks a minimal model and a set of unique rate constants that quantitatively account for the observable data.
3.1. Defining the model To begin the simulation, the reaction sequence is entered using a simple text description. For example the reaction scheme 23.1
A
C 0.1
1
2 10
6 4
5 10 20
6 4 2
2 0
0
100
200
300 Time, s
B
400
500
0
600
0
100
200
D
300 Time, s
400
500
600
20
10
20
0
1 2
8
5
(Product), mM
(Product), mM
0
8
5
8
6 4
15 10
10
5
5
2 0
10 2
(Product), mM
(Product), mM
10
0
100
200
300 Time, s
400
500
600
0
0
100
200
300 Time, s
400
500
600
607
KinTek Explorer
E+S
k1 k-1
ES
k2 k-2
EP
k3 k-3
E+P
Scheme 23.1
is entered simply as: E þ S ¼ ES ¼ EP ¼ E þ P. The program then solves the differential equations and sets up the necessary equations for performing the numerical integration to simulate the time dependence of the reaction. Each enzyme species must have a unique, user defined description consisting of case-sensitive alphanumeric characters plus the special characters , $, and #. Multiple steps of the reaction can be written on one continuous line as long as mass balance is maintained for each reaction. For example, more complex pathways involving two substrates and products, such as EPSP synthase, require multiple lines to maintain mass balance (Anderson et al., 1988): E þ A ¼ EA EA þ B ¼ EAB ¼ EI ¼ EPQ ¼ EQ þ P EQ ¼ E þ Q
3.2. Defining each experiment An experiment is defined by specifying the starting concentrations of the reactants and the signal that is measured, much the same way an experiment is defined in the laboratory. In fact, every aspect of the simulation should mimic the experimental details underlying the original data collection. In the software, the starting concentrations of reactants are entered into a table created by the program based upon the mechanism. In addition, if two or more reactants are allowed to equilibrate before adding additional reactants, that can be easily programmed by using multiple mixing steps. This is valuable because the rate constants governing the initial equilibration are included in the process of fitting the data. For example, in studies on DNA polymerases, we often incubated enzyme with DNA and then add the nucleotide substrate. In some experiments, the data fitting defines the DNA dissociation rate according to constraints imposed both during the pre-incubation phase, which determines the amplitude of the reaction, and Figure 23.1 Progress curve kinetics. Curves were calculated by numerical integration to illustrate the changes in the shape of the curves dependent upon kinetic parameters. All curves were computed with 1 mM enzyme, 10 mM substrate (unless noted), and the kinetic constants given in Table 23.1. (A) Effect of variable Km (Km ¼ 0.1, 0.5, 1, 2, 5, 10 mM). (B) Effect of product inhibition (k 3 ¼ 0, 1, 2, 5 mM 1s 1). (C) Effect of reversible chemistry (k 2 ¼ 0, 2, 5, 10, 20 s 1, with k 3 ¼ 5 mM 1s 1). (D) Variable substrate concentration ([S] ¼ 5, 10, 20 mM); the dotted line shows the simulation with irreversible chemistry and irreversible product release, and other constants given in Table 23.1.
608
Kenneth A. Johnson
during the subsequent phase where the rate of multiple turnovers is limited by DNA release. A good example of this is in work on the inhibition of HIV reverse transcriptase with nonnucleoside inhibitors which bind slowly to the enzyme (Spence et al., 1995). Fitting of the original data by simulation is given in the example file HIV_NNRTI.mec, provided with the software.
3.3. Defining output factors An essential part of the definition of an experiment includes specifying the properties of the output signal. All simulations are performed in absolute concentrations of reacting species. One must then define an output expression that relates concentrations of species to observable signals. In a rapid quenchflow experiment or another method based upon quenching a sample and quantifying the amount of product formed, the output may be the sum of all species containing the product. For example, for Scheme 23.1, total product will be defined by the sum EP þ P because upon quenching the reaction, product bound to the enzyme will be released. In the case of EPSP synthase, total product Q will be defined by the sum EPQ þ EQ þ Q. If there is an absorbance change upon conversion of substrate to product, the signal will be defined by the difference in extinction coefficients: a(ES þ S)b(EP þ P). On the other hand, if one is monitoring a change in protein fluorescence with different enzyme-bound states, the net signal will be defined by the different fluorescence coefficients for each species: a E þ b ES þ c EP. In this case, it is often useful to normalize the fluorescence relative to the starting enzyme and include a scaling factor: f(E þ b ES þ c EP). Defining the output expression in this manner helps the user keep track of the relative fluorescence change while fitting data and thereby avoid the pitfall of fitting data with an inordinately large fluorescence coefficient and a correspondingly low concentration of the species. Possible output expressions for Scheme 23.1 include: Fluorescence: Signal ¼ f (E þ b ES þ c EP) Burst of product formation: Signal ¼ EP þ P Absorbance of S and P: Signal ¼ a(S þ ES) þ b(P þ EP) The output coefficients can be derived readily as unknowns during the fitting process but care must be taken to define the minimal output expression. For example, an output expression as f(a E þ b ES þ c P) is overdefined and has an infinite number of solutions since any combination of terms in f a can give a desired constant, for example.
3.4. A note on units Units of time and concentration can be whatever is convenient for the experiments, but there must be consistency in that all concentrations must be in the same units and correspond to the dimensions of the second-order
609
KinTek Explorer
rate constants. Similarly, all rate constants must be entered in the units of time chosen for the experiment. For most enzymes, concentration units of micromolar and time in seconds is most appropriate such that second-order rate constants are given in units of mM 1s 1 (106 M 1s 1). In these units, the diffusion limit for substrate binding is approximately 1000 mM 1s 1 and a conservative estimate may be 100 mM 1s 1. First-order rate constants typically range from 0.001 s 1 to 10,000 s 1 for observable enzymecatalyzed reactions. One can easily adopt different units for time and concentration and it is advisable to keep entered numbers in the range of 1e-6 to 1e6, in part, to avoid round-off errors in the math, but also to afford easier and therefore less error-prone data entry. Even though all math is done in 64-bit double precision, avoiding extremely large or small numbers in data entry will minimize round-off errors.
3.5. Information content of data Understanding the information content of data is important to prevent overinterpretation. KinTek Explorer offers several unique tools to assess whether fitted parameters are well constrained by the data (i.e., whether the model is overly complex). This question is distinct from whether a good fit can be achieved and cannot be answered based upon whether nonlinear regression returns small estimates of standard error. Standard error calculations fail when multiple parameters are underconstrained. Rather we address these questions by exploring parameter space by scrolling rate constants and looking for how much the curves depend upon a given parameter, to assess whether certain parameters may be linked and whether a very different area of parameter space may contain another good fit to the data. Finally, we also rely upon computation of the confidence contours by quantifying how the total sum square error surface varies as a function of individual parameters. There are currently 50 example files included with the software online that illustrate the use of KinTek Explorer in fitting data from multiple experiments, based largely on transient kinetic data. Here, the program will be illustrated using methods from the field of steady-state kinetics, but where rigorous fitting is greatly facilitated by use of computer simulation; namely, in the analysis of full progress curves and the fitting of slow-onset inhibition.
3.6. A note on statistics Data fitting by nonlinear regression analysis is based upon finding the minimum sum square error, defined as the sum of the residuals squared: SSE ¼
N X i¼1
ðyi yðxi ÞÞ2
610
Kenneth A. Johnson
where yi is the observed data and y(xi) is the calculated y value for the x value at the ith data point and N is the number of data points. When the standard deviation (sigma) values for the data are known, then the residuals are normalized by dividing by sigma to compute chi2: w2 ¼
2 N X yi yðxi Þ i¼1
si
When the sigma values are not known, it is often assumed that the sigma value is constant for all of the data to enable one to compute an average sigma value: N 1 X
s2AVE ¼
w ¼ N M 2
2
½yi yðxi Þ
i¼0
N M
where N is the number of data points and M is the number of parameters being fit to the data. In the examples shown here, we will use these three terms to evaluate goodness of fit. In particular, the calculated average sigma value can be compare to the sigma values input in generating artificial data.
4. Progress Curve Kinetics Standard steady-state kinetic analysis is based upon the error-prone estimates of initial velocities, restricting data to the first 10–20% of the reaction and the need to measure initial slope before the reaction starts to become nonlinear. The initial velocities must then be plotted as a function of substrate (and perhaps inhibitor) concentrations and then fit to another set of equations to extract kcat and Km values, all the while being careful to propagate error estimates. These time-consuming methods can be replaced by direct fitting of the primary data to the model using computer simulation ( Johnson et al., 2009a). Moreover, once a reaction is started, it can be followed to completion to get the most information from each sample. In particular, analysis of the full progress curve as the reaction goes to completion allows definition of the kcat and Km values for substrate and, in some cases, the Kd for product inhibition, or possibly kcat and Km for the reverse reaction. Fitting the full progress curve is not new; in fact, Michaelis and Menten (1913) fit their data to the integrated form of the rate equation in their landmark 1913 paper. However, the ease and utility of fitting based upon computer simulation finally makes fitting to the full progress curves the preferred method, rather than restricting attention to initial velocities.
KinTek Explorer
611
In order to understand the information content of the progress curve kinetic data, we begin by analysis of a simple enzyme-catalyzed reaction and compare the effects of different sets of rate constants, as shown in Fig. 23.1 using constants summarized in Table 23.1. This illustrates the use of KinTek Explorer to explore the landscape before doing an experiment, which can aid in defining optimal reaction conditions. If the product inhibition is negligible, then the shape of the curvature is determined solely by the substrate concentration dependence of the rate. Figure 23.1A shows the effect of variable Km on the shape of the progress curve. As Km is increased, the curvature becomes more pronounced. Of course, the observed shape is also dependent upon the concentrations of enzyme and substrate, so if the Km is at the lower limit of what is shown in Fig. 23.1A, then the experiment needs to be repeated at a lower enzyme concentration and lower substrate concentration to resolve the curvature. Thus, the first step in fitting data is to collect data in the optimal region of concentrationtime space to reveal the underlying parameters. Note, for example, if one attempted to extract Km from the lowest Km curve in Fig. 23.1A, one could only place an upper limit on the estimated value of Km. However, in practice, this initial estimate could be used to then design an experiment at lower concentration, optimized to measure the lower Km, a goal that can be easily accomplished using the simulation program. Product inhibition also changes the shape of the curves as shown in Fig. 23.1B, illustrating the effect of decreasing Kd for product rebinding. A priori, one does not know whether the curvature in the time dependence is due to a higher Km for substrate or a lower Kd for product rebinding. Therefore, experiments must be done at several concentrations of substrate and/or product to resolve the two parameters, a process that can be achieved easily by global fitting of the family of curves, as illustrated below. The variation in the curvature as a function of starting substrate concentration or added product provides the information content to define both the Km for substrate and Kd for product. If the chemical reaction is reversible (k 2 > 0), then the amplitude of the reaction is also affected (as shown in Fig. 23.1C), which provides the additional information necessary to define the overall equilibrium constant and therefore derive kcat and Km values in both the forward and reverse directions. However, the extent to which k 2 can be defined, based upon forward-rate measurements, is dependent upon the magnitude of its effect on the observed reaction, as described below. As a general rule, it is necessary to measure the full progress curves at several concentrations of starting substrate as shown in Fig. 23.1D. Here, the fact that the curvature is different at the three substrate concentrations provides information to define product inhibition. For comparison, the dotted lines in Fig. 23.1D show the case where there is no product inhibition and one can see that the curved portion could be superimposed for each
Table 23.1 Rate constants for computing progress curvesa
a
Figure
k1 (mM 1s 1)
k 1 (s 1)
k2 (s 1)
k 2 (s 1)
k3 (s 1)
k 3 (mM 1s 1)
A B C D D (dots)
10 10 10 10 10
Variable 50,000 50,000 20,000 20,000
120 120 120 120 120
0 0 0, 2, 5, 10, 20 20 0
10,000 10,000 10,000 10,000 10,000
0 0, 1, 2, 5 5 5 0
Curves displayed in Fig. 23.1 were calculated using Scheme 23.1 and the rate constant summarized here. Variable Km values of 0.1, 1, 2, 5, and 10 mM in Fig. 23.1A were obtained by varying k 1 from 1000 to 100,000 s 1.
KinTek Explorer
613
of the concentrations. If the kinetics are measured using a coupled enzyme assay so that product does not accumulate, then the data would follow the dotted line, and can be fit to only extract kcat and Km values for the forward reaction without complications due to product inhibition. Global fitting of several progress curves simultaneously allows definition of all relevant kinetic parameters.
5. Fitting Full Progress Curves In fitting steady-state or full progress curve kinetics, all that can be determined are kcat and Km values, possibly in both the forward and reverse directions depending upon the reversibility of the reaction and the properties of the data. Accordingly, one can only fit the data to extract two or four constants. However, the minimal model (Scheme 23.1) contains three steps and six rate constants. One easy approach is to simply fit to a model with all six rate constants as variable parameters and then can calculate kcat and Km values. However, the set of six rate constants will not be unique and it will not be possible to estimate errors on the kcat and Km values. That is, one could arbitrarily (within some limits) choose another set of six rate constants to fit the data and compute the same kcat and Km values. In fact, it is a useful exercise to fit a given set of date using multiple sets of rate constants and show that the same kcat and Km values are obtained. This was done in analysis of alanine racemase data ( Johnson et al., 2009a,b) in order to refute claims that eight rate constants could be extracted from the progress curve data using Dynafit ( Johnson et al., 2009a,b; Spies and Toney, 2007; Spies et al., 2004) and subsequent claims of fitting 18 rate constants (Spies and Toney, 2007). The alanine racemase example illustrates how easy it is to be misled in fitting multiple parameters to a data set without carefully considering the distinction between a good fit and one in which parameters are constrained by the data. In order to estimate errors on parameters, simplifications are needed to reduce the number of variables to correspond to the information content of the data. If there is no product inhibition and the reaction is largely irreversible, one can only get kcat and Km for the forward reaction, and one progress curve would sufficient. Better yet, a full progress curves performed at several substrate concentrations or in the presence and absence of added product would improve confidence in the parameters. In order to develop a general method for fitting progress curve kinetics, one must allow for the possibility of product inhibition and reversal of the chemical reaction. Therefore, the fitting procedure must provide estimates of kcat and Km in both the forward and reverse directions. This still entails fitting only four constant to a minimal model containing six rate constants.
614
Kenneth A. Johnson
One method to reduce the number of variable parameters involves setting the second-order rate constants for substrate and product binding at the diffusion limit. Under these conditions, the rates of product release and substrate release are then set to be much greater than kcat so that k2 and k 2 limit the net rate of turnover in each direction. E+S
100 mM–1s –1
ES
k–1
k2 k–2
k3
EP
100 mM–1s –1
E+P
Scheme 23.2
By fitting the data to this rapid equilibrium binding model, Km,S ¼ k 1/k1 and kcat ¼ k2 for the forward reaction and Km,P ¼ k3/k 3 and kcat,rev ¼ k 2. It is important to note that this does NOT imply that the rapid equilibrium binding model necessarily represents a valid description of the elementary rate constants. Rather it serves only as a tool to extract the steady-state kinetic parameters. To illustrate this approach to fitting progress curve kinetics, artificial data were generated based upon a model shown below. The time course of reaction was simulated and random noise was added (sigma ¼ 0.02) to generate data at three substrate concentrations as shown in Fig. 23.2A, using constants shown in Table 23.2. E+S
10 µM -1 s-1 400 s-1
ES
180 s-1 20 s-1
EP
1200 s-1 10 µM -1 s-1
E+P
The data were then fit to a model assuming diffusion-limited substrate and product binding steps (fixed at 100 mM 1s 1) so that only the remaining four rate constants were allowed to float during fitting. The following parameters were derived: E+S
(100 µM -1 s-1) 4780 s-1
ES
157 s-1 13.9 s-1
EP
11400 s-1 (100 µM -1 s-1)
E+P
From this model it is easy to calculate that kcat ¼ k2 ¼ 157 s 1 and Km,S ¼ k 1/k1 ¼ 47.8 mM in the forward reaction and kcat,rev ¼ k 2 ¼ 13.9 s 1 and Km,P ¼ k3/k 3 ¼ 114 mM for the reverse reaction. These are the same steady-state kinetic constants that are calculated from the starting model. Moreover, by limiting the number of parameters used in fitting to correspond to the information content of the data, standard error estimates derived in fitting apply directly to the kcat and Km values. The data can also be fit to the model in which all six rate constants are varied. One such fit is shown below, which yields the same kcat and Km values in each direction. E+S
4.4 µM -1 s-1 241 s-1
ES
2000 s-1 132 -1
EP
178 s-1 1.85 µM -1 s-1
E+P
615
KinTek Explorer
A
(Product), mM
10 8 6 4 2 0
0
50
100 Time, s
150
200
B
(Product), mM
2 1.5 1 0.5 0 0
50
100
150
200
250
300
Time, s Figure 23.2 Simultaneous fitting of three progress curves. Simulated curves were calculated according to the constants in Table 23.2, line a, with an enzyme concentration of 1 mM, substrate concentrations of 2, 5, and 10 mM, and with added random errors giving a sigma value of 0.02. Two sets of fitted curves are shown with rate constants summarized in Table 23.2. An optimal global fit required fitting to four parameters according to the original model (solid black line). When an attempt was made to fit the data to a simplified irreversible model (dotted line), only one of the three curves could be fit adequately. (B) The reverse reaction was simulated using 1 mM enzyme, 2 mM product, and a trap to sequester any free substrate with random noise added (sigma ¼ 0.02).
Thus, multiple sets of parameters can be fit to the data and used to compute the values for kcat and Km. This exercise reinforces what we already know: steady-state kinetic data cannot be used to establish elementary rate constants in an enzyme-catalyzed reaction. One shortcut that has been suggested for fitting full progress curves is based upon reducing the model to a minimal two-step irreversible sequence:
Table 23.2 Kinetic parameters in fitting three progress curvesa
a
b
Curve
k1 (mM 1s 1)
k 1 (s 1)
k2 (s 1)
k 2 (s 1)
k3 (s 1)
k 3 (mM 1s 1)
Sigma
a b c
(10) (100) 0.0701
400 4780 (0)
180 157 297
20 13.9 (0)
1200 11400 (10000)
(10) (100) (0)
0.02 0.0204 0.0197b
Three sets of constants were used to derive curves in attempting to fit simultaneously the three progress curves shown in Fig. 23.2. Curve a represents the parameters used to generate the fake data with a sigma value of 0.02 using 1 mM enzyme and 2, 5, and 10 mM substrate. Curve b represents the best fit derived with diffusion-limited binding of substrate and product fixed at 100 mM 1s 1 as shown by the solid black lines. Curve c shows an attempt to fit the middle progress curve to a simplified irreversible model to derive kcat and Km values, but fails to account for the data derived at lower or higher substrate concentrations. Numbers in parentheses were held fixed during the fitting. Average sigma values were computed from the best fit. In this case, the sigma value was calculated for the fitting only the one curve at 5 mM substrate.
617
KinTek Explorer
E+S
k1
ES
0
k2 0
EP
Fast
E+P
0
With this simplified model, k1 ¼ kcat/Km and k2 ¼ kcat. This approach could work but only under the limited circumstances where product does not rebind to the enzyme during the approach to the endpoint. Because it is not known a priori whether product inhibition is significant, this approach can be very misleading unless the reactions are examined at several concentrations of substrate. As shown in Fig. 23.2A, data collected at one concentration can be fit using this model (the middle concentration in this example), but one cannot fit all three concentrations simultaneously using this oversimplified model. Because fitting by computer simulation does not require such a potentially misleading oversimplification and this reduced model offers no advantages, it is not recommended.
5.1. Error analysis The next step in the analysis is to assess errors on the estimates for each of the rate constants. Standard error analysis based upon the covariance matrix derived during nonlinear regression suggests that each of the rate constants is known with a great deal of certainty as summarized in Table 23.3. However, confidence contour analysis, which provides a much more robust assessment of the limits on each parameter, suggests that the parameters are not well constrained. Construction and evaluation of FitSpace confidence contours are explained in more detail in Johnson et al. (2009b). In order to construct the confidence contour, individual rate constants are pushed to higher and lower values while allowing all other constants to be adjusted in deriving the best fit. The limits on each constant are then defined by the observed increase in the sum square error that is attributable to constraints on each parameter individually without any assumptions regarding the values for other constants. A three-dimensional plot is then generated Table 23.3 Error estimates on kinetic parameters in fitting three progress curvesa
a
Source
k 1 (s 1)
k2 (s 1)
k 2 (s 1)
k3 (s 1)
NR FS-A FS-A&B
4776 241 14–10300 2850–7350
156.6 0.2 155–276 155–159
13.9 0.2 13–3020 13.4–14.4
11,380 560 570–24600 6730–17800
Error estimates were derived while simultaneously fitting three progress curves shown in Fig. 23.2A. NR, nonlinear regression standard error; FS, FitSpace confidence contour error limits based upon a 10% increase in the sum square error. FS-A is based upon fitting data in Fig. 23.2A. FS-A&B is based upon fitting the data in Fig. 23.2A and B simultaneously.
618
Kenneth A. Johnson
291
showing the dependence of the sum square error on each pair of parameters. The shape of the surface reveals underlying relationships between parameters and a fixed threshold in the sum square error surface can be used to define upper and lower limits for each parameter. Figure 23.3 shows the confidence contours computed for the data shown in Fig. 23.2A fit to Scheme 23.2. The most striking results shown visually are that kcat for the reverse reaction (k 2) is not constrained by the data and that there is a linear correlation between k3 and k 1. The ranges allowed for individual parameters are listed in Table 23.3, row FS-A. Upon seeing these results, the initial reaction of most investigators is disbelief. Nonlinear regression gives very small errors, so how can it be that these
k +2
1.243 (min)
153
1.616 (1.3x) 1.865 (1.5x)
19400
k −2
2.487 (2x)
11.8
11.8
k −2
2300
k −1
3030
12.3
19400
12.3
k −1
19400
153
k +2
291
153
k +2
291
46,100
299 (241x)
k +3
k +3
501
501
501
k +3
46,100
k −1
46,300
12.3
11.8
k −2
75.9
Figure 23.3 Confidence contours in fitting progress curves. Confidence contours are shown derived from the fitting of the data in Fig. 23.2A. Red shows the area of best fit and the yellow band between red and green shows a threshold at which the sum square error increased by 10% over the minimum value. The results show that k 2 has no upper limit and that there is a wide range over which kþ 3 and k 1 can vary as long as the constant ratio of kþ 3/k 1 is maintained. The isolated peaks in the kþ 3 versus k 1 plot result from sampling on a grid, whereas the underlying function should produce a continuous ridge.
619
KinTek Explorer
constants are so poorly defined? The answer is that nonlinear regression grossly underestimates the errors. Another test of whether one might believe the large range over which the rate constants can vary is to overlay on the data all the curves calculated at the extremes of the parameter set. This can be done within the simulation program, but is difficult to display in print because all of the curves superimpose at the resolution of the figure. This analysis shows that even the most extreme ranges of rate constants still account for the data and produce traces that are largely indistinguishable. A careful reassessment of the experimental design and the parameters that were derived in fitting points to the possible limitations with the data. First, the rate of the reverse reaction is small and contributes negligibly to the observable signal. Therefore, perhaps an experiment should be performed to better define the reverse rate constants. Of course, this is easy when the experiments are done by simulation, but even in the real world, it is often useful to simulate experiments first to see whether they could help to distinguish models. An additional ‘‘experiment’’ was then performed by monitoring the reaction in reverse. In the simulation, the starting conditions contained only the product of the reaction and the formation of substrate was monitored as a function of time. However, it was immediately recognized that one cannot simply drive the reaction in reverse without the addition of a coupled-enzyme assay to remove substrate. This can be programmed in KinTek Explorer simply by the addition of a trap to sequester substrate or by the full programming of the kinetic properties of the coupled-enzyme assay (Hanes and Johnson, 2008). E+S
k1 k–1
ES
k2 k–2
EP
k3 k–3
E+P
S + trap = Strap or E2 + S
E2S
E2 + X
The new ‘‘data’’ are shown in Fig. 23.2B. Including this data in the process of global fitting greatly improves confidence in the value of kcat in the reverse direction as defined by k 2 in the model. Moreover, by increasing confidence in k 2, the range over which k 1 and k3 can vary was also restricted to provide a better global fit to all of the data (Fig. 23.2A and B) fit simultaneously. This is illustrated by the confidence contour shown in Fig. 23.4 and Table 23.3 (row FS-A&B); in which each constant is bounded by an upper and lower limit. In summary, full progress curve kinetic traces can be fit to a simplified model in which substrate and product binding rates are assumed to be diffusion limited only for the sake of extracting kcat and Km values. Simultaneous fitting of data collected at several concentrations is required to test
620
175
Kenneth A. Johnson
155
k +2
1.625 (min)
2.112 (1.3x)
10500
2.437 (1.5x)
k –2
3.25 (2x)
10500
k +2
162
k +2
174
259 (160x)
k –1
10500
k +3 1520
k +3 1520
k +3 1450 545
153 26,300
k –1
26,300
545
26,300
13.4
13.1
k –2
16.3
k –1
16.3
545
155
13.4
k –2
16
Figure 23.4 Confidence contours in fitting progress curves forward and reverse reaction. Confidence contours are shown from the fitting of the data in Fig. 23.2A and B simultaneously. Colors are as in Fig. 23.3. The results show that all parameters are well constrained.
for and possibly quantify product inhibition. The process of data fitting and refinement is facilitated by careful use of the confidence contours to find gaps in the data that lead to large errors in estimated parameters, which can then be overcome by performing additional experiments.
6. Slow Onset Inhibition Kinetics In this example, we consider data collected in the steady state involving slow-onset inhibition. The data shown in Fig. 23.5 were generously provided by Vern Schramm and Andrew Murkin of the Albert Einstein College of Medicine from their work in developing transition state analog inhibitors of purine nucleoside phosphorylase (PNPase) (Kicska et al., 2002). These unpublished data show the increase in absorbance with time in the
621
KinTek Explorer
A 0.6
Absorbance
0.5 0.4 0.3 0.2 0.1 0
0
1000
2000
3000 Time, s
4000
5000
0
1000
2000
3000 Time, s
4000
5000
B 0.6
Absorbance
0.5 0.4 0.3 0.2 0.1 0
Figure 23.5 PNPase slow onset inhibition kinetics. The time dependence of product formation is shown after starting the PNPase reaction with 1 mM substrate and various concentrations of the inhibitor, DADMe-ImmH (0, 0.02, 0.06, 0.1, 0.15, 0.3, 0.5, 1, 2, 5, 7, and 10 mM). The kcat ¼ 0.34 s 1 and Km ¼ 5 mM values were used and the enzyme concentration was adjusted to 26 nM to fit the trace in the absence of inhibitor. Data were then fit globally (black lines superimposed on the data shown as thicker green lines) to either a one- or two-step inhibitor binding model based upon a Km ¼ 5 mM and kcat ¼ 0.34 s 1. Fitted curves are shown according to the constants summarized in Table 23.4. (A) Fit to the two-step binding model based upon fits a or b (Table 23.4). (B) Fitted curves based upon the one-step binding model (Scheme 23.4) with parameters in row c of Table 23.4. The three sets of fitted curves are indistinguishable. Data were kindly provided by Vern Schramm and Andrew Murkin of the Albert Einstein College of Medicine (Kicska et al., 2002).
presence of various concentrations of the DADMe-ImmH inhibitor with the PNPase from Plasmodium falciparum. The tight binding of this inhibitor makes it a promising candidate for treating malaria.
622
Kenneth A. Johnson
The relevant mechanistic question to address is whether the data reveal a two-step inhibitor binding mechanism with an initial weak binding followed by a slower isomerization to tighter binding (Scheme 23.3) or whether a onestep binding model is sufficient to account for the data (Scheme 23.4). If the one-step binding model accounts for the slow inhibition, then it is still likely that the reaction occurs in two steps, but the initial binding may be too weak to measure. To address which model accounts for the data we simply fit the data to both models and then examine the errors in the parameters and evaluate goodness of fit both visually and computationally. E+I
K1
EI
k2 k-2
FI
Scheme 23.3 E+I
K1 k-1
EI
Scheme 23.4
The results of fitting the data to a two-step model are shown in Fig. 23.5A with rate constants summarized in Table 23.4, row a. Initial analysis based upon nonlinear regression suggests that the parameters are well constrained, supporting the conclusion that the two-step binding model is well defined. However, one can scroll the constants and find another area of parameter space leading to an equally good fit with the constants summarized in Table 23.4, row b. Clearly, the parameters are not as well constrained as the nonlinear regression error analysis would lead us to believe. A full confidence contour analysis of the fitting to a two-step model reveals the underlying problem, as shown in Fig. 23.6A. The figure shows a linear correlation between k2 and k 1. This implies that, above a lower limit, the data only define the ratio of k2/k 1. Because k1 was assumed to be a constant, we can translate this to a constant term defined by k1k2/k 1 which equals K1k2 ¼ 0.22 mM 1s 1. This can be immediately recognized as the apparent second-order rate constant for a two-step binding reaction in the range of low Table 23.4
a
Kinetic parameters in fitting slow onset inhibition of PNPasea
Fit
k 1 (s 1)
k2 (s 1)
k 2 (s 1)
Chi2
a b
7.15 0.05 2000 100
c
–
0.0154 0.0003 4.3 0.2 k1 (mM 1s 1) 0.250 0.0006
0.00013 0.0001 0.00013 0.0001 k 1 (s 1) 0.00021 0.000007
0.034 0.0422 Chi2 0.032
Three sets of parameters illustrate the fitting of the data in Fig. 23.5 to either a two-step inhibitor binding mechanism (a and b) or a one-step mechanism (c). In fitting this data to Scheme 23.3, k1 was fixed at 100 mM 1s 1.
623
KinTek Explorer
0.0004
k−2
0.0009 0
k2
1.18
A
590 0
k−1
k2
0.027
0.0001
k−1
0.0003
0.7
B
0.216
k1
0.286
Figure 23.6 Confidence contours for fitting PNPase slow onset inhibition. Pair-wise confidence contours are shown after fitting the data in Fig. 23.5 to either a two-step inhibitor binding model with three variable parameters (A) or a one-step model with two variable parameters (B). The contours are colored with red showing the area of best fit. The yellow boundary separating red and green defines a threshold where the SSE was increased by 10% over the minimum. The numbers at the corners of each plot show the ranges for each kinetic parameter. These plots were used to derive parameter confidence intervals summarized in Table 23.5. The analysis shows that the data do not support the definition of a two-step binding mechanism. Rather, in the two-step binding model, the product K1k2 defines a second-order rate constant for inhibitor binding equal to 0.22 mM 1s 1 according to the slope of the diagonal boundary in the plot of SSE for k2 versus k 1 in (A). Note that the diagonal boundary in (B) demonstrates that the ratio defining the net Kd ¼ k 2/k2 is known with greater certainty than either of the parameters individually. Nonetheless, both parameters are well constrained.
624
Kenneth A. Johnson
Table 23.5 PNPase kinetic parameter confidence intervalsa Model
Parameter
Lower limit
Upper limit
Two-step
1/K1 (nM) k2 (s 1) k 2 (s 1) k1 (mM 1s 1) k 1 (s 1)
18 0.0033 0.000096 0.233 0.00017
none none 0.00027 0.265 0.000026
One-step a
Confidence intervals on individual kinetic parameters were derived from the threshold defined by a 10% increase in the SSE as described ( Johnson et al., 2009b).
concentrations (½I 1=K1 ) where the rate is linearly dependent upon inhibitor concentration. This analysis leads to the conclusion that although one can fit to a two-step model, there is no data to define the Kd(1/K1) for the initial complex and the model collapses to a one-step mechanism. The fit to a one-step mechanism is shown in Fig. 23.5B and the corresponding confidence contour is shown in Fig. 23.6B. Clearly, the data are adequately fit by the one-step model and the rate constants for inhibitor binding and release are well constrained. The brief linear correlation between k 1 and k1 implies that the net dissociation constant 1/K1 ¼ k 1/k1 is known with greater certainty than either of the rate constants, but still the range over which the individual constants can vary is relatively small, with the greatest uncertainty in k 1, as summarized in Table 23.5. This analysis once again illustrates that the standard error estimates derived from nonlinear regression are not to be trusted. However, the confidence contour analysis reveals the extent to which parameters are underconstrained and defines the underlying relationships between parameters. Careful analysis leads one to either simplify the model or perform additional experiments to fill in the gaps in the data.
7. Summary The two examples for data fitting serve to illustrate the use of KinTek Explorer in fitting data to derive steady-state kinetic constants and the rates of slow onset inhibition. In these cases, the fitting based upon simulation is fast and reliable. By fitting the parameters of the model directly to the data, simplifying assumptions and errors are eliminated. Standard error analysis during nonlinear regression is not reliable and it fails to reveal when parameters are seriously underconstrained. This can be understood in that the Hessian matrix that must be solved is singular when
KinTek Explorer
625
parameters are not well constrained and so there are huge round-off errors in computing the covariance matrix. We are in the process of solving this problem by singular value decomposition, but in the meantime, it is important to recognize that standard nonlinear regression routines used by all currently available programs for fitting data seriously underestimate errors. The software can be easily adapted to fit data to examine enzyme activation, an important area of research in the pharmaceutical industry. We are currently using the software to simultaneously fit data collected by rapid quench methods and data obtained by fluorescence methods in the stopped-flow instrument. The rigorous fitting of both datasets simultaneously overcomes many of the limitations in previous attempts to correlate the results from both experiments ( Johnson and Taylor, 1978). The ease of use of the software and the efficiency of the program allows many experiments to be fit directly to models with the greatest accuracy in estimating kinetic parameters and evaluating models. Financial conflict of interest. KinTek Explorer was developed using private funds and a professional version of the software is offered for sale.
ACKNOWLEDGMENTS Supported by KinTek Corporation (www.kintek-corp.com)
REFERENCES Anderson, K. S., Sikorski, J. A., and Johnson, K. A. (1988). A tetrahedral intermediate in the EPSP synthase reaction observed by rapid quench kinetics. Biochemistry 27, 7395–7406. Anderson, K. S., Miles, E. W., and Johnson, K. A. (1991). Serine modulates substrate channeling in tryptophan synthase. A novel intersubunit triggering mechanism. J. Biol. Chem. 266, 8020–8033. Barshop, B. A., Wrenn, R. F., and Frieden, C. (1983). Analysis of numerical methods for computer simulation of kinetic processes: development of KINSIM—a flexible, portable system. Anal. Biochem. 130, 134–145. Bates, D. M., and Watts, D. G. (1988). Nonlinear Regression Analysis and its Applications. Wiley, New York. Hanes, J. W., and Johnson, K. A. (2008). Real-time measurement of pyrophosphate release kinetics. Anal. Biochem. 372, 125–127. Johnson, K. A., and Taylor, E. W. (1978). Intermediate states of subfragment 1 and actosubfragment 1 ATPase: Reevaluation of the mechanism. Biochemistry 17, 3432–3442. Johnson, K. A., Simpson, Z. B., and Blom, T. (2009a). Global Kinetic Explorer: A new computer program for dynamic simulation and fitting of kinetic data. Anal. Biochem. 387, 20–29. Johnson, K. A., Simpson, Z. B., and Blom, T. (2009b). FitSpace Explorer: An algorithm to evaluate multidimensional parameter space in fitting kinetic data. Anal. Biochem. 387, 30–41. Kicska, G. A., Tyler, P. C., Evans, G. B., Furneaux, R. H., Kim, K., and Schramm, V. L. (2002). Transition state analogue inhibitors of purine nucleoside phosphorylase from Plasmodium falciparum. J. Biol. Chem. 277, 3219–3225.
626
Kenneth A. Johnson
Michaelis, L., and Menten, M. L. (1913). Die Kinetik der Invertinwirkung. Biochem. Z. 49, 333–369. Spence, R. A., Kati, W. M., Anderson, K. S., and Johnson, K. A. (1995). Mechanism of inhibition of HIV-1 reverse transcriptase by nonnucleoside inhibitors. Science 267, 988–993. Spies, M. A., and Toney, M. D. (2007). Intrinsic primary and secondary hydrogen kinetic isotope effects for alanine racemase from global analysis of progress curves. J. Am. Chem. Soc. 129, 10678–10685. Spies, M. A., Woodward, J. J., Watnik, M. R., and Toney, M. D. (2004). Alanine racemase free energy profiles from global analyses of progress curves. J. Am. Chem. Soc. 126, 7464–7475. Zimmerle, C. T., and Frieden, C. (1989). Analysis of progress curves by simulations generated by numerical integration. Biochem. J. 258, 381–387.
Author Index
A Acton, S. T., 42 Ahmed, R., 88 Ahren, B., 574 Aitchison, J. D., 335–353 Akutsu, T., 179, 181, 184 Albert, I., 288, 293, 299 Albert, R., 183, 281–303 Aldana, M., 337 Alday, P. H., 135–159 Alder, B. J., 317 Aldridge, B. B., 283 Alexandrov, N. N., 315 Alexopoulos, L. G., 42 Allison, D. B., 60 Alon, U., 173 Alter, O., 64 Altman, C., 257 Altschul, S. F., 311–312 Alvarez-Buylla, E. R., 172, 283, 297, 339 Anastopoulous, A. D., 359, 361 Anderson, A. R. A., 32, 37 Anderson, D. R., 254, 273, 274 Anderson, K. S., 603, 607 Andrae, J., 463–464 Andrec, M., 179 Angold, A., 375 Antia, R., 85–86, 88, 92–93 Antognazza, M. R., 112 Ao, P., 122, 132 Apweiler, R., 309 Arstila, T. P., 80 Ashcroft, F. M., 555 Askelof, P., 500 Astier, H., 555 Atkinson, A., 276 Atkins, P. W., 391, 408 Attwood, T. K., 309 Augustin, H. G., 463 Autiero, M., 464, 471 B Bachut-Okrasinska, E., 253, 275 Baer, S. M., 18 Bains, I., 80 Bairoch, A., 311 Balleza, E., 173, 176 Banarer, S., 549, 551
Bansal, M., 179, 180 Bao, P., 463 Barabasi, A. L., 282 Bardi, J. S., 112 Barker, D. R., 502 Barkley, R. A., 360–361 Barrett, C. B., 176 Barrett, T., 230 Barrow, N. J., 500, 502 Barshop, B. A., 602 Bates, D. M., 138, 269, 604 Bates, D. O., 464–465 Bates, P. A., 315 Beal, M. J., 179 Beard, D. A., 126 Bear, J. E., 44 Beckett, D., 137 Beechem, J. M., 251–252, 262 Beenken, A., 463–464 Beirlant, J., 532, 540 Bell, G. I., 555 Bellman, R. E., 452 Benkovic, S. J., 256–257, 259 Ben–Naim, A., 132 Benson, D. A., 308 Bentele, M., 178 Ben–Zvi, A., 435–438 Berendsen, H. J. C., 319 Berg, H. C., 40 Bergman, R. N., 573 Bergmeyer, H. U., 589 Berman, H. M., 308 Bernard, A., 181 Bernasconi, C. F., 151 Bertram, R., 1–20 Bertrand, G., 573 Bevington, P. R., 504 Bheekha Escura, R., 83 Bidaut, G., 69–71, 229–244 Bogaert, E., 479 Bo¨hm, C. M., 81 Bolli, G. B., 574 Bolstad, B. M., 62 Bolster, C. H., 500, 502, 521, 524–526 Bonvin, A. M., 326, 329 Bornholdt, S., 171, 283, 286, 290, 297, 299, 338–339, 342, 352 Bosco, G., 248 Bossomaier, T., 293
627
628
Author Index
Bowser, M. T., 500, 512, 527 Box, G. E. P., 276 Boyd, J. C., 411–431 Braunewell, S., 339 Breda, E., 573 Breiman, L., 536 Brelje, T. C., 574 Brenner, S., 177 Bressloff, P. C., 2 Briknarova´, K., 250 Britt, H. I., 502 Brock, A., 24 Bromberg, S., 113 Brooks, B. R., 319 Brooks, I., 252, 269, 271–272 Brown, A. F., 38 Brown, D. A., 40 Brown, P. J., 137 Brown, T. E., 365–366 Bruck, J., 407 Bruggemann, F. J., 164 Brunet, J. P., 61 Brunicardi, F. C., 554–555, 574 Brun, M., 348, 350 Bruns, D. E., 411–431 Bryce, N. S., 44 Bryson, K., 313 Buck, M. J., 282, 288 Burk, D., 590 Burnham, K. B., 254, 273, 274 Burroughs, N. J., 85–86, 97 Butera, R., 13 Butler, J. T., 176 Butte, A. J., 62, 178 Bzowska, A., 253–255 C Cabrera, O., 573 Cai, L., 44 Caldwell, J. W., 319 Camacho, D., 180 Cann, J., 137 Cann, J. R., 137 Cao, Y., 352, 464 Carlin, B. P., 364 Carmona–Saez, P., 68 Carneiro, J., 85, 86 Carpenter, A. E., 30, 42 Carroll, R. J., 526 Carton, D. M., 86 Carvalho, C. M., 62, 67, 69 Casadevall, A., 83 Casal, A., 86, 90 Case, D. A., 319 Castellanos, F. X., 360, 375 Catron, D. M., 89 Cejvan, K., 554 Celada, F., 100
Chakraborty, U. K., 264 Chance, B., 586 Chandra, R., 202 Chang, H. H., 341 Chang, W.-C., 179 Chaouiya, C., 283, 286 Chaves, M., 183, 291–292, 297, 299, 338 Chay, T. R., 13 Cheatham, T. E., 319 Chen, D. D. Y., 500, 512, 527 Cheng, J., 309, 313 Cheng, L., 62 Chen, J., 113, 122, 132 Chen, K. C., 346 Chicone, C., 443 Ching, W. K., 184 Cho, K. H., 285 Chow, C. C., 4 Christie, K. R., 68 Chuang, H. Y., 230, 244 Church, G., 62, 68 Cleare, A., 437 Cle´, C., 261 Cleland, W. W., 500 Clutton-Brock, M., 502 Cobelli, C., 573 Codling, E. A., 38 Cohn, M., 100 Cole, C., 313 Colijn, C., 86, 87 Collinson, D. J., 466 Collom, S. L., 275 Connors, K. A., 501, 503, 511–512, 527 Contino, P. B., 500 Conzelmann, H., 283 Conzen, S. D., 179, 348 Coombes, S., 2 Cooper, J. A., 129, 130 Corana, A., 268 Corcos, L., 283 Cornelisse, L. N., 13 Cornish-Bowden, A., 408, 500, 586, 590 Cornish, P. V., 383 Correia, J. J., 135–159 Costa, M., 533 Cover, T. M., 544 Cowan, J. D., 4 Cox, D. J., 365–366, 370–371 Cressie, N., 536 Crofford, L., 437 Cross, J. B., 326, 330 Crouch, S. R., 526 Cryer, P. E., 549–550 Cullen, G. E., 261 Cummings, M. D., 326
629
Author Index D Dam, J., 137 D’Amore, P. A., 465 D’Ari, R., 172 Davendra, D., 264 Davidian, M., 526 Davidich, M. I., 171, 283, 297–299, 338, 342, 352 Davidson, A. R., 312 Davis–Smyth, T., 465 Dean, P. M., 2 de Boer, R. J., 85–86, 167, 168 De Donder, T., 115 De Jong, H., 297 de la Fuente, A., 179–180 de Levie, R., 506 Del Negro, C. A., 15 Deming, W. E., 502, 505 Deng, Q., 251 Deng, X., 179 de Paula, J., 408 de Pillis, L. G., 85–86 Desai, A., 136 Destexhe, A., 15 Deutsch, A., 436 de Vries, A. H., 319 Dey, D. K., 505 Diana, L. M., 502 Di Cera, E., 502 Diem, P., 551 Digits, J. A., 259 Dikovskaya, D., 30 Dill, K. A., 113 Dimitrova, E., 184–185, 187, 188 Diraviyam, K., 307–330 Dojer, N., 179 Dominguez, C., 326 Donev, A., 276 Donnelly, R., 466 Dove, A., 24 Dowd, J. E., 500 Drossel, B., 337 Duan, Y., 319 Duffy, K. J., 40 Duggleby, R. G., 249, 260, 276, 590 Dumonteil, E., 551 Dunne, M. J., 555 Dunn, G. A., 38 DuPaul, G. J., 359, 361, 370–372 Dwass, M., 271 E Ebos, J. M., 465 Eccleston, J. F., 151 Eddy, S. R., 312 Edgar, R., 62 Eeckman, F. H., 341 Efendic, S., 555
Eftink, M. R., 500 Eisen, M. B., 63 Eisenthal, R., 500, 590 Elf, J., 409 Eliason, S. R., 33 Ellner, S. P., 286 Endre´nyi, L., 276 English, S. B., 62 Epstein, S., 555 Erdogmus, D., 543 Ermentrout, G. B., 2, 4 Ernst, J., 179 Espinosa-Soto, C., 172, 283, 297 Evans, J. G., 24 F Fanelli, C. G., 574 Farhy, L. S., 547–575 Faunt, L. M., 503 Faure´, A., 171, 343 Fedorov, V., 276 Feldman, H. A., 500, 502 Feng, D., 487 Feoktistov, V., 264 Ferrara, N., 465 Fierke, C. A., 256, 259 Figeys, D., 282 Figge, M. T., 86, 90–91 Finn, R. D., 312 Fischer, M., 375 Folcik, V. A., 100 Fong, L., 81 Ford, R. M., 40 Foreman, D., 375 Forsythe, J. A., 479 Fossati, A., 366 Foth, B. J., 286 Fouchet, D., 85–86, 97 Franco, R., 276 Fraser, C. G., 412 Frasier, S. G., 252 Frazier, G. R., 526 Frieden, C., 602 Friedl, P., 34 Friedman, N., 179 Friesner, R. A., 326 Frigon, R. P., 137 Frigyesi, A., 64 Fu, B. M., 486, 487 Fujitani, S., 555 Fukuda, M., 550 Funahashi, A., 586 Furth, R., 38 G Gabhann, F. M., 461–494 Gadkar, K., 179
630 Gagnon, M. L., 465 Galon, J., 81 Gao, Y., 62, 68 Garbett, S. P., 23–54 Gardner, T. S., 179–180 Garlick, D. G., 487 Gasa, T., 248, 250, 275 Gaspard, P., 132 Gasteiger, E., 309 Gat-Viks, I., 173 Gautestad, A. O., 44 Gedulin, B. R., 553–554, 560 Gelinas, A. D., 137 Gelperin, D. M., 588 Geman, D., 65 Geman, S., 65 Genter, P., 555, 573 Gentleman, R. C., 231 Georgescu, W., 23–54 Gerber, H. P., 465 Gerich, J. E., 549–551 Gerrits, A., 231 Ghaemmaghami, S., 588 Ghiron, C. A., 500 Giannis, A., 463 Gibson, M. A., 407 Gilbert, D., 283 Gilbert, G. A., 151 Gilbert, H. F., 255 Gilbert, L. M., 151 Gilles, E. D., 283 Gillespie, D. T., 383–384, 407–409, 587 Giorgio, A., 437 Girling, J. E., 463 Glass, L., 297–298, 345 Goldberg, I. G., 30 Goldberg, P. A., 416 Goldbeter, A., 2, 128–129 Goldman, D., 477 Goldstein, H., 112 Gomperts, B. D., 284 Gonzalez, A., 172 Goodsell, D. S., 326 Gopel, S. O., 554–555 Gordon, M. S., 488, 490 Goryanin, I., 346 Gra¨f, R., 30 Grapengiesser, E., 555, 560 Greer, B., 238 Griesdale, D. E. G., 414 Grimmichova, R., 555, 562 Grimshaw, A., 197–226 Gromada, J., 550, 553, 574 Gropp, W., 202 Gschwind, A., 463 Guckenheimer, J., 286 Guerlain, S., 441
Author Index
Guldberg, C. M., 248 Guldener, U., 68–69 Gupta, S., 171, 437–438, 440 Guyton, J. R., 573 H Habel, L. A., 359 Hackett, C. J., 80 Haigh, J. J., 463 Hairer, E., 140 Halgren, T. A., 326 Hanahan, D., 34 Hanes, J. W., 619 Hardin, C., 316 Harper, S. J., 464–465 Harris, M. P., 23–54 Harris, S. E., 175 Hartemink, A., 181 Harvey, I., 293 Ha, T., 383 Hatzimanikatis, V., 282 Havel, P. J., 574 Hedstrom, L., 259 Heilman, D., 360 Heimberg, H., 551, 555 Heise, T., 554, 574 Heller, H., 319 Hellman, B., 553, 555, 560 Heng, H. H., 24 Hermansen, K., 555 Herrga˚rdet, M. J., 584 Herrgard, M. J., 171, 176 Hess, B., 319 Heyman, B., 83 Hibbs, M. A., 230 Hill, T. L., 123 Hilsted, J., 551 Hinshaw, S. P., 360 Hirsch, B. R., 551 Hoffman, R. P., 550 Hofmann–Wellenhof, R., 37 Holmes, A. P., 271 Hoops, S., 264, 267, 583–598 Hope, K. M., 551 Hopfield, J. J., 112 Howard, J., 122 Hoyer, P., 69 Huang, A. C., 294 Huang, C-H. C., 137 Huang, N., 325–326 Huang, S., 251, 340–341 Hucka, M., 586 Hughes, T. R., 60, 63, 68–69 Hulo, N., 309, 312 Hunter, S., 309, 313 Huypens, P., 573–574 Hyvarinen, A., 535 Hyvrinen, A., 64
631
Author Index I Ideker, T. E., 178–179, 184, 188 Illingworth, C. J., 330 Inagaki, N., 573 Ingber, D. E., 341 Ingle, J. D. Jr., 526 Insel, P. A., 573 Irizarry, R. A., 62 Irons, D. J., 343 Irwin, J. J., 326, 330 Ishihara, H., 553–554, 560 Ito, K., 553, 560 Ivanova, N. B., 233 Izhikevich, E. M., 13, 15 J Jacob, F., 341 Jacobson, L., 437 Jacobson, M. P., 315 Jacquez, J. A., 525 Jain, A. N., 325, 330 Jain, S., 171 Jamakhandi, A. P., 275 Jarrah, A. S., 163–192, 286 Jaspan, J. B., 555 Jefferys, W. H., 502, 524 Jenkins, R. C. Ll, 151 Jensen, P. S., 361 Jiang, B. H., 479 Jiang, D.-Q., 132 Jiao, X., 37 Ji, J. W., 477, 481 Jirstrand, M., 346 Johnson, K. A., 269, 601–625 Johnson, M. L., 138, 252, 269–272, 500, 503, 525 Jones, G., 326 Jones, M. C., 532, 536 Jorgensen, W. L., 319 K Kachalo, S., 283, 289, 302 Kaech, S. M., 88 Kalbfleisch, M. L., 371 Karlin, S., 113 Kauffman, S. A., 168, 173–176, 179, 181, 286, 337, 341, 345 Kaufman, M., 283 Kawai, K., 560 Kawakami, Y., 81 Kawamori, D., 551, 554 Kegeles, G., 137, 156–158 Keiding, N., 51 Keizer, J., 13 Kell, D. B., 177, 264–268, 588, 598 Kelley, L. A., 315 Kerbel, R. S., 463, 465
Kerr, M. K., 60 Kervizic, G., 283 Khan, J., 238 Kicska, G. A., 620–621 Kieffer, T. J., 555, 560, 573 Kilpinen, S., 230 Kimas, N., 437–438 Kimberly, M. M., 430 Kim, H. D., 40 Kim, J., 179 Kim, P. M., 61 Kim, P. S., 79–105 Kim, S., 347 Kimura, S., 180 King, E. L., 257 Kinniburgh, D. G., 500–502 Kipper, M. J., 38 Kirkpatrick, S., 268 Kitchen, D. B., 330 Klaff, L. J., 554 Kleinman, R., 554–555 Klein, R. G., 360 Klipp, E., 436, 586 Kohler, J., 587 Kollman, P. A., 319 Kontoyianni, M., 330 Koshland, D. E., 128–129 Kossenkov, A. V., 59–73 Kovatchev, B. P., 365–366, 415, 573 Kozakov, D., 326, 329 Kremling, A., 180 Kretsinger, R. H., 309 Krietsch, W. K., 589 Krogh, A., 309 Kruse, K., 409 Kuhn, A., 244 Kuperman, S., 361 Ku¨rten, K. E., 297 Kuske, R., 18 Kut, C., 466 Kuzmic, P., 590 Kuzmic, P., 247–276 L La¨hdesma¨ki, H., 290, 348 Lake, D. E., 531–545 Lambert, J. D., 345 Langmuir, I., 500 Larkin, M. A., 312 Laskowski, R. A., 316 Laubenbacher, R., 163–192, 286 Lauffenburger, D. A., 178, 465, 468 Laughlin, R. B., 132 Lauritzen, S. L., 51 Laws, W. R., 500 Leach, A. R., 318, 324, 326 Leach, P. J., 204 Le Clainche, L., 248
632
Author Index
Lee, D. D., 61, 68 Lee, H. Y., 85–87 Lee, J. H., 441, 449 Lee, J. M., 435–458 Lee, P. P, 79–105 Lee, S., 465, 469 Lee, T. I., 282, 465, 479 Lee, Y. S., 16 Lefever, R., 2 Le Nove`re, N., 587 Lensink, M. F., 326 Leo´n, K., 85–86, 97 Leskovar, A., 248 Letunic, I., 313 Levenberg, K., 591 Levy, D., 79–105 Liang, S., 179, 184 Liao, J. C., 62 Liebermeister, W., 64, 586 Lieb, J. D., 282, 288 Li, F., 171, 283, 294, 297–298, 342, 351 Linderman, J. L., 465, 468 Lineweaver, H., 501, 590 Li, N. K., 85 Lin, S. M., 64 Li, S., 171, 282–283, 293, 301–302 Li, X., 179 Li, Y.-X., 2 Llinas, R. R., 15 Lloyd, P. G., 481 Lobert, S., 158 Lobley, A., 315 Lockhart, D. J., 62 Lodish, H., 463, 465 Loo, L. H., 33–34 Loomis, W. F., 177 Lou, H., 360 Louis, T. A., 364 Ludvigsen, E., 554 Luecke, R. H., 502 Luthy, R., 316 Lybanon, M., 502, 524 M Macdonald, J. R., 502 Mac Gabhan, F., 470–473, 479–481, 483–485 Mackey, M. C., 86–87, 113, 121, 132 MacLeod, M. C., 340 Madden, T. L., 311 Madhusudhan, M. S., 309 Madura, J. D., 319 Maharaj, A. S., 465 Mahoney, M. W., 319 Malys, N., 583–598 Mannervik, B., 273, 500 Manninen, T., 352 Mannuzza, S., 360 Margolin, A. A., 179
Mari, A., 573 Marino, S., 180 Marquardt, D. W., 263, 591 Martin, D., 465 Martin, S. R., 179, 184, 381–409 Marti-Renom, M. A., 314 Maruyama, H., 553 Mason, D., 84 Mata, J., 100 Mathews, E. K., 2 Matsudaira, P., 24 Matthews, D. R., 555 May, R. M., 286 Mazitschek, R., 463 McCall, A. L., 547–575 McCammon, J. A., 317 McGinnis, S., 311 McHugh, R. B., 502 Mead, R., 138 Meffre, E., 80 Mehra, S., 179 Meier, J. J., 571 Meinert, C. L., 502 Mendes, P., 180, 264–268, 346, 583–598 Mendez, R., 326 Mendoza, L., 172, 283, 297 Menten, M. L., 590, 610 Mercado, R., 88 Merkel, R. L., 365–366, 371 Merrill, S. J., 85–86 Messiha, H., 583–598 Mewes, H. W., 68–69 Mian, S., 348 Michaelis, L., 590, 610 Mills, J. C., 233 Mitchell, M., 132 Mitchison, T. J., 136 Mohammadi, M., 463–464 Moitessier, N., 326 Molldrem, J. J., 81 Moloshok, T. D., 61 Monastra, V. J., 360 Monod, J., 341 Moore, H., 85–86 Moorman, J. R., 540, 545 Morari, M., 441, 449 Morgan, M. M., 197–226 Morrison, J. F., 249, 260 Morrow, D. A., 412 Mostofsky, S. H., 360 Mount, W. D., 315 Mukherjee, D. P., 42 Munson, P. J., 500, 502 Murakami, M., 82 Muroga, S., 342 Murphy, K., 348 Murray, J. D., 2, 112, 128 Muske, K. R., 453
633
Author Index
Mysterud, I., 44 Myung, J. I., 254, 273, 275–276 N Naik, D. C., 204 Nariai, N., 179 Nelder, J. A., 138 Nelson, B. H., 81 Nichols, T. E., 271 Niedzwiecka, A., 253 Nikolayewaa, S., 175–176 Norusis, M., 525 Notredame, C., 313 Novak, B., 352 O Oberhauser, D. F., 137 Ochs, M. F., 59–73 Ochsner, S. A., 233 Olson, A. J., 326 Oltvai, Z. N., 282 Onsum, M., 86, 89 Onwubolu, G. C., 264 Orear, J., 502 Ornstein, L. S., 38 O’Shea, E. K., 351 Othmer, H. G., 183, 283 Oudes, A. J., 233 Ousterhout, J. K., 200 Ozbudak, E. M., 351 P Paffrath, D., 359 Pahle, J., 409 Pardoll, D. M., 81 Parkinson, H., 62, 230 Pastor, P. N., 359 Paulsson, J., 409 Pawlak, M., 541 Pearson, W. R., 311 Pe’er, D., 179 Pei, J., 312 Penberthy, J. K., 357–377 Penheiter, A. R., 260 Pepperkok, R., 24 Peranteau, A. G., 264 Perelson, A. S., 85–86 Perlman, Z. E., 24, 33–34 Petersen, P. H., 412 Pettersen, E. F., 313 Petzold, L. R., 409 Philo, J. S., 138 Pincus, S. M., 544 Pipeleers, D. G., 551 Pirofski, L. A., 83 Pitt, M. A., 254, 273, 275
Pollak, M., 463 Ponder, J. W., 319 Popel, A. S., 461–494 Prksen, N., 562 Portela-Gomes, G. M., 554 Porter, I. M., 30 Potdar, A. A., 40 Pournara, I., 179 Powell, D. R., 502 Press, W. H., 140, 258, 504, 507 Presta, L. G., 488 Price, C. P., 413 Price, K. V., 264–265, 268 Pries, A. R., 477 Prvan, T., 501 Punta, M., 309 Q Qian, H., 111–132 Quaranta, V, 23–54 Quon, M. J., 573 Qutub, A. A., 463, 465, 479 R Raeymaekers, L., 173 Rahman, A., 317 Rajewsky, K., 80 Ramsey, S., 346 Rao, B. L. S. P.7, 540 Rao, C. V., 86, 89, 352 Rappaport, J. L., 360 Rarey, M., 326 Rasband, W. S., 30 Raser, J. M., 351 Ratkowsky, D. A., 510 Ravier, M. A., 554 Rawlings, J. B., 453 Ray, N., 42 Read, T., 536 Reaven, G. M., 555, 560 Reed, J. L., 282 Regoes, R., 85–86, 97 Reich, J. G., 263 Reiff, M. I., 360 Renkin, E. M., 487 Ren, P., 319 Reuben, C. A., 359 Rhoads, D. G., 586 Rice, J. J., 179 Richman, J. S., 540, 545 Richmond, C. S., 338 Riggs, D. S., 500 Rinzel, J., 2, 13–14, 16–17 Ripley, B., 536 Rissanen, J., 347 Ritchie, D. W., 326 Ritchie, R. J., 501
634
Author Index
Rizzi, M., 436 Robeva, R., 168, 357–377 Rodbard, D., 500, 502, 526 Rodriguez-Fernandez, M., 588, 591 Roeder, K., 33 Roessler, C. G., 312 Rogers, P. A., 463 Rohl, C. A., 316 Rojas, I., 586 Ropers, D., 299 Rorsman, P., 553, 555, 560 Rosenberg, S. A., 81 Rost, B., 309, 313 Rowat, P., 18 Roy, H., 463, 465 Runarsson, T., 591 Rutter, G. A., 554 Rysselberghe, P., 115 S Sackmann, A., 283 Saeed, A. I., 63 Saez-Rodriguez, J., 171, 283 Safer, D. J., 359 Sakaguchi, S., 82, 84 Sakaue-Sawano, A., 53 Salehi, A., 555 Sali, A., 314 Samal, A., 171 Samols, E., 553–554, 560, 574 Samuels, D. C., 352 Sanchez, L., 172, 283, 297 Sander, C., 313 Sauro, H. M., 587 Savageau, M. A., 179 Saxena, A., 307–330 Scearce, L. M., 231 Scheele, R. B., 137 Schena, M., 62 Scherer, A., 86, 89 Schilling, C. H., 584 Schilstra, M. J., 159, 381–409 Schlippe, Y. V. G., 260 Schmidt, H., 346 Schuck, P., 137 Schueler-Furman, O., 326 Schuit, F. C., 554–555, 574 Schulthess, C. P., 505 Schulz-Gasch, T., 326 Schu¨rmeyer, T. H., 441 Schuster, T. M., 137 Schwabe, U., 359 Schwede, T., 315 Scott, D. W., 532, 536, 541 Scott, M. G., 414 Secomb, T. W., 477 Segel, I. H., 117, 128, 248, 255, 258 Segel, S. A., 550
Segerstrom, L., 490 Seiden, P. E., 100 Sela, S., 465 Sept, D., 307–330 Seung, H. S., 61, 68 Shachtman, T., 113 Shahaf, G., 85–86 Shamir, R., 173 Shamoon, H., 551 Shannon, C. E., 112, 532 Shannon, P., 587 Shelton, T. L., 359, 361 Shen, R., 231 Shen, S., 486–487 Sherman, A., 2, 13–14 Sherwood, P. J., 135–159 Shi, J., 314 Shire, S. J., 137 Shmulevich, I., 173, 181, 184–185, 335–353 Shoichet, B. K., 330 Shortle, D., 316 Shpiro, A., 5–6 Shukla, G. K., 502 Sibisi, S., 62, 65 Sible, J. C., 346 Sibson, R., 532, 536 Silvestre, R. A., 554 Simons, K. T., 316 Simons, M., 464 Sippl, M. J., 316 Skilling, J., 62, 65–66 Slack, M. D., 33, 34 Slyke, D. D. V., 261 Sohal, D., 231 Sotiropoulou, P. A., 81 Sousa, S. F., 326 Sowell, E. R., 360–361 Spence, R. A., 608 Spies, M. A., 613 Stafford, W. F., 135–159 Stagner, J. I., 553–555, 560, 574 Stahl, M., 326 Starkuviene, V., 24 Staude, R. G., 51 Steele, R., 573 Stefanini, M. O., 461–494 Stein, M. A., 366 Sternberg, P. W., 177 Stern, J. V., 19 Steuer, R., 352–353 Stewart, J., 85–86 Stigler, B., 166, 168–171, 188 Stillinger, F. H., 317 Stockholm, D., 24 Stoeckert, C. J. Jr., 229–244 Storme, T., 248 Straume, M., 269–272, 525
635
Author Index
Strogatz, S. H., 132 Strowski, M. Z., 554, 574 Sturmfels, B., 168 Subramaniam, S., 309 Su, J., 18 Sumida, Y., 554 Swain, P. S., 352 Swedlow, J. R., 30 Szallasi, Z., 409 Szedlacsek, S., 249, 260 T Tabak, J., 1–20 Taborsky, G. J. Jr., 551, 554, 574 Tai, M., 137, 158 Tang, K., 479 Tannock, R., 375 Tapia-Arancibia, L., 555 Taylor, E. W., 625 Taylor, H. M., 113 Tellinghuisen, J., 499–527 Teusink, B., 436 Thain, D., 198 Thakar, J., 283, 297, 299 Thieffry, D., 172, 283, 297 Thomas, J. A., 544 Thomas, R., 172, 180, 286, 337 Thompson, M., 526 Thomsen, A. R., 85, 93 Thusius, D., 137, 158–159 Tidor, B., 61 Tieleman, D. P., 319 Timasheff, S. N., 137 Tirone, T. A., 554 Tiwari, R., 326 Toffolo, G., 416, 573 Tomaiuolo, M., 1–20 Toney, M. D., 613 Trence, D. L., 416 Tringe, S, 179 Truskey, G. A., 482 Tsai, J., 234 Tsaneva–Atanasova, K., 2 Tusher, V. G., 60 Tyson, D. R., 23–54 Tyson, J. J., 2, 346 U Uehara, S., 573 Uhlenbeck, G. E., 38 Unger, R. H., 555, 560 Utsumi, M., 555 V Vajda, S., 326, 328–329 Valsami, G., 505 Van Boekel, M., 248
Van den Berghe, G., 414, 431 van der Mark, J., 2 van der Pol, B., 2 Van Goor, F., 2 Van Schravendijk, C. F., 554, 560 van Stipdonk, M. J., 88 Varela, F. J., 85–86 Veldhuis, J. D., 573 Veliz-Cuba, A., 166, 168–171 Vera-Licona, P., 189 Verheul, H. M., 465 Vidal, M., 282, 288 Viswanathan, G. M., 44 Vita, C., 248 Vlasselaers, D., 414 Voigt, J. H., 329 Voit, E., 180 von Dassow, G., 283 Von Weymarn, N., 248 Vriend, G., 316 W Waage, P., 248 Waddington, C. H., 173 Wagner, A., 179 Wainwright, T. E., 317 Walhout, A. J., 282, 288 Wallin, A. E., 44 Walsh, C. T., 249, 260 Wang, G., 68 Wang, R.-S., 281–303 Wanner, G., 140 Wardemann, H., 80 Ward, M. F., 365–366, 370 Warren, G. L., 326 Wasserman, L., 33 Waterhouse, A. M., 313 Watts, D. G., 138, 269, 271, 604 Weaver, D. C., 343 Weaver, W., 112 Wegner, A., 136 Weidow, B., 23–54 Weinberg, R. A., 34 Weisbuch, G., 85–86 Weiss, G., 360 Wells, A., 37 Welner, Z., 375 Wendt, A., 553, 560 Werhli, A. V., 180 Wernisch, L., 179 Westerhoff, H., 164 Westley, A. M., 586 Westley, J., 586 Weyandt, L. L., 366 Wheeler, D. L., 234 Wiederstein, M., 316 Wielgus-Kutrowska, B., 253–255
636
Author Index
Wiener, R., 414 Wijelath, E. S., 465 Wilkinson, D. J., 408 Wilkinson, G. N., 500 Willett, C. G., 490 Williams, C. R., 260 Williams, J. W., 249, 260 Wilson, H. R., 4 Wodarz, D., 85, 93 Wolf, D. M., 341 Wolf, K., 34 Wolkenhauer, O., 285 Wong, D., 307–330 Wong, W. H., 62 Wu, F. T. H., 461–494 X Xu, E., 553–554, 560 Xu, L., 231 Y Yamasaki, Y., 573
Yao, X., 591 Yeung, M. K., 179–180 Young, M. A., 319 Yuan, F., 487 Yu, J., 179, 181, 184–185, 348 Z Zametkin, A. J., 360 Zdobnov, E. M., 309 Zeng, Q. C., 511, 525–526 Zhang, R., 171, 283, 302 Zhang, W., 338 Zhang, Y., 316, 350–351 Zhang, Z., 312 Zhao, H., 137 Zhou, H., 549, 551–552, 554, 556–557, 565–566, 569 Zimmerle, C. T., 602 Zito, J. M., 359 Zoete, V., 330 Zou, M., 179, 348
Subject Index
A ABCD systems c(r) distribution simulation, 142 concerted system simulation, 147 cooperative model data simulation, 148 correlation plots, 150 direct boundary analysis, 143 kinetically mediated concerted model, 145–146 koff values and MC analysis, 149 noise perturbed data simulation, 144, 146 velocity data simulation, 141 Abscisic acid-induced stomatal closure, 301–302 Adrenocorticotropic hormone (ACTH), 437–438, 440 Agent-based models (ABM), 89–90 ANN training and validation independence testing, 240–241 leave-one-out validation, 238–240 minimal error data set, 240 results interpretation, 243 whole data analysis pipeline, 241–243 Asymptotic Mean-Integrated Squared Error (AMISE), 539 Asymptotic mean-squared error (AMSE), 541–543 Attention-deficit hyperactivity disorder (ADHD) appraisal Bayesian probability algorithm meta-analysis tool, 369–373 procedure, 366–367 results, 367–369 score standardization, 363–364 statistical analyses, 367 subjects, 365–366 comprehensive psychophysiological assessment, 362 diagnosis of, 359 DSM-IV diagnostic criteria, 358 etiology of, 360–361 mean probability, age-gender groups, 374 prevalence, 359–360 problem summary, 361–362 types, 358 Autocrine signaling, 465
B Batch queuing systems, 201–202 Bayesian factor regression modeling (BFRM), 62 Bayesian probability algorithm meta-analysis tool procedure, 371–372 results, 373 subjects, 370–371 procedure, 366–367 results, 367–369 score standardization, 363–364 statistical analyses, 367 subjects, 365–366 Bellman equation, 453 Bi Bi Random mechanism, 248–249, 257 Boolean dynamic modeling, cellular signaling networks abscisic acid-induced stomatal closure, 301–302 biological implications and predictions, 295–296 Boolean switches to dose-response curves, 299–301 gene regulatory networks, 286 illustration, 287 network backbone construction, 288–289 piecewise linear systems, 298–299 robustness testing, 295 state transition model selection, 291–292 steady state analysis, 293–295 threshold Boolean networks, 297–298 T-LGL survival signaling network, 302–303 transfer function determination, 289–290 Boolean networks algebraic model framework, 173 dynamics of, 170 genetic regulatory model asynchronous updating, 338 attractors as cell types, 341–343 definition, 338 hysteresis, 340 PBN attractors role, 350 state transition probability, 347 steady-state analysis and stability, 350–351 switching probability, 349 transition matrix, 348
637
638
Subject Index
Boolean networks (cont.) random networks (RBN), 337 state-transition diagram, 340 truth tables, functions, 339 logical model, 172 nested canalyzing function (NCF), 171 phase space, 175 reverse-engineering deterministic and stochastic Boolean network, 181–184 inference, 184–185 lac operon model, 189–190 polynome, parameter estimation, 185–189 time-discrete dynamical system, 171 transcriptional network, 176 wiring diagram, 169 C Cell motility cellular parameter extraction dynamic expansion and contraction cell activity, 44–45 image acquisition and validation, 35–36 image processing, 36 instantaneous motion fraction, 44 motion fraction, 40 persistence time, 38 speed fluctuation, 42–44 statistical subpopulations, 45 step-length, 44 surface area, 42 turn-angle distribution, 40–42 image acquisition and validation, 35–36 image processing, 36 Cell proliferation H2BmRFP-labeled cells validation and image acquisition, 46 image processing and parameter extraction, 46–48 statistical analysis Fucci system, 53 progeny tree, 49–51 quality control, 53 sibling pair analysis, 51–53 single-cell IMT and generation rate, 48–49 Cellular signaling networks Boolean dynamic modeling abscisic acid-induced stomatal closure, 301–302 biological implications and predictions, 295–296 Boolean switches to dose-response curves, 299–301 gene regulatory networks, 286 illustration, 287 network backbone construction, 288–289 piecewise linear systems, 298–299
robustness testing, 295 state transition model selection, 291–292 steady state analysis, 293–295 threshold Boolean networks, 297–298 T-LGL survival signaling network, 302–303 transfer function determination, 289–290 directed graph representation, 284 hypothetical signal transduction process, 285 Checkpointing process, 219–222 Chronic fatigue syndrome (CFS), 437 Clinically relevant performance-assessment tools invariant manifold construction eigenvalues and eigenvectors, 442–443 optimal input signals, 441 stable manifold theorem, 443–444 steady-state points, 441 optimal control objective, 448–452 treatment options, evaluation, 445, 447–448 Clustering techniques, 63 COPASI software, 587–588, 590–591, 594–595 Coronary artery disease (CAD), 481 Corticotropin-releasing hormone (CRH), 437 D Decipher stem cell signature detection ANN training and validation independence testing, 240–241 leave-one-out validation, 238–240 minimal error data set, 240 results interpretation, 243 whole data analysis pipeline, 241–243 computing environment, 231–232 databases, 234–235 data sources, 232 final compendium index, 236–237 generalized hierarchy, 235–236 normalisation, 232–234 variation filtering, 237–238 vector projection, 237 Delay differential equations (DDE), 87–88 DynaFit software package, enzymology Bi Bi Random mechanism, 248–249 common features, 248 enzyme reactions steady-state initial rate equation, 256–260 thermodynamic cycles, initial rate models, 255–256 time course, 260–262 equilibrium binding studies independent binding sites and statistical factors, 252–253 interacting vs. independent sites, trimeric enzyme, 253–255 NMR study, protein-protein interactions, 251–252 initial model parameter estimation
639
Subject Index
global minimization, differential evolution, 264–269 model-discrimination analysis, 273–275 Monte-Carlo confidence intervals, 270–273 systematic parameter scan, 263–264 optimal design of experiments, 276 use and advantage of, 248 Dynamic programming (DP) deterministic systems Bellman equation, 453 cost-to-go function, 452–453 objective function, infinite horizon problems, 453 multistage optimal control problems, 452 worst-case cost, 454–455 E Echo statements, 206 Elliptic bursting model characteristics, 15 subthreshold oscillations, 16 voltage time course, 17 voltage trace, 16 Entropy balance equation, 118–119 Enzyme reactions invariant concentrations of reactants, 261–262 steady-state initial rate equation, 256–260 thermodynamic cycles, initial rate models, 255–256 Equilibrium binding studies independent binding sites and statistical factors, 252–253 interacting vs. independent sites, trimeric enzyme, 253–255 NMR study, protein-protein interactions, 251–252 F Fibroblast growth factor (FGF) system, 463 Fibromyalgia, 437–438 File staging, 222–223 Friedman-Tukey index, 535–536 AMSE, 541–542 mean-squared error (MSE), 541–543 plug-in and resubstitution estimate, 540 Fucci system, 53 G Genetic regulatory network modelling Boolean networks asynchronous updating, 338 attractors as cell types, 341–343 definition, 338 hysteresis, 340 random Boolean networks (RBN), 337 state-transition diagram, 340
truth tables, functions, 339 differential equation models function Fþ(x,1), 345 mutant phenotype prediction, 346–347 nonlinear time-dependent equation, 343 sigmoid functions, 344 time-invariant system, 343 probabilistic Boolean networks attractors role, 350 state transition probability, 347 steady-state analysis and stability, 350–351 switching probability, 349 transition matrix, 348 stochastic differential equation models, 351–353 Gibbs entropy, 114, 132 Gilbert theory, 151 Glucagon counterregulation (GCR) b-cell-deficient rat model, 549 dysregulation, diabetes hypoglycemia, 550–551 switch-off hypothesis, 551 initial qualitative analysis b-cell inhibition, a-cells, 553–554 a-cell stimulation, d-cells, 554–555 d-cell inhibition, a-cells, 554 glucose inhibition, a-cells, 555–556 glucose stimulation, b-and d-cells, 555 interdisciplinary approach advantages and limitations, 571–575 Minimal Control Network (MCN), 553 mathematical models, control mechanisms a-cell inhibitor, 559 experimental findings, diabetic STZ-treated rats, 557 model equations, 558 model-predicted GCR responses, 558–559 somatostatin and glucagon concentration rates, 557 normal endocrine pancreas dynamic network approximation, MCN, 561–562 model parameters determination, 562–563 response, reduction, 567–569 response to hypoglycemia, 566 response to switch-off signals, 567 in silico experiments, 564–565 simulated transition, normal physiologyinsulinopenic state, 569–571 validation, MCN, 561–562 Glucocorticoids, 437 Glyceraldehyde 3-phosphate (G3P), 589 Glycerone phosphate, 589 Grid systems, 223–226 Growth factor-receptor systems angiogenesis, biology, 462–463 mesoscale single-tissue 3D models, 474–482
640
Subject Index
Growth factor-receptor systems (cont.) cell based therapy, muscle ischemia, 480–481 exercise therapy, muscle ischemia, 481–482 gene therapy, muscle ischemia, 480 mathematical framework, 474–478 molecular level kinetics models mathematical framework, 468 PIGF synergy mechanism, 468–471 multi-tissue compartmental models anti-VEGF therapy, pharmacokinetics, 488–491 ligand trap, sVEGFR1, 491–493 lymphatic drainage, 487–488 macromolecular vascular permeability, 486–487 normal compartment, 485 single-tissue compartmental models mathematical framework, 482–483 pharmacodynamic mechanism, 483–485 vascular endothelial growth factor (VEGF) computational models, 466 multiscale biology, 464–466 systems biology, 463–464 H High-content automated microscopy (HCAM) cancer cell trait variability (see Quantitative cell traits (QCT) variability) implementation of, 24 High-throughput computing (HTC) application, 199–200 batch queuing systems, 201–202 checkpointing, 219–222 data transformation pattern BASH shell script and echo statements, 206 line detector submission control script, 207 PBS submission script, 204–205 submission manager or control script, 205–206 file staging, 222–223 grid systems, 223–226 iterative refinement, 217–218 Monte Carlo simulations, 211–214 parameter space study airflow over wing submission script., 208–210 BASH shell script, 209 portable batch system, 202–204 resource restrictions, 218–219 scripting languages, 200–201 throwDarts sequential program, 215–217 HIV protease inhibition distribution histograms, 272–273 least-squares fit, progress curves, 265 mechanistic model, 266 Homology modeling
secondary structure prediction, 313 sequence analysis, 306–313 structure validation, 316 tertiary structure prediction ab initio algorithms, 315–316 template-query alignment, 314–315 threading, 315 HTC. See High-throughput computing Hyperglycemia, 414, 425, 429 Hypoglycemia, 414, 417, 423–425, 427, 429 Hypothalamic-pituitary-adrenal (HPA) axis system ACTH, 437–438 block diagram, dynamics, 437 deterministic optimization, 455–456 exogenous ACTH (EACTH), 438 homeostasis maintenance, 437 steady-state analysis cortisol, 440 model predictive control (MPC), 441 nominal parameters, 439 stress-related disorders, 437 system model, 438–439 worst-case optimization, 456–458 Hypoxia-inducible factor 1 (HIF1) activation, 463 I Immune network, self-regulating immune regulation complexity, 81–83 mathematical modelling agent-based models (ABM), 89–90 delay differential equations (DDE), 87–88 ordinary differential equations (ODE), 85–87 partial differential equations (PDE), 88–89 stochastic differential equations (SDE), 90–91 self/nonself discrimination, 83–84 T-cell regulation (see Intracellular T-cell regulation) thymic selection, 80 tumor-associated antigens (TAA), 81 Insulin infusion protocol, 416 Insulin-like growth factor (IGF) system, 463 Intracellular T-cell regulation iTreg-based negative feedback death rate, 99 model diagram, 98 simulation, 103–105 T cell contraction, 98 T cell proliferation program antigen function graphs, 98 expanded diagram, 94–96 parameter estimates, 97 simulation, 100–103 summary, 93 variable definition, 94
641
Subject Index K Kernel density bandwidth, 536–537 Epanechnikov kernel, 537 PDF, 532, 536–538 symmetry, 541 template matches, 544–545 zero mean and unit variance, 534 KinTek Global Kinetic Explorer software fitting data, simulation, 603–605 fitting full progress curves error analysis, 617–620 information content, 613 kcat and Km values, 613–615 kinetic parameters, 614, 616 rapid equilibrium binding model, 614 rate constants, 613–614 methods average sigma value, 610 experiment definition, 607–608 information content, data, 609 model definition, 605–607 nonlinear regression analysis, data fitting, 609 output factors definition, 608 time and concentration, units, 608–609 progress curve kinetics effect of variable, 607, 611 information content, 611 product inhibition effect, 607, 611 rate constants, 611–612 rate constants, 603 slow onset inhibition kinetics absorbance, 620–621 purine nucleoside phosphorylase (PNPase), 620–624 tryptophan synthase, 603 L Least-squares analysis data uncertainty, variance function estimation, 524–526 Michaelis-Menten enzyme kinetics, 501 multiple uncertain variables, Deming’s treatment, 505 rectangular hyperbola, 500 standard linear and nonlinear least squares, 503–505 statistics, reciprocals binding and kinetics data, 510–511 implications, thumb rule, 509–510 Monte Carlo experiment, 506–509 uncertainty in functions, error propagation, 505–506 unusual weighting, dependent variable effective-variance-based weighting expressions, 523
effective variance treatment, 521–522 weights, true dependent variable constant sy, 511–512 Monte Carlo simulations, 517–521 perfectly fitting data, illustrations, 512–515 real data example, 515–517 Levenberg–Marquart algorithm, 138 M Markov process entropy and balance equation, 118–119 equilibrium and time reversibility, 119–120 free energy and relative entropy, 121–122 Matrix factorization Baye’s equation, 65 Bayesian factor regression modeling (BFRM), 62 mathematical statement, 61 Occam’s Razor argument, 65 positivity and dimensionality reduction, 66 transcriptional regulators, 67 Mean-Integrated Squared Error (MISE) definition, 538 optimal bandwidth, 539 Mesoscale single-tissue 3D models cell based therapy, muscle ischemia, 480–481 exercise therapy, muscle ischemia, 481–482 gene therapy, muscle ischemia, 480 mathematical framework blood flow volumes, 477 2D and 3D tissue geometry, 474–477 diffusion, 477 receptor-ligands interactions, 478–479 sequestration, ECM, 478 single-compartment models, 480 VEGF production/secretion rates, 479 Michaelis-Menten enzyme kinetics, 128–130, 501 Microarray data analysis clustering techniques, 63 mating activation and filamentation activation, 61 matrix factorization Baye’s equation, 65 Bayesian factor regression modeling (BFRM), 62 mathematical statement, 61 Occam’s Razor argument, 65 positivity and dimensionality reduction, 66 transcriptional regulators, 67 nonnegative matrix factorization (NMF), 61–62, 68 Rosetta compendium, 63, 68–72 tightly coupled MAPK pathways, 60 traditional statistical approaches, 64
642 Minimal Control Network (MCN) interdisciplinary approach, 553 normal endocrine pancreas dynamic network approximation, 561–562 hypoglycemia, response, 566 model parameters determination, 562–563 response, reduction, 567–569 in silico experiments, 564–565 simulated transition, normal physiologyinsulinopenic state, 569–571 switch-off signals, response, 567 validation, 561–562 Molecular docking basic components, 324–325 iterative docking and analysis, 328–329 molecule preparation macromolecule protein, 326–327 protein ligands, 327–328 small molecule ligands, 327 post analysis, 329 software selection, 325–326 virtual screening, 329–330 Molecular dynamics mechanics, 318–319 setting up and running simulations equilibration, 320–321 minimization and preparation, 320 production, 321 simulation analysis equilibration measures, 321–324 principal component analysis, 322–324 RMSF fluctuation, 321–322 Molecular modeling homology modeling secondary structure prediction, 313 sequence analysis, 306–313 structure validation, 316 tertiary structure prediction, 313–316 molecular docking basic components, 324–325 iterative docking and analysis, 328–329 molecule preparation, 326–328 post analysis, 329 software selection, 325–326 virtual screening, 329–330 molecular dynamics mechanics, 318–319 setting up and running simulations, 320–321 simulation analysis, 321–324 software and resources, 310–311 Monomer-tetramer model Gilbert theory, 151 kinetics of reequilibration, 157 monomer–dimer–trimer–tetramer species distributions, 155 normalised distribution, 153 reaction relaxation time, 156 sedimentation distribution analysis, 152
Subject Index
sequential model, 154 Monte Carlo simulations, 211–214 glucose control, 429 glucose meters, 430–431 hypoperfusion, skin capillaries, 430 limitations, 430–431 methods glucose concentration, 416–417 tight glucose control (TGC) regimens, 416 modeling approach assay imprecision and inaccuracy, 414–415 physiologic response, 415 University of Washington regimen hypoglycemia, 427, 429 inaccurate and imprecise glucose assay, 421, 429 insulin infusion rate, 427–428 Yale regimen bias vs. coefficient of variation (CV), 422–426 glucose concentrations, 417–423 insulin infusion rates, 417–422 Motor protein model biochemical kinetic scheme, 123 chemomechanical and futile cycle, 124 motor efficiency, 125 Multi-tissue compartmental models anti-VEGF therapy, pharmacokinetics, 488–491 ligand trap, sVEGFR1, 491–493 lymphatic drainage, 487–488 macromolecular vascular permeability, 486–487 normal compartment, 485 Muscle ischemia cell based therapy, 480–481 exercise therapy, 481–482 gene therapy, 480 N Newton-Raphson distribution, 140 Nonnegative matrix factorization (NMF), 61–62, 68 Nonparametric entropy cardiac rhythms, classification atrial fibrillation (AF), 533–534 implantable cardioverter-defibrillator (ICD), 534 trigeminy rhythms, 533–534 Friedman-Tukey index, 535–536 AMSE, 541–542 mean-squared error (MSE), 541–543 plug-in and resubstitution estimate, 540 Kernel density bandwidth, 536–537
643
Subject Index
Epanechnikov kernel, 537 PDF, 536–537 template matches, 544–545 Mean-Integrated Squared Error (MISE) definition, 538 optimal bandwidth, 539 Renyi entropy, 535–536 O Occam’s Razor argument, 65 Ordinary differential equations (ODEs), 85–87, 383, 586, 594 P Partial differential equations (PDE), 88–89 Peripheral artery disease (PAD), 481 Phosphorylation-dephosphorylation cycle (PdPC) kinetics Michaelis–Menten kinetics, 128–130 signaling switch and phosphorylation energy, 125–127 substrate specificity amplification, 130 Piecewise linear systems, 298–299 Platelet-derived growth factor (PDGF) system, 463 Portable batch system, 202–204 Probabilistic Boolean networks (PBN) attractors role, 350 state transition probability, 347 steady-state analysis and stability, 350–351 switching probability, 349 transition matrix, 348 Probability density function (PDF), 532, 536–538 Progress curve kinetics KinTek Global Kinetic Explorer software effect of variable, 607, 611 information content, 611 product inhibition effect, 607, 611 rate constants, 611–612 systems biology COPASI, 594–595 forward reaction, 595–596 G3P, 595 kinetic parameters, 597 Projection index, 536 Protein-protein interactions, NMR study, 251–252 Q Quantitative cell traits (QCT) variability cell proliferation Fucci system, 53 H2BmRFP-labeled cells validation and image acquisition, 46 image processing and parameter extraction, 46–48
progeny tree, 49–51 quality control, 53 sibling pair analysis, 51–53 single-cell IMT and generation rate, 48–49 computational workflow average data, 32 cellular parameter extraction, 31 computer-assisted analysis, 27 data management, 30 distribution data, 32 image processing, 30–31 statistical analysis, 31 statistical subpopulations, 32–34 time-lapse image acquisition, 28–30 definition, 24 single cell motility dynamic expansion and contraction cell activity, 44–45 image acquisition and validation, 35–36 image processing, 36 instantaneous motion fraction, 44 motion fraction, 40 persistence time, 38 speed fluctuation, 42–44 statistical subpopulations, 45 step-length, 44 surface area, 42 turn-angle distribution, 40–42 R Reaction kinetics first-order reactions, 392 pseudo-first-order reactions, 392–393 rate constants, 393 second-order reactions, 390–391 Receptor tyrosine kinases (RTKs), 463–464 Relaxation-type models elliptic bursting characteristics, 15 subthreshold oscillations, 16 voltage time course, 17 voltage trace, 16 phase duration determination algorithm, 19–20 prolactin-secreting pituitary lactotrophs, 18–19 relaxation oscillations activation function, 4 activity pattern, 9 a-nullcline shape, 10 divisive feedback, 5 noise effects, 8 on and off transition, 11 positive and negative feedback system, 5 s-model and y-model, 6–8 subtractive feedback, 6 survival analysis of particles, 10 Z-shaped nullcline, 12 scatter plots and correlation analysis, 3–4
644
Subject Index
Relaxation-type models (cont.) square wave bursting active and silent phase durations, 15 membrane potential, 13 wave spiking and voltage trace, 14 Renyi entropy, 535–536 Resource restrictions submission script, 219 reverse reaction, 595–596 Rosetta compendium, 63, 68–72 S SBML. See Systems biology markup language Scatter plots and correlation analysis, 3–4 SDE. See Stochastic differential equations Sedanal fitting models, 138 Sedimentation velocity profiles ABCD systems c(r) distribution simulation, 142 concerted system simulation, 147 cooperative model data simulation, 148 correlation plots, 150 direct boundary analysis, 143 kinetically mediated concerted model, 145–146 koff values and MC analysis, 149 noise perturbed data simulation, 144, 146 velocity data simulation, 141 advanced parameter kinetics equilibrium control window, 141 concerted tetramer model, 140 Levenberg–Marquart algorithm, 138 monomer-tetramer model Gilbert theory, 151 kinetics of reequilibration, 157 monomer–dimer–trimer–tetramer species distributions, 155 normalised distribution, 153 reaction relaxation time, 156 sedimentation distribution analysis, 152 sequential model, 154 Newton–Raphson distribution, 140 Sedanal fitting models, 138 stathmin-eGFP to tubulin binding, 139 Shannon entropy, 112–113, 532, 535, 541 Simple stochastic simulation citation data, 383 graphical notation Petri-net format, 386–389 place and transition nodes, 387 Monte Carlo computer simulations, 383 reaction dynamics, 385–386 reaction kinetics first-order reactions, 392 pseudo-first-order reactions, 392–393 rate constants, 393 second-order reactions, 390–391 reactions, 389
transition firing rules first-order reactions, 395–401 ground rules, 394–395 pseudo-first-order and second-order reactions, 404–405 rate constants, 393 Single-tissue compartmental models mathematical framework, 482–483 pharmacodynamic mechanism, 483–485 Slow onset inhibition kinetics absorbance, 620–621 purine nucleoside phosphorylase (PNPase) confidence contours, 623 kinetic parameters, 622, 624 Plasmodium falciparum, 621 Square wave bursting model active and silent phase durations, 15 membrane potential, 13 wave spiking and voltage trace, 14 Stem Cell Analysis and characterization by Neural Networks (SCANN), 231 Stem Cell Genome Anatomy Project (SCGAP) Consortium, 230, 232, 235 Stochastically fluctuating system, thermodynamics cycle kinetics and thermodynamics box, 115–117 equilibrium and nonequilibrium steady state, 113–115 Gibbs and Shanon entropy, 132 historical reflection, 131–132 Markov process entropy and entropy balance equation, 118–119 equilibrium and time reversibility, 119–120 free energy and relative entropy, 121–122 PdPC kinetics Michaelis–Menten kinetics, 128–130 signaling switch and phosphorylation energy, 125–127 substrate specificity amplification, 130 three-state two-cycle motor protein biochemical kinetic scheme, 123 chemomechanical and futile cycle, 124 motor efficiency, 125 Stochastic differential equations (SDE), 90–91 Systems biology Boolean networks algebraic model framework, 173 dynamics of, 170 logical model, 172 nested canalyzing function (NCF), 171 phase space, 175 time-discrete dynamical system, 171 transcriptional network, 176 wiring diagram, 169 bottom-up approach, 584 computational modeling and enzyme kinetics
645
Subject Index
COPASI, biochemical modeling and simulation package, 587–588 ordinary differential equations (ODEs), 586 standards, computational systems biology, 586–587 gene regulatory networks, 165 Hill function, 166 initial rate analysis COPASI, 590–591 forward reaction, 592 Henri-Michaelis-Menten kinetics, 590 rate law, 594 reverse reaction, 593 lac operon, 166 network interference, 176–181 ODE, 166 progress curve analysis COPASI, 594–595 forward reaction, 595–596 G3P, 595 kinetic parameters, 597 ODEs, 594 reverse reaction, 595–596 reverse-engineering deterministic and stochastic Boolean network, 181–184 inferring Boolean networks, 184 inferring stochastic Boolean networks, 184–185 lac operon model, 189–190 polynome, parameter estimation, 185–189 vascular endothelial growth factor (VEGF), 463–464 yeast triosephosphate isomerase (EC 5.3.1.1) forward reaction, 589 kinetic parameters, 597 MORF proteins, 588–589 reverse reaction, 589 Saccharomyces cerevisiae, 588 Systems biology markup language (SBML), 586–587 T TAA. See Tumor-associated antigens T cell proliferation program antigen function graphs, 98 expanded diagram, 94–96 parameter estimates, 97
simulation cell dynamics dependence, 101 cell populations, 100 time evolution, 102 summary, 93 variable definition, 94 Threshold Boolean networks, 297–298 throwDarts sequential program, 215–217 Thymic selection, 80 T-LGL survival signaling network, 302–303 Transition firing rules first-order reactions, 395–401 ground rules, 394–395 pseudo-first-order and second-order reactions, 404–405 rate constants, 393 Trimeric enzyme, interacting vs. independent sites, 253–255 Tumor-associated antigens (TAA), 81 V Vascular endothelial growth factor (VEGF) computational models, 466 ligand shifting, 468 multiscale biology autocrine signaling, 465 heparin-binding affinity, 464–465 intertissue transport, 465 intratissue transport, 464 paracrine signaling, 465–466 nonlinear differential equations, 471 NRP1–VEGFR2 coupling, 472–474 in silico model formulation, 469 systems biology, 463–464 VEGF VEGFR2 complex, 471 X X-linked agammaglobulinemia, 91 Y Yeast triosephosphate isomerase (EC 5.3.1.1) forward reaction, 589 kinetic parameters, 597 MORF proteins, 588–589 reverse reaction, 589 Saccharomyces cerevisiae, 588