Evolutionary Genomics and Systems Biology

Evolutionary Genomics and Systems Biology Evolutionary Genomics and Systems Biology Edited by Gustavo Caetano-Anolle...

Author: Gustavo Caetano-Anolles

183 downloads 2633 Views 8MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Evolutionary Genomics and Systems Biology

Evolutionary Genomics and Systems Biology Edited by Gustavo Caetano-Anolles

Copyright 2010 by Wiley-Blackwell. All rights reserved. Published by John Wiley & Sons, Inc., Hoboken, New Jersey. Published simultaneously in Canada. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions. Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and speciﬁcally disclaim any implied warranties of merchantability or ﬁtness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of proﬁt or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. For general information on our other products and services, or technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com. Library of Congress Cataloging-in-Publication Data: Evolutionary genomics and systems biology / [edited by] Gustavo Caetano-Anolle´s. p. ; cm. Includes bibliographical references and index. ISBN 978-0-470-19514-7 (cloth) 1. Evolutionary genetics. 2. Molecular evolution. I. Caetano-Anolle´s, Gustavo, 1955[DNLM: 1. Evolution, Molecular. 2. Genome–genetics. 3. Systems Biology. QU 475 E9566 2010] QH390.E985 2010 572.8’38–dc22 2009025383 Printed in the United States of America 10 9 8 7 6

5 4 3

2 1

Contents

Preface

Contributors

Part I

xvii

David Penny and Lesley J. Collins 1.1 1.2

Introduction 3 Evolution and the Power of Genomes 4 1.3 The Problem of Deep Phylogeny and “The Tree” 5 1.4 Fred, the Last Common Ancestor of Modern Eukaryotes 7 1.5 Eukaryote Origins: Continuity from the RNA World? 10 1.6 Minimal Genomes and Reductive Evolution 12 1.7 Evolutionary Genomics for the Future 13 References 14 2. Current Approaches to Phylogenomic 17 Reconstruction Denis Baurain and Herve Philippe

2.2 2.3

2.4

Phylogenomics and Supermatrices 17 Phylogenetic Signal Versus Nonphylogenetic Signal 19 Probabilistic Models and Nonphylogenetic Signal 22 2.3.1 Homogeneous Models 22 2.3.2 Handling of Rate Signal 23 2.3.3 Handling of Compositional Signal 24

26 27

Reduction of Nonphylogenetic Signal Under Fixed Models 28 2.4.1 Variations in Taxon Sampling 28 2.4.2 Recoding and Removal of Offending Data 30

Evolution of Life

1. Evolutionary Genomics Leads the Way 3

2.1

2.3.4 Other Model Violations 2.3.5 Future Developments

xiii

2.5 2.6

CAT Model 31 Case Study: Cambrian Explosion 33 2.7 Conclusion 35 References 36 3. The Universal Tree of Life and the Last Universal Cellular Ancestor: Revolution and Counterrevolutions

43

Patrick Forterre 3.1 3.2 3.3

Introduction 43 The Woesian Revolution 45 A Rampant “Prokaryotic” Counterrevolution 47 3.4 How to Polarize Characters Without a Robust Root? 50 3.5 The Hidden Root: When the Weather Became Cloudy 51 3.6 LUCA and Its Companions 54 3.7 The Problem of Horizontal Gene Transfer and Ancient Phylogenies: Trees Versus Gene Webs 54 3.8 The Nature of the RNA World 55 3.9 The DNA Replication Paradox and the Nature of LUCA 56 3.10 When Viruses Find Their Way into the Universal Tree of Life 58 3.11 Future Directions 59 References 60 v

vi

Contents 6.3.4 Speciation and Species Deﬁnition 103

4. Eukaryote Evolution: The Importance 63 of the Stem Group 6.4

Anthony M. Poole 4.1

Introduction 4.1.1

4.2 4.3

6.4.1 Distinct and Speciﬁc Genome Organization of Three Major Evolutionary Subdivisions of Hemiascomycetes 104 6.4.2 Comparison of Proteins: Pan- and Core-Proteomes 106 6.4.3 Genome Redundancy and Paralogues 106 6.4.4 Conservation of Synteny 107 6.4.5 Genes for Noncoding RNAs, Introns, and Genetic Code Variation 108 6.4.6 Sex, Transposons, Plasmids, Inteins, and Horizontal Gene Transfer 109 6.4.7 Mitochondrial Genomes and NUMTs 110

63

What is Signiﬁcant About the Origin of the Eukaryote Cell? 63

Interpreting Trees 68 Moving Beyond the Deep Roots of Eukaryotes 70 4.3.1 4.3.2 4.3.3

Origin of the Mitochondrion 71 Stepwise Development of Mitochondria 72 Intron Proliferation and Eukaryote Origins 74

4.4 Concluding Remarks References 77

76

5. The Role of Information in Evolutionary Genomics of Bacteria 81 Antoine Danchin and Agnieszka Sekowska 5.1 5.2 5.3 5.4 5.5

Introduction 81 Revisiting Information 83 Ubiquitous Functions for Life 84 The Cenome and the Paleome 87 Functions Corresponding to Nonessential Persistent Genes 89 5.6 A Ubiquitous Information-Gaining Process: Making a Young Organism from an Aged One 89 5.7 Provisional Conclusion 91 Acknowledgments 92 References 92

6.5 Surprises 111 6.6 What Next? 113 Acknowledgments 115 Epilogue 115 References 115 Part II

95

Bernard Dujon 6.1 6.2 6.3

Introduction 95 A Brief History of Hemiascomycetous Yeast Genomics 96 The Scientiﬁc Attractiveness of S. cerevisiae 98 6.3.1 6.3.2 6.3.3

Functional Genomics 98 Genome Duplication 102 A Bunch of Fermentative Engines 102

Evolution of Molecular Repertoires

7. Genotypes and Phenotypes in the Evolution of Molecules

123

Peter Schuster 7.1 7.2

The Landscape Paradigm 123 Molecular Phenotypes 125 7.2.1 Protein Structures 126 7.2.2 Nucleic Acid Structures 129

7.3 6. Evolutionary Genomics of Yeasts

Evolutionary Genomics of Hemiascomycetes 104

The RNA Model

132

7.3.1 RNA Replication and Mutation 133 7.3.2 RNA Secondary Structures 138 7.3.3 Neutrality and Its Consequences 139 7.3.4 Stochastic Effects in RNA Evolution 141 7.3.5 Beyond the One Sequence–One Structure Paradigm 145

7.4 Conclusions and Outlook Acknowledgments 149 References 149

148

Contents

8. Genome Evolution Studied Through 153 Protein Structure Philip E. Bourne, Kristine Briedis, Christopher Dupont, Ruben Valas, and Song Yang

10. Molecular Structure and Evolution 183 of Genomes Todd A. Castoe, A. P. Jason de Koning, and David D. Pollock 10.1 10.2

8.1 8.2

Introduction 153 Structural Granularity and Its Implications 156 8.3 Protein Domains in the Study of Genome Rearrangements 158 8.4 Protein Domain Gain and Loss 160 8.5 And in the Beginning . . . 161 8.6 But Let Us Not Forget the Inﬂuence of the Environment 161 8.7 Conclusions 162 References 163

9. Chromosomal Rearrangements in Evolution

10.3

165

9.4

Introduction 165 Genome Representation 166 Constructing Genome Permutations from Sequence Data 167 Genomic Distances 168 9.4.1 Model-Free Distances 9.4.2 Rearrangement-Based Distances 170

9.5

169

Reconstruction of Ancestors and Evolutionary Scenarios 9.5.1 Model-Free Reconstruction Algorithms 175 9.5.2 Rearrangement-Based Reconstruction Algorithms 176

9.6

Recent Applications on Large Genomes 177 9.7 Challenges and Promising New Approaches 178 Acknowledgment 179 References 179

174

Introduction 183 Overview of Considerations in Studying Protein Evolution 184 Function and Evolutionary Genomics 186 10.3.1 Deciphering Complexities of Protein Evolution 186 10.3.2 The Future of Modeling Protein Evolution: Merging Realism with Tractability 188 10.3.3 The Effect of Increasing Taxon Sampling and Sequence Biodiversity 189 10.3.4 Removing the Mutational Noise and ContextDependent Biases from Protein Evolution 190 10.3.5 Where is Protein Evolution Going? 191 10.3.6 Detecting Adaptation and Functional Innovation 193

Hao Zhao and Guillaume Bourque 9.1 9.2 9.3

vii

10.4

Integrating Inferences to Detect and Interpret Adaptation: An Example with Snake Metabolic Proteins 194 10.4.1 Snake Metabolic Proteins— Integration of Inferences for Adaptation 194 10.4.2 Detection of Accelerated Nonsynonymous Change 195 10.4.3 Changes at Conserved Sites and Coevolutionary Signal 195 10.4.4 Integrating Evolutionary Inferences with Structure and Function Information 197 10.4.5 Further Evidence of Adaptation from Molecular Convergence 198 10.4.6 Integrating Inferences with Possible Causal Factors 199

10.5 Conclusion References 200

200

viii

Contents

11. The Evolution of Protein Material Costs

203

Jason G. Bragg and Andreas Wagner

Adam James Reid, Sarah Addou, Robert Rentzsch, Juan Ranea, and Christine Orengo

11.1 11.2 11.3

Introduction 203 Protein Material Costs 204 An Example: Proteomic Sulfur Sparing 205 11.4 Episodic Nutrient Scarcity Can Shape Protein Material Costs 205 11.5 Highly Expressed Gene Products Often Exhibit Reduced Material Costs 206 11.6 Material Costs and the Evolution of Genomes 207 11.7 Material Costs and Other Costs of Making Proteins 208 11.8 Conclusions 209 Acknowledgments 209 References 209

12. Protein Domains as Evolutionary Units

13. Domain Family Analyses to Understand 231 Protein Function Evolution

13.1 13.2

13.3

13.4

13.5

13.5.1 Phylogenetic Analysis of Protein Families 238 13.5.2 Structural Domain Characterization of Clusters of Orthologous Genes Using CATH 240 13.5.3 Evolution of (COGs) Function Within CATH Superfamilies 240 13.5.4 Resolving Ambiguous Evolutionary Scenarios Between Parent and Child COGs in a CATH Superfamily 241 13.5.5 Relationship Between Domain Architecture Rearrangement and Functional Divergence Within CATH Superfamilies 242

213

Andrew D. Moore and Erich Bornberg-Bauer 12.1 12.2

Modular Protein Evolution Domain-Based Homology Identiﬁcation 215

213 13.6

How Safely Can Function Be Inherited Between Homologues? 245 13.7 How Are Domain Families Distributed in Protein Complexes? 247 References 248

12.2.1 Domain Architecture Similarity 216 12.2.2 Domain Resources and Domain-Based Search 219 12.2.3 Deciphering Circular Permutations with Domains 221

12.3

14. Noncoding RNA

Domains in Genomics and Proteomics 222 12.3.1 Building Domain Trees

12.4 The Coverage Problem 12.5 Conclusion 227 References 228

Introduction 231 Universal Domain Structure Families Identiﬁed in the Last Universal Common Ancestor 232 Some Domain Families Recur More Frequently and Are Structurally Very Diverse 234 Correlation of Structural Diversity in Superfamilies with Functional Diversity 234 To What Extent Does Function Vary Between Homologous? 238

225

223

251

Alexander Donath, Sven Findeib, Jana Hertel, Manja Marz, Wolfgang Otto, Christine Schulz, Peter F. Stadler, and Stefan Wirth 14.1 14.2

Introduction Ancient RNAs

251 254

Contents 14.2.1 RNase P and RNase MRP RNA 254 14.2.2 Signal Recognition Particle RNA 254 14.2.3 snoRNAs 256

14.3

Domain-Speciﬁc RNAs 14.3.1 14.3.2 14.3.3 14.3.4 14.3.5

14.4

14.5 14.6

259

14.4.1 14.4.2 14.4.3 14.4.4 14.4.5 14.4.6

15.5

RNAs with Dual Functions

15.6

279

Evolution of microRNAs

Origin(s) of microRNA Families 313

Genomic Organization

316

15.8.1 Clusters and Families 316 15.8.2 Regulation of microRNA Expression 318

282

15.9 Summary and Outlook References 321

320

16. Phylogenetic Utility of RNA Structure: Evolution’s Arrow and Emergence of Early Biochemistry and Diversiﬁed 329 Life Feng-Jie Sun, Ajith Harish, and Gustavo Caetano-Anolles 16.1

296

307

Animal microRNAs 307 Plant microRNAs 310 MicroRNAs and Viruses 311 Mirtrons 313

15.7.1 Metazoa 313 15.7.2 Mechanisms in Plants 314 15.7.3 microRNAs and Transposable Elements 314 15.7.4 Are Animal and Plant microRNAs Homologous? 315

15.8

Andrea Tanzer, Markus Riester, Jana Hertel, Clara Isabel Bermudez-Santana, Jan Gorodkin, Ivo L. Hofacker, and Peter F. Stadler

15.2.1 Endogenous siRNAs 296 15.2.2 piRNA 297 15.2.3 rasiRNA 297

15.7

281

15. Evolutionary Genomics of microRNAs 295 and Their Relatives

Introduction 295 The Small RNA Zoo

Computational microRNA Prediction 302 microRNA Targets 304

15.6.1 15.6.2 15.6.3 15.6.4

RNAIII 281 SgrS 281 SRA/SRAP 282 Enod40 282

14.8 Concluding Remarks Acknowledgments 283 References 283

298

15.5.1 How Many Targets? 304 15.5.2 Target Prediction 305 15.5.3 Targets and Polymorphisms 306

ncRNAs from Repeats and Pseudogenes 276 mRNA-like ncRNAs 277 Dosage Compensation Imprinting 280 Stress Response 280 Transcriptional Regulators 280

Small RNA Biogenesis

15.3.1 Components of the Small RNA Processing Machinery 298 15.3.2 MicroRNA Biogenesis 299 15.3.3 Biogenesis of Other Small RNAs 300 15.3.4 Three Main Mechanisms, Same Global Effect on Gene Expression 300

15.4

14.7.1 14.7.2 14.7.3 14.7.4

15.1 15.2

15.3

Conserved ncRNAs with Limited Distribution 267

14.6.1 14.6.2 14.6.3 14.6.4

14.7

15.2.4 “Exotic” Small RNA Species 297

Telomerase RNA 259 Spliceosomal snRNAs 260 U7 snRNA 262 tmRNA 262 6S RNA 265

Y RNAs 267 Vault RNAs 267 7SK RNA 267 SmY RNA 270 Bacterial RNAs 270 A Zoo of Diverse Examples 274

ix

Introduction

329

16.1.1 A Novel Phylogenetic Approach Based on Macromolecular Structure 331 16.1.2 Broadened Utility of Constraint Analysis 333

x

Contents

16.2

Structural Characters and Derived Phylogenetic Trees 333

17.4

16.2.1 Character Coding 334 16.2.2 Character Polarization 335 16.2.3 Phylogenetic Analysis of RNA Structural Characters 338 16.2.4 Character State Change Frequency in RNA Structural Evidence 339 16.2.5 Major Properties of Phylogenetic Trees Derived from RNA Structure 340 16.2.6 Potential Limitations of the Methodology 343

16.3

Applications 16.3.1 16.3.2 16.3.3 16.3.4 16.3.5

tRNA 344 5S rRNA 346 RNase P RNA 348 SINE RNA 351 rRNA 352

16.4 Conclusions Acknowledgments References 354

Part III

344

353 354

363

Charles G. Kurland and Otto G. Berg Introduction

17.2

The Emperor’s BLAST Search Revisited 381 17.5.1 Ecological Settings 17.5.2 The Way Forward 17.5.3 Less May be More

384 386 387

17.6

Will the Real Missing Link Please Stand Up? 388 17.7 All’s Well 389 Acknowledgments 391 References 391 397

18.1 18.2

Introduction 397 Metabolic Network Properties 398 18.3 Network Models For Metabolic Evolution 403 18.4 Dynamic Models Of Genome-Level Metabolic Function 407 References 410

363

Mere Words? 364 Sideways 365 One Man’s Glitch 365 Three Challenges 366

Phylogenetic Continuities, Biological Coherence 367 17.2.1 The Rosetta Stone 367 17.2.2 Compositional Outliers 368 17.2.3 Phylogeny by Any Other Name 368 17.2.4 The Emperor’s Blast Search 370 17.2.5 Between Consenting Adults 371

17.3

17.5

Eivind Almaas

Evolution of Biological Networks

17.1.1 17.1.2 17.1.3 17.1.4

374

18. Evolution of Metabolic Networks

17. A Hitchhiker’s Guide to Evolving Networks

17.1

Optimal Networks

17.4.1 Optimality Not Maximality 375 17.4.2 Patchy Environments, Patchy Genomes 377 17.4.3 Fixation of Novel Sequences 378 17.4.4 Selﬁsh Operons 380

Nested Structural Networks 17.3.1 Nip and Tuck 371 17.3.2 Fold Selection 372 17.3.3 Higher Order Structural Networks 373

371

19. Single-Gene and Whole-Genome Duplications and the Evolution of Protein–Protein Interaction Networks

413

Grigoris Amoutzias and Yves Van de Peer 19.1 19.2 19.3 19.4 19.5 19.6 19.7 19.8 19.9

Introduction 413 Evolution of PINs 414 Single-Gene Duplications 416 Whole-Genome Duplications 416 Diploidization Phase 416 Dosage Balance Hypothesis 417 Types of Interactions 417 WGDs, Transient Interactions, and Organismal Complexity 418 Studies on PPIs of Ohnologues 419

Contents

19.10

Concerns About the Methods of Analysis and the Quality of the Data 420 19.11 The Importance of Medium-Scale Studies: the Case of Dimerization 422 19.12 Evolution of Dimerization Networks 424 19.13 Conclusions 426 Acknowledgment 426 References 427 20. Modularity and Dissipation in Evolution of Macromolecular Structures, 431 Functions, and Networks Gustavo Caetano-Anolles, Liudmila Yafremava, and Jay E. Mittenthal 20.1

Introduction

431

xi

20.2

Biological Structure as an Emergent Property of Dissipative Systems 432 20.3 Information and Its Dissipation 435 20.4 Time, Thermodynamic Irreversibility, and Growth of Order in the Universe 437 20.5 Information Dissipation and Modularity Pervade Structure in Biology 440 20.6 Modularity and Dissipation in Protein Evolution 443 20.7 Conclusions 447 Acknowledgments 448 References 448

Index

451

Preface

“The hardest thing to see is what is in front of our eyes.” Johan Wolfgang von Goethe

C hange, the process of becoming different, is at the heart of biology. Without it, nothing makes sense. Understanding and describing change is a fundamental endeavor in mathematics and the natural sciences. Change embraces evolutionary thought but is particularly difﬁcult to deﬁne in complex dynamical systems, such as those encountered in life. The intricate arrangement of parts in these systems is responsible for unique relationships that must be established and emergent properties that must be explored. What makes something in biology unique? Closely related individuals in a population may differ little in terms of their genetic makeup. Yet they can exhibit marked phenotypic differences, beginning with appearance and behavior. They can react differently when exposed to mutation or stress. Their progeny can even inherit genetic or learned features differently. As taxonomical distance widens, what is unique widens to the extreme. For example, there is only 1.2% difference in the genomic makeup of chimp and human at the nucleotide level1, yet phenotypic differences are huge. We may be tempted to explain uniqueness by citing critical differences (e.g., unique proteins, gene copy variants, or regulatory mechanisms) or general patterns (e.g., differential regulation of a large number of genes). We can claim that the repertoire of proteins or noncoding RNA, or the way the expression or functioning of these molecules is regulated are the ones responsible, or that the explanation lies in the integration of the thousands of component parts that make up biological systems. The genomic revolution now provides millions of protein sequences with which to dissect what is different, yet we are far away from understanding the complexities of life. In face of so much diversity, we could frame the question differently. What makes something in biology common? Again, we ﬁnd ourselves amazed by unabated similarities in everything from molecules, molecular structures, and cellular machinery, to body plans, behavior, and language. Why homogeneity amid so much heterogeneity? The answer I believe lies in the patterns and processes that are responsible for evolutionary change and the complexity and embedded simplicity of our evolving world. Modern phylogenetics and evolutionary bioinformatics have a lot to say about how to approach these questions, especially within the molecular realm. Similarly, the emerging ﬁeld of evolutionary genomics attempts to address the complexity of entire repertoires of component parts acquired by the genomic revolution by reconciling what is common and what is unique 1

The Chimpanzee Sequencing Analysis Consortium, 2005. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437:69–87.

xiii

xiv

Preface

at different levels of organization and along taxonomical transects. Approaches link pragmatism with theoretical discourse, the bench with the computer, and statistical inference with the hypothetico-deductive method in a more modern framework for scientiﬁc and philosophical advance. Concepts are borrowed from unlikely disciplines, such as graph theory and networks from applied mathematics and social sciences, and biologists brush comfortably with algorithms and simulations. This is probably where frontiers in science live. These past several months have been exciting because of what they represent. This is why this book is particularly timely. The birth of Darwin, 200 years ago, and the publication of his famous book2 50 years later are clearly of profound importance. Darwin catalyzed everlasting change in science and society, revolutionizing biology and prompting the search for deeper understanding of the fundamentals of biological change. His views took more relevance in the synthesis that ensued, and their signiﬁcance was later (though slowly) embraced by molecular biology. The report of the ﬁrst crystallographic structure (Protein Data Bank entry 1MBN)3 50 years ago also marks a fundamental milestone in the exploration of the atomic makeup of the molecules of life. About 50,000 molecules later, we have better understanding of the molecular machinery of the cell as we try to reconcile models of structure with biological function. The beginning of this year also marks the acquisition of the sequences of the ﬁrst 1000 organisms, which now span superkingdoms Archaea, Bacteria, and Eukarya, showing that the genomic revolution is indeed full ﬂedged and unstoppable. Postgenomic science has changed the face of biology forever. The reductionist view of understanding the living world by atomizing systems and examining their parts is slowly caving in to more integrative approaches that impinge on the power of synthesis and the integration of knowledge. Physics enters the realm of biology and vice versa, but so do computer science, statistics, mathematics, and philosophy. A marvelous time for science! The conﬂuence of technological feats, such as “nextgen” sequencing, with informatics and exponential increases in our biological knowledge base prompted the consideration of advances in two seemingly distinct but emerging ﬁelds of endeavor, Evolutionary Genomics and Systems Biology. This book represents an attempt to bring timely problems and insights from a selected group of researchers to the table. The treatment of advances and challenges is therefore by no means exhaustive. In fact, the purpose was to entice thought rather than to seek exhaustive coverage. Following that spirit, the book has been organized in three sections, treating aspects of evolutionary genomics and systems biology that are crucial for the “new biology”. In Part I, “Evolution of Life,” six chapters cover the impact that evolutionary genomics has had on our understanding of life, from the general to the speciﬁc, exploring current challenges, controversial but central issues, and the relevance of model systems in biology. In Part II, “Evolution of Molecular Repertoires,” 10 chapters discuss how levels of organization map to each other, how structure and function impinge on genomic repertoires, the centrality of evolutionary models, and the hidden world of RNA molecules that pervades biology. Finally, in Part III, “Evolution of Biological Networks,” four chapters explore the complexity of the cellular makeup and wiring diagram of an organism, the dynamics and emerging properties of networks, and the role of fundamental evolutionary processes in the integration of component parts in systems. I hope the readers will ﬁnd each and every chapter exciting and thought provoking. 2

Darwin, C.R., 1859. On the Origin of Species by Means of Natural Selection. Murray, London.

3

Kendrew, J.C., Bodo, G., Dintzis, H.M., Parrish, R.G., Wycoff, H.W., and Phillips, D.C., 1958. A threedimensional model of the myoglobin molecule obtained by X-ray analysis. Nature 181:662–666

Preface

xv

This book could have not been possible without the patience and wholehearted cooperation of its contributors, which made this project feasible. I also wish to thank those who generously took time out of their busy schedules to provide valuable comments. Their input is highly appreciated. I would also like to thank Jay E. Mittenthal for his friendship and encouragement, Derek Caetano-Anolle´s for artwork, and Karen Chambers and her team at Wiley for being so understanding in all issues related to the production of this book. Finally, I wish to recognize the National Science Foundation for continued and enabling support and my research team for making all things possible. From a personal point of view, I am particularly grateful to my wife, Gloria, for her patience, encouragement, and understanding, but fundamentally, for the sacriﬁces she bore when I embarked in the pursuit of science. Without her, none of this would have been possible. GUSTAVO CAETANO-ANOLLES

Urbana, Illinois January 2010

Contributors

Sarah Addou, Department of Structural & Molecular Biology, University College, London, England Eivind Almaas, Professor of Systems Biology, Department of Biotechnology, Norwegian University of Science and Technology, Trondheim, Norway Grigoris Amoutzias, VIB Department of Plant Systems Biology, Ghent University, Belgium, Brussels Denis Baurain, Unit of Animal Genomics, GIGA-R and Faculty of Veterinary Medicine, University of Liege, Liege, Belgium Otto G. Berg, Department of Molecular Evolution, Evolutionary Biology Centre, Uppsala University, Uppsala, Sweden Clara Isabel Bermudez-Santana, Bioinformatics Group, Department of Computer Science and Interdisciplinary Center for Bioinformatics, University of Leipzig, Leipzig, Germany; Department of Biology, National University of Colombia, Bogota, Colombia Erich Bornberg-Bauer, IEB, University of Muenster, Muenster, Germany Philip E. Bourne, Skaag School of Pharmacy and Pharmaceutical Sciences and San Diego Supercomputer Center, University of California, San Diego, California Guillaume Bourque, Computational & Mathematical Biology, Genome Institute of Singapore, Singapore Jason G. Bragg, Department of Civil and Environmental Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts Kristine Briedis, Skaag School of Pharmacy and Pharmaceutical Sciences and San Diego Supercomputer Center, University of California, San Diego, California Gustavo Caetano-Anolles, Bioinformatics Laboratory, Department of Crop Sciences, University of Illinois at Urbana-Champaign, Urbana, Illinois Todd A. Castoe, Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, Colorado Lesley J Collins, Allan Wilson Center for Molecular Ecology and Evolution, Institute for Molecular BioSciences, Massey University, Palmerston North, New Zealand Antoine Danchin, Genetics of Bacterial Genomes/CNRS Institute Pasteur, Paris, France A. P. Jason de Koning, Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, Colorado Alexander Donath, Bioinformatics Group, Department of Computer Science and Interdisciplinary Center for Bioinformatics, University of Leipzig, Leipzig, Germany Bernard Dujon, Molecular Genetic Unit of Levures, Institute Pasteur, Paris, France xvii

xviii

Contributors

Christopher Dupont, Skaag School of Pharmacy and Pharmaceutical Sciences and San Diego Supercomputer Center, University of California, San Diego, California Sven Findeib, Bioinformatics Group, Department of Computer Science and Interdisciplinary Center for Bioinformatics, University of Leipzig, Leipzig, Germany Patrick Forterre, Institute Pasteur and Institute of Genetics and Microbiology, University of South, Paris, France Jan Gorodkin, Division of Genetics and Bioinformatics, IBHV, University of Copenhagen, Frederiksberg, Denmark Ajith Harish, Bioinformatics Laboratory, Department of Crop Sciences, University of Illinois at Urbana-Champaign, Urbana, Illinois Jana Hertel, Bioinformatics Group, Department of Computer Science and Interdisciplinary Center for Bioinformatics, University of Leipzig, Leipzig, Germany Ivo L. Hofacker, Institute for Theoretical Chemistry, University of Vienna, Wien, Austria Charles G. Kurland, Department of Microbial Ecology, Lund University, Lund, Sweden Manja Marz, Bioinformatics Group, Department of Computer Science and Interdisciplinary Center for Bioinformatics, University of Leipzig, Leipzig, Germany Jay E. Mittenthal, Department of Cell and Developmental Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois Andrew D. Moore, IEB, University of Muenster, Muenster, Germany Christine Orengo, Department of Structural & Molecular Biology, University College, London, England Wolfgang Otto, Bioinformatics Group, Department of Computer Science and Interdisciplinary Center for Bioinformatics, University of Leipzig, Leipzig, Germany David Penny, Allan Wilson Center for Molecular Ecology and Evolution, Institute for Molecular BioSciences, Massey University, Palmerston North, New Zealand Herve Philippe, Department of Biochemistry, Robert-Cedergren Center, University of Montreal, Montreal, Quebec, Canada David D. Pollock, Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, Colorado Anthony M. Poole, Department of Molecular Biology and Functional Genomics, Stockholm University, Stockholm, Sweden, and School of Biological Sciences, University of Canterbury, Christchurch, New Zealand Juan Ranea, Department of Structural & Molecular Biology, University College, London, England Adam James Reid, Department of Structural & Molecular Biology, University College, London, England Robert Rentzsch, Department of Structural & Molecular Biology, University College, London, England Markus Riester, Bioinformatics Group, Department of Computer Science and Interdisciplinary Center for Bioinformatics, University of Leipzig, Leipzig, Germany Christine Schulz, Fraunhoffer Institute for Cell Therapy and Immunology, Leipzig, Germany

Contributors

xix

Peter Schuster, Institute for Theorectical Chemistry, Wien University, Wien, Austria; Santa Fe Institute, Santa Fe, New Mexico Agnieszka Sekowska, Genetics of Bacterial Genomes/CNRS, Institut Pasteur, Paris, France Peter F. Stadler, Bioinformatics Group, Department of Computer Science and Interdisciplinary Center for Bioinformatics, University of Leipzig, Leipzig, Germany; Institute for Theoretical Chemistry, University of Vienna, Wien, Austria; Fraunhoffer Institute for Cell Therapy and Immunology, Leipzig, Germany; Santa Fe Institute, Santa Fe, New Mexico Feng-Jie Sun, W. M. Keck Center for Comparative and Functional Genomics, Roy J. Carver Biotechnology Center, University of Illinois at Urbana-Champaign, Urbana, Illinois Andrea Tanzer, Bioinformatics Group, Department of Computer Science and Interdisciplinary Center for Bioinformatics, University of Leipzig, Leipzig, Germany; Institute for Theoretical Chemistry, University of Vienna, Wien, Austria Ruben Valas, Skaag School of Pharmacy and Pharmaceutical Sciences and San Diego Supercomputer Center, University of California, San Diego, California Yves Van de Peer, VIB Department of Plant Systems Biology, Ghent University, Belgium Andreas Wagner, Department of Biochemistry, University of Zurich and Zurich, Switzerland; Santa Fe Institute, Santa Fe, New Mexico Stefan Wirth, Bioinformatics Group, Department of Computer Science and Interdisciplinary Center for Bioinformatics, University of Leipzig, Leipzig, Germany Liudmila S. Yafremava, Bioinformatics Laboratory, Department of Crop Sciences, University of Illinois at Urbana-Champaign, Urbana, Illinois Song Yang, Skaag School of Pharmacy and Pharmaceutical Sciences and San Diego Supercomputer Center, University of California, San Diego, California Hao Zhao, Computational & Mathematical Biology, Genome Institute of Singapore, Singapore

Part I

Evolution of Life

Chapter

1

Evolutionary Genomics Leads the Way David Penny and Lesley J. Collins 1.1

INTRODUCTION

1.2

EVOLUTION AND THE POWER OF GENOMES

1.3

THE PROBLEM OF DEEP PHYLOGENY AND “THE TREE”

1.4

FRED, THE LAST COMMON ANCESTOR OF MODERN EUKARYOTES

1.5

EUKARYOTE ORIGINS: CONTINUITY FROM THE RNA WORLD?

1.6

MINIMAL GENOMES AND REDUCTIVE EVOLUTION

1.7

EVOLUTIONARY GENOMICS FOR THE FUTURE

REFERENCES

1.1 INTRODUCTION When the older of our authors was an undergraduate (we won’t tell you how long ago, but it was certainly way back in the last millennium), there were considered three “Great Scientiﬁc Problems.” All three were questions about origins that might (in principle) have genuine scientiﬁc answers, but at that time they were thought to be so complex that we might never ﬁnd them—the questions might just be too big to ever ﬁnd a scientiﬁc solution. The questions were 1. the origin of humans, 2. the origin of life, and 3. the origin of the universe. It is brilliant that in a single working life the ﬁrst is answered and the second is crumbling away. The third (the origin of the universe) we recognized even “way back then” as a question of a different kind in that, in principle, could lead to an inﬁnite regress. That is, solve the question about the origin of our universe (say, hypothesis A) and it immediately

Evolutionary Genomics and Systems Biology, edited by Gustavo Caetano-Anolles Copyright 2010 John Wiley & Sons, Inc.

3

4

Chapter 1

Evolutionary Genomics Leads the Way

opens up another question, namely, what explains hypothesis A. So the third question is best left to the physicists or philosophers! As we show below, analyzing genomics datasets allows us to address such major questions where solutions were not possible even 5–10 years ago.

1.2 EVOLUTION AND THE POWER OF GENOMES The ﬁrst step is showing how access to information about complete genomes allowed the ﬁrst of the above three questions to be answered; so in this section, we will only refer to the comparison of the chimpanzee and human genomes. The practical point is how these two genomes can be used as a test of the question whether microevolutionary processes are sufﬁcient for macroevolution (Penny and Phillips, 2004). Can the origin of the human genome be understood solely in terms of the normal microevolutionary processes that occur in natural populations? This is a major scientiﬁc question—perhaps “the” major question. The genetic processes in populations (that we know about) include point mutations (SNPs, single nucleotide polymorphisms), small insertions and deletions (indels, from a single nucleotide to larger indels), variations in copy number (of a gene or other fragment of DNA, CNVs; Redon et al., 2006), inversions and translocations of sections of chromosome and also chromosome fusions, and activated retrotransposable elements. These are precisely the differences we see between the human and chimpanzee genomes, and the classes of differences are outlined in Table 1.1 (see Li and Saunders (2005) and Levy and Strausberg (2008)). Thus, the conclusion is that the human genome arises strictly from the natural processes that occur in plant and animal populations—we can ﬁnd nothing different or unexpected about the genetic processes leading to humans. The conclusion is extremely powerful and should be reiterated continually by biologists talking to members of the public, especially in more religious countries. In principle, other possibilities exist for the origin of the human genome. Although we make jokes about it, saying that perhaps a Kindly Creator, or a Group of Itinerant Space Travellers (the GIST model), might have inserted into the human genome a whole lot of genes for both wisdom and intelligence. Just think of that, we tell our students, many, many genes in the human genome for wisdom and intelligence. After the appropriate pause, we continue—all we have to do now is to ﬁnd how to turn those genes on! All right, so on the surface the story is a joke, but the story has a very serious purpose and carries a much deeper signiﬁcance. As far as we can tell, the human genome arose from an Table 1.1 Natural Microevolutionary Differences Between Chimpanzee and Human Genomes Natural One chromosome fusion Many point mutations An enzyme lost Many differences in copy number Many small inversions Different transposable elements activated Many indels (insertions/deletions) Some introns expanded/contracted The differences between the human and chimpanzee genomes are all the normal microevolutionary processes seen within populations and sibling species (see Li and Saunders (2005) and Levy and Strausberg (2008)).

1.3 The Problem of Deep Phylogeny and “The Tree”

5

ape-like genomic ancestor through 100% normal microevolutionary processes—processes that occur within populations or between sibling species. So yes, in principle, there were other alternatives. If normal microevolutionary processes are sufﬁcient to lead to humans, then that is a very powerful conclusion about the sufﬁciency of microevolutionary processes. Of course, we do not know which combinations of mutations (of the many kinds of mutations that can occur) led to which changes in human morphology, behavior, and mental and social abilities. And similarly, we do know which mutations led to which changes on the chimpanzee lineage. Fortunately, we tell our students again, there is certainly a huge amount to be learnt in the future, and there is a major role for the next generation. But what we can say is that at the level of the human genome, there is nothing unusual about humans. That is a great improvement over what could have been said even a decade ago. It illustrates the tremendous importance of having genomes of related organisms. If humans, and our achievements, can arise by natural mechanisms, then complete genome analysis has led to the major conclusion that microevolutionary processes are sufﬁcient for a major macroevolutionary change, that is, the origin of humans, with all our creative (and destructive) powers. Before leaving humans, one more example of how molecular genetic data is generally so important; in this case, even for interpreting fossils. Again, it comes from the experience of the older author as a graduate student when Louis Leakey toured North America lecturing about their new fossils of early humans in Africa. At a reception afterward, the paleontologists were quite unimpressed, “we know humans evolved in Asia” was their conclusion (for several decades they had scathingly ignored the early human fossils found by Raymond Dart, Robert Bloom, and others in Southern Africa). No way, said the molecular biologists—Morris Goodman (Goodman et al., 1962) is ﬁnding a very close molecular relationship between humans, chimpanzees, and gorillas. The latter two are exclusively African, so we should be looking in Africa for early human fossils. “Look,” said the molecular biologists, “Louis Leakey is doing just that and he is ﬁnding the predicted fossils in Africa.” Thus, we clearly see that molecular data is critical for many other areas of biology. So although the human/chimp case is just one example, the evolution of whole genomes is a rapidly developing ﬁeld. For example, 12 new genomes of fruit ﬂy (Drosophila) species were published in a single publication (Clark et al., 2007), more genomes of individual humans are becoming available (e.g., Wang et al., 2008), and now there are proposals for 1000 human genomes, and as soon as possible. We have given one example of how genomes can answer major scientiﬁc questions, but in many cases we need to know the evolutionary relationships among the taxa involved; as the next section shows that is not as easy as it sounds for deep divergences.

1.3 THE PROBLEM OF DEEP PHYLOGENY AND “THE TREE” We need to know the basic divisions both within and between the archaea, bacteria, and eukaryotes. However, there is a real problem in resolving ancient divergences; almost certainly the deep branching orders cannot be known from aligned DNA and protein sequences. So although genomics data is necessary, we don’t yet have the theory for determining the most powerful approach for comparing genomes. The primary approach in phylogeny uses aligned DNA or protein sequences (see Figure 1.1a). Both theory and simulation show that this approach is excellent over a range of times (with some lack of signal at shorter times). However, there is a major loss of information at longer times (say a billion years or more). As in many areas of science, the

6

Chapter 1

Evolutionary Genomics Leads the Way

Figure 1.1

Current phylogenetic methods use only a small part of the information in a sequence and are expected to saturate for deep divergences. (a) Aligned amino acid sequences for seven taxa, with amino acids color coded for chemical properties. (b) The same sequence with the columns (sites) reordered; any of the c! reorderings of the columns always gives the same parsimony or likelihood value, or distance matrix. (c) Probability of recovering the character state at the root for different mutation rates (the x-axis is time in millions of years). Figure from Penny and Steel, unpublished manuscript. (See insert for color representation of this ﬁgure.)

standard mechanism that we use is a Markov model. This is interesting from an evolutionary viewpoint in that Markov models assume “continuity” of process—there were a continuous series of generations of DNA molecules between the starting and end point of the process. Great, that is good evolutionary theory, but the model does not use all the information in the data, for example, the order of the columns in an alignment is not used. Figure 1.1b shows one reshufﬂing of the columns in Figure 1.1a, and this will give precisely the same tree and parameters as the data from Figure 1.1a. If the number of columns in an aligned data set is c, then there are c! ways of shufﬂing the alignment. (There are c ways of selecting the ﬁrst column, c 1 ways the second column, c 2 the third, and so on.) There is clearly more information than this in the sequence data; for example, if we shufﬂed the order of amino acids in a protein, we could get very different 2D and 3D structures. Markov models are well studied mathematically, and in the case of trees, it is known that at longer times all information about the tree that generated the data is lost (see Mossel and Steel, 2004). These authors show that there is “phase transition,” the probability of either recovering the tree correctly or (given the tree) inferring the ancestral character state, is initially high but decreases markedly in the longer term. The rate of information loss depends on the mutation rate; information loss is obviously faster at higher mutation rates (Figure 1.1c). Thus, there must be a real limit to using aligned sequence data for ancient divergences, and the loss of information will be even faster if there are any systematic errors (Phillips et al., 2004), such as differences in nucleotide composition or in the function or 3D structure of the protein. The calculations of Mossel and Steel (2004) assume the “best possible” case; reality will usually ensure that in practice the loss of information will be even faster. Yes, maybe we can do better by downweighting the fast evolving sites (Jeffroy et al., 2006), but we do not expect this to eliminate all problems.

1.4 Fred, the Last Common Ancestor of Modern Eukaryotes

7

At this point, our earlier comment about our Markov models not using all the information is highly relevant. There is additional information in the data from gene order (Henz et al., 2005), unique structural changes to the genome (such as insertions and deletions, Boore, 2006), and 2D and 3D structures (Caetano-Anolles et al., 2007). So there is plenty of room for progress. It is just that the purists want to check whether these classes of additional information do really lead to similar results, and we alsowant a better theoretical understanding of the expected changes in information from gene order and 3D structure, especially when there is a change in function of the gene (Lockhart et al., 1996). Despite these qualiﬁcations, using the additional information is a top priority for understanding deep divergences. The rise of RNA has also seen a shift in thinking in phylogenetics. In addition to examining relationships using protein genes (e.g., the 2% of the human genome), we now have to handle genes that are non-protein coding. In addition, we have to take the structure of a molecule into account as well as the sequence. The idea of using RNA secondary structure has been around for a while, but using it in a practical manner in phylogenetics is still under development despite the volume of software available (Freyhult et al., 2007). With RNA becoming ﬂavor of the month, we expect to see a large jump in research in secondary structure evolution and this will lead to better predictive software for ncRNA searching (especially for non-human and non-Arabidopsis miRNAs) and new software for ncRNA phylogenetics. When it comes to the universal “Tree of Life,” we need to be much more careful and be much more “Darwinian ” than many modern commentators. The reason is that Darwin almost universally used the phrase “the theory of descent with modiﬁcation,” not the “tree of life” (a concept that is biblical in origin and has strong mystical overtones). He did suggest at one stage that the “tree of life” was a useful simile, and that is certainly a constructive way of expressing the situation. Thus, although “tree of life” (written in lowercase) is a useful analogy, “descent with modiﬁcation” is much more accurate and inclusive. We know that both mitochondria and chloroplasts arose by endosymbiosis, as did some other organelles, including the nitrogen-ﬁxing organelle in the interesting diatom Rhopalodia gibba (Prechtl et al., 2004). With molecular data, hybridization is being found much more often in both plants and animals (Mallet, 2008). Bacteria regularly “beg, borrow, and steal” genes from relatives or more distant organisms (Dagan et al., 2008), and indeed even the concept of bacterial “species” needs updating away from the eukaryote expectation that every individual has more or less the full genome of the “species” (Lan and Reeves, 2000). With bacteria, that concept is not appropriate, and strains can be relatively different in their composition of genes (but still able to regain genes from related strains). This is only one way in which we have changed our thinking of genome evolution over the past 10 years. We still see this muddle of prokaryote evolution because bacteria and archaea exchange DNA faster than their researchers. However, most prokaryotic evolutionists can consider a “network” nature of bacterial genome evolution means that accurately working out the ﬁner details of bacterial tree rooting is not going to be easy (Dagan et al., 2008).

1.4 FRED, THE LAST COMMON ANCESTOR OF MODERN EUKARYOTES Of course, we are on one side of a eukaryotic origins’ argument (Kurland et al., 2006), but is diving directly into origins really the way to go? Consider that instead of going for the origin, we focus on the later step, the last eukaryotic common ancestor. The deep phylogeny of eukaryotes splits into ﬁve or six deep groups (e.g., Keeling et al., 2005) that are very much accepted, though certainly not the deeper rooting (Roger and Hug, 2006). Trees based on the

8

Chapter 1

Evolutionary Genomics Leads the Way

Figure 1.2 Stem lineages and crown groups. The crown group consists of all descendants (living and extinct) of the last common ancestor of all (in this case) eukaryotes; we call this organism “Fred.” In contrast, the stem lineages are all the earlier organism, most of which will probably not be on the direct lineage to extant eukaryotes. In principle, some early members of the stem lineage might be difﬁcult to recognize as being eukaryotic. We might, for example, choose the origin of the mitochondrion, by endosymbiosis, as the event that “deﬁned” eukaryotes, and this is indicated as a large dot as “identiﬁable.” This event predates Fred, by an unknown amount. The choice of endosymbiosis to deﬁne eukaryotes is arbitrary, though not unreasonable, and other choices might be a nucleus, or the existence of a spliceosome, and so on. The “sister group” has its usual meaning of including the crown group of the next most closely related group, possibly archaea in the case of eukaryotes. However, in the case of eukaryotes, it is conceivable (possibly) that there was no “sister group” that the protoeukaryote lineage was formed by a fusion of an archaeal and a bacterial cell—each of those would be equally related. Even in this scenario, there would almost certainly be a stem lineage of “protoeukaryotes.”

fusion of a single gene do not take gene ﬁssion into account; trees based on SSU rRNA create problems for species with longer branch lengths. We are left with tongue-twisting groupings (e.g., Opisthokonts for fungamals (fungi plus animals)) and a variety of paths by which the “ﬁrst” eukaryotes arose. Given the uncertainty about deep phylogeny from aligned sequences, our current approach for eukaryote origins (Collins and Penny, 2005) is to temporarily put aside the ultimate origin of eukaryotes and to concentrate on the properties of the last common ancestor of modern eukaryotes—Fred.1Figure 1.2 illustrates the relative difference between these two key concepts. The more we learn about the biochemical and subcellular properties of Fred, the more we are basing our inferences on real data. Because we have huge amounts of genomic information from present-day eukaryotes, we are able to infer many aspects of the biology of past eukaryotes. Our strategy has been to search for features that occur in all deep lineages of eukaryotes. We call such features “general” (or “ancestral”) for eukaryotes, rather than universal; this allows some groups to have lost an ancestral feature. A general feature is expected to be present in the last common ancestor of the eukaryotic cell. Contrary to tradition, the younger of our authors refuses to call this hypothetical beastie LECA (last eukaryote common ancestor). The name could imply a close relationship to LUCA (the last universal common ancestor) and at this stage how close they are in time is hypothetical, so instead the name became “Fred.” With a neutral name, we can approach independently the characteristics of Fred and begin with an obvious question: what did Fred look like? Over the past few years, we are getting some good information (e.g., Kurland et al., 2006). Fred almost certainly had a mitochondrial-type organelle and functional splicing (implying both introns and the complicated apparatus to remove them, Collins and Penny, 2005). Other facets of RNA processing especially within the transcription–translation 1

Some people refer to Fred as a “fairly remote eukaryotic daddy.” We just use Fred.

1.4 Fred, the Last Common Ancestor of Modern Eukaryotes

9

system are under investigation (Collins and Chen, 2009). There are immediate questions that should be solvable very soon (with perhaps a few more key eukaryote genomes). How much of RNA-based regulation (such as RNAi) can be traced back to Fred? Riboswitches are found in both eukaryotes and prokaryotes (Cheah et al., 2007), so we could make an assumption that they, as a general mechanism, extend even back to LUCA, but before making this great step, we have to check that the mechanisms are sufﬁciently similar to infer common ancestry. Recent work makes it likely that alternative splicing occurred relatively early within eukaryotes (Irimia et al., 2007) and therefore was available to be recruited into development of later multicellularity. But do the proteins of the nuclear pore complex (D’Angelo and Hetzer, 2008) occur through eukaryotes? It is interesting that the assumptions originally applied just to sequences (ancestral sequences or convergent evolution) and now being applied to entire mechanisms. To investigate the generality of mechanisms, we have to apply evolutionary models and concepts not only to sequences, but also to expression and pathway information, cell structure, and more complex metabolic information. Thus, we begin to step beyond evolutionary genomics and into the realm of evolutionary systems biology. For example, there are many proteins that are only found in eukaryotes—eukaryote signature proteins (ESPs) (e.g., Hartman and Fedorov, 2002; Kurland et al., 2006). Clearly, an archaean plus bacterial fusion does not explain ESPs directly, though a very long period between fusion and Fred would help. This is a point where considering stem and crown groups does help (see Figure 1.2). Similarly, a fusion model does not either predict or explain the origin of the spliceosome/intron/exon structure of Fred (Collins and Penny, 2005). Certainly, some models attempt to explain the intron/exon structure as an invasion of type II introns at the time of mitochondrial acquisition by endosymbiosis. However, this certainly does not explain the origin of the spliceosome, a bigger structure even than the ribosome— this is one reason we emphasize the origin of the spliceosome as part of the exon/intron processing apparatus. Beware “the invasion of the introns” that some models propose, but those models (by themselves) appear to us unlikely. To us it would slow down RNA processing completely, and any members of the population that did not suffer intron invasion would strongly outcompete those poor individuals suffering from the invasion. Nevertheless, it serves as a point to introduce the widespread involvement of RNA in eukaryotes. One of the greatest surprises for many people is that eukaryotic molecular biology is so RNA based (Amaral et al., 2008; Costa, 2007). Eukaryotes abound in RNA-based processing of other RNAs (Figure 1.3): rRNA transcripts cleaved by MRP RNase and modiﬁed by guide RNAs such as snoRNAs (small nucleolar RNAs); mRNA cleaved by snRNAs (small nuclear RNAs) in the spliceosome, and tRNA transcripts cleaved by RNase P. There are others, but in addition there is also the very widespread regulation of RNA (RNAi, Munroe and Zhu, 2006), and even though riboswitches are much more frequent in bacteria, they still occur in eukaryotes (Montange and Batey, 2008). Overall, there is a complex network of ncRNA-catalyzed and controlled processes, especially around transcription and translation (Figure 1.3). We ﬁnd many systems such as transcription, splicing, and RNA export are so coordinated that they not only share protein components but also operate on the RNA at the same time. When we take into account the numerous biogenesis pathways to produce some of the components for these mechanisms (e.g., snRNAs, Matera et al., 2007), we ﬁnd more RNA-based molecules linked to this pathway. We quickly see what we call the RNA infrastructure (Collins et al., 2009), a network of ncRNA-based processes regulating RNA processes around the cell, both in time and in space. It is moving components in and out of the nucleus (Hopper, 2006), or in and out of

10

Chapter 1

Evolutionary Genomics Leads the Way

(a)

(b)

The RNA infrastructure of the eukaryotic cell Cytoplasm

Nucleus

Transcription

pre-mRNA Modification of snRNA

RNA-based processing

pre-tRNA

modification of rRNA

RNase P cleavage of tRNA

RNA processing

Transcription

snoRNA

RNase MRP cleavage of rRNA

rRNA

tRNA

mRNA storage RNA stress granules

RNAi, Riboswitches RNase P

U1, U2, U4–U6 snRNA mRNA Splicing Intronic snoRNAs

Transcribed snoRNA

pre-rRNA

RNA-mediated transcriptional regulation

Translation

Pol I - rRNA Pol II - mRNA, miRNA U1,2,4,5 snRNA Pol III - U6snRNA, miRNA tRNA, SRP RNA, RNase P RNA, RNaseMRP RNA* (Pol IV – Plants - miRNA)

Cascade mRNA processing tRNA processing rRNA processing

Translation Ribosomes

RNA-mediated translational regulation P-bodies (miRNA)

RNP biogenesis and assembly

Pre-tRNA, Feedback RNase P regulation of Pre-5S rRNA RNase polymerase III transcripts

snRNPs, snoRNPs RNase P RNase MRP SRP

Nucleus

RNA degradation P-bodies Exosomes

Cytoplasm

Figure 1.3

The eukaryote RNA infrastructure. (a) A generalized RNA processing cascade showing ncRNAbased processing from transcription to translation, concentrating on the processing of mRNAs, rRNAs, and tRNAs. This is the central section of the overall RNA infrastructure network (b) where different RNA-based processes feed into and regulate others, including those in the central section. This is a generic model that can differ in detail within different lineages—for example, MRP RNA is transcribed by Pol III in humans but Pol II in Saccharomyces cerevisiae. Red dashed arrows indicate that processes within each group interact in either direction. Based on Woodhams et al. (2007). (See insert for color representation of this ﬁgure.)

RNA storage granules (Anderson and Kedersha, 2006). We could ask the question as to whether ncRNA is the biological “dark matter,” the previously unappreciated molecules that take a single stretch of DNA and produce a functional protein in the right place at the right time. Now that we are understanding more about such basic eukaryote features as RNA processing, we are now getting a good overview of the composition of Fred, and we need this before moving backward in time toward the origin of the eukaryote lineage. The widespread occurrence of RNA in eukaryotes will come up again in the next section.

1.5 EUKARYOTE ORIGINS: CONTINUITY FROM THE RNA WORLD? Put another way, by focusing on Fred, we can use highly detailed biological knowledge information from real (i.e., existing) organisms to infer ancestral properties, rather than just invoking magic to suggest something about the earlier origin of eukaryotes. We sometimes feel that “every man and her dog” has a theory about eukaryote origins (well summarized in Embley and Martin (2006)); it is just that we (all of us, ourselves included) don’t fully know all the features that need to be accounted for in the ancestral eukaryote. Thus, our preference at the moment is to deﬁne as fully as possible the properties of Fred, thereby helping understand what questions have to be answered in any theory about the origins of eukaryotes. We may delay facing it, but the question of the ultimate origin of eukaryotes will certainly not go away. Although endosymbiosis is established for the origin of the mitochondrion (and its derivatives such as mitosomes and hydrogenosomes), this model has not by itself really helped understand the complexity of either the cell that at an early stage engulfed the

1.5 Eukaryote Origins: Continuity from the RNA World?

11

endosymbiont or the later stage of Fred (see Figure 1.2). It used to be fashionable to suggest that the eukaryote cell arose from a fusion of an archaeal and a bacterial cell; we call this the 0 þ 0 ¼ 1 model. Because both archaea and bacteria lack important general features that are characteristic of eukaryotes (see Section 1.4), fusing an archeon and a bacterium certainly does not give a eukaryote! Fusion, by itself, does not explain the origin of eukaryotes. We need to be careful here—such an argument cannot establish that fusion did not occur, it is just that fusion by itself does not explain the origin of the many unique eukaryote features. Now we get to a very fundamental and important, but difﬁcult, question—is there continuity of the RNA processing of RNA from eukaryotes, past Fred, and all the way back to the predicted widespread processing of RNA by RNA in the earlier stages of life (Penny, 2005)? We expect there to have been an RNP world (an RNA plus protein world that must have preceded DNA) and an even earlier RNA world that would have preceded encoded proteins. For the ribosome, tRNAs, and mRNA, there is indeed little doubt that they are very ancient. What about the MRP RNase ribozyme that processes the rRNA transcript in eukaryotes? To what extent is RNA processing and regulation of RNA in modern eukaryotes continuous right back to an RNP world? Since 1998 (Jeffares et al., 1998), we have been exploring this rather unfashionable (some would perhaps say heretical) idea that eukaryotes retain some ancestral RNA processing features that have been lost in the highly streamlined (and efﬁcient) “prokaryotes” (see Collins et al., 2009). Protein enzymes are far more effective than ribozymes (see comparative values of kcat and kcat/Km in Table 1 of Jeffares et al. (1998)). Superﬁcially, at least, it is unlikely that ribozymes will take over a catalytic or regulatory function that proteins are already doing. The simplest model is therefore an irreversible trend of ribozymes ! proteins for catalysis (see Figure 1.4). Thus, whatever the ﬁnal decision, the idea of RNA continuity is well worth exploring.

Figure 1.4

Comparison of two models for the origin of the high RNA complexity in eukaryotes. (a) Under the RNA continuity model, the complex system of RNA processing of RNA is largely continuous from an earlier ribonucleoprotein stage of the origin of life—an RNP world. The model involves two losses of complex RNA processing in the streamlined and efﬁcient “prokaryote” groups. (b) Under the RNA reexpansion model, there was the same early complex system of RNA processing of RNA, but it was largely lost for the last universal common ancestor, and then subsequently reexpanded in eukaryotes. This model has one loss and one gain. (The order of branching of archaea, bacteria, and eukaryotes is not shown because it is not relevant to either model.)

12

Chapter 1

Evolutionary Genomics Leads the Way

In contrast to the RNA continuity concept, the dominant theory by far for the origin of eukaryotes is the pre-Darwinian theory of evolution by orthogenesis (for this concept, see Blomberg and Garland (2002)). This assumes some unknown (and possibly unknowable) “universal principle” of evolution going from the simple to the complex. In other words, it is “blindingly obvious ” (except to a few heretics) that prokaryotes preceded eukaryotes! The idea that the smaller prokaryotic cells lead inexorably to the larger eukaryote cells is just one manifestation of this—“bigger is better” seems to be the motto. An alternative way of thinking about prokaryotes and eukaryotes is to consider prokaryotes as elegant and efﬁcient in both their genome structure and RNA functioning, and to consider eukaryotes as clumsy and inefﬁcient in their genome organization and their RNA processing. Clumsy and inefﬁcient maybe, but that very inefﬁciency and redundancy has allowed all sorts of complexities to develop, for which we multicellular eukaryotes are quite grateful. Indeed, we joke about the inordinate complexity of the RNA processing system in eukaryotes and say that “not even a University Committee could invent a system as clumsy and inefﬁcient as the eukaryote genome.” In contrast, we might be quite proud to have been on a committee that designed intelligently a prokaryote genome! Again, it is a joke with a serious message. We cannot accept “a priori” that “because the RNA processing of RNA in eukaryotes is so complex, it must be advanced”! To us, eukaryote RNA processing is a just clumsy example of “unintelligent design.” The usual hypothesis to explain genome organization under prokaryotes-ﬁrst model implies that the extensive RNA processing and regulation that we expect in an early RNA and RNP world would be largely lost in prokaryotes, and then reappear by magic in eukaryotes (Figure 1.4b). This is not impossible (maybe just unlikely from our point of view, given the relatively poor catalytic power of ribozymes, Jeffares et al., 1998). New discoveries are reinforcing the fundamental (and possibly ancient) role of RNA in the basic functioning of the cell. The most recent is the report that tRNA itself is involved in the catalysis of an amino acid (threonine) onto the tRNA (Minajigia and Francklyn, 2008)— it is not only protein involved in the catalysis. More research is expected here; if this involvement of tRNAs in amino acid charging is widespread in other tRNAs and in other organisms, then it strengthens even further the concept of an RNA world being “alive and well” in modern organisms. So our plea is really quite simple—keep an open mind about the relationships between eukarya, archaea, and bacteria. We need more evidence, and we need to consider earlier stages of evolution (especially before DNA).

1.6 MINIMAL GENOMES AND REDUCTIVE EVOLUTION Certainly, since the work of Forterre (1995), it is an option that has to be taken seriously that prokaryotes have streamlined their genomes and RNA processing from a more complex earlier state—whether the selective forces were thermoreduction (Forterre, 1995), kselection (Jeffares et al., 1998), or something else is a separate issue. An immediate test is whether reduction in genome size is an ongoing strategy in some organisms today. It is basic to a “Darwinian” approach to evolution that the same selective forces are there, even though they might operate at very different rates in different lineages. It is also fundamental to Darwinian evolution that evolution is never directed to long-term goals, anything that gives an immediate advantage will be selected—even if possibly deleterious in the longer term. Evolution is certainly not always “forward” toward increased complexity. Thus, it is conceptually important that genome reduction is observed in a variety of cases, as discussed by Andersson (2006) and Ochman and Davalos (2006). For example,

1.7 Evolutionary Genomics for the Future

13

many pathogens (both bacterial and eukaryote) have reduced genomes and rely on the host for many nutrients. Buchnera is a bacterium that lives internally in insects such as aphids and makes essential amino acids for the aphid, but has lost its genes for making the other amino acids (Moran and Baumann, 2000). The critical point here is that there are many examples among existing organisms where there are reduced genomes—there is certainly no a priori argument against genome reduction in archaea and bacteria (see Figure 1.4). On a different subject, in some eukaryotes there has been selection against high intron numbers. For example, a plot of the average numbers of introns per gene versus life cycle times shows a strong negative correlation (Jeffares et al., 2006). Eukaryotes (such as yeast) with short life cycles have few introns per gene. Conversely, eukaryotes (such as humans) with a long life cycle have many introns per gene, around eight per gene in our case. Similarly, it appears that genes that are turned on and off quickly (“nimble genes”) have fewer introns (Jeffares et al., 2008)—even though it is not clear yet which is cause and which effect. The evolution of intron numbers has been well studied (Roy and Irimia, 2008), and it seems clear that early eukaryotes did have larger number of introns per gene (Roy, 2006). Certainly, population size factors must be important (Lynch, 2002), and selection strengths on the gain or loss of an individual intron (for example) will be small (Wagner 2005), but the results outlined above indicate that selective factors do appear to be important. The way we phrase it is, the evolution is “sideways, backward, and occasionally, forward.” There is certainly no universal tendency to becoming larger and more complex, though there is certainly a niche there for some organisms that do manage it. The widespread occurrence of reductive evolution throughout nature means that the RNA continuity model has to be considered seriously as an option for the origin of eukaryote RNA processing.

1.7 EVOLUTIONARY GENOMICS FOR THE FUTURE Of course, the younger author’s undergraduate years were not quite so long ago, but long enough to have witnessed the rise of the bioinformatic and genomic era. Starting out in molecular biology at a time when it was cool to “ftp” rDNA sequences onto our new computer, the rise in computing power saw the introduction of breakthrough technologies such as PCR, automated sequencing, and microarrays. Thus, we began to investigate evolutionary principles seen in many genes, rather than the accidental examples explored previously. When will it stop? A downside of breakthrough technologies is that the analysis procedures and software may be years behind. We saw this with microarrays where the initial analysis software left a lot to be desired. But it did spur the bioinformatics industry. Suddenly bioinformatics was not the odd biologist who knew how to program but instead were programmers working on biological issues, and it was these programmers who developed and published the solutions. But what of the odd biologists pushed into the “bioinformatics” niche. Like any clever organism they had to evolve; some moved into heavier programming to join the bioinformaticians, and others kept solidly in the biological realm and created the ﬁeld of genomics. Technology has now moved again in the form of “next-generation sequencing (NGS).” As a former bench molecular biologist, we remember in nostalgic times when it was great to get our “gene” sequenced; now the problem is quickly analyzing data from entire genomes. The data is produced in a week; but the analysis can take forever! On the positive side, gone are the days when in evolutionary research our organism of choice was too distant from an already sequenced genome. Now we just sequence it. Unfortunately, next-generation sequencing still comes at a cost, but even that is predicted to be reduced within a few

14

Chapter 1

Evolutionary Genomics Leads the Way

years. Instead of a complete genome, we will be sequencing a population of genomes. This rapid progression of high-throughput technologies pumping out genome-scale data set after data set is also enabling genomics to grow at a rate much faster than a molecular clock, and entire centers now analyze and integrate data from a wide range of species (Schuster, 2008). But smaller centers are not being left out as we piggyback on the protocols set by these large centers, shifting the focus from model species to every species. The problem of course is that with all these data, we still have to analyze it . . .“ah but there’s the rub ” (to coin a Shakespearean phrase). Should we be dismayed and allow ourselves to be overwhelmed by the sea of genomes in which we see ourselves swimming. Actually, no. Those of us with NGS data are rather excited. We turn again to bioinformatics and although it has been a slow start, the “sea of data” problem has been recognized (Valdivia-Granda, 2008), and this is the critical ﬁrst step. Solutions to sifting and ﬁltering these data are being approached as we type, but now as evolutionary biologists we have to face our own problems. How do we compare the genomes not of one or two representatives of a species, but of individuals from an entire population? Knowing as we do now the differences in evolutionary rates within a gene, and between genes, how do we apply our tools and models to correct for rate variation within entire genomes and can we infer ancestral genomes (Muffato and Crollius, 2008) (can we have a bigger computer to do it on please)? The last comment is in fact a harsh reality of the genomics and systems biology world. We require not only “hunky” computers and ﬂash programming in order to correlate a genome’s worth of evolution but also those with biological and evolutionary knowledge to work out what questions to ask of the bioinformaticians and the computers. Working one without the other is pointless. Evolutionary genomics was born beyond the chalk on blackboard age, right onto the silicon chip. As the technology advances, we pull in those advances, often before the classical molecular biologists have realized that evolutionary genomicists have poached upon their territory. We look to the future and see a time where along with coffee, a laptop, a server, and a fast wireless connection, we can sit in the sun to look at how the world and eukaryotes have evolved. We are conﬁdent that the availability of more genomes from deeply diverging eukaryotes will answer many of the questions about the nature of the ancestral eukaryotes and its origins. More important, we should be able to answer the fundamental evolutionary question—extending the analysis from the ﬁrst section. Namely, is there anything in genome evolution, for any species, that is not the result of normal microevolutionary processes? Yes, the human genome appears 100% the product of natural processes—how long before we can claim that for all genomes? Better make that lots of coffee!

REFERENCES AMARAL, P.P., DINGER, M.E., MERCER, T.R., and MATTICK, J.S., 2008. The eukaryotic genome as an RNA machine. Science 319: 1787–1789. ANDERSON, P. and KEDERSHA, N., 2006. RNA granules. J. Cell Biol. 172: 803–808. ANDERSSON, S.G.E., 2006. The bacterial world gets smaller. Science 314: 259–260. BLOMBERG, S.P. and GARLAND, T., 2002. Tempo and mode in evolution: phylogenetic inertia, adaptation and comparative methods. J. Evol. Biol. 15: 899–910.

BOORE, J.L., 2006. The use of genome-level characters for phylogenetic reconstruction. Trends Ecol. Evol. 21: 439–446. CAETANO-ANOLLE´S, G., KIM, H.S., and MITTENTHAL, J.E., 2007. The origin of modern metabolic networks inferred from phylogenomic analysis of protein architecture. Proc. Natl. Acad. Sci. USA 104: 9358–9363. CHEAH, M.T., WACHTER, A., SUDARSAN, N., and BREAKER, R.R., 2007. Control of alternative RNA splicing and gene expression by eukaryotic riboswitches. Nature 447: 497–507.

References CLARK, A.G., EISEN, M.B., SMITH, D.R., et al., 2007. Evolution of genes and genomes on the Drosophila phylogeny. Nature 450: 203–218. COLLINS, L.J. and PENNY, D., 2005. Complex spliceosomal organization ancestral to extant eukaryotes. Mol. Biol. Evol. 22: 1053–1066. COLLINS, L.J., and CHEN, X.S. 2009. Ancestral RNA: the RNA biology of the eukaryote ancestor. RNA Biol. 6: 1–8. COLLINS, L.J., KURLAND, C.G., BIGGS, P., and PENNY, D., 2009. The modern RNP world of eukaryotes. J. Hered., 100: 597–604. COSTA, F.F., 2007. Non-coding RNAs: lost in translation? Gene 386: 1–10. DAGAN, T., ARTZY-RANDRUP, Y., and MARTIN, W., 2008. Modular networks and cumulative impact of lateral transfer in prokaryote genome evolution. Proc. Natl. Acad. Sci. USA 105: 10039–10044. D’ANGELO, M.A. and HETZER M.W., 2008 Structure, dynamics and function of the nuclear pore complexes. Trends Cell Biol. 18: 456–466. EMBLEY, T.M. and MARTIN, W., 2006. Eukaryote evolution, changes and challenges. Nature 440: 623–630. FORTERRE, P., 1995. Thermoreduction, a hypothesis for the origin of prokaryotes. CR Acad. Sci. Paris Life Sci. 318: 415–422. FREYHULT, E.K., BOLLBACK, J.P., and GARDNER, P.P., 2007. Exploring genomic dark matter: a critical assessment of the performance of homology search methods on noncoding RNA. Genome Res. 17: 117–125. GOODMAN, M. et al., 1962. Immunochemistry of primates and primate evolution. Ann. NY Acad. Sci. 102: 219–234. HARTMAN, H. and FEDOROV, A., 2002. The origin of the eukaryotic cell: a genomic investigation. Proc. Natl. Acad. Sci. USA 99: 1420–1425. HENZ, S.R., HUSON, D.H., AUCH, A.F., NIESELT-STRUWE, K., and SCHUSTER, S.C., 2005. Whole-genome prokaryote phylogeny. Bioinformatics 21: 2329–2335. HOPPER, A.K., 2006. Cellular dynamics of small RNAs. Crit. Rev. Biochem. Mol. Biol. 41: 3–19. IRIMIA, M., RUKOV, J.L., PENNY, D., and ROY, S.W., 2007. Functional and evolutionary analysis of alternatively spliced genes is consistent with an early eukaryotic origin of alternative splicing. BMC Evol. Biol. 7: 188. JEFFARES, D.C., POOLE, A.M., and PENNY, D., 1998. Relics from the RNA world. J. Mol. Evol. 46: 18–36. JEFFARES, D.C., MOURIER, T., and PENNY, D., 2006. The biology of intron gain and loss. Trends Genet. 22: 16–22. JEFFARES, D.C., PENKETT, C.J., and B€aHLER, J., 2008. Selection against introns in rapidly regulated genes. Trends Genet. 24: 375–378. JEFFROY, O., BRINKMANN, H., DELSUC, F., and PHILIPPE, H., 2006. Phylogenomics: the beginning of incongruence? Trends Genet. 22: 225–231. KEELING, P.J. et al., 2005. The tree of eukaryotes. Trends Ecol. Evol. 20: 670–676.

15

KURLAND, C.G., COLLINS, L.J., and PENNY, D., 2006. Genomics and the irreducible nature of eukaryote cells. Science 312: 1011–1014. LAN, R. and REEVES, P.R., 2000. Intraspecies variation in bacterial genomes: the need for a species genome concept. Trends Microbiol. 8: 396–401. LEVY, S. and STRAUSBERG, R.L., 2008. Individual genomes diversity. Nature 456: 49–50. LI, W.H. and SAUNDERS, M.A., 2005. The chimpanzee and us. Nature 437: 50–51. LOCKHART, P.J., LARKUM, A.W.D., STEEL, M.A., WADDELL, P.J., and PENNY, D., 1996. Evolution of chlorophyll and bacteriochlorophyll: the problem of invariant sites in sequence analysis. Proc. Natl. Acad. Sci. USA 93: 1930–1934. LYNCH, M., 2002. Intron evolution as a population-genetic process. Proc. Natl. Acad. Sci. USA 99: 6118–6123. MALLET, J., 2008. Hybridization, ecological races and the nature of species: empirical evidence for the ease of speciation. Philos. Trans. R. Soc. B 363: 2971–2986. MATERA, A.G., TERNS, R.M., and TERNS, M.P., 2007. Noncoding RNAs: lessons from the small nuclear and small nucleolar RNAs. Nat. Rev. Mol. Cell Biol. 8: 209–220. MINAJIGIA, A. and FRANCKLYN, C.S., 2008. RNA-assisted catalysis in a protein enzyme: the 20 -hydroxyl of tRNAThr A76 promotes aminoacylation by threonyltRNA synthetase. Proc. Natl. Acad. Sci. USA 105: 17748–17753. MONTANGE, R.K. and BATEY, R.T., 2008. Riboswitches: emerging themes in RNA structure and function. Annu. Rev. Biophys. 37: 117–133. MORAN, N.A. and BAUMANN, P., 2000. Bacterial endosymbionts in animals. Curr. Opin. Microbiol. 3: 270–275. MOSSEL, E. and STEEL, M., 2004. A phase transition for a random cluster model on phylogenetic trees. Math. Biosci. 187: 189–203. MUFFATO, M. and CROLLIUS, H.R., 2008. Paleogenomics in vertebrates, or the recovery of lost genomes from the mist of time. Bioessays 30: 122–134. MUNROE, S.H. and ZHU, J., 2006. Overlapping transcripts, double-stranded RNA and antisense regulation: a genomic perspective. Cell. Mol. Life Sci. 63: 2102–2118. OCHMAN, H. and DAVALOS, L.M., 2006. The nature and dynamics of bacterial genomes. Science 311: 1730–1733. PENNY, D., 2005. An interpretive review of the origin of life research. Biol. Philos. 20: 633–671. PENNY, D. and PHILLIPS, M.J., 2004. The rise of birds and mammals: are microevolutionary processes sufﬁcient for macroevolution. Trends Ecol. Evol. 19: 516–522. PHILLIPS, M.J., DELSUC, F., and PENNY, D., 2004. Genomescale phylogeny: sampling and systematic errors are both important. Mol. Biol. Evol. 21: 1455–1458. PRECHTL, J., KNEIP, C., LOCKHART, P., WENDEROTH, K., and MAIER, U.G., 2004. Intracellular spheroid bodies of Rhopalodia gibba have nitrogen-ﬁxing apparatus of cyanobacterial origin. Mol. Biol. Evol. 21: 1477–1481.

16

Chapter 1

Evolutionary Genomics Leads the Way

REDON, R., et al., 2006. Global variation in copy number in the human genome. Nature 444: 444–454. ROGER, A.J. and HUG, L.A., 2006. The origin and diversiﬁcation of eukaryotes: problems with molecular phylogenetics and molecular clock estimation. Philos. Trans. R Soc. Lond. B Biol. Sci. 361: 1039–1054. ROY, S.W., 2006. Intron-rich ancestors. Trends Genet. 22: 468–471. ROY, S.W. and IRIMIA, M., 2008. Spliceosomal introns as tools for genomic and evolutionary analysis. Nucleic Acids Res. 36: 1703–1712.

SCHUSTER, S.C., 2008. Next-generation sequencing transforms today’s biology. Nat. Methods 5: 16–8. VALDIVIA-GRANDA, W., 2008. The next meta-challenge for bioinformatics. Bioinformation 2: 358–62. WAGNER, A., 2005. Energy constraints on the evolution of gene expression. Mol. Biol. Evol. 22: 1365–1374. WANG, J., et al., 2008. The diploid genome sequence of an Asian individual. Nature 456: 60–65. WOODHAMS, M.D., STADLER, P.F., PENNY, D., and COLLINS, L.J., 2007. RNase MRP and the RNA processing cascade in the eukaryotic ancestor. BMC Evol. Biol. 7: S13.

Chapter

2

Current Approaches to Phylogenomic Reconstruction Denis Baurain and Herve Philippe 2.1

PHYLOGENOMICS AND SUPERMATRICES

2.2

PHYLOGENETIC SIGNAL VERSUS NONPHYLOGENETIC SIGNAL

2.3

PROBABILISTIC MODELS AND NONPHYLOGENETIC SIGNAL

2.4

REDUCTION OF NONPHYLOGENETIC SIGNAL UNDER FIXED MODELS

2.5

CAT MODEL

2.6

CASE STUDY: CAMBRIAN EXPLOSION

2.7

CONCLUSION

REFERENCES

2.1 PHYLOGENOMICS AND SUPERMATRICES In the 1960s, the seminal papers of Zuckerkandl and Pauling (1965) and Fitch and Margoliash (1967) spawned the whole ﬁeld of molecular phylogenetics. Since then, literally thousands of trees based on primary sequences (i.e., DNA or protein) have been published. In most cases, these studies tried to infer the evolutionary history of some groups of species in some region of the Tree of Life, the so-called “organismal phylogeny”—the Holy Grail of any molecular systematicist. To the casual reader, it might seem that any given gene should provide an equally truthful account of the organismal phylogeny. In practice, this is rarely the case and two genes will never tell the exact same story provided that a large enough set of organisms is considered. A valuable phylogenetic marker must honestly represent its host organism, at least across the species of interest (Fitch, 1970). This implies that all sequences in a multispecies alignment have to be orthologs, that is, exclusively derived from speciation events and not

Evolutionary Genomics and Systems Biology, edited by Gustavo Caetano-Anolles Copyright 2010 John Wiley & Sons, Inc.

17

18

Chapter 2

Current Approaches to Phylogenomic Reconstruction

from gene duplications (leading to paralogy) or transfers (leading to xenology). Moreover, when an ancestral polymorphism is maintained through more than one speciation event, its multiple alleles can be lost independently in the diverging descendants (lineage sorting), which results in a gene tree departing from the organismal phylogeny (Maddison, 1997). While many genes potentially reﬂect the true evolutionary history, most are too short to properly sample the speciation process (stochastic error) and thus yield phylogenies where some (or even most) species display uncertain relationships (i.e., not statistically supported as estimated by bootstrap values (BVs) or posterior probabilities). Consequently, conﬂicts are widespread when comparing trees inferred from individual genes—a phenomenon termed incongruence (Jeffroy et al., 2006). With the never-ending accumulation of sequences enabled by the advent of automated sequencing, it became obvious that combining several genes in a single analysis should overcome the lack of resolution plaguing single-gene phylogenies. The idea was to drastically reduce the stochastic (or sampling) error and to mitigate the effects of undetected paralogy or xenology in some markers by analyzing thousands of informative positions from hundreds of genes (Philippe et al., 2005a). Two main strategies were developed to this end: the supertree that combines the trees resulting from the separate analysis of individual alignments, and the supermatrix that combines genes prior to phylogenetic reconstruction (Delsuc et al., 2005). Though supertrees have some distinct advantages, such as the possibility to mix morphological and molecular trees (Liu et al., 2001) or to trace xenology (Beiko et al., 2005), they are often biased toward large and/or unbalanced source trees while suffering from the underlying lack of resolution of single genes. For instance, a supertree of animals built with the popular MRP method (matrix representation with parsimony; Baum, 1992; Ragan, 1992) appears to have difﬁculty in placing fast-evolving lineages comparatively to the equivalent supermatrix (Philippe et al., 2005a). In the remaining of this chapter, we will therefore focus on supermatrices, which have been intensively explored, tested, and validated in the last decade. Supermatrices follow the total evidence principle of using all the relevant available data (Kluge, 1989). They are assembled by concatenating individual alignments into one huge alignment. As for supertrees, source genes do not need to share exactly the same set of species, except that missing genes are explicitly encoded as question marks in the supermatrix (Delsuc et al., 2005). Once cause of worry when the total number of positions was low (<1000) (Huelsenbeck, 1991; Novacek, 1992; Wiens, 1998; Wilkinson and Benton, 1995), missing data are now considered relatively harmless to the accuracy of phylogenetic inference based on supermatrices (>10,000 positions). Computer simulations (Wiens, 2003) and empirical studies (Philippe et al., 2004) have both shown that a sizable fraction of missing data (10–30%) is acceptable—and even desirable in some cases (see microsporidia below). Indeed, because missing characters are not known to be positively misleading, the total number of positions for a given species (at least several thousands) is more important than completeness of data (Philippe et al., 2004; Wiens, 2003). The tolerance of supermatrices to missing data makes them easily assembled from disparate sources of primary sequences, such as public protein databases, cDNA libraries, raw sequencing trace archives, and newly acquired sequences following targeted PCR. From the point of view of cost effectiveness, EST sequencing is the way to go since mining as few as 5000 randomly chosen ESTs from a nonnormalized cDNA library yields on average >10,000 reliably aligned amino acids (Philippe and Telford, 2006). In contrast to

2.2 Phylogenetic Signal Versus Nonphylogenetic Signal

19

the labor-intensive targeted PCR strategy, the major cost of the EST approach is the sequencing itself, which is decreasing steadily, especially with the availability of the nextgeneration sequencing platforms. Standard inference methods developed for the phylogenetic analysis of single genes are applicable to supermatrices (Delsuc et al., 2005), sometimes with minor enhancements (e.g., separate models in which branch lengths and parameters other than the tree topology can be different for each gene; Yang, 1996b). Assembly and analysis of supermatrices nonetheless come with a dedicated tool required to concatenate the individual alignments. Among those publicly available, SCaFoS (Roure et al., 2007) offers the ﬁnest control on the merging process. One interesting feature of the latter is the semiautomated generation of chimerical sequences blending data from different species (in principle from the same monophyletic group). This allows the phylogenetic analysis of clades for which primary sequences are still scarce without increasing the share of missing data. Even if supermatrices lead to a considerable gain in resolving power, they have their own issues. First, their analysis is computationally expensive and requires a large amount of memory, especially when using complex evolutionary models (see below). Second, search heuristics are not always able to cross the high barriers in tree space and get trapped in local optima (Salter, 2001). Third, owing to the quantity of data, sequence biases become important and have to be accounted for to avoid systematic errors during phylogenetic inference. Since this topic is a major concern, it will be thoroughly reviewed in the next sections.

2.2 PHYLOGENETIC SIGNAL VERSUS NONPHYLOGENETIC SIGNAL By reducing the stochastic error, supermatrices are expected to yield phylogenetic trees that are both correct and statistically supported because they enhance the phylogenetic signal. In practice, all combinations of the attributes “correct” and “supported” are found throughout the literature, from the highly supported correct trees to the unsupported incorrect trees, as well as the perplexing unsupported correct trees and their evil twins, the highly supported incorrect trees. This variety of outcomes explains why supermatrices have not brought “the end of incongruence” (Jeffroy et al., 2006) contrary to an expectation largely shared at the beginning of phylogenomics (Gee, 2003; Rokas et al., 2003). The very reason for this is that standard inference methods applied to supermatrices analyze an apparent signal that is actually a blend of phylogenetic and nonphylogenetic signals, which are both increased at genome scale (Baurain et al., 2007; Rodriguez-Ezpeleta et al., 2007b). To be able to enrich in genuine phylogenetic signal the apparent signal, it is important to understand what is and what causes the nonphylogenetic signal. The phylogenetic signal stems from the substitutions occurring along the various branches of the evolutionary tree. In a maximum parsimony (MP) framework, the signal is encoded in the informatively shared residues at homologous positions that are interpreted as synapomorphies for sequences derived from a common ancestor (Hennig, 1966). Hence, Felsenstein (1985) has shown that a bootstrap value above or equal to 95% for any given group of species requires at least three substitutions on the corresponding branch. This relationship between the number of substitutions and the strength of the phylogenetic signal explains why longer branches are easier to recover and why supermatrices yield more robust trees than smaller data sets. In theory, branch length is directly proportional to the number of

20

Chapter 2

Current Approaches to Phylogenomic Reconstruction

substitutions along that branch, which is in turn the product of time and mutation rate. An increase in one of them (and of course in both) will thus lengthen a branch, what should in principle lead to easier recovery of the corresponding group, as more substitutions will have occurred along that branch. However, during the course of evolution, the oldest signal is progressively erased by the continuous accumulation of substitutions, that is, by the additional substitutions occurring at the same positions at the tips of the tree (Fitch, 1979; Ho and Jermiin, 2004). When multiple substitutions dominate over single substitutions, the phylogenetic marker is said to be saturated for the set of species considered. Since recent branches conserve more single substitutions than older branches, saturation mostly affects the deepest branches in a tree. If these branches are already short (e.g., because of closely-spaced speciation events), saturation completely destroys the deep branching pattern and the resulting tree is unresolved. Though disappointing, the lack of resolution is not the worst consequence of saturation. Most often, multiple substitutions lead to spurious identities between random sequences (termed homoplasies) due to convergence or reversal. This is especially true for long branches (either early branching or fast evolving) that will share many homoplasies with other (generally unrelated) long branches. As soon as homoplasies are more abundant than synapomorphies (around saturation), phylogenetic methods are susceptible to becoming inconsistent (Felsenstein, 2004). By deﬁnition, this means that they will converge toward supporting an incorrect tree with increasing statistical support as more and more data are analyzed. The most famous case of inconsistency in phylogenetics is the long-branch attraction artifact (LBA) affecting MP (and distance methods) in which two fast-evolving yet unrelated lineages (i.e., having long branches) cluster together (Felsenstein, 1978). In an ideal world, probabilistic methods (maximum likelihood and Bayesian inference) would be immune to saturation (i.e., be consistent) because, in contrast to parsimony, they are supposed to explicitly mimic the evolutionary process. Thus, they should be able to distinguish true synapomorphies from confounding homoplasies by modeling with accuracy all the intermediate character states (the substitution history) at each position (Felsenstein, 2004). In the real world, however, sequences evolve according to a very complex and heterogeneous process that reconstruction methods only grossly approximate. Among the complexities of the evolutionary process, let us cite the mutations themselves, which are not homogeneous over time and across the genome, the population structure, which is not homogeneous over time, the various selective pressures, which are certainly not homogeneous over time nor across the genome, the nucleotide composition, which is known to be heterogeneous across species, the evolutionary rate and the mutation process, which are both heterogeneous across positions and over time, the interdependency between positions owing to protein structure, and so on. To ensure that phylogenetic inference remains tractable, probabilistic models have to make simplifying assumptions about the evolutionary process, hence the gross approximations. When the data at hand have evolved according to a true process too different from the modeled process, violations ensue (Lemmon and Moriarty, 2004). The main consequence of model violations is that a more or less important fraction of the multiple substitutions are incorrectly interpreted as genuine synapomorphies. As long as the violations are random, the phylogenetic reconstruction is not affected, at least not beyond a possible decrease in resolution (but see Susko et al., 2005). The real concern stems from the widespread evidence that model violations are often biased in one direction (systematic

2.2 Phylogenetic Signal Versus Nonphylogenetic Signal

21

error). For example, a group of sequences in the alignment could have evolved under the same compositional pressure (compositional signal) or at a similarly aberrant rate (rate signal). Analyzed by standard models, such biased evolutionary processes lead noisy homoplasies to be erroneously converted into a structured, yet nonhistorical, signal (Ho and Jermiin, 2004). As mentioned above, this nonphylogenetic signal blends with the true phylogenetic signal to generate a composite (or apparent) signal that is actually the signal inferred by a given method (Baurain et al., 2007; Rodriguez-Ezpeleta et al., 2007). In single-gene phylogenies, the nonphylogenetic signal due to systematic errors is rarely an issue because it is commonly overwhelmed by the stochastic error associated with the short size of the sequences. There exist nonetheless notable exceptions, such as the wrong clustering of two unrelated mesophilic bacteria having the same lower G þ C content (compositional signal) in their rRNA gene (Embley et al., 1993) or the erroneous placement of fast-evolving (rate signal) microsporidia (Philippe and Germot, 2000). Conversely, in supermatrices where the stochastic error is low, even subtle model violations, such as minor variations in amino acid composition, might be enough to divert the interpretation of multiple substitutions from homoplasy to synapomorphy and to create a strong nonphylogenetic signal that will eventually favor an incorrect tree, sometimes highly supported (Jeffroy et al., 2006). This contrasts with the stochastic errors that plague single-gene phylogenies, which are almost never statistically supported since not accumulative by essence. Of course, an incorrect tree generally arises when the genuine phylogenetic signal is faint, that is, for short branches. In most other cases, the historical signal present in supermatrices is so powerful that the nonphylogenetic signal simply cannot compete. If both signals are of comparable amplitude, the resulting tree may be correct or not, but will not be statistically supported (Rodriguez-Ezpeleta et al., 2007b). Actually, such a lack of resolution with large supermatrices should prompt for speciﬁc experiments aimed at uncovering a potentially confounding nonphylogenetic signal (see below). Somewhat unexpectedly, nonphylogenetic signal is not necessarily evil and can even supplement the genuine phylogenetic signal (Albert, 2006), provided that the systematic bias similarly affects all species of a monophyletic group (e.g., owing to a change in the evolutionary process in the common ancestor of the group). Finally, another kind of nonphylogenetic signal is worth mentioning. It relates to the orthology issue already discussed, when gene trees are in conﬂict with the organismal phylogeny. This variant of nonphylogenetic signal occurs in supermatrices contaminated by a nonnegligible level of xenology or hidden paralogy. To detect orthologous sequences, most methods rely on sequence similarity (e.g., best reciprocal BLAST hits), which is reasonably effective in general, except for xenologs having replaced the corresponding receiver gene (Philippe et al., 2005a). Therefore, to reduce the probability of including some sequences of doubtful orthology in a supermatrix, orthologs have to be selected through time-consuming building and inspection of separate trees for all the individual alignments before concatenation (e.g., Leigh et al., 2008). Since the orthology assessment itself depends on prior knowledge of the organismal phylogeny, this approach involves a certain amount of circularity. This is why a commonly adopted work-around consists in using only genes present in single copy in all or most species and/or ensuring afterward that individual gene trees are congruent with the organismal phylogeny inferred from the concatenation (Rodriguez-Ezpeleta et al., 2007a). That said, it should be noted that supermatrices are surprisingly robust to the inclusion of nonorthologous sequences (Brochier et al., 2002; H. Philippe, unpublished results).

22

Chapter 2

Current Approaches to Phylogenomic Reconstruction

2.3 PROBABILISTIC MODELS AND NONPHYLOGENETIC SIGNAL As nonphylogenetic signal ultimately stems from model violations, improving the models of sequence evolution is the most direct way to reduce the systematic errors that impede phylogenetic inference. In this section, we will ﬁrst brieﬂy introduce standard (homogeneous) models, and then review the enhancements that allow state-of-the-art models to deal with two variants of nonphylogenetic signal: the rate signal and the compositional signal. Finally, we will mention other sources of nonphylogenetic signal and summarize the logic driving future model development.

2.3.1 Homogeneous Models With probabilistic methods, the evolutionary process operating along each branch of a phylogenetic tree is modeled as a Markov process. In such a process, the conditional probability of change at a given position (or site) only depends on the current character state (either nucleotide or amino acid)—and is thus independent of its earlier states (Jermiin et al., 2008). The substitution process is described by an instantaneous rate matrix that speciﬁes the rate of replacement between each pair of nucleotides (or amino acids). This matrix results from the combination of a vector of stationary probabilities (i.e., equilibrium frequencies) of nucleotides and of a matrix of conditional instantaneous exchange rates between nucleotides. For practical reasons, all positions from an alignment are assumed to be independent and to evolve at a constant rate under a single Markov process. Moreover, the most widely used evolutionary models expect the process to be globally stationary, reversible, and homogeneous over time. In phylogenetic terms, stationarity implies that nucleotide frequencies are constant across the tree, while reversibility means that the probability of sampling nucleotide i from the stationary distribution and going to nucleotide j is the same as that of sampling nucleotide j and going to nucleotide i. Note that, by deﬁnition, a reversible process is necessarily stationary as well. Finally, homogeneity implies that the conditional exchange rates are constant across the tree. Taken together, these three assumptions allow reconstruction methods to ignore the direction of evolution during phylogenetic inference and to yield unrooted trees. For nucleotide sequences, available models range from the simplest one-parameter Jukes and Cantor model (Jukes and Cantor, 1969) to the general time-reversible (GTR) model (Lanave et al., 1984) where all nucleotide frequencies and conditional exchange rates are estimated from the data. For amino acids, conditional exchange rates (and possibly stationary probabilities) are generally estimated once from a corpus of data supposed to faithfully represent a particular class of proteins (e.g., globular proteins), hence the numerous matrices available (e.g., WAG, mtREV, Dayhoff, JTT, VT, Blosum62, CpREV, RtREV, MtMam, MtArt, HIVb, and HIVw; Thorne and Goldman, 2003). These precompiled rates are biochemically meaningful; that is, exchanges between similar amino acids are more likely than between residues with different properties. To help with the selection of the most appropriate evolutionary model for the data at hand, software based on likelihood ratio tests (when models are nested) or on the Akaike (AIC; Akaike, 1973) or Bayesian (BIC; Schwarz, 1978) information criteria has been developed and is in wide use (Abascal et al., 2005; Posada and Crandall, 1998). However, this automated approach suffers from at least two shortcomings. First, the bestﬁtting model sometimes gets its very good statistical ﬁt from parameters that are not directly relevant for the phylogenetic problem considered. Second, all models evaluated are standard

2.3 Probabilistic Models and Nonphylogenetic Signal

23

models that assume site independency along with stationarity, reversibility, and homogeneity of the evolutionary process.

2.3.2 Handling of Rate Signal In the ﬁrst probabilistic models, all positions of a nucleotide or protein sequence were expected to evolve at the same rate, both across positions and over time. With the introduction of the rate across sites (RAS) model (Yang, 1993), this assumption has been partially relaxed through the modeling of site-speciﬁc substitutions rates as random variables following a gamma distribution. In practice, positions are binned into discrete categories (from 4 to 16) that approximate a gamma distribution whose unique shape parameter (alpha) is estimated from the data. By improving the interpretation of fastevolving (i.e., saturated) positions, which are the most susceptible to create a nonphylogenetic signal, the introduction of the RAS model led to considerable gains in phylogenetic accuracy (Yang, 1996a). More sophisticated distributions of substitution rates further increase the ﬁt of the model to alignments (Huelsenbeck et al., 2006; Mayrose et al., 2005), thus suggesting that improvements of the RAS model are still possible. Since functional and structural constraints operating at a given site are subject to change over time (Penny et al., 2001), substitutions rates should be allowed to vary not only across positions, but also over time. The ﬁrst attempt at handling such a behavior is due to Fitch and Markowitz (1970) with the covarion hypothesis: at a given time, only a constant fraction of the codons (the concomitantly variable codons or covarions) can accept substitutions, while others (invariant sites) cannot. Over time, however, a site can shift from being variable to being invariable (and vice versa). In spite of its early appearance, the covarion hypothesis did not receive much attention until the end of the 1990s. The covarion idea (variable versus invariable) was generalized by the concept of heterotachy (meaning “different speeds” in Greek) corresponding to sites that evolve at different rates over time (Philippe and Lopez, 2001). Heterotachy has since been shown to frequently occur in both nucleotide and amino acid sequences (e.g., Lopez et al., 2002; Misof et al., 2002). Owing to the resulting nonphylogenetic signal, this violation of the rate constancy assumption is able to impede phylogenetic inference whether in simulations (Kolaczkowski and Thornton, 2004; Lockhart et al., 1996) or empirical studies (Inagaki et al., 2004; Lockhart et al., 1998; Philippe and Germot, 2000). To estimate the level of heterotachy, a supermatrix is ﬁrst divided into monophyletic groups for each of which is computed a tree. A given position is then said to be homotachous if the number of substitutions undergone in each group is proportional to the tree length of that group, which is tested by a modiﬁed chi-square procedure (Baele et al., 2006; Lopez et al., 1999). Alternatively, heterotachy can be detected and modeled through probabilistic models speciﬁcally designed to allow different evolutionary rates over time. Two families of models have been developed to this end, the covarion-like models and the “mixture of branch lengths” (MBL) models. In the covarion model as ﬁrst formalized by Tufﬂey and Steel (1998), the substitution history at each position unfolds according to a doubly stochastic process, in which the rate of a classical ﬁrst-order Markov substitution process is itself time modulated in an on–off fashion. Its implementation in a Bayesian framework is due to Huelsenbeck (2002) and orthogonally supports the RAS model for the “on” state. In a variant proposed by Galtier (2001), a constant fraction of sites evolve under the RAS model, while the remaining

24

Chapter 2

Current Approaches to Phylogenomic Reconstruction

sites switch among the same different rate classes over time. Huelsenbeck’s and Galtier’s models have advantages and disadvantages for different data sets but are signiﬁcantly outperformed by a generalized covarion model that contains both of them as special cases (Wang et al., 2007). Tufﬂey and Steel’s covarion model only adds two global stationary parameters (the switching rates from on to off and from off to on, respectively) to handle heterotachy (as does Galtier’s yet through a different approach). On the other hand, this model assumes that rate shifts occur in a strictly site-independent fashion, whereas more general scenarios are possible, such as those involving collective rate shifts of many sites at once due to a sudden change in the selection pressure (Germot and Philippe, 1999; Inagaki et al., 2004; Philippe and Germot, 2000). In principle, this kind of collective rate shifting could be better handled by the MBL model, ﬁrst proposed by Kolaczkowski and Thornton (2004), and then reformulated as a correct likelihood model (Spencer et al., 2005). As its name implies, the MBL model incorporates heterotachy by including a predeﬁned number of components per topology, each specifying a distinct set of branch lengths. Given the serious increase in parameters required, any potential gain over covarion models would be statistically expensive—without mentioning the computational challenges raised by the implementation of the MBL model in a free topology perspective. Using AIC, BIC, and cross-validation techniques, the relative merits of covarion and MBL models were recently investigated on three large real data sets (Zhou et al., 2007). Although the heterotachy detected by the MBL model corresponds to collective rate shifts that are biologically meaningful, both BIC and cross-validation indicate that the covarion model performs signiﬁcantly better on all data sets analyzed. MBL’s failure was somewhat expected as it implies inferring a complete set of branch lengths for each component, while heterotachy may affect only a subset of the tree. Accordingly, branch lengths are well correlated among components (Zhou et al., 2007). Furthermore, separate models used to capture heterotachy among genes (Yang, 1996b) likely suffer from the same weaknesses (but see Kolaczkowski and Thornton, 2008), including branch length correlation (Moreira et al., 2002), as they are similar to the MBL model except that the number of components and their structure are predeﬁned. A promising avenue to combine the explicit account of collective events allowed by the MBL model with the economy of parameters of the covarion model would be to develop a divergence point model suited to heterotachy (Zhou et al., 2007). In such a model, some positions would evolve differently from other sites in the different areas of the tree deﬁned by the various breakpoints corresponding to functional or structural shifts (see below for an application to compositional signal).

2.3.3 Handling of Compositional Signal Beside rate constancy, standard models assume that a single Markov process of substitution may be applicable to all positions at all times. This implies that the stationary probabilities of the four nucleotides (or of the 20 amino acids) remain the same both along the sequence and across the tree. If this assumption were true, all sites from all species should display homogeneous state compositions, up to stochastic ﬂuctuations. However, state composition has been shown to be highly heterogeneous among species, whether for nucleotide (Bernardi, 1993; Jukes and Bhushan, 1986; Montero et al., 1990) or amino acid sequences (Foster et al., 1997)—a phenomenon generally denoted as compositional biases. When using standard (homogeneous) models, this violation of the stationarity assumption creates a nonphylogenetic signal that may lead to the attraction of sequences sharing a similar

2.3 Probabilistic Models and Nonphylogenetic Signal

25

composition yet otherwise unrelated (Foster, 2004; Foster and Hickey, 1999; Galtier and Gouy, 1995; Lake, 1994; Lockhart et al., 1994; Lockhart et al., 1992; Mooers and Holmes, 2000; Yang and Roberts, 1995). As for heterotachy, statistical tests are available to detect compositional heterogeneity in supermatrices. Building on Tavare’s early work (Tavare, 1986), Jermiin and coworkers devised a series of matched-pairs tests of homogeneity advocated as more appropriate than other strategies commonly used (for review see Jermiin et al., 2004). These include Bowker’s test of symmetry (Bowker, 1948), Stuart–Maxwell’s test of marginal symmetry (Stuart, 1955), and Ababneh et al.’s test of internal symmetry (Ababneh et al., 2006). Their interpretative complexity (Jermiin et al., 2008), along with a relative lack of evaluation outside simulated nucleotide data sets, has so far hindered the wide adoption of such tests (but see Rodriguez-Ezpeleta et al., 2007b). For nucleotide supermatrices, a related visual approach based on tetrahedral plots has been developed (Ho et al., 2006), whereas for protein data sets, a two-dimensional plot in a principal component analysis comparing amino acid frequencies across the various sequences is generally enough to identify outlier species that have to be handled with special care (Delsuc et al., 2006; Rodriguez-Ezpeleta et al., 2007b). Beyond these strategies merely aimed at detecting compositional biases, several nonstationary models actually able to handle such heterogeneities have been proposed. In Yang and Roberts’ model (Yang and Roberts, 1995), the most general nonhomogeneous substitution model over time, each branch of the tree (including the root) is assigned four nucleotide composition parameters (three free parameters). Rooted trees have to be considered because Felsenstein’s pulley principle Felsenstein,1981 does not apply to processes that are not time reversible. The high parameterization of this model is an issue when there are not enough data available to accurately infer all the parameters. Moreover, the estimation of an independent composition vector for each branch results in a considerable computational burden. In the simpliﬁed model introduced by Galtier and Gouy (1998), the compositional vector is replaced by a single parameter corresponding to the G þ C content. Though packaged in an efﬁcient reimplementation (Boussau and Gouy, 2006), this model still requires the estimation of many free parameters when a large number of species are used. To reduce such overparameterization effects, Foster (2004) proposed a model based on a predeﬁned number of composition vectors, smaller than the number of branches in the tree. While this approach helps to control the amount of free parameters, it still lacks ﬂexibility as the number of compositional vectors is ﬁxed. Along with other reﬁnements, this limitation has been recently removed by the model of Gowri-Shankar and Rattray (2007) in which the number of composition vectors becomes a free parameter that is allowed to vary during the inference. With all branchwise models described above, the equilibrium frequencies of the substitution process have to be reassessed at the base of each branch of the tree. In addition to leading to overparameterization, such an approach is likely not the best option since, in many practical situations, the equilibrium frequencies may have remained constant for periods spanning several branches, sometimes entire groups. Furthermore, in these models, changes in compositional biases are always associated with speciation events (i.e., occurring at tree nodes), which is not realistic. Ideally, both phenomena should be uncoupled. In this direction, Blanquart and Lartillot (2006) have introduced a nonhomogeneous divergence point model based on a compound stochastic (or doubly Markov) process (Huelsenbeck et al., 2000) where an additional Poisson process operates along the branches to model compositional shift events independently of speciation events. Interestingly, this BP model (for breakpoint) can be made equivalent to either Yang and Roberts’ or Galtier and

26

Chapter 2

Current Approaches to Phylogenomic Reconstruction

Gouy’s models by enforcing a breakpoint at the base of each branch. Albeit very parsimonious when the number of breakpoints is small, Blanquart and Lartillot’s model does not allow the different areas of the tree to share parameters. This contrasts with GowriShankar and Rattray’s model where each branch draws its composition vector from a common pool. Therefore, if similar substitution patterns evolve many times independently, the latter model may provide a simpler explanation of the data (Gowri-Shankar and Rattray, 2007).

2.3.4 Other Model Violations Since their introduction 40 years ago (Jukes and Cantor, 1969), probabilistic models of sequence evolution have made considerable progress. Standard models in wide use handle at least three kinds of heterogeneities that would otherwise create a nonphylogenetic signal: the differences in the instantaneous substitution rate among each character state pair, the differences in the equilibrium frequencies of each character state, and the rate heterogeneity across sites. In many cases, these models perform correctly because these heterogeneities are likely to be the most prevalent in the evolution of biological sequences. Moreover, we have just reviewed a number of improved models able to deal either with heterotachy or with compositional biases, thus further reducing the risk of systematic errors. Among the remaining model violations, let us cite the assumptions of homogeneity of the substitution pattern along the sequence and of independence between positions, as well as the (lack of) account of the underlying genetic code in nucleotide coding sequences. Concerning the homogeneity assumption, it is trivial to note that a given site of a protein alignment does not display all possible amino acids, but only a particular subset, generally characterized by similar biochemical properties (Bruno, 1996; Miyamoto and Fitch, 1996). Though empirical exchangeability matrices (e.g., JTT, WAG) certainly help, they were shown to inaccurately deal with saturation effects due to multiple substitutions among highly exchangeable amino acids (Lartillot et al., 2007), which creates a nonphylogenetic signal that can lead to LBA artifacts (Felsenstein, 1978). The CAT model (Lartillot and Philippe, 2004) has been developed speciﬁcally to handle such cases but owing to its increasing application in recent years, it will be presented in a dedicated section below, along with key achievements. Although model development provided more reasonable descriptions of the process of sequence evolution, several approximations persist. Hence, most of the models currently in use make the assumption that evolutionary events at a particular position are independent from events at other sites. While this simpliﬁcation has both practical and computational justiﬁcations, it is biologically unrealistic. For example, the overall fold of a protein must involve interactions between various residues of the primary sequence. Consequently, means of relaxing this assumption have been investigated, usually with correlations or dependence between a limited number of sites (e.g., Felsenstein and Churchill, 1996; Siepel and Haussler, 2004) or for a limited number of sequences (Jensen and Pedersen, 2000; Pedersen and Jensen, 2001; Robinson et al., 2003). One way to introduce explicit protein structure constraints into evolutionary models is based on statistical potentials, which are empirically derived scores akin to exchangeability matrices, except that they give pairwise plausibilities of ﬁnding two amino acids at a particular spatial proximity in natural sequences (e.g., Bastolla et al., 2001; Miyazawa and Jernigan, 1985). In the models ﬁrst proposed by Robinson et al. (2003) in mechanistic terms at the codon level, then formulated

2.3 Probabilistic Models and Nonphylogenetic Signal

27

directly at the amino acid level by Rodrigue et al. (2005), statistical potentials are used as a proxy to capture the approximate structural ﬁtness of a sequence, so that differences in compatibility, before and after inferred amino acid substitutions, inﬂuence the probability of an evolutionary scenario. In a recent evaluation of this approach on three real data sets using Bayesian methods of model selection, it was shown that considering site interdependency due to tertiary structure through statistical potentials always improves the ﬁt of the model. However, resorting to pairwise contact potentials alone cannot account for the complexities of the evolutionary process successfully modeled under the assumption of site independence. Therefore, a pragmatic strategy, which receives the highest support in model assessments, is to combine the JTT empirical exchangeability matrix and the RAS model with Bastolla statistical potentials in a layered model (Rodrigue et al., 2006). Aside from inference based on rRNA genes, most phylogenetic trees are computed from alignments of coding sequences. Since the number of character states is greater for amino acids than for nucleotides, inference using DNA sequences is often reserved to closely related organisms (e.g., mammals), whereas broad phylogenetic studies are carried out using proteins. Following the same logic, inference based on codons is appealing as it triples the character space relatively to amino acids, while potentially accounting for heterogeneities due to the coding nature of DNA. The ﬁrst codon models were introduced 15 years ago (Goldman and Yang, 1994; Muse and Gaut, 1994) and gave rise to two families of models, each with many extensions (e.g., Huelsenbeck et al., 2006; Yang and Nielsen, 2008). In all cases, the exchangeability matrix has 61 61 entries, thus ignoring the three nonsense codons, yet models derived from Muse and Gaut’s model (Muse and Gaut, 1994) have a small vector of nucleotide stationary probabilities, whereas models in the Goldman and Yang style (Goldman and Yang, 1994) have a large vector of codon stationary probabilities. Using the same set of Bayesian techniques as above applied to three real data sets, Rodrigue et al. (2008) recently compared many variants of both models and found that Muse and Gaut’s formulation was the soundest. Therefore, future extensions to codon models should be incorporated in that context, which allows for a ﬂexible account of global amino acid or codon preferences while maintaining distinct parameters for nucleotide stationary probabilities. One should also note that owing to their inherently large exchangeability matrix, phylogenetic analyses under codon models are very time-consuming.

2.3.5 Future Developments With the generalization of supermatrices, opportunities for more realistic, parameter-rich models are on the rise (Philippe et al., 2005a). In principle, model development should focus on issues that cause inconsistency (i.e., undetected multiple substitutions) of the current methods without trying to “ﬁt an elephant” (Steel, 2005). To this end, an advantage of probabilistic methods is that one can explicitly look for the assumptions of the model that are violated by the data and then devise a new model accounting for these heterogeneities. Similarly, simulations allow to check that the revised model correctly describes the true evolutionary process by comparing statistics obtained on real data with those resulting from simulated data sets (for a review see Sullivan and Joyce, 2005). As long as the number of parameters increases more slowly than the number of positions, a model does not fall into the “inﬁnitely many parameter trap” and thus has good consistency (Felsenstein, 2004).

28

Chapter 2

Current Approaches to Phylogenomic Reconstruction

So far, most model violations requiring special modeling have been addressed only one at a time (but see below the CAT–BP model; Blanquart and Lartillot, 2008). In this respect, a long-term goal of modern phylogenetics is to develop improved models that could cope with all types of heterogeneities at once. Beside the risk of overparameterization, such models do not exist at present because their computational (hence environmental, see Philippe, 2008) cost would be absolutely prohibitive. This illustrates the fact that any evolutionary model will always be a subtle compromise trading off a bit of biological realism for large gains in mathematical simplicity and computational speed. Consequently, model violations in supermatrices are here to stay, which calls for alternative strategies able to reduce the nonphylogenetic signal associated with the heterogeneities not yet accounted for.

2.4 REDUCTION OF NONPHYLOGENETIC SIGNAL UNDER FIXED MODELS Since model violations impede phylogenetic inference by preventing efﬁcient detection of multiple substitutions, any nonphylogenetic signal is exacerbated at mutational saturation (Rodriguez-Ezpeleta et al., 2007b). Given a reasonable tree, saturation can be estimated by plotting the number of observed differences as a function of the number of inferred substitutions for all pairs of species in a supermatrix. As soon as the curve reaches a plateau (i.e., the same amount of observed differences corresponds to a large range of inferred substitutions), the supermatrix is saturated (Jeffroy et al., 2006; Philippe et al., 1994b) and should be searched for systematic biases that may affect analyses. Alternatively, a nonphylogenetic signal can be revealed by its effects on phylogenetic reconstruction, for example, through the lack of resolution of a tree computed from a large supermatrix (Baurain et al., 2007; Rodriguez-Ezpeleta et al., 2007b). There are nonetheless cases where the confounding signal is so great that the resulting tree, albeit perfectly supported by bootstrap values or posterior probabilities, is ﬂat wrong (Jeffroy et al., 2006; Rokas et al., 2003). Hence, the search for nonphylogenetic signal in supermatrices should be part of routine analyses when doing phylogenomics. To this end, one commonly adopted strategy is to compute trees using several reconstruction methods and/or evolutionary models. As those display different sensitivities to the various sources of systematic error, this approach frequently leads to a number of distinct trees from which the most and least robust parts of the phylogeny can be identiﬁed (Baurain et al., 2007; Jeffroy et al., 2006). Another possibility is to perform multiple rounds of phylogenetic inference on different subsamples of the data of interest, either by varying the set of species considered (taxon sampling) or after recoding or removing the most saturated positions (Rodriguez-Ezpeleta et al., 2007b).

2.4.1 Variations in Taxon Sampling The link between taxon sampling and detection of multiple substitutions has been known for a while and can be summarized as “breaking long branches” (Hendy and Penny, 1989; Hillis, 1996; Zwickl and Hillis, 2002). This means that a rich (and well designed) set of species helps to reduce nonphylogenetic signal by promoting the identiﬁcation of fastevolving positions, thus minimizing LBA artifacts due to homoplasies. In this respect, the controversy about the new animal phylogeny (Adoutte et al., 2000) is a cautionary tale. Early phylogenomic studies aimed at reconstructing the evolution of metazoans were based

2.4 Reduction of Nonphylogenetic Signal Under Fixed Models

29

on hundreds of genes from a handful of model organisms (Blair et al., 2002; Dopazo et al., 2004; Philip et al., 2005; Wolf et al., 2004). These attempts led to long-branched trees that strongly supported the morphological view, known as the “Coelomata hypothesis,” in which arthropods are grouped with vertebrates (Brusca and Brusca, 1990). With the multiplication of large-scale sequencing projects allowing broader taxon samplings (Philippe and Telford, 2006), new studies provided convincing evidence for the alternative view (Ecdysozoa), in which arthropods are instead grouped with nematodes (e.g., Baurain et al., 2007; Dunn et al., 2008; Matus et al., 2006; Philippe et al., 2005b). Interestingly, the overall resolution of the animal tree also rose with the number of species considered, as can be seen in a series of three papers using the same genes and methods (Baurain et al., 2007; Philippe et al., 2005b; Philippe et al., 2004). Although the emphasis is often put on the accuracy of tree reconstruction, this demonstrates that the resolving power of phylogenomics can also be drastically affected by taxon sampling (Baurain et al., 2007; Lecointre et al., 1993). In some phylogenomic analyses, the lack of resolution persists in spite of a rich set of species. This is generally due to a limited number of the so-called “rogue” species, nearly always fast evolving, that cause model violations (Baurain et al., 2007; Philippe et al., 2005a; Rodriguez-Ezpeleta et al., 2007b; Sanderson and Shaffer, 2002). With a bit of practice, these are easy to spot as they often have long branches in preliminary trees and tend to take part in the topological moves observed when using different methods or models. It could happen, however, that some rogue species can only be identiﬁed by the statistical tests designed to detect heterotachy or compositional bias (see above). In any case, the next step is to selectively remove these species from the analysis to check whether alternative approaches now converge on a unique, though possibly still incorrect, tree (Brinkmann et al., 2005; Rodriguez-Ezpeleta et al., 2007b). Such targeted removals can have drastic effects not only on the topology of phylogenomic trees but also on their support because nonphylogenetic signal decreases without affecting phylogenetic signal (unchanged number of positions; Baurain et al., 2007; Rodriguez-Ezpeleta et al., 2007b). For instance, while platyhelminthes strongly cluster with nematodes within Ecdysozoa when analyzed simultaneously in a large-scale animal phylogeny, removing nematodes from the supermatrix yields a tree where platyhelminthes become sister to Lophotrochozoa (e.g., molluscs and annelids; Philippe et al., 2005b). Similarly, using two different species as the lone red algal representative results into two incongruent but highly supported eukaryotic phylogenies, which is a hallmark of nonphylogenetic signal (Rodriguez-Ezpeleta et al., 2007b). Whenever possible, rogue species should be replaced by a slow-evolving representative of the same taxonomic group. By avoiding multiple substitutions in the ﬁrst place, this strategy allows to reduce the amount of nonphylogenetic signal while preserving the genuine phylogenetic signal (Aguinaldo et al., 1997). For example, with a limited taxon sampling, the replacement of a classical yet fast-evolving nematode by a slow-evolving relative was enough to suppress the LBA artifact between nematodes and platyhelminthes that prevented recovery of Ecdysozoa monophyly (Baurain et al., 2007). Of course, such exchange is only feasible when many related species have been sequenced. For some key isolated species (e.g., the angiosperm Amborella), this cannot be done. More important, outgroup species, which are required to root molecular trees, are also rogue species because they diverged earlier and likely evolved under constraints distinct from those of the ingroup species. Therefore, it is always informative to infer a tree after removal of the outgroup for comparison (Brinkmann et al., 2005; Philippe et al., 2007). Conversely, creating long branches by deliberately using a distant outgroup or by removing early-branching ingroups is a good way of disclosing a nonphylogenetic signal in the

30

Chapter 2

Current Approaches to Phylogenomic Reconstruction

remaining species (Rodriguez-Ezpeleta et al., 2007b). When rooting is not dispensable, a closer (and broader) outgroup can also completely counteract phylogenetic artifacts in the ingroup, especially in a context of scarce taxon sampling (Delsuc et al., 2005). This illustrates the fact that sensitivity to species sampling is currently the most salient signature of nonphylogenetic signal.

2.4.2 Recoding and Removal of Offending Data A distinct advantage of data profusion is to allow discarding the least reliable characters without compromising statistical signiﬁcance (Delsuc et al., 2005), as it would be with single-gene phylogenies (Philippe et al., 2000). Along this line, two main approaches are imaginable: either reducing the number of character states at each position through a sensible coding scheme or selectively removing the most saturated (i.e., fast-evolving) and evolutionarily biased positions. In both cases, the overall apparent signal will decrease but the confounding nonphylogenetic signal will be more reduced than the genuine phylogenetic signal, thus resulting in increased accuracy and support (Philippe et al., 2005a). Recoding of character states is generally done according to functional groups, so as to homogenize sequence composition across species. For instance, the RY coding of DNA sequences (Woese et al., 1991) consists in replacing nucleotides A and G by R (purine) and C and T by Y (pyrimidine), which leads to ignoring frequent transition events in favor of more informative transversions. Beyond efﬁcient desaturation, this strategy also helps to cope with compositional biases (including G þ C bias), as these are often more pronounced between nucleotides of the same family. Hence, RY coding has been successfully used to handle the low GC bias of many mitochondrial genomes (Delsuc et al., 2003; Gibson et al., 2005; Phillips and Penny, 2003), as well as in yeast phylogenomics (Jeffroy et al., 2006; Phillips et al., 2004). Similarly, the value of Dayhoff coding (six biochemical categories) was demonstrated on biased amino acid sequences (Hrdy et al., 2004). To allow analyses under the widely implemented GTR model, a four-state variant has been introduced. Applied to a eukaryotic supermatrix, it was able to reduce the nonphylogenetic signal caused by compositional biases and by multiple substitutions among highly exchangeable amino acids (Rodriguez-Ezpeleta et al., 2007b). At the DNA level, it is customary to discard third codon positions, as they are the most saturated and G þ C biased due to the redundancy of the genetic code (Canback et al., 2004; Delsuc et al., 2002; Jeffroy et al., 2006; Swofford et al., 1996). For amino acid supermatrices, the strategy is to identify and then remove fast-evolving sites, which are likely to have accumulated multiple substitutions and model violations (Philippe et al., 2000). Except for compatibility approaches that do not require prior knowledge of the phylogenetic relationships (Pisani, 2004; Qiu and Estabrook, 2008), most methods of site removal depend on a tree to estimate the evolutionary rates. To avoid circularity when only one species of interest is difﬁcult to locate, sitewise rates can be estimated on a tree built in its absence. For cases with more than one rogue species, computing mean sitewise rates from a set of best topologies and the full supermatrix is a good work-around (RodriguezEzpeleta et al., 2007b). A simple way to identify fast-evolving positions is to take afﬁliations to discrete gamma categories of the RAS model as a proxy for evolutionary rates—the higher the number, the faster the corresponding positions (Burleigh and Mathews, 2004; Ruiz-Trillo et al., 1999). Alternatively, sitewise rates can be computed explicitly. Once positions are sorted by decreasing evolutionary rate, progressive removal of the

2.5 CAT Model

31

fast-evolving positions allows monitoring the rise and fall of competing signals by plotting the statistical support for alternative groupings as a function of the number of positions considered (Rodriguez-Ezpeleta et al., 2007b). In the slow–fast (SF) method, evolutionary rates are estimated within predeﬁned monophyletic groups for which only intergroup relationships are under study, thus ensuring practical independence toward topology (Brinkmann and Philippe, 1999). Used on a saturated metazoan data set, this strategy turned an initial support for Coelomata into an equally strong support for Ecdysozoa as nearly two-thirds of the original supermatrices were discarded (Delsuc et al., 2005). The removal of fast-evolving genes is another possibility. For example, when 133 genes are used, all inference methods strongly (yet artifactually) locate microsporidia (highly derived fungi) at the base of eukaryotes, whereas once >70% of the fastest-evolving microsporidial sequences are coded as missing, probabilistic methods avoid the artifact. This shows that a highly incomplete species can be more accurately positioned than its complete version, provided that a minimum number of positions are kept for the analysis (Brinkmann et al., 2005). In the animal tree, a similar approach, in which genes are removed in their totality, was successful at avoiding the attraction between nematodes and platyhelminthes (Philippe et al., 2005b) or between nematodes and the outgroup (Dopazo and Dopazo, 2005).

2.5 CAT MODEL A special section is devoted to CAT (Lartillot and Philippe, 2004) because the improvements introduced by this model had the most signiﬁcant impact on phylogenomic reconstruction, whereas modeling heterotachy or compositional heterogeneity brought limited beneﬁts. The main motivation behind the CAT model is that a given site of a protein can generally tolerate a limited number of amino acids, often similar in their biochemical properties (Miyamoto and Fitch, 1996). As the number of parameters required to describe these position-speciﬁc amino acid preferences is huge, handling this heterogeneity across sites is challenging (Bruno, 1996). One obvious way to reduce parameterization is to use a mixture model where positions with similar properties are grouped in the same categories. To avoid the difﬁcult question of the adequate number of categories, Lartillot and Philippe (2004) took advantage of the inﬁnite mixture model based on the Dirichlet process prior (Ferguson, 1973). This approach allows to adapt the number of categories to the amount of signal present in the supermatrix at the expense of a single hyperparameter. To further reduce the number of parameters, categories are only deﬁned by stationary probabilities (thus offering straightforward modeling of amino acid preferences), while the exchangeability matrix is kept the same for all categories. Since most of the biological complexity is taken into account by these site-speciﬁc equilibrium frequencies, the matrix becomes less important. This is why Lartillot and Philippe (2004) chose a uniform matrix (Poisson) that greatly decreases the computational burden and allows to process phylogenomic supermatrices in a realistic time (a few days). Analyzing real data sets with the CAT model yields meaningful results (Lartillot and Philippe, 2004). For instance, the 627 positions of elongation factor 2 are grouped into 28 categories, 11 of which are stable across the MCMC sampling. These categories appear to be very peaked, with only two or three amino acids displaying high equilibrium frequencies, while others are close to zero. Even more important, categories are biochemically consistent (e.g., D þ E, F þ Y, I þ V, or K þ R). The ability of the inﬁnite mixture model to adapt to the data at hand is demonstrated by the higher number of categories obtained from the

32

Chapter 2

Current Approaches to Phylogenomic Reconstruction

analysis of a larger matrix (35 categories for 3596 mitochondrial amino acid positions), as well as by the different nature of the inferred categories (much more reﬁnements for hydrophobic amino acids, as expected for predominantly transmembrane proteins). The efﬁciency of the CAT model in dealing with one of the most important functional constraints acting on protein evolution is logically conﬁrmed by model selection analysis. For all data sets of reasonable size (i.e., >20 species and >500 sites), CAT has a better ﬁt than all sitehomogeneous models, including GTR, using either Bayes factors or cross-validation (Lartillot et al., 2007; Lartillot and Philippe, 2004; Lartillot and Philippe, 2008). With regard to phylogenetic inference, the main improvement of the CAT model is to enhance the detection of multiple substitutions, thanks to the limited size of the alphabet at each position (Lartillot et al., 2007). The best way to illustrate this property is to perform computer simulations using different models and to compare the predictions to the observed values. To this end, a useful statistics is the mean number of amino acids per site. CAT is the only model to correctly predict observed values, whereas sitehomogeneous models expect a much higher number of amino acids, especially for fastevolving positions (Lartillot et al., 2007). This is due to the fact that, after a very few substitutions (two or three), the inﬂuence of the exchangeability matrix drops in favor of stationary probabilities that are not biochemically meaningful (except for CAT). Consequently, the CAT model leads to a better estimation of branch lengths (N. Lartillot et al., unpublished results). Its improved handling of saturation allows CAT to locate long branches with better accuracy during phylogenetic inference, since multiple substitutions along these branches are much more often correctly interpreted as mere autapomorphies. For instance, while fastevolving nematodes are attracted by the distant fungal outgroup when using site-homogeneous models (thus supporting the Coelomata hypothesis), they group with arthropods (Ecdysozoa hypothesis) with the CAT model (Lartillot et al., 2007). Interestingly, if the distant outgroup is replaced by a close outgroup (e.g., nonbilaterian animals), all models recover Ecdysozoa monophyly. This indicates that the remaining nonphylogenetic signal is too low to mislead even site-homogeneous models. Another example is the erroneous clustering of two long branches owing to taxonomic isolation rather than fast evolutionary rate (i.e., scarce taxon sampling). Under the WAG matrix, cephalochordates (Branchiostoma) erroneously cluster with echinoderms (Strongylocentrotus), thus breaking chordate monophyly (Delsuc et al., 2006). On the other hand, using the CAT model recovers chordate monophyly (Delsuc et al., 2008), in agreement with results obtained after improved taxon sampling (Bourlat et al., 2006). This efﬁciency allows to study difﬁcult cases, such as clades characterized by fast evolutionary rate and poor taxonomic diversity. Hence, a very fastevolving acoel (Convoluta), which always groups with the longest branch of the tree with WAG (i.e., platyhelminthes or the outgroup), is robustly excluded from protostomes under the CAT model. Unfortunately, its precise location, not far from slow-evolving deuterostomes, cannot be resolved yet (Philippe et al., 2007). Although CAT’s achievements are impressive, several model violations remain. In particular, following the stationarity assumption, this model expects a process of amino acid substitution that is homogeneous over time. As a result, CAT erroneously clusters two extremely AT-rich clades (ticks and hymenopterans) when analyzing a compositionally biased mitochondrial amino acid alignment (Blanquart and Lartillot, 2008). However, the combination of CAT with breakpoint modeling (CAT–BP) allows to recover the correct topology, albeit at the expense of a huge computational burden. This major study demonstrates that combining the various independent model improvements that have been recently developed (e.g., covarion or GTR) is probably an efﬁcient way to reduce the impact of

2.6 Case Study: Cambrian Explosion

33

systematic errors in phylogenomics. As underlined above, the most important limitation to this approach is the computational cost, thus calling for massive algorithmic improvements.

2.6 CASE STUDY: CAMBRIAN EXPLOSION To ﬁnish this chapter, we will discuss recent insights into the Cambrian explosion (Conway Morris, 2000) brought by phylogenomics. Paleontological record suggests that almost all bilaterian lineages suddenly appear near the base of the Cambrian, with very few disputed evidence of Precambrian fossils. Since earlier branching animals (Porifera and Cnidaria) are found in Precambrian sediments, the lack of Bilateria can be interpreted as a rapid diversiﬁcation of Bilateria at the end of this period (Budd and Jensen, 2000). The lack of resolution of molecular phylogenies based on rRNA (Abouheif et al., 1998; Field et al., 1988; Philippe et al., 1994a) ﬁrst seemed to conﬁrm this hypothesis known as the “Cambrian explosion”. Similarly, a recent large-scale study (50 genes), in which only vertebrates and platyhelminthes were supported by 100% BV, was interpreted as a “molecular signature of radiations compressed in time” (Rokas et al., 2005). In spite of a limited taxon sampling (23 species) featuring extensive rate variation among species, only site-homogeneous models were used. This opens the possibility that the reported lack of resolution could be due to a large amount of nonphylogenetic signal masking the phylogenetic signal (Baurain et al., 2007). To address this question, let us ﬁrst perform a very rough quantiﬁcation of the expected phylogenetic signal. As already said, Felsenstein (1985) has shown that, given a perfect method, an internal branch needs at least three substitutions to be recovered with signiﬁcant statistical support (BV > 95%). Assuming a rough molecular clock and a divergence of Bilateria around 550 MYa, this “theorem” allows us to compute the minimum time interval between two speciations (DTs) required to signiﬁcantly recover the corresponding branch (i.e., the time to accumulate three substitutions). For the slow-evolving rRNA (1000 positions), DTs is equal to 15 MY, thus indicating that we cannot test the Cambrian explosion with this marker. As expected, the value of DTs decreases sharply with phylogenomic data sets: 0.7 MY for 12,060 positions (Rokas et al., 2005) and 0.25 MY for 33,800 positions (Delsuc et al., 2006). Therefore, provided that tree reconstruction methods are efﬁcient, a lack or resolution in phylogenomics would validate the Cambrian explosion hypothesis. In principle, the supermatrix of Rokas et al. (2005) has the potential to resolve the animal phylogeny. However, a comparison of support values obtained with MP and ML strongly suggests that nonphylogenetic signal plays a major role in the lack of resolution. Our reasoning assumes that the amount of genuine phylogenetic signal is insensitive to the reconstruction method. In Figure 2.1 of Rokas et al. (2005), BVs are much higher for ML than for MP (e.g., Bilateria monophyly is supported at 74% and 32%, respectively). Since MP is much less efﬁcient in detecting multiple substitutions, a considerably larger amount of nonphylogenetic signal is anticipated. By counteracting the phylogenetic signal, this large nonphylogenetic signal should result in a very weak apparent signal (Figure 2.1a, left). In contrast, the apparent signal should be higher with ML, as the amount of nonphylogenetic signal due to undetected multiple substitutions would be smaller (Figure 2.1a, center). To further test this idea, we reanalyzed these data set using the CAT model (Figure 2.1a, right). As expected, the CAT tree (Figure 2.2) is much more resolved, with for instance BVs of 95% and 99% for the monophyly of Bilateria and protostomes, respectively. Moreover, Ecdysozoa are recovered (BV ¼ 52%), whereas site-homogeneous models artifactually cluster fast-evolving nematodes and platyhelminthes (BV ¼ 89%; Rokas et al., 2005). Such an

Chapter 2

Current Approaches to Phylogenomic Reconstruction

1

1 2

Nonphylogenetic signal

Phylogenetic signal

(a)

1 2

3

2

3

ML

3

CA T

Apparent signal

MP

1

2

3

1

2

3

1

2

3

(b) Cho

100

Bil

Bootstrap value (%)

34

80

Cni Lop

60

Ecd

40

Pro Por

20 0

MP Figure 2.1

ML

CA T

Effect of the reconstruction method on the amount of nonphylogenetic and apparent signal. (a) Cartoons depicting the opposite effects of phylogenetic and nonphylogenetic signals on the apparent signal. 1, 2, and 3 are imaginary nodes (see text for details). (b) Boostrap values obtained after analysis of 100 replicates of Rokas et al. (2005) supermatrix (12,060 amino acid positions from 32 species) using MP, ML, or CAT model. Both ML (rtREV model) and CAT analyses included the RAS model. MP and ML values taken from Table S3 of Rokas et al. (2005). Bil: Bilateria; Cho: Chordates; Cni: Cnidaria; Ecd: Ecdysozoa; Lop: Lophotrochozoa; Por: Porifera; Pro: Protostomes.

100

Mollusk 99

49

52

98

Annelid Priapulid Arthropod Nematode

76 78 55

Anthozoan cnidarian Hydrozoan cnidarian 76

74

Hexactinelid poriferan Calcareous poriferan Demosponge Choanoﬂagellate

Bilateria

63

Protostomes

Triclad platyhelminth Trematode platyhelminth

95

Lophotr

Human Mouse Zebraﬁsh Tunicate

Porif Cnid Ecdys

100 100 100

35

Chordates

2.7 Conclusion

0.05

Figure 2.2 Bayesian tree obtained after analysis of Rokas et al. (2005) supermatrix under the CAT þ G model. Outgroup (15 fungi) omitted for brevity. Resolution is largely improved and Ecdysozoa are recovered with BV ¼ 52%. Support rises to 76% when ignoring priapulids (i.e., only considering arthropods and nematodes).

increase in support indicates that apparent signal is much higher with CAT (Figure 2.1b). This owes to the enhanced interpretation of multiple substitutions leading to a serious decrease in nonphylogenetic signal, which is consistent with our hypothesis. This brief analysis demonstrates that the amount of phylogenetic signal is not the limiting factor in phylogenomic supermatrices (DTs ¼ 0.25 MY for 33,800 positions, at the edge of lineage sorting inﬂuences) and that accounting for nonphylogenetic signal is the major concern. In the case of Rokas et al. (2005), the combination of scarce taxon sampling with heterogeneous evolutionary rates and site-homogeneous models generates so much nonphylogenetic signal that the metazoan phylogeny artifactually appears unresolved. On the other hand, recent analyses with many species and better models yield a much more resolved animal tree (Dunn et al., 2008; Lartillot and Philippe, 2008), though open issues remain, such as deuterostome monophyly (e.g., echinoderms and chordates) or relationships among nonbilaterian animals and among Lophotrochozoa.

2.7 CONCLUSION While phylogenomic supermatrices have the promise to resolve most nodes in the Tree of Life, the trouble caused by nonphylogenetic signal still undermines this exciting eventuality. In the coming years, however, steady improvements in models of sequence evolution combined with the richer taxon samplings enabled by the rise of new sequencing technologies are expected to seriously boost the apparent phylogenetic signal, which should provide robust statistical support for important questions presently unsettled. Owing to the incredible computational resources that they would require, ultrarealistic approaches modeling all the heterogeneities of the evolutionary process are likely to remain out of reach for a long time. Meanwhile, model violations will be efﬁciently alleviated through selective exclusion of rogue data (species, genes, or positions) prior to phylogenetic analysis. Another way of detecting nonphylogenetic signal in primary sequences is to look for corroboration from independent sources. To this end, a variety of whole genome approaches have been proposed, including gene content and gene order methods, as well as DNA string (or oligopeptide) distances (for reviews see Delsuc et al., 2005 and Philippe et al., 2005a). Although these methods share the convenient property of bypassing tedious multiple alignments, most still depend on correct orthology assessment and some have their own

36

Chapter 2

Current Approaches to Phylogenomic Reconstruction

artifacts, such as the clustering of genomes having acquired a similar gene content by convergence (Lake and Rivera, 2004). Similar in spirit to the methodology used by comparative morphologists, rare genomic changes (e.g., gene fusion and ﬁssion events) are in principle a promising avenue of research and have been put to use in some notable occasions, that is, to pinpoint the root of the eukaryotic tree (Philippe et al., 2000; Stechmann and Cavalier-Smith, 2002). On the other hand, an emblematic gene split apparently linking green algae to alveolates has recently been shown to be convergent, thus demonstrating that even rare genomic changes may be affected by homoplasy (Waller and Keeling, 2006). Ultimately, it is nonetheless highly desirable to develop powerful and reliable alternatives to the use of primary sequences for phylogenetic inference. The advent of a trustworthy global picture of the Tree of Life hinges on that.

REFERENCES ABABNEH, F., JERMIIN, L.S., MA, C., and ROBINSON, J., 2006. Matched-pairs tests of homogeneity with applications to homologous nucleotide sequences. Bioinformatics 22: 1225–1231. ABASCAL, F., ZARDOYA, R., and POSADA, D., 2005. ProtTest: selection of best-ﬁt models of protein evolution. Bioinformatics 21: 2104–2105. ABOUHEIF, E., ZARDOYA, R., and MEYER, A., 1998. Limitations of metazoan 18S rRNA sequence data: implications for reconstructing a phylogeny of the animal kingdom and inferring the reality of the Cambrian explosion. J. Mol. Evol. 47: 394–405. ADOUTTE, A., BALAVOINE, G., LARTILLOT, N., LESPINET, O., PRUD’HOMME, B., and de ROSA, R., 2000. The new animal phylogeny: reliability and implications. Proc. Natl. Acad. Sci. USA 97: 4453–4456. AGUINALDO, A.M., TURBEVILLE, J.M., LINFORD, L.S., RIVERA, M.C., GAREY, J.R., RAFF, R.A., and LAKE, J.A., 1997. Evidence for a clade of nematodes, arthropods and other moulting animals. Nature 387: 489–493. AKAIKE, H., 1973. Information theory and an extension of the maximum likelihood principle. In Proceedings of the 2nd International Symposium on Information Theory (eds Petrov and Csaki). Akademiai Kiado, Budapest, pp. 267–281. ALBERT, V.A., 2006. Parsimony and phylogenetics in the genomic age. In Parsimony, Phylogeny, and Genomics (ed. V.A. Albert). Oxford University Press, USA, pp. 1–11. BAELE, G., RAES, J., Van de PEER, Y., and VANSTEELANDT, S., 2006. An improved statistical method for detecting heterotachy in nucleotide sequences. Mol. Biol. Evol. 23: 1397–1405. BASTOLLA, U., FARWER, J., KNAPP, E.W., and VENDRUSCOLO, M., 2001. How to guarantee optimal stability for most representative structures in the Protein Data Bank. Proteins 44: 79–96. BAUM, B.R., 1992. Combining trees as a way of combining data sets for phylogenetic inference, and the desirability of combining gene trees. Taxon 41: 3–10. BAURAIN, D., BRINKMANN, H., and PHILIPPE, H., 2007. Lack of resolution in the animal phylogeny: closely spaced clado-

geneses or undetected systematic errors? Mol. Biol. Evol. 24: 6–9. BEIKO, R.G., HARLOW, T.J., and RAGAN, M.A., 2005. Highways of gene sharing in prokaryotes. Proc. Natl. Acad. Sci. USA 102: 14332–14337. BERNARDI, G., 1993. The vertebrate genome: isochores and evolution. Mol. Biol. Evol. 10: 186–204. BLAIR, J.E., IKEO, K., GOJOBORI, T., and HEDGES, S.B., 2002. The evolutionary position of nematodes. BMC Evol Biol 2: 7. BLANQUART, S. and LARTILLOT, N., 2006. A Bayesian compound stochastic process for modeling nonstationary and nonhomogeneous sequence evolution. Mol. Biol. Evol. 23: 2058–2071. BLANQUART, S. and LARTILLOT, N., 2008. A site- and timeheterogeneous model of amino acid replacement. Mol. Biol. Evol. 25: 842–858. BOURLAT, S.J., JULIUSDOTTIR, T., LOWE, C.J., FREEMAN, R., ARONOWICZ, J., KIRSCHNER, M., LANDER, E.S., THORNDYKE, M., NAKANO, H., KOHN, A.B., et al., 2006. Deuterostome phylogeny reveals monophyletic chordates and the new phylum Xenoturbellida. Nature 444: 85–88. BOUSSAU, B. and GOUY, M., 2006. Efﬁcient likelihood computations with nonreversible models of evolution. Syst. Biol. 55: 756–768. BOWKER, A.H., 1948. A test for symmetry in contingency tables. J. Am. Stat. Assoc. 43: 572–574. BRINKMANN, H., GIEZEN, M., ZHOU, Y., RAUCOURT, G.P., and PHILIPPE, H., 2005. An empirical assessment of long-branch attraction artefacts in deep eukaryotic phylogenomics. Syst. Biol. 54: 743–757. BRINKMANN, H. and PHILIPPE, H., 1999. Archaea sister group of Bacteria? Indications from tree reconstruction artifacts in ancient phylogenies. Mol. Biol. Evol. 16: 817–825. BROCHIER, C., BAPTESTE, E., MOREIRA, D., and PHILIPPE, H., 2002. Eubacterial phylogeny based on translational apparatus proteins. Trends Genet. 18: 1–5. BRUNO, W.J., 1996. Modeling residue usage in aligned protein sequences via maximum likelihood. Mol. Biol. Evol. 13: 1368–1374.

References BRUSCA, R.C. and BRUSCA, G.J., 1990. Invertebrates. Sinauer Associates, Inc., Sunderland, MA. BUDD, G.E. and JENSEN, S., 2000. A critical reappraisal of the fossil record of the bilaterian phyla. Biol. Rev. Camb. Philos. Soc. 75: 253–295. BURLEIGH, J.G. and MATHEWS, S., 2004. Phylogenetic signal in nucleotide data from seed plants: implications for resolving the seed plant tree of life. Am. J. Bot. 91: 1599–1613. CANBACK, B., TAMAS, I., and ANDERSSON, S.G., 2004. A phylogenomic study of endosymbiotic bacteria. Mol. Biol. Evol. 21: 1110–1122. CONWAY MORRIS, S., 2000. The Crucible of Creation: The Burgess Shale and the Rise of Animals. Oxford University Press, USA. DELSUC, F., SCALLY, M., MADSEN, O., STANHOPE, M.J., de JONG, W.W., CATZEFLIS, F.M., SPRINGER, M.S., and DOUZERY, E. J., 2002. Molecular phylogeny of living xenarthrans and the impact of character and taxon sampling on the placental tree rooting. Mol. Biol. Evol. 19: 1656–1671. DELSUC, F., PHILLIPS, M.J., and PENNY, D., 2003. Comment on “Hexapod origins: monophyletic or paraphyletic?” Science 301: 1482; author reply 1482. DELSUC, F., BRINKMANN, H., and PHILIPPE, H., 2005. Phylogenomics and the reconstruction of the tree of life. Nat. Rev. Genet. 6: 361–375. DELSUC, F., BRINKMANN, H., CHOURROUT, D., and PHILIPPE, H., 2006. Tunicates and not cephalochordates are the closest living relatives of vertebrates. Nature 439: 965–968. DELSUC, F., TSAGKOGEORGA, G., LARTILLOT, N., and PHILIPPE, H., 2008. Additional molecular support for the new chordate phylogeny. Genesis 46: 592–604. DOPAZO, H. and DOPAZO, J., 2005. Genome-scale evidence of the nematode-arthropod clade. Genome Biol. 6: R41. DOPAZO, H., SANTOYO, J., and DOPAZO, J., 2004. Phylogenomics and the number of characters required for obtaining an accurate phylogeny of eukaryote model species. Bioinformatics 20: i116–i121. DUNN, C.W., HEJNOL, A., MATUS, D.Q., PANG, K., BROWNE, W. E., SMITH, S.A., SEAVER, E., ROUSE, G.W., OBST, M., EDGECOMBE, G.D., et al. 2008. Broad phylogenomic sampling improves resolution of the animal tree of life. Nature 452: 745–749. EMBLEY, T.M., THOMAS, R.H., and WILLIAMS, R.A.D., 1993. Reduced thermophilic bias in the 16S rDNA sequence from Thermus ruber provides further support for a relationship between Thermus and Deinococcus. Syst. Appl. Microbiol. 16: 25–29. FELSENSTEIN, J., 1978. Cases in which parsimony or compatibility methods will be positively misleading. Syst. Zool. 27: 401–410. FELSENSTEIN, J., 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17: 368–376. FELSENSTEIN, J., 1985. Conﬁdence limits on phylogenies: an approach using the bootstrap. Evolution 40: 783–791. FELSENSTEIN, J., 2004. Inferring Phylogenies. Sinauer Associates, Inc., Sunderland, MA.

37

FELSENSTEIN, J. and CHURCHILL, G.A., 1996. A Hidden Markov Model approach to variation among sites in rate of evolution. Mol. Biol. Evol. 13: 93–104. FERGUSON, T.S., 1973. A Bayesian analysis of some nonparametric problems. Ann. Stat. 1: 209–230. FIELD, K.G., OLSEN, G.J., LANE, D.J., GIOVANNONI, S.J., GHISELIN, M.T., RAFF, E.C., PACE, N.R., and RAFF, R.A., 1988. Molecular phylogeny of the animal kingdom. Science 239: 748–753. FITCH, W.M., 1970. Distinguishing homologous from analogous proteins. Syst. Zool. 19: 99–113. FITCH, W.M., 1979. Cautionary remarks on using gene expression events in parsimony procedures. Syst. Zool. 28: 375–379. FITCH, W.M. and MARGOLIASH, E., 1967. Construction of phylogenetic trees. Science 155: 279–284. FITCH, W.M. and MARKOWITZ, E., 1970. An improved method for determining codon variability in a gene and its application to the rate of ﬁxation of mutations in evolution. Biochem. Genet. 4: 579–593. FOSTER, P.G., 2004. Modeling compositional heterogeneity. Syst. Biol. 53: 485–495. FOSTER, P.G. and HICKEY, D.A., 1999. Compositional bias may affect both DNA-based and protein-based phylogenetic reconstructions. J. Mol. Evol. 48: 284–290. FOSTER, P.G., JERMIIN, L.S., and HICKEY, D.A., 1997. Nucleotide composition bias affects amino acid content in proteins coded by animal mitochondria. J. Mol. Evol. 44: 282–288. GALTIER, N., 2001. Maximum-likelihood phylogenetic analysis under a covarion-like model. Mol. Biol. Evol. 18: 866–873. GALTIER, N. and GOUY, M., 1995. Inferring phylogenies from DNA sequences of unequal base compositions. Proc. Natl. Acad. Sci. USA 92: 11317–11321. GALTIER, N. and GOUY, M., 1998. Inferring pattern and process: maximum-likelihood implementation of a nonhomogeneous model of DNA sequence evolution for phylogenetic analysis. Mol. Biol. Evol. 15: 871–879. GEE, H., 2003. Evolution: ending incongruence. Nature 425: 782. GERMOT, A. and PHILIPPE, H., 1999. Critical analysis of eukaryotic phylogeny: a case study based on the HSP70 family. J. Eukaryot. Microbiol. 46: 116–124. GIBSON, A., GOWRI-SHANKAR, V., HIGGS, P.G., and RATTRAY, M., 2005. A comprehensive analysis of mammalian mitochondrial genome base composition and improved phylogenetic methods. Mol. Biol. Evol. 22: 251–264. GOLDMAN, N. and YANG, Z., 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11: 725–736. GOWRI-SHANKAR, V. and RATTRAY, M., 2007. A reversible jump method for Bayesian phylogenetic inference with a nonhomogeneous substitution model. Mol. Biol. Evol. 24: 1286–1299. HENDY, M.D. and PENNY, D., 1989. A framework for the quantitative study of evolutionary trees. Syst. Zool. 38: 297–309.

38

Chapter 2

Current Approaches to Phylogenomic Reconstruction

HENNIG, W., 1966. Phylogenetic Systematics. University of Illinois Press, Urbana, IL. HILLIS, D.M., 1996. Inferring complex phylogenies. Nature 383: 130–131. HO, S.Y. and JERMIIN, L., 2004. Tracing the decay of the historical signal in biological sequence data. Syst. Biol. 53: 623–637. HO, J.W., ADAMS, C.E., LEW, J.B., MATTHEWS, T.J., NG, C.C., SHAHABI-SIRJANI, A., TAN, L.H., ZHAO, Y., EASTEAL, S., WILSON, S.R., et al., 2006. SeqVis: visualization of compositional heterogeneity in large alignments of nucleotides. Bioinformatics 22: 2162–2163. HRDY, I., HIRT, R.P., DOLEZAL, P., BARDONOVA, L., FOSTER, P. G., TACHEZY, J., and EMBLEY, T.M., 2004. Trichomonas hydrogenosomes contain the NADH dehydrogenase module of mitochondrial complex I. Nature 432: 618–622. HUELSENBECK, J.P., 1991. When are fossils better than extant taxa in phylogenetic analysis? Syst. Zool. 40: 458–469. HUELSENBECK, J.P., 2002. Testing a covariotide model of DNA substitution. Mol. Biol. Evol. 19: 698–707. HUELSENBECK, J.P., LARGET, B., and SWOFFORD, D., 2000. A compound poisson process for relaxing the molecular clock. Genetics 154: 1879–1892. HUELSENBECK, J.P., JAIN, S., FROST, S.W., and POND, S.L., 2006. A Dirichlet process model for detecting positive selection in protein-coding DNA sequences. Proc. Natl. Acad. Sci. USA 103: 6263–6268. INAGAKI, Y., SUSKO, E., FAST, N.M., and ROGER, A.J., 2004. Covarion shifts cause a long-branch attraction artifact that unites Microsporidia and Archaebacteria in EF-1a phylogenies. Mol. Biol. Evol. 21: 1340–1349. JEFFROY, O., BRINKMANN, H., DELSUC, F., and PHILIPPE, H., 2006. Phylogenomics: the beginning of incongruence? Trends Genet. 22: 225–231. JENSEN, J.L. and PEDERSEN, A.-M.K., 2000. Probabilistic models of DNA sequence evolution with context dependent rates of substitution. Adv. Appl. Probab. 32: 499–517. JERMIIN, L., HO, S.Y., ABABNEH, F., ROBINSON, J., and LARKUM, A.W., 2004. The biasing effect of compositional heterogeneity on phylogenetic estimates may be underestimated. Syst. Biol. 53: 638–643. JERMIIN, L.S., JAYASWAL, V., ABABNEH, F., and ROBINSON, J., 2008. Phylogenetic model evaluation. Methods Mol. Biol. 452: 331–364. JUKES, T.H. and CANTOR, C.R., 1969. Evolution of protein molecules. In Mammalian Protein Metabolism (ed. H.N. Munro). Academic Press, New York, pp. 21–132. JUKES, T.H. and BHUSHAN, V., 1986. Silent nucleotide substitutions and G þ C content of some mitochondrial and bacterial genes. J. Mol. Evol. 24: 39–44. KLUGE, A.G., 1989. A concern for evidence and a phylogenetic hypothesis of relationships among Epicrates (Boidae, Serpentes). Syst. Zool. 38: 7–25. KOLACZKOWSKI, B. and THORNTON, J.W., 2004. Performance of maximum parsimony and likelihood phylogenetics when evolution is heterogeneous. Nature 431: 980–984.

KOLACZKOWSKI, B. and THORNTON, J.W., 2008. A mixed branch length model of heterotachy improves phylogenetic accuracy. Mol. Biol. Evol. 25: 1054–1066. LAKE, J.A., 1994. Reconstructing evolutionary trees from DNA and protein sequences: paralinear distances. Proc. Natl. Acad. Sci. USA 91: 1455–1459. LAKE, J.A. and RIVERA, M.C., 2004. Deriving the genomic tree of life in the presence of horizontal gene transfer: conditioned reconstruction. Mol. Biol. Evol. 21: 681–690. LANAVE, C., PREPARATA, G., SACCONE, C., and SERIO, G., 1984. A new method for calculating evolutionary substitution rates. J. Mol. Evol. 20: 86–93. LARTILLOT, N. and PHILIPPE, H., 2004. A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. Mol. Biol. Evol. 21: 1095–1109. LARTILLOT, N. and PHILIPPE, H., 2008. Improvement of molecular phylogenetic inference and the phylogeny of Bilateria. Philos. Trans. R Soc. Lond. B Biol. Sci. 363: 1463–1472. LARTILLOT, N., BRINKMANN, H., and PHILIPPE, H., 2007. Suppression of long-branch attraction artefacts in the animal phylogeny using a site-heterogeneous model. BMC Evol. Biol. 7 (Suppl 1): S4. LECOINTRE, G., PHILIPPE, H., Van LE, H.L., and Le GUYADER, H., 1993. Species sampling has a major impact on phylogenetic inference. Mol. Phylogenet. Evol. 2: 205–224. LEIGH, J.W., SUSKO, E., BAUMGARTNER, M., and ROGER, A.J., 2008. Testing congruence in phylogenomic analysis. Syst. Biol. 57: 104–115. LEMMON, A.R. and MORIARTY, E.C., 2004. The importance of proper model assumption in Bayesian phylogenetics. Syst. Biol. 53: 265–277. LIU, F.G., MIYAMOTO, M.M., FREIRE, N.P., ONG, P.Q., TENNANT, M.R., YOUNG, T.S., and GUGEL, K.F., 2001. Molecular and morphological supertrees for eutherian (placental) mammals. Science 291: 1786–1789. LOCKHART, P.J., HOWE, C.J., BRYANT, D.A., BEANLAND, T.J., and LARKUM, A.W., 1992. Substitutional bias confounds inference of cyanelle origins from sequence data. J. Mol. Evol. 34: 153–162. LOCKHART, P., STEEL, M., HENDY, M., and PENNY, D., 1994. Recovering evolutionary trees under a more realistic model of sequence evolution. Mol. Biol. Evol. 11: 605–612. LOCKHART, P.J., LARKUM, A.W., STEEL, M., WADDELL, P.J., and PENNY, D., 1996. Evolution of chlorophyll and bacteriochlorophyll: the problem of invariant sites in sequence analysis. Proc. Natl. Acad. Sci. USA 93: 1930–1934. LOCKHART, P.J., STEEL, M.A., BARBROOK, A.C., HUSON, D., CHARLESTON, M.A., and HOWE, C.J., 1998. A covariotide model explains apparent phylogenetic structure of oxygenic photosynthetic lineages. Mol. Biol. Evol. 15: 1183–1188. LOPEZ, P., FORTERRE, P., and PHILIPPE, H., 1999. The root of the tree of life in the light of the covarion model. J. Mol. Evol. 49: 496–508.

References LOPEZ, P., CASANE, D., and PHILIPPE, H., 2002. Heterotachy, an important process of protein evolution. Mol. Biol. Evol. 19: 1–7. MADDISON, W.P., 1997. Gene trees in species. Syst. Biol. 46: 523–536. MATUS, D.Q., COPLEY, R.R., DUNN, C.W., HEJNOL, A., ECCLESTON, H., HALANYCH, K.M., MARTINDALE, M.Q., and TELFORD, M.J., 2006. Broad taxon and gene sampling indicate that chaetognaths are protostomes. Curr. Biol. 16: R575–576. MAYROSE, I., FRIEDMAN, N., and PUPKO, T., 2005. A Gamma mixture model better accounts for among site rate heterogeneity. Bioinformatics 21 (Suppl 2): ii151–ii158. MISOF, B., ANDERSON, C.L., BUCKLEY, T.R., ERPENBECK, D., RICKERT, A., and MISOF, K., 2002. An empirical analysis of mt 16S rRNA covarion-like evolution in insects: sitespeciﬁc rate variation is clustered and frequently detected. J. Mol. Evol. 55: 460–469. MIYAMOTO, M.M. and FITCH, W.M., 1996. Constraints on protein evolution and the age of the eubacteria/eukaryote split. Syst. Biol. 45: 568–575. MIYAZAWA, S. and JERNIGAN, R.L., 1985. Estimation of effective interresidue contact energies from protein crystal structures: quasi-chemical approximation. Macromolecules 18: 534–552. MONTERO, L.M., SALINAS, J., MATASSI, G., and BERNARDI, G., 1990. Gene distribution and isochore organization in the nuclear genome of plants. Nucleic Acids Res. 18: 1859–1867. MOOERS, A.O. and HOLMES, E.C., 2000. The evolution of base composition and phylogenetic inference. Trends Ecol. Evol. 15: 365–369. MOREIRA, D., KERVESTIN, S., JEAN-JEAN, O., and PHILIPPE, H., 2002. Evolution of eukaryotic translation elongation and termination factors: variations of evolutionary rate and genetic code deviations. Mol. Biol. Evol. 19: 189–200. MUSE, S.V. and GAUT, B.S., 1994. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol. Biol. Evol. 11: 715–724. NOVACEK, M.J., 1992. Fossils, topologies, missing data, and the higher level phylogeny of eutherian mammals. Syst. Biol. 41: 58–73. PEDERSEN, A.-M.K. and JENSEN, J.L., 2001. A dependent-rates model and an MCMC-based methodology for the maximum-likelihood analysis of sequences with overlapping reading frames. Mol. Biol. Evol. 18: 763–776. PENNY, D., MCCOMISH, B.J., CHARLESTON, M.A., and HENDY, M.D., 2001. Mathematical elegance with biochemical realism: the covarion model of molecular evolution. J. Mol. Evol. 53: 711–723. PHILIP, G.K., CREEVEY, C.J., and MCINERNEY, J.O., 2005. The Opisthokonta and the Ecdysozoa may not be clades: stronger support for the grouping of plant and animal than for animal and fungi and stronger support for the Coelomata than Ecdysozoa. Mol. Biol. Evol. 22: 1175–1184.

39

PHILIPPE, H., 2008. Less is more: decreasing the number of scientiﬁc conferences to promote economic degrowth. Trends Genet. 24: 265–267. PHILIPPE, H. and GERMOT, A., 2000. Phylogeny of eukaryotes based on ribosomal RNA: long-branch attraction and models of sequence evolution. Mol. Biol. Evol. 17: 830–834. PHILIPPE, H. and LOPEZ, P., 2001. On the conservation of protein sequences in evolution. Trends Biochem. Sci. 26: 414–416. PHILIPPE, H. and TELFORD, M.J., 2006. Large-scale sequencing and the new animal phylogeny. Trends Ecol. Evol. 21: 614–620. PHILIPPE, H., CHENUIL, A., and ADOUTTE, A., 1994a. Can the Cambrian explosion be inferred through molecular phylogeny? Development 120: S15–S25. PHILIPPE, H., S€oRHANNUS, U., BAROIN, A., PERASSO, R., GASSE, F., and ADOUTTE, A., 1994b. Comparison of molecular and paleontological data in diatoms suggests a major gap in the fossil record. J. Evol. Biol. 7: 247–265. PHILIPPE, H., LOPEZ, P., BRINKMANN, H., BUDIN, K., GERMOT, A., LAURENT, J., MOREIRA, D., MULLER, M., and Le GUYADER, H., 2000. Early-branching or fast-evolving eukaryotes? An answer based on slowly evolving positions. Proc. R. Soc. Lond. Biol. Sci. 267: 1213–1221. PHILIPPE, H., SNELL, E.A., BAPTESTE, E., LOPEZ, P., HOLLAND, P. W., and CASANE, D., 2004. Phylogenomics of eukaryotes: impact of missing data on large alignments. Mol. Biol. Evol. 21: 1740–1752. PHILIPPE, H., DELSUC, F., BRINKMANN, H., and LARTILLOT, N., 2005a. Phylogenomics. Annu. Rev. Ecol. Evol. Syst. 36: 541–562. PHILIPPE, H., LARTILLOT, N., and BRINKMANN, H., 2005b. Multigene analyses of bilaterian animals corroborate the monophyly of Ecdysozoa, Lophotrochozoa, and Protostomia. Mol. Biol. Evol. 22: 1246–1253. PHILIPPE, H., BRINKMANN, H., MARTINEZ, P., RIUTORT, M., and BAGUNA, J., 2007. Acoel ﬂatworms are not platyhelminthes: evidence from phylogenomics. PLoS ONE 2: e717. PHILLIPS, M.J. and PENNY, D., 2003. The root of the mammalian tree inferred from whole mitochondrial genomes. Mol. Phylogenet. Evol. 28: 171–185. PHILLIPS, M.J., DELSUC, F., and PENNY, D., 2004. Genomescale phylogeny and the detection of systematic biases. Mol. Biol. Evol. 21: 1455–1458. PISANI, D., 2004. Identifying and removing fast-evolving sites using compatibility analysis: an example from the Arthropoda. Syst. Biol. 53: 978–989. POSADA, D. and CRANDALL, K.A., 1998. MODELTEST: testing the model of DNA substitution. Bioinformatics 14: 817–818. QIU, Y.-L. and ESTABROOK, G.F., 2008. Inference of phylogenetic relationships among key angiosperm lineages using a compatibility method on a molecular data set. J. Syst. Evol. 46: 130–141. RAGAN, M.A., 1992. Phylogenetic inference based on matrix representation of trees. Mol. Phylogenet. Evol. 1: 53–58.

40

Chapter 2

Current Approaches to Phylogenomic Reconstruction

ROBINSON, D.M., JONES, D.T., KISHINO, H., GOLDMAN, N., and THORNE, J.L., 2003. Protein evolution with dependence among codons due to tertiary structure. Mol. Biol. Evol. 20: 1692–1704. RODRIGUE, N., LARTILLOT, N., BRYANT, D., and PHILIPPE, H., 2005. Site interdependence attributed to tertiary structure in amino acid sequence evolution. Gene 347: 207–217. RODRIGUE, N., PHILIPPE, H., and LARTILLOT, N., 2006. Assessing site-interdependent phylogenetic models of sequence evolution. Mol. Biol. Evol. 23: 1762–1775. RODRIGUE, N., LARTILLOT, N., and PHILIPPE, H., 2008. Bayesian comparisons of codon substitution models. Genetics 180: 1579–1591. RODRIGUEZ-EZPELETA, N., BRINKMANN, H., BURGER, G., ROGER, A.J., GRAY, M.W., PHILIPPE, H., and LANG, B.F., 2007a. Toward resolving the eukaryotic tree: the phylogenetic positions of jakobids and cercozoans. Curr. Biol. 17: 1420–1425. RODRIGUEZ-EZPELETA, N., BRINKMANN, H., ROURE, B., LARTILLOT, N., LANG, B.F., and PHILIPPE, H., 2007b. Detecting and overcoming systematic errors in genome-scale phylogenies. Syst. Biol. 56: 389–399. ROKAS, A., WILLIAMS, B.L., KING, N., and CARROLL, S.B., 2003. Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature 425: 798–804. ROKAS, A., KRUGER, D., and CARROLL, S.B., 2005. Animal evolution and the molecular signature of radiations compressed in time. Science 310: 1933–1938. ROURE, B., RODRIGUEZ-EZPELETA, N., and PHILIPPE, H., 2007. SCaFoS: a tool for selection, concatenation and fusion of sequences for phylogenomics. BMC Evol. Biol. 7 (Suppl 1): S2. RUIZ-TRILLO, I., RIUTORT, M., LITTLEWOOD, D.T., HERNIOU, E. A., and BAGUNA, J., 1999. Acoel ﬂatworms: earliest extant bilaterian Metazoans, not members of Platyhelminthes. Science 283: 1919–1923. SALTER, L.A., 2001. Complexity of the likelihood surface for a large DNA dataset. Syst. Biol. 50: 970–978. SANDERSON, M.J. and SHAFFER, H.B., 2002. Troubleshooting molecular phylogenetic analyses. Annu. Rev. Ecol. Syst. 33: 49–72. SCHWARZ, G., 1978. Estimating the dimension of a model. Ann. Stat. 6: 461–464. SIEPEL, A. and HAUSSLER, D., 2004. Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Mol. Biol. Evol. 21: 468–488. SPENCER, M., SUSKO, E., and ROGER, A.J., 2005. Likelihood, parsimony, and heterogeneous evolution. Mol. Biol. Evol. 22: 1161–1164. STECHMANN, A. and CAVALIER-SMITH, T., 2002. Rooting the eukaryote tree by using a derived gene fusion. Science 297: 89–91. STEEL, M., 2005. Should phylogenetic models be trying to ‘ﬁt an elephant’? Trends Genet. 21: 307–309. STUART, A., 1955. A test for homogeneity of the marginal distributions in a two-way classiﬁcation. Biometrika 42: 412–416.

SULLIVAN, J. and JOYCE, P., 2005. Model selection in phylogenetics. Annu. Rev. Ecol. Evol. Syst. 36: 445–466. SUSKO, E., SPENCER, M., and ROGER, A.J., 2005. Biases in phylogenetic estimation can be caused by random sequence segments. J. Mol. Evol. 61: 351–359. SWOFFORD, D.L., OLSEN, G.J., WADDELL, P.J., and HILLIS, D.M., 1996. Phylogenetic inference. In Molecular Systematics (eds D.M. Hillis, C. Moritz, and B.K. Mable). Sinauer Associates, Inc., Sunderland, MA, pp. 407–514. TAVARE , S., 1986. Some probabilistic and statistical problems in the analysis of DNA sequences. Lect. Math. Life Sci. 17: 57–86. THORNE, J.L. and GOLDMAN, N., 2003. Probabilistic models for the study of protein evolution. In Handbook of Statistical Genetics (eds D.J. Balding, M. Bishop, and C. Cannings). John Wiley & Sons, Ltd, Chichester, UK. pp. 209–226. TUFFLEY, C. and STEEL, M., 1998. Modeling the covarion hypothesis of nucleotide substitution. Math. Biosci. 147: 63–91. WALLER, R.F. and KEELING, P.J., 2006. Alveolate and chlorophycean mitochondrial cox2 genes split twice independently. Gene 383: 33–37. WANG, H.C., SPENCER, M., SUSKO, E., and ROGER, A.J., 2007. Testing for covarion-like evolution in protein sequences. Mol. Biol. Evol. 24: 294–305. WIENS, J.J., 1998. Does adding characters with missing data increase or decrease phylogenetic accuracy? Syst. Biol. 47: 625–640. WIENS, J.J., 2003. Missing data, incomplete taxa, and phylogenetic accuracy. Syst. Biol. 52: 528–538. WILKINSON, M. and BENTON, M.J., 1995. Missing data and rhynchosaur phylogeny. Hist. Biol. 10: 137–150. WOESE, C.R., ACHENBACH, L., ROUVIERE, P., and MANDELCO, L., 1991. Archaeal phylogeny: reexamination of the phylogenetic position of Archaeoglobus fulgidus in light of certain composition-induced artifacts. Syst. Appl. Microbiol. 14: 364–371. WOLF, Y.I., ROGOZIN, I.B., and KOONIN, E.V., 2004. Coelomata and not ecdysozoa: evidence from genome-wide phylogenetic analysis. Genome Res. 14: 29–36. YANG, Z., 1993. Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol. Biol. Evol. 10: 1396–1401. YANG, Z., 1996a. Among-site rate variation and its impact on phylogenetic analyses. Trends Ecol. Evol. 11: 367–370. YANG, Z., 1996b. Maximum-likelihood models for combined analyses of multiple sequence data. J. Mol. Evol. 42: 587–596. YANG, Z. and ROBERTS, D., 1995. On the use of nucleic acid sequences to infer early branchings in the tree of life. Mol. Biol. Evol. 12: 451–458. YANG, Z. and NIELSEN, R., 2008. Mutation-selection models of codon substitution and their use to estimate selective strengths on codon usage. Mol. Biol. Evol. 25: 568–579.

References ZHOU, Y., RODRIGUE, N., LARTILLOT, N., and PHILIPPE, H., 2007. Evaluation of the models handling heterotachy in phylogenetic inference. BMC Evol. Biol. 7: 206. ZUCKERKANDL, E. and PAULING, L., 1965. Evolutionary divergence and convergence in proteins. In Evolving Genes and

41

Proteins (eds V. Bryson and H.J. Vogel). Academic Press, New York. pp. 97–166. ZWICKL, D.J. and HILLIS, D.M., 2002. Increased taxon sampling greatly reduces phylogenetic error. Syst. Biol. 51: 588–598.

Chapter

3

The Universal Tree of Life and the Last Universal Cellular Ancestor: Revolution and Counterrevolutions Patrick Forterre 3.1

INTRODUCTION

3.2

THE WOESIAN REVOLUTION

3.3

A RAMPANT “PROKARYOTIC” COUNTERREVOLUTION

3.4

HOW TO POLARIZE CHARACTERS WITHOUT A ROBUST ROOT?

3.5

THE HIDDEN ROOT: WHEN THE WEATHER BECAME CLOUDY

3.6

LUCA AND ITS COMPANIONS

3.7

THE PROBLEM OF HORIZONTAL GENE TRANSFER AND ANCIENT PHYLOGENIES: TREES VERSUS GENE WEBS

3.8

THE NATURE OF THE RNA WORLD

3.9

THE DNA REPLICATION PARADOX AND THE NATURE OF LUCA

3.10

WHEN VIRUSES FIND THEIR WAY INTO THE UNIVERSAL TREE OF LIFE

3.11

FUTURE DIRECTIONS

3.1 INTRODUCTION The intellectual and technical advances associated with molecular biology during the last decades of the twentieth century opened for the ﬁrst time the possibility to determine the evolutionary relationships between all living organisms. The design of elegant and efﬁcient methods for sequencing macromolecules paved the way to tackle the old challenge proposed by Darwin’s evolution theory, to reconstruct the history of life on our planet. The dream started to become reality when Carl Woese, focusing on the translation apparatus, decided in

Evolutionary Genomics and Systems Biology, edited by Gustavo Caetano-Anolles Copyright 2010 John Wiley & Sons, Inc.

43

44

Chapter 3

The Universal Tree of Life and the Last Universal Cellular Ancestor

(a)

Eukaryotes

Eukaryote

Eubacteria Uurkaryotes Archaebacteria

(b) Eubacteria

Prokaryote

Archaebacteria

Prokaryote

Eukaryotes

LUCA

Protoeukaryote

Progenote

Eukaryotes

(c)

LUCA

Eukaryote

(d)

Eukaryotes

Eukaryote

Archaebacteria Eubacteria Prokaryote

Archaebacteria Prokaryote

Eubacteria LUCA

LUCA

Figure 3.1

Four ancient and alternative versions of the universal tree of life proposed early on to take into account the discovery of archaea (see text for explanation). The old nomenclature, archaebacteria, eubacteria, eukaryotes, is used to ﬁt with the age of these proposals (see text for explanations). The old prokaryote/eukaryote dichotomy is maintained in all of them, with prokaryotes as primitive organisms in a, c, and d.

the early 1970s to use the 16S rRNA as “the” universal molecular marker. This led Carl Woese and the Urbana School to make the groundbreaking discovery that all cellular organisms previously grouped under the umbrella “prokaryote” belong actually to two different evolutionary lineages (originally called eubacteria and archaebacteria) (for an historical account of the great scientiﬁc saga, see Woese (2007)). Woese and Fox (1977a) then suggested to replace the prokaryote/eukaryote dichotomy based on cellular structure (the presence or absence of a nucleus) by a trinity based on molecular structure (the ribosome and its macromolecules) and underlying sequences (Figure 3.1a). The trinity concept opened a Pandora’s box for new interpretations on early cellular evolution. Whereas the pre-Woesian era was dominated by common thinking, nearly all evolutionists working in the framework of the prokaryote/eukaryote dichotomy, the post-Woesian era has been dominated by vivid conﬂicts between opposed views concerning early cellular evolution, with nearly as many theories as evolutionists. For instance, whereas Woese and Fox (1977b) proposed a very simple “preprokaryotic” last common ancestor to all modern cells, the progenote (the historical ancestor of the communal last universal cellular ancestor (LUCA) favored now by Woese, see below), several authors, including myself, have proposed instead a more complex, possibly “eukaryotic-like” ancestor (Forterre, 1995; Forterre and Philippe, 1999; Fuerst, 2005; Glansdorff, 2000; Glansdorff et al., 2008; Kurland et al., 2006; Poole et al., 1999; see Chapter 17) (Figure 3.1b), whereas still other maintained the idea of a bacterial ancestor (Cavalier-Smith, 2002, Figure 3.1c) or even suggested an archaeal ancestor (Wang et al., 2007). The origin of eukaryotes also became a new open and

3.2 The Woesian Revolution

45

controversial question (see Chapter 4). For Woese and Fox (1977a), the three primary lineages (called urkingdoms at that time) evolved more or less independently from the progenote. In this scenario, the “prokaryotic state” was then reached three times independently in each of the three lineages, and the “eukaryotic stage” was reached in only one of them (the urkaryotic lineage, sensu Woese) (Figure 3.1a). In contrast, the proponent of a protoeukaryotic LUCA suggested that several complex eukaryotic features were ancestral and that the “prokaryotic” state evolved from the eukaryotic one by streamlining or reduction (Figure 3.1b). Others suggested various scenarios in which eukaryotes emerged from the association of a bacterium and an archaeon (Lopez-Garcia and Moreira, 1999 and references therein for earlier suggestions; Rivera and Lake, 2004 for the “ring of life” scenario) (Figure 3.1d). The introduction of the trinity concept in biology by Carl Woese is often referred as the Woesian’s revolution. As known from history, there is no revolution without counterrevolution. In this chapter, I will adopt an historical perspective in an attempt to untangle the various revolutionary and counterrevolutionary aspects associated with the discussions related to the universal tree of life and propose my own interpretation of the subjective considerations that led scientists to favor particular topologies for the universal tree. My main thesis is that many questions remain unsolved concerning this topology and that it is worth to continue to tackle these questions by combining traditional approaches (molecular phyogenetic analyses, comparative biochemistry) with the aggressive search for new data, that is, new groups of organisms, both viral and cellular. An important part of this review will also be devoted to the problem of the place that viruses should occupy in the universal tree of life. Recent data and new hypotheses have indeed put viruses for the ﬁrst time in a central position in various scenarios for early life evolution. This could ﬁnally led to a grand uniﬁcation of all living organisms under a single conceptual framework, an exciting perspective for the new century.

3.2 THE WOESIAN REVOLUTION A formidable consequence of the Woesian’s revolution was to shake previous conceptions about early cellular evolution. Before this revolution, the current idea was that all cellular life started with something looking like a simple bacterium (possibly related to a mycoplasma) that progressively evolved toward more complex and diversiﬁed bacteria. Then “lower eukaryotes” appeared by the association of several differentiated bacteria and ﬁnally evolved into “higher eukaryotes”, up to the appearance of the brain! In the 1970s, this new version of Aristotle’s scala natura was still recent and had been happily adopted by the pioneers of the molecular biology revolution, who were then active proponents of the prokaryote/eukaryote dichotomy (prokaryote meaning before the nucleus) (for an historical account of the prokaryote nomenclature, see Sapp (2005)). This explains partly why the Woesian’s revolution was initially ignored by most molecular biologists and why its importance is still not fully appreciated today. Many of the pioneers of molecular biology were previously trained in physics or chemistry and were not aware of what was going on in evolutionary biology. This situation has only slightly changed today and, until now, terms such as “lower” or “higher” eukaryotes are still widely used in journals such as Nature, Cell, or Sciences. Whereas most mainstream molecular biologists originally simply ignored the Woesian’s revolution (with a few important exceptions, such as Wolfram Zillig, who actively promoted very early on the study of archaea) (Zillig, 1991), the ﬁerce resistance to the trinity concept

46

Chapter 3

The Universal Tree of Life and the Last Universal Cellular Ancestor

came mainly from some evolutionists who, until today, still favor the prokaryote/eukaryote dichotomy (hence the continuous debate around the use of the term prokaryote itself (see Pace (2006) and reply by Martin and Koonin (2006)) and about its meaning (see the debate between Carl Woese and Ernst Mayr; Woese, 1998 and references therein). An extreme view has been taken by Cavalier-Smith (2002) who still thinks that all life originated from a bacterium and that archaea originated much later on, from Gram-positive bacteria, a scenario that clearly introduces a new version of the scala natura, in which archaea replaced the “lower eukaryotes” (Figure 3.1c). Some of these scenarios can easily be dismissed; for example, the grouping of archaea with Gram-positive bacteria (in fact, Firmicutes, one of the 20–30 or so bacterial phylums), originally observed in a few protein trees, is not supported by global comparative genomic analyses that clearly indicate that archaea is monophyletic and that core archaeal genes are very divergent from all their bacterial homologues (when these homologues exist at all) (for review see Forterre et al. (2007a) and Yutin et al. (2008)). The prokaryote/eukaryote dichotomy is also maintained in recently popular views in which eukaryotes originated secondarily from the association of archaea and bacteria. In these “ring of life” scenarios (or complicated conventional scenarios, in the terminology of Chapter 17), the original trinity concept, in which the three domains have an equal evolutionary status, is replaced by a new dichotomy between two primary domains that originated ﬁrst (usually still named archaebacteria and eubacteria by the authors who favored such scenarios) and a secondary (more evolved) domain that originated later on (eukaryotes) (Figure 3.1c). However, the “ring of life” scenarios do not explain why there are three basic versions (and not two) of nearly all universally conserved proteins, testifying for an old ancestral protoeukaryotic lineage (Woese and Fox, 1977a), discussed in Forterre (2006a)). The hypothesis of an old ancestral protoeukaryotic lineage also better explains the existence of many eukaryotic-speciﬁc proteins, protein domains, and molecular mechanisms (Caetano-Anolles and CaetanoAnolles, 2005; Caetano-Anolles, 2002; Ciccarelli et al., 2006; Collins and Penny, 2005; Hartman and Fedorov, 2002; Kurland et al., 2006; Poole and Penny, 2007; Wong et al., 2007; Yutin et al., 2008; for speciﬁc criticisms of “rings of life” scenarios, see Forterre (2006a), Kurland et al. (2006), and Poole and Penny (2007), as well as Chapter 17). In my opinion, all views that aim at reinstalling the prokaryote/eukaryote dichotomy at the center of evolutionary thinking are clearly counterrevolutionary, and they originated from a gradist-oriented view, in which eukaryotes are at the top of the scale. This does not make these views incorrect by deﬁnition (the revolution might have taken the wrong path) but, in that case, they are not supported by comparative genomic analyses that instead confront the initial insight of Carl Woese and his followers. In 1990, Woese et al. (1990) proposed a new nomenclature for the three domains to get rid of the confusion that was still commonly made between archaebacteria and eubacteria (both “bacteria,” both “prokaryotes”). The replacement of the term archaebacteria by the term bacteria was indeed justiﬁed by the fact that most biologists at that time (and still today) considered archaebacteria as simply “strange”—possibly old—bacteria (mostly extremophiles). To support their new nomenclature (archaea, bacteria, and eukarya), Woese and his colleagues endorsed the rooting of the tree of life proposed independently in 1989 by two research groups (Gogarten et al., 1989; Iwabe et al., 1989). Iwabe and colleagues tentatively localized the root of the universal tree in the bacterial branch, from the phylogenetic analysis of paralogous proteins that have diverged by gene duplication before the LUCA (for the origin of the LUCA acronym see http://www-archbac.u-psud.fr/Meetings/LesTreilles/ Treilles_frm.html). By putting archaea and eukarya in the same clade, this rooting nicely justiﬁed breaking the last link remaining between archaea(bacteria) and bacteria

3.3 A Rampant “Prokaryotic” Counterrevolution (a)

47

(b) Prokaryote

Eukarya

Eukaryote

Archaea

Eukaryote

Eukarya

Bacteria Archaea

Prokaryote

Bacteria

LUCA

LUCA

Figure 3.2 Two visions of the same classical Woese’s tree: the traditional drawing (a) and a drawing (b) showing how this tree can be interpreted as supporting the old prokaryote/eukaryote dichotomy, with prokaryotes as primitive organisms.

(Figure 3.2a). Woese et al. (1990) thus used the ﬁgure of the Iwabe’s tree to illustrate their seminal paper proposing the new nomenclature for the three domains, and Woese used this tree continuously later on to illustrate his papers on the nature of the universal ancestor (Woese, 2000, 2002).

3.3 A RAMPANT “PROKARYOTIC” COUNTERREVOLUTION Although the replacement of the term archaebacteria by the term archaea was a great step forward toward the elimination of the prokaryotic concept, a subtle perfume of counterrevolution was surreptitiously introduced by rooting the universal tree in the bacterial branch. In my opinion, the main reason for the success of the Iwabe/Gogarten’s tree is indeed that it can be easily interpreted in the framework of the prokaryote/eukaryote dichotomy. You just have to turn the initial drawing by 90 C and to draw a line separating prokaryotes (below) and eukaryotes (on top) to put prokaryotes (even bacteria) at the base of the tree and eukaryotes (once again) at the top (Figure 3.2b). Amazingly, despite the fact that Karl Stetter has been an early admirer and follower of Carl Woese, he redraw the Iwabe’s tree in a way that also reminds us of the old dichotomy to illustrate the hypothesis of an ancestral origin for hyperthermophiles. In a recent tree drawn by Karl Stetter, the eukaryotes still appear in the upper part of the ﬁgure, dominating the two other domains (Stetter, 2006). By the way, in looking back to his early drawing, I noticed that Carl Woese himself tends to put the eukaryotes slightly above the two other domains in popular representations (Woese, 1981). The Iwabe/Gogarten’s tree (redraw by Carl Woese) is still widely used today as “the” tree, reﬂecting the natural relationships between all living organisms, despite the fact that later studies have shown that the rooting in the bacterial branch is not really supported by the actual data (Forterre et al., 1993; Philippe and Forterre, 1999). One of the reasons is probably that the universal tree rooted in the bacterial branch apparently explains well why archaea exhibit many eukaryotic features at the molecular level (Olsen and Woese, 1997). The bacterial rooting was immediately interpreted as meaning that these common archaeal/ eukaryal traits have appeared in the part of the tree exclusively shared by archaea and

48

Chapter 3

The Universal Tree of Life and the Last Universal Cellular Ancestor

(a)

(b)

Archaea Bacteria

Eukarya

Archaea Bacteria

+ LUCA

Eukarya

_ LUCA

Figure 3.3 Two alternative (and opposite) possibilities to polarize a character common to archaea and eukarya in the framework of the same classical Woese’s tree: (a) a shared derived trait (synapomorphy) testifying for the sisterhood of archaea and eukarya and (b) a primitive trait lost in bacteria.

bacteria. In fact, these characters might have been ancestral characters as well, already present in the last universal cellular ancestor that have been lost or modiﬁed in the branch leading to bacteria (Figure 3.3). This possibility is usually not mentioned in most studies dealing with early evolution, because for most biologists (again from the Aristotle’s principle) eukaryotic traits should be, by principle, more evolved than prokaryotic ones (prokaryote meaning in that case bacteria). As a consequence of this prejudice, many biologists still automatically consider that features common to archaea and eukaryotes cannot be ancestral characters, but are necessarily shared derived synapomorphies testifying for the sisterhood of archaea and eukarya. This is not to say here that they are not. It might be the case, but I think that we still don’t know. My feeling is that the prejudice viewing eukarya as the last step in the Scala Natura has the deleterious effect to greatly limiting the willingness of evolutionists to consider the alternative possibility. Finally, the Iwabe/Gogarten’s tree was welcome by proponents of a hot origin of life and a hyperthermophilic LUCA because it gave the impression that its root was indeed populated by hyperthermophiles. This impression was reinforced in the Stetter’s version of this tree with lineages leading from the origin of life to hyperthermophiles drawn in bold lines and the two “prokaryotic domains” located at the bottom of the tree (Stetter, 2006). It was not realized by many that it is not because the end of a branch is characterized by a particular phenotype that the ancestors of this lineage exhibit the same one. Despite of warnings that the position of hyperthermophiles in 16S rRNA-based tree is much likely highly biased by the high GC content of hyperthermophiles (Forterre, 1996), phylogenetic data supporting a mesophilic LUCA (Galtier et al., 1999) or a mesophilic bacterial ancestor (Brochier and Philippe, 2002) or comparative genomics and molecular data showing that hyperthermophiles are not “primitive” organisms but very evolved creatures (Forterre, 1996, 2002; Xu and Glansdorff, 2002), the apparent rooting of the Iwabe/Gogarten’s tree between hyperthermophilic archaea and bacteria is still widely used as an argument in favor of the hot origin of life. Recent data based on the tentative reconstruction of the nucleotide composition of ancestral rRNA and on the amino acid protein composition of ancestral proteins suggest that LUCA was a mesophile, but that the last common ancestors of archaea and bacteria were both thermophiles (or hyperthermophiles) (Bousseau et al., 2008). This would be in agreement with the “thermoreduction hypothesis” (Forterre, 1995) that postulated that

3.3 A Rampant “Prokaryotic” Counterrevolution

(a)

49

(b) Archaea

Bacteria

+

Eukarya

Eukarya

Archaea

–

Bacteria

– LUCA

LUCA

(c)

(d)

Archaea Eukarya

Bacteria

Archaea

– LUCA

Bacteria

Eukarya

+ LUCA

Figure 3.4 Four alternative scenarios showing that LUCA could have been “prokaryotic-like” (small circles) (a, d), “eukaryotic-like” (small circle in a larger one) (b, c), or both if the universal tree is rooted in the bacterial branch (a, b) or in the eukaryotic branch (c, d).

genome streamlining and increased macromolecular turnover induced by the adaptation to high temperature have been major forces in the selection of the “prokaryotic” phenotype. If true, this thermoreduction scenario can be again accommodated with any rooting of the tree of life since it could have happened either only once (in a branch common to archaea and bacteria) or twice independently, in the two “prokaryotic” domains. Amazingly, although acceptance of the bacterial rooting has been clearly helped by the conscious or unconscious adhesion to the eukaryote/prokaryote dichotomy and the hypothesis of a hyperthermophilic LUCA, this rooting by itself does not imply a prokaryotic-like LUCA, as in Figure 3.1a. Indeed, one can easily imagine a “protoeukaryotic” LUCA, leading independently twice to the “prokaryotic state” by streamlining (Figure 3.4b) (Penny and Poole, 1999). I was myself misled for a while by the idea that the bacterial rooting implies a “prokaryotic LUCA,” and this, together with my prejudice for a protoeukaryoticlike LUCA, explains why I scrutinized early on (using the cladistic approach) the data supporting this particular rooting. I thus observed that the data set used by Iwabe to draw the elongation factor tree completely lacked valid phylogenetic signals (Forterre et al., 1993; Forterre, 1997). Detailed phylogenetic analyses performed later on in collaboration with Herve Philippe have indicated that all data sets used by several authors to support the bacterial rooting were strongly biased by tree reconstruction artifacts (Lopez et al., 1999; Philippe and Forterre, 1999). Brinkman and Philippe (1999) have even obtained some indication in favor of a eukaryotic rooting in analyzing slowly evolved positions in paralogous proteins, and the eukaryotic rooting is now favored by some analyses based on the distribution of protein folds in the three domains of life (Caetano-Anolles, 2002;

50

Chapter 3

The Universal Tree of Life and the Last Universal Cellular Ancestor

Caetano-Anolles and Caetano-Anolles, 2005; Kurland et al., 2007; see Chapter 17). However, by the same reasoning, rooting the universal tree in the eukaryotic branch would not actually favor a protoeukaryotic LUCA. In such a scenario, LUCA might have been either “eukaryotic-like” (implying a transition from the “eukaryotic state” to the “prokaryotic state” in the part of the branch common to archaea and bacteria (Figure 3.4c) or “prokaryotic-like” (the transition from the “prokaryotic” to the “eukaryotic” state having taken place in the eukaryotic branch) (Figure 3.4d). Furthermore, the terms “eukaryoticlike” and “prokaryotic-like” are themselves misleading; they reﬂect the fact that we face difﬁculties in imagining organisms very different from modern ones and that one always try to reconstruct the past with concepts of the present. The confusion introduced by the eukaryote/prokaryote nomenclature prompt things to rapidly get out of control. It is therefore more clear now that, even if one ﬁnally succeeds to get convincing arguments in favor of a particular rooting, this will not solve the problem of the nature of LUCA, except if the root turned out to be within one of the three domains. However, this seems very unlikely, since this would imply that organisms of a particular modern lineage had once the ability to transform themselves into organisms of another domain. Then, one can wonder why such transformation did not occur several times during the last two-to-three billions years of existence of the three domains, giving rise to a plethora of alternative domains? The same criticism can also be raised against the ring of life hypothesis that involves the ancestral fusion of organisms resembling modern archaea and bacteria. Why has this taken place only once? Why is this not taking place today?

3.4 HOW TO POLARIZE CHARACTERS WITHOUT A ROBUST ROOT? For me, the greatest challenge in order to determine the nature of LUCA is to polarize the characters that are common to archaea and eukaryotes. These characters are not only found in informational systems, as usually assumed, but are also characteristic of fundamental “operational” processes such as energy production (ATP synthase) or protein secretion (the Sec systems, the signal recognition particle) (Gogarten et al., 1989; Bolhuis, 2004). We would like to know if these are “shared derived traits,” as usually assumed (the conservative view), or ancestral (the revolutionary, but possibly wrong view). To give an example, we would like to now if the 33 ribosomal proteins speciﬁcally shared by archaea and eukarya (Lecompte et al., 2002) were present in LUCA and lost in bacteria, or if they have been added to the ribosome in a common lineage leading to archaea and eukarya. Another challenge is to determine if eukaryotic-speciﬁc characters, such as the spliceosome, were present in LUCA and were lost in both archaea and bacteria, or if they appeared later on in the eukaryotic lineage. In the absence of outgroup for the universal tree of life, these are very difﬁcult questions. It has been suggested to use a virtual cellular ancestor of the RNA—protein world to root the universal tree (Jeffares et al., 1998). In that case, relics of the RNA world in eukarya, such as the spliceosome, or else common archaeal–eukaryal features related to RNA, such as the use of RNA guides in rRNA and tRNA methylation, could be considered as ancestral features. In that case, these traits were lost in bacteria. Unfortunately, such polarization process mainly rests on subjective assumptions, and, in my opinion, convincing detailed analyses to polarize the history of a particular molecular machine (e.g., the ribosome) is still lacking. A promising and recent attempt to determine the nature of LUCA is based on the parsimonious analysis of protein-fold distribution in the three domains. This analysis suggests a eukaryotic-like

3.5 The Hidden Root: When the Weather Became Cloudy

51

LUCA and favors again a streamlining at the origin of archaea and bacteria (Kurland et al., 2007; see also Chapter 17). The presence of a particular fold in LUCA could be one way to polarize molecular features containing that fold. Finally, one should also not forget the existence of common characters shared by archaea and bacteria, such as circular genome, ribosomal superoperons, or some cell division proteins. Again, we would like to know if these are ancestral characters (present in LUCA), or if they originated by convergent evolution. We have many pieces of a puzzle but we still don’t know how to put them all together in a coherent way. It should be interesting to try.

3.5 THE HIDDEN ROOT: WHEN THE WEATHER BECAME CLOUDY One way to get rid of the rooting and/or polarization problems is to dismiss the possibility to really root the tree or to polarize ancient characters by assuming that the early evolution did not exhibit a tree-like structure. Such scenario was proposed these last years by Carl Woese himself, who suggested that organisms living before the divergence between the three domains coevolved by pervasively exchanging their genes and thus sharing innovations (Woese, 2000, 2002). In his view, it is not possible to draw a tree with well-deﬁned lineages for this period of evolution. In that scenario, the modern domains emerged when some barriers to gene transfers delineated a group of organisms that could still exchange extensively genes between them, but not so with outside lineages (Figure 3.5a). From that moment, communal evolution is replaced by classical Darwinian evolution (hence the term Darwinian threshold coined by Carl Woese to label this transition). A problem with the notion of a Darwinian threshold is that it could give the impression that the mechanism of variation plus natural selection was not operational for the early steps of life. However, this powerful and unique (somehow trivial) mechanism is the only one

(a)

(b) Archaea

Eukarya

Bacteria

LUCAS

Figure 3.5

The scenario of communal LUCA cells when all cells freely exchanged their genes within a common gene pool (a). A domain segregated from the communal LUCA when a subset of cells became genetically isolated. They continue to freely and frequently exchange genes between them, but much more rarely with other cells still belonging to the common pool. (b) The root of the universal tree is hidden behind a cloud that could symbolize either our ignorance or a population of communal LUCA (adapted from Gary Olsen).

52

Chapter 3

The Universal Tree of Life and the Last Universal Cellular Ancestor

known that can produce a continuous increase in complexity, making possible the emergence of structures such as the human brain in eukarya (Dawkins, 2006). There is no reason why we should abandon this powerful (although it is simple) mechanism when we need to justify the formidable transition from inanimate to animate matter (Forterre and Gribaldo, 2007). Darwinian evolution does not explain the emergence of the brain and will not explain the emergence of life, but it explains why these emergences have been possible. Of course, we still need to understand the physical constraints and principles that shaped these historical entities, as well as the particular chemical structures that were selected under these constraints (not a small task!). For instance, simulation experiments have shown that horizontal gene transfer (HGT) between ancient cells that enable sharing of innovations have been probably a major factor in the emergence of an optimal genetic code (Vetsigian et al., 2006), but at the level of individuals, those with better codes (in terms of optimization to reduce the mutation load) were actually selected (produced more successful descendants) over cell populations with less optimal codes. In a recent version of the universal tree drawn by Gary Olsen for the anniversary of the discovery of archaea, the root of the universal tree was hidden behind a cloud (Figure 3.5b). This can either mean that we still ignore where the root is, or that the question is meaningless, according to the scenario of early communal evolution. An advantage of the cloud is to somehow reconcile the nomenclature proposed by Woese et al. (1990) for the three domains with the rooting of the tree of life in the bacterial branch. Indeed, whereas the three domains had a similar evolutionary status in the ﬁrst trees draw by Carl Woses in the early 1980s (they all emerged directly from the progenote), the bacteria have now a special status in the Iwabe/ Gogarten tree since they emerge alone from LUCA, whereas archaea and eukarya emerged later on from a common primordial lineage (Figure 3.2). Accordingly, following the now widely accepted principle of cladistic nomenclature, Woese and colleagues should have normally proposed a name for the clade grouping archaea and eukarya (whose status would have been equivalent to those of bacteria). They probably refrain to do it to avoid introducing a new dichotomy and reducing the rank of archaea and eukarya compared to bacteria. However, in doing so, they adopt de facto a gradist viewpoint, suggesting that the three domains emerged from organisms of a “lower grade”. These “lower organisms” are now deﬁned as those living before the Darwinian threshold, including the last common ancestor of archaea and eukarya. The problem with this view is that it raises the risk to eliminate a real question, how to polarize shared archaea/eukarya or archaea/bacteria characters, and to ﬁnally give the impression that the topology of the tree of life has been solved (closing the Pandora’s box). Instead, I think this is still an open question. The communal LUCA hypothesis implies that HGT between “core” genes was indeed possible at an earlier stage of evolution, because the information processing mechanisms were less precise than they are now, allowing easy protein exchange. There is presently no evidence for this idea, but this seems reasonable. In fact, most of the modern HGT are products of viral activity and we can conﬁdently assume today that viruses are ancient and that they most likely predated LUCA (Bamford, 2003; Forterre, 2006b). One can therefore safely assume that virus-driven HGT also occurred extensively at the time of LUCA. Virusdriven HGT between ancient cells might have in fact occurred much before LUCA, as soon as viruses emerged in a world of cells with RNA genomes. However, it is not clear why extensive HGT should have prevented a Darwinian type of evolution? Darwinian evolution implies variation followed by natural selection. As ﬁrst noticed by Gary Olsen, a successful HGT for a particular organism (ﬁxing of a new RNA or DNA gene) can be viewed as a variation, not different from a classical mutation (Gary Olsen, Craaford lectures, 2003) (Figure 3.6). HGT are also considered as a peculiar type of mutations by Kurland and Berg (Chapter 17). If HGT were much more abundant at the time of LUCA (and before) and

3.5 The Hidden Root: When the Weather Became Cloudy (a)

53

(b)

Figure 3.6 Variation followed

m

m

m

by natural selection, (a) the variation is a product of lateral gene transfer, (b) the variation is a product of a mutation. The result is the same (adapted from a conference given by Gary Olsen for the 2003 Craaford lectures in honor of Carl Woese).

affected all types of proteins, it simply means that variations occurred much more rapidly and could have changed any molecular feature of the organism drastically in a short period of time. Accordingly, evolution was still Darwinian but its tempo was much faster. The lifetime of any particular lineage was shorter than today and new lineages emerged more rapidly. This does not prevent to identify among fast-evolving organisms of that time a particular individual that was the LUCA (Figure 3.7) as long as the process of cell division was not blurred by cell fusion.

Figure 3.7 LUCA (gray circle within an open circle) and its companions. Cells from now extinct lineages are in white or dark grey. Cells from the LUCA’s lineage are in gray. The gray arrows symbolize the three modern domains of life. The black arrows indicate two independent transfers of the same character from an extinct lineage to two modern lineages.

54

Chapter 3

The Universal Tree of Life and the Last Universal Cellular Ancestor

Cell fusion is a real issue in the question of LUCA. Indeed, if such process occurred frequently for early cells, one cannot deﬁne a LUCA before the end of this process. It should be stressed however that real fusion between cells of different lineages is unknown in the modern cellular world (symbiosis cannot be identiﬁed to a real fusion since the two partners remain distinct) and that we have no evidence for it in the past. Modern endosymbiosis in eukaryotic cells does not originated from the fusion between a bacterium and an eukaryote, but from the engulfment of the bacterium by the eukaryotic cell that remains an eukarya (and this can be likely extrapolated to the origin of mitochondria and chloroplasts). There is therefore currently no good reason to think that cell fusion should have occurred frequently at the time of LUCA. Even, if this had been the case, Darwinian evolution would have still operated on the organisms that were produced by these fusions.

3.6 LUCA AND ITS COMPANIONS In recent years, many biologists have been probably attracted by the communal LUCA hypothesis in reaction to popular views of LUCA as the ﬁrst cell or the “lonely cell” of its time. These are of course simplistic and wrong views! As for the African’s Eve, LUCA was not living alone but shared our planet with many contemporaries (possibly hundreds of thousands Homo sapiens in the case of EVE, billions of primitive cells and viruses of various types in the case of LUCA). The only peculiar feature of LUCA was that its descendents have, in one way or another, eliminated all of its contemporaries. If LUCA can be traced to an individual cell, it is scientiﬁcally sound to ask questions such as, was LUCA a “progenote”, a “prokaryote”, a “protoeukaryote”, or, more likely, something else? More precisely, as an example of clear-cut question, was LUCA endowed or not with the 33 ribosomal proteins common to archaea and eukarya? An important point to keep in mind is that all ancient traits found in modern organisms were not necessarily present in LUCA. For a long time, the descendents of LUCA should have coexisted with those of now extinct lineages. In particular, ancestral lineages of modern domain might have inherited traits from these extinct lineages and not from LUCA (Figure 3.7). One can even imagine the case when two domains inherited independently the same trait from a third one (now extinct) giving the false impression that this character was present in the common ancestor of the two domains. This might be the case in particular for cellular genes of viral origin. For instance, bacteria and eukarya harbor a type II DNA topoisomerase of the same family (A) that was not present in the last common archaeal ancestor (Forterre et al., 2007b). One could conclude from such observation that this enzyme was present in LUCA and later on lost in archaea. However, the bacterial and eukaryotic versions of these enzymes are very divergent, and a third divergent version is encoded by bacterioviruses of the T4 superfamily. One cannot therefore exclude the possibility that the bacterial and eukaryotic versions of this particular topoisomerase were introduced independently by viruses in these two domains, after their separation from archaea, but before their own diversiﬁcation.

3.7 THE PROBLEM OF HORIZONTAL GENE TRANSFER AND ANCIENT PHYLOGENIES: TREES VERSUS GENE WEBS The idea of a community of LUCA cells exchanging extensively their genes has been clearly inﬂuenced by the early outcome of comparative genomics that have shown that HGT

3.8 The Nature of the RNA World

55

can occur between domains more frequently than previously assumed. This even led to the idea that HGT have been so prevalent in life history that reconstruction of the history of species based on molecular phylogeny was impossible (Doolittle, 1999; Doolittle and Bapteste, 2007). For some authors, the notion of a tree is in itself misleading and should be replaced by a web of life (Dagan and Martin, 2006). These “microbialists,” as they called themselves, present this idea as novel (and revolutionary), and the tree concept favored by “positivists” being the conservative view. However, the idea that HGT could have prevented the establishment of microbial phylogenies is not new. Prior to the work of Carl Woese and the Urbana School, some microbiologists believed that all bacteria on Earth possibly formed a single organism and a continuum of genes were exchanged. This would make it impossible to sort out their evolutionary relationships. For instance, many microbiologists believed at that time that the enzymes for methanogenesis had been transferred between “Grampositive” and “Gram-negative” bacteria, explaining the phenotypic diversity of methanogens. In my opinion, the present attempt by “microbialists” to revive notions that have for so long delayed the introduction of the evolutionary concept in microbiology (going back to the pre-rRNA era) can be viewed as a clear case of “counterrevolution”. Fortunately, it has now been shown clearly by a number of studies that the trinity of life, ﬁrst identiﬁed using a single gene, can be recovered in whole genome trees despite HGT (reviewed in Forterre et al. (2007a) and Chapter 17) and that “core genes” involved in central information processing mechanisms are highly resistant to HGT (Brochier et al., 2005a, 2005b; Daubin et al., 2002). In the case of archaea, phylogenies based on ribosomal proteins and RNA polymerase subunits are highly congruent, and this congruence has increased with the number of genomes analyzed (Brochier et al., 2005a). This clearly shows that it is possible to extract the history of species from the phylogeny of core genes, although careful and multiple analyses are required to correctly position some lineages, as emphasized by the cases of nanoarchaea or Methanopyrus (Brochier et al., 2004, 2005b). Phylogenetic analysis of concatenated ribosomal protein sequences associated with cladistic analysis of genome content hence allows recently to propose a new archaeal phylum (Thaumarchaea) and to tentatively root the archaeal tree in the branch leading to this new phylum (Brochier-Armanet et al., 2008a). One can hope that future work in that direction will progressively unravel the deepest evolutionary relationships between phylums within the bacterial and eukaryal domains. However, the nature of the events that preceded the formation of the domain, and even worst, those which occurred prior the emergence of LUCA will always be out of reach of these approaches based on molecular phylogeny and comparative genomics. One has to rely to a comprehensive understanding of modern biochemistry to dig further into the past.

3.8 THE NATURE OF THE RNA WORLD An important debate associated with our conception of the universal tree of life, especially of its root, concerns the nature of cells that predated the three types of cells that are present today on our planet (modern cells). Some authors, usually in the framework of the ring of life hypothesis, have argued that such “ancient” cells never existed and that archaea and bacteria directly originated from an acellular world (Koga et al., 1998; Koonin and Martin, 2005). However, one can reasonably argue that natural selection of integrated molecular mechanisms would have required the existence of well-deﬁned cells before the differentiation of the three domains (Forterre and Gribaldo, 2007 and references therein). In particular, I strongly argue that RNA world was not a world of free RNA molecules (or RNA-based

56

Chapter 3

The Universal Tree of Life and the Last Universal Cellular Ancestor

macromolecular complexes) competing with each other, but a world of RNA cells competing with each other (Forterre, 2005). Although the deﬁnition (and nature) of the RNA world is still a matter of debate, everybody agrees today that DNA genomes were preceded by RNA genomes. If one considers the prerequisites required to transform an organism with an RNA genome into an organism with a DNA genome, it seems to me quite obvious that this transition occurred starting from very sophisticated RNA cells (for arguments in favor of the existence of such cells, based on the capacity of modern cells to transcribe faithfully and repair RNA see Poole and Logan (2005)). Indeed, the RNA to DNA transition required the presence of the four ribonucleotide triphosphates (rNTP), a ribonucleotide reductase to produce dNTP from rNTP (an especially complex reaction requiring the generation of free radicals in the protein enzyme), and an RNA polymerase able to shift its speciﬁcity from rNTP to dNTP (or capable to use both). This implies a complex metabolism to produce rNTP (with a continuous energy source) and several complex protein enzymes that were already quite faithfully produced. This in turn implies an already sophisticated translation apparatus (with an optimal genetic code) and a complex ATP-generating mechanism. I suggested that the selective force that triggered the RNA to DNA transition has been the necessity for a cell or a virus (preferentially the latter) to protect its genome against attack from its “enemy” by chemical modiﬁcation (Forterre, 2002). This again implies that the couple between cell and viruses already existed (the notion of a virus with its capsid existing in the absence of cell is at odds with common sense). If DNA was ﬁrst selected as genetic material for a virus, it was ﬁrst produced inside the cell infected by this virus, that is, in the viral factory (Claverie, 2006). All these arguments strongly support the notion that our world of modern DNA/RNA/protein cells was preceded by a world of RNA/protein cells (and viruses) competing between each other (Forterre, 2005).

3.9 THE DNA REPLICATION PARADOX AND THE NATURE OF LUCA Independently of any theoretical consideration, the fact that two distinct mechanisms for cellular DNA replication exist presently on Earth (one in bacteria, another in archaea and eukarya) raises forcefully the question of the genome of LUCA (Mushegian and Koonin, 1996; Leipe et al., 1999; Forterre, 1999). Was LUCA still an RNA cell or already a DNA cell? In the framework of the viral theory for the origin of DNA, this question can be translated as follows: when was DNA transferred from viruses to cells, before of after LUCA? This is still an unsolved question. I recently suggested that three transfers could have occurred independently, one at the origin of each domain (Forterre, 2006a) (Figure 3.8a). This proposal was made to explain some critical difference in the DNA replication apparatus between archaea and eukarya, in particular their different complement of DNA topoisomerases (IB and IIA in eukarya, IIB in Archaea). However, two new results have weakened the idea that the common ancestor of archaea and eukarya still had an RNA genome: (1) an archaeal TopoIB has been detected in Thaumarchaea and was probably present in the last common archaeal/eukaryal ancestor (Brochier-Armanet et al., 2008b), and (2) archaea and eukaryotes seem to share a common mechanism coupling DNA replication and protein synthesis (Berthon et al., 2008). In the three viruses, three cell theory, the ﬁnding of an archaeal TopoIB can still be explained by the presence of this enzyme in the two viral ancestors of archaea and eukarya, but the existence of a common

3.9 The DNA Replication Paradox and the Nature of LUCA RNA LUCA

(a) Bacteria

(b)

Archaea

RNA LUCA Archaea

Eukarya

57

Eukarya

Bacteria

? (c)

(d)

DNA LUCA

Bacteria

Archaea Eukarya

Bacteria

DNA LUCA Archaea

Eukarya

?

Figure 3.8

Four scenarios for the transfer of DNA from viruses to cells either both after LUCA (a, b) (LUCA still had an RNA genome) or one already before LUCA (c, d) (LUCA had a DNA genome); (a) the three viruses–three domains scenarios, (b) the two viruses–three domains scenario. In the two viruses–three domains scenario, the last common ancestor of archaea and eukarya was already a member of the DNA world. In (a), the root of the universal tree could be anywhere and in (b) it should be located on the bacterial branch. The dotted line indicates the transition from lineage with RNA genomes (thin line) to DNA genomes (bold line). DNA could have been also transferred before LUCA, and the ancestral DNA replication machinery displaced by a new one either in bacteria (c) or in a common ancestor to archaea and bacteria (d). This scenario would be compatible with any rooting if the second transfer occurred in bacteria (c), otherwise the root has to be in the bacterial branch (d).

mechanism coupling DNA replication and translation (if conﬁrmed) is more difﬁcult to justify. The three viruses–three domains hypothesis was also proposed to explain the existence of three different versions of universal proteins by assuming that the rate of protein evolution rapidly increased after the RNA to DNA transition (at the onset of each domain). If LUCA had an RNA genome but the common ancestor of archaea and eukarya had a DNA genome (Figure 3.8b), this would now explain why the bacterial version of universal proteins is much more divergent from the archaeal and eukaryal versions than the latter two are from each other. In that case, the root of the universal tree should be indeed in the bacterial branch. Alternatively, if LUCA already had a DNA genome (Figure 3.8c and d), all options remained possible for the rooting (Figure 3.8c), but this requires now to come back to the idea of a nonorthologous displacement of the ancestral DNA replication mechanism by a new one (of possible viral origin) in one of the ﬁrst two cellular lineages that diverged from LUCA (Forterre, 1999).

58

Chapter 3

The Universal Tree of Life and the Last Universal Cellular Ancestor

3.10 WHEN VIRUSES FIND THEIR WAY INTO THE UNIVERSAL TREE OF LIFE Viruses have recently made a spectacular appearance on the stage of evolution (for recent reviews and debates, see Forterre, 2006a,b, Koonin et al., 2006, Forterre and Prangishvili, 2009ab). Dennis Bamford, from Helsinki, whose laboratory played a major role in getting evidence for the antiquity of viruses (Bamford et al., 2006) suggested to immerge the universal tree of life into a “viral ocean” (Bamford, 2003) (Figure 3.9a). However, the virosphere not necessarily forms a single ocean without borders. Some viruses infecting members of different domains are clearly evolutionary related (at least from the viewpoint of capsid structure and some DNA replication proteins) suggesting a common origin of some viral features before LUCA (Bamford, 2003; Forterre, 2006a,b), but at the same time viruses are quite speciﬁc for a given domain (for instance, fuselloviruses only exist in archaea) (Prangishvili et al., 2006). This can be reconciled by a model in which each domain has “selected” at its onset a different part of the ancient virosphere (the virosphere at the time of LUCA) (Prangishvili et al., 2006). From that time, viruses (and related plasmids) have mainly coevolved with their hosts (Figure 3.9b). Some viral lineages evolved in a tree-like structure much like their cellular hosts (for instance, the bacteriovirus T4 superfamily; Filee et al., 2006) whereas other evolved in a much more web-like structure (for instance, the Tectiviridae, see Krupovic and Bamford (2007)) with the possibility to create novel viruses by the recombination of cassettes of genes encoding structural proteins with cassettes of DNA replication proteins encoding genes. Viruses are much more numerous than cells and can evolve more rapidly (Suttle, 2007). They are also very ancient and it is likely that viral genes have always outnumbered cellular genes. In contrast with the traditional view that considers that viruses are mainly pickpockets of cellular genes, the bidirectional ﬂow of gene transfers between viruses and cells has been probably much more important from viruses to cells than from cell to viruses. Indeed, all cellular genomes contain a high proportion of genes of recent viral origin, up to

(a)

(b)

Archaea

Archaea

Eukarya

Bacteria LUCA

Eukarya LUCA Bacteria

Figure 3.9

The place of viruses in the universal tree of life. (a) The viral ocean, which symbolizes the evolutionary connections between viruses infecting cells from different domains (b) The three viral “domains” emerging from an ancestral virosphere and coevolving with their cellular hosts independently in each cellular domain.

3.11 Future Directions

59

42% for mammalian genomes (De Parseval and Heidmann, 2005) and several cases of essential cellular genes of viral origin have been recently documented, such as the viral origin of the initiator protein DnaC in bacteria (Slominski et al., 2007) or the viral origin of the mitochondrial DNA replication and transcription apparatus (Filee and Forterre, 2005). This supports hypotheses in which viral genes are at the origin of the DNA and DNA replication machinery (Forterre, 2002) or else that viruses have played a critical role in the origin of the nucleus (Bell, 2001; Claverie, 2006; Forterre, 2006b; Takemura, 2001). The understanding that viruses have played a major role in cellular evolution has stimulated new thinking on their nature and origin. Several authors consider “viruses are alive and well.” Claverie (2006) suggested a focus on the intracellular viral factory as the organismal component of the viruses, whereas Raoult and Forterre (2008) have suggested to divide the living world in two categories: ribosome encoding organisms (cells) and capsid encoding organisms (viruses). This should not be considered as a new dichotomy replacing the prokaryotic/eukaryotic one! The proposal does not reﬂect a true natural classiﬁcation, since viruses probably do not form a monophyletic group. The proposal instead attempts to knock down the prejudice of cellular organisms (like ourselves) and assumptions that viruses occupy a second rank in nature! It will probably never be possible to draw a universal tree of viral organisms similar to the universal tree of cellular organisms for the reason previously mentioned (viral chimerism), but it will be feasible to identify the major ancestral viral lineages and to reconstruct part of the web-like viral tree. This is clearly one of the main challenges of the future.

3.11 FUTURE DIRECTIONS Two main factors have contributed to our present knowledge of the universal tree of life and have fuelled the debates surrounding its topology, the dramatic advances of molecular biology and associated methods for the sequencing of macromolecules on one side, and the continuous exploration of the microbial diversity on the other, with, for instance, the discovery of new groups of archaea and their viruses, the discovery of mimiviruses, or the discovery of Planctomycetales and other nucleated bacteria (more details in Chapter 17). These two lines of research will probably continue to bring important information in the future. On the one hand, programs are underway to systematically sequence the genomes of many representatives of all known phyla and/or divisions in the three cellular domains. On the other hand, the importance of the systematic exploration of all natural biotopes for microbes and their viruses is now better understood. Besides fashionable metagenomic projects, it is clear that only the cultivation of living cellular and viral organisms will permit to make dramatic discoveries similar to those previously mentioned. The possibility to reconstruct the universal tree of life based on conserved proteins has been ridiculed by some who called it “the tree of 1%”, considering the low number of universal genes useful for such analysis (Dagan and Martin, 2006). However, they confuse organisms and their genes, species, and gene trees. The aim of evolutionists is to reconstruct the universal species tree (an historical object). To focus on the 1% of conserved genes is a typical reductionist approach that indeed reminds the reductionist approach of geneticists that focused on the gene to understand the hereditary properties of organisms (something highly criticized by some biologists in the ﬁrst part of the twentieth century) or the reductionist approach of biochemists who focused on DNA (less than 1% of cellular

60

Chapter 3

The Universal Tree of Life and the Last Universal Cellular Ancestor

molecules) to understand the nature of the gene (not to speak of Darwin himself and Darwin’s ﬁnches of the Galapagos islands!). The discovery that cellular life was divided into three domains and not two has been itself one of the greatest success stories of the reductionist approach (the tree of 0.01%!). One should be careful not to abandon these extremely fruitful strategies simply because we have now an avalanche of genomic data at our disposal (the forest could hide the tree!). So, one should not dismiss the tree of 1% but instead cherish these 1% of genes (probably a lower estimate by the way) that allow us to dig into the most distant past of our history. After all, they could have been missing as well. Core genes could have failed to resist the disruptive forces of evolution. Fortunately, this doer not seem to be the case. Although more troublesome, the lack of valid phylogenetic signal in these genes to root the universal tree of life and the difﬁculty to even root each domain should not be an excuse to forget the Darwinian’s tree concept. One should continue to pursue aggressively the reductionist approach that focuses on central molecular mechanisms and on the meaningful genes that can reveal the history of organisms. To look for emergent properties in evolution based on physical principles is an interesting and important new avenue (Woese, 2004), but it should only complement the historical approach. As we have seen with the problem of the emergence of the genetic code, simulation experiments can be essential to unravel fundamental evolutionary process, but the actual historical pathway in which these processes realized (such as the topology of the tree of life) can be only unravel by looking deep at the molecular details of modern organisms. Finally, one should always keep an open mind and get rid of our old “macrobe prejudices” (Forterre, 2008) that favor gradist trees of life giving eukaryotes a special status with human at the top of the universal tree. The Pandora’s box opened 30 years ago by the Woesian’s revolution has still many treasures to deliver for Darwin’s followers.

REFERENCES BAMFORD, D.H., 2003. Do viruses form lineages across different domains of life? Res. Microbiol. 154: 231–236. BAMFORD, D.H., GRIMES, J.M., and STUART, D.I., 2006. What does structure tell us about virus evolution? Curr. Opin. Struct. Biol. 15: 655–663. BERTHON, J., CORTEZ, D., and FORTERRE, P., 2008. Genomic context analysis in Archaea suggests previously unrecognized links between DNA replication and translation. Genome Biol. 9: R71. BELL, P.J., 2001. Viral eukaryogenesis: was the ancestor of the nucleus a complex DNA virus? J. Mol. Evol. 53: 251–256. BRINKMAN, H. and PHILIPPE, H., 1999. Archaea sister group of bacteria? Indications from tree reconstruction artifacts in ancient phylogenies. Mol. Biol. Evol. 16: 817–825. BOLHUIS, A., 2004. The archaeal Sec-dependent protein translocation pathway. Philos. Trans. R. Soc. Lond. B Biol. Sci. 359: 919–927. BOUSSEAU, B., BLANQUART, S., NECSULA, A., LARTILLOT, N., and GOUY, M., 2008. Parallel adaptations to high temperatures in the Archaean eon. Nature 456: 942–945. BROCHIER, C. and PHILIPPE, H., 2002. Phylogeny: a nonhyperthermophilic ancestor for bacteria. Nature 417: 244.

BROCHIER, C., FORTERRE, P., and GRIBALDO, S., 2004. Archaeal phylogeny based on proteins of the transcription and translation machineries: tackling the Methanopyrus kandleri paradox. Genome Biol. 5: R17. BROCHIER, C., FORTERRE, P., and GRIBALDO, S., 2005a. An emerging phylogenetic core of Archaea: phylogenies of transcription and translation machineries converge following addition of new genome sequences. BMC Evol. Biol. 5: 36. BROCHIER, C., GRIBALDO, S., ZIVANOVIC, Y., CONFALONIERI, F., and FORTERRE, P., 2005b. Nanoarchaea: representatives of a novel archaeal phylum or a fast-evolving euryarchaeal lineage related to Thermococcales? Genome Biol. 6: R42. BROCHIER-ARMANET, C., BOUSSEAU, B., GRIBALDO, S., and FORTERRE, P., 2008a. Mesophilic crenarchaeota: proposal for a third archaeal phylum, the Thaumarchaeota. Nat. Rev. Microbiol. 6: 245–252. BROCHIER-ARMANET, C., GRIBALDO, S., and FORTERRE, P., 2008b. A DNA topoisomerase IB in Thaumarchaeota testiﬁes for the presence of this enzyme in the last common ancestor of Archaea and Eucarya. Biol. Direct 3: 54. CAETANO-ANOLLE S, G., 2002. Evolved RNA secondary structure and the rooting of the universal tree. J. Mol. Evol. 54: 333–345.

References CAETANO-ANOLLE S, G. and CAETANO-ANOLLeS, D., 2005. Universal sharing patterns in proteomes and evolution of protein fold architecture and life. J. Mol. Evol. 60: 484–498. CAVALIER-SMITH, T., 2002. The neomuran origin of archaebacteria, the negibacterial root of the universal tree and bacterial megaclassiﬁcation. Int. J. Syst. Evol. Microbiol. 52: 7–76. CICCARELLI, F.D., DOERKS, T., von MERING, C., CREEVEY, C.J., SNEL, B., and BORK, P., 2006. Toward automatic reconstruction of a highly resolved tree of life. Science 311: 1283–1297. CLAVERIE, J.M., 2006. Viruses take center stage in cellular evolution. Genome Biol. 7: 110. COLLINS, L. and PENNY, D., 2005. Complex spliceosomal organization ancestral to extant eukaryotes. Mol. Biol. Evol. 22: 1053–1066. DAGAN, T. and MARTIN, W., 2006. The tree of one percent. Genome Biol. 7: 118. DAUBIN, V., GOUY, M., and PERRIERE, G., 2002. A phylogenomic approach to bacterial phylogeny: evidence of a core of genes sharing a common history. Genome Res. 12: 1080–1090. DAWKINS, R., 2006. The God Delusion. Bantam Press. DE PARSEVAL, N. and HEIDMANN, T., 2005. Human endogenous retroviruses: from infectious elements to human genes. Cytogenet. Genome Res. 110: 318–332. DOOLITTLE, W.F., 1999. Phylogenetic classiﬁcation and the universal tree. Science 284: 2124–2129. DOOLITTLE, W.F. and BAPTESTE, E., 2007. Pattern pluralism and the tree of life hypothesis. Proc. Natl. Acad. Sci. USA 104: 2043–2049. FILE´E, J. and FORTERRE, P., 2005. Viral proteins functioning in organelles: a cryptic origin? Trends Microbiol. 13: 510–513. FILE´E, J., BAPTESTE, E., SUSKO, E., and KRISCH, H.M., 2006. A selective barrier to horizontal gene transfer in the T4-type bacteriophages that has preserved a core genome with the viral replication and structural genes. Mol. Biol. Evol. 23: 1688–1696. FORTERRE, P., 1995. Thermoreduction, a hypothesis for the origin of prokaryotes. CR Acad. Sci. III 318: 415–422. FORTERRE, P., 1996. A hot topic: the origin of hyperthermophiles. Cell 85: 789–792. FORTERRE, P., 1997. Protein versus rRNA: rooting the universal tree of life? ASM News 63: 89–95. FORTERRE, P., 1999. Displacement of cellular proteins by functional analogues from plasmids or viruses could explain puzzling phylogenies of many DNA informational proteins. Mol. Microbiol. 33: 457–465. FORTERRE, P., 2002. The origin of DNA genomes and DNA replication. Curr. Opin. Microbiol. 5: 525–532. FORTERRE, P., 2005. The two ages of the RNA world, and the transition to the DNA world: a story of viruses and cells. Biochimie 87: 793–803. FORTERRE, P., 2006a. Three RNA cells for ribosomal lineages and three DNA viruses to replicate their genomes: a

61

hypothesis for the origin of cellular domain. Proc. Natl. Acad. Sci. USA 103: 3669–3674. FORTERRE, P., 2006b. The origin of viruses and their possible roles in major evolutionary transitions. Virus Res. 117: 5–16. FORTERRE, P., 2008. In a world of microbes, where should microbiology stand? Res. Microbiol. 159 (1): 74–80. FORTERRE, P. and GRIBALDO, S., 2007. The origin of modern terrestrial life. HFSP J. 1: 156–168. FORTERRE, P. and PHILIPPE, H., 1999. Where is the root of the universal tree of life? Bioessays 21: 871–879. FORTERRE, P., BENACHENHOU-LAHFA, N., CONFALONIERI, F., DUGUET, M., ELIE, C., and LABEDAN, B., 1993. The nature of the last universal ancestor and the root of the tree of life, still open questions. Biosystems 28: 15–32. FORTERRE, P., GRIBALDO, S., and BROCHIER-ARMANET, C., 2007a. In Natural History of the Archaeal Domain in Archaea (eds R. Garrett and H.P. Klenk). Blackwell Publishing, pp. 17–29. FORTERRE, P., GRIBALDO, S., GADELLE, D., and SERRE, M. C., 2007b. Origin and evolution of DNA topoisomerases. Biochimie 9: 427–46. FORTERRE, P. and PRANGISHVILI, D., 2009a. The great billionyear war between ribosome- and capsid-encoding organisms (cells and viruses) as the major source of evolutionary novelties. Ann. N. Y. Acad. Sci. 1178: 65–77. FORTERRE, P. and PRANGISHVILI,, D., 2009b. The origin of viruses. Res. Microbiol. 160(7): 466–472. FUERST, J.A., 2005. Intracellular compartmentation in planctomycetes. Annu. Rev. Microbiol. 59: 299–328. GALTIER, N., TOURASSE, N., and GOUY, M., 1999. A non hyperthermophilic common ancestor to extant life forms. Science 283: 220–221. GLANSDORFF, N., 2000. About the last common ancestor, the universal tree of life and lateral gene transfer: a reappraisal. Mol. Microbiol. 38: 177–185. GLANSDORFF, N., XU, Y., and LABEDAN, B., 2008. The last universal common ancestor: emergence, constitution and genetic legacy of an elusive forerunner. Biol. Direct 3: 29. GOGARTEN, J.P., KIBAK, H., DITTRICH, P., TAIZ, L., BOWMAN, E.J., BOWMAN, B.J., MANOLSON, M.F., POOLE, R.J., DATE, T., OSHIMA, T., DENDA, K.J.K., and YOSHIDA, M., 1989. Evolution of the vacuolar H þ -ATPase: implications for the origin of eukaryotes. Proc. Natl. Acad. Sci. USA 86: 6661–6665. HARTMAN, H. and FEDOROV, A., 2002. The origin of the eukaryotic cell: a genomic investigation. Proc. Natl. Acad. Sci. USA 99: 1420–1425. IWABE, N., KUMA, K., HASEGAWA, M., OSAWA, S., and MIYATA, T., 1989. Evolutionary relationship of archaebacteria, eubacteria and eukaryotes inferred from phylogenetic trees of duplicated genes. Proc. Natl. Acad. Sci. USA 86: 9355–9359. JEFFARES, D.C., POOLE, A.M., and PENNY, D., 1998. Relics from the RNA world. J. Mol. Evol. 46: 18–36. KOONIN, E.V. and MARTIN, W., 2005. On the origin of genomes and cells within inorganic compartments. Trends Genet. 2: 1647–654.

62

Chapter 3

The Universal Tree of Life and the Last Universal Cellular Ancestor

KOONIN, E.V., SENKEVICH, T.G., and DOLJA, V.V., 2006. The ancient Virus World and evolution of cells. Biol. Direct 1: 29. KRUPOVIC, M. and BAMFORD, D.H., 2007. Putative prophages related to lytic tailless marine dsDNA phage PM2 are widespread in the genomes of aquatic bacteria. BMC Genomics 8: 236. KOGA, Y., KYURAGI, T., NISHIHARA, M., and SONE, N., 1998. Did archaeal and bacterial cells arise independently from noncelullar precursors? A hypothesis stating that the advent of membrane phospholipids with enantiomeric glycerophosphate backbones caused the separation of the two lines of descent. J. Mol. Evol. 46: 54–63. KURLAND, C.G., COLLINS, L.J., and PENNY, D., 2006. Genomics and the irreducible nature of eukaryote cells. Science 312: 1011–1014. KURLAND, C.G., CANBACK, B., and BERG, O.G., 2007. The origins of modern proteomes. Biochimie 89: 1454–1463. LECOMPTE, O., RIPP, R., THIERRY, J.C., MORAS, D., and POCH, O., 2002. Comparative analysis of ribosomal proteins in complete genomes: an example of reductive evolution at the domain scale. Nucleic Acids Res. 30: 5382–5390. LEIPE, D.D., ARAVIND, L., and KOONIN, E.V., 1999. Did DNA replication evolve twice independently? Nucleic Acids Res. 27: 3389–3401. LOPEZ, P., FORTERRE, P., and PHILIPPE, H., 1999. The root of the tree of life in the light of the covarion model. J. Mol. Evol. 49: 496–508. LOPEZ-GARCIA, P. and MOREIRA, D., 1999. Metabolic symbiosis at the origin of eukaryotes. Trends Biochem. Sci. 24: 88–93. MARTIN, W. and KOONIN, E.V., 2006. A positive deﬁnition of prokaryotes. Nature 442: 868. MUSHEGIAN, A.R. and KOONIN, E.V. 1996. A minimal gene set for cellular life derived by comparison of complete bacterial genomes. Proc. Natl. Acad. Sci. USA. 93(19): 10268–10273. OLSEN, G.J. and WOESE, C.R. 1997 Archaeal genomics: an overview. Cell, 89: 991–994. PACE, N.R., 2006. Time for a change. Nature 441: 289. PENNY, D. and POOLE, A., 1999. The nature of the last universal common ancestor. Curr. Opin. Genet. Dev. 9: 672–677. PHILIPPE, H. and FORTERRE, P., 1999. The rooting of the universal tree is not reliable. J. Mol. Evol. 49: 509–523. POOLE, A.M. and LOGAN, D.T., 2005. Modern mRNA proofreading and repair: clues that the last universal common ancestor possessed an RNA genome? Mol. Biol. Evol. 22: 1444–1455. POOLE, A.M. and PENNY, D., 2007. Evaluating hypotheses for the origin of eukaryotes. Bioessays 29(1): 74–84. POOLE, A.M., JEFFARES, D.C., and PENNY, D., 1999. Early evolution: prokaryotes, the new kids on the block. Bioessays 21: 880–889. PRANGISHVILI, D., FORTERRE, P., and GARRETT, R.A., 2006. Viruses of the Archaea: a unifying view. Nat. Rev. Microbiol. 4: 837–848. RAOULT, D. and FORTERRE, P., 2008. Redeﬁning viruses: lessons from Mimivirus. Nat. Rev. Microbiol. 6: 315–319.

RIVERA, M.C. and LAKE, J.A., 2004. The ring of life provides evidence for a genome fusion origin of eukaryotes. Nature 431: 152–155. SAPP, J., 2005. The prokaryote–eukaryote dichotomy: meanings and mythology. Microbiol. Mol. Biol. Rev. 69: 292–305. SLOMINSKI, B., CAKIEWICZ, J., GOLEC, P., WEGRZYN, G., and WRO´BEL, B., 2007. Plasmids derived from Gifsy-1/Gifsy-2, lambdoid prophages contributing to the virulence of Salmonella enterica serovar Typhimurium: implications for the evolution of replication initiation proteins of lambdoid phages and enterobacteria. Microbiology 153: 1884–1896. STETTER, K.O., 2006. Hyperthermophiles in the history of life.Philos.Trans.R.Soc.Lond.BBiol.Sci.361:1837–1842. SUTTLE, C.A., 2007. Marine viruses—major players in the global ecosystem. Nat. Rev. Microbiol. 5: 801–812. TAKEMURA, M., 2001. Poxviruses and the origin of the eukaryotic nucleus. J. Mol. Evol. 52: 419–425. VETSIGIAN, K., WOESE, C.R., and GOLDENFELD, N., 2006. Collective evolution and the genetic code. Proc. Natl. Acad. Sci. USA 103: 10696–10701. WANG, M., YAFREMAVA, L.S., CAETANO-ANOLLeS, D., MITTENHAL, J.E., and CAETANO-ANOLLeS, G., 2007. Reductive evolution of architectural repertoires in proteomes and the birth of the tripartite world. Genome Res. 17: 1572–1585. WOESE, C.R., 1981. Archaebacteria. Sci. Am., 244: pp. 98–122. WOESE, C.R., 1998. Default taxonomy: Ernst Mayr’s view of the microbial world. Proc. Natl. Acad. Sci. USA 95: 11043–11046. WOESE, C.R., 2000. Interpreting the universal phylogenetic tree. Proc. Natl. Acad. Sci. USA 97: 8392–8396. WOESE, C.R., 2002. On the evolution of cells. Proc. Natl. Acad. Sci. USA 99: 8742–8747. WOESE, C.R., 2004. A new biology for a new century. Microbiol. Mol. Biol. Rev. 68: 173–86. WOESE, C.R. 2007. In Archaea (eds R.A. Garrett and H.P. Klenk). Blackwell publishing, Oxford, pp. 1–15. WOESE, C.R. and FOX, G.E, 1977a. The phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proc. Natl. Acad. Sci. USA 74: 5088–5090. WOESE, C.R. and FOX, G.E., 1977b. The concept of cellular evolution. J. Mol. Evol. 10: 1–6. WOESE, C.R., KANDLER, O., and WHEELIS, M.L., 1990. Towards a natural system of organisms: proposal for the domains Archaea, Bacteria, and Eucarya. Proc. Natl. Acad. Sci. USA 87: 4576–4579. WONG, J.T., CHEN, J., MAT, W.K., NG, S.K., and XUE, H., 2007. Polyphasic evidence delineating the root of life and roots of biological domains. Gene 403: 39–52. XU, Y. and GLANSDORFF, N., 2002. Was our ancestor a thermophilic procaryote? Comp. Biochem. Physiol. A 133: 677–688. YUTIN, N., MAKAROVA, K.S., MEKHEDOV, S.L., WOLF, Y.I., and KOONIN, E.V., 2008. The deep archaeal roots of eukaryotes. Mol. Biol. Evol. 25: 1619–30. ZILLIG, W., 1991. Comparative biochemistry of Archaea and Bacteria. Curr. Opin. Genet. Dev. 1: 544–51.

Chapter

4

Eukaryote Evolution: The Importance of the Stem Group Anthony M. Poole 4.1

INTRODUCTION

4.2

INTERPRETING TREES

4.3

MOVING BEYOND THE DEEP ROOTS OF EUKARYOTES

4.4

CONCLUDING REMARKS

REFERENCES

4.1 INTRODUCTION There is now widespread consensus that the main features of eukaryote cells can be traced back to the crown group ancestor, also known as the last eukaryotic common ancestor (LECA). A substantial body of evidence now supports the presence of a mitochondrion in the LECA (reviewed in Embley and Martin, 2006; van der Giezen and Tovar, 2005), with substantial mitochondrial to host gene transfer (Esser et al., 2004; Pisani et al., 2007; Rivera and Lake, 2004). Moreover, there is substantial evidence to support the view that the LECA was a fully ﬂedged eukaryote with a well-developed nucleus and nuclear pore complex (Bapteste et al., 1999; Devos et al., 2004; Mans et al., 2004), endomembrane system (Dacks and Field, 2007; Dacks et al., 2003, 2008; Jekely, 2003, 2008), mitosis and meiosis (Cavalier-Smith, 2002a; Dacks and Roger, 1999; Egel and Penny, 2007; Ramesh et al., 2005), introns and the spliceosomal apparatus (Collins and Penny, 2005), linear chromosomes with telomeres and telomerase (Nakamura and Cech, 1998), and phagocytosis (Jekely, 2003, 2008) (Table 4.1).

4.1.1 What is Signiﬁcant About the Origin of the Eukaryote Cell? There are two reasons for posing this question. First, the emergence of modern eukaryotes does appear to be associated with the appearance of a large number of traits, and even Evolutionary Genomics and Systems Biology, edited by Gustavo Caetano-Anolles Copyright 2010 John Wiley & Sons, Inc.

63

64

Chapter 4

Eukaryote Evolution: The Importance of the Stem Group

Table 4.1 Ancestor

Features of the Eukaryote Cell That Were Present in the Last Eukaryotic Common

Feature

Reference

Mitochondrion Nucleus and nuclear pore complex Endomembrane system

(Embley and Martin, 2006; van der Giezen and Tovar, 2005) (Bapteste et al., 1999; Devos et al., 2004; Mans et al., 2004)

Mitosis and meiosis Introns and spliceosomal apparatus Linear chromosomes with telomeres and telomerase Phagocytosis

(Dacks et al., 2003; Dacks and Field, 2007; Dacks et al., 2008; Jekely, 2003, 2008) (Cavalier-Smith, 2002a; Dacks and Roger, 1999; Egel and Penny, 2007; Ramesh et al., 2005) (Collins and Penny, 2005; Jeffares et al., 2006; Roy and Gilbert, 2005, 2006) (Nakamura and Cech, 1998) (Cavalier-Smith, 2002b; Jekely, 2003, 2008)

genes, that apparently do not have equivalents in either bacteria or archaea. How has such a fundamentally different cellular architecture emerged, and does this require special explanation? Second, among these, is one particular feature particularly signiﬁcant? Considering the second point ﬁrst, for historical reasons, the deﬁning feature of the eukaryote cell has of course been the nucleus. However, given the large number of features that can now be placed in the LECA, there has been a departure from this view. For instance, a number of authors have suggested that the mitochondrion predates the nucleus (Jekely, 2008; Koonin, 2006; Lo´pez-Garcıa and Moreira, 2006; Martin and Koonin, 2006; Martin and Muller, 1998). Indeed, given the lack of resolution in establishing the relative order of events, the emergence of any of the features inferred to be present in the LECA could in principle be taken as the deﬁning trait demarcating emergence of this lineage (see, for example, discussion between Forterre and Glansdorff et al. in Glansdorff et al. (2008)). While there is merit in trying to ascertain the timing of emergence of major eukaryote traits, it may in practice be difﬁcult to produce a deﬁnitive picture of the sequence of events. Even a cursory look at any of the key traits reveals similar or analogous features in either archaea or bacteria, or both (see Table 4.2). Against this backdrop, it becomes difﬁcult to say whether there really is a deﬁning cellular or molecular feature of eukaryotes that separates themfromarchaeaand bacteria; thispoint has beenelegantlymadewithrespect tochromosome architecture (Bendich and Drlica, 2003) and nuclear architecture (Fuerst, 2005). Perhaps the most evidently eukaryotic feature is phagotrophy, in that neither bacteria nor archaea has been documented as being capable of cell engulfment. There are forms of “predation” wherein bacterial cells invade the cells of their “prey” (Guerrero et al., 1986; Martin, 2002; Poole and Penny, 2007a), though this is not at all like phagotrophy (Table 4.2) and perhaps a nearer analogy may be drawn to lytic viruses, particularly in the case of Bdellovibrio. Nevertheless, there is clearly a propensity for certain bacteria to enter other bacteria via this route, and the relevance for this type of mechanism to formation of bacterial–bacterial endosymbioses is clear. That said, no current data indicate that archaeal cells are the target of such an invasive strategy (Poole and Penny, 2007b). On current data then, this does not seem a likely mechanism for the origin of the eukaryote cell from an archaeon and a bacterium, but the mealybug endosymbiosis (where a g-proteobacterium is resident within a b-proteobacterium within mealybug cells comprising the bacteriome)

4.1 Introduction Table 4.2

65

Eukaryote-Speciﬁc Features and Similar or Analogous Features Found in Archaea and/or Bacteria

Feature in eukaryote crown ancestor

Bacterial or archaeal counterpart

Spliceosomal introns

Self-splicing introns

snoRNA-based rRNA modiﬁcation

Both C/D and H/ACA snoRNA-like sRNAs found in archaea

Endomembranes and nucleus

Diverse forms of endomembrane architecture in planctomycete bacteria

Cytoskeleton

Cytoskeletal protein homologues known from bacteria Polyploidy documented in a number of bacterial lineages Linear chromosomes in bacteria

Diploidy/ polyploidy Linear chromo somes and telomeres Meiosis and syngamy

Zygogenesis in bacteria

Endosymbioses

Bacterial–bacterial endosymbiosis

Phagotrophy

Cell or organelle inva sion by parasitic bacteria: Bdellovibrio, Daptobacter, a-pro teobacterial tick symbionts

Notes

References

tRNA introns in archaea and eukaryotes

(Belfort and Weiner, 1997; Bonen and Vogel, 2001; Lykke-Andersen et al., 1997) (Dai and Zimmerly, 2003; Rest and Mindell, 2003)

Group II introns in bacteria, some plastids, and mito chondria, and have recently invaded archaea Group I introns in bacteria, protist rDNA, and organelles Numerous proteins in archaeal and eukaryote systems are homologous; some modiﬁcation positions are conserved across domains Gemmata has a doublemembraned nuclear envelope with pore-like structures; homology to eukaryote nuclear envelope remains unclear Bacterial and eukaryote systems are homologous See also zygogenesis

Linearity has evolved several times independently in bacteria Diploids formed, recombination can occur Interspecies zygogenesis has been observed Inside a mealybug; mechanism of entry not known, though entry via cell invasion mechanistically plausible Cell invasion is not equivalent to phagocytosis

(Haugen et al., 2005) (Dennis and Omer, 2005; Gaspin et al., 2000; Muller et al., 2008; Omer et al., 2000) (Fuerst, 2005)

(Amos et al., 2004; Graumann, 2007; van den Ent et al., 2001) (Bendich and Drlica, 2003) (Chen et al., 2002; Volff and Altenbuchner, 2000) (Gratia, 2005; Gratia, 2007a; Gratia, 2007b; Gratia and Thiry, 2003)

(von Dohlen et al., 2001)

(Beninati et al., 2004; Guerrero et al., 1986; Martin, 2002; Sacchi et al., 2004)

66

Chapter 4

Eukaryote Evolution: The Importance of the Stem Group

(von Dohlen et al., 2001) does establish that compartmentation of the genetic material is not an absolute requirement of endosymbiosis. Moreover, the example indicates that there is no obvious problem with the suggestion that the nucleus evolved after the mitochondrial ancestor had gained entry to its protoeukaryotic host.

4.1 Introduction

67

A minor but important semantic point perhaps needs to be raised at this point. If, as some workers have argued, the origin of the eukaryote nucleus was not the deﬁning event that formed eukaryotes, and the nucleus appeared relatively late in the evolution of this domain, we need to distinguish between a cladistic and a cell biological deﬁnition of the term eukaryote. In terms of cell biology, an organism lacking a nucleus is a prokaryote, so one could argue that a late origin for the nucleus equates with a prokaryotic origin for eukaryotes. Insofar as the nucleus appears to be a derived feature of the eukaryotic cell (Devos et al., 2004), rather than a feature that can be deﬁnitively traced back to the last universal common ancestor (though see de Roos (2006) and Glansdorff et al. (2008) for suggestions to the contrary), this would indeed be technically correct in a cell biological sense. However, to my mind this is too vague a statement because it does not distinguish between the two hypotheses described in Figure 4.1a and b. Under both models, there is no requirement for the nucleus to be an early or late development relative to the emergence of other key eukaryotic traits. However, an archaeal (and hence “prokaryotic”— in the cell biological sense) origin (Figure 4.1b) is not the same as a sister relationship between eukaryotes and archaea (Figure 4.1a), with the common ancestor of both groups being anucleate. Cladistically, there is no requirement that the earliest members of the eukaryote lineage had a nucleate cellular architecture; the semantic confusion can be best avoided by explicit reference to cladistics, that is, whether a particular model implies eukaryotes are direct descendents of archaea or whether they are sister to the archaea (see Section 4.2). In considering the current state of knowledge on eukaryote origins, my intention is thus to focus on the following issues: considering to what extent disparities concerning the phylogenetic relationship between archaea and eukaryotes impact our understanding of eukaryote evolution, and to what extent the series of evolutionary events leading to modern eukaryotes can be established, given no obvious intermediate stages. The ﬁrst issue can be summarized as one concerning interpretation of conﬂicting phylogenies regarding the relationship between eukaryotes and archaea. Phylogenies have indicated that eukaryotes are either archaeal in origin or sister to archaea, and this has led to debate concerning the origin of eukaryotes. I will show how an acknowledgment of the concepts of stem and crown group can serve as a helpful framework in informing interpretation of seemingly conﬂicting phylogenetic results. The second issue concerns our ability to establish the relative order of emergence of the key eukaryotic features

3 Figure 4.1 Three different takes on the origin of eukaryotes. (a) This tree depicts archaea and eukaryotes as sister groups that share a common ancestor. In each case, the crown group is deﬁned as containing all extant lineages and may also contain extinct lines (on the basis of evidence). The stem is by deﬁnition populated by lineages that have gone extinct. The total group is the combination of crown and stem. For this depiction to be correct, molecular phylogenies would need to recover the monophyly of each domain. It also supposes that the placement of the root of the tree of life between bacteria and archaea–eukaryotes is correct. Note that it is not possible a priori to establish whether the common ancestor of archaea and eukaryotes was more archaeal-like or more eukaryote-like; domain-speciﬁc characters can only be placed in the crown group ancestor for that domain. (b) In this tree, eukaryotes are depicted as a group within archaea. For this relationship to be borne out, eukaryotes must branch within the extant diversity of archaea (i.e., eukaryote are a subgroup within the archaea). Placement of eukaryotes within archaea would need to be based on the existence of phylogenetic data to support this topology. Note that this relationship would tell us that eukaryotes evolved from archaea, but does not tell us the relative order in which key eukaryote features evolved (Table 4.1). Moreover, it does not invalidate the existence of a eukaryote stem and cannot be used per se as an argument in direct support of models wherein an archaeon “engulfed” a bacterium (see text for details). (c) By invoking extinct unobservable archaeal lineages, the tree in (a) is converted into the tree in (b). For reasons described in the text, this position is philosophically untenable.

68

Chapter 4

Eukaryote Evolution: The Importance of the Stem Group

described above. Here, again, there are numerous plausible ideas, and a veritable cornucopia of published speculation. Given the lack of evolutionarily informative intermediate forms, it may be difﬁcult to arrive at a deﬁnitive scenario. I will therefore take a conservative view here and point to two examples (phagocytosis and intron proliferation) where our knowledge of existing biological processes proves informative.

4.2 INTERPRETING TREES Consensus has yet to be reached regarding the order in which the numerous eukaryotic cellular features summarized above emerged, and there is likewise ongoing debate concerning the deep evolutionary origins of eukaryotes. Some of the more speculative hypotheses—particularly those that advocate an endosymbiotic origin for the nucleus, cellular fusion, or three-way symbioses (e.g., Bell, 2005; Gupta and Golding, 1996; Hartman and Fedorov, 2002; Horiike et al., 2001; Lo´pez-Garcıa and Moreira, 1999; Margulis et al., 2000; Moreira and Lo´pez-Garcıa, 1998; Takemura, 2001; Zillig et al., 1989; see Martin et al. (2001) for a near-exhaustive list)—have been argued to be unlikely given available genomic and cell biological data, and known biological processes (CavalierSmith, 2002b; Embley and Martin, 2006; Martin, 1999; Poole and Penny, 2001, 2007a, 2007b; Rotte and Martin, 2001). On current data, it therefore seems most likely that the path to modern (crown group) eukaryotes involved a single endosymbiosis, wherein the bacterial ancestor of the mitochondrion was engulfed by a cell sharing a common ancestry with modern archaea. Whether the host cell can be traced back to the archaeal crown (making it an archaeon), or whether it represented a separate lineage that had diverged prior to the diversiﬁcation of the archaeal crown (making eukaryotes and archaea sister groups) is the subject of considerable debate (Dagan and Martin, 2007; Embley and Martin, 2006; Koonin and Martin, 2005; Lo´pez-Garcıa and Moreira, 1999; Poole and Penny, 2007a, 2007b, 2007c; Rivera and Lake, 2004; Yutin et al., 2008). In this section, I will examine the implications of these two very different phylogenetic results. The phylogenetic relationship between archaea and eukaryotes can have profound implications for the tree of life. Formally, a crown archaeal origin for eukaryotes implies the existence of only two domains (bacteria and archaea, since, by deﬁnition, eukaryotes would be derived from archaea), whereas the other implies Woese and colleagues’ threedomain classiﬁcation (Harris et al., 2003; Wang et al., 2007; Woese et al., 1990). Figure 4.1 shows the fundamental difference between these two scenarios, which can in principle be distinguished formally by a phylogenetic test (Cavalier-Smith, 2002b; Poole and Penny, 2007b). However, the phylogenetic relationship between eukaryotes and archaea has a long history of being difﬁcult to resolve (e.g., Lake, 1988; Rivera and Lake, 1992; Tourasse and Gouy, 1999; Woese and Fox, 1977; Woese et al., 1990); results disagree as to whether eukaryotes and archaea are phylogenetically distinct (e.g., Daubin et al., 2001, 2002; Harris et al., 2003; Wang et al., 2007), or whether, phylogenetically, eukaryotes are derived archaea (Pisani et al., 2007; Rivera and Lake, 1992, 2004; Tourasse and Gouy, 1999). Perhaps indicative of the problem, Yutin et al. (2008) recently reported an analysis wherein both types of signal were recovered, though with greater signal for a sister grouping. While the numerous technical problems associated with genome-level phylogenetic analyses are well acknowledged (e.g., Delsuc et al., 2005; Jeffroy et al., 2006; Steel et al., 2000; Wilkinson et al., 2005), it is equally important that there is a consistently applied framework for interpretation. This area of enquiry is, moreover, prone to the additional

4.2 Interpreting Trees

69

difﬁculty of being so far back in evolution that independent data such as fossils are essentially absent. This has led to a wide range of interpretations stemming from very similar bioinformatic data (see Poole and Penny (2007b) for discussion). Given the general lack of informative morphological, cellular, or molecular biological characters and the corresponding low resolution when attempting to examine evolutionary change across domains, and given the widely held (but rarely evaluated) assumption that all eukaryote features are by deﬁnition derived, the application of cladistic thinking is easily sidestepped in favor of intuition. This leads to major limitations on our propensity to distinguish between hypotheses and to separate testable claims from the occasional vagaries of speculation. One point that therefore needs to be clariﬁed is the formal interpretation of such data (Poole and Penny, 2007b, 2007c; Yutin et al., 2008). For instance, while current data favor the view that the nucleus (and therefore by deﬁnition the eukaryotic state) is derived and evolved autogenously (Devos et al., 2004; Jekely, 2003, 2008; Koonin, 2006; Lo´pez-Garcıa and Moreira, 2006; Martin and Koonin, 2006), this does not logically lead to the conclusion that eukaryotes evolved from archaea. To understand this distinction requires an acknowledgment of stem, crown, and total group concepts (developed by Hennig—though the terms themselves were coined by Jefferies—see Budd (2001) and Donoghue (2005)), the application that has led to signiﬁcant recent progress in paleontology (Budd, 2001; Budd and Jensen, 2000; Donoghue, 2005; Donoghue and Purnell, 2005). Of particular note is the fruitful use of stem group fossils to resolve conundrums emergent in solely molecular analyses; by reference to fossil data, it has, for example, been possible to establish that there is no need to invoke special molecular mechanisms (resulting in rapid emergence of new morphological features following genome duplication) to account for the evolution of jawed vertebrates (Donoghue and Purnell, 2005). In brief, the crown group encompasses the current diversity of a monophyletic group and can include both extant and extinct species. The stem includes only extinct lineages; these can be populated with extinct fossil specimens, illuminating important facets of the evolutionary history of a group in a way that phylogenetics of extant species cannot because, ancient DNA studies aside, stems are by deﬁnition naked in molecular phylogenies. The total group is the combined stem and crown. Obtaining the phylogenetic result represented by Figure 4.1a for eukaryotes leads to a straightforward interpretation in that it directly tells us the direction of evolution; universal features of archaeal cellular architecture would here represent the ancestral state, and we can surmise that eukaryotes evolved from archaea and that eukaryote-speciﬁc traits are, in all probability, derived. However, the result represented by Figure 4.1b is often erroneously interpreted as meaning that eukaryotes nevertheless evolved from archaea (Yutin et al., 2008), as illustrated in Figure 4.1c. There is difﬁculty with this interpretation for two reasons. First, it is not possible to deduce the origins of features found only in one crown by reference to the other (Poole and Penny, 2007b). Second, it is not fruitful to invoke extinct (and unobservable) lineages solely in order to enable the interpretation that eukaryotes evolved from archaea. This fallacy has been dubbed the “celestial teapot of phylogenetics” (Poole and Penny, 2007c)(Figure 4.1c), recalling Bertrand Russell’s quip that while it is possible to invoke entities that are by deﬁnition unobservable (Russell’s teapot in orbit around the sun but too small to be observed—or, in the present case, extinct archaeal lineages where evidence for their existence is unobtainable), the burden of evidence lies with those who choose to invoke such entities, not those who doubt their existence. Russell was in fact discussing the burden of evidence in regard to theistic and atheistic views (Russell, 1952), but the generality of the point is clear. In the case of eukaryote origins, the proposal illustrated in Figure 4.1c should only be invoked where there is evidence to

70

Chapter 4

Eukaryote Evolution: The Importance of the Stem Group

demonstrate that stem group archaeal lineages existed prior to the emergence of the lineage leading to eukaryotes. By deﬁnition, this is not amenable to phylogenetic or morphological analyses of extant lineages and requires fossil evidence at a level of molecular and cellular detail greater than that currently available for the deepest divergences.

4.3 MOVING BEYOND THE DEEP ROOTS OF EUKARYOTES Given conﬂicting results (e.g., Pisani et al., 2007; Yutin et al., 2008) regarding whether eukaryotes should be considered phylogenetically archaeal or sister to archaea (i.e. results of the type shown in Figure 4.1a and b), it is important to establish whether anything concerning eukaryote origins can be deduced independently of these conﬂicting views of archaeal–eukaryote relationships. Here, the central point is that under both scenarios, there is a eukaryotic stem (Figure 4.1). While fossils may potentially document the evolution of cell structure (Knoll et al., 2006) and even sexual reproduction (Butterﬁeld, 2000), identifying putative stem group fossils for eukaryotes is far from trivial and may at best shed limited light on the evolutionary origins of cell ultrastructure and genome architecture (the problems of taxonomic resolution aside (Butterﬁeld, 2007)). Nevertheless, it is crucial to consider the need for a stem, regardless of whether one favors an archaeal origin for eukaryotes or an archaeal–eukaryote sister grouping. Using the more robust framework of stem and crown groups eliminates the need to appeal to special changes in tempo and mode for the evolution of major groups (Budd and Jensen, 2000; Donoghue and Purnell, 2005); the same case can be argued for eukaryotes (Butterﬁeld, 2007; Poole and Penny, 2007b). The primary difference between the two positions embodied in Figure 4.1a and b is the start point; for both scenarios, any account of the evolution of crown group eukaryotes requires an explanation for the origins of numerous eukaryote-speciﬁc features (Table 4.1), including the mitochondrion, cell engulfment, meiosis, introns and splicing, linear chromosomes with telomeres, nucleus, and associated endomembrane system. In addition, there are eukaryote-speciﬁc proteins (Kurland et al., 2006) and RNA molecules and processing (Penny and Poole, 1999; Woodhams et al., 2007). Some elaboration is required here. An archaeal origin for eukaryotes makes all eukaryotic-speciﬁc features derived by deﬁnition, whereas a three-domains view leaves open the possibility that certain features of eukaryotes can in principle be traced back to the last universal common ancestor (see Forterre and Philippe, 1999; Poole et al., 1999; Wang et al., 2007 for further discussion). However, as detailed later in this chapter, for the purposes of the current discussion, debate concerning ultimate and proximate origins for features such as introns and splicing can be safely ignored. Regardless of how the eukaryote stem connects the tree of life, a priori, there is no way of ordering the emergence of these features, since all can be placed in the crown group ancestor of eukaryotes (LECA). This has led, somewhat unfortunately, to suggestions that eukaryote origins are the consequence of a big-bang (meaning here a qualitatively distinct speeding up of evolution invoked to explain the absence of intermediate forms, with subsequent dramatic evolutionary slowdown after formation of the ‘type’ (Koonin, 2007)— this is not the same usage of the term ‘big-bang’ as in the proposal that the main eukaryote lineages may have diverged over a relatively short evolutionary time span (Brinkmann and Philippe, 2007; Philippe et al., 2000). As pointed out above, acknowledgment of the stemcrown distinction, and of the fact that phylogenetic stems are by deﬁnition naked, eliminates the need to appeal to special (and untestable) circumstances to explain the evolution of

4.3 Moving Beyond the Deep Roots of Eukaryotes

71

major phenotypic differences between eukaryotes and archaea. However, with so many eukaryote-speciﬁc features now crowded into the stem, and with no observable intermediates, deducing the order of events seems a futile task. Modest progress can nevertheless be made for some aspects by reference to known mechanisms. I will now brieﬂy review two cases where I think a noncontroversial interpretation is possible.

4.3.1 Origin of the Mitochondrion It is now generally accepted that mitochondria, mitosomes, and hydrogenosomes are all derived from the same endosymbiotic origin and that the eukaryote crown group ancestor possessed a mitochondrion (reviewed in Embley and Martin (2006) and van der Giezen and Tovar (2005)). As has been widely discussed (e.g., Embley and Hirt, 1998; Embley and Martin, 2006; Keeling, 1998; Poole and Penny, 2007b; Roger, 1999), under the now defunct archezoa hypothesis (Cavalier-Smith, 1983), the origin of the nucleus clearly preceded the endosymbiotic origin of mitochondria in that some eukaryotes were thought to be ancestrally amitochondriate. As new evidence has come to light, this hypothesis has been refuted. However, there has been confusion regarding what this means. In particular, the view that the origin of the mitochondrion was the deﬁning event in the origin of eukaryotes has become popular (Embley and Martin, 2006; Lo´pez-Garcıa and Moreira, 1999; Martin and Muller, 1998; Moreira and Lo´pez-Garcıa, 1998). Subsequently, hypotheses arguing for a direct role of mitochondria in the emergence of selective pressures favoring the development of a nucleus have been proposed (Jekely, 2008; Koonin, 2006; Lo´pez-Garcıa and Moreira, 2006; Martin and Koonin, 2006). While there may be objections to the details of such hypotheses (as discussed below), it is important to point out that no data currently enable us to speciﬁcally establish whether the origin of a nuclear envelope predates or postdates the origin of the mitochondrion. Subsequent to the rejection of the archezoa hypothesis, a number of practitioners have argued that the host that “engulfed” the ancestor to mitochondria was an archaeon (e.g., Martin and Muller, 1998)—while there are numerous hypotheses that suggest this (see Martin et al. (2001) for a near-exhaustive survey), so as to focus on the nature of the host and not the speciﬁcs of any one hypothesis, I will refer to this class of hypothesis under the rubric “the archaeal hypothesis.” However, as has recently been pointed out, there is value in separating the archezoa hypothesis into its two component parts ((a) that a protoeukaryote host (PEH) engulfed the mitochondrial ancestor—the PEH hypothesis (CavalierSmith, 2002b; Poole and Penny, 2007a) and (b) that modern archezoa are missing links that never possessed mitochondria), rather than risking prematurely throwing the baby out with the bathwater. However, the key point I wish to make concerns the stem: the advantage of invoking a stem is that, regardless of whether one subscribes to the PEH hypothesis or the archaeal hypothesis in deﬁning the host (Figure 4.1a and b), the tricky problem of engulfment that emerges from the latter hypothesis (Poole and Penny, 2007b) can be avoided by not requiring the origin of the mitochondrion to be the very ﬁrst event in the path to the eukaryote crown. In other words, different hypotheses on the ultimate origin of eukaryotes are independent of the logical order of events along the stem. The argument goes as follows. Invoking an archaeal host suggests that phagocytosis was a feature of some crown group archaeal lineages (under the scenario in Figure 4.1a), but that this feature was subsequently lost, thus explaining the fact that no archaea with bacterial endosymbionts have been observed. However, an archaeal origin for eukaryotes is

72

Chapter 4

Eukaryote Evolution: The Importance of the Stem Group

not incompatible with the observation that only eukaryotes are known to be capable of cell engulfment. Simply proposing that the propensity to engulf by phagocytosis (a trait that, like mitochondria, can be placed in the eukaryote crown group ancestor) evolved before engulfment occurred, and that both evolved in the stem, in that order, the whole problem is alleviated and no unknown mechanisms (i.e. engulfment by archaea) need be invoked. The distinction may seem too subtle to be taken seriously (Davidov and Jurkevitch, 2007), but it is important; if the weight of phylogenetic evidence were ultimately to favor an archaeal origin for eukaryotes, the stem avoids invoking engulfment as either an extant or an extinct feature of noneukaryotic members of the archaeal crown (by deﬁnition, the eukaryote total group would be contained within the archaeal crown—see Figure 4.1b). A clear example of the value of such a subtle distinction can be seen in Jekely’s version of events (he envisages that eukaryotes evolved from archaea) where phagocytosis evolves in the eukaryote stem, but prior to the origin of mitochondria (Jekely, 2003, 2007). This avoids the conundrum of engulfment apparent in a number of other hypotheses (Horiike et al., 2001; Margulis et al., 2000; Martin and Muller, 1998; Moreira and Lo´pezGarcıa, 1998) in the same sense that the PEH hypothesis does; the difference concerns which speciﬁc phylogenetic relationships are endorsed, but by permitting a stem, both are mechanistically plausible. To summarize, the key point is that the relative timing of the origin of phagocytosis and mitochondria can be uncontroversially separated from the more problematic question of the ultimate evolutionary origins of the eukaryote lineage. Crucially, evolution along the stem must be distinguished from the ultimate origins of the eukaryote total group. By applying such thinking, both a sister group (Figure 4.1a) and an archaeal origin for eukaryotes (Figure 4.1b) are in principle compatible with phagotrophy emerging prior to the mitochondrion. It is important to acknowledge that no archaeal–bacterial endosymbioses are known, and cell invasion by “predatory” bacteria (as per Bdellovibrio) is only known to occur in the bacterial domain (Davidov and Jurkevitch, 2007; Poole and Penny, 2007b and references therein). In invoking an archaeal origin for eukaryotes (from compatible phylogenetic analyses, for example, Pisani et al. (2007)), due care should nevertheless be taken to consider the difference between a direct archaeal host (e.g., Koonin, 2006) and the necessity of accounting for evolution of numerous eukaryote-speciﬁc features along the eukaryotic stem (see Jekely (2007, 2008) and Poole and Penny (2007b, 2007c)), even if that stem emerges from the archaeal crown.

4.3.2 Stepwise Development of Mitochondria Irrespective of the timing of the origin of mitochondria relative to the emergence of other eukaryotic features, a key question is how a stable endosymbiosis between stem eukaryote and free-living bacterium came into being. Determining the speciﬁc basis for the endosymbiosis is difﬁcult. To date, the most detailed attempt to address this problem is the hydrogen hypothesis (Martin and Muller, 1998), wherein it was proposed that the eukaryote cell arose via symbiosis between a hydrogen-dependent autotrophic archaeon and a bacterium that produced molecular hydrogen as a waste product. While several objections have been raised concerning the speciﬁcs of this hypothesis (whether the host cell was an archaeon (Poole and Penny, 2007a) and the ﬁnding that, phylogenetically, hydrogenosomes are derived mitochondria (Embley, 2006)), the model stands out as one of the few that attempts to specify a biochemically and ecologically plausible initial selection pressure favoring symbiotic association.

4.3 Moving Beyond the Deep Roots of Eukaryotes

73

Establishing the exact metabolic nature of the initial contact that favored symbiotic association may not be possible in the absence of direct data, and given the current metabolic diversity of mitochondria, mitosomes, and hydrogenosomes (Tovar et al., 2005). This difﬁculty notwithstanding, it is possible to envisage a plausible set of intermediates that indicate the general steps in the process. Drawing from extant examples, it has recently been suggested that a simple phagotroph-prey scenario would likely involve the following steps (Poole and Penny, 2007a): 1. Phagotrophy: engulfment of prey cells via phagocytosis 2. Emergence of individuals within the population of prey that evade digestion; these could either be prey with resistance to phagocytosis, or parasites that gain entry into their host via the phagocytic pathway. 3. Establishment of a facultative symbiotic relationship—this may be mutualistic, commensal, or pathogenic in nature. 4. Shift from a facultative to obligate endosymbiotic association. 5. The endosymbiont evolves into an organelle. To give a quick rundown, phagotrophy and phagocytosis are widespread in eukaryotes, and cell engulfment can lead to a lifestyle switch, both transiently, as is the case for mixotrophic lineages, and stably, as with the evolution of land plants and photosynthetic algae (Raven, 1997; Stoecker, 1999; Jones, 2000; Archibald, 2005). A particularly stunning example of mixotrophy involving a phagotrophic phase and a phototrophic phase is the recent description of Hatena, a ﬂagellate eukaryote that carries a green algal endosymbiont (Okamoto and Inouye, 2005). Upon cell division, only one daughter receives the endosymbiont; the other develops a specialized apparatus that enables the symbiontless Hatena cell to engulf a new algal cell; Hatena then apparently returns to a phototrophic existence. While this does not establish the nature of the symbiotic interaction between stem group eukaryote and mitochondrial ancestor, mixotrophy illustrates the propensity for a lifestyle switch from phagotrophy (which provides a mechanism for engulfment) to an alternative lifestyle where the host is at least transiently dependent on the symbiont. Phagocytosis and phagotrophy are likewise exploited by bacteria. In the case of phagocytosis, it is well documented for numerous medically important bacterial genera, some pathogens in fact exploit macrophage phagocytosis as a pathway for initiating infection and facilitating systemic spread (Rosenberger and Finlay, 2003). Moreover, many such pathogenic bacteria are found in association with phagotrophic amoebae, where they are likewise known to be resistant to digestion; prey may have evolved into parasite and may in some cases exist in a commensal relationship with their amoebal hosts (Greub and Raoult, 2004). One example that illustrates aspects of steps 3 and 4, that it sets the stage for a switch from pathogenesis to commensalism, is the endosymbiosis between Acanthamoeba and Candidatus Odysella thessalonicensis. The latter is an obligate intracellular pathogen of Acanthamoebae, and will lyse its amoebal host at elevated temperatures (30–37 C); at lower temperatures (22 C) the bacterium is a relatively stable occupant of the amoeba. A number of endosymbioses are known to be obligate, with some apparently mutualistic associations raising questions as to when an endosymbiont should be considered an organelle. One particularly well-studied case is that of aphids and their endosymbiont bacteria (genus: Buchnera), where the bacteria are vertically transmitted from mother to offspring (Baumann et al., 1995), and, consequently, there is clear phylogenetic congruence between host aphid lineages and their endosymbiont bacteria (Munson et al., 1991), as is the

74

Chapter 4

Eukaryote Evolution: The Importance of the Stem Group

case for organelles such as mitochondria and chloroplasts. Moreover, neither host nor bacterium appears to be capable of independent existence; treatment of some aphid strains with antibiotics has been demonstrated to result in sterility and reductions in growth rate and life span (Douglas, 2007). These examples give some indication as to the wide range of interactions by which stable endosymbioses may emerge (i.e., predation, parasitism, commensalisms). While it is formally difﬁcult to establish mutualism between endosymbiont and host, the existence of endosymbioses appears to be a widespread phenomenon between members of the eukaryote domain and bacteria (and also archaea), whereas there is a dearth of examples excluding eukaryotes. Whether examples of prokaryote–prokaryote endosymbioses will be discovered remains to be seen (Davidov and Jurkevitch, 2007), and for the time being, it makes most sense to conclude that the cellular mechanism by which the mitochondrion entered the eukaryote lineage was phagocytosis, irrespective of the timing of emergence of this process relative to the emergence of other features along the eukaryote stem (Cavalier-Smith, 2002b; Jekely, 2007, 2008; Poole and Penny, 2007a).

4.3.3 Intron Proliferation and Eukaryote Origins A second conundrum that can be resolved by invoking a eukaryote stem is the proliferation, though not the origin, of introns. Given that intron proliferation and origin are often discussed in parallel, a brief aside concerning intron origins is warranted in order to illustrate that it is possible to separate proximate and ultimate origins. There are three current ideas concerning the origin of introns that enjoy different levels of popularity at different times, but all suffer from a paucity of evidence (reviewed in Jeffares et al. (2006) and Rodriguez-Trelles et al. (2006)). These are the introns-ﬁrst (Poole et al., 1999), introns-early (Blake, 1978; Doolittle, 1978; Gilbert, 1978), and mitochondrial seed (Cavalier-Smith, 1991; Logsdon, 1998) hypotheses. None of these models are incompatible with a possible common ancestry for group II introns and the spliceosomal apparatus; a recent incarnation of introns-early (Gilbert and de Souza, 1999) argues for an RNA world origin for group II introns (but not for spliceosomal introns as per some earlier suggestions), while the mitochondrial seed hypothesis argues instead for transfer of group II introns from the mitochondrion to the nucleus, leading to the late emergence of the spliceosomal apparatus and spliceosomal introns. The introns-ﬁrst hypothesis is foremost an attempt to account for the origins of protein-coding genes and mRNA from an RNA world, and while it does not speciﬁcally address the origin of group II introns, a common ancestry for group II introns and spliceosomal introns and apparatus is possible under this model, but would imply that mobile group II introns evolved by reductive evolution. It is likewise worth noting that introns-late is not at odds with an RNA world origin for group II introns (e.g., Hickey, 1992). Furthermore, none of these models are particularly at odds with the proliferation of introns in some crown group eukaryotes; origins can be extricated from conditions favoring proliferation (Jeffares et al., 2006). Proliferation subsequent to origin can thus be separated even from the most diametrically opposed starting points (spliceosomal introns originating the RNA world versus spliceosomal introns originating in the eukaryote stem, having evolved from mitochondrially derived group II introns). In both cases, introns are present in LECA. The key point concerning intron spread has been developed by Hickey in relation to the effect that sexual outcrossing has on the dynamics of mobile elements (Hickey, 1982, 1992). I will brieﬂy summarize Hickey’s argument and point out an important corollary of

4.3 Moving Beyond the Deep Roots of Eukaryotes

75

importance to the current discussion: Hickey’s model indicates that meiosis and sexual outcrossing preceded the successful proliferation of introns in early stem group eukaryote lineages (as testiﬁed by a moderate ancestral intron content in LECA (Roy and Gilbert, 2005)). Consequently, it does not support models for the origin of eukaryotes where intron proliferation began in an archaeon (Poole, 2006). For convenience, I consider the proliferation of introns from the perspective of the mitochondrial seed hypothesis, but there is no mechanistic requirement that intron proliferation involve self-splicing introns; proliferation appears to be possible for spliceosomal introns provided that there is a functioning spliceosome, as substantiated by cases of recent intron gain (reviewed in Roy and Gilbert (2006)). Hickey’s point is essentially one concerning levels of selection. In archaea and bacteria, where cells reproduce clonally, there is limited propensity for the spread of mobile elements within a population; intragenomic proliferation of an element in a cell lineage is possible, but would impart a cost on that lineage. Consequently, the effect of proliferation is that the host is less able to compete with uninfected members of the population; between-cell selection should therefore dominate, and element-bearing lineages will tend to be lost from the population. Another possibility is of course that elements evolve to be benign (Arkhipova and Meselson, 1997), meaning they have a minimal capacity to spread—which seems the norm among bacterial transposable elements, in that, under normal cellular conditions, their mode of transposition is predominantly conservative, not replicative (Tavakoli and Derbyshire, 2001; Twiss et al., 2005). Such elements may also survive by horizontal spread (but not intragenomic proliferation), perhaps by hitchhiking on beneﬁcial genes they carry (Dobrindt et al., 2004; Hickey, 1992). This situation is vastly different in obligately sexually outcrossing lineages, where the effect of between-cell selection within the population is reduced, and elements can replicate despite a considerable cost on their host lineage (Dobrindt et al., 2004; Hickey, 1982, 1992). The result is thus effective in spread and proliferation of mobile elements; it therefore follows that obligate sexual reproduction (and outcrossing) would favor intron proliferation (Hickey, 1992). Against this background, several recent hypotheses on the origin of the nucleus seem highly implausible. Martin and Koonin, and, independently, Lo´pez-Garcıa and Moreira (Koonin, 2006; Lo´pez-Garcıa and Moreira, 2006; Martin and Koonin, 2006) recently argued that, subsequent to the endosymbiotic origin of mitochondrion in an archaeal host, the invasion of archaeal genes by group II introns would result in translation of aberrant gene products. Their reasoning is that, given modern rates of translation and splicing, ribosomal readthrough into introns would have been a problem unless these were physically separated. The physical separation of splicing (nuclear) and translation (cytoplasmic) would of course resolve this problem, and it was therefore hypothesized that readthrough generated conditions for the selection of nuclear compartmentation. Signiﬁcantly, the model has effectively been tested by a natural observation (see Poole (2006)): group II introns have recently entered archaea, Methanosarcina speciﬁcally (Dai and Zimmerly, 2003; Rest and Mindell, 2003), re-creating the condition hypothesized in the readthrough models—an archaeal genome invaded by group II introns. The documented behavior of the group II introns in Methanosarcina however conforms with Hickey’s general predictions concerning element spread in an asexual lineage, and not with those proposed in the various studies by Martin, Koonin, Lo´pez-Garcıa, and Moreira. Group II introns in Methanosarcina were not documented to jump into archaeal open reading frames; instead, they were found in intergenic regions, and primarily jumped into one another, creating a nested intron architecture.

76

Chapter 4

Eukaryote Evolution: The Importance of the Stem Group

Clarity can be gained by splitting this scenario into its component parts. First, intron proliferation in an archaeal host at the level invoked (Koonin, 2006; Lo´pez-Garcıa and Moreira, 2006; Martin and Koonin, 2006) cannot account for the origin of the nucleus. Given signiﬁcant intron density predicted for LECA (Jeffares et al., 2006; Roy and Gilbert, 2005), together with evidence for a substantial spliceosomal apparatus (Collins and Penny, 2005), initial intron proliferation likely occurred in the eukaryote stem, and it seems reasonable to propose that this occurred subsequent to the origin of meiosis, which is also likely to be a feature of LECA (Cavalier-Smith, 2002a; Egel and Penny, 2007; Ramesh et al., 2005). (Note that while meiosis is a probable feature of LECA, recombination evolved much earlier and was arguably even a feature of the RNA world (Lehman, 2003; Reanney, 1984, 1987); there are also links between meiosis and mitosis (see Egel and Penny (2007)), so dating the origin of meiosis to LECA should not be taken to imply that the process in its entirety necessarily evolved in some brief period that excludes the eukaryote stem, or earlier stages.) That secondary intron loss appears common among unicellular lineages with large effective population sizes where sexual reproduction is not always obligate (Jeffares et al., 2006; Mourier and Jeffares, 2003; Roy and Gilbert, 2006) further ﬁts this picture. The merit of the original mitochondrial seed hypothesis compared to the newer archaeal host models is consequently that, in the former, intron spread is not precluded on the basis of the host. The second component is the scenario for the origin of the nucleus based on intron proliferation. This can potentially be rescued if the host is instead a stem group sexual eukaryote, as under conditions favoring intron proliferation, the selective conditions invoked as favoring the origin of the nucleus could still hold in principle. Whether such a revised model provides an adequate explanation for the origin of the nucleus is harder to ascertain, and one obvious question is why aberrant readthrough is not frequently observed with insertion of mobile, self-splicing introns. To return to the two phylogenies in Figure 4.1a and b, while these may have profound implications concerning the various intron origin hypotheses, again, intron proliferation is possible provided a stem is invoked or a direct archaeal host is avoided.

4.4 CONCLUDING REMARKS The main point of this chapter is to illustrate the value in applying stem and crown group concepts to the question of the origin of the eukaryote cell. I have argued that doing so avoids the traps that come with invoking special cases in order to explain phylogenies built solely from sequences from extant (crown group) taxa. Furthermore, it enables us to separate a particular phylogenetic result from speculation regarding the series of events leading to the origin of modern crown group eukaryotes; this in turn should allow more informed and nuanced discussion and debate. Unfortunately, it seems unlikely that evolution along the eukaryote stem will be enriched and augmented with informative fossil data, and as such it may be difﬁcult to arrive at a deﬁnitive account of the evolution of modern eukaryotes. While there may not be fossils illuminating the series of intermediates leading to crown eukaryotes, frugal speculation (i.e., preferring known biological mechanisms over those for which no known precedent exists) and due consideration of theory (the genetics of mobile element spread in the example given in Section 4.3.2) can nevertheless be informative in shaping a partial picture—I am optimistic that, in this way, our picture of eukaryote origins can be further embellished.

References

77

REFERENCES AMOS, L.A., VAN DEN ENT, F., and LOWE, J., 2004. Structural/ functional homology between the bacterial and eukaryotic cytoskeletons. Curr. Opin. Cell. Biol. 16: 24–31. ARCHIBALD, J.M., 2005. Jumping genes and shrinking genomes—probing the evolution of eukaryotic photosynthesis with genomics. IUBMB Life 57: 539–547. ARKHIPOVA, I. and MESELSON, M., 2005. Deleterious transposable elements and the extinction of asexuals. Bioessays 27: 76–85. BAPTESTE, E., CHARLEBOIS, R.L., MACLEOD, D., and BROCHIER, C., 2005. The two tempos of nuclear pore complex evolution: highly adapting proteins in an ancient frozen structure. Genome Biol. 6: R85. BAUMANN, P., BAUMANN, L., LAI, C.Y., ROUHBAKHSH, D., MORAN, N.A., and CLARK, M.A., 1995. Genetics, physiology, and evolutionary relationships of the genus Buchnera: intracellular symbionts of aphids. Annu. Rev. Microbiol. 49: 55–94. BELFORT, M. and WEINER, A., 1997. Another bridge between kingdoms: tRNA splicing in archaea and eukaryotes. Cell 89: 1003–1006. BELL, P.J., 2001. Viral eukaryogenesis: was the ancestor of the nucleus a complex DNAvirus? J. Mol. Evol. 53: 251–256. BENDICH, A.J. and DRLICA, K., 2000. Prokaryotic and eukaryotic chromosomes: what’s the difference? Bioessays 22: 481–486. BENINATI, T., LO, N., SACCHI, L., GENCHI, C., NODA, H., and BANDI, C., 2004. A novel alpha-Proteobacterium resides in the mitochondria of ovarian cells of the tick Ixodes ricinus. Appl. Environ. Microbiol. 70: 2596–2602. BLAKE, C.C.F., 1978. Do genes-in-pieces imply proteinsin-pieces? Nature 273: 267. BONEN, L. and VOGEL, J., 2001. The ins and outs of group II introns. Trends Genet. 17: 322–331. BRINKMANN, H. and PHILIPPE, H., 2007. The diversity of eukaryotes and the root of the eukaryotic tree. Adv. Exp. Med. Biol. 607: 20–37. BUDD, G., 2001. Climbing life’s tree. Nature 412: 487. BUDD, G.E. and JENSEN, S., 2000. A critical reappraisal of the fossil record of the bilaterian phyla. Biol. Rev. Camb. Philos. Soc. 75: 253–295. BUTTERFIELD, N.J., 2000. Bangiomorpha pubescens n. gen., n. sp.: implications for the evolution of sex, multicellularity, and the Mesoproterozoic/Neoproterozoic radiation of eukaryotes. Paleobiology 26: 386–404. BUTTERFIELD, N.J., 2007. Macroevolution and macroecology through deep time. Palaeontology 50: 41–55. CAVALIER-SMITH, T., 1983. A six-kingdom classiﬁcation and a uniﬁed phylogeny. Endocytobiology 2: 1027–1034. CAVALIER-SMITH, T., 1991. Intron phylogeny: a new hypothesis. Trends Genet. 7: 145–148. CAVALIER-SMITH, T., 2002a. Origins of the machinery of recombination and sex. Heredity 88: 125–141. CAVALIER-SMITH, T., 2002b. The phagotrophic origin of eukaryotes and phylogenetic classiﬁcation of Protozoa. Int. J. Syst. Evol. Microbiol. 52: 297–354.

CHEN, C.W., HUANG, C.H., LEE, H.H., TSAI, H.H., and KIRBY, R., 2002. Once the circle has been broken: dynamics and evolution of Streptomyces chromosomes. Trends Genet. 18: 522–529. COLLINS, L. and PENNY, D., 2005. Complex spliceosomal organization ancestral to extant eukaryotes. Mol. Biol. Evol. 22: 1053–1066. DACKS, J.B. and FIELD, M.C., 2007. Evolution of the eukaryotic membrane-trafﬁcking system: origin, tempo and mode. J. Cell Sci. 120: 2977–2985. DACKS, J. and ROGER, A.J., 1999. The ﬁrst sexual lineage and the relevance of facultative sex. J. Mol. Evol. 48: 779–783. DACKS, J.B., DAVIS, L.A., SJOGREN, A.M., ANDERSSON, J.O., ROGER, A.J., and DOOLITTLE, W.F., 2003. Evidence for Golgi bodies in proposed ‘Golgi-lacking’ lineages. Proc. Biol. Sci. 270 (Suppl. 2): S168–171. DACKS, J.B., POON, P.P., and FIELD, M.C., 2008. Phylogeny of endocytic components yields insight into the process of nonendosymbiotic organelle evolution. Proc. Natl. Acad. Sci. USA 105: 588–593. DAGAN, T. and MARTIN, W., 2007. Testing hypotheses without considering predictions. Bioessays 29: 500–503. DAI, L. and ZIMMERLY, S., 2003. ORF-less and reverse-transcriptase-encoding group II introns in archaebacteria, with a pattern of homing into related group II intron ORFs. RNA 9: 14–19. DAUBIN, V., GOUY, M., and PERRIERE, G., 2001. Bacterial molecular phylogeny using supertree approach. Genome Inform. 12: 155–164. DAUBIN, V., GOUY, M., and PERRIERE, G., 2002. A phylogenomic approach to bacterial phylogeny: evidence of a core of genes sharing a common history. Genome Res. 12: 1080–1090. DAVIDOV, Y. and JURKEVITCH, E., 2007. Comments of Poole and Penny’s essay “Evaluating hypotheses for the origin of eukaryotes”. Bioessays 29: 74–84. DE ROOS, A.D., 2006. The origin of the eukaryotic cell based on conservation of existing interfaces. Artif. Life 12: 513–523. DELSUC, F., BRINKMANN, H., and PHILIPPE, H., 2005. Phylogenomics and the reconstruction of the tree of life. Nat. Rev. Genet. 6: 361–375. DENNIS, P.P. and OMER, A., 2005. Small non-coding RNAs in Archaea. Curr. Opin. Microbiol. 8: 685–694. DEVOS, D., DOKUDOVSKAYA, S., ALBER, F., WILLIAMS, R., CHAIT, B.T., SALI, A., and ROUT, M.P., 2004. Components of coated vesicles and nuclear pore complexes share a common molecular architecture. PLoS Biol. 2: e380. DOBRINDT, U., HOCHHUT, B., HENTSCHEL, U., and HACKER, J., 2004. Genomic islands in pathogenic and environmental microorganisms. Nat. Rev. Microbiol. 2: 414–424. DONOGHUE, P.C.J., 2005. Saving the stem group—a contradiction in terms? Paleobiology 31: 553–558. DONOGHUE, P.C.J. and PURNELL, M.A., 2005. Genome duplication, extinction and vertebrate evolution. Trends Ecol. Evol. 20: 312–319.

78

Chapter 4

Eukaryote Evolution: The Importance of the Stem Group

DOOLITTLE, W.F., 1978. Genes in pieces: were they ever together? Nature 272: 581–582. DOUGLAS, A.E., 2007. Symbiotic microorganisms: untapped resources for insect pest control. Trends Biotechnol. 25: 338–342. EGEL, R. and PENNY, D., 2007. On the origin of meiosis in eukaryotic evolution: coevolution of meiosis and mitosis from feeble beginnings. In Recombination and Meiosis (eds R. Egel and D.-H. Lankenau). Springer-Verlag, Berlin. EMBLEY, T.M. and HIRT, R.P., 1998. Early branching eukaryotes? Curr. Opin. Genet. Dev. 8: 624–629. EMBLEY, T.M., 2006. Multiple secondary origins of the anaerobic lifestyle in eukaryotes. Philos. Trans. R. Soc. Lond. B Biol. Sci. 361: 1055–1067. EMBLEY, T. M. and MARTIN, W., 2006. Eukaryotic evolution, changes and challenges. Nature 440: 623–630. ESSER, C., AHMADINEJAD, N., WIEGAND, C., ROTTE, C., SEBASTIANI, F., GELIUS-DIETRICH, G., HENZE, K., KRETSCHMANN, E., RICHLY, E., LEISTER, D., et al., 2004. A genome phylogeny for mitochondria among alpha-proteobacteria and a predominantly eubacterial ancestry of yeast nuclear genes. Mol. Biol. Evol. 21: 1643–1660. FORTERRE, P. and PHILIPPE, H., 1999. Where is the root of the universal tree of life? Bioessays 21: 871–879. FUERST, J.A., 2005. Intracellular compartmentation in planctomycetes. Annu. Rev. Microbiol. 59: 299–328. GASPIN, C., CAVAILLE, J., ERAUSO, G., and BACHELLERIE, J.P., 2000. Archaeal homologs of eukaryotic methylation guide small nucleolar RNAs: lessons from the Pyrococcus genomes. J. Mol. Biol. 297: 895–906. GILBERT, W., 1978. Why genes in pieces? Nature 271: 501. GILBERT, W. and DE SOUZA, S.J., 1999. Introns and the RNA world. In The RNA World (eds R.F. Gesteland, T.R. Cech, and J.F. Atkins). Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY, pp. 221–231. GLANSDORFF, N., XU, Y., and LABEDAN, B., 2008. The last universal common ancestor: emergence, constitution and genetic legacy of an elusive forerunner. Biol. Direct 3: 29. GRATIA, J.P., 2005. Noncomplementing diploidy resulting from spontaneous zygogenesis in Escherichia coli. Microbiology 151: 2947–2959. GRATIA, J.P., 2007a. Spontaneous zygogenesis (Z-mating) in mecillinam-rounded bacteria. Arch. Microbiol. 188: 565–574. GRATIA, J.P., 2007b. Spontaneous zygogenesis, a wide-ranging mating process in bacteria. Res. Microbiol. 158: 671–678. GRATIA, J.P. and THIRY, M., 2003. Spontaneous zygogenesis in Escherichia coli, a form of true sexuality in prokaryotes. Microbiology 149: 2571–2584. GRAUMANN, P.L., 2007. Cytoskeletal elements in bacteria. Annu. Rev. Microbiol. 61: 589–618. GREUB, G. and RAOULT, D., 2004. Microorganisms resistant to free-living amoebae. Clin. Microbiol. Rev. 17: 413–433. GUERRERO, R., PEDROS-ALIO, C., ESTEVE, I., MAS, J., CHASE, D., and MARGULIS, L., 1986. Predatory prokaryotes: predation

and primary consumption evolved in bacteria. Proc. Natl. Acad. Sci. USA 83: 2138–2142. GUPTA, R.S. and GOLDING, G.B., 1996. The origin of the eukaryotic cell. Trends Biochem. Sci. 21: 166–171. HARRIS, J.K., KELLEY, S.T., SPIEGELMAN, G.B., and PACE, N.R., 2003. The genetic core of the universal ancestor. Genome Res. 13: 407–412. HARTMAN, H. and FEDOROV, A., 2002. The origin of the eukaryotic cell: a genomic investigation. Proc. Natl. Acad. Sci. USA 99: 1420–1425. HAUGEN, P., SIMON, D.M., and BHATTACHARYA, D., 2005. The natural history of group I introns. Trends Genet. 21: 111–119. HICKEY, D.A., 1982. Selﬁsh DNA: a sexually-transmitted nuclear parasite. Genetics 101: 519–531. HICKEY, D.A., 1992. Evolutionary dynamics of transposable elements in prokaryotes and eukaryotes. Genetica 86: 269–274. HORIIKE, T., HAMADA, K., KANAYA, S., and SHINOZAWA, T., 2001. Origin of eukaryotic cell nuclei by symbiosis of Archaea in Bacteria is revealed by homology-hit analysis. Nat. Cell Biol. 3: 210–214. JEFFARES, D.C., MOURIER, T., and PENNY, D., 2006. The biology of intron gain and loss. Trends Genet. 22: 16–22. JEFFROY, O., BRINKMANN, H., DELSUC, F., and PHILIPPE, H., 2006. Phylogenomics: the beginning of incongruence? Trends Genet. 22: 225–231. JE´KELY, G., 2003. Small GTPases and the evolution of the eukaryotic cell. Bioessays 25: 1129–1138. JE´KELY, G., 2007. Origin of phagotrophic eukaryotes as social cheaters in microbial bioﬁlms. Biol. Direct 2: 3. JE´KELY, G., 2008. Origin of the nucleus and Ran-dependent transport to safeguard ribosome biogenesis in a chimeric cell. Biol. Direct 3: 31. JONES, R.I., 2000. Mixotrophy in planktonic protists: an overview. Freshwater Biol. 45: 219–226. KEELING, P.J., 1998. A kingdom’s progress: Archezoa and the origin of eukaryotes. Bioessays 20: 87–95. KNOLL, A.H., JAVAUX, E.J., HEWITT, D., and COHEN, P., 2006. Eukaryotic organisms in Proterozoic oceans. Philos. Trans. R. Soc. Lond. B Biol. Sci. 361: 1023–1038. KOONIN, E.V., 2006. The origin of introns and their role in eukaryogenesis: a compromise solution to the intronsearly versus introns-late debate? Biol. Direct 1: 22. KOONIN, E.V., 2007. The Biological Big Bang model for the major transitions in evolution. Biol. Direct 2: 21. KOONIN, E.V. and MARTIN, W., 2005. On the origin of genomes and cells within inorganic compartments. Trends Genet. 21: 647–654. KURLAND, C.G., COLLINS, L.J., and PENNY, D., 2006. Genomics and the irreducible nature of eukaryote cells. Science 312: 1011–1014. LAKE, J.A., 1988. Origin of the eukaryotic nucleus determined by rate-invariant analysis of rRNA sequences. Nature 331: 184–186. LEHMAN, N., 2003. A case for the extreme antiquity of recombination. J. Mol. Evol. 56: 770–777.

References LOGSDON, J.M., JR., 1998. The recent origins of spliceosomal introns revisited. Curr. Opin. Genet. Dev. 8: 637–648. LO´PEZ-GARCI´A, P. and MOREIRA, D., 1999. Metabolic symbiosis at the origin of eukaryotes. Trends Biochem. Sci. 24: 88–93. LO´PEZ-GARCI´A, P. and MOREIRA, D., 2006. Selective forces for the origin of the eukaryotic nucleus. Bioessays 28: 525–533. LYKKE-ANDERSEN, J., AAGAARD, C., SEMIONENKOV, M., and GARRETT, R.A., 1997. Archaeal introns: splicing, intercellular mobility and evolution. Trends Biochem. Sci. 22: 326–331. MANS, B.J., ANANTHARAMAN, V., ARAVIND, L., and KOONIN, E.V., 2004. Comparative genomics, evolution and origins of the nuclear envelope and nuclear pore complex. Cell Cycle 3: 1612–1637. MARGULIS, L., DOLAN, M.F., and GUERRERO, R., 2000. The chimeric eukaryote: origin of the nucleus from the karyomastigont in amitochondriate protists. Proc. Natl. Acad. Sci. USA 97: 6954–6959. MARTIN, W., 1999. A brieﬂy argued case that mitochondria and plastids are descendants of endosymbionts, but that the nuclear compartment is not. Proc. R. Soc. Lond. B 266: 1387–1395. MARTIN, M.O., 2002. Predatory prokaryotes: an emerging research opportunity. J. Mol. Microbiol. Biotechnol. 4: 467–477. MARTIN, W. and KOONIN, E.V., 2006. Introns and the origin of nucleus-cytosol compartmentalization. Nature 440: 41–45. MARTIN, W. and MULLER, M., 1998. The hydrogen hypothesis for the ﬁrst eukaryote. Nature 392: 37–41. MARTIN, W., HOFFMEISTER, M., ROTTE, C., and HENZE, K., 2001. An overview of endosymbiotic models for the origins of eukaryotes, their ATP-producing organelles (mitochondria and hydrogenosomes), and their heterotrophic lifestyle. Biol. Chem. 382: 1521–1539. MOREIRA, D. and LO´PEZ-GARCI´A, P., 1998. Symbiosis between methanogenic archaea and delta-proteobacteria as the origin of eukaryotes: the syntrophic hypothesis. J. Mol. Evol. 47: 517–530. MOURIER, T. and JEFFARES, D.C., 2003. Eukaryotic intron loss. Science 300: 1393. MULLER, S., LECLERC, F., BEHM-ANSMANT, I., FOURMANN, J. B., CHARPENTIER, B., and BRANLANT, C., 2008. Combined in silico and experimental identiﬁcation of the Pyrococcus abyssi H/ACA sRNAs and their target sites in ribosomal RNAs. Nucleic Acids Res. 36: 2459–2475. MUNSON, M.A., BAUMANN, P., CLARK, M.A., BAUMANN, L., MORAN, N.A., VOEGTLIN, D.J., and CAMPBELL, B.C.,1991. Evidence for the establishment of aphid-eubacterium endosymbiosis in an ancestor of four aphid families. J Bacteriol. 173: 6321–6324. NAKAMURA, T.M. and CECH, T.R., 1998. Reversing time: origin of telomerase. Cell 92: 587–590. OKAMOTO, N. and INOUYE, I., 2005. A secondary symbiosis in progress? Science 310: 287.

79

OMER, A.D., LOWE, T.M., RUSSELL, A.G., EBHARDT, H., EDDY, S.R., and DENNIS, P.P., 2000. Homologs of small nucleolar RNAs in Archaea. Science 288: 517–522. PENNY, D. and POOLE, A., 1999. The nature of the last universal common ancestor. Curr. Opin. Genet. Dev. 9: 672–677. PHILIPPE, H., GERMOT, A., and MOREIRA, D., 2000. The new phylogeny of eukaryotes. Curr. Opin. Genet. Dev. 10: 596–601. PISANI, D., COTTON, J.A., and MCINERNEY, J.O., 2007. Supertrees disentangle the chimerical origin of eukaryotic genomes. Mol. Biol. Evol. 24: 1752–1760. POOLE, A.M., 2006. Did group II intron proliferation in an endosymbiont-bearing archaeon create eukaryotes? Biol Direct 1: 36. POOLE, A. and PENNY, D., 2001. Does endosymbiosis explain the origin of the nucleus? Nat. Cell Biol. 3: E173–174. POOLE, A. and PENNY, D., 2007a. Eukaryote evolution: engulfed by speculation. Nature 447: 913. POOLE, A. and PENNY, D., 2007b. Evaluating hypotheses for the origin of eukaryotes. Bioessays 29: 74–84. POOLE, A. and PENNY, D., 2007c. Response to Dagan and Martin. Bioessays 29: 611–614. POOLE, A., JEFFARES, D., and PENNY, D., 1999. Early evolution: prokaryotes, the new kids on the block. Bioessays 21: 880–889. RAMESH, M.A., MALIK, S.B., and LOGSDON, J.M., JR., 2005. A phylogenomic inventory of meiotic genes; evidence for sex in Giardia and an early eukaryotic origin of meiosis. Curr. Biol. 15: 185–191. RAVEN, J.A., 1997. Phagotrophy in phototrophs. Limnol. Oceanogr. 42: 198–205. REANNEY, D.C., 1984. RNA splicing as an error-screening mechanism. J. Theor. Biol. 110: 315–321. REANNEY, D.C., 1987. Genetic error and genome design. Cold Spring Harbor Symp. Quant. Biol. 52: 751–757. REST, J.S. and MINDELL, D.P., 2003. Retroids in archaea: phylogeny and lateral origins. Mol. Biol. Evol. 20: 1134–1142. RIVERA, M.C. and LAKE, J.A., 1992. Evidence that eukaryotes and eocyte prokaryotes are immediate relatives. Science 257: 74–76. RIVERA, M.C. and LAKE, J.A., 2004. The ring of life provides evidence for a genome fusion origin of eukaryotes. Nature 431: 152–155. RODRIGUEZ-TRELLES, F., TARRIO, R., and AYALA, F.J., 2006. Origins and evolution of spliceosomal introns. Annu. Rev. Genet. 40: 47–76. ROGER, A.J., 1999. Reconstructing early events in eukaryotic evolution. Am. Nat. 154: S146–S163. ROSENBERGER, C.M. and FINLAY, B.B., 2003. Phagocyte sabotage: disruption of macrophage signalling by bacterial pathogens. Nat. Rev. Mol. Cell. Biol. 4: 385–396. ROTTE, C. and MARTIN, W., 2001. Does endosymbiosis explain the origin of the nucleus? Nat. Cell Biol. 3: E173–174. ROY, S.W. and GILBERT, W., 2005. Complex early genes. Proc. Natl. Acad. Sci. USA 102: 1986–1991.

80

Chapter 4

Eukaryote Evolution: The Importance of the Stem Group

ROY, S.W. and GILBERT, W., 2006. The evolution of spliceosomal introns: patterns, puzzles and progress. Nat. Rev. Genet 7: 211–221. RUSSELL, B., 1952. Is there a god? In The Collected Papers of Bertrand Russell, Vol. 11: Last Philosophical Testament, 1943-68 (eds J.G. Slater and P. Kollner). Routledge, London, 1977, pp. 542–548. SACCHI, L., BIGLIARDI, E., CORONA, S., BENINATI, T., LO, N., and FRANCESCHI, A., 2004. A symbiont of the tick Ixodes ricinus invades and consumes mitochondria in a mode similar to that of the parasitic bacterium Bdellovibrio bacteriovorus. Tissue Cell 36: 43–53. STEEL, M., DRESS, A.W., and BOCKER, S., 2000. Simple but fundamental limitations on supertree and consensus tree methods. Syst. Biol. 49: 363–368. STOECKER, D.K.,1999. Mixotrophy among dinoﬂagellates. J. Eukaryot. Microbiol. 46: 397–401. TAKEMURA, M., 2001. Poxviruses and the origin of the eukaryotic nucleus. J. Mol. Evol. 52: 419–425. TAVAKOLI, N.P. and DERBYSHIRE, K.M., 2001. Tipping the balance between replicative and simple transposition. EMBO J. 20: 2923–2930. TOURASSE, N.J. and GOUY, M., 1999. Accounting for evolutionary rate variation among sequence sites consistently changes universal phylogenies deduced from rRNA and protein-coding genes. Mol. Phylogenet. Evol. 13: 159–168. TWISS, E., COROS, A.M., TAVAKOLI, N.P., and DERBYSHIRE, K. M., 2005. Transposition is modulated by a diverse set of host factors in Escherichia coli and is stimulated by nutritional stress. Mol. Microbiol. 57: 1593–1607. VAN DEN ENT, F., AMOS, L.A., and LOWE, J., 2001. Prokaryotic origin of the actin cytoskeleton. Nature 413: 39–44. VAN DER GIEZEN, M. and TOVAR, J., 2005. Degenerate mitochondria. EMBO Rep. 6: 525–530.

VOLFF, J.N. and ALTENBUCHNER, J., 2000. A new beginning with new ends: linearisation of circular chromosomes during bacterial evolution. FEMS Microbiol. Lett. 186: 143–150. VON DOHLEN, C.D., KOHLER, S., ALSOP, S.T., and MCMANUS, W.R., 2001. Mealybug beta-proteobacterial endosymbionts contain gamma-proteobacterial symbionts. Nature 412: 433–436. WANG, M., YAFREMAVA, L.S., CAETANO-ANOLLES, D., MITTENTHAL, J.E., and CAETANO-ANOLLES, G., 2007. Reductive evolution of architectural repertoires in proteomes and the birth of the tripartite world. Genome Res. 17: 1572–1585. WILKINSON, M., COTTON, J.A., CREEVEY, C., EULENSTEIN, O., HARRIS, S.R., LAPOINTE, F.J., LEVASSEUR, C., MCINERNEY, J.O., PISANI, D., and THORLEY, J.L., 2005. The shape of supertrees to come: tree shape related properties of fourteen supertree methods. Syst. Biol. 54: 419–431. WOESE, C.R. and FOX, G.E., 1977. Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proc. Natl. Acad. Sci. USA 74: 5088–5090. WOESE, C.R., KANDLER, O., and WHEELIS, M.L., 1990. Towards a natural system of organisms: proposal for the domains Archaea, Bacteria, and Eucarya. Proc. Natl. Acad. Sci. USA 87: 4576–4579. WOODHAMS, M.D., STADLER, P.F., PENNY, D., and COLLINS, L. J., 2007. RNase MRP and the RNA processing cascade in the eukaryotic ancestor. BMC Evol. Biol. 7 (Suppl. 1): S13. YUTIN, N., MAKAROVA, K.S., MEKHEDOV, S.L., WOLF, Y.I., and KOONIN, E.V., 2008. The deep archaeal roots of eukaryotes. Mol. Biol. Evol. 25: 1619–1630. ZILLIG, W., KLENK, H.P., PALM, P., LEFFERS, H., PU¨HLER, G., GROPP, F., and GARRETT, R.A., 1989. Did eukaryotes originate by a fusion event? Endocyt. Cell Res. 6: 1–25.

Chapter

5

The Role of Information in Evolutionary Genomics of Bacteria Antoine Danchin and Agnieszka Sekowska 5.1

INTRODUCTION

5.2

REVISITING INFORMATION

5.3

UBIQUITOUS FUNCTIONS FOR LIFE

5.4

THE CENOME AND THE PALEOME

5.5

FUNCTIONS CORRESPONDING TO NONESSENTIAL PERSISTENT GENES

5.6

A UBIQUITOUS INFORMATION-GAINING PROCESS: MAKING A YOUNG ORGANISM FROM AN AGED ONE

5.7

PROVISIONAL CONCLUSION

ACKNOWLEDGMENTS REFERENCES

5.1 INTRODUCTION This chapter is written in the spirit with which Claude Bernard wrote his Introduction a l’e´tude de la me´decine expe´rimentale, deliberately placing biology within the realm of physics and chemistry, albeit in a modern context: En re´sume´, le but de la science est partout identique connaıˆtre les conditions mate´rielles des phe´nomenes. Mais si ce but est le m^eme dans les sciences physico-chimiques et dans les sciences biologiques, il est beaucoup plus difﬁcile a atteindre dans les dernieres, a cause de la mobilite´ et de la complexite´ des phe´nomenes qu’on y rencontre1 (Bernard, 1865). Since the time of Claude Bernard, physics witnessed an extraordinary revolution that unveiled a deep insufﬁciency in the tight links associating the four categories of Nature: matter, energy, space and time, as revealed by the 1

In summary, the goal of science is everywhere the same, to know the material conditions of phenomena. However if this goal is the same in physico-chemical sciences and in biological sciences, it is much more difﬁcult to reach in the latter, because of the instability and the complexity of the phenomena met there.

Evolutionary Genomics and Systems Biology, edited by Gustavo Caetano-Anolle´s Copyright 2010 John Wiley & Sons, Inc.

81

82

Chapter 5

The Role of Information in Evolutionary Genomics of Bacteria

famous equation by Albert Einstein, E ¼ mc2. Indeed, a deep revolution, in 1905, began to introduce a formal description of Nature in quantum physics, based on a completely different mathematical formalism, which separated classical physics from quantum physics. The separation was so deep that Einstein himself saw that resolving the inherent contradictions of these two different ways of representing Nature—classical and quantum physics—subsumed the existence of “hidden variables” that would permit to harmonize both models into a unique self-consistent one. Together with other scientists, he proposed experiments to test whether this hypothesis of missing variables was reliable. We now know that all these experiments, and many others constructed since then, refute the idea of hidden variables. However, there is a remarkable way out, which, as we shall see, is central to our view of biology: one has to add to the four categories of Nature, a ﬁfth one, which naturally is still in the process of being correctly represented (as it is almost entirely novel, despite a long tradition that discussed the role of form, since the time of Aristotle), information. A quotation by Andrew Steane summarizes the situation in physics: Historically, much of fundamental physics has been concerned with discovering the fundamental particles of nature and the equations which describe their motions and interactions. It now appears that a different programme may be equally important: to discover the ways that nature allows, and prevents, information [my emphasis] to be expressed and manipulated, rather than particles to move (Steane, 1998). The stance that will be taken here is to take this remark by a physicist seriously, and consider information as central to understand what life is. Furthermore, we shall endeavor to try and place in context the relationships within living organism between the ﬁrst four categories (which make the core of the cell’s architecture and dynamics) and the site where information is unfolded, the genetic program with its support, the DNA molecule forming the genome of the organism. To enable the readers better understand this viewpoint, and place them on quite steady and familiar ground, we chose to recall an old enigma, asked by the Pythia at the Temple of Delphi to passersby who came to ask them about their future. She asked the question: I have a boat made of planks of pine, that are replaced while they decay; after some time, all the planks have been changed, is this the same boat? Yes it is, indeed, and what has been kept unchanged is the making up of the boat, the relationships between the planks, the form of the boat and so on. We obviously see here that in addition to matter, information is a central part of the object of interest. Matter even can be changed—pine can be replaced by oak to make the boat age slower—and provided the planks are well adjusted (and not too heavy), the boat will keep its function to ﬂoat and transport passengers and goods, demonstrating that the properties of the matter used here are not speciﬁc to the nature of the boat (it could even be made of metal or cement) (Danchin, 2003). This is typically one of the endeavors of those daring ones among us who try to change some of the basic bricks making life (such as amino acids (Wang et al., 2006) or nucleotides (Adelﬁnskaya et al., 2007)), in some of the efforts of synthetic biology (de Lorenzo and Danchin, 2008). Before going on, we ask the reader to be open minded and accept the challenge of relating unexpected developments of genomics that have their roots in very abstract regions of knowledge with more familiar biology. It is also important to note that some of the literature we will quote do not appear in journals or books familiar to microbiologists, nor are they always indexed in PubMed (fortunately however, important papers such as those of Alan Turing are readily available on the World Wide Web). This chapter aims at stimulating the curiosity, enticing the reader to go well beyond the standard literature, which despite its enormous size is unfortunately ignorant of the past, and of domains of science that have strong bearings on the very nature of biology.

5.2 Revisiting Information

83

5.2 REVISITING INFORMATION We all use the word “information” repeatedly (539,302 references for this word at PubMed as of November 22, 2008, and see a general discussion in Danchin, 2003). This is true in particular when we speak about the genetic program and the processes of information transfer (1248 references for this expression in PubMed at the same date, mostly based on work in neurobiology and in molecular biology), namely, transcription and translation. But do we know exactly what this word means? Information is one of those “prospective” concepts that change in the course of being discussed (Myhill, 1952). The ﬁrst quantitative mathematical description of the concept came from the work of Claude Shannon, when analyzing the limits of communication of strings of symbols (Shannon and Weaver, 1949). Interestingly, the theory of communication he established was only interested in measuring whether the integrity of the string of symbols (a message) was preserved, without taking into account its meaning (Cover and Thomas, 1991). In the biological context, this is exactly what replication does, when a DNA molecule is duplicated into an (almost) identical molecule. This is, however, very far from what we would like to say about information, and this is certainly very far from the way information is exploited at the level of transcription or translation. When the structure of DNA was discovered with its clear “aperiodic crystal” characters (Schr€ odinger, 1945), a few original scientists stressed the importance of the concept with its limitations. In 1953, the physician Henry Quastler was among the ﬁrst to realize the importance of information theory and coding in molecular biology (Quastler, 1953). His interest was however mostly driven by the question raised by the nature of the brain and its remarkable properties of learning, memory, and consciousness. However, together with the physicist Hubert Yockey, he organized a Symposium on Information Theory in Biology (Quastler et al., 1958), where topics on what was becoming molecular biology were discussed extensively. In a short work published posthumously, Quastler further developed a theory of biological organization starting with the enigma of the origin of life. The author’s emphasis on the problem of creation of information in simple cells is important (Quastler, 1964). In a nutshell, most investigators at the time (and still many today) thought intuitively that creation of information required energy (for a historical discussion, see Danchin (2003)) in such large amounts that what we could observe in life appeared quite mysterious (typically 1/2 kT for one bit of information). This is not so however: creation of information is reversible, as demonstrated by Landauer (1961), and therefore the process does not require energy. This remarkable property of the physical world, which however is not discussed here in detail (see Danchin (2003, 2008b) for a general presentation), has the consequence that accumulation of information by living organisms is not a paradox and can be associated with explicit molecular processes, as we shall see subsequently. Although sometimes a good starting point for initiating a research programme, intuition cannot be a reliable way to understand the nature of things. And, as a matter of fact, the use of the word “entropy” to characterize information by Shannon triggered considerable misunderstandings (Danchin, (1986, 1996, 2003). The parallel works of Kolmogorov in the Soviet Union and Solomonoff and Chaitin in the United States aimed at deﬁning the information not in terms of communication but in terms of meaning. This required to provide a random sequence from a meaningful sequence (Cover and Thomas, 1991). The concept of algorithmic complexity deﬁnes a sequence by the shortest algorithm needed to generate the sequence: with this deﬁnition of sequence compression, a random sequence will be said to have maximum algorithmic complexity (it cannot be compressed to a length

84

Chapter 5

The Role of Information in Evolutionary Genomics of Bacteria

shorter than itself) while a repeated sequence would be of low complexity.2 A further concept shows that much more can be said about the very nature of what information is (Danchin, 2003). In 1988, for example, the physicist Charles Bennett created the concept of logical depth, based on the remark that two sequences with the same algorithmic complexity might differ widely in the way they carry information. In a repeated sequence, the information of the nth symbol, n large, is obtained in straightforward way. By contrast, in sequences that are the result of a recursive algorithm, such as the sequence of the digits of p, one often cannot infer the nature of the nth symbol when n is large without running the algorithm, and this can take a very long time, possibly longer than the age of the Universe (Bennett, 1988). Bennett proposed that the time required to have access to the corresponding information will measure the logical depth of the sequence. Examples of the nontrivial features of the algorithmically simple but logically deep sequences are the outcome of algorithms generating fractal ﬁgures such as Koch’s ﬂake or the Mandelbrodt’s set. Both these remarkable ﬁgures are generated by fairly short algorithms, but it is not possible to predict easily the color of a pixel until the algorithm has been run. In the case of genomic sequences, the very fact that DNA comes from DNA through generations suggests that any nucleotide in a sequence is logically deep, supporting the view that there is no such thing as “junk” DNA. This is a further indication that information cannot be derived from the four categories of Nature, matter, energy, space, and time, but is a category in itself. This also shows that we need to further develop deeper formulations of what information is ((Steane, 1998), and to deﬁne critical depth, see Danchin (1996)). At this point, despite the title of Yockey’s book (Information Theory and Molecular Biology), the reader may have the impression that we are very far from biology. However, many articles about genomes deal with Shannon’s information in a constructive way, and “sequence logos” used to identify DNA binding sites for regulators, for example, are certainly quite familiar to many readers (He´naut and Danchin, 1996; Schneider, 2006; Schneider and Stephens, 1990). Another example: in terms of algorithmic complexity, the concept of information measured as “sequence complexity” is now familiar to all investigators using BLAST complexity ﬁlters (Zhang and Madden, 1997), which is in fact a default parameter for most BLAST searches. Finally, there is a wealth of other processes where biologists contemplate a direct involvement of information in their topic of interest or when one deals with properties of the central nervous system. This is the case of the way molecular motors perform their task, and even molecules can act as “molecular information ratchets” (Serreli et al., 2007). In the same way, analysis of the exploration of space by insect males looking for their female involves a process of infotaxis, which computes ways to access the source of a speciﬁc pheromone in a highly turbulent environment, using information as the driving element permitting identiﬁcation of the target (Vergassola et al., 2007). In what follows, we try to relate information to the way bacterial genomes are organized.

5.3 UBIQUITOUS FUNCTIONS FOR LIFE Life can be deﬁned as combining two entities that rest on completely different physicochemical properties and on a particular way of handling information. The cell, ﬁrst, is a

2 Note that this deﬁnition of complexity goes against a considerable number of usages that would certainly not identify complexity with increased randomness. This is why I consider that the word should be avoided at all costs, and that instead we should use the Greek equivalent, symplectic, when we wish to describe highly organized systems. Refer de Lorenzo (2008).

5.3 Ubiquitous Functions for Life

85

“machine,” that combines elements that are quite similar (although in a fairly fuzzy way) to those involved in a man-made factory. The machine combines two processes. First, it requires explicit compartmentalization, including scaffolding structures similar to that of the ch^assis of engineered machines. Cells deﬁne clearly an inside, the cytoplasm, and an outside. The cell envelope is more or less complicated in bacteria, where one can approximately separate between monoderms, which have a single cytoplasmic membrane, and diderms, which have two membranes, limiting a periplasm (Gupta, 2000a). Second, the machine also requires dynamic chemical processes, metabolism, that can be split into intermediary metabolism, managing chemical transformations and transport of small building blocks and management of energy (often with a rotating nanomachine, ATP synthase), and the macromolecule synthesis, salvage and turnover machinery that uses a variety of nanomachines, the ribosome being the most prominent one. The second entity that needs to be associated with life is the genetic program, in the form of the genome, composed of one or several chromosomes made of DNA. This is the entity that associates most clearly with information. As we saw, the latter is a fairly abstract concept. In Auguste Comte classiﬁcation of sciences, mathematics is the basic reference for all sciences, while the most difﬁcult to grasp, and we subsequently go away from pure abstractness to physics, chemistry, and then ﬁnally biology. Because we are alive, we tend to think that life is easy to grasp. This certainly does not reﬂect reality. When molecules became central to life half a century ago—we only recently witnessed the death of some of their discoverers, Jacques Monod, Francis Crick, and Arthur Kornberg, biology climbed one step up toward abstractness. Then, with recognition of the abstract schemas dictating how genes are expressed and in silico biology (Danchin et al., 1991), two steps were further ascended, placing biology not so far away from mathematics. This culminates today with the buzzword “Systems’ Biology,” while ﬂesh is given to the image of the genetic program, somewhat difﬁcult to catch, unless one identiﬁes that particular program with a computer’s program (Danchin, 2008a). Indeed, the association between a machine and a program that can be represented as a linear sequence of symbols is highly suggestive of the construction of a Turing Machine, the abstract representation of the human artifacts we have constructed to manipulate all the operation of computing and logics (Turing,1936/1937), the ubiquitous computers. Many features of cells suggest that this analogy is much deeper than a simple metaphor, and that the cell is a real implementation of Turing Machines, with the remarkable feature that these particular instances of Turing Machines use their computing power to make Turing Machines (cells make cells) (Danchin, 1996, 2008b). Indeed, an experiment such as that of genome transplantation with generation of a Mycoplasma mycoides species driven by the chromosome of a species (M. mycoides) differing from the initial host cell (Mycoplasma capricolum) (Lartigue et al., 2007) is an overwhelming argument in favor of this model. The common rebuttal raised by those who are reluctant to leave the ﬁrm ground of traditional biology against the idea of the cell as a Turing Machine is that its information content is much extensive than that in its chromosome. This argument does not hold, however, as we also meet exactly the same situation with real computers, that nobody would challenge the material implementations of Turing Machines: the engineering of the explicit machine that reads programs requires deﬁnitely much more information than that in the program loaded with the Operating System (OS) that makes the computer work. Some have further argued against the “computer” model stating that in a biological machine it is not possible to completely separate between the hardware and the software, and this is right. However, again exactly as in the case of the objection raised for the existence of information in the machine itself, in addition to that in the OS, the same holds true for the way the OS is

86

Chapter 5

The Role of Information in Evolutionary Genomics of Bacteria

carried into the machine. As in the case of DNA, which carries the genetic program, while an OS is an abstract entity, to be usable it must be carried by concrete objects, such as ﬂash memories, compact disks (CDs), or magnetic tapes. Let us imagine a computer driven by a program stored on a CD. Let us further think that the CD has been let standing for some time in the sun: it will be deformed, and despite the fact that the program it carries is unaltered, it will no longer be read by the computer’s laser beam and will not be usable by the computer to start. This does not alter the very existence of the abstract laws establishing what a computer is (a Turing Machine) but this tells us that in any real implementation of the Turing Machine, one cannot completely separate between the hardware and the software. This is an important constraint that may explain the noticeable absence of a transplantation experiment in the recent synthesis of an artiﬁcial Mycoplasma genome, using Saccharomyces cerevisiae as an intermediary host (Gibson et al., 2008): it could well be that the resulting folding of the chromosome does not make it readable by the receiving Mycoplasma machinery (Peckham et al., 2007; Peter et al., 2004). This generalization of the concept of Turing Machine to the cell, which may be highly relevant to the processes involving accumulation of information in biology, cannot be discussed further here. This type of investigation of the role and form of information in molecular biology keeps being developed and it is likely that new insights will be appearing in the next few years (Chaitin, 2007; Danchin, 1996; Lifson, 2005; Yockey, 1992). This model reminds us that there is a deep interaction between the level of information and that of matter, energy, space, and time. This implies that all processes involving the abstract entity, information, need to be concretely implemented. We therefore need to look for concrete objects that implement biological functions. This latter concept is subject to a large variety of interpretations, based on questions asked by the frequent teleological arguments used to account for the existence of a given function (Allen et al., 1998). We shall simply assume here that we use “function” with its commonsense intuitive (inaccurate!) meaning as do engineers when they construct a machine. Many ubiquitous functions are required to make a cell. They are needed to implement the essential macromolecular biosynthesis processes, transcription, translation, and replication. They are also essential to manage energy and exchanges between the inside and the outside of the cell. They need to control physical constraints, such as osmotic pressure or temperature. And, of course, there is a need to synthesize the building blocks that cannot be obtained outside the cell. A standard way to try and have access to these ubiquitous functions is to think that they will be associated with ubiquitous structures, and this drove the quest for a “core genome” that would lie at the intersection of the genomes of all autonomous cells of a given clade because they share a common history (Daubin et al., 2002; Moore and Lindsay, 2001). Unfortunately, this approach subsumes the existence of one common ancestor (named LUCA, for Last Universal Common Ancestor by Christos Ouzounis (Kyrpides et al., 1999)) and it also implies that there is generally a bi-univocal correspondence between structure and function, while, in fact, many structures can fulﬁll a given function (for a general discussion, see (Danchin, 2003)). Such structures may be recruited by horizontal gene transfer (Poptsova and Gogarten, 2007), after acquisitive evolution (Ashida et al., 2005; Thompson and Krawiec, 1983; Wright, 2004), or even created de novo (we do not have clear ideas however about that particular process, forming the path to gene creation, except for some hypotheses, such as the “gluon” hypothesis (Pascal et al., 2005)). This approach generally led to the identiﬁcation of the so-called “housekeeping” genes, which are assumed to make up the minimal genome. And, as a matter of fact, these genes are identical to the approximately 250 genes ﬁrst identiﬁed in silico (Danchin, 1988, 1995; Mushegian and

5.4 The Cenome and the Paleome

87

Koonin, 1996) and then found to be essential for growth under laboratory conditions (Baba et al., 2006; Joyce et al., 2006; Kobayashi et al., 2003). We therefore need to explore genomes further, under the difﬁcult situation that even if we expect some functions to be ubiquitous, we do not expect that the genes of interest will be strictly ubiquitous. Fortunately, because organisms derive from each other by descent, functional objects tend to persist through generations, allowing us to identify “persistent” genes, that is, genes that are present in a clique of genomes but not necessarily in all of them, comparing a large number of genomes to model genomes (Fang et al., 2005). To substantiate that these genes indeed correspond to ubiquitous functions, we need to check that they display organizational properties shared by essential genes. This is performed by ﬁxing the threshold measuring the size of the clique according to properties that are speciﬁc to experimentally identiﬁed “essential” genes. This approach led us to identify a set of some 500 genes (twice the number of essential genes) (Fang et al., 2005) that kept the property of essential genes to have a strong tendency to be located in the leading DNA replication strand (Rocha and Danchin, 2003). Subsequently, knowing these genes permitted us to identify (most of) the ubiquitous functions we need to consider as essential to life and to identify the corresponding genes when they do not belong to the core genome gene set (see an example in (Mechold et al., 2007)).

5.4 THE CENOME AND THE PALEOME Analysis of a large number of bacterial genomes for ubiquitous functions allowed us to construct a set of persistent genes in two major bacterial clades, the gamma-Proteobacteria (with Escherichia coli as the model) and the Firmicutes (with Bacillus subtilis as the model organism) (Fang et al., 2005). We further analyzed their syntenies, as essential genes tend to remain clustered together. A detailed analysis of conservation of proximity of genes in genomes revealed a remarkable feature of their organization: both persistent genes and rare genes tend to stay clustered together (Danchin, 2007), making two highly consistent families of genes, separated by a large twilight zone (Figure 5.1). The cause of clustering of the genes that are rarely present in genomes is fairly well understood. It is the straightforward consequence of “selﬁsh” behavior of horizontally transferred genes, interpreted as “selﬁsh” in the usual anthropocentric view of Nature (Lawrence and Roth, 1996). Indeed, bacteria need not only to survive and perpetuate life but also to occupy a particular niche. Karl M€ obius in 1877 referred to the common pool of living species in a particular environment as the biocenose ((see, for example, (Movila et al., 2006)). The concept of gene did not exist at the time, but we need now to relate the idea of the biocenose to that of the genes that are permitting the cell to occupy an ecological niche. These genes are acquired from an unknown pool of genes by horizontal gene transfer and, as a consequence of the corresponding gene transfer processes (transformation, transduction, integration of prophages, and conjugation), they are generally coming in as clusters (Lawrence and Roth, 1996). This very large class, the cenome (after koino, common, as in biocenose (Danchin et al., 2007)), tends to comprise novel members in different strains of the same species. Taking into account the concept of pan-genome, which puts together all the genes of a given species (Tettelin et al., 2005), the cenome of a given species is a subset of the pan-genome, comprising all the genes permitting any strain of that species to live in its favored niche. The number of cenome genes can be very large, and in the case of E. coli, where we already know several tens of strains, the number of genes in the cenome exceeds 20,000, with a fraction (usually 1000–1500 genes) present in each E. coli strain.

88

Chapter 5

The Role of Information in Evolutionary Genomics of Bacteria

Figure 5.1

Gene clustering in an example of a bacterial genome. Genes are grouped by 50 and the mutual attraction within each group is measured on the vertical axis (Fang et al., 2008). The horizontal axis displays groups of genes as a function of their frequency in bacterial genomes. The horizontal bar indicates the Kuiper’s test score for a p-value of 0.01. Rare genes, making the cenome, are highly clustered. About 500 persistent genes are also signiﬁcantly clustered. The intermediate twilight zone corresponds to genes that are present only in a fraction of the genomes, with less selection pressure for maintaining clustering. Modiﬁed from Danchin et al. (2007), p. 105, supplementary Figure 1.

Persistent genes are by deﬁnition frequent (with a caveat, however, as measuring frequency depends on the type and size of the set of genomes of interest), and they also tend to remain clustered together in different genomes (Fang et al., 2008). Remarkably, the connection network of the persistent genes coding for ubiquitous functions is reminiscent of a scenario of the origin of life. This has suggested paleome (from palaio, ancient) for its name, as it recapitulates the three phases of a scenario based on surface metabolism uncovered via analysis of extant metabolism (Granick, 1957): synthesis of small molecules on solid surfaces (including ribonucleotides and coenzymes), substitution of solid particle surfaces by an RNA-world where transfer RNA played a central role, and invention of template-mediated information transfer (Danchin, 1989). We also found that construction of the paleome based on a seed genome from different clades differed in some of their genes (Fang et al., 2005), showing that besides the core genome that shares a common evolutionary history (Daubin et al., 2002) (but whose elements tend to vanish away as more genomes are included in the analysis), there are several coevolving genes whose function makes the paleome functions differ in different clades (Danchin, 2009; Fang et al., 2005; Mechold et al., 2007). A study of the coevolution of the corresponding genes in all deep-rooted bacterial clades is therefore a must in the in silico research for the next few years.

5.6 A Ubiquitous Information-Gaining Process: Making a Young Organism from an Aged One

89

5.5 FUNCTIONS CORRESPONDING TO NONESSENTIAL PERSISTENT GENES Besides genes coding for unknown functions that deserve to be studied in priority, two major types of functions are coded by nonessential persistent genes. First, one observes a few metabolic functions that are likely to play a role as “metabolic patches.” Indeed, biomolecules are always chemically reactive entities, which in some environments may stumble— as life itself must have stumbled at its origin—on very concrete dead ends. Some objects may be mutually incompatible (something that physicists often refer to as “frustration,” mostly considering energy states), and this needs to be taken into account by special metabolic pathways (which, however, while indispensable, may differ in different species). As an example, in the list of persistent genes based on the E. coli model genome, we have identiﬁed the puzzling presence of a purely catabolic enzyme system, serine dehydratase (Sda) (Fang et al., 2005). This indicates that serine is an amino acid that may play a complex role in the process of metabolic reproduction. We have indeed revealed a long time ago that serine is toxic in E. coli for nontrivial reasons (Uzan and Danchin, 1978), and our unpublished work has shown that the toxicity of this amino acid is probably ubiquitous and certainly unrelated to the similarity between serine and threonine. Distribution of the sda gene within persistent genes sets constructed using a variety of model organisms tells us that several genes coevolve with this gene family, opening novel threads in our quest for the processes essential to metabolic reproduction. Second, we observe a large number of genes involved in degradation processes, usually in an energy-dependent fashion. Many are involved in RNA degradation. In fact, when E. coli is used as the model organism to identify persistent genes, all the genes of the degradosome (Carpousis, 2007) are found in the set. This allowed us to identify the likely counterpart of this complex structure in B. subtilis (Danchin, 2009). Remarkably, these genes are closely associated with processes that use energy, a puzzling property for degradation processes, which would generally produce energy rather than use it. Most of the corresponding functions are annotated as related to the processes of maintenance, management of stress, or environmental transitions and repair processes (Fang et al., 2005). In a nutshell, we have a signiﬁcant number of functions that do not appear to be absolutely essential, as inactivation of the corresponding genes let the mutant build up colonies on plates in the laboratory (Baba et al., 2006; Kobayashi et al., 2003), but which more often than not involve signiﬁcant energy consumption.

5.6 A UBIQUITOUS INFORMATION-GAINING PROCESS: MAKING A YOUNG ORGANISM FROM AN AGED ONE Let us now have a second look at the functions encoded in the paleome. We have just seen that rapid inspection suggests that besides functions required for anabolism and replication, many functions involved in maintenance, repair, and sometimes highly speciﬁc catabolic activities are always present (Fang et al., 2005, 2008). This shows us that we need to go back to the way cells manage information, as maintenance and repair, for example, tend to increase the information content of a cell submitted to some decay, such as in any aging process. This observation reminds us of the very important distinction made by Freeman Dyson in his reﬂection about the origin of life, separating between reproduction, which can be improved over time (and which in the context of the present chapter will require functions that increase information, restoring it to its former level or even improving on that

90

Chapter 5

The Role of Information in Evolutionary Genomics of Bacteria

background), and replication, which tends to keep a given information content in a context where it can only accumulate errors unless there is an external way to compensate for errors or detect the rare events where errors increase the information content of the genetic program. As a matter of fact, Dyson convincingly showed that replication could not exist ﬁrst, before a fairly stable reproduction system was established, entitling his book Origins of Life (plural) to stress this important observation (Dyson, 1985). Replication per se results in the error catastrophe as pointed out by Orgel (1963) in the case of protein synthesis and often recognized as Muller’s ratchet in the case of heredity (Muller, 1932), while reproduction is not doomed to progressively decay, but, on the contrary, may improve over time (Dyson, 1985). While the core of the mathematical model constructed by Dyson rests on the demonstration that metabolic reproduction predates replication and can improve over time, it does not propose explicit implementation of the concrete processes it must involve (Dyson, 1985). The model shows in mathematical terms how prebiotic metabolic systems could progressively become more accurate, before they could discover replication after the role of RNAs as molecular substrates of chemical reaction shifted to a role of molecular templates (Danchin, 1986). The argument we wish to develop here is to show how the creation of a link between these views and information theories can lead to very concrete— experimentally testable–hypotheses. In this view, the paleome is made of several subsets of functions that cope with these different aspects of life: to survive and to perpetuate life, while separating the process of replication of the genetic program from that of reproduction of the machine that expresses it (Figure 5.2). Remarkably, in the process of reproduction, it is always an aged organism that creates a progeny made of young organisms (even bacteria age (Lindner et al., 2008; Nystrom, 2007)) (Aguilaniu et al., 2003). It cannot escape our attention that the process is highly reminiscent of Dyson’s discussion about the reproduction/replication dichotomy. The following is then the important point: making young entities from aged ones implies creating (or recovering)

Figure 5.2

Organization of the functions coded in bacterial genomes. The cenome collects functions permitting to occupy a particular niche. The paleome permits the cell to survive aging processes and to propagate life. For a given species, the set of all strains make the pan-genome (Tettelin et al., 2005), which comprises the paleome (which is the same in all strains) and the cenome, which is expressed as a particular subset of all the functions permitting life in context, speciﬁc to each strain. Cenome functions are subject to considerable exchange by horizontal gene transfer.

5.7 Provisional Conclusion

91

some information, improving over the information background of the cell (Danchin, 2008a). On the one hand, therefore, we have a process that needs to create information. On the other hand, as we have seen in our analysis of the paleome’s functions, we have energy-dependent functions that make room to recover structures that may be information-rich. An aged biological system expresses particular objects (RNAs, proteins, and metabolites) and processes (metabolic pathways) some of which are accurate, some of which are less active or less accurate variants. Despite this lack of exactitude, an aged system will generally be able to express metabolic pathways and by using energy it will generate young systems, thereby making new objects using old objects. Note that this is not an assumption but a fact. If this is a correct view of a fairly common process, then we need to identify the genes coding for the corresponding objects: Dyson’s improvement of reproduction would result in a process accumulating information in a ratchet-like manner (Danchin, 2008). The contention proposed here is that many of the corresponding code for functions belong to the paleome. An analysis of nonessential persistent genes shows that energy is indeed involved in many of their functions. In particular, we ﬁnd that many of the enzymes of the “degradosome” required for RNA degradation belong to this paleome functions category (it can be noticed that the core of the degradosome, polynucleotide phosphorylase, which uses phosphorolysis to produce nucleoside diphosphates—energy-rich compounds—is nonessential (Portier, 1980)). While this structure has explicitly been identiﬁed in E. coli, where it is associated with polynucleotide phosphorylase and enolase, both functions associated with energy salvage or production (Carpousis, 2007), the exact counterpart has not yet been identiﬁed whether in Firmicutes or in Eukarya, for example. However, a thorough review of the literature, associated with a general analysis of the coevolution of genes in B. subtilis, suggests that at least in the former case, a degradosome-like structure, directly associated with energy-producing enzymes of glycolysis, also exists in Firmicutes (Danchin, 2009). In the case of Eukarya, ﬁnally, the exosome is also tightly associated with RNA phosphorolysis, not hydrolysis, which is an energy-saving process (Lin-Chao et al., 2007). In parallel, genes coding for ATP-dependent proteases as well as ATPdependent RNA helicases belong to the paleome. Taken together, these functional associations are remarkably consistent with the present conjecture.

5.7 PROVISIONAL CONCLUSION To live involves three major processes: combating aging processes, perpetuating life, and living in a particular environmental context. The ﬁrst two processes require presumably ubiquitous functions, which have been grouped in a set named the paleome, because its organization is relative to a large number of bacterial genomes. The last one corresponds to large pools of horizontally transferred genes, with yet undeﬁned borders, that are shared by the individual strains of a given species (Danchin, 2007; Tettelin et al., 2005). Most of the functions in the paleome have been identiﬁed. They are often (as in the case of transcription and translation processes) the target of drugs that prevent propagation of the relevant organisms. While these functions are conserved, they are often not resulting from similar structures. RNA degradation is a case in point, as the corresponding enzymes vary considerably in different bacterial clades (the best explored situation is that of the gamma-Proteobacteria and Firmicute clades (Danchin, 2009)), opening lively and heated discussions about the origin of the various branches of living organisms (Doolittle and Bapteste, 2007; Gupta, 2000b; Kurland, 2005). Two further splits must be made of the

92

Chapter 5

The Role of Information in Evolutionary Genomics of Bacteria

functions of the paleome. Some are essential for permitting formation of a colony on plates supplemented by rich medium, while some being ubiquitous do not have this property (Fang et al., 2005). This particular feature, superimposed on the need to survive and the need to perpetuate life, has to be superimposed on a third split, which separates reproduction from replication (Dyson, 1985). Taken together, these views of the paleome open a novel way to consider genomes and evolution, where management of the creation of information required when a young organism is born from an aged one is the central issue. Repeated invention of energydependent processes required to make room while accumulating information in a ratchetlike manner probably accounts for the remarkable diversity of the structures involved in the process.

ACKNOWLEDGMENTS This work summarizes many years of continuous discussions with the Stanislas Noria group. AS was supported by the PROBACTYS programme, grant CT-2006-029104, in an effort to deﬁne genes essential for the construction of a synthetic cell. Some in silico analyses were also supported by the BioSapiens program, grant LSHG CT-2003-503265.

REFERENCES ADELFINSKAYA, O., TERRAZAS, M., FROEYEN, M., MARLIERE, P., NAUWELAERTS, K., and HERDEWIJN, P., 2007. Polymerasecatalyzed synthesis of DNA from phosphoramidate conjugates of deoxynucleotides and amino acids. Nucleic Acids Res. 35: 5060–5072. AGUILANIU, H., GUSTAFSSON, L., RIGOULET, M., and NYSTROM, T., 2003. Asymmetric inheritance of oxidatively damaged proteins during cytokinesis. Science 299: 1751–1753. ALLEN, C., BEKOFF, M., and LAUDER, G., 1998. Nature’s Purposes. MIT Press, Cambridge, MA. ASHIDA, H., DANCHIN, A., and YOKOTA, A., 2005. Was photosynthetic RuBisCO recruited by acquisitive evolution from RuBisCO-like proteins involved in sulfur metabolism? Res. Microbiol. 156: 611–618. BABA, T., ARA, T., HASEGAWA, M., TAKAI, Y., OKUMURA, Y., BABA, M., DATSENKO, K.A., TOMITA, M., WANNER, B.L., and MORI, H., 2006. Construction of Escherichia coli K-12 inframe, single-gene knockout mutants: the Keio collection. Mol. Syst. Biol. 2: 2006.0008. BENNETT, C., 1988. Logical depth and physical complexity. In The Universal Turing Machine: A Half-Century Survey (ed. R. Herken). Oxford University Press, Oxford, pp. 227–257. BERNARD, C., 1865 (reed 1966). Introduction a l’e´tude de la me´decine expe´rimentale. Bailliere (reed Garnier-Flammarion), Paris. CARPOUSIS, A.J., 2007. The RNA degradosome of Escherichia coli: an mRNA-degrading machine assembled on RNase E. Annu. Rev. Microbiol. 61: 71–87. CHAITIN, G., 2007. Speculations on biology, information and complexity. EATCS Bull. 91: 231–237.

COVER, T. and THOMAS, J., 1991. Elements of Information Theory. John Wiley & Sons, Inc., New York. DANCHIN, A., 1986. Order and necessity. In From Enzyme Adaptation to Natural Philosophy: Heritage from Jacques Monod. Symposium “J Monod and Molecular Biology, Yesterday and Today” (eds E. Quagliariello, G. Bernardi, and A. Ullmann). Elsevier Sciences Publishers, Trani, Italy. DANCHIN, A., 1988. Complete genome sequencing: future and prospects. In BAP 1988–1989 (ed. A. Goffeau). Commission of the European Communities, Brussels, pp. 1–24. DANCHIN, A., 1989. Homeotopic transformation and the origin of translation. Prog. Biophys. Mol. Biol. 54: 81–86. DANCHIN, A., 1995. Why sequence genomes? The Escherichia coli imbroglio. Mol. Microbiol. 18: 371–376. DANCHIN, A. 1996. On genomes and cosmologies. In Integrative Approaches to Molecular Biology (eds J. ColladoVides B. Magasanik, and T. Smith). The MIT Press, Cambridge, MA, pp. 91–111. DANCHIN, A., 2003. The Delphic Boat. What Genomes Tell Us. Harvard University Press, Cambridge, MA. DANCHIN, A., 2007. Archives or palimpsests? Bacterial genomes unveil a scenario for the origin of life. Biol. Theory 2: 52–61. DANCHIN, A., 2008a. Natural selection and immortality. Biogerontology, published online: August 22, 2008, DOI 10.1007/s10522-008-9171-5. DANCHIN, A., 2008b. Bacteria as computers making computers. FEMS Microbiol. Rev., published online: November 11, 2008, DOI: 10.1111/j.1574-6976.2008.00137.x. DANCHIN, A., 2009. A phylogenetic view of bacterial ribonucleases. Prog. Biophys. Mol. Biol. 85: 1–41.

References DANCHIN, A., ME´DIGUE, C., GASCUEL, O., SOLDANO, H., and HE´NAUT, A., 1991. From data banks to data bases. Res. Microbiol. 142: 913–916. DANCHIN, A., FANG, G., and NORIA, S., 2007. The extant core bacterial proteome is an archive of the origin of life. Proteomics 7: 875–889. DAUBIN, V., GOUY, M., and PERRIERE, G., 2002. A phylogenomic approach to bacterial phylogeny: evidence of a core of genes sharing a common history. Genome Res. 12: 1080–1090. DE LORENZO, V. and DANCHIN, A., 2008. Synthetic biology and the discovery of new worlds/words. EMBO Reports 9: 822–827. DOOLITTLE, W.F. and BAPTESTE, E., 2007. Pattern pluralism and the tree of life hypothesis. Proc. Natl. Acad. Sci. USA 104: 2043–2049. DYSON, F.J., 1985. Origins of Life. Cambridge University Press, Cambridge, UK. FANG, G., ROCHA, E., and DANCHIN, A., 2005. How essential are nonessential genes? Mol. Biol. Evol. 22: 2147–2156. FANG, G., ROCHA, E.P., and DANCHIN, A., 2008. Persistence drives gene clustering in bacterial genomes. BMC Genomics 9: 4. GIBSON, D.G., BENDERS, G.A., ANDREWS-PFANNKOCH, C., DENISOVA, E.A., BADEN-TILLSON, H., et al., 2008. Complete chemical synthesis, assembly, and cloning of a Mycoplasma genitalium genome. Science 319: 1215–1220. GRANICK, S., 1957. Speculations on the origins and evolution of photosynthesis. Ann. N. Y. Acad. Sci. 69: 292–308. GUPTA, R.S., 2000. The natural evolutionary relationships among prokaryotes. Crit. Rev. Microbiol. 26: 111–131. GUPTA, R.S., 2000. The phylogeny of proteobacteria: relationships to other eubacterial phyla and eukaryotes. FEMS Microbiol. Rev. 24: 367–402. HE´NAUT, A. and DANCHIN, A. 1996. Analysis and predictions from Escherichia coli sequences or E. coli in silico. In Escherichia coli and Salmonella, Cellular and Molecular Biology (ed. F. Neidhardt). ASM Press, Washington, pp. 2047–2065. JOYCE, A.R., REED, J.L., WHITE, A., EDWARDS, R., OSTERMAN, A., BABA, T., MORI, H., LESELY, S.A., PALSSON, B.O. and AGARWALLA, S., 2006. Experimental and computational assessment of conditionally essential genes in Escherichia coli. J. Bacteriol. 188: 8259–8271. KOBAYASHI, K., EHRLICH, S.D., ALBERTINI, A., AMATI, G., ANDERSEN, K.K., et al., 2003. Essential Bacillus subtilis genes. Proc. Natl. Acad. Sci. USA 100: 4678–4683. KURLAND, C.G., 2005. What tangled web: barriers to rampant horizontal gene transfer. BioEssays 27: 741–747. KYRPIDES, N., OVERBEEK, R. and OUZOUNIS, C., 1999. Universal protein families and the functional content of the last universal common ancestor. J. Mol. Evol. 49: 413–423. LANDAUER, R., 1961. Irreversibility and heat generation in the computing process. IBM J. Res. Dev. 3: 184–191. LARTIGUE, C., GLASS, J.I., ALPEROVICH, N., PIEPER, R., PARMAR, P.P., HUTCHISON, C.A., 3RD, SMITH, H.O., and VENTER, J.C.,

93

2007. Genome transplantation in bacteria: changing one species to another. Science 317: 632–638. LAWRENCE, J.G. and ROTH, J.R., 1996. Selﬁsh operons: horizontal transfer may drive the evolution of gene clusters. Genetics 143: 1843–1860. LIFSON, S., 2005. What the book say: What is information for molecular biology? BioEssays 16: 373–375. LIN-CHAO, S., CHIOU, N.T. and SCHUSTER, G., 2007. The PNPase, exosome and RNA helicases as the building components of evolutionarily-conserved RNA degradation machines. J. Biomed. Sci. 14: 523–532. LINDNER, A.B., MADDEN, R., DEMAREZ, A., STEWART, E.J., and TADDEI, F., 2008. Asymmetric segregation of protein aggregates is associated with cellular aging and rejuvenation. Proc. Natl. Acad. Sci. USA 105: 3076–3081. MECHOLD, U., FANG, G., NGO, S., OGRYZKO, V., and DANCHIN, A., 2007. YtqI from Bacillus subtilis has both oligoribonuclease and pAp-phosphatase activity. Nucleic Acids Res. 35: 4552–4561. MOORE, P.C. and LINDSAY, J.A., 2001. Genetic variation among hospital isolates of methicillin-sensitive Staphylococcus aureus: evidence for horizontal transfer of virulence genes. J. Clin. Microbiol. 39: 2760–2767. MOVILA, A., USPENSKAIA, I., TODERAS, I., MELNIC, V., and CONOVALOV, J., 2006. Prevalence of Borrelia burgdorferi sensu lato and Coxiella burnetti in ticks collected in different biocenoses in the Republic of Moldova. Int. J. Med. Microbiol. 296: 172–176. MULLER, H., 1932. Some genetic aspects of sex. Am. Nat. 66: 118–128. MUSHEGIAN, A.R. and KOONIN, E.V., 1996. A minimal gene set for cellular life derived by comparison of complete bacterial genomes. Proc. Natl. Acad. Sci. USA 93: 10268–10273. MYHILL, J., 1952. Some philosophical implications of mathematical logic. Three classes of ideas. Rev. Metaphys. 6: 165–198. NYSTROM, T., 2007. A bacterial kind of aging. PLoS Genet. 3: e224. ORGEL, L., 1963. The maintenance of the accuracy of protein synthesis and its relevance to aging. Proc. Natl. Acad. Sci. USA 49: 517–521. PASCAL, G., ME´DIGUE, C., and DANCHIN, A., 2005. Universal biases in protein composition of model prokaryotes. Proteins 60: 27–35. PECKHAM, H.E., THURMAN, R.E., FU, Y., STAMATOYANNOPOULOS, J.A., NOBLE, W.S., STRUHL, K., and WENG, Z., 2007. Nucleosome positioning signals in genomic DNA. Genome Res. 17: 1170–1177. PETER, B.J., ARSUAGA, J., BREIER, A.M., KHODURSKY, A.B., BROWN, P.O., and COZZARELLI, N.R., 2004. Genomic transcriptional response to loss of chromosomal supercoiling in Escherichia coli. Genome Biol. 5: R87. POPTSOVA, M.S. and GOGARTEN, J.P., 2007. The power of phylogenetic approaches to detect horizontally transferred genes. BMC Evol. Biol. 7: 45.

94

Chapter 5

The Role of Information in Evolutionary Genomics of Bacteria

PORTIER, C., 1980. Isolation of a polynucleotide phosphorylase mutant using a kanamycin resistant determinant. Mol. Gen. Genet. 178: 343–349. QUASTLER, H., 1953. Essays on the Use of Information Theory in Biology. University of Illinois Press, Urbana, IL. QUASTLER, H., 1964. The Emergence of Biological Organization. Yale University Press, New York. QUASTLER, H., PLATZMAN, R.L., and YOCKEY, H., 1958. Symposium on Information Theory in Biology. Pergamon Press, Oxford. ROCHA, E.P. and DANCHIN, A., 2003. Gene essentiality determines chromosome organisation in bacteria. Nucleic Acids Res. 31: 6570–6577. SCHNEIDER, T.D., 2006. Twenty years of Delila and molecular information theory: the Altenberg–Austin workshop in theoretical biology biological information, beyond metaphor: causality, explanation, and uniﬁcation Altenberg, Austria, 11–14 July 2002. Biol Theory 1: 250–260. SCHNEIDER, T.D. and STEPHENS, R.M., 1990. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 18: 6097–6100. SCHR€oDINGER, E., 1945. What is Life? The Physical Aspect of the Living Cell. The Macmillan Company, New York. SERRELI, V., LEE, C.F., KAY, E.R. and LEIGH, D.A., 2007. A molecular information ratchet. Nature 445: 523–527. SHANNON, C. and WEAVER, W., 1949. The Mathematical Theory of Communication. University of Illinois, Urbana, IL. STEANE, A., 1998. Quantum computing. Rep. Prog. Phys. 61: 117–173.

TETTELIN, H., MASIGNANI, V., CIESLEWICZ, M.J., DONATI, C., MEDINI, D., et al., 2005. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pan-genome.” Proc. Natl. Acad. Sci. USA 102: 13950–13955. THOMPSON, L.W. and KRAWIEC, S., 1983. Acquisitive evolution of ribitol dehydrogenase in Klebsiella pneumoniae. J. Bacteriol. 154: 1027–1031. TURING, A. (1936/1937) On computable numbers, with an application to the Entscheidungsproblem. Proc. Lond. Math. Soc. 42: 230–265. UZAN, M. and DANCHIN, A., 1978. Correlation between the serine sensitivity and the derepressibility of the ilv genes in Escherichia coli relA- mutants. Mol. Gen. Genet. 165: 21–30. VERGASSOLA, M., VILLERMAUX, E., and SHRAIMAN, B.I., 2007. ‘Infotaxis’ as a strategy for searching without gradients. Nature 445: 406–409. WANG, L., XIE, J., and SCHULTZ, P.G., 2006. Expanding the genetic code. Annu. Rev. Biophys. Biomol. Struct. 35: 225–249. WRIGHT, B.E., 2004. Stress-directed adaptive mutations and evolution. Mol. Microbiol. 52: 643–650. YOCKEY, H., 1992. Information Theory and Molecular Biology. Cambridge University Press, Cambridge, UK. ZHANG, J. and MADDEN, T.L., 1997. PowerBLAST: a new network BLAST application for interactive or automated sequence analysis and annotation. Genome Res. 7: 649–656.

Chapter

6

Evolutionary Genomics of Yeasts Bernard Dujon 6.1

INTRODUCTION

6.2

A BRIEF HISTORY OF HEMIASCOMYCETOUS YEAST GENOMICS

6.3

THE SCIENTIFIC ATTRACTIVENESS OF S. CEREVISIAE

6.4

EVOLUTIONARY GENOMICS OF HEMIASCOMYCETES

6.5

SURPRISES

6.6

WHAT NEXT?

ACKNOWLEDGMENTS EPILOGUE REFERENCES

6.1 INTRODUCTION Used empirically over millennia for a variety of fermentations and food processes, the microscopic yeasts played a key role during the ninetieth century for the emergence of major biological sciences such as microbiology and biochemistry and, during the twentieth century, became favored experimental models for cell biology and genetics (Barnett, 2003; 2007). More recently, their small and compact genomes, and the facility to manipulate cells and pure cultures in the laboratory, forwarded them at the forefront of genomics. The baker’s yeast, Saccharomyces cerevisiae, was the ﬁrst eukaryotic genome entirely sequenced (Goffeau et al., 1996) and has subsequently been so instrumental for the early development of functional genomics that it has become one of the best characterized eukaryotic cells, serving as a reference for the study of numerous molecular mechanisms. The ﬁssion yeast, Schizosaccharomyces pombe, sequenced few years later (Wood et al., 2002), also contributed major fundamental discoveries about eukaryotic cells, although the functional analysis of its genome remains less extensive than that of S. cerevisiae (Sunnerhagen, 2002; Aslett and Wood, 2006). These two favored models of molecular biology laboratories are so distant from each other on an evolutionary timescale that they do not share obvious traces of common ancestry except those universal to all eukaryotes. But around them exists a very broad variety of yeast Evolutionary Genomics and Systems Biology, edited by Gustavo Caetano-Anolles Copyright 2010 John Wiley & Sons, Inc.

95

96

Chapter 6

Evolutionary Genomics of Yeasts

species that have received much more limited attention, except the human pathogens such as Candida albicans, or those involved in fermentation processes, that is, essentially the close relatives of S. cerevisiae. Over a thousand species of yeasts have been described (Kurtzman and Fell, 2000) and this number rapidly rises nowadays, because modern genomic methods allow rapid and efﬁcient exploration of natural systems, such as insect guts, that appear to be rich sources of yeasts (Boekhout, 2005). Yeasts are essentially terrestrial saprobes that often live in close association with other organisms, but commensals, pathogens, and even marine forms are also found (Kutty and Philip, 2008). Despite their common unicellular mode of life and propagation, yeasts belong to distinct phylogenetic groups of the fungal kingdom, which includes several phyla (James et al., 2006; Hibbett et al., 2007; Marcet-Houben and Gabaldon, 2009). S. cerevisiae is a member of a large subphylum of Ascomycota, designated Saccharomycotina or Hemiascomycetes, which contains many species of interest. They are believed to have separated between 400 and 1000 Myr ago from the Pezizomycotina, another large subphylum of Ascomycota containing hyphal fungi (Hedges et al., 2004; Taylor and Berbee, 2006). S. pombe belongs to the third subphylum of Ascomycota, called Taphrinomycotina, that separated even earlier from the common ancestor of both Pezizomycotina and Saccharomycotina and, therefore, represents an independent emergence of a yeast form of life. Yet other yeast species, such as the pathogenic Cryptococcus (Loftus et al., 2005) and the dandruff-associated Malassezia (Xu et al., 2007), belong to the other phylum of “modern” fungi, the Basidiomycota whose subdivisions contain a variety of yeast species amid a majority of other pluricellular species (Hibbett et al., 2007). The unicellular yeasts, therefore, do not represent primitive fungi but, on the contrary, have emerged several independent times from multicellular fungal ancestors. Phyla of basal fungi also contain unicellular species, some with a ﬂagellum (chytrids), but they are not regarded as yeasts. Given the difﬁculty of covering evolutionary genomics of such an heterogeneous group of organisms, phylogenetically embedded among more complex multicellular species, this chapter will focus solely on the monophyletic group of hemiascomycetous yeasts, containing by far the largest number of genomes sequenced (Dujon, 2006).

6.2 A BRIEF HISTORY OF HEMIASCOMYCETOUS YEAST GENOMICS Yeast genomics started from a European program, enlarged into an international cooperative project, that resulted in the early complete sequencing and annotation of a haploid laboratory strain of S. cerevisiae (Goffeau et al., 1996). This 12.3 Mb genome revealed two major surprises that largely guided studies in subsequent years: (i) a large number of genes identiﬁed from the sequence were totally unknown before (orphans) despite the previous genetic studies on this organism (Dujon, 1996), and (ii) the sequence-based genome map showed dozens of duplicated segments of various sizes, dispersed throughout it (Mewes et al., 1997; Figure 6.1). The conspicuous presence of orphan genes prompted large-scale functional studies, opening the way for rapid developments of functional genomics. The traces of duplicated segments provided the ﬁrst hint for ancestral genome duplication. Both aspects contributed to reinforce the focus on S. cerevisiae (next paragraph), leaving aside the very large number of other hemiascomycetous species. The ﬁrst large-scale exploration of hemiascomycetous yeast genomes included 13 species selected to cover the phylogenetic diversity of this group as evenly as possible (Souciet et al., 2000). The limited coverage allowed by sequencing methods of the time was

6.2 A Brief History of Hemiascomycetous Yeast Genomics

97

Figure 6.1

The scientiﬁc attractiveness of S. cerevisiae. (a) Genome duplication: The original schematic view of the 53 clustered gene duplications (colored diagonals) between the 16 fully sequenced chromosomes of S. cerevisiae strain S288C (horizontal black thin lines), as published in Mewes et al. (1997). (b) Evolution of functional annotation of S. cerevisiae ORFs, as published in Pena-Castillo and Hughes (2007). Of about 5780 actual protein-coding genes expected, about 1000 remains to be functionally characterized despite numerous intensive functional genomic studies. (See insert for color representation of this ﬁgure.)

sufﬁcient, however, to reveal the broad evolutionary range of hemiascomycetes, as judged from high sequence divergence of homologous genes (Malpertuy et al., 2000) and extensive loss of synteny (Llorente et al., 2000) when compared to S. cerevisiae. The same strategy of unidirectional comparisons to S. cerevisiae was applied 3 years later to several species closely related to S. cerevisiae to compare promoter regions and regulatory elements (Cliften et al., 2003; Kellis et al., 2003). Sequencing coverage was higher but still incomplete.

98

Chapter 6

Evolutionary Genomics of Yeasts

The draft assembly of Kluyveromyces waltii (Kellis et al., 2004) and the complete sequence of the plant pathogen Ashbya gossypii (Dietrich et al., 2004), published a year later, were also each compared to S. cerevisiae, showing for the ﬁrst time regions of dual synteny left over from its ancestral duplication and subsequent gene loss (see below). Similarly, the diploid genome sequence of the pathogenic yeast C. albicans, published the same year, was also essentially compared to S. cerevisiae (Jones et al., 2004). The ﬁrst multidimensional comparison between hemiascomycetous genomes was made possible by the simultaneous complete sequencing of Candida glabrata, Kluyveromyces lactis, Debaryomyces hansenii, and Yarrowia lipolytica (Dujon et al., 2004), four species selected according to our previous exploratory analysis (Souciet et al., 2000). A total of about 25,700 predicted proteins were systematically compared to each other and clustered into families whose conservation could be examined. The degree of sequence divergence between orthologous proteins indicated for the ﬁrst time that hemiascomycetous yeasts cover an evolutionary range comparable to or larger than the entire phylum of Chordates (Dujon et al., 2004; Dujon, 2006). While the sequencing of other yeast species continued to accelerate (today, over 30 species have been sequenced, see Tables 6.1 and 6.2), comparative genomics of hemiascomycetes addressed speciﬁc medical, biological, or biotechnological aspects in addition to evolutionary aspects. Several species more or less closely related to C. albicans were sequenced, such as Lodderomyces elongisporus or the human pathogens C. tropicalis, C. dubliniensis, and C. parapsilosis, all isolated as heterozygous diploids, as well as Pichia guillermondi and Clavispora lusitaniae, found as haploids. Hansenula polymorpha was sequenced because it is a methylotrophic yeast used for industrial applications (RamezaniRad et al., 2003) and Pichia stipitis because it metabolizes xylose and other lignocellulosederived products (Jeffries et al., 2007). Additional sequencing was also performed on new strains of S. cerevisiae for studies on population polymorphism (Liti et al., 2009; Schacherer et al., 2009), usage in wine production (Borneman et al., 2008; Novo et al., 2009), or because they were isolated from immunocompromised patients (Wei et al., 2007). Wine spoilage was also a reason to start sequencing Dekkera bruxellensis in addition to its uncertain phylogenetic relationship with other yeasts (Woolﬁt et al., 2007). Evolutionary genomics obviously beneﬁt from the above data, but additional sequences of species selected on the basis of their phylogeny remain needed, given the evolutionary range of hemiascomycetes. The genome of Vanderwaltozyma polymorpha (Kluyveromyces polysporus), a member of the most distant clade from S. cerevisiae among those that underwent whole-genome duplication, was sequenced to better understand postduplication evolutionary events (Scannell et al., 2007). Conversely, several additional species were sequenced among those that did not share the ancestral genome duplication of S. cerevisiae (Zygosaccharomyces rouxii, Lachancea (Kluyveromyces) thermotolerans, and Lachancea (Saccharomyces) kluyveri). Multispecies genome comparisons of these and other yeasts (about 49,000 protein coding genes in total) were recently used to deﬁne the “protoploid” status of the Saccharomycetaceae genomes (Genolevures Consortium, 2009).

6.3 THE SCIENTIFIC ATTRACTIVENESS OF S. CEREVISIAE 6.3.1 Functional Genomics The genetics of S. cerevisiae started in the middle of the twentieth century and grew so rapidly that over 800 of its genes were functionally characterized and genetically mapped

6.3 The Scientiﬁc Attractiveness of S. cerevisiae

99

Table 6.1 Hemiascomycetous Yeast Genomes Clade or groupw

Species Saccharomycetaceae Saccharomyces cerevisiae

Strain

Status

Ploidy

References

Saccharomyces pastorianus

1

Kazachstania (Saccharomyces) exigua Kazachstania (Saccharomyces) servazzii Saccharomyces castelliiy Candida (Torulopsis) glabrata Vanderwaltozyma (Kluyveromyces) polyspora Zygosaccharomyces rouxii Lachancea (Kluyveromyces) thermotolerans Lachancea (Kluyveromyces) waltii Lachancea (Saccharomyces) kluyveri Kluyveromyces lactis var. lactis

2

S288c (FY1679)a YJM789b RM11-1ac AWRI 1631d EC1118d M22c YPS163e CBS432T IFO1815 IFO1802 CBS7001 (623-6C) CBS 7001 (MCYC623) Weihenstephan 34/70 CBS379T

2

CBS4311T

E

2n

12

3 4 6

CBS4309T CBS138 DSMZ 70294T

S C S

n n

8 13 14

7 10

CBS732T CBS6340T

C C

n 2n

15 15

10 10 11

NCYC2644 CBS3082T CBS2359 (CLIB210) CBS712T

D C C

2n n

16 15 13

ATCC10895 NBRC 1722 CBS2499/Y1031

C P E

n 2n n?

18 19 20

CBS4732T (RB11)

D

2n

21

C

CBS767T

C

n

13

C B B B A

CBS7064 CBS6054 ATCC6260p ATCC42720p CDC317p

E C D D D

2n n n n 2n

22 23 24 24 24

Saccharomyces Saccharomyces Saccharomyces Saccharomyces uvarum Saccharomyces

1

paradoxus mikatae kudriavzevii bayanus var.

1 1 1 1

bayanus

1

Kluyveromyces marxianus var. marxianus Eremothecium (Ashbya) gossypii Saccharomycodes ludwigii Dekkera (Brettanomyces) bruxellensis Pichia angusta (Hansenula polymorpha) CTG complex Debaryomyces hansenii var. hansenii Pichia farinosa (sorbitophila) Pichia stipitis Pichia (Candida) guilliermondii Clavispora (Candida) lusitaniae Candida parapsilosis

11 12 13

C D D D D S S S S S S

n n n n 2n

2n

1 2 3 4 5 6 6 7 7,8 8 9

S

2n

7,8

S

Hybrid

10

E

2n

11

2n 2n

E

17

(continued)

100

Chapter 6

Evolutionary Genomics of Yeasts

Table 6.1 (Continued) Clade or groupw

Species Lodderomyces elongisporus Candida tropicalis

A A

Candida albicans

A

Candida dubliniensis

A

Dipodascaceae Arxula adeninivorans Yarrowia lipolytica

Strain

Status

Ploidy

References

CBS2605T MYA 3404p CBS94Tp WO-1p SC5314p CD36p

D D E D D C

2n 2n 2n 2n 2n 2n

24 24 25 24 26 27

CBS7504 (CLIB122)

P C

n

22 13

The table lists the yeast species (Kurtzman and Fell, 2000) and strains (brackets: actual derivative used for some sequencing studies; T: type strain; a: reference laboratory strain; b: AIDS patient; c: vineyard; d: commercial wine strain or derivative; e: oak tree; p: human pathogen) whose genome sequences are available in different status of completion: C: complete (includes ﬁnishing); D: draft assembly (limited number of supercontigs identiﬁed with chromosomes), S: whole genome shotgun (typically, about 3–7X coverage and/or about 500–2000 contigs), E: exploratory genome survey, P: in progress. References: (1) Goffeau et al., 1996; (2) Wei et al., 2007; (3) www.broad.mit.edu/annotation/genome/ saccharomyces_cerevisiae; (4) Borneman et al., 2008; (5) Novo et al., 2009; (6) Doniger et al., 2008; (7) Kellis et al., 2003; (8) Cliften et al., 2003; (9) Bon et al., 2000a; (10) Nakao et al., 2009; (11) Bon et al., 2000b; (12) Casaregola et al., 2000; (13) Dujon et al., 2004 and www.genolevures.org; (14) Scannell et al., 2007; (15) Souciet et al., 2009; (16) Kellis et al., 2004; (17) Llorente et al., 2000b; (18) Dietrich et al., 2004 and agd.vital-it.ch; fungal.genome.duke.edu; (19) M. Knop, EMBL (personal communication); (20) Woolﬁt et al., 2007; (21) Ramezani-Rad et al., 2003; (22) Genolevures Consortium (in progress); (23) Jeffries et al., 2007; (24) www.broad.mit.edu/annotation/genome/candida_group/GenomeStats.html; (25) Blandin et al., 2000; (26) Jones et al., 2004 (www.candidagenome.org); (27) ftp.sanger.ac.uk/pub/pathogens/Candida/dubliniensis. w Clades or groups for members of the Saccharomycetaceae and CTG complex are according to Kurtzman and Robnett (2003) and Tsui et al. (2008), respectively. Note that although taxonomically classiﬁed among the family Saccharomycetaceae, D. bruxellensis and P. angusta do not belong to one of the clades deﬁned by Kurtzman and Robnett (2003). Their actual phylogeny is not entirely clear. y This clade is designated Naumovia by Kurtzman (2003), but this designation is not accepted by NCBI, hence the original genus name.

when the sequencing of its genome started (references in Cherry et al., 1997). The early availability of this genome prompted large-scale functional studies that pioneered the ﬁeld of functional genomics. The original annotation of this genome predicted over 6200 proteincoding genes. Following the successive reannotations based on new functional and comparative data, this number now converges toward about 5780. Nearly all of these genes have been individually deleted and replaced by molecularly bar-coded cassettes to allow systematic studies of mutants (Winzeler et al., 1999; Giaever et al., 2002; Pierce et al., 2007). This was the ﬁrst and remains the most comprehensive collection of null mutants among eukaryotes. In parallel, large collections of epitope-tagged mutants were constructed, allowing tandem afﬁnity puriﬁcation of native protein complexes or in vivo ﬂuorescence assays (Gavin et al., 2002; Howson et al., 2005; Krogan et al., 2006). Proteincoding genes were also individually cloned with tags in various vectors (Gelperin et al., 2005; Hu et al., 2007 and references therein). Vectors bearing titratable promoter alleles were constructed for the functional exploration of essential genes, that is, about 20% of total (Mnaimneh et al., 2004; Davierwala et al., 2005). Combined with haploid-selectable markers, such constructs opened the possibility for large-scale analysis of synthetic phenotypes between combinations of mutants, using methods so far unique to S. cerevisiae (Pan et al., 2004; Ooi et al., 2006; Tong and Boone, 2006). Dominant negative phenotypes

6.3 The Scientiﬁc Attractiveness of S. cerevisiae Table 6.2

Genome Organization and Content of Major, Fully Sequenced Hemiascomycetous Yeast Genomes

Species code

Strain

Saccharomycetacea SACE S288c CAGL CBS138 VAPO DSMZ70294 ZYRO CBS732 LATH CBS6340 LAWA NCYC2644 LAKL CBS3082 KLLA CLIB210 ERGO ATCC10895 PIAN CBS4732 CTG complex DEHA PIST PIGU CLLU CAPA LOEL CATR CAAL CADU

101

CBS767 CBS6054 ATCC6260 ATCC42720 CDC317 CBS2605 MYA3404 WO-1 SC5314 CD36

Dipodascaceae YALI CLIB122

Split Chromo- Genome GC% Total CDS Size some Total rDNA CDS %a Ploidy number (Mb) Redundancyb tDNA loci n n n n 2n — 2n n n 2n

16 13 13 7 8 8 8 6 7 6

12.1 12.3 14.7 9.8 10.4 10.7 11.3 10.7 8.7 9.5

38.3 38.8 32.3 39.1 47.3 43.8 41.5c 38.8 52.0 47.9

5 769 5 204 5 652 5 055 5 137 5 230 5 397 5 108 4 715 5 933

4.4 2.5 — 3.3 5.6 — 6.0 3.4 4.5 —

1.42 1.36 — 1.30 1.30 — 1.33 1.29 1.25 —

274 207 251 272 229 217 257 163 190 80

1 (I) 2 (S) — 1 (I) 1 (I) — 1 (I) 1 (I) 1 (I) 1 (I)

n n n n 2n 2n 2n 2n 2n 2n

7 8 — — — — — 8 8 8

12.2 15.4 10.6 12.1 13.1 15.5 14.6 14.4 14.3 14.6

36.3 41.1 43.8 44.5 38.7 37.0 33.1 33.5 33.5 33.2

6 397 5 841 5 920 5 941 5 733 5 802 6 258 6 160 6 107 5 758

6.5 — — — — — — — 6.0 —

1.39 — — — — — — — — —

200 — — — — — — — 131 —

3 (I) — — — — — — — 1 (I) —

n

6

20.5

49.0

6 582

14.5

510

6 (S)

1.45

The table summarizes chromosome numbers, genome sizes and compositions, total numbers of protein-coding genes and % of intron-containing genes, global genome redundancy, tRNA-coding genes and rDNA loci for all yeast species whose genome has been fully sequenced (see Table 6.1). Species code: SACE: S. cerevisiae, CAGL: C. glabrata, VAPO: V. polymorpha, ZYRO: Z. rouxii, LATH: L. thermotolerans, LAWA: L. waltii, LAKL: L. kluyveri, KLLA. K. lactis, ERGO: E. gossypii, PIAN: P. angusta, DEHA: D. hansenii, PIST: P. stipitis, PIGU: P. guillermondii, CLLU: C. lusitaniae, CAPA: C. parapsilosis, LOEL: L. elongisporus, CATR: C. tropicalis, CAAL: C. albicans, CADU: C. dubliniensis, YALI: Y. lipolytica. –: data nonavailable in computed form, (I): internal within chromosome arm, (S): subtelomeric. a

% of CDS interrupted by introns,

b

Ratio of total CDS over number of distinct protein families,

c

heterogeneous composition.

were also examined after gene overexpression (Boyer et al., 2004; Sopko et al., 2006). The wealth of information obtained from all above strategies makes S. cerevisiae one of the best described eukaryotic cells with about 80% of its protein-coding genes functionally characterized (Pena-Castillo and Hughes, 2007). There remains about 1000 genes, however, whose function is still unknown (Figure 6.1). But, next to offering a powerful experimental platform to study eukaryotic cell functions, S. cerevisiae is attractive for two other reasons: its ancestral genome duplication, and its use in alcoholic fermentations. Both reasons have stimulated genomic studies on the Saccharomyces sensu stricto complex, that is, a set of species closely related to S. cerevisiae and able to form interspeciﬁc hybrids with it.

102

Chapter 6

Evolutionary Genomics of Yeasts

6.3.2 Genome Duplication The ideathatS.cerevisiae emerged fromanancient duplication of the entiregenome(Wolfeand Shields, 1997) stimulated numerous studies about the role of such events in genome evolution. One aspect is to try to reconstruct the original genome at the time of the duplication event; the other, of moregeneral signiﬁcance, isto examine the faith of the duplicated genes.It is generally accepted that newly arisen duplicate gene pairs experience an altered selective regime, leading either to the loss of one copy or to an increased rate of sequence divergence if both copies are preserved. In the S. cerevisiae genome, although relics of ancestral genes have been found among duplicated chromosomal segments (Fischer et al., 2001; Lafontaine et al., 2004), the majority of postduplication events seem to be complete gene deletions, as only about 550 duplicated pairs (ohnologues) still remain (Byrne and Wolfe, 2005). Other species separated from S. cerevisiae after the proposed duplication event, such as C. glabrata, S. castellii, or V. polymorpha, show similar extent of gene loss for the latter two (Cliften et al., 2006; Scannell et al., 2007) and an even greater loss for the former (Dujon et al., 2004). Many of the gene deletions are common to several of these species, suggesting that they occurred before speciation, but quite a number are species speciﬁc, indicating postspeciation events (Cliften et al., 2006). The mechanism by which so many individual gene deletions take place remains one of the problems to solve under thewhole-genome duplication hypothesis (Martin et al., 2007). In addition, the successive gene deletions from the initial polyploid stage are expected to create phenotypically disadvantaged intermediates, owing to gene dosage imbalance or the rewiring of the protein interaction network (Andalis et al., 2004; Presser et al., 2008; Vinogradov and Anatskaya, 2009). Solving these problems will probably demand additional data, including experimental ones. Meanwhile, the duplicated pairs remaining in extant postduplication genomes proved a very valuable material to characterize evolutionary mechanisms. Assuming that their sequences were identical immediately after the duplication, that is, that the duplication was an autopolyploidization rather than an allopolyploidization event, the sequence divergence and differential expression among pairs of ohnologues is informative of evolutionary trends. When compared to single copy homologues from a protoploid species of Saccharomycetaceae, a majority of ohnologous pairs exhibit strong asymmetry in evolution rate, suggesting neofunctionalization of the most diverged copy (Kim and Yi, 2006; Byrne and Wolfe, 2007; Turunen et al., 2009). This asymmetry is also reﬂected at the level of gene expression and cis-regulatory elements (Papp et al., 2003; Gu et al., 2005), and seems correlated with the recombination rate of the chromosomal region, as predicted from the Hill–Robertson effect (Zhang and Kishino, 2004). For other ohnologous pairs, however, functional redundancy remains for long time (Dean et al., 2008). But subfunctionalization also occurs. About a third of proteins encoded by pairs of ohnologues localize to distinct subcellular compartments (Marques et al., 2008). Despite such detailed studies, the rate of sequence evolution and, consequently, the date of the ancestral duplication event remain uncertain (Sugino and Innan, 2005). Comparing eight yeast species, it was concluded that immediately after the duplication event, duplicate gene pairs evolved at least three times faster than before, giving rise to a burst of protein sequence evolution, followed by a rapidly declining rate (Scannell and Wolfe, 2008).

6.3.3 A Bunch of Fermentative Engines Yeasts of the Saccharomyces sensu stricto complex were empirically domesticated long ago for their fermentative capabilities (Replansky et al., 2008). Two commercial wine strains of

6.3 The Scientiﬁc Attractiveness of S. cerevisiae

103

S. cerevisiae (or their derivative) have now been sequenced (Borneman et al., 2008; Novo et al., 2009), along with several strains isolated from vineyards (Doniger et al., 2008; Liti et al., 2009; http://www.broad.mit.edu/annotation/genome/saccharomyces_cerevisiae). Similarly, genomic studies have been recently used to characterize brewing yeasts. The lager beer yeast S. pastorianus is a hybrid between S. cerevisiae and S. bayanus or S. uvarum (Rainieri et al., 2006; Dunn and Sherlock, 2008; Nakao et al., 2009). Other beer strains appear as hybrids between S. cerevisiae and S. kudriavzevii (Gonzalez et al., 2008; Belloch et al., 2009). Interestingly, their genomes underwent nonreciprocal exchanges between the two parental genomes, accompanied by loss of chromosomal segments or even complete chromosomes, producing chimeras in which the non-cerevisiae genome is generally reduced, probably as a result of selective pressure during the fermentations. The variety of chimeras found suggests several independent hybridization events followed by distinct genome rearrangements, copy number variations, and ploidy differences. DNA sequence polymorphism further suggests that independent hybridization events occurred between different parental strains in different breweries or geographic locations. Introgression of large DNA fragments from S. paradoxus, S. kudriavzevii, or S. uvarum has also been found in the genome of S. cerevisiae wine strains (Naumova et al., 2005; Muller and McCusker, 2009). Although most chimeras involve Saccharomyces sensu stricto species, large fragments expressed during winemaking were found in the genome of a commercial wine strain, EC1118, that originate from non-Saccharomyces yeasts, one of which being Zygosaccharomyces bailii, a major contaminant of wine fermentation (Novo et al., 2009). Ecological proximity and selective pressures to adapt to high-sugar, lownitrogen, and high-ethanol conditions probably favor such remodeling of the S. cerevisiae genome by exogenous genes, but this raises questions on the deﬁnition of species during natural evolution. Extensive karyotypic changes, depending on growth conditions, were observed in artiﬁcial hybrids between S. cerevisiae and S. paradoxus (Greig et al., 2002) and correlate with the loss of meiotic fertility, the criterion originally used to deﬁne species of the Saccharomyces sensu stricto complex (Naumov, 1987; Naunov et al., 2000).

6.3.4 Speciation and Species Deﬁnition Reproductive isolation between Saccharomyces sensu stricto species results from a combination of phenomena. Postzygotic isolation, that is, the lack of fertility of meiotic products from the hybrids, results in part from the Bateson–Dobzhansky–Muller incompatibility effect created by the postduplication differential gene loss, many meiotic products inheriting an incomplete gene set (Scannell et al., 2006). In addition, sequence divergence between parental genomes was suspected to lower meiotic fertility of hybrids because of the mismatch repair system (Chambers et al., 1996). Chromosomal translocations also play some role in meiotic infertility (Delneri et al., 2003; Liti et al., 2006) but do not alone explain speciation (Fischer et al., 2000). Recently, a single gene from S. bayanus was demonstrated to create incompatibility with S. cerevisiae mitochondria, causing hybrid sterility (Lee et al., 2008). But species isolation also results from prezygotic barriers. In experimental assays, germinating ascospores of S. cerevisiae and S. paradoxus showed a preference for ownspecies mating over interspeciﬁc mating (Maclean and Greig, 2008), while vegetative haploid cells of S. paradoxus did not (Murphy et al., 2006). The molecular basis for this phenomenon remains to be elucidated, but such a preference may play an important role in wild yeast populations, as mating in these species is suspected to occur primarily between

104

Chapter 6

Evolutionary Genomics of Yeasts

newly germinated ascospores. From the analysis of sequence polymorphism between several S. cerevisiae strains, it was calculated that this species undergoes outcrossing only once in every 50,000 asexual generations (Ruderfer et al., 2006). A similar ﬁgure was recently reported for S. paradoxus, with one sexual cycle every 1000 asexual generations and only 1% of them corresponding to outcrossing (Tsai et al., 2008). In conclusion, while admixture of genetic material between distantly related species may play an important role on an evolutionary timescale, Saccharomyces sensu stricto yeasts have evolved a reproductive behavior that limits the level of such exchanges, hence allowing the sympatric existence of distinct species (Sampaio and Gonc¸alves, 2008).

6.4 EVOLUTIONARY GENOMICS OF HEMIASCOMYCETES Although compared to the above, much more limited knowledge is available for other hemiascomycetous species, this subphylum of Ascomycota is presently one of the most extensively sequenced group of eukaryotes, and comparative genomics within it has proven very useful not only to identify major evolutionary events of yeast history but also to replace some of the mechanisms studied in S. cerevisiae back into a proper evolutionary perspective.

6.4.1 Distinct and Speciﬁc Genome Organization of Three Major Evolutionary Subdivisions of Hemiascomycetes From presently available genome sequences (Table 6.1), three major subdivisions of hemiascomycetes can be recognized (Figure 6.2). The Saccharomycetaceae family (or Saccharomyces complex; Kurtzman and Robnett, 2003) contains no less than 14 distinct clades identiﬁed by phylogenetic studies on selected gene sets (Kurtzman, 2003). Genome sequences are available for 10 of them. Yeasts from this family share common genomic signatures that distinguish them from other subdivisions. Among them is the presence of short centromeres, made of two conserved motifs important for kinetochore attachment and separated by AT-rich intervals that vary in size between species of distinct clades (Meraldi et al., 2006; Genolevures consortium, 2009). All Saccharomycetaceae genomes sequenced so far show one such structure per chromosome (except S. castellii). A second genomic signature of the Saccharomycetaceae is the triplication of the mating-type cassette, ensuring the presence of the genetic information for both mating types in haploid genomes (see below). A third genomic signature of the Saccharomycetaceae is the presence of a single rDNA locus, internal to one chromosome arm, containing tandem copies of the rRNA genes (C. glabrata is an exception with two subtelomeric rDNA loci). Each tandem copy contains one 5S rRNA gene in opposite orientation with other genes (C. glabrata has two tandem copies of 5S rRNA genes per unit). Finally, the genetic content of the mitochondrial genome constitutes another signature that distinguishes the Saccharomycetaceae from the other subdivisions of Hemiascomycetes (see below). The second major subdivision of hemiascomyceteous yeasts, recognized from common genomic traits, contains D. hansenii, P. stipitis, and several other species sequenced for medical relevance (Figure 6.2). Their genomes share distinctive signatures from those of the Saccharomycetaceae. Many of these yeasts have no known sexual cycle, hence their collective designation as Candida, irrespective of actual phylogeny. For lack of simple taxonomy of this subdivision, we call it here “CTG” complex since a major common feature of these yeasts is their genetic code deviation: the CUG codon speciﬁes serine instead of the

6.4 Evolutionary Genomics of Hemiascomycetes

105

Figure 6.2

Hemiascomycetous yeast genomics. Shown in color are species whose genome sequences are available as complete or draft assembly forms (bold type names), or partial coverage shotguns. Note that exploratory sequences, work in progress and sequences not publicly available are ignored (see Table 6.1). Tree topology is adapted from Kurtzman and Robnett (2003), Diezmann et al. (2004), and Tsui et al. (2008). Branching of P. stipitis, P. guillermondii, and C. lusitaniae is uncertain. Black names illustrate nonsequenced clades artiﬁcially grouped (dotted triangles) between the three major sequenced subdivisions (gray background) designated according to family names (Saccharomycetacea, Dipodasaceae) or “CTG” to summarize the complex taxonomical classiﬁcation of corresponding species. Major distinctive features of their genomes are summarized. ( ) Short centromeric motifs not found in S. castellii (de Montigny, personal communication). ( ) Two rDNA loci in C. glabrata. P: human (blue) or plant (green) pathogen. (See insert for color representation of this ﬁgure.)

normal leucine, due to the existence of an abnormal serine tRNA with a CAG anticodon (see below). Yeasts from this subdivision have a unique mating-type cassette. Other features that may characterize this subdivision have not been systematically studied (Table 6.2). D. hansenii has three subtelomeric rDNA loci within which each tandem repeat contains two copies of 5S rRNA genes (Dujon et al., 2004). C. albicans has long (about 3–4 kb) regional centromeres, lacking conserved sequence motif and pericentric heterochromatine (Sanyal et al., 2004), and neocentromeres form efﬁciently (Ketel et al., 2009). Mitochondrial DNA of yeasts of the “CTG” complex contain genes for complex I subunits (see below). So far, Y. lipolytica stands alone in a third major subdivision, distinct from other hemiascomycetes. Its genome, which is nearly twice as large as that of other yeasts, with only a limited number of additional genes, is less compact and contains many more tRNA genes (Dujon et al., 2004). Protein-coding genes have also more introns, despite the fact that introns are still rare (Bon et al., 2003). Compared to all above yeasts where the gene for 5S RNA is part of the rDNA repeat unit, the 5S RNA genes are dispersed in the genome of

106

Chapter 6

Evolutionary Genomics of Yeasts

Y. lipolytica (as in many other eukaryotes). Interestingly, nearly half of them are found as dicistronic polymerase III fusions behind tRNA genes (Acker et al., 2008). Centromeres of Y. lipolytica have been experimentally deﬁned as short sequences necessary for autonomous replication of plasmids but do not contain the sequence motifs characteristic of Saccharomycetaceae (Vernis et al., 1997). Instead, the six centromeric sequences share a short consensus sequence forming a 17–21 bp imperfect palindrome (Yamane et al., 2008). The genomes of yeasts of these three subdivisions are so clearly differentiated that major evolutionary transitions, which remain to be understood, must have taken place at their origin. One reason for these large evolutionary divides between the three characterized subdivisions of hemiascomycetes may be the lack of genomic data from other branches, except the genome of P. angusta and the partial sequencing of D. bruxellensis. Given the power of novel sequencing technologies, this problem will likely be corrected rapidly. Another reason may be that intermediate branches no longer exist.

6.4.2 Comparison of Proteins: Pan- and Core-Proteomes Meanwhile, the wealth of genome sequences accumulated on a monophyletic group of eukaryotes such as the hemiascomycetes offers a unique opportunity to examine the content and evolution of its protein repertoire. Comprehensive pairwise sequence comparisons and clustering of sequence-predicted proteins have been performed several times independently using slightly different sets of yeasts (as sequencing progressed) and often including multicellular fungi as well. Deduced protein families were analyzed in a variety of ways. About 49,000 proteins from 9 yeast species spanning the entire evolutionary range of hemiascomycetes cluster into about 8000 families, a third of them being represented in all 9 species (Sherman et al., 2009; Genolevures Consortium, 2009). The Saccharomycetaceae alone have a total pan-proteome of about 5100 families and a common core-proteome of about 3300 families, representing over 80% of the proteome of each individual species, the complement being represented by species-speciﬁc proteins or those shared by some but not all yeast species. Nonuniversal proteins attract attention as potential determinants for the distinct physiological and metabolic properties of each species or clade. At the same time, they illustrate the intense gain and loss of genes during evolution (Dujon et al., 2004; Wapinski et al., 2007). A hierarchical clustering of the sequence-predicted proteomes from 13 hemiascomycetous species, 14 species of Pezizomycotina, 4 species of Basidiomycota, plus 1 Taphrinomycotina (S. pombe) and 1 Zygomycota (Rhizopus oryzae), revealed 466 clusters speciﬁc to the hemiascomycetes (Arvas et al., 2007). They represent a variety of cellular functions but are enriched in transcription and mitochondrion-related functions. By contrast, the Pezizomycotina-speciﬁc clusters are often involved in plant biomass degradation and secondary metabolism, and show signs of recent gene expansion.

6.4.3 Genome Redundancy and Paralogues Building a minimal yeast genome, with one optimized gene per function, seems attractive to synthetic biology engineers. But nature has never done that. Instead, all yeast genomes contain numerous series of paralogous genes issued from ancient or recent duplications and leading to a signiﬁcant “genome redundancy” (Table 6.2). In S. cerevisiae, and related species, part of the paralogous gene pairs originate from the ancestral genome duplication (see above), but most of the overall genome redundancy is due to dispersed paralogues and

6.4 Evolutionary Genomics of Hemiascomycetes

107

tandem gene arrays. In the Saccharomycetaceae species that have not inherited the genome duplication, about one-third of the protein-coding genes are members of multigene families (Genolevures Consortium, 2009), and in Y. lipolytica and the species of the CTG complex, this ﬁgure is signiﬁcantly higher. Reasons for this universal redundancy are several. Recently duplicated genes (highly similar sequences) are often found in subtelomeric regions (Fairhead and Dujon, 2006). By contrast, internal segmental duplications are rare in yeast genomes (Genolevures Consortium, 2009). This paucity contrasts with the high frequency of spontaneous formation of segmental duplications in S. cerevisiae (Payen et al., 2008) and can be explained by the their instability in absence of evolutionary pressure to preserve them. On an evolutionary scale, however, the intense dynamics of formation and loss of segmental duplications may leave behind some duplicated gene copies and contribute to the numerous dispersed paralogues observed. This remains to be quantiﬁed. Tandem gene arrays also contribute a fraction of the overall redundancy of yeast genomes, but large gene clusters remain exceptional, and those that are functionally characterized correspond to adaptive evolution (expansion of gene families with functional diversiﬁcation). Tandem gene arrays appear highly dynamic structures and are generally not conserved between species, with a few signiﬁcant exceptions such as the genes encoding the B-type cyclins. Quantitatively, the dispersed paralogues contribute to the majority of the genome redundancy in all yeast species. Most of them are highly diverged in sequence and it is generally impossible to identify the type of duplication mechanism that was at their origin. Duplication of gene fragments at ectopic location is possible through Ty-mediated RNA intermediates (Schacherer et al., 2004), and it is possible that some dispersed gene paralogues are, actually, retrogenes originated from such a mechanism. Alternatively, map reshufﬂing after duplication of chromosomal segments is also a possible source of dispersed paralogues. Finally, it is not excluded that whole-genome duplications have occurred at some point of ancestry but left no recognizable signatures, such as dual synteny regions, in presently living yeasts because they are too ancient.

6.4.4 Conservation of Synteny Remarkably, hemiascomycetous yeasts have similar numbers of chromosomes in their haploid sets (6–8) despite their very large evolutionary range (Table 6.2). Saccharomycetaceae that underwent whole genome duplication have twice as many chromosomes (13–16). The reason for this conservation is not well understood, but chromosome maps are also highly conserved over long evolutionary ranges. If representatives of the three studied subdivisions of hemiascomycetes share minimal synteny conservation between themselves, reasonably long synteny blocks are conserved when members of a same subdivision are compared to one another. Genome rearrangement rates differ between lineages (Fischer et al., 2006) but are limited enough to ensure signiﬁcant conservation of gene order over evolutionary periods corresponding to extensive sequence divergence (Genolevures Consortium, 2009). Species of the Saccharomyces sensu stricto complex have nearly identical chromosomal maps while their orthologous proteins may already differ by up to 20% amino acid replacement on average (Fischer et al., 2001; Dujon, 2006). Protoploid Saccharomycetaceae share synteny blocks of about 20 genes on average, that is, their genomes have experienced less than 300 rearrangements from their common ancestor (Genolevures Consortium, 2009). Superimposed on this slow rate of gross chromosomal rearrangements, microrearrangements, mostly small inversions, occur within conserved synteny blocks and contribute to gene adjacency breakage (Seoighe et al., 2000; Fischer et al., 2006).

108

Chapter 6

Evolutionary Genomics of Yeasts

6.4.5 Genes for Noncoding RNAs, Introns, and Genetic Code Variation Genes for noncoding RNA molecules other than tRNAs and rRNAs are not frequently annotated in genome sequences for lack of predictive features. In yeasts, annotation has been unequal depending on the species and on the class of noncoding RNA molecules. Thermodynamic prediction of noncoding RNA has been recently achieved for S. cerevisiae (Kavanaugh and Dietrich, 2009) and compared to previously available annotations. Comparative genomics was successfully used to identify many noncoding RNA genes in protoploid Saccharomycetaceae, but those for H/ACA snoRNA could not be systematically recognized (Genolevures Consortium, 2009). Besides occasional cases of recent duplications (identical sequences), snoRNA genes and snRNA genes are highly conserved, at least among all Saccharomycetaceae. But the structures of some of these RNA molecules exhibit an unanticipated degree of variation. It was known for long time that S. cerevisiae has large insertions in regions of U1, U2, and U5 snRNAs that are otherwise highly constrained in both length and secondary structures. The lengthening of stem III of U1 snRNA appears to have started in the ancestry of both Saccharomycetaceae and members of the “CTG complex,” and contrasts with the shortening of the same stem in Y. lipolytica (Mitrovich and Guthrie, 2007). A variety of additional changes occurred in the various branches, some of which offering interesting examples of RNA and protein coevolution. This variability of snRNAs may be correlated with the limited number of spliceosomal introns, a common characteristics of all hemiascomycetous genomes (Table 6.2). The loss of introns, therefore, appears as an ancestral trait to the entire subphylum and, together with the loss of the machineries for RNA control (corresponding genes are not found in hemiascomycetous yeasts), must have played a critical role in subsequent evolution (see below). Compared to intron-rich ancestors of Opisthokonta, assumed to have had several introns per gene, intron loss started early in the fungal kingdom but accelerated at the origin of hemiascomycetes (Stajich et al., 2007). Thanks to their strongly conserved structural characteristics, tRNA genes were extensively annotated in many yeast genomes (Table 6.2). Many of them contain introns. The two- to threefold variation in their total number between Y. lipolytica and several other yeasts remains puzzling. The evolution of the tRNA set has been more extensively studied among the Saccharomycetaceae (Marck et al., 2006) and revealed a few surprises (see below). But the most striking evolutionary feature with regard to decoding strategy remains the alteration of the genetic code in the “CTG” subdivision of hemiascomycetes: the CUG codon speciﬁes serine instead of the normal leucine, due to the existence of an abnormal serine tRNA with a CAG anticodon (Miranda et al., 2006). Because such a change simultaneously reprogrammed the identity of several thousands of CUG codons, predicting a profound impact on their evolution, this phenomenon has attracted attention. It was found that an insertion of one adenosine in the intron of a Ser-tRNA (CGA), which normally decodes the UCG serine codon, generated a CAG anticodon, forming an inaccurate decoder, charged by both serine and leucine synthetases (Massey et al., 2003). This ambiguity lowered the negative effect of the CUG reassignment and made it possible. Two secondary mutations on each side of the anticodon subsequently lowered the ambiguity and resulted in the nearly complete reassignment of CUG codons to serine (only 3–5% leucine ambiguity remains). Phylogenetic analysis showed that such secondary mutations occurred approximately at the time of separation of the “CTG” subdivision from the Saccharomycetaceae, while ambiguous CUG decoding was ancestral to both evolutionary branches. The separation between the two major subdivisions of hemiascomycetous yeasts

6.4 Evolutionary Genomics of Hemiascomycetes

109

was accompanied by extensive mutational changes of codons to preserve serine or leucine at essential positions in proteins (about 30,000 CUG codons of S. cerevisiae do not coincide with about 17,000 CUG codons of C. albicans).

6.4.6 Sex, Transposons, Plasmids, Inteins, and Horizontal Gene Transfer S. cerevisiae has developed a sophisticated mating system, whose molecular mechanisms have been intensively studied, while other yeast species with similar gene set have no known sexual cycles (Knop, 2006; Muller et al., 2007). Several species of the CTG complex such as C. albicans are also asexual, while others such as L. elongisporus exhibit a complete sexual cycle, and both possess similar mating type loci (Table 6.1). As previously noted, the genomes of Saccharomycetaceae are characterized by a triplication of the mating cassettes, whose location on chromosomes varies between species (Muller et al., 2007). In some species, one cassette is missing from the assembled genome sequence (in L. kluyveri, two cassettes are missing; Payen et al., 2009). In most cases, therefore, haploid genomes carry the genetic information of both mating types (in E. gossypii, all three cassettes carry the same mating type; Dietrich et al., 2004) but express only one. The genomes of P. angusta and D. hansenii, both reported as homothallic, have a single cassette (as their phylogenetic position predicts) but it contains genes of both mating types simultaneously. Overall, yeasts seem to have evolved a variety of mechanisms to limit the genetic consequences of sex. Among Saccharomycetaceae, the recruitment of the site-speciﬁc endonuclease HO, probably from an ancestral LAGLIDADG homing endonuclease, determines the switching of mating types of haploid cells, hence favoring mating between isogenic members of the same clade (Butler et al., 2004; Fabre et al., 2005). A limited variety of transposons exists in yeast genomes. Although class I elements (retrotransposons) were originally discovered in S. cerevisiae and traces of such elements are observed in many hemiascomycetous genomes (both Ty1/copia type and Ty3/gypsy type), only few species have complete elements (Goodwin and Poulter, 2000; Neuveglise et al., 2002). Non-LTR class I elements were also found in C. albicans (Goodwin et al., 2001). Class II elements have been found in Y. lipolytica and members of the CTG complex (Neuveglise et al., 2005) but are generally absent from Saccharomycetaceae with one exception recently identiﬁed among some protoploid species (Genolevures Consortium, 2009). Several extrachromosomal autonomously replicating elements have also been irregularly observed in yeasts, raising questions about their origin and function. Next to the intensively studied 2 mm circular plasmid of S. cerevisiae (Hartley and Donelson, 1980), occasionally found in related species as well, circular plasmids with similar structures have been described in Z. rouxii and Z. bisporus, but they do not show sequence conservation (Araki et al., 1985; Toh-e and Utatsu, 1985). Other linear plasmids sharing some similarities among them are found in K. lactis and D. hansenii. Three cytoplasmic plasmids of unknown function are found in the salt-tolerating yeast D. hansenii and require high osmolarity for their proper replication at high temperature (Fukuda et al., 2004). In K. lactis, the two plasmids confer a killer phenotype to the cells (Hishinuma et al., 1984). A killer phenomenon also exists in S. cerevisiae and unrelated fungi, but in these cases it depends upon double-stranded RNA viruses (Schmitt and Breinig, 2006). This brief overview only illustrates the fact that we are far from having a comprehensive view of extrachromosomal elements in hemiascomycetes and on their evolutionary origin and role.

110

Chapter 6

Evolutionary Genomics of Yeasts

A slightly more comprehensive view exists for the inteins. These are protein inserts endowed with a self-catalytic protein splicing activity, required for their excision from the protein precursor and religation of the host gene product, and a homing-endonuclease activity, required for their initial insertion in the gene and their propagation into intein-free alleles of the population until ﬁxation is reached (Perler, 2005 and references therein). The ﬁrst intein ever discovered, VDE1, was found inserted within the vacuolar ATPase gene VMA1 of S. cerevisiae, and many inteins have been found in a variety of fungal genes since. VDE1 contains a homing endonuclease of the LAGLIDADG type, a class of enzymes originally discovered from a mitochondrial group I intron of S. cerevisiae (Dujon, 2005 and references therein). Among hemiascomycetes, inteins have been found so far in the glutamate synthase gene, GLT1, of P. guilliermondii and D. hansenii and in the threonyl-tRNA synthetase gene, ThrRS, of C. tropicalis and C. parapsilosis (Poulter et al., 2007). Given their evolutionary persistence and wide distribution in eukaryotes, inteins likely propagate by horizontal gene transfer, as also proposed for other selﬁsh elements, such as plasmids, mycoviruses, transposons, or group I introns mentioned above (Rosewich and Kistler, 2000). Interspeciﬁc crosses between more or less distantly related yeast species (Marinoni et al., 1999) may contribute to such genetic exchanges, as well as to introgression (see above). But cases of horizontal transfer of genes from bacterial origin have also been reported in a variety of yeast species, sometimes contributing to important functional innovations or reacquisitions (Dujon et al., 2004; Gojdovic et al., 2004; Hall et al., 2005; Hall and Dietrich, 2007; Wei et al., 2007; Woolﬁt et al., 2007; Fitzpatrick et al., 2008), and the systematic analysis of inserted genes within conserved synteny blocks suggests that the phenomenon is not rare (Rolland et al., 2009).

6.4.7 Mitochondrial Genomes and NUMTs Although mitochondrial genetics was born with S. cerevisiae (Dujon, 1981 and references therein), recent reports on evolutionary genomics of yeasts rarely mention mitochondrial genomes. Mitochondrial DNA (mtDNA) of more than two dozens of hemiascomycete species have, however, been fully sequenced (the list of species nearly coincides with the fully sequenced nuclear genomes, with few signiﬁcant exceptions). They all encode cytochrome b, subunits I–III of cytochrome oxidase complex, subunits 6, 8, and 9 of the ATP synthase complex, the small and large rRNA subunits, and a set of tRNA genes sufﬁcient to interpret the entire genetic code or, at least, the codons actually used (some species ignore some codon families). Genes contain group I and group II introns, many of which bearing internal reading frames. From such a picture, it seems that not much remains to be learned about yeast mitochondrial genomes. However, mtDNA reinforces the subdivision of hemiascomycetes in an evolutionary interesting manner. MtDNA of Y. lipolytica and all species of the “CTG complex” encode seven subunits of complex I of the respiratory chain, while none of the Saccharomycetaceae do so, this complex being totally missing. Instead, mtDNA from Saccharomycetaceae encode a highly variable protein, which is a constituent of the mitochondrial ribosome in S. cerevisiae, and the RNA moiety of mitochondrial RNase P (not always annotated). They also encode an abnormal Thr-tRNA (UAG), with an extranucleotide in the anticodon loop, which reads CUN codons (normally leucine), hence adding one further change to the genetic code in addition to the reading of UGA as tryptophan and AUA as methionine. The perfect coincidence between mtDNA content and nuclear genome characteristics is highly suggestive of a major ancestral divergence at the separation between

6.5 Surprises

111

Saccharomycetaceae and other hemiascomycetes, subsequently conserved for functional reasons. The recent report that the organization and content of the mtDNA of Candida zemplinina is typical of Saccharomycetaceae (closely resembling C. glabrata) while being phylogenetically closely related to Y. lipolytica by sequence similarity remains to be further examined (Pramateftaki et al., 2008). It may be related to the difﬁculty of establishing phylogeny based on mitochondrial sequence divergence. If the mitochondrial genome content is consistent with the phylogenetic subdivisions of hemiascomycetes, the size and shape of mtDNA vary extensively. While most species have circular mtDNA molecules, linear molecules with some kind of telomeric structure were found for C. parapsilosis, C. orthopsilosis, and C. metapsilosis (Nosek et al., 2004; Kosa et al., 2006). Size vary extensively, according to the sporadic occurrence of group I and group II introns, which is more a strain-speciﬁc trait than a characteristics of the species, but also because of variable length of intergenes. Compared to the long mtDNA of S. cerevisiae (about 80 kb), many hemiascomycetes have mtDNA molecules of about 20–25 kb. Some species of the Nakaseomyces clade have mtDNA ﬁve times larger than that of C. glabrata, without change of gene content, due to expansion of intergenic regions and multiplication of GC-rich palindromes (Bouchier et al., 2009). Note ﬁnally, that short fragments of mtDNA molecules (NUMTs) can be found in the chromosomes of yeasts where they represent a potentially mutagenic force and possibly a source of innovation over evolutionary times (Sacerdot et al., 2008). Such fragments were experimentally demonstrated to insert at double-strand breaks in S. cerevisiae chromosomes (Ricchetti et al., 1999).

6.5 SURPRISES Over the years, while genomics of hemiascomycetes progressed, the sequencing of novel species often continued to reveal surprises, indicating that we were and still are far from having observed and understood all evolutionary characteristics of these genomes (Figure 6.3). The ﬁrst surprise was that S. cerevisiae originated from genome duplication (Wolfe and Shields, 1997). This was not anticipated despite the extensive genetic analyses performed before on this species. Although the mechanism and date of this event remain to be deﬁned, it is clear that it is only concerned with a monophyletic subdivision of the Saccharomycetaceae made of six related clades. Another surprise was the extensive gene loss in C. glabrata (Dujon et al., 2004) and E. gossypii (Dietrich et al., 2004). Such losses were a posteriori explained by the adaptation of these species as human or plant pathogens, respectively. In C. glabrata, the corresponding loss of function has permitted a novel regulatory adaptation, involved in pathogenicity (Domergue et al., 2005). Also surprising was the signiﬁcant excess of tandem gene arrays observed in D. hansenii (Dujon et al., 2004). This difference with other yeasts has not yet been explained, but species related to D. hansenii seem to share the same trend. The heterozygosity of the diploid genome of the asexual yeast C. albicans was not itself a surprise by its irregular distribution along chromosomes is puzzling (Jones et al., 2004). The gradual loss of heterozygosity during commensal growth may explain this trait (Diogo et al., 2009). In C. glabrata, the large number of sequence repeats with unusually long motifs (designated megasatellites) was not anticipated (Thierry et al., 2008). Such repeats occur within proteincoding genes, consequently expanding the proteins by tandem arrays of about 45–140 amino acid long repeated motifs. No similar structures have been described in other yeasts so far, but some may have been overlooked because speciﬁc ﬁnishing procedures are required to properly assemble such sequences.

112

Chapter 6

Evolutionary Genomics of Yeasts

Figure 6.3

Major surprises in hemiascomycetous genomes. The ﬁgure summarizes unexpected observations and features of yeast genomes revealed from sequencing (see text for explanation). Surprises found in only one species (bold type names) are shown on the right, features common to several species were attributed to evolutionary nodes (left) following parsimony criterion (note that they have not always been veriﬁed for all species indicated). Black text: features of genome structure; blue text: protein synthesizing machinery. (See insert for color representation of this ﬁgure.)

Recently, the genome of S. kluyveri surprised us by its heterogeneous composition, a phenomenon never observed in other yeasts. About 1 Mb long segment (nearly coinciding with a chromosome arm) shows a GC content 12.5% higher than the rest of the genome (Genolevures Consortium, 2009). The origin of this heterogeneity is unknown but the high GC fragment, which contains a normal set of genes, is devoid of transposable elements (Payen et al., 2009). Finally, S. cerevisiae itself reserved us other surprises when additional strains were sequenced: large DNA fragments were found that correspond to introgressions from other species. Although this phenomenon may explain fermentative abilities of various strains, its mechanism remains unclear (Naumova et al., 2005; Muller and McCusker, 2009; Novo et al., 2009). Next to the above peculiarities of genome structure, the RNA-based translational machinery also offered us a number of surprises. The assignation of the CUG codon to serine instead of leucine, as mentioned above, was discovered before complete genome sequencing of the affected species (Miranda et al., 2006), but sequence analysis was subsequently used to complete the list of species concerned (Tekaia et al., 2000; Jeffries et al., 2007). An additional surprise with regard to the genetic code is that all hemiascomycetes studied so far, except Y. lipolytica, use the bacterial rule instead of the eukaryotic one to decode the four

6.6 What Next?

113

arginine codons CGN (the A34 base of tRNA molecules reads codons ending with U, C, and A at the third position, while other tRNA molecules with a C34 base reads those ending with G) (Marck et al., 2006). A similar change to the bacterial mode occurred for the Saccharomycetaceae (but not other hemiascomycetes) to read the four leucine codons CUN. Interestingly, the reversed change (to the eukaryotic mode) is observed for S. castellii only. Some captures of tRNAs also took place during the evolution of hemiascomycetes. The Met-tRNA (CAT) of Yarrowia lipolytica differs from its homologues in other yeast species and may have originated from the capture of a Thr-tRNA (CGT). Similarly, all studied Saccharomycetaceae use an Arg-tRNA (CCG) suspected to have emerged from an Asp-tRNA (GTC). Finally, the discovery of transcriptional fusions between the 5S RNA gene and tRNA genes in Y. lipolytica (Acker et al., 2008) could not be anticipated, even though 5S RNA genes are suspected to transpose behind a variety of repeated genes in other eukaryotes (Drouin and Moniz de Sa, 1995; Bergeron and Drouin, 2008).

6.6 WHAT NEXT? Despite the number of species already analyzed, much remains to be learned about the evolution of yeast genomes, if one judges from the frequent surprises accompanying novel sequences. But perhaps the greatest surprise of all is that despite their very long evolutionary history, yeasts remain yeasts, that is, fungal species able to propagate indeﬁnitely in unicellular form without the need for multicellular fruiting bodies to complete their biological cycle. Some differentiation is visible at the genome level between the major subdivisions, and some speciﬁc properties characterize the species, but all propagate by budding in either haploid or diploid form, even if some can occasionally form pseudohyphae. Extrapolation of the spontaneous mutation rate precisely calculated for S. cerevisiae (Lang and Murray, 2008) to the considerable number of successive generations undergone by hemiascomycetes from their origin predicts that an enormous spectrum of mutational changes has been explored. Similarly, genome analysis showed us how yeasts can acquire novel functions or genetic elements through gene duplication, introgression, interspeciﬁc mating, and, occasionally, horizontal transfer. Why, then, have changes remained so limited over about 400–1000 Myr of their evolution? One possibility is that yeasts have been trapped into an evolutionary deadlock by the conjunction of several phenomena (Figure 6.4). Compared to other fungal branches, hemiascomycetes underwent considerable loss of spliceosomal introns, hence limiting exon shufﬂing, which is an important source of innovation in other eukaryotic branches (De Souza et al., 1996). They also do not show any trace of the RNA control machineries, hence limiting possibilities of regulation at this level. In absence of RNA-mediated events, yeast genome may evolve rapidly, but fail to produce novelties. But their very mode of propagation may add further limitation to yeast evolution. In appropriate conditions, yeasts rapidly form large populations by clonal expansion. Although they have inherited a sophisticated mating system, many lineages seem to be completely asexual and in others, like S. cerevisiae and related species, even greater sophistication has developed to limit outcrossing as far as possible, while performing meiotic cycles. The consequently limited genetic exchanges make gene loss a frequent and reiterative phenomenon, irreversible, further conﬁning yeast clones into a limited environment. Similarly, functional transposable elements tend to be lost, further limiting the possible creation of retrogenes, gene fusion, and exon shufﬂing.

114

Chapter 6

Evolutionary Genomics of Yeasts

Figure 6.4

A cartoon view of molecular evolution leading to yeasts. Hemiascomycetous yeasts (Saccharomycotina) are believed to have emerged from other subphyla of Ascomycota between 400 and 1000 Myr ago (Hedges et al., 2004; Taylor and Berbee, 2006). Given their average rate of asexual growth, this represents many billions of successive generations during which numerous evolutionary events of various types have occurred, increasing or decreasing their genomic complexity, but keeping their original mode of unicellular propagation essentially unchanged while they adapted to a variety of distinct environments and specialized in metabolic and physiological functions. Major losses or gains that seem to have occurred in the ancestry of yeasts (as judged from lateral branches) are summarized. The drawing emphasizes the fact that from the common ancestor to all Opisthokonta, Metazoa (hence man) are much more “primitive” than yeasts in the sense that they have accumulated less successive generations and have not undergone the irreversible losses mentioned. ( )Taphrinomycotina and Pezizomycotina only. (See insert for color representation of this ﬁgure.)

Within grown populations of unicellular organisms, what matters most is not the ﬁtness or survival of individual cells (a critical parameter for the growth phase itself) but the presence of mutants or variants able to generate the next populations. The high frequency of mutation, duplication, and gene loss observed with yeasts further enhanced in diploids and hybrids by extensive rearrangements of chromosomal segments probably answers this need and submits yeast populations to stochastic evolutionary events. The noise in protein expression observed between individual cells in a genetically homogeneous population appears to have been selected for the same reason (Newman et al., 2006). Based on the stochastic production and destruction of mRNA molecules, it is much larger for functions involved in response to environmental changes. Additional data will obviously be needed to further examine whether the evolutionary scheme proposed above has any chance to be correct, at least in part. As we are now reaching the stage of synthesizing new yeasts using modern technologies (Dymond et al., 2009), it is perhaps useful to carefully consider the experiments done by nature itself if one wants to create species that will be as friendly to us as yeasts have been over millennia. While we still need additional sequences of well selected genomes to unravel further details of evolution of the hemiascomycetes, time may be ripe to explore the numerous species of yeasts that exist in other fungal phyla, particularly the Basidiomycota that have received much less attention so far for genomic studies but in which many independent yeast forms of life have evolved.

References

115

ACKNOWLEDGMENTS I am indebted to members of the Genolevures Consortium for stimulating discussions and access to unpublished data. This work was supported in part by grants from CNRS (GDR2354) and ANR (ANR-05-BLAN-0331). BD is a member of Institut Universitaire de France.

EPILOGUE In the Darwin year, it did not escape my attention that the evolutionary scheme drawn on Figure 6.4 is not without similarity with the present status of the world capitalistic economies. Yeasts are only slightly ahead of us!

REFERENCES ACKER, J., OZANNE, C., KACHOURI-LAFOND, R., GAILLARDIN, C., NEUVE´GLISE, C., and MARCK, C., 2008. Dicistronic tRNA5S genes in Yarrowia lipolytica: an alternative TFIIIAindependent way for expression of 5S rRNA genes. Nucleic Acids Res. 18: 5832–5844. ANDALIS, A.A., STORCHOVA, Z., STYLES, C., GALITSKI, T., PELLMAN, D., et al., 2004. Defects arising from whole-genome duplication in Saccharomyces cerevisiae. Genetics 167: 1109–1121. ARAKI, H., JEARNPIPATKUL, A., TASUMI, H., SAKURAI, T., USHIO, K., et al., 1985. Molecular and functional organization of yeast plasmid pSR1. J. Mol. Biol. 20: 191–203. ARVAS, M., KIVIOJA, T., MITCHELL, A., SALOHEIMO, M., USSERY, D., et al., 2007. Comparison of protein coding gene contents of the fungal phyla Pezizomycotina and Saccharomycotina. BMC Genomics 8: 325. ASLETT, M. and WOOD, V., 2006. Gene ontology annotation status of the ﬁssion yeast genome: preliminary coverage approaches 100%. Yeast 23: 913–919. BARNETT, J.A., 2003. Beginnings of microbiology and biochemistry: the contribution of yeast research. Microbiology 149: 557–567. BARNETT, J.A., 2007. A history of research on yeasts 10: foundations of yeast genetics. Yeast 24: 799–845. BELLOCH, C., PEREZ-TORRADO, R., GONZALEZ, S.S., PERZEORTIN, J.E., GARCIA-MARTINEZ, J., et al., 2009. The chimerical genomes of natural hybrids between Saccharomyces cer-evisiae and Saccharomyces kudriavzevii. Appl. Environ. Microbiol. 75: 2534–2544. BERGERON, J. and DROUIN, G., 2008. The evolution of 5S ribosomal RNA genes linked to the rDNA units of fungal species. Curr. Genet. 54: 123–131. BLANDIN, G., OZIER-KALOGEROPOULOS, O., WINCKER, P., ARTIGUENAVE, F., and DUJON, B., 2000. Genomic exploration of the hemiascomycetous yeasts: 16. Candida tropicalis. FEBS Lett. 487: 91–94. BOEKHOUT, T., 2005. Gut feeling for yeasts. Nature 434: 449–451.

BON, E., NEUVE´GLISE, C., CASARE´GOLA, S., ARTIGUENAVE, F., WINCKER, P., et al., 2000a. Genomic exploration of the hemiascomycetous yeasts: 5. Saccharomyces bayanus var. uvarum. FEBS Lett. 487: 37–41. BON, E., NEUVE´GLISE, C., LE´PINGLE, A., WINCKER, P., ARTIGUENAVE, F., et al., 2000b. Genomic exploration of the hemiascomycetous yeasts: 6. Saccharomyces exiguus. FEBS Lett. 487: 42–46. BON, E., CASARE´GOLA, S., BLANDIN, G., LLORENTE, B., NEUVeGLISE, C., et al., 2003. Molecular evolution of eukaryotic genomes: hemiascomycetous yeast spliceosomal introns. Nucleic Acids Res. 31: 1121–1135. BORNEMAN, A.R., FORGAN, A.H., PRETORIUS, I.S., and CHAMBERS, P.J., 2008. Comparative genome analysis of a Saccharomyces cerevisiae wine strain. FEMS Yeast Res. 8: 1185–1195. BOUCHIER, C., MA, L., CRE´NO, S., DUJON, B., and FAIRHEAD, C., 2009. Complete mitochondrial genomes of Kluyveromyces delphensis, Candida castelli and Kluyveromyces bacillisporus reveal invasion by GC clusters. FEBS Yeast Res. 9: 1283–1292. BOYER, J., BADIS, G., FAIRHEAD, C., TALLA, E., HANTRAYE, F., et al., 2004. Large-scale exploration of growth inhibition caused by overexpression of genomic fragments in Saccharomyces cerevisiae. Genome Biol. 5 (9): R72. BUTLER, G., KENNY, C., FAGAN, A., KURISCHKO, C., GAILLARDIN, C., and WOLFE, K.H., 2004. Evolution of the MAT locus and its HO endonuclease in yeast species. Proc. Natl. Acad. Sci. USA 101: 1632–1637. BYRNE, K.P. and WOLFE, K.H., 2005. The Yeast Gene Order Browser: combining curated homology and syntenic context reveals gene fate in polyploid species. Genome Res. 15: 1456–1461. BYRNE, K.P. and WOLFE, K.H., 2007. Consistent patterns of rate asymmetry and gene loss indicate widespread neofunctionalization of yeast genes after whole-genome duplication. Genetics 175: 1341–1350. CASARE´GOLA, S., LE´PINGLE, A., BON, E., NEUVE´GLISE, C., NGUYEN, H.-V., et al., 2000. Genomics exploration of the

116

Chapter 6

Evolutionary Genomics of Yeasts

hemiascomycetous yeasts: 7. Saccharomyces servazzii. FEBS Lett. 487: 47–51. CHAMBERS, S.R., HUNTER, N., LOUIS, E.J., and BORTS, R.H., 1996. The mismatch repair system reduces meiotic homeologous recombination and stimulates recombination-dependent chromosome loss. Mol. Cell Biol. 16: 6110–6120. CHERRY, J.M., BALL, C., WENG, S., JUVIK, G., SCHMIDT, R., et al., 1997. Genetic and physical maps of Saccharomyces cerevisiae. Nature 387 (Suppl.): 67–73. CLIFTEN, P., SUDARSANAM, P., DESIKAN, A., FULTON, L., FULTON, B., et al., 2003. Finding functional features in Saccharomyces cerevisiae by phylogenetic footprinting. Science 301: 71–76. CLIFTEN, P.F., FULTON, R.S., WILSON, R.K., and JOHNSTON, M., 2006. After the duplication: gene loss and adaptation in Saccharomyces genomes. Genetics 172: 863–872. DAVIERWALA, A.P., HAYNES, J., LI, Z., BROST, R.L., ROBINSON, M.D., et al., 2005. The synthetic genetic interaction spectrum of essential genes. Nat. Genet. 37: 1147–1152. De SOUZA, S.J., LONG, M., and GILBERT, W., 1996. Introns and gene evolution. Genes Cells 1: 493–505. DEAN, E.J., DAVIS, J.C., DAVIS, R.W., and PETROV, D.A., (2008) Pervasive and persistent redundancy among duplicated genes in yeast. PLoS Genet. 4 (7): e1000113. DELNERI, D., COLSON, I., GRAMMENOUDJI, S., ROBERTS, I.N., LOUIS, E.J., and OLIVER, S.G., 2003. Engineering evolution to study speciation in yeasts. Nature 422: 68–72. DIETRICH, F.S., VOEGELI, S., BRACHAT, S., LERCH, A., GATES, K., STEINER, S., MOHR, C., PUHLMANN, R., LUEDI, P., CHOI, S., et al., 2004. The Ashbya gossypii genome as a tool for mapping the ancient Saccharomyces cerevisiae genome. Science 304: 304–307. DIEZMANN, S., COX, C.J., SCHONIAN, G., VILGALYS, R.J., and MITCHELL, T.G., 2004. Phylogeny and evolution of medical species of Candida and related taxa: a multigenic analysis. J. Clin. Microbiol. 42: 5624–5635. DIOGO, D., BOUCHIER, C., D’ENFERT, C., and BOUGNOUX, M.-E., 2009. Loss of heterozygosity in commensal isolates of the asexual diploid yeast Candida albicans. Fungal Genet. Biol. 46: 159–168. DOMERGUE, R., CASTANO, I., De LAS PENAS, A., ZUPANCIC, M., LOCKATELL, V., et al., 2005. Nicotinic acid limitation regulates silencing of Candida adhesions during UTI. Science 308: 866–870. DONIGER, S.W., KIM, H.S., SWAIN, D., CORCUERA, D., WILLIAMS, M., et al., 2008. A catalog of neutral and deleterious polymorphisms in yeast. PLoS Genet. 4: e10000183. DROUIN, G. and MONIZ DE SA, M., 1995. The concerted evolution of 5S ribosomal genes linked to the repeat units of other multigene families. Mol. Biol. Evol. 12: 481–493. DUJON, B., 1981. Mitochondrial genetics and function. In Molecular Biology of the Yeast Saccharomyces: Life Cycle and Inheritance (eds Strathern, et al.). Cold Spring Harbor Laboratory Press, pp. 505–635. DUJON, B., 1996. The yeast genome project: what did we learn? Trends Genet. 12: 263–270. DUJON, B. 2005. Homing endonucleases and the yeast mitochondrial omega locus: a historical perspective. In Homing

Endonucleases and Inteins (eds M. L., Belfort, B. L., Stoddard, D. W., Wood, and V. Derbyshire). SpringerVerlag, pp. 11–31. DUJON, B., 2006. Yeasts illustrate the molecular mechanisms of eukaryotic genome evolution. Trends Genet. 22: 375–387. DUJON, B., SHERMAN, D., FISCHER, G., DURRENS, P., CASAREGOLA, S., et al., 2004. Genome evolution in yeasts. Nature 430: 35–44. DUNN, B. and SHERLOCK, G., 2008. Reconstruction of the genome origins and evolution of the hybrid lager yeast Saccharomyces pastorianus. Gene Res. 18: 1610–1623. DYMOND, J.S., SCHEIFELE, L.Z., RICHARDSON, S., LEE, P., CHANDRASEGARAM, S., et al., 2009. Teaching synthetic biology, bioinformatics and engineering to undergraduates: the interdisciplinary build-a-genome course. Genetics 181: 13–21. FABRE, E., MULLER, E., THERIZOLS, P., LAFONTAINE, I., DUJON, B., and FAIRHAED, C., 2005. Comparative genomics in hemiascomycete yeasts: evolution of sex, silencing, and subtelomeres. Mol. Biol. Evol. 22: 856–873. FAIRHEAD, C. and DUJON, B., 2006. Structure of Kluyveromyces lactis subtelomeres: duplication and gene content. FEMS Yeast Res. 6: 428–441. FISCHER, G., JAMES, S.A., ROBERTS, I.N., OLIVER, S.G., and LOUIS, E.J., 2000. Chromosomal evolution in Saccharomyces. Nature 405: 415–454. FISCHER, G., NEUVE´GLISE, C., DURRENS, P., GAILLARDIN, C., and DUJON, B., 2001. Evolution of gene order in the genomes of two related yeast species. Genome Res. 11: 2009–2019. FISCHER, G., ROCHA, E.P., BRUNET, F., VERGASSOLA, M., and DUJON, B., 2006. Highly variable rates of genome rearrangements between hemiascomycetous yeast lineages. PLoS Genet. 2: e32. FITZPATRICK, D.A., LOGUE, M.E., and BUTLER, G., 2008. Evidence of recent interkingdom horizontal gene transfer between bacteria and Candida parapsilosis. BMC Evol. Biol. 8: 181. FUKUDA, K., JIN-SHAN, C., KAWANO, M., SUDO, K., and GUNGE, N., 2004. Stress responses of linear plasmids from Debaryomyces hansenii. FEMS Microbiol. Lett. 15: 243–248. GAVIN, A.-C., B€oSCHE, M., KRAUSE, R., GRANDI, P., MARZIOCH, M., et al., 2002. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415: 141–147. GELPERIN, D.M., WHITE, M.A., WILKINSON, M.L., KON, Y., KUNG, L.A., et al., 2005. Biochemical and genetic analysis of the yeast proteome with a movable ORF collection. Genes Dev. 19: 2816–2826. SOUCIET, J.-L., DUJON, B., GAILLARDIN, C., JOHNSTON, M., BARET, P.V., CLIFTEN, P., SHERMAN, D.J., WEISSENBACH, J., WESTHOF, E., WINCKER, P., JUBIN, C., POULAIN, J., BARBE, V., SEGURENCE, B., ARTIGUENAVE, F., ANTHOUARD, V., VACHERIE, B., VAL, M.E., FULTON, R.S., MINX, P., WILSON, R., DURRENS, P., JEAN, G., MARCK, C., MARTIN, T., NIKOLSKI, M., ROLLAND, T., SERET, M.-L., CASAREGOLA, S., DESPONS, L., FAIRHEAD, C., FISCHER, G., LAFONTAINE, I.,

References LEH, V., LEMAIRE, M., DE MONTIGNY, J., NEUVEGLISE, C., THIERRY, A., BLANC-LENFLE, I., BLEYKASTEN, C., DIFFELS, J., FRITSCH, E., FRANGEUL, L., GOEFFON, A., JAUNIAUX, N., KACHOURI-LAFOND, R., PAYEN, C., POTIER, S., PRIBYLOVA, L., OZANNE, C., RICHARD, G.-F., SACERDOT, C., STRAUB, M.-L., and TALLA, E., 2009. Comparative genomics of protoploid Saccharomycetaceae. Genome Res. 19: 1696– 1709. GIAEVER, G., CHU, A.M., NI, L., CONNELLY, C., RILES, L., et al., 2002. Functional proﬁling of the Saccharomyces cerevisiae genome. Nature 418: 387–391. GOFFEAU, A., BARRELL, B.G., BUSSEY, H., DAVIS, R.W., DUJON, B., et al., 1996. Life with 6000 genes. Science 274: 563–567. GOJDOVIC, Z., KNECHT, W., ZAMEITAT, E., WARNEBOLDT, J., COUTLIS, J.-B., et al., 2004. Horizontal gene transfer promoted evolution of the ability to propagate under anaerobic conditions in yeasts. Mol. Genet. Genomics 271: 387–393. GONZALEZ, S., BARRIO, E., and QUEROL, A., 2008. Molecular characterization of new natural hybrids of Saccharomyces cerevisiae and S. kudriavzevii in brewing. Appl. Environ. Microbiol. 74: 2314–2320. GOODWIN, T.J. and POULTER, R.T., 2000. Multiple LTR–retrotransposon families in the asexual yeast Candida albicans. Genome Res. 10: 174–191. GOODWIN, T.J., ORMANDY, J.E., and POULTER, R.T., 2001. L1-like non-LTR retrotransposons in the yeast Candida albicans. Curr. Genet. 39: 83–91. GREIG, D., LOUIS, E.J., BORTS, R.H., and TRAVISANO, M., 2002. Hybrid speciation in experimental populations of yeast. Science 298: 1773–1775. GU, X., ZHANG, Z., and HUANG, W., 2005. Rapid evolution of expression and regulatory divergences after yeast gene duplication. Proc. Natl. Acad. Sci. USA 102: 707–712. HALL, C., BRACHAT, S., and DIETRICH, F.S., 2005. Contribution of horizontal gene transfer to the evolution of Saccharomyces cerevisiae. Eukaryot. Cell 4: 1102–1115. HALL, C. and DIETRICH, F.S., 2007. The reacquisition of biotin prototrophy in Saccharomyces cerevisiae involved horizontal gene transfer, gene duplication and gene clustering. Genetics 177: 2293–2307. HARTLEY, J.L. and DONELSON, J.E., 1980. Nucleotide sequence of the yeast plasmid. Nature 286: 860–865. HEDGES, S.B., BLAIR, J.E., VENTURI, M.L., and SHOE, J.L., 2004. A molecular timescale of eukaryote evolution and the rise of complex multicelluar life. BMC Evol. Biol. 4: 2. HIBBETT, D.S., BINDER, M., BISCHOFF, J.F., BLACKWELL, M., CANNON, P.F., et al., 2007. A higher-level phylogenetic classiﬁcation of the fungi. Mycol. Res. 111: 509–547. HISHINUMA, F., NAKAMURA, K., HIRAI, K., NISHIZAWA, R., GUNGE, N., and MAEDA, T., 1984. Cloning and nucleotide sequences of the linear DNA killer plasmids from yeast. Nucleic Acids Res. 12: 7581–7597. HOWSON, R., HUH, W.-K., GHAEMMAGHAMI, S., FALVO, J.V., BOWER, K., et al., 2005. Construction, veriﬁcation and experimental use of two epitope-tagged collections of budding yeast strains. Comp. Funct. Genomics 6: 2–16.

117

HU, Y., ROLFS, A., BHULLAR, B., MURTHY, T.V.S., ZHU, C., et al., 2007. Approaching a complete repository of sequenceveriﬁed protein-encoding clones for Saccharomyces cerevisiae. Genome Res. 17: 536–543. JAMES, T.Y., KAUFF, F., SCHOCH, C.L., MATHENY, P.B., and HOFSTETTER, V., 2006. Reconstructing the early evolution of fungi using a six-gene phylogeny. Nature 443: 818–822. JEFFRIES, T.W., GRIGORIEV, I.V., GRIMWOOD, J., LAPLAZA, J.M., AERTS, A., et al., 2007. Genome sequence of the lignocellulose-bioconverting and xylose-fermenting yeast Pichia stipitis. Nat. Biotechnol. 25: 319–326. JONES, T., FEDERSPIEL, N.A., CHIBANA, H., DUNGAN, J., KALMAN, S., et al., 2004. The diploid genome sequence of Candida albicans. Proc. Natl. Acad. Sci. USA 101: 7329–7334. KAVANAUGH, L.A. and DIETRICH, F.S., 2009. Non-coding RNA prediction and veriﬁcation in Saccharomyces cerevisiae. PLoS Genet. 5: (1), e1000321. KELLIS, M., PATTERSON, N., ENDRIZZI, M., BIRREN, B., and LANDER, E.S., 2003. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 423: 241–254. KELLIS, M., BIRREN, B.W., and LANDER, E.S., 2004. Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature 428: 617–624. KETEL, C., WANG, H.S.W., MCCLELLAN, M., BOUCHONVILLE, K., SELMECKI, A., et al., 2009. Neocentromeres form efﬁciently at multiple possible loci in Candida albicans. PLoS Genet. 5: (3), e1000400. KIM, S.-H. and YI, S.V., 2006. Correlated asymmetry of sequence and functional divergence between duplicate proteins of Saccharomyces cerevisiae. Mol. Biol. Evol. 23: 1068–1075. KNOP, M., 2006. Evolution of the hemiascomycete yeasts: on the life styles and the importance of inbreeding. BioEssays 28: 696–708. KOSA, P., VALACH, M., TOMASKA, L., WOLFE, K.H., and NOSEK, J., 2006. Complete DNA sequences of the mitochondrial genomes of the pathogenic yeasts Candida orthopsilosis and Candida metapsilosis: insight into the evolution of linear DNA genomes from mitochondrial telomere mutants. Nucleic Acids Res. 34: 2472–2481. KROGAN, N.J., CAGNEY, G., GUO, X., IGNATCHENKO, A., LI, J., et al., 2006. Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 440: 637–643. KURTZMAN, C.P., 2003. Phylogenetic circumscription of Saccharomyces, Kluyveromyces and other members of the Saccharomycetaceae, and the proposal of the new genera Lachancea, Nakaseomyces, Naumovia, Vanderwaltozyma and Zygotorulaspora. FEMS Yeast Res. 4: 233–245. KURTZMAN, C.P. and FELL, J.W., 2000. The Yeasts, A Taxonomic Study. Elsevier, Amsterdam. NL, p. 1055. KURTZMAN, C.P. and ROBNETT, C.J., 2003. Phylogenetic relationships among yeasts of the ‘Saccharomyces complex’ determined from multigene sequence analyses. FEMS Yeast Res. 3: 417–432.

118

Chapter 6

Evolutionary Genomics of Yeasts

KUTTY, S.N. and PHILIP, R., 2008. Marine yeasts: a review. Yeast 25: 465–483. LAFONTAINE, I., FISCHER, G., TALLA, E., and DUJON, B., 2004. Gene relics in the genome of the yeast Saccharomyces cerevisiae. Gene 335: 1–17. LANG, G.L. and MURRAY, A.W., 2008. Estimating the per-base mutation rate in the yeast Saccharomyces cerevisiae. Genetics 178: 67–82. LEE, H.-Y., CHOU, J.-Y., CHEONG, L., CHANG, N.-H., YANG, S.-Y., et al., 2008. Incompatibility of nuclear and mitochondrial genomes causes hybrid sterility between two yeast species. Cell 135: 1065–1073. LITI, G., BARTON, D.B.H., and LOUIS, E.J., 2006. Sequence diversity, reproductive isolation and species concepts in Saccharomyces. Genetics 174: 839–850. LITI, G., CARTER, D.M., MOSES, A.M., WARRINGER, J., PARTS, L., et al., 2009. Population genomics of domestic and wild yeasts. Nature 458: 337–341. LLORENTE, B., MALPERTUY, A., NEUVE´GLISE, C., de MONTIGNY, J., AIGLE, M., et al., 2000a. Genomic exploration of the hemiascomycetous yeasts: 18. Comparative analysis of chromosome maps and synteny with Saccharomyces cerevisiae. FEBS Lett. 487: 101–112. LLORENTE, B., MALPERTUY, A., BLANDIN, G., ARTIGUENAVE, F., WINCKER, P., and DUJON, B., 2000b. Genomic exploration of the hemiascomycetous yeasts: 12. Kluyveromyces marxianus var. marxianus. FEBS Lett. 487: 71–75. LOFTUS, B.J., FUNG, E., RONCAGLIA, P., ROWLEY, D., AMEDEO, P., et al., 2005. The genome of the basidiomycetous yeast and human pathogen Cryptococcus neoformans. Science 307: 1321–1324. MACLEAN, C.J. and GREIG, D., 2008. Prezygotic reproductive isolation between Saccharomyces cerervisiae and Saccharomyces paradoxus. BMC Evol. Biol. 8: 1. MALPERTUY, A., TEKAIA, F., CASARE´GOLA, S., AIGLE, M., ARTIGUENAVE, F., et al., 2000. Genomic exploration of the hemiascomycetous yeasts: 19. Ascomycete-speciﬁc genes. FEBS Lett. 487: 113–121. MARCET-HOUBEN, M. and GABALDON, T., 2009. The tree versus the forest: the fungal tree of life and the topological diversity within the yeast phylome. PLoS One 4: (2), e4357. MARCK, C., KACHOURI-LAFOND, R., LAFONTAINE, I., WESTHOF, E., DUJON, B., and GROSJEAN, H., 2006. The RNA polymerase III-dependent family of genes in hemiascomycetes: comparative RNomics, decoding strategies, transcription and evolutionary implications. Nucleic Acids Res. 34: 1816–1835. MARINONI, G., MANUEL, M., PETERSEN, R.F., HVIDTFELDT, J., SLO, P., and PISKUR, J., 1999. Horizontal transfer of genetic material among Saccharomyces yeasts. J. Bacteriol. 181: 6488–6496. MARQUES, A.C., VINCKENBOSCH, N., BRAWAND, D., and KAESSMANN, H., 2008. Functional diversiﬁcation of duplicate genes through subcellular adaptation of encoded proteins. Genome Biol. 9: R54. MARTIN, N., RUEDI, E.A., LEDUC, R., SUN, F.J., and CAETANOANOLLES, G., 2007. Gene-interleaving patterns of synteny

in the Saccharomyces cerevisiae genome: are they proof of an ancient genome duplication event? Biol Direct. 2: 23. MASSEY, S.E., MOURA, G., BELTRAO, P., ALMEIDA, R., GAREY, J.R., et al., 2003. Comparative evolutionary genomics unveils the molecular mechanism of reassignment of the CTG codon in Candida spp. Genome Res. 13: 544–557. MERALDI, P., MCAINSH, A.D., RHEINBAY, E., and SORGER, P.K., 2006. Phylogenetic and structural analysis of centromeric DNA and kinetochore proteins. Genome Biol. 7: R23. MEWES, H.W., ALBERMANN, K., BA¨HR, M., FRISHMAN, D., GLEISSNER, A., et al., 1997. Overview of the yeast genome. Nature 387 (Suppl.): 7–65. MIRANDA, I., SILVA, R., and SANTOS, M.A.S., 2006. Evolution of the genetic code in yeasts. Yeast 23: 203–213. MITROVICH, Q.M. and GUTHRIE, C., 2007. Evolution of small nuclear RNAs in S. cerevisiae, C. albicans, and other hemiascomycetous yeasts. RNA 13: 2066–2080. MNAIMNEH, S., DAVIERWALA, A.P., HAYNES, J., MOFFAT, J., PENG, W.-T., et al., 2004. Exploration of essential gene functions via titratable promoter alleles. Cell 118: 31–44. MULLER, L.A.H. and MCCUSKER, J.H., 2009. A multispeciesbased taxonomic microarray reveals interspecies hybridization and introgression in Saccharomyces cerevisiae. FEMS Yeast Res. 9: 143–152. MU¨LLER, H., HENNEQUIN, C., DUJON, B., and FAIRHEAD, C., 2007. Ascomycetes: the Candida MAT locus: comparing MAT in the genomes of hemiascomycetous yeasts. In Sex in Fungi (eds J. Heitmann, et al.). ASM Press, Washington, DC, pp. 185–201. MURPHY, H., KUEHNE, H., FRANCIS, C., and SNIEGOWSKI, P., 2006. Mate choice assays and mating propensity differences in natural yeast populations. Biol. Lett. 2: 553–556. NAKAO, Y., KANAMORI, T., ITOH, T., KODAMA, Y., and RAINIERI, S. et al., 2009. Genome sequence of the lager brewing yeast, an interspecies hybrid. DNA Res., advanced e-publication. NAUMOV, G.I., 1987. Genetic basis for classiﬁcation and identiﬁcation of the ascomycetous yeasts. Stud. Mycol. 30: 469–475. NAUNOV, G.I., JAMES, S.A., NAUMOVA, E.S., LOUIS, E.J., and ROBERTS, I.N., 2000. Three new species in the Saccharomyces sensu stricto complex: Saccharomyces cariocanus. Saccharomyces kudriavzevii and Saccharomyces mikatae. Int. J. Syst. Evol. Micobiol. 50: 1931–1942. NAUMOVA, E.S., NAUMOV, G.I., MASNEUF-POMARDE, I., and AIGLE, M., 2005. Molecular genetic study of introgression between Saccharomyces bayanus and S. cerevisiae. Yeast 22: 1099–1115. NEUVE´GLISE, C., FELDMAN, H., BON, E., GAILLARDIN, C., and CASARE´GOLA, S., 2002. Genomic evolution of the long terminal repeat retrotransposons in hemiascomycetous yeasts. Genome Res. 12: 930–943. NEUVE´GLISE, C., CHALVET, F., WINCKER, P., GAILLARDIN, C., and CASAREGOLA, S., 2005. Mutator-like element in the yeast Yarrowia lipolytica displays multiple alternative splicings. Eukaryot. Cell 4: 615–624.

References NEWMAN, J.R.S., GHAEMMAGHAMI, S., IHMELS, J., BRESLOW, D.K., NOBLE, M., et al., 2006. Single-cell proteomic analysis of S. cerevisiae reveals the architecture of biological noise. Nature 44: 840–846. NOSEK, J., NOVOTNA, M., HLAVATOVICOVA, Z., USSERY, D.W., FAJKUS, J., and TOMASKA, L., 2004. Complete DNA sequence of the linear mitochondrial genome of the pathogenic yeast Candida parapsilosis. Mol. Genet. Genomics 272: 173–180. NOVO, M., BIGEY, F., BEYNE, E., GALEOTE, V., GAVORY, F., et al., 2009. Eukaryote-to-eukaryote gene transfer events revealed by the genome sequence of the wine yeast Sachharomyces cerevisiae EC118. Proc. Natl. Acad. Sci. USA 106: 16333–16338. OOI, S.L., PAN, X., PEYSER, B.D., YE, P., MELUH, P.B., et al., 2006. Global synthetic-lethality analysis and yeast functional proﬁling. Trends Genet. 22: 56–63. PAN, X., YUAN, D.S., XIANG, D., WANG, X., SOOKHAI-MAHADEO, S., et al., 2004. A robust toolkit for functional proﬁling of the yeast genome. Mol. Cell. 16: 487–496. PAYEN, C., KOSZUL, R., DUJON, B., and FISCHER, G., 2008. Segmental duplications arise from pol32-dependent repair of broken forks through two alternative replication-based mechanisms. PLoS Genet. 4: e1000175. PAYEN, C., FISCHER, G., MARCK, C., PROUX, C., SHERMAN, D. J., COPPEE, J.-Y., JOHNSTON, M., DUJON, B., and NEUVEGLISE, C., 2009. Unusual composition of a yeast chromosome arm is associated with its delayed replication. Genome Res. 19: 1710–1721. PAPP, B., PAL, C., and HURST, L.D., 2003. Evolution of cis-regulatory elements in duplicated genes of yeast. Trends Genet. 19: 417–422. PENA-CASTILLO, L. and HUGHES, T.R., 2007. Why are there still over 1000 uncharacterized yeast genes? Genetics 176: 7–14. PERLER, F.B., 2005. Inteins: a historical perspective. In Homing Endonucleases and Inteins (eds M. L., Belfort, B.L., Stoddard, D.W., Wood, and V., Derbyshire). SpringerVerlag, pp. 193–210. PIERCE, S.E., DAVIS, R.W., NISLOW, C., and GIAEVER, G., 2007. Genome-wide analysis of barcoded Saccharomyces cerevisiae gene-deletion mutants in pooled cultures. Nat. Protoc. 2: 2958–2974. POULTER, R.T.M., GOODWIN, T.J.D., and BUTLER, M.I., 2007. The nuclear-encoded inteins of fungi. Fungal Genet. Biol. 44: 153–179. PRAMATEFTAKI, P.V., KOUVELIS, V.N., LANARIDIS, P., and TYPAS, M.A., 2008. Complete mitochondrial genome sequence of the wine yeast Candida zemplinina; intraspecies distribution of a novel group- IIB1 intron with eubacterial afﬁliations. FEMS Yeast Res. 8: 311–327. PRESSER, A., ELOWITZ, M.B., KELLIS, M., and KISHONY, R., 2008. The evolutionary dynamics of the Saccharomyces cerevisiae protein interaction network after duplication. Proc. Natl. Acad. Sci. USA 105: 950–954. RAMEZANI-RAD, M., HOLLENBERG, C.P., LAUBER, J., WEDLER, H., GRIESS, E., et al., 2003. The Hansenula polymorpha

119

(strain CBS4732) genome sequencing and analysis. FEMS Yeast Res. 4: 207–215. RAINIERI, S., KODAMA, Y., KANEKO, Y., MIKATA, K., NAKAO, Y., et al., 2006. Pure and mixed genetic lines of Saccharomyces pastorianus and their contribution to the lager brewing strain genome. Appl. Environ. Microbiol. 72: 3968–3974. REPLANSKY, T., KOUFOPANOU, V., GREIG, D., and BELL, G., 2008. Saccharomyces sensu stricto as a model system for evolution and ecology. Trends Ecol. Evol. 23: 494–501. RICCHETTI, M., FAIRHEAD, C., and DUJON, B., 1999. Mitochondrial DNA repairs double-strand breaks in yeast chromosomes. Nature 402: 96–100. ROLLAND, T., NEUVE´GLISE, C., SACRDOT, C., and DUJON, B., 2009. Insertion of horizontally transferred genes within conserved syntenic regions of yeast genomes. PLOS One 4(8): e6515. ROSEWICH, U.L. and KISTLER, H.C., 2000. Role of horizontal gene transfer in the evolution of fungi. Annu. Rev. Phytopathol. 28: 325–363. RUDERFER, D.M., PRATT, S.C., SEIDEL, H.S., and KRUGLYAK, L., 2006. Population genomic analysis of outcrossing and recombination in yeast. Nat. Genet. 38: 1077–1081. SACERDOT, C., CASARE´GOLA, S., LAFONTAINE, I., TEKAIA, F., DUJON, B., and OZIER-KALOGEROPOULOS, O., 2008. Promiscuous DNA in the nuclear genomes of hemiascomycetous yeasts. FEMS Yeast Res. 8: 846–857. SAMPAIO, J.P. and GONC¸ALVES, P., 2008. Natural populations of Sacchraomyces kudriavzevii in Portugal are associated with oak bark and are sympatric with S. cerevisiae and S. paradoxus. Appl. Environ. Micriobiol. 74: 2144–2152. SANYAL, K., BAUM, M., and CARBON, J., 2004. Centromeric DNA sequences in the pathogenic yeast Candida albicans are all different and unique. Proc. Natl. Acad. Sci. USA 101: 11374–11379. SCANNELL, D.R. and WOLFE, K.H., 2008. A burst of protein sequence evolution and a prolonged period of asymmetric evolution follow gene duplication in yeast. Genome Res. 18: 137–147. SCANNELL, D.R., BYRNE, K.P., GORDON, J.L., WONG, S., and WOLFE, K.H., 2006. Multiple rounds of speciation associated with reciprocal gene loss in polyploid yeasts. Nature 440: 341–345. SCANNELL, D.R., FRANK, A.C., CONANT, G.C., BYRNE, K.P., WOOLFIT, M., and WOLFE, K.H., 2007. Independent sortingout of thousands of duplicated gene pairs in two yeast species descended from a whole-genome duplication. Proc. Natl. Acad. Sci. USA 104: 8397–8402. SCHACHERER, J., TOURETTE, Y., SOUCIET, J.-L., POTIER, S., and De MONITGNY, J., 2004. Recovery of a function involving gene duplication by retroposition in Saccharomyces cerevisiae. Genome Res. 14: 1291–1297. SCHACHERER, J., SHAPIRO, J.A., RUDERFER, D.M., and KRUGLYAK, L., 2009. Comprehensive polymorphism survey elucidates population structure of Saccharomyces cerevisiae. Nature 458: 342–346.

120

Chapter 6

Evolutionary Genomics of Yeasts

SCHMITT, M.J. and BREINIG, F., 2006. Yeast viral killer toxins: lethality and self-protection. Nature Rev. Microbiol. 4: 212–221. SEOIGHE, C., FEDERSPIEL, N., JONES, T., HANSEN, N., BIVOLAROVIC, V., et al., 2000. Prevalence of small inversions in yeast gene order evolution. Proc. Natl. Acad. Sci. USA 97: 14433–14437. SHERMAN, D.J., MARTIN, T., NIKOLSKI, M., CAYLA, C., SOUCIET, J.-L., et al., 2009. Genolevures: protein families and synteny among complete hemiascomycetous yeast proreomes and genomes. Nucleic Acids Res. 37: D550–D554. SOPKO, R., HUANG, D., PRESTON, N., CHUA, G., PAPP, B., et al., 2006. Mapping pathways and phenotypes by systematic gene overexpression. Mol. Cell 21: 319–330. SOUCIET, J.L., AIGLE, M., ARTIGUENAVE, F., BLANDIN, G., BOLOTIN-FUKUHARA, M., et al., 2000. Genomic exploration of the hemiascomycetous yeasts: 1. A set of yeast species for molecular evolution studies. FEBS Lett. 487: 3–12. STAJICH, J.E., DIETRICH, F.S., and ROY, S.W., 2007. Comparative genomic analysis of fungal genomes reveals intronrich ancestors. Genome Biol. 8 (10): R223. SUGINO, R. and INNAN, H., 2005. Estimating the time to the whole-genome duplication and the duration of concerted evolution via gene conversion in yeast. Genetics 171: 63–69. SUNNERHAGEN, P., 2002. Prospects for functional genomics in Schizosaccharomyces pombe. Curr. Genet. 42: 73–84. TAYLOR, J.W. and BERBEE, M.L., 2006. Dating divergence in the fungal tree of life: review and new analyses. Mycologia 98: 838–849. TEKAIA, F., BLANDIN, G., MALPERTUY, A., LLORENTE, B., DURRENS, P., et al., 2000. Genomic exploration of the hemiascomycetous yeasts: 3. Methods and strategies used for sequence analysis and annotation. FEBS Lett. 487: 17–30. THIERRY, A., BOUCHIER, C., DUJON, B., and RICHARD, G.-F., 2008. Megasatellites: a peculiar class of giant minisatellites in genes involved in cell adhesion and pathogenicity in Candida glabrata. Nuc. Acids Res. 36: 5970–5982. TOH-E, A. and UTATSU, I., 1985. Physical and functional structure of a yeast plasmid, pSB3, isolated from Zygosaccharomyces bisporus. Nucleic Acids Res. 13: 4267–4283. TONG, A.H. and BOONE, C., 2006. Synthetic genetic array analysis in Saccharomyces cerevisiae. Methods Mol. Biol. 313: 171–192. TSAI, I.J., BENSASSON, D., BURT, A., and KOUFOPANOU, V., 2008. Population genomics of the wild yeast Saccharomyces

paradoxus: quantifying the life cycle. Proc. Natl. Acad. Sci. USA 105: 4957–4962. TSUI, C.K.M., DANIEL, H.-M., ROBERT, V., and MEYER, W., 2008. Re-examining the phylogeny of clinically relevant Candida species and allied genera based on multigene analysis. FEMS Yeast Res. 8: 651–659. TURUNEN, O., SEELKE, R., and MACOSKO, J., 2009. In silico evidence for functional specialization after genome duplication in yeast. FEMS Yeast Res. 9: 16–31. VERNIS, L., ABBAS, A., CHASLES, M., GAILLARDIN, C.M., BRUN, C., et al., 1997. A origin of replication and a centromere are both needed to establish a replicative plasmid in the yeast Yarrowia lipolytica. Mol. Cell. Biol. 17: 1995–2004. VINOGRADOV, A.E. and ANATSKAYA, O.V., 2009. Loss of protein interactions and regulatory divergence in yeast wholegenome duplicates. Genomics 93: 534–542. WAPINSKI, I., PFEFFER, A., FRIEDMAN, N., and REGEV, A., 2007. Natural history and evolutionary principles of gene duplication in fungi. Nature 449: 54–64. WEI, W., MCCUSKER, J.H., HYMAN, R.W., JONES, T., NING, Y., et al., 2007. Genome sequencing and comparative analysis of Saccharomyces cerevisiae strain YJM789. Proc. Natl. Acad. Sci. USA 104: 12825–12830. WINZELER, E., SHOEMAKER, D.D., ASTROMOFF, A., LIANG, H., ANDERSON, K., et al., 1999. Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis. Science 285: 901–906. WOLFE, K.H. and SHIELDS, D.C., 1997. Molecular evidence for an ancient duplication of the entire yeast genome. Nature 387: 708–713. WOOD, V., GWILLIAM, R., RAJANDREAM, M.A., LYNE, R., STEWART, A., et al., 2002. The genome sequence of Schizosaccharomyces pombe. Nature 415: 871–880. WOOLFIT, M., ROZPEDOWSKA, E., PISKUR, J., and WOLFE, K.H., 2007. Genome survey sequencing of the wine spoilage yeast Dekkera (Brettanomyces) bruxellensis. Eukaryot. Cell 6: 721–733. XU, J., SAUNDERS, C.W., HU, P., GRANT, R.A., BOEKHOUT, T., et al., 2007. Dandruff-associated Malassezia genomes reveal convergent and divergent virulence traits shared with plant and human fungal pathogens. Proc. Natl. Acad. Sci. USA 104: 18730–18735. YAMANE, T., OGAWA, T., and MATSUOKA, M., 2008. Derivation of consensus sequence for protein binding site in Yarrowia lipolytica centromere. J. Biosci. Bioeng. 105: 671–674. ZHANG, Z. and KISHINO, H., 2004. Genomic background predicts the fate of duplicated genes: evidence from the yeast genome. Genetics 166: 1995–1999.

Part Two

Evolution of Molecular Repertoires

Chapter

7

Genotypes and Phenotypes in the Evolution of Molecules Peter Schuster 7.1

THE LANDSCAPE PARADIGM

7.2

MOLECULAR PHENOTYPES

7.3

THE RNA MODEL

7.4

CONCLUSIONS AND OUTLOOK

ACKNOWLEDGMENTS REFERENCES

7.1 THE LANDSCAPE PARADIGM Genotypes are DNA or RNA sequences that—together with epigenetic and environmental inﬂuences—determine the unfolding of the phenotype. Commonly, this process is extremely complicated and—at least for the time being—escapes rigorous mathematical analysis and serious computer modeling. Nevertheless, the relations between genotypes and phenotypes play a fundamental role in biology and in its applications to pharmaceutical research and medicine. In particular, many questions concerning evolution and its mechanisms cannot be answered without an understanding of the phenotypic consequences of changes in the genotypes. Neglecting epigenetics and environmental change for the moment, genotypes and phenotypes play clearly deﬁned distinct parts in Darwinian evolution, which is understood as the interplay of variation and selection: all variations, mutations, recombination, and gene duplication are changes in the polynucleotide sequences of the genotype, whereas the phenotype is the target of selection. Historically, the idea to encapsulate genotype–phenotype relations in the postulate of a landscape in the theory of evolution is due to Wright (1932). He used the landscape metaphor to illustrate optimization in the sense of Darwin’s natural selection: Populations climb a ﬁtness landscape through optimization of mean ﬁtness and in a stationary situation all species occupy local optima that correspond to the niches in an ecosystem. In the 1930s, one major problem of Wright’s metaphor was that it remained unclear, in essence, what was Evolutionary Genomics and Systems Biology, edited by Gustavo Caetano-Anolles Copyright 2010 John Wiley & Sons, Inc.

123

124

Chapter 7

Genotypes and Phenotypes in the Evolution of Molecules

to be plotted on the horizontal axes of the landscape given ﬁtness is the vertical coordinate axis. The second and even more substantial criticism had been raised already by Ronald Fisher: The metaphor is built upon the assumptions of (i) constant ﬁtness values, and (ii) time invariant ﬁtness landscapes, which had to be made in order to guarantee the applicability of the theorem of natural selection (see, for example, Fisher’s fundamental theorem of natural selection (Fisher, 1930, 1941)) and the recent analysis of it (Okasha, 2008). Molecular biology revealed the structures of nucleic acid and proteins and provided a basis for handling genotypes and phenotypes by means of sound theoretical concepts. The notion of sequence space has been introduced for nucleic acids (Eigen, 1971) and for proteins (Maynard Smith, 1970). The principle of sequence space construction is simple: A point “i” is assigned to every sequence Xi and the Hamming distance, dij ¼ dH(Xi, Xj) serves as metric.1 The properties of sequence space are illustrated best by means of a buildup ðkÞ principle: The sequence space Qn {X; dH} is the set of all strings of length n over an ðkÞ alphabet of k digits. Qn may be constructed recursively by joining k spaces of strings of ðkÞ length n 1, Qn 1 (Figure 7.1). The construction principle is the same for any alphabet— binary, three-letter, four-letter—but the objects obtained are difﬁcult to describe except in ð2Þ the case of binary sequences where Qn is a hypercube of dimension n. Sequence spaces, in general, are high-dimensional objects—the dimension is n (k 1)—and low dimensional, in particular, two-dimensional illustrations are frequently misleading. Two often misconceived features are (i) all sequences in sequence space are (topologically) equivalent and hence they have the same number of neighbors—there are no sequences in the interior of Q—and (ii) distances in high-dimensional spaces are small compared to those in lowdimensional spaces with the same number of nodes. A trivial but nevertheless important feature of sequence spaces is their connectedness. From every genotype, we can reach every arbitrarily chosen genotype through a series of successive point mutations whose number never exceeds n, or, in other words, the Hamming distance between two arbitrarily chosen ðkÞ sequences fulﬁls dH(Xi, Xj) n for all (Xi, Xj) 2 Qn (independently of k). Genotype–phenotype relations can be viewed as mappings from sequence space into a space of phenotypes Sn (S, ds) that comprises all possible phenotypes and has some distance measure ds as metric. Fitness and other properties of phenotypes are thought to be expressed quantitatively by some function, fk ¼ F(Sj), where Sj is the phenotype formed by some genotype Xi: Sj ¼ C(Xi). Fitness values are represented by real numbers, fk 2 R1 , with the common restriction to nonnegative values. The deﬁnition of genotype or sequence space is only a minor ﬁrst step toward an understanding of genotype–phenotype relations. The complexity of phenotypes and additional inﬂuences through epigenetics and environmental factors are currently prohibitive for useful constructions of phenotype spaces for whole cells or organisms. There are, however, examples of simpler mappings from sequence spaces into phenotypes or structures that are currently accessible by theory as well as by experiment (see also Lehmann (2005). We mention two of them: (i) in vitro evolution of biomolecules with predeﬁned functions, in particular nucleic acid molecules (Joyce, 2004; Klussmann, 2006) and proteins (Brakmann and Johnsson, 2002; J€ackel et al., 2008) and (ii) virus evolution (Domingo et al., 2008). In both cases, the genotype is a polynucleotide that is short compared to the genomes of organisms. The numbers of possible genotypes in these relatively small sequence spaces are nevertheless huge compared to realistic population sizes: For chain lengths n > 40, the number of possible polynucleotide sequences exceeds Avogadro’s number, and in protein 1 The Hamming distance counts the number of positions in which two aligned sequences differ (Hamming, 1950, 1989). It is identical with the minimal numbers of (single) point mutations that is required to convert one sequence into the other.

7.2 Molecular Phenotypes

125

Figure 7.1

Sequence spaces. The properties of sequence spaces are illustrated by means of a recursive ðkÞ construction principle. The sequence space for strings of chain length n þ 1, Qn 1 is constructed from two ðkÞ sequence spaces for strings of chain length n, Qn , which are obtained by adding one symbol, (0 or 1) or (A or U or G or C), respectively, on the left-hand side to the string. Joining all pairs of sequences with Hamming distance dH ¼ 1 ðkÞ ð2Þ by a straight line yields the sequence space Qn 1 . The upper part of the ﬁgure deals with binary sequences: Qn is a hypercube of dimension n. The lower part of the ﬁgure indicates the same construction for natural four-letter sequences. The single-digit element, which is a straight line (and one dimensional) for binary sequences, is a ð4Þ tetrahedron (and three dimensional) in the four-digit case. The sequence space Q2 for two-letter AUGC strings is a tetrahedron of tetrahedra (middle), a fairly complicated looking object in six-dimensional space. (See insert for color representation of this ﬁgure.)

sequence space this happens already at chain length n > 19. The enormous size of sequence spaces and the principal accessibility of every genotype by mutation, in essence, set the stage for evolutionary optimization.

7.2 MOLECULAR PHENOTYPES The notion of molecular phenotype was used already in the analysis and interpretation of the ﬁrst evolution experiments of RNA molecules in vitro (Mills et al., 1967; Spiegelman, 1971). It is commonly understood as the structure of biomolecules and the properties derived from the structure. In case of polynucleotide evolution, the situation is particularly simple because genotype and phenotype are different features of the same molecule, the nucleotide sequence and the molecular structures with its properties, respectively. In directed evolution of proteins (J€ackel et al., 2008), the genotype is a DNA or RNA molecules and the phenotype is the protein molecule obtained by (transcription and) translation. In DNA display (Wrenn and Harbury, 2007), the sequences are coupled to small molecules from a library that can be

126

Chapter 7

Genotypes and Phenotypes in the Evolution of Molecules

created, for example, by combinatorial chemistry and in this case the phenotype is the small molecule and its properties. Most of the currently adopted attempts to predict function for DNA, RNA, or protein sequences try to split the genotype–phenotype relation into two parts represented by mappings from sequence to structure and from structure to function: Sj ¼CðXi Þ

fj ¼FðXj Þ

sequence ! structure ! function:

ð7:1Þ

The rationale underlying the two-step approach is that both prediction of structure from known sequence and prediction of function from known structure are less hard problems than the one-step prediction of function from sequence. Indeed, folding biopolymer sequences into molecular structures and inferring functions from structures follow principles from molecular physics, which are, in principle, known from structural chemistry and chemical kinetics, thermodynamic stability.

7.2.1 Protein Structures Historically, the concept that protein folding follows a straightforward and reversible downhill process and ends at the thermodynamically stable (and therefore uniquely characterized) conformation of the molecule was derived from early work on the protein bovine pancreatic ribonuclease A (Raines, 1998; Marshall et al., 2008). The sequence of the small protein with a chain length of 124 amino acids was determined by Stanford Moore and William Stein (Smyth et al., 1963; Moore and Stein, 1973) and only 4 years later the threedimensional molecular structure of the protein had been determined (Kartha et al., 1967; Wyckoff et al., 1967a, 1967b; Kim et al., 1992). The major breakthrough in understanding folding of ribonuclease A came from the work by Christian Anﬁnsen (White, 1961; Anﬁnsen and Haber, 1961; Anﬁnsen et al., 1961; Anﬁnsen, 1973): The protein was denatured through breaking four disulﬁde bonds by reduction and complete unfolding. On oxidation by air, the molecule returned to its native conformation in extremely high yields. Anﬁnsen casted the ﬁndings on ribonuclease A folding and unfolding into three criteria called the thermodynamic hypothesis of protein structure: (i) uniqueness—the sequence has only one conformation of minimum free energy (mfe) and no other energetically nearby lying state, (ii) stability—small changes in the surrounding environment cannot result in substantial changes in the mfe conformation, and (iii) kinetic accessibility—a smooth free energy path leads from the unfolded random coil to the folded state. In essence, a two-state model considering a folded and an unfolded state of the molecule is sufﬁcient to describe the observations. In more recent years, the kinetics of the catalytic mechanism of ribonuclease A has been studied in detail (Park and Raines, 2003). The most beautiful results obtained for ribonuclease suffer from the fact that biomolecules behaving like ribonuclease are rather a small minority in the universe of proteins. As a matter of fact, the majority of natural and artiﬁcial proteins behave differently and all three criteria of the thermodynamic hypothesis are rarely fulﬁlled. A stable protein structure requires a subtle balance between hydrophilic and hydrophobic interactions and accordingly the sequences from large sections of protein sequence space fail to form structures, because the polypeptides aggregate and the aggregates are insoluble in aqueous solution. Membrane proteins are an exception because they adopt their structures in a natural hydrophobic environment (for a review of the state of the art in membrane protein structure analysis see, for example, Elofson and von Heijne (2007)). Understanding protein structure

7.2 Molecular Phenotypes

127

and prediction of structures from known sequences turned out to be extremely hard, remained a major issue of biophysics since more than 30 years, and is still one of the hot topics. In the late 1980s, a new concept for the interpretation of protein folding has been developed that made use of an energy landscape (Onuchic et al., 1997; Dobson et al., 1998). The energy of the protein is plotted upon conformation space of the protein sequence. Conformation space is commonly continuous in chemistry; the coordinates are all bond lengths, bond angles, and dihedral angles that determine the structure of the molecule.2 The notion of an energy landscape describing energy as a function of the coordinates of a molecule is a result of the Born–Oppenheimer approximation in quantum mechanics, which separates the motions of fast moving electrons and slow nuclei. Accurate energy landscapes are accessible through computation for small molecules and small molecular aggregates. As more and more data become available, the empirical reconstruction of free energy landscapes of proteins at atomic resolution becomes within reach (Vendruscolo and Dobson, 2005). The numbers of degrees of freedom in conformation space are also hyperastronomical: Considering only dihedral angles, in other words keeping bond lengths and bond angles constant, we estimate for the chain length of ribonuclease A, n ¼ 124, some 1090 angular degrees of freedom leading to about 10150 local minima of the energy landscape. Levinthal (1968, 1969) formulated a paradox in view of these huge numbers of degrees of freedom: How can a protein manage to ﬁnd the native conformation in a time interval as short as a millisecond when sequential sampling of conformation, one every picosecond, would take longer than the age of the universe? The answer is shown in Figure 7.2: The folding landscape has the shape of a funnel, under folding conditions (almost) all random coil conformation have conformations of lower free energy in the neighborhood, an enormous large number of trajectories lead to the target conformation, and hence only a negligibly small fraction of conformation space is sampled along an individual trajectory. The Anﬁnsen funnel (Figure 7.2, left-hand side) describes the idealized case of a fast folding protein like ribonuclease A, whereas most proteins are characterized by a rugged folding landscape with a great number of local (free) energy minima (Figure 2.2, right-hand side). Many proteins need assistance for folding that is provided in vivo by chaperonins being large protein assemblies with cavities, inside which the unfolded protein ﬁnds its way into the native conformation (Horwich et al., 2007). In essence, the mechanism of protein folding is understood by now (Onuchic et al., 1997; Dobson et al., 1998; Dill et al., 2008; Service, 2008). Conventionally, protein structure is described at four hierarchical levels: primary, secondary, tertiary, and quaternary structure. The primary structure is the amino acid sequence of the polypeptide chain, the secondary structure consists of regular structural elements formed through closure of hydrogen bonds of the polypeptide backbone, the tertiary structure is the 3D structure of a protein or a protein subunit, and the quaternary structure, eventually, provides information on the numbers and spatial arrangements of protein subunits.3 Two notions of structural units of proteins are used in addition to the presented classiﬁcation of structure: (i) A protein domain consists of a part of protein sequence that can fold, exist, function, and evolve independently of the rest of the protein. Domains are highly variable with respect to chain lengths, which are typically lying between 25 to 500 residues. (ii) A structural motif is a 3D structural element or fold within the polypeptide chain that is transferable from one protein 2

In the case of model proteins on lattices and also for nucleic acid secondary structures discrete conformation spaces are appropriate; see Section 7.3.

3

Subunits of proteins are deﬁned as independent polypeptide chains; in other words, each subunit of a protein is characterized by a separate chain.

128

Chapter 7

Genotypes and Phenotypes in the Evolution of Molecules

“Entropy”

“Entropy”

M

“Energy”

“Energy”

T Gl

Q

Di

Na

Na

Figure 7.2 Energy landscapes of protein folding. The sketch of the landscape on the left-hand side corresponds to the Anﬁnsen dogma of protein folding: The unfolded random coil of the polypeptide sequence is converted smoothly into the unique and stable native structure as observed with ribonuclease A. The sketch of the folding funnel on the right-hand side represents the more common case as observed with most proteins (Onuchic et al., 1997): The native structure is reached via various intermediates that are represented by molten globules, sometimes long-lived glassy states and (discrete) suboptimal conformations, which act as folding traps. The abscissa axis in both sketches is an appropriate cross section of conformational space. The factor Q is the fraction of native-like contacts. Typically, Q ¼ 0.3 for molten globules, Q ¼ 0.6 in the transition region, and Q ¼ 0.7 in the range of glass transitions. “Entropy” and “energy” are put in quotation marks because they are just illustrations implying that a wide funnel sustains a larger ensemble of trajectories leading to the target state, and the depth of the funnel is a measure of the stability of the native state. The majority of entropic contributions are not encapsulated in the width of the funnel and commonly the quantity on the ordinate axis is not pure energy but Gibbs’ free energy lacking entropy contributions from these degrees of freedom that are illustrated on the abscissa axis. (See insert for color representation of this ﬁgure.)

to other proteins. In folding, the polypeptide chain passes a series of stages: (i) Local interactions, in particular nuclei of a-helices and b or reverse turns, are introduced into the random coil at one of many—more or less equivalent—positions, (ii) secondary structures grow until about 30% of the contacts in the native state have been established and the molecules forms a so-called molten globule—a partially ordered structure with still substantial ﬂexibility, (iii) further loss in conformational freedom induces transitions to more rigid states—sometimes of glassy nature, and (iv) conﬁnement to one of the narrow deep values corresponding either to the native structure or to one of the suboptimal conformations, which are usually inactive and therefore addressed as misfolded states. Conversion from a suboptimal state to the native conformation may be fast or slow depending on the barrier separating the valleys. It is commonly assumed that an ordered and rigid structure is required for efﬁcient catalysis; a recent protein engineering study, however, produced a molten globule with perfect catalytic performance that is practically the same as that of the natural counterpart (Vamvaca et al., 2004). Prediction of protein

7.2 Molecular Phenotypes

129

structure from known sequence is still a very hard task. Progress is regularly monitored every 2 years by Critical Assessment of Techniques for Protein Structure Prediction (CASP) contests: The two latest prediction evaluation meetings of the committees were CASP 6 and CASP 7 (Lattman, 2005; Trapane and Lattman, 2007). The progress within the past 2 years has been modest but two changes were signiﬁcant: (i) the gap between human prediction groups and automatic servers has been closed, and (ii) an improvement has been observed with template-based models resulting from the usage of multiple templates, template-free modeling in regions where no template is available, and reﬁnement.

7.2.2 Nucleic Acid Structures Folding of random coil polynucleotide chains into DNA and RNA structures has been studied less frequently by far than protein folding—there are more than 40,000 protein-only structures in the Protein Data Bank compared to 575 deposited RNA-only structures. Nevertheless, the current understanding of nucleic acid structures is not far behind our knowledge on proteins. This has mainly three reasons: (i) nucleic acids are polyelectrolytes and hence almost always soluble in water, (ii) the structures of nucleic acids fall into two distinct classes, double helical duplexes and single-stranded structures, and (iii) the dominant contribution to the stability of structures is the interaction of base pairs in double helical stacks. Indeed, formation of stacked base pairs is the major driving force for folding single-stranded nucleic acid molecules into structures as it is for the formation of duplexes. Although DNA in nature is almost always double stranded and RNA mostly single stranded, both nucleic acids can and do exist in both forms. Examples are deoxyribozymes that are single-stranded, catalytically active DNA molecules (Breaker, 1999), double strand RNA viruses, and double-stranded RNA in regulation of gene expression through RNA interference (McManus and Sharp, 2002). The most important issue of double-stranded DNA is the sequence dependence of double helical (B-DNA) structures, which are the key to protein recognition. Empirical data-based duplex structure prediction from known local DNA sequences has been successful (Packer et al., 2000a, 2000b; Gardiner et al., 2003). Important issues of higher order structures in cyclic DNA concern supercoil, catenation, and other topological properties (Benham and Mielke, 2005). In this review, we shall not discuss duplex structures further but concentrate on conformations of single-stranded (RNA) molecules. Polyelectrolytes require counterions, which inﬂuence structure, and accordingly the structures of nucleic acids depend on ionic strength as well as the nature of the ions. This had already been known in the early days of modeling DNA double helical structures from ﬁber diffraction data and turned out to be particularly important for most full 3D RNA structures that are formed only when divalent Mg2 þ is present in the solution. Metal ions are also known to occur as elements of protein structure—a well-known example is Zn2 þ in zinc ﬁngers (Klug and Rhodes, 1987; Klug, 1999)—but more frequently they play an essential part in the catalytic function of proteins. Like in proteins, RNA structures can be partitioned into primary, secondary, tertiary, and quaternary structure elements. The primary structure is the nucleotide sequence, the secondary structure, in essence, is a listing of Watson–Crick and GU wobble base pairs and consists of a small number of motifs that can be combined with little restrictions only, and the tertiary structure comprises additional interactions in RNA structure that place the secondary structure elements in 3D space. These interactions are often characterized as tertiary structural motifs. Commonly, the introduction of tertiary interactions keeps secondary structures unchanged but in rare cases tertiary structure

130

Chapter 7

Genotypes and Phenotypes in the Evolution of Molecules

formation causes secondary structure rearrangements (Wu and Tinoco, 1998). The quaternary structure is deﬁned as in proteins but plays only a minor role except in RNA–protein complexes, for example, in virions or cellular complexes like the ribosome. RNA structure analysis and prediction is facilitated by the existence of motifs at all structural levels (Moore, 1999; Hendrix et al., 2005; Holbrook, 2008). Secondary structure motifs fall into four classes (Figure 7.3): (i) stacks, (ii) loops, (iii) joints, and (iv) free ends. In essence, the stacks provide the (only) stabilizing contributions to RNA structure, whereas the other elements are accompanied with positive free energy contributions. Loops are

Figure 7.3

Modules of RNA secondary structures. Stacks (blue) consist of base pairs combined in Watson–Crick-type double helices. Hairpin loops (red) terminate stacks, bulges and internal loops (pink and magenta) are adjacent to two stacks, and multiloops (violet) combine three or more stacks. A joint (brown) is an element joining two otherwise independent parts of the structure and free ends (orange) are mobile single strands at the 50 - and/or the 30 -end of the RNA. Below the conventional representation of the secondary structures, we show an equivalent representation of structures by parentheses and dots: parentheses symbolize base pairs, the opening parenthesis is nearer to the 50 -end and the closing parenthesis is nearer to the 30 -end, and the dots stand for unpaired nucleotides. As with sequences, the 50 -end is on the left-hand side and the 30 -end on the right-hand side of the parentheses string. The assignment of parentheses to base pairs follows the mathematical notation. (See insert for color representation of this ﬁgure.)

7.2 Molecular Phenotypes

131

single-stranded elements attached to stacks, a hairpin loop to a single stack, a bulge or an internal loop to two stacks, and a multiloop to three or more stacks. Small hairpin loops commonly lead to large positive free energy contributions, because several degrees of freedom are frozen when the loop is closed. Exceptions, among others, are especially stable tetraloops where a favorable geometry allows for additional base stacking (Antao et al., 1991; Antao and Tinoco, 1992). Joints are single strands combining two otherwise independent motifs—in case the joint is cut the RNA is partitioned into two unconnected molecules. Free ends, eventually, are single-stranded stretches at the 50 -end or the 30 -end of the RNA molecule. Joints and free ends are characterized by high conformational ﬂexibility. Like in proteins, composite motifs are also found in RNA. As an example, we mention the kink-turn motif (Klein et al., 2001), which is a combination of two stacks and a bulge or an internal loop between them. For certain constraints on loop size and RNA sequence, the results are a sharp turn of the ribose–phosphate backbone and an acute angle formed by the axes of the double helices. The conventional deﬁnition of RNA secondary structure excludes pseudoknots (see Figure 7.4 and Section 7.3). RNA secondary structures are much more important than protein secondary structures because every nucleotide is contained in a secondary structure motif and secondary structure formation commonly covers the major part of the free energy of folding. Tertiary motifs are larger in number and richer in diversity than secondary structure motifs (Holbrook, 2008). A systematic nomenclature of base pairs allows for a classiﬁcation of non-Watson–Crick-type nucleotide–nucleotide interactions (Leontis and Westhof, 2001). The search for tertiary RNA motifs has been very successful so far (Leontis and Westhof, 2003; Lescoute et al., 2005; Lescoute and Westhof, 2006) and is still continued on a worldwide basis (Leontis et al., 2006). The most common and overall structuredominating motif is end-to-end base pair stacking of helices, also called continuous interhelical base stacking (COIN stacking). It combines stacks into elongated double helical stretches. A well-known example is found in tRNAs—the 3D structure was ﬁrst determined for phenylalanyl-tRNA (tRNAPhe) (Rich and RajBhandary, 1976)—where the stacks terminated by the dihydro-U-loop and the anticodon loop form one extended helix and so do the stack of the TC-loop and the terminal stack carrying the CCA end. The

Figure 7.4

Pseudoknots in RNA structures. Pseudoknots are structures with Watson–Crick base pairs that cannot be casted into the parenthesis representation without violating the mathematical notation. Parentheses cannot be assigned unambiguously to the base pairs without usage of colors. The ﬁgure sketches hairpins from two classes: (i) a hairpin-type (H-type pseudoknot) (left-hand side) where a hairpin is involved in downstream base pairing and (ii) the kissing loops motif (right-hand side) involving two hairpin loops forming a stack. Colored parentheses representations are shown below the ﬁgures. (See insert for color representation of this ﬁgure.)

132

Chapter 7

Genotypes and Phenotypes in the Evolution of Molecules

Figure 7.5

Tertiary interactions in tRNA structures. The ﬁgure on the left-hand side shows the conventional cloverleaf secondary structure of phenylalanyl transfer RNA (tRNAPhe). Continuous interhelical base stacking shapes the molecule into an “L”. The stack closed by the dihydro-U-loop (green) associates end-on-end with the anticodon stack (red), and the nucleotide between the two stacks, G26, forms a non-Watson–Crick base pair with A44. Similarly, the stack of the TC-loop (blue) is coaxial with the terminal stack (violet) with one regular AU base pair in between. Other tertiary interactions further stabilizing the “L”-structure are shown as broken gray lines. (See insert for color representation of this ﬁgure.)

“L”-shaped tRNA structure is stabilized by four Mg2 þ ions binding to speciﬁc sites and a number of tertiary interactions involving a pseudoknot, non-Watson-Crick base pairs, base intercalation, and binding to 20 -OH of the ribose moieties (Figure 7.5). Studies on randomized genes have shown that the reverse-Hogsteen base pair bridging the TC-loop (T54 ¼ A58) is essential for the rigid and strong contact between the dihydro-U- and the TCloop and that the base pair is needed together with other interactions for the maintenance of the “L”-shape (Zagryadskaya et al., 2004). Kinetic folding of RNA molecules follows similar principles as kinetic protein folding does. The process is initiated by local folding of structural nuclei of stacks at several positions of the RNA sequence, and then the stacks grow until the formation of the still ﬂexible secondary structure. Introduction of tertiary contacts and addition of Mg2 þ cations result in the full 3D structure of the molecule. Although the kinetic details of hairpin formation are quite involved (Chen, 2008), the overall kinetics can be described well as a cooperative process (P€ orschke and Eigen, 1971; P€orschke, 1974, 1977) and modeled by straightforward algorithms (Flamm et al., 1999) or computed by Arrhenius kinetics (Wolﬁnger et al., 2004). It is worth mentioning the highly promising single molecule techniques that are steadily providing additional information on biopolymer structures and structure formation. Techniques successfully applied to RNA and protein folding are atomic force microscopy and ﬂuorescence techniques (Borgia et al., 2008; Li et al., 2008).

7.3 THE RNA MODEL The landscape metaphor introduced in Section 7.1 requires either empirical data or a realistic model in order to test its applicability to RNA evolution and optimization of

7.3 The RNA Model

133

molecular properties. The RNA model is based on two different inputs: (i) the kinetic theory of molecular evolution (Eigen, 1971; Eigen and Schuster, 1977, 1978; Eigen et al., 1989) provides the tool for the analysis of evolutionary dynamics at the molecular level and (ii) folding of RNA sequences into secondary structures yields simpliﬁed biomolecular structures that are suitable for the computation of parameters (Schuster, 2003). The relation between RNA sequences and secondary structures is used for modeling ﬁtness landscapes of evolutionary optimization since secondary structures are physically well deﬁned and meaningful and, at the same time, accessible to rigorous mathematical analysis (Schuster, 2006). In particular, RNA secondary structures allow the introduction of most features of real structures in a straightforward and analyzable way.

7.3.1 RNA Replication and Mutation Evolution of RNA molecules in the test tube represents the simplest system that fulﬁls the criteria for Darwinian evolution: (i) multiplication, (ii) variation, and (iii) selection. Evolutionary studies of RNA molecules in the test tube was already initiated in the 1960s by Sol Spiegelman and his group (Mills et al., 1967; Spiegelman, 1971) and has remained a highly active ﬁeld ever since (Watts and Schwarz, 1997; Joyce, 2004). The kinetics of RNA replication by means of viral replicases has been studied in great detail (Biebricher et al., 1983, 1984, 1985). Although RNA replication follows a complicated multistep reaction mechanism, the overall kinetics under suitable conditions consisting of excess replicase and nucleotide triphosphates can be described by simple exponential growth (Figure 7.6). In this phase, complementary replication shown in Figure 7.7 can be represented by a simple two-step mechanism fþ

Concentration of RNA c(t)

X þ ! X þ X þ

and

Exponential

Linear

c (t ) = c (0) ekt

c (t ) = c (0) + k ′ t

f

X ! X þ þ X :

ð7:2Þ

Saturation or product inhibition

c (t ) = cmax (1–ae–k ″t )

Time t

Figure 7.6

RNA replication by viral replicases. Shown is the growth curve of RNA concentration in a closed system with polymerase and excess nucleotide triphosphates (Biebricher et al., 1983). In the exponential phase, the total concentration of RNA is smaller than the total concentration of replicase, in the linear phase RNA is present in excess, and eventually at high RNA concentration the growth curve levels off, since the enzyme is bound in inactive RNA–replicase complexes and RNA synthesis is blocked by product inhibition. (See insert for color representation of this ﬁgure.)

134

Chapter 7

Genotypes and Phenotypes in the Evolution of Molecules

Plus strand

A U G G U A C A U C A U G A

C U U G

Template-induced synthesis

Plus strand Minus strand

A U G G U A C A U C A U G A U A C C A U

U G A C

C U U G

Template-induced synthesis Plus strand Minus strand

A U G G U A C A U C A U G A

C U U G

U A C C A U G U A G U A C U

G A A C

Complex dissociation Plus strand

A U G G U A C A U C A U G A

C U U G

+ Minus strand

U A C C A U G U A G U A C U

G A A C

Figure 7.7

Complementary replication of RNA. Complementary replication consists of (i) duplex formation from single strands by template-induced synthesis and (ii) dissociation of the duplex into a plus and a minus strand. The dissociation of the completed duplex is highly unfavorable because of the large negative free energy of duplex formation. Complex dissociation is facilitated by the enzyme, which separates the two strands on the ﬂy in order to allow for independent structure formation and prevention of the formation of the complete duplex. (See insert for color representation of this ﬁgure.)

The solution of the kinetic equations leads to two modes that describe fast internal equilibration pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ and growth of the plus–minus ensemble with a rate parameter, f ¼ f þ f , that is, the geometric mean of the two rate constants: ZðtÞ ¼ Zð0Þe

ft

pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ fþ xþ f x ;Z ¼ f

and

zðtÞ ¼ zð0Þe

þ ft

pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ fþ xþ f x ;z ¼ f ð7:3Þ

with x þ ¼ [X þ ] and x ¼ [X] being the concentrations of plus and minus strands, respectively. Variation is introduced into ensembles of replicating RNA molecules by unprecise copying or mutation. Three classes of mutations are distinguished: (i) single nucleotide mismatch in the replication duplex leading to a point mutation (Figure 7.8), (ii) duplication of part of the RNA sequence leading to insertion of nucleotides, and (iii) deletion of nucleotides. A molecular theory of evolution based on the kinetics of replication and mutation has been formulated by Eigen (1971). The concept is based on the reaction network, Qij fj

Xj ! Xi þ Xj ;

i; j ¼ 1; 2; . . . ; N;

ð7:4Þ

7.3 The RNA Model

135

Template-

Plus strand Minus strand Template-

Plus strand Minus strand C

Plus strand

Minus strand

Plus strand

Figure 7.8

Point mutation in replication of RNA. Point mutation results from a mismatch in the replication duplex. The ﬁgure sketches the result of a U–G mismatch that leads to a point mutation of transition type A ! G and U ! C. (See insert for color representation of this ﬁgure.)

which is considered under the idealized conditions of excess nucleotide triphosphates and replicase. The rate parameter fj refers to replications—correct and incorrect—of template Xj, and the factor Qij represents the frequency of production of Xi as a copy of X Pj.NSince every copy has to be either correct or a mutant, the conservation relation i¼1 Qij ¼ 1 holds. The kinetic differential equations resulting from Equation 7.4 with xi ¼ [Xi] are linear 0

dx ¼ Wx dt

and

1 x1 B x2 C B C x ¼ B .. C @ . A

ð7:5Þ

x3 and can be solved in terms of eigenvalues and eigenvectors of the selection-mutation matrix W that can be factorized into a product of the mutation matrix Q and the diagonal matrix of replication rate parameters, F:

136

Chapter 7

Genotypes and Phenotypes in the Evolution of Molecules

0

W ¼ QF

with

Q11 BQ B 21 Q¼B B .. @ . QN1

Q12 Q22 .. . QN2

1 Q1N Q2N C C .. C C and O . A QNN

0

f1 B0 B F¼B B .. @. 0

0 f2 .. . 0

1 0 0C C .. C C: O . A fN

Since all sequences Xi can be reached from everywhere in sequence space by a chain of successive point mutations, the matrix Wm has only strictly positive entries for sufﬁciently large m and Perron–Frobenius theorem (Seneta, 1981) holds: The eigenvector z0 corresponding to the largest eigenvalue l0 has exclusively strictly positive components and all mutants Xi are present in the population after some time: xi(t) > 0 for t 0. The eigenvalue l0 is positive and all components of the eigenvector are growing exponentially. The mutant distribution determined by the eigenvector z0 is called quasispecies since it represents the genetic reservoir of an asexually replicating species. It is P straightforward to introduce a constraint into Equation 7.5 that limits population size Ni¼1 xi ¼ c asymptotically to limt ! ¥ cðtÞ ¼ c0 : f f dx ¼ W x x ¼ W E x: c0 c0 dt

ð7:50 Þ

P Here, E is the unit matrix and f ¼ Ni¼1 fi xi =c represents the mean replication rate 0 parameter of the population. In the solutions PN of Equation PN7.5 , the population approaches i ¼ c0 , which is determined indeed the stable stationary state limt ! ¥ i¼1 xi ðtÞ ¼ i¼1 x . Choosing c ¼ 1 yields relative or L1 normalized by the components of the eigenvector z 0 0 PN concentrations: i¼1 x1 ¼ 1 (which will be used in the rest of this section). All entries of the mutation matrix Q can be derived from three parameters, the mutation rate p, the Hamming distance dH(Xi, Xj), and the sequence length n, provided the uniform mutation rate model is adopted. This model is based on the assumption that mutation rates are independent of the position on the RNA sequence: Qij ¼ ð1 pÞn edH ðXi ;Xj Þ

with

e¼

p : 1p

ð7:6Þ

The rate parameters fi are derived from the mappings C and F from sequence space into shape space and into real numbers as formulated in Equation 7.1. In the absence of neutrality, the stationary distribution of sequences contains a master sequence, XM, which is characterized in terms of the largest replication rate parameter: fM ¼ maxffi 8i ¼ 1; . . . ; Ng. At sufﬁciently small mutation rates p, the stationary concenM ¼ maxf xi 8i ¼ 1; . . . ; Ng. A simple expression tration of the master sequence is largest, x for stationary concentrations can be derived from the single-peak model landscape. In this landscape, a higher replication parameter is assigned to the master and identical values to all others sequences: fM ¼ sM f and fi ¼ f for all i 6¼ M (Swetina and Schuster, 1982; Tarazona, 1992; Alves and Fontanari, 1996). The (dimensionless) factor sM is called the superiority of the master. The assumption leading to the single-peak landscape is in the spirit of mean ﬁeld approximations, since all mutants are lumped together into a single molecular species P with average ﬁtness. The concentration of the mutant cloud is simply xc ¼ m j¼1;j6¼M xj ¼ 1 xM and the replication-mutation problem boils down to an exercise in a single variable, xM, the frequency P of the master. A mean-except-the-master replication rate parameter is deﬁned as f ¼ j6¼M fi xj =ð1 xM Þ and then the superiority is of the form

7.3 The RNA Model

137

sM ¼ fM =f . Neglecting mutational backﬂow, we can readily compute the stationary frequency of the master sequence, M ¼ x

fM QMM f sM QMM 1 : ¼ sM 1 fM f

ð7:7Þ

Nonzero frequency of the master requires QMM ¼ sM 1 > Qmin . Within the uniform error rate M ¼ 0 in the no mutational backﬂow approximamodel, an error threshold, deﬁned by x tion, occurs at a minimum single-digit accuracy of qmin ¼ 1 pmax ¼

p ﬃﬃﬃﬃﬃﬃﬃﬃﬃ 1=n n Qmin ¼ sM

or

1=n

Pmax ¼ 1 sM

:

ð7:8Þ

M , as a function of the Figure 7.9 shows the stationary frequency of the master sequence, x error rate. The exact solution of Equation 7.5 approaches the uniform distribution at mutation rates above error threshold. In other words, the concentrations of all molecular species in the population become identical. Such a state can never be achieved in real populations since population sizes are always many orders of magnitude smaller than the numbers of sequences in sequence space—for a rather very large population size of N ¼ 1015, the chain length at which sequence space matches population size is about

xM

xM

Stationary mutant distribution

Co

F

Mutation rate p

Re

Mi

F

Mu Ac

Figure 7.9

p Er

Error threshold in replication. The ﬁgure sketches the (relative) stationary concentration of the M ðpÞ. It vanishes at the error threshold in the no master sequence in the population as a function of the mutation rate x mutational backﬂow approximation. The insert shows curves obtained as the exact solution derived from the largest eigenvector of the matrix W (red) by an approximation based on equal concentrations of all mutants that corresponds to the population at mutation rates p > pmax and becomes exact at p ¼ 0.5 (blue) and by the no mutational backﬂow approximation (Equation 7.7, black). The red curve and the blue curve approach each other above the error threshold and converge to the uniform distribution. The deterministic equation (7.5) and its approximations fail to describe population dynamics (Section 3.4) at mutation rates above threshold. In addition, all replication processes in reality are bound by a minimum error rate, pmin, that represents the physical accuracy limit of replication. (See insert for color representation of this ﬁgure.)

138

Chapter 7

Genotypes and Phenotypes in the Evolution of Molecules

n ¼ 25. Accordingly, the no mutational backﬂow approximation and the exact solution of the differential equation 7.50 fail to describe replication-mutation dynamics at mutation rates above the error thresholds because of ﬁnite population size effects (see Section 7.3.4). The error threshold phenomenon is used in virology for the design of new antiviral drugs (Domingo, 2005; Domingo et al., 2008).

7.3.2 RNA Secondary Structures An RNA secondary structure S of the sequence X ¼ (a1, a2, . . ., an) where the nucleotides are chosen from an alphabet, for example, ai 2 {A, U, G, C}, is a planar graphs with the nodes being the individual nucleotides ai. The edges, ij 2 S, are deﬁned by the following criteria: i. for all nodes i (n 1) holds ii þ 1 2 S (backbone), ii. for all nodes i exists maximal one k 6¼ {i þ 1, i 1} such that ik 2 S (base pairs), iii. from ij 2 S and kl 2 S with k < l, and i < k < j follows i < k < l < j (no pseudoknot rule), and iv. a criterion for structure formation, commonly minimization of free energy. The backbone (i) represents the polynucleotide chain consisting of alternating phosphate and ribose moieties. The rule for base pairs (ii) deﬁnes all base pairs in structure S and excludes base triplets and other interaction involving more than two bases. The no pseudoknot rule (iii) excludes structures shown in Figure 7.4. The mfe condition, ﬁnally, deﬁnes the conditions under which S is a possible structure of X. Thanks to these criteria, the search for RNA secondary structures can be performed by means of dynamic programming (Zuker and Stiegler, 1981; Zuker and Sankoff, 1984; Hofacker et al., 1994; Hofacker, 2003; Svobodova Varekova et al., 2008). Introducing the search for pseudoknots into the search for optimal structures is possible in principle, but raises the computational demands enormously (Rivas and Eddy, 1999). The situation with other tertiary motifs is similar. The currently used approach to predict tertiary structures starts from secondary structures and introduces tertiary contacts where sequence and structure make it possible. Secondary structures can be represented as strings consisting of dots, left and right parenthesis related by mathematical convention (Figure 7.3) without losing information. This fact provides an upper bound for the number of possible secondary structures: NS(n) 3n since acceptable mathematical parentheses notation is a severe restriction. Application of combinatorics yields a remarkably good approximation for sufﬁciently long sequences (Hofacker et al., 1998; Schuster, 2003): NS ðnÞ 1:4848 n 3=2 ð1:84892Þn : Accordingly, the number of sequences, N ¼ 4n, is always larger—commonly much larger—than the number of secondary structures and we are dealing therefore with neutrality. Folding RNA sequences into conventional secondary structures with minimal free energies provides a suitable model system for studying realistic sequence-structure maps of biopolymers for several reasons: (i) Almost all RNA sequences form some base pairs and structures are found everywhere in sequence space, (ii) RNA folding follows a simple base pairing logic and hence it is accessible by mathematics and computation, and (iii) RNA secondary structures are physically meaningful and provide a basis for discussing RNA function. These three properties that are not fulﬁlled in the case of proteins, and the

7.3 The RNA Model

139

Figure 7.10

Neutral networks and compatible sequences. The set of sequences folding into the same mfe structure S is denoted by G(S). It deﬁnes the nodes of the neutral network of structure S in sequence space. Connecting all pairs of sequences with Hamming distance dH ¼ 1 yields the neutral network G(S) (the graph drawn in red). A neutral network is embedded in the set of compatible sequences C(S), G(S) C(S). A compatible sequence of structure S, XC(S), forms S either as its mfe structure or as one of its suboptimal conformations. (See insert for color representation of this ﬁgure.)

capability of multiplication in simple replication assays, make RNA a suitable model for studies of evolution in vitro and in silico.

7.3.3 Neutrality and Its Consequences The mappings deﬁned in Equation 7.1 provide the theoretical basis for both rational and evolutionary design of biomolecules. Since we are dealing with orders of magnitude more sequences than structures and a multitude of structures serving the same task, both mappings C and F are noninvertible in the sense that many sequences form the same mfe structure and many different structures may have the same function. The mapping C is sketched in Figure 7.10. The inversion of the mapping S ¼ C(X) generally results in a set of sequences G(S) deﬁning the preimage of structure S in sequence space: GðSÞ ¼ C 1 ðSÞ_fX jCðXÞ ¼ Sg:

ð7:9Þ

It is a subset of the compatible set of structure S:4G(S) C(S). Since every sequence Xk maps into some structure Sk, the union of all neutral sets covers the entire sequence space, [kG(Sk) ¼ Q. Global properties of neutral networks can be derived from random graph theory (Bollobas, 1985). The characteristic quantity for a neutral network is the degree of neutrality l, which is obtained by averaging the fraction of Hamming distance one neighbors that fold 4

Compatibility of sequences and structures is deﬁned in the following way: A sequence X is compatible with structure S if and only if for every base pair in S, the sequence X contains pairable nucleotides in the two positions forming the pair. Similarly, a structure S is comparable with a sequence X when the same relation— and, obviously, not its inversion—holds.

140

Chapter 7

Genotypes and Phenotypes in the Evolution of Molecules ð1Þ

ð1Þ

into the same mfe structure, lX ¼ nntr =ðn ðk 1ÞÞ—with nntr being the number of neutral single nucleotide exchange neighbors—over the whole network, G(S): lðSÞ ¼

X 1 lX ; jGðSÞj X2GðSÞ

ð7:10Þ

where |G(S)| is the number of sequences forming the neutral network. Connectedness of neutral networks, among other properties, is determined by the degree of neutrality (Reidys et al., 1997): connected With probability one a network is not connected

if l lcr ; if l < lcr ;

ð7:11Þ

where lcr ¼ 1 k 1=ðk 1Þ . Interestingly, this threshold value depends exclusively on the number of digits in the nucleotide alphabet. Calculation yields the critical values lcr ¼ 0.5, 0.423, and 0.370 for two, three, and four letter alphabets, respectively. Random graph theory predicts a single largest component for nonconnected networks, that is, networks below threshold, which is commonly called the giant component. Real neutral networks derived from RNA secondary structures sometimes deviate signiﬁcantly from the prediction of random graph theory. In particular, they can have two or four equally sized largest components. This deviation is readily explained by nonuniform distribution of the sequences belonging to G(Sk) over sequence space that is caused by uner et al., 1996a,b). For example, structures that speciﬁc properties of the structure Sk (Gr€ allow for closure of additional base pairs at the ends of stacks are more likely to be formed by sequences that have an excess of one of the two bases forming a base pair than by those with an ideally balanced distribution (nG ¼ nC and nA ¼ nU). For GC sequences, the neutral network of such a structure is less dense in the middle part of sequence space (nG ¼ nC) than above (nG > nC) or below (nG < nC), and we ﬁnd two equally sized largest components, one at excess G and one at excess C. Neutrality in sequence space has consequences for the selection process. The scenario of neutral evolution has been investigated in great detail by Kimura (1968, 1983). In the absence of differences in ﬁtness values, the distribution of neutral genotypes or sequences drifts randomly in sequence space until one particular genotype becomes ﬁxed. Kimura’s theory yields two highly relevant results: (i) the average time of replacement of one genotype by another is the reciprocal rate of mutation, tsubst ¼ 1/p, and hence independent of population size, and (ii) the time of ﬁxation of a mutant is proportional to the population size, tsubst ¼ 4Ne (with Ne being the effective population size). Neutrality can be introduced into model ﬁtness landscapes, the corresponding selection-mutation Equation 7.5 is solved straightforwardly, and yields in the limit of small mutation rates for two sequences depend on the Hamming distance (Schuster and Swetina, 1988): 8 ¼ 1 : limp ! 0 xi > > < ¼ 2 : lim p ! 0 xi dH ðXi ; Xj Þ >

3 : limp ! 0 xi > :

3 : limp ! 0 xi

¼ 0:5 and limp ! 0 xj ¼ 0:5; ¼ a and limp ! 0 xj ¼ 1 a; ¼ 0 and limp ! 0 xj ¼ 1; ¼ 1 and limp ! 0 xj ¼ 0;

ð7:12Þ

A pair of ﬁttest neutral nearest neighbor sequences appears in the stationary mutant distribution strongly coupled at equal concentrations, two sequences, Xi and Xj, with

141

Re

x(p)

7.3 The RNA Model

Mu

Figure 7.11 Neutral networks and quasispecies. An example of a quasispecies core for a degree of neutrality l ¼ 0.1. Fitness values fi were assigned randomly to all 1024 binary (GC) sequences of chain length n ¼ 10 with the constraint of 10% having the highest ﬁtness value. The numbers on the sequences represent the decimal equivalent of the binary sequence, for example, the two sequences X184 CCGCGGGCCC and X248 CCGGGGGCCC with Hamming distance dH(X184, X248) ¼ 1. The selected neutral network (upper part, left-hand side) comprises seven sequences. The relative concentrations in the limit of vanishing mutation rates, lim p ! 0, are given by the 248 , x 504 , x 600 , x 728 , x 729 , x184 , x largest eigenvector of the adjacency matrix A (upper part, right-hand side): z0 ¼ ( i ðpÞ show, the ratio of the individual stationary in 760 ) ¼ (0.1, 0.2, 0.1, 0.1, 0.2, 0.1, 0.2). As the computed curves x x the limit is also a good approximation for ﬁnite mutation rates almost up to the error threshold.

Hamming distance dH(Xi, Xj) ¼ 2 form a strongly coupled pair with a concentration ratio a/1 a, and for Hamming distance dH(Xi, Xj) 3 the Kimura scenario holds: Either of the two sequences is selected depending on initial conditions (and/or random ﬂuctuations). The group of two or more neutral sequences that is selected is called the core of the quasispecies and replaces the master sequence of the nonneutral case. For more than two neutral nearest neighbor sequences, the core of the quasispecies is derived straightforwardly: We consider the selection-mutation matrix W and neglect all terms of order O(e2). Without changing the eigenvectors of W, we set f ¼ f (1 p)n ¼ 0 and e ¼ 1 and obtain the adjacency matrix A. The core is then computed as the largest eigenvector of A. An example is shown in Figure 7.11. Increasing mutation rates p > 0 lead to small or moderate changes in the relative concentrations of sequences in the core and in fortunate cases ratios of concentrations hold almost up to the error threshold.

7.3.4 Stochastic Effects in RNA Evolution Stochasticity becomes important when particle numbers are small and this is certainly the case for rare mutations in evolution. For RNA molecules, the number of possible single point mutations is 3n and increases like binomial coefﬁcients with the Hamming distance.

142

Chapter 7

Genotypes and Phenotypes in the Evolution of Molecules

A related source of stochastic effects concerns the smallness of all real populations compared to sequence space: In molecular evolution experiments, the numbers of RNA molecules in an experiment can hardly exceed 1015, which is practically nothing compared ð4Þ to 1.6 1060, and the number of sequences in Q100 and therefore quasispecies are always truncated at a certain distance from the population center. Therefore, stochastic effects are particularly important in molecular evolution under several conditions: i. In the regime of sufﬁciently accurate replication, the master sequence or the core of a quasispecies is surrounded by a cloud of mutants. Near the truncation distance from the population center, mutations become very rare and the mutants cannot reach stationarity but remain ﬂuctuating elements. ii. At mutation rates above threshold, mutations to distant sequences gain sufﬁciently high probability to destroy inheritance and all mutants become equally frequent in the deterministic approach. Since the population cannot cover whole sequence space, it spreads and starts to migrate through sequence space. iii. Populations on neutral networks drift in the sense of Kimura’s neutral evolution. In particular, the population spreads and breaks up into different clones that migrate through sequence space. Scenario (ii) and scenario (iii) are similar but arise from two completely different origins: Scenario (ii) results from low accuracy that manifests itself in the elements of the Qmatrix and gives rise to migration of the population because of frequent mutations. The error threshold has also been interpreted as a localization threshold of the quasispecies in sequence space (McCaskill, 1984). Scenario (iii) is tantamount to random drift in sequence space because of a degeneracy of the largest entries of matrix F (Huynen et al., 1996). In order to simulate selection-mutation dynamics of RNA at the stochastic level, a realistic model based on chemical reactions in a ﬂow reactor was conceived (Fontana and Schuster, 1987, 1998a; Fontana et al., 1989). The sequence-structure map is an integral part of this model in the sense that sequences are converted into mfe secondary structures by means of an RNA folding mechanism. Structures are evaluated to yield replication rate parameters or ﬁtness values fi. The simulation tool starts from a population of RNA molecules and simulates chemical reactions corresponding to replication and mutation in a continuously stirred ﬂow reactor (CSFR) by using Gillespie’s algorithm (Gillespie, 1976, 1977b, 2007). In target search, the replication rate parameter of a sequence Xi, fi, is chosen to be a function of the distance between the mfe structure formed by the sequence Si ¼ f (Xi) and the target structure ST,5 fi ðSi ; ST Þ ¼

1 ; a þ dH ðSi ; ST Þ=n

ð7:13Þ

which increases when Si approaches the target structure ST (a is an adjustable parameter that was chosen to be 0.1). A trajectory is completed when the population reaches a sequence that folds into the target structure. Accordingly, the simulated stochastic process has two absorbing barriers, the target and the state of extinction. For sufﬁciently large populations (N > 30 molecules), the probability of extinction is very small, and for population sizes reported here, N 1000, extinction has never been observed. 5

The measure for the distance between two structures Si and Sj applied here is the Hamming distance between the two parentheses representations: dH (Si, Sj).

7.3 The RNA Model

143

Replications

Figure 7.12

A trajectory of evolutionary optimization. The topmost plot presents the mean distance to the target structure of a population of 1000 molecules. The plot in the middle shows the width of the population in Hamming distance between sequences and the plot at the bottom is a measure of the velocity with which the center of the population migrates through sequence space. Diffusion on neutral networks causes spreading on the population in the sense of neutral evolution (Huynen et al., 1996). A remarkable synchronization is observed: At the end of each quasistationary plateau, a new adaptive phase in the approach toward the target is initiated that is accompanied by a drastic reduction in the population width and a jump in the population center (the top of the peak at the end of the second long plateau is marked by a black arrow). A mutation rate of p ¼ 0.001 was chosen, the replication rate parameter is deﬁned in Equation 7.13, and initial and target structure are shown in Table 7.1. (See insert for color representation of this ﬁgure.)

A typical trajectory is shown in Figure 7.12. In this simulation, a homogeneous population consisting on N molecules with the same randomly chosen sequence is applied as initial condition. The target structure is the well-known secondary structure of phenylalanyl transfer RNA (tRNAPhe; see Figure 7.5). The distance to target averaged over the entire population decreases stepwise until the target is reached (Fontana et al., 1989; Fontana and Schuster, 1998a; Schuster, 2003). The process occurs on two timescales: short adaptive phases are interrupted by long quasistationary epochs. Transitions between two structures Si and Sj can be classiﬁed according to the nearness of their neutral networks G(Si) and G(Sj) (Fontana and Schuster, 1998b; Stadler et al., 2001). Inspection of the sequence record during a quasistationary epoch on a given plateaus provides hints for the distinction of two scenarios:

144

Chapter 7

Genotypes and Phenotypes in the Evolution of Molecules

Table 7.1

Statistics of the Optimization Trajectories Real time from start to target

Alphabet AUGC

GC

Population size N 1,000 2,000 3,000 10,000 30,000 100,000 1,000 3,000 10,000

Number of replications (107)

Number of runs nR

Mean value

s

Mean value

s

120 120 1,199 120 63 18 46 278 40

900 530 400 190 110 62 5,160 1,910 560

þ 1380–542 þ 880–330 þ 670–250 þ 230–100 þ 97–52 þ 50–28 þ 15,700–3,890 þ 5,180–1,460 þ 1,620–420

1.2 1.4 1.6 2.3 3.6 — — 7.4 —

þ 3.1–0.9 þ 3.6–1.0 þ 4.4–1.2 þ 5.3–1.6 þ 6.7–2.3 — — þ 35.8–6.1 —

The table shows the results of sampled evolutionary trajectories leading from a random initial structure SI to the structure of tRNAPhe, ST as target. Simulations were performed with an algorithm introduced by Gillespie 1976, 1977a, 1977b. The time unit is here undeﬁned. A mutation rate of p ¼ 0.001 per site and replication was used. The mean and standard deviation were calculated under the assumption of a lognormal distribution that ﬁts well the data of the simulations. The following structures SI and ST were used in the optimization: SI: ((.(((((((((((((............(((....)))......)))))).))))))).))...(((......))) ST: ((((((...((((........)))).(((((.......))))).....(((((.......))))).)))))).....

i. The structure is constant because of neutrality in the map C and we observe neutral evolution. In particular, the number of neutral mutations accumulated is proportional to the number of replications on the population level. Evolution is a random walk of the population on a neutral network. ii. The process during the stationary epoch involves several structures with the same replication rate parameters. Because of neutrality in the map from structure to function, F, the population performs a kind of random walk in the space of neutral structures. The random walk or the diffusion of the population on neutral networks is illustrated by the plot in the middle of Figure 7.12, showing the width of the population as a function of time (Schuster, 2003). The population width increases during the quasistationary epoch and sharpens almost instantaneously after mutation has created a sequence that allows for the start of a new adaptive phase. The scenario at the end of the plateau corresponds to a bottleneck of evolution. The bottom part of the ﬁgure shows a plot of the migration rate or drift of the population center in sequence space and conﬁrms this interpretation: The drift is almost always negligibly slow unless the population center jumps from one point in sequence space to another point in sequence space where the molecule initiating the new adaptive phase is located. A closer look at the ﬁgure reveals the coincidence of three events: (i) collapse-like narrowing of the population width, (ii) jump-like migration of the population center, and (iii) beginning of a new adaptive phase. In Table 7.1, numerical data obtained from sampling evolutionary trajectories under identical conditions6 are presented. The individual trajectories show enormous scatter in the time or the number of replications required to reach the target. Mean values and the standard 6

Identical means here that everything was kept constant except the seeds for the random number generators.

7.3 The RNA Model

145

deviations were obtained from statistics of trajectories under the assumption of a lognormal distribution. Despite the scatter, three features are unambiguously detectable: i. The search in GC sequence space takes about ﬁve time as long as the corresponding process in AUGC sequence space in agreement with the difference in neutral network structure. ii. The time from initial conditions to target decreases with increasing population size. iii. The number of replications required to reach target from initial conditions increases with population size. Combining items (ii) and (iii) allows for a clear conclusion concerning requirements in time and resources of the optimization process: Fast optimization requires large populations, whereas economic use of material and/or energy suggests to work with small population sizes just large enough to avoid extinction. Systematic studies on the parameter dependence of RNA evolution were reported in a recent simulation (Kupczok and Dittrich, 2006). Increase in mutation rate leads to an error threshold phenomenon that is close to one observed with quasispecies on a single-peak landscape as described above (Eigen et al., 1989). Evolutionary optimization becomes more efﬁcient7 with increasing error rate until the error threshold is reached. Further increase in the error rate leads to an abrupt breakdown of the success in optimization. As expected the distribution of replication rates or ﬁtness values fi in sequence space is highly relevant too: Steep decrease of ﬁtness with the distance from the ﬁttest master sequence (forming the target structure) leads to the sharp error threshold behavior as observed with single-peak landscapes, whereas ﬂat landscapes show a broad maximum of optimization efﬁciency without an indication of threshold-like behavior.

7.3.5 Beyond the One Sequence–One Structure Paradigm So far it has been assumed implicitly that every RNA sequence gives rise to one unique structure. This is almost always true when the notion of structure is restricted to a welldeﬁned thermodynamic or process determined folding criterion, mfe or in situ folding during RNA synthesis. In general, the number of structures Sk that are compatible with a given sequence X are commonly quite large and form the set of compatible structures C(X), which consists of the mfe structure together with all suboptimal structures. Efﬁcient algorithms for the computation of suboptimal structures are available (Zuker, 1989; Wuchty et al., 1999). Because the numbers of suboptimal structures are almost always too large to be computed, stored and retrieved, the computational procedures use restrictions: In Zuker (1989), certain common but less important classes of structures are neglected and in Wuchty et al. (1999) all structures are computed that lie within an predeﬁned energy band above the mfe (Figure 7.13). Alternatively, using the partition function of the states Sk, the superposition of all Boltzmann-weighted structures can be calculated with little more computational efforts than needed for the computation of the mfe structure (McCaskill, 1990; Hofacker et al., 1994). Yes-or-no pairing between two nucleotides is then replaced by a base pairing probability. Rules deﬁning nearest neighbors in shape space and a measure of distance between structures are required for the construction of a free energy surface that identiﬁes the (meta) 7

Efﬁciency of evolutionary optimization is measured by average and best ﬁtness values obtained in populations after a predeﬁned number of generations.

146

Chapter 7

Genotypes and Phenotypes in the Evolution of Molecules

Figure 7.13

RNA structures. The mfe structure of an RNA sequence is accompanied by a large number of suboptimal structures. The sequence GGCCCCUUUGGGGGCCAGACCCCUAAAGGGGUC folds into a single hairpin structure S0 with mfe of 26.3 kcal/mol. The ﬁrst suboptimal structure of this molecule, S1, is a double hairpin with a free energy of 25.3 kcal/mol. The ﬁgure shows the mfe structure (left-hand side; red), the spectrum of suboptimal structures (middle; suboptimal conformations related to S0 are shown in red, those related to S1 in blue), and the barrier tree of the sequence (righthand side) with two major basins for S1 (blue) and S0 (red). (See insert for color representation of this ﬁgure.)

stable conformations as local minima and the transitions states for conformational changes as saddle points. Such rules form the move set of allowed elementary transitions between structures and represent individual steps in models for folding kinetics. An acceptable move set guarantees that every structure can be reached from every structure in shape space by a sequence of moves.8 Opening and closing of single base pairs forms already a move set fulﬁlling the condition. Empirical evidence suggests inclusion of also a shift move that can be understood as a speciﬁc combination of base pair opening and base pair closing into one move: ððð ÞÞÞ ! ðððð ÞÞÞÞ ðððð ÞÞÞÞ ! ððð ÞÞÞ ðð ððð ÞÞÞÞÞ ! ððð ðð ÞÞÞÞÞ

base pair closure; base pair opening; and base pair shift:

The move set deﬁnes the nearest neighbors of a given structure and allows for classiﬁcation (Figure 7.14). A structure that is surrounded by structures of higher free energy represents a local minimum of the free energy surface and corresponds to a (meta) stable conformation. The conformation Sk corresponding to a local minimum of the free energy surface has a uniquely deﬁned basin of attraction that is deﬁned by the set of all structures from which downhill walks end uniquely in Sk. In addition to local minima, the saddle points of free energy surfaces are required for folding kinetics. A saddle point is deﬁned by a locally lowest point in shape space that has (two or more) nearest neighbors in shape space that belong to two distinct basins of attraction. All structures except those 8

A move set in sequence space that fulﬁlls this condition is point mutation.

147

G

7.3 The RNA Model

Fr

Tjk

Lo

Tjk

Sj

Sj Sk

Sk Re

Ba

Figure 7.14

Conformation space and barrier tree. RNA secondary structures formed by one sequence fall into three classes: (i) Local minima of the energy surface (black) are surrounded exclusively by suboptimal structures with higher free energies, (ii) saddle points (red) have two (or more) nearest neighbors in shape space that belong to two distinct basins, and (iii) (fully) unstable structures that are neither local minima or saddle points (green). The reaction coordinate is a path in shape space that leads from one local minimum (conformation Sk) to another local minimum (conformation Sj). The barrier tree (Flamm et al., 1999, 2002) is constructed by discarding

corresponding to local minima and saddle points are (fully) unstable structures.9 It is straightforward to show that the inclusion of the shift move may change the nature of structures: some local minima are turned into unstable states. The barrier tree is a coarse-grained simpliﬁcation of the free energy surface of an RNA molecule. It discards all (fully) unstable structures and retains only (meta)stable conformations and saddle points. The barrier tree, nevertheless, allows for an identiﬁcation of the basins of attraction (see the example shown in Figure 7.13). Small basins of attraction can be united to form larger ones until we end up with a few major conformations each deﬁning a large basin, and this procedure can be continued until only very few basins are retained or a single conformation remains. RNA molecules with several dominant basins of attraction corresponding to two or more (meta)stable conformations are called riboswitches, can be designed in silico (Flamm et al., 2001), and occur also in vivo (Montange and Batey, 2008). Conformational changes in natural riboswitches are commonly triggered by binding of small molecules and have regulatory function in metabolism. The barrier tree has also been used to compute Arrhenius-type folding kinetics of RNA molecules. The results are in good agreement with the exact computations of the folding kinetics on the computed conformational energy landscape unless there are many transition states whose energies lie close by Wolﬁnger et al. (2004). Finally, RNA suboptimal structures can also be considered in the context of sequencestructure mappings (Schuster, 2006). The set of structures that are compatible with a given sequence, C(X) considered in Figure 7.15, is in a way inverse to the set of compatible sequences (shown in Figure 7.10) since it deals with a noninvertible mapping in opposite direction, from shape space into sequence space. A subset of the compatible structures, G(X) C(X), which contains all local minima of the free energy surface and the saddle points connecting the basins corresponding to (meta)stable conformations, provides the basis for the construction of barrier trees. All structures that are neither local minima nor 9

A saddle point is also unstable at least in one direction, but it is locally stable in at least one other direction.

148

Chapter 7

Genotypes and Phenotypes in the Evolution of Molecules

Figure 7.15

Suboptimal and compatible structures. Metastable conformations Sk(X) of sequence X are deﬁned by two conditions: (i) DG < 0 for folding and (ii) conformation Sk(X) is a local minimum of the free energy surface. These conformations form the set G(X) in shape space. This set is embedded in the set of all structures that are compatible with sequence X, G(X) C(X). This compatible set C(X) contains all structures of shape space that are compatible with sequence X. For the consideration of kinetic folding, it is useful to include the set of saddle point structures GðXÞ in the set of metastable structures forming thereby the set of structures of sequence X that is needed for the construction of barrier trees: G(X) ¼ G(X) GðXÞ C(X). (See insert for color representation of this ﬁgure.)

saddle points are neglected. Local minima with positive free energies relative to the open chain, DG > 0, and saddle points leading into their basins are also excluded likewise. RNA evolution on neutral networks considered as a process with structure conservation and likewise kinetic RNA folding in conformation space is a process with conservation of sequence (Schuster, 2006).

7.4 CONCLUSIONS AND OUTLOOK The current state of the art in computation and empirical determination of ﬁtness landscapes for evolution does not allow for predictions because the accessible data are still rudimentary. The most promising areas of application are evolutionary design of molecules in vitro and virus evolution where genotype spaces are large but accessible through extensive data collection. The greatest challenge for the future, presumably, is the same as in computational systems biology: Despite an enormous wealth of data, only a small fraction is comparable because most of the currently accessible information is widely scattered in the literature and has been measured under incomparable condition. Further progress in reliability and predictive power of models depends, among other things, on validation and standardization of data. Mathematical and computational tools are nevertheless available and can be implemented and used as soon as reliable information on the structure of landscapes becomes available. Evolution can be formally described and properly modeled as a process in sequence space as kinetic folding is visualized in shape space. The RNA model serves as a kind of tool kit that provides fundamental insights into basic structures and dynamics, which will later be encountered also in the real world.

References

149

ACKNOWLEDGMENTS This work was supported in part by the German DFG under the auspices of SPP-1258 “sensory and regulatory RNAs in prokaryotes,” SPP-1174 “metazoan deep phylogeny,” and the Graduierten-kolleg Wissensrepr€ asentation by the European Union through the sixth framwork program projects EMBIO http://www-embio.ch.cam.ac.uk/ and SYNLET http:// synlet.izbi.uni-leipzig.de/. We thank Claudia S. Copeland for editing the manuscript for English language and clarity.

REFERENCES ALVES, D. and FONTANARI, J.F., 1996, Population genetics approach to the quasispecies model. Phys. Rev. E 54: 4048–4053. ANFINSEN, C.B., 1973. Principles that govern the folding of protein chains. Science 181: 223–230. ANFINSEN, C.B. and HABER, E., 1961. Studies on the reduction and re-formation of protein disulﬁde bonds. J. Biol. Chem. 236: 1361–1363. ANFINSEN, C.B., HABER, E., SELA, M., and WHITE, F. H., JR., 1961. Studies on the reduction and re-formation of protein disulﬁde bonds. Proc. Natl. Acad. Sci. USA 47: 1309–1314. ANTAO, V.P. and TINOCO, I., JR., 1992. Thermodynamic parameters for loop formation in RNA and DNA hairpin tetraloops. Nucleic Acids Res. 20: 819–824. ANTAO, V.P., LAI, S.Y., and TINOCO, I., JR., 1991. A thermodynamic study of unusually stable RNA and DNA hairpins. Nucleic Acids Res. 19: 5901–5905. BENHAM, C.J. and MIELKE, S.P., 2005. DNA mechanics. Annu. Rev. Biomed. Eng. 7: 21–53. BIEBRICHER, C.K., EIGEN, M., WILLIAM, C., and GARDINER, J., 1983. Kinetics of RNA replication. Biochemistry 22: 2544–2559. BIEBRICHER, C.K., EIGEN, M., WILLIAM, C., and GARDINER, J., 1984. Kinetics of RNA replication: plus–minus asymmetry and double-strand formation. Biochemistry 23: 3186–3194. BIEBRICHER, C.K., EIGEN, M., WILLIAM, C., and GARDINER, J., 1985. Kinetics of RNA replication: competition and selection among self-replicating RNA species. Biochemistry 24: 6550–6560. BOLLOBAS, B., 1985. Random Graphs. Academic Press, London. BORGIA, A., WILLIAMS, P.M., and CLARKE, J., 2008. Singlemolecular studies of protein folding. Annu. Rev. Biophys. 77: 101–125. BRAKMANN, S. and JOHNSSON, K., 2002. Directed Molecular Evolution of Proteins: or How to Improve Enzymes for Biocatalysis. Wiley-VCH, Weinheim. BREAKER, R.R., 1999. Catalytic DNA: in training and seeking employment. Nat. Biotechnol. 17: 422–423. CHEN, S., 2008. RNA-folding: conformational statistics, folding kinetics, and ion electrostatics. Annu. Rev. Biophys. 37: 197–234.

DILL, K.A., OZKAN, S.B., SHELL, M.S., and WEIKL, T.R., 2008. The protein folding problem. Annu. Rev. Biophys. 37: 289–316. DOBSON, C.M., SALI, A., and KARPLUS, M., 1998. Protein folding: a perspective from theory and experiment. Angew. Chem. Int. Ed. 37: 868–893. DOMINGO, E., 2005. Virus entry into errror catastrophe as a new antiviral strategy. Virus Res. 107: 115–228. DOMINGO, E., PARRISH, C., and HOLLAND, J., 2008. Origin and Evolution of Viruses, 2nd edition. Academic Press, San Diego, CA. EIGEN, M., 1971. Selforganization of matter and the evolution of biological macromolecules. Naturwissenschaften 58: 465–523. EIGEN, M. and SCHUSTER, P., 1977. The hypercycle. A principle of natural self-organization. Part A: emergence of the hypercycle. Naturwissenschaften 64: 541–565. EIGEN, M. and SCHUSTER, P., 1978. The hypercycle. A principle of natural self-organization. Part B: the abstract hypercycle. Naturwissenschaften 65: 7–41. EIGEN, M., MCCASKILL, J., and SCHUSTER, P., 1989. The molecular quasispecies. Adv. Chem. Phys. 75: 149–263. ELOFSON, A. and von HEIJNE, G., 2007. Membrane protein structure: prediction versus reality. Annu. Rev. Biophys. 76: 125–140. FISHER, R.A., 1930. The Genetical Theory of Natural Selection. Oxford University Press, Oxford, UK. FISHER, R.A., 1941. Average excess and average effect of a gene substitution. Ann. Eugenics 11: 53–63. FLAMM, C., FONTANA, W., HOFACKER, I.L., and SCHUSTER, P., 1999. Elementary step dynamics of RNA folding. RNA 6: 325–338. FLAMM, C., HOFACKER, I.L., MAURER-STROH, S., STADLER, P.F., and ZEHL, M., 2001. Design of multi-stable RNA molecules. RNA 7: 254–265. FLAMM, C., HOFACKER, I., STADLER, P., and WOLFINGER, M., 2002. Barrier trees of degenerate landscapes. Z. Phys. Chem. 216: 155–173. FONTANA, W. and SCHUSTER, P., 1987. A computer model of evolutionary optimization. Biophys. Chem. 26: 123–147.

150

Chapter 7

Genotypes and Phenotypes in the Evolution of Molecules

FONTANA, W. and SCHUSTER, P., 1998a. Continuity in evolution: on the nature of transitions. Science 280: 1451–1455. FONTANA, W. and SCHUSTER, P., 1998b. Shaping space: the possible and the attainable in RNA genotype–phenotype mapping. J. Theor. Biol. 194: 491–515. FONTANA, W., SCHNABL, W., and SCHUSTER, P., 1989. Physical aspects of evolutionary optimization and adaptation. Phys. Rev. A 40: 3301–3321. GARDINER, E.J., HUNTER, C.A., PACKER, M.J., PALMER, D.S., and WILLETT, P., 2003. Sequence-dependent DNA structure: a database of octamer structural parameters. J. Mol. Biol. 332: 1025–1035. GILLESPIE, D.T., 1976. A general method for numerically simulating the stochastic time evolution of coupled chemical reactions. J. Comput. Phys. 22: 403–434. GILLESPIE, D.T., 1977a. Concerning the validity of the stochastic approach to chemical kinetics. J. Stat. Phys. 16: 311–318. GILLESPIE, D.T., 1977b. Exact stochastic simulation of coupled chemical reactions. J. Phys. Chem. 81: 2340–2361. GILLESPIE, D.T., 2007. Stochastic simulation of chemical kinetics. Annu. Rev. Phys. Chem. 58: 35–55. GRU€NER, W., GIEGERICH, R., STROTHMANN, D., REIDYS, C., WEBER, J., HOFACKER, I.L., and SCHUSTER, P., 1996a. Analysis of RNA sequence structure maps by exhaustive enumeration. I. Neutral networks. Monatsh. Chem. 127: 355–374. GRU€NER, W., GIEGERICH, R., STROTHMANN, D., REIDYS, C., WEBER, J., HOFACKER, I.L., and SCHUSTER, P., 1996b. Analysis of RNA sequence structure maps by exhaustive enumeration. II. Structures of neutral networks and shape space covering. Monatsh. Chem. 127: 375–389. HAMMING, R.W., 1950. Error detecting and error correcting codes. Bell Syst. Tech. J. 29: 147–160. HAMMING, R.W., 1989. Coding and Information Theory, 2nd edition. Prentice Hall, Englewood Cliffs, NJ. HENDRIX, D.K., BRENNER, S.E., and HOLBROOK, S.R., 2005. RNA structural motifs: building blocks of a modular biomolecule. Q. Rev. Biophys. 38: 221–243. HOFACKER, I.L., 2003. Vienna RNA secondary structure server. Nucleic Acids Res. 31: 3429–3431. HOFACKER, I.L., FONTANA, W., STADLER, P.F., BONHOEFFER, L. S., TACKER, M., and SCHUSTER, P., 1994. Fast folding and comparison of RNA secondary structures. Monatsh. Chem. 125: 167–188. HOFACKER, I.L., SCHUSTER, P., and STADLER, P.F., 1998. Combinatorics of RNA secondary structures. Discrete Appl. Math. 89: 177–207. HOLBROOK, S.R., 2008. Structural principles from large RNAs. Annu. Rev. Biophys. 37: 445–464. HORWICH, A.L., FENTON, W.A., CHAPMAN, E., and FARR, G.W., 2007. Two families of chaperonin: physiology and mechanism. Annu. Rev. Cell Dev. Biol. 23: 115–145. HUYNEN, M.A., STADLER, P.F., and FONTANA, W., 1996. Smoothness within ruggedness: the role of neutrality in adaptation. Proc. Natl. Acad. Sci. USA 93: 397–401.

J€aCKEL, C., KAST, P., and HILVERT, D., 2008. Protein design by directed evolution. Annu. Rev. Biophys. 37: 153–173. JOYCE, G.F., 2004. Directed evolution of nucleic acid enzymes. Annu. Rev. Biochem. 73: 791–836. KARTHA, G., BELLO, J., and HARKER, D., 1967. Tertiary structure of ribonuclease. Nature 213: 862–865. KIM, E.E., VARADARAJAN, R., WYKOFF, H.W., and RICHARDS, F. M., 1992. Reﬁnement of the crystal structure of ribonuclease S. Comparison with and between the various ribonuclease A structures. Biochemistry 31: 12304–12314. KIMURA, M., 1968. Evolutionary rate at the molecular level. Nature 217: 624–626. KIMURA, M., 1983. The Neutral Theory of Molecular Evolution. Cambridge University Press, Cambridge, UK. KLEIN, D.J., SCHMEING, T.M., MOORE, P.B., and STEITZ, T.A., 2001. The kink-turn: a new RNA secondary structure motif. EMBO J. 20: 4214–4221. KLUG, A., 1999. Zinc ﬁngers peptides for the regulation of gene expression. Trends Biochem. Sci. 293: 215–218. KLUG, A. and RHODES, D., 1987. ‘Zinc ﬁngers’: a novel protein motif for nucleic acid recognition. Trends Biochem. Sci. 12: 464–469. KLUSSMANN, S., 2006. The Aptamer Handbook. Functional Oligonucleotides and Their Applications. Wiley-VCH Verlag, Weinheim. KUPCZOK, A. and DITTRICH, P., 2006. Determinants of simulated RNA evolution. J. Theor. Biol. 238: 726–735. LATTMAN, E.E., 2005. Sixth meeting on the critical assessment of techniques for protein structure prediction. Proteins 61: 1–2. LEHMANN, N., 2005. Special issue on experimental evolution. J. Mol. Evol. 61 (2). LEONTIS, N. and WESTHOF, E., 2001. Geometric nomenclature and classiﬁcation of RNA base pairs. RNA 7: 499–512. LEONTIS, N.B. and WESTHOF, E., 2003. Analysis of RNA motifs. Curr. Opin. Struct. Biol. 13: 300–308. LEONTIS, N.B., ALTMANN, R.B., BERMAN, H.M, BRENNER, S.E., BROWN, J.W., ENGELKE, D.R., HARVEY, S.C., HOLBROOK, S. R., JOSSINET, F., LEWIS, S.E., MAJOR, F., MATHEWS, D.H., RICHARDSON, J.S., WILLIAMSON, J.R., and WESTHOF, E., 2006. The RNA ontology consortium: An open invitation to the RNA community. RNA 12: 533–541. LESCOUTE, A. and WESTHOF, E., 2006. The interaction network of structured RNAs. Nucleic Acids Res. 34: 6587–6604. LESCOUTE, A., LEONTIS, N.B., MASSIRE, C., and WESTHOF, E., 2005. Recurrent structural RNA motifs, isostericity matrices and sequence alignments. Nucleic Acids Res. 33: 2395–2409. LEVINTHAL, C., 1968. Are there protein folding pathways? J. Chim. Phys. 65: 44–45. LEVINTHAL, C., 1969. How to fold graciously. In M€ossbauer Spectroscopy in Biological Systems (eds P. Debrunner, J. C. M. Tsibris, and E. M€unck). University of Illinois Press, Urbana, IL, pp. 22–24. LI, P.T.X., VIEREGG, J., and TINOCO, I., JR., 2008. How RNA unfolds and refolds. Annu. Rev. Biochem. 77: 77–100.

References MARSHALL, G.R., FENG, J.A., and KUSTER, D.J., 2008. Back to the future: ribonuclease A. Biopolymers 90: 259–277. MAYNARD SMITH, J., 1970. Natural selection and the concept of protein space. Nature 225: 563–564. MCCASKILL, J.S., 1984. A localization threshold for macromolecular quasispecies from continuously distributed replication rates. J. Chem. Phys. 80: 5194–5202. MCCASKILL, J.S., 1990. The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers 29: 1105–1119. MCMANUS, M.T. and SHARP, P.A., 2002. Gene silencing in mammals by small interfering RNAs. Nat. Rev. Genetics 3: 737–747. MILLS, D.R., PETERSON, R.L., and SPIEGELMAN, S., 1967. An extracellular Darwinian experiment with a self-duplicating nucleic acid molecule. Proc. Natl. Acad. Sci. USA 58: 217–224. MONTANGE, R.K. and BATEY, R.T., 2008. Riboswitches: emerging themes in RNA structure and function. Annu. Rev. Biophys. 37: 117–133. MOORE, P.B., 1999. Structural motifs in RNA. Annu. Rev. Biochem. 67: 287–300. MOORE, S. and STEIN, W.H., 1973. Chemical structures of pancreatic ribonuclease and deoxyribonuclease. Science 180: 458–464. OKASHA, S., 2008. Fisher’s fundamental theorem of natural selection: a philosophical analysis. Br. J. Philos. Sci. 59: 319–351. ONUCHIC, J.N., LUTHEY-SCHULTEN, Z., and WOLYNES, P.G., 1997. Theory of protein folding: the energy landscape perspective. Annu. Rev. Phys. Chem. 48: 545–600. PACKER, M.J., DAUNCEY, M.P., and HUNTER, C.A., 2000a. Sequence-dependent DNA structure: dinucleotide conformational maps. J. Mol. Biol. 295: 71–83. PACKER, M.J., DAUNCEY, M.P., and HUNTER, C.A., 2000b. Sequence-dependent DNA structure: tetranucleotide conformational maps. J. Mol. Biol. 295: 85–103. PARK, C. and RAINES, R.T., 2003. Ribonuclease A is limited by the rate of substrate association. Biochemistry 42: 3509–3518. P€ oRSCHKE, D., 1974. Thermodynamic and kinetic parameters of an oligonucleotide hairpin helix. Biophys. Chem. 1: 381–386. P€ oRSCHKE, D., 1977. Elementary steps of base recognition and helix-coil transitions in nucleic acids. In Chemical Relaxation in Molecular Biology (eds I. Pecht and R. Rigler). Springer Verlag, pp. 191–218. P€ oRSCHKE, D. and EIGEN, M., 1971. Co-operative non-enzymic base recognition. III. Kinetics of the helix-coil transition of the oligoribouridylic–oligoriboadenylic acid systems and of oligoriboadenylic acid alone at acidic pH. J. Mol. Biol. 62: 361–381. RAINES, R.T., 1998. Ribonuclease A. Chem. Rev. 98: 1045–1065. REIDYS, C., STADLER, P.F., and SCHUSTER, P., 1997. Generic properties of combinatory maps: neutral networks of RNA secondary structure. Bull. Math. Biol. 59: 339–397.

151

RICH, A. and RAJBHANDARY, U.L., 1976. Transfer RNA: molecular structure, sequence, and properties. Annu. Rev. Biochem. 45: 805–860. RIVAS, E. and EDDY, S.R., 1999. A dynamic programming algorithm for RNA structure prediction including pseudoknots. J. Mol. Biol. 285: 2053–2068. SCHUSTER, P., 2003. Molecular insight into the evolution of phenotypes. In Evolutionary Dynamics: Exploring the Interplay of Accident, Selection, Neutrality, and Function (eds J. P. Crutchﬁeld and P. Schuster). Oxford University Press, New York, pp. 163–215. SCHUSTER, P., 2006. Prediction of RNA secondary structures: from theory to models and real molecules. Rep. Prog. Phys. 69: 1419–1477. SCHUSTER, P. and SWETINA, J., 1988. Stationary mutant distribution and evolutionary optimization. Bull. Math. Biol. 50: 635–660. SENETA, E., 1981. Non-negative matrices and Markov chains, 2nd edition. Springer-Verlag, New York. SERVICE, R.F., 2008. Problem solved ( sort of). Science 231: 784–786. SMYTH, D.G., STEIN, W.H., and MOORE, S., 1963. Sequence of amino acid residues in bovine pancreatic ribonuclease: revisions and conﬁrmations. J. Biol. Chem. 238: 227–234. SPIEGELMAN, S., 1971. An approach to the experimental analysis of precellular evolution. Q. Rev. Biophys. 4: 213–253. STADLER, B.R.M., STADLER, P.F., WAGNER, G.P., and FONTANA, W., 2001. The topology of the possible: formal spaces underlying patterns of evolutionary change. J. Theor. Biol. 213: 241–274. SVOBODOVA VAREKOVA, R., BRADAC, I., PLCHU´T, M., SKRDLA, M., WACENOVSKY, M., MAHR, H., MAYER, G., TANNER, H., BRUGGER, H., WITHALM, J., LEDERER, P., HUBER, H., GIERLINGER, G., GRAF, R., TAFER, H., HOFACKER, I., SCHUSTER, P., and POLCıK, M., 2008. www.rnaworkbench.com: a new program for analyzing RNA interference. Comput. Methods Programs Biomed. 90: 89–94. SWETINA, J. and SCHUSTER, P., 1982. Self-replication with errors: a model for polynucleotide replication. Biophys. Chem. 16: 329–345. TARAZONA, P., 1992. Error-thresholds for molecular quasispecies as phase transitions: from simple landscapes to spin-glass models. Phys. Rev. A 45: 6038–6050. TRAPANE, T.L. and LATTMAN, E.E., 2007. Seventh meeting on the critical assessment of techniques for protein structure prediction. Proteins 69: 1–2. VAMVACA, K., VO€GELI, B., KAST, P., PERVUSHIN, K., and HILVERT, D., 2004. An enzymatic molten globule: efﬁcient coupling of folding and catalysis. Proc. Natl. Acad. Sci. USA 101: 12860–12864. VENDRUSCOLO, M. and DOBSON, C.M., 2005. Towards complete description of free-energy landscapes of proteins. Philos. Transact. A Math. Phys. Eng. Sci. 363: 433–452. WATTS, A. and SCHWARZ, G., 1997. Evolutionary Biotechnology: From Theory to Experiment,

152

Chapter 7

Genotypes and Phenotypes in the Evolution of Molecules

Vol. 66/2-3, Biophysical Chemistry. Elesvier, Amsterdam, pp. 67–284. WHITE, F.H., JR., 1961. Regeneration of native secondary and tertiary structure by air oxidation of reduced ribonuclease. J. Biol. Chem. 236: 1353–1360. WOLFINGER, M.T., SVRCEK-SEILER, W.A., FLAMM, C., HOFACKER, I.L., and STADLER, P.F., 2004. Efﬁcient computation of RNA folding dynamics. J. Phys. A: Math. Gen. 37: 4731–4741. WRENN, S.J. and HARBURY, P.B., 2007. Chemical evolution as a tool for molecular discovery. Annu. Rev. Biochem. 76: 331–349. WRIGHT, S., 1932. The roles of mutation, inbreeding, crossbreeding and selection in evolution. In International Proceedings of the Sixth International Congress on Genetics, Vol. 1, Brooklyn Botanic Garden, Ithaca, NY (ed. D. F. Jones), pp. 356–366. WU, M. and TINOCO, I., JR., 1998. RNA folding causes secondary structure rearrangement. Proc. Natl. Acad. Sci. USA 95: 11555–11560. WUCHTY, S., FONTANA, W., HOFACKER, I.L., and SCHUSTER, P., 1999. Complete suboptimal folding of RNA and the

stability of secondary structures. Biopolymers 49: 145–165. WYCKOFF, H.W., HARDMAN, K.D., ALLEWELL, N.M., INAGAMI, T., JOHNSON, L.N., and RICHARDS, F.M., 1967a. The struc ture of ribonuclease-S at 3.5 A resolution. J. Biol. Chem. 242: 3984–3988. WYCKOFF, H.W., HARDMAN, K.D., ALLEWELL, N.M., INAGAMI, T., TSERNOGLOU, D., JOHNSON, L.N., and RICHARDS, F.M., 1967b. The structure of ribonuclease-S at 6 A resolution. J. Biol. Chem. 242: 3749–3753. ZAGRYADSKAYA, E.I., KOTLOVA, N., and STEINBERG, S.V., 2004. Key elements in maintenance of the t-RNA L-shape. J. Mol. Biol. 340: 435–444. ZUKER, M., 1989. On ﬁnding all suboptimal foldings of an RNA molecule. Science 244: 48–52. ZUKER, M. and SANKOFF, D., 1984. RNA secondary structures and their prediction. Bull. Math. Biol. 46: 591–621. ZUKER, M. and STIEGLER, P., 1981. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res. 9: 133–148.

Chapter

8

Genome Evolution Studied Through Protein Structure Philip E. Bourne, Kristine Briedis, Christopher Dupont, Ruben Valas, and Song Yang 8.1

INTRODUCTION

8.2

STRUCTURAL GRANULARITY AND ITS IMPLICATIONS

8.3

PROTEIN DOMAINS IN THE STUDY OF GENOME REARRANGEMENTS

8.4

PROTEIN DOMAIN GAIN AND LOSS

8.5

AND IN THE BEGINNING . . .

8.6

BUT LET US NOT FORGET THE INFLUENCE OF THE ENVIRONMENT

8.7

CONCLUSIONS

REFERENCES

8.1 INTRODUCTION In 1859, Charles Darwin published The Origin of Species, a seminal work that deﬁned the beginning of evolutionary biology (Darwin, 1859). However, Darwin’s studies were constrained by phenotype—those characteristics of organisms that could be visibly observed, both in living and in the fossil record. The discovery of DNA as the method by which genetic information was transferred (Avery et al., 1944) and the subsequent ability to sequence DNA (Sanger and Coulson, 1975; Sanger et al., 1977), RNA (Sanger, 1971), and later proteins (Biemann, 1992) enabled evolution to be studied at the molecular level. Advances in our evolutionary understanding came from an increase in the number of gene sequences and improved computational techniques and computational infrastructure. Combining molecular observations with observations at the species level led to innovations such as molecular clocks (Zuckerkandl and Pauling, 1962), principles of parsimony, and to an understanding of allele frequencies based on the processes of mutation, genetic drift, gene ﬂow, and natural selection. Such methods have largely focused on ﬁnding similarities and differences in the DNA or protein sequences of selected organisms. As organisms evolve, events such as point Evolutionary Genomics and Systems Biology, edited by Gustavo Caetano-Anolles Copyright 2010 John Wiley & Sons, Inc.

153

154

Chapter 8

Genome Evolution Studied Through Protein Structure

mutations, chromosome rearrangements, duplications, insertions, and deletions may alter their respective genomes. Pinpointing these changes allows researchers to reconstruct how species are related (Freeman and Herron, 2003). Yet as time passes, DNA, and thus protein sequences, can evolve to a point that it becomes difﬁcult to ascertain their past evolutionary relationship based on sequence analysis alone. It is here that protein structure, which deﬁnes the molecular repertoire, can play a unique role. It was ﬁrst noted by Lesk and Chothia (1980) that globin structures exhibited remarkable similarity even when their sequence identity was of the order of 15%. The relative plasticity of protein sequence when compared to protein structure has implications for the redundancy of protein sequences versus structures, the relative size of protein sequence versus structure space, and so on; issues that will be explored further in this chapter. For now consider a couple of further examples beyond the globins. Hon et al. (1997) found a surprising structural similarity in the structure of the aminoglycoside phosphotransferase APH(30 )-IIIa to that of the typical eukaryotic protein kinase catalytic domain, despite a low sequence identity. Another example was found by Holm and Sander in two glucosyltransferases. The protein sequences had less than 10% sequence identity, but the protein structures contained similarities that suggested evolutionary relatedness (Holm and Sander, 1995). It is important to note, however, that such structural similarities may also be the result of convergent evolution. For example, the serine endopeptidases subtilisin and chymotrypsin share a catalytic triad but are not otherwise similar (Bartlett et al., 2003). It now appears that, from a protein structure perspective, convergent evolution is a relatively rare phenomenon and is mainly restricted to cases such as a catalytic triad where three residues can adopt a similar conformation, while the global structures that support that arrangement are quite different. In other words, while physical principles alone can govern secondary structure of proteins the likelihood they would converge on the same tertiary and quaternary structure is very small. Our major point, then, is that protein structure can provide valuable information when studying long evolutionary timescales, information that may not be apparent through DNA and protein sequence analysis alone. The potential issue with this approach then becomes the relative available coverage of protein structure versus protein sequence space. It is therefore worth considering the coverage of structural space as deﬁned by the contents of the Protein Data Bank (PDB). In 1971, the PDB was established and contained seven protein structures. At the time of writing (November 2008), that number has grown to more than 53,000 structures (Bernstein et al., 1977; Berman et al., 2000). This explosion in structural data has presented new opportunities to study evolution from a structural perspective; however, that number needs to be put into perspective. First, there is a high level of redundancy within the PDB. Given the discussion above, it should be apparent that the number of nonredundant structures in the PDB will depend on whether you consider redundancy relative to sequence or relative to structure. From a sequence perspective as of May 2008, the PDB contained 3881 unique protein polypeptide chains according to PDBselect (Hobohm and Sander, 1994) from approximately 58,000 protein chains in the PDB of greater than 30 amino acids—a 15-fold redundancy brought about by many structures being the same protein with a different ligand, the same structure with posttranslational modiﬁcations, and so on. Given this redundancy a useful question to ask, if trying to assess the value of structure to study molecular repertoires, is, what is the estimated coverage of protein structure space at the present time and how does that relate to the coverage of fully sequenced proteomes? To begin to answer this question, we need to think more about redundancy, how we classify protein structures, and how we map them to existing fully sequenced proteomes. PDBselect groups protein structures by clustering similar sequences, but as we have seen that does not reﬂect structural redundancy. This phenomenon is illustrated in Figure 8.1

8.1 Introduction

155

One thousand randomly selected structurally similar PDB polypeptide chains from CE with z > 4.5. (See insert for color representation of this ﬁgure.)

Figure 8.1

where each point illustrates one of 1000 structurally similar pairs of polypeptide chains according to CE, an algorithm for three-dimensional structure comparison (Shindyalov and Bourne, 1998). As can be seen, sequence identity is highly variable across all polypeptide chain lengths. Russ Doolittle coined the phrase the “Twilight Zone” (Doolittle, 1986) to describe the region of sequence identity where it was difﬁcult to ascertain a relationship between two sequences and subsequently Rost coined the phrase the “Midnight Zone” (Rost, 1999) where only structure can reveal a sequence relationship. From Figure 8.1, it can be seen that a very signiﬁcant number of evolutionary relationships will only be easily revealed through structure comparison. Automated structure comparison, as used above, is a useful tool, and yet achieving the best structural alignments is not a solved problem (Bourne and Shindyalov, 2003). However, the various methods work to guide other criteria, both human and algorithmic, to ascertain whether two protein structures can be considered similar. This similarity is captured by both the structural classiﬁcation of proteins (SCOP) (Andreeva et al., 2008) and CATH (Greene et al., 2007), two protein structure classiﬁcation schemes, with signiﬁcant overlap, but some differences (Day et al., 2003). These resources are a tour de force in our understanding of protein structure space. While analogous conclusions can be drawn from either resource, for simplicity we will illustrate what is possible in using structure to study evolution using SCOP as our standard. SCOP is a hierarchy that in its current release (1.73) is based on 34,494 protein entries from PDB and consists of 97,198 protein domains (we will get to domains in a moment) organized into 3464 protein families and 1777 superfamilies. A protein family consists of proteins that have a clear evolutionary relationship observable at the sequence level. A protein superfamily is one or more families where an evolutionary relationship is only clear from structural similarity. A member of a protein family consists of one or more domains that are reused to provide a large functional diversity from a small number of building blocks. A domain typically consists of a unique fold, of which there are currently 1086 according to SCOP. This is a remarkable level of redundancy given the

156

Chapter 8

Genome Evolution Studied Through Protein Structure

diversity of protein sequence space. The premise that follows, and upon which much of our thinking is based, is that the invention of a new structural scaffold is a major evolutionary event that can be exploited in the study of evolution. To make this leap, it is necessary to assign domains, families, and superfamilies to fully sequence proteomes. Fortunately, the developers of SCOP and CATH and their colleagues have done that for us in the form of SUPERFAMILY (Wilson et al., 2007) and Gene3D (Yeats et al., 2008), respectively. Considering SUPERFAMILY, assignments are made by building hidden Markov models (HMMs) for domain combinations at the family and superfamily levels (Gough et al., 2001). Built from structure alignment seeds, they provide distant evolutionary relationships not necessarily seen from sequences alone. The study of evolution utilizing protein structural information was originated by Gerstein in 1997 (Gerstein, 1997), when only one species from each of the three superkingdoms had been sequenced. The fold recognition method using FASTA could only annotate 10–20% of the genome, and it was more of a classiﬁcation than tree construction for the three species studied. However, this approach became more promising when more 3D structures became available and sequence comparison algorithms became more sophisticated (Wolf et al., 1999; Caetano-Anolles and Caetano-Anolles, 2003). The proof that the molecular repertoire as deﬁned by protein structure was indeed a powerful tool in the study of evolution, at least for us, came from a simple experiment (Yang et al., 2005). We built a simple binary matrix that on one axis simply contained all the fully sequence proteomes, and on the other axis a list of all the known protein superfamilies. In each cell of that matrix, if that superfamily was present in that organism, it was given a one, otherwise it was given a zero. From this binary distribution, it was a simple step to establish a distance matrix and hence a tree. Remarkably, this tree looked very much like the tree of life after some weighting adjustments for highly symbiotic bacteria. That we could approximate the species lineage through the use of structural information set us on a path to exploit the value of structure in the study of evolution. It is the use of this molecular repertoire that forms the basis for the remainder of this chapter. That the tree was built from mere presence or absence versus content is itself an interesting issue. Trees built including the number of times a fold superfamily occurs are arguably less distinct.

8.2 STRUCTURAL GRANULARITY AND ITS IMPLICATIONS Fold superfamilies are a coarse structural measure, whereas the domain provides a ﬁner level of granularity. Protein domain is a misused term. Here, we refer to domains in a structural sense as compact independent folding units that can be considered the evolutionary units of currency—they are swapped, added, and taken away to provide a complex repertoire of proteins with diverse functions. In other words, the function of a domain can change, but its overall fold remains the same. Consider the P-loop containing nucleoside triphosphate hydrolases superfamily, which is one of the most abundant superfamilies in nature. It is present 1034 times in the human genome according to SUPERFAMILY. All of these proteins share a common evolutionary ancestor since they share this domain, but they each have a unique function. The difference in function between these proteins evolved through several mechanisms. The function of an individual domain can change, but it is also dependent on its functional context. To understand how the functions of these proteins evolved, one must study the history of their domains. The simplest proteins are monomers composed of a single domain. Within a superfamily, there can be many proteins like this with different functions. Shikimate kinase is an

8.2 Structural Granularity and Its Implications

157

example of a monomer composed of a single p-loop hydrolase domain. It catalyzes the phosphorylation of shikimate. Changes in sequence in the active site of an enzymatic protein could result in the binding of a new ligand and catalysis of a new reaction. There are 10 known different families of the p-loop hydrolase domain that are single-domain proteins that can bind 46 different ligands (Bashton et al., 2006). It is not surprising that most of these ligands are quite similar in structure. It is far simpler to tinker with an existing structure and perform a similar function than it is to evolve an entirely new structure that happens to perform a similar function. The protein kinase-like superfamily is a case in point (Scheeff and Bourne, 2005). Protein kinases exist in all three superkingdoms of life. SCOP classiﬁes them as a single domain, while CATH considers them as two domains. Regardless, there is a distinct ATP binding cassette that has been conserved across all species. What has changed in more dramatic fashion is the substrate binding component of the structure that has clearly adapted to bind a large variety of substrates and use a variety of second messengers as signal transduction has continued to evolve. The characteristics of the structural signal within the families of the superfamily were distinct enough to build a phylogenetic tree based on differing structural characteristics (Scheeff and Bourne, 2005). The function of a domain is also determined by its surrounding context within the protein. Not all proteins that contain p-loop hydrolase domains are composed of a single domain. This domain is found in combination with 91 other superfamilies in the human genome according to SUPERFAMILY. The same domain can have very different functions in different combinations. Table 2 in Bashton et al. (2006) summarizes the functions of the p-loop hydrolase domain in different combinations. In this case, the domain does not gain a new function (it still only binds ligands it could bind as a single domain), but its function can be coupled with another domain. Most DNA helicases contain a p-loop domain as well as a DNA binding domain (Caruthers and McKay, 2002). Having both of these domains in a protein chain allows the unwinding of DNA to be powered by ATP hydrolysis. The helicase function did not evolve because the p-loop domain gained a new function. Instead, the new function arose because of a change in the context of that domain. Another factor that can change the context and the function of a domain is its quaternary structure in a protein. Like domain combinations, binding partners can introduce several other domains, each with their own function. The FoF1-ATP synthase is composed of several different protein chains, several of which contain the p-loop hydrolase domain. In this structure, the g-subunit rotates around the a and b subunits (Itoh et al., 2004). The energy from this rotation is stored in the production of ATP from ADP. The p-loop hydrolase domain binds the ADP and is the site where the reaction occurs, but it would be useless without the rest of the quaternary structure. Each of the chains performs a different function, but the combination of all the domains together completes the structure. In this case, the domain really is a building block in a structure with a higher level function. An analysis of multidomain proteins in the three superkingdoms shows that two-thirds of prokaryote proteins have two or more domains, whereas four-ﬁfths of proteins in eukaryotes are multidomain (Teichmann et al., 1998). Domain combinations in 40 genomes also show a power law distribution (Apic et al., 2001), where some two-domain or threedomain combinations, the so-called “supradomains”, frequently recur in different protein contexts (Vogel et al., 2004). A simulation of the processes of domain duplication and combination suggests that domain combinations are stochastic processes followed by duplication to various extents (Vogel et al., 2005). During the evolution of domains, gene fusion is more common than gene ﬁssion (Kummerfeld and Teichmann, 2005), and convergent evolution is a rare event (Gough, 2005). A recent analysis suggested that the abundance of protein domains and domain combinations are correlated with the complexity

158

Chapter 8

Genome Evolution Studied Through Protein Structure

of the organism, as characterized by the numbers of cell types each organism contains (Vogel and Chothia, 2006).

8.3 PROTEIN DOMAINS IN THE STUDY OF GENOME REARRANGEMENTS In the past few years, the accumulation of complete genomes from a variety of taxonomic groups has enabled whole-genome comparative analysis that yielded exciting insights into the components, structure, and evolution of genomes (Bentley and Parkhill, 2004; Miller et al., 2004). The structure of genomic DNA can be analyzed at different levels, ranging from the nucleotide sequence, to gene and protein location and organization, to operon structure, and to the overall genome size and GC content. Previous work has shown that genome structures are very dynamic with various evolutionary events, such as large-scale genomic inversion, translocation, duplication, as well as insertion and deletion, frequently occurring and altering the genomic structure of an organism (Mira et al., 2002b). Genomic rearrangements between closely related species can be observed using gene position plots, where two complete genomes are aligned according to the gene sequence on their linear chromosomes, similar to gene or protein sequence alignment based on nucleotide or amino acid sequences (Eisen et al., 2000; Suyama and Bork, 2001; Tillier and Collins, 2000). Using, for example, data from SUPERFAMILY similar approaches based on protein domain order rather than gene order can be applied. Since the number of protein domains is limited and structure is more conserved than sequence, revealing remote paralogues and orthologues, it is beneﬁcial to use protein domains as the basic element when comparing multiple species, in terms of both speed and accuracy. Both gene and protein domain position plots reveal that bacterial genomes often undergo symmetrical inversion around the origin and/or terminus of replication as shown schematically in Figure 8.2 (Mira et al., 2002a; Tillier and Collins, 2000). These appear as X-shapes in the position plots, although translocation and other evolutionary processes occur, disrupting this pattern. Figure 8.3 illustrates a generic example of what can be seen from structural domain comparison. Each dot represents a similar domain. Lines show alignment, inversion, and translocation caused by insertion and deletion. The lines on the plot are calculated by local dynamic programming similar to the method used in nucleotide or amino acid sequence alignment. Here, the domain sequences of two chromosomes are compared; each domain is represented by its SCOP ID, so there are about 2000 unique identiﬁers in total, compared to 4 in nucleotide sequences and 20 in amino acid sequences, thus resulting in a more granular plot. As with a protein sequence alignment, gap penalties and mismatches are involved in getting an optimum alignment between two chromosomes. Because large-scale genome inversions are frequent, after the two chromosomes are compared in one direction, one

Figure 8.2 of replication.

Schematic representation of symmetrical multistep chromosome inversion about the origin

8.3 Protein Domains in the Study of Genome Rearrangements

Figure 8.3

159

Comparative domain mapping. (See insert for color representation of this ﬁgure.)

chromosome is compared to the reversed domain sequence of another chromosome. The combined score for the two-direction alignment reﬂects the overall similarity of the two chromosomes. Figure 8.4 illustrates a speciﬁc example taken from two closely related strains of Salmonella enterica. As the evolutionary distance between organisms increases, the relationship of genome structures become less resolved.

Figure 8.4

Comparison of two strains of Salmonella enterica, CT18 and Ty2, showing a distinct inversion. Wen Deng et al. Journal of Bacteriology, 2003, 185, 2330–2337.

160

Chapter 8

Genome Evolution Studied Through Protein Structure

In summary, comparative domain mapping uses data already computed and available from the SUPERFAMILY resource (Wilson et al., 2007) and in a few seconds of compute time generate a course-grained view of genome rearrangements based on their respective molecular repertoires.

8.4 PROTEIN DOMAIN GAIN AND LOSS Just as domain arrangements can tell us quickly and simply of genome rearrangements, domain gain and loss mapped to species trees can tell us a great deal about the emergence of new protein functions. Again, these data are present in SUPERFAMILY and merely need to be extracted and mapped to existing species trees. Consider an example taken from a recent study (Yang and Bourne, 2009). Figure 8.5 shows the domain tree for the class II MHC-associated invariant chain ectoplasmic trimerization domain (SCOP a.109.1.1) that plays a critical role in the assembly of the major histocompatibility complex (MHC), as well as in MHC II antigen processing (Stern et al., 2006). Absent in all bacteria and archaea, this domain appears in the genomes of all Amniota except Danio rerio. With regard to the principle of maximum parsimony, the

Figure 8.5

Single domains and domain combinations mapped to the eukaryotic tree for SCOP domain a.109.1.1, the class II MHC-associated invariant chain ectoplasmic trimerization domain. (a) The number

next to the species name represents the abundance of the domain in the genome of that species. (b) The letters represent different combination types. In this case, type b corresponds to N/A a.109.1.1 and c represents N/A a.109.1.1 g.28.1.1, where N/A is an unknown domain (no 3D structure, no SCOP ID).

8.6 But Let Us Not Forget the Inﬂuence of the Environment

161

evolutionary history of a.109.1.1 can be explicitly derived according to this distribution: a.109.1.1 originated from the root of Amniota and was inherited by all sibling organisms but lost from Danio rerio. Note, we cannot discount the possibility that the domain exists in Danio rerio, which might be limited by the domain homology detection methodology. The abundance of domains in the genome of each species allows us to infer possible duplication events. In principle, such inference about evolutionary events can be applied to any protein domain, though the complexity varies.

8.5 AND IN THE BEGINNING . . . As we have seen, genome evolution can be studied by looking at how protein domains, as representatives of the molecular repertoire, change in function and context. However, this does not explain the origin of domains. The birth of a new superfamily is the most difﬁcult event to understand in terms of domain evolution. Where did the p-loop domain come from? What other domains have evolved from it? These questions are beyond our current level of understanding because there is no visible sequence or structural homology between many regions of protein space. Work has been done on deﬁning relationships between superfamilies that have structural similarities at a subdomain level (Friedberg and Godzik, 2005; Taylor, 2002). While these methods do provide links between superfamilies, it is not clear whether they represent convergent or divergent evolution. Protein structures can also be compared across superfamilies by comparing their functional sites. Most members of the p-loop hydrolase superfamily bind nucleotides. It is possible that other superfamilies that bind the same ligands share a common ancestor with this superfamily. One can imagine a scenario where selective pressure allows a protein’s structure to vary as long as it continues to bind a speciﬁc ligand. Two structures could drift apart to the point where they would not be considered to be the same superfamily. However, their ligand binding pockets should retain homology. New resources allow for comparison of active sites to detect such homology (Xie and Bourne, 2008). More tools will be developed for mapping such relationships, but the fundamental unit of these relationships will be the protein domain.

8.6 BUT LET US NOT FORGET THE INFLUENCE OF THE ENVIRONMENT The evolution of molecular repertoires during the over 4 billion years of the earth’s history has not always occurred under identical environmental conditions. Temperatures have ﬂuctuated by 50 K, atmospheric pressure is estimated to have ﬂuctuated between 1–5 atm, while photoenergy has been fairly constant. Since 90% of evolution has taken place in the ocean, it makes sense to consider how any of these and other conditions impacted life in the ocean. Perhaps the greatest change came not from these changes in physical phenomena but from changes in life itself, which in turn impacted the environment and to complete the circle, the environment then impacted life as suggested by the Gaia hypothesis as originally proposed by Lovelock (2001). The emergence of cyanobacteria, and hence oxygenic photosynthesis, is associated with major changes in global biogeochemistry and metabolism (Kopp et al., 2005; Raymond and Segre, 2006). In particular, the rise in atmospheric oxygen approximately 2.3 billion years ago (Gya) (Bekker et al., 2004; Farquhar et al., 2000) potentially led to the ocean becoming euxinic (sulﬁdic and anoxic) approximately 1.8 Gya (Canﬁeld and Teske, 1996; Arnold et al., 2004), prior to an oxygenation of deep waters from 0.7 to 1.0 Gya. These changes

162

Chapter 8

Genome Evolution Studied Through Protein Structure

in the redox state of the ocean dramatically inﬂuenced trace metal chemistry and bioavailability. Speciﬁcally, anoxic archaean ocean would have been rich in Fe, Mn, Co, but deﬁcient in Zn. Conversely, the modern oxic ocean is depleted in Fe and relatively rich in Zn (Saito et al., 2003). A question we asked ourselves several years ago is, are these changes in any way imprinted on modern-day proteomes (Dupont et al., 2006)? To answer this question requires an examination of the molecular repertoire. One needs to look at the complement of the metal binding proteins in modern-day proteomes, difﬁcult from an examination of DNA or protein sequence alone. Metal binding proteins have speciﬁc amino acid residues in exact locations and orientations within a three-dimension scaffold, and a structural bioinformatics approach takes full advantage of these traits. The hypothesis is that if early earth possessed an ocean relatively rich in say iron and cobalt, proteomes of species, namely, archaea and bacteria that emerged at that time period would also show that enrichment. Likewise, eukaryotes that emerged later would be rich in zinc. We showed that these trends were indeed the case raising the question as to whether other correlations between the environment at a given time and organisms that emerged in that time could be correlated. This work is ongoing.

8.7 CONCLUSIONS If you assume that the average protein is 300 amino acids, this leads to 20300 possible proteins, which is more than all the atoms in the universe. At present, Refseq (release 31) (Pruitt et al., 2007) contains 5.9 million reference protein sequences from 5513 taxa. While nowhere near approaching all protein sequences in all species, this still points to a massive reduction in possible sequence space. Similarly remarkable is that those 5.9 million sequences are currently represented by 1086 folds in SCOP 1.73 (Murzin et al., 1995). Certainly, the PDB is biased in what it contains by virtue of the experiments that make it hard to obtain membrane proteins and a content that is biased by drug discovery and certain types of highly sought after functions, but nevertheless we have something quite remarkable. Nature has chosen a very small number of three-dimensional scaffolds upon which to base all of life. Current estimates range from 1500 to 5000 folds as scaffolds. This limited molecular repertoire points us to a couple of very important conclusions. First, many different protein sequences adopt a singular structural framework. Consequently, using current sequence methods many homologous relationships cannot be detected without structure; the sequences have drifted so far as to not leave a discernable signal. Second, any structural invention represents a signiﬁcant evolutionary event. Finally, there must be a mechanism whereby functional diversity is achieved with such a limited macromolecular structure repertoire. Understanding these conclusions begins with the protein domain. Through duplication, gain, loss, and rearrangement the functionality required to sustain advanced life is achieved. Pragmatically protein domain distributions can be used to study at least those parts of genomes that code for proteins. We have shown that simply treating each domain present in a protein as a point on a dot matrix plot can be used to show gross features associated with genome rearrangement. Further mapping speciﬁc protein domains to their respective proteomes and looking at that presence, absence, and arrangement on species trees tells us a great deal about the emergence of new functions. Finally, the molecular repertoire has characteristics not available from sequence alone that can teach us a great deal about evolution. We have illustrated this by relating changes to the environment, but much more will be uncovered as the molecular repertoire expands to cover more of a given proteome and more proteomes are sequenced. It will be interesting to see what new discoveries are made.

References

163

REFERENCES ANDREEVA, A., HOWORTH, D., CHANDONIA, J.M., BRENNER, S.E., HUBBARD, T.J., CHOTHIA, C., and MURZIN, A.G., 2008. Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res. 36: D419–D425. APIC, G., GOUGH, J., and TEICHMANN, S.A., 2001. Domain combinations in archaeal, eubacterial and eukaryotic proteomes. J. Mol. Biol. 310: 311–325. ARNOLD, G.L., ANBAR, A.D., BARLING, J., and LYONS, T.W., 2004. Molybdenum isotope evidence for widespread anoxia in mid-Proterozoic oceans. Science 304: 87–90. AVERY, O., MACLEOD, C., and MCCARTY, M., 1944. Studies on the chemical nature of the substance inducing transformation of pneumococcal types. Inductions of transformation by a desoxyribonucleic acid fraction isolated from pneumococcus type III. J. Exp. Med. 2: 137–158. BARTLETT, G.J., TODD, A.E., and THORNTON, J.M., 2003. Inferring protein function from structure. Methods Biochem. Anal. 44: 387–407. BASHTON, M., NOBELI, I., and THORNTON, J.M., 2006. Cognate ligand domain mapping for enzymes. J. Mol. Biol. 364: 836–852. BEKKER, A., HOLLAND, H.D., WANG, P.L., RUMBLE, D., 3RD, STEIN, H.J., HANNAH, J.L., COETZEE, L.L., and BEUKES, N.J., 2004. Dating the rise of atmospheric oxygen. Nature 427: 117–120. BENTLEY, S.D. and PARKHILL, J., 2004. Comparative genomic structure of prokaryotes. Annu. Rev. Genet. 38: 771–792. BERMAN, H.M., WESTBROOK, J., FENG, Z., GILLILAND, G., BHAT, T.N., WEISSIG, H., SHINDYALOV, I.N., and BOURNE, P.E., 2000. The Protein Data Bank. Nucleic Acids Res. 28: 235–242. BERNSTEIN, F.C., KOETZLE, T.F., WILLIAMS, G.J., MEYER, E.F., JR., BRICE, M.D., RODGERS, J.R., KENNARD, O., SHIMANOUCHI, T., and TASUMI, M., 1977. The Protein Data Bank: a computer-based archival ﬁle for macromolecular structures. J. Mol. Biol. 112: 535–542. BIEMANN, K., 1992. Mass spectrometry of peptides and proteins. Annu. Rev. Biochem. 61: 977–1010. BOURNE, P.E. and SHINDYALOV, I.N., 2003. Structure comparison and alignment. Methods Biochem. Anal. 44: 321–337. CAETANO-ANOLLE´S, G. and CAETANO-ANOLLE´S, D., 2003. An evolutionarily structured universe of protein architecture. Genome Res. 13: 1563–1571. CANFIELD, D.E. and TESKE, A., 1996. Late Proterozoic rise in atmospheric oxygen concentration inferred from phylogenetic and sulphur-isotope studies. Nature 382: 127–132. CARUTHERS, J.M. and MCKAY, D.B., 2002. Helicase structure and mechanism. Curr. Opin. Struct. Biol. 12: 123–133. DARWIN, C., 1859. On the Origin of Species by Natural Selection, Murray, London, UK. DAY, R., BECK, D.A., ARMEN, R.S., and DAGGETT, V., 2003. A consensus view of fold space: combining SCOP, CATH, and the Dali Domain Dictionary. Protein Sci. 12: 2150–2160.

DOOLITTLE, R., 1986. Of URFs and ORFs: A Primer on How to Analyze Derived Amino Acid Sequences. University Science Books, Mill Valley, CA. DUPONT, C.L., YANG, S., PALENIK, B., and BOURNE, P.E., 2006. Modern proteomes contain putative imprints of ancient shifts in trace metal geochemistry. Proc. Natl. Acad. Sci. USA 103: 17822–17827. EISEN, J.A., HEIDELBERG, J.F., WHITE, O., and SALZBERG, S.L., 2000. Evidence for symmetric chromosomal inversions around the replication origin in bacteria. Genome Biol. 1: RESEARCH0011. FARQUHAR, J., BAO, H., and THIEMENS, M., 2000. Atmospheric inﬂuence of Earth’s earliest sulfur cycle. Science 289: 756–759. FREEMAN, S. and HERRON, J.C., 2003. Evolutionary Analysis. Prentice Hall. FRIEDBERG, I. and GODZIK, A., 2005. Fragnostic: walking through protein structure space. Nucleic Acids Res. 33: W249–W251. GERSTEIN, M., 1997. A structural census of genomes: comparing bacterial, eukaryotic, and archaeal genomes in terms of protein structure. J. Mol. Biol. 274: 562–576. GOUGH, J., 2005. Convergent evolution of domain architectures (is rare). Bioinformatics 21: 1464–1471. GOUGH, J., KARPLUS, K., HUGHEY, R., and CHOTHIA, C., 2001. Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J. Mol. Biol. 313: 903–919. GREENE, L.H., LEWIS, T.E., ADDOU, S., CUFF, A., DALLMAN, T., DIBLEY, M., REDFERN, O., PEARL, F., NAMBUDIRY, R., REID, A., SILLITOE, I., YEATS, C., THORNTON, J.M., and ORENGO, C.A., 2007. The CATH domain structure database: new protocols and classiﬁcation levels give a more comprehensive resource for exploring evolution. Nucleic Acids Res. 35: D291–D297. HOBOHM, U. and SANDER, C., 1994. Enlarged representative set of protein structures. Protein Sci. 3: 522–524. HOLM, L. and SANDER, C., 1995. Evolutionary link between glycogen phosphorylase and a DNA modifying enzyme. EMBO J. 14: 1287–1293. HON, W.C., MCKAY, G.A., THOMPSON, P.R., SWEET, R.M., YANG, D.S., WRIGHT, G.D., and BERGHUIS, A.M., 1997. Structure of an enzyme required for aminoglycoside antibiotic resistance reveals homology to eukaryotic protein kinases. Cell 89: 887–895. ITOH, H., TAKAHASHI, A., ADACHI, K., NOJI, H., YASUDA, R., YOSHIDA, M., and KINOSITA, K., 2004. Mechanically driven ATP synthesis by F1-ATPase. Nature 427: 465–468. KOPP, R.E., KIRSCHVINK, J.L., HILBURN, I.A., and NASH, C. Z., 2005. The Paleoproterozoic snowball Earth: a climate disaster triggered by the evolution of oxygenic photosynthesis. Proc. Natl. Acad. Sci. USA 102: 11131–11136. KUMMERFELD, S.K. and TEICHMANN, S.A., 2005. Relative rates of gene fusion and ﬁssion in multi-domain proteins. Trends Genet. 21: 25–30.

164

Chapter 8

Genome Evolution Studied Through Protein Structure

LESK, A.M. and CHOTHIA, C., 1980. How different amino acid sequences determine similar protein structures: the structure and evolutionary dynamics of the globins. J. Mol. Biol. 136: 225–270. LOVELOCK, J., 2001. Homage to Gaia: The Life of an Independent Scientist. Oxford University Press. MILLER, W., MAKOVA, K.D., NEKRUTENKO, A., and HARDISON, R.C., 2004. Comparative genomics. Annu. Rev. Genomics Hum. Genet. 5: 15–56. MIRA, A., KLASSON, L., and ANDERSSON, S.G., 2002a. Microbial genome evolution: sources of variability. Curr. Opin. Microbiol. 5: 506–512. MIRA, A., KLASSON, L., and ANDERSSON, S.G.E., 2002b. Microbial genome evolution: sources of variability. Curr. Opin. Microbiol. 5: 506–512. MURZIN, A.G., BRENNER, S.E., HUBBARD, T., and CHOTHIA, C., 1995. SCOP: a structural classiﬁcation of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247: 536–540. PRUITT, K.D., TATUSOVA, T., and MAGLOTT, D.R., 2007. NCBI Reference Sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 35: D61–D65. RAYMOND, J. and SEGRE, D., 2006. The effect of oxygen on biochemical networks and the evolution of complex life. Science 311: 1764–1767. ROST, B., 1999. Twilight zone of protein sequence alignments. Protein Eng. 12: 85–94. SAITO, M., SIGMAN, D., and MOREL, F., 2003. Inorg. Chim. Acta 356: 308–318. SANGER, F., 1971. Nucleotide sequences in bacteriophage ribonucleic acid. The eighth hopkins memorial lecture. Biochem. J. 124: 833–843. SANGER, F. and COULSON, A.R., 1975. A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. J. Mol. Biol. 94: 441–448. SANGER, F., NICKLEN, S., and COULSON, A.R., 1977. DNA sequencing with chain-terminating inhibitors. Proc. Natl. Acad. Sci. USA 74: 5463–5467. SCHEEFF, E.D. and BOURNE, P.E., 2005. Structuralevolution of the protein kinase-like superfamily. PLoS Comput. Biol. 1: e49. SHINDYALOV, I.N. and BOURNE, P.E., 1998. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 11: 739–747. STERN, L.J., POTOLICCHIO, I., and SANTAMBROGIO, L., 2006. MHC class II compartment subtypes: structure and function. Curr. Opin. Immunol. 18: 64–69.

SUYAMA, M. and BORK, P., 2001. Evolution of prokaryotic gene order: genome rearrangements in closely related species. Trends Genet. 17: 10–13. TAYLOR, W.R., 2002. A ‘periodic table’ for protein structures. Nature 416: 657–660. TEICHMANN, S.A., PARK, J., and CHOTHIA, C., 1998. Structural assignments to the Mycoplasma genitalium proteins show extensive gene duplications and domain rearrangements. Proc. Natl. Acad. Sci. USA 95: 14658–14663. TILLIER, E.R. and COLLINS, R.A., 2000a. Genome rearrangement by replication-directed translocation. Nat. Genet. 26: 195–197. VOGEL, C. and CHOTHIA, C., 2006. Protein family expansions and biological complexity. PLoS Comput. Biol. 2: e48. VOGEL, C., BERZUINI, C., BASHTON, M., GOUGH, J., and TEICHMANN, S.A., 2004. Supra-domains: evolutionary units larger than single protein domains. J. Mol. Biol. 336: 809–823. VOGEL, C., TEICHMANN, S.A., and PEREIRA-LEAL, J., 2005. The relationship between domain duplication and recombination. J. Mol. Biol. 346: 355–365. WILSON, D., MADERA, M., VOGEL, C., CHOTHIA, C., and GOUGH, J., 2007. The SUPERFAMILY database in 2007: families and functions. Nucleic Acids Res. 35: D308–D313. WOLF, Y.I., BRENNER, S.E., BASH, P.A., and KOONIN, E.V., 1999. Distribution of protein folds in the three superkingdoms of life. Genome Res. 9: 17–26. XIE, L. and BOURNE, P.E., 2008. Detecting evolutionary relationships across existing fold space, using sequence order-independent proﬁle–proﬁle alignments. Proc. Natl. Acad. Sci. USA 105: 5441–5446. YANG, S. and BOURNE, P.E., 2009. The evolutionary history of protein domains viewed by species phylogeny. PLoS, One in press. YANG, S., DOOLITTLE, R.F., and BOURNE, P.E., 2005. Phylogeny determined by protein domain content. Proc. Natl. Acad. Sci. USA 102: 373–378. YEATS, C., LEES, J., REID, A., KELLAM, P., MARTIN, N., LIU, X., and ORENGO, C., 2008. Gene3D: comprehensive structural and functional annotation of genomes. Nucleic Acids Res. 36: D414–D418. ZUCKERKANDL, E. and PAULING, L.B., 1962. Molecular Disease, Evolution, and Genetic Heterogeneity. Academic Press, New York.

Chapter

9

Chromosomal Rearrangements in Evolution Hao Zhao and Guillaume Bourque 9.1

INTRODUCTION

9.2

GENOME REPRESENTATION

9.3

CONSTRUCTING GENOME PERMUTATIONS FROM SEQUENCE DATA

9.4

GENOMIC DISTANCES

9.5

RECONSTRUCTION OF ANCESTORS AND EVOLUTIONARY SCENARIOS

9.6

RECENT APPLICATIONS ON LARGE GENOMES

9.7

CHALLENGES AND PROMISING NEW APPROACHES

ACKNOWLEDGMENT REFERENCES

9.1 INTRODUCTION Genomes evolve via two types of mutational events: point mutations and chromosomal mutations. Point mutations cause the substitution, insertion, or deletion of single nucleotides. Although they can have a signiﬁcant impact on the genome (e.g., a base change could be responsible for the insertion of an early stop codon that completely annihilates a gene), the consequence of point mutations is mostly limited to individual genes. In contrast, chromosomal mutations, or chromosomal rearrangements, are large-scale events that shufﬂe large sections of chromosomes and that can affect the gene order or content of genomes. Compared to traditional molecular studies that are focused on the analysis of point mutations, exploring the rearrangement history of a set of genomes provides a global view on the evolution of these organisms where the entire DNA is considered. Moreover, because chromosomal rearrangements are relatively rare events, studying them allows the recovery of deep evolutionary histories (e.g., the mammalian phylogeny; Ma et al., 2006; Murphy et al., 2005; Zhao and Bourque, 2007) with fewer problems associated with homoplasmic mutations. The analysis of comparative maps and the rearrangement events they evidenced was pioneered almost a century ago at Morgan’s Drosophila lab (Morgan and Bridges, 1916). Evolutionary Genomics and Systems Biology, edited by Gustavo Caetano-Anolles Copyright 2010 John Wiley & Sons, Inc.

165

166

Chapter 9

Chromosomal Rearrangements in Evolution

More recently, large sequencing projects have rejuvenated this topic by making available the complete DNA sequences of many prokaryotic (Chain et al., 2006; Eppinger et al., 2007; Thomson et al., 2006) and eukaryotic genomes (Gibbs et al., 2004, 2007; Hillier et al., 2004; Lander et al., 2001; Venter et al., 2001; Waterston et al., 2002). One of the stated purpose of these projects is to further our knowledge of these species through comparative analyses (O’Brien et al., 1999). Such analyses can, for instance, lead to the identiﬁcation of regions of genomic instability (i.e., regions with high rates of rearrangements) that challenge and help reﬁne our understanding of the dynamics of chromosome evolution (Murphy et al., 2005; Pevzner and Tesler, 2003b). Furthermore, high-resolution breakpoint mapping and identiﬁcation of ancestral rearrangements enable the analysis of sequence features associated with such rearrangement events and can shed light on their underlying molecular mechanisms. In this chapter, we will review advances and developments for the analysis of rearrangement scenarios as well as present some recent applications of these methods to large genomes. The chapter is organized as follows. In Section 9.2, we present some basic concepts for analyzing chromosomal rearrangements such as how to model and represent genomes. In Section 9.3, we review the main methods for identifying synteny blocks conserved between two or more genomes directly from the DNA sequences of these species. Next, in Section 9.4, we summarize two categories of criteria that can be used to compute the similarity or the distance between a pair of genomes. In Section 9.5, we review the main methods for reconstructing the gene orders of ancestors and evolutionary scenarios using the various genomic distances presented in Section 9.4. Finally, we discuss some challenges and promising directions in the study of chromosomal rearrangements.

9.2 GENOME REPRESENTATION A genome consists of a single chromosome or of a collection of chromosomes and is called unichromosomal or multichromosomal accordingly. The initial focus of genome rearrangement studies was on the comparative analyses of small genomes such as mitochondria (Blanchette et al., 1999), chloroplasts (Cosner et al., 2000), viruses (Hannenhalli et al., 1995), and small region or a single chromosome of larger genomes (Bafna and Pevzner, 1995; Pevzner and Tesler, 2003a). In these studies, the relative order of homologous genes in different organisms was used to infer phylogenetic relationships and even rearrangement scenarios. For this purpose, the relative gene order of different genomes can be encapsulated into a set of signed permutations. In this way, a genome is associated with a permutation, where each integer corresponds to one of its genes. One of the genome can be identiﬁed as the reference genome and associated with the identity permutation id if it is unichromosomal, or its chromosomes can be concatenated into the identity permutation otherwise. The permutation associated with each other genome can directly be obtained from the order of appearance of the homologous genes in that genome. Furthermore, a sign corresponding to the relative orientation (strand) of the gene, as compared to the reference genome, is given to each integer of the new permutation. In practice, it is sometimes challenging to unambiguously identify gene orthologues across different genomes. Moreover, in large eukaryotic genomes (e.g., human or mouse) genes cover only a small fraction of the chromosomes. For these reasons, it is sometimes advantageous to use homologous marks that are not restricted to genes but that can also correspond to arbitrary homologous genomic segments when building permutations. Such segments are called synteny blocks. The example shown in Figure 9.1 is a permutation with

9.3 Constructing Genome Permutations from Sequence Data

167

Figure 9.1

A permutation with ﬁve synteny blocks corresponding to a region of the human chr7 homologous to a region of the rhesus macaque chr3. The relative position and orientation of the blocks are shown with the “net” track of the UCSC genome web browser. The human genome is used as the reference and is set to be the identity permutation 1 2 3 4 5. The corresponding permutation for the rhesus macaque is 1 2 3 4 5. (See insert for color representation of this ﬁgure.)

ﬁve synteny blocks conserved across a region of the human chr7 and of the rhesus macaque chr3. Using the human genome as the reference, the homologous region of the rhesus macaque can be represented by the permutation 1 2 3 4 5. In the rest of this chapter, we will use gene and synteny block interchangeably to mean a homologous mark used to build a permutation representing a genome. The “gene order” data for a collection of genomes will be the set of permutations associated with these species. When representing genomes with permutations, we have assumed that a set of n genes was found with a unique homologous counterpart in all genomes. In fact, in many cases this assumption will be violated: a genome may have gained additional genes through rearrangements events such as insertions or duplications and it may have lost genes following deletions. To encapsulate the relative gene orders of genomes with unequal gene content, we need to generalize the representation to account for this variable alphabet. Although models that are not restricted to equal gene content are more complete and realistic (see El’Marbrouk (2005) for a review), they are also more challenging algorithmically and have been limited to only a few applications (Sankoff, 1999; Tang and Moret, 2003). We will focus on genomes with equal gene content here except for the presentation of some rearrangement events affecting gene content.

9.3 CONSTRUCTING GENOME PERMUTATIONS FROM SEQUENCE DATA As we have now seen, an important prerequisite in the analysis of rearrangement scenarios is the ability to convert the raw sequence of the genomes of interest into gene order data. In other words, to identify the synteny blocks conserved across the genomes under study and convert them into permutations. This problem, also known as comparative genome mapping, has been intensively studied as more and more genomes are being sequenced (Darling et al., 2004; Pevzner and Tesler, 2003a; Swidan et al., 2006). In practice, it is quite challenging to accurately identify conserved synteny blocks largely because (i) the genomes might consist of signiﬁcant portions of segmental duplications (SDs) (Kent et al., 2003) and repeats that will potentially cause ambiguous mappings and (ii) some genomic regions have undergone changes by so many point mutations that we can hardly deﬁne the ancestral state and hence detect the exact boundaries of the synteny blocks. In this section, we will review the main approaches that aim to construct the gene order data based on the DNA sequences of two or more genomes.

168

Chapter 9

Chromosomal Rearrangements in Evolution

Algorithms seeking to build synteny blocks are usually split into two main steps: the ﬁrst step focuses on searching for a set of small genomic segments with homology (called anchors). These anchors are similar genomic regions locally conserved in all the genomes under study. In the second step, the anchors are clustered and connected according to various rules and the clusters are reﬁned into synteny blocks. Various variants for both of these steps could be implemented. Pevzner and Tesler (2003a), for instance, designed a method called GRIMMSynteny to identify the synteny blocks of the human and mouse genomes. The idea behind GRIMM-Synteny is that microrearrangements (e.g., less than 1 Mb) can be distinguished from macrorearrangements affecting the large-scale organization of chromosomes. The anchors used in this particular work were the set of nonoverlapping local alignments detected by PatternHunter (Ma et al., 2002). In the clustering step, consecutive anchors within a distance cutoff G were connected (irrespective of their orientations). Small synteny blocks below a given minimum cluster size C in length were removed. Finally, the orientations of the blocks formed were determined carefully using an algorithm presented in Pevzner and Tesler (2003c). By neglecting the orientations of anchors within blocks and removing small clusters, the microrearrangements were masked. Using this approach, a total number of 281 synteny blocks between human and mouse were identiﬁed with size larger than 1 Mb. Another method called Mauve (Darling et al., 2004) ﬁnds candidate anchors called maximal unique matches (multi-MUMs) using multiple sequence alignments, and thus it is applicable to two or more genomes directly. By using a greedy breakpoint elimination algorithm, a subset of the multi-MUMs are selected as anchors for the clustering step. Unlike GRIMM-Synteny, the anchors selected are required to be collinear, and thus a single block does not contain any rearrangement events. Mauve used a recursive strategy to detect the homology and to identify the blocks conserved across all the genomes. This approach was recently used to identify the synteny blocks of a set of bacterial genomes (Darling et al., 2008). Another tool for synteny block identiﬁcation is called MAGIC (Swidan et al., 2006), which is an integrative method that incorporates more biological intuitive into the comparative mapping problem. Speciﬁcally, the anchors used in MAGIC are a set of high-quality orthologues obtained by KEGG (Kanehisa et al., 2004). This approach makes use of the functional and positional information of the genes. The anchors are combined and reﬁned to build a many-to-many comprehensive table that contains an initial list of reorderfree segments (RFs) between a pair of genomes. Finally, the ambiguous mappings resulting from outparalogues (called nuisance overlap) and duplications from inparalogues are carefully removed. Finally, other alignment tools have also been used to construct synteny blocks. In Ma et al. (2006), the UCSC chains and nets were used to build the blocks conserved in a set of large genomes. The chains are collinear local alignments obtained by Blastz (Schwartz et al., 2003) while the nets are sets of nested chains (Kent et al., 2003). In this approach, the nets were initially split into orthology blocks, and contiguous orthology blocks were ﬁnally connected to form synteny blocks if they were of the same orientation. We note that although these various approaches have been successfully applied to both bacterial and eukaryotic genomes (see Section 9.5), there is no standard statistical way to compare the accuracy of the synteny blocks identiﬁed by these methods.

9.4 GENOMIC DISTANCES Given the permutations associated with two genomes, there are various criteria that can be used to measure the rearrangement distance between them. If we assume that one of the

9.4 Genomic Distances

169

permutations is the identity id and that the other genome is G, we can use d(G, id), or d(G) for simplicity, to represent the genomic distance between them. In this section, we will review different criteria that can be use to measure their similarity. Overall, the distances fall into two main categories: model free and rearrangement based.

9.4.1 Model-Free Distances In deﬁning the model-free distances, no assumption is made on the underlying rearrangement model; that is, they are not associated with any rearrangement operations. These distances only quantitatively measure the evolutionary distance between two genomes, but do not infer the intermediate steps that are required between them. Here, we will present three such distances. Breakpoint distance. This distance compares two permutations by directly counting the number of adjacencies present in one genome but not in the other (see, for example, Nadeau and Taylor (1984)). Formally, given two signed permutations G and id of size n, the ﬁrst step in computing the breakpoint distance is to extend both permutations so that they start with 0 and end with n þ 1: G ¼ 0, g1, g2 . . . gn, n þ 1 and id ¼ 0, 1, 2 . . . n, n þ 1. The breakpoint distance, dbreak(G), is deﬁned as the number of adjacencies (i, i þ 1), 0 i n, such that neither the pair (i, i þ 1) nor its reverse ((i þ 1), 1) appears in G. For instance, if G ¼ 2 1 3 4, then the ﬁrst step is to extend G to 0 2 1 3 4 5, and we get that dbreak(G) ¼ 4 since the adjacencies (0, 1), (2, 3), (3, 4), and (4, 5) are now disrupted in G. We now present two additional model-free criteria: common intervals and conserved intervals. Although these two criteria are used to measure the similarity rather than distances between genomes, we discuss them here because they generalize the breakpoint distance by considering intervals of integers instead of only adjacencies. Common interval. Given two permutations G and id, a common interval is a shared set of two or more integers that are present as an interval in both G and id (Heber and Stoye, 2001; Uno and Yagiura, 2000). For example, Figure 9.2 shows the 10 common intervals shared between the permutation 1 4 2 3 5 8 9 7 6 and the identity permutation. The more common the intervals are shared between two permutations, the more similar these permutations are. In the extreme case, where G corresponds to the identity permutation id, there are n(n 1)/2 common intervals. In Figure 9.2, there are 10 common intervals, while there could have been a maximum of 36 common intervals between two permutations of size 9. Conserved interval. Similar to a common interval, a conserved interval is also an interval of shared integers in G and id with the additional restriction that the set of integers must be framed by [a, b] or [b, a], where a is the smallest integer among

1

–4

–2

3

5

8

9

–7

–6

The 10 common intervals associated with the permutation 1 4 2 3 5 8 9 7 6. Each box corresponds to a common interval.

Figure 9.2

170

Chapter 9 1

Chromosomal Rearrangements in Evolution

–4

–2

3

5

8

9

–7

–6

Figure 9.3 The three conserved intervals associated with the permutation 1 4 2 3 5 8 9 7 6. Each box corresponds to a conserved interval.

the set and b is the biggest (Bergeron and Stoye, 2003). Continuing with the example used above, there are only three conserved intervals in G and two of them are trivial in the sense that they contain only frames (see Figure 9.3). Initially, the deﬁnition of conserved intervals may seem a bit arbitrary but it is tightly connected to the HP theory (Hannenhalli and Pevzner, 1995b; see below), and it has been shown to be useful in efﬁciently sorting permutations by reversals (Bergeron et al., 2004b). All three model-free distances have the advantage of being easily computable in polynomial time (Heber and Stoye, 2001; Uno and Yagiura, 2000). Furthermore, the computation of breakpoints and common/conserved intervals can be directly extended to a set of more than two genomes and allow for the identiﬁcation of shared features in a family of organisms.

9.4.2 Rearrangement-Based Distances In contrast to model-free distances, rearrangement-based distances are directly associated with speciﬁc rearrangement operations. The most common rearrangement operations include reversal, translocation, fusion, ﬁssion, and transposition. More generalized operations have also been proposed, such as the block interchange (BI) (Christie, 1996; Lin et al., 2005) and the double-cut-and-join (DCJ) (Yancopoulos et al., 2005). A common feature of these operations is that performing these operations on a permutation only modiﬁes the gene order of a genome, not its gene content. See Table 9.1a for examples of such operations. Given the permutation of two chromosomes X ¼ x1 x2 . . . xi1 xi xi þ 1 . . . xm, and Y ¼ y1 y2 . . . yj1yj yj þ 1 . . . yn, these operations can be formally deﬁned by .

.

A reversal r(i, j), where i j, transforms X into x1x2 . . . xj xj1 . . . xi þ 1 xi xj þ 1 . . . xn by inverting both the order of xi xi þ 1 . . . xj1xj and the sign of each gene. A translocation tloc(i, j, X, Y) acts on two chromosomes X ¼ X1X2 and Y ¼ Y1Y2, where X1 ¼ x1x2 . . . xi1, X2 ¼ xi xi þ 1 . . . xm, Y1 ¼ y1y2 . . . yj1, and Y2 ¼ yj yj þ 1 . . . yn. A translocation tloc(i, j, X, Y) exchanges X1 and Y1 and leads to two new chromosomes X0 and Y0 , where X0 ¼ Y1X2 and Y0 ¼ X1Y2 or exchanges X1 and Y2 and leads to two new chromosomes X00 ¼ Y2X2 and Y00 ¼ X1 Y1.

.

A ﬁssion breaks a chromosome X ¼ X1X2 and leads to two new chromosomes X1 and X2 (where X1 and X2 are nonempty segments). A fusion is the opposite of a ﬁssion and connects two chromosomes X1 and X2 to form a single chromosome X1X2 or X1 X2.

.

A transposition t(i, j, k) picks up a segment xi . . . xj of a chromosome and reinserts it immediately after xk. If xk is on the same chromosome (k > j or k < i), then the transposition t(i, j, k) is intrachromosomal and otherwise it is interchromosomal. A block interchange BI(i, j; k, m) (i j k m) on a chromosome X swaps the two segments xi . . . xj and xk . . . xm and transform X into X0 ¼ x1x2 . . . xi1 xk . . . xm xj þ 1 . . . xk1 xi . . . xj xk þ 1 . . . xn. Interestingly, if a transposition t is intrachromosomal, it

.

9.4 Genomic Distances

171

Table 9.1 Example of the Rearrangement Operations That Can Affect (a) the Gene Order or (b) the Gene Content of a Genome Operation

Original

Target

(a) Operations affecting gene orders 3

2 4 5

Reversal

12345

1

Translocation

12345 6789

1 2 3 6 7 4

8 9 5

Fusion

12345 6 7

1 2 3

4 5 6 7

Fission

12345 67

1 2 3 6 7

4 5

Transposition

12345678

1 4 5

6 2 3 7

8

Block interchange (special)

123 45678

1 4 5

6 2 3 7

8

1 6 7

4 5 2 3

8

Block interchange (general)

12345678

(b) Operations affecting gene contents Duplication

12345

1 2 3

20

30

Insertion

12345

1 2 3

6 4 5

Deletion

12345

1 4 5

4 5

can be viewed equivalently as a BI that exchanges two adjacent segments (Christie, 1996). For instance, the transposition t shown in Table 9.1 moves the segment 2 3 after gene 6 and leads to the target genome 1 4 5 6 2 3 7 8. It is easy to see that the BI operation (labeled as special) in Table 9.1a swaps the two adjacent segments 2 3 and 4 5 6 and leads to the same target genome. .

A double-cut-and-join operation cuts two adjacencies (a, b), (c, d) and rejoins the four affected integers and leads to either (a, d), (c, b) or (a, c), (b, d). The ﬂexibility of rejoining the associated genes allows a DCJ to act equivalently as a reversal or translocation, depending on if the two adjacencies belong to the same chromosome or different ones. One of the two adjacencies is allowed to be “empty” (see the adjacency (0, 0) in Table 9.2a), in which case the DCJ mimics a fusion or ﬁssion. Moreover, if (a, b), (c, d) are nonempty and at the same chromosome, they are also permitted to be rejoined into (a, d), (b, c), where the (b, c) indicates that the two ends of the segment b . . . c is connected to form a circular intermediate (CI). Although the formation of a CI looks unnatural and does not correspond to any of the most common rearrangement operations, it is useful to allow two consecutive DCJ operations to mimic a BI or a transposition (see Table 9.2b).

In Table 9.2a, we show, using examples presented in Table 9.1a, how a DCJ can be used to mimic a reversal, a translocation, a ﬁssion, and a fusion. In Table 9.2b, we illustrate how two DCJs can be used to mimic a BI. In the ﬁrst step of this example, the segments 2 3 4 5 forms a CI and in the second step, the CI is absorbed by another DCJ and the target genome is obtained.

172

Chapter 9

Chromosomal Rearrangements in Evolution

Table 9.2 (a) A DCJ is Used to Mimic the Reversal, the Translocation, the Fusion, and the Fission from Table 9.1a. (b) Two Consecutive DCJs Are Used to Mimic the BI from Table 9.1a Operation

Original

Affected Resulted adjacencies adjacencies

Target

(a) Examples of operations that are mimicked by a single DCJ (1, 2), (3, 4)

(1, 3), (2, 4)

1

(2, 3), (7, 8)

(2, 8), (7, 3)

1 2 8 9 6 7 3 4

5

Fission

1 2 3 4=5 6 7 (4, 5), (0,0) 0 = 0

(4, 0), (0, 5)

1 2 3 4 0 5 6 7

0 8

Fusion

1 2 3 4 0 = 5 6

(4, 5), {0, 0)

1 2 3 4

0 0 5 6

Reversal

1=2 3=4 5

Translocation 1 2=3 4 6 7=8 9

5

= 0 7 8

(4, 0), (0, 5)

3

2

4 5

7

(b) An example of a Bl that is mimicked by two consecutive DCJs 1/2 3

4 5/ 6 7 8 DCJ one. It cuts (1,2), (5,6) and leads to a circular intermediate (CI) 2 3 4 5 by rejoining (2, 5) and (1, 6).

2

5

3

4

1 6 7/ 8 DCJ two. It cuts (3,4), (7,8) and absorbs the CI by rejoining (7,4) and (3,8). 1 6 7 4 5 2 3 8 Here, a “0” in italic represents an artiﬁcial empty gene. The “/” between two integers denotes the place where a DCJ cuts the adjacency.

Finally, we note that there are other types of rearrangement operations that can affect the gene content of a genome: duplications, insertions, and deletions (see Table 9.1b for examples of such operations). Given a set of the permissible rearrangements selected from the ones presented above, the rearrangement distance between a genome G and the identity permutation, drear(G), is deﬁned as the minimum number of such operations required to convert G into id. The interest in looking for the minimum number of steps is that under the assumption that such events are rare (and that our rearrangement model is correct), we hope to recover the sequence of rearrangements that actually occurred. The caveat is that the most parsimonious scenarios will underestimate the actual number of operations especially when the number of steps is above a threshold of yn, where n is the size of the permutation and y is in the range from 1/3 to 2/3 (Bourque and Pevzner, 2002; Wang and Warnow, 2001). When the genomes studied are unichromosomal and only reversals are considered, the problem of transforming one permutation G into id with the minimum number of reversals is

9.4 Genomic Distances

173

known as the problem of sorting by reversals (SBR). In the rest of this section, we will review a methodology that was developed to solve the SBR problem or its generalization for multichromosomal genomes that considers reversals, translocations, fusions, and ﬁssions. We will refer to this method as the Hannenhalli–Pevzner theory (H-P theory in short). This approach was developed by Bafna and Pevzner (1995) and by Hannenhalli and Pevzner (1995a) more than a decade ago. Since then it has been improved by Tesler (2002a), and ﬁnally it has been implemented in a program called GRIMM (Tesler, (2002b)). After this summary, we will introduce alternative distances associated with other rearrangement operations. Hannenhalli–Pevzner Theory and Its Elementary Interpretation The ﬁrst method to solve the SBR problem was presented by Hannenhalli and Pevzner (1995a). We will describe the methodology calculating the distance and sorting unichromosomal genomes when reversals are the only permissible operations. Again, we assume the two given permutations are G and id. In this approach, a breakpoint graph composed of alternative black and gray edges is constructed to describe the relative order and orientations of the genes in the two genomes. Using the information embedded in the breakpoint graph, Hannenhalli and Pevzner devised a polynomial time algorithm to optimally transform G to id. This algorithm relies on the manipulations and operations of structures of the breakpoint graph such as cycles (c), hurdles (h), and fortresses (f). Speciﬁcally, the reversal distance dRev(G) can be calculated by dRev ðGÞ ¼ n 1 c þ h þ f ; where once again n is the size of G. We will refer the reader to Hannenhalli and Pevzner (1999) for the details and the proof of this result. Instead, we will present the elementary interpretation of the H-P theory that uses concepts that are deﬁned directly on the permutations (Bergeron, 2005). An oriented pair (a, b) in a permutation G is composed of two integers a and b such that |a| |b| ¼ 1 and such that a and b have opposite signs. Intuitively, an oriented pair indicates the two ends of a potential reversal that can be performed to create two consecutive integers. For instance, in the permutation 1 2 5 3 4 6, there are two oriented pairs: (2, 3) and (5, 6). The pair (2, 3) corresponds to a reversal r1 that would ﬂip the segment 5 3 and would lead to two new consecutive integers 2 3. Because the reversal r1 is associated with an oriented pair, we call it an oriented reversal. Similarly, the pair (5, 6) can be associated with an oriented reversal r2 that would create two consecutive integers 5 6. Because the process of sorting G corresponds to removing breakpoints by creating consecutive integers, a straightforward sorting strategy is to use oriented reversals at every step (Bergeron, 2005). However, it is important to note that not all oriented reversals are as beneﬁcial. For instance, in the example above, performing the reversal r2 leads to an intermediate permutation with no oriented pairs (and hence no oriented reversal). For such cases, a better strategy is to perform the oriented reversal with the maximum score, where the score of an oriented reversal r of G is deﬁned as the number of oriented pairs in the resulting permutation G0 after performing r. In the example above, the score of r1 is 2 because in the resulting permutation 1 2 3 5 4 6, there are two oriented pairs (3, 4) and (5, 4), while the score of r2 is 0 because it leads to a permutation 1 2 4 3 5 6 with only positive integers. In this case, r1 is the only oriented reversal that can be performed in the ﬁrst step to sort G optimally. Interestingly, this simple strategy can lead to an optimal sequence of reversals that can sort G in most cases.

174

Chapter 9

Chromosomal Rearrangements in Evolution

When the permutation consists of only positive integers and no oriented pairs exist, framed intervals can be deﬁned to tackle these more complicated cases. Formally, a framed interval [i . . . i þ k] is a conserved interval whose frames are i and i þ k, and every integer between i and i þ k has a positive sign. A hurdle of G is a framed interval that has no shorter framed intervals (Bergeron, 2005). Such a hurdle also corresponds to the hurdle as deﬁned on the breakpoint graph of G in Hannenhalli and Pevzner (1995a). Additional reversals need to be performed to remove such hurdles from a permutation and to create oriented pairs without generating new hurdles. Such reversals can either cut a single hurdle or merge two hurdles. A reversal is said to cut a hurdle [i . . . i þ k] if it ﬂips the segment between the frames i and i þ k, while a reversal is said to merge two hurdles [i . . . i þ k] and [i0 . . . i0 þ k0 ] where i0 i þ k if it ﬂips the segment i þ k . . . i0 . The algorithm presented in Bergeron (2005) can be summarized as follows: when an oriented pair exists, perform the oriented reversal with the maximum score; otherwise, cut a hurdle or merge two hurdles such that the number of hurdles is decreased. It was proved in Bergeron (2005) that this straightforward algorithm can optimally sort any permutation G. An extension of the SBR problem for multichromosomal genomes was studied by Hannenhalli and Pevzner (1995b). In this work, the authors derived an algorithm and an equation to compute the rearrangement distance between two multichromosomal genomes using reversals, translocations, fusions, and ﬁssions. We refer the reader to Tesler (2002a) for the details of the calculation, but we will brieﬂy present how the formula can be obtained. In order to compute the rearrangement distance between a multichromosomal genome G and the identity permutation, the main idea is to concatenate the chromosomes of G into a new permutation G0 and apply the algorithm designed for the SBR problem. In this way, every rearrangement in the multichromosomal genome G is mimicked by a reversal in G0 . In an optimal concatenation of G, sorting G0 is equivalent to sorting G. Tesler (2002a) also showed that when such an optimal concatenate does not exist, a near-optimal concatenate exists such that sorting this concatenate corresponds to sorting the multichromosomal genomes and uses a single extra reversal to reorder the chromosomes. When transpositions are involved, the problem of sorting a genome G becomes very difﬁcult (Artif et al., 2008; Bafna and Pevzner, 1998; Tzvika and Ron, 2006). Interestingly, sorting by the more general operation of BI can be done in polynomial time (Lin et al., 2005). Two software tools, ROBIN (Lu et al., 2005) and SPRING (Lin et al., 2006), have been developed for sorting by BIs or by reversals plus BIs. Yancopoulos et al. (2005) and Bergeron et al. (2006) have studied the rearrangement distance under the DCJ operation. They deﬁned the adjacency graph between two genomes, formulated the DCJ distance and explained its relationship to the HP distance. Formally, the DCJ distance was deﬁned as dDCJ (G1, G2) ¼ n (C I/2), where n is the size of the genomes G1 and G2 and where C and I are two parameters associated with the conﬁguration of the adjacency graph. In a recent study (Bergeron et al., 2008), the DCJ distance is calculated as dDCJ ¼ dHP t, where t is the cost of not resorting to unoriented DCJ operations.

9.5 RECONSTRUCTION OF ANCESTORS AND EVOLUTIONARY SCENARIOS So far we have presented methods that are mostly used for the analysis of pairs of genomes. Given the gene order data of a set of related genomes, another important challenge is the reconstruction of the rearrangement phylogeny that best explains the genetic relationships

9.5 Reconstruction of Ancestors and Evolutionary Scenarios

175

Figure 9.4

An example of a phylogenetic tree T with four species: human, chimpanzee, rhesus macaque, and dog. The internal nodes on the tree correspond to their ancestors.

between the organisms. In this context, phylogenies are represented by binary trees where the leaf nodes of the trees correspond to contemporary genomes and the internal nodes correspond to their extinct ancestors (see Figure 9.4 for an example). The problem of phylogenetic tree reconstruction includes the recovery of the tree topology, the ancestors, and of the rearrangement scenario linking these ancestors to the extant genomes. It can be formally deﬁned as follows: given the permutations of a set of m genomes, the problem is to reconstruct the tree topology T for these species and label the ancestral genomes with permutations such that the D(T) is minimized: DðTÞ ¼

X

dðeÞ;

e¼ðA;BÞ2T

where d(e) can be any of the genomic distances described in Section 9.3. This problem is difﬁcult because the number of trees grows at a rate more than exponential with the number of leaf nodes. When the tree topology is known, this problem is known as the small parsimony problem. In particular, this problem with m ¼ 3 (called the median problem) was proved to be NP-hard (Caprara, 1999; Pe’er and Shamir, 1998) under both the breakpoint distance dbreak and the reversal distance drev. Despite the fact that this problem is not tractable in polynomial time, different heuristics have been devised to approximate it. Many algorithms, for instance, search all (or a subset) of the candidate trees and solve the small parsimony problem. Finally, the tree T with the minimum D(T) is outputted. In this section, we review a few of the main approaches that aim to reconstruct the ancestral genomes under different distances presented in Section 9.3.

9.5.1 Model-Free Reconstruction Algorithms One of the ﬁrst attempts at solving this problem was made by Sankoff and Blanchette (1997) who studied the median problem under the breakpoint distance and made a clever reduction to the traveling salesman problem for which reasonably efﬁcient algorithms are available.

176

Chapter 9

Chromosomal Rearrangements in Evolution

Using this result, Blanchette et al. (1997) developed BPAnalysis, a method to recover the most parsimonious scenario also under the breakpoint distance. This method looks for the optimal assignment of internal nodes for a given topology by iteratively solving a series of median problems. More recently, a ﬁrst approach to reconstruct conserved intervals in ancestors was presented in Bergeron et al. (2004a). The algorithm can be viewed as a variant of the Fitch algorithm (Fitch, 1971) in the sense that internal nodes are labeled with all possible conserved intervals from the bottom to the root and then a reﬁnement step is performed starting from the root and down to the leaves to obtain the unique solution. In 2006, Ma and colleagues published an algorithm called inferCARs to recover contiguous ancestral regions (Ma et al., 2006). In this work, the adjacencies observed in the modern genomes are represented with binary integers and, using once again a Fitch-like algorithm, the ancestors are labeled using a bottom-up approach followed by a top-down reﬁnement step. For this, the authors constructed predecessor and successor graphs to detect the possible left and right neighbors of each gene in the ancestors and used adjacencies in an outgroup species to resolve conﬂicts in these graphs.

9.5.2 Rearrangement-Based Reconstruction Algorithms As an extension of the BPAnalysis described above, Moret and colleagues (Siepel and Moret, 2001) studied the small parsimony problem and the full phylogeny problem using the linear-time algorithm developed to compute the reversal distance (Bader and Moret, 2001). The method was called GRAPPA and was designed for unichromosomal genomes under both the breakpoint and the reversal distance model (Moret et al., 2002). Concurrently, Bourque and Pevzner (2002) developed a method called MGR that can also be used for both the small parsimony problem and the full phylogeny problem and was making use of properties of additive or nearly additive trees. This algorithm, combined with GRIMM (Tesler, 2002a), is applicable to unichromosomal genomes for the reversal distance and to multichromosomal genomes for a rearrangement distance that allows reversals, translocations, fusions, and ﬁssions. The main idea of the algorithm is to look for rearrangements in the starting genomes that reduce the total distance to the other genomes and to iteratively perform these rearrangements to “reverse history”. The key is to use a good criterion in selecting the order in which to perform the rearrangements. Both GRAPPA and MGR can output the optimal (or nearly optimal) tree and the gene order of the ancestors on this tree. More recently, Zhao and Bourque (2007) used simulated data to show that the scenarios recovered by GRAPPA and MGR only recover a fraction of the “actual” ancestral events and that the speciﬁcity of the events in these reconstructed scenario is relatively low. They proposed a new criterion to evaluate the reconstruction quality: the reliability of the rearrangement events predicted. They also presented an efﬁcient method called EMRAE to recover reliable ancestral events. The idea behind this algorithm is to use adjacencies conserved across many modern genomes to recover the events associated with limited breakpoint reuse. In this way, EMRAE predicts a partial but reliable scenario consisting of reversals and transpositions. This approach can also be extended to other types of events (e.g., translocations and fusions/ﬁssions; Zhao and Bourque, 2009). Rearrangement-based reconstructions can also be studied in a maximum likelihood (ML) framework. In this context, similar assumptions are made on the permissible types of rearrangement operations and a well-deﬁned parameter space is searched to identify the

9.6 Recent Applications on Large Genomes

177

tree(s) and the ancestors that are most likely to have generated the observed genomes. For instance, BADGER (Larget et al., 2005) was designed to solve this problem and used a Monte Carlo Markov chain (MCMC) strategy to efﬁciently sample the parameter space. ML-based methods are computationally intensive as they face a large number of potential states for both the ancestors and the trees. Finally, although the DCJ and the BI operations have been used to study the chromosomal rearrangements, these analyses so far have mostly focused on pairwise genomes as described in the previous section.

9.6 RECENT APPLICATIONS ON LARGE GENOMES The growing availability of complete prokaryotic and eukaryotic genomic sequences has enabled a number of studies exploring ancestral genome architectures and rearrangement histories and has led to new insights into the chromosomal evolution of modern species. For instance, the application of MGR on the gene order data of a set of mammalian genomes and of the chicken genome has highlighted signiﬁcant changes in the rates of rearrangement events across different lineages (Bourque et al., 2005). Another comparative analysis, using eight mammalian genomes, revealed a high level of breakpoint reuse during mammalian evolution (approximately 20% of the breakpoints were predicted to have been reused; Murphy et al., 2005). The algorithm EMRAE (Zhao and Bourque, 2007) was develop in part to bypass an ongoing debate on the quality of ancestral reconstructions where ambiguities and alternative solutions are mostly a consequence of breakpoint reuse (Bourque et al., 2006; Froenicke et al., 2006; Rocchi et al., 2006). EMRAE aims to identify highly reliable ancestral events and was recently applied to the gene order data of six mammalian genomes: human, chimpanzee, rhesus, mouse, rat, and dog (Zhao and Bourque, 2009). At a 50 Kb resolution, the total numbers of rearrangement events predicted on this tree with six species are shown in Figure 9.5. A schematic illustration of the reversals predicted on the path from the primate ancestor to the human is also shown in Figure 9.6. Interestingly, the analysis of the sequence

2, 0, 1, 0

Human

19, 0, 4, 1 12, 0, 1, 1 27, 0, 5, 2

Chimp

Figure 9.5 Predicted 22, 0, 6, 1

25, 3, 0, 5

Rhesus

Mouse

41, 2, 2, 0 128, 0, 65, 5

46, 7, 8, 13

Rat

Dog

rearrangement events made by EMRAE on a subset of the mammalian phylogeny using synteny blocks constructed at a 50 Kb resolution (see Zhao and Bourque (2009)).The four numbers on each edge of the tree correspond to the number of events predicted: reversals, translocations, transpositions, and ﬁssion/fusions.

178

Chapter 9 Chr1 Chr2 Chr3 Chr4 Chr5 Chr6 Chr7 Chr8 Chr9 Chr10 Chr11 Chr12 Chr13 Chr14 Chr15 Chr16 Chr17 Chr18 Chr19 Chr20 Chr21 Ch22 ChrX

Chromosomal Rearrangements in Evolution

hs reversals hcs reversals hcrs reversals centromeres

Figure 9.6 A schematic localization on the human genome of the reversals that are predicted by EMRAE on the path between the primate ancestor and the human. ‘‘hs’’ means a human-speciﬁc reversal, ‘‘hcs’’ means a human–chimp-speciﬁc reversal, and‘‘hcrs’’ means human–chimp–rhesus-speciﬁc reversal. Black boxes correspond to centromeres. (See insert for color representation of this ﬁgure.)

features associated with the predicted reversals in the same data set also revealed that both paired SD and paired L1 repeats were signiﬁcantly enriched. Apart from the comparative analyses performed on eukaryotic genomes, Darling et al. (2008) studied the rearrangement evolution of nine bacterial genomes. In this work, the authors generated the gene order data using Mauve and applied a modiﬁed version of BADGER to sample the solution space. Interestingly, they reported extensive breakpoint reuse in the evolution of the studied genomes and also revealed an overrepresentation of symmetric reversals whose ends are of equal distance to the replication origins. Advances and developments in the analysis of chromosomal rearrangements along with the availability of more and more fully sequenced genomes will continue to provide opportunities for large-scale evolutionary comparisons. A number of software tools devised to either compare genomes using different distances or to study chromosomal evolution are listed in Table 9.3.

9.7 CHALLENGES AND PROMISING NEW APPROACHES The underlying evolutionary model selected will always have a critical impact on the reconstructed scenarios. In this chapter, we presented approaches that used either modelfree distances or rearrangement model distances. One of the limitations of the rearrangement-based models is that the candidate operations that they exploit (e.g., reversals,

References

179

Table 9.3 Software Available to Study Chromosomal Evolution BPAnalysis GRIMM MGR BADGER GRAPPA EMRAE ROBIN SPRING inferCARs

http://www.mcb.mcgill.ca/blanchem/software.html www-cse.ucsd.edu/groups/bioinformatics/GRIMM www-cse.ucsd.edu/group/bioinformatics/MGR badger.duq.edu www.cs.unm.edu/moret/GRAPPA www.gis.a-star.edu.sg/bourque genome.life.nctu.edu.tw/ROBIN/ algorithm.cs.nthu.edu.tw/tools/SPRING/ www.bx.psu.edu/miller_lab/car/

translocations, fusions, and ﬁssions) are usually considered to be equally likely (i.e., the weight of each of the events is the same when the distance is computed). In reality, short reversals for instance are probably more common than other operations, but this is not incorporated into the models. A creative solution to this problem might be to combine a model-free distance with a rearrangement-based distance when trying to recover evolutionary scenarios. For instance, in Berard et al. (2007), Matthias et al. (2006), and Matthias et al. (2008), the reconstruction methods aim to ﬁnd the most parsimonious scenario where the reversals are restricted to preserve the common or conserved intervals of the permutations of the input genomes. In this way, the predictions made by these approaches might be more biologically realistic. Another limitation of current rearrangement-based reconstruction methods is that they assume that each position of the genome is equally likely to be affected by rearrangement events. This assumption, which corresponds to the random breakage model (Nadeau and Taylor, 1984), contrasts with recent observations that there are regions in the genome that are more prone to the rearrangements events (Peng et al., 2006; Pevzner and Tesler, 2003b). In this second model, which corresponds to the fragile breakage model, some breakpoints are recurrently reused, which makes the prediction of unique ancestral scenarios extremely hard. In a method recently presented (Ma et al., 2008), the authors proposed an inﬁnite sites model of chromosome evolution to circumvent the challenges associated with breakpoint reuse. The idea here is that if you assume that the DNA sequences of the genomes have an inﬁnite number of nucleotides, speciﬁc rearrangement events will have zero probability of recurrently hitting the exact same nucleotides. An attractive alternative could also be to partition the genome into weak and strong positions to reﬂect the different levels of instability in the genome. As a result, the candidate operations would not all be equivalent to each other but some would be more likely than others.

ACKNOWLEDGMENT The authors are supported by funding from the Biomedical Research Council (BMRC) of the Agency for Science, Technology, and Research (A STAR) in Singapore.

REFERENCES ARTIF, R., SWAKKHAR, S., and MASUD, H., 2008. An approximation algorithm for sorting by reversals and transpositions. J. Discrete Algorithms 6: 449–457.

BADER, D. and MORET, B., 2001. A linear-time algorithm for computing inversion distance between signed permutations with an experimental study. Proceedings of the

180

Chapter 9

Chromosomal Rearrangements in Evolution

Seventh International Workshop on Algorithms and Data Structures (WADS 2001), pp. 365–376. BAFNA, V. and PEVZNER, P., 1995. Sorting by reversals: genome rearrangements in plant organelles and evolutionary history of X chromosome. Mol. Biol. Evol. 12: 239–246. BAFNA, B. and PEVZNER, P.A., 1998. Sorting by transpositions. SIAM J. Discrete Math. 11: 224–240. BERARD, S., BERGERON, A., CHAUVE, C., and PAUL, C., 2007. Perfect sorting by reversals is not always difﬁcult. IEEE/ ACM Trans. Comput. Biol. Bioinform. 4: 4–16. BERGERON, A., 2005. A very elementary presentation of the Hannenhalli–Pevzner theory. Discrete Appl. Math. 134–145. BERGERON, A. and STOYE, J., 2003. On the similarity of sets of permutations and its application to genome comparison. COCOON 2003, pp. 68–79. BERGERON, A., BLANCHETTE, M., CHATEAU, A., and CHAUVE, C., 2004a. Reconstructing ancestral gene orders using conserved intervals. WABI 2004, pp. 14–25. BERGERON, A., MIXTACKI, J., and STOYE, J., 2004b. Reversal distance without hurdles and fortresses. CPM 2004, 388–399. BERGERON, A., MIXTACKI, J., and STOYE, J., 2006. A unifying view of genome rearrangements. WABI 2006, 4175, pp. 163–173. BERGERON, A., MIXTACKI, J., and STOYE, J., 2008. HP distance via double cut and join distance. CPM 2008, pp. 56–68. BLANCHETTE, M., KUNISAWA, T., and SANKOFF, D., 1999. Gene order breakpoint evidence in animal mitochondrial phylogeny. J. Mol. Evol. 49: 193–203. BOURQUE, G. and PEVZNER, P.A., 2002. Genome-scale evolution: reconstructing gene orders in the ancestral species. Genome Res. 12: 26–36. BOURQUE, G., ZDOBNOV, E.M., BORK, P., PEVZNER, P.A., and TESLER, G., 2005. Comparative architectures of mammalian and chicken genomes reveal highly variable rates of genomic rearrangements across different lineages. Genome Res. 15: 98–110. BOURQUE, G., TESLER, G., and PEVZNER, P.A., 2006. The convergence of cytogenetics and rearrangement-based models for ancestral genome reconstruction. Genome Res. 16: 311–313. CAPRARA, A., 1999. Formulations and complexity of multiple sorting by reversals. Proceedings of the Third Annual International Conference on Computational Molecular Biology (RECOMB99). CHAIN, P.S., HU, P., MALFATTI, S.A., RADNEDGE, L., LARIMER, F., VERGEZ, L.M., WORSHAM, P., CHU, M.C., and ANDERSEN, G.L., 2006. Complete genome sequence of Yersinia pestis strains Antiqua and Nepal516: evidence of gene reduction in an emerging pathogen. J. Bacteriol. 188: 4453–4463. CHRISTIE, P.A., 1996. Sorting permutations by block-interchanges. Inf. Process. Lett. 60: 165–169. COSNER, M., JANSEN, R., MORET, B., RAUBESON, L., WANG, L., WARNOW, T., and WYMAN, S., 2000. A new fast heuristic for computing the breakpoint phylogeny and experimental

phylogenetic analyses of real and synthetic data. Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology (ISMB 2000), 104–115. DARLING, A.C., MAU, B., BLATTNER, F.R., and PERNA, N.T., 2004. Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res. 14: 1394–1403. DARLING, A.E., MIKLOS, I., and RAGAN, M.A., 2008. Dynamics of genome rearrangement in bacterial populations. PLoS Genet. 4: e1000128. EL’MARBROUK, N., 2005. Genome rearrangement with gene families. Mathematics of Evolution and Phylogeny. Oxford University Press. EPPINGER, M., ROSOVITZ, M.J., FRICKE, W.F., RASKO, D.A., KOKORINA, G., FAYOLLE, C., LINDLER, L.E., CARNIEL, E., and RAVEL, J., 2007. The complete genome sequence of Yersinia pseudotuberculosis IP31758, the causative agent of Far East scarlet-like fever. PLoS Genet. 3: e142. FITCH, W.M., 1971. Toward deﬁning the course of evolution: minimum change for a speciﬁc tree topology. Syst. Zool. 406–416. FROENICKE, L., CALDES, M.G., GRAPHODATSKY, A., MULLER, S., LYONS, L.A., ROBINSON, T.J., VOLLETH, M., YANG, F., and WIENBERG, J., 2006. Are molecular cytogenetics and bioinformatics suggesting diverging models of ancestral mammalian genomes? Genome Res. 16: 306–310. GIBBS, R.A., WEINSTOCK, G.M., METZKER, M.L., MUZNY, D.M., SODERGREN, E.J., SCHERER, S., SCOTT, G., STEFFEN, D., WORLEY, K.C., BURCH, P.E., et al., 2004. Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 428: 493–521. GIBBS, R.A., ROGERS, J., KATZE, M.G., BUMGARNER, R., WEINSTOCK, G.M., MARDIS, E.R., REMINGTON, K.A., STRAUSBERG, R.L., VENTER, J.C., WILSON, R.K., et al., 2007. Evolutionary and biomedical insights from the rhesus macaque genome. Science 316: 222–234. HANNENHALLI, S. and PEVZNER, P., 1995a. Transforming cabbage into turnip (polynomial algorithm for sorting signed permutations by reversals). Proceedings of the 27th Annual ACM-SIAM Symposium on the Theory of Computing (STOC 1995), pp. 178–189. HANNENHALLI, S. and PEVZNER, P., 1995b. Transforming men into mice: polynomial algorithm for genomic distance problem. Proceedings of the 36th IEEE Symposium on Foundations of Computer Science (FOCS 1995), pp. 581–592. HANNENHALLI, S. and PEVZNER, P.A., 1999. Transforming cabbage into turnip: polynomial algorithm for sorting signed permutations by reversals. J. ACM 46: 1–27. HANNENHALLI, S., CHAPPEY, C., KOONIN, E.V., and PEVZNER, P. A., 1995. Genome sequence comparison and scenarios for gene rearrangements: a test case. Genomics 30: 299–311. HEBER, S. and STOYE, J., 2001. Finding all common intervals of k permutations. CPM 2001, pp. 201–218. HILLIER, L., MILLER, W., and BIRNEY, E., 2004. Sequence and comparative analysis of the chicken genome provide

References unique perspectives on vertebrate evolution. Nature 432: 695–716. KANEHISA, M., GOTO, S., KAWASHIMA, S., OKUNO, Y., and HATTORI, M., 2004. The KEGG resource for deciphering the genome. Nucleic Acids Res. 32: D277–D280. KENT, W.J., BAERTSCH, R., HINRICHS, A., MILLER, W., and HAUSSLER, D., 2003. Evolution’s cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc. Natl. Acad. Sci. USA 100: 11484–11489. LANDER, E.S., LINTON, L.M., BIRREN, B., NUSBAUM, C., ZODY, M.C., BALDWIN, J., DEVON, K., DEWAR, K., DOYLE, M., FITZHUGH, W., et al., 2001. Initial sequencing and analysis of the human genome. Nature 409: 860–921. LARGET, B., SIMON, D.L., KADANE, J.B., and SWEET, D., 2005. A Bayesian analysis of metazoan mitochondrial genome arrangements. Mol. Biol. Evol. 22: 486–495. LIN, Y.C., LU, C.L., CHANG, H.Y., and TANG, C.Y., 2005. An efﬁcient algorithm for sorting by block-interchanges and its application to the evolution of vibrio species. J. Comput. Biol. 12: 102–112. LIN, Y.C., LU, C.L., LIU, Y.C., and TANG, C.Y., 2006. SPRING: a tool for the analysis of genome rearrangement using reversals and block-interchanges. Nucleic Acids Res. 34: W696–W699. LU, C.L., WANG, T.C., LIN, Y.C., and TANG, C.Y., 2005. ROBIN: a tool for genome rearrangement of block-interchanges. Bioinformatics 21: 2780–2782. MA, B., TROMP, J., and LI, M., 2002. PatternHunter: faster and more sensitive homology search. Bioinformatics 18: 440–445. MA, J., ZHANG, L., SUH, B.B., RANEY, B.J., BURHANS, R.C., KENT, W.J., BLANCHETTE, M., HAUSSLER, D., and MILLER, W., 2006. Reconstructing contiguous regions of an ancestral genome. Genome Res. 16: 1557–1565. MA, J., RATAN, A., RANEY, B.J., SUH, B.B., MILLER, W., and HAUSSLER, D., 2008. The inﬁnite sites model of genome evolution. Proc. Natl. Acad. Sci. USA 105: 14254–14261. MATTHIAS, B., MERKLE, D., and MIDDENDORF, M., 2006. Genome rearrangement based on reversals that preserve conserved intervals. IEEE/ACM Trans. Comput. Biol. Bioinform. 3: 275–288. MATTHIAS, B., MERKLE, D., and MIDDENDORF, M., 2008. Solving the preserving reversal median problem. IEEE/ACM Trans. Comput. Biol. Bioinform. 5: 332–347. MORET, B., TANG, J., WANG, L., and WARNOW, T., 2002. Steps toward accurate reconstruction of phylogenies from geneorder data. J. Comput. Syst. Sci. 65: 508–525. MORGAN, T. and BRIDGES, C., 1916. Sex-Linked Inheritance in Drosophila. Carnegie Institution of Washington, pp. 1–88. MURPHY, W.J., LARKIN, D.M., EVERTS-VAN DER WIND, A., BOURQUE, G., TESLER, G., AUVIL, L., BEEVER, J.E., CHOWDHARY, B.P., GALIBERT, F., GATZKE, L., et al., 2005. Dynamics of mammalian chromosome evolution inferred from multispecies comparative maps. Science 309: 613–617. NADEAU, J.H. and TAYLOR, B.A., 1984. Lengths of chromosomal segments conserved since divergence of man and mouse. Proc. Natl. Acad. Sci. USA 81: 814–818.

181

O’BRIEN, S., MENOTTI-RAYMOND, M., MURPHY, W., NASH, W., WIENBERG, J., STANYON, R., COPELAND, N., JENKINS, N., WOMACK, J., and GRAVES, J., 1999. The promise of comparative genomics in mammals. Science 286: 458–462. PE’ER, I. and SHAMIR, R., 1998. The median problems for breakpoints are NP-complete. Electronic Colloquium on Computational Complexity. PENG, Q., PEVZNER, P.A., and TESLER, G., 2006. The fragile breakage versus random breakage models of chromosome evolution PLoS Comput. Biol. 2: e14. PEVZNER, P. and TESLER, G., 2003a. Genome rearrangements in mammalian evolution: lessons from human and mouse genomes. Genome Res. 13: 37–45. PEVZNER, P. and TESLER, G., 2003b. Human and mouse genomic sequences reveal extensive breakpoint reuse in mammalian evolution. Proc. Natl. Acad. Sci. USA 100: 7672–7677. PEVZNER, P. and TESLER, G., 2003c. Transforming men into mice: the Nadeau-Taylor chromosomal breakage model revisited. RECOMB 2003, pp. 247–256. ROCCHI, M., ARCHIDIACONO, N., and STANYON, R., 2006. Ancestral genomes reconstruction: an integrated, multidisciplinary approach is needed. Genome Res. 16: 1441–1444. SANKOFF, D., 1999. Genome rearrangement with gene families. Bioinformatics 15: 909–917. SANKOFF, D. and BLANCHETTE., M., 1997. The median problem for breakpoints in comparative genomics. Proceedings of COCOON’97, pp. 251–263. SCHWARTZ, S., KENT, W., SMIT, A., ZHANG, Z., BAERTSCH, R., HARDISON, R.C., HAUSSLER, D., and MILLER, W., 2003. Human–mouse alignments with BLASTZ. Genome Res. 13: 103–107. SIEPEL, A. and MORET, B., 2001. Finding an optimal inversion median: experimental results. Algorithms in Bioinformatics: First International Workshop (WABI2001), pp. 189–203. SWIDAN, F., ROCHA, E.P., SHMOISH, M., and PINTER, R.Y., 2006. An integrative method for accurate comparative genome mapping. PLoS Comput. Biol. 2: e75. TANG, J. and MORET, B., 2003. Phylogenetic reconstruction from gene-rearrangement data with unequal gene content. WADS 2003, pp. 37–46. TESLER, G., 2002a. Efﬁcient algorithms for multichromosomal genome rearrangements. J. Comput. Syst. Sci. 65: 23. TESLER, G., 2002b. GRIMM: genome rearrangements web server. Bioinformatics 18: 492–493. THOMSON, N.R., HOWARD, S., WREN, B.W., HOLDEN, M.T., CROSSMAN, L., CHALLIS, G.L., CHURCHER, C., MUNGALL, K., BROOKS, K., CHILLINGWORTH, T., et al., 2006. The complete genome sequence and comparative genome analysis of the high pathogenicity Yersinia enterocolitica strain 8081. PLoS Genet. 2: e206. TZVIKA, H. and RON, S., 2006. A simpler and faster 1.5approximation algorithm for sorting by transpositions. Inf. Comput. 204: 275–290.

182

Chapter 9

Chromosomal Rearrangements in Evolution

UNO, T. and YAGIURA, M., 2000. Fast algorithms to enumerate all common intervals of two permutations. Algorithmica 26: 290–309. VENTER, J.C., ADAMS, M.D., MYERS, E.W., LI, P.W., MURAL, R.J., SUTTON, G.G., SMITH, H.O., YANDELL, M., EVANS, C.A., HOLT, R.A., et al., 2001. The sequence of the human genome. Science 291: 1304–1351. WANG, L. and WARNOW, T., 2001. Estimating true evolutionary distances between genomes. STOC 2001, pp. 637–646. WATERSTON, R.H., LINDBLAD-TOH, K., BIRNEY, E., ROGERS, J., ABRIL, J.F., AGARWAL, P., AGARWALA, R., AINSCOUGH, R., ALEXANDERSSON, M., AN, P., et al., 2002. Initial sequencing

and comparative analysis of the mouse genome. Nature 420: 520–562. YANCOPOULOS, S., ATTIE, O., and FRIEDBERG, R., 2005. Efﬁcient sorting of genomic permutations by translocation, inversion and block interchange. Bioinformatics 21: 3340–3346. ZHAO, H. and BOURQUE, G., 2007. Recovering true rearrangement events on phylogenetic trees. In Comparative Genomics, RECOMB 2007 International Workshop, RECOMBCG 2007 (eds G., Tesler and D., Durand). Springer, San Diego, CA, pp. 149–161. ZHAO, H. and BOURQUE, G., 2009. Recovering genome rearrangements in the mammalian phylogeny. Genome Res., 19: 934–942.

Chapter

10

Molecular Structure and Evolution of Genomes Todd A. Castoe, A. P. Jason de Koning, and David D. Pollock 10.1

INTRODUCTION

10.2

OVERVIEW OF CONSIDERATIONS IN STUDYING PROTEIN EVOLUTION

10.3

FUNCTION AND EVOLUTIONARY GENOMICS

10.4

INTEGRATING INFERENCES TO DETECTAND INTERPRETADAPTATION: AN EXAMPLE WITH SNAKE METABOLIC PROTEINS

10.5

CONCLUSION

REFERENCES

10.1 INTRODUCTION The ﬁeld of evolutionary genomics is extremely young, with multiple complete (or nearly complete) eukaryotic genomes having been sequenced only recently. It is an exciting but challenging time for evolutionary genomics. Some of the biggest hurdles have been mechanical, as methods that worked well on a few genes from 10 to 20 species have had to be redesigned to rapidly handle tens of thousands of genes. The fundamentally interesting challenges, though, have been to understand the mechanistic factors that have shaped the evolution of genomes and the genes they contain. This requires an integrated understanding of mutation, population genetics, and the functional, structural, and thermodynamic bases of selection. Ultimately, we would like to know how functional molecules fold and interact, how structure is dynamically altered, how novelty is created, and to reconstruct how these factors have effected evolutionary change. In short, we seek to understand the interface of sequence and structure within the genomic context. As more genomic resources accumulate, the ﬁeld of genomics and molecular biology are rapidly morphing into the new ﬁelds of evolutionary (comparative) genomics and systems biology. This transformation largely represents a more realistic, comparative, and holistic reassessment of previous research aimed at appreciating the true complexities of the evolutionary process and the biological dynamics and interrelationships in nature. Molecular evolutionary biology, in contrast, has largely maintained a reductionist perspective that Evolutionary Genomics and Systems Biology, edited by Gustavo Caetano-Anolles Copyright 2010 John Wiley & Sons, Inc.

183

184

Chapter 10

Molecular Structure and Evolution of Genomes

focuses on analysis of how single molecules change through time. The new ﬁelds are primed to deliver novel understanding on how sequence evolution shapes the evolutionary complexity of biological diversity. Before going further, it is important to note that the following is a highly personal account of one laboratory’s perspective on the topic of deciphering molecular structure and the evolution of genomes. It is not intended to be a review. This serves as an apology in advance to the authors of the many fundamentally interesting and relevant papers that will not be covered here, and also to the thoroughly relevant topics that could have been discussed, but aren’t. RNA, for example, is an extremely important functional molecule, but we will focus more on proteins because we have done more work on them.

10.2 OVERVIEW OF CONSIDERATIONS IN STUDYING PROTEIN EVOLUTION There are some general points about the interface of sequence and structure that need to be clariﬁed. Probably the most important of these is to emphasize how little we know about the mechanistic details of how proteins fold, undergo dynamic movements, and function. This is not to complain about the rate of progress in the ﬁeld of structural biology, but simply to note that our ability to predict structures and the structural and functional effects of amino acid replacements in a protein, or even worse, a series of replacements, is limited. If the crystal structure of a protein is available, we usually must assume that the structure is about the same in proteins from the various species in a data set we are using. Predictions of things such as interactions between amino acid replacements must thus be tempered by a healthy skepticism as to their accuracy. In statistical language, we must put a fairly large prior on the possibility that precise inferences from structural data are wrong. We also must rely heavily on information that is more likely to remain accurate over evolutionary time, such as the approximate distance between residues, orientation of the vector from the Ca and Cb atoms, and distance from the surface or from active sites. Another related point is that we are even further from understanding the complete thermodynamic explanations for why proteins evolve the way they do. Although we do not have space here to go into work on recreating protein-like complex systems in a thermodynamic setting (Goldstein and Pollock, 2006; Williams et al., 2001; Williams et al., 2006; Xu et al., 2005), such work is important, and in the current climate needs to be justiﬁed because it is not always seen as relevant as the models used are not reﬂective of real proteins (see previous paragraph). Oftentimes, for computational reasons, simulations are grossly simpliﬁed far below even the limits of our best knowledge of structural biophysics. The rationale for this is that evolving complex systems in a thermodynamic setting can often produce dramatically counterintuitive results. While such results do not constitute “proof” of anything, they do provide clues to what to look for in when observing the products of protein evolution, and how to interpret it, and also provide null expectations. Important examples include the ideas that proteins may evolve to lessen the deleterious effects of mutations, that different protein structures may have different degrees of “designability” and thus different freedom to vary and still form the same structure, and that proteins at evolutionary equilibrium will tend to be only marginally stable (Taverna and Goldstein, 2002a, 2002b; Williams et al., 2006a). This last example means at a minimum that there is no inherent requirement to invoke stabilizing selection to explain the marginal stability of most proteins and increases the burden of proof to demonstrate that stabilizing selection actually exists in some cases.

10.2 Overview of Considerations in Studying Protein Evolution

185

A particularly interesting (but often ignored) component of complex systems is ploidy, since it strongly alters the effect of mutations on ﬁtness. By providing redundancy by default in the genetic system, it strongly interacts with redundancy and divergence in terms of duplicate gene copies. Another key question is how the form of the relationships between binding, expression, and ﬁtness affect the outcome, and whether analysis of the outcome provides enough data to discriminate information about the form of these relationships. For example, our default will be to assume that expression levels are proportional to percent binding, that expression levels are additive across binding sites (unless they are overlapping or interacting), and that ﬁtness levels due to the function of an expressed gene linearly increase with expression level up to a maximum, where they are constant, but that this is balanced by a constant energy/ﬁtness cost for each protein synthesized. A useful feature of inferences in evolutionary genetics, then, is that it can be pursued independently of knowledge of structure and functionand that is the route that has mostly been pursued in the past. Another way of putting this is that evolutionary genetics inference can be made based on models of the process, rather than using mechanistic or ﬁtness-based models. We hope that this will change, however, not only because mechanistic models are bound to be more realistic and therefore presumably more effective for making inference, but also because inclusion of mechanistic models will allow an independent (from basic biophysics) method of understanding mechanisms. With even a passing knowledge of thermodynamics, for example, it is difﬁcult to believe that individual residue positions evolve independently from one another. Much of this chapter will focus on how this integration can be achieved. If we are to achieve such integration, however, there are still a few more introductory points to consider. When studying proteins and protein structure, it is easy to forget sometimes thatallproteinsarecodedbyDNAbeforetheyaretranscribedtoRNAandtranslatedtoprotein. Thus, a complete approach should consider mutation processes in the DNA, and translation processes as well as possibly selectable structural and functional properties of DNA and RNA. All of these are affected by the local genomic context, and thus the study of protein evolution may ultimately be inseparable from the study of evolutionary genomics. In vertebrate mitochondria, for example, the mutation process changes over the entire genome (Faith and Pollock, 2003; Krishnan et al., 2004a, 2004b), while in mammalian (and probably more divergent animal) mitochondrial genomes, there are strong differences in mutation process along the genome that bias the equilibrium G þ C concentration (Gu and Zhang, 1997). Another point that is often underappreciated is the need for large amounts of taxon sampling to obtain accurate site-speciﬁc models. This is important for understanding coevolutionary interactions and other forms of context-dependent evolution. At the genomic level, this is important even for predicting such basic features as whether a region is under functional constraint. Such reasoning forms the basic rationale for sequencing more (at least 26) mammal genomes more thoroughly simply to predict the existence of transcription factor binding sites (Amemiya et al., 2003). For understanding the evolutionary properties of functional molecules (proteins as well as transcription factor binding sites), the beneﬁts continually increase as more and more taxa are sampled, particularly if they are sampled to break up long branches (rather than adding more and more deeply branching taxa). For that reason, we have been using the mitochondrial genome as a sort of pilot for predicting the beneﬁts of expanded sequencing of complete genomes. The number of vertebrate mitochondrial genomes has increase from 67 in 2000 (Pollock et al. 2000) to over 1000 today. Finally, it is key to consider the role of selection and adaptation, and how they might interact with protein structure and function. Our biggest evolutionary predictor of such things is the observation of changes in nucleotide substitution (or amino acid replacement) rates, but we should be careful not to employ circular reasoning in our inferences:

186

Chapter 10

Molecular Structure and Evolution of Genomes

conservation predicts function but is not the same thing, and changes in conservation predict changes in function (functional divergence), but functional divergence is a generally historical causative inference and should not be deﬁned as changes in evolutionary rate. Adaptation is also a concept that is difﬁcult to pin down; here, we will use the term in the sense of evolving to improve the optimization of a trait with regard to its average functional role in maintaining the relative ﬁtness of an organism in its usual environment. An adaptive event will result in long-term alteration of the physical characteristics that describe an interaction and will usually involve multiple amino acid replacements. “Convergence” is another slippery term, and it is necessary to distinguish between random convergence due to neutral processes and convergence due to natural selection or adaptation. We summarize below some recent work on vertebrate mitochondria, particularly snake mitochondria, to illustrate how these processes, as well as coevolution between residues, can be detected. “Innovation” is an adaptive event that creates a new functional role for a protein or regulatory element and will often involve gene duplication and/or a change in relative evolutionary rates of amino acid residues or nucleotides. “Function” is itself a somewhat ill-deﬁned concept, but much of it can be deﬁned as the degree that two molecules or regions of molecules bind together (including binding as a step in catalysis). Note that the relationship between the degree of binding and its average effect on organismal ﬁtness is not necessarily direct and may be described by a variety of parameterized distributions. The “ﬁtness” of a genetic element (e.g., a protein or DNA segment, haplotype, or genotype) will refer to the expected relative ﬁtness of an organism bearing the element, averaged over all genotypes and environmental variables not being directly considered. Finally, the “speciation” question will be addressed by limiting our interest in it to the narrow question of how well regulatory interactions are maintained when their constituent elements (proteins and DNA segments) are brought back together after having diverged via multiple substitutions during independent evolution in separate species or subpopulations.

10.3 FUNCTION AND EVOLUTIONARY GENOMICS 10.3.1

Deciphering Complexities of Protein Evolution

As proteins accumulate change and diverge over time, they must continue to satisfy the structural and energetic constraints that enable them to function properly. Because of this, the diversity of protein sequences available from living organisms represents a wealth of data on the relationships between protein sequence, structure, and function. Extracting insights from these data, however, remains a challenge. In principle, the optimal approach to decoding functional information in protein sequence biodiversity is to use parametric statistical inference with realistic phylogeny-based models. Historically, this approach has been limited by the amount of sequence data available and by the difﬁculty and prohibitive computational complexity of the probability calculations needed. The evolution of proteins is complex primarily because it is directed by a large number of underlying stochastic processes that are bounded by structural and functional constraints (Bloom et al., 2006; Drummond et al., 2006; Golding and Dean, 1998; Julenius and Pedersen, 2006; Lemos et al., 2005; Lopez et al., 2002; Robinson et al., 2003; Rodrigue et al., 2005, 2006; Thorne, 2007). Structural and functional features of proteins, and how they interact with the evolutionary process, are not easily or accurately predictable based on ﬁrst principles (Hayashi et al., 2006; Wood and Pearson, 1999; Xu et al., 2005). Nevertheless, the potential rewards of incorporating these factors into an accurate yet feasible

10.3 Function and Evolutionary Genomics

187

framework for modeling protein evolution are truly immense. An understanding of protein functional evolution is essential for identifying mutations of functional signiﬁcance that may lead to disease, using multiple sequences in structure and function prediction, ancestral reconstruction of sequence and function (Williams et al., 2006a), identifying sites involved in protein–protein interactions, and identifying changes in function (see Philippe et al. (2003) and Wang and Pollock 2005 for perspectives). Identifying and understanding the principles that dictate evolutionary diversiﬁcation would also help to improve strategies for protein design (Glasner et al., 2007; Tobin et al., 2000). The principal difﬁculty in modeling protein evolution is that it is highly context dependent, meaning that the probability of amino acid residue replacement during evolution is expected to vary across positions and over time (Andreeva and Murzin, 2006; Buchner, 1999; Koshi and Goldstein, 1995; Kozak, 1999; Midelfort and Wittrup, 2006; Pollock and Goldstein, 2002; Templeton et al., 2004; Xu et al., 2005). Particular positions in a protein will have different structural and functional environments, and thus different evolutionary constraints, and will therefore have distinctive patterns of substitution at the corresponding codons. Context-dependent changes over time may occur when the structural or functional environment around that position changes via replacement of interacting residues, or in general when any intrinsic or extrinsic factors (e.g., protein function, physiological role, or expression pattern) change, altering selective constraints. When the context changes, the process of evolutionary change at each position may also change (Figure 10.1). When speaking of evolutionary interactions among protein residue positions, this is sometimes called molecular coevolution, and positions that interact are said to coevolve (Pollock, 2002; Pollock and Taylor, 1997; Pollock et al., 1999; Wang and Pollock, 2005, 2007, 2009). Despite its critical importance, traditional approaches to phylogenetic analysis of protein evolution have not substantially taken context dependence into account. This is mostly for computational and conceptual reasons, but also because of data limitations. It is this context dependence, however, that is central to properunderstanding of evolutionary genomics and its

Figure 10.1

Functional or structural context changes. Replacement of amino acid A by amino acid B may lead to changes in selective pressure that alter substitution processes at different sites depending on the constraints at those sites. (See insert for color representation of this ﬁgure.)

188

Chapter 10

Molecular Structure and Evolution of Genomes

relation to structure and function. If the replacement of an amino acid at one residue position alters the functional effect (e.g., pathogenicity) of mutations at an adjacent position, the ﬁrst replacement has changed the evolutionary context at the adjacent position. If another replacement leads to loss of binding afﬁnity for a ligand, such functional divergence may alter the functional context of many residue positions, and the resultant alteration of evolutionary ﬂexibility should be detectable in the descendants. Subtle changes in ﬂexibility, packing of side chains, or distribution of charge on the surface may yield correspondingly subtle, yet functionally signiﬁcant changes in contextual effects on the evolutionary process. Our ability to predict and interpret the details of evolutionary shifts caused by changes in structure and function requires that we develop and observe the outcomes of models and data sets that are capable of accurately reﬂecting the details of these context-dependent effects. The unrealistically simple models used in the past cannot detect biochemical realities that they are not designed to reﬂect. They thus fail to reveal subtle yet critical details of the true process and reduce the accuracy of model-dependent inferences of functional innovation or ancestral reconstruction of sequence and function (Williams et al., 2006).

10.3.2 The Future of Modeling Protein Evolution: Merging Realism with Tractability One of the main constraints hampering the development and analysis of more realistically complex models has been computational limitations, although progress by our group and others has dramatically decreased these limitations (de Koning et al., 2009; Hwang and Green, 2004; Krishnan et al., 2004c; Nielsen, 2002; Rodrigue et al., 2006). Key innovations that enable more computation are mostly due to truncations or simpliﬁcations of the calculations made in computing probabilistic models. Essentially, these approaches function by sacriﬁcing a small degree of computational accuracy for massive gains in computational efﬁciency. Given the computational ability to efﬁciently incorporate complex models of the evolutionary process in a tractable framework, we can consider many of the potential complexities of both underlying mutational processes and also complex patterns of protein evolution that may elucidate changes in function that may represent adaptation. Our group and others have made substantial progress developing methods that are capable of efﬁciently evaluating complex evolutionary models. We developed a fast likelihood-based “conditional pathway” approach that scales extremely well with the number and complexity of context-dependent models used in phylogeny-based analyses. This approach removes many computational barriers that have previously limited the types of model-building experiments that were feasible. The conditional pathway approach allows exploration of fundamentally novel levels of model complexity and thus provides new potential to reconstruct and understand sequence, structural, and functional changes that have occurred through evolution (de Koning et al., 2009). The goal of modeling protein evolution is not to develop complex substitution models for their own sake, but to develop models that reﬂect the true complexity of protein evolution. This will lead to novel insight into how features of protein structure and function affect substitution processes and will enable novel and fundamental insight into the relationship between sequence, structure, and function. A central need is to develop ﬂexible context-dependent models to assess how substitution processes can differ in different parts of proteins and change through time. Such models should accommodate uncertain knowledge of which features are important. By allowing this ﬂexibility in the substitution models, we will in the future be able to accurately infer the biologically relevant features that

10.3 Function and Evolutionary Genomics

189

determine the evolutionary process. Previously, assessing context-dependent process in proteins would have been confounded by the incorrect assumption that the substitution process does not change over time, or was the same across a large numbers of protein residues. The accurate assessment of protein evolutionary processes is essential in determining the surrounding structural and functional features that constrain how amino acid residues evolve. Although post hoc correlation of substitution states to structure and function is an important means of understanding the causal basis of differences among sites and changes over time, it is ultimately best to integrate structural information directly into the models. Some basic structural information is essentially static or requires little additional computation, and can be incorporated into models directly; examples are the amino acid composition in residues adjacent to a site, the secondary structure, the accessible surface, and the local side chain or charge densities. Incorporation of each particular category of information should be well justiﬁed based upon the amount of data available and its demonstrable effect on substitution processes. This avoids superﬂuous model complexity yet ensures inclusion of all potentially contributing information. The impact of structural information on key functional information, such as proximity to ligand binding sites, proton channels, and other functional features, should also be considered, and both continuous functions and discrete effects should be considered. Other structural information (e.g., from molecular modeling) can require computationally expensive energetic calculations. It is essential to incorporate such information carefully to control the computational burden and maintain the tractability of calculations. Levels of structural integration should incorporate progressively more and more detailed energy potentials, beginning with simple pseudoenergy contact potentials, moving up to more detailed and realistic physical models including all-atom rotamer-based potentials and pseudowater potentials, and later incorporating ﬂexible backbones and simulated water molecules. It is also important to move toward free energy approximations by considering distributions of alternative (competing) structures and local folding and unfolding processes. As the calculations become more complex, it will be feasible to make energy calculations on only a small proportion of the possible ancestral replacements.

10.3.3 The Effect of Increasing Taxon Sampling and Sequence Biodiversity Inferences about the evolutionary process and how it has changed through time are highly dependent on taxon sampling (Abbott et al., 2005). Thus, dense evolutionary sampling of genomic biodiversity is necessary for the powerful detection of subtle changes in protein evolution. Probabilistic methods of evolutionary inference rely critically on information about patterns of change at each site in order to accurately portray and estimate the evolutionary process. Also, the redesign of probabilistic evolutionary models to incorporate further biological realism, to both avoid systematic error, and to be able to detect subtle departures from regimes of selection is vital. Particularly, in analyses of a single protein through evolutionary time, the only way to increase the information used to estimate evolutionary models is through dense taxon sampling to provide many examples of site patterns to inform the probabilistic models. Taxonomic sampling is particularly important for studies of site-speciﬁc evolution and coevolution in proteins. Such studies have had moderate success, but it is clear that studies of this kind are currently limited by the need to include dense taxonomic sampling so that there

190

Chapter 10

Molecular Structure and Evolution of Genomes

are multiple amino acid substitutions at each site over the entire tree, but not too many multiple substitutions along individual branches. Twenty to one hundred diverse taxa appears to be a minimum for successful analysis and more would improve the power considerably. When considering coevolution between genes, both data sets must include the same speciﬁc taxa. Furthermore, the amount of evolution between any two points on the tree should not be so large that there have been changes in the structural context of individual sites. For most proteins, the only means of obtaining a large data set is to include widely divergent bacteria and eukaryotes, but even conserved proteins will undergo considerable change since these taxa have diverged. Another reason for good taxon sampling is to obtain accurate phylogenies for the study of gene duplication and evolutionary innovation. Estimates of the phylogenetic history of multigene families to understand the process of gene duplication is uniquely difﬁcult because only the gene of interest (i.e., the duplicated gene) and related sequences are available to make this inference. This factor is often underappreciated, but poses a severe limitation because the phylogenetic signal available for making such inferences of phylogeny is limited to a single locus. Thus, particular attention and care is required in interpreting and inferring multigene family phylogenies, and major sources of inference error must be kept in mind and evaluated. Taxon sampling helps to increase certainty of the duplication placement, and also because it allows for better models that will then reconstruct phylogeny better (Pollock et al., 2002).

10.3.4 Removing the Mutational Noise and Context-Dependent Biases from Protein Evolution To infer changes in the evolutionary process in protein-coding genes that may represent adaptation or functional change, it is critical that analyses of how proteins evolve begin by appreciating, and removing, the effects and dynamics of the mutational processes at the nucleotide level. Equally important, extremely large data sets are subject to extremely large systematic error when probabilistic evolutionary model assumptions are violated, and such is the case if underlying mutational biases or context dependencies are ignored. In addition to mutational effects at a local or small scale, larger scale context effects, such as genomic contexts, may also require appreciation in the modeling process, although only a bit is known regarding the ways in which genome architecture might affect the various aspects of genome function and evolution (including replication, transcription, and function of proteins and RNAs). Nevertheless, patterns linking mitochondrial genome structure, function, and nucleotide evolution have begun to emerge (Krishnan et al., 2004b, 2004c; Raina et al., 2005). Thus, at the core of any good model of protein evolution is a good model of how the DNA alone would evolve if it were not involved in determining protein structure and function. To meet this goal, the DNA models underlying the amino acid replacement process must be accurate and realistic to avoid confounding estimates of DNA evolution with inﬂuences from amino acid evolution. At a ﬁne spatial scale, the local nucleotide environment or context may affect nucleotide evolutionary dynamics. In this case, the nucleotide content at adjacent sites may have a notable context-dependent effect on the probability of nucleotide substitution. We have previously investigated and demonstrated this context-dependent effect by modeling the nucleotide evolutionary process in alignments of SINE elements in the opossum (Monodelphis domestica) genome (Gu et al., 2007). Based on analysis using a symmetric fully context-dependent dinucleotide model, it is clear that adjacent nucleotide

10.3 Function and Evolutionary Genomics

191

content can have an important effect on adjacent site substitution rates and that accounting for such context-dependent effects represents an important feature of modeling the underlying nucleotide evolutionary process. On a broader spatial scale, the genomic location of a locus can also have a notable effect on nucleotide evolution. Molecular evolutionary analysis showed that the substitution process was different in different SINE1 elements with different adjacent GC content in opossum genome (Gu et al., 2007). Also, different genomic regions may have different degrees of accelerated nucleotide substitution at CpG dinucleotides. Elevation of substitution rates at CpG dinucleotides are thought to be linked to relative degrees of cytosine methylation, which leads to higher rates of stochastic mutations (especially elevated transition substitutions). Examples of such acceleration can be seen in the SINE elements of the opossum based on the context-dependent dinucleotide model described above; note that transition substitutions at CpGs are an order of magnitude higher than other transitions. Another striking example of how genomic location may lead to different mutational contexts has been demonstrated in vertebrate mitochondrial genomes, primarily affecting transition substations (purine,purine or pyrimidine,pyrimidine). We previously showed that the mutation process is different at every position, but differences among sites are fairly predictable (Faith and Pollock, 2003; Krishnan et al., 2004a, 2004b, 2004c), largely based on the asymmetrical replication of the mitochondrial genome. According to the “classic” model of mitochondrial replication, different positions in the mitochondrial genome spend different amounts of time in an asymmetric and mutagenic single-strand state during replication. This apparently leads to gradients of thymine (T) to cytosine (C) and adenine (A) to guanine (G) substitution caused by thymine to uracil and adenine to hypoxanthine deaminations on the displaced heavy strand (Faith and Pollock, 2003). The response to the mutation gradient differs between these two substitution types, with T)C having a roughly asymptotic response and A)G a strikingly linear response (Faith and Pollock, 2003; Krishnan et al., 2004a, 2004b, 2004c). To account for this, we developed a nucleotide model that allowed evolutionary patterns tovary at each site in the mitochondrial genome and applied this model to fourfold and twofold redundant third codon positions (Krishnan et al., 2004a, 2004b, 2004c). In the case of vertebrate mitochondria, and also in other circular genomes such as plastids and bacterial genomes, different locations in the genome may experience very different background mutational processes due to mutational gradients that result from the process of genome replication. In addition to changes in mutational processes on various spatial contexts, it is important to consider changes to the process through time, and the combination of spatial and temporal dynamics in the evolutionary process. For example, in primate mitochondrial genomes, genome-wide gradients of substitution bias have been shown to evolve rapidly across lineages such that different primate species may have quite different mutational gradients (Krishnan et al., 2004b).

10.3.5

Where is Protein Evolution Going?

Once the mutation process and amino acid replacement process are modeled separately and integrated, there are two main routes to more biologically realistic models. The ﬁrst route is to generalize process-based models, what we call nonstationary context-dependent (NSCD) models. In the broadest sense, these would allow context-dependent mixture model processes to vary across sites and over time (Figure 10.2), and could incorporate structural information to inform the mixture and mixture switching. Mixture models

192

Chapter 10

Molecular Structure and Evolution of Genomes

Figure 10.2

A simple example of a mixture of models changing over time. The mammalian phylogenetic tree is shown on the left, and the portion in red consists of the primates, where the mixture change point was set. Substitution rates in heavy-strand encoded proteins for the homogeneous model and the nonhomogeneous model (two mixture components) are shown on the right. The two-class model had a log BF improvement of 6575, indicating strong support for a primate-speciﬁc alteration in evolutionary patterns/rates of some sites. (See insert for color representation of this ﬁgure.)

Figure 10.3 Posterior estimates for mammalian cytb under the RAR model (de Koning, Castoe, and Pollock, unpublished) with three independent nonreversible rate class components. (a) Mean rate estimates for each mixture class component, with rows and columns clustered according to posterior average rates. (b) The same as (a), but with rows and columns in the same order among mixture classes. (c) The most likely posterior model class overlaid (by color) onto a secondary structure diagram of cytb (alpha helices are squiggly lines and beta strands are arrows). (See insert for color representation of this ﬁgure.)

10.3 Function and Evolutionary Genomics

193

Figure 10.4 Posterior probability distributions for mammalian cytb using various amino acid substitutions models. (a) mtMam versus a single (5-cat Gamma RAR) unrestricted model and a mixture of two unrestricted (5-cat Gamma 2) or two dependent rate assignment (DRA) RAR models. (b) Comparison of DRA models with different priors. Marginal log likelihoods shown above the distributions were estimated using the harmonic mean over MCMC samples, and DRA priors are labeled. (See insert for color representation of this ﬁgure.)

without structural information can be compared to structure post hoc to determine proﬁtable structural guides for future modeling (Figure 10.3). We have found that the conditional pathway method is highly amenable to methods to reduce the number of rate parameters in arbitrary ways relative to the mixture rate matrices. These methods, under development, are called “rates across rates,” or RAR methods (de Koning, Castoe, and Pollock, unpublished), and can allow for the rapid testing of otherwise difﬁcult protein models (Figure 10.4). The second route is to incorporate ﬁtness into the equation, basing it on some estimation of the thermodynamic stability of variants. We call these SEF (structure/energetic/ﬁtness) models. Ultimately, of course, these two approaches should be integrated, and all components thoroughly tested on large data sets to determine the justiﬁcation for their inclusion. Although these models are simple to describe, we anticipate that a huge amount of interesting biology will occur as we begin to understand what physical factors truly inﬂuence how proteins evolve and under what conditions.

10.3.6

Detecting Adaptation and Functional Innovation

One of the most important problems in protein evolution is that of detecting and understanding adaptation and functional innovation. Ideally, these phenomena will leave traces on the evolutionary record, possibly including accelerated evolution at some sites, bursts of substitution along particular branches, and changes in the models of protein evolution. Adaptive events may be detected as an excess number of substitutions in a particular set of sites, or excess numbers on a particular branch, compared to the expectation

194

Chapter 10

Molecular Structure and Evolution of Genomes

calculated from synonymous sites. We have used this approach in analyzing bursts of evolution in snake mitochondrial genomes and have been able to discriminate differential rate acceleration in different genes (Castoe et al., 2008b; Jiang et al., 2007). Also, based on our previous experience (Castoe et al., 2008b; Jiang et al., 2007; Wang and Pollock, 2005, 2007b), adaptation and coevolution often go together; many substitutions during an adaptive burst may be closely paired in the three-dimensional structure, and these same pairs tend to substitute together on different lineages. Patterns of coevolution also differ depending on whether the residues are involved in adaptive bursts. When such events are detected, they may be further dissected with nonstationary mixture models. When (possibly adaptive) functional divergence is inferred, an important means of testing this inference is to reconstruct ancestors in the laboratory and examine their functional features through biochemical analysis. Unfortunately, ancestral reconstruction can be subject to a variety of errors and biases that lead to incorrect functional inferences (Krishnan et al., 2004c; Williams et al., 2006b). The improved biological realism of NSCD models should lead to improved accuracy in ancestral reconstruction, and we will test this through simulation studies. This will also provide the means to examine the relationship between model accuracy and phylogenetic structure (i.e., density and relationships of taxon sampling). Another relevant point to consider, based on our previous experience (Castoe et al., 2008b; Jiang et al., 2007; Wang and Pollock, 2005), is that adaptation and coevolution often go together; many substitutions during an adaptive burst may be closely paired in the three-dimensional structure, and these same pairs tend to substitute together on different lineages. Patterns of coevolution also differ depending on whether the residues are involved in adaptive bursts.

10.4 INTEGRATING INFERENCES TO DETECT AND INTERPRET ADAPTATION: AN EXAMPLE WITH SNAKE METABOLIC PROTEINS 10.4.1 Snake Metabolic Proteins—Integration of Inferences for Adaptation The best approach to identifying important functional change in proteins that may represent adaptation is through integration of multiple lines of evidence for functional change. Thus, because protein evolution is highly complex, and detecting changes in protein evolution that may represent adaptation and functional change may be confounded by so many factors, no single statistic is sufﬁcient to convincingly demonstrate when adaptation and functional change happens in proteins. Recently, we discovered that snakes are an excellent system for studying adaptive evolution and functional change in protein-coding genes, and this system demonstrates how multiple inferences of functional change in proteins can be integrated to provide a more holistic inference of adaptation and also of potential selective factors that may have led to functional change. The proteins involved in aerobic metabolism encoded in their mitochondrial genomes have undergone an extreme burst of adaptive evolution that appears to have led to functional innovation and reorganization of snake oxidative metabolism. To infer how and why this may have occurred, we conducted extensive molecular evolutionary analyses of selection and coevolution in snake mitochondria and evaluated the results in the context of the structure and function of snake mitochondrial proteins.

10.4 Integrating Inferences to Detect and Interpret Adaptation

10.4.2

195

Detection of Accelerated Nonsynonymous Change

The ﬁrst indication of a burst of adaptive protein evolution in snake mitochondria was that snake proteins appear to have experienced greatly elevated rates of nonsynonymous change compared to other tetrapods (Castoe et al., 2008b). Mitochondrial protein-coding genes are subject to strong purifying selection to conserve protein function (Reyes et al., 1998; Yang et al., 2000), normally leading to low rates of nonsynonymous change compared to synonymous change (dN/dS). Consistent with this, the median dN/dS ratio (inferred from codon-based selection analyses) for the tetrapod mitochondrial data set is 0.12, and for cytochrome c oxidase subunit 1 (COI), the most conserved mitochondrial protein, it is 0.02. In contrast, along the branch leading to snakes the dN/dS for all proteins combined is 25-fold higher (3.14) and is 40-fold higher for COI (0.81) (Castoe et al., 2008b). Ratios are also high along the COI branch leading to the alethinophidian snakes, and along these same two branches for the protein cytochrome b (CytB). Furthermore, branch-site models (Yang et al., 2000) indicate that a large number of sites across all 13 mitochondrial proteins experienced excess nonsynonymous substitutions and positive selection. Paralleling the inferences based on standard dN/dS, the highest number of positively selected sites occur in COI and CytB. Although dN/dS-based analyses of protein adaptation are a standard in the ﬁeld, they are also very susceptible to error from a number of sources, mostly related to the high potential for inaccurate estimationof thedScomponent.Estimatesof dSfor bothlong branches and ancient (deep) branches, both of which were the case in our tetrapod mitochondrial data set, are particularly susceptible to saturation and underestimation. Furthermore, in the mitochondrial genome a vast majority of synonymous substitutions are comprised of transition substitutions that evolve at a high rate and are thus likely to saturate. Mitochondrial transition substitution rates and substitution gradients across the genome may also evolve substantially across lineages (Raina et al., 2005). Because transversion (TV; purine,pyrimidine) substitution dynamics in mtDNA are slower and far more consistent than transitions (Raina et al., 2005), they are much less prone to saturation, the use of exclusively transversions for relative rate comparisons (e.g., dN/dS) can eliminate many potential errors (Raina et al., 2005; Yang et al., 2000). Thus, the transversion component of dN/dS was estimated by averaging over all third codon positions in the mtDNAwith conserved fourfold redundancy (dSTV4X), while the nonsynonymous transversion rate was measured at ﬁrst and second codon positions (dNTV12) for each gene under consideration. It is notable that nonsynonymous transversions at ﬁrst and second codon positions result primarily in amino acid replacements with radical physicochemical differences and major functional effects, and thus dNTV12 may reﬂect more radical and functionally relevant amino acid replacements than standard measures of dN. The dNTV12/dSTV4X ratios strongly supported the ﬁnding that mitochondrial proteins endured dramatic bursts of amino acid replacement early in snake evolution (Figure 10.5). Notably, high ratios are not maintained in descendant snake lineages, indicating that strong purifying selection subsequently dominates snake mtDNA evolution (Figure 10.5). These ﬁnding provide an excellent example of an apparent context-dependent change in protein evolution in snake mitochondrial genes, in which an episodic burst of selection disrupted the normally neutral equilibrium patterns of protein evolution.

10.4.3 Changes at Conserved Sites and Coevolutionary Signal The impact of the most functionally relevant amino acid replacements in snake mitochondrial proteins was studied at “unique sites” that had replacements in snakes and were

196

Chapter 10

Molecular Structure and Evolution of Genomes

10.4 Integrating Inferences to Detect and Interpret Adaptation

197

otherwise conserved across most tetrapods (Castoe et al., 2008b). COI and CytB have the greatest number of unique sites among mitochondrial proteins, and amino acid replacements at the 23 unique COI sites are concentrated in the earliest branches in the snake tree, with 25–31 estimated changes. Nine sites had reversions or multiple replacements, usually leading to parallel or convergent evolution, and about half of these sites underwent substantial changes in polarity or charge (Castoe et al., 2008b). The 23 unique snake sites show an excessively high degree of coevolution with each other in this analysis: among all possible combinations of unique site pairs, 66% and 89% have signiﬁcantly coevolved (p < 0.05; 28% and 36% at p < 0.01) according to polarity and volume, respectively (Castoe et al., 2008b). When these 23 unique sites are visualized on the structure of cow CO, seventeen of these 23 unique sites clearly form structurally clustered pairs or triplets, most of which appear to be in physical contact, and these clusters occur primarily in the core functional regions of the COI protein (Figure 10.6). To our knowledge, such a high proportion of physically close (or touching) clusters of replaced residues has not been previously observed in any protein, nor has this degree of concentrated coevolutionary change been previously reported for a protein. The physical clustering of unique sites strongly supports the hypothesis that these sites have coevolved, independent of the statistical coevolution analysis. Therefore, such tight physically paired coevolving residues at otherwise conserved (and therefore presumably functionally critical) sites are unlikely to have occurred without the inﬂuence of strong positive selection for evolutionary redesign. The general coevolutionary signal in snake COI at all sites (not just the unique ones) is also inordinately strong (Castoe et al., 2008b).

10.4.4 Integrating Evolutionary Inferences with Structure and Function Information

~

The structural basis of CO function is complex. Oxidative phosphorylation is carried out by ﬁve complexes that generate a proton gradient and drive the synthesis of ATP. CO is the penultimate complex in this chain, where the reduction of oxygen is coupled to proton pumping (Tsukihara et al., 1995, 19962009). Of the 13 CO subunits, the three encoded by the mitochondrial genome (I, II, and III) are at the structural and functional core of the complex (Tsukihara et al., 1995, 1996). A copper atom and two heme groups in COI are critical to the coordinated electron transport, oxygen reduction, and proton pumping function of CO (Tsukihara et al., 1995,1996). Protons transported or “pumped” along three putative channels (D, H, and K) from the mitochondrial matrix to the mitochondrial intermembrane space

Figure 10.5

Mitochondrial proteins have had highly elevated rates of amino acid replacement early in the evolution of snakes. The conservative transversion-based approximations of the relative rates of nonsynonymous to synonymous substitution (dNTV12/dSTV4x) rates are shown as open or colored circles for each branch of the phylogenetic tree; linear regression lines (excluding points in the red ellipse) are shown in black (A and B). The calculations shown are from (a) all mitochondrial proteins and (b) cytochrome c oxidase subunit 1. Blue-shaded areas of (a) and (b) indicate very long branches with high dSTV4x values where the (dNTV12/dSTV4x) estimate may be inaccurate, possibly due to dSTV4x saturation and underestimation. The phylogenetic tree of relationships among species in our comparative data set is shown in (c). Branches with extremely high values of dNTV12/dSTV4X for COI are indicated with colored lines (black, blue, red) following the key in the bottom left. The circles for branches in (a) and (b) were colored according to the same legend for ratios of COI (dNTV12/dSTV4x). (See insert for color representation of this ﬁgure.)

198

Chapter 10

Molecular Structure and Evolution of Genomes

Figure 10.6

The 23 unique amino acid replacements in the cytochrome c oxidase subunit 1 protein of snakes form seven pairs and one triplet of spatially clustered amino acid replacements, concentrated at the core functional region of the COI protein. The seven spatially adjacent pairs of amino acid residues, strongly suggestive of coevolutionary adaptive change, are shown in blue/red paired spaceﬁll combinations, and one triplet cluster is shown in a blue/purple/red combination. Unique sites that did not form clusters are shown in gray spaceﬁll representations. The two heme groups are shown in gold spaceﬁll shapes, the COI backbone in white, and the magnesium and copper atoms are shown as magenta and green balls, respectively. Two different perspectives are depicted, one in (a) and (b), and a second in (c) and (d); ﬁgure sets (a)/(b) and (c)/(d) are the same views with (b) and (d) showing the ribbon structure of the COI backbone in transparent gray. (See insert for color representation of this ﬁgure.)

contribute to the proton gradient utilized by the ATP synthase complex to produce ATP and also facilitate the reduction of oxygen to water. The three core COI proton channels appear to have been extensively redesigned during the evolution of snakes. At least two unique site residues (unique residues) are located in or adjacent to each of three proposed channels, and most other unique residues are distributed around these channels.

10.4.5 Further Evidence of Adaptation from Molecular Convergence Convergent molecular evolution is believed to be rare in nature, although a few studies have explicitly searched for it. When it is observed, it is often taken as good evidence for directional selection for functional change, which has acted in parallel on independent

10.4 Integrating Inferences to Detect and Interpret Adaptation

199

lineages. There are an exceptionally large number of convergent changes between independent lineages of snakes, and between snakes and another group of legless squamate reptiles—amphisbaenians. In COI, convergent changes included some of the most conserved, structurally and functionally important sites in all three proton channels of COI (Castoe et al., 2008b). Increased taxon sampling and a novel statistical approach for detection and analysis of convergent molecular evolution revealed evidence that a signiﬁcant excess of convergent molecular evolution has occurred, at an unprecedented scale, between snake and agamid lizard mitochondrial genomes (Castoe et al., 2008a, 2009). There is a strong linear relationship between the number of divergent and convergent substitutions using both ML and Bayesian methods, and this allows for good statistical accounting of the effect of branch lengths on convergence expectations. Although previous analyses of molecular convergence utilized ML approaches (Zhang and Kumar, 1997), there were big differences between ML and Bayesian results, probably due to error in ML approach, which ignores error in the unknown ancestral states; failure to integrate over unknown ancestral states can generally lead to misleading biological conclusions (Krishnan et al., 2004c; Williams et al., 2006a; Yang, 2003). Likely convergent sites were concentrated in COX1 and ND1, but were present in other proteins as well (Castoe et al., 2008a, 2009). A thorough analysis of alternative hypotheses to explain this convergence (e.g., nucleotide frequencies, heterogeneous models, and long-branch attraction) eliminated all reasonable neutral explanation. Thus, the remaining obvious potential explanation for this case of excess convergent evolution is adaptation. Combined with other evidence for adaptive protein evolution in snakes (discussed above), the excess convergence levels observed here are consistent with the action of natural selection rather than random homoplasy. The evolutionary burst in snakes may have been driven by selection related to physiological adaptations for metabolic efﬁciency and to allow radical ﬂuctuations in aerobic metabolic rate (Castoe et al., 2008b). The molecular convergence between snakes and agamid lizards may thus have resulted from shared adaptive pressures on metabolic function. Whatever the underlying cause, since the convergence extends across most regions of the mitochondrial genome, any common adaptive force must have been exceptionally strong and broad in scope.

10.4.6

Integrating Inferences with Possible Causal Factors

Adaptive evolution and coevolution in COI early in snake evolution appear to have redesigned core functions. In particular, the roles of the various amino acid residues and channels in proton transport, coupling of proton transport to oxygen reduction, and regulation of these processes appear to have been reorganized. Although the structural and functional evidence is best in COI, there is also compelling evidence for adaptive evolution in other mitochondrial proteins early in snake evolution (Castoe et al., 2008b). The distribution and number of unique amino acid replacements, the elevated dN/dS for the entire mitochondrial proteome, site-speciﬁc selection analyses, and nucleotide dynamics (Jiang et al., 2007) collectively suggest that most snake mitochondrial proteins have experienced extraordinary levels of functional adaptive change. Snake mitochondrial function and oxidative metabolism thus appear to be exceptional systemwide, implying that snakes are an excellent model system for further metabolic research.

200

Chapter 10

Molecular Structure and Evolution of Genomes

10.5 CONCLUSION The problem of genome evolution and molecular structure/function is of fundamental importance to a wide variety of scientiﬁc and health-related research. The better we understand the relationship between sequence, structure, and function, the better we will be able to predict structure and function, manipulate proteins to achieve our aims, and understand and predict protein failure through mutation that leads to disease. The evolutionary record provides a vast amount of information on the subject of how sequences change under the constraints of structure, function, and functional innovation; evolutionary genomics research should be designed to extract much more accurate and practically useful information about this process. Although evolutionary genomics is not designed necessarily to predict structure directly, the results obtained have obvious potential beneﬁts for structural prediction. Such possible beneﬁts include predicting mutational effects, predicting structural features in novel proteins, predicting protein–protein interactions, protein–substrate and protein–drug interactions, and guiding protein design. Every effort should be made to translate evolutionary genomics results into predictions that can be used in empirical research, or higher level protein structure prediction, and to address any direct predictive utility that arises as an outcome of such research. In general, the next generation of evolutionary genomics should produce a more subtle and biologically realistic understanding of the signiﬁcance of diversity and variation in proteins than is currently available

REFERENCES ABBOTT, C.L., DOUBLE, M.C., TRUEMAN, J.W.H., ROBINSON, A., and COCKBURN, A., 2005. An unusual source of apparent mitochondrial heteroplasmy: duplicate mitochondrial control regions in Thalassarche albatrosses. Mol. Ecol. 14: 3605–3613. AMEMIYA, C.T., GREALLY, J.M., JIRTLE, R.L., LANDER, E.S., LINDBLAD-TOH, K., MILLER, R.D., POLLOCK, D.D., SAMALLOW, P.B., SPRINGER, M.S., and WILSON, R.K., 2003. Proposal for complete sequencing of the genome of a Marsupial: the gray, short-tailed opossum, Monodelphis domestica. NIHGRI White Paper. ANDREEVA, A. and MURZIN, A.G., 2006. Evolution of protein fold in the presence of functional constraints. Curr. Opin. Struct. Biol. 16: 399–408. BLOOM, J.D., DRUMMOND, D.A., ARNOLD, F.H., and WILKE, C.O., 2006. Structural determinants of the rate of protein evolution in yeast. Mol. Biol. Evol. 23: 1751–1761. BUCHNER, E., 1999. Molecular complexity at the synapse: new proteins and multiple isoforms detected in Drosophila. Ross Fiziol Zh Im I M Sechenova 85: 159–166. CASTOE, T.A., De KONING, A.P., KIM, H.-M., GU, W., NOONAN, B.P., JIANG, Z.J., PARKINSON, C.L., and POLLOCK, D.D., 2008a. An ancient adaptive episode of convergent molecular evolution confounds phylogenetic inference. Nat. Preced., http://hdl.handle.net/10101/npre.12008.12123.10101. CASTOE, T.A., JIANG, Z.J., GU, W., WANG, Z.O., and POLLOCK, D. D., 2008b. Adaptive evolution and functional redesign of core metabolic proteins in snakes. PLoS ONE 3: e2201.

CASTOE, T.A., de, KONING, A.P.J., KIM, H.-M. GU, W., NOONAN, B.P., NAYLOR, G, JIANG, Z.J., PARKINSON, C.L., and POLLOCK, D.D., 2009. Evidence for an ancient adaptive episode of convergent molecular evolution. Proc. Nat. Acad. Sci. U.S.A 106:8986–8991. de KONING, A.P.J, GU, W., and POLLOCK, D.D., 2009. Rapid likelihood analysis on large phylogenies using partial sampling of substitution histories. Mol. Biol. Evol. (Advanced Acess) doi:10.1093/molbev/msp228. DRUMMOND, D.A., RAVAL, A., and WILKE, C.O., 2006. A single determinant dominates the rate of yeast protein evolution. Mol. Biol. Evol. 23: 327–337. FAITH, J.J. and POLLOCK, D.D., 2003. Likelihood analysis of asymmetrical mutation bias gradients in vertebrate mitochondrial genomes. Genetics 165: 735–745. GLASNER, M.E., GERLT, J.A., and BABBITT, P.C., 2007. Mechanisms of protein evolution and their application to protein engineering. Adv. Enzymol. Relat. Areas Mol. Biol. 75: 193–239, xii-xii. GOLDING, G.B. and DEAN, A.M., 1998. The structural basis of molecular adaptation. Mol. Biol. Evol. 15: 355–369. GOLDSTEIN, R.A. and POLLOCK, D.D., 2006. Observations of amino acid gain and loss during protein evolution are explained by statistical bias. Mol. Biol. Evol. 23: 1444–1449. GU, X. and ZHANG, J., 1997. A simple method for estimating the parameter of substitution rate variation among sites. Mol. Biol. Evol. 14: 1106–1113.

References GU, W., RAY, D.A., WALKER, J.A., BARNES, E., GENTLES, A.J., SAMALLOW, P.B., JURKA, J., BATZER, M.A., and POLLOCK, D. D., 2007. SINEs, evolution and genome structure in the opossum. Gene 396: 46–58. HAYASHI, Y., AITA, T., TOYOTA, H., HUSIMI, Y., URABE, I., and YOMO, T., 2006. Experimental rugged ﬁtness landscape in protein sequence space. PLoS ONE 1: e96. HWANG, D.G. and GREEN, P., 2004. Bayesian Markov chain Monte Carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution. Proc. Natl. Acad. Sci. USA 101: 13994–14001. JIANG, Z.J., CASTOE, T.A., AUSTIN, C.C., BURBRINK, F.T., HERRON, M.D., MCGUIRE, J.A., PARKINSON, C.L., and POLLOCK, D.D. 2007 Comparative mitochondrial genomics of snakes: extraordinary substitution rate dynamics and functionality of the duplicate control region. BMC Evol. Biol. 7: 123. JULENIUS, K. and PEDERSEN, A.G., 2006. Protein evolution is faster outside the cell. Mol. Biol. Evol. 23: 2039–2048. KOSHI, J.M. and GOLDSTEIN, R.A., 1995. Context-dependent optimal substitution matrices. Protein Eng. 8: 641–645. KOZAK, M., 1999. Initiation of translation in prokaryotes and eukaryotes. Gene 234: 187–208. KRISHNAN, N.M., RAINA, S.Z., and POLLOCK, D.D., 2004a. Analysis of among-site variation in substitution patterns. Biol. Proced. Online 6: 180–188. KRISHNAN, N.M., SELIGMANN, H., RAINA, S.Z., and POLLOCK, D.D., 2004b. Detecting gradients of asymmetry in sitespeciﬁc substitutions in mitochondrial genomes. DNA Cell Biol. 23: 707–714. KRISHNAN, N.M., SELIGMANN, H., STEWART, C.B., de KONING, A.P., and POLLOCK, D.D., 2004c. Ancestral sequence reconstruction in primate mitochondrial DNA: compositional bias and effect on functional inference. Mol. Biol. Evol. 21: 1871–1883. LEMOS, B., BETTENCOURT, B.R., MEIKLEJOHN, C.D., and HARTL, D.L., 2005. Evolution of proteins and gene expression levels are coupled in Drosophila and are independently associated with mRNA abundance, protein length, and number of protein–protein interactions. Mol. Biol. Evol. 22: 1345–1354. LOPEZ, P., CASANE, D., and PHILIPPE, H., 2002. Heterotachy, an important process of protein evolution. Mol. Biol. Evol. 19: 1–7. MIDELFORT, K.S. and WITTRUP, K.D., 2006. Context-dependent mutations predominate in an engineered high-afﬁnity single chain antibody fragment. Protein Sci. 15: 324–334. NIELSEN, R., 2002. Mapping mutations on phylogenies. Syst. Biol. 51: 729–739. PHILIPPE, H., CASANE, D., GRIBALDO, S., LOPEZ, P., and MEUNIER, J., 2003. Heterotachy and functional shift in protein evolution. IUBMB Life 55: 257–265. POLLOCK, D.D., 2002. Genomic biodiversity, phylogenetics and coevolution in proteins. Appl. Bioinformatics 1: 81–92. POLLOCK, D.D. and GOLDSTEIN, R.A. 2002. Molecular evolution and phylogenetic analysis. Pac. Symp. Biocomput. Tutorial.

201

POLLOCK, D.D. and TAYLOR, W.R., 1997. Effectiveness of correlation analysis in identifying protein residues undergoing correlated evolution. Protein Eng. 10: 647–657. POLLOCK, D.D., TAYLOR, W.R., and GOLDMAN, N., 1999. Coevolving protein residues: maximum likelihood identiﬁcation and relationship to structure. J. Mol. Biol. 287: 187–198. POLLOCK, D.D., EISEN, J.A., DOGGETT, N.A., and CUMMINGS, M.P., 2000. A case for evolutionary genomics and the comprehensive examination of sequence biodiversity. Mol. Biol. Evol. 17: 1776–1788. POLLOCK, D.D., ZWICKL, D.J., MCGUIRE, J.A., and HILLIS, D. M., 2002. Increased taxon sampling is advantageous for phylogenetic inference. Syst. Biol. 51: 664–671. RAINA, S.Z., FAITH, J.J., DISOTELL, T.R., SELIGMANN, H., STEWART, C.B., and POLLOCK, D.D., 2005. Evolution of basesubstitution gradients in primate mitochondrial genomes. Genome Res. 15: 665–673. REYES, A., GISSI, C., PESOLE, G., and SACCONE, C., 1998. Asymmetrical directional mutation pressure in the mitochondrial genome of mammals. Mol. Biol. Evol. 15: 957–966. ROBINSON, D.M., JONES, D.T., KISHINO, H., GOLDMAN, N., and THORNE, J.L., 2003. Protein evolution with dependence among codons due to tertiary structure. Mol. Biol. Evol. 20: 1692–1704. RODRIGUE, N., LARTILLOT, N., BRYANT, D., and PHILIPPE, H., 2005. Site interdependence attributed to tertiary structure in amino acid sequence evolution. Gene 347: 207–217. RODRIGUE, N., PHILIPPE, H., and LARTILLOT, N., 2006. Assessing site-interdependent phylogenetic models of sequence evolution. Mol. Biol. Evol. 23: 1762–1775. TAVERNA, D.M. and GOLDSTEIN, R.A., 2002a. Why are proteins marginally stable? Proteins 46: 105–109. TAVERNA, D.M. and GOLDSTEIN, R.A., 2002b. Why are proteins so robust to site mutations? J. Mol. Biol. 315: 479–484. TEMPLETON, A.R., REICHERT, R.A., WEISSTEIN, A.E., YU, X.F., and MARKHAM, R.B., 2004. Selection in context: patterns of natural selection in the glycoprotein 120 region of human immunodeﬁciency virus 1 within infected individuals. Genetics 167: 1547–1561. THORNE, J.L., 2007. Protein evolution constraints and modelbased techniques to study them. Curr. Opin. Struct. Biol. 17: 337–341. TOBIN, M.B., GUSTAFSSON, C., and HUISMAN, G.W., 2000. Directed evolution: the ‘rational’ basis for ‘irrational’ design. Curr. Opin. Struct. Biol. 10: 421–427. TSUKIHARA, T., AOYAMA, H., YAMASHITA, E., TOMIZAKI, T., YAMAGUCHI, H., SHINZAWA-ITOH, K., NAKASHIMA, R., YAONO, R., and YOSHIKAWA, S., 1995. Structures of metal sites of oxidized bovine heart cytochrome c oxidase at 2.8 A. Science 269: 1069–1074. TSUKIHARA, T., AOYAMA, H., YAMASHITA, E., TOMIZAKI, T., YAMAGUCHI, H., SHINZAWA-ITOH, K., NAKASHIMA, R., YAONO, R., and YOSHIKAWA, S., 1996. The whole structure of the 13 subunit oxidized cytochrome c oxidase at 2.8 A. Science 272: 1136–1144.

202

Chapter 10

Molecular Structure and Evolution of Genomes

WANG, Z.O. and POLLOCK, D.D., 2005. Context dependence and coevolution among amino acid residues in proteins. Methods Enzymol. 395: 779–790. WANG, Z.O. and POLLOCK, D.D., 2007. Coevolutionary patterns in cytochrome c oxidase subunit I depend on domain structure and functional context. J. Mol. Evol. 65: 485–495. WANG, Z.O. and POLLOCK, D.D., 2009. Context dependent coevolution in protein complex cytochrome c oxidase detected by Bayes Factor analysis, in press. WILLIAMS, P.D., POLLOCK, D.D., and GOLDSTEIN, R.A., 2001. Evolution of functionality in lattice proteins. J. Mol. Graph. Model. 19: 150–156. WILLIAMS, P.D., POLLOCK, D.D., BLACKBURNE, B.P., and GOLDSTEIN, R.A., 2006a. Assessing the accuracy of ancestral protein reconstruction methods. PLoS Comput. Biol. 2: e69. WILLIAMS, P.D., POLLOCK, D.D., and GOLDSTEIN, R.A., 2006b. Functionality and the evolution of marginal stability in

proteins: inferences from lattice simulations. Evo. Bioinformatics Online 2: 59–69. WOOD, T.C. and PEARSON, W.R., 1999. Evolution of protein sequences and structures. J. Mol. Biol. 291: 977–995. XU, Y.O., HALL, R.W., GOLDSTEIN, R.A., and POLLOCK, D.D., 2005. Divergence, recombination and retention of functionality during protein evolution. Hum. Genomics 2: 158–167. YANG, Z. 2003. Adaptive molecular evolution. In Handbook of Statistical Genetics (eds D. Balding, M. Bishop, and C. Cannings). Wiley, New York, pp. 229–254. YANG, Z., NIELSEN, R., GOLDMAN, N., and PEDERSEN, A.M., 2000. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155: 431–449. ZHANG, J. and KUMAR, S., 1997. Detection of convergent and parallel evolution at the amino acid sequence level. Mol. Biol. Evol. 14: 527–536.

Chapter

11

The Evolution of Protein Material Costs Jason G. Bragg and Andreas Wagner 11.1

INTRODUCTION

11.2

PROTEIN MATERIAL COSTS

11.3

AN EXAMPLE: PROTEOMIC SULFUR SPARING

11.4

EPISODIC NUTRIENT SCARCITY CAN SHAPE PROTEIN MATERIAL COSTS

11.5

HIGHLY EXPRESSED GENE PRODUCTS OFTEN EXHIBIT REDUCED MATERIAL COSTS

11.6

MATERIAL COSTS AND THE EVOLUTION OF GENOMES

11.7

MATERIAL COSTS AND OTHER COSTS OF MAKING PROTEINS

11.8

CONCLUSIONS

ACKNOWLEDGMENTS REFERENCES

11.1 INTRODUCTION We here survey evidence that natural selection on reducing the material cost of making proteins can profoundly change the chemical composition of proteins and of the genomes that encode them. Organisms take nutrients from their environment and use them as substrates for generating energy and for making the tissues, cells, and molecules that compose their cells. Different nutrients are often required in speciﬁc ratios to support balanced growth. However, the relative availability of nutrients can vary sharply across different environ-

Evolutionary Genomics and Systems Biology, edited by Gustavo Caetano-Anolles Copyright 2010 John Wiley & Sons, Inc.

203

204

Chapter 11

The Evolution of Protein Material Costs

ments, or over time within environments, leading to growth limitation by one or more nutrients (Sterner and Elser, 2002). Growth limitation can lead to strong selective pressure for an improved ability to acquire a limiting nutrient, or for increases in the efﬁciency with which a nutrient is used for growth once acquired. Accordingly, adaptations to nutrient scarcity can be observed on almost all levels of biological organization, from the sizes of whole cells (e.g., Aksnes and Egge, 1991; Button, 1991; Chisholm, 1992; Yoshiyama and Klausmeier, 2007), to the numbers and types of atoms that are used to make molecules (e.g. Baudouin-Cornu and Bragg, 2006; Van Mooy et al., 2006). Here, we discuss protein atomic (material) costs in relation to nutrient availability. This discussion includes the evolution of protein expression costs in response to transient and chronic nutrient shortages, as well as the evolution of genome composition—with respect to both nucleotide and gene content—as it relates to the costs of protein expression.

11.2 PROTEIN MATERIAL COSTS The presence, expression, and functional characteristics of speciﬁc genes and their protein products are fundamentally important for evolutionary responses to nutrient limitation. For example, when the yeast Saccharomyces cerevisiae evolves in the laboratory under glucose limitation (Paquin and Adams, 1983), chromosomal rearrangements result in ampliﬁcation of high-afﬁnity hexose transporters, leading to greater afﬁnity of yeast cells for glucose (Brown et al., 1998; Dunham et al., 2002). In addition to performing essential functions, proteins constitute a large proportion of cellular biomass, and their expression carries signiﬁcant material costs. Speciﬁcally, proteins are composed of amino acids that contain different numbers of atoms of carbon (2–11 atoms), nitrogen (1–4 atoms), and sulfur (1 atom in Cys and Met only). In many environments, these elements are ecologically limiting. Amino acids also contain oxygen and hydrogen, although these elements are probably rarely growth limiting (but see Acquisti et al. (2006)). Some proteins also require metal cofactors, including iron, copper, zinc, molybdenum, magnesium, manganese, or nickel. These metals can potentially limit growth, sometimes by limiting the activity of enzymes needed to assimilate macronutrients (Saito et al., 2008). It has been suspected for some time that the costs of synthesizing and incorporating different amino acids into proteins might be evolutionarily signiﬁcant. For instance, Richmond (1970) argued that few amino acid substitutions are completely neutral, since different amino acids tend to vary in their abundance and in their costs. Since then, numerous studies have highlighted patterns in protein amino acid and elemental composition that are consistent with selection for reduced energetic or material costs. It is important to note that the function of a protein will place large constraints on its primary structure. In certain positions along the length of a protein, there will be no opportunity to replace an expensive amino acid with a cheaper one, without compromising the functional integrity of the protein. However, in other positions, it may be possible to use one of several different amino acids. This suggests that beyond protein functional constraints, there is room for variation in amino acid composition, and hence elemental composition. In discussing research on the evolution of protein material costs, we begin by describing the response of yeast to limitation by sulfur. The reason is that yeast’s response to sulfur stress highlights concepts that are central to evolution of protein material costs in general. In subsequent sections, we consider these patterns and concepts in greater detail. Finally, we consider the evolution of protein elemental composition on a whole genome scale.

11.4 Episodic Nutrient Scarcity Can Shape Protein Material Costs

205

11.3 AN EXAMPLE: PROTEOMIC SULFUR SPARING It was observed more than 25 years ago that sulfur starved bacteria produced protein containing less sulfur per unit mass than cells that were not sulfur starved (Cuhel et al., 1981). At that time, it was not known whether this was an “active” response by the sulfur stressed cells, or whether it simply reﬂected the inability of sulfur starved cells to make proteins that required large quantities of sulfur. Sulfur-depleted proteins are not only produced in sulfur-stressed bacteria, but also in sulfur-stressed yeast. In a study by Fauchon et al. (2002), yeast cells were exposed to cadmium, which causes the synthesis of the sulfur-rich detoxifying compound, glutathione. The diversion of sulfur to glutathione synthesis leads to sulfur stress. Major changes in gene expression occur in response to these conditions, at both the mRNA and protein levels. These changes in protein expression led to a reduction in the sulfur content of expressed proteins by approximately 30% (Fauchon et al., 2002). Two main processes were responsible for this reduction. First, several abundant proteins were downregulated, and sulfur-poor isozymes were upregulated to replace them (Fauchon et al., 2002). This pattern was observed for several proteins that function in glycolysis. The glycolysis-speciﬁc functions of these upregulated isozymes make their direct role in detoxiﬁcation or sulfur assimilation unlikely. It is therefore probable that these isozymes were upregulated to reduce the sulfur costs of protein expression. Second, sulfur metabolism genes that were upregulated during cadmium exposure tended to have low sulfur content (Fauchon et al., 2002). The induction of sulfur metabolism genes and of sulfur-poor glycolytic isozymes were linked to the yeast transcriptional activator, Met4p, demonstrating that the observed “sulfur sparing” during sulfur limitation was transcriptionally regulated (Fauchon et al., 2002). These observations elegantly showcase several important processes related to the evolution of protein material costs. Below, we will discuss these processes with further examples, and also the types of selective pressure that may underpin them.

11.4 EPISODIC NUTRIENT SCARCITY CAN SHAPE PROTEIN MATERIAL COSTS In the example of sulfur-stressed yeast, the replacement of abundant proteins by sulfur-poor isozymes led to a substantial reduction in the total amount of sulfur needed to express proteins (Fauchon et al., 2002). Similar observations have been reported for sulfur and for other elements in a variety of microbes. In cyanobacteria, phycobiliproteins are highly expressed photosynthesis proteins that can account for a substantial proportion of total cellular protein. The cyanobacterium Calothrix encodes a duplicate set of phycobiliproteins whose protein products contain substantially fewer sulfur atoms than the phycobiliproteins that are typically expressed (Mazel and Marliere, 1989). During sulfur limitation, the sulfur-poor versions of the phycobiliproteins are speciﬁcally upregulated (Mazel and Marliere, 1989). The sulfur-poor phycobiliproteins do not have an obvious functional connection to sulfur metabolism, suggesting that their induction is related to their smaller demand for sulfur containing amino acids during protein synthesis. A number of ribosomal proteins contain zinc binding domains. Panina et al. (2003) identiﬁed four ribosomal proteins that typically contain zinc binding domains, but that have paralogous copies without zinc binding domains in a substantial number of bacterial genomes. These paralogues appear to be induced during shortages of zinc, which may

206

Chapter 11

The Evolution of Protein Material Costs

allow the bacteria to reduce their use of zinc in ribosomal proteins when zinc is scarce (Panina et al., 2003). A variety of microbes use the proteins ferredoxin, ﬂavodoxin, or both, for electron transfer in metabolic pathways. Ferredoxin contains iron, but ﬂavodoxin does not. Iron is a key limiting nutrient in large regions of the ocean (Martin et al., 1991). In several microbes, both ferredoxin and ﬂavodoxin are known to be encoded by the same genome and to be differentially expressed according to iron availability. Speciﬁcally, during iron limitation, the expression of ferredoxin is suppressed, and iron-free ﬂavodoxin is induced (Knight and Hardy, 1966; La Roche et al., 1993; Mayhew and Massey, 1969). Interestingly, the relative levels of ferredoxin and ﬂavodoxin expression by marine microbes have been used as an index for the intensity of iron limitation in the ocean (e.g., Erdner et al., 1999). In the above examples, the genes differentially regulated during nutrient stress may not be required for the acquisition and assimilation of an element, but their protein products may merely contain much of the element. Not surprisingly, genes encoding proteins necessary to assimilate an element may also contain less of the element than other proteins (BaudouinCornu et al., 2001). As mentioned above, sulfur-stressed yeast upregulate sulfur metabolism genes that encode sulfur-poor protein products (Boer et al., 2003; Fauchon et al., 2002). In fact, it was noted as long ago as 1966 that a bacterial sulfate binding protein had the “unusual feature” of containing no cysteine or methionine residues (Pardee, 1966; p5888). It has since been observed that proteins involved in sulfur uptake and assimilation are sulfur poor in bacteria and yeast (e.g., Baudouin-Cornu et al., 2001; Van der Ploeg et al., 1996). Furthermore, for yeast and Escherichia coli, proteins used for the assimilation of carbon contain fewer carbon atoms compared to the rest of the proteome (Baudouin-Cornu et al., 2001). Two possible and subtly different types of selective mechanisms could account for the expression of proteins that are depleted in a speciﬁc element when the element is scarce. First, selection might favor the induction of such proteins to reduce cellular demand for the limiting element. Second, selection could favor reduced use of an element in speciﬁc proteins to ensure that those proteins can be translated when the element is strongly limiting. In the case of assimilatory proteins, it has been suggested that the latter mechanism is more likely, since assimilatory proteins probably contain a relatively small proportion of the total cellular budget of an element, but their translation and activity are critically important during conditions of strong limitation (Baudouin-Cornu et al., 2001). Currently, there is little direct evidence to distinguish between the two hypotheses. A potentially relevant observation is that yeast grown under carbon limitation over short evolutionary timescales exhibit a relaxation in their tendency to upregulate carbon-poor proteins. That is, during adaptive evolution to carbon limitation, the tendency to upregulate carbon poor proteins is not elaborated as part of an evolved response to carbon limitation (Bragg and Wagner, 2007).

11.5 HIGHLY EXPRESSED GENE PRODUCTS OFTEN EXHIBIT REDUCED MATERIAL COSTS Evidence that cells economize on gene expression costs is not restricted to transient periods of acute limitation by speciﬁc elements. Speciﬁcally, genes that are constitutively expressed at high levels have low expression cost according to several criteria. For instance, highly expressed genes tend to encode shorter proteins (Brocchieri and Karlin, 2005), and contain fewer introns (Castillo-Davis et al., 2002), than less highly expressed genes. Proteins encoded by highly expressed genes also contain fewer energetically expensive or heavy amino acids (Akashi and Gojobori, 2002; Heizer et al., 2006; Seligmann, 2003). Similarly,

11.6 Material Costs and the Evolution of Genomes

207

highly expressed genes encode proteins with relatively low material costs for nutrients that are commonly limiting. For instance, highly expressed yeast genes encode proteins that are depleted in sulfur, carbon, and nitrogen (Bragg and Wagner, 2007; Fauchon et al., 2002). Similar observations have been made for plants that are commonly limited by nitrogen, but may rarely face carbon limitation. Consistent with this observation, highly expressed plant genes tend to encode proteins that are poor in nitrogen, but do not exhibit any signiﬁcant bias in carbon content (Elser et al., 2006).

11.6 MATERIAL COSTS AND THE EVOLUTION OF GENOMES Material costs of gene expression can inﬂuence the evolution of whole genomes. One pertinent line of evidence relates to metal cofactor use over geological time (Dupont et al., 2006). During the history of life, dramatic changes occurred in the availability of metals that are commonly used as cofactors. In particular, the availability of iron, zinc, manganese, and cobalt probably changed greatly when ocean geochemistry was transformed after the oxygenation of the atmosphere some 2 billion years ago. Extant representatives of lineages that went through major expansions before and after these events show patterns in the use of metal cofactors that reﬂect likely changes in metal availability. For example, Dupont et al. (2006) characterized the use of different metal binding domains in the proteomes of organisms in the superkingdoms Archaea, Bacteria, and Eukarya. Speciﬁcally, these authors ﬁtted curves of the form y ¼ mxa to relationships between the number of predicted metal binding domains (y) and the total number of structural protein domains (x), across the predicted proteomes in each superkingdom. This yielded estimates for the value of the scaling exponent, a, which was used to make inferences about the change in the prevalence of metal binding proteins in proteomes over evolutionary history. Speciﬁcally, values of the scaling exponent greater than one (a > 1) were taken to imply that the metal binding domains were preferentially retained during the evolution and growth of proteomes, while values of the scaling exponent smaller than one (a < 1) indicate the opposite (Dupont et al., 2006). After the oxygenation of the earth, the availability of zinc to marine organisms probably increased drastically. Eukaryotes probably originated in an oxygenated earth, and the number of zinc binding domains in eukaryotic proteomes scales with an exponent greater than one (a > 1). Conversely, Bacteria and Archaea evolved prior to the oxygenation of earth, and the numbers of zinc binding domains in their proteomes scales with exponents smaller than one (a < 1, for both Archaea and Bacteria) (Dupont et al., 2006). Metal cofactor use can also be biased in the genomes of individual species that are adapted to environments where a speciﬁc metal is scarce. The human body provides an ironpoor environment for microbes, in part because humans produce and secrete proteins that bind iron, and reduce its availability to pathogens. Some bacteria counter this “withholding” of iron by producing specialized molecules that help extract iron from their hosts (Ratledge and Dover, 2000). In contrast, the bacterial pathogen Borrelia burgdorferi has drastically reduced its iron requirements for growth by encoding very few, if any, iron binding proteins (Posey and Gherardini, 2000). Limited evidence suggests that the quantities of macronutrients used in whole genomes and in the proteins they encode respond adaptively to nutrient availability. Aerobic, nitrogen-ﬁxing bacteria tend to have higher genomic GC content (proportion of guanine plus cytosine base pairs) than their non nitrogen ﬁxing congeners (McEwan et al., 1998). This phenomenon may be related to the greater nitrogen content of GC base pairs (8 N atoms) than AT base pairs (7 N atoms) (McEwan et al., 1998). That is, bacteria that

208

Chapter 11

The Evolution of Protein Material Costs

are capable of ﬁxing atmospheric nitrogen might have encountered relaxed selective pressure for low nitrogen content in their DNA (and possibly in their mRNAs) and may thus use a greater proportion of the more nitrogen-rich GC base pairs. Alternatively, bacteria with higher DNA (and mRNA) nitrogen content might be subjected to greater selective pressure to ﬁx atmospheric nitrogen (McEwan et al., 1998). However, DNA and messenger RNA account for a relatively small proportion of cellular nitrogen, meaning that the reduction in cellular nitrogen content afforded by low GC content is relatively small. Observations like these motivated a study of protein material costs among diverse prokaryotes that found a positive association between genomic GC content and the average nitrogen content of predicted proteins (per amino acid) (Bragg and Hyder, 2004). Therefore, it is possible that nitrogen-ﬁxing bacteria with high DNA nitrogen content often also have high average protein nitrogen costs and that this association underpins an adaptive association between GC content and atmospheric nitrogen ﬁxing (McEwan et al., 1998). The correlation between the elemental content of genomes and proteins may stem from the structure of the genetic code. For instance, nitrogen-rich amino acids, such as Arg and His, have relatively GC-rich codons (Bragg and Hyder, 2004). Protein carbon content is associated negatively with GC content among organisms (Baudouin-Cornu et al., 2004; Bragg and Hyder, 2004), probably because carbon-rich amino acids often have AT-rich codons. The average sulfur content of proteins tends to be low in organisms with very high and very low GC content, and higher in organisms with intermediate GC content (Bragg et al., 2006). This association may exist because cysteine and methionine codons collectively have moderate GC content. Additional variation among species in average protein sulfur content can be explained by environmental conditions to which different species are adapted. Prokaryotes adapted to high temperatures tend to have lower average protein sulfur content than those adapted to lower temperatures, and anaerobic species tend to have greater average protein sulfur content than nonanaerobes (Bragg et al., 2006). While these analyses do not indicate that protein sulfur content evolves in response to the availability of sulfur, they do suggest that environmental conditions may affect the demand for speciﬁc nutrients.

11.7 MATERIAL COSTS AND OTHER COSTS OF MAKING PROTEINS Costs other than material costs can inﬂuence the evolution of protein composition. As mentioned previously, energetic costs have been linked to the frequencies with which different amino acids occur in proteins (Akashi and Gojobori, 2002; Craig and Weber, 1998). Similar observations have been made for the size of amino acids (e.g., molecular weight) as a surrogate for biosynthetic cost (Dufton, 1997). The use of such a surrogate is convenient when biosynthetic costs are compared for several organisms, since it does not require detailed information on the metabolic pathways of amino acid biosynthesis in each organism (Heizer et al., 2006; Seligmann, 2003). In some cases, protein material costs and energetic costs can be strongly related. In yeast, protein carbon content per amino acid is related strongly and positively to the energetic cost per amino acid (Bragg and Wagner, 2007). However, in other cases, material costs and energetic costs may interact in more complex ways. Microbial organisms often obtain inorganic nitrogen and sulfur in oxidation states that are too high to incorporate into organic compounds, and must reduce them before they can be used in proteins. The interacting demands for materials and reducing power, along with metabolic differences among organisms and variation in the availability of elements in different oxidation states,

References

209

may signiﬁcantly increase the number of potential ways in which protein material costs could evolve. For example, two species of prokaryotes that perform dissimilatory sulfate reduction—where sulfate acts as a terminal electron acceptor in anaerobic respiration— have high average protein sulfur content (Bragg et al., 2006). The reason may be that cells of these species produce reduced sulfur as a by-product of their metabolism.

11.8 CONCLUSIONS Nutrient limitation has profound effects on the evolution of microbial genomes, including the kinds of genes a genome encodes, and their regulation. Many important adaptations to nutrient limitation promote the uptake and assimilation of limiting nutrients, such as the acquisition and retention of genes that encode high afﬁnity transporters (Dunham et al., 2002). Organisms may also evolve to use smaller quantities of scarce nutrients in making proteins. For instance, speciﬁc proteins may evolve to contain relatively small quantities of an element. This has been observed for proteins that are expressed speciﬁcally during nutrient limitation (Baudouin-Cornu et al., 2001; Boer et al., 2003; Bragg and Wagner, 2007; Fauchon et al., 2002), and for proteins that are highly expressed in general (Bragg and Wagner, 2007; Elser et al., 2006; Fauchon et al., 2002). In a growing number of cases, organisms respond to nutrient limitation by down-regulating speciﬁc proteins, and by replacing them with proteins of similar function but smaller amounts of the limiting element (e.g., Fauchon et al., 2002; Knight and Hardy, 1966; Mazel and Marliere, 1989; Panina et al., 2003). Over much longer time scales, the availability of speciﬁc metals appears to have inﬂuenced the rate at which metal binding domains proliferated in proteomes (Dupont et al., 2006). The costs of protein expression can thus evolve in diverse ways (i.e., through changes in protein composition, expression, or both), and in response to nutrient shortages that occur over vastly different time scales. Undoubtedly, many more instances where protein material costs have evolved in response to nutrient limitation await discovery. Taken together, these instances will help to illuminate many aspects of genome evolution.

ACKNOWLEDGMENTS JGB was supported by an NSF Biocomplexity grant (DEB-0083422) and AW by grant 315200-116814 from the Swiss National Foundation.

REFERENCES ACQUISTI, C., KLEFFE, J., and COLLINS, S., 2006. Oxygen content of transmembrane proteins over macroevolutionary time scales. Nature 445: 47–52. AKASHI, H. and GOJOBORI, T., 2002. Metabolic efﬁciency and amino acid composition in the proteomes of Escherichia coli and Bacillus subtilus. Proc. Natl. Acad. Sci. USA 99: 3695–3700. AKSNES, D.L. and EGGE, J.K., 1991. A theoretical model for nutrient uptake in phytoplankton. Mar. Ecol. Prog. Ser. 70: 65–72. BAUDOUIN-CORNU, P. and BRAGG, J.G., 2006. Analyzing proteomic, genomic and transcriptomic elemental compositions to uncover the intimate evolution of biopolymers. In

Encyclopedia of Genetics, Genomics, Proteomics and Bioinformatics (eds. L. Jorde, P. Little, M. Dunn, and S. Subramaniam). John Wiley and Sons. BAUDOUIN-CORNU, P., SURDIN-KERJAN, Y., MARLIE`RE, P., and THOMAS, D., 2001. Molecular evolution of protein atomic composition. Science 293: 297–300. BAUDOUIN-CORNU, P., SCHUERER, K., MARLIE`RE, P., and THOMAS, D., 2004. Intimate evolution of proteins. Proteome atomic content correlates with genome base composition. J. Biol. Chem. 279: 5421–5428. BOER, V.M., de WINDE, J.H., PRONK, J.T., and PIPER, M.D.W., 2003. The genome-wide transcriptional responses of Saccharomyces cerevisiae grown on glucose in

210

Chapter 11

The Evolution of Protein Material Costs

aerobic chemostat cultures limited for carbon, nitrogen, phosphorus or sulfur. J. Biol. Chem. 278: 3265–3274. BRAGG, J.G. and HYDER, C.L., 2004. Nitrogen versus carbon use in prokaryotic genomes and proteomes. Proc. R. Soc. B 271(Suppl. 5): S374–S377. BRAGG, J.G. and WAGNER, A., 2007. Protein carbon content evolves in response to carbon availability and may inﬂuence the fate of duplicated genes. Proc. R. Soc. B 274: 1063–1070. BRAGG, J.G., THOMAS, D., and BAUDOUIN-CORNU, P., 2006. Variation among species in proteomic sulphur content is related to environmental conditions. Proc. R. Soc. B 273: 1293–1300. BROCCHIERI, L. and KARLIN, S., 2005. Protein length in eukaryotic and prokaryotic proteomes. Nucleic Acids Res. 33: 3390–3400. BROWN, C.J., TODD, K.M., and ROSENZWEIG, R.F., 1998. Multiple duplications of yeast hexose transporter genes in response to selection in a glucose-limited environment. Mol. Biol. Evol. 15: 931–942. BUTTON, D.K., 1991. Biochemical basis of whole-cell uptake kinetics: speciﬁc afﬁnity, oligotrophic capacity and the meaning of the Michaelis constant. Appl. Environ. Microbiol. 57: 2033–2038. CASTILLO-DAVIS, C.I., MEKHEDOV, S.I., HARTL, D.L., KOONIN, E.V., and KONDRASHOV, F.A., 2002. Selection for short introns in highly expressed genes. Nat. Genet. 31: 415–418. CHISHOLM, S.W., 1992. Phytoplankton size. In Primary Productivity and Biogeochemical Cycles in the Sea (eds P.G. Falkowski and A.D. Woodhead). Plenum, New York. CRAIG, C.L. and WEBER, R.S., 1998. Selection costs of amino acid substitutions in ColE1 and ColIa gene clusters harbored by Escherichia coli. Mol. Biol. Evol. 15: 774–776. CUHEL, R.L., TAYLOR, C.D., and JANNASCH, H.W., 1981. Assimilatory sulfur metabolism in marine microorganisms: sulfur metabolism, growth and protein synthesis of Pseudomonas halodurans and Alteromonas luteo-violaceus during sulfate limitation. Arch. Microbiol. 130: 1–7. DUFTON, M.J., 1997. Genetic code synonym quotas and amino acid complexity: cutting the cost of proteins. J. Theor. Biol. 187: 165–173. DUNHAM, M.J., BADRANE, H., FEREA, T., ADAMS, J., BROWN, P. O., RICHMOND, R., and BOTSTEIN, D., 2002. Characteristic genome rearrangements in experimental evolution of Saccharomyces cerevisiae. Proc. Natl. Acad. Sci. USA 99: 16144–16149. DUPONT, C.L., YANG, S., PALENIK, B., and BOURNE, P.E., 2006. Modern proteomes contain putative imprints of ancient shifts in trace metal geochemistry. Proc. Natl. Acad. Sci. USA 103: 17822–17827. ELSER, J.J., FAGAN, W.F., SUBRAMANIAN, S., and KUMAR, S., 2006. Signatures of ecological resource availability in the animal and plant proteomes. Mol. Biol. Evol. 23: 1946–1951. ERDNER, D.L., PRICE, N.M., DOUCETTE, G.J., PELEATO, M.L., and ANDERSON, D.M., 1999. Characterization of ferredoxin

and ﬂavodoxin as markers of iron limitation in marine phytoplankton. Mar. Ecol. Prog. Ser. 184: 43–53. FAUCHON, M., LAGNIEL, G., AUDE, J.C., LOMBARDIA, L., SOULARUE, P., PETAT, C., MARGUERIE, G., SENTENAC, A., WERNER, M., and LABARRE, J., 2002. Sulfur sparing in the yeast proteome in response to sulfur demand. Mol. Cell 9: 713–723. HEIZER, E.M., JR., RAIFORD, D.W., RAYMER, M.L., DOOM, T.M., MILLER, R.V., and KRANE, D.E., 2006. Amino acid cost and codon-usage biases in 6 prokaryotic genomes: a whole-genome analysis. Mol. Biol. Evol. 23: 1670–1680. KNIGHT, J.E. and HARDY, R.W.F., 1966. Isolation and characteristics of ﬂavodoxin from nitrogen-ﬁxing Clostridium pasteurianum. J. Biol. Chem. 241: 2752–2756. La ROCHE, J., GEIDER, R.J., GRAZIANO, L.M., MURRAY, H., and LEWIS, K., 1993. Induction of speciﬁc proteins in eukaryotic algae grown under iron-, phosphorus-, or nitrogendeﬁcient conditions. J. Phycol. 29: 767–777. MARTIN, J.H., GORDON, R.M., and FITZWATER, S.E., 1991. The case for iron. Limnol. Oceanogr. 36: 1793–1802. MAYHEW, S.G., and MASSEY, V., 1969. Puriﬁcation and characterization of ﬂavodoxin from Peptostreptococcus elsdenii. J. Biol. Chem. 244: 794–802. MAZEL, D. and MARLIE`RE, P., 1989. Adaptive eradication of methionine and cysteine from cyanobacterial light-harvesting proteins. Nature 341: 245–248. MCEWAN, C., GATHERER, D., and MCEWAN, N., 1998. Nitrogen-ﬁxing aerobic bacteria have higher genomic GC content than non-ﬁxing species within the same genus. Hereditas 128: 173–178. PANINA, E.M., MIRONOV, A.A., and GELFAND, M.S., 2003. Comparative genomics of bacterial zinc regulons: enhanced ion transport, pathogenesis and rearrangement of ribosomal proteins. Proc. Natl. Acad. Sci. USA 100: 9912–9917. PAQUIN, C. and ADAMS, J., 1983. Frequency of ﬁxation of adaptive mutations is higher in evolving diploid than haploid yeast populations. Nature 302: 495–500. PARDEE, A.B., 1966. Puriﬁcation and properties of a sulfatebinding protein from Salmonella typhimurium. J. Biol. Chem. 241: 5886–5892. POSEY, J.E. and GHERARDINI, F.C., 2000. Lack of a role for iron in the Lyme disease pathogen. Science 288: 1651–1653. RATLEDGE, C. and DOVER, L.G., 2000. Iron metabolism in pathogenic bacteria. Annu. Rev. Microbiol. 54: 881–941. RICHMOND, R.C., 1970. Non-Darwinian evolution: a critique. Nature 255: 223–225. SAITO, M.A., GOEPFERT, T.J., and RITT, J.T., 2008. Some thoughts on the concept of colimitation: three deﬁnitions and the importance of bioavailability. Limnol. Oceanogr. 53: 276–290. SELIGMANN, H., 2003. Cost-minimization of amino acid usage. J. Mol. Evol. 56: 151–161. STERNER, R.W. and ELSER, J.J., 2002. Ecological Stoichiometry: The Biology of Elements from Molecules to the Biosphere. Princeton University Press, Princeton, NJ.

References Van der PLOEG, J.R., WEISS, M.A., SALLER, E., NASHIMOTO, H., SAITO, N., KERTESZ, M.A., and LEISINGER, T., 1996. Identiﬁcation of sulfate starvation-regulated genes in Escherichia coli: a gene cluster involved in the utilization of taurine as a sulfur source. J. Bacteriol. 178: 5438–5446. Van MOOY, B.A.S., ROCAP, G., FREDRICKS, H.F., EVANS, C.T., and DEVOL, A.H., 2006. Sulfolipids dramatically decrease

211

phosphorus demand by picocyanobacteria in oligotrophic marine environments. Proc. Natl. Acad. Sci. USA 103: 8607–8612. YOSHIYAMA, K. and KLAUSMEIER, C.A., 2007. Optimal cell size for resource uptake in ﬂuids: a new facet of resource competition. Am. Nat. 171: 59–70.

Chapter

12

Protein Domains as Evolutionary Units Andrew D. Moore and Erich Bornberg-Bauer 12.1

MODULAR PROTEIN EVOLUTION

12.2

DOMAIN-BASED HOMOLOGY IDENTIFICATION

12.3

DOMAINS IN GENOMICS AND PROTEOMICS

12.4

THE COVERAGE PROBLEM

12.5

CONCLUSION

REFERENCES

12.1 MODULAR PROTEIN EVOLUTION Proteins are composed of subunits termed domains that are recurrent units with distinct structure, function, and evolutionary history. At the sequence level, a domain can be described as a conserved stretch of amino acids found in various proteins. Domain signatures can be stored as proﬁles created from alignments of descriptive family members from which hidden Markov models (HMMs) are generated or by the use of position-speciﬁc scoring matrices (PSSM). A number of large databases such as Pfam (Finn et al., 2008), SMART (Letunic et al., 2006), or ProDom (Bru et al., 2005) harbor domain signatures, and the use of domains has become an essential tool of modern proteomics and genomics. However, in order to properly utilize the strength of domain-based analyses, it is of fundamental importance to understand the mechanics and selection forces that govern the evolution of multidomain architectures (MDAs). A large body of research has explored the mechanisms that drive MDA creation and diversiﬁcation (recently reviewed in Moore et al. (2008)). Protein evolution is thought to have started with a small library of domains, and structural and functional complexity increased by the combination and rearrangements thereof (Chothia, 1992). Complex multidomain proteins can be produced by gene fusion events where genes that code for simple proteins get fused facilitating the formation of more complex architectures

Evolutionary Genomics and Systems Biology, edited by Gustavo Caetano-Anolles Copyright 2010 John Wiley & Sons, Inc.

213

214

Chapter 12

Protein Domains as Evolutionary Units

(Kummerfeld and Teichmann, 2005; Pasek et al., 2006; Weiner et al., 2006). Multidomain proteins are decomposed by ﬁssion events giving rise to less complex architectures (Kummerfeld and Teichmann, 2005; Pasek et al., 2006; Wang et al., 2004; Weiner et al., 2006). Studies have estimated that fusion events are a major force in the evolution of proteins, more so than ﬁssion events (Ekman et al., 2007; Fong et al., 2007; Kummerfeld and Teichmann, 2005; Pasek et al., 2006). Moreover, novel MDAs frequently arise by domain insertion or deletion events at protein termini (Vibranovski et al., 2005; Weiner et al., 2006). At the sequence level, domain insertions or deletions may be governed by gene fusion events followed by simple point mutations. For example, the insertion of a premature stop signal may render a terminal domain nonfunctional and in turn facilitate rapid decay (Weiner et al., 2006; see Figure 12.1). Furthermore, exon shufﬂing bears

Figure 12.1

Schematic example of modular protein evolution. Square boxes represent domains. Connected domain boxes signify multidomain proteins; s and d boxes represent mutations that facilitate premature stop signals and a duplication event, respectively. Bounded color boxes around proteins indicate participation in a fusion event. At the initial stage, proteins are mostly comprised of a single domain. As time progresses, domains get duplicated and subsequently rearranged to form different domain arrangements. Point mutations that lead to premature stop signals initiate decay of the downstream sequences and lead to eventual loss of terminal domains. For example, the domain architecture D-A-B-C-D arises by an initial fusion of the domains B and C, followed by a duplication of the domain D, which subsequently fuses with domain A. The multidomain protein B-C gets fused with the single-domain protein E. The resulting B-C-E protein gets duplicated and fused with B-C followed by terminal domain loss . Finally, the concluding architecture is fused with the D-A architecture. (See insert for color representation of this ﬁgure.)

12.2 Domain-Based Homology Identiﬁcation

215

the potential to create new MDAs, at least in higher eukaryotes (Ekman et al., 2007; Vibranovski et al., 2005). Beyond linear domain rearrangement events, the sequential order of domains within a MDA can be permutated (Fliess et al., 2002; Ponting and Russell, 1995; Weiner et al., 2006). This can happen by the combination of a duplication event and a subsequent fusion event followed by losses of terminal domains (Weiner et al., 2006). The details of the genomic mechanisms governing domain rearrangement events are difﬁcult to trace; corresponding signals rapidly decay in absence of selective pressure. A sequence of successive, fundamental genomic events initiated by intra- or intergenic duplication events followed by a series of mutations gives rise to events such as fragment loss, fusion/ﬁssion, or permutations. The sum of such events determines the domain-wise steps in protein evolution. While higher order “meta-events” on the level of domains are easy to detect and quantify as they happen with low frequency and leave detectable signals (Ekman et al., 2007; Fong et al., 2007), characterizing the underlying, fundamental genetic events is more challenging and frequently amounts to detailed case studies (e.g., Wang et al., 2004). The ﬁeld of graph analysis has provided considerable insights into the modularity of domains. While a minority of domains occur in many MDAs, the majority of domains are found in a few proteins only (Apic et al., 2001a, 2001b). A projection of this phenomenon onto a network, where neighboring domains are connected by an edge, leads to a network that displays scale-free behavior, with a few highly connected nodes and many nodes with a low degree of connection (Apic et al., 2001b). The degree distribution remains consistent, independent of the number of nodes the network has, and characteristically exhibits power law behavior (Apic et al., 2001b; Bornberg-Bauer, 2002; Wuchty, 2001). Whether domain rearrangement is a stochastic process or whether it is driven by selection forces that may guide the rearrangement of domains with speciﬁc intrinsic properties is a matter of dispute. Apic et al. (2001b) conducted genome-wide studies of domain combinations within the three major clades of life. Their results indicate that species unspeciﬁc domains make up the largest proportion of domains within any clade. Either older domains had more time to spread or they are more prone to recombine and be retained. Model-based studies indicate that the number of actual domain combinations is lower than what would be expected under a random model (Apic et al., 2003), supporting a nonrandom model of domain rearrangement. This is further supported by the formation of supradomains (Vogel et al., 2004), which are deﬁned as evolutionary units larger than single domains (Vogel et al., 2005), as well as by the fact that domain combinations carry phylogenetic signals that can be used to reconstruct the tree of life (e.g., Wang and Caetano-Anolles, 2006; see also Section 3.2.). Moreover, a recent study (Forslund et al., 2007) of convergent evolution of domain architectures indicates that convergence is more frequent than previously thought (Gough, 2005). In essence, it seems as if the process of domain arrangement is not purely stochastic. Furthermore, the rate of novel domain emergence tends to be low in higher eukaryotes while the rate of domain arrangement is high (Ekman et al., 2007), indicating that domain-wise rearrangements are at the heart of the formation of complex proteomes.

12.2 DOMAIN-BASED HOMOLOGY IDENTIFICATION Proteins can be understood as strings of autonomously evolving domains. Two proteins that share domain composition can be considered to be functionally similar (Bashton and Chothia, 2007; Hegyi and Gerstein, 2001), and, as the rate of convergently arisen domain

216

Chapter 12

Protein Domains as Evolutionary Units

architectures in protein evolution is low (Forslund et al., 2007), a large proportion of such proteins may also share ancestral background. While, at the sequence level, divergence can mask homology, the composition of domains deﬁned using proﬁles can remain stable (Bystroff and Krogh, 2008). Furthermore, MDAs have properties that challenge established methods for homology detection such as circular permutations (CP) (Weiner et al., 2006), shared domains between nonhomologous proteins, or modiﬁed domain composition within a homologous set of proteins. These problems can be alleviated using domains as informative characters; the change of domain composition of proteins over time harbors considerable evolutionary signal (Ekman et al., 2007; Fong et al., 2007; Itoh et al., 2007). Moreover, domain-wise comparison reduces the computational complexity arising from exhaustive amino acid comparisons. Hence, domains can be useful for studying homology between a set of proteins and for ﬁnding functionally similar proteins. In order to determine similarity of proteins on the level of their composite domains, measurements of domain architecture similarity can be helpful.

12.2.1

Domain Architecture Similarity

Detecting homology between proteins using constituent domains resembles a problem known from information retrieval, that is, detecting topic overlap between two bodies of text based on the text word composition. Proteins sharing common domains are functionally similar, much like two bodies of text that share the same words revolve around the same subject. In principle, a simple measurement of domain architecture treats each domain as a complete character, similar to a word. However, four obstacles must be considered. First, some domains are found in many different proteins deeming them less informative, analogous to a preposition, whereas others are more informative, such as verbs. Second, some domains occur frequently as repeats within proteins rendering scoring more difﬁcult. Third, proteins can vary strongly in the total number of domains they contain and comparison of domain architectures of unequal length is desirable. Fourth, domains can vary in length; equal scoring of a match between a long domain and a short domain may not be meaningful. These issues can be addressed using different strategies. The interested reader should refer to studies that apply various values of domain architecture similarity to a deﬁned data set (Lin et al., 2006; Song et al., 2007). When comparing domain strings, the considerations mentioned above manifest themselves in the form of weights that are applied to each domain found in the proteins compared. Simply put, the score between two proteins p1 and p2 can be expressed as Sðp1 ; p2 Þ ¼

X

wðdi ; p1 Þwðdi ; p2 Þ

ð12:1Þ

i

where w(d, p) signiﬁes the weight of domain d in protein p with w(di, p) ¼ 0 if domain di is not present in protein p. The ﬁrst consideration, differences in domain versatility (Weiner et al., 2008) or promiscuity (Basu et al., 2008), deems some domains more informative than others. Small, structurally compact domains, such as src-homologous domains, are known to combine with a wide array of domains (Marcotte et al., 1999). Therefore, two proteins containing such a promiscuous domain are not necessarily homologous and can be functionally divert (Marcotte et al., 1999). To alleviate this issue, a weight can be applied to domains that are

12.2 Domain-Based Homology Identiﬁcation

217

frequently found in all proteins. The rationale is that if a domain is found throughout all compared proteins, it is a poor discriminator and thus less informative than a domain occurring only in some proteins. A frequently used weight applied in the ﬁeld of natural language processing is the inverse document frequency (Salton and Buckley, 1987), of which many different formulations exist; Song et al. (2007) use the inverse document frequency and deﬁne the idf weight for a domain d as idf ðdÞ ¼ log2

jPj jfpjd 2 DðpÞ; p 2 Pgj

ð12:2Þ

where P is the set of proteins contained within the whole data set and D(p) signiﬁes a bag of domains that is, a multiset consisting of all domains d that are constituents of p. Ergo, the denominator signiﬁes the number of proteins p 2 P of which d is a consistent domain. Other weights could be applied here, such as measurements of domain versatility. Frequently used measurements of domain versatility include the number of neighbors NN deﬁned as the number of direct N- and C-terminal neighbors a given domain has (Apic et al., 2003), or the number of co-occurrences (NCO) which is deﬁned as the number of co-constituent domains a given domain has in all proteins in which it occurs (Wuchty, 2001). However, these values all strongly correlate with domain abundance (Weiner et al., 2008). A domain that is abundant in a genome has higher values for NN and NCO, simply because it is more widely spread. A domain that has a low frequency within a genome can nonetheless be very versatile and form many combinations. Weiner et al. (2008) developed a simple measurement of domain versatility, the domain versatility index (DVI), where domain versatility is understood as the strength of the relationship between NCO and NN. The DVI differs for different domains and could equally be used to weight a domain. Other values, such as the distinct partner weight deﬁned by Song et al. (2007), can similarly be used to adjust for domain versatility. This helps decrease the impact of versatile domains on the score as they have the potential to suggest homology despite different ancestral backgrounds. The second consideration concerns repeats. Certain domains are known to occur as repeats; some repeats evolve with a pronounced regularity by duplication of units containing more than a single domain (Bj€ orklund et al., 2006). Different strategies to score such repeats exist. One possibility, used in the case of tandemly repeated domains, is to collapse a stretch of repeating domains and treat them as a single occurrence (Ekman et al., 2007). The rationale to this is that the number of short repeats found in stretches may vary between homologues, such as in proteins containing zinc ﬁnger domains (B€ohm et al., 1997). A second, opposing possibility assumes that if a domain is found to be repeated in proteins that are to be compared, it could be of particular importance. Similarly, if a nonprepositional word is found to be repeated multiple times throughout a text, it may contribute to the subject in a special way. In information retrieval, this property is measured using the term frequency adapted by Song et al. (2007) as wtf ðd; pÞ ¼

Nðd; pÞ jDðpÞj

ð12:3Þ

where N(d, p) signiﬁes the number of domains d in protein p and |D(p)| denotes the total number of domains in protein p. The term frequency and the above-described inverse document frequency can be considered together; domains that are very frequent throughout all proteins can be deemphasized, while domains that are particularly frequent in compared

218

Chapter 12

Protein Domains as Evolutionary Units

proteins, a subset of all proteins, can be emphasized. In information retrieval, this is achieved by using the tf idf weight of which many formulations exist. High values of tf idf are obtained by a high frequency of a domain in a protein and a low frequency of a domain in the entire data set. Thus, this weight tends to weed out nondiscriminating domains. Both of the aforementioned weighting strategies are not geared toward weighting a match between two domains; they assign an initial value of reliability to each domain based on, for example, the frequency of the domain within a data set (such as a genome). However, the similarity between two domain arrangements can be detected using a classical dynamic programming scheme, where a match between two domains is assigned a score, which in turn can be weighted. The third consideration from above regards the length of matching domains. When comparing two proteins, matches between long domains should score higher than matches between short domains, as the match covers a larger part of the primary structure. This could be achieved with Sw ðd1 ; d2 Þ ¼ Sðd1 ; d2 Þ

lmin ðdÞ 100

ð12:4Þ

where Sw is the weighted score of the match between d1, d2 2 D, S is the unweighted score, and lmin(d) is the shorter of the two domains d1 and d2. The limiting length of the match is the smaller domain, hence lmin(d). The longer the smaller matching domain is, the better the score. Moreover, a match between two domains of equal length should score over a match between domains of different length. In analogy to the above, this could be deﬁned as Sw ðd1 ; d2 Þ ¼ Sðd1 ; d2 Þ

lmin ðdÞ lmax ðdÞ

ð12:5Þ

where lmax(d) returns the longer of the two domains d1 and d2; while a match of equal length leaves the score unweighted, differences in length will reduce the score. Finally, there should be a way of comparing proteins that contain a different number of domains, as homologous proteins may exhibit different domain counts. Such differences can be taken into account using a simple, unweighted similarity value given by the Jaccard similarity coefﬁcient Jðp1 ; p2 Þ ¼

jNðd; p1 Þ \ Nðd; p2 Þj jNðd; p1 Þ [ Nðd; p2 Þj

ð12:6Þ

where the numerator denotes the number of domains shared between the proteins p1, p2 (domain intersection between p1 and p2), and the denominator signiﬁes the number of domains speciﬁc to both proteins (domain union of p1 and p2). Beyond the approaches exempliﬁed above, other measures exist that allow the assessment of domain architecture similarity. For example, the tf idf weight mentioned above can be used in collaboration with the cosine similarity (Lee and Lee, 2008; Song et al., 2007; Yandell and Majoros, 2002), where proteins are understood as weighted vectors in n-dimensional space where n is equal to the number of unique domains within the complete data set. In this “domain space,” the weights for each vector determine its direction, and the similarity between two proteins is given by the angle between the two vectors. Other strategies for assessing domain architecture similarity exist, of which some combine or average different weighting

12.2 Domain-Based Homology Identiﬁcation

219

schemes. It should be noted that domain architecture similarity can also be expressed as the edit distance between two domain architectures. Bjørklund et al. (2005) deﬁned the domain distance using a simple dynamic programming scheme. The domain distance is calculated by considering each domain as a symbol and conducting an alignment between domain architectures using the string of symbols. The domain distance between two architectures is considered to be the number of unmatched domains in the alignment. This approach considers domain order (Bjørklund et al., 2005) and can similarly be weighted (Song et al., 2007). Regardless of the strategy the strategy concerning values of domain architecture similarity, an adequate measure should attempt to incorporate weights for both, a priori knowledge regarding domains within compared arrangements, as well as a posteriori applied weights to speciﬁc domain matches occurring in multidomain proteins. Naturally, similarity between single-domain proteins can only be uncovered at the sequence or structure level.

12.2.2

Domain Resources and Domain-Based Search

As previously established, domain arrangements can be used to detect protein homology. Fewer characters required for comparison, the ability to detect even distantly related sequences and the capability to deal with nonlinear rearrangement events within a feasible timescale, make homology detection based on domain string comparison particularly attractive. This can be done, for example, using a simple dynamic programming scheme (Bjørklund et al., 2005). One of the most frequently used aspects of domain-based homology detection is to extract functional information from proteins. It is generally accepted that the large the number of domains shared between two arrangements, the more these proteins can be assumed functionally equivalent (Hegyi and Gerstein, 2001). Functional aspects of proteins can be explored using domain databases. While some domain resources specialize on domain deﬁnitions, whether manually curated or automatically derived, other resources rely on predeﬁned domain families providing methods to interface the data. Pfam is one of the most comprehensive protein family databases today. It contains manually curated seed alignments from representative family members that are used to create high-quality HMM proﬁles (Eddy, 1998) by an iterative search procedure against SwissProt, the underlying sequence database. Query sequences are annotated by running hmmpfam against the Pfam models. Sequences that cannot be annotated using the highquality models (PfamA) are annotated using automatically derived signatures recruited from ProDom (see below) that are provided as PfamB. Users can explore proteins that have domains in common with the query. For each domain, the seed alignments can be explored, as well as the taxonomic distribution of all sequences in SwissProt that have contributed to the model reﬁnement. Pfam can be supplemented by PfamAlyzer (Hollich and Sonnhammer, 2007), a stand-alone application that provides an interface to commonly used Pfam features. PfamAlyzer allows querying SwissPfam using domain arrangements and supports browsing architectures by taxonomy, which simpliﬁes the analysis of architectural diversiﬁcation. The conserved domain database (CDD) (Marchler-Bauer et al., 2007) is a domain database hosted at the NCBI. CDD contains PSSMs based on alignments from Pfam and SMART, as well as some of NCBIs own resources. CD search is a tool that supports searching for sequences that share domain architecture with a query sequence. The initial step to this procedure is the annotation of a query amino acid sequence using CDDs PSSM models. At the core of this annotation is RPS-BLAST (reverse position-speciﬁc BLAST) (Marchler-Bauer et al., 2002); RPS-BLAST searches with a sequence against a database of

220

Chapter 12

Protein Domains as Evolutionary Units

proﬁles. After the query sequence has received domain annotation, the resulting domain architecture is used to search against CDDs collection of PSSM models. For each arrangement in the search results, only a single representative instance is reported and links to identical architectures are provided. Moreover, CDD provides a tool geared toward homology search based on domain architecture. CDART (Geer et al., 2002) allows users to perform a similarity search against the Entrez protein database (Wheeler et al., 2008). Precomputed CD search results are queried to allow rapid identiﬁcation of proteins that share a common domain set with the query. All proteins in which the domain architecture overlaps with the query in at least one domain are reported, and ranking of the results is achieved by the total number of domains shared with the query. Users can restrict the search to a given taxonomy or search within clusters of any one of the constituent domains of the original query (taxonomic and domain subsetting, respectively). While Pfam and CDD rely on manual inspection of initial alignments, other resources do not. An example for an automatically derived domain database is ProDom. ProDom is based on a PSI-BLAST (Altschul et al., 1997) procedure termed MKDOM2 (Gouzy et al., 1999) that clusters homologous sequence fragments into domain families. The underlying data set is a nonredundant representation of SwissProt and TrEMBL. Furthermore, an alternative data set is provided that contains completely sequenced genomes (ProDomGC). Users can explore proteins that contain a given domain, see the distribution of families in a tree, and, when using ProDomGC, can explore where a given nonunique domain has emerged. Similar to ProDom, EVEREST (Portugaly et al., 2007) uses an automated scheme to deﬁne domain families and does not require manual curation. Domains are deﬁned using an iterative procedure, where protein sequences from UniProt (Wu et al., 2006) and PDB (Henrick et al., 2008) are subjected to an all-versus-all comparison to identify conserved sequence segments. Segments are subsequently clustered, quality assessed and used as the basis for building HMMs. Iteration over the database reﬁnes these models. Beyond EVEREST domain deﬁnitions, users may select external domain deﬁnition databases such as SCOP (Andreeva et al., 2004) or Pfam from a list. When annotating a query sequence, the accuracy of the domain deﬁnitions is assessed using a custom scoring scheme that scores domains based on the overlap with domain deﬁnitions of Pfam, SCOP, and CATH (Greene et al., 2007). Details on the procedure of annotation and scoring are described in Portugaly et al. (2006). Overlapping domain annotations are permitted and queries can be complex; constraints can be set on which database is searched, which resources are allowed to overlap, which taxa are to be included, minimal domain length, and more. Another domain database that features domain-based search is SIMAP (similarity matrix of proteins) (Rattei et al., 2008). SIMAP is a resource that covers more than 17 million proteins and relies on InterPro (Mulder et al., 2007) domain signatures. It is aimed toward providing a precalculated set of features for all protein sequences found within the major sequence databases. SIMAP features a domain similarity tool that utilizes InterPro annotation to detect homology between multidomain proteins. As one of the few public resources, SIMAP attempts to quantitatively describe the evolutionary distance between two domain architectures. Domain-based search can be used to reﬁne the sorting of homologues as well as for the detection of distant homologues that may be missed by traditional methodologies. Besides web services that contain domain deﬁnitions and support queries, there are services that are geared toward homology detection and domain architecture comparison. A web service termed Protein Domain Architecture Retrieval Tool (PDART) (Lin et al., 2006) allows the comparison between domain architectures using domain architecture similarity

12.2 Domain-Based Homology Identiﬁcation

221

values. Using PDART, users can search within data resources such as Pfam for proteins with identical or similar domain architecture. Furthermore, evolutionary relationships between domain architectures can be computed using a distance matrix on which neighbor-joining clustering is performed; the results are displayed in the form of a dendrogram. An alternative service is DAhunter. DAhunter is based on RefSeq (Pruitt et al., 2007) sequences that are annotated using Pfam domain deﬁnitions. A query sequence is annotated using hmmpfam against Pfam domain proﬁles, and DAhunter retrieves candidate homologous proteins that share domains with the query architecture. The candidate architectures are subsequently compared with the query architecture and the degree of homology is assessed using three individual scores that are averaged to a global similarity score. However, the service does not readily support the submission of multiple sequences, and, for long amino acid sequences, the search can take many minutes. Many more resources exist (e.g., Gough and Chothia, 2002; Yeats et al., 2008) and efforts are being made to combine existing resources. InterPro is an integrated domain database that combines existing protein signature resources, ranging from motifs and functional sites to domains, and maps them to speciﬁc IDs. Currently, InterPro combines 10 resources and features extensive cross-links between all databases. Query sequences can be used to search against UniProt using InterProScan, a utility that combines all detection tools on which the cross-linked resources are built.

12.2.3

Deciphering Circular Permutations with Domains

Proteins can be related via nonlinear rearrangement events that are difﬁcult to detect using established methods. An example of such a nonlinear arrangement is a circular permutation (CP). A CP is a cyclic rearrangement of subsequences within a protein in such a way that the N-terminal region of a protein is transferred to the C-terminus (Jeltsch, 1999; Ponting and Russell, 1995; Weiner et al., 2005). A circular rearrangement of a protein with the arrangement A-B-C would facilitate a homologue with the domain architecture B-C-A. One of the mechanisms that can give rise to such variants assumes a duplication event forming a protein A-B-C-A-B-C followed by domain loss events at either terminus. Although CPs are not widely spread, they are of particular interest as they help elucidate how domain-wise evolution occurs. At the level of the genome, genes get duplicated, undergo fusion and ﬁssion events, and are challenged by random mutations. At the protein level, such incidents amount to domain rearrangements. Intermediate states of permutation, the so-called iCPs (Weiner et al., 2005) (such as A-B-C-A-B following the example above), can be seen as traces of the underlying genetic mechanics that fuel domain-wise rearrangements. The detection of CPs is a demanding task; the loss of linearity in change challenges algorithms such as BLAST. Homologues may go undetected despite the fact that proteins harboring the same domains in different N- to C-terminal order can fulﬁll the same tasks (Cheltsov et al., 2001). An exact algorithm for detection of CPs on the level of amino acids is simple, yet requires exhaustive comparisons. All possible CPs of one protein sequence could be generated and subsequently compared to a second protein sequence. If a CP scores higher than a simple alignment, a potential CP has been detected. Such an approach becomes computationally demanding for a large number of proteins. Heuristic approaches to CP detection have been suggested (Uliel et al., 1999), but domains, as the units of protein evolution, provide an elegant and exact solution to CP detection (see Figure 12.2). Using a simple Needleman–Wunsch alignment scheme

222

Chapter 12

Protein Domains as Evolutionary Units

Figure 12.2

Tracing circular permutations using domains. RASPODOM is a variant of the popular Needleman–Wunsch algorithm used for global alignments. The ﬁrst row/column of the lattice is initialized with zeros. Vertical and horizontal numbers correspond to domain IDs. Matching domains score 10, mismatches score 1, and gap penalties are set to 5. The key to CP detection is the arrangement of the alignment lattice into four quadrants, where each contains either one of the two proteins represented as strings of domains. The quadrants top left (TL) and top right (TR) both contain one protein; the bottom left (BL) and bottom right (BR) contain the other protein. A match is found when two cells share the same domain represented by identical IDs. The detection of CPs and iCPs between two proteins begins with a complete alignment over all four quadrants followed by a modiﬁed traceback procedure. In the four quadrant layout, a CP will appear as precisely two alignment paths: one path passes through the TR, TL, and BR quadrants, while another passes through the TR, BL, and BR quadrants. Both of these paths will pass through the accordant cells in TL and BR. Moreover, one match will be found in the TL, BR, and one of the TR or BL quadrants. These patterns in the four-quadrant lattice appear due to the matching states of the N- and C-terminal domains, and indicate a true CP; iCPs manifest themselves as suboptimal paths with regard to the paths stated above. While two proteins that share domains but are not related by a CP will exhibit match states, there will be no paths that pass through two quadrants (Weiner et al., 2005). (See insert for color representation of this ﬁgure.)

that utilizes domains as symbols and exploiting a simple pattern through a quadrupled alignment lattice allows accurate CP detection up to 50,000 times faster than existing sequence-based algorithms (Weiner et al., 2005).

12.3 DOMAINS IN GENOMICS AND PROTEOMICS Domains and domain rearrangements are being used to explore aspects of protein evolution as well as the evolution of the genome. In genome-scaled studies, domains are frequently used to provide functional annotation. The Gene Ontology (GO) (Harris et al., 2004) provides a convenient method for comparing functional contents across different genomes and efforts have been undertaken to link domains to GO terms. A map from InterPro entries to GO terms is created manually. Curators compare description lines of proteins with similar

12.3 Domains in Genomics and Proteomics

223

functionality associated with a given InterPro entry, attempt to identify common annotations between such proteins, and manually map the InterPro entry to the GO graph at the most speciﬁc level possibly as to be descriptive for all proteins associated with the particular entry (Camon et al., 2003). Such manual curation cannot easily keep pace with the speed at which new data becomes available, and methods for automated domain-to-GO term mapping have been developed. For example, Schug et al. (2002) developed a heuristic method that maps GO terms from the molecular function ontology to ProDom and CDD-deﬁned domains. Therein, the intersection between the functional assignments of a gene and the genes’ respective domain content is used to automatically generate rules that allow GO term-todomain association. Hayete and Bienkowska (2005) exploited protein architectures to train decision tree classiﬁers for GO terms. Once achieved, the functional domain composition of proteins can be used for a wide variety of inferences. For example, Chou and Cai (2002) used the functional domain composition of a protein and support vector machines to predict the subcellular location of proteins. Similarly, Qian et al. (2006) used functional domain composition to detect transcription factors. Such studies indicate that the use of functional domains, which are closely linked to a biological function of proteins, allows the description of protein properties in terms of small discrete numbers and thus supports extensive, computationally expensive analysis. Other studies have explored domain–domain interaction (Lee et al., 2006; Ng et al., 2003) or have used domains for chromosome comparisons (Pasek et al., 2005). Moreover, domain-based studies have helped uncover ﬁndings of medical relevance (Friedrichs et al., 2008). Lucas et al. (2006) used domain graphs to analyze co-composite domains helping link RBR supradomains found in ES3 ligases to domains involved in RNA binding and metabolism. Furthermore, the detection of pathogenic proteins has been enhanced by building specialized models. Szczesny and Lupas (2008) developed specialized, manually curated HMM-based domain proﬁles. These proﬁles can be used to identify trimeric autotransporter adhesins, proteins that provide the means by which some pathogens adhere to their host cells. Existing models failed to annotate sequences due to their high degree of diversity, large number of repeats, unusual coiled coils, and large amount of low-complexity regions. Such studies indicate that domains are used beyond the boundaries of evolutionary studies. Nonetheless, while domains can be used for a wide variety of studies, a frequent application is tree construction, which is often conducted in order to study the details of domain architecture evolution.

12.3.1

Building Domain Trees

The details of protein evolution can be inferred by studying domain trees. For example, Bjørklund et al. (2005) used the domain distance between domain arrangements, deﬁned as the number of unaligned domains in an alignment between two protein architectures, to create trees of protein families. The trees were created using neighbor-joining (Saitou and Nei, 1987) where domain deletion or insertion events facilitated branch creation. Using this methodology, Bjørklund et al. (2005) elucidated the domain-wise events in the evolution of two large protein families. Here, the calculated tree need not necessarily correspond to a correct species tree, but does allow (a) the inference of ancestral domain architectures and (b) mapping of domain-wise events to nodes (Figure 12.3). An alternative way to infer ancestral architectures is by mapping domain architectures to an existing species tree and using ancestral reconstruction approaches such as maximum

224

Chapter 12

Protein Domains as Evolutionary Units

Figure 12.3 Using the edit distance between domain architectures for tree construction. For proteins, P1–6, the domain architecture is indicated as boxes. Differences in domain architectures can be exploited for tree construction. The edit distance, that is the number of edit steps necessary to move from one architecture to another, between domain architectures is extracted (a) and used to construct a distance matrix (b). This distance matrix is then used for tree construction (here: program neighbor from the PHYLIP package; Felsenstein, 1989). Possible domain-wise events can be extracted from the resulting tree. Phylogenetic proﬁling prior to tree construction can help ﬁx nodes at which certain domains emerge. Furthermore, intricate understanding of protein evolution could amend tree construction. For example, given that many studies have shown that domains are gained and lost at protein termini, scenarios that suggest nonterminal changes could be penalized.

parsimony to reconstruct the ancestral state. Fong et al. (2007) conducted a study using 159 proteomes covering all major branches of the tree of life. Using these data, the phylogeny from 85% of all domain architectures could be reconstructed. The methodology assumes that evolutionary scenarios that required the fewest fusion and ﬁssion events are most likely. A modiﬁed maximum parsimony approach was used to allow for nondichotomic trees and results were veriﬁed using alternative parsimony rules. Their results conﬁrm earlier studies (Bjørklund et al., 2005; Pasek et al., 2006; Weiner et al., 2006) fortifying that most arrangements arise over time by a sequence of simple domain gain and loss events, where fusion seems more important for the creation of novel arrangements than ﬁssion. A more rigorous analysis of modular protein evolution using domain trees created by maximum parsimony was conducted by Forslund et al. (2007). In particular, their study was geared toward the identiﬁcation of convergence in domain architecture creation. In order to detect such events of convergence, the domain trees of each protein family were built separately. Ancestral arrangements were inferred following the parsimony criteria; one

12.4 The Coverage Problem

225

arrangement was chosen randomly if arrangements at a given node shared the same cost. At each terminal node of the created tree, the number of identical architectures arising at any other node was counted leading to upper bound estimates of 12% convergence in domain architecture evolution. Domain trees are also being used to reconstruct the tree of life. The use of single genes to construct trees is known to be difﬁcult, particularly due to horizontal gene transfer events, hidden paralogy, or variance in evolutionary speed. In fact, trees constructed from different genes can be incompatible (Wolf et al., 2002) as they tend to show the evolutionary history of the gene more than the history of the organism. Hence, in the postgenomic area, attempts have been made to include genome-scale data sets to construct “phylogenomic trees”, for example, by creation of supertrees based on several individual gene trees (Daubin et al., 2002). Domain trees can be used to construct species trees, exploiting the evolutionary signal concealed within domain architectures across genomes. Fukami-Kobayashi et al. (2007) abstracted the comparison of gene contents between genomes by using protein domains. Evolutionary distances were computed based on the number of domain arrangements, and a species tree was calculated using neighbor-joining and evaluated using bootstrapping. The constructed tree was mostly consistent with the current view of the species tree indicating that domains, as the unit of protein evolution, can be used to simplify the study of genome evolution. Wang and Caetano-Anolles (2006) constructed different phylogenomic trees based on domain pairings, domain combinations, and presence or absence of domains in 185 fully sequenced organisms. Tree construction was done using both maximum parsimony supported by bootstrapping and neighbor-joining. The phylogenomic patterns were then compared. In summary, the constructed consensus tree conﬁrms the tripartite of life, with interesting exceptions such as the grouping of chordates with arthropods within the eukaryotic clade (the so-called Coelomata hypothesis). Furthermore, the authors provide support for the observed increase in domain combinations within the eukaryotic clade, in particular within the metazoan subclade (Bj€orklund et al. 2006). Domains derived from sensitive models are more conserved than sequences. Therefore, domains are meaningful characters for studying distance relationships. Moreover, the use of domains alleviates hidden paralogy and site-speciﬁc evolutionary rates, both of which complicate tree reconstruction. Typically, for genome-scale data sets, the use of nondistance-based methods for phylogenetic reconstruction such as maximum parsimony or likelihood methods is not feasible due to the large tree space and hard optimization problems. While the use of domains as informative characters alleviates this problem to some degree, especially if a large number of taxa are used, it is not applicable to singledomain proteins, nor to proteins that are very closely related and share a common domain architecture.

12.4 THE COVERAGE PROBLEM The basis of most domain detection methods are sensitive, position-speciﬁc search methods based on either position-speciﬁc scoring matrices or Hidden Markov Models. However, by using such approaches only roughly half of a proteome residue content can be accurately assigned to a domain. This poses a challenge to domain-based, genome-scale analysis as much of the potential signal is lost. There are a wide array of methods available to increase domain coverage. One method is to allow for domains of varying quality where unassigned regions may be reannotated using less well-characterized domain signatures (such as in the

226

Chapter 12

Protein Domains as Evolutionary Units

Figure 12.4

Context-dependent analysis reveals annotation artifacts and evolutionary events. Boxes represent domains and lines correspond to amino acid sequences. Proteins are clustered based on their domain contents. Considering the context of domains in proteins allows to detect and differentiate between evolutionary events and annotation artifacts. (a) Evolutionary events. Substitution events are considered as such when, at a given position within the architecture, a different domain is found in comparison to the rest of the cluster (E-value (B, D) 1). Shadow domains signify that a given position in a protein p1 harbors an identiﬁable domain B, while the corresponding stretch of sequence in p2 is neither similar to B nor to any other known domain (E-value (B, seq) 1). A physical deletion is detected under the circumstances that protein p2 contains, instead of a domain B, a small linker amino acid sequence of <20 aa between domains found to N- and C-terminal neighbors of the domain B in p1. (b) Annotation artifacts. A camouﬂage is detected if two domains B and D share a high degree of similarity (E-value (B, D) 1) and are found at the same position within the architecture. Domain erosion events are signiﬁed by the identiﬁcation of a domain B in p1 to which no corresponding domain can be found at the same position, despite high similarity between the annotated domain B in p1 and the corresponding sequence stretch in p2 (E-value (B, seq) 1). All of these events result in an apparent domain deletion event of B.

case of PfamA and PfamB) (Figure 12.4). Another approach can be used to increase annotation levels within repeats. Within repeating domains, E-value constraints can be loosened if N- and C-terminal ﬂanking domains with signiﬁcant E-values can be detected. Such approaches have proved useful and can increase domain coverage by 40% in some families (Bj€ orklund et al. 2006). Contextual information can also be utilized to assign domains. When comparing two architectures, it may happen that one arrangement contains an unassigned region of a certain length, while the other arrangement contains a detectable domain below cutoff. Domain detection may ﬁnd a corresponding domain within the unassigned region of the ﬁrst protein, but decayed with respect to the original domain signature. In such cases, as within repeats, the annotation can be repeated with less stringent cutoff values. Moreover, context information can be utilized to transfer domain annotation. By clustering similar arrangements and transferring contextual information between arrangements within one cluster,

12.5 Conclusion

227

annotation artifacts can be uncovered and the proportion of unassigned regions can be reduced. Such efforts aid automatically annotated databases that typically boast higher coverage at the cost of quality (Beaussart et al., 2007). A large part of even very diverse genomes can be annotated using existing domain models. Nonetheless, kingdom-speciﬁc domain libraries such as FPfam (Alam et al., 2007), a library of HMMs specialized on fungal proteomes, have demonstrated the potential of specialized models by achieving higher, more sensitive domain coverage. Despite the efforts undertaken to increase domain coverage, roughly one-fourth of the proteomes remain unannotated. Such unannotated regions can vary in length, ranging from small ﬂexible linker regions between domains to larger, so-called orphan domains (Ekman et al., 2005). Orphan domains, in analogy to ORFans (Fischer and Eisenberg, 1999), are regions within protein sequence that have no known homology to other proteins (Ekman et al., 2005), raising the question of the origin of such sequences. Orphan domains, particularly in eukaryotes, harbor regions that yield no structure (regions of low complexity) (Ekman et al., 2005). Such regions possibly represent known domain families that have diverged beyond detection using existing models, perhaps en-route to new functionality. Alternatively, such regions could have originated as noncoding sequences, accidentally transﬁgured into coding regions by mutations that challenge exon boundaries or termination signals in turn facilitating the birth of new exons (Sorek, 2007).

12.5 CONCLUSION Differences in domain architectures of multidomain proteins that are known to share common ancestry raise the question, whether the concept of orthology is applicable only on the domain level or for proteins that share identical domain architectures (Koonin et al., 2000; Ponting and Russell, 2002). As the sequencing of genomes continues, regions previously thought to be orphans may ﬁnd their homologues in distant relatives (Yooseph et al., 2007). Moreover, the incorporation of newly available data into existing domain deﬁnition databases will further help extend domain coverage (such as the most recent version of Pfam that sees the inclusion of 1000 new families in comparison to the previous release). Similarly, the improvement of existing tools for domain detection will prove valuable in increasing domain coverage. For example, the newest version of the HMMER package (http://hmmer.wustl.edu/) which contains tools used to derive HMMs from sequence alignments or to search with a query sequence against existing models, promises more sensitive results (Eddy, 2008) in a shorter time frame. Furthermore, specialized kingdom-speciﬁc models may provide an approach for increasing domain coverage. The diversity of available domain-based resources further improves the applicability of domain data; manually curated domain databases such as Pfam, automated databases such as EVEREST or ProDom, and integrated resources such as InterPro and SIMAP help offer a balanced mixture of well annotated domains required for functional inference and simple homologous characters required for high-throughput analysis. To truly harvest the power of domains as the unit of evolution, efforts geared toward understanding the genomic details of domain rearrangements that lead to the formation of complex proteomes are of importance. Such efforts supplement models of domain-wise evolution, which can in turn be used as the theoretical basis on which modern genomic tools can be developed. Eventually, such efforts may be combined with models of sequence evolution facilitating a detailed statistical framework that describes how genomes and proteomes evolve.

228

Chapter 12

Protein Domains as Evolutionary Units

The use of domains to study proteins and protein function will increase in importance as the ﬂood of genomic data continues. However, new methods must be established that use domains as the units of evolution. While many studies utilize domain-based approaches, well-described methodologies that allow for proper statistical evaluation must be developed. If this is achieved, domain-based analysis will aid existing sequence-based methods facilitating tools that can be used to study genomes and proteomes at multiple levels.

REFERENCES ALAM, I., HUBBARD, S.J., OLIVER, S.G., and RATTRAY, M., 2007. A kingdom-speciﬁc protein domain HMM library for improved annotation of fungal genomes. BMC Genomics 8: 97. € , A.A., ZHANG, J., ALTSCHUL, S.F., MADDEN, T.L., SCHAFFER ZHANG, Z., MILLER, W., and LIPMAN, D.J., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25: 3389–3402. ANDREEVA, A., HOWORTH, D., BRENNER, S.E., HUBBARD, T.J.P., CHOTHIA, C., and MURZIN, A.G., 2004. SCOP database in 2004: reﬁnements integrate structure and sequence family data. Nucleic Acids Res. 32: D226–D229. APIC, G., GOUGH, J., and TEICHMANN, S.A., 2001. An insight into domain combinations. Bioinformatics 17: S83–S89. APIC, G., GOUGH, J., and TEICHMANN, S.A., 2001. Domain combinations in archaeal, eubacterial and eukaryotic proteomes. J. Mol. Biol. 310: 311–325. APIC, G., HUBER, W., and TEICHMANN, S.A., 2003. Multidomain protein families and domain pairs: comparison with known structures and a random model of domain recombination. J. Struct. Funct. Genomics 4: 67–78. BASHTON, M. and CHOTHIA, C., 2007. The generation of new protein functions by the combination of domains. Structure 15: 85–99. BASU, M.K., CARMEL, L., ROGOZIN, I.B., and KOONIN, E.V., 2008. Evolution of protein domain promiscuity in eukaryotes. Genome Res. 18: 449–461. BEAUSSART, F., WEINER, J., and BORNBERG-BAUER, E., 2007. Automated Improvement of Domain ANnotations using context analysis of domain arrangements (AIDAN). Bioinformatics 23: 1834–1836. BJØRKLUND, A.K., EKMAN, D., and ELOFSSON, A., 2006. Expansion of protein domain repeats. PLoS Comput. Biol. 2: e114. BJØRKLUND, A.K., EKMAN, D., LIGHT, S., FREY-SKO¨TT, J., and ELOFSSON, A., 2005. Domain rearrangements in protein evolution. J. Mol. Biol. 353: 911–923. BORNBERG-BAUER, E., 2002. Randomness, structural uniqueness, modularity and neutral evolution in sequence space of model proteins. Z. Phys. Chem. 216: 139–154. B€oHM, S., FRISHMAN, D., and MEWES, H.W., 1997. Variations of the C2H2 zinc ﬁnger motif in the yeast genome and classiﬁcation of yeast zinc ﬁnger proteins. Nucleic Acids Res. 25: 2464–2469. BRU, C., COURCELLE, E., CARRERE, S., BEAUSSE, Y., DALMAR, S., and KAHN, D., 2005. The ProDom database of protein

domain families: more emphasis on 3D. Nucleic Acids Res. 33: D212–D215. BYSTROFF, C. and KROGH, A., 2008. Hidden Markov Models for prediction of protein features. Methods Mol. Biol. 413: 173–198. CAMON, E., MAGRANE, M., BARRELL, D., BINNS, D., FLEISCHMANN , W. et al., 2003. The Gene Ontology Annotation (GOA) project: implementation of GO in SWISSPROT, TrEMBL, and InterPro. Genome Res. 13: 662–672. CHELTSOV, A.V., BARBER, M.J., and FERREIRA, G.C., 2001. Circular permutation of 5-aminolevulinate synthase: mapping the polypeptide chain to its function. J. Biol. Chem. 276: 19141–19149. CHOTHIA, C., 1992. Proteins. One thousand families for the molecular biologist. Nature 357: 543–544. CHOU, K.-C. and CAI, Y.-D., 2002. Using functional domain composition and support vector machines for prediction of protein subcellular location. J. Biol. Chem. 277: 45765–45769. DAUBIN, V., GOUY, M., and PERRIERE, G., 2002. A phylogenomic approach to bacterial phylogeny: evidence of a core of genes sharing a common history. Genome Res. 12: 1080–1090. EDDY, S.R., 1998. Proﬁle hidden Markov models. Bioinformatics 14: 755–763. EDDY, S.R., 2008. A probabilistic model of local sequence alignment that simpliﬁes statistical signiﬁcance estimation. PLoS Comput. Biol. 4: e1000069. EKMAN, D., BJØRKLUND, A.K., FREY-SKO¨TT, J., and ELOFSSON, A., 2005. Multi-domain proteins in the three kingdoms of life: orphan domains and other unassigned regions. J. Mol. Biol. 348: 231–243. EKMAN, D., BJØRKLUND, A.K., and ELOFSSON, A., 2007. Quantiﬁcation of the elevated rate of domain rearrangements in metazoa. J. Mol. Biol. 372: 1337–1348. FELSENSTEIN, J., 1989. PHYLIP—Phylogeny Inference Package (Version 3.2). Cladistics 5: 164–166. FINN, R.D., TATE, J., MISTRY, J., COGGILL, P.C., SAMMUT, S.J. et al., 2008. The Pfam protein families database. Nucleic Acids Res. 36: D281–D288. FISCHER, D. and EISENBERG, D., 1999. Finding families for genomic ORFans. Bioinformatics 15: 759–762. FLIESS, A., MOTRO, B., and UNGER, R., 2002. Swaps in protein sequences. Proteins 48: 377–387. FONG, J.H., GEER, L.Y., PANCHENKO, A.R., and BRYANT, S. H., 2007. Modeling the evolution of protein domain

References architectures using maximum parsimony. J. Mol. Biol. 366: 307–315. FORSLUND, K., HENRICSON, A., HOLLICH, V., and SONNHAMMER, E.L.L., 2007. Domain tree based analysis of protein architecture evolution. Mol. Biol. Evol. 25: 254–264. FRIEDRICHS, F., HENCKAERTS, L., VERMEIRE, S., KUCHARZIK, T., oLLER- KRULL, M., BORNBERG-BAUER, E., SEEHAFER, T., M€ STOLL, M., and WEINER, J., 2008. The Crohn’s disease susceptibility gene DLG5 as a member of the CARD interaction network. J. Mol. Med. 86: 423–432. FUKAMI-KOBAYASHI, K., MINEZAKI, Y., TATENO, Y., and NISHIKAWA, K., 2007. A tree of life based on protein domain organizations. Mol. Biol. Evol. 24: 1181–1189. GEER, L.Y., DOMRACHEV, M., LIPMAN, D.J., and BRYANT, S.H., 2002. CDART: protein homology by domain architecture. Genome Res. 12: 1619–1623. GOUGH, J., 2005. Convergent evolution of domain architectures (is rare). Bioinformatics 21: 1464–1471. GOUGH, J. and CHOTHIA, C., 2002. SUPERFAMILY: HMMs representing all proteins of known structure. SCOP sequence searches, alignments and genome assignments. Nucleic Acids Res. 30: 268–272. GOUZY, J., CORPET, F., and KAHN, D., 1999. Whole genome protein domain analysis using a new method for domain clustering. Comput. Chem. 23: 333–340. GREENE, L.H., LEWIS, T.E., ADDOU, S., CUFF, A., DALLMAN, T., DIBLEY, M., REDFERN, O., PEARL, F., NAMBUDIRY, R., REID, A. et al., 2007. The CATH domain structure database: new protocols and classiﬁcation levels give a more comprehensive resource for exploring evolution. Nucleic Acids Res. 35: D291–D297. HARRIS, M.A., CLARK, J., IRELAND, A., LOMAX, J., ASHBURNER, M., FOULGER, R., EILBECK, K., LEWIS, S., MARSHALL, B., MUNGALL, C. et al., 2004. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 32: D258–D261. HAYETE, B. and BIENKOWSKA, J.R., 2005. Gotrees: predicting go associations from protein domain composition using decision trees. Pac. Symp. Biocomput. 127–138. HEGYI, H. and GERSTEIN, M., 2001. Annotation transfer for genomics: measuring functional divergence in multidomain proteins. Genome Res. 11: 1632–1640. HENRICK, K., FENG, Z., BLUHM, W.F., DIMITROPOULOS, D., DORELEIJERS, J.F., DUTTA, S., FLIPPEN-ANDERSON, J.L., IONIDES, J., KAMADA, C., KRISSINEL, E. et al., 2008. Remediation of the protein data bank archive. Nucleic Acids Res. 36: D426–D433. HOLLICH, V. and SONNHAMMER, E.L.L., 2007. PfamAlyzer: domain-centric homology search. Bioinformatics 23: 3382–3383. ITOH, M., NACHER, J., KUMA, K.-I., GOTO, S., and KANEHISA, M., 2007. Evolutionary history and functional implications of protein domains and their combinations in eukaryotes. Genome Biol. 8: R121. JELTSCH, A., 1999. Circular permutations in the molecular evolution of DNA methyltransferases. J. Mol. Evol. 49: 161–164.

229

KOONIN, E.V., ARAVIND, L., and KONDRASHOV, A.S., 2000. The impact of comparative genomics on our understanding of evolution. Cell 101: 573–576. KUMMERFELD, S.K. and TEICHMANN, S.A., 2005. Relative rates of gene fusion and ﬁssion in multi-domain proteins. Trends Genet. 21: 25–30. LEE, B. and LEE, D., 2008. DAhunter: a web-based server that identiﬁes homologous proteins by comparing domain architecture. Nucleic Acids Res. 36: W60–W64. LEE, H., DENG, M., SUN, F., and CHEN, T., 2006. An integrated approach to the prediction of domain–domain interactions. BMC Bioinf. 7: 269. LETUNIC, I., COPLEY, R.R., PILS, B., PINKERT, S., SCHULTZ, J., and BORK, P., 2006. SMART 5: domains in the context of genomes and networks. Nucleic Acids Res. 34: D257–D260. LIN, K., ZHU, L., and ZHANG, D.-Y., 2006. An initial strategy for comparing proteins at the domain architecture level. Bioinformatics 22: 2081–2086. LUCAS, J.I., ARNAU, V., and MARıN, I., 2006. Comparative genomics and protein domain graph analyses link ubiquitination and RNA metabolism. J. Mol. Biol. 357: 9–17. MARCHLER-BAUER, A., PANCHENKO, A.R., SHOEMAKER, B.A., THIESSEN, P.A., GEER, L.Y., and BRYANT, S.H., 2002. CDD: a database of conserved domain alignments with links to domain three-dimensional structure. Nucleic Acids Res. 30: 281–283. MARCHLER-BAUER, A., ANDERSON, J.B., DERBYSHIRE, M.K., DEWEESE-SCOTT, C., GONZALES, N.R., GWADZ, M., HAO, L., HE, S., HURWITZ, D.I., JACKSON, J.D. et al., 2007. CDD: a conserved domain database for interactive domain family analysis. Nucleic Acids Res. 35: D237–D240. MARCOTTE, E.M., PELLEGRINI, M., NG, H.L., RICE, D.W., YEATES, T.O., and EISENBERG, D., 1999. Detecting protein function and protein–protein interactions from genome sequences. Science 285: 751–753. MOORE, A.D., BJØRKLUND, A.K., EKMAN, D., BORNBERG-BAUER, E., and ELOFSSON, A., 2008. Arrangements in the modular evolution of proteins. Trends Biochem. Sci. 33: 444–451. MULDER, N.J., APWEILER, R., ATTWOOD, T.K., BAIROCH, A., BATEMAN, A., BINNS, D., BORK, P., BUILLARD, V., CERUTTI, L., COPLEY, R. et al., 2007. New developments in the InterPro database. Nucleic Acids Res. 35: D224–D228. NG, S.-K., ZHANG, Z., TAN, S.-H., and LIN, K., 2003. InterDom: a database of putative interacting protein domains for validating predicted protein interactions and complexes. Nucleic Acids Res. 31: 251–254. PASEK, S., BERGERON, A., RISLER, J.-L., LOUIS, A., OLLIVIER, E., and RAFFINOT, M., 2005. Identiﬁcation of genomic features using microsyntenies of domains: domain teams. Genome Res. 15: 867–874. , P., 2006. Gene fusion/ PASEK, S., RISLER, J.-L., and BREZELLEC ﬁssion is a major contributor to evolution of multi-domain bacterial proteins. Bioinformatics 22: 1418–1423.

230

Chapter 12

Protein Domains as Evolutionary Units

PONTING, C.P. and RUSSELL, R.B., 1995. Swaposins: circular permutations within genes encoding saposin homologues. Trends Biochem. Sci. 20: 179–180. PONTING, C.P. and RUSSELL, R.R., 2002. The natural history of protein domains. Annu. Rev. Biophys. Biomol. Struct. 31: 45–71. PORTUGALY, E., HAREL, A., LINIAL, N., and LINIAL, M., 2006. EVEREST: automatic identiﬁcation and classiﬁcation of protein domains in all protein sequences. BMC Bioinf. 7: 277. PORTUGALY, E., LINIAL, N., and LINIAL, M., 2007. EVEREST: a collection of evolutionary conserved protein domains. Nucleic Acids Res. 35: D241–D246. PRUITT, K.D., TATUSOVA, T., and MAGLOTT, D.R., 2007. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 35: D61–D65. QIAN, Z., CAI, Y.-D., and LI, Y., 2006. Automatic transcription factor classiﬁer based on functional domain composition. Biochem. Biophys. Res. Commun. 347: 141–144. RATTEI, T., TISCHLER, P., ARNOLD, R., HAMBERGER, F., KREBS, J., KRUMSIEK, J., WACHINGER, B., S€uMPFLEN, V., and MEWES, W., 2008. SIMAP–structuring the network of protein similarities. Nucleic Acids Res. 36: D289–D292. SAITOU, N. and NEI, M., 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4: 406–425. SALTON, G. and BUCKLEY, C., 1987. Term weighting approaches in automatic text retrieval. Technical report, Ithaca, NY. SCHUG, J., DISKIN, S., MAZZARELLI, J., BRUNK, B.P., and STOECKERT, C.J., 2002. Predicting gene ontology functions from ProDom and CDD protein domains. Genome Res. 12: 648–655. SONG, N., SEDGEWICK, R.D., and DURAND, D., 2007. Domain architecture comparison for multidomain homology identiﬁcation. J. Comput. Biol. 14: 496–516. SOREK, R., 2007. The birth of new exons: mechanisms and evolutionary consequences. RNA 13: 1603–1608. SZCZESNY, P. and LUPAS, A., 2008. Domain annotation of trimeric autotransporter adhesins–daTAA. Bioinformatics 24: 1251–1256. ULIEL, S., FLIESS, A., AMIR, A., and UNGER, R., 1999. A simple algorithm for detecting circular permutations in proteins. Bioinformatics 15: 930–936. VIBRANOVSKI, M.D., SAKABE, N.J., de OLIVEIRA, R.S., and DE SOUZA, S.J., 2005. Signs of ancient and modern exonshufﬂing are correlated to the distribution of ancient and modern domains along proteins. J. Mol. Evol. 61: 341–350.

VOGEL, C., BERZUINI, C., BASHTON, M., GOUGH, J., and TEICHMANN, S.A., 2004. Supra-domains: evolutionary units larger than single protein domains. J. Mol. Biol. 336: 809–823. VOGEL, C., TEICHMANN, S.A., and PEREIRA-LEAL, J., 2005. The relationship between domain duplication and recombination. J. Mol. Biol. 346: 355–365. , G., 2006. Global phylogeWANG, M. and CAETANO-ANOLLES ny determined by the combination of protein domains in proteomes. Mol. Biol. Evol. 23: 2444–2454. WANG, W., YU, H., and LONG, M., 2004. Duplicationdegeneration as a mechanism of gene ﬁssion and the origin of new genes in Drosophila species. Nat. Genet. 36: 523–527. WEINER, J., THOMAS, G., and BORNBERG-BAUER, E., 2005. Rapid motif-based prediction of circular permutations in multi-domain proteins. Bioinformatics 21: 932–937. WEINER, J., BEAUSSART, F., and BORNBERG-BAUER, E., 2006. Domain deletions and substitutions in the modular protein evolution. FEBS J. 273: 2037–2047. WEINER, J., MOORE, A.D., and BORNBERG-BAUER, E., 2008. Just how versatile are domains? BMC Evol. Biol. 8: 285. WHEELER, D.L., BARRETT, T., BENSON, D.A., BRYANT, S.H., CANESE, K., CHETVERNIN, V., CHURCH, D.M., DICUCCIO, M., EDGAR, R., FEDERHEN, S. et al., 2008. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 36: D13–D21. WOLF, Y.I., ROGOZIN, I.B., GRISHIN, N.V., and KOONIN, E.V., 2002. Genome trees and the tree of life. Trends Genet. 18: 472–479. WU, C.H., APWEILER, R., BAIROCH, A., NATALE, D.A., BARKER, W.C., BOECKMANN, B., FERRO, S., GASTEIGER, E., HUANG, H., LOPEZ, R. et al., 2006. The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res. 34: D187–D191. WUCHTY, S., 2001. Scale-free behavior in protein domain networks. Mol. Biol. Evol. 18: 1694–1702. YANDELL, M.D. and MAJOROS, W.H., 2002. Genomics and natural language processing. Nat Rev. Genet. 3: 601–610. YEATS, C., LEES, J., REID, A., KELLAM, P., MARTIN, N., LIU, X., and ORENGO, C., 2008. Gene3D: comprehensive structural and functional annotation of genomes. Nucleic Acids Res. 36: D414–D418. YOOSEPH, S., SUTTON, G., RUSCH, D.B., HALPERN, A.L., WILLIAMSON, S.J., REMINGTON, K., EISEN, J.A., HEIDELBERG, K. B., MANNING, G., LI, W. et al., 2007. The Sorcerer II Global Ocean Sampling expedition: expanding the universe of protein families. PLoS Biol. 5: e16.

Chapter

13

Domain Family Analyses to Understand Protein Function Evolution Adam James Reid, Sarah Addou, Robert Rentzsch, Juan Ranea, and Christine Orengo 13.1

INTRODUCTION

13.2

UNIVERSAL DOMAIN STRUCTURE FAMILIES IDENTIFIED IN THE LAST UNIVERSAL COMMON ANCESTOR

13.3

SOME DOMAIN FAMILIES RECUR MORE FREQUENTLY AND ARE STRUCTURALLY VERY DIVERSE

13.4

CORRELATION OF STRUCTURAL DIVERSITY IN SUPERFAMILIES WITH FUNCTIONAL DIVERSITY

13.5

TO WHAT EXTENT DOES FUNCTION VARY BETWEEN HOMOLOGOUS RELATIVES?

13.6

HOW SAFELY CAN FUNCTION BE INHERITED BETWEEN HOMOLOGUES?

13.7

HOW ARE DOMAIN FAMILIES DISTRIBUTED IN PROTEIN COMPLEXES?

REFERENCES

13.1 INTRODUCTION Although many genetic mechanisms contribute to the evolution of protein diversity across the various kingdoms of life, a common and dominant theme has been the duplication of domains and fusion in different multidomain arrangements. Several comparative genome analyses (Yang et al., 2005; Ranea et al., 2006) have elucidated the presence of relatively few universal domain families (less than 300) that account for 30–40% of domain sequences in the genomes from organisms in all kingdoms of life. That these have been combined in very diverse ways is illustrated by the fact that fewer than 10% of complete proteins within these organisms are common to all kingdoms of life (Grant et al., 2004).

Evolutionary Genomics and Systems Biology, edited by Gustavo Caetano-Anolles Copyright 2010 John Wiley & Sons, Inc.

231

232

Chapter 13

Domain Family Analyses to Understand Protein Function Evolution

It is indeed widely held that as protein domain families evolved during evolution, their functional repertoires were expanded through domain rearrangements. For example, gene duplication and subsequent divergence is thought to signiﬁcantly contribute to the emergence of new functions (Chothia et al., 2003). Upon gene duplication, one gene copy is allowed to evolve a new or modiﬁed function by divergence at the level of the sequence or at the level of the spatial or temporal expression of the gene or by combining with different partner domains. Many evolutionary analyses of domain families have exploited structural data, where possible, to capture very distant evolutionary relationships, thereby gleaning more comprehensive insights into the extent to which domain families can diverge both in structure and in function. Structure tends to be much more highly conserved than sequence (Chothia and Lesk, 1986), and structural classiﬁcations such as SCOP (Andreeva et al., 2008) and CATH (Greene et al., 2007) were established to exploit this phenomenon and characterize and analyze domain families and their evolution. The CATH domain structure classiﬁcation currently contains 2200 domain families comprising 114,000 domain structures. By building sequence proﬁles representative of each domain structure family, it is possible to predict domain sequence relatives in completed genomes, and the SUPERFAMILY (Wilson et al., 2007) and CATH-Gene3D (Yeats et al., 2008) resources have arisen to provide these data for SCOP and CATH, respectively. The CATH-Gene3D resource now contains over 600 completed genomes, comprising a total of 6 million protein sequences. Domain annotations have been mapped onto these sequences using HMM libraries based on CATH families. While these HMM-based structural annotations account for between 40% and 60% of domain sequences in each genome, predictions performed using threading algorithms such as GenThreader (McGufﬁn and Jones, 2003) assign 80–90% of domain sequences in most organisms to structure-based CATH domain families (McGufﬁn et al., 2006). This suggests that a signiﬁcant proportion of domain sequences can be assigned to fewer than 2200 domain families. By contrast, clustering of the domain sequences into protein families using the APC protocol (Frey and Dueck, 2007) produces more than 350,000 protein families supporting the hypothesis of extensive variations in domain combinations mediating protein diversity. Figure 13.1 shows the increase in the number of protein families with each new completely sequenced genome over the past 10 years. By contrast, Figure 13.2 shows the proportion of domain sequences in representative model organisms, which can be assigned to domain structure families in CATH.

13.2 UNIVERSAL DOMAIN STRUCTURE FAMILIES IDENTIFIED IN THE LAST UNIVERSAL COMMON ANCESTOR Comparative genome analysis performed on 200 completed genomes in Gene3D version 4 revealed a set of 140 (205) “universal” families present in at least 90% (70%) of all organisms (Ranea et al., 2006). Functional analysis of these families exploiting public data (e.g., from GO, COG, EC) and the literature revealed that in addition to families involved in protein biosynthesis, families involved in key biochemical pathways and regulations were also identiﬁed. These analyses suggested a more complex picture of early life than had been ﬁrst revealed by examining sequence-based families (Koonin, 2003). Although these analyses detected the protein biosynthesis machinery shared by all kingdoms of life, other

13.2 Universal Domain Structure Families Identiﬁed in the Last Universal Common Ancestor

233

Figure 13.1

Increase in the number of protein families with each new completely sequenced genome over the last 10 years. Families with at least two members are shown in red and those with more than ﬁve members are shown in green. Singleton families are shown in blue. (See insert for color representation of this ﬁgure.)

classes of proteins, clearly required by any independent organism, were missed, as the sequence signal had been washed away over time in these more divergent families. Thus, by exploiting structure-based families, a more convincing representation of the processes occurring in the last universal common ancestor (LUCA) can be imagined.

Figure 13.2

Structural coverage of genes in model organisms. Coverage achieved using Gene3D is shown in dark blue. Additional coverage achieved using GenThreader (McGufﬁn and Jones, 2003) shown in light blue. (See insert for color representation of this ﬁgure.)

234

Chapter 13

Domain Family Analyses to Understand Protein Function Evolution

13.3 SOME DOMAIN FAMILIES RECUR MORE FREQUENTLY AND ARE STRUCTURALLY VERY DIVERSE There is considerable variation in the populations of different CATH domain superfamilies with the 100 most highly expanded superfamilies accounting for nearly 40% of all CATH domain annotations in the genomes. Furthermore, recent analyses of structural relatives in CATH families using a sensitive structure comparison algorithm (Redfern et al., 2007) to compare and cluster domains into structurally similar groups has shown that the expansion of these families correlates with considerable structural diversity across the family (Cuff et al., 2009). For example, if we deﬁne relatives having signiﬁcantly similar structures or folds as ˚´ RMSD (normalized by the proportion of the largest domain those superposing with <5 A aligned) and we cluster these relatives into the same structural subgroup (SSG), it can be seen that in some families more than 12 different subgroups can be identiﬁed. In some of these superfamilies, relatives in different SSGs can differ in size by threefold or more. Usually, relatives have at least 40–50% of the residues in the core in common, so that the topological motif in the core of the domain is conserved. Although all relatives are also likely to have very similar secondary structure arrangements in the cores of their domains, relatives in different SSGs are likely to exhibit considerable differences in the secondary structure embellishments or decorations to these cores. By performing multiple structure alignments across the superfamily, it is possible to identify the common core secondary structures and those secondary structure embellishments occurring in diverse relatives. The 2DSEC method (Reeves et al., 2006) highlights these differences and can be used to identify changes occurring across different sets of relatives. For example, Figure 13.3 shows secondary structure variations observed in different SSGs in the ATP-Grasp domain family.

13.4 CORRELATION OF STRUCTURAL DIVERSITY IN SUPERFAMILIES WITH FUNCTIONAL DIVERSITY Domain families that have been highly expanded in the genomes and that exhibit great structural diversity also tend to have multiple functions. Since Gene3D contains all the predicted sequence relatives for CATH domain superfamilies, the functional annotations associated with these predicted CATH domains can be examined to gauge the extent of functional diversity across each family. Several structural analyses of domain families in CATH and SCOP have examined the manner in which functions are modiﬁed across a superfamily. Todd et al. (2001) showed that changes could arise simply due to mutations of single residues (e.g., catalytic residues in the active site) or structural embellishments of the domain modifying the binding site. Changes in domain partnerships can also alter access to binding site or change its geometry. Some structural variations in relatives were associated with the promotion of different oligomerization states, and these changes could also affect access of ligands and substrates to the binding site. A more detailed analysis of the types of functional changes that can be mediated by varying the combination of protein domains was performed recently for a set of 46 domain homologues found in single-domain and multidomain proteins for which comprehensive data on both the structure and the function were available (Bashton and Chothia, 2007). The authors reported a variety of functional modiﬁcations, functional activity gain, and loss as a

13.4 Correlation of Structural Diversity in Superfamilies with Functional Diversity

235

Figure 13.3

Three domains from the galectin-type carbohydrate recognition domain superfamily. The domains are colored so that residues having the same a-helix or b-sheet conformation in 75% of domains appear red. The blue regions represent those secondary structures that are present only in less than 75% of domains or where the structure is coil. Domains 1bkzA0 and 1a8d01 are in the same orientation so that the embellishments can be seen. 1dypA0 in a different orientation shows how these embellishments can modify the geometry of the binding site. The binding site remains in the same place in all members of this superfamily. From Reeves et al. (2006). (See insert for color representation of this ﬁgure.)

result of domain combinations. It was noted that the fusion of an additional domain could affect the type of substrate the protein acts upon. As well as modifying substrate binding through oligomer formation, different domain combinations may also have roles in regulating the function of the domain responsible for catalysis. Moreover, fusion of domains was shown to sometimes produce bifunctional enzymes (i.e., enzymes with two catalytic activities). Loss of the catalytic function by a homologue in a domain combination was also reported. In order to understand more clearly the precise structural mechanisms by which changes in structure mediate divergence in function, 31 of the most structurally divergent superfamilies in CATH were analyzed recently (Reeves et al., 2006). It was observed that the speciﬁc secondary structure embellishments, found in different functional subgroups, were often impacting on binding site geometry or modifying surface features of the domain, thereby promoting different domain partnerships or oligomerization states. An example of secondary structure insertions modifying the binding site geometry is shown in the galactose binding domain-like superfamily (CATH code 2.60.120.260). The crystal structure of galectin-7 complexed with galactose (Leonidas et al., 1998) reveals the

236

Chapter 13

Domain Family Analyses to Understand Protein Function Evolution

carbohydrate binding site in loops above and below the b-sandwich. Binding interactions are provided by the loops at both ends of the b-sheets. Galectin-7 (CATH domain 1bkzA0) is the smallest member of this superfamily comprising two antiparallel b-sheets, each with ﬁve b-strands. Two types of embellishments are observed. The tetanus neurotoxin (1a8d01) shows the most typical with b-strand insertions at both sides of the b-sandwich (Figure 13.3). All relatives have active sites in the same location and there are signiﬁcant changes in the binding pocket shaped by the extensive b-strand embellishments. For example, k-carrageenans (sulfated ab-galactands, PDB code 1dyp) has a tunnel-shaped active site, created by these insertions and thought to be responsible for the degradation of polysaccharides (Michel et al., 2001). An example of secondary structure insertions modifying the binding site and promoting a different oligomerization state is found in the ATP-dependent carboxylate-amine/thiol ligase (ATP-grasp) superfamily. Members typically catalyze ATP-dependent ligation of a substrate carboxylate to an amine or thiol group of a second substrate (Todd et al., 2001). Nearly all relatives share three common domains. ATP is bound in the cleft between two of the domains, which are referred to as the small and large ATP binding domains (Figure 13.4a). The third domain, the biotin carboxylase N-terminal domain-like domain, is referred to as the B domain. The size of the b-sheet in the large domain varies from 5 strands to 11 and the total number of secondary structures varies from 8 to 20, a 2.5-fold increase (Figure 13.4b). Additional a-helices are appended to the N-termini of the large domain in several relatives, and there is an embellishment of varying size between consensus strand 2 and consensus ahelix 1. These embellishments aggregate to form an extension to the b-sheet and the domains are arranged so that the embellishments enclose the active site resulting in a boxlike geometry. Functions exhibited by embellished relatives include carboxylases (phosphoribosylamino-imidazole carboxylase, 1b6rA, and biotin carboxylase, 1bncA) and synthetases (glycinamide ribonucleotide synthetase, 1gso), and the substrates are all small in size due to the box-like active site. In biotin carboxylase, which dimerizes to form the biological unit, the C-terminal embellishment of the large domain is also directly involved in the dimerization interface (Artymiuk et al., 1996) (Figure 13.4c). When secondary structure insertions occurring in different relatives were examined, they were typically occurring at positions distributed along the whole length of the polypeptide chain. However, when inspected visually in 3D, it could be seen that despite being dispersed along the polypeptide, they were aggregating in 3D at the same site effectively forming a much larger substructure that could have a more profound impact on binding site geometry or surface contours. Furthermore, it was rare to see large insertions. Ninety percent of the insertions comprised only one or two secondary structure elements. This suggests a simple evolutionary mechanism whereby small changes, occurring throughout the chain, can be ampliﬁed because they colocate in 3D. The tendency of these inserted secondary structures to be colocated in 3D was found to be statistically signiﬁcant (Reeves et al., 2006). Interestingly, a large proportion of the superfamilies showing the greatest structural and functional divergence in CATH belong to a small subset of domain architectures. 37% of the superfamilies belong to the two-layer b-sandwiches and two- and three-layer ab sandwiches. These are the most heavily populated architectures in CATH, accounting for more than a third of predicted CATH domains in completed genomes. Domain structures adopting these regular, layered architectures typically comprise one or two central beta sheets. Since residue insertions are rarely tolerated within b-sheets, this

13.4 Correlation of Structural Diversity in Superfamilies with Functional Diversity

237

Figure 13.4

(a) Three domains from the ATP-grasp family. In red the large domain, in blue the small domain, and in light blue the B domain. Residues shown in yellow are involved in ATP binding residues and the green residues represent those involved in substrate binding. Right structure shows D-alanine D-alanine ligase (1iow) in the L conformation. Left structure shows biotin carboxylase (box-like geometry). (b) A 2DSEC plot of the ATP-grasp large domain. Pink circles represent consensus helices (in >75% of the aligned domains) and purple circles represent embellished helices. Yellow triangles represent consensus strands and brown triangles represent strand embellishments. The size of the symbol is in proportion to the length (number of residues) in the secondary structure. (c) The biological unit of biotin carboxylase (1bnc; group I) visualized using PQS. The small domain is shown in blue, the B domain in light blue, the large domain consensus region in red, and the embellishment in yellow. From Reeves et al. (2006). (See insert for color representation of this ﬁgure.)

arrangement constrains insertions to occur predominantly at the loop regions connecting the b-strands, that is, at the top and the bottom of the b-sheets. Alternately, insertions can comprise additional b-strands added to the edges of the b-sheets. Therefore, there are four locations in the structure that are more tolerant of insertions and at which insertions aggregate, thus giving rise to larger structural features that can modify the functions of the domains.

238

Chapter 13

Domain Family Analyses to Understand Protein Function Evolution

13.5 TO WHAT EXTENT DOES FUNCTION VARY BETWEEN HOMOLOGUES? 13.5.1 Phylogenetic Analysis of Protein Families There have been many interesting evolutionary analyses of individual domain families (Aravind et al., 2002; Burroughs et al., 2006; e.g., Ojha et al., 2007). Both orthologous and paralogous relatives, within these families, have been considered where orthologues are relatives in different species that have evolved from a common ancestral gene by speciation. By contrast, paralogues are related by duplication within a genome. Orthologues tend to retain the same function in the course of evolution, whereas paralogues generally evolve new functions, even if related to the original one. For a protein phylogeny where the objective is to understand the evolution of an entire family of proteins, both orthologues and paralogues should be used. By contrast, an accurate evolutionary reconstruction of the protein family tracing its vertical evolution on a species tree requires sets of orthologous genes only. A species tree represents the speciation events that took place during evolution and led to the emergence of present-day organisms. Such a tree can also be used to trace the evolutionary changes in protein families. Parsimonious reconstruction of protein families, using this approach, aims to trace the ﬁrst acquisition of sets of orthologues making up the protein family or predict lineage-speciﬁc gene loss events with the minimum number of events. Recent algorithms have been developed to trace the most parsimonious scenario for the evolution of the entire complement of protein families in the genomes and reconstruct the gene content of ancestral genomes (Snel et al., 2002; Mirkin et al., 2003; Kunin and Ouzounis, 2003). The gene most frequently used to reconstruct species or organism phylogeny is the small subunit ribosomal RNA (SSU rRNA), which has been claimed to rarely undergo lateral transfer among genomes (Woese, 1987; Jain et al., 1999) and to generate more accurate reconstruction of deep branching patterns than gene content-based methods (Maidak et al., 2001). However, HGT needs to be taken into account as this can confound attempts at phylogenetic reconstruction of the evolution of protein families. Its prevalence, even between different kingdoms, has been supported by mapping the phyletic patterns of orthologous gene sets (COGs) onto a species tree (Mirkin et al., 2003). Structural data are more conserved than sequence data and allows the detection of more distant evolutionary relationships, illuminating evolutionary processes that are undetectable by sequence-based methods. For example, as discussed above, studies of LUCA using information on remote relationships detected using structural data highlighted the existence of a more comprehensive functional repertoire in LUCA than suggested by sequence data alone (Ranea et al., 2006). A recent analysis of domain structure superfamilies in CATH exploited a protocol based on parsimony analysis to trace the most likely evolutionary scenarios leading to the expansion of the functional repertoires associated with these superfamilies. The evolution of new functions, deﬁned by the clusters of orthologous groups (COGs), within a superfamily was traced through sequence mapping of CATH relatives onto complete ORFs from the COG database. This effectively allowed the reconstruction of the evolutionary history of each COG, suggesting the most likely emergent node using parsimony and the topology of a species tree to identify earlier and later evolved COG functions. Structural data from CATH were exploited to detect distant evolutionary relationships between COGs and enable parent–child relationships to be deduced between earlier and later evolved COGs that share domain relatives from the same homologous superfamily. In

13.5 To What Extent Does Function Vary Between Homologues?

239

turn, these parent–child relationships were used to unravel the trends associated with the evolution of new functions within CATH superfamilies. The role of domain rearrangement events in driving domain context change and hence functional change in CATH superfamilies was also examined. The identiﬁcation of nodes and species in the tree of life, showing where a particular COG is believed to have emerged for the ﬁrst time during the course of evolution, was deduced by reconstructing the most parsimonious evolutionary history of that COG. Although parsimony analysis searches for the branching order that requires the fewest number of evolutionary events for each COG, there may be multiple evolutionary scenarios for a given COG with the same minimal number of events, which in turn means that a COG could be predicted to have more than one emergent point in the tree. In many of these cases, HGT is likely to be the cause of the additional evolutionary paths. For this reason, organisms known to be associated with high preponderance of HGT were removed. The identiﬁcation of earlier evolved and later evolved COGs in the remaining data set (1997 COGs) was used to explore which functions are more ancestral and which functions evolved more recently. Figure 13.5 shows the distributions of COG-associated functional categories in earlier and later evolved COGs. Unsurprisingly, the functions most basic to a cellular organism such as those involved in translation, ribosomal structure, and biogenesis (J) are more signiﬁcantly represented by

Figure 13.5

Distribution of functional categories associated with earlier and later evolved COGs. (a) Distribution of COG functional categories associated with earlier and later evolved COGs. Functional categories of COGs describe the functions carried out by sequence relatives in the ancestral COGs (blue) and the more recently evolved COGs (red). (b) w2 statistical deviation values reﬂect tendency to be earlier evolved or later evolved function for each of the 24 COG functional categories; at w2 ¼ 2.71, the null hypothesis that a COG functional category is equally likely to be linked with earlier and later evolved COGs can be rejected with a conﬁdence level. (See insert for color representation of this ﬁgure.)

240

Chapter 13

Domain Family Analyses to Understand Protein Function Evolution

earlier evolving COGs. 80% of COGs that are classiﬁed within the translation, ribosomal structure, and biogenesis (J) are ancestral. Conversely, functions that are associated with later evolved COGs include those COGs with “unknown function”. It is likely that these COGs include relatives with organism-speciﬁc functions that are relatively hard to characterize.

13.5.2 Structural Domain Characterization of Clusters of Orthologous Genes Using CATH While phylogenetic methods that are based on sequence similarity alone often fail to detect remote relationships between homologous proteins, knowledge of protein structure can signiﬁcantly improve homology recognition methods, as structure is generally more conserved than sequence (Chothia, 1992). To this end, evolutionarily related COGs were identiﬁed when their sequence members mapped to domain sequences classiﬁed in the same homologous superfamily in CATH. genes in 2081 COGs, representing 43% of all COGs and 72% of all genes classiﬁed in the COG database, had structural domain assignments that mapped to 822 CATH domain superfamilies. Furthermore, structural domain annotation covered 70% or more of all genes in 68% of the COGs used in the analysis. This corresponds to annotation of 70% or more of the species, indicating that a relatively high and homogeneous structural assignment coverage is obtained with respect to both genes and species coverage. The average size of COGs without structural annotation was almost three times smaller than that of COGs where structural domain relatives in CATH superfamilies were identiﬁed. Figure 13.6 shows that more than half of genes in the COGs have relatives that map to a single-domain superfamily in CATH. The remaining genes encode multidomain proteins that have domain relatives classiﬁed in different superfamilies in CATH.

13.5.3 Evolution of (COGs) Function Within CATH Superfamilies The identiﬁcation of reliable emergent nodes and structural relatives for distinct clusters of orthologous proteins was used to establish the chronological order of emergence of sets of

Figure 13.6

Distribution of the number of structural domain relatives in CATH identiﬁed in 1997 COGs. (See insert for color representation of this ﬁgure.)

13.5 To What Extent Does Function Vary Between Homologues?

241

COGs within a domain superfamily in CATH. This allowed inference of the parent-child relationships for the different types of functions carried out by the different relatives in the CATH domain superfamily.

13.5.4 Resolving Ambiguous Evolutionary Scenarios Between Parent and Child COGs in a CATH Superfamily As the number of ancestral nodes is limited and the majority of speciation processes cannot be captured from the species tree, the probability of assigning the emergence of two different COGs to the same node is high. Sequence homology was used to resolve ambiguous parent/child relationships between COGs. A sequence similarity score based on that derived by Abascal and Valencia (2002) was used to distinguish, where possible, the most probable parent–child COG scenario by identifying which of all possible parent–child relationships, as suggested by the phyletic reconstruction using GeneTRACE (Kunin and Ouzounis, 2003), has the highest sequence similarity score. Figure 13.7 provides a summary of the protocol employed in this study in order to trace the set of evolutionary events behind the expansion of functional diversity observed in some CATH domain superfamilies. A total of 942 parsimonious scenarios (parent/child relationships between groups of orthologues within a domain superfamily) were identiﬁed. Figure 13.8 below shows normalized frequencies of functional shifts observed between relatives in parent and child COGs within a CATH superfamily. The analysis of COG repertoires in CATH domain superfamilies illustrated in Figure 13.8 shows that in the majority of superfamilies, functions remain conserved or shift within the same functional category (e.g., replication, amino acid biosynthesis). To a lesser extent, some parent functions diverge to functions classiﬁed in different functional categories that are part of the same general functional class (e.g., regulation, metabolism, etc). The most common examples of such functional change usually involve interlinked functions involved in transcription regulation and signal transduction mechanisms. For example, the reconstructed scenario tracing the evolution of function in the CATH superfamily 3.30.70.340 predicts the emergence of membrane GTPases involved in stress response (COG1217) from more ancestral GTPases serving as translation elongation factors (COG0480). However, there are examples involving shifts to a different broad functional class. These observations bring new insights into the extent to which function can diverge beyond speciﬁcities of ligand binding in homologous superfamilies. The degree of functional divergence can be classiﬁed into one of three types: Type 1: a parent function evolves to a new function that is within the same functional category according to the COG classiﬁcation. Type 2: a parent function diverges to a greater extent and is subsequently classiﬁed into a different functional category, but still within the same broad functional class in the COG classiﬁcation (i.e. information, regulation, metabolism). Type 3: a parent function evolves to a different broad functional class. A signiﬁcant proportion of ancestral functions seem to evolve into paralogues that remain poorly annotated ([R], [S]), which may well be associated with organism-speciﬁc functions. Some functional categories are naturally underrepresented in prokaryotes (i.e., RNA processing and modiﬁcation [A], defense mechanisms [V]), as indicated by the low

242

Chapter 13

Domain Family Analyses to Understand Protein Function Evolution

Figure 13.7

Schematic representation showing steps in the protocol employed to identify earlier and later evolved sets of orthologous genes (COGs) within CATH domain superfamilies. (a) The most probable emerging node for groups of orthologues (COGs) sharing one or more component domain(s) from the same CATH domain superfamily was calculated by a parsimony-based analysis based on the phylogenetic distribution of each COG on the species tree (the “ þ ” and “” symbols represent the presence and absence of a COG representative(s) in a particular species, respectively). (b) The predictive scenarios for the emergence of COGs, as proposed by the GeneTRACE algorithm, sometimes revealed ambiguous evolutionary relationships linking COGs within the same CATH superfamily. In the example, COG D can be said to have evolved from either COG B or COG C. (c) A distance matrix is used to identify the most likely parent–child relationship linking two of the three COGs where a sequence similarity score was calculated between ORFs of the two possible parent–child pairs: COG D/COG B and COG D/COG C. (d) The pair with the highest score (i.e., the closest pair, illustrated with a double plus sign ( þ þ ) in the distance matrix) is kept and added to the reconstructed evolutionary scenario describing the chronological order in which three distinct functional groups within the superfamily have emerged. (See insert for color representation of this ﬁgure.)

numbers of functional shifts associated with these categories. As a consequence, the frequencies derived to reﬂect evolution of function in these categories should be considered with caution.

13.5.5 Relationship Between Domain Architecture Rearrangement and Functional Divergence Within CATH Superfamilies Recent analyses on the evolution of domain architectures in proteins have suggested that most multidomain protein architectures evolve by simple domain rearrangements, involving a single domain at a time (Bjorklund et al., 2005; Fong et al., 2007). Bjorklund et al. (2005) used a domain distance measure reﬂecting the number of unmatched domains between protein homologues with dissimilar multidomain architectures to identify and quantify domain recombination events that have driven the evolution of multidomain

243

Figure 13.8 Percent frequencies of functional shifts observed in parent/child reconstructed evolutionary scenarios for COGs mapped to the same CATH domain superfamily. Function changes with the highest frequency in each category are highlighted in yellow. Other shifts representing more than 10% of all observed evolutionary scenarios for parent functions associated with a particular functional category (e.g., transcription, signal transduction, etc.) are underlined. Shaded boxes indicate where shifts in functions occur within the same broad functional class in the COGs classiﬁcation: Information storage and processing (blue), cellular processes and signaling (red), metabolism (green), and poorly characterized (gray). In order to highlight those functional categories that are underrepresented in COGs, the last row of the table shows the raw frequencies of function changes for each category. (See insert for color representation of this ﬁgure.)

244

Chapter 13

Domain Family Analyses to Understand Protein Function Evolution

proteins in eukaryotes. A related analysis by Fong et al. (2007) employed parsimony to reconstruct the most probable domain recombination events that may have taken place during the course of evolution. The evolution of function in CATH superfamilies with respect to the multidomain architectures in parent and child COGs has also been investigated. A similar approach to the domain distance method developed by Bjorklund et al. (2005) was employed whereby differences between the domain architectures in the ancestral and the more recently evolved sequence relatives indicate a domain rearrangement event that may have caused functional divergence in relatives. Out of the 306 evolutionary scenarios that we studied, 228 had comprehensive functional annotation (i.e., excluding the poorly characterized COG functional classes R and S) and complete domain annotation. A total of 84 (36.8%) of these 228 evolutionary scenarios do not include any kind of domain rearrangement events, and relatives in both parent and child COGs share identical domain architectures. Domain rearrangements in the remaining evolutionary scenarios include 54 (23.7%) instances of domain deletion, 42 (18.4%) domain accretion, and 36 (15.8%) domain exchange events, and 12 (5.3%) complex domain rearrangements, which comprise more than a single type of recombination event and may include insertion or deletion of domains accompanied by the exchange of a domain(s). Evolutionary scenarios involving complex domain rearrangement events that include domain accretions and exchange or domain deletion and exchange are the least frequent scenarios. This observation is in accord with recently published data (Fong et al., 2007). Since previous studies showed that most multidomain architectures in proteins often arise from single recombination events (Bjorklund et al., 2005; Fong et al., 2007), it is possible that the complex domain rearrangements observed here have arisen as the product of several recombination events and the chronological order of these events cannot be readily determined from the data (e.g., absent COGs). The relationship between the occurrence and type of domain recombination events and the extent to which function can diverge between relatives in a CATH superfamily (i.e., type 1 functional change, type 2 functional change, type 3 functional change), as described in the section, is summarized in Table 13.1. As can be seen from the data presented in Table 13.1, the relationship between the occurrence and type of domain rearrangement and the extent to which function diverges between relatives in parent and child COGs is not a straightforward one. For example, function seems to diverge in all three ways (i.e., type 1, type 2, and type 3) regardless of the type of recombination events. Similarly, it is hard to link any type of domain recombination event with a speciﬁc type of change in function. While this could simply reﬂect the presence of an intricate relationship between the evolution of function and domain rearrangements within a protein superfamily, it could also be that the number of informative scenarios, with respect to COGs functional annotation and complete multidomain characterization of relatives, is too small to reach deﬁnite conclusions about the impact that multidomain architecture change has on the extent of functional divergence. In summary, in 63% of scenarios in which the domain architectures in parent and child COGs is identical, the more recently evolved functions (child functions) can be classiﬁed in the same functional category as deﬁned by the COGs functional classiﬁcation (i.e., type 1 functional change). In addition, half of all scenarios in which multidomain architecture change comprises the addition of a single domain and, to a lesser extent, several domains involve functions classiﬁed in the same broad functional class (i.e., type 2 functional change).

13.6 How Safely Can Function Be Inherited Between Homologues?

245

Table 13.1 The Relationship Between the Incidence and the Type of Domain Recombination Events and the Extent of Functional Divergence (i.e., Type 1, Type 2, Type 3 Functional Change), Observed in the Evolutionary Scenarios Linking Parent and Child COGs in CATH Superfamilies

Identical single and multidomain architectures Domain accretion Domain deletion Domain exchange Complex rearrangement Total

Type 1 functional change

Type 2 functional change

Type 3 functional change

Total

53: 47.7% (63%)

20: 27.8% (23.8%)

11: 24.4% (13.1%)

84

13: 11.7% (31%) 23: 20.7% (42.6%) 17: 15.3% (47%) 5: 4.5% (41.7%)

21: 29.2% (50%) 15: 20.9% (27.8%) 9: 12.5% (25%) 7: 9.7% (58.3%)

8: 17.8% (19%) 16: 35.6% (29.6%) 10: 22.2% (27.8%) —

42 54 36 12

72

45

111

228

Information in each cell describes the number of instances observed for a particular type of multidomain architecture change as well as function shift between relatives in parent and child COGs. Information on the percentage of total functional changes and the percentage of multidomain architecture changes (in brackets) is also presented.

Interestingly, the deletion and the exchange of a domain do not necessarily seem to dramatically change the function associated with the new multidomain architectures. Functions in more than 42% (47%) of all scenarios where a domain is deleted (exchanged with another domain), respectively, involve the same functional category (type 1 functional change). Perhaps some of these domains are lost because they are not really important for the function. Complex domain rearrangements account for only 12 evolutionary scenarios where function seems to diverge within the same functional category and broad functional class (i.e. type 1 and type 2 functional changes). Because of the low frequency at which complex domain rearrangements occur, it is quite difﬁcult to conclude precisely how this type of domain architecture rearrangement may impact on function change in relatives.

13.6 HOW SAFELY CAN FUNCTION BE INHERITED BETWEEN HOMOLOGUES? The Gene3D database contains functional information from many diverse public sources (e.g., GO, KEGG) integrated with domain and protein families. This has allowed a recent study of the conservation of function between homologous proteins/domains to identify safe sequence identity thresholds for transferring functional annotations between relatives. Previous analyses of this type based on enzyme superfamilies in CATH or SCOP (Devos and Valencia, 2000; Rost, 2002; Todd et al., 2002; Tian and Skolnick, 2003) suggested that protein relatives sharing 40% or more sequence identity had more than 90% probability of having the same function at the third level of the EC classiﬁcation. At this level, relatives perform the same catalytic functions, but they could be operating on different substrates, that is, possess different speciﬁcities. If domain relatives are being considered, higher thresholds should be used (60% seq. id) to guarantee high probability of related functions. However, Rost (2002) showed that the bias in populations of enzyme superfamilies meant that some superfamilies were skewing the analysis and that when this was taken into account, much higher thresholds (70%) should be used for safe inheritance of function

246

Chapter 13

Domain Family Analyses to Understand Protein Function Evolution

Figure 13.9

Conservation and variation of enzyme function in CATH-Gene3D superfamilies. The degree of enzyme function variation is measured at the four levels of the EC classiﬁcation (variation is indicated by a hyphen). The number of S35 sequence families in CATH-Gene3D (secondary y-axis) measures sequence diversity in the functionally conserved and variable superfamilies (red line). Note that domain relatives found in functionally variable superfamilies populate about 13 times more S35 families in CATH-Gene3D than domains in functionally conserved superfamilies. (See insert for color representation of this ﬁgure.)

between protein relatives. Revisiting these analyses with a 10-fold larger data set and taking into account the bias in superfamily populations and also the bias in functional subfamilies within superfamilies, we have found that a threshold of 60% is more reasonable (Addou et al., 2009). Furthermore, when transferring functional annotations between domains, much lower levels of sequence identity can be used (20%) provided the domains share the same multidomain contexts. It can be seen from Figure 13.9 that in agreement with analyses discussed above on the correlation between family population in the genomes and functional diversity, the 400 families showing the lowest degree of function conservation are also those most highly populated in the genomes. Thus, although the majority of CATH domain superfamilies are highly functionally conserved, these tend to be small superfamilies. Perhaps of more interest is the difference in degree of functional conservation between domain superfamilies. Figure 13.10 shows that in some superfamilies, relatives are likely to be sharing similar functional annotations at sequence identity thresholds as low as 20%.

Figure 13.10

Family-speciﬁc sequence identity thresholds for function conservation. Threshold distribution among CATH-Gene3D enzyme superfamilies and the number of member sequences, where conservation of the ﬁrst three EC digits is at least 90% in all families. Functional comparisons were performed at the levels of the domain sequence. (See insert for color representation of this ﬁgure.)

13.7 How Are Domain Families Distributed in Protein Complexes?

247

13.7 HOW ARE DOMAIN FAMILIES DISTRIBUTED IN PROTEIN COMPLEXES? In some recent analyses, we have explored the distribution of CATH superfamilies among protein complexes in several species. This was achieved by clustering networks of known protein–protein interactions using the MCL algorithm (Enright et al., 2002) as described by Brohee and van Helden (2006). Such clusters of interacting proteins have been found to represent protein complexes and thus groups of proteins that are involved in common biological processes (Pereira-Leal et al., 2004). Various analyses have shown that homologous proteins conserve their interactions to some extent and are often found either interacting or involved in the same or similar biological processes (Brun et al., 2003; Baudot et al., 2004; Mika and Rost, 2006; PereiraLeal et al., 2007). However, study of the distribution of domain families in networks has been relatively neglected. Protein domain superfamilies allow more distant evolutionary relationships between proteins to be identiﬁed. Domain–domain interactions are more conserved than protein–protein interactions (Stein et al., 2005), and therefore domains may allow a greater insight into the evolution of protein complexes and interaction networks. It has been proposed that a proportion of protein complexes in yeast have evolved from duplicated homodimers, that is, interacting homologues (Pereira-Leal et al., 2007). Our analysis showed that whether considering whole protein families (with common multidomain architectures) or domain superfamilies, the tendency for homologues to occur together in complexes in signiﬁcantly lower in E. coli than in S. cerevisise and D. melanogaster. This suggests that protein complexes evolve differently in E. coli due to the restricted size of protein domain superfamilies in prokaryotes. We wanted to known which superfamilies tended to cluster together in complexes more than expected by chance and were therefore likely to contribute to the model of complex formation described above. When considering how superfamilies in general are distributed among complexes, there is a strong correlation between family size and number of complexes in which that family is found. This suggests that most protein domain families may be randomly distributed among complexes. Indeed, we ﬁnd that the majority of superfamilies in each organism studied were not found together more often than expected. In E. coli, there were two superfamilies that were clustered together more than would be expected by chance. In yeast and ﬂy, several more superfamilies did tend to colocate in protein complexes (Table 13.2). A total of 656, 630, and 724 different superfamilies were detected in E. coli, yeast, and ﬂy, respectively. To determine whether despite the fact that superfamilies tend to spread randomly across complexes, they tend to be involved in complexes with similar functions, we compared the functional annotation of complexes. This was achieved using gene ontology semantic similarity (GOSS) scores as described by Lord et al. (2003). The GOSS score between two complexes was determined as the average GOSS score between each intercomplex protein pair. We found that, in general, the similarity between complexes containing a particular superfamily (ignoring proteins containing that superfamily) and random complexes of the same size was not different from that expected by chance. In E. coli, we ﬁnd that only two CATH superfamilies are found in complexes that tend to be more functionally similar to each other than expected by chance. These are nucleotidyltransferase (3.30.420.40) and carbamate kinase (3.40.1160.10). There are six in yeast, “winged helix” repressor DNA binding domain (1.10.10.10), homeodomain-like (1.10.10.60), histone acetyltransferase, chain A (1.20.920.10), an OB-fold superfamily (2.40.50.40), erythroid transcription factor (3.30.50.10), and an aminopeptidase superfamily (3.40.630.30). There was only one

248

Chapter 13

Domain Family Analyses to Understand Protein Function Evolution

Table 13.2 Superfamilies in E. coli, Yeast, and Fly That Are Nonrandomly Distributed in Complexes with p < 0.05 Species

Superfamily

E. coli

Nucleic acid binding proteins (2.40.50.140) NAD(P) binding Rossmann-like domain (3.40.50.720) RNA binding (2.30.30.100) Glutamine phosphoribosylpyrophosphate, subunit 1, domain 1 (3.60.20.10) Quinoprotein amine dehydrogenase (2.130.10.10) Protein tyrosine phosphatase superfamily (3.90.190.10) 3.10.20.30 P-loop containing nucleotide triphosphate hydrolases (3.40.50.300) Cyclin superfamily (1.10.472.10) RNA binding (2.30.30.100) Immunoglobulins (2.60.40.10) Minichromosome maintenance (MCM) complex, chain A, domain 1 (3.30.1640.10) Phosphatase (3.60.21.10)

Yeast

Fly

Frequency

MYOD basic-helix-loop-helix domain, subunit B (4.10.280.10) Phosphatidylinositol 3-kinase catalytic subunit; chain A, domain 1 (3.10.20.90) Retinoid X receptor (1.10.565.10) P-loop containing nucleotide triphosphate hydrolases (3.40.50.300) (1.20.58.60)

Species distribution

23 73 11 18

Universal Universal Universal Universal

68 12 5 190

Universal Eukaryotic Universal Universal

21 11 63 5

37

Universal Universal Universal Eukaryotic/ archael Eukaryotic/ bacterial Eukaryotic

42

Eukaryotic

12 202

Eukaryotic Universal

18

Universal

21

Superfamilies are considered to belong to a kingdom when they are found in at least 70% of completed genomes from that kingdom. Universal refers to eukaryotes, eubacteria, and archaea. CATH superfamily codes are shown in brackets.

signiﬁcant superfamily in ﬂy: Dbl homology domain chain A (1.20.900.10). These do not overlap at all with those superfamilies found to be nonrandomly distributed among complexes. We found that the members of most CATH protein domain superfamilies are randomly spread in the protein complexes of E. coli, yeast, and ﬂy. In addition, these superfamilies tend to be found among complexes whose functions are no more similar than expected by chance. This phenomenon appears to be a stronger in E. coli than in the two eukaryotes examined, suggesting that superfamily members tend to share common biological process functions less often in prokaryotes than eukaryotes.

REFERENCES ABASCAL, F. and VALENCIA, A., 2002. Clustering of proximal sequence space for the identiﬁcation of protein families. Bioinformatics 18: 908–921. ADDOU, S., RENTZSCH, R., LEE, D., and ORENGO, C.A., 2009. Domain-based and family-speciﬁc sequence identity

thresholds increase the levels of reliable protein function transfer. J. Mol. Biol., 387(2): 416–430. ANDREEVA, A., HOWORTH, D., CHANDONIA, J.M., BRENNER, S.E., HUBBARD, T.J., CHOTHIA, C., and MURZIN, A.G., 2008. Data growth and its impact on the SCOP data-

References base: new developments. Nucleic Acids Res. 36: D419–D425. ARAVIND, L., ANANTHARAMAN, V., and KOONIN, E.V., 2002. Monophyly of class I aminoacyl tRNA synthetase, USPA, ETFP, photolyase, and PP-ATPase nucleotide-binding domains: implications for protein evolution in the RNA. Proteins 48: 1–14. ARTYMIUK, P.J., POIRRETTE, A.R., RICE, D.W., and WILLETT, P., 1996. Biotin carboxylase comes into the fold. Nat. Struct. Biol. 3: 128–132. BASHTON, M. and CHOTHIA, C., 2007. The generation of new protein functions by the combination of domains. Structure 15: 85–99. BAUDOT, A., JACQ, B., and BRUN, C., 2004. A scale of functional divergence for yeast duplicated genes revealed from analysis of the protein–protein interaction network. Genome Biol. 5: R76. BJORKLUND, A.K., EKMAN, D., LIGHT, S., FREY-SKOTT, J., and ELOFSSON, A., 2005. Domain rearrangements in protein evolution. J. Mol. Biol. 353: 911–923. BROHEE, S. and van HELDEN, J., 2006. Evaluation of clustering algorithms for protein–protein interaction networks. BMC Bioinf. 7: 488. BRUN, C., CHEVENET, F., MARTIN, D., WOJCIK, J., GUENOCHE, A. and JACQ, B., 2003. Functional classiﬁcation of proteins for the prediction of cellular function from a protein–protein interaction network. Genome Biol. 5: R6. BURROUGHS, A.M., ALLEN, K.N., DUNAWAY-MARIANO, D., and ARAVIND, L., 2006. Evolutionary genomics of the HAD superfamily: understanding the structural adaptations and catalytic diversity in a superfamily of phosphoesterases and allied enzymes. J. Mol. Biol. 361: 1003–1034. CHOTHIA, C., 1992. Proteins. One thousand families for the molecular biologist. Nature 357: 543–544. CHOTHIA, C. and LESK, A.M., 1986. The relation between the divergence of sequence and structure in proteins. EMBO J. 5: 823–826. CHOTHIA, C., GOUGH, J., VOGEL, C., and TEICHMANN, S.A., 2003. Evolution of the protein repertoire. Science 300: 1701–1703. CUFF, A., REDFERN, O., GREENE, L.H., SILLITOE, I., LEWIS, T., DIBLEY, M., REID, A.J., PEARL, F., DALLMAN, T., TODD, A.E., GARRAT, R., THORNTON, J., and ORENGO, C.A., 2009. The CATH hierarchy revisited: structural divergence in domain superfamilies and the continuity of fold space. Structure, 17(8): 1051–1062. DEVOS, D. and VALENCIA, A., 2000. Practical limits of function prediction. Proteins 41: 98–107. ENRIGHT, A.J., Van DONGEN, S., and OUZOUNIS, C.A., 2002. An efﬁcient algorithm for large-scale detection of protein families. Nucleic Acids Res. 30: 1575–1584. FONG, J.H., GEER, L.Y., PANCHENKO, A.R., and BRYANT, S.H., 2007. Modeling the evolution of protein domain architectures using maximum parsimony. J. Mol. Biol. 366: 307–315. FREY, B.J. and DUECK, D., 2007. Clustering by passing messages between data points. Science 315: 972–976.

249

GRANT, A., LEE, D., and ORENGO, C., 2004. Progress towards mapping the universe of protein folds. Genome Biol. 5: 107. GREENE, L.H., LEWIS, T.E., ADDOU, S., CUFF, A., DALLMAN, T., DIBLEY, M., REDFERN, O., PEARL, F., NAMBUDIRY, R., REID, A. et al., 2007. The CATH domain structure database: new protocols and classiﬁcation levels give a more comprehensive resource for exploring evolution. Nucleic Acids Res. 35: D291–D297. JAIN, R., RIVERA, M.C., and LAKE, J.A., 1999. Horizontal gene transfer among genomes: the complexity hypothesis. Proc. Natl. Acad. Sci. USA 96: 3801–3806. KOONIN, E.V., 2003. Comparative genomics, minimal genesets and the last universal common ancestor. Nat. Rev. Microbiol. 1: 127–136. KUNIN, V. and OUZOUNIS, C.A., 2003. GeneTRACE: reconstruction of gene content of ancestral species. Bioinformatics 19: 1412–1416. LEONIDAS, D.D., VATZAKI, E.H., VORUM, H., CELIS, J.E., MADSEN, P., and ACHARYA, K.R., 1998. Structural basis for the recognition of carbohydrates by human galectin-7. Biochemistry 37: 13930–13940. LORD, P.W., STEVENS, R.D., BRASS, A., and GOBLE, C.A., 2003. Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics 19: 1275–1283. MAIDAK, B.L., COLE, J.R., LILBURN, T.G., PARKER, C.T., JR., SAXMAN, P.R., FARRIS, R.J., GARRITY, G.M., OLSEN, G.J., SCHMIDT, T.M., and TIEDJE, J.M., 2001. The RDP-II (Ribosomal Database Project). Nucleic Acids Res. 29: 173–174. MCGUFFIN, L.J. and JONES, D.T., 2003. Improvement of the GenTHREADER method for genomic fold recognition. Bioinformatics 19: 874–881. MCGUFFIN, L.J., SMITH, R.T., BRYSON, K., SORENSEN, S.A., and JONES, D.T., 2006. High throughput proﬁle–proﬁle based fold recognition for the entire human proteome. BMC Bioinf. 7: 288. MICHEL, G., CHANTALAT, L., DUEE, E., BARBEYRON, T., HENRISSAT, B., KLOAREG, B., and DIDEBERG, O., 2001. The kappa-carrageenase of P. carrageenovora features a tunnel-shaped active site: a novel insight in the evolution of Clan-B glycoside hydrolases. Structure 9: 513–525. MIKA, S. and ROST, B., 2006. Protein–protein interactions more conserved within species than across species. PLoS Comput. Biol. 2: e79. MIRKIN, B.G., FENNER, T.I., GALPERIN, M.Y., and KOONIN, E.V., 2003. Algorithms for computing parsimonious evolutionary scenarios for genome evolution, the last universal common ancestor and dominance of horizontal gene transfer in the evolution of prokaryotes. BMC Evol. Biol. 3: 2. OJHA, S., MENG, E.C., and BABBITT, P.C., 2007. Evolution of function in the “two dinucleotide binding domains” ﬂavoproteins. PLoS Comput. Biol. 3: e121. PEREIRA-LEAL, J.B., ENRIGHT, A.J., and OUZOUNIS, C.A., 2004. Detection of functional modules from protein interaction networks. Proteins 54: 49–57.

250

Chapter 13

Domain Family Analyses to Understand Protein Function Evolution

PEREIRA-LEAL, J.B., LEVY, E.D., KAMP, C., and TEICHMANN, S. A., 2007. Evolution of protein complexes by duplication of homomeric interactions. Genome Biol. 8: R51. RANEA, J.A., SILLERO, A., THORNTON, J.M., and ORENGO, C.A., 2006. Protein superfamily evolution and the last universal common ancestor (LUCA). J. Mol. Evol. 63: 513–525. REDFERN, O.C., HARRISON, A., DALLMAN, T., PEARL, F.M., and ORENGO, C.A., 2007. CATHEDRAL: a fast and effective algorithm to predict folds and domain boundaries from multidomain protein structures. PLoS Comput. Biol. 3: e232. REEVES, G.A., DALLMAN, T.J., REDFERN, O.C., AKPOR, A., and ORENGO, C.A., 2006. Structural diversity of domain superfamilies in the CATH database. J. Mol. Biol. 360: 725–741. ROST, B., 2002. Enzyme function less conserved than anticipated. J. Mol. Biol. 318: 595–608. SNEL, B., BORK, P., and HUYNEN, M.A., 2002. Genomes in ﬂux: the evolution of archaeal and proteobacterial gene content. Genome Res. 12: 17–25. STEIN, A., RUSSELL, R.B., and ALOY, P., 2005. 3did: interacting protein domains of known three-dimensional structure. Nucleic Acids Res. 33: D413–D417.

TIAN, W. and SKOLNICK, J., 2003. How well is enzyme function conserved as a function of pairwise sequence identity? J. Mol. Biol. 333: 863–882. TODD, A.E., ORENGO, C.A., and THORNTON, J.M., 2001. Evolution of function in protein superfamilies, from a structural perspective. J. Mol. Biol. 307: 1113–1143. TODD, A.E., ORENGO, C.A., and THORNTON, J.M., 2002. Sequence and structural differences between enzyme and nonenzyme homologs. Structure 10: 1435–1451. WILSON, D., MADERA, M., VOGEL, C., CHOTHIA, C., and GOUGH, J., 2007. The SUPERFAMILY database in 2007: families and functions. Nucleic Acids Res. 35: D308–D313. WOESE, C.R., 1987. Bacterial evolution. Microbiol. Rev. 51: 221–271. YANG, S., DOOLITTLE, R.F., and BOURNE, P.E., 2005. Phylogeny determined by protein domain content. Proc. Natl. Acad. Sci. USA 102: 373–378. YEATS, C., LEES, J., REID, A., KELLAM, P., MARTIN, N., LIU, X., and ORENGO, C., 2008. Gene3D: comprehensive structural and functional annotation of genomes. Nucleic Acids Res. 36: D414–D418.

Chapter

14

Noncoding RNA Alexander Donath, Sven Findeib, Jana Hertel, Manja Marz, Wolfgang Otto, Christine Schulz, Peter F. Stadler, and Stefan Wirth 14.1

INTRODUCTION

14.2

ANCIENT RNAs

14.3

DOMAIN-SPECIFIC RNAs

14.4

CONSERVED ncRNAs WITH LIMITED DISTRIBUTION

14.5

ncRNAs FROM REPEATS AND PSEUDOGENES

14.6

mRNA-LIKE ncRNAs

14.7

RNAs WITH DUAL FUNCTIONS

14.8

CONCLUDING REMARKS

ACKNOWLEDGMENTS REFERENCES

14.1 INTRODUCTION The advent of high-throughput techniques that allow comprehensive unbiased studies of transcription has led to a dramatic change in our understanding of genome organization. A decade ago, the genome was seen as a linear arrangement of separated individual genes, which are predominantly protein-coding, with a small set of ancient noncoding “housekeeping” RNAs such as tRNA and rRNA dating all the way back to an RNA-world. However, in contrast to this simple view more recent studies reveal a much more complex genomic picture. The ENCODE Pilot Project (The ENCODE Project Consortium, 2007), the mouse cDNA project FANTOM (Maeda et al., 2006), and a series of other large-scale transcriptome studies (e.g., Ravasi et al. (2006)) leave no doubt that the mammalian transcriptome is characterized by a complex mosaic of overlapping, bidirectional transcripts, and a plethora of nonprotein-coding transcripts arising from the same locus (Figure 14.1). This newly discovered complexity is not unique to mammals. Similar high-throughput studies in invertebrate animals (Manak et al., 2006; He et al., 2007) and plants

Evolutionary Genomics and Systems Biology, edited by Gustavo Caetano-Anolles Copyright 2010 John Wiley & Sons, Inc.

251

252

Chapter 14

Noncoding RNA

Highly transcribed regions Mosaic transcript

Noncoding transcript Protein coding mRNA Coding exons ncRNA

taRNAs

Intronic ncRNA Antisense transcrips

paRNAs

Figure 14.1

Sketch of the post-ENCODE view of a mammalian transcriptome (adapted from Kapranov et al., 2007). Highly transcribed regions consist of a complex mosaic of overlapping transcripts (arrows) in both reading-directions. These transcripts link together the locations of several protein-coding genes (coding exon indicated by black rectangles). Conversely, multiple transcription products, many of which are noncoding, are processed from the same locus as a protein-coding mRNA.

(Li et al., 2007a) demonstrate the generality of the mammalian genome organization among higher eukaryotes. Even the yeasts Saccharomyces cerevisiae and Saccharomyces pombe, whose genomes have been considered to be well understood, are surprising us with a much richer repertoire of transcripts than previously thought (Havilio et al., 2005; David et al., 2006; Miura et al., 2006; Wilhelm et al., 2008). Even in bacteria, an unexpected complexity of regulatory RNAs was discovered in recent years (Gottesman, 2004). Given the importance and ubiquity of noncoding RNAs (ncRNAs) and RNA-based mechanisms in all extant life forms, it is surprising that we still know relatively little about the evolutionary history of most RNA classes, although a series of systematic studies have greatly improved our understanding since our ﬁrst attempt at a comprehensive review of this topic (Bompf€ unewerer et al., 2005). There are strong reasons to conclude that the last universal common ancestor (LUCA) was preceded by simpler life forms that were based primarily on RNA. In this RNA-world scenario (Gilbert, 1986; Gesteland and Atkins, 1993), the translation of RNA into proteins, and the usage of DNA (Freeland et al., 1999) as an information storage device are later innovations. The wide range of catalytic activities that can be realized by relatively small ribozymes (Serganov and Patel, 2007; Strobel and Cochrane, 2007) as well as the usage of RNA catalysis at crucial points of the information metabolism of modern cells provides further support for the RNA-world hypothesis. Multiple ancient ncRNAs are involved in translation: the ribosome itself is an RNA machine (Moore and Steitz, 2002), tRNAs perform a major part of the decoding on the messenger RNAs, and RNase P, another ribozyme, is involved in processing of primary tRNA transcripts. The signal recognition particle, another ribonucleoprotein (RNP), also interacts with the ribosome and organizes the transport of secretory proteins to their target locations. For a discussion of rRNAs and tRNAs, refer to Chapter 16. On the other hand, most functional ncRNAs do not date back to the LUCA but are the result of later innovations. Some crucial “housekeeping” functions involve domain-speciﬁc ncRNAs. Eukaryotes, for instance have invented the splicing machinery involving several small spliceosomal RNAs (snRNAs), while bacteria use tmRNA to free stalled ribosomes and the 6S RNA as a common transcriptional regulator. The invention of the RNAi machinery in eukaryotes and the subsequent evolution of microRNAs in plants and animals are discussed in detail in Chapter 15.

14.1 Introduction

253

Selected Experimental Surveys for ncRNAs

Table 14.1

ncRNAs Organism C. elegans D. discoideum Aspergillus fumigatus Giardia intestinalis P. falciparum Caulobacter crescentus Sulfolobus solfataricus S. solfataricus

a

b

161 20 30 30 41 3 22 31

31 16 15 26 6 27 23 33

Reference Deng et al. (2006) Aspegren et al. (2004) J€ ochl et al. (2008) Chen et al. (2007) Chakrabarti et al. (2007) Landt et al. (2008) Zago et al. (2005) Tang et al. (2005)

We only list, without claim to completeness, extensive studies for RNAs larger than those associated with the RNA machinery and smaller than typical mRNAs. a

ncRNAs for which at least membership in a known class such as snoRNAs could be established.

b

ncRNAs without annotation.

The innovation of ncRNAs is an ongoing process. In fact, most experimental surveys for ncRNAs (Table 14.1) report lineage-speciﬁc elements without detectable homologies in other species. An overview of evolutionarily older ncRNA families is compiled in Figure 14.2 without claim to completeness. Many RNA classes, however, such as Y RNAs

miRNAs

Animalia Choanoflagellata Fungi Amoebozoa

microRNA mechanism

Metazoa vault Y RNA 7SK SmY

Plantae Rhodophyta Heterokonta

telomerase-RNA MRP snRNAs

miRNAs

Apicomplexa Ciliates

RNAi gRNAs snoRNAs RNAse P rRNA tRNA SRP

Kinetoplastida Euglenozoa Metamonada

Yfr1

Proteobacteria Chlamydia Actinobacteria Cyanobacteria Firmicutes

Basidiomycota Glomeromycoya Chytridiomycoya Microsporidia miRNAs

Chlorophyta Charales

miRNAs enod40

Figure 14.2

Nematoda Arthropoda Platyhelminthes Annelida Mollusca Cnidaria Porifera Taphrinomycotina Saccharomycotina Pezizomycotina

Nanoarchaeota Crenarchaeota Euryarchaeota

tmRNA 6S

Vertebrata Urochordata Cephalochordata Echinodermata Hemichordata

Bryophyta Coniferales Angiosperms

Origins of major ncRNA families. The origin of ncRNA families is marked leading to the last common ancestor of the known representative. For details on RNAi and microRNAs, refer to Chapter 15. The microRNA families of (a) eumetazoa (animals except sponges), (b) the slime mold Dictyostelium, (c) embryophyta (land plants), and (d) the green algae Chlamydomonas are nonhomologous. In addition, the putative origin of the RNAi mechanism and the microRNA pathways is indicated. (See insert for color representation of this ﬁgure.)

254

Chapter 14

Noncoding RNA

and vault RNAs, and most bacterial ncRNA families have not been studied in sufﬁcient detail to date their origin with certainty. Some of them thus might have originated earlier than shown.

14.2 ANCIENT RNAs 14.2.1

RNase P and RNase MRP RNA

Ribonucleases P (RNase P) and mitochondrial RNA processing (RNase MRP) are ribonucleoprotein complexes that act as endoribonucleases in tRNA and rRNA processing, respectively. Their RNA subunits are evolutionarily related and are involved in the catalytic activity of the enzymes. While it has long been known that RNase P RNA is a ribozyme in bacteria and several archaea, it was demonstrated only recently that eukaryotic RNase P RNA also exhibits ribozyme activity (Willkomm and Hartmann, 2007). The main function of RNase P is the generation of the mature 50 -ends of tRNAs Walker and Engelke (2006). In contrast, RNase MRP is eukaryote speciﬁc. It processes nuclear precursor rRNA (cleaving the A3 site and leading to the maturation of the 50 -end of 5.8S rRNA), generates RNA primers for mitochondrial DNA replication, and is involved in the degradation of certain mRNAs. The phylogenetic distribution of P RNA clearly indicates that it dates back to LUCA (Piccinelli et al., 2005). MRP RNA can be traced to the most basal eukaryotes (Piccinelli et al., 2005) and apparently was part of the rRNA processing cascade of the eukaryotic ancestor (Woodhams et al., 2007). The high similarity of P and MRP RNA secondary structures (Collins et al., 2000) and similarity of the protein contents and interactions of RNase P and MRP (Aspinall et al., 2007; Walker and Engelke, 2006) suggest that P and MRP RNAs are paralogues. RNase P RNA is found almost ubiquitously. Interestingly, so far only MRP RNA has been found in plants including green algae and red algae (Piccinelli et al., 2005). Whether the ancestral P RNA has been lost in these clades or possibly replaced by MRP RNA is unclear. It is also possible that the P RNA sequences are derived from each other that they have escaped detection so far. Despite the highly conserved core structures, P and MRP RNAs can exhibit dramatic variations in size, which mostly arise from large insertions in several “expansion domains” (Piccinelli et al., 2005; Kachouri et al., 2005). In eukaryotes, additional P RNAs are often encoded in organelle genomes. Chloroplast P RNA is structurally similar to bacterial type A (de la Cruz and Vioque, 2003) and exhibits ribozyme activity (Li et al., 2007a). Mitochondrial P RNAs, in particular those of fungi, are highly derived and exhibit only a small subset of the conserved structural elements shown in Figure 14.3, mostly P1, P4, and P18 (Seif et al., 2003). Despite its core function in tRNA processing, RNase P appears to be absent in the archaeon Nanoarchaeum equitans. Instead, placement of its tRNA gene promoters allows the synthesis of leaderless tRNAs (Randau et al., 2008).

14.2.2

Signal Recognition Particle RNA

The signal recognition particle (SRP) is a ribonucleoprotein that interacts with the ribosome during the synthesis and translocation of secretory proteins. SRP is responsible for the cotranslational targeting of proteins that contain signal peptides to membranes including the

14.2 Ancient RNAs

255

Figure 14.3

Schematic drawing of the consensus structures of P and MRP RNAs. Adapted from Marquez et al., 2005; Walker and Engelke, 2006; Woodhams et al., 2007; and Zhu et al., 2007. The table indicates the distribution of structural elements. Black circles indicate conserved elements, stems indicated in gray are present in known sequences, open circles refer to elements that are sometimes present (See insert for color representation of this ﬁgure.)

prokaryotic plasma membrane and the endoplasmatic reticulum. SRP recognizes ﬁrst the signal sequence of the nascent polypeptide and then the SRP receptor in the membrane. The interaction between the SRP and its receptor involves a conformational switch that appears to be controlled by the SRP RNA. The assembled SRP consists of two structurally and functionally distinct domains. The small domain, which consists of the Alu-domain of the SRP RNA and the proteins SRP9 and SRP14, modulates the elongation of the secretory protein. The larger S-domain, which consists of the S-domain of the SRP RNA and several speciﬁc proteins, captures the signal peptide. Eukaryotic SRP RNA is also known as 7SL RNA. SRP RNAs are highly conserved across large phylogenetic distances and ﬁt a generalized structure, shown in Figure 14.4, that can be used to deﬁne a standard nomenclature for the individual structural features (Zwieb et al., 2005). Eukaryotes, Archaea, Firmicutes, and several other early branching bacteria such as Thermatoga maritima have large SRPs containing both domains. In contrast, most Gram-negative bacteria as well as all known organellar SRPs have a reduced SRP RNA consisting of helix 8 only (Andersen et al., 2006). Since ﬁrmicutes are among the most early branching bacterial phyla, the SRP RNAs have mostly likely lost the Alu-domain in higher bacteria.

256

Chapter 14

Noncoding RNA

Figure 14.4

Standard nomenclature of the helices of SRP RNA and their phylogenetic distribution. A subgroup of bacteria, which includes proteobacteria, has a drastically reduced SRP RNA. Cases where the existence of the helix is unknown are indicated by ?. Adapted from Zwieb et al., 2005; Andersen et al., 2006. (See insert for color representation of this ﬁgure.)

It is interesting to note that the SRP of trypanosomatids contains a second, tRNA-like, RNA component (Liu et al., 2003).

14.2.3

snoRNAs

Small nucleolar RNAs (snoRNAs) represent one of the most abundant classes of ncRNAs. They act as guides for single nucleotide modiﬁcations in nascent ribosomal RNAs and other RNAs in the nucleolus of eukaryotic species (Decatur et al., 2007). While no targets are present or known for so-called orphan snoRNAs, most guide snoRNAs target ribosomal RNAs (rRNAs) or small nuclear RNAs (snRNAs). Recently, there has been a report that several snoRNAs may also target tRNAs and other snoRNAs (Zemann et al., 2006). The orphan snoRNAs have been implicated in modulating alternative splicing (Kishore and Stamm, 2006; Bazeley et al., 2008). Based on conserved sequence elements and secondary structure, snoRNAs are divided into two major classes designated as C/D and H/ACA snoRNAs. Both differ in their characteristic secondary structure and in the sequence motifs that gave them their names: C/ D snoRNAs exhibit the “boxes” C and D with consensus sequence (A)UGAUGA and CUGA, respectively. They frequently contain a second copy of these two boxes, usually designated C0 and D0 , in the region between the C and D boxes. The H box of the H/ACA snoRNAs has the consensus sequence ANANNA. The ACA motif located at the 30 -end of the molecule is highly conserved. Believed to be protein binding elements, the sequence boxes are located in unpaired regions (Figure 14.5). The characteristic secondary structures in Figure 14.5 of snoRNAs are inferred mostly from phylogenetic comparisons. Thermodynamic predictions using, for example, mfold or the Vienna RNA Package may differ considerably from the functional structure, indicating that the functional RNA structure within the small nucleolar ribonucleoprotein (snoRNP) particles is changed by the protein components, which bring the binding sites in the correct formation and thus help to bind the target (Ganot et al., 1997; Samarsky et al., 1998). The poor predictability of their secondary structures is a major obstacle for computational snoRNA gene ﬁnding (Hertel et al., 2008 and references therein). The two major classes direct distinct chemical modiﬁcations. Most C/D snoRNAs determine speciﬁc target nucleotides for methylation of the 20 -hydroxyl position of the sugar, while most of the H/ACA snoRNAs specify uridines that are converted into

257

5′

Box H

rRNA

14–16 nt

H/ACA snoRNA

NΨ

ACA- 3′

5′

14–16 nt 3′

Box C 3′

3-10 nt

C/D snoRNA

5′

xD

Bo

M

5 nt

5′

3

3

A

U G

3′ weblogo.berkeley.edu

3′ weblogo.berkeley.edu

CUGA

5′

Box D

3′ weblogo.berkeley.edu

AUGAUGA

G

5′ 1

G C

Box C

2

2

rRNA

5′

A

AA CC GG U U

4

Schematic representation of H/ACA and C/D snoRNA structures and their characteristic consensus sequence motifs (boxes). H/ACA snoRNAs (left) fold into a double stem-loop structure with the Box H between both stems and the ACA motif immediately following after the second stem. The target is bound to both stem-loop structures inside their symmetrical interior loops. Neither the U that is modiﬁed nor the following nucleotide is bound to the snoRNA: instead, they are centered within the interior loop. The functional structure of C/D snoRNAs is probably stabilized by proteins bound to the boxes, since there is only a short (3–10 nt) stem that encloses boxes C and D while the region between the two boxes is mainly unpaired. The target RNA is bound upstream of box D so that the ﬁfth nucleotide is methylated. Many C/D snoRNAs have a second copy of both boxes between box C and box D (namely, C0 and D0 ), with a second functional binding site upstream of box D0 . (See insert for color representation of this ﬁgure.)

Figure 14.5

3′

NΨ

rRNA

A C G U

A A

2

M

3′

Box H

4

3

Box C′

5

5 nt

′

4

xD

5 6

Bo

1

1

5′

6 7

258

Chapter 14

Noncoding RNA

pseudouridine by rotation of the base. These modiﬁcations apparently inﬂuence the rRNA structure and are functionally essential (Lafontaine and Tollervey, 2001; Decatur et al., 2007). A few exceptional snoRNAs of both classes are also involved in pre-rRNA processing but do not lead to chemical modiﬁcations. Instead, they mediate structural changes of the rRNA to establish the correct conformation endonuclease cleavage (Atzorn et al., 2004). In human and yeast, these are U3, U17 (also known as sn30), U22, and U14, as well as vertebrate U8 and yeast snR10. A subgroup of snoRNAs (only U3, U8, and U13 in human with several additional families in nematodes) share features of snRNAs, such as a posttranscriptionally hypermethylated 2,2,7-trimethylguanosine (TMG) cap at their 50 -end (Jia et al., 2007). Another small subgroup of both C/D and H/ACA snoRNAs carries an additional localization signal for the Cajal body (Jady et al., 2004). These small Cajal body-speciﬁc RNAs (scaRNAs) target snRNAs and are retained in the Cajal bodies (or yeast nuclear bodies, respectively). These sequences may be chimeric composites containing both a C/D and an H/ACA domain for recruitment of both classes of snoRNPs; an example is U85 (E and Kiss, 2001). There are a number of scaRNAs known that modify RNA pol II-speciﬁc U1, U2, U4, and U5 snRNAs. For example, the human U85 scaRNA directs 20 -O-methylation of the C45 and pseudourydilation of the U46 residues in the U5 spliceosomal snRNA (Darzacq et al., 2002; E and Kiss, 2001). The snoRNA-based RNA modiﬁcation system is found in both Eukaryota and Archaea (Tang et al., 2002; Terns and Terns, 2002; Reichow et al., 2007). However, little is known about the evolution of individual snoRNA families. Even the characteristic sequence motifs of snoRNAs are modiﬁed in several basal eukaryotic lineages. In Leishmania major, the ACA box is systematically replaced by an AGA motif (Liang et al., 2006), while other snoRNA molecules in the primitive eukaryote Gardia lamblia show mutations in the D box of C/D-like snoRNAs (Yang et al., 2005). Due to their poor sequence conservation it is a nontrivial problem to establish the homology of snoRNAs over large evolutionary distances (e.g., between a mammalian and a yeast sequence). A plausible assumption is that homologous snoRNAs target homologous target positions. Based on this assumption, C/ D snoRNAs in the Drosophila melanogaster genome were successfully identiﬁed (Accardo et al., 2004). Additionally, a table of human/yeast correspondences can be found in the snoRNABase (Lestrade and Weber, 2008). On much shorter timescales, the investigation of snoRNA evolution in nematodes (Zemann et al., 2006) and in mammals (Schmitz et al., 2008) has shown that snoRNAs, like many other ncRNAs, evolve by duplication events and by coevolution with their targets. As with duplicated proteins (Force et al., 1999), three different fates for duplicated snoRNAs were observed: (i) inactivation and eventual loss of one of the paralogues, (ii) maintenance of function and the same target in both paralogues, and (iii) divergence of the target site of one of the paralogues. This may provide a mechanism by which the extant diversity of snoRNAs may have evolved from a single ancestral C/D and H/ACA snoRNA. SnoRNAs are found in two different genomic contexts: within introns of protein-coding genes and in independent transcription units. In both cases, the snoRNA is initially processed as pre-snoRNA that requires further maturation (Weinstein and Steitz, 1999). Almost all snoRNAs in vertebrates are localized within introns of “housekeeping” genes, in particular of genes for ribosomal proteins. However, in several cases it is also known that the host gene is noncoding (e.g., Bachellerie et al., 2002). Outside the Metazoa, gene organization is more diverse. In yeast, most snoRNAs are processed from independent mono-, di-, or polycistronic RNA transcripts (Vincenti et al., 2007). Such polycistronic clusters of multiple different snoRNA genes are also common in higher plants (Brown

14.3 Domain-Speciﬁc RNAs

259

et al., 2003) and Kinetoplastida (Liang et al., 2007). The biosynthesis of box C/D snoRNAs seems to be related to the position of the gene within the intron and optimal distances between snoRNA coding sequence and conserved intron elements (e.g., 30 -end and branch point) in mammals and yeast (Hirose and Steitz, 2001; Vincenti et al., 2007). In contrast, it has been demonstrated that H/ACA snoRNAs originate in a splicing-dependent manner but they do not possess any preferential localization with respect to intronic splice sites (Richard et al., 2006). Duplicated snoRNA paralogues are often inserted into different positions in the same gene (cis-duplication). However, also duplicated snoRNAs in distant genes or even other chromosomes (trans-duplication) (Zemann et al., 2006; Schmitz et al., 2008) are reported. They may spread by retrotransposition (Weber, 2006; Luo and Li, 2007). Finally, snoRNA paralogues may be moved to different chromosomal locations through a duplication of the host genes. Consistent with these mechanisms, duplications and “jumps” of the snoRNA gene from an intron in the ancestral host gene to an intron of another gene have been observed in vertebrates (Bompf€ unewerer et al., 2005). Intergenic copies may also remain functional: The polymerase-III transcribed C/D snoRNA in yeast (snRN52) (Harismendy et al., 2003) may have arisen from a paralogue that by chance came under the control of a polymerase-III promoter after duplication. In some cases, small RNAs that appear to function as microRNAs are processed from snoRNAs. In Giardia, a 26 nucleotide (nt) small RNA was identiﬁed as a product of Dicerprocessed snoRNA GlsR17 (Saraiya and Wang, 2008). While the small RNA is cytoplasmic, the complete snoRNA is predominately located in one of the two nuclei. Similarly, small RNAs that appear to act like microRNAs are processed from the human snoRNA ACA45 (Ender et al., 2008).

14.3 DOMAIN-SPECIFIC RNAs 14.3.1

Telomerase RNA

In contrast to the circular genomes of prokaryotes, eukaryotes have linear chromosomes. Special mechanisms are necessary to replicate the chromosome ends, the telomers. In almost all species investigated to date, a telomerase enzyme maintains telomere length by adding G-rich telomeric repeats to the ends of eukaryotic chromosomes. Telomerase thus dates back to the origin of eukaryotes. Notable exceptions are diptera including Anopheles and Drosophila, which use retrotransposons or unequal recombination instead of a telomerase enzyme. The core telomerase enzyme consists of two components: an essential RNA component, which serves as template for the repeat sequence, and the catalytic protein component telomerase reverse transcriptase (TERT). The RNA component varies dramatically in sequence composition and size (Figure 14.6). Although dozens of telomerase RNAs (usually called TER in vertebrates and TLC-1 in yeasts) have been cloned and sequenced, the known examples are restricted to four narrow phylogenetic groups: vertebrates, yeasts, ciliates, and plasmodia. The protein component TERT on the other hand is known in a much wider range of eukaryotes (Podlevsky et al., 2008). Yeast telomerase RNAs appear to be even less well conserved: In Tzfati et al. (2003), only seven short sequence motifs are reported within more than 1.2 kb transcripts of Kluyveromyces species, and of these only a few are partially conserved in Saccharomyces. In fact, Saccharomyces and Kluyveromyces TLC-1 genes cannot be aligned with each other

Noncoding RNA S5

Chapter 14

S2

P6b

C

260

S3 CS5a S1

Yeast

CS2

CR4

CS6

P5

Pseudoknot CS3 CS4 CS7 TB

CAB

Pseudoknot

CS1 Template

CR5 P6.1

IV Pseudoknot IIIb

TB I

IIIa II TB Template Ku80

H ACA snoRNA

Template

Vertebrate

Ciliate

Figure 14.6

Telomerase RNA structures of yeast and human share the topology of the pseudoknot region and a functionally important junction region. The template and its boundary element (TB) are highlighted. The yeast structure is a consensus of Saccharomyces (Dandjinou et al., 2004; Zappulla and Cech, 2004) and Kluyveromyces structures (Brown et al., 2007). The Ku80 binding domain is speciﬁc for Saccharomyces. Vertebrate telomerases have a snoRNA-domain at their 30 -end (Mitchell et al., 1999). This domain carries a Cajal-body localization signal (CAB) (Jady et al., 2004), which is present in all vertebrates except teleosts (Xie et al., 2008). Black regions may vary dramatically in length. (See insert for color representation of this ﬁgure.)

by standard alignment programs. The same is true for the recently discovered TLC gene of Schizosaccharomyces pombe (Leonardi et al., 2008; Webb and Zakian, 2008). Interestingly, S. pombe 30 -end processing by the spliceosomal machinery is essential for telomerase function (Box et al., 2008). In yeasts, snRNA and snoRNA methyltransferase Tgs1 is responsible for TLC1 methylation, which inﬂuences telomere length and structure (Franke et al., 2008). The small ciliate TER genes include a pseudoknot domain that contains an unusual triple-helical segment with an AUU base triplet. This domain is also shared by the vertebrate and yeast telomerase RNAs (Ulyanov et al., 2007). Whether such a structure is also present in the computationally predicted TER genes of plasmodia (Chakrabarti et al., 2007) is not yet known. Although there is a common core structure of all these telomerase RNAs (Chen and Greider, 2004), and despite their length of several hundred to almost 2000 nt, these RNAs remain a worst case scenario for homology search. Indeed, a survey of vertebrate telomerase RNAs (Chen et al., 2000) shows dramatic sequence variation with only a few, short, wellconserved sequence patterns separated by regions of highly variable length. The recent discovery of the TER genes of teleost ﬁshes (Xie et al., 2008) highlights the variability of this molecule, which has acquired several lineage-speciﬁc domains, such as the snoRNAdomain in vertebrates and the Ku80 binding domain in budding yeast (Figure 14.6).

14.3.2

Spliceosomal snRNAs

In eukaryotes, introns of protein-coding mRNA und mRNA-like ncRNAs are spliced out of the primary transcript by the spliceosome, a large RNP complex that consists of up to 200 proteins and 5 small ncRNAs Nilsen (2003). Mounting evidence suggests that these snRNAs exert crucial catalytic functions in the splicing process (Valadkhan et al., 2007). Spliceosomal splicing is one of the four distinct mechanisms (Table 14.2).

14.3 Domain-Speciﬁc RNAs Table 14.2 Domain Bacteria Archaea Eukaryota

261

Splicing Mechanisms (A) Group I

(B) Group II

þ þ

þ þ

Spliceosomal

(C) Endonuclease

“mRNA”

tRNA, rRNA, mRNA tRNA

Three major mechanisms, (A), (B), and (C) can be distinguished (Lykke-Andersen et al., 1997). Group I (Haugen and Simon, 2005) and Group II (Fedorova and Zingler, 2007) (which include the Group III introns) are self-splicing. However, Group II introns also share several characteristic traits, including the lariat intermediate, with spliceosomal introns and might share a common origin. The splicing of eukaryotic tRNAs and all archaeal introns uses speciﬁc splicing endonucleases (reviewed in Calvin and Li, 2008). The spliceosomal machinery does not distinguish between protein-coding mRNAs and mRNA-like ncRNAs.

The spliceosomal machinery itself may be present in three distinct variants in eukaryotic cells. The dominant form is the major spliceosome, which contains the snRNAs U1, U2, U4, U5, and U6 and removes introns delimited by the canonical donor–acceptor pair GT-AT (as well as some AT-AC and GC-AG introns). A recent report on the expression of a U5 snRNA candidate in Giardia (Chen et al., 2007), a protozoan with few introns, suggests that the spliceosome and its snRNA date back to the eukaryote ancestor. In general, snRNAs are subject to concerted evolution if they are present in multiple copies. Nevertheless, there is evidence for differential regulation of paralogous snRNA genes in several lineages (Domitrovich and Kunkel, 2003; Chen et al., 2005; Marz et al., 2008). Approximately 1 in 10,000 protein-coding genes is spliced by the minor spliceosome (Patel and Steitz, 2003), which is composed of the snRNAs U11, U12, U4atac, U5m, and U6atac and acts on AT-AC (rarely GT-AG) introns (Sheth et al., 2006). The snRNAs U11, U12, U4atac, and U6atac take on the roles of U1, U2, U4, and U6. Whereas, both U6 and U6atac are polymerase-III transcripts, all other spliceosomal snRNAs are transcribed by polymerase-II. Interestingly, the minor spliceosome can also act outside the nucleus and has a function in the control of cell proliferation (K€onig et al., 2007). Functional and structural differences between the two types of spliceosomes are reviewed in Will and L€ uhrmann (2005). The snRNAs themselves are not only part of the spliceosomes but also involved in transcriptional regulation (Kwek et al., 2002). The third type of splicing is spliced-leader-trans-splicing. Here, a “miniexon” derived from the noncoding spliced-leader RNA (SL RNA) is attached to the 50 -end of each protein-coding exon (Palﬁ et al., 2005; Pouchkina-Stantcheva and Tunnacliffe, 2005; Hastings, 2005). The corresponding spliceosomal complex contains the snRNAs U2, U4, U5, and U6, as well as an SL RNA (Hastings, 2005). The minor spliceosome is present in most eukaryotic lineages and traces back to an origin early in eukaryotic evolution (Collins and Penny, 2005; Lorkovic et al., 2005; Russell et al., 2006). Although it appears to have been lost in many lineages, most Metazoa have a minor spliceosome, with the notable exception of nematodes such as Caenorhabditis elegans (Patel and Steitz, 2003) and certain Cnidaria (Davila Lo´pez et al., 2008; Marz et al., 2008). Within fungi, minor splicesomes have been reported only for Zygomycota and some Chytridiomycota. Minor spliceosomes are also reported in oomycetes (Heterokonta) and Streptophyta (Davila Lo´pez et al., 2008), whereas Euglenozoa and Alveolata do not seem to have minor spliceosomes.

262

Chapter 14

Noncoding RNA

The evolutionary origin of SL-trans-splicing is unclear. It has been described in tunicates, nematodes, platyhelminthes, cnidarians, kinetoplastids (Hastings, 2005), rotifers, and dinoﬂagellates (Lidie and van Dolah, 2007). In contrast, SL RNAs are absent in vertebrates, insects, plants, and yeasts (Pouchkina-Stantcheva and Tunnacliffe, 2005). Due to the rapid evolution and the small size of SL RNAs, it is hard to determine whether examples from different phyla are true homologues or not. Thus, two competing hypotheses are discussed in the literature: (i) ancient trans-splicing and SL RNAs have been lost in multiple lineages and (ii) the mechanism has evolved independently as a variant of spliceosomal cis-splicing in multiple lineages. In nematodes, polycistronic pre-mRNAs are trans-spliced into two or even more distinct SL RNAs (Pettitt et al., 2008) that provide the 50 -acceptor site for the ﬁrst (SL1) and all subsequent (SL2) mRNA sequences. This leads to the formation of discrete monocistronic mRNAs that start with either the SL1 or the SL2 sequences (Blumenthal, 1995). Trans-splicing to the SL1 acceptor requires no speciﬁc signal. In contrast, to cis-splicing or SL1-trans-splicing the attachment of SL2 appears to be linked with polyadenylation of the preceding mRNA in the polycistron (Pirrotta, 2002).

14.3.3

U7 snRNA

Replication-dependent histone genes are the only known eukaryotic protein-coding mRNAs that are not polyadenylated, ending instead in a conserved stem-loop sequence (see Marzluff (2005) for a recent review). The processing of the 30 -end of these histone genes is performed by the U7 snRNP. The U7 snRNA is the smallest RNA polymerase-II transcript known to date, with a length ranging from 57 (sea urchin) to 70 nt (fruit ﬂies). Its expression level of only a few hundred copies per cell in mammals is at least three orders of magnitude smaller than the abundance of other snRNAs. The U7 snRNP-dependent mode of histone end processing has long been believed to be a metazoan innovation (Azzouz and Sch€umperli, 2003; Marzluff, 2005). Anyway, the phylogenetic distribution of the U7 snRNP proteins provides evidence for an origin of this mechanism early in eukaryote evolution (Davila Lo´pez and Samuelsson, 2008). Nevertheless, U7 snRNAs so far have been reported only for deuterostomes and drosophilids (Marz et al., 2007; Davila Lo´pez and Samuelsson, 2008). A detailed analysis shows that each of its three major domains, the histone binding region, the Sm binding sequence, and the 30 -stemloop structure, exhibit substantial variation in both sequence and structural details (Figure 14.7) Recently, splicing-associated functions were also attributed to U7 snRNA: Modiﬁed U7 snRNAs, caused in vitro by doxycycline, lead to different alternative splice variants (Marquis et al., 2008) and introducing in vivo U7snRNA by germline transgenesis, stimulates exon 7 of survival motor neurons, which extends the mouse SMA model (Meyer et al., 2008).

14.3.4

tmRNA

The transfer-messenger RNA (tmRNA), also known as 10Sa RNA or SsrA, is part of a complex that acts as a unique translation quality control and ribosome rescue system in all bacteria and some eukaryotic organelles (Armbrust et al., 2004; Gueneau de Novoa and Williams, 2004). “Nonstop” mRNAs, which due to the processing errors lack appropriate

263

Figure 14.7 Aligned sequence logos of the consensus sequences of U7 snRNAs from tetrapods, teleosts, sea urchins, and fruit ﬂies. Adapted from Marz et al., 2007. (See insert for color representation of this ﬁgure.)

264

Chapter 14

Noncoding RNA

termination signals, cannot promote release factor binding and thus lead to stalled ribosomes attached to the nascent incomplete polypeptide. tmRNA rescues these ribosomes and provides the template for a peptide tag that then causes the rapid degradation of the incomplete translation product. In a typical Escherichia coli bacterium, there are approximately 700 tmRNAs per cell; that is, one tmRNA for every 10–20 ribosomes (Moore and Sauer, 2005). For recent detailed reviews of tmRNA structure and function, refer to Moore and Sauer (2007) and Dulebohn et al. (2007). In brief, tmRNA combines the functions of a tRNA and an mRNA, a dichotomy that is also reﬂected by its structure (Figure 14.8). Its tRNA-like domain binds to the small protein B (SmpB) and is then aminoacetylated by alanyl-tRNA synthetase. A quaternary complex that in addition contains elongation factor EF-Tu recognizes ribosomes stalled at the 30 -end of a nonstop mRNA and like tRNA enters the ribosome’s A site. The nascent polypeptide is transferred to the alanyl-tmRNA, which now switches to its mRNA-like mode of action by translocating to the P site of the ribosome where it places a TAG codon in the ribosome’s mRNA channel. This leads to the release of the defective mRNA and its subsequent selective degradation by RNase R. The ribosome continues translation with the tmRNA ORF as a surrogate template and terminates at a tmRNA-encoded stop codon, thereby releasing the nascent protein with the 11-amino acid degradation tag, which contains epitopes for ubiquitin proteases. Recent ﬁndings imply a role of tmRNA also in the decay of defective mRNAs (Richards et al., 2008). Genes for tmRNA and the accompanying SmpB protein have been found so far in each of the completely sequenced bacterial genomes (e.g., Gueneau de Novoa and Williams, 2004), although tmRNA is not crucial for survival in some bacterial species (e.g., Okan et al., 2006; Braud et al., 2006). In E. coli, tmRNA is transcribed as a 457 nt

Figure 14.8

Canonical tmRNA structure. The 50 -end and the 30 -end form a tRNA-like domain (shaded) that includes an acceptor stem, a D-loop without stem, and a T-arm (Komine et al., 1994). Instead of the anticodon stem-loop that is typical for a normal tRNA, however, tmRNA has a long stem that connects the tRNA-domain with the rest of the molecule. The mRNAlike domain contains the template ORF and is terminated by a hairpin with the stop codon in its loop. The mRNA-like and tRNA-like domains are connected by a series of four pseudoknots of unknown function. Adapted from Andersen et al., 2006. (See insert for color representation of this ﬁgure.)

14.3 Domain-Speciﬁc RNAs

265

precursor and then cleaved to obtain the 363 nt mature molecule. In some species, circularly permuted genes produce split tmRNAs, which still share the ancestral domain organization (Sharkady and Williams, 2004). There are tmRNA genes in some bacterial-like organelles, in particular in the chloroplasts of diatoms (e.g., Thalassiosira pseudonana) and red algae (e.g., Cyanophora paradoxa) (Gimple and Sch€on, 2001; Armbrust et al., 2004). Reduced tmRNAs, which lack the mRNA-like region, were identiﬁed in the mitochondrial genomes of a few protozoans including the jakobid Reclinomonas americana (Jacob et al., 2004). So far neither an archaeal nor a nuclear eukaryotic tmRNA have been found, although there is a nuclear-encoded SmpB gene with an organelle import signal in heterokonta with organellar tmRNAs (Jacob et al., 2005). A comparable mechanism of ribosome rescue is not known for eukaryotes.

14.3.5

6S RNA

The 6S RNA of E. coli is one of the ﬁrst known bacterial small RNAs. Nevertheless, its function in transcriptional regulation was only recently elucidated (reviewed in Willkomm and Hartmann, 2005; Wassarman, 2007). It binds speciﬁcally to the bacterial RNA-Polymerase (RNAP) holoenzyme and selectively inhibits s70-dependent transcription. During the exponential growth, almost all housekeeping genes are s70-dependent regulated (Maeda et al., 2000). 6S RNA seems to mimic the open s70-dependent promoter complex (Cavanagh et al., 2008). This hypothesis is supported by the fact that mutation within the “central bubble” of the 6S RNA abrogated s70-holoenzyme binding. Therefore, the conserved secondary structure of the RNA molecule is critical for its function. Since the 6S RNA concentration increases 10-fold from the exponential to the stationary phase (Wassarman and Storz, 2000), it stops transcription of the “housekeeping” genes and stores the RNAP-holoenzyme. This RNAP-6S complex is present during late stationary phase, when NTP (nucleotide triphosphate) concentrations are low. As soon as new NTPs become available to the RNAP-6S complex, the 6S RNA serves as a template for the transcription of the 14–20 nt pRNAs. This results in the formation of an unstable 6S RNA-pRNA complex and the release of the RNAP-holoenzyme (Wassarman and Saecker, 2006). The secondary structure of 6S RNA is discussed in detail by Barrick et al. (2005). It consists of three domains (Figure 14.9) and shows close similarities with an open s70promoter. While 6S is a single-copy gene in most bacterial phyla, there are two differentially regulated copies in Gram-positive bacteria (Barrick et al., 2005). Interestingly, two alternative structural conformations have been reported for cyanobacterial 6S RNAs (Axmann et al., 2007). In E. coli, the ssrS1 gene forms a bicistronic message combining the 6S RNA (50 ) and the open reading frame of ygfA (30 ). A tandem promoter is responsible for regulating 6S RNA expression, producing either a short, s70-dependent transcript or a longer version that uses either s70 or sS. Different RNases process these transcripts (Kim and Lee, 2004). Additionally, the repression by several regulatory proteins is reported (Neusser et al., 2008). A similar gene structure was observed in Prochlorococcus, although the short transcript does not seem to be processed in this case (Axmann et al., 2007). The broad phylogenetic distribution of 6S RNA across all bacterial phyla emphasizes the universality of its function. Nevertheless, 6S RNA-null cells show only a minor phenotype effect in bacteria such as E. coli and Synechococcus (Trotochaud and Wassarman, 2006).

266

3′

5′

CUG GGY

AT AT T A

R U Y

CCY U CCR

GGR Y

R

YR

RY

RA

Template strand

5′

3′

RR

A R YY

Terminal loop

Nontemplate strand

(c)

(b)

(a)

6S RNA structure. The central bubble is delimited by stable, GC-rich stems in all known 6S RNAs. It consists of three domains, known as the “closing stem,” “central bubble,” and “terminal loop.” Three major variants of the terminal loop have been identiﬁed: (a) a-proteobacteria, (b) g-proteobacteria, cyanobacteria, and fermicutes, and (c) b-proteobacteria, d-proteobacteria, and spirochetes. For comparison, the open promoter complex is shown. Adapted from Barrick et al., 2005.

AC

AACTGT

RY

R

Central bubble

TAAT T A

−10

GG

RCU

TG

CC

UGR

T TGACA

−35

YRUYG R YGGCR

Closing stem

Figure 14.9

3′

5′

14.4 Conserved ncRNAs with Limited Distribution

267

14.4 CONSERVED ncRNAs WITH LIMITED DISTRIBUTION 14.4.1

Y RNAs

Y RNAs were discovered as the RNA component of the Ro RNP particle. They form a small family of short RNA polymerase-III transcripts with characteristic secondary structure (Farris et al., 1999; Teunissen et al., 2000). The function of Y RNAs remains elusive. A direct role of Y RNAs for DNA replication was demonstrated by Christov et al. (2006), and Y5 was recently implicated in 5S rRNA quality control (Hogg and Collins, 2007; van Horn et al., 1995). Y RNAs have so far been reported only in vertebrates, the nematode C. elegans and the prokaryote Deinococcus radiodurans (Chen et al., 2000). At this point, it remains unclear whether the latter is homologous to the animal Y RNAs and if so, whether Y RNAs date back to the LUCA. The amniota exhibit a single cluster of Y RNAs whose evolution has been traced in detail (Mosig et al., 2007b; Perreault et al., 2007). Interestingly, the orientation of the Y1 gene is inverted in eutheria. Loss of members of the Y RNA family seems to be a fairly frequent phenomenon in several families. In both rat and mouse, there is no trace of either Y5 or Y4, while in the closely related squirrel genome Y4 is still present (Figure 14.10). In primates, Y RNAs are the founders of a family of approximately 1000 pseudogenes that constitute a class of L1-dependent nonautonomous retroelements (Perreault et al., 2005). In contrast, in almost all other species (with the notable exception of the guinea pig Cavia procellus) there are only a few Y RNA-derived pseudogenes.

14.4.2

Vault RNAs

Vaults are large ribonucleoprotein particles that are ubiquitous in eukaryotic cells with characteristic barrel-like shape and a still poorly understood function in multidrug resistance (van Zon et al., 2003). Vault RNAs are short (80–150 nt) polymerase-III transcripts comprising approximately 5% of the mammalian vault complex (Figure 14.11). So far, vault RNAs have been studied experimentally in frogs and a few mammals. They exhibit little sequence conservation beyond their Box A and Box B internal promoter elements. Nevertheless, computational analyses have identiﬁed vault RNAs in most vertebrates (Mosig et al., 2007a; Stadler et al., 2008). Similar to Y RNAs, mammalian vault RNAs are organized in a small cluster (e.g., comprising two or three paralogues in primates). This locus is linked to the protocadherin cluster in most vertebrates. Recently, a second vault RNA locus on the same chromosome was discovered in eutheria (Mrazek et al., 2007; Nandy et al., 2008; Stadler et al., 2008).

14.4.3

7SK RNA

The 7SK snRNA is one of the most highly abundant ncRNAs in vertebrate cells (Wassarman and Steitz, 1991). Due to its abundance, it has been known since the 1960s. Its function as a transcriptional regulator, however, has only recently been discovered: 7SK mediates the inhibition of transcription elongation factor P-TEFb, a critical regulator of RNA polymerase-II transcription that stimulates the elongation phase (Egloff et al., 2006; Krueger et al., 2008; and references therein). 7SK binds either to P-TEFb/HEXIM1 proteins or, for activation of P-TEFb, to heterogeneous nuclear ribonuclear proteins (hnRNP). LARP7,

268 Y3 Y3

Y3

Y5

Y3

Y4 Y4

Ig.igu.

An.car.

Sauropsida

Y3

Y1

Y3

Y1 Y1

Y4

Y3

Y1

Y1

Y1

Y1

Y3

ancenstral Eutherian

Y4

Y4

Y3

Y1

Afrotheria Xenarthra Da.nov. Ch.hof.

Lo.afr. Ec.tel. Pr.cap.

So.ara Er.eur

Ca.fam Fe.cat.

Pt.vam. My.luc.

Eq.cab.

Bo.tau. Su.scr. Tu.tru.

Tu.bel.

Oc.pri.

Sp.tri. Di.ord. Ca.por. Or.cun.

Mu.mus. Ra.nor

Laurasiatheria

Y1

Y3

Y3

Y4 Y4 Y4 Y4

Y5 Y5 Y5 Y5

Y4 Y4 Y4

Y5 Y5 Y5

Y4 Y4

Y5 Y5

Y4

Y4

Y4

Y5

Y4

Y4

Y5

Y5

Y4

Y5

Y5

Y4

Y3

Y3

Y3

Y3

Y3

Y3

Y1

Y1

?

?

?

Y1

Y1 Y3 Y3

Y1

Y1

Y1

Y1

Y1

Y1

Y1

Y1

Y1

Y1

Y1

Y3

Y3

Y3

Y3

Y3

Y4 Y5

Y3 Y3

Y4 Y4

Y3

Y3

Y3

Y4

Y1

Y1

Y3

Y3

Y1

Y3

Y1 Y1

Y3 Y3

Y1

Y1

Y1

Y1

Y1

Y3

Y3

Y3

Y3

Y3

Y5

Y4

Y4

Y4

Y5

Y5

Y4

Y5

Y4

Y4

Y5

Y5

Y4

Y5

Evolution of the vertebrate Y RNA locus. With the exception of Xenopus, the functional Y RNA genes are located in a single cluster in all sufﬁciently assembled genomes (symbols with arrows on a line marking an uninterrupted piece of genomic DNA). For most species, only short scaffolds or shotgun traces are available (white symbols without direction). Updated from Mosig et al., 2007b. (See insert for color representation of this ﬁgure.)

Figure 14.10

Y4

Y3

Ga.gal. Ta.gut. An.pla

Y5

Y4 Y4

Ma.eug.

Y5

Mo.dom.

Or.ana.

xY5

xYa

Y4

Mammalia

xY5

xYa

Y

Y

Y

Y

Y

Y

Teleostei

Xe.tro. Xe.lae.

Da.rer. On.myk. Or.lat. Ga.acu. Ta.rup. Te.nig.

Pa.tro. Ma.mul. Ta.syr. Mi.mur. Le.cat. Ot.gar.

Ho.sap.

Euarchontoglires

14.4 Conserved ncRNAs with Limited Distribution

269

C - UUC - U WU M Box A G U 5¢ C A M U G C U C-AG C G GY YRG GG Variable C C Y G R G R GCGC C Y C region U G c 3¢ C H U C g u G Box B u Terminator A G R u AGCUU u

Figure 14.11 Vault RNAs contain two internal polymerase-III promoter sequences, Box A and Box B, and a typical terminator sequence at the 30 -end. The terminal stem is conserved among all known examples. Circles indicate positions with compensatory mutations. The variable regions have a length of 20 to >100 nt. Adapted from Mosig et al., 2007b. (See insert for color representation of this ﬁgure.)

MePCE, Tat, La Lupus antigen, and RNA helicase A are known to bind speciﬁc hairpins of human 7SK (Michels and Bensaude, 2008). The polymerase-III transcript, with a length of approximately 330 nt, is highly conserved in vertebrates (G€ ursoy et al., 2000). In contrast to the nearly perfect sequence and structure conservation in jawed vertebrates (Wassarman and Steitz, 1991; Egloff et al., 2006), however, the 7SK RNA from the lamprey homologue differs in more than 30% of its nucleotide positions from its mammalian counterpart. Recently, the molecule was regarded as a vertebrate innovation because searches for invertebrate homologues remained unsuccessful despite considerable efforts. A combination of computational and experimental approaches, however, uncovered 7SK homologues in basal deuterostomes and several lophotrochozoans and revealed its evolutionary ﬂexibility (Gruber et al., 2008b) (Figure 14.12). A genome-wide survey for well-conserved type-3 polymerase-III promoter structures in Drosophila ﬁnally led to the discovery of arthropod 7SK RNAs (Gruber et al., 2008a). The 7SK RNA hence was present already in the protostome–deuterostome ancestor.

8−20nt W G

4−10nt 8−30nt

Y A G

Figure 14.12

3−7nt U G B Y Y D G

G Y Y

5−12nt

C U A G H U u A Y N W R Y C

K C W t G A U C

Y R Y Y R Drosophila 300−350nt Lophotrochozoa 150−200nt Deuterostomia 200−240nt

Y G G

G Y R R Y N B G C C TTTT

Like telomerase RNA, 7SK RNA is highly variable in both sequence and structure. Only two stemloop structures, toward the 50 end and 30 -end, are conserved. The intervening sequence is highly variable in both length and structure. The 50 -terminal hairpin is responsible for HEXIM1 and P-TEFb binding (Egloff et al., 2006), the 30 -stem also interacts with P-TEFb. (See insert for color representation of this ﬁgure.)

270

Chapter 14

Noncoding RNA 70 30

Figure 14.13

60 80 20

40 10

14.4.4

AARUUUUGGA 50 5′

3′

Consensus structure of the C. elegans SmY RNAs. The terminal hairpin and the consensus pattern AARUUUUGGA form a canonical Sm binding site.

SmY RNA

The analysis of the cis- and trans-spliceosomes in Ascaris lumbricoides (Maroney et al., 1996) led to the discovery of two snRNA-like small RNAs that exhibit a canonical Sm protein binding site. In Caenorhabditis elegans, this novel class of RNAs contains at least 12 closely related genes (Deng et al., 2006; He et al., 2007; MacMorris et al., 2007) (Figure 14.13). The SmY RNAs occur in a complex with the spliced-leader RNA (SL RNA) formed by direct base pairing reminiscent of the interactions between spliceosomal RNAs. This strongly suggests a direct involvement in the trans-splicing machinery, either as a direct component of the trans-spliceosome that is required for function or as a chaperone for the SL snRNP that prevents inappropriate splicing. Although all sequenced nematode genomes contain SmY homologues, no SmY RNAs were found outside the phylum Nematoda (Jones et al., 2009).

14.4.5

Bacterial RNAs

A plethora of evolutionarily unrelated small RNAs (sRNAs) have been described across the bacteria, most of that have a more or less limited phylogenetic distribution. It is at present nearly impossible to offer a comprehensive survey or even a reasonably complete list. A series of reviews cover much of this terrain (e.g. Altuvia, 2007; Storz et al., 2006; Vogel and Sharma, 2005). However, many novel and almost poorly understood ncRNAs continue to be discovered at a staggering rate. The Sm-like protein Hfq plays a special role in small RNA based regulation of bacterial gene expression (reviewed in Brennan and Link, 2007). This RNA chaperone is present in half of all Gram-positive as well as Gram-negative bacteria and at least one archaeon, Methanococcus jannaschii. Existence of an hfq gene, however, does not imply the presence of a functional Hfq protein or its involvement in sRNA-based regulation. The 70–110-amino acid protein forms homohexamers and binds sRNA as well as mRNA (via its proximal site) and also poly(A) tails. In E. coli, approximately one-fourth of the known sRNAs bind to Hfq. They also interact with proteins that are involved in mediated decay of speciﬁc mRNAs. In the following, we can only brieﬂy introduce a few subjectively selected examples of small bacterial RNAs.

14.4 Conserved ncRNAs with Limited Distribution

271

OLE RNA The “ornate, large, and extremophilic” (OLE) RNA with a length of approximately 610 nt is highly conserved in both sequence and secondary structure. OLE is found predominantly in extremophilic Gram-positive bacteria. A functional link between OLE RNA and putative membrane protein including BH2780 has been suggested (Puerta-Fernandez et al., 2006). GcvB Computer-based detection of GcvB showed the existence of the sequence within many enterobacteria as well as Pasteurellaceae and Vibrionaceae. Further investigation in Salmonella enteria revealed that it regulates at least seven mRNAs. The GcvB RNA represses its targets by binding to highly conserved regions. Examples include a 29 nt G-U rich region and the Shine–Dalgarno sequence of dpA and oppA. GcvB RNA also binds upstream of the Shine–Dalgarno sequence, preventing the binding of the 30S ribosome (Sharma et al., 2007). Yfr1 Cyanobacterial functional RNAs (Yfr) were discovered by computer-aided searches (Axmann et al., 2005; Nakamura et al., 2007). The Yfr2–5 appear to be related due to a highly conserved sequence pattern (“GGAAACA”x2) within the loop region of a hairpin. Yfr7 exhibits a faint sequence similarity to 6S RNA. The approximately 60 nt short Yfr1 RNA folds into a structure containing an ultraconserved sequence surrounded by two stemloops. This element is found in all cyanobacteria lineages, where it appears in very high copy numbers and with a long half-life of 60 min, emphasizing its functional importance. Yfr1 seems to be involved in growth and stress responses, but its regulatory mechanism remains largely unknown. RsmY and RsmZ A complex network of ncRNAs and proteins controls pathogenesis in Pseudomonas aeruginosa (Toledo-Arana et al., 2007). Homologous networks of GacS-GacA-RsmY-ZRsmA are known in E. coli and many other bacterial species. The GacS-GacA proteins upregulate the expression of RsmY and RsmZ. The sensor proteins LadS and RetS seem to up- as well as downregulate GacA, respectively. RsmY and RsmZ bind to the translation regulatory protein RsmA. The sequestration of this molecule results in expression of survivaland virulence-associated mRNAs such as the elements of the type-III-secretion complex. Riboswitches These primarily regulatory elements of protein-coding mRNAs are mostly found in bacteria. We include them here because in the “off” state, the riboswitch sequence itself is often transcribed. Almost all riboswitches are concatenations of two components: a very well-conserved binding domain acting as a highly speciﬁc aptamer that senses its metabolite, and a comparatively variable expression platform that changes its structure upon binding of the metabolite. One can distinguish two functional principles: in kinetically controlled riboswitches, metabolite binding competes with RNA folding; in thermodynamically controlled ones, the binding energy of the metabolite is sufﬁcient to cause structural changes (Coppins et al., 2007) (Table 14.3).

272

Chapter 14

Noncoding RNA

Table 14.3 Classiﬁcation of veriﬁed riboswitches by their function, and primary ligand as well as candidate riboswitches are listed. The SAM and PreQ groups utilize different structures to bind the same ligand. Where known, the mode of action is indicated by H (transcription control) and/or (translation control) Riboswitch Cleavage: glmS Splicing: TPP (THI-box) ON/OFF: TPP (THI-box) Purine Lysine (L-box) Glycin Cobalmin (B12-element) FMN (RFN-element) Mg2+ PreQ1-I PreQ1-II SAM-I (S-box leader) SAM-II (SAM-alpha) SAM-III (SMK) SAH Candidates: Moco Tuco

Ligand

mRNA-location

Expression Rgulation

Glucose-amine-6-phosphate

50 UTR

Thiamin pyrophosphate

intron in 50 UTR

Thiamin pyrophosphate Adenine, guanine and 20 -deoxiguanosine Lysine Glycine Adenosylcobalamin

30 UTR, 50 UTR 50 UTR

+/ +/

H H

50 UTR 50 UTR 50 UTR

+

H H

Flavin mononucleotide Mg2+ Pre-queuuosine1

50 UTR 50 UTR 50 UTR

H H

S-adenosylmethionine

50 UTR

H

S-adenosylhomocystein, S-adenosylcysteine

50 UTR

+/

H

Molybdenum cofactor Tungest cofactor

50 UTR

H H

SAM-V yybP-ykoY (SraF) ykkC-yxkD Data are taken compiled from Breaker, 2008; Barrick and Breaker, 2007; Coppins et al., 2007; Regulski and Breaker, 2008; Meyer et al. 2008; Wang and Breaker, 2008. The symbols +/ indicate an increase/descrease of gene expression.

The expression platform can employ two principal modes of action (Figure 14.14). It can regulate translation by inhibiting translational initiation, for example, by blocking the ribosomal binding site. Alternatively, riboswitches can control transcription. Depending on whether the aptamer is loaded with the metabolite or not, a terminator hairpin is formed by the expression platform that prematurely terminates the transcription of the mRNA. In this case, the “raw” riboswitch RNA is produced as an unstable product. The TPP riboswitch is mechanistically more complex (Barrick and Breaker, 2007) and employs various expression platforms. It can be found in both 50 -UTR and 30 -UTR; in the latter case, it usually acts in conjunction with an upstream ORF. Riboswitches may act as activators or repressors depending on genomic context (Cheah et al., 2007) by switching ON ! OFF or

273

M

UUUUU

ORF

ORF

3¢

3¢

5¢ GU

5¢

M

GU

M AG

UUUUU

ORF

ORF

3¢

3¢

5¢

5¢

M

ORF

M

AGGAGG

3¢

AAAAA 3¢

ORF

Figure 14.14

Mechanisms of riboswitch control. Most switches are located in the 50 -UTR (top), acting either by forming a transcription terminator, blocking the anti-terminator hairpin, or sequestering the Shine–Dalgarno sequence. Less common riboswitches (below) act by cleaving (gmlS), causing alternative splicing (TPP), or stabilizing the mRNA. (See insert for color representation of this ﬁgure.)

5¢

5¢

M

274

Chapter 14

Noncoding RNA

OFF ! ON with increasing substrate concentration. A more complex regulatory logic can be implemented by placing two or more riboswitches that sense the same or different ligands in series. For instance, a tandem riboswitch architecture in Bacillus clausii implements a logical NOR gate. This and further examples are reviewed by Sudarsan et al. (2006). A unique case is the gmlS riboswitch in certain Gram-positive bacteria. Instead of instigating a conformational change, it stimulates a self-cleaving ribozyme activity that acts on the GmlS mRNA. Recent studies show that this ribozyme is still functional after physically separation from its target RNA. From the phylogenetic point of view, it is important to note that the glmS expression in Gram-negative bacteria is regulated by two sRNAs, GlmY and GlmZ, respectively. Both sRNAs are highly similar in sequence and structure but they form a regulatory hierarchy instead of simple redundancy. For a review, see G€ orke and Vogel (2008). Riboswitches are common in bacteria, although their phylogenetic distribution and genomic frequency is quite variable (Kazanov et al., 2007; Wang and Breaker, 2008). A few riboswitches have also been detected in archaea, fungi, and plants (Sudarsan et al., 2003; Barrick and Breaker, 2007; Breaker, 2008). A TPP riboswitch in Neurospora crassa, for example, regulates the expression of mRNAs by controlling alternative splicing. RNA Thermometers While classical riboswitches sense a speciﬁc metabolite, the so-called RNA thermometers sense temperature as a physical stimulus, providing a simple regulatory component that does not need an interaction partner (Waldminghaus et al., 2008). All known RNA thermometers control translation initiation, typically by sequestering the Shine–Dalgarno sequence in the low temperature regime and refolding the 50 -UTR with increasing temperature. This refolding process is supported by temperature-labile, noncanonical base pairs around the Shine–Dalgarno sequence. Two well-known examples are fourU and ROSE-like thermometer. The latter are common regulators of heat-shock genes in many aand g-proteobacteria.

14.4.6

A Zoo of Diverse Examples

The reasonably well-understood ncRNA classes have been complemented during the last few years by many other examples that are less well described. In the following paragraphs, we can only brieﬂy touch upon a few of them. Guide RNAs Mitochondrial mRNAs of some protozoa need to undergo a posttranscriptional editing process before they can be translated. Kinetoplastids of the trypanosomatid group possess two types of mitochondrial DNA molecules: Maxicircles bear protein and ribosomal RNA genes. Minicircles specify guide RNAs (gRNAs), with a typical length of approximately 50 nt that mediate uridine insertion/deletion RNA editing. Following the hybridization of the 50 -anchor region of a gRNA to the 30 -end of its target mRNA, a U insertion and deletion is directed by sequential base pairing. The enzyme cascade involves cleavage of the mRNA, U insertion by a TUTase, and re-ligation (Alatortsev et al., 2008 and references therein). With a few exceptions, gRNAs are encoded in the short variable regions located between highly conserved sequence blocks on the minicircles. A computational approach based on this

14.4 Conserved ncRNAs with Limited Distribution

275

observation predicted approximately 100 gRNA candidates in Trypanosoma cruzi (Thomas et al., 2007), consistent with recent experimental results for Trypanosoma brucei (Madej et al., 2007, 2008). Dictyostelium discoideum In addition to the usual repertoir of eukaryotic ncRNA, there are two related classes of ncRNAs that share the sequence motif CCUUACAGCAA (Aspegren et al., 2004). One of them, the class I RNAs, is organized in a few genomic clusters (Mosig et al., 2005; Hinas and S€ oderbom, 2007). Several novel families of ncRNAs have been identiﬁed based on a computational screen and subsequent experimental veriﬁcation (Larsson et al., 2008). Plasmodium falciparum The ncRNAs of this malaria parasite have been studied rather extensively. Its notable scarcity of identiﬁable transcription factors led to speculation that this organism may be unusually reliant on chromatin modiﬁcations as a mechanism for regulating gene expression. The centromeres of Plasmodium falciparum contain transcriptionally active promoters and produce noncoding transcripts with a length of 75–175 nt that are retained in the nucleus and appear to associate with the centromers (Li et al., 2008). A large number of other ncRNAs without homologues outside apicomplexa have been found using computational techniques and microarrays (Chakrabarti et al., 2007; Mourier et al., 2008). TelRNAs A recent study showed that mammalian telomeric repeats are transcribed from the C-rich strand by polymerase-II (Schoeftner and Blasco, 2008). The transcripts contain UUAGGG repeats and are polyadenylated. Transcription is regulated depending on developmental status, telomere length, cellular stress, tumor stage, and chromatin structure. In vitro, TelRNAs block telomerase activity suggesting an active role of TelRNAs in regulating telomerase in vivo. Similar transcripts were recently reported in Leishmania donovani (Saxena et al., 2007), suggesting that TelRNAs might be evolutionarily ancient. Centromeric Transcripts ncRNA deriving from centromeric repeats plays an active role, mostly through the RNAi pathway and in the formation of pericentromeric and centromeric heterochromatin (Pezer and Ugarkovic, 2008; Li et al., 2008). Polymerase V Transcripts in Plants In Arabidopsis, RNA polymerase IVb/Pol V, a multisubunit nuclear enzyme required for siRNA-mediated gene silencing of transposons and other repeats, transcribes intergenic and noncoding sequences. This process facilitates heterochromatin formation and appears to cause the silencing of overlapping and adjacent genes (Wierzbicki et al., 2008). Promoter- and Termini-Associated RNAs A high-resolution tiling array map of the human transcriptome implied a novel role for some unannotated RNAs as primary transcripts for the production of short RNAs, and identiﬁed

276

Chapter 14

Noncoding RNA

three novel classes of RNAs (Kapranov et al., 2007): “promoter-associated small RNAs” (pasRNAs) and “termini-associated small RNAs” (tasRNAs) are syntenically conserved between human and mouse. The presence of pasRNAs appears to be associated with a small increase in the expression of the corresponding protein-coding gene. Short sequence reads arising from human bidirectional promoters, which might be pasRNAs or at least related to them, are also reported by Kawaji et al. (2008). Longer “promoter-associated long RNAs” (palRNAs) with a length of a few hundred nucleotides are also abundant throughout the human genome (Kapranov et al., 2007). A detailed study of a palRNA associated with the promoter of EF1a suggests a function in epigenetic gene silencing (Han et al., 2007). RNA Polymerase-III Transcripts Genome-wide surveys have recently uncovered a plethora of novel, hitherto unclassiﬁed transcripts that are produced by RNA polymerase-III (Isogai et al., 2007; Pagano et al., 2007; reviewed by Dieci et al., 2007). In Drosophila, many of these transcripts exhibit well-conserved secondary structures (Rose et al., 2007). The snaR-A RNA, on the other hand, is present in human and chimpanzee only and appears to have undergone accelerated evolution (Parrott and Mathews, 2007). TIN RNAs A survey of human mRNA and EST public databases revealed more than 55,000 totally intronic noncoding (TIN) RNAs transcribed from the introns of the majority of RefSeq genes (Nakaya et al., 2007). Surprisingly, RNA polymerase-II inhibition resulted in increased expression of a fraction of intronic RNAs in cell cultures. This suggests that the recently discovered spRNAP-IV, an RNA polymerase of mitochondrial origin (Kravchenko et al., 2005), might be responsible for these transcripts. Functional importance of some TIN RNAs is supported by conserved expression patterns between human and mouse (Louro et al., 2008). IGS RNAs The intergenic spacers (IGSs) that separate individual ribosomal rDNA genes contain polymerase-I promoters that cause the transcription of 150–300 nt ncRNA complementary to rRNA promoter regions. These IGS RNAs help to establish and maintain a speciﬁc heterochromatin conﬁguration in a subset of rRNA promoters (Mayer et al., 2006). Ciliates Tetrahymena and other ciliates use an RNA-based mechanism for directing their genomewide DNA rearrangements (Mochizuki et al., 2002; Yao et al., 2003). The “scanning RNAs” appear to be closely related to the RNAi pathway.

14.5 ncRNAs FROM REPEATS AND PSEUDOGENES Repetitive elements account for approximately half of mammalian genomes. In the past, these sequences were often considered as “junk DNA,” that is, devoid of cellular function

14.6 mRNA-like ncRNAs

277

(Belancio et al., 2008). We are only beginning to understand that this is probably very far from the truth: For example, two distinct short polymerase-III transcribed SINE elements, B2 in mouse and Alu in human, have been recognized as negative regulators of polymeraseII transcription (Allen et al., 2004; Espinoza et al., 2004, 2007; Mariner et al., 2008). Both act in a similar way by directly binding to polymerase-II. The Alu RNA arose from the fusion of the 50 - and 30 -ends of 7SL RNA and later on evolved by a head-to-tail fusion of two related Alu sequences into a dimeric structure. It can still bind some of the SRP proteins. The resulting Alu RNP complex has been found to downregulate translation initiation (H€asler and Strub, 2006). In Leishmania infantum, conserved tandem head-to-tail subtelomeric repeats are expressed in a stage-speciﬁc manner (Dumas et al., 2006). The expression of telomeric repeats has been brieﬂy described in the previous section. Repetitive DNA elements as well as pseudogenes are an important source of novel ncRNA genes. In some cases, protein-coding genes lose their coding capacity and become functional as ncRNA. Probably the best-known example is Xist (Duret et al., 2006; see Section 14.6). A different mechanism, “exaptation” (Brosius, 1999), starts with the reactivation of retrotransposed pseudogenes, which are by chance either integrated into a locus that provides promoter sequences or integrated into a locus where a promoter happens to be generated by mutations after integration. An example of an exapted gene is the mRNA-like ncRNA Makorin1-p1 (Yano et al., 2004): Both the functional protein-coding gene Makorin1 and the pseudogene Makorin1-p1 possesses the same cis-acting destabilizing elements in their 50 -region. The pseudogene stabilizes its functional paralogue by competing for the mRNA degradation apparatus. In some cases, pseudogenes may also be processed to generate small RNAs, including microRNAs and piRNAs. See Chapter 15 for examples. In addition, several snoRNAs seem to belong to repetitive families in the mouse genome (H€ uttenhofer et al., 2001). Conversely, snoRNAs may also appear as founders of repeat families, as exempliﬁed by a recently discovered element in the platypus genome that consists of a H/ACA box snoRNA followed by a sequence element similar to retrotransposon-like elements (Schmitz et al., 2008). Another source of pseudogenes are small stable RNAs such as tRNAs, 7SK RNAs, or even snoRNAs (Luo and Li, 2007). Retrotransposition produces large numbers of such pseudogenes that may give rise to novel ncRNAs. The BC1 RNA, for instance, which is speciﬁc to rodent neurons, shares 80% sequence similarity with its progenitor tRNAAla. It folds into a stable stem/loop rather than into a cloverleaf structure (Rozhdestvensky et al., 2001). BC200, another brain-speciﬁc transcript, is speciﬁc to anthropoidea (Kuryshev et al., 2001) and exapted from a retrotransposed ancient Alu monomer. Although unrelated evolutionary, BC1 RNA and BC200 RNA share the same expression pattern and exert analogous functions (Zalfa et al., 2005). A related primate lineage contains the analogous, also Alu-derived, ncRNA G22 (Khanam et al., 2007). Two closely related RNAs are exapted from rodent SINEs. The 94 nt 4.5SH RNA and the 101–108 nt 4.5S RNA are exapted from rodent B1 and B2 elements, respectively (Gogolevskaya et al., 2005). The function of these RNAs remains unknown although it was shown that 4.5SH is bound by mouse nucleolin (Hirose and Harada, 2008).

14.6 mRNA-LIKE ncRNAs A rapidly growing class of ncRNAs looks like protein-coding messenger RNAs in many respects. These mRNA-like ncRNAs (mlncRNAs) are transcribed by polymerase-II,

278

Chapter 14

Noncoding RNA

polyadenylated at their 30 -end, capped with 7-methylguanosine at the 50 -end, and typically spliced. These transcripts are the main target of systematic full-length cDNA cloning (reviewed in Carninci, 2007). While mlncRNAs were found in both animals (Inagaki et al., 2005; Carninci, 2007; Szell et al., 2008; Ahanda et al., 2008) and plants (Wen et al., 2007; Rymarquis et al., 2008), next to nothing is known about most of them. There is, however, mounting evidence that many of them are associated with diseases (Szymanski et al., 2005) (Table 14.4). A recent study based on in situ hybridization data from the Allen Brain Atlas identiﬁed 849 ncRNAs expressed in mouse, of which most were speciﬁcally associated with particular neuroanatomical regions, cell types, or subcellular compartments (Mercer et al., 2008). This kind of tight regulation is at least indicative of speciﬁc

Table 14.4 mlncRNAs associated with Human Diseases ncRNA

Disease/Disorder

Altered Expression Levels in Cancer PCGEM1 " Prostate cancer DD3/PCA3 Prostate cancer MALAT-1 NSCLC, endometrial sarcoma, hepatocellular carcinoma OCC-1 " Colon carcinoma NCRMS " Alveolar rhabdomyosarcoma BCMS/DLEU1 B-cell neoplasia H19 " Liver and breast cancer NC612 HULC HIS-1 BIC

Prostate cancer " Hepatocellular carcinoma " Myeloid leukemia Accumulates B-cell lymphoma and leukemia SRA Isoform in breast cancer TRNG10 Various cancers U50HG At chromosomal breakpoint in B-cell lymphoma PEG8/IGF2AS Fetal tumors Neurological diseases/disorders SZ-1/PSZA11q14 # Schizophrenia DISC2 Schizophrenia and bipolar affective disorder IPW Prader-Willi syndrome SCA8 Spinocerebellar ataxia type 8 Miscellaneous diseases/disorders DGCR5 Disruped in DiGeorge syndrom MIAT Risk of myocardial infarction 22k48 Deletion in Dg George syndrome LIT1 Romano-Ward, Jervell, Lange-Nielsen and Beckwith-Wiedemann BR514 Congenital developmental abnormalities Altered expression in disease/disorder: " up-regulated, # down-regulated.

Reference(s) Srikantan et al. (2000) Bussemakers et al. (1999) Yamada et al. (2006); Lin et al. (2007) Pibouin et al. (2002) Chan et al. (2002) Wolf et al. (2001) Gabory et al. (2006); Matouk et al. (2007) Silva et al. (2003) Panzitt et al. (2007) Li et al. (1997) Eis et al. (2005); Tam and Dahlberg (2006) Lanz et al. (1999) Roberts et al. (1998) Tanaka et al. (2000) Okutsu et al. (2000) Polesskaya et al. (2003) Millar et al. (2004); Chubb et al. (2008) Wevrick et al. (1994) Mutsuddi et al. (2004) Sutherland et al. (1996) Ishii et al. (2006) Pizzuti et al. (1999) Horike et al. (2000); Niemitz et al. (2004) Haider et al. (2006)

14.6 mRNA-like ncRNAs

279

functionalities. The NRED database compiled expression information on this class of transcripts (Dinger et al., 2009). In a few cases, the mode of regulation implicates mlncRNAs in a particular functional context, as in the case of the DiGeorge syndrome-associated noncoding RNA, DGCR5, and related neuronal RNAs that are repressed by REST through a proximal upstream binding site (Johnson et al., 2008). Recent data strongly suggest that mlncRNAs do not form a homogenous class with respect to function and processing. A large subclass is natural antisense transcripts (NATs), which are implicated in the expression regulation of their protein-coding counterparts in both animals and plants (Katayama et al., 2005; Li et al., 2007b). Antisense interactions may also play a role in regulatory interactions among protein-coding mRNAs (Wang et al., 2008 and references therein). A signiﬁcant fraction of transcripts, which probably includes many of the mlncRNAs, is processed into short RNAs (Kapranov et al., 2007). Only a small number of such examples are reasonably well understood at present, however. The most prominent mlncRNAs that function as carriers of other functional ncRNAs are the noncoding host genes of snoRNAs (Tycowski and Steitz, 2001; de los Santos et al., 2000) and primary microRNA precursors (He et al., 2008) including H19 (Cai and Cullen, 2007) and BIC (Tam and Dahlberg, 2006). In Carlile et al. (2008), it is shown that a coexpressed pair of a sense and antisense transcript of the phosphate transporter gene Slc34a2a is speciﬁcally processed into small RNAs. A highly conserved cytoplasmic tRNA-like ncRNA is processed from the nacent nuclearretained transcript MALAT-1 by means of RNAse P cleavage (Wilusz et al., 2008). Refer to Chapter 15, for more details on miRNAs, siRNAs, and their relatives. A small subclass of mlncRNAs is predominantly present in the nucleus. A recent screen (Hutchinson et al., 2007) identiﬁed only four such genes: the two well-conserved genes NEAT1 and MALAT-1, NTT, and Xist, which are well-studied in X-chromosome inactivation (Hutchinson et al., 2007; Ng et al., 2007). The neuronally expressed mouse transcript Gomafu also seems to belong to this class (Sone et al., 2007). Recent work furthermore implicates mlncRNAs in the regulation of chromatin structure (Sanchez-Elsner et al., 2006; B€uhler, 2008; Dinger et al., 2008; Whitehead et al., 2008). From a functional point of view, one can distinguish transcripts involved in dosage compensation, imprinting events, stress signals, and regulators of gene expression. Mechanistic details, however, remain to be elucidated. In the following we brieﬂy touch upon a few of the better-understood representatives. Long mRNA-like ncRNAs appear to be more conserved in evolution than previously thought. At least a subclass is conserved across the 12 fruit ﬂy genomes; indeed, their conserved intron position has recently been used to construct a computational gene-ﬁnder for nonstructured ncRNAs (Hiller et al., 2008). Another example is the discovery of a human homologue of Air (Yotova et al., 2008), a fairly well-understood imprinting-related RNA originally described in mouse.

14.6.1

Dosage Compensation

Many species with sex chromosomes, including mammals and fruit ﬂies, need to equalize the expression levels of (in this case) the X chromosome genes in the different sexes. In Drosophila, this is achieved by X chromosome upregulation in XY cells with the help of the mlncRNAs rox1 and rox2 (Wutz, 2003), while mammals inactivate one of the two X chromosomes (reviewed in Hutchinson et al., 2007; Ng et al., 2007). The X inactivation proceeds at an early developmental stage in females and is regulated by different factors, including a region of chromosome X the so-called X inactivation center (XIC). The Xist (X inactive speciﬁc transcript) gene, a nuclear 19.3 kb transcript exclusively expressed from the

280

Chapter 14

Noncoding RNA

XIC of the inactive X chromosome, is necessary and sufﬁcient for the inactivation of the X chromosome. So far, the exact mechanism is unclear but it is assumed that only the transcription of Xist could be enough to change the chromatin structure to allow the binding of different silencing factors.

14.6.2

Imprinting

The 2.3 kb H19 transcript is the ﬁrst and probably best characterized autosomally imprinted gene. It is located in a cluster of imprinted genes (11p15 in human) containing also the IGF2 gene. While H19 is expressed exclusively from the maternal allele, IGF2 expression is limited to the paternal allele. H19 RNA is highly expressed in placental, embryonic and most fetal tissues, but after birth the expression is suppressed in nearly all tissues. Recently identiﬁed as a microRNA precursor (Cai and Cullen, 2007), it appears to play a role in development and differentiation. Both loss and overexpression of H19 is associated with different cancers.

14.6.3

Stress Response

Both prokaryotes and eukaryotes induce a set of heat shock genes to counteract environmental stress (reviewed in Arya et al., 2007). The heat shock response causes not only a widespread inhibition of transcription but also a blockade of splicing and other posttranscriptional processing. Usually, heat shock genes code for proteins. In Drosophila, however, the major site of transcription after temperature-induced stress is the hrso locus, which produces an ncRNA (see Jolly and Lakhotia, 2006, for a recent review). The hrso RNA is constitutively expressed nearly ubiquitously and its transcription level can be rapidly expressed in response to stress signals. There seems to be three distinct isoforms: both the full-length (10 kb) hsro1 and the 7–8 kb hsro2 RNA, which is obtained by alternative polyadenylation, accumulate in the nucleus. In contrast, the spliced 1.2–1.3 kb hsro3 RNA is cytoplasmic. The hsro RNAs contain a short translatable reading frame in several Drosophila species; however, the corresponding peptides have not been detected. The large RNAs appear to act as “organizers” for the sequestering components of the mRNA processing machinery: Together with diverse hnRNPs, the nuclear hsro RNAs are localized in subnuclear compartments. These o-speckles are believe to act as dynamic storage for RNA-processing and related proteins. Mammals do not have an hsro homologue. Recently heat-induced ncRNAs, transcribed from satellite III repetitive sequence, have been described in human cells. Therefore, the polyadenylated satellite III transcripts are functional analogues of the hsro (Jolly and Lakhotia, 2006).

14.6.4 Transcriptional Regulators The 2.7 kb Evf-1 transcript and its 3.8 kb splice variant Evf-2 orginate from the Dlx-5/6 bigene cluster and overlap an ultraconserved region. The Dlx genes are homeodomain transcription factors with crucial functions in differentiation and migration of neuronal cells as well as craniofacial and limb patterning during development. The Evf-2 RNA speciﬁcally interacts with the Dlx-2 protein, forming a stable complex in the nucleus. The Evf-2/Dlx-2 interaction increases the transcriptional activity of the Dlx-5/6 enhancer region in a target

14.7 RNAs with Dual Functions

281

and homeodomain-speciﬁc manner. Most likely, the Evf-2/Dlx-2 complex stabilizes the binding between the Dlx-2 homeodomain protein and the Dlx5/6 enhancer sequence (Faedo et al., 2004; Feng et al., 2006; Kohtz and Fishell, 2004). A few more examples that are at least partially understood are described in some detail in recent reviews (Prasanth and Spector, 2007; Rymarquis et al., 2008; Szell et al., 2008).

14.7 RNAs WITH DUAL FUNCTIONS The complex mosaic of transcripts outlined in the introduction implies that protein-coding and noncoding transcripts frequently overlap, in different reading directions or even in the same direction. In several cases, distinct types of functional products are produced from the same primary transcript, the best-known example being snoRNAs that are frequently processed from introns of genes for ribosomal proteins. An extreme example of this type is mﬂ, the locus for the pseudouridine synthase miniﬂy/Nop60b of D. melanogaster. It not only encodes alternative splice forms that can be polyadenylated at different downstream poly(A) sites but also contains within its introns a cluster comprising four isoforms of a C/D box snoRNA and two highly related copies of a small ncRNA genes of unknown function. The alternative 30 -ends allow mﬂ not only to produce two distinct protein subforms but also to differentially release different ncRNAs (Riccardo et al., 2007). Coding and noncoding information can be packed even more tightly, however, by superimposing it on the same sequence. Outside RNA viruses, a few examples are known in both bacteria and eukaryotes. Strictly speaking, these RNAs are not “noncoding” rather bifunctional antisense regulators and mRNAs.

14.7.1

RNAIII

With a length of 514 nt, the Staphylococcus-speciﬁc RNAIII is one of the largest regulatory RNAs in bacteria. As an intracellular effector of the quorum-sensing system, it is a key regulator in virulence gene expression (Toledo-Arana et al., 2007). A 14-stem-loops regulatory structure and the d-hemolysin peptide, a short ORF close to the 50 -end are encoded within one genomic locus. RNAIII exerts its function by binding to at least three target mRNAs, hla, spa, and rot, and alters the expression levels of the corresponding proteins by modifying the accessibility of the Shine–Dalgarno sequence. The type of sRNA–mRNA interaction varies between targets so that different loops of the RNAIII molecule selectively bind to a certain target and either unfold the complete stem or just form stabilizing kissing hairpins (Boisset et al., 2007; Toledo-Arana et al., 2007).

14.7.2

SgrS

The approximately 200 nt SgrS RNA is expressed in E. coli during glucose–phosphate stress by downregulating the translation of the glucose transporter in an Hfq-dependent manner. Therefore, it has been shown that base pairing of the SgrS RNA around the Shine–Dalgarno sequence of its target and further degradation of this complex by RNAse E is the regulatory concept. Additionally the 50 -region of the SgrS RNA encodes a 43 nt ORF, sgrT, which is well conserved and translated under stress conditions. The SgrT peptide and the SgrS RNA base pairing are equally efﬁcient to compensate the glucose–phosphate stress. So far, SgrS RNAs have been described for various enterobacteria (G€orke and Vogel, 2008).

282

Chapter 14

14.7.3

Noncoding RNA

SRA/SRAP

In human, the steroid receptor RNA activator modulates transcriptional activity of steroid receptors as an RNA molecule. On the other hand, it encodes a protein that is highly conserved among chordata. A recent review (Leygue, 2007) lists 13 SRA variants, apparently arising from alternative transcription start sites and alternative splicing. They all share exons 2–5 that encode the functional secondary structure core. Some of these isoforms also encode the SRAP protein, while others lack the translation start. It appears that the main function of SRA RNA is to organize the various protein components in the SRA–RNP complexes that contain both transcription factors and positive or negative regulators of nuclear receptor activity.

14.7.4

Enod40

The plant gene enod40 participates in the regulation of symbiotic interactions between leguminous plants and bacteria or fungi, and it has been implicated in the development also of nonsymbiotic plants (Sousa et al., 2001). Its molecular mechanisms remain unclear, but both short peptides and well-conserved RNA secondary structure appear to play a role. A recent computational study (Gultyaev and Roussis, 2007) demonstrated a well-conserved structural core that is conserved across angiosperms, and the presence of highly variable expansion domains reminiscent of the patterns observed in many other functional ncRNAs. Legumes often contain more than one enod40 gene. The analysis of transcript structures in the ENCODE regions shows that overlapping arrangements of coding and noncoding transcripts and splice forms are the rule rather than the exceptions. At this point, however, the relevance of noncoding RNAs arising from protein-coding loci remains unclear. The recent discovery of the short tarsal-less peptide translated from only 33 nt-long ORFs of a Drosophila transcript previously classiﬁed as noncoding (Galindo et al., 2007) might indicate that many other “mlncRNAs” in fact code for short peptides. The peptides of enod40 and tarsal-less are highly conserved over long evolutionary time. So far, no additional examples have been reported (apart from uORFs of protein-coding mRNAs).

14.8 CONCLUDING REMARKS We have attempted in this chapter to give a comprehensive overview of the inventory of nonprotein-coding RNAs across all domains of life, excluding viral RNAs, viroid and satellite DNAs, ribozymes, and regulatory RNA elements of mRNAs. Of course, in a few pages such an endeavor is bound to remain incomplete and subjective. New classes, mechanisms, and functions of ncRNAs are being discovered almost every week. Therefore, this chapter will likely be even more incomplete, and half way outdated, when it reaches the reader in printed form. After all, a series of computational studies has provided quite convincing evidence for tens of thousands of unclassiﬁed RNAs whose secondary structure is under stabilizing selection (Missal et al., 2005; Rose et al., 2007; Torarinsson et al., 2006; Uzilov et al., 2006; Washietl et al., 2005, 2007). Structure-based clustering (Will et al., 2007; Rose et al., 2008) furthermore strongly suggests that several new classes of ncRNAs with characteristic secondary structures are still lurking in eukaryotic genomes. In order to keep the list of references at reasonable length, we had to give priority to reviews over the reviewed original works, and to give preference to most recent publications

References

283

over classical papers, hence this chapter does not even attempt to review the history of the discovery of the “modern RNA-world” over the last decade.

ACKNOWLEDGMENTS This work was supported in part by the German DFG under the auspices of SPP-1258 “Sensory and Regulatory RNAs in Prokaryotes,” SPP-1174 “Deep Metazoan Phylogeny,” and the Graduiertenkolleg Wissensrepr€ asentation, by the European Union through the 6th framework program projects EMBIO http://www-embio.ch.cam.ac.uk/ and SYNLET http://synlet.izbi.uni-leipzig.de/. We thank Claudia S. Copeland for editing the manuscript for English language and clarity.

REFERENCES ACCARDO, M.C., GIORDANO, E., RICCARDO, S., DIGILIO, F.A., IAZZETTI, G., CALOGERO, R.A., and FURIA, M., 2004. A computational search for box C/D snoRNA genes in the Drosophila melanogaster genome. Bioinformatics 20: 3293–3301. AHANDA, M.L., RUBY, T., WITTZELL, H., BED’HOM, B., CHAUSSE´, A.M., MORIN, V., OUDIN, A., CHEVALIER, C., YOUNG, J.R., and ZOOROB, R., 2008. Non-coding RNAs revealed during identiﬁcation of genes involved in chicken immune responses. Immunogenetics, DOI: 10.1007/ s00251-008-0337-8. ALATORTSEV, V.S., CRUZ-REYES, J., ZHELONKINA, A.G., and SOLLNER-WEBB, B., 2008. Trypanosoma brucei RNA editing: coupled cycles of U deletion reveal processive activity of the editing complex. Mol. Cell Biol. 28: 2437–2445. ALLEN, T.A., VON KAENEL, S., GOODRICH, J.A., and KUGEL, J., 2004. The SINE-encoded mouse B2 RNA represses mRNA transcription in response to heat shock. Nat. Struct. Mol. Biol. 11: 816–821. ALTUVIA, S., 2007. Identiﬁcation of bacterial small noncoding RNAs: experimental approaches. Curr. Opin. Microbiol. 10: 257–261. ANDERSEN, E.S., ROSENBLAD, M.A., LARSEN, N., WESTERGAARD, J.C., BURS, J., WOWER, I.K., WOWER, J., GORODKIN, J., SAMUELSSON, T., and ZWIEB, C., 2006. The tmRDB and SRPDB resources. Nucleic Acids Res. 34: D163–D168. ARMBRUST, E.V., et al., 2004. The genome of the diatom Thalassiosira pseudonana: ecology, evolution, and metabolism. Science 306: 79–86. ARYA, R., MALLIK, M., and LAKHOTIA, S.C., 2007. Heat shock genes: integrating cell survival and death. J. Biosci. 32: 595–610. ASPEGREN, A., HINAS, A., LARSSON, P., LARSSON, A., and S€ oDERBOM, F., 2004. Novel non-coding RNAs in Dictyostelium discoideum and their expression during development. Nucleic Acids Res. 32: 4646–4656. ASPINALL, T.V., GORDON, J.M.B., BENNET, H.J., KARAHALIOS, P., BUKOWSKI, J.-P., WALKER, S.C., ENGELKE, D.R., and AVIS, J.M., 2007. Interactions between subunits of Saccharomyces cerevisiae RNase MRP support a conserved eu-

karyotic RNase P/MRP architecture. Nucleic Acids Res. 35: 6439–6450. ATZORN, V., FRAGAPANE, P., and KISS, T., 2004. U17/snR30 is a ubiquitous snoRNA with two conserved sequence motifs essential ribosome assembly. Cell 72: 443–457. AXMANN, I.M., HOLTZENDORFF, J., VOSS, B., KENSCHE, P., and HESS, W.R., 2007. Two distinct types of 6S RNA in Prochlorococcus. Gene 406: 69–78. AXMANN, I.M., KENSCHE, P., VOGEL, J., KOHL, S., HERZEL, H., and HESS, W.R., 2005. Identiﬁcation of cyanobacterial noncoding RNAs by comparative genome analysis. Genome Biol. 6: R73. AZZOUZ, T.N. and SCHU¨MPERLI, D., 2003. Evolutionary conservation of the U7 small nuclear ribonucleoprotein in Drosophila melanogaster. RNA 9: 1532–1541. BACHELLERIE, J.-P., CAVAILLE´, J., and HU¨TTENHOFER, A., 2002. The expanding snoRNA world. Biochimie 84: 775–790. BARRICK, J.E. and BREAKER, R.R., 2007. The distributions, mechanisms, and structures of metabolite-binding riboswitches. Genome Biol. 8: R239. BARRICK, J.E., SUDARSAN, N., WEINBERG, Z., RUZZO, W.L., and BREAKER, R.R., 2005. 6S RNA is a widespread regulator of eubacterial RNA polymerase that resembles an open promoter. RNA 11: 774–784. BAZELEY, P.S., SHEPELEV, V., TALEBIZADEH, Z., BUTLER, M.G., FEDOROVA, L., FILATOV, V., and FEDOROV, A., 2008. snoTARGET shows that human orphan snoRNA targets locate close to alternative splice junctions. Gene 408: 172–179. BELANCIO, V.P., HEDGES, D.J., and DEININGER, P., 2008. Mammalian non-LTR retrotransposons: for better or worse, in sickness and in health. Genome Res. 18: 343–358. BLUMENTHAL, T., 1995. Trans-splicing and polycistronic transcription in Caenorhabditis elegans. Trends Genet. 11: 132–136. BOISSET, S., et al., 2007. Staphylococcus aureus RNAIII coordinately represses the synthesis of virulence factors and the transcription regulator Rot by an antisense mechanism. Genes Dev. 21: 1353–1366. BOMPFU¨NEWERER, A.F., et al., 2005. Evolutionary patterns of non-coding RNAs. Theory Biosci. 123: 301–369.

284

Chapter 14

Noncoding RNA

BOX, J.A., BUNCH, J.T., TANG, W., and BAUMANN, P., 2008. Spliceosomal cleavage generates the 30 end of telomerase RNA. Nature 456: 910–914. BRAUD, S., LAVIRE, C., BELLIER, A., and MAZODIER, P., 2006. Effect of SsrA (tmRNA) tagging system on translational regulation in Streptomyces. Arch. Microbiol. 184: 343–352. BREAKER, R.R., 2008. Complex riboswitches. Science 319: 1795–1797. BRENNAN, R.G. and LINK, T.M., 2007. Hfq structure, function and ligand binding. Curr. Opin. Microbiol. 10: 125–133. BROSIUS, J., 1999. RNAs from all categories generate retrosequences that may be exapted as novel genes or regulatory elements. Gene 238: 115–134. BROWN, J.W.S., ECHEVERRIA, M., and QU, L.-H., 2003. Plant snoRNAs: functional evolution and new modes of gene expression. Trends Plant Sci. 8: 42–49. BROWN, Y., ABRAHAM, M., PEARL, S., KABAHA, M.M., ELBOHER, E., and TZFATI, Y., 2007. A critical three-way junction is conserved in budding yeast and vertebrate telomerase RNAs. Nucleic Acids Res. 35: 6280–6289. B€uHLER, M., 2008. RNA turnover and chromatin-dependent gene silencing. Chromosoma, DOI: 10.1007/s00412-0080195-z. BUSSEMAKERS, M.J., vAN BOKHOVEN, A., VERHAEGH, G.W., SMIT, F.P., KARTHAUS, H.F., SCHALKEN, J.A., DEBRUYNE, F. M., RU, N., and ISAACS, W.B., 1999. DD3: a new prostatespeciﬁc gene, highly overexpressed in prostate cancer. Cancer Res. 59: 5975–5979. CAI, X. and CULLEN, B.R., 2007. The imprinted H19 noncoding RNA is a primary microRNA precursor. RNA 13: 313–316. CALVIN, K. and LI, H., 2008. RNA-splicing endonuclease structure and function. Cell. Mol. Life Sci. 65: 1176–1185. CARLILE, M., NALBANT, P., PRESTON-FAYERS, K., MCHAFFIE, G. S., and WERNER, A., 2008. Processing of naturally occurring sense/antisense transcripts of the vertebrate Slc34a gene into short RNAs. Physiol. Genomics 34: 95–100. CARNINCI, P., 2007. Constructing the landscape of the mammalian transcriptome. J. Exp. Biol. 210: 1497–1506. CAVANAGH, A.T., KLOCKO, A.D., LIU, X., and WASSARMAN, K. M., 2008. Promoter speciﬁcity for 6S RNA regulation of transcription is determined by core promoter sequences and competition for region 4.2 of sigma70. Mol. Microbiol. 67: 1242–1256. CHAKRABARTI, K., PEARSON, M., GRATE, L., STERNE-WEILER, T., DEANS, J., DONOHUE, M., and ARES, J.P., JR., (2007) Structural RNAs of known and unknown function identiﬁed in malaria parasites by comparative genomics and RNA analysis. RNA 13: 1923–1939. CHAN, A.S., THORNER, P.S., SQUIRE, J.A., and ZIELENSKA, M., 2002. Identiﬁcation of a novel gene NCRMS on chromosome 12q21 with differential expression between rhabdomyosarcoma subtypes. Oncogene 21: 3029–3037. CHEAH, M.T., WACHTER, A., SUDARSAN, N., and BREAKER, R. R., 2007. Control of alternative RNA splicing and gene

expression by eukaryotic riboswitches. Nature 447: 497–500. CHEN, J.-L. and GREIDER, C.W., 2004. An emerging consensus for telomerase RNA structure. Proc. Natl. Acad. Sci. USA 101: 14683–14684. CHEN, L., LULLO, D.J., MA, E., CELNIKER, S.E., RIO, D.C., and DOUDNA, J.A., 2005. Identiﬁcation and analysis of U5 snRNA variants in Drosophila. RNA 11: 1473–1477. CHEN, X., QUINN, A.M., and WOLIN, S.L., 2000. Ro ribonucleoproteins contribute to the resistance of Deinococcus radiodurans to ultraviolet resistance. Genes Dev. 14: 777–782. CHEN, X., ROZHDESTVENSKY, T.S., COLLINS, L.J., SCHMITZ, J., and PENNY, D., 2007. Combined experimental and computational approach to identify non-protein-coding RNAs in the deep-branching eukaryote Giardia intestinalis. Nucleic Acids Res. 35: 4619–4628. CHRISTOV, C.P., GARDINER, T.J., SZU¨TS, D., and KRUDE, T., 2006. Functional requirement of noncoding Y RNAs for human chromosomal DNA replication. Mol. Cell. Biol. 26: 6993–7004. CHUBB, J.E., BRADSHAW, N.J., SOARES, D.C., PORTEOUS, D.J., and MILLAR, J.K., 2008. The DISC locus in psychiatric illness. Mol. Psychiatry 13: 36–64. COLLINS, L. and PENNY, D., 2005. Complex spliceosomal organization ancestral to extant eukaryotes. Mol. Biol. Evol. 22: 1053–1066. COLLINS, L.J., MOULTON, V., and PENNY, D., 2000. Use of RNA secondary structure for studying the evolution of RNase P and RNase MRP. J. Mol. Evol. 51: 194–204. COPPINS, R.L., HALL, K.B., and GROISMAN, E.A., 2007. The intricate world of riboswitches. Curr. Opin. Microbiol. 10: 176–181. DANDJINOU, A.T., LE´VESQUE, N., LAROSE, S., LUCIER, J.-F., ELELA, S.A., and WELLINGER, R.J., 2004. A phylogenetically based secondary structure for the yeast telomerase RNA. Curr. Biol. 14: 1148–1158. DARZACQ, X., JA´DY, B.E., VERHEGGEN, C., KISS, A.M., BERTRAND, E., and KISS, T., 2002. Cajal body-speciﬁc small nuclear RNAs: a novel class of 20 -o-methylation and pseudouridylation guide RNAs. EMBO J. 21: 2746–2756. DAVID, L., HUBER, W., GRANOVSKAIA, M., TOEDLING, J., PALM, C.J., BOFKIN, L., JONES, T., DAVIS, R.W., and STEINMETZ, L.M., 2006. A high-resolution map of transcription in the yeast genome. Proc. Natl. Acad. Sci. USA 103: 5320–5325. DA´VILA LO´PEZ, M. and SAMUELSSON, T., 2008. Early evolution of histone mRNA 30 end processing. RNA 14: 1–10. ´ DAVILA LO´PEZ, M., ALM ROSENBLAD, M., and SAMUELSSON, T., 2008. Computational screen for spliceosomal RNA genes aids in deﬁning the phylogenetic distribution of major and minor spliceosomal components. Nucleic Acids Res. 36: 3001–3010. de la CRUZ, J. and VIOQUE, A., 2003. A structural and functional study of plastid RNAs homologous to catalytic bacterial RNase P RNA. Gene 321: 47–56. de LOS SANTOS, T., SCHWEIZER, J., REES, C.A., and FRANCKE, U., 2000. Small evolutionarily conserved RNA, resembling

References C/D box small nucleolar RNA, is transcribed from PWCR1, a novel imprinted gene in the Prader–Willi deletion region, which is highly expressed in brain. Am. J. Hum. Genet. 67: 1067–1082. DECATUR, W.A., LIANG, X.H., PIEKNA-PRZYBYLSKA, D., and FOURNIER, M.J., 2007. Identifying effects of snoRNA-guided modiﬁcations on the synthesis and function of the yeast ribosome. Methods Enzymol. 425: 283–316. DENG, W., et al., 2006. Organisation of the Caenorhabditis elegans small noncoding transcriptome: genomic features, biogenesis and expression. Genome Res. 16: 30–36. DIECI, G., FIORINO, G., CASTELNUOVO, M., TEICHMANN, M., and PAGANO, A., 2007. The expanding RNA polymerase III transcriptome. Trends Genet. 23: 614–622. DINGER, M.E., PANG, K.C., MERCER, T.R., CROWE, M.L., GRIMMOND, S.M., and MATTICK, J.S., 2009. NRED: a database of long noncoding RNA expression. Nucleic Acids Res. 37: D122–D126. DINGER, M.E., et al., 2008. Long noncoding RNAs in mouse embryonic stem cell pluripotency and differentiation. Genome Res. 18: 1433–1445. DOMITROVICH, A.M. and KUNKEL, G.R., 2003. Multiple, dispersed human U6 small nuclear RNA genes with varied transcriptional efﬁciencies. Nucleic Acids Res. 31: 2344–2352. DULEBOHN, D., CHOY, J., SUNDERMEIER, T., OKAN, N., and KARZAI, A.W., 2007. Trans-translation: the tmRNA-mediated surveillance mechanism for ribosome rescue, directed protein degradation, and nonstop mRNA decay. Biochemistry 46: 4681–4693. DUMAS, C., CHOW, C., M€uLLER, M., and PAPADOPOULOU, B., 2006. A novel class of developmentally regulated noncoding RNAs in Leishmania. Eukaryotic Cell 5: 2033–2046. DURET, L., CHUREAU, C., SAMAIN, S., WEISSENBACH, J., and AVNER, P., 2006. The Xist RNA gene evolved in eutherians by pseudogenization of a protein-coding gene. Science 312: 1653–1655. JA´DY B.E. and KISS, T., 2001. A small nucleolar guide RNA functions in both 20 -O-ribose methylation and pseudourydilation of the U5 spliceosomal RNA. EMBO J. 20: 541–551. EGLOFF, S., Van HERREWEGHE, E., and KISS, T., 2006. Regulation of polymerase II transcription by 7SK snRNA: two distinct RNA elements direct P-TEFb and HEXIM1 binding. Mol. Cell. Biol. 26: 630–642. EIS, P.S., TAM, W., SUN, L., CHADBURN, A., LI, Z., GOMEZ, M.F., LUND, E., and DAHLBERG, J.E., 2005. Accumulation of miR155 and BIC RNA in human B cell lymphomas. Proc. Natl. Acad. Sci. USA 102: 3627–3632. ENDER, C., KREK, A., FRIEDLA¨NDER, M.R., BEITZINGER, M., WEINMANN, L., CHEN, W., PFEFFER, S., RAJEWSKY, N., and MEISTER, G., 2008. A human snoRNAwith microRNA-like functions. Mol Cell 32: 519–528. ESPINOZA, C.A., ALLEN, T.A., HIEB, A.R., KUGEL, J.F., and GOODRICH, J.A., 2004. B2 RNA binds directly to RNA polymerase II to repress transcript synthesis. Nat. Struct. Mol. Biol. 11: 822–829.

285

ESPINOZA, C.A., GOODRICH, J.A., and KUGEL, J.F., 2007. Characterization of the structure, function, and mechanism of B2 RNA, an ncRNA repressor of RNA polymerase II transcription. RNA 13: 583–596. FAEDO, A., QUINN, J.C., STONEY, P., LONG, J.E., DYE, C., ZOLLO, M., RUBENSTEIN, J., PRICE, D., and BULFONE, A., (2004) Identiﬁcation and characterization of a novel transcript down-regulated in Dlx1/Dlx2 and up-regulated in Pax6 mutant telencephalon. Dev Dyn. 231: 614–620. FARRIS, A.D., KOELSCH, G., PRUIJN, G.J., van VENROOIJ, W.J., and HARLEY, J.B., 1999. Conserved features of Y RNAs revealed by automated phylogenetic secondary structure analysis. Nucleic Acids Res. 27: 1070–1078. FEDOROVA, O. and ZINGLER, N., 2007. Group II introns: structure, folding and splicing mechanism. Biol. Chem. 388: 665–678. FENG, J., BI, C., CLARK, B.S., MADY, R., SHAH, P., and KOHTZ, J.D., 2006. The Evf-2 noncoding RNA is transcribed from the Dlx-5/6 ultraconserved region and functions as a Dlx-2 transcriptional coactivator. Genes Dev. 20: 1470–1484. FORCE, A., LYNCH, M., PICKETT, F.B., AMORES, A., YAN, Y.-l., and POSTLETHWAIT, J., 1999. Preservation of duplicate genes by complementary, degenerative mutations. Genetics 151: 1531–1545. FRANKE, J., GEHLEN, J., and EHRENHOFER-MURRAY, A.E., 2008. Hypermethylation of yeast telomerase RNA by the snRNA and snoRNA methyltransferase Tgs1. J. Cell Sci. 121: 3553–3560. FREELAND, S.J., KNIGHT, R.D., and LANDWEBER, L.F., 1999. Do proteins predate DNA? Science 286: 690–692. GABORY, A., RIPOCHE, M.A., YOSHIMIZU, T., and DANDOLO, L., 2006. The H19 gene: regulation and function of a noncoding RNA. Cytogenet. Genome Res. 113: 188–193. GALINDO, M.I., PUEYO, J.I., FOUIX, S., BISHOP, S.A., and COUSO, J.P., 2007. Peptides encoded by short ORFs control development and deﬁne a new eukaryotic gene family. PLoS Biol. 5: e106. GANOT, P., CAIZERGUES-FERRER, M., and KISS, T., 1997. The family of box ACA small nucleolar RNAs is deﬁned by an evolutionarily conserved secondary structure and ubiquitous sequence elements essential for RNA accumulation. Genes Dev. 11: 941–956. GESTELAND, R.F. and ATKINS, J.F., 1993. The RNA World. Cold Spring Harbor Laboratory Press, Plainview, NY. GILBERT, W., 1986. The RNA world. Nature 319: 618. GIMPLE, O. and SCHO¨N, A., 2001. In vitro and in vivo processing of cyanelle tmRNA by RNase P. Biol. Chem. 382: 1421–1429. GOGOLEVSKAYA, I.K., KOVAL, A.P., and KRAMEROV, D.A., 2005. Evolutionary history of 4.5SH RNA. Mol. Biol. Evol. 22: 1546–1554. GOTTESMAN, S., 2004. The small RNA regulators of Escherichia coli: roles and mechanisms. Annu. Rev. Microbiol. 58: 303–328. GRUBER, A., KILGUS, C., MOSIG, A., HOFACKER, I.L., HENNIG, W., and STADLER, P.F., 2008a. Arthropod 7SK RNA. Mol. Biol. Evol. 25: 1923–1930.

286

Chapter 14

Noncoding RNA

GRUBER, A.R., KOPER-EMDE, D., MARZ, M., TAFER, H., BERNHART, S., OBERNOSTERER, G., MOSIG, A., HOFACKER, I.L., STADLER, P.F., and BENECKE, B.-J., 2008b. Invertebrate 7SK snRNAs. J. Mol. Evol. 66: 107–115. GUENEAU DE NOVOA, P. and WILLIAMS, K.P., 2004. The tmRNA website: reductive evolution of tmRNA in plastids and other endosymbionts. Nucleic Acids Res. 32: D104–D108. GULTYAEV, A.P. and ROUSSIS, A., 2007. Identiﬁcation of conserved secondary structures and expansion segments in enod40 RNAs reveals new enod40 homologues in plants. Nucleic Acids Res. 35: 3144–3152. GU¨RSOY, H.-C., KOPER, D., and BENECKE, B.-J., 2000. The vertebrate 7S K RNA separates hagﬁsh (Myxine glutinosa) and lamprey (Lampetra ﬂuviatilis). J. Mol. Evol. 50: 456–464. GO¨RKE, B. and VOGEL, J., 2008. Noncoding RNA control of the making and breaking of sugars. Genes Dev. 22: 2914–2925. HAIDER, S., MATSUMOTO, R., KUROSAWA, N., WAKUI, K., FUKUSHIMA, Y., and ISOBE, M., 2006. Molecular characterization of a novel translocation t(5;14)(q21;q32) in a patient with congenital abnormalities. J. Hum. Genet. 51: 335–340. HAN, J., KIM, D., and MORRIS, K.V., 2007. Promoter-associated RNA is required for RNA-directed transcriptional gene silencing in human cell. Proc. Natl. Acad. Sci. USA 104: 12422–12427. HARISMENDY, O., GENDREL, C.G., SOULARUE, P., GIDROL, X., SENTENAC, A., WERNER, M., and LEFEBVRE, O., 2003. Genome-wide location of yeast RNA polymerase III transcription machinery. EMBO J. 22: 4738–4747. HA¨SLER, J. and STRUB, K., 2006. Alu elements as regulators of gene expression. Nucleic Acids Res. 34: 5491–5497. HASTINGS, K.E., 2005. SL trans-splicing: easy come or easy go? Trends Genet. 21: 240–247. HAUGEN, P., SIMON, D.M., and BHATTACHARYA, D., (2005) The natural history of group I introns. Trends Genet. 21: 111–119. HAVILIO, M., LEVANON, E.Y., LERMAN, G., KUPIEC, M., and EISENBERG, E., 2005. Evidence for abundant transcription of non-coding regions in the Saccharomyces cerevisiae genome. BMC Genomics 6: 93. HE, H., et al., 2007. Mapping the C. elegans noncoding transcriptome with a whole-genome tiling microarray. Genome Res. 17: 1471–1477. HE, S., SU, H., LIU, C., SKOGERBØ, G., HE, H., HE, D., ZHU, X., LIU, T., ZHAO, Y., and CHEN, R., 2008. MicroRNAencoding long non-coding RNAs. BMC Genomics 21: 236. HERTEL, J., HOFACKER, I.L., and STADLER, P.F., 2008. snoReport: computational identiﬁcation of snoRNAs with unknown targets. Bioinformatics 24: 158–164. HILLER, M., FINDEISS, S., LEIN, S., MARZ, M., NICKEL, C., ROSE, D., SCHULZ, C., BACKOFEN, R., PROHASKA, S.J., REUTER, G., and STADLER, P.F., 2008. Conserved introns reveal novel transcripts in Drosophila melanogaster. Genome Res., 19: 1289–1300.

HINAS, A. and S€oDERBOM, F., 2007. Treasure hunt in an amoeba: non-coding RNAs in Dictyostelium discoideum. Curr. Genet. 51: 141–159. HIROSE, T. and STEITZ, J.A., 2001. Position within the host intron is critical for efﬁcient processing of box C/D snoRNAs in mammalian cells. Proc. Natl. Acad. Sci. USA 98: 12914–12919. HIROSE, Y. and HARADA, F., 2008. Mouse nucleolin binds to 4.5S RNAh, a small noncoding RNA. Biochem. Biophys. Res. Commun. 365: 62–68. HOGG, J.R. and COLLINS, K., 2007. Human Y5 RNA specializes a Ro ribonucleoprotein for 5S ribosomal RNA quality control. Genes Dev. 21: 3067–3072. HORIKE, S., MITSUYA, K., MEGURO, M., KOTOBUKI, N., KASHIWAGI, A., NOTSU, T., SCHULZ, T.C., SHIRAYOSHI, Y., and OSHIMURA, M., 2000. Targeted disruption of the human LIT1 locus deﬁnes a putative imprinting control element playing an essential role in Beckwith–Wiedemann syndrome. Hum. Mol. Genet. 9: 2075–2083. HUTCHINSON, J.N., ENSMINGER, A.W., CLEMSON, C.M., LYNCH, C.R., LAWRENCE, J.B., and CHESS, A., 2007. A screen for nuclear transcripts identiﬁes two linked noncoding RNAs associated with SC35 splicing domains. BMC Genomics 8: 39. HU¨TTENHOFER, A., KIEFMANN, M., MEIER-EWERT, S., O’BRIEN, J., LEHRACH, H., BACHELLERIE, J.P., and BROSIUS, J., 2001. RNomics: an experimental approach that identiﬁes 201 candidates for novel, small, non-messenger RNAs in mouse. EMBO J. 20: 2943–2953. INAGAKI, S., NUMATA, K., KONDO, T., TOMITA, M., YASUDA, K., KANAI, A., and KAGEYAMA, Y., 2005. Identiﬁcation and expression analysis of putative mRNA-like non-coding RNA in Drosophila. Genes Cells 10: 1163–1173. ISHII, N., et al., 2006. Identiﬁcation of a novel non-coding RNA, MIAT, that confers risk of myocardial infarction. J. Hum. Genet. 51: 1087–1099. ISOGAI, Y., TAKADA, S., TJIAN, R., and KELES, S., 2007. Novel TRF1/BRF target genes revealed by genome-wide analysis of Drosophila Pol III transcription. EMBO J. 26: 79–89. JACOB, Y., SEIF, E., PAQUET, P.-O., and LANG, B.F., 2004. Loss of the mRNA-like region in mitochondrial tmRNAs of jakobids. RNA 10: 605–614. JACOB, Y., SHARKADY, S.M., BHARDWAJ, K., SANDA, A., and WILLIAMS, K.P., 2005. Function of the SmpB tail in transfer-messenger RNA translation revealed by a nucleusencoded form. J. Biol. Chem. 280: 5503–5509. JA´DY, B.E., BERTRAND, E., and KISS, T., 2004. Human telomerase RNA and box H/ACA scaRNAs share a common Cajal body speciﬁc localization signal. J. Cell Biol. 164: 647–652. JIA, D., CAI, L., HE, H., SKOGERBØ, G., LI, T., AFTAB, M.N., and CHEN, R., 2007. Systematic identiﬁcation of non-coding RNA 2,2,7-trimethylguanosine cap structures in Caenorhabditis elegans. BMC Mol Biol. 8: 86. JO¨CHL, C., REDERSTORFF, M., HERTEL, J., STADLER, P.F., HOFACKER, I.L., SCHRETTL, M., HAAS, H., and H€uTTENHOFER, A., 2008. Small ncRNA transcriptome

References analysis from Aspergillus fumigatus suggests a novel mechanism for regulation of protein-synthesis. Nucleic Acids Res. 36: 2677–2689. JOHNSON, R., TEH, C.H., JIA, H., VANISRI, R.R., PANDEY, T., LU, Z.H., BUCKLEY, N.J., STANTON, L.W., and LIPOVICH, L. 2008 Regulation of neural macroRNAs by the transcriptional repressor REST. RNA, 1127009, DOI: 10. 1261/rna. JOLLY, C. and LAKHOTIA, S.C., 2006. Human sat III and Drosophila hsrx transcripts: a common paradigm for regulation of nuclear RNA processing in stressed cells. Nucleic Acids Res. 34: 5508–5514. JONES, T., OTTO, W., MARZ, M., EDDY, S., and STADLER, P., 2009. A survey of nematode SmY RNAs. RNA Biol. 6: (1), 5–8. KACHOURI, R., STRIBINSKIS, V., ZHU, Y., RAMOS, K.S., WESTHOF, E., and LI, Y., 2005. A surprisingly large RNase P RNA in Candida glabrata. RNA 11: 1064–1072. KAPRANOV, P., et al., 2007. RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science 316: 1484–1488. KATAYAMA, S., et al., 2005. Antisense transcription in the mammalian transcriptome. Science 309: 1564–1566. KAWAJI, H., et al., 2008. Hidden layers of human small RNAs. BMC Genomics 9: 157. KAZANOV, M.D., VITRESCHAK, A.G., and GELFAND, M.S., 2007. Abundance and functional diversity of riboswitches in microbial communities. BMC Genomics 8: 347. KHANAM, T., ROZHDESTVENSKY, T.S., BUNDMAN, M., GALIVETI, C.R., HANDEL, S., SUKONINA, V., JORDAN, U., BROSIUS, J., and SKRYABIN, B.V., 2007. Two primate-speciﬁc small nonprotein-coding RNAs in transgenic mice: neuronal expression, subcellular localization and binding partners. Nucleic Acids Res 35: 529–539. KIM, K.-s. and LEE, Y., 2004. Regulation of 6S RNA biogenesis by switching utilization of both sigma factors and endoribonucleases. Nucleic Acids Res. 32: 6057–6068. KISHORE, S. and STAMM, S., 2006. Regulation of alternative splicing by snoRNAs. Cold Spring Harb. Symp. Quant. Biol. 71: 329–334. KOHTZ, J. and FISHELL, G., 2004. Developmental regulation of EVF-1, a novel non-coding RNA transcribed upstream of the mouse Dlx6 gene. Gene Expr. Patterns 4: 407–412. KOMINE, Y., KITABATAKE, M., YOKOGAWA, T., NISHIKAWA, K., and INOKUCHI, H., 1994. A tRNA-like structure is present in 10Sa RNA, a small stable RNA from Escherichia coli. Proc. Natl. Acad. Sci. USA 91: 9223–9227. KO¨NIG, H., MATTER, N., BADER, R., THIELE, W., and M€uLLER, F., 2007. Splicing segregation: the minor spliceosome acts outside the nucleus and controls cell proliferation. Cell 131: 718–729. KRAVCHENKO, J., ROGOZIN, I.B., KOONIN, E.V., and CHUMAKOV, P.M., 2005. Transcription of mammalian mRNAs by a novel nuclear RNA polymerase of mitochondrial origin. Nature 436: 735–739. KRUEGER, B.J., et al., 2008. LARP7 is a stable component of the 7SK snRNP while P-TEFb, HEXIM1 and hnRNP A1

287

are reversibly associated. Nucleic Acids Res. 36: 2218–2229. KURYSHEV, V.Y., SKRYABIN, B.V., KREMERSKOTHEN, J., JURKA, J., and BROSIUS, J., 2001. Birth of a gene: locus of neuronal BC200 snmRNA in three prosimians and human BC200 pseudogenes as archives of change in the Anthropoidea lineage. J. Mol. Biol. 309: 1049–1066. KWEK, K.Y., MURPHY, S., FURGER, A., THOMAS, B., O’GORMAN, W., KIMURA, H., PROUDFOOT, N.J., and AKOULITCHEV, A., 2002. U1 snRNA associates with TFIIH and regulates transcriptional initiation. Nat. Struct. Biol. 9: 800–805. LAFONTAINE, D.L. and TOLLERVEY, D., 2001. Ribosomal RNA. Encyclopedia of Life Sciences. LANDT, S.G., ABELIUK, E., MCGRATH, P.T., LESLEY, J.A., MCADAMS, H.H., and SHAPIRO, L., 2008. Small non-coding RNAs in Caulobacter crescentus. Mol. Microbiol. 68: 600–614. LANZ, R.B., MCKENNA, N.J., ONATE, S.A., ALBRECHT, U., WONG, J., TSAI, S.Y., TSAI, M.J., and O’MALLEY, B.W., 1999. A steroid receptor coactivator, SRA, functions as an RNA and is present in an SRC-1 complex. Cell 97: 17–27. LARSSON, P., HINAS, A., ARDELL, D.H., KIRSEBOM, L., VIRTA¨ DERBOM, F., 2008. De novo search for nonNEN, A., and SQO coding RNA genes in the AT-rich genome of Dictyostelium dicoideum: performance of Markov-dependent genome feature scoring. Genome Res. 18: 888–899. LEONARDI, J., BOX, J.A., BUNCH, J.T., and BAUMANN, P., 2008. TER1, the RNA subunit of ﬁssion yeast telomerase. Nat. Struct. Mol. Biol. 15: 26–33. LESTRADE, L. and WEBER, M.J., 2008. snoRNA-LBME-db, a comprehensive database of human H/ACA and C/D box snoRNAs. Nucleic Acids Res. 34: D158–D162. LEYGUE, E., 2007. Steroid receptor RNA activator (SRA1): unusual bifaceted gene products with suspected relevance to breast cancer. Nucl. Recept. Signal. 5: e006. LI, D., WILLKOMM, D.K., SCHO¨N, A., and HARTMANN, R.K., 2007a. Rnase P of the Cyanophora paradoxa cyanelle: a plastid ribozyme. Biochimie 89: 1528–1538. LI, F., SONBUCHNER, L., KYES, S.A., EPP, C., and DEITSCH, K. W., 2008. Nuclear non-coding RNAs are transcribed from the centromeres of Plasmodium falciparum and are associated with centromeric chromatin. J. Biol. Chem. 283: 5692–5698. LI, J., WITTE, D.P., DYKE, T.V., and ASKEW, D.S., 1997. Expression of the putative proto-oncogene His-1 in normal and neoplastic tissues. Am. J. Pathol. 150: 1297–1305. LI, L., et al., 2007b. Global identiﬁcation and characterization of transcriptionally active regions in the rice genome. PLoS ONE 2: e294. LIANG, X., HURY, A., HOZE, E., ULIEL, S., MYSLYUK, I., APATOFF, A., UNGER, R., and MICHAELI, S., 2006. Genome-wide analysis of C/D and H/ACA-like small nucleolar RNAs in Leishmania major indicates conservation among trypanosomatids in the repertoire and in their rRNA targets. Eukaryotic Cell 6: 361–377. LIANG, X.-h., HURY, A., HOZE, E., ULIEL, S., MYSLYUK, I., APATOFF, A., UNGER, R., and MICHAELI, S., 2007. Genomewide analysis of C/D and H/ACA-like small nucleolar

288

Chapter 14

Noncoding RNA

RNAs inLeishmania major indicates conservation among Trypanosomatids in the repertoire and in their rRNA targets. Eukaryotic Cell 6: 361–377. LIDIE, K.B. and van DOLAH, F.M., 2007. Spliced leader RNAmediated trans-splicing in a dinoﬂagellate, Karenia brevis. J. Eukaryot. Microbiol. 54: 427–435. LIN, R., MAEDA, S., LIU, C., KARIN, M., and EDGINGTON, T.S., 2007. A large noncoding RNA is a marker for murine hepatocellular carcinomas and a spectrum of human carcinomas. Oncogene 26: 851–858. LIU, L., BEN-SHLOMO, H., HU, Y.X., STERN, M.Z., GONCHAROV, I., ZHANG, Y. and MICHAELI, S., 2003. The trypanosomatid signal recognition particle consists of two RNA molecules, a 7SL RNA homolog and a novel tRNA-like molecule. J. Biol. Chem. 278: 18271–18280. LORKOVIC, Z.J., LEHNER, R., FORSTNER, C., and BARTA, A., 2005. Evolutionary conservation of minor U12-type spliceosome between plants and humans. RNA 11: 1095–1107. LOURO, R., EL-JUNDI, T., NAKAYA, H.I., REIS, E.M., and VERJOVSKI-ALMEIDA, S., 2008. Conserved tissue expression signatures of intronic noncoding RNAs transcribed from human and mouse loci. Genomics 92: 18–25. LUO, Y. and LI, S., 2007. Genome-wide analyses of retrogenes derived from the human box H/ACA snoRNAs. Nucleic Acids Res. 35: 559–571. LYKKE-ANDERSEN, J., AAGAARD, C., SEMIONENKOV, M., and GARRETT, R.A., 1997. Archaeal introns: splicing, intercellular mobility and evolution. Trends Biochem. Sci. 22: 326–331. MACMORRIS, M., KUMAR, M., LASDA, E., LARSEN, A., KRAEMER, B., and BLUMENTHAL, T., 2007. A novel family of C. elegans snRNPs contains proteins associated with transsplicing. RNA 13: 511–520. MADEJ, M.J., ALFONZO, J.D., and HU¨TTENHOFER, A., 2007. Small ncRNA transcriptome analysis from kinetoplast mitochondria of Leishmania tarentolae. Nucleic Acids Res. 35: 1544–1554. MADEJ, M.J., NIEMANN, M., HU¨TTENHOFER, A., and GA¨RINGER, H.U., 2008. Identiﬁcation of novel guide RNAs from the mitochondria of Trypanosoma brucei. RNA Biol. 5: 84–91. MAEDA, H., FUJITA, N., and ISHIHAMA, A., 2000. Competition among seven Escherichia coli sigma subunits: relative binding afﬁnities to the core RNA polymerase. Nucleic Acids Res. 28: 3497–3503. MAEDA, N., et al., 2006. Transcript annotation in FANTOM3: mouse gene catalog based on physical cDNAs. PLoS Genetics 2: e62. MANAK, J.R., et al., 2006. Biol1ogical function of unannotated transcription during the early development of Drosophila melanogaster. Nat Genet 38: 1151–1158. MARINER, P.D., WALTERS, R.D., ESPINOZA, C.A., DRULLINGER, L.F., WAGNER, S.D., KUGEL, J.F., and GOODRICH, J.A., 2008. Human Alu RNA is a modular transacting repressor of mRNA transcription during heat shock. Mol. Cell 29: 499–509. MARONEY, P.A., YU, M., JANKOWSKA, Y.T., and NILSEN, T.W., 1996. Direct analysis of nematode cis- and trans-spliceosomes: a functional role for U5 snRNA in spliced leader

addition trans-splicing and the identiﬁcation of novel Sm snRNPs. RNA 2: 735–745. MARQUEZ, S.M., HARRIS, J.K., KELLEY, S.T., BROWN, J.W., DAWSON, S.C., ROBERTS, E.C., and PACE, N.R., 2005. Structural implications of novel diversity in eucaryal RNase P RNA. RNA 11: 739–751. MARQUIS, J., KA¨MPFER, S.S., ANGEHRN, L., and SCHU¨MPERLI, D. 2008 Doxycycline-controlled splicing modulation by regulated antisense U7 snRNA expression cassettes. Gene Ther. MARZ, M., KIRSTEN, T., and STADLER, P. 2008 Evolution of spliceosomal snRNA genes in metazoan animals. J. Mol. Evol. MARZ, M., MOSIG, A., STADLER, B.M.R., and STADLER, P.F., 2007. U7 snRNAs: a computational survey. Geno. Prot. Bioinfo. 5: 187–195. MARZLUFF, W.F., 2005. Metazoan replication-dependent histone mRNAs: a distinct set of RNA polymerase II transcripts. Curr. Opin. Cell. Biol. 17: 274–280. MATOUK, I.J., DEGROOT, N., MEZAN, S., AYESH, S., ABU-LAIL, R., HOCHBERG, A., and GALUN, E., 2007. The H19 noncoding RNA is essential for human tumor growth. PLoS ONE 2: e845. MAYER, C., SCHMITZ, K.M., LI, J., GRUMMT, I., and SANTORO, R., 2006. Intergenic transcripts regulate the epigenetic state of rRNA genes. Mol. Cell 22: 351–361. MERCER, T.R., DINGER, M.E., SUNKIN, S.M., MEHLER, M.F., and MATTICK, J.S., 2008. Speciﬁc expression of long noncoding RNAs in the mouse brain. Proc. Natl. Acad. Sci. USA 105: 716–721. MEYER, M.M., ROTH, A., CHERVIN, S.M., GARCIA, G.A., and BREAKER, R.R., 2008. Conﬁrmation of a second natural preQ1 aptamer class in Streptococcaceae bacteria. RNA 14: 685–695. MICHELS, A.A. and BENSAUDE, O., 2008. RNA-driven cyclindependent kinase regulation: when CDK9/cyclin T subunits of P-TEFb meet their ribonucleoprotein partners. Biotechnol. J. 3: 1022–1032. MILLAR, J.K., JAMES, R., BRANDON, N.J., and THOMSON, P.A., 2004. DISC1 and DISC2: discovering and dissecting molecular mechanisms underlying psychiatric illness. Ann. Med. 36: 367–378. MISSAL, K., ROSE, D., and STADLER, P.F., (2005) Non-coding RNAs in Ciona intestinalis. Bioinformatics 21 (S2): i77–i78. MITCHELL, J.R., CHENG, J., and COLLINS K. 1999 A box H/ACA small nucleolar RNA-like domain at the human telomerase 30 end. Mol. Cell Biol. 19: 567–576. MIURA, F., KAWAGUCHI, N., SESE, J., TOYODA, A., HATTORI, M., MORISHITA, S., and ITO, T., 2006. A large-scale full-length cDNA analysis to explore the budding yeast transcriptome. Proc. Natl. Acad. Sci. USA 103: 17846–17851. MOCHIZUKI, K., FINE, N.A., FUJISAWA, T., and GOROVSKY, M.A., 2002. Analysis of a piwi-related gene implicates small RNAs in genome rearrangement in tetrahymena. Cell 110: 689–699. MOORE, P.B. and STEITZ, T.A., 2002. The involvement of RNA in ribosome function. Nature 418: 229–235.

References MOORE, S.D. and SAUER, R.T., 2005. Ribosome rescue: tmRNA tagging activity and capacity in Escherichia coli. Mol Microbiol. 58: 456–466. MOORE, S.D. and SAUER, R.T., 2007. The tmRNA system for translational surveillance and ribosome rescue. Annu. Rev. Biochem. 76: 101–124. MOSIG, A., CHEN, J.L., and STADLER, P.F., 2007a. Homology search with fragmented nucleic acid sequence patterns. In Algorithms in Bioinformatics (WABI 2007), Vol. 4645 of Lecture Notes in Computer Science (eds R. Giancarlo and S. Hannenhalli). Springer Verlag, Berlin, Heidelberg. pp. 335–345. MOSIG, A., GUOFENG, M., STADLER, B.M.R., and STADLER, P.F., 2007b. Evolution of the vertebrate Y RNA cluster. Theory Biosci. 126: 9–14. MOSIG, A., SAMEITH, K., and STADLER, P.F., 2005. Fragrep: efﬁcient search for fragmented patterns in genomic sequences. Geno. Prot. Bioinfo. 4: 56–60. MOURIER, T., et al., 2008. Genome-wide discovery and veriﬁcation of novel structured RNAs in Plasmodium falciparum. Genome Res. 18: 281–292. MRA´ZEK, J., KREUTMAYER, S.B., GRA¨SSER, F.A., POLACEK, N., and H€ uTTENHOFER, A., 2007. Subtractive hybridization identiﬁes novel differentially expressed ncRNA species in ebv-infected human B cells. Nucleic Acids Res. 35: e73. MUTSUDDI, M., MARSHALL, C.M., BENZOW, K.A., KOOB, M.D., and REBAY, I., 2004. The spinocerebellar ataxia 8 noncoding RNA causes neurodegeneration and associates with staufen in Drosophila. Curr. Biol. 14: 302–308. NAKAMURA, T., NAITO, K., YOKOTA, N., SUGITA, C., and SUGITA, M., 2007. A cyanobacterial non-coding RNA, Yfr1, is required for growth under multiple stress conditions. Plant Cell Physiol. 48: 1309–1318. NAKAYA, H.I., AMARAL, R., LOURO, P.P., LOPES, A., FACHEL, A. A., MOREIRA, Y.B., EL-JUNDI, T.A., da SILVA, A.M., REIS, E. M., and VERJOVSKI-ALMEIDA, S., 2007. Genome mapping and expression analyses of human intronic noncoding RNAs reveal tissue-speciﬁc patterns and enrichment in genes related to regulation of transcription. Genome Biol. 8: R43. NANDY, C., MRA´ZEK, J., STOIBER, H., GRA¨SSER, F.A., H€ uTTENHOFER, A., and POLACEK, N. 2008 Epstein–Barr virus-induced expression of a novel human vault RNA. J. Mol. Biol. NEUSSER, T., GILDEHAUS, N., WURM, R., and WAGNER, R., 2008. Studies on the expression of 6S RNA from E. coli: involvement of regulators important for stress and growth adaptation. Biol. Chem. 389: 285–297. NG, K., PULLIRSCH, D., LEEB, M., and WUTZ, A., 2007. Xist and the order of silencing. EMBO Rep. 8: 34–39. NIEMITZ, E.L., DEBAUN, M.R., FALLON, J., MURAKAMI, K., KUGOH, H., OSHIMURA, M., and FEINBERG, A.P., 2004. Microdeletion of LIT1 in familial Beckwith–Wiedemann syndrome. Am. J. Hum. Genet. 75: 844–849. NILSEN, T.W., 2003. The spliceosome: the most complex macromolecular machine in the cell? Bioessays 25: 1147–1149.

289

OKAN, N.A., BLISKA, J., and KARZAI, A.W., 2006. A role for the SmpB-SsrA system in Yersinia pseudotuberculosis pathogenesis. PLoS Pathog. 2: e6. OKUTSU, T., et al., 2000. Expression and imprinting status of human PEG8/IGF2AS, a paternally expressed antisense transcript from the IGF2 locus, in Wilms’ tumors. J. Biochem. 127: 475–483. PAGANO, A., CASTELNUOVO, M., TORTELLI, F., FERRARI, R., DIECI, G., and CANCEDDA, R., (2007) New small nuclear RNA gene-like transcriptional units as sources of regulatory transcripts. PLoS Genet. 3: e1. PALFI, Z., SCHIMANSKI, B., G€uNZL, A., LCKE, S., and BINDEREIF, A., 2005. U1 small nuclear RNP from Trypanosoma brucei: a minimal u1 snRNA with unusual protein components. Nucleic Acids Res. 33: 2493–2503. PANZITT, K., et al., 2007. Characterization of HULC, a novel gene with striking up-regulation in hepatocellular carcinoma, as noncoding RNA. Gastroenterology 132: 330–342. PARROTT, A.M. and MATHEWS, M.B., 2007. Novel rapidly evolving hominid RNAs bind nuclear factor 90 and display tissue-restricted distribution. Nucleic Acids Res. 35: 6249–6258. PATEL, A.A. and STEITZ, J.A., 2003. Splicing double: insights from the second spliceosome. Nat. Rev. Mol. Cell Biol. 4: 960–970. PERREAULT, J., NOE´L, J.-F., BRIE`RE, F., COUSINEAU, B., LUCIER, J.-F., PERREAULT, J.-P., and BOIRE, G., 2005. Retropseudogenes derived from human Ro/SS-A autoantigen-associated hY RNAs. Nucleic Acids Res. 33: 2032–2041. PERREAULT, J., PERREAULT, J.-P., and BOIRE, G., 2007. The Ro associated Y RNAs in metazoans: evolution and diversiﬁcation. Mol. Biol. Evol. 24: 1678–1689. PETTITT, J., MU¨LLER, B., STANSFIELD, I., and CONNOLLY, B., 2008. Spliced leader trans-splicing in the nematode Trichinella spiralis uses highly polymorphic, noncanonical spliced leaders. RNA 14: 760–770. PEZER, Z. and UGARKOVIC , D., 2008. Role of non-coding RNA and heterochromatin in aneuploidy and cancer. Semin Cancer Biol. 18: 123–130. PIBOUIN, L., VILLAUDY, J., FERBUS, D., MULERIS, M., PROSPE´RI, M.-T., REMVIKOS, Y., and GOUBIN, G., 2002. Cloning of the mRNA of overexpression in colon carcinoma-1: a sequence overexpressed in a subset of colon carcinomas. Cancer Genet. Cytogenet. 133: 55–60. PICCINELLI, P., ROSENBLAD, M.A., and SAMUELSSON, T., 2005. Identiﬁcation and analysis of ribonuclease P and MRP RNA in a broad range of eukaryotes. Nucleic Acids Res. 33: 4485–4495. PIRROTTA, V., 2002. Trans-splicing in Drosophila. Bioessays 24: 988–991. PIZZUTI, A., et al., 1999. Isolation and characterization of a novel transcript embedded within HIRA, a gene deleted in DiGeorge syndrome. Mol. Genet. Metab. 67: 227–235. PODLEVSKY, J.D., BLEY, C.J., OMANA, R.V., QI, X., and CHEN, J. J., 2008. The telomerase database. Nucleic Acids Res. 36: D339–D343.

290

Chapter 14

Noncoding RNA

POLESSKAYA, O.O., HAROUTUNIAN, V., DAVIS, K.L., HERNANDEZ, I., and SOKOLOV, B.P., 2003. Novel putative nonproteincoding RNA gene from 11q14 displays decreased expression in brains of patients with schizophrenia. J. Neurosci. Res. 74: 111–122. POUCHKINA-STANTCHEVA, N.N. and TUNNACLIFFE, A., 2005. Spliced leader RNA-mediated trans-splicing in phylum Rotifera. Mol. Biol. Evol. 22: 1482–1489. PRASANTH, K.V. and SPECTOR, D.L., 2007. Eukaryotic regulatory RNAs: an answer to the ‘genome complexity’ conundrum. Genes Dev. 21: 11–42. PUERTA-FERNANDEZ, E., BARRICK, J.E., ROTH, A., and BREAKER, R.R., 2006. Identiﬁcation of a large noncoding RNA in extremophilic eubacteria. Proc. Natl. Acad. Sci. USA 103: 19490–19495. RANDAU, L., SCHRO¨DER, I., and SO¨LL, D., 2008. Life without RNase P. Nature 453: 120–123. RAVASI, T., et al., 2006. Experimental validation of the regulated expression of large numbers of non-coding RNAs from the mouse genome. Genome Res. 16: 11–19. REGULSKI, E.E. and BREAKER, R.R., 2008. In-line probing analysis of riboswitches. Methods Mol. Biol. 419: 53–67. REICHOW, S.L., HAMMA, T., FERRe-D’AMARe, A.R., and VARANI, G., 2007. The structure and function of small nucleolar ribonucleoproteins. Nucleic Acids Res. 35: 1452–1464. RICCARDO, S., TORTORIELLO, G., GIORDANO, E., TURANO, M., and FURIA, M., (2007) The coding/non-coding overlapping architecture of the gene encoding the Drosophila pseudouridine synthase. BMC Mol. Biol. 8: 15. RICHARD, P., KISS, M.A., DARZACQ, X., and KISS, T., 2006. Cotranscriptional recognition of human intronic box H/ ACA snoRNAs occurs in a splicing-independent manner. Mol. Cell. Biol. 26: 2540–2549. RICHARDS, J., SUNDERMEIER, T., SVETLANOV, A., and KARZAI, A. W., 2008. Quality control of bacterial mRNA decoding and decay. Biochim. Biophys. Acta 1779: 574–582. ROBERTS, T., CHERNOVA, O., and COWELL, J.K., 1998. NB4S, a member of the TBC1 domain family of genes, is truncated as a result of a constitutional t(1;10)(p22;q21) chromosome translocation in a patient with stage 4S neuroblastoma. Hum. Mol. Genet. 7: 1169–1178. ROSE, D., HACKERMU¨LLER, J., WASHIETL, S., FINDEIß, S., REICHE, K., HERTEL, J., STADLER, P.F., and PROHASKA, S. J., 2007. Computational RNomics of drosophilids. BMC Genomics 8: 406. ROSE, D., JO¨RIS, J., HACKERMU¨LLER, J., REICHE, K., LI, Q., and STADLER, P.F., 2008. Duplicated RNA genes in teleost ﬁsh genomes. J. Bioinf. Comp. Biol., In press. ROZHDESTVENSKY, T.S., KOPYLOV, A.M., BROSIUS, J., and HU¨TTENHOFER, A., 2001. Neuronal BC1 RNA structure: evolutionary conversion of a tRNA(Ala) domain into an extended stem-loop structure. RNA 7: 722–730. RUSSELL, A.G., CHARETTE, J.M., SPENCER, D.F., and GRAY, M. W., 2006. An early evolutionary origin for the minor spliceosome. Nature 443: 863–866. RYMARQUIS, L.A., KASTENMAYER, J.P., HU¨TTENHOFER, A.G., and GREEN, P.J., 2008. Diamonds in the rough: mRNA-

like non-coding RNAs. Trends Plant. Sci. 13: 329–334. SAMARSKY, D.A., FOURNIER, M.J., SINGER, R.H., and BERTRAND, E., 1998. The snoRNA box C/D motif directs nucleolar targeting and also couples snoRNA synthesis and localization. EMBO J. 17: 3747–3757. SANCHEZ-ELSNER, T., GOU, D., KREMMER, E., and SAUER, F., 2006. Noncoding RNAs of trithorax response elements recruit Drosophila Ash1 to Ultrabithorax. Science 311: 1118–1123. SARAIYA, A.A. and WANG, C.C., 2008. snoRNA, a novel precursor of microRNA in Giardia lamblia. PLoS Pathog. 4: e1000224. SAXENA, A., LAHAV, T., HOLLAND, N., AGGARWAL, G., ANUPAMA, A., HUANG, Y., VOLPIN, H., MYLER, P.J., and ZILBERSTEIN, D., 2007. Analysis of the Leishmania donovani transcriptome reveals an ordered progression of transient and permanent changes in gene expression during differentiation. Mol. Biochem. Parasitol. 152: 53–65. SCHMITZ, J., ZEMANN, A., CHURAKOV, G., KUHL, H., GRU¨TZNER, F., REINHARDT, R., and BROSIUS, J., 2008. Retroposed SNOfall: a mammalian-wide comparison of platypus snoRNAs. Genome Res. 18: 1005–1010. SCHOEFTNER, S. and BLASCO, M.A., 2008. Developmentally regulated transcription of mammalian telomeres by DNAdependent RNA polymerase II. Nat. Cell Biol. 10: 228–236. SEIF, E.R., FORGET, L., MARTIN, N.C., and LANG, B.F., 2003. Mitochondrial RNase P RNAs in ascomycete fungi: lineage-speciﬁc variations in RNA secondary structure. RNA 9: 1073–1083. SERGANOV, A. and PATEL, D.J., 2007. Ribozymes, riboswitches and beyond: regulation of gene expression without proteins. Nat. Rev. Genet. 8: 776–790. SHARKADY, S.M. and WILLIAMS, K.P., 2004. A third lineage with two-piece tmRNA. Nucleic Acids Res. 32: 4531–4538. SHARMA, C.M., DARFEUILLE, F., PLANTINGA, T.H., and VOGEL, J., 2007. A small RNA regulates multiple ABC transporter mRNAs by targeting C/A-rich elements inside and upstream of ribosome-binding sites. Genes Dev. 21: 2804–2817. SHETH, N., ROCA, X., HASTINGS, M.L., ROEDER, T., KRAINER, A. R., and SACHIDANANDAM, R., 2006. Comprehensive splicesite analysis using comparative genomics. Nucleic Acids Res. 34: 3955–3967. SILVA, A.P.M., et al., 2003. Identiﬁcation of 9 novel transcripts and two RGSL genes within the hereditary prostate cancer region (HPC1) at 1q25. Gene 310: 49–57. SONE, M., HAYASHI, T., TARUI, H., AGATA, K., TAKEICHI, M., and NAKAGAWA, S., 2007. The mRNA-like noncoding RNA Gomafu constitutes a novel nuclear domain in a subset of neurons. J. Cell Sci. 120: 2498–2506. SOUSA, C., JOHANSSON, C., CHARON, C., MANYANI, H., SAUTTER, C., KONDOROSI, A., and CRESPI, M., 2001. Translational and structural requirements of the early nodulin gene enod40, a short-open reading frame-containing RNA, for elicitation of a cell-speciﬁc growth response in the alfalfa root cortex. Mol. Cell. Biol. 21: 354–366.

References SRIKANTAN, V., et al., 2000. PCGEM1, a prostate-speciﬁc gene, is overexpressed in prostate cancer. Proc. Natl. Acad. Sci. USA 97: 12216–12221. STADLER, P.F., et al., 2008. Evolution of vault RNAs. J. Mol. Biol., Submitted. STORZ, G., OPDYKE, J.A., and WASSARMAN, K.M., 2006. Regulating bacterial transcription with small RNAs. Cold Spring Harb. Symp. Quant. Biol. 71: 269–273. STROBEL, S. A. and COCHRANE, J. C., 2007. RNA catalysis: ribozymes, ribosomes, and riboswitches. Curr. Opin. Chem. Biol. 11: 636–643. SUDARSAN, N., BARRICK, J.E., and BREAKER, R.R., 2003. Metabolite-binding RNA domains are present in the genes of eukaryotes. RNA 9: 644–647. SUDARSAN, N., HAMMOND, M.C., BLOCK, K.F., WELZ, R., BARRICK, J.E., ROTH, A., and BREAKER, R.R., 2006. Tandem riboswitch architectures exhibit complex gene control functions. Science 314: 300–304. SUTHERLAND, H.F., et al., 1996. Identiﬁcation of a novel transcript disrupted by a balanced translocation associated with DiGeorge syndrome. Am. J. Hum. Genet. 59: 23–31. SZE´LL, M., BATA-CSO¨RGO, Z., and KEME´NY, L., 2008. The enigmatic world of mRNA-like ncRNAs: their role in human evolution and in human diseases. Semin. Cancer Biol. 18: 141–148. SZYMANSKI, M., BARCISZEWSKA, M.Z., ERDMANN, V.A., and BARCISZEWSKI, J., 2005. A new frontier for molecular medicine: noncoding RNAs. Biochim. Biophys. Acta 1756: 65–75. TAM, W. and DAHLBERG, J.E., 2006. miR-155/BIC as an oncogenic microRNA. Genes Chromosomes Cancer 45: 211–212. TANAKA, R., SATOH, H., MORIYAMA, M., SATOH, K., MORISHITA, Y., YOSHIDA, S., WATANABE, T., NAKAMURA, Y., and MORI, S., 2000. Intronic U50 small-nucleolar-RNA (snoRNA) host gene of no protein-coding potential is mapped at the chromosome breakpoint t(3;6)(q27;q15), of human B-cell lymphoma. Genes Cells 5: 277–287. TANG, T.-H., POLACEK, N., ZYWICKI, M., HUBER, H., BRUGGER, K., GARRETT, R., BACHELLERIE, J.P., and H€uTTENHOFER, A., 2005. Identiﬁcation of novel non-coding RNAs as potential antisense regulators in the archaeon Sulfolobus solfataricus. Mol. Microbiol. 55: 469–481. TANG, T.H., ROZHDESTVENSKY, T.S., D’ORVAL, B.C., BORTOLIN, M.L., HUBER, H., CHARPENTIER, B., BRANLANT, C., BACHELLERIE, J.P., BROSIUS, J., and H€ uTTENHOFER, A., 2002. RNomics in Archaea reveals a further link between splicing of archaeal introns and rRNA processing. Nucleic Acids Res. 30: 921–930. TERNS, M.P. and TERNS, R.M., 2002. Small nucleolar RNAs: versatile trans-acting molecules of ancient evolutionary origin. Gene Expr. 10: 17–39. TEUNISSEN, S.W.M., KRUITHOF, M.J.M., FARRIS, A.D., HARLEY, J.B., van VENROOIJ, W.J., and PRUIJN, G.J.M., 2000. Conserved features of Y RNAs: a comparison of experimentally derived secondary structures. Nucleic Acids Res. 28: 610–619.

291

The ENCODE Project Consortium, 2007. Identiﬁcation and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447: 799–816. THOMAS, S., MARTINEZ, L.I.T., WESTENBERGER, S.J., and STURM, N.R., 2007. A population study of the minicircles in Trypanosoma cruzi: predicting guide RNAs in the absence of empirical RNA editing. BMC Genomics 8: 133. TOLEDO-ARANA, A., REPOILA, F., and COSSART, P., 2007. Small noncoding RNAs controlling pathogenesis. Curr. Opin. Microbiol. 10: 182–188. TORARINSSON, E., SAWERA, M., HAVGAARD, J., FREDHOLM, M., and GORODKIN, J., 2006. Thousands of corresponding human and mouse genomic regions unalignable in primary sequence contain common RNA structure. Genome Res. 16: 885–889. Erratum Genome Res. 16:1439 (2006). TROTOCHAUD, A.E. and WASSARMAN, K.M., 2006. 6S RNA regulation of pspF transcription leads to altered cell survival at high pH. J. Bacteriol. 188: 3936–3943. TYCOWSKI, K.T. and STEITZ, J.A., 2001. Non-coding snoRNA host genes in Drosophila: expression strategies for modiﬁcation guide snoRNAs. Eur. J. Cell Biol. 80: 119–125. TZFATI, Y., KNIGHT, Z., ROY, J., and BLACKBURN, E.H., 2003. A novel pseudoknot element is essential for the action of a yeast telomerase. Genes Dev. 17: 1779–1788. ULYANOV, N.B., SHEFER, K., JAMES, T.L., and TZFATI, Y., 2007. Pseudoknot structures with conserved base triples in telomerase RNAs of ciliates. Nucleic Acids Res. 35: 6150–6160. UZILOV, A.V., KEEGAN, J.M., and MATHEWS, D.H., 2006. Detection of non-coding RNAs on the basis of predicted secondary structure formation free energy change. BMC Bioinformatics 7: 173. VALADKHAN, S., MOHAMMADI, A., WACHTEL, C., and MANLEY, J. L., 2007. Protein-free spliceosomal snRNAs catalyze a reaction that resembles the ﬁrst step of splicing. RNA 13: 2300–2311. van HORN, D.J., EISENBERG, D., O’BRIEN, C.A., and WOLIN, S. L., 1995. Caenorhabditis elegans embryos contain only one major species of Ro RNP. RNA 1: 293–303. van ZON, A., MOSSINK, M.H., SCHEPER, R.J., SONNEVELD, P., and WIEMER, E.A.C., 2003. The vault complex. Cell. Mol. Life Sci. 60: 1828–1837. VINCENTI, S., CHIARA, V.D., BOZZONI, I., and PRESUTTI, C., 2007. The position of yeast snoRNA-coding regions within host introns is essential for their biosynthesis and for efﬁcient splicing of the host pre-mRNA. RNA 13: 138–150. VOGEL, J. and SHARMA, C.M., 2005. How to ﬁnd small non-coding RNAs in bacteria. Biol. Chem. 386: 1219–1238. WALDMINGHAUS, T., KORTMANN, J., GESING, S., and NARBERHAUS, F., 2008. Generation of synthetic RNA-based thermosensors. Biol. Chem. 389: 1319–1326. WALKER, S.C. and ENGELKE, D.R., 2006. Ribonuclease P: the evolution of an ancient RNA enzyme. Crit. Rev. Biochem. Mol. Biol. 41: 77–102.

292

Chapter 14

Noncoding RNA

WANG, J.X. and BREAKER, R.R., 2008. Riboswitches that sense S-adenosylmethionine and S-adenosylhomocysteine. Biochem. Cell Biol. 86: 157–168. WANG, P., YIN, S., ZHANG, Z., XIN, D., HU, L., KONG, X., and HURST, L.D., 2008. Evidence for common short natural trans sense-antisense pairing between transcripts from protein coding genes. Genome Biol. 9: R169. WASHIETL, S., HOFACKER, I.L., LUKASSER, M., HU¨TTENHOFER, A., and STADLER, P.F., 2005. Mapping of conserved RNA secondary structures predicts thousands of functional noncoding RNAs in the human genome. Nat. Biotech. 23: 1383–1390. WASHIETL, S., et al., 2007. Structured RNAs in the ENCODE selected regions of the human genome. Genome Res. 17: 852–864. WASSARMAN, D.A. and STEITZ, J.A., 1991. Structural analyses of the 7SK ribonucleoprotein (RNP), the most abundant human small RNP of unknown function. Mol. Cell. Biol. 11: 3432–3445. WASSARMAN, K.M., 2007. 6S RNA: a small RNA regulator of transcription. Curr. Opin. Microbiol. 10: 164–168. WASSARMAN, K.M. and SAECKER, R.M., 2006. Synthesis-mediated release of a small RNA inhibitor of RNA polymerase. Science 314: 1601–1603. WASSARMAN, K.M. and STORZ, G., 2000. 6S RNA regulates E. coli RNA polymerase activity. Cell 101: 613–623. WEBB, C.J. and ZAKIAN, V.A., 2008. Identiﬁcation and characterization of the Schizosaccharomyces pombe TER1 telomerase RNA. Nat. Struct. Mol. Biol. 15: 34–42. WEBER, M.J., 2006. Mammalian small nucleolar RNAs are mobile genetic elements. PLoS Genet. 2: e205. WEINSTEIN, L.B. and STEITZ, J.A., 1999. Guided tours: from precursor snoRNA to functional snoRNP. Curr. Opin. Cell Biol. 11: 378–384. WEN, J., PARKER, B.J., and WEILLER, G.F., 2007. In silico identiﬁcation and characterization of mRNA-like noncoding transcripts in Medicago truncatula. In Silico Biol. 7: 485–505. WEVRICK, R., KERNS, J.A., and FRANCKE, U., 1994. Identiﬁcation of a novel paternally expressed gene in the Prader–Willi syndrome region. Hum. Mol. Genet. 3: 1877–1882. WHITEHEAD, J., PANDEY, G.K., and KANDURI, C. 2008 Regulation of the mammalian epigenome by long noncoding RNAs. Biochim. Biophys. Acta, DOI: 10.1016/j. bbagen.2008.10.007. WIERZBICKI, A.T., HAAG, J.R., and PIKAARD, C.S., 2008. Noncoding transcription by RNA polymerase Pol IVb/Pol V mediates transcriptional silencing of overlapping and adjacent genes. Cell 135: 635–648. WILHELM, B.T., MARGUERAT, S., WATT, S., SCHUBERT, F., WOOD, V., GOODHEAD, I., PENKETT, C.J., ROGERS, J., and B€aHLER, J., 2008. Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution. Nature 453: 1239–1243. WILL, C.L. and LU¨HRMANN, R., 2005. Splicing of a rare class of introns by the U12-dependent spliceosome. Biol. Chem. 386: 713–724.

WILL, S., MISSAL, K., HOFACKER, I.L., STADLER, P.F., and BACKOFEN, R., 2007. Inferring non-coding RNA families and classes by means of genome-scale structure-based clustering. PLoS Comp. Biol. 3: e65. WILLKOMM, D.K. and HARTMANN, R.K., 2005. 6S RNA: an ancient regulator of bacterial RNA polymerase rediscovered. Biol. Chem. 386: 1273–1277. WILLKOMM, D.K. and HARTMANN, R.K., 2007. An important piece of the RNase P jigsaw solved. Trends Biochem. Sci. 32: 247–250. WILUSZ, J.E., FREIER, S.M., and SPECTOR, D.L., 2008. 30 end processing of a long nuclear-retained noncoding RNA yields a tRNA-like cytoplasmic RNA. Cell 135: 919–932. WOLF, S., MERTENS, D., SCHAFFNER, C., KORZ, C., D€oHNER, H., STILGENBAUER, S., and LICHTER, P., 2001. B-cell neoplasia associated gene with multiple splicing (BCMS): the candidate B-CLL gene on 13q14 comprises more than 560 kb covering all critical regions. Hum. Mol. Genet. 10: 1275–1285. WOODHAMS, M.D., STADLER, P.F., PENNY, D., and COLLINS, L.J., 2007. RNAse MRP and the RNA processing cascade in the eukaryotic ancestor. BMC Evol. Biol. 7: S13. WUTZ, A., 2003. RNAs templating chromatin structure for dosage compensation in animals. Bioessays 25: 434–442. XIE, M., MOSIG, A., QI, X., LI, Y., STADLER, P.F., and CHEN, J.J.L., 2008. Size variation and structural conservation of vertebrate telomerase RNA. J. Biol. Chem. 283: 2049–2059. YAMADA, K., KANO, J., TSUNODA, H., YOSHIKAWA, H., OKUBO, C., ISHIYAMA, T., and NOGUCHI, M., 2006. Phenotypic characterization of endometrial stromal sarcoma of the uterus. Cancer Sci. 97: 106–112. YANG, C.-Y., ZHOU, H., LUO, J., and QU, L.-H., 2005. Identiﬁcation of 20 snoRNA-like RNAs from the primitive eukaryote Gardia lamblia. Biochem. Biophys. Res. Commun. 328: 1224–1231. YANO, Y., SAITO, R., YOSHIDA, N., YOSHIKI, A., WYNSHAWBORIS, A., TOMITA, M., and HIROTSUNE, S., 2004. A new role for expressed pseudogenes as ncRNA: regulation of mRNA stability of its homologous coding gene. J. Mol. Med. 82: 414–422. YAO, M.C., FULLER, P., and XI, X., 2003. Programmed DNA deletion as an RNA-guided system of genome defense. Science 300: 1517–1518. YOTOVA, I.Y., VLATKOVIC, I.M., PAULER, F.M., WARCZOK, K.E., AMBROS, P.F., OSHIMURA, M., THEUSSL, H.C., GESSLER, M., WAGNER, E.F., and BARLOW, D.P., 2008. Identiﬁcation of the human homolog of the imprinted mouse Air non-coding RNA. Genomics 92: 464–473. ZAGO, M.A., DENNIS, P.P., and OMER, A.D., 2005. The expanding world of small RNAs in the hyperthermophilic archaeon Sulfolobus solfataricus. Mol. Microbiol. 55: 1812–1828. ZALFA, F., ADINOLFI, S., NAPOLI, I., K€uHN-H€oLSKEN, E., URLAUB, H., ACHSEL, T., PASTORE, A., and BAGNI, C., 2005. Fragile X mental retardation protein (FMRP) binds speciﬁcally to the brain cytoplasmic RNAs BC1/BC200 via a novel RNAbinding motif. J. Biol. Chem. 280: 33403–33410.

References ZAPPULLA, D.C. and CECH, T.R., 2004. Yeast telomerase RNA: a ﬂexible scaffold for protein subunits. Proc. Natl. Acad. Sci. USA 101: 10024–10029. ZEMANN, A., OP DE BEKKE, A., KIEFMANN, M., BROSIUS, J., and SCHMITZ, J., 2006. Evolution of small nucleolar RNAs in nematodes. Nucleic Acids Res. 34: 2676–2685.

293

ZHU, Y., PULUKKUNAT, D.K., and LI, Y., 2007. Deciphering RNA structural diversity and systematic phylogeny from microbial metagenomes. Nucleic Acids Res. 35: 2283–2294. ZWIEB, C., van NUES, R.W., ROSENBLAD, M.A., BROWN, J.D., and SAMUELSON, T., 2005. A nomenclature for all signal recognition particle RNAs. RNA 11: 7–13.

Chapter

15

Evolutionary Genomics of microRNAs and Their Relatives Andrea Tanzer, Markus Riester, Jana Hertel, Clara Isabel Bermudez-Santana, Jan Gorodkin, Ivo L. Hofacker, and Peter F. Stadler 15.1

INTRODUCTION

15.2

THE SMALL RNA ZOO

15.3

SMALL RNA BIOGENESIS

15.4

COMPUTATIONAL microRNA PREDICTION

15.5

microRNA TARGETS

15.6

EVOLUTION OF microRNAs

15.7

ORIGIN(S) OF microRNA FAMILIES

15.8

GENOMIC ORGANIZATION

15.9

SUMMARY AND OUTLOOK

REFERENCES

15.1 INTRODUCTION MicroRNAs (miRNAs) are an abundant class of small noncoding RNA (ncRNA) genes, which were found in eukaryotes, particularly in metazoans and plants, and in their viruses. MicroRNA research has come a long way since the ﬁrst discoveries of lin-4 (Lee et al., 1993) and let-7 (Reinhart et al., 2000) in Caenorhabditis elegans. The turn of the century brought the realization that miRNAs form a large new class of ncRNAs (Lagos-Quintana et al., 2001; Lau et al., 2001; Lee and Ambros, 2001) that provide a ubiquitous and powerful mechanism for RNA-mediated control of gene expression. The miRBase (Grifﬁths-Jones et al., 2008), a comprehensive database collecting published miRNAs as well as assigning unique names (Ambros et al., 2003b) to novel ones, started with only 218 sequences (v1.0, December 2002) and now lists 8619 entries in the current version 12.0 (September 2008). Today, there are 5517 publications about miRNAs in PubMed of which 924 are reviews. These numbers Evolutionary Genomics and Systems Biology, edited by Gustavo Caetano-Anolles Copyright 2010 John Wiley & Sons, Inc.

295

296

Chapter 15

Evolutionary Genomics of microRNAs and Their Relatives

might illustrate the impact of this ﬁeld of research on our understanding of the information encoded by the vast majority of genomic sequences and transcribed units. MicroRNAs were the ﬁrst small regulatory RNAs found in animals, but turned out not to be the only ones. During the past few years, a variety of additional classes were detected, many of which share functional properties and processing machinery. In the following section, we will address those similarities as well as differences by outlining biogenesis and function.

15.2 THE SMALL RNA ZOO Size and ﬁnal destination of the RNA classes addressed in this chapter deﬁne them as a reasonably homogeneous group of functional RNAs: They are about 20–30 nucleotides (nt) in length, and they guide large protein complexes to their targets, thus comprising the “RNA sensor” allowing sequence speciﬁc binding of the proteins. Both miRNAs and siRNAs form subclasses of this large class of small ncRNAs. Like miRNAs, many other small RNAs are involved in gene silencing. On the other hand, miRNAs function posttranscriptionally, others are involved in other types of functions. MiRNAs stand out from the other small RNAs in many ways, in particular by their energetically stable precursor hairpin, which has been a key component in computational search methods. While most of this contribution deals with microRNAs, in this section we attempt to compile the related small RNAs that got into the focus of RNA research. Given that approximately 1% of the human genome contains protein-coding genes, it is likely that only a fraction of the regulatory RNome has been discovered so far. New insights constantly require regrouping of classes of small RNAs, such that our list can only provide a snapshot of the current knowledge.

15.2.1

Endogenous siRNAs

The term small interfering RNAs (siRNAs) is often used for approximately 20 nt long regulatory RNAs and thus summarizes members of classes introduced in this section. However, the original meaning of the term siRNAs stems from Hamilton and Baulcombe (1999), who discovered 25 nt long RNA intermediates in either transgene-induced posttranscriptional gene silencing (PTGS) or virus-induced PTGS in plants. Meanwhile, siRNAs were detected in numerous eukaryotes across kingdoms (Czech et al., 2008). They all originate from endogenous or exogenous (viral) transcripts, which are turned into double-stranded RNA by RNA-dependent RNA polymerase (RdRP), show high complementarity to their target mRNAs and induce degradation of their targets. Endogenous siRNAs have also been found in most major eukaryote lineages, including animals (C. elegans (Ambros et al., 2003a), Drosophila melanogaster (Aravin et al., 2003), and mouse (Tam et al., 2008; Watanabe et al., 2008)), fungi (Schizosaccharomyces pombe (Reinhart et al., 2002)), amebozoa (Dictyostelium (Kuhlmann et al., 2006)), plants (Arabidopsis thaliana (Xie and Qi, 2008)), and kinetoplastids (Trypanosoma brucei (Djikeng et al., 2001)). On the other hand, several lineages have lost the entire RNA interference (RNAi) machinery, including budding yeasts and Leishmanias (see Ullu et al. (2004), for a review of RNA interference in protozoan parasites). Recently, endogenous siRNA were detected in higher eukaryotes that lack RdRP. A novel class of short interfering RNAs in D. melanogaster was found to be excised from hairpins longer than animal miRNAs and in several instances longer than plant miRNAs

15.2 The Small RNA Zoo

297

(Okamura et al., 2008b). These hairpins, termed hpRNA are located in regions of limited coding potential and were found by searching for inverted repeats resulting from inverted terminal repeats of transposons or tandem invertions of transposable elements and mRNAs. The siRNAs of size approximately 21 are processed from the hairpin by known components of both the siRNA and the miRNA pathways (Okamura et al., 2008a). However, due to 50 -methylation and their dependence on Dicer-2 and AGO2, it was concluded that the short RNAs derived from hpRNAs are siRNAs and not miRNAs. In mice, pseudogenes and transposons were also shown to serve as source for potential siRNAs (Tam et al., 2008; Watanabe et al., 2008). So-called tasiRNAs (trans-acting endogenous siRNAs) in plants are transcribed in trans to their target mRNAs and lead to mRNA degradation (Kim, 2005a). MicroRNAs and siRNAs share several components and processing steps in each of their maturation pathways. However, there are a number of differences. For instance, siRNAs show a high degree of sequence complementarity to their target sites compared to miRNAs.

15.2.2

piRNA

Another class of small RNAs that was discovered in the attempt to ﬁnd miRNAs is the germline-speciﬁc piRNAs (Piwi interacting RNA) of length 25–32 nt (Lau et al., 2006; Kim, 2006; Aravin et al., 2007a). In D. melanogaster, piRNAs are involved in repression of transposons in the germline. In contrast to rasiRNAs (see below), piRNAs are restricted to speciﬁc genomic loci and are organized in a limited number of large clusters of noncoding transcripts. piRNAs were found to be expressed in two meiotic stages in spermatocytes. Pachytene piRNAs are depleted of repeats. Pre-pachytene piRNAs in contrast depend on Mili proteins, show similarity to repeat sequences, and mediate DNA methylation of transposable elements such as L1 elements (Aravin et al., 2007b). In nematodes, the 21U RNAs (Ruby et al., 2006) are characterized by an initial uridine 50 -monophosphate, and a chemical modiﬁcation at either the 20 - or the 30 -oxygen of this nucleotide, as reported for small RNAs in plants and rasiRNAs in fruit ﬂies (Li et al., 2005; Vagin et al., 2006). A recent study identiﬁed them as the piRNAs of C. elegans by virtue of their association with Piwi-Argonaute (Batista et al., 2008). They are far more diverse than miRNAs, and unlike siRNAs and piRNAs in other organisms, which are expressed in tight clusters, the 21U-RNAs appear to be autonomously expressed.

15.2.3

rasiRNA

Repeat-associated RNAs in animals (Aravin et al., 2003) and plants (Hamilton et al., 2002) both lead to silencing of repeat regions by DNA methylation. However, they show certain differences in biogenesis. In plants, transcripts from transposons are turned into a doublestranded RNA (dsRNA) by means of RdRP. In D. melanogaster, rasiRNAs were discovered in a genome wide screen for small RNAs and found to be expressed in testes and early embryos and were later shown to interact with Piwi proteins (Aravin et al., 2007a). Thus, rasiRNAs might be another family or subgroup of piRNAs.

15.2.4

“Exotic” Small RNA Species

Mouse Meryl RNA is transcribed during meiosis as a several kilobase long polyadenylated primary transcript and is then processed by Drosha into approximately 80 nt long fragments.

298

Chapter 15

Evolutionary Genomics of microRNAs and Their Relatives

Dicer products were found in vitro but not in vivo. The function of mrhl remains elusive. A homologous sequence was so far only found in rat (Ganesan and Rao, 2008). The ciliate protozoan Tetrahymena undergoes a complicated mechanism of macro- and micronuclei formation during sexual reproduction. In the course of this process, DNA is removed from the macronuclei. Small scan RNAs (scnRNAs) originate from, possibly repeat or transposon containing, regions in the micronucleus, and guide histone methylation that in turn recruits proteins facilitating DNA excision. This process might help to prevent propagation of transposons onto the next Tetrahymena generation (Kim, 2005a). Another class of small RNAs whose function is not yet well understood is the approximately 20–200 nt long PASRs (promoter-associated small RNAs) and TASRs (termini-associated small RNAs). They associate with approximately 50% of mammalian protein-coding genes in promoter and termini regions, respectively, and the PASRs also correlate with the expression of proteins (Kapranov et al., 2007). It remains unclear at present, whether PASRs and TASRs are related to siRNAs in function and biogenesis, or whether they belong to an entirely distinct part of the cells’ regulation system.

15.3 SMALL RNA BIOGENESIS 15.3.1

Components of the Small RNA Processing Machinery

Type III RNases RNase III type enzymes bind and cleave dsRNAs and are divided into three families. Besides the cleavage domain, they all contain a dsRNA binding domain. In small RNA pathways, we ﬁnd members of classes I and II. Drosha, a class II enzyme, resides in the nucleus and requires Pasha (H.s. DGCR8 (DiGeorge syndrome critical region 8, a homologue to the D. melanogaster Pasha)) as cofactor. It cleaves pre-miRNAs from longer precursors, which are then further processed by Dicer. So far, Drosha homologues were exclusively found in animals. Drosophilids and possibly all arthropods harbor two homologues, whereas all other metazoans have only a single copy (Murphy et al., 2008). Dicer, a Class III enzyme, has an N-terminal DExD/H-box helicase and a PAZ (Piwi/Argonaute/Zwille) domains. It “dices” long dsRNA into approximately 20 nt long duplexes with a typical 2 nt overhang at the 30 -end. In contrast to Drosha, it is found in all organisms using small RNA pathways described here. The number of homologues within a genome varies greatly by organism. Drosophila has two (Dcr1 and Dcr2), all other metazoans and protists have one and plants have even four (DCL1–4) homologues involved in different small RNA pathways (Kim, 2005a; Murphy et al., 2008; MacRae and Doudna, 2007). Piwi Proteins and Argonautes The family of argonaute proteins (AGO) comprises a multitude of different members of various functions (Hutvagner and Simard, 2008). AGOs consist of an N-terminal PAZ domain, also found in Dicer, and the C-terminal PIWI domain. The exact functions of the domains remain unresolved. However, the PIWI domain seems to bind to the 50 -seed region of miRNAs, whereas the PAZ domain interacts with the 30 -OH. Vertebrates have four AGOs (Ago1-4, also known as eIFC1-4). Ago2 is required for RNAi, whereas Ago1 acts in translational inhibition. Both interact with Dicer (Murphy et al., 2008). For a detailed review of the numerous members

15.3 Small RNA Biogenesis

299

of the Argonaut family, we refer to Parker and Barford (2006). Detailed studies in Drosophila were described in Tomari et al. (2007) and F€orstemann et al. (2007). Piwi proteins are predominantly expressed in the germline. They contain the characteristic Piwi domain and were found to associate with piRNAs. In vertebrates, three Piwis were found so far: Mouse and zebraﬁsh homologues are termed Mili, Miwi, and Miwi2 and Zili, Ziwi, and Ziwi2, respectively. Even though Mili is expressed in ovaries, Piwis seem to promote male germline-speciﬁc functions (Aravin et al., 2007a). Polymerases When it comes to transcription, small RNAs behave just like ordinary protein-coding genes. Expression of miRNAs for instance has been studied in great detail. The primary transcripts originate either from introns (although often driven by an intronic promoter) or from mlncRNAs (mRNA-like ncRNAs). Most of them are transcribed by DNA polymerase II and show alternative start and splice sites, are 50 -caped and 30 -polyadenylated. Organisms with strong siRNA activity require another enzyme in order to multiply their response to parasitic RNAs. In plants, protozoans and lower metazoans, RdRP performs siRNAs primed synthesis of dsRNA, which is then cleaved by RISC (RNA induced silencing complex) and Dicer homologues. In the case of plant rasiRNAs, the resulting small RNAs mediate silencing of the genomic loci of the parasitic sequences (transposable elements). Even though endogenous siRNAs were found, Drosophilids and vertebrates lack endogenous RdRP homologues. Exogenous (transposon, viral encoded) RdRPs are not required for siRNA function (Stein et al., 2003). It is tempting to speculate that this lack of RdRP in vertebrates might have led to the emergence of new defense mechanisms in order to respond to viral and other infections, for example, the acquired immune system.

15.3.2

MicroRNA Biogenesis

Unless stated otherwise, we outline here miRNA biogenesis in the mammalian genome. (For miRNAs in introns, see below.) The process of miRNAs in intergenic regions is that a primary pol-II transcript is 50 -capped and added a polyA tail (Cai et al., 2004). Some of these transcripts can also function as protein-coding mRNAs (Cai et al., 2004). The primary transcript (pri-miRNA) is then further processed in the nucleus by the microprocessor complex, consisting of endonucleases Pasha and Drosha (Gregory et al., 2004), resulting in a characteristic hairpin of length 60–120 nt. In plants, which do not contain Drosha, its function is carried out by the homologues DCL1 and HYL1 (reviewed in Jones-Rhoades et al. (2006)). The resulting stem-loop precursor, also referred to as the pre-miRNA, is transported into the cytoplasm by Exportin-5 (Lund et al., 2004). In the cytoplasm, the pre-miRNA becomes processed further and is both sliced and diced. Dicer associates with TRBP (trans-activator RNA binding protein) and process the hairpin into a double-stranded RNA (dsRNA) of length approximately 22 nt with a 2 nt 30 -overhang. In general, only one strand of the duplex termed mature miRNA will be incorporated into RISC to guide it to the target mRNA. The other strand (miRNA) becomes degraded. However, recent results in D. melanogaster revealed that a number of miRNA sequences might be functional, since they are expressed above background signal and show higher conservation than expected of a nonfunctional sequence in a pri-miRNA helix (Okamura

300

Chapter 15

Evolutionary Genomics of microRNAs and Their Relatives

et al., 2008c). The molecular machinery determines which of the two strands gets loaded into the RISC complex by sensing the strand which 50 -end is less stable bound compared to the 30 -end in the miRNA:miRNA duplex (Schwarz et al., 2003; Krol et al., 2004; Khvorova et al., 2003; Tomari et al., 2004). In addition, short conserved sequence motifs within the mature miRNA might serve as signals in both asymmetric processing and strand selection (Gorodkin et al., 2006).

15.3.3

Biogenesis of Other Small RNAs

Only miRNAs are generated without the help of RdRP in both plants and animals (see Figure 15.1). In plants, the primary transcripts of other small RNAs are converted to dsRNA, which in turn is cleaved by Dicer homologues. The resulting small RNAs are often 30 -methylated by HEN1. In contrast, higher metazoans use small RNAs (most of them of unknown origin as in the case of piRNAs) to slice primary transcripts. In both cases, each RNA family requires a distinctive set of Ago, Piwi, and Dicer homologues. Depending on the subcellular localization of pathway components and targets, small RNAs shuttle between nucleus and cytoplasm. Exportin-5 is the only export pathway so far described in detail, but there are speculations about a piRNA-speciﬁc transport mechanism. For a more detailed description, see Figure 15.1.

15.3.4 Three Main Mechanisms, Same Global Effect on Gene Expression Originally, RNA interference described a variety of gene silencing processes, which require small RNAs mediating site speciﬁcity. RNAi was discovered in C. elegans (Fire et al., 1998) and can be induced in a number of eukaryotes, as D. melanogaster (Kennerdell and Carthew, 1998), vertebrates (Elbashir et al., 2001), and many protozoans (Ullu et al., 2004). In plants, cosuppression or PTGS was ﬁrst described in petunia (Napoli et al., 1990; van der Krol et al., 1990). RNAi also refers to an efﬁcient technology to knockdown expression of speciﬁc genes (Fire et al., 1998) for which Craig C. Mello and Andrew Fire were awarded the Nobel Price for Medicine in 2006 (Fire, 2007) (reviewed in Kim et al., 2005b). The pathways operated by the small RNA machinery appear to overlap to a certain extent. While they use distinct core proteins, the share several components. A screen for proteins involved in miRNAs, siRNAs, and endo-siRNA functions in Drosophila, for instance, revealed 117 candidate genes, 54 of which show effects in all three pathways (Zhou et al., 2008). Ago2 is clearly the major player when it comes to RNAi, but is also found in miRNPs, which normally use Ago1. These ﬁndings might explain why miRNAs and siRNAs are capable of switching between their mode of function (Doench et al., 2003). Translational Inhibition Classes: miRNA The small RNA binds to an mRNA and causes translational inhibition. The degree of basepairing between RNA and target sequence as well as protein components in the miRNPs (Ago1) determines the mode of function. The so-called seed region (7 nt on 50 -end of miRNA) mediates sequence speciﬁcity. RNA degradation requires almost perfect complementarity, whereas translational inhibition allows a certain number of unpaired bases. The actual mechanism behind translational repression has not been resolved yet. MicroRNAs

301

Figure 15.1 Biogenesis of major small RNA families. (green) miRNAs are transcribed as long primary transcripts, which are processed by the nuclear RNase III Drosha and its cofactor Pasha (DCL1/HYL1 in plants). In vertebrates, these stem-loop structures are exported to the cytoplasm by means of the exportin-5 pathway, where the mature miRNA is cleaved by Dicer. In plants, the second cleavage step also takes place in the nucleus and short methylated dsRNAs are exported by HST. (black) plant tasiRNAs are processed in the cytoplasm. Tas precursors use the same export mechanism as protein-coding mRNAs. miRNA primed synthesis of dsRNA is followed by DCL4 mediated dicing and HEN1 methylation. (Blue) natsi RNAs in plants might use a mechanism similar to tasiRNAs. Cis antisense transcripts bind the sense RNA and serve as primers for RdRP. (Red) rasi RNAs in plants never leave the nucleus. Primary transcripts are converted into dsRNA by RDP2 (an RdRP) and diced by DCL3. The resulting small RNAs guide DNA methylation. (Magenta) A not yet complete model outlining piRNA processing (ping-pong mechanism). Transcription of piRNA clusters results in mature piRNAs antisense to their target transposon. Upon binding, an “antisense piRNA” is processed and interacts further with the transposon. Only weak sequence constraints are required. The process does not require any Dicer homologue. Cleavage is mediated by Ago3, Aubergine acts as a cofactor (Klattenhoff and Theurkauf, 2008). Figure based on drawings from Kim (2005a) and Vazquez (2006). (See insert for color representation of this ﬁgure.)

302

Chapter 15

Evolutionary Genomics of microRNAs and Their Relatives

have been isolated from ribonucleoparticles (RNPs) containing ribosomes, RISC, mRNA and miRNAs (Dostie et al., 2003) which suggests that miRNA binding blocks transcriptional elongation by stalling ribosomes leading to release of the nascent transcript. In contrast, more recent studies showed that at least some miRNAs are able to inhibit the formation of the translational initiation complex (Mathonnet et al., 2007). Efﬁcient miRNA repression in metazoan seem to be governed by multiple targets residing in the 30 -UTR of the messenger, that is the same or different miRNAs target the same mRNA simultaneously. MicroRNA functions were reviewed in detail in Bushati and Cohen (2007). RNAi: mRNA Degradation Classes: miRNA, siRNA, tasiRNA, natsiRNA, and piRNA In contrast to translational repression, RNAi causes degradation of the target by RISC. Two factors determine this mode: the composition of the RISC complex and the small RNA: mRNA binding pattern. RNAi requires the presence of Ago2 and nearly perfect complementarity between small RNA and its target. On the other hand, metazoan miRNAs target the 30 -end of the mRNA and by some not yet fully understood mechanism cause blocking of translation, miRNAs in plants target the coding region and cause degradation by an siRNAlike pathway (reviewed in Filipowicz et al. (2008)). Transcriptional Gene Silencing and Imprinting Class: miRNA, rasiRNA, and piRNA Small RNAs were shown to promote de novo methylation as well as maintenance of DNA methylation (Aufsatz et al., 2002) in plants and animals (Kawasaki and Taira, 2004). Several studies also gave rise to the idea that histone methylation of speciﬁc loci might be guided by small RNAs. MicroRNAs target promoter regions of genes, whereas rasiRNAs shut down repeat rich regions in the genome.

15.4 COMPUTATIONAL microRNA PREDICTION There are two basic strategies to detect novel miRNAs. The simpler one uses sequence homology to experimentally known miRNAs as well as the characteristic hairpin structure of the pre-miRNA (Weber, 2005; Legendre et al., 2005; Hertel et al., 2006; Dezulian et al., 2006). The de novo computational prediction of miRNAs primarily relies on the thermodynamically stable pre-miRNA hairpin and on the characteristic pattern of sequence conservation. Conservation is high at both sides of the stem region and is decreasing toward the unpaired region of the apical loop. If only one mature miRNA is produced from the precursor, the region encoding the mature sequences is best conserved. In some cases, both sides of the hairpin produce mature sequences, usually labeled miR and miR . In this case, both mature loci are conserved nearly equally, as in the case of mir-125b-1 (Figure 15.2). Several software tools have been designed to utilize this information for miRNA gene ﬁnding: miRscan (Lim et al., 2003), miRseeker (Lai et al., 2003), miralign (Wang et al., 2005) and RNAmicro (Hertel and Stadler, 2006) all have led to the discovery of a large number of animal microRNAs. For closely related species, phylogenetic shadowing can be used to identify regions that are under stabilizing selection and exhibit the

15.4 Computational micro RNA Prediction

303

MicroRNA sequence and structure features illustrated by mir-125b-1 and mir-315. The 85 nt precursor folds into the typical hairpin structure (secondary structure predicted with RNAfold), which is cleaved by Dicer resulting in the mature miRNA (20 nt) indicated by a line. In case of mir-125b-1, the mature miR and miR are both well conserved. For mir-315 only one miR is expressed, which is much better conserved than the opposite side of the stem. The ClustalW multiple sequence alignment of the precursor sequences emphasizes the conservation pattern. The colors of the base pair encode the number of consistent and compensatory mutations supporting that pair: Red marks pairs with no sequence variation; ochre, green and turquoise mark pairs with two, three, four different types of pairs, respectively. (See insert for color representation of this ﬁgure.)

Figure 15.2

characteristic variations in sequence conservation between stems, loop, and mature miRNA (Berezikov et al., 2005). Genomic context also can give additional information: Mirscan-II, for example, takes conservation of surrounding genes into account (Ohler et al., 2004), while the propensity of microRNAs to appear in genomic clusters is used as an additional selection criterion in Altuvia et al. (2005). On the other hand, there is the miRank tool (Xu et al., 2008) that is independent of genomic annotation and cross-species conservation. This is, in particular, important due to the quality of many sequenced genomes and the lack of well annotated related species. MicroRNA detection without the aid of comparative sequence analysis is a very hard task, but unavoidable when species-speciﬁc miRNAs are of prime interest. The miR-abela approach ﬁrst searches for hairpins that are robust against changes in the folding windows (and also thermodynamically stabilized) and then uses a support vector machine (SVM) to identify microRNAs among these candidates (Sewer et al., 2005). A related technique is described by Xue et al. (2005). Conclusively, computational prediction of novel miRNAs can be roughly categorized into the following types: Straightforward sequence and/or structure homology search, the characterization of candidates based on scored sequence and/or structural properties, machine learning techniques and the prediction of novel miRNAs in combination with putative targets (Hou et al., 2008). Plant miRNA precursors show much more variations in lengths and secondary structures and therefore ﬁlters must be less restrictive in this context. On the other hand, plant miRNA targets display complementary sites with near-perfect base-pairings. Tools such as ﬁndMIRNA (Adai et al., 2005) thus predict miRNAs and their targets simultaneously and ignore candidate miRNA genes without putative targets.

304

Chapter 15

Evolutionary Genomics of microRNAs and Their Relatives

15.5 microRNA TARGETS Since microRNAs act as guide molecules that program the RISC complex to recognize a target mRNA, it is essential to understand the mechanism by which miRNAs recognize their targets and to predict target mRNAs for a given miRNA sequence. To date, the number of veriﬁed miRNA–mRNA interactions is still small. The Tarbase database (Sethupathy et al., 2006) currently lists only 570 mRNAs targeted by 123 animal miRNAs. These known interactions have been used heavily to derive rules of miRNA–mRNA interactions. However, only a few guiding principles have emerged: (i) Perfect complementarity between miRNA and target is not required; in fact, most miRNA:mRNA complexes form imperfect duplexes containing mismatches as well as bulges. (ii) miRNA: mRNA duplexes are asymmetric; the 50 -end of the miRNA (30 -end of the target) binding more strongly than the 30 -side of the miRNA. (iii) Base pairing at positions 9–11 triggers mRNA degradation, whereas mismatches at these positions lead to translational inhibition leaving mRNA merely intact. The region comprising positions 2–8 on the miRNA often exhibits perfect complementarity and is therefore referred to as the seed region (Doench and Sharp, 2004; Ambros, 2004). There are at best weak sequence signals associated with either miRNA or target sites. Target sites with evolutionarily conserved seed regions, however, show stronger regulatory impact than nonconserved ones (Baek et al., 2008; Selbach et al., 2008). Proteins from nonconserved targets, however, outnumber those with conserved ones 6 : 1. The context of the target site also inﬂuences protein response: an AU rich local neighborhood signiﬁcantly increases the effect on protein expression (Baek et al., 2008). Cooperative effects caused by additional target sites within 40 nt can enhance PTGS. While the effect of multiple seed regions in the 30 -UTR is cumulative for translational repression, this is not the case for mRNA cleavage. For mir-223 (Baek et al., 2008), the majority of experimentally veriﬁed targets with 7–8 mer seed regions lead to mRNA destabilization, while only a small fraction of mRNA remained stable and was downregulated via translational repression.

15.5.1

How Many Targets?

Since miRNAs are short and need not match perfectly, it should come as no surprise that a single miRNA can regulate several targets. How many targets a typical miRNA might have is still open to debate. This also reﬂected in the widely ﬂuctuating number of targets returned by the various target prediction approaches. For example, Robins et al. (2005) estimate less than 30 targets per miRNA, while Miranda et al. (2006), based on their rna22 method, suggest that a single miRNA may have several thousand targets. SILAC analysis (stable isotope labeling with amino acids in cell culture) (Ong et al., 2002) of mir-233 in neutrophils showed that 78 out of 3819 proteins investigated were direct targets. Since only a third of the proteom was quantiﬁed, mir-223 might have 200 targets in neutrophils and possibly even more targets speciﬁcally present in other cell types and processes (Baek et al., 2008). In part, these diverging numbers may be due to the fact that it is not clear what constitutes a functional target site. Some targets of a miRNA might lead to only slightly lower protein expression levels, or may become functional only at elevated miRNA concentrations. It is clear that a large fraction of human mRNAs is under miRNA control. However, the more generous estimates for the number of miRNAs and the number of targets per miRNA suggest a picture where every mRNA is subject to regulation by a large

15.5 microRNA Targets

305

ensemble of miRNAs from the cells miRNA milieu. In such a scenario any mutation in a 30 UTR would be expected to inﬂuence expression patterns. The observation that housekeeping genes seem to avoid miRNA regulation through the use of very short 30 -UTRs (Stark et al., 2005) is consistent with this view. MicroRNAs preferentially target mRNAs whose protein products also have regulatory functions. Overrepresented groups include transcription factors, components of the miRNA machinery, and other proteins involved in translational regulation, as well as components of the ubiquitin machinery (John et al., 2004). This points at an intricately interwoven network of transcriptional and posttranscriptional regulation (Zhou et al., 2007b). The average number of targets per plant miRNA is low due to their high similarity to the target site and comprises mostly closely related genes (Jones-Rhoades et al., 2006). One rare example for a miRNA with unrelated targets is Arabidopis mir395 (Allen et al., 2005), regulating an ATP sulfurylase and a sulfate transporter.

15.5.2

Target Prediction

Over the past years, a plethora of new methods have been proposed to predict microRNA targets (for a recent review, see Rajewsky (2006)). In most cases, the initial search for candidate sites is purely sequence based. An often used approach, exempliﬁed by the miRanda (John et al., 2004) and PicTar (Krek et al., 2005) programs, is to equip a standard local alignment algorithm with a scoring system that favors base complementarity, using separate scores for G–C, A–U, G–U pairs and mismatches. A similar effect can be achieved by training hidden Markov models (Stark et al., 2003), or even by pattern search using on sequence patterns that are overrepresented in a database of known miRNAs (Miranda et al., 2006). The resulting scores provide a measure for the thermodynamic stability of the miRNA:mRNA duplex. The sequence-based methods can be substituted with a direct search for the most stably interacting sites under the standard energy model for RNA structures. The ﬁrst such approach was implemented in RNAhybrid (Rehmsmeier et al., 2004) and is slower than sequence alignment only by a constant factor. Alternatively, some methods, such as TargetScan (Lewis et al., 2005) immediately start with a search for (near-)perfect seed matches that are then extended toward the 30 -side. In plants, approximate matching of the whole miRNA sequence is typically used and empirical scoring rules later penalize mismatches in the seed region (Jones-Rhoades and Bartel, 2004; Schwab et al., 2005; Zhang, 2005). There are, however, recent ﬁndings that more extensively mismatched targets also exist in plants, which are missed with this approach (Brodersen et al., 2008). In any case, the initial phase tends to generate a large number of candidate sites that have to be further ﬁltered and ranked in order to produce predictions with reasonable conﬁdence. The most important features used to rank targets are (i) quality of the seed match, (ii) conservation of target site in related species, (iii) existence of multiple target sites in a single 30 -UTR, (iv) sequence composition around the mRNA target site (Grimson et al., 2007), (v) hybridization energy of the miRNA:mRNA duplex, and (vi) structural accessibility of the target site. All of these criteria imply some kind of balance between sensitivity and speciﬁcity, that is, the ability to predict as many target sites as possible while avoiding false predictions. For example, restricting oneself to targets with perfect seed complementarity signiﬁcantly reduces the false positive rate but will exclude many valid targets. As a compromise, some methods allow G–U base pairs or maybe a single mismatch or bulge within the seed region.

306

Chapter 15

Evolutionary Genomics of microRNAs and Their Relatives

Yet, some validated targets have poorly matched seed regions that will defy almost any seed based approach (Vella et al., 2004; Didiano and Hobert, 2006; Miranda et al., 2006). Similarly, the introduction of evolutionary conservation led to a marked improvement in prediction accuracy (Lewis et al., 2005; Krek et al., 2005). Many methods rely on conservation either by demanding that target sites for a particular miRNA occur in homologous genes from several species or, more strongly, that these target sites occur at homologous positions of the aligned mRNAs. The work of Xie et al. (2005a) follows an alternative route by ﬁrst determining conserved regions in 30 -UTRs of mammalian mRNAs to determine more than 100 candidate motifs likely involved in posttranscriptional regulation. More than half of them were then identiﬁed as a putative targets for known microRNAs. Presumably, however, many microRNAs are evolutionarily young or even species speciﬁc (Bentwich et al., 2005), and in this case evolutionary conservation is of little help. Since secondary structure of the mRNA might interfere with miRNA binding, a few recent methods have tried to improve target predictions by including the effect of target site accessibility (Long et al., 2007; Kertesz et al., 2007). Accessibility is usually expressed as the probability that the target site is free of secondary structure (and thus available for binding) or equivalently the free energy needed to break any existing structure. The total binding energy of the miRNA can then be expressed as the sum of the free energy gained from forming the heteroduplex and the breaking energy expended to make the site accessible (M€ uckstein et al., 2006). Including the breaking energy gives a signiﬁcant improvement over using the interaction energy alone, as done, for example, in RNAhybrid, and may yield comparable performance with conservation based methods. Current target prediction methods are still burdened with a signiﬁcant false positive rate. Presumably, this is not because some vital ingredient is missing in current methods, but simply because the set of known validated targets (as well as known nonfunctional sites) is currently too small to allow optimizing the relative weight of the features discussed above. This situation may well change soon as signiﬁcant experimental effort is expended for the large scale identiﬁcation of miRNA targets, for example, by immunoprecipitation of mRNAs with components of the RISC complex (Easow et al., 2007). Comparing target prediction with experimental proteome analysis revealed that predictions from Target Scan, and Pictar, which are both looking for seed matches, gave the most accurate results (Baek et al., 2008; Selbach et al., 2008).

15.5.3 Targets and Polymorphisms Single nucleotide polymorphisms (SNPs) can destroy miRNA targets sites or inactivate the miRNA itself (Georges et al., 2007). In fact, even a single substitution can have a dramatic effect (Brennecke et al., 2005). Natural variation by SNPs not only disrupts miRNA–mRNA interactions but also can give rise to novel miRNA targets. A prime example is the Belgian Texel sheep, famous for their hyper-developed muscles. A QTL study of the phenotype (Clop et al., 2006; Georges et al., 2006) uncovered a SNP in the 30 -UTR of the myostation gene (gdf8), which is involved in limiting the growth of muscle tissue. The GA SNP creates target sites for mir-1 and mir-206, which result in downregulation of myostation and thereby of increased muscle growth. In a similar vein, a GA SNP (var321) in the 30 -UTR of SLITRK1, which is associated with Tourette’s Syndrome, tightens the binding with miR189. Recent work (Wang et al., 2008) reports a link between miR-433 and SNPs in the FGF20 (ﬁbroblast growth factor 20) gene, which is expressed in the brain and has been

15.6 Evolution of microRNAs

307

shown to be associated with Parkinson’s disease. A more systematic study (Saunders et al., 2007) identiﬁed approximately 400 SNPs in target sites and reported SNPs that give rise to approximately 250 putative novel target sites. SNP data were used to estimate that approximately 30–50% of the nonconserved miRNA targets in 30 -UTRs are functional when the transcript and miRNA are coexpressed (Chen and Rajewsky, 2007). Databases collecting disease-relevant miRNA-related SNPs are emerging: examples are www.patrocles.org by Georges and coworkers and PolymiRTS (compbio.utmem.edu/miRSNP) (Bao et al., 2007).

15.6 EVOLUTION OF microRNAs 15.6.1

Animal microRNAs

The number of annotated microRNAs collected in the MiRBase (11.0; April 2008; http:// microrna.sanger.ac.uk/sequences/)(Grifﬁths-Jones et al., 2008) varies greatly between different animals. For instance, it currently lists 695 human and 184 frog miRNAs, but only 34 in the tunicate Ciona intestinalis and 63 in the planarian Schmidtea mediterranea. A few microRNA families, such as let-7 (Pasquinelli et al., 2000), mir-1, and mir-124 (Hertel et al., 2006) are well conserved among most animal clades. On the other hand, many other families are evolutionarily very young, some even speciﬁc to primates and possibly to human (Berezikov et al., 2005; Bentwich et al., 2005). Members of a given miRNA family can be fairly reliably recovered from genomic DNA sequences due to the extreme sequence conservation of the mature miR and the characteristic stable hairpin structure of the precursor. Such a systematic search for miRNA homologues can be used to determine ﬁrst the phylogenetic distribution of a family and then to infer the likely time of evolutionary origin, which must predate the last common ancestor of all extant family members. Pioneered by Hertel et al. (2006) and subsequently extended to increasingly larger data sets and complemented by experimental veriﬁcation of predicted miRNAs (Sempere et al., 2006; Prochnik et al., 2007; Heimberg et al., 2007), the analysis (see Figure 15.3 for updated data) reveals striking patterns in miRNA evolution and suggests that miRNAs have huge impacts on animal phylogeny. The dramatically expanding repertoire of both miRNA genes and their putative targets (Lee et al., 2007) appears to be correlated with major body-plan innovations. On the other hand, lineage speciﬁc microRNAs may account for phenotypic variation in closely related species. A survey of literature (Lee et al., 2007) concludes that the diversity of the microRNA repertoire, the complexity of their expression patterns, and the diversity of the miRNA targets, are correlated with the animal’s morphological complexity. Mechanistically, this is more than plausible since the miRNA pathway can inﬂuence large gene networks in a coordinated manner and miRNAs are known to be involved in the regulation of nearly all cellular processes. The evolution of microRNAs is characterized not only by the continuing innovation of novel families but also by the diversiﬁcation of established families spawning additional paralogous family members. Animal miRNAs are often organized in genomic clusters, usually indicating a single polycistronic primary precursor transcript, which may carry members of several distinct microRNA families. Like protein-coding gene families, the miRNA families evolve through gene duplications and gene loss, (Tanzer and Stadler, 2004; Tanzer et al., 2005; Hertel et al., 2006). Two distinct types of duplication events can be distinguished: (a) local duplications leading to additional copies on the same primary

308

Chapter 15

Evolutionary Genomics of microRNAs and Their Relatives Trichoplax adhaerens

Acropora millepora Acropora palmata Cnidaria

Nematostella vectensis Hydra magnipapillata Spisula solidissima Biomphalaria glabrata Aplysia californica Lottia gigantea Capitella capitata Helobdella robusta

Mollusca Gastropoda Annelida 3 Bilateria

Schmidtea mediterranea Schistosoma mansoni Trichinella spiralis Brugia malayi Pristionchus pacificus Caenorhabditis briggsae Caenorhabditis remanei Caenorhabditis elegans Daphnia pulex Apis mellifera Tribolium castaneum Bombyx mori Anopheles gambiae

Plathelmynthes

1

Nematoda

1 70

4

4 Protostomia

2 18

7

Urochordata

5 1

8 20 Teleostomi

25 Gnathostoma

18 Vertebrata

Teleostei

8 9

9 Mammalia

Rodentia

Drosophila

2

13 Arthropoda

Deuterostomia

21 Metazoa

Echinodermata

19

9

90

83 Eutheria

Primates

Saccoglossus kowalevskii Strongylocentrotus purpuratus Oikopleura dioica Ciona savignyi Ciona intestinalis Branchiostoma floridae Petromyzon marinus Callorhinchus mili Danio rerio Oryzias latipes Gasterosteus aculeatus Takifugu rubripes Tetraodon nigroviridis Xenopus tropicalis Gallus gallus Ornithorhynchus anatinus Monodelphis domestica Canis familiaris Bos taurus Rattus norvegicus Mus musculus Pan troglodytes Homo sapiens

15.6 Evolution of microRNAs

309

~

transcript, and (b) nonlocal duplications, which eventually place the paralogues under different transcriptional control. The cause for nonlocal duplications are mostly the wholegenome duplications in early vertebrate and in the teleost evolution (Spring, 2002), while only a few individual duplications of primary miRNA precursor genes have been described (Hertel et al., 2006). In contrast to the typical mode in protein evolution, mature miRNA paralogues usually acquire no or only minimal substitutions, suggesting that functional differences between paralogues are predominantly caused by differences in the regulation of their expression and processing rather than by changes in the portfolio of their potential targets. The continuing innovation of miRNAs is also highlighted by the presence of a large number of evolutionarily very young and sometimes even species-speciﬁc miRNAs. A pipeline designed to ﬁnd miRNAs without enforcing initial constraints of evolutionary conservation (Bentwich et al., 2005) discovered 89 novel miRNAs of which 53 are primate speciﬁc. These ﬁnding partially overlap similar results from other groups (Berezikov et al., 2005, 2006; Xie et al., 2005b). In a high-throughput sequencing study (Berezikov et al., 2006) of human and chimpanzee small RNAs, Plasterk and coworkers found 447 miRNA that were novel at the time, of which 244 were expressed in human and 230 in chimpanzee with an overlap of only 27. Of these novel human miRNAs, more than 50 are speciﬁc to primates and 8% speciﬁc to human according to sequence conservation. The same study also shows that some miRNA families apparently expand in a species-speciﬁc fashion. The general trend of expanding the microRNA repertoire in most lineages appears to correlate with increasing morphological complexity (Hertel et al., 2006; Sempere et al., 2006; Prochnik et al., 2007; Heimberg et al., 2007). The morphological simpliﬁcation in urochordates, on the other hand, is accompanied by the loss of numerous highly conserved bilaterian miRNAs and a reorganization of their miRNAs that clearly sets them apart from the other chordate lineages (Fu et al., 2008). In Oikopleura, the effect is particularly striking. In urochordates, a large number of introns have been eliminated due to the strong pressure toward genome compression, explaining the reduction of the fraction of intronic microRNAs from approximately 80% in vertebrates to less than 30%. The need to reduce genome size may also explain why the majority of urochordate miRNAs is located antisense to their target gene (Fu et al., 2008). Another example for the opposite trend of minimizing or reorganizing the miRNA repertoire can be found in the ﬂatworm lineage. The planarian S. mediterranea encodes 71 miRNAs (Palakodeti et al., 2006). In contrast to other protostome lineages, most of their precursor sequences cannot be faithfully aligned with family members in other phyla.

Figure 15.3

Evolution of animal miRNAs. Starting from mirBase 11.0 (April 2008), a comprehensive homology search in all genomes shown in the tree was conducted. Each microRNA family is mapped to the branch leading to the last common ancestor of the computationally identiﬁed homologues (for technical details, refer to Hertel et al. (2006)). Innovation of new miRNA families is clearly an on-going process in metazoan evolution. Due to the incomplete genomes of the lamprey Petromyzon marinus and the shark Callorhinchus milii, the assignment of innovations around the root of vertebrates is uncertain in details, and more complete data might shift some innovations back to the gnathostome and/or the vertebrate root. Taken together, there is, however, a clear increase microRNA innovation between the vertebrate ancestor and the split of the teleost and tetrapod lineages. The most striking burst of innovations, however, is observed in the eutherian ancestor. Note that the data are biased by the fact that independent surveys for miRNAs have been conducted only for a few model organisms, thus the lack of innovations along many of the invertebrate lineages might be due to missing data. (See insert for color representation of this ﬁgure.)

Chapter 15

Evolutionary Genomics of microRNAs and Their Relatives

Structural alignments of miRNA precursors, however, might help to clarify these evolutionary relationships (Kaczkowski et al., 2009). In the trematode ﬂatworm Schistosoma mansoni, the closest sequenced relative of Schmidtea, a computational survey (unpublished data) recognized less than 10 microRNAs unambiguously as members of known microRNA families. In basal metazoans, only a small number of miRNAs have been reported so far. Among the 32 miRNAs (Grimson et al., 2008) in Nematostella vectensis is a mir-100 homologue (Prochnik et al., 2007), demonstrating a common origin of Bilaterian and in Cnidarian microRNAs. In contrast, the eight microRNAs detected in the sponge Amphimedon queenslandica (Grimson et al., 2008) show no homology with any of the known Cnidarian or Bilaterian microRNAs and have precursors that structurally are more similar to plant and slime-mold miRNAs than to their Bilaterian relatives.

15.6.2

Plant microRNAs

As in animals, microRNA innovation is an ongoing process in plant evolution (Figure 15.4). Interestingly, there are much fewer distinct families of conserved miRNAs, many of which are evolutionarily very old (e.g., Zhang et al., 2006; Axtell et al., 2007; Sunkar and Jagadeeswaran, 2008). At least 16 families date back to the last common ancestor of

Magnoliophyta Dicots

Ranunculales Caryophyllales Solanales Asterales Vitacea Rosacea Fabacea Malpighiales Malvales Sapindales Brassicales

Poales

Magnoliidae

Coniferales

Filicophyta

Lycopodiophyta

Embryophyta Spermatophyta

Bryophyta

Chlamydomonas

310

403

158 161 391

435 437 444 162 397 399 413 164 393 394 398 783 168 169 396 156/157 160 159/319 165/166 167 170/171 172 390 395 408 414 418 419 473/477 529 535 536 537 1220

no common microRNAs Figure 15.4

Phylogenetic distribution of plant microRNA families. As in Figure 15.3, microRNA families are mapped to the branch leading to the last common ancestor of annotated family members. The ﬁgure combines the data listed in mirbase 11 (April 2008) and in several references (Axtell et al., 2007; Barakat et al., 2007; Sunkar and Jagadeeswaran, 2008; Willmann and Poethig, 2007; Zhang et al., 2006).

15.6 Evolution of microRNAs

311

bryophytes and angiosperms. At the same time, many plant studies so far exhibit large diverse sets of species-speciﬁc miRNAs that often outnumbering the conserved miRNAs (Axtell et al., 2007; Barakat et al., 2007; Fahlgren et al., 2007; Fattash et al., 2007; Lindow and Krogh, 2005; Talmor-Neiman et al., 2006; Sunkar et al., 2008). Many of these speciesspeciﬁc miRNAs are single-copy genes and show signiﬁcant sequence similarity with their putative targets, supporting the view that these miRNAs are indeed evolutionarily very recent. Conceivably, some of the species-speciﬁc miRNAs may be misclassiﬁed members of other siRNA families. Nonlocal events can be detected and dated by examining conservation patterns of protein-coding genes ﬂanking individual miRNA family members allowing calculation of phylogenetic trees of miRNA families (Maher et al., 2006). Approximately 67% of all Arabidopsis multifamily miRNA genes, for instance, emerged from local duplications.

15.6.3

MicroRNAs and Viruses

MicroRNAs regulate host–pathogen interactions in different directions (virus ! virus, virus ! host, host ! virus) and stages of the viral life cycle (infectious, latent) and therefore pathways (replication, apoptosis, infection). The mode of interaction also depends on the subcellular localization of the virus within the host cell. Since the ﬁrst cleavage step of the pre-miRNA from the primary transcript takes place in the nucleus, viruses encoding their own miRNAs have to be able to cross the nuclear membrane. This is the case in particular for retroviruses, which even integrate into host genomes, and DNA viruses. RNA viruses remaining in the cytoplasm require either a transport mechanism shuttling their mRNA into the nucleus or some alternative miRNA maturation pathway. EBV (Epstein–Barr virus also called human herpesvirus-4 (HHV-4)) was the ﬁrst virus shown to encode several microRNAs (Pfeffer et al., 2004) located in introns and UTRs. Typically, viral miRNAs are conserved only in closely related species or not at all, making their computational prediction a difﬁcult task. A machine-learning approach using a set of properties of stem-loop structures such as free energy of folding, length, and base pair compositions (Pfeffer et al., 2005) nevertheless led to the discovery of miRNAs from a diverse array of DNA and retroviruses including herpesviruses, polyomaviruses, and Adenoviridae. One microRNA each was found in Measle virus (Paramyxoviridae) and yellow fever virus (YFV, Flaviviridae) (Pfeffer et al., 2005). Drosophila C virus, a Picornavirus naturally associated with D. melanogaster, also expresses small RNAs (Aravin et al., 2003). In the following, we brieﬂy introduce a few well-studied cases; for more detailed reviews of microRNAs in viruses, refer to Nair and Zavolan (2006), Sullivan and Ganem (2005), Sarnow et al. (2006), Cullen (2006) and Dykxhoorn (2007). DNA Viruses Encode Their Own miRNAs The latency-associated transcript (LAT) in Herpes simplex virus-1 (HSV-1) inhibits apoptosis and helps the virus to remain undetected in the infected hosts cells. A miRNA encoded in this transcript targets components (TGF-beta, SMAD3) of the TGFbeta pathway, which induces apoptosis, and thus sustains viability of the host (Gupta et al., 2006). In Simian virus 40 (SV40), a member of the polyomavirus family, sv40-miR-S1 is processed from the 30 -UTR of the late pre-mRNA. It targets T-antigen, one of the viral early

312

Chapter 15

Evolutionary Genomics of microRNAs and Their Relatives

genes, which in turn is recognized by cytotoxic T lymphocytes (CTLs). The mechanism enables the virus to escape the host defense response. The miRNA is highly conserved across related primate polyomaviruses (Sullivan et al., 2005). Epstein–Barr virus (EBV) encodes several miRNA. One of them (miR-BART2) was predicted to target viral DNA polymerase BALF5 and was recently shown to inhibit transition from latent to lytic viral replication (Barth et al., 2008). miR-K12-10 from Kaposi’s sarcoma-associated virus is encoded in the ORF of the kaposi gene. Excision of the miRNA caused cleavage of the mRNA. In addition, this miRNA provides an editing site leading to a glycine to serine change in the kaposin protein (Pfeffer et al., 2005). RNA Viruses Regulated by Host miRNAs Human mir-122 leads to accumulation of viral RNA during Hepatitis C virus (HCV) infection, and was therefore suggested to positively interfere with viral replication. This explains why successful HCV infection depends on the presence of mir-122 (Jopling et al., 2005). Knocking down components of the miRNA pathways (Randall et al., 2007) or mir-122 (Jopling et al., 2006) leads to reduced HCV replication. Other RNA viruses are sensitive to host miRNAs. Mouse mir-24 and mir-93 serve as host defense by targeting large protein (L protein) and phosphoprotein (P protein) genes of rhabdovirus vesicular stomatitis virus (VSV) (Otsuka et al., 2007). Retroviral RNAs Blocking Host miRNA Pathways Several retroviral RNAs have been shown to enter or interfere with the miRNA pathways and thus cause inhibition of the host machinery. Adenoviral VAII RNA is processed by dicer. It is highly abundant in late cells and blocks the host miRNA machinery by saturating the various protein components. For instance, 60% of the small RNAs incorporated into the RISC complex resemble viral VAII RNA products (Aparicio et al., 2006; Xu et al., 2007). Two miRNAs of the mir-17 cluster (mir-17 and mir-20a) target histone acetyltransferase Tat cofactor PCAF, an important factor HIV-1 replication. The expression of these miRNAs was found to be actively suppressed by HIV-1 (Triboulet et al., 2007). Human miR-32, ﬁnally, targets Tas, a gene of primate foamy virus type 1 (PFV-1), that suppresses the microRNA pathways (Lecellier et al., 2005). Plant Viruses As of today, no miRNAs in plant virus genomes have been reported and although exogenous RNAi plays a central role in ﬁghting viruses in plants (Hamilton and Baulcombe, 1999), there is no evidence that miRNAs are directly involved in responses to viral infections. High mutation rates allow viruses to escape miRNA cleavage by quickly altering sequences of putative miRNAs target sites (Simo´n-Mateo and Garcıa, 2006). Furthermore, almost every plant virus encoding suppressors of the siRNA-mediated host response to infections and some of these inhibit steps that are shared with the miRNA pathway (Kasschau et al., 2003; Chellappan et al., 2005). It is, however, reported that some viruses without such PTGS suppressors may also exploit the miRNA pathway (Bazzini et al., 2007). An example for a plant miRNA with a probably regulatory role in an infection response has been observed in Brassica rapa. Here, an evolutionarily young, Turnip mosaic virus (TuMV) induced miRNA cleaves speciﬁc disease-resistance genes of the TIR-NBS-LRR class (He et al., 2008b).

15.7 Origin(s) of microRNA Families

313

Mammalia

Figure 15.5 G G G U G

C C C A

15.6.4

R U G G

Invertebrates

Y R C A G

Mirtrons exhibit characteristic sequence patterns just inside the exon/intron boundaries that differ signiﬁcantly between vertebrates and invertebrates (Berezikov et al., 2007). The splice-donor GU and the splice acceptor AG are shown in bold. Arrows indicate the mature microRNAs, which can be located on both arms. While their 50 -end is well deﬁned, there is some variation at their 30 -end.

Mirtrons

Mirtrons are alternative precursors for microRNAs that employ the splicing machinery for the ﬁrst steps of their processing, thereby bypassing Drosha cleavage. This alternative processing pathway was recently described in mammals, Drosophila, and Caenorhabditis (Berezikov et al., 2007; Okamura et al., 2007; Graham Ruby et al., 2007) (Figure 15.5) and even in rice a candidate mirtron has been reported recently (Zhu et al., 2008). While mirtrons are often well conserved within nematodes, insects, and vertebrates, none of the known mirtrons is shared between these clades. Since vertebrate and invertebrate mirtrons exhibit several differences (Figure 15.5), Berezikov et al. (2007) suggested that the mirtron pathway evolved independently in several clades. Alternatively, mirtron sequences might not be sufﬁciently well conserved in order to unambiguously establish homology between phyla.

15.7 ORIGIN(S) OF microRNA FAMILIES 15.7.1

Metazoa

Since almost the entire eukaryotic genome is transcribed (The ENCODE Project Consortium, 2007), there is no shortage in RNAs that can potentially enter the microRNA processing cascade. In fact, stem-loop structures of the approximate size of microRNA precursors are a highly abundant feature of random RNA sequences. It stands to reason that a sizeable fraction of these is sufﬁciently stable and symmetric to be processed. Indeed, a computational approach that started with an initial search for all hairpins in the human genome and subsequently employed stringent computational and experimental ﬁltering (Bentwich et al., 2005) identiﬁed 53 primate-speciﬁc microRNAs. Similarly, several lineage-speciﬁc miRNAs were listed by Berezikov et al. (2006); some of them were exhibiting rapid evolution. This picture is reinforced by high-throughput sequencing (Berezikov et al., 2006) and the ﬁnding of hundreds of speciﬁc miRNAs in human and chimp brain, respectively. This leads us to conclude (Tanzer and Stadler, 2004) that novel metazoan microRNA families constantly arise from expressed transcripts that are currently not under strong selection. Hairpins formed by the precursor RNAs are then processed with a nonnegligible probability to novel microRNAs, which are retained and rapidly optimized if they provide a beneﬁcial regulatory impact.

314

Chapter 15

Evolutionary Genomics of microRNAs and Their Relatives

In some cases, the precursor transcript can be identiﬁed either as a repetitive element (see Section 7.3) or as a pseudogene. The latter are good candidates for ancestors of novel miRNA-bearing transcripts, because expressed pseudogenes are found in a reasonable number in many genomes, often arising from strongly expressed genes such as housekeeping genes. Examples of observed miRNAs in pseudogenes are the primate-speciﬁc mir-220 and mir-492 (Devor, 2006). Recently, a curious class of nonprotein-coding RNAs was reported in human (Ender et al., 2008) and in the parasite Giardia lamblia (Saraiya and Wang, 2008) that originates from small nucleolar RNAs but acts like microRNAs. Some snoRNAs seem to serve as precursors that need not to be processed by Drosha but require Dicer activity to extract the functional 22 nt miRNA. Ender et al. (2008) and Saraiya and Wang (2008) predicted reasonable targets that support their hypothesis.

15.7.2

Mechanisms in Plants

Evolutionary young, species-speciﬁc plant miRNAs often show high sequence similarities to their target genes even beyond the mature miRNA sequence. For example, both arms of miR822 show extended similarity with DC1 domain containing genes (Allen et al., 2005), and a similar pattern was reported for mir161, mir163 (Allen et al., 2004), miR826, and miR841 (Rajagopalan et al., 2006) and their predicted targets. In some cases, the sequence similarities also include promoter regions (Wang et al., 2006). These observations lead to the inverted duplication hypothesis (Allen et al., 2004) that postulates that miRNA genes arise from local inverted duplications of their target genes. A variant of this mechanism has been proposed for miR842 and miR846 in Arabidopsis (Rajagopalan et al., 2006), where miRNA and miRNA likely arose by an early duplication event within their targets. Later duplications then generated this miRNA loci. Transcription of such young miRNA genes produces foldback structures that are probably processed by DCL4 and acquired mutations then may lead to a switch to DCL1 processing (Axtell and Bowman, 2008).

15.7.3

microRNAs and Transposable Elements

A subset of the mammalian miRNAs are derived from transposable elements (TEs) (Table 15.1). This phenomenon appears to be associated with the expansion of TEs in mammalian genomes, since no repeat-related miRNA precursors have been reported in chicken or Drosophila. The single example in C. elegans, cel-mir-69, was later reclassiﬁed as siRNA (Lim et al., 2003). Overall, TE-derived miRNAs are signiﬁcantly less conserved than non-TE derived ones (Piriyapongsa et al., 2007), and the list includes several lineage-speciﬁc miRNAs (e.g., rno-mir-333 and hsa-mir-95). The better conserved ones mostly stem from L2 and MIR elements (Smalheiser and Torvik, 2005), while mariner derived elements MADE1 and other miniature inverted-repeat transposable elements (MITEs) are a major source of humanspeciﬁc microRNAs (Figure 15.6) (Piriyapongsa and Jordan, 2007). Several genomic loci in plants have been reported to encode both siRNAs and miRNAs. Comparative analysis revealed that these are repeat derived (Piriyapongsa and Jordan, 2008). While long nearly exact double strands, including those formed by the terminal inverted repeats of full-length DNA elements, produce siRNAs, miRNAs are derived from short imperfect hairpin structures. The latter may arise from MITEs, which consist of two terminal inverted repeats with little intervening sequence (Figure 15.6).

15.7 Origin(s) of microRNA Families

315

Table 15.1 microRNAs Derived from Transposable Elements Repeat Class

Mammalian microRNAsa

LINE

{(hsa, mmu, rno) 28, 151/151 , 325}, {(hsa, mmu) 374, 421, 493}, {(hsa) 95, 545, 552, 562, 571, 576, 578, 579, 582, 588, 606, 616, 619, 625, 626, 634, 644, 648, 649} {(hsa) 361, 513-[a-1, a-2], 544, 548-[a-1, a-2, a-3, b, c, d-1, d-2], 570, 579, 584, 587, 603, 645, 652} {(hsa, mmu) 130-b, 330, 345, 370, 378}, {(hsa) 422-a, 566, 575, 607, 619, 633, 640, 649}, {(rno) 333} {(hsa) 548-a-3, 558}, {(mmu) 297}, {(rno) 327} {(hsa, mmu, rno) 340} {(hsa) 659} Plant microRNAsb {(ath) 416},{(osa) 439-[a, b], 817, 821-[a, b, c]} {(ath) 401, 854-[a, b, c, d], 855}, {(osa) 416, 420, 531} {(ath) 405 [a, b, d]}{(osa) 442, 443, 445-a, 806-[b, g] 807-[b, c] 809-h, 811-[a, b, c], 812-[a, b, c, d, e], 813, 814-[a, b, c], 816, 818-[b, e], 819-[a, d, f, g, h], 821-[a, b, c]}

MITE SINE LTR DNA(mariner) Other (Arthur1) DNA LTR MITE

hsa: human; mmu: mouse; rno: rat; ath: arabidopsis; osa: rice; LINE: long interspersed element; MITE: miniature inverted-repeat transposable element; SINE: short interspersed element; LTR: long terminal repeat retrotransposons; MIR: mammalian interspersed repeat. a

Compiled from Smalheiser and Torvik (2005), Piriyapongsa et al. (2007), and Piriyapongsa and Jordan (2007).

b

Compiled from Arteaga-Vazquez et al. (2006) and Piriyapongsa and Jordan (2008).

15.7.4

Are Animal and Plant microRNAs Homologous?

Until very recently, endogenous miRNAs were known only in multicellular organisms: land plants and metazoans. This picture changed with the discovery of miRNAs in the green algae Chlamydomonas reinhardtii (Zhao et al., 2007; Molnar et al., 2007) and in the slime mold Dictyostelium discoideum (Hinas et al., 2007). A computational study presents evidence for miRNAs in Trypanosomes (Mallick et al., 2008), although these reports have not yet been veriﬁed experimentally. There is no convincing evidence that any of the known microRNA families dates back to the last common ancestor of plants and animals. The only published candidate, mir854/855 (Arteaga-Vazquez et al., 2006), cannot be traced consistently through either plant or animal phylogeny; the low-complexity sequence is most likely an analogous invention. In the same vein, none of the seed plant microRNAs is related to microRNAs of the green algae C. reinhardtii (Zhao et al., 2007; Molnar et al., 2007). Chlamydomonas miRNAs differ in several respects from microRNAs of land plants. In

(a)

(c) (d)

(b)

Figure 15.6 Transition from a full-length DNA element (a) with terminal inverted repeats (black triangles) enclosing an ORF to a MITE (b) that consists of the inverted repeats only. Transcripts with a large internal region (c) give rise to siRNAs, while short hairpin RNAs arising from MITEs (d) are the ﬁrst step toward generating microRNAs from TEs. Adapted from Piriyapongsa and Jordan (2007, 2008).

316

Chapter 15

Evolutionary Genomics of microRNAs and Their Relatives

particular, multiple mature miRs are processed from a single stem-loop. The slime mold miRNAs (Hinas et al., 2007) also show no homology to either plant, Chlamydomonas, or animal miRNAs. The small RNA processing pathways and the RNAi machinery in particular are evolutionarily very old (Ullu et al., 2004), presumably dating back to ancestral eukaryote since its components are present in the most basal lineages (Carlton et al., 2007). For the origin of the miRNA processing machinery, there are two possible scenarios between which we cannot distinguish based on the available evidence: 1. It arose once, rather early in eukaryote evolution. In this case, the ancestral microRNAs have then long been replaced by more modern innovations in the different kingdoms, while the protein components of the microRNA processing machinery have been retained. 2. The endogenous production of speciﬁc miRNAs has evolved multiple times with different requirements on the RNAs to be processed. Thus, not only the microRNAs arose independently but also the processing machinery was derived multiple times from ancestral siRNA pathway(s). Chlamydomonas, for instance, has undergone extensive duplications of Dicer and Argonaute proteins after the divergence of the green algae and land plant lineages leading to a diversiﬁcation of the core RNAi machinery (Casas-Mollano et al., 2008).

15.8 GENOMIC ORGANIZATION 15.8.1

Clusters and Families

Several miRNA genes can be found in proximity to each other at particular genomic locus. Such groups of miRNA are termed miRNA clusters. Mammalian genomes contain two distinct types of microRNA clusters. In the ﬁrst type, groups of microRNAs expressed from polycistronic primary precursors are easily recognized by the syntenic conservation of their genomic location over long evolutionary times (Hertel et al., 2006). Such clusters typically contain only a few miRNA precursor hairpins, the largest and most impressive example being the mir-17 clusters, whose evolution is summarized in Figure 15.7. The largest cluster of this type in vertebrates is the mir-379/mir-656 cluster, located in human within the imprinted DLK-DIO3 region on chromosome 14 (Cavaille et al., 2002). This cluster is present in the genomes of all sequenced placental mammals (Glazov et al., 2008). MiRBase (Grifﬁths-Jones et al., 2008) lists 42 miRNAs in human and 37 in mouse located in this cluster. Its members are produced from a large noncoding RNA (Seitz et al., 2004). The second type of clusters consists of large numbers of miRNAs that are transcribed independently or possibly in small groups. An example is the C19MC cluster (Borchert et al., 2006), whose members are individually transcribed by pol-III utilizing the promoters of Alu elements. In contrast, miRNA clusters are not frequently observed in plant genomes. One of these exceptional cases is the miRNA-395 family. Clusters of various sizes and intergenic distances have been reported for several genomes and rice EST data indicate that at least some of them are expressed as single, policistronic transcript (Jones-Rhoades and Bartel, 2004). A MIR156 tandem cluster has been reported both in the several monocots and in the dicotylednous plant Ipomea nil; MIR169 and MIR1219 are also observed as clusters

15.8 Genomic Organization

317

Figure 15.7 The evolution of the mir-17 clusters is governed by a complex history of duplications and loss of individual members as well as duplications of entire clusters. The extant clusters consist of members of three nonhomologous groups of miRNAs, namely, the mir-17, mir-19, and mir-92 groups each of which is composed of several subfamilies with different mirbase names (lower right insets). Only mir-92 pre-dates the origin of vertebrates, which is the earliest evidence for clusters stemming from lamprey and shark. The formation of the ancestral cluster, and the divergence of both mir-18 and mir-93 from the mir-17 group appears to have pre-dated the ﬁrst round of genome duplication in the ancestral vertebrate. Differential loss of one of the mir-93 and mir-18 paralogues apparently followed the ﬁrst duplication. The two clusters then evolved independently: The type-I cluster was extended by a duplication of the entire region from mir-17 over mir-18 to mir-19a, immediately behind mir-19a and a secondary loss of the mir-18 copy. MicroRNAs of the type-II cluster evolved independently in their sequence, resulting in homologous miRNAs mir-106a, 19d, and mir-25. Only a single cluster was found in the genome of the lamprey P. marinus, which contains both a mir-20 and mir-19b homologue, suggesting that it shares the ﬁrst genome duplication. A second round of genome duplication results in two copies of type-I clusters while the type-II cluster was not duplicated at all. In elephant ﬁsh (C. milii) as an early representative whose genome was exposed to two genome duplications, mir-19a and mir-20 were lost in both type-I clusters and mir-106a and mir-19d were lost in the single type-II cluster. In mammals, the homologous miRNAs mir-19a and mir-19d were lost in the second copy of type-I cluster and the type-II cluster, respectively, while the ﬁrst copy of the type-I cluster remained complete. In teleost ﬁshes, which underwent a third whole genome duplication, the two copies of type-I clusters were duplicated and one of these duplicated clusters was lost subsequently, resulting in three type-I clusters (type-I-A, -B, and -C). Again, only one copy of the type-II cluster was retained. In zebraﬁsh (Danio rerio) the ﬁrst gene of the type-II cluster (mir-106b) and the ﬁrst (mir-17) and last (mir-92) ones of the third copy of type-I cluster were lost. While mir-19a is absent in medaka (Oryzias latipes) and in stickleback (Gasterosteus aculeatus), the pufferﬁshes Takifugu rubripes and Tetraodon nigrovirides lost mir-20 and mir-19b. The ﬁgure is based on a reevaluation and extension of earlier studies of the mir-17 cluster (Tanzer and Stadler, 2004, 2006). (See insert for color representation of this ﬁgure.)

in distantly related plants (Talmor-Neiman et al., 2006; Axtell et al., 2007). In all these cases, the clusters contain only members of the same family. In contrast to all other land plants investigated to date, approximately a quarter of the miRNAs of the moss Physcomitrella patens are located in clusters (Talmor-Neiman et al., 2006; Axtell et al., 2007). The exceptional microRNAs of the green alga C. reinhardtii

318

Chapter 15

Evolutionary Genomics of microRNAs and Their Relatives

are also partially clustered. In particular, several members of the MIR918 family are potentially derived from a single stem-loop (Zhao et al., 2007; Molnar et al., 2007).

15.8.2

Regulation of microRNA Expression

Most pri-miRNAs are transcribed by RNA polymerase II since these transcripts contain cap structures as well as poly(A) tails (Lee et al., 2004; Cai et al., 2004). Core promoters have been characterized in both animals and plants (Zhou et al., 2007a). Transcriptional regulation of pri-miRNAs does not seem to differ substantially from protein-coding pol-II transcripts, although only a few examples have been analyzed in detail. Expression of the human mir-21 gene, for example, depends on the transcription factor Stat3 due to two Stat3 binding sites in an upstream enhancer region that is strictly conserved since the ﬁrst observed evolutionary appearance of mir-21 and Stat3 (L€ ofﬂer et al., 2007). This connection between microRNA and transcription factor is highly conserved in evolution (Figure 15.8). Recently, the involvement of Stat3 on mir-21 expression was also observed in a teleost (Ramachandra et al., 2008). MicroRNA misregulation by the oncogenic transcription factor Myc (Chang et al., 2008) and the tumor suppressor p53 (Hermeking, 2007) contributes to tumorigenesis. Phylogenetic footprinting, furthermore, revealed that transcription factors that play essential roles in development preferentially regulate miRNA genes in Drosophila. A recent analysis of the primary precursors of mouse microRNAs uncovered a conserved sequence element that might be involved in posttranscriptional regulation of microRNAs (He et al., 2008a). Some microRNAs, most notably the members of the C19MC cluster on human chromosome 19, are transcribed by RNA polymerase III, in this case utilizing Alu repeats to recruit the polymerase (Borchert et al., 2006). In general, it not clear whether transposable elements provide the transcriptional starts for adjacent microRNAs (Smalheiser and Torvik, 2005), although human MITEs are transcribed as read-through transcripts initiated from adjacent genomic positions and not by means of a strand-speciﬁc promoter provided by the transposable element itself (Piriyapongsa and Jordan, 2007). Transposable elements that dually code for microRNAs and siRNAs can be expressed as read-through transcripts from intronic regions of spliced RNA messages (Piriyapongsa and Jordan, 2008). Approximately a quarter of the human microRNAs is located in introns of proteincoding genes (Li et al., 2007), in Xenopus intronic locations are predominant (Tang and Maxwell, 2008). Contrary to intronic snoRNAs, the majority of intronic miRNAs is processed from unspliced intronic regions before the catalysis of splicing in vertebrates (Kim and Kim, 2007) and shows a bias toward large 50 -introns (Zhou and Lin, 2008). However, recently discovered mirtrons are diced from unbranched introns (see Section 6.4.). Interestingly, several intronic miRNAs have their own complex promoters and are transcribed independently of the promoters of their host genes (Ozsolak et al., 2008). Differential miRNA precursor processing in both the nucleus and the cytoplasm may lead to distinct expression proﬁles of both miRNA precursors and their mature microRNAs, indicating that posttranscriptional processing plays an important role in regulating miRNA expression (Obernosterer et al., 2006; Tang and Maxwell, 2008). MicroRNA-speciﬁc sequence motifs located within a few hundred nucleotides upstream of the pre-mir hairpin in both nematodes and vertebrates (Ohler et al., 2004; Inouchi et al., 2007) may well be involved in these processes. Plant pri-miRNAs are probably also transcribed by RNA polymerase II (Xie et al., 2005b). Differential pri-miRNA processing is also observed in plants and splicing

319

Figure 15.8 Evolutionarily conserved regulation of mir-21 by Stat3. A highly conserved enhancer featuring two Stat3 binding sites (LHS: schematic sequence alignment) is located between 3 and 4 kb upstream of the pre-mir-21 hairpin (RHS). Although the enhancer is located in an intron of the adjacent TMEM49 gene, this distance does not correlate with genome size suggesting that the enhancer region is associated with mir-21 and independent of the TMEM49 gene. Adapted from L€ofﬂer et al. (2007). (See insert for color representation of this ﬁgure.)

320

Chapter 15

Evolutionary Genomics of microRNAs and Their Relatives

variants of several pri-miRNAs have been reported (e.g., Xie et al., 2005b; Warthmann et al., 2008). In contrast to animals, the overwhelming majority of plant miRNA genes are located in regions between annotated genes (Reinhart et al., 2002) and only a few miRNA loci have been reported to overlap with protein-coding genes (Rajagopalan et al., 2006; Axtell et al., 2007). In general, little is known about the regulation of miRNA expression in plants. microRNAs miR162 and miRNA168, for example, were shown to be negative regulators of miRNA pathways in plants by targeting DCL1 (Xie et al., 2003) and AGO1 (Vaucheret et al., 2004), respectively, both central genes of the plant miRNA machinery. In Arabidopsis, the DCL1 pre-mRNA level is also self-regulated with the help of mir838, which is located in intron 14 of DCL1. High levels of DCL1 proteins lead to a competition with the splicing machinery and DCL1 processed DCL1 primary transcripts are nonfunctional and subject to degradation (Rajagopalan et al., 2006). Analyses of the upstream regions of known miRNAs revealed a TATA box sequence motif in the promoter region (Xie et al., 2005b; Warthmann et al., 2008). Furthermore, binding sites of the transcription factors AtMYC2, ARF, SORLREP3, and LFY are overrepresented in comparison to protein-coding genes, indicating an important role of these transcription factors in miRNA regulation (Megraw et al., 2006). The miRNA319a locus has been investigated in different species from Brassicaceae (Warthmann et al., 2008) and a strongly conserved upstream region has shown to be essential for transcription.

15.9 SUMMARY AND OUTLOOK We are just about to understand the signiﬁcance of small RNAs as regulators in eukaryotes. Starting from miRNAs, researchers from a variety of disciplines have set out for the quest for other small noncoding transcripts by exploring the RNome. The impact of the ﬁndings was remarkable. We have not only found that transcription goes beyond/extends regions of protein-coding genes but also found that these noncoding regions are of signiﬁcant information content. Intergenic DNA, often described as “playground of evolution,” turned out to harbor a plethora of cis and trans regulatory elements, many of them in the form of noncoding RNAs. In this chapter, we not only tried to provide a comprehensive view of one such class of ncRNAs but also underscored the importance of other RNA regulators. To date, microRNAs are one of the best-described classes of small ncRNAs. Less than a decade of microRNA research has profoundly changed the perceptions of the role of RNAs from rather uninteresting carriers of coding information to key players in cellular regulation. Indeed, microRNAs affect gene expression on multiple levels: speciﬁc histone methylation patterns alter the accessibility of genomic regions, activation or silencing of promoters deﬁnes transcriprtional activity of genes, and ﬁnally PTGS of mRNAs serves as another checkpoint before energy consuming translation into protein takes place. In all these processes, miRNAs serve as the exchangeable RNA module in large protein complexes and sign responsible for speciﬁc interactions with the target sequences. The consequences of this novel picture of eukaroytic regulation need to be explored in more detail, using also approaches from systems biology. Studying the evolutionary history of genes and targets revealed an RNA-based gene regulatory layer, implying an additional source for genome plasticity. Question on how novel RNAs contribute to an increase in genome complexity and how they lead to the emergence of novel traits remain largely unanswered. Tracing back the ancestor(s) of recent small ncRNAs seems a promising approach toward understanding whether small RNAs in plants and metazoans are analogies or homologies. The protein machinery that facilitates processing as well as functionality

References

321

clearly shares main features, but at the same time possesses enough ﬂexibility to allow the acquisition of novel RNA substrates. Even the epigenome turned out to be under the inﬂuence of RNA control. The discovery of rasiRNAs and piRNAs, for instance, reveals a ﬂux of information between generations that goes beyond the “programs” hard coded in our genomes. Arewe indeed ribo-organisms? If so, we need to be careful in designing RNA based drugs. Shortly after siRNAs were introduced as the ultimate tool for transient knockdown experiments, researchers found themselves dealing with cross-reactivity and other unexplainable side effects or even no effects at all. Nevertheless, these new technologies seem to be key in the development of new laboratory technologies and medical applications. A variety of diseases, foremost cancer, were linked to RNA misexpression, once again pointing out that small can be mighty.

REFERENCES ADAI, A., JOHNSON, C., MLOTSHWA, S., ARCHER-EVANS, S., MANOCHA, V., VANCE, V., and SUNDARESAN, V., 2005. Computational prediction of miRNAs in Arabidopsis thaliana. Genome Res. 15: 78–91. ALLEN, E., XIE, Z., GUSTAFSON, A.M., and CARRINGTON, J.C., 2005. microRNA-directed phasing during trans-acting siRNA biogenesis in plants. Cell 121: 207–221. ALLEN, E., XIE, Z., GUSTAFSON, A.M., SUNG, G.H., SPATAFORA, J.W., and CARRINGTON, J.C., 2004. Evolution of microRNA genes by inverted duplication of target gene sequences in Arabidopsis thaliana. Nat. Genet. 36: 1282–1290. ALTUVIA, Y., LANDGRAF, P., LITHWICK, G., ELEFANT, N., PFEFFER, S., ARAVIN, A., BROWNSTEIN, M.J., TUSCHL, T., and MARGALITH, H., 2005. Clustering and conservation patterns of human microRNAs. Nucleic Acids Res. 33: 2697–2706. AMBROS, V., 2004. The functions of animal microRNAs. Nature 431: 350–355. AMBROS, V., LEE, R.C., LAVANWAY, A., WILLIAMS, P.T., and JEWELL, D., 2003a. MicroRNAs and other tiny endogenous RNAs in C. elegans. Curr. Biol. 13: 807–818. AMBROS, V. et al., 2003b. A uniform system for microRNA annotation. RNA 9: 277–279. APARICIO, O., RAZQUIN, N., ZARATIEGUI, M., NARVAIZA, I., and FORTES, P., 2006. Adenovirus virus-associated RNA is processed to functional interfering RNAs involved in virus production. J. Virol. 80: 1376–1384. ARAVIN, A., LAGOS-QUINTANA, M., YALCIN, A., ZAVOLAN, M., MARKS, D., SNYDER, B., GAASTERLAND, T., MEYER, J., and TUSCHL, T., 2003. The small RNA proﬁle during Drosophila melanogaster development. Dev. Cell 5: 337–350. ARAVIN, A.A., HANNON, G.J., and BRENNECKE, J., 2007a. The Piwi-piRNA pathway provides an adaptive defense in the transposon arms race. Science 318: 761–764. ARAVIN, A.A., SACHIDANANDAM, R., GIRARD, A., FEJES-TOTH, K., and HANNON, G.J., 2007b. Developmentally regulated piRNA clusters implicate MILI in transposon control. Science 316: 744–747.

ARTEAGA-VA´ZQUEZ, M., CABALLERO-PE´REZ, J., and VIELLECALZADA, J.P., 2006. A family of microRNAs present in plants and animals. Plant Cell 18: 3355–3369. AUFSATZ, W., METTE, M.F., van der WINDEN, J., MATZKE, A.J., and MATZKE, M., 2002. RNA-directed DNA methylation in Arabidopsis. Proc. Natl. Acad. Sci. USA 99(Suppl 4): 16499–16506. AXTELL, M. and BOWMAN, J., 2008. Evolution of plant microRNAs and their targets. Trends Plant Sci. 13: 343–349. AXTELL, M.J., SNYDER, J.A., and BARTEL, D.P., 2007. Common functions for diverse small RNAs of land plants. Plant Cell 19: 1750–1769. BAEK, D., VILL´EN, J., SHIN, C., CAMARGO, F.D., GYGI, S.P., and BARTEL, D.P., 2008. The impact of microRNAs on protein output. Nature 455: 44–45. BAO, L., ZHOU, M., WU, L., LU, L., GOLDOWITZ, D., WILLIAMS, R.W., and CUI, Y., 2007. PolymiRTS database: linking polymorphisms in microRNA target sites with complex traits. Nucleic Acids Res. 35: 51–54. BARAKAT, A., WALL, K., LEEBENS-MACK, J., WANG, Y.J., CARLSON, J.E., and DEPAMPHILIS, C.W., 2007. Large-scale identiﬁcation of microRNAs from a basal eudicot (Eschscholzia californica) and conservation in ﬂowering plants. Plant J. 51: 991–1003. BARTH, S., PFUHL, T., MAMIANI, A., EHSES, C., ROEMER, K., KREMMER, E., JA¨KER, C., HO¨CK, J., MEISTER, G., and GRA¨SSER, F.A., 2008. Epstein–barr virus-encoded microRNA miRBART2 down-regulates the viral DNA polymerase BALF5. Nucleic Acids Res 36: 666–675. BATISTA, P.J. et al., 2008. PRG-1 and 21U-RNAs interact to form the piRNA complex required for fertility in C. elegans. Mol. Cell. 31: 67–78. BAZZINI, A.A., HOPP, H.E., BEACHY, R.N., and ASURMENDI, S., 2007. Infection and coaccumulation of tobacco mosaic virus proteins alter microRNA levels, correlating with symptom and plant development. Proc. Natl. Acad. Sci. USA 104: 12157–12162.

322

Chapter 15

Evolutionary Genomics of microRNAs and Their Relatives

BENTWICH, I. et al., 2005. Identiﬁcation of hundreds of conserved and nonconserved human microRNAs. Nat. Genet. 37: 766–770. BEREZIKOV, E., CHUNG, W.J., WILLIS, J., CUPPEN, E., and LAI, E. C., 2007. Mammalian mirtron genes. Mol. Cell 28: 328–336. BEREZIKOV, E., GURYEV, V., VAN DE BELT, J., WIENHOLDS, E., and RONALD PLASTERK, H.A., 2005. Phylogenetic shadowing and computational identiﬁcation of human microRNA genes. Cell 120: 21–24. BEREZIKOV, E., THUEMMLER, F., VAN LAAKE, L.W., KONDOVA, I., BONTROP, R., CUPPEN, E., and PLASTERK, R.H., 2006. Diversity of microRNAs in human and chimpanzee brain. Nat. Genet. 38: 1375–1377. BORCHERT, G.M., LANIER, W., and DAVIDSON, B.L., 2006. RNA polymerase III transcribes human microRNAs. Nat. Struct. Mol. Biol. 13: 1097–1101. BRENNECKE, J., STARK, A., RUSSELL, R.B., and COHEN, S.M. 2005 Principles of microRNA-target recognition. PLoS Biol. 3: e85. BRODERSEN, P., SAKVARELIDZE-ACHARD, L., BRUUN-RASMUSSEN, M., DUNOYER, P., YAMAMOTO, Y., SIEBURTH, L., and VOINNET, O., 2008. Widespread translational inhibition by plant miRNAs and siRNAs. Science 320: 1185–1190. BUSHATI, N. and COHEN, S.M., 2007. MicroRNA functions. Annu. Rev. Cell Dev. Biol. 23: 175–205. CAI, X., HAGEDORN, C.H., and CULLEN, B.R., 2004. Human microRNAs are processed from capped, polyadenylated transcripts that can also function as mRNAs. RNA 10: 1957–1966. CARLTON, J.M. et al., 2007. Draft genome sequence of the sexually transmitted pathogen Trichomonas vaginalis. Science 315: 207–212. CASAS-MOLLANO, J.A., ROHR, J., KIM, E.J., BALASSA, E., VAN DIJK, K., and CERUTTI, H., 2008. Diversiﬁcation of the core RNA interference machinery in Chlamydomonas reinhardtii and the role of DCL1 in transposon silencing. Genetics 179: 69–81. CAVAILLE´, J., SEITZ, H., PAULSEN, M., FERGUSON-SMITH, A., and BACHELLERIE, J.P., 2002. Identiﬁcation of tandemlyrepeated C/D snoRNA genes at the imprinted human 14q32 domain reminiscent of those at the Prader–Willi/ Angelman syndrome region. Hum. Mol. Genet. 11: 1527–1538. CHANG, T., YU, D., LEE, Y.S., WENTZEL, E.A., ARKING, D.E., WEST, K.M., DANG, C.V., THOMAS-TIKHONENKO, A., and MENDELL, J.T., 2008. Widespread microRNA repression by Myc contributes to tumorigenesis. Nat. Genet. 40: 43–50. CHELLAPPAN, P., VANITHARANI, R., and FAUQUET, C.M., 2005. MicroRNA-binding viral protein interferes with Arabidopsis development. Proc. Natl. Acad. Sci. USA 102: 10381–10386. CHEN, K. and RAJEWSKY, N., 2007. The evolution of gene regulation by transcription factors and microRNAs. Nat. Rev. Genet. 8: 93–103. CLOP, A. et al., 2006. A mutation creating a potential illegitimate microRNA target site in the myostatin

gene affects muscularity in sheep. Nat. Genet. 38: 813–818. CULLEN, B.R., 2006. Viruses and microRNAs. Nat. Genet. 38: (Suppl), 25–30. CZECH, B. et al. 2008 An endogenous small interfering RNA pathway in Drosophila. Nature 453: 798–802. DEVOR, E.J., 2006. Primate microRNAs miR-220 and miR-492 lie within processed pseudogenes. J. Hered. 97: 186–190. DEZULIAN, T., REMMERT, M., PALATNIK, J.F., WEIGEL, D., and HUSON, D.H., 2006. Identiﬁcation of plant microRNA homologs. Bioinformatics 22: 359–360. DIDIANO, D. and HOBERT, O., 2006. Perfect seed pairing is not a generally reliable predictor for microRNA-target interactions. Nat. Struct. Mol. Biol. 13: 849–851. DJIKENG, A., SHI, H., TSCHUDI, C., and ULLU, E., 2001. RNA interference in Trypanosoma brucei: cloning of small interfering RNAs provides evidence for retroposon-derived 24-26-nucleotide RNAs. RNA 7: 1522–1530. DOENCH, J. and SHARP, P., 2004. Speciﬁcity of microRNA target selection in translational repression. Genes Dev. 18: 504–511. DOENCH, J.G., PETERSEN, C.P., and SHARP, P.A., 2003. siRNAs can function as miRNAs. Genes Dev. 17: 438–442. DOSTIE, J., MOURELATOS, Z., YANG, M., SHARMA, A., and DREYFUSS, G., 2003. Numerous microRNPs in neuronal cells containing novel microRNAs. RNA 9: 180–186. DYKXHOORN, D.M., 2007. MicroRNAs in viral replication and pathogenesis. DNA Cell Biol. 26: 239–249. EASOW, G., TELEMAN, A., and COHEN, S., 2007. Isolation of microRNA targets by miRNP immunopuriﬁcation. RNA 13: 1198–1204. ELBASHIR, S.M., LENDECKEL, W., and TUSCHL, T., 2001. RNA interference is mediated by 21- and 22-nucleotide RNAs. Genes Dev. 15: 188–200. ENDER, C., KREK, A., FRIEDLA¨NDER, M.R., BEITZINGER, M., WEINMANN, L., CHEN, W., PFEFFER, S., RAJEWSKY, N., and MEISTER, G., 2008. A human snoRNAwith microRNA-like functions. Mol. Cell 32: 519–528. FAHLGREN, N. et al., 2007. High-throughput sequencing of Arabidopsis microRNAs: evidence for frequent birth and death of MIRNA genes. PLoS ONE 2: e219. FATTASH, I., VOSS, B., RESKI, R., HESS, W.R., and FRANK, W., 2007. Evidence for the rapid expansion of microRNAmediated regulation in early land plant evolution. BMC Plant Biol. 7: 13. FILIPOWICZ, W., BHATTACHARYYA, S.N., and SONENBERG, N., 2008. Mechanisms of post-transcriptional regulation by microRNAs: are the answers in sight? Nat Rev. Genet. 9: 102–114. FIRE, A.Z., 2007. Gene silencing by double-stranded RNA. Cell Death Differ. 14: 1998–2012. FIRE, A., XU, S., MONTGOMERY, M.K., KOSTAS, S.A., DRIVER, S. E., and MELLO, C.C., 1998. Potent and speciﬁc genetic interference by double-stranded RNA in Caenorhabditis elegans. Nature 391: 806–811.

References F€ oRSTEMANN, K., HORWICH, M.D., WEE, L., TOMARI, Y., and ZAMORE, P.D., 2007. Drosophila microRNAs are sorted into functionally distinct argonaute complexes after production by dicer-1. Cell 130: 287–297. FU, X., ADAMSKI, M., and THOMPSON, E.M., 2008. Altered miRNA repertoire in the simpliﬁed chordate. Oikopleura dioica. Mol. Biol. Evol. 25: 1067–1080. GANESAN, G. and RAO, S.M. 2008 A novel noncoding RNA processed by Drosha is restricted to nucleus in mouse. RNA 14: 1399–1410. GEORGES, M., COPPIETERS, W., and CHARLIER, C., 2007. Polymorphic miRNA-mediated gene regulation: contribution to phenotypic variation and disease. Curr. Opin. Genet. Dev. 17: 166–176. GEORGES, M. et al., 2006. Polymorphic microRNA-target interactions: a novel source of phenotypic variation. Cold Spring Harb. Symp. Quant. Biol. 71: 343–350. GLAZOV, E.A., MCWILLIAM, S., BARRIS, W.C., and DALRYMPLE, B.P., 2008. Origin, evolution, and biological role of miRNA cluster in DLK-DIO3 genomic region in placental mammals. Mol. Biol. Evol. 25: 939–948. GORODKIN, J., HAVGAARD, J., ENSTERO¨, M., SAWERA, M., JENSEN, P., OHMAN, M., and FREDHOLM, M., 2006. MicroRNA sequence motifs reveal asymmetry between the stem arms. Comput. Biol. Chem. 30: 249–254. GRAHAM RUBY, J.G., JAN, C.H., and BARTELL, D.P., 2007. Intronic microRNA precursors that bypass Drosha processing. Nature 48: 83–86. GREGORY, R.I., YAN, K.P., AMUTHAN, G., CHENDRIMADA, T., DORATOTAJ, B., COOCH, N., and SHIEKHATTAR, R., 2004. The microprocessor complex mediates the genesis of microRNAs. Nature 432: 235–240. GRIFFITHS-JONES, S., SAINI, H.K., VAN DONGEN, S., and ENRIGHT, A.J., 2008. miRBase: tools for microRNA genomics. Nucleic Acids Res. 36: D154–D158. GRIMSON, A., FARH, K., JOHNSTON, W., GARRETT-ENGELE, P., LIM, L., and BARTEL, D., 2007. MicroRNA targeting speciﬁcity in mammals: determinants beyond seed pairing. Mol. Cell 27: 91–105. GRIMSON, A., SRIVASTAVA, M., FAHEY, B., WOODCROFT, B.J., CHIANG, H.R., KING, N., DEGNAN, B.M., ROKHSAR, D.S., and BARTEL, D.P., 2008. Early origins and evolution of microRNAs and Piwi-interacting RNAs in animals. Nature 455: 1193–1197. GUPTA, A., GARTNER, J.J., SETHUPATHY, P., HATZIGEORGIOU, A. G., and FRASER, N.W., 2006. Anti-apoptotic function of a microRNA encoded by the HSV-1 latency-associated transcript. Nature 442: 82–85. HAMILTON, A., VOINNET, O., CHAPPELL, L., and BAULCOMBE, D., 2002. Two classes of short interfering RNA in RNA silencing. EMBO J. 21: 4671–4679. HAMILTON, A.J. and BAULCOMBE, D.C., 1999. A species of small antisense RNA in post-transcriptional gene silencing in plants. Science 286: 950–952. HE, S., SU, H., LIU, C., SKOGERBØ, G., HE, H., HE, D., ZHU, X., LIU, T., ZHAO, Y., and CHEN, R., 2008a. MicroRNAencoding long non-coding RNAs. BMC Genomics 21: 236.

323

HE, X.F., FANG, Y.Y., FENG, L., and GUO, H.S., 2008b. Characterization of conserved and novel microRNAs and their targets, including a TuMV-induced TIR-NBSLRR class R gene-derived novel miRNA in Brassica. FEBS Lett. 582: 2445–2452. HEIMBERG, A.M., SEMPERE, L.F., MOY, V.N., DONOGHUE, P.C.J., and PETERSON, K., 2007. MicroRNAs and the advent of vertebrate morphological complexity. Proc. Natl. Acad. Sci. USA 105: 2946–2950. HERMEKING, H., 2007. p53 enters the microRNA world. Cancer Cell 12: 414–418. HERTEL, J., LINDEMEYER, M., MISSAL, K., FRIED, C., TANZER, A., FLAMM, C., HOFACKER, I.L., STADLER, P.F., and The students of Bioinformatics Computer Labs 2004 and 2005. 2006. The expansion of the metazoan microRNA repertoire. BMC Genomics 7: 15 (Epub). HERTEL, J. and STADLER, P.F., 2006. Hairpins in a haystack: recognizing microRNA precursors in comparative genomics data. Bioinformatics 22: e197–e202. HINAS, A., REIMEGA¨RD, J., WAGNER, E.G., NELLEN, W., AMBROS, V., and S€oDERBOM, F., 2007. The small RNA repertoire of Dictyostelium discoideum and its regulation by components of the RNAi pathway. Nucleic Acids Res. 6714–6726: 35. HOU, Y.Y., YING, X.M., and LI, W.J., 2008. Computational approaches to microRNA discovery. Yi Chuan 30: 687–696, in Chinese. HUTVAGNER, G. and SIMARD, M.J., 2008. Argonaute proteins: key players in RNA silencing. Nat. Rev. Mol. Cell Biol. 9: 22–32. INOUCHI, A., SHINOHARA, S., INOUE, H., KITA, K., and ITAKURA, M., 2007. Identiﬁcation of speciﬁc sequence motifs in the upstream region of 242 human miRNA genes. Comput. Biol. Chem. 31: 207–214. JOHN, B., ENRIGHT, A.J., ARAVIN, A., TUSCHL, T., SANDER, C., and MARKS, D.S., 2004. Human microRNA targets. PLoS Biol. 2: e363. JONES-RHOADES, M.W. and BARTEL, D.P., 2004. Computational identiﬁcation of plant microRNAs and their targets, including a stress-induced miRNA. Mol. Cell 14: 787–799. JONES-RHOADES, M.W., BARTEL, D.P., and BARTEL, B., 2006. MicroRNAs and their regulatory roles in plants. Annu. Rev. Plant Biol. 57: 19–53. JOPLING, C., YI, M., LANCASTER, A., LEMON, S., and SARNOW, P., 2005. Modulation of hepatitis C virus RNA abundance by a liver-speciﬁc MicroRNA. Science 309: 1577–1581. JOPLING, C.L., NORMAN, K.L., and SARNOW, P., 2006. Positive and negative modulation of viral and cellular mRNAs by liver-speciﬁc microRNA mir-122. Cold Spring Harb. Symp. Quant. Biol. 71: 369–376. KACZKOWSKI, B., TORARINSSON, E., REICHE, K., HAVGAARD, J. H., STADLER, P.F., and GORODKIN, J., 2009. Structural proﬁles of human miRNA families from pairwise clustering. Bioinformatics. 25: 291–294 KAPRANOV, P. et al., 2007. RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science 316: 1484–1488.

324

Chapter 15

Evolutionary Genomics of microRNAs and Their Relatives

KASSCHAU, K.D., XIE, Z., ALLEN, E., LLAVE, C., CHAPMAN, E.J., KRIZAN, K.A., and CARRINGTON, J.C., 2003. P1/HC-Pro, a viral suppressor of RNA silencing, interferes with Arabidopsis development and miRNA function. Dev. Cell 4: 205–217. KAWASAKI, H. and TAIRA, K., 2004. Induction of DNA methylation and gene silencing by short interfering RNAs in human cells. Nature 431: 211–217. KENNERDELL, J.R. and CARTHEW, R.W., 1998. Use of dsRNAmediated genetic interference to demonstrate that frizzled and frizzled-2 act in the wingless pathway. Cell 95: 1017–1026. KERTESZ, M., IOVINO, N., UNNERSTALL, U., GAUL, U., and SEGAL, E., 2007. The role of site accessibility in microRNA target recognition. Nat. Genet. 39: 1278–1284. KHVOROVA, A., REYNOLDS, A., and JAYASENA, S.D., 2003. Functional siRNAs and miRNAs exhibit strand bias. Cell 115: 209–216. KIM, V.N., 2005a. MicroRNA biogenesis: coordinated cropping and dicing. Nat. Rev. Mol. Cell Biol. 6: 376–385. KIM, V.N., 2005b. Small RNAs: classiﬁcation, biogenesis, and function. Mol. Cell 19: 1–15. KIM, V.N., 2006. Small RNAs just got bigger: Piwiinteracting RNAs (piRNAs) in mammalian testes. Genes Dev. 20: 1993–1997. KIM, Y.K. and KIM, V.N., 2007. Processing of intronic microRNAs. EMBO J. 26: 775–783. KLATTENHOFF, C. and THEURKAUF, W., 2008. Biogenesis and germline functions of piRNAs. Development 135: 3–9. KREK, A. et al., 2005. Combinatorial microRNA target predictions. Nat Genet 37: 495–500. KROL, J., SOBCZAK, K., WILCZYNSKA, U., DRATH, M., JASINSKA, A., KACZYNSKA, D., and KRZYZOSIAK, W.J., 2004. Structural features of microRNA (miRNA) precursors and their relevance to miRNA biogenesis and small interfering RNA/short hairpin RNA design. J. Biol. Chem. 279: 42230–42239. KUHLMANN, M., POPOVA, B., and NELLEN, W., 2006. RNA interference and antisense-mediated gene silencing in Dictyostelium. Methods Mol. Biol. 346: 211–226. LAGOS-QUINTANA, M., RAUHUT, R., LENDECKEL, W., and TUSCHL, T., 2001. Identiﬁcation of novel genes coding for small expressed RNAs. Science 294: 853–858. LAI, E.C., TOMANCAK, P., WILLIAMS, R.W., and RUBIN, G.M., 2003. Computational identiﬁcation of Drosophila microRNA genes. Genome Biol. 4: R42. LAU, N.C., LIM, L.P., WEINSTEIN, E.G., and BARTEL, D.P., 2001. An abundant class of tiny RNAs with probable regulatory roles in Caenorhabditis elegans. Science 294: 858–862. LAU, N.C., SETO, A.G., KIM, J., KURAMOCHI-MIYAGAWA, S., NAKANO, T., BARTEL, D.P., and KINGSTON, R.E., 2006. Characterization of the piRNA complex from rat testes. Science 313: 363–367. LECELLIER, C.H., DUNOYER, P., ARAR, K., LEHMANN-CHE, J., EYQUEM, S., HIMBER, C., SAI¨B, A., and VOINNET, O., 2005. A cellular microRNA mediates antiviral defense in human cells. Science 308: 557–560.

LEE, C.T., RISOM, T., and STRAUSS, W.M., 2007. Evolutionary conservation of microRNA regulatory circuits: an examination of microRNA gene complexity and conserved microRNA-target interactions through metazoan phylogeny. DNA Cell Biol. 26: 209–218. LEE, R.C. and AMBROS, V., 2001. An extensive class of small RNAs in Caenorhabditis elegans. Science 294: 862–864. LEE, R.C., FEINBAUM, R.L., and AMBROS, V., 1993. The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14. Cell 75: 843–854. LEE, Y., KIM, M., HAN, J., YEOM, K.H., LEE, S., BAEK, S.H., and KIM, V.N., 2004. MicroRNA genes are transcribed by RNA polymerase II. EMBO J. 23: 4051–4060. LEGENDRE, M., LAMBERT, A., and GAUTHERET, D., 2005. Proﬁle-based detection of microRNA precursors in animal genomes. Bioinformatics 21: 841–845. LEWIS, B.P., BURGE, C.B., and BARTEL, D.P., 2005. Conserved seed pairing, often ﬂanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell 120: 15–20. LI, J., YANG, Z., YU, B., LIU, J., and CHEN, X., 2005. Methylation protects miRNAs and siRNAs from a 30 end uridylation activity in Arabidopsis. Curr. Biol. 15: 1501–1507. LI, S.C., TANG, P., and LIN, W.C., 2007. Intronic microRNA: discovery and biological implications. DNA Cell Biol. 26: 195–207. LIM, L.P., LAU, N.C., WEINSTEIN, E.G., ABDELHAKIM, A., YEKTA, S., RHOADES, M.W., BURGE, C.B., and BARTEL, D. P., 2003. The microRNAs of Caenorhabditis elegans. Genes Dev. 17: 991–1008. LINDOW, M. and KROGH, A., 2005. Computational evidence for hundreds of non-conserved plant microRNAs. BMC Genomics 6: 119. L€oFFLER, D. et al., 2007. Induction of microRNA-21 contributes to the oncogenic potential of Stat3. Blood 110: 1330–1333. LONG, D., LEE, R., WILLIAMS, P., CHAN, C.Y., AMBROS, V., and DING, Y., 2007. Potent effect of target structure on microRNA function. Nat. Struct. Mol. Bio. 14: 287–294. LUND, E., GU¨TTINGER, S., CALADO, A., DAHLBERG, J., and KUTAY, U., 2004. Nuclear export of microRNA precursors. Science 303: 95–98. MACRAE, I.J. and DOUDNA, J.A., 2007. Ribonuclease revisited: structural insights into ribonuclease iii family enzymes. Curr. Opin. Struct. Biol. 17: 138–145. MAHER, C., STEIN, L., and WARE, D., 2006. Evolution of Arabidopsis microRNA families through duplication events. Genome Res. 16: 510–519. MALLICK, B., GHOSH, Z., and CHAKRABARTI, J., 2008. MicroRNA switches in Trypanosoma brucei. Biochem. Biophys. Res. Commun. 372: 459–463. MATHONNET, G. et al., 2007. MicroRNA inhibition of translation initiation in vitro by targeting the capbinding complex eIF4F. Science 317: 1764–1767.

References MEGRAW, M., BAEV, V., RUSINOV, V., JENSEN, S.T., KALANTIDIS, K., and HATZIGEORGIOU, A.G., 2006. MicroRNA promoter element discovery in Arabidopsis. RNA 12: 1612–1619. MIRANDA, K.C., HUYNH, T., TAY, Y., ANG, Y.S., TAM, W.L., THOMSON, A.M., LIM, B., and RIGOUTSOS, I., 2006. A pattern-based method for the identiﬁcation of microRNA binding sites and their corresponding heteroduplexes. Cell 126: 1203–1217. MOLNA´R A., SCHWACH, F., STUDHOLME, D., THUENEMANN, E.C., and BAULCOMBE, D.C., 2007. miRNAs control gene expression in the single-cell alga Chlamydomonas reinhardtii. Nature 447: 1126–1129. MU¨CKSTEIN, U., TAFER, H., HACKERMU¨LLER, J., BERNHART, S.H., STADLER, P.F., and HOFACKER, I.L., 2006. Thermodynamics of RNA-RNA binding. Bioinformatics 22: 1177–1182. MURPHY, D., DANCIS, B., and BROWN, J.R., 2008. The evolution of core proteins involved in microRNA biogenesis. BMC Evol. Biol. 8: 92–92. NAIR, V. and ZAVOLAN, M., 2006. Virus-encoded microRNAs: novel regulators of gene expression. Trends Microbiol. 14: 169–175. NAPOLI, C., LEMIEUX, C., and JORGENSEN, R., 1990. Introduction of a chimeric chalcone synthase gene into petunia results in reversible co-suppression of homologous genes in trans. Plant Cell 2: 279–289. OBERNOSTERER, G., LEUSCHNER, P.J., ALENIUS, M., and MARTINEZ, J., 2006. Post-transcriptional regulation of microRNA expression. RNA 12: 1161–1167. OHLER, U., YEKTA, S., LIM, L.P., BARTEL, D.P., and BURGE, C. B., 2004. Patterns of ﬂanking sequence conservation and a characteristic upstream motif for microRNA gene identiﬁcation. RNA 10: 1309–1322. OKAMURA, K., CHUNG, W.J., and LAI, E.C., 2008a. The long and short of inverted repeat genes in animals: microRNAs, mirtrons and hairpin RNAs. Cell Cycle 7: 2840–2845. OKAMURA, K., CHUNG, W.J., RUBY, J.G., GUO, H., BARTEL, D.P., and LAI, E.C., 2008b. The Drosophila hairpin RNA pathway generates endogenous short interfering RNAs. Nature 453: 803–806. OKAMURA, K., HAGEN, J.W., DUAN, H., TYLER, D.M., and LAI, E.C., 2007. The mirtron pathway generates microRNA-class regulatory RNAs in Drosophila. Cell 130: 89–100. OKAMURA, K., PHILLIPS, M.D., TYLER, D.M., DUAN, H., CHOU, Y.T., and LAI, E.C., 2008c. The regulatory activity of microRNA species has substantial inﬂuence on microRNA and 30 UTR evolution. Nat. Struct. Mol. Biol. 15: 354–363. ONG, S.E., BLAGOEV, B., KRATCHMAROVA, I., KRISTENSEN, D.B., STEEN, H., PANDEY, A., and MANN, M., 2002. Stable isotope labeling by amino acids in cell culture, SILAC, as a simple and accurate approach to expression proteomics. Mol. Cell Proteomics 1: 376–386. OTSUKA, M. et al., 2007. Hypersusceptibility to vesicular stomatitis virus infection in dicer1-deﬁcient mice is due to impaired mir24 and mir93 expression. Immunity 27: 123–134.

325

OZSOLAK, F., POLING, L.L., WANG, Z., LIU, H., LIU, X.S., ROEDER, R.G., ZHANG, X., SONG, J.S., and FISHER, D.E., 2008. Chromatin structure analyses identify miRNA promoters. Genes Dev. 22: 3172–3183. PALAKODETI, D., SMIELEWSKA, M., and GRAVELEY, B.R., 2006. MicroRNAs from the planarian Schmidtea mediterranea: a model system for stem cell biology. RNA 12: 1640–1649. PARKER, J.S. and BARFORD, D., 2006. Argonaute: a scaffold for the function of short regulatory RNAs. Trends Biochem. Sci. 31: 622–630. PASQUINELLI, A.E. et al., 2000. Conservation of the sequence and temporal expression of let-7 heterochronic regulatory RNA. Nature 408: 86–89. PFEFFER, S. et al., 2004. Identiﬁcation of virus-encoded microRNAs. Science 304: 734–736. PFEFFER, S. et al., 2005. Identiﬁcation of microRNAs of the herpesvirus family. Nat. Methods 2: 269–276. PIRIYAPONGSA, J. and JORDAN, I.K., 2007. A family of human microRNA genes from miniature inverted-repeat transposable elements. PLoS ONE 2: e203. PIRIYAPONGSA, J. and JORDAN, I.K., 2008. Dual coding of siRNAs and miRNAs by plant transposable elements. RNA 14: 814–821. PIRIYAPONGSA, J., MARIN˜O RAMI´REZ, L., and JORDAN, I.K., 2007. Origin and evolution of human microRNAs from transposable elements. Genetics 176: 1323–1337. PROCHNIK, S.E., ROKHSAR, D.S., and ABOOBAKER, A.A., 2007. Evidence for a microRNA expansion in the bilaterian ancestor. Dev. Genes Evol. 217: 73–77. RAJAGOPALAN, R., VAUCHERET, H., TREJO, J., and BARTEL, D.P., 2006. A diverse and evolutionarily ﬂuid set of microRNAs in Arabidopsis thaliana. Genes Dev. 20: 3407–3425. RAJEWSKY, N., 2006. microRNA target predictions in animals. Nat. Genet. 38: S8–S13. RAMACHANDRA, R.K., SALEM, M., GAHR, S., REXROAD, C.E., III and YAO, J.Y., 2008. Cloning and characterization of microRNAs from rainbow trout (Oncorhynchus mykiss): their expression during early embryonic development. BMC Dev. Biol. 8: 41. RANDALL, G. et al., 2007. Cellular cofactors affecting hepatitis C virus infection and replication. Proc. Natl. Acad. Sci. USA 104: 12884–12889. REHMSMEIER, M., STEFFEN, P., H€oCHSMANN, M., and GIEGERICH, R., 2004. Fast and effective prediction of microRNA/target duplexes. RNA 10: 1507–1517. REINHART, B.J., SLACK, F.J., BASSON, M., PASQUINELLI, A.E., BETTINGER, J.C., ROUGVIE, A.E., HORVITZ, H.R., and RUVKUN, G., 2000. The 21-nucleotide let-7 RNA regulates developmental timing in Caenorhabditis elegans. Nature 403: 901–906. REINHART, B.J., WEINSTEIN, E.G., RHOADES, M.W., BARTEL, B., and BARTEL, D.P., 2002. MicroRNAs in plants. Genes Dev. 16: 1616–1626. ROBINS, H., LI, Y., and PADGETT, R.W., 2005. Incorporating structure to predict microRNA targets. Proc. Natl. Acad. Sci. USA 102: 4006–4009. RUBY, J.G., JAN, C., PLAYER, C., AXTELL, M.J., LEE, W., NUSBAUM, C., GE, H., and BARTEL, D.P., 2006. Large-

326

Chapter 15

Evolutionary Genomics of microRNAs and Their Relatives

scale sequencing reveals 21U-RNAs and additional microRNAs and endogenous siRNAs in C. elegans. Cell 127: 1193–1207. SARAIYA, A.A. and WANG, C.C., 2008. snoRNA, a novel precursor of microRNA in Giardia lamblia. PLoS Pathog. 4: e1000224. SARNOW, P., JOPLING, C.L., NORMAN, K.L., SCHU¨TZ, S., and WEHNER, K.A., 2006. MicroRNAs: expression, avoidance and subversion by vertebrate viruses. Nat. Rev. Microbiol. 4: 651–659. SAUNDERS, M.A., LIANG, H., and LI, W.H., 2007. Human polymorphism at microRNAs and microRNA target sites. Proc. Natl. Acad. Sci. USA 104: 3300–3305. SCHWAB, R., PALATNIK, J.F., RIESTER, M., SCHOMMER, C., SCHMID, M., and WEIGEL, D., 2005. Speciﬁc effects of microRNAs on the plant transcriptome. Dev. Cell 8: 517–527. SCHWARZ, D., HUTVA´GNER, G., DU, T., XU, Z., ARONIN, N., and ZAMORE, P.D., 2003. Asymmetry in the assembly of the RNAi enzyme complex. Cell 115: 199–208. SEITZ, H., ROYO, H., BORTOLIN, M.L., LIN, S.P., FERGUSONSMITH, A.C., and CAVAILLE´, J., 2004. A large imprinted microRNA gene cluster at the mouse Dlk1-Gtl2 domain. Genome Res. 14: 1741–1748. SELBACH, M., SCHWANHA¨USSER, B., THIERFELDER, N., FANG, Z., KHANIN, R., and RAJEWSKY, N., 2008. Widespread changes in protein synthesis induced by microRNAs. Nature 455: 58–63. SEMPERE, L.F., COLE, C.N., MCPEEK, M.A., and PETERSON, K.J., 2006. The phylogenetic distribution of metazoan microRNAs: insights into evolutionary complexity and constraint. J. Exp. Zool. B Mol. Dev. Evol. 306: 575–588. SETHUPATHY, P., CORDA, B., and HATZIGEORGIOU, A.G., 2006. TarBase: a comprehensive database of experimentally supported animal microRNA targets. RNA 12: 192–197. SEWER, A., PAUL, N., LANDGRAF, P., ARAVIN, A., PFEFFER, S., BROWNSTEIN, M.J., TUSCHL, T., VAN NIMWEGEN, E., and ZAVOLAN, M., 2005. Identiﬁcation of clustered microRNAs using an ab initio prediction method. BMC Bioinformatics 6: 267. SIMO´N-MATEO, C. and GARCI´A, J.A., 2006. MicroRNA-guided processing impairs Plum pox virus replication, but the virus readily evolves to escape this silencing mechanism. J. Virol. 80: 2429–2436. SMALHEISER, N.R. and TORVIK, V.I., 2005. Mammalian microRNAs derived from genomic repeats. Trends Genet. 21: 322–326. SPRING, J., 2002. Genome duplication strikes back. Nat. Genet. 31: 128–129. STARK, A., BRENNECKE, J., BUSHATI, N., RUSSELL, R.B., and COHEN, S.M., 2005. Animal microRNAs confer robustness to gene expression and have a signiﬁcant impact on 30 UTR evolution. Cell 123: 1133–1146. STARK, A., BRENNECKE, J., RUSSELL, R.B., and COHEN, S.M., 2003. Identiﬁcation of Drosophila microRNA targets. PLoS Biol. 1: e60.

STEIN, P., SVOBODA, P., ANGER, M., and SCHULTZ, R.M., 2003. RNAi: mammalian oocytes do it without RNA-dependent RNA polymerase. RNA 9: 187–192. SULLIVAN, C.S. and GANEM, D., 2005. MicroRNAs and viral infection. Mol. Cell 20: 3–7. SULLIVAN, C.S., GRUNDHOFF, A.T., TEVETHIA, S., PIPAS, J.M., and GANEM, D., 2005. Sv40-encoded microRNAs regulate viral gene expression and reduce susceptibility to cytotoxic t cells. Nature 435: 682–686. SUNKAR, R. and JAGADEESWARAN, G., 2008. In silico identiﬁcation of conserved microRNAs in large number of diverse plant species. BMC Plant Biol. 8: 37. SUNKAR, R., ZHOU, X., ZHENG, Y., ZHANG, W., and ZHU, J.K., 2008. Identiﬁcation of novel and candidate miRNAs in rice by high throughput sequencing. BMC Plant Biol. 8: 25. TALMOR-NEIMAN, M., STAV, R., FRANK, W., VOSS, B., and ARAZI, T., 2006. Novel micro-RNAs and intermediates of micro-RNA biogenesis from moss. Plant J. 47: 25–37. TAM, O.H. et al., 2008. Pseudogene-derived small interfering RNAs regulate gene expression in mouse oocytes. Nature 453: 534–538. TANG, G.Q. and MAXWELL, E.S., 2008. Xenopus microRNA genes are predominantly located within introns and are differentially expressed in adult frog tissues via posttranscriptional regulation. Genome Res. 18: 104–112. TANZER, A., AMEMIYA, C.T., KIM, C.-B., and STADLER, P.F., 2005. Evolution of microRNAs located within Hox gene clusters. J. Exp. Zool. B Mol. Dev. Evol. 304B: 75–85. TANZER, A. and STADLER, P.F., 2004. Molecular evolution of a microRNA cluster. J. Mol. Biol. 339: 327–335. TANZER, A. and STADLER, P.F., 2006. Evolution of microRNAs. In MicroRNA Protocols, volume 342 of Methods in Molecular Biology (ed. S.Y. Ying). Humana Press, Totowa, NJ, pp. 335–350. The ENCODE Project Consortium. 2007. Identiﬁcation and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447: 799–816. TOMARI, Y., DU, T., and ZAMORE, P.D., 2007. Sorting of Drosophila small silencing RNAs. Cell 130: 299–308. TOMARI, Y., MATRANGA, C., HALEY, B., MARTINEZ, N., and ZAMORE, P.D., 2004. A protein sensor for siRNA asymmetry. Science 306: 1377–1380. TRIBOULET, R. et al., 2007. Suppression of microRNAsilencing pathway by HIV-1 during virus replication. Science 315: 1579–1582. ULLU, E., TSCHUDI, C., and CHAKRABORTY, T., 2004. RNA interference in protozoan parasites. Cell Microbiol. 6: 509–519. VAGIN, V., SIGOVA, A., LI, C., SEITZ, H., GVOZDEV, V., and ZAMORE, P.D., 2006. A distinct small RNA pathway silences selﬁsh genetic elements in the germline. Science 313: 320–324. van der KROL, A.R., MUR, L.A., BELD, M., MOL, J.N., and STUITJE, A.R., 1990. Flavonoid genes in petunia: addition of a limited number of gene copies may lead to a suppression of gene expression. Plant Cell 2: 291–299.

References VAUCHERET, H., VAZQUEZ, F., CR´ET´E, P., and BARTEL, D.P., 2004. The action of ARGONAUTE1 in the miRNA pathway and its regulation by the miRNA pathway are crucial for plant development. Genes Dev. 18: 1187–1197. VAZQUEZ, F., 2006. Arabidopsis endogenous small RNAs: highways and byways. Trends Plant Sci. 11: 460–468. VELLA, M.C., REINERT, K., and SLACK, F.J., 2004. Architecture of a validated microRNA: target interaction. Chem. Biol. 11: 1619–1623. WANG, X., GU, J., ZHANG, M.Q., and LI, Y., 2008. Identiﬁcation of phylogenetically conserved microRNA cis-regulatory elements across 12 Drosophila species. Bioinformatics 24: 165–171. WANG, X., ZHANG, J., LI, F., GU, J., HE, X., ZHANG, T., and LI, Y., 2005. MicroRNA identiﬁcation based on sequence and structure alignment. Bioinformatics 21: 3610–3614. WANG, Y., HINDEMITT, T., and MAYER, K.F., 2006. Signiﬁcant sequence similarities in promoters and precursors of Arabidopsis thaliana non-conserved microRNAs. Bioinformatics 22: 2585–2589. WARTHMANN, N., DAS, S., LANZ, C., and WEIGEL, D., 2008. Comparative analysis of the MIR319a microRNA locus in Arabidopsis and related Brassicaceae. Mol. Biol. Evol. 25: 892–902. WATANABE, T. et al., 2008. Endogenous siRNAs from naturally formed dsRNAs regulate transcripts in mouse oocytes. Nature 453: 539–543. WEBER, M.J., 2005. New human and mouse microRNA genes found by homology search. FEBS J. 272: 59–73. WILLMANN, M.R. and POETHIG, R.S., 2007. Conservation and evolution of miRNA regulatory programs in plant development. Curr. Opin. Plant Biol. 10: 503–511. XIE, X., LU, J., KULBOKAS, E.J., GOLUB, T.R., MOOTHA, V., LINDBLAD-TOH, K., LANDER, E.S., and KELLIS, M., 2005a. Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals. Nature 434: 338–345. XIE, Z., ALLEN, E., FAHLGREN, N., CALAMAR, A., GIVAN, S.A., and CARRINGTON, J.C., 2005b. Expression of Arabidopsis MIRNA genes. Plant Physiol. 138: 2145–2154. XIE, Z., KASSCHAU, K.D., and CARRINGTON, J.C., 2003. Negative feedback regulation of Dicer-Like1 in Arabidopsis by microRNA-guided mRNA degradation. Curr. Biol. 13: 784–789.

327

XIE, Z. and QI, X., 2008. Diverse small RNA-directed silencing pathways in plants. Biochim. Biophys. Acta 1779: 720–724. XU, N., SEGERMAN, B., ZHOU, X., and AKUSJA¨RVI, G., 2007. Adenovirus virus-associated RNAII-derived small RNAs are efﬁciently incorporated into the RNA-induced silencing complex and associate with polyribosomes. J. Virol. 81: 10540–10549. XU, Y., ZHOU, X., and ZHANG, W., 2008. MicroRNA prediction with a novel ranking algorithm based on random walks. Bioinformatics 24: i50–i58. XUE, C., LI, F., HE, T., LIU, G., LI, Y., and ZHANG, X., 2005. Classiﬁcation of real and pseudo microRNA precursors using local structure-sequence features and support vector machine. BMC Bioinformatics 6: 310. ZHANG, B., PAN, X., CANNON, C.H., COBB, G.P., and ANDERSON, T.A., 2006. Conservation and divergence of plant microRNA genes. Plant J. 46: 243–259. ZHANG, Y., 2005. miRU: an automated plant miRNA target prediction server. Nucleic Acids Res. 33: W701–W704. ZHAO, T., LI, G., MI, S., LI, S., HANNON, G.J., WANG, X.J., and QI, Y., 2007. A complex system of small RNAs in the unicellular green alga Chlamydomonas reinhardtii. Genes Dev. 21: 1190–1203. ZHOU, H. and LIN, K., 2008. Excess of microRNAs in large and very 50 biased introns. Biochem. Biophys. Res. Commun. 368: 709–715. ZHOU, R., HOTTA, I., DENLI, A.M., HONG, P., PERRIMON, N., and HANNON, G.J., 2008. Comparative analysis of argonautedependent small RNA pathways in Drosophila. Mol Cell 32: 592–599. ZHOU, X., RUAN, J., WANG, G., and ZHANG, W., 2007a. Characterization and identiﬁcation of microRNA core promoters in four model species. PLoS Comput. Biol. 3: e37. ZHOU, Y., FERGUSON, J., CHANG, J.T., and KLUGER, Y., 2007b. Inter- and intra-combinatorial regulation by transcription factors and microRNAs. BMC Genomics 8: 396. ZHU, Q.H., SPRIGGS, A., MATTHEW, L., FAN, L., KENNEDY, G., GUBLER, F., and HELLIWELL, C., 2008. A diverse set of microRNAs and microRNA-like small RNAs in developing rice grains. Genome Res 18: 1456–1465.

Chapter

16

Phylogenetic Utility of RNA Structure: Evolution’s Arrow and Emergence of Early Biochemistry and Diversiﬁed Life Feng-Jie Sun, Ajith Harish, and Gustavo Caetano-Anolles 16.1

INTRODUCTION

16.2

STRUCTURAL CHARACTERS AND DERIVED PHYLOGENETIC TREES

16.3

APPLICATIONS

16.4

CONCLUSIONS

ACKNOWLEDGMENTS REFERENCES

16.1 INTRODUCTION A common method to establish evolution’s arrow in phylogenetic trees that describe history of organismal diversiﬁcation (with leaves of the trees representing “ingroup” taxa) is to identify outlier groups known as “outgroups.” These distantly related taxa represent external hypotheses of relationships that need to be amply justiﬁed and supported by external assumptions (Maddison et al., 1984). Outgroups are widely used in phylogenetic studies at taxonomic levels below superkingdom (i.e., within lineages of Archaea, Bacteria, or Eukarya). Selection of outgroups is a nontrivial and often troublesome exercise, especially when branches are deep in the trees. The signiﬁcance of selecting correctly an appropriate outgroup lies in the fact that its selection will ultimately determine the reconstruction of a correct phylogeny and will impact its interpretation (Maddison et al., 1984). This is because selection of outgroups can alter optimization of phylogenetic

Evolutionary Genomics and Systems Biology, edited by Gustavo Caetano-Anolles Copyright 2010 John Wiley & Sons, Inc.

329

330

Chapter 16

Phylogenetic Utility of RNA Structure

characters (the biological features that are under evolutionary study) in the reconstruction of ingroup relationships (Dalevi et al., 2001; O’Brien et al., 2002; Rota-Stabelli and Telford, 2008). Furthermore, when the tree of life is reconstructed at the superkingdom level, it is difﬁcult, if not impossible, to identify an outgroup and subsequently determine the evolutionary direction on the unrooted global phylogenetic tree (Baldauf et al., 1996). The long-standing controversy over the root of the tree of life is a well-known issue in origins and evolution of early life on Earth (Forterre and Philippe, 1999). Various rooting scenarios have been explored. For example, using elongation factor proteins EF-Tu and EFG, Baldauf et al. (1996) placed the root of the universal tree of life between Bacteria and Archaea/Eukarya. This rooting is also supported, for example, by a study of aminoacyltRNA synthetases (Brown and Doolittle, 1995). While striking similarities exist between Archaea and Eukarya on many aspects of DNA, RNA, and protein synthesis, notable exceptions exist that are exempliﬁed in analyses of 16S and 23S ribosomal RNA (rRNA) molecules for which Archaea are closer to Bacteria in sequence (Leffers et al., 1987; Olsen et al., 1994). However, the availability of a large number of completely sequenced genomes has caused a paradigm shift, in which the focus has shifted from individual molecules to entire molecular repertoires. For example, recent phylogenomic studies using protein domains at fold and fold superfamily levels of molecular structure placed the root of the tree of life between Eukarya and Archaea/Bacteria (Caetano-Anolles and CaetanoAnolles, 2003; Wang and Caetano-Anolles, 2006). This rooting scenario, originally proposed by Forterre and Philippe (1999), suggests that many fundamental eukaryotic features are ancient (Penny and Poole, 1999). Interestingly, while phylogenomic analysis of protein domain structure revealed a rooting in Eukarya, trees of architectures (with leaves representing protein domains) derived from these same data uncovered very early reductive evolutionary tendencies in archaeal proteomes (Wang et al., 2007). These observations provide strong support to the very early origin of the archaeal lineage. Moreover, evolutionary studies of transfer RNA (tRNA) also challenged the popular Bacteria-rooting scenario by rooting the tree of life at various groups in Archaea (Xue et al., 2003, 2005; Di Giulio, 2007; Wong et al., 2007; Sun and Caetano-Anolles, 2008b; Sun and Caetano-Anolle´s, 2009). To date, the rooting of the tree of life and the origins of diversiﬁed life remain controversial, mostly due to our inability to root trees and distinguish between base and crown group taxa that are both extinct and extant (see Chapters 1 and 4) but also because of limited phylogenetic signal in sequences. Together with the unresolved issue of the tree of life, the monophyly of Archaea is also critically debated (monophyly refers to ingroup taxa sharing a common ancestor). In general, rRNA has been the most widely used and inﬂuential source of evidence for supporting the monophyly of Archaea. However, this is questionable especially because of the lack of strong monophyletic support of both small (Barnes et al., 1996) and large subunit (LSU) rRNA (De Rijk et al., 1995). In contrast, other studies suggest Archaea is likely paraphyletic (Baldauf et al., 1996; F.-J. Sun, K.M. Kim, and G. Caetano-Anolles, unpublished). Recently, there has been a growing tendency of applying molecular structure evidence in phylogenetic analysis either by using RNA secondary structures to improve the alignment of sequences (Higgs, 2000; Higgs et al., 2003) or by directly embedding structural characters in phylogenetic analysis (Billoud et al., 2000; Collins et al., 2000; CaetanoAnolles, 2001, 2002a, 2002b, 2005; Pollock, 2003; Swain and Taylor, 2003; Grajales et al., 2007; Sun et al., 2007; Sun and Caetano-Anolles, 2008a, 2008b, 2008c, 2008d, 2009). These applications of RNA structures are made possible and beneﬁt from three fundamental

16.1 Introduction

331

biological tendencies, which we consider here axiomatic: (i) RNA structure is far more conserved than sequence, an is therefore advantageous when used to study ancient events in the history of life; (ii) there is a universal tendency toward molecular order that is supported by thermodynamic, statistical, and phylogenetic arguments; and (iii) successfully implemented biological designs tend to be reused in nature, generally through recruitment by takeovers or cooptions. These tendencies, which we will consider subsequently, are exploited in an award-winning phylogenetic approach that uses RNA structural information to reconstruct evolutionary history of macromolecules such as rRNA (Caetano-Anolles, 2002a, 2002b). This method has also been applied to tRNA, the ancient and central player of translation, exploring long-standing evolutionary questions, such as the origin and evolution of the genetic code, amino acid charging, and early life and viruses, and the evolutionary signiﬁcance of the long variable arm in tRNA structure (Sun and Caetano-Anolles, 2008a, 2008b, 2008c, 2008d). It has also been applied to other RNA molecules (Collins et al., 2000; Caetano-Anolles, 2002a, 2002b, 2005; Swain and Taylor, 2003; Grajales et al., 2007; Sun et al., 2007 Sun and Caetano-Anolle´s, 2009). Similarly, a focus on the structure of individual protein domains or the organization of domains in multidomain proteins in the proteomes of hundreds of completely sequenced organisms has uncovered remarkable patterns in the diversiﬁcation of cellular life (Caetano-Anolles and Caetano-Anolles, 2003; Wang and Caetano-Anolles, 2006, 2009; Wang et al., 2007;). These methods have also been used to study molecular networks such as modern metabolism (Caetano-Anolles et al., 2007). Here we discuss the generation of global phylogenies that incorporate information from RNA structure and how these have the potential to uncover patterns associated with the emergence of modern biochemistry and early life. Since evolution’s arrow and the rooting of the tree of life are central and unavoidable issues of evolutionary biology, we discuss how the use of RNA structural evidence can be developed and improved to gain insights into deep evolution of ancient molecules. We will show that these studies consistently and congruently identify the same or similar basal groups in different systems (Xue et al., 2003, 2005; Wong et al., 2007; Sun and Caetano-Anolles, 2008b; Sun and Caetano-Anolle´s, 2009). We summarize these results and discuss the tree rooting strategy and the role of thermodynamics in life.

16.1.1 A Novel Phylogenetic Approach Based on Macromolecular Structure In the cladistic method of Caetano-Anolles (2002a), structural components of RNA secondary structure are coded as polarized and linearly ordered multistate characters, following a model of character state transformation that favors molecular order and stability. Results show that considerable phylogenetic signal is present in RNA secondary structure, and that this information is capable of grouping organisms in highly divergent lineages. The intrinsically rooted phylogenies reconstructed from evolved structure match those derived from nucleic acid sequences at all taxonomical levels and group organisms in concordance with traditional classiﬁcation, especially in superkingdoms Archaea and Eukarya. Speciﬁcally, phylogenetic analysis of secondary structure of the small and large rRNA subunits reconstructs a universal tree of life that branched in three monophyletic groups corresponding to Archaea, Bacteria, and Eukarya, and that is rooted in the eukaryotic branch. These results suggest that it is equally parsimonious to consider that ancestral unicellular eukaryotes or prokaryotes (herein referred to as akaryotes) gave rise to all extant

332

Chapter 16

Phylogenetic Utility of RNA Structure

life-forms. Results challenge the “prokaryotic dogma” stating that the last universal common ancestor, the “cenancestor” proposed by Fitch and Upper (1987), was bacterial-like in genomic and cellular organization, and support an eukaryotic-like ancestry (Reanney, 1974; Darnell, 1978; Doolittle, 1978) and the hypothesis that many akaryotic features originated by simpliﬁcation through gene loss and nonorthologous displacement (Forterre and Philippe, 1999). Note that a eukaryotic rooting has been proposed for other molecules (e.g., signal recognition particles; Brinkmann and Philippe, 1999) and provides a simple explanation for the close genomic relationship of Archaea and Bacteria (Koonin et al., 1997). RNA molecules are mostly composed of stem regions (helices) that stabilize the molecule by establishing hydrogen binding pairing interactions and unpaired regions (hairpin and internal loops, bulges, multiloops, and free ends) that generally destabilize the molecules (Figure 16.1). In the process of reconstructing phylogenetic trees of organisms, RNA structural characters have multiple discrete states, are linearly ordered, and are polarized by ﬁxing the direction of character state change using a transformation sequence that distinguishes ancestral states as those more stable thermodynamically (e.g., larger state values for stems and smaller state values for hairpin, bulges, and unpaired regions). The model of character state transformation is based on the hypothesis that evolved RNA molecules are optimized to produce highly stable folded conformations with increased conformational order. This optimization process increases favorable and decreases nonfavorable inter- and intramolecular interactions and restricts alternative outcomes of the RNA folding process. The model assumes a priori that evolution of individual substructures is independent of each other. The structure-based cladistic method is a new tool for phylogenetic inference that complements classical methods of primary sequence comparison. One of its unique features is its ability to produce rooted topologies capable of establishing direction of evolutionary change at the molecular level. This can be very useful in applications where suitable outgroups or paralogous sequences are not available. The true potential of the method has since been explored with several important RNA molecules, including tRNA, short interspersed element (SINE) RNA, 5S rRNA, and ribonuclease (RNase) P RNA. Furthermore,

Figure 16.1 The secondary structure of 5S rRNA from Holoarcula marismortui (Ban et al., 2000) showing different stem (helices) and unpaired (loops) substructural components. Unpaired segments include hairpin loops (loops C and D), internal loops (e.g., loops B and E), multiloops (loop A), and bulges.

16.2 Structural Characters and Derived Phylogenetic Trees

333

an ongoing analysis with a comprehensive sampling of over 13,000 sequences of rRNA has demonstrated that the evolutionary patterns previously identiﬁed are still recovered and an evolutionary model of rRNA structure is currently under construction (A. Harish and G. Caetano-Anolles, unpublished; see Section 16.3.5). All these RNA molecules are evolutionarily, structurally, and/or functionally closely related to each other, and their molecular functions are directly or indirectly linked to the process of protein synthesis. Models for the evolution of these molecules have been proposed based on phylogenetic trees of substructures (see below). Furthermore, several common evolutionary trends were identiﬁed among these types of RNAs.

16.1.2

Broadened Utility of Constraint Analysis

We recently broadened the use of phylogenetic constraints in the analysis of tRNA sequence and structure (Sun and Caetano-Anolles, 2008b, 2008c). In constraint analysis, the search of optimal trees is restricted to prespeciﬁed tree topologies deﬁning sets of taxa that share a common ancestor (monophyletic groups). We then test alternative or nonmutually exclusive hypotheses. The number of additional steps (S) required to force (constrain) particular taxa into a monophyletic group examined, for example, using the enforce topological constraint option in PAUP (Swofford, 2003), deﬁnes a lineage coalescence distance that can be used to test alternative hypotheses or deﬁne evolutionary timelines. In these timelines, lower values of S correspond to ancient groups (this trend was derived from the rooted trees and embedded assumptions of character polarization). In cybernetics, this type of analysis represents a formal method to decompose a reconstructable system into its components (Ashby, 1956). Although this method is not new and is widely used in phylogenetic analyses to test, for example, hypotheses of monophyly (Doyle, 2006), to our knowledge, it is a novel application used to dissect systematically patterns in a phylogenetic tree. The evolutionary timeline is generated starting from a smaller S value (more ancient) to a larger S value (more recent). Two fundamental assumptions support constraint analyses of tRNA molecules. First, we assume that tRNA structures acquired new identities and functions as the genetic code expanded and that different structures were coopted to perform functions in different lineages and different functional contexts. This assumption seems reasonable because recruitment processes are quite common in the evolution of macromolecules. Second, we assume that old tRNA structures developed or recruited new functions (cooptions) more often than new tRNA structures acquired old functions (takeovers). This assumption is also reasonable and appears to be supported by studies of enzyme recruitment in metabolism (Caetano-Anolles et al., 2007, 2008).

16.2 STRUCTURAL CHARACTERS AND DERIVED PHYLOGENETIC TREES Figure 16.2 illustrates through a ﬂow diagram the rationale and steps involved in an evolutionary study of RNA structure. In the example, tRNA structures are collected, coded, and used to build data matrices with elements describing geometrical or statistical features of structure. Phylogenetic trees are then generated that depict the evolution of molecules and substructures. The trees are ﬁnally used to build evolutionary heat maps, in which ancestries are traced directly onto two-dimensional (planar representations) or three-dimensional models of structure.

334

Chapter 16

Phylogenetic Utility of RNA Structure

Figure 16.2

Phylogenetic reconstruction of trees of molecules and substructures. The structure of an RNA molecule (illustrated with tRNAs) can be decomposed into substructures, such as coaxial stem tracts and unpaired regions, which can be studied using features (characters) that describe their geometry (e.g., length of stems or unpaired regions) or their branching, stability, and uniqueness (e.g., statistical parameters describing a morphospace). These “shape” and “statistical” characters are coded and assigned “character states” according to an evolutionary model that polarizes character transformation toward an increase in molecular order (character argumentation). Coded characters (s) are arranged in data matrices, which can be transposed and subjected to cladistic analyses to generate phylogenies of either molecules or substructures. Rooted trees are used to generate evolutionary heat maps of secondary structure that color 3D structural models with molecular ancestries.

16.2.1

Character Coding

Sets of character attributes are used to describe RNA structures that are inferred from nucleic acid sequence by comparative sequence analysis, comparison with crystallographic models, and other criteria. These features are decomposed into substructural components and characterized and coded using an alphanumerical format (numbers 0–9 and letters A–Z) suitable for cladistic analysis (Caetano-Anolles, 2001, 2002a). Two different kinds of attributes are used to characterize RNA structure numerically, “shape” and “statistical” characters. Shape characters (also referred to as “geometrical” characters) describe the geometry of the molecules by measuring, for example, the length in nucleotides of each spatial component of secondary structure. These components include double helical stems, hairpin loops, bulges and interior loops, and unpaired segments such as 50 - or 30 -free ends, connecting joints, and multiloop sequences separating stems. In general, the topographic correspondence of the substructures is the main criterion for determining character homology. We do not focus on unusual base pairings or noncovalent interactions important for RNA tertiary structure, because there are not enough crystallographic models available for comparative analysis of these interactions. Statistical characters describe the branching, stability, and plasticity (uniqueness) of the molecules. A molecular morphospace deﬁned quantitatively by three statistical parameters, the Shannon entropy of the base-pairing

16.2 Structural Characters and Derived Phylogenetic Trees

335

probability matrix (Q), base pairing propensity (P), and the mean length of helical stem structures (S) of RNA sequences, depicts the degree of conformational order of the molecular components as a point in a 3D order–disorder space (Schultes et al., 1999). The calculations of Q, P, and S were conducted by using the program STOAT (Knudsen and CaetanoAnolles, 2008). These characters take advantage of molecular mechanic aspects of RNA, such as molecular ensembles that measure conﬂicting molecular interactions in RNA folding. To make characters useful, they need to be appropriately coded so that they can provide maximum phylogenetic signal.

16.2.2

Character Polarization

We invoke an evolutionary tendency toward molecular order to polarize character state change in the resulting phylogenetic trees and produce trees that are intrinsically rooted so that we are able to deﬁne lineages that are either ancient or derived. Polarization identiﬁes the ancestral states in the character transformation series and results in reversible character transformation sequences that are directional and show asymmetry between gains and losses (see below). Speciﬁcally, distinguishing ancestral states as states that are thermodynamically more stable polarizes phylogenetic trees. Maximum and minimum character states are deﬁned as the ancestral states for structures that stabilize (e.g., stems, modiﬁed bases, and G:U base pairs) and destabilize (e.g., bulges, hairpin loops, and other unpaired regions) RNA molecules, respectively. The model of character state transformation follows an evolutionary trend that results in molecules that are less plastic (more unique) but more modular (sensu Ancel and Fontana (2000)). RNAs with structures that are more ordered are therefore considered ancestral (plesiomorphic). The validity of character argumentation has not only been discussed in detail elsewhere (Caetano-Anolles, 2000, 2001, 2002a, 2002b, 2005; Sun et al., 2007; Caetano-Anolles et al., 2008; Sun and CaetanoAnolles, 2008a, 2008b, 2008c, 2008d, 2009) and brieﬂy reviewed (Sun and CaetanoAnolles, 2008e) but is also supported by a considerable body of theoretical and experimental evidence: i. Molecular mechanics. The tendency toward molecular order has been supported by many studies that focus on molecular mechanics. Comparative studies of extant and randomized sequences show that evolution enhances conformational order and diminishes conﬂicting molecular interactions over those intrinsically acquired by self-organization (Stegger et al., 1984; Le and Maizel, 1989; Higgs, 1993, 1995; Schultes et al., 1999; Steffens and Digby, 1999; Gultyaev et al., 2002; CaetanoAnolles, 2005). Indeed, randomizations of mono- and dinucleotides have been used to dissect the effects of composition and order of nucleotides on the stability of folded nucleic acid molecules and uncover evolutionary processes acting at RNA and DNA levels (Forsdyke, 2007). In recent bench experiments, extant evolved RNA molecules encoding complex and functional structural folds were compared to oligonucleotides corresponding to randomized counterparts (Schultes et al., 2005). Results show that arbitrary sequences, unlike evolved molecules, were prone to having multiple competing conformations. In contrast to arbitrary proteins, which rarely fold into well-ordered structures (Hecht et al., 2004), these arbitrary RNA sequences were however quite soluble and compact and appeared delimited by physicochemical constraints such as nucleotide composition that were inferred in previous computational studies (Schultes et al., 1999).

336

Chapter 16

Phylogenetic Utility of RNA Structure

ii. Thermodynamics. The molecular tendency toward order that drives biological change can be found linked to fundamental concepts in thermodynamics (see Chapter 20). This tendency has been veriﬁed using thermodynamic principles generalized to account for nonequilibrium conditions experimentally (Gladyshev and Ershov, 1982). In fact, the impact of thermodynamic principles in living systems (the “building order from disorder” concept championed by Schr€odinger (1944)) manifests through optimization of more modern thermodynamic quality descriptors of energy gradients (e.g., maximization of energy) in nonequilibrium systems that are open to ﬂows of energy and matter (Schneider and Kay, 1994a, 1994b). This optimization results in more efﬁcient degradation of incoming (solar) energy through autocatalytic, self-assembly, reproduction, evolution, and adaptation processes acting on molecular structures, all of which enhance the order of the system to decrease energy gradients and oppose disequilibrium (in line with second law of thermodynamics). It also has important consequences for evolution of molecular structure and the mapping of sequence to structure spaces, representing different levels of biological organization. For example, RNA molecules have low informational entropy in sequence space, but in structure space highly evolvable phenotypes are also more entropic (Wagner, 2008). These results suggest that increases in the order at one level of organization are counteracted by decreases in the order of the next. This relationship ultimately encourages escape (evolvability) from constraints of order (stasis through structural canalization). Note that a large body of theoretical evidence supports these sequence-to-structure mappings and their consequences on the energetic and kinetic landscapes of the evolving molecules (Ancel and Fontana, 2000; Higgs, 2000; Fontana, 2002). Furthermore, some important predictions have already been conﬁrmed experimentally in in vitro evolution of ribozymes (Schultes and Bartel, 2000). iii. Cosmology. A tendency towards order is also supported by dissipation tendencies in energy and matter that exist in an open cosmological model of the Friedmann type (see Chapter 20). Under this model, the universe expands faster than its contents can equilibrate, turning the nearly homogeneous hot gas at the beginning of the big bang into clumps of energy-dissipating matter that acquire more and more elaborate and ﬁner grained properties (Layzer, 1970; Frautschi, 1982). This emerging structure ultimately materializes in ordered structures and life. Note that the big bang model is supported by three observational pillars: the motion of galaxies away from each other, the cosmic microwave background radiation, and the relative quantities of light chemical elements (e.g., He, H) in cosmic gas. iv. Phylogenetics. Finally and more important, tendencies toward structural order and the rooting of trees have been experimentally supported by phylogenetic congruence of phylogenies reconstructed using geometrical and statistical structural characters (Caetano-Anolles, 2005; Sun et al., 2007; Sun and CaetanoAnolles, 2008a) and from sequence, structure, and genomic rearrangements at different taxonomical levels, which also match statements from traditional organismal classiﬁcation (Billoud et al., 2000; Collins et al., 2000; CaetanoAnolles, 2002a, 2002b, 2005; Swain and Taylor, 2003; Sun et al., 2007). For example, the same phylogenies are produced when using characters that describe the topology of tRNA or characters that describe a molecular morphospace (Schultes et al., 1999) deﬁned by the Q, P, and S statistics (Sun and CaetanoAnolles, 2008a). Remarkably, tests in which characters were polarized in the

16.2 Structural Characters and Derived Phylogenetic Trees

337

opposite direction generated phylogenetic trees that were less parsimonious and had topologies incompatible with accepted taxonomical knowledge (e.g., CaetanoAnolles, 2002a, 2005). Similarly, congruent reconstructions from RNA structure and orthology of large-scale recombination events in grass genomes also strongly support assumptions of polarization in character argumentation (CaetanoAnolles, 2005). Other more indirect results derived from using our focus on structure also proved to be congruent, such as hypotheses of organismal origin that used global trees of tRNA structures and constraint analysis (Sun and CaetanoAnolles, 2008b) and phylogenies of proteomes derived from an analysis of protein structure in entire genomic complements (Wang et al., 2007). Many new instances of congruence from ongoing phylogenetic studies (unpublished data from F.-J. Sun, A. Harish, M. Wang, K.M. Kim, and G. Caetano-Anolles) consistently support our hypothesis of polarization. Note that order is seldom achieved in frustrated systems that are driven by the energetics of conformation and stability, such as RNA or proteins, and that while the proposed generalized trend in structure appears valid by the evidence outlined above, we do not know the nature and stability of selective preferences or constraints acting on primordial RNA during the early stages of evolution of these molecules. In this regard, we speculate that increases in redundancy could have driven selection of beneﬁcial traits in molecules during early life, for example, under a phenotypic model in which increasingly adaptive phenotypes evolve (Zhu and Freeland, 2006). Character attributes represent transformation pathways and hypotheses of relationship that are falsiﬁable and link character states to each other using fundamental evolutionary assumptions or axioms (Bryant, 1991). We use the auxiliary assumption of an evolutionary tendency toward order (hypothesis of polarization), which may represent an accurate depiction of important generalized trends. We caution however that the model may fail to explain exceptions and departures to the trend (Caetano-Anolles, 2002b). Furthermore, multistate characters transform from one character state to another in linearly ordered and reversible pathways. This order implies a distance relationship between character states in which costs related to the transformation of two nonneighboring states are larger than one step. In fact, costs are initially simply the difference between the integers that describe the character states. However, polarization identiﬁes the ancestral states in the character transformation series and this results in characters that are directional and show asymmetry between gains and losses. The choice of linearly ordered characters is appropriate for both shape and statistical RNA features. RNA structures normally change in discrete manner by nucleotide addition or removal resulting in one-step extension or contraction of topological features. Insertions or deletions are possible but have a higher cost associated with them. They can remove entire substructures and are generally rare events. In our studies, we consider that the cost of insertions and deletions is proportional to their length. Statistical characters represent continuous-valued features, but are gap recoded into discrete entities. Therefore, only linearly ordered characters describe transformations appropriately. Note that geometrical and statistical features cannot be described appropriately using unordered characters. The evolution of RNA can be further explained by a discrete multidimensional ﬁtness landscape that illustrates how molecules traverse sequence space in search of improvements in function and ﬁtness during evolution (Wright, 1932; Kauffmann, 1993; Schuster et al., 1994). The highly frustrated nature of RNA folding determines that this landscape is rugged and dynamic. Its ruggedness shapes the pathway of adaptive walks toward evolved

338

Chapter 16

Phylogenetic Utility of RNA Structure

function within constraints imposed by the mapping of sequence into structure (Huynen et al., 1997; Fontana and Schuster, 1998). The biophysical model suggests that RNA molecules are usually trapped in local optima and can only escape by diffusive walks in neutral space. This has been recently demonstrated experimentally in elegant in vitro evolutionary experiments (Schultes and Bartel, 2000). The existence of local optima and a rugged landscape deﬁnes a world of suboptimal evolutionary outcomes in which conformation is not perfectly optimized and frustration is diminished by natural selection. These suboptimal conformations are used in comparative exercise to infer phylogenetic relationships (Caetano-Anolles, 2005). Evolved structures are embedded in extant molecules and these are placed at the leaves of the tree. These structures are derived from a common ancestor and consequently have been evolving at the same time. However, history is differentially imprinted on them as they transverse evolutionary landscapes and produce lineages. Consequently, phylogenies presuppose ancestors and ancestral nodes and inferences about them must be derived from extant data with a model of character state transformation. Character polarization deﬁnes virtual ancestral–descendant relationships between the structure of molecules that are being studied, and these relationships can be established by connecting a hypothetical ancestral molecule that is also virtual to the base of the tree. According to the model, molecules that have a more ancestral origin will exhibit increased optimality and display structures that are more ordered and less frustrated. In contrast, those that had the chance to explore structural space more efﬁciently (e.g., by diffusive walks in neutral space) should display suboptimal but more reﬁned structures.

16.2.3 Phylogenetic Analysis of RNA Structural Characters All data matrices for both geometrical and statistical characters are analyzed using equally weighted maximum parsimony (MP) as the optimality criterion (Swofford, 2003). Note that a more realistic weighting scheme should consider, for example, the evolutionary rates of change in structural features. However, this requires the measurement of evolutionary parameters along individual branches of the tree and the development of an appropriate quantitative model. In the absence of this information, it is most parsimonious and preferable to give equal weight to the relative contribution of each character. The preference of solutions that require the least amount of change in MP is particularly appropriate and can outperform maximum likelihood (ML) approaches in certain circumstances (Steel and Penny, 2000). Indeed, MP is precisely ML when character changes occur with equal probability but rates vary freely between characters in each branch. This model is useful when there is limited knowledge about underlying mechanisms linking characters to each other (Steel and Penny, 2000). Furthermore, the use of large multistep character state spaces decreases the likelihood of revisiting a same character state on the underlying tree, making MP statistically consistent. Character reconstruction and character change frequency are examined using MacClade (Maddison and Maddison, 2003). Structural attributes that describe the geometry or the molecular mechanics of RNA molecules were used to generate phylogenetic trees of molecules that portray the evolution of molecular lineages (Figure 16.2). The general approach has been used, for example, to reconstruct a tree of life from rRNA (Caetano-Anolles, 2002a), trace the evolution of RNA structures in ribosomes (Caetano-Anolles, 2002b), establish deep phylogenetic relationships in Poaceae (Caetano-Anolles, 2005), and study the evolution of retrotransposable elements (Sun et al., 2007) in addition to the studies of tRNA structural evolution (Sun and Caetano-Anolles, 2008a), the diversiﬁcation of cellular life and viruses (Sun and Caetano-

16.2 Structural Characters and Derived Phylogenetic Trees

339

Anolles, 2008b), the evolution of amino acid charging and the genetic code (Sun and Caetano-Anolles, 2008c), and the evolutionary signiﬁcance of the long variable arm in tRNA (Sun and Caetano-Anolles, 2008d). The method also produces phylogenetic trees of molecular substructures, histories of the structural evolution of RNA molecules (Figure 16.1). These trees are atypical in that they do not describe the evolution of organismal taxa or molecules embedded in them. Instead they describe the evolution of molecular repertoires of RNA components that are present in the organismal world. Very much as with the trees of molecules (Caetano-Anolles, 2002a), these trees are derived directly from RNA structure and are intrinsically rooted. Consequently, the ancestral-derived relationships of substructures embedded in the phylogenies can be used to establish which of them are the most ancient (deﬁning a structural origin) and how individual substructures were added (sequentially or in groups) into an evolving RNA molecule. These trees therefore deﬁne a molecular chronology.

16.2.4 Character State Change Frequency in RNA Structural Evidence Molecular models have been developed to describe evolutionary patterns of change within the individual nucleotide sites (Swofford et al., 1996). Such models are generally necessary to explain variations in the nucleic acid sequences. In contrast, evolutionary models that interpret variation in morphological or RNA structural characters are still sparse. These kinds of characters require examination of variation of parameters such as the frequency of character state change. This information is important for inferring evolutionary patterns of molecular diversiﬁcation and changes in molecular size of RNA substructural components. In the original cladistic method of Caetano-Anolles (2002a, 2002b), a model of character evolution was inferred from patterns of character change using the “state changes and stasis” feature in MacClade (Maddison and Maddison, 2003) and the resulting step matrix of character transformation was used to reconstruct a reﬁned universal tree of rRNA (Caetano-Anolles, 2002b). The relative frequencies of change were plotted in a bubble diagram and were converted to a transformation type using functions described by Wheeler (1990). The reﬁned model depicted the frustrated energetics of base pairing invoked by the original model of character change. While changes occurred most frequently in single steps, there was a differential behavior of helical stems and unpaired loop regions. In helices (component stabilizing the rRNA structure), losses were favored over gains for lengths of stacks 9–11 bases with a reverse trend for longer segments. In unpaired regions (component destabilizing the rRNA structure), gains were favored over losses, with frequencies decreasing with nucleotide length. These trends were especially evident when one-step gains and losses were individually charted for each of both subunits. Similar patterns of character change frequency were also revealed in tRNA molecules (Figure 16.3). For example, a high percentage of character state changes occurred in single steps for both stabilizing and destabilizing characters. However, the actual length change was limited in stems (5 bases) and quite extended in unpaired segments. The contrasting probabilities of changes in helix and unpaired regions reﬂect opposing energetic trends. However, results also suggest that there is an evolutionary tendency to optimize molecular size of substructural components that is speciﬁc for each RNA species that is analyzed (F.-J. Sun and Caetano-Anolles, unpublished).

340

Chapter 16

Phylogenetic Utility of RNA Structure

Figure 16.3

Bubble charts describing the average frequency of character state changes in stabilizing and destabilizing characters for tRNA and rRNA molecules. This information is used in modeling character evolution. For tRNA, stabilizing characters include helices, G:U pairs, and modiﬁed bases and destabilizing characters include hairpins, bulges, and unpaired regions. For rRNA, stabilizing characters include helices and destabilizing characters include hairpins, bulges, and unpaired regions.

16.2.5 Major Properties of Phylogenetic Trees Derived from RNA Structure Trees of molecules and trees of substructures have several desirable properties that are salient and enable direct inferences about the evolutionary origin and history of RNA. 1. Phylogenetic trees are intrinsically rooted, establishing evolution’s arrow. Phylogenetic reconstruction produces trees that are rooted according to models of character transformation that delimit how individual characters transform from one character state to another along the branches of the trees (Caetano-Anolles, 2001, 2002a). Consequently, the derived trees of molecules and substructures are rooted without the need and associated uncertainties of local external hypotheses of relationship (e.g., outgroup or paralogous sequences). Even though we apply different models of character transformation, but just like any phylogenetic method, our analysis still rests on the validity of the phylogenetic models that are used. Most important, the validity of conclusions about molecular origins depends on the

16.2 Structural Characters and Derived Phylogenetic Trees

341

axiomatic component that deﬁnes polarization of character transformation, which as we mentioned is falsiﬁable. 2. Terminal leaves in trees of substructures represent a ﬁnite repertoire of RNA substructural components. Many kinds of RNA substructures can be recognized as taxa, such as structural domains (e.g., arms in tRNA), stem loops, stems, hairpin loops, internal loops, bulges, unpaired segments in multiloops or external segments, base pairs (e.g., weak G:U pairs), and modiﬁed bases. To ensure homology, each kind of RNA substructure is used separately in tree reconstruction. In contrast to the traditional phylogenetic and phylogenomic approaches that generally deal with taxa that cannot be considered ﬁnite, RNA substructures represent necessarily a ﬁnite set, which is deﬁned by the repertoire present in sampled RNAs. For example, when building organismal phylogenies using morphological or molecular characters, taxa represent species drawn from millions of species that inhabit earth (Bull et al., 1992). The same is true for gene or genome sequences in phylogenomic analyses. In contrast, atoms in molecules are by deﬁnition ﬁnite and the diversity of an RNA encountered in nature is also ﬁnite. 3. Internal nodes in trees of substructures determine relative chronologies of structural diversiﬁcation. An internal node delimits the emergence of substructures in the context of other substructures of the molecular repertoire. Consequently, branches in trees describe network-like genealogies in which substructural evolution provides the foundation for evolutionary change. Furthermore, branch lengths measure character state change at the structural level. An interesting note relates to the putative existence of hard polytomies describing simultaneous divergence of substructures. The existence of hard polytomies in a tree of substructures can be explained by molecular duplications or other rearrangements leading to homologous segments in different parts of the RNA molecule. Their absence provides evidence of gradual buildup of substructures in the course of evolution. 4. Structural characters fulﬁll the requirement of character independence needed in phylogenetic analysis. Characters used to build trees of substructures represent molecular lineages associated with RNA evolving within organismal lineages, deﬁning, for example, a species or tRNA molecules coding for individual amino acids or with speciﬁc codon speciﬁcities. Therefore, molecular lineages and associated characters should be relatively independent of each other, satisfying the requirements of character independence in phylogenetic analyses. It is noteworthy that when considering nucleic acid or protein sequences, nucleotide or amino acid sites are related by a process linked to the evolution of structure and function that is curbed by epigenetic effects, such as the interaction within and between genes (Felsenstein, 1988). We note that similar arguments could be drawn for morphological characters or features describing higher levels of structural organization. Character independence is therefore difﬁcult to achieve in traditional phylogenetic analyses. In contrast, characters used to build trees of substructures represent features in molecular lineages that are, for the most part, evolving independent of each other. Therefore, we argue that in this regard, structural characters are more advantageous than other types of characters that are generally used in phylogenetic reconstructions. 5. The criterion of primary homology is determined by the feature of structure being studied and its associated evolutionary model, and how this feature relates to the substructural taxa analyzed. Features can describe the geometry (e.g., shape

342

Chapter 16

Phylogenetic Utility of RNA Structure

characters) or the branching, stability, and plasticity (e.g., statistical characters) of homologous substructural components. Homologous substructures represent those that are of the same kind (e.g., domains, stems, nucleotide base pairs) and respond to the same evolutionary model deﬁning the character transformation sequences. For example, we reconstruct trees of coaxial stems corresponding to arms in tRNA, separate from trees of hairpin loops or trees of tRNA arms using statistical characters (Sun and Caetano-Anolles, 2008a). This is because character change leading to coaxial stem taxa depends on models of character state that are quite different from those governing unpaired segments or Shannon entropy. While taxa (substructures) are implicitly related by a phylogenetic tree describing the evolution of a molecular repertoire and characters (molecules) can be considered relatively free of covariation patterns (tendencies of overconﬁdence in phylogenies), the validity of the reconstruction exercise rests on an adequate sampling of the molecular repertoire. For a speciﬁc type of RNA molecule, we generally examine the entire group of RNA sequences representing the structures of the complete set of known sequences acquired at the RNA level. This will guarantee an extensive sampling of molecular variants in our studies. Note that poor sampling could result in missing substructures and deﬁcient models of molecular evolution. 6. Substructural taxa representing molecular features larger than a nucleotide are necessarily the subject of information compression, generally dependent on how substructures are deﬁned. The negative consequences of information compression would be loss of information or differential weighting of taxa. However, this impact will decrease if the lost information constitutes noise and if models are derived from substructures of one kind. The conversion between amino acid and nucleotide sequences represents a typical example of information compression that has a natural rationale delimited by the genetic code. Indeed, the compression of RNA sequences into structures has also a natural rationale, the mapping of genotype (sequence) into phenotype (structure) enabled by the unique chemistry and folding of the RNA biopolymers (Ancel and Fontana, 2000; Higgs, 2000; Fontana, 2002). This mapping has three important properties: (i) the sequence-to-structure map is degenerate because there are many more sequences (in orders of magnitude) than structures; (ii) few common but many rare structural conformations materialize in structure space; and (iii) extensive neutral networks percolate sequence space and deﬁne common structures and structural neighborhoods (see Chapter 7). Because the distribution of sequences that fold into a same structure within neutral networks in RNA is approximately random, the mapping has “space covering” properties. In other words, all structures can materialize within relatively few mutational changes in sequence space, a property that has been conﬁrmed experimentally using RNA functional switches (Schultes and Bartel, 2000). For example, the complexities of sequence-to-structure mapping in tRNA are driven by decreases in frustration and increases in thermodynamic stability of the folding ensemble that ultimately deﬁne the cloverleaf structure of the molecule (Sun and Caetano-Anolles, 2008a). Based on these considerations, information compression in RNA is natural and should not bias signiﬁcantly the exercise of phylogenetic reconstruction. Instead, we argue that it will uncover deeper phylogenetic signals that are embedded at higher levels of biological organization and are difﬁcult to retrieve in an analysis of primary sequence.

16.2 Structural Characters and Derived Phylogenetic Trees

343

7. Trees of substructures provide by deﬁnition models of RNA structural evolution. When using shape characters, models describe evolutionary processes in which RNA molecules evolve from an originating substructure by addition of nucleotides and base pairs to substructural components. This process manifests as a sequence of events that are ordered in time. Consequently, the topologies of trees of substructures can be used to build models of RNA evolution. Generally, the phylogenetic relationship of stems determines the evolution of the overall shape of the RNA molecule that is typical of the molecular repertoire studied, and should be considered ﬁrst. This means that trees of stems will set a framework in the model for the chronological addition of major substructures to the evolving RNA. Once this framework is in place, trees of unpaired structural components deﬁne patterns of diversiﬁcation that do not result in molecular multifurcation (branching), and therefore, provide evolutionary patterns of decoration of the evolving molecule. Finally, substructures describing the nature of base pairs in stems generate trees that depict preference for usage of nucleotides in base pairing interactions. Numbers of paired and unpaired regions describe interruptions in coaxial stacking of helical segments and relative frequency of bulges and internal loops in these segments and generate trees that describe evolution of these geometrical features in RNAs.

16.2.6

Potential Limitations of the Methodology

In contrast to traditional phylogenetic trees reconstructed from nucleic acid sequences, trees derived from RNA structure establish ancestral–descendant relationships between the individual structural components of a nucleic acid molecule. Instead of nucleic acid bases, quantitative and geometrical attributes of RNA secondary structure are the characters analyzed. Superﬁcially, when such kind of quantitative descriptions are chosen, the exercise of distinguishing among character states is to some extent subjective. Moreover, this method has also to deal with the trade-off between the accuracy of the characters that are deﬁned and the amount of phylogenetic signal embedded in them. In other words, the amount of phylogenetic signal may be inﬂuenced by the way structural characters are selected and coded. However, studies have shown little if no discrepancy between results obtained using alternative quantitative character deﬁnitions (Caetano-Anolles, 2002b; Sun and CaetanoAnolles, 2008a). Ongoing investigations in our laboratory that focus on an analysis of structural topologies and evolutionary histories of more molecules are geared to the clariﬁcation of this issue. Furthermore, phylogenetic analysis of structure may also suffer from the same problems that affect reconstructions of phylogenies from sequences, such as mutational saturation, variation of evolutionary rates across sites, and covarion structure. Character selection and scoring are also inﬂuenced by the structural complexity of the RNA molecules that are studied. For example, microRNA and 5S rRNA have rather simple secondary structures while RNase P RNA and rRNA subunit molecules normally fold into much more complex structures. Differences in RNA length impose the limitation that character selection and scoring procedures be established case-by-case for each type of molecule examined. Another potential limitation of using structural characters in phylogenetic analysis relates to currently available phylogenetic software. For example, large molecules may generate more character states than the programs MacClade (Maddison and Maddison, 2003) and PAUP (Swofford, 2003) can handle. Furthermore, evolutionary

344

Chapter 16

Phylogenetic Utility of RNA Structure

models for quantitative characters are still sparse and, again, the number of character states greatly limits the application of popular phylogenetic software such as MrBayes (Huelsenbeck and Ronquist, 2001; Ronquist and Huelsenbeck, 2003). One way to deal with this problem is to combine character states so that their number is decreased and can be appropriately processed with currently available phylogenetic software by applying various weighting schemes. However, this practice sacriﬁces phylogenetic signal embedded in the combined character states. Finally, we assure the accuracy of RNA secondary structures during character selection and coding by using robust structural models. The structural characters are coded based on available secondary structures, which are supported by structural information derived from nuclear magnetic resonance (NMR) spectroscopy, crystallography, genetics, biochemistry, and comparative sequence analysis. In some cases, RNA secondary structure can be deﬁned by sequence and structural alignments and further complemented by the use of RNA folding software (Hofacker, 2003; Zuker, 2003).

16.3 APPLICATIONS Since its inception, our phylogenetic method has been applied to the study of many noncoding RNA molecules. We here summarize results derived from our investigations of the structure and evolution of ﬁve types of RNA molecules, tRNA, 5S rRNA, RNase P RNA, SINE RNA, and rRNA.

16.3.1

tRNA

tRNA molecules are generally short and are composed of runs of 73–95 nucleotides. The molecules generally adopt a typical cloverleaf-shaped secondary structure and fold into a Llike conformation in 3D space (Figure 16.2). This cloverleaf structure is made up of two structural and functional domains. The acceptor (Acc) and the pseudouridine (TCC) arms comprise the “top half” domain, while the “bottom half” domain is composed of the dihydrouridine (DHU) and the anticodon (AC) arms. A variable (extra) arm spans the TCC and AC arms. This molecular adaptor delimits the genetic code and the encoding of amino acids in proteins by establishing numerous interactions with many other macromolecules, many of which are associated with the ribosome. For each speciﬁc tRNA, a speciﬁc aminoacyl-tRNA synthetase attaches a speciﬁc amino acid to the Acc arm at one end of the L-shaped folded molecule, while a three-nucleotide sequence on the AC arm at the other end recognizes the complementary codon sequence in messenger RNA (mRNA) when embedded in the ribosomal ensemble. Because of its central role in biology, it is believed that tRNA carries deep phylogenetic signal and its evolution reﬂects the evolution of the entire protein biosynthetic apparatus (Dick and Schamel, 1995). Indeed, this fundamental molecule has been the focus of many evolutionary investigations that explore the origins and evolution of life. For example, tRNA molecules have been extensively used in evolutionary studies of the genetic code (Wong, 1975; Eigen and Winkler-Oswatitsch, 1981a, 1981b; Fitch and Upper, 1987; Di Giulio, 2000; Widmann et al., 2005) and a chronology of amino acids discovery has been linked to prebiotic chemistries and processes related to an ancient RNA world (Zuckerkandl et al., 1971; Brooks et al., 2002; Trifonov, 2004; Jordan et al., 2005). The evolutionary history of the two structural and functional domains of a typical tRNA molecule has been contentious. In addition, the evolutionary relevance of the variable region that spans the AC and TCC arms is generally unknown. Competing and divergent

16.3 Applications

345

hypotheses have been proposed regarding which of the two domains is more ancient. On one hand, the primordial AC arm adaptor that embodies the classic genetic code may be more ancient than structures in the top half, because amino acids may have served as cofactors establishing a stereochemical relation with the anticodon or codon in the primordial stage of life (Szathmary, 1999). On the other hand, the ﬁrst tRNA structures may have been “genomic tags” that marked the 30 -ends of ancient RNA genomes for replication by RNA enzymes in the RNA world (Weiner and Maizels, 1987). These tags in their simplest form could have been hairpin structures involving a coaxial stack of the Acc and TCC arms (Maizels and Weiner, 1994). The genomic tag hypothesis therefore considers the top half of modern tRNA as being the ancient structural and functional domain, while the bottom half represents a later addition to provide additional speciﬁcity necessary for codon recognition in mRNA. In a recent study (Sun and Caetano-Anolles, 2008a), we retrieved the entire set of 571 tRNA sequences with cloverleaf secondary structures from Part 2 (Compilation of tRNA Sequences) of the September 2004 edition of the Bayreuth tRNA Database (Sprinzl and Vassilenko, 2005; J€ uhling et al., 2009). We scored a total of 42 structural characters and inferred the evolutionary histories of organisms and tRNA substructures. Phylogenetic results generated from a total of 56 data matrices sorted according to taxonomic distribution, amino acid speciﬁcity, and biological sources show that the monophyly of tRNAs belonging to the three superkingdoms or expressing different amino acid speciﬁcities was not revealed in most of the parsimonious trees that were reconstructed. This is probably due to loss of phylogenetic signal or because tRNA diversiﬁcation predated organismal diversiﬁcation. The substructure trees derived from both geometrical and statistical characters congruently reveal the same evolutionary patterns—the Acc arm is the most basal substructure, generally followed in order by the TCC, AC, and DHU and variable arms. With these two congruent lines of phylogenetic evidence, we proposed an evolutionary model that explains the origin and evolution of the major functional and structural components of tRNA (Sun and Caetano-Anolles, 2008a; Figure 16.2). This model considers that the modern cloverleaf tRNA structure resulted from gradual addition of structural components to the growing molecule, either by insertion of single or multiple nucleotides or by partial or total duplications. In this model, short RNA hairpins with stems homologous to the Acc arm of present-day tRNAs were extended with regions homologous to the TCC and AC arms. The DHU arm was then incorporated into the resulting three-stemmed structure to form a protocloverleaf structure. The variable region was the last structural addition to the molecular repertoire of evolving tRNA substructures. The results and the tRNA model support the ancestral nature of the Acc and the TCC arms, the cornerstone of the genomic tag hypothesis that postulates tRNAs were ancient telomeres in the RNAworld (Weiner and Maizels, 1987; Maizels and Weiner, 1994). tRNA structural phylogenies and the use of phylogenetic constraint also unraveled evolutionary patterns and processes related to the genetic code and the origins and evolution of life. We were able to provide novel insights into the origin and diversiﬁcation of cellular life and viruses and into the evolution of the genetic code and amino acid charging (Sun and Caetano-Anolles, 2008b, 2008c). The generation of organismal timelines that were based on evolutionary distances among alternative or competing hypotheses of origin and diversiﬁcation revealed Archaea was the most ancestral superkingdom, followed by viruses, and superkingdoms Eukarya and Bacteria in that order (Sun and Caetano-Anolles, 2008b). These results support conclusions from recent phylogenomic studies of protein architecture (Wang et al., 2007; Wang and Caetano-Anolles, 2009). Strikingly, constraint analyses showed that the origin of viruses was not only ancient but was also linked to Archaea. Our

346

Chapter 16

Phylogenetic Utility of RNA Structure

ﬁndings have important biological and evolutionary implications. They support the notion that the archaeal lineage was very ancient, resulted in the ﬁrst organismal divide, and predated diversiﬁcation of tRNA function and speciﬁcity. Results are also consistent with the concept that viruses contributed to the development of the DNA replication machinery during the early diversiﬁcation of the living world (Sun and Caetano-Anolles, 2008b). Timelines of amino acid charging and codon discovery show that charging of selenocysteine (Sec), tyrosine (Tyr), serine (Ser), and leucine (Leu) were ancient, while speciﬁcities related to asparagine (Asn), methionine (Met), and arginine (Arg) were derived (Sun and Caetano-Anolles, 2008c). The timelines also uncovered an early role of the second and then ﬁrst codon bases, identiﬁed codons for alanine (Ala) and proline (Pro) as the most ancient, and revealed important evolutionary takeovers linked to the loss of the long variable arm in tRNA. The lack of correlation between ancestries of amino acid charging and encoding indicated that the separate discoveries of these functions reﬂected independent histories of recruitment. These histories were probably curbed by cooptions and important takeovers during early diversiﬁcation of the living world (Sun and Caetano-Anolles, 2008c). For the ﬁrst time, our method allowed to address the evolutionary signiﬁcance of the long variable arm in tRNA from a phylogenetic perspective (Sun and CaetanoAnolles, 2008d). We showed that class II tRNA molecules containing a long variable arm, including tRNASec, tRNASer, tRNATyr, and tRNALeu, were ancestral compared to those lacking this structural feature. This suggests that Sec, Ser, Tyr, and Leu were among the ﬁrst amino acids to be charged by their cognate tRNAs and that they may represent the ﬁrst group of amino acids with functional speciﬁcities linked to modern biochemistry. Results also suggest that the stop codon UGA, which also codes for Sec, may be the oldest codon to have a modern functional role in the history of the genetic code. Finally, the charging of amino acids by cognate tRNAs appear to have occurred once the canonical cloverleaf structure was fully realized in evolution but prior to the diversiﬁcation of amino acid speciﬁcities and superkingdoms of life (Sun and Caetano-Anolles, 2008a, 2008b).

16.3.2

5S rRNA

Generally considered an integral component of the large subunit of the ribosome, the 5S rRNA molecule plays fundamental roles during protein synthesis. It probably acts as a signal transducer between the peptidyl transferase center and domain II of the large subunit responsible for translocation (Bogdanov et al., 1995; Dokudovskaya et al., 1996) and between regions of 23S rRNA responsible for ribosomal function (Kouvela et al., 2007). 5S rRNA is also a determinant of stability for the large subunit (Holmberg and Nygard, 2000). It can act as an evolutionary resource for other genetic components. For example, SINE3 molecules, a class of SINE RNAs, are indeed evolutionarily derived from 5S rRNA (Kapitonov and Jurka, 2003). Despite extensive studies undertaken during the past decades, detailed understanding of 5S rRNA function is still lacking (Bogdanov et al., 1995; Barciszewska et al., 2000, 2001; Szymanski et al., 2003) and the evolutionary histories of the molecular components of this universally conserved molecule are still unclear. Regardless of its origin, 5S rRNA can always be folded into a common secondary structure (Figure 16.1). This structure contains ﬁve helices (I–V), two hairpin loops (C and D), two internal loops (B and E), and a multiloop (hinge) region (A) connecting helices I, II, and V. This three-branched general structure has been conﬁrmed by a number of structural studies and comparative sequence analyses. We explored the evolutionary history of 5S rRNA with structural characters. Our goal was to identify the ancestral functional and

16.3 Applications

347

structural components or substructures. We retrieved the entire set of 1371 5S rRNA sequences from the September 2005 edition of the 5S Ribosomal RNA Database (Szymanski et al., 2002). It is well known that although programming algorithms for secondary structure prediction have improved, predicted 5S rRNA structures are not always satisfactory (Azad et al., 1998; Ding and Lawrence, 1999; Mathews et al., 1999). We therefore decided to select only those structures that match a set of predeﬁned requirements. We ﬁrst predicted 5S rRNA secondary structure with RNAFOLD (Hofacker, 2003). Approximately one half of the available sequences (666), representing a comprehensive sampling among the three superkingdoms of life (89 Archaea, 168 Bacteria, and 409 Eukarya), were then selected for further study as 5S rRNA sequences folded into minimum free energy structures that were compatible with phylogeny and known 3D models of structure. Phylogenetic analysis of 46 structural characters in 5S rRNA showed that, individually, the three superkingdoms of life were not monophyletic (Sun and Caetano-Anolle´s, 2009). As with tRNA, we reasoned that this is probably due to loss of phylogenetic signal in the secondary structure of 5S rRNA. Alternatively, 5S rRNA diversiﬁcation could have predated true organismal diversiﬁcation and structures were the result of primitive processes during evolutionary stages prior to superkingdom differentiation. By applying the total evidence approach, we conducted phylogenetic analysis of combined structure and sequence data. Results showed that both Bacteria and Eukarya were each monophyletic, while Archaea was paraphyletic and occupied a basal position in the universal tree. Interestingly, this archaeal rooting of the tree of life again supports previous studies based on tRNA (Xue et al., 2003, 2005; Di Giulio, 2007; Wong et al., 2007; Sun and CaetanoAnolles, 2008b; F.-J. Sun and G. Caetano-Anolles, unpublished) and proteomes (Wang et al., 2007; Wang and Caetano-Anolles, 2009). Based on phylogenetic trees of substructures reconstructed from geometrical characters describing the complete data set of 666 5S rRNA sequences, an evolutionary model of 5S rRNA structure is proposed. According to this model, short RNA hairpins with stems homologous to helix I (the most ancient substructure) of present-day 5S rRNAs were extended with regions homologous to the branch containing helices II and III, then followed by a third branch composed of helices IV and V, to form a proto-three-branched structure. Speciﬁcally, the evolutionary history of 5S rRNA proceeded via two routes in Archaea and Bacteria/Eukarya, respectively. Both routes started with helix I, followed by helix III. At this point, the two routes departed. For Archaea, helix III is followed by helix V; helices IVand II are the last components added to the three-branched structure. For Bacteria/Eukarya, the order is helical arms I, III, II, and Vand IV; in other words, helix I is followed by the branch harboring helices III and II, then the branch containing helices V and IV. Interestingly, the two-route scenario that distinguishes molecules between Archaea and Bacteria/Eukarya agrees with the two-route scenario of the evolutionary model of tRNA structure (Sun and Caetano-Anolles, 2008a). Both molecules suggest the divergence of Archaea from Bacteria/ Eukarya. More important, the existence of these two evolutionary routes is also compatible with whole-genome analysis of protein complements and domain combinations that suggest an early split of the archaeal lineage from a protein architecture-rich communal world by reductive genomic tendencies that were protracted and ultimately led to superkingdom Archaea and then to Bacteria and Eukarya (Wang and Caetano-Anolles, 2006, 2009; Wang et al., 2007). A terminal stem-loop structure (a hairpin) has also been proposed to be the starting evolutionary component of tRNA molecules (Woese, 1969; Hopﬁeld, 1978; Eigen and Winkler-Oswatitsch, 1981b; Bloch et al., 1985; Dick and Schamel, 1995; Tanaka and Kikuchi, 2001; Widmann et al., 2005; Sun and Caetano-Anolles, 2008a). Similarly, a

348

Chapter 16

Phylogenetic Utility of RNA Structure

terminal hairpin structure may also be the starting structure in the evolution of RNase P RNA (see Section 16.3.3; F.-J. Sun and G. Caetano-Anolles, unpublished), SINE RNA (Sun et al., 2007), and rRNA (A. Harish and G. Caetano-Anolles, unpublished). The evolutionary importance of hairpin structures has been emphasized by the genomic tag hypothesis (Weiner and Maizels, 1987; Maizels and Weiner, 1994). Based on these results, we hypothesize that tRNA, SINE RNA, RNase P RNA, 5S rRNA, and rRNA may have shared the same common ancestral structure during early evolutionary history of the RNA world, a primordial stem-loop hairpin structure. We speculate that further studies in other types of RNA molecules may provide further evidence to support this hypothesis.

16.3.3

RNase P RNA

It is well-known that the RNase P enzyme, one of the two ribozymes (the other is the ribosome) present in all living organisms, is generally composed of one RNA subunit and one or more protein subunits in Bacteria and Archaea/Eukarya, respectively (for exceptions, see Manam and Van Tuyle, 1987; Wang et al., 1988; Rossmanith and Karwan, 1998; Thomas et al., 2000b; Salavati et al., 2001; Holzmann et al., 2008). This ribonucleoprotein functions as a phosphodiesterase and carries out the endonucleolytic cleavage of the precursor sequence from the 50 -end of a primary tRNA transcript to form the mature 50 end, a process that catalyzes the hydrolysis of the phosphate backbone of pre-tRNA at the 50 end (reviewed in Karwan et al., 1995; Pace and Brown, 1995; Frank and Pace, 1998; Hartmann and Hartmann, 2003; Evans et al., 2006; Kazantsev and Pace, 2006; Walker and Engelke, 2006). The catalytic function can be conducted by the RNA subunit independent of protein subunits, indicating that it resides in the RNA subunit (Guerrier-Takada et al., 1983; Pannucci et al., 1999; Thomas et al., 2000a; Kikovska et al., 2007). This RNA-based catalytic activity has been preserved during evolution in all three superkingdoms, an idea that was recently conﬁrmed (Pulukkunat and Gopalan, 2008). RNase P RNAs (RPRs) in Bacteria have been widely studied and can be divided into two independently folding domains involved in both substrate cleavage, the catalytic (C) domain, and substrate binding, the speciﬁcity (S) domain (Pan, 1995; Loria and Pan, 1996; Figure 16.4). Structurally, the S domain is composed of helical structure P7 and everything else distal to P7, while the C domain is composed of the rest of the molecule. Furthermore, ﬁve universally distinct conserved regions (CR I–V) that are distal to each other in the primary sequence are proposed in RPRs (Chen and Pace, 1997). These ﬁve regions consist of the universally conserved core structure. It is further suggested that the S domain comprises CR II and III, and that the C domain comprises CR I, IV, and V (Torres-Larios et al., 2006). Despite the overwhelmingly helical components in the tertiary fold, it is interesting that both domains are nonhelical: CR II and III form two interleaving T-loop motifs, whereas CR I, IV, and V are part of loops and turns. In terms of functions, the C domain contains the entire active site and binds the Acc stem/50 -leader and the ACCA sequence at the 30 -end (by a Watson-Crick base pairing mechanism) of the pre-tRNA substrate. It also cleaves the leader sequence of the pre-tRNA in the presence of the bacterial RNase P protein (RPP) cofactor (Guerrier-Takada and Altman, 1992; Pan, 1995; Pan et al., 1995; Loria and Pan, 1996; Christian et al., 2002; Harris and Christian, 2003). In contrast, the S domain binds the TCC stem-loop region of the pre-tRNA substrate and confers the substrate speciﬁcity. Given the distinct functions of these two structural domains during enzymatic catalysis, the S and C domains may have different evolutionary histories. In fact, Altman and Kirsebom (1999) proposed that an earlier form of the RPR in the RNA world lacked the

16.3 Applications

349

Figure 16.4

Secondary structure models of RNase P RNA. (a) Bacterial type A and type B molecules from Escherichia coli (Harris et al., 2001) and B. subtilis (Haas and Brown 1998), respectively. (b) Consensus structure showing the catalytic and speciﬁcity domains highlighted with different shades. Extension stems present in RNase P MRP in eukaryotes are shown as dashed lines in light gray. Loops are indicated with circles, stems with solid lines, and pseudoloops (P4 and P6) with dashed lines connecting interacting segments of the molecule.

S domain simply because this domain was not needed for the binding of the substrate. This hypothesis indicates that between the two functional and structural domains of RPR, the C domain is more ancient than the S domain, and the S domain is a rather derived structure. In fact, substrate speciﬁcity can be altered without changing the basic cleavage reaction by modifying the S domain of Bacillus subtilis (Mobley and Pan, 1999). This strongly suggests that the C domain is ancestral. However, current evolutionary studies do not provide convincing evidence in support of this hypothesis. Pace and coworkers have deﬁned the “minimal” RPR, the smallest molecule needed to carry out the hydrolysis reaction associated with RNase P (Siegel et al., 1996). However, this minimal RPR contains molecular components from both S and C domains. To date, the evolutionary histories of the molecular components of each of these two folding domains remain unclear. To explore the origin and evolution of these two structural and functional domains in RPR, we reconstructed phylogenetic trees derived from 129 structural characters in 133 RPRs using our cladistic method (F.-J. Sun and Caetano-Anolles, manuscript submitted). These RPR sequences with secondary structures were retrieved from Release No. 12 of the 2005 edition of the RNase P Database (Brown, 1999). Partial RNA sequences were excluded and the remaining sequences folded into secondary structures that were compatible with the RPR phylogeny and the known 3D models of RPR structure (Kazantsev et al., 2005). Phylogenetic analysis of structure revealed the monophyly of two superkingdoms of life, Archaea and Eukarya. In contrast, molecules of bacterial lineages clustered into two distantly related groups. One group was basal and contained molecules from 15 Gram positive bacteria (low-GC subdivision having type B RPRs; Figure 16.4) plus one green

350

Chapter 16

Phylogenetic Utility of RNA Structure

nonsulfur bacterium (Thermomicrobium roseum with type C RPR). The other group, a sister clade of Archaea, was a largely unresolved clade composed of molecules from other bacteria with type A RPRs. Phylogenetic analysis of combined structure and sequence data revealed that both Eukarya and Archaea were monophyletic, that Bacteria remained paraphyletic, and that Archaea was basal in the tree. Our results support superkingdom monophyly and are in agreement with studies by Collins et al. (2000), who demonstrated that phylogenetically informative characters are indeed embedded in the secondary structure of RPR molecules and produce trees that recover the three superkingdoms. One of the remarkable beneﬁts of using structural characters in these evolutionary studies is that phylogenetic relationships are inferred from distantly related RPR and RNase MRP RNAs with scarcely alignable sequences because structures are more conserved than sequences. Recent studies have indicated that RPR molecules are also suitable for phylogenetic analysis of closely related bacterial taxa and have the potential as a tool for species discrimination (Tapp et al., 2003; Birkenheuer et al., 2002). One distinctive feature of RPR that makes it a reliable molecular marker is that there is only one copy of its gene per genome. This makes its gene less likely to be compromised by interspeciﬁc lateral gene transfer. Phylogenetic trees of substructures allowed proposing an evolutionary model of RPRs structure. Phylogenetic analyses of stem substructures derived from three partitioned data sets according to superkingdoms revealed similar topologies as the topology derived from the complete data set. In contrast, analyses of bulges, hairpin loops, and other unpaired segments produced uninformative patterns. In this model, RPRs start from a hairpin structure equivalent to P12 (a helical substructure in S domain) in present-day RPR molecules. This is followed by four helical regions in the C domain (in the order P1, P3, P4, and P2) and four stems in the S domain (P10/P11, P9, P8, and P7). Five additional stems in the S domain are then added in deﬁned order. Later on, one (P18) and two (P13 and P14) helices in the C and S domains are added followed by a last group of elements composed of eight helices in the C domain and one helix (P10.1) in the S domain. With the exception of P12, the C domain appears generally more ancient than the S domain on all of the phylogenetic trees of stem substructures. For example, four helical regions in the C domain, P1–P4, which are also reconstructed as the structural components in the conserved minimal RPR (Siegel et al., 1996), are ancient structures in comparison to many other helical structures in both C and S domains. It has been suggested that the S domain facilitates substrate binding in the ribozymic reaction, but is dispensable for catalysis, and the RPPs can compensate to aid in substrate binding in its absence (Krasilnikov et al., 2003, 2004). This indicates the ancestral nature of the C domain, as proposed by Altman and Kirsebom (1999). Our results add an additional line of evidence and are largely in support of this hypothesis. RPRs catalytically cleave the 50 -end of the pre-tRNA sequence to generate the mature tRNA. In doing so, RPRs interact with the ancient top domain of the pre-tRNA composed of the TCC and Acc arms. Previous studies indicate that the top domain of tRNA evolutionarily predated the bottom domain composed of the DHU and AC arms (Maizels and Weiner, 1994; Sun and Caetano-Anolles, 2008a). Based on our study, the phylogenetic trees of substructures show that helices P1–P4 are the ancestral elements. These are the only four conserved stems of the C domain in the universal consensus minimum structure of RPR (Evans et al., 2006). These results suggest a coevolutionary scenario between the top domain of tRNA (the substrate) and the C domain of the RPR (the catalyst). Given the supporting functions (instead of catalytic functions) carried out by the S domain, the evolutionary signiﬁcance of P12 is worthy of further study. Furthermore, it should also be noted that in comparison to the currently available RPRs (Brown, 1999), especially with the generation

16.3 Applications

351

of sequences by metagenomic studies (Zhu et al., 2007), the number of RPR sequences sampled in this study is limited. Consequently, conclusions drawn should be taken with caution. Future studies with comprehensive sampling would also provide insights into the evolutionary patterns of various types of RPR molecules, the evolutionary relation between the RPRs and their associated proteins, and the phylogenetic utility of RPR molecules.

16.3.4

SINE RNA

In eukaryotic genomes, SINEs are widely distributed mobile retroelements (Kramerov and Vassetzky, 2005). Approximately 80–500 bp long SINEs normally have genomic copy numbers ranging from a few hundred to more than a million (Jurka, 1995; Okada and Ohshima, 1995; Kramerov and Vassetzky, 2005). They may play important regulatory roles closely related to transcription and translation of higher eukaryotes (Moran and Gilbert, 2002; Dewannieux et al., 2003; Pelissier et al., 2004; Ohshima and Okada, 2005). SINEs are believed to be derived from tRNAs or more exceptionally from 7SL or 5S RNAs (Jurka, 1995; Kapitonov and Jurka, 2003). The common composite structure of SINEs usually consists of a 50 -tRNA-related segment followed by a tRNA-unrelated segment of variable size (Okada and Ohshima, 1995; but see an exception described by Piskurek et al., 2003). In all cases, SINEs harbor in their tRNA-related segment the internal promoter recognized by the RNA polymerase III (Pol III) machinery (Arnaud et al., 2001). Therefore, SINE-speciﬁc transcription completely depends on the Pol III transcription system. Several vertebrate and invertebrate SINE families were grouped into three superfamilies, CORE-SINE (Gilbert and Labuda, 2000), V-SINE (Ogiwara et al., 2002), and AmnSINE (Nishihara et al., 2006), based on the conserved nature of the central region. Given the strong conservation of the central domain of these SINEs over a long evolutionary period, it is suggested that this region might have been subject to some form of positive selection (Gilbert and Labuda, 2000; Ogiwara et al., 2002; Nishihara et al., 2006). To date, little is known about SINE RNA structure and its evolutionary signiﬁcance. This is especially true for tRNA-related SINEs. Only two tRNA-related SINE structures have been experimentally studied (Rozhdestvensky et al., 2001; Kawagoe-Takaki et al., 2006). Recently, Deragon and Zhang (2006) probed the secondary structure of two tRNA-related plant SINE RNAs from Brassica oleracea and Arabidopsis thaliana and found that these molecules did not conserve the folding patterns of ancestral tRNAs. However, they still presented similar RNA structures despite being largely unrelated at the primary sequence level. Furthermore, by using a bioinformatic approach, the RNA secondary structure of different SINEs corresponding to known SINE families in Brassica and Arabidopsis were also predicted (Sun et al., 2007) to test if similar RNA folding patterns could be observed for tRNA-related SINE families outside the Brassicaceae. More than 50 tRNA-derived SINE families have now been described in mammals, ﬁshes, and plants (Kramerov and Vassetzky, 2005). Consensus sequences from many of these SINE families are not robust (they were generated from the alignment of a small number of copies) but were nevertheless used to predict the corresponding RNA structures. Surprisingly, the diversity of structures observed for most of these SINE families is similar to that observed for SINEs of Brassicaceae and shows a clear conservation of patterns. These RNAs either fold as two extended stem loops or adopt a structure comprising three stem loops, which is the most common structure for tRNA-related SINEs. Using our cladistic method, we compared the different SINE RNA structures to test more rigorously the relatedness of the different SINE RNA structures. We scored 37

352

Chapter 16

Phylogenetic Utility of RNA Structure

substructural characters from 20 representative eukaryotic SINE RNAs. Phylogenetic analysis of these structural characters showed that SINE RNAs started as simple structures with a single tRNA-related segment and became more complex in some lineages or stayed simple in others. It is noteworthy that SINE length associates with increases in structural complexity, suggesting SINE sequences increase in length during evolution. Based on the phylogenetic trees of substructures from 20 representative eukaryotic SINE RNAs, a general model of structural evolution for tRNA-related SINEs was proposed (Sun et al., 2007). Phylogenies describing the evolution of substructures established clearly that the SINE RNA molecule has an origin in a GC-rich tRNA-related segment. A timeline for the evolution of SINE molecules can be inferred directly from the phylogenetic trees of substructures. This timeline represents a global model of evolution of SINE RNA structure that suggests modern SINEs resulted from gradual addition of structural components to a tRNA-related founder. Overall, the proposed model shows that the SINEs could have evolved and ampliﬁed from a single-stem loop to two and three, as inferred from the phylogenetic trees of substructures. To accomplish this process, SINEs either amplify the tRNA-related segment or add short or long tRNA-unrelated segments to their molecules.

16.3.5

rRNA

In addition to small molecules such as tRNAs and SINE RNAs, our method has proven useful to study the evolution of structure of complex molecules such as rRNA, providing insights into the evolution of the ribosome and phylogenetic support to theories of ribosome evolution. Recent advances in solving high-resolution structures of the ribosomal complex (Yusupov et al., 2001; Schuwirth et al., 2005; Selmer et al., 2006) resulted in a scaffold for evolutionary tracings (Caetano-Anolles, 2002b). Initial studies showed patterns of evolution in intersubunit bridge contacts and tRNA binding sites that were remarkable (CaetanoAnolles, 2002b). These patterns were consistent with the proposed coupling of tRNA translocation and subunit movement (Yusupov et al., 2001). Results also supported the concerted evolution of tRNA binding sites in the two rRNA subunits and the ancestral nature of both the peptidyl site and the functional relay of the penultimate stem of the small subunit (SSU). Further advances in our methods allowed reconstruction of trees of substructures of rRNA and evolutionary heat maps of the entire ribosomal ensemble (Harish and CaetanoAnolles, manuscript submitted). The tree of substructures provides a chronology that aids our understanding of which parts of the ribosome are ancestral and which are derived. We analyzed 13,000 rRNA sequences from the European ribosomal RNA database (Wuyts et al., 2004) annotated with secondary structures derived from comparative sequence analysis, scoring 150 substructural characters, 50 for the small subunit and 100 for the large subunit rRNA, respectively. Analysis of SSU rRNA conﬁrmed that the penultimate stem (S49 or helix 44 in the bacterial model), the critical SSU component of the subunit interface (Yusupov et al., 2001) and the proposed ribosomal functional relay (Cate et al., 1999; Frank and Agrawal, 2000), was the most ancestral substructure of the ensemble. Figure 16.5 shows the evolution of the SSU ribosome. Analysis of the LSU rRNA suggested a late and more complicated structural origin. Interestingly, ancient substructures were located at the center of the ribosomal complex surrounding the intersubunit interface and were clearly linked to ratchet mechanics. In other words, the ancestral substructures of the SSU and LSU form a functional core. Our observations are compatible with an origin of the ribosome in structures that were not linked with modern protein synthesis. It is intuitive that an ensemble as

16.4 Conclusions

353

Figure 16.5

An evolutionary 3D heat map of the SSU ribosome showing the relative age of rRNA substructures and ribosomal proteins. Phylogenetic trees of rRNA stem substructures and proteins are reconstructed. Ancestries from the trees are mapped onto the 3D structure to reveal an evolutionary model describing the relative ages of rRNA substructures (a) and ribosomal proteins (b). The penultimate stem in red can be easily identiﬁed across most of the surface of the molecule. Color ancestry scales for RNA and proteins are not comparable. (See insert for color representation of this ﬁgure.)

complex as the ribosome did not evolve at once. Our trees of substructures and timelines support the progressive evolution of this complex biosynthetic machinery from a much simpler proto-ribosomal structure. However, the origins of the molecule did not lie in the functional core as one might expect. Instead, ancient and derived substructures occupied the same regions of the molecular ensemble. A complicated history of recruitment of substructures seems to dominate evolution of these complex molecules. Using our method we were also able to test theories of RNA structure evolution (Ancel and Fontana, 2000; Fontana, 2002). We traced the complete repertoire of ribosomal structural characters, lineage-by-lineage, in the universal phylogenetic tree of rRNA molecules (Caetano-Anolles, 2002b). This tracing showed patterns of evolutionary change in rRNA and allowed the reconstruction of hypothetical ancestral molecules. The exercise revealed a tendency toward molecular simpliﬁcation, especially in highly variable regions of the molecules. This tendency was maximal in rRNA from Encephalitozoon cuniculi, an amitochondriate microsporidian endoparasite with a highly reduced genome and protein complement. The exercise also showed reduction of ribosomal structural change with time that occurred concomitantly in both ribosomal subunits, which is compatible with plastogenetic congruence and structural canalization (Ancel and Fontana, 2000; Fontana, 2002).

16.4 CONCLUSIONS In addition to the traditionally known RNA molecules (e.g., rRNA, tRNA, and mRNA), a large number of novel nonprotein coding RNA (ncRNA) systems has been documented in various model organisms in recent years (Berezikov et al., 2006; Ruby et al., 2006; Axtell et al., 2007; Brennecke et al., 2007; Houwing et al., 2007; Kasschau et al., 2007; Lu et al., 2007; Zhao et al., 2007). These studies have uncovered a truly remarkable and complex modern RNA world. Functional analysis of these ncRNAs is posing a great challenge and is attracting tremendous attention from scientists in various ﬁelds. One

354

Chapter 16

Phylogenetic Utility of RNA Structure

excellent example that demonstrates the signiﬁcance of these ncRNAs is the 2006 Nobel Prize in Medicine and Physiology awarded to Andrew Fire and Craig Mello for providing new insights into ncRNA-based gene regulation. The cladistic method that we here describe has been applied to the evolutionary study of a number of ncRNAs, including rRNA, 5S rRNA, tRNA, RNase P RNA, and SINE RNA. The method embeds structure directly in phylogenetic analysis and generates intrinsically rooted phylogenetic trees without the need of outgroups. We have exempliﬁed the potential of this novel phylogenetic approach by focusing on several fundamental molecules that are involved in or are linked to protein synthesis. These studies provide novel insights into important evolutionary questions surrounding the emergence of cellular life and viruses and the origins and evolution of the genetic code and the protein biosynthetic machinery. Because trees of life generated from these ncRNA molecules establish evolution’s arrow, it becomes possible to establish the location of the root on the tree of life. We here show that a common topology emerges congruently from the analysis of these ncRNA molecules, which indicates the Archaea is the most ancient lineage on Earth. This result is important because the root of the tree of life has been debated over decades, with controversy largely stemming from the various rooting approaches that have been used and the alternative evolutionary scenarios that had been derived. The rooting strategy that we apply to the study of ncRNA has also been extended to the study of protein domain structure at fold and fold superfamily levels (see Chapter 20), revealing again the ancestry of the archaeal lineage and uncovering the origin, evolution, and structure of the protein world (Caetano-Anolles et al., 2009). We anticipate future studies will extend analysis to all kinds of ncRNAs, providing further support to the validity of the method and clarifying evolutionary questions surrounding origins of modern biochemistry and diversiﬁed life. Phylogenetic analyses will also impact the study of function and structure of ncRNA molecules, placing them within an evolutionary context. Together with evidence derived from molecular, genetic, and biochemical studies, evolutionary insights will surely enhance our understanding of biological functions and how these are linked to mechanisms embodied in molecular repertoires. Given the increasing demand for phylogenetic applications in various scientiﬁc ﬁelds (Baldauf, 2003), we optimistically foresee the increased phylogenetic utility of establishing evolution’s arrow for the study of molecular evolution.

ACKNOWLEDGMENTS We thank GCA Laboratory members Hee Shin Kim, Minglei Wang, Liudmila S. Yafremava, Kyung Mo Kim, and Jay E. Mittenthal for helpful discussions. The National Science Foundation (MCB-0343126 and MCB-0749836), the United Soybean Board and the Critical Research Initiative of the University of Illinois at Urbana-Champaign supports the RNA evolutionary studies in the GCA Laboratory (http://www.cropsci.uiuc.edu/faculty/gca/gca. html). Any opinions, ﬁndings, and conclusions and recommendations expressed in this material are those of the authors and do not necessarily reﬂect the views of the funding agencies.

REFERENCES ALTMAN, S., KIRSEBOM, L., and RIBONUCLEASE, P., 1999. In The RNA World: the nature of modern RNA suggests a prebiotic RNA (eds. R.F. GESTELAND, T.R. CECH, J.F. ATKINS). 2nd edition. Cold Spring Harbor Laboratory Press, New York, pp. 351–380.

ANCEL, L.W. and FONTANA, W., 2000. Plasticity, evolvability, and modularity in RNA. J. Exp. Zool. (Mol. Dev. Evol.) 288: 242–283. ARNAUD, P., YUKAWA, Y., LAVIE, L., PE´LISSIER, T., SUGIURA, M., and DERAGON, J.M., 2001. Analysis of the SINE S1

References Pol III promoter from Brassica; impact of methylation and inﬂuence of external sequences. Plant J. 26: 295–305. ASHBY, W.R., 1956. An Introduction to Cybernetics. Chapman and Hall, London. AXTELL, M.J., SNYDER, J.A., and BARTEL, D.P., 2007. Common functions for diverse small RNAs of land plants. Plant Cell 19: 1750–1769. AZAD, A.A., FAILLA, P., and HANNA, P.J., 1998. Inhibition of ribosomal subunit association and protein synthesis by oligonucleotides corresponding to deﬁned regions of 18S rRNA and 5S rRNA. Biochem. Biophys. Res. Commun. 248: 51–56. BALDAUF, S.L., 2003. Phylogeny for the faint of heart: a tutorial. Trends Genet. 19: 345–351. BALDAUF, S.L., PALMER, J.D., and DOOLITTLE, W.F., 1996. The root of the universal tree and the origin of eukaryotes based on elongation factor phylogeny. Proc. Natl. Acad. Sci. USA 93: 7749–7754. BAN, N., NISSEN, P., HANSEN, J., MOORE, P.B., and STEITZ, T.A., 2000. The complete atomic structure of the large ribosomal subunit at 2.4 A resolution. Science 289: 905–920. BARCISZEWSKA, M.Z., SZYMANSKI, M., ERDMANN, V.A., and BARCISZEWSKI, J., 2000. 5S ribosomal RNA. Biomacromolecules 1: 297–302. BARCISZEWSKA, M.Z., SZYMANSKI, M., ERDMANN, V.A., and BARCISZEWSKI, J., 2001. Structure and functions of 5S rRNA. Acta Biochim. Pol. 48: 191–198. BARNES, S.M., DELWICHE, C.F., PALMER, J.D., and PACE, N.R., 1996. Perspectives on archaeal diversity, thermophily and monophyly from environmental rRNA sequences. Proc. Natl. Acad. Sci. USA 93: 9188–9193. BEREZIKOV, E., THUEMMLER, F., van LAAKE, L.W., KONDOVA, I., BONTROP, R., CUPPEN, E., and PLASTERK, R.H., 2006. Diversity of microRNAs in human and chimpanzee brain. Nat. Genet. 38: 1375–1377. BILLOUD, B., GUERRUCCI, M.A., MASSELOT, M., and DEUTSCH, J. S., 2000. Cirripede phylogeny using a novel approach: molecular morphometrics. Mol. Biol. Evol. 17: 1435–1445. BIRKENHEUER, A.J., BREITSCHWERDT, E.B., ALLEMAN, A.R., and PITULLE, C., 2002. Differentiation of Haemobartonella canis and Mycoplasma haemofelis on the basis of comparative analysis of gene sequences. Am. J. Vet. Res. 63: 1385–1388. BLOCH, D.P., MCARTHUR, B., and MIRROP, S., 1985. tRNA–rRNA sequence homologies: evidence for an ancient modular format shared by tRNAs and rRNAs. Biosystems 17: 209–225. BOGDANOV, A.A., DONTSOVA, O.A., DOKUDOVSKAYA, S.S., and LAVRIK, I.N., 1995. Structure and function of 5S rRNA in the ribosome. Biochem. Cell Biol. 73: 869–876. BRENNECKE, J., ARAVIN, A.A., STARK, A., DUS, M., KELLIS, M., SACHIDANANDAM, R., and HANNON, G.J., 2007. Discrete small RNA-generating loci as master regulators of transposon activity in Drosophila. Cell 128: 1089–1103. BRINKMANN, H. and PHILIPPE, H., 1999. Archaea sister group of Bacteria? Indications from tree reconstruction artifacts in ancient phylogenies. Mol. Biol. Evol. 16: 817–825.

355

BROOKS, D.J., FRESCO, J.R., LESK, A.M., and SINGH, M., 2002. Evolution of amino acid frequencies in proteins over deep time: inferred order of introduction of amino acids into the genetic code. Mol. Biol. Evol. 19: 1645–1655. BROWN, J.W., 1999. The ribonuclease P database. Nucleic Acids Res. 27: 314. BROWN, J.R. and DOOLITTLE, W.F., 1995. Root of the universal tree of life based on ancient aminoacyl-tRNA synthetase gene duplications. Proc. Natl. Acad. Sci. USA 92: 2441–2445. BRYANT, H.N., 1991. The polarization of character transformations in phylogenetic systematics: role of axiomatic and auxiliary assumptions. Syst. Zool. 40: 433–445. BULL, A.T., GOODFELLOW, M., and SLATER, J.H., 1992. Biodiversity as a source of innovation in biotechnology. Annu. Rev. Microbiol. 46: 219–252. CAETANO-ANOLLE´S, G., 2000. Evolution of RNA secondary structure and genome size in the grasses. Plant Mol. Biol. Rep. 18: S21–S23. CAETANO-ANOLLE´S, G., 2001. Novel strategies to study the role of mutation and nucleic acid structure in evolution. Plant Cell Tiss. Org. Cult. 67: 115–132. CAETANO-ANOLLE´S, G., 2002a. Evolved RNA secondary structure and the rooting of the universal tree of life. J. Mol. Evol. 54: 333–345. CAETANO-ANOLLE´S, G., 2002b. Tracing the evolution of RNA structure in ribosomes. Nucleic Acids Res. 30: 2575–2587. CAETANO-ANOLLE´S, G., 2005. Grass evolution inferred from chromosomal rearrangements and geometrical and statistical features in RNA structure. J. Mol. Evol. 60: 635–652. CAETANO-ANOLLE´S, G. and CAETANO-ANOLLE´S, D., 2003. An evolutionarily structured universe of protein architecture. Genome Res. 13: 1563–1571. CAETANO-ANOLLE´S, G., KIM, H.S., and MITTENTHAL, J.E., 2007. The origin of modern metabolic networks inferred from phylogenomic analysis of protein architecture. Proc. Natl. Acad. Sci. USA 104: 9358–9363. CAETANO-ANOLLE´S, G., SUN, F.-J., WANG, M., YAFREMAVA, L. S., HARISH, A., KIM, H.S., KNUDSEN, V., CAETANO-ANOLLE´S, D., and MITTENTHAL, J.E., 2008. Origins and evolution of modern biochemistry: insights from genomes and molecular structure. Front. Biosci. 13: 5212–5240. CAETANO-ANOLLE´S, G., WANG, M., CAETANO-ANOLLE´S, D., and MITTENTHAL, J.E., 2009. The origin, evolution and structure of the protein world. Biochem. J. 417: 621–637. CATE, J.H., YUSUPOV, M.M., YUSUPOVA, G.Z., EARNEST, T.N., and NOLLER, H.F., 1999. X-ray crystal structures of 70S ribosome functional complexes. Science 285: 2095–2104. CHEN, J.-L. and PACE, N.R., 1997. Identiﬁcation of the universally conserved core of ribonuclease P RNA. RNA 3: 557–560. CHRISTIAN, E.L., ZAHLER, N.H., KAYE, N.M., and HARRIS, M. E., 2002. Analysis of substrate recognition by the ribonucleoprotein endonuclease RNase P. Methods 28: 307–322. COLLINS, L.J., MOULTON, V., and PENNY, D., 2000. Use of RNA secondary structure for studying the evolution of RNase P and RNase MRP. J. Mol. Evol. 51: 194–204.

356

Chapter 16

Phylogenetic Utility of RNA Structure

DALEVI, D., HUGENHOLTZ, P., and BLACKALL, L.L., 2001. A multiple-outgroup approach to resolving division-level phylogenetic relationships using 16S rDNA data. Int. J. Syst. Evol. Microbiol. 51: 385–391. DARNELL, J.E., JR., 1978. On the origin of prokaryotes. Science 202: 1257–1260. De RIJK, P., Van de PEER, Y., Van den BROECK, I., and De WACHTER, R., 1995. Evolution according to large ribosomal subunit RNA. J. Mol. Evol. 41: 366–375. DERAGON, J.M. and ZHANG, X., 2006. Short interspersed elements (SINEs) in plants: origin, classiﬁcation and use as phylogenetic markers. Syst. Biol. 55: 949–956. DEWANNIEUX, M., ESNAULT, C., and HEIDMANN, T., 2003. LINEmediated retrotransposition of marked Alu sequences. Nat. Genet. 35: 41–48. Di GIULIO, M., 2000. The RNAworld, the genetic code and the tRNA molecule. Trends Genet. 16: 17–18. Di GIULIO, M., 2007. The tree of life might be rooted in the branch leading to Nanoarchaeota. Gene 401: 108–113. DICK, T.P. and SCHAMEL, W.W.A., 1995. Molecular evolution of transfer RNA from two precursor hairpins: implications for the origin of protein synthesis. J. Mol. Evol. 41: 1–9. DING, Y. and LAWRENCE, C.E., 1999. A Bayesian statistical algorithm for secondary structure prediction. Comput. Chem. 23: 387–400. DOKUDOVSKAYA, S., DONTSOVA, O., SHPANCHENKO, O., BOGDANOV, A., and BRIMACOMBE, R., 1996. Loop IV of 5S ribosomal RNA has contacts both to domain II and to domain V of the 23S RNA. RNA 2: 146–152. DOOLITTLE, W.F., 1978. Gene in pieces: were they ever together. Nature 272: 581–582. DOYLE, J.A., 2006. Seed ferns and the origin of angiosperms. J. Torrey Bot. Soc. 133: 169–209. EIGEN, M. and WINKLER-OSWATITSCH, R., 1981a. TransferRNA, an early gene? Naturwissenschaften 68: 282–292. EIGEN, M. and WINKLER-OSWATITSCH, R., 1981b. TransferRNA: the early adaptor. Naturwissenschaften 68: 217–228. EVANS, D., MARQUEZ, S.M., and PACE, N.R., 2006. RNase P: interface of the RNA and protein worlds. Trends Biochem. Sci. 31: 333–341. FELSENSTEIN, J., 1988. Phylogenies from molecular sequences: inference and reliability. Annu. Rev. Genet. 22: 521–565. FITCH, W.M. and UPPER, K., 1987. The phylogeny of tRNA sequences provides evidence for ambiguity reduction in the origin of the genetic code. Cold Spring Harb. Symp. Quant. Biol. 52: 759–767. FONTANA, W., 2002. Modelling “evo–devo” with RNA. BioEssays 24: 1164–1177. FONTANA, W. and SCHUSTER, P., 1998. Continuity in evolution: on the nature of transitions. Science 280: 1451–1455. FORSDYKE, D.R., 2007. Calculation of folding energies of single-stranded nucleic acid sequences: conceptual issues. J. Theor. Biol. 248: 745–753. FORTERRE, P. and PHILIPPE, H., 1999. Where is the root of the universal tree of life. BioEssays 21: 871–879. FRANK, J. and AGRAWAL, R.K., 2000. A ratchet-like intersubunit reorganization of the ribosome during translocation. Nature 406: 318–322.

FRANK, D.N. and PACE, N.R., 1998. Ribonuclease P: unity and diversity in a tRNA processing ribozyme. Annu. Rev. Biochem. 67: 153–180. FRAUTSCHI, S., 1982. Entropy in an expanding universe. Science 217: 593–599. GILBERT, N. and LABUDA, D., 2000. Evolutionary inventions and continuity of CORE-SINEs in mammals. J. Mol. Biol. 298: 365–377. GLADYSHEV, G.P. and ERSHOV, Y.A., 1982. Principles of the thermodynamics of biological systems. J. Theor. Biol. 94: 301–343. GRAJALES, A., AGUILAR, C., and SA´NCHEZ, J.A., 2007. Phylogenetic reconstruction using secondary structures of Internal Transcribed Spacer 2 (ITS2, rDNA): ﬁnding the molecular and morphological gap in Caribbean gorgonian corals. BMC Evol. Biol. 7: 90. GUERRIER-TAKADA, C. and ALTMAN, S., 1992. Reconstitution of enzymatic activity from fragments of M1 RNA. Proc. Natl. Acad. Sci. USA 89: 1266–1270. GUERRIER-TAKADA, C., GARDINER, K., MARSH, T., PACE,, N.R., and ALTMAN, S., 1983. The RNA moiety of ribonuclease P is the catalytic subunit of the enzyme. Cell 35: 849–857. GULTYAEV, P.A., van BATENBURG, F.H.D., and PLEIJ, C.W.A., 2002. Selective pressures on RNA hairpins in vivo and in vitro. J. Mol. Evol. 54: 1–8. HAAS, E.S. and BROWN, J.W., 1998. Evolutionary variation in bacterial RNase P RNAs. Nucleic Acids Res. 26: 4093–4099. HARRIS, M.E. and CHRISTIAN, E.L., 2003. Recent insights into the structure and function of the ribonucleoprotein enzyme ribonuclease P. Curr. Opin. Struct. Biol. 13: 325–333. HARRIS, J.K., HAAS, E.S., WILLIAMS, D., FRANK, D.N., and BROWN, J.W., 2001. New insight into RNase P RNA structure from comparative analysis of archaeal RNA. RNA 7: 220–232. HARTMANN, E. and HARTMANN, R.K., 2003. The enigma of ribonuclease P evolution. Trends Genet. 19: 561–569. HECHT, M.H., DAS, A., GO, A., BRADLEY, L.H., and WEI, Y.N., 2004. De novo proteins from designed combinatorial libraries. Protein Sci. 13: 1711–1723. HIGGS, P.G., 1993. RNA secondary structure: a comparison of real and random sequences. J. Phys. I France 3: 43–59. HIGGS, P.G., 1995. Thermodynamic properties of transfer RNA: a computational study. J. Chem. Soc. Faraday Trans. 91: 2531–2540. HIGGS, P.G., 2000. RNA secondary structure: physical and computational aspects. Quart. Rev. Biophys. 33: 199–253. HIGGS, P.G., JAMESON, D., JOW, H., and RATTRAY, M., 2003. The evolution of tRNA-Leu genes in animal mitochondrial genomes. J. Mol. Evol. 57: 435–445. HOFACKER, I.L., 2003. Vienna RNA secondary structure server. Nucleic Acids Res. 31: 3429–3431. HOLMBERG, L. and NYGARD, O., 2000. Release of ribosomebound 5S rRNA upon cleavage of the phosphodiester bond between nucleotides A54 and A55 in 5S rRNA. Biol. Chem. 381: 1041–1046. HOLZMANN, J., FRANK, P., LO¨FFLER, E., BENNETT, K.L., GERNER, C., and ROSSMANITH, W., 2008. RNase P without RNA:

References identiﬁcation and functional reconstitution of the human mitochondrial tRNA processing enzyme. Cell 135: 462–474. HOPFIELD, J.J., 1978. Origin of the genetic code: a testable hypothesis based on tRNA structure, sequence, and kinetic proofreading. Proc. Natl. Acad. Sci. USA 75: 4334–4338. HOUWING, S., KAMMINGA, L.M., BEREZIKOV, E., CRONEMBOLD, D., GIRARD, A., van den ELST, H., FILIPPOV, D.V., BLASER, H., RAZ, E., MOENS, C.B., PLASTERK, R.H., HANNON, G.J., DRAPER, B.W., and KETTING, R.F., 2007. A role for Piwi and piRNAs in germ cell maintenance and transposon silencing in zebraﬁsh. Cell 129: 69–82. HUELSENBECK, J.P. and RONQUIST, F., 2001. MRBAYES: Bayesian inference of phylogeny. Bioinformatics 17: 754–755. HUYNEN, M., GUTELL, R., and KONINGS, D., 1997. Assessing the reliability of RNA folding using statistical mechanics. J. Mol. Biol. 267: 1104–1112. JORDAN, I.K., KONDRASHOV, F.A., ADZHUBEI, I.A., WOLF, Y.I., KOONIN, E.V., KONDRASHOV, A.S., and SUNYAEV, S., 2005. A universal trend of amino acid gain and loss in protein evolution. Nature 433: 633–638. oRL, M., HARTMANN, R.K., SPRINZL, M., STAJU¨HLING, F., M€ DLER, P.F., and P€ uTZ, J., 2009. tRNAdb 2009: compilation of tRNA sequences and tRNA genes. Nucleic Acids Res. 37: D159–D162. JURKA, J., 1995. Origin and evolution of Alu repetitive elements. In The impact of Short Interspersed Elements (SINEs) on the Host Genome (ed. R. Maraia). R.G. Landes, Springer-Verlag, New York, pp. 25–41. KAPITONOV, V.V. and JURKA, J., 2003. A novel class of SINE elements derived from 5S rRNA. Mol. Biol. Evol. 20: 694–702. KARWAN, R., PLUK, H., and van VENROOIJ, W.J., 1995. RNase MRP/RNase P Systems. Mol. Biol. Rep. 22 (2–3): 67–200. KASSCHAU, K.D., FAHLGREN, N., CHAPMAN, E.J., SULLIVAN, C. M., CUMBIE, J.S., GIVAN, S.A., and CARRINGTON, J.C., 2007. Genome-wide proﬁling and analysis of Arabidopsis siRNAs. PLoS Biol. 5: e57. KAUFFMANN, S.A., 1993. The Origins of Order. Oxford University Press, New York. KAWAGOE-TAKAKI, H., NAMEKI, N., KAJIKAWA, M., and OKADA, N., 2006. Probing the secondary structure of salmon SmaI SINE RNA. Gene 365: 67–73. KAZANTSEV, A.V. and PACE, N.R., 2006. Bacterial RNase P: a new view of an ancient enzyme. Nat. Rev. Microbiol. 4: 729–740. KAZANTSEV, A.V., KRIVENKO, A.A., HARRINGTON, D.J., HOLBROOK, S.R., ADAMS, P.D., and PACE, N.R., 2005. Crystal structure of a bacterial ribonuclease P RNA. Proc. Natl. Acad. Sci. USA 102: 13392–13397. KIKOVSKA, E., SV€aRD, S.G., and KIRSEBOM, L.A., 2007. Eukaryotic RNase P RNA mediates cleavage in the absence of protein. Proc. Natl. Acad. Sci. USA 104: 2062–2067. KNUDSEN, V. and CAETANO-ANOLLE´S, G., 2008. NOBAI: a web server for character coding of geometrical and statistical

357

features in RNA structure. Nucleic Acids Res. 36: W85–W90. KOONIN, E.V., MUSHEGIAN, A.R., GALPERIN, M.Y., and WALKER, D.R., 1997. Comparison of archaeal and bacterial genomes: computer analysis of protein sequences predicts novel functions and suggests a chimeric origin for the Archaea. Mol. Microbiol. 25: 619–637. KOUVELA, E., GERBANAS, G.V., XAPLANTERI, M.A., PETROPOULOS, A.D., DINOS, G.P., and KALPAXIS, D.L., 2007. Changes in the conformation of 5S rRNA cause alternations in principal functions of the ribosomal nanomachine. Nucleic Acids Res. 35: 5108–5119. KRAMEROV, D.A. and VASSETZKY, N.S., 2005. Short retroposons in eukaryotic genomes. Int. Rev. Cytol. 247: 165–221. KRASILNIKOV, A.S., YANG, X., PAN, T., and MONDRAGO´N, A., 2003. Crystal structure of the speciﬁcity domain of ribonuclease P. Nature 421: 760–764. KRASILNIKOV, A.S., XIAO, Y., PAN, T., and MONDRAGO´N, A., 2004. Basis for structural diversity in homologous RNAs. Science 306: 104–107. LAYZER, D., 1970. Cosmic evolution and thermodynamic irreversibility. Pure Appl. Chem. 22: 457–468. LE, S.-Y. and MAIZEL, J.V., 1989. A method for assessing the statistical signiﬁcance of RNA folding. J. Theor. Biol. 138: 495–510. LEFFERS, H., KJEMS, J., OSTERGAARD, L., LARSON, N., and GARRETT, R.A., 1987. Evolutionary relationships amongst archaebacteria: a comparative study of 23S ribosomal RNAs of a sulphur-dependent extreme thermophile, an extreme halophile and a thermophilic methanogen. J. Mol. Biol. 195: 43–61. LORIA, A. and PAN, T., 1996. Domain structure of the ribozyme from eubacterial ribonuclease P. RNA 2: 551–563. LU, C., MEYERS, B.C., and GREEN, P.J., 2007. Construction of small RNA cDNA libraries for deep sequencing. Methods 43: 110–117. MADDISON, D.R. and MADDISON, W.P., 2003. MacClade Version 4: Analysis of Phylogeny and Character Evolution. Sinauer Associates, Sunderland, MA. MADDISON, W.P., DONOGHUE, M.J., and MADDISON, D.R., 1984. Outgroup analysis and parsimony. Syst. Zool. 33: 83–103. MAIZELS, N. and WEINER, A.M., 1994. Phylogeny from function: evidence from the molecular fossil record that tRNA originated in replication, not translation. Proc. Natl. Acad. Sci. USA 91: 6729–6734. MANAM, S. and Van TUYLE, G.C., 1987. Separation and characterization of 50 - and 30 -tRNA processing nucleases from rat liver mitochondria. J. Biol. Chem. 262: 10272–10279. MATHEWS, D.H., SABINA, J., ZUKER, M., and TURNER, D.H., 1999. Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J. Mol. Biol. 288: 911–940. MOBLEY, E.M. and PAN, T., 1999. Design and isolation of ribozyme substrate pairs using RNase P-based ribozymes containing altered substrate binding sites. Nucleic Acids Res. 27: 4298–4304.

358

Chapter 16

Phylogenetic Utility of RNA Structure

MORAN, J. and GILBERT, N., 2002. Mammalian LINE-1 retrotransposons and related elements. In Mobile DNA II (eds N. L. Craig, R. Cragie, M. Gellert, and A.M. Lambowitz). ASM Press, Washington DC, pp. 836–869. NISHIHARA, H., SMIT, A.F.A., and OKADA, N., 2006. Functional noncoding sequences derived from SINEs in the mammalian genome. Genome Res. 16: 864–874. O’BRIEN, M.J., LYMAN, R.L., SAAB, Y., SAAB, E., DARWENT, J., and GLOVER, D.S., 2002. Two issues in archaeological phylogenetics: taxon construction and outgroup selection. J. Theor. Biol. 215: 133–150. OGIWARA, I., MIYA, M., OHSHIMA, K., and OKADA, N., 2002. VSINEs: a new superfamily of vertebrate SINEs that are widespread in vertebrate genomes and retain a strongly conserved segment within each repetitive unit. Genome Res. 12: 316–324. OHSHIMA, K. and OKADA, N., 2005. SINEs and LINEs: symbionts of eukaryotic genomes with a common tail. Cytogenet. Genome Res. 110: 475–490. OKADA, N. and OHSHIMA, K., 1995. Evolution of tRNAderived SINEs. In The Impact of Short Interspersed Elements (SINEs) on the Host Genome (ed. R. Maraia). R.G. Landes, Springer-Verlag, New York, pp. 61–80. OLSEN, G.J., WOESE, C.R., and OVERBEEK, R., 1994. The winds of (evolutionary) change: breathing new life into microbiology. J. Bacteriol. 176: 1–6. PACE, N.R. and BROWN, J.W., 1995. Evolutionary perspective on the structure and function of ribonuclease P, a ribozyme. J. Bacteriol. 177: 1919–1928. PAN, T., 1995. Higher order folding and domain analysis of the ribozyme from Bacillus subtilis ribonuclease P. Biochemistry 34: 902–909. PAN, T., LORIA, A., and ZHONG, K., 1995. Probing of tertiary interactions in RNA: 20 -hydroxyl-base contacts between the RNase P RNA and pre-tRNA. Proc. Natl. Acad. Sci. USA 92: 12510–12514. PANNUCCI, J.A., HAAS, E.S., HALL, T.A., HARRIS, J.K., and BROWN, J.W., 1999. RNase P RNAs from some Archaea are catalytically active. Proc. Natl. Acad. Sci. USA 96: 7803–7808. PE´LISSIER, T., BOUSQUET-ANTONELLI, C., LAVIE, L., and DERAGON, J.-M., 2004. Synthesis and processing of tRNArelated SINE transcripts in Arabidopsis thaliana. Nucleic Acids Res. 32: 3957–3966. PENNY, D. and POOLE, A., 1999. The nature of the last universal common ancestor. Curr. Opin. Genet. Dev. 9: 672–677. PISKUREK, O., NIKAIDO, M., BOEADI; BABA, M., and OKADA, N., 2003. Unique mammalian tRNA-derived repetitive elements in dermopterans: the t-SINE family and its retrotransposition through multiple sources. Mol. Biol. Evol. 20: 1659–1668. POLLOCK, D., 2003. The Zuckerkandl Prize: structure and evolution. J. Mol. Evol. 56: 375–376. PULUKKUNAT, D.K. and GOPALAN, V., 2008. Studies on Methanocaldococcus jannaschii RNase P reveal insights into the roles of RNA and protein cofactors in RNase P catalysis. Nucleic Acids Res. 36: 4172–4180.

REANNEY, D.C., 1974. On the origin of prokaryotes. Theor. Biol. 48: 243–251. RONQUIST, F. and HUELSENBECK, J.P., 2003. MRBAYES 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19: 1572–1574. ROSSMANITH, W. and KARWAN, R.M., 1998. Characterization of human mitochondrial RNase P: novel aspects in tRNA processing. Biochem. Biophys. Res. Commun. 247: 234–241. ROTA-STABELLI, O. and TELFORD, M.J., 2008. A multi criterion approach for the selection of optimal outgroups in phylogeny: recovering some support for Mandibulata over Myriochelata using mitogenomics. Mol. Phylogenet. Evol. 48: 103–111. ROZHDESTVENSKY, T.S., KOPYLOV, A.M., and H€uTENHOFER, A., 2001. Neuronal BC1 RNA structure: evolutionary conversion of a tRNA (Ala) domain into an extended stem-loop structure. RNA 7: 722–730. RUBY, J.G., JAN, C., PLAYER, C., AXTELL, M.J., LEE, W., NUSBAUM, C., GE, H., and BARTEL, D.P., 2006. Large-scale sequencing reveals 21U-RNAs and additional microRNAs and endogenous siRNAs in C. elegans. Cell 127: 1193–1207. SALAVATI, R., PANIGRAHI, A.K., and STUART, K.D., 2001. Mitochondrial ribonuclease activity of Trypanosoma brucei. Mol. Biochem. Parasit. 115: 109–117. SCHNEIDER, E.D. and KAY, J.J., 1994a. Life as a manifestation of the second law of thermodynamics. Math. Comp. Model. 19: 25–48. SCHNEIDER, E.D. and KAY, J.J., 1994b. Complexity and thermodynamics: towards a new ecology. Futures 26: 626–647. SCHRO¨DINGER, E., 1944. What is life? Cambridge University Press, Cambridge. SCHULTES, E.A. and BARTEL, D.P., 2000. One sequence, two ribozymes: implications for the emergence of new ribozyme folds. Science 289: 448–452. SCHULTES, E.A., HRABER, P.T., and LABEAN, T.H., 1999. Estimating the contributions of selection and self-organization in RNA secondary structure. J. Mol. Evol. 49: 76–83. SCHULTES, E.A., SPASIC, A., MOHANTY, U., and BARTEL, D.P., 2005. Compact and ordered collapse of randomly generated RNA sequences. Nat. Struct. Mol. Biol. 12: 1130–1136. SCHUSTER, P., FONTANA, W., STADLER, P.F., and HOFACKER, I.L., 1994. From sequences to shapes and back: a case study in RNA secondary structure. Proc. R. Soc. Lond. Ser. B 255: 279–284. SCHUWIRTH, B.S., BOROVINSKAYA, M.A., HAU, C.W., ZHANG, W., VILA-SANJURJO, A., HOLTON, J.M., and CATE, J.H.D., 2005. Structures of the bacterial ribosome at 3.5 A resolution. Science 310: 827–834. SELMER, M., DUNHAM, C.M., MURPHY, F.V. IV. WEIXLBAUMER, A., PETRY, S., KELLEY, A.C., WEIR, J.R., and RAMAKRISHNAN, V., 2006. Structure of the 70S ribosome complexed with mRNA and tRNA. Science 313: 1935–1942. SIEGEL, R.W., BANTA, A.B., HAAS, E.S., BROWN, J.W., and PACE, N.R., 1996. Mycoplasma fermentans simpliﬁes our

References view of the catalytic core of ribonuclease P RNA. RNA 2: 452–462. SPRINZL, M. and VASSILENKO, K.S., 2005. Compilation of tRNA sequences and sequences of tRNA genes. Nucleic Acids Res. 33: D139–D140. STEEL, M. and PENNY, D., 2000. Parsimony, likelihood, and the role of models in molecular phylogenetics. Mol. Biol. Evol. 17: 839–850. STEFFENS, W. and DIGBY, D., 1999. mRNA have greater negative folding free energies than shufﬂed or codon choice randomized sequences. Nucleic Acids Res. 27: 1578–1584. STEGGER, G., HOFMAN, H., FO¨RTSCH, J., GROSS, H.J., RANDLES, J.W., S€aNGER, H.L., and RIESNER, D., 1984. Conformational transitions in viroids and virusoids: comparison of results from energy minimization algorithm and from experimental data. J. Biomol. Struct. Dynam. 2: 543–571. SUN, F.-J. and CAETANO-ANOLLE´S, G., 2008a. The origin and evolution of tRNA inferred from phylogenetic analysis of structure. J. Mol. Evol. 66: 21–35. SUN, F.-J. and CAETANO-ANOLLE´S, G., 2008b. Evolutionary patterns in the sequence and structure of transfer RNA: early origins of Archaea and viruses. PLoS Comput. Biol. 4: e1000018. SUN, F.-J. and CAETANO-ANOLLE´S, G., 2008c. Evolutionary patterns in the sequence and structure of transfer RNA: a window into early translation and the genetic code. PLoS ONE 3: e2799. SUN, F.-J. and CAETANO-ANOLLE´S, G., 2008d. The evolutionary signiﬁcance of the long variable arm in transfer RNA. Complexity 14 (5): 26–39. SUN, F.-J. and CAETANO-ANOLLE´S, G., 2008e. Transfer RNA and the origins of diversiﬁed life. Sci. Prog. 91: 265–284. SUN, F.-J. and CAETANO-ANOLLE´S, G., 2009. The evolutionary history of the structure of 5S ribosomal RNA. J. Mol. Evol. DOI 10.1007/s00239-009-9264-z. SUN, F.-J., FLEURDE´PINE, S., BOUSQUET-ANTONELLI, C., CAETANO-ANOLLE´S, G., and DERAGON, J.-M., 2007. Common evolutionary trends for SINE RNA structures. Trends Genet. 23: 26–33. SWAIN, T.D. and TAYLOR, D.J., 2003. Structural rRNA characters support monophyly of raptorial limbs and paraphyly of limb specialization in water ﬂeas. Proc. R. Soc. Lond. Ser. B 270: 887–896. SWOFFORD, D.L., 2003. PAUP. Phylogenetic Analysis Using Parsimony ( and Other Methods), Version 4. Sinauer Associates, Sunderland, MA. SWOFFORD, D.L., OLSEN, G.J., WADDELL, P.J., and HILLIS, D.M., 1996. Phylogenetic inference. In Molecular Systematics, 2nd edition (eds D.M. Hillis, C. Moritz, and B.K. Mable). Sinauer Associates, Sunderland, MA, pp. 407–514. SZATHMA´RY, E., 1999. The origin of the genetic code: amino acids as cofactors in an RNA world. Trends Genet. 15: 223–229. SZYMANSKI, M., BARCISZEWSKA, M.Z., ERDMANN, V.A., and BARCISZEWSKI, J., 2002. 5S ribosomal RNA database. Nucleic Acids Res. 30: 176–178.

359

SZYMANSKI, M., BARCISZEWSKA, M.Z., ERDMANN, V.A., and BARCISZEWSKI, J., 2003. 5S rRNA: structure and interactions. Biochem. J. 371: 641–651. TANAKA, T. and KIKUCHI, Y., 2001. Origin of the cloverleaf shape of transfer RNA—the double-hairpin model: implication for the role of tRNA intron and the long extra loop. Viva Origino 29: 134–142. TAPP, J., THOLLESSON, M., and HERRMANN, B., 2003. Phylogenetic relationships and genotyping of the genus Streptococcus by sequence determination of the RNase P RNA gene, rnpB. Int. J. Syst. Evol. Microbiol. 53: 1861–1871. THOMAS, B.C., CHAMBERLAIN, J., ENGELKE, D.R., and GEGENHEIMER, P., 2000a. Evidence for an RNA-based catalytic mechanism in eukaryotic nuclear ribonuclease P. RNA 6: 554–562. THOMAS, B.C., LI, X., and GEGENHEIMER, P., 2000b. Chloroplast ribonuclease P does not utilize the ribozyme-type pretRNA cleavage mechanism. RNA 6: 545–553. TORRES-LARIOS, A., SWINGER, K.K., PAN, T., and MONDRAGON, A., 2006. Structure of RNase P, a universal ribozyme. Curr. Opin. Struct. Biol. 16: 327–335. TRIFONOV, E.N., 2004. The triplet code from ﬁrst principles. J. Biomol. Struct. Dyn. 22: 1–11. WAGNER, A., 2008. Robustness and evolvability: a paradox resolved. Proc. R. Soc. Lond. Ser. B 275: 91–100. WALKER, S.C. and ENGELKE, D.R., 2006. Ribonuclease P: the evolution of an ancient RNA enzyme. Crit. Rev. Biochem. Mol. 41: 77–102. WANG, M. and CAETANO-ANOLLE´S, G., 2006. Global phylogeny determined by the combination of protein domains in proteomes. Mol. Biol. Evol. 23: 2444–2454. WANG, M. and CAETANO-ANOLLE´S, G., 2009. The evolutionary mechanics of domain organization in proteomes and the rise of modularity in the protein world. Structure 17: 66–78. WANG, M.J., DAVIS, N.W., and GEGENHEIMER, P., 1988. Novel mechanisms for maturation of chloroplast transfer RNA precursors. EMBO J. 7: 1567–1574. WANG, M., YAFREMAVA, L.S., CAETANO-ANOLLE´S, D., MITTENTHAL, J.E., and CAETANO-ANOLLE´S, G., 2007. Reductive evolution of architectural repertoires in proteomes and the birth of the tripartite world. Genome Res. 17: 1572–1585. WEINER, A.M. and MAIZELS, N., 1987. tRNA-like structures tag the 30 ends of genomic RNA molecules for replication: implications for the origin of protein synthesis. Proc. Natl. Acad. Sci. USA 84: 7383–7387. WHEELER, W.C., 1990. Combinational weights in phylogenetic analysis: a statistical parsimony procedure. Cladistics 6: 269–275. WIDMANN, J., Di GIULIO, M., YARUS, M., and KNIGHT, R., 2005. tRNA creation by hairpin duplication. J. Mol. Evol. 61: 524–535. WOESE, C.R., 1969. The biological signiﬁcance of the genetic code. Prog. Mol. Subcell. Biol. 1: 5–46. WONG, J.T., 1975. A coevolution theory of the genetic code. Proc. Natl. Acad. Sci. USA 72: 1909–1912.

360

Chapter 16

Phylogenetic Utility of RNA Structure

WONG, J.T.-F., CHEN, J., MAT, W.-K., NG, S.-K., and XUE, H., 2007. Polyphasic evidence delineating the root of life and roots of biological domains. Gene 403: 39–52. WRIGHT, S., 1932. The roles of mutation, inbreeding, crossbreeding and selection in evolution. Proceedings of the Sixth International Congress of Genetics, Vol. 1, pp. 356–366. WUYTS, J., PERRIERE, G., and Van de PEER, Y., 2004. The European ribosomal RNA database. Nucleic Acids Res. 32: D101–D103. XUE, H., TONG, K.-L., MARCK, C., GROSJEAN, H., and WONG, J. T.-F., 2003. Transfer RNA paralogs: evidence for genetic code–amino acid biosynthesis coevolution and an archaeal root of life. Gene 310: 59–66. XUE, H., NG, S.-K., TONG, K.-L., and WONG, J.T.-F., 2005. Congruence of evidence for a Methanopyrus-proximal root of life based on transfer RNA and aminoacyl-tRNA synthetase genes. Gene 360: 120–130. YUSUPOV, M.M., YUSUPOVA, G.Z., BAUCOM, A., LIEBERMAN, K., EARNEST, T.N., CATE, J.H.D., and NOLLER, H.F., 2001.

Crystal structure of the ribosome at 5.5 A resolution. Science 292: 883–896. ZHAO, T., LI, G., MI, S., LI, S., HANNON, G.J., WANG, X.J., and QI, Y., 2007. A complex system of small RNAs in the unicellular green alga Chlamydomonas reinhardtii. Genes Dev. 21: 1190–1203. ZHU, W. and FREELAND, S., 2006. The standard genetic code enhances adaptive evolution of proteins. J. Theor. Biol. 239: 63–70. ZHU, Y., PULUKKUNAT, D.K., and LI, Y., 2007. Deciphering RNA structural diversity and systematic phylogeny from microbial metagenomes. Nucleic Acids Res. 35: 2283–2294. ZUCKERKANDL, E., DERANCOURT, J., and VOGEL, H., 1971. Mutational trends and random processes in the evolution of informational macromolecules. J. Mol. Biol. 59: 473–490. ZUKER, M., 2003. Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res. 31: 3406–3415.

Part III

Evolution of Biological Networks

Chapter

17

A Hitchhiker’s Guide to Evolving Networks Charles G. Kurland and Otto G. Berg 17.1

INTRODUCTION

17.2

PHYLOGENETIC CONTINUITIES, BIOLOGICAL COHERENCE

17.3

NESTED STRUCTURAL NETWORKS

17.4

OPTIMAL NETWORKS

17.5

THE EMPEROR’S BLAST SEARCH REVISITED

17.6

WILL THE REAL MISSING LINK PLEASE STAND UP?

17.7

ALL’S WELL

ACKNOWLEDGMENTS REFERENCES

17.1 INTRODUCTION No two members of the same reproducing population are likely to be identical nor are the feathers, face, proteins, and genome of a parent necessarily the same as those of its progeny. Substitutions, sequence rearrangements, insertions, deletions, and alien transfers are continuously impacting successive generations of organisms. We may wonder how under these circumstances a tree of life can coherently represent the diversity that is apparent in our biosphere. Or, paraphrasing one skeptic (Doolittle, 1999), is noise the essence of the evolutionary signal? Even the notion that random mutation could have generated the enormous diversity of genome sequences evident in our biosphere is questionable. Rough calculations show that neitherthemassnortheageofourgalaxyislargeenoughtohavegeneratedalltheuniqueproteins in the biosphere by testing independent random mutations (Salisbury, 1969). This calculation encouragedJohnMaynardSmith(1970)toabandontheassumptionthatrandommutationsdrive the independent evolution of individual proteins. His alternative formulation is that proteins form a single network or space of interrelated functional sequences in which mutation has generated every point in this space from a prior functional protein (Maynard Smith, 1970). Evolutionary Genomics and Systems Biology, edited by Gustavo Caetano-Anolles Copyright 2010 John Wiley & Sons, Inc.

363

364

Chapter 17

A Hitchhiker’s Guide to Evolving Networks

This view of a functional protein network identiﬁes a space of sequences that is responsive to purifying selection. Any sequences not stabilized by purifying selection are transients that are obliterated eventually by random mutation unless the selective factors change (Kimura, 1983; Berg and Kurland, 2002). Accordingly, it can be inferred that the phylogenetic network of protein sequences is nested in a network of functional parameters. Indeed, protein phylogeny seems robust precisely because it reﬂects the interplay of conservative structural and functional networks in which the intrusion of novelty is constrained by intense network interactions, that is to say by purifying selection.

17.1.1

Mere Words?

That being said, we deny any implication that all genome evolution is “adaptive.” Rather we side with Lynch (2007) who has argued for the necessity of including nonadaptive events in genome evolution. We part company with him on details of his formulation. In particular, we advocate a classiﬁcation that differs from that of Lynch (2007) for the parameters that sustain evolutionary change. We agree with Lynch (2007) that the mechanisms through which random mutations are generated are not adaptive as such. On the other hand, changes in the mutation frequency may be adaptive in the sense that an allele for high mutation rates may enhance the survival probabilities of cells under environmental stress (e.g., Taddei et al., 1997). Mutator alleles hitchhiking on survivors of environmental challenges provide a transparent route to the enhancement of mutation rates in cells. Mutatis mutandis, under stable conditions a lower mutation rate might reduce the load on populations and thereby improve ﬁtness. So, in general, mutation is neither adaptive nor nonadaptive but it is undeniably the source of genome change. Lynch (2007) identiﬁes besides adaptive forces (which we interpret to mean purifying selection), three nonadaptive driving forces: mutation, recombination, and drift. We group these same parameters differently in the context of genome evolution driven by periodically varying environments and relentless mutation. First, we classify gene transfers as well as recombination events as mutations. Second, we identify three modes through which sequence changes may hitchhike through populations of genomes. That is to say, genes or DNA sequences hitchhike on the organisms in which they produce a phenotype with a ﬁtness advantage. They also hitchhike by the diffusion of organisms in which they produce no phenotypic advantage or disadvantage. Finally, DNA sequences can hitchhike on infective vehicles such as transposons to which they make no infective contribution (Orgel and Crick, 1980; Berg and Kurland, 2002; Kurland, 2005). We note that speciﬁc infective sequences augment the propagation of genome parasites, whose ﬁtness is separable from that of their hosts. That is to say, the virulence of a genome parasite is maintained by Darwinian selection for the sequences that support infectivity, while the evolving parasite may be decidedly nonadaptive for its host. The term “hitchhiking” is a convenient antidote to a confusing precedent set by an inﬂuential popular text. Thus, Dawkins (1976) adopted in mutated forms Hamilton’s (1964a,b) term “selﬁsh” to describe genes that are favored by Darwinian natural selection. Unfortunately, Dawkins (1976) uses this term in several contradictory ways including the sense that DNA sequences perpetuate themselves “selﬁshly” without regard to the phenotypes they may express (Kurland, 2005). Dawkins implies that all genes are genome parasites. However, selﬁsh in the sense of being “parasitic” is precisely what was not meant by Hamilton (1964a,b).

17.1 Introduction

365

Finally, we appreciate Lynch’s (2007) insistence that a meaningful theory of evolution must be consistent with population genetics. However, we are mindful that population genetics alone is not enough. The missing bits are usually packaged anonymously into the parameters describing ﬁtness. So, we explicate the links with chemistry, cell biology, and ecology that determine the parameters of population genetics commonly used to describe evolving protein networks.

17.1.2

Sideways

In the same spirit, it is useful to identify recombination as well as sequence transfer (horizontal or lateral) as distinct mutation modes. This emphasizes the hazardous character of all such intrusions that alter genome sequences, and it counters the notion that there might be something special about the changes generated in genomes by recombination or sequence transfer. So, by entering a genome “sideways,” a sequence transfer or a recombination event is still just a change of A’s, G’s, C’s and T’s. We can illustrate this point by considering a minimal genomic sideways event. Here, we imagine that a sequence of nucleotides has been copied into a genome from an alien genome but we stipulate that the transferred sequence has precisely the same nucleotide sequence as the original one it replaces. This must happen at some appreciable rate in nature. But, how would the genome sequencer detect such an event? Why would the genome sequencer care? Our view is that such an event is phylogenetically a nonevent. We propose that it is the sequence changes themselves and not the modes through which they arise that may disrupt a phylogenomic signal. Accordingly, we discuss gene transfers as examples of mutation that permit us to inspect the ways that puriﬁcation selection may or may not support a robust evolutionary signal in phylogenomic networks.

17.1.3

One Man’s Glitch

Horizontal gene transfer (HGT) had been a well-established phenomenon for decades (Watson, 1965) when Doolittle (1999) presented his concerns about the phylogenetic mischief it might create. A bit later, Koonin et al. (2000) essayed the essence of Doolittle’s (1999) views as “strictly speaking, a tree can not reﬂect the phylogeny if there has been any horizontal gene transfer at all, but if it has been extensive the tree can become meaningless.” It might be worthwhile to entertain such a purist view if all the genome sequences in a population of organisms were singular and evolving in a continuum of smooth variation. But, the reality is that genomes in a natural population are not singular (Kimura, 1987). They are instead highly heterogeneous so that all nominal genome sequences are consensus sequences. Furthermore, sequence variation is anything but smooth. Genomes evolve discontinuously by accumulating discrete mutations. Consequently, successive generations of genomes are like the individual frames in a ﬁlm that when screened may project an illusion of smooth transitions. But, it is easy to ignore the random discontinuities that mutations work on genomes in order to retain one or another screen version of evolution. A case in point is precisely Doolittle’s (1999) version of Darwinian evolution, in which he attributes a unique disruptive inﬂuence to HGT. He does so by ignoring the erratic variability and inevitable discontinuities that arise from other mutational events that like HGT intrude, diffuse and most often, disappear without pause (Kimura, 1987).

366

Chapter 17

A Hitchhiker’s Guide to Evolving Networks

Doolittle’s (1999) critique of the Darwinian view of evolution consisted of citing a small number of glitches in gene families and then inferring, that a single discontinuity in a single gene family implies that there is a corresponding discontinuity in the phylogeny of genomes. So, Doolittle (1999) citing Mayr claims that natural phylogeny should be “inclusively hierarchical.” He continues: here “each species should be part of one and only one genus, each genus should be part of one and only one family,” etc. But, where did Doolittle (1999) get the notion that genome sequences ordered within a natural phylogeny may not be marked by occasional phylogenetic discontinuities such as those following sequence transfers or other prominent mutations?

17.1.4

Three Challenges

Doolittle’s (1999) formulation projects three practical concerns into a critique of Darwinian evolution: One is that sequence transfers invalidate phylogeny. A second is that transfer likewise invalidates the species concept. The third is the replacement of a last universal common ancestor (LUCA) to all three domains of life (Woese and Fox, 1977; Woese, 1983; Woese, 1998) with a speculative scenario in which archaea and bacteria genomes fuse to make chimeric eukaryote genomes (Zillig et al., 1989; Gogarten et al., 1996; Martin, 1999). As it has turned out, none of these mooted assertions survives scrutiny. The evolution of proteomes is in our view driven by selection for near-optimal performance of highly cooperative cellular networks competing in periodically evolving environments. Organisms are nearly always playing “catch up” with changing environments in which their network optimality is often challenged. By selecting uncommon mutations, including rare sequence transfers, evolution tracks the optimal performance of cellular networks in an evolving milieu. Among the consequences of selection for optimality are two that are most relevant to Doolittle’s (1999) mooted dilemmas. First, the intense cooperativity of cellular networks purges most mutant variants because they tend to be incompatible with network optimality (Berg and Kurland, 2002; Kurland, 2005; Pedersen et al., 2003). Accordingly, phylogeny and species remain valid constructions. Second, a general criterion of the optimality of cellular networks requires that all of their components perform at their maximum rates normalized to their mass equivalents (Ehrenberg and Kurland, 1984). This means that small and fast are good. So, under competitive conditions loss of inessential proteins as well as nondestructive reduction of the mass investment in proteins always improve the ﬁtness of cells. In other words, reductive evolution of proteomes tends to lurk in the background of all evolutionary scenarios, but its intensity is quite uneven. Indeed, a huge ecological challenge, namely, the rise of phagotrophy, has been posited to account for the divergence of archaea and bacteria by reductive evolution from a common phagotrophic ancestor to modern eukaryotes (Kurland et al., (2006, 2007). Thus, LUCA reemerges as an essential phylogenetic construction to account for the common origins of the three modern domains. Data suggest that the three modern superkingdoms belong to a single monophyletic tree for which each of the three domains bears the traces of descent from a common ancestor as well as the imprint of the distinctive evolutionary adaptations that characterize each domain (Woese, 1983; Darnell, 1978; Penny and Poole, 1999; Philippe and Forterre, 1999; Kurland et al., (2006, 2007). The domain-speciﬁc adaptations include the distinctive population genetic parameters of archaea, bacteria, and eukaryotes (Berg and Kurland, 2002; Lynch, 2007) as well as speciﬁc cellular adaptations (Kurland et al., 2007, 2006; Valentine, 2007).

17.2 Phylogenetic Continuities, Biological Coherence

367

According to our view, network optimality and its correlate, reductive evolution emerge as two particularly useful concepts with which to explore the origins of modern proteomes. Finally, purifying selection for network optimality leads to a degree of cooperativity between proteins such that multimutated variants, in general, and gene transfers, in particular, are often toxic to cells. Accordingly, gene transfer is useful to probe protein networks, but as a biological phenomenon its rumored signiﬁcance is highly inﬂated.

17.2 PHYLOGENETIC CONTINUITIES, BIOLOGICAL COHERENCE It must be said that Doolittle’s (1999) inﬂuential paradigm shift was not idiosyncratic. Rather, it faithfully reﬂected the prejudices of other molecular biologists who previously voiced the conviction that gene transfer is a prime mover in all facets of evolution (Gogarten et al., 1996; Lawrence and Ochman, 1997, 1998; Martin and M€uller, 1998). Furthermore, Doolittle’s (1999) rhetorically coherent observations were all the more startling because of their dissonance with the genome data surfacing at the time. Genome phylogeny was emerging in a robust form while anecdotal support for pervasive gene transfer was never more than incidental (Huynen et al., 1999a, 1999b; Fitz-Gibbon and House, 1999; Tekaia et al., 1999; Snel et al., 1999; Kurland, 2000). In fact, the potential to confuse phylogeny is not unique to sequence transfer. Other more frequent mutational events can and do make similar mischief. Modern phylogenomic networks of coding sequences are quite robust only up to a point. That point of diminishing returns is reached in the deep divergences of ancient lineages. In effect, deep time not sequence transfer limits the applicability of common phylogenetic algorithms to ancient lineages. On the other hand, by working with fully sequenced genomes and by studying structurally related protein domains, some of the pitfalls of protein phylogeny can be ﬁnessed (Wang et al., 2007).

17.2.1

The Rosetta Stone

Doolittle’s (1999) critique of rRNA-based phylogeny rested on several sorts of challenges. One of these concerned minor discrepancies between protein family and rRNA phylogenies. But, the contemporary development of genome phylogeny had already relegated those few discrepancies to irrelevancies (Huynen et al., 1999a, 1999b; Fitz-Gibbon and House, 1999; Tekaia et al., 1999; Snel et al., 1999). By contrast, the truly startling observation was that rRNA phylogeny and genome phylogeny were emerging in such clear accord. A second challenge was the ﬁnding of a few archaea that apparently were the recipients of rRNA transfers from alien donors (cited in Doolittle, 1999). Though these two transgenic rRNAs stimulated Doolittle’s (1999) pique, it has been noted elsewhere that in those few sightings, the alien rRNA had been incorporated into genomes containing as well a normal set of endogenous rRNA determinants (Kurland, 2000; Kurland et al., 2003). In other words, these few incomplete rRNA transfers turn out to be and remain isolated curiosities. The third challenge was richly informative, though perhaps not as intended. Doolittle (1999) reveled in experiments showing that functional ribosomes can be reconstituted in vitro even from distantly related rRNA and proteins (Asai et al., 1999). Doolittle’s (1999) interpretation of these observations was that sequence transfer is facile. However, while he was jubilant about the recovery of 70% of normal activity from ribosomes compounded from distantly related rRNA and protein, others were more impressed with the fact that in the match up between the two most closely related

368

Chapter 17

A Hitchhiker’s Guide to Evolving Networks

organisms, that is, between proteins from Escherichia coli and rRNA from Salmonella typhimurium, only 91% of normal E. coli activity is recovered in the hybrid (Asai et al., 1999). A loss of 9%, not to mention 30%, in translation rate would doom a recombinant ribosome to rapid extinction (Kurland et al., 2003). Even much smaller defects in the functional efﬁciency of the recipients of alien transfers could lead to the rapid loss of transgenic organisms in short term evolution (Kurland et al., 2003). Thus, suboptimal performance characteristics are likely to be one general reason that interspecies recombinants may be scarcer in nature than in Science.

17.2.2

Compositional Outliers

The most important early source of large-scale information about alien transfer was uncovered from compositional properties of genomes (Hooper and Berg, 2002; Medigue et al., 1991; Lawrence and Ochman, 1997, 1998; Ochman and Jones, 2000; Ochman et al., 2000). Not incidentally, their observations provided important clues about how seemingly large-scale gene transfer in bacteria could be reconciled with the robust performance of bacterial phylogenomic networks. In their approach, compositional idiosyncrasies of genomes are used to distinguish sequences that have the nucleotide as well as the codon frequencies that distinguish most of a genome from more rare nonconforming sequences. The latter are the presumed transfers simply because they seem not to ﬁt a compositional bias that is identiﬁed as a genomic consensus. One limitation of this approach is that those transfers that have persisted in genomes for long evolutionary times will gradually mutate into the consensus of the host background. They are ameliorated by the characteristic mutation frequencies, which eventually transforms their alien compositions into that of the host (Lawrence and Ochman, 1998). Thus, sharply divergent compositional strings must be relatively recent transfers and therefore are probably not the signature of ancient events. All in all, the compositional approach has yielded extremely valuable information about observable sequence transfers in bacteria (Lawrence and Ochman, 1997, 1998; Ochman and Jones, 2000; Ochman et al., 2000; Hooper and Berg, 2002). Estimates for different bacterial species put the compositional outliers between 0% and 19% of the genome sequences, with a mean close to 6% (Ochman et al., 2000). On the other hand, it turns out that the sampling of different strains within a species such as E. coli reveals two crucial characteristics: One is that much of the open reading frames encode what seems to be junk, that is, partially degraded pseudogenes with recognizable sequences mostly originating from viruses, pilae, etc. The other is that the identities of putative transferred sequences vary from strain to strain, which strongly suggests that transferred sequences are unstable. Lawrence and Ochman (1998) conclude that “most acquired sequences do not confer a long term selective advantage to the host and are, consequently, likely to be lost by deletion.”

17.2.3

Phylogeny by Any Other Name

Another method exploits phylogenetic reconstruction to identify examples of sequence transfer as clades in unusual phylogenetic associations. Here, unusual might mean something as dramatic as ﬁnding nominal eukaryote sequences nested within the bacterial domain. Or, it might mean something much less pronounced such as a putative exchange between two closely related genomes. Such phylogenetic anomalies are commonly

17.2 Phylogenetic Continuities, Biological Coherence

369

perceived as the most reliable signals of sequence transfer. That perception probably arises because phylogenetic reconstructions are visually simple and comprehensible at a glance. However, that immediacy is an illusion since phylogenetic algorithms often are based on assumptions that are not applicable. These include the assumption that sequence evolution proceeds at an even rate, with an unbiased compositional variation, and that duplications do not diverge and paralogues do not segregate within genome lineages (Lockhart et al., 1994; Galtier and Guoy, 1995; Forterre and Philippe, 1999; Galtier et al., 1999; Lopez et al., 1999; Kurland, 2000; Felsenstein, 2001; Lopez et al., 2002). Deviations from these normative constraints can and do create phylogenetic discontinuities that are remarkable imitations of sequence transfer. In order to avoid false identiﬁcations of sequence transfer, a minimum requirement is that phylogeny must be based on an adequate numbers of clades (Canback et al., 2002). Otherwise, the ﬂuctuations of random mutation events, especially in closely related clades, will generate anomalies that mimic gene transfers as they do in Zhaxybayeva et al. (2006). In addition, it is sometimes possible to recognize departures from the normative assumptions upon which algorithms are based, such as those generated by compositional bias. Then it becomes possible to correct a mistakenly identiﬁed sequence transfer (Itoh et al., 2002) using a speciﬁc tool (Galtier and Guoy, 1995) to deconstruct the anomalies generated by biased mutation rates (Canback et al., 2004). There is not a little irony in the fact that the phylogenetic method works only where there is a reliable reference phylogeny with which to identify the nonconformists. The ﬁrst assays of the match between phylogeny obtained from ribosomal RNA and that from gene or sequence content of fully sequenced genomes were encouraging (Huynen et al., 1999a, 1999b; Fitz-Gibbon and House, 1999; Tekaia et al., 1999; Snel et al., 1999). More reﬁned comparisons with hundreds of fully sequenced genomes have also turned out to be even more striking (Brown et al., 2000; Snel et al., 2002; Korbel et al., 2002: Kunin et al., 2005a). Thus, Woese’s rRNA trees seem to be accurate predictors of gene content or genome sequence trees. Quantitative analysis using weighting functions to sort out gene transfer from vertical transfer consistently reveals the overwhelming dominance of vertical events as well as a marked preponderance of glitches ascribed to gene loss events compared to those ascribed to gene transfer. The conservative consensus is that after millions of generations, the aggregate ﬁxation of alien transfers is far less than 10% of the aggregate of vertical inheritance (Korbel et al., 2002). Most phylogenetic reconstructions are represented as two dimensional trees or dendrograms. Nevertheless, it was recognized from the beginning of the great HGT scare that reticulate networks would be useful to track infrequent gene transfers (Kurland, 2000). Indeed, recent reconstructions of the phylogenomic network for microbial genomes feature a third dimension to record reticulation by gene transfer (Kunin et al., 2005b). Here, the transfers are represented as thin “vines” intertwining a tree or dendrogram of vertical branches. The bottom line of three independent approaches to such reconstructions is that 3–5% of vertical transfers and loss is identiﬁable as gene transfer (Kunin et al., 2005b). Again, these are cumulative ﬁgures for millions of generations. Several percent for the aggregate frequencies of random gene transfer accumulated over millions of years might seem to be a barely signiﬁcant contribution to microbial evolution. On the other hand, certain speciﬁc evolutionary phenomena such as the acquisition of pathogenesis islands, of antibiotic resistance genes or of the capacity to degrade unusual chemicals are dependent on gene transfer between speciﬁc microbial clades. Thus, microbial gene transfer is structured and not random in the sense that it involves most often a particular subset

370

Chapter 17

A Hitchhiker’s Guide to Evolving Networks

of coding sequences, and it seems to be locally intense at speciﬁc nodes in the dendrogram (Kunin et al., 2005b).

17.2.4

The Emperor’s Blast Search

The identiﬁcation of horizontal transfers involves identifying a node in one clade that is the root or origin of a lineage of sequences that were initially transferred to another clade that is not a vertical a descendent of the ﬁrst. Such rootings are transparently identiﬁed in the three dimensional phylogeny of Kunin et al. (2005b). In contrast, the most widely used protocol to identify transfers is one that exploits BLAST searches to identify the strongest similarities between proteins from two otherwise distant clades (Smith et al., 1998; Doolittle et al., 1996; Gogarten et al., 1996; Feng et al., 1997; Aravind et al., 1998; Nelson et al., 1999; Koonin et al., 2000). Here, strong homology is promoted as a substitute for phylogenetic rooting of the putative transfers. The problem with this protocol is that homology no matter how strong does not by itself distinguish rooting in alien clades from rootings in a common ancestor (Eisen, 1998; Kurland, 2000; Koski and Golding, 2001; Canback et al., 2002; Kurland et al., 2003, 2006). There is as far as we know no method to root lineages by homology alone. In contrast, observations of strong or weak homology may be more informative about biased mutation rates between clades. Nevertheless, there are well-cited studies that claim to have identiﬁed through BLAST searches extensive gene transfer between protein-coding sequences from archaeal and bacterial thermophiles (Aravind et al., 1998; Nelson et al., 1999). But the fact is that both archaeal and bacterial thermophiles share unusually strong compositional bias that is clearly distinguishable from that of mesophiles. Here, the base compositions, synonymous codon usage as well as nonsynonymous codon frequencies are identiﬁed in highly biased patterns that are general for hyperthermophiles (Chakravarty and Varadarajan, 2000; Kreil and Ouzounis, 2001; Kumar and Nussinov, 2001; Lynn et al., 2002; Singer and Hickey, 2003). These extreme biases are not restricted to some minor subset of the coding sequences but are distributed even among the most highly expressed sequences. In other words, there is no reason to believe that a subset of genes with particularly biased composition shared by the archaeal and bacterial genomes are gene transfers. Rather, convergent compositional adaptations to high temperature together with descent from a common ancestor most likely account for the strong BLAST signals observed between orthologous proteins of archaeal and bacterial thermophiles (Lynn et al., 2002; Singer and Hickey, 2003). Consistent with this inference is the ﬁnding that phylogeny places the genome sequences of archaeal and bacterial hyperthermophiles well within their respective superkingdoms (Korbel et al., 2002). There are a large number of publications that report an even larger number of putative modern or ancient sequence transfers identiﬁed by homology searches. None of these distinguish homology originating with sequence transfer from homology arising as convergent compositional bias. As we have seen, convergent compositions in the hyperthermophiles (Lynn et al., 2002; Singer and Hickey, 2003) and more generally, convergent mutation rate biases strengthen homology between phylogenetically distant sequences (Lockhart et al., 1994; Galtier and Guoy, 1995; Yang, 1995; Forterre and Philippe, 1999 Galtier et al., 1999; Lopez et al., 1999, 2002; Kurland, 2000; Felsenstein, 2001; Canback et al., 2004). We will have occasion to evaluate again the results obtained by homology searches in the quest for the root of the modern superkingdoms.

17.3 Nested Structural Networks

17.2.5

371

Between Consenting Adults

The convergent compositional biases of archaeal and bacterial hyperthermophiles clearly illustrate the inﬂuences of selection for strongly biased mutation rates (Lynn et al., 2002; Singer and Hickey, 2003). Likewise, the nonrandom patterns of gene transfer observed within the vertical dendrogram of microbial coding sequences strongly hint that though rare, persistent sequence transfers are selected (Kunin et al., 2005b; Sorek et al., 2007), just as inferred by Lawrence and Ochman (1998). Indeed, the compositional studies of Lawrence and Ochman with E. coli most often identiﬁed nonrandom gene transfers of pili and viral sequences among alien pseudogenes. Similarly, Pal et al. (2005) identiﬁed up to 30% of gene transfers between E. coli and S. typhimurium as viral or transposonsassociated proteins. It would seem that viral propagation of alien sequences is one way of mediating alien transfers with some probability of providing a competitive edge for the host. All sequences that are not stabilized in a population by their contribution to the ﬁtness of a host or to a genome parasite are transient since they are vulnerable to mutations that lead to their eventual eradication (Kimura, 1987). Strong selection is a sine qua non to ﬁxation of transferred or mutant sequences within the huge populations of microorganisms (Berg and Kurland, 2002). The requirement for strong selection is somewhat relaxed for ﬁxation in small patches within global populations (Berg and Kurland, 2002)

17.3 NESTED STRUCTURAL NETWORKS Proteins have evolved to work in a special environment: a dense “overcrowded” gel with solute densities approaching 400 mg/mL (Laurent, 1971; Minton, 1981; Berg, 1990; Cayley et al., 1991; Zimmerman and Trach, 1991; Ellis et al., 2001a,b; Robinson et al., 2007). At macromolecular densities that exceed those of some protein crystals the cytosol is nothing like an ideal solution. Typically, proteins are so tightly packed in the cytosol that the distance between them is less than one third of their virtual diameters, a situation referred to as “macromolecular crowding” (Ellis et al., 2001). Crowding means in effect that protein diffusion is limited and that intermolecular interactions are highly favored, as in aggregation, complex formation and precipitation. Maintaining the solubility of proteomes is therefore a fundamental cellular problem. The reasons that cellular proteomes press the solubility-envelope are simple: chemistry tends to be faster at higher concentrations and fast chemistry is at a premium when organisms are competing (Ehrenberg and Kurland, 1984; Kurland, 1992; Lovmar and Ehrenberg, 2006). That is to say, macromolecular crowding improves the ﬁtness of cells, but as we shall see the kinetic enhancement at high concentrations of proteins is biologically costly.

17.3.1

Nip and Tuck

Dense proteomes are metastable gels in which relatively small structural aberrations may tip the proteins from a soluble phase into amorphous precipitates or even into ordered arrays (Dobson, 1999; Otzen et al., 2000; Hartyl and Hayer-Hartyl, 2002). Entangled, precipitated or crystalline proteins are dysfunctional and lethal to cells. Frequently cited clinical examples of protein entanglement and precipitation are the accumulation of amyloidal

372

Chapter 17

A Hitchhiker’s Guide to Evolving Networks

protein associated with diseases such as Alzheimer’s and Huntington’s (Dobson, 1999; Radford, 2000). Accordingly, the evolution of proteomes depends on selection for structures that minimize the instability of the cytosol. In general, native proteins tend to be folded or compacted molecules elaborated from single compact domains (folds) or from several concatenated compact domains (Doolittle, 1995; Murzin et al., 1995; Teichman and Mitchison, 1999). There are many fewer domain structures recruited by modern proteomes than there are proteins. Though it contains considerably fewer nodes, the underlying network of domains is nested within the space of all proteins and has proven to be especially informative in phylogenomic studies (Caetano-Anolles and Caetano-Anolles, 2003; Yang et al., 2005; Wang et al., 2007). Selection for compact or otherwise protected folds is maintained by ubiquitous cellular infrastructure that monitors and destroys proteins presenting strings of unprotected sequence (Goldberg and Dice, 1974; Wickner et al., 1999; Voges et al., 1999; Glickman and Ciechanover, 2002). Note that there are invulnerable unstructured loops in proteins as well as in the transient intermediates that lead to the folding of proteins (Dyson and Wright, 1998; Deming, 2002) The fact is that exposed protein loops as well as linkers between compact domains are important for the provision of binding sites between protein domains as well as in modulating the dynamic ﬂexibility of protein domains (Robinson and Sauer, 1998). This is strikingly illustrated by the progressive increase in frequency and lengths of loops in the proteins of hyperthermophiles, thermophiles, mesophiles, and psychrophiles, respectively. Here, the dynamics of enzyme functions in cold-adapted proteins is greatly enhanced by conspicuous loops (Thompson and Eisenberg, 1999; Narinx et al., 1997; Tekaia et al., 2002; McElhenny et al., 2005; Deming, 2002). Accordingly, there are loops and there are loops. Loops that initiate protein demolition are presumed to be sufﬁciently accessible and sufﬁciently long to be targets for the proteolytic surveillance systems, while the others are buried within the tertiary fold of proteins or masked by site-speciﬁc interactions between different proteins or other ligands (Dyson and Wright, 1998; Robinson and Sauer, 1998). The rule of thumb seems to be that if a loop together with its ligands is “wider” than a single beta strand, it will not ﬁt into common proteolytic active sites (Tyndall et al., 2005). Relevant here are ubiquitous families of protein folding machines, so-called chaperonins that mediate refolding of aberrant loops as well as entangled proteins. By doing so, chaperonins oppose aggregation and entanglement of proteomes as well as proteolytic demolition (Rutherford and Lindquist, 1998; Wickner et al., 1999; Hartl and Hayer-Hartl, 2002; Maisnier-Patin et al., 2005). Accordingly, accidental misfolding is reversible while persistent misfolding is lethal, once it is recognized by the proteolytic surveillance systems. So, there is a transient competition between proteolytic surveillance and chaperonin functions (Wickner et al., 1999; Maisnier-Patin et al., 2005). Since proteolysis is irreversible, in time it always wins the competition with the chaperonins for proteins that tend to unfold or misfold. Thus, the targets of posttranslational editing systems are proteins that are produced as translation errors, or variants produced from mutant genes or aged proteins that are oxidized or otherwise modiﬁed.

17.3.2

Fold Selection

Seen from the perspective of evolving proteomes, the proteolytic surveillance systems of eukaryotes, bacteria, and archaea are the enforcers of the fold networks. It is accordingly

17.3 Nested Structural Networks

373

appropriate to think of the evolution of members of any particular proteome as having coevolved through fold selection in a particular cellular environment supported by a particular editing system. For example, the fold selection systems of psychrophiles and those of hyperthermophiles are distinct because they support the evolution of structurally distinguishable proteomes that feature either prominent loops or virtually no loops at all (Thompson and Eisenberg, 1999; Narinx et al., 1997; Tekaia et al., 2002; McElhenny et al., 2005; Deming, 2002). Accordingly, proteins from one genome may not be compatible with the proteome of another genome that has evolved with different features of structural optimality. Thus, any sequence change however small that destabilizes a fold or that disrupts the folding process of a protein may signal the destruction of the affected proteins by the proteolytic surveillance system (Otzen and Oliveberg, 1999; Otzen et al., 2000). In the laboratory, by enhancing the mutation rates of cells or otherwise introducing artiﬁcial structural changes in speciﬁc genes, chaperonin functions may be augmented, but these structural repair functions can protect the systems only up to a point, after which the structural dislocations overwhelm the system and create irreversible damage (Wickner et al., 1999; Maisnier-Patin et al., 2005). Again, the stabilities of proteins that are dependent on chaperonin functions are likely to be dependent on cell-speciﬁc features of the corresponding proteomes. Although the evolution of this network of compact domains answers the needs of a high-density cytosol, its maintenance by the chaperonin/protease monitoring systems comes at a considerable evolutionary cost. First, by limiting the acceptable random sequence variation to those changes supporting compact domains in proteins, this screening retards the rates of viable mutation. Second, the underlying network of domains is a far less dense space of permissible structures than that anticipated by Maynard Smith (1970). This means that random mutations are much more likely to be lethal than would be the case if compacted domains were not a selective trait for polypeptide sequences. Thus, the genetic load generated by random mutations that generate lethal unstable structures is a measure of the cost of the fold space to evolving populations. Nevertheless, both the universality of compact folded domains in proteins and the ubiquity of the cellular networks that police protein folds suggest that this is an ancient order that preceded the last universal common ancestor of the three modern domains (Kurland et al., 2007).

17.3.3

Higher Order Structural Networks

If cells were simply unorganized bags of proteins and nucleic acids encased in lipid membranes, the functional cycles of biosynthesis and metabolism would depend in each case on the independent diffusion of hundreds of individual components to their speciﬁc work sites in the cell. That would be a kinetic nightmare that would be greatly exacerbated by the macromolecular density of the cytosol, which markedly reduces free diffusion of macromolecules (Laurent, 1971; Cayley et al., 1991; Zimmerman and Trach, 1991; Ellis, 2001a,b; Kurland et al., 2006). The evolutionary solution to that nightmare is to associate the individual components that cooperate in a functional cycle into complex structures that reduce waiting times between the successive steps of the cycle (Welch, 1977; Srivastava and Bernhard, 1986; Srere, 1990; Kurland et al., 2006; Collins et al., 2009). Indeed, cells are supported by a continuum of such complexes ranging in size between multienzyme complexes through ribosomes and splicesomes

374

Chapter 17

A Hitchhiker’s Guide to Evolving Networks

with their hundreds of components up to nucleoli and chromosomes visible in the light microscope. All of these complexes are examples of biological structures that solve kinetic problems (Kurland et al., 2006): they bring together functionally related macromolecules in organized complexes that mediate kinetically efﬁcient processing of small molecules. It is not incidental that macromolecular complex formation is facilitated by crowding. Indeed, complex formation is favored because associations tend to increase the entropy of solute and solvent molecules. Furthermore, complex formation between proteins as well as between nucleic acids and proteins stabilizes both these components in a cytosol policed by nucleases and proteases (Collins et al., 2009). The densely packed diffusing macromolecules and their complexes in eukaryotes are particularly hampered by large cell volumes, larger on average than those of archaea and bacteria. Larger volumes make for diffusion times across eukaryote cells orders of magnitude longer than those for archaea and bacteria. The evolutionary solution to this problem in eukaryotes has been to introduce cellular compartments. Here, the concentration of complex machinery to carry out compartment-speciﬁc functional cycles has two effects on kinetic efﬁciencies: First, the individual steps of the cycles speciﬁc to the nuclear, nucleolar, endoplasmic, and cytoplasmic compartments are more kinetically efﬁcient because effective volumes are reduced (Kurland et al., 2006; Collins et al., 2009). Second, the biomass invested in particular cycles that is required to support efﬁcient kinetics is reduced correspondingly (Kurland et al., 2006; Collins et al., 2009). Again, subcellular partitioning is a structural adaptation that solves formidable kinetic problems for eukaryotes (Kurland et al., 2006; Collins et al., 2009). Cellular compartments in eukaryotes provide a higher order adaptation not usually required by bacteria or archaea. Within these compartments are nested biochemical networks as well as links to other compartments of the cell. Thus, all genome-encoded structures are connected to each other by the chemical ﬂuxes they mediate, by their physical contacts and by regulatory feedback. The point is that ubiquitous connectivity enables the cell to respond as an integrated network to perturbations, whether these perturbations are mutations or environmental. This means that we can describe in molecular detail what we mean by the ﬁtness of cells or organisms without resorting to transcendental language.

17.4 OPTIMAL NETWORKS The ﬁtness of organisms is conveniently compared by measuring their relative duplication rates. These can follow simple or complicated cycles but over sufﬁciently long times any such cycle can be summarized by a time-averaged growth rate. The simplest cycle to analyze in detail is that for steady state cellular growth, which to a good approximation is most relevant to growing populations of archaea, bacteria, and single cell eukaryotes. Here, the cells are taken to be networks of macromolecules that are connected to each other in an ordered scheme that relates the sequential transformations of substrates leading to the replication and duplication of macromolecular products, that is, cell growth. In the following, we summarize some properties of growing cellular networks that are informative about the constraints on hitchhiking sequences in evolving populations. The arithmetic formalities of these networks are described in the literature (Ehrenberg and Kurland, 1984; Berg and Kurland, 2002; Lovmar and Ehrenberg, 2006) as well as elsewhere in this volume. We focus on the kinetic network corresponding to the translation system because much is known about how it responds to mutation in bacteria.

17.4 Optimal Networks

375

Since we are most interested in the evolution of proteomes we simplify our notations by representing cells as networks of proteins (Ehrenberg and Kurland, 1984). All other macromolecules such as nucleic acids, complex carbohydrates, and lipids are ignored. This semirealistic simpliﬁcation makes for a transparent notation to relate cellular ﬁtness in the form of growth rates to the evolving characteristics of the proteome. Important features of this simple model of a growing kinetic network have been vetted experimentally under different environmental situations for mutant and wild type cells (Mikkola and Kurland, 1991a, 1991b, 1992; Kurland, 1992; Dong et al., 1996; Berg and Kurland, 1997, 2002). The growth rate of this cellular network, k0, can be expressed as k0 ¼

kel R : r

ð17:1Þ

Here, kel is the translation rate per ribosome, R is the number of translating ribosomes and r is the total mass (in amino acids) of protein produced per cell cycle (Ehrenberg and Kurland, 1984; Kurland et al., 2007). In general, the density of total biomass is more or less constant for all cells while the mass fraction (ri) of each protein and substrate in the network can evolve under selection for maximum growth rates. In particular, the growth efﬁciency increases when the biosynthetic capacity or in this case the rate of function for ribosomes expressed as rate per mass (kel/rr) as well as when that for other components of the network (ki/ri) increase (Ehrenberg and Kurland, 1984). We assume that the rate of function per molecular mass tends to evolve to a maximum under any given growth conditions. Such networks have a well-deﬁned maximum rate of growth that corresponds to a unique arrangement of all components of growing cells (Ehrenberg and Kurland, 1984; Kurland, 1992; Lovmar and Ehrenberg, 2006). That maximum corresponds to the most efﬁcient mass investment for each kinetic compartment of the network for a particular environment. The signature of this optimal arrangement is that each and every component of the cell is operating at a maximum rate normalized to its mass in that environment. Thus, there is always a tendency to minimize the mass investment in molecular components.

17.4.1

Optimality Not Maximality

We use optimality as a synonym for maximum ﬁtness and we take the maximum rate of growth of a cellular network as the standard state to which cells are driven by purifying selection under any environmental circumstances in nature. Note that in general, ﬁtness is not necessarily obtained by selection for maximized responses from the separate components of biological networks. For example, mutations in a particular ribosomal protein may increase the accuracy of translation but they simultaneously decrease the growth rates of the cells. This suboptimality arises from the enhanced rates of substrate (aminoacyl-tRNA) rejection that support higher accuracy but lead to lower net rates of translation (Ehrenberg and Kurland, 1984; Kurland, 1992; Lovmar and Ehrenberg, 2006). The translation network is in general a highly cooperative one in which clusters of proteins and RNAs modulate the kinetics of each of the individual steps in the network (reviewed in Kurland, 1992; Lovmar and Ehrenberg, 2006). The streptomycin cluster is an exemplary cluster of three proteins (S4, S5, and S12) in the small ribosomal subunit of E. coli. Mutations in any one of these can be selected with streptomycin directly or as suppressors of streptomycin dependent mutations (Kurland, 1992). Variants of any one protein tend to ameliorate the effects of other variant proteins in the cluster. Thus, different

376

Chapter 17

A Hitchhiker’s Guide to Evolving Networks

combinations of mutant proteins from the cluster are associated with phenotypic responses to streptomycin that range from sensitivity, to resistance, to dependence or to hypersensitivity. One other remarkable property of this ribosomal cluster is that the phenotype of any one of its variants is often determined by a simple sequence change such as one or a few amino acid substitutions. In addition, kinetic analyses of ribosomes containing mutant members of the streptomycin cluster reveal speciﬁc rate adjustments that can inﬂuence not only antibiotic phenotypes but also kinetic performance characteristics as diverse as ribosomal translation rates, accuracy of aminoacyl tRNA selection and bacterial growth rates, that is, ﬁtness in the presence and absence of antibiotics (Kurland, 1992; Johansson et al., 2008). Furthermore, mutant alleles of one protein in the streptomycin cluster tend to ameliorate the kinetic phenotype expressed by other mutant alleles of a different protein (Kurland, 1992; Johansson et al., 2008). Thus, the variant members of the streptomycin cluster cooperate as though they were selected to oppose or balance one another. In effect, small structural changes in clusters of ribosomal proteins seem to “tune” the kinetics of translation as well as to modulate bacterial growth rates. There are two separable inﬂuences on the evolution of the streptomycin cluster to account for what seems to be the incremental “tuning” phenotype of mutations in any one member of the cluster. First, constraints on protein evolution in a dense cytosol policed by proteolytic machines (see Section 17.3) lead to modular architectures in which viable proteins are constituted from stable domains. The sequences that fold into compact domains and that also associate speciﬁcally with other macromolecules are members of a sparsely populated polypeptide network. Only some random minimal changes out of all the conceivable ones that could impact this network will preserve the capacity to fold and bind. In general the key structural constraints preserve the tightly packed hydrophobic cores of domains (reviewed in Oliveberg and Wolynes, 2006). For this reason, we expect those mutations that are viable to affect a minimal change in the domain involving one or a few amino acid changes. Second, the evolution of extensively cooperative sequences has a special dynamic because variants that are selected for ﬁtness in one environment may be suboptimal when the environment shifts. For example, when streptomycin is introduced to a growth medium, sequence changes in protein S12 may be selected to support resistance to the antibiotic. If subsequently the streptomycin were removed, both translation kinetics and growth rates are likely to become suboptimal. A second mutation that reverses the resistance phenotype in protein S12 might restore optimality in the absence of the antibiotic. However, if the functional cooperativity between proteins S4, S5, and S12 is sufﬁciently intimate, different mutations in the structures of S4 and S5 might also reverse the phenotype of the S12 streptomycin resistance phenotype. In this case, because the number of mutatable sites is greater, the probabilities are such that random mutations would most often modulate the ribosome kinetics through variants of S4 and S5 rather than through a back mutation in S12. And, so we might imagine that environmental shifts that select changes in the kinetic performance of the streptomycin cluster will produce variants in all three protein sequences that appear to tune the functions of this cluster in the ribosome. Extensive structural differences between native and alien homologues may often disfavor alien transfer into the integrated networks characteristic of cells. Thus, “tuning” by minor mutational variations of the members of an interactive cluster is likely to be the signature of all transient network optimizations. The tuning or optimizing functions of ribosomal components are unusual only because they have been observed in some detail.

17.4 Optimal Networks

377

Other cooperative networks such as those mediating DNA replication, RNA transcription and processing, the Krebs cycle, and the like are certain to provide further examples of network optimizations when the kinetics of the appropriate mutant components are analyzed systematically as has been done for the translation networks of bacteria (Kurland, 1992; Lovmar and Ehrenberg, 2006). Another way to view the expected incompatibilities of gene transfer comes from Woese (1998, 2000) in his progenote hypothesis. He reasons that early in the evolution of ancestral cellular forms sequence transfer between different lineages would be a favored way to sample mutant variants. However, at some point the proteins that make up a particular pathway will together approach optimality, that is, maximum network efﬁciency. Beyond this point the most likely consequence of sequence transfer will be to introduce suboptimal features, and thereby generate a less competitive cell. Woese refers to this break point as the Darwinian Boundary (Woese, 1998, 2000). However, there is no evidence that LUCA was a primitive cell population with a simple proteome for which the major biochemical cycles were still under construction. Furthermore, estimates of LUCA proteome size based on the self-contradictory assumption that the minimalist genomes of highly reduced obligate bacterial parasites could support a free living LUCA (Koonin, 2003) are probably quite inaccurate. In contrast, Ouzounis et al. (2006) have extrapolated from the sequences of 184 genomes of modern organisms the common sequences that are likely to have been present in LUCA. Three different extrapolations were exploited; these are based either on gene content or on average sequence similarity or on genome conservation. Ouzounis et al. (2006) correct their extrapolations for the inﬂuences of gene transfer and more importantly gene loss, but are inevitably left with minimal estimates of LUCA’s proteome, which are between 1344 and 1529 gene families for the ancestor of the three domains. As Ouzounis et al. (2006) say, this is a “fairly complex genome” that emerges from their extrapolations and we may suppose that LUCA was a fairly well evolved organism. Thus, the Progenote is a useful concept that may be relevant to a far more primitive ancestor of LUCA. Modern organisms seem to evolve with physical associations and kinetic cooperativity between components of their networks that provide a strong barrier to mutation, in general and gene transfer, in particular. In fact, we ﬁnd that small structural changes in cooperative clusters of proteins can reoptimize the workings of a network when that network is challenged for example by an antibiotic. On the other hand, it would be surprising if an alien transfer to an environmentally challenged organism would function like a minimally substituted variant of an endogenous protein. Alien proteins are after all alien, and their aberrant sequences make it difﬁcult for them to slip into a foreign molecular network without disturbing its functions. That is one reason that after millions of years of evolution, modern organisms seem to have accumulated only a small fraction of gene transfers (Korbel et al., 2002; Kunin et al., 2005a).

17.4.2

Patchy Environments, Patchy Genomes

Comparisons of the growth phenotypes of a cohort of E. coli taken from nature (natural isolates) reveal how environmentally idiosyncratic these can be. In one study the growth rates of 65 natural isolates of E. coli were compared in glucose minimal medium in the laboratory (Mikkola and Kurland, 1991a). These rates ranged between 0.48 and 1.42 doublings per hour, with more than half growing at rates less than 1.2 doublings per hour. A “wild type” laboratory strain grew at 1.33 doublings per hour. So, natural strains are

378

Chapter 17

A Hitchhiker’s Guide to Evolving Networks

distinctive even in the same growth medium and the subpopulations of a global population are far from uniform. We refer to these subpopulations as patches. Seven of the natural isolates with a representative range of growth rates in glucose minimal medium between 0.48 and 1.42 doublings per hour were grown in chemostats. After 280 generations in glucose-limited chemostats, their growth rates evolved to between 1.15 and 1.36 doublings per hour, that is to say, to rates similar to each other as well as ones approaching that of the laboratory wild type strain (Mikkola and Kurland, 1992). The evolution of the “laboratory” growth phenotype was accompanied by incremental steps in the growth rates that were interpreted as evidence for the accumulation of multiple adaptive mutations. Likewise, the selected growth phenotypes could be associated with improved in vitro translation rates by ribosomes isolated from the different strains (Mikkola and Kurland, 1992). We infer that patches of a bacterial species such as E. coli present growth rates and ribosomal translation rates that vary signiﬁcantly according to demands on ﬁtness presented by the environmental patches in which they evolve. The adaptations to environmental patches are likely to be reﬂected in genome heterozygosis, that is, sequence patchiness that distinguishes the members of a global population (Berg and Kurland, 2002). Recognition of the generality of genomic patchiness might be used to redeﬁne species as patches or ecotypes (Cohan, 2001). However, we would oppose using the patch or ecotype to redeﬁne clades. Such a convention would tend to promote a false view of the clade as a stable ecotype. Ecotypes are expressions of volatile genome specializations associated with relatively short-lived environmental circumstances. Rather than to deﬁne away the dynamic character of global populations, it would be more realistic to accept that the global adaptations of a species are periodic in time and space. In other words, for large populations ﬁtness is local and the heterozygosity of global genome populations is not entirely neutral, but locally selected (Berg and Kurland, 2002).

17.4.3

Fixation of Novel Sequences

Because drift is more relevant to small populations and selection more relevant to large ones, mutations, including gene transfers that enter a large microbial population are more likely to ﬁx in relatively small patches than in the whole global population (Berg and Kurland, 2002; Lynch, 2007). This makes patches as well as the patchiness of genomes particularly interesting from the evolutionary point of view. In order to explore novel sequence acquisition (and loss) in the immense populations of microorganisms, we need to explore the inﬂuences of drift, destructive mutations, and selective forces on sequences hitchhiking through patchy populations. In this we assume that most novel sequences are neutral because sequence duplications and gene transfer between organisms sharing the same environment are rarely expected to generate new adaptive functions straight off. Fixation of novel genes has been explored formally in two extreme situations (Berg and Kurland, 2002): hitchhiking either through a homogeneous global population or through a population that is subdivided into patches. In the second situation there is exchange between patches, but this is taken to be slower than the progression within a patch. The results are that novel, neutral or near-neutral coding sequences in global populations of microorganisms are opposed by fresh destructive mutations as they slowly migrate through that population. So, they usually do not ﬁx globally but they may have a transient presence in a small fraction of the population, that is, in one or more patches. In effect, a large microbial population will exhibit quite extensive diversity of transient neutral gene content with a neutral heterozygosity close to 0.99 (Berg and Kurland, 2002).

17.4 Optimal Networks

379

A principle conclusion of these calculations is that only sequences that are under strong selection globally can be expected to persist globally (Berg and Kurland, 2002). In general, it is rare that selection for a novel sequence is sufﬁciently strong throughout a global population to promote its ﬁxation because the selective parameters are for the most part periodic, that is to say patchy. Antibiotics and the resistance phenotypes of pathogens provide a good example. Thus, not all members of a pathogenic bacterial population are antibiotic resistant because selection of resistant phenotypes is favored only by environmental patches in which antibiotics are dispersed, for example, in hospitals and on farms. This conclusion has been challenged by Novozhilov et al. (2005) on the basis of three mistaken premises: First, they claim that Berg and Kurland (2002) failed to recognize the inﬂuence of infective modes on the transfer of neutral sequences. Second, Novozhilov et al. (2005) along with Dawkins (1976) seem not to recognize that the rates with which sequences hitchhike through a population on a parasitic or infective vehicle are dependent on selection. Thus, as mentioned in the introduction, genome parasites are dependent on purifying selection to maintain the infectivity of speciﬁc sequences that determine their virulence. Finally, Novozhilov et al. (2005) assume that they can arbitrarily choose extravagantly large transfer rates with which to show that in theory selection is not needed to reach ﬁxation. However, Berg and Kurland (2002) employed transfer rates that were observed in bacteria; those data suggests that in reality selection is needed to ﬁx transferred neutral sequences in large microbial populations. Indeed, had Novozhilov et al. (2005) not been so obsessed with making the world safe for rampant gene transfer, they might have noticed two novelties in the treatment presented by Berg and Kurland (2002). One is that that the virulence of parasitic sequences enters the equation describing the probabilities for ﬁxation in the same way that host selection coefﬁcients enter (see Eq. 17.1; Berg and Kurland, 2002). For this reason, Berg and Kurland (2002) were obliged to discuss explicitly the containment by random deletion of novel sequences introduced into microbial genomes by viruses and transposons. Second, while global ﬁxation is highly unlikely, Berg and Kurland (2002) emphasized the result that probabilities for local ﬁxation of novel sequences in small patches are much more favorable. This second novelty may have been ignored by Novozhilov et al., 2005 because it only ﬁts mundane facts. Primary among these are that patchy transfers have limited impact on global phylogeny (Berg and Kurland, 2002; Korbel et al., 2002; Kunin et al., 2005b). Both aspects of infectious sequence acquisition are discussed in more detail in Kurland (2005). The general conclusion stands: the probabilities that novel sequences may be ﬁxed in populations large or small depends on the magnitude of selective forces (Berg and Kurland, 2002). These selective forces may be associated with either the ﬁtness of the host or the ﬁtness of an infective vehicle. However, neutral hitchhikers riding infective vehicles are not immune to erosion by mutation (Berg and Kurland, 2002). If infective sequences are inactivated by mutation, their neutral hitchhikers are left to a near certain fate of extinction (Kimura, 1987). These general conclusions help to sort some speculative transfer stratagems that have been prominent in the literature. We begin with a simple case. It has been suggested that E. coli became E. coli by recruiting through alien gene transfer the lac operon (Ochman et al., 2000; Lawrence, 2001). It was further postulated that by enabling the metabolism of milk sugar this operon supported E. coli’s invasion of the mammalian intestine as well as its divergence from S. typhimurium. Indeed, several such gene transfer-mediated speciation events were identiﬁed by the presence and absence of deﬁning enzyme activities (Ochman

380

Chapter 17

A Hitchhiker’s Guide to Evolving Networks

et al., 2000; Lawrence, 2001). But what is the evidence that acquisition of the outstanding operon by one ancestor of a pair of closely related species rather than loss of the operon from an ancestor of the other closely related species provided impetus for that divergence (Berg and Kurland, 2002)? After all, sequence loss is much more likely a priori than sequence acquisition (Korbel et al., 2002). Indeed, phylogenetic studies of the lac operon lend no support whatsoever for the postulated lac transfer to E. coli. Rather, loss of lac from S. typhimurium seems a better ﬁt for the data (Stoebel, 2004). Of course, at the turn of the millennium sequence loss or reductive evolution was not recognized as a potent evolutionary force. In contrast, the absence of any evidence to support operon transfer scenarios that could have led to speciation was not an ideological handicap.

17.4.4

Selﬁsh Operons

In the same vein, Lawrence and Roth (1996) suggested earlier that the evolution of operons for metabolism that is not often on demand (i.e., facultative operons) might be particularly dependent on alien transfer. Their idea is that enzymatic pathways that are not essential but only occasionally on demand would tend to be lost by bacteria during functional eclipse. They further suggest that the lost operons would return to their former hosts or enter new ones as needed through gene transfer from some organism that has conveniently maintained a reservoir of the relevant genes. According to Lawrence and Roth (1996) genes specifying the lost enzymes would be expected to map together as a contiguous sequence, which would facilitate their return by a single transfer event. In fact transfer-friendly contiguous coding sequences of enzymes deﬁne the “selﬁsh operon” model because that contiguity is thought to have no other function but to facilitate the transfer of the operon as a unit (Lawrence and Roth, 1996). In addition, niche-deﬁning chemistry for many microorganisms, whether it is aerobic respiration, nitrogen ﬁxation, sulfate reduction, or photosynthesis is identiﬁed in disparate lineages, interspersed by related lineages lacking the same facultative pathways. In effect, the facultative metabolic pathways are distributed in phylogenetic patches. How do such patches evolve among the bacteria and archaea? The standard answer is through sequence transfer according to the selﬁsh operon model (Boucher et al., 2003). The data to date are not encouraging for aﬁcionados of the selﬁsh operon. The expectation that the genes of facultative operons would be more tightly linked than genes from essential operons is contradicted by genomic data (Pal and Hurst, 2004). Likewise, the frequency with which members of facultative operons can be identiﬁed as gene transfers is “average” and not as expected, high (Price et al., 2005). Alien transfers of entire operons containing substantial numbers of coding sequences have yet to be described in bacterial genomes (Stoebel, 2004) but transfer of pairs of cooperating proteins that are members of an operon is favored (Pal et al., 2005). The most favored coding sequences for transfer are surface proteins as well as viral and transposon-associated proteins (Ochman et al., 2000; Price et al., 2005). It is generally recognized that differential gene loss rather than the selﬁsh operon alternative might account for the patchy distributions of operons among microorganisms (Boucher et al., 2003). However, according to Boucher et al. (2003) “differential loss scenarios will be wildly unparsimonious when other phylogenetic information is taken into account.” In fact, Boucher et al. (2003) apply two oddly nonquantitative parsimony analyses to support their speculations.

17.5 The Emperor’s Blast Search Revisited

381

First, they assume that the ancestor was a population of uniform organisms with a single genome that was ancestral to each and every modern organism but not larger than modern microbial genomes. Then, Boucher et al. (2003) claim that differential gene loss would only work if this unique LUCA genome were signiﬁcantly larger than modern bacterial genomes in order to encode all of the facultative operons that are patchily distributed in modern microorganisms. That criticism is essentially a restatement of two unsupported assumptions, to wit that the ancestral population contained a unique genome and that this genome was no larger than that of modern bacteria. However, where is the evidence that LUCA did not evolve and segregate into a number of metabolically distinct cell lineages so that the population ancestral to all bacteria was characterized by a multiplicity of genomes as in the models of Woese 1998, 2000 and Kurland et al. (2006). If then, the ancestral population produced by LUCAwas a complex one with diverse organisms related by general cellular properties but differentiated into distinct classes with characteristic genomes, the ﬁrst parsimony argument of Boucher et al. (2003) is just circularly simplistic. Second, they claim that the patchy distribution of facultative operons would require an “absurd” intensity of gene loss to generate modern patchy distributions. However, the numbers seem to have eluded Boucher et al. (2003). It would seem that by “unparsimonious” they mean that many losses with a probability close to 1.0 are less parsimonious than fewer transfers with probabilities close to zero. This questionable inequality is an allusion to two boundary conditions. One arises when there is no functional demand for a sequence; there the probability for the eventual loss of the sequence is commonly assumed to be close to one. On the other hand, based on their rarity (Stoebel, 2004; Pal et al., 2005) transfers of whole operons containing many coding sequences in a single event may have probabilities close to zero. So, a rational comparison of the virtues of differential gene loss and the selﬁsh operon model would require numbers that seem to be unavailable at present. It might turn out that both sorts of events are relevant but it might also turn out that the reductive scenarios are far more parsimonious than the acquisitive ones. The bottom line is that the most parsimonious trajectory might turn out to involve many events involving certain loss rather than a scenario with numerically fewer acquisitions through extremely rare transfers. It is anyone’s guess at this point.

17.5 THE EMPEROR’S BLAST SEARCH REVISITED The unrooted trees depicted in Figure 17.1 for 16SrRNA and protein-coding sequences as well as that in Figure 17.2 depicting the corresponding protein domains, so-called fold superfamilies or FSFs reveal in all cases a virtual node to which all three domains converge. Those virtual nodes conﬁrm the much earlier inference from rRNA phylogeny that the three domains are direct or indirect descendents of a common ancestor (Woese and Fox, 1977; Woese, 1983; Woese, 1998). Furthermore, that virtual node suggests that all three domains contain a signiﬁcant number of proteins that are orthologous to proteins in other domains. This is easily conﬁrmed by examination of databases containing orthologous protein families (Goldovsky et al., 2005). However, there are two competing views about how the three domains emerged from the ancestor. One class of scenarios suggests that LUCA evolves into archaea and bacteria before these two fuse genomes to produce a chimeric eukaryote genome (Zillig et al., 1989; Gogarten et al., 1996; Doolittle, 1999; Lopez-Garcia and Moreira, 1999). We refer to this trajectory as the complicated conventional scenario. The simpler alternative suggests that a

382

Chapter 17

A Hitchhiker’s Guide to Evolving Networks

Figure 17.1

Unrooted phylogeny for 50 fully sequenced genomes based on gene content and on small subunit ribosomal RNA sequences, as described by Korbel et al. (2002) and generously provided by Martijn Huynen. (See insert for color representation of this ﬁgure.)

cellular ancestor most resembling a primitive eukaryote was the progenitor of archaea and bacteria as well as of modern eukaryotes (Penny and Poole, 1999; Forterre and Philippe, 1999). In general, the original versions of both alternatives grew out of the idea that LUCA was a primitive organism that emerged relatively early in evolution, say, 3–4 billion years ago. This timing for both the complicated and the simple scenarios is in our view highly questionable, because it is inconsistent with the suggestion that LUCA was a fairly complex organism (Ouzounis et al., 2006). However, the proposal of a common ancestor to all three modern domains is quite viable (see Figure 17.3). Critical phylogenetic data to support the complicated conventional scenario is absent at present. In its place, BLAST-based homology searches are offered (Gogarten et al., 1996; Doolittle, 1999; Lopez-Garcia and Moreira, 1999; Koonin et al., 2000). Indeed, a multitude of homology searches have produced vast amounts of information that defy rational interpretation because the intrinsic ambiguities are impenetrable (Eisen, 1998; Kurland, 2000; Koski and Golding, 2001; Canback et al., 2002; Kurland et al., 2003, 2006). See Section 17.2.4. The aesthetic appeal of the complicated scenario is that it proposes that two smallish proteomes are combined to form a bigger one (Gogarten et al., 1996; Doolittle, 1999; LopezGarcia and Moreira, 1999; Koonin et al., 2000). Here, then a common prejudice that evolution is always about getting bigger and more complex is reinforced. However, the sums do not add up. Characteristic eukaryote proteins and cellular features are missing from archaea and bacteria (Kurland et al., 2006; Collins et al., 2009). That is to say, with the exception of only partially describing the origins of mitochondria nothing about eukaryotes that makes them eukaryotes is explained by the complicated conventional scenario (Kurland et al., 2006; Collins et al., 2009).

17.5 The Emperor’s Blast Search Revisited

383

Figure 17.2

Unrooted phylogeny and accompanying Venn diagram for 1259 FSFs from 84 fully sequenced genomes as described in Wang et al. (2007) and kindly provided by Gustavo Caetano-Anolles. (See insert for color representation of this ﬁgure.)

Figure 17.3 An artist’s view of phylogeny relating the origins of the modern domains from LUCA. Here, descendents of LUCA are identiﬁed both as primitive eukaryote phagotrophes and as its prey that diverged by reductive evolution in the form of streamlined cells. The streamlined prey evolve into archaea and bacteria, while the phagotrophe evolving in the retentive mode takes on a bacterium that became the mitochondrion.

384

Chapter 17

A Hitchhiker’s Guide to Evolving Networks

Figure 17.4 This electron micrograph of Gemmata obscuriglobus, a Planctomycetes, was generously provided by John Fuerst. The bacterium features a nucleoid surrounded by an internal double membrane.

What are typically missing from archaea and bacteria are splicesomes with their hundreds of speciﬁc proteins, cytoskeleton elements, membrane limited intracellular compartments and the like that are nearly ubiquitous in modern eukaryotes and accordingly inferred elements of the last common ancestor of eukaryotes (Collins and Penny, 2005; Kurland et al., 2006; Dacks and Field, 2007; Collins et al., 2009). Nevertheless, there are exceptional bacteria, such as the Planctomycetes (see Figure 17.4), to which we return, that present what may be relics of ancestral eukaryotic equipment. Most tantalizing, Planctomycetes have been reliably identiﬁed at the root of the bacterial tree (Brochier and Philippe, 2002). Though nearly ubiquitous in eukaryotes, mitochondria are likely descendents of bacteria that were parasites or symbıonts of early modern eukaryotes. Indeed, there are reliable phylogenetic data suggesting that limited transfer of bacterial sequences to eukaryote genomes was associated with the acquisition of mitochondria (Andersson et al., 1998; Berg and Kurland, 2000; Karlberg et al., 2000; Kurland and Andersson, 2000; Gabaldon and Huynen, 2004). Nevertheless, there are still no data to support the complementary aspect of the complicated scenario, namely, that an archaeal genome contributed sequences to the nuclear genome of an early eukaryote. The simpler scenario seems more credible. Here, relevant support includes recognition that archaea and bacteria diverged as streamlined cells (Darnell, 1978; Penny and Poole, 1999; Forterre, 2001), identiﬁcation of eukaryotes at the root of phylogenies of paralogous proteins (Brinkmann and Philippe, 1999; Forterre and Philippe, 1999), identiﬁcation of eukaryote signature proteins in mitochondrial proteomes but encoded in eukaryote nuclear genomes (Karlberg et al., 2000; Kurland and Andersson, 2000), and the rooting of trees based on rRNA secondary structures among the eukaryotes (Caetano-Anolles, 2000).

17.5.1

Ecological Settings

Two global ecological catastrophes are likely to have provided some impetus to the divergence of all three modern domains from an ancestral population: First, it is commonly agreed that the acquisition of mitochondria by eukaryotes was preceded by a gradual

17.5 The Emperor’s Blast Search Revisited

385

increase in atmospheric oxygen tension. Since oxygen is toxic to cells, its accumulation in the biosphere may be viewed as an environmental catastrophe, to which mitochondria are in part a detoxifying adaptation (Andersson and Kurland, 1999). Second, the debut of phagotrophy by primitive eukaryotes would likewise be a catastrophic ecological event for the communities of cells made up of the phagotrophes and their prey. Thus, the phagotrophic “raptors” were a novelty that created huge ecological challenges for both hunted and hunters (Kurland et al., 2006). Both catastrophes may have contributed to the divergence of the three modern domains. Certainly, the acquisition of bacterial symbıonts that eventually evolve into organelles in primitive eukaryotes would be facilitated by the prior appearance of phagotrophic vacuoles in primitive cell populations (Stanier and van Niel, 1962; De Duve, 1982; CavalierSmith, 1987). Furthermore, it has been suggested that small cell size and relatively rapid growth physiology were the adaptations that allowed prey to outgrow and survive the depredations of the newly arrived cellular “raptors” (Kurland et al., 2006). Accordingly, new cell types corresponding to streamlined, rapidly growing bacteria, adapted to the newly oxygen-enriched environment, could evolve into the symbıonts that seeded the mitochondrial lineage in a larger phagotrophic raptor. If this scenario is anything like accurate, the succession of events would have been the introduction of phagotrophy and the appearance of aerobic bacteria prior to the acquisition of mitochondria by eukaryotes, which culminated roughly 1.8 billion years ago (SicheritzPonten et al., 1998). That is to say, without relying too much on the precision of this extrapolation, the divergence of modern eukaryotes with mitochondria is somewhat later than the divergence of aerobic bacteria, which would be closer to two billion years than to four billion years ago. This scenario is depicted in Figure 17.3. Why two microbial domains? The thermoreduction hypothesis of Forterre (1995) is an example of an ecological account in which archaea populating extreme environments might have diverged from bacteria. However, the widespread distributions of archaea that have been detected recently suggest that preferential exploitation of extreme environments may not be an archaeal preference (Valentine, 2007). However, a related biochemical account has suggested that the archaea and the bacteria have diverged by adapting different membrane structures and metabolic strategies. The outcome of their design choices is that archaea are more efﬁciently adapted to low energy stress, while bacteria are specialists in rapid growth where there are easily accessible energy sources (Valentine, 2007). A third global factor that must have contributed to the timing of the divergence of the three domains is the elaboration of the cohort of modular protein domains or folds from which protein families are compounded (Kurland et al., 2007). Such a network of protein domains is illustrated in Figure 17.2, and its implications for the evolution of the three domains are the focus of the next section. It is sufﬁcient to say here that if Dollo’s Law is applicable to polypeptide evolution (Farris, 1977; Ferrari, 1988; Marshall et al., 1994), the invention of the individual domains from which all proteins are compounded was almost certainly the rate-limiting step in the evolution of modern proteomes. Furthermore, it may well have taken most of approximately 2 billion years that preceded the divergence of the three domains to evolve the modern cohort of these modules. In that case, the timing of the divergence of the three domains may have coincided with or followed closely after the evolution of the modern cohort of protein domains (Kurland et al., 2007). Amazingly, such a concurrence is consistent with the ﬁrst “uncorrected” estimates of the coalescence time to LUCA that turn out to be approximately 2 billion years (Doolittle et al., 1996).

386

Chapter 17

17.5.2

A Hitchhiker’s Guide to Evolving Networks

The Way Forward

Figure 17.3 shows the unrooted tree for protein domains (FSFs) from 82 fully sequenced genomes, slightly overrepresented by those of bacteria. In fact the modular domains of proteins can be represented in at least three ﬂavors: as folds, fold families and fold superfamilies (FSFs), each an inclusive collection of domains related by structure homology. There are 887 folds, 1443 fold super families and 2630 fold families that have been identiﬁed so far in the Structural Classiﬁcation of Proteins (SCOP) database release from February 2005 in which 65,122 domains were identiﬁed (Wang et al., 2007). Those numbers are slowly creeping up, but they are not expected to surge dramatically. Rather they seem to be approaching an asymptote as more genome sequences become available. We have focused on the FSFs because they are thought to be monophyletic collections of domains related by structure, sequence, and function (Wang et al., 2006). The ﬁrst point about these families is that compared to the hundreds of thousands of orthologous protein families recorded in the OFAM database (Goldovsky et al., 2005) there are orders of magnitude fewer FSFs. Second, unlike the protein families, which are dominated by eukaryote members, the FSFs are represented in roughly comparable numbers in each of the three domains. For example, for the cohort described in Figure 17.2 with a total of 1259 different FSFs, there are 1192 in eukaryotes, 1147 in bacteria, and 826 in archaea. Of those 1259 FSFs, fully 779 or nearly 62% are present in all three domains, while 1127 or nearly 90% are present in two or three domains. This pattern of commonality suggests rather unambiguously that the three domains are the descendents of a common ancestor, LUCA (Kurland et al., 2007). Next, we can compare the degree of overlap or distance between each of the domains and a putative LUCA. For this comparison, we calculate the most parsimonious path to the present day cohort of 1259 FSFs. For this calculation, we take the present number of FSFs in each of the three domains in turn to represent a putative LUCA cohort of FSFs. We then compare the numbers of FSFs that must be added to either one of the domains in order to generate the other two. We ﬁnd that starting with eukaryotes we would need to invent 67 new FSFs, while starting with bacteria would require 112 new FSFs and starting with archaea would require 433 additional FSFs. In other words, the most parsimonious scenario clearly begins with a LUCA resembling modern eukaryotes to which additional FSFs are added to generate the archaea and bacteria. This calculation can be done “in the other direction” by assuming that LUCA had all 1259 FSFs and asking how many would have been lost in evolving each of the modern domains. The numbers are of course the same because the eukaryotes are the FSF cohort most closely related to all 1259 FSFs. These results conﬁrm those obtained independently from a different cohort of 57 fully sequenced genomes (Yang et al., 2005; Kurland et al., 2007). The previous cohort contained 1155 FSFs distributed between the three domains, roughly, as is the present cohort, and with the same most parsimonious identiﬁcation of eukaryotes with the LUCA cohort of FSFs (Kurland et al., 2007). However, the previous cohort was slightly more suitable than the present one because there were exactly equally many genomes representing each of the three domains, 19. For this reason, the convergence of the results from the two independent parsimony tests suggests that the comparisons are robust. The unrooted phylogenies in Figures 17.1 and 17.2 strongly suggest that the three modern domains are branches of a monophyletic tree. The implication is that a unifying network identity pervades all three domains. Indeed, that identity is evident in the network of modern FSFs featuring the overlap of folds shared by the domains as in the Venn diagram

17.5 The Emperor’s Blast Search Revisited

387

of Figure 17.2. On the other hand, a pervasive orthology of protein folds within the three modern domains implies that the identiﬁcation of LUCA with either archaea or bacteria or eukaryotes might be arbitrary unless it supports an informative hierarchy, as we show it does in the next section. We conclude that the data from the Venn diagrams of FSFs here and in Kurland et al. (2007) are in accord with all previous identiﬁcations of the rooting of proteomes from the three modern domains in the eukaryote lineage as most clearly shown for the fold networks (Caetano-Anolles and Caetano-Anolles, 2003; Wang and CaetanoAnolles, 2006; Wang et al., 2007). Since 90% of the FSFs analyzed here as well as elsewhere (Kurland et al., 2007) are shared by at least two of the three domains, the common ancestor LUCA is likely to have encoded that 90% cohort if not all the FSFs in modern proteomes. This is consistent with the notion that the divergence of the modern domains may have been delayed until most of the modern FSFs had evolved (Kurland et al., 2007). Next we consider evidence that the evolution of protein-coding sequences has been dominated by the reductive mode in the modern archaea and bacteria, while a more retentive mode has dominated that of eukaryotes.

17.5.3

Less May be More

We expect the evolution of all proteomes to be minimalist in the sense that the ﬁtness of competing cells will be enhanced by efﬁcient function of the smallest possible mass investment in molecular components such as ribosomes, polymerases, cytochromes, etc. Small is good, or as Maynard Olson (1999) has quipped: less may be more. Accordingly, there always is selection pressure that minimizes the complexity of proteomes as well as the tendency to reduce the sizes of proteins (Ehrenberg and Kurland, 1984; Kurland et al., 2007). Thus, the pressure to approach minimal limits depends on parameters such as expression levels of components and population sizes (Kurland et al., 2007). When the approach to minimal limits is intensiﬁed as for large populations and relatively high expression levels of individual molecular components, selection is expressed as reductive evolution of genomes with fewer and shorter coding sequences. Alternatively, when the approach to minimal limits is less pronounced as for smaller populations with low expression levels for the members of highly diverse proteomes, retentive evolution is expressed in genomes as selection for more and longer sequences. A scan of the orthologous families of modern genomes shows that there are typically many more cellular proteins expressed in individual eukaryotes than in individual archaea as well as in bacteria (Goldovsky et al., 2005). Hence from this viewpoint, eukaryotes typically are examples of retentive protein evolution while archaea and bacteria are examples of reductive protein evolution. We recall that the greater complexity and size of proteomes, as well as smaller effective population size mitigates selection pressure for minimal protein lengths. These tendencies have been veriﬁed for a curated database of archaeal, bacterial and eukaryote proteins that are members of orthologous protein families shared by all three domains (Kurland et al., 2007). It is found that the mean and median lengths of orthologous proteins in archaea are 311 and 308 amino acids, in bacteria are 309 and 308 amino acids and in eukaryotes are 508 and 506 amino acids, respectively. Thus, archaeal and bacteria protein lengths are less than two thirds those of eukaryotes. Previous measurements of lengths for randomly chosen proteins from the three domains are consistent with these values (Zhang, 2000; Liang and Riley, 2001; Brochieri and Karlin, 2005). Furthermore, it had been suggested that this domain-speciﬁc size

388

Chapter 17

A Hitchhiker’s Guide to Evolving Networks

difference arose from the special regulatory needs of eukaryote proteins to form complexes and to be targeted to speciﬁc compartments in cells (Brochieri and Karlin, 2005). Although the complexities of archaeal, bacterial and eukaryote protein-protein interactions seem not to be fundamentally different, the N-terminal and C-terminal sequences of the orthologous cohorts were inspected to determine whether eukaryote proteins are more specialized for interactive functions. They seem not to be so (Kurland et al., 2007). More to the point, it was possible to compare the selective pressure exerted on the lengths of proteins by studying the distributions of lengths in all the orthologous families. Here, the standard deviation of the lengths normalized to the mean length for proteins within orthologous families was compared in each of the three domains. This normalized standard deviation was taken as a measure of selective pressure: for archaea, bacteria, and eukaryotes, respectively; the resulting ﬁgures are 0.099, 0.091, and 0.2 (Kurland et al., 2007). On the basis of these data, it may be concluded that selective pressure for minimal protein length is greatest among microbes and least among eukaryotes. In summary, the population sizes, and individual protein expression levels are signiﬁcantly lower in eukaryotes, compared to those in archaea and in bacteria, with the consequence that the selection pressure to minimize lengths is signiﬁcantly greater in the microbes than in the eukaryotes. We note that the other dimension of reductive evolution, namely, loss of coding sequences will automatically raise the pressure for minimal lengths. Thus, loss of proteins tends to raise the expression levels of the remaining proteins when the total density of proteins is conserved. Likewise, reducing the complexity of the proteome tends to increase the growth rate maximum of cells because the fewer proteins that are expressed, the greater the protein density available for ribosomes, polymerases, etc. In terms of growth rates, loss of expressed sequences as well as shorter sequences necessarily lead to more efﬁcient growth physiology as in archaea and bacteria. The reductive evolution of genomes and their proteomes involves an exchange of complexity for efﬁciency (Kurland et al., 2007). Such an exchange is favored in archaea as well as bacteria and less favored among eukaryotes. Thus, the reductive evolution of microbial genomes and proteomes is a veriﬁable signature of the divergence of the archaea and bacteria from the eukaryotes.

17.6 WILL THE REAL MISSING LINK PLEASE STAND UP? The notion that the ancestor of modern eukaryotes was an anaerobic phagotrophic eukaryote is not new (e.g., Cavalier-Smith, 1987). However, the discovery of mitochondrial relics in putative descendents of that phagotrophe, the so-called archaezoans appropriately discredited the identiﬁcation of these as descendents of a premitochondrial ancestor of eukaryotes (Cavalier-Smith, 2002). It is arguable that the initial search for the ancestral anaerobes among highly reduced anaerobic eukaryote parasites was from the beginning misdirected. That search might have been more usefully restricted to free living eukaryote anaerobes because these are the organisms most likely to be descended from an ancestral free-living anaerobic phagotrophe (Kurland et al., 2006). For these reasons, the claim that there is “proof” (Martin et al., 2006) that the putative phagotrophe never existed is somewhat less than convincing. The bottom line is that so far the search for a free-living anaerobic eukaryote phagotrophe with no traces of a mitochondrial history has been fruitless. However, Figure 17.4 shows a prime candidate for the position of “missing link” between a hypothetical eukaryote ancestor and the bacterial domain. This micrograph is a portrait of Gemmata obscuriglobus, a representative of the bacterial phylum Planctomy-

17.7 All’s Well

389

cetes (Fuerst, 2005). It illustrates general characteristics of Planctomycetes that include the absence of the peptidoglycan cell wall characteristic of other bacterial clades, an outer membrane enclosing the cytosol and a double membrane enclosing an internal compartment containing what has been identiﬁed as a nucleoid (Fuerst, 2005). The partial resemblance to eukaryote cellular features is stunning. Early evidence (Stackebrandt et al., 1984) suggested that the 16S RNA of Planctomycetes was missing some oligonucleotide signatures of bacterial 16SRNA. This and the absence of peptidoglycan walls suggest that the phylum had diverged early in the evolution of the bacteria domain, and that these missing features may have been introduced to the bacteria later in their evolution (Stackebrandt et al., 1984). Nevertheless, conventional phylogenetic reconstructions based on 16S RNA place the Planctomycetes distant from the putative thermophilic root of bacteria (Brochier and Philippe, 2002; Teeling et al., 2004). This discrepancy was resolved by phylogeny that takes into account the disparate rates of mutation in both rRNA and proteins. Thus, the thermophiles have been displaced from the root of bacteria by reliable phylogeny that identiﬁes mesophiles at the root of the bacteria domain (Galtier and Guoy, 1995; Galtier et al., 1999; Forterre and Philippe, 1999; Brochier and Philippe, 2002). Accordingly, it comes as no surprise that Planctomycetes are found now at the root of phylogenetic reconstructions of bacteria based on conservative positions in 16S RNA sequences (Brochier and Philippe, 2002). Most recently, Fuerst et al. (2008) have discovered in micrographs of the inner membranes of G. obscuriglobus pores that resemble in detail the nuclear pores of eukaryotes. There is in addition the general observation that Planctomycetes divide by budding, somewhat reminiscent of fungi (Fuerst and Nisbet, 2004). Finally, extensive internal membranous compartments have been found in some archaea but here a cell wall is also discerned (Fuerst et al., 1998, 1999; Rachel et al., 2002). In summary, the Planctomycetes as well as a small number of archaea present what seem to be internal membranous compartments that though simple are reminiscent of eukaryote structures. These and other considerations have persuaded Fuerst and Nisbet (2004) as well as Fuerst (2005) that the archaea and bacteria may be descended from a simple eukaryote-like ancestor. Martin et al. (2006) would certainly want to be the ﬁrst to point out that this inference requires “proof”, perhaps from extensive BLAST-based homology searches. Hopefully, new genomic data will soon provide a test of Fuerst’s thesis. Aworking hypothesis is that archaea and bacteria arose from a primitive eukaryote-like ancestor (Kurland et al., 2006, 2007). In that case the membranous structures initially inherited from this ancestor were eventually lost by reductive evolution. This account is also consistent with the hypotheses of Dacks and Field (2007). They have suggested that the eukaryote membrane trafﬁcking system originated in a proto-eukaryote ancestor. According to both views, the internal membrane systems originated within the eukaryote lineage, not in either the archaea or the bacterial lineages. Of course, critics will want to suggest that eukaryote features such as internal membranes could have arisen by gene transfer from eukaryotes to microbes. But such a mechanism would not provide a simple scenario to account for the location of the Planctomycetes at the root of the bacteria, nor would it explain their missing peptidoglycan walls.

17.7 ALL’S WELL The thread running through this chapter is that chemistry, cell biology, ecology, and population genetics provide insights into the links that constrain the evolution of proteomes.

390

Chapter 17

A Hitchhiker’s Guide to Evolving Networks

Indeed, these links can be summarized at every level as context dependent constraints. Such constraints explain why it is that mutations hitchhiking through populations exhibit an ephemeral transience. The overwhelming majority of mutations including gene transfers simply disappear without trace. Mutations that are ﬁxed in a global population are rare, because the requirements to slip into a network to carry out the same selectively trimmed job as the previous occupant are exacting. Deviations by small fractions of a percent of kinetic efﬁciency or mass investment are enough to promote the elimination of a mutant variant from the network, particularly in large populations and for high expression level components. The consequence is that most often only strong selection either for the host ﬁtness or for the virulence of an infective vehicle can provide hitchhiking mutations the lift they need to ﬁxation. The consequence is that gene transfer is a seldom seen event. For example, we may take 3–5% gene transfers in modern genomes and combine these ﬁgures with the understanding that it represents cumulative transfers after millions of generations. The bottom line is that the fraction of vertical transfer per generation is more than 0.99999995. If you want to quibble about how fast transfers are ameliorated, we can take 10,000 generations as our base. Then the percent of vertical transfers per generation is only 0.999995. From this perspective gene transfer seems quite rare, and certainly not frequent enough to justify announcing either the demise of Darwinian descent or the debut of a phylogenetic paradigm shift. When gene transfer was being used to invent molecular genetics, it was a phenomenon under strict experimental control. By the time that it was recruited to bring on the paradigm shift, the identiﬁcation of gene transfer was no longer experimental. It was instead inferential. Since it was based on interpretations of various sorts of genome data, the detection of sequence transfer became inextricably entwined in “theories” of genome evolution. BLAST based homology searches that replace phylogeny are a ﬁne example of the confusion created by failing to distinguish results that are consistent with a hypothesis from those that validate it. A universal high-density cytosol is the selective condition upon which ubiquitous policing systems composed of chaperonins and proteases are predicated. These together are the context out of which the modular system of polypeptide folds emerged early in the evolution of proteomes composed of compact, interactive polypeptides. It is not impossible that the evolution of LUCAwith its modest 1500 protein proteome was coincidental with the near completion of the modern cohort of FSFs. This is suggested by ﬁnding that 90% of modern FSFs are shared by two or three of the modern domains. If so, the subsequent evolution of modern proteomes has been more a matter of combining compact domains in novel arrangements and much less a matter of inventing new folds. The differences between the proteomes of the three modern domains are marginal at the level of FSFs. However, the archaea and the bacteria share a common reductive evolutionary mode in which both the diversities of their proteomes as well as the lengths of their proteins are strongly constrained. In contrast, eukaryotes are more retentive and their genomes by comparison encode enormously diverse protein cohorts. The efﬁcient exploitation of the much more diverse proteomes of eukaryotes probably provided the selective condition for the evolution of cellular compartments and the internal membrane systems of eukaryotes. The invention of phagotrophy in an early eukaryote descendent of LUCA may have triggered the reductive evolution of related descendents of LUCA, the archaea and the bacteria. That impetus to reductive evolution eventually led to the divergence of two domains of specialist chemists selected for relatively efﬁcient growth. In contrast, the increasingly complex phagotrophes with their specialized hunting and grazing modes

References

391

developed large compartmental cells with an eventual expansion into multicellular architectures supported by exquisitely diverse proteomes. Looming on the immediate horizon are two explorations directed in opposite directions. One is directed further back in time to the world preceding LUCA. This is the ribonucleoprotein world from whence ribosomes, splicesomes, and other wondrous things emerged. But serious progress in that exploration will in all likelihood require abandoning the notion of an RNA World, the existence of which is entirely speculative. Furthermore, it seems meaningless to invoke the evolution of protein-encoding RNA genomes or machinery for protein synthesis in an RNA World without signiﬁcant biological roles for proteins. The fact that in modern cells RNA is always found together with proteins may be a sign of the primitive ancestral arrangement. Finally, the age of genome communities or metagenomics is upon us. It can only be hoped, that investigations of genome communities will proﬁt from the experiences already obtained identifying the transience of sequence transfers and the need to distinguish theory from observation. Bioinformatics will need to evolve into a science that is more than superﬁcially informed by biology as it takes up the metagenomic challenge.

ACKNOWLEDGMENTS We thank Irmgard Winkler for invaluable help with the manuscript, as well as Martijn Huynen, Gustavo Caetano-Anolles, Thomas Johansson, and John Fuerst for providing material to illustrate this essay. We are grateful to Dan Andersson, Siv Andersson, Patrick Biggs, Gustavo Caetano-Anolles, Lesley Collins, Mans Ehrenberg, John Fuerst, Roger Garret, and David Penny for criticism, guidance, stimulation as well as help with the literature. Charles G. Kurland’s research is supported by The Royal Physiographic Society, Lund and The Nobel Committee for Chemistry, the Royal Swedish Academy of Sciences, Stockholm. Otto G. Berg’s research is supported by the Swedish Research Council, Stockholm.

REFERENCES ANDERSSON, S.G.E. and KURLAND, C.G., 1999. Origins of mitochondria and hydrogenosomes. Curr. Opin. Microbiol. 2: 535–541. ANDERSSON, S.G., ZOMORODIPOUR, A., ANDERSSON, J.O., SICHERITZ-PONTEN, T., ALSMARK, C.M.U., PODOWSKI, R. M., NA¨SLUND, A.K., ERIKSSON, A.-S., WINKLER, H.H., and KURLAND, C.G. 1998 The genome sequence of Rickettsia prowazekii and the origin of mitochondria. Nature, 133–140. ARAVIND, L., TATUSOV, R.L., WOLF, Y.I., WALKER, D.R., and KOONIN, E.V., 1998. Evidence for massiv gene exchange between archaeal and bacterial hyperthermophiles. Trends Genet. 14: 442–444. ASAI, T., ZAPOROJETS, D., SQUIRES, C., and SQUIRES, C.L., 1999. An Escherichia coli strain with all chromosomal rRNA operons inactivated: complete exchange of rRNA genes between bacteria. Proc. Natl. Acad. Sci. USA 96: 1971–1976. BERG, O.G., 1990. The inﬂuence of macromolecular crowding on thermodynamic activity: solubility and

dimerization constants for spherical and dumbbellshaped molecules in a hard-sphere mixture. Biopolymers 30: 1027–1037. BERG, O.G. and KURLAND, C.G., 1997. Growth rate-optimised tRNA abundance and codon usage. J. Mol. Biol. 270 (4): 544–550. BERG, O.G. and KURLAND, C.G., 2000. Why mitochondrial genes are most often found in nuclei. Mol. Biol. Evol. 17: 951–961. BERG, O.G. and KURLAND, C.G., 2002. Evolution of microbial genomes: sequence acquisition and loss. Mol. Biol. Evol. 19: 2265–2276. BOUCHER, Y., DOUADY, C.J., PAPKE, R.T., WALSH, D.A., BOUDREAU, M.E.R., NESBO, C.L., CASE, R.J., and DOOLITTLE, W.F., 2003. Lateral gene transfer and the origins of prokaryotic groups. Annu. Rev. Genet. 37: 283–328. BRINKMANN, H. and PHILIPPE, H., 1999. Archaea sistergroup of bacteria? Indications from tree reconstruction artifacts in ancient phylogenies. Mol. Biol. Evol. 16: 817–825.

392

Chapter 17

A Hitchhiker’s Guide to Evolving Networks

BROCCHIERI, C. and KARLIN, S., 2005. Protein lengths in eukaryotic and prokaryotic proteomes. Nucleic Acids Res. 33: 3390–3400. BROCHIER, C. and PHILIPPE, H., 2002. Phylogeny: a nonhyperthermophilic ancestor for bacteria. Nature 417: 244. CAETANO-ANOLLE S, G., 2000. Evolved RNA secondary structure and the rooting of the universal tree of life. J. Mol. Evol. 54: 333–345. CAETANO-ANOLLE´S, G. and CAETANO-ANOLLE´S, D., 2003. An evolutionarily structured universe of protein architecture. Genome Res. 13: 1563–1571. CANBACK, B., ANDERSSON, S.G., and KURLAND, C.G., 2002. The global phylogeny of glycolytic enzymes. Proc. Natl. Acad. Sci. USA 99: 6097–6102. CANBACK, B., TAMAS, I., and ANDERSSON, S.G., 2004. A phylogenomic study of endosymbiotic bacteria. Mol. Biol. Evol. 21: 1110–1122. CAVALIER-SMITH, T. 1987. The simultaneous symbiotic origin of mitochondria, chloroplasts, and microbodies. In Endocytobiology (eds J.L. Lee and J.F. Frederick). New York Academy of Sciences, New York, pp. 55–71. CAVALIER-SMITH, T., 2002. The phagotrophic origin of eukaryotes and phylogenetic classiﬁcation of Protozoa. Int. J. Syst. Evol. Microbiol. 52: 297–354. CAYLEY, S., LEWIS, B.A., GUTTMAN, H.J. and RECORD, M.T., JR., (1991) Characterization of the cytoplasm of Escherichia coli K-12 as a function of external osmolarity: implications for protein-DNA interactions in vivo. J. Mol. Biol. 222: 281–300. CHAKRAVARTY, S. and VARADARAJAN, R., 2000. Elucidation of determinants of protein stability through genome sequence analysis. FEBS Lett. 470: 65–69. COHAN, F.M., 2001. Bacterial species and speciation. Syst. Biol. 50: (4), 513–524. COLLINS, L.G., KURLAND, C.G., BIGGS, P., and PENNY, P., 2009. The modern RNP world of eukaryotes. J. Heredity 100(5): 597–604. COLLINS, L. and PENNY, D., 2005. Complex spliceosomal organization, ancestral to extant eukaryotes. Mol. Biol. Evol. 22: 1053–1066. DACKS, J.B. and FIELD, M.C., 2007. Evolution of the eukaryotic membrane-trafﬁking system: origin, tempo and mode. J. Cell Sci. 120: 2977–2985. DARNELL, J.E., 1978. On the origin of prokaryotes. Science 202: 1257–1260. DAWKINS, R., 1976. The Selﬁsh Gene, Oxford University Press, New York. De DUVE, C., 1982. Peroxisomes and related paarticles. Ann. NY Acad. Sci. 386: 1–4. DEMING, J.W., 2002. Psychrophilies and polar regions. Curr. Opin. Microbiol. 5: 301–309. DOBSON, C.M., 1999. Protein misfolding, evolution and disease. Trends Biochem. Sci. 24: 329–332. DONG, H., NILSSON, L., and KURLAND, C.G., 1996. Covariation of tRNA abundance and codon usage in Escherichia coli at different growth rates. J Mol Biol. 260: (5), 649–663.

DOOLITTLE, R.F., 1995. The multiplicity of domains in proteins. Annu. Rev. Biochem. 64: 287–314. DOOLITTLE, R.F., FENG, D.-F., TSANG, S., CHO, S., and LITTLE, E., 1996. Response: dating the cenancester of organisms. Science 274: 1751–1753. DOOLITTLE, W.F., 1999. Phylogenetic classiﬁcation and the universal tree. Science 284: 2124–2128. DYSON, H.J. and WRIGHT, P.E., 1998. Equilibrium NMR studies of unfolded and partially folded proteins. Nat. Struct. Biol. (NMR Supplement) 499–503. EHRENBERG, M. and KURLAND, C.G., 1984. Costs of accuracy determined by a maximal growth rate constraint. Q. Rev. Biophys. 17: 45–82. EISEN, J.A., 1998. Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res. 8: 163–167. ELLIS, R.J., 2001a. Macromolecular crowding: obvious but underappreciated. Trends Biochem. Sci. 26: 597–603. ELLIS, R.J., 2001b. Macromolecular crowding: an important but neglected aspect of the intercellular environment. Curr. Opin. Struct. Biol. 11: 114–119. FARRIS, J.S., 1977. Phylogenetic analysis under Dollo’s Law. Syst. Zool. 26: 77–88. FELSENSTEIN, J., 2001. Taking variation of evolutionary rates between sites into account in inferring phylogenies. J. Mol. Evol. 53: 447–455. FENG, D.-F., CHO, G., and DOOLITTLE, R.F., 1997. Determining divergence times with a protein clock: update and reevaluation. Proc. Natl. Acad. Sci. USA 94: 13028–13033. FERRARI, F.D., 1988. Evolutionary transformations and Dollo’s Law. J. Crustacean Biol. 8: (4), 618–619. FITZ-GIBBON, S.T. and HOUSE, C.H., 1999. Whole genomebased phylogenetic analysis of free-living microorganisms. Nucleic Acids Res. 27: 4218–4222. FORTERRE, P., 1995. Thermoreduction, a hypothesis for the origin of prokaryotes. C.R. Acad. Sci. III Sci 318: 871–879. FORTERRE, P. and PHILIPPE, H., 1999. Where is the root of the universal tree of life? BioEssays 21: 871–879. FUERST, J.A., 2005. Intracellular compartmentation in Planctomycetes. Annu. Rev. Microbiol. 59: 299–328. FUERST, J.A. and NISBET, E.G., 2004. Buds from the tree of life: linking compartmentalized prokaryotes and eukaryotes by a non-hyperthermophile common ancestor and implications for understanding Archaean microbial communities. Int. J. Astrobiol. 3: 183–187. FUERST, J.A., WEBB, R.I., GARSON, M.J., HARDY, L., and REISWIG, H.M., 1998. Membrane-bounded nucleoids in microbial symbionts of marine sponges. FEMS Microbiol. Lett. 166: 29–34. FUERST, J.A., WEBB, R.I., GARSON, M.J., HARDY, L., and REISWIG, H.M., 1999. Membrane-bounded nuclear bodies in a diverse range of microbial symbionts of Great Barrier Reef sponges. Mem. Qld. Mus. 44: 193–203. FUERST, J.A., WEBB, R.I., LEE, K.-C., MCCAMMON, R., YEE, B., and BUTLER, M.K., 2008. Structures in the plancto-mycete bacterium Gemmata obscuriglobus analogous to nuclear pores of eukaryotes. submitted.

References GABALDON, T. and HUYNEN, M.A., 2004. Shaping the mitochondrial proteome. Biochim. Biophys. Acta. 1659: 212–220. GALTIER, N. and GUOY, M., 1995. Inferring phylogenies from DNA sequences of unequal base compositions. Proc. Natl. Acad. Sci. USA 92: 11317–11321. GALTIER, N., TOURASSE, N., and GOUY, M., 1999. A nonhyperthermophilic common ancestor to extant life forms. Science 238: 220–221. GLICKMAN, M.H. and CIECHANOVER, A., 2002. The ubiquitinproteaosome pathway: destruction for the sake of construction. Physiol. Rev. 82: 373–428. GOGARTEN, J.P., OLENDZENSKI, L., HILARIO, E., SIMON, C., and HOLSINGER, K.E., 1996. Dating the cenancester of organisms. Science 274: 1750–1751. GOLDBERG, A.L. and DICE, J.F., 1974. Intercellular protein degradation in mammalian and bacterial cells. Annu. Rev. Biochem. 43: 835–869. GOLDOVSKY, L., JANSSEN, P., AHREN, D., AUDIT, B., CASES, I., DARZENTAS, N., ENRIGHT, A.J., LOPEZ-BIGAS, N., PEREGRINALVAREZ, J.M., SMITH, M., et al., 2005. CoGenT: an extensive and extensible data environment for computational genomics. Bioinformatics 21: 3806– 3810. HAMILTON, W.D., 1964a. The genetical evolution of social behavior. I. J. Theor. Biol. 7: 1–16. HAMILTON, W.D., 1964b. The genetical evolution of social behavior. II. J. Theor. Biol. 7: 17–52. HARTYL, F.U. and HAYER-HARTYL, M., 2002. Molecular chaperonins in the cytosol: nascent chain to folded protein. Science 295: 1852–1858. HOOPER, S.D. and BERG, O.G., 2002. Gene import or deletiona study of the difference genes in Escherichia coli strains K12 and O157:H. J Mol Evol. 54: 734–744. HUYNEN, M.A., SNEL, B., and BORK, P., 1999a. Lateral gene transfer, genome surveys, and the phylogeny of prokaryotes (in Technical Comments). Science 286: 1443. HUYNEN, M.A., DANDEKAR, T., and BORK, P., 1999b. Variation and evolution of the citric acid cycle: a genomic approach. Trends Microbiol. 7: 281–291. ITOH, T., MARTIN, W., and NEI, M., 2002. Acceleration of genomic evolution caused by enhanced mutation rate in endocellular symbiosis. Proc. Natl. Acad. Sci USA 99: 12944–12948. JOHANSSON, M., LOVMAR, M., and EHRENBERG, M. 2008. Rate and accuracy of bacterial protein synthesis revisited. Curr. Opin. Microbiol. 11: 1–7. KARLBERG, O., CANBA¨CK, B., KURLAND, C.G., and ANDERSSON, S.G.E., 2000. The dual origin of the yeast mitochondrial proteome. Yeast 17: 170–187. KIMURA, M., 1983. The Neutral Theory of Molecular Evolution, Cambridge University Press, Cambridge. KIMURA, M., 1987. Molecular evolutionary clock and the neutral theory. J. Mol. Evol. 26: 24–33. KOONIN, E.V., 2003. Comparative genomics, minimal genesets and the last universal common ancestor. Natl. Rev. Microbiol. 1: 127–136.

393

KOONIN, E.V., ARAVIND, L., and KONDRASHOV, A.S., 2000. The impact of comparative genomics on our understanding of evolution. Cell 101: 573–576. KORBEL, J.O., SNEL, B., HUYNEN, M.A., and BORK, P., 2002. SHOT: a web server for the construction of genome phylogenies. Trends Genet. 18: 158–162. KOSKI, L.B. and GOLDING, G.B., 2001. The closest BLAST hit is often not the nearest neighbor. J. Mol. Biol. 52: 540–542. KREIL, D.P. and OUZOUNIS, C.A., 2001. Identiﬁcation of thermophilic species by the amino acid compositions deduced from their genomes. Nucleic Acids Res. 29: 1608–1615. KUMAR, S. and NUSSINOV, R., 2001. How do thermophilic proteins deal with heat? Cell. Mol. Life Sci. 58: 1216– 1233. KUNIN, V., GOLDOVSKY, L., DARZENTAS, N., and OUZOUNIS, C. A., 2005a. The net of life: reconstructing the microbial phylogenetic network. Genome Res. 15: 954–959. KUNIN, V., AHREN, D., GOLDOVSKY, L., JANSSEN, P., and OUZOUNIS, C.A., 2005b. Measuring genome conservation across taxa: divided strains and united kingdoms. Nucleic Acids Res. 33: 616–621. KURLAND, C.G., 1992. Translational accuracy and the ﬁtness of bacteria. Annu. Rev. Genet. 26: 29–50. KURLAND, C.G., 2000. Something for everyone; horizontal gene transfer in evolution. EMBO Rep. 1: 92–95. KURLAND, C.G., 2005. What tangled web: barriers to rampant horizontal gene transfer. BioEssays 27: 741–747. KURLAND, C.G. and ANDERSSON, S.G.E., 2000. Origin and evolution of the mitochondrial proteome. Microbiol. Mol. Biol. Rev. 64: 786–820. KURLAND, C.G., CANBACK, B., and BERG, O.G., 2003. Horizontal gene transfer: a critical view. Proc. Natl. Acad. Sci. USA 100: 9658–9662. KURLAND, C.G., CANBACK, B., and BERG, O.G., 2007. The origins of modern proteomes. Biochimie 89: 1454–1463. KURLAND, C.G., COLLINS, L.J., and PENNY, D., 2006. Genomics and the irreducible nature of eukaryotic cells. Science 312: 1011–1014. LAURENT, T.C., 1971. Enzyme reactions in polymer media. Eur. J. Biochem. 21: 498–506. LAWRENCE, J.G., 2001. Catalyzing bacterial speciation: correlating lateral transfer with genetic headroom. Syst. Biol. 50: 479–496. LAWRENCE, J.G. and OCHMAN, H., 1997. Amelioration of bacterial genomes: rates of change and exchange. J. Mol. Evol. 44: 383–397. LAWRENCE, J.G. and OCHMAN, H., 1998. Molecular archaeology of the Escherichia coli genome. Proc. Natl. Acad. Sci. USA 95: 9413–9417. LAWRENCE, J.G. and ROTH, J.R., 1996. Selﬁsh operons: horizontal transfer may drive the evolution of gene clusters. Genetics 143: 1843–1860. LIANG, P. and RILEY, M., 2001. A comparative genomics approach for studying ancestral proteins and evolution. Adv. Appl. Microbiol. 50: 39–72.

394

Chapter 17

A Hitchhiker’s Guide to Evolving Networks

LOCKHART, P.J., STEEL, M.A., HENDY, M.D., and PENNY, D., 1994. Recovering evolutionary trees under a more realistic model of sequence evolution. Mol. Biol. Evol. 11: 605–612. LOPEZ, P., FORTERRE, P., and PHILIPPE, H., 1999. The root of the tree of life in the light of the covarion model. J. Mol. Evol. 49: 496–508. LOPEZ, P., CASANE, D., and PHILIPPE, H., 2002. Heterotachy, an important process of protein evolution. Mol. Biol. Evol. 19: 1–7. LOPEZ-GARCIA, P. and MOREIRA, D., 1999. Metabolic symbiosis at the origin of eukaryotes. Trends Biochem. 24: 88–93. LOVMAR, M. and EHRENBERG, M., 2006. Rate, accuracy and cost of ribosomes in bacterial cells. Biochemie 88: 951–961. LYNCH, M., 2007. The evolution of genetic networks by nonadaptive processes. Nat. Rev. Genet. 8: 803–813. LYNCH, M., 2007. The frailty of adaptive hypotheses for the origins of organismal complexity. Proc. Natl. Acad. Sci. USA 104: 8597–8604. LYNN, D.J., SINGER, G.A.C., and HICKEY, D.A., 2002. Synonomous codon usage is subject to selection in thermophilic bacteria. Nucleic Acids Res. 30: 4272– 4277. MAISNIER-PATIN, S., ROTH, J.R., FREDRIKSSON, A., NYSTROM, T., BERG, O.G., and ANDERSSON, D.I., 2005. Genomic buffering mitigates the effects of deleterious mutations in bacteria. Nat. Genet. 37: 1376–1377. MARSHALL, C.R., RAFF, E.C., and RAFF, R.A., 1994. Dollo’s Law and the death and resurrection of genes. Proc. Natl. Acad. Sci. USA 91: 12283–12287. MARTIN, W., 1999. Mosaic bacterial chromosomes: a challenge en route to a tree of genomes. BioEssays 21: 99–104. MARTIN, W., DAGAN, T., KOONIN, E.V., DIPIPPO, J.L., GOGARTEN, J.P., and LAKE, J.A., 2006. The evolution of eukaryotes (in Letters). Science 316: 540. uLLER, M., 1998. The hydrogen hypothesis MARTIN, W. and M€ for the ﬁrst eukaryote. Nature 392: 37–41. MAYNARD SMITH, J., 1970. Natural selection and the concept of a protein space. Nature 225: 563–564. MCELHENY, D., SCHNELL, J.R., LANSING, J.C., DYSON, H.J., and WRIGHT, P.E., 2005. Deﬁning the role of active-site loop ﬂuctuations in dihydrofolate reductase catalysis. Proc. Natl. Acad. Sci. USA 102: 5032–5037. MEDIGUE, C., ROUXEL, T., VIGIER, P., HENAUT, A., and DANCHIN, A., 1991. Evidence for horizontal gene transfer in Escherichia coli speciation. J. Mol. Biol. 222: 851–856. MIKKOLA, R. and KURLAND, C.G., 1991a. Is there a unique ribosome phenotype for naturally occurring Escherichia coli? Biochimie 73: 1061–1066. MIKKOLA, R. and KURLAND, C.G., 1991b. Evidence for demand-regulation of ribosome accumulation in E. coli. Biochimie 73: 1551–1556. MIKKOLA, R. and KURLAND, C.G., 1992. Selection of laboratory wild type phenotype from natural isolates of E. coli in chemostats. Mol. Biol. Evol. 9: 394–402.

MINTON, A.P., 1981. Excluded volume as a determinant of macromolecular structure and reactivity. Biopolymers 20: 2093–2120. MURZIN, A.G., BRENNER, S.E., HUBBARD, T., and CLOTHIA, C., 1995. SCOP: a structural classiﬁcation of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247: 536–540. NARINX, E., BAISE, E., and GERDAY, C., 1997. Subtilism from psychrophilic antarctic bacteria: characterization and site-directed mutagenesis of residues possibly involved in the adaptation to cold. Prot. Eng. 10: 1271– 1279. NELSON, K.E., CLAYTON, R.A., GILL, S.R., GWINN, M.L., DODSON, R.J., HAFT, D.H., HICKEY, E.K., PETERSON, J.D., NELSON, W.C., and KETCHUM, K.A. et al., 1999. Evidence for lateral gene transfer between archaea and bacteria from genome sequence of Thermotoga maritime. Nature 399: 323–329. NOVOZHILOV, A.S., KAREV, G.P., and KOONIN, E.V., 2005. Mathematical modeling of evolution of horizontally transferred genes. Mol. Biol. Evol. 22: (8), 1721–1732. OCHMAN, H. and JONES, I.B., 2000. Evolutionary dynamics of full genome content in Escherichia coli. EMBO J. 19: 6637–6643. OCHMAN, H., LAWRENCE, J.G., and GROISMAN, E.A., 2000. Lateral gene transfer and the nature of bacterial innovation. Nature 405: 229–304. OLIVEBERG, M. and WOLYNES, P.G., 2006. The experimental survey of protein-folding energy landscapes. Q. Rev. Biophys. 38: 1–44. OLSON, M., 1999. When less is more: gene loss as an engine of evolutionary change. Am. J. Hum. Genet. 64: 18–23. ORGEL, L.E. and CRICK, F.H.C., 1980. Selﬁsh DNA: the ultimate parasite. Nature 284: 604–607. OTZEN, D.E., KRISTENSEN, O., and OLIVEBERG, M., 2000. Designed protein tetramer zipped together with a hydrophobic Alzheimer homology: a structural clue to amyloid assembly. Proc. Natl. Acad. Sci. USA 97: 9907–9912. OTZEN, D.E. and OLIVEBERG, M., 1999. Salt-induced detour through compact regions of the protein folding landscape. Proc. Natl. Acad. Sci. USA 96: 11746–11751. OUZOUNIS, C.A., KUNIN, V., DARZENTAS, N., and GOLDOVSKY, L., 2006. A minimal estimate for the gene content of the last universal common ancestor: exobiology from a terrestial perspective. Res. Microbiol. 157: 57–68. PAL, C. and HURST, L.D., 2004. Evidence against the selﬁsh operon theory. Trends Genet. 20: 232–234. PAL, C., PAPP, B., and LERCHER, M.J., 2005. Adaptive evolution of bacterial metabolic networks by horizontal gene transfer. Nat. Genet. 37: 1372–1375. PEDERSEN, C., BROSS, B.P., STENBROEN-WINTER, V., CORYDON, T.J., BOLUND, L., BARTLETT, K., VOCKLEY, J., and GREGERSEN, N. 2003. Misfolding, Degradation, and Aggregation of Variant Proteins. J. Biol. Chem. 278: 47449–47458. PENNY, D. and POOLE, A., 1999. The nature of the last universal common ancestor. Curr. Opin. Genet. Dev. 9: 672–677.

References PHILIPPE, H. and FORTERRE, P., 1999. The rooting of the universal tree of life is not reliable. J. Mol. Evol. 49: 509–523. PRICE, M.N., HUANG, K.H., ARKIN, A.P., and ALM, E.J., 2005. Operon formation is driven by co-regulation and not by horizontal gene transfer. Genome Res. 15: 809–819. RACHEL, R., WYSCHKONY, I., RIEHL, S. and HUBER, H., 2002. The ultrastructure of Ignicoccus: evidence for a novel outer membrane and for intracellular vesicle budding in an archaeon. Archaea 1: 9–18. RADFORD, S.E., 2000. Protein folding: progress made and promises ahead. Trends Biochem. Sci. 25: 611–618. ROBINSON, C.V., SALI, A., and BAUMEISTER, W., 2007. The molecular sociology of the cell. Nature 450: 973–982. ROBINSON, C. and SAUER, R., 1998. Optimizing the stability of single-chain proteins by linker length and composition mutagenesis. Proc. Natl. Acad. Sci. USA 95: 5929–5934. RUTHERFORD, S.L. and LINDQUIST, S., 1998. Hsp90 as a capacitor for morphological evolution. Nature 396: 336–342. SALISBURY, F.B., 1969. Natural selection and the complexity of the gene. Nature 224: 342–343. SICHERITZ-PONTEN, T., KURLAND, C.G., and ANDERSSON, S.G. E., 1998. A phylogenetic analysis of the cytochrome b and cytochrome c oxidase I genes supports an origin of mitochondria from within the Rickettsiaceae. Biochim. Biophys. Acta 1365: 545–551. SINGER, G.A.C. and HICKEY, D.A., 2003. Thermophilic prokaryotes have characteristic patterns of codon usage, amino acid composition and nucleotide content. Gene 317: 39–47. SMITH, D., DOUCETTE-STAMM, L., DELOUGHERY, C., LEE, H., DUBOIS, J., ALDREDGE, T., BASHIRZADEH, R., BLAKELY, D., COOK, R., GILBERT, K., et al., 1998. Complete genome sequence of Methanobacterium thermoautotrophicum H: functional analysis and comparative genomics. J. Bacteriol. 179: 7135–7155. SNEL, B., BORK, P., and HUYNEN, M.A., 1999. Genome phylogeny based on gene content. Nat. Genet. 21: 108–110. SNEL, B., BORK, P., and HUYNEN, M., 2002. Genomes in ﬂux: the evolution of archaeal and proteobacterial gene content. Genome Res. 12: 17–25. SOREK, R., ZHU, Y., CREEVEY, C.J., FRANCINO, M.P., BORK, P., and RUBIN, E.M., 2007. Genome-wide experimental determination of barriers to horizontal gene transfer. Science 318: 1449–1452. SRERE, P.A., 1990. Citric acid cycle redux. Trends Biochem. Sci. 15: 164–165. SRIVASTAVA, D.K. and BERNHARD, S.A., 1986. Metabolite transfer via enzyme–enzyme complexes. Science 234: 1081–1086. STACKEBRANDT, E., LUDWIG, W., SCHUBERT, W., KLINK, F., and SCHLESNER, H., 1984. Molecular genetic evidence for early evolutionary origin of budding peptidoglycan-less eubacteria. Nature 307: 735–737. STANIER, R.Y. and van NIEL, C.B., 1962. The concept of a bacterium. Arch. Microbiol. 42: 17–35.

395

STOEBEL, D.M., 2004. Lack of evidence for horizontal transfer of the lac operon into Escherichia coli. Mol. Biol. Evol. 22 (3): 683–690. TADDEI, F., RADMAN, M., MAYNARD-SMITH, J., TOUPONCE, B., GOUYON, P.H., and GODELLE, B., 1997. Role of mutator alleles in adaptive evolution. Nature 387: 700–702. TEELING, H., LOMBARDOT, T., BAUER, M., LUDWIG, W., and GLOCKNER, F.O., 2004. Evaluation of the phylogenetic position of the planctomycete ‘Rhodopirellula baltica’ SH 1 by means of concatenated ribosomal protein sequences, DNA-directed RNA polymerase subunit sequences and whole genome trees. Int. J. Syst. Evol. Microbiol. 54: 791–801. TEICHMANN, S.A. and MITCHISON, G., 1999. Is there a phylogenetic signal in prokaryote proteins? J. Mol. Evol. 49: 98–107. TEKAIA, F., LAZCANO, A., and DUJON, B., 1999. The genomic tree as revealed from whole proteome comparisons. Genome Res. 9: 550–557. TEKAIA, F., YERAMIAN, E., and DUJON, B., 2002. Amino acid composition of genomes, lifestyles of organisms, and evolutionary trends: a global picture with correspondence analysis. Gene 297: 51–60. THOMPSON, M.J. and EISENBERG, D., 1999. Transproteomic evidence of a loop-deletion mechanism for enhancing protein thermostability. J. Mol. Biol. 290: 595–604. TYNDALL, J.D., NALL, T., and FAIRLIE, D.P., 2005. Proteases universally recognize beta strands in their active sites. Chem. Rev. 105: 973–999. VALENTINE, D.L., 2007. Adaptations to energy stress dictate the ecology and evolution of the Archaea. Nat. Rev. Microbiol. 5: 316–323. VOGES, D., ZWICKEL, P., and BAUMEISTER, W., (1999) The 26S proteasome: a molecular machine designed for controlled proteolysis. Annu. Rev. Biochem. 68: 1015–1058. WANG, M. and CAETANO-ANOLLE´S, G., 2006. Evolution inferred from domain combination in proteins. Mol. Biol. Evol. 23: 2444–2454. WANG, M., BOCA, S.M., KALELKAR, R., MITTENTHAL, J.E., and CAETANO-ANOLLE S, G. 2006. A phylogenomic reconstruction of the protein world based on a genomic census of protein fold architecture. Complexity 12: 27–40. WANG, M., YAFREMAVA, L.S., CAETANO-ANOLLE´S, D., MITTENTHAL, J.E., and CAETANO-ANOLLE´S, G., 2007. Reductive evolution of architectural repertoires in proteomes and the birth of the tripartic world. Genome Res. 17: 1572–1585. WATSON, J.D., 1965. Molecular Biology of the Gene. W.A. Benjamin, Inc., New York. WELCH, G.R., 1977. On the role of organized multienzyme systems in cellular metabolism: a general synthesis. Prog. Biophys. Mol. Biol. 32: 103–196. WICKNER, S., MAURIZI, M.R., and GOTTESMAN, S., 1999. Posttranslational quality control: folding, refolding, and degrading proteins. Science 286: 1888–1893. WOESE, C.R. 1983. The primary lines of descent and the universal ancestor. In Evolution from molecules to men

396

Chapter 17

A Hitchhiker’s Guide to Evolving Networks

(ed. D.S. Bendell). Cambridge University Press, Cambridge, pp. 209–233. WOESE, C.R., 1998. The universal ancestor. Proc. Natl. Acad. Sci. USA 95: 6854–6859. WOESE, C.R., 2000. Interpreting the universal phylogenetic tree. Proc. Natl. Acad. Sci. USA 97: 8392–8396. WOESE, C.R. and FOX, G.E., 1977. The concept of cellular evolution. J. Mol. Evol. 10: 1–6. YANG, S., DOOLITTLE, R.F., and BOURNE, P.E., 2005. Phylogeny determined by protein domain content. Proc. Natl. Acad. Sci. USA 102: 373–378. YANG, Z., 1995. A space-time process model for the evolution of DNA sequences. Genetics 139: 993–1005.

ZHANG, J. 2000. Protein-length distributions fore the three domains of life. Trends Genet. 16: 107–109. ZHAXYBAYEVA, O., GOGARTEN, J.P., CHARLEBOIS, R.L., DOOLITTLE, W.F., and PAPKE, R.T., 2006. Phylogenetic analyses of cyanobacterial genomes: quantiﬁcation of horizontal gene transfer events. Genome Res. 16: 1099–1108. ZILLIG, W., KLENK, H.-P., PALM, P., LEFFERS, H., PUHLER, G., GROPP, F., and GARRETT, R.A., 1989. Did eucaryotes originate by a fusion event? Endocytobiosis Cell Res. 6: 1–25. ZIMMERMAN, S.B. and TRACH, S.O., 1991. Estimation of macromolecule concentrations and excluded volume effects for the cytoplasm of Escherichia coli. J. Mol. Biol. 222: 599–620.

Chapter

18

Evolution of Metabolic Networks Eivind Almaas 18.1

INTRODUCTION

18.2

METABOLIC NETWORK PROPERTIES

18.3

NETWORK MODELS FOR METABOLIC EVOLUTION

18.4

DYNAMIC MODELS OF GENOME-LEVEL METABOLIC FUNCTION

REFERENCES

18.1 INTRODUCTION The postgenomic era has made it a possibility to study organisms on the systems level: integrating knowledge generated from a multitude of experimental high-throughput approaches to construct detailed models that aim to describe cellular function on the genome level. The basis of most system-level models consists of one or several cellular networks, such as an organism’s protein interaction network, its metabolism, or large signal transduction pathways. Since evolutionary forces have shaped the complex and highly nonlinear interactions between genes, proteins, and metabolites, there exists considerable variation in the nature of the basic cellular building blocks, as well as their interactions. It is therefore important to develop modeling approaches and methodologies that are capable of taking into account their varied nature and functions. In this chapter, we will discuss the evolution of biological networks, with which we predominantly mean intercellular networks in contrast to interaction networks between cells or interactions between species (food webs). While we will mainly focus on cellular metabolism, it will be necessary to consider the effect of changes in gene regulation, signal transduction, and protein interaction pathways, as these networks both take part in and regulate a wellfunctioning cell’s metabolism. To effectively discuss the consequences of evolutionary processes on biological networks, it is important to clarify our terminology. In the community of network scientist, it is a frequent occurrence to describe network evolution in terms of the appearance and disappearance of nodes and links with no reference to principles of ﬁtness and selection (Albert and Barabasi, 2002; Newman, 2003). The meaning of “evolution” is simply the

Evolutionary Genomics and Systems Biology, edited by Gustavo Caetano-Anolles Copyright 2010 John Wiley & Sons, Inc.

397

398

Chapter 18

Evolution of Metabolic Networks

change in network topology over time and the dynamical rules that govern this change. From this perspective, network evolution can be studied with the same suite of tools as evolution of the World Wide Web, a friendship network, or a protein interaction network. In contrast, a biologist will typically think of evolution in terms of the genetic change occurring from one generation to another in a population of individuals through the action of natural selection (Mayr, 2001). This understanding of evolution is frequently also described as optimization on a ﬁtness landscape. There is not an inherent disagreement between these two understandings of the term evolution, just a difference in the emphasis on the origin and evaluation of variability. Indeed, a particular topology for a gene regulatory network that, for example, has high ﬁdelity in suppressing noise may be understood as a choice of survival strategy, and thus associated with a ﬁtness score in a given environmental setting (Weitz et al., 2007). Are biological networks optimal, and if so, which evolutionary processes have driven the adaptation? While this is a question that has to be settled separately for each network or subnetwork under consideration, it is reasonable to expect that natural selection has inﬂuenced properties of biological networks. It has been suggested that the action of evolution has more in common with tinkering than engineering (Jacob, 1977). The former approach would likely result in a variety of solutions to a given problem, while the latter would provide the same, optimal solution every time. Thus, is it possible to extract general principles of metabolic network organization from organisms that have evolved through tinkering? In fact, widely different biological networks have many large-scale topological properties in common, such as being small world (Watts and Strogatz, 1998; Albert and Barabasi, 2002), the shape of their connectivity distributions, and their modular and hierarchical organization (Barabasi and Oltvai, 2004). It has been argued that several similarities between biological networks originating from tinkering and systems designed according to “good engineering” practices exist: Both display high levels of modularity, they are often able to function in the presence of component failure, and they contain recurring circuit elements (Alon, 2003). The discovery of motifs in biological networks, frequently occurring subnetworks consisting of a few nodes (Shen-Orr et al., 2002; Milo et al., 2002), has been used as an argument for the selection of network topologies. For instance, motifs may display properties that are associated with high ﬁtness, such as high robustness and efﬁcient combination of components to achieve a given function (Alon, 2007). However, since random network models with no process of selection give rise to networks with an overabundance of motifs (Mazurie et al., 2005; Cordero and Hogeweg, 2006), this question is far from settled (Wagner, 2003a; Sole and Valverde, 2006; Lynch, 2007). We have organized the chapter as follows. The aim of Section 18.2 is to discuss general properties of the topology of cellular metabolism. We will pay attention to the role played by the so-called network hubs, network modularity, the small-world property of metabolic networks, and how the topology of metabolic networks may reveal patterns of transcriptional regulation. Section 18.3 is focused on discussing models for the evolution of metabolic network topology. We will discuss the limitations of such models in drawing conclusions for evolutionary selection of particular network properties. In Section 18.4, we will focus our discussion on models of genome-level metabolic function and insights they may give on principles of metabolic evolution.

18.2 METABOLIC NETWORK PROPERTIES Cells depend on their ability to import molecules from the environment and convert these to the needed metabolites. The conversions are carried out by proteins, the enzymes that

18.2 Metabolic Network Properties

399

catalyze speciﬁc conversions of starting molecules (substrates) into products. There may be several intermediary steps until the ultimate product is generated, each carried out by a different enzyme, and the set of all these component substrates, products, reactions, and enzymes forms a metabolic pathway. Metabolic pathways can be classiﬁed as either anabolic pathways that construct needed molecules or catabolic pathways that break down molecules to provide necessary reactants. Different reactions and catalyzing enzymes vary tremendously in their properties, and their activities may depend on the presence of cofactors. For instance, their rates of catalysis, or efﬁciency, may vary over several orders of magnitude. These variations will affect the overall rate of ﬂow (ﬂux) of metabolites in a particular pathway. For example, a pathway consisting of only two steps, and thus seemingly “faster” for the cell to implement, may in fact be very slow compared to a ﬁve-step pathway generating the same by-product, in which all the enzymes are highly efﬁcient. From the reactant perspective, a particular type of molecule may participate in only one reaction or be used in several different reactions. A reaction may require one or more reactants, and the ratios (the stoichiometry) of those reactants may vary. Finally, while for the most part metabolic pathways can be assumed to be one way, there are cases of reversible reactions in a cell and cyclic reaction pathways that take a reactant through a series of intermediates but end up regenerating the initial reactant. Since the cell’s metabolism is the sum of all the reactions it carries out, it is important to recognize that the set of active pathways at any instant critically depends on the cell’s nutrient environment. Additionally, cellular metabolism is not a stand-alone biological network, instead highly integrated with the cell’s gene regulation, signal transduction, and protein interaction networks. An example that clearly demonstrates this point as well as highlighting some of the consequences is the elaborate control network of the GAL pathway in the yeast Saccharomyces cerevisiae. The function of this subnetwork is to direct the uptake of the extracellular sugar molecule galactose and subsequent activation of the galactose degradation pathway converting galactose to glucose-1-phosphate that may enter the glycolysis pathway and central carbon metabolism (see Figure 18.1). Extracellular galactose is ﬁrst transported into the cytosol. The enzyme GAL10p will subsequently initiate the degradation of galactose. In addition to its metabolic role, cytosolic galactose can bind to the protein GAL3p to form the complex GAL3p . This activated form of GAL3p can now bind to the protein GAL80p, hindering it from inhibiting GAL4p. The protein GAL4p is now able to transcriptionally activate the genes GAL1, GAL2, GAL3, GAL7, GAL10, and GAL80. Consequently, the network in Figure 18.1 contains two positive feedback loops: An increase in the concentrations of GAL2p and GAL3p will lead to a further increase in their transcriptional activity. The opposite effect governs the behavior of GAL80p, where an increase in its concentration will suppress GAL4p, and thus acts as a negative feedback loop. Note that in system of multiple feedback loops, it is possible for the dynamics to display several stable states. The balance in strength between the positive and negative feedback loops will determine which of the steady states is eventually obtained. The example of the GAL network emphasizes the additional fact that not only is network structure important for functional repertoire, but the kinetic rates and afﬁnities will critically inﬂuence a circuit’s functional repertoire. Recent studies have demonstrated the effect of both network topology and interaction parameters on the properties of this circuit. For instance, by tuning the extracellular galactose concentration, one study found that the GAL network has the potential for reliably remembering previous exposure to galactose over hundreds of generations (Acar et al., 2005). This memory effect is chieﬂy caused by the GAL3p positive feedback loop, while the competing action of the negative feedback loop through GAL80p may destabilize memory. The amount of destabilization depends on its relative

400

Chapter 18

Evolution of Metabolic Networks

Figure 18.1

Galactose uptake and degradation pathway in the yeast S. cerevisiae. External galactose is transported into the cell by the protein GAL2p, where it binds to GAL3p, generating its activate form Gal3p . The products of genes GAL1, GAL7, and GAL10 degrade galactose to Glu1P in the cytosol, which then may enter the glycolysis pathway. Pointed and blunt arrows indicate activation and inhibition, respectively. Dotted lines indicate the direct connection between metabolic activity and gene product. The metabolite short name correspondences are bGal is b-D-galactose, aGal is a-D-galactose, aGal1P is a-D-galactose-1-phosphate, Glu1P is glucose-1-phosphate, UDP–Glu is UDP–D-glucose, and UDP–Gal is UDP–galactose. Gene names are all capital letters and their corresponding proteins have a last character “p.”

strength compared to the GAL3p positive feedback loop (Acar et al., 2005). A separate study of the GAL3p and the GAL80p feedback loops have demonstrated that they are signiﬁcant for ensuring a homogeneous GAL network gene expression response, as well as being necessary for its expedient activation in the presence of galactose (Ramsey et al., 2006). However, in order to generate detailed models coupling signaling pathways with metabolic activity, it is necessary with extensive studies of the individual network components. The ability to generate models of this complexity and detail richness with regard to binding strengths, enzyme afﬁnities, and topological connections on a genomelevel scale is, so far, outside the realm of experimental possibilities. It has recently become possible to generate genome-level representations of protein interaction networks and cellular metabolism, but at the cost of a higher granularity in representation. Before we can effectively discuss insights gained from a network representation, we must address how to represent chemical reactions as a collection of nodes and links. Figure 18.2b and c show two possible network representations of the reactions given in Figure 18.2a. In both cases, two metabolites (nodes) are connected by a link if they participate as a substrate/product pair in a reaction. By keeping the directionality of the links (Figure 18.2c) as directed by the reaction representation (Figure 18.2a), the connectivity structure of the network is different from that of the undirected version despite the seeming similarity of Figure 18.2b and c. This fact is clearly reﬂected in the list of possible paths connecting two nodes. Node B (Figure 18.2c) only has outgoing links, while node A has both incoming and outgoing links. While it is possible to ﬁnd a directed path from B to A (B to D to A), no directed path

18.2 Metabolic Network Properties

401

Figure 18.2 Network representation of cellular metabolism. (a) Example metabolic reaction set. (b) Generating network by introducing an undirected link between two metabolites (nodes) if they are a substrate and a product pair of a metabolic reaction. (c) Same as (b), while using a directed link.

exists from A to B. In general terms, the strongly connected component consists of node pairs that can be connected by paths in both directions; the out-component consists of nodes that can be reached from the strongly connected component (but not the other way), and the in-component consists of nodes that can reach the strong component. There is a clear functional distinction between metabolites classiﬁed in the different components. For instance, nutrient sources only have outgoing links and are part of a metabolic network’s incomponent. The representation of various biological systems as networks has revealed surprising similarities, many of which are intimately tied to power laws. The simplest network measure is the average number of nearest neighbors of a node, or the average connectivity hki. However, this is a rather crude property, and to gain further insight into the topological organization of real networks, we need to determine the variation in the nearest neighbors, given by the connectivity distributionPðkÞ. For a surprisingly large number of networks, this distribution is well characterized by the power law functional form (Barabasi and Albert, 1999). Not surprisingly, metabolic networks are also characterized by a heavytailed connectivity distribution (Jeong et al., 2000; Wagner and Fell, 2001). Figure 18.3 shows the connectivity distribution of Escherichia coli metabolism for two representation schemes of the chemical reaction network. If the connectivity distribution instead was single peaked (e.g., Poisson or Gaussian), the majority of the nodes would be well described by the average connectivity and we can with reason talk about a “typical” node of the network. This

Figure 18.3 Connectivity distribution of E. coli metabolic network. (a) Network resulting from algorithm in Fig. 18.2b, and (b) network resulting from algorithm in Fig. 18.2c.

402

Chapter 18

Evolution of Metabolic Networks

is very different for networks with a power law connectivity distribution; the majority of the nodes only have a few neighbors, while many nodes have hundreds and some even thousands of neighbors. Although average node connectivity values can be calculated for these networks since their size is ﬁnite, these values are not representative of a typical node. For this reason, these networks are often referred to as “scale-free”. The ‘small-world’ property of complex networks has received a signiﬁcant amount of attention (Watts and Strogatz, 1998). Simply stated, the distance between two nodes can be measured as the smallest number of links that must be traversed in order to go from one to the other. In small-world networks, which now is recognized as a typical characteristic of most complex networks (Albert and Barabasi, 2002; Newman, 2003), the average distance ‘ between the nodes is ‘ logðNÞ or smaller. This is also the case for metabolic networks, possibly to allow the metabolic networks to rapidly sense and respond to perturbations (Wagner and Fell, 2001). A measure that gives insight into the local structure of a network is the so-called clustering of a node C: the degree to which the neighborhood of a node resembles a complete subgraph (Watts and Strogatz, 1998) and thus measures the degree to which a node’s neighbors are connected between themselves. For a node that is part of a fully interlinked cluster C ¼ 1, while C ¼ 0 for a node where none of its neighbors are interconnected. Accordingly, the overall clustering coefﬁcient of a network quantiﬁes its potential modularity. Biological networks are expected to be fundamentally modular, meaning that the network can be seamlessly partitioned into a collection of modules where each module performs an identiﬁable task, separable from the function(s) of other modules (Hartwell et al., 1999). By studying the average clustering of nodes with a given connectivity k, information about the actual modular organization of a metabolic network can be extracted (Ravasz et al., 2002): For all metabolic networks available, the average clustering follows a power law form asCðkÞ k a , suggesting the existence of a hierarchy of nodes with different degrees of modularity (as measured by the clustering coefﬁcient) overlapping in an iterative manner (Ravasz et al., 2002). Thus, a module consists of smaller modules that are more tightly knit together (higher C). This picture agrees very well with the assertion that functional modules in biological networks by no means are rigid; a particular module component may take active part of another module (Hartwell et al., 1999), temporal separation of module activity ensuring separation of function. Expanding the discussion of metabolic network properties to include information from temporal gene expression patterns, as well as changes in gene expressions induced by different stressors, it is possible to identify parts of the metabolic network in which transcriptional responses are highly correlated (Patil and Nielsen, 2005). This is achieved by adding up the changes in transcriptional activity among all enzymes that can modify a given metabolite. Metabolites for which this collective activity is statistically signiﬁcant are termed “reporter metabolites,” thus representing sites in the metabolic network with high regulation activity (Patil and Nielsen, 2005). Analyzing the metabolic network of the yeast S. cerevisiae in concert with mRNA expression data that capture metabolic response to gene deletions and growth on different carbon sources, several metabolites and enzyme networks were uncovered as working under tight transcriptional control. In particular, when analyzing the transcriptional response to the deletion of GDH1, which encodes for glutamate dehydrogenase (NADPH dependent), the algorithm for uncovering reporter metabolites identiﬁed glucose-6-phosphate, fructose-6-phosphate, and sedoheptulose7-phosphate to be among them. When studying a map of the metabolic capabilities of S. cerevisiae, it becomes apparent that the identiﬁed sugars are located at branching points between the Embden–Meyerhof–Parnas and the pentose phosphate pathways

18.3 Network Models For Metabolic Evolution

403

Figure 18.4 Distribution of metabolic reaction ﬂuxes in the yeast S. cerevisiae under (a) aerobic, glucoselimited conditions and (b) aerobic, acetate-limited conditions using ﬂux balance analysis simulations.

(Patil and Nielsen, 2005). This is understandable from the following observations. Subsequent to the deletion of GDH1, there is a signiﬁcant reduction of NADPH consumption, and thus, a reduced need for using the pentose phosphate pathway. This promising method highlights a relatively new direction in the study of metabolic network structure and its properties that likely will be highly successful in uncovering global principles of the dynamic organization and control of metabolic networks. To use the words of Patil and Nielsen: “Although the regulatory network structure deﬁnes the details of how the transcriptional regulatory program is executed, the metabolic network itself seems to guide this machinery, which we see as the consequence of the fact that metabolic regulation has been designed and evolved for and around the metabolites” (Patil and Nielsen, 2005).

18.3 NETWORK MODELS FOR METABOLIC EVOLUTION A surprisingly simple and general network growth model by Barabasi and Albert (BA) is able to describe the appearance of power law connectivity distributions by appealing to two key mechanisms (Barabasi and Albert, 1999). First, it assumes that networks grow through the addition of new nodes linking to nodes already present in the system. Second, there is a higher probability for a new node to link to an already existing node that has a large number of connections, a property called preferential attachment. These two principles are implemented as follows: starting from a small core graph consisting of m0 nodes, a new node with m links is added at each time step and connected to the already existing set of nodes. Each of the m new links P is then preferentially attached to a node i (with ki neighbors) with probabilityPi ¼ ki = j kj, which is linear in the target node connectivity. The simultaneous combination of these two network growth rules gives rise to power law connectivity distributions (Barabasi and Albert, 1999). Compared to traditional random networks, the probability that a node is highly connected is statistically signiﬁcant in scalefree networks. Consequently, many network properties are determined by a relatively small number of highly connected nodes, often called “hubs”.

404

Chapter 18

Evolution of Metabolic Networks

By simple modiﬁcations to this model, it is possible to develop growing network schemes that generate networks for which the power law connectivity distribution has tunable slope and exponential cutoff (Albert and Barabasi, 2002; Newman, 2003). However, the question of interest for biological networks is what local growth mechanisms may give rise to the observed network topologies, as the preferential attachment rule can be considered a global constraint. Multiple alternative processes exist that give rise to power law connectivity distributions (Newman, 2005). These local growth rules are typically based on gene duplication (addition of nodes) and gene diversiﬁcation (loss and/or addition/ rewiring of links), all giving rise to scale-free connectivity distributions (Albert and Barabasi, 2002; Newman, 2003). In particular, among the simplest models is one that includes the following three steps (Pastor-Satorras et al., 2003): (i) random selection of gene for duplication (together with existing links); (ii) a fraction d of the duplicate links are randomly removed; and (iii) new links are introduced to existing nodes with probability a. Note that the duplication of multiple genes (or even whole genome) is not included in its original formulation. It is possible to directly estimate the evolutionary rates of link addition and removal, as well as those of node duplication from empirical data (Wagner, 2001). Focusing on the yeast protein interaction network, two empirical studies (Eisenberg and Levanon, 2003; Wagner, 2003b) clearly support the hypothesis that local network growth rules give rise to linear preferential attachment, where highly connected proteins display an elevated rate of interaction turnover. For appropriate choices of the parameters a and d, this model produces connectivity distribution, average clustering coefﬁcient, and average path lengths that are strikingly similar to actual network measurements (Pastor-Satorras et al., 2003). At this point, it is important to emphasize that metabolic networks are different from protein interaction networks, of which metabolic enzymes are only a part, in several respects. Perhaps most importantly, metabolism is a mass transfer network where each step in a pathway modiﬁes the metabolites to some degree. Thus, acceptable evolutionary dynamics for the enzyme subset of the protein interaction network are constrained by the need for maintaining the function of chemical reaction to avoid loss of necessary metabolic functions. Separately investigating the properties of the enzyme subset of the protein interaction networks of E. coli and S. cerevisiae (Huthmacher et al., 2007), it was discovered that adjacent metabolic enzymes (i.e., enzymes that catalyze successive chemical reactions) demonstrate a signiﬁcantly elevated probability of interaction. The authors hypothesize that metabolic channeling, the intercomplex transfer of reaction intermediates to increase chemical reaction efﬁciency, is an important functional constraint on the evolution of metabolic enzymes (Huthmacher et al., 2007). The observation that enzymes sharing metabolites are signiﬁcantly overrepresented among homologous protein pairs (Alves et al., 2002; Light and Kraulis, 2004) further emphasizes the importance of explicitly accounting for metabolic constraints on protein evolution. Thus, the challenge is to develop evolutionary models of metabolism that not only generate a protein (enzyme) interaction network with the above noted large-scale properties, but also satisfy the metabolism-speciﬁc constraints. In the following, we will highlight a recent model for the large-scale evolution of metabolic networks that shows great promise. Based on a scenario of metabolic evolution (Kacser and Beeby, 1984) that includes the preexisting ingredients of cellular membrane, extraction of metabolites from the environment, and a set of genes coding for an initial metabolism, Preiffer et al. (2005) developed a model for the evolution of group transfer metabolic networks. A schematic of the model dynamics is given in Figure 18.5. The metabolic dynamics are assumed to be well described by linear kinetics for the group transfer to an enzyme, thus giving rise to

18.3 Network Models For Metabolic Evolution

405

Michaelis–Menten-like kinetics for the transfer from donor to receptor molecule. A fraction of the metabolites are randomly selected to contribute to the formation of biomass. Evolutionary dynamics are implemented by changing the kinetic rate of an enzyme’s group transfer capability, either by increasing one transfer rate while decreasing the others, or by decreasing one while increasing the others (Figure 18.5c). Also, the model allows for gene duplications and removals that change the enzyme dosage (Figure 18.5d). Evolutionary selection is modeled as a steepest descent process (Preiffer et al., 2005): For each evolutionary step, all possible mutations are evaluated and only the one that maximizes the steady-state metabolic growth rate of the network is selected. Networks emerging from the simulations are relatively small in size, making a direct comparison with known microbial metabolic networks difﬁcult. However, comparing the emerging metabolic networks with subnetworks of similar size sampled from actual metabolic networks (Preiffer et al., 2005), the properties of the model-generated topologies are found to be consistent. Averaging the resulting connectivity distributions over multiple simulations, the emerging connectivity distribution is consistent with a power law with slope 2.6 (Preiffer et al., 2005). The authors also note that while the initial set of enzymes all are generalists, that is, capable of facilitating all transfer reactions, the evolutionary dynamics select for increasing enzyme speciﬁcity, and the majority of the enzymes are fully specialized to the transfer of a single biochemical group (Preiffer et al., 2005). However, a small fraction of the enzymes have not fully specialized and are able to catalyze the transfer of more than one biochemical group. These reactions are characterized by both low gene dosage and low impact on biomass formation (Preiffer et al., 2005). A typical trait for the evolved metabolic networks is that some of the metabolites have come to dominate the transfer of particular biochemical groups, with especially high group transfer potentials. This suggests that there is an implicit “rich-gets-richer” principle in the model dynamics allowing for the emergence of specialized molecules that participate in almost every reaction in which a particular group is transferred (Preiffer et al., 2005). It is

Figure 18.5 Schematic of the growing metabolic network model of Preiffer et al. (2005). (a) Binary representation of a molecule in terms of seven different biochemical groups. Value of 1 indicates presence of group. (b) Transfer of group 4 from donor to receptor molecule in two steps by enzyme E4. First to the enzyme E*4 , and then from the enzyme to the receptor. Model evolutionary dynamics (c) depicts two possible mutational dynamics changing the rate of transfer reactions (indicated by link thickness) of an enzyme E (node) to generate enzyme E0 that participates in the transfer of four groups. (d) Depicts gene (node) duplication or loss on changing the enzyme (node) dosage. Two nodes are connected if they can transfer the same biochemical group. Evolutionary selection is implemented by only choosing mutations that facilitate maximal possible biomass production.

406

Chapter 18

Evolution of Metabolic Networks

interesting to speculate if the resulting metabolic network topologies would bear similar characteristics if the evolutionary selection principle was different from maximization of growth. As will be discussed in the next section, there is now much evidence that suggests other cellular metabolic optimization principles may also be active, depending upon the extracellular environment. Finally, we suggest that a combination of this modeling approach with that of genetic algorithms (see, for example, Mitchell, 1998) would possibly allow for the simulation of much larger metabolic networks. In conjunction with this model, it is interesting to discuss an analysis of how microbial ability to use oxygen as a high-potential redox couple in response to intense selection pressure in the ancient biosphere has affected the topology of metabolic networks (Raymond and Segre, 2006). Since the focus of this study is not on the metabolic network of individual microbes, rather the collection of all known pathways in the KEGG (Kyoto Encyclopedia of Genes and Genomes) database (Kanehisa and Goto, 2000), it is possible to draw general conclusions for the impact on enzyme evolution and cellular metabolism. Simply stated, the authors compare topological properties, such as network size and connectivity, between networks attainable with and without oxygen. This is made possible by using a heuristic network expansion method (Ebenhoh et al., 2004): Starting from a predetermined set of metabolites M0, all reactions, for which every substrate is available, are executed and the resulting metabolites are added to M0 to generate M1. This procedure is iterated until no new metabolites are produced. The resulting network of Mn metabolites contains all attainable pathways from the seed set M0 and is a reﬂection of available metabolic capabilities in KEGG. When comparing the expanded metabolic networks resulting from seed sets with and without oxygen, the former are clearly distinct in their properties from the latter (Raymond and Segre, 2006). In particular, the presence of oxygen is necessary to generate the largest and most complex metabolic networks. Furthermore, reactions that are included in an expanded network contingent on the presence of oxygen tend to be located at the periphery of the networks (Raymond and Segre, 2006). From the network expansion analysis, a plausible scenario emerges. While some of the metabolic pathways were retooled to accommodate the availability of oxygen, the majority of network change was caused by innovation of new metabolic pathways and enzymatic functions, and approximately 52% of these new reactions do not directly use oxygen (Raymond and Segre, 2006). The network expansion approach as applied to the KEGG database uncovers a hierarchical organization of metabolic network complexity related to the presence of biomolecules, such as NAD, coenzyme A, and ATP, in the metabolic seed set (Raymond and Segre, 2006), in agreement with a topological analysis (Ravasz et al., 2002). This should be no surprise, as these molecules typically are hubs in microbial metabolic networks around which hierarchical modularity is organized (Ravasz et al., 2002). This observation further agrees with the above discussed results from metabolic growth modeling (Preiffer et al., 2005), where it was observed that the transfer of biochemical groups is very important for the emergence of network hubs. Finally, the analysis of Raymond and Segre (2006) strongly suggests that adaptation to molecular oxygen occurred independently after the three domains of life evolved from the last common ancestor organism. A multidimensional analysis of the oxic and the anoxic networks from 44 different genomes shows that the anoxic networks clearly separate into the categories of archaea, bacteria, and eukaryotes, in agreement with genome-based phylogenetic studies (Raymond and Segre, 2006). However, the oxic networks are mostly inconsistent with traditional phylogenic categorization.

18.4 Dynamic Models Of Genome-Level Metabolic Function

407

18.4 DYNAMIC MODELS OF GENOME-LEVEL METABOLIC FUNCTION Up to this point, we have mostly considered cellular metabolism as a static entity for which functional phenotypes were determined by the network topology. In the introduction, we alluded to the fact that although a metabolic capability is encoded in a genome, it may not be activated until the appropriate environmental cues take place. Thus, a purely topological investigation of metabolism will tend to overestimate a cell’s metabolic ﬂexibility by assuming that all pathways are always available. Flux balance analysis (FBA) is a computational method based on linear programming that takes as input a metabolic network in the form of a list of chemical reactions, a set of constraints on these reactions, for example, maximal capacity, and an objective function that the cell is assumed to either maximize or minimize. If the linear problem is feasible, FBA will return predicted metabolic reaction turnover rates, or ﬂuxes, corresponding to an optimal utilization of the metabolic network with respect to the chosen goal function. FBA thus provides a tool to extend complex network analysis of metabolic networks from a purely topological approach to also consider dynamic effects. The ﬂux balance method is based on three basic assumptions that can brieﬂy be stated as follows: (i) cellular metabolism is in a steady state, (ii) mass is conserved, and (iii) metabolic reaction ﬂuxes optimize a goal function on the network (Kauffman et al., 2003). The ﬁrst hypothesis is motivated by the dire lack of information on kinetic parameters inside a cell, and that metabolic dynamics tend to equilibrate on a timescale of seconds to minutes. By assuming a steady state, it is possible to formulate the problem without appealing to time derivatives, consequently avoiding kinetic parameters altogether. The trade-off is that the problem has now become underdetermined, as there are typically many more ﬂuxes (variables) than constraints. The third hypothesis breaks the stalemate by allowing the question to be formulated as an optimization problem, allowing us to investigate possible evolutionary principles that have shaped the network structure. Applying principles of optimization to evaluate and understand metabolic network performance provides a connection with Darwin’s theory of natural selection, since the more efﬁciently a metabolic network is able to utilize nutrients, the faster the cell may duplicate. However, we do not invoke optimization to test whether an organism is optimal. Rather, “it is the assumptions of optimality that are tested. The failure to ﬁnd support for a prediction can be used to determine whether an assumption is wrong” (Sutherland, 2005). Thus, approaches such as FBA allow for testing assumptions of optimal network dynamics to achieve a particular cellular objective, subject to a given metabolic network topology. FBA can also indicate the potential impact of additions or deletions on metabolic capability, that is, network topology. In traditional implementations of FBA, the chosen optimality principle is cellular growth rate as evolutionary pressures through the competition for resources select for fast growing organisms in many natural and laboratory situations. However, alternative objective functions, such as maximal ATP production or minimization of ﬂux through a given pathway, have also been investigated (Kauffman et al., 2003; Price et al., 2004; Schuetz et al., 2007; Schuster et al., 2007). When cells are growing under conditions with strong nutrient limitations, the optimal growth assumption agrees well with experimental results, while in situations of virtually no nutrient limitations, cellular metabolism appears to maximize ATP production per ﬂux unit (Schuetz et al., 2007). Thus, a repertoire of cellular metabolic strategies as function of nutrient availability seems to have evolved (Nielsen, 2007).

408

Chapter 18

Evolution of Metabolic Networks

To formulate the ﬂux calculation problem more succinctly, the mass conservation constraint is expressed as dXi X ¼ S V; j ij j dt

ð18:1Þ

where Xi is the concentration of metabolite i, nj is the unknown ﬂux rate of reaction j, and Sij is the stoichiometric coefﬁcient of metabolite i in reaction j. In general, the stoichiometric matrix is S (m n), where m is the number of metabolites and n the number of different reactions. For example, the stoichiometric coefﬁcients associated with reaction R1 in Figure 18.2a are SA;R1 ¼ SB;R1 ¼ 1 and SC;R1 ¼ SC;R1 ¼ 1. Applying the steady-state approximation to Equation 18.1 by setting all time derivatives to zero, the linear problem is typically underdetermined as there are usually more metabolic reactions than metabolites, m > n. The FBA approach can be formulated as a standard form linear program: min cT n : Sn ¼ b; n 0 ;

ð18:2Þ

wherec; n 2 Rn ; b 2 Rm . The vector c corresponds to the goal function, and the vector b represents the ﬂux constraints, including the environmental availability of chemicals and nutrients (Kauffman et al., 2003; Price et al., 2004). There is ample experimental evidence that the FBA methodology is biologically relevant. For instance, it is possible to simulate gene knockouts by computationally removing genes with products (enzymes) that participate in metabolic processes. In the case of the genome-level model, for example E. coli metabolism, the predicted growth phenotypes (being characterized as either “viable” or “lethal”) agree with experimental results in 86% of the tested cases (Edwards and Palsson, 2000). It is also possible to experimentally determine transport ﬂuxes (uptake or production of metabolites and chemicals from the environment) using batch cultivation of the organism (Edwards et al., 2001) or a chemostat. For batch growth of E. coli, a direct comparison of predicted and measured uptake ﬂux values of acetate and oxygen, as well as cellular growth rates, had an average error of only 5.8%, while using succinate as nutrient source instead of acetate resulted in an average error of 10.7% (Edwards et al., 2001). Additionally, the network-wide distribution of metabolic ﬂuxes is found to have a heavy-tailed behavior, consistent with experimentally measured ﬂuxes in central carbon metabolism for E. coli (Almaas et al., 2004). Figure 18.4 shows the ﬂux distribution resulting from FBA simulations of the yeast S. cerevisiae under aerobic glucose-limited (Figure 18.4a) and acetate-limited (Figure 18.4b) conditions. It is interesting to note that despite the different carbon sources, and thus signiﬁcant difference in metabolic pathway utilization, the shape of the resulting ﬂux distributions is nearly unchanged. This is possibly related to the robust function of cellular metabolism (Wagner, 2005). Since the goal function of choice for the FBA method is maximal cellular growth rate, it is important to be aware of underlying assumptions in making this choice. Arguably, the three most important assumption are that (i) the organism is functioning optimally, that (ii) it is in the exponential growth phase (the cellular population doubles at regular intervals), and that (iii) the environment is not changing. Carefully conducted experiments on wild-type and mutant E. coli cells kept in the exponential growth phase with a stable nutrient source have demonstrated that both wild-type and mutant cellular populations were initially operating in a suboptimal metabolic state relative to the given nutrient conditions (Ibarra et al., 2002; Fong and Palsson, 2004). However, after undergoing adaptive evolution

18.4 Dynamic Models Of Genome-Level Metabolic Function

409

over a few hundred generations while keeping the nutrient environment stable, the end point cellular populations were well characterized by FBA using maximal growth as the goal function (Ibarra et al., 2002; Fong and Palsson, 2004). Since no dramatic shifts were observed, and the ability of the FBA approach to predict the outcome increased with the cellular population’s competitive adaptation to the stable nutrient environment, it is reasonable to assume that the evolutionary optimization of cellular performance did not include the invention of novel metabolic capabilities or pathways. Instead, it consisted of a ﬁne-tuning of the gene regulation pathways to channel the metabolic activity more efﬁciently along the pathways that gave rise to higher growth rates. Thus, natural selection was able to achieve optimal dynamic network utilization for this environment after a relatively short time. Consequently, the wild-type E. coli metabolism was not performing optimally under the conditions with ﬁxed nutrient availability. The observation of nonoptimality was experimentally tested for Bacillus subtilis, leading to the suggestions that a gene regulatory apparatus has evolved that allows for rapid metabolic response to changing nutrient environments, at the cost of optimal growth (Fischer and Sauer, 2005). The idea of metabolic adaptation to optimize response to variations in the extracellular nutrient environment has found support in a recent experimental study on the GAL network (Figure 18.1) in yeast (Bennett et al., 2008). Here, the response of the GAL network was probed by varying extracellular glucose concentration in a constant galactose background. The cytosolic presence of glucose will inhibit the activity of the galactose transporter GAL2p, as well as the transcription of genes GAL1, GAL3, and GAL4 (not included in Figure 18.1). Thus, variations in glucose concentration will regulate the galactose uptake and degradation activity. By probing the system using low- and high-frequency oscillations in the concentration of available glucose, the authors showed that the system acts as a low-pass ﬁlter (Bennett et al., 2008). In other words, high-frequency variations in glucose concentration were effectively ignored, while low-frequency variations would signiﬁcantly affect the galactose uptake and degradation activities, suggesting that cellular metabolic response is optimized to respond robustly to slowly varying nutrient conditions (Bennett et al., 2008). Using the FBA approach to investigate genome-level steady-state metabolic response in H. pylori, E. coli, and S. cerevisiae to a large number of different environment conditions, it was discovered that the robust and optimal metabolic response necessitated the consistent activity of a core set of metabolic reactions (Almaas et al., 2005). Furthermore, this core set of reactions is signiﬁcantly enriched by phenotypically essential and evolutionary conserved reactions, as well as being encoded by genes whose mRNA expression proﬁles were signiﬁcantly correlated (Almaas et al., 2005). The metabolic plasticity to environmental changes is likely also related to the observation that approximately 80% of the genes in yeast appear irrelevant for viability when tested under a small set of nutrient-rich conditions. A systematic investigation of gene knockout phenotypes in a wide range of nutrient environments using FBA identiﬁed that up to 68% of the apparently dispensable genes were essential in some environments (Papp et al., 2004). In a similar study (Harrison et al., 2007), FBA was used to investigate the nutrient condition dependence of synthetically lethal gene knockouts in S. cerevisiae, ﬁnding that the majority of synthetically lethal gene interactions were limited to a narrow range of nutrient conditions, in agreement with random expectation. Thus, the functional redundancy and robustness of metabolic networks as quantiﬁed through epistatic compensation is possibly a direct by-product of natural selection for survival in environments where nutrient availability may vary widely (Harrison et al., 2007). Since the extracellular nutrient conditions so clearly inﬂuence metabolic performance and evolved (possibly optimal) microbial strategies, the use of FBA and related approaches

410

Chapter 18

Evolution of Metabolic Networks

to infer possible paths of metabolic evolution is enticing. A promising approach for investigating reductive evolution was recently introduced by Pal et al. (2006). Buchnera aphidicola and Wigglesworthia glossinidia are both endosymbiotic organisms that are close evolutionary relatives of E. coli. During their evolutionary divergence from E.coli, B. aphidicola and W. glossinidia have shed genes and metabolic pathways that were nonessential in the context of the nutrient environment provided by their hosts. The method for simulating reductive gene loss is intriguingly simple (Pal et al., 2006): (1) Begin from a genome-level metabolic model of choice and implement environmental and nutrient constraints that match the host. (2) Randomly select a gene and test ﬁtness of knockout mutant. (3) If the value of the objective function is nearly unchanged, the gene is removed from the model. (4) Iterate steps (2) and (3) until no further gene reductions available. Note that, due to the signiﬁcant level of metabolic network redundancy, the repeated application of steps (1)-(4) will likely give rise to minimal metabolic models that differ in gene content. Applying this framework while using E. coli as the starting point, the accuracy of the determined minimal genomes for B. aphidicola was 80% and 76% for W. glossinidia, as compared to the random expectation of 50% (Pal et al., 2006). Furthermore, all the simulated minimal genomes for B. aphidicola (W. glossinidia) had 88% (82%) of their reactions in common, suggesting that the observed variation in gene content from one strain to another is caused by differences in selection pressures as well as random gene losses (Pal et al., 2006).

REFERENCES ACAR, M., BECSKEI, A., and VAN OUDENAARDEN, A., 2005. Enhancement of cellular memory by reducing stochastic transitions. Nature 435: 228–232. ALBERT, R. and BARABASI, A.-L., 2002. Statistical mechanics of complex networks. Rev. Mod. Phys. 74: 47–97. ALMAAS, E., KOVACS, B., VICSEK, T., OLTVAI, Z.N., and BARABASI, A.-L., 2004. Global organization of metabolic ﬂuxes in the bacterium Escherichia coli. Nature 427: 839–843. ALMAAS, E., OLTVAI, Z.N., and BARABASI, A.-L., 2005. The activity reaction core and plasticity of metabolic networks. PLoS Comput. Biol. 1 (7): e68. ALON, U., 2003. Biological networks: the tinkerer as an engineer. Science 310: 1866–1867. ALON, U., 2007. Network motifs: theory and experimental approaches. Nat. Rev. Genet. 8: 450–461. ALVES, R., CHALEIL, R.A., and STERNBERG, M.J., 2002. Evolution of enzymes in metabolism: a network perspective. J. Mol. Biol. 320: 751–770. BARABASI, A.-L. and ALBERT, R., 1999. Emergence of scaling in random networks. Science 286: 509–512. BARABASI, A.-L. and OLTVAI, Z.N., 2004. Network biology: understanding the cells’ functional organization. Nat. Rev. Genet. 5: 101–113. BENNETT, M.R., PANG, W.L., OSTROFF, N.A., BAUMGARTNER, B.L., NAYAK, S., TSIMRING, L.S., and HASTY, J., 2008. Metabolic gene regulation in a dynamically changing environment. Nature 454: 1119–1122. CORDERO, O.X. and HOGEWEG, P., 2006. Feed-forward loop circuits as a side effect of genome evolution. Mol. Biol. Evol. 23: 1931–1936.

EBENHOH, O., HENDORF, T., and HEINRICH, R., 2004. Structural analysis of expanding metabolic networks. Genome Inform. 15: 35–45. EDWARDS, J.S. and PALSSON, B.O., 2000. The Escherichia coli MG1655 in silico metabolic genotype: its deﬁnition, characteristics, and capabilities. Proc. Natl. Acad. Sci. USA 97: 5528–5533. EDWARDS, J.S., IBARRA, R.U., and PALSSON, B.O., 2001. In silico predictions of Escherichia coli metabolic capabilities are consistent with experimental data. Nat. Biotechnol. 19: 125–130. EISENBERG, E. and LEVANON, E.Y., 2003. Preferential attachment in the protein network evolution. Phys. Rev. Lett. 91: 138701. FISCHER, E. and SAUER, U., 2005. Large-scale in vivo ﬂux analysis shows rigidity and suboptimal performance of Bacillus subtilis metabolism. Nat. Genet. 37: 636–640. FONG, S.S. and PALSSON, B.O., 2004. Metabolic gene-deletion strains of Escherichia coli evolve to computationally predicted growth phenotypes. Nat. Genet. 36: 10568. HARRISON, R., PAPP, B., PAL, C., OLIVER, S.G., and DELNERI, d., 2007. Plasticity of genetic interactions in metabolic networks of yeast. Proc. Natl. Acad. Sci. USA 104: 2307–2312. HARTWELL, L.H., HOPFIELD, J.J., LEIBLER, S., and MURRAY, A. W., 1999. From molecular to modular cell biology. Nature 402: C47. HUTHMACHER, C., GILLE, C., and HOLZHUTTER, H.-G., 2007. A computational analysis of protein interaction in metabolic networks reveals novel enzyme pairs potentially

References involved in metabolic channeling. J. Theor. Biol. 252: 456–464. IBARRA, R.U., EDWARDS, J.S., and PALSSON, B.O., 2002. Escherichia coli K-12 undergoes adaptive evolution to achieve in silico predicted optimal growth. Nature 420: 186–189. JACOB, F., 1977. Evolution and tinkering. Science 196: 1161–1166. JEONG, H., TOMBOR, B., ALBERT, R., OLTVAI, Z.N., and BARABASI, A.-L., 2000 The large-scale organization of metabolic networks. Nature 407: 651–654. KACSER, H. and BEEBY, R., 1984. Evolution of catalytic proteins or on the origin of enzyme species by means of natural selection. J. Mol. Evol. 20: 38–51. KANEHISA, M. and GOTO, S., 2000. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28: 27–30. KAUFFMAN, K.J., PRAKASH, P., and EDWARDS, J.S., 2003. Advances in ﬂux balance analysis. Curr. Opin. Biotechnol. 14: 491–496. LIGHT, S. and KRAULIS, P., 2004. Network analysis of metabolic enzyme evolution in Escherichia coli. BMC Bioinf. 5: 15. LYNCH, M., 2007. The evolution of genetic network by nonadaptive processes. Nat. Rev. Genet. 8: 803–813. MAZURIE, A., BOTTANI, S., and VERGASSOLA, M., 2005. An evolutionary and functional assessment of regulatory network motifs. Genome Biol. 6: R35. MAYR, E., 2001. What Evolution Is. Basic Books, New York. MILO, R., SHEN-ORR, S.S., ITZKOVITZ, S., KASHTAN, N., CHOKLOVSKII, D., and ALON, U., 2002. Network motifs: simple building blocks of complex networks. Science 298: 824–827. MITCHELL, M., 1998. An Introduction to Genetic Algorithms. MIT press. NEWMAN, M.E.J., 2003. The structure and function of complex networks. SIAM Rev. 45: 167–256. NEWMAN, M.E.J., 2005. Power laws, Pareto distributions and Zipf’s law. Contemp. Phys. 46: 323–351. NIELSEN, J., 2007. Principles of optimal metabolic network operation. Mol. Syst. Biol. 3: 126. PAL, C., PAPP, B., LERCHER, M.J., CSERMELY, P., OLIVER, S.G., and HURST, L.D., 2006. Chance and necessity in the evolution of minimal metabolic networks. Nature 440: 667–670. PAPP, B., PAL, C., and HURST, L.D., 2004. Metabolic network analysis of the causes and evolution of enzyme dispensability in yeast. Nature 429: 651–654. PASTOR-SATORRAS, R., SMITH, E., and SOLE, R.V., 2003. Evolving protein interaction networks through gene duplication. J. Theor. Biol. 222: 199–210.

411

PATIL, K.R. and NIELSEN, J., 2005. Uncovering transcriptional regulation of metabolism by using metabolic network topology. Proc. Natl. Acad. Sci. USA 102: 2685–2689. PREIFFER, T., SOYER, O.S., and BONHOEFFER, S., 2005. The evolution of connectivity in metabolic networks. PLoS Biol. 3 (7): e228. PRICE, N.D., REED, J.L., and PALSSON, B.O., 2004. Genomescale models of microbial cells: evaluating the consequences of constraints. Nat. Rev. Microbiol. 2: 886–897. RAMSEY, S.A., SMITH, J.J., ORRELL, D., MARELLI, M., PETERSON, T.W., de ATAURI, P., BOLOURI, H., and AITHISON, J.D., 2006. Dual feedback loops in the GAL regulon suppress cellular heterogeneity in yeast. Nat. Genet. 38: 1082–1087. RAVASZ E. SOMERA, A.-L., MONGRU, D.A., OLTVAI, Z.N., and BARABA´SI, A.-L., 2002. Hierarchical organization of modularity in metabolic networks. Science 297: 1551–1555. RAYMOND, J. and SEGRE, D., 2006. The effect of oxygen on biochemical networks and the evolution of complex life. Science 311: 1764–1767. SCHUETZ, R., KUEPFER, L., and SAUER, U., 2007. Systematic evaluation of objective functions for predicting intracellular ﬂuxes in Escherichia coli. Mol. Syst. Biol. 3: 119. SCHUSTER, S., PFEIFFER, T., and FELL, D.A., 2007. Is maximization of molar yield in metabolic networks favoured by evolution? J. Theor. Biol. 252: 497–504. SHEN-ORR, S.S., MILO, R., MANGAN, S., and ALON, U., 2002. Network motifs in the transcriptional regulation network of Escherichia coli. Nat. Genet. 31: 64–68. SOLE, R.V. and VALVERDE, S., 2006. Are network motifs the spandrels of cellular complexity? Trends Ecol. Evol. 21: 419–422. SUTHERLAND, W.J., 2005. The best solution. Nature 435: 569. WAGNER, A., 2001. The yeast protein interaction network evolves rapidly and contains few redundant duplicate genes. Mol. Biol. Evol. 18: 1283–1292. WAGNER, A., 2003a. Does selection mold molecular networks? Sci. STKE, pe41. WAGNER, A., 2003b. How the global structure of protein interaction networks evolves. Proc. R. Soc. Lond. B 270: 457–466. WAGNER, A., 2005. Robustness and evolvability in living systems. Princeton university press, Princeton, NJ. WAGNER, A. and FELL, D.A., 2001. The small world inside large metabolic networks. Proc. R. Soc. Lond. B 268: 1803–1810. WATTS, D. and STROGATZ, S.H., 1998. Collective dynamics of ‘small-world’ networks. Nature 393: 440–442. WEITZ, J.S., BENFEY, P.N., and WINGREEN, N.S., 2007. Evolution, interactions, and biological networks. PLoS Biol. 5 (1): e11.

Chapter

19

Single-Gene and Whole-Genome Duplications and the Evolution of Protein–Protein Interaction Networks Grigoris Amoutzias and Yves Van de Peer 19.1

INTRODUCTION

19.2

EVOLUTION OF PINS

19.3

SINGLE-GENE DUPLICATIONS

19.4

WHOLE-GENOME DUPLICATIONS

19.5

DIPLOIDIZATION PHASE

19.6

DOSAGE BALANCE HYPOTHESIS

19.7

TYPES OF INTERACTIONS

19.8

WGDS, TRANSIENT INTERACTIONS, AND ORGANISMAL COMPLEXITY

19.9

STUDIES ON PPIS OF OHNOLOGUES

19.10

CONCERNS ABOUT THE METHODS OF ANALYSIS AND THE QUALITY OF THE DATA

19.11

THE IMPORTANCE OF MEDIUM-SCALE STUDIES: THE CASE OF DIMERIZATION

19.12

EVOLUTION OF DIMERIZATION NETWORKS

19.13

CONCLUSIONS

REFERENCES

19.1 INTRODUCTION Proteins within a cell do not function in isolation, but instead physically interact with their molecular environment, either to transduce information from the external environment to the nucleus or to form multisubunit protein complexes that act as sophisticated molecular Evolutionary Genomics and Systems Biology, edited by Gustavo Caetano-Anolles Copyright 2010 John Wiley & Sons, Inc.

413

414

Chapter 19

Single-Gene and Whole-Genome Duplications

machines. Since the functionality of the cell depends on these physical interactions, it is no surprise that great effort is being made in cataloguing the interactome of a genome, that is, to identify and describe all protein–protein interactions (PPIs) a protein participates in. In this era of “omics” technologies and systems biology, researchers try to deal with the interactome of a given organism in a holistic approach, and try to reconstruct protein–protein interaction networks (PINs), using graph theory (Barabasi and Oltvai, 2004). In such networks, proteins are represented as nodes in a graph, and edges connect nodes that physically interact, or nodes that participate in the same complex. There is a signiﬁcant amount of work on the principles of PINs, their statistical properties, and their signiﬁcance for the cell (Barabasi and Oltvai, 2004), but the focus of this review is on the evolution of PPIs and more particularly the contribution of two major sources of molecular innovation, namely, single-gene and whole-genome duplications. It is important to understand the evolution of PPIs in order to address fundamental questions about molecular biology and to use the interactome correctly. First of all, we need to understand which molecular mechanisms are responsible for innovation and the evolution of PINs, and the extent of contribution of each one of those mechanisms. Also, PINs from different organisms need to be compared, in order to understand which are the universal core protein complexes, which protein complexes are speciﬁc to a certain clade of organisms, and which are unique to one species. In this way, we will know which experimentally determined interactions can be transferred from one organism to another and which interactions are not transferable. In addition, by studying the evolution of PPIs and PINs, we will better understand the components and types of interactions that are responsible for increasing biological complexity. It is well acknowledged that organismal complexity correlates with the number and coverage of PPI domains per protein (Xia et al., 2008). Complexity also seems to correlate with an expansion of certain gene families (van Nimwegen, 2003). We need to understand whether these particular families are linked to speciﬁc types of interactions and to improve our knowledge on the relationship between organismal complexity, the various modes of duplication, and PPIs. Here, we will ﬁrst introduce the sources of molecular innovation in PINs, that is, through gene and genome duplications and mutations. Second, we will discuss and review studies from genome-scale data that provide a bird’s eye view about the importance of each source of molecular innovation, and ﬁnally, we will cite some medium-scale studies that use high-quality data and provide an in-depth view about the impact of gene/genome duplication on the evolution of PPIs.

19.2 EVOLUTION OF PINs During the last decade, the signiﬁcance of gene duplications, point mutations, and domain rearrangements in shaping the regulatory and protein interaction networks has been well established (Amoutzias et al., 2004b; Babu et al., 2004; Bornberg-Bauer et al., 2005; Evlampiev and Isambert, 2008; Ispolatov et al., 2005; Pastor-Satorras et al., 2003; Wagner, 1994). PINs may evolve by two mechanisms, either by mutations such as point mutations or/and domain rearrangements on existing proteins or by gene duplication and subsequent mutations of the duplicate/s. In the former case, the number of nodes in the PIN remains stable, but the PIN is actually rewired, as some new PPI interfaces will emerge and some will be lost. In the later case, a gene duplicates, and one (or both) copy(ies) may undergo one of three fates (Figure 19.1), namely, (i) subfunctionalization, where the functions of the ancestral gene are divided among the two duplicates; (ii) neofunctionaliza-

19.2 Evolution of PINs

415

Duplication

Subfunctionalization

Neofunctionalization

Nonfunctionalization

Figure 19.1

The three most common fates of a duplicated gene and its interactions. (See insert for color representation of this ﬁgure.)

tion, where one of the copies may retain the ancestral function but the other evolves a novel function; and (iii) or most frequently, nonfunctionalization, where one of the copies accumulates deleterious mutations and turns into a pseudogene. In any of the two cases where the two copies survive (subfunctionalizaton/neofunctionalization), all the initial interactions of the ancestral gene’s product, that is, protein are inherited by the identical duplicate and then, depending on the extent and character of mutations, a few or all of the common interactions among the duplicates may be retained or lost, whereas some other new interactions may also emerge. Mutations on existing genes or mutations on redundant duplicates may take place at the same time in different parts of the network. Nevertheless, the latter case of gene duplication and subsequent divergence is strongly supported by genomic data, as a major contributor of molecular innovation. For example, in yeast, only one-third of the genes are characterized as singletons (Davis and Petrov, 2005). Analyses of PPIs among gene duplicates in yeast show that the duplicates diverge asymmetrically in terms of PPIs (Wagner, 2002), meaning that one of the duplicates has more PPIs than the other, although they still retain a signiﬁcant number of common interactors, more than expected by chance (Musso et al., 2007). We can consider different types of gene duplication events, depending on how many genes are duplicated simultaneously: single-gene duplications (SGDs), block duplications, and whole-genome duplications (WGDs). Block duplications are observed in cases of trisomies or chromosomal aneuploidies, where one whole-genomic block, such as a whole chromosome or part of it may duplicate or become lost. The strong negative effect of these block duplications on the ﬁtness of individuals, as well as simulations on the evolution of genetic networks show that organisms should preferentially evolve either by single-gene or by whole-genome duplications (Wagner, 1994). These two mechanisms of duplication (SGDs/WGDs) are totally different in nature and it has been proposed that they should favor the duplication and survival of different types of genes (Davis and Petrov, 2005).

416

Chapter 19

Single-Gene and Whole-Genome Duplications

19.3 SINGLE-GENE DUPLICATIONS Single-gene duplications occur continuously within a genome. (Lynch and Conery, 2000) describe them as stochastic processes that are being ﬁxed in a population with a frequency of up to (differs for different species) 1 out of 100 genes per million years. As noted, the most common fate of gene duplication is nonfunctionalization, where the duplicate gene will be rapidly lost. The average half-life of a single-gene duplicate has been estimated to be approximately 4 million years (Lynch and Conery, 2000). It is estimated that half the genes of a genome will be duplicated and ﬁxed, within a timescale of 35 to 350 million years (Lynch and Conery, 2000).

19.4 WHOLE-GENOME DUPLICATIONS Whole-genome duplications were initially thought to happen very frequently, but become ﬁxed very rarely. Therefore, their impact on evolution has long been underestimated. Nevertheless, the advent of the genomic era revealed that actually, the ﬁxation of WGDs is more frequent than originally thought and of major signiﬁcance for speciation, radiation, and adaptation (De Bodt et al., 2005; Scannell et al., 2006; Van de Peer, 2004). Initially, this idea was hotly debated and opposed, but now it is widely accepted that almost all eukaryotic lineages such as animals, fungi, protists, and especially plants have undergone one or more rounds of WGDs in their evolutionary past. For example, in animals, two successive rounds of WGDs occurred at the origin of vertebrates (the 2R event) (Dehal and Boore, 2005; Panopoulou and Poustka, 2005) and one in the bony ﬁsh lineage (the 3R event) (Jaillon et al., 2004; Taylor et al., 2003; Vandepoele et al., 2004). In the yeast lineage, a WGD occurred around 100 million years ago (Wolfe and Shields, 1997), whereas in the ciliate Paramecium, 3 or 4 WGDs have occurred (Aury et al., 2006). In plants, one or two genome duplications are shared between all ﬂowering plants, whereas many of them underwent additional rounds of polyploidization (Blanc and Wolfe, 2004; Cui et al., 2006; De Bodt et al., 2005; Schlueter et al., 2004; Sterck et al., 2007).

19.5 DIPLOIDIZATION PHASE During a WGD event, either through autopolyploidy or through allopolyploidy (hybridization), all genes of a genome duplicate simultaneously and the organism appears as tetraploid. The duplicate pairs that result from such an event are called ohnologues, after Susumu Ohno, who was the ﬁrst to discuss the importance of gene and genome duplications (Ohno, 1970). Usually, this tetraploid phase does not last long. Extensive genomic rearrangements and genes loss occur, as the organism returns back to its diploid state, a process called diploidization (Wolfe, 2001). During this phase, the rate of duplicate loss can be more or less constant over time, as observed in Paramecium (Aury et al., 2006) or it can be very high at the beginning and slow down later, as observed in baker’s yeast (Scannell et al., 2006) and Arabidopsis (Maere et al., 2005). Lynch and Conery (2000) observed that the level of retention of ohnologues is unexpectedly high, compared to retention rates of single-gene duplicates. This high retention rate has been conﬁrmed for many species that underwent one or more WGDs. Several reasons have been proposed to explain this high retention rate, such as protein dosage effects (see further), buffering of essential genes, enhancement of metabolic ﬂuxes and rapid divergence of gene pairs (Aury et al., 2006; Chapman et al., 2006; Kondrashov et al., 2002; Lynch and Katju, 2004; Ohno, 1970; Papp et al., 2003; Veitia, 2005).

19.7 Types of Interactions

417

19.6 DOSAGE BALANCE HYPOTHESIS Perhaps the most important factor for ohnologue retention that is directly linked to PPIs is the protein dosage effect. According to the dosage balance hypothesis (DBH) (Veitia et al., 2008), the stoichiometric imbalances in macromolecular complexes can have phenotypic effects, most probably ﬁtness defects. These defects are the result of overexpression or underexpression of a protein subunit that disrupt proper formation of the complex. The DBH has also been extended to signaling and transcriptional networks, where according to theory, the balance between activators and repressors should be preserved (Birchler and Veitia, 2007). Studies have shown an overrepresentation of regulatory genes as being sensitive to haploinsufﬁciency (Kondrashov and Koonin, 2004; Papp et al., 2003). Therefore, where stoichiometric balance needs to be preserved, single-gene duplications can have a detrimental effect. On the other hand, whole-genome duplications should not affect the stoichiometries, since all parts of the complex are duplicated simultaneously. To put it simply, such complexes that are sensitive to stoichiometric imbalances are bound to evolve mostly by WGDs and not by SGDs. During the diploidization phase, duplicate genes that participate in complexes and are sensitive to dosage imbalances would tend to be retained instead of being lost. Later on, mutations and their resulting genetic network rewiring will allow some of these retained duplicates to diversify or disappear, due to compensation of imbalances (Aury et al., 2006; Semon and Wolfe, 2007). There are several mechanisms that can compensate for this imbalance, at the mRNA or protein level (Veitia et al., 2008) such as the pathway and kinetics of macromolecular assembly, the topology of the complex, negative feedback regulatory loops that maintain the concentration of mRNA or protein level stable, or proteasome degradation of monomers in excess. From the DBH alone, it is evident that the mode of duplication should have a strong effect on which categories of genes will be retained. Indeed, genes involved in signal transduction and transcription have a strong tendency to be retained after a WGD event, as has been shown for fungi, plants, and vertebrates (Blomme et al., 2006; Davis and Petrov, 2005; Maere et al., 2005). Maere et al. (2005) estimated that around two-thirds of the transcriptional regulators and half of the kinases of Arabidopsis are ohnologues, retained after several ancient WGD events over a period of 150 million years, whereas Blomme et al. (2006 and personal communication) estimated that half of the transcriptional regulators and two-thirds of signal transducers of human are ohnologues, retained from two genome doublings at the origin of vertebrates around 500 million years ago. It is evident that WGDs had a signiﬁcant impact on these speciﬁc categories of genes, but the question remains what the impact was on the evolution of PINs. Intriguingly, TFs and kinases are participating in a certain type of interaction, termed transient.

19.7 TYPES OF INTERACTIONS Not all interactions are of the same nature, but rather they are categorized in different types, as reviewed extensively by (Nooren and Thornton, 2003). More and more studies discuss the importance of distinguishing between the various types of interactions, in order to understand their characteristics and evolution, instead of treating them all as the same (Brown and Jurisica, 2007; Mintseris and Weng, 2005; Nooren and Thornton, 2003; Sprinzak et al., 2006; Tompa and Fuxreiter, 2008; Wilkins and Kummerfeld, 2008). One distinction is among obligate and nonobligate complexes. For instance, some proteins are not found as stable structures alone in vivo and take on their characteristic structure only

418

Chapter 19

Single-Gene and Whole-Genome Duplications

when they become part of the complex. The complexes that they form are called obligate. On the other hand, the proteins that have a deﬁned crystal structure, independently of their interacting partners form nonobligate complexes. Another distinction can be made between transient and permanent complexes. The former have a short lifetime span and associate and dissociate in vivo, whereas the latter are usually disrupted by proteolysis. Obligate interactions are usually permanent, whereas nonobligate interactions can be either permanent or transient. Despite these classiﬁcation efforts, (Nooren and Thornton, 2003) note that there is no simple and clear distinction between obligate and nonobligate interactions, but rather that there exists a continuum between both sorts of interaction. In addition, physiological conditions such as concentration of ions, chemicals, pH, temperature, or phosphorylation may affect the stability of an interaction. Several recent studies have shown that a distinction should be made between transient and stable interactions, due to their different properties. Otherwise, the various signals of a PPI analysis are weakened or scrambled (Brown and Jurisica, 2007; Mintseris and Weng, 2005; Nooren and Thornton, 2003; Sprinzak et al., 2006; Tompa and Fuxreiter, 2008; Wilkins and Kummerfeld, 2008). One of the most important differences is that the interactions of stable complexes are highly conserved, even among distantly related organisms, whereas transient interactions are usually much less conserved (Brown and Jurisica, 2007; Mintseris and Weng, 2005). This difference is also reﬂected by the amino acid conservation level of the interacting interfaces (Mintseris and Weng, 2005). Furthermore, interacting partners of stable complexes tend to be more coexpressed than the partners of transient interactions (such as the yeast kinome), whose coexpression is not higher than random protein pairs (Brown and Jurisica, 2007). Interestingly, the human PIN seems to be dominated by transient interactions (Brown and Jurisica, 2007).

19.8 WGDs, TRANSIENT INTERACTIONS, AND ORGANISMAL COMPLEXITY Phosphorylation is the reversible addition of a phosphate group by a protein kinase to a target protein, while phosphatases have the opposite effect and remove phosphate. Phosphorylation causes conformational changes in the structure of the target protein and may alter its functions. Between 30% and 50% of a proteome may be phosphorylated under certain physiological conditions, but interestingly, certain categories of genes, mainly retained following WGD (see higher), such as TFs and kinases are often overrepresented in phosphorylation sites (Chi et al., 2007; Heazlewood et al., 2008; Ptacek et al., 2005). These sites usually occur within fast evolving regions that are intrinsically disordered and lack regular structure, such as hinges and loops (Gnad et al., 2007; Iakoucheva et al., 2004; Peck, 2006). The phosphorylation motifs that are recognized by kinases are rather short, with a length of 8–12 amino acids (Gnad et al., 2007). Given their short length and their embedment within fast evolving regions, they must appear and disappear very rapidly. Indeed, regions of phosphorylation sites have lower conservation than the average conservation of the entire protein (Gnad et al., 2007), whereas yeast phosphoproteins, but not the actual phosphorylation sites are conserved across large evolutionary distances (Chi et al., 2007). Also, phosphorylation sites in plants are usually found outside of PFAM domains and they tend to be more conserved between orthologues than between paralogues (Nuhse et al., 2004; Peck, 2006). Another fact that links WGDs to transient interactions is the enrichment of intrinsically disordered regions in TFs and kinases. Some regions in a protein do not have a well-deﬁned

19.9 Studies on PPIs OF OHNOLOGUES

419

and stable 3D structure in their native state, but instead have dynamic structures that interconvert. They are termed intrinsically disordered regions (IDRs) and may cover either a small part of or the whole of a protein (Lobley et al., 2007). Although they lack a welldeﬁned structure, they are often involved in transient protein–protein interactions of regulatory and signaling molecules that require high speciﬁcity and low afﬁnity. The enrichment of TFs in IDRs was demonstrated by Liu et al. (2006). Especially activation domains are mostly or totally unstructured. There is a preference for phosphorylation sites to be embedded within IDRs (Iakoucheva et al., 2004), whereas Lobley et al. (2007) have shown that transcriptional regulators and kinases are among the groups of proteins that are enriched in IDRs. Given the fact that phosphorylation sites and IDRs are overrepresented in TFs and kinases and that WGD strongly favors the retention of these categories of genes, it becomes evident that this mode of duplication would result in the increase of transient interactions within a PIN. In addition, the rapid emergence and loss of phosphorylation sites by a few point mutations strongly suggests that WGD provides the raw material for rapid rewiring of a PIN, especially the part that is involved in transient interactions and information processing. Since transient interactions dominate gene regulation and signal transduction and given the established link between organismal complexity (at least in terms of distinct cell types) and an increase in the percentage of signal transducers and TFs within a genome (Ranea et al., 2005; van Nimwegen, 2003), it is tempting to assume that WGD is one of the major contributors of raw genetic material to increase biological complexity.

19.9 STUDIES ON PPIs OF OHNOLOGUES A pair of retained gene duplicates may follow one of three scenarios: (i) the duplicates may retain the majority of functions of the ancestral gene, and show redundancy, (ii) they may subdivide the functions or expression of the ancestral molecule, that is, subfunctionalize, or (iii) one of the two duplicates retains the ancestral functions, whereas the other evolves new functions (neofunctionalization) (Casneuf et al., 2006). In the ﬁrst scenario, the organism should become more robust to mutations, whereas in the other two scenarios, the organism should evolve new functions and possibly adapt better to new environments. Several studies (Guan et al., 2007; Hakes et al., 2007) have tried to address the question of whether the mode of duplication (SGD versus WGD) could be linked to one of the three aforementioned scenarios, by comparing the functional divergence between groups of ohnologues and groups of SG duplicates in yeast. As a measure of functional divergence, the number of common interactors, together with an integrated Bayesian analysis of diverse functional data was used, as well as a semantic distance based on Gene Ontology annotation. The results of both studies showed that for genes with the same level of sequence divergence, ohnologues diverge less in function and PPIs, compared to SG duplicates. In addition, ohnologues tend to be more dispensable than SG duplicates and also have higher synthetic lethality. Although in our opinion these results need to be considered with caution (see next section), the analysis of Hakes et al. (2007) highlights important differences between SG and WG duplicates. Both SG and WG duplicates have the same connectivity, with an average of 10 PPIs per protein, but WG duplicates tend to share more common interactors than SG duplicates. Furthermore, for genes that participate in complexes, SG duplicates tend to be more essential than WG duplicates (21% versus 10%, respectively), whereas for genes not participating in complexes, both types have similar dispensability (9% versus 6%,

420

Chapter 19

Single-Gene and Whole-Genome Duplications

respectively). WG duplicates tend to participate in complexes slightly more than SG duplicates (19% versus 14%). Another difference between SG and WG duplicates that relates to their physical interactions is underwrapping. Crystal structures from PDB were analyzed to test whether duplicability of a gene is affected by its underwrapping (Liang et al., 2008). This term describes the solvent accessibility of the hydrogen bonds of the protein backbone. The less accessible these hydrogen bonds are to water molecules, the more functionally competent the structure is. This inaccessibility is achieved by clusters of nonpolar amino acids that wrap the hydrogen bond and protect it from water. Intramolecular hydrogen bonds that are accessible to water are called dehydrons and constitute structural vulnerabilities. Therefore, the more underwrapped proteins are, the more reliant on their interactive context they are in order to maintain structural integrity. Overexpression of highly underwrapped proteins could increase misfolding and aggregation, thus leading to dosage sensitivity. Therefore, the theory predicts that the more underwrapped proteins are, the more sensitive to dosage imbalances, which should also be reﬂected by their family sizes and mode of duplication. Liang et al. (2008) compiled protein structures from PDB and calculated the underwrapping extent of each protein. Next, the gene family size was determined. The authors found a negative correlation between underwrapping and duplicability, showing that indeed underwrapping makes genes more sensitive to dosage effects and hinders SGDs. They also compared the yeast SG duplicates against the WG duplicates and found that WG duplicates could tolerate higher underwrapping. It was also noted that the underwrapping effect and therefore the dosage imbalance effect was strong for simple unicellular organisms, but less strong for more complex organisms. Five major reasons were suggested for this loss of sensitivity to dosage imbalance: (i) more efﬁcient regulatory networks that can compensate for higher expression, (ii) alternative splicing as an escape route, (iii) higher allostery in complex organisms, (iv) smaller effective population size, that allows slightly deleterious dosage imbalances to become ﬁxed, and (v) positive selection. Another study tried to determine the effect of WGD on the homodimeric interactions. By using network motifs and a mathematical model on the gain and loss of interactions, Presser et al. (2008) analyzed the interactions among the yeast ohnologues and conclude that the pre-WGD genome had proteins that tended to self-interact, more than after the WGD. The WGD probably caused a rewiring effect. It is possible that mutations changed some of the ancient homodimers to obligate heterodimers. Other groups also suggested a model of PIN evolution, where redundant duplicate homodimers would evolve to heterodimers by mutations (Amoutzias et al., 2004b; Ispolatov et al., 2005; Pereira-Leal et al., 2007).

19.10 CONCERNS ABOUT THE METHODS OF ANALYSIS AND THE QUALITY OF THE DATA Analyses on large-scale datasets are valuable for observing trends at the genome level. Nevertheless, we need to bear in mind that there are several issues that complicate such analyses, like the dimension of time, the fact that different organisms are under different constraints and live in different environments, the quality and coverage of the data and the biases of each dataset, among others. One example of how time complicates an analysis refers to two studies mentioned earlier, about the functional differences among SGDs and WGDs (Guan et al., 2007; Hakes et al., 2007). All ohnologues of yeast have the same age, around 100 million years, whereas

19.10 Concerns About the Methods of Analysis and the Quality of the Data

421

the SG duplicates have various ages. Ideally, one should use SG duplicates as old as WG duplicates, but then, probably the number of SG duplicates would not be sufﬁcient for statistical analysis. If the SGD dataset is dominated by very old duplicates, then the observed difference in functional divergence could be just a function of time and not due to the mode of duplication. Indeed, Guan et al. (2007) recognize the importance of dating the duplicates to increase the conﬁdence in future analyses and discuss in depth the potential pitfalls and how they could be avoided. Future analyses with more species and functional data are needed to fully resolve this problem. Another issue related to time is this of the rate of gene loss during the diploidization phase. In yeast and in plants, this rate is initially very high, but is declining over time (Maere et al., 2005; Semon and Wolfe, 2007). In addition, the rate of loss is different for various GO categories in plants (Maere et al., 2005). Therefore, depending on how old the event is, we should observe different outcomes, in terms of gene content for a genome. Occasionally, analyses appear to contradict common beliefs and stir up discussion about our understanding of a process. A very interesting case is the one about the effect of gene dosage on the evolution of protein complexes, as discussed by (Freeling and Thomas, 2006; Pereira-Leal and Teichmann, 2005). Although the DBH posits that protein complexes should preferentially evolve by WGDs and not by SGDs, Pereira-Leal and Teichmann (2005) show the opposite. These authors found that the predominant mechanism of protein complex creation was not duplication, actually. Nevertheless, there was a small, but signiﬁcant portion of the complexes that evolved by duplication from other complexes. Pereira-Leal and Teichmann (2005) classiﬁed homologous complexes as concurrent and parallel. Concurrent complexes share some of the components, whereas in parallel complexes, two homologous complexes will have similar but no shared components. Bioinformatics analyses showed that the most reasonable scenario is one where new homologous complexes arise by a slow process of step-wise duplications, whereas the ancient wholegenome duplication that occurred in yeast did not have a signiﬁcant impact on complex creation. The new homologous complexes retained the general function of the ancestral complexes, but evolved new speciﬁcities. This ﬁnding seems to contradict the DBH, which predicts whole-genome duplication as the most favorable mechanism, in order to avoid dosage imbalance, unless mechanisms for compensation of the dosage imbalance are in action. In addition, Hakes et al. (2007) found that for genes that participate in complexes, onhologues are more dispensable than single-gene duplicates, which appears to contradict the DBH again. Mintseris and Weng (2005) found evidence for the effect of dosage imbalance on stable protein complexes, but could not address this issue regarding the transient interactions. Many large-scale studies are based on literature-curated data as well as on highthroughput experiments and try to provide a global snapshot of the effect of whole-genome duplication. Although the literature-curated data are considered as the gold standard and are expected to have low coverage but describing highly conﬁdent interactions, on the other hand, many of the large-scale experiments used for some of the analyses are incomplete, with errors and biases. For example, the yeast and human interactomes are 50% and 10% complete, with an estimated total of 38–75,000 interactions in yeast and 154–369,000 interactions in humans (Hart et al., 2006). A more recent analysis converges on the number of yeast interactions, but doubles the number of human interactions (Stumpf et al., 2008). In addition, the false positive rates for any experimental high-throughput (HTP) method may ﬂuctuate between 30% and 80% (Hart et al., 2006). Therefore, it is of no surprise that experts in the ﬁeld suggest that in the future, the interactome should be treated like a genome, with multiple coverage, to account for mistakes.

422

Chapter 19

Single-Gene and Whole-Genome Duplications

Another major concern about HTP interactome data is the bias of certain technologies toward detecting the interactions of certain gene categories (extensively reviewed by (Lalonde et al., 2008)). Methods that have been adapted for HTP screening include the yeast-two-hybrid (Y2H), the mating-based split-ubiquitin (mbSUS), and afﬁnity puriﬁcation of protein complexes followed by mass-spectroscopy identiﬁcation of proteins (AP-MS). The various methods differ in their sensitivity, speciﬁcity, and ability to detect interactions over a broad spectrum of afﬁnities. Also, some methods detect direct physical interactions (e.g., Y2H), whereas others determine the presence of one protein within a protein complex (AP-MS) or its vicinity (FRET). Y2H and AP-MS also differ in their ability to detect PPIs with different kinetics and binding afﬁnities. AP-MS, for example, will be biased in favor of stable complexes. Y2H is more capable of detecting binary transient interactions. In one of the ﬁrst evaluations of the various interactome datasets, many biases were identiﬁed in the yeast interaction data (von Mering et al., 2002), related to certain cellular environments, more ancient, conserved, or highly expressed proteins. Lalonde et al. (2008) discuss common problems of these HTP technologies that include (i) limited number of replica tests, (ii) the interactions are assayed in an all-or-nothing scheme, ignoring binding afﬁnities, (iii) proteins are often overexpressed, thus modifying the relative concentration of potential partners, (iv) heterologous systems may be used, and (v) analysis of interactions in cellular extracts that may bring together proteins from different compartments. Therefore, HTP data include potential interactions, together with in vivo interactions. Without a doubt, HTP screens are necessary to obtain an overview of the potential interactome, but low-throughput, carefully prepared follow-up studies are needed to verify the initial data (Lalonde et al., 2008). Given all the problems that blight HTP interaction data, we can wonder to what extent the conclusions of duplication-interaction related analyses are robust? At this early phase in the era of interactomics, caution is deﬁnitely advised, but it should not give way to pessimism or nihilism. Most of the bioinformatics analyses are performed on yeast genomic and functional data. Yeast is indeed the best choice for such analyses, because there is a signiﬁcant number of related yeast species (some predating the WGD) that have been sequenced and thus helped to provide a very conﬁdent and carefully analyzed dataset of ohnologues (Byrne and Wolfe, 2005; Kellis et al., 2004). In addition, the yeast interactome has the highest coverage compared to other organisms, with an estimated 50%. Analyses by two groups found that, in terms of functional distance, PPI data are in agreement with an integrated Bayesian analysis of diverse functional data and semantic distance based on Gene Ontology annotation (Guan et al., 2007; Hakes et al., 2007). Therefore, most of the analyses about PPIs should at least be able to capture some strong global trends.

19.11 THE IMPORTANCE OF MEDIUM-SCALE STUDIES: THE CASE OF DIMERIZATION Although large-scale studies provide a bird’s eye view on the evolution of PINs, they are complicated by several problems that are not so strongly present in medium-scale studies. Usually, at the level of a protein family or a certain pathway, these medium-scale studies use data with higher quality and coverage and provide a deeper insight into the mechanisms and processes that govern the evolution of PPIs. One well-studied case related to regulatory PINs is the evolution of dimerizing interactions in metazoan TFs. Dimerization is deﬁned as the formation of a functional protein complex composed of two subunits (Klemm et al., 1998). In signal transduction pathways, dimeric interactions

19.11 The Importance of Medium-Scale Studies

423

usually are not very stable. Rather, they are dynamic and act as reversible switches in the process of information ﬂow (Nooren and Thornton, 2003). Dimerization is observed in many signal transduction and regulatory gene families (Klemm et al., 1998; Marianayagam et al., 2004). In TFs, two monomers need to dimerize in order to bind DNA and depending on the choice of partner and the cellular context, each unique TF dimer triggers a sequence of regulatory events that lead to a particular cellular fate. The best-studied TF families that form homotypic dimers (dimers among homologous proteins) are the bHLH, bZIP, Nuclear Receptors, MADS-box, HD-ZIP, and NF-kB, as well as the STATs. These TFs create a large number of dimers with distinct biological properties (more than 500 in human and up to 2500 when alternative splicing is accounted for) and form elaborate control circuits that are central to the evolution and generation of organismal complexity (Amoutzias et al., 2008). Dimerizing TFs regulate a very wide range of processes, such as the cell cycle, reproduction, development, homeostasis, metabolism, immunity, inﬂammation, and programmed cell death (Amoutzias et al., 2008). Members of these dimerizing TF families mediate their DNA-binding and dimerization activities via highly conserved domains. In all families, the DNA-binding domain is the most conserved portion of the protein, whereas the dimerization domain (which usually lies downstream from the DNA-binding domain) is less conserved (Amoutzias et al., 2008). These domains are shared among all the members of each family and therefore are often used in phylogenetic analyses. Usually, speciﬁc functions, such as recognition of DNA elements and dimerization are tightly linked to the phylogenetic clustering. Other regions of the proteins can contain transcriptional activation/repression domains, various functional domains, or phosphorylation sites; but these elements are usually not as highly conserved. Some of the most important implications of TF dimerization are reviewed in (Amoutzias et al., 2008; Klemm et al., 1998; Marianayagam et al., 2004), with perhaps the most important being differential regulation. One TF monomer can have multiple binding partners and thus form dimers that possess distinct properties and perform speciﬁc functions, thereby mediating differential gene regulation. In this case, the concentration of each monomer in the cell, its posttranslational modiﬁcations (e.g., phosphorylation), and its binding afﬁnity for other monomers will determine which dimer will form, and thus which signaling process will prevail over the others. Very good examples are the Myc–Max and Mad–Max heterodimers that deﬁne whether a large number of targeted genes will be expressed or silenced, respectively (Grandori et al., 2000; Luscher, 2001). N genes of a given TF family could in theory generate N homodimers þ (N (N 1)/2) unique heterodimers, assuming negligible binding speciﬁcity among the monomers, a lack of cell- or tissue-speciﬁc expression patterns, and little alternative splicing. Therefore, for the 51 bZIPs, 118 bHLHs, and 48 NRs in humans, there is the potential to form 1326, 7021, and 1176 unique dimers, respectively. Theoretically, TF dimerization could make a huge contribution to gene regulation ﬂexibility and complexity, given the fact that there are approximately 2000–3000 human sequence-speciﬁc TFs (Kummerfeld and Teichmann, 2006; van Nimwegen, 2003). In practice, the speciﬁcity of monomer–monomer interactions limits the available binding options. Protein-array experiments and reliable predictions based on biophysical constraints on leucine zipper (LZ) interactions lead to estimates of approximately 350 unique bZIP dimers (Fong et al., 2004; Grigoryan and Keating, 2006; Newman and Keating, 2003; Vinson et al., 2006). Strong evidence also indicates speciﬁcity of dimerization in the bHLHs, NRs, HD-ZIPs, MADS-box, and the plant bZIPs (Amoutzias et al., 2007a; Amoutzias et al., 2004a; Amoutzias et al., 2007b; de Folter et al., 2005; Ehlert et al., 2006; Johannesson et al., 2001; Veron et al., 2007). A very

424

Chapter 19

Single-Gene and Whole-Genome Duplications

interesting ﬁnding is that the paralogues of any given phylogenetic subgroup in bZIPs, bHLHs, NRs, MADS-box, HD-ZIPs share, to a high degree, their various dimerization partners (Amoutzias et al., 2007a; Amoutzias et al., 2004a; Amoutzias et al., 2007b; Johannesson et al., 2001; Newman and Keating, 2003). This results from the evolution of the TF families.

19.12 EVOLUTION OF DIMERIZATION NETWORKS Although some of the DNA-binding folds of dimerizing TFs are also found in prokaryotes (e.g., HTH), all of the TF families that we have discussed so far are speciﬁc to eukaryotes. Some are found in the metazoan, fungal and plant lineages (bHLH, bZIP, and MADS-box), whereas others are speciﬁc to plants (HD-ZIP) or the metazoa/opisthokonta (NF-kB, NR, and STAT) (Amoutzias et al., 2008). Although several ancient TF families are found in all three of these eukaryotic lineages, some have undergone signiﬁcant lineage-speciﬁc expansion in only one (i.e., MADS-box TFs in plants) or two (i.e., bHLH and bZIP in metazoa and plants) of the lineages, independently. The integration of genomic and functional data for the three largest families of dimerizing TFs in metazoa, the bHLHs (Amoutzias et al., 2004a), NRs (Amoutzias et al., 2007a) and, especially, the bZIPs (Amoutzias et al., 2007a) delineates, to some extent, the evolution of DNA-binding and dimerization speciﬁcity during the major phases of animal macroevolution and shows the effects of SGDs and WGDs (Figure 19.2). From the genomes of fungi, diploblastic cnidarians (Nematostella vectensis), invertebrates (insects), and vertebrates (ﬁshes and human), we can now understand more about the events that occurred during the emergence of bilaterian and vertebrate animals (Figure 19.2). Brieﬂy, major gene duplications, point mutations, and domain rearrangements occurred at the origin

B A

C SGD

B A

Mutations

SGD

D C B A

WGD

Mutations

Lost interaction Gained interaction

Figure 19.2

A general example of how TF dimerizing interactions have evolved during the major phases of animal macroevolution. SGD events and mutations (point mutations or domain rearrangements) created the various subfamilies, each one with a distinct interaction pattern. The subfamilies and their core dimerization network were formed by the time the bilaterian animals appeared. The two rounds of WGDs at the dawn of vertebrate evolution created more paralogues for each subfamily, but overall, these paralogues have, until today, retained the dimerizing interaction pattern of their ancestral molecules. (See insert for color representation of this ﬁgure.)

19.12 Evolution of Dimerization Networks

425

of metazoa, approximately 1 billion years ago. These events shaped the repertoire of gene subfamilies and the interactions among them. (Miyata and Suga, 2001) have hypothesized that the emergence of multicellularity was accompanied by a phase of large-scale duplications and domain rearrangements for the signaling families in general, but so far, there exists no hard evidence for a WGD during this early period. By the time the urbilaterian ancestor arose around 650 MYA, a highly conserved core dimerization network had already been formed, with most of the subfamilies present. The genes that evolved during this period seem to have been shaped by single-gene duplications. Its has been hypothesized that the emergence and rapid radiation of bilateria is not due to a large-scale duplication event, rather due to the duplication and rewiring of a few key gene families (Davidson, 2006; Miyata and Suga, 2001). Nevertheless, (Spring, 2003) suggests the opposite that is a genome duplication that had as a consequence the emergence of bilateria. Later, two rounds of whole-genome duplications occurred at the origin of vertebrates (2R event) around 550 MYA (Dehal and Boore 2005; Panopoulou and Poustka, 2005) and added more paralogues to each subfamily, but overall, they did not create many new subfamilies. These highly similar paralogues possess very similar DNA-binding and dimerization speciﬁcities until today in humans. The evolution of the bZIP, bHLH, and NR networks strongly supports this picture (Amoutzias et al., 2007a; Amoutzias et al., 2004a). The bZIPs are a well-studied case of how dimerization and DNA-binding evolved in animals, owing to a plethora of functional data, especially their high quality and coverage of dimerization data. The human bZIP dimerization network was reconstructed from a proteinarray technology, which does not show the biases that afﬂict Y2H assays. This array was used to monitor all possible bZIP interactions of one protein against all other proteins (Newman and Keating, 2003). The interaction array showed high symmetry and reproducibility and was in good agreement with the literature. In addition, rules that were derived from this dataset for predicting speciﬁcity were in good overall agreement with rules derived from independent studies. Current data show that the genome of the last common ancestor of eumetazoa contained genes for many dimerizing bZIP subfamilies (Amoutzias et al., 2007a). Most of these subfamilies must have emerged after the divergence of the fungi and before that of the cnidaria. Only 1 of the 19 human bZIP subfamilies was shared with the fungi, whereas 13 are shared with cnidaria. In addition, these 13 subfamilies recognize all 6 DNA elements bound by the human bZIPs. Therefore, speciﬁcity of DNA binding mainly evolved during this period. Several of these ancient bZIP subfamilies subsequently duplicated and, while retaining their DNA-binding afﬁnity for certain motifs, started to diverge at the dimerization domain, thus gaining and losing interactions with other bZIP subfamilies. This change in dimerization speciﬁcity could allow the new combinations of monomers to recognize new DNA motifs and thus increase the regulatory capacity of the genome. Of note, many bZIPs are involved in developmental processes. By the time the common ancestor of bilateria arose (just before the hypothesized Cambrian explosion), 17 of the 19 bZIP subfamilies were present and formed a complex core dimerization network, conserved in many vertebrate and invertebrate bilaterians. There is no evidence so far in the literature for a WGD during this period, although (Spring, 2003) suggests the opposite. Until this time, most of the subfamilies must have consisted of only one gene. At the origin of vertebrates, around 550 MYA, all of the 19 bZIP subfamilies were present. Then, the two rounds of whole-genome duplication (the 2R event) that the vertebrate ancestor underwent created more paralogues for each subfamily. These paralogues not only retained the DNA-binding speciﬁcity of their ancestral molecule but also retained most of its dimerizing interactions until today. The paralogues evidently diverged outside of the

426

Chapter 19

Single-Gene and Whole-Genome Duplications

DNA-binding and dimerization domains, thus making new interactions with other signal transduction molecules. At least 35 from the 51 human bZIPs are retained duplicates from the 2R, based on phylogenetic analysis. The high retention of WG duplicates is a general trend observed in vertebrate TFs (Blomme et al., 2006). Subsequent lineage speciﬁc SGDs and losses of TFs also occurred in the vertebrate lineage, though to a limited extent. The same scenario has been proposed for the evolution of the bHLH dimerization network in metazoa (Amoutzias et al., 2004b). Again, single-gene duplication and domain rearrangements formed the various subfamilies somewhere between the origin of metazoa and the origin of bilaterian animals. During that time, the core topology of the bHLH network was formed. Later on, 2R increased the number of paralogues for each subfamily, but overall, the majority of dimerizing interactions among the paralogues have been conserved until today.

19.13 CONCLUSIONS As more PPI data are generated in small or large-scale experiments, a conceptual model of the interactome is gradually being formalized and reﬁned, composed of binary interactions and multisubunit complexes. Many binary interactions are transient and involved in information processing, such as transcriptional regulation or signal transduction (Brown and Jurisica, 2007; Mintseris and Weng, 2005). On the other hand, multisubunit complexes form sophisticated molecular machines that are composed of core, module and attachment proteins (Gavin et al., 2006; Krogan et al., 2006). Usually, most of the members of a complex are tightly coregulated and form the immature complex that waits for the ﬁnal components to be expressed in the right occasion and thus complete the complex, a principle termed as “just in time assembly” (de Lichtenberg et al., 2005). Protein complexes seem to form the highly conserved part of the PIN, whereas the binary transient interactions form an evolutionarily ﬂexible coat around the conserved core. From the analyses that we have discussed so far, WGDs seem to have a strong effect on that part of the PIN that is composed of transient interactions. WGDs provide the raw material for rapid evolution and rewiring of interactions that are involved in information processing, like phosphorylation. Nevertheless, the effect of SGDs on the evolution of PPIs should not be underestimated. Medium-scale studies on metazoan dimerizing networks (Amoutzias et al., 2004b; Amoutzias et al., 2008; Amoutzias et al., 2007b) together with large-scale studies on PPIs (Guan et al., 2007; Hakes et al., 2007; Pereira-Leal and Teichmann, 2005) show that paralogues from SGDs underwent more drastic changes in their PPIs, compared to paralogues from WGDs. In addition, more protein complexes seem to have evolved by step-wise gene duplications rather than WGDs (Pereira-Leal and Teichmann, 2005). Is it possible that SGDs are involved in the rewiring of the conserved PIN core and thus are linked to major innovations in evolution, whereas WGDs provide the raw material for rapid rewiring of PINs and thus rapid adaption and species radiation? In our opinion, it is not clear yet if what we have observed so far is an effect of the time of duplication, or is inherent to the mode of duplication, although current data point toward this latter possibility. Future work and more data on genomes and interactomes will undoubtedly shed further light on these fundamental questions.

ACKNOWLEDGMENT Grigoris Amoutzias is supported by an EMBO long-term fellowship.

References

427

REFERENCES AMOUTZIAS, G.D., PICHLER, E.E., MIAN, N., De GRAAF, D., IMSIRIDOU, A., ROBINSON-RECHAVI, M., BORNBERG-BAUER, E., ROBERTSON, D.L., and OLIVER, S.G., 2007a. A protein interaction atlas for the nuclear receptors: properties and quality of a hub-based dimerisation network. BMC Systems Biol. 1: 34. AMOUTZIAS, G.D., ROBERTSON, D.L., and BORNBERG-BAUER, E., 2004a. The evolution of protein interaction networks in regulatory proteins. Comp. Func. Genomics 5: 79–84. AMOUTZIAS, G.D., ROBERTSON, D.L., OLIVER, S.G., and BORNBERG-BAUER, E., 2004b. Convergent evolution of gene networks by single-gene duplications in higher eukaryotes. EMBO Rep. 5: 274–279. AMOUTZIAS, G.D., ROBERTSON, D.L., VAN DE PEER, Y., and OLIVER, S.G., 2008. Choose your partners: dimerization in eukaryotic transcription factors. Trends Biochem. Sci. 33: 220–229. AMOUTZIAS, G.D., VERON, A.S., WEINER, J. 3RD., ROBINSONRECHAVI, M., BORNBERG-BAUER, E., OLIVER, S.G., and ROBERTSON, D.L., 2007b. One billion years of bZIP transcription factor evolution: conservation and change in dimerization and DNA-binding site speciﬁcity. Mol. Biol. Evol. 24: 827–835. AURY, J.M., JAILLON, O., DURET, L., NOEL, B., and JUBIN, C. et al., 2006. Global trends of whole-genome duplications revealed by the ciliate Paramecium tetraurelia. Nature 444: 171–178. BABU, M.M., LUSCOMBE, N.M., ARAVIND, L., GERSTEIN, M., and TEICHMANN, S.A., 2004. Structure and evolution of transcriptional regulatory networks. Curr. Opin. Struct. Biol. 14: 283–291. BARABASI, A.L. and OLTVAI, Z.N., 2004. Network biology: understanding the cell’s functional organization. Nat. Rev. Genet. 5: 101–113. BIRCHLER, J.A. and VEITIA, R.A., 2007. The gene balance hypothesis: from classical genetics to modern genomics. Plant Cell 19: 395–402. BLANC, G. and WOLFE, K.H., 2004. Widespread paleopolyploidy in model plant species inferred from age distributions of duplicate genes. Plant Cell 16: 1667–1678. BLOMME, T., VANDEPOELE, K., De BODT, S., SIMILLION, C., MAERE, S., and VAN DE PEER, Y., 2006. The gain and loss of genes during 600 million years of vertebrate evolution. Genome Biol. 7: R43. BORNBERG-BAUER, E., BEAUSSART, F., KUMMERFELD, S.K., TEICHMANN, S.A., and WEINER, J., 3RD. 2005. The evolution of domain arrangements in proteins and interaction networks. Cell. Mol. Life Sci. 62: 435–445. BROWN, K.R. and JURISICA, I., 2007. Unequal evolutionary conservation of human protein interactions in interologous networks. Genome Biol. 8: R95. BYRNE, K.P. and WOLFE, K.H., 2005. The Yeast Gene Order Browser: combining curated homology and syntenic context reveals gene fate in polyploid species. Genome Res. 15: 1456–1461.

CASNEUF, T., De BODT, S., RAES, J., MAERE, S., and VAN DE PEER, Y., 2006. Nonrandom divergence of gene expression following gene and genome duplications in the ﬂowering plant Arabidopsis thaliana. Genome Biol. 7: R13. CHAPMAN, B.A., BOWERS, J.E., FELTUS, F.A., and PATERSON, A. H., 2006. Buffering of crucial functions by paleologous duplicated genes may contribute cyclicality to angiosperm genome duplication. Proc. Natl. Acad. Sci. USA 103: 2730–2735. CHI, A., HUTTENHOWER, C., GEER, L.Y., COON, J.J., SYKA, J.E., BAI, D.L., SHABANOWITZ, J., BURKE, D.J., TROYANSKAYA, O. G., and HUNT, D.F., 2007. Analysis of phosphorylation sites on proteins from Saccharomyces cerevisiae by electron transfer dissociation (ETD) mass spectrometry. Proc. Natl. Acad. Sci. USA 104: 2193–2198. CUI, L., WALL, P.K., LEEBENS-MACK, J.H., LINDSAY, B.G., SOLTIS, D.E., DOYLE, J.J., SOLTIS, P.S., CARLSON, J.E., ARUMUGANATHAN, K., BARAKAT, A., ALBERT, V.A., MA, H., and DEPAMPHILIS, C.W., 2006. Widespread genome duplications throughout the history of ﬂowering plants. Genome Res. 16: 738–749. DAVIDSON, E.H., 2006. The regulatory genome, Academic Press. DAVIS, J.C. and PETROV, D.A., 2005. Do disparate mechanisms of duplication add similar genes to the genome? Trends Genet. 21: 548–551. DE BODT, S., MAERE, S., and VAN DE PEER, Y., 2005. Genome duplication and the origin of angiosperms. Trends Ecol. Evol. 20: 591–597. DE FOLTER, S., IMMINK, R.G., KIEFFER, M., PARENICOVA, L., HENZ, S.R., WEIGEL, D., BUSSCHER, M., KOOIKER, M., COLOMBO, L., KATER, M.M., DAVIES, B., and ANGENENT, G.C., 2005. Comprehensive interaction map of the Arabidopsis MADS box transcription factors. Plant Cell 17: 1424–1433. DE LICHTENBERG, U., JENSEN, L.J., BRUNAK, S., and BORK, P., 2005. Dynamic complex formation during the yeast cell cycle. Science 307: 724–727. DEHAL, P. and BOORE, J.L., 2005. Two rounds of whole genome duplication in the ancestral vertebrate. PLoS Biol. 3: e314. EHLERT, A., WELTMEIER, F., WANG, X., MAYER, C.S., SMEEKENS, S., VICENTE-CARBAJOSA, J., and DROGE-LASER, W., 2006. Two-hybrid protein–protein interaction analysis in Arabidopsis protoplasts: establishment of a heterodimerization map of group C and group S bZIP transcription factors. Plant J. 46: 890–900. EVLAMPIEV, K. and ISAMBERT, H., 2008. Conservation and topology of protein interaction networks under duplication-divergence evolution. Proc. Natl. Acad. Sci. USA 105: 9863–9868. FONG, J.H., KEATING, A.E., and SINGH, M., 2004. Predicting speciﬁcity in bZIP coiled-coil protein interactions. Genome Biol. 5: R11. FREELING, M. and THOMAS, B.C., 2006. Gene-balanced duplications, like tetraploidy, provide predictable drive

428

Chapter 19

Single-Gene and Whole-Genome Duplications

to increase morphological complexity. Genome Res. 16: 805–814. GAVIN, A.C., ALOY, P., GRANDI, P., KRAUSE, R., and BOESCHE, M. et al., 2006. Proteome survey reveals modularity of the yeast cell machinery. Nature 440: 631–636. GNAD, F., REN, S., COX, J., OLSEN, J.V., MACEK, B., OROSHI, M., and MANN, M., 2007. PHOSIDA (phosphorylation site database): management, structural and evolutionary investigation, and prediction of phosphosites. Genome Biol. 8: R250. GRANDORI, C., COWLEY, S.M., JAMES, L.P., and EISENMAN, R.N., 2000. The Myc/Max/Mad network and the transcriptional control of cell behavior. Annu. Rev. Cell Dev. Biol. 16: 653–699. GRIGORYAN, G. and KEATING, A.E., 2006. Structure-based prediction of bZIP partnering speciﬁcity. J. Mol. Biol. 355: 1125–1142. GUAN, Y., DUNHAM, M.J., and TROYANSKAYA, O.G., 2007. Functional analysis of gene duplications in Saccharomyces cerevisiae. Genetics 175: 933–943. HAKES, L., PINNEY, J.W., LOVELL, S.C., OLIVER, S.G., and ROBERTSON, D.L., 2007. All duplicates are not equal: the difference between small-scale and genome duplication. Genome Biol. 8: R209. HART, G.T., RAMANI, A.K., and MARCOTTE, E.M., 2006. How complete are current yeast and human protein-interaction networks? Genome Biol. 7: 120. HEAZLEWOOD, J.L., DUREK, P., HUMMEL, J., SELBIG, J., WECKWERTH, W., WALTHER, D., and SCHULZE, W.X., 2008. PhosPhAt: a database of phosphorylation sites in Arabidopsis thaliana and a plant-speciﬁc phosphorylation site predictor. Nucleic Acids Res. 36: D1015–1021. IAKOUCHEVA, L.M., RADIVOJAC, P., BROWN, C.J., O’CONNOR, T. R., SIKES, J.G., OBRADOVIC, Z., and DUNKER, A.K., 2004. The importance of intrinsic disorder for protein phosphorylation. Nucleic Acids Res. 32: 1037–1049. ISPOLATOV, I., YURYEV, A., MAZO, I., and MASLOV, S., 2005. Binding properties and evolution of homodimers in protein–protein interaction networks. Nucleic Acids Res. 33: 3629–3635. JAILLON, O., AURY, J.M., BRUNET, F., PETIT, J.L., and STANGETHOMANN, N. et al., 2004. Genome duplication in the teleost ﬁsh Tetraodon nigroviridis reveals the early vertebrate proto-karyotype. Nature 431: 946–957. JOHANNESSON, H., WANG, Y., and ENGSTROM, P., 2001. DNAbinding and dimerization preferences of Arabidopsis homeodomain-leucine zipper transcription factors in vitro. Plant Mol. Biol. 45: 63–73. KELLIS, M., BIRREN, B.W., and LANDER, E.S., 2004. Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature 428: 617–624. KLEMM, J.D., SCHREIBER, S.L., and CRABTREE, G.R., 1998. Dimerization as a regulatory mechanism in signal transduction. Annu Rev. Immunol. 16: 569–592. KONDRASHOV, F.A. and KOONIN, E.V., 2004. A common framework for understanding the origin of genetic dominance and evolutionary fates of gene duplications. Trends Genet. 20: 287–290.

KONDRASHOV, F.A., ROGOZIN, I.B., WOLF, Y.I., and KOONIN, E. V., 2002. Selection in the evolution of gene duplications. Genome Biol. 3: RESEARCH0008. KROGAN, N.J., CAGNEY, G., YU, H., ZHONG, G., and GUO, X. et al., 2006. Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 440: 637–643. KUMMERFELD, S.K. and TEICHMANN, S.A., 2006. DBD: a transcription factor prediction database. Nucleic Acids Res. 34: D74–81. LALONDE, S., EHRHARDT, D.W., LOQUE, D., CHEN, J., RHEE, S.Y., and FROMMER, W.B., 2008. Molecular and cellular approaches for the detection of protein–protein interactions: latest techniques and current limitations. Plant J. 53: 610–635. LIANG, H., PLAZONIC, K.R., CHEN, J., LI, W.H., and FERNANDEZ, A., 2008. Protein under-wrapping causes dosage sensitivity and decreases gene duplicability. PLoS Genet. 4: e11. LIU, J., PERUMAL, N.B., OLDFIELD, C.J., SU, E.W., UVERSKY, V.N., and DUNKER, A.K., 2006. Intrinsic disorder in transcription factors. Biochemistry 45: 6873–6888. LOBLEY, A., SWINDELLS, M.B., ORENGO, C.A., and JONES, D.T., 2007. Inferring function using patterns of native disorder in proteins. PLoS Comput Biol. 3: e162. LUSCHER, B., 2001. Function and regulation of the transcription factors of the Myc/Max/Mad network. Gene 277: 1–14. LYNCH, M. and CONERY, J.S., 2000. The evolutionary fate and consequences of duplicate genes. Science 290: 1151–1155. LYNCH, M. and KATJU, V., 2004. The altered evolutionary trajectories of gene duplicates. Trends Genet. 20: 544–549. MAERE, S., De BODT, S., RAES, J., CASNEUF, T., Van MONTAGU, M., KUIPER, M., and VAN DE PEER, Y., 2005. Modeling gene and genome duplications in eukaryotes. Proc. Natl. Acad. Sci. USA 102: 5454–5459. MARIANAYAGAM, N.J., SUNDE, M., and MATTHEWS, J.M., 2004. The power of two: protein dimerization in biology. Trends Biochem Sci. 29: 618–625. MINTSERIS, J. and WENG, Z., 2005. Structure, function, and evolution of transient and obligate protein–protein interactions. Proc. Natl. Acad. Sci. USA 102: 10930–10935. MIYATA, T. and SUGA, H., 2001. Divergence pattern of animal gene families and relationship with the Cambrian explosion. Bioessays 23: 1018–1027. MUSSO, G., ZHANG, Z., and EMILI, A., 2007. Retention of protein complex membership by ancient duplicated gene products in budding yeast. Trends Genet. 23: 266–269. NEWMAN, J.R. and KEATING, A.E., 2003. Comprehensive identiﬁcation of human bZIP interactions with coiledcoil arrays. Science 300: 2097–2101. NOOREN, I.M. and THORNTON, J.M., 2003. Diversity of protein–protein interactions. EMBO J. 22: 3486–3492. NUHSE, T.S., STENSBALLE, A., JENSEN, O.N., and PECK, S.C., 2004. Phosphoproteomics of the Arabidopsis plasma membrane and a new phosphorylation site database. Plant Cell 16: 2394–2405. OHNO, S., 1970. Evolution by gene duplication, Springer, Berlin.

References PANOPOULOU, G. and POUSTKA, A.J., 2005. Timing and mechanism of ancient vertebrate genome duplications: the adventure of a hypothesis. Trends Genet. 21: 559–567. PAPP, B., PAL, C., and HURST, L.D., 2003. Dosage sensitivity and the evolution of gene families in yeast. Nature 424: 194–197. PASTOR-SATORRAS, R., SMITH, E., and SOLE, R.V., 2003. Evolving protein interaction networks through gene duplication. J Theor. Biol. 222: 199–210. PECK, S.C., 2006. Phosphoproteomics in Arabidopsis: moving from empirical to predictive science. J. Exp. Bot. 57: 1523–1527. PEREIRA-LEAL, J.B., LEVY, E.D., KAMP, C., and TEICHMANN, S. A., 2007. Evolution of protein complexes by duplication of homomeric interactions. Genome Biol. 8: R51. PEREIRA-LEAL, J.B. and TEICHMANN, S.A., 2005. Novel speciﬁcities emerge by stepwise duplication of functional modules. Genome Res. 15: 552–559. PRESSER, A., ELOWITZ, M.B., KELLIS, M., and KISHONY, R., 2008. The evolutionary dynamics of the Saccharomyces cerevisiae protein interaction network after duplication. Proc. Natl. Acad. Sci. USA 105: 950–954. PTACEK, J., DEVGAN, G., MICHAUD, G., ZHU, H., and ZHU, X. et al., 2005. Global analysis of protein phosphorylation in yeast. Nature 438: 679–684. RANEA, J.A., GRANT, A., THORNTON, J.M., and ORENGO, C.A., 2005. Microeconomic principles explain an optimal genome size in bacteria. Trends Genet. 21: 21–25. SCANNELL, D.R., BYRNE, K.P., GORDON, J.L., WONG, S., and WOLFE, K.H., 2006. Multiple rounds of speciation associated with reciprocal gene loss in polyploid yeasts. Nature 440: 341–345. SCHLUETER, J.A., DIXON, P., GRANGER, C., GRANT, D., CLARK, L., DOYLE, J.J., and SHOEMAKER, R.C., 2004. Mining EST databases to resolve evolutionary events in major crop species. Genome 47: 868–876. SEMON, M. and WOLFE, K.H., 2007. Consequences of genome duplication. Curr. Opin. Genet. Dev. 17: 505–512. SPRING, J., 2003. Major transitions in evolution by genome fusions: from prokaryotes to eukaryotes, metazoans, bilaterians and vertebrates. J. Struct. Funct. Genomics 3: 19–25. SPRINZAK, E., ALTUVIA, Y., and MARGALIT, H., 2006. Characterization and prediction of protein–protein interactions within and between complexes. Proc. Natl. Acad. Sci. USA 103: 14718–14723. STERCK, L., ROMBAUTS, S., VANDEPOELE, K., ROUZE, P., and VAN DE PEER, Y., 2007. How many genes are there in plants (. . . and why are they there)? Curr. Opin. Plant Biol. 10: 199–203. STUMPF, M.P., THORNE, T., de SILVA, E., STEWART, R., AN, H.J., LAPPE, M., and WIUF, C., 2008. Estimating the size of the

429

human interactome. Proc. Natl. Acad. Sci. USA 105: 6959–6964. TAYLOR, J.S., BRAASCH, I., FRICKEY, T., MEYER, A., and VAN DE PEER, Y., 2003. Genome duplication, a trait shared by 22000 species of ray-ﬁnned ﬁsh. Genome Res. 13: 382–390. TOMPA, P. and FUXREITER, M., 2008. Fuzzy complexes: polymorphism and structural disorder in protein–protein interactions. Trends Biochem Sci. 33: 2–8. VAN DE PEER, Y., 2004. Computational approaches to unveiling ancient genome duplications. Nat. Rev Genet. 5: 752–763. van NIMWEGEN, E., 2003. Scaling laws in the functional content of genomes. Trends Genet. 19: 479–484. VANDEPOELE, K., De VOS, W., TAYLOR, J.S., MEYER, A., and VAN DE PEER, Y., 2004. Major events in the genome evolution of vertebrates: paranome age and size differ considerably between ray-ﬁnned ﬁshes and land vertebrates. Proc. Natl. Acad. Sci. USA 101: 1638–1643. VEITIA, R.A., 2005. Paralogs in polyploids: one for all and all for one? Plant Cell 17: 4–11. VEITIA, R.A., BOTTANI, S., and BIRCHLER, J.A., 2008. Cellular reactions to gene dosage imbalance: genomic, transcriptomic and proteomic effects. Trends Genet. 24: 390–397. VERON, A.S., KAUFMANN, K., and BORNBERG-BAUER, E., 2007. Evidence of interaction network evolution by wholegenome duplications: a case study in MADS-box proteins. Mol. Biol. Evol. 24: 670–678. VINSON, C., ACHARYA, A., and TAPAROWSKY, E.J., 2006. Deciphering B-ZIP transcription factor interactions in vitro and in vivo. Biochim. Biophys. Acta 1759: 4–12. von MERING, C., KRAUSE, R., SNEL, B., CORNELL, M., OLIVER, S.G., FIELDS, S., and BORK, P., 2002. Comparative assessment of large-scale data sets of protein–protein interactions. Nature 417: 399–403. WAGNER, A., 1994. Evolution of gene networks by gene duplications: a mathematical model and its implications on genome organization. Proc. Natl. Acad. Sci. USA 91: 4387–4391. WAGNER, A., 2002. Asymmetric functional divergence of duplicate genes in yeast. Mol. Biol. Evol. 19: 1760–1768. WILKINS, M.R. and KUMMERFELD, S.K., 2008. Sticking together? Falling apart? Exploring the dynamics of the interactome. Trends Biochem. Sci. 33: 195–200. WOLFE, K.H., 2001. Yesterday’s polyploids and the mystery of diploidization. Nat. Rev. Genet. 2: 333–341. WOLFE, K.H. and SHIELDS, D.C., 1997. Molecular evidence for an ancient duplication of the entire yeast genome. Nature 387: 708–713. XIA, K., FU, Z., HOU, L., and HAN, J.D., 2008. Impacts of protein–protein interaction domains on organism and network complexity. Genome Res. 19: 1500–1508.

Chapter

20

Modularity and Dissipation in Evolution of Macromolecular Structures, Functions, and Networks Gustavo Caetano-Anolles, Liudmila Yafremava, and Jay E. Mittenthal 20.1

INTRODUCTION

20.2

BIOLOGICAL STRUCTURE AS AN EMERGENT PROPERTY OF DISSIPATIVE SYSTEMS

20.3

INFORMATION AND ITS DISSIPATION

20.4

TIME, THERMODYNAMIC IRREVERSIBILITY, AND GROWTH OF ORDER IN THE UNIVERSE

20.5

INFORMATION DISSIPATION AND MODULARITY PERVADE STRUCTURE IN BIOLOGY

20.6

MODULARITY AND DISSIPATION IN PROTEIN EVOLUTION

20.7

CONCLUSIONS

ACKNOWLEDGMENTS REFERENCES

20.1 INTRODUCTION Living cells harbor a multitude of molecules that interact with themselves and each other in integrative manner. These interactions delimit, for example, fundamental metabolic and signaling processes. In metabolism, biochemical and transport reactions synthesize and break down small molecules. In signaling, molecules interconvert signals or stimuli needed for cellular function or interactions with the environment. This complex fabric of interconnected component parts (networks) is responsible for the many biological functions that embed fundamental processes, including cellular maintenance, reproduction, and survival in complex environments. They are ultimately responsible for structuring life. Evolution today shapes the function of molecules, as protein and RNA impact the success of organismal lineages. Generally, this evolutionary shaping is a consequence of natural selection operating at high levels of structural organization on evolving organismal Evolutionary Genomics and Systems Biology, edited by Gustavo Caetano-Anolles Copyright 2010 John Wiley & Sons, Inc.

431

432

Chapter 20

Modularity and Dissipation in Evolution

populations (Lynch, 2007). However, other important evolutionary mechanisms also play roles, including self-organization, the increase in internal organization of a system that materializes without external control (Hoelzer et al., 2006). Since information in life is ultimately structural, molecular structure embeds all that is needed to fulﬁll molecular functional roles. This includes the ability to interact with other molecules (e.g., to receive or communicate regulatory or transduction signals) or to associate into multimolecular complexes responsible for complex cellular functions (ranging from multienzyme complexes to cellular machinery). These abilities are only possible through constraints, historical imprintings that result from increasing molecular, cellular and higher level structure in biological systems and the emergence of many layers of biological organization. In this chapter, we investigate the connection between dissipation, the loss, dispersion or diffusion of energy, matter or information that exists in a natural system, and the emergence of highly integrated components of the system called modules, a property we shall call modularity. Modules cooperate to perform a task and are sometimes repeatedly used in different contexts. We follow Layzer (1970, 1975, 1988) and his views on information and growth of order in the universe. We also assume dissipation tendencies in energy and matter that exist in an open cosmological model of the Friedmann type (Dyson, 1979; Frautschi, 1982; Davies, 1983). Under this model, the universe expands faster than its contents can equilibrate, turning the nearly homogeneous hot gas at the beginning of the big bang into clumps of energy-dissipating matter (galaxies, stars, planets, complex chemistries, monomers, etc.) that are out of equilibrium in the present. As the universe expands, these clumps acquire more and more elaborate and ﬁner grained properties, and this emerging structure ultimately materializes in life. According to this inﬂationary model of cosmic structure, which is currently our best, the universe took approximately 14 billion years to make, with life emerging approximately 4 billion years ago in our planet. Consequently, Layzer’s views and those of many that followed (e.g., Brooks, Wiley, Collier, Wicken, Schneider, Chaisson; see Weber et al., 1988) have important implications for biological evolution and are here analyzed in relation to the new concept of dissipation of information. In our analysis, we do not explore the initial condition where dissipation tendencies were minimal, as these have more to do with the origins of time and the universe/multiverse itself (Carroll, 2008). Instead, we are interested in the generation of order and complexity in our world. In the exploration of dissipation and modularity, we focus on a repertoire of molecules with remarkable properties, the proteins that make up living systems. Its emergence in evolution enables cellular life and the many levels of organization that exist in organisms. A survey of this repertoire is currently possible, thanks to high-throughput sequencing and comparative genomics, increasing knowledge of protein structure, and advances in network and systems biology. Using phylogenetic principles and knowledge from protein sequence and structure, we here provide evidence supporting a tendency toward order and dissipation in evolution of living systems.

20.2 BIOLOGICAL STRUCTURE AS AN EMERGENT PROPERTY OF DISSIPATIVE SYSTEMS Collier (2003), cautioning that any deﬁnition of complex entities can be impredicative (i.e., self-referencing from a logical point of view), deﬁnes a dynamical system as a natural object delimited by a set of interacting components that is characterized and individuated from other systems by its cohesion. Cohesion refers here to dynamical stabilities that arise

20.2 Biological Structure as an Emergent Property of Dissipative Systems

433

from constraints within the system when these are imposed on its components. Being natural objects, systems have properties that must be discovered and induce models that must be tested. By deﬁnition, systems can be decomposed into components, which are dynamically stable subsystems that can be dissected from the whole. Components can be spatially bounded (e.g., amino acids in protein molecules) but can also be spatially unlinked, embodying processes (e.g., Collier’s lift and propulsion in ﬂight example important for bird ﬂocking behavior) or other entities that are not physical in nature (e.g., wavelengths of a waveform). Cohesion allows the integrated operation of the dynamical system, the system’s mechanics, through the organization of forces and ﬂows, which deﬁnes how these are interrelated (Collier, 2003). These forces and ﬂows establish the energetic, kinetic, or organizational differences that exist in the system. System mechanisms explain how a system works, providing information on the operation and interaction of components. The account is useful for understanding the signiﬁcance of components and interactions, because it predicts consequences of altering or deleting them. Verbal accounts can be useful and intuitive, but a mathematical description allows quantitative comparison with observation. Dynamical systems can often be described at a macroscopic level with sets of time-dependent equations, in which state variables characterize the state of components, and parameters characterize the relative contribution of components and interactions to changes in the system’s state. This dynamical interplay causes systems to be organized entities. Organization here expresses as structure, complexity endowed in levels by the cohesion of the system. Components in an organized system are distinct and at the same time interrelated, contributing to the unity of an integrated, coordinated, and cohesive behavior through local and global interactions. This distinctness and interrelationship generates ordering relationships in the system, delimiting consecutive layers of organization (levels and hierarchies) and expressing redundancy (Figure 20.1a). Redundancy is the repeated occurrence of components of a system in quantities greater than necessary for its activity

Figure 20.1

Phase space of structural states of the system. (a) The phase space becomes structured in evolution and can be coarse-grained into levels. These levels are generally hierarchical and express intralevel redundancy (reuse of components with a same state of structure within a level) or interlevel redundancy (mapping relationships between states in different levels), which are depicted with lines connecting states (aij). Interlevel redundancy refers to different lower level states that manifest the same higher level state (e.g., different protein sequences expressing a same folded structure). Alternatively, modularity, a special kind of interlevel redundancy, refers to different higher level states including components with the same lower-level state, which now become a “module” (e.g., protein domain combining with others to form multidomain proteins). Note that interlevel relationships need not occur between adjacent levels. (b) A wander set of two sequences diffusing in multidimensional sequence space visualized in a two-dimensional wire frame describing single-point mutations. Highlighted are adaptive walks of mutational steps around an attractor that minimizes free energy in the folded molecules.

434

Chapter 20

Modularity and Dissipation in Evolution

and fulﬁlling the system’s constraints (e.g., biological function). Redundancy is always deﬁned with respect to the performance of an activity. If the activity of interest changes, the deﬁnition of redundancy changes accordingly. When considering genes, gene redundancy expresses in evolution when two or more genes exert the same or overlapping functions (Nowak et al., 1997). Redundancy is a desirable characteristic of systems. For example, the repeated use of biological components ensures evolutionary novelties are adequately preserved for the future. Redundancy is also the normal outcome of evolutionary processes and events and its concept is tightly linked to the concept of information, which can be deﬁned precisely in mathematic terms, borrowing from information theory (Shannon and Weaver, 1949). While redundancy can express at a given level of complexity, the redundant component can be easily degraded (e.g., by mutation) as the system diversiﬁes. In contrast, when redundancy expresses between levels, the cohesion of the system is enhanced as higher levels constrain change at lower levels. One special kind of redundancy, modularity, involves the generation of discrete entities (modules), sets of integrated components that cooperate to perform a task and interact more extensively with each other than with components outside the modular set (Hartwell et al., 1999). Modules are particularly important because they enhance the component makeup of the system. When components become parts of modules they constrain change at lower levels and at the same time enhance the diversity of the component repertoire at higher levels, generating interlevel redundant relationships. Examples of modules at the molecular level include monomers in polymers, secondary structures in macromolecules, domains in folded macromolecules, and units and subunits in molecular complexes. The emergence of modules generally involves their combination in processes that enhance the diversity of the system as a whole. We, as many others, contend that the ultimate explanation of structure and biological organization comes from thermodynamics, especially when phrased within a cosmological framework. Thermodynamics explains energy transformation. Energy available to do work (e.g., free energy) transforms into energy that is unavailable to do work. This generates energy ﬂows from a source at higher free energy to a sink at lower free energy. It is under these gradient conditions that are far from equilibrium where biological systems exist. They do so by dissipating energy, by doing work, and by increasing entropy (S; meaning ‘transformation’). Entropy is a physical macroscopic state variable generated in all natural processes. No process in nature will occur spontaneously without some energetic cost S. The second law of thermodynamics posits that increasing S occurs during irreversible transformations that increase randomness in an isolated system. Boltzmann in the 1850s and 1860s and then Gibbs linked thermodynamic properties of macroscopic bodies to the dynamical behavior of their components, the molecules, in a statistical interpretation of irreversible physical processes. This elaboration made S a measurement of molecular disorder of the macroscopic system and a description of the number of its distinct microstates (precise descriptors of each component of a system). However, one important limitation of thermodynamic laws of energy exchanges is that they are fulﬁlled only close to equilibrium. Since many natural processes on Earth occur in open systems not in local equilibrium as energy ﬂows are dissipated from the sun, a theoretical framework beyond classical thermodynamics is needed. This framework should reconcile the second law with the development of order observed copiously in nature and described in cosmological models, under conditions that are far from equilibrium. Quite promising in this regard are extensions of thermodynamics to situations where some microscopic state variables, such as molecule velocities, do not reach local equilibrium (Vilar and Rubi, 2001; Rubi, 2008). These variables are taken as coordinates with the status of spatial coordinates, and the system is at local equilibrium with respect to the new sets of coordinates.

20.3 Information and Its Dissipation

435

While entropy increases, work is ultimately responsible for biological organization, and as structure unfolds in evolution, it increasingly provides more efﬁcient energy dissipation. Biological structure acts as an engine that extracts, concentrates and stores free energy by acting on an energy ﬂow and dissipating the energy gradient (Cottrell, 1979). The free energy that is extracted can then be used to build more engines and to contribute to the emergence of biological organization. These systems are known in physics as selforganizing or emergent dissipative systems (EDS). Dissipation fulﬁlls the maximum entropy production principle (MEPP) advanced by Ziegler in nonequilibrium systems, which is a logical generalization of the second law of thermodynamics (Martyushev and Seleznev, 2006). From an evolutionary point of view, MEPP formalizes Lotka’s proposal that evolution proceeds in a way that maximizes energy ﬂux through the system and at the same time is compatible with a system’s constraints (Lotka, 1922). Using the ﬂuctuation theorem, the constrained maximization of information entropy of Jaynes (1957) was recently used to establish MEPP as the physical selection principle for the most probable ﬂux conﬁguration of a system, clarifying the links between the proposals of Ziegler, Onsager, and Prigogine (Dewar, 2005). We have been considering biological evolution. Now we broaden evolution’s meaning by considering it in a cosmological sense. Several central questions can be posed. What are the fundamental evolutionary drivers of EDS organization and complexity? How does dissipation of energy and matter contribute to EDS formation? We contend the answer lies in information.

20.3 INFORMATION AND ITS DISSIPATION In its broadest possible deﬁnition, information arises from symmetry breaking through processes that make distinct (distinguish) the components of systems. It includes possible structure in the system and the distinction and interconnection of the system’s components, and can be equated to system’s complexity (Wicken, 1988). Examples include possible events of nucleosynthesis forming atoms, possible atoms making different molecules, different chemistries deﬁning possible metabolic reactions, different nucleotide or amino acid monomers deﬁning possible macromolecular machinery of life, cells in organisms differentiating into possible cellular types in tissues and organs, and so on. This broad deﬁnition implies a phase space of distinctions in which all regions of the space can be instantiated. However, not all information that is possible (potential information) actually materializes in evolution, so the “it from bit” notion of John Archibald Wheeler in which all bits (distinctions) count can be restricted to mean only those bits that materialize in systems (including those that can do work). Examples include atoms that are stable, actual metabolic networks, actual sequences and repertoires of nonprotein-coding RNA and proteins, viable cellular types, etc. All of these items of information can be found (materialize or are lawful) in our world and are linked to the state of a physical system. This bound information embodies information entropy (Shannon and Weaver, 1949) and can be instantiated (potential information) (Gatlin, 1972). Examples include information in nucleic acid or protein strings, alternative energetic conﬁgurations in folded macromolecules, and the genetic code. There is also information that is not expressed and is not easily accessible (e.g., information in recessive alleles, information in neutral networks that are revealed when mapping, for example, the space of sequences to the space of structures; see Chapter 7). Finally, the deﬁnition of information can be restricted even further by considering meaning in semantic information. Because biological function relationships are both hierarchical and

436

Chapter 20

Modularity and Dissipation in Evolution

sometimes many-to-many, as exempliﬁed by the semantics of the GO (Gene Ontology Consortium, 2000), constraints on systems are complex. Clearly, further restrictions in the deﬁnition of information are in line with higher and higher levels of organization and new emergent properties (e.g., purpose) that link biology to teleological interpretation in cognitive biological systems. These views of information are holistic and can be misleading when extended to life. This is because evolution manifests only as time unfolds. For example, a biological organism lives and dies and by itself cannot evolve, very much as its molecular components, which readily transform once the organism disappears. Instead, biological evolution manifests as a historical succession of systems that are causally related by transmission of information as the world unfolds in time. This transmitted information (the historical arrow of time) includes, for example, genetic information, learned behaviors, and accumulated experience, and provides continuity, maintaining cohesion and at the same time allowing change at system level. It is interesting to note two kinds of transmitted information, analog and digital. Analog signals are continuous time- or space-varying features. In turn, digital signals are discrete and embody a ﬁnite number of alternative states (e.g., conformations or levels). It is also remarkable that the historical succession of systems in biology requires discrete units (organisms) that behave as digital transducers. The evolution of these units embody descent (information maintenance) and modiﬁcation (information diversity), two Darwinian properties that are embedded in the microscopic components of the system and are the subject of irreversible processes of change, which by necessity follow Dollo’s law (Brooks et al., 1988). When examining transmission of information as a historical succession of systems, diffusion becomes central to evolution. The states of the system’s components explore the phase space of possibilities (potential information) given constraints (cohesion and levels of organization) as change unfolds, instantiating actual information from potential information into states. Phase space is a space in which all possible states of a system are represented and can materialize given enough time to reach equilibrium (the ergodic hypothesis). However, in a universe that expands faster than its contents can equilibrate, it also deﬁnes a dissipative system in which clouds of points representing the actual states of the system “wander” in diffusive walks (Figure 20.1b). These materialized states are generally never visited again as evolution proceeds and the space is explored in far from equilibrium conditions. This is especially true for large phase spaces and relatively long times. In fact, ergodic theory uses these wandering sets to provide a precise mathematical deﬁnition of the concept of dissipation (Nichols, 1989). Interestingly, while wandering sets diffuse in the space of structural states they can become adaptive and populate attractors, portions of the space favored by the system’s constraints. For example, the mapping of protein sequences into folded three-dimensional (3D) structures in a space deﬁned by single-mutational steps delimits neutral networks (nets) in sequence space, ensembles of sequences with native fold structures that are impervious to mutational change. While protein sequences drift in space by mutation in search of thermodynamic and kinetic folding optimality (and ultimately optimal function), neutral nets tailor trajectories and favor discovery of new attractors, concentrating structures in dense clusters (Taverna and Goldstein, 2002; Wroe et al., 2005). Interestingly and despite the clumping, the distribution of sequences folding into structures is approximately random at global levels and all structures can materialize (are accessible) within few mutations (Babajide et al., 2001). This property is especially true for RNA molecules and has been conﬁrmed experimentally using molecular functional switches that were engineered in vitro (Schultes and Bartel, 2000). Remarkably, extensive neutral nets enable the efﬁcient exploration of phase space, and space covering ensures a constant rate of

20.4 Time, Thermodynamic Irreversibility, and Growth of Order in the Universe

437

discovery of possible mapping relationships between different sets of states such as sequence–structure mappings described above. Dissipation of information is therefore a tendency of states to diversify in a space of possibilities, with wander sets materializing states in each generation through diffusion in the state space. The space of proteins and that of other molecular repertoires is clearly dissipative both from a thermodynamic and from an informational point of view. Dissipation of information intertwines with dissipation of energy in proteins as molecular ensembles optimize their ability to do work and dissipate energy. It is a property of matter that also complies with the second law. We propose that the repeated use of components in systems and the mapping relationships (exempliﬁed in Figure 20.1a) are properties that facilitate diffusion of energy and information and generate order. Every new level of complexity added to systems increases the phase space and the ability to establish new mapping relationships between components. It increases the dissipation potential of the system, as the system increases the number of possible states. At the same time, mapping relationships facilitate the exploration of the different states by wander sets in search for optimality of energy and organization. Fundamental questions that immediately arise are: What are optimality criteria in this diffusive space? How is this diffusive space linked to thermodynamics?

20.4 TIME, THERMODYNAMIC IRREVERSIBILITY, AND GROWTH OF ORDER IN THE UNIVERSE The evolution of information is intrinsically deﬁned by time, which can be considered a phenomenological property of the entropic dissipative function (Layzer, 1970, 1975, 1988) but more generally as an emergent property of the space–time geometry of the universe (Castagnino et al., 2003). We take Layzer’s arguments that time and evolution are tightly linked to information and emergence of order in the universe (Layzer, 1970, 1975, 1988). His views transform a reversible microscopic description of a system to an irreversible macroscopic description, freeing entropy from its thermodynamic context. We will also use entropy in a broad sense, extending its thermodynamic meaning to information and broader cosmological interpretations. However, our interest is to apply these arguments to EDS within particular levels of biological organization relevant to the biological repertoires and cellular organization in network biology. Information is linked to structure and order. It should not be considered merely as negative entropy, but rather as the difference between total possible entropy (potential entropy) and realized entropy in the system (actual entropy) (Layzer, 1988). Potential entropy represents the maximum amount of entropy that is possible. Since the information in a biological macromolecule is intimately linked to its molecular structure, entropy summarizes both molecular energetics and dynamics (thermodynamic entropy) and information embedded in its sequence and structure (information entropy). For a protein sequence, for example, it represents the permutational space of amino acid monomers that could materialize in proteins of particular lengths (also known as sequence or genotype space) together with all energy conformations of the protein polymers in this space. Actual entropy is the instantiation of potential entropy or the realized diversity of the system. For proteins, it would represent the actual sequence space that was explored in evolution and is compatible with constraints imposed by other levels of organization on the system (i.e., other spaces mapping onto the space of protein sequences) (see Chapter 7). This includes constraints imposed by folding energetics of the protein polymers as they acquire 3D structure and produce conformational ensembles.

438

Chapter 20

Modularity and Dissipation in Evolution

Layzer’s proposal takes the form S þ Si ¼ Smax :

ð20:1Þ

with S representing the actual entropy of the system, a measure of instantiated diversity or disorder and a macroscopic descriptor of the system, and Si representing macroscopic information entropy, a measure of unity or order (imposed by constraints) of microscopic descriptors of the system. Smax is the maximum value of S that is possible consistent with the macroscopic constraints. It represents potential entropy, which is also potential information. In thermodynamic contexts Smax can be constant but in astronomical and biological context it may increase in time (Layzer, 1988). The equation describes a trade-off, a “yin-yang” if you will, that links information to entropy. We here contend this simple equation links also diversity (disorder) and unity (order) of systems, two fundamental opposing properties that fuel evolution and emergence of structure in the universe. Diversity represents under this concept a tendency toward dissipation of energy or information while unity represents a tendency toward cohesion of the system imposed by its constraints. Equation 20.1 deﬁnes constraints on how microscopic events take place (Layzer, 1970). According to Heisenberg’s uncertainty principle there is an irreducible uncertainty about the position and momentum of a particle and as a consequence any ﬁnite physical system can be described by a ﬁnite amount of information. By introducing a statistical description of entropy, the information of a classical system with discrete states is deﬁned by a probability distribution {pk} instead of deﬁning the system’s N components in k deﬁned states (events, or microstates) as sets of state vectors each speciﬁed by 6N coordinates and momenta. It is intuitive that there is an inverse relationship between information and uncertainty about the overall state of a physical system—the more the information, the less uncertain the system’s state will be. This allows to treat information using probability theory and to consider entropy as a dissipative tendency toward randomness and as a measure of uncertainty. Claude Shannon and colleagues uncovered this connection over half a century ago and realized that information entropy (Si) is a function of the probability pi of the k possible outcome events (Shannon and Weaver, 1949). S is a compound and continuous function of its variables that can be represented by a weighted sum of the entropies of its component simple states. Sðp1 p2 . . . pk Þ ¼ w

k X

pi log2 pi :

ð20:2Þ

1

Note that S takes a maximal value w(log2 k) if and only if (iff) all p are equal (information is equally distributed among all states), approaches 0 when the number of probabilities in the sum of component states equaling 0 is increased, and equals 0 when the probability of one component state is 1 and the rest is 0 (i.e., when one state dominates the whole and carries all information). By virtue of Liouville’s theorem in Hamiltonian systems (constancy of density in phase space through time), evolution does not create or destroy information; that is, S is a constant of the motion for an isolated k-body system. The statistical description of S is therefore temporally symmetrical. However, note that higher levels of organization in EDS may not be Hamiltonian systems. Also note the tight connection and symbolic isomorphy of S with Boltzmann’s entropy when considering microscopic descriptions of the molecules of an ideal gas or with more classical entropy measures. In fact, entropy and information have been linked and the

20.4 Time, Thermodynamic Irreversibility, and Growth of Order in the Universe

439

minimum entropy cost of a bit equated to 1023 J/K (Brillouin, 1962). Information makes a nontrivial contribution to the entropy of an expanding universe. However, we caution Equation 20.2 is isomorphic with DeMoivre’s equation for games of chance, showing connections among entropic concepts can only be established if they result from irreversible processes and they increase in time (Wicken, 1988). Macrostates and microstates need to be distinguished to reveal how they are linked and to provide an irreversible macroscopic description. Layzer (1970) takes the k microstates of the system and coarse-grains them into aggregates (within which macroscopic variables do not vary appreciably) to show that changes in macrostate information will be accompanied by an equal and opposite change in microstate information. He then takes the argument further to the initial condition of the system using the theorem of Van Hove (1955) suggesting the entropy of a closed system (one that does not communicate with its surroundings) will increase only if macroscopic information is initially present and microscopic information is initially absent. These initial conditions provide an explanation for the thermodynamic arrow of time: the unidirectional (irreversible) ﬂow of information from macroscopic to microscopic degrees of freedom in the universe. For example, the diffusion of macromolecules through sequence space by mutation (decrease in macroscopic information associated with S) results in the optimization of the energetics of conformations of the actual protein or nucleic acid molecules (increase in microscopic information associated with S). While one can argue that certain systems can be considered closed in the universe, the mathematician E´mile Borel showed that no ﬁnite physical system can be considered completely isolated from external perturbations and therefore its dynamical history cannot be completely determined (see Bergmann and Lebowitz, 1955). Microscopic information generated from macroscopic sources is therefore dissipated by random perturbations, including those from higher levels of organization, connecting Layzer’s views with EDS and microscopic structure. However, Layzer invokes a strong “cosmological principle” with a universe that is statistically homogeneous and isotropic (despite irregularities) and contains no microscopic (nonstatistical) information about itself. While microscopic information of bounded subsystems remains accessible, this cosmic uncertainty principle is irreducible. It is interesting to relate this work to Ebeling (1993), who separates entropies in coarsegrained systems into a macroscopic (free) part and a microscopic (bound) part. The macroscopic part can be available as information for structure and is associated with the order parameters of the system. The microscopic part is bound and associated with the space of molecular motions. When entropies are considered in Markovian chains, asymptotic tendencies can show long-range memory effects typical of historical and evolutionary processes at the edge of chaos, when they are neither stochastic nor periodic in nature. Both the historical and thermodynamic arrows of time discussed above describe different aspects of the same phenomenon. In the historical arrow, wander sets seldom return to previously visited states when the phase space is large. Similarly, in the thermodynamic arrow dissipation of energy is often irreversible. The temporal asymmetry proposed by Layzer probably arises from a singularity during the big bang, with two crucial episodes toward disequilibrium, the inﬂationary expansion during the ﬁrst second and the incomplete nucleosynthesis during the seconds that followed (Frautschi, 1982, 1988). During these two early stages, the growing gap between equilibrium and actuality broadens and this generates most of the available free energy. The singularity is driven out of equilibrium as it changes (during expansion) on a timescale shorter than its natural equilibration time (its relaxation), that is, t(expansion) < t(relaxation). Given these initial conditions of the universe, there is no reason why this asymmetry, the cosmological arrow

440

Chapter 20

Modularity and Dissipation in Evolution

Figure 20.2

Evolution of entropy/information and generation of structure in the universe. Equation 20.1 is valid in situations far from equilibrium that are irreversible and compatible with a rapidly expanding universe, in which Smax increases more rapidly than energy and matter can equilibrate (Layzer 1975, 1988). This condition has been formalized to explain the origins of chemistry in the big bang (Frautschi 1982, 1988).

of time, would not continue as a pervasive property of the universe, broadening the difference between Smax and S and increasing order and structure as the universe evolves away from a hypothetical state of thermodynamic equilibrium at the big bang (Figure 20.2). This concept is further elaborated by Wiley and Brooks (1982) and others as it applies to life, extending symmetry breaking to lineage splitting, changes in gene frequencies, and ontogenetic change. Since rates of energy and matter equilibration are slower that the rate of expansion of universal redistribution, then Equation 20.1 should incorporate nonequilibrium entropic terms (not elaborated here), and within an evolutionary context, can be depicted with a plot that describes the relationship between entropy and information over time (Figure 20.2). Seq is the entropy of the space of possible states with no constraints (maximum disorder). Smax is the maximum attainable entropy subject to constraints, which deﬁnes boundary potential entropy. S is the entropy of materialized states (actual entropy). Entropic functions in this graph are drawn as concave functions of time, as we assume historical constraints on the system retard the increase of entropy and information (Wiley and Brooks, 1982; Brooks et al., 1989). Regardless of the shape of these curves, note that “internalized” information and organization (Si) and diversity (S) are always increasing functions of time, showing that the universe and individual systems as a whole are increasing in order as time progresses. They also comply with the tenets of the second law and the dissipation tendencies of the universe. By the same token the difference between the maximum possible entropy at equilibrium Seq and actual entropy S, which represents negative entropy, also increases, showing potential information and structure (order) build up in evolution.

20.5 INFORMATION DISSIPATION AND MODULARITY PERVADE STRUCTURE IN BIOLOGY Very much as the discrepancy between the expansion of the universe and the equilibration of its contents creates free energy gradients, the discrepancy of time constants between the enlargement of phase space and its search by wander sets drives the emergence of redundancy and modularity. It is noteworthy that the potential for diffusion is analogous to the free energy gradients that exist in the universe and propel life. As a system becomes more structured in evolution, more and more levels of organization ﬁne-grain its cohesive properties. This is made possible by intralevel and interlevel ordering relationships that

20.5 Information Dissipation and Modularity Pervade Structure in Biology

441

deﬁne redundancy and modularity and integrate information in the phase space of structures (Figure 20.1a). Redundancy in biology manifests, for example, at the genetic repository level, with gene and genomic regions being duplicated and sometimes maintained for relatively long periods of historical time, deﬁning intralevel redundant relationships in the space of sequences. However, mutation tends to degrade these extra copies if the same or a shared function does not select for them. In contrast, interlevel redundancy in molecular repertoires expresses as mapping relationships between, for example, the space of macromolecular sequences (genotype space) and the space of corresponding structures (structure space). Interlevel redundancy in this case constrains change at lower levels of biological organization and can give rise to modularity. Heteropolymers such as proteins or RNA can be deﬁned as strings that hold a message and evolutionarily as a set of outcomes that can be modeled using Markov processes. Outcomes emit symbols of an alphabet describing the identity of the monomer constituent at each position of the heteropolymeric string and at a precise time. Emission can be dependent on emission of n previous symbols (previous history of the system) in an n order Markov process that is computationally tractable. Consequently, macromolecular repertoires can be deﬁned by the workings of these processes and informational entropy extended to highorder Markov processes. For an alphabet of k symbols and a Markov process of order 0, maximum entropy (maximum disorder of the system) approaches log k. The ratio of S to log k therefore ranges from 0 (maximal information content per symbol) to 1 (uniform distribution of information). A relative measure of redundancy (R) can be deﬁned as R ¼ 1

S : log k

To relate this measure to redundancy, note that the more interlevel and intralevel redundancy, the more clumped the distribution of states in the system. For example, intralevel redundancy lowers S and therefore increases R. Interlevel redundancy also lowers S effectively. Suppose, we have 10 protein sequences as sequence space and one fold as fold space. If three sequences adopt the fold, the system is more clumped and has lower entropy that if only one sequence has the fold. Redundancy is intimately linked to (macroscopic) information, organization, and order. If outcomes are considered over long time intervals, information is expected to increase, and so is organization of the system. Lansberg (1984) took the redundancy concept of Gatlin (1972) and deﬁned the function Q he termed macroscopic order, Q ¼ 1

S Smax

¼

Si Smax

¼ R:

Interlevel and intralevel redundancies act as a unifying force in evolution, embedding order and leading to new structure and biological organization. From a biological point of view, the informational properties of redundancy we discussed have two unanticipated and remarkable consequences. They increase genetic robustness (Krakauer and Plotkin, 2002) and delimit modularity (Hintze and Adami, 2008). Robustness relates to the ability of a system to cope with genetic or environmental change. For example, each additional copy in a redundant set decreases the probability of a system’s failure. However, and as we mention before, redundancy by its deﬁnition also provides fodder for change, enhancing the dissipation capabilities of the system. For example, gene duplication provides opportunities for neofunctionalization or subfunctio-

442

Chapter 20

Modularity and Dissipation in Evolution

nalization of the duplicate and for the discovery of new functions, division of labor, or the use of already discovered functions in new contexts (e.g., Wapinski et al., 2007). Redundancy therefore increases the potential to create further diversity in the system (more bound information), but more importantly to provide genetic “backups” in the face of external perturbation. Modularity involves the generation of discrete entities, the modules. Modularity emerges spontaneously as broken symmetry states in response to changing environments (Sun and Deem, 2007), and evolvability in proteins appears a trait that can be selected (Earl and Deem, 2004). A system’s perturbation (such as horizontal gene transfer) therefore enhances the emergence of modules in evolution. This links to Borel’s arguments and the idea that perturbations help dissipate microscopic information in EDS. In molecules, interlevel redundancy can lock sequences into structures (Ancel and Fontana, 2000) and increase their robustness (Ancel Meyers et al., 2004) as macromolecules evolve. Point mutations stabilize structures that are suboptimal, decreasing their minimum free energies and reducing their conformational plasticity. This phenomenon (known as “plastogenetic congruence”) has the unanticipated consequence of increasing sequence neutrality (i.e., increasing the ability to resist structural change by adding interlevel redundant relationships) and decreasing the potential of the evolved molecules for phenotypic innovation. This generates structural modules, more stable instantiations of redundancy made possible by constraints imposed on the system at lower and/or higher levels of organization. In some cases, these modules deﬁne hourglass patterns in the diagram of Figure 20.1a and can propagate in molecular systems. Examples of modules that have spread pervasively include atoms in molecules, monomers in polymers, domains in macromolecules, molecules in supramolecular assemblies, supramolecular assemblies in cells, and cells in tissues. The formation and reuse of modules is clearly illustrated in the physical structure of proteins (protein architecture). Modules occupy speciﬁc segments of the polypeptide chain that fold tightly and are evolutionarily conserved, the protein domains. Their existence is recurrent in proteins. For example, enzymes with the same architectural design are reused in different metabolic contexts, sometimes associated with different enzymatic chemistries (Caetano-Anolles et al., 2007, 2009a). Domains with similar or distinct 3D folded structures often combine to produce homomultimeric (domain repeat) and heteromultimeric (multidomain) proteins, respectively (Wang and Caetano-Anolles, 2006, 2009; Caetano-Anolles et al., 2009b). The number of proteins that engage in the combinatorial arrangement of domains is substantial, representing 59%, 75%, and 81% of proteins of organisms in superkingdoms Archaea, Bacteria, and Eukarya, respectively (Wang and CaetanoAnolles, 2009). The combination of domains in proteins shows power law behavior (Apic et al., 2001; Wuchty, 2001) and holds considerable phylogenetic signal (Wang and CaetanoAnolles, 2006, 2009; Fukami-Kobayashi et al., 2007). Phylogenomic trees of life reconstructed from abundance of domain combinations in proteomes revealed in every instance the tripartite world heralded by the Woese school and many lineages (especially within Eukarya) matching traditional taxonomical classiﬁcation. Remarkably, the use of domains as modules in combinations appears to have arisen quite late in the evolution of proteins and in explosive manner (Wang and Caetano-Anolles, 2009). This big bang of the protein world coincides with the rise of diversiﬁed organismal lineages and appears driven by a mechanics of domain organization that involves fusion and ﬁssion of domain modules. It is evident that the emergence of modularity involves redundancy expressed at several levels of the system (Figure 20.1a). In proteins, protein domains are modules that arise from genomic rearrangement processes (e.g., recombination, alternative splicing of introns) that modify the mutational connectivity of the space of sequences. Rearrangements are large-

20.6 Modularity and Dissipation in Protein Evolution

443

scale changes in the space deﬁned by single mutations that enhance both the diffusive properties (by allowing long distance jumps in state space) and the cohesion of the original system (by generating modules). Futhermore, rearrangements curb the size of domains and therefore modulate the permutational space. In fact, domain size is quite a primordial property that is maintained in the evolution of the three domains of life (M. Wang, unpublished results).

20.6 MODULARITY AND DISSIPATION IN PROTEIN EVOLUTION Modularity is a side product of evolution. Consequently, biological modules must carry strong phylogenetic memory. Since modularity is linked to redundancy and Si by some function, it should increase with the historical arrow of time (Figure 20.2). Can evolutionary genomics support this contention? Modularity can be studied effectively in proteomes, the protein repertoire of an organism. Proteins and RNA make the bulk of the macromolecular machinery of the cell and their evolution is prerequisite for more organized cellular and organismal structures. Consequently, the emergence of both proteins and RNA represents a fundamental transition (a bottleneck) that impacted the emergence of life in our planet. In proteins, 20 þ amino acid monomers emerged to be components that make up the speciﬁc primary sequence of proteins, delimiting side chains that are spaced at regular intervals in the heteropolymer molecules capable of interacting with water, nucleic acids, ligands, and other chemical substrates. The permutations in sequence that are possible given our current knowledge of the protein world deﬁne a state space (10321–10469) of possible sequences that is hundreds of orders of magnitude larger that Eddington’s total number of electrons in the universe (1080). Not surprisingly, only a minute fraction of this enormous diffusive potential (potential information) appears to have been explored (1032) in the approximately 4 billion year long history of life (actual information) (Caetano-Anolles et al., 2009b). The same argument can be used in the study of secondary structures, the hydrogen-bonding patterns that give rise to helix and strand elements in protein folds. Permutations of secondary structure deﬁne a large space of possibilities (108 arrangement as linear strings), which is further enhanced by their arrangement in 3D space (1010 possible folds). However, only a small fraction of this has materialized in fold evolution (104). The combination of domain modules in proteins, much of which is currently delimited by domain pairs and approximately 1000 protein folds (Wang and CaetanoAnolles, 2006, 2009), deﬁnes also an enormous diffusive potential (1022 of possible pairwise domain combinations), a small fraction of which exists today (<104). Similarly, interactions between protein molecules deﬁne scale-free networks, small world networks that lie between the extremes of order (regular networks) and randomness (random graphs) (Strogatz, 2001). The 15,324 chemical compounds that are described in the Kyoto Encyclopedia of Genes and Genomes (KEGG) (February 2009) delineate a possible state space for global metabolism (1.17 108). However, KEGG only describes 7819 chemical reactions. Clearly, biological networks such as metabolism have a small number of connections compared to fully connected counterparts that fall within the regular network category. Despite their sparse connectivity these networks link any two nodes through very short paths (a property that proves valuable in biology) (see Chapter 18). Figure 20.3 illustrates the numbers of level-speciﬁc states we described that occur and are possible in the protein macromolecular system. Since all these levels of complexity are encased in a molecular hierarchy, both potential and actual information in the system increases with each

444

Chapter 20

Modularity and Dissipation in Evolution

Figure 20.3

Potential and realized spaces in proteins at different levels of organization. Selected levels in the phase space of protein macromolecules: (1) sequence, (2) combination of domains, (3) protein structure of domains, and (4) enzymes in metabolic networks. See text for more detailed description. Note that (2) and (3) occur in parallel.

new level of organization and therefore with time. Note that the state space of possible and realized protein conﬁgurations increases in additive manner as new levels and interlevel constraints are added to the system, matching the concave scaling described in Figure 20.2. Diffusion in this growing landscape also increases, as new neutral nets are discovered with the rise of new architectures. At the same time, chromosomal rearrangement speeds the coverage of sequence space while making it sparse. Intuitively, this growing mismatch between possible and actual modular conﬁgurations in all these examples supports Layzer’s historical and thermodynamic arrows of time, and the notion of a growing gap between what is possible (if there was enough time to reach equilibrium) and actuality (which is generally far from equilibrium). In order to deﬁne evolution’s arrow the molecular repertoire must be placed within a historical perspective. Much of protein history must have occurred when proteins were encased in cellular lineages. Consequently, history can be reconstructed using standard phylogenetic methods if repertoire information is available. While contemporary protein sequences change at a considerable pace, higher order structures are constrained by the energetic landscape of protein folding, the exploration of sequence and structure space, and complex interactions with cellular machineries, including those that ensure proteins are folded correctly (Caetano-Anolles et al., 2009b). High-order structures such as fold families (FF), fold superfamilies (FSF), or folds (F) in the Structural Classiﬁcation of Proteins (SCOP) (Murzin et al., 1995) deﬁne increasing levels of abstraction in the 3D molecular makeup. The highest levels (the fold designs) take many millions of years to materialize and consequently represent excellent molecular fossils with which to study protein evolution. Using this unique feature, we designed a strategy that enables the study of both the evolution of proteomes and the evolution of components in these molecular repertoires (CaetanoAnolles and Caetano-Anolles, 2003; reviewed in Caetano-Anolles et al., 2009b). Figure 20.4 describes the two kinds of phylogenetic trees that can be generated, trees of proteomes and trees of architectures. This involves analyzing protein sequences encoded in hundreds of genomes that have been completely sequenced and assigning structure to sequences using advanced hidden Markov models (HMMs) of structural recognition, generating a structural

20.6 Modularity and Dissipation in Protein Evolution

445

Figure 20.4

Flow diagram showing the reconstruction of phylogenomic trees of proteomes and trees of protein architectures. A structural census in proteomes of hundreds of completely sequenced organisms is used to compose a data matrix and its transpose matrix, which are then used to build phylogenomic trees describing the evolution of individual architectures and entire molecular repertoires, respectively. Elements of the matrix (gmn) represent genomic abundances of architectures or domain combinations in proteins. The box depicts evolution of architectures within lineages.

census, and using this information to construct a data matrix for phylogenomic analysis. Trees of architectures that are built describe how components of the system (proteins in proteomes) change as the system evolves. They describe the history of architectural discovery in proteins. In turn, trees of proteomes describe the history of the system. The branches of these trees encase the history of its molecular components. Overall, phylogenies describe the evolution of the container and the contained and intuitively link the system’s macrostate to its microstates. The structural census deﬁnes reuse of modular components (architectures in proteomes), and the model of evolution that drives the reconstruction of trees assumes the reuse of modules increases in evolution. That is, there will be more copies of a F, FSF, or FF, and more domain combinations with time. This implies proteome lineages that are ancient will exhibit lower levels of modularity that those that are more recent and architectures that are ancient will generally exhibit higher modularity levels than architectures that are more recent. When this evolutionary model was used to reconstruct trees of proteomes, the universal trees described the diversiﬁcation of organismal lineages appropriately. The universal trees were rooted in archaeal microbes and placed multicellular complex organisms at the crown (the most distal leaves of the tree). Phylogenomic relationships were also largely consistent with accepted classiﬁcation (see Figure 20.2 in Chapter 17). The fact that

446

Chapter 20

Modularity and Dissipation in Evolution

the reuse of modules has strong phylogenetic signal is remarkable and of fundamental signiﬁcance. It shows that (i) interlevel redundancy mappings between the spaces of sequences and structures capture protein history, (ii) historical relationships are also embedded in the rearrangement of protein domains, a higher order interlevel redundancy space, and (iii) modularity generally increases in evolution. However, we note that organisms that establish parasitic and obligate parasitic relationships with hosts exhibit strong reductive tendencies in the number of architectures in genomes (Wang et al., 2007). This is the result of secondary adaptations to these lifestyles. If not taken into consideration, they can bias this general evolutionary tendency. It is interesting to note that protein structure is unevenly distributed in the world of proteins and proteomes (Caetano-Anolles et al., 2009b). Several genomic surveys have shown that protein families and folds follow power–law distributions and establish networks with scale-free properties (Huynen and van Nimwegen, 1998; Rzhetsky and Gomez, 2001; Quian et al., 2001). This shows a preference for duplication of genes encoding families and folds that are already common—a “rich get richer” process. Interestingly, fold frequency plots for microbial superkingdoms Archaea and Bacteria had steeper slopes that those of Eukarya, showing there are more architectural modules in the proteomes of complex organisms (Koonin et al., 2002; Caetano-Anolles and Caetano-Anolles, 2003). However, the most ancient folds (shared by all organisms or shared by Bacteria and Eukarya) ﬁtted Gaussian-like distributions characteristic of random graphs, suggesting the spread of folds across superkingdoms is complex (Caetano-Anolles and Caetano-Anolles, 2003). What determines how much redundancy and modularity an organism has? Gene duplication, for example, produces redundancy while mutation degrades it. Consequently, a delicate balance of these contrasting processes is responsible for how much redundancy there is in an organism. Evolution can tinker the rates of gene duplication and mutation, altering them through self-organization at different levels of biological organization and selecting for those that increase ﬁtness. In turn, stabilizing performance acts as an important selection pressure to increase redundancy, when niche conditions tend to mutate genes underlying organismal performance. In contrast to redundancy, modularity can spread pervasively in genomes, increasing their size and slowing down replication time and proliferation. Consequently, the costs of limited proliferation curb excessive increases in modularity, especially in r-selected organisms such as akaryotic (prokaryotic) microbes, which can only pack a limited gene repertoire in their genomes and thrive in competitive environments. K-selected organisms such as eukaryotes on the other hand have room for process parallelization and can tolerate modularity within conﬁnes of rates of error correction in DNA replication and growth conditions dictated by the environment. In order to uncover the diversifying effects of mutation and the unifying effects of redundancy and modularity in protein and proteome evolution, we plotted the total number of distinct FSFs that are used (a measure of module diversity) against the average abundance of FSFs (a measure of module reuse) for each one of 376 free-living organisms belonging to the three superkingdoms of life (Figure 20.5). Remarkably, the plot shows a clear correlation between the richness of the structural repertoire and the levels of modularity, clustering organisms belonging to superkingdoms in separate groups. However, most notable is the fact that as diversity and module reuse increases in proteomes, so does organismal complexity, in the order Archaea, Bacteria, and Eukarya. In particular, multicellular organisms and metazoa (most notably man) show both the highest diversity and module reuse values. Since trees of proteomes and architectures have consistently indicated Archaea is the most ancient lineage of the organismal world (Wang et al., 2007; Wang and Caetano-Anolles, 2009), the ahistorical description of Figure 20.5 supports the increase

20.7 Conclusions

447

Figure 20.5

Increase in the diversity and modularity of protein architecture in evolution. Protein-encoding sequences in fully sequenced genomes corresponding to 376 free-living organisms were assigned to structures at FSF level (SCOP 1.71) using advanced HMMs in SUPERFAMILY 1.69. The total number of distinct FSFs used by an organism was tallied and displayed on the ordinate. Average abundance of all FSFs in an organism was calculated as the total number of instances of each FSF in an organism, summated for all FSFs, and divided by their number. Green, blue, and red open circles represent Eukarya, Bacteria, and Archaea, respectively. The overall increasing trend of this graph illustrates the increase both in richness of the structural repertoire and in the abundance of structural modules from Archaea to Bacteria to Eukarya. (See insert for color representation of this ﬁgure.)

of both diversity and modularity in lineages with evolutionary time. This matches Layzer’s arguments of growth of order and the arrow of time illustrated in Figure 20.2. Similar results can be obtained when studying biological networks. Suppose you have a metabolic subnetwork with N substrates catalyzed by M metabolic enzymes. Adding more enzymatic reactions among substrates will increase interlevel redundancy. Similarly, the more substrates the enzymes operate on, the more clumping and order in the system. For example, analysis of metabolic enzymes of the citrate (TCA) cycle in different organisms showed a clear correlation between the number of chemistries (nodes, enzymatic diversity) and the number of reactions (edges, chemical diversity) linking enzymes in this central metabolic subnetwork (data not shown). Again the plot showed a clear progression in the complexity of pathways from archaea to bacteria and to Eukarya. Therefore, these biological networks tend toward order in evolution.

20.7 CONCLUSIONS Laws and constraints shape phenomena in the universe. Laws describe regularities and constraints limit instantiations of the laws. Layzer’s views link information and entropy with a simple conservation law that reconciles Shannon’s entropy of information theory with classical thermodynamic entropy of Kelvin and Clausius and statistical entropy deﬁnitions of Boltzmann and Gibbs. He does this within a far-from-equilibrium cosmological framework,

448

Chapter 20

Modularity and Dissipation in Evolution

stating that the sum of possible and instantiated information (enabled by symmetry breaking diversity) and entropy (energy, chemical, gravitational, etc) is constant. We build on this concept and postulate that the law illustrates two opposing tendencies that operate in the universe, one that is dissipative and diversiﬁes, and the other that uniﬁes and generates order, structure, and complexity. We explore the entire repertoire of protein molecules present in hundreds of organisms that have been completely sequenced and summarize unity and diversity by measuring the use and abundance of domain architectures and their combinations in the three superkingdoms of life. The census supports experimentally the increase of modularity and diversity of the proteomic repertoire in the macromolecular system. The ﬁndings are in line with Layzer’s cosmological views and establish an important general tendency that can be use to gain insight into the historical happenings that gave rise to modern biochemistry and diversiﬁed cellular life.

ACKNOWLEDGMENTS This work was supported by the National Science Foundation (grants MCB-0343126 and MCB-0749836), the C-Far Sentinel program, the United States Department of Agriculture through Hatch Illu-802-314, and the Soybean Disease Biotechnology Center. Any opinions, ﬁndings, and conclusions and recommendations expressed in this material are those of the authors and do not necessarily reﬂect the views of the funding agencies.

REFERENCES ANCEL, L.W. and FONTANA, W., 2000. Plasticity, evolvability, and modularity in RNA. J. Exp. Zool. B Mol. Dev. Evol. 288: 242–278. ANCEL MEYERS, L., LEE, J.F., COWPERTHWAITE, M., and ELLINGTON, A.D., 2004. The robustness of naturally and artiﬁcially selected nucleic acid secondary structures. J. Mol. Evol. 58: 681–691. APIC, G., GOUGH, J., and TEICHMANN, S.A., 2001. An insight into domain combinations. Bioinformatics 17: S83–S89. BABAJIDE, A., FARBER, R., HOFACKER, I.L., INMAN, J., LAPEDES, A.S., and STADLER, P.F., 2001. Exploring protein sequence space using knowledge based potentials. J. Theor. Biol. 212: 25–46. BERGMANN, P.G. and LEBOWITZ, J.L., 1955. New approach to nonequilibrium processes. Phys. Rev. 99: 578–587. BRILLOUIN, L., 1962. Science and Information Theory. Academic Press, New York. BROOKS, D.R., CUMMING, D.D., and LEBLOND, P.H., 1988. Dollo’s law and the second law of thermodynamics: analogy or extension? In Entropy, Information, and Evolution (eds B.H. Weber, D.J. Depew, and J.D. Smith). MIT Press, Cambridge, MA, pp. 189–224. BROOKS, D.R., COLLIER, J., MAURER, B.A., SMITH, J.D.H., and WILEY, E.O., 1989. Entropy and information in evolving biological systems. Biol. Phylos. 4: 407–432. CAETANO-ANOLLE´S, G. and CAETANO-ANOLLE´S, D., 2003. An evolutionarily structured universe of protein architecture. Genome Res. 13: 1563–1571. CAETANO-ANOLLE´S, G., KIM, H.S., and MITTENTHAL, J.E., 2007. The origin of modern metabolic networks inferred

from phylogenomic analysis of protein architecture. Proc. Natl. Acad. Sci. USA 104: 9358–9363. CAETANO-ANOLLE´S, G., YAFREMAVA, L.S., GEE, H., CAETANOANOLLE´S, D., KIM, H.S., and MITTENTHAL, J.E., 2009. The origin and evolution of modern metabolism. Intl. J. Biochem. Cell Biol. 41: 285–297. CAETANO-ANOLLE´S, G., WANG, M.L., CAETANO-ANOLLE´S, D., and MITTENTHAL, J.E., 2009. The origin, evolution and structure of the protein world. Biochem. J. 417: 621–637. CARROLL, S.M., 2008. The cosmic origins of time’s arrow. Sci. Am. 298: 48–57. CASTAGNINO, M., LOMBARDI, O., and LARA, L., 2003. The global arrow of time as a geometrical property of the universe. Found. Phys. 33: 877–911. COLLIER, J., 2003. Hierarchical dynamical information systems with a focus on biology. Entropy 5: 100–124. COTTRELL, A., 1979. The natural philosophy of engines. Contemp. Phys. 20: 1–10. DAVIES, P.C.W., 1983. Inﬂation and time asymmetry in the universe. Nature 301: 398–400. DEWAR, R.C., 2005. Maximum entropy production and the ﬂuctuation theorem. J. Phys. A Math. Gen. 38: L371–L381. DYSON, F.J., 1979. Time without end: physics and biology in an open universe. Rev. Med. Phys. 51: 447–460. EARL, D.J. and DEEM, M.W., 2004. Evolvability is a selectable trait. Proc. Natl. Acad. Sci. USA 101: 11531–11536. EBELING, W., 1993. Entropy and information in processes of self-organization: uncertainty and predictability. Physica A 194: 563–573.

References FRAUTSCHI, S., 1982. Entropy in an expanding universe. Science 217: 593–599. FRAUTSCHI, S., 1988. Entropy in an expanding universe. In Entropy, Information, and Evolution (eds B.H. Weber, D. J. Depew, and J.D. Smith). MIT Press, Cambridge, MA, pp. 11–22. FUKAMI-KOBAYASHI, K., MINEZAKI, Y., TATENO, Y., and NISHIKAWA, K., 2007. A tree of life based on protein domain organization. Mol. Biol. Evol. 24: 1181–1189. GATLIN, L.L., 1972. Information Theory and the Living System. Columbia University Press, New York. Gene Ontology Consortium, 2000. Gene Ontology: tool for the uniﬁcation of biology. Nat. Genet. 25: 25–29. HARTWELL, L.H., HOPFIELD, J.J., LEIBLER, S., and MURRAY, A. W., 1999. From molecular to modular cell biology. Nature 401: C47–C52. HINTZE, A. and ADAMI, C., 2008. Evolution of complex modular biological networks. PLoS Comput. Biol. 4: e23. HOELZER, G.A., SMITH, E., and PEPPER, J.W., 2006. On the logical relationship between natural selection and selforganization. J. Evol. Biol. 19: 1785–1794. HUYNEN, M.A. and VAN NIMWEGEN, E., 1998. The frequency distribution of family sizes in complete genomes. Mol. Biol. Evol. 15: 583–589. JAYNES, E.T., 1957. Information theory and statistical mechanics. Phys. Rev. 106: 620–630. KRAKAUER, D.C. and PLOTKIN, J.B., 2002. Redundancy, antiredundancy, and the robustness of genomes. Proc. Natl. Acad. Sci. USA 99v: 1405–1409. KOONIN, E.V., WOLF, Y.I., and KAREV, G.P., 2002. The structure of the protein universe and genome evolution. Nature 420: 218–223. LANSBERG, P.T., 1984. Can entropy and order increase together. Physics Lett. 102A: 171–173. LAYZER, D., 1970. Cosmic evolution and thermodynamic irreversibility. Pure Appl. Chem. 22: 457–468. LAYZER, D., 1975. The arrow of time. Sci. Am. 233: 56–59. LAYZER, D., 1988. Growth of order in the universe. In Entropy, Information, and Evolution (eds B.H. Weber, D.J. Depew, and J.D. Smith). MIT Press, Cambridge, MA, pp. 23–39. LOTKA, A.J., 1922. Contribution to the energetics of evolution. Proc. Natl. Acad. Sci. USA 8: 147–151. LYNCH, M., 2007. The frailty of adaptive hypotheses for the origins of organismal complexity. Proc. Natl. Acad. Sci. USA 104: 8597–8604. MARTYUSHEV, L.M. and SELEZNEV, V.D., 2006. Maximum entropy production principle in physics, chemistry and biology. Phys. Rep. 426: 1–45. MURZIN, A., BRENNER, S.E., HUBBARD, T., and CHOTHIA, C., 1995. SCOP: a structural classiﬁcation of proteins for the investigation of sequences and structures. J. Mol. Biol. 247: 536–540. NICHOLS, P.J., 1989. The Ergodic Theory of Discrete Groups. Cambridge University Press, Cambridge. NOWAK, M.A., BOERLIJST, M.C., COOKE, J., and MAYNARD SMITH, J., 1997. Evolution of genetic redundancy. Nature 388: 167–171.

449

QUIAN, J., LUSCOMBE, N.M., and GERSTEIN, M., 2001. Protein family and fold occurrence in genomes: power–law behavior and evolutionary model. J. Mol. Biol. 313: 673–681. RZHETSKY, A. and GOMEZ, S.M., 2001. Birth of scale-free molecular networks and the number of distinct DNA and protein domains per genome. Bioinformatics 17: 988–996. RUBI, J.M., 2008. The long arm of the second law. Sci. Am. 299: 62–67. SCHULTES, E.A. and BARTEL, D.P., 2000. One sequence, two ribozymes: implications for the emergence of new ribozyme folds. Science 289: 448–452. SHANNON, C.E. and WEAVER, W., 1949. The Mathematical Theory of Information. Illinois Press, Urbana. STROGATZ, S.H., 2001. Exploring complex networks. Nature 410: 268–276. SUN, J. and DEEM, M.W., 2007. Spontaneous emergence of modularity in a model of evolving individuals. Phys. Rev. Lett. 99: 228107. TAVERNA, D.M. and GOLDSTEIN, R.A., 2002. Why are proteins so robust to site mutations? J Mol. Biol. 315: 479–484. VAN HOVE, L., 1955. Quantum-mechanical perturbations giving rise to a statistical transport equation. Physica 21: 517–540. VILAR, J.M.G. and RUBI, J.M., 2001. Thermodynamics “beyond” local equilibrium. Proc. Natl. Acad. Sci. USA 98: 11081–11084. WANG, M. and CAETANO-ANOLLE´S, G., 2006. Global phylogeny determined by the combination of protein domains in proteomes. Mol. Biol. Evol. 23: 2444–2454. WANG, M. and CAETANO-ANOLLE´S, G., 2009. The evolutionary mechanics of domain organization in proteomes and the rise of modularity in the protein world. Structure 17: 66–78. WANG, M., YAFREMAVA, L.S., CAETANO-ANOLLE´S, D., MITTENTHAL, J.E., and CAETANO-ANOLLE´S, G., 2007. Reductive evolution of architectural repertoires in proteomes and the birth of the tripartite world. Genome Res. 17: 1572–1585. WAPINSKI, I., PFEFFER, A., FRIEDMAN, N., and REGEV, A., 2007. Natural history and evolutionary principles of gene duplication in fungi. Nature 449: 54–61. WEBER, B.H., DEPEW, D.J., and SMITH, J.D., 1988. Entropy, Information, and Evolution. MIT Press, Cambridge, MA. WICKEN, J.S., 1988. Thermodynamics, evolution, and emergence: Ingredients for a new synthesis. In Entropy, Information, and Evolution (eds B.H. Weber, D.J. Depew, and J.D. Smith). MIT Press, Cambridge, MA, pp. 139–169. WILEY, E.O. and BROOKS, D.R., 1982. Victims of history: a nonequilibrium approach to evolution. Syst. Zool. 31: 1–24. WROE, R., BORNBERG-BAUER, E., and CHAN, H.S., 2005. Comparing folding codes in simple heteropolymer models of protein evolutionary landscapes: robustness of the superfunnel paradigm. Biophys. J. 88: 118–131. WUCHTY, S., 2001. Scale-free behavior in protein domain networks. Mol. Biol. Evol. 18: 1694–1701.

Index

A-B-C-A-B-C domain, 221 Adaptation, 186, 194 accelerated nonsynonymous change, detection, 195 coevolutionary signal, 195 conserved sites, changes, 195 detection/interpretation, 194 integrating inferences, 194 integrating evolutionary inferences, 197–198 structure/function information, 197 molecular convergence, evidence, 198 possible causal factors, integrating inferences, 199–200 snake metabolic proteins, 194 Adjacency matrix, 141 Aerobic bacteria, appearance, 385 Alcoholic fermentations, 101 Allele frequencies, 153 Amino acids, 204, 208 costs of synthesizing, 204 replacements, 184, 188 sequences, 6, 23 Amphimedon queenslandica, 310 Ancestral reconstruction approaches, 174, 223 maximum parsimony, 224 Ancestral rRNA, 48 nucleotide composition, 48 Ancient phylogeny, 54 problem, 54–55 Ancient RNAs, 254–259 RNase MRP RNA, 254 RNase P RNA, 254 signal recognition particle RNA, 254–256 snoRNAs, 256–259 Anﬁnsen funnel, 127 Apparent signal, 34 reconstruction method, effect, 34 Arabidopsis thaliana, 320, 351, 416 Archaea, 65, 332, 382 eukaryote-speciﬁc features, 65

genomic relationship, 332 progenitor, 382 Archaeal-bacterial endosymbiosis, 72 Archaeal cellular architecture, 69 universal features, 69 Archaebacteria, 46 Archezoa hypothesis, rejection, 71 Archaezoans, 388 Argonaute proteins (AGO), 298 Aristotle’s scala natura, version, 45 Arrhenius kinetics, 132 Ascaris lumbricoides, 270 ATP binding cassette, 157 dependent proteases, 91 dependent RNA helicases, 91 genes coding, 91 grasp family, domains, 237 production, 407 synthase complex, 85, 198 Avogadro’s number, 124 Bacillus subtilis, 409 Bacteria, 59, 332 genomic relationship, 332 initiator protein DnaC, viral origin, 59 Bacterial genomes, 81, 84, 88, 90 gene clustering, 88 information, role, 81–92 cenome/paleome, 87–88 nonessential persistent genes, 89 revisiting information, 83–84 ubiquitous functions, 84–87 ubiquitous information-gaining process, 89–91 Bacterial lineages, molecules of, 349 Bacterial RNase P protein (RPP), 348 Bacterial sulfate binding protein, 206 Bacterial symbionts, acquisition, 385 Bacteria-rooting scenario, 330

Evolutionary Genomics and Systems Biology, edited by Gustavo Caetano-Anolles Copyright 2010 John Wiley & Sons, Inc.

451

452

Index BADGER, version, 178 Baker’s yeast, 95, 416 Barrier tree, classes, 147 Basal fungi, phyla, 96 Basidiomycota, 114 Bayesian analysis, 419, 422 inference, organization of functions, 90 techniques, 27, 199 tree, 35 bHLH dimerization network, evolution, 426 Big-bang consequence, 70 Biological networks, 397, 398, 401, 402, 404, 443 evolution, 397 Biology, 440 information dissipation, 440–443 modularity pervade structure, 440–443 Biopolymers, sequence-structure maps, 138 Biotin carboxylase, 236 Blanquart model, 26 BLAST, 370 algorithms, 221 based homology, searches, 389 signals, 370 Block interchange (BI), 170 Boltzmann-weighted structures, 145 Bootstrap values (BVs), 18 Borrelia burgdorferi, 207 Bowker’s test, 25 Brassicaceae, SINEs, 351 Brassica oleracea, 351 Breakpoint graph, 173 approach, 173 structure, 173 Caenorhabditis elegans, 261 CAG anticodon, 105, 108 Cambrian explosion, case study, 33–35 Candida albicans, 96, 98 Candida elegans, 270 SmY RNAs, consensus structure, 270 Candida glabrata, 111 Canonical tmRNA structure, 264 Catalyzing enzymes, 399 CATH, 155–157, 234 developers, 156 domain structure, 232 Gene3D resource, 232 superfamilies, 234, 240–242, 244, 246, 248 domain architecture rearrangement vs. functional divergence, 242–245 evolution, 240–241 function, 240–241

parent/child COGs, ambiguous evolutionary scenarios, 241–242 CAT model, 31–33 breakpoint modeling, 28, 32 cDNA libraries, 18 C/D snoRNA structures, 257 schematic representation, 257 Cell biology, 67 Cell metabolism, 399 Cellular biomass, proportion, 204 Cellular metabolic optimization principles, 406 Cellular networks, 366, 373–375 cooperativity, 366 growth rate, 375 metabolism, 401, 407 network representation, 401 optimality, 366 properties, 374 Cellular proteins, 387 Cellular proteomes, 371 Chaperonin/protease monitoring systems, 373 Chemical ﬂuxes, 374 Chlamydomonas reinhardtii, 315 Chromosomal evolution study, software, 179 Chromosomal mutations, 165 Chromosomal rearrangements, See Chromosomal mutations Chromosomes, permutation, 170 Chymotrypsin, 154 Ciliates, 276 TER genes, 260 Circular permutations (CP), 216, 222 detection, heuristic approaches, 221 tracing, 222 Clades, phylogenetic analysis, 19 Classical dynamic programming scheme, 218 Class II MHC-associated invariant chain, 160 Class II tRNA molecules, 346 Clustering coefﬁcient, 402 Clusters of orthologous groups (COGs), 238–240, 244 database, 240 functional categories, distribution, 239 Coelomata hypothesis, 225 Cohesion, 432, 433 Communal LUCA hypothesis, 52, 54 HGT, 52 Compact disks (CDs), 86 Comparative domain mapping, 159 Comparative genome analysis, 46, 231, 232 mapping, 167 Compartment-speciﬁc functional cycles, 374 Complex domain rearrangements, 245 Complex networks, small-world property, 402

Index Complex protein enzymes, 56 Complex systems, 184, 185 component, 185 Computational microRNA prediction, 302–303 Computational systems biology, 148 Computer laser beam, 86 Conformation space, 127, 147 classes, 147 degrees of freedom, numbers, 127 Conserved domain database (CDD), 219 Conserved ncRNAs, 267–276 bacterial RNAs, 270–274 GcvB, 271 OLE RNA, 271 riboswitches, 271–274 RNA thermometers, 274 RsmY, 271 RsmZ, 271 Yfr1, 271 7SKRNA, 267–269 SmY RNA, 270 vault RNAs, 267 YRNAs, 267 Context-dependent models, 188, 189, 226 complexity, 188 Continuous interhelical base stacking, 131 Continuously stirred ﬂow reactor (CSFR), 142 Convergence, 186, 199 alternative hypothesis, 199 Convergent molecular evolution analysis, 199 Cooperating proteins, 380 Cooperative networks, 377 Cosmological model, 432 principle, 439 Coupling signaling pathways, 400 Covarion model, 24 Coverage problem, 225–227 CpG dinucleotides, 191 Critical Assessment of Techniques for Protein Structure Prediction (CASP) contests, 129 Crystallographic models, 334 CTG complex, 104, 108 CUG codons, 108, 112 serine, 112 CUG decoding, 108 Cyanobacteria, 161, 205 emergence, 161 highly expressed photosynthesis proteins, phycobiliproteins, 205 Cyclic reaction pathways, 399 Cytochrome c oxidase subunit, 1 (COI), 195, 197, 199

453

protein, 197 sites, 197 Cytotoxic T lymphocytes (CTLs), 312 DAhunter, 221 Danio rerio, 161 Darwinian evolution approach, 12, 52, 133, 365 principles, 52 threshold, 51, 52 notion, 51 Darwinian tree concept, 60 Darwin natural selection theory, 43, 123 Darwin studies, organisms characteristics, 153 Data analysis, 420–422 data resources, Pfam, 221 Dayhoff coding, 30 Debaryomyces hansenii, 98 Deep phylogeny, problem, 5–7 Degree of coevolution, 197 Degree of cooperativity, 367 Deinococcus radiodurans, 267 DeMoivre equation, 439 Densely packed diffusing macromolecules, 374 Detecting homology, 216 Dictyostelium discoideum, 275, 315 Dimerization, 422–424 deﬁnition, 422 domains, DNA-binding, 426 Dissipative systems, 432 emergent property, biological structure, 432–435 Distinctive evolutionary adaptations, 366 DNA, 6, 56–58, 82, 83, 87, 112, 129, 153, 154, 157, 185, 190, 208, 346, 364 binding, 423, 425 discovery, 153 fragments, 112 genomes, 56 helicases, P-loop domain, 157 models, 190 molecules, 6, 82, 83 generations, 6 replication, 57, 87 machinery, 346 paradox, 56–57 proteins, 58 sequences, RY coding, 30 topoisomerases, 56 viral theory, origin, 56 Dollo’s Law, 385 Domain architecture, measurement, 216 arrangement process, 215 distance, 219

454

Index Domain (Continued ) families, 232, 234, 247 distribution, 247–248 evolutionary analysis, 232 minority, 215 rearrangements, 227 space, 218 strings, 216 structures, 372 trees, 225 Domain-based homology identiﬁcation, 215–222 aspects, 219 deciphering circular permutations with domains, 221–222 domain architecture similarity, 216–219 domain-based search, 219, 220 domain resources, 219 Domain deﬁnition databases, SCOP, 220 Domain detection methods, 225 Domain-domain interactions, 247 Domain-speciﬁc RNAs, 259–266 spliceosmal snRNAs, 260–262 6SRNA, 265–266 telomerase RNA, 259–260 tmRNA, 262–265 U7 snRNA, 262 Domain versatility index (DVI), 217 Dosage balance hypothesis (DBH), 417 Double-cut-and-join (DCJ), 170–172 distance, 174 operation cuts, 171 Drosophilla melanogaster, 247, 258, 281, 300 Dual functional RNAs, 281–282 Enod40, 282 RNAIII, 281 SgrS, 281 SRA/SRAP, 282 Dynamical system, 432, 433 deﬁnition, 432 Ecdysozoa, 29, 32 hypothesis, 32 monophyly, 29 Emergent dissipative systems (EDS), 435 Emperor blast search revisited, 381–388 ecological settings, 384–385 EMRAE algorithm, 176, 177 rearrangement events, 177 ENCODE Pilot Project, 251 Endosymbiont, 11 Endosymbiosis, 54 Energy landscape, use, 127

Entropy, 434, 437 evolution, 440 Enzymatic pathways, 380 Enzyme, properties, 404 Epstein–Barr virus (EBV), 311, 312 Error threshold phenomenon, 137, 145 Escherichia coli, 206, 264, 368, 375, 377, 379, 408 batch growth, 408 central carbon metabolism, 408 characteristics, 368 growth phenotypes, 377 metabolic network, connectivity distribution, 401 model genome, 89 ribosomal subunit, 375 Eukaryotes, 9, 67, 68, 70, 162 ancestor, 7–10 archaea, phylogenetic relationship, 68 cell, 8, 63, 64 common ancestor of, 8 features of, 63, 64 origin, signiﬁcance, 63–68 qRNA infrastructure, 10 structure, evolution, 70 deﬁnition, 8 eukaryote signature proteins (ESPs), 9 evolution, stem group, importance, 63 genomes, 9, 351 intron proliferation, 74–76 mitochondria, stepwise development, 72–74 mitochondrion origin, 71–72 origins, 7, 10, 67, 74–76 knowledge, issues, 67 RNA continuity model, 10–12 protein kinase catalytic domain, 154 proteins, 382, 388 phylogeny, 7 riboswitches, 9 RNA-based processing, 9 RNA complexity, 11 stem, 76 E-value constraints, 226 Evolution, 134, 165, 447 chromosomal rearrangements, 165 challenges, 178 new approaches, 178 model, 445 molecular theory, 134 optimization, trajectory, 143 protein architecture, 447 diversity, 447 modularity, 447

Index scenarios, reconstruction, 174 Evolutionary genetics, 185 feature of inferences, 185 Evolutionary genomics, 183 Exon shufﬂing, 113 Experimental high-throughput approaches, 397 Fairly complex genome, 377 Family-speciﬁc sequence identity thresholds, 246 Fast likelihood-based conditional pathway approach, 188 Felsenstein’s pulley principle, 25 Fermentation processes, 96 Ferredoxin, 206 FoF1-ATP synthase, protein chains, 157 Flash memories, 86 Flavodoxin, 206 Flux balance analysis (FBA) methodology, 407–409 linear program, 408 Folding RNA sequences, 138 Fold recognition method, 156 FASTA, use, 156 Folds, fold families and fold superfamilies, 386 Fold superfamilies (FSFs), 386, 387 LUCA cohort, 386 Venn diagrams, 387 Forterre thermoreduction hypothesis, 385 Fragile breakage model, 179 Free energy surface, 145, 146 construction, 145 Free-living anaerobic phagotrophe, 388 Functional molecules, 185 divergence, 186, 241 degree, classiﬁcation, 241 historical causative inference, 186 evolutionary properties, 185 Functional protein network, 364 Function/evolutionary genomics, 186–194 context-dependent biases from protein evolution, 190–191 detecting adaptation, 193–194 functional innovation, 193–194 modeling protein evolution, future, 188–189 mutational noise, removing, 190–191 protein evolution, 186–188, 193–194 deciphering complexities, 186–188 taxon sampling/sequence biodiversity, effect, 189–190 Gaia hypothesis, 161 Galactose binding domain-like superfamily, 235

455

Galectin-7, 236 Galectin-type carbohydrate recognition domain superfamily, 235 GAL pathway, control network, 399, 409 Gaut’s formulation, 27 GC base pairs, 207, 208 nitrogen content, 207 GC-rich codons, 208 Gemmata obscuriglobus, 384, 389 electron micrograph, 384 inner membranes, 389 Gene coding, 89 duplications, 97, 424 encoding proteins, 206 expansion, 106, 207 material costs, 207 order, 171 data, 167 rearrangement operations, 171 position plots, 158 retention, 209 Gene3D database, 245 Gene ontology (GO), 222 annotation, 419, 422 graph, 223 Gene ontology semantic similarity (GOSS) scores, 247 Generalize process-based models, 191 General time-reversible (GTR) model, 22 Gene regulatory network, 398 Gene transfers, 377, 378, 380, 390 consequence, 390 incompatibilities, 377 Genetic element, ﬁtness, 186 Genetic material, compartmentation, 66 Genetic mechanisms, 4, 231 Genetic program, 85 Genetic system, 185 GeneTRACE algorithm, 241, 242 Genome, 166, 172, 179, 183, 368, 424 biodiversity, evolutionary sampling, 189 communities, age, 391 compositional properties, 368 distances, 168–174 model-free distances, 169–170 rearrangement-based distances, 170–174 DNA sequences, 179 structure, 158 domains, 222 trees building, 223 duplication, 97, 416

456

Index Genome (Continued ) evolution, 4–5, 12–13, 153, 183, 364 inﬂuence of environment, 161 material costs, 207 mutational events, 165 gene content, 172 heterozygosis, 378 instability, 166 integration, 424 mechanisms, 215 molecular structure, 183 organization, 316–320 phylogeny, development, 367 power, 4–5 proteomics, 222–225 rearrangements study, 158 protein domains, 158–160 representation, 166–167 resources, 183 scale data sets, 225 sequences, 104, 365 unichromosomal/multichromosomal, 166 Genome-level metabolic function, 407, 408, 410 dynamic models, 407 E. coli metabolism, 408 Genotype-phenotype relations, 123, 124, 126 Genotypes, 123, 124 deﬁnition, 124 role, 123 Gillespie’s algorithm, 142 Glutathione synthesis, 205 diversion of sulfur, 205 Glycolysis-speciﬁc functions, 205 Gouy’s models, 26 Gram-negative bacteria, 55 Gram-positive bacteria, 46, 55 GRAPPA method, 176 GRIMM-synteny, 168, 173 microrearrangements, 168 Group of Itinerant Space Travellers (GIST) model, 4 Guide RNAs, 274–275 H/ACA snoRNA structures, 108, 256–259 schematic representation, 257 Hairpin loop, 131 Hamiltonian systems, 438 Liouville’s theorem, 438 Hamming distance, 136, 140, 141 Hannenhalli–Pevzner theory, 173 elementary interpretation, 173–174 Hansenula polymorpha, 98 Hatena cell, 73

Heisenberg’s uncertainty principle, 438 Hemiascomycetes, evolutionary genomics, 104, 111 core-proteomes, 106 gene transfer, 109–110 genome organization, 104–106 genome redundancy, 106 genome sequences, 104 introns, 108 mitochondrial genomes, 110–111 noncoding RNAs, 108 NUMTs, 110–111 pan-proteomes, 106 paralogues, 106–107 synteny, conservation, 107 Hemiascomycetous yeast, genomes, 96–100, 105, 107, 112, 114 exploration, 96 Herpes simplex virus-1 (HSV-1), 311 latency-associated transcript (LAT), 311 Heteropolymers, 441 proteins, 441 RNA, 441 Heterozygosity, 111 Heuristic network expansion method, 406 Hickey’s model, 75 Hidden Markov models (HMMs), 156, 213, 225, 444 domain proﬁles, 223 Hidden root, weather became cloudy, 51–54 Highly expressed gene, 206, 207 encode proteins, 207 products, 206 High-order structures, 444 High-quality HMM proﬁles, 219 High-throughput (HTP) method, 421 interaction data, 422 Hill–Robertson effect, 102 Hitchhikers guide, 363, 364 emperor’s blast search, 370 to evolving networks, 363 one man’s glitch, 365–366 sideways, 365 three challenges, 366 Holoarcula marismortui, 332 Homologous proteins, 423 Homology detection methods, 216 circular permutations (CP), 216 Homo sapiens, 54 Horizontal gene transfer (HGT), 52, 54, 55, 239, 365 problem, 54–55 Host gene transfer, 63

Index Host-pathogen interactions, 311 Housekeeping genes, 86 Human genomes, 4, 178 schematic localization, 178 Human herpesvirus-4, 311 Hydrogen bonds, 420 Hydrogen-dependent autotrophic archaeon, 72 Hydrophobic amino acids, 32 Hyperthermophilic last universal common ancestor 48, 49 origin of life, 48 Identity permutation, 172 Ill-deﬁned concept, 186 InferCARs algorithm, 176 Information, dissipation, 435–437 Interactions, types, 417–418 Intergenic spacers (IGSs) RNAs, 276 Interpreting trees, 68–70 InterPro database, 221 Intralevel redundancies, 441 Intramolecular hydrogen bonds, 420 Intrinsically disordered regions (IDRs), 419 Introns, 74, 75 exon structure, 9 ﬁrst hypothesis, 74 origin, 74 proliferation, 75 Inverse document frequency, 217 Inverted duplication hypothesis, 314 Iwabe/Gogarten’s tree, 47, 48 Jaccard similarity coefﬁcient, 218 JTT empirical exchangeability matrix, 27 K-carrageenans, 236 Kimura’s theory, 140, 141 Kinetic nightmare, 373 Kingdom-speciﬁc models, 227 domain libraries, 227 Fungal Pfam/ FPfam, 227 Kluyveromyces waltii, 98 K-selected organisms, 446 Kyoto Encyclopedia of Genes and Genomes (KEGG), 168, 406, 443 database, 406 Lac operon, 380 Landscape paradigm, 123 Large genomes, 177 applications, 177 eukaryotic, 166 Lartillot’s model, 26

457

Last eukaryotic common ancestor (LECA), 63, 64, 76 feature, 76 Last universal cellular ancestor, 43, 44 Last universal common ancestor (LUCA), 45, 46, 49, 50–54, 56, 57, 58, 233, 252, 366, 377, 381, 382, 390 community, 54 companions, 53 DNA transfer scenarios, 57 eukaryote descendent, 390 feature, 45, 54 genome, 381 ITS companions, 54 nature, 50, 56–57 prokaryotic-like, 50 proteome, 377 protoeukaryotic, 50 scenario, 49, 51 viral features, 58 Layzer’s arguments, 437, 447 Layzer’s proposal, 438 Leishmania donovani, 275 Leishmania infantum, 277 Leucine zipper (LZ) interactions, 423 Living world, 59 capsid encoding organisms, 59 ribosome encoding organisms, 59 Lodderomyces elongisporus, 98 Long-branch attraction artifact (LBA), 20 Lytic viruses, 64 Macrobe prejudices, 60 Macromolecular biosynthesis processes, 86 Macromolecular complex, 374 Macromolecular crowding, 371 Macromolecules, 374, 375 networks, 374 Mad–Max heterodimers, 423 Major histocompatibility complex (MHC), 160 Major spliceosome, 261 Mammalian cytochrome b, 193 posterior probability distributions, 193 Mammalian genomes, 177, 316 Mammalian phylogenetic tree, 192 Mammalian transcriptome, 252 post-ENCODE view, 252 Marine microbes, 206 ferredoxin/ﬂavodoxin expression, 206 Markov models, 6–7, 24, 305 Mating-based split-ubiquitin (mbSUS), 422 Mauve method, 168

458

Index Maximum entropy production principle (MEPP), 435 Maximum likelihood (ML)-based methods, 177 approaches, 199 framework, 176 Maximum parsimony (MP), 19, 338 Maxwell’s test, 25 Messenger RNA (mRNA), 344 Metabolic efﬁciency, 199 physiological adaptations, 199 Metabolic network model, 397–399, 402, 404–406, 409 evolution, 397, 403 hierarchical organization, 406 organization, principles, 398 properties, 398–403 schematic presentation, 405 small-world property, 398 topology, 398 Metabolic reaction ﬂuxes, distribution, 403 Metagenomics, See Genome communities Metal binding domains, 207, 209 Metal binding proteins, 162 complement, 162 Methanococcus jannaschii, 270 MGR method, 176, 177 application, 177 Michaelis–Menten-like kinetics, 405 Microbial metabolic networks, 405 Microbial organisms, 208 Microevolutionary processes, 5 Microorganisms, 378, 380 global populations, 378 niche-deﬁning chemistry, 380 Microprocessor complex, 299 MicroRNAs (miRNAs) families, 295, 296, 302, 303, 305–307, 309, 313 computational prediction, 303 evolution, 307–313 animal microRNAs, 307–310 mirtrons, 313 plant microRNAs, 310–311 viruses, 311–313 expression, 318 innovation, 309 interactions, 304 origin, 313–316 metazoa, 313–314 microRNAs, 314 transposable elements (TEs) 314 precursors, 303 regulation, 318–320 sequence, 303

targets, 304–307 polymorphisms, 306–307 prediction, 305–306 Miniature inverted-repeat transposable elements (MITEs), 314 MiRBase, 295, 316 Mir-17 clusters, evolution, 317 Mir-21, evolutionarily conserved regulation, 319 Mitochondria, 71, 72 development, 72–74 endosymbiotic origin, 71 Mitochondrial DNA (mtDNA), 110 Mitochondrial mRNAs, 274 Mitochondrial proteins, 197 Mitochondrial seed hypothesis, 74 Mitochondrion, endosymbiotic origin, 75 Mixture of branch lengths (MBL) models, 23, 24 Model-free distances, 169 breakpoint distance, 169 common interval, 169 conserved intervals, 169 Model-free reconstruction algorithms, 175–176 Modeling protein evolution analysis, 187, 188, 213, 214, 224, 233 future, 188 genes, structural coverage, 233 goal, 188 principal, 187 schematic example, 214 Modular protein domains, cohort, 385 Molecular biology, 43, 124 Molecular components, 387 cytochromes, 387 polymerases, 387 ribosomes, 387 Molecular evolutionary analysis, 123, 191 genotypes/phenotypes, 123 Molecular function ontology, 223 Molecular phenotypes, 125–132 nucleic acid structures, 129–132 protein structures, 126–129 Monophyletic group, diversity, 69 Monophyly hypothesis, 333 Monte Carlo Markov chain (MCMC) strategy, 177 sampling, 31 Mouse Meryl RNA, 297 mRNA, expression proﬁles, 409 mRNA-like ncRNAs (mlncRNAs), 277–281 dosage compensation, 279–280 imprinting, 280 stress response, 280 transcriptional regulators, 280–281

Index MRP RNAs, 255 consensus structures, schematic drawing, 255 ribozyme, 11 Mitochondrial DNA (mtDNA) molecules, 111 Multichromosomal genomes, SBR problem, 174 Multidomain architectures (MDAs), 213–216 evolution, 213 properties, 216 Multidomain proteins, 214, 227 analysis, 157 domain architectures, 227 Multigene families, phylogenetic history, 190 Mutation process, 191 gradient, 191 hitchhiking, 390 matrix, 135, 136 rate model, 136 tuning phenotype, 376 Mutatis mutandis, 364 Mycoplasma capricolum, 85 Mycoplasma genome, 86 Mycoplasma mycoides species, 85 Nakaseomyces clade, 111 Nanoarchaeum equitans, 254 Natural antisense transcripts (NATs), 279 Natural phylogeny, 366 Natural population, genomes, 365 Natural selection, 53, 203, 407 Darwin’s theory, 407 variation, 53 ncRNA families, 10, 253 experimental surveys, 253 origins, 253 phylogenetics, 7 Nematostella vectensis, 310, 424 Neofunctionalization, 419 Nested structural networks, 371–344 fold selection, 372 higher order structural networks, 373 nip and tuck, 371 Network expansion analysis, 406 Neutral networks, 139–141, 144 properties, 139 Next-generation sequencing, 13 Nimble genes, 13 No mutational backﬂow approximation, 137, 138 Nonadaptive driving forces, 364 Non-Arabidopsis miRNAs, 7 Noncoding housekeeping RNAs, 251 genes, 108 rRNA/tRNA, 251 Nonphylogenetic signal, 19, 22–28, 34, 35

459

compositional signal, 24–26 homogeneous models, 22–23 masking, 33 vs. phylogenetic signal, 19–21 rate signal, 23–24 reconstruction method, effect, 34 reduction, 28 data, recoding and removal, 30–31 taxon sampling, 28–30 Nonprotein coding RNA (ncRNA) systems, 353 Nonstationary context-dependent (NSCD) models, 191, 194 biological realism, 194 Non-Watson–Crick-type nucleotide-nucleotide interactions, 131 No pseudoknot rule, 138 NRED database, 279 Nuclear pore complex proteins, 9 Nucleic acid bases, 343 molecule, 343 Nucleotide sequences, 22 Nutrient scarcity, 204 OFAM database, 386 Operating system (OS), 85 Optimal networks, 374–381 novel sequences, ﬁxation, 378–380 optimality not maximality, 375–377 patchy environments, patchy genomes, 397–378 selﬁsh operons, 380–381 Optimality principle, 407 Optimization process, resources, 145 Optimization trajectories, statistics, 144 Organismal phylogeny, 17 Organism’s protein interaction network, 397 Ornate, large, and extremophilic (OLE) RNA, 271 Orphan domains, 227 Orphan snoRNAs, 256 Orthologous genes, 240 clusters, structural domain characterization, 240 Oxic networks, 406 Pan-genome, 87, 90 concept, 87 Paralogous proteins, phylogenies, 384 Parent/child reconstructed evolutionary scenarios, 243 functional shifts, percent frequencies, 243 Pathogens, resistance phenotypes, 379

460

Index Pentose phosphate pathways, 402 Permutation, 169, 170 common intervals, 169 conserved intervals, 170 Perron–Frobenius theorem, 136 Pezizomycotina, 96, 106 PfamAlyzer, 219 Pfam database, 213, 219, 227 Phagocytosis, 72, 73 origin, 72 Phagotroph-prey scenario, 73 Phagotrophy, 64, 73 Phenotype, role, 123 Phenylalanyl transfer RNA, 143 secondary structure, 143 Phosphate transporter gene, Slc34a2a, 279 Phosphorylation, 418 motifs, 418 sites, 419 Phylogenetic analysis, 7, 27, 330, 341, 350, 352 Phylogenetic continuities, biological coherence, 367–371 b/w consenting adults, 371 compositional outliers, 368 phylogeny, 368 rosetta stone, 367 Phylogenetic method, 6, 369 algorithms, 369 marker, 17 reconstruction, 20, 369 signal, 19, 35, 331, 345 vs. nonphylogenetic signal, 19–21 stems, 19 Phylogenetic software, 343, 344 MrBayes, 344 Phylogenetic trees, 175, 335, 347 character polarization, 335 cosmology, 336 molecular mechanics, 335 phylogenetics, 336 thermodynamics, 336 example, 175 Phylogenomic analyses, 17–19, 29 construction, 225 relationships, 445 trees, 225, 442 Phylogeny, 383, 389 artist’s view, 383 Physcomitrella patens, 317 Planctomycetales, discovery, 59 Planctomycetes, 389 characteristics, 389 16S RNA, 389

Plant microRNA families, 310 phylogenetic distribution, 310 Plants, 275 pathogen Ashbya gossypii, 98 polymerase V transcripts, 275 viruses, 312 Plasmodium falciparum, 275 P-loop hydrolase superfamily, 161 domain, 157 function, 157 Point mutations, 165 Polarize characters, robust root, 50 Polycistronic pre-mRNAs, 262 Polymerase-III transcript, 269 Position-speciﬁc scoring matrices (PSSM) model, 213, 225 CDDs collection, 220 Position-speciﬁc search methods, 225 Post hoc correlation, 189 Posttranscriptional gene silencing (PTGS), 296 Posttranslational editing systems, targets, 372 Pri-miRNAs, 299, 318 Primitive eukaryote-like ancestor, 389 P RNA, 254, 255 phylogenetic distribution, 254 structures, schematic drawing, 255 Probabilistic models, 22 compositional signal, handling, 24–26 evolutionary models, redesign, 189 future developments, 27–28 homogeneity assumption, 26 homogeneous models, 22–23 rate signal, handling of, 23–24 rRNA genes, 27 sequence evolution, 26 Probe protein networks, 367 Prokaryote(s), 9, 11 elimination, 47 eukaryote dichotomy framework, 44–47 evolutionists, 7 LUCA, 49 prokaryote endosymbioses, 74 riboswitches, 9 Promoter-associated RNAs, 275–276 palRNAs, 276 Protein(s), 126, 208, 127, 155, 156, 186, 190, 194, 206, 208, 213–216, 218, 219, 223, 371, 377 adaptation, dN/dS-based analyses, 195 comparison, 106 complex repertoire, 156 cooperative clusters, 377 cytochrome b, 195

Index domain, 156, 161, 162, 213, 372 dynamic ﬂexibility, 372 as evolutionary units, 213 gain/loss, 160–161 position plots, 158 energetic costs, 208 evolution, 186, 187, 189, 191–193, 443 ancestor, 156 assessment, 189 dissipation/modularity, 443–447 phylogenetic analysis, 187 relationship, 155 study, considerations, 184 expression, 204 costs, 204, 209 levels, 388 families, phylogenetic analysis, 238–240 free energy landscapes, 127 folding, 126–128 energy landscapes, 128 machines, chaperonins, 372 pervasive orthology, 387 polypeptide chain, series of stages, 128 functional aspects, 219 functional change, 194 functional constraints, 204 functional domain composition, 223 genes, 7 material costs, 203, 204, 208 evolution, 203 phylogenomic analysis, 330 phylogeny, 238 sequences, 5, 102, 341 signature resources, 221 structure, 126, 129, 154, 184 degrees of designability, 184 elements, 129 functional features, 186 space, 154 thermodynamic hypothesis, 126 three-dimensional molecular structure, 126 units, notions, 127 synthesis, 352 synthesizing machinery, 112 Protein-array technology, 425 Protein-coding genes, 100, 190, 194 evolutionary process, 190 Protein-coding sequences, 370, 381 Protein Data Bank (PDB), 129, 154, 155, 162 polypeptide chains, 155 protein chains, 154 Protein domain architecture retrieval tool (PDART), 220, 221

461

Protein interaction networks, 400 genome-level representations, 400 Protein kinase-like superfamily, 157 Protein-like complex systems, 184 Protein-protein interaction networks (PINs), 414 evolution, 414–415, 420 gene duplication, 414, 415 principles, 414 Protein-protein interactions (PPIs), 187, 247, 414 Proteolytic surveillance system, 373 Proteomes, 445 phylogenomic trees, reconstruction, 445 Proteomic sulfur sparing, 205 Protoeukaryote host (PEH), 71 stem, advantage, 71 Pseudogenes, ncRNAs, 276–277 Pseudoknots, 131, 138 Psychrophiles, fold selection systems, 373 PubMed database, 83 Quantitative analysis, 369 Quasispecies, 141, 142 Rampant gene transfer, 379 Rampant prokaryotic counterrevolution, 47–50 Random breakage model, 179 Random gene transfer, 369 aggregate frequencies, 369 Random mutations mechanisms, 363, 364 Rates across rates (RAR) methods, 193 model, mammalian cyt b, 192 Rate across sites (RAS) model, 23 rDNA loci, 104 sequences, 13 Rearrangement-based reconstruction methods, 179 algorithms, 176–177 limitation, 179 Redundancy, 434, 441 deﬁnition, 441 Reorder-free segments (RFs), 168 Replication-dependent histone genes, 262 Retroviral RNAs, 312 based phylogeny, 367 guides, use, 50 genes, 21, 27, 104 molecules, 340, 353 phylogenetic tree, 353 phylogeny, 381 sequences, 352 stem substructures, phylogenetic trees, 353 structure, 339

462

Index Reverse position-speciﬁc BLAST (RPSBLAST), 219 Rhopalodia gibba, 7 Ribonuclease, 126 catalytic mechanism, 126 MRP, 349 P enzyme, 348, 349 P RNA, secondary structure models, 349 type III enzymes, 298 Ribonucleoprotein (RNP) complexes, 252, 254 functions, 348 mitochondrial RNA processing, 254 ribonucleases P, 254 Ribonucleotide triphosphates (rNTP), 56 Ribosomal cluster, property, 376 Ribosomal function, 346 23S rRNA responsible, 346 Ribosomal proteins, 205, 206 number, 205 Ribosomes, in vitro translation rates, 378 Riboswitches, 9, 147, 272, 274 classiﬁcation, 272 control mechanisms, 273 Ribozymes, 336, 348 in vitro evolution, 336 Ribozymic reaction, 350 Ring of life scenarios, 46 RNA model, 50, 125, 131–148, 185, 208 archaeal-eukaryal features, 50 Arrhenius-type folding kinetics, 147 based catalytic activity, 348 based regulation, 9 based translational machinery, 112 biophysical model, 338 biopolymers, 342 cells, 56 degradation, 89 dependent RNA polymerase, 296 encoding complex, 335 evolution, 133, 337 experiments, 125 stochastic effects, 141–145 folding mechanism, 142, 332, 337 software, 344 functional switches, 342 inputs, 133 kinetic folding, 132 molecular mechanics, 338 mutation, 133–138 neutrality, consequences, 139–141 one sequence-one structure paradigm, 145–148 phosphorolysis, 91

polymerase holoenzyme, 265 polymerase-III transcripts, 276 processing, 8, 9, 11, 12 protein complexes, 130 reexpansion model, 11 replication, 133–138 error threshold, 137 point mutation, 135 viral replicases, 133 10Sa RNA, 262 secondary structures, 130, 138–139 accuracy, 344 modules, 130 selection-mutation dynamics, 142 sensor, 296 sequences, 335, 342 7SL RNA, 255 5S rRNA, secondary structure of, 332, 347 6S RNA structure, 266 16S/23S ribosomal RNA (rRNA) molecules, 48, 330 thermometers, 274 types, 348 viruses, 109, 312 RNA structure analysis, 130, 131, 146 applications, 344 RNase P RNA, 348–351 rRNA, 352–353 SINE RNA, 351–352 5S rRNA, 346–348 tRNA, 344–346 phylogenetic analysis, 338 phylogenetic trees, 333–334 character coding, 334–335 character polarization, 335–338 character state change frequency, 339–340 phylogenetic analysis, 338–339 potential limitations, 343–344 properties, 340–343 phylogenetic utility, 329 pseudoknots, 131 tertiary structure, 334 RNA world, 55, 76, 391 feature, 76 nature, 55–56 protein synthesis, 391 Robustness, 441 RPR, 350 catalytically cleave, 350 C domain, 350 sequences, 349 structure, 3D models, 349

Index Saccharomyces cerevisiae, 86, 95, 97, 98, 102, 109, 252 fermentative engines, 102–103 functional genomics, 98–101 genetics, 98 genome duplication, 102 mitochondrial ribosome, 110 speciation, 103–104 species deﬁnition, 103–104 Saccharomyces paradoxus, 103 DNA fragments, 103 Schmidtea mediterranea, 309 Salmonella typhimurium, 380 Saccharomyces pombe, 252 Saccharomyces sensu stricto complex, 101, 103 species, 103 yeasts, 104 Saccharomycetaceae genomes, 98 Saccharomycetaceae species, 107 Saccharomycotina, See Hemiascomycetous yeasts Salmonella enterica, 159 strains, comparison, 159 Salmonella typhimurium, 368 Same structural subgroup (SSG), 234 Scala Natura, 46, 48 Schizosaccharomyces pombe, 95, 260 Sec systems, 50 Segmental duplications (SDs), 167 signiﬁcant portions, 167 Selection-mutation matrix, 135, 141 eigenvalues/eigenvectors, 135 Selective forces, 379 Sequence data, genome permutations construction, 167 Sequence homology, 241 Sequence space, 124, 125, 139, 140, 142 construction, principle, 124 deﬁnition, 124 properties, 124, 125 Serine dehydratase (Sda), 89 Serine endopeptidases subtilisin, 154 Shape protein material costs, 205 episodic nutrient scarcity, 205 Shine–Dalgarno sequence, 274 Short interspersed element (SINE) elements, 190, 191 alignments, 190 families, 351 RNA structure, 351 Sibling organisms, 161 Signaling process, 423

463

Signal recognition particle (SRP), 254, 255 helices, nomenclature, 256 RNA, 256, 256 Signal transduction pathways, 422 Signed permutations, 166 SILAC analysis, 304 Simian virus, 40 (SV40), 311 Similarity matrix of proteins (SIMAP), 220 domain similarity tool, 220 Single-domain proteins, 157 Single-gene duplications (SGDs), 415–417 Single-gene phylogenies, analysis, 18, 19, 21 Single nucleotide polymorphisms (SNPs), 4, 306 Slow-fast (SF) method, 31 Small Cajal body-speciﬁc RNAs (scaRNAs), 258 Small interfering RNAs (siRNAs), 296 Small nucleolar RNAs (snoRNAs), 256 Small protein B (SmpB), 264 Small RNA biogenesis, 298–302 gene expression, global effect, 300–302 mRNA degradation classes, 302 transcriptional gene silencing class, 302 transcriptional imprinting class, 302 translational inhibition classes, 300–302 microRNA biogenesis, 299–300 small RNA processing machinery, components, 298–299 Argonautes, 298–299 Piwi proteins, 298–299 polymerases, 299 type III RNases, 298 Small RNA group, 296–298 biogenesis, 300 exotic small RNA species, 297–298 piRNA, 297 rasiRNA, 297 Small scan RNAs (scnRNAs), 298 Small subunit ribosomal RNA (SSU rRNA), 238, 352 analysis, 352 ribosome, 353, 3D heat map, 353 SMART database, 213 Snake, 198 cytochrome c oxidase subunit, 1 protein, 198 amino acid replacements, 198 mitochondria, adaptive protein evolution, 195 oxidative metabolism, 194 reorganization, 194

464

Index snoRNAs, 256, 258 based RNA modiﬁcation system, 258 classes, 256 C/D, 256 H/ACA, 256 genes, 108 occurrence, 258 Sorting by reversals (SBR), 173 Spliced-leader RNA (SL RNA), 270 Spliced-leader-trans-splicing, 261 Spliceosomal splicing, 260 Spliceosome, origin, 9 Splicing mechanisms, 261 Stem lineages, 8 substructures, phylogenetic analyses, 350 Steroid receptor RNA activator, 282 Stochastic process, 215 domain rearrangement, 215 matrix, 408 stochastic error, 18 Straightforward sorting strategy, 173 Streptomycin cluster, evolution, 376 Structure/energetic/ﬁtness (SEF) models, 193 Structural biology, ﬁeld, 184 Structural classiﬁcation of proteins (SCOP) database, 155, 156, 234, 386 developers, 156 domain, eukaryotic tree, 160 domain families, structural analyses, 234 ID, 158 Structural diversity vs. functional diversity, 234–237 Structural domain annotation, 240 Structural granularity, 156 implications, 156–158 Structurally compact domains, 216 Src-homologous domains, 216 Sulfur-depleted proteins, production, 205 Sulfur-poor glycolytic isozymes, 205 Sulfur-stressed bacteria, 205 Superﬂuous model, 189 Supermatrices, 17–19 principle, 33 tolerance, 18 Support vector machine (SVM), 303 Symmetrical multistep chromosome inversion, 158 schematic representation, 158 Symmetric reversals, 178 overrepresentation, 178 Synteny blocks, 166, 167, 168 identiﬁcation, MAGIC tool, 168

permutation, 167 steps, 168 System mechanisms, 433 Tandem gene, 107 Target prediction methods, 306 Taxon sampling, 190 Telomerase reverse transcriptase (TERT), 259 Termini-associated RNAs, 275–276 Tetanus neurotoxin, 236 Thermatoga maritima, 255 Thermodynamics, 331, 434 hypothesis, 126 irreversibility, 437–440 life, role of, 331 Thermomicrobium roseum, 350 Thermoreduction scenario, 49 Totally intronic noncoding (TIN) RNAs, 276 Total mass, density, 375 Tourette’s syndrome, 306 Trans-activator RNA binding protein (TRBP), 299 Transfer RNA (tRNA), 88, 330 binding, 352 function, diversiﬁcation, 346 genes, 108 molecules, 333, 339–341, 344, 347 analysis, 333 evolutionary component, 347 related SINEs, structural evolution, 352 structural phylogenies, 345 structure, 132, 331, 347 tertiary interactions, 132 Translation network, 375 Transposable elements (TEs), 314, 318 Tree of life, 7, 17, 35, 46, 49, 215 Turing machine, 85, 86 Turnip mosaic virus (TuMV), 312 U Ubiquitous cellular infrastructure, 372 Universal cellular ancestor, 48 Universal high-density cytosol, 390 Universal tree of life, 43–45, 58, 59 ancient/alternative versions, 44 knowledge, 59 place of viruses, 58 reconstruct, 59 when viruses ﬁnd their way, 58 Unrooted phylogeny, 382, 383, 386 U7 snRNAs, 263 consensus sequences, aligned sequence logos, 263

Index Vanderwaltozyma polymorpha genome, 98 Vault RNAs, components, 269 Vertebrate Y RNA locus, evolution, 268 Viral ocean, 58 Viruses, 59 role, 59 three domains hypothesis, 57 Watson–Crick pairing, 332 Whole-genome comparative analysis, 158 Whole-genome duplications (WGDs), 415, 416, 419, 421 mechanisms, 415 Wigglesworthia glossinidia, 410 Wild-type E. coli metabolism, 409 Woese’s tree, 47, 48 revolution, 45 rRNA trees, 369 visions, 47 Woesian revolution, 45–47 World Wide Web, 398 Wright’s metaphor, 123, 124 assumptions, 124 problem, 123 Xenology, 18

465

X inactivation center (XIC). 279 Yarrowia lipolytica, 98, 112 Yeast, 13, 95, 96, 260, 419 evolutionary genomics, 95 genomes, 109, 112, 113 evolution, 113 features, 112 kinome, 418 microscopic, 95 mitochondrial DNA, 105 molecular evolution, 114 ohnologues, 420 protein interaction network, 404 Saccharomyces cerevisiae, 204, 399, 400, 402 degradation pathway, 400 galactose uptake, 400 metabolic network, 402 Saccharomyces sensu stricto complex, 102 species, 96, 114 telomerase RNA structures, 260 yeast-two-hybrid (Y2H), 422 Zinc binding domains, 207 Zinc ﬁnger domains, 217 Zygosaccharomyces bailii, 103

Figure 1.1

(See page 6 for text discussion).

(a)

(b)

The RNA infrastructure of the eukaryotic cell Cytoplasm

Nucleus

Transcription

pre-mRNA Modification of snRNA Transcribed snoRNA

pre-rRNA

pre-tRNA

RNA-mediated transcriptional regulation

RNA-based processing

RNase P cleavage of tRNA

RNA processing

Transcription

snoRNA modification of rRNA rRNA

RNA stress granules

RNAi, Riboswitches RNase P

U1, U2, U4–U6 snRNA mRNA Splicing Intronic snoRNAs

RNase MRP cleavage of rRNA

mRNA storage

Translation

tRNA

Pol I - rRNA Pol II - mRNA, miRNA U1,2,4,5 snRNA Pol III - U6snRNA, miRNA tRNA, SRP RNA, RNase P RNA, RNaseMRP RNA* (Pol IV – Plants - miRNA)

Cascade mRNA processing tRNA processing rRNA processing

Translation Ribosomes

RNA-mediated translational regulation P-bodies (miRNA)

RNP biogenesis and assembly

Pre-tRNA, Feedback RNase P regulation of Pre-5S rRNA RNase polymerase III transcripts

snRNPs, snoRNPs RNase P RNase MRP SRP

Nucleus

Figure 1.3

(See page 10 for text discussion).

RNA degradation P-bodies Exosomes

Cytoplasm

Figure 6.1

(See page 97 for text discussion).

Figure 6.2

(See page 105 for text discussion).

Figure 6.3

(See page 112 for text discussion).

Figure 6.4

(See page 114 for text discussion).

Figure 7.1

(See page 125 for text discussion).

“Entropy”

“Entropy”

M

“Energy”

“Energy”

T Gl

Di

Na

Figure 7.2

(See page 128 for text discussion).

Na

Q

Figure 7.3

(See page 130 for text discussion).

Figure 7.4

(See page 131 for text discussion).

Concentration of RNA c(t)

Figure 7.5

(See page 132 for text discussion).

Exponential

Linear

c (t ) = c (0) ekt

c (t ) = c (0) + k ′ t

Time t

Figure 7.6

(See page 133 for text discussion).

Saturation or product inhibition

Plus strand

A U G G U A C A U C A U G A

C U U G

Template-induced synthesis

Plus strand Minus strand

A U G G U A C A U C A U G A U A C C A U

U G A C

C U U G

Template-induced synthesis Plus strand Minus strand

A U G G U A C A U C A U G A

C U U G

U A C C A U G U A G U A C U

G A A C

Complex dissociation Plus strand

A U G G U A C A U C A U G A

C U U G

+ Minus strand

Figure 7.7

U A C C A U G U A G U A C U

(See page 134 for text discussion).

Template-

Plus strand Minus strand Template-

Plus strand Minus strand C

Plus strand

Minus strand

Plus strand

Figure 7.8

(See page 135 for text discussion).

G A A C

xM

xM

Stationary mutant distribution

Co

F

Mutation rate p

Re

Mi

F

Mu

Er

Ac

Figure 7.9

Figure 7.10

p

(See page 137 for text discussion).

(See page 139 for text discussion).

Replications

Figure 7.12

Figure 7.13

(See page 143 for text discussion).

(See page 146 for text discussion).

G Fr

Tjk

Lo

Tjk

Sj

Sj Sk

Sk Re

Figure 7.14

(See page 147 for text discussion).

Figure 7.15

(See page 148 for text discussion).

Ba

Figure 8.1

(See page 155 for text discussion).

Figure 8.3

(See page 159 for text discussion).

Figure 9.1

(See page 167 for text discussion).

Chr1 Chr2 Chr3 Chr4 Chr5 Chr6 Chr7 Chr8 Chr9 Chr10 Chr11 Chr12 Chr13 Chr14 Chr15 Chr16 Chr17 Chr18 Chr19 Chr20 Chr21 Ch22 ChrX

Figure 9.6

hs reversals hcs reversals hcrs reversals centromeres

(See page 178 for text discussion).

Figure 10.1

(See page 187 for text discussion).

Figure 10.2

(See page 192 for text discussion).

Figure 10.3

Figure 10.4

(See page 192 for text discussion).

(See page 193 for text discussion).

Figure 10.5

(See page 196 for text discussion).

Figure 10.6

(See page 198 for text discussion).

Figure 12.1

(See page 214 for text discussion).

Figure 12.2

(See page 222 for text discussion).

Figure 13.1

(See page 233 for text discussion).

Figure 13.2

(See page 233 for text discussion).

Figure 13.3

(See page 235 for text discussion).

Figure 13.4

(See page 237 for text discussion).

Figure 13.5

(See page 239 for text discussion).

Figure 13.6

(See page 240 for text discussion).

Figure 13.7

Figure 13.8

(See page 242 for text discussion).

(See page 243 for text discussion).

Figure 13.9

Figure 13.10

(See page 246 for text discussion).

(See page 246 for text discussion). Animalia Choanoflagellata Fungi

miRNAs Amoebozoa microRNA mechanism Plantae Rhodophyta Heterokonta telomerase−RNA MRP Apicomplexa snRNAs Ciliates Kinetoplastida RNAi gRNAs Euglenozoa snoRNAs Metamonada RNAse P rRNA Nanoarchaeota tRNA Crenarchaeota SRP Euryarchaeota tmRNA 6S

Proteobacteria Chlamydia Actinobacteria Cyanobacteria Yfr1 Firmicutes

Metazoa

Vertebrata Urochordata Cephalochordata Y RNA Echinodermata 7SK Hemichordata SmY Nematoda Arthropoda miRNAs Platyhelminthes Annelida Mollusca Cnidaria Porifera vault

Taphrinomycotina Saccharomycotina Pezizomycotina Basidiomycota Glomeromycoya Chytridiomycoya Microsporidia miRNAs

miRNAs ENOD40

Figure 14.2

(See page 253 for text discussion).

Chlorophyta Charales Bryophyta Coniferales Angiosperms

Figure 14.3

(See page 255 for text discussion).

Figure 14.4

(See page 256 for text discussion).

Box H

5′

5′

Box H

H/ACA sno RNA

C

P6b S3

Yeast

P5

Pseudoknot CS3 CS4 CS7 Template

TB

CR5 P6.1

CR4

CS6

CAB

Pseudoknot

CS1 IV Pseudoknot IIIb

TB I

IIIa II TB Template Ku80

Figure 14.6

(See page 260 for text discussion).

Figure 14.7

(See page 263 for text discussion).

Ciliate

7

6

5

(See page 257 for text discussion).

CS5a S1

6

5

3

C/D sno RNA

S2

CS2

4

CUGA

5′

S5

Figure 14.5

3′

5′

3-10 nt

Template

Vertebrate

H ACA snoRNA

3′ weblogo.berkeley.edu

4

xD

ACA-3′

3′ weblogo.berkeley.edu

Box D

Bo

Box C

5′

3′ weblogo.berkeley.edu

U G

A

3

rRNA

3′

3′

G C

5′

2

14–16 nt

A

AUGAUGA

G

5 nt

3

M NΨ

14–16 nt

AA CC GG U U

Box C

rRNA rRNA NΨ

2

5′

M

1

5 nt

A C G U

A A

3′

4

Box C′

2

′

1

xD

1

Bo 5′

Figure 14.8

(See page 264 for text discussion).

Teleostei

Ho.sap. Pa.tro. Ma.mul. Ta.syr. Mi.mur. Le.cat. Ot.gar.

xYa

xY5

Y4

Y3

Mu.mus. Ra.nor Sp.tri.

xYa

xY5

Y4

Y3

Xe.tro. Xe.lae.

Y Y Y Y Y

Euarchontoglires

Da.rer. On.myk. Or.lat. Ga.acu. Ta.rup. Te.nig.

Y

Mammalia Or.ana.

Y5

Mo.dom. Ma.eug.

Y4

Y1

Y4

Y3

Y4

Y5

Y3

Ig.igu.

Y4

Y3

An.car.

Y4

Y3

Y3

Y1

Y3

Y1

Sauropsida

(See page 268 for text discussion).

Y1

Y5

Y4

Y3

Y1

Y5

Y4

Y3

Y1

Y5

Y4

Y1

Y3

Y1

Y3

Y1

Y3

Y1

Y4

Y3

Y1 Y1

Y3

Y1

Y5

Y4

Y3

Y1

Y5

Y4

Y3

Y1

Tu.bel.

Y5

Y4

Y3

Y1

Ca.fam Fe.cat. So.ara Er.eur

Y1

Y3

Y4

Lo.afr. Ec.tel. Pr.cap.

Da.nov. Ch.hof.

Figure 14.10

Y3

Y5

Afrotheria Xenarthra

Y4

Y1

Y4

Y3

Pt.vam. My.luc.

Y3

Y1

Y5

Y4

Y1

ancenstral Eutherian

Ga.gal. Ta.gut. An.pla

Y1

Y3

Y5

Laurasiatheria

Y4

Y3

Y4

Di.ord. Ca.por. Or.cun. Oc.pri.

Eq.cab. Y5

Y4

Y5

Bo.tau. Su.scr. Tu.tru.

Y1 Y1

Y3

Y5

Y4

Y3

Y1

Y4

Y3

Y1

Y4

Y3

Y1

Y5

Y4

Y3

Y1

Y5

Y4

Y3

Y1

Y5

Y4

Y3

Y1

Y5

Y4

Y3

Y1

Y5

Y4

Y3

Y1

Y5

Y4

Y3

Y5

Y4

Y3

Y1

Y5

Y4

Y3

?

Y5

Y4

Y3

?

Y4

Y3

?

Y5

Y4

Y3

Y1

Y5

Y4

Y3

Y1

C - UUC U M Box A - - WU G U 5¢ C A M U G G A R CUC GCG G GY Y G Variable CGGGCG C YRRYC CC region U G c 3¢ C H U C g u G Box B u Terminator A G R u AGCUU u

Figure 14.11

(See page 269 for text discussion).

8−20nt W G

4−10nt 8−30nt

Y A G

Figure 14.12

3−7nt U G B Y Y D G

G Y Y

5−12nt

C U A G H U u A Y N W R Y C

K C W t G A U C

Y R Y Y R Drosophila 300−350 Lophotrochozoa 150−200 Deuterostomia 200−240

TTTT

(See page 269 for text discussion).

M 5¢

M UUUUU

ORF

3¢

5¢

ORF

3¢

5¢ GU

M UUUUU

M 5¢

Figure 14.14

Y G G

G Y R R Y N B G C C

ORF

3¢

5¢

ORF

3¢

5¢

AGGAGG

M GU

(See page 273 for text discussion).

ORF

3¢

M AG

ORF

AAAAA 3¢

Figure 15.1

Dicer

Exportin-5 HST

7m G

Met

Met

AAA

AAA

Met AGO?

AAA

21 nt

DCL1 HEN1

TAS1&2

AAA

Met

AAA 7m G DCL2

NRPD1a

24 nt Met

AAA

Met AGO?

7m G

AAA

Met

AAA

natsiRNA

7m G

Met

Met

Met

2DCLR3 HEN1 Met

NRPD1a NRPD2a

A U

A

A

HEN1

U

U

A U

+ Met

U Met AS-piRNA

AAA AUB

AUB

AAA

NRPD1a NRPD2b DRD1

AAA

AAA

7m G

AAA antisense pi-master

7m G sense pi-master

AGO3

AGO3 AAA

HEN1

Met

DRM1 DRM2 AGO4

sense-piRNA A Met

Met

Met

RDR2

rasiRNA CH3 CH3 CH3 CH3 CH3 CH3

7m G AAA Tas precursor AAA 7m G Antisense TC 7m G

AAA

(See page 301 for text discussion).

7m G

AGO1

or Met RDR6 SGS3

Met

tasiRNA

DRB4 DCL4 HEN1 Met Met

Met

7m G

Met Met AGO7

7m G TAS3

AGO1

Met

DCL1/HYL1 HEN1

Met

AGO1

pre-miRNA

Drosha/Pasha DCL1/HYL1

pri-miRNA

miRNA

piRNA

NUCLEUS

CYTOPLASMA

PIWI

7m G

Genomic DNA

Figure 15.2

(See page 303 for text discussion).

Amphimedon queenslandica 8

Trichoplax adhaerens Acropora millepora Metazoa

Acropora palmata 31

Cnidaria

Nematostella vectensis Hydra magnipapillata Spisula solidissima Biomphalaria glabrata

Mollusca

Aplysia californica

Gastropoda

Lottia gigantea Capitella capitata Helobdella robusta

Annelida

1

Schmidtea mediterranea

Plathelmynthes

Schistosoma mansoni Trichinella spiralis Brugia malayi 1

Nematoda

1 70

4 4 2

Protostomia

Bilateria

Pristionchus pacificus Caenorhabditis briggsae Caenorhabditis remanei Caenorhabditis elegans Daphnia pulex Apis mellifera Tribolium castaneum Bombyx mori

13 2 18

Deuterostomia

Saccoglossus kowalevskii Strongylocentrotus purpuratus Oikopleura dioica Ciona savignyi Ciona intestinalis Branchiostoma floridae Petromyzon marinus Callorhinchus mili Danio rerio

Echinodermata

7

Urochordata

5 1

Teleostei

8

20 Teleostomi

25 Gnathostoma

18 Vertebrata

8 9

9 Mammalia

Rodentia

19

9 Primates

90

83 Eutheria

Figure 15.3

Drosophila

Arthropoda

21

Anopheles gambiae

(See pages 308–309 for text discussion).

Oryzias latipes Gasterosteus aculeatus Takifugu rubripes Tetraodon nigroviridis Xenopus tropicalis Gallus gallus Ornithorhynchus anatinus Monodelphis domestica Canis familiaris Bos taurus Rattus norvegicus Mus musculus Pan troglodytes Homo sapiens

Callorhinchus mili type I-A type I-B type II

type I-A

type I-Pa

type I-B

type I-Pb

type II

type I-Q

Teleostei

Mammalia

type II

Petromyzon marinus

lost in D. rerio type I genome duplication

lost in O. latipes and G. aculeatus

type II cluster lost

and

type I

lost in T. nigroviridis and T. rubripes

type II 20 and 19b are copies of 17 and 19a, respectively; 18 lost type I type II genome duplication

cluster duplication mir-17/mir-106a 93 is copy of 17

mir-106b mir-19a mir-20

18 is copy of 17

Figure 15.7

(See page 317 for text discussion).

mir-92

mir-19b

mir-25

mir-19d

mir-93 mir-18

STAT

Figure 15.8

(See page 319 for text discussion).

1.......10........20........30........40........50........60........70......TTACAGGAA...90.......100.......TTCASAGAA.120.........

STAT

Teleosts

Tetrapoda

Homo Mus Sorex * Canis Bos Spermophilus * Myotis * Procavia * Echinops Monodelphis Trichosurus * Platypus * Taenopygia * Gallus Xenopus Tetraodon Takifugu * Gasterosteus Oryzias Danio STAT Ex

Ex

mir-21

4000

Distance to mir-21 3000

Figure 16.5

Figure 17.1

(See page 353 for text discussion).

(See page 382 for text discussion).

Figure 17.2

(See page 383 for text discussion).

Duplication

Subfunctionalization

Figure 19.1

Neofunctionalization

(See page 415 for text discussion).

Nonfunctionalization

B A

C SGD

B A

SGD

Mutations

D C B A

WGD

Mutations

Lost interaction Gained interaction

Figure 19.2

(See page 424 for text discussion).

Figure 20.5

(See page 447 for text discussion).