Systems Biology Volume I: Genomics
Series in Systems Biology Edited by Dennis Shasha, New York University EDITORIAL BOARD Michael Ashburner, University of Cambridge Amos Bairoch, Swiss Institute of Bioinformatics Charles Cantor, Sequenom, Inc. Leroy Hood, Institute for Systems Biology Minoru Kanehisa, Kyoto University Raju Kucherlapati, Harvard Medical School Systems Biology describes the discipline that seeks to understand biological phenomena on a large scale: the association of gene with function, the detailed modeling of the interaction among proteins and metabolites, and the function of cells. Systems Biology has wide-ranging application, as it is informed by several underlying disciplines, including biology, computer science, mathematics, physics, chemistry, and the social sciences. The goal of the series is to help practitioners and researchers understand the ideas and technologies underlying Systems Biology. The series volumes will combine biological insight with principles and methods of computational data analysis.
Cellular Computing, edited by Martyn Amos Systems Biology, Volume I: Genomics, edited by Isidore Rigoutsos and Gregory Stephanopoulos Systems Biology, Volume II: Networks, Models, and Applications, edited by Isidore Rigoutsos and Gregory Stephanopoulos
Systems Biology Volume I: Genomics
Edited by
Isidore Rigoutsos & Gregory Stephanopoulos
1 2007
1 Oxford University Press, Inc., publishes works that further Oxford University’s objective of excellence in research, scholarship, and education. Oxford New York Auckland Cape Town Dar es Salaam Hong Kong Karachi Kuala Lumpur Madrid Melbourne Mexico City Nairobi New Delhi Shanghai Taipei Toronto With offices in Argentina Austria Brazil Chile Czech Republic France Greece Guatemala Hungary Italy Japan Poland Portugal Singapore South Korea Switzerland Thailand Turkey Ukraine Vietnam
Copyright © 2007 by Oxford University Press, Inc. Published by Oxford University Press, Inc. 198 Madison Avenue, New York, New York 10016 www.oup.com Oxford is a registered trademark of Oxford University Press. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior permission of Oxford University Press. Library of Congress Cataloging-in-Publication Data Systems biology/edited by Isidore Rigoutsos and Gregory Stephanopoulos. v. ; cm.—(Series in systems biology) Includes bibliographical references and indexes. Contents: 1. Genomics—2. Networks, models, and applications. ISBN-13: 978-0-19-530081-9 (v. 1) ISBN 0-19-530081-5 (v. 1) ISBN-13: 978-0-19-530080-2 (v. 2) ISBN 0-19-530080-7 (v. 2) 1. Computational biology. 2. Genomics. 3. Bioinformatics. I. Rigoutsos, Isidore. II. Stephanopoulos, G. III. Series. [DNLM: 1. Genomics. 2. Computational Biology. 3. Models, Genetic. 4. Systems Biology. QU58.5 S995 2006] QH324.2.S97 2006 570—dc22 2005031826
9 8 7 6 5 4 3 2 1 Printed in the United States of America on acid-free paper
To our mothers
This page intentionally left blank
Acknowledgments
First and foremost, we wish to thank all the authors who contributed the chapters of these two books. In addition to the professionalism with which they handled all aspects of production, they also applied the highest standards in authoring pieces of work of the highest quality. For their willingness to share their unique expertise on the many facets of systems biology and the energy they devoted to the preparation of their chapters, we are profoundly grateful. Next, we wish to thank Dennis Shasha, the series editor, and Peter Prescott, Senior Editor for Life Sciences, Oxford University Press, for embracing the project from the very first day that we presented the idea to them. Peter deserves special mention for it was his continuous efforts that helped remove a great number of obstacles along the way. We also wish to thank Adrian Fay, who coordinated several aspects of the review process and provided input that improved the flow of several chapters, as well as our many reviewers, Alice McHardy, Aristotelis Tsirigos, Christos Ouzounis, Costas Maranas, Daniel Beard, Daniel Platt, Jeremy Rice, Joel Moxley, Kevin Miranda, Lily Tong, Masaru Nonaka, Michael MacCoss, Michael Pitman, Nikos Kyrpides, Rich Jorgensen, Roderic Guigo, Rosaria De Santis, Ruhong Zhou, Serafim Batzoglou, Steven Gygi, Takis Benos, Tetsuo Shibuya, and Yannis Kaznessis, for providing helpful and detailed feedback on the early versions of the chapters; without their help the books would not have been possible. We are also indebted to Kaity Cheng for helping with all of the administrative aspects of this project. And, finally, our thanks go to our spouses whose understanding and patience throughout the duration of the project cannot be overstated.
This page intentionally left blank
Contents
Contributors
xi
Systems Biology: A Perspective
xiii
1 Prebiotic Chemistry on the Primitive Earth Stanley L. Miller & H. James Cleaves 2 Prebiotic Evolution and the Origin of Life: Is a System-Level Understanding Feasible? Antonio Lazcano 3 Shotgun Fragment Assembly Granger Sutton & Ian Dew
3
57
79
4 Gene Finding 118 John Besemer & Mark Borodovsky 5 Local Sequence Similarities Temple F. Smith
154
6 Complete Prokaryotic Genomes: Reading and Comprehension 166 Michael Y. Galperin & Eugene V. Koonin 7 Protein Structure Prediction 187 Jeffrey Skolnick & Yang Zhang 8 DNA–Protein Interactions Gary D. Stormo
219
9 Some Computational Problems Associated with Horizontal Gene Transfer 248 Michael Syvanen 10 Noncoding RNA and RNA Regulatory Networks in the Systems Biology of Animals 269 John S. Mattick Index
303 ix
This page intentionally left blank
Contributors
JOHN BESEMER
ANTONIO LAZCANO
Department of Biology Georgia Institute of Technology Atlanta, Georgia
[email protected]
Faculty of Science Universidad Nacional Autónoma de México Mexico City, Mexico
[email protected]
MARK BORODOVKSY JOHN S. MATTICK
Department of Biology Georgia Institute of Technology Atlanta, Georgia
[email protected]
Institute for Molecular Bioscience University of Queensland Brisbane, Australia
[email protected]
H. JAMES CLEAVES STANLEY L. MILLER
The Scripps Institution of Oceanography University of California, San Diego La Jolla, California
[email protected]
Scripps Institution of Oceanography University of California, San Diego La Jolla, California
[email protected]
IAN DEW Steck Consulting, LLC Washington, DC
[email protected]
JEFFREY SKOLNICK
National Center for Biotechnology Information National Institutes of Health Bethesda, Maryland
[email protected]
New York State Center of Excellence in Bioinformatics and Life Sciences University at Buffalo The State University of New York Buffalo, New York
[email protected]
EUGENE V. KOONIN
TEMPLE F. SMITH
National Center for Biotechnology Information National Institutes of Health Bethesda, Maryland
[email protected]
BioMolecular Engineering Resource Center Boston University Boston, Massachusetts
[email protected]
MICHAEL Y. GALPERIN
xi
xii
Contributors
GARY D. STORMO
MICHAEL SYVANEN
Department of Genetics Washington University in St. Louis St. Louis, Missouri
[email protected]
Department of Medical Microbiology and Immunology University of California Davis School of Medicine Sacramento, California
[email protected]
GRANGER SUTTON J. Craig Venter Institute Rockville, Maryland
[email protected]
YANG ZHANG Center for Bioinformatics University of Kansas Lawrence, Kansas
[email protected]
Systems Biology: A Perspective
As recently as a decade ago, the core paradigm of biological research followed an established path: beginning with the generation of a specific hypothesis a concise experiment would be designed that typically focused on studying a small number of genes. Such experiments generally measured a few macromolecules, and, perhaps, small metabolites of the target system. The advent of genome sequencing and associated technologies greatly improved scientists’ ability to measure important classes of biological molecules and their interactions. This, in turn, expanded our view of cells with a bevy of previously unavailable data and made possible genome-wide and cell-wide analyses. These newly found lenses revealed that hundreds (sometimes thousands) of molecules and interactions, which were outside the focus of the original study, varied significantly in the course of the experiment. The term systems biology was coined to describe the field of scientific inquiry which takes a global approach to the understanding of cells and the elucidation of biological processes and mechanisms. In many respects, this is also what physiology (from the Greek physis = nature and logos = word-knowledge) focused on for the most part of the twentieth century. Indeed, physiology’s goal has been the study of function and characteristics of living organisms and their parts and of the underlying physiochemical phenomena. Unlike physiology, systems biology attempts to interpret and contextualize the large and diverse sets of biological measurements that have become visible through our genomic-scale window on cellular processes by taking a holistic approach and bringing to bear theoretical, computational, and experimental advances in several fields. Indeed, there is considerable excitement that, through this integrative perspective, systems biology will succeed in elucidating the mechanisms that underlie complex phenomena and which would have otherwise remained undiscovered. For the purposes of our discussion, we will be making use of the following definition: “Systems biology is an integrated approach that brings together and leverages theoretical, experimental, and computational approaches in order to establish connections among important molecules or groups of molecules in order to aid the eventual mechanistic explanation of cellular processes and systems.” More specifically, we view systems biology as a field that aims to uncover concrete molecular relationships for targeted analysis through the interpretation xiii
xiv
Systems Biology: A Perspective
of cellular phenotype in terms of integrated biomolecular networks. The fidelity and breadth of our network and state characterization are intimately related to the degree of our understanding of the system under study. As the readers will find, this view permeates the treatises that are found in these two books. Cells have always been viewed as elegant systems of immense complexity that are, nevertheless, well coordinated and optimized for a particular purpose. This apparent complexity led scientists to take a reductionist approach to research which, in turn, contributed to a rigorous understanding of low-level processes in a piecemeal fashion. Nowadays, completed genomic sequences and systems-level probing hold the potential to accelerate the discovery of unknown molecular mechanisms and to organize the existing knowledge in a broader context of high-level cellular understanding. Arguably, this is a formidable task. In order to improve the chances of success, we believe that one must anchor systems biology analyses to specific questions and build upon the existing core infrastructure that the earlier, targeted research studies have allowed us to generate. The diversity of molecules and reactions participating in the various cellular functions can be viewed as an impediment to the pursuit of a more complete understanding of cellular function. However, it actually represents a great opportunity as it provides countless possibilities for modifying the cellular machinery and commandeering it toward a specific goal. In this context, we distinguish two broad categories of questions that can guide the direction of systems biology research. The first category encompasses topics of medical importance and is typically characterized by forward-engineering approaches that focus on preventing or combating disease. The second category includes problems of industrial interest, such as the genetic engineering of microbes so as to maximize product formation, the creation of robust-production strains, and so on. The applications of the second category comprise an important reverse-engineering component whereby microbes with attractive properties are scrutinized for the purpose of transferring any insights learned from their functions to the further improvement and optimization of production strains. PRIOR WORK
As already mentioned, and although the term systems biology did not enter the popular lexicon until recently, some of the activities it encompasses have been practiced for several decades. As we cannot possibly be exhaustive, we present a few illustrative examples of approaches that have been developed in recent years and successfully applied to relatively small systems. These examples can serve as useful guides in our attempt to tackle increasingly larger challenges.
Systems Biology: A Perspective
xv
Metabolic Control Analysis (MCA)
Metabolic pathways and, in general, networks of reactions are characterized by substantial stoichiometric and (mostly) kinetic complexity in their own right. The commonly applied assumption of a single ratelimiting step leads to great simplification of the reaction network and often yields analytical expressions for the conversion rates. However, this assumption is not justified for most biological systems where kinetic control is not concentrated in a single step but rather is distributed among several enzymatic steps. Consequently, kinetics and flux control of a bioreaction network represent properties of the entire system and can be determined from the characteristics of individual reactions in a bottom-up approach or from the response of the overall system in a top-down approach. The concepts of MCA and distribution of kinetic control in a reaction pathway have had a profound impact on the identification of target enzymes whose genetic modification permitted the amplification of the product flux through a pathway. Signaling Pathways
Signal transduction is the process by which cells communicate with each other and their environment and involves a multitude of proteins that can be in active or inactive states. In their active (phosphorylated) state they act as catalysts for the activation of subsequent steps in the signaling cascade. The end result is the activation of a transcription factor which, in turn, initiates a gene transcription event. Until recently, and even though several of the known proteins participate in more than one signaling cascade, such systems were being studied in isolation from one another. A natural outcome of this approach was of course the ability to link a single gene with a single ligand in a causal relationship whereby the ligand activates the gene. However, such findings are not representative in light of the fact that signaling pathways branch and interact with one another creating a rather intricate and complex signaling network. Consequently, more tools, computational as well as experimental, are required if we are to improve our understanding of signal transduction. Developing such tools is among the goals of the recently formed Alliance for Cellular Signaling, an NIH-funded project involving several laboratories and research centers (www.signaling-gateway.org). Reconstruction of Flux Maps
Metabolic pathway fluxes are defined as the actual rates of metabolite interconversion in a metabolic network and represent most informative measures of the actual physiological state of cells and organisms. Their dependence on enzymatic activities and metabolite concentrations makes them an accurate representation of carbon and energy flows through the various pathway branches. Additionally, they are very
xvi
Systems Biology: A Perspective
important in identifying critical reaction steps that impact flux control for the entire pathway. Thus, flux determination is an essential component of strain evaluation and metabolic engineering. Intracellular flux determination requires the enumeration and satisfaction of all intracellular metabolite balances along with the use of sufficient measurements typically derived from the introduction of isotopic tracers and metabolite and mass isotopomer measurement by gas chromatography–mass spectrometry. It is essentially a problem of constrained parameter estimation in overdetermined systems with overdetermination providing the requisite redundancy for reliable flux estimation. These approaches are basically methods of network reconstruction whereas the obtained fluxes represent properties of the entire system. As such, the fluxes accurately reflect changes introduced through genetic or environmental modifications and, thus, can be used to assess the impact of such modifications on cell physiology and product formation, and to guide the next round of cell modifications. Metabolic Engineering
Metabolic engineering is the field of study whose goal is the improvement of microbial strains with the help of modern genetic tools. The strains are modified by introducing specific transport, conversion, or deregulation changes that lead to flux redistribution and the improvement of product yield. Such modifications rely to a significant extent on modern methods from molecular biology. Consequently, the following central question arises: “What is the real difference between genetic engineering and metabolic engineering?” We submit that the main difference is that metabolic engineering is concerned with the entire metabolic system whereas genetic engineering specifically focuses on a particular gene or a small collection of genes. It should be noted that over- or underexpression of a single gene or a few genes may have little or no impact on the attempt to alter cell physiology. On the other hand, by examining the properties of the metabolic network as a whole, metabolic engineering attempts to identify targets for amplification as well as rationally assess the effect that such changes will incur on the properties of the overall network. As such, metabolic engineering can be viewed as a precursor to functional genomics and systems biology in the sense that it represents the first organized effort to reconstruct and modify pathways using genomic tools while being guided by the information of postgenomic developments. WORDS OF CAUTION
In light of the many exciting possibilities, there are high expectations for the field of systems biology. However, as we move forward, we should not lose sight of the fact that the field is trying to tackle a
Systems Biology: A Perspective
xvii
problem of considerable magnitude. Consequently, any expectations of immediate returns on the scientific investment should be appropriately tempered. As we set out to forecast future developments in this field, it is important to keep in mind several points. Despite the wealth of available genomic data, there are still a lot of regions in the genomes of interest that are functional and which have not been identified as such. In order to practice systems biology, lists of “parts” and “relationships” that are as complete as possible are needed. In the absence of such complete lists, one generally hopes to derive at best an approximate description of the actual system’s behavior. A prevalent misconception among scientists states that nearly complete lists of parts are already in place. Unfortunately, this is not the case––the currently available parts lists are incomplete as evidenced by the fact that genomic maps are continuously updated through the addition of removal of (occasionally substantial amounts of) genes, by the discovery of more regions that code for RNA genes, and so on. Despite the wealth of available genomic data, knowledge about existing optimal solutions to important problems continues to elude us. The current efforts in systems biology are largely shaped by the available knowledge. Consequently, optimal solutions that are implemented by metabolic pathways that are unknown or not yet understood are beyond our reach. A characteristic case in point is the recent discovery, in sludge microbial communities, of a Rhodocyclus-like polyphosphateaccumulating organism that exhibits enhanced biological phosphorus removal abilities. Clearly, this microbe is a great candidate to be part of a biological treatment solution to the problem of phosphorus removal from wastewater. Alas, this is not yet an option as virtually nothing is known about the metabolic pathways that confer phosphorus removal ability to this organism. Despite the wealth of available genomic data, there are still a lot of important molecular interactions of whose existence we are unaware. Continuing on our parts and relationships comment from above, it is worth noting another prevalent misconception among scientists: it states that nearly complete lists of relationships are already in place. For many years, pathway analysis and modeling has been characterized by proteincentric views that comprised concrete collections of proteins participating in well-understood interactions. Even for well-studied pathways, new important protein interactions are continuously discovered. Moreover, accumulating experimental evidence shows that numerous important interactions are in fact effected by the action of RNA molecules on DNA molecules and by extension on proteins. Arguably, posttranscriptional gene silencing and RNA interference represent one area of research activity with the potential to substantially revise our current
xviii
Systems Biology: A Perspective
understanding of cellular processes. In fact, the already accumulated knowledge suggests that the traditional protein-centric views of the systems of interest are likely incomplete and need to be augmented appropriately. This in turn has direct consequences on the modeling and simulation efforts and on our understanding of the cell from an integrated perspective. Constructing biomolecular networks for new systems will require significant resources and expertise. Biomolecular networks incorporate a multitude of relationships that involve numerous components. For example, constructing gene interaction maps requires large experimental investments and computational analysis. As for global protein–protein interaction maps, these exist for only a handful of model species. But even reconstructing well-studied and well-documented networks such as metabolic pathways in a genomic context can prove a daunting task. The magnitude of such activities has already grown beyond the capabilities of a single investigator or a single laboratory. Even when one works with a biomolecular network database, the system picture may be incomplete or only partially accurate. In the postgenomic era, the effort to uncover the structure and function of genetic regulatory networks has led to the creation of many databases of biological knowledge. Each of these databases attempts to distill the most salient features from incomplete, and at times flawed, knowledge. As an example, several databases exist that enumerate protein interactions for the yeast genome and have been compiled using the yeast twohybrid screen. These databases currently document in excess of 80,000 putative protein–protein interactions; however, the knowledge content of these databases has only a small overlap, suggesting a strong dependence of the results on the procedures used and great variability in the criteria that were applied before an interaction could be entered in the corresponding knowledge repository. As one might have expected, the situation is less encouraging for those organisms with lower levels of direct interaction experimentation and scrutiny (e.g., Escherichia coli) or which possess larger protein interaction spaces (e.g., mouse and human); in such cases, the available databases capture only a minuscule fraction of the knowledge spectrum. Carrying out the necessary measurements requires significant resources and expertise. Presently, the only broadly available tool for measuring gene expression is the DNA chip (in its various incarnations). Conducting a large-scale transcriptional experiment will incur unavoidable significant costs and require that the involved scientists be trained appropriately. Going a step further, in order to measure protein levels, protein states, regulatory elements, and metabolites, one needs access to complex and specialized equipment. Practicing systems biology will
Systems Biology: A Perspective
xix
necessitate the creation of partnerships and the collaboration of faculty members across disciplines. Biologists, engineers, chemists, physicists, mathematicians, and computer scientists will need to learn to speak one another’s language and to work together. It is unlikely that a single/complex microarray experiment will shed light on the interactions that a practitioner seeks to understand. Even leaving aside the large amounts of available data and the presence of noise, many of the relevant interactions will simply not incur any large or direct transcriptional changes. And, of course, one should remain mindful of the fact that transcript levels do not necessarily correlate with protein levels, and that protein levels do not correlate well with activity level. The situation is accentuated further if one considers that both transcript and protein levels are under the control of agents such as microRNAs that were discovered only recently––the action of such agents may also vary temporally contributing to variations across repetitions of the same experiment. Patience, patience, and patience: the hypotheses that are derived from systemsbased approaches are more complex than before and disproportionately harder to validate. For a small system, it is possible to design experiments that will test a particular hypothesis. However, it is not obvious how this can be done when the system under consideration encompasses numerous molecular players. Typically, the experiments that have been designed to date strove to keep most parameters constant while studying the effect of a small number of changes introduced to the system in a controlled manner. This conventional approach will need to be reevaluated since now the number of involved parameters is dramatically higher and the demands on system controls may exceed the limits of present experimentation. ABOUT THIS BOOK
From the above, it should be clear that the systems biology field comprises multifaceted research work across several disciplines. It is also hierarchical in nature with single molecules at one end of the hierarchy and complete, interacting organisms at the other. At each level of the hierarchy, one can distinguish “parts” or active agents with concrete static characteristics and dynamic behavior. The active agents form “relationships” by interacting among themselves within each level, but can also be involved in inter-level interactions (e.g., a transcription factor, which is an agent associated with the proteomic level, interacts at specific sites with the DNA molecule, an agent associated with the genomic level of the hierarchy). Clearly, intimate knowledge and understanding of the specifics at each level will greatly facilitate the undertaking of systems
xx
Systems Biology: A Perspective
biology activities. Experts are needed at all levels of the hierarchy who will continue to generate results with an eye toward the longer-term goal of the eventual mechanistic explanation of cellular processes and systems. The two books that we have edited try to reflect the hierarchical nature of the problem as well as this need for experts. Each chapter is contributed by authors who have been active in the respective domains for many years and who have gone to great lengths to ensure that their presentations serve the following two purposes: first, they provide a very extensive overview of the domain’s main activities by describing their own and their colleagues’ research efforts; and second, they enumerate currently open questions that interested scientists should consider tackling. The chapters are organized into a “Genomics” and a “Networks, Models, and Applications” volume, and are presented in an order that corresponds roughly to a “bottom-up” traversal of the systems biology hierarchy. The “Genomics” volume begins with a chapter on prebiotic chemistry on the primitive Earth. Written by Stanley Miller and James Cleaves, it explores and discusses several geochemically reasonable mechanisms that may have led to chemical self-organization and the origin of life. The second chapter is contributed by Antonio Lazcano and examines possible events that may have led to the appearance of encapsulated replicative systems, the evolution of the genetic code, and protein synthesis. In the third chapter, Granger Sutton and Ian Dew present and discuss algorithmic techniques for the problem of fragment assembly which, combined with the shotgun approach to DNA sequencing, allowed for significant advances in the field of genomics. John Besemer and Mark Borodovsky review, in chapter 4, all of the major approaches in the development of gene-finding algorithms. In the fifth chapter, Temple Smith, through a personal account, covers approximately twenty years of work in biological sequence alignment algorithms that culminated in the development of the Smith–Waterman algorithm. In chapter 6, Michael Galperin and Eugene Koonin discuss the state of the art in the field of functional annotation of complete genomes and review the challenges that proteins of unknown function pose for systems biology. The state of the art of protein structure prediction is discussed by Jeffrey Skolnick and Yang Zhang in chapter 7, with an emphasis on knowledge-based comparative modeling and threading approaches. In chapter 8, Gary Stormo presents and discusses experimental and computational approaches that allow the determination of the specificity of a transcription factor and the discovery of regulatory sites in DNA. Michael Syvanen presents and discusses the phenomenon of horizontal gene transfer in chapter 9 and also presents computational questions that relate to the phenomenon. The first volume concludes with a chapter
Systems Biology: A Perspective
xxi
by John Mattick on non-protein-coding RNA and its involvement in regulatory networks that are responsible for the various developmental stages of multicellular organisms. The “Networks, Models, and Applications” volume continues our ascent of the systems biology hierarchy. The first chapter, which is written by Cristian Ruse and John Yates III, introduces mass spectrometry and discusses its numerous uses as an analytical tool for the analysis of biological molecules. In chapter 2, Chris Floudas and Ho Ki Fung review mathematical modeling and optimization methods for the de novo design of peptides and proteins. Chapter 3, written by William Swope, Jed Pitera, and Robert Germain, describes molecular modeling and simulation techniques and their use in modeling and studying biological systems. In chapter 4, Glen Held, Gustavo Stolovitzky, and Yuhai Tu discuss methods that can be used to estimate the statistical significance of changes in the expression levels that are measured with the help of global expression assays. The state of the art in high-throughput technologies for interrogating cellular signaling networks is discussed in chapter 5 by Jason Papin, Erwin Gianchandani, and Shankar Subramaniam, who also examine schemes by which one can generate genotype–phenotype relationships given the available data. In chapter 6, Dimitrios Mastellos and John Lambris use the complement system as a platform to describe systems approaches that can help elucidate gene regulatory networks and innate immune pathway associations, and eventually develop effective therapeutics. Chapter 7, written by Sang Yup Lee, Dong-Yup Lee, Tae Yong Kim, Byung Hun Kim, and Sang Jun Lee, discusses how computational and “-omics” approaches can be combined in order to appropriately engineer “improved” versions of microbes for industrial applications. In chapter 8, Markus Herrgård and Bernhard Palsson discuss the design of metabolic and regulatory network models for complete genomes and their use in exploring the operational principles of biochemical networks. Raimond Winslow, Joseph Greenstein, and Patrick Helm review and discuss the current state of the art in the integrative modeling of the cardiovascular system in chapter 9. The volume concludes with a chapter on embryonic stem cells and their uses in testing and validating systems biology approaches, written by Andrew Thomson, Paul Robson, Huck Hui Ng, Hasan Otu, and Bing Lim. The companion website for Systems Biology Volumes I and II provides color versions of several figures reproduced in black and white in print. Please refer to http://www.oup.com/us/sysbio to view these figures in color: Volume I: Figures 7.5 and 7.6 Volume II: Figures 3.10, 5.1, 7.4 and 9.8
This page intentionally left blank
Systems Biology Volume I: Genomics
This page intentionally left blank
1 Prebiotic Chemistry on the Primitive Earth Stanley L. Miller & H. James Cleaves
The origin of life remains one of the humankind’s last great unanswered questions, as well as one of the most experimentally challenging research areas. It also raises fundamental cultural issues that fuel at times divisive debate. Modern scientific thinking on the topic traces its history across millennia of controversy, although current models are perhaps no older than 150 years. Much has been written regarding pre-nineteenth-century thought regarding the origin of life. Early views were wide-ranging and often surprisingly prescient; however, since this chapter deals primarily with modern thinking and experimentation regarding the synthesis of organic compounds on the primitive Earth, the interested reader is referred to several excellent resources [1–3]. Despite recent progress in the field, a single definitive description of the events leading up to the origin of life on Earth some 3.5 billion years ago remains elusive. The vast majority of theories regarding the origin of life on Earth speculate that life began with some mixture of organic compounds that somehow became organized into a self-replicating chemical entity. Although the idea of panspermia (which postulates that life was transported preformed from space to the early sterile Earth) cannot be completely dismissed, it seems largely unsupported by the available evidence, and in any event would simply push the problem to some other location. Panspermia notwithstanding, any discussion of the origin of life is of necessity a discussion of organic chemistry. Not surprisingly, ideas regarding the origin of life have developed to a large degree concurrently with discoveries in organic chemistry and biochemistry. This chapter will attempt to summarize key historical and recent findings regarding the origin of organic building blocks thought to be important for the origin of life on Earth. In addition to the background readings regarding historical perspectives suggested above, the interested reader is referred to several additional excellent texts which remain scientifically relevant [4,5; see also the journal Origins of Life and the Evolution of the Biosphere]. 3
4
Genomics
BOTTOM-UP AND TOP-DOWN APPROACHES
There are two fundamental complementary approaches to the study of the origin of life. One, the top-down approach, considers the origin of the constituents of modern biochemistry and their present organization. The other, the bottom-up approach, considers the compounds thought to be plausibly produced under primitive planetary conditions and how they may have come to be assembled. The crux of the study of the origin of life is the overlap between these two regimes. The top-down approach is biased by the general uniformity of modern biochemistry across the three major extant domains of life (Archaea, Bacteria, and Eukarya). These clearly originated from a common ancestor based on the universality of the genetic code they use to form proteins and the homogeneity of their metabolic processes. Investigations have assumed that whatever the first living being was, it must have been composed of similar biochemicals as one would recover from a modern organism (lipids, nucleic acids, proteins, cofactors, etc.), that somehow were organized into a self-propagating assemblage. This bias seems to be legitimized by the presence of biochemical compounds in extraterrestrial material and the relative success of laboratory syntheses of these compounds under simulated prebiotic conditions. It would be a simpler explanatory model if the components of modern biochemistry and the components of the first living things were essentially similar, although this need not necessarily be the case. The bottom-up approach is similarly biased by present-day biochemistry; however, some more exotic chemical schemes are possible within this framework. All living things are composed of but a few atomic elements (CHNOPS in addition to other trace components), which do not necessarily reflect their cosmic or terrestrial abundances, and which begs the question why these elements were selected for life. This may be due to some intrinsic aspect of their chemistry, or some of the components may have been selected based on the metabolism of more complicated already living systems, or there may have been selection based on prebiotic availability, or some mixture of the three.
HISTORICAL FOUNDATIONS OF MODERN THEORY
The historical evolution of thinking on the origin of life is intimately tied to developments of other fields, including chemistry, biology, geology, and astronomy. Importantly, the concept of biological evolution proposed by Darwin led to the early logical conclusion that there must have been a first organism, and a distinct origin of life. In part of a letter that Darwin sent in 1871 to Joseph Dalton Hooker, Darwin summarized his rarely expressed ideas on the emergence of
Prebiotic Chemistry on the Primitive Earth
5
life, as well as his views on the molecular nature of basic biological processes: It is often said that all the conditions for the first production of a living being are now present, which could ever have been present. But if (and oh what a big if) we could conceive in some warm little pond with all sorts of ammonia and phosphoric salts, –light, heat, electricity &c present, that a protein compound was chemically formed, ready to undergo still more complex changes, at the present such matter wd be instantly devoured, or absorbed, which would not have been the case before living creatures were formed.... By the time Darwin wrote to Hooker DNA had already been discovered, although its role in genetic processes would remain unknown for almost eighty years. In contrast, the role that proteins play in biological processes had already been firmly established, and major advances had been made in the chemical characterization of many of the building blocks of life. By the time Darwin wrote this letter the chemical gap separating organisms from the nonliving world had been bridged in part by laboratory syntheses of organic molecules. In 1827 Berzelius, probably the most influential chemist of his day, had written, “art cannot combine the elements of inorganic matter in the manner of living nature.” Only one year later his former student Friedrich Wöhler demonstrated that urea could be formed in high yield by heating ammonium cyanate “without the need of an animal kidney” [6]. Wöhler’s work represented the first synthesis of an organic compound from inorganic starting materials. Although it was not immediately recognized as such, a new era in chemical research had begun. In 1850 Adolph Strecker achieved the laboratory synthesis of alanine from a mixture of acetaldehyde, ammonia, and hydrogen cyanide. This was followed by the experiments of Butlerov showing that the treatment of formaldehyde with alkaline catalysts leads to the synthesis of sugars. By the end of the nineteenth century a large amount of research on organic synthesis had been performed, and led to the abiotic formation of fatty acids and sugars using electric discharges with various gas mixtures [7]. This work was continued into the twentieth century by Löb, Baudish, and others on the synthesis of amino acids by exposing wet formamide (HCONH2) to a silent electrical discharge [8] and to UV light [9]. However, since it was generally assumed that that the first living beings had been autotrophic organisms, the abiotic synthesis of organic compounds did not appear to be a necessary prerequisite for the emergence of life. These organic syntheses were not conceived as laboratory simulations of Darwin’s warm little pond,
6
Genomics
but rather as attempts to understand the autotrophic mechanisms of nitrogen assimilation and CO2 fixation in green plants. It is generally believed that after Pasteur disproved spontaneous generation using his famous swan-necked flask experiments, the discussion of life beginning’s had been vanquished to the realm of useless speculation. However, scientific literature of the first part of the twentieth century shows many attempts by scientists to solve this problem. The list covers a wide range of explanations from the ideas of Pflüger on the role of hydrogen cyanide in the origin of life, to those of Arrhenius on panspermia. It also includes Troland’s hypothesis of a primordial enzyme formed by chance in the primitive ocean, Herrera’s sulphocyanic theory on the origin of cells, Harvey’s 1924 suggestion of a heterotrophic origin in a high-temperature environment, and the provocative 1926 paper that Hermann J. Muller wrote on the abrupt, random formation of a single, mutable gene endowed with catalytic and replicative properties [10]. Most of these explanations went unnoticed, in part because they were incomplete, speculative schemes largely devoid of direct evidence and not subject to experimentation. Although some of these hypotheses considered life an emergent feature of nature and attempted to understand its origin by introducing principles of evolutionary development, the dominant view was that the first forms of life were photosynthetic microbes endowed with the ability to fix atmospheric CO2 and use it with water to synthesize organic compounds. Oparin’s proposal stood in sharp contrast with the then prevalent idea of an autotrophic origin of life. Trained as both a biochemist and an evolutionary biologist, Oparin found it was impossible to reconcile his Darwinian beliefs in a gradual evolution of complexity with the commonly held suggestion that life had emerged already endowed with an autotrophic metabolism, which included chlorophyll, enzymes, and the ability to synthesize organic compounds from CO2. Oparin reasoned that since heterotrophic anaerobes are metabolically simpler than autotrophs, the former would necessarily have evolved first. Thus, based on the simplicity and ubiquity of fermentative metabolism, Oparin [11] suggested in a small booklet that the first organisms must have been heterotrophic bacteria that could not make their own food but obtained organic material present in the primitive milieu. Careful reading of Oparin’s 1924 pamphlet shows that, in contrast to common belief, at first he did not assume an anoxic primitive atmosphere. In his original scenario he argued that while some carbides, that is, carbon–metal compounds, extruded from the young Earth’s interior would react with water vapor leading to hydrocarbons, others would be oxidized to form aldehydes, alcohols, and ketones.
Prebiotic Chemistry on the Primitive Earth
7
These molecules would then react among themselves and with NH3 originating from the hydrolysis of nitrides: FemCn + 4m H2O → m Fe3O4 + C3nH8m FeN + 3H2O → Fe(OH)3 + NH3 to form “very complicated compounds,” as Oparin wrote, from which proteins and carbohydrates would form. Oparin’s ideas were further elaborated in a more extensive book published with the same title in Russian in 1936. In this new book his original proposal was revised, leading to the assumption of a highly reducing milieu in which iron carbides of geological origin would react with steam to form hydrocarbons. Their oxidation would yield alcohols, ketones, aldehydes, and so on, which would then react with ammonia to form amines, amides, and ammonium salts. The resulting protein-like compounds and other molecules would form a hot, dilute soup, which would aggregate to form colloidal systems, that is, coacervates, from which the first heterotrophic microbes evolved. Like Darwin, Oparin did not address in his 1938 book the origin of nucleic acids, because their role in genetic processes was not yet suspected. At around the same time, J.B.S. Haldane [12] published a similar proposal, and thus the theory is often credited to both scientists. For Oparin [13], highly reducing atmospheres corresponded to mixtures of CH4, NH3, and H2O with or without added H2. The atmosphere of Jupiter contains these chemical species, with H2 in large excess over CH4. Oparin’s proposal of a primordial reducing atmosphere was a brilliant inference from the then fledgling knowledge of solar atomic abundances and planetary atmospheres, as well as from Vernadky’s idea that since O2 is produced by plants, the early Earth would be anoxic in the absence of life. The benchmark contributions of Oparin’s 1938 book include the hypothesis that heterotrophs and anaerobic fermentation were primordial, the proposal of a reducing atmosphere for the prebiotic synthesis and accumulation of organic compounds, the postulated transition from heterotrophy to autotrophy, and the considerable detail in which these concepts are addressed. The last major theoretical contribution to the modern experimental study of the origin of life came from Harold Clayton Urey. An avid experimentalist with a wide range of scientific interests, Urey offered explanations for the composition of the early atmosphere based on then popular ideas of solar system formation, which were in turn based on astronomical observations of the atmospheres of the giant planets and star composition. In 1952 Urey published The Planets, Their Origin and Development [14], which delineated his ideas of the formation
8
Genomics
of the solar system, a formative framework into which most origin of life theories are now firmly fixed, albeit in slightly modified fashions. In contrast, shortly thereafter, Rubey [15] proposed an outgassing model based on an early core differentiation and assumed that the early atmosphere would have been reminiscent of modern volcanic gases. Rubey estimated that a CH4 atmosphere could not have persisted for much more than 105 to 108 years due to photolysis. The Urey/Oparin atmospheres (CH4, NH3, H2O) models are based on astrophysical and cosmochemical models, while Rubey’s CO2, N2, H2O model is based on extrapolation of the geological record. Although this early theoretical work has had a great influence on subsequent research, modern thinking on the origin and evolution of the chemical elements, the solar system, the Earth, and the atmosphere and oceans have not been shaped largely with the origin of life as a driving force. On the contrary, current origin of life theories have been modified to fit contemporary models in geo- and cosmochemistry. Life, Prebiotic Chemistry, Carbon, and Water
A brief justification is necessary for the discussion that will follow. One might ask why the field of prebiotic chemistry has limited itself to aqueous reactions that produce reduced carbon compounds. First, the necessary bias introduced by the nature of terrestrial organisms must be considered. There is only one example of a functioning biology, our own, and it is entirely governed by the reactions of reduced carbon compounds in aqueous media. The question naturally arises whether there might be other types of chemistry that might support a functioning biology. Hydrogen is the most abundant atom in the universe, tracing its formation to the time shortly after the Big Bang. Besides helium and small amounts of lithium, the synthesis of the heavier elements had to await later cycles of star formation and supernova explosions. Due to the high proportion of oxygen and hydrogen in the early history of the solar system, most other atomic nuclei ended up as either their oxides or hydrides. Water can be considered as the hydride of oxygen or the oxide of hydrogen. Water is one of the most abundant compounds in the universe. Life in the solid state would be difficult, as diffusion of metabolites would occur at an appallingly slow pace. Conversely, it is improbable that life in the gas phase would be able to support the stable associations required for the propagation of genetic information, and large molecules are generally nonvolatile. Thus it would appear that life would need to exist in a liquid medium. The question then becomes what solvent molecules are prevalent and able to exist in the liquid phase over the range of temperatures where reasonable reaction rates might proceed while at the same time preserving the integrity of the solute compounds. The high temperature limit is set by the
Prebiotic Chemistry on the Primitive Earth
9
decomposition of chemical compounds, while the low temperature limit is determined by the reactivity of the solutes. Water has the largest liquid stability range of any known common molecular compound at atmospheric pressure, and the dielectric constant of water and the high heat capacity are uniquely suited to many geochemical processes. There are no other elements besides carbon that appear to be able to furnish the immense variety of chemical compounds that allow for a diverse biochemistry. Carbon is able to bond with a large variety of other elements to generate stable heteroatomic bonds, as well as with itself to give a huge inventory of carbon-based molecules. In addition, carbon has the exceptional ability to form stable doublebonded structures with itself, which are necessary for generating fixed molecular shapes and planar molecules necessary for molecular recognition. Most of the fundamental processes of life at the molecular level are based on molecular recognition, which depends on the ability of molecules to possess functional groups that allow for weak interactions such as hydrogen bonding and π-stacking. Carbon appears unique in the capacity to form stable alcohols, amines, ketones, and so on. While silicon is immediately below carbon in the periodic table, its polymers are generally unstable, especially in water, and silicon is unable to form stable double bonds with itself. Organisms presently use energy, principally sunlight, to transform environmental precursors such as CO2, H2O, and N2 into their constituents. While silicon is more prevalent in the Earth’s crust than carbon, and both are generated copiously in stars, silicon is unable to support the same degree of molecular complexity as carbon. Silicon is much less soluble in water than are carbon species, and does not have an appreciably abundant gas phase form such as CH4 or CO2, making the metabolism of silicon more difficult for a nascent biology. THE PRIMITIVE EARTH AND SOURCES OF BIOMOLECULES
The origin of life can be constrained into a relatively short period of the Earth’s history. On the upper end, the age of the solar system has been determined to be approximately 4.65 billion years from isotopic data from terrestrial minerals, lunar samples, and meteorites, and the Earth–moon system is estimated to be approximately 4.5 billion years old. The early age limit for the origin of life on Earth is also constrained by the lunar cratering record, which suggests that the flux of large asteroids impacting the early Earth’s surface until ~3.9 billion years ago was sufficient to boil the terrestrial oceans and sterilize the planet. On the more recent end, there is putative isotopic evidence for biological activity from ~3.8 billion years ago (living systems tend to incorporate the lighter isotope of carbon, 12C, preferentially over 13C during carbon fixation due to metabolic kinetic isotope effects).
10
Genomics
There is more definitive fossil evidence from ~3.5 billion years ago in the form of small organic inclusions in cherts morphologically similar to cyanobacteria, as well as stromatolitic assemblages (layered mats reminiscent of the layered deposits created by modern microorganismal communities). Thus the time window for the origin of the predecessor of all extant life appears to be between ~3.9 billion and 3.8 billion years ago. The accumulation and organization of organic material leading to the origin of life must have occurred during the same period. While some authors have attempted to define a reasonable time frame for biological organization based on the short time available [16], it has been pointed out that the actual time required could be considerably longer or shorter [17]. It should be borne in mind that there is some uncertainty in many of the ages mentioned above. In any event, life would have had to originate in a relatively short period, and the synthesis and accumulation of the organic compounds for this event must have preceded it in an even shorter time period. The synthesis and survival of organic biomonomers on the primitive Earth would have depended on the prevailing environmental conditions. Unfortunately, to a large degree these conditions are poorly defined by geological evidence. Solar System Formation and the Origin of the Earth
If the origin of life depends on the synthesis of organic compounds, then the source and nature of these is the crucial factor in the consideration of any subsequent discussions of molecular organization. The origin of terrestrial prebiotic organic compounds depends on the primordial atmospheric composition. This in turn is determined by the oxidation state of the early mantle, which depends on the manner in which the Earth formed. Discussions of each of these processes are unfortunately compromised by the paucity of direct geological evidence remaining from the time period under discussion, and are therefore somewhat speculative. While a complete discussion of each of these processes is outside the scope of this chapter, they are crucial for understanding the uncertainty surrounding modern thinking regarding the origin of the prebiotic compounds necessary for the origin of life. According to the current model, the solar system is thought to have formed by the coalescence of a nebular cloud which accreted into small bodies called planetesimals which eventually aggregated to form the planets [18]. In brief, the sequence of events is thought to have commenced when a gas cloud enriched in heavier elements produced in supernovae began to collapse gravitationally on itself. This cool diffuse cloud gradually became concentrated into denser regions where more complex chemistry became possible, and in so doing began to heat up. As this occurred, the complex components of the gas cloud began to differentiate in what may be thought of as a large distillation process. The cloud condensed, became more disk-like, and
Prebiotic Chemistry on the Primitive Earth
11
began to rotate to conserve angular momentum. Once the center of the disk achieved temperatures and pressures high enough to begin hydrogen fusion, the sun was born. The intense radiative power of the nascent sun drove the lower boiling point elements outward toward the edge of the solar system where they condensed and froze out. Farther out in the disk, dust-sized grains were also in the process of coalescing due to gravitational attraction. These small grains slowly agglomerated to form larger and larger particles that eventually formed planetesimals and finally planets. It is noteworthy that the moon is thought to have formed from the collision of a Mars-sized body with the primitive Earth. The kinetic energy of such a large collision must have been very great, so great in fact that it would have provided enough energy to entirely melt the newly formed Earth and probably strip away its original atmosphere. Discussions of planetary formation and atmospheric composition are likely to be relevant to various other planets in our solar system and beyond, thus the following discussion may be generalizable. The Early Atmosphere
The temperature at which the planets accreted is important for understanding the early Earth’s atmosphere, which is essential for understanding the possibility of terrestrial prebiotic organic synthesis. This depends on the rate of accretion. If the planet accreted slowly, more of the primitive gases derived from planetesimals, likely reminiscent of the reducing chemistry of the early solar nebula, could have been retained. If it accreted rapidly, the model favored presently, the original atmosphere would have been lost and the primitive atmosphere would have been the result of outgassing of retained mineral-associated volatiles and subsequent extraterrestrial delivery of volatiles. CH4, CO2, CO, NH3, H2O, and H2 are the most abundant molecular gas species in the solar system, and this was likely true on the early Earth as well, although it is the relative proportions of these that is of interest. It remains contentious whether the Earth’s water was released via volcanic exhalation of water associated with mineral hydrates accreted during planetary formation or whether it was accreted from comets and other extraterrestrial bodies during planet formation. It seems unlikely that the Earth kept much of its earliest atmosphere during early accretion, thus the primordial atmosphere would have been derived from outgassing of the planet’s interior, which is thought to have occurred at temperatures between 300 and 1500 °C. Modern volcanoes emit a wide range of gas mixtures. Most modern volcanic emissions are CO2 and SO2, rather than CH4 and H2S (table 1.1). It seems likely that most of the gases released today are from the reactions of reworked crustal material and water, and do not represent components of the Earth’s deep interior. Thus modern volcanic gases may tell us little about the early Earth’s atmosphere.
12
Genomics
Table 1.1 Gases detected in modern volcanic emissions (adapted from Miller and Orgel [4]) Location
CO2
CO
CH4
NH3
H2
HCl
H2S
SO2
H2O
White Island, New Zealand Nyerogongo Lava Lake, Congo Mount Hekla, Iceland Lipari Island, Italy Larderello, Italy Zavaritskii crater, Kamchatka Same crater, B1 Unimak Island, Alaska
57.9
—
0.5
—
41.5
—
—
—
—
84.4
5.1
—
—
1.6
—
—
9.0
43.2
23
3
—
—
16
52
—
6
—
93.0
—
—
—
—
0.5
2.9
3.6
98.9
92.7
—
0.92
1.72
1.76
—
—
2.45
—
—
67
—
—
—
33
—
—
—
— 47
21 —
— —
— —
42 —
25 —
— —
12 53
— 95
Values for gases (except water) are given in volume percent. The value for water is its percentage of the total gases
The oxidation state of the early mantle likely governed the distribution of reducing gases released during outgassing. Holland [19] proposed a multistage model based on the Earth being formed through cold homogeneous accretion in which the Earth’s atmosphere went through two stages, an early reduced stage before complete differentiation of the mantle, and a later neutral/oxidized stage after differentiation. During the first stage, the redox state of the mantle was governed by the Fe°/Fe2+ redox pair, or iron–wustite buffer. The atmosphere in this stage would be composed of H2O, H2, CO, and N2, with approximately 0.27–2.7 × 10−5 atm of H2. Once Fe° had segmented into the core, the redox state of magmas would have been controlled by the Fe2+/Fe3+ pair, or fayalite–magnetite–quartz buffer. In reconstructing the early atmosphere with regard to organic synthesis, there is particular interest in determining the redox balance of the crust–mantle–ocean–atmosphere system. Endogenous organic synthesis seems to depend, based on laboratory simulations, on the early atmosphere being generally reducing, which necessitates low O2 levels in the primitive atmosphere. Little is agreed upon about the composition of the early atmosphere, other than that it almost certainly contained very little free O2. O2 can be produced by the photodissociation of water: 2H2O → O2 + 2H2
Prebiotic Chemistry on the Primitive Earth
13
Today this occurs at the rate of ~10−8 g cm−2 yr−1, which is rather slow, and it seems likely that the steady state would have been kept low early in the Earth’s history due to reaction with reduced metals in the crust and oceans such as Fe2+. Evidence in favor of high early O2 levels comes from morphological evidence that fossil bacteria appear to have been photosynthetic, although this is somewhat speculative. On the other hand, uranite (UO2) and galena (PbS) deposits from 2–2.8 bya testify to low atmospheric O2 levels until relatively recently, since both of these species are easily oxidized to UO3 and PbSO4, respectively. More evidence that O2 is the result of buildup from oxygenic photosynthesis and a relatively recent addition to the atmosphere comes from the banded iron formations (BIFs). These are distributed around the world from ~1.8–2.2 mya and contain extensive deposits of magnetite Fe3O4, which may be considered a mixture of FeO and hematite (Fe2O3), interspersed with bands of hematite. Hematite requires a higher pO2 to form. On the modern Earth, high O2 levels allow for the photochemical formation of a significant amount of ozone. Significantly, O3 serves as the major shield of highly energetic UV light on the Earth’s surface today. Even low O2 levels may have created an effective ozone shield on the early Earth [20]. The oceans could also have served as an important UV shield protecting the nascent organic chemicals [21]. It is important to note that while UV can be a significant source of energy for synthesizing organics, it is also a means of destroying them. While this suggests that the early atmosphere was probably not oxidizing, it does not prove or offer evidence that it was very reducing. Although it is generally accepted that free oxygen was generally absent from the early Archean Earth’s atmosphere, there is no agreement on the composition of the primitive atmosphere; opinions vary from strongly reducing (CH4 + N2, NH3 + H2O, or CO2 + H2 + N2) to neutral (CO2 + N2 + H2O). The modern atmosphere is not thermodynamically stable, and the modern atmosphere is not in equilibrium with respect to the biota, the oceans, or the continents. It is unlikely that it ever was. In the presence of large amounts of H2, the thermodynamically stable form of carbon is CH4: CO2 + 4H2 → CH4 + 2H2O CO + 3H2 → CH4 + H2O C + 2H2 → CH4
K25 = 8.1 × 1022 K25 = 2.5 × 1026
K25 = 7.9 × 108
In the absence of large amounts of H2, intermediate forms of carbon, such as formate and methanol, are unstable with respect to CO2
14
Genomics
and CH4, and thus these are the stable forms at equilibrium. Even large sources of CO would have equilibrated with respect to these in short geological time spans. In the presence of large amounts of H2, NH3 is the stable nitrogen species, although not to the extreme extent of methane: 1 2 N2
+
3 2 H2
→ NH3
K25 = 8.2 × 102
If a reducing atmosphere was required for terrestrial prebiotic organic synthesis, the crucial question becomes the source of H2. Miller and Orgel [4] have estimated the pH2 as 10−4 to 10−2 atm. Molecular hydrogen could have been supplied to the primitive atmosphere from various sources. For example, if there had been extensive weathering of Fe2+-bearing rocks which had not been equilibrated with the mantle, followed by photooxidation in water [22]:
ν 3H + Fe O 2Fe2+ + 3H2O h→ 2 2 3 although this reaction may also have been equilibrated during volcanic outgassing. The major sink for H2 is Jeans escape, whereby gas molecules escape the Earth’s gravitational field. This equation is important for all molecular gas species, and thus we will include it here: L = N(RT/2pm)1/2 (1 + x)e–x, where x = GMm/RTae where L = rate of escape (in atoms cm–2 s–1) N = density of the gas in the escape layer R = gas constant m = atomic weight of the gas G = gravitational constant M = mass of the Earth T = absolute temperature in the escape layer aε = radius at the escape layer. The escape layer on the Earth begins ~600 km above the Earth’s surface. Molecules must diffuse to this altitude prior to escape. The major conduits of H to the escape layer are CH4, H2, and H2O, since H2O and CH4 can be photodissociated at this layer. Water is, however, frozen out at lower altitudes, and thus does not contribute significantly to this process. The importance of the oxidation state of the atmosphere may be linked to the production of HCN, which is essential for the synthesis of amino acids and purine nucleobases, as well as cyanoacetylene
Prebiotic Chemistry on the Primitive Earth
15
for pyrimidine nucleobase synthesis. In CH4/N2 atmospheres HCN is produced abundantly [23,24], but in CO2/N2 atmospheres most of the N atoms produced by splitting N2 recombine with O atoms to form NO. The Early Oceans
The pH of the modern ocean is governed by the complex interplay of dissolved salts, atmospheric CO2 levels, and clay mineral ion exchange processes. The pH and concentrations of Ca2+, Mg2+, Na+ and K+ are maintained by equilibria with clays rather than by the bicarbonate buffer system. The bicarbonate concentration is determined by the pH and Ca2+ concentrations. After deriving these equilibria, the pCO2 can be derived from equilibrium considerations with clay species, which gives a pCO2 of 1.3 × 10−4 atm to 3 × 10−4 atm [4]. For comparison, presently CO2 is ~0.03 volume %. This buffering mechanism and pCO2 would leave the primitive oceans at ~pH 8, which is coincidentally a favorable pH for many prebiotic reactions. The cytosol of most modern cells is also maintained via a series of complicated cellular processes near pH 8, suggesting that early cells may have evolved in an environment close to this value. Our star appears to be a typical G2 class star, and is expected to have followed typical stellar evolution for its mass and spectral type. Consequently the solar luminosity would have been ~30% less during the time period we are concerned with, and the UV flux would have been much higher [20]. A possible consequence of this is that the prebiotic Earth may have frozen completely to a depth of ~300 m [25]. There is now good evidence for various completely frozen “Snowball Earth” periods later during the Earth’s history [26]. There is some evidence that liquid water was available on the Archean Earth between 4.3 and 4.4 bya [27,28], thus the jury is still out as to whether the early Earth was hot or cold, or perhaps had a variety of environments. The presence of liquid surface water would have necessitated that the early Earth maintained a heat balance that offset the postulated 30% lesser solar flux from the faint young sun. Presently the Earth’s temperature seems to be thermostatted by the so-called BLAG [29] model. The model suggests that modern atmospheric CO2 levels are maintained at a level that ensures moderate temperatures by controlling global weathering rates and thus the flux of silicates and carbonates through the crust–ocean interface. When CO2 levels are high, the Earth is warmed by the greenhouse effect and weathering rates are increased, allowing a high inflow of Ca2+ and silicates to the oceans, which precipitates CO2 as CaCO3, and lowers the temperature. As the atmosphere cools, weathering slows, and the buildup of volcanically outgassed CO2 again raises the temperature. On the early Earth, however, before extensive recycling of the crust became common, large amounts of CO2 may have been sequestered as CaCO3 in sediments, and the environment may have been considerably colder.
16
Genomics
Energy Sources on the Early Earth
Provided the early atmosphere had a sufficiently reducing atmosphere, energy would have been needed to dissociate these gases into radicals which could recombine to form reactive intermediates capable of undergoing further reaction to form biomolecules. The most abundant energy sources on Earth today are shown in table 1.2. Energy fluxes from these sources may have been slightly different in the primitive environment. As mentioned earlier, the dim young sun would have provided a much higher flux of UV radiation than the modern sun. It is also likely that volcanic activity was higher than it is today, and radioactive decay would have been more intense, especially from 40K [30], which is the probable source of the high concentrations of Ar in the modern atmosphere. Shock waves from extraterrestrial impactors and thunder were also probably more common during the tail of the planetary accretion process. Presently huge amounts of energy are discharged atmospherically in the form of lightning; it is difficult to estimate the flux early in the Earth’s history. Also significant is the energy flux associated with the van Allen belts and static electricity discharges. Some energy sources may have been more important for some synthetic reactions. For example, electric discharges are very effective at producing HCN from CH4 and NH3 or N2, but UV radiation is not. Electric discharge reactions also occur near the Earth’s surface whereas UV reactions occur higher in the atmosphere. Any molecules created would have to be transported to the ocean, and might be destroyed on the way. Thus transport rates must also be taken into account
Table 1.2 Energy sources on the modern Earth (adapted from Miller and Orgel [4]) Source Total radiation from sun Ultraviolet light < 300 nm Ultraviolet light < 250 nm Ultraviolet light < 200 nm Ultraviolet light < 150 nm Electric discharges Cosmic rays Radioactivity (to 1.0 km) Volcanoes Shock waves
Energy (cal cm−2 yr−1)
Energy (J cm−2 yr−1)
260,000 3400 563 41 1.7 4.0a 0.0015 0.8 0.13 1.1b
1,090,000 14,000 2360 170 7 17 0.006 3.0 0.5 4.6
cal cm−2 yr−1 of corona discharge and 1 cal cm−2 yr−1 of lightning. cal cm−2 yr−1 of this is the shock wave of lightning bolts and is also included under electric discharges. a3
b1
Prebiotic Chemistry on the Primitive Earth
17
when considering the relative importance of various energy sources in prebiotic synthesis. Atmospheric Syntheses
Urey’s early assumptions as to the constitution of the primordial atmosphere led to the landmark Miller–Urey experiment, which succeeded in producing copious amounts of biochemicals, including a large percentage of those important in modern biochemistry. Yields of intermediates as a function of the oxidation state of the gases involved have been investigated and it has been shown that reduced gas mixtures are generally much more conducive to organic synthesis than oxidizing or neutral gas mixtures. This appears to be because of the likelihood of reaction-terminating O radical collisions where the partial pressure of O-containing species is high. Even mildly reducing gas mixtures produce copious amounts of organic compounds. The yields may have been limited by carbon yield or energy yield. It seems likely that energy was the not the limiting factor [24]. Small Reactive Intermediates
Small reactive intermediates are the backbone of prebiotic organic synthesis. They include HCHO, HCN, ethylene, cyanoacetylene, and acetylene which can be recombined to form larger and more complex intermediates that ultimately form stable biochemicals. Most of these reactive intermediates would have been produced at relatively slow rates, resulting in low concentrations in the primitive oceans, where many of the reactions of interest would occur. Subsequent reactions which would have produced more complicated molecules would have depended on the balance between atmospheric production rates and rain-out rates of small reactive intermediates, as well as the degradation rates, which would have depended on the temperature and pH of the early oceans. It is difficult to estimate the concentrations of the compounds that could have been achieved without knowing the source production rates or loss rates in the environment. Nevertheless, low temperatures would have been more conducive to prebiotic organic synthesis, using the assumptions above. For example, steady-state concentrations of HCN would have depended on production rates as well as on the energy flux and the reducing nature of the early atmosphere. Sinks for HCN would have been photolysis and hydrolysis of HCN [31], as well as the pH and temperature of the early oceans and the rate of circulation of the oceans through hydrothermal vents. Ultraviolet irradiation of reduced metal sulfides has been shown to be able to reduce CO2 to various low molecular weight compounds including methanol, HCHO, HCOOH, and short fatty acids [32]. This may have been an important source of biomolecules on the early Earth.
18
Genomics
Concentration Mechanisms
When aqueous solutions are frozen, as the ice lattice forms solutes are excluded and extremely concentrated brines may be formed. In the case of HCN, the final eutectic mixture contains 75% HCN. In principle any degree of concentration up to this point is possible. Salt water, however, cannot be concentrated to the same degree as fresh water in a eutectic; for example, from 0.5 M NaCl, similar to the concentration in the modern ocean, the eutectic of the dissolved salt is the limit, which is only a concentration factor of ~10. Eutectic freezing has been shown to be an excellent mechanism for producing biomolecules such as amino acids and adenine from HCN [33]. This would of course require that at least some regions of the early Earth were cold enough to freeze, which would require that atmospheric greenhouse warming due to CO2, CH4, and NH3 or organic aerosols was not so great as to prohibit at least localized freezing. Concentration by evaporation is also possible for nonvolatile compounds, as long as they are stable to the drying process [34]. Some prebiotic organic syntheses may have depended on the availability of dry land areas. Although continental crust had almost certainly not yet formed, the geological record contains some evidence of sedimentary rocks that must have been deposited in shallow environments on the primitive Earth. It is not unreasonable to assume that some dry land was available on the primitive Earth in environments such as island arcs. There is the possibility that hydrophobic compounds could have been concentrated in lipid phases if such phases were available. Calculations and some experiments suggest that an early reducing atmosphere might have been polymerized by solar ultraviolet radiation in geologically short periods of time. An oil slick 1–10 m thick could have been produced in this way and could have been important in the concentration of hydrophobic molecules [35]. Clays are complex mineral assemblages formed from dissolved aluminosilicates. Such minerals form templates for the formation of subsequent layers of mineral, leading to speculation that the first organisms may have been mineral-based [36]. Clays are also capable of binding organic material via ionic and van der Waals forces, and may have been locations for early prebiotic synthesis. Early ion exchange processes would also have concentrated 40K+, which would have exposed early prebiotic organics to high fluxes of ionizing radiation [30]. SYNTHESIS OF THE MAJOR CLASSES OF BIOCHEMICALS
The top-down approach to origin of life research operates on the premise that the earliest organisms were composed of the same, or similar,
Prebiotic Chemistry on the Primitive Earth
19
biochemicals as modern organisms. The following sections will consider biomolecules and experimental results demonstrating how these may have come to be synthesized on the primitive Earth via plausible geochemical processes. Amino Acids
Experimental evidence in support of Oparin’s hypothesis of chemical evolution came first from Urey’s laboratory, which had been involved with the study of the origin of the solar system and the chemical events associated with this process. Urey considered the origin of life in the context of his proposal of a highly reducing terrestrial atmosphere [37]. The first successful prebiotic amino acid synthesis was carried out with an electric discharge (figure 1.1) and a strongly reducing model atmosphere of CH4, NH3, H2O, and H2 [38]. The result of this experiment was a large yield of racemic amino acids, together with hydroxy acids, short aliphatic acids, and urea (table 1.3). One of the surprising results of this experiment was that the products were not a large random mixture of organic compounds, but rather a relatively small number of compounds were produced in substantial yield. Moreover, with a few exceptions, the compounds were of biochemical significance. The synthetic routes to prebiotic bioorganic compounds and the geochemical
Figure 1.1 The apparatus used in the first electric discharge synthesis of amino acids and other organic compounds in a reducing atmosphere. It was made entirely of gas, except for the tungsten electrodes [38].
20
Genomics
Table 1.3 Yields of small organic molecules from sparking a mixture of methane, hydrogen, ammonia, and water (yields given based on input carbon in the form of methane [59 mmoles (710 mg)]) Compound Glycine Glycolic acid Sarcosine Alanine Lactic acid N-Methylalanine a-Amino-n-butyric acid a-Aminoisobutyric acid a-Hydroxybutyric acid b -Alanine Succinic acid Aspartic acid Glutamic acid Iminodiacetic acid Iminoaceticpropionic acid Formic acid Acetic acid Propionic acid Urea N-Methyl urea
Yield (µmoles) 630 560 50 340 310 10 50 1 50 150 4 4 6 55 15 2330 150 130 20 15
Yield (%) 2.1 1.9 0.25 1.7 1.6 0.07 0.34 0.007 0.34 0.76 0.27 0.024 0.051 0.37 0.13 4.0 0.51 0.66 0.034 0.051
plausibility of these became experimentally tractable as a result of this experimental demonstration. The mechanism of synthesis of the amino and hydroxy acids formed in the spark discharge experiment was investigated [39]. The presence of large quantities of hydrogen cyanide, aldehydes, and ketones in the water flask (figure 1.2), which were clearly derived from the methane, ammonia, and hydrogen originally included in the apparatus, showed that the amino acids were not formed directly in the electric discharge, but were the outcome of a Strecker-like synthesis that involved aqueous phase reactions of reactive intermediates. The mechanism is shown in figure 1.3. Detailed studies of the equilibrium and rate constants of these reactions have been performed [40]. The results demonstrate that both amino and hydroxy acids could have been synthesized at high dilutions of HCN and aldehydes in the primitive oceans. The reaction rates depend on temperature, pH, HCN, NH3, and aldehyde concentrations, and are rapid on a geological time scale. The half-lives for the hydrolysis of the intermediate products in the reactions, amino and hydroxy nitriles, can be less than a thousand years at 0 °C [41].
Prebiotic Chemistry on the Primitive Earth
21
Figure 1.2 The concentrations of ammonia (NH3), hydrogen cyanide (HCN), and aldehydes (CHO-containing compounds) present in the lowermost U-tube of the apparatus shown in figure 1.1. The concentrations of the amino acids present in the lower flask are also shown. These amino acids were produced from the sparking of a gaseous mixture of methane (CH4), ammonia (NH3), water vapor (H2O), and hydrogen in the upper flask. The concentrations of NH3, HCN, and aldehydes decrease over time as they are converted to amino acids.
The slow step in amino acid synthesis is the hydrolysis of the amino nitrile which could take 10,000 years at pH 8 and 25 °C. An additional example of a rapid prebiotic synthesis is that of amino acids on the Murchison meteorite (which will be discussed later), which apparently took place in less than 105 years [42]. These results suggest that if the prebiotic environment was reducing, then the synthesis of the building blocks of life was efficient and did not constitute the limiting step in the emergence of life. The Strecker synthesis of amino acids requires the presence of ammonia (NH3) in the prebiotic environment. As discussed earlier,
22
Genomics
Figure 1.3 The Strecker and cyanohydrin mechanisms for the formation of amino and hydroxy acids from ammonia, aldehydes and ketones, and cyanide.
gaseous ammonia is rapidly decomposed by ultraviolet light [43], and during early Archean times the absence of a significant ozone layer would have imposed an upper limit to its atmospheric concentration. Since ammonia is very soluble in water, if the buffer capacity of the primitive oceans and sediments was sufficient to maintain the pH + at ~8, then dissolved NH4 (the pKa of NH3 is ~9.2) in equilibrium with dissolved NH3 would have been available. Since NH4+ is similar in size to K+ and thus easily enters the same exchange sites on clays, NH4+ concentrations were probably no higher than 0.01 M. The ratio of hydroxy acids to amino acids is governed by the ammonia (NH3) concentration which would have to be ~0.01 M at 25 °C to make a 50/50 mix; equal amounts of the cyanohydrin and aldehyde are generated at CN− concentrations of 10−2 to 10−4 M. A more realistic atmosphere for the primitive Earth may be a mixture of CH4 with N2 with traces of NH3. There is experimental evidence that this mixture of gases is quite effective with electric discharges in producing amino acids [41]. Such an atmosphere, however, would nevertheless be strongly reducing. Alternatively, amino acids can be synthesized from the reaction of urea, HCN, and an aldehyde or a ketone (the Bucherer–Bergs synthesis, figure 1.4). This reaction pathway may have been significant if little free ammonia were available. A wide variety of direct sources of energy must have been available on the primitive Earth (table 1.2). It is likely that in the prebiotic environment solar radiation, and not atmospheric electricity, was the major
Prebiotic Chemistry on the Primitive Earth
23
Figure 1.4 The Bucherer–Bergs mechanism of synthesis of amino acids, which uses urea instead of ammonia as the source of the amino group.
source of energy reaching the Earth’s surface. However, it is unlikely that any single one of the energy sources listed in table 1.2 can account for all organic compound syntheses. The importance of a given energy source in prebiotic evolution is determined by the product of the energy available and its efficiency in generating organic compounds. Given our current ignorance of the prebiotic environment, it is impossible to make absolute assessments of the relative significance of these different energy sources. For instance, neither the pyrolysis (800 to 1200 °C) of a CH4/NH3 mixture or the action of ultraviolet light acting on a strongly reducing atmosphere give good yields of amino acids. However, the pyrolysis of methane, ethane, and other hydrocarbons gives good yields of phenylacetylene, which upon hydration yields phenylacetaldehyde. The latter could then participate in a Strecker synthesis and act as a precursor to the amino acids phenylalanine and tyrosine in the prebiotic ocean. The available evidence suggests that electric discharges were the most important source of hydrogen cyanide, which is recognized as an important intermediate in prebiotic synthesis. However, the hot H atom mechanism suggested by Zahnle could also have been significant [44]. In addition to its central role in the formation of amino nitriles during the Strecker synthesis, HCN polymers have been shown to be a source of amino acids. Ferris et al. [45] have demonstrated that, in addition to urea, guanidine, and oxalic acid, hydrolysis of HCN polymers
24
Genomics
produces glycine, alanine, aspartic acid, and aminoisobutyric acid, although the yields are not particularly high except for glycine (~1%). Modern organisms construct their proteins from ~20 universal amino acids which are almost exclusively of the L enantiomer. The amino acids produced by prebiotic syntheses would have been racemic. It is unlikely that all of the modern amino acids were present in the primitive environment, and it is unknown which, if any, would have been important for the origin of life. Acrolein would have been produced in fairly high yield from the reaction of acetaldehyde with HCHO [46], which has several very robust atmospheric syntheses. Acrolein can be converted into several of the biological amino acids via reaction with various plausible prebiotic compounds [47] (figure 1.5). There has been less experimental work with gas mixtures containing CO and CO2 as carbon sources instead of CH4, although CO-dominated atmospheres could not have existed except transiently. Spark discharge experiments using CH4, CO, or CO2 as a carbon source with various
Figure 1.5 Acrolein may serve as an important precursor in the prebiotic synthesis of several amino acids.
Prebiotic Chemistry on the Primitive Earth
25
Figure 1.6 Amino acid yields based on initial carbon. In all experiments reported here, the partial pressure of N2, CO, or CO2 was 100 torr. The flask contained 100 ml of water with or without 0.05 M NH4Cl brought to pH 8.7. The experiments were conducted at room temperature, and the spark generator was operated continuously for two days.
amounts of H2 have shown that methane is the best source of amino acids, but CO and CO2 are almost as good if a high H2/C ratio is used (figure 1.6). Without added hydrogen, however, the amino acid yields are very low, especially when CO2 is the sole carbon source. The amino acid diversity produced in CH4 experiments is similar to that reported by Miller [38]. With CO and CO2, however, glycine was the predominant amino acid, with little else besides alanine produced [41]. The implication of these results is that CH4 is the best carbon source for abiotic synthesis. Although glycine was essentially the only amino acid produced in spark discharge experiments with CO and CO2, as the primitive ocean matured the reaction between glycine, H2CO, and HCN could have led to the formation of other amino acids such as alanine, aspartic acid, and serine. Such simple mixtures may have lacked the chemical diversity required for prebiotic evolution and the origin of the first life forms. However, since it is not known which amino acids were required for the emergence of life, we can say only that CO and CO2 are less favorable than CH4 for prebiotic amino acid synthesis, but that amino acids produced from CO and CO2 may have been adequate. The spark discharge yields of amino acids, HCN, and aldehydes are about the same using CH4, H2/CO >1, or H2/CO2 >2. However, it is not clear how such high molecular hydrogen-to-carbon ratios for the last
26
Genomics
two reaction mixtures could have been maintained in the prebiotic atmosphere. Synthesis of Nucleic Acid Bases
Nucleic acids are the central repository of the information that organisms use to construct enzymes via the process of protein synthesis. In all living organisms genetic information is stored in DNA, which is composed of repeating units of deoxyribonucleotides (figure 1.7), which is transcribed into linear polymers of RNA, which are composed of repeating polymers of ribonucleotides. The difference between these two is the usage of deoxyribose in DNA and ribose in RNA, and uracil in RNA and thymine in DNA. It is generally agreed that one of the principal characteristics of life is the ability to transfer information from one generation to the next. Nucleic acids seem uniquely structured for this function, and thus a considerable amount of attention has been dedicated to elucidating their prebiotic synthesis. PURINES
The first evidence that the components of nucleic acids may have been synthesized abiotically was provided in 1960 [48]. Juan Oró, who was at the time studying the synthesis of amino acids from aqueous solutions of HCN and NH3, reported the abiotic formation of adenine, which may be considered a pentamer of HCN (C5H5N5) from these same mixtures. Oró found that concentrated solutions of ammonium cyanide which were refluxed for a few days produced adenine in up to 0.5% yield along with 4-aminoimidazole-5-carboxamide and an intractable polymer [48,49]. The polymer also yields amino acids, urea, guanidine, cyanamide, and cyanogen. It is surprising that a synthesis requiring at least five steps should produce such high yields of adenine. The mechanism of synthesis has since been studied in some detail. The initial step is the dimerization of HCN followed by further reaction to give HCN trimer and HCN tetramer, diaminomaleonitrile (DAMN) (figure 1.8). As demonstrated by Ferris and Orgel [50], a two-photon photochemical rearrangement of diaminomaleonitrile proceeds readily with high yield in sunlight to amino imidazole carbonitrile (AICN) (figure 1.9). Further reaction of AICN with small molecules generated in polymerizing HCN produces the purines (figure 1.10). The limits of the synthesis as delineated by the kinetics of the reactions and the necessity of forming the dimer, trimer, and tetramer of HCN have been investigated, and this has been used to delineate the limits of geochemically plausible synthesis. The steady-state concentrations of HCN would have depended on the pH and temperature of the early oceans and the input rate of HCN from atmospheric synthesis.
Figure 1.7 The structures of DNA and RNA (A = adenine, G = guanine, C = cytosine, T = thymine, U = uracil). 27
28
Genomics
Figure 1.8 The mechanism of formation of DAMN from HCN.
Assuming favorable production rates, Miyakawa et al. [31] estimated steady-state concentrations of HCN of 2 × 10−6 M at pH 8 and 0 °C in the primitive oceans. At 100 °C and pH 8 the steady-state concentration would have been 7 × 10−13 M. HCN hydrolyzes to formamide, which then hydrolyzes to formic acid and ammonia. It has been estimated that oligomerization and hydrolysis compete at approximately 10−2 M concentrations of HCN at pH 9 [51], although it has been shown that adenine is still produced from solutions as dilute as 10−3 M [52]. If the concentration of HCN were as low as estimated, it is possible that HCN tetramer formation may have occurred on the primitive Earth in eutectic solutions of HCN–H2O, which may have existed in the polar regions of an Earth of the present average temperature. High yields of the HCN tetramer have been reported by cooling dilute cyanide solutions to temperatures between −10 and −30 °C for a few months [51]. Production of adenine by HCN polymerization is accelerated by the presence of formaldehyde and other aldehydes, which could have also been available in the prebiotic environment [53]. The prebiotic synthesis of guanine was first studied under conditions that required unrealistically high concentrations of a number of precursors, including ammonia [54]. Purines, including guanine,
Figure 1.9 The synthesis of AICN via photoisomerization of DAMN.
Prebiotic Chemistry on the Primitive Earth
29
Figure 1.10 Prebiotic synthesis of purines from AICN.
hypoxanthine, xanthine, and diaminopurine, could have been produced in the primitive environment by variations of the adenine synthesis using aminoimidazole carbonitrile and aminoimidazole carboxamide [55] with other small molecule intermediates generated from HCN. Reexamination of the polymerization of concentrated NH4CN solutions has shown that, in addition to adenine, guanine is also produced at both −80 and −20 °C [56]. It is probable that most of the guanine obtained from the polymerization of NH4CN is the product of 2,6-diaminopurine hydrolysis, which reacts readily with water to give guanine and isoguanine. The yields of guanine in this reaction are 10 to 40 times less than those of adenine. Adenine, guanine, and a simple set of amino acids dominated by glycine have also been detected in dilute solutions of NH4CN which were kept frozen for 25 years at −20 and −78 °C, as well as in the aqueous products of spark discharge experiments from a reducing experiment frozen for 5 years at −20 °C [33]. The mechanisms described above are likely an oversimplification. In dilute aqueous solutions adenine synthesis may also involve the formation and rearrangement of other precursors such as 2-cyano and 8-cyano adenine [53]. PYRIMIDINES
The prebiotic synthesis of pyrimidines has also been investigated extensively. The first synthesis investigated was that of uracil from
30
Genomics
malic acid and urea [57]. The abiotic synthesis of cytosine in an aqueous phase from cyanoacetylene (HCCCN) and cyanate (NCO−) was later described [58,59]. Cyanoacetylene is abundantly produced by the action of a spark discharge on a mixture of methane and nitrogen, and cyanate is produced from cyanogen (NCCN) or from the decomposition of urea (H2NCONH2). Cyanoacetylene is apparently also a Strecker synthesis precursor to aspartic acid. However, the high concentrations of cyanate (> 0.1 M) required in this reaction are unrealistic, since cyanate is rapidly hydrolyzed to CO2 and NH3. Urea itself is fairly stable, depending on the concentrations of NCO− and NH3. Later, it was found that cyanoacetaldehyde (the hydration product of cyanoacetylene) and urea react to form cytosine and uracil. This was extended to a high yield synthesis that postulated drying lagoon conditions. The reaction of uracil with formaldehyde and formate gives thymine in good yield [60]. Thymine may also be synthesized from the UV-catalyzed dehydrogenation of dihydrothymine. This is produced from the reaction of b-aminoisobutryic acid with urea [61]. The reaction of cyanoacetaldehyde (which is produced in high yields from the hydrolysis of cyanoacetylene) with urea produces no detectable levels of cytosine [62]. However, when the same nonvolatile compounds are concentrated in laboratory models of “evaporating pond” conditions simulating primitive lagoons or pools on drying beaches on the early Earth, surprisingly high amounts of cytosine (>50%) are observed [63]. These results suggest a facile mechanism for the accumulation of pyrimidines in the prebiotic environment (figure 1.11).
Figure 1.11 Two possible mechanisms for the prebiotic synthesis of the biological pyrimidines.
Prebiotic Chemistry on the Primitive Earth
31
Figure 1.12 One possible mechanism for the formation of N6 modified purines.
A related synthesis under evaporating conditions uses cyanoacetaldehyde with guanidine, which produce diaminopyrimidine [62] in very high yield [64], which then hydrolyzes to uracil and small amounts of cytosine. Uracil (albeit in low yields), as well as its biosynthetic precursor orotic acid, have also been identified among the hydrolytic products of hydrogen cyanide polymer [45,65]. A wide variety of other modified nucleic acid bases may also have been available on the early Earth. The list includes isoguanine, which is a hydrolytic product of diaminopurine [56], and several modified purines which may have resulted from side reactions of both adenine and guanine with a number of alkylamines under the concentrated conditions of a drying pond [66], including N6-methyladenine, 1-methyladenine, N6,N6-dimethyladenine, 1-methylhypoxanthine, 1-methylguanine, and N2-methylguanine (figure 1.12). Modified pyrimidines may have also been present in the primitive Earth. These include dihydrouridine, which is formed from NCO− and b-alanine [67], and others like diaminopyrimidine, thiocytosine [64], and 5-substituted uracils, formed via reaction of uracil with formaldehyde, whose functional side groups may have played an important role in the early evolution of catalysis prior to the origin of proteins, and which are efficiently formed under plausible prebiotic conditions [68] (figure 1.13). Carbohydrates
Most biological sugars are composed of the empirical formula (CH2O)n, a point that was underscored by Butlerow’s 1861 discovery of the formose reaction [69], which showed that a complex mixture of the sugars of biological relevance could be formed by the reaction of HCHO under basic conditions. The Butlerow synthesis is complex and incompletely understood. It depends on the presence of suitable inorganic catalysts, with calcium hydroxide (Ca(OH)2) or calcium carbonate (CaCO3) being the most completely investigated. In the absence of basic catalysts, little or no sugar is obtained. At 100 °C, clays such as kaolin serve to catalyze the formation of sugars, including ribose,
32
Genomics
Figure 1.13 The reaction of uracil with formaldehyde to produce 5-hydroxymethyl uracil, and functional groups attached to 5-substituted uracil. Incorporation of these amino acid analogs into polyribonucleotides during the “RNA world” stage may have led to a substantial widening of the catalytic properties of ribozymes.
in small yields from dilute (0.01 M) solutions of formaldehyde [70–72]. This reaction has been extensively investigated with regard to catalysis and several interesting phenomena have been observed. For instance, the reaction is catalyzed by glycolaldehyde, acetaldehyde, and various organic catalysts [73]. Ribose was among the last of life’s building blocks characterized by chemists. Suggestions for the existence of an “RNA world,” a period during early biological evolution when biological systems used RNA both as a catalyst and an informational macromolecule, make it possible that ribose may have been among the oldest carbohydrates to be employed by living beings. Together with the other sugars that are produced by the condensation of formaldehyde under alkaline conditions [69],
Prebiotic Chemistry on the Primitive Earth
33
it is also one of the organic compounds to be synthesized in the laboratory under conditions that are relevant from a prebiotic perspective. The Butlerow synthesis is autocatalytic and proceeds through glycoaldehyde, glyceraldehyde, and dihydroxyacetone, four-carbon sugars, and five-carbon sugars to give finally hexoses, including biologically important carbohydrates such as glucose and fructose. The detailed reaction sequence may proceed as shown in figure 1.14. The reaction produces a complex mixture of sugars including various 3-, 4-, 5-, 6-, and 7-membered carbohydrates, including all isomers (for the addition of each CH2O unit, both isomers are produced) (figure 1.11) and generally is not particularly selective, although methods of overcoming this have been investigated. Inclusion of acetaldehyde in the reaction may lead to the formation of deoxyribose [74] (figure 1.15). The reaction tends to stop when the formaldehyde has been consumed and ends with the production of higher C4–C7 sugars that can form cyclic acetals and ketals. The reaction produces all of the epimers and isomers of the small C2–C6 sugars, some of the C7 ones, and various dendroaldoses and dendroketoses, as well as small molecules such as glycerol and pentaerythritol. Schwartz and De Graaf [72] have discovered an interesting photochemical formose reaction that generates pentaerythritol almost exclusively. Both L- and D-ribose occur in this complex mixture, but are not particularly abundant. Since all carbohydrates have somewhat similar chemical properties, it is difficult to envision simple mechanisms that could lead to the enrichment of ribose from this mixture, or how the relative yield of ribose required for the formation of RNA could be enhanced. However, the recognition that the biosynthesis of sugars leads not to the formation of free carbohydrates but of sugar phosphates, led Albert Eschenmoser and his associates to show that, under slightly basic conditions, the condensation of glycoaldehyde-2-phosphate in the presence of formaldehyde results in the considerably selective synthesis of ribose-2,4-diphosphate [75]. This reaction has also been shown to take place under neutral conditions and low concentrations in the presence of minerals [76], and is particularly attractive given the properties of pyranosyl-RNA (p-RNA), a 2′,4′-linked nucleic acid analog whose backbone includes the six-member pyranose form of ribose-2,4-diphosphate [77]. The major problem with this work is that a reasonable source of the starting material, oxirane carbonitrile (which hydrolyzes to glycolaldehyde-2-phosphate), has not been identified. There are three major obstacles to the relevance of the formose reaction as a source of sugars on the primitive Earth. The first problem is that the Butlerow synthesis gives a wide variety of straight-chain and branched sugars. Indeed, more than 40 different sugars have
34
Figure 1.14 A simplified scheme of the formose reaction.
Prebiotic Chemistry on the Primitive Earth
35
Figure 1.15 Possible prebiotic synthesis of deoxyribose from glyceraldehyde and acetaldehyde.
been identified in one reaction mixture [78] (figure 1.16). The second problem is that the conditions of synthesis are also conducive to the degradation of sugars [71]. Sugars undergo various diagenetic reactions on short geological time scales that are seemingly prohibitive to the accumulation of significant amounts on the primitive Earth. At pH 7, the half-life for decomposition of ribose is 73 minutes at 100 °C, and 44 years at 0 °C [79]. The same is true of most other sugars, including ribose-2,4-diphosphate. The third problem is that the concentrations of HCHO required appear to be prebiotically implausible, although the limits of the synthesis have not been determined.
Figure 1.16 Gas chromatogram of derivatives of the sugars formed by the formose reaction. The arrows point to the two ribose isomers (adapted from Decker et al. [78]).
36
Genomics
There are a number of possible ways to stabilize sugars; the most interesting one is to attach the sugar to a purine or pyrimidine, that is, by converting the carbohydrate to a glycoside, but the synthesis of nucleosides is notoriously difficult under plausible prebiotic conditions. It therefore has become apparent that ribonucleotides could not have been the first components of prebiotic informational macromolecules [80]. This has led to propositions of a number of possible substitutes for ribose in nucleic acid analogs, in what has been dubbed the “pre-RNA world” [81]. A Paradox?
When aqueous solutions of HCN and HCHO are mixed, the predominant product is glycolonitrile [82], which seems to preclude the formation of sugars and nucleic acid bases in the same location [83]. Nevertheless both sugars and nucleic acid bases have been found in the Murchison meteorite [84,85] and it seems likely that the chemistry which produced the compounds found in the Murchison meteorite was from reactions such as the Strecker synthesis. This suggests that the conditions for the synthesis of sugars, amino acids, and purines from HCHO and HCN, either exist at very limited concentrations of NH3, HCN, and HCHO and pH, or the two precursors were produced under different regimes in different locations. Lipids
Amphiphilic molecules are especially interesting due to their propensity to spontaneously assemble into micelles and vesicles. Cell membranes are almost universally composed of phosphate esters of fatty acid glycerides. Fatty acids are biosynthesized today by multifunctional enzymes or enzyme complexes. Nevertheless, as all life we know of is composed of cells, these compounds seem crucial. Eukaryotic and bacterial cell membranes are composed of largely straight-chain fatty acid acyl glycerols while those of the Archaea are often composed of polyisoprenoid glycerol ethers. Either type may have been the primordial lipid component of primitive cells. Most prebiotic simulations fail to generate large amounts of fatty acids, with the exception of simulated hydrothermal vent reactions, which arguably use unreasonably high concentrations of reactants [86]. Heating glycerol with fatty acids and urea has been shown to produce acylglycerols [87]. A prebiotic synthesis of long-chain isoprenoids has been suggested by Ourisson based on the Prins reaction of formaldehyde with isobutene [88]. The Murchison meteorite contains small amounts of higher straightchain fatty acids, some of which may be contamination [89]. Amphiphilic components have been positively identified in the Murchison meteorite [90], although the yields of these molecules are poor in typical spark discharge experiments [91].
Prebiotic Chemistry on the Primitive Earth
37
Cofactors
It might be assumed that most of the inorganic cofactors (Mo, Fe, Mn, etc.) were present as soluble ions in the prebiotic seas to some degree. Many of the organic cofactors, however, are either clearly byproducts of an extant metabolism or have syntheses so complex that their presence on the early Earth cannot reasonably be postulated. Most enzyme-catalyzed reactions use a cofactor, and these are often members of a small set of small biochemicals known collectively as vitamins. The most widely used is nicotinamide, and several prebiotic syntheses of this compound have been devised [92,93]. Other interesting vitamins that have prebiotic syntheses include components of coenzyme A and coenzyme M [94–96] and analogs of pyridoxal [97]. There have been reports of flavin-like compounds generated from dry-heated amino acids, but these have not been well characterized [98]. It may be that many compounds that do not have prebiotic syntheses were generated later once a functioning biochemistry was in place [99]. Interestingly, many of these are able to carry out their catalyses, albeit to a lesser degree, in the absence of the enzyme. Nonenzymatic reactions that occur in the presence of vitamin cofactors include thiaminmediated formose reactions [100] and transamination with pyridoxal [101]. These may have some relevance to prebiotic chemistry, or perhaps to the early development of metabolism. It is unclear whether porphyrins were necessary for the origin of life, although they are now a part of every terrestrial organism’s biochemistry as electron carriers and photopigments. They can be formed rather simply from the reaction of pyrroles and HCHO [102,103] (figure 1.17). Small Molecules Remaining to be Synthesized
There are numerous biochemicals that do not appear to be prebiotically accessible, despite some interesting prebiotic syntheses that have been developed. Tryptophan, phenylalanine, tyrosine, histidine, arginine, lysine, and the cofactors pyridoxal, thiamin, riboflavin, and folate are notable examples. These may not be necessary for the origin of life and may be instead byproducts of a more evolutionarily sophisticated metabolism.
Figure 1.17 Prebiotic synthesis of porphyrins from pyrroles and formaldehyde.
38
Genomics
Nucleosides
One popular theory for the origin of life posits the existence of an RNA world, a time when RNA molecules played the roles of both catalysts and genetic molecule [104]. A great deal of research has been carried out on the prebiotic synthesis of nucleosides and nucleotides. Although few researchers still consider this idea plausible for the origin of life, it is possible that an RNA world existed as an intermediary stage in the development of life once a simpler self-replicating system had evolved. Perhaps the most promising nucleoside syntheses start from purines and pure D-ribose in drying reactions, which simulate conditions that might occur in an evaporating basin. Using hypoxanthine and a mixture of salts reminiscent of those found in seawater, up to 8% of b-D-inosine is formed, along with the a-isomer. Adenine and guanine gave lower yields, and in both cases a mixture of a- and b-isomers was obtained [105]. Pyrimidine nucleosides have proven to be much more difficult to synthesize. Direct heating of ribose and uracil or cytosine has thus far failed to produce uridine or cytidine. Pyrimidine nucleoside syntheses have been demonstrated which start from ribose, cyanamide, and cyanoacetylene; however, a-D-cytidine is the major product [106]. This can be photoanomerized to b-D-cytidine in low yield; however, the converse reaction also occurs. Sutherland and coworkers [107] demonstrated a more inventive approach, showing that cytidine3′-phosphate can be prepared from arabinose-3-phosphate, cyanamide, and cyanoacetylene in a one-pot reaction. The conditions may be somewhat forced, and the source of arabinose-3-phosphate is unclear, nevertheless the possibility remains that more creative methods of preparing the pyrimidine nucleosides may be possible. Alternatively, the difficulties with prebiotic ribose synthesis and nucleoside formation have led some to speculate that perhaps a simpler genetic molecule with a more robust prebiotic synthesis preceded RNA. A number of alternatives have been investigated. Some propose substituting other sugars besides ribose. When formed into sugar phosphate polymers, these also often form stable basepaired structures with both RNA/DNA and themselves [77,108–110], opening the possibility of genetic takeover from a precursor polymer to RNA/DNA. These molecules would likely suffer from the same drawbacks as RNA, such as the difficulty of selective sugar synthesis, sugar instability, and the difficulty of nucleoside formation. Recently it has been demonstrated, based on the speculations of Joyce et al. [81] and the chemistry proposed by Nelsestuen [111] and Tohidi and Orgel [112], that backbones based on acyclic nucleoside analogs may be more easily obtained under reasonable prebiotic conditions, for example by the reaction of nucleobases with acrolein during mixed
Prebiotic Chemistry on the Primitive Earth
39
formose reactions [113]. This remains a largely unexplored area of research. More exotic alternatives to nucleoside formation have been proposed based on the peptide nucleic acid (PNA) analogs of Nielsen and coworkers [114]. Miller and coworkers [115] were able to demonstrate the facile synthesis of all of the components of PNA in very dilute solution or under the same chemical conditions required for the synthesis of the purines or pyrimidines. The assembly of the molecules into oligomers has not yet been demonstrated and may be unlikely due to the instability of PNA to hydrolysis and cyclization [116]. Nevertheless, there may be alternative structures which have not yet been investigated that may sidestep some of the problems with the original PNA backbone. Nucleotides
Condensed phosphates are the universal biological energy currency; however, abiological dehydration reactions are extremely difficult in aqueous solution due to the high water activity. Phosphate concentrations in the modern oceans are extremely low, partially due to rapid scavenging of phosphates by organisms, but also because of the extreme insolubility of calcium phosphates. Indeed, almost all of the phosphate present on the Earth today is present as calcium phosphate deposits such as apatite. There is some evidence, however, that condensed phosphates are emitted in volcanic fumaroles [117]. An extensive review of the hydrolysis and formation rates of condensed phosphates has not been conducted; however, it has been suggested that condensed phosphates are not likely to be prebiotically available materials [118]. Heating orthophosphate at relatively low temperatures in the presence of ammonia results in a high yield of condensed phosphates [119]. Additionally, trimetaphosphate (TMP) has been shown to be an active phosphorylating agent for various prebiological molecules including amino acids and nucleosides [120,121]. Early attempts to produce nucleotides used organic condensing reagents such as cyanamide, cyanate, or dicyanamide [122]. Such reactions were generally inefficient due to the competition of the alcohol groups of the nucleosides with water in an aqueous environment. Nucleosides can be phosphorylated with acidic phosphates such as NaH2PO4 when dry heated [123]. The reactions are catalyzed by urea and other amides, particularly if ammonium is included in the reaction. Heating ammonium phosphate with urea also gives a mixture of high molecular weight polyphosphates [119]. Nucleosides can be phosphorylated in high yield by heating ammonium phosphate with urea at moderate temperatures, as might occur in a drying basin [124]. For example, by heating uridine with urea and ammonium phosphate yields of phosphorylated nucleosides as high as
40
Genomics
70% have been achieved. In the case of purine nucleosides, however, there is also considerable glycosidic cleavage due to the acidic microenvironment created. This is another problem with the RNA world, that the synthesis of purine nucleosides is somewhat robust, but nucleotide formation may be difficult, while nucleotide formation from pyrimidine nucleosides is robust, but nucleoside formation appears to be difficult. Hydroxyapatite itself is a reasonable phosphorylating reagent. Yields as high as 20% of nucleosides were achieved by heating nucleosides with hydroxyapatite, urea, and ammonium phosphate [124]. Heating ammonium phosphates with urea leads to a mixture of high molecular weight polyphosphates [119]. Although these are not especially good phosphorylating reagents under prebiotic conditions, they tend to degrade, especially in the presence of divalent cations at high temperatures, to cyclic phosphates such as trimetaphosphate, which have been shown to be promising phosphorylating reagents [121]. cis-Glycols react readily with trimetaphosphate under alkaline conditions to yield cyclic phosphates, and the reaction proceeds somewhat reasonably under more neutral conditions in the presence of magnesium cation [125]. HYDROTHERMAL VENTS AND THE ORIGIN OF LIFE
The discovery of hydrothermal vents at the oceanic ridge crests and the appreciation of their significance in the element balance of the hydrosphere represents a major development in oceanography [126]. Since the process of hydrothermal circulation probably began early in the Earth’s history, it is likely that vents were present in the Archean oceans. Large amounts of ocean water now pass through the vents, with the whole ocean going through them every 10 million years [127]. This flow was probably greater during the early history of the Earth, since the heat flow from the planet’s interior was greater during that period. The topic has received a great deal of attention, partly because of doubts regarding the oxidization state of the early atmosphere. Following the first report of the vents’ existence, a detailed hypothesis suggesting a hydrothermal emergence of life was published [128], in which it was suggested that amino acids and other organic compounds are produced during passage through the temperature gradient of the 350 °C vent waters to the 0 °C ocean waters. Polymerization of the organic compounds thus formed, followed by their self-organization, was also proposed to take place in this environment, leading to the first forms of life. At first glance, submarine hydrothermal springs would appear to be ideally suited for creating life, given the geological plausibility of a hot early Earth. More than a hundred vents are known to exist along
Prebiotic Chemistry on the Primitive Earth
41
the active tectonic areas of the Earth, and at least in some of them catalytic clays and minerals interact with an aqueous reducing environment rich in H2, H2S, CO, CO2, and perhaps HCN, CH4, and NH3. Unfortunately it is difficult to corroborate these speculations with the findings of the effluents of modern vents, as a great deal of the organic material released from modern sources is diagenized biological material, and it is difficult to separate the biotic from the abiotic components of these reactions. Much of the organic component of hydrothermal fluids may be formed from diagenetically altered microbial matter. So far, the most articulate autotrophic hypothesis stems from the work of Wächtershäuser [129,130], who has argued that life begun with the appearance of an autocatalytic, two-dimensional chemolithtrophic metabolic system based on the formation of the highly insoluble mineral pyrite (FeS2). The reaction FeS + H2S → FeS2 + H2 is very favorable. It is irreversible and highly exergonic with a standard free energy change ∆G° = −9.23 kcal/mol, which corresponds to a reduction potential E° = −620 mV. Thus, the FeS/H2S combination is a strong reducing agent, and has been shown to provide an efficient source of electrons for the reduction of organic compounds under mild conditions. The scenario proposed by Wächtershäuser [129,130] fits well with the environmental conditions found at deep-sea hydrothermal vents, where H2S, CO2, and CO are quite abundant. The FeS/H2S system does not reduce CO2 to amino acids, purines, or pyrimidines, although there is more than adequate free energy to do so [131]. However, pyrite formation can produce molecular hydrogen, and reduce nitrate to ammonia, and acetylene to ethylene [132]. More recent experiments have shown that the activation of amino acids with carbon monoxide and (Ni,Fe)S can lead to peptide-bond formation [133]. In these experiments, however, the reactions take place in an aqueous environment to which powdered pyrite has been added; they do not form a dense monolayer of ionically bound molecules or take place on the surface of pyrite. None of the experiments using the FeS/H2S system reported so far suggests that enzymes and nucleic acids are the evolutionary outcome of surface-bounded metabolism. The results are also compatible with a more general model of the primitive soup in which pyrite formation is an important source of electrons for the reduction of organic compounds. It is possible that under certain geological conditions the FeS/H2S combination could have reduced not only CO but also CO2 released from molten magna in deep-sea vents, leading to biochemical monomers [134]. Peptide synthesis could have taken place in an iron and nickel sulfide system [133] involving amino acids formed by electric discharges via a Strecker-type synthesis, although this scenario requires the transportation of compounds formed at the surface to the deep-sea vents [135]. It seems likely that concentrations of reactants
42
Genomics
would be prohibitively low based on second-order reaction kinetics. If the compounds synthesized by this process did not remain bound to the pyrite surface, but drifted away into the surrounding aqueous environment, then they would become part of the prebiotic soup, not of a two-dimensional organism. In general, organic compounds are decomposed rather than created at hydrothermal vent temperatures, although of course temperature gradients exist. As has been shown by Sowerby and coworkers [136], concentration on mineral surfaces would tend to concentrate any organics created at hydrothermal vents in cooler zones, where other reaction schemes would need to be appealed to. The presence of reduced metals and the high temperatures of hydrothermal vents have also led to suggestions that reactions similar to those in Fischer–Trospch-type (FTT) syntheses may be common under such regimes. It is unclear to what extent this is valid, as typical FTT catalysts are easily poisoned by water and sulfide. It has been argued that some of the likely environmental catalysts such as magnetite may be immune to such poisoning [137]. Stability of Biomolecules at High Temperatures
A thermophilic origin of life is not a new idea. It was first suggested by Harvey [138], who argued that the first life forms were heterotrophic thermophiles that had originated in hot springs such as those found in Yellowstone Park. As underlined by Harvey, the one advantage of high temperatures is that the chemical reactions could go faster and the primitive enzymes could have been less efficient. However, high temperatures are destructive to organic compounds. Hence, the price paid is loss of biochemical compounds to decomposition. Although some progress has been made in synthesizing small molecules under hydrothermal vent type conditions, the larger trend for biomolecules at high-temperature conditions is decomposition. As has been demonstrated by various authors, most biological molecules have half-lives to hydrolysis on the order of minutes to seconds at the high temperatures associated with hydrothermal vents. As noted above, ribose and other sugars are very thermolabile compounds [79]. The stability of ribose and other sugars is problematic, but pyrimidines and purines, and many amino acids, are nearly as labile. At 100 °C the half-life for deamination of cytosine is 21 days, and 204 days for adenine [139,140]. Some amino acids are stable, for example, alanine with a half-life for decarboxylation of approximately 19,000 years at 100 °C, but serine decarboxylates to ethanolamine with a half-life of 320 days [141]. White [142] measured the decomposition of various compounds at 250 °C and pH 7 and found half-lives of amino acids from 7.5 s to 278 min, half-lives for peptide bonds from <1 min to 11.8 min, half-lives for glycoside cleavage in nucleosides from <1 s to 1.3 min, decomposition
Prebiotic Chemistry on the Primitive Earth
43
of nucleobases from 15 to 57 min, and half-lives for phosphate esters from 2.3 to 420 min. It should be borne in mind that the half-lives for polymers would be even shorter as there are so many potential breakage points in a polymer. Thus, while the vents may serve as synthesis sites for simpler compounds such as acetate or more refractory organic compounds such as fatty acids, it is unlikely they played a major role in synthesizing most biochemicals or their polymers. 350 °C submarine vents do not seem to presently synthesize organic compounds, more likely they decompose them in a time span ranging from seconds to a few hours. Thus, the origin of life in the vents is improbable. This does not imply that the hydrothermal springs were a negligible factor on the primitive Earth. If the mineral assemblages were sufficiently reducing, the rocks near the vents may have been a source of atmospheric CH4 or H2. As stated earlier, the concentrations of biomolecules which could have accumulated on the primitive Earth is governed largely by the rates of production and the rates of destruction. Submarine hydrothermal vents would have also been important in the destruction rather than in the synthesis of organic compounds, thus fixing the upper limit for the organic compound concentration in the primitive oceans. Although it is presently not possible to state which compounds were essential for the origin of life, it does seem possible to preclude certain environments if even fairly simple organic compounds were involved [143]. EXTRATERRESTRIAL SYNTHESES
Regardless of what the early Earth’s atmosphere was like, the planet was undoubtedly bombarded then, as it is now, by extraterrestrial material such as meteorites and comets. The presence of extraterrestrial organic compounds had been recognized since the mid-nineteenth century, when Berzelius analyzed in 1834 the Aläis meteorite, a carbonaceous C1 chondrite, and confirmed a few years later when Wöhler studied the Kaba meteorite, a C2 carbonaceous chondrite. Today the presence of a complex array of extraterrestrial organic molecules in meteorites, comets, interplanetary dust, and interstellar molecules is firmly established, and has led some to propose them as exogenous sources of the prebiotic organic compounds necessary for the origin of life [144–147]. One reason for proposing an extraterrestrial origin of the components of the prebiotic soup is the CO2-rich model of the primitive Earth’s atmosphere [148], which would not be as conducive to atmospheric organic synthesis. The Apollo missions revealed few if any organic materials on the moon; however, doubts as to the occurrence of organic materials in the solar system were laid to rest in 1969 when a meteorite fell in Murchison, Australia. This meteorite was seen to fall and was rapidly
44
Genomics
Table 1.4 Relative abundances of amino acids detected in the Murchison meteorite and a spark discharge experiment (adapted from Wolman et al. [158]) Amino acid Glycine Alanine a-Amino-n-butyric acid a-Aminoisobutyric acid Valine Norvaline Isovaline Proline Pipecolic acid Aspartic acid Glutamic acid b-Alanine b-Amino-n-butyric acid b-Aminoisobutyric acid g -Aminobutyric acid Sarcosine N-Ethylglycine N-Methylalanine
Murchison
Electric discharge
**** **** *** **** *** *** ** *** * *** *** ** * * * ** ** **
**** **** **** ** ** *** ** * <* *** ** ** * * ** *** *** **
collected, thus minimizing field contamination, and analyzed in the laboratory. A host of organic compounds was revealed to be present which were indubitably of extraterrestrial origin. These organics strongly resemble those produced in laboratory syntheses under presumed prebiotic conditions (table 1.4). Questions remain regarding the survival of the organic material from extraterrestrial bodies, although obviously those in the Murchison meteorite did survive. There is also a large abundance of extraterrestrial amino acids associated with the 65-millionyear-old impact event, concurrent with the decline of the dinosaurs and recorded geologically [149]. Chyba and Sagan [147] estimated the flux of extraterrestrial organics to the Earth based on the lunar cratering record. They then extrapolated an organic content and a yield based on the survival of these organics during entry, and estimated that exogenous delivery would have made a significant contribution to the primitive Earth’s organic inventory. Survival of extraterrestrial organic material would have been higher if the Earth’s atmosphere were denser. The estimated flux of HCN equivalents compared with that produced in a reducing atmosphere is shown in figure 1.18. Thus even if the early Earth’s atmosphere were oxidizing, the case can be made that significant amounts of prebiotic organic compounds resembling the types made in terrestrial atmospheric synthesis would have been delivered to the Earth.
Prebiotic Chemistry on the Primitive Earth
45
Figure 1.18 Estimated inputs of HCN from various energy sources and organic compounds from comets. The cometary inputs are from Chyba and Sagan [147]. The area is shaded because of the uncertainty of the fall-off of dust input on the early Earth. The cometary input has been converted to nmoles cm−2 yr−1 assuming a molecular weight of 100. The HCN production is discussed in Stribling and Miller [24].
The question is what was the relative percentage of prebiotic organic matter contributed by each source. Although there is a wide variety of interstellar molecules, including formaldehyde, hydrogen cyanide, acetaldehyde, cyanoacetylene, and other prebiotic compounds, and their total amount in a nebular dust cloud is high, it is generally felt that they played at most a minor role in the origin of life. The major sources of exogenous compounds would appear to be comets and dust, with asteroids and meteorites being minor contributors. Asteroids would have impacted the Earth frequently in prebiotic times, but the amount of organic material brought in would have been small, even if the asteroid were a Murchison meteorite-type object. The Murchison meteorite contains approximately 1.8% organic carbon, but most of this is a polymer, and there are only about 100 parts per million of amino acids. Assuming a density of approximately 2.0, for the 10-km asteroid believed to have hit the Earth at the end of the Cretaceous
46
Genomics
this would be equivalent to some 1.2 × 1017 g of amino acids, or 10−6 M amino acids if distributed evenly in oceans of the present size. However, it apparently left amino acids at the K/T boundary that are detectable by the most sensitive modern analytical methods [149]. The same considerations would apply to carbonaceous chondrites, although the survival of the organics on impact would be much better than with asteroids. As suggested by recent measurements, a significant percentage of meteoritic amino acids and nucleobases could survive the high temperatures associated with frictional heating during atmospheric entry, and become part of the primitive broth [150]. Comets are perhaps the most promising source of exogenous compounds [151]. The first proposal for the prebiotic importance of comets was by Oró [144]. Cometary nuclei contain about 80% H2O, 1% HCN, and 1% H2CO, as well as important amounts of CO2 and CO. Thus, assuming that cometary nuclei have a 1g cm−3 density, a 1-kmdiameter comet would contain 2 × 1011 M HCN, or 40 nmoles cm−2 of the Earth’s surface. This is comparable to the yearly production of HCN in a reducing atmosphere from electric discharges, and would be quite important if the Earth did not have a reducing atmosphere. This calculation assumes a complete survival of the HCN on impact. In fact, there is little understanding of what happens during the impact of such an object, but much of it would be heated to temperatures above 300 °C, which would decompose HCN and other cometary compounds. However, these highly reactive chemical species could then be used as precursors in the abiotic syntheses of biochemical monomers. The input from interplanetary dust may have also been important. The present infall is approximately 40 × 106 kg/yr [152], but on the primitive Earth it may have been greater by a factor of 100 to 1000. Unfortunately, the organic composition of cosmic dust is poorly known [153]. The only individual molecules that have been detected are polycyclic aromatic hydrocarbons [154,155]. Much of the dust could be organic polymers called tholins, which are produced by electric discharges, ionizing radiation, and ultraviolet light. Although quite resilient, a few percent of amino acids are released from tholins on acid hydrolysis. A more promising role for the tholins would be as a source of prebiotic precursors such as HCN, cyanoacetylene, and aldehydes. On entry to the Earth’s atmosphere, the dust particles would be heated and the tholins pyrolyzed, releasing HCN and other molecules, which could then participate in prebiotic reactions [156,157]. CONCLUSION
Given adequate expertise and experimental conditions, it is possible to synthesize almost any organic molecule in the laboratory under
Prebiotic Chemistry on the Primitive Earth
47
simulated prebiotic conditions. However, the fact that a number of molecular components of contemporary cells can be formed nonenzymatically in the laboratory does not necessarily mean that they were essential for the origin of life, or that they were available in the prebiotic milieu. The primitive soup must have been a complex mixture, but it could not reasonably have included all the compounds or molecular structures found today in even the simplest prokaryotes. The basic tenet of the heterotrophic theory of the origin of life is that that the origin and reproduction of the first living system depended primarily on abiotically synthesized organic molecules. As summarized here, there has been no shortage of discussion about how the formation of the primitive soup took place. It is unlikely that any single mechanism can account for the wide range of organic compounds that may have accumulated on the primitive Earth. The prebiotic soup was undoubtedly formed by contributions from endogenous atmospheric synthesis, deep-sea hydrothermal vent synthesis, and exogenous delivery from sources such as comets, meteorites, and interplanetary dust. This eclectic view does not beg the issue of the relative significance of the different sources of organics; it simply recognizes the wide variety of potential sources of organic compounds, the raw material required for the emergence of life. The existence of different abiotic mechanisms by which biochemical monomers can be synthesized under plausible prebiotic conditions is well established. Of course, not all prebiotic pathways are equally efficient, but the wide range of experimental conditions under which organic compounds can be synthesized demonstrates that prebiotic syntheses of the building blocks of life are robust, that is, the abiotic reactions leading to them do not take place under a narrow range defined by highly selective reaction conditions, but rather under a wide variety of environmental settings. Although our ideas on the prebiotic synthesis of organic compounds are based largely on experiments in model systems, the robustness of this type of chemistry is supported by the occurrence of most of these biochemical compounds in the Murchison meteorite. It is therefore plausible, but not proven, that similar synthesis took place on the primitive Earth. For all the uncertainties surrounding the emergence of life, it appears that the formation of the prebiotic soup is one of the most firmly established events that took place on the primitive Earth. ACKNOWLEDGMENTS We would like to thank Professors Antonio Lazcano and Jeffrey Bada, Dr. John Chalmers and Dr. Charles Salerno, and Mrs. Brenda Leake for helpful comments and assistance with the manuscript.
48
Genomics
REFERENCES 1. Farley, J. The Spontaneous Generation Controversy from Descartes to Oparin. Johns Hopkins University Press, Baltimore, 1977. 2. Fry, I. The Emergence of Life on Earth: A Historical and Scientific Overview. Rutgers University Press, New Brunswick, N.J., 2000. 3. Wills, C. and J. Bada. The Spark of Life: Darwin and the Primeval Soup. Perseus, Cambridge, Mass., 2000. 4. Miller, S. and L. Orgel. The Origins of Life on the Earth. Prentice-Hall, Englewood Cliffs, N.J., 1974. 5. Schopf, J. (Ed.) Earth’s Earliest Biosphere: Its Origin and Evolution. Princeton University Press, Princeton, N.J., 1983. 6. Leicester, H. Development of Biochemical Concepts from Ancient to Modern Times. Harvard University Press, Cambridge, Mass., 1974. 7. Glocker, G. and S. Lind. The Electrochemistry of Gases and Other Dielectrics. Wiley, New York, 1939. 8. Löb, W. Behavior of formamide under the influence of the silent discharge. Nitrogen assimilation. Chemische Berichte, 46:684–97, 1913. 9. Baudish, O. Über Nitrat und Nitritatassimilation. Zeitschrift Angewandte Chemie, 26:612–13, 1913. 10. Lazcano, A. A. I. Oparin: the man and his theory. In B. F. Plogazov, B. I. Kurganov, M. S. Kritsky, and K. L. Gladilin (Eds.), Evolutionary Biochemistry and Related Areas of Physicochemical Biology (pp. 49–56). Bach Institute of Biochemistry and ANKO Press, Moscow, 1995. 11. Oparin, A. Proiskhozhedenie Zhizni. Moskovskii Rabochii, Moscow, 1924. Reprinted and translated in J. D. Bernal, The Origin of Life. Weidenfeld & Nicolson, London, 1967. 12. Haldane, J. The origin of life. In The Rationalist Annual, issue 148, pp. 3–10, 1929. 13. Oparin, A. The Origin of Life. Macmillan, New York, 1938. 14. Urey, H. The Planets, Their Origin and Development. Yale University Press, New Haven, Conn., 1952. 15. Rubey, W. Geologic history of sea water. An attempt to state the problem. Geological Society of America Bulletin, 62:1111–48, 1951. 16. Lazcano A. and S. Miller. How long did it take for life to begin and evolve to cyanobacteria? Journal of Molecular Evolution, 39(6):546–54, 1994. 17. Orgel, L. The origin of life—how long did it take? Origins of Life and Evolution of the Biosphere, 28(1):91–6, 1998. 18. Wetherill, G. Formation of the Earth. In Albee, A. and Stehli, F. (Eds.), Annual Review of Earth and Planetary Sciences, Vol. 18, pp. 205–56. Palo Alto, 1990. 19. Holland, H. Model for the evolution of the Earth’s atmosphere. Geological Society of America, Buddington Volume, 447–77, 1962. 20. Canuto, V., J. Levine, T. Augustsson and C. Imhoff. Oxygen and ozone in the early Earth’s atmosphere. Precambrian Research, 20(2-4):109–20, 1983. 21. Cleaves, H. and S. Miller. Oceanic protection of prebiotic organic compounds from ultraviolet radiation. Proceedings of the National Academy of Sciences USA, 95:7260–3, 1998.
Prebiotic Chemistry on the Primitive Earth
49
22. Mauzerall, D. The photochemical origins of life and photoreaction of ferrous ion in the Archean oceans. Origins of Life and Evolution of the Biosphere, 20(3-4):293–302, 1990. 23. Chameides, W. and J. Walker. Rates of fixation by lightning of carbon and nitrogen in possible primitive atmospheres. Origins of Life and Evolution of the Biosphere, 11(4):291–302, 1981. 24. Stribling, R. and S. Miller. Energy yields for hydrogen cyanide and formaldehyde syntheses: the hydrogen cyanide and amino acid concentrations in the primitive ocean. Origins of Life and Evolution of the Biosphere, 17(3–4):261–73, 1987. 25. Bada, J., C. Bigham and S. Miller. Impact melting of frozen oceans on the early Earth: implications for the origin of life. Proceedings of the National Academy of Sciences USA, 91(4):1248–50, 1994. 26. Hoffman, P. and D. Schrag. The snowball Earth hypothesis: testing the limits of global change. Terra Nova, 14(3):129–55, 2002. 27. Wilde, S., J. Valley, W. Peck and C. Graham. Evidence from detrital zircons for the existence of continental crust and oceans on the Earth 4.4 Gyr ago. Nature, 409(6817):175–8, 2001. 28. Mojzsis, S., T. Harrison and R. Pidgeon. Oxygen-isotope evidence from ancient zircons for liquid water at the Earth’s surface 4,300 Myr ago. Nature, 409(6817):178–81, 2001. 29. Berner, R., A. Lasaga and R. Garrels. The carbonate-silicate geochemical cycle and its effect on atmospheric carbon dioxide over the past 100 million years. American Journal of Science, 283(7):641–83, 1983. 30. Mosqueira, F., G. Albarran and A. Negron-Mendoza. A review of conditions affecting the radiolysis due to 40K on nucleic acid bases and their derivatives adsorbed on clay minerals: implications in prebiotic chemistry. Origins of Life and Evolution of the Biosphere, 26(1):75–94, 1996. 31. Miyakawa, S., H. Cleaves and S. Miller. The cold origin of life. A. Implications based on the hydrolytic stabilities of hydrogen cyanide and formamide. Origins of Life and Evolution of the Biosphere, 32(3):195–208, 2002. 32. Zhang, X., S. Martin, C. Friend, M. Schoonen and H. Holland. Mineralassisted pathways in prebiotic synthesis: photoelectrochemical reduction of carbon(+IV) by manganese sulfide. Journal of the American Chemical Society, 126(36):11247–53, 2004. 33. Levy, M., S. Miller, K. Brinton and J. Bada. Prebiotic synthesis of adenine and amino acids under Europa-like conditions. Icarus, 145:609–13, 2000. 34. Nelson, K., M. Robertson, M. Levy and S. Miller. Concentration by evaporation and the prebiotic synthesis of cytosine. Origins of Life and Evolution of the Biosphere, 31(3):221–9, 2001. 35. Lasaga, A. and H. Holland. Primordial oil slick. Science, 174(4004):53–5, 1971. 36. Cairns-Smith, A. Takeover mechanisms and early biochemical evolution. BioSystems, 9(2–3):105–9, 1977. 37. Urey, H. On the early chemical history of the Earth and the origin of life. Proceedings of the National Academy of Sciences USA, 38:351–63, 1952. 38. Miller, S. A production of amino acids under possible primitive Earth conditions. Science, 117:528, 1953. 39. Miller, S. Production of some organic compounds under possible primitive Earth conditions. Journal of the American Chemical Society, 77:2351–61, 1955.
50
Genomics
40. Miller, S. The mechanism of synthesis of amino acids by electric discharges. Biochimica et Biophysica Acta, 23:480–9, 1957. 41. Miller, S. The endogenous synthesis of organic compounds. In Brack, A. (Ed.), The Molecular Origins of Life: Assembling the Pieces of the Puzzle (pp. 59–85). Cambridge University Press, Cambridge, 1998. 42. Peltzer, E., J. Bada, G. Schlesinger and S. Miller. The chemical conditions on the parent body of the Murchison meteorite: some conclusions based on amino-, hydroxy-, and dicarboxylic acids. Advances in Space Research, 4:69–74, 1984. 43. Kuhn, W. and S. Atreya. Ammonia photolysis and the greenhouse effect in the primordial atmosphere of the Earth. Icarus, 37(1):207–13, 1979. 44. Zahnle, K. Photochemistry of methane and the formation of hydrocyanic acid (HCN) in the Earth’s early atmosphere. Journal of Geophysical Research [Atmospheres], 91(D2):2819–34, 1986. 45. Ferris, J., P. Joshi, E. Edelson and J. Lawless. HCN: a plausible source of purines, pyrimidines, and amino acids on the primitive Earth. Journal of Molecular Evolution, 11:293–311, 1978. 46. Cleaves, H. The prebiotic synthesis of acrolein. Monatshefte für Chemie, 134(4):585–93, 2003. 47. Van Trump, J. and S. Miller. Prebiotic synthesis of methionine. Science, 178(63):859–60, 1972. 48. Oró, J. Synthesis of adenine from ammonium cyanide. Biochemical and Biophysical Research Communications, 2: 407–12, 1960. 49. Oró, J. and A. Kimball. Synthesis of purines under primitive Earth conditions. I. Adenine from hydrogen cyanide. Archives of Biochemistry and Biophysics, 94:221–7, 1961. 50. Ferris, J. and L. Orgel. An unusual photochemical rearrangement in the synthesis of adenine from hydrogen cyanide. Journal of the American Chemical Society 88:1074, 1966. 51. Sanchez, R., J. Ferris and L. Orgel. Conditions for purine synthesis: did prebiotic synthesis occur at low temperatures? Science, 153:72–3, 1966. 52. Miyakawa, S., H. Cleaves and S. Miller. The cold origin of life. B. Implications based on pyrimidines and purines produced from frozen ammonium cyanide solutions. Origins of Life and Evolution of the Biosphere, 32(3):209–18, 2002. 53. Voet, A. and A. Schwartz. Prebiotic adenine synthesis from HCN: evidence for a newly discovered major pathway. Bioorganic Chemistry, 12:8–17, 1983. 54. Sanchez, R., J. Ferris and L. Orgel. Studies in prebiotic synthesis. II. Synthesis of purine precursors and amino acids from aqueous hydrogen cyanide. Journal of Molecular Biology, 30:223–53, 1967. 55. Sanchez, R., J. Ferris and L. Orgel. Studies in prebiotic synthesis. IV. The conversion of 4-aminoimidazole-5-carbonitrile derivatives to purines. Journal of Molecular Evolution, 38: 121–8, 1968. 56. Levy, M., S. Miller and J. Oró. Production of guanine from NH4CN polymerizations. Journal of Molecular Evolution, 49:165–8, 1999. 57. Fox, S. and K. Harada. Synthesis of uracil under conditions of a thermal model of prebiological chemistry. Science, 133:1923–4, 1961.
Prebiotic Chemistry on the Primitive Earth
51
58. Sanchez, R., J. Ferris and L. Orgel. Cyanoacetylene in prebiotic synthesis. Science 154:784–5, 1966. 59. Ferris, J., R. Sanchez and L. Orgel. Studies in prebiotic synthesis. III. Synthesis of pyrimidines from cyanoacetylene and cyanate. Journal of Molecular Biology, 33:693–704, 1968. 60. Choughuley, A., A. Subbaraman, Z. Kazi and M. Chadha. A possible prebiotic synthesis of thymine: uracil-formaldehyde-formic acid reaction. BioSystems, 9(2–3):73–80, 1977. 61. Schwartz, A. W. and G. Chittenden. Synthesis of uracil and thymine under simulated prebiotic conditions. BioSystems, 9(2–3):87–92, 1977. 62. Ferris, J., O. Zamek, A. Altbuch and H. Freiman. Chemical evolution. XVIII. Synthesis of pyrimidines from guanidine and cyanoacetaldehyde. Journal of Molecular Evolution, 3:301–9, 1974. 63. Robertson, M. and S. Miller. An efficient prebiotic synthesis of cytosine and uracil. Nature, 375:772–4, 1995. 64. Robertson, M., M. Levy and S. Miller. Prebiotic synthesis of diaminopyrimidine and thiocytosine. Journal of Molecular Evolution, 43:543–50, 1996. 65. Voet, A. and A. Schwartz. Uracil synthesis via hydrogen cyanide oligomerization. Origins of Life and Evolution of the Biosphere, 12(1):45–9, 1982. 66. Levy, M. and S. Miller. The prebiotic synthesis of modified purines and their potential role in the RNA world. Journal of Molecular Evolution, 48: 631–7, 1999. 67. House, C. and S. Miller. Hydrolysis of dihydrouridine and related compounds. Biochemistry, 35(1):315–20, 1996. 68. Robertson, M. and S. Miller. Prebiotic synthesis of 5-substituted uracils: a bridge between the RNA world and the DNA-protein world. Science, 268:702–5, 1995. 69. Butlerow, A. Formation synthétique d’une substance sucrée. Comptes Rendus de l’Académie des Sciences, 53:145–7, 1861. 70. Gabel, N. and Ponnamperuma, C. Model for the origin of monosaccharides. Nature 216:453–5, 1967. 71. Reid, C. and L. Orgel. Synthesis of sugars in potentially prebiotic conditions. Nature, 216:455, 1967. 72. Schwartz, A. and R. De Graaf. The prebiotic synthesis of carbohydrates: a reassessment. Journal of Molecular Evolution, 36(2):101–6, 1993. 73. Matsumoto, T., M. Komiyama and S. Inoue. Selective formose reaction catalyzed by diethylaminoethanol. Chemistry Letters, (7):839–42, 1980. 74. Oró, J. Stages and mechanisms of prebiological organic synthesis. In S. W. Fox (Ed.), The Origins of Prebiological Systems and of Their Molecular Matrices (pp. 137–62). Academic Press, New York, 1965. 75. Muller, D., S. Pitsch, A. Kittaka, E. Wagner, C.E. Wintner and A. Eschenmoser. Chemie von alpha-Aminonitrilen (135). Helvetica Chimica Acta, 73:1410–68, 1990. 76. Pitsch, S., R. Krishnamurthy, M. Bolli, S. Wendeborn, A. Holzner, M. Minton, C. Leseur, I. Schlönvogt, B. Jaun and A. Eschenmoser. PyranosylRNA (pRNA):base-pairing selectivity and potential to replicate. Helvetica Chimica Acta, 78:1621–35, 1995. 77. Beier, M., F. Reck, T. Wagner, R. Krishnamurthy and A. Eschenmoser. Chemical etiology of nucleic acid structure: comparing pentopyranosyl(2′→ 4′) oligonucleotides with RNA. Science, 283(5402):699–703, 1999.
52
Genomics
78. Decker, P., H. Schweer and R. Pohlmann. Identification of formose sugars, presumable prebiotic metabolites, using capillary gas chromatography/ gas chromatography-mass spectrometry of n-butoxime trifluoroacetates on OV-225. Journal of Chromatography, 225:281–91, 1982. 79. Larralde, R., M. Robertson and S. Miller. Rates of decomposition of ribose and other sugars: implications for chemical evolution. Proceedings of the National Academy of Sciences USA, 92:8158–60, 1995. 80. Shapiro, R. Prebiotic ribose synthesis: a critical analysis. Origins of Life and Evolution of the Biosphere, 18:71–85, 1988. 81. Joyce, G., A. Schwartz, S. Miller and L. Orgel. The case for an ancestral genetic system involving simple analogs of the nucleotides. Proceedings of the National Academy of Sciences USA, 84(13):4398–402, 1987. 82. Schlesinger, G. and S. Miller. Equilibrium and kinetics of glyconitrile formation in aqueous solution. Journal of the American Chemical Society, 95(11):3729–35, 1973. 83. Arrhenius, T., G. Arrhenius and W. Paplawsky. Archean geochemistry of formaldehyde and cyanide and the oligomerization of cyanohydrin. Origins of Life and Evolution of the Biosphere, 24(1):1–17, 1994. 84. Cooper, G., N. Kimmich, W. Belisle, J. Sarlnana, K. Brabham and L. Garrel. Carbonaceous meteorites as a source of sugar-related organic compounds for the early Earth. Nature, 414(6866):879–83, 2001. 85. Van der Velden, W. and A. Schwartz. Search for purines and pyrimidines in the Murchison meteorite. Geochimica et Cosmochimica Acta, 41(7):961–8, 1977. 86. McCollom, T., G. Ritter and B. Simoneit. Lipid synthesis under hydrothermal conditions by Fischer-Tropsch-type reactions. Origins of Life and Evolution of the Biosphere, 29(2):153–66, 1999. 87. Hargreaves, W., S. Mulvihill and D. Deamer. Synthesis of phospholipids and membranes in prebiotic conditions. Nature, 266(5597):78–80, 1977. 88. Ourisson, G. and Y. Nakatani. The terpenoid theory of the origin of cellular life: the evolution of terpenoids to cholesterol. Chemistry and Biology, 1(1):11–23, 1994. 89. Yuen, G. and K. Kvenvolden. Monocarboxylic acids in Murray and Murchison carbonaceous meteorites. Nature, 246(5431):301–3, 1973. 90. Deamer, D. Boundary structures are formed by organic components of the Murchison carbonaceous chondrite. Nature, 317(6040):792–94, 1985. 91. Allen, W. and C. Ponnamperuma. A possible prebiotic synthesis of monocarboxylic acids. Currents in Modern Biology, 1(1):24–8, 1967. 92. Dowler, M., W. Fuller, L. Orgel and R. Sanchez. Prebiotic synthesis of propiolaldehyde and nicotinamide. Science, 169(952):1320–1, 1970. 93. Cleaves, H. and S. Miller. The nicotinamide biosynthetic pathway is a by-product of the RNA World. Journal of Molecular Evolution, 52(1):73–7, 2001. 94. Miller, S. and G. Schlesinger. Prebiotic syntheses of vitamin coenzymes. II. Pantoic acid, pantothenic acid, and the composition of coenzyme A. Journal of Molecular Evolution, 36(4):308–14, 1993. 95. Miller, S. and G. Schlesinger. Prebiotic syntheses of vitamin coenzymes. I. Cysteamine and 2-mercaptoethanesulfonic acid (coenzyme M). Journal of Molecular Evolution, 36(4):302–7, 1993.
Prebiotic Chemistry on the Primitive Earth
53
96. Keefe, A., G. Newton and S. Miller. A possible prebiotic synthesis of pantetheine, a precursor to coenzyme A. Nature, 373(6516):683–5, 1995. 97. Austin, S. and T. Waddell. Prebiotic synthesis of vitamin B6-type compounds. Origins of Life and Evolution of the Biosphere, 29(3):287–96, 1999. 98. Heinz, B., W. Ried and K. Dose. Thermal production of pteridines and flavins from amino acid mixtures. Angewandte Chemie, 91(6):510–11, 1979. 99. White, H. Coenzymes as fossils of an earlier metabolic state. Journal of Molecular Evolution, 7(2):101–4, 1976. 100. Shigemasa, Y., H. Matsumoto, Y. Sasaki, N. Ueda, R. Nakashima, K. Harada, N. Takeda, M. Suzuki and S. Saito. Formose reactions. Part 20. The selective formose reaction in dimethylformamide in the presence of vitamin B1. Journal of Carbohydrate Chemistry, 2(3):343–8, 1983. 101. Snell, E. Vitamin B6 group. V. The reversible interconversion of pyridoxal and pyridoxamine by transamination reactions. Journal of the American Chemical Society, 67:194–7, 1945. 102. Rothemund, P. New porphyrin synthesis. Synthesis of porphin. Journal of the American Chemical Society, 58:625–7, 1936. 103. Hodgson, G. and B. Baker. Porphyrin abiogenesis from pyrrole and formaldehyde under simulated geochemical conditions. Nature, 216(5110): 29–32, 1967. 104. Gesteland, R., T. Cech and J. Atkins (Eds.) The RNA World, 2nd ed. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y., 1999. 105. Fuller, W., R. Sanchez and L. Orgel. Prebiotic synthesis. VII. Solid-state synthesis of purine nucleosides. Journal of Molecular Evolution, 1(3):249–57, 1972. 106. Sanchez, R. and L. Orgel. Studies in prebiotic synthesis. V. Synthesis and photoanomerization of pyrimidine nucleosides. Journal of Molecular Biology, 47(3):531–43, 1970. 107. Ingar, A., R. Luke, B. Hayter and J. Sutherland. Synthesis of cytidine ribonucleotides by stepwise assembly of the heterocycle on a sugar phosphate. ChemBioChem, 4(6):504–7, 2003. 108. Schwartz, A. and L. Orgel. Template-directed synthesis of novel, nucleic acid-like structures. Science, 228:585–7, 1985. 109. Schneider, K. and S. Benner. Oligonucleotides containing flexible nucleoside analogs. Journal of the American Chemical Society, 112(1):453–5, 1990. 110. Eschenmoser, A. The TNA-family of nucleic acid systems: properties and prospects. Origins of Life and the Evolution of the Biosphere, 34(3):277–306, 2004. 111. Nelsestuen, G. Origin of life: consideration of alternatives to proteins and nucleic acids. Journal of Molecular Evolution, 15(1):59–72, 1980. 112. Tohidi, M. and L. Orgel. Some acyclic analogues of nucleotides and their template-directed reactions. Journal of Molecular Evolution, 28:367–73, 1989. 113. Cleaves, H. The reactions of nitrogen heterocycles with acrolein: scope and prebiotic significance. Astrobiology, 2(4):403–15, 2002. 114. Nielsen, P., M. Egholm, R. Berg and O. Buchardt. Sequence-selective recognition of DNA by strand displacement with a thymine-substituted polyamide. Science, 254(5037):1497–500, 1991. 115. Nelson, K., M. Levy and S. Miller. Peptide nucleic acids rather than RNA may have been the first genetic molecule. Proceedings of the National Academy of Sciences USA, 97(8):3868–71, 2000.
54
Genomics
116. Eriksson, M., L. Christensen, J. Schmidt, G. Haaima, L. Orgel and P. Nielsen. Sequence dependent N-terminal rearrangement and degradation of peptide nucleic acid (PNA) in aqueous solution. New Journal of Chemistry, 22(10):1055–9, 1998. 117. Yamagata, Y., H. Watanabe, M. Saitoh and T. Namba. Volcanic production of polyphosphates and its relevance to prebiotic evolution. Nature, 352(6335):516–19, 1991. 118. Keefe, A. and S. Miller. Are polyphosphates or phosphate esters prebiotic reagents? Journal of Molecular Evolution, 41(6):693–702, 1995. 119. Osterberg, R. and L. Orgel. Polyphosphate and trimetaphosphate formation under potentially prebiotic conditions. Journal of Molecular Evolution, 1(3):241–8, 1972. 120. Rabinowitz, J. and A. Hampai. Influence of imidazole and hydrocyanic acid derivatives on the “possible prebiotic” polyphosphate-induced peptide synthesis in aqueous solution. Helvetica Chimica Acta, 61(5):1842–7, 1978. 121. Schwartz, A. Specific phosphorylation of the 2′- and 3′-positions in ribonucleosides. Journal of the Chemical Society D: Chemical Communications, (23):1393, 1969. 122. Lohrmann, R. and L. Orgel. Prebiotic synthesis: phosphorylation in aqueous solution. Science, 161(3836):64–6, 1968. 123. Beck, A., R. Lohrmann and L. Orgel. Phosphorylation with inorganic phosphates at moderate temperatures. Science, 157(3791):952, 1967. 124. Lohrmann, R. and L. Orgel. Urea-inorganic phosphate mixtures as prebiotic phosphorylating agents. Science, 171(3970):490–4, 1971. 125. Yamagata, Y., H. Inoue and K. Inomata. Specific effect of magnesium ion on 2′, 3′-cyclic AMP synthesis from adenosine and trimetaphosphate in aqueous solution. Origins of Life and Evolution of the Biosphere, 25(1–3):47–52, 1995. 126. Corliss, J., J. Dymond, L. Gordon, J. Edmond, R. von Herzen, R. Ballard, K. Green, D. Williams, A. Bainbridge, K. Crane and T. van Andel. Submarine thermal springs on the Galapagos Rift. Science, 203:1073–83, 1979. 127. Edmond, J., K. Von Damn, R. McDuff and C. Measures. Chemistry of hot springs on the east Pacific Rise and their effluent dispersal. Nature, 297:187–91, 1982. 128. Corliss, J., J. Baross and S. Hoffman. An hypothesis concerning the relationship between submarine hot springs and the origin of life on Earth. Oceanologica Acta, 4 Suppl, 59–69, 1981. 129. Wächtershäuser, G. Before enzymes and templates: theory of surface metabolism. Microbiological Reviews, 52:452–84, 1988. 130. Wächtershäuser, G. Groundworks for an evolutionary biochemistry: the iron-sulphur world. Progress in Biophysical Molecular Biology 58:85–201, 1992. 131. Keefe, A., S. Miller, G. McDonald and J. Bada. Investigation of the prebiotic synthesis of amino acids and RNA bases from CO2 using FeS/H2S as a reducing agent. Proceedings of the National Academy of Sciences USA, 92:11904–6, 1995. 132. Maden, B. No soup for starters? Autotrophy and origins of metabolism. Trends in Biochemical Sciences, 20:337–41, 1995. 133. Huber, C. and G. Wächtershäuser. Peptides by activation of amino acids with CO on (Ni, Fe)S surfaces and implications for the origin of life. Science, 281:670–2, 1998.
Prebiotic Chemistry on the Primitive Earth
55
134. Orgel, L. The origin of life—a review of facts and speculations. Trends in Biochemical Sciences, 23:491–5, 1998. 135. Rode, B. Peptides and the origin of life. Peptides, 20:773–86, 1999. 136. Sowerby, S., C. Morth and N. Holm. Effect of temperature on the adsorption of adenine. Astrobiology, 1(4):481–7, 2001. 137. Holm, N. and E. Andersson. Hydrothermal systems. In Brack, A. (Ed.), The Molecular Origins of Life: Assembling the Pieces of the Puzzle (pp. 86–99). Cambridge University Press, Cambridge, 1998. 138. Harvey, R. Enzymes of thermal algae. Science, 60:481–2, 1924. 139. Garrett, E. and J. Tsau. Solvolyses of cytosine and cytidine. Journal of Pharmaceutical Sciences, 61(7):1052–61, 1972. 140. Shapiro, R. The prebiotic role of adenine: a critical analysis. Origins of Life and Evolution of the Biosphere, 25:83–98, 1995. 141. Vallentyne, J. Biogeochemistry of organic matter. II. Thermal reaction kinetics and transformation products of amino compounds. Geochimica et Cosmochimica Acta, 28:157–88, 1964. 142. White, R. Hydrolytic stability of biomolecules at high temperatures and its implication for life at 250 °C. Nature, 310(5976):430–2, 1984. 143. Cleaves, H. and J. Chalmers. Extremophiles may be irrelevant to the origin of life. Astrobiology, 4(1):1–9, 2004. 144. Oró, J. Comets and the formation of biochemical compounds on the primitive Earth. Nature, 190:442–3, 1961. 145. Anders, E. Pre-biotic organic matter from comets and asteroids. Nature, 342:255–7, 1989. 146. Chyba, C. Impact delivery and erosion of planetary oceans in the early inner Solar System. Nature, 343:129–33, 1990. 147. Chyba, C. and C. Sagan. Endogenous production, exogenous delivery and impact-shock synthesis of organic molecules: an inventory for the origins of life. Nature, 355:125–32, 1992. 148. Kasting, J. Earth’s early atmosphere. Science, 259:920–6, 1993. 149. Zhao, M. and J. Bada. Extraterrestrial amino acids in Cretaceous/Tertiary boundary sediments at Stevns Klint, Denmark. Nature, 339:463–5, 1989. 150. Glavin, D. and J. Bada. Survival of amino acids in micrometeorites during atmospheric entry. Astrobiology, 1(3):259–69, 2001. 151. Oró, J. and A. Lazcano. Comets and the origin and evolution of life. In Thomas, P. J., Chyba, C. F., and McKay, C. P. (Eds.), Comets and the Origin and Evolution of Life (pp. 3–27). Springer, New York, 1997. 152. Love, S. and D. Brownlee. A direct measurement of the terrestrial accretion rate of cosmic dust. Science, 262:550–3, 1993. 153. Maurette, M. Micrometeorites on the early Earth. In Brack, A. (Ed.), The Molecular Origins of Life: Assembling the Pieces of the Puzzle (pp. 147–86). Cambridge University Press, Cambridge, 1998. 154. Gibson, E. Volatiles in interplanetary dust particles: a review. Journal of Geophysical Research, 97:3865–75, 1992. 155. Clemett, S., C. Maechling, R. Zare, P. Swan and R. Walker. Identification of complex aromatic molecules in individual interplanetary dust particles. Science, 262:721–5, 1993. 156. Mukhin, L., M. Gerasimov and E. Safonova. Origin of precursors of organic molecules during evaporation of meteorites and mafic terrestrial rocks. Nature, 340:46–8, 1989.
56
Genomics
157. Chyba, C., P. Thomas, L. Brookshaw and C. Sagan. Cometary delivery of organic molecules to the early Earth. Science, 249:366–73, 1990. 158. Wolman, Y., W. Haverland and S. Miller. Nonprotein amino acids from spark discharges and their comparison with the Murchison meteorite amino acids. Proceedings of the National Academy of Sciences USA, 69(4):809–11, 1972.
2 Prebiotic Evolution and the Origin of Life: Is a System-Level Understanding Feasible? Antonio Lazcano
Systems biology attempts to explain the structure and operation of complex biological processes beyond the mere description of their individual components and their interplay. This rapidly evolving field, which endeavors to overcome the limitations inherent in the reductionist approaches found in many areas of biological research, involves multidisciplinary programs that attempt to develop predictive computer models of mathematically based descriptions of biological network components and their integrated interactions at a higher level [1]. Although still in its infancy, a systems biology approach based on the combined efforts of computational science with other developing areas such as genomics, proteomics, and other fields may provide important insights into the inner workings and robustness of biological systems, including cell signaling and regulatory networks, with foreseeable practical applications [2]. Can such a holistic approach contribute to our understanding of prebiotic evolution and the appearance of life? Addressing these issues from an evolutionary perspective requires the recognition that no unbridgeable gaps existed between the nonliving and the living, but how life first appeared on Earth is a major challenge not yet unraveled. The remarkable coincidence between the monomeric constituents of living organisms and those synthesized in laboratory simulations of the prebiotic environment (Miller and Cleaves, chapter 1, this volume) appears to be too striking to be fortuitous, but at the present time the hiatus between the primitive soup and the RNA world, that is, the evolutionary stage prior to the development of proteins and DNA genomes during which early life forms largely based on ribozymes may have existed (figure 2.1), is discouragingly enormous. The issue is further complicated by the lack of an all-embracing, generally agreed-upon definition of life. Although it is generally accepted that any explanation of the origin of living systems should attempt, at least implicitely, the definition of a set of minimal criteria for what constitutes a living organism, this has proven to be an elusive 57
58
Genomics
Figure 2.1 The hypothesis of a protein-free RNA world assumed that life began with systems based on catalytic and replicative RNA molecules. According to this proposal, ribosome-mediated protein synthesis evolved in an RNA world, while DNA is considered a latecomer in cellular evolution. Alternative sequences have been discussed [17]. The problems associated with the prebiotic synthesis of RNA suggest the prior existence of pre-RNA protobiological world(s). The shaded borders depicted in the diagram define the systems, and do not necessarily imply that membranes were present from the very start. See text for details.
intellectual endeavor. The lack of such definition sometimes gives the impression that what is meant by the origin of life is described in somewhat imprecise terms, and that several entirely different questions are often confused. For instance, until a few years ago the origin of the genetic code was considered synonymous with the appearance of life itself. This is no longer a dominant point of view: four of the
Prebiotic Evolution and the Origin of Life
59
central reactions involved in protein biosynthesis are catalyzed by ribozymes, and their complementary nature suggest that they may have first appeared in an RNA world [3], that is, that ribosome-catalyzed, nucleic acid-coded protein synthesis is the outcome of natural selection of RNA-based biological systems, and not of mere physicochemical interactions that took place in the prebiotic enviroment [4]. Despite the seemingly insurmountable obstacles surrounding the understanding of the origin of life, or perhaps because of them, there has been no shortage of discussion about how it took place. It is reasonable to assume that the first living organisms were simpler than any cell now alive, but their attributes are unknown. It is possible, for instance, that the only trait found in their extant descendants is the complementary nucleobase pairing-based replication of genetic polymers [5]. It is therefore not surprising that an inventory of current views on the origin of life reveals a mixture of opposites of every kind, including the imaginative suggestion that terrestrial life did not emerge on our planet but was transferred from another planet. As reviewed recently by Anet [6], however, a major dichotomy can be recognized between those claiming that the appearance of the first life forms depended on informational oligomeric compounds, that is, the so-called genetic approach, and those who argue that it was based on autocatalytic metabolic cycles (figure 2.2). In spite of a number of mesmerizing theoretical and experimental analogs of gene-free biological systems, life’s origin may be best understood in terms of the dynamics and evolution of sets of chemical replicating polymers endowed with heredity and so able to evolve. This does not implies that wriggling autocatalytic nucleic acid molecules were floating in the waters of the primitive oceans, ready to be used as primordial genes, or that the RNA world sprung completely assembled from simple precursors present in the prebiotic soup. The genetic-first approach to life’s emergence does not imply that the first replicating genetic polymers arose spontaneously from an unorganized prebiotic organic broth due to an extremely improbable accident, or that the precellular evolution was a continuous, unbroken chain of progressive transformations steadily proceeding to the first living beings. Many prebiotic culs-de-sac and false starts probably took place, with natural selection acting over populations of primordial systems based on genetic polymers simpler than RNA. As argued here, it is reasonable to assume that the emergence of the first life forms required not the appearance of a single living molecule, but the simultaneous coordination of many different components in a confluence of processes. Those working on the emergence of life have to ponder not just on how replicative systems appeared, but also when and how they became encapsulated, how protein synthesis and the genetic code first evolved, and how metabolic pathways evolved as a result of
60
Genomics
Figure 2.2 Schematic representation of the metabolic-first (a) and genetic-first (b) theories on the origin of life. In metabolism-first (a) proposals it is assumed that the emergence of life is tantamount to the self-organization of replicating biochemical networks. On the other hand, in the genetic-first (b) case it is assumed that the maintenance and reproduction of a system based on genetic polymers requires monomers and energy sources available in the primitive broth. As in figure 2.1, the shaded areas define the systems, and do not necessarily correspond to boundaries such as membranes or mineral surfaces. See text for details.
Darwinian processes. How such a sequence of events may have taken place is addressed here. THE ORIGIN OF LIFE: THE MISSING HISTORICAL RECORDS
The traits shared by all known species are far too numerous and complex to assume that they evolved independently. Minor differences in the basic molecular processes can be distinguished between the Bacteria, Archaea, and Eukarya, but all known organisms share the same genetic code and the same essential features of genome replication,
Prebiotic Evolution and the Origin of Life
61
gene expression, basic anabolic reactions, and energy production and utilization. The molecular details of these universal features provide direct evidence of the monophyletic origin of all living beings. As Charles Darwin wrote in the Origin of Species, “all the organic beings which have ever lived on this Earth may be descended from some primordial form.” Although the placement of the root of universal phylogenetic trees is a matter of debate, the development of molecular cladistics has shown that despite their overwhelming diversity and tremendous differences, all organisms are ultimately related and descend from Darwin’s primordial ancestor. What was the nature of this progenitor? How and when did it came into being? As in other areas of evolutionary biology, answers to such questions can only be regarded as inquiring and explanatory rather than definitive and conclusive. This does not imply that all origin of life theories and explanations can be dismissed as pure speculation, but rather that the issue should be addressed conjecturally, in an attempt to construct not a mere chronology but a coherent historical narrative by weaving together a large number of miscellaneous observational findings and experimental results [7]. Unfortunately, this goal has been seriously hindered by the incompleteness of the geological and paleontological records. Archean paleobiology is plagued by debates that demonstrate how difficult it is to ascertain the earliest traces of life. Nonetheless, life appears to be an ancient phenomenon, and the relatively short time scale required for the origin and rapid diversification of microbial life on Earth [8] suggests that the critical factor may have been the presence of liquid water, which began to accumulate as soon as the planet’s surface cooled down [9]. However, there is no direct evidence of the environmental conditions on the Earth at the time of the emergence of life, nor any fossil register of the predecessors of the first cells. Direct information is lacking not only on the composition of the terrestrial atmosphere during the period of the origin of life, but also on the temperature, ocean pH values, and other general and local environmental conditions which may or may not have been important for the origin of living systems. Hence, considerable caution should be applied when attempting straight-line extrapolations back in time to the origin of life. Geological evidence for the existence of a continental crust and bodies of liquid water on the Earth 4.4 × 109 years ago implies that the terrestrial surface rapidly cooled down [10], although there are strong indications that the planet was impacted by bodies large enough to evaporate the oceans as late as 3.8 × 109 years ago [11]. At first glance, large-scale analysis of the thermal history of the Earth appears to imply a hyperthermophilic origin of life [12,13], a possibility that would appear to be supported by the position of hyperthermophilic organisms in rooted universal
62
Genomics
rRNA-based phylogenies, where they occupy the deepest, shorter branches [14]. How valid are such extrapolations? The awareness that genes and genomes are extraordinarily rich historical documents from which a wealth of evolutionary information can be retrieved has widened the range of phylogenetic studies to previously unsuspected heights (and depths). However, it is unlikely that they can be extended back to the origin of life itself. The development of efficient nucleic acid sequencing techniques, which now allows the detailed analysis of complete cellular genomes, combined with the simultaneous and independent blossoming of computer science, has led not only to an explosive growth of databases and new sophisticated tools for their exploitation, but also to the recognition that different macromolecules may be uniquely suited as molecular chronometers in the construction of nearly universal phylogenies. Caution must be exercised in extrapolating deep molecular phylogenies back into primordial times. Organisms located at the base of universal phylogenies are cladistically ancient species, not primitive unmodified microbes. They are not endowed with rudimentary molecular genetic apparatus, nor do they appear to be more primitive in their metabolic abilities than, for instance, their aerobic counterparts. Primordial living systems would initially refer to pre-RNA worlds, in which life may have been based on polymers using backbones other than ribose phosphate and possibly bases different from adenine, uracil, guanine, and cytosine [15], followed by a stage in which life was based on RNA as both the genetic material and as catalyst [16,17]. Genome sequencing and analysis provide clues to some very early stages of biological evolution, but it is difficult to see how their applicability can be extended beyond a threshold that corresponds to a period of cellular evolution in which protein biosynthesis was already in operation, that is, an RNA/protein world [18]. Older stages are not yet amenable to molecular phylogenetic analysis. The most basic questions pertaining to the origin of life relate to much simpler replicating entities predating by a long series of evolutionary events the oldest recognizable heat-loving prokaryotes represented in molecular phylogenies. A cladistic approach to the origin of life itself is not feasible, since all possible intermediates that may have once existed have long since vanished. CHEMICAL EVOLUTION AND THE HETEROTROPHIC ORIGIN OF LIFE
Not surprisingly, the idea that living organisms were the historical outcome of gradual transformation of lifeless matter became widespread soon after the publication of Darwin’s The Origin of Species. This view is epitomized by Ernst Haeckel’s nineteenth-century suggestion that simple physical laws had led to nonnucleated structureless photosynthetic
Prebiotic Evolution and the Origin of Life
63
primordial organisms from nonliving matter. Haeckel’s views were followed by manifold alternatives, including Pflüger’s proposal that the first proteins were descendants of hydrogen cyanide, Svante Arrhenius’s suggestion that terrestrial life came from outer space, Leonard Troland’s hypothesis that the first form of life had been a primordial, replicative enzyme formed by chance events in the primitive ocean, Alfonso L. Herrera’s sulfocyanic theory on the origin of cells, Harvey’s 1924 suggestion of an heterotrophic origin in a high-temperature environment, and the provocative 1926 paper that Hermann J. Muller wrote on the abrupt, random formation of a single, mutable gene endowed with catalytic and autoreplicative properties [19]. Most of these proposals went unnoticed, in part because they were incomplete, speculative schemes largely devoid of direct evidence and not subject to fruitful experimental testing. Although some of these hypotheses attempted to understand the origin of life by introducing principles of historical explanation, the dominant view was that the first organisms had been photosynthetic microbes endowed from the very beginning with the ability to fix atmospheric CO2 and to use it as a source of carbon for the synthesis of organic compounds. A major scientific breakthrough occurred, however, when Oparin [20,21] suggested a heterotrophic origin of life, which assumed that, prior to the emergence of the first cells, prebiotic syntheses of organic compounds had led to the formation of the so-called primitive broth. Oparin’s proposal was sustained not only by the evidence of organic compounds in meteorites, but also by the striking nineteenth-century experimental demonstrations that biochemical compounds such as urea, alanine, and sugars could be formed under laboratory conditions, as had been demonstrated by Wöhler, Strecker, and Butlerow, respectively [22]. Oparin’s hypothesis, which was based on his orthodox Darwinian credence in a gradual, slow evolution from the simple to the complex, stood in sharp contrast to the then prevalent idea of an autotrophic origin of life. Oparin argued that since a heterotrophic anaerobe is metabolically simpler than an autotrophic one, the former would necessarily have evolved first. Based on the simplicity and ubiquity of fermentative reactions, he suggested that the first organisms must have been anaerobic heterotrophs that had resulted from the evolution of colloidal systems such as coacervates formed in the primitive soup. The hypothesis of chemical evolution is supported not only by a number of laboratory simulations of primitive environment, but also by a wide range of astronomical observations and the analysis of samples of extraterrestrial material. These include the existence of organic molecules of potential prebiotic significance in interstellar clouds and cometary nuclei, and of small molecules of considerable biochemical importance that are present in carbonaceous chondrites.
64
Genomics
The copious array of amino acids, carboxylic acids, purines, pyrimidines, hydrocarbons, and other molecules which have been found in the 4.5 × 109 year-old Murchison meteorite and other carbonaceous chondrites gives considerable credibility to the idea that comparable syntheses took place in the primitive Earth. The correlation between the compounds that are produced in prebiotic simulations and those found in carbonaceous meteorites is rather stunning, and strongly supports the contention that such molecules were part of the chemical environment from which life evolved [23]. Like many of his contemporaries, Oparin’s ideas on the definition of life included enzymatic-based assimilation, growth, and reproduction, but not nucleic acids, whose genetic role was not even suspected. Biological inheritance was assumed by Oparin to be the outcome of growth and division of the coacervate drops that he assumed were models of precellular systems. Oparin’s refusal to assume that nucleic acids had played a unique role in the origin of life resulted not only from his unwillingness to assume that life can be reduced to a single compound such as the “living DNA molecule” advocated by Muller and others [19], but also, within the framework of Cold War politics, from his complex relationship with Lysenko and his long association with the Soviet establishment [19,24,25]. As shown by his work with RNA-containing coacervates and his acceptance, based on the suggestions of Belozerskii [26], Brachet [27], and others, that RNA could have preceded DNA as genetic material, Oparin [28] eventually acknowledged the role of nucleic acids in the origin of life, and assumed that protein synthesis was the evolutionary outcome of the interaction of primordial polypeptides and polynucleotides within the boundaries of precellular systems [29]. The heterotrophic origin theory did not develop in blissful isolation from the rapid developments in genetics and molecular biology that characterized the second half of the twentieth century. The replication-first approach to the origin of life does not require the prior existence of a primitive soup [6,30]. However, a weak link between prebiotic chemistry and molecular biology began to develop a few years after the publication in 1953 of the Miller experiment [31] and the Watson and Crick DNA double-helix model [32], when Oró [33] demonstrated the remarkable ease with which adenine, one of the nucleobases of DNA and RNA, could be produced through the oligomerization of HCN. The evidence of the prebiotic availability of this nucleic acid component would eventually culminate in the independent suggestions of an RNA world by Woese, Orgel, and Crick [22]. Although some of Oparin’s assumptions have been challenged by our current understanding of cell biology and the basic molecular processes of living organisms, the epistemologically open character of his hypothesis has allowed the recognition of the essential role of
Prebiotic Evolution and the Origin of Life
65
primordial genetic polymers without destroying the theory’s overall structure and premises. An updated version of this view would assume that the raw material for assembling the first self-maintaining, replicative chemical systems was the outcome of abiotic synthesis. Thus, even if they were endowed with minimum synthetic abilities, primordial life forms depended primarily on prebiotically synthesized organic compounds, while the energy required to drive the chemical reactions involved in their growth and reproduction could have been provided by cyanamide, thioesters, glycine nitrile, or other highenergy compounds (figure 2.2). Considerable efforts have been devoted to the understanding of the chemical processes that may have preceded the ubiquitous DNAbased genetic machinery of extant living systems. Simple organic compounds dissolved in the primitive oceans or other bodies of water would need to be concentrated by some mechanism. As shown by numerous experiments, clays, metal cations, imidazole derivatives, and highly reactive derivatives of HCN (such as cyanamide, dicyanamide, and cyanogen) may have catalyzed polymerization reactions [34]. Selective adsorption of molecules onto mineral surfaces could have promoted their polymerization, as suggested by laboratory simulations using a variety of simple compounds and activated monomers [35]. It is unlikely that the activated nucleosides used in these experiments were available in the prebiotic environment. However, the montmorillonite-promoted polymerization of activated adenosine and uridine derivatives to yield 25- to 50-mer oligonucleotides substantially enriched in 3′-5′ phosphodiester bonds [36], suggests that chemically active mineral surfaces could have played dual roles both as adsorbents and as catalysts in the prebiotic environment [5]. Since absorption onto surfaces involves weak noncovalent van der Waals interactions, the mineral-based concentration process and subsequent polymerization would be most efficient at cool temperatures [37,38]. However, as the length of polymers formed on mineral surfaces increases, they would become more firmly bound to the mineral [39]. In order for these polymers to be involved in subsequent interactions with other polymers or monomers, they would need to be released. This could have been accomplished by warming the mineral, although this would also tend to hydrolyze the absorbed polymers, or by concentrated salt solutions [35], a process that could take place in tidal regions during evaporation or freezing of seawater and that would have led to the release of polymers [4]. Direct concentration of dilute solutions of monomers could also be accomplished by evaporation and by eutectic freezing of dilute aqueous solutions. The evaporation of tidal regions and the subsequent concentration of their organic constituents has been proposed in the synthesis of a variety of simple organic molecules [36]. Salty brines
66
Genomics
may have also been important in the formation of peptides and perhaps other important biopolymers as well. As reviewed elsewhere [4], salt-induced peptide formation reaction may provide an abiotic route for the formation of peptides directly from amino acids in concentrated NaCl solutions containing copper [40]. Clay minerals such as montmorillonite apparently promote the reaction, which could have taken place in evaporating tidal pools and where the required concentrated salty brines would have been easily available. It has also been shown that the freezing of dilute solutions of activated amino acids at −20 °C yields peptides at higher yields than in experiments with highly concentrated solutions at 0 and 25 °C [37], and there is experimental evidence that eutectic freezing is especially effective in the nonenzymatic synthesis of oligonucleotides [41]. For obvious methodological reasons, experimental simulations of prebiotic events tend to concentrate on the empirical analysis of single variables. However, it is reasonable to assume that the association and interplay of different biochemical monomers and oligomers in more complex experimental settings would lead to physicochemical properties not exhibited by their isolated components. This is not purely speculative; that interactions between liposomes and different water-soluble polypeptides lead to major changes in the morphology and permeability of liposomes of phosphatidyl-L-serine, and to a transition of poly-L-lysine from a random coil into an α-helix that exhibits hydrophobic bonding with the lipidic phase, has been documented in the laboratory [42]. Additional examples include experimental models of compartamentalized catalytic RNA [43,44] which, although they do not necessarily correspond to particular stages in the origin of life [43], nonetheless illustrate how individual components of a system dynamically interact and lead to unexpected new properties. The current evidence suggests that the inventory of catalytic agents in the prebiotic environment was considerable. The list includes metallic cations, imidazole derivatives, minerals like clays and pyrite, as well as chemically active small peptides and oligopeptides, as shown by the biologically active random copolymers of glutamate and phenylalanine [45,46]. The mere coexistence of such catalytic species and genetic polymers would have not led directly to the establishment of a genetic code, but the interaction of manifold components could driven a coordinated interaction between early genomes and compartment boundaries [47]. If self-sustaining chemical reaction chains, whether in solution or encapsulated, arose in the primitive Earth, they could have favored the appearance of replicating genetic polymers capable of undergoing Darwinian evolution by enriching the prebiotic soup with components not readily synthesized by other abiotic mechanisms [48–50].
Prebiotic Evolution and the Origin of Life
67
It is easy to envison that as polymerized molecules became larger and more complex, some of them began to fold into configurations that could bind and interact with other molecules, expanding the list of primitive catalysts that could promote nonenzymatic reactions. Some of these catalytic reactions, especially those involving hydrogen bond formation, may have assisted in making polymerization more efficient. As the variety of polymeric combinations increased, some compounds may have developed the ability to catalyze their own imperfect self-replication and that of related molecules. This could have marked the first molecular entities capable of multiplication, heredity, and variation, and thus the origin of both life and Darwinian evolution [4]. This scheme is necessarily speculative, but its intrinsic heuristic value cannot be underestimated. It is very unlikely, however, that the RNA world would have arisen from such processes. Minor amounts of ribose are formed in the formose reaction, but it is a very labile molecule. Lead hydroxide [51] and calcium borate [52] are known to stabilize ribose and other pentoses, and there is recent evidence that cyanamide, which is a likely prebiotic compound, reacts with the open-chain aldehyde form of ribose, yielding a stable bicyclic adduct [53]. However, RNA itself is very unstable because of the presence of the 2′-hydroxyl group of ribose in its phosphodiester backbone. Furthermore, the high number of possible random combinations of derivatives of nucleobases, sugars, and phosphate that may have been present in the prebiotic soup make it unlikely that an RNA molecule capable of catalyzing its own self-replication arose spontaneously. These difficulties have led to the suggestion that the RNA world was not a direct outcome of prebiotic evolution, but may have been the evolutionary outcome of some predecessor primordial living systems of what are now referred to as pre-RNA worlds. The chemical nature of the first genetic polymers and the catalytic agents that may have formed the hypothetical pre-RNA worlds that may have bridged the gap between the prebiotic broth and the RNA world are completely unknown and can only be surmised. There is evidence of numerous double-stranded polymeric structures with backbones quite different from those of nucleic acids, but held together by Watson–Crick base pairing. Modified nucleic acid backbones have been synthesized, which either incorporate a different version of ribose or lack it altogether. Peptide nucleic acids (PNAs), which are free of phosphate, have a polypeptide-like backbone of 2-aminoethyl glycine, a nonchiral amino acid to which nucleic acid bases are attached by an acetic acid [54]. In aqueous solutions pairs of complementary PNA oligomers form Watson– Crick base-paired double helices as well as stable heteroduplexes with RNA, and have been shown to catalyze the oligomerization of activated nucleotides as well as the ligation of complementary oligonucleotides.
68
Genomics
Alternating peptides based on simple D- and L-α-amino acids forming stable antiparallel Watson–Crick hydrogen-bonded pairing double helices have been described [55] which have considerable appeal from an evolutionary perspective [5]. Experiments on nucleic acid with hexoses instead of pentoses, and with pyranose instead of furanose [49], suggest that even when restricted to sugar phosphate backbones a wide variety of informational polymers is possible. A simpler analog of RNA based on a four-carbon sugar, that is, a tetrose, is a threose-based nucleic acid (TNA), or α-threofuranosyl (3’,2’)-oligonucleotides. TNA resembles RNA more closely than other proposed models of genetic polymers, exhibits efficient base pairing, and can efficiently cross-pair with DNA and RNA [56]. No convincing model of the evolutionary precursor(s) of RNA is available, but the possibility of pre-RNA protobiological worlds is supported by the favorable assembly of the four-carbon-ring tetrose component of TNA, which can be easily achieved under plausible prebiotic conditions by the reaction of two glycoaldehyde molecules [57]. WHICH WAY TO LIFE?
Although the updated heterotrophic theory has been recognized as a valuable framework to address the origin of life, the metabolismfirst approach has been renewed in manifold ways. There are several different and even opposing theories suggesting that the first living systems were self-assembled complex biochemical networks lacking genetic polymers. The list includes, among others, (i) sets of loosely self-organized cycles of replicating peptides; (ii) replicating primordial lipid worlds; and (iii) autocatalytic autotrophic metabolic networks, associated with minerals or operating within vesicles. Some of these proposals have been embellished as sets of complex mathematical formula [6,30,58,59]. One example is the influential dual origin of life model proposed by Dyson [60,61], in which the assumption that Darwinian selection plays no role was accompanied by the idea that metabolism and homeostasis have a greater biological significance than replication. Dyson’s proposal is based on a symbiotic association between populations of inaccurately replicating polymers and a hypothetical self-maintaining metabolic system involving catalytic oligopeptides and amino acids within permeable membrane-bounded systems. According to Dyson, heredity depends not only on gene replication, but also on a succession of steady states of dynamic systems, a mechanism for which there is little, if any, biological evidence. As shown by Lifson [58] and Anet [6], Dyson’s model is weakened not only by the lack of experimental verification, but also by a number of unrealistic simplifying assumptions and omissions, including
Prebiotic Evolution and the Origin of Life
69
the absence of evidence for the hypothetical highly discriminating but undefined catalysts required by his proposal. Kauffman [62] has attempted to provide evidence that life is an inexorable emergent collective property of complex autocatalytic systems formed by molecules whose possibilities for self-organization were not hindered by the lack of a genetic polymer. Since this metabolic-first theory does not takes into account the specific properties of individual organic compounds and polymers (such as the base pairing of AU and GC), it provides few, if any, guidelines for experimental approaches that could demonstrate the sudden emergence of biological order from chaotic systems. Nonetheless, Kauffman has argued that the properties of genetic systems can be deduced as general properties of complex systems, but in practice his model depends on specific assumptions that describe how the system is affected by its components. However, detailed analysis of his model reveals internal flaws and unacknowledged inconsistencies, including unrealistic assumptions on variable concentrations of amino acids or the constancy of the probability of a given reaction. Such flaws weaken the claims that life “crystallized” out of random collections of catalytic polymers [6,58]. The alternative lipid world scenario proposed by Segre et al. [63] faces similar problems. Based in part on Dyson’s [60,61] ideas, it relies on the prebiotic availability of membrane-forming amphiphiles, and assumes that mutual catalysis among lipid-like molecules led to the growth and cleavage of noncovalent protocellular assemblies displaying lifelike properties. In this proposal the origin of life is tantamount to the origin of a defined organized spatial structure that can replicate as whole without having a genome. There is indeed evidence that synthetic vesicles formed by caprylic acid containing lithium hydroxide and stabilized by an octanoid acid derivative, catalyzes the hydrolysis of ethyl caprylate. The resulting caprylic acid is incorporated into the micelle walls, leading to their growth and, eventually, to their fragmentation [64,65]. Although Segre et al. [63] conclude that reciprocal catalysis among prebiotic amphiphilic molecules could have led to vesicle growth and multiplication, their calculations seem to pay little attention to the chemical properties of the components of their lipid world [6,61]. The hypothetical ensemble replicator advocated by Segre et al. [63] corresponds to what Kauffman [62] has named a “reflexively autocatalytic system,” lacking direct template replication or copying. Accordingly, for Segre et al. [63] the vesicles in their lipid world model are endowed with what they term “compositional inheritance,” a mechanism reminiscent of Oparin’s interpretation of the growth and division exhibited by coacervate droplets [21]. It is unlikely, however, that such genome-free inheritance played a role in the origins of life.
70
Genomics
As argued forcefully by Szathmary [66], Anet [6], Pross [59], and others, hypothetical replicator networks such as those advocated by Dyson [60,61], Kauffman [62], and Segre et al. [63], even if they can renew themselves and maintain a given dynamic but stable regime, are in fact phenotypic replicators with limited heredity. Phenotypic replicators have the ability to pass on only some aspects of their phenotypes, not of their genotypic components (if any). To what extent the limited heredity advocated by Kauffman [62], Segre et al. [63], and others would have been sufficient to assist the self-organization of sequences of disparate reactions required by metabolic-first theories is an open question, but their possibilities are not truly encouraging. The only known example of an autocatalytic synthesis is the formose reaction [67], which proceeds in a series of stages through glycoaldehyde, glyceraldehyde, dihydroxyacetone, four-carbon sugars, and five-carbon sugars to give finally hexoses, that is, six-carbon sugars, including biologically important carbohydrates such as glucose and fructose. The formose reaction proceeds without biological catalysts, but all biochemical networks, such as the reverse Krebs cycle, require the presence of enzymes, which, it is worth remembering, are not produced by the cycles but encoded by preexisting genes. There are no empirical indications that the autocatalytic metabolicfirst schemes that have been proposed could have self-assembled in the prebiotic enviroment [6]. Many complex chemical changes must have occurred in the primitive Earth, but do these abiotic processes qualify as metabolism at its simplest manifestation? Not necessarily. There have been several attempts to make direct inferences of prebiotic chemistry from extant biochemistry [68–70]. It is true that some chemical intermediates in prebiotic syntheses and in nonenzymatic degradative processes are identical (or at least similar) to those produced by metabolic pathways [71]. This is the case of the alkaline degradation of glucose-6-phosphate [72] and the clay-mediated deamination of adenine into hypoxanthine [73], as well as the prebiotic synthesis of (a) 4-aminoimidazole-5-carboxamide (a key intermediate in abiotic formation of guanine and hypoxanthine), which results from the hydrolysis of 4-aminoimidazole-5-carbonitrile or the corresponding carboxamidine, and which is an intermediate, as a riboside, in the biosynthesis of purines [74]; (b) the photodehydrogenation of dihydroorotate, which yields orotic acid in a reaction comparable to the NAD-dependent dehydroorotate dehydrogenase-mediated step in pyrimidine biosynthesis [75]; and (c) uracil formation via the nonenzymatic photochemical decarboxylation of orotic acid [76]. These similarities, however, do not necessarily indicate an evolutionary continuity between prebiotic chemistry and biochemical pathways, but may reflect chemical determinism. These processes are
Prebiotic Evolution and the Origin of Life
71
similar because they may be the unique way in which given reactions can take place. Wächtershäuser [68] and Morowitz [70] have also argued that the first living systems were genome-free primitive metabolic networks formed by self-sustaining reactions based on monomeric organic compounds made directly from CO2 via the reductive citric acid cycle. This cycle, also known as the reverse Krebs cycle, is a mode of carbon fixation first described in the photosynthetic green sulfur bacterium Chlorobium limicola. Molecular phylogenetic trees show that this autotrophic metabolism fixation and its variants (such as the reductive acetyl-CoA or the reductive malonyl-CoA pathways) are found in anaerobic archaea and the most deeply divergent eubacteria, which has been interpreted by some as evidence of its primitive character [77]. Based on the hypothesis that core metabolic processes have not changed since the emergence of life, Morowitz [70] has argued that intermediary metabolism recapitulates prebiotic chemistry. He maintains that the basic traits of metabolism could only evolve after the closure of an amphiphilic bilayer membrane into a vesicle, that is, that the appearance of membranes represents the discrete transition from nonlife to life. According to his hypothesis, reverse Krebs cycledependent life appeared with “minimal protocells” formed by bilayer vesicles made up of small amphiphiles and endowed with pigments capable of absorbing radiant energy stored as a chemiosmotic proton gradient accross the membrane. Production of new amphiphiles using the chemiosmotic energy reservoir would lead to spontaneous fission of the vesicles. Wächtershäuser’s [68] model does not depends on such vesicles, but on the idea that life began with the appearance of an autocatalytic two-dimensional primordial chemolithotrophic reductive citric acid cycle based on the formation of the highly insoluble mineral pyrite. Wächtershäuser suggested that the synthesis and polymerization of organic compounds took place on the surface of FeS and FeS2 in extremely reducing volcanic settings resembling those of deep-sea hydrothermal vents. Moreover, he has argued that the organic compounds formed from the reduction of CO2 did not enter the aqueous solution, but remained bounded to the surface on which they were synthesized, evolving into a genome-free, autocatalytic twodimensional, pyrite-driven chemolithotrophic metabolic living system. According to this metabolic-first proposal, replication followed the appearance of nonorganismal iron sulfide-based chemoautotrophic two-dimensional life. As predicted by Wächtershäuser [68], the reaction FeS + H2S → FeS2 + H2 has been proven to be a very favorable one. It has an irreversible, highly exergonic character with a standard free energy
72
Genomics
change ∆G° = −9.23 kcal/mol, which corresponds to a reduction potential E° = −620 mV. This makes the FeS/H2S combination a strong reducing agent that has been shown to provide an efficient source of electrons for the reduction of organic compounds under mild conditions. Pyrite formation can produce molecular hydrogen, and reduce nitrate to ammonia, acetylene to ethylene, thioacetic acid to acetic acid, as well as more complex synthesis [77], including peptide bonds that result from the activation of amino acids with carbon monoxide and (Ni,Fe)S [78]. Although pyrite-mediated CO2 reduction to organic compounds has not been achieved, the fixation under plausible prebiotic conditions of carbon monoxide into activated acetic acid by a mixture of coprecipitated NiS/Fe/S has been reported [78]. However, in these experiments the reactions occur in an aqueous environment to which powdered pyrite has been added; they do not form a dense monolayer of ionically bound molecules or take place on the surface of pyrite. None of the experiments that have been performed within the context of Wachsterhauser’s ideas demonstrates that enzymes and nucleic acids are the evolutionary outcome of surface-bounded metabolism. In fact, these results are also compatible with a more general, modified model of the primitive soup in which pyrite formation is recognized as an important source of electrons for the reduction of organic compounds [48]. It is possible that, under certain geological conditions, the FeS/H2S combination could have reduced not only CO but also CO2 released from molten magna in deep-sea vents, leading to biochemical monomers. Peptide synthesis, for instance, could have taken place in an iron and nickel sulfide system [78] involving amino acids formed by electric discharges via a Miller-type synthesis. If the compounds synthesized by this process do not remain bound to the pyrite surface, but drift away into the surrounding aqueous environment, then they would become part of the prebiotic soup, not of a two-dimensional organism. No metabolic cycles have been demonstrated outside cells and in the absence of enzymes. As summarized by Orgel [67], theories that advocate the emergence of complex, self-organized biochemical cycles in the absence of genetic material are hindered not only by the lack of empirical evidence, but also by a number of unrealistic assumptions about the properties of minerals and other catalysts required to spontaneously organize such sets of chemical reactions. This suggests that the likelihood of a self-organization of a closed biochemical network as complex as the hypothetical archaic reductive citric acid cycle is negliglible. Wächtershäuser’s proposed model of metabolism-first origin of life is quite imaginative, but it provides answers for questions that evolutionary biologists stopped asking several decades ago.
Prebiotic Evolution and the Origin of Life
73
CONCLUSIONS AND PERSPECTIVES
The ease of formation in one-pot reactions of amino acids, purines, and pyrimidines supports the hypothesis that these molecules were components of the prebiotic broth (Miller and Cleaves, chapter 1, this volume). They would have been kept company by many other compounds, such as urea and carboxylic acids, sugars formed by the nonenzymatic condensation of formaldehyde, a wide variety of aliphatic and aromatic hydrocarbons, alcohols, and branched and straight fatty acids, including some that are membrane-forming compounds. These results suggest that the prebiotic soup must have been a bewildering organic chemical wonderland, but it could not include all the compounds or molecular structures found today even in the most ancient extant forms of life—nor did the first cells spring completely assembled, like Frankestein’s monster, from simple precursors present in the primitive oceans. According to the contemporary heterotrophic theory, the first living entities did not arise suddenly by chance from the primitive soup, but resulted from the outcome of a long (but not necessarily slow) evolutionary series of natural chemical processes. Contemporary metabolic-first theories of the origin of life have promised much but delivered little. Evidence of metabolic replication would indeed be exciting—if it could be established. To prove otherwise would require some rewriting of major facts of biology, but the old genome-free description of central biological processes that came apart with the discoveries of molecular genetics shows very few signs of putting itself back together. As of today, theoretical models of self-organized complex metabolic systems have not led to radical changes in current concepts of heredity and evolution, nor have they provided manageable descriptions of the origin of life. In some cases invocations to spontaneous generation appear to be lurking behind appeals to undefined “emergent properties” or “self-organizing principles” [79] that are used as the basis for what many life scientists see as grand, sweeping generalizations with little relationship to actual biological phenomena. In spite of many published speculations, everything in biology indicates that life could have not evolved in the absence of an intracellular genetic apparatus able to store, express, and, upon replication, transmit to its progeny information capable of undergoing evolutionary change. How such genetic machinery first evolved is a central question in origin of life studies. The pre-RNA world hypothesis does not imply that genetic polymers could only evolve from simpler genetic polymers in a never-ending sucession of genetic takeovers, but rather indicates the need to synthetize, under plausible prebiotic conditions, very simple monomers and genetic polymers that could serve as laboratory models of the possible evolutionary precursors of RNA. The assumption
74
Genomics
that a single molecule once served both as the depository of information storage and biological catalyst is not necessarily married to a reductionist approach that assumes that life can be assigned to such compounds. As argued here, the emergence of life may be best understood in terms of the dynamics and evolution of systems of chemical replicating entities endowed with genetic polymers. Whether such entities were enclosed within membranes is not yet clear, but given the prebiotic availability of amphiphilic compounds, this may well have been the case. Considerable progress has been made in understanding the emergence and early evolution of life, but major uncertainties remain. The chemistry of some prebiotic simulations is robust and supported by meteorite analyses, but the gap between experiments and the simplest extant cell is enormous. If the current interpretation of the evolutionary significance of the properties of RNA molecules is correct, then one of the central issues that origin of life research must confront is the understanding of the processes that led from the primitive soup into RNA-based life forms. The search for simple organic replicating polymers will play a central role in this inquiry, even if they share few traits with nucleic acids. Some may find the idea of pre-RNA worlds unpalatable. If so, it is worth remembering what N. H. Pirie [80] wrote half a century ago: “if we found a system doing things that satisfied our requirements for life but lacking proteins, would we deny it the title?” ACKNOWLEDGMENTS Support from UNAM-DGAPA Proyecto PAPIIT IN 111003-3 (UNAM, Mexico) is gratefully acknowledged.
REFERENCES 1. Zvelebil, M. J. System biology. In J. M. Hancock and M. J. Zvelebil (Eds.), Dictionary of Bioinformatics and Computational Biology (pp. 548–9). Wiley-Liss, Hoboken, N. J., 2004. 2. Kitano, H. Systems biology, a brief overview. Science, 295:1662–4, 2002. 3. Kumar R. K. and M. Yarus. RNA-catalyzed amino acid activation. Biochemistry, 40:6998–7004, 2001. 4. Bada, J. L. and A. Lazcano. The origin of life. In M. Ruse (Ed.), The Harvard Companion of Evolution. Harvard University Press, Cambridge, Mass., 2006, in press. 5. Orgel, L. E. Prebiotic chemistry and the origin of the RNA world. Critical Reviews in Biochemistry and Molecular Biology, 39:99–123, 2004. 6. Anet, F. A. L. The place of metabolism in the origin of life. Current Opinion in Chemical Biology, 8:654–9, 2004. 7. Kamminga, H. The origin of life on Earth, theory, history, and method. Uroboros, 1:95–110, 1991.
Prebiotic Evolution and the Origin of Life
75
8. Westall, F. Life on the early Earth, a sedimentary view. Science, 308:366–7, 2005. 9. Botta, O. and J. L. Bada. The early Earth. In L. Ribas de Pouplana (Ed.), The Genetic Code and the Origin of Life (pp. 1–14) Landes Bioscience, Georgetown, Tex., 2004. 10. Wilde, S. A., J. W. Valley, W. H. Peck, et al. Evidence from detrital zircons for the existence of continental crust and oceans on the Earth 4.4. Gyr ago. Nature, 409:175–8, 2001. 11. Sleep, N. H., K. J. Zahnle, J. F., Kasting, et al. Annihilation of ecosystems by large asteroid impacts on the early Earth. Nature, 342:139–42, 1989. 12. Pace, N. R. Origin of life—facing up the physical setting. Cell, 65:531–3, 1991. 13. Schwartzman, D. W. and C. H. Lineweaver. The hyperthermophilic origin of life revisited. Biochemical Society Transactions, 32:168–71, 2004. 14. Stetter, K. O. The lesson of archaebacteria. In S. Bengtson (Ed.), Early Life on Earth, Nobel Symposium No. 84 (pp. 114–22). Columbia University Press, New York, 1994. 15. Levy, M. and S. L. Miller. The stability of the RNA bases, implications for the origin of life. Proceedings of the National Academy of Sciences USA, 95:7933–8, 1998. 16. Joyce, G. F. The antiquity of RNA-based evolution. Nature, 418:214–21, 2002. 17. Dworkin, J. P., A. Lazcano and S. L. Miller. The roads to and from the RNA world. Journal of Theoretical Biology, 222:127–34, 2002. 18. Delaye, L., A. Becerra and A. Lazcano. The nature of the last common ancestor. In L. Ribas de Pouplana (Ed.), The Genetic Code and the Origin of Life (pp. 34–47). Landes Bioscience, Georgetown, Tex., 2004. 19. Lazcano, A. Aleksandr I. Oparin, the man and his theory. In B. F. Poglazov, B. I. Kurganov, M. S. Kritsky, et al. (Eds.), Frontiers in Physicochemical Biology and Biochemical Evolution (pp. 49–56). Bach Institute of Biochemistry and ANKO, Moscow, 1995. 20. Oparin, A. I. Proiskhozhedenie Zhizni, Moskovskii Rabochii, Moscow, 1924. Reprinted and translated in J. D. Bernal, The Origin of Life, Weidenfeld & Nicolson, London, 1967. 21. Oparin, A. I. The Origin of Life. Macmillan, New York, 1938. 22. Bada, J. L. and A. Lazcano, Prebiotic soup—revisiting the Miller experiment. Science, 300:745–6, 2003. 23. Ehrenfreund, P., W. Irvine, W., L. Becker, et al. Astrophysical and astrochemical insights into the origin of life. Reports on Progress in Physics, 65:1427–87, 2002. 24. Graham, L. Science and Philosophy in the Soviet Union. Knopf, New York, 1972. 25. Farley, J. The Spontaneous Generation Controversy, from Descartes to Oparin. John Hopkins University Press, Baltimore, 1977. 26. Belozerskii, A. N. On the species specificity of the nucleic acids of bacteria. In A. I. Oparin, A. G. Pasynskii, A. E. Braunshtein, et al. (Eds.), The Origin of Life on Earth (pp. 322–31) Pergamon Press, New York, 1959. 27. Brachet, J. Les acides nucléiques et l’origine des protéines. In A. I. Oparin, A. G. Pasynskii, A. E. Braunshtein, et al. (Eds.), The Origin of Life on Earth (pp. 361–7). Pergamon Press, New York, 1959. 28. Oparin, A. I. Life, its Nature, Origin and Development. Oliver & Boyd, Edinburgh, 1961.
76
Genomics
29. Oparin, A. I. The appearance of life in the Universe. In C. Ponnamperuma (Ed.), Exobiology (pp. 1–15). North-Holland, Amsterdam, 1972. 30. Shapiro, R. A replicator was not involved in the origin of life. IUBMB Life, 49:173–6, 2000. 31. Miller, S. L. A production of amino acids under possible primitive Earth conditions. Science, 117:528, 1953. 32. Watson, J. D. and F. H. C. Crick. Molecular structure of nucleic acids. Nature, 171:737–8, 1953. 33. Oró, J. Synthesis of adenine from ammonium cyanide. Biochemical and Biophysical Research Communications, 2:407–12, 1960. 34. Wills, C. and J. Bada. The Spark of Life: Darwin and the Primeval Soup. Perseus, Cambridge, 2000. 35. Hill, A. R., C. Böhler and L. E. Orgel. Polymerization on the rocks: negatively charged D/L amino acids. Origins of Life and Evolution of the Biosphere, 28:235–43, 2001. 36. Ferris, J. P. Montmorillonite catalysis of 30–50mer oligonucleotides, laboratory demonstration of potential steps in the origin of the RNA world. Origins of Life and Evolution of the Biosphere. 32:311–32, 2002. 37. Liu, R. and L. E. Orgel. Efficient oligomerization of negatively-charged D/L amino acids at −20 °C. Journal of the American Chemical Society, 119:4791–2, 1998. 38. Sowerby, S. J., C.-M. Marth and N. G. Holm. Effect of temperature on the adsorption of adenine. Astrobiology, 1:481–8, 2001. 39. Orgel, L. E. The origin of life—a review of facts and speculations. Trends in Biochemical Sciences, 23:491–5, 1998. 40. Rode, B. M. Peptides and the origin of life. Peptides, 20:773–86, 1999. 41. Kanavarioti, A., P. A. Monnard and D. W. Deamer. Eutectic phases in ice facilitate nonenzymatic nucleic acid synthesis. Astrobiology, 1:481–7, 2001. 42. Hammes, G. G. and S. E. Schullery. Structure of molecular aggregates. II. Construction of model membranes from phospholipids and polypeptides. Biochemistry, 9:2555–8, 1970. 43. Szostak, J. W., D. P. Bartel and P. L. Luisi. Synthesizing life. Nature, 409:387–90, 2001. 44. Hanczyc, M. M., S. M. Fujikawa and J. P. Szostak. Experimental models of primitive cellular compartments, encapsulation, growth, and division. Science, 302:618–22, 2003. 45. Naithani, V. K. and M. M. Dhar. Synthetic substitute lysozymes. Biochemical and Biophysical Research Communications, 29:368–72, 1967. 46. Moser, R., R. M. Thomas and B. Gutte. An artificial crystalline DDTbinding polypeptide. FEBS Letters, 57:247–51, 1983. 47. Chen, I. A., R. W. Roberts and J. W. Szostak. The emergence of competition between model protocells. Science, 305:1474–6, 2004. 48. Bada, J. L. and A. Lazcano. The origin of life, some like it hot, but not the first biomolecules. Science, 296:1982–3, 2002. 49. Eschenmoser, A. Chemical etiology of nucleic acid structure. Science, 284:2118–24, 1999. 50. Joyce, G. F. RNA evolution and the origins of life. Nature, 338:217–24, 1989. 51. Zubay, G. Studies on the lead-catalyzed synthesis of aldopentoses. Origins of Life and Evolution of the Biosphere, 28:12–26, 1998.
Prebiotic Evolution and the Origin of Life
77
52. Ricardo, A., M. A. Carrigan, A. N. Alcott, et al. Borate minerals stabilize ribose. Science, 303:196, 2004. 53. Springsteen, G. and G. F. Joyce. Selective derivatization and sequestration of ribose from a prebiotic mix. Journal of the American Chemical Society, 126:9578–83, 2004. 54. Nielsen, P. Peptide nucleic acid, PNA, a model structure for the primordial genetic material? Origins of Life and Evolution of the Biosphere, 23:323–7, 1993. 55. Diederichsen, U. Pairing properties of alanyl peptide nucleic acids containing an amino acid backbone with alternating configuration. Angewandte Chemie International Edition in English, 42:1540–3, 1996. 56. Schöning, K.-U., P. Scholz, S. Guntha, S., et al. Chemical etiology of nucleic acid structure, the alpha-threofuranosyl-(3′→2′) oligonucleotide system. Science, 290:1347–51, 2000. 57. Orgel, L. E. Origin of life: enhanced, a simpler nucleic acid. Science, 290:1306–7, 2000. 58. Lifson, S. On the crucial stages in the origin of animate matter. Journal of Molecular Evolution, 44:1–8, 1997. 59. Pross, A. Causation and the origin of life, metabolism or replication first? Origins of Life and Evolution of the Biosphere, 34:307–21, 2004. 60. Dyson, F. J. A model for the origin of life. Journal of Molecular Evolution, 18:344–50, 1982. 61. Dyson, F. Origins of Life. Cambridge University Press, Cambridge, 1999. 62. Kauffman, S. A. The Origins of Order: Self-Organization and Selection in Evolution. Oxford University Press, New York, 1993. 63. Segre, D., D. Ben-Eli, D. W. Deamer, et al. The lipid world. Origins of Life and Evolution of the Biosphere, 31:119–45, 2001. 64. Bachmann, P. A., P. L. Luisi and J. Lang. Autocatalytic self-replicating micelles as models for prebiotic structures. Nature, 357:57–9, 2002. 65. Luisi, P. L., P. S. Rasi and F. Mavelli. A possible route to prebiotic vesicle reproduction. Artificial Life, 10:297–308, 2004. 66. Szathmary, E. The evolution of replicators. Philosophical Transactions of the Royal Society, London, Series B, 355:1669–76, 2000. 67. Orgel, L. E. Self-organizing biochemical cycles. Proceedings of the National Academy of Sciences USA, 97:12503–7, 2000. 68. Wächtershäuser, G. Before enzymes and templates, a theory of surface metabolism. Microbiological Reviews, 52:452–84, 1988. 69. De Duve, C. Blueprint for a Cell: The Nature and Origin of Life. Patterson, Burlington, N. C., 1991. 70. Morowitz, H. J. Beginnings of Cellular Life: Metabolism Recapitulates Biogenesis. Yale University Press, New Haven, Conn., 1992. 71. Hegeman, G. D. and S. L. Rosenberg. The evolution of bacterial enzyme systems. Annual Review of Microbiology, 24:429–62, 1970. 72. Degami, C. and M. Halmann. Chemical evolution of carbohydrate metabolism. Nature, 216:1207, 1967. 73. Strasak, M. and F. Sersen. An unusual reaction of adenine and adenosine on montmorillonite, a new way of prebiotic synthesis of some purine nucleotides? Naturwissenschaften, 78:121, 1991.
78
Genomics
74. Oró, J. and A. P. Kimball. Synthesis of purines under possible primitive Earth conditions. II. Purine intermediates from hydrogen cyanide. Archives in Biochemistry and Biophysics, 96:293–7, 1962. 75. Yamagata, Y., K. Sasaki, O. Takoaka, et al. Prebiotic synthesis of orotic acid parallels to the biosynthetic pathway. Origins of Life and Evolution of the Biosphere, 20:389–99, 1990. 76. Ferris, J. P. and P. C. Joshi. Chemical evolution from hydrogen cyanide, photochemical decarboxylation of orotic acid and orotate derivatives. Science, 201:361–2, 1978. 77. Maden, B. E. H. No soup for starters? Autotrophy and origins of metabolism. Trends in Biochemical Sciences, 20:337–41, 1995. 78. Huber, C. and G. Wächtershäuser. Peptides by activation of amino acids with CO on Ni, Fe, S surfaces and implications for the origin of life. Science, 281:670–2, 1998. 79. Fenchel, T. Origin and Early Evolution of Life. Oxford University Press, Oxford, 2002. 80. Pirie, N. H. Ideas and assumptions about the origin of life. Discovery, 14:238–42, 1953.
3 Shotgun Fragment Assembly Granger Sutton & Ian Dew
From a computational standpoint, a genome (or DNA target sequence) can be viewed as a string, or small set of strings, constructed from the alphabet {A,C,G,T}. The shotgun fragment assembly problem is to reconstruct the DNA target sequence as completely and accurately as possible given a set of randomly selected substrings. In practice these substrings, referred to as reads or fragments, are sequenced from multiple copies of larger substrings called clones. A carefully controlled random shearing process is applied to the DNA target sequence to produce libraries of clones whose lengths are approximately normally distributed and at least one order of magnitude shorter than the target sequence (see figure 3.1). The core challenge of the fragment assembly problem is to identify fragments that share some subsequence of the target sequence and merge them together. Consequently, a fundamental determination must be made for each pair of fragments, which is whether they share a subsequence of the target sequence. If they do, the fragments are said to overlap or have an overlap. Define the target sequence T to be a string t1t2t3, ..., tn, where n is the length of T. Define a fragment fi to be a string fi1 fi2 fi3 , ..., fili , where li is the length of fi. Define the tuple (bi, ei) to be the interval of T from which fragment fi was derived. A true positive overlap of the fragment pair ( fi, fj) is the correct computational assertion that the interval (bi, ei) intersects the interval (bj, ej). (Figure 3.2 shows an example of clones and end reads taken from two samples of a DNA target sequence.) A false positive overlap of the fragment pair ( fi, fj) is the incorrect computational assertion that the intervals (bi, ei) and (bj, ej) intersect. A true negative overlap is the correct absence of such a computational assertion. A false negative overlap is the incorrect absence of such an assertion. Sensitivity is the ratio of true positive overlaps to the sum of true positive and false negative overlaps. Specificity is the ratio of true positive overlaps to the sum of true positive and false positive overlaps. We desire that the shotgun fragment assembly algorithm have sensitivity and specificity as close as possible to one (i.e., no false positives or false negatives). The biological, experimental, and computational issues that lead to false positive and negative overlaps have prompted the development of solutions to the shotgun fragment assembly problem. There are two 79
80
Genomics
Figure 3.1 Multiple copies of the DNA target sequence (A) are sheared into clones (B), which are inserted into cloning vectors (C). Both ends of the clone are sequenced to generate reads (D).
primary causes of false positive overlaps: the subsequence of T shared by two fragments (the overlap) is short enough that it is likely to appear by chance in T, treated as n (the length of T) random selections from {A,C,G,T}; or the subsequence is long enough that it is unlikely to appear in T by chance but it actually appears two or more times in T. In the latter case, a biological event occurred at some point in the past that caused some subsequence of T (a repeat) to be duplicated within T and retained at sufficiently high fidelity. These two sources of false positive
Figure 3.2 Clones and end reads sampled from two samples (A and B) of a DNA target sequence.
Shotgun Fragment Assembly
81
overlaps have influenced methods for computationally determining and filtering likely true overlaps. For short, exactly matching overlaps, it is easy to compute a probability of being found by random chance. For short overlaps with some differences in the aligned strings an estimate or bound must suffice. In either case, a threshold is set for deeming an overlap significant or nonrandom. For longer overlaps, which easily pass the significance threshold, the question is whether the known experimental error in determining the fragment sequences (sequencing error) is consistent with the number of differences between the two aligned sequences in the overlap. A probability model of sequencing error can be constructed and used to set a threshold for this case as well. There are two primary causes of false negative overlaps: the overlap is short enough that it is likely to appear by chance in T and consequently is below the significance threshold set to avoid false positive overlaps, or the rate of sequencing error is high enough in the portions of the two fragments that overlap that it cannot be distinguished from random chance at the chosen significance threshold. In the late 1970s and early 1980s when the problem first arose, target DNA sequences were relatively short (5–50k base pairs), sequencing costs were high, and sequencing error rates were relatively high (5–10%) for some fragments while much lower on average. Primarily due to the length of the early target DNA sequences, overlaps of any reasonable length (e.g., 15 base pairs) with even a large number of differences (e.g., 15%) usually represented true overlaps. The earliest shotgun fragment assemblers [1,2] exploited this situation by comparing existing fragments (or merged fragments) to new fragments, as they were sequenced, and merging the fragments together whenever an overlap was found. These merged fragments were then examined manually and corrected if necessary. A final sequence was then determined from the merged fragments. At this point in history, the shotgun fragment assembly problem was posed as a simpler mathematical optimization problem in hopes that the insight from this would generalize. The inspiration for the simpler problem statement was that almost all long overlaps were true and the longer the overlap the more likely it was true. If the target sequence is reconstructed by merging overlapping fragments and longer overlaps are preferred to shorter ones, then a shorter merged sequence reconstruction is better than a longer one because it must be based on longer overlaps [3–6]. The simpler problem, called the Shortest Common Superstring (SCS) problem, assumes that there is no sequencing or other experimental error (i.e., the fragment sequences are true substrings of the target sequence) and is formulated as follows. Given a set of strings, F, find the shortest string (superstring), S, that contains all of the strings in F as substrings. SCS has been proven to be NP-hard [7], which informally means that in order to find an optimal solution, an exponential
82
Genomics
number of solutions must be searched. The implication of this is that no efficient computer program can be created to find an optimal solution. A succession of papers [5–13] has shown that an efficient heuristic, which is simply to sort the overlaps by length and iteratively choose the next longest overlap that is consistent with the overlaps already chosen, comes within a small constant bound of the optimal solution. Peltola et al. [14,15] showed that the longest overlap heuristic also works well in practice when SCS is generalized to use fragments (substrings) with some maximum error rate when aligned to the reconstructed target sequence (superstring). Setubal and Meidanis [3] generalized SCS to accept only overlaps greater than some minimal length, recognizing that shorter overlaps are likely to be spurious, and consequently to generate multiple superstrings. This generalization is implicit in any implementation that uses significance thresholds on overlaps. Peltola et al. [14,15] broke the shotgun fragment assembly problem down into three phases: overlap, layout, and consensus. Almost all shotgun fragment assemblers since have included these phases as well. The overlap phase compares each fragment to every other fragment to determine if they overlap. The layout phase tries to find maximal or optimal sets of consistent overlaps detected in the overlap phase to merge or order (determine the layout of) the fragments. The consensus phase uses the layout of the fragments to reconstruct the target sequence. After the overlaps are determined, the most common representation of the layout problem is the overlap graph, or multigraph, G. The vertices are the fragments, F, and the edges are the overlaps, O (see figure 3.6A). The multigraph can be reduced to a graph if there is at most one edge between any pair of vertices. A heuristic of choosing the best overlap between any pair of fragments is sometimes applied to simplify the problem. The layout solution now becomes the maximal weight path or set of disjoint paths that visit each fragment exactly once. Consensus takes the set of pairwise overlaps given by the path(s) to seed a multiple sequence alignment which is then refined and evaluated at each position (column) to determine the reconstructed target sequence. OVERLAP PHASE
Overlap detection has an interesting history in shotgun fragment assembly. Efficient algorithms were essential several decades ago because computer memory and processing power were quite limited even relative to the short target sequences and small number of fragments needed to reconstruct them. Both for speed of computation and as a significance threshold, overlaps were only considered between fragments that shared an exact sequence (substring) of length k, referred to as a k-mer. The value of k varied depending on the length of the target sequence and a tradeoff between sensitivity and specificity, but 8 was a typical length.
Shotgun Fragment Assembly
83
Starting with a shared k-mer (seed) between two fragments, heuristic methods were used to efficiently find a reasonable alignment between them [1,2]. This approach was soon eclipsed by an efficient application [14] of dynamic programming [16–19] that could find an optimal alignment under certain assumptions. More efficient but still optimal variations of the dynamic programming approach followed [20]. Later, as target lengths increased dramatically (3 billion base pairs for the human genome), seeding alignments with shared k-mers (using a much longer k, say 20), or similar seeding strategies (shared k-mers with one difference allowed, or two shared shorter k-mers in close proximity in both fragments), were again necessary. The significance threshold on the length of a match was often included directly in the overlap computation by requiring a shared substring of at least a certain length. The second significance threshold, which attempts to reduce the impact of sequencing error, was defined either as a maximum number of differences per length [2,14] or as the probability of a certain number of errors per length occurring given the error rate [21]. The actual sequencing error for any given technology is not inherently known and must be modeled based on data. Initially, the sequence error was modeled as a fixed rate over the length of the fragment. This fixed error rate was estimated either by bootstrapping [22] (i.e., generate a preliminary reconstruction, compare fragments to this, and iterate) or by sequencing known DNA molecules repeatedly and assuming that the error rate would be similar for a novel DNA molecule. The first improvement to this model was to estimate the amount of sequencing error in each region of a fragment based on its overlaps with other fragments [23]. Most sequencing errors are independent of the sequence context (the sequencing technology is not likely to repeatedly make the same error at the same base pair). This implies that aligned base pairs in an overlap between two fragments that contains no differences are likely to be correct, or error-free, in both fragments. Similarly, maximum error levels can be estimated for overlapping regions based on the number of differences between them. If the error level of an overlap between two fragments exceeds the sum of the maximum error levels for these regions of the two fragments, the overlap can be rejected—the fragments come from two different copies of a repeat. This subtracts out the baseline level of sequencing error and allows repeat discrimination at a much finer level, effectively reducing the amount of repeats which can confound overlap detection. This approach has been extended and refined by employing a multiple sequence alignment of all fragments overlapping the one being evaluated [24]. This multiple sequence alignment computation is efficient because it is constructed directly from pairwise alignments. Any base call that is not confirmed by another aligned fragment is presumed to be a sequencing error and is not used to differentiate repeats.
84
Genomics
This process is often called error correction. A slightly different variant looks at the set of all k-mers from all fragments [25]. If the sequencing error rate is low, most k-mers in the target sequence should be correctly ascertained from multiple fragments. If k is large, then the number of possible k-mers (4k) is much larger than the number of k-mers in the target sequence (n − k + 1). As a result, most sequencing errors will generate k-mers that are not in the target sequence. The approach is to modify each k-mer that occurs only once and determine if it can be converted into a k-mer that occurs multiple times by applying a small (usually one) number of changes. If it can, then it is likely to be from the target sequence. In fact, an error in a k-mer will corrupt all of the k k-mers in the fragment that overlap the base pair in error. The signature of a single error is therefore k overlapping single-occurrence k-mers (from the same fragment), each with a single difference at the same error position with respect to k overlapping multiple-occurrence k-mers (from other fragments). A complementary approach was developed to estimate the likelihood that a particular base call in a given fragment is in error based on the signal strength and other properties of the sequencing machine [26]. In this approach, a quality value is assigned to each base call that represents an estimate of the probability that the call is in error. This increases the contrast between likely sequencing errors, in which the quality value is low, and repeat-induced differences and thus facilitates even better repeat discrimination. Many of the most recent shotgun fragment assemblers [24,27] combine techniques that exploit both quality values and overlapping fragments to improve error correction. To some extent, increases in computer power have helped address the issue of efficient overlap detection, but the recent and dramatic increase in the length of target DNA sequences has outpaced the rate of advances in computing hardware. While dynamic programming is generally viewed as optimal in the tradeoff between sensitivity and specificity, k-mer seeding of alignments is essential to make the overlap computation feasible for large genomes. Dynamic programming can be thought of as a traversal or search of an N by N space—two sequences, each of length N, laid out along two orthogonal axes of what is referred to as an edit graph—that takes N-squared time to compute (see figure 3.3). Using this method to look for overlaps between all fragments is essentially equivalent to concatenating all of the fragments into a single sequence and placing this along both axes. If r is the number of fragments and l the average length of a fragment, then the overlap computation takes (r ∗ l)2 time (see figure 3.4). If every k-mer occurs only once in the target sequence, then each k-mer occurs (r ∗ l)/n = c times on average in the set of fragment sequences, where n is the length of the target sequence, and c is called the coverage. Given this equation for k-mer uniqueness, each fragment would be involved in 2c overlap calculations if overlaps were computed only between fragments that
Shotgun Fragment Assembly
85
Figure 3.3 Edit graph (top) showing the optimal path of a local overlap between two sequences with their alignment (bottom).
share a k-mer. If dynamic programming is used to compute the k-mer seeded overlaps (in practice the k-mer seeds are extended within a small band of the search space, greatly improving efficiency), then the computation takes (r ∗ 2c ∗ l ∗ l) = (r ∗ l) ∗ (2c ∗ l). Because r >>l>>2c and l and 2c are constants while r grows with the length of the target sequence, the computation time is dominated by the (r ∗ l) term which is significantly more efficient than the previous (r ∗ l)2 time. Unfortunately most long target DNA sequences of interest do not satisfy the k-mer uniqueness assumption for practical values of k. In fact, a sizable portion of many target sequences constitutes ubiquitous repeats where k-mers not only occur more than once but occur many
Figure 3.4 Representation of overlap phase involving 2 ∗ n ∗ n edit graphs (r–j represents the reverse complement of rj).
86
Genomics
times, and the number of occurrences grows with target sequence length. In the extreme case, where a single k-mer occurs in every fragment, the problem regresses back to computing an overlap between every pair of fragments and taking (r ∗ l)2 time. In practice some k-mers occur too frequently to be used in an efficient overlap computation, so most shotgun fragment assemblers of recent vintage impose some maximum threshold on the number of fragments in which a k-mer can occur before it is no longer used to seed potential overlaps. Not investigating these potential overlaps may increase the number of false negative overlaps, but since most of the potential overlaps in these highly abundant k-mers must come from different copies of repeats, the number of false positive overlaps should be reduced even more. For all but the most similar repeats, less frequently occurring k-mers will exist due to the differences in the repeat copies, and most of the true positive overlaps based on these will be detected. A novel approach to determining fragment overlaps [28] was inspired by a method called sequencing by hybridization (SBH) which attempts to reconstruct sequences from the set of their constituent k-mers [29,30]. A k-mer graph represents the fragments and implicitly their overlaps by depicting each k-mer in any fragment as a directed edge between its (k−1)-mer prefix and (k−1)-mer suffix. A fragment is simply a path in the graph starting with the edge representing the first k-mer in the fragment, proceeding in order through the edges representing the rest of the k-mers in the fragment, and ending with the edge representing the last k-mer in the fragment. The fragment overlaps are implicit in the intersections of the fragment paths. No explicit criterion is given to specify the quality of intersection that would constitute an overlap. Rather, the fragment layout problem is solved directly using the k-mer graph and fragment paths. Generating the k-mer graph and fragment paths is more efficient than computing overlaps but requires a much greater amount of computer memory. The amount of memory is currently prohibitive for long target DNA sequences but computer memory continues to become denser and cheaper, so this constraint may soon be surmounted. LAYOUT PHASE
The overlap graph is a standard representation in fragment assembly algorithms in which the vertices represent fragments and the edges represent overlaps. An innovative, nonstandard approach represents fragments as a pair of vertices, one for each end of the fragment, and a fragment edge joining the vertices. To understand the reason that this is desirable, one must understand the nature of DNA sequence fragments. DNA is a double-stranded helix in which nucleotides (made of a molecule of sugar, a molecule of phosphoric acid, and a molecule called a base) on one strand are joined together along the sugar-phosphate backbone.
Shotgun Fragment Assembly
87
A second strand of nucleotides runs antiparallel to the first with the directionality of the sugar-phosphate backbone reversed and each base (one of the chemicals adenine, thymine, guanine, and cytosine, represented by the letters A, T, G, and C, respectively) bonded to a complementary base at the same position on the other strand (A bonds to T and G bonds to C). Each strand has a phosphate at one end called the 5’ (five prime) end and a sugar at the other end called the 3’ (three prime) end. Since the strands are antiparallel, the 5’ end of one strand is paired with the 3’ end of the other strand. When given the DNA sequence of one strand (a string of letters from the alphabet {A,C,G,T}), the sequence of the other strand starting from the 5’ end can be generated by starting from the 3’ end of the given strand and writing the complement of each letter until the 5’ end is reached (see figure 3.5).
Figure 3.5 DNA double helix (A) and diagram (B) distinguishing the sugarphosphate backbone from nitrogenous bases and showing the 5′ and 3′ ends.
88
Genomics
This process is called reverse complementing a sequence, and the sequence generated is called its reverse complement. In shotgun sequencing, multiple copies of a large target DNA molecule are sheared into double-stranded clones that are then inserted into a sequencing vector with a random orientation. Each end of the clone can then be sequenced as a fragment. Current sequencing technology determines the sequence of a fragment from the 5’ to the 3’ end starting from a known position (sequencing primer) in the sequencing vector near the insertion site of the clone (this can be done for just one strand or for both using a second sequencing primer site on the opposite side and strand of the clone insertion site). Each fragment sequence thus has an implied 5’ and 3’ end (see figures 3.1 and 3.5). Two fragments from different copies of the target DNA that share some of the same region are said to overlap, but due to the random strand orientation of the clone in the sequencing vector, the two overlapping fragments’ sequences may both be from the same strand of the target DNA, or one from each of the two strands. If the two sequences are from the same strand, the 3’ end of one sequence will overlap the 5’ end of the other. If the two sequences are from opposite strands, then either the two 5’ ends will overlap or the two 3’ ends will overlap (with one or the other fragment reverse complemented). In order to represent this in an overlap graph where one vertex corresponds to one fragment, the edges must be “super”-directed. In a directed graph, an arrow on one end of the edge represents that the edge goes from vertex a to vertex b or vice versa: a super-directed graph imparts additional information on the edge, in this case which end of vertex a (5’ or 3’) goes to which end of vertex b. This can be drawn as a bidirected line with an arrowhead at each vertex with the arrowheads oriented independently of each other toward or away from a vertex (see figure 3.6A and B, and [4]). In the bidirected overlap graph the directions of the arrows at a vertex effectively divide the set of edges touching it into overlaps involving the 5’ end of the fragment and overlaps involving the 3’ end of the fragment. A dovetail path in the bidirected overlap graph is constrained, by rules on the arrowheads, to require that consecutive edges involve opposite ends of the fragment they have in common [4]. A representation using two vertices per fragment (see figure 3.6C), one vertex for each fragment end and a connecting fragment edge, explicitly represents the fragment ends, allows all edges (overlap and fragment) to be undirected, and defines a dovetail path as a path in the overlap graph that traverses a fragment edge, then zero or more ordered pairs of (overlap edge, fragment edge). The information inherent in an overlap is simply the two fragment ends involved and the length of the overlap. When the subsequences of the two fragment ends that overlap are constrained to be identical, the length of the overlap is simply the length of the shared subsequence; if
Shotgun Fragment Assembly
89
Figure 3.6 Super-directed graph of reads from figure 3.2 (A), reduced superdirected graph with unitigs circled (B), and alternative, undirected representation of reduced graph (C).
some variation in the subsequences is allowed, particularly insertions and deletions (indels), then the length of the overlap must be represented by a 2-tuple of the lengths of both subsequences aligned in the overlap to be complete. An equivalent representation of overlap length used in [4] is the lengths of the subsequences in each fragment not aligned in the overlap (which is just the length of the fragment minus the length of the subsequence aligned in the overlap). These lengths are called overhangs, or hangs for short. Additional information can be retained for each edge, such as the edit distance to convert the sequence
90
Genomics
of one fragment end to the other fragment end, or some likelihood/ probability estimate that the overlap is true. The layout problem is to find a maximal set of edges (overlaps) that are consistent with each other. At the heart of a consistent set of overlaps is a pairwise alignment of two fragments. If a multiple sequence alignment of fragments includes all of the pairwise fragment alignments, then the overlaps are consistent. A more formal approach is to consider the target sequence laid along a line with integer coordinates from 1 to n. Each fragment (subsequence of the target) is viewed as an interval on this line (see figures 3.2, 3.7, and 3.8). A true overlap between a pair of fragments occurs if and only if the fragments’ intervals intersect. This implies that an overlap graph that contains all of the true overlaps (edges) and no false overlaps must be an interval graph (an interval graph is just a term for a graph which meets the intersection of intervals definition above where vertices are intervals and edges are intersections [31,32]). The layout problem can then be viewed as finding a maximal subset of edges (or subgraph) of the overlap graph that forms an interval graph. This criterion was established based on the observation that overlaps must meet the triangle condition of interval graphs [14]. The triangle condition states that if the intersections of intervals i and j and of intervals j and k are known, then the intersection of intervals i and k is completely determined (see figure 3.7). Most assemblers do not require the layout solution to be an interval graph, but rather (first setting aside fragment intervals that are contained in other fragment intervals) that the layout graph must be a subgraph of an interval graph in which the maximal intersections for each fragment end are retained (see figure 3.6). For SCS, it is clear that fragments that are substrings of other fragments (called contained fragments) need not be considered because a superstring of the noncontained fragments necessarily includes the contained fragments as substrings, and the superstring cannot be made shorter by adding additional constraints. Most assemblers follow this
Figure 3.7 Triangle condition of intervals showing that i ∩ j and j ∩ k implies i ∩ k.
Shotgun Fragment Assembly
91
approach of setting aside contained fragments and then placing them at the end of the layout phase. We will follow [21] in distinguishing between containment overlaps, in which the overlap completely contains one of the two fragments, and dovetail overlaps where only one end of each fragment is included in the overlap. Once contained fragments and their containment overlaps are set aside, the overlap graph contains only dovetail overlaps. Consequently, we will generally refer to dovetail overlaps simply as overlaps. Given the set of these remaining overlaps, the interval of each fragment has at most two maximal intersections, each involving one of its fragment ends, with the intervals of two other fragments. The noncontained fragment intervals can be ordered from left to right along the target sequence for fragments fi , i = 1 to rnc (r is the total number of fragments, rnc is the number of noncontained fragments), such that b0 < b1 < b2 < … < bi < … < brnc and e0 < e1 < e2 < … < ei < … < ernc (recall that bi is the beginning position of fragment fi and ei the end). The length of the union of the first two fragment intervals is l1 + l2 – o1,2, where li is the length of fi and oi,j is the length of the overlap between fi and fj. Building on this, the length of rnc
rnc −1
i =1
i =1
the target sequence, n, can be written as ∑ li − ∑ oi , i +1 . A desired solution in the overlap graph would be (again following [4] and [21]) a dovetail chain/path that traverses a fragment edge, followed by zero or more ordered pairs of (overlap edge, fragment edge). Fragment edges would represent the fragments f1 to frnc and overlap edges would represent the overlaps o1,2 to ornc −1, rnc. With the sum of the fragment lengths a constant, maximizing the length of the overlaps between adjacent fragments is equivalent to minimizing the length of the reconstructed string. Recall that finding an optimal solution to SCS is NP-hard, which is why a simple greedy algorithm is often used. The standard approach is to sort all of the overlaps by length and seed the solution with the longest overlap. Then the rest of the overlaps are iterated over in order and either added to the solution if the fragment ends are not in an overlap that is already part of the solution or discarded if one or both fragment ends are. The constraint that each fragment end can be involved in only one overlap guarantees that the solution will be a set of dovetail paths which must be a subgraph of an interval graph. A similar greedy approach was used in [14], but after the solution was generated, all discarded edges were checked for consistency with the dovetail paths solution using the triangle condition, and discrepancies were reported. Recall that an overlap graph with all true overlaps (no false positive or false negative overlaps) is an interval graph and any dovetail path in it will produce a correct solution. If some short true overlaps are not detected but longer overlaps cover the same intervals, the result is still a subgraph of an interval graph, and any dovetail path through it is a
92
Genomics
correct solution. If, however, a short true overlap is missed and no other overlap covers the interval, this will create an apparent gap in the solution. The same effect is produced by a lack of fragment coverage anywhere along the target sequence, and this is just intrinsic to the random sampling of the fragments [33]. The simple greedy solution fails when repeated regions of the target sequence generate false positive overlaps. If the copies of a repeat along the sequence (line) are indistinguishable, the associated graph is no longer an interval graph. In effect, the different copies end up being “glued together” [34], creating loops in the line and cycles in the graph. Within such a repeat, any given maximal overlap may not be a true overlap. Because the fragment intervals in the repeat are glued together there will be interleaving of fragment intervals from different copies of the repeat. This guarantees that some maximal overlaps will be false. The SCS solution fails to take repeats into account, and meeting the shortest superstring criterion actually compresses exact repeat copies into a single occurrence in the superstring (see figure 3.8). The first approach to achieving a correct layout solution in the presence of repeats is to reduce the number of false positive overlaps as much as possible. If the repeats are exact copies, then nothing can be done for
Figure 3.8 Example showing that the SCS solution can be overcompressed, misordered, and disconnected.
Shotgun Fragment Assembly
93
overlaps within these regions, but most repeats have some level of discrepancy between copies. As discussed in the overlap phase section, if the level of repeat discrepancy is significantly greater than the differences due to sequencing error (or corrected sequencing error, if using error correction), then false overlaps can be distinguished and kept out of the overlap graph. The Phrap assembler applies this technique aggressively and with good results [35]. Phrap uses quality values (estimates of error probability at each base call) to differentiate sequencing error from repeat copy differences. In and of itself, this provides a large advantage over assemblers not using any form of sequence error correction. The really aggressive aspect of Phrap’s approach is to use a maximum likelihood test to choose between two models: that the two fragments are from the same interval of the target sequence with differences due to sequencing error, or that the two fragments are from different copies of a repeat. The test includes a tunable parameter for the expected number of differences per length, typically 1 to 5 per 100 base pairs. This approach rejects many more false positive overlaps than would a test that an overlap is due to random chance. It also results in more false negative overlaps, but the tradeoff often provides very good results. Perhaps the furthest this repeat separation solution can be pushed is to use correlated differences gleaned from a multiple sequence alignment of fragments [36–39]. The key concept is that differences between copies of a repeat (called distinguishing base sites or defined nucleotide positions) will be the same for fragments from different copies (correlated), whereas sequencing errors will occur at random positions (uncorrelated). This method starts with a multiple sequence alignment of fragments and finds columns with discrepancies (called separating columns) (see figure 3.9). Each separating column partitions the set of fragments spanning it into two or more classes. When a set of fragments spans multiple separating columns the partitions can be tested for consistency. If the partitioning
Figure 3.9 Correlated differences (C and T in sequences 1–5, A and G in sequences 6–9 in the same column pairs) supporting repeat separation and an uncorrelated sequencing error in sequence 4.
94
Genomics
is consistent, then the columns are correlated and the differences are much more likely a result of repeat differences than sequencing error. The correlated partitioning test can be either heuristic or based on a statistical model. The test can be applied either immediately after the overlap phase as another filter to remove false positive overlaps, or after an initial portion of the layout phase has produced a contig (short for contiguous sequence) containing fragments from different repeat copies that need to be separated. Ultimately, however, large and highly complex genomes have repeat copies that are identical, or at least sufficiently similar, such that any approach to repeat separation based on repeat copy differences must fail. The second approach to repeat resolution in the overlap graph is to first recognize what portions of the overlap graph or which sets of fragments are likely to be from intervals of the target sequence that are repeats. One distinguishing feature of repeat fragments that has been widely recognized [40–43] is that they will have, on average, proportionately (to the repeat copy number) more overlaps than fragments from unique regions of the target sequence. The number of overlaps for a fragment or fragment end is a binomially distributed random variable parameterized by the number and length of the fragments and the length of the target sequence (if the fragments are randomly uniformly sampled from the target sequence) [44]. This binomial distribution is usually well approximated by the Poisson distribution [33]. Unfortunately the intersection between the distributions of repeat and unique fragments is large for repeats with a small number of copies (e.g., 2 or 3) when the coverage ratio of total length of fragments to target sequence length is 5 to 15, which is standard for shotgun assembly projects. Thus, setting a threshold in terms of a number of overlaps to differentiate between repeat and unique fragments will lead to a high number of false positives, false negatives, or both. A different property of repeat regions has also been widely recognized and has already been mentioned above: inconsistent triples of fragments and their overlaps that do not meet the triangle condition [14]. The triangle condition is violated at the boundaries of repeat regions (see figure 3.10). A particularly elegant approach to distinguishing these repeat boundaries in the overlap graph is to remove overlaps (edges) in those portions of the graph that have interval graph properties. This reduces these intervals to single dovetail chains [4]. The overlaps that are removed can be reconstructed from a pair of other overlaps (see figure 3.6A and B). This is also known as chordal graph reduction in interval graphs. The size and complexity of the overlap graph is greatly reduced using this method. The only branching in the reduced graph occurs where fragments cross repeat boundaries. The dovetail chains in the reduced overlap graph have no conflicting choice of layout positions, so they can be represented as contigs called chunks [4]
Shotgun Fragment Assembly
95
Figure 3.10 Reads from figure 3.2 that do not meet the triangle condition (A) and associated unique/repeat boundary detection via sequence alignment shown at the fragment level (B) and at the sequence level (C).
or unitigs [45]. The overlap graph is thus transformed into a chunk or unitig graph that has the same branching structure as the reduced overlap graph. A unitig can comprise fragments from a single interval of the target sequence (a unique unitig), from multiple intervals that have been collapsed or glued together (a repeat unitig), or unfortunately a combination of the two. A combined unitig (part unique, part repeat) only occurs when the boundaries between the end of the repeat and unique sequence is not detected for at least two copies of the repeat. This can occur due to low sequence coverage (the boundary is not sampled) or some failure in overlap detection. For deep sequence coverage, combined unitigs are rare, so we will set aside this problem for now. If we can distinguish unique from repeat unitigs the layout problem will be greatly simplified. There have been two complementary approaches for differentiating unique from repeat unitigs. One method looks at the density of fragments within the unitig (sometimes called arrival rate) and determines the probability or likelihood ratio that the fragment density is consistent with randomly sampling from a single interval or multiple intervals [45]. This is analogous to the previous approach of determining that a fragment end is in a repeat based on the number of overlaps that include it. The density measure becomes much more powerful than the fragment overlap measure as the length of the unitig increases because the density distributions for unique and repeat unitigs intersect less as unitig length increases. The separation power of the density measure also increases with coverage depth of the random sampling in the same fashion as the fragment overlap measure.
96
Genomics
Figure 3.11 Short repeats with spanning reads (A) that produce a unique, reducible layout graph pattern (B) and the corresponding solvable pattern in the k-mer graph approach (C).
A second method looks at the local branching structure of the unitig graph [24]. Typically each end of a unique unitig has either a single edge (overlap) with another unitig or no edge if it abuts a gap in the fragment coverage. In contrast, both ends of a repeat unitig have multiple edges to other unitigs representing the different unique intervals that flank each copy of the collapsed repeat (see figure 3.6). Unfortunately, there are rare as well as more common counterexamples to this simple rule (see figures 3.11 and 3.12). One approach to overcoming the branching associated with short repeats is to look instead at mate pair branching (see figure 3.13) [24]. Mate pairs should appear in the layout with a known orientation and distance between them. If multiple sets of mate pairs have one fragment in a repeat unitig and the other in different unique flanking unitigs, then from the perspective of the repeat unitig, the unique unitigs should all occupy the same interval. Just as with the overlap branching, this pattern would identify the unitig as a repeat with multiple different flanking regions. Another elegant approach for constructing the unitig/chunk/repeat graph is based on the k-mer structure of fragments rather than explicit fragment overlaps [25,28]. Recall from the overlap phase section that a k-mer graph represents the fragments, and implicitly their overlaps, in terms of the set of sequenced k-mers, which are drawn as edges between the prefix (k−1)-mer and suffix (k−1)-mer of each k-mer. A fragment is just a path in the graph starting with the edge representing the first k-mer in the fragment, proceeding in order through the rest of the edges representing k-mers in the fragment, and ending with the edge representing the last k-mer in the fragment. If there is no sequencing
Figure 3.12 Example of repeats within repeats and unique between repeats (A) and unitig graph showing unique/repeat multiplicities (B).
Figure 3.13 Initial scaffold graph from figure 3.2 example (A), reduced scaffold graph (B), and final scaffold (C). 97
98
Genomics
error, or the sequencing error can be corrected as discussed in the overlap phase section, then the branching structure in the k-mer graph is largely the same as that for the unitig graph, so that branching occurs only where a k-mer crosses a repeat boundary. A few minor differences do exist between the k-mer graph and the unitig graph due to the vertices and edges representing slightly different objects. In the k-mer graph a vertex represents a fixed length (k−1) interval of the target sequence; in the unitig graph (or more correctly the reduced overlap graph before dovetail paths are coalesced into unitigs) a vertex represents a variable length (a fragment length) interval of the target sequence which is usually at least an order of magnitude larger than k–1. In the k-mer graph an edge represents a fixed length (k) interval of the target sequence which contains the two intervals represented by the two vertices it connects; in the unitig graph an edge represents a variable length interval of the target sequence which is the intersection or overlap of the two intervals represented by the two vertices it connects. In the k-mer graph any (k−1)-mer that occurs multiple times in the target sequence will be represented by a single vertex. At the boundary of a repeat where the next (k−1)-mer occurs only once in the target sequence, the k-mer graph must branch with edges to each of the unique (k−1)-mers flanking each copy of the repeat. As a result, repeat boundaries are known precisely in the k-mer graph. In particular, a vertex at the end of a repeat will have edges to several neighboring vertices across the repeat boundary. The neighboring vertices that are in unique sequence will have only one edge in the direction of the repeat boundary, to the last vertex in the repeat. This overcomes the short repeat branching pattern encountered in unitig graphs (see figure 3.11) but not the repeat within a repeat branching pattern (see figure 3.12). The short repeat pattern occurs in the unitig graph because, even though each fragment interval in a unique unitig occurs only once in the target sequence, a fragment subinterval at the end of the unique unitig occurs multiple times (a repeat). This short repeat is represented by multiple edges from the end vertex of the unique unitig: each edge represents the intersection of two fragment intervals, and these subintervals are part of a repeat. The k-mer graph avoids this problem by using edges to represent unions rather than intersections of the (k−1)-mer intervals represented by vertices. So, edges are not subintervals of the vertices. Dovetail paths in the k-mer graph can be coalesced in a fashion similar to dovetail paths in the reduced overlap graph. In the reduced overlap graph the union of fragment vertices along the dovetail path is replaced with a vertex representing a unitig; in the k-mer graph the union of k-mer edges along a dovetail path becomes an edge representing a unitig, and the two bounding vertices remain the same [25,28]. The only exception is the trivial dovetail path that has no edges, where
Shotgun Fragment Assembly
99
the single vertex represents a (k−1) length repeat. We will call this coalesced k-mer graph the k-mer unitig graph. As in the unitig graph, unitigs in the k-mer unitig graph can represent unique or repeat intervals in the target sequence. Gaps in fragment coverage at repeat boundaries can obfuscate the true repeat branching structure in either graph. The approach to finding unique unitigs in the k-mer unitig graph is actually a subproblem of determining the multiplicity (number of copies) of each unitig where a unique unitig has multiplicity one. The assumption is that each unitig has multiplicity at least one and that most unitigs have multiplicity exactly one. The local flow of unitig multiplicities into, through, and out of a unitig must be balanced (Kirchhoff’s law). A simple heuristic is to start all unitig multiplicities at one and iteratively apply the balancing condition until a stable balanced solution is reached [46]. For example, if two single copy unitigs both precede a unitig in the unitig graph, that unitig must have at least two copies. This approach cannot correctly solve every situation (see figure 3.12) and even the more rigorous minimal flow algorithm [46,47] does not solve this example. Nevertheless, this approach has been shown to work well in practice for bacterial genomes [25,46]. It remains to be seen if the heuristic or the minimal flow algorithm can scale to large, complex genomes. Perhaps a more promising approach is to combine the depth of coverage as an initial estimate of unitig multiplicity and then apply the heuristic balancing of flows. These initial multiplicity estimates and balanced flows would be real valued but could be forced over iterations to converge to integers. This would give a high probability of correctly solving the example cited above. With the k-mer unitig graph and unitig graph (called here the fragment unitig graph to clearly differentiate from the k-mer unitig graph), the goal of the layout phase is to find one or more paths which together include each unique unitig once and each repeat unitig the number of times it occurs in the target sequence. This starts with an initial labeling of the unique unitigs or the multiplicity of the unitigs in the unitig graph. Already for both unitig graphs the nonbranching portions of the overlap graph have been coalesced into unitigs. The next step is to extend unique unitigs across repeat boundaries using fragments that start in the unique unitig. For the k-mer unitig graph the repeat boundary is at the end of the unitig, so all fragment paths that include the unitig boundary (k−1)-mer start before the repeat boundary. For the fragment unitig graph, the unique unitigs branching from the repeat unitigs must be aligned to identify the repeat boundaries within each unique unitig [45] (see figure 3.10). In the fragment unitig graph, unitigs that overlap the unique fragment on the other side of the repeat boundary must be the correct overlap, and any conflicting overlaps with that end of the unique unitig (branching in the graph) can be discarded. In the k-mer unitig graph, a set of equivalent graph transformations is
100
Genomics
defined which allows unitigs with multiplicity greater than one to be duplicated and the edges adjacent to those unitigs to be assigned to exactly one of the copies [25]. This can only be done when all of the fragment paths through the repeat unitig are consistent with the assignment of the adjacent edges to the unitig copies. This means that if a fragment spans from one unique unitig across a short repeat unitig to another unique unitig, then the two unique unitigs can be combined into a single unique unitig (containing the short repeat unitig) (see figure 3.11). All shotgun fragment assemblers that use only fragment sequence data are stymied by identical repeats that are long enough that they cannot be spanned by a single fragment. The reason for this is that once one traverses an edge from a unique unitig into a long repeat unitig, the fragment data cannot indicate which edge to follow leaving the repeat. For very simple repeat structures, if there is complete fragment coverage of the target sequence there will be only one path that includes every unique unitig in a single path (see figure 3.14A), but
Figure 3.14 Subsequence from U1 to U3 has two Eulerian tours with the same sequence (given R1 and R2 are identical) (A). Addition of third copy of repeat R makes order of U2 and U3 ambiguous in different Eulerian tours (B). Hamiltonian representation is shown on the left to illustrate increased complexity.
Shotgun Fragment Assembly
101
even for a slightly more complicated repeat structure this is no longer true (see figure 3.14B). For the SBH problem, in which all that is known about a short target sequence is the set of k-mers that appear in it, the k-mer graph was originally designed to address this issue. By having the edges represent k-mers and the vertices represent (k−1)-mers (overlaps of k-mers) instead of the reverse, the desired solution becomes a path that uses every edge exactly once (an Eulerian path) rather than a path that uses every vertex exactly once (a Hamiltonian path) [48] (see figure 3.14). In general a Hamiltonian path takes exponential time to compute. This means that in practice only target sequences with no duplicated (k−1)-mers can be solved using the Hamiltonian path approach. With no (k−1)-mer duplications all vertices have a maximum of one outgoing and one incoming edge, which makes the determination of the Hamiltonian path trivial [49]. An Eulerian path is efficient to compute (linear in the number of edges) if one exists, and as mentioned above only a single Eulerian path is possible for very simple repeat structures. For short, random target sequences (length 200) and k = 8, a simple Hamiltonian path (maximum degree ≤ 1) was found in 46% of test cases whereas a single Eulerian path was found 94% of the time [48]. Unfortunately for large target sequences that have complex repeat structures, the number of Eulerian paths quickly becomes exponential. So the Eulerian formulation provides no efficiency in solving the layout problem. The equivalent graph transformations approach for the k-mer unitig graph is, however, very useful for simplifying the k-mer unitig graph. This approach has been called the Eulerian Superpath Problem, and it is important to understand that the power of it comes from simplifying the structure of the graph by splitting spanned repeat unitigs and not by any computational advantage of the Eulerian versus Hamiltonian framing of the problem. Long Identical Repeats
When long, identical repeats are encountered, the solution to the layout problem must use additional information beyond that of the fragment sequences. The most useful and easily obtained auxiliary information used by shotgun fragment assemblers is mate pair data. Mate pair data is obtained when sequence fragments are generated from both ends of a cloned piece of DNA (see figures 3.1, 3.2, and 3.5). Because of this, the method is referred to as double-barreled sequencing, double-ended sequencing, or pairwise-end sequencing. Generally, the DNA is inserted into a sequencing vector with universal sequencing primers at either end of the inserted DNA to produce two fragments. The inserted DNA must be double-stranded, but current sequencing technology can process only a single strand of DNA at a time, from the 5′ end to the 3′ end. This imposes the constraint that the mate pair fragments must be located on opposite strands of the solution. A more useful constraint is the
102
Genomics
approximate distance between the 5′ ends of each mate pair. There are standard size selection techniques for creating collections (called libraries) of DNA clones that have an approximately known length distribution. The distribution is often roughly normal (Gaussian) with an approximate mean and variance. The use of mate pairs was first proposed as a method to determine which clones spanned gaps in fragment coverage of the target sequence [50]. The entire clone or just the portion needed to close the gap could then be sequenced to finish the target sequence. This is much more efficient than continuing to generate more random shotgun fragments with diminishing probability that a shotgun fragment would be encountered from the missing portions of the target sequence (the gaps). The clones used to generate the mate pairs are usually larger than the length of the mate pair fragments summed together. The amount of target sequence contained in the clones is larger than the amount contained in the fragments, but the random coverage of the target sequence by the clones, called the clone coverage, follows the same statistical model as the random coverage by the fragments (sequence coverage) [33]. Given sufficient sequence coverage to be able to place mate pair fragments into contigs, the probability that a gap in fragment coverage (called a sequencing gap) will be spanned by a mate pair increases with the clone coverage [51]. A mate pair that spans a sequencing gap constrains the orientation (which strand of the target sequence) and the order (before or after) the two contigs (or unitigs) containing the mate pair fragments have with respect to each other (see figure 3.13). The distance between the two contigs (length of the gap) can also be estimated based on the length distribution of the library that the mate pair was sampled from. In practice, the initial estimate of the size distribution of the library is determined based on the apparent size of the clones as measured by the dispersion on an agarose gel run. A refined estimate can be computed by bootstrapping. The assembler is run to generate contigs or unitigs. The lengths of the clones as implied by the positions of the mate pairs in the contigs can be used to estimate the library size distribution. In the absence of repeats, the fragment unitig graph, the k-mer unitig graph, or any of even the simplest greedy layout algorithms will produce a set of unitigs that terminate at sequencing gaps or the ends of the target sequence. The mate pairs allow the unitigs to be placed into structures called scaffolds (see figures 3.13 and 3.15) [51]. A scaffold is thus an ordered and oriented set of contigs separated by gaps. A mate pair graph that is analogous to the fragment unitig graph can be constructed. Vertices are still the unitigs but edges are now the distances between the unitigs as computed by the mate pairs connecting them. Overlap edges in the fragment unitig graph always have negative distances because the sequence in the overlap is shared in common
Shotgun Fragment Assembly
103
Figure 3.15 Mate pair edges between i–j and i–k imply distance and orientation between j and k.
between the two unitigs and the distance (length of overlap) is known precisely. Mate pair edges usually have a positive distance but can have a negative distance (indicating a possibly undetected overlap) and the variance on the distance estimate is much larger. In a target sequence without repeats (a line), the unitigs represent single intervals along the line. Since the mate pair edges constrain the relative distance between these intervals, there is a condition analogous to the triangle condition in the overlap graph: if the distances from unitig i to unitig j and from j to k are known from mate pair edges, then the distance from i to k is known (within some variance) and any mate pair edge between i and k must be consistent with this distance if the unitigs are really single intervals. Exactly as in the overlap graph, chordal edge removal can be performed to leave only the mate pair edges between adjacent unitigs. In contrast to the overlap graph, edges may be missing between adjacent unitigs (intervals), but these edges can often be inferred from edges between nonadjacent unitigs (see figure 3.15). Chordal edge removal in the mate pair graph removes all branching in the unique intervals of the genome but leaves the same problem as the unitig graph with branching still occurring at the repeat unitigs. As described above for both types of unitig graphs, if a fragment spans a short repeat unitig, it crosses the repeat boundary on both sides and connects the two flanking unique unitigs. This makes it possible to merge the two unique unitigs with a copy of the intervening repeat unitig into a single contig that represents the correct local portion of the solution path. This would also require removing any edges from the unique unitig ends that are internal to the newly formed contig and replacing the edges adjacent to the external ends of the unique unitigs with edges to the external ends of the newly formed contig. Meanwhile, the repeat unitig and its other edges may still be used for other parts of the solution path. This equivalent graph transformation simplifies the graph and would yield the solution if all of the repeats could be spanned. Whereas fragments will not span long repeats, mate pairs can, and a wide array of clone lengths is possible using a number of different cloning vectors and methods. In contrast to a fragment that spans a short repeat between unique unitigs, which gives a path through the unitig graph and determines the sequence, a mate pair edge provides a distance estimate but no path.
104
Genomics
If only a single path exists in the unitig graph and it is consistent with the mate pair edge distance, then it is likely to be correct. There is often only one consistent path for bacterial genomes [46] but not for more complex genomes [45]. Another approach is to require that every unitig on a consistent path between the unique unitigs also have a consistent mate pair edge to the flanking unique unitigs [52]. Spanning and placing the repeat unitigs between mate-pair-connected unique unitigs greatly simplifies the unitig graph. Unfortunately, imperfect labeling of unique unitigs and other imperfections of the data ultimately lead most shotgun fragment assemblers to resort to some form of greedy heuristic, such as using mate pair edges with the largest number of supporting mate pairs in the presence of conflicting edges, to generate a final solution. Mate Pair Validation
Mate pairs are a powerful tool for validating that the layout phase has produced a correct solution. Even though more recent assemblers use mate pairs in some fashion to guide the layout phase [24,40,42,53] of the assembly, mistakes are still made due to data and software imperfections. Patterns of unsatisfied mate pairs can identify many of these mistakes. A mate pair is satisfied if the two fragments from opposite ends and strands of the same clone appear on opposite strands and at a distance consistent with the clone size distribution (see figure 3.16A and B).
Figure 3.16 End reads of a clone as in target sequence (A), correctly assembled— satisfied (B), too far apart—stretched (C), too close together—compressed (D), misoriented to the right—normal (E), misoriented to the left—antinormal (F), and misoriented away from each other—outtie (G).
Shotgun Fragment Assembly
105
A mate pair that fails either of these two conditions is called unsatisfied. If two nonadjacent intervals of the target sequence have been inappropriately joined in the layout solution, this creates a bad junction, and mate pairs that would have spanned the correct junction will be unsatisfied (see figure 3.16C–G). Given deep clone coverage, most bad junctions will lead to multiple unsatisfied mate pairs. Each unsatisfied mated fragment defines an interval within which its mate should have appeared, implying that there is a bad junction somewhere within this interval. The intersection of these overlapping intervals can then be used to narrowly bound the region in which the bad junction must occur. This kind of analysis has been use to identify bad junctions in target sequence reconstructions [52,54]. In addition, these bad junctions often have cooccurring overlap signatures: a short or low-quality overlap or a layout that is only one sequence deep at that point (see chimeric fragments below). Some assemblers make use of these unsatisfied mate pair patterns to break and rejoin layouts, particularly when they coincide with a weak overlap pattern or chimeric fragment [42,53,55]. Chimeric Fragments
A chimeric clone is produced during the process of creating clone libraries when two pieces of fractured DNA fuse together. This fused piece of DNA no longer corresponds to a contiguous interval of the target sequence. If a fragment sequenced from a chimeric clone is long enough to cross the fused or chimeric junction, then the resulting fragment is called a chimeric fragment. Incorporating chimeric fragments into a layout would result in an incorrect reconstruction of the target sequence. Chimeric fragments tend to share a characteristic pattern of overlaps with nonchimeric fragments. Overlaps that are short enough not to reach (or that go just slightly beyond) the chimeric junction will be found; but, barring the unlikely event of another nearly identical chimeric fragment, there should be no overlaps with the chimeric fragment that cross the chimeric junction by more than a few bases. What distinguishes this pattern from a low coverage region is that fragments will exist that overlap with the fragments overlapping both ends of the chimeric fragment causing a branching in the unitig graph. This pattern is easy to detect after the overlap phase [35,56] and is incorporated by most assemblers by discarding the chimeric fragments before the layout phase. A chimeric fragment can also be recognized and discarded during the layout phase based on the unitig graph pattern (see figure 3.17A) in which a unitig composed of a single fragment causes branching in two intervals of the unitig graph that would otherwise be unbranched. Unfortunately, the same overlap pattern, or equivalent unitig graph pattern, can be induced by a pair of two-copy repeats in close proximity. To compensate for this we can use the previously described techniques
106
Genomics
Figure 3.17 Unitig graph pattern of a chimeric single-read unitig U3 (A), spur fragment U3 (B), and polymorphic unitigs U2 and U3 (C).
to determine the likely multiplicity of the two unitigs with edges to the apparently chimeric fragment. If both unitigs appear to have multiplicity two, then the fragment should be retained. Chimeric Mate Pairs
A chimeric mate pair occurs when the mated fragments from a single clone do not meet the clone constraints (opposite strand and expected distance) when placed on the target sequence. This can occur in at least two basic ways: the clone is chimeric as above, or fragments from different clones are mislabeled as being mated. Before capillary sequencing machines, parallel sequencing lanes on an agarose gel were often mistracked, associating the wrong fragment with a clone. Even after capillary sequencing, sequencing plates used in sequencing machines can be rotated or mislabeled, associating fragments with the wrong clones. Undoubtedly, there are and will continue to be other clever ways by which lab techniques or software misassociate fragments and clones. For this reason, most assemblers do not consider any single, uncorroborated mate pair to be reliable. Most assemblers will only use mate pair edges (as discussed above) if they are supported by at least two mate pairs. For large genomes the chance that any two chimeric mate pairs will support the same mate pair edge is small [45]. For the same reason, bad junction detection based on unsatisfied mate pairs also sets a threshold of the intersection of at least two unsatisfied intervals. Spur Fragments
Spur fragments (also called dead-end fragments [24]) are fragments whose sequence on one end does not overlap any other fragment.
Shotgun Fragment Assembly
107
Of course this is true of fragments on the boundary of a sequencing gap, so an additional criterion that the fragment cause a branching in the unitig graph is also needed to define a spur fragment. The spur pattern in the unitig graph is similar to the chimeric pattern where a single fragment unitig causes a branching in the unitig graph that would otherwise not occur (see figure 3.17B). Some of the reasons that spur fragments occur are undetected low-quality or artifactual sequence generated by the sequencing process that has not been trimmed and vector sequence that is undetected and therefore untrimmed. Spur fragments can also result from chimeric junctions close enough to a fragment end that overlaps with the short chimeric portion of the fragment will not be detected. Spur fragments are easy to detect using overlap or unitig graph patterns and can then be discarded. As with chimeric fragments, there are conditions under which a spur pattern can occur even though the spur fragment accurately reflects the target sequence. This can happen if, for instance, a sequencing gap exists in the unique flanking region near a repeat boundary of a two-copy repeat and the fragment coverage is low (single fragment). If the unitig at the branch point caused by the spur appears to be multiplicity two, the spur should probably be retained. Vector and Quality Trimming
Current sequencing technology usually requires that some known DNA sequence be attached to both ends of each randomly sheared piece of the target sequence (often a plasmid cloning vector). Part of this socalled vector sequence is almost always included at the beginning, or 5′ end, of a fragment as part of the sequencing process. If the sheared piece of target sequence is short, the sequencing process can run into vector sequence at the 3′ end of a fragment (see figure 3.1C). Although today the majority if not the entire length of most fragment sequences is of very high quality, both the beginning and end of the sequence are sometimes of such low quality that overlaps cannot be detected in these regions, even with error correction. The vector and low-quality regions of fragments do not reflect the underlying target sequence and can cause overlaps to be missed. A preprocess called vector and quality trimming is performed before the overlap phase in most assemblers to attempt to detect and remove these regions from the fragments. The vector can be detected using standard sequence alignment techniques that are only complicated in two cases: the sequence quality is low, which can be addressed by quality trimming, or the sequencing vector is very short so that a significant alignment (greater than random chance) does not exist. The latter can be addressed by trimming off any short alignment at the potential cost of trimming off a few nonvector base pairs. Quality trimming is usually based on the quality values (error estimates) for the base calls. Using these error estimates, a maximum number of expected errors per
108
Genomics
fixed window length are allowed and the maximum contiguous set of these windows (intervals) is retained as the quality-trimmed fragment. This trimming is usually somewhat conservative and so a complementary method using the overlap machinery is sometimes employed [35,53]. Instead of insisting that an overlap extend all the way to the end of both trimmed fragments, high-quality alignments that terminate before the end of untrimmed fragments can be considered. If the alignment is significant, then it is due to the fragments being from the same interval of the target sequence or sharing a repeat in the target sequence. In either case the aligned portion of the fragment is not likely to be low-quality sequence and can be used to extend the quality values based quality trimming. Polymorphic Target Sequences
If a clonal target sequence is asexually reproduced DNA, a single version of the target sequence is copied with little or no error, and we can conceptually think of each random shotgun fragment as having been sampled from a single target sequence. Unfortunately when the copies of the target DNA to be sheared are acquired from multiple individuals or even a single individual with two copies of each chromosome (one from each parent), this assumption is incorrect and we must allow for variance between the different copies of the target sequence. If the variance between copies is very low (say a single base pair difference per 1000), then the overlap and layout phases are unlikely to be impacted. A rate of variance that is well within the sequencing error rate (or corrected error rate) will not prevent any overlaps from being discovered. Even the most aggressive repeat separation strategies require at least two differences between fragments for separation, so variance with at most one difference per fragment length will not affect the layout phase. Unfortunately, polymorphic variance is often significantly greater than sequencing error. If the polymorphic variation in all intervals of a two-haplotype target sequence exceeds the sequencing error variation, the problem would be the same as assembling a target sequence that was twice as long as expected, since we could easily separate the two haplotypes. Polymorphic variation more often varies from quite low to high from region to region within the target sequence. The low-variance unique regions end up in a single unitig whereas the high-variance unique regions get split into multiple unitigs (two in the case of two haplotypes). This complicates the branching in the unitig graph and makes it more difficult to determine unitig multiplicities based on the branching structure. In some cases a polymorphic branching pattern within a unique region of the target sequence can be recognized and collapsed into a single unitig [57]. A common polymorphic pattern called a bubble occurs when unitig U1 branches out to unitigs U2 and U3 which then converge
Shotgun Fragment Assembly
109
back into unitig U4 (see figure 3.17C). There are two possibilities in the underlying target sequence to account for the bubble: unitigs U1 and U4 are unique unitigs and unitigs U2 and U3 are polymorphic haplotypes of the same analogous unique region between U1 and U4, or unitigs U1 and U4 are both repeats and unitigs U2 and U3 are different intervals of the target between copies of U1 and U4. These two cases can often be distinguished by the depth of coverage of the unitigs U1, U2, U3, and U4. CONSENSUS PHASE
The layout phase determines the order, orientation (strand), and amount of overlap between the fragments. The consensus phase determines the most likely reconstruction of the target sequence, usually called the consensus sequence, which is consistent with the layout of the fragments. As we discussed above, an overlap between fragments i and j, which defines a pairwise alignment, and an overlap between fragments j and k, when taken together create a multiple sequence alignment between fragments i, j, and k (see figures 3.7, 3.9, and 3.18). In general, the pairwise alignments between adjacent fragments in the layout can be used to create a multiple sequence alignment of all of the fragments. At each position, or column, of the multiple sequence alignment, different base calls or gaps inserted within a fragment for alignment may be present for each of the fragments that span that position of the target sequence. Which target sequence base pair (or gap in the absence of a base pair) is most likely to have resulted in the base calls seen in the column? In the absence of information other than the base calls and when the accuracy of the fragments is high, a simple majority voting algorithm works well. With quality values available as error estimates for the base calls, a quality value weighted voting improves the result. A Bayesian estimate which can also incorporate the a priori base pair composition propensities can be used to make the base call and provide an error estimate (quality value) for the consensus base call [22]. If the target sequence copies are polymorphic, the same Bayesian model can be used to assign probabilities that a column reflects sampling from a polymorphic position in the target sequence. The difficulty in generating the best consensus sequence does not lie in calling a base for a given column but in generating optimal columns in the multiple sequence alignment. Optimal multiple sequence alignments have an entire literature of their own. Dynamic programming is generally considered to give optimal pairwise alignments in a reasonable amount of computation (at least for short sequences) needing time proportional to the length of the fragments involved squared. Dynamic programming alignment can be easily extended to multiple sequence alignment but takes time proportional to the length of the fragments raised to the number of fragments to be aligned, which is impractical.
110
Genomics
The most common practical approach is to determine a likely optimal order of pairwise alignments to perform to create a multiple sequence alignment. The order of pairwise alignments is usually determined by maximal pairwise similarity. After a pairwise alignment the sequences are merged either into a consensus sequence or a profile representing each column as a weighted vector rather than as a consensus call. For shotgun fragment assembly the fragments are so similar (except if larger polymorphisms have been collapsed together into a single unitig) that the order of pairwise alignment and merged sequence representation has little impact. As mentioned above, the obvious choice is just to proceed from the first fragment in a contig and use the pairwise alignment with the adjacent fragment until the last fragment is reached. There is one glaring shortcoming resulting from this approach where gaps in adjacent columns are not optimally aligned (see figure 3.18). Alignment B is better because fewer sequencing errors are needed to explain it and sequencing errors are rare events. Two different methods have been developed to refine the initial multiple sequence alignment to correct this problem. The first removes fragments one at a time from the initial multiple sequence alignment and then realigns each fragment to the new profile of the multiple sequence alignment resulting from its removal [58]. This process iterates until the multiple sequence alignment score stops improving or a time limit is reached. The second method first finds some small number of consecutive columns, say six, which have no internal differences (all base calls in a column are the same with no gaps). These anchoring columns are unlikely to be in error and even more unlikely to be improved by any local multiple sequence alignment refinement technique. The abacus method, so called because gaps are shifted like beads on an abacus, then tries to reposition the gaps between anchors so that the gaps are concentrated in fewer columns [45]. Neither method always produces optimal results, but both methods produce significantly improved results over unrefined alignments.
Figure 3.18 Nonoptimal multiple sequence alignment (A) and optimal alignment (B).
Shotgun Fragment Assembly
111
An entirely different approach to consensus avoids the gap alignment optimization problem by using only a single, high-quality base from one fragment instead of letting bases from all fragments in the column vote in calling a given consensus base [35]. The quality values indicate that there are likely to be very few errors in each interval. A transition region where two fragments’ base calls match exactly is chosen to switch from one high-quality fragment to the next. If desired, this consensus sequence can just be the starting point for the consensus approaches discussed above. First all of the fragments would be aligned pairwise against the consensus sequence and then either or both of the above refinements could be performed. PAST AND FUTURE
The first whole genome shotgun assembly was performed with a lot of manual intervention to determine the 48,502 base pair genome of the bacteriophage lambda virus [59,60]. As larger genomic clones and genomes were shotgun sequenced, more automated methods were developed for assembling them. Many felt there was a limit to the genome size or the complexity of the genomic content that whole genome shotgun assemblers could be designed to handle. Of course at the extreme one can imagine genomes such as 10 million base pairs of a single nucleotide, say A with only a smattering of C, T, or G nucleotides intermingled, where there is no hope of using whole genome shotgun assembly or any other current assembly method. The interesting question becomes: for genomes we wish to sequence and assemble, can sufficiently sophisticated whole genome shotgun methodologies and assembly algorithms be devised to produce the sequence for these genomes? The frontier has continued to be expanded in the face of skepticism from 1 million base pair bacteria [61], to 100 million base pair invertebrates [45], to 3 billion base pair mammals [62]. Our belief is that while large strides have been made in the capabilities of whole genome shotgun assembly algorithms, there is much that can still be done to push the frontier further out and at the same time reduce the finishing effort required for genomes within our current capabilities. No single assembler incorporates the most advanced version of the methods discussed above and approaches to deal with polymorphism, tandem repeats [63], and large segmental duplications [64] are in their infancy. LITERATURE
The first shotgun assembly programs [1,2] were primarily concerned with finding overlaps and constructing contigs from these overlaps that could be presented to the scientists for confirmation. The target sequences were small enough that a high degree of manual inspection
112
Genomics
and correction was acceptable and any repeat structure was relatively simple. Even at this early stage the tradeoff between sensitivity and specificity in overlap detection was understood. These early programs assumed that any significant overlap was likely to be real and could be used first-come first-serve to construct contigs. Any mistakes could be corrected with manual intervention. Shotgun fragment assembly was quickly posed as a mathematical problem, Shortest Common Superstring (SCS), with a well-defined criterion, the length of the superstring, to be optimized. A simple greedy approach of merging fragments with the longest remaining overlap was proposed as a solution and bounds on performance were proven and later improved [5–13]. This new approach was then put into practice and violations of the triangle condition were recognized as indications of repeat boundaries that could not be solved using this simple approach [14]. The next wave of fragment assemblers began arriving almost a decade later with CAP [23], which has continued to be improved over time with CAP2 [56], CAP3 [53], and PCAP [27]. CAP used improved sensitivity and specificity measures but, more importantly, introduced the first version of sequence error correction based on overlap alignments. CAP2 recognized repeat contigs based on triangle condition violations at repeat boundaries and attempted to separate the copies of these repeats based on small differences (dividing columns) between different copies. CAP2 also used an eloquent chimeric fragment detection algorithm. CAP3 introduced the use of unsatisfied mate pairs to break and repair incorrect layouts. Phrap [35] used base call quality values generated by Phred [26] for much better error correction and repeat separation. Work on distinguishing columns perhaps has the most promise for repeat separation [36–39]. Others had also recognized the value of quality values [65,66]. Another method for detecting repeat boundaries was based on determining cliques in the overlap graph [67] which would share fragments with adjoining cliques. If there was more than one adjoining clique on each end, a repeat boundary was present. All of these approaches to finding repeat boundaries as violations of the triangle condition were made explicit in the transitive, or chordal, graph reduction approach [4] that removes overlap edges from the overlap graph until only branching edges due to repeat boundaries are left. TIGR Assembler made the first use of mate pairs to guide the layout phase of the assembly [40,61]. Other assemblers improved the efficiency of some stages of assembly [58,68–73]. A nice formalization of several phases of fragment assembly is presented in [21], but the branch and bound algorithms presented are only practical for target sequences with low repeat complexity. Genetic algorithm and simulated annealing approaches for searching the space of good layouts can outperform the simple greedy heuristic for target sequences with a few repeats, but the search does not scale for complex repeat structures [74–76].
Shotgun Fragment Assembly
113
A very different approach, the k-mer graph, was also developed in this intermediary time frame [28] and then expanded recently [25,34,46,77]. A new set of fragment assemblers have recently been developed that build on previous work and can scale to mammalian size genomes [24,27,41–43,45,55]. The results of one of these assemblers for Drosophila [78,79] and human [62] has been compared to finished versions of these genomes. Different genome sequencing strategies have been debated [50,51,80–83]. The impact of a new sequencing technology which produces short reads at low cost allowing for deep coverage has been evaluated for the k-mer graph approach [84]. We should like to recommend a few general supplementary texts for the interested reader [3,85,86]. ACKNOWLEDGMENTS We should like to thank all of those mentioned in this chapter or inadvertently overlooked who have worked on and contributed directly or indirectly to the problem of shotgun fragment assembly. We should like to recognize all of our colleagues, who are too numerous to list here, who have worked either directly with us on shotgun fragment assembly or more generally on whole genome shotgun sequencing for their efforts and encouragement. A special thanks goes to the group of people who helped design and build the Celera Assembler but more importantly made it a joy to come to work: Eric Anson, Vineet Bafna, Randall Bolanos, Hui-Hsien Chou, Art Delcher, Nathan Edwards, Dan Fasulo, Mike Flanigan, Liliana Florea, Bjarni Halldorsson, Sridhar Hannenhalli, Aaron Halpern, Merissa Henry, Daniel Huson, Saul Kravitz, Zhongwu Lai, Ross Lippert, Stephano Lonardi, Jason Miller, Clark Mobarry, Laurent Mouchard, Gene Myers, Michelle Perry, Knut Reinert, Karin Remington, Hagit Shatkay, Russell Turner, Brian Walenz, and Shibu Yooseph. Finally, we owe a debt of thanks to Mark Adams, Mike Hunkapiller, and Craig Venter for providing the opportunity and data to work on the Drosophila, human, and many other exciting genomes.
REFERENCES 1. Gingeras, T., J. Milazzo, D. Sciaky and R. Roberts. Computer programs for the assembly of DNA sequences. Nucleic Acid Research, 7(2):529–45, 1979. 2. Staden, R. Automation of the computer handling of gel reading data produced by the shotgun method of DNA sequencing. Nucleic Acid Research, 10(15):4731–51, 1982. 3. Setubal, J. and J. Meidanis. Introduction to Computational Molecular Biology (pp. 105–42). PWS Publishing Company, Boston, 1997. 4. Myers, E. Toward simplifying and accurately formulating fragment assembly. Journal of Computational Biology, 2(2):275–90, 1995. 5. Tarhio, J. and E. Ukkonen. A greedy approximation algorithm for constructing shortest common superstrings. Theoretical Computer Science, 57(1):131–45, 1988.
114
Genomics
6. Turner, J. Approximation algorithms for the shortest common superstring. Information and Computation, 83(1):1–20, 1989. 7. Gallant, J., D. Maier and J. Storer. On finding minimal length superstrings. Journal of Computer and Systems Science, 20:50–8, 1980. 8. Gallant, J. The complexity of the overlap method for sequencing biopolymers. Journal of Theoretical Biology, 101(1):1–17, 1983. 9. Blum, A., T. Jiang, M. Li, J. Tromp and M. Yannakakis. Linear approximation of shortest superstrings. Proceedings of the 23rd AC Symposium on Theory of Computation, 328–36, 1991. 10. Blum, A., T. Jiang, M. Li, J. Tromp and M. Yannakakis. Linear approximation of shortest superstrings. Journal of the ACM, 41:634–47, 1994. 11. Armen, C. and C. Stein. A 2.75 approximation algorithm for the shortest superstring problem. Technical Report PCS-TR94-214, Department of Computer Science, Dartmouth College, Hanover, N.H., 1994. 12. Armen, C. and C. Stein. A 2 2/3-approximation algorithm for the shortest superstring problem. Combinatorial Pattern Matching, 87-101, 1996. 13. Kosaraju, R., J. Park and C. Stein. Long tours and short superstrings. Proceedings of the 35th Annual IEEE Symposium on Foundations of Computer Science (FOCS), 166–77, 1994. 14. Peltola, H., H. Söderlund, J. Tarhio and E. Ukkonen. Algorithms for some string matching problems arising in molecular genetics. Proceedings of the 9th IFIP World Computer Congress, 59–64, 1983. 15. Peltola, H., H. Söderlund and E. Ukkonen. SEQAID: a DNA sequence assembling program based on a mathematical model. Nucleic Acids Research, 12(1 Pt 1):307–21, 1984. 16. Needleman, S. and C. Wunsch. A general method applicable to the search for similarities in the amino acid sequences of two proteins. Journal of Molecular Biology, 48(3):443–53, 1970. 17. Smith, T. and M. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147(1):195–7, 1981. 18. Sellers, P. The theory and computation of evolutionary distances: pattern recognition. Journal of Algorithms, 1:359–73, 1980. 19. Sankoff, D. and J. Kruskal. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, Mass., 1983. 20. Myers, E. Incremental Alignment Algorithms and Their Applications. Technical Report TR 86-2, Department of Computer Science, University of Arizona, Tucson, 1986. 21. Kececioglu, J. and E. Myers. Combinatorial algorithms for DNA sequence assembly. Algorithmica, 13(1/2):7–51, 1995. 22. Churchill, G. and M. Waterman. The accuracy of DNA sequences: estimating sequence quality. Genomics, 14(1):89–98, 1992. 23. Huang, X. A contig assembly program based on sensitive detection of fragment overlaps. Genomics, 14(1):18–25, 1992. 24. Batzoglou, S., D. Jaffe, K. Stanley, J. Butler, S. Gnerre, et al. ARACHNE: a whole genome shotgun assembler. Genome Research, 12:177–89, 2002. 25. Pevzner, P., H. Tang and M. Waterman. An eulerian path approach to DNA fragment assembly. Proceedings of the National Academy of Sciences USA, 98(17):9748–53, 2001.
Shotgun Fragment Assembly
115
26. Ewing, B. and P. Green. Base-calling of automated sequencer traces using phred. ii. error probabilities. Genome Research, 8(3):186–94, 1998. 27. Huang, X. and J. Wang. PCAP: a whole-genome assembly program. Genome Research, 13(9):2164–70, 2003. 28. Idury, R. and M. Waterman. A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2):291–306, 1995. 29. Drmanac, R., I. Labat, I. Brukner and R. Crkvenjakov. Sequencing of megabase plus DNA by hybridization: theory of the method. Genomics, 4(2):114–28, 1989. 30. Drmanac, R., I. Labat and R. Crkvenjakov. An algorithm for the DNA sequence generation from k-tuple word contents of the minimal number of random fragments. Journal of Biomolecular Structure and Dynamics, 8(5):1085–1102, 1991. 31. Columbic, M. Algorithmic Graph Theory and Perfect Graphs. Academic Press, London, 1980. 32. Fishburn, P. Interval Orders and Interval Graphs: A Study of Partially Ordered Sets. Wiley, New York, 1985. 33. Lander, E. and M. Waterman. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics, 2(3):231–9, 1988. 34. Pevzner, P., H. Tang and G. Tesler. De novo repeat classification and fragment assembly. Genome Research, 14(9):1786–96, 2004. 35. Green, P. PHRAP documentation. http://www.phrap.org, 1994. 36. Kececioglu, J. and J. Yu. Separating repeats in DNA sequence assembly. Proceedings of the 5th ACM Conference on Computational Molecular Biology, 176–83, 2001. 37. Roberts, M., B. Hunt, J. Yorke, R. Bolanos and A. Delcher. A preprocessor for shotgun assembly of large genomes. Journal of Computational Biology, 11(4):734–52, 2004. 38. Tammi, M., E. Arner, T. Britton and B. Andersson. Separation of nearly identical repeats in shotgun assemblies using defined nucleotide positions, DNPs. Bioinformatics, 18(3):379–88, 2002. 39. Tammi, M., E. Arner, E. Kindlund and B. Andersson. Correcting errors in shotgun sequences. Nucleic Acids Research, 31(15):4663–72, 2003. 40. Sutton, G., O. White, M. Adams and A. Kerlavage. TIGR assembler: a new tool for assembling large shotgun sequencing projects. Genome Science and Technology, 1:9–19, 1995. 41. Wang, J., G. Wong, P. Ni, Y. Han, X. Huang, et al. RePS: a sequence assembler that masks exact repeats identified from the shotgun data. Genome Research, 12(5):824–31, 2002. 42. Mullikin, J. and Z. Ning. The phusion assembler. Genome Research, 13(1):81–90, 2003. 43. Havlak, P., R. Chen, K. Durbin, A. Egan, Y. Ren, et al. The Atlas genome assembly system. Genome Research, 14(4):721–32, 2004. 44. Roach, J. Random subcloning. Genome Research, 5(5):464–73, 1995. 45. Myers, E., G. Sutton, A. Delcher, I. Dew, D. Fasulo, et al. A whole-genome assembly of Drosophila. Science, 287(5461):2196–204, 2000. 46. Pevzner, P. and H. Tang. Fragment assembly with double-barreled data. Bioinformatics, 17(Suppl 1):S225–33, 2001.
116
Genomics
47. Grotschel, M., L. Lovasz, and A. Schrijver. Geometric Algorithms and Combinatorial Optimization. Springer-Verlag, Berlin, 1993. 48. Pevzner, P. l-tuple DNA sequencing: computer analysis. Journal of Biomolecular Structure and Dynamics, 7(1):63–73, 1989. 49. Lysov, Y., V. Florentiev, A. Khorlin, K. Khrapko, V. Shik and A. Mirzabekov. DNA sequencing by hybridization with oligonucleotides. Dokl. Academy of Sciences USSR, 303:1508–11, 1988. 50. Edwards, A. and C. Caskey. Closure strategies for random DNA sequencing. Methods: A Companion to Methods in Enzymology, 3:41–47, 1990. 51. Roach, J., C. Boysen, K. Wang and L. Hood. Pairwise end sequencing: a unified approach to genomic mapping and sequencing. Genomics, 26(2):345–53, 1995. 52. Venter, J., M. Adams, E. Myers, P. Li, R. Mural and G. Sutton. The sequence of the human genome. Science, 291(5507):1304–51, 2001. 53. Huang, X. and A. Madan. CAP3: a DNA sequence assembly program. Genome Research, 9(9):868–77, 1999. 54. Huson, D., A. Halpern, Z. Lai, E. Myers, K. Reinert and G. Sutton. Comparing assemblies using fragments and mate-pairs. Proceedings of the 1st Workshop on Algorithms Bioinformatics, WABI-01:294–306, 2001. 55. Jaffe, D., J. Butler, S. Gnerre, E. Mauceli, K. Lindblad-Toh, et al. Wholegenome sequence assembly for mammalian genomes: Arachne 2. Genome Research, 13(1):91–6, 2003. 56. Huang, X. An improved sequence assembly program. Genomics, 33(1):21–31, 1996. 57. Fasulo, D., A. Halpern, I. Dew and C. Mobarry. Efficiently detecting polymorphisms during the fragment assembly process. Bioinformatics, 18(Suppl 1): S294–302, 2002. 58. Anson, E. and E. Myers. Realigner: a program for refining DNA sequence multialignments. Journal of Computational Biology, 4(3):369–83, 1997. 59. Sanger, F., A. Coulson, G. Hong, D. Hill and G. Petersen. Nucleotide sequence of bacteriophage λ DNA. Journal of Molecular Biology, 162(4):729–73, 1982. 60. Staden, R. A new computer method for the storage and manipulation of DNA gel reading data. Nucleic Acid Research, 8(16):3673–94, 1980. 61. Fleischmann, R., M. Adams, O. White, R. Clayton, E. Kirkness, et al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science, 269(5253):496–512, 1995. 62. Istrail, S., G. Sutton, L. Florea, A. Halpern, et al. Whole-genome shotgun assembly and comparison of human genome assemblies. Proceedings of the National Academy of Sciences USA, 101(7):1916–21, 2004. 63. Tammi, M., E. Arner and B. Andersson. TRAP: Tandem Repeat Assembly Program produces improved shotgun assemblies of repetitive sequences. Computer Methods and Programs in Biomedicine, 70(1):47–59, 2003. 64. Eichler, E. Masquerading repeats: paralogous pitfalls of the human genome. Genome Research, 8(8):758–62, 1998. 65. Lawrence, E. and V. Solovyev. Assignment of position-specific error probability to primary DNA sequence data. Nucleic Acid Research, 22(7):1272–80, 1994. 66. Bonfield, J. and R. Staden. The application of numerical estimates of base calling accuracy to DNA sequencing projects. Nucleic Acids Research, 23:1406–10, 1995.
Shotgun Fragment Assembly
117
67. Gleizes, A. and A. Henaut. A global approach for contig construction. Computer Applications in the Biosciences, 10(4):401–8, 1994. 68. Kim, S. and A. Segre. AMASS: a structured pattern matching approach to shotgun sequence assembly. Journal of Computational Biology, 6(2):163–86, 1999. 69. Bonfield, J., K. Smith and R. Staden. A new DNA sequence assembly program. Nucleic Acids Research, 23(24):4992–9, 1995. 70. Gryan, G. Faster sequence assembly software for megabase shotgun assemblies. Genome Sequencing and Analysis Conference VI, 1994. 71. Chen, T. and S. Skiena. Trie-based data structures for sequence assembly. 8th Symposium on Combinatorial Pattern Matching, 206–23, 1997. 72. Pop, M., D. Kosack and S. Salzberg. Hierarchical scaffolding with Bambus. Genome Research, 14(1):149–59, 2004. 73. Kosaraju, R. and A. Delcher. Large-scale assembly of DNA strings and space-efficient construction of suffix trees. Proceedings of the 27th ACM Symposium on Theory of Computing, 169–77, 1995. 74. Burks, C., R. Parsons and M. Engle. Integration of competing ancillary assertions in genome assembly. ISMB 1994, 62–9, 1994. 75. Parsons, R., S. Forrest and C. Burks. Genetic algorithms: operators, and DNA fragment assembly. Machine Learning, 21(1-2):11–33, 1995. 76. Parsons, R, and M. Johnson. DNA sequence assembly and genetic algorithms: new results and puzzling insights. Proceedings of Intelligent Systems in Molecular Biology, 3:277–84, 1995. 77. Mulyukov, Z. and P. Pevzner. EULER-PCR: finishing experiments for repeat resolution. Pacific Symposium in Biocomputing 2002, 199–210, 2002. 78. Celniker, S., D. Wheeler, B. Kronmiller, J. Carlson, A. Halpern, et al. Finishing a whole-genome shotgun: release 3 of the Drosophila melanogaster euchromatic genome sequence. Genome Biology, 3(12):1–14, 2002. 79. Hoskins, R., C. Smith, J. Carlson, A. Carvalho, A. Halpern, et al. Heterochromatic sequences in a Drosophila whole-genome shotgun assembly. Genome Biology, 3(12):1–16, 2002. 80. Webber, J. and E. Myers. Human whole-genome shotgun sequencing. Genome Research, 7(5):401–9, 1997. 81. Green, P. Against a whole-genome shotgun. Genome Research, 7(5):410–17, 1997. 82. Anson, E. and E. Myers. Algorithms for whole genome shotgun sequencing. Proceedings RECOMB’99, 1–9, 1999. 83. Chen, E., D. Schlessinger and J. Kere. Ordered shotgun sequencing, a strategy for integrated mapping and sequencing of YAC clones. Genomics, 17(3):651–6, 1993. 84. Chaisson, M., P. Pevzner and H. Tang. Fragment assembly with short reads. Bioinformatics, 20(13):2067–74, 2004. 85. Myers, E. Advances in sequence assembly. In M.D. Adams, C. Fields and J. C. Venter (Eds.), Automated DNA Sequencing and Analysis (pp. 231–8) Academic Press, London, 1994. 86. Myers, G. Whole-genome DNA sequencing. Computing in Science and Engineering, 1(3):33–43, 1999.
4 Gene Finding John Besemer & Mark Borodovsky
Recent advances in sequencing technology have created an unprecedented opportunity to obtain the complete genomic sequence of any biological species in a short time and at a reasonable cost. Computational gene-finding approaches allow researchers to quickly transform these texts, strings of millions of nucleotides with little obvious meaning, into priceless annotated books of life. Immediately, researchers can start extracting pieces of fundamental knowledge about the species at hand: translated protein products can be characterized and added to protein families; predicted gene sequences can be used in phylogenetic studies; the order of predicted prokaryotic genes can be compared to other genomes to elucidate operon structures; and so on. While software programs for gene finding have reached a high level of accuracy, especially for prokaryotic genomes, they are not designed to replace human experts in the genome annotation process. The programs are extremely useful tools that greatly reduce the time required for sequence annotation, and the overall quality of annotations is improving as the programs become better. This acceleration is becoming a critical issue, as a recent attempt to sequence microbial populations en masse, rather than individual genomes, produced DNA sequences from over 1800 species including 148 novel phylotypes [1]. While genes can be found experimentally, these procedures are both time-consuming and costly and are best utilized on small scales. Sets of experimentally validated genes, however, are of the utmost importance to the area of computational gene finding as they provide the most trustable sets for testing programs. Currently, such sets are rather few in number ([2] and [3] give two notable examples for Escherichia coli K12; [4] and [5] are well known for Homo sapiens and Arabidopsis thaliana, respectively). Typically, they contain a small number of genes used to validate the predictions made in a particular study [6]. Recently, RT-PCR and direct sequencing techniques have been used to verify specific computational predictions [7]. COMPONENTS OF GENE FINDERS
Before considering specific gene finders, it is important to mention the two major components present in many current programs: the prediction 118
Gene Finding
119
algorithm and the statistical models that typically reflect the genomic features of a particular species (or, perhaps, group of species such as plants or mammals). The algorithm defines the mathematical framework on which a gene finder is based. While popular gene finders frequently use statistical methods [often based on Markov models or hidden Markov models (HMMs)] and dynamic programming algorithms, other approaches including artificial neural networks have been attempted with considerable success as well. Selection of an appropriate framework (algorithm) requires knowledge of the organization of the genome under study, so it relies on biological knowledge in addition to mathematics. The second major component, which specifically influences ab initio, or single sequence, gene finders, is the set of model parameters the program uses to make gene predictions in anonymous DNA. These models can be of many types: homogeneous Markov chains to model noncoding DNA, inhomogeneous three-periodic Markov chains and interpolated Markov chains to model coding DNA; position-specific weight matrices (which also could be derived from statistical models) to model ribosomal binding sites (RBS), splice sites and other motifs; and exponential and gamma-type distributions to model distances within and between particular gene elements in the genome. A sample of the models used by the prokaryotic and eukaryotic versions of GeneMark.hmm is shown in figure 4.1. Exactly which combination of models is used depends on both the species being studied and the amount of available data. For instance, at the early stages of a genome sequencing project, the number of experimentally validated genes is typically not sufficient to derive the parameters of high-precision models of the protein-coding DNA, but there are ways to circumvent this problem (see below). MAJOR CHALLENGES IN GENE FINDING
Even though current gene-finding programs have reached high levels of accuracy, there remains significant room for improvement. In prokaryotes, where impressive average sensitivity figures over 95% have frequently been published [8–11], false positive rates of 10–20% are routine. These numbers can be much worse when the focus shifts to short genes, given the “evil little fellows” (ELFs) moniker due to difficulties distinguishing true short genes from random noncoding open reading frames (ORFs) [12]. The overannotation of short genes is a problem that plagues nearly all annotated microbial genomes [13]. Exact determination of the 5′-ends of prokaryotic genes has taken great strides from times when programs simply predicted genes as ORFs with ambiguous starts or extended all predicted genes to the longest ORF. Relatively recently, models of the RBS started to be used in the algorithms in a more advanced manner. Still, as of yet, the issue of accurate gene start prediction is not closed.
120
Genomics
Figure 4.1 A sample of statistical models. (A) Two-component model of the Bacillus subtilis RBS; nucleotide frequency distribution displayed as a sequence logo [121], left; and distribution of spacer lengths, right (used by the prokaryotic version GeneMark.hmm). (B) Graphical representation of donor site model for A. thaliana, displayed as a pictogram (Burge, C., http://genes.mit.edu/ pictogram.html). (C) Same as in B, for acceptor site model. (D) Distribution of exon lengths for A. thaliana (used by the eukaryotic version of GeneMark.hmm).
The complicated organization of eukaryotic genes presents even more challenges. While the accuracy of predicting exon boundaries is approaching 90% for some genomes, assembling the predicted exons into complete gene structures can be an arduous task. In addition, the errors made here tend to multiply; a missed splice site may corrupt the
Gene Finding
121
whole gene prediction unless one or more additional missed splice sites downstream allow the exon–intron structure prediction to get back on track. In addition, while some genome organization features may be common for all prokaryotes, such as that gene length distributions in all prokaryotes are similar to that of E. coli [10], current data show that the eukaryotes tend to have much more diversity. There is no universal exon or intron length distribution; the average number of introns per gene is variable; branch points are prominent in some genomes and seemingly missing in others; and so on. To deal with this diversity in genome organization one may need algorithms that can alter their structure, typically the HMM architecture, to better fit the genetic language grammar of a particular genome [14]. While comparisons of different programs are discussed extensively in the literature, the area of gene prediction is missing thoroughly organized competitions such as CASP in the area of protein structure prediction. Recent initiatives such as ENCODE (http://genome.ucsc.edu/ ENCODE/) attempt to fill this void. Several publications have set out to determine which program is the “best” gene finder for a particular genome [15–18]. However detailed these studies are, their results are difficult to extrapolate beyond the data sets they used, as the performance differences among gene finders are tightly correlated to differences in the sequence data used for training and testing. As performance tests are clear drivers of the practical gene finders’ development, it is important that the algorithms’ developers consider the simultaneous pursuit of another goal—the creation of programs that not only make accurate predictions, but also serve the purpose of improving our understanding of the biological processes that have brought the genomes, as they are now, to life. CLASSIFYING GENE FINDERS
In the early years of gene finding, it was quite easy to classify gene finders into two broad categories: intrinsic and extrinsic [19]. Ideally, the intrinsic approach, which gives rise to ab initio or single sequence gene finders, uses no explicit information about DNA sequences other than the one being analyzed. This definition is not perfect though, since an intrinsic approach may rely on statistical models with parameters derived from other sequences. This loophole in the definition of the intrinsic approach is tolerable, provided the term “intrinsic” conveys the meaning of statistical model-based approaches as opposed to similarity search-based ones. Therefore, intrinsic methods rely on the parameters of statistical models which are learned from collections of known genes. In general, this learning has to be genome-specific, though recent studies have shown that reasonable predictions can be obtained even
122
Genomics
with models deviating from those precisely tuned for a particular genome [20,21]. Initially, this was observed with the Markov models generated from E. coli K12 genes of class III, a rather small class which displayed the least pronounced E. coli-specific codon usage pattern and presumably contained a substantial portion of laterally transferred genes [22]. These “atypical models” were able to predict the majority of genes of E. coli. This observation led to the development of a heuristic approach for deriving models which capture the basic, but still genome-specific, pattern of nucleotide ordering in protein-coding DNA [20]. Heuristic models can be tuned for a particular genome by adjusting just a few parameters reflecting specific nucleotide composition. This approach is also useful for deriving models for rather short inhomogeneous sections of genomes such as pathogenicity islands or for the genomes of viruses and phages for which there is not enough data for conventional model training [23,24]. Extrinsic gene-finding approaches utilize sequence comparison methods, such as BLASTX (translated in six-frame nucleotide query versus protein database), TBLASTX (translated in six-frame nucleotide query versus translated in six-frame nucleotide database), or BLASTN (nucleotide query versus nucleotide database) [25]. Robison et al. [26] introduced the first extrinsic gene finders for bacterial genomes. Programs performing alignment of DNA to libraries of nucleotide sequences known to be expressed (cDNA and EST sequences) have to properly compensate for large gaps (which represent introns in the genomic sequence) to be useful for detecting genes in eukaryotic DNA. There are several programs that accomplish this task, including est_genome [27], sim4 [28], BLAT [29], and GeneSeqer [30]. Among these, GeneSeqer stands out as the best performing, with this leadership status apparently gained by making use of “intrinsic” features, namely, species-specific probabilistic models of splice sites. The utilization of these models allows the program to better select biologically relevant alignments from a few alternatives with similar scores, resulting in more accurate exon prediction. The classification of modern gene finders becomes more difficult due to the integrated nature of the new methods. As new high-throughput methods are developed and new types of data are becoming available in vast amounts (cDNA, EST, gene expression data, etc.), more complex gene-finding approaches are needed to utilize all this information. The ENSEMBL project [31] serves as an excellent example of a system that intelligently integrates both intrinsic and extrinsic data. In current practice, nearly all uses of gene finding are integrated in nature. For example, the application of the ab initio gene finders to eukaryotic genomes with frequent repetitive sequences is always preceded by a run of RepeatMasker (Smit, A.M.A., Hubley, R., and Green, P., www.repeatmasker.org) to remove species-specific genomic interspersed repeats revealed by
Gene Finding
123
similarity search through an existing repeat database. To fine-tune gene start predictions, the prokaryotic gene finders may rely on prior knowledge of the 3′-tail of the 16S rRNA of the species being analyzed. The ab initio definition has to become more general with the recent introduction of gene-finding approaches based on phylogenetic hidden Markov models (phylo-HMMs), such as Siepel and Haussler’s method [32] to predict conserved exons in unannotated orthologous genomic fragments of multiple genomes. While such a method belongs to the ab initio class as defined, since no knowledge of the gene content of the multiple DNA sequences is required, the algorithm relies heavily on the assumption that the fragments being considered are orthologous and the phylogenetic relationships of the species considered are known. While developments of intrinsic and extrinsic approaches will advance further in coming years, genome annotators will continue to rely on integrated approaches. In addition, researchers are also frequently combining the predictions of multiple gene finders into a joint set of metapredictions [33]. Such methods are gaining in sophistication and popularity in genome sequencing projects [34]. ACCURACY EVALUATION
The quality of gene prediction is frequently characterized by values of sensitivity (Sn) and specificity (Sp). Sensitivity is defined as the ratio of the number of true positive predictions made by a prediction algorithm to the number of all positives in the test set. Specificity is defined as the ratio of the number of true positive predictions to the total number of predictions made. Readers with a computer science background may be familiar with the terms recall and precision rather than sensitivity and specificity, respectively. For gene prediction algorithms, sensitivity and specificity are often determined in terms of individual nucleotides, splice sites, translation starts and ends, separate exons, and whole genes. Both sensitivity and specificity must be determined on test sets of sequences with experimentally validated genes. Some of the levels of sensitivity and specificity definition (i.e., nucleotides, exons, complete genes) are more useful for the realistic evaluation of practical algorithm performance than others. For prokaryotes, the basic unit of gene structure is the complete gene. For state-of-the art prokaryotic gene finders, Sn at the level of complete genes is typically above 90% and for many species close to 100%. The Sp value in the “balanced” case is expected to be about the same as Sn, but some prediction programs are tuned to exhibit higher Sn than Sp. The rationale here is that a human expert working with a prediction program would rather take the time to eliminate some (usually low-scoring) predictions deemed to be false positives than leaving this elimination entirely
124
Genomics
to the computer. At first glance, the high Sn and Sp figures may indicate that the problem of prokaryotic gene finding is “solved.” This, however, is not the case as such overall figures do not adequately reflect the errors in finding the exact gene starts or the rate of erroneous prediction in the group of short genes (shorter than 400 nt). In most eukaryotes, the basic unit of gene structure is the exon. Thus, exon-level accuracy is a quite natural and informative measure. In stateof-the art eukaryotic gene finders, exon-level Sn and Sp approach 85%. Interestingly, complete gene prediction accuracy will not be a highly relevant measure until exon-level accuracy approaches 100%. Even with 90% Sn at the exon level, the probability of correctly predicting all exons of a ten-exon gene correctly (and thus, the complete gene correctly) is only 0.910 or ~35%, though this is a rough estimate as the events are not strictly independent. For eukaryotes, Sn and Sp are often presented at the nucleotide level as well. However, such data should be used with caution as some nucleotides are more important than others. For instance, the knowledge of the exact locations of gene boundaries (splice sites and gene starts and stops) is especially important. Misplacement of an exon border by one nucleotide may dramatically affect a large portion of the sequence of the predicted protein product. In the worst case, it is possible to predict a gene with near 100% nucleotide level Sn and Sp while missing every single splice site. Gene-finding programs are often compared based on Sn and Sp calculated for particular test sets. This approach meets the difficulty of operating with multiple criteria, that is, the highest performing tool in terms of Sn may have lower Sp than the others. One of the ways to combine Sn and Sp into a single measure is to employ the F-measure, defined as 2 * Sn * Sp/(Sn + Sp). Yet another integrative method is to use the ROC curve [35]. GENE FINDING IN PROKARYOTES
Organization of protein-coding genes in prokaryotes is relatively simple. In the majority of prokaryotic genomes sequenced to date, genes are tightly packed and make up approximately 90% of the total genomic DNA. Typically, a prokaryotic gene is a continuous segment of a DNA strand (containing an integral number of triplets) which starts with the triplet ATG (the major start codon, or GTG, CTG, and TTG, which are less frequent starts) and ends with one of the gene-terminating triplets TAG, TGA, or TAA. Traditionally, a triplet sequence which starts with ATG and ends with a stop codon is called an open reading frame (ORF). Note that an ORF may or may not code for protein. However, the length distributions of ORFs known to code for protein and ORFs that simply occur by chance differ significantly. Figure 4.2 shows the probability
Gene Finding
125
Figure 4.2 Length distributions of arguably noncoding ORFs and GenBank annotated protein-coding ORFs for the E. coli K12 genome.
densities of the length distributions of both random ORFs and GenBank annotated genes for the E. coli K12 genome [36]. The exponential and gamma distributions are typically used to approximate the length distributions for random ORFs and protein-coding ORFs respectively [10]. Parameters of these distributions can vary across genomes with different G+C contents. As was mentioned, a challenging problem in prokaryotic gene finding is the discrimination of the relatively few true short genes from the large number of random ORFs of similar length. According to the GenBank annotation [36] there are 382 E. coli K12 genes 300 nt or shorter while NCBI’s OrfFinder reports over 17,000 ORFs in the same length range. This statistic is illustrated in the inset of figure 4.2. The development of ab initio approaches for gene finding in prokaryotes has a rather long history initiated by the works of Fickett [37], Gribskov et al. [38], and Staden [39]. The first characterization of nucleotide compositional bias related to DNA protein-coding function was apparently done by Erickson and Altman in 1979 [40]. It is worth noting that a frequently used gene finder, FramePlot [41], utilizes the simple measure of positional G+C frequencies to predict genes in prokaryotic genomes with high G+C%.
126
Genomics
Application of Markov Chain Models
In the 1980s a number of measures of DNA coding potential were suggested based on various statistically detectable features of protein-coding sequences (Fickett and Tung [42] reviewed and compared 21 different measures). Markov chain theory provided a natural basis for the mathematical treatment of DNA sequence [43] and ordinary Markov models have been used since the 1970s [44]. When a sufficient amount of sequence data became available, three-periodic inhomogeneous Markov models were introduced and proven to be more informative and useful for protein-coding sequence modeling and recognition than other types of statistics [45–47]. The three-periodic Markov chain models have not only an intuitive connection to the triplet structure of the genetic code, but also reflect fundamental frequency patterns generated by this code in protein-coding regions. Subsequently, Markov models of different types, ordinary (homogeneous) and inhomogeneous, necessary to describe functionally distinct regions of DNA, were integrated within the architecture of a hidden Markov model (HMM) with duration (see below). The first gene-finding program using Markov chain models, GeneMark [48], uses a Bayesian formalism to assess the a posteriori probability that the functional role of a given short fragment of DNA sequence is coding (in one of the six possible frames) or noncoding. These calculations are performed using a three-periodic (inhomogeneous) Markov model of protein-coding DNA sequence and an ordinary Markov model of noncoding DNA. To analyze a long sequence, the sliding window technique is used and the Bayesian algorithmic step is repeated for each successive window. The default window size and sliding step size are 96 nt and 12 nt respectively. GeneMark has been shown to be quite accurate at assigning functional roles to small fragments [48]. The posterior probabilities of a particular function defined for overlapping windows covering a given ORF are then averaged into a single score. The ORF is predicted as a protein-coding gene if the score is above the preselected threshold. The GeneMark program has been used as the primary annotation tool in many large-scale sequencing projects, including as milestones the pioneer projects on the first bacterial genome of Haemophilus influenzae, the first archaeal genome of Methanococcus jannaschii, and the E. coli genome project. Interestingly, the subsequent development of the new approach implemented in GeneMark.hmm (see below) has been a development of a method with properties complementary to GeneMark, rather than being a better version of GeneMark (referred to as the HMM-like algorithm [34]). It was shown [6] that this complementarity is akin to the complementarity of the Viterbi algorithm (GeneMark.hmm) and posterior decoding algorithm (GeneMark), both frequently used in HMM applications.
Gene Finding
127
HMM Algorithms
There are some inherent limitations of the sliding window approach: (i) it is difficult to identify short genes, those of length comparable to the window size, and (ii) it is difficult to pinpoint real gene starts when alternative starts are separated by a distance smaller than half of the window length. The HMM modeling paradigm, initially developed in speech recognition [49] and introduced to biological sequence analysis in the mid-1990s, could be naturally used to reformulate the gene-finding problem statement in the HMM terms. This approach removed the need for the sliding window, and the general Viterbi algorithm adjusted for the case of the HMM model of genomic DNA would deliver the maximum likelihood genomic sequence parse into protein-coding and noncoding regions. The first algorithm, ECOPARSE, explicitly using a hidden Markov model for gene prediction in the E. coli genome was developed by Krogh et al. [50]. The HMM technique implies, in general, that the DNA sequence is interpreted as a sequence of observed states (the nucleotides) emitted stochastically by the hidden states (labeled by the nucleotide function: protein-coding, noncoding, etc.) which, in turn, experience transitions regulated by probabilistic rules. In its classic form, an HMM would emit an observed state (a nucleotide) from each hidden state. This assumption causes the lengths of protein-coding genes to be distributed geometrically, a significant deviation from the length distribution of real genes. The classic HMM can be modified to allow a single hidden state to emit a whole nucleotide segment with length distribution of a desirable shape. This modification is known as HMM “with duration,” a generalized HMM (GHMM), or a semi-Markov HMM [49]. The Markov models of protein-coding regions (with separate submodels for typical and atypical gene classes) and models of noncoding regions can then be incorporated into the HMM framework to assess the probability of a stretch of DNA sequence emitted by a particular hidden state. The performance of an HMM-based algorithm critically depends on the choice of the HMM architecture, that is, the choice of the hidden states and transition links between them. For instance, the prokaryotic version of GeneMark.hmm uses the HMM architecture shown in figure 4.3. With all components of the HMM in place, the problem is reduced to finding the maximum likelihood sequence of hidden states associated with emitted DNA fragments, thus the sequence parse, given the whole sequence of nucleotides (observed states). This problem is solved by the modified Viterbi algorithm. Interestingly, the classic notion of statistical significance, which has been used frequently in the evaluation of the strength of pairwise sequence similarity, has not been used in gene prediction algorithms until recently. This measure was reintroduced in the EasyGene algorithm [35] which evaluates the score of an ORF with regard to the expected
128
Genomics
Figure 4.3 Simplified diagram of hidden state transitions in the prokaryotic version of GeneMark.hmm. The hidden state “gene” represents the proteincoding sequence as well as an RBS and a spacer sequence. Two distinct Markov chain models represent the typical and atypical genes, thus genes of both classes can be predicted. For simplicity, only the direct strand is shown and gene overlaps, while considered in the algorithm, are not depicted.
number of ORFs of the same or higher score in a random sequence of similar composition. Gene Start Prediction
Commonly, there exist several potential start codons for a predicted gene. Unless a gene finder with strong enough discrimination power for true gene start prediction was at hand, the codon ATG producing the longest ORF was identified by annotators (or the program itself) as the predicted gene start. It was estimated that this simple method pinpoints the true start for approximately 75% of the real genes [8]. We emphasize that 75% is a rough estimate, obtained under the assumption that there is no use of (relatively rarely occurring) GTG, CTG, and TTG as start codons and that the DNA sequence is described by the simplest multinomial model with equal percentages of each of the four nucleotides. Still, there is a need to predict gene starts more accurately. Such an improvement would not only give the obvious benefit of providing more reliable genes and proteins, but also would improve the delineation of intergenic regions containing sites involved in the regulation of gene expression. To improve gene start prediction, the HMM architecture of prokaryotic GeneMark.hmm contains hidden states for the
Gene Finding
129
RBS, modeled with a position-specific probability matrix [51], and the spacer between the RBS and the predicted translation start codon. This two-component RBS and spacer model is illustrated in figure 4.1A. Note that the accurate detection of gene starts can also be delayed to a postprocessing stage, following the initial rough gene prediction. Such an approach was implemented by Hannenhalli et al. [52] in RBSfinder for the Glimmer program [53]; in MED-Start [54]; and in the initial version of GeneMark.hmm [10]. Markov Models Are Not the Only Way to Go
While Markov models and HMMs provide a solid mathematical framework for the formalization of the gene-finding problem, a variety of other approaches have been applied as well. An algorithm called ZCURVE [11] uses positional nucleotide frequencies, along with phase-specific dinucleotide frequencies (only dinucleotides covering the first two and last two codon positions are considered), to represent fragments of DNA (such as ORFs) as vectors in 33-dimensional space. Training sets of coding and noncoding DNA are used by the Fischer discriminant algorithm to define a boundary in space (the Z curve) separating the coding and noncoding sequences. The authors recognized that, while a set of ORFs with strong evidence of being protein-coding is not difficult to compile, acquiring a reliable noncoding set is more challenging as the intergenic regions in bacteria are short and may be densely populated with RNA genes and regulatory motifs which may alter the base composition. The authors proposed building the noncoding training set from the reverse complements of shuffled versions of the sequences used in the coding training set. In tests on 18 complete bacterial genomes, the ZCURVE program demonstrated sensitivity similar to Glimmer and approximately 10% higher specificity [11]. The Bio-Dictionary Gene Finder (BDGF) exhibits features of both extrinsic and intrinsic gene finders [55]. The Bio-Dictionary [56] itself is a database of so-called “seqlets”—patterns extracted from the GenPept protein database using the Teiresias algorithm [57]. The version of the Bio-Dictionary used with BDGF contains approximately 26 million seqlets which represent all patterns of length 15 or less that start and end with a literal (i.e., one of the 20 amino acid characters), contain at least six literals, and occur two or more times in GenPept. The Teiresias algorithm extracts these patterns in an unsupervised mode. Utilizing BDGF to find genes is relatively straightforward. All ORFs in a genomic sequence are collected and translated into proteins. These proteins are scanned for the presence of seqlets and if the number of detected seqlets is sufficiently large, the ORF is predicted as a gene. In practice, the seqlets are weighted based on their amino acid composition, and these weights are used to calculate the score. These precomputed weights are not species-specific parameters; thus, BDGF can be applied
130
Genomics
to any genome without the need for additional training. Tested on 17 prokaryotic genomes, BDGF was shown to predict genes with approximately 95% sensitivity and specificity. As viral genomes are much smaller than those of prokaryotes, gene finders requiring species-specific training are often unsuccessful. BDGF has proven to be a useful tool in this respect, as shown by its use in a reannotation effort of the human cytomegalovirus [58]. The CRITICA (Coding Region Identification Tool Invoking Comparative Analysis) suite of programs [59] uses a combination of intrinsic and extrinsic approaches. The extrinsic information is provided by BLASTN alignments against a database of DNA sequences. For a genomic region aligned to a database sequence, “comparative coding scores” are generated for all six reading frames. In a particular frame, synonymous differences in the nucleotide sequence, those that do not change the encoded amino acid, contribute positively to the score and nonsynonymous changes contribute negatively. Intrinsic information is included in a second score calculated using a version of the dicodon method of Claverie and Bougueleret [60]. These intrinsic and extrinsic scores are used to identify regions of DNA with significant evidence of being coding. Subsequently, regions with significant coding evidence are extended downstream to the first available stop codon. Finally, a score derived from a predefined RBS motif along with a local coding score are used to define the start codon. In a recent test of several prokaryotic gene finders [33], the largely heuristic CRITICA was shown to have the highest specificity at the cost of some sensitivity when compared to Glimmer [9], ORPHEUS [61], and ZCURVE [11]. The Role of Threshold
Regardless of the methods used, all prokaryotic gene finders must divide the set of all ORFs in a given sequence into two groups: those that are coding and those that are noncoding. In this context, the role of a threshold must be discussed. The thresholds (user defined in some programs, hard-coded in others) essentially determine the number of predicted genes. As such, threshold directly affects the sensitivity and specificity of the prediction method. The following trivial cases illustrate the two possible extremes in choosing thresholds: (i) the program predicts all ORFs as genes (100% Sn, low Sp); and (ii) the program predicts only the highest scoring ORFs as genes, thus making a small number of predictions (100% Sp, low Sn). A rather appealing approach to avoid these extremes would be to define a “balanced” threshold that the number of false positives and false negatives would be equal. Most programs, however, lean toward the case of higher sensitivity and lower specificity, perhaps because overprediction is deemed to be the lesser evil, given the hope that false positives would be filtered out by human experts. An experiment with
Gene Finding
131
Table 4.1 Sensitivity and specificity values of E. coli gene predictions by GeneMark (with different thresholds) and GeneMark.hmm, abbreviated GM and GM-HMM, respectively Program Threshold Predictions (no.) Sensitivity (%) Specificity (%)
GM 0.3 4,447 93.4 88.5
GM 0.4 4,086 91.4 94.9
GM-HMM n/a 4,045 92.4 96.9
GM 0.5 3,829 88.2 97.7
GM 0.6 3,623 84.5 99.0
GeneMark, a program that allows the user to adjust the threshold parameter, and GeneMark.hmm, one that does not, demonstrates the effect of the threshold value on the overall prediction result for a particular genome (table 4.1). GeneMark with 0.4 threshold performs approximately as well as GeneMark.hmm in terms of sensitivity and specificity. GENE FINDING IN EUKARYOTES
The definition of a gene in eukaryotes is more complex than in prokaryotes. Eukaryotic genes are split into alternating coding and noncoding fragments, exons and introns, respectively. The boundaries of introns are referred to as splice sites. Nearly all introns begin with GT and end with AG dinucleotides, a fact exploited by all eukaryotic gene finders. While the nucleotide signatures of the splice sites are highly conserved, other basic features of introns such as their average lengths and numbers per gene vary among species. Therefore, species-specific training and algorithmic implementations are of high importance in this area. Gene prediction in eukaryotic DNA is further complicated by the existence of pseudogenes, genomic sequences apparently evolved from functional genes that have lost their protein-coding function. One particular class, called processed pseudogenes, share many features with single exon genes and have been a common source of false positive errors in eukaryotic gene finding. Recently, methods for the accurate identification of processed pseudogenes have been developed [62,63]. HMM-Based Algorithms
The GENSCAN program [64], intensively used in annotation of the whole human genome, leveraged key features of earlier successful gene finders, such as the use of three-periodic inhomogeneous Markov models [45] and the parallel processing of direct and reverse strands [48]. The major innovation of GENSCAN was the independent introduction of
132
Genomics
a generalized hidden Markov model (described earlier by Kulp et al. [65]), along with the model of splice sites using statistical decomposition to account for the most informative patterns. Unlike the genomes of prokaryotes, where within the HMM framework a single Markov model of second order could accurately detect the majority of genes, several models of fourth or fifth order are required for eukaryotic genomes. The reasons for that are as follows. First, exons are of rather short size in comparison with ORF-long prokaryotic genes and accurate detection of exons requires high-order models. Second, the genomes of many high eukaryotes (including human) are composed of distinct regions of differing G+C content termed isochores, which are typically hundreds of kilobases in length. Thus, on the scale of whole chromosomes, eukaryotic genomes are quite inhomogeneous and the use of only one model is not practical. Therefore, to estimate the parameters of several models, the training sequences were divided into empirically defined clusters covering the whole range of G+C content of the human genome. The probabilistic framework, that is, the HMM architecture, used by GENSCAN (figure 4.4) includes multi-intron genes, genes without introns, intergenic regions, promoters, and polyadenylation signals. With all of these genomic elements permitted to appear in the direct or reverse strand, this has been a quite complete gene model for a eukaryotic genome. In recent years some other elements, such as exonic splicing enhancers, have been explored in considerable detail [66]. In addition to the predictions made by the Viterbi algorithm (i.e., the most likely sequence parse), GENSCAN provided an assessment of the confidence of predicted exons in terms of an a posteriori probability of the exon computed by the posterior decoding algorithm. This feature has been further utilized by Rogic et al. in a program combining the predictions of GENSCAN and HMMGene [67] to obtain predictions with higher specificity [68]. Following the release of GENSCAN, a number of programs using HMM techniques for eukaryotic gene finding have become available, including AUGUSTUS [69], FGENESH [70], HMMGene [67], and the eukaryotic version of GeneMark.hmm ([10], http://opal.biology. gatech.edu/gmhmm_euk/). In tests of gene-finding accuracy, GENSCAN is still among the top performing programs [15,16]; in one comparative study, the GENSCAN predictions were even used as a standard by which other approaches were judged [71]. However, it seems to be fair to say that there is no ab initio gene-finding program at this time that would be uniformly better than other programs for all currently known genomes. For instance, GENSCAN and FGENESH have been among the most accurate for the human genome, while GeneMark.hmm has been one of the most accurate for plant genomes [18], the Genie program has been tuned up for the Drosophila genome [69], and so on.
Figure 4.4 Diagram of the hidden state transitions in the HMM of GENSCAN program. Protein-coding sequences (exons) and noncoding sequences (introns and intergenic regions) are represented by circles and diamonds, respectively [64]. 133
134
Genomics
The AUGUSTUS program makes use of a novel intron submodel, which treats the introns as members of two groups clustered merely by their length [69]. The paradigm of handling short and long introns separately follows the current biological concept claiming the existence of two mechanisms of intron splicing, the “intron definition” and the “exon definition,” related to short and long introns, respectively [72]. In many eukaryotic genomes, the intron length distribution contains a peak near 100 nt and a very long tail. Mathematically, this distribution is often best described as a mixture of two lognormal distributions [72,73], though the exact shape of the distribution varies significantly among species. In the AUGUSTUS algorithm, short introns are precisely modeled with length distributions calculated from sets of known genes and long introns are modeled with a geometric distribution. Modeling the splicing mechanisms in even more detail was employed in the INTRONSCAN algorithm, which is focused on detecting short introns specifically rather than complete gene structures [73]. In a significant exploratory effort, the authors identified the amount of information contained in donor and acceptor sites, branch points, and oligonucleotide composition in several eukaryotic genomes. The recent SNAP program [14] is also HMM based, but uses a reduced genome model as compared to GENSCAN. It does not include hidden states for promoters, polyadenylation signals, and UTRs. Even with this simplified model, the program accuracy was shown to be high enough on many test sets, a fact attributed to species-specific training. An interesting feature of the SNAP program is that its HMM state diagram is not fixed. A user can alter the structure of the HMM to better match the architecture to the genome under study. Gene Prediction in Genome Pairs
While HMM-based intrinsic approaches have been the main direction in eukaryotic gene finding for some time, efforts utilizing comparative genomics are now becoming more and more widespread. One of these approaches, implemented in a program called SLAM [74], uses a generalized pair HMM (GPHMM) to simultaneously construct an alignment between orthologous regions of two genomes, such as H. sapiens and Mus musculus, and identify genes in the aligned regions of both. A GPHMM is described as a hybrid of a generalized HMM and a pair HMM, which emits pairs of symbols (including a gap symbol) and is useful in the area of sequence alignment (see ch. 4 in [43]). As input, SLAM takes two DNA sequences along with their approximate alignment, defined as a set of “reasonable” alignments determined by the AVID global alignment tool [75]. These alignments help limit the search space, and thus the computational complexity, of the GPHMM. The output consists of predicted gene structures for each of the DNA sequences. One difficulty that SLAM attempts to overcome is that, perhaps surprisingly,
Gene Finding
135
there is a large amount of noncoding sequence conserved between the human and mouse genomes. The implementation of a conserved noncoding state in the GPHMM decreases the rate of false positive predictions by eliminating the possibility of predicting these conserved sequences as exons. Another program based on a pair HMM is Doublescan [76]. Doublescan does not require the two homologous DNA sequences to be prealigned. Still, it imposes a restriction that the features to be matched in the two sequences must be collinear to one another. The authors explain this restriction by the fact that the sequences intended to be analyzed are relatively short and contain only a small number of genes. As compared to GENSCAN and when tested on a set of 80 human/mouse orthologous pairs, Doublescan exhibited 10% higher Sn and 4% higher Sp at the level of complete gene structures, even though GENSCAN performs better at the level of individual nucleotides and exons. Two programs, SGP2 [77] and TWINSCAN [78], attempted to improve prediction specificity using the informant genome approach. Both of these programs exploit homology between two genomes, the target genome and the informant (or reference) genome. SGP2 heuristically integrates the ab initio gene finder GeneID [79,80] with the TBLASTX similarity search program [25]. The GeneID program provides a score, a log-likelihood ratio, for each predicted exon. SGP2 adds the GeneID score to a weighted score (also a log-likelihood ratio) of highscoring pairs identified by the TBLASTX search against the informant genome database. Predicted exons are then combined into complete gene structures “maximizing the sum of the scores of the assembled exons,” the same principle used by GeneID itself [77]. The first step in the TWINSCAN algorithm is the generation of a conservation sequence, which replaces the nucleotides of the target sequence (with repeats masked by RepeatMasker) with one of three symbols indicating a match, mismatch, or unaligned as compared to the top four homologs in a database of sequences from the informant genome. The probability of the conservation sequence is calculated given the conservation model. This conservation model is a Markov model of certain order with transition probabilities defined for the three state symbols that make up the conservation sequence (e.g., the probability of a gap following five match characters) rather than the nucleotide alphabets. The TWINSCAN program successfully proved the value of human/mouse genome comparisons for producing accurate computer annotations [81]. While the resulting annotation is quite conservative with only 25,622 genes predicted, its sensitivity is slightly higher than GENSCAN, at the level of both exons and complete genes, in concert with high exon-level specificity. ROSETTA [82] and AGenDA (Alignment-based Gene-Detection Algorithm) [83] are algorithms that represent yet another approach to
136
Genomics
finding genes in pairs of homologous DNA sequences, targeting once again the human and mouse genomes. Elements of intrinsic gene finders are utilized in both programs to score the potential gene structures determined from the alignment of the human and mouse sequences. The alignments, identifying syntenic regions, are provided by the GLASS global sequence alignment program. ROSETTA predicts genes by identifying elements of coincident gene structure (i.e., splice sites, exon length, sequence similarity, etc.) in syntenic regions of DNA from two genomes. ROSETTA uses a dynamic programming algorithm to define candidate gene structures in both aligned sequences. Each of the gene structures is scored by measures of splice site strength, codon usage bias, amino acid similarity, and exon length. Parameters of the scoring models are estimated from a set of known orthologs. AGenDa searches for conserved splice sites in locally homologous sequences, as determined by the DIALIGN program [84], to define candidate exons. These candidates are then assembled into complete gene structures via a dynamic programming procedure. The only models utilized are relatively simple consensus sequences used to score splice sites, but still a training set is required. In a test on 117 mouse/human gene pairs, ROSETTA performed approximately identically to GENSCAN in terms of nucleotide-level sensitivity and slightly better in terms of nucleotide-level specificity. AGenDA, more recent than ROSETTA, in a test on the same set performed approximately identically to GENSCAN in terms of both exon-level sensitivity and specificity. Construction of the initial alignment of the genomic sequences (sometimes complete genomes) presents a significant challenge for conventional alignment algorithms in terms of prohibitively long running time. A new class of genomic alignment tools, such as MUMmer, OWEN, and VISTA [85–87], have been introduced to address these concerns. While these tools make no attempt to pinpoint protein-coding regions and their borders, they are efficient for determining the syntenic regions that are used as input for ROSETTA and similar programs. Multigenome Gene Finders: Phylogenetic HMMS
The extension of the comparative gene-finding approach to more than two genomes, hence requiring the use of multiple alignments, has recently been implemented on the basis of phylo-HMMs [88] and evolutionary hidden Markov models (EHMMs) [89]. These approaches combine finite HMM techniques of gene modeling with continuous Markov chains, frequently used in the field of molecular evolution. In addition to the alignments utilized by GPHMM methods, a phylo-HMM requires a phylogenetic tree describing the evolutionary relationship of the genomes under study.
Gene Finding
137
Depending on the type of data provided, a phylo-HMM-based procedure could run (i) as a single sequence gene finder if only one sequence is available, (ii) as GPHMM if a pairwise alignment is provided, or (iii) as a bona fide phylo-HMM when a multiple alignment and a tree are provided [89]. The EHMM implementation of Pedersen and Hein [89] was presented as a proof of concept and was limited by the choice of a simplistic HMM of gene structure. Although the accuracy of the EHMM predictions did not match GENSCAN in tests, it is important to note that the optimal input for an EHMM would be a multiple alignment of several closely related complete genomes; currently, such data frequently are not available [89]. The phylo-HMM-based procedure ExoniPhy developed by Siepel and Haussler [32] for identification of evolutionarily conserved exons is quite sophisticated [90]. This approach targets the exons of core genes (those found in all domains of life), rather than the complete gene structures, because exons are more likely to be preserved over the course of evolution than complete genes. The predicted exons can later be pieced together into complete genes with a dynamic programming algorithm as in SGP2 or GeneID. The diagram of hidden state transitions used in ExoniPhy is shown in figure 4.5. ExoniPhy includes three major features that improve the performance of phylo-HMMs in terms of exon prediction. The first is the use of contextdependent phylogenetic models. The second is the explicit modeling of conserved noncoding DNA as in SLAM. The third is the modeling of insertions and deletions (indels), taking into account that the pattern of indels frequently is quite different in coding and noncoding sequences. Interestingly, almost 90% of the conserved exons in mouse, human, and rat have no gaps in their alignments. As an exon predictor, ExoniPhy was shown to perform comparably to GENSCAN, SGP2, TWINSCAN, and SLAM. However, the authors admit that there clearly is room for improvement as the current version of ExoniPhy does not contain several advanced features common to other gene finders such as species-specific distributions of exon lengths and higher-order splice site models. Extrinsic Gene Finders
Increased availability of experimental data indicating DNA sequence transcriptional activity in the form of cDNA and EST sequences (typically from the same genome) and protein sequences (from other genomes) has led to the development of gene finders that leverage this information. Mapping EST and cDNA sequences to genomic DNA as a method of predicting the transcribed genes generally falls in the realm of pairwise sequence alignment. Therefore, such approaches (covered to some extent earlier in this chapter) are not considered here in more detail.
138
Genomics
Figure 4.5 Diagram of hidden state transitions in the HMM of ExoniPhy program. States in the direct strand are shown on top and states in the reverse strand are shown at the bottom. Circles represent states with variable length, while boxes represent fixed-length states [32].
The concepts of extrinsic evidence-based eukaryotic gene finding were implemented in a rather sophisticated way in the algorithms Procrustes [91] and GeneWise [92]. Both programs are computationally expensive and rely on significant computational resources to identify the piece of extrinsic evidence, the reference protein (if such would exist at all) homologous to the one encoded in the given DNA sequence. Essentially, both programs are screening the protein database or their sizable subsets, one protein at a time, attempting to extract from a
Gene Finding
139
given genomic DNA a set of protein-coding exons (with associated introns) that would be translated into a protein product homologous to the database protein. Procrustes works by first determining all subsequences that could be potential exons; a basic approach is selecting those sequences bound by AG at their 5′-ends and GT at their 3′-ends. This set, collectively referred to as “candidate blocks,” can be assembled into a large number of potential complete gene structures. Procrustes uses the spliced alignment algorithm to efficiently scan all of the possible assemblies and find the one with the highest similarity score to a given database protein. The authors have determined that if a sufficiently similar protein exists in the protein database, the highest scoring block assembly is “almost guaranteed” to represent the correct gene structure. In a set of human genes having a homologous protein in another primate, 87% were predicted exactly. GeneWise was described in detail relatively recently [92], though it has been used in the ENSEMBL pipeline since 1997 [31] and has been used in a number of genome annotation projects. Its predecessor, PairWise [93], had been developed with the goal of finding frameshifts in genes with protein products belonging to known families. Availability of a protein profile built from a multiple alignment of the family members, thus delineating a conserved domain, allowed getting extrinsic evidence of a frameshift that would destroy the fit of the newly identified protein product to the profile. GeneWise exhibits a significantly new development over PairWise. GeneWise states the goal of gene prediction, rather than frameshift detection, and to reach the stated goal it employs a consistent approach based on HMM theory. Given that models of both pairwise alignment and protein-coding gene prediction are readily represented by HMMs, GeneWise uses a combined (merged) HMM model with hidden states reflecting the status of alignment between the amino acids of the database protein and triplets of genomic DNA that my fall either into coding (exon) or noncoding (intron) regions. The HMM modeling the gene structure is much simpler than in GENSCAN—only one strand is considered, with nucleotide triplets being observed states in proteincoding exons, intron boundaries not allowed to split codons, and so on. In the case when a reference database protein is a member of a family, the pairwise sequence alignment is extended to alignment with the family-specific profile HMM with parameters already defined in the HMMER package [94]. As compared to GENSCAN, GeneWise exhibits higher specificity, as would be expected, with predictions being attached to extrinsic evidence. However, its sensitivity is lower than that of GENSCAN, as GeneWise has no means to identify genes for which protein products do not generate well-detectable hits in the protein databases. The accuracy of GeneWise decreases as the percent identity and the length of the alignment to a reference protein decreases. The most accurate
140
Genomics
predictions are made when the database screening hits a reference protein that is more than 85% identical to the protein to be predicted by GeneWise along the whole length of this new prediction. Site Detectors
While many eukaryotic gene finders attempt to predict complete gene structures, there are a number of recent programs that focus on the use of advanced techniques to detect gene components such as promoters [95], transcription starts [96], splice sites [97], and exonic splicing enhancers [66]. There is considerable innovation in this area and the types of algorithms being introduced are quite diverse. Still, the common feature of these approaches is that each one introduces a new concept that can later be integrated into a full-scale gene finder or annotation pipeline. The accurate prediction of promoters is an important task that can contribute to improving the accuracy of eukaryotic gene finders [98]. As promoters are located upstream to the start of transcription, finding a promoter helps narrow down the region where translation starts may be located. This information is important for gene-finding algorithms, as two common sources of error at the level of complete gene structure prediction are the joining of two adjacent genes into one and the splitting of a single gene into two. Either of these errors could be prevented if the gene finder uses a priori knowledge of the promoter locations. The early development of promoter prediction programs was challenged by a notoriously large number of false positive predictions, on the order of more than ten false predictions for each true positive [98]. The PromoterInspector program [95], which specifically predicts polymerase II promoter regions, reduced this overprediction rate to approximately a one-to-one ratio. A promoter region in terms of PromoterInspector is a sequence that contains a promoter either on the direct or the reverse strand. PromoterInspector utilizes an unsupervised learning technique to extract sets of oligonucleotides (with mismatches) from training sets containing promoter and nonpromoter sequences. Genomic sequences are processed using a sliding window approach and the prediction of a promoter region requires the classification of a certain number of successive windows as promoters. Sherf et al. demonstrated the power of integrating promoter predictions with ab initio gene predictions [99]. In a test on annotated human chromosome 22, promoters predicted in regions compatible with the 5′-end predictions of GENSCAN matched annotated genes with high frequency. Closely related to the prediction of promoters is the prediction of transcriptional starts. In human and other mammalian genomes, transcriptional starts are frequently located in the vicinity of CpG islands. To predict transcriptional starts, the Dragon Gene Start Finder (Dragon GSF) [96] first identifies the locations of CpG islands in the genome.
Gene Finding
141
Then it uses an artificial neural network to evaluate all of the predicted transcription start sites, supplied by the Dragon Promoter Finder [100], with respect to the locations and compositions of the CpG islands and downstream sequence elements. The sequence site where the sum of scores of these factors reaches the highest value (provided it is above a preset threshold) pinpoints the transcription start. Currently, the program can only find transcription starts associated with CpG islands. This restriction imposes a limit on the sensitivity the program can achieve. Still, these types of signal sensors may suffer more from relatively low specificity, given that the number of detected CpG islands frequently exceeds the number of genes in a sequence by a large margin. There are numerous techniques used to identify splice sites in genomic sequences, as their accurate detection is imperative for nearly all gene-finding systems for eukaryotes. Castelo and Guigo recently presented a new method using inclusion-driven Bayesian networks (idlBNs) [97]. The idlBN method performs comparably to the best of the previously utilized approaches for splice site identification (including position-specific weight matrices based on zero and first-order Markov models). This method shows superior training dynamics; as the training size increases, the false positive rate decreases more quickly for idlBNs than it does for weight matrix or Markov chain-based approaches. Rather surprisingly, the integration of the idlBN method with a gene prediction program, GeneID, showed that improved signal detection does not necessarily lead to large improvements in genefinding accuracy. The authors offered the caveat that this relationship might depend on the specific gene finder, while their testing was limited to only one. Exonic splicing enhancers (ESEs) are short sequence motifs located in the exons near splice sites and implicated in enhancing splicing activity. The detection of ESEs is beneficial for gene-finding programs as ESEs can help delineate the boundaries between coding and noncoding DNA, especially when the sequence patterns at these boundary sites are weak [101]. The RESCUE method (Relative Enhancer and Silencer Classification by Unanimous Enrichment), a general sequence motif detection method, was applied for ESE detection and implemented in a program called RESCUE-ESE [66]. In a set of human genes, RESCUE-ESE was able to detect ten candidate splice enhancer motifs. Biochemical experiments confirmed the ESE activity of all ten predictions. MODEL TRAINING
The accuracy of the gene predictions made by a particular program is highly dependent on the choice of training data and training methods. Thus, making optimal choices is another part of the science (or art) of gene finding.
142
Genomics
Determination of Training Set Size
Training sets are typically derived from the expert annotated collections of genomic sequences. Sets of the required size may not always be available either for technical reasons (e.g., at the beginning of genome sequencing projects) or for more fundamental ones (e.g., extremely small genome size). While in practice it has been observed that threeperiodic Markov chain models are effective for gene finding, any discussion of the application of Markov models for DNA sequence analysis would be incomplete without addressing the question of which order model is most appropriate [102]. In general, accuracy of gene prediction, especially for short genes (or exons), increases with an increase in the model order. As far as minimum orders are concerned, it was observed that models of less than order two do not perform well for gene prediction applications, largely because the second-order Markov chains are the shortest chains for which entire codons are included in the frequency statistics and codon usage frequency can be captured. Models of order five have an additional advantage as they capture the frequency statistics of all oligonucleotides up to dicodons (hexamers). The maximum order of the model that can be used is limited by the size of the available training sequence. The minimal size of the training set can be defined in terms of 100(1−a)% confidence intervals for the estimated transition probabilities. Then, for a = 0.05 (and assuming a genome with about equal frequencies of each nucleotide), the number of observations required to estimate each transition probability is approximately equal to 400 [103]. Markov models of higher order have a larger number of parameters. As the amount of training data needed per parameter does not change, the required training set size grows geometrically as a function of model order. In real genomes, certain oligomers are overrepresented while others are underrepresented. Therefore, some transition probabilities will be defined with higher accuracy than others. To deal with this effect, the Glimmer program [9] employs a special class of Markov chain models called interpolated Markov models (IMMs). IMMs use a combination of high-order and lower-order Markov chains with weights depending on the frequencies of each oligomer. Further generalization of the interpolated models to so-called models with deleted interpolation (DI) is possible [102]. The performance of different types of Markov chain models, both conventional fixed order (FO) models and models with interpolation, was assessed within the framework of the GeneMark algorithm [102]. It was observed that the DI models slightly outperformed other types of models in detecting genes in genomes with medium G+C content. For genomic DNA with high (or low) G+C content, it was observed that the DI models were in some cases slightly outperformed by the FO models.
Gene Finding
143
Nonsupervised Training Methods
Frequently, it is difficult to find reliably annotated DNA sequence in sufficient amounts to build models by supervised training. However, the total length of sequenced DNA could be sufficient to harbor training sets for high-order models and nonsupervised training would be a valuable option. Nonsupervised training algorithms have been described for prokaryotic gene finders such as GeneMark or Glimmer [9,21,104,105]. Also, a nonsupervised training procedure, GeneMarkS [8], was proposed for building models for GeneMark.hmm. GeneMarkS starts the iterative training process from models with heuristically defined pseudocounts [20]. The rounds of sequence labeling into coding and noncoding regions, recompilation of the training sets, and model training follow until convergence. The heuristic approach [20] by itself may produce sufficiently accurate models without a training set. Models built by the heuristic approach have been successfully used for gene prediction in the genomes of viruses and phages, often too small to provide enough sequence to estimate parameters of statistical models via regular training. The heuristic approach was used for annotation of viral genes in the VIOLIN database [24], which contains computer reannotations for more than 2000 viral genomes. Self-training methods may also successfully incorporate similarity search in databases of protein sequences to identify members of the emerging training set, as is done by the ORPHEUS [61] and EasyGene [35] programs. Nonsupervised training may use clustering routines to separate genes of the atypical class, presumably populated with genes horizontally transferred into a given (microbial) genome in the course of evolution [106]. For analysis of new genomes, in the absence of substantial training sets, models from close phylogenetic neighbors with similar G+C content were used, albeit with varying degrees of success. While the reference genome model could be successful for a number of cases, a simple test on prokaryotic genomes can show why this method should be applied cautiously. For instance, the genomes of E. coli K12 and Prochlorococcus marinus str. MIT9313 have G+C% of 50.8 and 50.7, respectively. With a model trained on P. marinus MIT9313 sequence, GeneMark.hmm detects 92% of the genes in E. coli K12, while using a model trained on E. coli K12 sequence GeneMark.hmm detects only 74% of the P. marinus MIT9313 genes. Therefore, the operation of choosing a reasonable reference genome is not a symmetrical one. Interestingly, a complete genome of another strain of P. marinus (P. marinus subsp. marinus str. CCMP1375) with a much lower G+C% (36.4) is available and the usefulness of phylogenetic distance as a criterion for selecting a reference genome can be immediately tested. Cross-species tests show, somewhat surprisingly, that with MIT9313
144
Genomics
models GeneMark.hmm detects 90% of the CCMP1375 genes, but with models derived from CCMP1375 the program detects a mere 8% of MIT9313 genes. Recently, eukaryotic genome sequencing has experienced an acceleration akin to prokaryotic genome sequencing in the late 1990s. The feasibility of unsupervised model estimation for eukaryotic genomes has recently been shown (Ter-Hovhannisyan, V., Lomsadze, A., Borodovsky, M., unpublished). COMBINING THE OUTPUT OF GENE FINDERS
In 1996, Burset and Guigo combined several gene finders employing different methodologies to improve the accuracy of eukaryotic gene prediction [4]. At that time, GENSCAN and similar HMM-based gene finders were not available, and the eukaryotic sequence contigs used in the tests were relatively short and contained strictly one gene per sequence. As even the best eukaryotic gene finders of the time had rather low sensitivity, a significant number of exons were missed by any single program. However, with several methods used in concert, only about 1% of the real exons were missed completely. Exons predicted by all the programs in exactly the same way were labeled “almost certainly true” [4]. Since Burset and Guigo’s paper, combination of the predictions of multiple gene finders into a set of metapredictions has been a popular idea implemented in several programs (see below). Interestingly, in some ways these programs mirror the mode of operations of expert annotators running different gene prediction programs on an anonymous genomic DNA sequence at hand. After the release of GENSCAN, Murakami and Takagi used several methods to combine GENSCAN with three other programs: FEXH [107], GeneParser3 [108], and Grail [109]. The best performing combination, however, achieved only modest improvements over the predictions of GENSCAN alone. Recently, McHardy et al. [110] combined the outputs of the prokaryotic gene finders Glimmer [9] and CRITICA [59], representing the intrinsic and extrinsic classes, respectively. The three combination methods were (i) the union of CRITICA and a special run of Glimmer with its model parameters estimated from the predictions of CRITICA; (ii) an overlap threshold rule in which a Glimmer prediction was discarded if it significantly overlapped a CRITICA prediction; and (iii) a vote score threshold strategy in which predictions of Glimmer (again trained on the set of predictions of CRITICA) were discarded if the vote score (defined as the sum of the scores of the ORF analyzed in all reading frames other than the one voted for) was below a certain threshold. In a test on 113 genomes, the best of these methods, the vote score
Gene Finding
145
approach, showed accuracy comparable to the YACOP program [33], which combines the predictions of CRITICA, Glimmer, and ZCURVE using the Boolean operation CRITICA ∪ (Glimmer ∩ ZCURVE) and outperforms individual gene finders in terms of reducing false positive predictions. Rogic et al. [68] combined two eukaryotic gene finders, GENSCAN and HMMgene, using exon posterior probability scores validated earlier as sufficiently reliable [16]. In tests on long multigene sequences it was shown that the best performance was provided by the method called Exon Union-Intersection with Reading Frame Consistency. This method works by first selecting gene structures based on the union and intersection of the predictions of the two programs and by choosing the program producing the higher gene probability (average of the exon probabilities for all exons in a gene) as the one imposing the reading frame for the complete predicted gene. Yet another method to combine the outputs of several gene prediction programs (eukaryotic GeneMark.hmm, GENSCAN, GeneSplicer [111], GlimmerM [112], and TWINSCAN), along with protein sequence alignments, cDNA alignments, and EST alignments, has been implemented in a program called Combiner [113]. In tests on 1783 A. thaliana genes confirmed by cDNA, Combiner consistently outperformed the individual gene finders in terms of accurate prediction of both complete gene structures and separate exons. Note that Combiner specifically employed programs showing high accuracy of gene prediction in plant genomes [114]. GAZE [115] utilizes a dynamic programming algorithm to integrate information from a variety of intrinsic and extrinsic sources into the prediction of complete gene structures. The novel feature of GAZE is that the model of “legal” gene structure is defined in an easily edited external XML format as a list of features (e.g., translation start and stop, donors and acceptors). The use of an external XML gene model description allows GAZE to be quickly reconfigured to gain the ability to handle specific features of particular genomes, such as trans-splicing for the Caenorhabditis elegans model. The EuGene program [116], using an approach similar to GAZE, was developed to integrate several sources of information for gene prediction in A. thaliana. An extension of this program, called EuGeneHom, utilizes the EuGene framework to predict eukaryotic genes based on similarity to multiple homologous proteins [117]. CONCLUSIONS
Though the problem of prokaryotic gene finding has already been addressed at a level that meets the satisfaction of experimental biologists, considerable innovation is still needed to drive this field to perfection. Thus, new gene finders utilizing novel mathematical approaches such as
146
Genomics
Spectral Rotation Measure [118] and Self-Organizing Map [119] are still being introduced. The accuracy of eukaryotic gene-finding programs, though facilitated by the development of new training algorithms and more precise methods to locate short signals in genomes, is not expected to reach the same level as in prokaryotes in the near future. Perhaps for this reason, innovation in eukaryotic gene finding is currently focused less on the application of novel statistics-based methods, but instead on new methods that leverage the power of comparative genomics in ever more complex fashions. While the major challenges in prokaryotic gene finding have been narrowed down to the prediction of gene starts, discrimination of short genes from random ORFs, and prediction of atypical genes, the issues facing eukaryotic gene finding are more numerous. The extent of alternatively spliced genes in the genomes of high eukaryotes, along with exactly how to evaluate alternative splicing predictions, is currently under study. The initial applications of phylo-HMMs have shown the power of ab initio gene finders that can handle multiple genomes simultaneously and future improvements in this direction are expected. Overlapping (or nested) genes, initially thought to be rare but now known to be quite frequent in prokaryotes and eukaryotes, may be a rather challenging target for eukaryotic gene finding [120]. More precise models of promoters, terminators, and regulatory sites that may aid in the determination of gene starts and 5′ and 3′ untranslated regions are under development as well. ACKNOWLEDGMENTS The authors would like to thank Vardges Ter-Hovhannisyan and Wenhan Zhu for computational support and Alexandre Lomsadze, Alexander Mitrophanov, and Mikhail Roytberg for useful discussions. This work was supported in part by grants from the U.S. Department of Energy and the U.S. National Institutes of Health.
REFERENCES 1. Venter, J. C., K. Remington, J. F. Heidelberg, et al. Environmental genome shotgun sequencing of the Sargasso Sea. Science, 304(5667):66–74, 2004. 2. Link, A. J., K. Robison and G. M. Church. Comparing the predicted and observed properties of proteins encoded in the genome of Escherichia coli K12. Electrophoresis, 18(8):1259–1313, 1997. 3. Rudd, K. E. EcoGene: a genome sequence database for Escherichia coli K-12. Nucleic Acids Research, 28(1):60–4, 2000. 4. Burset, M. and R. Guigo. Evaluation of gene structure prediction programs. Genomics, 34(3):353–67, 1996.
Gene Finding
147
5. Korning, P. G., S. M. Hebsgaard, P. Rouze and S. Brunak. Cleaning the GenBank Arabidopsis thaliana data set. Nucleic Acids Research, 24(2):316–20, 1996. 6. Slupska, M. M., A. G. King, S. Fitz-Gibbon, et al. Leaderless transcripts of the crenarchaeal hyperthermophile Pyrobaculum aerophilum. Journal of Molecular Biology, 309(2):347–60, 2001. 7. Guigo, R., E. T. Dermitzakis, P. Agarwal, et al. Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes. Proceedings of the National Academy of Sciences USA, 100(3):1140–5, 2003. 8. Besemer, J., A. Lomsadze and M. Borodovsky. GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Research, 29(12):2607–18, 2001. 9. Delcher, A. L., D. Harmon, S. Kasif, et al. Improved microbial gene identification with GLIMMER. Nucleic Acids Research, 27(23):4636–41, 1999. 10. Lukashin, A. V. and M. Borodovsky. GeneMark.hmm: new solutions for gene finding. Nucleic Acids Research, 26(4):1107–15, 1998. 11. Guo, F. B., H. Y. Ou and C. T. Zhang. ZCURVE: a new system for recognizing protein-coding genes in bacterial and archaeal genomes. Nucleic Acids Research, 31(6):1780–9, 2003. 12. Ochman, H. Distinguishing the ORFs from the ELFs: short bacterial genes and the annotation of genomes. Trends in Genetics, 18(7):335–7, 2002. 13. Skovgaard, M., L. J. Jensen, S. Brunak, et al. On the total number of genes and their length distribution in complete microbial genomes. Trends in Genetics, 17(8):425–8, 2001. 14. Korf, I. Gene finding in novel genomes. BMC Bioinformatics, 5(1):59, 2004. 15. Guigo, R., P. Agarwal, J. F. Abril, et al. An assessment of gene prediction accuracy in large DNA sequences. Genome Research, 10(10):1631–42, 2000. 16. Rogic, S., A. K. Mackworth and F. B. Ouellette. Evaluation of gene-finding programs on mammalian sequences. Genome Research, 11(5):817–32, 2001. 17. Kraemer, E., J. Wang, J. Guo, et al. An analysis of gene-finding programs for Neurospora crassa. Bioinformatics, 17(10):901–12, 2001. 18. Mathe, C., P. Dehais, N. Pavy, et al. Gene prediction and gene classes in Arabidopsis thaliana. Journal of Biotechnology, 78(3):293–9, 2000. 19. Borodovsky, M., K. E. Rudd and E. V. Koonin. Intrinsic and extrinsic approaches for detecting genes in a bacterial genome. Nucleic Acids Research, 22(22):4756–67, 1994. 20. Besemer, J. and M. Borodovsky. Heuristic approach to deriving models for gene finding. Nucleic Acids Research, 27(19):3911–20, 1999. 21. Hayes, W. S. and M. Borodovsky. How to interpret an anonymous bacterial genome: machine learning approach to gene identification. Genome Research, 8(11):1154–71, 1998. 22. Borodovsky, M., J. D. McIninch, E. V. Koonin, et al. Detection of new genes in a bacterial genome using Markov models for three gene classes. Nucleic Acids Research, 23(17):3554–62, 1995. 23. Huang, S. H., Y. H. Chen, G. Kong, et al. A novel genetic island of meningitic Escherichia coli K1 containing the ibeA invasion gene (GimA): functional annotation and carbon-source-regulated invasion of human
148
24. 25. 26. 27.
28.
29. 30. 31. 32.
33. 34.
35. 36. 37. 38.
39.
40.
41.
42.
Genomics
brain microvascular endothelial cells. Functional and Integrative Genomics, 1(5):312–22, 2001. Mills, R., M. Rozanov, A. Lomsadze, et al. Improving gene annotation of complete viral genomes. Nucleic Acids Research, 31(23):7041–55, 2003. Altschul, S. F., W. Gish, W. Miller, et al. Basic local alignment search tool. Journal of Molecular Biology, 215(3):403–10, 1990. Robison, K., W. Gilbert and G. M. Church. Large scale bacterial gene discovery by similarity search. Nature Genetics, 7(2):205–14, 1994. Mott, R. EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA. Computer Applications in the Biosciences, 13(4):477–8, 1997. Florea, L., G. Hartzell, Z. Zhang, et al. A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Research, 8(9):967–74, 1998. Kent, W. J. BLAT—the BLAST-like alignment tool. Genome Research, 12(4):656–64, 2002. Usuka, J., W. Zhu and V. Brendel. Optimal spliced alignment of homologous cDNA to a genomic DNA template. Bioinformatics, 16(3):203–11, 2000. Birney, E., T. D. Andrews, P. Bevan, et al. An overview of Ensembl. Genome Research, 14(5):925–8, 2004. Siepel, A. and D. Haussler. Computational identification of evolutionarily conserved exons. In RECOMB ‘04: Proceedings of the 8th Annual International Conference on Research in Computational Molecular Biology (pp. 177–86). ACM Press, New York, 2004. Tech, M. and R. Merkl. YACOP: enhanced gene prediction obtained by a combination of existing methods. In Silico Biology, 3(4):441–51, 2003. Iliopoulos, I., S. Tsoka, M. A. Andrade, et al. Evaluation of annotation strategies using an entire genome sequence. Bioinformatics, 19(6):717–26, 2003. Larsen, T. S. and A. Krogh. EasyGene—a prokaryotic gene finder that ranks ORFs by statistical significance. BMC Bioinformatics, 4(1):21, 2003. Blattner, F. R., G. Plunkett, 3rd, C. A. Bloch, et al. The complete genome sequence of Escherichia coli K-12. Science, 277(5331):1453–74, 1997. Fickett, J. W. Recognition of protein coding regions in DNA sequences. Nucleic Acids Research, 10(17):5303–18, 1982. Gribskov, M., J. Devereux and R. R. Burgess. The codon preference plot: graphic analysis of protein coding sequences and prediction of gene expression. Nucleic Acids Research, 12(1 Pt 2):539–49, 1984. Staden, R. Measurements of the effects that coding for a protein has on a DNA-sequence and their use for finding genes. Nucleic Acids Research, 12(1):551–67, 1984. Erickson, J. W. and G. G. Altman. Search for patterns in the nucleotidesequence of the MS2 genome. Journal of Mathematical Biology, 7(3):219–30, 1979. Ishikawa, J. and K. Hotta. FramePlot: a new implementation of the frame analysis for predicting protein-coding regions in bacterial DNA with a high G+C content. FEMS Microbiology Letters, 174(2):251–3, 1999. Fickett, J. W. and C. S. Tung. Assessment of protein coding measures. Nucleic Acids Research, 20(24):6441–50, 1992.
Gene Finding
149
43. Durbin, R., S. Eddy, A. Krogh and G. Mitchison. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, 1998. 44. Gatlin, L. L. Information Theory and the Living System. Columbia University Press, New York, 1972. 45. Borodovsky, M., Y. A. Sprizhitsky, E. I. Golovanov and A. A. Alexandrov. Statistical features in the Escherichia coli genome functional primary structure. II. Non-homogeneous Markov chains. Molekuliarnaia Biologiia, 20:833–40, 1986. 46. Borodovsky, M., Y. A. Sprizhitsky, E. I. Golovanov and A. A. Alexandrov. Statistical features in the Escherichia coli genome functional primary structure. III. Computer recognition of protein coding regions. Molekuliarnaia Biologiia, 20:1144–50, 1986. 47. Tavare, S. and B. Song. Codon preference and primary sequence structure in protein-coding regions. Bulletin of Mathematical Biology, 51(1):95–115, 1989. 48. Borodovsky, M. and J. McIninch. Genmark—parallel gene recognition for both DNA strands. Computers and Chemistry, 17(2):123–33, 1993. 49. Rabiner, L. R. A tutorial on hidden Markov-models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–86, 1989. 50. Krogh, A., I. S. Mian and D. Haussler. A hidden Markov model that finds genes in E. coli DNA. Nucleic Acids Research, 22(22):4768–78, 1994. 51. Staden, R. Computer methods to locate signals in nucleic-acid sequences. Nucleic Acids Research, 12(1):505–19, 1984. 52. Hannenhalli, S. S., W. S. Hayes, A. G. Hatzigeorgiou and J. W. Fickett. Bacterial start site prediction. Nucleic Acids Research, 27(17):3577–82, 1999. 53. Suzek, B. E., M. D. Ermolaeva, M. Schreiber and S. L. Salzberg. A probabilistic method for identifying start codons in bacterial genomes. Bioinformatics, 17(12):1123–30, 2001. 54. Zhu, H. Q., G. Q. Hu, Z. Q. Ouyang, et al. Accuracy improvement for identifying translation initiation sites in microbial genomes. Bioinformatics, 20(18):3308–17, 2004. 55. Shibuya, T. and I. Rigoutsos. Dictionary-driven prokaryotic gene finding. Nucleic Acids Research, 30(12):2710–25, 2002. 56. Rigoutsos, I., A. Floratos, C. Ouzounis, et al. Dictionary building via unsupervised hierarchical motif discovery in the sequence space of natural proteins. Proteins, 37(2):264–77, 1999. 57. Rigoutsos, I. and A. Floratos. Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm. Bioinformatics, 14(1):55–67, 1998. 58. Murphy, E., I. Rigoutsos, T. Shibuya and T. E. Shenk. Reevaluation of human cytomegalovirus coding potential. Proceedings of the National Academy of Sciences USA, 100(23):13585–90, 2003. 59. Badger, J. H. and G. J. Olsen. CRITICA: coding region identification tool invoking comparative analysis. Molecular Biology and Evolution, 16(4):512–24, 1999. 60. Claverie, J. M. and L. Bougueleret. Heuristic informational analysis of sequences. Nucleic Acids Research, 14(1):179–96, 1986. 61. Frishman, D., A. Mironov, H. W. Mewes and M. Gelfand. Combining diverse evidence for gene recognition in completely sequenced bacterial genomes. Nucleic Acids Research, 26(12):2941–7, 1998.
150
Genomics
62. Zhang, Z. and M. Gerstein. Large-scale analysis of pseudogenes in the human genome. Current Opinion in Genetics and Development, 14(4):328–35, 2004. 63. Coin, L. and R. Durbin. Improved techniques for the identification of pseudogenes. Bioinformatics, 20(Suppl 1):I94–100, 2004. 64. Burge, C. and S. Karlin. Prediction of complete gene structures in human genomic DNA. Journal of Molecular Biology, 268(1):78–94, 1997. 65. Kulp, D., D. Haussler, M. G. Reese and F. H. Eeckman. A generalized hidden Markov model for the recognition of human genes in DNA. Proceedings of the International Conference on Intelligent Systems for Molecular Biology, 4:134–42, 1996. 66. Fairbrother, W. G., R. F. Yeh, P. A. Sharp and C. B. Burge. Predictive identification of exonic splicing enhancers in human genes. Science, 297(5583):1007–13, 2002. 67. Krogh, A. Two methods for improving performance of an HMM and their application for gene finding. Proceedings of the International Conference on Intelligent Systems for Molecular Biology, 5:179–86, 1997. 68. Rogic, S., B. F. Ouellette and A. K. Mackworth. Improving gene recognition accuracy by combining predictions from two gene-finding programs. Bioinformatics, 18(8):1034–45, 2002. 69. Stanke, M. and S. Waack. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics, 19(Suppl 2):II215–25, 2003. 70. Salamov, A. A. and V. V. Solovyev. Ab initio gene finding in Drosophila genomic DNA. Genome Research, 10(4):516–22, 2000. 71. Reese, M. G., G. Hartzell, N. L. Harris, et al. Genome annotation assessment in Drosophila melanogaster. Genome Research, 10(4):483–501, 2000. 72. Berget, S. M. Exon recognition in vertebrate splicing. Journal of Biological Chemistry, 270(6):2411– 14, 1995. 73. Lim, L. P. and C. B. Burge. A computational analysis of sequence features involved in recognition of short introns. Proceedings of the National Academy of Sciences USA, 98(20):11193-8, 2001. 74. Alexandersson, M., S. Cawley and L. Pachter. SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model. Genome Research, 13(3):496–502, 2003. 75. Bray, N., I. Dubchak and L. Pachter. AVID: a global alignment program. Genome Research, 13(1):97–102, 2003. 76. Meyer, I. M. and R. Durbin. Comparative ab initio prediction of gene structures using pair HMMs. Bioinformatics, 18(10):1309–18, 2002. 77. Parra, G., P. Agarwal, J. F. Abril, et al. Comparative gene prediction in human and mouse. Genome Research, 13(1):108–17, 2003. 78. Korf, I., P. Flicek, D. Duan and M. R. Brent. Integrating genomic homology into gene structure prediction. Bioinformatics, 17(Suppl 1):S140–8, 2001. 79. Guigo, R., S. Knudsen, N. Drake and T. Smith. Prediction of gene structure. Journal of Molecular Biology, 226(1):141–57, 1992. 80. Parra, G., E. Blanco and R. Guigo. GeneID in Drosophila. Genome Research, 10(4):511–15, 2000. 81. Flicek, P., E. Keibler, P. Hu, et al. Leveraging the mouse genome for gene prediction in human: from whole-genome shotgun reads to a global synteny map. Genome Research, 13(1):46–54, 2003.
Gene Finding
151
82. Batzoglou, S., L. Pachter, J. P. Mesirov, et al. Human and mouse gene structure: comparative analysis and application to exon prediction. Genome Research, 10(7):950–8, 2000. 83. Rinner, O. and B. Morgenstern. AGenDA: gene prediction by comparative sequence analysis. In Silico Biology, 2(3):195–205, 2002. 84. Morgenstern, B., A. Dress and T. Werner. Multiple DNA and protein sequence alignment based on segment-to-segment comparison. Proceedings of the National Academy of Sciences USA, 93(22):12098–103, 1996. 85. Kurtz, S., A. Phillippy, A.L. Delcher, et al. Versatile and open software for comparing large genomes. Genome Biology, 5(2):R12, 2004. 86. Ogurtsov, A. Y., M. A. Roytberg, S. A. Shabalina and A. S. Kondrashov. OWEN: aligning long collinear regions of genomes. Bioinformatics, 18(12):1703–4, 2002. 87. Couronne, O., A. Poliakov, N. Bray, et al. Strategies and tools for wholegenome alignments. Genome Research, 13(1):73–80, 2003. 88. Siepel, A. and D. Haussler. Combining phylogenetic and hidden Markov models in biosequence analysis. Journal of Computational Biology, 11(2-3): 413–28, 2004. 89. Pedersen, J. S. and J. Hein. Gene finding with a hidden Markov model of genome structure and evolution. Bioinformatics, 19(2):219–27, 2003. 90. Brent, M. R. and R. Guigo. Recent advances in gene structure prediction. Current Opinion in Structural Biology, 14(3):264–72, 2004. 91. Gelfand, M. S., A. A. Mironov and P. A. Pevzner. Gene recognition via spliced sequence alignment. Proceedings of the National Academy of Sciences USA, 93(17):9061–6, 1996. 92. Birney, E., M. Clamp and R. Durbin. GeneWise and Genomewise. Genome Research, 14(5):988–95, 2004. 93. Birney, E., J. D. Thompson and T. J. Gibson. PairWise and SearchWise: finding the optimal alignment in a simultaneous comparison of a protein profile against all DNA translation frames. Nucleic Acids Research, 24(14):2730–9, 1996. 94. Eddy, S. R. Profile hidden Markov models. Bioinformatics, 14(9):755–63, 1998. 95. Scherf, M., A. Klingenhoff and T. Werner. Highly specific localization of promoter regions in large genomic sequences by PromoterInspector: a novel context analysis approach. Journal of Molecular Biology, 297(3): 599-606, 2000. 96. Bajic, V. B. and S. H. Seah. Dragon gene start finder: an advanced system for finding approximate locations of the start of gene transcriptional units. Genome Research, 13(8):1923–9, 2003. 97. Castelo, R. and R. Guigo. Splice site identification by idlBNs. Bioinformatics, 20(Suppl 1):I69–76, 2004. 98. Fickett, J. W. and A. G. Hatzigeorgiou. Eukaryotic promoter recognition. Genome Research, 7(9):861–78, 1997. 99. Scherf, M., A. Klingenhoff, K. Frech, et al. First pass annotation of promoters on human chromosome 22. Genome Research, 11(3):333–40, 2001. 100. Bajic, V. B., S. H. Seah, A. Chong, et al. Dragon Promoter Finder: recognition of vertebrate RNA polymerase II promoters. Bioinformatics, 18(1):198–9, 2002.
152
Genomics
101. Graveley, B. R. Sorting out the complexity of SR protein functions. RNA, 6(9):1197–211, 2000. 102. Azad, R. K. and M. Borodovsky. Effects of choice of DNA sequence model structure on gene identification accuracy. Bioinformatics, 20(7):993–1005, 2004. 103. Borodovsky, M., W. S. Hayes and A. V. Lukashin. Statistical predictions of coding regions in prokaryotic genomes by using inhomogeneous Markov models. In R.L. Charlebois (Ed.), Organization of the Prokaryotic Genome (pp. 11–34). ASM Press, Washington, D.C., 1999. 104. Audic, S. and J. M. Claverie. Self-identification of protein-coding regions in microbial genomes. Proceedings of the National Academy of Sciences USA, 95(17):10026–31, 1998. 105. Baldi, P. On the convergence of a clustering algorithm for protein-coding regions in microbial genomes. Bioinformatics, 16(4):367–71, 2000. 106. Hayes, W. S. and M. Borodovsky. Deriving ribosomal binding site (RBS) statistical models from unannotated DNA sequences and the use of the RBS model for N-terminal prediction. In Pacific Symposium on Biocomputing (pp. 279–90). World Scientific, Singapore, 1998. 107. Solovyev, V. and A. Salamov. The Gene-Finder computer tools for analysis of human and model organisms genome sequences. Proceedings of the International Conference on Intelligent Systems for Molecular Biology, 5: 294–302, 1997. 108. Snyder, E. E. and G. D. Stormo. Identification of protein coding regions in genomic DNA. Journal of Molecular Biology, 248(1):1–18, 1995. 109. Xu, Y., R. Mural, M. Shah and E. Uberbacher. Recognizing exons in genomic sequence using GRAIL II. Genetic Engineering, 16:241–53, 1994. 110. McHardy, A. C., A. Goesmann, A. Puhler and F. Meyer. Development of joint application strategies for two microbial gene finders. Bioinformatics, 20(10):1622–31, 2004. 111. Pertea, M., X. Lin and S. L. Salzberg. GeneSplicer: a new computational method for splice site prediction. Nucleic Acids Research, 29(5):1185–90, 2001. 112. Salzberg, S. L., M. Pertea, A. L. Delcher, et al. Interpolated Markov models for eukaryotic gene finding. Genomics, 59(1):24-31, 1999. 113. Allen, J. E., M. Pertea and S. L. Salzberg. Computational gene prediction using multiple sources of evidence. Genome Research, 14(1):142–8, 2004. 114. Pavy, N., S. Rombauts, P. Dehais, et al. Evaluation of gene prediction software using a genomic data set: application to Arabidopsis thaliana sequences. Bioinformatics, 15(11):887–99, 1999. 115. Howe, K. L., T. Chothia and R. Durbin. GAZE: a generic framework for the integration of gene-prediction data by dynamic programming. Genome Research, 12(9):1418–27, 2002. 116. Schiex, T., A. Moisan and P. Rouze. EuGene: an eucaryotic gene finder that combines several sources of evidence. In O. Gascuel and M. F. Sagot (Eds.), Computational Biology. LNCS 2066 (pp. 111–25). Springer, Heidelberg, 2001. 117. Foissac, S., P. Bardou, A. Moisan, et al. EuGeneHom: a generic similaritybased gene finder using multiple homologous sequences. Nucleic Acids Research, 31(13):3742–5, 2003. 118. Kotlar, D. and Y. Lavner. Gene prediction by spectral rotation measure: a new method for identifying protein-coding regions. Genome Research, 13(8):1930–7, 2003.
Gene Finding
153
119. Mahony, S., J. O. McInerney, T. J. Smith and A. Golden. Gene prediction using the Self-Organizing Map: automatic generation of multiple gene models. BMC Bioinformatics, 5(1):23, 2004. 120. Veeramachaneni, V., W. Makalowski, M. Galdzicki, et al. Mammalian overlapping genes: the comparative perspective. Genome Research, 14(2):280–6, 2004. 121. Schneider, T. D. and R. M. Stephens. Sequence logos: a new way to display consensus sequences. Nucleic Acids Research, 18(20):6097–100, 1990.
5 Local Sequence Similarities Temple F. Smith
In today’s genomic era, the use of computer-based comparative genetic sequence analysis has become routine. It is used to identify the function of newly sequenced genes, to identify conserved functional sites, to reconstruct probable evolutionary histories, and to investigate many other biological questions. DNA and protein comparative sequence analysis is often considered to be the founding aspect of what is now called Bioinformatics and Genomics. Given that the development of computational tools played a key role, a bit of history will help in understanding both the motivations and the sequence of ideas that led to many comparative sequence tools, including the local dynamic or Smith–Waterman alignment algorithm. It has often been stated that Zuckerkandl and Pauling introduced in 1965 the idea of using comparative sequence information to study evolution [1]. It is clear, however, that many were already considering that method by 1964, as seen by the lively discussions at the symposium on “Evolving Genes and Proteins” held September 17–18, 1964, at Rutgers University [2]. Three papers in particular provide sequence comparative alignments: one on a cytochrome c by Margoliash and Smith [3]; one on a dehydrogenase by N. O. Kaplan [4]; and one by Pauling [5]. The first paper was the precursor to the famous paper by Fitch and Margoliash [6] presenting the first large-scale evolutionary reconstruction based on protein sequence alignments. This meeting at Rutgers was attended by over 250 researchers from departments of chemistry, microbiology, applied physics, schools of medicine, and others, foretelling the highly multidisciplinary nature of biology’s future and the future of bioinformatics, an interdisciplinary area often traced back to sequence comparative analyses. Since Sanger’s seminal work on sequencing insulin [7], it had taken fewer than ten years for a large number of scientists to recognize the wealth of information available in such protein sequences. By 1969, fourteen years later, Margaret Dayhoff had collected and aligned over 300 protein sequences and organized them into evolutionary-based clusters [8]. Alignments by hand were straightforward for many of the proteins sequenced early on, such as the cytochrome c’s and globins. It became clear, however, that there were two fundamental problems requiring a 154
Local Sequence Similarities
155
more rigorous approach. These were the often degenerate or alternative possible placements of alignment gaps, and the need for a means of weighting the differences or similarities between different amino acids that one placed in homologous or aligned positions. There were a number of early heuristic approaches, such as those by Fitch [9] and Dayhoff [8]. It was the algorithm by Needleman and Wunsch in 1970 [10], however, that set the stage for nearly all later biological sequence alignment tools. Although not initially recognized as such, this work by Needleman and Wunsch was an application of Bellman’s 1957 dynamic programming methodology [11]. The fully rigorous application of dynamic programming came slowly with work by Sankoff [12], Reichert et al. [13], and Sellers [14]. In the latter, Sellers was able to introduce a true metric or distance measure between sequence pairs, a metric that Waterman et al. [15] were then able to generalize. This outline of the early sequence alignment algorithm developments lay behind the work of my colleague, Michael Waterman, and myself that led to the local or optimal maximum similar common subsequence algorithm, since labeled the “Smith–Waterman algorithm.“ It was work on a related problem, however, the prediction of RNA secondary structure, that was our critical introduction to dynamic programming. Much of our early collaboration took place in the summers at Los Alamos, New Mexico. I first met Michael when we were both invited to spend the summer of 1974 at Los Alamos National Laboratory through the efforts of William Beyer. I had previously spent time at Los Alamos interacting with researchers in the Biological Sciences group and with the mathematician, Stanislaw Ulam. By Michael’s own account (Waterman, Skiing the Sun: New Mexico Essays, 1997) our meeting was not altogether an auspicious one. We were both young faculty from somewhat backwater, nonresearch universities. We were thus highly motivated by the opportunity to get some real research done over the short summer. Our earliest efforts focused on protein sequence comparisons and evolutionary tree building, yet interestingly, neither we nor our more senior “leader,” Bill Beyer, were initially aware of the key work of Needleman and Wunsch, nor of dynamic programming. Much of this ignorance is reflected in a first paper on the subject [16] produced prior to working with Waterman. Over the next couple of summers we became not only very familiar with the work of Needleman and Wunsch, Sankoff, and Sellers, but were able to contribute to the mathematics of sequence comparison matrices [15]. By the end of 1978 we had successfully applied the mathematics of dynamic programming to the prediction of tRNA secondary structures. Here, as in nearly all of our joint work, Waterman was the true mathematician while I struggled with the biochemistry and the writing of hundreds of lines of FORTRAN code. We made a good team. Los Alamos was a special place at that time, filled with very bright and
156
Genomics
Figure 5.1 A photo of Mike Waterman (right) and Temple Smith (left) taken in the summer of 1976 at Los Alamos National Laboratory, Los Alamos, New Mexico, by David Lipman. Dr. Lipman is one of the key developers of the two major heuristic high-speed generalizations of the Smith–Waterman algorithm, FASTA and Blast.
unique people, and a place where one felt connected to the grand wide-open spaces of the American southwest (see figure 5.1). At about this same time there were discussions about creating a database for DNA sequences similar to what had been done for both protein sequences and 3D-determined structures [17]. As members of the Los Alamos group considering applying for grant support, both Michael and I were active in discussions on likely future analysis needs. These needs included not only using sequence alignment to compare two sequences of known functional similarity, but searching long DNA sequences for matches to much shorter single gene sequences. The need to identify short matches within long matches was reinforced by the discovery of introns in 1976 [18–21]. Peter Sellers [14] and David Sankoff [12] had each worked on this problem, but without obtaining a timely general solution. Waterman and I were lucky enough to have recognized the similarity between protein pairwise sequence alignment and the geology problem called stratigraphic correlations— the problem of aligning two sequences of geological strata. Upon recognition, we were able to write a paper [22] in just a few days that included the dynamic programming solution to finding the optimal matching subsequence of a short sequence within a much longer one. In that solution lay one of the keys to the real problem that needed
Local Sequence Similarities
157
answering, the initiation of the traceback matrix boundaries with zeros (see below). To understand the simple, yet subtle, modification we made to the dynamic programming algorithms developed by 1978, we need to recall that nearly everyone was working and thinking in terms of distances between sequences and sequence elements. This was natural from the evolutionary viewpoint since one was generally attempting to infer the distance between organisms, or more accurately between them and their last common ancestor. Various researchers (again including Dayhoff) were working with the probability that any two amino acids would be found in homologous positions or aligned across from one another. These were nearly always converted to some form of distance [8,23], typically an edit distance, the minimum number of mutations or edits required to convert one sequence into the other. In modern form these amino acid pair probabilities are converted to log-likelihoods where the object is to maximize the total likelihood rather than minimize a distance. These log-likelihoods have the form LL(i,j) = log [P(ai,aj)/P(ai)P(aj)]
(1)
providing a similarity measure of how likely it is to see amino acid, i, in the same homologous position as amino acid, j, in two related proteins as compared to observing such an association by chance. Such measures range from positive values, similar, through zero, to negative values or dissimilar. Clearly, as recognized by Dayhoff and others, the involved probabilities depend both on how great the evolutionary distance is between the containing organisms and on the degree and type of selection pressures affecting and/or allowing the changes. More importantly, one normally has to obtain estimates of the probabilities from the very groups of proteins that one is trying to align. Why was the concept of a similarity measure so critical to us? If one wants to define an optimal substring match between two long sequences as one having the minimal distance, then any run of identities including single matches has the minimum distance of zero. There will clearly be many such optimal or minimum distance matches between nearly any two real biological sequences. Sellers, attempting to deal with this potential problem [14], introduced a complex optimalization of maximum length with minimum distance that later proved to be incorrect [24]. Waterman and I recognized that by using a similarity measure and not having any cost for dissimilar or nonmatching terminal or end subsequences, the rigorous dynamic programming logic could be applied to the problem of identifying the maximum similar subsequences between any two longer sequences. The maximum similarity, not minimum distance, is just what was needed.
158
Genomics
THE ALGORITHM
The algorithm, like that of Needleman and Wunsch, is a deceptively simple iterative application of dynamic programming that is today routinely encoded by many undergraduates. It is described formally as follows: Given two sequences, A and B, of length n and m; on a common alphabet A = {a1,a2,…,an} and B = {b1,b2,…,bm}; a measure of similarity between the elements of that alphabet, s(ai, bj); and a cost or dissimilarity value, W(k), for a deletion of length k introduced into either sequence, a matrix H is defined such that (2a) H k 0 = H0l = 0 for 0 ≤ k ≤ n and 0 ≤ l ≤ m and
{
{
}
{
}
}
Hij = max Hi −1, j −1 + s( ai , b j ), max Hi − k , j − W ( k ) , max Hi , j −l − W (l) , 0 k ≥1
l ≥1
(2b)
forr i ≤ k ≤ n and j ≤ l ≤ m. The elements of H have the interpretation that the value of matrix element, Hij, is the maximum similarity of two segments ending in ai and bj, respectively. The formula for Hij follows by considering four possibilities: 1. If ai and bj are associated or aligned, the similarity is Hi−1,j−1 + s(ai, bj). 2. If ai is at the end of a deletion of length k, the similarity is Hi−k,j − W(k). 3. If bj is at the end of a deletion of length l, the similarity is Hi,j−1 − W(l). 4. Finally, a zero is assigned if all of the above three options would result in a calculated negative similarity value, Hij = 0, indicating no positive similarity subsequence starting anywhere and ending at ai and bj. The zeros in equations (2a) and (2b) allow the exclusion of leading and trailing subsequences whose total sum of element pair similarities, s(ai,bj), plus any alignment insertion/deletions are negative, thus having a net dissimilarity. The matrix of sequence elements similarities, s(i,j), is arbitrary to a high degree. In most biological cases, s(ai,bj) directly or indirectly reflects some evolutionary or statistical assumptions, as in equation (1). These assumptions influence the constant values in deletion weight function, W(k), as well. For example, one normally assumes that the likelihood of a sequence deletion is less likely than any point or signal element mutation. This in turn requires W(l) to be less than the minimum value of s(ai,bj).
Local Sequence Similarities
159
The pair of maximally similar subsequences is found by first locating the maximum element in the matrix H. Then by sequentially determining the H matrix elements that lead to that maximum H value, with a traceback procedure ending at the last nonzero value, the two maximally similar subsequences are obtained. The two subsequences are composed of the sequence of pairs associated with the H elements composing this traceback. The traceback itself also produces the optimal alignment, the one having the highest similarity value between these two subsequences. The zero in equation (2b) thus defines via the alphabet similarities, s(ai,bj), the level of alphabet similarity considered dissimilar. In most biological cases s(ai,bj) directly or indirectly reflects some evolutionary or statistical assumptions. These assumptions also determine the relationship between the deletion weight, W(k), value and s(ai,bj).
GAP WEIGHT PROBLEM
The functional form of the gap function, W(k), was originally viewed as restricted by the dynamic program optimization to be a monotonic increasing function of the gap length, k. The simplest form used, and that used in the original published example of the Smith–Waterman algorithm [25], is the affine linear function of gap length, W(k) = W0 + W1 * (k − 1)
(3)
Here W0 is a cost associated with opening a gap in either sequence. W1 is the cost or penalty of extending the gap beyond a length of one sequence element. Such a function has limited biological justification. Surely in most cases longer gaps or insertion/deletions are less likely than shorter ones? However, all deletions above some minimum length may all be nearly as likely. Very long insertion/deletion events can be quite common, such as those in chromosomal rearrangements and in many protein sequences containing large introns. In addition, if one is searching for the optimal local alignments between protein-encoding DNA sequences, then nearly all expected DNA insertion/deletion events are modulo three. Also in the case of proteins such mutational events have highly varying probabilities along the sequence. They are much more likely in surface loop regions, for example. The latter is taken into account in many implementations of alignment dynamic programming tools. There W(k) is made a function of the sequence element ai or bj, or more exactly the three-dimensional environment associated with that sequence element, the amino acid.
160
Genomics
NEAR OR SUBOPTIMAL LOCAL ALIGNMENT
The somewhat arbitrary nature of the sequence element similarity or scoring matrix is closely related to the need to be able to identify suboptimal or alternative optimal alignments. Like the insertion/deletion cost function, W(k), discussed above, the “true” cost or likelihood of a given sequence element replacement can and generally is a function of the local context. In DNA sequences this is a function not only of where, but what type of information is being encoded. Also in DNA comparisons across different species there are the well-known differences in overall base composition or the varying likelihood of each nucleotide at each codon position. In protein sequences it is not only a function of local structure, but of the class of protein. Membraneassociated, excreted, and globular proteins all have fundamentally different amino acid background frequencies. This means that the typically employed s(ai,bj) element-to-element similarity matrices are at best an average over many context factors, and therefore even minor differences could produce different optimal alignments, particularly in detail. This in turn means that, in any case where the details of the
Figure 5.2a The H matrix generated using equation (3) and the following similarity scoring function: matches, s = +1; mismatches, s = −0.5; and W(k) = − (0.91 + 0.10 * k) for the two DNA input sequences, AAGCATCAAGTCT and TCATCAGCGG. The heavy boxed value of 5.49 is the global local maximum while the other three boxed values are the next three nearest suboptimal similarity values. The arrows indicate the diagonal traceback steps; dashes indicate horizontal traceback steps involving W(k). Displayed values are × 100.
Local Sequence Similarities
161
alignment are important, one needs to look at any alternative alignments with the same or nearly the same total similarity score as the optimal. An example would be those cases where one is attempting to interpret evolutionary history between relatively distant sequences where there are likely many uncertainties in the alignment. There are in principle three different types of such alternative nearoptimal alignments. There are those that contain a substantial portion of the optimal. These can be found by recalculating, at each step in the traceback, which of the H-matrix cells could have contributed to the current cell within some fixed value, X, and then proceeding through each of these in turn, subtracting the needed contribution from the value, X, until the traceback is complete or until the sum exceeds X. This is a very computationally intensive procedure for most cases. The simplest of such tracebacks, however, are those for which X is set equal to zero and only those alternative traceback alignments having the same score as the optimal are obtained. Figure 5.2a displays two such equally optimal tracebacks with the implied alignment difference being only the location of the central gap by one. A third class of suboptimal local alignments is obtained by rerunning the dynamic programming while not allowing the position pairs
Figure 5.2b Pairwise alignments for the maximum similar local aligned segments of the two input DNA sequences, AAGCATCAAGTCT and TCATCAGCGG, using the similarity scoring function given in figure 5.2a. Each of these alignments corresponds to one of the tracebacks shown in figure 5.2a.
162
Genomics
aligned in the optimal to contribute. Those position pairs are set equal to zero, requiring any suboptimal sequence to avoid those pairs. The next best such alignment is then obtained by the standard traceback, beginning as before at the largest positive value. This procedure can be repeated to obtain a set of suboptimal alignments of decreasing total similarity. Examples of these are shown in figure 5.2b. Note that the particular cases displayed could have been obtained directly from the initial H matrix only because they do not involve any of the aligned pairs that contributed to the optimal. STATISTICAL INTERPRETATION
It was noticed early on [26] that the optimal local alignment similarity varied linearly for a fixed alphabet with the logarithm of the product of the lengths of the two sequences. This is very pronounced as at least one of the sequences gets very long. This was initially only an empirical observation that was later clearly shown to be an instance of the Erdös–Renyi Law [27]. One thus expects, as one or both sequences (or the database of sequences against which one is searching a single sequence) becomes large, for the similarity to go as S(ab) = log(n*m) with an error on the order of log[log(n*m)]. This implies that the statistical significance—its deviation above the expected—of a local sequence alignment against a large database is a function of the logarithm of the database size. Considerable effort has gone into being able to estimate correctly the associated probabilities [28] from the extreme value distribution. CONCLUSION
It must be noted that neither the motivation nor surely all of the ideas that went into our derivation of this local algorithm were unique to us. Many others, including Sankoff, Sellers, Goad [29], and even Dayhoff, were working on similar ideas, and had we not proposed this solution, one of them would have arrived there shortly. As one of the authors of this algorithm I have always enjoyed telling my own graduate students that my contribution was just adding a zero in the right place at the right time. But even that would have meant little without Waterman’s input and his solid mathematical proof that the result remained a rigorous application of Bellman’s dynamic programming logic [11]. It is not obvious from the iterative formulation of the algorithm that it is invariant under the reversal of the two sequences or even that the elements lying between the positions associated with the maximum H value and the tracebacked first nonzero positions is the optimal. There have been many efficiency improvements to the original formulation of the local or Smith–Waterman alignment algorithm, beginning with reducing the length cubed complexity [30] and identifying all
Local Sequence Similarities
163
nearby suboptimal similar common subsequences. The latter is conceptually very important since both s(ai,bj) and W(k) are normally obtained as averages over some data set of biologically “believed” aligned sequences, none of which can be assumed with much certainty to have the same selective history or even structural sequence constraints as any new sequence searched for its maximally similar subsequence against all of the currently available data. But most importantly this algorithm led to the development of very fast heuristic algorithms that obtain generally identical results, these being first FastA [31] and then Blast [32]. Blast and its associated variants have become today’s standard for searching very large genomic sequence databases for common or homologous subsequences. There have been many applications of our algorithm to biological problems and to the comparisons with other sequence match tools, as testified by the original paper’s large citation list. One of these allows me to end this short historic review with one more Smith and Waterman anecdote. While on sabbatical at Yale University, Mike came to visit me, and on our way to lunch we passed through the Yale Geology Department. There stood two stratigraphic columns with strings connecting similar strata within the two columns—a sequence alignment of similar sediments! Given that Sankoff had recently pointed out to us that researchers studying bird songs had identified similar subsequences via time warping [33], we now faced the possibility that the geologist had also solved the problem before us! After a somewhat depressing lunch we went to the Geology Department chairman’s office and asked. Lo and behold, this was an unsolved problem in geology! This resulted in our first geology paper [34] basically written over the next couple of days. REFERENCES 1. Zuckerkandl, E. and L. C. Pauling. Molecules as documents of evolutionary history. Journal of Theoretical Biology, 8:357–8, 1965. 2. Proceedings published as: Bryson, V. and H. Vogel (Eds.), Evolving Genes and Proteins. Academic Press, New York, 1965. 3. Margoliash, E. and E. Smith. Structural and functional aspects of cytochrome c in relation to evolution. In V. Bryson and H. Vogel (Eds.), Evolving Genes and Proteins (pp. 221-42). Academic Press, New York, 1965. 4. Kaplan, N. Evolution of dehydrogenases. In V. Bryson and H. Vogel (Eds.), Evolving Genes and Proteins (pp. 243-78). Academic Press, New York, 1965. 5. Zuckerkandl, E. and L. Pauling. Evolutionary divergence and convergence in proteins. In V. Bryson and H. Vogel (Eds.), Evolving Genes and Proteins (pp. 97-166). Academic Press, New York, 1965. 6. Fitch, W. and E. Margoliash. Construction of phylogenetic trees. Science, 155:279–84, 1967.
164
Genomics
7. Sanger, F. The structure of insulin. In D. Green (Ed.), Currents in Biochemical Research. Interscience, New York, 1956. 8. Dayhoff, M. Atlas of Protein Sequence and Structure. National Biomedical Research Foundation, Silver Springs, Md., 1969. 9. Fitch, W. An improved method of testing for evolutionary homology. Journal of Molecular Biology, 16:9-16, 1966. 10. Needleman, S. B. and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48:443–53, 1970. 11. Bellman, R. Dynamic Programming. Princeton University Press, Princeton, N.J., 1957. 12. Sankoff, D. Matching sequences under deletion/insertion constraints. Proceedings of the National Academy of Sciences USA, 68:4–6, 1972. 13. Reichert, T., D. Cohen and A. Wong. An application of information theory to genetic mutations and the matching of polypeptide sequences. Journal of Theoretical Biology, 42:245–61, 1973. 14. Sellers, P. On the theory and computation of evolutionary distances. SIAM Journal of Applied Mathematics, 26:787–93, 1974. 15. Waterman, M., T. F. Smith and W. A. Beyer. Some biological sequence metrics. Advanced Mathematics, 20:367–87, 1976. 16. Beyer, W. A., M. L. Stein, T. F. Smith and S. M. Ulam. A molecular sequence metric and evolutionary trees. Mathematical Biosciences, 19:9–25, 1974. 17. Smith, T. F. The history of the genetic sequence databases. Genomics, 6:701–7, 1990. 18. Berget, S., A. Berk, T. Harrison and P. Sharp. Spliced segments at the 5’ termini of adenovirus-2 late mRNA: a role for heterogeneous nuclear RNA in mammalian cells. Cold Spring Harbor Symposia on Quantitative Biology, XLII:523–30, 1977. 19. Breathnach, R., C. Benoist, K. O’Hare, F. Gannon and P. Chambon. Ovalbumin gene: evidence for leader sequence in mRNA and DNA sequences at the exon-intron boundaries. Proceedings of the National Academy of Sciences USA, 75:4853–7, 1978. 20. Broker, T. R., L. T. Chow, A. R. Dunn, R. E. Gelinas, J. A. Hassell, D. F. Klessig, J. B. Lewis, R. J. Roberts and B. S. Zain. Adenovirus-2 messengers—an example of baroque molecular architecture. Cold Spring Harbor Symposia on Quantitative Biology, XLII:531–54, 1977. 21. Jeffreys, A. and R. Flavell. The rabbit b-globin gene contains a large insert in the coding sequence. Cell, 12:1097–1108, 1977. 22. Smith, T. and M. Waterman. New stratigraphic correlation techniques. Journal of Geology, 88:451–57, 1980. 23. Dayhoff, M. Atlas of Protein Sequence and Structure. National Biomedical Research Foundation, Silver Springs, Md., 1972. 24. Waterman, M. Sequence alignments. In M. S. Waterman (Ed.), Mathematical Methods for DNA Sequences (pp. 53-92). CRC Press, Boca Raton, Fl., 1989. 25. Smith, T. and M. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147:195–7, 1981. 26. Smith, T. F. and C. Burks. Searching for sequence similarities. Nature, 301:174, 1983.
Local Sequence Similarities
165
27. Arratia, R. and M. Waterman. An Erdös-Renyi law with shifts. Advances in Mathematics, 55:13–23, 1985. 28. Karlin, S. and S. F. Altschul. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proceedings of the National Academy of Sciences USA, 87:2264–8, 1990. 29. Goad, W. and M. Kanehisa. Pattern recognition in nucleic acid sequences: a general method for finding local homologies and symmetries. Nucleic Acids Research, 10:247–63, 1982. 30. Gotoh, O. An improved algorithm for matching biological sequences. Journal of Molecular Biology, 162:705–8, 1982. 31. Pearson, W. R. and D. J. Lipman. Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences USA, 85:2444–8, 1988. 32. Altschul, S. F., W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215:403–10, 1990. 33. Bradley, D. W. and R. A. Bradley., Application of sequence comparison to the study of bird songs. In D. Sankoff and J. B. Kruskal (Eds.), Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison (pp. 189-210). Addison-Wesley, Reading, Mass., 1983. 34. Smith, T. F. and M. S. Waterman. New stratigraphic correlation techniques. Journal of Geology, 88:451–7, 1980.
6 Complete Prokaryotic Genomes: Reading and Comprehension Michael Y. Galperin & Eugene V. Koonin
The windfall of complete genomic sequences in the past ten years has dramatically changed the face of biology, which has started to lose its purely descriptive character. Instead, biology is gradually becoming a quantitative discipline that deals with firmly established numerical values and can be realistically described by mathematical models. Indeed, we now know that, for example, the bacterium Mycoplasma genitalium has a single chromosome which consists of 580,074 base pairs and carries genes for three ribosomal RNAs (5S, 16S, and 23S), 36 tRNAs, and 478 proteins [1,2]. We also know that about a hundred of the protein-coding genes can be disrupted without impairing the ability of this bacterium to grow on a synthetic peptide-rich broth containing the necessary nutrients [3], suggesting that the truly minimal gene set necessary for the cell life might be even smaller, in the 300–350 gene range [4–6]. Furthermore, we know that the cell of Aquifex aeolicus with its 1521 protein-coding genes is capable of autonomous, autotrophic existence in the environment, requiring for growth only hydrogen, oxygen, carbon dioxide, and mineral salts [7]. These observations bring us to the brink of finally answering the 60-year-old question posed by Erwin Schrödinger’s “What is Life?” Furthermore, although the descriptions of greatly degraded organelle-like cells like Buchnera aphidicola [8,9] and giant cell-like mimiviruses [10] necessarily complicate the picture, analysis of these genomes allows an even better understanding of what is necessary and what is dispensable for cellular life. That said, every microbial genome contains genes whose products have never been experimentally characterized and which lack experimentally characterized close homologs. The numbers of these “hypothetical” genes vary from a handful in small genomes of obligate parasites and/or symbionts of eukaryotes [11] to thousands in the much larger genomes of environmental microorganisms. As discussed previously, the existence of proteins with unknown function even in model organisms, such as Escherichia coli, Bacillus subtilis, or yeast Saccharomyces cerevisiae, poses a challenge not just to functional genomics but also to biology in general [12]. While some of these 166
Complete Prokaryotic Genomes
167
represent species-specific “ORFans” which might account for the idiosyncrasies of each particular organism [13,14], there are also hundreds of “conserved hypothetical” proteins with a relatively broad phyletic distribution. As long as we do not understand functions of a significant fraction of genes in any given genome, “complete” understanding of these organisms as biological systems remains a moving target. Therefore, before attempting to disentangle the riveting complexity of interactions between the parts of biological machines and to develop theoretical and experimental models of these machines— the stated goals of systems biology—it will be necessary to gain at least a basic understanding of the role of each part. Fortunately, it appears that the central pathways of information processing and metabolism are already known, and the existing models of the central metabolism in simple organisms (e.g., obligate parasites such as Haemophilus influenzae or Helicobacter pylori) adequately describe the key processes [15–17]. However, even in these organisms, there are hundreds of uncharacterized proteins, which are expressed under the best growth conditions [18,19], let alone under nutritional, oxidative, or acidic stress. We will not be able to create full-fledged metabolic models accommodating stress-induced changes of the metabolism without a much better understanding of these processes, which critically depends on elucidation of functions of uncharacterized and poorly characterized proteins. Here, we briefly review the current state of functional annotation of complete genomes and discuss what can be realistically expected from complete genome sequences in the near future. KNOWN, KNOWN UNKNOWN, AND UNKNOWN UNKNOWN PROTEINS
The analysis of the first several sequenced genomes proved to be an exciting but also a humbling exercise. It turned out that, even in the genomes of best-studied organisms, such as E. coli, B. subtilis, or yeast, less than half of all genes has ever been studied experimentally or assigned a phenotype [12]. For a certain portion of the genes, typically 30–40% of the genome, general functions could be assigned based on subtle sequence similarities of their products to experimentally characterized distant homologs. However, for at least 30–35% of genes in most genomes, there was no clue as to their cellular function. For convenience, borrowing the terminology from a popular expression, we shall refer to those genes whose function we know (or, rather, think that we know) as “knowns”; to those genes whose function we can describe only in some general terms as “known unknowns”; and to those genes whose function remains completely enigmatic as “unknown unknowns.” This classification will allow us to consider each of these classes of genes separately, concentrating on the
168
Genomics
specific problems in understanding—and properly annotating—their functions. Not All “Knowns” Are Really Understood
Whenever biologists talk about annotation of gene functions in sequenced genomes, they complain about the lack of solid data. The most typical question is, “Is this real (i.e., experimental) data or just a computer-based prediction?” It may come as a great surprise to anybody not immediately involved in database management that (i) experimental data are not always trustable and (ii) computational predictions are not always unsubstantial. It is true, however, that gene and protein annotations in most public and commercial databases are often unreliable. To get the most out of them, one needs to understand the underlying causes of this unreliability, which include misinterpretation of experimental results, misidentification of multidomain proteins, and change in function due to nonorthologous gene displacement and enzyme recruitment. These errors are exacerbated by the propagation in the database due to (semi)automatic annotation methods used in most genome sequencing projects. In many cases, this results in biologically senseless annotation, which may sometimes be amusing but often becomes really annoying. Nevertheless, there will never be enough time and funds to experimentally validate all gene annotations in even the simplest genomes. Therefore, one necessarily has to rely on annotation generated with computational approaches. It is important, however, to distinguish between actual, experimental functional assignments and those derived from them on the basis of sequence similarity (sometimes questionable). Sometimes wrong annotations appear in the database because of an actual error in an experimental paper. The best example is probably tRNA-guanine transglycosylase (EC 2.4.2.29), the enzyme that inserts queuine (7-deazaguanine) into the first position of the anticodon of several tRNAs. While the bacterial enzyme is relatively well characterized [20], the eukaryotic one, reportedly consisting of two subunits, is not. Ten years ago, both subunits of the human enzyme were purified and partially sequenced [21,22]. Later, however, it turned out that the putative N-terminal fragment of the 32 kD subunit (GenBank accession no. AAB34767) actually belongs to a short-chain dehydrogenase, a close homolog of peroxisomal 2-4-dienoyl-coenzyme A reductase (GenBank accession no. AF232010), while the putative 60 kD subunit (GenBank accession no. L37420) actually turned out to be a ubiquitin C-terminal hydrolase. Although the correct sequence of the eukaryotic tRNA-guanine transglycosylase, homologous to the bacterial form and consisting of a single 44 kD subunit, was later determined [23] and deposited in GenBank (accession no. AF302784), the original erroneous assignments persist in the public databases.
Complete Prokaryotic Genomes
169
In many cases, however, the blame for erroneous functional assignments lies not with the experimentalists, but with genome analysts, where functional calls are often made (semi)automatically, based solely on the definition line of the top BLAST hit. Even worse, this process often strips “putative” from the tentative names, such as “putative protoporphyrin oxidase,” assigned by the original authors. This leads to the paradoxical situation when the gene, referred to by the original authors only as hemX (GenBank accession no. CAA31772), is confidently annotated as uroporphyrin-III C-methyltransferase in E. coli strain O157:H7, Salmonella typhimurium, Yersinia pestis, and many other bacteria, even though it has never been studied experimentally and lacks the easily recognizable S-adenosylmethionine-binding sequence motif. Despite the repeated warnings, for example in [24], this erroneous annotation found its way into the recently sequenced genomes of Chromobacterium violaceum, Photorhabdus luminescens, and Y. pseudotuberculosis. This seems to be a manifestation of the “crowd effect”: so many proteins of the HemX family have been misannotated that their sheer number convinces a casual user of the databases that this annotation must be true. Several examples of such persistent erroneous annotations are listed in table 6.1. Loss of such qualifiers as “putative,” “predicted,” or “potential” is a common cause of confusion, as it produces a seemingly conclusive annotation out of inconclusive experimental data. For example, ABC1 (activity of bc1) gene has been originally described as a yeast nuclear gene whose product was required for the correct assembly and functioning of the mitochondrial cytochrome bc1 complex [25,26]. Later, mutations in this locus were shown to affect ubiquinone biosynthesis and coincide with previously described ubiquinone biosynthesis mutations ubiB in E. coli and COQ8 in yeast [27,28]. Still, the authors were careful not to claim that ABC1 protein directly participates in ubiquinone biosynthesis, as the sequence of this protein clearly identifies it as a membrane-bound kinase, closely related to Ser/Thr protein kinases. Hence, the simplest suggestion was (and still remains) that ABC1 is a protein kinase that regulates ubiquinone biosynthesis and bc1 complex formation. Nevertheless, members of the ABC1 family are often annotated as “ubiquinone biosynthesis protein” or even “2-octaprenylphenol hydroxylase.” Even worse, because the name of this family is similar to ATP-binding cassette (ABC) transporters, ABC1 homologs are sometimes annotated as ABC transporters. Given that the original explanation that ABC1 might function as a chaperon has recently been questioned [29], one has to conclude that this ubiquitous and obviously important protein is misannotated in almost every organism. The story of the ABC1 protein shows that, besides being wrong per se, erroneous or incomplete annotations often obscure the fact that
170
Genomics
Table 6.1 Some commonly misannotated protein families
Protein name
Protein family COG Pfam
E. coli HemX E. coli NusB
2959
E. coli PgmB E. coli PurE
0406
00300
0041
00731
M. jannaschii MJ0010
3635
01676
M. jannaschii MJ0697 T. pallidum TP0953 A. fulgidus AF0238
1889
01676
1916
01963
0130
01509
R. baltica RB4770
0661
03109
F. tularensis FTT1298
0661
03109
0781
Erroneous annotation
More appropriate annotation
04375 Uroporphyrinogen III methylase 01029 N utilization substance protein B
O. sativa K1839 05303 OJ1191_G08.42
Uncharacterized protein, HemX family [84] Transcription antitermination factor [84] Phosphoglycerate Broad specificity mutase 2 phosphatase [85] Phosphoribosyl Phosphoribosyl aminoimidazole carboxyaminoimidazole carboxylase mutase [86,87] Phosphonopyruvate Cofactor-independent decarboxylase phosphoglycerate mutase [88,89] Fibrillarin rRNA methylase (nucleolar protein 1) [90,91] Pheromone Uncharacterized protein shutdown protein TraB family [92] Centromere/ tRNA pseudouridine microtubule-binding synthase [93,94] protein ABC transporter Predicted Ser/Thr protein kinase, regulates ubiquinone biosynthesis 2-polyprenylphenol Predicted Ser/Thr 6-hydroxylase protein kinase, regulates ubiquinone biosynthesis Eukaryotic translation Uncharacterized protein initiation factor with a TPR repeat, 3 subunit CLU1 family [95]
we actually might not know the function of such a misannotated protein. Another good example is the now famous case of the HemK protein family, originally annotated as an “unidentified gene that is involved in the biosynthesis of heme in Escherichia coli” [30], then recognized as a methyltransferase unrelated to heme biosynthesis [31] and reannotated as adenine-specific DNA methylase [32]. The recent characterization of this protein as a glutamine N5-methyltransferase of peptide release factors [33,34] revealed the importance of this posttranslational modification that had been previously overlooked [35,36]. Remarkably, orthologs of HemK in humans and other eukaryotes are still annotated (without experimental support)
Complete Prokaryotic Genomes
171
as DNA methyltransferases. As a result, the role and extent of glutamine N5-methylation in eukaryotic proteins remains obscure. Another example of the annotation of a poorly characterized protein as if it was a “known” is the TspO/CrtK/MBR family of integral membrane proteins, putative signaling proteins found in representatives of all domains of life, from archaea to human [37,38]. These proteins, alternatively referred to as tryptophan-rich sensory proteins or as peripheral-type mitochondrial benzodiazepine receptors, contain five predicted transmembrane segments with 12–14 well-conserved aromatic amino acid residues, including seven Trp residues [37]. They have been shown to regulate photosynthesis gene expression in Rhodobacter sphaeroides, to respond to nutrient stress in Sinorhizobium meliloti, and to bind various benzodiazepines, tetrapyrrols, and steroids, including cholesterol, protoporphyrin IX, and many others [37,39–41]. None of these functions, however, would explain the role of these proteins in B. subtilis or Archaeoglobus fulgidus, which do not carry out photosynthesis and have no known affinity to benzodiazepines or steroids. Thus, instead of describing the function of these proteins, at least in bacteria, the existing annotation obscures the fact that it still remains enigmatic. In practicality, this means that any talk about “complete” understanding of the bacterial cell or even about creating the complete “parts list” should be taken with a grain of salt. There always remains a possibility that a confidently annotated gene (protein) might have a different function, in addition to or even instead of what has been assumed previously. Having said that, the problems with “knowns” are rare and far between, particularly compared with the problems with annotation of “known unknowns” and “unknown unknowns.” Known Unknowns
As noted above, even relatively small and simple microbial genomes contain numerous genes whose precise functions cannot be assigned with any degree of confidence. While some of these are “ORFans,” the number of genes that are found in different phylogenetic lineages and are commonly referred to as “conserved hypothetical” keeps increasing with every new sequenced genome. As repeatedly noted, annotation of an open reading frame as a “conserved hypothetical protein” does not necessarily mean that the function of its product is completely unknown, less so that its existence is questionable [12,24,42]. Generally speaking, if a conserved protein is found in several genomes, it is not really hypothetical anymore (see [43] for a discussion of possible exceptions). Even when a newly sequenced protein has no close homologs with known function, it is often possible to make a general prediction of its function based on: (1) subtle sequence similarity to a previously characterized protein, presence of a conserved sequence
172
Genomics
motif, or a diagnostic structural feature [12,24,42]; (2) “genomic context,” that is, gene neighborhood or domain fusion data and phyletic patterns for the given protein family [24,44]; or (3) a specific expression pattern or protein–protein interaction data [19,45,46]. The methods in the first group are homology-based and rely on transfer of function from previously characterized proteins that possess the same structural fold and, typically, belong to the same structural superfamily. Since proteins with similar structures usually display similar biochemical properties and catalyze reactions with similar chemical mechanisms, homology-based methods often allow the prediction of the general biochemical activity of the protein in question but tell little about the specific biological process(es) it might be involved in. Table 6.2 lists several well-known superfamilies of enzymes with the biochemical reactions that they catalyze and the biological processes in which they are involved. One can see that the assignment of a novel protein to a particular superfamily is hardly sufficient for deducing its biological function. This is why we refer to such proteins as “known unknowns.” Genome context-based methods of functional prediction do not rely on homology of the given protein to any previously characterized one. Instead, these methods derive predictions for uncharacterized genes from experimental or homology-based functional assignments for the genes that are either located next to the gene in question or are present (or absent) in the same organisms as that gene [44,47–55]. Thus, identification of homologs by sequence and structural similarity still plays a crucial role in the genome context methods, even if indirectly. In this type of analysis, the reliability of predictions for unknowns critically depends on the accuracy of the functional assignments for the neighboring gene(s) and the strength of the association between the two. On the plus side, these assignments are purely computational, do not require any experimental analysis, and can be performed on a genomic scale. Such assignments proved particularly successful in identification of candidates for filling gaps (i.e., reactions lacking an assigned enzyme) in metabolic pathways [24,56–59]. These approaches also worked well for multi-subunit protein complexes like the proteasome, DNA repair systems, or the RNA-degrading exosome [60–62]. Table 6.3 shows some nontrivial genome context-based computational predictions that have been subsequently verified by direct experimental studies. Domain fusions (the so-called Rosetta Stone method) also can be used to deduce protein function [46,51,63]. This approach proved to be particularly fruitful in the analysis of signal transduction pathways, which include numerous multidomain proteins with a great variety of domain combinations. Sequence analysis of predicted signaling proteins encoded in bacterial and eukaryotic genomes revealed complex domain architectures and allowed the identification
Complete Prokaryotic Genomes
173
Table 6.2 The range of biological functions among representatives of some common superfamilies of enzymesa Biochemical function
Biological function (pathway)
Acid phosphatase superfamily [96,97] Phosphatidic acid phosphatase Diacylglycerol pyrophosphate phosphatase Glucose-6-phosphatase
Lipid metabolism Lipid metabolism, signaling Gluconeogenesis, regulation
ATP-grasp superfamily [98–101] ATP-citrate lyase Biotin carboxylase Carbamoyl phosphate synthase D-ala-D-ala ligase Glutathione synthetase Succinyl-CoA synthetase Lysine biosynthesis protein LysX Malate thiokinase Phosphoribosylamine-glycine ligase Protein S6-glutamate ligase Tubulin-tyrosine ligase Synapsin
TCA cycle Fatty acid biosynthesis Pyrimidine biosynthesis Peptidoglycan biosynthesis Redox regulation TCA cycle Lysine biosynthesis Serine cycle Purine biosynthesis Modification of the ribosome Microtubule assembly regulation Regulation of nerve synapses
HAD superfamily [102–105] Glycerol-3-phosphatase Haloacid dehalogenase Histidinol phosphatase Phosphoserine phosphatase Phosphoglycolate phosphatase Phosphomannomutase P-type ATPase Sucrose phosphate synthase
Osmoregulation Haloacid degradation Histidine biosynthesis Serine, pyridoxal biosynthesis Sugar metabolism (DNA repair) Protein glycosylation Cation transport Sugar metabolism
Alkaline phosphatase superfamily [106–108] Alkaline phosphatase N-Acetylgalactosamine 4-sulfatase Nucleotide pyrophosphatase Phosphoglycerate mutase Phosphoglycerol transferase Phosphopentomutase Steroid sulfatase Streptomycin-6-phosphatase aNot
Phosphate scavenging Chondroitin sulfate degradation Cellular signaling Glycolysis Osmoregulation Nucleotide catabolism Estrogen biosynthesis Streptomycin biosynthesis
all enzymatic activities or biological functions found in a given superfamily are listed
of a number of novel conserved domains [38,64–68]. While the exact functions (e.g., ligand specificity) of some of these domains remains obscure, association of these domains with known components of the signal transduction machinery strongly suggests their involvement in signal transduction [38,67,68]. Over the past several years, functional
174
Genomics
Table 6.3 Recently verified genome context-based functional assignments
Protein name E. coli NadR E. coli YjbN B. subtilis YgaA H. pylori HP1533 Human COASY M. jannaschii MJ1440 M. jannaschii MJ1249 P. aeruginosa PA2081 P. furiosus PF1956 P. horikoshii PH0272 S. pneumoniae SP0415 S. pneumoniae SP0415
Protein family COG Pfam
Assigned function
References [109,110]
03060
Ribosylnicotinamide kinase tRNA-dihydrouridine synthase Enoyl-ACP reductase
[112]
1351
02511
Thymidylate synthase
[44,113]
1019
01467
[114–116]
1685
00288
Phosphopantetheine adenylyltransferase Shikimate kinase
1465
01959
1878
04199
1830
01791
0346
00903
1024
00378
2070
03060
3172
—
0042
01207
2070
[111]
[117]
3-Dehydroquinate synthase Kynurenine formamidase
[24,118]
Fructose 1,6-bisphosphate aldolase Methylmalonyl-CoA racemase trans-2,cis-3-decenoylACP isomerase Enoyl-ACP reductase
[44,120–122]
[119]
[123] [124] [125]
predictions were made for many of these proteins. Several remarkable examples of recently characterized and still uncharacterized signaling domains are listed in table 6.4. Sometimes functional assignment for a “known unknown” gene can be made on the basis of experimental data on its expression under particular conditions (e.g., under nutritional stress), coimmunoprecipitation with another, functionally characterized protein, and two-hybrid or other protein–protein interaction screens [45,69]. Although these annotation methods are not entirely computational, results of large-scale gene expression and protein–protein interaction experiments are increasingly available in public databases and can be searched without performing actual experiments [69–73] (see [74] for a complete listing). These data, however, have an obvious drawback. Even if there is convincing evidence that a given protein is induced, for example by phosphate starvation or UV irradiation, it says virtually nothing about the actual biological function of this particular protein [19].
Complete Prokaryotic Genomes
175
Table 6.4 Poorly characterized protein domains involved in signal transduction Domain name
Database entry COG Pfam
Cache CHASE CHASE2 CHASE3 CHASE4 CHASE5 CHASE6 KdpD MHYT
— 3614 4252 5278 3322 — 4250 2205 3300
02743 03924 05226 05227 05228 — — 02702 03707
MHYE
—
—
MASE1 MASE2 PfoR PutP-like HDOD TspO
3447 — 1299 0591 1639 3476
05231 05230 — — 08668 03073
Predicted function
Reference
Extracytoplasmic sensor domain Extracytoplasmic sensor domain Extracytoplasmic sensor domain Extracytoplasmic sensor domain Extracytoplasmic sensor domain Extracytoplasmic sensor domain Extracytoplasmic sensor domain Cytoplasmic turgor-sensing domain Membrane-bound metal-binding sensor domain Membrane-bound metal-binding sensor domain Membrane-bound sensor domain Membrane-bound sensor domain Membrane-bound sensor domain Membrane-bound sensor domain Signal output domain Membrane-bound tryptophan-rich sensory protein
[126] [127,128] [68] [68] [68] [68] [68] [129] [130] [131] [132] [132] [133,134] [135] [38] [37,39,41]
Likewise, even if an interaction of two proteins is well documented, it is often difficult to judge whether (1) this interaction is biologically relevant, that is, it occurs inside the cell at physiological concentrations of the interacting components; and (2) in which way, if at all, does this interaction affect the function of the “known” member of the pair. Therefore, the clues to function provided by gene expression and protein–protein interaction experiments are usually just that—disparate clues—that require careful analysis by a well-educated biologist to come up with an even tentative functional prediction. For this reason, many genes whose altered expression has been documented in microarray experiments still have to be assigned to the category of “unknown unknowns.” An important exception to that rule is a group of poorly characterized genes whose products participate in cell division (table 6.5). Although a significant fraction of such genes have no known (or predicted) enzymatic activity, they still qualify as “known unknowns” based on the mutation phenotypes and protein–protein interaction data. Table 6.5 lists some of the known cell division genes and the available sparse clues as to their roles in the process of cell division.
176
Genomics
Table 6.5 Cell division proteins of unknown biochemical function
Protein name
Database entry COG Pfam
BolA CrcB
0271 0239
01722 02537
DivIVA EzrA FtsB FtsL FtsL FtsN FtsW FtsX IspA Maf
3599 4477 2919 3116 4839 3087 0772 2177 2917 0424
05103 06160 04977 04999 — — 01098 02687 04279 02545
MukB
3096
04310
MukE
3095
04288
MukF
3006
03882
RelE
2026
05016
SpoIID StbD ZipA
2385 2161 3115
08486 — 04354
Functional assignment Stress-induced morphogen Integral membrane protein possibly involved in chromosome condensation Cell division initiation protein Negative regulator of septation ring formation Initiator of septum formation Cell division protein Protein required for the initiation of cell division Cell division protein Bacterial cell division membrane protein Cell division protein, putative permease Intracellular septation protein A Nucleotide-binding protein implicated in inhibition of septum formation Uncharacterized protein involved in chromosome partitioning Uncharacterized protein involved in chromosome partitioning Uncharacterized protein involved in chromosome partitioning Translational repressor of toxin–antitoxin stability system Sporulation protein Antitoxin of toxin–antitoxin stability system Cell division membrane protein, interacts with FtsZ
Unknown Unknowns
By definition, “unknown unknowns” are those genes that cannot be assigned a biochemical function and have no clearly defined biological function either. A recent survey of the most common “unknown unknowns” showed that, although many of them have a wide phyletic distribution, very few (if any) are truly universal [42]. Many “unknown unknowns” are conserved only within one or more of the major divisions of life (bacteria, archaea, or eukaryotes) or, more often, are restricted to a particular phylogenetic lineage, such as proteobacteria or fungi. Presence of a gene in all representatives of a particular phylogenetic lineage suggests that it might perform a function that is essential for the organisms of that lineage. In contrast, many “unknown unknowns” have a patchy phylogenetic distribution, being present in some representatives of a given lineage and absent in other representatives of the same lineage. This patchy distribution is likely to reflect
Complete Prokaryotic Genomes
177
Table 6.6 Uncharacterized proteins of known structure
Protein name
Protein family COG Pfam
PDB code Tentative annotation
BrtG
2105
03674
1v30
NIP7 RtcB TM0613 TT1751 YbgI YchN YebC YgfB YjeF YigZ YodA
1374 1690 2250 3439 0327 1553 0217 3079 0062 1739 3443
03657 01139 05168 03625 01784 01205 01709 03595 03853 01205 —
1sqw 1uc2 1o3u 1j3m 1nmo 1jx7 1kon 1izm 1jzt 1vi7 1s7d
Uncharacterized enzyme, butirosin synthesis Possible RNA-binding protein Possible role in RNA modification Uncharacterized protein Uncharacterized protein Possible transcriptional regulator Uncharacterized protein Potential role in DNA recombination Uncharacterized protein Possible role in RNA processing Possible enzyme of sugar metabolism Cadmium-induced protein
frequent horizontal gene transfer and/or gene loss, suggesting that the encoded function is not essential for cell survival. This nonessentiality, at least under standard laboratory conditions, could be the cause of the lack of easily detectable phenotypes, which makes these genes “unknown unknowns” in the first place. The progress in structural genomics has led to a paradoxical situation where a significant fraction of “unknown unknown” proteins have known 3D structure [19,75–77], which, however, does not really help in functional assignment. Table 6.6 lists some of such “unknown unknown” proteins with determined 3D structure. CONCLUSION
In conclusion, improved understanding of the cell as a biological system critically depends on the improvements in functional annotation. As long as there are numerous poorly characterized genes in every sequenced microbial genome, there always remains a chance that some key component of the cell metabolism or signal response mechanism has been overlooked [42]. The recent discoveries of the deoxyxylulose pathway [78,79] for terpenoid biosynthesis in bacteria and of the cyclic diguanylate (c-di-GMP)-based bacterial signaling system [38,67,80] indicate that these suspicions are not unfounded. Furthermore, several key metabolic enzymes have been described only in the past two to three years, indicating that there still are gaping holes in our understanding of microbial cell metabolism [58,81]. Recognizing the problem and identifying and enumerating these
178
Genomics
holes through metabolic reconstruction [59,71] or integrated analysis approaches such as Gene Ontology [82,83] is a necessary prerequisite to launching projects that would aim at closing those holes (see [81]). Nevertheless, we would like to emphasize that the number of completely enigmatic “unknown unknowns” is very limited, particularly in the small genomes of heterotrophic parasitic bacteria. For many other uncharacterized genes, there are clear predictions of enzymatic activity that could (and should) be tested experimentally. It would still take a significant effort to create the complete “parts list,”that is, a catalog of the functions of all genes, even for the relatively simple bacteria and yeast [36]. However, the number of genes in these genomes is relatively small and the end of the road is already in sight.
REFERENCES 1. Fraser, C. M., J. D. Gocayne, O. White, et al. The minimal gene complement of Mycoplasma genitalium. Science, 270:397–403, 1995. 2. Dandekar, T., M. Huynen, J. T. Regula, et al. Re-annotating the Mycoplasma pneumoniae genome sequence: adding value, function and reading frames. Nucleic Acids Research, 28:3278–88, 2000. 3. Hutchison, C. A., S. N. Peterson, S. R. Gill, et al. Global transposon mutagenesis and a minimal Mycoplasma genome. Science, 286:2165–9, 1999. 4. Mushegian, A. R., and E. V. Koonin. A minimal gene set for cellular life derived by comparison of complete bacterial genomes. Proceedings of the National Academy of Sciences USA, 93:10268–73, 1996. 5. Koonin, E. V. How many genes can make a cell: the minimal-gene-set concept. Annual Reviews in Genomics and Human Genetics, 1:99–116, 2000. 6. Peterson, S. N., and C. M. Fraser. The complexity of simplicity. Genome Biology, 2:comment2002, 1-2002.8, 2001. 7. Deckert, G., P. V. Warren, T. Gaasterland, et al. The complete genome of the hyperthermophilic bacterium Aquifex aeolicus. Nature, 392:353–8, 1998. 8. Andersson, J. O. Evolutionary genomics: is Buchnera a bacterium or an organelle? Current Biology, 10:R866–8, 2000. 9. Gil, R., F. J. Silva, E. Zientz, et al. The genome sequence of Blochmannia floridanus: comparative analysis of reduced genomes. Proceedings of the National Academy of Sciences USA, 100:9388–93, 2003. 10. Raoult, D., S. Audic, C. Robert, et al. The 1.2-Mb genome sequence of mimivirus. Science, 306:1344–50, 2004. 11. Shimomura, S., S. Shigenobu, M. Morioka, et al. An experimental validation of orphan genes of Buchnera, a symbiont of aphids. Biochemical and Biophysical Research Communications, 292:263–7, 2002. 12. Galperin, M. Y. Conserved “hypothetical” proteins: new hints and new puzzles. Comparative and Functional Genomics, 2:14–18, 2001. 13. Siew, N., and D. Fischer. Analysis of singleton ORFans in fully sequenced microbial genomes. Proteins, 53:241–51, 2003. 14. Siew, N., Y. Azaria and D. Fischer. The ORFanage: an ORFan database. Nucleic Acids Research, 32:D281–3, 2004.
Complete Prokaryotic Genomes
179
15. Schilling, C. H., and B. O. Palsson. Assessment of the metabolic capabilities of Haemophilus influenzae Rd through a genome-scale pathway analysis. Journal of Theoretical Biology, 203:249–83, 2000. 16. Schilling, C. H., M. W. Covert, I. Famili, et al. Genome-scale metabolic model of Helicobacter pylori 26695. Journal of Bacteriology, 184:4582–93, 2002. 17. Raghunathan, A., N. D. Price, M. Y. Galperin, et al. In silico metabolic model and protein expression of Haemophilus influenzae strain Rd KW20 in rich medium. OMICS: A Journal of Integrative Biology, 8:25–41, 2004. 18. Kolker, E., S. Purvine, M. Y. Galperin, et al. Initial proteome analysis of model microorganism Haemophilus influenzae strain Rd KW20. Journal of Bacteriology, 185:4593–602, 2003. 19. Kolker, E., K. S. Makarova, S. Shabalina, et al. Identification and functional analysis of “hypothetical” genes expressed in Haemophilus influenzae. Nucleic Acids Research, 32:2353–61, 2004. 20. Ferre-D’Amare, A. R. RNA-modifying enzymes. Current Opinion in Structural Biology, 13:49–55, 2003. 21. Slany, R. K., and S. O. Muller. tRNA-guanine transglycosylase from bovine liver. Purification of the enzyme to homogeneity and biochemical characterization. European Journal of Biochemistry, 230:221–8, 1995. 22. Deshpande, K. L., P. H. Seubert, D. M. Tillman, et al. Cloning and characterization of cDNA encoding the rabbit tRNA-guanine transglycosylase 60-kilodalton subunit. Archives of Biochemistry and Biophysics, 326:1–7, 1996. 23. Deshpande, K. L., and J. R. Katze. Characterization of cDNA encoding the human tRNA-guanine transglycosylase (TGT) catalytic subunit. Gene, 265:205–12, 2001. 24. Koonin, E. V., and M. Y. Galperin. Sequence—Evolution—Function: Computational Approaches in Comparative Genomics. Kluwer Academic Publishers, Boston, 2002. 25. Bousquet, I., G. Dujardin and P. P. Slonimski. ABC1, a novel yeast nuclear gene has a dual function in mitochondria: it suppresses a cytochrome b mRNA translation defect and is essential for the electron transfer in the bc1 complex. EMBO Journal, 10:2023–31, 1991. 26. Brasseur, G., G. Tron, G. Dujardin, et al. The nuclear ABC1 gene is essential for the correct conformation and functioning of the cytochrome bc1 complex and the neighbouring complexes II and IV in the mitochondrial respiratory chain. European Journal of Biochemistry, 246:103–11, 1997. 27. Poon, W. W., D. E. Davis, H. T. Ha, et al. Identification of Escherichia coli ubiB, a gene required for the first monooxygenase step in ubiquinone biosynthesis. Journal of Bacteriology, 182:5139–46, 2000. 28. Do, T. Q., A. Y. Hsu, T. Jonassen, et al. A defect in coenzyme Q biosynthesis is responsible for the respiratory deficiency in Saccharomyces cerevisiae abc1 mutants. Journal of Biological Chemistry, 276:18161–8, 2001. 29. Hsieh, E. J., J. B. Dinoso and C. F. Clarke. A tRNA(TRP) gene mediates the suppression of cbs2-223 previously attributed to ABC1/COQ8. Biochemical and Biophysical Research Communications, 317:648–53, 2004. 30. Nakayashiki, T., K. Nishimura and H. Inokuchi. Cloning and sequencing of a previously unidentified gene that is involved in the biosynthesis of heme in Escherichia coli. Gene, 153:67–70, 1995.
180
Genomics
31. Le Guen, L., R. Santos and J. M. Camadro. Functional analysis of the hemK gene product involvement in protoporphyrinogen oxidase activity in yeast. FEMS Microbiology Letters, 173:175–82, 1999. 32. Bujnicki, J. M., and M. Radlinska. Is the HemK family of putative S-adenosylmethionine-dependent methyltransferases a “missing” zeta subfamily of adenine methyltransferases? A hypothesis. IUBMB Life, 48: 247–9, 1999. 33. Nakahigashi, K., N. Kubo, S. Narita, et al. HemK, a class of protein methyl transferase with similarity to DNA methyl transferases, methylates polypeptide chain release factors, and hemK knockout induces defects in translational termination. Proceedings of the National Academy of Sciences USA, 99:1473–8, 2002. 34. Heurgue-Hamard, V., S. Champ, A. Engstrom, et al. The hemK gene in Escherichia coli encodes the N5-glutamine methyltransferase that modifies peptide release factors. EMBO Journal, 21:769–78, 2002. 35. Clarke, S. The methylator meets the terminator. Proceedings of the National Academy of Sciences USA, 99:1104–6, 2002. 36. Roberts, R. J. Identifying protein function—a call for community action. PLoS Biology, 2:E42, 2004. 37. Yeliseev, A. A., and S. Kaplan. TspO of Rhodobacter sphaeroides. A structural and functional model for the mammalian peripheral benzodiazepine receptor. Journal of Biological Chemistry, 275:5657–67, 2000. 38. Galperin, M. Y. Bacterial signal transduction network in a genomic perspective. Environmental Microbiology, 6:552–67, 2004. 39. Gavish, M., I. Bachman, R. Shoukrun, et al. Enigma of the peripheral benzodiazepine receptor. Pharmacological Reviews, 51:629–50, 1999. 40. Lacapere, J. J., and V. Papadopoulos. Peripheral-type benzodiazepine receptor: structure and function of a cholesterol-binding protein in steroid and bile acid biosynthesis. Steroids, 68:569–85, 2003. 41. Davey, M. E., and F. J. de Bruijn. A homologue of the tryptophan-rich sensory protein TspO and FixL regulate a novel nutrient deprivationinduced Sinorhizobium meliloti locus. Applied and Environmental Microbiology, 66:5353–9, 2000. 42. Galperin, M. Y., and E. V. Koonin. “Conserved hypothetical” proteins: prioritization of targets for experimental study. Nucleic Acids Research, 32:5452–63, 2004. 43. Natale, D. A., M. Y. Galperin, R. L. Tatusov, et al. Using the COG database to improve gene recognition in complete genomes. Genetica, 108:9–17, 2000. 44. Galperin, M. Y., and E. V. Koonin. Who’s your neighbor? New computational approaches for functional genomics. Nature Biotechnology, 18:609–13, 2000. 45. Marcotte, E. M., M. Pellegrini, H. L. Ng, et al. Detecting protein function and protein-protein interactions from genome sequences. Science, 285: 751–3, 1999. 46. Marcotte, E. M., M. Pellegrini, M. J. Thompson, et al. A combined algorithm for genome-wide prediction of protein function. Nature, 402:83–6, 1999. 47. Overbeek, R., M. Fonstein, M. D’Souza, et al. The use of contiguity on the chromosome to predict functional coupling. In Silico Biology, 1:93–108, 1998.
Complete Prokaryotic Genomes
181
48. Overbeek, R., M. Fonstein, M. D’Souza, et al. The use of gene clusters to infer functional coupling. Proceedings of the National Academy of Sciences USA, 96:2896–901, 1999. 49. Huynen, M., B. Snel, W. Lathe, 3rd, et al. Predicting protein function by genomic context: quantitative evaluation and qualitative inferences. Genome Research, 10:1204–10, 2000. 50. Snel, B., P. Bork and M. A. Huynen. The identification of functional modules from the genomic association of genes. Proceedings of the National Academy of Sciences USA, 99:5890–5, 2002. 51. Dandekar, T., B. Snel, M. Huynen, et al. Conservation of gene order: a fingerprint of proteins that physically interact. Trends in Biochemical Sciences, 23:324–8, 1998. 52. Tatusov, R. L., E. V. Koonin and D. J. Lipman. A genomic perspective on protein families. Science, 278:631–7, 1997. 53. Gaasterland, T., and M. A. Ragan. Microbial genescapes: phyletic and functional patterns of ORF distribution among prokaryotes. Microbial and Comparative Genomics 3:199–217, 1998. 54. Pellegrini, M., E. M. Marcotte, M. J. Thompson, et al. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proceedings of the National Academy of Sciences USA, 96:4285–8, 1999. 55. Tatusov, R. L., M. Y. Galperin, D. A. Natale, et al. The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Research, 28:33–6, 2000. 56. Dandekar, T., S. Schuster, B. Snel, et al. Pathway alignment: application to the comparative analysis of glycolytic enzymes. Biochemical Journal, 343:115–24, 1999. 57. Huynen, M. A., T. Dandekar and P. Bork. Variation and evolution of the citricacid cycle: a genomic perspective. Trends in Microbiology, 7:281–91, 1999. 58. Osterman, A., and R. Overbeek. Missing genes in metabolic pathways: a comparative genomics approach. Current Opinion in Chemical Biology, 7: 238–51, 2003. 59. Green, M. L., and P. D. Karp. A Bayesian method for identifying missing enzymes in predicted metabolic pathway databases. BMC Bioinformatics, 5:76, 2004. 60. Koonin, E. V., Y. I. Wolf and L. Aravind. Prediction of the archaeal exosome and its connections with the proteasome and the translation and transcription machineries by a comparative-genomic approach. Genome Research, 11:240–52, 2001. 61. Verma, R., L. Aravind, R. Oania, et al. Role of Rpn11 metalloprotease in deubiquitination and degradation by the 26S proteasome. Science, 298:611–15, 2002. 62. Makarova, K. S., L. Aravind, N. V. Grishin, et al. A DNA repair system specific for thermophilic Archaea and bacteria predicted by genomic context analysis. Nucleic Acids Research, 30:482–96, 2002. 63. Enright, A. J., I. Illopoulos, N. C. Kyrpides, et al. Protein interaction maps for complete genomes based on gene fusion events. Nature, 402:86–90, 1999. 64. Aravind, L., and C. P. Ponting. The GAF domain: an evolutionary link between diverse phototransducing proteins. Trends in Biochemical Sciences, 22:458–9, 1997.
182
Genomics
65. Aravind, L., and C. P. Ponting. The cytoplasmic helical linker domain of receptor histidine kinase and methyl-accepting proteins is common to many prokaryotic signalling proteins. FEMS Microbiology Letters, 176: 111–16, 1999. 66. Taylor, B. L., and I. B. Zhulin. PAS domains: internal sensors of oxygen, redox potential, and light. Microbiology and Molecular Biology Reviews, 63:479–506, 1999. 67. Galperin, M. Y., A. N. Nikolskaya and E. V. Koonin. Novel domains of the prokaryotic two-component signal transduction system. FEMS Microbiology Letters, 203:11–21, 2001. 68. Zhulin, I. B., A. N. Nikolskaya and M. Y. Galperin. Common extracellular sensory domains in transmembrane receptors for diverse signal transduction pathways in bacteria and archaea. Journal of Bacteriology, 185:285–94, 2003. 69. Salwinski, L., C. S. Miller, A. J. Smith, et al. The Database of Interacting Proteins: 2004 update. Nucleic Acids Research, 32:D449–51, 2004. 70. Edgar, R., M. Domrachev and A. E. Lash. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Research, 30:207–10, 2002. 71. Karp, P. D., M. Riley, M. Saier, et al. The EcoCyc Database. Nucleic Acids Research, 30:56–8, 2002. 72. Munch, R., K. Hiller, H. Barg, et al. PRODORIC: prokaryotic database of gene regulation. Nucleic Acids Research, 31:266–9, 2003. 73. Makita, Y., M. Nakao, N. Ogasawara, et al. DBTBS: database of transcriptional regulation in Bacillus subtilis and its contribution to comparative genomics. Nucleic Acids Research, 32:D75–7, 2004. 74. Galperin, M. Y. The Molecular Biology Database Collection: 2005 update. Nucleic Acids Research, 33:D5–24, 2005. 75. Gilliland, G. L., A. Teplyakov, G. Obmolova, et al. Assisting functional assignment for hypothetical Haemophilus influenzae gene products through structural genomics. Current Drug Targets and Infectious Disorders, 2:339–53, 2002. 76. Frishman, D. What we have learned about prokaryotes from structural genomics. OMICS: A Journal of Integrative Biology, 7:211–24, 2003. 77. Kim, S. H., D. H. Shin, I. G. Choi, et al. Structure-based functional inference in structural genomics. Journal of Structural and Functional Genomics, 4:129–35, 2003. 78. Eisenreich, W., F. Rohdich and A. Bacher. Deoxyxylulose phosphate pathway to terpenoids. Trends in Plant Sciences, 6:78–84, 2001. 79. Eisenreich, W., A. Bacher, D. Arigoni, et al. Biosynthesis of isoprenoids via the non-mevalonate pathway. Cellular and Molecular Life Sciences, 61:1401–26, 2004. 80. Jenal, U. Cyclic di-guanosine-monophosphate comes of age: a novel secondary messenger involved in modulating cell surface structures in bacteria? Current Opinion in Microbiology, 7:185–91, 2004. 81. Karp, P. D. Call for an enzyme genomics initiative. Genome Biology, 5:401, 2004. 82. Harris, M. A., J. Clark, A. Ireland, et al. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Research, 32:D258–61, 2004.
Complete Prokaryotic Genomes
183
83. Camon, E., D. Barrell, V. Lee, et al. The Gene Ontology Annotation (GOA) Database—an integrated resource of GO annotations to the UniProt Knowledgebase. In Silico Biology, 4:5–6, 2004. 84. Sasarman, A., Y. Echelard, J. Letowski, et al. Nucleotide sequence of the hemX gene, the third member of the Uro operon of Escherichia coli K12. Nucleic Acids Research, 16:11835, 1988. 85. Rigden, D. J., I. Bagyan, E. Lamani, et al. A cofactor-dependent phosphoglycerate mutase homolog from Bacillus stearothermophilus is actually a broad specificity phosphatase. Protein Science 10:1835–46, 2001. 86. Mathews, I. I., T. J. Kappock, J. Stubbe, et al. Crystal structure of Escherichia coli PurE, an unusual mutase in the purine biosynthetic pathway. Structure with Folding and Design, 7:1395–1406, 1999. 87. Thoden, J. B., T. J. Kappock, J. Stubbe, et al. Three-dimensional structure of N5-carboxyaminoimidazole ribonucleotide synthetase: a member of the ATP grasp protein superfamily. Biochemistry, 38:15480–92, 1999. 88. van der Oost, J., M. A. Huynen and C. H. Verhees. Molecular characterization of phosphoglycerate mutase in archaea. FEMS Microbiology Letters, 212:111–20, 2002. 89. Graham, D. E., H. Xu and R. H. White. A divergent archaeal member of the alkaline phosphatase binuclear metalloenzyme superfamily has phosphoglycerate mutase activity. FEBS Letters, 517:190–4, 2002. 90. Feder, M., J. Pas, L. S. Wyrwicz, et al. Molecular phylogenetics of the RrmJ/fibrillarin superfamily of ribose 2’-O-methyltransferases. Gene, 302:129–38, 2003. 91. Deng, L., N. G. Starostina, Z. J. Liu, et al. Structure determination of fibrillarin from the hyperthermophilic archaeon Pyrococcus furiosus. Biochemical and Biophysical Research Communications, 315:726–32, 2004. 92. An, F. Y., and D. B. Clewell. Characterization of the determinant (traB) encoding sex pheromone shutdown by the hemolysin/bacteriocin plasmid pAD1 in Enterococcus faecalis. Plasmid, 31:215–21, 1994. 93. Koonin, E. V. Pseudouridine synthases: four families of enzymes containing a putative uridine-binding motif also conserved in dUTPases and dCTP deaminases. Nucleic Acids Research, 24:2411–15, 1996. 94. Lafontaine, D. L., C. Bousquet-Antonelli, Y. Henry, et al. The box H+ACA snoRNAs carry Cbf5p, the putative rRNA pseudouridine synthase. Genes and Development, 12:527–37, 1998. 95. Fields, S. D., M. N. Conrad and M. Clarke. The S. cerevisiae CLU1 and D. discoideum cluA genes are functional homologues that influence mitochondrial morphology and distribution. Journal of Cell Science, 111:1717–27, 1998. 96. Stukey, J., and G. M. Carman. Identification of a novel phosphatase sequence motif. Protein Science, 6:469–72, 1997. 97. Neuwald, A. F. An unexpected structural relationship between integral membrane phosphatases and soluble haloperoxidases. Protein Science, 6:1764–7, 1997. 98. Fan, C., P. C. Moews, Y. Shi, et al. A common fold for peptide synthetases cleaving ATP to ADP: glutathione synthetase and D-alanine:D-alanine ligase of Escherichia coli. Proceedings of the National Academy of Sciences USA, 92:1172–6, 1995.
184
Genomics
99. Artymiuk, P. J., A. R. Poirrette, D. W. Rice, et al. Biotin carboxylase comes into the fold. Nature Structural Biology, 3:128–32, 1996. 100. Murzin, A. G. Structural classification of proteins: new superfamilies. Current Opinion in Structural Biology, 6:386–94, 1996. 101. Galperin, M. Y., and E. V. Koonin. A diverse superfamily of enzymes with ATP-dependent carboxylate-amine/thiol ligase activity. Protein Science, 6:2639–43, 1997. 102. Koonin, E. V., and R. L. Tatusov. Computer analysis of bacterial haloacid dehalogenases defines a large superfamily of hydrolases with diverse specificity. Application of an iterative approach to database search. Journal of Molecular Biology, 244:125–32, 1994. 103. Aravind, L., M. Y. Galperin and E. V. Koonin. The catalytic domain of the P-type ATPase has the haloacid dehalogenase fold. Trends in Biochemical Sciences, 23:127–9, 1998. 104. Collet, J. F., V. Stroobant, M. Pirard, et al. A new class of phosphotransferases phosphorylated on an aspartate residue in an amino-terminal DXDX(T/V) motif. Journal of Biological Chemistry, 273:14107–12, 1998. 105. Collet, J. F., V. Stroobant and E. Van Schaftingen. Mechanistic studies of phosphoserine phosphatase, an enzyme related to P-type ATPases. Journal of Biological Chemistry, 274:33985–90, 1999. 106. Grana, X., L. de Lecea, M. R. el-Maghrabi, et al. Cloning and sequencing of a cDNA encoding 2,3-bisphosphoglycerate-independent phosphoglycerate mutase from maize. Possible relationship to the alkaline phosphatase family. Journal of Biological Chemistry, 267:12797–803, 1992. 107. Galperin, M. Y., A. Bairoch and E. V. Koonin. A superfamily of metalloenzymes unifies phosphopentomutase and cofactor-independent phosphoglycerate mutase with alkaline phosphatases and sulfatases. Protein Science, 7:1829–35, 1998. 108. Galperin, M. Y., and M. J. Jedrzejas. Conserved core structure and active site residues in alkaline phosphatase superfamily enzymes. Proteins, 45:318–24, 2001. 109. Kurnasov, O. V., B. M. Polanuyer, S. Ananta, et al. Ribosylnicotinamide kinase domain of NadR protein: identification and implications in NAD biosynthesis. Journal of Bacteriology, 184:6906–17, 2002. 110. Singh, S. K., O. V. Kurnasov, B. Chen, et al. Crystal structure of Haemophilus influenzae NadR protein. A bifunctional enzyme endowed with NMN adenyltransferase and ribosylnicotinimide kinase activities. Journal of Biological Chemistry, 277:33291–9, 2002. 111. Bishop, A. C., J. Xu, R. C. Johnson, et al. Identification of the tRNAdihydrouridine synthase family. Journal of Biological Chemistry, 277: 25090–5, 2002. 112. Heath, R. J., N. Su, C. K. Murphy, et al. The enoyl-[acyl-carrier-protein] reductases FabI and FabL from Bacillus subtilis. Journal of Biological Chemistry, 275:40128–33, 2000. 113. Myllykallio, H., G. Lipowski, D. Leduc, et al. An alternative flavindependent mechanism for thymidylate synthesis. Science, 297:105–7, 2002. 114. Daugherty, M., B. Polanuyer, M. Farrell, et al. Complete reconstitution of the human coenzyme A biosynthetic pathway via comparative genomics. Journal of Biological Chemistry, 277:21431–9, 2002.
Complete Prokaryotic Genomes
185
115. Aghajanian, S., and D. M. Worrall. Identification and characterization of the gene encoding the human phosphopantetheine adenylyltransferase and dephospho-CoA kinase bifunctional enzyme (CoA synthase). Biochemical Journal, 365:13–18, 2002. 116. Zhyvoloup, A., I. Nemazanyy, A. Babich, et al. Molecular cloning of CoA synthase. The missing link in CoA biosynthesis. Journal of Biological Chemistry, 277:22107–10, 2002. 117. Daugherty, M., V. Vonstein, R. Overbeek, et al. Archaeal shikimate kinase, a new member of the GHMP-kinase family. Journal of Bacteriology, 183:292–300, 2001. 118. White, R. H. L-Aspartate semialdehyde and a 6-deoxy-5-ketohexose 1-phosphate are the precursors to the aromatic amino acids in Methanocaldococcus jannaschii. Biochemistry, 43:7618–27, 2004. 119. Kurnasov, O., L. Jablonski, B. Polanuyer, et al. Aerobic tryptophan degradation pathway in bacteria: novel kynurenine formamidase. FEMS Microbiology Letters, 227:219–27, 2003. 120. Galperin, M. Y., L. Aravind and E. V. Koonin. Aldolases of the DhnA family: a possible solution to the problem of pentose and hexose biosynthesis in archaea. FEMS Microbiology Letters, 183:259–64, 2000. 121. Siebers, B., H. Brinkmann, C. Dorr, et al. Archaeal fructose-1,6bisphosphate aldolases constitute a new family of archaeal type class I aldolase. Journal of Biological Chemistry, 276:28710–18, 2001. 122. Lorentzen, E., B. Siebers, R. Hensel, et al. Structure, function and evolution of the Archaeal class I fructose-1,6-bisphosphate aldolase. Biochemical Society Transactions, 32:259–63, 2004. 123. Bobik, T. A., and M. E. Rasche. Identification of the human methylmalonylCoA racemase gene based on the analysis of prokaryotic gene arrangements. Implications for decoding the human genome. Journal of Biological Chemistry, 276:37194–8, 2001. 124. Marrakchi, H., K. H. Choi and C. O. Rock. A new mechanism for anaerobic unsaturated fatty acid formation in Streptococcus pneumoniae. Journal of Biological Chemistry, 277:44809–16, 2002. 125. Marrakchi, H., W. E. Dewolf, Jr., C. Quinn, et al. Characterization of Streptococcus pneumoniae enoyl-(acyl-carrier protein) reductase (FabK). Biochemical Journal, 370:1055–62, 2003. 126. Anantharaman, V., and L. Aravind. Cache—a signaling domain common to animal Ca2+-channel subunits and a class of prokaryotic chemotaxis receptors. Trends in Biochemical Sciences, 25:535–7, 2000. 127. Mougel, C., and I. B. Zhulin. CHASE: an extracellular sensing domain common to transmembrane receptors from prokaryotes, lower eukaryotes and plants. Trends in Biochemical Sciences, 26:582–4, 2001. 128. Anantharaman, V., and L. Aravind. The CHASE domain: a predicted ligand-binding module in plant cytokinin receptors and other eukaryotic and bacterial receptors. Trends in Biochemical Sciences, 26:579–82, 2001. 129. Heermann, R., A. Fohrmann, K. Altendorf, et al. The transmembrane domains of the sensor kinase KdpD of Escherichia coli are not essential for sensing K+ limitation. Molecular Microbiology, 47:839–48, 2003. 130. Galperin, M. Y., T. A. Gaidenko, A. Y. Mulkidjanian, et al. MHYT, a new integral membrane sensor domain. FEMS Microbiology Letters, 205: 17–23, 2001.
186
Genomics
131. Nikolskaya, A. N., and M. Y. Galperin. A novel type of conserved DNA-binding domain in the transcriptional regulators of the AlgR/ AgrA/LytR family. Nucleic Acids Research, 30:2453–9, 2002. 132. Nikolskaya, A. N., A. Y. Mulkidjanian, I. B. Beech, et al. MASE1 and MASE2: two novel integral membrane sensory domains. Journal of Molecular Microbiology and Biotechnology, 5:11–16, 2003. 133. Awad, M. M., and J. I. Rood. Perfringolysin O expression in Clostridium perfringens is independent of the upstream pfoR gene. Journal of Bacteriology, 184:2034–8, 2002. 134. Savic, D. J., W. M. McShan and J. J. Ferretti. Autonomous expression of the slo gene of the bicistronic nga-slo operon of Streptococcus pyogenes. Infection and Immunity, 70:2730–3, 2002. 135. Häse, C. C., N. D. Fedorova, M. Y. Galperin, et al. Sodium ion cycle in bacterial pathogens: evidence from cross-genome comparisons. Microbiology and Molecular Biology Reviews, 65:353–70, 2001.
7 Protein Structure Prediction Jeffrey Skolnick & Yang Zhang
Over the past decade, the success of genome sequence efforts has brought about a paradigm shift in biology [1]. There is increasing emphasis on the large-scale, high-throughput examination of all genes and gene products of an organism, with the aim of assigning their functions [2]. Of course, biological function is multifaceted, ranging from molecular/biochemical to cellular or physiological to phenotypical [3]. In practice, knowledge of the DNA sequence of an organism and the identification of its open reading frames (ORFs) does not directly provide functional insight. Here, the focus is on the proteins in a genome, namely, the proteome, but recognizes that proteins are only a subset of all biologically important molecules and addresses aspects of molecular/biochemical function and protein–protein interactions. At present, evolutionary-based approaches can provide insights into some features of the biological function of about 40–60% of the ORFs in a given proteome [4]. However, pure evolutionary-based approaches increasingly fail as the protein families become more distant [5], and predicting the functions of the unassigned ORFs in a genome remains an important challenge. Because the biochemical function of a protein is ultimately determined by both the identity of the functionally important residues and the three-dimensional structure of the functional site, protein structures represent an essential tool in annotating genomes [6–11]. The recognition of the role that structure can play in elucidating function is one impetus for structural genomics that aims for high-throughput protein structure determination [12]. Another is to provide a complete library of solved protein structures so that an arbitrary sequence is within modeling distance of an already known structure [13]. Then, the protein folding problem, that is, the prediction of a protein’s structure from its amino acid sequence, could be solved by enumeration. In practice, the ability to generate accurate models from distantly related templates will dictate the number of protein folds that need to be determined experimentally [14–16]. Protein–protein interactions, which are involved in virtually all cellular processes [17], represent another arena where protein structure prediction could play an important role. This area is in ferment, with considerable concern about the accuracy and consistency of high-throughput experimental methods [18]. 187
188
Genomics
In what follows, an overview of areas that comprise the focus of this chapter is presented. First, the state of the art of protein structure prediction is discussed. Then, the status of approaches to biochemical function prediction based on both protein sequence and structure is reviewed, followed by a review of the status of approaches for determining protein–protein interactions. Then, some recent promising advances in these areas are described. In the concluding section, the status of the field and directions for future research are summarized. BACKGROUND
Historically, protein structure prediction approaches are divided into three general categories, Comparative Modeling (CM) [19], threading [20], and New Fold methods or ab initio folding [21–23], which are schematically depicted in figure 7.1. In CM, the protein’s structure is predicted by aligning the target protein’s sequence to an evolutionarily related template sequence with a solved structure in the PDB [24], that is, two homologous sequences are aligned, and a three-dimensional model built based on this alignment [25]. In threading, the goal is to match the target sequence whose structure is unknown to a template that adopts a known structure, whether or not the target and template are evolutionarily related [26]. It should identify analogous folds, that is, where they adopt a similar fold without an apparent evolutionary relationship [27–29]. Note that the distinction between these approaches is becoming increasingly blurred [29–31]. Certainly, the general approach of CM and threading is the same: identify a structurally related template, identify an alignment between the target sequence and the template structure, build a continuous, full-length model, and then refine the resulting structure [26]. Ab initio folding usually refers to approaches that model protein structures on the basis of physicochemical principles. However, many recently developed New Fold/ab initio approaches often exploit evolutionary and threading information [30] (e.g., predicted secondary structure or contacts), although some versions are more physics-based [32]; perhaps such approaches should be referred to as semi-first principles. Indeed, a number of groups have developed approaches spanning the range from CM to ab initio [29,30] folding that performed reasonably well in CASP5, the fifth biannual communitywide experiment to assess the status of the field of protein structure prediction [33]. Comparative Modeling
Comparative Modeling (CM) can be used to predict the structure of those proteins whose sequence identity is above 30% with a template protein sequence [34], although progress has been reported at lower sequence identity [26]. An obvious limitation is that it requires a homologous
Protein Structure Prediction
189
Figure 7.1 Schematic overview of the methodologies employed in Comparative Modeling/threading and ab initio folding.
protein, the template, whose structure is known. When proteins have more than 50% sequence identity to their templates, in models built by CM techniques, the backbone atoms [19] can have up to a 1 Å rootmean-square deviation (RMSD) from native; this is comparable to experimental accuracy [9]. For target proteins with 30–50% sequence identity to their templates, the backbone atoms often have about 85% of their core regions within a RMSD of 3.5 Å from native, with errors
190
Genomics
mainly in the loops [19]. When the sequence identity drops below 30%, the model accuracy by CM sharply decreases because of the lack of significant template hits and substantial alignment errors. The sequence identity <30% is usually termed the “twilight” zone for sequence-based alignment and more than half of genome sequences are at these distances to known proteins in PDB. For all sequence identity ranges, the predicted structures are generally closer to the template on which they are based rather than to their native conformation [34]. This was true in the recent CASP5 protein structure prediction experiment [35]. Another issue is the accurate construction of the loops. While progress has been made for short loops [36], for longer loops significant problems remain [35]. Therefore, it is essential to develop an automated technology that can deal with proteins in the twilight zone of sequence identity, then build models that are closer to the native structure than to the template on which they are based [37,38]. Many recently developed threading algorithms start to be able to identify structural analogs in the twilight zone, but little progress has been reported with regard to the template refinements. Despite these limitations, CM has been applied to predict the tertiary structure of the ORFs in a number of proteomes [39]. At present, about 40–50% of all sequences have a homologous protein of known structure, with CM results compiled in the PEDANT [40], GTOP [41], MODBASE [42], and FAMS [41] databases. This percentage is slowly increasing as new structures are being solved at an increasing rate. Interestingly, most newly solved structures exhibit an already known fold [16], an issue examined below. Threading
The formulation of a threading algorithm involves three choices. First, the interaction sites must be chosen. Due to computational complexity, these are taken to be a subset of the protein’s heavy atoms and can be the Ca ’s [43], Cb’s [44], side-chain centers of mass [45], specially defined interaction centers [46], or any side-chain atom [47]. Second, the functional form of the energy is chosen, with examples ranging from contact [47] to continuous distance-dependent potentials [44]. The energy can include predicted secondary structure preferences [48] or burial patterns [49]. To improve both template recognition ability and the quality of the alignment, most successful threading approaches combine both sequence and structural information [27,48,50]. Third, given an energy function, the optimal alignment of the target sequence to each structural template must be found. If the “energy” terms are local (e.g., secondary structure propensities and/or sequence profiles), then dynamic programming [51] is best. If pair interactions are considered (which use a nonlocal scoring function), the interactions in the template structure must be updated to reflect the target sequence. Some approaches employ dynamic
Protein Structure Prediction
191
programming with a frozen environment (with interaction partners taken from the template protein) [20], followed by iterative updating [47]; others employ double dynamic programming which updates some interactions recognized as being the most important in the first pass of dynamic programming [52]. Other computationally more intensive variants include the actual partners in the target sequence and use Monte Carlo [53] or branch-and-bound search strategies [54]. A reasonably successful and faster alternative uses a sequence profile to align the target sequence to the template structure; then, the partners in the target sequence are used to evaluate the pair interactions [45,50]. These approaches suffer from the disadvantages that the template structure never adjusts to reflect modifications due to differences in the target and template sequences, and one cannot do better than the best structural alignment between the template and target structures [16,55,56]. As demonstrated in CASP5 [29,30,57–60], there are now a number of threading methods that significantly outperform sequence-only approaches such as PSI-BLAST [58]. Examples include PROSPECT II [27], GENTHREADER [48], and PROSPECTOR [50]. These algorithms found some analogous [29] structural templates for targets in the fold recognition/analogous (FR/A) category [5]. However, threading had many outstanding issues in common with CM: the need to improve aligned regions and move them closer to the native structure than the initial template alignment and the need to have a good loop-building algorithm that fills in the gapped region and generates statistically significant loop predictions. Furthermore, selection of the best model was often problematic [29]. Metapredictor-Based Approaches
CASP5/CAFASP3 demonstrated the power of Metapredictors (defined as automated predictors that combine consensus information from a variety of threading and sequence-based servers to make more accurate consensus structural predictions) such as 3-D SHOTGUN [59], PCONS [60], and ROBETTA [61], which gave results competitive with the best human predictors [59]. 3D-SHOTGUN and PCONS [60] do not simply select a model from the input models, but generate more complete and accurate hybrid models by splicing fragments from the individual models; however, these can have steric clashes, sometimes making the construction of physically realistic models impossible. Nonetheless, based on EVA [62] and LiveBench [63] results, Metapredictors are quite promising. For example, in large-scale testing, 3D-SHOTGUN produced models with up to 28% higher Maxsub score than any of the individual methods and having 17% higher specificity than any individual method. Here, the specificity is defined as the number of correct predictions with confidence score higher than the first false prediction. These results illustrate the potential power of the Metaprediction approach. However, the
192
Genomics
ultimate success of Metaprediction lies in the underlying accuracy of the individual contributing servers. Completeness of the PDB
CM/threading approaches cannot succeed if a structure related to the target sequence is not already solved. Therefore, the key issue for their applicability is the completeness of the PDB [24]. One way to explore this issue is to use structural alignment algorithms (which find the best structural match between a pair of proteins where the labeling of residues to be matched is not fixed in advance) to establish the structural relationship between newly solved protein structures and those already in the PDB. Indeed, the best alignment between a pair of protein structures that CM/threading can exploit is obtained from a structural alignment. One class of structural alignment algorithms employs dynamic programming [55], whose advantage is speed, but global optimality is not guaranteed. DALI [64] compares the intrastructural residue–residue distances in a pair of structures. Others [65,66] compare spatial arrangements of secondary structure elements. Bachar et al. [67] employ geometric hashing, while an incremental combinatorial extension (CE) method that combines structurally similar fragments was employed by Shindyalov and Bourne [68]. Kedem et al. [69] defines the unit-vector RMS to detect chain segment similarities, and MAMMOTH [70] employs a heuristic algorithm to align low-resolution structures and assigns their significance via an extreme value distribution. Several authors compared a set of representative structures in the PDB [71] and emphasized the discreteness of structural space on the domain level of protein structures. On the other hand, using their CE method, Shindyalov and Bourne [72] recently pointed out that substructures obtained from an all-against-all structure comparison sometimes distribute among protein domains transgressing their respective fold types. These substructures are around 130-residue-long, continuous chains, much longer than the conventional concept of supersecondary structure [73]. Harrison et al. also concluded that fold space is a continuum for some topology types in the b or a/b secondary structure class [74]. These studies suggest that there are rather large structure motifs of significant length that occur in many other folds. Yang and Honig [75] also showed that their structure comparison program detects structural similarity between different folds in the SCOP database [76]. This indicates that some regions of protein fold space are not as distinct as once thought. Recently, using a more sensitive structure alignment algorithm, SAL, Kihara and Skolnick demonstrated that for low-to-moderate resolution structures, the PDB is essentially complete for single-domain proteins [16]. That is, the global fold of essentially all single-domain proteins can be found among the already solved structures in the PDB. Furthermore, protein structure space is very dense. The problem is to develop a threading
Protein Structure Prediction
193
algorithm that can find these related template structures/good alignments and build a model useful for functional annotation [77]. As shown below, there has been significant progress in this direction, but additional work needs to be done before the protein structure prediction problem can be viewed as being solved, at least by enumeration. Inference of Biochemical Function from Structure
Currently, most methods that assign the molecular/biochemical function of proteins are based on finding protein sequence homology [78] or conserved protein sequence or structural motifs [79–82] between the uncharacterized protein and a protein of known biochemical function. However, such methods often fail as the sequence identity drops below 40%. Because the global fold of a protein family is more conserved than its sequence, protein biochemical function prediction should benefit by the inclusion of structural information [38]. However, divergent and convergent evolution gives a nonunique relationship between function and fold. In general, fold type by itself is not sufficient for correct function prediction [83,84], and additional information is required to infer biochemical function from structure. Several methods are based on three-dimensional descriptors of biologically relevant sites [7,85–90]. In addition to active site descriptors characterizing the geometric features of catalytic residues [87], a number of approaches that describe binding sites focus on the conservation of geometrical arrangements of residues [90–93], the physicochemical properties of functional residues [90,94], and/or ligand–cavity shape complementarity [95]. Many methods were specifically designed to recognize a particular type of ligand, for example, adenylate [88], calcium [92], or DNA [94], with more general methods only tested for a few ligand types [90,91]. Of interest is the recently available PINTS (Patterns in Nonhomologous Tertiary Structures) [31] approach designed to perform database searches against a collection of ligand-binding sites taken directly from PDB files [96]. Methods based on structural templates have been reasonably successful when applied to high-resolution structures. The question is what happens when predicted models of lower resolution are used? Given recent improvements in the performance of protein structure prediction algorithms [29,77,97], a structure-based method for protein function prediction that does not require high-resolution structures could be of practical value. The essential issue is to establish the quality of a predicted structure required to transfer a given biochemical function at a specified level of accuracy. In practice, the ability to detect functional sites in low-to-moderate resolution predicted structures had until recently only been tested for a few specific active site descriptors [92,98]. Recently, Arakaki et al. have developed a method that automatically generates a structural library of 3D descriptors of enzyme active sites [77] (automated functional templates or AFTs; 593 in total for
194
Genomics
162 different enzymes) based on functional and structural information extracted from public databases. The applicability to predicted structures was investigated by analyzing varying quality decoys derived from enzyme native structures. For 35% of decoys having a 3–4 Å backbone RMSD from the native structure, the AFT-based method correctly identifies the active site and transfers the first three EC indices. A key challenge is to routinely generate predicted structures of at least this quality so that they can be used for biochemical function inference. APPROACHES FOR DETERMINING PROTEIN–PROTEIN INTERACTIONS
Given their biological importance [17], the development of efficient methods to detect and characterize protein–protein interactions and assemblies is a major theme of functional genomics and proteomics efforts [99]. Currently, two main types of experimental methods are used: (1) yeast two-hybrid screening (Y2H) [100], which is mainly limited to binary interaction detection; and (2) the combination of large-scale affinity purification with mass spectrometry to detect and characterize multiprotein complexes [101]. First applied to yeast [102], these methods revealed the dense network of interactions linking proteins in the cell, but their error rate is high [18]. The coverage of Y2H screens seems incomplete, with many false negatives and false positives, as evidenced by the limited overlap between sets of interacting proteins identified by different groups [103] and between those identified by Y2H and other approaches [104]. This discrepancy among experimental methods prompted keen interest in the development of computational methods for inferring protein–protein interactions [105–107]. Many consider protein–protein interactions in the most general context and often refer to “functionally interacting proteins” [106], implying that the proteins cooperate to carry out a given task without actually (or necessarily) engaging in physical contact. These methods exploit the fact that the genes of such cooperating proteins tend to be associated within genomes [107]. The earliest methods considered gene fusion [107, 108], conservation of gene order [109], and co-occurrence of genes in different genomes [107] as a means of inferring functional interactions. Subsequent methods frequently use protein sequence information and are based on the idea of gene coevolution, which assumes that the genes of proteins that interact tend to evolve together [110]. Approaches based on the coevolution model include phylogenetic tree topology comparison [111], gene preservation correlation, and correlated mutation approaches [110]. These methods offer several advantages—the idea of correlated evolution is a priori and fits basic biological principles. But their downside is their low signal-to-noise ratio [112]. Furthermore, methods based on a coevolution model rely on the knowledge of the phylogenetic trees of the corresponding sets of proteins [113].
Protein Structure Prediction
195
Given that the exact evolutionary path of a specific protein is unknown, one must infer phylogeny via a careful analysis of related proteins from different organisms, the so-called orthologs, whose identification is not straightforward [113]. Phylogenetic tree reconstruction is NP-complete [114]. Existing measures for assessing coevolution (such as the Pearson correlation coefficient) attempt to avoid this problem by considering all protein homolog pairs. They are effective when the signal is strong, and often it is not. A conceptually different set of methods uses information from protein quaternary structure, and deals more directly with the actual physical interactions between proteins. It is in this sense that protein–protein interactions are considered in what follows. These approaches not only suggest which two proteins interact, but also provide a quaternary structure. Salient examples of this class of approaches include promising extensions of homology modeling and threading techniques [115,116] and neural net-based approaches [117]. One promising extension of threading to treat predict protein–protein interactions is described below [118,119]. RECENT ADVANCES Can the Protein Structure Prediction Problem Be Solved in Principle Using the Current PDB Library?
In recent studies, Skolnick et al. constructed a representative set of all single-domain proteins that have structures in the PDB (no two of which have more than 35% pairwise sequence identity to each other) ranging from 41 to 200 residues in length; there are 1489 such proteins [120], the PDB200 benchmark set. Using an improved structural alignment algorithm, SAL, they then compared these proteins to a benchmark library that is no more than 20% identical to the target protein [77,121]. The resulting average coverage and RMSD between the best template and the native structure are 84% and 2.6 Å, with an average sequence identity of 13% in the aligned regions. These results are compatible with the notion of the completeness of the PDB. Because SAL structural alignments can contain a number of gaps, it might not be possible to build biologically useful models [77], in which case the conclusion on the completeness of the PDB, while of fundamental interest, would not have practical applications. On the other hand, if the PDB were complete and useful models could be constructed, then, in principle, the protein folding problem could be solved, if one defines the protein folding problem on a purely structural level, that is, building statistically significant models that have similar topology to native (e.g., with RMSD <6.5 Å). However, to make this conclusion a reality, the development of better threading algorithms to detect all such fold similarities is required. Using the templates and
196
Genomics
alignments identified from SAL, Skolnick et al. demonstrated for the 1489 proteins in the PDB200 benchmark set that: 1. Reasonable full-length models can be built by either MODELLER [42] or TASSER, a newly developed algorithm for threading/assembly/refinement (see below; see also figure 7.3 for an schematic overview). Therefore, the conclusion on the completeness of the PDB is of practical interest. 2. Using TASSER, consistent improvement of the models from the best structural alignments is demonstrated. 3. Significant improvements in loop modeling are found. On average, as stated above, from SAL the average RMSD of the structural alignments to native is 2.6 Å with 84% coverage. Skolnick et al. applied TASSER [122] to the build/refine full-length models for the PDB200 benchmark set. The TASSER final models show improvement over their initial template alignments. Over the same aligned regions, on average, the RMSD is reduced to 1.9 Å. Many low-resolution templates improve by refinement to structures with an acceptable resolution for biochemical function annotation [77]. For the entire chain, almost all but two targets (with dangling termini involved in intermolecular interactions) have an RMSD <6 Å for the best of the top five models with an average rank of 1.7 and an average RMSD to native of 2.3 Å. In fact, 97% of the target proteins have a global RMSD <4 Å. For the rank one cluster (the highest structure density cluster), the average RMSD to native is 2.4 Å. The average RMSD of the best of top five MODELLER (a widely used comparative modeling program) models is 3.7 Å, with average rank of 2.9. In general, TASSER does a better job in the unaligned regions compared to MODELLER, especially for low-coverage templates (see figure 7.2). Looking at those targets with more than 90% coverage (437 in total), the average RMSDs of the full-length chain
Figure 7.2 (A) Scatter plot of RMSD from native to the final models built by TASSER refinements versus RMSD to native in the best initial template alignments identified by SAL. The same aligned regions are used in both RMSD calculations. (B) Using TASSER, the fraction of targets with an RMSD improvement d greater than some threshold value. Here d = “RMSD of template”–”RMSD of final model,” where each RMSD is calculated over the aligned regions. Each point is calculated with a bin width of 1 Å. (C) Similar data as in A, but the models are from MODELLER refinements. (D) Similar data as in B, but the models are from MODELLER refinements. (E) RMSDlocal and (F) RMSDglobal of unaligned/loop regions as a function of loop length. TASSER and MODELLER models are denoted by triangles and circles respectively. The lines connecting the points serve to guide the eye. The dashed line in F denotes an RMSDglobal cutoff of 7 Å.
197
198
Genomics
models generated by TASSER and MODELLER are fairly close, that is, 1.6 Å, and 2.2 Å, respectively. However, for targets with initial alignment coverage below 75% (386 in total), the average RMSDs from native to models by TASSER and MODELLER are 2.9 Å and 6.1 Å, respectively, a significant difference. Overall, in 1120 (102) targets, TASSER (MODELLER) models have lower RMSD to native. In essentially all targets, using the structural alignments provided by SAL, reasonable full-length models could be built. Therefore, if one could find the templates and corresponding alignments, given the set of already solved structures in the PDB, these results are highly suggestive that the protein folding problem could be solved for single-domain proteins, if one defines the solution as the ability to generate models with a backbone RMSD below 4 Å. In figure 7.2, a detailed comparison of the final models with respect to the template in the aligned regions is plotted. TASSER models (figures 7.2A and B) often show obvious improvement, especially when templates are more than 3 Å away from native. As shown in figure 7.2B, for initial template aligned regions with an RMSD from native ranging from 2 to 3 Å, for around 61% of these cases, the models have at least an 0.5 Å improvement; and for targets having initial template aligned regions with an RMSD from native ranging from 3 to 4 Å, for around 49% of these cases, the models have at least 1.0 Å improvement. This improvement occurs because the force field takes consensus information from multiple templates (the top five templates are used) [59], as well as the clustering procedure and the energy terms in TASSER [122,123]. Often, the ability to refine models from the best structural alignments (more precisely, using the best structural alignments provided by SAL) is demonstrated. In contrast, figures 7.2C and D, show the comparison between the models generated by MODELLER and the initial template alignments. Mainly, MODELLER keeps the topology of models near the template [124,125]. However, sometimes (~10% of cases) the MODELLER models are >1 Å worse than the initial template values. Here an unaligned or “loop” region is defined as a piece of sequence lacking a coordinate assignment in the SAL template alignment. Since no spatial information is provided, modeling the unaligned or loop regions is difficult [125]. Following Fiser et al. [125], two measures of model accuracy are calculated: RMSDlocal denotes the root-meansquare deviation between the native and the modeled loop with direct superposition of the unaligned region and measures the local conformational accuracy. RMSDglobal is the root-mean-square deviation between the native and modeled loop after superposition of up to five neighboring stem residues on each side of the loop and measures both the accuracy of the local conformation and its global orientation with respect to the rest of the protein. There are 11,380 unaligned/loop
Protein Structure Prediction
199
regions ranging from 1 to 84 residues in length in the 1489 targets. In figures 7.2E and F, the average values of RMSDlocal and RMSDglobal of TASSER and MODELLER models versus loop length are presented. In both cases, the accuracy decreases with increasing loop size. For all size ranges, TASSER models have lower average RMSDlocal and RMSDglobal. Focusing on the unaligned loops ≥4 residues in length, there are 1675 cases with an average length of 8.8 residues. TASSER shows obviously better control of loop orientations. For example, in one-third of the cases, TASSER generates models with an RMSDglobal <3 Å, while the fraction of MODELLER models having an RMSDglobal <3 Å is around one-seventh. Clearly, while the problem of loop modeling is definitely not solved, some progress is being made. The “New Fold” targets in CASP5 [126] were also examined by Skolnick et al., since by definition these targets putatively adopt a novel fold never seen in the PDB. Using TASSER, acceptable models can be built from the initial SAL template alignments with an average RMSD from native of 2.87 Å for the first predicted model. Hence, these putative NF targets have templates in the PDB that give reasonable structural alignments and full-length models. The PROSPECTOR_3 Threading Algorithm
Recently, Skolnick and coworkers developed an improved threading algorithm, PROSPECTOR_3 [50], which is designed to identify analogous as well as homologous templates. The scoring function includes close and distant sequence profiles, secondary structure predictions from PSIPRED [28], and a variety of side-chain contact pair potentials supplemented by predicted side-chain contacts (consensus contacts in at least weakly scoring templates). Alignments are generated using a Needleman–Wunsch global alignment algorithm [51]. Based on score significance, target sequences are classified into three categories. If PROSPECTOR_3 has at least one significant hit with Z-score (the energy in standard deviation units relative to mean) above 15 or at least two structurally consistent template hits of Z-score above 7, these targets have high confidence to have a correct template and a good alignment, and the target is assigned to the “Easy set.” (Note that Easy does not mean that they are trivially identified; indeed, in the PDB200 benchmark, PROSPECTOR_3 correctly assigns more than twice the number of targets to their correct templates (just using the Easy set) as PSI-BLAST [127].) Sequences that either hit a single template with 7 < Z < 15 or hit multiple templates lacking a significant consensus structure are assigned to the “Medium set”; these have the correct fold identified in most cases, but their alignment may be incorrect. Finally, sequences not assigned to a template belong to the “Hard set”; from the point of view of the algorithm, they are New Folds, but based on the finding of the completeness of the PDB [16], (almost) all proteins
200
Genomics
should be assigned to either the Easy or Medium set by a “perfect” threading algorithm. PROSPECTOR_3 was applied to the comprehensive PDB200 benchmark set described above, which are no more than 30% identical to any threading template. In the latest version (somewhat better than published work of Skolnick et al. [50], reflecting minor improvements), there are 915 Easy protein targets. 791 have an RMSD to native <6.5 Å. The average contact prediction accuracy is 46%. Continuous aligned regions provide rather accurate (~90% accuracy) native-like fragments that can be used in structure assembly. The level of contact prediction accuracy combined with the fact that continuous fragments are quite accurate motivated the development of TASSER, described in the next section. 67% of the residues obtained from the threading alignments have the same alignment to template as the best structural alignments using SAL. Threading (structural) alignment refers to the alignment provided by PROSPECTOR_3 (SAL). In addition, 97% of Easy targets have a template whose SAL structural alignment has an RMSD <6.5 Å, with an average RMSD of 2.4 Å and 82% average coverage. PROSPECTOR_3 assigns 565 Medium set proteins, 149 with an RMSD <6.5 Å and 44% average coverage. However, 91% have good SAL structural alignments with an average RMSD of 3.8 Å and 50% coverage. The issue is to uncover these alignments (which as seen above can be used to build good models). Combining Easy/Medium sets, 65% (94%) of targets have good threading (structural) alignments and the average target/template sequence identity is 22%. Since all targets have good templates in the template library [16], the fact that roughly one-third are not identified points out that improvements in PROSPECTOR_3 are needed. However, consistent with the notion of PDB completeness, there are only 19 Hard targets. Development and Benchmarking of TASSER
Having a set of threading templates, the next thing one wants to do is to build a full-length model and to refine the structure so that the regions that have corresponding template alignments move closer to the native state than the template on which they are based. To achieve these two objectives, the Skolnick group has developed the TASSER (Threading/ASSEmbly/Refinement) algorithm, an overview of which is schematically depicted in figure 7.3. The protein model is described by the alpha-carbon (Ca) atoms and off-lattice side-chain centers of mass (SG). The chain is divided into continuous aligned regions extracted from PROSPECTOR_3 (>5 residues), whose local conformation is kept essentially unchanged during assembly, and gapped regions that will be treated by ab initio methods. The Ca’s of these ab initio residues lie on an underlying cubic lattice (by discretizing the conformational space, lattices can improve the rate of
Protein Structure Prediction
201
Figure 7.3 Overview of the TASSER structure prediction methodology which consists of template identification by PROSPECTOR_3 [50] that provides template fragments and predicted contact restraints, fragment assembly using Parallel Hyperbolic Sampling [141], and fold selection by SPICKER clustering [122]. The entire process for 1ayyD is shown.
conformational sampling), while the Ca’s of aligned residues are excised from the threading template and are off-lattice (this is done because it is very difficult to move preconstructed fragments around on a lattice. Also lattices introduce an error in the local representation). In a certain sense, TASSER represents a convergence of the ROSETTA [30] and TOUCHSTONE II [128] approaches. However, ROSETTA [22] uses small fragments (3–9 residues), and since the conformational search is carried out using large-scale moves (by switching between
202
Genomics
different local segments), the acceptance rate of ROSSETTA movements significantly decreases with increasing fragment size. Here, the threading-based fragments are longer (~20.7 residues on average), the conformational entropy is significantly reduced, and more native-like interactions are retained. Movements consist of scaled continuous translations and rotations, allowing for the successful movement of all size substructures. The potential includes predicted secondary structure propensities from PSIPRED [129], backbone hydrogen bonds, consensus-predicted side-chain contacts from PROSPECTOR_3 [50], statistical short-range correlations, and hydrophobic interactions [122]. The combination of energy terms was optimized by maximizing the correlation between the RMSD of decoy structures to native and the energy for 100 nonhomologous training proteins (extrinsic to the PDB200 benchmark), each with 60,000 decoys. This gave a funnel-like energy landscape, with a correlation coefficient of 0.7 [122] for the training set. For 200 randomly chosen testing proteins in the PDB200 benchmark set, the correlation coefficient between the energy and RMSD is 0.69, that is, it is essentially the same for both training and testing proteins. The next task is to apply the TASSER algorithm to a comprehensive benchmark set representative of all the proteins in the PDB below a certain size. The goal here is to have a sufficiently comprehensive set that the results are truly representative. When relatively small sets of proteins are used to test a given algorithm, often the parameters are implicitly optimized so that success is found for the benchmark, but not generally. If, say, one considers 100 proteins, and a given variant of a folding algorithm folds 3 additional proteins, does this mean that on average the algorithm is 3% better? In other words, the 3 folded proteins may or may not be representative. However, if benchmarking is done on all representative folds in the PDB, improvements will be statistically significant, and one can ascertain in general what are the strengths and weaknesses of a given algorithm. This will accelerate progress. On the other hand, such large-scale benchmarking on thousands of proteins is very CPU intensive, and considerable computational resources are required to carry out the calculations. APPLICATION TO THE PDB200 BENCHMARK
With the goal of comprehensive benchmarking, application of TASSER to the PDB200 benchmark set gave the following: There are obvious improvements for almost all quality templates, with the biggest improvement for the poorer quality template alignments (initial RMSD >8 Å); these mainly belong to the Medium and Hard sets. For good templates (mostly Easy set targets), the alignments are much less gapped, and the tertiary contact restraints from PROSPECTOR_3 are more consistent. For initial models with a 4–5 Å (2–3 Å) RMSD from native, 58% (43%)
Protein Structure Prediction
203
of the targets improve by at least 1 (0.5) Å. These results are consistent (see figure 7.2) with those when structural alignments are used and show a systematic improvement in model quality. For most initially good templates (mainly from the Easy set) with an initial RMSD of 2–6 Å to native, there is consistently about a 1–3 Å improvement because of the better local structure and side-chain group packing following optimization. The final alignments in MODELLER [125] tend to be much closer to the initial template alignments. This is not entirely fair since MODELLER was designed to fold homologous proteins, and such protein pairs are excluded here. Turning to loop modeling, considering unaligned/loop regions that have lengths ≥ 4 residues, the average RMSDs by TASSER and MODELLER are 6.7 Å and 14.9 Å respectively. Using an RMSD cutoff of <4 Å, MODELLER gives successful results in 12% of the cases, while TASSER is successful in 35% of the cases. These results are slightly worse than when structural alignments are used because of the lower accuracy of the core (see figures 7.2E and F). As shown in figure 7.4A, defining foldable cases as those where one of the top five structures has an RMSD to native below 6.5 Å (a statistically significant value [130], but any reasonable cutoff can be used), the overall success rate for TASSER full-length models is 66% (= 989/1489). The fraction of targets having an RMSD <6.5 Å in the aligned regions increases from 65% to 79% after TASSER refinement. Furthermore, TASSER does not show significant bias to secondary structure class. The success rates for a-, b-, and ab-proteins are 69%, 61%, and 69% respectively. Nevertheless, a dependence on protein size exists. For targets <120 residues, the success rate is 73%, but for targets >120 residues it is 58%. All results including threading templates, structure trajectories, and final models for each of the targets are available at http://bioinformatics.buffalo.edu/abinitio/1489. APPLICATION TO THE PDB300 BENCHMARK
To explore the ability of TASSER to treat larger proteins [122], Skolnick and coworkers examined a second comprehensive PDB benchmark set, the PDB300 set, of 745 proteins ranging in length from 201 to 300 residues; 258 have more than one domain [131]. No pair of target protein sequences has >35% sequence identity; also, all proteins >35% identity are excluded from the template library. PROSPECTOR_3 identifies 593 Easy set proteins; 441 have good threading alignments (whose RMSD from native <6.5 Å), with an average RMSD of 3.6 Å, 83% coverage, and 21% sequence identity to their templates. There are 150 Medium and 2 Hard targets. Using this information, figure 7.4B shows the TASSER results for the percent of predicted targets with a given RMSD, with single- and multiple-domain protein targets presented separately. The success rate for all PDB300 targets is 55%.
Figure 7.4 (A) For the PDB200 benchmark set of proteins, histograms of foldable proteins using MODELLER [124] and TASSER based on the same templates and alignments from PROSPECTOR_3 [50]. (B) For proteins in the PDB300 benchmark set, using TASSER, the histogram of the percent of predicted targets as function of global RMSD, divided into single- and multiple-domain categories.
204
Protein Structure Prediction
205
61% of single-domain proteins have the best of top five models with an RMSD to native <6.5 Å. This is slightly less than the success rate of 66% for single-domain proteins ≤ 200 residues [122]. For multiple-domain proteins, 43% have an RMSD <6.5 Å for the best of top five models. But two-thirds of these multiple-domain targets have at least one domain (average length of 144 residues) with an RMSD <6.5 Å. Thus, domains are often correctly predicted, but not their mutual orientation. This is a significant problem that must be addressed. Similar to the case of proteins ≤ 200 residues, TASSER gives significant improvements with respect to the initial alignments. For example, for initial alignments with an RMSD between 4–5 Å, in 53% of the cases the final models improve by at least 1Å. Turning to loop modeling and focusing on unaligned/loops ≥4 residues, there are in total 1809 cases with average length of 12.2 residues. In around one-third of cases, the TASSER loop-modeling procedure has acceptable accuracy. RESULTS FOR TRANSMEMBRANE PROTEINS
There are 18 large membrane proteins in the PDB300 benchmark set. For one-third, TASSER generates at least one model in the top five that has an RMSD to native below 5.5 Å. For the PDB200 benchmark (proteins 41–200 residues), there are 20 membrane proteins, with a success rate of 45%. Among the total of 15 foldable membrane targets in both sets, for 10, PROSPECTOR_3 hits at least one other nonhomologous transmembrane template; in the remaining five, PROSPECTOR_3 hits globular proteins with regular helical structures consistent with the target structures, which provided the opportunity for TASSER to assemble/refine the models. Figure 7.5 shows three typical results for membrane proteins 1jgjA, 1fqyA, and 1bh3_, with the well-known GPCR rhodopsin, 1jgjA, having the highest resolution. Their best template hits by PROSPECTOR_3 are respectively: 1ap9_ (1.47 Å over 96% coverage and 29% sequence identity), 1fx8A (5.20 Å over 92% coverage and 29% sequence identity), and 2por_ (13.44 Å over 88% coverage and 22% sequence identity). The final models have an RMSD to native of 1.1/0.89 Å, 3.3/3.1 Å, and 5.3/5.2 Å over the full-length/aligned regions, respectively. This shows that TASSER improves threading alignments and builds reasonable loops for membrane proteins. COMPARISON OF TASSER MODELS WITH NMR STRUCTURES
For all representative proteins ≤ 300 residues (in both the PDB200 and PDB300 benchmark sets) that have corresponding multiple NMR structures in the PDB, ~20% of the models generated by TASSER are closer to the NMR structure centroid than the farthest individual NMR model. Note that no experimental information is employed in this set of predictions. Some representative examples for proteins belonging to each of the three secondary structure classes are shown in figure 7.6.
Figure 7.5 Three representative examples of the successful structure prediction of transmembrane proteins by TASSER. The thin (thick) lines denote the Ca-backbone of the experimental (predicted) structure. Below the structures are their PDB id, the RMSD between the model and native structure, and the length of the protein. To view this figure in color, see the companion web site for Systems Biology, http://www.oup.com/us/sysbio.
Figure 7.6 Three representative examples of TASSER-predicted models that are structurally closer to the NMR structure centroid than some of individual NMR structures. The thick backbone shows the rank one models predicted by TASSER; the wire frame presents the structures satisfying the NMR distance constraints equally well. The RMSD of TASSER models to the NMR centroid for 1adr_ (a-protein), 2fnbA (b-protein), and 1dbyA (ab-protein) are 1.6 Å, 1.9 Å, and 1.1 Å, respectively; the maximum RMSD of NMR models to the centroid are 3.6 Å, 2.3 Å, and 1.3 Å, respectively. To view this figure in color, see the companion web site for Systems Biology, http://www.oup.com/us/sysbio.
206
Protein Structure Prediction
207
While this represents encouraging progress, nevertheless there remain the 80% of proteins with NMR structures that are not predicted at the level of experimental resolution. These remain an outstanding challenge. EXTENSION OF THREADING TO PREDICT QUATERNARY STRUCTURE
Over the past several years, the multimeric threading algorithm MULTIPROSPECTOR was developed and benchmarked by Skolnick and coworkers [119]. The approach consists of two phases. First, traditional single threading is applied to generate a set of candidate structures. Then, for those proteins whose template structures are part of a known complex, they rethread both partners and now include a protein–protein interfacial energy. A database of multimeric protein template structures was constructed [118], interfacial pairwise potentials derived, and empirical indicators to identify dimers based on their threading Z-score and the magnitude of the interfacial energy was established. The authors tested the algorithm on a benchmark set composed of 58 homodimers, 20 heterodimers, and 96 monomers scanned against 3900 representative template structures. The method correctly recognized and assigned 54 homodimers, all 20 heterodimers, and 91 monomers, and satisfactory performance was demonstrated [119]. Application to Proteomes PROSPECTOR_3 RESULTS
To examine the generality of the PDB200 benchmark results, Skolnick and coworkers applied PROSPECTOR_3 to ORFs £200 residues in the E. coli [132], M. genitalium [133], and S. cerevisiae [134] proteomes. Unlike the benchmark, here homologous proteins are allowed. An overview is presented with details given elsewhere; see http://www.bioinformatics. buffalo.edu/resources/genomethreading/. For E. coli [132] there are 1360 ORFs £ 200 residues. PROSPECTOR_3 assigns 61% to the Easy set (82% average coverage) and 38% to the Medium set (51% average coverage). In contrast, Peitsch et al. [135] produced assignments for ~10–15% of the entire proteome. Using PSI-BLAST [127], Hegyi et al. [136] assigned 28% of all E. coli ORFs to SCOP domains. In PEDANT [40], 31% of E. coli ORFs £200 residues have a PSI-BLAST hit to PDB structures. In GTOP [137], Reverse PSI-BLAST [138] assigned 35% of E. coli ORFs £200 residues to PDB structures. The M. genitalium [133] proteome has 128 ORFs £200 residues. PROSPECTOR_3 assigns 73% to the Easy set (87% average coverage) and 27% to the Medium set (54% average coverage). In S. cerevisiae [134] there are 1496 ORFs £ 200 residues. PROSPECTOR_3 assign 53% to the Easy set (75% average coverage) and 45% to the Medium set (65% average coverage). There are few putative New Folds ORFs in all three proteomes.
208
Genomics
TASSER RESULTS
TASSER was also applied to all ORFs in the E. coli proteome [132] £200 residues. Based on the PDB benchmarks, a confidence, C-score, is defined that is a function of cluster density, the RMSD of cluster members from the cluster centroid and the threading template Z-score (see eq. 1 of ref. [122]). For the same C-score cutoff in the PDB200 benchmark that gives a false positive/negative rate of 12.4%/14.7%, 68% of E. coli ORFs should have acceptable predictions. According to MEMSAT [139], ~23% of these E. coli ORFs have transmembrane regions. All TASSER-predicted first-rank models have at least one long (putative transmembrane) helix consistent with MEMSAT. Using the C-score, 47% of the ORFs have >80% probability for models with an RMSD <6.5 Å. Furthermore, signal peptides are not masked out, and 149 ORFs have annotated signal peptides in SWISS-PROT [140]. Due to their composition, PROSPECTOR_3 does not align the majority of signal peptide residues, and due to the resulting lack of predicted contacts, these peptides lie outside the predicted compact core. A possibility to be pursued is to use this method to identify signal sequences. APPLICATION OF MULTIPROSPECTOR TO
S. cerevisiae
Using MULTIPROSPECTOR, each possible pair of interactions among more than six thousand encoded proteins is evaluated against a dimer database of 768 complex structures by using a confidence estimate of the fold assignment and the magnitude of the interfacial potentials. 7321 interactions are predicted involving 1256 proteins. After filtering by subcellular colocalization, there are 2028 heterodimer interactions. From mRNA abundance analysis, the MULTIPROSPECTOR method does not bias toward high-abundance proteins. The predicted interactions are then compared to other large-scale methods and to high-confidence interactions defined as those supported by two or more other methods [18]. 374 of the predictions are found by at least one other study, comparable to the overlap between two other methods. Based on functional category assignment, MULTIPROSPECTOR predictions have a similar distribution to high-confidence interactions. CONCLUSION
At this juncture, it is apparent that considerable progress is being made in the field of protein structure prediction, with the greatest success seen for knowledge-based approaches that extend comparative modeling and threading. At present, based on very large-scale benchmarking, for weakly/nonhomologous proteins, one can expect to produce low-resolution structures for about two-thirds of all proteins. Given the observation that the PDB is complete for low-to-moderate resolution
Protein Structure Prediction
209
single-domain proteins, the outstanding challenge is to develop methods to identify the roughly one-third of proteins that cannot be recognized by contemporary approaches. Furthermore, progress is being made on generating predictions where the model is closer to the native structure than to the template on which it is based. Part of the reason for the recent relative success is the comprehensive testing on all representative PDB structures so that one can identify both the strengths and weaknesses of a given approach. In the past, relatively small-scale benchmarking was done, where it was difficult to establish the generality of the conclusions. Another reason is the improved correlation of energy and structure quality. This is not to say that existing potentials are perfect, for certainly they are not, but rather that procedures to derive better potentials are starting to bear fruit. There remain a number of outstanding problems that must be addressed. With regards to low-resolution modeling, existing approaches to predict the relative orientation of multiple-domain proteins often fail when the domains adopt a different orientation from the template. This is same issue as the inability to predict good global orientations for long loops even when (as is often the case) their internal conformation is well predicted. This reflects problems with the force field. Similarly, it is still not possible in general to refine the low-resolution structures to higher-quality structures at atomic detail. Whether this is an issue of conformational sampling, or problems with existing atomic force fields, or both, remains to be established. In that regard, Skolnick and coworkers have embarked on a similar large-scale benchmarking effort to identify the outstanding unresolved issues with the goal of making progress in detailed atomic model refinement. Indeed, one would like to supersede the current generation of knowledge-based approaches with more fundamental physics-based approaches. The next issue that must be addressed is the prediction of protein–protein interactions and the quaternary structure of the resulting complexes. Here, the field of structure prediction is in its infancy; approaches similar to ROSETTA and TASSER generalized to multimers represent promising avenues of investigation. At the end of the day, one goal of protein structure prediction is to provide models that are of sufficient quality that they can provide functional insights. While much remains to be done, there is now cause for optimism that progress is being made to achieve this objective. ACKNOWLEDGMENTS This research was supported in part by NIH grants GM-37408, GM-48835, and RR-12255. Stimulating discussions and essential contributions of our colleagues Drs. A. Arakaki, D. Kihara, L. Lu, and H. Lu are gratefully acknowledged.
210
Genomics
REFERENCES 1. Venter, J. C., M. D. Adams, E. W. Myers, et al. The sequence of the human genome. Science, 291(5507):1304–51, 2001. 2. Wiley, S. R. Genomics in the real world. Current Pharmaceutical Design, 4(5):417–22, 1998. 3. Betz, S. F., S. M. Baxter and J. S. Fetrow. Function first: a powerful approach to post-genomic drug discovery. Drug Discovery Today, 7(16):865–71, 2002. 4. Pearson, W. R. Effective protein sequence comparison. Methods in Enzymology 266:227–58, 1996. 5. Kinch, L. N., J. O. Wrabl, S. S. Krishna, I. Majumdar, R. I. Sadreyev, Y. Qi, J. Pei, H. Cheng and N.V. Grishin. CASP5 assessment of fold recognition target predictions. Proteins, 53(Suppl 6):395–409, 2003. 6. Wallace, A. C., R. A. Laskowski and J. M. Thornton. Derivation of 3D coordinate templates for searching structural databases: application to Ser-His-Asp catalytic triads in the serine proteinases and lipases. Protein Science, 5(6):1001–13, 1996. 7. Kleywegt, G. J. Recognition of spatial motifs in protein structures. Journal of Molecular Biology, 285(4):1887–97, 1999. 8. Skolnick, J., J. S. Fetrow and A. Kolinski. Structural genomics and its importance for gene function analysis. Nature Biotechnology, 18(3):283–7, 2000. 9. Baker, D. and A. Sali. Protein structure prediction and structural genomics. Science, 294(5540):93–6, 2001. 10. Aloy, P., E. Querol, F. X. Aviles and M. J. Sternberg. Automated structurebased prediction of functional sites in proteins: applications to assessing the validity of inheriting protein function from homology in genome annotation and to protein docking. Journal of Molecular Biology, 311(2):395–408, 2001. 11. Turcotte, M., S. H. Muggleton and M. J. Sternberg. Automated discovery of structural signatures of protein fold and function. Journal of Molecular Biology, 306(3):591–605, 2001. 12. Gerstein, M., A. Edwards, C. H. Arrowsmith and G. T. Montelione. Structural genomics: current progress. Science, 299(5613):1663, 2003. 13. Vitkup, D., E. Melamud, J. Moult and C. Sander. Completeness in structural genomics. Nature Structural Biology, 8(6):559–66, 2001. 14. Moult, J. and E. Melamud. From fold to function. Current Opinion in Structural Biology, 10(3):384–9, 2000. 15. McGuffin, L. J. and D. T. Jones. Targeting novel folds for structural genomics. Proteins, 48(1):44–52, 2002. 16. Kihara, D. and J. Skolnick. The PDB is a covering set of small protein structures. Journal of Molecular Biology, 334(4):793–802, 2003. 17. Alberts, B., D. Bray, J. Lewis, M. Raff, K. Roberts and J. D. Watson. Molecular Biology of the Cell, 3rd ed. Garland, New York, 1994. 18. Mering, C. V., R. Krause, B. Snel, M. Cornell, S. G. Oliver, S. Fields and P. Bork. Comparative assessment of large-scale data sets of protein-protein interactions. Nature, 417(6887):399–403, 2002. 19. Marti-Renom, M. A., A. C. Stuart, A. Fiser, R. Sanchez, F. Melo and A. Sali. Comparative protein structure modeling of genes and genomes. Annual Review of Biophysics and Biomolecular Structure, 29:291–325, 2000.
Protein Structure Prediction
211
20. Bowie, J. U., R. Luthy and D. Eisenberg. A method to identify protein sequences that fold into a known three-dimensional structure. Science, 253:164–170, 1991. 21. Liwo, A., J. Lee, D. R. Ripoll, J. Pillardy and H. A. Scheraga. Protein structure prediction by global optimization of a potential energy function. Proceedings of the National Academy of Sciences USA, 96(10):5482–5, 1999. 22. Simons, K. T., C. Strauss and D. Baker. Prospects for ab initio protein structural genomics. Journal of Molecular Biology, 306:1191–9, 2001. 23. Kihara, D., H. Lu, A. Kolinski and J. Skolnick, TOUCHSTONE: an ab initio protein structure prediction method that uses threading-based tertiary restraints. Proceedings of the National Academy of Sciences USA, 98(18):10125–30, 2001. 24. Westbrook, J., Z. Feng, S. Jain, T. N. Bhat, N. Thanki, V. Ravichandran, G. L. Gilliland, W. Bluhm, H. Weissig, D.S. Greer, P.E. Bourne and H. M. Berman. The Protein Data Bank: unifying the archive. Nucleic Acids Research, 30(1):245–8, 2002. 25. Kopp, J. and T. Schwede. The SWISS-MODEL Repository of annotated threedimensional protein structure homology models. Nucleic Acids Research, 32(Database issue):D230–4, 2004. 26. John, B. and A. Sali. Comparative protein structure modeling by iterative alignment, model building and model assessment. Nucleic Acids Research, 31(14):3982–92, 2003. 27. Xu, D., O. H. Crawford, P. F. LoCascio and Y. Xu. Application of PROSPECT in CASP4: characterizing protein structures with new folds. Proteins, Suppl 5:140–8, 2001. 28. McGuffin, L. J., K. Bryson and D. T. Jones. The PSIPRED protein structure prediction server. Bioinformatics, 16(4):404–5, 2000. 29. Skolnick, J., Y. Zhang, A. Arakaki, A. Kolinski, M. Boniecki, A. Szilagyi, and D. Kihara. A unified approach to protein structure prediction. Proteins, 53(Suppl 6):469–79, 2003. 30. Bradley, P., D. Chivian, J. Meiler, K.M. Misura, C. A. Rohl, W. R. Schief, W. J. Wedemeyer, O. Schueler-Furman, P. Murphy, J. Schonbrun, C. E. Strauss, and D. Baker. Rosetta predictions in CASP5: successes, failures, and prospects for complete automation. Proteins, 53(Suppl 6):457–68, 2003. 31. Aloy, P., A. Stark, C. Hadley and R. B. Russell. Predictions without templates: new folds, secondary structure, and contacts in CASP5. Proteins, 53 (Suppl 6):436–56, 2003. 32. Liwo, A., P. Arlukowicz, C. Czaplewski, S. Oldziej, J. Pillardy and H.A. Scheraga. A method for optimizing potential-energy functions by a hierarchical design of the potential-energy landscape: application to the UNRES force field. Proceedings of the National Academy of Sciences USA, 99(4):1937–42, 2002. 33. Venclovas, C., A. Zemla, K. Fidelis and J. Moult. Assessment of progress over the CASP experiments. Proteins, 53(Suppl 6):585–95, 2003. 34. Tramontano, A. and V. Morea Assessment of homology-based predictions in CASP5. Proteins, 53 (Suppl 6):352–68, 2003. 35. Iliopoulos, I., S. Tsoka, M. A. Andrade, et al. Evaluation of annotation strategies using an entire genome sequence. Bioinformatics, 19(6):717–26, 2003. 36. de Bakker, P. I., M. A. DePristo, D. F. Burke and T. L. Blundell. Ab initio construction of polypeptide fragments: accuracy of loop decoy discrimination
212
37.
38.
39. 40. 41.
42.
43.
44.
45. 46. 47. 48. 49.
50.
51.
52. 53.
54.
Genomics
by an all-atom statistical potential and the AMBER force field with the Generalized Born solvation model. Proteins, 51(1):21–40, 2003. Bonneau, R., J. Tsai, I. Ruczinski and D. Baker. Functional inferences from blind ab initio protein structure predictions. Journal of Structural Biology, 134(2–3):186–90, 2001. Skolnick, J. and J. S. Fetrow. From genes to protein structure and function: novel applications of computational approaches in the genomic era. Trends in Biotechnology, 18(1):34–9, 2000. Gerstein, M. Patterns of protein-fold usage in eight microbial genomes: a comprehensive structural census. Proteins, 33(4):518–34, 1998. Frishman, D., M. Mokreys, D. Kosykh, et al. The PEDANT genome database. Nucleic Acids Research, 31(1):207–11, 2003. Yamaguchi, A., M. Iwadate, E. Suzuki, K. Yura, S. Kawakita, H. Umeyama and M. Go. Enlarged FAMSBASE: protein 3D structure models of genome sequences for 41 species. Nucleic Acids Research, 31(1):463–8, 2003. Pieper, U., N. Eswar, H. Braberg, et al. MODBASE, a database of annotated comparative protein structure models, and associated resources. Nucleic Acids Research, 32 (Database issue):D217–22, 2004. Maiorov, V. N. and G. M. Crippen. Contact potential that recognizes the correct folding of globular proteins. Journal of Molecular Biology, 277:876–88, 1992. Sippl, M. J. and S. Weitckus. Detection of native-like models for amino acid sequences of unknown three-dimensional structure in a database of known protein conformations. Proteins, 13:258–71, 1992. Skolnick, J. and D. Kihara. Defrosting the frozen approximation: PROSPECTOR—a new approach to threading. Proteins, 42(3):319–31, 2001. Bryant, S. H. and C. E. Lawrence. An empirical energy function for threading protein sequence through the folding motif. Proteins, 16(1):92–112, 1993. Godzik, A., J. Skolnick. and A. Kolinski. A topology fingerprint approach to the inverse folding problem. Journal of Molecular Biology, 227:227–38, 1992. McGuffin, L. J. and D. T. Jones. Improvement of the GenTHREADER method for genomic fold recognition. Bioinformatics, 19(7):874–81, 2003. Zhang, B., L. Jaroszewski, L. Rychlewski and A. Godzik. Similarities and differences between nonhomologous proteins with similar folds: evaluation of threading strategies. Folding and Design, 2(5):307–17, 1997. Skolnick, J., D. Kihara and Y. Zhang. Development and large scale benchmark testing of the PROSPECTOR_3 threading algorithm. Proteins, 56(3):502–18, 2004. Needleman, S. B. and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3):443–53, 1970. Orengo, C. A. and W. R. Taylor. A local alignment method for protein structure motifs. Journal of Molecular Biology, 233(3):488–97, 1993. Panchenko, A. R., A. Marchler-Bauer and S. H. Bryant. Combination of threading potentials and sequence profiles improves fold recognition. Journal of Molecular Biology, 296:1319–31, 2000. Lathrop, R. H. An anytime local-to-global optimization algorithm for protein threading in theta (m2n2) space. Journal of Computational Biology, 6(3–4):405–18, 1999.
Protein Structure Prediction
213
55. Gerstein, M. and M. Levitt. Using iterative dynamic programming to obtain accurate pairwise and multiple alignments of protein structures. Proceedings of the International Conference on Intelligent Systems in Molecular Biology, 4:59–67, 1996. 56. Shindyalov, I.N. and P.E. Bourne. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Engineering, 11(9):739–47, 1998. 57. Kosinski, J., I. A. Cymerman, M. Feder, M. A. Kurowski, J. M. Sasin and J. M. Bujnicki. A “Frankenstein’s monster” approach to comparative modeling: merging the finest fragments of Fold-Recognition models and iterative model refinement aided by 3D structure evaluation. Proteins, 53(Suppl 6): 369–79, 2003. 58. Kinch, L. N., Y. Qi, T. J. Hubbard and N.V. Grishin. CASP5 target classification. Proteins, 53(Suppl 6):340–51, 2003. 59. Fischer, D. 3D-SHOTGUN: a novel, cooperative, fold-recognition metapredictor. Proteins, 51(3):434–41, 2003. 60. Wallner, B., H. Fang and A. Elofsson. Automatic consensus-based fold recognition using Pcons, ProQ, and Pmodeller. Proteins, 53(Suppl 6):534–41, 2003. 61. Chivian, D., D. E. Kim, L. Malmstrom, P. Bradley, T. Robertson, P. Murphy, C. E. Strauss, R. Bonneau, C. A. Rohl and D. Baker. Automated prediction of CASP-5 structures using the Robetta server. Proteins, 53(Suppl 6): 524–33, 2003. 62. Eyrich, V. A., D. Przybylski, I. Y. Koh, O. Grana, F. Pazos, A. Valencia and B. Rost. CAFASP3 in the spotlight of EVA. Proteins, 53(Suppl 6):548–60, 2003. 63. Rychlewski, L., D. Fischer. and A. Elofsson. LiveBench-6: large-scale automated evaluation of protein structure prediction servers. Proteins, 53 (Suppl 6):542–7, 2003. 64. Holm, L. and C. Sander. Touring protein fold space with Dali/FSSP. Nucleic Acids Research, 26(1):316–9, 1998. 65. Grindley, H.M., P. J. Artymiuk, D.W. Rice and P. Willett. Identification of tertiary structure resemblance in proteins using a maximal common subgraph isomorphism algorithm. Journal of Molecular Biology, 229(3):707–21, 1993. 66. Mizuguchi, K. and N. Go. Comparison of spatial arrangements of secondary structural elements in proteins. Protein Engineering, 8(4):353–62, 1995. 67. Bachar, O., D. Fischer, R. Nussinov and H. Wolfson. A computer vision based technique for 3-D sequence-independent structural comparison of proteins. Protein Engineering, 6(3):279–88, 1993. 68. Shindyalov, I. N. and P. E. Bourne. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Engineering, 11(9):739–47, 1998. 69. Kedem, K., L. P. Chew and R. Elber. Unit-vector RMS (URMS) as a tool to analyze molecular dynamics trajectories. Proteins, 37(4):554–64, 1999. 70. Ortiz, A. R., C. E. Strauss and O. Olmea. MAMMOTH (matching molecular models obtained from theory): an automated method for model comparison. Protein Science, 11(11):2606–21, 2002. 71. Orengo, C. A., A. D. Michie, S. Jones, D. T. Jones, M. B. Swindells and J.M. Thornton. CATH—a hierarchic classification of protein domain structures. Structure, 5(8):1093–1108, 1997.
214
Genomics
72. Shindyalov, I. N. and P. E. Bourne. An alternative view of protein fold space. Proteins, 38(3):247–60, 2000. 73. Boutonnet, N. S., A. V. Kajava and M. J. Rooman. Structural classification of alphabetabeta and betabetaalpha supersecondary structure units in proteins. Proteins, 30(2):193–212, 1998. 74. Harrison, A., F. Pearl, R. Mott, J. Thornton and C. Orengo. Quantifying the similarities within fold space. Journal of Molecular Biology, 323(5):909–26, 2002. 75. Yang, A. S. and B. Honig. An integrated approach to the analysis and modeling of protein sequences and structures. I. Protein structural alignment and a quantitative measure for protein structural distance. Journal of Molecular Biology, 301(3):665–78, 2000. 76. Lo Conte, L., S. E. Brenner, T.J. Hubbard, C. Chothia, and A. G. Murzin. SCOP database in 2002: refinements accommodate structural genomics. Nucleic Acids Research, 30(1):264–7, 2002. 77. Arakaki, A. K., Y. Zhang and J. Skolnick. Large scale assessment of the utility of low resolution protein structures for biochemical function assignment. Bioinformatics, 20:1087–96, 2004. 78. Claudel-Renard, C., C. Chevalet, T. Faraut and D. Kahn. Enzyme-specific profiles for genome annotation: PRIAM. Nucleic Acids Research, 31(22):6633–9, 2003. 79. Henikoff, J. G., S. Pietrokovski, C. M. McCallum and S. Henikoff. Blocksbased methods for detecting protein homology. Electrophoresis, 21(9):1700–6, 2000. 80. Attwood, T. K., P. Bradley, D. R. Flower, A. Gaulton, N. Maudling, A. L. Mitchell, G. Moulton, A. Nordle, K. Paine, P. Taylor, A. Uddin and C. Zygouri. PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Research, 31(1):400–2, 2003. 81. Fetrow, J. S., N. Siew, J. A. Di Gennaro, M. Martinez-Yamout, H. J. Dyson and J. Skolnick. Genomic-scale comparison of sequence- and structurebased methods of function prediction: does structure provide additional insight? Protein Science, 10(5):1005–14, 2001. 82. Hulo, N., C. J. Sigrist, V. Le Saux, P. S. Langendijk-Genevaux, L. Bordoli, A. Gattiker, E. De Castro, P. Bucher and A. Bairoch. Recent improvements to the PROSITE database. Nucleic Acids Research, 32(Database issue):D134–7, 2004. 83. Hegyi, H. and M. Gerstein. The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. Journal of Molecular Biology, 288(1): 147–64, 1999. 84. Kihara, D. and J. Skolnick. Microbial genomes have over 72% structure assignment by the threading algorithm PROSPECTOR_Q. Proteins, 55(2):464–73, 2004. 85. Wallace, A. C., N. Borkakoti and J. M. Thornton. TESS: a geometric hashing algorithm for deriving 3D coordinate templates for searching structural databases. Application to enzyme active sites. Protein Science, 6(11):2308–23, 1997. 86. Russell, R. B. Detection of protein three-dimensional side-chain patterns: new examples of convergent evolution. Journal of Molecular Biology, 279(5):1211–27, 1998.
Protein Structure Prediction
215
87. Fetrow, J.S. and J. Skolnick. Method for prediction of protein function from sequence using the sequence-to-structure-to-function paradigm with application to glutaredoxins/thioredoxins and T1 ribonucleases. Journal of Molecular Biology, 281(5):949–68, 1998. 88. Zhao, S., G. M. Morris, A. J. Olson and D. S. Goodsell. Recognition templates for predicting adenylate-binding sites in proteins. Journal of Molecular Biology, 314(5):1245–55, 2001. 89. Hamelryck, T. Efficient identification of side-chain patterns using a multidimensional index tree. Proteins, 51(1):96–108, 2003. 90. Liang, M. P., D. L. Brutlag and R.B. Altman. Automated construction of structural motifs for predicting functional sites on protein structures. Pacific Symposium on Biocomputing, 204–15, 2003. 91. Peters, K. P., J. Fauck and C. Frommel. The automatic search for ligand binding sites in proteins of known three-dimensional structure using only geometric criteria. Journal of Molecular Biology, 256(1):201–13, 1996. 92. Wei, L., E. S. Huang and R. B. Altman. Are predicted structures good enough to preserve functional sites? Structure with Folding and Design, 7(6):643–50, 1999. 93. Stark, A. and R. B. Russell. Annotation in three dimensions. PINTS: Patterns in Non-homologous Tertiary Structures. Nucleic Acids Research, 31(13):3341–4, 2003. 94. Jones, D. T. and L. J. McGuffin. Assembling novel protein folds from super-secondary structural fragments. Proteins, 53(Suppl 6):480–5, 2003. 95. Schmitt, S., D. Kuhn and G. Klebe. A new method to detect related function among proteins independent of sequence and fold homology. Journal of Molecular Biology, 323(2):387–406, 2002. 96. Adams, M.D., S. E. Celniker, R. A. Holt, et al. The genome sequence of Drosophila melanogaster. Science, 287(5461):2185–95, 2000. 97. Bonneau, R., C. E. Strauss, C. A. Rohl, D. Chivian, P. Bradley, L. Malmstrom, T. Robertson and D. Baker. De novo prediction of threedimensional structures for major protein families. Journal of Molecular Biology, 322(1):65–78, 2002. 98. Fetrow, J. S., A. Godzik and J. Skolnick. Functional analysis of the Escherichia coli genome using the sequence-to-structure-to-function paradigm: identification of proteins exhibiting the glutaredoxin/thioredoxin disulfide oxidoreductase activity. Journal of Molecular Biology, 282(4):703–11, 1998. 99. Legrain, P., J. Wojcik. and J. M. Gauthier. Protein-protein interaction maps: a lead towards cellular functions. Trends in Genetics, 17(6):346–52, 2001. 100. Fields, S. and O. Song. A novel genetic system to detect protein-protein interactions. Nature, 340(6230):245–6, 1989. 101. Sobott, F. and C. V. Robinson. Protein complexes gain momentum. Current Opinion in Structural Biology, 12(6):729–34, 2002. 102. Uetz, P., L. Giot, G. Cagney, et al. A comprehensive analysis of proteinprotein interactions in Saccharomyces cerevisiae. Nature, 403(6770):623–7, 2000. 103. Ito, T., T. Chiba, R. Ozawa, M. Yoshida, M. Hattori and Y. Sakaki, A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proceedings of the National Academy of Sciences USA, 98(8):4569–74, 2001.
216
Genomics
104. Janin, J. and B. Seraphin. Genome-wide studies of protein-protein interaction. Current Opinion in Structural Biology, 13(3):383–8, 2003. 105. Valencia, A. and F. Pazos. Computational methods for the prediction of protein interactions. Current Opinion in Structural Biology, 12(3): 368–73, 2002. 106. Huynen, M. A., B. Snel, C. von Mering and P. Bork. Function prediction and protein networks. Current Opinion in Cell Biology, 15(2):191–8, 2003. 107. Marcotte, E. M., M. Pellegrini, H. L. Ng, D.W. Rice, T.O. Yeates and D. Eisenberg, Detecting protein function and protein-protein interactions from genome sequences. Science, 285(5428):751–3, 1999. 108. Enright, A. J., I. Iliopoulos, N.C. Kyrpides and C.A. Ouzounis. Protein interaction maps for complete genomes based on gene fusion events. Nature, 402(6757):86–90, 1999. 109. Overbeek, R., M. Fonstein, M. D’Souza, G.D. Pusch, and N. Maltsev. The use of gene clusters to infer functional coupling. Proceedings of the National Academy of Sciences USA, 96(6):2896–901, 1999. 110. Pazos, F., M. Helmer-Citterich, G. Ausiello and A. Valencia. Correlated mutations contain information about protein-protein interaction. Journal of Molecular Biology, 271(4):511–23, 1997. 111. Pazos, F. and A. Valencia. Similarity of phylogenetic trees as indicator of protein-protein interaction. Protein Engineering, 14(9):609–14, 2001. 112. Valencia, A. and F. Pazos. Prediction of protein-protein interactions from evolutionary information. Methods of Biochemical Analysis, 44:411–26, 2003. 113. Tatusov, R. L., N. D. Fedorova, J. D. Jackson, et al. The COG database: an updated version includes eukaryotes. BMC Bioinformatics, 4(1):41, 2003. 114. Comet, J. P. and J. Henry. Pairwise sequence alignment using a PROSITE pattern-derived similarity score. Computers and Chemistry, 26(5):421–36, 2002. 115. Aloy, P. and R. B. Russell. The third dimension for protein interactions and complexes. Trends in Biochemical Science, 27(12):633–8, 2002. 116. Adams, J. The proteasome: structure, function, and role in the cell. Cancer Treatment Reviews, 29(Suppl 1):3–9, 2003. 117. Fariselli, P., O. Olmea, A. Valencia and R. Casadio. Progress in predicting inter-residue contacts of proteins with neural networks and correlated mutations. Proteins, Suppl 5:157–62, 2001. 118. Lu, L., A. K. Arakaki, H. Lu and J. Skolnick. Multimeric threading-based prediction of protein-protein interactions on a genomic scale: application to the Saccharomyces cerevisiae proteome. Genome Research, 13(6A):1146–54, 2003. 119. Lu, L., H. Lu and J. Skolnick. MULTIPROSPECTOR: an algorithm for the prediction of protein-protein interactions by multimeric threading. Proteins, 49(3):350–64, 2002. 120. Henikoff, S. and J. G. Henikoff. Performance evaluation of amino acid substitution matrices. Proteins, 17(1):49–61, 1993. 121. Zhang, Y. and J. Skolnick. The protein structure prediction problem could be solved using the current PDB library. Proceedings of the National Acadamy of Sciences USA, 102(4):1029–34, 2005. 122. Zhang, Y. and J. Skolnick. Automated structure prediction of weakly homologous proteins on a genomic scale. Proceedings of the National Academy of Sciences USA, 101(20):7594–9, 2004.
Protein Structure Prediction
217
123. Zhang, Y. and J. Skolnick. SPICKER: a clustering approach to identify near-native protein folds. Journal of Computational Chemistry, 25(6):865–71, 2004. 124. Sali, A. and T. L. Blundell. Comparative protein modelling by satisfaction of spatial restraints. Journal of Molecular Biology, 234(3):779–815, 1993. 125. Fiser, A., R.K. Do and A. Sali. Modeling of loops in protein structures. Protein Science, 9(9):1753–73, 2000. 126. Moult, J., K. Fidelis, A. Zemla and T. Hubbard. Critical assessment of methods of protein structure prediction (CASP)—round V. Proteins, 53 (Suppl 6):334–9, 2003. 127. Altschul, S. F., T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller and D. J. Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research, 25(17):3389–402, 1997. 128. Zhang, Y., A. Kolinski and J. Skolnick. TOUCHSTONE II: a new approach to ab initio protein structure prediction. Biophysics Journal, 85(2):1145–64, 2003. 129. Jones, D. T. Protein secondary structure prediction based on position-specific scoring matrices. Journal of Molecular Biology, 292:195–202, 1999. 130. Reva, B. A., A. V. Finkelstein and J. Skolnick. What is the probability of a chance prediction of a protein structure with an rmsd of 6 Å? Folding and Design, 3(2):141–7, 1998. 131. Guo, J. T., D. Xu, D. Kim and Y. Xu. Improving the performance of DomainParser for structural domain partition using neural network. Nucleic Acids Research, 31:944–952, 2003. 132. Blattner, F. R., G. Plunkett, C. A. Bloch, et al. The complete genome sequence of Escherichia coli K-12. Science, 277(5331):1453–74, 1997. 133. Fraser, C. M., J. D. Gocayne, O. White, M. D. Adams, R. A. Clayton, R. D. Fleischmann, C. J. Bult, A.R. Kerlavage, G. Sutton, J. M. Kelley and et al. The minimal gene complement of Mycoplasma genitalium. Science, 270(5235):397–403, 1995. 134. Mewes, H. M., D. Frishman, C. Gruber, Geier, B., Haase, D., Kaps, A., Lemcke, K., Mannhaupt, G., Pfeiffer, F., Schuller, C., Stocker, S. and Weil, B. MIPS: a database for genomes and protein sequences. Nucleic Acids Research., 28(1):37–40, 2000. 135. Peitsch, M. C., M. R. Wilkins, L. Tonella, J. C. Sanchez, R.D. Appel and D. F. Hochstrasser. Large-scale protein modelling and integration with the SWISS-PROT and SWISS-2DPAGE databases: the example of Escherichia coli. Electrophoresis, 18(3–4):498–501, 1997. 136. Hegyi, H., J. Lin, D. Greenbaum and M. Gerstein. Structural genomics analysis: characteristics of atypical, common, and horizontally transferred folds. Proteins, 47(2):126–41, 2002. 137. Kawabata, T., S. Fukuchi, K. Homma, M. Ota, J. Araki, T. Ito, N. Ichiyoshi and K. Nishikawa. GTOP: a database of protein structures predicted from genome sequences. Nucleic Acids Research, 30(1):294–8, 2002. 138. Marchler-Bauer, A., A. R. Panchenko, B. A. Shoemaker, P. A. Thiessen, L. Y. Geer and S. H. Bryant. CDD: a database of conserved domain alignments with links to domain three-dimensional structure. Nucleic Acids Research, 30(1):281–3, 2002.
218
Genomics
139. Jones, D. T., W. R. Taylor and J. M. Thornton. A model recognition approach to the prediction of all-helical membrane protein structure and topology. Biochemistry, 33(10):3038–49, 1994. 140. Bairoch, A. and R. Apweiler. The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1998. Nucleic Acids Research, 26(1):38–42, 1998. 141. Zhang, Y., D. Kihara and J. Skolnick. Local energy landscape flattening: parallel hyperbolic Monte Carlo sampling of protein folding. Proteins, 48(2):192–201, 2002.
8 DNA–Protein Interactions Gary D. Stormo
"Gene expression" refers to the process by which the information encoded in the DNA of a genome is converted into the RNA and protein products that perform the various functions of the cell. The first step of expression is the transcription, or literally copying, of regions of the DNA into RNA sequences which may themselves perform important functions or which may be the information-carrying intermediates, messenger RNA (mRNA), that specify the sequences of proteins. The process of transcription requires a large complement of protein factors, including some that specify the position within the DNA sequence where transcription begins. This includes many proteins that are components of the RNA polymerase complex that forms at the transcription initiation site, many of which do not directly contact the DNA. The focus of this chapter is on those transcription factors (TFs) that interact directly with DNA in a sequence-specific manner to regulate transcription. They may act as repressors that turn off transcription of a particular gene when they are bound to DNA nearby, or they may be activators that facilitate the transcription of the genes they regulate. Some proteins even provide both functions at different promoters or under different conditions. There are many classes of TFs, defined by their structural similarities. At least one member of each structural family has been co-crystallized bound to DNA so that many details about the interactions are known [1,2]. Their most important feature is that they bind to DNA in a sequence-specific manner. This feature allows them to regulate the expression of a specific subset of genes, those with the appropriate binding site sequences in the appropriate locations. The specificity of the TFs is sometimes referred to by the protein's "consensus sequence," which is usually the sequence with the highest affinity. For example, the consensus sequence for the TF Sp1 is GGCGGGT. However, this is not the best description of the specificity of the Sp1 protein because it also binds with high affinity to several other sequences. Representations are described below that can provide better predictions of where the protein binds within a genome and which genes it regulates. In general the binding sites of TFs have a fixed length, although there are exceptions. The exceptions are usually for proteins that bind as dimers, a complex of two proteins that bind DNA together. In such cases the two 219
220
Genomics
parts of the site, which interact with each monomer, may have variable spacing between them. However, even dimeric TFs usually bind to sites of fixed length so that will be considered the general case in this chapter. If the binding site is a sequence of length l, there are 4l different sequences to which the protein might bind. The specificity of the protein can be described by how well it can distinguish between the different sequences. AFFINITY
The binding reaction between a protein, P, and a specific DNA sequence, Di, is diagrammed as P + Di ↔ P • Di
(1)
The two directions of the reaction are usually referred to as the "on rate," kon, for forming the complex, and the "off rate," koff, at which the complex dissociates. Those rates depend on the protein and the DNA and can be considered intrinsic properties of the interaction. Of course they will also depend on the conditions of the reaction, such as temperature, pH, ionic strength, and the concentration of some metal ions (for example, Mg2+ is often required for stable binding). In addition, the affinity of the protein for DNA may be affected by the binding of "effector" molecules, such as cAMP for the CRP protein, or by covalent modifications of the protein, such as phosphorylations. Those effects on the affinity of the protein for DNA provide a means for the cell to respond to signals from the environment by modifying the transcription of specific genes. How a protein's affinity changes in response to the reaction conditions and to modifications is an interesting and important topic. But for most of this chapter we assume that there is some constant reaction condition for which we care about the affinity. For experiments performed in vitro it is usually assumed that the conditions are approximately physiological so that the results are relevant in vivo. At equilibrium the "association constant" of the protein for the any specific DNA sequence is K A (Di ) = kon / koff = [ P • Di ]/[ P][Di ]
(2)
The brackets "[" and "]" refer to the concentrations of what is inside. It is important to realize that [P] refers to the concentration of the free protein, not including that which is bound to the DNA, and the same for the concentration of the free DNA, [Di]. The inverse of the association constant is called the "dissociation constant," KD(Di), and is a convenient number to keep in mind because it is the free protein concentration at
DNA–Protein Interactions
221
which half of the DNA would be bound (i.e., for which [Di]/[P • Di] = 1). The Gibbs standard free energy of binding is defined as: ∆ G°(Di ) K − RT ln K A (Di ) = RT ln K D (Di )
(3)
The units are usually reported in kcal/mole and this represents the difference in free energy of the equilibrium state from the standard state, which is 1 M of reactants and products. The probability that the sequence Di is bound to the protein is: p(Di bound) =
=
[ P • Di ] = [ P ] K A (Di ) [ P • Di ] + [ Di ] [ P ] K A (Di ) + 1 [ P]
[ P ] + KD ( Di )
=
(4)
1 1 + e( ∆G °( D )− m )/RT i
The last equation is the Fermi–Dirac form of the binding probability equation where m=RT ln[P] is the "chemical potential" set by the concentration of the protein [3]. The affinity of a protein for DNA is due to many contacts between them. Contacts with the DNA backbone, the sugar-phosphate chain that links the bases together, generally do not depend on the DNA sequence. There are exceptions because some proteins require the DNA to bend in order to make some contacts. Since the sequence will influence how bent or how flexible the DNA is, it can alter the contacts with the backbone [4]. But it is primarily the contacts to the base pairs that are responsible for the protein binding with higher affinity to some sequences than others. Most of those contacts are direct but some may involve, for example, water intermediates between the protein and the DNA [5]. The affinity of most TFs is in the range of 109 to 1012 M−1 for their preferred binding sites. But all TFs also have a fairly high affinity for nonspecific DNA, random DNA without any strong binding sites. The ratio of the affinities between specific and nonspecific sites is often in the range of 105 to 106. This limited ratio of affinity between specific and nonspecific binding has several consequences. It helps the protein to find its binding sites more quickly than simple diffusion because it allows the proteins to scan along the DNA sequence for high-affinity sites. The nonspecific binding mode of a protein is different from how it binds to its high-affinity, specific sites [6]. The limited ratio also means that the cell must make an excess of the TF over the number of regulatory sites. That is because the genome as a whole acts as a sponge that soaks up some of the protein. And while the affinity is much less for those nonspecific sites, there are so many more of them in the genome that most of the protein will be bound there instead of at the specific sites [7].
222
Genomics
For example, in a bacterial genome there might be only a few regulatory sites for a TF to bind, and therefore an excess of about 106 nonregulatory sites. If the protein binds with 106-fold higher affinity to the regulatory sites than to the average sequence in the genome, it would spend only about half of its time bound to the regulatory sites. In order for the regulatory sites to be occupied nearly all of the time, which is required for a repressor to function properly, there would have to be many-fold more copies of the protein than the number of sites needed to be bound. In prokaryotic cells, with genomes usually of about 106 base pairs, the amount of excess TF needed to compensate for the limited specificity is relatively modest. But in eukaryotic cells, where the genomes can be over 1000 times larger, it can be a much more significant problem. Eukaryotic cells do make an excess of the regulatory proteins, and to an even larger extent than prokaryotes, but not as much as would be needed without some other solutions, several of which cells can use to solve this problem [7]. One solution would be making proteins that have much higher specificity than in bacterial cells, but that strategy apparently is not used because eukaryotic TFs have similar specificity to those of prokaryotic TFs. However, it is much more common in eukaryotic cells that two, or more, TFs are required to regulate a gene's expression. If those TFs bind cooperatively, where their affinities are increased in the presence of the other factor, then specificity of the combined TFs can be much larger than either alone. Another solution is that much of a eukaryotic genome can be effectively hidden from access to the proteins by being sequestered in chromatin, where it does not compete for binding to the TFs. Finally, the mode of regulation also determines how much protein must be made. Many prokaryotic promoters are "constitutive," meaning that they are active in the absence of regulation, and they are controlled by repressors that keep them turned off in the absence of an inductive signal. For this system to work, the repressor must be bound to the regulatory sites nearly all of the time, because whenever they are not bound there the promoter will be active. To achieve a 10-fold induction the repressor would have to be bound 90% of the time in the absence of inductive signal, and a 100-fold induction would require it to be bound 99% of the time. On the other hand, if a promoter is normally off, and requires an activator to turn it on, then that activator only needs to bind to the regulatory site a fraction of the time in order to achieve a significant increase in expression (the exact fold induction depends on exactly how off the promoter is when uninduced; most promoters are somewhat "leaky" and express some amount of the gene even when not activated). While both prokaryotic and eukaryotic cells use activators to control gene expression, they are the predominant mode in eukaryotes and can help to reduce the amount of TF needed for proper regulation.
DNA–Protein Interactions
223
SPECIFICITY
Despite the limitations to specificity described above, it is the key aspect of TFs that is required for the proper functioning of a regulatory system and is the main focus of this chapter. Unlike affinity, there is no concise, standard definition of specificity, but the term refers to the difference in binding affinity for different DNA sequences. For any TF, an understanding of its specificity allows one to predict what sequences it will preferentially bind to and what genes it may regulate. As mentioned in the introduction, the specificity of most TFs is not well modeled by a simple consensus sequence. Rather there is an affinity for any sequence, from the most preferred sites down to those with nonspecific affinity. The complete specificity of the protein could be described as a list of binding affinities to all of its potential binding sites, all 4l sequences of length l. Columns 2 and 3 of figure 8.1 show a list of the binding affinities and free energies for a hypothetical protein that binds to 4-long sequences. This list would be everything there is to know about the specificity of the protein (under the conditions of the experiment), although there are other aspects of the protein that we would also like to know. For instance, knowing the on- and off-rates
Figure 8.1 Binding affinities and free energies. For a hypothetical transcription factor that binds to 4-long sequences, the association constant, KA, and Gibbs free energy, ∆G°, are obtained for all possible binding site sequences. The relative and specific binding constants and energies are also shown. For the specific binding constant, KS, and free energy, ∆GS, background sequences with biases of 35% A and T and 15% C and G are used. For each binding constant the sum is also shown at the bottom, and for the two specific free energies the value of Ispec is provided (see text for definitions).
224
Genomics
would tell us about the kinetics of protein–DNA association. We would also like to know about cooperative interactions with other proteins (including other copies of itself). That information might be necessary to correctly model the regulation of the genes under the control of the TF. We would also like to know what effector molecules or protein modifications may switch the protein between active and inactive states. And we might want to know exactly how the specificity is achieved, something for which we could get insight from a crystallographic study of the protein bound to DNA, and preferably bound to a set of different DNA sequences with different affinities [8]. But for the rest of this chapter we focus on the problem of representing and discovering the specificity of a TF from various types of data. Since specificity only depends on the differences in affinity, it is convenient to convert the association constants, and standard free energies, into relative ones as presented by Berg and von Hippel in their thermodynamic analysis of TF binding sites [9]. This could be done using any sequence as the reference but is most useful if the highest affinity site is used. In the example of figure 8.1 the sequence ACCG has the highest affinity so it is assigned a relative binding constant of 1 and a relative binding free energy of 0. The relative binding affinities and free energies are then determined for every other sequence by: K rel (Di ) K
K A (Di ) ; K A (ACCG)
∆ Grel (Di ) = − log 2 K rel (Di )
(5)
Note that Krel is unitless and that we choose to use log2 for the relative free energies so that every 2-fold change in affinity corresponds to a difference of 1 in free energy. In this hypothetical example, different sequences have relative affinities that are powers of one-half so that the relative free energies are all integers (figure 8.1). While relative binding constants and free energies are convenient for some purposes, they still do not give us a simple measure of the specificity of a TF or an easy way to compare the specificities of multiple TFs to each other. Nor do they provide an objective function that we can use for binding site discovery, as we describe later. For those purposes it is useful to define a specific binding constant and a specific binding free energy. The remainder of this section presents some algebraic manipulations of the fundamental equation (2) to define a useful measure of specificity. We assert that the specificity of any nonspecific protein, which is a protein that has the same binding affinity for all sequences, is 0. We can then define the specificity of any protein by comparing its binding probability distribution, over all possible binding sites, to that of the nonspecific protein. This will be shown below to be the average of the specific binding free energy. The binding probability distribution at
DNA–Protein Interactions
225
equilibrium is the frequency with which each particular sequence, Di, is bound to the protein when in competition with all other potential binding sites and is easily derived from equation (2): Fb (Di ) K
[ P • Di ] [Di ]K A (Di ) = ∑ [ P • D j ] ∑ [D j ]K A (D j ) j
(6)
j
Note that this is not the probability that a particular sequence is bound to the protein, which will depend on the concentration of the protein as shown in equation (4), but rather it is the distribution of different sequences within the bound fraction. The concentration of the protein only matters indirectly by determining the concentration of free DNA sequences. We also define a probability distribution for the unbound sequences: Fu (Di ) K
[Di ] ∑ [D j ]
(7)
j
Note that this is the concentration of the free DNA sequences, not the total. Under conditions where there is very little protein compared to the number of DNA binding sites, and therefore all sequences are unbound most of the time, this can be approximated by the background, or prior, frequency of all the sites in a genome or in an experiment. We now define a specific binding constant as KS (Di ) K
Fb (Di ) K A (Di ) = Fu (Di ) ∑ Fu (Dj )K A (Dj )
(8)
j
Note that KS, like Krel, is a unitless measure that is proportional to the association constant to every possible binding site and therefore is descriptive of the ratio of affinity between different sequences. For a nonspecific binding protein, KS(Di) = 1 for every sequence because each sequence will have the same distribution in the bound and unbound fractions. Specific binding constants to different sequences are fairly easily measured, as indicated in equation (8), because one merely has to measure the ratio of the bound and unbound sequences in an experiment where the different sequences are competing for the same pool of protein [10,11]. It is also useful to define a specific binding free energy: ∆ GS (Di ) = − log 2 KS (Di )
(9)
A nonspecific binding protein has ∆GS(Di)= 0 for every sequence. We now define a measure of the specificity of a DNA binding protein,
226
Genomics
Ispec (for information of specificity), as the negative average specific free energy of binding: I spec K − ∑ Fb (Di )∆ GS (Di ) = ∑ Fb (Di )log 2 i
i
Fb (Di ) ≥0 Fu (Di )
(10)
This is similar to what we have previously termed the "information content" of a DNA binding protein [12,13], but differs in important ways. First, it does not rely on any particular model for DNA binding (see below) but instead is based solely on the association constants for each sequence. Second, it is based on the entire set of possible binding sites, not just those with sufficiently high affinity to be used as regulatory sites in vivo. As indicated in equation (10), Ispec ≥ 0 with equality only in the case of a nonspecific protein, which is a necessary criterion for a useful measure of specificity. The last form of the equation shows that Ispec is the relative entropy between the bound and unbound probability distributions. This is also called the Kullbach–Liebler distance between those two distributions and is a useful statistical measure, related to χ2, of how different the two distributions are from each other. That makes it an appropriate measure for the difference in specificity between a TF of interest and a nonspecific DNA binding protein, or between any two TFs. It is also a very useful objective function in algorithms used to discover regulatory sites from coregulated genes [14], as described below. Note that KS, ∆GS, and Ispec all depend on the unbound (or background) distribution Fu. This is appropriate because the average KA depends on the mixture of potential binding sites [see equation (6)], and the average free energy of binding also depends on that mixture. However, to compare the specificity of different proteins to each other it is useful to define a measure that does not depend on the background frequencies of a particular genome or experiment. This is easily done by defining a standard condition of all sequences being equally abundant in the unbound fraction, that is, Fu(Di) = 4−l for all l-long sequences. Using that value in equation (8) leads to a standard specific binding constant: K S° (Di ) K
K A (Di ) K A (D j )
= 4 l • Fb° (Di )
(11)
where 〈KA(Dj)〉 is the average over all the sequences. From this it is easy to calculate what the standard condition bound distribution, F°(D b i), would be, even if the experiment were not performed under that condition. This standard specific binding constant has a defined range and a fixed sum: 0 ≤ KS° (Di ) ≤ 4l ;
∑ KS° (Di ) = 4 i
l
(12)
DNA–Protein Interactions
227
This also shows that it is easy to convert the specific binding constant determined under any condition to the value under the standard condition by normalizing the original values to have the fixed sum 4l. We can also define the standard specific binding energy ∆ GS° (Di ) = − log 2 KS° (Di );
− 2l ≤ ∆ GS° (Di ) ≤ ⬁
(13)
and the standard information of specificity: ° ≤ 2l bits 0 ≤ I spec
(14)
where 0 is for a completely nonspecific binding protein and 2l bits is for a protein that binds exclusively to one sequence and has no affinity for all other sequences. All real TFs are between those two extremes and the value of I°spec measures where it is on that continuum. The higher the value, the more different is the binding distribution from a random distribution. We used this approach previously to determine the standard information content from an experiment in which the background frequencies of the sites were heavily biased [15]. It is important to note that the definition of specific binding constant, whether for the standard condition or any other condition, maintains the true ratio of binding affinities for any pair of sequences: K A (Di ) K rel (Di ) KS° (Di ) KS (Di ) = ∀ Di , Dj = = K A (Dj ) K rel (Dj ) KS° (Dj ) KS (Dj )
(15)
This fundamental relationship is true only if the unbounded distribution, Fu(Di), is taken into consideration when determining the specific binding constants. It follows that the difference in binding free energies is also maintained regardless of the definition of free energy used. We use ∆∆G(Di,Dj) to refer to the difference in binding free energy between two different sequences: ∆∆ G°(Di , Dj )/RT ln 2 = ∆∆ Grel (Di , Dj ) = ∆∆ GS° (Di , Dj ) = ∆∆GS (Di , Dj )
(16)
Figure 8.1 includes columns for K S° and ∆GS° and the value of I°spec for the hypothetical TF. It also includes values for KS , ∆GS, and Ispec that would be obtained from a genome or experiment in which the background was biased to be 35% A and T and 15% G and C. It can be seen that the important relationships of equations (15) and (16) are maintained, within rounding errors, for all comparisons.
228
Genomics
MODELS OF SPECIFICITY
As described in the last section, if we knew the complete list of binding affinities to all possible sequences for a protein we would know all there is to know about its specificity. We could search the genome and find all of the high-affinity sites, and use that information to predict which genes are regulated by it, although knowing about interactions with other proteins might be essential to make accurate predictions. Unfortunately, such a list of affinities does not exist for any protein, and is not easy to obtain because the length of the list increases exponentially with the length of the binding site; there are 4l possible binding sites of length l. Typical binding site lengths are 6 to 10 for proteins that bind as monomers, and can be twice that or longer for proteins that bind as dimers (or other multimers). At a length of 6 there are over 4000 different binding sites, and at length 10 there are over 1 million. In general the complete list of affinities, as shown in figure 8.1, will not be obtainable. It is also probably not needed to get a good approximation to the specificity of the protein because there are likely to be a small fraction of the sequences that contribute significantly to the total free energy, those with a significant probability of binding, and the rest can be grouped together into the vast majority of nonspecific sites. Because the complete list of binding affinities is generally not available, one represents the specificity of a DNA binding protein by some model that is reasonably simple and yet is a good approximation to the full specificity [14]. The simplest model would be just a sequence, often called the consensus sequence, for the protein. For the example in figure 8.1 that would be ACCG, which is the highest affinity sequence. For some proteins such a representation works very well. For instance, restriction enzymes and DNA modification enzymes often have extremely high specificity such that only one sequence is cut, or modified, at an appreciable rate. The restriction enzyme EcoRI cuts the sequence GAATTC and essentially only that sequence, so it is a good representation of its specificity. The specificity for the EcoRI enzyme is I°spec = 12 bits, which is consistent with its expected frequency of occurrence of once per 4096 positions in random DNA of equal composition. Of course in a genome with a biased composition it might occur more or less frequently, and that would be reflected in the value of Ispec for that genome. For some other restriction enzymes there are a group of sequences that are cut, but still the effect is essentially "all or none," such that there is no appreciable difference between those sequences that are cut while any other sequence is essentially uncut. For example, the restriction enzyme HincII cuts at GTYRAC (R = A or G, Y = C or T) sites, a total of four different DNA sequences. Its specificity is described completely by that consensus sequence and I°spec = 10 bits. But for most
DNA–Protein Interactions
229
proteins the affinity is spread out over many different sequences with only small differences in affinity for very similar sequences, such as the example of figure 8.1. If every difference in sequence from the preferred site had a constant decrease in affinity, then one might still use a consensus sequence to represent the specificity. For example, suppose that each difference resulted in a 5-fold decrease in affinity, then one could search the genome using the consensus sequence, allowing for mismatches. Those sites with one mismatch would bind with 5-fold lower affinity than the preferred site, those with two mismatches would bind with 25-fold lower affinity, and so on. For such a protein the complete list of affinities, as in figure 8.1, could be replaced by the consensus sequence and an indication of the decrease in affinity per mismatch. However, nature appears not so simple and for most proteins the changes in affinity will vary depending on the position and the base substitution, as in the example of figure 8.1. In such cases a more flexible model is needed. The weight matrix model of DNA–protein interactions was developed as a simple model that captures the different affects of different mismatches from the consensus sequence [16]. In general, the elements of the matrix, W(b,j), are just some abstract scores for each base, b, at each position, j, of the binding site. Any sequence, Di, can be encoded into the same form of a matrix, where Di(b,j) = 1 if base b occurs at position j, and is 0 elsewhere. The score for the sequence is just the sum of the weight matrix elements that correspond to its sequence: T
l
→
→
Score(Di ) = ∑ ∑ Di (b , j)W (b , j) = Di • W b= A j =1
(17)
Figure 8.2 shows three different weight matrices, based on the example of figure 8.1, and how the score for the preferred sequence, ACCG, and one of the lowest affinity sequences, TTTC, is determined from the matrices. Note that it is customary in the biological literature for higher scores to correspond to higher affinity, and therefore lower energy, sites. In general we switch signs to translate from scores to predicted binding energies and in the ideal case, such as the example of figures 8.1 and 8.2, Score(Di) = −∆G(Di). The last form of equation (17) indicates that the weight matrix can be considered a vector in the space of sequences (which are also represented as vectors) and the score is the dot-product between those vectors. Since the sequence vectors are all of the same length, the differences in scores for different sequences are just related to the differences in angles between the sequences and the weight matrix vector. Sequences that are close to it, with a small angle between the vectors, will have higher scores than those that are farther away. If we imagine that there is some threshold of score required for a sequence to
DNA–Protein Interactions
231
function properly as a regulatory site in vivo, then that threshold defines a maximum angle between the weight matrix vector and all such regulatory sites; that is, it defines a subspace of functional sequences. While the weight matrix model contains considerable flexibility, it makes the assumption that each base in a binding site contributes to the binding activity independently of the rest of the sequence. Such additive models are probably not completely accurate in most cases, but may still provide a good approximation to the true specificity [17]. More complex models are easily obtained. The next simplest would be a dinucleotide model with 16 rows, one for each possible dinucleotide in adjacent positions. This has been termed a weight array [18], and it could be done as a first-order Markov model where the contribution of a base at one position depends on the adjacent base. Of course the interacting bases do not have to be adjacent, although that seems most likely in DNA binding sites (unlike RNA sites where distant positions interact to form the secondary structure). Higher order interactions can also be imagined and can be encoded as higher order Markov models [19]. More general networks of interactions, and even combinations of networks, can also be used [20]. Of course, the main problem is that each more complicated model has many more parameters that have to be estimated. Often there will not be enough data to estimate all of those parameters accurately. Some model of this type must be a perfect representation of the specificity of the protein because the actual binding energy data is itself a vector (column 3 of figure 8.1). The general goal is to find the simplest model that provides adequate accuracy (which may depend on the problem being addressed) or is the most accurate based on the available data. For the remainder of this section we assume that a simple weight matrix is the appropriate model and we want to find the best model based on different types of data or for different purposes. The hypothetical example of figure 8.1 was designed to be completely additive so that one can obtain a weight matrix such that the scores for every sequence Figure 8.2 Weight matrix models of binding specificity. (A) A matrix with relative energy contributions of each base at each position to the total relative binding energy for the TF of figure 8.1 (except that the matrix values are the negative of the free energy values). The scores, from the matrix, for the highest affinity sequence, ACCG, are shown in bold and for one of the lowest affinity sequences, TTTC, are underlined. (B) The same matrix but using the standard specific binding free energies, ∆G°S , from figure 8.1. The scores for the highest and lowest affinity sites are also shown. Note that their difference is the same as in part A. (C) The same matrix but using the specific binding free energies, ∆GS. (D) The probabilities of each base at each position in the standard case with each base occurring at a probability of 25%. (E) The probabilities of each base at each position in the case where the prior probabilities of the bases are 35% A and T and 15% C and G.
232
Genomics
are a perfect match to their binding energies. In general that will not be the case [10,21] and one may want to find a weight matrix that is the best fit to the binding energy data. Two approaches, using different definitions of best fit, have been described for this problem [17,22,23]. ˆ K, ˆ and Fˆ to be the values of free For the following we use ∆ G, b energy, binding constant, and bound frequency that are predicted by the model. (It does not matter which form of each we use because we will only be comparing the differences in binding energies, or ratios of binding constants and frequencies, which are the same for each form, equations (15) and (16)). Since we are finding the weight matrix model that is the best fit ˆ ) = −Score (D ). The first to the quantitative binding data, we set ∆ G(D i i definition of best fit is the least squared difference between the true and model binding energies and the objective is to determine the weight matrix W(b,j) that minimizes it: 2 R = min ∑ (∆ Gˆ (Di ) − ∆ G(Di )) W (b, j) i
(18)
Standard multiple regression methods can obtain that weight matrix [22,23] and the value of R determines the goodness of fit. It may be that the best fit, according to the additive model, is quite poor so that more complex models are needed to provide reasonably accurate binding site predictions. A second definition of best fit is the minimum relative entropy, or Kullbach–Leibler distance, between the true and model probability distributions [17]: L = min ∑ Fb (Di )log 2 W (b, j) i
Fb (Di ) Fˆb (Di )
(19)
= ∑ Fb (Di )( ∆ Gˆ (Di ) − ∆ G(Di )) i
The value of L is also a measure of goodness of fit, related to χ2, and we have argued that it is a better measure, at least for some purposes, because the emphasis is on fitting the high-probability sites and poorer predictions of the lower affinity sites are less important. Methods to obtain the best-fit weight matrix by this criterion are also straightforward [17]. In both cases the ideal situation would be to have the entire list of binding site energies (although in that case one may ask why having a model that approximates it is necessary). However, even having binding energies for a portion of all possible sites allows one to find the best-fit weight matrices from which one can predict the binding energy to all possible sites [22].
DNA–Protein Interactions
233
The most common data to have is a collection of known binding sites but without any affinity data. If there is a large sample of known sites, and if the additive model is reasonably accurate, then a simple model based on the relative distribution of bases at each position in the binding site can be a good representation of the specificity. From an alignment of the known binding sites we determine the frequency of each base at each position, Fb(b,j). We also need to know the background frequency of each base, Fu(b), in the genome from which the sites are obtained. Under the assumptions that the genome is a random sequence with that composition and that the additive assumption is valid, then the weight matrix that minimizes the sum of the specific binding energy of all the sites is [24]: W (b , j) = log 2
Fb (b , j) Fu (b)
(20)
From this we can define the information content [12] of the set of sites as IC = ∑ ∑ Fb (b , j)log 2 b
j
Fb (b , j) = ∑ ∑ F (b , j)W (b , j) Fu (b) b j b
(21)
The information content obtained from known sites like this will tend to overestimate the Ispec of the protein because it is based on the highaffinity sites, those with high enough affinity to function in vivo. The W(b,j) values will tend to give larger differences between the highaffinity bases and the low-affinity bases than is their true difference in binding energy, for the same reason. For example, suppose that for the hypothetical TF of figures 8.1 and 8.2, only sites within 8-fold of the affinity of best would function as regulatory sites in vivo. Then of the 256 possible sites, only 26 of them exceed that threshold. Figure 8.3B shows the frequency matrix of those 26 sites. This is the expected frequency of each base at each position if the sites with affinity above the threshold are selected at random. If the higher affinity sites are more likely to be used for regulation, the frequency matrix would be even more skewed toward the high-affinity bases at each position. Figure 8.3C shows the W(b,j) matrix for this data, using formula (20). As compared to the true energy values for this matrix (figure 8.2B) this is overly specific due to the limited number of low-affinity bases that occur. Low-affinity bases have restricted contexts in functional sites because any base that drops the affinity to just above the threshold can only occur when all of the other bases are the preferred ones. There are two main consequences of this. One is that the lower affinity bases are predicted, by formula (20), to have even lower affinity than they really have. The second is that it will appear that positions within the functional sites are correlated with each other. Correlations between the positions is
234
DNA–Protein Interactions
235
sometimes presented as evidence that the positions do not contribute independently to the binding, but while that may be true it is not necessary. In fact, one of the expectations of an additive model with a threshold for functional activity is that there will be correlations in the base frequencies between positions. This is due simply to the restricted contexts that some bases appear in, especially the low-affinity bases. If one compares the matrix in figure 8.3C with the true matrix that it is to represent, from figure 8.2B, one sees that the preferred site is still the same, although its score and probability are somewhat higher than the true value. The lowest affinity sites are still correctly predicted too, but with lower score and lower probability than the true values. Overall there is a reasonable correlation between the true affinities and predicted ones, but there are some important differences. The lowest scoring true sites, ADGT (D = "not C"), all have scores of 0.6. Yet there are some nonfunctional sites, with lower true affinity, that have better scores. For example, GCKW (K = "G or T," W = "A or T") all have scores of 1.3 by the matrix, but their true affinity is below the threshold (16-fold reduction in relative affinity). This indicates that the matrix provided by the logodds method, formula (20), while a reasonable fit overall, has some problems at the boundary and no threshold exists without false positive or false negative predictions. While in general such a problem might be due to the inadequacy of the additive model, in this example we know that is not the case. Rather it is due to the effect of having a threshold for functional sites, which is a realistic constraint. A solution is to determine an optimum matrix according to the binding probability [equation (3)] as in the work of [3]. The goal of that method is to simultaneously find a matrix
Figure 8.3. Weight matrix from example sites. (A) For the TF of figures 8.1 and 8.2, and assuming that only sites with affinity within 8-fold of the highest affinity site (i.e., ∆Grel ≤ 3), the 26 sites have the occurrences for each base at each position as shown. (B) The probabilities for each base at each position, from part A. The probabilities for ACCG and TTTC are also shown, calculated as the probabilities from the matrix. (C) The weight matrix obtained from part B using the log-odds method [equation (20)]. The scores are shown for the highest scoring site, ACCG, one of the lowest scoring sites, TTTC, and the lowest scoring true site, AACG, and one of the highest scoring false sites, GCGA. Note that the difference between the highest and lowest scores is larger than the true value (figure 8.2C) and that the highest false site scores higher than the lowest true site. (D) The weight matrix from part C, except adjusted to maximize the score of the lowest true site [3]. Now the difference in score between the highest and lowest sites is nearly correct and the lowest true site scores higher than the highest false site.
236
Genomics
and a threshold, which is determined by the chemical potential m, that scores all of the known sites above the threshold while minimizing the number of other sites above the threshold. In general we cannot assume that all of the sites that are not observed to bind are below the binding threshold because they may simply be missing in the observed sample. But we can assume that the number of other sequences should be kept to a minimum. The method for obtaining an optimal weight matrix by this criterion is similar to training a support vector machine, except for minimizing the total number of sites above the threshold instead of using specific nonfunctional sites. When the matrix of figure 8.3C is adjusted in this manner, the weight matrix of figure 8.3D is obtained. It can be seen to be much more similar to the true binding energy matrix (figure 8.2B). Now the lowest scoring true site is 1.6, and the highest scoring nonfunctional site is only 1.0, so a threshold exists without any false positives or false negatives. The overall specificity of the matrix is reduced, but that is more than compensated by having a higher threshold for all of the known sites. MOTIF DISCOVERY
Another type of data has become prevalent in recent years due to technological advances. Using microarrays it is now possible to identify genes that are likely to be coregulated based on their coexpression patterns [25]. And using chromatin immunoprecipitation, followed by microarray hybridization (ChIP-chip experiments), it is possible to identify genomic regions that are bound to specific TFs [26]. One can also compare orthologous promoters between different species and, under the assumption that those genes will be regulated by a common mechanism, identify conserved regions that may be required for the regulation [27]. In each of these cases one obtains a collection of DNA sequences, whose lengths depend on the technique used or the problem being studied, that are expected to contain high-affinity binding sites for some common, and perhaps unknown, TF. The problem is now to identify the binding sites in each sequence and determine the specificity of the protein that binds to them. Those can be seen as dual problems because if one knew the specificity of the TF one could simply search the sequences to find the most likely binding sites, and if one knew where the binding sites were located then one could determine an appropriate representation for them. The algorithms developed to solve this problem come from both sides of the problem, either searching through the space of patterns for one that appears most significant in the data, or searching through the space of alignments to find the most significant one. A brief description of some of these approaches is provided in the rest of this section, and the reader is referred to recent reviews [14,28].
DNA–Protein Interactions
237
In pattern-driven methods one chooses a space of patterns to search and finds the one that is most significant within the set of sequences. Because a weight matrix has continuously variable elements you cannot look at all possible weight matrices. Rather, it is usually assumed that some type of consensus sequence is a reasonable approximation to the pattern and the space of consensus sequences is searched. Galas et al. [29] first applied this approach, finding a consensus sequence that was most represented in a collection of sequences where mismatches to it were allowed, but penalized for each mismatch. For short patterns one can consider all possible consensus sequences [30], even allowing for ambiguous bases in the pattern [31]. As the pattern gets longer it become impossible to consider all possible patterns, but one really only needs to consider patterns that occur, perhaps with some mismatches, in the data set. Efficient search methods have been developed for these methods [28]. For example, if the sequences are preprocessed into a suffix tree one can efficiently find the most significant pattern, even allowing for mismatches, in reasonable time and space [28,32,33]. The significance of a pattern can be evaluated by minimizing the total penalties associated with the mismatches, or by comparing the frequency of matches in the data set to an expected frequency or to the frequency in a control set that is not expected to be regulated by the same TF. While pattern-driven methods are unlikely to get the best overall pattern, that is, the one that best matches the specificity of the TF, they can be quite successful at identifying the correct binding sites within the set. How well they work depends on how conserved the sites are or how good an approximation a consensus sequence is for representing them. Once the sites are found, a more accurate representation of the TF specificity can be obtained from the sites, as described above. The other approaches are referred to as alignment driven or profile driven [14,28], where profile refers to an alignment or probability matrix (figure 8.3A,B) or a weight matrix of some kind derived from the alignment (figure 8.3C). This approach attempts to determine the weight matrix for the protein directly by finding the best alignment. The criterion for best alignment is usually information content [IC, equation (21)], or some similar measure of deviation from the background distribution. Often several alignments are identified and then ranked by methods similar to the pattern-driven approaches, such as a measure of the frequency of matches (high-scoring sites) in the data set compared to the background. The IC itself is a measure of the divergence between the probability distribution within the set of sites compared to a background model, but since real background sequences are often biased in various ways, explicitly counting the matches in the background can give a better measure of significance [24]. However, it is generally impossible to examine all possible alignments to find the best one, so heuristic methods have been developed to try and find the optimal alignments.
238
Genomics
The first approach of the alignment-based methods was a simple greedy algorithm [34] that compared individual sequences to one another, saving some fraction (typically 1000 or more) of the top-scoring pairwise alignments. Those saved alignments are compared to the remaining sequences and updated to three-sequence alignments. The procedure is continued until all of the sequences are included, or some maximum of the objective function is obtained. A p-value can also be computed and used to identify the best alignment [35]. This was followed by an expectation-maximization (EM) method [36] and then by a Gibbs sampling approach [37]. These two approaches differ, but the main ideas have many similarities. They both rely on the dual nature of the problem and iterate between the two. Given a model for protein specificity, one can assign binding probabilities to all of the possible sites. Then one can use those probabilities to define the next model of specificity. In EM, the sites are all combined together, weighted by their probabilities, to determine the next model. In Gibbs sampling, all of the sites in one sequence (which was left out of the previous model-building step) are sampled based on their binding probabilities, and a single site is selected to add to the new model. In both cases the iterative procedure is run either for some fixed number of steps or until convergence. Several new programs have been developed over the last few years, but they are primarily built upon one or more of these approaches. In designing or using specific programs a number of important issues arise. One is simply how confident one is that each sequence in the data set contains binding sites for a common TF. If the data set is a collection of genes that are coexpressed under several different conditions they may be coregulated, but there also might be parallel regulatory pathways that lead to the same pattern, in which case there may not be a TF binding site that is common to every gene. If the data set is a collection of promoter regions that are enriched in a ChIP-chip experiment, then each of those is quite likely to contain a binding site for the TF used in the chromatin immunoprecipitation. Especially in eukaryotic genomes, binding sites for specific TFs often occur in clusters, so the ability to take advantage of multiple similar sites within a particular sequence may be valuable. Finally, for many species there are now complete genome sequences available for other, closely related species and using information about conservation of sites can be useful in discovering true regulatory motifs [38,39]. These approaches first do a phylogenetic footprinting step in which regions that are conserved between species are identified. Since those are the regions that are most likely to contain the regulatory sites, focusing on them can reduce the search space considerably. Then comparing between the conserved regions of coregulated genes can give more reliable predictions about the regulatory motifs that they have in common. In fact, if the species are distant enough, phylogenetic footprinting alone may be sufficient
DNA–Protein Interactions
239
to identify regulatory motifs because other, nonfunctional sequences will diverge to the point where they are not significantly alignable, leaving only the regulatory sites [27]. cis-REGULATORY MODULES
Transcription factors often work together to control gene expression, especially in eukaryotes. That is one of the means for overcoming the specificity limitation described above. In fact, searching a large genome for high-affinity sites for any TF generally identifies many such sites, most of which are not functional because they are not in the correct context. The context may be inappropriate because that region of the genome is in a chromatin state where it is inaccessible to the TF, or it may be that the TF requires a complex of other specific factors in order to function. The set of binding sites for such interacting factors are referred to as cisregulatory modules (CRMs) or sometimes just modules [40]. The expression of a gene is often controlled by the set of TFs that bind to the CRMs, rather than by individual TFs, and each TF may participate in more than one type of CRM. So accurate modeling of regulatory networks requires knowing the set of CRMs, not just the TF binding sites within the genome. Predicting genes with specific expression patterns can be much more reliable by using CRMs than just individual binding sites which have significant numbers of false positives. A variety of methods have been employed to discover important CRMs. The simplest is just to look for pairs of TF binding sites that occur together much more often than expected by chance [41,42], which suggests that those TFs may interact. More generally one can use a library of known motifs and search for any combinations whose occurrences appear to be significantly correlated [43,44]. Because not all of the TF binding sites and motifs are known for any species, these methods will currently be incomplete. So one may need to apply motif discovery methods that search for combinations of individual sites that are jointly significant even in cases where the individual motifs may not be [45–47]. Extensive expression data, or other sources of TF binding data, can help in the discovery of not only individual binding sites but also of CRMs using various types of motif discovery algorithms and correlation analyses [48,49]. Since we expect that understanding and modeling complete regulatory networks requires the identification of CRMs, advances in methods to discover them and determine their responses to cellular states is a critical goal. RECOGNITION CODE
The general idea of a recognition code is that there should be a relationship between the sequence of a TF and its binding site specificity such
240
Genomics
that knowing one allows for the accurate prediction of the other. Seeman et al. [50] first proposed a very simple recognition code in which specific amino acids would be used to interact with particular base pairs. After a few TF structures had been determined by crystallography and some striking similarities were observed in their DNA binding domains, Pabo and Sauer also proposed that a simple recognition code might explain the specificity of each TF [51]. However, after the structures of only a few DNA–protein complexes had been determined, it was clear that a simple, universal code did not exist [52]. The same amino acid could be in contact with different base pairs, and each base pair could interact with multiple amino acids. However, it was found that within specific TF protein families—and the zinc-finger proteins have been studied in the most detail—there were clear preferences for particular amino acid and base-pair combinations that constitute a qualitative code [53–55]. For those proteins the preferred binding sites could often be predicted from the protein sequence and one can even design a protein to bind with high affinity to a specific DNA sequence [56]. But the bigger challenge is to find a model that links the DNA binding specificity of a protein, as described in the models of figure 8.2, with the amino acid sequence of the TF. This would be a quantitative, or probabilistic, code that predicts binding site energies from a TF sequence [57]. Figure 8.4 presents an idealized model of this type. In that figure every amino acid is assigned a specific binding energy for every base pair (shown are relative binding energies compared to the preferred base pair). From that table the binding specificity of any protein would be predicted by concatenating the rows that correspond to the protein's sequence. For instance, if the DNA binding domain of the protein had the sequence QDTR, then the predicted relative energy matrix would be as shown in the figure, which corresponds to the matrix of figure 8.2. A matrix of the type shown in figure 8.4 was determined using a logodds scoring system and data from all of the DNA–protein crystal structures [58]. That matrix was shown to give reasonable predictions for the binding specificities of some zinc-finger proteins. However, it is an overly simplified model in several respects. First, it assumes that a single matrix can adequately represent all interactions. Even the qualitative models for zinc-finger proteins show position-specific interaction preferences [54,55]. For example, in positions 1 and 3 of the binding sites, a G is most likely to interact with R (the amino acid arginine). But at position 2, an H (histidine) is the most common amino acid interacting with G, and R is rarely used. This is due to the geometry of the interface between the protein and the DNA, where position 2 is much closer to the protein backbone and a G interacts optimally with H. Positions 1 and 3 are further from the protein and the large amino acid R makes the most favorable contact with a G. Two other groups developed models that
DNA–Protein Interactions
241
Figure 8.4 Recognition code matrix. An energy value is obtained for all pairs of amino acids (aa) with base pairs (base), but only a few of them are shown. According to this simple model, the weight matrix for any protein can be obtained by concatenating the rows from the interaction matrix according to the protein's sequence. For a hypothetical TF that has a DNA binding domain with amino acid sequence QDTR, the weight matrix on the right is obtained, which corresponds to the example protein of the previous figures.
are similar in some ways but also take into account more of the geometry of the interaction [59,60]. Still, with limited data such approaches have limited accuracy. Using extensive data for just one family of proteins, the EGR zinc-finger proteins, Benos et al. developed a position-specific model, with a different matrix, of the type shown in figure 8.4, for each of the four positions of the interaction and showed that it could predict quantitative binding affinities with reasonable accuracy [61].
242
Genomics
All of those approaches are really just a start in this direction. Benos et al. [61] only tried to predict binding affinities for one family. Other approaches try to predict for multiple families [58–60], and so make the approach more general but also suffer from more limited accuracies. We might expect that the four matrices of Benos et al. are not the complete set, and that other families might have additional such matrices. Perhaps a collection of such amino acid and base-pair interaction matrices could be obtained that represent the entire repertoire of interacting positions, and for any particular protein it is only a matter of identifying the correct set to make reasonable predictions of its specificity. Of course this model, like those in figure 8.2, relies on additivity. Here it is even more important because it requires additivity both in the DNA and in the protein. This is undoubtedly an oversimplified model, but we do not know at this time how good or bad an approximation it is. More data are necessary to answer those questions. An alternative approach would be more biophysical. Given a particular example of a DNA–protein complex with a known structure, one could make substitutions, in the computer, in either the DNA or protein sequence. Then by running molecular dynamics simulations one could try to compute the change in free energy. So far attempts in that direction have not been very successful, probably due to limited accuracy in the energy parameters used in the modeling. As better parameters become available it may be possible to use this very general approach to make predicted binding matrices that are reasonably accurate for any protein for which there is a structurally similar example with known structure. The success of a recognition code would help to solve many of the problems described in earlier sections. With an accurate code one could, for any protein, develop a model of its specificity from its sequence alone. And for any binding site motif discovered in a genome, one could compare it to all of the TFs in that genome and determine which one is most likely to bind to it. One could follow this up with other statistical analyses to get at interacting TFs based on CRMs. Experiments would still be needed to determine under what conditions the TFs were active, but much could be predicted from the sequences alone. As sequences become increasingly easy and inexpensive to obtain, improvements in computational methods to extract the most information from them will become increasingly valuable. This is true in most areas of molecular biology, but models of gene regulation will be especially facilitated by such advances. CONCLUSION
Control of transcription relies upon proteins that bind to specific DNA sites with much greater affinity than to the bulk of genomic sequences.
DNA–Protein Interactions
243
Information of specificity is a useful measure of how effectively a TF can distinguish different sequences and can provide an estimate of the frequency of binding sites expected in a genome. Simple representations of TF specificity, such as models that assume independent contributions between the bases in the binding site, are often adequate for algorithms that discover regulatory sites and search for new sites in the genome. Depending on the type of data available, different approaches are needed to determine the optimal model for any TF and the underlying assumptions of the model need to be taken into consideration. For example, correlations between positions in a sample of known binding sites does not necessarily imply nonindependent contributions to binding affinity. Improved algorithms are still needed to efficiently determine the optimal models for TFs from a variety of different data types. CRMs, where combinations of TFs act coordinately to control gene expression, are common in eukaryotic species. Improved algorithms are needed for discovering them and modeling their affects on transcription, although experimental methods that determine their characteristics are currently limiting and technological advances are required for rapid progress. The development and application of recognition codes for all TF families would facilitate many studies of gene regulation and allow for improved predictions based solely on genomic sequences that are readily obtained by current technologies. A combination of experimental and computational approaches to studying control of gene expression is essential because experimental methods alone are too slow to make the discoveries that are needed, and computational methods alone have limited accuracy because of oversimplified models and a lack of information about important features of the biological systems. ACKNOWLEDGMENTS Support for our work in this area has come primarily from two NIH grants, GM28755 and HG00249.
REFERENCES 1. Jones, S., P. van Heyningen, H. M. Berman, and J. M. Thornton. Protein-DNA interactions: a structural analysis. Journal of Molecular Biology, 287(5):877–96, 1999. 2. Luscombe, N. M., R. A. Laskowski and J. M. Thornton. Amino acid–base interactions: a three-dimensional analysis of protein–DNA interactions at an atomic level. Nucleic Acids Research, 29:2860–74, 2001. 3. Djordjevic, M., A. M. Sengupta and B. I. Shraiman. A biophysical approach to transcription factor binding site discovery. Genome Research, 13(11): 2381–90, 2003. 4. Olson, W. K., A. A. Gorin, X. J. Lu, L. M. Hock and V. B. Zhurkin. DNA sequence-dependent deformability deduced from protein-DNA crystal
244
5. 6.
7.
8.
9.
10.
11.
12.
13.
14. 15.
16.
17.
18. 19.
20.
Genomics
complexes. Proceedings of the National Academy of Sciences USA, 95(19): 11163–8, 1998. Jayaram, B. and T. Jain. The role of water in protein-DNA recognition. Annual Review of Biophysics and Biomolecular Structure, 33:343–61, 2004. Kalodimos, C. G., N. Biris, A. M. Bonvin, M. M. Levandoski, M. Guennuegues, R. Boelens and R. Kaptein. Structure and flexibility adaptation in nonspecific and specific protein-DNA complexes. Science, 305(5682):386–9, 2004. von Hippel, P. H. On the molecular basis of the specificity of interaction of transcriptional proteins with genome DNA. In R. F. Goldberger (Ed.), Biological Regulation and Development, 1:279–347. Plenum, New York, 1979. Elrod-Erickson, M. T. E. Benson and C. O. Pabo. High-resolution structures of variant Zif268-DNA complexes: implications for understanding zinc finger-DNA recognition. Structure, 6(4):451–64, 1998. Berg, O. G. and P. H. von Hippel. Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. Journal of Molecular Biology, 193:723–50, 1987. Man, T. -K. and G. D. Stormo. Non-independence of Mnt repressor–operator interaction determined by a new quantitative multiple fluorescence relative affinity (QuMFRA) assay. Nucleic Acids Research, 15:2471–8, 2001. Man, T. -K., J. S. Yang and G. D. Stormo. Quantitative modeling of DNAprotein interactions: effects of amino acid substitutions on binding specificity of the Mnt repressor. Nucleic Acids Research, 32(13):4026–32, 2004. Schneider, T. D., G. D. Stormo, L. Gold and A. Ehrenfeucht. Information content of binding sites on nucleotide sequences. Journal of Molecular Biology, 188:415–31, 1986. Stormo, G. D. and D. S. Fields. Specificity, free energy and information content in protein-DNA interactions. Trends in Biochemical Sciences, 23(3): 109–13, 1998. Stormo, G. D. DNA binding sites: representation and discovery. Bioinformatics, 16:16–23, 2000. Schneider, T. D. and G. D. Stormo. Excess information at bacteriophage T7 genomic promoters detected by a random cloning technique. Nucleic Acids Research, 17(2):659–74, 1989. Stormo, G. D., T. D. Schneider, L. Gold and A. Ehrenfeucht. Use of the "Perceptron" algorithm to distinguish translational initiation sites in E. coli. Nucleic Acids Research, 10(9):2997–3011, 1982. Benos, P. V., M. L. Bulyk and G. D. Stormo. Additivity on protein–DNA interactions: how good an approximation is it? Nucleic Acids Research, 30(20):4442–51, 2002. Zhang, M. and T. Marr. A weight array method for splicing signal analysis. Computer Applications in the Biological Sciences, 9:499-509, 1993. Ellrott, K., C. Yang, F. M. Sladek and T. Jiang. Identifying transcription factor binding sites through Markov chain optimization. Bioinformatics, 18 (Suppl 2):S100–9, 2002. Barash, Y., G. Elidan, N. Friedman and T. Kaplan. Modeling dependencies in protein-DNA binding sites. Proceedings of the 7th Annual International Conference on Research in Computational Molecular Biology, pp. 28–37, 2003.
DNA–Protein Interactions
245
21. Bulyk, M. L., P. L. F. Johnson and G. M. Church. Nucleotides of transcription factor binding sites exert inter-dependent effects on the binding affinities of transcription factors. Nucleic Acids Research, 30:1255–61, 2002. 22. Stormo, G. D., T. D. Schneider and L. Gold. Quantitative analysis of the relationship between nucleotide sequence and functional activity. Nucleic Acids Research, 14(16):6661–79, 1986. 23. Lee, M. -L., M. Bulyk, G. Whitmore and G. Church. A statistical model for investigating binding probabilities of DNA nucleotide sequences using microarrays. Biometrics, 58:981–8, 2002. 24. Heumann, J. M., A. S. Lapedes and G. D. Stormo. Neural networks for determining protein specificity and multiple alignment of binding sites. Proceedings of the International Conference on Intelligent Systems in Molecular Biology, 2:188–94, 1994. 25. Roth, F. P., J. D. Hughes, P. W. Estep and G. M. Church. Finding DNA regulatory motifs within unaligned noncoding sequences clustered by wholegenome mRNA quantitation.Nature Biotechnology,16(10):939–45, 1998. 26. Lee, T. I., N. J. Rinaldi, F. Robert, D. T. Odom, Z. Bar-Joseph, G. K. Gerber, N. M. Hannett, C. T. Harbison, C. M. Thompson, I. Simon, J. Zeitlinger, E. G. Jennings, H. L. Murray, D.B. Gordon, B. Ren, J. J. Wyrick, J. B. Tagne, T. L. Volkert, E. Fraenkel, D. K. Gifford and R. A. Young. Transcriptional regulatory networks in Saccharomyces cerevisiae. Science, 298(5594):799–804, 2002. 27. McCue, L., W. Thompson, C. Carmack, M. Ryan, J. Liu, V. Derbyshire and C. Lawrence. Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes. Nucleic Acids Research, 29:774–82, 2001. 28. Pavesi, G., G. Mauri and G. Pesole. In silico representation and discovery of transcription factor binding sites. Briefings in Bioinformatics, 5(3):217–36, 2004. 29. Galas, D. J., M. Eggert and M. S. Waterman. Rigorous pattern-recognition methods for DNA sequences. Analysis of promoter sequences from Escherichia coli. Journal of Molecular Biology, 186(1):117–28, 1985. 30. van Helden, J., B. Andre and J. Collado-Vides. Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies.Journal of Molecular Biology, 281(5):827–42, 1998. 31. Ulyanov, A. V. and G. D. Stormo. Multi-alphabet consensus algorithm for identification of low specificity protein-DNA interactions. Nucleic Acids Research, 23(8):1434–40, 1995. 32. Pavesi, G., G. Mauri and G. Pesole. An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics, 17(Suppl 1):S207–14, 2001. 33. Marsan, L. and M. F. Sagot. Algorithms for extracting structured motifs using a suffix tree with an application to promoter and regulatory site consensus identification. Journal of Computational Biology, 7(3–4):345–62, 2000. 34. Stormo, G. D. and G. Hartzell III. Identifying protein-binding sites from unaligned DNA fragments. Proceedings of the National Academy of Sciences USA, 86:1183–7, 1989. 35. Hertz, G. and G. D. Stormo. Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics, 15:563–77, 1999.
246
Genomics
36. Lawrence, C. and A. Reilly. An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins, 7:41–51, 1990. 37. Lawrence, C., S. Altschul, M. Boguski, J. Liu, A. Neuwald and J. Wootton. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262:208–14, 1993. 38. Wasserman, W. W., M. Palumbo, W. Thompson, J. W. Fickett and C. E. Lawrence. Human-mouse genome comparisons to locate regulatory sites. Nature Genetics, 26(2):225–8, 2000. 39. Wang, T. and G. D. Stormo. Combining phylogenetic data with coregulated genes to identify regulatory motifs. Bioinformatics, 19(18): 2369–80, 2003. 40. Klingenhoff, A., K. Frech, K. Quandt and T. Werner. Functional promoter modules can be detected by formal models independent of overall nucleotide sequence similarity.Bioinformatics, 15(3):180–6, 1999. 41. Wagner, A. Genes regulated cooperatively by one or more transcription factors and their identification in whole eukaryotic genomes. Bioinformatics, 15(10):776–84, 1999. 42. Pilpel, Y., P. Sudarsanam and G. M. Church. Identifying regulatory networks by combinatorial analysis of promoter elements. Nature Genetics, 29(2):153–9, 2001. 43. Frith, M. C., U. Hansen and Z. Weng. Detection of cis-element clusters in higher eukaryotic DNA. Bioinformatics, 17(10):878–89, 2001. 44. Sharan, R., I. Ovcharenko, A. Ben-Hur and R. M. Karp. CREME: a framework for identifying cis-regulatory modules in human-mouse conserved segments. Bioinformatics, 19(Suppl 1):i283–91, 2003. 45. GuhaThakurta, D. and G. D. Stormo. Identifying target sites for cooperatively binding factors. Bioinformatics, 17(7):608–21, 2001. 46. Thompson, W., M. J. Palumbo, W. W. Wasserman, J. S. Liu and C. E. Lawrence. Decoding human regulatory circuits. Genome Research, 14(10A):1967–74, 2004. 47. Zhou, Q. and W. H. Wong. CisModule: de novo discovery of cis-regulatory modules by hierarchical mixture modeling. Proceedings of the National Academy of Sciences USA, 101(33):12114–19, 2004. 48. Beer, M. A. and S. Tavazoie. Predicting gene expression from sequence. Cell, 117(2):185–98, 2004. 49. Bar-Joseph, Z., G. K. Gerber, T. I. Lee, N. J. Rinaldi, J. Y. Yoo, F. Robert, D. B. Gordon, E. Fraenkel, T. S. Jaakkola, R. A. Young and D. K. Gifford. Computational discovery of gene modules and regulatory networks. Nature Biotechnology, 21(11):1337–42, 2003. 50. Seeman, N. C., J. M. Rosenberg and A. Rich. Sequence-specific recognition of double helical nucleic acids by proteins. Proceedings of the National Academy of Sciences USA, 73:804–8, 1976. 51. Pabo, C. O. and R. T. Sauer. Protein–DNA recognition. Annual Review of Biochemistry, 53:293–321, 1984. 52. Matthews, B. W. Protein–DNA interaction. No code for recognition. Nature, 335:294–5, 1988. 53. Desjarlais, J. R. and J. M. Berg. Toward rules relating zinc finger protein sequences and DNA binding site preferences. Proceedings of the National Academy of Sciences USA, 89:7345–9, 1992.
DNA–Protein Interactions
247
54. Choo, Y. and A. Klug. Physical basis of a protein–DNA recognition code. Current Opinion in Structural Biology, 7:117–25, 1997. 55. Wolfe, S. A., L. Nekludova and C. O. Pabo. DNA recognition by Cys2His2 zinc finger proteins. Annual Review of Biophysics and Biomolecular Structure, 29:183–212, 2000. 56. Jantz, D., B. T. Amann, G. J. Gatto and J. M. Berg. The design of functional DNA-binding proteins based on zinc finger domains. Chemical Reviews, 104(2):789–99, 2004. 57. Benos, P. V., A. S. Lapedes and G. D. Stormo. Is there a code for protein– DNA recognition? Probab(istical)ly. Bioessays, 24:66–75, 2002. 58. Mandel-Gutfreund, Y. and H. Margalit. Quantitative parameters for amino acid–base interaction: implications for prediction of protein–DNA binding sites. Nucleic Acids Research, 26:2306–12, 1998. 59. Suzuki, M., S. E. Brenner, M. Gerstein and N. Yagi. DNA recognition code of transcription factors. Protein Engineering, 8:319–28, 1995. 60 Kono, H. and A. Sarai. Structure-based prediction of DNA target sites by regulatory proteins. Proteins: Structure, Function and Genetics, 35:114–31, 1999. 61 Benos, P. V., A. S. Lapedes and G. D. Stormo. Probabilistic code for DNA recognition by proteins of the EGR family. Journal of Molecular Biology, 323(4):701–27, 2002.
9 Some Computational Problems Associated with Horizontal Gene Transfer Michael Syvanen
It has been over 30 years since the suggestion that horizontal gene transfer (HGT) may have been a factor in the evolution of life entered the literature. Initially these speculations were based on discoveries made in medical microbiology, namely, that genes for resistance to antibiotics were found to move from one bacterial pathogen to another. This discovery was so unexpected and contrary to accepted genetic principles that though it was announced in Japan in 1959 [1,2], it was not generally recognized in the West for another decade. Speculations that HGT may have been a bigger factor in the evolution of life was inviting because it offered broad explanations for a variety of biological phenomena that have interested and puzzled biologists for over the last century and a half. These were problems that had been raised by botanists who have puzzled over the evolution of green plants [3] as well as by paleontologists who recorded macroevolutionary trends [4] in the fossil record that were often difficult to reconcile with the New Synthesis that merged Darwin's thinking with Mendelian genetics. However, outside of the field of bacteriology this exercise did not really attract that much attention until the late 1990s, at which time there was a major influx of data indicating that HGT had been very pervasive in early life. Namely, complete genome sequences began to appear. Simple examination of these sequences showed beyond any doubt that horizontal gene transfer was indeed a major factor in the evolution of modern bacterial, archaeal, and eukaryotic genomes. Hence, in the past seven years or so, investigations into HGT have moved from the realm of the highly speculative and poorly documented to a robust area of investigation, especially for problems based on computationally intensive studies of genome sequences. A prerequisite to the ability to explore HGT is the ability to distinguish between genomic regions that may have originated from a foreign source (i.e., from a parallel lineage) and genomic regions whose evolutionary history is the result of vertical evolution of that given lineage. In the current review I will go over four areas that pose nontrivial computational problems. 248
Problems Associated with Horizontal Gene Transfer
249
These are: (1) the phylogenetic congruency test, (2) mosaics, (3) distance discrepancy, and (4) nucleotide composition analysis. Even though there is a rich literature concerning phylogenetic incongruities and mosaics, there has been little recent progress on new computational approaches. Therefore this review will focus mainly on the distance discrepancy approach and on atypical nucleotide composition analysis. These latter two approaches have the potential to shed light on some outstanding biological questions. Before going into these problems I will review the concept of common ancestry as a means of introducing the general topic of HGT and to help explain why it is having such profound influence on how we think about biology in general. THE LAST UNIVERSAL COMMON ANCESTOR
An example of how profoundly the notion of HGT has changed our thinking concerns the concept of the last universal common ancestor (LUCA). This is an idea that was central to the hypothesis that life shared common ancestors. Though the idea of common ancestry remains valid (indeed evidence for common ancestry is everywhere in the sequence of our genes), there is no longer a need to postulate that all life evolved from a single last universal common ancestor. Rather, we can entertain common descent from multiple ancestors. The notion that all life passed through a single interbreeding bottleneck is still probably believed to be true by most people who think about this problem. The reason is simple. There are many genes involved in information processing (i.e., DNA replication, RNA transcription, and protein synthesis) whose orthologs are found in all three major domains of life. Furthermore, when the sequences of these genes are submitted to phylogenetic analysis they more or less support the following relationship: the Archaea and Eukaryotes define a clade to the exclusion of a bacterial clade and a single line links both of these clades. Figure 9.1A shows this relationship. The figure shows an unrooted tree with four taxa; this happens to be a topology that is susceptible to semirigorous statistical analysis (see below). The Archaea/Eukaryote clade, by definition, implies the existence of a common ancestor for these two groups and further we can infer that a point on the line leading to the bacterial clade represents the last common ancestor of all life. Thus we can say that there is empirical support for the existence of the last common ancestor. I mentioned above that this scenario is more or less supported by the informational genes. The striking finding is that other genes common to the three major kingdoms frequently show exceptions to these relationships. When it comes to the genes for energy metabolism, Eukaryotes and gram-negative bacteria are usually more closely related to one another than they are to the Archaea and other bacteria
250
Genomics
Figure 9.1 Universal tree of life and two alternatives. Bacteria contain many deeply rooted clades; here we include two groups which are shown as the gram (−) or more accurately known as proteobacteria and gram (+) or the low GC gram (+) bacteria. A shows the so-called universal tree that is supported by the rRNA sequences. B shows the relationships found between a very large number of genes involved in metabolism and biosynthesis. C simply shows the remaining 4-taxa relationship which very few genes seem to follow.
(as in figure 9. 1B). These genes are thought to have become associated with the eukaryotic cell through the endosymbiote that eventually gave rise to the mitochondrion [5–7]. In green plants we can also trace the ancestry of many genes involved in carbon fixation, photosynthesis, as well as other metabolic processes to cyanobacteria, the endosymbiote host that gave rise to the chloroplast. For many of the biosynthetic pathways the relevant genes yield even more complex relationships. Thus we have arrived at the current situation that is accepted by most—there remain a few genes (almost all associated with basic genetic informational processing) that reflect an evolutionary history that goes back to some very primitive LUCA, but that superimposed over the remnants of that primitive ancestor in modern genomes are numerous examples of subsequent horizontal gene transfer events. The above is a good model and it requires good reasons to reject it. To begin, not all of the informational orthologs support the simple phylogenetic pattern outlined above. Even here there are some exceptions. These exceptions have been dealt with in one of two ways. First, in some cases it can be argued that there is insufficient amount of sequence to rigorously support the true clade relationships (i.e., sequence noise or homoplasy is hiding the true pattern), or alternatively, these are
Problems Associated with Horizontal Gene Transfer
251
informational genes that also have been involved in HGT events. Though some of the cases are still open to debate, there are a number of cases where it is simplest to conclude that some of the informational genes have been involved in HGT events; this is especially true for some of the amino acid-tRNA ligases [8]. Once we reach this point then it is no longer possible to argue that biochemically complex processes such as protein synthesis are too complicated to have their genes being involved in HGT events, a position that was held at least up until 1998. In fact, Woese [9] suggested that there existed in the very primitive cells a less functionally constrained protein synthesis machinery that permitted some HGT events of these components, thereby accounting for the few exceptions. In this formulation a LUCA at least implicitly remains in the model. But evidence for the LUCA is greatly reduced, at least with respect to the number of genes found in modern genomes that can be directly traced back to the LUCA via exclusive vertical evolution. In 1982 it was automatic to assume that because a biochemical process was found in all of modern life, than that process must represent evidence for the one interbreeding population of the LUCA. Now we know that many of the universal biochemical processes have moved horizontally multiple times. Thus today we have a greatly truncated LUCA from what we believed just a decade ago. When speculating on the nature of the LUCA it is generally accepted that it must have contained the modern universal genetic code since that is a feature shared by all life. However, even if we accept the existence of this LUCA, there are a variety of reasons to believe that the LUCA itself was the product of an evolutionary process that employed horizontal transfer events; this is so especially with respect to the evolution of the genetic code. It is very difficult to see how the modern genetic code could have evolved in a sequential fashion; rather the code must have evolved on separate occasions and become fused into single lineages. This problem is illustrated by considering the case of lysinetRNA ligase genes found in modern life. All life has two different completely nonhomologous enzymes. If the modern genetic code evolved in a sequential fashion, then we would have to imagine a situation where a lineage that carried one of the two enzymes evolved the second. The raises the question: what selective pressure could possibly account for the emergence of this second enzyme when it already has one? It is much simpler to believe that the lysine enzyme evolved independently in two different lineages, which then fused to give rise to the ancestor of modern life. This is not a radical idea. Of course, if HGT is common to life after the time of LUCA, then it seems not unreasonable to assume that it was common to life before the LUCA. At this point we come to the following model for evolution of life if we try to preserve the LUCA. We have multiple lineages of pre-LUCA life that are linked
252
Genomics
together by HGT events into a netted or reticulate evolutionary pattern. This leads to the LUCA. The LUCA diversifies into its many modern lineages and then these lineages are again reticulated. We then end up with a topological model that looks like an hourglass, namely, a net above that bottlenecks to the LUCA which then diversifies and yields a net below. At this point the principle of parsimony should kick in. Why encumber our model with this bottleneck? It is not only no longer necessary but is now an exceptional assumption. There is another reason that we should jettison the LUCA. This has to do with the finding that many of the universal genes, including a number that make up the genetic code, appear to be younger than are the major clades of life. That is, we can be reasonably sure that life forms resembling Archaea, bacteria, and some kind of primitive Eukaryote existed before 1.5 and likely before 2 billion years ago. However, parts of the genetic code are younger than that. The simplest explanation is that the genetic code continued to evolve after modern life diversified. If so, then the only reasonable explanation for this is that these younger members of the genetic code must have achieved their current modern and universal distribution via HGT events. These unexpectedly young genes are young by virtue of their having experienced less divergence than would be expected from certain assumptions of the molecular clock (i.e., a computational problem, see Distance discrepancy section below). In addition, these young genes often seem to display unusual phylogenetic topologies that are observed as the star phylogenies (another computational problem encountered in phylogenetic analysis). Once we accept that something as complex as the genetic code can evolve and spread by HGT events, it strongly suggests that a gene encoding any function could also. There are deep ideological reasons for believing in a LUCA that explain the reluctance of many to abandon it. In fact this reason is built directly into the most basic model of modern biology, that is, the tree of life. The only figure in Darwin's Origin of Species happens to be a tree that inevitably maps back to a single trunk. Indeed the algorithms used in phylogenetic analysis can only find a single trunk, which, of course, is how they are designed. All practicing biologists are aware of the limitations of phylogenetic modeling with its built-in assumptions, but nevertheless these assumptions do cause confusion. For example, let me pose a question and ask how often there was confusion when thinking about mitochondrial Eve? Isn't it a common misperception to think at some point that all of human life could be mapped back to a single woman? When in fact all we can say is that the only surviving remnant of that distant ancestor is her mitochondrial genome, and it is extremely unlikely that any of her other genes survive in any human populations. Because of the phenomena of sexual reproduction and recombination we share genes with multiple ancestors with no need to
Problems Associated with Horizontal Gene Transfer
253
hypothesize any individual ancestor from whom we have descended. The same reasoning should apply to the evolution of all life; because of the phenomena of horizontal gene transfer we share genes with multiple ancestors with no need to hypothesize individual species from whom we have descended [10]. PHYLOGENETIC CONGRUENCY TEST
Though this is considered the most rigorous method, and has been the most frequently employed, to establish the occurrence of HGT events, it remains very difficult to estimate a level of confidence in resultant findings. This situation has not improved significantly since I last reviewed this topic [11]. The problem lies in the fact that phylogenetic trees are Steiner trees and hence the solution to finding the minimal length tree is np-complete. This means that for large numbers of taxa it is impossible to compare two different topologies and to provide a judgment as to the significance of any differences. This is not to say that there has been little progress on developing new algorithms for searching for phylogenetic trees, just that it remains highly problematical to decide upon competing topologies. Exact solutions to 4- and 5-taxa trees are possible and there has been some use of 4-taxa trees to determine if two different gene trees, from the same set of taxa, are significantly different. This problem is not too difficult if we simply select an ortholog from four different species and ask if the resulting gene trees are consistent with our expectation based on underlying species phylogeny. I have performed quartet analysis where a simple t-test was used to assess the significance of the two trees. In this approach, the number of uniquely shared characters was computed for each of the three unique 4-taxa trees [12]. Zhaxybayeva and Gogarten [13] have applied maximum likelihood and Bayesian probabilities to the 4-taxa problem that give a more rigorous solution to this problem than that provided by the simple t-test. However, the 4-taxa comparison is susceptible to a major artifact, as is any test based upon phylogenetic congruency. One must be wary of the long-branch attraction problem, which arises when the evolutionary rates among the different lineages are highly variable [14]. Long branches attract because of a higher chance that they share unique characters from homoplasy and not homology. In small data sets this can very difficult to assess. Highly unequal rates can be identified provided we have good outgroup taxa against which we can perform a relative rate test [15] and thereby directly assess whether or not the rates of evolution in the various lineages are comparable. If this is known, then we can proceed with the 4-taxa test. There is a potential statistical bias in the 4-taxa test, namely, we may identify potential incongruities after examining a gene tree containing
254
Genomics
a very large number of taxa. For example, let us say that we have a gene tree consisting of 50 taxa and one of the taxa seems significantly displaced. We can pick out the aberrant taxa and compare it to three other selected taxa and perform a test on these four taxa. Let us say that the resulting 4-taxa gene tree is shorter than the one expected from known species relationships and that the difference has a significance of P (either based on the t-test or on maximum likelihood). The problem we now have is that our quartet was selected from a much larger data set. What is the correct value of P? Would it be P times 50? Or P times 5,527,200? (the total number of quartets in the sample) or a value somewhere in between? This is not a simple problem. Gene and Genome Mosaics
Aside from the phylogenetic congruency test, the finding of mosaics has provided the greatest impetus in the acceptance of HGT, especially in bacterial evolution. Indeed, the finding that different strains of E. coli are mosaics of each other led directly to the rejection of the clonal model of E. coli populations [16]. Mosaic is sometimes used as a synonym for horizontal gene transfer. But there is a specific analytical process implied by this term as well. Figure 9.2 can illustrate this. Let us consider homologous genomic regions from two different species designated D for the donor and R for the recipient. These regions were derived from a common ancestor but have diverged in their primary sequence. One common horizontal gene transfer event can be the movement of a DNA segment from D into the recipient R followed by a double recombination event (or possibly a gene conversion) to produce a strain that is
Figure 9.2 Scheme for the formation of a mosaic. Mosaics are created by a simple double recombination event between two homologous, but diverged, DNA sequences designated here as the donor (D sequences a–d) and the recipient (R sequences a′-d′) to give rise to the mosaic hybrid (Hy). The two crossover points are at b/b′ and c′/c; these points are also referred to as the novel junctions.
Problems Associated with Horizontal Gene Transfer
255
a hybrid from D and R. Recombination between two homologous but diverged sequences has been termed "homeologous" recombination [16] while "homologous" recombination involves two identical DNA sequences. There is no question that homeologous recombination occurs. There are well-documented examples from laboratory studies [17]. It also likely occurs naturally. Some of the more striking examples involve important pathogenicity genes found in bacterial and viral human pathogens. An early example included a penicillin resistance gene found in Streptococcus viridians [18]. In this example, sequence R (the sensitive S. viridians), sequence D (a resistant S. pneumoniae), and the hybrid sequence (the resistant S. viridians) were available. Thus a straightforward parsimony argument would suffice to reconstruct a pathway analogous to the one in figure 9.2. In addition, mosaic patterns appear to be common among viruses; it is precisely this type of recombination event that has contributed to much of the variation seen with HIV [19]. A different kind of recombination (called reassortment) has led to the creation of novel human influenza virus strains. Indeed much of the world is now waiting to see if such an event may occur with the agent responsible for the current Southeast Asian "bird flu" that could set off a pandemic among humans. It is of interest to map the exact recombination crossover points (i.e., b/b′ and c′/c) in figure 9.2. The mosaic problem can be solved using the phylogenetic congruency test where different regions of the mosaic are compared to one another. But if this is done, some sense of where the crossover points are located is needed. Rarely can the exact "novel junctions" be identified, but a target range can be found. A few authors [20,21] have derived statistical tests to help judge the significance of a presumed mosaic, especially with respect to locating inferred crossover points. Conceptually there is an overlap between this problem and the haplotype-mapping problem [22–24] for the simply reason that haplotypes are linkage groups that are defined by recombination units. Because the recombination events that produced the pathogenic hybrids described above occurred within the past few decades, there were no or few subsequent point mutations that could erase the pattern shown in figure 9.2. These examples are so clear that there is not that much difficulty in identifying the mosaic pattern in the hybrids. However, there are also situations where mosaic patterns of evolution have left an imprint, but where we have only partial data sets and/or degraded data sets. As one possibility, consider a suspected hybrid mosaic in a larger set of homologs where we are missing the putative donor and recipient. Can we identify the mosaic? In principle, this can be accomplished by using the larger data set to establish the expected amount of divergence for each region. Based on my own experience from examining large numbers of aligned sequences, the possible occurrence of mosaics is
256
Genomics
not rare. The problem is that lacking a possible donor sequence (D), we cannot conclude that the aberrant sequence is due to horizontal gene transfer since there are other mutational mechanisms that can produce an apparent increase in sequence divergence. Nevertheless, it is interesting to identify such regions and there are no currently automated procedures for doing so. We can also imagine a situation where an ancient homeologous recombination event had occurred, and we encounter highly diverged descendents of the hybrid and donor and/or recipient. At what point can we still identify the mosaic before the mosaic pattern is lost in the evolutionary noise? This is a problem that goes beyond the single issue of horizontal gene transfer. We can imagine homeologous recombination events occurring between paralogs within a genome that result in a protein with a novel function. Modern proteins are certainly the result of fusions and rearrangements of more primitive proteins, and there is considerable interest in reconstructing the pathways by which proteins evolved [25]. Clearly, it seems likely that events similar to those seen in figure 9.2 would have contributed to the evolution of modern proteins. This is a computational problem worthy of continued investigation. Distance Discrepancy
Distance discrepancy as a means of detecting HGT events remains a potentially powerful but as yet underutilized tool. Accurate determinations of molecular distance have the potential to answer questions about the importance of HGT in evolutionary history in situations where the phylogenetic congruency test is too insensitive. The potential advantage of this tool can be most clearly seen in regard to hypotheses on the importance of HGT in the evolutionary history of higher Eukaryotes— especially multicellular plants and metazoans. The earliest speculations concerning HGT repeatedly mentioned the explanatory power of HGT for a number of phenomena that had puzzled biologists for over a century [1,2,26]. These include those major episodes of evolution (especially during the emergence of novel structures) that occurred simultaneously and over short periods of time. Such events as the Precambrian radiation or the eutherean radiation have been cited. In addition, the widespread occurrence of parallelism among closely related lineages has recurred throughout the fossil record. If these speculations are correct, it means that HGT events among, for example, the metazoans occur more frequently between closely related lineages than among more distantly related lineages. The phylogenetic congruency test detects movement of genes between highly unrelated lineages, but since it compares topology, it is relatively insensitive to movement between close relatives. In principle, such events will be seen through temporal discrepancies without necessarily giving rise to incongruent phylogenetic topologies (reviewed in [27]).
Problems Associated with Horizontal Gene Transfer
257
For example, let us consider a very real possibility concerning metazoan evolution. If the radiation occurred 540 MYA (million years ago), could we detect movement of a gene leading to two modern phyla that occurred 400 MYA? To develop tools that can attack this problem requires good molecular clocks, good calibration points, and accurate species diversification times. At the present time this remains outside of our ability to resolve, but there is no reason in principle that good statistical tools cannot be developed to solve this problem. In what follows I will deal primarily with protein distances, as opposed to nucleotide distances, simply because some of the more interesting problems such as the metazoan radiation, eukaryotic diversification, and vertebrate divergences have occurred over a time period that exceeds the resolving power obtained from an analysis of neutral DNA evolution. We are forced to look at proteins whose molecular clocks have been slowed by functional constraint. Formally, a molecular distance is the number of mutations (or the number of amino acid replacements) that have occurred since the separation of two genes. A distance measure can be extremely useful because of the existence of molecular clocks, which means there is a possibility of determining a time of separation from the distance. In crude terms we can possibly infer evidence of horizontal gene transfer if the time of divergence of two genes significantly deviates from the time of divergence of two lineages [27]. The use of molecular distances to infer horizontal gene transfer events has not received as much attention as have the phylogenetic congruency test or the deviations in gene composition (see below). It is, however, the belief of this author that many of the more interesting developments in the near future in horizontal gene transfer will emerge from distance studies. Is There a Molecular Clock? One of the reasons that distance measurements have received minimal notice is a residue of distrust in the notion of the molecular clock. Thirty years ago proponents of the molecular clock argued that not only was there a stochastic clock but also that replacement or substitution rates in different lineages were the same. This latter point is certainly not correct. We now know that substitutions per bp per year vary between different lineages [15]. This fact does not, however, mean that a molecular clock is not operating within a given lineage or that such a clock cannot be calibrated. The relative rate test is again important to ensure that the rates of evolution in the respective lineages are comparable [15]. There are two distance approaches that I would like to consider here: one is what I have called the distance matrix rate test and the other is the use of protein distance ratios. Both of these have the potential to reveal distance discrepancies that may uncover horizontal gene transfer events. Distance Matrix Rate Test (DMR). This is a test that allows one to compare the rate of divergence of a protein from a large number of species
258
Genomics
to the rate of evolution of a species being compared. Basically, one plots a whole distance matrix from the gene under consideration against a "standard" distance matrix from the same set of species that presumably represents the genomes of those species. The standard can be the average distance of all of the shared proteins across the genome [28] or it can be a representative gene that is believed to characterize the genome [29]. The advantage of this test is that we need not assume that the rate of evolution in the different lineages is the same (no need for a constant molecular clock) nor do we need to know the phylogenetic topology. One of the advantages of the DMR test is that we can estimate a confidence level in the discrepancy between a protein distance as compared to the standard. The approach that I took is shown in figure 9.3
Figure 9.3 Example of a distance matrix rate (DMR) test. Complete distance matrices for both rps11 and rpl14 proteins were computed from the same set of taxa. A distance between taxon X and taxon Y for the rps11 protein is plotted against the distance between taxon X and taxon Y for the rpl14 protein. There are twenty different taxa selected from Bacteria, Eukarya, and Archaea. The location of some representative taxon pairs is at the top of the figure. Ecoli, Escherichia coli; haemin, Haemophilus influenzae; Cain, Caenorhabditis elegans; drome, Drosophila melanogaster; Bacsu, Bacillus subtilis; syny3, Synechococcus; Yeast, Saccharomyces cerevisiae; and metja, Methanococcus jannaschii.
Problems Associated with Horizontal Gene Transfer
259
(from Syvanen [29]). The distance matrix for the ribosomal protein rpl14 is plotted against that of rps11. The same set of species is present in both. The species include representatives of Bacteria, Archaea, and Eukaryotes. The dotted lines gives an expected 95% confidence level where it is assumed that divergence of the two proteins from the last common ancestor is neutral (hence it is assumed that the rate is determined simply by the genome-wide mutation rate) and further that the amount of functional constraint acting against divergence for each protein is the same in the different lineages. There is no need to assume constant clocks or to know phylogenetic relationships. The two ribosomal proteins were chosen in this example because they were expected to have a very low chance of being involved in HGT events. The linear fit of the data with few points lying outside of the 95% confidence level supports this assumption. The occurrence of HGT events can be inferred when data points for one of the proteins falls significantly outside of the linear regression. This has been exploited by Novichkov et al. [28], who have successfully used this method to identify very likely HGT events. It is difficult to provide a rigorous interpretation for the error analysis. Normally with N independent comparisons one could simply calculate the correlation coefficient and covariance in order to test whether a time-dependent stochastic process relates the two variables. This will not work here because the distance matrix values are autocorrelated, that is, N independent sequences will yield N(N – 1)/2 distances. Thus, any calculated covariance will be artificially low. Therefore, to assess whether or not the replacement process is random, the 95% confidence intervals are given in figure 9.3. In addition to this difficulty, many of lineages may share histories over a considerable period of time, which means there are even not really N independent sequences. This is a problem that also bedevils the relative rate test. Novichkov et al. [28] have used a different approach to analyze error, but the same uncertainties as are found in [29] apply with their approach as well. Protein Distance Ratios. I have recently developed another approach that exploits aberrant distances in order to identify possible genes that are involved in HGT events. In this approach, distances based upon different proteins within the same genomes are compared. The difficulty with comparing two different proteins is that different levels of functional constraint act upon each, so without knowing in advance the amount of functional constraint a direct comparison of distance will tell us little. However, there is a possibility of using ratios of protein distances that can overcome this problem. This approach requires the sequences of protein orthologs from three different taxa. The approach is illustrated in figure 9.4. In the current example, distances between proteins from the tunicate Ciona, humans, and the yeast Sachharomyces cerevisiae are used. As is described in figure 9.4, the distances between
260
Genomics
Figure 9.4 Schematic illustrating the distance ratios metric. Evolutionary time is determined by the time of divergence between two lineages. In this example we have two time points—the human (Hu)/Ciona (Ci) division and the S. cerevisiae I (Ye)/chordate (Ci)division. If distance were some measure of the number of amino acid replacements per amino acid site deduced from pairwise comparisons that is linear with time of divergence, then we would have linear molecular clocks. Protein a and protein b are two different orthologs that differ from each other in their level of functional constraint.
Ciona and human orthologs are normalized to the distance of those proteins to the yeast protein. If three following assumptions are met—that the evolution of each protein within the three lineages reflects the evolutionary history of the underlying species, that the protein distance measure is linear with time of divergence, and that the amount of functional constraint acting upon each protein is the same within the three lineages—then the ratio of the distances of the two proteins should be the same, that is, from figure 9.4 a/a′ = b/b′. Hence deviations in these ratios will offer evidence of atypical protein evolution. This exercise was carried out using the complete protein sequence databases for the Ciona, human, and yeast genomes. A group of about 200 protein sequences that appeared to be orthologous (i.e., a single copy within the genomes) was chosen. One advantage of using protein ratios in this way is that we can directly test the assumption that the distance measure is linear with time. If a protein distance becomes saturating with time, then we would expect to see the ratio of distances to increase with increasing protein distance. Figure 9.5 shows the result of plotting the ratio of distance against absolute distance. There are a number of competing methods for measuring protein distances available.
Problems Associated with Horizontal Gene Transfer
261
Figure 9.5 Distance ratios are constant with increasing distance. The set of 185 proteins shown were chosen in the following manner. The 6800 protein sequences from the S. cerevisiae genome were used as queries in Blast searches against a database that contained the proteins deduced from the human, Ciona, C. elegans, Drosophila, Arabidopsis, and Oryza sativa genomes. This yielded a list of over a thousand yeast proteins such that at least one copy of the homolog was found in each of the other genomes. This list was reduced by removing the obvious gene families containing parologs, thereby enriching this list for orthologous sets. The chosen genes were aligned, all indels removed, and then the JTT distances determined. The corrected distance between human and Ciona (Dhc) was divided by the yeast–Ciona distance (Dyc). The logs of the ratios are normally distributed about a constant mean independent of Dyc values.
In the current study I tested five of them (data not shown). Shown in figure 9.5 is the one that gave the best result, that is, the slope of the curve was closest to zero. This distance is based on the JTT matrix [30], though the Dayhoff PAM measures worked reasonably well also. Simple distances, Poisson corrected distances, and Kimura protein distances gave significantly nonzero slopes. Figure 9.6 shows the data from the ordinate in figure 9.5 plotted as a simple histogram. The log of distance ratios is given because the direct ratios are not normal, though there are reasons to believe they should be lognormal. In figure 9.6 we can see that the data roughly approximates a normal curve but that there are many outlying points. In fact the distribution is highly overdispersed with about 10% of the sequences lying outside of a normal distribution. The genes represented by these outliers are candidates for possible horizontal gene transfer events. This study remains unfinished at this point. Though we have a tool here that allows identification of sequences that are atypical, once identified there is a difficulty with concluding that an HGT event
262
Genomics
Figure 9.6 Distance ratios distribution. The distance for each of the proteins between human and Ciona (Dhc) was divided by the distance between the two metazoa and S. cerevisiae (Dyc) and the distribution of the log of this ratio is shown.
had occurred. This is due to the very real possibility that the three proteins are not an orthologous set but that a paralog is present. Many of the early claims of HGT (reviewed in [11]) turned to artifactual because of the inclusion of parologous sequences. The possibility of selecting for a paralog seems high with a screen that looks at hundreds of genes as described here. Further analysis of these examples will be required before we can conclude that the highly dispersed points in figure 9.6 are evidence for HGT events. In summary, distance ratios provide another metric that allows us to detect HGT events. Though this work remains incomplete, there are two internal controls that offer encouragement. One is that the log ratios versus distance plot in figure 9.5 is flat, thus supporting the notion that distances upon which we are basing the ratios are linear with time. Another factor that is encouraging is that the peak of the curve in figure 9.6 lies at 0.44, which places the Ciona/human (two chordates) divergence time at about 440 million years ago (assuming a fungal–metazoan divergence of 1 BYA). This is after the Cambrian radiation of 540 MYA, as it likely should be. ATYPICAL NUCLEOTIDE COMPOSITION (THE ANC ISLANDS)
In the previous three sections the procedures described for identifying HGT events rely on the comparisons of orthologs from multiple species. This section describes an approach where deviations in nucleotide composition are used to identify foreign genes [31,32]. The genes identified by this criterion fall into a number of different categories. These include phages, plasmids, insertion sequences, and other mobile
Problems Associated with Horizontal Gene Transfer
263
genetic elements. In addition, we can include a class of genomic regions that have been called, in different contexts, pathogenicity islands, plasticity islands, or accessory gene regions. These classes range from those genes that are clear parasites to selfish genes to those that can be considered accessory and even some that are essential. The notion of accessory genes has been around for many years either implicitly or explicitly [33]. They are DNA elements that are often associated with mobile genetic elements. They allow the organism to exploit some highly specialized ecological niche that may require only temporary unions of genes that can be easily lost when no longer useful. Mobility becomes important to long-term gene survival when selection for an associated trait is lost. Because accessory genes include a large variety of genetic elements, I will collectively refer to them as the atypical nucleotide composition (ANC) genes or ANC islands when multiple genes are clustered. When we refer to atypical compositions we refer to deviations from within a given genome since different species have their own unique composition. There are different evolutionary forces responsible for these unique genome compositions. It is clear that major bacterial assemblages vary in their GC content, from a low of 25% to as high as 75%. Even for bacteria with similar GC content, synonymous codon use can differ. There appear to be biases in nearest neighbor frequencies (dinucleotide frequencies) or even biases in oligonucleotide frequencies of up to eight [34,35]. Biases in longer oligonucleotides are probably caused by other mechanisms than just GC content and codon bias. Positive selection against certain restriction nuclease sites or other DNA metabolism factors could lead to different bacteria obtaining different localized compositions. Thus, for a variety of reasons, the distribution of oligonucleotides will be skewed from random and that skewing can provide a unique signature for a given bacterium. In most bacteria, the composition signature that uniquely characterizes the genome applies to about 85% of the genes on average, while the remainder of genes seem to have compositions governed by different rules [36,37]. It was originally proposed that these atypical genes could be the result of horizontal gene transfer (HGT) with the atypical gene carrying the signature of a foreign donor genome. According to this postulate the donor is so remotely related to the recipient that its genome composition is significantly different. In fact, in recent years this explanation has become so widely accepted that the finding of atypical regions has been considered a measure of HGT. A strong prediction of the remote donor hypothesis for ANC is the existence of a donor species whose genome composition at the time of HGT reflects this atypical composition. Because such remote donor species have not been found, the hypothesis remains on shaky grounds. In this review I suggest a revision of the remote donor HGT hypothesis for atypical nucleotide composition and will go into a detailed
264
Genomics
discussion of the need to consider alternatives. I will continue to maintain that the atypical genes identified in the above studies have most likely been involved in HGT events. The revision that I propose is that the atypical composition observed does not reflect the genome composition of the donor species, but rather reflects the property of gene mobility per se. A number of lines of evidence can be offered to support the gene mobility hypothesis for the ANC islands. Let us begin with the sequence composition of phages, prophages, plasmids, and insertion sequences that are invariably atypical when compared to the hosts that carry them. This, of course, has been attributed to their likely residence in remote donors with different nucleotide compositions. The problem, however, is that the sequences of these particular elements have been the subject of extensive study for over three decades without any hint of remotely related donors. For example, let us consider the lambdoid phages that have been extensively surveyed but so far have been encountered only within a narrow group of gamma-proteobacteria, the so-called enterics [38,39]. The lambdoid phages are a closely related group of phages that have similar genome organizations and can give rise to hybrids between different members of the group. The enteric bacteria as a group, however, do not differ in their genome composition. For example, E. coli and Salmonella have virtually identical GC compositions and codon usages [40]. A similar pattern is seen with plasmids and insertion sequences. That is, with a few spectacular exceptions, plasmids and insertion sequences seem to have limited host ranges (i.e., they are found within a group that generally have the same genome compositions) but their sequences themselves show the atypical composition. Thus there is no independent evidence for remote species donors, rather the alternative seems to be the case; the apparent donor species most likely have the same genome composition as do the recipients. There are other puzzling patterns in the nature of these genes with atypical composition that are difficult to reconcile with the idea of remote species donors. There are many genes with aberrant GC content found in the chromosome of E. coli. It turns out that over 90% of those that deviate significantly from E. coli's GC content of 0.5 are greatly enriched for AT. It was difficult to explain this asymmetry as being due to the donor species distribution. I suggested in 1994 [11] an alternative explanation that a bias in favor of AT is consistent with a gene which in the course of its life cycle was frequently submitted to homeologous recombination events. This predicts a family of mobile genes whose integration in new recipient chromosomes relies on homeologous recombination [17] as opposed to the site-specific integration events associated with prophages, inserted plasmids, and insertion sequences. Evidence that such a class of genes is found in the ANC islands has recently been described. Koski et al. [41] argued against the use of
Problems Associated with Horizontal Gene Transfer
265
atypical compositions as an indicator of horizontally transferred genes after a detailed analysis of the types of genes that Lawrence and Ochman [42] had identified as HGT candidates because of ANC. One of the apparent problems that Koski et al. noted was that 135 of the 747 genes classified as horizontally transferred in E. coli turned out to have positional orthologs in the bacterium Salmonella. If both E. coli and Salmonella have the same positional ortholog, then this strongly suggests that the common ancestor to these two strains also carried this ortholog. However, it appears the these two bacteria diverged about 100 million years ago, whereas Lawrence and Ochman [42] had already estimated that after 100 million years of vertical evolution a gene's composition would be expected to have converged to that of the host genome. They call this amelioration. Thus the apparent dilemma: common ancestry suggests that these 135 genes were present in the lines leading to E. coli and Salmonella 100 million years ago, but sequence composition suggests that these genes were introduced into E. coli considerably more recently. The explanation that I am offering for this class of genes is not that these genes are not involved in HGT events, as argued by Koski et al. [41], but rather that they are mobile genes that can be lost but also can reestablish themselves in enteric chromosomes via homeologous recombination events. Such a mechanism of transfer would necessarily preserve positional orthology and, according to the current hypothesis, such recombination events select for the atypical nucleotide composition. The phenomena of homeologous recombination between homologous genetic regions of E. coli and Salmonella have been well documented. Mutant strains of Salmonella missing the methyl-directed mismatch repair pathway recombine with an E. coli donor [17]. It is probably not coincidence that wild strains of E. coli are frequently encountered that have lost this mismatch repair pathway even though such mutants have a serious growth disadvantage when compared to wild-type strains. Evidence that recombination events have occurred naturally between these two lineages is seen with apparent concerted evolution of the elongation factor Tu, tufA and tufB loci [40]. Another fact that supports frequent recombination events between diverged but homologous sequences is the finding that one very important class of genes found on one of E. coli's ANC islands has a highly mosaic pattern [43]. As mentioned above, there is so far no direct evidence for the remote species origins for the ANC islands. There is one recent study that illustrates the size of this problem. Nakamura et al. [36] surveyed the genomes of 116 bacteria for nucleotide composition and found that on average 14% of the genes had atypical compositions, which was represented by 1357 gene clusters. They furthermore probed their 116-genome database with these 1357 gene clusters to see if the codon bias in one of them would show a match to any of the other 115 genomes and thereby
266
Genomics
possibly locate the donor. The power of the technique was shown in that they did in fact identify one donor–recipient pair among those 1357 gene clusters. This pair, however, turned out to be an artifact; the strain of Neisseria meningitides that had been sequenced, happened to carry erythromycin-resistant Staphylococcus plasmid genes that had been unsuspectingly cloned into the Neisseria species [44], and this was the gene that Nakamura et al. [36] identified. Thus, in this fairly large survey, we can conclude that so far no clear case of naturally occurring remote species HGT events have been identified using genome difference analysis. The reason for presenting detailed arguments for the mobility hypothesis as an explanation for the ANC islands is that it leads to testable predictions that can be revealed by future computations. In addition, it gives rise to certain expectations with respect to the chemical properties of DNA and molecular mechanisms of genetic recombination. But this goes well beyond the scope of this review. Horizontal gene transfer is an indisputable fact. In general terms, types of genes have been divided into two classes on the basis of transfer frequency: informational genes and operational genes. The accessory genes found on the ANC islands should be included as a third category: Informational → Operational → Accessory And moving from left to right the likelihood of the genes being involved in horizontal gene transfer seems to increase dramatically. REFERENCES 1. Ochiai, K., T. Yamanaka, K. Kimura and O. Sawada. Inheritance of drug resistance, and its transfer between Shigella strains and between Shigella and E. coli strains. Hihon Iji Shimpor, 1861:34, 1959, in Japanese. 2. Akiba, T., K. Koyama, Y. Ishiki, S. Kimura and T. Fukushima. On the mechanism of the development of multiple-drug-resistant clones of Shigella. Japanese Journal of Microbiology 4:219–27,1960. 3. Went, F. W. Parallel evolution. Taxon, 20:197–226,1971. 4. Reanney, D. Extrachromosomal elements as possible agents of adaptation and development. Bacteriological Reviews, 40:552–90,1976. 5. Golding, G. B. and R. S. Gupta. Protein-based phylogenies support a chimeric origin for the eukaryotic genome. Molecular Biology and Evolution, 12:1–6,1995. 6. Gogarten, J. P., W. F. Doolittle and J. G. Lawrence. Prokaryotic evolution in light of gene transfer. Molecular Biology and Evolution, 19:2226–38,2002. 7. Doolittle, W. F. Lateral genomics. Trends in Cell Biology, 9:M5–8,1999. 8. Brown, J. R. and W. F. Doolittle. Gene descent, duplication, and horizontal transfer in the evolution of glutamyl- and glutaminyl-tRNA synthetases. Journal of Molecular Evolution, 49:485–95,1999. 9. Woese, C. The universal ancestor. Proceedings of the National Academy of Sciences USA, 95:6854–9,1998.
Problems Associated with Horizontal Gene Transfer
267
10. Zhaxybayeva, O. and J. P. Gogarten. Cladogenesis, coalescence and the evolution of the three domains of life. Trends in Genetics, 20:291,2004. 11. Syvanen, M. Horizontal gene transfer: evidence and possible consequences. Annual Review of Genetics, 28:237–61,1994.* 12. Syvanen, M. On the occurence of horizontal gene transfer among an arbitrarily chosen group of 26 genes. Journal of Molecular Evolution, 54: 258–66,2002.* 13. Zhaxybayeva, O. and J. P. Gogarten. An improved probability mapping approach to assess genome mosaicism. Genomics, 4:37,2003. 14. Felsenfeld, J. Evolutionary trees from DNA sequence. Journal of Molecular Evolution, 17:368–76,1981. 15. Li, W. -H., M. Tanimura and P.M. Sharp. An evaluation of the molecular clock hypothesis using mammalian DNA sequences. Journal of Molecular Evolution, 25:330–42,1987. 16. Milkman, R. Transduction, restriction and recombination patterns in Escherichia coli. Genetics, 139:35–43,1995. 17. Rayssiguier, C., D. S. Thaler and M. Radman. The barrier to recombination between Escherichia coli and Salmonella typhimurium is disrupted in mismatch-repair mutants. Nature, 342:396–401,1989. 18. Dowson, C. G., A. Hutchison, N. Woodford, A. P. Johnson, R. C. George and B. G. Spratt. Penicillin-resistant viridans streptococci have obtained altered penicillin-binding protein genes from penicillin-resistant strains of Streptococcus pneumoniae. Proceedings of the National Academy of Sciences USA, 87:5858–62,1990. 19. Korber, B., C. Brander, B. F. Haynes, R. Koup, J. P. Moore, B. D. Walker and D. I. Watkins, http://hiv-web.lanl.gov/content/hiv-db/CRFs/CRFs.html at http://hiv-web.lanl.gov/content/immunology/. 20. Maynard-Smith, J. Analyzing the mosaic structure of genes. Journal of Molecular Evolution, 34:126–9,1992. 21. Kececioglu, J. and G. Gusfield, Reconstructing a history of recombinations from a set of sequences. In Proceedings of the 5th ACM-SIAM Symposium on Discrete Algorithms, pp. 471–80, 1994. 22. Gabriel, S. B., S. F. Schaffner, H. Nguyen, J. M. Moore, J. Roy, B. Blumenstiel, J. Higgins, M. DeFelice, A. Lochner, M. Faggart, S. N. Liu-Cordero, C. Rotimi, A. Adeyemo, R. Cooper, R. Ward, E. S. Lander, M. J. Daly and D. Altshuler. The structure of haplotype blocks in the human genome. Science, 296:2225–9,2002. 23. Eskin, E., E. Halperin and R. M. Karp. Efficient reconstruction of haplotype structure via perfect phylogeny. Journal of Bioinformatics and Computational Biology, 1:1–20,2003. 24. Gusfield, D, S. Eddhu and C. Langley. Optimal, efficient reconstruction of phylogenetic networks with constrained recombination. Journal of Bioinformatics and Computational Biology, 21:173–213,2004. 25. Tani, T., Y. Takahashi, S. Urushiyama, Y. Oshima, M. Go and P. Schimmel, eds. in Tracing Biological Evolution in Protein and Gene Structures. Elsevier, Amsterdam, 1995. 26. Syvanen, M. Cross-species gene transfer: implications for a new theory of evolution. Journal of Theoretical Biology, 112:333–43,1985.* 27. Syvanen, M. Molecular clocks and evolutionary relationships: possible distortions due to horizontal gene flow. Journal of Molecular Evolution, 26:16–23,1987.*
268
Genomics
28. Novichkov, P. S., M. V. Omelchenko, M. S. Gelfand, A. A. Mironov, Y. I. Wolf and E. V. Koonin. Genome-wide molecular clock and horizontal gene transfer in bacterial evolution. Journal of Bacteriology, 186:6575–85, 2004. 29. Syvanen, M. Rates of ribosomal RNA evolution are uniquely accelerated in eukaryotes. Journal of Molecualar Evolution, 55:85–91,2002.* 30. Jones, D. T., W. R. Taylor and J. M. Thornton. The rapid generation of mutation data matrices from protein sequences. Computational and Applied Bioscience, 8:275–82,1992. 31. Lawrence, J. G. and H. Ochman. Molecular archaeology of the Escherichia coli genome. Proceeding of the National Academy of Sciences USA, 95:9413–17,1998. 32. Garcia-Vallve, S., E. Guzman, M. A. Montero and A. Romeu. HGT-DB: a database of putative horizontally transferred genes in prokaryotic complete genomes. Nucleic Acids Research, 31:187–9,2003. 33. Court, D. and A. Oppenheim. Phage lambda's accessory genes. In R. Hendrix, J. Roberts, F. Stahl and R. Weisberg (Eds.), Lamdba II. Cold Spring Harbor Press, Cold Spring Harbor, N.Y., 1983. 34. Pride, D. T., R. J. Meinersmann, T. M. Wassenaar and M. J. Blaser. Evolutionary implications of microbial genome tetranucleotide frequency biases. Genome Research, 13:145–58,2003. 35. Tsirigos, A. and I. Rigoutsos. A new computational method for the detection of horizontal gene transfer events. Nucleic Acids Research, 16:922–33,2005. 36. Nakamura, Y., T. Itoh, H. Matsuda and T. Gojobori. Biased biological functions of horizontally transferred genes in prokaryotic genomes. Nature Genetics, 36, 760–6, 2002. 37. Ochman, H., J. G. Lawrence and E. A. Groisman. Lateral gene transfer and the nature of bacterial innovation. Nature, 405:299–304,2000. 38. Hendrix, R. W. Bacteriophage genomics. Current Opinion in Microbiology, 6:506–11,2003. 39. Hendrix, R. W. Bacteriophage lambda: the genetic neighborhood. In R. Calendar (Ed.), The Bacteriophages (pp. 409–47). Oxford University Press, New York, 2006. 40. Sharp, P. M. Determinants of DNA sequence divergence between Escherichia coli and Salmonella typhimurium: codon usage, map position, and concerted evolution. Journal of Molecular Evolution, 33:23–33,1991. 41. Koski, L. B., R. A. Morton and G. B. Golding. Codon bias and base composition are poor indicators of horizontally transferred genes. Molecular Biology and Evolution, 18:404–12,2001. 42. Lawrence, J. G. and H. Ochman. Amelioration of bacterial genomes: rates of change and exchange. Journal of Molecular Evolution, 44:383–97,1997. 43. Denamur, E., G. Lecointre, P. Darlu, O. Tenaillon, C. Acquaviva, C. Sayada, I. Sunjevaric, R. Rothstein, J. Elion, F. Taddei, M. Radman and I. Matic. Evolutionary implications of the frequent horizontal transfer of mismatch repair genes. Cell, 103:711–21,2000. 44. van Passel, M., A. Bart, Y. Pannekoek and A. van der Ende. Phylogenetic validation of horizontal gene transfer? Nature Genetics, 36:1028, 2004.
*These papers are available in pdf format at http://www.vme.net/hgt/
10 Noncoding RNA and RNA Regulatory Networks in the Systems Biology of Animals John S. Mattick
DNA is best regarded as biological software. It encodes the analog components of cells (proteins, RNAs, and their derived products) as well as the regulatory architecture that controls their division and assembly into multicellular organisms. The human genome contains about 3 billion base pairs of a quaternary code [1,2] that programs the growth and development of a complex organism of around 1014 cells, with a wide variety of bones, muscles, and specialized organs, such as the brain, lung, liver, and kidneys, all with very specific anatomical features and often asymmetric shapes. These structures are not just field approximations, but reflect in their detail the particular genomic program and its polymorphic idiosyncrasies that we inherit from our parents and their ancestors, as evidenced by the individual contours of our faces and bodies. The cellular ontogeny of the nematode worm C. elegans is precise and largely invariant [3], and it is likely that the same generally holds for all animals, with the possible exception of the clonal expansion of some cell types under different physiological or immunological conditions. Controlling the growth patterns, positional identity, organization, and differentiation of cells in the assembly of a complex organism is the major challenge of genomic programming. Where and how is the information that controls developmental trajectories encoded? What determines the number, position, and size of our limbs and digits, the placement and particular shape of different muscles, bones, and organs, the timing of developmental transitions including puberty and aging, and the differences in these characteristics between and within humans and other species? Although only 1.2% of the human genome encodes protein, a large fraction of it is transcribed. Indeed, around 98% of the transcriptional output in humans and other mammals consists of non-protein-coding RNAs (ncRNA) from the introns of protein-coding genes and the exons and introns of non-protein-coding genes [4,5], including many that are antisense to or overlapping protein-coding genes [6–12]. Until recently the noncoding RNA fraction was considered mainly useless with the 269
270
Genomics
exception of the common infrastructural RNAs involved in protein synthesis, transport, and splicing. Introns have long been regarded as evolutionary debris with intronic RNA assumed to be simply degraded after splicing excision, and the increasing number of non-protein-coding transcripts being detected in mammalian cells have been suggested, at least by some, to be largely "transcriptional noise" [13]. However, a significant proportion of ncRNAs appear to be stable in eukaryotic cells. For example, some excised introns have half-lives comparable with mRNA and are even exported from the nucleus to the cytoplasm [14,15]. Whole chromosome tiling chip arrays have shown that the range of detectable ncRNAs in human cells is much greater than can be accounted for by mRNAs [11,12,16–19], and that there appear to be roughly equal numbers of protein-coding and noncoding transcripts regulated by common transcription factors in the human genome [17,18]. Similar data have been reported in Drosophila [20]. Apart from transfer RNAs (tRNAs) and spliceosomal small nuclear RNAs (snRNAs), which are housekeeping RNAs involved in mRNA splicing and translation, there are several functionally and structurally distinct classes of short RNAs in eukaryotic cells. In most if not all cases their function is based on recognition of RNA or DNA target sequences by specific base pairing. Because of this feature, even short RNAs contain sufficient information to specify individual targets in the genome and the transcriptome, in a much more compact and energy-efficient manner than proteins, which may have been a necessary adaptation to address the accelerating regulatory requirements of more complex organisms [21,22] and have been crucial to their evolution and development [23,24] (see below). THE NUMBERS OF PROTEIN-CODING GENES DO NOT SCALE WITH DEVELOPMENTAL COMPLEXITY
By extension from bacteria, it is widely believed that most genes encode proteins and that most genetic information, including regulatory information, is transacted by proteins. Bacterial genomes are largely composed of densely packed protein-coding sequences, with variation between species and strains being achieved by changes in the encoded proteome, which are often quite dramatic. For example, over 20% of genes are different between benign and pathogenic strains of E. coli [25]. In contrast, the proteome is relatively stable in complex organisms [26]. The vast majority of protein-coding genes are common among mammals [27], despite their wide phenotypic diversity (think of whales, dogs, mice, cats, elephants, giraffes, lemurs, etc.), most (albeit with variations [28]) are common to all vertebrates [29,30], and many are shared with invertebrates [31]. These include not only those encoding proteins that are central to cell division and cell biology [32] but also those that control
Noncoding RNA and RNA Regulatory Networks
271
development [3,33]. However, the numbers of protein-coding genes appear to bear little relationship to developmental complexity—the simple nematode worm with only 1000 cells has almost 50% more protein-coding genes (~19,500) [34,35] than the far more complex insects (~13,500) [36–38], and a similar number to the even more complex vertebrates, including mammals (~20,000–25,000) [2,27,29,30,39]. In contrast, however, the relative amount of noncoding DNA [21,40] and the transcription of noncoding RNA [41] does generally scale with complexity, in spite of variations in the extent of chromosomal ploidy or the amount of repetitive DNA in some lineages. At least some of the differences between species, and the incongruities between gene number and developmental complexity, may be explained by variations in the sequences and range of encoded proteins and their isoforms, including splice variants [42,43]. The other important source, and probably the major source, of phenotypic differences (particularly between closely related individuals and species) lies in the variation of the regulatory architecture that controls the patterns of gene expression during differentiation and development. In humans, protein-coding sequences occupy only 1.2% of the genome, with another approximately 4% apparently conserved under purifying selection [27], a figure that sits uncomfortably with current conceptions of proteinbased gene regulation, a matter that we will return to later in this chapter. If the 5% that is apparently conserved (albeit with variations) between mammals is all that is required (although of course by definition some instructions must be different between species), then the ontogeny and biology of a mammal can theoretically be specified in about 150 Mb of genomic sequence, quite a feat. INFORMATION REQUIREMENTS OF FUNCTIONALLY INTEGRATED COMPLEX SYSTEMS
Can proteins interacting with cis-regulatory sequences in RNA and DNA provide sufficient information to program the precise deployment of 1014 organized differentiated cells in a human? Perhaps, but probably not. The usual assumption is that the increased complexity of multicellular organisms is enabled by the explosive possibilities afforded by combinatorics of regulatory protein interactions [44,45], a conclusion that fits comfortably with the current view that genes are generally synonymous with proteins. However, the problem is not to generate "complexity" by combinatorics, but rather to control it to produce meaningful, that is, functional and competitive, outcomes. This requires considerable information, especially regulatory information. Indeed, the best (albeit somewhat abstract) definition of relative complexity is the minimum amount of information that is required to specify the ontogeny and operation of the object or system [46]. Moreover a
272
Genomics
general feature of integrated complex systems is that the proportion of the nodes, links, and information that must be devoted to regulation (communication and control) increases as the system grows in size, which is in turn ultimately limited by the physical basis of the regulatory architecture [22]. Reciprocally, it is well established that rapid increases in complexity occur as a consequence of the introduction of more advanced control technologies and embedded networking, most of which is invisible to the naïve observer [47]. In line with theoretical predictions, the numbers of regulatory proteins in prokaryotes increases as a quadratic function of gene number [21,48]. If this empirical relationship is correct, there must be a limit to the size of genomes whose functions are regulated solely by proteins, and indeed the largest bacterial genome sizes show a close correspondence with the point at which the predicted number of required additional regulatory proteins outweigh the numbers of new nonregulatory proteins [48–50], and no prokaryotes achieve anything other than rudimentary multicellular organization—a good example is "fruiting body" formation in Myxococcus xanthus, whose genome is among the largest in bacteria [51]. This also implies that the more complex eukaryotes must have breached this limit by a different method, perhaps by developing a digital system of regulatory controls, which is the generic solution to the increasing regulatory requirements of complex systems. One can build and operate a bicycle or an early automobile using analog controls, but can one imagine a modern jet being possible using such systems? Have we made the mistake of tacitly assuming that the majority of genetic information in complex organisms is transacted by proteins, by extrapolation from simple organisms? Moreover, being conditioned by experience to an analog cellular world from foundation studies in biochemistry and molecular biology (which were undertaken prior to modern digital computing), have we made the mistake of expecting that complex organisms only use analog controls, being totally unaware (until recently) of more sophisticated digital information transmission systems and therefore unreceptive to their existence? A good analogy is to imagine how engineers in 1960 might approach an understanding of a modern commercial or military jet. Like biochemists, they would attempt to identify the physical components of the aircraft, determine their structure–function relationships, and randomly or purposefully inactivate them to see what happens, which may or may not be informative. In doing so they would have had trouble detecting, let alone understanding, the software encoded within the computing systems, and would have been very likely to dismiss as dross the hundreds of miles of gossamer-like optical fibers hidden under the floor, simply because they had no conception of their function. To discern such systems on the microscale of a cell is even more difficult.
Noncoding RNA and RNA Regulatory Networks
273
THE HIDDEN LAYER OF NONCODING RNA IN COMPLEX ORGANISMS
Is there another major level of information transaction in the genome of complex organisms? Ostensibly yes. As noted already, the vast majority (~98%) of the transcriptional output of the human genome is nonprotein-coding RNA [4], derived from introns of protein-coding genes and from what appears to be a large but as yet poorly characterized set of non-protein-coding transcripts, many of which are also spliced, and at least some of which have important biological functions. There is firm evidence from annotated "known genes," mRNAs, and spliced ESTs [41], as well as other bioinformatic analyses [52], that most of the genome is transcribed, which gives two clear choices: either the genome is replete with useless transcription or many of these RNAs have an unexpected but important function. There is no formal reason not to consider the latter possibility, nor any observations which would deny it. Indeed it would be surprising if evolution had not explored these RNAs as regulatory molecules [23,53], a function originally predicted in principle for RNA by Jacob and Monod in 1961 [54] and by Britten and Davidson in 1969 to explain the presence of large amounts of RNA in the nucleus of metazoan cells [55]. Although a limited number of small regulatory RNAs occur in prokaryotes [56–65], the large-scale opportunity for RNA to evolve trans-acting regulatory functions almost certainly arose as a consequence of the separation of transcription from translation in eukaryotes, and the subsequent colonization of eukaryotic protein-coding genes by introns [23]. This was followed by the devolution of the intron-enclosed catalytic RNA sequences required for splicing into spliceosomal RNAs capable of acting in trans and the concomitant recruitment of accessory proteins into the spliceosome, which reduced the internal sequence restrictions on introns, improved the efficiency of their excision after transcription, and allowed them to drift and to explore new evolutionary space with minimal constraint [23]. Under such conditions, it would not be unreasonable to suggest that any random mutational change that produced RNA sequences that could meaningfully contact some other part of the genetic network (presumably in the main DNA and other RNAs, but also proteins) would have positive selection value and be likely to be retained. Importantly, such evolution would have occurred in parallel with protein expression, without directly interfering with it—the essential difference being that the evolving molecules were RNAs that were separable from their associated protein-coding sequences by splicing (and from each other by postsplicing processing pathways), and therefore could act multiply and independently in trans. This process would have accelerated as it became more established, leading ultimately to a radical change in the genetic operating system of the organism. This is not to suggest that all introns produce
274
Genomics
RNA signals, but that an increasing number will have gained such functions as evolution explored higher levels of biological complexity. That is, the expansion of introns into eukaryotic genes initiated a new round of molecular evolution, which allowed RNA to emerge as a regulatory agent in its own right. This would have occurred in conjunction with the coevolution of a expanded RNA and protein infrastructure for RNA processing and signaling [66–68], leading ultimately to a more sophisticated regulatory system that was the prerequisite for the appearance of developmentally complex multicellular organisms [23,24]. This line of logic, which is strongly supported by a wide range of evidence, suggests therefore that noncoding RNAs derived from introns and from non-protein-coding transcripts (see below) are not evolutionary debris or transcriptional noise, but rather have evolved to produce functional signals (efference signals or eRNAs [4], to borrow a term from neurobiology [69,70]), in parallel with protein-coding sequences. It also leads to a number of subsidiary predictions, among which are: (i) that the majority of the genome is functional and under evolutionary selection (see below); (ii) that many genes will have evolved only to produce RNA, as higher order regulators in this network, which is supported by the emerging realization that at least half and perhaps as many as three-quarters of all transcriptions in mammals do not encode proteins [5]; (iii) that these RNAs largely transmit information via primary sequence (RNA:RNA and RNA:DNA) interactions, as a kind of bit string or zip code that addresses receptive targets [21]; and (iv) that the actions taken upon receipt of these signals will be determined by the secondary structures embedded in these interactions that are in turn recognized by different types of proteins and protein domains. Each of these predictions, and the evidence that supports them, is discussed in more detail below. This system, in principle and in practice, would constitute a quasi-digital feedforward control network which sets the trajectories of differentiation and development in concert with analog protein-based sensor and signaling pathways that relay contextual cues to supplement and correct stochastic errors in the endogenously encoded program. That is, the RNA signals that are produced in parallel with protein-coding sequences would not only act to coordinate complex suites of gene expression, but would also act to set the future transcriptional state of cells and the posttranscriptional control of gene expression in a programmed manner [21,24,53,71]. RNA-BASED REGULATORY NETWORKS IN COMPLEX ORGANISMS
A variety of evidence now points to the fact that RNA regulatory systems are well developed in the higher organisms, and probably dominate their control architecture (figure 10.1). This evidence has been summarized in detail elsewhere [4,5,21,23,24,53], but the salient
Noncoding RNA and RNA Regulatory Networks
275
Figure 10.1 Diagrammatic outline of the complex RNA and protein networks involved in gene regulation. Noncoding RNAs regulate genome structure and gene expression at many levels. miRNAs, siRNAs, snoRNAs and possibly other small RNAs are involved in the regulation of translation, mRNA stability, and chromatin structure, as well as self-regulation (dashed lines) and possibly also the control of transcription and splicing (dashed lines with question mark). (Figure adapted from ref. [93], with permission from Human Molecular Genetics.)
points are that: (i) that most if not all of the complex genetic phenomena in the eukaryotes, including transcriptional and posttranscriptional gene silencing, RNA interference (RNAi), imprinting, and methylation, are RNA-directed [24,72–79]; (ii) there are large numbers of noncoding RNAs being detected in animal cells by cDNA cloning [80] and by genome tiling arrays [11,12,16–20]; and (iii) new classes of small regulatory RNAs, notably microRNAs (miRNAs) and small interfering RNAs (siRNAs), are being discovered that control a wide variety of developmental processes in animals and plants, including stem cell and embryonic development, brain development, hematopoietic differentiation, patterning, insulin secretion, adipocyte differentiation, growth,
276
Genomics
apoptosis, leaf and floral development, and stress responses, among others [67,81–93]. Databases of known and predicted miRNAs [94–97] and siRNAs [98,99] have recently been published, and almost 1000 miRNAs have now been identified with over 5000 predicted target mRNAs in humans [100,101]. miRNAs and siRNAs are involved in complex networks controlling the expression of other RNAs and proteins, including Hox proteins, in developmental pathways [93,102,103], and alterations in the expression of miRNAs are observed in aberrant development, that is, various cancers [93,104–110]. As predicted [23,24], many miRNAs are derived from the introns of protein-coding genes and non-protein-coding genes [111–113], as are most snoRNAs (small nucleolar RNAs) that modify other RNAs [114–116]. It is likely that those small RNAs that have been discovered to date are only the tip of a very big iceberg, and that there are tens or hundreds of thousands of such RNAs which direct a variety of regulatory functions, including epigenetic modification of chromatin, transcriptional selection, and alternative splicing [93], all of which are central to the complex molecular ontogeny of multicellular organisms. While it has been shown that miRNAs can target other RNAs for destruction or for translational repression, depending on the degree of sequence match to the target, knocking out the pathways involved in their production has also been shown to affect chromosomal dynamics [117–123]. The introduction of small RNAs targeted against specific sequences has also been shown to alter gene-specific methylation patterns and induce gene silencing [119,124,125]. Indeed there is considerable evidence that the processes of gene silencing, methylation, and imprinting are all connected to RNA signaling [5,24,72,74,122,126], and that these pathways underlie certain types of genetic disease [127]. At least some intronic RNAs are stably detectable in cells and trafficked to different subcellular locations [14,15]. Not only are there large numbers of RNA-binding and RNA-processing proteins encoded in eukaryotic genomes, but some so-called transcription factors, including zinc finger proteins and Y-box proteins, seem to have high affinity for RNA or RNA–DNA hybrids [128,129]. It is also likely that the large numbers of nucleic acid- and chromatin-binding proteins and domains therein whose specificity is unknown are in fact recognizing various types of RNA–RNA and RNA–DNA complexes [21,130,131] (see below). KNOWN NONCODING RNAs
Apart from miRNAs and snoRNAs, of which over 200 each have been identified in humans and mice [94–97], there are now almost 300 functional noncoding RNAs that have been identified in mammals, many of which have been shown or implied to be involved in developmental processes, neural function, and disease, including various types of cancer,
Noncoding RNA and RNA Regulatory Networks
277
ataxia, autism, and schizophrenia (for summaries and databases see [5,96,97]). There are many thousands more showing up in EST collections and cDNA libraries, at least some of which are produced in tissue-specific patterns [80,132]. In addition, all well-studied gene loci, including b-globin and several imprinted regions in mouse and human, as well as the bithorax-abdominalA/B developmental control locus in Drosophila, have a majority of noncoding transcripts (including miRNA precursors) that are developmentally regulated [133–141]. Many of these RNAs are antisense to known protein-coding transcripts, and recent evidence suggests that there are thousands of such transcripts in the mammalian genome [6,7,80]. There are also approximately 20,000 pseudogenes in the human genome [142]. These are generally thought to be dead genes (nonfunctional relics of past duplications or reverse transcriptase-mediated insertions) because they contain nonsense and frameshift mutations, but one has been recently shown to regulate the stability of its homologous protein-coding mRNA [143]. This suggests that the significance of pseudogenes may have been misinterpreted, and that both sense and antisense regulatory RNAs may be important features of gene regulation by RNA networks in animals. Moreover, at least some long-distance transcriptional "enhancers," including the archetypal globin "locus control region" (LCR), which are thought to act in cis (by chromosomal looping) to control regional gene expression, have been recently shown to be transcribed, in the case of the globin LCR specifically in erythroid cells [133,144]. RNA signaling is involved in the genetic process termed transinduction [133] and probably also in a form of allelic cross-talk termed transvection [24,144], which appears to involve trans-acting signals (presumably RNAs) that act locally. SEQUENCE-SPECIFIC CONTROL OF CHROMATIN MODIFICATION, TRANSCRIPTION, AND SPLICING
One of the hallmarks of eukaryotic differentiation and development is the epigenetic modification of chromatin structure at different loci in different cells. There must either be an army of sequence-specific DNAbinding proteins that carry out these modifications, which is not the case— there are only a limited number of DNA and histone modifying enzymes (methylases, acetylases, deacetylases, etc.) [145]—or these enzymes must be directed to their sites of action by some other signal, most logically sequence-specific RNAs [77,122,123]. Such signals would also potentially solve the conundrum of how to select from the huge number of transcription factor binding sites that exist in the genome, and indeed transcription of some genes has been shown to require trans-acting noncoding RNAs [146–149]. Trans-acting guide RNAs may also regulate
278
Genomics
alternative splicing [150], which is currently mainly thought to be controlled by the combinatoric effects of protein "splicing factors" but is not at all well understood in these terms [5,151,152]. Consistent with the possibility that site-specific trans-acting RNAs are involved, the nucleotide sequences around alternative splice sites are often highly conserved between species [153,154], and it has been shown by many studies that splicing patterns may be easily altered in cultured cells and in whole animals by introducing small antisense RNAs [155–159]. It is not a big leap of faith to conclude that this is also likely to happen naturally, and that the reason that it has not yet been demonstrated to be the case is because of the sheer complexity of the numbers and variety of such signals in regulatory networks in different cells. RNA control of the translation or turnover (destruction) of specific mRNAs by miRNAs and siRNAs has already been demonstrated (see above). If cells are awash with small RNA signals processed from longer precursors, which (as such) have short half-lives, identification of these signals will be difficult, although bioinformatics using appropriate search algorithms may provide a means to do so, as they have to a limited extent in the identification of miRNAs [160–166] (see below). It would be predicted that the majority of RNA signaling would occur on a sequence-specific basis, targeting complementary sequences in relevant RNA and DNA targets, as compact encoding of sequence recognition is the main advantage of RNA over proteins in network communication and the key requirement for a digital control system. These RNA signals may obey different sequence recognition rules for different types of RNA–RNA and RNA–DNA interactions, as well as containing various embedded secondary structural signals that recruit particular proteins to the resulting complexes (see below). In this context it is worth pointing out that one of the few ways that RNA might address genomic DNA is by triplex formation. There is strong evidence for the existence of triplex and other complex structures in human chromatin [167,168] and that sequences capable of forming triplexes occur at higher than expected frequency in the human genome [169,170]. It has also been shown that triplex formation can regulate transcription [171,172] and inhibit tumor growth [173]. GENOMIC IMMIGRATION AND EXPANSION
Genomes of complex organisms are thought to evolve and acquire new functions by duplication (in part or full) and by the accumulation of sequences laterally transferred from other organisms. A large proportion of the human genome is comprised of transposons, again (like introns) largely thought to be accumulated evolutionary dross with little if any active genetic function. However, recent evidence suggests that transposon sequences have also evolved in situ with the rest of the genome
Noncoding RNA and RNA Regulatory Networks
279
and contribute to its dynamics, including the control of development [174–176]. In this context it is interesting that another form of RNA modification, adenosine-to-inosine (A-I) editing, can alter both splicing patterns and protein sequence and is essential for mammalian development [177,178]. In humans, A-I editing has recently been shown to be much more widespread than was previously thought, and to occur primarily in Alu elements (which are primate-specific) in noncoding RNA sequences in both protein-coding and noncoding transcripts [179–182]. A-I editing is particularly active in the brain [177], and aberrant editing has been associated with certain cancers and a range of abnormal behaviors including epilepsy and depression [183–186]. This raises the intriguing possibility that RNA modification is central to higher order neural function, learning, and memory, which must by definition be molecularly stored but not hard-wired, and that the colonization of the primate lineage by Alu elements may have provided an expanded platform for the more rapid evolution of this capacity. Moreover, the pathways for A-I editing and RNA interference have been found to intersect [187,188], hinting at the enormous complexity of RNA regulatory interactions and regulatory networks in epigenetic control of gene expression, development, and complex characteristics. MUCH OF THE GENOME MAY BE FUNCTIONAL AND UNDER SELECTION
If the majority of the RNA that is transcribed from the human genome is functional in regulatory networks, there are several logical predictions that must hold, not the least of which is that a large proportion of the genome (not just the protein-coding sequences and their immediate cis-acting regulatory elements) must be under evolutionary selection, both positive and negative. As noted above, 5% of the human genome appears to be under purifying selection, as judged by comparison to the level of conservation between human and mouse genomes in ancient repeats [27], which are assumed (in general) to be drifting neutrally, as are most intronic and intergenic sequences. However, a recent reanalysis of this data by mathematical approaches that take account of compounded changes indicates that the real figure may be much higher, at least 10% (300 Mb) [189]. This suggests that the amount of regulatory information under negative selection is (at least) an order of magnitude greater than those encoding proteins, a figure that is surprising if one accepts a protein-centric orthodoxy but not if the amount of RNA regulatory information is scaling faster than analog functional information as organisms become more complex. It is also difficult if not impossible to reconcile with current paradigms for gene regulation by cis-acting protein regulators. 300 Mb of DNA sequence represents
280
Genomics
an awful lot of potential regulatory protein-binding sites, an average of ~12–15 Kb per protein-coding gene. These figures also do not take into account those sequences that must be under positive evolutionary selection for adaptive radiation (phenotypic divergence) [190], nor the possibility that the reference sequences in the rest of the genome that are presumed to be evolving neutrally may not be. Selection forces can be both strong and weak. It has been known for some time that the rate of nucleotide substitution varies across the genome, in intronic and intergenic sequences as well as at the third base of redundant codons (see, e.g., [191,192]), which is commonly thought to be the consequence of local variations in the rate of de novo mutation [193] (although this has not been actually demonstrated to occur), since the alternative—that the observed variation in nucleotide substitution frequencies is the consequence of different selection pressures on different sequences in the genome—would require acceptance of the implication that the majority rather than the minority of the genome may be genetically functional. However, evidence supporting the latter conclusion is mounting. Recent comparative studies of particular genomic regions in multiple species show that noncoding regions that are not conserved between some species are conserved between others, and that noncoding sequences that are conserved between any two or more species occupy a much greater proportion of these regions than is evident from pairwise comparisons alone [194–197]. Presumably those that are conserved between (some or all) species reflect some shared regulatory feature of the common biology of the species in question, whereas those that have diverged at these places in particular species are (or at least may well be) due either to drift under weak selection or due to positive selection associated with some aspect of their divergent biology. Thus a substantial proportion of the studied genomic regions, which are presumably typical, shows evidence of conservation when multiple species are compared, which must reflect either local variation in selection history in different lineages or lineage-specific changes in the underlying de novo mutation frequency at orthologous positions. The latter is surely the less likely alternative. PLASTIC AND FROZEN REGULATORY SEQUENCES
It is also worth pointing out in the context of this discussion that a number of well-validated functional noncoding RNAs, such as XIST involved in X chromosome dosage compensation, can evolve very quickly [198–200], even though their function is conserved, presumably because the primary sequence constraints are less severe than (for example) those encoding proteins [93]. That is, rapid sequence divergence of noncoding sequences may not necessarily mean that they are nonfunctional or even that their function has significantly changed, since such regulatory
Noncoding RNA and RNA Regulatory Networks
281
sequences would be expected to be quite plastic; in this case it is the sequence interactions and/or the structures of the resulting complexes that matter, rather than the sequences themselves [93]. This is a quite different situation to that for sequences encoding proteins that have tight structure–function relationships and where the amino acid sequences are important for analog structure and function. The plasticity inherent in digital RNA signaling (depending on how many contacts are involved— see below) would also allow such sequences to explore new contacts in regulatory networks, and allow organismal ontogeny and phenotype to evolve relatively quickly, which has been proposed to account for the rapid expansion of phenotypic variation of multicellular organisms into uncontested environments, such as occurred initially in the Cambrian radiation and subsequently after major extinction events [23,24]. On the other hand, the recent discovery of noncoding ultraconserved sequences that are identical between humans and rodents and that have remained essentially frozen since the divergence of mammals and birds 300 million years ago shows that there are some noncoding sequences that are far more fiercely conserved than proteins [201]. These sequences are significantly associated with genes encoding RNA-binding proteins and developmental regulators, and are presumably essential for the ontogeny of vertebrates, although their mechanism of action and reasons for their ultraconservation remain unknown. Similar sequences also occur in invertebrates (insects), albeit at shorter lengths [202]. In any case, these sequences do not fit any established model for gene regulation by proteins, but presumably reflect their involvement in critical pathways with multiple interacting components (similar to highly conserved sequences in ribosomal RNA) that restrict the opportunity for variation. BIOINFORMATIC DISSECTION OF RNA REGULATORY NETWORKS
The complexity of the transcriptome in mammals and other complex organisms is being revealed in part by cDNA cloning [80] and, more recently, by whole chromosome or whole genome tiling arrays probed with cDNAs prepared by random priming from various RNA fractions [16–18,20], which give a more comprehensive view of the stable transcripts that exist in cells. An increasing number of these transcripts appear to be noncoding RNAs, which presumably have a regulatory function [17,18,20,80]. However, these approaches do not effectively poll smaller RNAs, such as miRNAs and snoRNAs, which need more specialized and more difficult cloning approaches. While a number of these RNAs have been identified by such approaches (see, e.g., [203–208]), the problems of cloning small RNAs and the contamination of cDNA libraries with rRNAs and other common RNA sequences have led to the conclusion that not many more miRNAs will be identified by this approach [206], particularly if it is the case (as one might expect)
282
Genomics
that most small RNAs are present in low amounts in cell-specific patterns. Indeed there are a number of known examples of miRNAs, such as that encoded by the lys-6 locus, which controls the asymmetry of chemosensory neurons in C. elegans [209], and that encoded by the bantam locus in Drosophila [84], that were only discovered by sensitive genetic screens, which are difficult if not impossible to carry out in mammals. In most if not all cases the function of known regulatory RNAs is based on recognition of RNA or DNA target sequences by specific base pairing, for example by snoRNAs, miRNAs, and siRNAs, which is analogous to digital signaling. Because of this feature, even short RNAs contain sufficient information to specify individual targets in the genome and the transcriptome, in a much more compact and efficient manner than proteins. Therefore, if RNA-mediated regulatory networks comprise the majority of information encoded in the genomes of complex organisms, it would be expected that the majority of this information would be transacted in a sequence-dependent manner, which should make these networks amenable to bioinformatic dissection by interand intragenomic comparisons. However, this will not be simple and will involve sophisticated approaches that take into account the likely complexities of such a system. As noted already, such networks are likely to be evolving quickly at the primary sequence level, even if the signaling relationships are maintained, and hence long-distance comparisons of genomic sequences between species will only be useful for identifying those sequences that are both common to the common biology of the species being compared and that are (presumably) inhibited from rapid evolution by their participation in multiple interactions or multiple pathways. Much more informative is likely to be intragenomic analyses that look for patterns of primary sequence and secondary structure conservation to identify RNA signals and their potential targets within cellular networks. Such approaches have been used with some success to identify miRNAs and their targets, using a combination of the presence of short hairpins in the putative precursors of the former, evolutionary conservation, and the presence of matching sequences in putative mRNA targets [160–166], with particular emphasis on 3′-UTRs, since the archetypal and best-studied examples of miRNAs (such as lin-4 and let-7) exert their action through binding to UTRs [210,211]. At least some of these predicted miRNAs have been validated experimentally. Consistent with the premise that RNA regulatory sequences that are involved in multiple interactions do not evolve quickly, a recent analysis of 218 known miRNAs identified 2273 potential target genes (about 10% of all known or predicted human protein-coding genes) with one or more target sites showing 90% sequence conservation between human, mouse, and rat [166]. On the other hand, endogenous siRNAs are less conserved presumably because these RNAs and their homologous
Noncoding RNA and RNA Regulatory Networks
283
targets can easily covary and still maintain specificity [67,93,212], which will make them difficult to identify bioinformatically, at least on the basis of evolutionary conservation. This would also apply to miRNAs and other regulatory RNAs that have one or very few targets. Similar approaches utilizing a combination of primary sequence and secondary structure prediction have also been used to identify other types of small RNAs, including snoRNAs [213–217]. However, while such approaches potentially minimize the numbers of false positives, they may (and almost certainly do) seriously underestimate the actual numbers of such RNAs, by restricting the search space to particular structures and/or particular types of signals and targets that show long-range conservation. For these and other reasons, the development of appropriately discriminatory bioinformatic approaches to identify potential RNA signaling molecules and their targets in systemic regulatory networks is a significant challenge, exemplified by the case of snoRNAs where target recognition is mediated by one or two short stretches of sequences (10–21 nt) that are antisense to the target, which for many snoRNAs ("orphan guide snoRNAs") remain unknown [115,218]. It is also possible that other types of RNA signals will also have embedded mismatches or secondary structures (such as stem-loops) that interrupt the primary sequences that recognize specific targets, and that enable the specific recruitment of particular types of proteins to such complexes (see below). Secondary structure predictions for RNAs, especially if these involve long precursors, remain problematic [219-221], although new algorithms based on secondary structural parameters are being developed which appear to have the potential to identify new types of noncoding RNAs [222]. Moreover the rules for primary sequence recognition between RNA strands (for example G:U base pairs, lone pair triloops, and tetraloop hairpins) and between RNA and DNA (for example in triplexes) vary from canonical Watson–Crick base-pairing [223–227]. This will require elegant new search algorithms that not only take these RNA and higher order structural base-pairing rules into account (insofar as this may be possible), but also permit or specify various levels of gaps, mismatches, and enclosed secondary structures, without degrading signal-to-noise ratios to nonsense levels, an area of bioinformatics likely to blossom in the coming years. As always, and despite the countervailing force of anthropocentrism, it will be simplest to do such analyses first in model organisms (yeast, C. elegans, and Drosophila). These organisms not only have smaller genomes and simpler regulatory networks, but also permit bioinformatics predictions to be readily genetically tested and validated. Indeed we already have bioinformatic evidence that such networks exist in a primitive form in yeast (S. Stanley and J. S. Mattick, unpublished data).
284
Genomics
BIOCHEMICAL DISSECTION OF DIGITAL–ANALOG CONVERSION SYSTEMS IN RNA REGULATORY PATHWAYS
The corollary of using RNA as regulatory molecules is that such signals must be converted to analog actions upon receipt. That is, while the target specificity of the regulatory RNA is dependent upon a primary sequence interaction with its cognate target(s), the response to the receipt of this signal is dependent upon the (prior or subsequent) recruitment of appropriate proteins that then undertake the appropriate action, such as mRNA degradation or modification of chromatin. Thus, RNA signals must have two forms of embedded information, that which specifies their targets, and that which specifies the required subsequent action, assuming that (in most cases) the RNAs themselves do not have catalytic activity. Indeed this is precisely what is observed in the known examples of snoRNAs and miRNAs, which select their targets based on antisense interactions, but whose effects are exerted through the recruitment of associated complexes, that is, snoRNPs (small ribonucleoprotein particles) [115] or the RNA-induced silencing complex (RISC) [67,93], respectively, which contain the relevant catalytic functions. Presumably the same type of recruitment of (for example) histone methylases or acetylases, or DNA methylases, occurs in the RNAdirected modification of chromatin [228–230]. Indeed it is likely that many of the protein families and protein domains that have nucleic acid- or chromatin-binding functions, but whose specificity is unknown or uncertain, are in fact recognizing different forms of RNA:RNA and RNA:DNA complexes, linking such signaling to resulting actions [228–230]. Consistent with this it has been shown that chromodomains, which are present in many different types of chromatin-binding proteins involved in epigenetic memory and the control of development, appear to recognize a combination of modified histones and RNA signals [121,123,228,231–235]. If this is generally the case, there exists a significant opportunity to dissect these pathways and to identify the relevant proteins/domains by a variety of approaches, for example affinity chromatography, using different RNA-containing complexes as ligands, followed by proteomic analysis of the bound fraction. One may also consider such approaches as chromatin immunoprecipitation [236–238], using complex-specific antibodies followed both by proteomic analysis and by interrogation of genomic tiling chips ("chIP on chip"), to identify (for example) the positions of RNA-containing complexes in chromatin and their dynamic changes during developmental or physiological transitions. There are many possible variations on these themes and such experiments are currently underway in our laboratory.
Noncoding RNA and RNA Regulatory Networks
285
CONCLUSION
Proteins are the key catalytic and structural analog components of all biological systems, but that does not mean that they necessarily transmit the majority of regulatory information in complex organisms. However, this is not to say that regulatory proteins are not essential to eukaryotic ontogeny—there is a rich literature to show that they are. Proteins are essential for reporting cellular, physiological, and environmental information, including cell–cell communication during development (such as the Notch, Hedgehog, BMP/TGF-b, Jak/STAT, and Wnt/wingless signaling pathways) [239–245], as it is impossible to assemble complex systems without such information and positional reference. A considerable portion of this information is relayed through a surprisingly small number of signaling pathways, which are highly evolutionarily conserved and used in many different cell types [241]. As discussed above, proteins are also required to transduce RNA signals into analog actions, one example being the Argonaute family of proteins which are involved in RNA interference and affect many developmental processes [66]. Nonetheless, while protein signaling and function is an essential part of cellular control systems, it seems not only plausible but highly likely that the trajectories of development are deeply embedded in the endogenous RNA signaling networks that emanate out in a programmed manner from the fertilized zygote to daughter cells as they divide and grow (see below), to produce the dizzying variety of developmental phenotypes in different mammals, birds, reptiles, fishes, insects, crustaceans, and so on, as well as many if not most of the idiosyncratic but clearly genomically encoded differences among individuals. Thus, the evolution, development, and adaptive radiation of the higher eukaryotes has likely been primarily dependent on a previously hidden network of RNA signals, whose specification and targets occupy the majority of the genome. Importantly these RNA signals, derived both from introns of protein-coding transcripts and from the exons and introns of noncoding transcripts, provide not only the power of feedback signals that can report and efferently coordinate the expression of transcripts in the network, but also have the ability to send feedforward signals to direct subsequent transcription and gene expression in a developmental series. That is, developmental ontogeny may be primarily encoded in the RNA-based regulatory architecture of the genome and orchestrated by the unfolding patterns of interacting RNA regulatory signals that control chromatin modification, transcription, splicing, RNA editing, RNA stability, mRNA translation, and so on. These RNA-directed networks, in conjunction with cell–cell signaling, determine the timing of cell division and the pathways of differentiation, beginning with the
286
Genomics
fertilized embryo, whose initial asymmetry (maintained in subsequent generations by cell–cell communication) sets up a cascade of consequential but largely predefined actions that are lineage-specific, with the transcriptional status and trajectory of each cell—its "ribotype" or soft-wiring [53,71]—dependent on its developmental history and physical position. If correct, this means that most current conceptions of how genes are controlled in the higher organisms are wrong, and that future research should more actively consider the likelihood of underlying RNA regulatory networks in experimental design and interpretation. This is likely to be particularly important when trying to track down the source of phenotypic trait variation, the genetic basis of complex diseases such as cancer, and the genetic contributions to complex characteristics such as behavior. ACKNOWLEDGMENTS I am grateful to the Australian Research Council and the Queensland State Government for their financial support. I thank Igor Makunin for critical reading of the manuscript and for assistance with the figure. I also thank the other members of my laboratory for their contributions and for many stimulating discussions.
REFERENCES 1. Lander, E. S., L. M. Linton, B. Birren, C. Nusbaum, M. C. Zody, J. Baldwin, K. Devon, K. Dewar, M. Doyle, W. FitzHugh, R. Funke, D. Gage, K. Harris, A. Heaford, et al. Initial sequencing and analysis of the human genome. Nature, 409(6822):860–921, 2001. 2. Venter, J. C., M. D. Adams, E. W. Myers, P. W. Li, R. J. Mural, G. G. Sutton, H. O. Smith, M. Yandell, C. A. Evans, R. A. Holt, J. D. Gocayne, P. Amanatides, R. M. Ballew, D. H. Huson, et al. The sequence of the human genome. Science, 291(5507):1304–51, 2001. 3. Lambie, E. J. Cell proliferation and growth in C. elegans. Bioessays, 24(1): 38–53, 2002. 4. Mattick, J. S. Non-coding RNAs: the architects of eukaryotic complexity. EMBO Reports, 2(11):986–91, 2001. 5. Mattick, J. S. Challenging the dogma: the hidden layer of non-proteincoding RNAs in complex organisms. Bioessays, 25(10):930–9, 2003. 6. Yelin, R., D. Dahary, R. Sorek, E. Y. Levanon, O. Goldstein, A. Shoshan, A. Diber, S. Biton, Y. Tamir, R. Khosravi, S. Nemzer, E. Pinner, S. Walach, J. Bernstein, et al. Widespread occurrence of antisense transcription in the human genome. Nature Biotechnology, 21(4):379–86, 2003. 7. Lehner, B., G. Williams, R. D. Campbell and C. M. Sanderson. Antisense transcripts in the human genome. Trends in Genetics, 18(2):63–5, 2002. 8. Lavorgna, G., D. Dahary, B. Lehner, R. Sorek, C. M. Sanderson and G. Casari. In search of antisense. Trends in Biochemical Science, 29(2):88–94, 2004.
Noncoding RNA and RNA Regulatory Networks
287
9. Dahary, D., O. Elroy-Stein and R. Sorek. Naturally occurring antisense: transcriptional leakage or real overlap? Genome Research, 15(3):364–8, 2005. 10. Kiyosawa, H., N. Mise, S. Iwase, Y. Hayashizaki and K. Abe. Disclosing hidden transcripts: mouse natural sense-antisense transcripts tend to be poly(A) negative and nuclear localized. Genome Research, 15(4):463–74, 2005. 11. Cheng, J., P. Kapranov, J. Drenkow, S. Dike, S. Brubaker, S. Patel, J. Long, D. Stern, H. Tammana, G. Helt, V. Sementchenko, A. Piccolboni, S. Bekiranov, D. K. Bailey, et al. Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science, 308(5725):1149–54, 2005. 12. Kapranov, P., J. Drenkow, J. Cheng, J. Long, G. Helt, S. Dike and T. R. Gingeras. Examples of the complex architecture of the human transcriptome revealed by RACE and high density tiling arrays. Genome Research, 15(7): 987–97, 2005. 13. Dennis, C. The brave new world of RNA. Nature, 418(6894):122–4, 2002. 14. Clement, J. Q., L. Qian, N. Kaplinsky and M. F. Wilkinson. The stability and fate of a spliced intron from vertebrate cells. RNA, 5(2):206–20, 1999. 15. Clement, J. Q., S. Maiti and M. F. Wilkinson. Localization and stability of introns spliced from the Pem homeobox gene. Journal of Biological Chemistry, 276(20):16919–30, 2001. 16. Kapranov, P., S. E. Cawley, J. Drenkow, S. Bekiranov, R. L. Strausberg, S. P. Fodor and T. R. Gingeras. Large-scale transcriptional activity in chromosomes 21 and 22. Science, 296(5569):916–19, 2002. 17. Cawley, S., S. Bekiranov, H. H. Ng, P. Kapranov, E. A. Sekinger, D. Kampa, A. Piccolboni, V. Sementchenko, J. Cheng, A. J. Williams, R. Wheeler, B. Wong, J. Drenkow, M. Yamanaka, et al. Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell, 116(4):499–509, 2004. 18. Kampa, D., J. Cheng, P. Kapranov, M. Yamanaka, S. Brubaker, S. Cawley, J. Drenkow, A. Piccolboni, S. Bekiranov, G. Helt, H. Tammana and T. R. Gingeras. Novel RNAs identified from an in-depth analysis of the transcriptome of human chromosomes 21 and 22. Genome Research, 14(3):331–42, 2004. 19. Bertone, P., V. Stolc, T. E. Royce, J. S. Rozowsky, A. E. Urban, X. Zhu, J. L. Rinn, W. Tongprasit, M. Samanta, S. Weissman, M. Gerstein and M. Snyder. Global identification of human transcribed sequences with genome tiling arrays. Science, 306(5705):2242–6, 2004. 20. Stolc, V., Z. Gauhar, C. Mason, G. Halasz, M. F. van Batenburg, S. A. Rifkin, S. Hua, T. Herreman, W. Tongprasit, P. E. Barbano, H. J. Bussemaker and K. P. White. A gene expression map for the euchromatic genome of Drosophila melanogaster. Science, 306(5696):655–60, 2004. 21. Mattick, J. S. RNA regulation: a new genetics? Nature Reviews Genetics, 5(4):316–23, 2004. 22. Mattick, J. S. and M. J. Gagen. Accelerating networks. Science, 307(5711):856–8, 2005. 23. Mattick, J. S. Introns: evolution and function. Current Opinion in Genetics and Development, 4(6):823–31, 1994. 24. Mattick, J. S. and M. J. Gagen. The evolution of controlled multitasked gene networks: the role of introns and other noncoding RNAs in the development
288
25.
26. 27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
Genomics
of complex organisms. Molecular Biology and Evolution, 18(9): 1611–30, 2001. Hayashi, T., K. Makino, M. Ohnishi, K. Kurokawa, K. Ishii, K. Yokoyama, C. G. Han, E. Ohtsubo, K. Nakayama, T. Murata, M. Tanaka, T. Tobe, T. Iida, H. Takami, et al. Complete genome sequence of enterohemorrhagic Escherichia coli O157:H7 and genomic comparison with a laboratory strain K-12. DNA Research, 8(1):11–22, 2001. Duboule, D. and A. S. Wilkins. The evolution of "bricolage." Trends in Genetics, 14(2):54–9, 1998. Waterston, R. H., K. Lindblad-Toh, E. Birney, J. Rogers, J. F. Abril, P. Agarwal, R. Agarwala, R. Ainscough, M. Alexandersson, P. An, S. E. Antonarakis, J. Attwood, R. Baertsch, J. Bailey, et al. Initial sequencing and comparative analysis of the mouse genome. Nature, 420(6915):520–62, 2002. Yeo, G. W., E. Van Nostrand, D. Holste, T. Poggio and C. B. Burge. Identification and analysis of alternative splicing events conserved in human and mouse. Proceedings of the National Academy of Sciences USA, 102(8): 2850–5, 2005. Aparicio, S., J. Chapman, E. Stupka, N. Putnam, J. M. Chia, P. Dehal, A. Christoffels, S. Rash, S. Hoon, A. Smit, M. D. Gelpke, J. Roach, T. Oh, I. Y. Ho, et al. Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science, 297(5585):1301–10, 2002. Hillier, L. W., W. Miller, E. Birney, W. Warren, R. C. Hardison, C. P. Ponting, P. Bork, D. W. Burt, M. A. Groenen, M. E. Delany, J. B. Dodgson, A. T. Chinwalla, P. F. Cliften, S. W. Clifton, et al. Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature, 432(7018):695–716, 2004. Rubin, G. M., M. D. Yandell, J. R. Wortman, G. L. Gabor Miklos, C. R. Nelson, I. K. Hariharan, M. E. Fortini, P. W. Li, R. Apweiler, W. Fleischmann, J. M. Cherry, S. Henikoff, M. P. Skupski, S. Misra, et al. Comparative genomics of the eukaryotes. Science, 287(5461):2204–15, 2000. Chervitz, S. A., L. Aravind, G. Sherlock, C. A. Ball, E. V. Koonin, S. S. Dwight, M. A. Harris, K. Dolinski, S. Mohr, T. Smith, S. Weng, J. M. Cherry and D. Botstein. Comparison of the complete protein sets of worm and yeast: orthology and divergence. Science, 282(5396):2022–8, 1998. Dehal, P., Y. Satou, R. K. Campbell, J. Chapman, B. Degnan, A. De Tomaso, B. Davidson, A. Di Gregorio, M. Gelpke, D. M. Goodstein, N. Harafuji, K. E. Hastings, I. Ho, K. Hotta, et al. The draft genome of Ciona intestinalis: insights into chordate and vertebrate origins. Science, 298(5601):2157–67, 2002. The C. elegans Sequencing Consortium. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science, 282(5396):2012–18, 1998. Stein, L. D., Z. Bao, D. Blasiar, T. Blumenthal, M. R. Brent, N. Chen, A. Chinwalla, L. Clarke, C. Clee, A. Coghlan, A. Coulson, P. D'Eustachio, D. H. Fitch, L. A. Fulton, et al. The genome sequence of Caenorhabditis briggsae: a platform for comparative genomics. PLoS Biology, 1(2):E45, 2003. Adams, M. D., S. E. Celniker, R. A. Holt, C. A. Evans, J. D. Gocayne, P. G. Amanatides, S. E. Scherer, P. W. Li, R. A. Hoskins, R. F. Galle, R. A. George, S. E. Lewis, S. Richards, M. Ashburner, et al. The genome sequence of Drosophila melanogaster. Science, 287(5461):2185–95, 2000.
Noncoding RNA and RNA Regulatory Networks
289
37. Holt, R. A., G. M. Subramanian, A. Halpern, G. G. Sutton, R. Charlab, D. R. Nusskern, P. Wincker, A. G. Clark, J. M. Ribeiro, R. Wides, S. L. Salzberg, B. Loftus, M. Yandell, W. H. Majoros, et al. The genome sequence of the malaria mosquito Anopheles gambiae. Science, 298(5591):129–49, 2002. 38. Zdobnov, E. M., C. von Mering, I. Letunic, D. Torrents, M. Suyama, R. R. Copley, G. K. Christophides, D. Thomasova, R. A. Holt, G. M. Subramanian, H. M. Mueller, G. Dimopoulos, J. H. Law, M. A. Wells, et al. Comparative genome and proteome analysis of Anopheles gambiae and Drosophila melanogaster. Science, 298(5591):149–59, 2002. 39. Consortium. Finishing the euchromatic sequence of the human genome. Nature, 431(7011):931–45, 2004. 40. Taft, R. J. and J. S. Mattick. Increasing biological complexity is positively correlated with the relative genome-wide expansion of non-protein-coding DNA sequences. Genome Biology Preprint Depository, http://genomebiology. com/2003/5/1/P1, 2003. 41. Frith, M. C., M. Pheasant and J. S. Mattick. The amazing complexity of the human transcriptome. European Journal of Human Genetics, 13(8):894–7, 2005. 42. Graveley, B. R. Alternative splicing: increasing diversity in the proteomic world. Trends in Genetics, 17(2):100–7, 2001. 43. Maniatis, T. and B. Tasic. Alternative pre-mRNA splicing and proteome expansion in metazoans. Nature, 418(6894):236–43, 2002. 44. Levine, M. and R. Tjian. Transcription regulation and animal diversity. Nature, 424(6945):147–51, 2003. 45. Buchler, N. E., U. Gerland and T. Hwa. On schemes of combinatorial transcription logic. Proceedings of the National Academy of Sciences USA, 100(9):5136–41, 2003. 46. Li, M. and P. M. B. Vitanyi. An Introduction to Kolmogorov Complexity and its Applications, 2nd ed. Springer-Verlag, New York, 1998. 47. Csete, M. E. and J. C. Doyle. Reverse engineering of biological complexity. Science, 295(5560):1664–9, 2002. 48. Croft, L. J., M. J. Lercher, M. J. Gagen and J. S. Mattick. Is prokaryotic complexity limited by accelerated growth in regulatory overhead? Genome Biology Preprint Depository, http://genomebiology.com/qc/2003/5/1/p2, 2003. 49. Gagen, M. J. and J. S. Mattick. Inherent size constraints on prokaryote gene networks due to "accelerating" growth. arXiv Preprint Archive, http://arXiv.org/abs/q-bio.MN/0312021, 2004. 50. Gagen, M. J. and J. S. Mattick. Inherent size constraints on prokaryote gene networks due to "accelerating" growth. Theory in Bioscience, 123: 381–411, 2005. 51. Casjens, S. The diverse and dynamic structure of bacterial genomes. Annual Review of Genetics, 32:339–77, 1998. 52. Semon, M. and L. Duret. Evidence that functional transcription units cover at least half of the human genome. Trends in Genetics, 20(5):229–32, 2004. 53. Herbert, A. and A. Rich. RNA processing and the evolution of eukaryotes. Nature Genetics, 21(3):265–9, 1999. 54. Jacob, F. and J. Monod. Genetic regulatory mechanisms in the synthesis of proteins. Journal of Molecular Biology, 3:318–56, 1961. 55. Britten, R. J. and E. H. Davidson. Gene regulation for higher cells: a theory. Science, 165(891):349–57, 1969.
290
Genomics
56. Wassarman, K. M., A. Zhang and G. Storz. Small RNAs in Escherichia coli. Trends in Microbiology, 7(1):37–45, 1999. 57. Rivas, E., R. J. Klein, T. A. Jones and S. R. Eddy. Computational identification of noncoding RNAs in E. coli by comparative genomics. Current Biology, 11(17):1369–73, 2001. 58. Gottesman, S. Stealth regulation: biological circuits with small RNA switches. Genes and Development, 16(22):2829–42, 2002. 59. Vogel, J., V. Bartels, T. H. Tang, G. Churakov, J. G. Slagter-Jager, A. Huttenhofer and E. G. Wagner. RNomics in Escherichia coli detects new sRNA species and indicates parallel transcriptional output in bacteria. Nucleic Acids Research, 31(22):6435–43, 2003. 60. Mandal, M., B. Boese, J. E. Barrick, W. C. Winkler and R. R. Breaker. Riboswitches control fundamental biochemical pathways in Bacillus subtilis and other bacteria. Cell, 113(5):577–86, 2003. 61. Lenz, D. H., K. C. Mok, B. N. Lilley, R. V. Kulkarni, N. S. Wingreen and B. L. Bassler. The small RNA chaperone Hfq and multiple small RNAs control quorum sensing in Vibrio harveyi and Vibrio cholerae. Cell, 118(1): 69–82, 2004. 62. Grundy, F. J. and T. M. Henkin. Regulation of gene expression by effectors that bind to RNA. Current Opinion in Microbiology, 7(2):126–31, 2004. 63. Nudler, E. and A. S. Mironov. The riboswitch control of bacterial metabolism. Trends in Biochemical Science, 29(1):11–17, 2004. 64. Wilderman, P. J., N. A. Sowa, D. J. FitzGerald, P. C. FitzGerald, S. Gottesman, U. A. Ochsner and M. L. Vasil. Identification of tandem duplicate regulatory small RNAs in Pseudomonas aeruginosa involved in iron homeostasis. Proceedings of the National Academy of Sciences USA, 101(26):9792–7, 2004. 65. Storz, G., J. A. Opdyke and A. Zhang. Controlling mRNA stability and translation with small, noncoding RNAs. Current Opinion in Microbiology, 7(2):140–4, 2004. 66. Carmell, M. A., Z. Xuan, M. Q. Zhang and G. J. Hannon. The Argonaute family: tentacles that reach into RNAi, developmental control, stem cell maintenance, and tumorigenesis. Genes and Development, 16(21):2733–42, 2002. 67. Bartel, D. P. MicroRNAs: genomics, biogenesis, mechanism, and function. Cell, 116(2):281–97, 2004. 68. Sanford, J. R. and J. F. Caceres. Pre-mRNA splicing: life at the centre of the central dogma. Journal of Cell Science, 117(26):6261–3, 2004. 69. Bridgeman, B. A review of the role of efference copy in sensory and oculomotor control systems. Annals of Biomedical Engineering, 23(4):409–22, 1995. 70. Andersen, R. A., L. H. Snyder, D. C. Bradley and J. Xing. Multimodal representation of space in the posterior parietal cortex and its use in planning movements. Annual Review of Neuroscience, 20: 303–30, 1997. 71. Herbert, A. and A. Rich. RNA processing in evolution: the logic of softwired genomes. Annals of the New York Academy of Science, 870: 119–32, 1999. 72. Wassenegger, M. RNA-directed DNA methylation. Plant Molecular Biology, 43(2):203–20, 2000. 73. Sharp, P. A. RNA interference-2001. Genes and Development, 15(5):485–90, 2001.
Noncoding RNA and RNA Regulatory Networks
291
74. Aufsatz, W., M. F. Mette, J. van der Winden, A. J. Matzke and M. Matzke. RNA-directed DNA methylation in Arabidopsis. Proceedings of the National Academy of Sciences USA, 99(Suppl 4):16499–506, 2002. 75. Cerutti, H. RNA interference: traveling in the cell and gaining functions? Trends in Genetics, 19(1):39–46, 2003. 76. Andersen, A. A. and B. Panning. Epigenetic gene regulation by noncoding RNAs. Current Opinion in Cell Biology, 15(3):281–9, 2003. 77. Baulcombe, D. RNA silencing in plants. Nature, 431(7006):356–63, 2004. 78. Schramke, V. and R. Allshire. Those interfering little RNAs! Silencing and eliminating chromatin. Current Opinion in Genetics and Development, 14(2):174–80, 2004. 79. Tariq, M. and J. Paszkowski. DNA and histone methylation in plants. Trends in Genetics, 20(6):244–51, 2004. 80. Carninci, P., T. Kasukawa, S. Katayama, J. Gough, M. C. Frith, N. Maeda, R. Oyama, T. Ravasi, B. Lenhard, C. Wells, R. Kodzius, K. Shimokawa, V. B. Bajic, S. E. Brenner, et al. The transcriptional landscape of the human genome. Science, 309(5740):559–63, 2005. 81. He, L. and G. J. Hannon. MicroRNAs: small RNAs with a big role in gene regulation. Nature Reviews Genetics, 5(7):522–31, 2004. 82. Pasquinelli, A. E. and G. Ruvkun. Control of developmental timing by microRNAs and their targets. Annual Review of Cell and Developmental Biology, 18: 495–513, 2002. 83. Ambros, V. MicroRNA pathways in flies and worms: growth, death, fat, stress, and timing. Cell, 114(2):269, 2003. 84. Brennecke, J., D. R. Hipfner, A. Stark, R. B. Russell and S. M. Cohen. bantam encodes a developmentally regulated microRNA that controls cell proliferation and regulates the proapoptotic gene hid in Drosophila. Cell, 113(1): 25–36, 2003. 85. Carrington, J. C. and V. Ambros. Role of microRNAs in plant and animal development. Science, 301(5631):336–8, 2003. 86. Houbaviy, H. B., M. F. Murray and P. A. Sharp. Embryonic stem cell-specific microRNAs. Developmental Cell, 5(2):351–8, 2003. 87. Kidner, C. A. and R. A. Martienssen. Macro effects of microRNAs in plants. Trends in Genetics, 19(1):13–16, 2003. 88. Bartel, D. P. and C. Z. Chen. Micromanagers of gene expression: the potentially widespread influence of metazoan microRNAs. Nature Reviews Genetics, 5(5):396–400, 2004. 89. Poy, M. N., L. Eliasson, J. Krutzfeldt, S. Kuwajima, X. Ma, P. E. Macdonald, S. Pfeffer, T. Tuschl, N. Rajewsky, P. Rorsman and M. Stoffel. A pancreatic islet-specific microRNA regulates insulin secretion. Nature, 432(7014): 226–30, 2004. 90. Kasashima, K., Y. Nakamura and T. Kozu. Altered expression profiles of microRNAs during TPA-induced differentiation of HL-60 cells. Biochemical and Biophysical Research Communications, 322(2):403–10, 2004. 91. Esau, C., X. Kang, E. Peralta, E. Hanson, E. G. Marcusson, L. V. Ravichandran, Y. Sun, S. Koo, R. J. Perera, R. Jain, N. M. Dean, S. M. Freier, C. F. Bennett, B. Lollo, et al. MicroRNA-143 regulates adipocyte differentiation. Journal of Biological Chemistry, 279(50):52361–5, 2004.
292
Genomics
92. Chen, X. A microRNA as a translational repressor of APETALA2 in Arabidopsis flower development. Science, 303(5666):2022–5, 2004. 93. Mattick, J. S. and I. V. Makunin. Small regulatory RNAs in mammals. Human Molecular Genetics, 14: R121–32, 2005. 94. Griffiths-Jones, S. The microRNA Registry. Nucleic Acids Research, 32(Database Issue):D109–11, 2004. 95. Griffiths-Jones, S., S. Moxon, M. Marshall, A. Khanna, S. R. Eddy and A. Bateman. Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Research, 33(Database Issue):D121–4, 2005. 96. Pang, K. C., S. Stephen, P. G. Engström, K. Tajul-Arifin, W. Chen, C. Wahlestedt, B. Lenhard, Y. Hayashizaki and J. S. Mattick. RNAdb—a comprehensive mammalian noncoding RNA database. Nucleic Acids Research, 33(Database Issue) D125–30, 2005. 97. Liu, C., B. Bai, G. Skogerbo, L. Cai, W. Deng, Y. Zhang, D. Bu, Y. Zhao and R. Chen. NONCODE: an integrated knowledge database of non-coding RNAs. Nucleic Acids Research, 33(Database Issue): D112–15, 2005. 98. Chalk, A. M., R. E. Warfinge, P. Georgii-Hemming and E. L. Sonnhammer. siRNAdb: a database of siRNA sequences. Nucleic Acids Research, 33(Database Issue):D131–4, 2005. 99. Truss, M., M. Swat, S. M. Kielbasa, R. Schafer, H. Herzel and C. Hagemeier. HuSiDa—the human siRNA database: an open-access database for published functional siRNA sequences and technical details of efficient transfer into recipient cells. Nucleic Acids Research, 33(Database Issue): D108–11, 2005. 100. Berezikov, E., V. Guryev, J. van de Belt, E. Wienholds, R. H. Plasterk and E. Cuppen. Phylogenetic shadowing and computational identification of human microRNA genes. Cell, 120(1):21–4, 2005. 101. Lewis, B. P., C. B. Burge and D. P. Bartel. Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell, 120(1):15–20, 2005. 102. Yekta, S., I. H. Shih and D. P. Bartel. MicroRNA-directed cleavage of HOXB8 mRNA. Science, 304(5670):594–6, 2004. 103. Vazquez, F., H. Vaucheret, R. Rajagopalan, C. Lepers, V. Gasciolli, A. C. Mallory, J. L. Hilbert, D. P. Bartel and P. Crete. Endogenous transacting siRNAs regulate the accumulation of Arabidopsis mRNAs. Molecular Cell, 16(1):69–79, 2004. 104. Calin, G. A., C. D. Dumitru, M. Shimizu, R. Bichi, S. Zupo, E. Noch, H. Aldler, S. Rattan, M. Keating, K. Rai, L. Rassenti, T. Kipps, M. Negrini, F. Bullrich, et al. Frequent deletions and down-regulation of micro- RNA genes miR15 and miR16 at 13q14 in chronic lymphocytic leukemia. Proceedings of the National Academy of Sciences USA, 99(24):15524–9, 2002. 105. Michael, M. Z., S. M. O’Connor, N. G. van Holst Pellekaan, G. P. Young and R. J. James. Reduced accumulation of specific microRNAs in colorectal neoplasia. Molecular Cancer Research, 1(12):882–91, 2003. 106. McManus, M. T. MicroRNAs and cancer. Seminars in Cancer Biology, 13(4): 253–58, 2003. 107. Metzler, M., M. Wilda, K. Busch, S. Viehmann and A. Borkhardt. High expression of precursor microRNA-155/BIC RNA in children with Burkitt lymphoma. Genes, Chromosomes and Cancer, 39(2):167–9, 2004.
Noncoding RNA and RNA Regulatory Networks
293
108. Calin, G. A., C. G. Liu, C. Sevignani, M. Ferracin, N. Felli, C. D. Dumitru, M. Shimizu, A. Cimmino, S. Zupo, M. Dono, M. L. Dell'Aquila, H. Alder, L. Rassenti, T. J. Kipps, et al. MicroRNA profiling reveals distinct signatures in B cell chronic lymphocytic leukemias. Proceedings of the National Academy of Sciences USA, 101(32):11755–60, 2004. 109. Calin, G. A., C. Sevignani, C. D. Dumitru, T. Hyslop, E. Noch, S. Yendamuri, M. Shimizu, S. Rattan, F. Bullrich, M. Negrini and C. M. Croce. Human microRNA genes are frequently located at fragile sites and genomic regions involved in cancers. Proceedings of the National Academy of Sciences USA, 101(9):2999–3004, 2004. 110. Takamizawa, J., H. Konishi, K. Yanagisawa, S. Tomida, H. Osada, H. Endoh, T. Harano, Y. Yatabe, M. Nagino, Y. Nimura, T. Mitsudomi and T. Takahashi. Reduced expression of the let-7 microRNAs in human lung cancers in association with shortened postoperative survival. Cancer Research, 64(11): 3753–56, 2004. 111. Cai, X., C. H. Hagedorn and B. R. Cullen. Human microRNAs are processed from capped, polyadenylated transcripts that can also function as mRNAs. RNA, 10(12):1957–66, 2004. 112. Rodriguez, A., S. Griffiths-Jones, J. L. Ashurst and A. Bradley. Identification of mammalian microRNA host genes and transcription units. Genome Research, 14(10A):1902–10, 2004. 113. Ying, S. Y. and S. L. Lin. Intronic microRNAs. Biochemical and Biophysical Research Communications, 326(3):515–20, 2005. 114. Maxwell, E. S. and M. J. Fournier. The small nucleolar RNAs. Annual Review of Biochemistry, 64: 897–934, 1995. 115. Bachellerie, J. P., J. Cavaille and A. Hüttenhofer. The expanding snoRNA world. Biochimie, 84(8):775–90, 2002. 116. Kiss, T. Small nucleolar RNAs: an abundant group of noncoding RNAs with diverse cellular functions. Cell, 109(2):145–8, 2002. 117. Volpe, T. A., C. Kidner, I. M. Hall, G. Teng, S. I. Grewal and R. A. Martienssen. Regulation of heterochromatic silencing and histone H3 lysine-9 methylation by RNAi. Science, 297(5588):1833–7, 2002. 118. Hall, I. M., K. Noma and S. I. Grewal. RNA interference machinery regulates chromosome dynamics during mitosis and meiosis in fission yeast. Proceedings of the National Academy of Sciences USA, 100(1):193–8, 2003. 119. Schramke, V. and R. Allshire. Hairpin RNAs and retrotransposon LTRs effect RNAi and chromatin-based gene silencing. Science, 301(5636): 1069–74, 2003. 120. Volpe, T., V. Schramke, G. L. Hamilton, S. A. White, G. Teng, R. A. Martienssen and R. C. Allshire. RNA interference is required for normal centromere function in fission yeast. Chromosome Research, 11(2):137–46, 2003. 121. Mochizuki, K. and M. A. Gorovsky. Small RNAs in genome rearrangement in Tetrahymena. Current Opinion in Genetics and Development, 14(2): 181–7, 2004. 122. Lippman, Z. and R. Martienssen. The role of RNA interference in heterochromatic silencing. Nature, 431(7006):364–70, 2004. 123. Verdel, A., S. Jia, S. Gerber, T. Sugiyama, S. Gygi, S. I. Grewal and D. Moazed. RNAi-mediated targeting of heterochromatin by the RITS complex. Science, 303(5658):672–6, 2004.
294
Genomics
124. Kawasaki, H. and K. Taira. Induction of DNA methylation and gene silencing by short interfering RNAs in human cells. Nature, 431(7005): 211–17, 2004. 125. Imamura, T., S. Yamamoto, J. Ohgane, N. Hattori, S. Tanaka and K. Shiota. Non-coding RNA directed DNA demethylation of Sphk1 CpG island. Biochemical and Biophysical Research Communications, 322(2):593–600, 2004. 126. Dernburg, A. F. and G. H. Karpen. A chromosome RNAissance. Cell, 111(2):159–62, 2002. 127. Tufarelli, C., J. A. Stanley, D. Garrick, J. A. Sharpe, H. Ayyub, W. G. Wood and D. R. Higgs. Transcription of antisense RNA leading to gene silencing and methylation as a novel cause of human genetic disease. Nature Genetics, 34(2):157–65, 2003. 128. Shi, Y. and J. M. Berg. Specific DNA-RNA hybrid binding by zinc finger proteins. Science, 268(5208):282–4, 1995. 129. Ladomery, M. Multifunctional proteins suggest connections between transcriptional and post-transcriptional processes. Bioessays, 19(10):903–9, 1997. 130. Wilkinson, M. F. and A. B. Shyu. Multifunctional regulatory proteins that control gene expression in both the nucleus and the cytoplasm. Bioessays, 23(9):775–87, 2001. 131. Bomsztyk, K., O. Denisenko and J. Ostrowski. hnRNP K: one protein multiple processes. Bioessays, 26(6):629–38, 2004. 132. Ravasi, T., H. Suzuki, K. C. Pang, S. Katayama, M. Furuno, R. Okunishi, S. Fukuda, K. Ru, M. C. Frith, M. Gongora, S. Grimmond, D. A. Hume, Y. Hayashizaki and J. S. Mattick. Experimental validation of the regulated expression of large numbers of non-coding RNAs from the mouse genome. Genome Research, 16(1):11–19, 2006. 133. Ashe, H. L., J. Monks, M. Wijgerde, P. Fraser and N. J. Proudfoot. Intergenic transcription and transinduction of the human beta-globin locus. Genes and Development, 11(19):2494–509, 1997. 134. Charlier, C., K. Segers, D. Wagenaar, L. Karim, S. Berghmans, O. Jaillon, T. Shay, J. Weissenbach, N. Cockett, G. Gyapay and M. Georges. Humanovine comparative sequencing of a 250-kb imprinted domain encompassing the callipyge(clpg) locus and identification of six imprinted transcripts: DLK1, DAT, GTL2, PEG11, antiPEG11, and MEG8. Genome Research, 11(5): 850–62, 2001. 135. Sleutels, F., R. Zwart and D. P. Barlow. The non-coding Air RNA is required for silencing autosomal imprinted genes. Nature, 415(6873): 810–13, 2002. 136. Lin, S. P., N. Youngson, S. Takada, H. Seitz, W. Reik, M. Paulsen, J. Cavaille and A. C. Ferguson-Smith. Asymmetric regulation of imprinting on the maternal and paternal chromosomes at the Dlk1-Gtl2 imprinted cluster on mouse chromosome 12. Nature Genetics, 35(1): 97–102, 2003. 137. Holmes, R., C. Williamson, J. Peters, P. Denny and C. Wells. A comprehensive transcript map of the mouse Gnas imprinted complex. Genome Research, 13(6b):1410–15, 2003. 138. Seitz, H., N. Youngson, S. P. Lin, S. Dalbert, M. Paulsen, J. P. Bachellerie, A. C. Ferguson-Smith and J. Cavaille. Imprinted microRNA genes
Noncoding RNA and RNA Regulatory Networks
139.
140.
141.
142.
143.
144.
145. 146. 147.
148.
149.
150. 151. 152. 153.
154.
295
transcribed antisense to a reciprocally imprinted retrotransposon-like gene. Nature Genetics, 34(3):261–2, 2003. Lipshitz, H. D., D. A. Peattie and D. S. Hogness. Novel transcripts from the Ultrabithorax domain of the bithorax complex. Genes and Development, 1(3):307–22, 1987. Sanchez-Herrero, E. and M. Akam. Spatially ordered transcription of regulatory DNA in the bithorax complex of Drosophila. Development, 107(2): 321–9, 1989. Bae, E., V. C. Calhoun, M. Levine, E. B. Lewis and R. A. Drewell. Characterization of the intergenic RNA profile at abdominal-A and Abdominal-B in the Drosophila bithorax complex. Proceedings of the National Academy of Sciences USA, 99(26):16847–52, 2002. Harrison, P. M., H. Hegyi, S. Balasubramanian, N. M. Luscombe, P. Bertone, N. Echols, T. Johnson and M. Gerstein. Molecular fossils in the human genome: identification and analysis of the pseudogenes in chromosomes 21 and 22. Genome Research, 12(2):272–80, 2002. Hirotsune, S., N. Yoshida, A. Chen, L. Garrett, F. Sugiyama, S. Takahashi, K. Yagami, A. Wynshaw-Boris and A. Yoshiki. An expressed pseudogene regulates the messenger-RNA stability of its homologous coding gene. Nature, 423(6935):91–6, 2003. Ronshaugen, M. and M. Levine. Visualization of trans-homolog enhancer-promoter interactions at the Abd-B Hox locus in the Drosophila embryo. Developmental Cell, 7(6):925–32, 2004. Khorasanizadeh, S. The nucleosome: from genomic organization to genomic regulation. Cell, 116(2):259–72, 2004. Krause, M. O. Chromatin structure and function: the heretical path to an RNA transcription factor. Biochemistry and Cell Biology, 74(5):623–32, 1996. Lanz, R. B., N. J. McKenna, S. A. Onate, U. Albrecht, J. Wong, S. Y. Tsai, M. J. Tsai and B. W. O'Malley. A steroid receptor coactivator, SRA, functions as an RNA and is present in an SRC-1 complex. Cell, 97(1):17–27, 1999. Lanz, R. B., B. Razani, A. D. Goldberg and B. W. O'Malley. Distinct RNA motifs are important for coactivation of steroid hormone receptors by steroid receptor RNA activator (SRA). Proceedings of the National Academy of Sciences USA, 99(25):16081–6, 2002. Martinho, R. G., P. S. Kunwar, J. Casanova and R. Lehmann. A noncoding RNA is required for the repression of RNApolII-dependent transcription in primordial germ cells. Current Biology, 14(2):159–65, 2004. Holliday, R. and V. Murray. Specificity in splicing. Bioessays, 16(10):771–4, 1994. Lopez, A. J. Alternative splicing of pre-mRNA: developmental consequences and mechanisms of regulation. Annual Review of Genetics, 32:279–305, 1998. Singh, R. RNA-protein interactions that regulate pre-mRNA splicing. Gene Expression, 10(1-2):79–92, 2002. Sorek, R. and G. Ast. Intronic sequences flanking alternatively spliced exons are conserved between human and mouse. Genome Research, 13(7):1631–7, 2003. Sugnet, C. W., W. J. Kent, M. Ares, Jr. and D. Haussler. Transcriptome and genome conservation of alternative splicing events in humans and mice. Pacific Symposium on Biocomputing, pp.66–77, 2004.
296
Genomics
155. Dunckley, M. G., M. Manoharan, P. Villiet, I. C. Eperon and G. Dickson. Modification of splicing in the dystrophin gene in cultured Mdx muscle cells by antisense oligoribonucleotides. Human Molecular Genetics, 7(7): 1083–90, 1998. 156. Mann, C. J., K. Honeyman, A. J. Cheng, T. Ly, F. Lloyd, S. Fletcher, J. E. Morgan, T. A. Partridge and S. D. Wilton. Antisense-induced exon skipping and synthesis of dystrophin in the mdx mouse. Proceedings of the National Academy of Sciences USA, 98(1):42–7, 2001. 157. Kole, R. and P. Sazani. Antisense effects in the cell nucleus: modification of splicing. Current Opinion in Molecular Therapeutics, 3(3):229–34, 2001. 158. Suwanmanee, T., H. Sierakowska, G. Lacerra, S. Svasti, S. Kirby, C. E. Walsh, S. Fucharoen and R. Kole. Restoration of human beta-globin gene expression in murine and human IVS2-654 thalassemic erythroid cells by free uptake of antisense oligonucleotides. Molecular Pharmacology, 62(3):545–53, 2002. 159. Sazani, P., F. Gemignani, S. H. Kang, M. A. Maier, M. Manoharan, M. Persmark, D. Bortner and R. Kole. Systemically delivered antisense oligomers upregulate gene expression in mouse tissues. Nature Biotechnology, 20(12):1228–33, 2002. 160. Rhoades, M., B. Reinhart, L. Lim, C. Burge, B. Bartel and D. Bartel. Prediction of plant microRNA targets. Cell, 110(4):513–20, 2002. 161. Lewis, B. P., I. H. Shih, M. W. Jones-Rhoades, D. P. Bartel and C. B. Burge. Prediction of mammalian microRNA targets. Cell, 115(7):787–98, 2003. 162. Stark, A., J. Brennecke, R. B. Russell and S. M. Cohen. Identification of Drosophila microRNA targets. PLoS Biology, 1(3):e60, 2003. 163. Kiriakidou, M., P. T. Nelson, A. Kouranov, P. Fitziev, C. Bouyioukos, Z. Mourelatos and A. Hatzigeorgiou. A combined computational-experimental approach predicts human microRNA targets. Genes and Development, 18(10):1165–78, 2004. 164. Wang, X. J., J. L. Reyes, N. H. Chua and T. Gaasterland. Prediction and identification of Arabidopsis thaliana microRNAs and their mRNA targets. Genome Biology, 5(9):R65, 2004. 165. Rehmsmeier, M., P. Steffen, M. Hochsmann and R. Giegerich. Fast and effective prediction of microRNA/target duplexes. RNA, 10(10):1507–17, 2004. 166. John, B., A. J. Enright, A. Aravin, T. Tuschl, C. Sander and D. S. Marks. Human microRNA targets. PLoS Biology, 2(11):e363, 2004. 167. van Holde, K. and J. Zlatanova. Unusual DNA structures, chromatin and transcription. Bioessays, 16(1):59–68, 1994. 168. Ohno, M., T. Fukagawa, J. S. Lee and T. Ikemura. Triplex-forming DNAs in the human interphase nucleus visualized in situ by polypurine/ polypyrimidine DNA probes and antitriplex antibodies. Chromosoma, 111(3):201–13, 2002. 169. Vasquez, K. M. and J. H. Wilson. Triplex-directed modification of genes and gene activity. Trends in Biochemical Science, 23(1):4–9, 1998. 170. Goni, J. R., X. de la Cruz and M. Orozco. Triplex-forming oligonucleotide target sequences in the human genome. Nucleic Acids Research, 32(1):354–60, 2004. 171. Giovannangeli, C. and C. Helene. Triplex-forming molecules for modulation of DNA information processing. Current Opinion in Molecular Therapeutics, 2(3):288–96, 2000.
Noncoding RNA and RNA Regulatory Networks
297
172. Song, J., Z. Intody, M. Li and J. H. Wilson. Activation of gene expression by triplex-directed psoralen crosslinks. Gene, 324:183–90, 2004. 173. Re, R. N., J. L. Cook and J. F. Giardina. The inhibition of tumor growth by triplex-forming oligonucleotides. Cancer Letters, 209(1):51–3, 2004. 174. Whitelaw, E. and D. I. Martin. Retrotransposons as epigenetic mediators of phenotypic variation in mammals. Nature Genetics, 27(4):361–5, 2001. 175. Peaston, A. E., A. V. Evsikov, J. H. Graber, W. N. de Vries, A. E. Holbrook, D. Solter and B. B. Knowles. Retrotransposons regulate host genes in mouse oocytes and preimplantation embryos. Developmental Cell, 7(4): 597–606, 2004. 176. Lippman, Z., A. V. Gendrel, M. Black, M. W. Vaughn, N. Dedhia, W. R. McCombie, K. Lavine, V. Mittal, B. May, K. D. Kasschau, J. C. Carrington, R. W. Doerge, V. Colot and R. Martienssen. Role of transposable elements in heterochromatin and epigenetic control. Nature, 430(6998):471–6, 2004. 177. Bass, B. L. RNA editing by adenosine deaminases that act on RNA. Annual Review of Biochemistry, 71:817–46, 2002. 178. Saunders, L. R. and G. N. Barber. The dsRNA binding protein family: critical roles, diverse cellular functions. FASEB Journal, 17(9):961–83, 2003. 179. Blow, M., P. A. Futreal, R. Wooster and M. R. Stratton. A survey of RNA editing in human brain. Genome Research, 14(12):2379–87, 2004. 180. Athanasiadis, A., A. Rich and S. Maas. Widespread A-to-I RNA editing of Alu-containing mRNAs in the human transcriptome. PLoS Biology, 2(12):e391, 2004. 181. Kim, D. D., T. T. Kim, T. Walsh, Y. Kobayashi, T. C. Matise, S. Buyske and A. Gabriel. Widespread RNA editing of embedded alu elements in the human transcriptome. Genome Research, 14(9):1719–25, 2004. 182. Levanon, E. Y., E. Eisenberg, R. Yelin, S. Nemzer, M. Hallegger, R. Shemesh, Z. Y. Fligelman, A. Shoshan, S. R. Pollock, D. Sztybel, M. Olshansky, G. Rechavi and M. F. Jantsch. Systematic identification of abundant Ato-I editing sites in the human transcriptome. Nature Biotechnology, 22(8): 1001–5, 2004. 183. Higuchi, M., S. Maas, F. N. Single, J. Hartner, A. Rozov, N. Burnashev, D. Feldmeyer, R. Sprengel and P. H. Seeburg. Point mutation in an AMPA receptor gene rescues lethality in mice deficient in the RNA-editing enzyme ADAR2. Nature, 406(6791):78–81, 2000. 184. Vissel, B., G. A. Royle, B. R. Christie, H. H. Schiffer, A. Ghetti, T. Tritto, I. Perez-Otano, R. A. Radcliffe, J. Seamans, T. Sejnowski, J. M. Wehner, A. C. Collins, S. O'Gorman and S. F. Heinemann. The role of RNA editing of kainate receptors in synaptic plasticity and seizures. Neuron, 29(1): 217–27, 2001. 185. Maas, S., S. Patt, M. Schrey and A. Rich. Underediting of glutamate receptor GluR-B mRNA in malignant gliomas. Proceedings of the National Academy of Sciences USA, 98(25):14687–92, 2001. 186. Schmauss, C. Serotonin 2C receptors: suicide, serotonin, and runaway RNA editing. Neuroscientist, 9(4):237–42, 2003. 187. Knight, S. W. and B. L. Bass. The role of RNA editing by ADARs in RNAi. Molecular Cell, 10(4):809–17, 2002. 188. Tonkin, L. A. and B. L. Bass. Mutations in RNAi rescue aberrant chemotaxis of ADAR mutants. Science, 302(5651):1725, 2003.
298
Genomics
189. Smith, N. G., M. Brandstrom and H. Ellegren. Evidence for turnover of functional noncoding DNA in mammalian genome evolution. Genomics, 84(5):806–13, 2004. 190. Bazykin, G. A., F. A. Kondrashov, A. Y. Ogurtsov, S. Sunyaev and A. S. Kondrashov. Positive selection at sites of multiple amino acid replacements since rat-mouse divergence. Nature, 429(6991):558–62, 2004. 191. Fullerton, S. M., J. Bond, J. A. Schneider, B. Hamilton, R. M. Harding, A. J. Boyce and J. B. Clegg. Polymorphism and divergence in the betaglobin replication origin initiation region. Molecular Biology and Evolution, 17(1):179–88, 2000. 192. Hardison, R. C., K. M. Roskin, S. Yang, M. Diekhans, W. J. Kent, R. Weber, L. Elnitski, J. Li, M. O'Connor, D. Kolbe, S. Schwartz, T. S. Furey, S. Whelan, N. Goldman, et al. Covariation in frequencies of substitution, deletion, transposition, and recombination during eutherian evolution. Genome Research, 13(1):13–26, 2003. 193. Hwang, D. G. and P. Green. Bayesian Markov chain Monte Carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution. Proceedings of the National Academy of Sciences USA, 101(39):13994–14001, 2004. 194. Williams, S. H., N. Mouchel and A. Harris. A comparative genomic analysis of the cow, pig, and human CFTR genes identifies potential intronic regulatory elements. Genomics, 81(6):628–39, 2003. 195. Thomas, J. W., J. W. Touchman, R. W. Blakesley, G. G. Bouffard, S. M. Beckstrom-Sternberg, E. H. Margulies, M. Blanchette, A. C. Siepel, P. J. Thomas, J. C. McDowell, B. Maskeri, N. F. Hansen, M. S. Schwartz, R. J. Weber, et al. Comparative analyses of multi-species sequences from targeted genomic regions. Nature, 424(6950):788–93, 2003. 196. Frazer, K. A., H. Tao, K. Osoegawa, P. J. de Jong, X. Chen, M. F. Doherty and D. R. Cox. Noncoding sequences conserved in a limited number of mammals in the SIM2 interval are frequently functional. Genome Research, 14(3):367–72, 2004. 197. Hare, M. P. and S. R. Palumbi. High intron sequence conservation across three Mammalian orders suggests functional constraints. Molecular Biology and Evolution, 20(6):969–78, 2003. 198. Chureau, C., M. Prissette, A. Bourdet, V. Barbe, L. Cattolico, L. Jones, A. Eggen, P. Avner and L. Duret. Comparative sequence analysis of the X-inactivation center region in mouse, human, and bovine. Genome Research, 12(6):894–908, 2002. 199. Hong, Y. K., S. D. Ontiveros and W. M. Strauss. A revision of the human XIST gene organization and structural comparison with mouse Xist. Mammalian Genome, 11(3):220–4, 2000. 200. Nesterova, T. B., S. Y. Slobodyanyuk, E. A. Elisaphenko, A. I. Shevchenko, C. Johnston, M. E. Pavlova, I. B. Rogozin, N. N. Kolesnikov, N. Brockdorff and S. M. Zakian. Characterization of the genomic Xist locus in rodents reveals conservation of overall gene structure and tandem repeats but rapid evolution of unique sequence. Genome Research, 11(5):833–49, 2001. 201. Bejerano, G., M. Pheasant, I. Makunin, S. Stephen, W. J. Kent, J. S. Mattick and D. Haussler. Ultraconserved elements in the human genome. Science, 304(5675):1321–5, 2004.
Noncoding RNA and RNA Regulatory Networks
299
202. Glazov, E. A., M. Pheasant, E. A. McGraw, G. Bejerano and J. S. Mattick. Ultraconserved elements in insect genomes: a highly conserved intronic sequence implicated in the control of homothorax mRNA splicing. Genome Research, 15(6):800–8, 2005. 203. Lee, R. C. and V. Ambros. An extensive class of small RNAs in Caenorhabditis elegans. Science, 294(5543):862–4, 2001. 204. Lau, N. C., L. P. Lim, E. G. Weinstein and D. P. Bartel. An abundant class of tiny RNAs with probable regulatory roles in Caenorhabditis elegans. Science, 294(5543):858–62, 2001. 205. Lagos-Quintana, M., R. Rauhut, W. Lendeckel and T. Tuschl. Identification of novel genes coding for small expressed RNAs. Science, 294(5543):853–8, 2001. 206. Lagos-Quintana, M., R. Rauhut, J. Meyer, A. Borkhardt and T. Tuschl. New microRNAs from mouse and human. RNA, 9(2):175–9, 2003. 207. Aravin, A. A., M. Lagos-Quintana, A. Yalcin, M. Zavolan, D. Marks, B. Snyder, T. Gaasterland, J. Meyer and T. Tuschl. The small RNA profile during Drosophila melanogaster development. Developmental Cell, 5(2): 337–50, 2003. 208. Reinhart, B. J., E. G. Weinstein, M. W. Rhoades, B. Bartel and D. P. Bartel. MicroRNAs in plants. Genes and Development, 16(13):1616–26, 2002. 209. Johnston, R. J. and O. Hobert. A microRNA controlling left/right neuronal asymmetry in Caenorhabditis elegans. Nature, 426(6968):845–9, 2003. 210. Moss, E. G. and L. Tang. Conservation of the heterochronic regulator Lin-28, its developmental expression and microRNA complementary sites. Developmental Biology, 258(2):432–42, 2003. 211. Vella, M. C., E. Y. Choi, S. Y. Lin, K. Reinert and F. J. Slack. The C. elegans microRNA let-7 binds to imperfect let-7 complementary sites from the lin41 3′UTR. Genes and Development, 18(2):132–7, 2004. 212. Mattick, J. S. The functional genomics of noncoding RNA. Science, 309(5740):1527–8, 2005. 213. Lowe, T. M. and S. R. Eddy. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Research, 25(5): 955–64, 1997. 214. Lowe, T. M. and S. R. Eddy. A computational screen for methylation guide snoRNAs in yeast. Science, 283(5405):1168–71, 1999. 215. Cavaille, J., H. Seitz, M. Paulsen, A. C. Ferguson-Smith and J. P. Bachellerie. Identification of tandemly-repeated C/D snoRNA genes at the imprinted human 14q32 domain reminiscent of those at the PraderWilli/Angelman syndrome region. Human Molecular Genetics, 11(13):1527–38, 2002. 216. McCutcheon, J. P. and S. R. Eddy. Computational identification of noncoding RNAs in Saccharomyces cerevisiae by comparative genomics. Nucleic Acids Research, 31(14):4119–28, 2003. 217. Edvardsson, S., P. P. Gardner, A. M. Poole, M. D. Hendy, D. Penny and V. Moulton. A search for H/ACA snoRNAs in yeast using MFE secondary structure prediction. Bioinformatics, 19(7):865–73, 2003. 218. Cavaille, J., K. Buiting, M. Kiefmann, M. Lalande, C. I. Brannan, B. Horsthemke, J. P. Bachellerie, J. Brosius and A. Huttenhofer. Identification of brain-specific and imprinted small nucleolar RNA genes exhibiting an
300
219. 220.
221. 222.
223.
224. 225.
226. 227.
228.
229.
230.
231. 232.
233.
234. 235.
Genomics
unusual genomic organization. Proceedings of the National Academy of Sciences USA, 97(26):14311–16, 2000. Rivas, E. and S. R. Eddy. The language of RNA: a formal grammar that includes pseudoknots. Bioinformatics, 16(4):334–40, 2000. Rivas, E. and S. R. Eddy. Secondary structure alone is generally not statistically significant for the detection of noncoding RNAs. Bioinformatics, 16(7):583–605, 2000. Eddy, S. R. How do RNA folding algorithms work? Nature Biotechnology, 22(11):1457–8, 2004. Washietl, S., I. L. Hofacker and P. F. Stadler. Fast and reliable prediction of noncoding RNAs. Proceedings of the National Academy of Sciences USA, 102:2454–9, 2005. Gutell, R. R., J. C. Lee and J. J. Cannone. The accuracy of ribosomal RNA comparative structure models. Current Opinion in Structural Biology, 12(3): 301–10, 2002. Lee, J. C., J. J. Cannone and R. R. Gutell. The lonepair triloop: a new motif in RNA structure. Journal of Molecular Biology, 325(1):65–83, 2003. Moody, E. M., J. C. Feerrar and P. C. Bevilacqua. Evidence that folding of an RNA tetraloop hairpin is less cooperative than its DNA counterpart. Biochemistry, 43(25):7992–8, 2004. Gilbert, D. E. and J. Feigon. Multistranded DNA structures. Current Opinion in Structural Biology, 9(3):305–14, 1999. Neidle, S., B. Schneider and H. M. Berman. Fundamentals of DNA and RNA structure. In P. E. Bourne and H. Weisseg (Eds.), Structural Bioinformatics: Methods in Biochemical Analysis, Vol. 44 (pp. 41–73): Wiley, New York, 2003. Akhtar, A. Dosage compensation: an intertwined world of RNA and chromatin remodelling. Current Opinion in Genetics and Development, 13(2): 161–9, 2003. Jeffery, L. and S. Nakielny. Components of the DNA methylation system of chromatin control are RNA-binding proteins. Journal of Biological Chemistry, 279(47):49479–87, 2004. Krajewski, W. A., T. Nakamura, A. Mazo and E. Canaani. A motif within SET-domain proteins binds single-stranded nucleic acids and transcribed and supercoiled DNAs and can interfere with assembly of nucleosomes. Molecular Cell Biology, 25(5):1891–9, 2005. Akhtar, A., D. Zink and P. B. Becker. Chromodomains are protein-RNA interaction modules. Nature, 407(6802):405–9, 2000. Tajul-Arifin, K., R. Teasdale, T. Ravasi, D. A. Hume and J. S. Mattick. Identification and analysis of chromodomain-containing proteins encoded in the mouse transcriptome. Genome Research, 13(6b):1416–29, 2003. Meehan, R. R., C. F. Kao and S. Pennings. HP1 binding to native chromatin in vitro is determined by the hinge region and not by the chromodomain. EMBO Journal, 22(12):3164–74, 2003. Brehm, A., K. R. Tufteland, R. Aasland and P. B. Becker. The many colours of chromodomains. Bioessays, 26(2):133–40, 2004. Bottomley, M. J. Structures of protein domains that create or recognize histone modifications. EMBO Reports, 5(5):464–9, 2004.
Noncoding RNA and RNA Regulatory Networks
301
236. Das, P. M., K. Ramachandran, J. vanWert and R. Singal. Chromatin immunoprecipitation assay. Biotechniques, 37(6):961–9, 2004. 237. Yan, Y., H. Chen and M. Costa. Chromatin immunoprecipitation assays. Methods in Molecular Biology, 287:9–19, 2004. 238. Barski, A. and B. Frenkel. ChIP Display: novel method for identification of genomic targets of transcription factors. Nucleic Acids Research, 32(12):e104, 2004. 239. Lai, E. C. Notch signaling: control of cell communication and cell fate. Development, 131(5):965–73, 2004. 240. Weng, A. P. and J. C. Aster. Multiple niches for Notch in cancer: context is everything. Current Opinion in Genetics and Development, 14(1):48–54, 2004. 241. Hansson, E. M., U. Lendahl and G. Chapman. Notch signaling in development and disease. Seminars in Cancer Biology, 14(5):320–8, 2004. 242. Lum, L. and P. A. Beachy. The Hedgehog response network: sensors, switches, and routers. Science, 304(5678):1755–9, 2004. 243. Mishina, Y. Function of bone morphogenetic protein signaling during mouse development. Frontiers in Bioscience, 8:855–69, 2003. 244. Hombria, J. C. and S. Brown. The fertile field of Drosophila Jak/STAT signalling. Current Biology, 12(16):R569–75, 2002. 245. Seto, E. S. and H. J. Bellen. The ins and outs of Wingless signaling. Trends in Cell Biology, 14(1):45–53, 2004.
This page intentionally left blank
Index
Page references to tables and figures are in italics. ab initio folding 188, 189. See also protein, folding abiotic synthesis 25, 30, 65 acetylene 17, 41, 72 activators 219, 222 AFT (automated functional templates) 193 AGenDA algorithm 135–6 alcohols 6–7, 9, 39, 73 aldehydes 6–7, 20–2, 25–6, 28, 67 algorithms AGenDA 135–6 AUGUSTUS 132, 134 Bayesian 109–10, 126, 141, 253 BLAST 156, 163, 169, 261 DALI 192 dynamic programming 83–5, 109–10, 119, 136, 145, 152, 155, 158–60, 162, 190–2, 242 EasyGene 127, 143 ECOPARSE 127 FastA 163 Fischer discriminant 129 folding 188–9, 202 fragment assembly. See shotgun fragment assembly fragment detection 112 GeneWise 138–40 Genie 132 GENTHREADER 191 GPHMM 134–37 greedy 91, 102, 104, 112, 238 heuristic 82, 83, 94, 99, 104, 112, 122, 130, 143, 155–6, 163, 192, 237 HMM-like 126 INTRONSCAN 134 minimal flow 99 MODELLER 196, 196, 198–9, 203, 204 motif discovery 236, 239 MULTIPROSPECTOR 207–8
Needleman and Wunsch 155, 158 prediction 123, 127, 193 Procrustes 138–9 PROSPECT II 191 PROSPECTOR 191, 199–200, 201¸202–3, 204, 205, 207–8 PROSPECTOR_3 199, 200–2, 204, 205, 207–8 PSIPRED 199, 202 ROSETTA 135–6, 172, 201, 209 SAL 192, 195–6, 198–200 Smith–Waterman 154–6, 159, 162 spliced alignment 139 structural alignment 192, 195 TASSER 196, 198–201, 201, 202–3, 204, 205, 206, 208–9 Teiresias 129 threading. See threading TOUCHSTONE II 201 training 143, 146 TWINSCAN 135, 137, 145 Viterbi 126–7, 132 ZCURVE 129–30, 145 See also computer programs alignment multiple sequence 82–3, 90, 93, 107, 109–10 optimal 109, 151, 160–2, 190, 237 pairwise 83, 90, 109–10, 137, 139, 161, 238 Smith–Waterman algorithm for 154–6, 159, 162 structural 191–2, 195–6, 198–200, 203 See also gene(s) gene finding; sequence amines and amides 7, 9, 31, 39 amino acids sequence 114, 164, 187, 240, 241, 281 synthesis of 5, 14, 19–25, 19–26, 41, 50, 54, 68, 72
303
304 ammonia 7–8, 11–14, 16, 18–23, 25–6, 28, 30, 36, 39, 41, 50, 72 ammonium cyanide 26, 50, 76 amphiphiles 69, 71 annotation 118–19, 125–6, 130–1, 135, 139–40, 143, 167–71, 177, 177, 193, 196 annotators, genomic 123, 128, 144 Aquafex aeolicus 166 Arabidopsis thaliana 118, 120, 145 Archea 13, 15, 22, 40, 49, 52, 61 Archoglobus fulgidus 170, 171 Arrhenius, Svante 6, 63 artifical neural networks 119, 216 asteroids 9, 45–6 atmosphere early 11–15, 17, 26, 43–4, 46, 61 origin 6, 10–15, 17–19, 22, 24, 26, 40 ozone 13, 22, 48 reducing 7, 16, 18, 23 atypical nucleotide composition (ANC) 262–6 AUGUSTUS algorithm 132, 134 automated functional templates 193 autotrophic organisms 5–7, 41, 54, 63, 68, 71, 77, 166 AVID program 7, 134 Bacillus subtilis 120, 166–7, 171, 174 bacteria evolution of 13, 254 genomes of 99, 126 gram-negative 249, 250 gram-positive 250 photosynthetic green sulfur 71 See also under genus name Bayesian algorithms and networks 109, 126, 141, 253 Bellman’s dynamic programming 192 benchmark, PDB200 195–6, 199, 202, 204, 205, 207–8 Berzelius 5, 43 Big Bang 8 Bio-Dictionary Gene Finder (BDGF) 129–30 bioinformatics 154, 273, 278, 280–3 BLAST program 156, 163, 169, 261 BLASTN program 122, 130 BLASTX program 122
Index BLAT program 122 bootstrapping 83, 102 branch-and-bound search 191 Buchnera aphidicola 166 Caenorhabditis elegans 145, 258, 261, 269, 282–3 Cambrian radiation 262, 281 carbides 6–7 carbohydrate synthesis 31–3, 32, 34–5, 35–6 carbon fixation in plants 6, 9, 71, 250 carbonaceous chondrites 43, 46, 63–4 CASP5 protein structure prediction 188, 190–1, 199 catalytic agents 66–7 cDNA 122, 137, 145, 148, 275, 277 cell division 175, 176, 270, 285 chemical evolution 19, 62–8 ChIP-chip technique 236, 238, 284 Chlorobium limicola 71 chromatin immunoprecipitation 236, 238, 284 Chromobacterium violaceum 169 chromosomes 108, 132, 140, 166, 176, 264–5, 270, 280–1, 287 Ciona 259–60, 261–2, 262 cis-regulatory modules (CRMs) 239, 242–3, 271 cladistics 61–2 clay 15, 18, 22, 31, 41, 49, 65–6, 70 clones 79, 80¸ 81, 88, 102–6, 104, 106, 111, 122, 187, 193, 195, 200, 203, 239, 266, 277, 281 CM (Comparative Modeling) 188, 189, 196, 208 coacervates 63–4, 69 codons 122, 128–30, 136, 139, 142, 160, 168, 263–5, 279–80 start 124, 128–30, 149 stop 124, 130 (see also open reading frames) coenzymes 37, 168 coevolution 194–5, 274 cofactors 4, 37 combinatorial extension (CE) 192 Combiner program 145 comparative coding scores 130 modeling (CM) 188, 189, 196, 208
Index computer programs 3-D SHOTGUN 191 AUGUSTUS 132, 134 AVID 7, 134 BLAST 156, 163, 169, 261 BLASTN 122, 130 BLASTX 122 BLAT 122 Combiner 145 CRITICA 130, 144–5 DALI 192 DIALIGN 136 Doublescan 135 Dragon Gene Start Finder 140 EasyGene 127, 143 ECOPARSE 127 ENSEMBL 122, 139 est_genome 122 EuGene 145 EuGeneHom 145 ExoniPhy 137, 138 FEXH 144 FGENESH 132 GAZE 145 GeneID 145 GeneMark 126, 131, 142–3 GeneMark.hmm 119–20, 126–7, 128, 129, 131, 132, 143–5 GeneMarkS 143 GeneParser3 144 GENSCAN program 131–2, 133, 134–7, 139–40, 144–5 GeneSeqer 122 GeneSplicer 145 GeneWise 138–40 GENTHREADER 191 GLASS 136 Glimmer 129–30, 141–5 GlimmerM 145 Grail 144 HMMGene 132–45 idlBN 44 MAMMOTH 192 MED-Start 129 MEMSAT 208 MULTIPROSPECTOR 207–8 MUMmer 136 ORPHEUS 130, 143 OWEN 136
305 PCONS 191 PINTS 193 Procrustes 138–9 PromoterInspector 140 PROSPECT II 191 PROSPECTOR 191, 199–200, 201¸202–3, 204, 205, 207–8 PROSPECTOR_3 199, 200–2, 204, 205, 207–8 PSIPRED (server) 199, 202, 211 RBSfinder 129 RepeatMasker 122, 135 RESCUE-ESE 141 ROBETTA 191 SGP2 135, 137 sim4 122 SLAM 134, 137 SNAP 134 TBLASTX 122, 135 TOUCHSTONE II 201 TWINSCAN 135, 137, 145 VISTA 136 YACOP 145 ZCURVE 129–30, 145 See also algorithms conserved exons 123, 137 contigs (continuous sequences) 94, 102–3, 105, 108, 110–12, 144. See also unitigs correlated partitioning test 94 CpG islands 140–1 CRITICA program 130, 144–5 crystallography 224, 240 C-score 208 cyanide ammonium 26, 50, 76 hydrogen 5–6, 14, 16–18, 20–3, 25–6, 28–9, 36, 41, 44–56, 64–5, 72, 166 cyanoacetylene 14, 30, 38, 45–6 cyanobacteria 10, 48, 250 cytochrome 154, 169 DALI algorithm 192 Darwin, Charles 4–5, 7, 61 database(s) Ciona genome 260, 261 DNA sequence 130, 135, 156, 162 existing repeat 123 FAMS 190 genome 135, 163, 265
306 database(s) (continued) GenPept 129 GTOP 190 human genome 260, 261 management 168 miRNA 276 MODBASE 190 nucleotide 122 PEDANT 190, 207 protein 122, 129, 138–9, 143, 175–6, 207 protein dimer 208 public 168, 174, 194 SCOP 192, 207 seqlet 129 SWISS-PROT 208 VIOLIN 143 yeast genome 260, 261 deletions 89, 137, 158–60, 164 deoxyribose 26, 33, 35, 35 development, control of 269, 279, 284–5 DIALIGN program 136 distance matrix rate (DMR) test 257–9 DNA affinity with protein 220–2 asexually reproduced 108 binding protein 225–6, 228, 277 cis-regulatory sequences 271–2 deletions 89, 137, 158–60, 164 double helix. See DNA, structure of eukaryotic 122, 131, 246 homologous sequences 135–6 insertions 88–9, 137, 158–60, 262, 264, 277 protein-coding 119, 122, 126. See also gene(s), protein-coding recognition code in 239–40, 241, 242–3 recombination 177, 252, 254–6, 264–6 reconstruction 79 repair of 172–3, 181 sequence database 130, 135, 156, 162 sequence fragments 86 structure of 27, 64, 66–8, 86, 87, 283 target sequences 79–80, 80, 270, 282 See also gene(s); sequence Doublescan program 135 Dragon Gene Start Finder program 140 Drosophila 113, 132, 258, 261, 270, 277, 282–3
Index dynamic programming. See programming, dynamic Earth atmosphere. See atmosphere energy sources 16–17, 16 mantle of 10, 12, 14 oceans 6, 8–9, 12–13, 15–18, 20, 22–3, 25–6, 28, 39–40, 43, 46, 48, 59, 61, 63, 65, 73 origin of 10–11 primitive 9–11, 19, 22, 28, 31, 33, 35, 43–4, 46–7, 64, 67, 70 as source of biomolecules 9–18 EasyGene algorithm 127, 143 ECOPARSE algorithm 127 ENCODE (Encyclopedia Of DNA Elements) project 120–1, 270, 274 ENSEMBL program 122, 139 enzymes 6, 26, 36–7, 41–2, 63, 70, 72, 168, 172, 173, 177, 194, 228, 251, 277 Escherichia coli 118, 121–2, 125, 125–7, 131, 141, 143, 166–7, 169, 170, 174, 207–8, 254, 258, 264–5, 270 ESEs (exonic splicing enhancers) 132, 140–1. See also splicing, of exons EST (expressed sequence tag) sequences 122 est_genome program 122 ethylene 17, 41, 72 Eubacteria 71 EuGene program 145 EuGeneHom program 145 Eukarya (eukaryotes) 4, 60, 121, 124, 131–2, 141, 146, 166, 170, 176, 222, 239, 249, 256, 258–9, 272–3, 275, 285 eutectic freezing and solutions 18, 28, 66 EVA. See evaluation of automatic structure prediction servers evaluation of automatic structure prediction servers 191 evolution chemical 19, 62–3 convergent 193 divergent 193 of green plants 7, 248, 250, 256 horizontal gene transfer and 248–53 model for 252–3 molecular 136, 274 natural selection and 59, 279–80
Index prebiotic 23, 25, 57, 60, 67 radiations in 256–7, 262, 280–1, 285 rates of 257–8, 279 role of genes in 62 vertical 248, 251, 265 See also natural selection and under specific chemical compounds and structures and specific organisms evolutionary histories 154, 161, 248, 250, 256, 260 exonic splicing enhancers 132, 140–1. See also splicing, of exons ExoniPhy program 137, 138 exons 120, 121–2, 124, 132, 134–7, 145 FAMS database 190 FastA algorithm 163 fatty acids 5, 17, 36, 43, 73, 185 FEXH program 144 FGENESH program 132 Fischer discriminant algorithm 129 F-measure 124 folding algorithm 188–9, 202 formose reaction 31, 33, 34, 35, 37, 39, 51–2, 67 4-taxa 250, 253–4 Francisella tularensis 170 fragment assembly algorithms 86, 112–13. See also shotgun fragment assembly detection algorithms 112 fructose 33, 70, 174, 185. See also carbohydrate synthesis GAZE program 145 GenBank 125, 168–9 gene(s) accessory 263, 266 activators 219, 222 atypical nucleotide composition of 262–6 eukaryotic 122, 132, 134, 144, 172, 222, 238, 248, 276 as historical document 62 horizontal transfer of (see horizontal gene transfer) mosaic 254, 254–6 (see also horizontal gene transfer) multiple ancestors of 252–3
307 numbers and complexity 270–1 primordial 59, 65 prokaryote 118–19, 123–8, 130, 132, 140, 144–6, 148 protein-coding 124, 126–7, 139, 166, 269–71, 273, 276, 280, 282 (see also DNA, protein-coding) RNA 129 short 119, 124–5, 127, 142, 146 trees 253–4 uncharacterized 172, 178 See also exons gene finding accuracy evaluation in 123–4, 141–2, 146 challenges in 119–21 classifying models for 121–3 computational methods for 118–19, 122–5, 128–45 coregulated 226, 238 eukaryotic 124, 131–2, 134, 138–40, 144–6 expression 61, 122, 128, 171, 174–5, 219, 222, 239, 243, 271, 274, 275, 277, 279, 285 GeneWise algorithm 138–40 hidden Markov (HMM) models for 119, 123, 126–7, 132, 136, 141–2, 231, 235–7, 237, 241 Markov models for 119, 122, 126–7, 128, 129, 131–2, 135 motifs 119, 129–30, 141, 147, 149, 172, 192–3, 236, 238–9, 242 prokaryotic 124–31, 145–6 regulation 242–3, 271, 275, 277, 279, 281 ROSETTA for 135–6, 172, 201, 209 sensitivity (Sn) 79, 82, 84, 112, 119, 123, 129–30, 131, 135–6, 139, 141, 144 silencing 275–6, 284 sliding window approach 126–7, 140 specificity (Sp) 123, 129–32, 135–6, 139, 141 start prediction 128–9 weight matrices in 119, 141, 229, 231–3, 235 GeneID program 145 GeneMark program 126, 131, 142–3 .hmm 119–20, 126–7, 128, 129, 131, 132, 143–5
308 GeneMarkS program 143 GeneParser3 program 144 generalized pair HMM (GPHMM) algorithm 134–7 GeneSeqer program 122 GeneSplicer program 145 genetic code 4, 59–60, 66, 126, 251–2 GeneWise algorithm 138–40 Genie algorithm 132 genome annotators 123, 128, 144 bacterial 122, 126, 129, 222, 270, 272 complexity 270–2 database 135, 163, 265 eukaryotic 122, 132, 134, 144, 172, 222, 238, 248, 276 expansion 278–9 -free early life 69, 71, 73 functionality 279–80 as historical document 62 human 83, 131–2, 136, 269–70, 273, 277–9 immigration 278–9 mammalian 140, 277 microbial 119, 143, 166, 171, 177–8 models 134, 143 mosaics 254–6 mouse 135–6, 279 plant 132, 145 prokaryote 132, 222 reference 135, 143 regulation by noncoding RNA 275, 276 sequences 167, 187, 190, 238, 248 sequencing 62, 113, 118–19, 123, 142, 144, 168 (see also sequencing) shotgun assembly (see shotgun fragment assembly) size 111, 142, 270–1 tiling chips 236, 238, 284 viral 130, 143 yeast 166–7, 169, 178, 260 See also under specific organisms genomics 57, 134, 146, 154, 166, 177, 187, 194 GenPept protein database 129 GENSCAN program 131–2, 133, 134–7, 139–40, 144–5 GENTHREADER algorithm 191 Gibbs sampling 238
Index GLASS program 136 Glimmer program 129–30, 141–5 GlimmerM program 145 glucose 33, 70, 173. See also carbohydrate synthesis GPHMM (generalized pair HMM) 134–7 Grail program 144 greedy algorithm 91, 102, 104, 112, 238 greenhouse effect 15, 18, 50 GTOP database 190 Haeckel, Ernst 62–3 Haemophilus influenzae 126, 167, 258 Helicobacter pylori 167, 174 heredity 59, 64, 67–70, 73 heterotrophs 7, 63 HGT. See horizontal gene transfer heuristic approach 82, 83, 94, 99, 104, 112, 122, 130, 143, 155–6, 163, 192, 237 hidden Markov (HMM) models 119, 123, 126–7, 132, 136, 141–2, 231, 235–7, 237, 241. See also gene finding HMMGene program 132–45 HMMs. See hidden Markov models Homo sapiens 118 homology 135, 172, 193, 195, 253 homoplasy 250, 253 horizontal gene transfer (HGT) 177 atypical nucleotide composition and 262–6 distance discrepancy for 256–62 mosaics and 254, 254–6 phylogenetic congruency test for 253–62 protein distance ratios and 259–62 human adenosine-to-inosine (A-I) editing 279 chromatin 278 chromosome 22, 140 cytomegalovirus 130 exons 137 gene pairs 136 genome 83, 131–2, 135–6, 139–40, 260, 269–70, 273, 277–9 genome database 260, 261 mRNAs 276 noncoding ultraconserved sequences 281 pathogens 255
Index protein-coding genes and sequences 174, 259–60, 260–2, 271, 282 pseudogenes 277 signaling proteins 171 hydrogen 8, 11, 14, 20–1, 25 bond 9, 67–8, 202 cyanide 5–6, 14, 16–18, 20–3, 25–6, 28–9, 36, 41, 44–56, 64–5, 72, 166 hydrothermal vents 17, 36, 40–3, 47, 71–2 hydroxy acids 19, 20, 22 idlBN method 141 indels 89, 137, 261. See also deletions; insertions inheritance 59, 64, 67–70, 73 insertions 88–9, 137, 158–60, 262, 264, 277 introns 121–2, 131–2, 133, 134, 139, 159, 269–70, 273–4, 276, 278, 285 INTRONSCAN algorithm 134 ketones 6–7, 9, 20, 22 Kullbach–Liebler distance 226, 232 last universal common ancestor (LUCA) 249–53 libraries, genetic 79, 102, 105, 122, 187, 193, 195, 200, 203, 239, 277, 281 life characteristics of 26 definition of 57–8 emergence and origin of 3–10, 18–19, 21, 24–5, 37–8, 40, 43, 45, 47, 58–64, 66, 68–9, 71–4, 251, 256 (see also origin of life) temperature limits for 8–9 lipid world 68–9 lipids 4, 18, 36, 52, 66, 68–9, 173 LiveBench 191 LUCA. See last universal common ancestor MAMMOTH program 192 Markov models 119, 122, 126–7, 128, 129, 131–2, 135. See also gene finding Maxsub score 191 MED-Start program 129 MEMSAT program 208 metabolism 4, 6, 9, 37, 41, 59, 60, 68, 70–2, 167, 172, 177, 181, 249–50, 263 dependence on enzymes 72
309 independence from genes 72 lipid 173 sugar 173, 177 metapredictor 191 meteorite 9, 43, 45, 47, 52, 55, 63–4 Murchison 21, 36, 43–5, 47, 64 methane 7–9, 11–14, 16, 18–25, 30, 41, 43, 50 Methanococcus jannaschii 156, 170, 258 methanol 13, 17 micelles 36, 77 microarrays 175, 236 microbes and microbiology 6, 13, 61–3, 71, 154, 248. See also under specific organisms minimal flow algorithm 99 MODBASE database 190 MODELLER algorithm 196, 196, 198–9, 203, 204 molecular biology 64, 272 clocks 62, 252, 257–60 distance 256–7, 257–61 Monte Carlo method 191 motif discovery algorithm 236, 239 motifs 119, 129–30, 141, 147, 149, 172, 192–3, 236, 238–9, 242 mRNA (messenger RNA) 219, 282, 284–5 MULTIPROSPECTOR algorithm 207–8 MUMmer alignment tool 136 Murchison meteorite 21, 36, 43–5, 47, 64 Mus musculus 134–7, 277, 279, 282 mutations 157–9, 169, 175, 194, 255–7, 259, 273, 277, 280. See also molecular distance Mycoplasma genitalium 166, 207 Myxococcus xanthus 272 natural selection 59, 274, 279–80. See also evolution Needleman and Wunsch algorithm 155, 158 Neisseria meningitides 266 networks Bayesian 141 metabolic 68 neural 119, 126 protein 275 reaction 4 RNA regulatory 57, 239, 269–86 signaling 285
310 neural networks 119, 216 New Fold methods 188, 189 nitrogen assimilation by plants 6, 48 noncoding RNA (ncRNA) 271–83 NP-complete 195, 253 NP-hard 81, 91 nucleic acids 26, 27–31, 28–31, 33, 59, 63, 67 nucleosides 36, 38–40, 42, 65 nucleotide(s) 93, 111, 115–16, 120, 124–5, 127, 129–31, 135–6, 139, 142, 148, 160, 176, 183, 244–6, 257, 263–4, 278, 287 catabolism 173 composition 122, 249, 262, 263, 265 database 122 formation 40 sequence 130 substitution 280 objective function 224, 226, 238 oceans 6, 8–9, 12–13, 15–18, 20, 22–3, 25–6, 28, 39–40, 43, 46, 48, 59, 61, 63, 65, 73 ontogeny 269, 271, 276, 281, 285 Oparin, A. 6–8, 19, 63 open reading frames (ORFs) 119, 124, 126–9, 132, 144, 171, 187. See also codons, stop ORFans 167, 171 ORFs. See open reading frames organic compounds assembly of 10–11 extraterrestrial synthesis of 4, 11, 16, 43–6, 44–5, 63 organic synthesis 5, 11–12, 14, 17, 43, 51 origin of life 3–10, 18–19, 21, 24–5, 37–8, 40, 43, 45, 47, 58–64, 66, 68–9, 71–4, 251, 256 extraterrestrial compound synthesis and 43–6, 44–5 genetics-first theory 59, 60 heterotrophic 62–8, 73 hydrothermal vents and 36, 40–3, 47 metabolics-first theory 60, 69–71, 73 missing historical records of 60–2 thermophilic 42, 61, 75 See also evolution; life Origin of Species 61–2, 252 ORPHEUS program 130, 143 orthologs 123, 134–6, 170, 195, 236, 249–50, 253, 259–60, 260–1, 262, 265, 280
Index Oryza sativa 170 overlap 85 containment 91 determination of 81, 86 dovetail 88, 91–4, 98 false negative 79, 81, 86, 91–4, 130, 194, 235–6 false positive 79–81, 86, 91–4, 119, 123, 130–1, 135, 140–1, 145, 194, 208, 235–6, 239, 283 k-mer and 82–6, 96, 98–102, 113 path of 85 repeat resolution 94 sensitivity 79, 82, 84 specificity 79, 82, 84, 112 triangle condition and 90, 90, 91–5, 103, 112 OWEN alignment tool 136 panspermia 3, 6, 63–4, 69 parasites 166–7, 263 Pasteur, Louis 6 pathogenicity islands 12, 263 pathogens 122, 186, 248, 255, 263, 270 PCONS metapredictor 191 PDB200 benchmark 195–6, 199, 202, 204, 205, 207–8 Pearson correlation coefficient 195 PEDANT database 190, 207 peptide 68, 166, 170, 180, 208 bond 42, 72 nucleic acid (PNA) 39, 53, 67 signals 208 synthesis 41, 54, 66, 72, 183 phosphates 33, 36, 38–40, 43, 62, 67–8, 70, 86–7, 173, 174, 182 photolysis 8, 17 Photorhabdus luminescens 169 photosynthesis 13, 171, 250 photosynthetic microbes 6, 13, 61–3, 71 phylogenetic analysis 62, 118, 123, 249 footprinting 238 methods 136–7, 143, 249–50, 252–9 trees 61, 71, 136, 194–5, 216, 253 phylogenies 62, 195, 252–3, 266 PINTS program 193 planets. See solar system plasmids 107, 262, 264, 266 polyadenylation signals 132, 134
Index polymerization reactions 28–9, 40, 50, 65, 67, 74 polypeptides 64, 66–7, 170, 180, 211. See also peptide prebiotic (primitive) soup 7, 41–3, 46–7, 57, 59–60, 63–4, 66–7, 72–5, 166 prediction algorithm 123, 127, 193 pre-RNA world 36, 58, 63, 67–8, 73–4. See also RNA world Procrustes algorithm 138–9 Prochlorococcus marinus 143 programming dynamic 83–5, 109–10, 119, 136, 145, 152, 155, 158–60, 162, 190–2, 242 genomic 269 programs. See computer programs PromoterInspector program 140 promoters 132, 134, 140–1, 146, 219, 222, 236, 238 PROSPECT II algorithm 191 PROSPECTOR algorithm 191, 199–200, 201, 202–3, 204, 205, 207–8 PROSPECTOR_3 algorithm 199, 200–2, 204, 205, 207–8 protein affinity with DNA 220–2 CASP5 structure prediction experiment 188, 190–1, 199 comparative modeling (CM) of 188–9, 196, 208 database 122, 129, 138–9, 143, 175–6, 207–8 dimer database 208 distance 257–61 distance matrix rate (DMR) test 257–9 folding 187, 192, 195, 198 function 172, 193–4 “known unknown” 167, 174 membrane 160, 171, 176, 205–6 networks 275 orthologs 23, 134–6, 170, 195, 236, 249–50, 253, 259–60, 262–5, 280 products 118, 124, 139, 219 –protein interactions 172, 174–5, 180, 187–8, 194–5, 207, 209 signaling 171–2, 285 structure 121, 187–8, 160, 190, 193, 195, 205, 208–9 structure prediction 140, 187–8, 190, 193, 195–6, 208–9
311 synthesis 26, 58–9, 64, 249, 251, 270 threading to determine structure 188–92, 195–6, 199–203, 205, 207–8 uncharacterized 177 “unknown unknown” 167, 177 zinc-finger 240–1 proteobacteria 176, 250, 264 proteome 187, 207–8, 270 proteomics 57, 194 pseudogenes 131, 277 Pseudomonas aeruginosa 174 PSIPRED alogrithm 199, 202, 211 purines 14, 26–9, 29, 36, 40, 50, 53, 70, 77 pyrimidines, synthesis of 15, 29–31, 30–1, 36, 38–42, 64, 70, 73, 173 Pyrococcus furiosus 174 horikoshii 174 reads 79, 80, 89, 95–6, 104, 113 reductionist approach 57, 74 reductive citric acid cycle 71–2, 181 regulatory networks 57, 239, 274–6, 278–9, 281–3 repressors 176, 219, 222 reverse Krebs cycle 71–2, 181 Rhodobacter sphaeroides 171 Rhodocyclus ix Rhodopirellula baltica 170 RBSfinder program 129 RBS (ribosomal binding site) 119 regulatory motifs 129, 238–9. See also motifs RepeatMasker program 122, 135 RESCUE method 141 RESCUE-ESE program 141 ribose 26, 31–3, 35–6, 38, 42, 52, 62, 67, 76, 183 ribosomal binding site (RBS) 119 ribosomes 58–9, 173 ribozymes 32, 57, 59 RISC (RNA-induced silencing complex) 284 ROBETTA metapredictor 191 RNA cis-regulatory sequences 271–2 digital–analog conversion systems 284 -induced silencing complex 284 interference (RNAi) 275, 279, 285
312 RNA (continued) messenger (mRNA) 219, 282, 284–5 micro- (miRNA) 276–8, 281 noncoding (ncRNA) 271–83 polymerase 219 regulatory function 273, 275, 277, 282–4 regulatory networks 274–6, 275 ribosomal (rRNA) 166, 281 signaling 274–8, 281–5 small interfering RNA (siRNA) 275, 275–276, 282 small nucleolar (snoRNA) 275, 276, 281–4 structure of 26, 27, 283 trans-acting 273, 277–8 RNA world 32, 36, 38, 40, 51, 57–9, 64, 67. See also lipid world; pre-RNA world ROSETTA algorithm 135–6, 172, 201, 209 rRNA (ribosomal RNA) 166, 281 Saccharomyces cerevisiae 166, 207–8, 258, 259, 260–2 SAL algorithm 192, 195–6, 198–200 Salmonella typhimurium 169, 264–5 SBH 86, 101 SCOP database 192, 207 SCS (shortest common string) problem 81–2, 90–2, 102, 104, 112, 238 self-organizing maps 146 systems 40, 60, 68–70, 72–3 seqlets 129 sequence alignment (see alignment) complete genomic 118, 166 consensus 109–11, 136, 153, 219, 223, 228–9, 237 fragment 81, 84, 88, 100–1, 105, 107 insertion 262, 264 loop region 159, 190–1, 196, 196, 198–9, 203, 205, 209, 277, 283 noncoding 129, 133, 135, 137, 280–1 primary 149, 254, 280, 282–4 protein-coding 126, 128, 133, 194, 203, 220–2, 224, 240, 242, 260–1, 270–1, 273–4, 279 recombination of 177, 252, 254–6, 264–6 similarity 110, 121, 123, 127, 135–6, 139, 143, 145, 156–62, 168, 171–2
Index -specific control of transcription 277–8 See also sequencing and under specific organisms sequencing computational methods for 154, 248 (see also gene finding; shotgun fragment assembly) error 81, 83–4, 93–4, 93, 98, 108, 110 gap 102, 106–7 homologs 135, 166–9, 171–2, 195, 255, 261 by hybridization (SBH) 86, 101 long identical repeats and 101, 104 polymorphic targets 108–9 primers 88, 101 protein 137, 143, 154, 156, 159–60, 203, 260, 261 reverse complement 88, 129 technology 83, 88, 101, 107, 113, 118, 140 training sets for 129, 140, 143 vector 88, 101, 107, 229 SGP2 program 135, 137 shortest common string (SCS) problem 81–2, 90–2, 102, 104, 112, 238 shotgun fragment assembly algorithms for 79, 82, 86, 91, 100, 102, 109, 111–12 assemblers for 81–2, 84, 86, 100–1, 104, 112–13 chimeric fragments in 105 consensus phase of 82, 109–11, 136, 153 error correction 84, 93. 107, 112 Eulerian path for 100, 101, 114 Hamiltonian path for 100, 101 layout phase of 82, 86–102, 104–5, 108–9, 112 mate pairs and 96, 102–6, 112 overlap phase in 82, 85, 93–4, 96, 98, 105, 107 Phrap assembler and 93, 112 spur fragments 106–7 signal transduction 172–3, 175, 180 signaling cell 57, 171–4, 177, 185 digital 272, 274, 278, 281–2, 284 RNA 274–8, 281–5 sim4 program 122 Sinorhizobium meliloti 171 siRNA (small interfering RNA) 275, 275–6, 282
Index SLAM program 134, 137 Smith–Waterman algorithm 154–6, 159, 162 SNAP 134 snoRNA (small nucleolar RNA) 275, 276, 281–4 solar system 3–4, 7–11, 16, 19, 40, 43, 46–7, 55, 59, 61 spark discharge experiments 20, 20, 24–5, 29–30, 36, 44. See also amino acids, synthesis of specificity of DNA–protein interactions 223–7 gene prediction and 123–4 models of 228 of transcription factors 224, 237, 243 See also gene finding Spectral Rotation Measure 146 splicing 134, 141, 145–6, 191, 270, 273, 275, 276–9, 285 enhancers 132, 140–1, 150, 277–8 of exons 132, 140–1, 150 of introns 134 spontaneous generation 6, 71, 73 Strecker, Adolph 5, 63 Strecker synthesis 20, 22–3, 30, 36, 41 Streptococcus pneumoniae 174 viridians 255 structural motifs 193. See also motifs substrings 79–80, 89, 95–6, 104, 113 sugar synthesis of 5–6, 33, 36, 38 supernovas 8, 10 SWISS-PROT database 208 Synechococcus 258 systems biology 57, 167 TASSER algorithm 196, 198–201, 201, 202–3, 204, 205, 206, 208–9 TBLASTX program 122, 135 Teiresias algorithm 129 thermophilic origin of life 42, 61, 75 threading 188–92, 195–6, 199–203, 205, 207–8 metapredictor approaches to 191–2 PDB200 195–6, 199–200, 202, 204, 205, 207–8 PROSPECTOR algorithm_3 191, 199–208
313 TASSER algorithm 196, 198–203, 201, 204, 205–6, 208–9 3-D SHOTGUN metapredictor 191 TOUCHSTONE II algorithm 201 training algorithm 143, 146 sets 129–40, 142–3 transcription activators 219, 222 binding affinities and 223, 224, 227–8, 241–3 control of 177, 242–3 factors 219, 221–4, 223, 226–7, 231, 233, 235, 236–40, 241, 243, 270, 276–7 noise 270, 274, 283 non-protein-coding 270–1, 273–4, 277, 285 output 269, 273 starts 140–1 transcripts non-protein-coding 270, 273–4 protein-coding 270, 273–4, 277, 285 translation 123, 129, 140, 145, 170, 176, 202, 270, 273, 275, 276–278, 285 transposons 178, 278 Trichomonas pallidum 170 tRNA-guanine transglycosylase 168 TWINSCAN program 135, 137, 145 two-hybrid screen for yeast 194 ubiquinone 169, 170 unitigs combined 95–6, 97, 98–105, 106, 107–8 constructing 96–7 repeat 95–6, 99–101, 103–4 scaffolds for 97, 102 structure of 96–7 See also contigs; shotgun fragment assembly urea 5, 19–20, 22–3, 26, 30, 36, 39–40, 63, 73 Urey, Harold Clayton 7–8, 17, 19 UTRs (mRNA untranslated regions) 134, 282 UV light 5, 13, 15–16, 30, 174
314 van der Waals forces 18, 65 vectors 107–8 cloning 80, 103, 107 sequencing 88, 101, 107, 229 vesicles 36, 68–9, 71 VIOLIN database 143 viruses 111, 122, 130, 143, 166, 255 VISTA tool 136 Viterbi algorithm 126–7, 132 volcanic activity 8, 11–12, 12, 14–16, 39, 71 water chemistry of 9, 15, 18, 21, 29, 39, 42, 221 Earth’s source of 11–12, 61 photodissociation of 12, 14 weight matrix model for DNA–protein interaction 229, 230, 232–4, 235, 236, 241 Wöhler, Friedrich 5, 43, 63
Index XML gene model 145 Y2H (yeast two-hybrid screening) 194 YACOP program 145 yeast genes and genome 166–7, 169, 178, 260 proteins 258, 259–60, 261, proteomes 207–8 two-hybrid screening (Y2H) 194 Yersinia pestis 169. See also yeast Yersinia pseudotuberculosis 169. See also yeast ZCURVE algorithm 129–30, 145, Z-score 199, 207–8