The Origins of Evolutionary Innovations
This page intentionally left blank
The Origins of Evolutionary Innovations A Theory of Transformative Change in Living Systems
Andreas Wagner Institute of Evolutionary Biology and Environmental Studies University of Zurich Switzerland
1
1
Great Clarendon Street, Oxford ox2 6dp Oxford University Press is a department of the University of Oxford. It furthers the University’s objective of excellence in research, scholarship, and education by publishing worldwide in Oxford New York Auckland Cape Town Dar es Salaam Hong Kong Karachi Kuala Lumpur Madrid Melbourne Mexico City Nairobi New Delhi Shanghai Taipei Toronto With offices in Argentina Austria Brazil Chile Czech Republic France Greece Guatemala Hungary Italy Japan Poland Portugal Singapore South Korea Switzerland Thailand Turkey Ukraine Vietnam Oxford is a registered trade mark of Oxford University Press in the UK and in certain other countries Published in the United States by Oxford University Press Inc., New York © Andreas Wagner 2011 The moral rights of the author have been asserted Database right Oxford University Press (maker) First published 2011 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior permission in writing of Oxford University Press, or as expressly permitted by law, or under terms agreed with the appropriate reprographics rights organization. Enquiries concerning reproduction outside the scope of the above should be sent to the Rights Department, Oxford University Press, at the address above You must not circulate this book in any other binding or cover and you must impose the same condition on any acquirer British Library Cataloguing in Publication Data Data available Library of Congress Cataloging in Publication Data Data available Typeset by SPI Publisher Services, Pondicherry, India Printed in Great Britain on acid-free paper by CPI Antony Rowe, Chippenham, Wiltshire ISBN 978-0-19-969259-0 (Hbk.) 978-0-19-969260-6 (Pbk.) 1 3 5 7 9 10 8 6 4 2
If you want to have a good invention, have a lot of them.
Attributed to T.A. Edison
This page intentionally left blank
Acknowledgments
Research is a social endeavor. The research leading to this book is no exception. The book’s bibliography comprises almost 900 items. This large number reflects the size of my debt to a community that has accumulated the knowledge on which I build. And still, the bibliography is not complete. Any attempt at being exhaustive would have led to a tome many times the current size. Please accept my apologies if your work is not cited here. I did not omit it for willful negligence, but to keep the exposition focused, for the benefit of the non-expert reader. A significant portion of the book relies on research by PhD students and postdocs in my own laboratory over more than ten years. Their work is cited throughout. No one person could assemble a body of work this size in such a limited time. I am in great debt to my co-workers, not least because of the trust they placed in an unorthodox research program
based on innovation. Special thanks also go to my collaborator Olivier Martin. His expertise has been instrumental in analyzing the structure of large genotype spaces. Allan Drummond, Angela Hay, Miltos Tsiantis, Danny Tawfik, and Nobuhiko Tokuriki have provided illustrations. Several trusted colleagues reviewed individual chapters of this book. They include Homayoun Bagheri, Peter and Rosemary Grant, Lukas Keller, Marcelo Sánchez, and Daniel Segrè. Thanks to all of them, as well as to Ian Sherman and Helen Eaton for their editorial work. Finally, Johannes Jaeger and Alessandro Minelli, as well as an anonymous reviewer who went far beyond the call of duty, and read and critiqued the entire volume. I followed most of their advice, which helped improve the book considerably. Where I decided otherwise, it may have been for the worse and only I am to blame.
This page intentionally left blank
Contents
Acknowledgments 1 Introduction
vii 1
2 Metabolic innovation
18
3 Innovation through regulation
33
4 Novel molecules
47
5 The origins of evolutionary innovation
68
6 Genotype networks, self-organization, and natural selection
83
7 A synthesis of neutralism and selectionism
93
8 The role of robustness for innovation
107
9 Gene duplications and innovation
124
10 The role of recombination
132
11 Environmental change in adaptation and innovation
143
12 Evolutionary constraints and genotype spaces
158
13 Phenotypic plasticity and innovation
172
14 Towards continuous genotype spaces
186
15 Evolvable technology and innovation
198
16 Summary and outlook
214
References Index
219 249
ix
This page intentionally left blank
CH A PT ER 1
Introduction
The history of life is a history of innovations. We are all familiar with countless examples, but are there principles behind them? Is there a property that facilitates innovations, regardless of their physical manifestation? I here argue that the answer is yes, and I characterize this property—I will call it innovability.
Innovations
everywhere Every macroscopic organism has visible traits that were dramatic, transformative innovations when they first became fully formed. They changed not only organismal lifestyles, but also the future evolutionary path of life. Examples include plants with flowers, animals with a hard skeleton, birds and insects with wings, organisms living in groups, and, most fundamentally, multicellularity itself. Others include teeth to digest hard foodstuffs, vascular systems of plants and animals, syringes to deliver venoms, the endosperm storage tissues of seeds, and the silk production of arthropods [807]. Underneath this surface of macroscopically visible innovations is a universe of microscopic and submicroscopic innovations. Ultimately, they are the basis of all macroscopic innovations. An example is oxygen-producing photosynthesis. It originated with light-harvesting molecules that can split water to produce oxygen, and with mechanisms to incorporate carbon dioxide into biomass. By allowing oxygen to accumulate in the atmosphere, it changed not only the entire geochemistry of the planet, but also the future trajectory of life [410]. It permitted the macroscopic innovations of higher plant life, and ultimately supports most of the 1000 billion tons of biomass that exist today on earth [229]. Other similarly profound innovations involve the ability of organisms to thrive on unusual (for
us) food sources, such as minerals, natural gas, or crude oil; the ability to synthesize keratins, a critical component of the outer covering of many animals, such as the scales of reptiles, the feathers of birds, and the hairs of mammals; the ability to incorporate gaseous nitrogen—an otherwise growth-limiting element for many plants—directly into biomass; the origin of myelin, an electrical insulator that allows mammalian neurons to conduct electrical signals efficiently, and that may have promoted the evolution of complex brains [264, 620, 667]. It may be difficult to define rigorously what an evolutionary innovation is [538, 616]. However, these and countless other examples show that it is usually easy to recognize: a new feature that endows its bearer with qualitatively new, often game-changing abilities. These may not only mean the difference between life and death in a given environment (just think of biosynthetic abilities), they may also create broad platforms for future innovations, as did the innovations of photosynthesis and of complex nervous systems.
Towards a theory of innovation During Charles Darwin’s era, molecular innovations were inaccessible to science. In his theory of evolution by natural selection, Darwin thus focused on complex macroscopic innovations, such as our eyes, “organs of extreme perfection and complication,” which he acknowledged as potential difficulties for his theory [Ch. 6 of ref. 162]. At the same time Darwin emphasized his conviction that such complex innovations could evolve from simpler antecedents through gradual variation that is preserved by natural selection. Since then, eyes have become a textbook example of evolutionary innovation. We now know that they have evolved multiple times independently [213].
1
2
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
While Darwin’s theory rightly emphasized the role of natural selection in preserving useful variation, it left untouched the question how new and useful variation originated. As the geneticist Hugo de Vries put it in 1904 [170], “Natural selection may explain the survival of the fittest, but it cannot explain the arrival of the fittest.” This question about the origins of new things is still fundamentally unanswered. What is it about life that allows innovation through random changes in its parts? This ability becomes especially striking when we contrast it with the properties of most man-made, engineered systems. Would random changes in a typical complex technological system, say, a computer or an airplane, be a sensible recipe to improve the system? Hardly. There is something special about the architecture of life that makes it amenable to improvement through random change. This something is the subject of this book. I here provide evidence that it is more than a combination of natural selection and random change. Both are necessary but not sufficient for innovation. A deep understanding of innovation would have been inaccessible in Darwin’s time. First, he and his contemporaries knew little about the nature of inheritance—they were ignorant of the nature of the inherited material, how it is transmitted between generations, and how it changes over time. Second, they also knew nothing about the molecular events that are key to innovation in general (and to eyes in particular, such as the evolution of photoreceptors and of lens proteins). These events are changes in the interactions among biological molecules, as well as in the molecules themselves. 150 years later, we are in a completely different position. We understand the nature of genotypes, the genetic material (DNA or RNA) of organisms. In addition, we have amassed much information concerning the structure and function of biological molecules, and how they change over time. We are also beginning to understand the interactions between these molecules, and the large molecular networks that they form. These molecules and networks together ultimately determine all observable characteristics of organisms, their phenotype. Many evolutionary innovations have been studied individually in great detail. They provide fascinating case studied of natural history. However, no
number of case studies can add up to the deeper and general insights that would answer how organisms can innovate. Case studies cannot provide the general perspective needed to answer this question. By themselves, they are a heap of observations without a principle that unifies them. This principle could only come from an overarching explanatory framework for evolutionary innovations. One might call such a framework a theory of evolutionary innovation, an “innovability theory.” I argue here that if such a theory exists, the concepts I discuss will be its necessary (and perhaps sufficient) building blocks.
What must a theory of innovation accomplish? At first sight, the very search for such a theory may seem utterly quixotic, yes absurd. Does not the very nature of innovation defy prediction, and is not one main purpose of a theory to predict? An analogy with Darwin’s theory is instructive. Darwin explained many apparently unrelated phenomena with the key organizing principle of his theory, natural selection. Although population genetics and quantitative genetics have seen limited success in predicting evolutionary trajectories, such prediction is elusive for many real-life evolutionary processes. Even small, biochemically well-understood molecules with completely known genotypes, evolved in the laboratory under minutely controlled chemical conditions, can take surprising evolutionary turns [317, 379, 380]. Yet such unpredictability does not cast Darwin’s theory into doubt. The theory may fail to predict any one phenomenon, but it succeeds in organizing a myriad disparate phenomena. It has value as a unifying framework. Similarly, a theory of innovation may have little to say about any one specific innovation. Instead, it may provide an explanatory framework for innovations in general. It would fit the definition of a theory as a “small body of general principles that work together to explain a large number of empirical observations, often by describing an underlying mechanism common to all of them” [Ch. 5 of ref. 657]. It can be powerful in its generality, without trivializing the individual innovation and its marvelous uniqueness. Here is a minimal list of what a theory of innovation should accomplish.
INTRODUCTION
1. (The paramount problem.) It should explain how biological systems can preserve existing, well adapted phenotypes while exploring myriad new phenotypes. This is perhaps the most fundamental challenge of biological innovation, because destroying the old before finding something new and better may spell death. In addition, finding an innovation may require exploration of many different inferior phenotypes, before a new and superior phenotype is uncovered. The inventor Thomas Alva Edison’s adage “if you want to have a good invention, have a lot of them” holds for organisms, perhaps even more so than for human inventors. 2. It should unify innovations that involve different levels of biological organization. Some innovations are caused by new molecules with new structures and functions; others are caused by regulatory changes, for example, in the expression of molecules; yet others occur through combining existing molecules into new pathways. A theory of innovation should be general enough to accommodate different kinds of innovation. It needs to be dissociated from any particular substrate of innovation, but it must apply to each such substrate. 3. It should be able to capture the combinatorial nature of innovation. Biological systems have parts that are elementary units of system function. They include the amino acids that compose proteins, the enzymes that compose metabolic pathways and networks, and many others. Innovation usually involves new combinations of these “modules” and of higher order units of organization. Because combinatorial change is at the heart of many innovations, it must also be central to a theory of innovation. 4. It should be able to capture that the same problem can be solved by different innovations. Innovations can be viewed as solutions to a problem an organism faces. In the history of life, these problems have often been solved multiple times, and in very different ways. Examples include the evolution of image-forming eyes, tetrapod wings, aerobic respiration, and carbon fixation [807]. For instance, the last problem is that of incorporating inert atmospheric CO2 into biomass. It has been solved through the Calvin– Benson cycle, the reductive citric acid cycle, and the
3
hydroxypropionate cycle, in quite different ways [661]. 5. It should enable us to study how environmental change influences innovability. The environment determines whether any one novel phenotype is an innovation. Some aspects of an organism’s environment may be constant; others may change rapidly or slowly, predictably or unpredictably. We do not know whether these differences can affect the rate of innovation. Because environments can change in many different ways, universal answers may not exist. However, a theory of innovation must at least provide a framework to study this environmental influence. 6. It should be applicable to technological systems. A theory of innovation dissociated from any concrete material substrate should also apply to non-biological systems. In doing so, it might help develop technologies that can use evolutionary principles to accelerate innovation.
What kind of information does a theory of innovation need? I stated earlier that a deep understanding of innovation was inaccessible at Darwin’s time, because essential information was lacking. What is this essential information? In my view, it has at least four elements. The first element is a systematic and comprehensive understanding of genotypes. Ultimately, evolutionary innovations are caused by genotypic change, change in DNA or RNA molecules. (An apparently contradictory view holds that innovations begin with phenotypic change. I discuss this view in Chapter 11, where I argue that the contradiction is more apparent than real.) Our ability to understand genotypes is becoming nearly limitless with technologies to sequence entire genomes in single experiments. Genotypes can be organized into vast genotype spaces. Albeit astronomically large, these spaces have countably many member genotypes, which allows their systematic analysis. The second element is a systematic and comprehensive understanding of realistically complex phenotypes. The phenotypes of biological systems range from molecular phenotypes, such as protein structures,
4
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
all the way up to macroscopic phenotypes, such as the body plans of organisms. Each level of organization, from molecules to whole organisms, can have astronomically many different phenotypes. I write “realistically complex” with this observation in mind. Innovations create new complex phenotypes from existing complex phenotypes. To define a comprehensive “phenotype space” that is also amenable to systematic analysis is an enormous challenge. It is obvious how to meet this challenge for some molecular phenotypes, such as protein structures, but unclear for others. The more comprehensive our understanding of phenotype is on any level of organization, the better our chances for a framework that can apply to all innovations on this level. The third element is the ability to link phenotype to genotype. Innovations originate with a genotypic change whose effects translate into a phenotypic change. Thus, if we do not understand how exactly genotypic change maps into phenotypic change, we cannot hope to develop a comprehensive explanatory framework for innovation. The link between them can be provided through experiments, through comparative data, or through computational and mathematical modeling. The fourth element is an understanding of population-level processes. Any evolutionary process involves populations of reproducing objects with heritable differences that affect their reproductive success [452]. Population-level processes can limit or enhance the efficacy of natural selection [402, 476] and they affect the exploration of novel phenotypes, depending on factors such as population sizes and mutation rates. They thus also affect the emergence of innovations. Any theory of innovation needs to take them into consideration. Fortunately, these processes are well-understood, largely to the credit of the “modern synthesis” of evolutionary biology that brought forth population genetics in the early twentieth century [503]. Which areas of biological knowledge already fulfill these four requirements? The answer to this question might help identify the nuclei of a theory of innovation. I will next discuss what many would consider the best candidate areas. Unfortunately, neither of them currently has all of the above elements.
Population genetics and evolutionary developmental biology Population genetics and quantitative genetics constitute a body of quantitative evolutionary theory that emerged decades after Darwin with the modern synthesis [503]. Through its ability to handle a potentially infinite number of genotypes [310, 402] this body of theory is wellequipped to encapsulate the richness of possible genotypes of DNA and RNA sequences (element 1 above). Population processes (element 4) lie of course at its very heart, and pose no problem for it. However, models in population genetics and quantitative genetics usually make simple assumptions about the relationship between genotype and phenotype. By design, they thus make phenotype easy to predict from genotype (element 3). Unfortunately, this raises a serious problem with the phenotypic complexity (element 2) represented in these models. Many population genetic models, for example, represent phenotype only through a (scalar) fitness. Even quantitative genetic models that consider multivariate quantitative traits represent complex phenotypes only through correlations among each dimension of such a trait. They can thus only capture statistical dependencies among the constituents of complex phenotypes. These representations are not well-suited for realistically complex phenotypes. There is a world of difference, for example, between this statistical representation of a phenotype, and the myriad numbers of possible protein structures. The latter are best represented as atomic coordinates for amino acids, and not as correlations among quantitative characters. For this reason, population genetics and quantitative genetics are missing a critical element. While clearly necessary for a theory of innovation, they are not sufficient, at least in their current state. In contrast to population genetics, evolutionary developmental biology tackles complex phenotypes head-on. Its phenotypes are the most complex phenotypes of all, the macroscopic phenotypes— tissues, organs, body plans—of multicellular organisms. Evolutionary developmental biology has elucidated many beautiful and fascinating examples of phenotypic change and its roots in genotypic change. We will encounter some of them below. Nonetheless, the very complexity of organismal phenotypes presents two problems. The first of
INTRODUCTION
them, perhaps less serious, regards the systematic account of phenotypes a theory of innovation would require (element 2 above). It may be possible to overcome this problem, for example, through concepts such as the morphospace of paleontologists [507, 634]. The second problem, however, poses a more formidable obstacle. Given how complex the phenotypes are that developmental biology is studying, and how many genes contribute to them, we currently are unable to determine phenotype from genotype for them. Thus, element 3 above is missing. Hundreds of genes may influence even the simplest organismal phenotypes, such as the shape of a bacterial cell wall or the structure of a human hair. The fact that organismal phenotypes are not static but unfold over time, and that they are intricately organized in space, would further complicate the task of linking genotype to phenotype. In addition, the understanding of population-level processes (element 4) is less advanced in developmental biology. In sum, these considerations show that the phenotypes of population genetics are still too simple, and those of developmental biology still too complex to become part of a theory of innovation, given our current knowledge. They also show that key in trying to develop a theory is to find a middle ground: Phenotypes that are sufficiently rich to capture the astronomical diversity of actual phenotypes, yet manageable enough to understand how genotypes translate into phenotypes, while at the same time being crucial for many different kinds of innovation. This book revolves around such phenotypes.
5
and macromolecules—proteins and RNA. They correspond to three broad classes of innovations that have played a key role in the history of life: innovations involving new metabolic pathways, involving new patterns of gene activity in regulatory circuits, and involving new molecules. Most innovations in macroscopic traits can ultimately be traced to molecular innovations in these three system classes, or to combinations thereof. Any fundamental shared principles they reveal may thus apply to the complex, macroscopic phenotypes of developmental biologists, once we will be able to study these phenotypes with the same level of rigor. These three classes of systems are also special for a different reason: We have either massive amounts of empirical data linking genotypes and phenotypes, or we can predict phenotype from genotype. In other words, they fulfill the above requirement 3 for a theory of innovation. Admittedly, predictions of these phenotypes are far from perfect. They involve mathematical modeling based on limited empirical data, and a heavy dose of computation. This holds especially true for the kinds of analyses I will discuss here, analyses needed to study new phenotypes systematically. They require us to map thousands to millions of genotypes to their phenotypes. Experimental genotyping on this scale is routine, but analysis of this many phenotypes is still difficult. Until such phenotyping becomes possible, computational approaches remain essential. They may not be sufficiently good to predict any one phenotype with very high accuracy, but they are sufficient to tackle the broad questions a theory of innovation needs to answer. They give us a place to start.
Three classes of tractable phenotypes are involved in most innovations I will next discuss
Innovation through metabolism Innovation
three broad and very different classes of biological systems in which innovations occur. Here and elsewhere, I view a system as a set of elements or parts that cooperate to perform a task. An example is a protein whose parts—amino acids—cooperate to catalyze a chemical reaction. The phenotype of a protein is the three-dimensional folded structure it assumes, and the biological function it performs. Together, all proteins constitute a system class. The three classes of systems central to innovation are large metabolic networks, regulatory circuits,
often arises through combining enzymes—or, more specifically, enzyme-coding genes—into new metabolic pathways. Such pathways can make new energy sources available to the organism, or they can synthesize new compounds useful for self-defense, protection, and communication. “A” new metabolic pathway can mean the difference between life and death, either by allowing its carrier to subsist on new food sources, defend itself against an enemy, or survive in a hostile environment.
6
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
Pentachlorophenol HO Cl
Cl
Cl
Cl Cl
Pentachlorophenol hydroxylase HO Cl
Cl
Cl
Cl HO
Maleylacetoacetate isomerase HO Cl
Cl
Cl
H H
Maleylacetoacetate isomerase HO Cl
Cl
H
H HO
2,6-dichlorohydroquinone dioxygenase Ring cleavage product Figure 1.1 Degradation of pentachlorophenol as a metabolic innovation. Shown are four enzymatic steps in the degradation of pentachlorophenol. The enzymes written in light gray type have probably been recruited to pentachlorophenol degradation from pathways that are involved in the degradation of naturally occurring chlorophenols, such as 2,6-dichlorophenol, which are
Microbial metabolism provides dramatic examples of how novel combinations of enzymes, and the reactions they catalyze, can lead to innovations. Microbes can use a bewildering variety of substances as food, including many man-made compounds not known to occur in nature [620]. For instance, microbial isolates from pristine soils that have been minimally exposed to humans can use several antibiotics as sole carbon sources, including fully synthetic compounds such as ciprofloxacin [160]. Microbes also thrive on many xenobiotic substances of industrial importance, such as polychlorinated biphenyls, a highly toxic, now banned class of industrial compounds [638]; chlorobenzenes, organic solvents [796, 797]; or pentachlorophenol, a synthetic pesticide first produced in 1936 [126, 141]. Just take the last chemical. Not known to occur in nature, pentachlorophenol can nonetheless be digested by the bacterium Sphingomonas chlorophenolica. The necessary metabolic pathway involves four reactions that this organism assembled, using enzymes that process naturally occurring chlorinated chemicals, as well as an enzyme involved in tyrosine metabolism [141] (Figure 1.1). In microbes, horizontal gene transfer is an extremely effective and abundant way of creating such new combinations of reactions. A second example regards halophilic bacteria and algae, some of which can survive in saturating salt concentrations of 30 percent, or even in fluid inclusions of growing salt crystals. In contrast, drinking seawater with its paltry 3 percent of sodium chloride kills many other organisms [620]. Several complementary strategies allow halophilic bacteria to survive in such high salt concentrations [75, 179, 180, 620]. One of them involves the production of “compatible solutes,” such as ectoine or glycine betaine. These substances stabilize proteins to keep them functioning, and they neutralize the high external produced naturally by some fungi and insects. The last reaction leads to cleavage of the aromatic ring shown. The reactions marked with dark gray arrows are catalyzed by maleylacetoacetate isomerase, an enzyme involved in the degradation of phenylalanine and tyrosine in some organisms, including some bacteria, fungi, and humans [141].
7
INTRODUCTION
so readily. However, such innovations also occur in higher organisms. A case in point regards the detoxification of ammonia, a waste product of animal metabolism. Water-living animals can excrete it directly into the water, but land-living organisms cannot do so. To avoid poisoning themselves, they convert it into a less toxic compound for excretion. Many do so through the production of urea, made possible by another metabolic innovation, the urea cycle (Figure 1.2). The urea cycle illustrates a key theme of metabolic innovations: The individual reactions are not necessarily new, but their combination is. The urea cycle arose when a set of four reactions involved in arginine biosynthesis combined
osmotic pressure caused by high salt concentrations. Ectoine and glycine betaine are produced by a short chain of reactions that starts from ubiquitous molecules, such as the amino acid aspartate. Yet another metabolic innovation occurred in the origin of oxygen-producing photosynthesis. Although the evolution of photosynthesis was not a one-step process [661, 869], one associated key innovation was the evolution of the light-harvesting pigment chlorophyll. Chlorophylls are tetrapyrrole compounds like heme and vitamin B12, whose biosyntheses share many features. Microbes provide the most dramatic examples of metabolic innovations, because they exchange genes
O N
Aspartate
O
Argininosuccinate synthetase
N
N
O
O
O
N N
O
Citrulline
Argininosuccinate
O
N O N
Ornithine transcarbamylase
NH3 CO2 2 ATP
O
Argininosuccinate lyase
Carbamyl phosphate
Fumarate
Carbamylphosphate synthetase Ornithine
N
Arginine
N
N
N
O
N
O
O
Arginase N
O
N O N UREA Figure 1.2 The urea cycle as a metabolic innovation. The figure shows the urea cycle, whose enzymes are expressed in the mammalian liver. They serve to convert ammonia into urea, which can be excreted in liquid form. The four reactions marked with light gray arrows constitute an arginine biosynthesis pathway, and are expressed in various tissues other than the liver for arginine biosynthesis. The reaction marked in dark gray is the first reaction involved in arginine degradation. The five reactions occur in many different organisms from prokaryotes to human, and are thus not themselves mammalian innovations [753].
8
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
with arginase, a reaction involved in arginine degradation [753]. All the reactions involved are widespread in both prokaryotes and eukaryotes [753]. In sum, metabolism provides a treasure trove of innovations, new metabolic abilities that have enabled new lifestyles. They often arise through new combinations of enzymes that already exist in other organisms. They have played important roles in the earliest history of life, and continue to play such roles to this day. As we shall see, metabolic phenotypes are one kind of complex phenotype that can help understand innovation in a principled way.
(a)
(b)
(c)
Innovation through regulation For my purpose, I define regulation as a process that changes the abundance or activity of a gene product at a particular time and place (but does not change its encoding DNA, RNA, or amino acid sequence). Examples include changes in the rate of transcription of a gene into its RNA product, the rate of translation of a messenger RNA into protein, or the modification of a protein that changes the protein’s activity, for example, through phosphorylation. Suggestions that regulatory changes play an important role in evolution date back many years. In 1975 King and Wilson, for example, noted the small amount of sequence divergence between humans and chimpanzees (≈1 percent). Homologous proteins in these two species, so their argument went, are too similar to explain what makes us human. Thus, they argued, changes in regulatory DNA that affect gene expression are responsible for these differences [404]. Many traits that distinguish us from chimpanzees are evolutionary innovations. They include bipedalism and qualitatively new cognitive abilities, such as symbolic communication. And while many researchers continue to search for genetic changes responsible for these species differences, others have focused on innovations outside primates [125, 190, 376, 397, 558, 867]. As a result, we now have plenty of candidates for innovations that involve regulatory change. Several examples follow. Some butterflies use an ingenious innovation to scare off would-be predators many times their size [736–738]. Their wings harbor spots that resemble
Figure 1.3 Butterfly eyespots as regulatory innovations. (a) Butterfly eyespots on the wings of the moth Automeris io (image from http://commons.wikimedia. org). (b) Eyespots on the ventral surface of the forewing (upper) and hindwing (lower) of the butterfly Bicyclus anynana. From figure 3 of [54]. (c) Distal-less expression in a B. anynana hindwing imaginal disc (seven small white spots in upper-left panel), the larval structure from which wings form. Distal-less expression is visible in seven spots that correspond to the future position of seven eyespots on the adult hindwing (upper-right). Distal-less expression during development of the Cyclops mutant (lower-left) occurs in a single stripe corresponding to the sole eyespot that will form in this mutant (lower right). From figure 3 of [86], used with permission from Nature Publishing Group.
the eyes of animals much larger than their predators (Figure 1.3a). A display of these eyespots is a bluff that may save the butterfly’s life when it is attacked. (Eyespots belong to a much larger class of
INTRODUCTION
color-patterning innovations. Such patterns often serve to inform friends or deceive foes.) In developing butterfly larvae and pupae, eyespots form in a prospective wing region called the eyespot focus. One feature that distinguishes eyespot foci from their surrounding tissue is the expression of a key regulatory molecule, the transcription factor Distalless [86]. In many animals Distal-less plays a role in the development of several body structures, including legs and wings [104]. Its expression in the eyespot focus is an early key event that demarcates the eyespot (Figure 1.3). Even though butterflies vary in the numbers and positions of eyespots, Distal-less is expressed in all eyespot foci. Conversely, grafts of Distal-less-expressing eyespots to developing wing tissue suffice to cause eyespot formation in the recipient tissue [86]. Other regulatory molecules are also expressed in eyespot foci [396], and some of them in turn drive surrounding cells to produce the pigments that give eyespots their striking appearance. We may never know whether one of these regulators or Distal-less first changed their expression in the origin of eyespots. However, the key point is that a change in the expression of one or more already existing molecules is critical to form these defensive innovations. The lenses of vertebrate eyes are marvelous innovations [431]. They are able to form images with minimal aberration, the distortion of an image as light passes through a lens. The materials responsible for this ability and for a lens’ glassy transparency are crystallins. They comprise a class of proteins with multiple functions elsewhere in the body [612], many of them enzymes. What unites them is that they can be highly expressed while remaining soluble and transparent. These properties make them ideal materials for eye lenses. Regardless of their function elsewhere in the body, regulatory mutations have caused them to be highly expressed in the lens. Many crystallins have undergone gene duplication, but non-duplicated crystallins also exist. They include e-crystallin, which is the same molecule as lactate dehydrogenase, and t-crystallin, which is the same molecule as a-enolase [611, 612, 781]. In such nonduplicated crystallins changes in regulatory DNA regions have allowed enhanced gene expression in the lens.
9
The lenses of water-living animals face a particularly stiff challenge [431]. To bend light’s trajectory, lenses take advantage of the difference in refractive index as light passes from one medium to another. In land-living animals, light passes from air into the water-rich biological tissue of the lens. But in waterliving animals light already travels through water, so their lenses cannot take advantage of the air– water difference in refractive index. Lenses of waterliving animals thus need to bend light much more strongly compared to land-living animals, and they suffer greater aberration. To minimize this aberration, fish and squid have lenses with a graded refractive index. Their lenses are built of many onion-like layers. Central layers have a higher refractive index (higher crystallin concentration). Peripheral layers have a lower index. This lens architecture allows high power with little aberration. Regulatory mutations are key to achieve it [431, 611, 612, 747, 781]. In sum, regulatory changes in the expression of existing proteins are responsible both for the existence of transparent lenses, as well as for their sophisticated fine structure. Some plant leaves are simple in shape, others are highly complex or dissected, consisting of multiple small leaflets (Figure 1.4a). The first flowering plants most likely had simple leaves [60]. Leaf dissection is an innovation that can serve many purposes, among them to prevent leaf overheating in hot environments, and to increase CO2 uptake in water [275, 299]. The developing leaflets of most flowering plants with complex leaves show a marked increase in the expression of KNOX (KNOTTED1-like homeobox) transcription factors [60] (Figure 1.4b). This association is causal, as shown in the lamb’s cress Cardamine hirsuta, which has dissected leaves. Reducing the activity of KNOX genes severely impairs leaflet formation, whereas an increase in its expression is sufficient to produce additional leaflets [316]. Thus here again, a change in the expression of regulatory molecules is closely associated with an evolutionary innovation. Many animals use highly specialized body parts as tools to access food. The availability of a tool can have dramatic consequences on the animal’s survival probability in times of food scarcity. One such tool is a bird’s beak. Beaks come in many shapes and sizes. They range from the long and narrow
10
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
(a) Simple leaf
Dissected leaf
(b) Cardamine hirsuta (hairy bittercress)
* 4
2
3
Arabidopsis thaliana (thale cress)
*
Figure 1.4 Leaf dissection as a regulatory innovation. (a) A simple and a dissected leaf; (b) left panels: dissected leaves of Cardamine hirsuta (top) and simple leaves of Arabidopsis thaliana (bottom); right panels: accumulation of class 1 KNOX proteins in the primordia of dissected leaves of C. hirsuta (top), but not of A. thaliana (bottom), as revealed by antibody staining (dark spots in right panels) of KNOX proteins [316]. The central region, marked with an asterisk, is the shoot apical meristem, from which the shoot forms, and which shows
hummingbird beak, specialized to access deep and narrow flowers, to the wide and squat beak of seedcrushing birds. The beaks of Darwin’s finches on the Galapagos and Cocos islands provide wellstudied and diverse examples [288]. These finches include cactus finches, such as Geospiza candens with long, pointed beaks, specialized in feeding on cactus flowers or the insects therein, and ground finches, such as G. magnirostris, capable of crushing hard and large seeds. Beak-shape differences have great adaptive significance. For example, only the largest ground finch G. magnirostris can feed on the largest occurring seeds in its habitats, because only its beak can exert the necessary force to crush them [288; Ch.6]. Other ground finches are restricted to smaller seeds in their diet. During some periods of droughts, where small seeds can get depleted quickly, only large and hard seeds remain on the ground [287]. Recent studies linked a change in the expression of two regulatory molecules to beak shape and size. One of them is bone morphogenetic protein 4 (bmp4), a signaling protein with a role in skeleton and jaw development [44, 383, 572]. Its expression is highest in the developing deep and wide beaks of seed-crushing species [4]. The other protein is calmodulin, a signaling protein that mediates signals of changing calcium concentration to a variety of proteins. Calmodulin is most highly expressed in the developing elongate beaks of cactus finches. When bmp4 expression or calmodulin-mediated calcium signaling are artificially increased in chicken embryos, the embryos’ beaks change shape in the same way as they do among different finch species [3, 4]. Bmp4 and calmodulin expression thus probably play a causal role in changing beak shape.
KNOX expression in both species [316]. The enclosed areas, two of which are indicated by arrows, indicate initiating leaf cells of leaf primordia, which do not show KNOX expression in either species. The exception are the dark-staining small regions within C. hirsuta indicated by arrowheads, which correspond to initiating leaflets where KNOX proteins are expressed. After figure 1 of [316], used with permission from Nature Publishing Group.
INTRODUCTION
The last example may seem different from the preceding ones. On the one hand, it was not about a qualitatively new feature (presence or absence of an eyespot), but a quantitative modification of an existing feature (beak shape). From this perspective, it may not seem like the qualitative change an innovation requires. On the other hand, if only one kind of beak—exemplified by that of G. magnirostris—can crush the hardest seeds, having this beak will make a qualitative, life-preserving difference whenever only hard seeds are available. From this perspective, the beak can be viewed as an innovation. I included this example on purpose, to remind us of an oft-overlooked fact: Many innovations, when examined closely, are of this kind [528], although it is usually not obvious from the final product. They fall into a large gray area between unambiguously qualitative and merely quantitative phenotypic change. However, one key feature unites all examples in this section: they revolve around regulatory change.
Innovation through new molecules It is difficult to know even where to begin. Every single one of thousands of highly specific enzymes in our body was a molecular innovation when it first arose, many million years ago. The same holds true for other proteins and RNA molecules that are involved in metabolism, development, mechanical support, and communication. Innovations in them arise through mutations of individual nucleotides and recombination. They are facilitated by the functional promiscuity of some proteins (Chapter 11), and by gene duplications (Chapter 9) that can liberate the molecules from functional constraints [135, 368]. Here I will highlight only a few wellstudied cases. The first example shows how even the smallest possible change in a protein can lead to qualitatively new functions. It concerns the bacterial enzyme L-ribulose-5-phosphate 4-epimerase (L-Ru5P). This enzyme from Escherichia coli catalyzes the interconversion of L-ribulose-5phosphate and D-xylulose-5-phosphate. The enzyme links arabinose metabolism and the pentose phosphate pathway. It allows bacterial cells to survive on arabinose as a carbon and energy
11
source. The enzyme is a homotetramer with four identical subunits, one of which is shown in Figure 1.5a [475]. The active site of this enzyme includes a histidine residue at position 97, which is also shown in Figure 1.5a. A single mutation at this position from histidine to asparagine gives rise to a new catalytic activity, that of an aldolase (Figure 1.5b), while preserving the structure shown in Figure 1.5a [371]. Specifically, the mutant enzyme is able to join one molecule of dihydroxyacetone phosphate and glycoaldehyde phosphate in a condensation reaction. There are other known (and probably many unknown) enzymes where single amino acid changes give rise to new catalytic activities [566, 849]. Isocitrate dehydrogenases (IDHs) are enzymes in the energy-producing citric acid cycle; b-isopropylmalate dehydrogenases (IMDHs) are their distant relatives that catalyze a reaction in leucine biosynthesis. Despite their common ancestry, these enzymes have very different biological roles. A key distinction between them is their use of cofactors. IDHs can use either nicotine amide dinucleotide (NAD) or NADP, whereas IMDHs can use only NAD. Because NAD and NADP play very different roles in metabolism—providing electrons for ATP production and biosynthesis, respectively—the question of what causes this functional shift is intriguing. It turns out that fewer than ten amino acid differences are sufficient to dramatically shift the cofactor preferences of these enzymes [116, 171, 277, 474]. The next example revolves around the threats posed by freezing temperatures. When ice crystals grow, they kill cells. They incorporate the liquid water molecules that proteins need to function, and they slice through cell membranes [620]. Organisms that can survive this threat include arctic and antarctic fish, as well as overwintering terrestrial insects and plants. They have independently evolved a class of proteins called antifreeze proteins (Figure 1.6). These proteins bind the surface of small ice crystals and prevent them from growing [118, 166, 246]. For example, many fish adapted to cold waters can survive ice-laden seawater at almost –2ºC, about 1ºC lower than the freezing temperature of their body fluids [246].
12
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
(a)
H97
(b)
L-Ribulose-5-phosphate 4-epimerase R2
R2
R1
H
O
OH OH
R1
HO
O
H
His97
OH
Asn97
Aldolase R2
R2
R1
R1
O O
H
OH
OH H
+
H OH
OH
Figure 1.5 A single amino acid change can create a novel enzymatic function. (a) The structure of one subunit of the homotetrameric L-ribulose-5-phosphate 4-epimerase from Escherichia coli. A histidine residue (His97) in the catalytic site is highlighted. The structure is rendered from information in Protein Data Bank file 1K0W [475]. (b) Schematic drawing of the chemical reaction catalyzed by the epimerase shown in (a), as well as for a mutant with a single histidine to asparagine amino acid change at position 97 (after [566]). The mutant can catalyze a new aldolase reaction.
One important observation about this molecular innovation is that it occurred repeatedly: Antifreeze proteins fall into five classes [118] that show very little similarity in sequence or structure. Their ancestors are very different proteins, for example, serine proteases and chitinases. Arctic and antarctic fish have evolved antifreeze proteins with very sim-
ilar sequences independently [115]. In addition, antifreeze proteins can evolve very rapidly. For example, the arctic glaciation, which drove antifreeze protein evolution in arctic fish, occurred less than 3 million years ago [691]. Sister species in the same genus Myoxocephalus (sculpins) have even independently evolved two different classes of anti-
INTRODUCTION
(a)
13
(b)
Figure 1.6 Antifreeze proteins. Antifreeze protein of (a) the longsnout poacher Brachyopsis rostratus, a benthic fish living off the northeast coast of Japan, and (b) the mealworm beetle Tenebrio molitor. Note the very different structures of the two proteins, which are rendered from Protein Data Bank files 2ZIB [560] and 1EZG [463].
freeze proteins [118]. Antifreeze proteins stand for a much larger class of innovations that occurred repeatedly, rapidly, from different ancestors, and sometimes with very different solutions to the same problem [661, 807]. They underscore how readily innovations can arise in living systems. The last three examples were ordered by the amount of change—from minimal to drastic—required for a new protein function. The next and last example illustrates again the sliding scale between a quantitative change in an existing phenotype, and the qualitative change characteristic of innovation. It requires minimal change in a protein, modifies an existing protein’s function, but can open completely new habitat to an organism, and can thus set the stage to the conquest of new environments. At Mount Everest’s peak, the air contains only one-third of the amount of oxygen compared to sea level. Because oxygen is so limited, exercise becomes very strenuous at high altitudes. This is why many human high-altitude climbers need supplementary oxygen. The bar-headed goose (Anser indicus) does not have this luxury. This bird lives in central Asia and migrates over the Himalayas, at altitudes exceeding 10 kilometers. It is one of the highest flying birds known. The ability to migrate over a mountain range this high is an amazing adaptation
that can greatly expands an organism’s habitat range. How does this bird do it? The answer is multi-faceted, but an important aspect regards oxygen transport [466, 531]. The bar-headed goose has a hemoglobin molecule with higher oxygen affinity than its lowland relatives. A proline to alanine substitution in one of the hemoglobin subunits is important for this change [277, 459]. It eliminates a key contact between the hemoglobin subunits, which shifts the equilibrium of hemoglobin towards a conformation that has higher affinity to oxygen.
Genotype networks and their history The preceding three sections illustrate three classes of innovation in different kinds of phenotypes. Together, they form the basis of most evolutionary innovations. Some innovations may arise in a single large step, but many arise more gradually, through a series of changes with individually modest effects. Such innovations will typically involve hopelessly entangled changes in all three phenotypic classes. For example, a new metabolic ability may arise in an organism through the “import” of new enzymecoding genes via horizontal gene transfer, together with changes in the regulation of metabolic enzymes already encoded in an organism’s genome, and the
14
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
evolution of new enzymatic activities through mutations in existing genes. Similarly, complex macroscopic innovations, such as the evolution of new body parts, may involve changes in the regulation of multiple molecules, and the evolution of new molecules. Known macroscopic innovations are so complex that we do not yet understand all required changes for any one of them. Despite these complexities, it is useful to keep these three classes of innovations conceptually separate, and not only because some innovations fall within a single class. Such separation allows us to ask whether different classes of biological systems have similarities relevant to innovation, even though they may differ in most other respects. I argue in this book that such deep similarities exist, and that they are key to understand innovation. One important similarity is that their phenotypes can be organized into genotype networks. A genotype network is a set of genotypes that have the same phenotype. Genotypes in such a network are connected in the following sense: you can reach each genotype by a series of small mutational changes, each of which leaves the phenotype unchanged. Each such small change affects only a single part of a genotype, such as one amino acid in a protein. (I will call two genotypes that differ in only a single part neighbors.) All human understanding requires abstraction from the unfathomable complexity of the world around us. If one tries to understand a particular phenomenon, one needs to ask about the level of abstraction on which this phenomenon can be understood. In my view, the concepts of genotype spaces and genotype networks are the right level of abstraction to understand evolutionary innovation comprehensively and systematically. The reasons are spelled out throughout this book. I will explain the concept of a genotype network and its importance for innovation in much greater detail in later chapters. For now I will just say a few words about its history. To my knowledge, the concept was first foreshadowed in a 1970 paper on protein spaces—now usually called genotype spaces or sequence spaces—by the late John Maynard Smith. The paper stated “. . . if evolution by natural selection is to occur, functional proteins must form a continuous network which can be traversed by unit
mutational steps without passing through nonfunctional intermediates” [498]. Maynard Smith’s interest in this paper did not regard the origins of evolutionary innovation, but whether natural selection could plausibly lead to any functional proteins. He argued that this was the case for real proteins. The network Maynard-Smith had in mind differed from genotype networks in this book in another important respect. His is not a network of proteins with the same phenotype, but of proteins with any phenotype (function). To understand innovation, however, it is important to distinguish between different phenotypes and the genotype networks each forms. We shall see that the organization of these networks in the space of all possible genotypes is important for innovation. After Maynard Smith’s paper, it took another twenty years and considerable advances in computational technology before computational studies first showed that genotype networks may exist, at least for simple models of “coarse grained” structural phenotypes that can be computationally estimated from genotypes. The genotypes of these studies [434, 464] were coarse models of amino acid strings that consist of only two types of amino acids, hydrophobic and hydrophilic amino acids. The phenotypes were geometric models of protein structure, where each amino acid occupies a different position on a regular geometric lattice. Lipman and Wilbur [464] showed that such a structure is typically adopted by a large number of genotypes. Many of these genotypes can be reached from one another through series of single amino acid changes that do not change the phenotype. A few years later, unrelated work on RNA genotypes by Peter Schuster and his associates provided further support for the existence of genotype networks [688]. The phenotypes in this work were RNA secondary structures, the planar shapes that such sequences can adopt through internal basepairing (more about them in Chapter 4). These authors showed that RNA molecules with the same secondary structure typically can have very different sequence. In addition, sequences with the same phenotype typically form large sets whose sequences can be reached from one another through a series of single nucleotide changes [687, 688]. Their work remains among the most detailed characterization of a genotype space.
INTRODUCTION
Each of the last two lines of work was limited. It either considered only model proteins, or only partial phenotypes—RNA secondary structure phenotypes are necessary but not sufficient for RNA function. However, the concepts that emerge from this work remain important. This book shows that these concepts apply not just to model or partial phenotypes. In addition, they are important far beyond molecules like protein and RNA. They apply to different levels of biological organization, and can tie innovations on different levels together. The importance of this unifying power is hard to overstate. Such unification is essential for any comprehensive theory of innovation.
Neutral versus genotype networks Schuster and collaborators coined the term “neutral network” for the genotype networks they studied [688]. “Neutrality” in their sense means invariance of a well-defined phenotype among all genotypes on a neutral network. The term “neutral network” is widely used; it is evocative, and has alliterative appeal. In evolutionary biology, however, neutrality has a different meaning: a change in a genotype that is invisible to natural selection, because it does not affect fitness (more about that neutrality in chapter 7). Neutrality in the first sense does not imply neutrality in the second. To avoid confusion, I will thus use the word “neutral network” sparingly, and only where its meaning—in the first sense above—should be unambiguous from the context. Elsewhere, I will refer to “genotype networks.” Most phenomena I will discuss do not require that the genotypes on the same genotype network have exactly the same fitness. For example, many mutations in proteins of well-studied organisms are deleterious, but weakly so [227, 676]. Such weakly deleterious mutations can rise to high frequency in a population by chance events (Chapter 7), or they can persist until other mutations arise that compensate for their deleterious effects and thus preserve them [393, 428]. They are no strong impediment to evolutionary change on one genotype network. Conversely, many mutations that increase fitness do so only very slightly, and their fate can be determined by the same forces that determine the fate of neutral mutations [310, 676].
15
In sum, because the term “neutral network” insinuates that its genotypes have the same fitness, it is too narrow for the purpose of studying innovation, and I will prefer the term “genotype network.”
Innovability versus evolvability or phenotypic variability A few words are now necessary to motivate my use of the neologism “innovability.” Perhaps a more popular word, such as “evolvability” might be a better choice? The most widely used meaning of evolvability is the ability to produce heritable phenotypic variation. Why, then, not just use this notion here, or simply “phenotypic variability”? The reason is that phenotypic variability can merely refer to quantitative variation in existing phenotypes (body height, thermotolerance, etc.). When studying innovation, however, qualitative variation becomes important. The approaches I use below to analyze different phenotypes all aim to distinguish such qualitative differences. We currently do not have a good word to refer to such qualitative differences. This is the main motivation for using a new word, innovability. In addition, many authors use evolvability to describe some aspect of their study system. Unfortunately, the word’s meaning has thus become rather muddled by overuse. Moreover, evolvability has many aspects that I do not discuss here [264]. This is another reason to sidestep this word in the book.
Chapter overview Each of the next three chapters will focus on one of the three main system classes important for innovation. Specifically, Chapter 2 will focus on metabolic systems, Chapter 3 will focus on regulatory circuits, and Chapter 4 will focus on protein and RNA molecules. Each chapter will provide evidence for the existence of genotype networks; it will also characterize these networks. The emphasis is on common features, not an exhaustive review. These features include that genotype networks typically have vast size, that they extend far through genotype space, and that the neighbors of different genotypes on any one such network form very different novel phenotypes. If you are not interested in the technical details of these chapters, you may wish to skip to Chapter 5, which summa-
16
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
rizes and synthesizes information from the earlier chapters. It explains why these and other features are crucial for innovation, and probably were crucial since the origin of life. The chapter is as selfcontained as I knew how to write it. Chapter 6 concerns the perhaps most puzzling observation left unexplained by Chapter 5. It is that the three very different system classes from Chapters 2 to 4 have key commonalities that are important for innovation. The chapter shows that a simple fact is both necessary and sufficient for these commonalities: in all three system classes many neighbors of any one genotype G typically have the same phenotype as G itself. In other words, genotypes are to some extent robust to mutations. This chapter is the only mathematical chapter of the book, although the mathematics are elementary and used to make largely qualitative statements. Taken together, Chapters 2–6 show that the framework I propose here accomplishes three of the five basic goals for a theory of innovation, including the most important goal: to explain how life can preserve the old while exploring the new. The remaining two goals are the subject of later chapters. Subsequent chapters deal with several apparently disparate phenomena and problems in evolutionary biology, and show how the concepts of the earlier chapters allow us to unify them and resolve tensions between them. Some of the chapters summarize large bodies of work; others mainly outline directions for future research. Chapter 7 regards the tension between selectionism and neutralism. Selectionism emphasizes the role of natural selection in evolution, whereas neutralism ascribes an important role to neutral change that is invisible to natural selection. The tension between them has permeated the field of molecular evolutionary biology at least since Motoo Kimura proposed the neutral theory of molecular evolution [402]. The chapter shows how we can resolve this tension. It argues that neutral or nearly neutral changes may be frequent, but that most such changes will later become subject to selection. Such neutrality, albeit ephemeral, is indispensable for innovation, because it allows the exploration of novel phenotypes. Chapter 8 is about robustness, a biological system’s ability to preserve—in any one environment—
its phenotype under perturbations such as mutations [825]. The chapter first makes the elementary qualitative observation that robustness causes the existence of genotype networks (as shown in Chapter 6), and is thus essential for innovation. Quantitatively however, the relationship between robustness and innovability is more complex. On the one hand, and almost by definition, the more robust a genetic system is, the less phenotypic variation it produces in response to perturbation. From this perspective, robustness hinders innovability. On the other hand, both experimental and computational studies show that robustness can promote innovability in some system classes. The chapter resolves this tension and shows how robust phenotypes can promote innovation. Whether they do depends on details of genotype space organization for a system class. The closely related Chapter 9 focuses on gene duplication, a kind of mutation linked to dramatic innovations. The chapter shows that we can understand this link by considering that gene duplications are mutations which increase robustness in a particular way: Without destroying old phenotypes, they greatly facilitate the exploration of new phenotypes around a genotype network. Chapter 10 will discuss recombination, an important class of mutation that causes large-scale genotypic change. Recombination can be highly effective in producing novel phenotypes. However, it can also destroy existing, well-adapted phenotypes. I will show here that this destructive potential of recombination may be smaller than its creative potential. For example, molecules or regulatory circuits can be highly resilient to recombination, especially if they are exposed to it frequently. Recombination can help exploring the new without destroying the old. Chapter 11 returns to a key remaining question about robustness. Robustness brings forth genotype networks, but why does it exist in the first place, and why in such very different systems? The answer leads to the role of environmental change. I argue that coping with changing environments may require systems to increase their size, and that this increase in system complexity causes robustness in any one environment. In other words, the need to cope with environmental change has been a driving force behind the enormous complexity of present
INTRODUCTION
day biological systems. This complexity entails robustness, which is behind the existence of genotype networks. The chapter also shows how genotype networks can help study the quantitative influence of environmental change on innovation: systems able to cope with multiple environments exist in intersections of multiple genotype networks, which affects their ability to innovate. Chapter 12 discusses constraints on phenotypic evolution, which are biases or limitations in the phenotypic variation a system produces. It shows that genotype networks are useful to understand and unify several apparently unrelated causes of such constraints. These causes emerge from an underlying “developmental” cause, the processes that produce phenotypes from genotypes. Chapter 13 focuses on phenotypic plasticity, a genotype’s ability to produce multiple phenotypes. Genotype networks can facilitate the origin of genotypes that have a novel phenotype in their plastic repertoire of phenotypes. The chapter also discusses genetic assimilation and related phenomena that may lead to the fixation of plastic phenotypes after their origin. When characterizing genotype networks, one usually represents both genotypes and phenotypes as discrete objects, which facilitates enumeration and comparison. Chapter 14 discusses systems that are best represented by continuously valued phenotypes and genotypes. Such systems
17
are a research frontier, because they have not been rigorously studied. What little we know, however, suggests that the main principles I described apply to them as well. Chapter 15 discusses technological systems. A general innovability theory should apply to both biological and technological substrates of innovation. This chapter shows one technological application. It focuses on reconfigurable hardware, a commercially important class of electronic circuitry whose internal wiring (“genotype”) can be altered to compute different functions (“phenotypes”). The chapter shows that such circuitry can display key features of biological systems. It suggests that the biological principles I explored earlier are transferable to technology, and may promote technological adaptation and innovation. Chapter 16 is a short summary of key points and an outlook to future challenges. Taken together, the material in chapters 2–6 show that the framework I propose here meets the above minimal requirements 1–3 for a theory of innovation. Chapter 11 shows that it meets requirement 4, and Chapter 15 shows that it meets the last minimal requirement number 5. Although some data in this book come from previously unpublished research, most of what originated in my own research group is scattered throughout some 30 articles that are cited where appropriate.
CH A PT ER 2
Metabolic innovation
Metabolic Genotype: An organism’s enzyme-coding genes and the enzymes they produce. Metabolic phenotype: An organism’s ability to synthesize biomass, produce essential molecules, and extract energy from a chemical environment. In this chapter, I examine innovation in metabolic phenotypes. I focus on the most fundamental such phenotypes: those that allow an organism to survive in a given chemical environment. I will discuss how vast genotype networks, sets of metabolic genotypes with the same phenotype, facilitate the evolution of novel metabolic phenotypes. My discussion applies to all organisms, but especially to microbes, where metabolic network evolution is rapid and mediated through horizontal gene transfer.
Metabolic networks The genome of a free-living organism encodes metabolic enzymes that catalyze most of the hundreds to thousands of chemical reactions needed to sustain life. These reactions convert food into biochemical building blocks or energy, build biomass out of light, air, and minerals, and produce chemicals that serve in storage, self-defense, communication, and other processes. Typically, each metabolic process involves the joint action of multiple enzymes. Traditional biochemistry has taught us to think of such processes as linear chains of reactions encoded by enzymes, or as simple cycles. Now that we have complete, or nearly complete, information about the metabolisms of well-studied organisms, we realize that it is better to think of them as highly reticulate metabolic networks [207, 208, 324, 562]. Many metabolic abilities of organisms surely were game-changing when they first arose. These include abilities I already mentioned, such as the
18
ability to synthesize a protective cell wall, to produce communication molecules, or to synthesize diverse storage compounds for times of nutrient scarcity [842]. The most fundamental of these is the ability to use new foods—new chemicals in the environment—to synthesize building blocks for cell or biomass growth. The chemicals in question may serve both as sources of energy and of essential building materials, especially of the elements nitrogen, phosphorous, sulfur, and carbon. Because carbon is the most abundant of these four elements, it is of central importance in this regard. Over its four-billion-year history, life—and in particular prokaryotic life—has evolved the ability to thrive on a myriad different carbon sources, however toxic they may be to humans. Chapter 1 already mentioned some astounding innovations in carbon metabolism, where organisms learned to feed on xenobiotic carbon sources that include industrial chemicals and antibiotics [126, 141, 160, 638, 796, 797]. Other elements can also be provided by a broad range of sources [195, 257, 375, 515, 588, 783]. Whether an organism has the ability to survive in an environment where only one nutrient or energy source occurs is a life-or-death matter. The ability to produce biomass from a given set of nutrients is thus perhaps the most fundamental requirement metabolism must fulfill. I will focus on it here, but much of what I say may apply to a broader spectrum of metabolic phenotypes. More specifically, I will focus on organisms that feed on organic nutrients, and on carbon metabolism, because of carbon’s centrality to life. Although I will refer to these nutrients as carbon sources, it should be understood that they are at the same time energy sources. What I say here about carbon metabolism also holds for other chemical elements [653].
M E TA B O L I C I N N OVAT I O N
Metabolic genotypes The biosphere contains enzyme-coding genes whose products catalyze of the order of 104 or more chemical reactions [571]. The genome of any one organism encodes enzymes that catalyze some of the reactions in this reaction “universe.” We can view this collection of enzymecoding genes as an organism’s metabolic genotype. This genotype is ultimately a string of DNA, but representing it as such is not effective to study metabolism. More compact representations are necessary, representations that focus on the set of chemical reactions that the enzymes encoded by this genotype can catalyze. One simple such representation is a binary string whose length is the number of reactions in the known reaction universe (Figure 2.1). This string contains a “1” at position i if
19
the organism encodes a gene for reaction i, and a “0” if it does not. This representation is gene-centered, which reflects the fact that the gene and not the enzyme is the unit of metabolic evolution: only genes are subject to mutations that are inherited from parents to offspring. Some enzymes catalyze multiple reactions, which can simply be represented through the gene encoding them. Conversely, some reactions are catalyzed by multiple enzymes; these can be represented through one of their enzyme-coding genes. This genotype representation is discrete, which is a great advantage if one wants to enumerate metabolic genotypes. Such enumeration is useful to develop the genotype network concept. It focuses on whether one or more reaction can be catalyzed at
(a)
(b)
Genotype
Phenotype (survival on food source)
(determines metabolic reaction network)
Glucose + ATP Æ Glucose 6-phosphate + ADP
1
1
Alanine
Fructose 1,6-bisphosphate Æ Fructose 6-phosphate + Pi
1
0
Citrate
Isocitrate Æ Glyoxylate + Succinate
0
Acetoacetyl-Co + Gyoxylate Æ CoA + Malate
1
0 Flux Balance Analysis
1
Ethanol Glucose
Oxaloacetate + ATP Æ Phosphoenolpyruvate + CO2 + ADP
1
1
Melibiose
Pyruvate + Glutamate ´ 2-Oxoglutarate + Alanine
0
0
Xanthosine
sole carbon sources Figure 2.1 The concept of metabolic genotypes and phenotypes. (a) The metabolic genotype of a genome-scale metabolic network can be represented in discrete form as a binary string, each of whose entries corresponds to one biochemical reaction in a “universe” of known reactions. Individual entries indicate the presence (“1,” black type in stoichiometric equation) and absence (“0,” gray type) of an enzyme-coding gene whose product catalyzes the respective reaction. The binary string is as long as the number of known enzyme-catalyzed reactions, and for any one organism, only a small fraction of its entries may be equal to one. (b) Qualitatively, metabolic phenotypes can be represented by a binary string. The entries of this string correspond to individual carbon sources. The string contains a “1” for every carbon source (black type), for which a metabolic network can synthesize all major biomass molecules, if this source is the only available carbon source. Flux balance analysis (arrow) can be used to determine the metabolic phenotype from the genotype.
20
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
all, rather than on merely quantitative differences in the rates at which they can be catalyzed. In doing so, it allows us to focus on qualitatively new metabolic abilities—the ability to survive on new food sources—rather than on quantitative changes in existing abilities. We can think of all metabolic genotypes as forming a vast genotypes space. Even with our current, limited knowledge of some 104 metabolic reactions, the size of this space is astronomical. Given the genotype representation I use here, it contains of 4 the order of 2(10 ) metabolic network genotypes. Each metabolic network is a point in this space. It will be useful to have a measure of the distance between two metabolic genotypes (networks) in this space. The measure I will use is the number or fraction D of reactions that are not catalyzed by both metabolic networks. I will refer to D as the genotype distance. Two metabolic genotypes would have the maximal distance if they have no reactions in common. Conversely, two metabolic genotypes are neighbors if they differ in exactly one reaction, i.e., if one metabolic network catalyzes all the reactions that the other also catalyzes, except for one. The neighborhood of a metabolic genotype comprises all the metabolic networks that differ from it by one reaction [652]. More generally, one can define a k(-mutant)-neighborhood as the set of metabolic networks that differ from it by k reactions. By the end of the twentieth century, the first genome-scale metabolic networks of well-studied organisms had been thoroughly characterized [207, 208]. To characterize such a network is to assemble a list of enzyme-catalyzed chemical reactions known to occur in a given organism. These reactions are usually represented by their stoichiometric equations (Figure 2.1a). To characterize an organism’s metabolic network is easier if its genome has been completely sequenced and if its genes have been identified. In this case, one can compare these genes to all known genes that encode specific enzymes in other organisms, and thus infer the likely functions of the enzymes they encode. The resulting information about metabolic genotypes is usually still incomplete and needs to be complemented by biochemical information. Such information is easiest to come by for well-studied model organisms such as E. coli and the yeast S. cerevisiae, whose metabolism
has been studied for decades, and where mountains of primary literature on metabolic enzymes exist [208, 254, 324, 637]. However, metabolic networks are also being characterized for many other organisms, including humans [232]. The genome-scale metabolic networks that result from this effort may not include every single reaction catalyzed in an organism, but are usually comprehensive enough to cover the synthesis and recycling of all major biomass molecules.
Metabolic phenotypes I will now turn from metabolic genotypes to metabolic phenotypes related to the synthesis of essential biomass molecules. For free-living (non-parasitic) organisms, these biomass molecules include all amino acid and nucleotide precursors, lipids, many carbohydrates, and multiple cofactors for metabolic enzymes. For example, for E. coli, they comprise some 50 different molecules [231]. One could define a metabolic network’s phenotype as the number or fraction of these biomass molecules it can synthesize. This definition, however, has a serious limitation: unless a network synthesizes every single essential biomass molecule, it cannot sustain life. In other words, if we want to understand phenotypes of living organisms and their metabolic innovations, this definition is too limited. Here is an alternative definition of metabolic phenotypes. It is motivated by the observation that the nutrients available to a cell determine whether the cell can synthesize all biomass molecules. Many organisms, such as E. coli, can survive in minimal environments that contain a terminal electron acceptor (such as O2), a source of nitrogen (e.g., NH3), sulfur (SO4), phosphorus (PO4), carbon, and energy. A simple way of characterizing a metabolic phenotype is to ask whether a metabolic network can synthesize all biomass molecules in a given chemical medium, such as a minimal medium with glucose as the only or sole carbon source. In other words, can the network sustain cell growth in this environment? The above definition is useful but still has drawbacks, because astronomically many combinations of carbon sources might allow a network to sustain growth. A more systematic categorization of metabolic phenotypes is necessary, where these combi-
M E TA B O L I C I N N OVAT I O N
nations are represented in a simple way. To arrive at such a categorization, consider a minimal environment like that above, where all molecules except the carbon source do not vary in their availability. That is, some environments may harbor one carbon source, others may harbor another, yet others may harbor multiple carbon sources. With this categorization in mind, I propose the following representation of a metabolic phenotype. Let us write this phenotype as a binary string. The length of this string corresponds to the number of molecules that could potentially serve as sole carbon source for some metabolic network. For any one metabolic network, this string contains a “1” at position i if the network can synthesize all biomass molecules whenever carbon source i is provided as the only carbon source, in an otherwise minimal environment (Figure 2.1b). A string with multiple ones corresponds to a network that can synthesize biomass in multiple minimal environments that differ in the sole carbon source they contain. The length of this string will be much smaller than the total number of carbon-containing molecules [202]. The reasons are that, first, organisms can import a limited number of such molecules; second, some carbon-containing molecules may be highly unstable; third, some molecules may be toxic reaction intermediates of metabolism. For brevity, I will call a network that can synthesize all biomass molecules in a given environment a viable network. An advantage of this representation is that it accounts for environments that contain many carbon molecules: a network that can synthesize all biomass molecules on each of several sole carbon sources, is also likely to do so if all these carbon sources occur together. Obviously, this kind of reasoning can also be applied to categorizing metabolic phenotypes with respect to sources of sulfur, nitrogen, and phosphorus, when these sources vary in their availability. The same holds for different energy sources, even though sources of these elements often provide energy as well. In sum, for my purpose an organism’s metabolic genotype is the totality of biochemical reactions catalyzed by enzymes of the organism. It is but a point in a vast space of metabolic networks. A network’s metabolic phenotype is the spectrum of alternative carbon (or other food) sources that the
21
network can use to synthesize all of an organism’s biomass molecules.
From metabolic genotype to phenotype With the above definitions in hand, how can we determine phenotype from genotype? The time-honored experimental approach is to expose a specific organism to a minimal environment with a sole carbon source, and determine whether it can grow and divide. By exploring many different sole carbon sources, the metabolic phenotype can then be elucidated experimentally. However, to understand metabolic innovation, we will need to explore the metabolic phenotypes of many thousand well-defined genotypes. To create and characterize such genotypes is beyond current experimental techniques, and thus requires computational approaches. Fortunately, the method of flux balance analysis allows us to compute metabolic phenotypes from genotypes for very large metabolic networks comprising thousands of reactions [678]. This method is widely used to optimize metabolic properties of industrially important microbes [232, 233, 624]. I will not discuss its mathematical details but merely highlight some central features. Flux balance analysis requires a list of all stoichiometric equations (Figure 2.1a) in a network. The method assumes that a network operates in steady-state conditions, as, for example, in a growing population with a constant nutrient supply. Flux balance analysis is commonly used for two purposes. First, it can identifiy, for all reactions in a network, the set of allowed metabolic fluxes through these reactions—the rate at which the substrates of a reaction are converted into products—in a given chemical environment. Not all possible fluxes are allowed. The reason is that metabolism needs to conserve mass. It cannot produce more than it consumes. In calculating allowed fluxes, flux balance analysis must take into consideration that environmental nutrients can flow into a network at a limited rate. The task of calculating allowed fluxes amounts to solving a large set of linear equations, one for each kind of molecule in the network. The set of allowed fluxes typically forms a large connected region of a high-dimensional space with as many dimensions as there are reactions in a metabolic network.
22
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
The second purpose of flux balance analysis is to identify those fluxes among all allowed fluxes that have a property of interest. For example, they might allow efficient synthesis of a biotechnologically interesting molecule, or they might allow production of all of a cell’s essential biomass molecules. To achieve the second purpose, flux balance analysis maximizes or minimizes linear functions of fluxes using a numerical optimization technique called linear programming [144]. In practice, the predictions of flux balance analysis are often in good agreement with experimental data for well-studied organisms [64, 248, 348, 593, 690]. The usual exceptions involve situations where the organism may not have adapted to a nutrient environment during its recent evolutionary past [146, 243, 348]. However, even in this case, laboratory evolution experiments can rapidly create microbial genotypes whose metabolic properties match the predictions of flux balance analysis [146, 247, 248, 348]. The most common reason for the original mismatch is suboptimal regulation of necessary enzymes in a novel laboratory environment. Such regulatory constraints are easily and rapidly overcome, more easily than the complete absence of an enzyme for an essential reaction. (I will examine regulation extensively in Chapter 3 and later chapters.) For my purpose, the key question about any one metabolic genotype is whether it can synthesize all essential biomass molecules in a given environment. To understand metabolic innovation, it is important to understand how this qualitative ability—regardless of the rate of synthesis—can change as a genotype changes. It is prudent to ignore the rate of synthesis, because the fast cell growth that a high synthesis rate supports is not necessarily the best indicator of an organism’s fitness, especially for microbes. For example, some highly successful microbial species, such as Mycobacterium tuberculosis, grow slowly in the wild [150, 405, 843]. Slow-growing microbes may even outcompete faster growing microbes under conditions often found in nature [172]. Nonetheless, most of what I will say below also holds for metabolic networks with high biomass synthesis rates [670]. In order to determine a metabolic network’s phenotype, as defined above, one can apply flux balance n times, that is, to n different minimal chemical media that differ only in the sole carbon sources
they contain. Each medium in which an organism can synthesize all biomass molecules from the respective carbon source is assigned a “1” in the metabolic phenotype (Figure 2.1b). In sum, we can think of an organism’s metabolic genotype as a set of biochemical reactions represented by enzyme-coding genes. These reactions form a metabolic network whose most basic task is to synthesize all biomass molecules, small molecules essential for cell growth. As defined here, a metabolic phenotype is the ability to synthesize all of these molecules from chemicals found in the environment. Because of carbon’s centrality, I here focus on chemicals that can serve as carbon (and energy) sources in heterotrophic organisms. I categorize phenotypes according to those chemicals that can serve as sole carbon sources in an otherwise minimal chemical environment. Flux balance analysis allows us to predict metabolic phenotypes from genotypes. In closing this section, I note that the concepts and tools I just introduced fulfill several of the requirements I posited in Chapter 1. First, they can capture the combinatorial nature of metabolic innovations, which arise through novel combinations of enzymatic reactions; second, they explicitly represent qualitatively different phenotypes that regard the ability or inability to sustain life, and thus allow us to focus on the qualitative phenotypic change that characterizes innovation; third, they allow us to predict phenotype from genotype. I will next turn to what I earlier called the paramount problem of innovation: how organisms can explore myriad qualitatively new phenotypes while preserving their existing phenotypes.
Evolution of metabolic networks The unit of evolutionary change in metabolic networks is the enzyme-coding gene. I will postpone discussing changes in individual genes to Chapter 4, and consider here instead two larger scale changes; both of them are more appropriate for the level of resolution at which I represent metabolic networks. First, I consider changes that arise through the addition of enzyme-coding genes (and thus reactions) to a metabolic network. An important driver of such change is horizontal or lateral gene transfer. It occurs in both pro- and eukaryotes. It is so frequent in prokaryotes that it can change genome organization on short evolutionary time scales [122, 163, 419,
23
M E TA B O L I C I N N OVAT I O N
(a) 1200
anaerobic, aquatic aerobic, terrestrial
number of network pairs
1000
thermophilic halophilic
800
marine
600 400 200 0
0
0.1
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 metabolic network genotype distance D
1
(b) 1.0 metabolic network distance D
445, 550, 568, 569, 589]. Second, I consider elimination of individual reactions from a network, as might occur through gene deletions or through lossof-function mutations in enzyme-coding genes. Even though much of what I will say below applies to all metabolic networks, it is useful to be aware of how fast especially microbial metabolic networks may change in evolution. Here are some relevant observations from the well-studied genome of the bacterium Escherichia coli and its relatives. DNA is transferred into the E. coli genome at a rate of 64 kilo base pairs per million years [437]. With an average gene length of approximately 1 kilo base pairs, this amounts to the transfer of 64 genes per million years [65]. Even different E. coli strains can differ by more than 1 Mbp of DNA, or more than 20% of their genome, and may have experienced of the order of 100 gene additions through horizontal transfer relative to other strains [567, 590]. Because some 30 percent of E. coli genes have metabolic functions, the effects of such horizontal gene transfer on metabolism is profound [65, 231]. The addition of new DNA is compensated by the deletion of other DNA, and many newly added genes reside in the genome only for short amounts of time [437, 590]. Gene turnover in microbial genomes can thus be very high. Environmental demands on the organism in general, and on metabolism in particular, play an important role in such turnover [437, 590]. Over long time-scales, the accumulated effects of this gene turnover on metabolic network organization are staggering. Figure 2.2a shows the distribution of the genotype distance D between metabolic genotypes that are representatives of 222 prokaryotic genera whose complete genome sequences are known [831]. These genomes include both bacteria and archaea, and they span a broad range of prokaryotic diversity. Each network reaction is represented through its encoding gene, as represented by orthologs of metabolic enzyme-coding genes in the Kyoto Encyclopedia of Genes and Genomes (KEGG) database, a curated database of metabolic networks [571]. Such databases have limitations, but they are currently the only viable means to centralize and manage the enormous amounts of data required for such an analysis [831]. As defined earlier, the genotype distance D is the fraction of reactions that occur in only one, but not both, of two networks in a pair. (Networks that
0.8
0.6
0.4
0.2
Spearman's r = 0.39; P<10–17; n = 2.1 x 104 0.0 0.00
0.05 0.10 0.15 0.20 0.25 0.30 0.35 evolutionary distance (16S divergence)
0.40
Figure 2.2 Metabolic network genotypes are extremely diverse. (a) A histogram of the fraction D of reactions (represented by KEGG orthologs; [571]) that occur in only one network of a pair of metabolic networks. The histogram is based on all networks that occur in representatives of 222 prokaryotic genera with completely sequenced genomes. Horizontal bars indicate mean (center of bar) and one standard deviation (length of bar) of D for organisms that live in the habitat-type indicated above each bar. I obtained information on these habitat types from the National Center for Biotechnology Information (NCBI), (http://www.ncbi.nlm.nih.gov/ genomes/lproks.cgi), and from genomes sequenced in the Marine Microbiology Initiative (http://www.moore. org/microgenome/strain-list.aspx) with metabolic network information in KEGG [831]. (b) For each of 222 species pairs, the fractional nucleotide divergence in 16S rDNA molecules (horizontal axis) is plotted against metabolic genotype distance D (vertical axis). The fractional divergence is calculated from a multiple alignment of 16S rDNA molecules [178]. From [831].
24
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
share no reactions have D=1.) The large mean D=0.68 suggests that two networks share on average only about one-third of their reactions. Superimposed on the distribution of Figure 2.2a is information about the diversity of prokaryotic taxa that share a similar, broadly defined habitat. Specifically, horizontal bars indicate the mean (center of bar) and standard deviation (length of bar) of genotype distance D for species in each labeled habitat type [831]. The figure shows that metabolic networks are not much less diverse in these habitats than in the whole dataset of metabolic networks. Even apparently closely related organisms can be metabolically very diverse. For example, the metabolic networks of 13 completely sequenced strains of E. coli show a mean genotype distance of D=0.36 (with a large standard deviation of 0.31) [831]. In general, the genotype distance increases with the phylogenetic distance among two species. This is illustrated in Figure 2.2b, which uses the pairwise nucleotide divergence in 16S ribosomal DNA data [178] as a measure of phylogenetic distance. The figure shows that the genotype distance D among metabolic networks is broadly associated with phylogenetic distance.
Genotype networks in metabolic genotype space The great diversity of metabolic genotypes, and the rapid turnover of metabolic genes even in closely related species, suggest that metabolic networks have a very plastic organization. As we shall see, a question important for understanding evolutionary innovation is how variable a network’s organization can be, if its metabolic phenotype is not allowed to change. Unfortunately, comparative analyses like that above do little to answer this question, because they study species whose metabolic phenotypes might be very different and unknown to us, even if they live in similar habitats. The only currently available approach to address this question systematically is computational. In this approach, one changes a network with a given phenotype (as determined by flux balance analysis) in a series of steps, where each step is required to preserve the phenotype. Each step can either eliminate a reaction, and thus emu-
late the effect of a loss-of-function mutation in an enzyme-coding gene, or it can add a reaction chosen at random from the known universe of biochemical reactions, thus emulating the effects of horizontal gene transfer. Whether gene deletions or horizontally transferred genes become established in a population may depend on many vagaries of evolution, including environmental change and population sizes. In addition, both gene deletions and horizontal transfer may involve multiple, physically linked genes with related metabolic roles [346, 433, 436, 590, 591, 623]. In contrast, the approach just outlined provides a view on metabolic plasticity that is independent of these vagaries. By iterating many such steps of change, one can study systematically by how much a metabolic genotype can be altered while leaving its phenotype unchanged. If carried out multiple times independently to generate multiple networks, each of which with numerous mutational changes, this approach can generate large sets of networks that are very different from the original network, but that are all viable and have the same phenotype. One can show that such networks are random samples of metabolic genotype space with a given phenotype [670]. A good starting point for this approach is the metabolic network of a well-studied organism such as Escherichia coli, which encodes more than 700 metabolic reactions, and which can import more than 100 carbon-containing molecules [637]. These molecules can thus serve as potential carbon sources. Because glucose is the most prominent carbon source, it is sensible to first ask how different two metabolic networks can be that support cell growth (synthesis of all essential biomass molecules) on a minimal medium containing glucose as sole carbon source. To address this question, it is useful to explore genotype space through phenotype-preserving random walks similar to those just described. During each such random walk, one forces each mutant to increase its distance from the E. coli starter network. To facilitate comparison among networks, it is also convenient to leave the total number of network reactions approximately constant during any one random walk [652]. Figure 2.3a shows the distribution of genotype distances D from the E. coli starting
M E TA B O L I C I N N OVAT I O N
network for the end-points of 1000 independent such random walks comprising 104 reaction changes each [652]. The mean fraction of shared (a) 200 180 Number of networks
160 140 120 100 80 60 40 20 0 0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Maximal genotype distance
(b)
Maximal genotype distance
1.0
0.8
0.6
0.4
0.2
0.0
0
5
10
15
20
25
30
35
40
45
50
55
25
reactions (1–D) is less than 0.24, and no two networks have more than 30 percent of reactions in common. This means that metabolic networks with vastly different organization—sharing a small percentage of reactions—can support life on a glucose minimal environment. The random walks in this approach use a limited number of 104 reaction changes (“mutations”), and 5870 well-characterized metabolic reactions from the known reaction universe [652]. The differences among networks would be even larger if we had allowed more reaction changes, and if we had considered more chemical reactions from the reaction universe. They are also larger for a complex medium that contains more than the bare minimum of nutrients [652]. The reason is that more complex environments provide more nutrients, and thus render non-essential some chemical reactions that are essential in a minimal environment. These reactions are then free to vary, which further increases the plasticity of metabolic genotypes and their maximally feasible genotype distance. What holds for glucose-minimal environments also holds for metabolic phenotypes more generally, that is for genotypes that can grow on each of n different molecules as sole carbon sources. Starting from a network like that of E. coli, one can readily generate—through random addition or deletion of
60
Number of alternative carbon sources
Figure 2.3 Extended networks of genotypes with the same metabolic phenotype. (a) Distribution of maximum genotype distances between 1000 metabolic networks that are the end-points of random walks leading away from an initial network, while preserving this network’s ability to sustain life in a minimal environment with glucose as the sole carbon source. (b) Maximum genotype distances (vertical axis) between initial metabolic networks able to sustain life on a given number of carbon sources (horizontal axis) and networks derived from them through long random walks of 104 random mutations that leave the initial phenotype invariant. The data in (b) are based on 10 different phenotypes (and 10 different initial networks) for each number of carbon sources, and on 10 phenotypepreserving random walks for each initial network. Each circle on the plot is thus based on an average over 100 networks. The 95% confidence intervals are shorter than
D=0.01 for each data point, and are therefore not drawn here [652]. I note that the distance measure D used here differs somewhat from the Hamming distance of two binary strings, which measures the fraction of entries at which two binary strings differs. Specifically, it does not take into account all the reactions that are absent in both metabolic networks, and therefore its maximal value of 1 is not the same as the diameter of genotype space. All known metabolic networks contain only a small fraction of the total number of reactions in the known reaction universe [571]. Even if two such networks differed in all their reactions, there are still many reactions that are absent in both network. For this reason, D is more appropriate than the Hamming distance to characterize the maximal distance among real metabolic networks. I emphasize that the qualitative observations I make here do not depend on the number of reactions in a network [670].
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
metabolic reactions—metabolic genotypes that grow on n specific alternative molecules as sole carbon sources. Any one such network can then serve as the departure point for phenotype-preserving random walks like those described above. One might think that a higher value of n would constrain change in a metabolic network, because such a network needs more reactions to use all of its n alternative carbon sources. However, the differences between networks with the same phenotype are almost as high as in the previous, simpler environment (Figure 2.3a), and they do not depend strongly on n. Figure 2.3b shows the maximal distance of genotypes with the same phenotype, averaged over several phenotypes capable of growing on the same number of carbon sources n. From n=1 (glucose) to n=60 (networks able to grow on 60 alternative sole carbon sources), this maximal distance decreases by merely 10 percent. Taken together, these observations show that metabolic genotypes with the same carbon-metabolizing metabolic phenotypes are organized into genotype networks that span more than 75 percent of a vast metabolic genotype space. To avoid confusion, it is worth highlighting the distinction between a metabolic network and a genotype network here. A metabolic network is one point in metabolic genotype space. A genotype network is a collection of metabolic networks with the same metabolic phenotype—a network of networks. One can use statistical properties of phenotype-preserving random walks to estimate the size of a genotype network. Such estimates show that a genotype network typically contains astronomically many metabolic networks [670].
Metabolic networks typically have many neighbors with the same phenotype At the very least, the existence of genotype networks requires that metabolic networks generally have more than one neighbor with the same phenotype. If that was not so, the set of genotypes with the same metabolic phenotype would contain many, if not mostly, isolated networks, that is, metabolic networks with no neighbors of the same phenotype. Metabolic networks exceed this minimal number of necessary neighbors dramatically, because their number of neighbors with the same or similar phe-
notype is typically much greater than one. This is already the case for the simple phenotype of growth on glucose as a sole carbon source. For example, for the E. coli metabolic network, adding any one of more than 5000 possible biochemical reactions would not impair this ability. (Some of these new reactions might even endow the E. coli metabolic network with the ability to grow on new carbon sources.) Deletions of reactions may be a different matter. However, the deletion of only 29 percent (210) of 726 metabolic reactions abolishes this ability. In other words, 96.4 percent ((5870–210)/5870) of the neighbors of the E. coli metabolic network preserve the ability to grow on glucose as a sole carbon source. Reactions whose removal destroys viability in a given environment are essential in that environment. The E. coli network has only 29 percent of reactions that are essential for viability on glucose. It is similar in this respect to randomly sampled networks with the same phenotype. Figure 2.4 shows the distribution of the fraction of essential reactions for 103 metabolic networks that are able to sustain life on glucose, but whose content of metabolic reac-
350 300 Number of networks
26
250 200 150 100 50 0 0.0
0.1
0.2
0.3 0.4 0.5 0.6 0.7 0.8 Fraction of essential reactions
0.9
1.0
Figure 2.4 The fraction of essential reactions in a metabolic network is typically small. The vertical axis shows the number of genotypes with a given proportion of essential reactions, as indicated on the horizontal axis, in a random sample of 1000 genotypes viable in a glucose minimal environment [670]. The number of reactions in each network is equal to that in E. coli. Error bars were estimated by a jackknife procedure [670].
27
M E TA B O L I C I N N OVAT I O N
The number of metabolic genotypes with a given phenotype varies depending on phenotype Arguably, the greater the number n of alternative sole carbon sources is, the smaller should be the number of metabolic networks that can sustain growth on every one of these n carbon sources. The reason is that an increase in the number of alternative carbon sources should require an increase in the number of reactions needed to metabolize these carbon sources; Figure 2.5 shows that this is indeed the case. The figure is based on 1000 random metabolic networks verified to be viable in a glucose minimal environment [670]. For each such network, we asked on how many additional carbon sources the network was viable. The distribution in Figure 2.5 shows the answer. Two features are apparent. First, networks viable on glucose are often also viable on a few additional carbon sources. The likely reason is that viability on glucose requires particular sets of chemical reactions that are also sufficient to sustain viability on other carbon sources. Second, and more importantly, as the number of additional carbon sources increases beyond two, the number of viable networks (on all of these carbon sources) decreases rapidly. Thus, required growth on increasing numbers of carbon sources generally reduces the number of viable genotypes.
300 250 Number of genotypes
tions was randomized with respect to that of E. coli [670]. The proportion of essential reactions in these 1000 random viable networks ranges from 25 to 32 percent, and the 29 percent of E. coli fall within this range. This observation implies that a network with the ability to grow on glucose generally has many neighbors with the same ability. The same observation, that metabolic networks have many neighbors with the same phenotype, also holds for the ability to sustain life in environments containing any one of n alternative carbon sources [652, 670]. It merits emphasizing that none of the key observations I have made thus far—the existence of extended genotype networks and the abundance of a network’s neighbors with the same phenotype—are specific to E. coli. They are general properties of carbon metabolism. The E. coli network simply is a convenient starting point to explore these properties.
200 150 100 50 0 0
2
4 6 8 10 12 Number of minimal environments in which genotype is viable
14
Figure 2.5 Genotype network size depends on metabolic phenotype. Data in this figure are based on a sample 1000 random metabolic networks that are viable on a minimal environment containing glucose as its sole carbon source. For each of these networks, we evaluated whether the network was also viable on 88 other known carbon sources [670]. The histogram indicates the number of metabolic genotypes that were also viable on the number of additional carbon sources shown on the horizontal axis. (A value of zero on this axis would indicate networks that were viable only on glucose, and on no other of the 88 other carbon sources.) Similar histograms can be obtained for other minimal environments [670].
The analyses I have discussed thus far require evaluation of multiple carbon phenotypes for tens of thousands of metabolic networks. They are at the limit of current computational technology. Not surprisingly then, they leave unanswered questions. Perhaps the most fundamental one is whether all genotypes with the same phenotype are connected in genotype space. The thousands of metabolic networks that the random sampling approach generates are but a minute fraction of the set of all networks with the same phenotype. Do all networks in this set belong in the same genotype network? Do they fall into multiple genotype networks (and if so, how many)? Does this set include isolated metabolic networks with no neighbors that are also in the set? We do not know the answer for the phenotypes I have analyzed thus far. However, we have an answer for a slightly different notion of phenotype. As opposed to defining a
28
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
phenotype as requiring viability on exactly n specific sole carbon sources, let us define it through viability on at least these n carbon sources (it may be viable on more than these n sources). It is then easy to see that any two networks with the same phenotype thus defined are part of the same genotype network. Consider two networks R1 and R2. Their union R1 È R2 is a metabolic network with the same phenotype, because addition of any number of reactions to either network R1 or R2 will not abolish its ability to grow on any one carbon source. It is thus possible to design a series of reaction additions that lead from R1 to R1 È R2, and a series of subsequent deletions that lead from R1 È R2 to R2 (or vice versa), such that each addition or deletion will preserve the metabolic phenotype [652]. Therefore, all networks with a phenotype thus defined form a single genotype network.
The evolution of novel metabolic phenotypes I will next turn to a central question about metabolic innovations. Does the existence of extended genotype networks facilitate the discovery of new
metabolic phenotypes? To begin addressing this question, we need to study the neighborhoods of metabolic networks. Neighborhoods are important for innovation, because they contain all genotypes that are only one or a few small genetic changes— changes in one or a few individual reactions—away. They comprise genotypes that are easily accessible to evolution, much more easily than the rest of a vast genotype space. As I have discussed, the neighborhood of any metabolic network contains many networks with the same phenotype. However, it also contains networks with different phenotypes. Some (perhaps small) fraction of these variants would constitute evolutionary innovations. Consider two different genotypes on the same genotype network (Figure 2.6). Are the phenotypes occurring in their neighborhoods similar? If so, a genotype’s location on a genotype network does not influence the potential metabolic innovations accessible to this genotype. Conversely, if these phenotypes are very different, this location becomes important, and so does the very existence of a genotype network for evolutionary innovation. Why?
G1
G2
Figure 2.6 Two neighborhoods on a genotype network. A highly schematic drawing of part of a genotype network (black lines), two genotypes G1 and G2 on it, and their neighborhoods (circles). A neighborhood of a genotype G comprises all genotypes whose distance D to G does not exceed some small value, such as D=1. Note that genotype spaces are highdimensional, and two-dimensional drawings like this are mere caricatures of structures in this space. In particular, genotype network are much more reticulate than the drawing indicates.
M E TA B O L I C I N N OVAT I O N
Because in this case, a series of small genotypic changes that preserve the phenotype can transform a metabolic network’s genotype and permit exploration of very different regions on a genotype network. During this exploration, the metabolic network can gain access to the very different novel phenotypes that occur in different neighborhoods of this genotype network, and that are just one or few mutations away from it. (In biological evolution, genotype space is explored by populations, not individual networks, but the basic principle I just highlighted still applies. I will talk more about populations in Chapter 5.) To ask whether different genotype neighborhoods contain different novel phenotypes, one can examine the neighborhoods of multiple genotypic end-points of long phenotype-preserving random walks starting from the same metabolic network [652]. Specifically, one can study those novel phenotypes that occur only in the neighborhood of one but not the other genotype, as indicated by the gray section of the inset in Figure 2.7a. The distribution shown in this figure is the fraction of phenotypes unique to the neighborhood of one member of a metabolic network pair. The figure is based on 4950 random metabolic network pairs with the same phenotype. It shows that the majority of neighborhoods are completely different (right-most bar in the figure). In other words, they do not have a single novel metabolic phenotype in common. More generally, the vast majority of neighborhood pairs share only a small fraction of new metabolic phenotypes. A complementary analysis asks how the composition of a network neighborhood changes as one changes a network relative to some starting network G0, while preserving its phenotype. Specifically, what fraction of phenotypes are unique to the neighborhood of an evolving network Gt relative to the neighborhood of its ancestor G0, as a function of the number of changed reactions t that distinguish Gt from G0? Figure 2.7b shows the answer. The fraction of unique phenotypes increases initially very rapidly with the number of changed reactions, and it approaches some asymptotic value where more than 70 percent of phenotypes are unique to a neighborhood. The data in Figures 2.7a and 2.7b are based on networks with different metabolic phenotypes,
29
where on average all phenotypes are able to grow on five alternative sole carbon sources. The proportion of phenotypes unique to a neighborhood is even greater for networks that can grow on more than these five alternative carbon sources [652]. A third type of analysis asks how many different phenotypes become accessible—through single mutations—to a metabolic network that changes over time but preserves its phenotype. This analysis carries out a phenotype-preserving random walk through metabolic genotype space, starting from a network with a specific metabolic phenotype. During this random walk one counts the cumulative unique number of phenotypes that occur in the neighborhood of this random walker. This means that if a phenotype occurs twice or more, either in the neighborhood of the same network, or in the neighborhood of a network encountered previously during the random walk, it is only counted once. Figure 2.7c shows this cumulative number of phenotypes. It varies with the complexity of the phenotype, and increases if one requires a metabolic network to grow on an increasing number of alternative carbon sources. The most important feature of this figure, however, is independent of the number of carbon sources. The cumulative number of accessible novel phenotypes increases steadily with no sign of reaching a plateau, even though the number of mutational changes to the network is enormous (104). Such a plateau would of course have to be reached eventually, because the number of phenotypes, albeit enormous (≈2100), is finite. Nonetheless, the analysis suggests that it would take an enormous amount of time to reach this plateau, and that metabolic network, by evolving on a genotype network, can gain access to a virtually inexhaustible reservoir of new phenotypes [652]. The above analyses imply the following: whether any one evolutionary innovation is accessible from a metabolic network with a given phenotype depends strongly on the metabolic network’s location on its genotype network. Two typical examples make this abstract observation more concrete [652]. The carbon source melibiose is a sugar similar to lactose. Specifically, it is made of the same two monosaccharides, galactose and glucose, but differs in the (glycosidic) link between them. While lactose
30
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
(b) (a)
1.0 Fraction U of phenotypes unique to neighborhood
Number of network pairs
160 140 phenotype
P
P
G1
G2
120
U
100 80
genotype
60 40 20 0 0.0 0.2 0.4 0.6 0.8 1.0 Fraction U of phenotypes unique to neighborhood
0.8
0.6
0.4
0.2
0.0
1
2000 4000 6000 8000 Number of changed reactions
10000
(c) 70
Number of alternative sole carbon sources 5 10 20 40
Cumulative unique accessible phenotypes
60 50 40 30 20 10 0
0
2000
4000
6000
8000
10000
Number of changed reactions
Figure 2.7 Different neighborhoods of a genotype network contain very different phenotypes. (a) The distribution of the fraction of different phenotypes that occur in the neighborhood of one, but not the other, network in a pair of random metabolic networks with the same phenotype. The two circles in the inset stand for the sets of phenotypes in the two neighborhoods, where phenotypes that would occur in only one but not the other neighborhood have gray shading and are labeled with the letter “U” for unique. Both networks in a pair are obtained independently from a “starter” network through two long phenotype-preserving random walks of 104 reaction changes that randomize the starter network. (b) How the fraction of phenotypes unique to one neighborhood (vertical axis) changes as a function of the number of mutations (horizontal axis), during a phenotype-preserving random walk that starts at some genotype G0. Here, the neighborhood of G0 is compared to the neighborhoods of networks Gt that have experienced t reaction changes. As the inset below the data illustrates, the number of phenotypes unique to the neighborhood of Gt increases dramatically early during such a random walk, until more than 70 percent of phenotypes are unique to a neighborhood. Whiskers represent 95% confidence intervals. The data in (a) and (b) are based on networks that can sustain life on 5 alternative sole carbon sources [652]. (c) The average cumulative number of unique phenotypes (vertical axis) found in the neighborhood of an evolving network as a function of the number of reaction changes (horizontal axis) the network experienced during its evolution. This cumulative number is displayed for networks that can sustain life on varying numbers of alternative carbon sources. Data in all panels are averages over 10 independently generated initial networks, and over 10 random walks of 104 mutations starting from each of these 10 networks [652].
M E TA B O L I C I N N OVAT I O N
can be metabolized by many microbes, melibiose is a less commonly usable carbon and energy source. The metabolization of lactose and melibiose also requires different enzymes (a-galactosidase for melibiose and b-galactosidase for lactose). Metabolic networks that can use melibiose are important for biotechnology. For example, yeast cells engineered to utilize melibiose improve efficiency and reduce waste in fermenting dairy products [84]. Among metabolic networks with identical phenotypes, there are networks where adding the a-galactosidase reaction is sufficient to endow the network with melibiose utilization [652]. In contrast, in other networks with the same phenotype this reaction does not suffice, because such networks lack a reaction necessary to excrete the excess galactose from the degradation of melibiose. An unrelated example involves a metabolic network’s viability on galactitol, a molecule similar to galactose. In some metabolic genotypes G that are viable on galactose, the addition of a single enzymatic reaction leads to viability on galactitol as well. The reaction in question causes the transfer of a phosphate group from a phospho-histidine to galactitol. The reaction produces galactitol-1-phosphate, which other “downstream” enzymes can transform into galactose, on which the network is viable. In other metabolic networks with the same carbon utilization phenotype as G, the addition of the same reaction does not lead to viability on galactitol. The reason is that in these networks, the “downstream” reactions needed for conversion into galactose are absent [652]. These examples illustrate the mechanistic reasons why not all metabolic innovations are equally accessible from two different metabolic networks with the same phenotype. They complement the statistical perspective on genotype space organization I developed here. This perspective reveals that different neighborhoods of a genotype network contain different metabolic phenotypes. The mechanistic reasons will vary among phenotypes.
Metabolic networks with different phenotypes are close together in genotype space A last question I will discuss is how far one would have to travel through genotype space to reach a network with an arbitrary (but predetermined)
31
novel metabolic phenotype. This question is about the shortest distance between different genotype networks in genotype space. The smaller this shortest distance is, the easier it is to reach novel phenotypes. To estimate this distance, one creates a pair (G1, G2) of metabolic network genotypes with arbitrary but different phenotypes; for example, through a random search for these phenotypes in genotype space [652]. One then carries out a random walk that starts from G1 and that approaches G2 in genotype space, while leaving G1’s phenotype unchanged. After a large number of steps in this random walk, one records a final distance. By repeating this procedure with many metabolic network pairs, each with different phenotypes, one can estimate how far apart two genotype networks are. I note that this estimate is an upper bound, because the actual distance may be lower, but the procedure may not have found this lowest distance. Figure 2.8 shows the distribution of the minimal distance thus estimated for 1000 phenotype pairs. The phenotypes of these metabolic network pairs share only the property that they can sustain life on five different (randomly chosen) carbon sources. Notably, the distance in this distribution is typically small, of the order of one-tenth of the maximally possible distance, or D≈0.1 [652]. This may not seem very small, but that appearance changes if one asks what fraction of sequence space a neighborhood of this size contains. Because of the counterintuitive geometry of genotype spaces (more about that in Chapter 6), a neighborhood around a genotype with radius D=0.1 would contain a vanishing fraction, much less than 10-500, of all genotypes in our metabolic genotype space. This means that only a very small fraction of genotypes would need to be explored to encounter an arbitrary, specific novel metabolic phenotype. In closing this section, I note that the minimal genotype distance I have just discussed depends only weakly on the number of alternative carbon sources a network is required to sustain growth on [670]. For example, as this number is increased from 5 to 40 different carbon sources, the smallest observed distance increases from 0.08 to only 0.1 [652]. Thus, complex constraints on metabolism do not dramatically increase the difficulty of finding specific, novel phenotypes.
32
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
160
Number of networks
140 120 100 80 60 40 20 0 0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Minimum genotype distance for networks with different metabolic phenotype Figure 2.8 Metabolic networks with different phenotypes can have similar genotypes. The distribution of minimal genotype distance between pairs of genotype networks with different metabolic phenotypes. Specifically, phenotypes in the analyzed pairs are able to sustain life on exactly 5 carbon sources, but these carbon sources are different within and among pairs. They are randomly chosen from 101 possible carbon sources that E. coli cells can import. Data are based on random walks of at most 103 mutations for each phenotype pair [652].
Summary I have here focused on the most fundamental metabolic innovations, new metabolic phenotypes that allow an organism to survive on new sources of chemical elements and energy [652, 653, 670]. Metabolic networks with the same phenotype are connected in vast genotype networks that extend far through metabolic genotype space. The extent to which metabolic networks can change their genotype, while preserving their phenotype, is dramatic. The ability of evolution to preserve phenotype, while changing genotypes, is crucial to “discover” many novel phenotypes via random
changes in a network’s reactions. The reason is that neighborhoods of different metabolic networks with the same phenotype contain metabolic networks with very different new phenotypes. In addition, the distance between genotype networks of arbitrary different phenotypes is typically small. These properties are not peculiarities of metabolism in one organism, but general features of metabolic network space. Beyond them, much remains to be discovered about genotype networks, including their size distribution, and their organization in metabolic genotype space.
CH A PT ER 3
Innovation through regulation
Genotype: The DNA sequences determining (cis)regulatory interactions among regulatory proteins. Phenotype: Gene expression or activity patterns produced via these interactions. We have seen in Chapter 1 that regulatory changes underlie innovations as different as the transparent lenses of eyes, the dissected leaves of plants, and the eyespots of butterflies. As I stated there, I refer to regulation as any process that influences the abundance or activity of a gene product. This definition is very broad. It encompasses changes in enzyme activity caused by small molecules, changes in the activity of signaling proteins through phosphorylation, changes in messenger RNA concentration caused by RNA interference, changes in the rate at which a gene is transcribed, and many others. Transcription is the process in which DNA is copied into RNA, such as the messenger RNA that encodes proteins. Transcription is highly regulated. My earlier examples of regulatory innovation involved the regulation of transcription, as do many others. This is not a coincidence. Most regulatory processes ultimately affect transcriptional regulation. The transcription of metabolic enzymes increases when they are needed [13, 173, 490]; hormonal signals arriving at a cell’s surface trigger a complex regulatory process that, ultimately, activates or represses transcription of genes [13, 293]; and complex circuits of transcriptional regulators drive embryonic development in organisms as different as fruit flies and humans [104, 340, 342]. One can think of transcriptional regulation as a regulatory backbone that integrates many different kinds of regulatory phenomena. Because of this centrality of transcriptional regulation, I will focus on it in this chapter. However, systems involving other kinds
of regulation show phenomena similar to those I discuss here (Chapter 14), although they are lesswell characterized. The evolution of transcriptional regulation is difficult to study experimentally. Transcriptional regulation is mediated by the binding of proteins—transcriptional regulators—to specific DNA sequences in the vicinity of a regulated gene. These DNA sequences—transcription factor binding sites or cis-regulatory elements—can be as short as five base-pairs. Mutations can thus create and destroy them rapidly on short evolutionary time-scales. The DNA regions in which they regulate transcription can span hundreds of kilo base pairs upstream and downstream of a regulated gene. In addition, cis-regulatory elements can often function regardless of their orientation and distance from a regulated gene [13]. Finally, the DNA regions surrounding them often evolve rapidly, not only through changes of individual nucleotides, but through insertions and deletions of large swaths of DNA [261, 470, 483, 490, 655, 740, 758, 787, 861]. These phenomena make it challenging to characterize not only the transcriptional regulation of individual genes, but also evolutionary changes in this regulation. As if that was not enough complexity, transcriptional regulators form intricate regulatory circuits whose individual regulators mutually regulate each others expression, and that of many target genes [340, 342, 355, 440, 698, 787]. Such combinatorial regulation further compounds the difficulty of studying the evolution of transcriptional regulation. It means that to understand the evolution of transcriptional regulation one must understand the evolution of such circuitry, and not just that of individual regulators. Because regulatory evolution is difficult to study experimentally, the principles of innovation involving the regulation of gene activity are very poorly
33
34
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
understood, although we know many examples of such innovation. For the same reason, a systematic analysis of new regulatory phenotypes must currently rely heavily on computational modeling. However, some exceptions exist where empirical evidence is beginning to accumulate. I will discuss them in the next section.
Many ways to achieve the same gene expression phenotype In Chapter 2, I discussed how the ability to change a genotype without changing a phenotype was crucial for innovation. Recent evidence shows that this ability also exists for regulatory genotypes—genes encoding transcriptional regulators and their binding sites on DNA—as well as their phenotypes, the gene expression patterns they produce. Below, I will first discuss this empirical evidence. To address the many questions this evidence leaves unanswered, I will then discuss a computational model that captures general properties of transcriptional regulation circuits. This and other models show that regulatory genotypes may be connected in vast genotype networks that provide ready access to a broad spectrum of novel gene activity phenotypes. As I mentioned above, regulatory DNA can evolve very rapidly, for reasons such as its small size and frequent location-independence. In contrast, the gene expression phenotypes mediated by this changing DNA often remain unchanged. By gene expression phenotypes I mean specific patterns of gene expression—either in space or in time—produced through transcriptional regulation. The following three examples illustrate that such phenotypes can remain unchanged despite drastic changes in how the affected genes are regulated. These examples come from well-studied microbial organisms. This is not a coincidence: regulatory evolution is studied most easily in such organisms, with their small genomes and short regulatory regions. Similar phenomena, however, exist in higher organisms [177, 426, 470–472, 482, 483, 655, 861]. Most organisms metabolize the sugar galactose by converting it into galactose-6-phosphate, which they then feed into glycolysis. In the yeast Saccharomyces cerevisiae and its distant relative Candida albicans, three of the enzymes necessary for this conversion are regulated transcriptionally [490]. Their expres-
sion increases in response to galactose. In S. cerevisiae, this increase is mediated by the transcriptional activator Gal4p, which binds to a specific sequence motif in the vicinity of these genes. Gal4p itself is regulated by the proteins Gal80p and Gal3p. In stark contrast to S. cerevisiae, C. albicans does not regulate the expression of the three enzymes through Gal4p, but instead through a completely different transcriptional regulator [490]. This regulator has a counterpart in S. cerevisiae. Strikingly, however, this counterpart is not involved in the galactose metabolism of S. cerevisiae. Instead, it regulates genes involved in mating behavior. Conversely, there also exists a protein related to Gal4p in C. albicans. Again, this protein does not regulate the galactose metabolic genes of C. albicans, but genes unrelated to galactose metabolism. Overall, the similar regulation of galactose metabolism in two different organisms is achieved by very different mechanisms. Some species intermediate between S. cerevisae and C. albicans show an intermediate mode of regulation that may involve both regulators. This observation hints at how this transition in regulatory mechanisms has occurred [490]. A second example regards the a and a mating types in yeast cells [786]. These mating types are yeast’s two different sexes. Each of them expresses a set of mating-type specific genes. Different yeasts regulate these genes’ expression in profoundly different ways. For example, in C. albicans a transcriptional activator is responsible for expression of a-specific genes. Their unexpressed state in a-cells is the default state. In contrast, in S. cerevisiae, the a-specific genes are expressed by default, and they must be transcriptionally repressed in a-cells. These very different modes of regulation have evolved through a series of small genetic changes in transcription factors and regulatory regions that left the regulatory phenotype intact [786]. A third example regards ribosomal protein coding genes, classical examples of “housekeeping” genes with ubiquituous functions. The expression of such genes needs to be precisely coordinated in order to allow rapid cell growth and proper cell maintenance. One might think that such precise coordination can be accomplished in only one optimal way. As it turns out, it can be achieved through quite different regulatory mechanisms. For example, the
I N N OVAT I O N T H R O U G H R E G U L AT I O N
distantly related yeasts Saccharomyces cerevisiae and Schizosaccharomyces pombe use completely different transcription factor binding sites and different transcription factors to accomplish this coordination. Species intermediate between the two show intermediate forms of regulation [758]. These examples illustrate that similar gene expression phenotypes can be produced through different regulators in different species. In all three examples, intermediate species display intermediate forms of regulation, showing how radically different regulation can be accomplished through gradual changes on evolutionary time-scales [490, 758, 786]. Examples of different regulatory mechanisms with similar phenotypic outcomes also abound in higher organisms [177, 426, 470–472, 482, 483, 655, 847, 861, 891]. They affect many genes expressed in embryonic development, such as the gene even-skipped in fruit flies, or the Endo 16 gene from sea urchins [470, 472, 655]. Their common theme is that one or more genes show similar expression patterns despite very different transcriptional regulation. These studies use a comparative approach. They analyze events that unfolded over time scales of many million years. Together, they hint at one property of regulatory systems: in order to change a regulatory mechanism without changing an expression phenotype, the expression phenotype must be robust to changes in regulation; otherwise, it would not be possible to gradually but completely transform regulatory mechanisms (genotypes) while preserving gene activity phenotypes. A recent study on smaller, laboratory time-scales dramatically demonstrates such robustness on a genome-wide scale [355]. This study focused on the transcriptional regulation circuit of the bacterium Escherichia coli. This circuit includes some 300 transcriptional regulators. The target genes of these regulators comprise a broad spectrum of E. coli genes, including some of the genes encoding the transcriptional regulators themselves. The authors of the study “rewired” this circuit in many ways. Specifically, they introduced some 600 new transcription factor– target gene interactions into the circuit. It is intriguing that E. coli cells readily tolerated 95% of these new interactions, some of which can indirectly affect the regulation of many genes. E. coli cells
35
rewired in this way grow at a rate very similar to that of unmanipulated cells in a laboratory environment [355]. Despite the impressive scale of this experiment, it is worth noting that no laboratory experiment can strictly prove that a regulatory (or any other genetic) change is neutral—that it does not affect fitness. There are two reasons. First, very subtle changes in fitness are undetectable in the laboratory [763]. Second, some fitness effects might manifest themselves only in one of the many different environments that organisms experience in the wild, but that laboratory experiments do not explore [826]. Nonetheless, the radical genotypic changes observed in the comparative studies above, together with the latter laboratory observations, raise the possibility that such neutral changes facilitate long-term genotypic change.
Transcriptional regulation circuits The examples I have just discussed share an important feature: they demonstrate that regulatory genotypes can change profoundly without affecting gene activity phenotypes. However, empirical case studies like these are also limited in several respects. First, they cannot address whether this phenomenon might be a general property of transcriptional regulatory circuits. Second, they do not explicitly ask about regulatory innovation, the evolution of new expression phenotypes. Third, except for the last example, they all focus on the target genes of regulation, not on the regulators themselves. Transcriptional regulators, however, often crossregulate each other’s activity and thus form regulatory circuits, with complex dynamics of gene expression [15, 104]. Such circuits are important not only in the physiology of single cells, they are also central to the embryonic development of main body axes and their segmentation, of body appendages such as limbs, and of many other structures, such as the central nervous system [104, 268, 355, 440, 654, 698]. Because of their importance, they are also involved in various evolutionary innovations [103, 104, 342, 701, 784]. Such circuits thus merit special attention. Unfortunately, currently available experimental technologies are not sufficiently powerful to overcome the limitations I have just discussed. They cannot study evolutionary innovation in tran-
(a) protein mRNA Gene 1 Gene 2
W=
Gene 3 Gene 4 Gene 5
(b)
Figure 3.1 (a) A transcriptional regulation circuit’s regulatory genotype. Solid horizontal black bars indicate genes that encode transcriptional regulators in a hypothetical 5-gene circuit. Each gene is expressed at a rate that is determined by the transcriptional regulators in the circuit. Each regulator typically exerts its influence on a gene by binding to the gene’s regulatory region (horizontal line). The model represents the regulatory interactions between transcription factor j and genes i through a matrix w = (wij). A regulator’s effect can be activating (wij > 0), patterned rectangles or repressing (wij < 0), solid rectangles. Any given gene’s expression may be unaffected by most regulators in the circuit (wij = 0, open rectangles). Rectangles drawn in different shades of gray and different patterns correspond to different magnitudes of regulatory interactions wij. The highly regular correspondence of matrix entries to transcription factor binding sites serves the purpose of illustration and is not normally found, because such binding sites often function regardless of their position in a regulatory region. (b) Neighbors in genotype space. The middle panel shows a hypothetical circuit of five genes (top) and its genotype w of regulatory interactions (bottom), if genes are numbered clockwise from the uppermost gene. Light gray arrows indicate activating interactions and dark gray lines terminating in a circle indicate repressive interactions. The left-most circuit and the middle circuit differ in one repressive interaction from gene four to gene three (dashed thick gray line, black cross, large open rectangle). The right-most circuit and the middle circuit differ in one activating interaction from gene one to gene five (dashed thick line, black cross, large open rectangle). Each of the three circuit topologies corresponds to one point—indicated by the large circles around them—in a vast regulatory genotype space. These genotypes (circles) are connected because they are neighbors, that is, they differ by one regulatory interaction. After [123].
scriptional regulation circuitry systematically. Computational approaches are needed, approaches that allow us to characterize many regulatory genotypes and their expression phenotypes. I will now turn to a computational model of transcriptional regulation
circuits that captures general features of the crossregulatory interactions in such circuits [820]. The model is concerned with a circuit of S transcriptional regulators. Here and elsewhere S stands for the size of a system, that is, the number of its
I N N OVAT I O N T H R O U G H R E G U L AT I O N
parts. These regulators are represented by their expression patterns Et = E(t) = (E1(t), E2(t), . . ., ES(t)) at some time t during a developmental or cell-biological process, and in one cell or region of an embryo. The model’s transcriptional regulators can influence each other’s expression through crossregulatory and autoregulatory interactions, which are encapsulated in a matrix w = (wij). The entries wij of this matrix indicate the strength of the regulatory influence that gene j has on gene i (Figure 3.1a). This influence can be activating (wij > 0), repressing (wij < 0), or absent. The entries of w represent cisregulatory elements or transcription factor binding sites on DNA. Each row wi. of this matrix represents the regulatory region of an entire gene i.The entire matrix w represents the (regulatory) genotype of this system. This model does not represent the genotype of the transcriptional regulators themselves. It focuses instead on the regulatory interactions encapsulated in the genotype w, and on evolutionary changes in this genotype. In doing so, it can disentangle the evolution of regulatory interactions from the evolution of the interacting molecules, the transcriptional regulators. For conceptual clarity, the evolution of these and other molecules is best examined separately, as I will do in Chapter 4. Also, cis-regulatory elements evolve much more rapidly, and may thus be more important for circuit evolution than evolutionary change in transcriptional regulators, which are often highly conserved [683, 775, 856, 857, 889]. The regulatory interactions of circuit genes change the expression state E(t) of the circuit over time t, a change that is modeled by the difference equation: ⎡ S ⎤ Ei (t + t ) = s ⎢ ∑ w ij E j (t )⎥ , ⎣ j =1 ⎦
(3.1)
where t is a time constant that is determined by the time scale characteristic for transcriptional regulation, which is of the order of minutes. The function s(.) is a sigmoidal function whose values lie in the interval (–1, +1). Equation (3.1) reflects the extent to which circuit genes contribute to the regulation of any circuit gene i. The sigmoidal function in the right-hand side of Equation 3.1 reflects the common
37
observation that transcriptional regulators regulate the expression of their target genes cooperatively [101, 438, 577, 862]. The influence of a regulator j on the expression of gene i is reflected in the relative magnitude of the wij in Equation 3.1. The equation is analogous to equations used in neural computation [19, 325]. This model is different from earlier models of generic regulatory circuits in that it specifically models transcriptional regulation, and not just any type of regulatory process [387]. More details on its biological motivation can be found in reference [820]. I here consider the limit where s(.) has a very steep slope at the origin, and becomes a step function, giving (–1) for negative arguments and (+1) for positive ones. (s(0)=0.) This limit is used for computational convenience, but much of what I discuss below would also for steep sigmoidal functions as well. This limit means that the model represents two gene expression states, one where a gene is not expressed (–1), and one where it is expressed (+1). This choice of variables exists for computational convenience [123, 820]. Other modeling choices that have been examined, such as Boolean (0–1) expression states, lead to phenomena similar to those I discuss below [341, 481]. The model is concerned with circuits whose expression dynamics start from a pre-specified initial state E0 at some time t = 0 during development, and arrive at a “target” equilibrium expression state E∞. The initial state can be viewed as being determined by regulatory factors upstream of the circuit, which may represent signals from the cell’s environment, or from other regions of a developing embryo. Transcriptional regulators that are expressed in the equilibrium state E∞ can affect the expression of genes downstream of the circuit. I will here focus on stable equilibrium states E∞, but extensions to states that vary over time are straightforward [704]. The expression state E∞ is the expression phenotype of a regulatory circuit. Because any such state is only attained from some initial expression state E0, the pair (E0, E∞) needs to be considered jointly when one studies expression phenotypes. Given that there is an astronomical number of 2S × 2S = 22S such pairs for circuits with S genes, it may seem hopeless to make general statements about them and the genotypes that express them. However, one can show
38
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
that most properties of the model encapsulated in Equation 3.1 that are relevant for my purpose, depend only on the number of genes whose expression state differs in the initial and the equilibrium state [123]. This means that instead of characterizing 22S possible expression states, it is sufficient to consider those (S + 1) classes of pairs (E0, E∞) for which anywhere between 0 and S genes differ in their expression, regardless of the identity of these genes, and regardless of the specific initial and equilibrium expression state. This model is abstract, which is an advantage to characterize generic properties of transcriptional regulation circuits. It may seem like a disadvantage for modeling specific circuits in any one organism. However, I note that variants of this model are highly successful in such modeling. For example, they can predict the regulatory dynamics of early developmental genes in the fruit fly Drosophila, including their mutant phenotypes [362, 527, 645, 646, 696]. These model variants have also helped address a broad range of conceptual questions in evolutionary biology. They include why mutants often show a release of genetic variation that is cryptic in the wild-type, how adaptive evolution of robustness occurs in genetic networks of a given topology, and how sexual reproduction can enhance robustness to recombination [34, 55, 83, 704, 820]. For brevity, I will refer to my observations below as properties of transcriptional regulation circuits.
Genotype space, neighborhoods, and numbers of genotypes and phenotypes I will refer to the structure of a regulatory genotype matrix w as the topology of a circuit, the pattern of existing (wij ≠ 0) and absent (wij = 0) regulatory interactions (Figure 3.1). Changes in topology correspond to the loss of regulatory interactions (wij ® 0), or to the appearance of new regulatory interactions that were previously absent. A variety of evidence, some of it discussed above, shows that such topological changes can occur on very short evolutionary timescales [490, 740, 758, 786]. My analysis in this chapter will focus on such topological changes. Part of the motivation is biological [123]: the biochemical parameters determining the behavior of cellular circuitry vary continually,
because a cell’s internal and external environment varies constantly. This variation makes less variable circuit topologies (instead of the more variable interactions strengths within a topology) an obvious and important focus of study [349, 512, 812]. A second motivation for this focus on topology is conceptual. It will allow us to see deep similarities between innovation in regulatory circuits and in other systems. I will refer to the set of all possible circuit topologies as the genotype space of regulatory circuits, and will call two circuits (topologies, genotypes) neighbors in this space, if they differ in the presence or absence of exactly one regulatory interaction (Figure 3.1b). The entire neighborhood of a circuit consists of all circuits that differ in exactly one regulatory interaction from it. That is, this neighborhood consists of circuits that either have one additional interaction, or they lack one interaction. More generally, one can define a circuit’s k-neighborhood as comprising all circuits that differ from it in k regulatory interactions. By extension, one can define the genotype distance D of two circuits as the fraction of non-zero regulatory interactions that they differ in [349, 512, 812]. This distance ranges from zero to a maximal value of D=1 for two circuits that do not share a single interaction. A (k)-neighborhood around any one circuit can also be viewed as a shell or ball of some small radius D around the circuit. For my purpose, it is useful to consider a simplified and discrete regulatory genotype, where individual regulatory interactions can only assume values of wij = + 1 (activation), wij = –1 (repression), or no interaction (wij = 0). However, much of what I will say holds for continuous interactions [123, 124, 491]. The main purpose of this simplification is that it allows enumeration of the possible genotypes in this model. This number of genotypes is the size 2 of the circuits’ genotype space. It is equal to 3( S ) for circuits with S genes. This number remains large even if one considers only circuits with some given number I of (non-zero) regulatory interactions. The number of such circuits is given by: ⎛ S2⎞ S2! 2I ⎜ ⎟ = 2I 2 , ( N − I )!I ! ⎝⎜ I ⎠⎟
(3.2)
where X! = 1 × 2 × . . . × X denotes the factorial function for positive integers X. It is easy to see—for
I N N OVAT I O N T H R O U G H R E G U L AT I O N
example, by applying Stirling’s formula [2] to approximate S2!—that this number of circuits grows exponentially in S2. Thus, even a modest number of regulators can interact in astronomically different ways, and form a vast genotype space of possible regulatory circuits. Below I will restrict myself to circuits where not all genes regulate each other’s expression; that is, circuits where the number of interactions I is much smaller than S2. This restriction reflects biological reality. For example, the circuit involved in the patterning of the syncytial Drosophila embryo comprises 15 genes with merely 32 interactions among them [104]. Among 76 transcriptional regulators in the yeast Saccharomyces cerevisiae, 106 putative regulatory interactions exist in the form of transcription factors bound to the promoter regions of other transcription factor coding genes [225, 440]. Although the number of regulatory genotypes grows exponentially in S2, the number of expression phenotypes, as I mentioned earlier, grows only exponentially in 2S, that is, as 22S. This means that
39
there are exponentially more regulatory genotypes than phenotypes. It follows that there must be many circuits with the same expression phenotype. This excess of regulatory genotypes over phenotypes does not mean, however, that every phenotype has the same large number of associated genotypes. Some expression phenotypes may be abundant; that is, they may have many more circuits adopting them than other “rare” phenotypes. For the generic circuits I study here, the number of genotypes per phenotype depends only on the number of genes whose activity differs between the initial expression state E0 and the final equilibrium expression state E∞ [123]. If we view a regulatory circuit as a device that “computes” the state E∞ from E0, then this computation becomes more difficult as more genes need to change their activity during the computation. In consequence, fewer circuit genotypes can execute the computation. Figure 3.2 shows this dependency of circuit number on phenotype [222]. The figure’s vertical axis shows the fraction of genotype space occupied by circuits of a given genotype. This
6x10–6
Fraction of genotype space
5x10–6 4x10–6 3x10–6 2x10–6 1x10–6
0
0
0.2
0.4
0.6
0.8
1.0
Fraction d of expression state differences Figure 3.2 Some phenotypes have many more associated genotypes than others. The horizontal axis shows the fraction d of genes which differ in their expression between a circuit’s initial expression state E0 and its equilibrium expression state E∞ [123]. The vertical axis shows, for each value of d, the fraction of a random sample of 1.7 × 106 circuits with this value of d, divided by the total number of expression phenotypes (E0, E∞) with this value of d. (A value of d = 0 corresponds to circuits where E0 = E∞, and where thus the only requirement on the circuit is that E∞ is a stable equilibrium state.) It can be shown that the number of circuit genotypes per phenotype (E0, E∞) depends only on the fraction d. The data is shown for circuits of size S = 20 genes and I = 55 regulatory interactions, but similar relationships would hold for other circuits [123, 222].
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
fraction decreases rapidly as more genes need to change their activity during a circuit’s regulatory dynamics.
Circuit genotypes with the same phenotype form very large connected sets An excess of regulatory genotypes over phenotypes has another consequence for the number of circuits that have a given expression phenotype—the genotype set of this phenotype: This genotype set may be very large, yet it may constitute a very small fraction of genotype space. As an example, Figure 3.3 shows how the fraction (left vertical axis) and number (right vertical axis) of circuits with a given expression phenotype depend on circuit size. For example, for networks with merely S=10 genes, there are approximately 1039 circuits with any one expression phenotype, but these circuits occupy only a small
fraction 10-5 of genotype space. We will see similar relationships when we discuss phenotypes of molecules in Chapter 4. I next focus on the genotype set of a given phenotype. This genotype set can be viewed as a graph. Graphs are mathematical objects that consist of nodes and edges, and that are used to represents networks in many areas of science, from physics to sociology. The nodes in our genotype set are genotypes (regulatory circuits). Two nodes are connected by an edge if they are neighbors; that is, if they differ in exactly one regulatory interaction. A fundamental question about this graph is whether it is connected; that is, whether there is a path of edges connecting any two circuits that does not leave this graph. Alternatively, this graph might consist of only isolated nodes—nodes not neighboring any other circuit on the graph—or of some
180
log10 (fraction of genotypes)
–2
160
–3
140 –4
120
–5
100
–6
80 60
–7
log10 (number of genotypes)
40
40 –8 –9
20 2
4
6
8 10 12 14 16 Number S of circuit genes
18
20
0 22
Figure 3.3 Circuits with any one expression phenotype represent a small fraction of genotype space. The horizontal axis shows the circuit size S, that is, the number of genes in a circuit. The left vertical axes shows the fraction of circuits in genotype space that have a given expression phenotype (see below), on a logarithmic scale. The right vertical axis shows the number of these circuits. Note the logarithmic scales, the exponential decrease in the fraction of genotypes with increasing S, and the greater than exponential increase in the number of genotypes, which is caused by the exponential scaling of genotype space size with the square of circuit size S [123]. The relationship shown here can be shown to hold quantitatively for all phenotypes where the fraction d of genes which differ in their expression between E0 and E∞ is equal to d=0.5, regardless of the actual expression states E0 and E∞ [123]. The number of genotypes per phenotype would decrease as d increases (Figure 3.2), but otherwise qualitatively identical relationships would hold for other values of d. The data is based on circuits where each gene is regulated by half of all circuit genes, but similar scaling relationships would hold for different numbers of regulatory interactions. The data was obtained through random sampling of circuits from circuit genotype space. Sampling errors are at least one order of magnitude smaller than the estimated values, and thus invisible on the plot.
I N N OVAT I O N T H R O U G H R E G U L AT I O N
number of connected components, subsets of nodes that are connected to one another but isolated from all other nodes. I will refer to any such component as a genotype network. A closely related and equally important question is whether genotype networks typically occur in small, highly localized “islands” of genotype space, or whether they extend far through this space. To determine connectedness of a genotype network, one can pursue two different strategies. First, for very small circuits (S < 7 genes), one can exhaustively enumerate all circuits with a given phenotype. For larger circuits, such exhaustive enumeration is no longer possible [123]. In this case, one needs to sample circuits from a genotype set and ask whether there are paths through genotype space that connect these circuits without leaving this set. In practice, such sampling allows one to estimate connectivity of a genotype set to arbitrary accuracy, given sufficiently large samples. A combination of enumeration and sampling approaches demonstrates the following property [123]: regardless of the specific phenotype studied, the vast
41
majority of all circuits belong in the same genotype network, and only a tiny fraction form islands of largely isolated circuits. For example, for circuits with 12 genes and an average of three regulatory interactions per gene, 99.8 percent of all circuits with the same phenotype are connected in a single genotype network. This percentage increases further for larger circuits, and is similarly high for networks with different phenotypes [123]. Such high connectedness of a genotype set is only possible if circuits in the set have on average more than one neighbor with the same phenotype. In fact, most circuits have vastly more such “neutral” neighbors. As an example, Figure 3.4 shows the distribution of the fraction of a circuit’s neutral neighbors for circuits of S = 20 genes. Most circuits of this size have dozens to hundreds of neutral neighbors, regardless of their specific expression phenotype. The next question is how far individual genotype networks extend through genotype space. To obtain a first answer, one can sample circuits chosen at random from the same genotype network, and ask how different they are from each other. Put differently,
800 700
Number of circuits
600 500 400 300 200 100 0 0.0
0.2 0.4 0.6 0.8 Fraction n of neighbors with the same phenotype
1.0
Figure 3.4 Most circuits have many neighbors with the same phenotype. The horizontal axis shows the fraction n of a circuit’s neighbors with the same phenotype, out of a total of 2S(S – 1) possible neighbors. The data are based on a sample of 104 networks (S = 20 genes) with approximately S/2 regulatory interactions per gene, sampled at random from a set of genotypes with the same phenotype [123]. The distribution shown applies to any phenotype where half of all genes differ in their expression between E0 and E∞ [123], regardless of the specific phenotype. While the shape of the distribution depends on some of the parameters just mentioned, the fact that most circuits have many neighbors with the same phenotype does not change as these parameters are varied [123].
42
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
4
0.85 Mean genotype distance D
Number of circuit pairs (×104)
0.84
3
0.83 0.82 0.81 0.80 0.79 0.78 0.77 0.76
2
0.75
12
16
20
24
28
36
Number S of network genes
1
0
0.1
0.2
0.3
0.4 0.5 0.6 Genotype distance D
0.7
0.8
0.9
1
Figure 3.5 Randomly chosen circuit pairs with identical gene expression phenotype have very different topology. The horizontal axis shows genotype distance D, the fraction of regulatory interactions that differ between two circuits with topologies w and w’. The histogram is based on a sample of 5 × 105 circuit pairs with S = 24 genes, and an average of 6 regulatory interactions per gene. The same distribution exists for all phenotypes where 50 percent of genes differ in their expression between E0 and E∞, regardless of the specific phenotype. Similarly large distances are observed when these circuit properties are varied [123, 124]. The inset shows, as an example, the mean (and standard deviation) of the distribution shown, but for circuits with different numbers S of genes. The genotype distance D is defined as: D '(w , w ') = (1/ 2 M+ ) i , j (1 − d [sign(wij ), sign(wij )]), where M+ denotes the maximally allowed number of regulatory interactions, and where the function d = 1 if and only if x = y, and d = 0 otherwise [124].
∑
this approach asks how different circuits with the same phenotype typically are [123, 124]. Figure 3.5 shows an example. The example illustrates that randomly chosen pairs of circuits with the same phenotype typically differ in approximately 80% of their regulatory interactions. This observation suggests that circuits with the same phenotype can be found in very distant parts of genotype space. A similar approach can be used to estimate the maximal genotype distance of circuits within one genotype network. Specifically, a lower bound for this maximal distance is given by the maximal genotype distance among pairs of circuits in a large random sample of a genotype network. When estimating this maximum, one finds that it is very large and often equal to the diameter of genotype space, i.e., D = 1 [124]. Two circuits at this extreme distance have the same expression phenotype, yet they do not share a single
regulatory interaction. These observations hold not only for some circuits and distance measures, but over a broad range of circuit sizes, circuit phenotypes, and for different measures of genotype distance [124]. They are a generic property of the circuits I analyze here. I note that these observations are broadly consistent with the qualitative empirical observations that I discussed earlier, and which showed that very different regulatory interactions (regulatory genotypes) can produce the same gene expression phenotype. In sum, transcriptional regulation circuits with the same phenotype typically form vast sets of genotypes. Most or all of these genotypes are connected in one giant genotype network that can be traversed through small individual steps, changing one regulatory interaction at a time, and affecting as little as
I N N OVAT I O N T H R O U G H R E G U L AT I O N
a single transcription factor binding site. The genotype networks of different phenotypes extend far through genotype space, and many such networks even traverse this space completely [123, 124].
Accessing novel gene expression phenotypes Thus far, I have focused on circuits with the same expression phenotype. I will next turn to different (new) expression phenotypes, and how they arise through regulatory mutations. Many new expression phenotypes may have no consequence or deleterious consequences for the organism, but some fraction of them may give rise to regulatory innovations. Whether a new phenotype is an innovation will depend on many circumstances, but this much is certain: circuits that can produce many new variants over time while preserving their existing phenotype will increase their odds of bringing forth regulatory innovations. I will next show that the sprawling genotype networks characteristic of regulatory circuits facilitate such innovations. Consider two different circuits that have very different genotypes (large genotype distance D), but that lie on the same genotype network; that is, they share the same gene activity phenotype P. The two immediate neighborhoods around each of these circuits contain many circuits that also have the same phenotype P. However, both neighborhoods may also contain many circuits whose phenotypes are different from P. The same holds for two larger neighborhoods; that is, k-neighborhoods that contain circuits differing in some small number k of regulatory interactions [124]. If the neighborhoods of two very distant circuits in genotype space typically contain exactly the same novel phenotypes, then the extendedness of a genotype network is irrelevant for the accessibility of such phenotypes. No matter where a circuit is located on a genotype network, small changes in it will produce the same novel phenotypes. However, if such neighborhoods contain different novel phenotypes, the very existence of a genotype network becomes important to evolutionary innovation. In this case, a series of small genotypic changes that preserve a circuit’s phenotype may give the circuit access to many novel phenotypes. The following observations show that this is the case.
43
The horizontal axis of Figure 3.6a shows the genotype distance between circuit pairs drawn a random from the same genotype network. The data in this figure is based on more than 103 such circuit pairs [124]. The vertical axis shows the fraction of novel phenotypes unique to the k-neighborhood (k ≤ 3) of one circuit. Here and elsewhere, I use the word unique phenotype in the sense that a phenotype occurs in the neighborhood of one but not the other circuit in a pair. The fraction of unique, novel phenotypes is generally large, with a mean of greater than 0.7. Figure 3.6b shows a similar relationship but for circuit pairs at smaller genotype distances. The figure displays only the mean fraction of novel and unique phenotypes in a neighborhood as a function of genotype distance among circuits in a pair. This mean fraction increases very rapidly with increasing genotype distance. For instance, for circuits of S = 8 genes, at a genotype distance of merely D = 0.06, 34 percent of new phenotypes occurring in the 2-neighborhood of two circuits are different. This number increases further until it exceeds 50 percent [124]. Taken together, these data show that small neighborhoods around different circuits with the same phenotype contain many novel phenotypes that are unique to each neighborhood.
Genotype networks are highly interwoven An important final question regards how far one must typically travel in genotype space to reach a circuit with an arbitrary novel phenotype. This is a question about the shortest distance between two genotype networks in this space. It is analogous to a question I asked in Chapter 2 for metabolic systems. To address it, one can choose at first two random circuits that have the same initial expression state E0 but arrive at different and arbitrary (random) equilibrium states E∞ and E’∞. Starting from the first circuit, one can then perform a random walk that aims to reach the second circuit, while leaving the first circuit’s phenotype E∞ unchanged. After a large number of mutations in this random walk, a distance Dmin is reached that cannot be reduced further without changing the phenotype E∞. Repeating this procedure with more than 103 pairs of random networks yields the distribution of Dmin shown in Figure 3.7. To be precise, this distribution is only an upper bound for the minimum distance
44
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
(a)
Fraction U of phenotypes unique to neighborhood
1.0
0.8
0.6
0.4
0.2
Circuit pairs: (0,5] (5,10] (10,15] (15,20] (20,25] (25,30] (30,35] (35,40] (40,45] > 45
0.0 0.2
0.3
phenotype
genotype
0.4
0.5
0.6
0.7
P
G1
0.8
P
U
G2
0.9
1.0
Genotype distance D (b)
Fraction U of phenotypes unique to neighborhood
0.7 0.6
S = 12 genes S = 8 genes
0.5 0.4 0.3 0.2 0.1 0.0 0.00
0.04
0.08
0.12
0.16
0.20
0.24
Genotype distance D Figure 3.6 Different neighborhoods of a genotype network contain different expression phenotypes. (a) The mean fraction U of novel and unique phenotypes that occur in the neighborhood (circles in the inset on the lower right) of one but not the other regulatory circuit (gray color in the inset, symbol U for unique), for pairs of regulatory circuits whose genotype distance D is shown on the horizontal axis. Sizes of small circles represent numbers of circuit pairs with a given distance and fraction of unique phenotypes. Data are based on the k-neighborhoods (k ≤ 3) of a sample of 2210 circuit pairs chosen at random from the same genotype network. Notice the large fraction of unique new phenotypes for almost all circuit pairs shown (mean U = 0.73). (b) As in (a) but for the mean U of genotypes at smaller distances from one another [124], and for k = 2. Standard deviations around each data point are no greater than 8×10–3 and error bars are thus too small to be visible. Data are for circuits of S = 8 genes (in addition, S = 12 genes for b), an average of 2 regulatory interactions per gene, for any phenotype where 50% of genes differ in their expression between E0 and E∞ (regardless of the specific phenotype) and for circuits that have between 45 and 60% of neutral neighbors. Qualitatively similar patterns hold over a broad range of these circuit properties [124].
I N N OVAT I O N T H R O U G H R E G U L AT I O N
45
Minimum genotype distance (upper bound)
300
Number of circuit pairs
250 200 150
0.50 0.45 0.40 0 35 0 30 0.25 0.20 0.15 0.10
8
12
16
20
24
28
Number N of circuit genes
100 50 0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Minimal genotype network distance (Dmin) Figure 3.7 Genotype networks of arbitrary phenotypes are close together in genotype space. The figure shows a histogram for an upper bound in the minimum genotype distance Dmin of two circuits with different phenotypes. It is based on circuits with S = 20 genes, 5 regulatory interactions per gene on average, and on 1600 random pairs of phenotypes, where 50 percent of genes differ in their expression between E0 and E∞, and where the expression states of individual genes in E∞ and E’∞ are uncorrelated. The inset shows the mean (error bars: one standard deviation) of Dmin as a function of the number of genes. The mean Dmin decreases with increasing number of genes S (inset) and with an increasing number of regulatory interactions per gene (data not shown). See also [124].
between two typical genotype networks. The actual Dmin may still be lower. Nonetheless, the figure shows that Dmin is small and that it decreases as circuit size increases. For example, for circuits of merely twenty genes, the average minimum distance to circuits with arbitrary different phenotypes (expression patterns) is only D=0.14. In genotype space, a region with this radius contains only a tiny fraction of all networks. For example, there are an estimated 10128 networks of S=20 genes with an average of 5 interactions per gene. Only a fraction 10-102 of them is contained in a neighborhood of D=0.14 around any one circuit. In sum, the genotype networks of an arbitrary phenotype can typically be found in a tiny region around any given genotype network. This observation shows that typical genotype networks not only span large distances, but they also are highly interwoven [124].
Summary The analyses I have summarized here suggest that transcriptional regulation circuits have several generic properties. First, there are many more circuits (genotypes) than phenotypes. Second, typi-
cally almost all genotypes with any one expression phenotype form one vast, connected genotype network. Third, typical circuits have a large number of neighbors with the same expression phenotype. Their phenotype is robust to regulatory change. Some of the experimental data I discussed above speaks to these properties. For example, the fact that one can readily alter regulatory interactions in the E. coli transcriptional regulation circuit without detrimental effects insinuates that this kind of robustness also exists for the E. coli regulatory circuit [355]. Fourth, the longest distance of two genotypes in such a network is close to the diameter of genotype space. This means that two circuits with the same expression phenotype may have few regulatory interactions in common. As I mentioned earlier, these observations are broadly consistent with empirical evidence that very different cis-regulatory regions and transcription factors can indeed convey the same expression pattern on many different genes in different species [490, 758, 786, 847]. Also, transitional forms of regulation in species intermediate to those
46
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
studied show how such highly diverged circuitry may arise mechanistically: new regulatory interactions arise before old interactions are eliminated, and a sequence of such additions and eliminations of reactions may eventually cause regulatory circuitry to diverge beyond recognition, yet leave an expression phenotype largely unchanged [490, 758, 786]. Fifth, the neighborhoods of different circuits on the same genotype network contain many unique new phenotypes. Sixth, the shortest distance between genotype networks in genotype space is small. No experimental evidence is yet available that speaks to these last two observations. The models on which these observations are based contain assumptions that facilitate enume-
ration of genotypes and phenotypes. However, relaxing these assumptions is not likely to affect these observations. For example, wide-ranging and vast genotype networks still exist for circuits with continuous regulatory interactions, with different representations of gene expression states, and for signaling circuits, circuits that do not just involve transcriptional regulation [123, 124, 481, 491, 561, 823]. Neighborhood diversity, a large fraction of unique novel phenotypes in different neighborhoods of a genotype network, can also exist for such circuitry (Chapter 14). Taken together, all this evidence suggests that the properties I discuss here are general properties of regulatory circuitry.
CH A PT ER 4
Novel molecules
Genotype: A sequence of ribonucleotides or amino acids. Phenotype: Secondary and tertiary structure, biochemical activity. Every distinct and specific enzymatic activity we see in organisms today was an evolutionary innovation when it first arose. These innovations permit the use of new energy sources, the biosynthesis of essential molecules from unusual food, or protection against a hostile world. The same holds for many non-enzymatic proteins that serve in structural support, communication, or defense. The relationship between macromolecular genotypes and their phenotypes is thus very important to understand evolutionary innovation. I will examine this relationship here for two important macromolecules: proteins and RNA. A previous book of mine also discussed parts of this material [825].
RNA and protein genotypes In RNA molecules, genotypes can be represented as nucleotide strings. As in previous chapters, I will let the letter S stand for system size. It corresponds to the length of a nucleotide string, its number of nucleotide monomers. RNA molecules exist in a vast genotype space. Specifically, because there are four different RNA nucleotide monomers, this genotype space comprises 4S possible genotypes for a molecule of length S. Because all proteins are encoded by RNA or DNA molecules, one could also represent the genotype of a protein by the encoding nucleotide sequence. However, it is often more convenient to represent it directly as an amino acid sequence. This is unproblematic, because a nucleotide string usually encodes
only one amino acid string. Like RNA genotype space, protein genotype space is vast. For example, it contains 20S proteins with S amino acids. In both protein and RNA genotype spaces, a simple measure of the distance D between two genotypes exists. It is the number or fraction of nucleotides or amino acids in which they differ. Two genotypes are (1-mutant) neighbors in this space, if they differ in only one nucleotide or amino acid. The (1-mutant) neighborhood of a genotype comprises all its neighbors. More generally, the k-mutant neighborhood of a genotype comprises all molecules differing in no more than k monomers from itself.
RNA and protein phenotypes Because proteins have many different roles in a cell, they have multifaceted phenotypes. An enormous amount of work has focused on characterizing secondary structure and tertiary structure phenotypes of proteins, the arrangement of their amino acid sequence in threedimensional space. This aspect of a protein’s phenotype is also known as its fold (Figures 1.5 and 1.6 showed examples). Compared to the astronomical number of protein sequences, the number of protein folds is small. It varies according to the estimation method but is less than 20,000 according to available predictions [145, 286, 418, 579, 838, 860, 885, 886]. When studying protein function, the fold is not always the best representation of phenotype. Enzymes provide a case in point. An enzyme’s catalytic site is formed by few surface amino acids that are responsible for the enzyme’s substrate specificity, and for the kind of chemical reaction it catalyzes. Enzymes with very similar folds can have a great variety of catalytic functions [660, 774, 776, 777, 849]. Thus, at least for some proteins, the 47
48
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
tertiary structure provides only a scaffold; different biochemical functions are built on top of this scaffold by subtle modifications. Despite this limitation of studying protein folds as phenotypes, disrupting a fold will also generally disrupt protein function. Thus, a properly formed tertiary structure is a necessary ingredient for the function of many proteins. It therefore deserves study in its own right. RNA phenotypes are also attractive study objects, because RNA is involved in most of life’s key processes. Examples include: small nuclear RNAs, which are key parts of the splicing machinery; guide RNAs important in RNA editing; telomerase RNA necessary for maintaining chromosome ends; and small RNA molecules regulating gene expression through RNA interference [13]. Like proteins, RNA forms secondary and tertiary structures. RNA secondary structure is an elaborate planar shape that is formed when an RNA molecule folds onto itself, thus forming hydrogen bonds of complementary base-pairs within the molecule (A–U, G–C, and, to a lesser extent, G–U nucleotides; Figure 4.1). The three-dimensional RNA tertiary structure brings distant secondary structure elements into proximity through non-standard base-pairing, pseudoknots, and bivalent ions such as Mg2+. Compared to solved protein structures, which number in the thousands, the number of well-characterized RNA tertiary structures is puny, because RNA tertiary structure is more difficult to determine [191]. Thus, more evolutionary work has focused on RNA secondary structure phenotypes. Another motivation for this focus is that secondary structure is critical to the function of many RNA molecules: destroy this structure and you destroy RNA function. Examples include many viruses whose genome consists of RNA. Parts of their genome form secondary structure motifs, such as the so-called transactivation responsive and Rev-1 responsive elements of human HIV, the internal ribosomal entry site of picorna viruses, and the 3’ untranslated region of flavivirus genomes [48, 169, 361, 486, 622]. The RNA structures they form interact with parts of the protein machinery necessary to complete the viral life-cycle. Another class of examples regards the secondary structures formed by messenger RNA (mRNA). They are important for chemical modifications of messenger RNA that affect its half-
life, and how efficiently it is translated into protein [489, 575, 599, 802]. Evolutionary conservation underscores the functional importance of many RNA secondary structures. Distantly related species harbor many RNAs with diverged nucleotide sequences but conserved secondary structures. Natural selection maintains these secondary structure phenotypes. Examples come from ribosomal RNAs, transfer RNAs, catalytic RNAs such as ribonuclease P, and viral RNA genomes [191, 256, 465, 687, 832].
Genotype networks in proteins and RNA Proteins with the same phenotype can be extremely diverse in genotype. A first and prominent example involves the globin fold, whose elucidation won Max Perutz and John Kendrew the Nobel prize. The globin fold is a structural phenotype characteristic of oxygen-binding proteins, such as myoglobin and hemoglobin. These proteins have numerous distant relatives in many vertebrates, invertebrates—mollusks, arthropods, and annelids—and even plants, where they bind oxygen to facilitate nitrogen fixation. All these proteins bind oxygen, albeit with different affinities and kinetics [264, pp. 38–40]. The tertiary structures of even distant globin representatives are very similar. For instance, the threedimensional structure of whale myoglobin and the hemoglobin of the clam Lucina pectinata can be superimposed almost exactly [651]. This great phenotypic similarity stands in stark contrast to the high genotypic diversity of globins. For example, clam hemoglobin has only 18 percent amino acid identity to vertebrate hemoglobin. A study of 6 hemoglobins from plants and animals found that as few as 12.4 percent of amino acid residues were identical between any pair of these proteins [31]. In addition, only 4 out of 97 amino acids in the protein were unchanged in all of these proteins. If the sequence of two proteins has diverged this dramatically, it is difficult to determine whether they share a common ancestor from their sequence alone. However, a variety of other criteria, including details of a protein’s structure, can help. Taken together, such criteria argue for a common origin of globins [305]. The phylogenetic tree in Figure 4.2 shows that amino acid similarities of globins among different species reflect the evolutionary relatedness of the
NOVEL MOLECULES
(a)
49
G G hairpin loop U A C G U A A internal loop G G C A U G C stack A U U A G C A U G A G G internal loop U G G A A G U AG G C multiloop C G A G G G U C G A G C C C A G C A A G GA U U A
(b)
(((((.(((.....((((((.((....)).))))))........))).))...))..))))) Figure 4.1 Two equivalent representations of RNA secondary structure. (a) Two-dimensional graphical representation. Stacks, regions of paired bases, and various kinds of loops, unpaired regions, are indicated. Base pairs in an RNA secondary structure have to meet two conditions. First, each nucleotide must be paired with at most one other nucleotide. Second, two pairings can not cross in the planar projections of the structure, otherwise planarity would be violated. (b) Dot-parenthesis representation. A dot stands for an unpaired base, and a pair of matching parentheses corresponds to a base pair. The two representations are equivalent. That is, as one reads the RNA string in (a) beginning from GGGU . . ., one can read the representation in (b) from left to right to find which bases are paired and unpaired. From figure 4.1 in [825], used with permission from Princeton University Press.
species. More distantly related species harbor globins with more dissimilar amino acid sequences [279]. Globins are not unusual in being highly diverse yet connected evolutionarily, as the next examples will illustrate. A protein domain is a distinct, compact and stable unit of protein structure that folds independently of other such units. It often also has a unique biochemical function. A protein can consist of single domains, like hemoglobin, or of multiple domains. As structural information on proteins is accumulating, comparative analyses of protein evolution are
focusing increasingly on domains themselves. Such comparisons reveal patterns very similar to those of whole proteins I have just discussed. A case in point is the fibronectin type III domain, which forms a tertiary structure similar to one also found in immunoglobulins. It is widespread in animals and has also been found in some bacteria [79]. The proteins in which it occurs include extracellular matrix proteins such as fibronectin—involved in processes as diverse as tissue repair, blood clotting, and cell migration—intracellular proteins, and many kinds of membrane receptor proteins, such as the human
50
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
700 21 22
66
ANNELID
600
ARTHROPOD
VERTEBRATE 33
35
54
61
MOLLUSC
31
16
89
59
CRU
83
125
60 GASTROPOD
B INTR IVALVE ACE LLU LAR
82
CEA
59
260
99
171
276
158
129
174
28
22
144
140
DIMERIC
TETRAMERIC
98 87
87
64
60
22
21
14
HOMO b
HOMO a
LAMPETRA
MYXINE
HOMO mb
APLYSIA l.
APLYSIA k.
BUSYCON
CERITHIDEA
ANADARA t.
CTT VIII
CTT VI
CTT VIIA
CTT IX
CTT X
CTT IIb CTT VIIb
DIMERIC
ANADARA b.
MONOMERIC
CTT IV
CTT III
CTT IIIA
1
0
CTT IA CTT I
ARTEMIA E1
GLYCERA
TYL I
LUMB I
TYL IIC
TYL IIB
PARASPONIA
VICIA
LUPINUS
8
PHASEOLUS
28
5
10
28
0
54
115 140
7
14 61
GLYCINE 16 22 31 55 GLYCINE C2
59
100
93
122 144
204
30
129
134
TYL IIA
158
125
LUMB II
95
89
200
193
12
31
122
220
18
42
300
STA
INTRACELLULAR
54
MYR BP
PLANT
400
129
EXTRA
CT INSE
CELLU LA
R
500
Figure 4.2 Evolutionary relationships among globins. The phylogenetic tree shown is a maximum parsimony tree [237] based on aligned amino acid sequences of globins from 5 plants, 9 invertebrates, and 3 vertebrates [279]. The numbers along each tree branch represent the numbers of substitutions that took place along the branch. The left vertical axis represents time in million years before present (MYR BP). Acronyms and names of genera denote the following. Anadara broughtonii and Anadara trapezia (bivalves); Aplysia spp. (gastropods); Artemia salina (brine shrimp); Busycon canaliculatum and Cerithidea rhizopharorum (gastropods); CTT: Chironomus tumi tumi (a midge) and its various globins; Donabella auricularia (a sea hare); Glycera dibranchiata (a polychaete); Lampetra sp. (lamprey); LUM: Lumbricus terrestris (earthworm); Myxine sp. (hagfish); Glycine sp., Lupinus sp., Parasponia sp., Phaseolus sp., Vicia sp. (legumes); Scapharca inaequivalvis (bivalves); TYL: Tylorrhynchus heterochaetus (a polychaete). From ref. [279], used with permission from Springer Science+Business Media.
growth hormone receptor. Despite their highly similar tertiary structures, amino acid sequence similarities of fibronectin domains in different species are as low as 9 percent, although they may have a common origin [79].
Among many other examples, the triosephosphate isomerase (TIM)-barrel domain stands out, because it is one of the most abundant protein structures in nature. Named after a glycolytic enzyme that harbors this domain, it has a barrel-like
NOVEL MOLECULES
structure whose “planks” are made up of secondary structure elements. It may occur in as many as 10 percent of all enzymes. The enzymes with this domain have a broad diversity of functions [849]. The TIM-barrel domain may derive from a single ancestor, but many proteins that harbor this domain have no recognizable sequence similarity to one another [140].
Genotype sets and genotype networks For RNA molecules and proteins, a genotype set comprises all genotypes (sequences) with the same phenotype. In the case of globins, for example, it would comprise all proteins with an oxygen-binding globin fold. In a genotype network, additionally, any pair of sequences can be connected through single amino acid changes that do not change the phenotype. As I mentioned in earlier chapters, not necessarily all sequences in a genotype set form a single genotype network; this also holds for proteins. We may never know whether all proteins with oxygenbinding globin folds belong to exactly one genotype network, because their number may be so astronomically large. However, phylogenetic analyses like that of Figure 4.2 show that the genotype network on which known globins lie is vast [279]. Because different globin phenotypes have close to maximally dissimilar genotypes, their genotype network spans genotype space or nearly so. To see this, it is useful to remember that the species in Figure 4.2 are but a few leaves on a huge tree of globin-possessing species connected through common ancestry. If we could identify all their (extinct) ancestors, we would see the full continuum of genetically diverse molecules that constitute this network. Examples such as those of globins and the TIMbarrel domain show that evolutionary explorations of sequence space in the billion-year long history of eukaryotes can range very far without compromising a protein’s structure. However, any one such example is essentially an anecdote. It does not show how representative this phenomenon is of all proteins. Counterexamples exist of proteins with conserved structure and very similar sequences in widely divergent organisms; they include actins, tubulins, and histones, with up to 98 percent
51
sequence identity in organisms as dissimilar as humans and plants [188]. To find out whether such counterexamples are the rule or the exception, statistical surveys of many protein structures are necessary. With increasing amounts of available protein structure information, such surveys have become possible. However, a difficulty is that many proteins with similar structures are close together in genotype space, because they share a recent common ancestor—a single sequence—from which they have diversified. This means that proteins with the same structure are not unbiased representatives of all sequences that fold into the structure. In a statistical survey that alleviates this problem, Rost first identified 272 proteins with different folds in the protein databank, a database of thousands of protein structures at atomic resolution [201, 659]. He then used each such protein with a unique structure as a reference protein, and identified all other proteins that were so dissimilar to the reference protein in amino acid sequence (< 25 percent identity) that their common ancestry with the reference protein is doubtful [189]. He found that many of these proteins have a fold that is similar to that of the reference protein. Because of low sequence similarity to the reference protein, this set of proteins is not biased towards highly similar proteins that fold into the same structure. It is striking that proteins with similar structure but different sequence thus identified shared on average only 8.5 percent of their amino acids, much fewer than the 25 percent threshold pre-imposed on the analysis. This number is only slightly higher than the 5.6 percent amino acid identity between any two proteins chosen at random from the database (and possibly different structure). Other surveys make similar observations: many protein structures can be realized by very different amino acid sequences [47, 774, 776]. When we turn to RNA and its phenotypes, the situation is similar, although our knowledge is more limited. Very different sequences can have similar form and function. For example, only seven nucleotides are conserved among group I selfsplicing introns, a prominent class of RNAs with catalytic activity. Nevertheless, secondary (and probably tertiary) structures within the core of these catalytic RNAs are conserved [465]. Similarly great diversity is observed for the catalytic RNA
52
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
molecules of ribonuclease P, an enzyme necessary for the biosynthesis of transfer RNA molecules. This RNA contains a 200 nucleotide long core that is shared by eukaryotes and prokaryotes [256]. Only 10 percent of these 200 nucleotides do not change in known RNA molecules. More generally, RNA sequence conservation among molecules with the same phenotype is typically restricted to short stretches of fewer than 10 nucleotides [191]. This limited similarity can raise serious problems when trying to identify new molecules based on sequence similarity alone [191]. It is often the conserved structure phenotype itself that needs to be used as a guide to identify such sequences. In sum, any one RNA and protein phenotype can typically be formed by many highly dissimilar sequence genotypes. For well-studied molecules with many known genotypes, these genotypes form a mutationally connected genotype network.
Systematic explorations of genotype networks Examples from naturally occurring molecules powerfully demonstrate how molecules with similar phenotype can have very different genotypes. However, they have difficulty answering more general questions, some of which are important to understand innovability. Do genotype networks have similar sizes for molecules with different phenotypes? Which kinds of novel phenotypes— candidate evolutionary innovations—can typically be found near a given genotype network? To answer these questions, one would ideally want exhaustive information about all possible genotypes and their phenotypes, but all we have is a modest and biased set of naturally occurring genotypes. They are a few of the leaves on a vast phylogenetic tree. Many other leaves, and the internal branches connecting them, are usually unknown, because they have not been discovered yet, or because they correspond to sequences from extinct species. To avoid this limitation of available data, one can either explore a small genotype space exhaustively, or one can sample genotype space in a random, unbiased way. (For sufficiently large sample sizes, results from sampling will be arbitrarily close to those of exhaustive exploration.) Unfortunately, experimental determination of phenotypes is too laborious to create large samples. In addition, com-
putational prediction of all aspects of a molecule’s phenotype is either too slow or impossible with today’s methods. We thus need to focus on aspects of phenotypes that are complex enough to reflect important properties of actual molecules, but simple enough to predict for vast numbers of genotypes. Doing so involves computational approaches, such as computational models of protein structure. I will next introduce some of the relevant models for proteins, before returning to RNA.
Protein folding models Computational models of protein folding rest on the notion that proteins will fold into a native conformation or tertiary structure that is compact in space and that minimizes the protein’s free energy. An important contribution to this free energy is non-covalent interactions—hydrogen bonds, hydrophobic interactions, and ionic bonds—between amino acids that are not adjacent in the amino acid chain [87]. Some such interactions are favorable and reduce the protein’s free energy, whereas others are unfavorable and increase it. The native conformation has the largest number of favorable and strong interactions. It is usually also a compact conformation, in the sense that amino acids are densely packed in it. The principal obstacle to brute force calculation of this native conformation is the astronomical number of non-native conformations any one protein can form. Computational approaches take various shortcuts to alleviate this problem. One prominent approach uses simplified and tractable models of protein folding called lattice proteins [35, 80–82, 90, 97, 108–110, 156, 185, 186, 220, 283–286, 434, 444, 464, 668, 669, 718, 759, 760, 805, 853]. In lattice proteins, folding is constrained in one important way: in the native conformation, individual amino acids can only assume positions on a discrete grid—a lattice. This grid can be either two-dimensional or three-dimensional (Figure 4.3). The advantage of this discrete representation of tertiary structure is that all possible tertiary structures of an amino acid chain can be enumerated. There are, for example, fewer than 105 possible tertiary structures that completely fill the three-dimensional cubic lattice of Figure 4.3b [455]. To identify the most thermodynamically stable among these structures, lattice protein models make several assump-
NOVEL MOLECULES
53
(a)
(b)
Figure 4.3 Lattice proteins. A protein is represented by a chain of black and white beads, corresponding to hydrophobic and hydrophilic amino acids. (a) A protein of 36 amino acids folded onto a 6 × 6 two-dimensional cubic lattice. (b) A protein of 27 amino acids folded onto a 3 × 3 × 3 threedimensional cubic lattice. From figure 1 in [455], used with permission from AAAS.
tions. For instance, they may represent proteins as chains of only two types of amino acids: polar (P) and hydrophobic (H). Any protein’s amino acid sequence is then completely determined by choosing one of these types for every position along the chain. When folded compactly on a lattice, a protein’s free energy E can then be calculated by adding the individual energy contributions Ea a made i
j
by all individual amino acids ai and aj that are adjacent to one another on the lattice, but not adjacent on the amino acid chain. (The latter amino acids do not contribute to the free energy, because they are always adjacent, regardless of the protein’s fold.) If there are only two types (H and P) of amino acids, this interaction energy can assume only three values, EHH, EHP, and EPP. These values can be chosen to
54
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
mimic features of actual proteins, e.g., that amino acids of the same type interact preferentially (e.g., EHP > EPP, EHH) [455]. Even this simplest of models captures important aspects of protein folding. Specifically, it is the tendency of hydrophobic amino acids to avoid water that drives different proteins to fold into compact shapes with a core of hydrophobic amino acids [87]. Some real proteins can be designed or redesigned merely by choosing suitable hydrophobic and polar amino acids along the chain [143, 382]. These observations justify a restriction to hydrophobic and hydrophilic amino acids as a first order approximation to characterize protein structures. Extensions of this model can capture increasingly subtle aspects of protein folding. Such extensions include larger alphabets of up to 20 amino acids, empirically determined interaction energies for these amino acids, and modifications to incorporate the effects of the solvent in which the protein folds [90]. Lattice proteins are not the only models of protein folding. Another class of models folds proteins without confining them to a lattice [549, 565]. Yet another class does not represent protein conformations in space, but only their free energies [35, 90, 97]. Because models of protein folding minimize free energy, they implicitly rest on the assumption that a protein’s native phenotype is its minimum free-energy fold [28]. But how does a protein find this fold? The total number of possible folds is astronomical, and a protein cannot possibly explore all of them [894]. The likely answer is that the protein’s minimum free-energy structure is surrounded by a “folding funnel” of similar folds with higher free energies [444, 565, 894]. This folding funnel guides the folding protein through states of increasingly lower free energy to the minimum free-energy structure. Thus, aside from exceptions, most proteins are able to form the minimum freeenergy fold as their native fold [57].
Genotype networks in protein folding models The simplest lattice protein models with only hydrophobic and hydrophilic amino acids already demonstrate the existence of genotype networks [80, 455]. For instance, Li and collaborators examined all possible sequences of hydrophobic and polar amino acids in three-dimensional (3 × 3 × 3) and two-dimensional lattice proteins (for various
lattice sizes) that have a unique minimum freeenergy structure. Here is what they found. First, there are many fewer protein phenotypes than genotypes. This observation is consistent with the modest numbers of folds (< 2 × 104) in real proteins [449]. For instance, for three-dimensional lattice proteins, the average phenotype (structure) is realized by 62 of the 2(3 × 3 × 3) = 1.3 × 108 possible sequences. Second, some phenotypes are formed by many sequences (up to 3794), whereas others are formed by few [455]. These qualitative observations are largely independent of the protein model used and are thus likely to hold for real proteins [91, 218, 456, 457, 519, 692]. Third, where a phenotype is formed by many amino acid sequences, these sequences can be very dissimilar [455]. These observations are also consistent with evidence from real proteins. The majority of protein structures are “unifolds,” realized by only one family of usually diverse proteins with the same, unique evolutionary origin [418]. In addition, multiple especially “frequent” tertiary structures exist [418]. These include the TIM-barrel I mentioned above, or the Rossman fold, a tertiary structure found in nucleotide-binding proteins [87, 546]. These structures are frequent in the sense that they occur in multiple families of proteins. Members of a family have significant amino acid sequence similarity and thus a common ancestor. However, little such similarity exists among families. Although even highly dissimilar sequences can share a common ancestor, some frequent folds may have originated multiple times independently [79, 140, 279, 594]. Fourth, most model protein phenotypes have a single network of connected genotypes, whereas only a minority has multiple disconnected networks [156]. In general, the larger the number of genotypes that form a phenotype, the more likely it is that they form one large connected genotype network. Such networks can only exist if any one genotype typically has one or more neighboring genotypes with the same phenotypes. This is the case, but within any one genotype network, the distribution of the number of neighbors is very heterogeneous. Some genotypes have many neighbors with the same phenotype, others have few [107]. All these observation have to be taken with a grain of salt, especially because of the small size of
NOVEL MOLECULES
the amino acid “alphabet” (two amino acids) in these models, and because other aspects of protein folding are sensitive to the number of different amino acids and their interaction energies [91]. However, explorations of more extensive amino acid alphabets and real protein structures confirm these observations [35].
Different neighborhoods of the same protein genotype network contain different novel phenotypes Just like comparative data from real proteins, protein folding models thus support the notion that amino acid sequences are organized into extensive genotype networks. But an important
55
question about innovability remains. Are the phenotypes in the neighborhood of different genotypes on the same genotype network similar? This question is important, because its answer determines how broad the spectrum of phenotypes is that one or few mutations can reach from a genotype network. Available knowledge about the biochemical function of thousands of proteins can help answer this question. Figure 4.4 shows the answer for enzymes, the most prominent and especially well-studied class of proteins. The figure shows an analysis of small neighborhoods around two proteins with the same structure, where the proteins have varying
Fraction U of unique phenotypes in neighborhood
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
0.25
0.5 Genotype distance D
0.75
1.0
Figure 4.4 Small neighborhoods around different proteins with a given structure contain proteins with many unique functions. The horizontal axis shows the genotype distance D of two single-domain protein genotypes with the same structure. The vertical axis shows the fraction U of proteins with enzymatic functions that occur in a small sequence neighborhood of one but not the other of the two genotypes. The neighborhood in question covers a radius of k = 5 point mutations around each sequence. The data for this figure are based on the 30 most abundant protein folds—the folds with the highest number of sequences—in a dataset with 16,574 single-domain proteins of known structure and enzymatic function. This dataset includes 705 different types of enzymes from all six enzymes classes of the enzyme commission (EC) nomenclature [239]. The moderately large value of k, and the large uncertainty (long error bars) at small distances D result from the low number of known enzyme pairs at low genotype distance D. Error bars represent standard errors of the mean [239].
56
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
genotype distance D. It shows that even for proteins with moderate genotype distance D one finds more than 50 percent of enzymatic phenotypes unique to one neighborhood, a percentage that increases for larger distances D [239]. Similar observations would hold for small neighborhoods around protein genotypes with conserved function instead of conserved structure [239]. The data in this figure come from more than 16,000 proteins with known structure and enzymatic functions. Although these data cover 750 different enzyme functions, they are still a very sparse sample of sequence space. For example, sequence pairs with small D are highly underrepresented in them [239]. This explains the lack of data at the smallest values of D, the large uncertainty at moderate D (long error bars), and the necessity to analyze k-neighborhoods of radius k = 5 and not k = 1 for this figure. As sequence space becomes better characterized, these gaps will undoubtedly be filled. Despite these limitations, the data clearly shows that different neighborhoods in sequence space contain different novel phenotypes.
RNA structure prediction I will next return to RNA molecules and their organization in genotype space. Systematic explorations of RNA phenotypes that avoid the biases inherent in comparative studies of natural RNA molecules demonstrate patterns of organization similar to those of proteins. Such explorations sample RNA phenotypes and genotypes at random from sequence space, explore their properties, and compare them to those of naturally occurring RNA molecules. The organization of RNA phenotypes in genotype space is even better studied than that of proteins. The focus in such studies is on RNA secondary structure phenotypes, because tertiary structure data is very limited, and because, as discussed above, secondary structure is often essential to RNA function. Experimental techniques to analyze RNA secondary structure are too laborious to analyze thousands of RNA genotypes and phenotypes, as is necessary to understand their organization in genotype space [408, 530]. In consequence, computational approaches are important to determine RNA secondary structures. Two categories of approaches exist. The first predicts secondary
structures by comparing RNA sequences with conserved function from multiple different organisms [255, 364, 365, 543, 587, 595, 859]. Its reliance on available sequences is thus subject to the biases discussed above and makes it less suited for a systematic analysis of sequence space. The second category predicts RNA secondary structure from thermodynamic principles. Such prediction can be made on several levels of resolution. First, one can determine an RNA molecule’s secondary structure with the smallest free energy, that is, the most stable secondary structure. The task of predicting this minimum free-energy structure is simplified by the fact that each secondary structure consists of only two kinds of elements: loops, that is, regions of unpaired bases; and stacks, regions of paired bases (Figure 4.1). Loops destabilize a secondary structure, whereas stacks stabilize it. I note that the most stable secondary structure is generally not the structure with the largest number of paired bases. Part of the reason is that each stack, although it stabilizes secondary structure, by necessity creates a loop, which destabilizes secondary structure. For instance, a transfer RNA responsible for attaching histidine to a nascent polypeptide has more than 105 secondary structures with 26 base pairs, the maximum number of base pairs possible for this RNA. However, it has only one minimum free-energy structure, which has 22 base pairs, fewer than the maximum number of 26 base pairs [864]. Most of a structure’s stabilizing energy comes from interactions between the aromatic rings of adjacent pyrimidine and purine base-pairs, so-called stacking interactions. Secondary structure prediction algorithms [563, 841, 892, 893] take advantage of experimentally determined energy contributions of stacks and loops [318, 363, 497, 789, 835]. Albeit not perfect, their predictions often agree well with experimentally determined secondary structures; most importantly for my purpose, their characterizations of genotype network structure are insensitive to algorithmic details [750]. Some computational approaches to predict RNA structure take into account that each possible structure—including an RNA’s minimum free-energy structure—is only metastable. That is, thermal fluctuations cause an RNA molecule to unfold and
NOVEL MOLECULES
refold constantly, and thus to assume a whole spectrum of different secondary structures. The lower a secondary structure’s free energy compared to other structures, the more time the RNA molecule will spend in this structure. The molecule thus spends the relative majority of its time in its minimum freeenergy structure. Algorithms to calculate the free energies of secondary structures within an energy range above the minimum free energy exist [864]. I will discuss some observations made with these computationally more demanding algorithms later on (Chapter 11). For the moment, I will focus on minimum free-energy structures, because they expose the key features of genotype network organization most clearly. Yet another class of algorithms to predict RNA structure also takes into account an RNA’s folding kinetics—the temporal order in which base pairs form as an RNA molecule is synthesized. Folding kinetics is an important determinant of structure, especially for long RNA sequences [326, 532]. Algorithms that take folding kinetics into account are computationally demanding [245, 298, 356, 523]. Thus, they are not yet extensively used to study genotype networks. The currently most comprehensive analyses of RNA secondary structures have been carried out by Peter Schuster and his associates [687].
RNA genotype networks have highly nonuniform sizes The most basic prerequisite for the existence of genotype networks is that multiple genotypes can form the same phenotype. Like proteins, RNA molecules fulfill this prerequisite. The average number of genotypes per phenotype can be determined exhaustively for short RNA sequences. It can be estimated through combinatorial analysis for longer sequences [331, 686]. This number is astronomical, even for sequences of moderate length. For instance, there are 420 = 1.10 × 1012 RNA sequences with 20 nucleotides, but no more than 2741 distinct minimum free-energy structures of 20 nucleotides. This implies that, on average, there are more than 400 million RNA sequences per structure, even for sequences as short as 20 nucleotides. The discrepancy between number of sequences and number of structures becomes much greater for longer sequences: the number of RNA sequences of
57
a given length S is 4S, whereas the number of RNA structures scales with S as approximately 1.8S [688]. There are therefore over 2S-fold more sequences than structures. Analogously to proteins, a genotype set is the number of RNA sequences that fold into the same structure. A genotype network is a connected set of RNA sequences with the same phenotype. That is, every pair of RNA sequences in this network can be connected through a series of single nucleotide changes that do not change the structure. The size of a genotype set depends strongly on the associated structure: most structures have few sequences folding into them, but the vast majority of sequences folds into a small number of structures with large genotype set sizes. Figure 4.5 shows a plot of the distribution of genotype set sizes (expressed both as a fraction of genotype space size, and in absolute numbers of genotypes) for structures found in a random sample of one million RNA sequences that are each 30 nucleotides long [830]. The plot clearly illustrates that this distribution is highly skewed and far from uniform. Some structures have genotype sets that are 1000 times larger than those of other structures, even in this modest sample of sequences that comprises only a tiny fraction 106/430 = 9 × 10–13 of sequence space. For even shorter sequences or sequences with a restricted “alphabet” of only two nucleotides, the size distribution of genotype sets can be determined exactly through exhaustive folding of all sequences [294, 685, 686]. For instance, there are more than 109 sequences of S = 30 nucleotides that consist only of G and C nucleotides. These sequences fold into one of approximately 2 × 105 structures, yielding on average 5000 sequences per structure. If one defines a frequent structure as one that is realized by more sequences than this average, then only 10.4 percent of structures are frequent. However, 93 percent of sequences fold into these frequent structures. At the other end of the structure spectrum, one finds 12,362 structures formed by only one sequence, and more than half of all possible structures are formed by fewer than 100 sequences [294, 685, 686]. As a sequence’s length increases, the frequent structures occupy an increasingly large fraction of sequence space. For very long sequences, almost all
5×10–03
5×1015
5×10–04
5×1014
5×10–05
5×1013
5×10–06
5×1012
0
40000
Genotype set size
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
Genotype set size / 430
58
80000
Size rank of genotype set Figure 4.5 The number of sequences folding into one structure has a highly skewed distribution. For this figure, I randomly sampled 106 RNA sequences of length 30, determined their minimum free-energy secondary structure, and ranked each structure according to how frequently it occurred in this sample [332]. For a structure that occurs sufficiently frequently, one can estimate the structure’s genotype set size from its number of occurrences in the sample. The plot shows structure rank (horizontal axis) plotted against estimated genotype set size (vertical axes), expressed as a fraction of the size of sequence space, 430 (left vertical axis), and in terms of absolute numbers (right vertical axis). For rarely found structures that occur only once or few times, this procedure overestimates genotype network size [830]. Note the logarithmic scale on the vertical axis. Structure frequencies vary by more than a factor 103 even in this modest sample of short sequences.
sequences fold into a vanishingly small fraction of structures. These properties are not peculiarities of small molecules or molecules with restricted nucleotide composition. They are typical of RNA molecules. With these observations in mind, it becomes clear that rare phenotypes, phenotypes with small genotype networks, may not play an important role in evolution. They are difficult to find in a genotype space that is almost completely filled with the large genotype networks of frequent phenotypes. Are there any obvious features that render a structure frequent? It has been suggested that structures with large genotype networks show a balance between stacked regions that provide thermodynamic stability and looped regions that can be realized by many dif-
ferent RNA sequences [249]. However, heuristic algorithms to predict genotype network size from these and other simple structural characteristics have so far had limited success [147, 149, 378]. A next question regards the organization of all genotypes with the same phenotype in genotype space. The only phenotypes of any practical importance here are frequent phenotypes with large genotype sets. It has been shown that three main possibilities for their organization exist [640, 687]. First, these genotypes can be connected in a single large genotype network; second, the vast majority may fall into a single genotype network, with a small minority being organized into much smaller networks or isolated genotypes; and third, they
NOVEL MOLECULES
may form a small number of genotype networks that are each very large in size. A mathematical observation that I will revisit in Chapter 6 shows that a surprisingly simple property can predict whether the first possibility holds, i.e., whether a single genotype network exists. This property is the fraction n of a sequence’s neighbors that have the same phenotype as the sequence itself, averaged over all sequences with the same phenotype [640, 641]. Specifically, if this average fraction of neutral neighbors exceeds a value of 0.37, then the sequences with the phenotype in question will generally be connected in a single genotype network. The more frequent a phenotype is, the greater the average number of neutral neighbors of its genotypes, and the more likely it is that all of these genotypes form a single connected network [830]. In genotypes with frequent phenotypes, the average fraction of neutral neighbors will often exceed 0.37 [378, 640]. In sum, regardless of whether all genotypes with the same phenotype are connected in a single genotype network, a large fraction of them can typically be reached from one another through single-point mutations.
59
In addition to being organized into one or few genotype networks, the genotypes that form a frequent phenotype are also extremely diverse. Computationally, this can be shown by randomly changing some genotype G in single mutational steps, while requiring that each mutation preserves the minimum free-energy structure, until a genotype maximally distant from G has been reached. By repeating this procedure for different starting genotypes with the same phenotype, one can obtain a distribution of the maximal distance D of genotypes with the same phenotype. This distance D is conveniently expressed as the fraction of nucleotides that differ among these genotypes. Even for short random sequences (thus random phenotypes) of 100 nucleotides, this maximal distance is on average greater than 0.95 [687]. This means that for typical phenotypes, genotypes can differ in more than 95 percent of their nucleotides while preserving their phenotype. Figure 4.6 illustrates that very different genotypes (with the same phenotype) can be reached even with a limited number of mutations away from a starting genotype. To create this figure, I used the
80
Number of sequences
70 60 50 40 30 20 10 0
0
0.1
0.2
0.3
0.4 0.5 0.6 0.7 Genotype distance
0.8
0.9
1
Figure 4.6 RNA molecules with the same structural phenotype can have very different genotypes. The distribution of genotype distances based on 466 RNA molecules with “typical” secondary structures, i.e., structures formed by 466 sequences drawn at random from genotype space. For each such sequence, a random walk in genotype space was carried out. This random walk comprised 5000 mutations. Each step in each walk had to preserve the sequence’s structure, and it was not allowed to decrease the distance to the starting sequence. The figure shows the distribution of genotype distances to the starting sequence at the endpoints of these random walks, expressed as the fraction of nucleotides in which the two sequences differed. The mean genotype distance is 0.87. For most genotypes, the maximally achievable distance could be even higher than shown here, because only a limited number of mutations was used in this analysis.
60
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
approach I have just discussed and applied it to 400 randomly chosen sequences of length 100. All these observations complement empirical evidence I discussed earlier from extremely diverse natural RNA molecules. The ability to diversify genotypically while preserving a phenotype is thus not a peculiarity of some natural RNA molecules. It is a generic feature of RNA.
Individual genotype networks are highly heterogeneous Similar to my earlier discussion of proteins, I will now turn to features of individual RNA genotype networks that are important for innovability. The first of these regards the neighborhoods of individual sequences. I have already discussed that individual genotypes on a genotype network usually have multiple neighbors with the same phenotype. The actual number, however, varies greatly among members of the same genotype network; Figure 4.7 shows an example. It is based on sequences sampled at random from the genotype network of a transfer RNA that incorporates the amino acid phenylalanine into nascent proteins [745]. The figure shows the fraction n of neighbors with the same phenotype as this transfer RNA. For a molecule of this length (S = 76 nucleotides), a frac-
tion n = 0.1 would correspond to approximately 23 neighbors with the same phenotype. The distribution in Figure 4.7 is clearly quite broad. Just as in proteins, some sequences have many neutral neighbors whereas others have few. What about the remainder of a genotype’s neighborhood, the sequences that form different phenotypes? Are these phenotypes very similar to one another? Just as it is possible to measure distance among sequence genotypes, various measures of phenotype (structure) distance exist [250]. Perhaps the simplest one determines the fraction of different symbols in the dot-parenthesis representation (Figure 4.1) of two secondary structures. It indicates the number of base pairs in which two structures differ. Figure 4.8a shows as an example an arbitrary RNA sequence of length S = 30 and its minimum free-energy secondary structure. 17 percent of the sequence’s 3S = 90 neighbors have the same structure as itself. The remainder covers a broad spectrum of structures that differ in between 1 and 19 base pairs from the reference structures shown on the left. Examples of these structures are shown near the individual sections of the pie chart. Figure 4.8b [745] shows a similar pattern for a natural RNA molecule. It is the Hammerhead ribozyme of peach
Number of sequences
1000 800 600 400 200 0
0.1
0.2
0.3
0.4
0.5
Fraction v of neighbors with the same phenotype Figure 4.7 RNA sequences have many neighbors with the same phenotype. The figure is based on more than 104 sequences sampled at random from the genotype network of the cloverleaf RNA structure characteristic of tRNAPhe (S = 76), a transfer RNA responsible for transporting phenylalanine to the ribosome during translation [745]. The figure shows the distribution of the fraction of neighbors with the same minimum free-energy structure as the sampled sequences. The generally large number of such neighbors is not a peculiarity of tRNAPhe, but typical of RNA structures [745, 830].
NOVEL MOLECULES
61
(a) G
U
A
G
C
G
A G C
G
A
U
A
U
5'
neighborhood 3S=90 neighbors 32 different structures 17 different distances
G C U
A
U
A
C
G
G
U
C
G
C
G
neutral, d=0, 17%
d=19, 8%
U
d=9, 8% d=1, 26% d=6, 7%
d=2, 7%
C
(b) 1400 Hammerhead (54-mer) Random 54-mer
Number of sequences
1200 1000 800 600 400 200 0
2
6
10 14 18 22 26 30 34 38 42 46 50 54 Structure distance
Figure 4.8 The neighborhood of a genotype contains a broad spectrum of phenotypes. (a) The pie chart shows the distribution of the distance d between the RNA secondary structure phenotype on the left, and the RNA phenotypes in the neighborhood of the sequence shown. Distance d between two structures is measured as the number of differences in their dot-parenthesis representation (Figure 4.1), and indicated by gray shading in the pie chart. Encircling the pie chart are graphical examples of some structures, together with their distances and the percentage of the neighborhood that their genotypes occupy. (b) For a 54nt “hammerhead” RNA structure involved in the self-cleavage of peach latent mosaic viroid [18], the histogram shows the distribution of structure distances, for phenotypes found in the neighborhoods of many sequences sampled at random from the genotype network of this structure (black bars). The gray bars show an analogous histogram, but for randomly chosen RNA molecules of length 54, and thus for random phenotypes [745]. The figure demonstrates that the broad distribution of structures with a variety of distances d in a sequence neighborhood is not a peculiarity of individual sequences or structures, but a typical feature of RNA molecules. Similar observations have been made for RNA molecules of different length, and using different distance measures [745].
62
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
latent mosaic viroid, a simple plant parasite whose RNA genome can cleave itself [323]. The figure shows the distribution (black bars) of the structure distances for the neighborhoods of sequences sampled from the genotype network of this ribozyme [745]. This distribution is not a peculiarity of the particular phenotype shown, it also holds for random genotypes of the same length (gray bars), and independently of the particular structure distance measure used [745]. To summarize, the genotypes of RNA phenotypes are typically organized into vast genotype networks that reach far through sequence space. Different genotypes on a genotype network vary in their number of neutral neighbors. The remaining (nonneutral) neighbors form a broad spectrum of phenotypes. These properties are a generic feature of RNA phenotypes. They hold both for biological RNA molecules, and for RNA molecules sampled at random from genotype space.
Different neighborhoods of the same RNA genotype network contain different novel phenotypes I next turn to the question of how diverse the phenotypes in the neighborhood of different RNA genotypes are, a question that Figure 4.4 has addressed already for proteins. An early pertinent study carried out long random walks on a genotype network starting from a given sequence [347]. It then determined the cumulative number of unique phenotypes that the random walker encountered in its neighborhood, and found that this number increases nearly linearly, without showing any signs of leveling off for long random walks. This observation suggests that different sequence neighborhoods on a genotype network may contain very different phenotypes. The results of an analysis shown in Figure 4.9 underscore this observation. The starting point of the analysis is a reference genotype G that has a given phenotype. One then generates genotypes GD that differ in D nucleotides from G and that lie on the same genotype network as G. That is, they have the same phenotype as G. One then examines the 1-neighborhoods of both G and GD, and counts the fraction of phenotypes in the neighborhood of GD that do not also occur in the neighborhood of G. In other words, one asks how many
phenotypes are unique to the neighborhood of GD. Figures 4.9b and 4.9c show the answer for two different RNA molecules with known biological function, as a function of the number of nucleotide differences D. They demonstrate that the fraction of unique phenotypes is greater than 50 percent even for distances as small as D = 2, and eventually approaches a value of greater than 0.8 for large D [745]. The neighborhoods of different sequences on a genotype network thus contain very different phenotypes. This does not only hold for the two biological molecules shown here. It is a generic feature of RNA sequences [745]. Thus, RNA molecules are similar to proteins in the diversity of their neighborhoods.
The close proximity of different genotype networks In previous chapters on metabolism and regulation, I showed that genotype networks of different phenotypes are close together and tightly interwoven in genotype space. We do not know whether this holds for protein genotype space, but RNA genotype space shows this feature. Consider a randomly chosen RNA sequence that folds into a frequent structure. Now choose a completely different frequent structure, and ask how far one has to step away from the original sequence to find a sequence that folds into this second structure. This question can be asked in more general terms: How large is the radius k of the neighborhood (sphere) around one sequence that is sufficient to find a representative of any common structure? Recall that a neighborhood of radius k is a collection of sequences that differ in no more than k nucleotides the neighborhood’s center sequence. If this radius k was large, such as half that of the entire sequence space (k≤S/2), one would have to traverse at least half of the sequence space to find a representative of every structure. However, k, which can be estimated, is much smaller than (S/2) [685, 686, 688]. For instance, for RNAs of lengths S = 100 nucleotides, a sphere of k = 15 mutational steps contains with probability one a sequence for any frequent structure. This implies that one has to search only a vanishingly small fraction of sequence space (one 4.52 × 1037th for sequences of length 100) to find all common structures. This phenomenon has been called shapespace covering [688].
NOVEL MOLECULES
63
(a)
phenotype (structure)
P
P
U
GD
G
genotype (sequence) (b) 1.0 Fraction U of unique structures
Peach Mosaic Latent Viroid (54nt) 0.8 0.6 0.4 0.2 Mean±SE
0.0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Genotype distance D (c) 1.0 Fraction U of unique structures
tRNA (76nt) 0.8 0.6 0.4 0.2 0.0
Mean±SE
0
2
4
6
8
10
12
14
16
Genotype distance D
18
20
22
Figure 4.9 The neighborhoods of different RNA sequences on a genotype network contain very different phenotypes. (a) The bottom symbols G and GD stand for two RNA genotypes (sequences) that differ in D nucleotides, but that form the same phenotype P. The left and right circles symbolize the phenotypes different from P that occur in the neighborhoods of G and GD. Of special interest is the shaded circle segment. It contains the fraction (among all phenotypes in the neighborhood of GD) that occur only in the neighborhood of GD, but not in the neighborhood of G. For brevity, I refer to it as the fraction U of phenotypes unique to the neighborhood of GD. The lower two panels show the distance D (horizontal axes) of two genotypes on the same genotype network plotted against U (vertical axes) for genotypes that form (b) the 54nt hammerhead structure involved in the selfcleavage of peach latent mosaic viroid [323], and (c) the tRNAPhe cloverleaf structure. Note that D here corresponds to a number, not a fraction of changed nucleotides, as in most other figures of the paper. The figure shows that U approaches a value of one rapidly as D increases. The same pattern holds for random RNA structure phenotypes [745].
64
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
Vast network size and tiny occupancy of sequence space The vast sizes of genotype spaces for both proteins and RNA give rise to counterintuitive properties. One of them is that there is no contradiction between the observation that a genotype network may occupy a tiny fraction of genotype space, yet be astronomical in size. For example, the frequency at which ATP-binding proteins are found in random protein libraries suggests that a tiny fraction 10–11 of random amino acid sequences can bind ATP [389]. Because the proteins in question are 80 amino acids long, the sequence space comprises 2080 = 1.2 × 10104 sequences. Thus, the tiny fraction of 10–11 translates into a huge number of 1.2 × 1093 genotypes with ATP-binding phenotypes. Another example comes from experiments with randomized chorismate mutase, a metabolic enzyme necessary for amino acid biosynthesis [761]. These experiments suggest that a fraction 10–24 of random proteins that are 93 amino acids long encode a protein with the same structure and activity. This fraction is 13 orders of magnitude smaller than the 10–11 above, probably because it is more difficult to design a protein with specific catalytic activity than a protein that just binds a given molecule. However, even this tiny fraction translates into a genotype network size of 10–24 × (2093) = 9.9 × 1096 genotypes [761]. For part of the l repressor, a transcriptional regulator that plays an important role in the life-cycle of the bacteriophage l, a fraction 10–63 of all amino acid sequences may yield a functional protein [639]. The genotype network(s) of this protein, which comprises 92 amino acids, thus have (9220) × 10–63 = 5 × 1056 sequences. If 10–63 is unimaginably small, then 1056 surely is unimaginably large. For instance, a random protein library that contains only one copy of each of 1056 protein sequences of length 92 would have a mass of 1037 g, more than a billion times the mass of the earth (5.97 × 1027 g), or more than 10,000 times the mass of the sun (2 × 1033 g). Analogous observations hold for RNA molecules with specific phenotypes [673]. Common lore has it that molecules with some specific phenotype are very rarefied in sequence space. The above observations support this notion. However, they also show that the genotype network(s) associated with such molecules are astronomically large.
Molecules with a given structure and function may thus be both rare and frequent, in the sense that they have large genotype networks. These observations raise the additional question of whether natural RNA or protein molecules with some biological function are unusually rare compared to other molecules, for example, molecules with typical phenotypes chosen at random from genotype space. That is, are the genotype networks of biological RNA molecules unusually small? To my knowledge, this question is still unanswered for proteins. However, for RNA structures, we have developed an approach that allows estimation of genotype network size for RNA structures of length up to circa S = 100 [378]. Applying this approach to some 80 biological RNA molecules with diverse functions showed not only that their genotype networks are very large (while occupying a tiny fraction of genotype space). It also showed that the genotype networks of these structures are larger than those of randomly chosen RNA structures [378]. In other words, the structure phenotypes of biological RNA molecules are not rarer, but even more frequent, than that of generic RNA structures. I will revisit these observations in Chapter 8, together with a candidate explanation.
Differences between RNA and protein phenotypes The evidence discussed above highlights multiple similarities between protein and RNA phenotypes. Most importantly, both kinds of phenotypes occur in vast genotype networks that span genotype space. However, there are also some differences between protein and RNA phenotypes, one of which I will now highlight. The majority of random RNA molecules form a well-defined secondary structure, which is often essential to their function [687]. This does not necessarily hold for proteins. Estimates of the fraction of proteins that fold into some structure are based on experimental work on random protein libraries and on theoretical calculations. Such estimates vary broadly between 0.01 and 10 percent, but they generally suggest that a minority of proteins fold [164, 241, 389]. The reason may be that there are fewer amino acid sequences that can fold into a compact hydrophic core, as is required for protein folding, than there are nucleotide sequences that can form some stable internal base pairing pattern [80, 87,
NOVEL MOLECULES
107, 729]. The consequence of this difference is that protein space is more sparsely populated with folded proteins. Some authors suggest that proteins with different structures are more isolated from one another in protein space, and that genotype networks of different protein structures do not lie close to one another as they do for RNA [80, 107]. From this perspective, folding proteins thus occupy distinct, localized, and isolated islands in sequence space. This perspective, however, cannot be the whole truth. The reason is that proteins with the same function, structure, and common ancestry often have irrecognizably diverse sequences, as I mentioned earlier. Where common ancestry of diverse proteins is unclear, sequence intermediates that demonstrate such common ancestry often exist [594]. This suggests that genotype networks of proteins still span sequence space or nearly so, just as they do for RNA [687]. None of these differences between protein and RNA argue against the importance of genotype networks for evolutionary innovations in proteins. Many innovations, for example, occur without transforming the scaffold provided by a given structure [239, 371, 849]. However, we still have much too learn about the organization of protein phenotypes in sequence space.
Experimental evidence on genotype networks and their role in innovation I have opened this chapter discussing empirical evidence for the extreme diversity of proteins with the same function or structure. I will close it with a focus on relevant experimental data. Such data are currently available only for the evolution of novel RNA phenotypes. I will first discuss an experiment that demonstrates the importance of genotype networks in the creation of new molecular functions [684]. The subjects of this experiment are one natural and one synthetic ribozyme—an RNA molecule that can catalyze chemical reactions. The natural ribozyme (Figure 4.10a, right) is encoded by the hepatitis delta virus, a human pathogen with a single-stranded RNA genome. This ribozyme catalyzes its own cleavage, a reaction that is necessary to complete the viral lifecycle. The synthetic RNA is the class III self-ligating ribozyme (Figure 4.10a, left), which joins an oligo-
65
nucleotide substrate to its own 5’ end, and was isolated in the laboratory from a pool of random RNAs. The sequences of these two ribozymes have no more than the 25 percent sequence identity expected by chance alone, and no structural similarities that might favor the nearness of their respective genotype networks [684]. Nevertheless, Schultes and Bartel [684] were able to design an RNA molecule that simultaneously has both catalytic activities, that of the self-cleaving ribozyme and that of the ligase. This sequence is more than 40 mutational steps away from both the prototype ligase and from the prototype self-cleaving ribozyme. Its activity is substantially lower than that of the prototype ribozymes (Figure 4.10b), but still 70 times higher than that of uncatalyzed RNA cleavage, and 460 times higher than that of the uncatalyzed ligase reaction. Importantly, this hybrid sequence can be linked via a series of point mutations to both prototype ribozymes, without reducing its activity. Two point mutations into the direction of the ligase restore near wild-type levels of ligase activity, and two point mutations in the other direction restore near wild-type levels of the self-cleavage activity. The remaining, approximately 40 point mutations in either direction keep the catalytic activity close to the level of the prototype ligase and the self-cleaving ribozyme (Figure 4.10b). By constructing a hybrid ribozyme and constructing a path through sequence space back to its ancestors, this work makes two key points. First, many changes in a genotype are possible that do not affect an RNA’s (catalytic) phenotype. Second, these changes can be very important intermediate steps in creating a new catalytic function. Similar principles have been suggested for other ribozymes [50]. A third point of this experiment is that RNAs with different functions can be near each other in sequence space. This notion is independently supported by a different study that started from an RNA molecule whose phenotype was the ability to bind ATP with high specificity [338]. The study’s authors changed this molecule through random mutation followed by selection for a new phenotype, namely the ability to bind a different molecule related to GTP. The experiment revealed several RNA molecules with this ability. These RNA molecules had different structures, but they were close to
66
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
(a) The class III self-ligating ribozyme.
The hepatitis delta virus ribozyme
L5 UU U U A U G C P5 A U C G C G A A G J5/4 A C G U A C G U U U A G G AC G A U C P2 U P4 G C J2/5 A G C A J1/2 C J2/1 G C G pppG C C 3' U G 2' HO C G U A J3/4 A G C G A U P1 C G G AAAC G G G C U A G C G C P3 G C C G U U UU L3
J1/2 U A C G G C P1 U G G G C U U A A G G G
C C G A C C U G G G
3' G GCA G U A C G P2 U A C G C G U A G C P3 G G C A C U G A J4/2 C C C L3 U U P1.1
C G
C G A U U A C G C G G C P4 A U A U G C G C U U UU L4
(b) 10
10 1 10–1
LIG2
10–2
10–1
HDV1
10–2
LIG1 Intersection
10–3
10–3
10–4
10–4
10–5
10–5
10–6
10–6
10–7
10–7 40
30
20
10
0
10
20
30
Relative Cleavage Rate
Relative Ligation Rate
1
HDV2
40
Figure 4.10 Mutational paths on genotype networks lead to new ribozyme functions. (a) The two starting ribozymes discussed in the text. (b) A series of mutations that do not destroy catalytic activity connect a hybrid ribozyme with the two starting ribozymes. The horizontal axis shows the distance (in single nucleotide changes) of the hybrid ribozyme (positioned at distance 0) from its ancestors, the ligase (towards the left) and the hepatitis delta virus ribozyme (towards the right). The vertical axis shows the reaction rate of each ribozyme (gray = ligation, black = self-cleavage) as a fraction of the rate realized by the respective ancestor. The relative rate for the uncatalyzed ligation reaction is indicated by the short-dashed line (ligation with formation of a 2’-5’ linkage) and the dotted line (ligation with formation of a 3’-5’ linkage). The rate of the uncatalyzed cleavage reaction is indicated by the long-dashed line. From figure 3A in [684], used with permission from AAAS.
NOVEL MOLECULES
the original ATP-binding molecule in sequence space. A final experimental example provides a hint that the neighborhoods of different but similar genotypes may contain very different novel phenotypes. The phenotype at issue is a ribozyme with a new catalytic activity. The experiment started from a ribozyme capable of modifying its own 3’ end by adding an adenylated phenylalanine [157]. This reaction is an aminoacylation, which is important to load amino acids onto transfer RNA molecules for protein synthesis. The goal of the experiment was to completely change the enzymatic activity of the starting ribozyme into that of a kinase that adds a thiophosphate group to its own 5’ end. Aminoacylation and kinase reactions were chosen for the experiment, because they are both biochemically important yet very different reactions [157]. By mutagenizing the starting ribozyme and selecting for the new biochemical activity, the study’s authors readily found 23 different kinases whose structures were different from the parent’s structure. The probability of finding such ribozymes increased with the distance from the starting ribozyme. That is, ribozymes with the novel activity were less likely to be found very close to the starting ribozyme, and more likely at moderate distances (10–15 mutations out of 90 nucleotides) from the parent.
Summary A growing body of evidence points to the existence of genotype networks in both RNA and protein phenotypes, and to their importance
67
for evolutionary innovation. This evidence comes from comparative analysis of protein and RNA genotypes and phenotypes, from laboratory evolution experiments, and from computational analysis of RNA and model protein structures. This evidence shows that many genotypes can form the same phenotype, even for molecules of modest size. Some phenotypes have a much larger set of associated genotypes—the genotype set—than others. Only phenotypes with a large genotype set are of practical importance, because only they could be found in a vast genotype space through an evolutionary search driven by random mutations. The vast majority of genotypes in a genotype set falls into one or few genotype networks of astronomical size. These networks often nearly span genotype space. This means that two molecules with the same phenotype may share little or no sequence similarity. A series of mutational changes on a genotype network can preserve a phenotype while exploring an everchanging spectrum of new phenotypes. Laboratory evolution experiments show that these properties facilitate the evolution of new function by allowing exploration of new phenotypes while leaving an existing phenotype unchanged. All these principles are most easily explained if one focuses on native structures or phenotypes, to the neglect of the continuous unfolding and refolding of individual conformations caused by thermal noise and other environmental influences. I will discuss such phenotypic plasticity and its role for innovation in Chapter 13.
CH A PT ER 5
The origins of evolutionary innovation
In the preceding chapters, I examined three very different classes of biological systems. These are large, metabolic networks, biological circuits that regulate gene activity, as well as protein and RNA molecules. Most evolutionary innovations arise through changes in these systems. Any theory of innovation thus needs to apply to systems as different as these. At first sight, this may seem impossible, precisely because these systems are so different; but on a deeper level, they also share important similarities. These similarities can help us understand the ability of living things to innovate. Here I will summarize key material from the previous chapters, highlight these similarities, and point out how they affect innovability. You can find most relevant literature references in the previous chapters.
Genotypes and phenotypes When discussing biological macromolecules, I focused on protein and RNA molecules, because they perform most catalytic, transport, support, regulation, and communication functions in a cell. An RNA molecule’s genotype is a sequence of ribonucleotides (Figure 5.1) or, equivalently, the DNA sequence encoding this RNA. Proteins are also encoded by RNA or DNA sequences; their genotype is the encoding DNA sequence. However, for my purpose, an amino acid based representation of protein genotype is more economical. With this representation, we are spared having to conceptually translate the encoding nucleotide sequence into an amino acid sequence for every single protein genotype. Such translation becomes especially tedious when studying change in protein genotypes, because the genetic code’s redundancy causes many nucleotide changes to have no effect on the amino acid sequence and thus on the protein [100]. Moreover, most of the complexity in forming protein phenotypes does not lie 68
in the steps leading from DNA to amino acid sequences, but from the spatial folding of amino acid sequences into secondary and tertiary structures. Both RNA and protein phenotypes have two key aspects. One is the arrangement of a sequence in two- and three-dimensional space. The second aspect is a molecule’s biological function, be it the chemical reaction it catalyzes, the structural support it provides, or any other process it is a part of. Because structure is usually a prerequisite for function, it is a worthy subject of study in and by itself. The second class of system I explored was regulatory circuits. The DNA sequences that encode all parts of a circuit and the interactions between these parts comprise a circuit’s genotype. These interactions may be encoded in the parts themselves, such as for interactions between proteins; or they may involve DNA that does not encode proteins, such as the short DNA motifs that bind transcriptional regulators. It would be cumbersome to understand a circuit directly from its encoding DNA sequence, even more so than for proteins. The problem is analogous to that faced by a computer novice learning how a computer program works. She can wade through the binary numbers that are the program’s instructions to the hardware; or she could study the program in its higher-level programming language. The latter would be much more effective. Both representations are correct, but one of them is better for the purpose at hand. Similarly, if we want to understand what regulatory circuits do, it is best to represent their genotypes on a higher level than that of a DNA string. The representation I chose earlier encodes the regulatory interactions of circuit parts through numerical parameters that indicate the strengths of these interactions. After all, it is these interactions that
THE ORIGINS OF EVOL UTIONARY INNOVATION
Metabolism
Genotype
Phenotype
DNA encoding enzyme-catalyzed metabolic reactions
ability to synthesize biomass molecules from a given set of nutrients
DNA encoding regulatory interactions among molecules
gene expression pattern, concentration or activity of regulatory molecules
amino acid sequence
protein fold or or biochemical activity
nucleotide sequence
RNA fold or biochemical activity
69
Regulation
Molecules: Protein
Molecules: RNA
Figure 5.1 An overview over analogous notions of genotype and phenotype in metabolic networks, regulatory circuits, proteins, and RNA.
determine what a circuit does. For a regulatory circuit with S regulatory molecules, there are S2 possible pairwise interactions, and the strengths of these interactions (many of which may be absent and thus have zero strength) represent a circuit’s genotype. I focused on a particularly important class of circuits, namely transcriptional regulation circuits. Here, the regulatory molecules are transcriptional regulators encoded by circuit genes. Their pairwise regulatory interactions involve a regulator’s binding of DNA near a circuit gene, and the activation or repression of that gene. The phenotype of such a circuit is a pattern of gene expression that the circuit’s regulatory interactions produce. (In Chapter 14, I will discuss other kinds of circuits.) On the next level of organization, I explored genome-scale metabolic networks. The main task of a metabolic network is to synthesize all of a
cell’s molecular building blocks, including amino acids, nucleotides, sugars, and lipids. I refer to these building blocks as a cell’s biomass components or biomass precursors. An organism’s metabolic genotype encodes the metabolic enzymes that catalyze all chemical reactions in its metabolic network. The most effective genotype representation for my purpose reflects which reactions are present in a given metabolic network (Figure 2.1, Figure 5.1). These reactions form part of a much larger “universe” of possible enzyme-catalyzed reactions. The phenotype of a metabolic network is the ability to synthesize all biomass components in one or more chemical environments. Because of the central role carbon plays in life, I here focused on chemical environments differing in their carbon sources; these carbon sources can also serve as energy
70
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
sources. Most of what I say would also hold for sources of other elements [653]. I classified metabolic phenotypes according to the carbon/energy sources that a network can use as the only source to synthesize all biomass compounds in an otherwise minimal chemical environment. In this context, a metabolic phenotype can be most simply represented as a binary string, each of whose entries corresponds to one of the carbon/energy sources that an organism can import from the environment. If an organism can synthesize all biomass compounds from carbon source i, then this string will contain a one at position i, and otherwise a zero. I emphasize that these and other discrete representations of genotypes and phenotypes are abstractions that serve to develop important concepts more clearly. Further, below I will revisit their merits and limitations. I will now summarize important commonalities of all but the smallest metabolic networks, regulatory circuits, and molecules.
Many more genotypes than phenotypes The above three systems share a simple feature with far-reaching consequences: they have many more genotypes than phenotypes. In metabolic networks, the number 2S of possible genotypes is determined by the size S of the “universe” of enzyme-catalyzed reactions. The known universe currently comprises more than S = 5 × 103 such reactions, and may well be much larger [571]. The number of possible metabolic network genotypes is thus also very large. To see that there are fewer metabolic phenotypes than genotypes, recall (Chapter 2) that there are many possible pathways of synthesizing all biomass components from a given sole carbon source. Each of these pathways corresponds to one genotype. A large body of evidence shows that even central functions of metabolism cannot be executed in just one optimal way, but by different pathways of metabolic reactions. Furthermore, in any given environment, many reactions in a metabolic network can be eliminated, thus changing a genotype without necessarily changing a metabolic phenotype [64, 208, 248, 309, 690, 730, 839]. Taken together, these observations demonstrate an excess of genotypes over phenotypes. Because they hold for any one carbon source
(and for other elemental sources), they also apply to any combination of carbon sources. In addition, empirical data on metabolic network composition of a broad spectrum of organisms suggests that they can utilize only a small fraction C of all possible carbon sources, rendering the total number of phenotypes (2C) vastly smaller than the number of genotypes [202]. In a transcriptional regulation circuit of size S, that is, with S genes, there are S2 possible pairwise regulatory interactions (and many more higher order interactions). Even if we restrict ourselves to the simple abstraction of admitting only three kinds of interactions (activating, repressing, or absent) 2 there are 3 N genotypes such a circuit can have. In contrast, the total number of phenotypes (gene expression states) such a circuit could have in a single cell would be of the order of 2S, if genes are counted as being either on or off. These numbers of genotypes and phenotypes would rise dramatically, if we admitted finer gradations of interaction strengths and gene expression levels. However, because the number of possible interaction is proportional to the square of the number of genes, whereas the number of expression states is proportional only to the number of genes itself, the number of possible regulatory genotypes would generally be exponentially larger than the number of possible phenotypes. This would hold for any kind of regulatory system: the number of phenotypes scales with the number of molecules, whereas the number of regulatory genotypes scales with the much larger number of possible interactions among molecules. Let us now turn to protein molecules. Proteins with S amino acids have 20S possible genotypes. Even for short genotypes of 100 amino acids, this number (≈ 10130) may be many orders of magnitude larger than the number of hydrogen atoms in the universe. To see that there are fewer phenotypes than genotypes, let us first focus on the protein fold (tertiary structure) aspect of phenotype. Here, available structural evidence shows that there are of the order of 104 protein folds (Section 5.2). Protein folding models, where genotypes and phenotypes can be exhaustively enumerated, also show many fewer phenotypes than genotypes. The number of protein function phenotypes is different from that of structure phenotypes. The
THE ORIGINS OF EVOL UTIONARY INNOVATION
reason is that structure and function do not show a one-to-one relationship. For example, the catalytic sites of enzymes are formed by a precise local juxtaposition of few amino acids relative to the total size of a protein. They can be thought of as local “decorations” of a global fold. Any one fold may harbor different such decorations. Conversely, different folds may have the same enzymatic activity. Our understanding of the universe of protein function phenotypes is still limited, but some orderof-magnitude statements can be made about its size. A comprehensive effort to classify functions of proteins and other molecules is the widely used “gene ontology” classification system [32]. It currently recognizes 8 × 103 different molecular functions. Another is a classification of enzymatic functions that is long-established and supported by more than a century of biochemical research, and thus perhaps better founded than that of gene ontology [133]. The currently most comprehensive metabolic reaction database lists fewer than 104 enzymatic functions [571]. These estimates should be taken with a grain of salt because classifying protein functions, let alone counting them, is difficult. Also, these numbers may not include many as yet undiscovered functions. However, even if the universe of protein or enzyme function were a million times as large as our current knowledge indicates, the total number of functions (1010) would still be paltry compared to the total number of protein genotypes. The second class of functionally important macromolecules, RNA, also has an astronomical number of 4S genotypes for molecules comprising S nucleotides. Although little is known about the number of RNA tertiary structures, the number of possible secondary structure phenotypes scales approximately as 1.8S (Chapter 4) This means that as the length of an RNA molecule increases, there are exponentially more RNA genotypes than phenotypes (approximately (4/1.8)S) Many fewer catalytic functions are known for RNA than for proteins, perhaps because RNA molecules have fewer building blocks (four instead of twenty), and are thus more restricted in the spectrum of molecular shapes required for catalysis. In sum, in disparate classes of biological systems, there are more genotypes than phenotypes. Where sufficient information exists to enumerate these
71
phenotypes, there are exponentially more genotypes than phenotypes, as a function of the number S of system parts. This means that any one phenotype typically has many genotypes that form it. Many of the more complex phenomena below rest on this deceptively simple fact.
Genotype networks An important mode of evolutionary change in metabolic networks is the elimination and addition of chemical reactions, for example, through horizontal gene transfer. One can ask how many such additions and deletions a network can sustain without changing its metabolic phenotype. The answer is best expressed in terms of the maximal genotypic distance D (Figure 5.2) of two metabolic networks with the same phenotype. For networks whose size is typical of that of freeliving organisms (≈103 reactions), this distance is greater than D = 0.75. This means that two networks can share fewer than 25 percent of their reactions, while having the same metabolic phenotype. This great divergence is not sensitive to the specific metabolic phenotype considered (Chapter 2). By generating many such diverse networks, one finds that metabolic networks of the same phenotype form vast genotype networks that extend far through genotype space. Regulatory circuits, in turn, evolve through mutations that create and destroy regulatory interactions between circuit molecules. For example, small genetic changes can alter a transcriptional regulator’s binding sites on DNA. An analysis similar to that of metabolic genotype space shows that the genotype distance of two circuits with the same gene expression phenotype can be as high as D = 1. At this maximal genotype distance, two circuits have no regulatory interactions in common (Figure 5.2). Two otherwise random circuits with the same phenotype typically share only of the order of 20 percent of regulatory interactions (D = 0.8). Moreover the vast majority or all circuits with the same phenotype form one gigantic genotype network that traverses genotype space completely or nearly so (Chapter 3). Protein and RNA molecules undergo evolutionary change in their individual nucleotide or amino acid building blocks. A combination of empirical evidence and computational studies shows that
72
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
Metabolism
Neighbors
Genotype Distance D
networks differing in one reaction
fraction of reactions not shared by two networks
circuits differing in one regulatory interactions
fraction of regulatory interactions not shared by two networks
proteins differing in one amino acid
fraction of amino acids not identical in two proteins
RNA molecules differing in one nucleotide
fraction of nucleotides not identical in two RNA molecules
Regulation
Molecules: Protein
Molecules: RNA
Figure 5.2 An overview over the analogous concepts of neighbors and genotype distance used in describing metabolic networks, regulatory circuits, proteins, and RNA.
protein and RNA molecules with the same phenotype—whether defined structurally or functionally—are often not recognizably similar. Their maximal genotype distance D (Figure 5.2) is typically close to one. RNA molecules with the same structure phenotypes may even have no nucleotides in common (D = 1). As in the other study systems, the genotypes forming a typical phenotype are connected in one or few vast genotype networks. There may be many differences between metabolic, regulatory, and molecular phenotypes, as well as within each class of system. For example, their genotype networks may differ in how far they extend through genotype space, or in whether the genotypes of typical phenotypes form one or more genotype networks. The depth of our knowledge also varies for these system classes, which affects our ability to compare them. For example, discover-
ing new reactions in the biochemical reaction universe can only increase the already large flexibility in metabolic network organization we observe. Such new knowledge can affect how far metabolic genotype networks appear to reach through genotype space, and whether they differ from regulatory genotype networks in this regard. Similarly, we have more information about the connectivity of genotype networks for molecules than for metabolisms, simply because we have studied molecules much longer than large metabolic networks. For my purpose, these and many other differences—whether real or caused by the gaping holes in our knowledge—are differences in details. They do not affect the commonality most important for evolutionary innovation: that genotypes with the same phenotype form genotype networks that spread far and wide through genotype space.
THE ORIGINS OF EVOL UTIONARY INNOVATION
Common features, different mechanisms An approach that asks how genotypes with the same phenotype are organized in genotype space creates a global, statistical perspective on this space. A complementary, mechanistic perspective would ask why the same phenotype can be built in so many different ways; conversely, why must some parts of a genotype not change? The question arises because two genotypes highly or maximally different from each other (D = 1) may not be able to form the same phenotype. In contrast to the first, statistical question, the answer to the second, mechanistic question depends strongly on the study system. For example, the flexibility of metabolic organization comes from the multiple ways in which biomass components can be synthesized. This flexibility is limited, because some reactions or pathways do not admit alternatives, as dictated by principles of organic chemistry. For proteins, the formation of a specific structure phenotype typically requires a conserved core of hydrophobic amino acids in the center of this structure. These interactions provide the glue for a protein’s spatial structure [87, 186]. The need to have such a hydrophobic core can limit the extent to which protein genotypes can vary. Least well understood are the reasons why some regulatory circuits with a given phenotype may need to preserve a small fraction of regulatory interactions, a phenomenon that occurs in the biological evolution of regulatory circuitry [165]. One candidate explanation is that the preserved interactions provide resistance of an expression phenotype to gene expression noise, but our knowledge in this area is very limited [561]. Despite such limitations, it is clear that limited genotypic conservation has different mechanistic causes in different systems. These differences make the commonality of connected genotype networks even more remarkable
Neutral neighbors, robustness, and continuity properties A system’s ability to preserve its phenotype while exploring genotype space requires that not all small changes in genotype cause large changes in phenotype. This is the case for genotypic changes in proteins, regulatory circuits, and
73
metabolism. Even more, many such changes have no effect on phenotype. This means that genotypes typically have a substantial fraction of neighbors in genotype space with the same phenotype. For brevity, I will for now refer to such neighbors as neutral neighbors. A biological system’s robustness to change is its ability to preserve phenotype in the face of change. Thus, a genotype with many neutral neighbors is to some extent robust to genetic change in individual system parts. As I will explore in more detail in Chapter 6, such robustness is a prerequisite for the existence of genotype networks. This requirement for robustness also hints at the positive role robustness can play in evolutionary innovation. I have explored this role in an earlier book [825], and will discuss it in more detail in Chapter 8. It is tempting to cast this property of genotypes in the mathematical language of continuous functions [435]. In a continuous function, a small change in the function’s argument causes a small or no change in the function’s value. The function F at issue is one that takes a genotype G as an argument and produces a phenotype from it (P = F(G)). Such a function F is often called a genotype-phenotype map [11]. Thus, on the surface, the genotype-phenotype maps I studied have properties akin to continuity. However, the analogy is limited: when applying strict mathematical definitions of continuity, all functions in a discrete space are continuous [689, p.131]. I note in passing that many man-made, technological systems differ in this respect from biological system: if you change a part, the function of the whole often changes dramatically, and usually for the worse. This difference may be why most technological systems cannot readily innovate through random change [636]. However, this limitation can be overcome with the right kind of technology (Chapter 15).
Some consequences of genotype space and genotype network size Genotype spaces are vast. Even the genotype networks of individual phenotypes we discussed are astronomical in size. The large numbers one encounters in studying genotype spaces are best illustrated with examples from molecules, because approaches to estimate
74
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
these numbers are most advanced here [294, 378, 389, 639, 761]. Consider the guide RNA of the protozoan parasite Leishmania tarentolae shown in Figure 5.3. Guide RNAs are important in RNA editing, a process that changes the nucleotide sequence of already transcribed RNA molecules [710]. There are approximately 5 × 1022 sequences forming the minimum free-energy structure of the RNA molecule in Figure 5.3 [378]. Turning to proteins, consider again the bacteriophage l repressor from Chapter 4, where a protein’s structure and function is achieved by an estimated 5 × 1056 sequences [639]. Numbers as large as these are not exceptions to a rule. Many other proteins and RNAs of the same size have comparable or even larger genotype networks. Their enormity can perhaps be better appreciated if we consider that the number of stars in our galaxy (1011–1012) is puny compared to them. We currently can estimate genotype network sizes less well for regulatory circuits and metabolic networks, but they can be just as large (Figure 3.3 and ref. [670]). This results simply from the excess of genotypes over phenotypes.
guide RNA C
A
A
Our intuition easily fails us when thinking about very large numbers. Here are a few consequences of the vast numbers of genotypes with the same phenotype and of the genotype networks they form. First, because genotype space is vast, phenotypes with small genotype networks have limited biological relevance, because a (blind) evolutionary search is very unlikely to find them. Recall the example of RNA (Chapter 4), where the vast majority of genotype space is filled with phenotypes that have large genotype networks. These phenotypes are “typical” in the sense that genotypes chosen at random from genotype space are highly likely to form one of them. With increasing length of a molecule, the likelihood to find atypical phenotypes decreases rapidly. In this regard, it is also relevant that the genotype networks of RNA phenotypes with known biological functions are somewhat larger than those of RNA molecules chosen at random from genotype space [378]. The perspective I propose here provides a ready explanation. If a given biological function can be carried out by two phenotypes, one with a small genotype network, and the other with a larger genotype network, then a blind search through genotype space is more likely to discover the larger genotype network.
l repressor
A A
U U
U
A A
U G C U A U U A C U G G A C U G A A GU A U A U U AUA A G G G G G C A A A U U
Number of genotypes: 5×1022
Number of genotypes: 5×1056
Fraction of genotype space: 4×10–8
Fraction of genotype space: 4×10–63
Figure 5.3 The number of genotypes in a genotype network may be astronomical. The left panel shows the secondary structure of a guide RNA from the protozoan parasite Leishmania tarentolae. We estimated the number of genotypes with this secondary structure using a replica exchange Monte Carlo algorithm [378]. The protein structure in the right panel is the 92 amino acid long N-terminal part of the l repressor, a transcriptional repressor of bacteriophage l. It is displayed using atomic coordinates in Protein Databank File 3BDN [727]. The number of genotypes displayed was estimated from a large-scale mutagenesis experiment with this protein [639].
THE ORIGINS OF EVOL UTIONARY INNOVATION
Second, a genotype network can be astronomically large, yet it may occupy a tiny fraction of an even larger genotype space. For example, the astronomical number of RNA sequences forming the structure of Figure 5.3 constitutes only a tiny fraction 4 × 10–8 (≈ 5 × 1022/450) of this space. The even greater number of protein sequences folding into the l repressor occupies a fraction 10–63 of their genotype space. A biologically important phenotype may have a vast number of associated genotypes, yet still be rare in genotype space. Third, a biological system that serves multiple functions does not necessarily have a small genotype network. Many enzymes can have more than one enzymatic activity, or even some non-enzymatic functions [8, 367]. Regulatory circuits are exposed to different regulatory signals in different cells, and they respond by producing different cell-specific gene expression patterns that influence physiology and development. Metabolic networks may synthesize different compounds in different tissues or at different times. We can think of each such function as having an associated genotype network. Genotypes with multiple functions occur in the intersection of these networks. Because each genotype network is vast, the intersection may still be very large. Consider the model regulatory circuits I discussed in Chapter 3, where circuits with 6 genes and 3 regulatory interactions per gene have 8.6 × 1013 possible genotypes. The average number of networks producing only one specific gene expression pattern is equal to 5.92 × 1010. (The average is taken over different expression patterns.) Bifunctional networks, that is, networks that can produce two specific expression patterns, have fewer but still very many genotypes (1.96 × 107) that can produce both expression patterns [491]. Fourth, because genotype networks are so vast, there is much room for heterogeneity inside them. This internal structure, however, is still poorly understood. One exception to our general ignorance is that in some regions of a genotype network, genotypes have many more neutral neighbors than in other regions. This holds for all three classes of systems I discussed so far. It has evolutionary implications that I will discuss in more detail later (Chapter 8).
75
In closing this section, I restate the perhaps most important consequence of the vast size of genotype space. Because this veritable universe can hold an enormous number of different phenotypes, it provides the ideal starting point to develop an account of innovation that is systematic instead of anecdotal; because it has room for myriad phenotypes, it is sufficiently rich to encapsulate the enormous diversity of molecular innovations in the history of life. It does not just reduce their diversity and complexity to a simple caricature, as other contemporary models do. It qualifies as a solid foundation for a theory of innovation.
Neighborhoods and their phenotypic diversity The existence of vast genotype networks ensures that a genotype can change substantially without changing its phenotype. This feature, however, is not sufficient to explain innovability. To produce evolutionary innovations, biological systems must explore many different phenotypic variants before finding one that may become an innovation. Such phenotypic variants are produced by mutations. Among all possible variants, those accessible from any one genotype via a single mutation are the most important, because they are most easily reached. For the systems I focus on, such variants differ in a single amino acid, an RNA nucleotide, a regulatory interaction, or an enzymatic reaction. Together, all of a genotype’s single mutants comprise a genotype’s 1(-mutant) neighborhood. Thus far, the language I used insinuated that individual systems change and explore different phenotypes. That is only part of the truth, because all evolution takes place in populations. Although most of the principles I discussed apply to both populations and their members, large populations have the following advantage in exploring new phenotypes. If a population is sufficiently large or if mutation rates are sufficiently high, then the mutations occurring every generation produce not only single mutant variants [845]. A case in point is viruses, such as HIV, with small genomes, high mutation rates, and enormous population sizes. In the human body, a single round of replication of a viral population would suffice to produce all single mutants of a viral genome, as well as many double and triple
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
(a) Fraction U of phenotypes unique to a neighborhood
mutants [488, 608, 629]. In such populations, not only members of a genotype’s 1-neighborhood, but many members of its 2- and 3-neighborhoods are accessible. The genotypic neighborhoods of metabolic networks, regulatory circuits, and molecules share two important features, as shown schematically in Figure 5.4. First, different neighborhoods on the same genotype network contain very different novel phenotypes. Specifically, consider two such genotypes G1 and G2, and the fraction U of “unique” phenotypes accessible to only one of them. These are phenotypes that occur only in the neighborhood of one but not the other genotype. The fraction U increases with increasing genotype distance between G1 and G2, and it reaches a plateau at genotype distances much smaller than the diameter (the maximally possible genotype distance) of a genotype network (Figure 5.4a). At this plateau, a 1-neighborhood contains between 40 percent to greater than 90 percent of unique phenotypes, depending on the system and the neighborhood considered. This percentage increases further for larger neighborhoods. The second observation (Figure 5.4b) regards individual genotypes or entire populations that evolve on a given genotype network through cycles of mutation and natural selection. Mutations allow individuals to explore the network and its surroundings in a random walk; selection preserves the population’s well-adapted phenotype and thus confines it to the genotype network. In consequence, a population spreads through genotype space like a cloud of genotype “particles” diffusing through a porous medium, the population’s genotype network. As mutations accumulate generation after generation (while leaving the phenotype unchanged), individuals and populations gain access to an ever-increasing number of novel phenotypes. These are phenotypes in their neighborhoods that were not contained in any neighborhood encountered in previous generations. Because genotype networks are typically vast in size, and because the total number of possible phenotypes is very large, this number of novel phenotypes does not level off even when most system parts—nucleotides, amino acids, regulatory interactions, metabolic reactions—have
1 U
Genotype distance D
Dmax
(b) Cumulative number of novel phenotypes in neighborhood
76
S
Number of mutations
Figure 5.4 Different genotypic neighborhoods on the same genotype network contain very different phenotypes. Both panels are schematic illustrations of features common to metabolic networks, regulatory circuits, and molecular genotypes discussed in earlier chapters. (a) The fraction U of phenotypes that occur in the neighborhood of one but not the other genotype on a neutral network (as illustrated by the intersecting circuits in the inset), as a function of the genotype distance D between two genotypes. Dmax indicates the maximal genotype distance of two genotypes on the same genotype network, which is often close or equal to the diameter of genotype space. (b) The cumulative number of different phenotypes (vertical axis) in the neighborhood of a genotype that changes its composition gradually through mutations (horizontal axis), while preserving its phenotype. This cumulative number increases linearly or nearly so for a number of mutations that is much greater than the system size S (number of nucleotides, amino acids, regulatory genes, or metabolic enzymes), as indicated by the label S on the horizontal axis.
changed multiple times. Eventually, the number of novel phenotypes would have to level off, because genotype networks have a finite size.
THE ORIGINS OF EVOL UTIONARY INNOVATION
However, a population of realistic size could hardly explore a large genotype network on realistic evolutionary time scales, and thus experience this exhaustion of phenotypic variation. Taken together, these two properties ensure that molecules, regulatory circuits, and metabolic networks meet an indispensible requirement to produce evolutionary innovations: the ability to access vast amounts of novel phenotypes while leaving their own phenotype unchanged.
Two necessary prerequisites for innovability Figure 5.5 summarizes and illustrates how these two features conspire to allow metabolic innovation. The rectangle in this figure stand for genotype space. The gray open circles correspond to genotypes with some common phenotype. Genotypes are connected by straight lines if they are neighbors. Symbols of different shapes and shading correspond to different phenotypes that occur as neighbors of some genotype on the graygray genotype network. Each symbol stands for a different phenotype. Figure 5.5 illustrates a genotype network that spans genotype space and that is connected. By virtue of its connectedness, genotypes evolving on it can access many different novel phenotypes, while fulfilling the key requirement of not changing their own phenotype. Molecules, regulatory circuits, and metabolic networks may differ in many details, but they all share these organizational features. First, their phenotypes are typically organized into vast genotype networks that traverse a large fraction of genotype space. Second, different neighborhoods on these networks contain very different novel phenotypes. Figures 5.6a through 5.6c illustrate that these features are essential by exploring several counterfactual scenarios; that is, these scenarios are not typical for the system classes I examined. In the first scenario (Figure 5.6a), the number of genotypes that form the same phenotype is just as large as in Figure 5.5. These genotypes are also as widely distributed through sequence space. However, these genotypes are either isolated from one another or they form only small groups of connected genotypes. Their disconnectedness hinders access of new phenotypic variants, because evolving gen-
77
otypes remain confined to small regions of this space. They can no longer explore large regions of this space through mutations that leave the phenotype unchanged. The second scenario (Figure 5.6b) shows a genotype network that is connected, but that does not span a large fraction of genotype space. Instead, it is localized in a smaller region of this space. Therefore, many novel phenotypes occurring elsewhere in genotype space remain inaccessible from it. The fundamental underlying reason is again the requirement to retain old phenotypes—and thus remain close to a genotype network—while exploring new phenotypes.
Figure 5.5 Connected genotype networks facilitate accessibility of diverse phenotypes. The figure schematically represents a set of genotypes (gray circles) in genotype space (rectangle) that share the same phenotype and form a genotype network; neighboring genotypes are connected by gray lines. Symbols of different shapes and shading indicate genotypes with different phenotypes. The figure illustrates that many different novel phenotypes can be accessed from a connected genotype network that spreads far through genotype space.
78
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
(a)
(b)
(c)
Figure 5.6 Three counterfactual scenarios for genotype network organization. Each panel indicates a set of genotypes (gray circles) in genotype space (rectangle) that share the same phenotype; neighboring genotypes are connected by gray lines. Symbols of different shapes and shading indicate genotypes with different phenotypes: (a) a disconnected genotype network, (b) a highly localized genotype network, and (c) a genotype network where different neighborhoods contain the same novel phenotypes. See text for details.
A final counterfactual scenario is shown in Figure 5.6c, where a sprawling and connected genotype network exists, but where the phenotypes in its neighborhoods are all the same. In this case, the network is irrelevant for evolutionary innovation, because regardless of where a genotype occurs on this network, and regardless of how far a population spreads through this network, it has access to the same novel phenotypes. Taken together, these images highlight that both the extension of neutral networks in genotype space, and the phenotypic diversity of their neighborhoods are essential for the exploration of many different phenotypes, which allows evolutionary innovation. Whatever else a theory of evolutionary innovation might include, these elements will be essential. Whether they are also sufficient for innovation is an intriguing open question. Before going on, I need to caution that Figure 5.5 and other low-dimensional representations are mere caricatures of genotype networks. Genotype spaces are very high-dimensional. They are closely related to hypercubes, n-dimensional analogues of threedimensional cubes [641]. Such high-dimensional
spaces have many counterintuitive features. I will discuss them in more detail in Chapter 6. For now, I will merely highlight two features that make low-dimensional representations misleading. First, in our familiar three-dimensional space, we can move into three orthogonal directions, but in a genotype space we can move in as many “directions” as there are dimensions. In consequence, the immediate neighborhood of a genotype contains many different genotypes. For example, any protein genotype with S = 100 amino acids has 19 × 100 = 1900 immediate neighbors. A two-dimensional projection cannot capture such large neighborhoods well. Second, despite the enormous size of the corresponding genotype space, one can walk through this space in few steps; that is, in as many mutations as there are dimensions. Because each step can take multiple possible directions, the number of paths through this space is astronomical. Many paths lead to genotypes that are maximally different from the starting genotype, yet they are also maximally different from each other. Again, two-dimensional images represent these features poorly. They only serve as visual crutches to aid our understanding.
THE ORIGINS OF EVOL UTIONARY INNOVATION
Genotype networks are highly interwoven One further general feature of molecules, regulatory circuits, and metabolic networks is worth highlighting: The genotype networks of any two typical phenotypes, P1 and P2, are close together in genotype space. More specifically, the minimal number of mutations necessary to go from a genotype with phenotype P1 to a genotype with phenotype P2 comprises only a small fraction of the diameter of genotype space. (The diameter is the maximal distance between two genotypes.) In other words, there is at least one point in genotype space where two genotype networks are close together. This holds for any two typical phenotypes. For an example, consider first the regulatory circuits of merely twenty genes I discussed earlier (Chapter 3). Here, the average minimum genotype distance between circuits with arbitrary different expression phenotypes is only D = 0.14. There are approximately 10128 circuits of S = 20 genes with an average of 5 regulatory interactions per gene. Only a tiny fraction 10–102 of them is contained in a neighborhood of D = 0.14 around any one circuit. Yet this tiny region of genotype space around a circuit—around any circuit—contains most expression phenotypes. The radius D and size of a neighborhood with this property would further decrease as circuit size increases. Genotype networks of larger circuits thus become increasingly interwoven. As a second example, consider RNA molecules of size S = 100 nucleotides. Here, a region of D = 0.15 around any one genotype contains with near certainty one sequence for any common structure. Such a region comprises only one 4.52 × 1037th of genotype space (Chapter 4). Lastly, let us turn to metabolic networks. To reach any one metabolic phenotype from any other metabolic phenotype, one typically does not need to go further than a genotypic distance of D = 0.1, i.e., change 10 percent of a metabolic network’s reactions (Chapter 2). For the genotype space of more than 5000 reactions I discussed earlier, a genotype neighborhood with this radius would contain much less than one 10–500th of genotype space. As strikingly tiny as the fractions I just cited may seem, one should bear in mind that walking hundred biochemical reactions, fourteen regulatory interactions, or fifteen nucleotides away from a gen-
79
otype network is not a small feat. If old phenotypes must not be destroyed for evolutionary innovations to occur, then genotypes this far away are certainly not readily accessible. One might thus argue that this phenomenon, however striking, may be of limited importance for evolutionary innovation. I nonetheless mention it here, because it speaks to the diversity of phenotypes that we find in different neighborhoods of a genotype network (Figure 5.5). By studying it more closely, we may learn more about the causes for this diversity.
Minimal requirements for a theory of innovation Chapter 1 listed several minimal requirements for a theory of innovation. I will briefly revisit them to show that the elements I discussed so far meet these requirements. The first requirement is to explain how the old can be preserved while the new is being explored. Extended genotype networks with diverse phenotypic neighborhoods allow precisely this kind of exploration. The second requirement is to unify different kinds of innovations. Because most innovations involve changes in three classes of systems that share organizational features of genotype spaces, this requirement is also met. The third requirement is to capture the combinatorial nature of innovation. The framework I use is also ideal in this regard, because it explicitly captures innovation as new combinations of system parts (elementary modules) that give rise to novel genotypes. Depending on the system class, these elementary units of organization are enzymes, regulatory interactions, amino acids, or nucleotides. In each system, small modules may form higher order modules, such as regulatory circuit motifs, or enzyme complexes. The role of such higher order modularity is much studied but still poorly understood [224, 384, 424, 681, 774, 813, 834]. The genotype space framework can help us study it systematically. The fourth requirement is to capture that the same problem can be solved through different innovations. This is important, because many innovations in the history of life occurred multiple times and in different ways [807]. The existence of extended genotype networks captures this feature
80
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
well. For example, it shows that the problem of synthesizing biomass from a single energy source can be solved by many and very different sets of chemical reactions, i.e., metabolic network genotypes. Any two such networks can be viewed as different solutions to the same problem. The same holds for two different regulatory circuits that produce the same molecular activity phenotype, or two different proteins with the same enzymatic function. However, genotype networks help us see much more than that: if a problem has a solution at all, it has usually astronomically many solutions. In addition, these solutions are connected in genotype space. Which of these solutions an organism discovers depends on its evolutionary history, that is, its past trajectory and location in genotype space. The fifth requirement regarded environmental change. I will discuss it in Chapter 11. The sixth and last requirement regarded applicability to technology; it is the focus of Chapter 15.
The merits and price of abstraction One of the most influential concepts in evolutionary biology is that of the adaptive landscape [259]. It is commonly visualized as a landscape of rolling hills or steep ravines. Its peaks represent trait combinations or genotypes with high fitness. This concept is an abstraction derived from an immensely complex reality. Such abstraction is necessary for any human understanding. Yet like any other abstraction, it also has limitations. One of them derives from the fact that genotype space is high-dimensional. For example, a single mountain peak in three dimensions can become a much stranger object in higher dimensions. Any phenotype—metabolic, regulatory, or molecular—that confers high fitness on its carrier could serve as an example. It is a peak in an adaptive landscape. But because many connected genotypes typically form any such phenotype, this peak is spread out through genotype space. In other words, a single peak of a three-dimensional fitness landscape becomes a connected, vast, and sprawling genotype network in a higher dimensional space. Just as low-dimensional representations of fitness landscapes have limitations, so has their refinement to many dimensions, and the concept of genotype networks. The most elementary abstraction I made here is to consider discrete genotypes and pheno-
types. It has obvious merits. First, it is well-suited to study the qualitatively different phenotypes that are important for innovation. Second, it also helps us understand the combinatorial nature of innovation. Third, it gives rise to clear concepts about proximity, neighborhoods, robustness, the spreading of genotype networks, and unique phenotypes in a neighborhood. But this abstraction also has limitations. For example, one might argue that many systems can have an infinite continuum of phenotypes. Examples include the ever-changing conformations of proteins and RNA molecules inside a cell. Although, continuous systems and their organization in genotype space are poorly studied, Chapter 14 will hint that important observations from discrete systems also apply to continuous systems. On a more general note, a continuum of phenotypes may well fall into discrete classes defined by distinct biological features. We may live in a continuous world, but most efforts at understanding this world involve classification of its objects, as much in biology as in any other area of human life, from the classification of biological species, to the classification of objects by our retina and visual cortex. Classification is a form of discretization. Discretization, with all its limitations, is thus central for our orientation in the world. The framework I have developed thus far also contains other, more hidden simplifications, on which subsequent chapters will focus. I have thus far neglected the role of changing environments and phenotypic plasticity for innovation (Chapters 11 and 13), as well as the role of recombination (Chapter 10), population dynamics (Chapters 7, 8), and gene duplications (Chapter 9). As these chapters will show, the framework can easily accommodate these phenomena. They can enhance the power of genotype networks to explore new phenotypes.
Validation To help validate any principle that organizes natural phenomena, it is essential to think about the limits of its applicability, including the kinds of evidence that could prove this principle wrong. Although some of these limitations were implicit in my previous discussions, I will now briefly revisit them to make them more explicit. Below will refer to a “system class” as a particular
THE ORIGINS OF EVOL UTIONARY INNOVATION
kind of genotype and phenotype, like the molecules, regulatory circuits, and metabolic networks I discuss throughout. First, the framework I suggest would face a problem if we found that phenotypes with tiny genotype networks generally are more innovative than phenotypes with large such networks. What I have in mind are phenotypes unusually rare in genotype space, yet nonetheless highly abundant in organisms, and at the same time highly innovative. Second, highly innovative systems or phenotypes, where many genotypes form the same phenotype, but where most of these genotypes are isolated in genotype space would present a problem to the theory (Figure 5.6a). Third, the same would hold if we found systems or phenotypes whose genotype networks are highly localized in a small region of genotype space, and that are nonetheless highly innovative (Figure 5.6b), more so than genotype networks that reach far through genotype space. Fourth, it would be problematic if we found systems or highly innovative phenotypes where distant neighborhoods of a genotype network contain mostly or exclusively identical phenotypes (Figure 5.6c). Lastly, the framework does not apply to systems with as many or more phenotypes than genotypes. (This is not the same issue as raised by phenotypic plasticity, where environmental variation produces several phenotypes from one genotype, as Chapter 13 discusses.) There is at least one prominent system class that may fall into this category. It comprises systems involved in self-recognition and immunity. An organism’s antibody repertoire, for example, is most effective in recognizing many antigens if its antibodies have highly diverse surface properties. Organisms achieve such high diversity through several mechanisms, including hypermutation of small hypervariable regions in the genes encoding these antibodies [13]. To maximize the antibody diversity that results from a given number of mutations, it is best if each mutation generates an antibody with new surface properties. In other words, it is best if there is one phenotype for every genotype. The same line of reasoning applies to pathogens capable of producing many gene variants encoding different surface proteins. This great diversity of surface proteins can help them avoid
81
detection by the immune system [537]. Thus, when every phenotypic variant is useful, the genotype network framework becomes useless. In this case, “innovation” also becomes trivially synonymous with variation, and is no longer challenging to explain. Our abilities to study genotypes and phenotypes are growing rapidly with the advance of whole-genome sequencing and functional genomic technology. Thus, in the years and decades to come, we will undoubtedly learn more about the incidence of anomalies like those I just described, whether they are the exception to a rule, or whether they lead to new principles we do not yet appreciate.
Innovation at the origin of life The importance of innovation undoubtedly began with the origin of life. We do not know whether the earliest life involved the information carrier RNA, or some simple metabolic network [535]. From an innovability perspective, it may not matter: both system classes have the prerequisites for innovability I discussed here. They can support the countless innovations that must have occurred until life even vaguely resembled its present cellular form. One could even take the perspective that genotype networks predate natural selection and thus life itself. From this point of view, they transcend biology. They exist regardless of whether a life form explores them. And life was able to take advantage of them as soon as natural selection started to operate.
Summary Figure 5.7 contains a summary of six important commonalities that we find between
Innovation is combinatorial in nature. Genotypes have many neighbors with the same phenotype. Many or all genotypes with the same phenotype are connected in genotype networks. Genotype networks of different phenotypes have different sizes. Typical genotype networks traverse a large part of genotype space. Different neighborhoods of a genotype network contain different phenotypes. Figure 5.7 Six common features of metabolic networks, regulatory circuits, and molecules.
82
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
metabolic networks, regulatory circuits, and molecules, key system classes underlying all evolutionary innovation. It highlights the last two commonalities, because they are essential for the ability to innovate: genotype networks extend far
through genotype space and their different neighborhoods contain a universe of different phenotypes. In the next chapter, we will see that these features emerge from a remarkable simple property of innovable systems.
CH A PT ER 6
Genotype networks, self-organization, and natural selection
The previous chapter showed that extended genotype networks with phenotypically diverse neighborhoods are necessary for innovability. But it left a fundamental question unanswered: Why do they exist in the first place, and in system classes as different as metabolic networks, regulatory circuits, and molecules? This chapter suggests an answer. They emerge from one core common property: individual genotypes typically have many neighbors with the same phenotype. I will show that this property is sufficient for the existence of connected genotype networks that occupy a small fraction of genotype space, but that extend far through this space. I will also show that this property is necessary. Subsequently, I will show that the great phenotypic diversity of different neighborhoods is not surprising, but expected for systems with many phenotypes. Next, I will discuss the interdependency of self-organization and natural selection in evolution. Finally, I will point out that genotype networks render phenotypic change non-random in ways that facilitate innovation.
Genotype networks as graphs The chemistry and physics of how molecules fold, of how genes regulate their expression, and of how large metabolic networks synthesize biomass differ in many details. Commonalities among them will thus probably emerge neither from physics nor chemistry, but from more fundamental, mathematical principles. Principles from graph theory are most important in this regard [70, 304]. A graph is a mathematical object that consists of nodes, and of edges that link these nodes. A graph is connected if one can reach any node from any other
node by traversing a path of edges, and disconnected otherwise. Two kinds of graphs are important for my purpose. The first is a genotype space. It can be viewed as a graph whose nodes are genotypes. Edges connect nearest (1-mutant) neighbors in this space. The second kind of graph is a genotype network. The nodes in a genotype network are also genotypes, but only genotypes with a common phenotype; an edge connects two genotypes again if they are 1-mutant neighbors. A genotype network graph typically does not contain all genotypes in a genotype space. It can thus be viewed as a subgraph of genotype space. My treatment of genotype networks below emphasizes intuition over mathematical rigor, for three reasons. First, doing so will render the text as accessible as possible to the non-expert reader. Second, mathematically rigorous graph theory has produced deep insights, but mostly about highly idealized graphs with a simple structure, not the highly heterogeneous and “messy” real-world graphs such as genotype networks [70–73, 77, 262, 263, 642]. Third, although rigorous and highly technical mathematical proofs exist [70, 641–644], they are not essential to gain intuition about qualitative genotype network properties. Only some elementary probability theory and some terminology is necessary, to which I will now turn. I am using some terms that are non-standard in graph theory, but that will facilitate comprehension by non-experts [70, 304].
Hypercubes The genotype spaces of molecules, regulatory circuits, and metabolic networks are closely related to hypercube graphs. I will now explain this concept and its relationship to genotype spaces.
83
84
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
In doing so, I will use metabolic networks as an example. Recall that a metabolic genotype can be encapsulated in a binary string whose entries represent presence (“1”) or absence (“0”) of a chemical reaction in a metabolic network. Figure 6.1 shows cubes in various dimensions. The line and square of Figure 6.1a and 6.1b can be viewed as one- and two-dimensional cubes. Figure 6.1c shows the conventional three-dimensional cube. Figure 6.1d shows a three-dimensional representation of a four-dimensional cube. Such a representation, at least one that provides geometric intuition, becomes impossible for higher-dimensional cubes. Such higher dimensional cubes are called hypercubes. The vertices of the cubes in Figure 6.1 are labeled by binary strings whose length corresponds to the dimension of each cube. We can interpret these strings as representations of a genotype; for example, a metabolic network genotype. In this context, the vertices of the hypercube become the set of possible genotypes in a genotype space. For example,
(a) (b)
0
1
01
11
00
10
(d)
(c)
110 111
100
101 010 011
000 001
Figure 6.1 Cubes and hypercubes. (a) and (b) “Cubes” in one and two dimensions. (c) A cube in three dimensions. The vertices of each cube are labeled with the binary strings that they correspond to. (d) A three-dimensional representation of a four-dimensional cube, where the labeling of vertices with binary strings (of length four) is omitted for clarity. The vertices of cubes in higher dimensions (hypercubes) form the nodes of hypercube graphs.
the ends of the line in Figure 6.1a correspond to a trivially small genotype space that contains only one reaction, which can be absent or present in any one genotype. Figure 6.1d shows a four-dimensional genotype space whose metabolic network genotypes can contain up to four reactions. The concept of a hypercube graph extends these ideas to higher dimensions. A hypercube graph is a graph whose nodes (here: genotypes) correspond to vertices of a hypercube. Two nodes are neighbors (connected by an edge), if they correspond to adjacent vertices in the hypercube. In the context of metabolic networks, there is a one-to-one correspondence between each vertex of a hypercube, and each genotype in metabolic genotype space. Two adjacent vertices correspond to two metabolic genotypes that are neighbors, and that differ in a single entry of their binary representation, that is, in a single enzymatic reaction. I now review some very basic facts about hypercube graphs that we will need later on. (Figure 6.2 shows some of the notation I will use here.) For the purpose of this chapter, the distance between two genotypes corresponds to the number of mutations (additions or eliminations of reactions) necessary to transform one genotype into the other. The diameter of a graph is the maximum distance (number of edges) between any two genotypes. For the hypercube graph, this distance is equal to S, the number of mutations that are necessary to create from a genotype G its complement G̅ that differs in every single reaction. This diameter S is vastly smaller than the total number of possible genotypes 2S. This simple fact will go a long way towards explaining important properties of genotype networks. Each genotype G in genotype space has exactly S immediate neighbors. One can show that its number of k-neighbors, genotypes that differ from it in exactly k system parts, is given by the binomial coefficients: S! ⎛ S⎞ . ⎜⎝ k ⎟⎠ = ( S − k )!S !
(6.1)
This can be most easily seen by considering a hypothetical genotype G in which all chemical reactions are present, corresponding to the binary string of only ones. The k-neighbors of G are the strings that have exactly k zeroes. The number of ways to choose
G EN OTY P E N ETW ORKS, SEL F - OR GANIZATION, AND NATURAL SEL ECTION
S
85
System size Size of the “universe”of possible biochemical reactions Number of possible regulatory interactions in a regulatory circuit Length (number of monomers) of an RNA or protein molecule
B
Number of different system “building blocks” B=2 for metabolic networks (reaction presence/absence) B≥3 for regulatory circuits B=4 for RNA B=20 for proteins
G
genotype
G
genotype differing from G in every one of S building blocks
u
fraction of a genotype’s neighbors with the same phenotype
D
distance between two nodes in a graph or between two genotypes in a genotype network, expressed as a fraction of system size S. (0≤D≤1)
KP
Total number of phenotypes.
MP
The number of genotypes with a phenotype P
Figure 6.2 Some notation used in this chapter. I note a minor deviation from notation in earlier chapters. For regulatory circuits, I earlier used the variable S to indicate the number of circuit genes (Chapter 3), which renders the total number of possible pairwise regulatory interactions equal to S2. In this chapter, I will use the variable S for this number of possible pairwise regulatory interactions. I do so to keep the mathematical notation for molecules, regulatory circuits, and metabolic networks commensurate.
k out of S elements, regardless of their order is given by the binomial coefficients (Equation 6.1). The same argument applies to any genotype G. The fundamental reason is that hypercubes are highly symmetric: The nodes of a hypercube graph have identical roles in it, much like the vertices in a cube. The number of k-neighbors given by Equation 6.1 increases with increasing k until k≈S/2, and it declines thereafter, as k approaches S. In the metabolic networks I discussed, genotypes can be characterized by the presence or absence of metabolic reactions. In regulatory circuits, one needs to distinguish at least three kinds of regulatory interactions (activating, absent, or repressing). Finer gradations of interaction strengths may be useful or necessary for many purposes. In RNA molecules, there are four kinds of nucleotides, whereas in proteins, there are 20 kinds of amino acids. Thus, the elementary parts of genotypes may have very different numbers of building blocks, be they metabolic reactions, strengths of regulatory interactions, nucleotides, or amino acids. I will refer to the number of such building blocks with the
variable B, and will consider here predominantly the values B=2 (metabolism), B=3 (transcriptional regulation), B=4 (RNA), and B=20 (protein). Wherever the number of building blocks exceeds B=2, an extension of the hypercube concept is necessary. For B>2, there are not just 2S but BS genotypes, because each of S system parts can be made of B different building blocks. In even just one dimension (Figure 6.1a), each part of a genotype may then assume not just two values but B different values. The resulting hypercube graphs are even more difficult to understand geometrically than for B=2, although some features remain the same. Most importantly, despite the larger number of genotypes when B>2, the graph diameter remains at the small value of S. Genotypes are again connected by an edge if they differ in exactly one building block. However, each genotypes now has (B–1)S 1-neighbors, because each of its S parts can change into one of (B–1) different building blocks. The number of k-mutant neighbors becomes: ⎛ S⎞ ( B − 1)k ⎜ ⎟ . ⎝ k⎠
(6.2)
86
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
The additional factor (B-1)r arises because each of the k changed parts of a genotype can change into B–1 other parts. In another difference to hypercube graphs for B=2, any one genotype G does not have just one maximally different complement (or S-neighbor) G̅, but a large number of (B-1)S such maximally distant neighbors. The reason is, again, that each constituent part may adopt one of (B–1) values different from those in G.
The existence of neutral neighbors suffices for the existence of genotype networks In molecules, regulatory circuits, and metabolic networks, a genotype G typically has many neighbors with the same phenotype as G itself. As in Chapter 5, I will refer to such neighbors as neutral neighbors for brevity. I will denote the fraction of a genotype’s neighbors that are neutral as ν. It can vary widely among different phenotypes and genotypes, but typically ranges between 0.1 and greater than 0.5, as we saw in Chapters 2–4. I will now show that this feature is a sufficient condition for the existence of genotype networks that comprise a tiny fraction of genotype space, but that extend far through this space. To keep the mathematics simple, I will treat the problem as if all genotypes had the same number of neutral neighbors, and neglect the fact that different genotypes on a genotype network may have different numbers of neutral neighbors. This simplification is appropriate for my qualitative analysis. The analysis rests on the following idea. I construct different kinds of random graphs in which an approximate fraction ν<0.5 of each node’s neighbors on the hypercube belong to the graph, and examine the properties of these random graphs. Note that actual genotype networks need not be organized like these random graphs. The random graphs merely serve as a null hypothesis, telling us what to expect of genotype networks that share only the one simple property that genotypes have many neutral neighbors. I will first focus on the question how far such a graph would typically reach through genotype space. To this end, I will build a random graph iteratively, starting at some arbitrary genotype G. In the first iteration, I connect G to a fraction ν of its 1-neighbors
at random, with equal probability that each 1-neighbor is chosen. The diameter of the graph thus constructed is equal to 1, or when expressed as a fraction of the diameter of genotype space, D=1/S. In the second step, I take each of the 1-neighbors of G that are now connected to G, and connect it to a fraction ν of its neighbors, most of which will be 2-neighbors of G. Thus, unless ν is smaller than typically observed, the diameter of the resulting graph is D=2/S. I proceed analogously for the 2-neighbors now connected to 1-neighbors (that are themselves connected to G) and connect these 2-neighbors to 3-neighbors, and so forth. By construction, nodes in the resulting graph will be connected to approximately a fraction ν of their neighbors in the hypercube. At some iteration step in this graph-construction process, genotypes newly added to the graph may not increase the graph diameter further. To estimate approximately when this step may be reached, let us focus on an arbitrary k-neighbor Gk of G and ask for the probability that a randomly chosen neighbor of Gk is a (k+1)-neighbor of G. To this end, we need to appreciate that only a number (B–1)(S–k) of the 1-neighbors of Gk will be (k+1)-neighbors of G. (The remainder will be k-neighbors or (k–1)-neighbors.) The reason is that in Gk, (S–k) system parts have not changed their identity from the identity they had in G itself. Each of these parts can change into (B–1) possible new identities to produce a (k+1)-neighbor of G. With this observation in mind, we can calculate the probability that a randomly chosen neighbor of Gk is a (k+1)-neighbor of G. This probability is given by the number of Gk’s neighbors that are (k+1) neighbors of G, divided by the total number of neighbors of Gk, or: ( B − 1)( S − k ) k = 1− . ( B − 1)S S
(6.3)
Note that this probability is independent of B, and it decreases linearly for increasing distance k from G. That is, the more distant Gk is from G, the less likely it is that one of Gk’s neighbors is even more distant from G. The (k+1)-th iteration of the random graph construction process connects each k-neighbor of G that is already a member of the graph to ν(B–1)S
G EN OTY P E N ETW ORKS, SEL F - OR GANIZATION, AND NATURAL SEL ECTION
randomly chosen neighbors of itself. The number of newly added nodes that are (k+1)-neighbors of G thus follows a binomial distribution with parameters ν(B–1)S and p=1–k/S, given by Equation 6.3. We can ask when the expected number of newly added nodes that are (k+1) neighbors of G falls below one. In other words, how far does the graph already have to extend through genotype space, such that none of the nodes added from Gk would extend the diameter of this graph further. This expectation is given by the product of the binomial distribution’s parameters. In mathematical terms, we want to find k such that: n ( B − 1)( S − k ) < 1. In terms of the graph diameter D=k/S, we obtain: D > 1−
1 , n ( B − 1)S
(6.4)
as the graph diameter beyond which addition of new nodes to one node Gk of our random graph is expected to add fewer than one edge that increases the graph diameter further. This estimate focuses on a single genotype Gk from which the graph is extended further. It neglects that each iteration may produce multiple genotypes Gk that have the same maximal distance k from G. Each of them (and not just one of them) has a probability given by Equation 6.3 to increase the graph diameter further in the next iteration. The estimate also neglects that for large ν, small B, and Gk with large k, the graph construction process I outlined may produce more than one edge connecting (k–1)-neighbors of G to Gk, thus reducing the number of (k+1)-neighbors that can be added to Gk while keeping its number of neighbors approximately equal to ν. For these reasons, the estimate is rather crude, but it provides some qualitative insight. It tells us that as S and (B–1) increase, the diameter of such a random graph would move ever closer to the maximally possible value of one. The same holds for increasing values of the fraction ν of neutral neighbors, because as ν increases, the likelihood increases that at least one of the newly added neighbors increases the graph diameter. Even for modest S, small B, and values of ν close to the lower range of
87
what we observe in typical metabolic, regulatory, and molecular genotype networks, D can be large. For example, even for a modest genotype size of S=100, merely B=2 building blocks, and ν=0.1, the above estimate yields D>0.9. The random graph thus constructed may thus span some 90 percent of genotype space. In sum, random graphs, where each genotype has some modest fraction ν of neighbors, will typically have large diameters. This means that genotype networks with this property will typically reach far through genotype space. The fundamental reason is the huge discrepancy between the total number of BS genotypes in a genotype space, which grows exponentially with the size of a genotype space, and the diameter of the hypercube, which grows merely linearly. Next, I will consider the typical size of a random graph whose nodes have an appreciable fraction ν of neutral neighbors. Naively, one might think that such a graph would also occupy a large fraction of genotype space. The purpose of this section is to show that this is not the case. To estimate the number of genotypes in such a graph, I will revisit the first two steps in the random graph construction process. A fraction ν of 1-neighbors of G are connected to G, and of those again a fraction ν connect to their neighbors, such that the fraction of 2-neighbors of G that become connected to G is of the order of ν2. More precisely, consider an arbitrary 2-neighbor of G, regardless of whether it is part of our random graph. It is easy to see that any such 2-neighbor has exactly two edges that connect to it from 1-neighbors of G. The probability that this 2-neighbor of G is part of our random graph is then equal to the probability that at least one of these two edges link it to 1-neighbors of G. From the argument in the previous paragraph, the probability that none of these edges link it to 1-neighbors of G is of the order of (1-v2). The probability that at least one of these edges link it to 1-neighbors of G then simply becomes ν2. This argument holds for all 2-neighbors of G, who become connected to G via 1-neighbors of G (independently from one another and with the same probability, as prescribed in the construction procedure for our random graph). Thus, the number of 2-neighbors of G in our random graph becomes:
88
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
⎛ S⎞ n 2 ( B − 1)2 ⎜ ⎟ . ⎝ 2⎠ The argument extends analogously to neighbors of increasing distance k of G yielding: ⎛ S⎞ n k ( B − 1)k ⎜ ⎟ . ⎝ k⎠
(6.5)
for the number of k-neighbors of G that form part of this random graph. Overall, the number of nodes in the random graph is approximated by the sum of these individual contributions: S
∑n k =0
k
⎛ S⎞ ( B − 1)k ⎜ ⎟ . ⎝ k⎠
(6.6)
I emphasize that Equation 6.6 is again a crude approximation. It neglects, for example, that only a fraction of neighbors of randomly chosen k-neighbors of G connect to (k+1)-neighbors of G. The simplification, however, may not affect the order-of-magnitude estimate (Equation 6.6) dramatically, if one considers that this fraction of neighbors decreases only linearly in k (Equation 6.3), whereas the fraction of (k+1)-neighbors added to the graph in each construction step changes exponentially in k. Figure 6.3 shows how the estimate (Equation 6.6) depends on the fraction ν of neighbors (horizontal axis), and on the number of building blocks B (vertical axis) for a moderately sized system of S=100
20
Number B of building blocks
10–90
10–70
10–60
10–50 10–40
10–30
4
10–20
3 2 0.1
0.2
0.3
0.4
0.5
Number n of neighbors Figure 6.3 Genotype space spanning random graphs can occupy a small fraction of sequence space. The horizontal axis shows the average fraction n of neighbors in a random graph constructed as described in the text, the vertical axis shows the number B of building blocks that each genotype can be constructed of. In the systems I discussed here, B included values of B=2 (metabolic networks), B=3 (regulatory circuits), B=4 (RNA), and B=20 (proteins). The contour lines indicate random graph sizes, expressed as fractions of the total size BS of the hypercube for S=100.
G EN OTY P E N ETW ORKS, SEL F - OR GANIZATION, AND NATURAL SEL ECTION
metabolic reactions, regulatory interactions, nucleotides, or amino acids. This genotype network size is expressed as a fraction of the total size BS of genotype space. The figure shows that this fractional size is very small compared to genotype space. It would decrease further with increasing S. Equation 6.6 may be an imprecise estimate of genotype network size, but it serves to make the main point. Even if a genotype typically has many neutral neighbors, its genotype network may occupy a tiny fraction of genotype space. This also means that genotype space could host a myriad different genotype networks—one for each phenotype—that span this space or nearly so (Equation 6.4.). A remaining aspect of the organization of genotype networks regards their connectedness. The observation that genotypes typically have many neighbors with the same phenotype makes it seem highly likely that genotypes form large connected networks. However, whether all genotypes would belong to one such network, or whether there might be more than one network is less clear. Some work in graph theory aims at answering this question. This work typically focuses on genotype networks in the limit where the size S of a system approaches infinity. It demonstrates graph properties that exist with probability one in this limit. One such property is the existence of a giant component. A component is a part of a graph where every two nodes can be connected through a path of edges. A giant component contains a finite fraction of a graph’s nodes [70]. This means that if a graph’s size approaches infinity, then so does the size of a giant component. In this limit, components that are not giant contain only an infinitesimal fraction of a graph’s nodes. A graph can have more than one giant component. Relevant for my purpose is a graph theoretical result applying to random graphs on the hypercube for which an average fraction ν>0 of each node’s neighbors lie also on the graph. In this case, the probability that a graph has a giant component becomes equal to one as the graph’s size approaches infinity [640]. If ν exceeds a threshold of n > 1 − B −1 1/ B , then there exists exactly one such component in the limit where S approaches infinity. This component spans the entire genotype space, that is, its diameter is equal to D=1 [640]. For metabolic networks, where B=2, this threshold is ν>0.5,
89
for regulatory circuits (B³3), it is ν>0.42, for RNA (B=4), it would be ν>0.37, and for proteins (B=20), it would be ν>0.15. Sets of genotypes whose members have more than this fraction of neutral neighbors would be connected and span genotype space. Smaller sets would not, but they would still form a giant component that contains most genotypes. The more building blocks B a system has, the smaller the average fraction ν of neighbors that suffices to generate a single genotype space-spanning genotype network. I emphasize again that all these considerations do not imply that actual genotype networks behave exactly like random graphs. They may differ in many respects, for example, in how the fraction ν of neutral neighbors varies from genotype to genotype. However, they show that even simple random graphs with some typical fraction ν>0 of neutral neighbors per node typically form vast connected sets that extend far (or all the way) through genotype space, despite comprising only a tiny fraction of this space. The root cause for these properties is the huge discrepancy between the diameter of this space (S) and the number of genotypes (BS) in it.
Neutral neighbors are necessary for the existence of genotype networks Thus far, I have shown that neutrality (ν>0) is sufficient for the existence of genotype networks. I will now show that it is also necessary. To see this, consider a typical phenotype P. It will be formed by some very large number MP of genotypes that typically constitute a very small fraction of genotype space. Let us assume that this set of MP genotypes consists of genotypes chosen at random from genotype space, without requiring that each genotype has many neighbors with the same phenotype. This assumption is another variant of a null hypothesis comparing genotype networks to random graphs. The question is whether many or most of these genotypes would be connected in a genotype network. To address this question, let us examine one such genotype G and its S(B–1) neighbors. What is the probability that this genotype is isolated, that is, that none of its neighbors are members of P’s genotype set? To answer this question, consider first the probability that a randomly chosen genotype from the BS–1 genotypes different from G is not one of the
90
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
S(B–1) neighbors of G. This probability is equal to one minus the number of neighbors of G divided by the BS–1 genotypes different from G, i.e., 1–[S(B–1)/ (BS–1)]. Similarly, the probability that a second genotype chosen at random from the now remaining BS–2 genotypes is not a neighbor of G is 1–[S(B–1)/ (BS–2)]. The same argument applies for a third, fourth, and further genotypes, until one reaches genotype number (MP–1), for which the probability that it is not a neighbor of G is given by 1–[S(B–1)/ (BS–(MP–1))]. From these expressions, we can calculate the probability that none of the (MP–1) genotypes different from G are neighbors of G as their product: M p −1
⎛
∏ ⎜⎝1− i =1
S ( B − 1) ⎞ ⎟. Bs − i ⎠
I note that each factor in this product is greater than the last factor (i=Mp–1), such that the entire product is greater than the expression: ⎛ S ( B − 1) ⎞ ⎜⎝1 − B s − M + 1⎟⎠ P
M p −1
≈ 1−
S ( B − 1)(M p − 1) B s − MP + 1
.
(6.7)
The approximation in Equation 6.7 takes advantage of the relationship that (1–x)y≈1–yx for small x. The ratio S(B–1)/[BS–(MP–1)] (corresponding to x) will indeed be very small, even for moderately large system size S. The reason is that the numerator of this ratio is linear in S, whereas the denominator is dominated by the term BS, which is exponential in S. For the same reason, and because the number MP of genotypes with phenotype P is typically very small compared to the size BS of genotype space, the right-hand side of Equation 6.7 will be extremely close to 1. Thus, the probability that any one genotype in a set of genotypes with the same phenotype has no neighbors that are also in this set is extremely close to one. Because this holds for any genotype in this set, such a set would consist mostly of isolated nodes, and would not form a connected genotype network. More generally, one can show that a set of random genotypes must contain at least of the order of a fraction 1/S of all genotypes in genotype space before a giant connected component arises [74, 643]. In a genotype space of BS genotypes, this is a gigantic number of genotypes, larger than the
genotype sets for most phenotypes in the systems I study. In sum, even a large set of random genotypes with the same phenotype would mostly consist of isolated nodes. Thus, the condition of neutrality, that many neighbors of a genotype have the same phenotype is essential for the existence of connected genotype networks.
High phenotypic diversity in genotype network neighborhoods is expected I will now turn to the second major feature of genotype networks that is crucial for evolutionary innovation: their diverse genotypic neighborhoods. Specifically, I will show here that we can expect these neighborhoods to be diverse, even if different phenotypes were organized completely randomly in genotype space. By organized randomly I mean that if there are KP phenotypes in total, then each genotype is equally likely to adopt any one of these phenotypes. Actual phenotypes are certainly not distributed in this way, but this scenario serves again as a useful null hypothesis. Consider two genotypes, G1 and G2, on the same genotype network, where a fraction ν of each genotype’s S(B–1) neighbors have the same phenotype as Gi itself. Let us focus on all the (1–ν) S(B–1) genotypes in the neighborhood of G1 that have a phenotype different from G1. I will assume for the moment that all these phenotypes are also different from each other. The question is what fraction of these phenotypes we would expect to find also in the neighborhood of G2. The number of genotypes in the neighborhood of G2 whose phenotype is different from that of G2 itself is (1–ν) S(B–1). Under the null hypothesis, the probability p that any one randomly chosen phenotype in the neighborhood of G2 is also found in the neighborhood of G1 is equal to the number (1–ν) S(B–1) of phenotypes in the neighborhood of G1, divided by the total number of phenotypes KP. That is, p=(1–ν) S(B–1)/KP. Under this null hypothesis, the number of G2’s neighbors whose phenotypes are identical to phenotypes in the 1-neighborhood of G1 is then binomially distributed with parameter (1–ν) S(B–1) and probability p. The properties of the binomial distribution [236] imply that the expected number of phenotypes that is identical in the two
G EN OTY P E N ETW ORKS, SEL F - OR GANIZATION, AND NATURAL SEL ECTION
neighborhoods is given by (1–ν) S(B–1)p, which is equal to: (1 − n )2 S 2 ( B − 1)2 . Kp
(6.8)
The numerator of Equation 6.8 is dominated by the term S2. The denominator, the total number of phenotypes KP, is generally much larger than S2. It scales exponentially with the number of nutrients for metabolic phenotypes (Chapter 2), for gene expression phenotypes (Chapter 3), for the RNA phenotypes I have discussed (Chapter 4), and, albeit not scaling exponentially, it may also be large for protein functions (Chapter 5). This means that, for most systems, the ratio in Equation 6.8 will be much smaller than 1. Thus, under the null hypothesis, the two neighborhoods are expected to share fewer than one common genotype. If some of the phenotypes in a genotype’s neighborhood that are different from that of G1 and G2 are identical to one another (as is usually observed) then the numerator of Equation 6.8 would be even smaller. This means that the two neighborhoods would contain even fewer common phenotypes. In reality, phenotypic neighborhoods are diverse, but not quite as diverse as Equation 6.8 suggests. In metabolic networks, regulatory circuits, and molecules, the neighborhoods even of very distant genotypes on the same genotype network share some fraction of phenotypes, while other phenotypes are unique to one neighborhood. The reason is that neighbors of a genotype G tend to adopt phenotypes similar to that of G itself. Specifically, many neighbors of a regulatory circuit genotype G typically have gene expression patterns similar to that of G; the metabolic phenotype of a metabolic genotype G’s neighbor must be similar to that of G, because the neighbor differs from G in only one reaction; and the structure of an RNA genotype G’s neighbors is often similar to that of G itself (Figure 4. 8). Actual genotype networks thus violate the null hypothesis. But again, this hypothesis merely serves to show that in systems with many phenotypes, even a random organization of phenotypes in genotype space will lead to genotypic neighborhoods with highly diverse phenotypes. Such diversity is thus expected, and not unusual for systems with many phenotypes.
91
Self-organization and natural selection The word self-organization has many meanings [99, 210, 303, 557, 739, 746, 854]. Here I use it in the sense that collections of objects and their interactions bring forth structures on a higher level of organization. Such structures form from the bottom-up, merely through properties of the objects and their interactions, and without any order imposed from the outside. Genotype networks are examples of self-organized structures. Here, the lower level objects are genotypes of molecules, regulatory circuits, and metabolic networks. The higher order structures are individual genotype networks, and their organization in genotype space. They emerge from the principles that guide how phenotypes form from genotypes, together with some basic features of genotype space, such that it has a small diameter and many phenotypes, but fewer than genotypes (Chapter 5). For more than a century, evolutionary biology has focused on natural selection as the key process explaining life’s enormous diversity. The occasional suggestion that self-organization may be equally, or more, important than natural selection has been decidedly heterodox [388]. In light of what I have said thus far, it is useful to re-examine the relationship between natural selection and self-organization. This relationship, it turns out, is not difficult to understand. The self-organization of genotype networks is essential for evolutionary innovation. If genotype networks did not exist, or if they were organized differently (Figure 5.5b–d), a vast world of molecules, regulatory circuits, and metabolic network phenotypes would be inaccessible to evolution. Conversely, imagine a world with selforganized genotype networks, but without natural selection. There would be no force preserving phenotypes. Driven by mutations, a population of molecules, regulatory circuits, or metabolic networks with a given phenotype would drift aimlessly through genotype space, and thus lose this phenotype. The integrated organization of even the simplest organism would be unsustainable. These simple considerations show that both natural selection and self-organization are equally necessary in evolution. The success of one in bringing forth innovation depends entirely on the other. Self-
92
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
organization of genotype networks ensures that mutations can produce innovations, and natural selection ensures that innovations can be preserved. Self-organization is as essential for innovation as natural selection is for its preservation. Our insights into the self-organization of genotype networks are recent. In contrast, the discovery of natural selection goes back to Charles Darwin and Alfred Russell Wallace in the nineteenth century [162, 503]. It is thus little surprise that much more attention has been paid to selection in the history of evolutionary biology. Compared to what we know about selection, our ignorance about genotype networks is nearly complete, save for the few qualitative features I discuss here. Perhaps it is time to refocus our efforts to better understand this second key ingredient to life’s great success.
Innovation and “random” change We generally think of evolution as driven by “random” mutations. I will next examine briefly how genotype networks may affect our views on randomness in evolution. I will only make some qualitative observation, because a nuanced discussion could itself fill a book. Randomness is made precise through the notion of a random variable, a mathematical object that can adopt a range of possible values, each of them with some probability drawn from a probability distribution [236]. Randomness can only be properly defined with such a set of values and a probability distribution in mind. One can distinguish two connotations of randomness pertinent to my subject. The first connotation regards the effect of mutations on an organism’s fitness. Here, the random variable is the change in fitness a mutation causes. It can assume three categories of values: beneficial, detrimental, and neutral. When biologists maintain that mutations are random, they often mean that mutations do not preferentially increase their carrier’s fitness. Mutations do not serve the interests of the organism in which they occur. There is widespread consensus among evolutionary biologists that mutations are random in this sense [168, 204, 259, 501, 706, 715, 716].
A second connotation, more important here, regards phenotypes. How do they change through a random mutation, for example, one that affects each system part (nucleotide, amino acid, regulatory interaction, or biochemical reaction) with the same probability? Here, the random variable is a phenotype. And its possible “values” are all possible phenotypes. If we call a mutation random that produces all phenotypes with equal probability, then mutations are decidedly non-random. First, a single mutation can only bring forth a minute fraction of phenotypes, whose identity depends on the mutated genotype and its neighborhood. Second, a substantial fraction of mutations do not affect phenotype at all. Both aspects of non-randomness are important, but let me highlight the second. As we saw here, it brings forth genotype networks with an organization that facilitates innovation. From this perspective, mutations are non-random in a way that enables evolutionary innovation.
Summary Genotype networks that are large, occupy a small fraction of genotype space, traverse a large fraction of this space, and show highly diverse phenotypic neighborhoods are self-organized, emergent features of genotype spaces. The existence of many neutral neighbors per genotype gives rise to the first three features. It is both necessary and sufficient for them. The last feature, different genotypic neighborhoods that contain different novel phenotypes, merely requires that a system has many different phenotypes. Thus, the properties of the metabolic, regulatory, and molecular systems that facilitate innovation are self-organized features of genotype space. Mutations are non-random in a way that brings forth these features, and thus promotes evolutionary innovation. Natural selection and self-organization are both essential for evolutionary innovation. In Chapter 1, I quoted the geneticist de Vries who stated that natural selection cannot explain the origin of novel phenotypes [170]. More than 100 years later we can say this: Genotype networks help explain the arrival of the fittest, and natural selection permits their survival.
CH A PT ER 7
A synthesis of neutralism and selectionism
Neutralism and selectionism are two opposing perspectives on evolutionary change. In the broadest sense, they apply to all evolutionary change, including evolutionary innovation. Any theory of innovation thus needs to have a position towards them. In this chapter, I first explain these two perspectives (see also ref. [829]). I then provide some background material on the population dynamics of neutral change. After that, I propose a synthetic view on neutralism and selectionism that can resolve the tension between them. In this view, neutralism and selectionism capture complementary aspects of biological reality. Genotype networks play a central role in it. This view also clarifies the role of molecular exaptations in innovation, an important concept that I will also discuss here [281, 282]. Most pertinent data for this chapter comes from molecules, but the major principles hold for all three major system classes of this book.
Selectionism and neutralism in a broad and narrow sense With respect to innovations, a strict selectionist would maintain that all innovations arise through beneficial mutations. These mutations change a trait for the better when they first arise, and they constitute the innovation. For my purpose, the relevant traits are the structure and function of molecules, the expression phenotypes of genes, and a metabolic network’s biosynthetic abilities. In contrast, a neutralist would argue that mutations without any effect when they first arise might facilitate such innovation. I refer to these perspectives as selectionism and neutralism in the broad sense. Selectionism and neutralism are also used in a narrower sense. I need to explain this usage here, because it played an important role in the history of molecular evolutionary biology. In this usage, selectionism
and neutralism offer competing explanations on what causes observed genetic variation in populations. To understand this usage, one needs to be aware that the DNA of a genome is subject to three possible kinds of mutations. The first kind comprises deleterious mutations, which are harmful and subject to purifying selection. The second kind comprises neutral mutations, which do not affect fitness. The third kind comprises beneficial mutations, which increase fitness and are subject to positive selection. Neutralism and selectionism in the narrow sense agree that deleterious mutations are frequent in the evolution of genes and proteins. However, they profoundly disagree on the relative importance of neutral and beneficial mutations. In the words of Motoo Kimura, one of neutralism’s principal proponents, “. . . random fixation of selectively neutral or slightly deleterious mutations occur far more frequently in evolution than positive Darwinian selection of definitely advantageous mutants” [403, 536]. In contrast, selectionism posits that most mutations that attain high frequency or become fixed in a population would be beneficial, or be linked to abundantly occurring beneficial mutations. (An allele’s frequency is the number of copies of the allele in the population, divided by the number of individuals in haploid populations, or divided by twice the number of individuals in diploid populations. Fixation means that an allele attains a frequency of one—it replaces all other alleles.) Strict selectionists, such as Ernst Mayr, dismiss the importance of neutral evolutionary change altogether [502, pp.204–214]. Neutralism and selectionism in this narrower sense originated early in the twentieth century [Ch.1 of ref. 402], but a debate about them only gained momentum in the 1960s. At that time, the 93
94
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
first systematic observations on enzyme polymorphisms indicated that many wild populations contain great amounts of genetic variation [308, 453]. Neutralists proposed that most of this variation was caused by neutral mutations, whereas selectionists attributed it to beneficial mutations. The narrow and the broad usages of neutralism and selectionism are linked. For example, a strict selectionist view on molecular variation would also tend to favor selectionism with respect to evolutionary innovation. This is because mutations that dominate populations would tend to be the mutations that produce most evolutionary innovation. In recent years, the narrow-sense neutralist– selectionist debate has abated, for reasons I will discuss below, but the broader tension remains. After a brief introduction to concepts central to understand the evolution of neutral mutations, I will first discuss experimental data that clearly supports the selectionist perspective on molecular evolution. I will then juxtapose this data with evidence that neutral mutations are critical to evolutionary innovation. Finally, I will suggest how to reconcile these lines of evidence into a synthetic perspective. Three predictions emerge from this synthesis, and I will discuss supporting evidence for them.
Evolutionary dynamics of neutral mutations As mentioned above, a neutral allele is a genetic variant that does not affect an organism’s fitness. Such an allele’s frequency p in a population is influenced by genetic drift, a force of random evolutionary change that is strongest in small populations [402]. In a population of constant, finite size, genetic drift causes this frequency to fluctuate from generation to generation, because alleles get sampled from the previous, parental generation to form the next, offspring generation. In haploid organisms, the variance in the amount of change in allele frequency from generation to generation is given by V=p(1–p)/N [310]. The quantity N here is the effective population size, which reflects how many individuals actually contribute alleles to the next generation [310]. This, and all mathematical expressions below, hold for haploid populations, but they apply to diploid populations if one replaces every occurrence of N with 2N. The above expression for V shows that in small populations, allele frequencies fluctuate more
than in large populations. Over time, these fluctuations will cause an allele to become either extinct (frequency p=0) or fixed (p=1). An allele newly arisen through mutation has a probability of 1/N to go to fixation. If it goes to fixation, it will take on average 2N generations to do so [310]. Over time, genetic drift will thus reduce genetic variation in a population. Mutations, on the other hand, continually introduce new genetic variation. Genetic drift and mutations are thus opposing evolutionary forces. They will reach a balance over time. This mutation–drift balance is influenced by the rate μ at which neutral mutations occur per generation in a gene or in any stretch of DNA. A population will most of the time be monomorphic—it will contain only one allele—if the product of population size N and mutation rate μ is much smaller than one (Nμ<<1). Conversely, it will be polymorphic most of the time if Nμ>>1. These and many other predictions are made by the neutral theory of molecular evolution, a widely accepted body of work about the effects of genetic drift on populations [402]. Alleles that are not neutral have a selection coefficient s that indicates by how much their carrier’s fitness differs from a reference, non-mutant genotype. The fate of mutations whose selection coefficient s is much smaller than 1/(2N) is determined by drift rather than by selection, because generation-togeneration random allele frequency fluctuations are stronger than the influence of selection. Such mutations are also called effectively neutral [401]. However, even mutations whose selection coefficient is greater than 1/2N are influenced by drift. Specifically, weakly deleterious mutations can go to fixation, whereas weakly beneficial mutations can be lost, all through the influence of drift [227, 310, 676]. These considerations show that population size N is of central importance for the “visibility” of alleles to selection, and thus for their fate. Effective population sizes vary among species by more than five orders of magnitude, from typical values of 104 for vertebrates to values of up to 109 for prokaryotes. Effective population sizes generally decrease in larger and multicellular organisms [476]. Many alleles whose fate would be dominated by selection in prokaryotes would thus be evolving neutrally in vertebrates [226, 476]. The consequences may be far-reaching. For example, neutral evolution may
A SY NTHESIS OF NEUTRAL ISM AND SEL ECTIONISM
95
influence the size and complexity of genomes [476]. Genome complexity is much greater in higher eukaryotes, where drift is stronger, than in prokaryotes, where more mutations altering genome structure would be deleterious and get eliminated. Population size also affects the fates of the kinds of genotypes I study here, those of molecules, regulatory circuits, or metabolic networks. Whether the fitness of two such genotypes is indistinguishable depends on population size. Similarly, the number of members in a set of genotypes with indistinguishable fitness would expand or shrink in magnitude with changing population size. Again, the reason is that small fitness differences visible to selection in large populations become neutral in small populations. (Because the size of such neutral genotype sets is often astronomical, however, they may still be very large even when shrunk.) The organization of genes into chromosomes adds a further layer of complication. On the one hand, if a neutral mutation occurs physically close to a beneficial mutation, then the neutral mutation may be rapidly swept to a high frequency or to fixation, if recombination does not break up its association with the beneficial mutation. This phenomenon is also called “genetic draft” or “hitchhiking” [270, 271, 500]. Genomic regions where hitchhiking is frequent show reduced amounts of neutral genetic variation, similar to those caused by a reduced population size [270]. On the other hand, if a neutral mutation occurs close to a region where deleterious mutations segregate, the neutral mutation may be dragged to extinction along with the deleterious mutations. This phenomenon of “background selection” can affect polymorphisms and the time neutral alleles need to go to fixation [113]. Because recombination rates vary substantially among organisms and chromosomal regions, the impact of these phenomena on allelic variation may also vary [476, 478].
strongly depend on the environment. In any one organism, some fitness components may be unknown, whereas the influence of others on fitness may be missed, because they are manifest only in a environments different from standard laboratory environments. A second problem is that experimental measurements of fitness components, such as microbial cell division rates, can resolve selection coefficients to a resolution of at most s=10–3, but much smaller selection coefficients of s<10–8 can be visible to natural selection in large enough populations [763]. Finally, even functional assays that measure the effect of a mutation on, say, the biochemical activity of a mutated protein, can be inconclusive. To be sure, any one assay may reveal that a mutation has no effect. However, many proteins (and other biological systems) have multiple, often completely unanticipated functions, such as both enzymatic and regulatory functions [112, 228, 367, 395, 494, 815, 840, 870]. In sum, these observations mean that direct measurements of an allele’s fitness contribution have limited value. They also imply that a complementary approach to estimate neutrality is necessary. This approach relies on comparative data on genotypic change that accumulates within and among populations on evolutionary time scales. It uses a simple yet fundamental feature of neutral mutations: neutral mutations that occur in a population and eventually go to fixation would arise at a clock-like and constant rate [402]. This rate depends only on the rate at which neutral mutations occur, and not on other factors, most notably population size. This feature has multiple consequences, which form the basis for different tests for neutrality. In the next section, I will discuss what these tests reveal about the incidence of neutral mutations in genes and genomes.
Determining neutrality To measure directly
in the narrow, population genetic sense posits that most observed genetic variation is neutral variation. Even before genome-scale sequence data became available, evidence against this position existed [9, 270, 425, 506]. This evidence came from comparative studies of individual genes. It included variation in the clock-like rate of mutation fixation that is inexplicable by the neutral theory, and
whether an allele is truly neutral is difficult, if not impossible. First, fitness has multiple components. In microbes these include population-doubling time, viability under starvation, and sporulation efficiency. In higher organisms, sexual reproduction, as well as age-specific mortality and fertility complicate the situation. All of these components
Recent genotypic data support a selectionist view on genetic variation Recall that neutralism
96
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
evidence for selection acting even on synonymous (silent) changes in protein coding regions [9, 270, 425]. The structure of genotype networks may be able to explain some of these observations. For example, different proteins in the same genotype network vary in their number of neutral neighbors, and thus also in the incidence of neutral mutations, which can explain some of the observed variation around a clock-like rate of mutation fixation [47]. However, observed clock rate variation can be so dramatic that genotype network structure may not provide a sufficient explanation [270]. Genome-scale data from humans and several model organisms provides evidence on a different scale from that of individual genes. It is based on hundreds or thousands of individual genes, and on multi-megabase regions of multiple genomes. This kind of evidence makes an even stronger case against neutralism. I will now discuss some examples. One class of studies uses the McDonald–Kreitman test [506] to assess the incidence of positive selection among different genes. This test examines the ratio of non-synonymous mutations, which cause amino acid changes, to synonymous mutations, which cause no such changes. Specifically, it compares this ratio for alleles that are polymorphic within populations to the same ratio among populations; that is, for fixed allelic differences among populations. A significantly elevated ratio of fixed differences to polymorphic changes indicates that amino acid changes go to fixation at a rate not explicable by the neutral theory. Based on this test, multiple genome-scale studies show that at least 30 percent of amino acid changes in Drosophila go to fixation because they are beneficial, a percentage inconsistent with the neutralist position [26, 53, 61, 230, 675, 695, 712]. Similarly high numbers hold for non-coding regions [25, 413], and for many genes in other organisms, such as E. coli [114]. Because the McDonald–Kreitman test is statistically conservative [10], the actual incidence of adaptive mutations may be higher. A second line of evidence comes from the relationship between the mean number of polymorphic differences between alleles within a species, commonly denoted as π, and the number f of fixed differences between sequences in two species. For
neutral mutations, one would expect that more polymorphism within a species would translate into more fixed differences between species—π and f should show a positive association. The reason is that, according to the neutral theory, both π and f are linearly proportional to the rate at which neutral mutations arise. In contrast to this neutralist prediction, recent genome-scale data show that π and f are negatively associated [53, 302]. This negative association is consistent with a selectionist view [302]: alleles genetically linked to a beneficial mutation that sweeps through a population will “hitchhike” to fixation with this mutation, because recombination cannot decouple them rapidly enough from this mutation during a selective sweep. Thus, genomic regions in which such selective sweeps are frequent should show decreased polymorphisms (low π). At the same time, such abundant adaptive mutations should increase the divergence f of fixed differences in a genomic region, because a rapid succession of allele substitutions driven by beneficial mutations will increase f. The net result is that a high number of fixed differences f would be associated with lower amounts of polymorphism π. This pattern is observed in genome-scale data, and stands in contrast to the neutralist prediction. A third pertinent pattern of molecular evolution is that genomic regions with higher recombination rate contain larger amounts of nucleotide polymorphisms, a pattern that would not occur if the vast majority of mutations were neutral [27, 158, 321, 322, 400, 446, 544, 545, 656, 708, 725, 743, 752, 762]. A minority of these observations can be explained if recombination itself has mutagenic effects [53, 321, 322, 446]. However, the bulk of the data argues for a selectionist explanation: in regions with high recombination rate, combinations between beneficial alleles and alleles that hitchhike with them to fixation are disrupted rapidly [302]. Selective sweeps thus affect overall variation in these regions less, and they deplete these regions less of genetic variation. Thus, in regions with high recombination rate, more polymorphisms should prevail, as observed. Taken together, these independent lines of evidence suggest that, on a genomic scale, many mutations that become fixed are not neutral. Genome-scale sequence data is the gold standard of comparative genotypic data, the most comprehensive kind of
A SY NTHESIS OF NEUTRAL ISM AND SEL ECTIONISM
genotypic data that can speak to the neutralism– selectionism tension. It implies that neutralism in the strict and narrow sense is not tenable.
Molecular phenotype data support a neutralist view on evolutionary innovation The evidence I have just highlighted was based on genotypes. I will now turn to analyses that also account for molecular phenotypes. They tell a different story: change that affects phenotype little or that is neutral may be important for evolutionary innovation. (These analyses also speak to a central tenet of this book: we cannot understand innovation without studying phenotypes in all their complexity.) Chapters 2 through 4 examined genotypic change that leaves phenotypes unchanged. Such change is responsible for the existence of genotype networks, which in turn facilitate innovation. There, I focused on specific molecular phenotypes, metabolic phenotypes, and gene activity phenotypes, but not fitness. I did so for two reasons. First, fitness is a property that integrates multiple aspects of phenotype, reducing the complexity of phenotypes to a simple scalar value. Second, little of what I say here requires that the genotypic changes I consider are exactly fitness-neutral. For example, many weakly deleterious changes can and do go to fixation. This holds especially if they are followed by changes that compensate for their deleterious effects. Such compensatory mutations are highly abundant in many systems [67, 227, 350, 393, 416, 428, 509, 596, 619, 676, 732, 845, 850]. Thus, whether the genotypes I consider have exactly identical fitness is not central. What is central, however, is the existence of genotype networks. As I argued in Chapter 5, innovation would be much harder without them. Thus, I could now revisit much of Chapters 2 through 4 to support the claim of this section. Instead, I will discuss a few independent examples, most of which I have not mentioned earlier. These examples revolve around phenotypes that are robust to mutations. Robust phenotypes are of special interest here, because the more robust a phenotype is, the greater the proportion of mutations that have weak or no effects on it. Several recent laboratory studies involving mutations in multiple enzymes hint at the importance of
97
neutral change to evolve new functions [8, 20, 66, 68, 779]. One series of experiments studied the evolution of new enzymatic functions in cytochrome P450, a member of an enzyme superfamily with a wide range of enzymatic activities. These experiments relied on error-prone polymerase chain reaction to introduce multiple mutations into different variants of this enzyme [68]. The variants differed in their thermodynamic stability, and in their robustness to mutations [68, 69]. In a more robust variant, a greater number of mutations would have weak or no effects than in a less robust variant. Strikingly, the stable and more robust variants of cytochrome P450 more readily evolved the ability to hydrolyze new substrates, such as the anti-inflammatory compound naproxen. A different line of evidence comes from laboratory evolution studies of the protein serum paraoxonase [8, 20]. This enzyme is primarily a lactonase, an enzyme that cleaves cyclic esters. In addition, it can also catalyze reactions involving a variety of other substrates, including aryl esters and organophosphates, albeit with lower activity [398]. Surprisingly, many mutations that increase these side activities dramatically (101–106-fold) have no effect on the biochemically measured primary activity [8, 20]. Similar observations exist for other enzymes, such as a bacterial phosphotriesterase and carbonic anhydrase II [8]. In addition, some 300 paraoxonase variants that are neutral or close to neutral with respect to its primary activity are at least one mutation closer to enzymes with new phenotypes, such as thiolactonase and phosphotriesterase, than their parent enzyme [20]. Thus, (nearly)-neutral sequence change can facilitate evolutionary adaptation. A third example regards the role of chaperones in protein evolution. Chaperones are proteins that assist other proteins in folding, and help maintain protein fold and function. Enzymes that evolve in E. coli cells under repeated cycles of mutations and laboratory selection to maintain their function can tolerate more amino acid changes when large amounts of a chaperone are present. Such enzymes also show better evolutionary adaptation to an altered or non-native enzymatic activity than enzymes in cells with a smaller amount of the chaperone. Again, high robustness—in this case induced
98
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
by a chaperone—is associated with superior evolutionary adaptation [779]. A fourth example regards two ribozymes that I already discussed in Chapter 4. These are the hepatitis delta virus self-cleaving ribozyme, and a synthetic self-ligating ribozyme [684]. The two ribozymes are unrelated in sequence, structure, and function. Nonetheless, they can be transformed into one another through a mutational walk in sequence space [684]. Importantly, on most of this walk the enzymatic activity of the mutated molecule does not change dramatically, with the exception of a few key mutations that transform one biochemical activity into the other. Here, a phenotype that can change little despite many mutations facilitates the evolution of new ribozyme functions. More anecdotal evidence supplements observations from such experimental studies. One such line of evidence comes from protein engineering, where mutagenesis creates proteins with new functions from existing protein scaffolds. Desirable in this process are scaffolds whose structural backbone is insensitive to mutations, and can thus be modified through the substitution of many different amino acids. One of the most successful such scaffolds is the zinc finger domain [272], where a zinc ion is bound to two conserved cysteine and histidine residues, an interaction that stabilizes the domain’s structure. This structure is strikingly robust to mutations. When one replaces all but seven of its 26 amino acids by alanine, the structure is left essentially intact [518]. This robustness accounts for the great versatility of this domain in protein engineering, where it can be used to design proteins with a great variety of DNA-binding activities and molecular functions [200]. It is perhaps no coincidence that the zinc finger domain is also the most abundant domain in the human proteome, where 4500 zinc finger domains are found in more than 500 proteins [806]. The preceding observations regard the effects of molecular changes on short, laboratory time-scales. However, they also apply to much larger evolutionary time scales. Specifically, proteins whose structure is robust to amino acid changes have accumulated greater functional diversity in their evolutionary history [238]. Such diversity is a record of past evolutionary innovations. Robust proteins, proteins where many mutations would have neutral or weak effects, thus evolve more
novel functions on long, evolutionary time scales. I will discuss this last example in much greater detail in chapter 8. All of the above examples regarded individual molecules and their functions. Although the evolution and functional diversity of other systems are much less studied, we are beginning to see glimpses of the same principles there as well. A case in point is gene duplications. They are unique mutations in their effect on robustness, because they systematically increase the mutational robustness of genes and proteins, and of the regulatory circuits that these genes form. They are also conspicuously associated with many innovations in macroscopic traits that such circuits help build. Chapter 9 will focus on this association between gene duplications and innovation. For now, I just note that the association hints at a positive role for neutral change in innovation on higher levels of organization. Taken together, these observations argue that change that is neutral or that has very weak phenotypic effects can facilitate phenotypic innovation. They argue against the view that neutral change is irrelevant for evolutionary innovation.
Synthesizing neutralism and selectionism The preceding sections show that, on the one hand, the influence of selection pervades patterns of genotypic change. This influence is certainly not weaker than that of neutrality. On the other hand, studies on molecular phenotypes support the notion that robustness, and thus mutations with weak or no effects, facilitate evolutionary innovation. I will now show how the genotype network concept can help reconcile these observations. Consider first a simple thought experiment. A single genotype undergoes random mutational change, a random walk on the genotype network of its phenotype (Figure 7.1). The genotype could be that of a molecule, a regulatory circuit, or a metabolic network; that is, a member of any one of the core system classes important for innovation. Assume that its phenotype is suboptimal in carrying out its function—whatever this function is. One or more better, optimal “target” phenotypes exist. As the genotype explores its genotype network, it carries out a blind evolutionary search for these phenotypes. In doing so, it may first take several neutral mutational steps on the same genotype net-
A SY NTHESIS OF NEUTRAL ISM AND SEL ECTIONISM
neutral
beneficial
Figure 7.1 Cycles of neutral evolution and positive selection caused by traversal of multiple genotype networks in adaptive evolution. Circles correspond to individual genotypes. A straight line links two genotypes if they are neighbors in genotype space. Examples would include two metabolic networks differing in a single enzymatic reaction, or two proteins differing in a single amino acid. The figure shows the evolutionary path (thick edges and arrows) of a single genotype through sequence space. The genotype evolves towards a hypothetical adaptive phenotype (not shown). In order to arrive at this phenotype, the genotype traverses four different genotype networks (filled circles shown in four different patterns and thick edges). Within each genotype network, evolution may be neutral, but at the transition between two genotype networks (arrows), positive selection occurs. After figure 3 in [829], used with permission from Nature Publishing Group.
work. (I will focus on neutral and not deleterious change, but deleterious change may actually be more likely. After a deleterious mutation, the random walk in this thought-experiment would revert
99
to its pre-mutation genotype.) After some number of these neutral steps, a phenotype-changing beneficial mutation may produce a new phenotype closer to the target phenotype (Figure 7.1). The random walker will thus step from the first to a second genotype network. From then on, the cycle repeats. Some number of neutral mutations—an exploration of a current neutral network—would be followed by a mutation that “discovers” a new phenotype closer to the target. In a more realistic scenario, an entire population, instead of a single genotype, explores genotype space. It leads to similar evolutionary dynamics. The population starts out on one genotype network. It explores this network until one of its members uncovers a phenotype (genotype network) closer to the target, through a beneficial mutation that then sweeps to fixation in the population. Previously occurred mutations that paved the way for this sweep would hitchhike to fixation. After this sweep, the descendants of the successful mutant explore the genotype network until one of them finds a new, better phenotype, and so on. In this context, strong selectionism would imply that no neutral mutations would occur between phenotypic changes. Every single change would be either deleterious (and hence eliminated), or it would discover a new genotype network closer to the target phenotype. At the other extreme, strong neutralism would imply that many mutations occur between the “discovery” of subsequent genotype networks. During this time—the majority of the time—populations would evolve neutrally. The truth lies somewhere in-between, and the mode of evolution may even change during a single adaptive process. For example, in evolutionary adaptation towards an optimal phenotype, a population may need to traverse hundreds of genotype networks, each different, each slightly better than the previous [830]. Towards the beginning of this adaptive process, improvement is often easy. The genotype networks of improved phenotypes are quickly found. The population may thus spend little time on any one such network, and multiple mutations may sweep to fixation in quick succession—a selectionist mode of evolution. As the population approaches an optimal phenotype more and more closely, further improvements may become
100
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
more difficult, the “discovery” of the next genotype takes longer, and evolution occurs in the neutralist mode [6, 214, 251, 252, 259, 411, 686, 742]. We will encounter the last kind of pattern later in this chapter, and in Chapter 13 on evolutionary constraints. Let us now focus on the last mutation in a sequence of neutral steps, and on the beneficial mutations that follow it. Importantly, the pheno-
C A A
Position 30: C U, neutral C A A
typic effect of this beneficial mutation may depend on the mutation(s) preceding it [249]. I will illustrate this notion with RNA secondary structure phenotypes, because they are especially easy to visualize. Figure 7.2 shows an RNA phenotype where a first mutation (C®U at position 30) in a sequence leaves the minimum free-energy structure of the RNA molecule unaffected. A second mutation (C®U at
U CU U G G C A U C C A G C C G A GG C C A A G U A A GA A U U U CA C A G A U U C A C CG
U CU U G G C A U C C A G C C G A GG C C A A G U A A GA A U U U CA C A G A U U C A C G U
Position 39: C U, neutral
C A A
U CU U G G C A U C C A G C C G A GG C C A A G U A A GA A U U U UA C A G A U U C A C UG
Position 30: C U, non-neutral
Position 39: C U, non-neutral
U A C C G G CG UA UA C GU U AC AU A A A C U U C A C A G CGAU AU GC GC C GU GU A U A A
Figure 7.2 The neutrality of mutations may depend on the order in which they occur. The figure focuses on two mutations in an RNA sequence that is shown together with its minimum-free energy secondary structure [332]. These mutations are C to U transitions at positions 30 (mutation 1) and 39 (mutation 2). By itself, mutation 1 is neutral, as is mutation 2. However, when mutation 1 is followed by mutation 2, or vice versa, a changed secondary structure results. After figure 4 in [829], used with permission from Nature Publishing Group.
A SYNTHESIS OF NEUTRAL ISM AND SEL ECTIONISM
position 39) then changes this secondary structure. The first mutation is neutral, but the second mutation is non-neutral, e.g., it might be beneficial. What if these two mutations had occurred in the reverse order (position 39 followed by 30). The C®U mutation at position 39, in and by itself, is neutral (Figure 7.2). However, if the C®U mutation at position 30 follows this change, the same phenotypic change results as with the previous mutation order. In other words, if the sequence of mutations had been reversed, we would now call the mutation at position 39 neutral, as opposed to beneficial. Similarly, the mutation at position 30 would now be beneficial, where it was previously neutral. The situation would be even more complicated if we considered additional mutations that the molecule experienced earlier. (Similar observations hold for pairs of neutral and deleterious mutations, or for deleterious and beneficial mutations.) This simple example serves to make an important point: a mutation’s effect exists only in the context of the mutations preceding it. The other principal systems classes from earlier chapters lend themselves to similar examples. For instance, the ability to feed on new carbon source may require two (or more) new chemical reactions that transform this carbon source into a metabolite that can be utilized further. Each of these reactions, when added individually to a network, might have little effect. Only both of them together would result in innovation. Whichever reaction gets added first is neutral, and whichever reaction gets added second is beneficial, analogous to the RNA example above. Given these observations, one could argue that it is not sensible to speak of neutral, deleterious, or beneficial mutations at all, if a mutation’s effect depends on the genotype’s evolutionary history. But not so fast. The notion of a deleterious mutation is clearly necessary in characterizing the causes of genetic diseases, as is the notion of a beneficial mutation to characterize evolutionary adaptations. However, we need to acknowledge a key limitation of these notions. A mutation has an effect at the time at which it arises, and this effect may change over time: a mutational change that was once neutral may later become beneficial (or deleterious), depending on other genetic changes. This view
101
stands in stark contrast to how the bulk of population genetics represents genetic change. There, with notable exceptions [e.g. 270], alleles are labeled as unchangingly deleterious, neutral, or beneficial. In sum, the perspective encapsulated in Figure 7.1, suggests that both neutral and beneficial mutations are important for evolutionary adaptation. Neutral mutations prepare the ground for later beneficial mutations, and are just as necessary as them. Whether this perspective is useful depends on whether it can make qualitatively correct predictions about the evolutionary dynamics leading to adaptation and innovation. I will next turn to three classes of such predictions, and the evidence supporting them.
Prediction 1: Boom and bust cycles of diversity A first class of prediction is that evolutionary change should often occur in cycles of neutral diversity expansion (“boom”), and selective diversity contraction (“bust”). The boom occurs as a population explores a genotype networks and diversifies on it, and the bust occurs after one individual stumbles upon a new and superior phenotype, upon which its inferior ancestors go extinct. RNA molecules evolving towards a target structure demonstrate this kind of dynamics in computational work [251]. However, this dynamics has also been observed in sufficiently well-sampled, evolving populations in the wild. A case in point is the evolution of antigenic properties in the human influenza virus. Here, hemagglutinin, a viral surface glycoprotein and key viral antigen shows punctuated and episodic evolution in its antigenic properties: episodes of small genetic changes with large effect on the antigenic phenotype alternate with periods of time where genetic variation accumulates with little phenotypic change. A genotype network model can best explain these features of hemagglutinin evolution [412, 709]. DNA sequences sampled from a population that is subject to such episodic evolution over time show a characteristic ladder-like phylogeny. Figure 7.3 shows an example, a phylogenetic tree derived from different influenza hemagglutinin isolates over a period of 12 years [128, 244]. Some clades of the tree are characterized by multiple, short branches
102
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
corresponding to the accumulation of neutral diversity [128]. Periodically, this diversity is dramatically reduced. Only one individual within a clade gives rise to a new, rapidly branching clade. Four such episodes are highlighted with thick black lines in the tree of Figure 7.3. This kind of pattern is not a peculiarity of influenza. It also appears in noroviruses, an important cause of viral gastroenteritis (“stomach flu”), in dengue virus, and in the human immunodeficiency virus (HIV) within hosts [5, 128, 462, 693, 703, 872]. To identify such a pattern requires phylogenetic information at high time resolution, which is available for viruses but for few other systems. To the extent that such information becomes available, we will be able to determine how general this pattern is.
In general, the closer an evolving population approaches an optimal molecular phenotype, the more difficult it often becomes—the longer the diversity expansion phase of a cycle—to find further phenotypic improvements [251, 799, 800]. Laboratory evolution provides anecdotal evidence that this holds not only for molecular phenotypes, but also for more complex phenotypes such as cell size [214]. Observations like these argue against a strict selectionist view, where no neutral mutations and no phenotypic stasis would occur between phenotypic changes.
Prediction 2: Pervasive epistasis and compensatory mutations A second class of prediction is that consecutive mutations in molecules should show interdependent effects. The relationship
Figure 7.3 A ladder-like phylogeny associated with episodic evolution. The figure shows a phylogeny of the influenza A virus subtype H3N2 hemagglutinin (HA1) gene. It is based on 254 nucleotide sequences isolated between 1984 and 1996 [244]. The thick, black branches indicate four (among several) founders of new clades. From figure 1d in [412], used with permission from Nature Publishing Group.
A SYNTHESIS OF NEUTRAL ISM AND SEL ECTIONISM
between the two mutations in Figure 7.2 is a special case of epistasis, a phenomenon that geneticists have known for many decades. Qualitatively, epistasis is the dependency of a mutation’s effects on mutations in other parts of a gene or genome [102, 142]. Most genetic analyses of epistasis have focused on mutations that are induced experimentally or that are already present in a population. Until recently, the temporal aspect of epistasis I highlighted above, where past mutations influence the effects of future mutations, have received less attention. However, this is beginning to change [175, 393, 416, 428, 732, 845, 846]. A number of recent studies observe that mutational effects frequently change over time. For instance, a recent computational study demonstrated that the effects of mutations on the kind of molecular structure shown in Figure 7.2 can influence the evolutionary dynamics of RNA molecules [148]. In this study, the authors introduced mutations with small deleterious effects on an RNA structure into an RNA molecule, and recorded the evolutionary trajectories of these changes in a finite population. Population genetic theory makes specific predictions about the probability that such mutations go to fixation through random genetic drift [402]. However, the incidence of fixation was significantly higher than predicted by the theory. Partly responsible were secondary, compensatory mutations that reversed the deleterious effects of the original mutation, and that turned these mutations into beneficial mutations [148, 850]. Compensatory mutations, where a mutation reverses or neutralizes the effect of another mutation are generally a frequent phenomenon [350, 393, 416, 428, 596, 619, 732, 845, 850]. They are abundant in organisms as different as fruit flies and humans [416, 428]. For example, 10 percent of strongly deleterious (disease-causing) mutations in human proteins occur as wild-type variants in other mammals [416]. Similarly, as many as 50 pathogenic mutations in human tRNA genes occur in normal tRNAs of other mammals [393]. In RNA molecules in particular, compensatory mutations are so frequent that they can be used to map nucleotide interactions [255, 365, 543, 587, 595, 859]. The kind of epistasis that compensatory mutations show is also called sign epistasis, because the fitness effects of succes-
103
sive mutations have different signs (positive, neutral, or negative) [845, 846]. Detailed analyses of molecules can demonstrate the epistatic interactions of specific mutations in molecules with important functions. Cases in point are the mineralocorticoid and the glucocorticoid receptors. In humans and other tetrapods, the mineralocorticoid receptor is activated by the hormone aldosterone, an osmoregulatory hormone. The glucocorticoid receptor is activated by cortisol, a stress-response hormone. The mineralocorticoid receptor originated via a gene duplication from a molecule ancestral to the glucocorticoid receptor. Two studies asked which mutations were responsible for the cortisol-specificity of the glucocorticoid receptor [88, 582]. To this end, their authors reconstructed the common ancestor of both receptors, a molecule that responds to both aldosterone and cortisol. They then identified mutations responsible for the receptor specialization, mutations that reduced sensitivity to aldosterone but retained sensitivity to cortisol. Two among several mutations, Ser106Pro and Leu111Gln, stood out. In and by itself, Ser106Pro dramatically reduced overall receptor sensitivity to both hormones. Leu111Gln alone did not affect sensitivity. However, Leu111Gln followed by Ser106Pro yields a receptor that is still sensitive to cortisol but 1000-fold less sensitive to aldosterone [88, 171]. This mutant combination is another example of the epistatic interactions highlighted above. Leu111Gln has little effect in and by itself, but in combination with another mutation it facilitates the evolution of a molecular specialization that has served tetrapods well for many million years. Mutation pairs with this property are not alone. Two other mutations exist in this receptor that affect receptor function little by themselves. However, jointly with three further mutations they confer its present specificity to the glucocorticoid receptor [88, 582]. Studies like these can also help elucidate the mechanisms behind such epistasis. For example, the above Leu111Gln change introduces a new amino acid side-chain into the receptor. However, this new side-chain matters only after the Ser106Pro change, which repositions the new side-chain. The result is that the side-chain at position 111 can now
104
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
form a hydrogen bond with a hydroxyl group of cortisol. (Aldosterone lacks this hydroxyl group.) The result is a cortisol-specific interaction between protein and ligand. Another general epistatic theme in protein evolution is this. Mutations that introduce a novel enzymatic activity in a protein often also destabilize protein structure. Thus, mutations that precede or follow these function-changing mutations, and that stabilize the enzyme (but themselves do not introduce a novel activity) can be important for functional innovation [58, 59, 176, 778]. Although the vast majority of information available to study molecular evolution comes from molecules, it is worth pointing out that similar epistatic phenomena also exist on higher levels of organization. Metabolic networks are a case in point. Here, many individual mutations that eliminate enzymecoding genes affect the rate of cell growth and division very little [208, 254, 679, 690]. Nonetheless, pairs of mutations often have strong epistatic effects. These effects are not necessarily detrimental, because multiple deletions of metabolic genes can even lead to increased growth [536].
Prediction 3: Shifting foci of selection A third class of prediction is that at different points in time, different parts of a system—molecule, regulatory circuit, or metabolic network—will be subject to positive selection triggered by beneficial mutations. This prediction emerges naturally from the fact that each step in the evolutionary trajectory of Figure 7.1 changes a different system part, be it a nucleotide, an amino acid, a regulatory interaction, or a metabolic reaction. Only some such changes— those leading to new phenotypes—will be subject to positive selection. Put differently, system parts that are subject to positive selection at some time may evolve neutrally at other times. For many years, phylogenetic methods were neither powerful enough, nor were available molecular data sufficiently detailed to test such predictions. This has changed, most notably with the advent of maximum-likelihood based phylogenetic methods. Such methods allow detection of positive selection that varies among individual branches of a phylogenetic tree and among individual codons in a protein [278, 297, 875, 876, 887].
Several molecular evolution studies find patterns consistent with the prediction that the focus of selection shifts between different system parts. For example, in the influenza hemagglutinin antigen, different residues of the antigen are associated with different antigenic properties. Changes in these residues can change virulence patterns and thus viral fitness [412, 709]. However, most amino acid sites associated with changes in antigenic properties, and thus likely subject to positive selection in these lineages, evolve neutrally in other viral lineages [412, 551]. Similar patterns can be observed in the evolution of another well-studied viral antigen: the envelope glycoprotein env of the human immunodeficiency virus (HIV) [297, 693]. A maximum-likelihood method that allows individual codons to shift between a state of neutral evolution and positive selection detects that such shifts are frequent [297]. As an example, consider the phylogenetic tree of Figure 7.4 [829]. It is based on env coding sequences from a single HIV positive individual, in which env sequence evolution was monitored for more than 10 years [693]. In the tree, I highlighted different codons that are under positive selection at different times with different shades of gray. It is clear that even from the point of view of individual codons, selection is episodic on this tree. It occurs on some branches of the tree but not on others. In addition, codons that are under selection along some branches evolve neutrally along others. In some proteins, such as cytochrome b, as many as 95 percent of amino acid sites may be subject to selection pressure that varies over time [468, 609]. One might think that these patterns are peculiarities of special kinds of genes. A comprehensive study analyzing the evolution of thousands of genes shows otherwise. This study identified genes subject to positive selection in six completely sequenced mammalian genomes, including that of humans [421]. The study identified 544 genes subject to positive selection in at least one branch of the phylogenetic tree of these species. The vast majority of these genes switch between episodes of positive selection, and episodes where positive selection is absent. Specifically, 91 percent of genes switch the mode of evolution at least once, and 53 percent more than once. I note that a phylogenetic tree of merely six
A SYNTHESIS OF NEUTRAL ISM AND SEL ECTIONISM
14 14 24 24 24 24 24 14 24 45 61 24 34 34 34 45 34 45
476
3 3 3 3 34 14 3 3 3
476
24 14 14 14 14 14 14 51 5 4
476
476
291 476
34 34 66 51 66
24
Codon
291
45 5 4 476 34 61 45 51 45 45 45 51 61 66 61 61 68 61 61 61 61 61 300 66 66 51 68 68 68 77 416 66 51 68 80 77 77 80 80
300 332 416 451
94 94 94 94 94 94 98 98 105 94 98 98 98 98 98 98 105 105 94 98 5 10 5 10 5 10 8 9 4 9
105
476
80 80 87 87 77 94 87 105 105 105 105 87 87 87 87
80
80
80
77
80 77 77 68 66 68 68 77 77 68 66 77 87 80 87 87
332
332
451
Figure 7.4 Positive selection acts episodically on different codons in the envelope protein of the acquired immunodeficiency syndrome (AIDS) virus HIV. The phylogenetic tree shown here is based on coding regions of the envelope (env) glycoprotein (gp120) of a single patient isolated at different time points after the patient became HIV-positive. The molecular structure of gp120 is shown to the left of the tree’s center [430]. The data is taken from patient 1 in [693]. The numbers on each terminal branch of the tree indicate the time in months after the patient tested positive for HIV exposure. Bars drawn with different shades of gray next to each terminal branch reflect codons that are under positive selection on different branches of this tree. Specifically, data are shown for six codons, whose numbers are given to the right of the grayscale legend (small rectangles). For instance, codon 291 is under positive selection in a sequence isolated at 51 months (at ca. 11 o’clock on the tree), but in no other sequence. Codon number 300 is under positive selection in three sequences isolated at 61 and 66 months (3 o’clock), but in no other sequences. Codon numbers reflect coordinates relative to the beginning of the gp120 sequence from [430]. I used a maximum-likelihood codon-based approach to detect episodic selection, and displayed the tree using iTOL [297, 447]. Branch lengths do not reflect the amount of sequence change along each branch. After figure 5 in [829], used with permission from Nature Publishing Group.
106
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
species is small. Adding additional species to such an analysis would increase the number of tree branches, and would provide an opportunity to see an even greater incidence of switching events.
Molecular exaptations A concept intimately linked to the neutralist perspective is that of an exaptation, coined by the late Stephen J. Gould specifically for complex innovations in whole organisms [281, 282]. These are traits that may have no use, or a use unrelated to a given trait, when they first arise, but that facilitate later innovation in this trait [281, 282]. This concept can be transferred from macroscopic traits to the molecular systems I discuss. Because the relationship between genotype and phenotype is transparent for molecular traits, such traits can help us think clearly about the meaning and importance of exaptation. The molecules that arise through neutral mutations of the kind I discussed above can be viewed as exaptations: when they first arise, they do not affect the phenotype, but they prepare the ground for phenotypic change. This clearly holds for the first of the two changes in RNA from Figure 7.2, which are essential for later phenotypic change. It also holds for the steroid hormone receptor I discussed, where a Leu111Gln change introduces a new amino acid side-chain at position 111. This new side-chain affects the molecule’s function only after it is repositioned by the subsequent mutation Ser106Pro, which then allows a cortisol-specific interaction of the side-chain at position 111. Analogous examples in regulatory circuits and metabolic networks are not difficult to come by. For example, many additions of individual reactions to a metabolic network will not change a metabolic phenotype until a second added reaction connects the first reaction to an already existing metabolic pathway. Similarly, because many changes of regu-
latory interactions in transcriptional regulation circuits do not affect the circuit phenotype (Chapter 3), neutral regulatory changes that precede later nonneutral changes—and on which these non-neutral changes depend—are likely to be frequent in such circuits. In sum, neutral changes that pave the way for future phenotypic change may be ubiquitous in adaptive evolution. From this point of view, exaptations are essential ingredients for evolutionary adaptation and innovation.
Summary On the one hand, genomic data suggest that positively selected beneficial mutations are abundant in the evolution of genes and genomes. On the other hand, experimental and comparative work on molecular phenotypes show that neutral change facilitates the origin of new molecular functions. The perspective I discuss here synthesizes these two lines of observations [829]. In this synthesis, neutral mutations on a genotype network are crucial to prepare the ground for later, beneficial mutation that lead to evolutionary adaptation or to an evolutionary innovation. Whether a mutation is neutral depends on the context in which it occurred, and on the mutations that preceded it. Three lines of molecular evolution evidence support this perspective. These are patterns of episodic diversification, pervasive epistatic interactions among mutations, and shifting foci of positive selection. The evolutionary scenario I discussed can explain how neutral mutations may be crucial for functional innovation, yet need not remain neutral forever. Such neutral mutations can be viewed as molecular exaptations. These observations make clear that neutralism and selectionism in a broad sense are not incompatible. Instead, they capture complementary and equally important aspects of biological reality.
CH A PT ER 8
The role of robustness for innovation
A feature of a biological system is robust if it persists in the face of perturbations. Relevant features include a molecule’s structure or catalytic activity, a regulatory circuit’s expression phenotype, or a metabolic network’s ability to produce biomass. If the perturbations are genetic (mutations), then I will speak of mutational robustness. Alternatively, the perturbations can be non-genetic, for example, a changing environment. Non-genetic perturbations may be more frequent than mutations, but mutations arguably have more lasting and thus important effects: they alter a system permanently, and they are inherited from generation to generation. I will thus focus on them here. However, with possible exceptions [138, 522], most features robust to mutations are also robust to non-genetic change, and vice versa [22, 441, 496, 511, 749, 825]. I have already alluded to the role robustness plays for evolutionary innovation (Chapters 5 and 6). Here, I will explore this role in more detail. But why should robustness play any role to begin with? Because there is no innovation without phenotypic variation. And, by definition, mutations in a robust system produce little phenotypic variation in response to perturbation. From this perspective, one might think that robustness impedes innovation. Using observations from previous chapters, I will first briefly highlight (again) why this notion is flawed, and why, to the contrary, robustness permits innovation. In the second, longer part of the chapter, I will demonstrate an additional, less obvious role of robustness in influencing innovation. Whether this role is positive, facilitating innovation, or negative depends on the kind of phenotype considered, and on details of genotype space organization. I will also point out that a system where robustness facilitates innovation avoids a conflict that pervades many areas of life. It is the conflict between the
interests of the individual and that of a group. Avoidance of this conflict can further facilitate evolutionary innovation.
Robustness is essential for innovation In Chapters 2 through 4, I examined the 1-mutant neighborhoods of individual genotypes G. These are genotypes that are only a single mutation away from G. Examples include proteins that differ from some reference protein G in one amino acid; regulatory circuits that differ from some circuit G in one regulatory interaction; and metabolic networks that differ from a reference network G in one metabolic reaction. Of particular importance here is the fraction ν of these neighbors that have the same phenotype as G (Chapter 6). It can be viewed as a measure of mutational robustness (Chapter 5). If ν=0, then no neighbor has the same phenotype, and robustness is minimal; if ν=1, then all neighbors have the same phenotype, and robustness is maximal. Chapter 6 showed that some degree of robustness (ν>0) is both necessary and sufficient for the existence of genotype networks. Such networks, in turn, are essential for evolutionary innovation, because they allow access to vast amounts of phenotypic variation while preserving existing phenotypes. In sum, mutational robustness (ν>0) ensures that an evolving system can explore many different phenotypes, which facilitates innovation. Where does the naive argument that robustness should impede innovation go wrong? To see this, consider the extreme case of a minimally robust genotype G (ν=0) for any one of the systems we examined earlier. The total number of its neighbors is (B–1)S. Here, S is again the size of the system (number of amino acids, regulatory interactions, metabolic reactions), and B is the number of elementary building blocks (B=20 for proteins, B=4 for RNA, etc.). If 107
108
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
all of these neighbors’ phenotypes are different from each other, then the neighborhood of our hypothetical genotype contains the maximally possible number (B–1)S of novel phenotypes. It certainly contains more novel phenotypes than the neighborhood of a genotype with greater robustness (ν>0). Put differently, a greater number of novel phenotypes are just one mutation away from the minimally robust genotype, and thus immediately accessible. This observation encapsulates the gist of the earlier argument: the lower a genotype’s robustness, the greater its phenotypic variability in response to mutations. The argument has two flaws. First, typically the vast majority of mutations with a new phenotype— and thus perhaps all (B–1)S neighbors—are deleterious [458, 477]. To see the second flaw, consider a genotype G like the one we just considered, but where exactly one neighbor G’ has the same phenotype (ν=1/(B–1)S). Without changing the phenotype, novel phenotypes that lie in G’s neighborhood, and in the neighborhood of its neutral neighbor G’ are accessible to G. There are of the order of 2(B–1)S–1 such variants, approximately twice as many as if ν=0. In addition, the neutral neighbor G’ may itself have neutral neighbors, through which even more novel phenotypes become accessible, and so on. Earlier, we saw that ν is typically much greater than the minimal value of 1/(B–1)S I just assumed (Chapters 2–4). This larger fraction of neutral neighbors renders accessible a number of phenotypic variants that is astronomically larger than (B–1)S; it thus increases the chances to find rare beneficial variants, and it facilitates preservation of the initial phenotype. Robustness may well reduce the number of new phenotypes in the immediate neighborhood of a genotype, but it also allows access to its neutral neighbors, their neutral neighbors, and so forth, thus vastly expanding accessible variation. An essential condition for this positive role of robustness is that the number of phenotypes accessible by a point mutation must be much smaller than the total number of phenotypes [193]. This condition, however, is trivially fulfilled for any system of realistic complexity. In sum, the naive argument that robustness impedes phenotypic variability neglects this: robustness brings forth vast connected genotype networks, and renders an astronomical number of phenotypes accessible through them.
Mutual reinforcement on different levels of organization I will next briefly point out that the hierarchical organization of biological systems can reinforce the positive effect of robustness on innovation. In previous chapters, I have focused on different levels of biological organization and discussed them separately. However, molecules, regulatory circuits, and metabolic networks interact. Specifically, robustness on one level of organization can reinforce robustness on another level. Consider, for example, a regulatory change that reduces an enzyme’s expression, and thus its concentration and activity. This enzyme may be embedded in a metabolic network where its function is dispensable, because metabolites that this enzyme cannot process can be rerouted through other reactions. If so, then this regulatory change affects the protein’s activity, but not the organism’s metabolic phenotype. The same may hold for an amino acid change in the enzyme that may create or increases an activity, but reduces another activity. Such changes are not uncommon [68, 135]. If this change can be readily buffered by a metabolic network, it will be neutral on the level of the metabolic phenotype. It is not difficult to see that interactions like these, which mutually reinforce robustness on different levels of organization, can further enhance the accessibility of novel phenotypes, while leaving an existing phenotype unchanged.
Lessons from artificial systems In studying biological systems, we are trying to understand an existing relationship between genotype and phenotype. This relationship determines system properties such as robustness. One can, therefore, not easily manipulate robustness alone, to find out how it affects evolutionary innovation. To do so, one would have to change the very relationship between genotype and phenotype, the “map” between them. Such manipulation, however, is easy in artificial mappings from genotype to phenotype. Their details may have little bearing on biology, but they have one key merit: they can be designed with tunable robustness. They allow us to ask, everything else being equal, whether robustness facilitates evolutionary adaptation and innovation. I will next briefly discuss a few relevant examples from artificial systems.
THE R OL E OF R OBUSTNESS F OR INNOVATION
One pertinent study used an artificial genotypeto-fitness map, where phenotypes are only represented through fitness values [554]. By design, the chosen map had tunable robustness, as represented by the number of genotypes with the same fitness. In this study, evolutionary searches encountered genotypes with high fitness more readily, if robustness is high [554]. Another study focused on a different kind of map between genotype and phenotype. It examined two scenarios. In the first scenario, an evolving population can explore a genotype network neutrally to find a superior new phenotype. In the second scenario, the robustness implied by such neutrality is absent, and a population needs to cross a fitness “valley” of inferior genotypes to reach a new phenotype [799]. The neutrally evolving population found the new phenotypes faster by orders of magnitude. A third study focused on kinds of maps important in computer science, including cellular automata, a class of dynamical system with discrete state “phenotypes”; their “genotype” corresponds to simple logic rules that prescribe how the system’s state change [205]. This study showed that evolutionary searches, involving maps with greater robustness, find novel phenotypes more easily [205]. A final example involves self-replicating computer programs whose “phenotype” consisted in their ability to compute a set of logic functions. In long evolutionary searches, programs more robust to random changes in their computing instructions can be more effective in discovering new phenotypes [215]. These observations from artificial systems, where robustness can be tuned, support what we learn from biological systems. With the insight that robustness creates genotype networks, it is easy to understand observations like these from one common unifying framework.
Robust phenotypes may facilitate innovation The argument I have presented thus far is qualitative and based on the very existence of genotype networks. Once we know about this existence, we can go one step further: We can ask about finer-grained aspects of genotype space organization, and how they affect a system’s ability to explore novel phenotypes. One such aspect regards the size of different genotype networks, and how this size
109
affects the ability to explore novel phenotypes quantitatively. As I discussed earlier (Chapters 2-4), the sizes of different genotype networks vary by many orders of magnitude. Moreover, genotypes in larger genotype networks have, on average, more neighbors with the same phenotype [124, 640, 830]. In addition, within any one genotype network, some genotypes have few neighbors, others have many neighbors with the same phenotype. I now discuss how these properties affect a system’s ability to explore novel phenotypes. The first part of my analysis focuses on molecules, because they have been studied in the most detail. In the second part, I show how observations on molecules translate to other system classes. In this part of my analysis I will compare properties of different genotype networks. This amounts to comparing properties of phenotypes, because each genotype network is associated with a different phenotype. It is thus useful to introduce a new concept: the robustness of a phenotype. Recall that the robustness of a genotype is the fraction ν of its neutral neighbors. By extension, the robustness of a phenotype is the average fraction ν of neutral neighbors for each of the genotypes with this phenotype. This robustness generally increases with the number of genotypes that adopt a phenotype, as Figure 8.1a shows. The data in the figure are based on RNA molecules sampled at random from genotype space, but the same holds for RNA molecules with known biological functions. For example, in a sample of 82 RNA molecules with known and diverse biological functions, the number of sequences folding into a structure, and their average robustness show a high Spearman’s rank correlation coefficient of 0.92 [378]. In other words, phenotypes with large genotype networks are also more robust. My qualitative observations from the previous sections showed that the mere existence of neutral networks makes many different phenotypes accessible without the need to change an existing phenotype. In a more quantitative approach, one could ask how many different phenotypes occur in the neighborhood of an entire genotype network. This neighborhood comprises all genotypes one mutant away from a genotype network, and thus genotypes immediately accessible from it. Such an
110
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
(a)
(b) Spearman's s=0.64; P<10
–17
4
; n=2.5x10
More robust phenotype (10–3) Less robust phenotype (10–6)
2400
500
Number of different phenotypes in 1-neighborhood
Genotype set size (fraction of genotype space x10–6)
800 300
80 50 30
2000 1600 1200 800 400
8 5
0.3
0.4
0
0.5
0
1
Phenotype robustness
3
4
5
6
7
8
9
10
Number of generations
(c)
(d) 400
600
More robust phenotype (10–3) Less robust phenotype (10–6)
Number of different structures in 1-neighborhood
Number of different genotypes in population
2
300
200
100
580
Spearman's s=0.27; n=4000; P<10–17
560 540 520 500 480 460 440 420
0
0
1
2
3
4
5
6
7
Number of generations
8
9
10
400
4
8
20 60 100 Genotype set size –5 (fraction of genotype space size x10 )
Figure 8.1 Robust phenotypes can access many novel phenotypes in their evolution. (a) Phenotypic robustness increases with genotype set size, the number of genotypes forming an RNA structure phenotype. Phenotype robustness (horizontal axis) is estimated as the average fraction ν of neutral neighbors for 100 inversely folded RNA sequences per minimum free energy RNA secondary structure. Genotype set size (vertical axis) is expressed as a fraction of genotype space size (430). The data is based on 2.5×104 random RNA structure phenotypes of length S=30 nucleotides. Qualitatively similar associations exist for RNA molecules of different length S [830]. (b) The number of different novel phenotypes (vertical axis) accessible to an evolving population of 500 RNA molecules during 10 generations (horizontal axis) of mutation and selection that confine the population to a genotype network. Filled and open circles correspond to two phenotypes with high and low robustness, respectively. Their respective genotype set sizes are 10–6 and 10–3, expressed as a fraction of genotype space size. Data are based on 20 replicate evolving populations, each started from one of 20 different inversely folded sequences per phenotype, and on populations of 500 individuals. (c) As in (b), but the vertical axis shows the number of different RNA genotypes in the evolving populations. (d) Numbers of different phenotypes (vertical axis) accessible to populations that have evolved for 10 generations on the genotype network of structures whose genotype set size is shown on the horizontal axis. Data are based on 4000 different structures whose genotype network sizes (expressed as a fraction of genotype space) range from 3.3×10–5 to 1.7×10–3, and on one population evolution per structure, started with an inversely folded sequence [830]. The differences in accessible variation persist for greater numbers of generations than those shown here. Populations have 100 individuals. The mutation rate in (b) and (d) is μ=1 random mutation per sequence and generation. Circles represent means, and error bars indicate one standard error of the mean. From [830].
THE R OL E OF R OBUSTNESS F OR INNOVATION
analysis might reveal differences among genotype networks, because its outcome might depend on genotype network size and phenotype robustness. Such an analysis, however, faces one serious obstacle, and it has one serious limitation. First, genotype networks are so vast that it is typically impossible to examine their entire neighborhood. Although this obstacle can be overcome only in small genotype spaces, one can try to skirt it by exploring genotype network neighborhoods through random sampling of member genotypes. Observations based on this approach suggest that robust RNA phenotypes contain more phenotypic variants in the neighborhood of their genotype networks [830]. The serious limitation of this analysis is its biological relevance: all evolution occurs in finite and often small populations. Most such populations would take a prohibitive amount of time to explore any large genotype network and its neighborhood, even if one considers evolutionary time-scales. Thus the entire network neighborhood may matter less than the part of it that an evolving population can explore in a limited amount of time. These observations prompt the following, population-based approach to link phenotype robustness with the accessibility of novel phenotypes. Consider a population of identical genotypes with some phenotype P. The population evolves through repeated cycles (“generations”) of mutations in individual system parts (e.g., changes in individual nucleotides) and selection that confines the population to the genotype network of P. That is, no individual with a phenotype different from P survives into the next generation. Over multiple generations, this process will cause a population to spread through the genotype network of P, like a cloud of genotype particles that diffuse through the network. One can then examine the neighborhood of such a population; that is, all genotypes that are neighbors of individuals in the population, but that are not on the genotype network of P. Specifically, one can count the number of different phenotypes in this neighborhood. These are the phenotypes immediately accessible to the population, through a single mutation of some individual. How does this number of accessible phenotypes depend on genotype network size and thus on phenotype robustness?
111
Figure 8.1b shows the answer for two RNA structure phenotypes with different robustness, as indicated by the very different sizes of their genotype sets. Specifically, the more robust structure (filled circles) has a genotype set that comprises a fraction 10–3 of genotype space, whereas the less robust structure (open circles) has a genotype set that comprises only a fraction 10–6 of this space. The more robust structure would be approximately twice as robust, as indicated by the information in Figure 8.1a. Figure 8.1b shows how the number of phenotypes accessible to an evolving population changes over multiple generations of the evolutionary process. Clearly, populations of the more robust phenotype have access to more novel phenotypes, even after merely 10 generations of evolution. Figure 8.1c also shows that such populations are more diverse in their genotypes. Such genotypic variation is often called cryptic variation, because it is not itself visible as phenotypic variation [267]. These observations are not a peculiarity of the two specific RNA phenotypes used in this analysis [830]. For instance, Figure 8.1d shows that the accessibility of new phenotypes in evolving populations systematically increases with phenotypic robustness. This assertion is based on data from 4000 different RNA phenotypes [830]. Taken together, these observations show that phenotypic robustness enhances the ability to encounter novel phenotypes during evolution. But how could this possibly be the case, given the very definition of a robust phenotype: individual genotypes with a robust phenotype have more neutral neighbors, and thus fewer novel phenotypes in their immediate neighborhood. Populations with robust phenotypes should thus encounter fewer, and not more, novel phenotypes. To see the flaw in this argument, one has to examine the population’s evolutionary dynamics, as illustrated in Figure 8.2. The figure illustrates the evolution of a population of genotypes (filled circles) over time (top to bottom) for a less robust phenotype (left three panels) and a more robust phenotype (right three panels). Because individuals with a robust phenotype have more neutral neighbors, they suffer fewer deleterious mutations (gray stubs in Figure 8.2) that lead to their elimination from the population. With fewer such deaths, the population remains more diverse,
112
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
Low Robustness
High Robustness
Time
Figure 8.2 Genotypes diversify more rapidly when phenotypes are robust. Each rectangle stands for a subset of genotype space, and shows part of a genotype network; gray circles correspond to individual genotypes on this genotype network; straight lines link neighboring genotypes. The upper two panels show two genetically identical populations (filled circles) with moderate genotypic diversity, where several individuals have the same genotype. The left three panels show how a population explores genotype space for a phenotype with low robustness, and driven by repeated cycles of mutation and selection (confining it to a genotype network). On the corresponding genotype network, individual genotypes have, on average, few neutral neighbors, as indicated by the few edges connecting them. The right three panels show the same evolutionary process, but for a robust phenotype, where genotypes have many neutral neighbors. Gray stubs illustrate hypothetical deleterious mutations that drive a genotype off the genotype network. The differences in the number of gray stubs between the left and right panels illustrates that robust phenotypes with large neutral networks are subject to fewer deleterious mutations. The population with the robust phenotype (right three panels) will spread more rapidly, because fewer deleterious mutations impede accumulation of genotypic diversity. Figure and legend adapted from [829].
THE R OL E OF R OBUSTNESS F OR INNOVATION
and spreads more rapidly through the neutral network. This is the reason for the higher amount of cryptic variation we saw in Figure 8.1c. High robustness thus causes two phenomena with opposite roles in influencing the accessibility of new phenotypes. On the one hand, it causes fewer novel phenotypes to be accessible to individual genotypes. On the other hand, it causes populations of such phenotypes to spread more rapidly through genotype space, and to become more diverse. They contain more cryptic variation. The first phenomenon tends to reduce the number of accessible new phenotypes; the second phenomenon tends to increase this number. On balance, in the case of RNA phenotypes, the faster diversification process wins, and a population overall has access to more novel phenotypes. We shall see below that the same holds for proteins, but not necessarily for all other systems [238].
Accessibility of novel phenotypes in small and large populations Thus far, I have treated individual genotype networks as if they were internally homogeneous. However, we have seen in earlier Chapters 2–4 that genotypes with the same phenotype can differ in their fraction of neutral neighbors. A typical genotype network has regions where its genotypes are more robust, and regions where they are less robust. I will now explore the consequences of this heterogeneity. Before doing that, however, I first need to explain that large and small populations explore a genotype network in different ways [798, 825]. In Chapter 7, I have already discussed the evolutionary dynamics of neutral mutations in a population. One of the basic observations was that genetic drift tends to eliminate the neutral genotypic variants that mutations introduce. If the rate μ at which neutral mutations arise is much greater than the inverse of the effective population size N, (μ>>1/N, or Nμ>>1), then mutations are so frequent that genetic drift cannot eliminate all of them. As a result, a population will typically contain more than one genotype. It will be polymorphic most of the time. Conversely, if μ<<1/N, or Nμ<<1, drift eliminates mutations as fast as they arise. Such a population will be genetically homogeneous or monomorphic most of the time.
113
Consider now an evolving population of genotypes that is confined to a given genotype network, and that is sufficiently large to be polymorphic (Nμ>>1). As I just discussed, different genotypes in such a population typically have different fractions ν of neutral neighbors. This heterogeneity has an important consequence: the population will accumulate in regions of the genotype network with high robustness (Figure 8.3) [798, 825]. To see why, compare two genotypes G1 and G2, one with high robustness (more neutral neighbors, ν1) and the other one with low robustness (fewer neutral neighbors, ν2). We can write μ(νi) to indicate that the two genotypes have different neutral mutation rates, which depend on νi. Now consider a mutation that changes these genotypes. By definition, this mutation will be more likely to change the robust genotype into a neutral neighbor than the less robust genotype. Because most non-neutral variants are deleterious, less robust genotypes will thus suffer more deleterious mutations over time. Over the course of many generations, they will become eliminated from the generation, and more robust genotypes will accumulate in the population. This means that the population will tend to congregate in regions of a neutral network that have more robust genotypes (Figure 8.3). If one knows the structure of the genotype network, one can predict the average robustness of such a population in an equilibrium between mutation and selection [798]. These observations are based on large populations and do not hold for populations with Nμ<<1. To see why, it helps to recall that natural selection requires (genotypic) variation. The relevant kind of selection here is selection of robust genotypes. Because populations with Nμ<<1 are monomorphic most of the time, they do not contain much genotypic variation in general, and thus also little or no variation in mutational robustness. This lack of variation prohibits such populations from evolving high robustness. Consider some population, typically consisting of one genotype G1, with some fraction ν1 of neutral neighbors. The rate μ(ν1) at which neutral mutations arise is a function of this fraction. A basic insight of population genetic theory is that approximately every 1/μ(ν1) generation, a neutral mutation will arise that will go to fixation [402]. (Other neutral mutations may arise but eventually get eliminated.)
114
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
Figure 8.3 Selection can cause populations to accumulate in regions of a genotype network where robustness is high on average. Open circles connected by straight lines correspond to neighboring genotypes on a hypothetical genotype network. Dashed ellipses correspond to regions of the genotype network where individuals have higher numbers of neutral neighbors, and thus higher mutational robustness. The left panel shows a population of genotypes (filled circles) on the neutral network. Under selection that favors maintenance of a phenotype, the population can move into regions of high robustness (right panel). There may be more than one such region, as indicated by the two ellipses. Which of them a population comes to occupy may depend on a variety of factors, such as a population’s initial distance from each region. Analogous considerations hold for robustness to other perturbations, which is often correlated with robustness to mutations.
The average waiting time for such a mutation is long, because mutations in general are rare events (μ<<1), and 1/μ(ν1) is thus large. Such a mutation will go to fixation rapidly, in about N (<<1/μ(ν1)) generations. Once the mutation has gone to fixation, the entire population becomes monomorphic again, and stays monomorphic until the next neutral mutation arises. The population has thus effectively moved from genotype G1 to one of its neighbors G2, which may have a different fraction of neutral neighbors ν2. The wait for the next mutation that will ultimately go to fixation is again long, and once the mutation arises, it goes to fixation rapidly. The mathematical theory of random walks shows that a population with this kind of evolutionary dynamics will spend equal amounts of time at each genotype of a genotype network [798]. When monitored over long amounts of time, the population’s robustness thus becomes the average robustness of all genotypes in the genotype network. As defined above, this average robustness is the robustness of a phenotype.
In sum, large populations or populations with high mutation rate (Nμ>>1) can evolve high average robustness of their genotypes. In contrast, populations with Nμ<<1 cannot do that, because they lack variation for robustness. Their robustness will typically be representative of the genotype network’s average robustness; that is, the robustness of the phenotype associated with this network. Populations of intermediate values of Nμ evolve robustness intermediate between these limits. I discuss these population dynamic details, because we saw that robustness may affect phenotype accessibility. The populations discussed in Figure 8.1 were in the regime of Nμ>>1. The evolutionary process described will increase robustness in such populations within few generations [123, 798, 820]. As we saw, high robustness can affect the spreading of populations (Figure 8.2), and with it the accessibility of novel phenotypes. These observations raise the possibility that phenotypic robustness may have a positive role for evolutionary innovation only if Nμ>>1.
THE R OL E OF R OBUSTNESS F OR INNOVATION
To see whether this is the case, one can repeat the analysis of Figure 8.1, but with populations where (Nμ<<1). Because such populations are monomorphic in most generations, their neighborhoods at any one time would not contain a broad spectrum of phenotypes. However, during their random walk on a genotype network, they may still encounter, over time, a large cumulutive number of different phenotypes in their neighborhoods. Figure 8.4a shows this cumulative number of phenotypes over 104 generations of evolution, for populations of RNA molecules with Nμ<0.1. Specifically, the figure compares the cumulative number of genotypes encountered by two populations whose phenotypes differ in their robustness. The data clearly show that over time, the population with the robust phenotype encounters more novel phenotypes. Figure 8.4b shows that this is not a peculiarity of the two analyzed phenotypes, but that it holds for 1500 different phenotypes with broadly varying robustness. These observations mean that phenotypic robustness facilitates access of novel phenotypes also in small populations. They raise, however, the same question as the earlier analysis in large populations. How come that populations of robust phenotypes, whose genotypes typically have fewer novel phenotypes in their neighborhood, can still encounter more novel phenotypes? The answer is similar to the one I gave earlier (Figure 8.2). Genotypes with robust phenotypes have, on average, a greater fraction of neutral neighbors ν. This means that the waiting time 1/μ(ν) for successive steps of their random walk through a genotype network is shorter. Evolving genotypes with robust phenotypes thus simply explore a genotype network faster. In balance, the positive contribution of exploring a genotype network faster wins against the negative contribution of less diverse neighborhoods, and leads to greater accessibility of new phenotypes (Figure 8.4). In sum, in this molecular system, phenotypic robustness can aid the exploration of novel phenotypes, regardless of population size. In closing this section I note that some RNA genotypes are both extremely robust to mutations and have a thermodynamically extremely stable phenotype [22]. Such genotypes have mostly neu-
115
tral neighbors, and thus low phenotypic diversity in their neighborhoods [22]. If a population of such extreme genotypes occupies a small, local region of a genotype network, it may not be able to access much phenotypic diversity. Such genotypes can be attained in principle if selection favors both mutational robustness and high thermodynamic stability [22]. However, we do not know whether such extreme genotypes often occur in organisms. Recent studies examining the robustness of multiple different RNA genotypes with diverse biological functions showed that typical RNA genotypes have more modest robustness [76, 672, 832].
High robustness of proteins is associated with more functional innovations Analyses of RNA molecules can help us understand how robustness influences the ability to access novel molecular structures. To understand how it affects innovation in molecular functions, however, proteins are better study objects than RNA. The reason is that we know much more about their structural and functional diversity. This knowledge can be used in comparative analyses that ask whether robust protein phenotypes have evolved more diverse functions. I note that functional diversity is a record of past evolutionary innovations. The observation that robust protein phenotypes are functionally more diverse would suggest that robustness facilitates functional innovation in proteins. The analysis I will discuss next asks, specifically, whether protein domains that are highly robust to amino acid change occur in proteins that have evolved a great variety of functions. (Recall from Chapter 4 that a protein domain is a distinct unit of protein structure that typically folds independently of other such units.) There are two complementary ways to assess a protein structure’s robustness. The first uses information from the protein structure alone. Specifically, England and Shakhnovich showed that the number of amino acid sequences that can adopt a given structure can be estimated by a mathematical property—the largest eigenvalue— of the contact density matrix of a protein structure [221]. This is a binary matrix whose entries aij are equal to one if two amino acids i and j that are not adjacent on the linear amino acid chain contact each
116
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
(a) More robust phenotype (10–3) Less robust phenotype (10–6)
Cumulative number of different phenotypes in neighborhood
2400 2000 1600 1200 800 400 0
0
1
2
3
4
5
6
7
8
9
10
Number of generations (×103) (b) 2200
Cumulative number of different phenotypes in neighborhood
Spearman's s = 0.3; n = 15; P<10–17 2100 2000 1900 1800 1700 1600 1500
5
20
40 60 70 90 Genotype set size (×10–5)
100
Figure 8.4 Robustness facilitates access of novel phenotypes also for small populations. (a) The vertical axis shows the cumulative number of different RNA structure phenotypes that an evolving population encounters in its neighborhood between generation zero, and the time shown on the horizontal axis. The neighborhoods of the starting genotypes (not shown) contained a smaller fraction of unique phenotypes for the robust starting phenotype than for the less robust starting phenotype (0.33 vs. 0.30). The two starting genotypes were also more robust (neutrality ν=0.48 vs. ν=0.27). Nonetheless, populations with the more robust phenotype can access more phenotype variation over time. (b) The horizontal axis shows binned genotype network sizes (expressed as a fraction of genotype space size) for 1500 different RNA structure phenotypes. The vertical axis shows means (circles) and standard errors (bars) of the cumulative number of unique structures that occurs in the neighborhood of populations after 104 generations of evolution. All data are based on populations of N=10 individuals and a mutation rate of μ=0.01 per molecules and generation. Data in b) are based on one inversely folded sequence per structure that is used to seed an evolving population. From [830].
THE R OL E OF R OBUSTNESS F OR INNOVATION
other in the folded protein. All other matrix entries are equal to zero. The second measure of structural robustness uses the amount of amino acid variation that occurs in proteins with a given domain. Any one position of a domain that harbors multiple amino acids (in different proteins with this domain), suggests that the domain can tolerate multiple amino acid changes without losing its integrity. A note of caution about this diversity-based measure of robustness is necessary. First, ancient protein domains, domains that have originated early in the history of life, had more time to accumulate amino acid changes than younger domains. They may thus spuriously appear more robust, based on their greater amino acid diversity. Second, some domains may simply have more known sequences, because the proteins containing them may have been more thoroughly studied. At any one amino acid position, they may thus appear more diverse for this reason alone. One can account for these factors, first by studying only ancient protein domains that were present in the most recent common ancestor of extant life [630]; second, by correcting for variation in the number of available sequences per structure. If one does that, one finds that the two measures of protein robustness—structure-based and diversity-based robustness—are highly correlated, with a rank correlation coefficient of r>0.88 [238]. I emphasize that these measures assess phenotypic (structural) robustness, and not genotypic (sequence) robustness. They estimate the robustness of a protein structure to amino acid change, not the robustness of an individual amino acid sequence forming this structure. The upper part of Figure 8.5 shows, as an example, two protein domains with different robustness, and their observed amino acid diversity, which is coded in different shades of gray that range from black (maximal diversity) to white (minimal diversity). The left domain (topoisomerase) is less robust, and has lower amino acid diversity, as indicated by the lighter shading over its entire length. The right domain (hemerythrin) is more robust and more diverse overall. The different shades of gray in different regions indicate that robustness varies among these regions.
117
To assess protein robustness is only one of two steps towards analyzing how robustness relates to functional diversity. The second step is to assess functional diversity itself. This is most easily done for enzymes, the most prominent and best studied class of proteins [133]. Important for enzyme activity and specificity are active sites, small groups of precisely positioned amino acids that bind an enzyme’s substrates for catalysis. By comparing active sites among different enzymes with known structure, one can infer an enzyme’s likely reaction mechanisms and substrates. One can then use this information to classify enzymes into families whose members share the same reaction mechanisms and substrates [604]. Based on this information, the lower part of Figure 8.5 shows the number of enzymatic functions that proteins harboring the hemerythrin and the topoisomerase domain have. The more robust hemerythrin domain has adopted more enzymatic functions—it is associated with more enzyme families—than the less robust topoisomerase domain. This association holds not only for these two domains, but for all 112 ancient protein domains for which the necessary information is available. Figure 8.6 plots diversity of enzymatic functions (vertical axis) against protein robustness (horizontal axis) for these domains. The figure shows that more robust protein domains have evolved a greater diversity of enzymatic functions. The association shown is based on the structurebased measure of robustness I mentioned above. However, it also holds for the diversity-based measure of robustness. The association exists whether or not one corrects for the greater number of sequences associated with some domains [238]. Not all proteins are enzymes, but classifications of other, non-enzymatic functions are less easy to come by. Part of the reason is that protein functions are very heterogeneous. Their classification must encompass activities as different as transport and structural support; and it must incorporate the location and timing of a protein’s expression. The most comprehensive effort to classify protein functions is provided by the gene ontology consortium, which classifies proteins according to their molecular function and other functional categories [33]. One can ask whether a protein’s robustness is associated with its functional diversity, as assessed through
118
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
Topoisomerase
Hemerythrin
max
min Robustness
1. Dihydroorotase 3 2. Triazine hydrolase 3. 1,4-dihydroxy-2-napthoyl-CoA synthase
1. Triazine hydrolase 2. Enoyl-CoA hydratase 3. Methylglutaconyl-CoA hydratase 4. Methylmalonyl-CoA decarboxylase 5. 3-hydroxyisobutyryl-CoA hydrolase 6. Glucarate dehydratase 7. O-succinylbenzoate synthase 8. Enolase 9. Galactonate dehydratase 10. Phosphoserine phosphatase 11. P-type ATPase
Figure 8.5 Two protein domains that differ in robustness and in their diversity of enzyme functions. The grayscale spectrum indicates the amount of observed sequence diversity, i.e., the number of different amino acids observed in a particular region of a protein, as a measure of phenotype robustness. Dark (pale) shading indicates regions with high (low) sequence diversity per residue. The enzyme families associated with each domain are listed below the domain images. The left ribbon diagram shows domain 4 of topoisomerase 1, which has low diversity and robustness (CATH database identifier: 1mw9X04, [289]). This domain occurs in proteins that are involved in three enzymatic reactions, as indicated underneath the ribbon diagram, which differ in their substrates and reaction mechanisms [604]. The right panel shows a domain with high robustness and diversity, subunit A of hemerythrin (CATH database id: 1ls1A01). This domain is associated with 11 enzymatic reactions. It is functionally more diverse than topoisomerase domain 4. From [238].
the gene ontology classification system [33]. The result is that here as well, robust proteins tend to have greater functional diversity, regardless of which measure of diversity is used, and regardless of whether one corrects the analysis for domains with more associated sequences [238]. In sum, complementary measures of protein functional diversity and protein robustness show that robust protein phenotypes have more diverse functions. This means that they have experienced more innovations in their evolutionary history. These observations are based on a comparative analysis on very large evolutionary time-scales.
I note that they are complemented by experimental evidence on short, laboratory time-scales that I discussed in Chapter 7. This evidence shows that increased protein robustness or thermodynamic stability can facilitate the evolution of novel protein functions in the laboratory. Taken together, the available evidence on proteins thus supports the mechanism suggested by the RNA based work I discussed. Robust phenotypes explore genotype space more rapidly, and can thus access greater amounts of phenotypic variation, some of which turns into evolutionary innovation.
THE R OL E OF R OBUSTNESS F OR INNOVATION
119
0.014
Diversity of enzyme functions
0.012 0.010 0.008 0.006 0.004 0.002 0.000 0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.10
0.11
Robustness of protein structure Figure 8.6 Protein structures highly robust to mutations evolved greater functional enzymatic diversity. The horizontal axis shows, for 112 ancient protein domains, the length-normalized largest eigenvalue of the contact density matrix [221]. It is a measure of the protein’s robustness to mutations. The vertical axis shows a measure Div of the number of enzymatic reactions a domain is known to be involved in. Specifically, if a set of proteins with the same domain has k different enzymatic functions, and if pi is the frequency with which each function i occurs in this set, then Div = –∑ipi log pi. The reactions examined here differ in the substrates they use. They can be mapped to individual proteins through structural similarities in the active sites of proteins that catalyze particular reactions [604]. The data show that robust domains have evolved greater functional diversity. Figure and caption adapted from [238, 829].
The qualitative statement that robustness brings forth genotype networks and thus facilitates innovation is independent of system class. It holds for molecules, regulatory circuits, and metabolic networks alike. This statement is different from quantitative statements I just made about different sizes of genotype networks and how these sizes affect innovability. These statements regarded molecules. One might think that they also apply to other system classes, regulatory circuits and metabolic networks. However, this is not necessarily the case [193, 222, 653]. Recall that two competing processes determine the accessibility of new phenotypic variation. The first is the more rapid exploration of genotype networks by robust phenotypes. The second is the lower diversity of the neighborhoods of individual genotypes. For molecules, the first process wins. This need not be the case for other systems, and
under all circumstances. Figure 8.7 shows two examples, one from metabolic networks, the other from regulatory circuits. In Figure 8.7a populations of metabolic networks with intermediate (not high) phenotypic robustness can access the most novel phenotypes. In Figure 8.7b, populations of regulatory circuits whose phenotypes have the lowest robustness encounter the most novel phenotypes in their neighborhood. These differences among systems reflect differences in the organization of genotype spaces. We know too little about the causes of these differences to determine when the size of a genotype network may facilitate or hinder innovation. They are thus fertile ground for further research. In sum, although robustness qualitatively facilitates innovation regardless of system class, quantitative differences in robustness among phenotypes may affect innovation in a system-dependent manner.
Cumulative number of different metabolic phenotypes in neighborhood (x104)
(a)
Metabolic networks 9 8 7 6 5 4 3 2 1 1
10
20 30 40 50 60 Number of alternative sole sulfur sources Regulatory circuits
(b) Number of different gene expression phenotypes in neighborhood
70
170
160
150
140 low
medium
high
Robustness of expression phenotype Figure 8.7 High mutational robustness need not facilitate innovation in metabolic networks and gene regulatory circuits. (a) The notion of phenotype in this figure is based on different sulfur sources that allow synthesis of all biomass components when provided as sole sulfur sources, exactly analogous to the carbon-metabolizing phenotypes discussed in Chapter 2. The horizontal axis reflects the number of sulfur sources that a metabolic network can use as sole sulfur sources. Because metabolic networks that are viable on many different sulfur sources are less robust to reaction removal [653], the horizontal axis serves as a proxy to the robustness of a metabolic phenotype, where robustness decreases from left to right. The vertical axis shows the cumulative number of unique novel phenotypes encountered in the neighborhoods of a population of networks during exploration of a genotype network. Each data point corresponds to an average over 200 evolving populations of 100 individuals each, after 2000 cycles of selection and mutation (one changed reaction per metabolic network and cycle). All evolving networks consisted of 100 enzymatic reactions, but similar observations hold for networks of different size. Lengths of bars correspond to one standard deviation. Note the peak in the number of accessible phenotypes, which means that the number of accessible new phenotypes does not increase monotonically with
THE R OL E OF R OBUSTNESS F OR INNOVATION
Evolution of phenotypic variability It is intriguing to ask whether the ability of biological systems to bring forth phenotypic variation and innovation has changed—perhaps increased—in their evolutionary history. Biologists from other areas who venture into evolutionary biology tend to make this assumption with carefree abandon [264, 576]. They assert that biological systems bring forth novel features because this ability has been “selected for.” Evolutionary biologists tend to be more careful, because such assertions are problematic unless backed by evidence. To explain the problem with such assertions, I will use the example of mutator alleles [714, 751]. These alleles dramatically increase the mutation rate of the organism hosting them. They often encode variants of genes involved in DNA replication or repair. In environments that are stressful and that challenge the survival of a lineage or population of organisms, such mutators can provide an advantage to the population. The reason is this. Although the vast majority of mutations that mutators produce are deleterious, the tiny fraction that is beneficial to at least some individuals can allow the population’s survival. Mutators can be quite abundant in bacterial populations [751]. A facile explanation for their abundance resorts to the very advantage they confer to the population. However, this advantage is overshadowed by a great disadvantage that the mutator confers to the individual—typically just one in a large population—who first acquires it: because the vast majority of mutations are deleterious, carrying the mutator genotype is detrimental to this individual, and will thus often lead to its extinction [714].
121
For my purpose, the main pertinent observation is this: a conflict exists between the interests of the lineage and that of the individual. Except under special circumstances [714], a mutator can persist only through selection that favors one group or lineage of individuals over other groups. How prevalent group or lineage selection has been a matter of ferocious controversy [717, 852]. This is not the place to resolve this controversy, except to say that group selection occurs, especially in lineages where individuals are genetically highly related; its ubiquity, however, is still unclear. In other words, the conflict between the individual’s and the population’s benefit need not always be resolved in favor of the population, and in favor of increased variability. In contrast, it may often be resolved in the interest of the individual, and in favor of reduced variability. Similar conflicts exist for most causes of increased phenotypic variability. Fundamentally, the reason is that for individuals, variation is more likely to be deleterious than beneficial. This is why the assertion that phenotypic variability is “selected for” should not be made lightly. The reason to discuss this conflict here is to point out a remarkable property of the systems I discussed in this chapter: they avoid this conflict. Specifically, in RNA and protein structures, where robust phenotypes facilitate evolutionary innovation, the interests of the individual and the lineage are perfectly aligned. To see this, recall that a robust phenotype is one whose genotypes are typically also robust to mutations, more so than genotypes of less robust phenotypes. This means that having a robust phenotype is typically good for the
increasing robustness (decreasing number of sulfur sources), but declines for small numbers of sulfur sources. Data from [653]. (b) The figure is based on populations of regulatory circuits (N=20 genes, ≈4 regulatory interactions per gene; Chapter 3) that are in mutation-selection balance on the genotype network of a gene expression phenotype. The horizontal axis shows data for three different kinds of phenotypes that are distinguished by their robustness. Specifically, in phenotypes with low, medium and high robustness, the distances between initial and equilibrium expression state of a circuit are d=0.5, d=0.25, and d=0.1, respectively (Chapter 3). The vertical axis shows the number of unique novel phenotypes that are accessible in the neighborhood of a population. Note that populations with highly robust phenotypes have fewer novel phenotypes in their neighborhood. Data are averaged over 500 populations of 200 circuits each, where each circuit had a 50 percent change to change a regulatory interaction per generation and is subject to strong stabilizing selection on its expression phenotype. Data from [222].
122
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
individual: the individual experiences fewer deleterious mutations. At the same time, populations of such individuals can access more phenotypic variation. In this case, greater robustness benefits both the individual and the lineage. This would no longer hold for systems where greater phenotypic robustness can lead to less phenotypic variability. In such systems, the conflict re-emerges. The alignment of individual and group benefits for robustness in RNA and protein phenotypes has a potentially important consequence. Because robust phenotypes are unequivocally favorable, their abundance should increase on evolutionary time-scales, if other constraints on the molecular functions on RNA and protein phenotypes allow such an increase. For example, if several protein structures permit the execution of the same catalytic function, the more robust structure should become used preferentially, because it provides the dual benefits of increased robustness and phenotypic variability.
It is not easy to find out whether protein and RNA phenotypes have increased their phenotypic variability over time. Phylogenetic methods allow us to reconstitute past genotypes, but to reconstruct past phenotypes is a different matter [237]. However, we can ask whether phenotypes of natural macromolecules are more robust or have larger genotype networks than phenotypes sampled at random from a genotype space [149, 378]. Doing so requires information about the nature of such random phenotypes. This information is currently still unavailable for proteins, but for RNA secondary structures, it is possible to provide such information [429]. To this end, we developed techniques to sample phenotypes randomly from genotype space, and to estimate their genotype network size [378]. We used these techniques to compare the genotype network size of 5×103 phenotypes sampled at random from genotype space, to the genotype network size of more than 80 RNA secondary structures of natural molecules with known and diverse biological
18 16 Number of natural RNA phenotypes
14 12 10 8 6 4 2 0
0.1
0.01 0.05
0.0001 0.001
0.00001
0.000001 0.0000001
Fraction of random RNA phenotypes with larger genotype networks Figure 8.8 Natural RNA molecules have unusually large genotype networks. The horizontal axis indicates the fraction of random genotype networks that are larger than the genotype network of a natural RNA molecule, and the vertical axis indicates the number of RNA molecules in each of the histogram’s bins shown on the horizontal axis. The figure is based on the genotype network sizes of RNA secondary structures for 82 natural RNA molecules with known biological functions. The figure compares these genotype network sizes with those of RNA phenotypes sampled at random and uniformly from the set of phenotypes in RNA genotype space. Genotype network sizes were estimated with an algorithm based on replica exchange Monte Carlo sampling of RNA genotype space [343, 378]. From data in [378].
THE R OL E OF R OBUSTNESS F OR INNOVATION
functions. These included ribozymes, guide RNAs involved in RNA editing, small nuclear RNAs, and others. This analysis showed that the genotype networks of natural RNAs are larger than those of randomly chosen phenotypes [378], as illustrated in Figure 8.8. The histogram in this figure shows, for each natural RNA molecule we studied, the fraction of random RNA phenotypes that have larger genotype network sizes. This fraction is generally small. Specifically, fewer than 1 in 10,000 random phenotypes have genotype networks larger than those of a typical RNA molecule that we studied. Only 1 out of 82 biological RNA phenotypes has a genotype network whose size is not in the top 5 percent. Using different approaches and data, other authors have also found that biological RNA phenotypes may have more associated genotypes than expected by chance [149]. The observations of Figure 8.8 suggest that the phenotypic robustness of RNA molecules, and thus their ability to access new phenotypic variation, is greater than expected by chance alone. It is consistent with the notion that large genotype network sizes (and thus high robustness) are favorable properties. However, I note that this conclusion is only tentative. First, the available data are very limited. Second, other explanations are conceivable. For example, RNA phenotypes with large genotype networks may simply be easiest to find in blind evolutionary searches. How to distinguish between such alternative explanations remains an open question.
Summary Some degree of robustness is a prerequisite for the existence of genotype networks. Thus, robustness qualitatively facilitates evolutionary
123
innovation. How it affects innovation quantitatively, that is, when compared among different phenotypes, depends on the system class considered. What we currently know suggests that robustness of phenotypes enhances innovation in proteins and RNA. The reason is that populations with more robust molecular phenotypes explore genotype networks more rapidly, and thus gain access to more novel phenotypic variants. In other words, the larger regions of a genotype network that such molecules explore compensates for the reduced phenotypic diversity in the neighborhood of any one molecule caused by robustness. In addition to a focus on phenotypic (and not just genotypic) robustness, an understanding of population processes is essential to understand this phenomenon. The reason why it is not universal to all system classes is that it depends on details of genotype space organization. In systems where phenotypic robustness facilitates phenotypic variability, the benefit that individuals derive from high robustness is congruent with the population or lineage-level benefit of increased variability. In such systems, evolution may bring forth especially robust phenotypes. More than 80 RNA structure phenotypes from molecules with known and diverse biological functions hint that this is the case for RNA, because such molecules are indeed more robust than RNA phenotypes chosen at random from genotype space. Beyond this hint, it remains an intriguing open question whether the distribution of observed phenotypes has been shaped by the ability of phenotypes to bring forth innovations, in systems as different as proteins, regulatory circuits, metabolic networks, and in the macroscopic traits they help form.
CH A PT ER 9
Gene duplications and innovation
No treatment of evolutionary innovation would be complete without discussing gene duplication. In 1970 already, Ohno argued, based on limited evidence, that gene duplications are important for evolutionary innovation [573]. Later evidence, ranging from whole genome analyses to detailed biochemical studies of individual duplications, largely supported this view. Others have written extensively about gene duplication [135, 253, 296, 334, 458, 479, 480, 794]. Thus, I will here only discuss observations central to my topic [828]. After having given some essential background information, I will first highlight three dramatic classes of innovations in which gene duplications may have played a role. I then create a larger context for these examples, showing how gene duplications fit seamlessly into the framework I propose here. Specifically, I will point out that gene duplications are unique classes of mutations, because they systematically increase mutational robustness. As I discussed in Chapter 8, robustness plays an important role in evolutionary innovation. Gene duplications increase robustness in ways that facilitate innovation; that is, they increase the size of genotype networks, and, more generally, the search space through which evolving populations can access novel phenotypic variants. A duplication of DNA is a kind of mutation that arises as a by-product of recombination and DNA repair processes. Such processes occur in most cells. The product of a duplication are two copies of a stretch of DNA. Duplications can affect non-coding DNA, parts of a gene, entire genes, multiple genes, or whole genomes. Whenever one or more genes are duplicated, the duplicates are often, but not always, identical [386]. I will focus mostly on the duplication of single genes, because this case serves 124
to develop the key concepts. However, these concepts also apply to other categories of duplication. A gene duplication occurs typically in a single individual of a population. The two new gene duplicates may go to fixation by genetic drift or natural selection, for example, if the increased amount of their RNA or protein product is advantageous [593, 809, 827]. During this process, or after the duplicate genes have gone to fixation, mutations may arise in either duplicate. Many of these mutations may abolish one of the duplicate genes’ functions or eliminate one duplicate from the genome. This is the most frequent duplication fate, affecting as many 90 percent of duplicate genes [390]. It is irrelevant from the point of view of evolutionary innovation. I will thus only focus on the case where both duplicates remain in the genome. Some such persisting duplicates retain similar functions for long times; others partition existing functions among them, leading to increasing specialization or so-called subfunctionalization; yet others evolve different, novel functions [135, 253, 768, 804, 821]. Well-studied examples of gene duplications show how individual gene duplicates evolve new functions [135]. Prominent mechanisms include the adoption of new gene expression patterns, and changes in protein structures that allow interaction with new molecular partners. (Chapter 7 discussed an example involving steroid hormone receptors.) On long evolutionary time-scales, multiple such individual gene duplicates can arise, go to fixation, and diversify in any one lineage of organisms. The cumulative effect of such diversification may be dramatic. It can facilitate striking innovations on the organismal level. The following three examples hint at the powerful role that gene duplication can play in morphological innovation (see also [828]).
G E N E D U P L I C AT I O N S A N D I N N OVAT I O N
Example 1: The rise of flowering plants Flowering plants (angiosperms) are the most diverse and evolutionary successful group of land plants. They comprise approximately 250,000 species, which outnumber those of all other plant taxa. Their radiation began some 100 million years ago. Since then, flowering plants have come to dominate land ecosystems. Many key evolutionary innovations of flowering plants relate to reproduction. Among them is the endosperm, a triploid tissue that nourishes a seedling; closed carpels that shield the female germ cells and prevent self-fertilization; and, most visibly, flowers themselves. The prototypical angiosperm flower consists of four different floral organs: sepals, petals, stamens, and carpels. Myriad variations exist on the number, arrangement, and organization of these four organs. Together, they are the most visible aspects of angiosperm diversity [219]. To understand how angiosperms diversified, one has to understand how they develop, and in particular how flowers develop. Central to flower development is a circuit of transcription factors akin to the circuits I discussed in Chapter 3 [223]. This circuit specifies the identity of the four floral organs. The so-called ABC model of flower development, established first in Arabidopsis thaliana and Antirrhinum majus [129], encapsulates the core principles of floral organ specification. According to this model, the combined action of three classes of transcription factors, called A, B, and C, are necessary to specify floral organ identity. A class A transcription factor expressed by itself specifies sepals; class A and B factors are jointly necessary to specify petals; class B and C factors jointly specify stamens, whereas class C factors expressed alone specify carpels. The model’s central notion is the combinatorial specification of organ identity. The basic model is well corroborated, although accumulating evidence requires some modifications to its details [764]. Most of the well-studied transcription factors involved in flower development belong in one family of proteins—it is the family of MADS box proteins. This family is named after a particular protein motif that it contains, which is ubiquituous in eukaryotes. The reason why I discuss this family
125
here is that flowering plants have experienced a wave of duplications in member genes of this family [17, 493, 548, 765, 766] (Figure 9.1). On the one hand, yeast, nematode, and fruit fly genomes contain only between two and four MADS box genes; the most recent common ancestor of gymnosperms and angiosperms may have had as few as 7 MADS box genes [49, 649, 765, 766]. On the other hand, the two completely sequenced genomes of the angiosperms Arabidopsis thaliana (thale cress) and Oryza sativa (rice) each contain more than 70 MADS box genes [493, 548]. Since their duplication, the functions of many duplicate MADS box genes have diversified [353, 354]. Examples involve the MADS box genes abbreviated as SEP (SEPALLATA). Several duplicates of these genes (SEP1–4) exist in the wellstudied flowering plant Arabidopsis. They are jointly responsible for converting leaf-like structures into petals, stamens, and carpels [187, 605, 606]. While having the same or highly similar functions in Arabidopsis, SEP homologs have adopted different functions in other flowering plants. Examples include a tomato SEP homolog that is involved in fruit ripening but not in floral organ identity [814]. In grasses, SEP-like genes may have influenced the morphological diversification of inflorescences [485]. Another example involves the AGAMOUS (AG) gene family, whose name derives from AGAMOUS, a gene involved in carpel and stamen formation. This gene has experienced a duplication in the angiosperm lineage leading to eudicotyledons [353, 423], which include prominent plant families such as rosids and asterids. In the asterid Antirrhinum (snapdragon), expression of the AG family member PLENA outside the region where it is normally expressed transforms sepals into carpels. However its duplicate FARINELLI (FAR) does not cause this transformation [106]. The different loss of function phenotypes in the two genes show that they have adopted different functions [106]. Similarly, Arabidopsis contains two duplicates in the AG family, the genes SHATTERPROOF1 and SHATTERPROOF2, which have adopted novel functions in fruit ripening [460].
126
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
MADS box genes
>38
Dicots Monocots Basal Angiosperms Gymnosperms Bryophytes Green algae
Fungi, Animals
1
1-4
Figure 9.1 Duplication of MADS box genes are associated with innovations in flowering plants. The plant phylogeny shown is highly simplified. It contains major plant groups, as well as fungi and animals as outgroups [667]. Superimposed on this phylogeny are the numbers of known MADS box genes from organisms that include yeast, nematodes, and fruit flies; the green algae Coleochaete scutata, and the eudicotyledon Arabidopsis thaliana (thale cress) [649, 755, 766]. The lower bound of MADS box genes in eudicotyledons is based on the MADS MIKC subfamily; this has been suggested by [808]. Numbers of MADS box genes are minimal numbers and could fluctuate within taxonomic groups. The images depict C. scutata [370], and a flower of A. thaliana [352] (Courtesy of Vivian Irish, Yale University.) Figure and legend adapted from [828].
Example 2: The rise of vertebrates Much like the radiation of flowering plants, the radiation of vertebrates created a spectacularly successful and diverse group of organisms. Hox genes played a key role in this diversification. These genes encode transcriptional regulators and are named after the homeobox, a DNA sequence they contain and that encodes their DNA-binding domain. Hox genes pattern many structures along the head–tail body axis, including the hindbrain, the vertebral column, and the limbs [508]. Most animals contain multiple Hox genes that are adjacent to one another, forming one or more clusters of genes close together on a chromosome. Their spatiotemporal expression pattern along the head–tail axis corresponds to their chromosomal order in a Hox gene cluster. Many invertebrates have a single cluster of Hox genes. This cluster underwent at least two duplications during vertebrate evolution, duplications that led to four Hox gene clusters in many vertebrates [442].
Vertebrates have numerous innovations relative to their chordate ancestors [699]. Examples include not only the sophisticated brain of higher vertebrates, but also cartilage, teeth, and bone. These tissues serve many roles, ranging from support to feeding. The evolution of bone in turn gave rise to the most obvious and striking vertebrate innovations, which include hinged jaws, the vertebral column, and paired appendages. The latter allowed new forms of swimming, walking, and flying that made many ecological niches accessible. Various duplicate Hox genes are critical for the development of vertebrate-specific traits, suggesting that Hox genes also play important roles in the evolution of these traits. Some of the Hox genes that were duplicated during vertebrate evolution have evolved new functions. Their functional divergence often involved acquiring novel gene expression rather than novel biochemical activities [89, 131, 292,
G E N E D U P L I C AT I O N S A N D I N N OVAT I O N
790]. A case in point is the duplicate Hox genes Hoxa3 and Hoxd3 [292]. The effect of a mutation in one of these two genes depends on which gene is mutated. For example, Hoxa3 mutants show defects in pharyngeal tissues, whereas Hoxd3 mutants have malformed cervical vertebrae [121, 136]. However, expressing one gene where the other is normally expressed, and vice versa, indicates that the two genes can substitute for each other’s function, as long as they are expressed in the same way. Their differences may therefore be caused by quantitative expression changes [292]. Similarly, the protein products of the Hoxa1 and Hoxb1 genes have different functions in hindbrain development. Nonetheless, one of them can substitute for the other when it is expressed in the right time and place [790].
Example 3: Heart evolution The preceding examples are about spectacular evolutionary radiations from which many diverse species emerged, but gene duplications may also facilitate innovations in individual traits. One example regards the heart, a pump that drives fluid through the body. Such a pump becomes necessary in organisms too large to distribute nutrients and oxygen through diffusion. Hearts originated with a simple architecture, a contractile tube with bidirectional blood flow. A prototypical example is the heart of amphioxus (lancelets), which is thought to resemble that of a basal vertebrate. More advanced hearts are more complex. For instance, the heart of amniotes, including mammals and birds, is a complex fourchambered pump with two atria and two ventricles that separate oxygen-poor from oxygen-rich blood. The heart acquired its sophisticated structure during vertebrate evolution. Fish hearts have a single atrium and a single ventricle; amphibian hearts have two atria and one ventricle; vertebrate hearts, additionally, acquired septae to separate the heart’s chambers, valves to enforce unidirectional flow, and a conduction system for synchronized and forceful heart contraction [667]. A core circuit of transcription factors controls heart development in vertebrates and invertebrates. These factors include proteins named NK2, MEF2, GATA, Tbx, and Hand [154, 578]. Their coding genes have duplicated during vertebrate evolution
127
[578] (Figure 9.2). One of them, MEF2 (myocyte enhancer factor 2), regulates the expression of contractile muscle proteins. The fruit fly Drosophila has only one MEF2 gene. If this gene loses its function, the expression of contractile proteins in muscle cells ceases [391, 631]. Vertebrates, in contrast, have four MEF2 duplicates with partly divergent functions [62]. A case in point is MEF2c. If it loses its function, a subset of contractile proteins in the heart ceases to be expressed. In addition, the right ventricle no longer forms [461]. Because the population of cells from which the right ventricle forms occurs only in amniotes, the function of MEF2c in its development is arguably novel. Another, particularly striking example involves the Hand gene. Zebrafish (Danio rerio) and amphibians express a single copy of this gene. Both kinds of animals have only one ventricle. The zebrafish Hand gene is necessary for the formation of this single ventricle [877]. In contrast to these organisms, mice express two duplicates of Hand. Among other defects, loss-of-function mutants in Hand1 cannot form the left ventricle, whereas Hand2 mutants fail to form the right ventricle [92, 242, 650, 723]. Thus, the two duplicates have acquired functions that are not only specialized to one of two ventricles. Their functions are also novel, because the structures they help form did not yet exist in fish. All three examples I have just discussed—flowering plant radiation, vertebrate radiation, and heart evolution—share a conspicuous association between gene duplication and complex evolutionary innovations. Based on such associations, many researchers argue that gene duplication is key to innovation, and reasonably so [135, 354, 573, 833]. Unfortunately, such associations are no proof that gene duplications were necessary for innovation. And because the processes I have discussed unfolded over tens to hundreds of million years, far beyond the time scales of laboratory evolution experiments, we may never have such proof. Although this observation is an important note of caution, gene duplications are too abundant, and their association with innovation too striking to dismiss their importance for innovation. I will thus show next how they fit into the framework I propose in this book.
128
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
Duplicate Cardiac Transcription Factors
Heart Chambers
Hand
1
1
1
2
MEF2
1
4
4
4
GATA
2
3
3
3
Tbx
?
≥4
≥5
≥7
NK2
1
≥3
≥3
≥2
1
2
3
4
Amphibians
Amniotes (Reptiles,birds, mammals)
CephaloFish chordates (Amphioxus)
Left Atrium
Complexity
Right Atrium
Right Ventricle
Left Ventricle
Figure 9.2 Duplications of genes involved in heart development. The upper panel (gray) shows the number of duplicates in different groups of chordates (below the panel, together with their characteristic number of heart chambers) for five genes (Hand, MEF2, GATA, Tbx, NK2) encoding transcriptional regulators with central functions in heart development. The lower panel shows a highly simplified vertebrate phylogeny, together with schematic illustrations of a primitive heart (left), and the four-chambered vertebrate heart (right). From [578, 828]. Used with permission from AAAS.
Gene duplications cause robustness After a gene duplication that creates identical duplicates, both duplicates have redundant functions. Mutations in one of them are thus less likely to be deleterious than before the duplication. Among the many kinds of mutations that can affect a genome— from single nucleotide changes to chromosome rearrangements—gene duplications are unique in this way: only they systematically increase robustness to mutations. This increase in robustness is evident from two complementary lines of experimental evidence. The first comes from efforts to study the function of individual genes. An important approach to study gene function is to eliminate (“knock out”) a gene or its expression, and to examine the phenotype of
the resulting mutant. In many genes whose knockouts have little or no phenotypic effect, gene duplications are behind the absence of such effects [16, 134, 266, 295, 768, 824, 851]. A second line of evidence comes from molecular evolution studies. Duplicate genes can tolerate more nucleotide changes than their single copy counterparts, and are thus under relaxed selection. The phenomenon is most evident if one examines duplicate genes on a genome-wide scale. Here, recent gene duplicates in various eukaryotes can tolerate 10-fold more amino acid changes than older duplicates [417, 479]. I note that remnants of such robustness still exist for the duplicate genes discussed in the preceding sections on organismal innovation. For example, in the thale cress Arabidopsis thaliana, individual dupli-
G E N E D U P L I C AT I O N S A N D I N N OVAT I O N
cates of the flower development genes SEPALLATA show only weak phenotypic effects if their function is lost due to mutations [187, 605]. Similarly, some Hox gene duplicates have retained partially redundant functions, remnants of the robustness that gene duplications cause. Examples include zebrafish Hoxa2 and Hoxb2, which function redundantly in embryonic patterning of the second pharyngeal arch [345]; and the mouse Hox8 genes, which have redundant roles in positioning the hindlimbs [795].
Many new phenotypes become accessible through duplication The observation that gene duplications (temporarily) increase robustness of duplicated genes is important, because, as we saw in Chapter 8, robustness can facilitate evolutionary innovation. First, it permits the existence of genotype networks. Second, the cryptic genetic variation that a population of molecules with a highly robust phenotype can accumulate allows it to access many novel phenotypes. Gene duplications not only systematically increase robustness, they also do so in a peculiar way that increases the size of the genotype space in which an evolutionary search can take place. In doing so, they increase the number of different phenotypes that can be explored without destroying an existing phenotype. It is easiest to appreciate this phenomenon with a simple example. In Chapter 4, I discussed the enzyme chorismate mutase with its 93 amino acids. For this enzyme, an estimated fraction 10−24 of genotype space encodes proteins with the same structure and activity [761]. This fraction would translate into a total number of 10−24×2093≈1097 chorismate mutase genotypes. Now consider a hypothetical duplication of a gene encoding this protein. After the duplication, one of the duplicates must maintain its function, whereas the other is free to vary. The genotype space of both proteins taken together has the squared size of the original genotype space. It thus contains (2093)2 genotypes. Mathematically speaking, it is the Cartesian product of the two original genotype spaces. Figure 9.3 illustrates how we can think of the two combined spaces geometrically, although, as usual, my two-dimensional caricature does not do justice to the high-dimensional nature of genotype spaces. The two open circles in the middle represent two
129
(identical) gene duplicates immediately after duplication, and the jagged lines terminated by an arrow indicate the independent evolutionary trajectories that the duplicates can take as they begin to change independently in an evolving population. Even if we restricted both evolving duplicates to their respective genotype networks, as shown in the figure, an evolving population containing them could access twice the phenotype variation than before the duplication. The reason is that both duplicates undergo mutation independently. They thus explore their respective genotype network independently, and gain access to novel phenotypes in its neighborhood independently. The effect is equivalent to doubling the size of a population of molecules that explores a given genotype network (in the absence of duplication). In reality, the effect of gene duplication would be much more dramatic than indicated by this argument. For as long as one of the proteins is confined to its genotype network, the other is free to explore the genotype space. If a complete exploration of this space were possible, then the total number of genotypes that would become accessible through the duplication, while preserving the chorismate mutase phenotype, is given by1097 (the number of chorismate mutase genotypes) times 2093 (the size of genotype space, for the freely evolving molecule) or 10218, more than hundred orders of magnitude greater than the original genotype network. This calculation neglects that the genotype space is much too large to be explored by any one molecule. However, even if the second molecule could merely explore the k-neighborhoods of its genotype network for some small value of k, a vast number of phenotypes would become accessible that is beyond reach if only the immediate 1-neighborhood can be explored. My exposition so far has focused on duplications of individual genes, but much the same holds for larger scale duplications. For example, whole genome duplications that occurred in vertebrates and many other organisms duplicated entire regulatory circuits [793]. As far as the genes of these circuits are concerned, the same reasoning as above applies: Such duplications free individual circuit genes to explore new structures and activities, as long as one of the duplicates preserves the old activity. But with the duplication of a regulatory circuit,
130
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
Figure 9.3 Gene duplications increase the size of the search space for novel phenotypes. The left and right parallelograms stand for the genotype space associated with each of two duplicate genes. Inscribed into the parallelograms are the genotype networks of each duplicate’s phenotype, with gray circles corresponding to genotypes with the same phenotype, and gray lines indicating neighboring genotypes. Symbols of different shapes and shading indicate genotypes with novel phenotypes that are only one mutation away from the genotype network shown. The two open circles in the middle stand for two hypothetical identical duplicate genes immediately after duplication. The identical protein structures (from chorismate mutase, Protein Data Bank identifier: 2gtv, [201] are shown merely to indicate that after a duplication that creates identical duplicates, the duplicates’ phenotypes will be identical. The jagged black lines illustrate that each duplicate mutates independently, and thus explores genotype space independently from the other duplicate.
G E N E D U P L I C AT I O N S A N D I N N OVAT I O N
not only the circuit genes, but also their regulatory interactions undergo duplication. This permits, in addition, the exploration of novel regulatory interactions, and thus of new gene activity or expression phenotypes, while the original phenotype can be preserved. This regulatory divergence is more difficult to analyze experimentally than the divergence of individual molecules. For example, while duplicate genes mutate and diversify independently from one another, regulatory interactions in a duplicate regulatory circuit are intertwined: They involve both original and duplicate genes. Taken together, these observations show that the kind of robustness caused by gene duplications can increase access to novel phenotypes. But as if this was not enough, gene duplications also solve another problem that is common in evolutionary innovation: an old function of a system may need to be preserved not only during exploration of genotype space, but also afterwards, after a genotype with a novel function has been found. And often, a single genotype may not be able to execute both functions, or execute them equally well. A candidate example would be an enzyme that needs to catalyze reactions with two different substrates at high rate, or a regulatory protein that needs to bind two very different DNA sequences with high specificity. Here, gene duplications can preserve the
131
old function of one duplicate, while facilitating not only the origin, but also the optimization of a novel function in the other duplicate [56, 135, 328].
Summary Gene duplications increase mutational robustness in a way that greatly increases the accessibility of novel phenotypes. They can preserve an existing phenotype not only during a search for novel phenotypes, but also afterwards, once such phenotypes have been encountered. It is thus not surprising that duplications have been associated not only with dramatic innovations in individual organs, such as the heart, but also with the diversification of vast groups of organisms, such as flowering plants and vertebrates. Although we may never have absolute certainty about their involvement in innovations whose origin and refinement may require many million years, it would be surprising if their association with such innovations was coincidental. In the most general terms, my observations here can be summarized in a syllogism, a logical argument dating back to ancient Greece, where two premises are used to infer a conclusion. The first premise is that gene duplication causes robustness; the second premise is that robustness can facilitate innovation. The conclusion is that gene duplications can facilitate innovation.
CH A PT ER 10
The role of recombination
Thus far, I have focused on the smallest unit of genetic change, mutations that affect one system part, be it a nucleotide, an amino acid, a regulatory interaction, or a metabolic gene. As we have seen, a sequence of such small changes can transform a system gradually while leaving its phenotype intact, and yet allow it to explore many novel phenotypes. Mutations are undoubtedly important for evolutionary innovation [68, 330, 607, 684]. However, recombination, a larger scale kind of change (Figure 10.1), may be at least as important [414, 576, 694, 888, 890]. From the perspective I discuss here, recombination causes long jumps in genotype space. By reaching into far-flung regions of this space, such jumps facilitate the exploration of different phenotypes. To see this, recall that neighborhoods in far-apart regions of this space contain very different novel phenotypes. In other words, recombination can be more effective than mutation for exploring new phenotypes. (I will discuss some experimental evidence below.) But while long jumps may facilitate phenotypic exploration, they also cause a major problem. The genotype network of any one phenotype typically comprises a tiny fraction of genotype space. One would thus think that a long jump through this space should almost always end outside this network. If so, then a key benefit of genotype networks, the preservation of the old during a search or the new, would be lost. The central theme of this chapter is that recombination, perhaps surprisingly, does not suffer from this problem. For example, recombination perturbs phenotypes much less than mutation. It has weak effects on existing phenotypes, and yet it can help explore novel phenotypes that are very different from a starting phenotype. These two properties make recombination a powerful facilitator of evolutionary innovation. 132
Different kinds of recombination Perhaps the simplest kind of recombination is homologous recombination, as it occurs during meiosis. When taking place between two protein-coding regions, such recombination exchanges parts of these regions and leaves their length unchanged (Figure 10.1a). Other kinds of recombination are more complex and involve unequal exchange among the recombining partners. Such exchange is facilitated if two molecules share at least short stretches of identical DNA sequence. Unequal exchange can occur, for example, between protein-coding genes that share short, repeated motifs of DNA sequences. It can also occur between genes that encode multiple, similar protein domains. Such recombination can create novel proteins with new and unique domain combinations. The genomes of higher organisms encode thousands of proteins with multiple domains, and the same domain is often found in many different proteins of different functions. These observations speak to the power of recombination to create novel proteins [93, 269, 418, 449, 580]. In addition to this role in generating novel proteins, recombination can also cause change on a much larger scale. It can rearrange genomes, causing duplications, inversions, and translocations, where entire chromosome segments are swapped. Repetitive DNA, and in particular transposable elements, play important roles in such recombination [458, 576]. Wherever repetitive DNA occurs in a genome, it facilitates an increased incidence of unequal recombination. In humans, some 50 percent of the genome consist of repetitive DNA, much of it derived from transposable elements [432, 806]. In addition to their passive role in facilitating recombination, transposable elements can also generate recombinant DNA through their active movement.
T H E R O L E O F R E C O M B I N AT I O N
133
(a) MPTYIHELLYTLLLTYLSSPSPRSGPLRSGPLRFRRIQHINSPSPSSTRAVLASFSEENLIPD
MPTYIHELLYTLLLTYLSSPSPRSGPLRSGPLRFRRIQHINSPSPSSRATVLASFSEENPSSD
MYPTIHEPPYTLLLTYLSSHLPRSGPLRSGPLRFRRIQHINSPSPSSRATVLASFSEENPSSD
(b)
Figure 10.1 Recombination in proteins and regulatory circuits. (a) The left side of the figure shows two hypothetical “parental” protein sequences. The right side shows a chimaeric protein resulting from a reciprocal recombination event between these parents. (b) The left side of the figure shows two hypothetical regulatory circuits that differ in regulatory interactions (black and grey arrows) among circuit genes (black rectangles). The right side shows a chimaeric circuit created through recombination between the parents. The recombinant circuit contains a mix of the parents’ regulatory interactions.
When inserting into a gene, they can lead to the creation of novel coding regions; and when inserting near a gene they can affect its regulation. Some 50 human genes consist largely of sequences derived from transposable elements, and many more contain exons derived from such elements [576]. A final form of recombination is lateral gene transfer [96, 445, 568, 569]. It is typically not a reciprocal exchange of DNA, but a unidirectional transfer from a donor to a host genome. Aside from the addition of new genes, it thus does not change the host’s gene content as radically as other kinds of genome rearrangements. For this reason, lateral gene transfer may be less disruptive. Therefore, it poses less of a problem for preserving well-adapted
phenotypes than other kinds of recombination. I will not discuss it in detail here. The non-reciprocal kinds of recombination I just discussed can change the size of a system, be it the length of a molecule, or the number of genes in a cellular circuit. This property makes it difficult to compare the substrates and the outcome of non-reciprocal recombination. Simply put, the reason is that recombination substrates and products exist in genotype spaces of different dimensions. This may represent a fundamental obstacle to analyzing the effects of non-reciprocal recombination systematically [724]. Because my objective is such a systematic analysis, I will thus here focus on reciprocal recombination, whose substrates and products occur in the same
134
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
genotype space. The principles I will describe, however, may well apply to all kinds of recombination.
The power of recombination I will highlight the power of recombination with DNA shuffling [731], a widely used technique to engineer novel proteins and higher order systems in the laboratory [111, 152, 153, 443, 552, 719, 731, 888]. Briefly, DNA shuffling starts from a mix of different “parental” variants of equally long DNA sequences, such as different alleles of a gene. These sequences are cut into small fragments at random location within them, denatured (made single-stranded), and reannealed to render them double-stranded again. The result is a complex mixture of partially singlestranded, partially double-stranded chimaeric DNA sequences. The double-stranded regions of these DNA sequences then are extended by DNA polymerase in a polymerase chain reaction. After multiple cycles of denaturation, reannealing, and synthesis of new DNA using the polymerase chain reaction, recombinant DNA molecules of the same length as the parental DNA emerge. Each such molecule consists of multiple recombined fragments of the parental DNA molecules [731]. Two experiments that use this technique illustrate the power of recombination. Crameri and collaborators applied DNA shuffling to recombine genes encoding cephalosporinases. These are enzymes that confer resistance to cephalosporins, a class of antibiotics. The aim of the experiment was to create cephalosporinases that confer resistance against high concentrations of antibiotics. A single DNA shuffling experiment recombined four cephalosporinases that showed between 18 and 42 percent divergence on the DNA level [153]. The experiment yielded a chimaeric cephalosporinase with a 270fold increase of resistance to the cephalosporin moxalactam, as compared to the parental sequences. By comparison, the highest improvement achievable in the same amount of time through point mutations was an 8-fold increase over the parent [153]. The same approach can also shuffle DNA sequences on a much larger scale, recombining DNA containing multiple genes or entire genomes [888]. For example, recombination of entire genomes has been used to produce strains of the bacterium
Streptomyces fradiae that produce high amounts of the antibioticum tylosin. In this approach, recombination was 20 times more effective than random mutagenesis in improving tylosin production [888]. These and other experiments show that experimental recombination of DNA sequences can rapidly generate new genes, pathways, and genomes with new and desirable features [111, 152, 153, 443, 552, 719, 731, 888].
Recombination preserves existing gene expression phenotypes in regulatory circuits Engineering experiments like these illustrate the power of recombination to identify phenotypes with novel properties. However, they are not designed to examine the central problem recombination poses: it might disrupt already existing, welladapted phenotypes, and thus often have large deleterious effects. Specifically, the superior genotypes that these experiments find might be few among an astronomical number of potentially inactive chimaeras [888]. I will now address this problem of deleterious recombination effects for regulatory circuits and molecules (Figure 10.1). I will not discuss recombination in genome-scale metabolic networks here, for two reasons. First, in prokaryotes, the kind of frequent, obligate recombination that is characteristic of meiosis is absent, and horizontal gene transfer, as I discussed, will usually be less disruptive to phenotypes. Second, in eukaryotes with their frequent and often obligate meiotic sex, individuals in the same interbreeding population are typically similar in their genotypes. They would differ in the presence of few (if any) of their hundreds to thousands of metabolic reactions. Recombination would thus usually not alter their metabolic genotype dramatically. I will begin by examining the effects of recombination in transcriptional regulatory circuits of the kind studied in Chapter 3. Consider two individuals (“parents”), each of which harbors a regulatory circuit genotype that produces a gene expression phenotype. Both individuals have identical phenotypes. Thus, they belong to the same genotype network or genotype set (Figure 10.2). Their genotypes may differ in one or more regulatory interactions.
T H E R O L E O F R E C O M B I N AT I O N
Parent 1
Offspring
Parent 2
Figure 10.2 Recombination causes long jumps through genotype space. The figure illustrates schematically that the offspring of a recombination event may lie far from either parent in genotype space. Its neighborhood will thus contain novel phenotypes that are different from those accessible near either parent (Chapter 5). The large rectangle stands for genotype space. Small grey circles connected by lines indicate neighboring genotypes on one hypothetical genotype network. Symbols of different shapes and shading indicate genotypes with a novel phenotype that are just one mutation away from genotypes on this genotype network. The large black and white circles indicated two hypothetical parental genotypes. The large grey circle stands for a recombinant offspring of the two parents. In the image, the offspring genotype is equally distant from either parent, but in reality it may be closer to one or the other parent, depending on details of the recombination event that produced it. The offspring may lie on the same genotype network, and thus have the same phenotype as the parents, as indicated in this hypothetical example; or it may lie outside this genotype network and thus have a different phenotype.
These two individuals produce offspring through a reciprocal exchange of their regulatory genotypes. If all circuit genes occurred in a closely linked gene cluster on a single chromosome, then the likelihood
135
of a meiotic recombination event between them would be small. For this reason, I will here consider the opposite scenario, where the individual genes of a circuit are not closely linked, and thus recombine freely. This scenario is important, because here the potentially deleterious effects of recombination will be most evident. Specifically, this scenario requires that every gene in each “offspring” network receives with probability one half the regulatory region of one of the parents, and with probability one half the regulatory region of the other parent. (Recall that in these circuits we focus on the evolution of regulation through changes in cis-regulatory regions.) A quantity of interest is the probability that the offspring of recombination between two parents would no longer have the parental phenotype. This probability indicates the disruptive effects of recombination. A complication is that its value will depend on how different the parents are from one another. Recombination between genotypically similar parents will produce offspring whose genotypes are also similar to either parent. Thus, we would expect that their phenotypes are also often unchanged. Conversely, genotypically very different parents would usually produce genotypically and phenotypically diverse offspring. One way to take parental similarity into account is to compare the offspring’s genotype to one of the parents, and determine the number m of regulatory interactions in which it differs from this parent. To this end, I will examine the probability RR(m) that a recombination event changing m regulatory interactions of a viable circuit preserves its phenotype. It is useful to compare this quantity to the probability Rμ(m) that m independent random changes (mutations) of individual regulatory interactions preserve the phenotype. By comparing the two quantities RR(m) and Rμ(m), we can assess how strongly recombination affects a genotype when compared to an equivalent amount of mutational change. Figure 10.3 shows RR(m) and Rμ(m) for circuits sampled at random from the same genotype set [492]. One can see that for recombination events that change only m=1 regulatory interactions, recombination is already less likely to change a
136
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
1.0 Recombinational robustness RR(m) Mutational robustness Rμ(m)
Robustness
0.8
0.6
0.4
0.2
0.0
1
2
3 4 5 6 7 8 9 10 11 Number m of regulatory interactions changed
12
Figure 10.3 Recombination has weaker phenotypic effects than mutation. The figure shows the probabilitites Rμ(m) and RR(m) that m changes of individual regulatory interactions caused by mutation and recombination, respectively, leave a circuit’s gene expression phenotype intact. The data is based on 106 circuits of S=12 genes randomly sampled from the same genotype network [492]. Lengths of error bars show one standard deviation. A mutation may cause (i) an existing interaction to disappear, in which case the respective interaction strength wij (Chapter 3) is set to zero; (ii) a new regulatory interaction to appear, in which case the new value is chosen as a Gaussian random variable with mean zero and variance one (N(0,1)); and (iii) an existing interaction to change in magnitude. In the latter case, the sign of the interaction is forced to remain unchanged by choosing a Gaussian (N(0,1)) random variable and multiplying it by (–1) if it is of the wrong sign. For the data shown, the number of regulatory interactions per circuit lie in the interval (S, 3S), and gene expression states E(0) and (E∞) differ in the expression of half of their genes. All relevant circuit properties depend only on how different these two gene expression states are [123]. Similar observations exist for circuits of different size. Figure and caption adapted from [492], used with permission from Genetics Society of America.
circuit phenotype. Specifically, more than 90 percent of recombinant offspring that differ from their most closely related parent by only one regulatory interaction preserve the parental phenotype. In contrast, only 75 percent of circuits where mutations changed one regulatory interaction preserve this phenotype. With increasing numbers of changes m, these differences increase. For example, when a recombination event changes m=12 regulatory interactions, 50 percent of all offspring circuits preserve the parental phenotype, whereas fewer than 8 percent of circuits with 12 random mutations preserve this phenotype (Figure 10.3). These observations show that exchanging regulatory interactions that are already part of a viable circuit greatly increases the likelihood to preserve the circuit’s phenotype.
The following is a complementary way of examining the effects of recombination [492]. If the parent circuits differ in I regulatory interactions, then one of the recombinant offspring will differ from one parent by m regulatory interactions, whereas the other offspring will differ by (I–m) regulatory interactions. We can then express the distance of the offspring from either parent as a fraction of I, i.e., as a recombination distance DR=m/I. This recombination distance varies between 0 and 1. A value of DR close to zero means that the offspring is close to the reference parent, whereas a value of DR close to one means that the offspring is very distant to the reference parent, but very close to the other parent. Intermediate values of DR mean that the offspring is approximately equally distant to either parent.
T H E R O L E O F R E C O M B I N AT I O N
137
Fraction of recombinant networks with preserved phenotype
1.1 1.0 Sample Mutation–selection Mutation–selection–recombination
0.9 0.8 0.7 0.6 0.5 0.4
0
0.15
0.35
0.55
0.75
0.95
Recombination distance DR Figure 10.4 Recombination’s disruptive effect on a phenotype depends on the distance of an offspring circuit from either parent. The vertical axis shows the fraction of recombinant offspring circuits with the same phenotype as the parent, as a function of the recombination distance DR between parent and offspring (horizontal axis, see text). The recombination distance is normalized to values ranging between zero and one. Data are shown for parental circuits sampled uniformly from the same genotype network (“sample”), for circuits from a population in mutation–selection balance, and for circuits from a population in mutation–selection–recombination balance. Note the very high fraction of viable recombinants for the population in mutation–selection–recombination balance. Data is based on circuits with S=12 genes, number of regulatory interactions per circuit in the interval (S,3S), as well as initial (E(0)) and equilibrium ((E∞)) gene expression states where 50 percent of genes differ in their activity. All relevant circuit properties depend only on how different these two gene expression states are [123]. The middle and upper curves are based on populations of 1000 circuits, and μ=1 mutations of regulatory interactions per circuit and generation. Lengths of error bars correspond to one standard deviation, and are too small to be visible for any of the data points shown. From [492], used with permission from Genetics Society of America.
Figure 10.4 examines the relationship between the recombination distance DR to the probability that recombination preserves the parental phenotype. For now, I will focus on the lower-most set of points (closed circles). These data are based on parental regulatory circuits that are sampled at random from a set of genotypes with the same phenotype [492]. The figure shows that offspring very similar to the parent, where DR is close to zero or one, is very likely to preserve the parental phenotype. The likelihood that a recombination event is deleterious has a parabolic, U-shaped distribution, with a minimum at intermediate recombination distances DR. This means that recombination is most likely to change a phenotype, if the recombinant circuit is maximally different from either parent.
I will return to the significance of this figure for my main argument shortly.
Recombination preserves protein structure and function The weak effects of recombination in Figure 10.3 may be peculiarities of transcriptional regulation circuits. Alternatively, they may be generic properties that hold for broader classes of systems, and that reflect fundamental organizational properties of genotype space. A mix of computational and experimental evidence from proteins argues for the latter possibility [156, 196]. One such study focused on lattice proteins, the computational models of protein folding I discussed in Chapter 3 [156]. Its authors studied sequences that fold into the same structure and subjected pairs of such
138
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
sequences to recombination. They found that 78.9 percent of recombination products fold stably into a structure, and the vast majority of them (99.3 percent) adopt a structure identical to that of the parents. Another relevant study compared the effects of recombination in real proteins and lattice proteins [196]. The study’s authors estimated the probabilities RR(m) that a recombination event changing m amino acids preserves protein structure, and compared it to the probability Rμ(m) that the same number of mutational changes preserves secondary structure. For lattice proteins with Rμ(1)=0.1, that is, where 10 percent of a protein’s 1-mutant neighbors have the same structure, they found that the fraction of recombination events that change a single amino acid and preserve protein structure is RR(1)≈0.7. In other words, recombination is seven times more likely than point mutation to preserve a lattice protein’s structure. For mutationally more robust proteins where Rμ(1)=0.5, RR(1) exceeds 0.85. For larger numbers m of amino acid changes, mutations typically have dramatically more disruptive effects than recombination. For example, for a structure that remains intact with a probability of less than 1 percent after five independent mutations (Rμ(5)<0.01), more than 30 percent of recombination events that change five residues leave the structure intact (RR(5)=0.3). This means that here recombination is thirty times more likely to preserve this structure than the same amount of change caused by mutation [196]. These qualitative differences between recombination and mutation have been confirmed in experimentally constructed recombinants of two well-studied proteins encoding β-lactamase. This enzyme cleaves and inactivates antibiotics that contain a four-atom ring called a β-lactam. Such antibiotics include penicillins and ampicillin. β-lactamases endow bacteria with resistance against these antibiotics. The experimentors used two β-lactamases, called PSE-4 and TEM-1, that share 43 percent of their amino acids. They constructed synthetic recombinant enzymes with various amounts of amino acid change relative to either parent [196]. For comparison, they also constructed enzymes with the same number of amino acid changes, but where these changes are caused by random muta-
tion. For both classes of proteins—recombinant and mutant—they estimated what fraction of protein had preserved their molecular activity. They did so by identifying the fraction of recombinant or mutant proteins that allowed E. coli cells to survive treatment with the antibiotic ampicillin. The experimentors found that a single amino acid exchange produced through recombination has a probability of RR(1)=0.79 to preserve protein function. In contrast, if random mutation causes this change, then this probability is only Rμ(1)=0.54. Thus, as in regulatory circuits and in lattice proteins, mutations are much more likely than recombination to disrupt a protein phenotype. Increasing numbers of mutations enhance these differences dramatically, as Figure 10.5 shows [196]. For example, recombinatorial change of some 10 amino acids has a greater than 20 percent chance of preserving protein function, whereas the same number of random mutations is ten times more likely to disrupt this function. More generally, the probability Rμ(m) that m mutations preserve a structure decreases exponentially with increasing m, but the same does not hold for recombination. The effects of recombination show a parabolic distribution similar to that shown in Figure 10.4 for regulatory circuits. The effects of mutation in this system have only been measured up to some 30 mutations, but if extrapolated to the number of changes that distinguish maximally different recombinants from their parents, then recombination would be 16 orders of magnitude more likely to preserve phenotype than the same number of mutations [196]. Taken together, these observations suggest that the weak effects of recombination relative to mutation are not a peculiarity of one kind of system, but may be a generic property of different systems. The explanation may be straightforward: recombination swaps parts of a system that are able to form a given phenotype and, in this sense, have been “pretested.” In contrast, mutation changes a system part for parts that may be incompatible with this phenotype [111, 196, 811]. One could also say that recombination exchanges functional system parts for other such parts. Mutations, in contrast, need not do so. The pattern of hydrophobic and hydrophilic amino acids on an amino acid chain serves to illus-
T H E R O L E O F R E C O M B I N AT I O N
139
Amino acid substitutions m
Fraction of functional b-lactamases
1
0
30
60
90
120
150
TEM-1
PSE-4
0.1 0.01 Recombination 0.001 0.0001
Mutation 0
0.4
0.2
0.6
0.8
1.0
Recombination distance DR
Figure 10.5 Recombination in β-lactamases is much more likely than mutation to preserve protein structure. The lower horizontal axis shows the recombination distance DR, the distance of recombined β-lactamases to the PSE-4 parent, and normalized to range between zero and one. The upper horizontal axis corresponds to the absolute number of amino acid changes caused by recombination between PSE-4 and TEM-1 β-lactamases. The numbers on this axis are also relative to the PSE-4 parent. Thus, the maximally possible number of 150 changes corresponds to the TEM-1 parent. The black squares show the fraction of functional recombinants binned according to their divergence from the PSE-4 parent. Error bars correspond to one standard error of the mean [196]. The line labeled “Mutation” is derived from mutagenesis and subsequent measurement of β-lactamase activity. Figure courtesy of Allan Drummond. Used with permission of the National Academy of Sciences, USA.
trate this principle. This pattern is necessary for the formation of a given protein structure [87]. For example, properly spaced hydrophobic amino acids may be necessary to form a protein’s hydrophobic core. This means that some sequences of hydrophobic and hydrophilic amino acids are compatible with a given protein structure, whereas others are incompatible. Although the PSE-4 and TEM-1 lactamases that I just discussed have only 43 percent amino acid identity, if one considers only whether an amino acid is hydrophobic or polar, this identity rises to 76 percent [196]. Reciprocal recombination will swap or exchange amino acids that preserve hydrophobicity along the chain, and thus preserve compatibility with a given structure. The same considerations would hold for the volume of amino acid side chains and for their electric charge. In regulatory circuits, a similar principle holds. Recombination swaps regulatory interactions that are compatible with a given gene activity phenotype. Examples include regulatory inputs to a gene
that stabilize its expression (or repression) in an optimal expression phenotype [123, 492]. Two parental circuits may differ greatly in their genotype, but they may share such stabilizing interactions. If so, recombination involving such interactions would preserve a gene’s expression state.
Evolved
robustness
to
recombination
Everything I have said thus far applies to systems that may or may not have been subject to recombination in their evolutionary history. Continued exposure to recombination and/or mutation, as it turns out, may dramatically increase the likelihood that recombination leaves a phenotype intact. To demonstrate this effect of recombination for regulatory circuits, we examined populations of circuits that had been subject to repeated rounds of mutation of individual regulatory interactions, and selection preserving their gene expression pattern, until the population had reached an equilibrium between mutation and selection [492]. In addition, we exam-
140
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
ined populations that had been subject to mutation, selection, and recombination, and that had reached a mutation–selection–recombination equilibrium. Figure 10.4 summarizes the effects of recombination in such populations. The figure shows that mutation and selection alone increase the likelihood that recombination preserves a gene expression phenotype by more than 40 percent, from less than 0.45 for randomly sampled circuits (black circles), to over 0.65 for populations in mutation-selection balance (open squares; both numbers are for the largest recombination distance DR=0.5). More dramatic, however, is the effect of ongoing recombination itself. The open diamonds in Figure 10.4 indicate the probability that recombination leaves a gene expression phenotype unchanged, for populations in mutation–selection–recombination balance. This probability exceeds 0.995, even for recombinants with the maximal distance DR=0.5 from either parent. The same increase in robustness is evident if one examines the likelihood that a given number of regulatory changes caused by recombination leaves a phenotype unchanged. For example, in a population in mutation–selection–recombination balance, the probability that 10 independent mutations leave a phenotype intact is Rμ(10)=0.49, whereas the same probability but for changes in 10 regulatory interactions caused by recombination equals RR(10)=0.993 [492]. In sum, continued exposure to recombination can dramatically increase robustness to recombination. In population genetic theory, the disruptive effects of recombination are conventionally expressed in terms of a “genetic load” [167, 539]. A population’s genetic load designates a mean fitness lower than could be attained in the absence of some agent of evolutionary change, such as mutation, migration, or recombination. Think of the load as the amount of “damage” to the population that this agent of change causes. Recombination, one might believe, should always increase the genetic load of a population in mutation– selection–recombination balance, because it causes disruption of an optimal phenotype in at least some individuals of the population. However, this is not necessarily the case [94, 467, 492]. In the context of the regulatory circuits I just discussed, we can define the genetic load as the fraction of a genera-
tion’s offspring that does not have the optimal, parental gene expression phenotype. Figure 10.6 shows this load for two kinds of populations. One of the populations is in mutation–selection equilibrium, and does not experience recombination. Its load, which I call Lμ is caused by mutations only. The other population is in mutation–selection–recombination equilibrium. Its load, Lμr, is therefore caused by both recombination and mutation. The figure shows the ratio Lμr/Lμ of the two genetic loads as a function of an important population genetic parameter, the product of population size and mutation rate Nμ. As I discussed in Chapter 7, this product determines whether a population is polymorphic (Nμ>1) or monomorphic (Nμ<1) most of the time. If the ratio Lμr/Lμ is smaller than one, then the presence of recombination causes a lower load than mutation alone. In this case, a population is better off when exposed to recombination rather than just to mutation, because more of its individuals will have the optimal phenotype. Figure 10.6 shows that for values of Nμ exceeding one, the recombining population can have a lower load. In other words, far from causing a disadvantage, recombination can cause an advantage in this case. To understand this phenomenon, two observations are helpful [34, 467, 492]. First, mutation and not recombination is the prime cause of the load in both populations. The reason is that the likelihood of a phenotypic change is generally smaller for recombination than for mutation, even for populations sampled at random from a genotype network, as the data in Figure 10.3 already showed. This likelihood decreases further in evolving populations subject to recombination, mutation, or both. Consider, for example, the probability that a recombination event changing one or two regulatory interactions leaves a circuit’s phenotype intact. In recombining populations, this probability is RR(1)>0.9999 and RR(2)=0.998 [492]. In contrast, the probability that one or two mutations leave this phenotype intact is Rμ(1)=0.94 and Rμ(2)=0.88. This means that the genetic load is dominated by the effects of mutation and not recombination. A second observation needed to understand the phenomenon of Figure 10.6 regards the product of population size and mutation rate Nμ. In a population that is monomorphic most of the time (Nμ<<1),
T H E R O L E O F R E C O M B I N AT I O N
141
1.15 1.10 1.05
Lmr/Lm
1.00 Recombination advantageous 0.95 0.90 0.85 0.80 0.75 0.70 –1
Spearman's r = –0.56; P = 10–6; n = 63 0
1
2
3
log10 (Nm) Figure 10.6 Recombination can reduce genetic load. The horizontal axis shows the logarithmically transformed product of population size N and mutation rate μ of regulatory interactions per regulatory circuit and generation. The vertical axis shows the ratio Lμr/Lr of the genetic loads of a population of circuits in mutation–selection–recombination balance (Lμr) and the load of a population in mutation–selection balance (Lr). In the grey area, this ratio Lμr/Lr is smaller than one. This means that the load of the recombining population is lower than that of the non-recombining population, and recombination provides an advantage. The diagonal line is a linear regression line. All panels are based on circuits with S=12 genes, number of regulatory interactions per circuit in the interval (S,3S), as well as initial (E(0)) and equilibrium ((E∞)) gene expression states where 50 percent of genes differ in their activity. From [492], used with permission from Genetics Society of America.
recombination will not affect the structure of the population, because it will recombine mostly genetically identical individuals. In Chapter 7, I discussed that in polymorphic populations (Nμ>>1) subject to mutation and selection, robustness of phenotypes to mutations increases. It turns out that much the same holds for recombining populations. That is, not only does their robustness to recombination increase, as we saw in Figure 10.4, but so does their robustness to mutations [34, 260, 341, 467, 492, 748]. Taken together, these observations can explain how recombination can cause a lower genetic load than mutation alone in sufficiently large populations (Nμ>1). Ongoing recombination increases robustness to recombination, such that recombination can become a minor and mutation a dominant cause of deleterious phenotypic change. In addition, ongoing recombination also increases robustness to mutation, thus decreasing the genetic load (mostly caused
by mutations) compared to when recombination is absent (Figure 10.6). In sum, continual exposure of a population to recombination can dramatically increase the likelihood that recombination preserves a regulatory circuit’s phenotype. It can even decrease a population’s genetic load below that observed in the absence of recombination. These last observations are all based on circuits of transcriptional regulators. They raise the question whether similar phenomena may be observable in other systems. Although the effects of continual recombination are studied less systematically in other systems, limited evidence suggests that similar principles may exist there [111, 399, 868]. For example, a study on the effects of recombination on the regulatory gene network driving segmentation of the Drosophila embryo showed that continual recombination can greatly increase a circuit’s robustness to mutation [399]. An unrelated study on lat-
142
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
tice proteins showed that proteins subject to ongoing recombination can evolve dramatically higher robustness of their structure to mutation [868]. Unfortunately, neither of these studies focused on robustness to recombination itself. In this regard, a DNA shuffling experiment of human α-interferons provides at least anecdotal pertinent evidence [10]. α-interferons can interfere with viral infections and can inhibit cell division. In humans, they are encoded by more than 20 tandemly duplicated genes. Such tandem clusters of genes generally facilitate recombination between members of a cluster. Chang and collaborators used the human α-interferon genes in a DNA shuffling experiment. They found that most chimeric interferons were biologically active, and analyzed four randomly chosen recombinants in greater detail [111]. They found that all four were at least as capable of inhibiting cell division in a human lymphoma cell line as their most active parent. These observations suggest that recombination in these proteins does not generally destroy protein function. The continual exposure of these tandemly arrayed genes to recombination may facilitate such weak recombination effects.
Summary Recombination causes long jumps through a vast genotype space. Because different
regions of this space contain different novel phenotypes, recombination can thus greatly facilitate the exploration of novel phenotypes. At the same time, however, these long jumps may often destroy a parental, well-adapted phenotype. This is a major problem in understanding recombination’s role in evolutionary innovation. Based on evidence from proteins and regulatory circuits, I show here that this problem is much less severe than one might think. First, recombination causes much weaker effects than mutation, because it exchanges system parts that are compatible with a given phenotype. Second, past exposure of a system to recombination can dramatically increase the system’s robustness to recombination. It may cause the vast majority of recombinants to preserve their parental phenotype, and thus eliminate the problem that recombination destroys well-adapted phenotypes. From this perspective, the power of recombination, which is evident both from laboratory experiments aiming to engineer new phenotypes, and from comparative studies of genes and genomes, becomes readily understandable. I note that none of my discussion here pertains to the evolutionary origin of recombination itself. This is a complex topic beyond the scope of this book [46, 584, 585], although the principles I discussed here may help explain this origin [467].
CH A PT ER 11
Environmental change in adaptation and innovation
The environment plays a key role in determining the fate of novel phenotypes. Metabolism serves as an example. An enzyme able to catalyze a novel chemical reaction is of little use if the chemical substrate of this reaction does not occur in its environment. Similarly, a regulatory circuit’s coordinated regulation of existing metabolic enzymes will only provide a benefit if these enzymes can jointly metabolize a substrate available in the environment. And a metabolic network with a novel metabolic pathway that is able to synthesize biomass from a new nutrient is of little use if this nutrient is absent from the environment. From this point of view, the environment determines whether novel phenotypes become innovations, or whether they perish. The environment in these examples comprises the non-living and living world outside an organism, but it may also include an organism’s internal environment. Examples include the cytoplasmic environment in which a protein exerts its function, the region of an embryo in which a regulatory circuit patterns body structures, or the organ in which a metabolic network performs its tasks. An organism’s environment has physical, chemical, and biological aspects. Each of them can differ in myriad ways; and every innovation may require a specific environment. To understand the origin of individual innovations, one must understand the individual environments in which they arose. In this book, however, I am less concerned with individual innovations, but with generic characteristics of innovation, properties that are likely to hold for broad classes of them. To study such generic properties, one has to study generic properties of the environment. The most generic property is constancy or change. In the first part of this chapter, I will focus on environmental change. Specifically, I will ask
whether an environment that changes over time facilitates or hinders innovation and evolutionary adaptation. In the second part, I will discuss that novel phenotypes can resemble phenotypes that were adaptive in past environments. In this way, the evolutionary past can influence future phenotypic variation and innovation. In the third part of this chapter, I will suggest a relationship between the complexity of a changing environment, and the complexity of a biological system. That is, the greater the number of different environments a biological system encounters, the more complex it must be to survive in these environments. This relationship is crucial to understand a key property of the systems I study here, namely that they are to some extent robust to mutations in a given environment. I already discussed, in Chapters 6 and 8, how robustness goes a long way toward explaining the existence of genotype networks and their most fundamental properties. However these chapters left the origins of robustness unexplained. At first sight, this origin may seem mysterious, because metabolic networks, regulatory circuits, and molecules are so very different. But they all have something important in common. They need to function in multiple environments. I will argue here that this commonality helps explain why they are robust, and thus, ultimately, why genotype networks exist.
Simultaneous constraints reflect rapid environmental change Environments can change on different time scales. Two extremes are noteworthy. Slow change occurs on time scales that are many times larger than an organism’s generation time. Exposed to such change, an evolving population would encounter a constant environment for many 143
144
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
generations, and then another constant environment for many generations, and so on. Whether any such change spans multiple generations depends on an organism’s generation time, which can vary from minutes for microbes to years for some vertebrates and plants. Potential examples of long-term change include climatic changes on geological time-scales, such as long global warming and cooling cycles, the “El Niño” temperature oscillation in the tropical Pacific Ocean, which causes floods and droughts in different world regions on time scales of several years, but also seasonal variations in temperature or precipitation for organisms of short generation times. On the other extreme is rapid environmental variation that occurs on time scales shorter than an organism’s generation time. Examples include the ever-changing environments—hot, cold, wet, dry, differences in nutrient quality and quantity—that animals typically encounter over their lifetime. On the smallest scale, they include the fluctuating chemical environment inside a cell, where the concentrations of ions, metabolites, and all molecules change through thermal noise and through changes outside the cell. In between these two extremes falls a broad spectrum of environmental change. Aside from its frequency, such change can be cyclical or singular, it can affect one or multiple environmental variables, and the affected variable can change dramatically or slightly, and so on. Because there are uncountably many ways in which an environment can change, we have little systematic knowledge about the relationship between change and innovation. This is why I have to restrict myself to the most general and qualitative observations. In doing so, I will begin with rapid environmental change, because it constrains systems most strongly: Organisms that can survive rapid change must be able to cope with all environments they encounter within their lifetime. To do so, organisms adapt to rapid environmental change physiologically, by producing different phenotypes. In other words, they show phenotypic plasticity. For instance, consider two environments that differ in two different sole carbon sources available to the organism, such as fructose or xylose. To metabo-
lize each one of these carbon sources requires a set of enzymes (encoded by genes) that convert either of these sugars into molecules that the organism’s central metabolism can use. If fructose is the sole available carbon source, the organism will express the fructose-specific enzymes; if xylose is the sole carbon source, it will express the xylose-specific enzymes. In other words, the expression phenotype of these genes is plastic. This example illustrates that phenotypic plasticity is a property of a genotype. A genome that does not encode both sets of enzymes cannot cope with both environments. Only when the right enzymecoding genes are present can the organism regulate them. I devote Chapter 13 to phenotypic plasticity and its implications for innovation. For this reason I will not dwell on it here, except to say this: most examples I discuss below, of genotypes viable in multiple environments, could be interpreted as examples of phenotypic plasticity. Of the three central study systems of this book, I will focus here on metabolic networks and regulatory circuits. There are two reasons. First, important principles are most easily explained for these systems, although these principles are likely to hold for protein and RNA molecules as well. Second, each of these systems is ideally suited to study changes in two main different kinds of environments: external and internal. On the one hand, metabolic networks are ideal to study variation in the external, nutrient environment of an organism. Recall from Chapter 2 that a definition of metabolic phenotypes needs to incorporate properties of the environment. As defined there, these phenotypes reflect the ability to synthesize all biomass components in a chemically minimal environment that varies in its source of energy and elements. On the other hand, the regulatory circuits of Chapter 3 are well-suited to study intraorganismal variation. The reason is that such circuits play important roles in organismal development; they typically have different roles in different parts of a developing embryo, and at different times during development [104, 268]. Their local chemical environment contains regulators “upstream” of them, including chemical signals sent from other cells. In response, they adopt different expression or activity states that pattern the embryo.
E N V I R O N M E N TA L C H A N G E I N A D A P TAT I O N A N D I N N OVAT I O N
The concepts of genotype spaces and of genotype networks form the unifying framework of this book. These concepts can also organize our thinking on how environmental change affects innovation. In the next section, I will show how. I will start by discussing rapid environmental change, and move from there to slow environmental change.
145
cuits and metabolic networks, this is not typical [491, 652, 670]. Let us now connect the perspective of Figure 11.1 to the production of novel phenotypes. Consider
Environment 1
Genotype networks and environmental change Figure 11.1 shows how environmental change can be incorporated into the genotype space framework. Each panel of this figure shows the same part of a hypothetical genotype space, encircled in a rectangle. The filled black circles in the upper panel indicate part of a genotype network whose members are viable in a hypothetical environment 1. The open black circles in the middle panel reflect part of a genotype network whose members are viable in some environment 2. The gray circles in the lower panel, finally, correspond to part of a genotype network whose members are viable in both environments. This genotype network is the intersection of the genotype networks in the middle and upper panels. When this intersection is empty, viability in both environments is impossible. An earlier caveat applies: these two-dimensional representations are highly simplified caricatures of high-dimensional spaces. The simple idea behind Figure 11.1 extends to multiple environments: The set of genotypes necessary for viability in any number n of environments is the intersection of genotype sets that allow viability in each environment. This intersection becomes progressively smaller as n increases. However, it helps to keep in mind that genotype sets are typically huge. The size of the intersection of n such sets may decrease exponentially with n, but it can still be astronomically large for n as large as 50 or more environments [491, 652, 670]. I will refer to such intersections of genotype sets that permit their member genotypes to survive in multiple environments as n-environment genotype sets. In principle, it is possible that with increasing number of environments, n-environment genotype sets become increasingly fragmented, such that that they no longer form connected genotype networks. However, based on our analyses of regulatory cir-
Environment 2
Environment 1+2
Figure 11.1 Intersections of genotype networks contain genotypes viable in more than one environment. Each rectangle represents the same part of a hypothetical genotype space. The filled black circles in the upper panel and the open black circles of the middle panel correspond to parts of two genotype networks that are viable in hypothetical environments 1 and 2, respectively. The gray circles in the lower panel correspond to the intersection of these genotype networks, and thus to genotypes viable in both environments. Gray arrows highlight a genotype in this intersection. This genotype, together with its neighbors on the same genotype network, is also shown to the right of each panel.
146
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
can survive only in fewer environments. This phenomenon reflects the increasing constraints that large numbers of environments impose on genotypes [491, 652, 653, 670]. Figure 11.2 shows pertinent data for 1- and 2-environment genotype networks of regulatory circuits (Chapter 3). Circuits on the same 2-environment genotype network are required to produce two specific and distinct expression phenotypes in two different intraorganismal environments. Such environmental differences are represented by different initial activity states of a circuit’s genes. These states reflect a circuit’s input from genes “upstream” of the circuit, or from chemical signals outside the
first the genotype highlighted with the arrow in the upper panel of the figure. The same genotype is also shown to the right of the panel, together with its four neighbors on the same genotype network. This genotype is highlighted in the middle panel as well, where it is part of the genotype networks for environment 2, and where it also has four (different) neighbors on this same genotype network. In the 2-environment genotype network of the lower panel, this genotype has only two neighbors. This simple example illustrates a more general principle: genotypes in n-environment genotype networks will typically have fewer neighbors that permit survival in all n-environments than genotypes that
1.0
Fraction of neutral neighbors (both environments)
0.8 Environment 1 Environment 2 0.6
0.4
0.2
0.0 0.0
0.2
0.4
0.6
0.8
1.0
Fraction of neutral neighbors (one environment) Figure 11.2 Less change is neutral in genotypes allowing viability in two environments. The figure is based on a sample of more than 103 regulatory circuits (Chapter 3) that are able to produce one specific gene expression pattern P1 in one environment, and another specific pattern P2 in a second environment [491]. The horizontal axis shows the fraction of a circuit G’s neighbors in genotype space that are on the same genotype network as G, i.e., the fraction of neighboring circuits that preserve G’s gene expression pattern in one of the two environments. Open circles correspond to this fraction for genotypes that form P1 in environment 1, closed circles to genotypes that form P2 in environment 2. The vertical axis shows the fraction of G’s neighbors that can produce the same gene expression pattern as G, but in both environments. In other words, it is the fraction of phenotype-preserving neighbors in a 2-environment genotype network (see Figure 11.1, bottom panel). The solid diagonal line is the identity line. Notice that the fraction of phenotype-preserving neighbors is always smaller (below the diagonal line) for two environments than for one environment. Different environments are represented as different initial gene expression states encountered by a circuit. Data are based on circuits with S=12 genes and four regulatory interactions per gene. Adapted from [491].
E N V I R O N M E N TA L C H A N G E I N A D A P TAT I O N A N D I N N OVAT I O N
147
tiple environments become smaller, and their member genotypes have fewer phenotype-preserving neighbors as the number of environments increases. If a circuit in an n-environment genotype network has fewer phenotype-preserving neighbors, it must have more neighbors with novel phenotypes. This increased number of neighbors will affect the number of novel phenotypes that occur in the neighborhood of individuals or populations evolving on an n-environment genotype network. Figure 11.3 shows this number of novel phenotypes for regulatory circuits, and compares it between 2- and 1-environment genotype networks. The figure is based on populations
cell in which the circuit operates. The figure shows the fraction of a circuit’s nearest neighbors that preserve the circuit’s gene expression phenotype in one environment (horizontal axis) and two environments (vertical axis). Note that this fraction is always smaller in the 2-environment case, and sometimes dramatically so. The same qualitative phenomenon holds for metabolic networks that can synthesize biomass from multiple (n) different carbon sources [652]. However, in metabolism, the differences in the fraction of phenotype-preserving neighbors are smaller as n varies. In sum, genotype networks that permit survival in mul-
Fraction of unique new phenotypes in neighborhood
40 35 30 1-environment 2-environments
25 20 15 10 1
5
9
13
17
21
25
29
33
37
41
Time (generations) Figure 11.3 More novel phenotypes near 2-environment genotype networks of regulatory circuits. The horizontal axis shows time in generations. The vertical axis shows the number of different novel phenotypes found in the 1-neighborhood of an entire population of regulatory circuits (Chapter 3). The figure is based on populations of regulatory circuits that are confined to a 1-environment genotype network (open circles), i.e., they produce a specific expression phenotype in one environment, or to a 2-environment genotype network (closed circles), i.e., they produce two different expression phenotypes in two different environments. These populations of regulatory circuits have reached equilibrium between selection, genetic drift, and mutation, where mutations change one regulatory interaction per circuit and generation. That is, if the same phenotype occurs in the neighborhood of two circuits, it is counted only once. The data show that circuits confined to 2-environment genotype networks encounter greater numbers of novel phenotypes in their vicinity. Data are shown for regulatory circuits with S=10 genes, five regulatory interactions per circuit, and regulatory interactions whose strength has a Gaussian (N(0,1)) distribution. However, the same qualitative patterns hold for circuits with different sizes, number of regulatory interactions, and discrete instead of continuous regulatory interactions (unpublished data). Results are presented as averages over 100 replicate populations with 100 individuals each. Error bars correspond to one standard error of the mean. Different environments are represented as different initial gene expression states (Chapter 3). All relevant procedures have been described earlier [124, 820].
148
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
of circuits that have spread through 2- and 1-environment genotype networks through multiple generations of mutation and selection, which confined them to their respective genotype networks. The figure shows, as a function of the number of generations, the number of different phenotypes encountered in the 1-neighborhood of these populations. This number of phenotypes does not vary strongly over time, reflecting that both populations have reached an equilibrium in their evolutionary dynamics, a balance between mutation and selection. The figure also shows that the population evolving on the 2-environment genotype network consistently has access to many more novel phenotypes. Unfortunately, one cannot generalize this observation to systems different from these regulatory circuits. That is, it is not generally the case that populations confined to increasingly more environmentally constrained (and thus smaller) genotype networks can access more novel phenotypes. For example, for molecules, larger and not smaller genotype networks promote evolutionary innovation, because populations can spread more rapidly through them (Chapter 8). In metabolic networks, the relationship between the number of environments n and accessibility of novel phenotypes can be non-monotonic. That is, depending on the kind of chemical environments considered, access to novel phenotypes may be highest for intermediate n [653] (Chapter 8). The reason for these differences among system classes is that two processes are important in the accessibility of novel phenotypes. The first is the number of novel phenotypes in the neighborhood of a population; the second is how fast a population spreads through a genotype network (Chapter 8). Increasing environmental constraints (much like decreasing robustness) increases the first, but decreases the second. Which of these processes dominates in determining how many novel phenotypes a population’s neighborhood contains depends on details of genotype space organization. These details differ among system classes and remain to be characterized. To summarize, rapid environmental change can be viewed as confining populations to genotype networks whose size decreases as the number of environmental demands on a system increase. On
such genotype networks, individual genotypes would typically encounter more novel phenotypes in their vicinity. But whether evolving populations can access more novel phenotypes on such genotype networks depends on the system class studied.
Slow environmental change The perspective of Figure 11.1 can also help understand the evolutionary dynamics of genotypes under slow environmental change. For example, in environment 1, a population would first spread through the upper genotype network in this figure. When a switch to environment 2 occurs, individuals that happen to be in the 2-environment genotype network (lower panel) would preferentially survive, and spread through the genotype network of environment 2, until the next environmental change occurs, and so on. The longer one environment prevails, the further the population spreads through the corresponding genotype network, and the smaller the fraction of the population that remains in the 2-environment genotype network. Conversely, this remaining fraction will increase as environments switch more frequently. One can think of this fraction as a population’s genotypic “memory” of a past environment, a memory that fades in constant environments. It is easy to extend this line of reasoning to more than two environments. Figure 11.3 showed data indicating that populations evolving on 2-environment and 1-environment genotype networks can differ in the number of novel phenotypes that they produce in response to mutations. In slowly changing environments, this number of novel phenotypes would be somewhere between the 2-environment and the 1-environment scenarios. Thus, slow environmental change would also influence the number of novel phenotypes accessible to a population. The direction of this influence may again depend on system details. The perspective of Figure 11.1 might also help explain why environmental change often reveals cryptic variation. This is genotypic variation that is phenotypically invisible in a given environment. In a new environment, such variation may become visible as phenotypic variation [625, 666, 816, 817]. Consider a population in environment 1 whose
E N V I R O N M E N TA L C H A N G E I N A D A P TAT I O N A N D I N N OVAT I O N
individuals are on the genotype network of environment 1, but outside the 2-environment genotype network (Figure 11.1). As a change from environment 1 to environment 2 occurs, these individuals form different phenotypes that affect their fitness in the new environment 2. Their genotypic variation, which was cryptic in environment 1 has become phenotypically visible in environment 2.
The speed of evolutionary adaptation In rapidly changing environments, a population needs to be adapted to multiple environments simultaneously. In contrast, under slow environmental change, a population would adapt evolutionarily to the first environment, then to the second, then again to the first, and so on. The number of novel phenotypes encountered by such a population can affect the speed of this evolutionary adaptation to its current environment. This is the rate at which a population approaches a phenotype well-adapted to this environment. Several computational studies show that slow environmental change can accelerate this rate dramatically relative to an unchanging environment [155, 194, 222, 358, 385, 517, 598]. I will next highlight some of these studies and their observations. One pertinent study considered populations of regulatory circuits like those of Chapter 3, and periodic changes between two environments, in which two different gene expression phenotypes were optimal. Its authors asked how fast these populations approached the optimal phenotype of the new environment [194]. To do so, they used the fraction of genes that differed in their expression from the optimal phenotype as a measure of an individual’s distance from this phenotype. And as a measure of adaptation speed, they evaluated a population’s mean distance—averaged over all individuals—to the optimum phenotype, after the population had evolved for a given amount of time in its current environment. The smaller this distance was, the faster the population had adapted to its current environment. The authors found that populations evolving in an infrequently changing environment can adapt more than twice as fast as populations from an unchanging environment [194]. Also, in circuits that evolve in changing environments, mutations
149
can produce more novel phenotypes than in circuits in unchanging environments [194]. It is perhaps not surprising that the speed of adaptation increases with this number of accessible novel phenotypes [194]. In addition, circuits that produce many novel phenotypes also show an intriguing organization: The regulatory inputs to any one gene are typically balanced between activating and inhibiting inputs. This means that no class of inputs is overwhelmingly strong, such that it could stabilize the expression phenotype against mutational changes in one regulatory interaction. Rather, any one mutational change can upend the balance between activating and repressing interactions, and thus change the circuit’s expression phenotype. Faster adaptation is not limited to the model circuits I just discussed. It also occurs in broader classes of systems. For example, Kashtan and others showed that varying environments accelerate evolutionary adaptation in four different kinds of biological and non-biological circuitry. This acceleration occurred over a broad range of time scales of slow environmental change, where environments changed between every 10 and every 105 generations [385]. A final study worth highlighting made similar observations, but with a completely different, very simple phenotype. Its authors focused on hydrophobicity, a physicochemical property of amino acids that reflects a tendency to avoid water molecules. Physicochemical properties of amino acids in a protein can evolve rapidly, and especially in antigenic proteins of viral pathogens. The reason is that these antigenic proteins are exposed to a changing environment of host immune systems, and they adapt to this changing environment [95, 874]. These observations motivated the study’s choice of phenotype. The study represented genotypes through the nucleotide triplets (codons) that encode each of the twenty amino acids. It simulated different environments that favored amino acids with different hydrophobicities [517]. It then asked which codons accumulate in populations that evolve under such environmental change. Under infrequent environmental change, codons accumulated in which mutations can readily produce amino acids of different hydrophobicity [517].
150
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
These are codons that encode, say hydrophilic amino acids, but where mutations are likely to produce hydrophobic amino acids, and vice versa. Under this scenario of environmental change, such codons provide both the best possible adaptation to a current environment, and they facilitate the production of novel phenotypes that are well adapted to a new environment. In sum, several independent studies that focused on very different kinds of systems and phenotypes all showed that slow environmental change can facilitate the speed of adaptation to any new environment. It is not difficult to understand why. Such change produces genotypes that are well-adapted in one environment, and in which mutations have one or both of the following properties. First, they can easily create novel phenotypes, which can facilitate adaptation to the other environment. Second, they may even create phenotypes that are already well-adapted to the other environment. If so, environmental change has driven the starting genotypes into regions of a genotype network that are especially close to a genotype network whose phenotype is welladapted to the second environment.
The ghosts of environments past Atavisms are rudimentary phenotypes that resemble an ancestral phenotype, typically from an extinct species. Wellknown examples of macroscopic atavisms include snakes that bear hind legs (like their distant ancestors), horses born with extra toes (like ancestral horses), and human babies with a rudimentary tail. Atavisms are remnants of phenotypes that may have been adaptive in the past, but have ceased to be so. The kinds of circuits I discuss here help shape macroscopic traits during development. It is thus not far-fetched to ask whether these circuits can express phenotypes that are akin to atavisms. These phenotypes would of course not be complex macroscopic phenotypes, but gene expression phenotypes. With this question in mind, it is useful to revisit the notion of genotypic memory. Such genotypic memory can arise during slow environmental change, where a population can linger near a 2- or n-environment genotype network (Figure 11.1),
even though one of the environments is a thing of the past. Figure 11.4 shows the extent of genotypic memory in populations of regulatory circuits. These populations were subject to repeated rounds of mutations in regulatory interactions, as well as to selection maintaining one or two gene activity phenotypes. Specifically, before generation one indicated in the figure, the populations were confined to a 2-environment genotype network that required them to produce two different gene activity patterns in two different environments. After generation one, selection continued to maintain only one of the two gene activity phenotypes. Individuals that did not have this phenotype were eliminated from the population. The open circles show, as a function of time, the fraction of individuals that can still form the second, past phenotype. The closed circles show the average fraction of each individual’s neighbors that form this past phenotype. As a control, the same statistics are shown for populations of circuits that were never confined to the 2-environment network. The figure demonstrates that individuals and their neighbors are much more likely to form the second, past phenotype than control populations. Their genotypic memory decays only slowly. For example, after 30 generations, where approximately 30 percent of regulatory interactions in a circuit changed, populations still maintain significant genotypic memory (Figure 11.4). Such genotypic memory may have an important consequence: under evolution in changing environments, mutations tend to produce not only many novel phenotypes, but preferentially phenotypes that were well-adapted in past environments. Several computational studies have observed this phenomenon [155, 358, 597]. One such study examined gene regulatory circuits whose gene expression patterns had adapted to two different, cyclically changing environments. Genotypes adapted to any one environment preferentially produced expression phenotypes adapted to the other environment [155]. Another example comes from the study on amino acid hydrophobicity that I discussed in the last section [517]. Infrequent changes in an environment favoring hydrophilic or hydrophobic amino acid lead to genotypes that produce one kind of
E N V I R O N M E N TA L C H A N G E I N A D A P TAT I O N A N D I N N OVAT I O N
151
Genotypic memory (mean fraction of population)
1.0 individuals preserving phenotype (2-environments)
0.8
neighbors preserving phenotype (2-environments)
Control: individuals preserving phenotype (1-environment) neighbors preserving phenotype (1-environment)
0.6
0.4
0.2
0.0 1
3
5
7
9
11 13 15 17 19 21 23 25 27 29 31 Time (generations)
Figure 11.4 Regulatory circuits retain genotypic memory. The horizontal axis shows time in generations. The vertical axis shows the fraction of a population’s members (open circles) or the fraction of their neighbors (closed circles) that produce a gene expression phenotype that was adaptive in the past. The figure is based on populations of regulatory circuits (Chapter 3) that had reached a balance between mutation, selection, and genetic drift during 500 generations of selection confining them to a genotype network. That is, each individual was required to produce two different expression phenotypes in two different environments. After generation 1, shown on the horizontal axis, this selection was relaxed such that individuals were only required to produce the first of the two gene activity phenotypes. In other words, from generation one onward, a population was confined only to a 1-environment genotype network. Repeated cycles of this selection and mutation (one regulatory interaction per circuit and generation) continued for the thirty generations shown here. The open black circles show the fraction of individuals in a population that still adopt the second (past) phenotype. The closed black circles show the average fraction of each individual’s neighbors that have this second phenotype. As a control, the same statistics are shown for populations of circuits that were never confined to the 2-environment network, but only to a 1-environment genotype networks (gray circles). Data are shown for regulatory circuits with S=10 genes, five regulatory interactions per circuit, and regulatory interactions with a Gaussian (N(0,1)) distribution. The same qualitative patterns hold for circuits with different sizes, number of regulatory interactions, and discrete instead of continuous regulatory interactions (unpublished data). Results are presented as averages over 100 replicate populations with 100 individuals each. Error bars correspond to one standard error of the mean. Different environments are represented as different initial gene expression states (Chapter 3). All relevant procedures have been described earlier [124, 820].
amino acid, but that can readily mutate to produce another kind that was favored in the past. In anthropomorphic terms, one could say that such genotypes are engraved with “assumptions” about their future environments. These assumptions are that past adaptive phenotypes may become future adaptive phenotypes. In other words, the future will resemble the past. All this implies that environmental change per se may not be sufficient to facilitate adaptation to novel environments. The right kind of change may
be necessary. If the future does not resemble the past, environmental change may fail to facilitate adaptation. For model regulatory circuits, this has been demonstrated: when a new environment is completely different from any past environment, populations adapt to it only slowly [598]. Instructive exceptions arise when new environments do not exactly replicate the demands of any one past environment, but combinations of such demands from different past environments. In this case, adaptation to a new environment may also be rapid [598].
152
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
These observations also speak to the question in which sense the phenotypic variation that mutations produce is “random” variation. I have already discussed, in Chapter 6, how genotype networks cause a genotype G, when mutated, to produce variation that is non-random with respect to G’s own phenotype P. For example, the new phenotypes tend to be similar to P. Moreover, where a genotype G lies on a given genotype network can influence the new phenotypes that mutations in G can create. The genotypic memory I just discussed provides yet another instance of such structured variation. If a population of genotypes is adapted to one environment, but if its individuals exist near a 2-environment genotype network, mutations may preferentially produce phenotypes adaptive in this second environment. In sum, in the few systems where genotypic memory has been sought—in particular regulatory circuits—it has been found. Individual genotypes or their neighbors can form phenotypes that have been adaptive in the distant past. For macroscopic organismal traits, this and other structured phenotypic variability has been known for a long time, but has resisted explanation [406, 499]. The genotype network framework can help explain it. Regulatory circuits both help build macroscopic phenotypes, and they can remember their past expression phenotypes. We should thus not be surprised that the ghosts of past phenotypes resurface long after these phenotypes ceased to be useful.
Environmental variation as a cause of system complexity and robustness The next section of this chapter is perhaps the most important section. It will address a fundamental question that previous chapters have left open. As we saw in Chapter 6, genotype space is partitioned into genotype networks whose organization is conducive to innovation, because typical genotypes have more than one neighbor with the same phenotype. They are to some extent robust to mutations. The question is why. At first sight, this may seem a philosophical rather than a scientific question, because proteins, regulatory circuits, and metabolic networks are completely different kinds of systems. In contrast,
I will here suggest that environmental variation may provide the answer. To begin with, I will summarize the core of my argument. A system of a given complexity, i.e., number of elements (metabolic reactions, regulatory genes, amino acids etc.), may be able to cope with different environments. The preceding sections introduced the context of n-environment genotype networks, whose member genotypes can perform well in each environment. However, this ability will become exhausted at some point, as the number of environments increases. Beyond some limit, only an increase in system complexity may allow good performance in additional environments. This increased complexity is responsible for robustness of a system’s phenotype in any one environment. (And from this robustness follows the existence of genotype networks.) I will here equate a system’s complexity with the number of its parts, that is, with system size. I am well aware that this definition of complexity completely neglects organizational aspects of complexity. For my purpose, however, it will suffice. Of the three system classes I study here, metabolic networks are best-suited for analyses that support the argument I just made. The reason is that we can systematically study the relationship between environmental variation, system size, and robustness for them. I will thus use some metabolic examples to characterize this relationship here. Before I begin, recall that my definition of metabolic phenotypes (Chapter 2) explicitly incorporates the number of environments in which a network is viable; that is, in which it can synthesize all biomass molecules. As I did earlier, I will again consider chemically minimal environments that differ in one source of an essential element. In this context, environmental variation corresponds to variation in these sole element sources. I will discuss pertinent observations for environments that differ in their sole source of sulfur, but the arguments below readily extend to sources of other elements, and to more complex environments [653]. Minimal environments impose high metabolic demands on an organism and its metabolic network, because in such environments an organism needs to synthesize several dozen biomass molecules from very few nutrients. Nonetheless, most
E N V I R O N M E N TA L C H A N G E I N A D A P TAT I O N A N D I N N OVAT I O N
reactions in a metabolic network like that of E. coli or yeast are silent in minimal environments [559, 593, 670, 809, 839]. That is, they show zero metabolic flux through them, despite an overall large biomass growth flux. They are thus completely dispensable. For example, based on our current knowledge, the metabolic reaction networks of E. coli and yeast comprise more than 900 chemical reactions. However, in a glucose minimal environment, more than 60 percent of these reactions are silent [559]. In addition, many reactions with nonzero flux are also dispensable, without eliminating or reducing biomass production [593, 670, 839]. Overall, in E. coli, the fraction of reactions that would not reduce biomass growth when eliminated exceeds 70 percent [670]. This is not a peculiarity of the E. coli metabolic network, but a general property of viable networks that have similar complexity. One can show this by sampling viable metabolic networks randomly and uniformly from genotype space. In a sample of more than 1000 such networks whose biomass production rate is at least as high as that of E. coli, on average more than 70 percent of reactions are dispensable in a minimal glucose environment [670]. If E. coli lived in only one environment, such as the above glucose minimal environment, it could thus eliminate most of its chemical reactions without detrimental consequences. The price would be a complete loss of robustness to removal of further reactions. Such a systematic elimination of metabolic reactions actually occurs in some organisms [591, 767, 883]. These are often endosymbiotic or endoparasitic organisms that live in highly constant intracellular environments. For example, the metabolic network of Buchnera aphidicola, an endosymbiont of aphids, has merely 263 metabolic reactions. More than 90 percent of them are essential [767]. To appreciate the role of dispensable reactions in a metabolic network like that of E. coli, consider now a minimal environment that is different from the glucose minimal environment. For example, it may contain a different sole carbon source, say acetate or succinate. In this environment, the E. coli metabolic network will again contain many dispensable reactions, but not all of these reactions will be the same as in the glucose environment. Some reactions essential in glucose are no longer needed to use the second sole carbon source. Conversely, other
153
reactions become essential to feed the other sole carbon source into central metabolism [652, 670]. This line of reasoning can be extended to environments varying in three, four, and more carbon sources. As a metabolic generalist, the E. coli metabolic network can synthesize its biomass from more than 80 alternative carbon sources [202, 637, 652]. The fraction of reactions that are dispensable, if one varies the environment among all these 80 carbon sources reduces to 57 percent compared to more than 70 percent for a glucose minimal environment. In other words, more than one-hundred additional reactions become indispensable [652]. In addition to varying in their carbon sources, environments can also vary in sources of other elements, as well as in other parameters, such as acidity, or salinity. Different sets of enzymes are needed to ensure maximal biomass production in these different environments. A metabolic network viable in all such environments would have to contain many more reactions than a minimally complex network viable in only one environment. In other words, the minimal network complexity needed for viability grows with the number n of alternative environments in which viability is required. When we examine a metabolic network with multi-environment viability in only one or few of these environments, as laboratory studies typically do, we would see exactly what we see in E. coli: many of its reactions would be dispensable in one environment. In the language of metabolic genotype space, such a network has many neutral neighbors, neighbors with the same phenotype that preserve its viability. However, any one reaction that is dispensable in this environment may be essential in a different environment. If not, it would eventually disappear, that is, the gene encoding the required enzyme would become eliminated from the genome. As an aside, such disappearance may take a long time. Even enzyme-coding genes that are required in environments that occur as rarely as once every few thousand generations may remain in a population indefinitely [822, 839.] I will next complement these qualitative observations with a more quantitative analysis that extends to general network properties. Starting from a large metabolic network with the ability to grow in n minimal environments, let us eliminate reactions from it
154
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
at random, one by one, while preserving its phenotype, until the number of reactions cannot be reduced further without eliminating its ability to produce all biomass molecules in at least one of these n environments. The resulting network will be much smaller—less complex—than the starting network. I will call the resulting network a minimal network. Notice that the number of reactions in this network will depend on the order in which reactions were removed from the starting network. That is, repeating this process of reaction removal may yield different networks of different size [591, 653]. In other words, a minimal network is not necessarily the smallest possible network with a given metabolic phenotype. I also note that a network generated in this way will have absolutely no robustness to reaction removal left: every single eliminated reaction will eliminate biomass growth. Figure 11.5a shows the mean size of minimal networks that are viable in different numbers of minimal environments distinguished by their sole sulfur sources. Each of the three data points in the figure is based on 200 independently generated minimal metabolic networks. The horizontal axis indicates the number of sole sulfur sources a network must be viable in. Clearly, the size of a minimal metabolic network increases dramatically with this number. For example, to be viable in 60 instead of 20 minimal environments that differ in their sole sulfur source, a metabolic network must typically contain almost twice as many reactions. These observations reinforce the first part of my argument: an increase in environmental variability requires an increase in the minimum system complexity needed to cope with this variability. I next turn to the question how metabolic network complexity relates to robustness. Figure 11.5b is based on random samples of 100 metabolic networks that are viable in a given number of [20] minimal environment [653]. The samples differ only in the number of reactions in their member metabolic networks, as indicated on the horizontal axis. The vertical axis shows, as a measure of robustness, the mean number of reactions that are non-essential in networks from each sample. Clearly, as network complexity increases, robustness increases as well. The reason is that larger networks contain more extraneous reactions that are not absolutely required
for viability in the given environments. The data in the figure is based on viability in a specific number of 20 minimal environments, but the same observations hold for other numbers of minimal environments [653]. Observations analogous to those of Figure 11.5 have been made with simpler, more abstract representations of metabolism [203, 720]. Taken together, all these observations indicate that the large metabolic networks of free-living organisms are much more complex than necessary to sustain life in any one environment. Their complexity arises from their viability in multiple environments. A consequence is that these networks appear highly robust to reaction removal in any one environment, where every metabolic networks has multiple neutral neighbors. This neutrality, however, is conditional on the environment. Evidence for such conditionality has existed since long before comprehensive metabolic network information was available. For example, it has been detected experimentally for enzymes in the E. coli pentose phosphate pathway, a part of central energy metabolism [311]. Before moving on, I note that we have already encountered context dependence earlier, in the effects of mutations. That is, in Chapter 7, I discussed how a mutation’s effect—or lack thereof— depends on the genotype in which it occurs. Here, I pointed out that the environment also influences the effects of mutations. In general, both genotypic and environmental contexts are necessary to understand how mutations affect phenotype. Metabolic networks may illustrate the relationship between environmental variability and system complexity most clearly. But as I will discuss next, the same principle may apply to molecules and regulatory circuits, even though we cannot study this connection as systematically in their case. I will first turn to regulatory circuits in organismal development. Such circuits typically pattern more than one body structure of the same organism [104, 268]. To do so, they produce different gene activity patterns in different (intraorganismal) environments. The number of phenotypes they can reliably produce, increases with their number of genes [19]. One can view such circuits as computational devices that store information about an optimal gene activity phenotype in their
E N V I R O N M E N TA L C H A N G E I N A D A P TAT I O N A N D I N N OVAT I O N
155
(a)
Minimal metabolic network size
100 90 80 70 60 50 40 30 20 10 0
1
5 10 20 40 60 Number of alternative minimal environments
(b)
Fraction of non-essential reactions
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
50
100 200 400 Metabolic networks size
600
Figure 11.5 Metabolic network complexity, environmental variability, and robustness. (a) The horizontal axis shows the number of minimal environments with different sole sulfur sources in which a metabolic network is required to be viable, i.e., able to synthesize all sulfur-containing biomass molecules. The vertical axis shows the mean number of reactions in minimal viable metabolic networks. These are networks in which all reactions are essential. Minimal metabolic networks were generated through the random elimination of reactions, as described in the text, from a metabolic network comprising 1221 reactions involving sulfur-containing compounds, while preserving the networks’ viability in at least the number of sulfur sources given on the horizontal axis [653]. The data are based on 200 metabolic networks generated in this way for each environmental demand indicated on the horizontal axis [653]. (b) The figure is based on samples of 100 random viable networks with a given size (horizontal axis) [653]. Viability is defined as the ability to synthesize all sulfur-containing biomass compounds in each of n=20 minimal environments that differ in their sulfur source. The vertical axis shows the fraction of metabolic reactions that are non-essential, that is, not required for viability. Error bars correspond to one standard deviation. The same qualitative pattern occurs for different numbers n of minimal environments [653].
156
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
regulatory interactions. This information allows them to “compute” or “retrieve” this expression pattern in response to environmental signals. Such circuits have a maximal information storage capacity; that is, a maximal number of gene activity phenotypes they can reliably form. This capacity has been studied in detail for circuits akin to the model regulatory circuits I discussed throughout, where it increases linearly with the number of circuit genes [19].
Whether this relationship is linear or not, it is clear that more complex circuits can store more information, and thus pattern a greater number of body structures. They can operate in more diverse intraorganismal environments. As far as molecules are concerned, consider that very small peptides with only a handful of amino acids can catalyze chemical reactions. Figure 11.6a shows an example involving aldolases. The enzyme in this panel is a synthetic aldolase peptide of
(a) A synthetic aldolase peptide
YKLLKELLAKLKWLLRKL-NH2
(b) Human muscle aldolase
PYQYPALTPEQKKELSDIAHRIVAPGKGILAADESTGSIAKRLQSIGTENTEENRRFYRQLLLTADDRVNPCIGGVILFH ETLYQKADDGRPFPQVIKSKGGVVGIKVDKGVVPLAGTNGETTTQGLDGLSERCAQYKKDGADFAKWRCVLKIGEHTPSA LAIMENANVLARYASICQQNGIVPIVEPEILPDGDHDLKRCQYVTEKVLAAVYKALSDHHIYLEGTLLKPNMVTPGHACT QKFSHEEIAMATVTALRRTVPPAVTGITFLSGGQSEEEASINLNAINKCPLLKPWALTFSYGRALQASALKAWGGKKENL KAAQEEYVKRALANSLACQGKYTPSGQAGAAASESLFVSNHAY
Figure 11.6 A small synthetic aldolase peptide and a large natural aldolase. Aldolases are enzymes involved in aldol reactions, a class of reaction in which carbon bonds form between a ketone and an aldehyde. The aldolase in (a) is a synthetic peptide of 18 amino acids with an N-terminal amino group. Its structure is unknown and is arbitrarily shown as a loop here. It shows a rate acceleration of greater than 104-fold relative to the uncatalyzed rate [756]. Other short catalytic peptides with high rate acceleration and substrate specificity have been reported [757]. The lower aldolase in (b) is human fructose 1,6-bisphosphate aldolase (Protein Database Structure identifier 4ald; http://www.pdb.org) [159], which has 363 amino acids. It is involved in a reverse aldol reaction. Specifically, this glycolytic enzyme catalyzes the cleavage of fructose 1,6-bisphosphate into dihydroxyacetone phosphate and glyceraldehyde 3-phosphate.
E N V I R O N M E N TA L C H A N G E I N A D A P TAT I O N A N D I N N OVAT I O N
merely 18 amino acids. The enzyme in Figure 11.6b is a natural aldolase, a hulking giant of more than 360 amino acids. Larger enzymes like that in Figure 11.6b are thought to have several advantages over minimal enzymes like that of Figure 11.6a. These include increased substrate specificity, catalytic efficiency, and thermodynamic stability [757]. From the perspective of environmental change, the most notable advantage is that larger size facilitates regulation of enzymatic activity. Perhaps the best example is regulation of enzymes through allosteric effectors—small molecules that influence an enzyme’s activity. Such regulation is necessary only because enzymes operate in a variable intracellular environment, in which the concentrations of their substrates and other molecules change. Many allosteric enzymes are regulated by not just one, but several allosteric effectors, and in different ways [337, 626]. Such complex regulation would be impossible for tiny catalytic peptides. Allosteric enzymes are typically also multimeric proteins, and thus even more complex than monomeric proteins [235]. As we go from metabolic networks to regulatory circuits to molecules, our understanding of the relationship between environmental variability and system complexity becomes less systematic and more anecdotal. However, what we know suffices to assert a link between environmental variability, system complexity, and robustness in any one environment. This connection even transcends biology, as we will see in Chapter 15. There, we will encounter a programmable electronic system whose functional versatility and innovability also rises with system complexity.
Summary In studying the influence of the environment on evolutionary innovation, it is useful to distinguish between slowly changing and rapidly
157
changing environments. Slowly changing environments can accelerate a population’s evolutionary adaptation to an environment, especially if the environment resembles past environments or combinations thereof. (Whether the respective phenotypes are truly innovations is debatable.) Part of the reason is that populations retain genotypic memory of past environments, and can produce genotypes well-adapted to such environments through mutations. The genotype network framework can help explain such genotypic memory, which may be the cause of atavisms in macroscopic traits. To persist in rapidly changing environments, a population must be evolutionarily adapted to multiple environments simultaneously. It is useful to view such a population as existing in the intersection of multiple genotype networks. Whether its multiple adaptations promote the evolution of novel phenotypes depends on details of genotype space organization. Beyond some limit in the number of environments, a system of a given complexity (size) can no longer be well-adapted to all environments simultaneously. Beyond this limit, system complexity needs to increase. The more environments a system needs to function in, the more complex it tends to be, and the more robust it will appear in any one environment. Thus, the robustness that brings forth genotype networks—and thus innovability—is ultimately a result of life’s need to cope with changing environments. The mostly qualitative and general observations of this chapter neglect many details that may differ among different systems, among spatially and temporally changing environments, among change at different time scales and with different magnitude. To understand how these details influence innovation will occupy us for a long time to come.
CH A PT ER 12
Evolutionary constraints and genotype spaces
An evolutionary constraint is a bias or limitation in genotypic or phenotypic variation that a biological system produces. Striking phenotypic examples include the absence of photosynthesis in higher animals, the general lack of teeth in the lower jaw of frogs, the absence of palm trees in cold climates, the maximally five digits (fingers and toes) of tetrapod limbs, and the absence of birds that give birth to live young instead of eggs [259, 499]. These are extreme examples of phenotypic constraints, where a trait is completely absent. More subtle constraints are correlations among different characters. A paradigmatic case are allometric scaling relationships [259, ch. 17]. Here, the value of one quantitative trait, such as an organism’s mass, has a specific non-linear relationship to another trait, such as the thickness of a tree’s trunk, the diameter of a tetrapod’s femur, the size of an organism’s reproductive organs, or the rate at which its metabolism consumes energy. For example, metabolic rate is proportional to body mass m raised to the threequarter power (m0.75) for many different animal species. It is not hard to see that constraints can influence the spectrum of evolutionary adaptations and innovations that are accessible to living things. For this reason, questions about the causes and consequences of constrained evolution have attracted much attention [29, 85, 119, 329, 499, 555, 570, 583, 702, 855]. In this chapter, I will first discuss the kinds of phenotypic constraints that are most important for my purpose. I will then show how a systematic characterization of genotype spaces can help us understand the relationship of these constraints. Specifically, I will show that several different kinds of constraints can be traced to a single 158
cause. I will then point out how the genotype space framework can help us understand evolutionary stasis, the absence of evolutionary change. For a student of innovation, stasis may be the most important consequence of constrained variation. After having discussed these aspects of constrained phenotypic variation, I will then briefly discuss constrained genotypic variation.
Different kinds of phenotypic constraints An important, albeit not sharp, distinction among constraints is that between universal and local constraints. A local constraint applies only to some group of organisms. For example, monocotyledonous plants, such as palms and bananas, cannot increase their stem diameter the same way as dicotyledonous plants do. The reason is that they do not have the ability to produce a secondary xylem, a nutrient transport tissue that enables the secondary thickening of dicotyledons [499]. In contrast to local constraints, universal constraints apply to all organisms. An example is that all organisms above a given (small) size must have a circulatory system. The reason is that diffusion is not fast enough to deliver nutrients or oxygen to all body parts of a large organism. The example of a universal constraint I just mentioned is a consequence of physics. It may thus be binding and inevitable. However, universal (and of course local constraints) need not be inevitable. They may be accidents of evolutionary history. A prominent candidate example regards the fact that the 20 amino acids found in natural proteins are L-isomers instead of D-isomers [741]. That is, they are chiral molecules that rotate the orientation of polarized light counterclockwise instead of
E V O L U T I O N A RY C O N S T R A I N T S A N D G E N O T Y P E S PA C E S
clockwise. This predominance of L-amino acids is almost certainly a historical accident of life’s early evolution [499, 705]. Some constraints, especially local constraints, can be broken; that is, they are not absolute rules, but they admit exceptions. For example, ichthyosaurs are exceptions to the rule that tetrapods have at most five digits. These extinct marine reptiles had more digits. Frogs in the genus Amphignathodon are exceptions to the rule that frogs do not have teeth in the lower jaw. Even some apparently universal constraints may turn out to admit exceptions. For example, the “universal” genetic code that maps 64 specific nucleotide triplets (codons) onto amino acids is shared by many, but not all, organisms. Minor variations exist where the assignment of individual codons has changed [409]. Similarly, a few proteins and peptides may harbor D-amino acids [258, 525]. These observations illustrate that local and universal constraints are idealized extremes of a continuous spectrum of constrained variation.
Causes of phenotypic constraints I next introduce some prominent causes of phenotypic constraints. As we shall see, they are not mutually exclusive. Any one pattern of constrained variation can be caused by several of these factors. First, phenotypic variation can be subject to physicochemical constraints. A case in point is again the necessity of vascular systems in large organisms, because of diffusion-limited nutrient transport. Another example is the limited size of terrestrial organisms. It is caused by limitations in biological materials and the mechanical support they can provide [259]. A second prominent class is selective constraints; their cause is natural selection. Much phenotypic variation, if it occurred, would be detrimental to its carrier, and would thus be observed only rarely, if ever. An example involves cyclopia, a condition in which only one (central) eye forms in an animal’s development. In zebrafish, for example, cyclopic mutants can be created in mutagenesis screens; they are lethal [315]. This is an extreme selective constraint. More subtle selective constraints are everywhere, because natural selection affects most phenotypes. For example, most mutations in a protein’s coding region have deleterious albeit often
159
subtle fitness effects [479, 676]. Selection influences the frequency of these variants in a population, and thus the protein phenotypes they encode. In doing so, selection changes the distribution of many phenotypes within a population, a bias that would not occur in the absence of selection. A third class of causes is genetic constraints. Such constraints mean that any one genotype or its mutants can only produce a small subset (or none) of a broad spectrum of conceivable phenotypic variants. Genetic constraints have been known for a long time. A classical candidate example involves variation in wing shapes and eye morphology that occur readily through mutations in the fly Drosophila subobscura, but not in its relative Drosophila melanogaster [499, 722]. In general, the dependency of a mutation’s phenotypic effect on an organism’s genotype is known as epistasis. It is a very widespread phenomenon [102, 610]. I discussed some examples in Chapter 7. A fourth class of causes emerges from the processes that produce phenotypes from genotypes. For the macroscopic traits of higher organisms, this process is embryonic development. This class of causes is usually subsumed under the notion of developmental constraints [499]. A classical example regards variation in the number of digits in salamanders and frogs [12, 268, 583]. The salamander Ambystoma mexicanum (the axolotl) has a hind limb with five toes or digits that are conventionally labeled I through V. This ordering reflects also the order in which the digits form during development (digit I first, digit V last). In salamanders related to the axolotl, one or more of these digits are lost, and the lost digits are those that form latest in development. In developing axolotls, the drug colchicin can induce such digit loss. Colchicin inhibits cell division in the growing limb, and thus reduces limb size and the number of digits. After colchicin treatment, the latest forming digits are the first to be lost, just as in the case of natural digit loss. These two independent lines of evidence suggest that the order of digit loss is constrained in this group of organisms. The reason lies in how digits originate in development, namely from groups of cells that produce cartilage where the digits will later form. The number of such cartilaginous cell groups reflects the number of digits. It depends on
160
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
the size of the developing limb [12, 268, ch. 23, 583]. This is a case where evolutionary variation in a trait (order of digit loss) appears to be constrained by trait development. This constraint is local rather than universal: In frogs and other vertebrates, digits form in the opposite order compared to the salamander, and they also get lost in the opposite order in evolution. I note that developmental constraints are themselves quite heterogeneous [555]. The reason is that organismal development is very complex and involves many different kinds of interactions. These include signaling interactions among cells, movements of cells and tissues, the action of hormones, and physical processes, such as wave formation in excitable media [542]. These four classes of causes are not exclusive. They can overlap in ways that may be difficult to disentangle, especially for complex morphological traits. A case in point is constrained variation in segment number and identity of the fruit fly Drosophila melanogaster. In the 1980s, researchers screened thousands of fly mutants created in large-scale mutagenesis experiments. These screens revealed only a small number of variants in segment number, orientation, and identity [564]. These include embryos that lack several consecutive segments, and embryos that lack every other (odd-numbered or even-numbered) segment. At first sight, genetic constraints might seem the best candidate cause for limited variation in such a genetic screen. However, because a complex developmental process is involved in segmentation, development itself may be the cause of this constrained variation. To make matters more complicated, fly segmentation requires several, now well-characterized genes, which link developmental and genetic causes of constraints. In addition, segmentation involves diffusion of molecules along an embryo, and chemical interactions between gene products, making it potentially subject to physicochemical constraints [268]. And finally, past selection that favored unchanging segment numbers and identities may also have contributed to such constrained variation. After all, other arthropods have much more variable segment numbers than fruit flies. In sum, different causes of constraints are entangled here. Such entanglement is the rule rather than the exception.
The case for studying constraints in genotype spaces The phenotypes I have focused on in this book are simpler than macroscopic traits of higher organisms. What we can learn from them about such traits is limited; for example, because they lack the spatial dimension of macroscopic traits. However, to study these phenotypes also has tremendous advantages. First, as I argued earlier, they are the building blocks of macroscopic traits. By learning about constrained variation in these phenotypes, we may learn about constraints in macroscopic phenotypes. Second, we can study these phenotypes more systematically than macroscopic traits. We can examine their distribution in a space of genotypes. And we can quantify their differences, such as differences in metabolic abilities among metabolic network genotypes, differences in gene activity phenotypes of gene circuits, and differences in the shapes and functions of molecules. Sets of possible phenotypes in any one of these categories form an analogue to morphospace, the space of all macroscopic traits and forms, except that we can study this space’s relationship to genotype space more rigorously. A third advantage is that these phenotypes form through complex processes which can themselves constrain variation, yet they avoid the unfathomable complexity of development. For protein phenotypes, the relevant process is protein folding; for metabolic phenotypes, it is the flow of metabolites through a reaction network; and for gene activity phenotypes, the relevant process is the dynamical change of gene activities caused by regulatory interactions. The latter process can capture important aspects of the dynamical complexity involved in pattern formation of development, such as static geometric patterns and traveling waves in the activities of regulatory molecules [362, 621, 646, 671, 792]. To understand how this process produces phenotypes is thus relevant for macroscopic traits. Fourth, population genetic and quantitative genetic models of constraints need to assume specific causes of constraints or statistical patterns of constrained variation, such as correlated variation between traits. In contrast, both causes and patterns emerge naturally from the phenotypes I study here.
E V O L U T I O N A RY C O N S T R A I N T S A N D G E N O T Y P E S PA C E S
By analyzing constraints in these phenotypes, we can thus hope to understand their different causes more clearly. These phenotypes may allow us to see how these causes are related to one another, where they can be separated, and where they are entangled. I will next review these causes for the phenotypes I studied earlier. Doing so allows me to point to an important aspect of their relationship: the processes that form phenotypes are the fundamental cause of several other constraints.
Physicochemical
constraints Physicochemical factors clearly constrain observable protein structure phenotypes. To give but one example, consider the folding of globular proteins. Globular proteins typically fold such that a densely packed core of amino acids forms in their center. The polar -NH and -CO groups of these amino acids cannot form energetically favorable hydrogen bonds to water, because the dense packing in this core excludes most water molecules. To avoid energetically unfavorable interactions, these groups form the wellknown α-helix and β-sheet secondary structure elements, where amino acids form hydrogen bonds with each other [87]. Inappropriate exposure of hydrophobic amino acids to water disrupts protein function and may lead to insoluble and non-functional protein aggregates. The packing requirement of hydrophobic amino acids is part of the reason why only a small fraction of random amino acid sequences folds, and why the number of protein structure phenotypes is small (Chapter 4). In other words, this physicochemical requirement constrains allowable protein phenotypes. The phenotypes of metabolic networks are also subject to physicochemical constraints. These constraints are primarily dictated by organic chemistry; that is, by the organic chemical reactions that can occur in water. Despite a huge number of known organic reactions and the reaction mechanisms underlying them, little principled knowledge about the set of allowable organic reactions seems to exist. The principles that effectively exclude some reactions would range from the trivial—reactions that would violate mass conservation—to the more subtle, such as reactions whose products are so unstable that they could play no role in living systems. The resulting constraints influence the range of
161
allowable metabolic phenotypes. To give a very simple example, building biomass from any source of carbon or of other elements requires that an organism’s metabolic network has a minimum number of chemical reactions [653, 670]. In other words, metabolic phenotypes that require growth on a variety of such sources are out of reach for small metabolic networks. Physicochemical factors may constrain the gene activity phenotypes of regulatory circuits much more weakly than they constrain protein and metabolic network phenotypes. The reason lies in the highly flexible nature of the regulatory interactions involved. For example, protein kinases, which regulate protein activity through protein phosphorylation, recognize short peptide motifs on their target proteins [264]. Protein kinases can thus interact with many different targets, making rapid evolutionary change of these targets possible [336, 754]. Similarly, the short DNA-binding motifs of typical transcription factors allow these factors to regulate just about any target gene. They also allow regulatory interactions to change rapidly on evolutionary time-scales [740, 861]. Such highly flexible interactions mean that regulatory circuits can produce just about any pattern of molecular activity, given the right interactions between their member genes. As I mentioned above, the processes that create phenotypes from genotypes for the systems I study here are protein folding, the dynamically changing regulatory interactions within regulatory circuits, and chemical synthesis of biomass molecules. These processes are the analog of “development” for these systems. Taken together, the observations above show that these processes are key to understanding physicochemical constraints. For example, allowable protein phenotypes are constrained because of how proteins fold; and gene activity patterns can be organized flexibly, precisely because the regulatory interactions producing them are flexible.
Selective constraints Selective constraints are as ubiquitous for the phenotypes I consider here as they are for macroscopic traits. Consider selection that confines a population of genotypes to a genotype network. This kind of selection does not admit any variant phenotypes, and thus poses strong selective constraints on phenotypes. Such strong
162
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
constraints are useful to characterize genotype networks, as I discussed in Chapters 2 through 4. However, selective constraints will often be much weaker. For example, an enzyme mutant with reduced catalytic efficiency, or a metabolic network mutant with a missing enzyme may cause a lower rate of biomass production, and thus slower cell growth and division in one or more environments. As a result, natural selection may eliminate organisms hosting such mutants over time, and thus bias the distribution of protein or metabolic network phenotypes. The smaller the effect of such mutations on cell growth is, the more slowly natural selection will tend to eliminate them, and the weaker this constraint will be. Similarly, the experimental literature, especially in cell and developmental biology, is full of mutations that change the gene activity phenotype of a regulatory circuit, such that an organismal phenotype—be it that of a cell or a multicellular organism—does not form properly. Some such mutations can only be seen after mutagenesis in the laboratory and may rarely, if ever, occur in the wild. Because experiments can reveal such variants, they are clearly neither prohibited by genetic constraints nor by developmental constraints, but by their detrimental effects on the organism. The generation of such variants through mutagenesis has been key to revealing the functions of regulatory genes, including those encoding transcriptional regulators. After having touched on these kinds of obvious selective constraints, I next turn to a less obvious kind of selective constraint, which emerges from ongoing selection favoring the preservation of existing phenotypes. In the language of population genetics, this kind of selection is called stabilizing selection. Among students of development, it is also known as canalizing selection [499]. Stabilizing selection disfavors variants of a population’s current phenotype, whether they arise through rare genetic perturbations, or through more frequent non-genetic perturbations (such as the incessant perturbations caused by thermal motion in protein and RNA molecules). If such selection is strong, a population will tend to accumulate in a region of a genotype network where most genotypes have many neighbors with the same phenotypes; that is, where genotypes are highly robust to
mutation. I discussed the reasons in Chapter 8 (Figure 8.3). Briefly, in such regions, fewer perturbations cause detrimental phenotypic change, and a greater number of perturbed individuals survive [798, 820, 825]. Importantly, in such regions, individuals also produce no, or little, phenotypic variation in response to genetic or non-genetic perturbations. In other words, the phenotype of a population in such a region will be highly constrained to a currently optimal phenotype. Thus, ongoing stabilizing selection lowers phenotypic variability. Candidate examples from morphological traits have been known for a long time [647, 711, 819]. For example, in wild populations subject to such canalizing selection, more recently evolved traits may show greater variability than older traits, because stabilizing selection has acted on them for a shorter amount of time. This holds, for instance, for rows of bristles used in male courtship of Drosophila sylvestris fruit flies. Here, newly evolved rows of bristles can be more variable than older such rows [105]. More recently, it has been shown that the phenotypes of molecules may also be subject to this stabilizing process [76, 672, 749, 832]. One example involves micro RNA genes in eukaryotes, which are transcribed into a precursor RNA with a characteristic secondary structure. In micro RNA genes from several eukaryotes, the robustness of this secondary structure is higher than that of random RNA sequences folding into the same structure [76, 749]. Similarly, secondary structures in RNA viruses that are conserved in evolution and thus subject to stabilizing selection are more robust to mutations [832]. Also, viroids, simple plant pathogens that consist of a single-stranded RNA molecule, acquired increasingly robust secondary structures in the course of their evolution [672]. This kind of selective constraint illustrates two merits of analyzing constraints in the framework of genotype networks. First, this framework readily explains why canalizing selection can increase selective constraints on a phenotype. The explanation relies on the heterogeneity of genotype networks I just discussed. Second, this framework shows how this selective constraint is entangled with developmental constraints, or more generally the processes producing phenotypes from genotypes. It is these processes that are fundamentally
E V O L U T I O N A RY C O N S T R A I N T S A N D G E N O T Y P E S PA C E S
responsible for how genotypes map onto phenotypes, and thus for the organization of genotype networks and their heterogeneity. If this heterogeneity was absent, canalizing selection could not increase robustness, neither for the phenotypes I study here, nor for macroscopic traits. The genotype–phenotype mapping precedes the action of natural selection on any one population in genotype space. It thus determines what natural selection can achieve, and is a prerequisite for canalizing selection, and the kind of selective constraint that it causes.
Genetic constraints on phenotypic variation The structure of genotype space and the distribution of phenotypes in this space can elegantly explain the inevitability of genetic constraints. Previous chapters, especially Chapter 2–4, were full of (then undeclared) examples of such constraints. I will briefly revisit two key general observations from these examples. Genotype space is astronomically vast and contains many phenotypes. This observation holds for all three classes of systems I examined: molecules, regulatory circuits, and metabolic networks. The immediate neighborhood of any one genotype contains only a tiny fraction of all possible genotypes. For example, for proteins of 100 amino acids, where genotype space comprises more than 10130 amino acid sequences, any one protein genotype G has only 19×100=1900 1-mutant neighbors, fewer than one 10-126th of genotype space. The 2-mutant and 3-mutant neighborhoods, albeit larger, comprise similarly small fractions genotype space. It is thus not surprising that these neighborhoods also contain only a tiny fraction of all possible phenotypes. (In addition, some of G’s neighbors have the same phenotype as G, further reducing the number of phenotypes accessible through few mutations.) The second general observation regards the neighborhoods of two different genotypes G1 and G2 with the same phenotype. As we saw, if G1 and G2 differ even in only a few parts (from amino acids to metabolic enzymes), most phenotypes that occur in the two small neighborhoods around them differ between these neighborhoods (Figure 5.4a). In addition, as a genotype changes gradually while preserving its phenotype, this neighborhood contains
163
ever-changing new phenotypes (Figure 5.4b). Taken together, these two observations mean that the spectrum of phenotypes accessible through a mutation is limited to the small fraction of phenotypes that occur near any one genotype. The observations I just discussed focus on different genotypes and their local neighborhoods. A complementary approach to characterize genetic constraints focuses on different phenotypes and their distribution in genotype space. It studies regions of genotypes space much broader than local neighborhoods, and asks whether specific phenotypes occur preferentially in some regions but not others. These preferences may not be strong, because, as we saw, genotype networks often nearly span genotype space. However, they should at least be detectable. The problem is that genotype spaces are high-dimensional, such that we cannot visualize them easily. To circumvent this problem, one can use statistical techniques to project data in high-dimensional spaces onto lower-dimensional spaces. One such technique is principal component analysis. Briefly, it projects a set of points in a high-dimensional space onto a lower-dimensional space, such that the lower-dimensional cloud of points captures the largest possible amount of variation in the higher-dimensional data [377]. Figure 12.1 shows data derived from the principal component analysis of randomly sampled metabolic networks with a given phenotype [670]. Specifically, the data are based on two random samples of 1000 metabolic networks each. Networks in the first sample are viable in a minimal chemical environment with glucose as the sole carbon source (filled circles); networks in the second sample are viable in a minimal environment with succinate as the sole carbon source (open circles). The figure shows the first two principal components, that is, a two-dimensional projection of a much higher dimensional space of metabolic network genotypes. The two clouds of points overlap greatly, but they are also visible separate at their margins. The overlap arises partly because networks viable on these two carbon sources must share many reactions essential for viability on both carbon sources. I emphasize that the networks were sampled uniformly from a set of genotypes with this phenotype,
164
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
4 viable on glucose viable on succinate center of mass
Second principal component
3 2 1 0 –1 –2 –3 –4 –4
–3
–2
–1
0
1
2
3
4
First principal component Figure 12.1 Viable metabolic networks with different phenotypes show both overlap and separation in genotype space. The data in this figure are based on two random samples of 103 metabolic networks that are viable on minimal media with glucose (closed circles) and succinate (open circles) as sole carbon sources. Viability is here defined as the ability to produce all biomass components of a prokaryotic cell (E. coli) in these environments. More specifically, each of the two data sets is a uniform sample of viable networks within a set of metabolic networks that all have the same number (n=831) of chemical reactions. The genotype of each metabolic network in these two samples was represented as a binary string that indicates the presence or absence of individual reactions (out of a much larger universe of chemical reactions), as described in Chapter 2. The resulting set of 2×103 strings was then subject to principal component analysis, a technique to project high-dimensional data onto fewer dimensions, such that the projection reflects the largest amount of variance in the data. The two-dimensional projection of the figure shows the first two principal components, which explain the largest and second-to-largest fraction of the variation in the data. Despite significant overlap between the two data sets, they also show clear separation, which is statistically significant [670]. This separation shows that different regions of genotype space are populated by networks with different phenotypes. After [670].
such that the separation of the two clouds is not due to any sampling bias [670]. An analogous analysis is shown in Figure 12.2, for more than 4000 proteins with different enzymatic functions. The figure shows that protein sequences with the same function are not homogeneously distributed in sequence space. I caution that samples of real proteins like that of the figure are always subject to potential biases—with unknown effects on the analysis outcome. For example, some functions may be more intensely
studied than others, and thus have more associated protein genotypes. The observations in this section show that genetic constraints on phenotypic variation exist, and that they are inevitable properties of systems as different as molecules, regulatory circuits, and metabolic networks. As in the previous section, I note that the very existence of these genetic constraints emerges from the processes that form phenotypes from genotypes. For example, these processes are ultimately responsible for the hetero-
E V O L U T I O N A RY C O N S T R A I N T S A N D G E N O T Y P E S PA C E S
165
Functions 50
53
40
Principal Component 2
30 20 10 0 −10 −20 −30 −40 −50
1 −40
−30
−20 −10 0 10 Principal Component 1
20
30
40
Figure 12.2 Proteins with different functions occur in overlapping, yet distinct regions of genotype space. The data shows the first two principal components of a data set of 4134 protein sequences that adopt 53 different enzymatic functions. These functions are arbitrarily coded by different shades of gray, as indicated in the vertical grayscale bar to the right of the figure. For the analysis, proteins were aligned and encoded as numerical strings, where each amino acid was assigned a numerical value between one and twenty. The resulting set of 4134 strings was then subject to principal component analysis, a technique to project high-dimensional data onto fewer dimensions. The two-dimensional projection shown here shows the first two principal components, which explain the largest and second-to-largest fraction of the variation in the data. Note that the 53 functions (shades of gray) are not homogeneously distributed in genotype space. The 4134 proteins used in this analysis all adopt an aldolase fold (CATH classification 3.20.20.70, [289]), but they catalyze different chemical reactions. From [239].
geneity of genotypic neighborhoods, because of how they map genotypes onto phenotypes. These processes differ among my three study systems, but they nonetheless share key features responsible for genetic constraints.
“Developmental” constraints On its own, each of the three system classes I focused on here cannot produce the macroscopic phenotypes that form during organismal development. Thus, one might argue that they may have little to say about the role of development in constraining phenotypic variation. However, phenotypes form in these systems also through complex processes, such as protein folding and dynamically changing transcriptional
regulatory interactions. Thus, we can examine the role that these “developmental” processes play in constraining variation. I have already mentioned this role in the preceding sections, and will summarize it now. First, these processes are at the root of physicochemical constraints on phenotypic variation. Second, these processes create internally heterogeneous genotype networks, which are responsible for those selective constraints caused by canalizing selection. And finally, the distribution of phenotypes in genotype space is a consequence of the processes that produce phenotypes from genotypes. This distribution is the origin of genetic constraints. Taken together, this means that
166
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
constraints emerging from the processes that form phenotypes are ultimately the cause of three other classes of constraints. This causal role of phenotype production can also help explain why different classes of constraints, such as genetic and developmental constraints, can be difficult to disentangle. The three classes of systems I study produce phenotypes in very different ways. Yet their genotype– phenotype maps, and thus the kinds of constraints they allow, share common features. These features have a common cause: the fact that typical genotypes have many neighbors with the same phenotype (Chapters 5 and 6). This remarkably simple robustness property can thus also be viewed as a common cause of constrained phenotypic variation. Students of organismal development and its evolution have long emphasized that we need to understand how phenotypes form to understand evolution [104, 499, 555, 583]. This emphasis is dismissed by some population geneticists, who believe that the population genetic theory of evolution is essentially complete, and does not need to incorporate these complexities. The processes of phenotype formation I study here are building blocks of organismal development. Macroscopic traits are subject to the constraints caused by them, and to additional constraints, caused by the spatial organization of embryos. Because these traits are subject to the constraints I study, my most general observation will hold also for them: The developmental mechanisms that form phenotypes are the common cause of several other classes of constraints. This view supports the developmentalists’ emphasis on phenotype formation and its importance for innovation.
A key consequence of constrained phenotypic evolution Phenotypic constraints have several consequences [259]. For the student of evolutionary adaptation and innovation, one of these consequences is most important. It is an absence of evolutionary change known as stasis. Stasis can occur if a particular phenotype is optimal in a given-environment, and if no superior phenotype exists. For a student of evolutionary innovation, this kind of stasis is less interesting than a second kind of stasis, where phenotypic variability may be present, but the right kind of variability is
absent. I am referring to variability that produces novel adaptive phenotypes. This kind of stasis arises only when a superior, as yet undiscovered phenotype exists. It would thus not occur under stabilizing selection of an already optimal phenotype. In this second case, a characteristic pattern of evolution is episodic change or punctuated evolution. Here, long periods of evolutionary stasis, where a population’s phenotype changes little, are punctuated by rapid evolutionary change, where a population discovers a novel, superior phenotype. Such episodic change has been found on all levels of biological organization and on different timescales [6, 214, 251, 252, 411, 742]. For example, it occurs for morphological traits observable in the fossil record, where its causes have led to much debate [259]; it also occurs for cellular traits in laboratory evolution experiments, such as bacterial cell size [214]; it can also occur for molecules evolving under directional selection, where it can be studied computationally [251, 252, 742]. An example is shown in Figure 12.3. Here, an RNA molecule subject to random change in individual nucleotides “searches” sequence space for a target secondary structure phenotype. Over time, this molecule “discovers” phenotypes (and their genotype networks) that are ever closer to the target. These discoveries are frequent at first and become successively rare and, in this sense, more difficult to make. The intervals between rare successive phenotypic transitions are periods of evolutionary stasis. During these periods, a molecule explores the genotype network of its current phenotype, until it discovers the genotype network of a superior phenotype. Similar processes and patterns occur when entire populations search sequence space for new and better phenotypes [686]. In the case of episodic evolution of RNA secondary structures, the reasons for the rarity of some phenotypic transitions have been examined in detail [251, 252]. They involve global rearrangements of secondary structures that are unlikely to occur by single point mutations. This kind of stasis arises, because not all phenotypic variation is mutationally accessible from any one genotype or population. It is thus a consequence of genetic constraints. Recall that these constraints arise from the processes that form phenotype from
E V O L U T I O N A RY C O N S T R A I N T S A N D G E N O T Y P E S PA C E S
167
70 60 Distance to target phenotype
50 40 30 Stasis 20
10
1
5
50
500
5000
50000
Number of mutations Figure 12.3 Temporary stasis during evolutionary search for a novel phenotype. The horizontal axis shows the number of steps in a random walk of an RNA genotype that “searches” sequence space for a specific RNA secondary structure phenotype. The vertical axis shows the random walking molecule’s phenotypic distance to the target, expressed as the distances of the dot-parentheses representations of the two molecules’ RNA secondary structures [687]. Horizontal lines indicate periods of stasis, where the distance to the target does not improve. The longest such period is labeled with an arrow. Note the double-logarithmic scale. Periods of stasis tend to get longer as the search proceeds. For the data shown in this figure, I subjected a random RNA genotype of length 100 nucleotides to repeated cycles of mutation of one nucleotide, and selection, where a mutation was accepted only if the phenotype of the mutated RNA molecular was at least as close to a target phenotype as the phenotype before mutation. The target phenotype was a random secondary structure phenotype whose dot-parenthesis representation is: .((((((.(((......))).))).....)))...((((((..(((.(((.......))))))...((((...(((....))).))))......)))))). Episodic evolution and temporary stasis is a generic property of evolutionary searches in the kinds of genotype spaces I study here.
genotype, which determine how different phenotypes and their genotype networks are distributed across genotype space. The details of this distribution depend on the kind of system—molecule, regulatory circuit, or metabolic network. In addition, the details of any pattern of episodic change, like that shown in Figure 12.3, may vary among evolving population and among target phenotype. Regardless of these details, however, the genotype space framework helps understand this kind of stasis as a consequence of how genotype networks are organized in genotype space.
Constraints
on
variation
in
genotypes
Everything I have said thus far has regarded constrained variation in phenotypes; but genotypes are
also constrained in their evolution. (Note that such constrained variation of genotypes is different from the genetic constraints I discussed above, which affect phenotypic variation.) For the sake of completeness, I need to discuss constraints on genotypic variation briefly. To appreciate the inevitability of genotypic constraints is simple. Just recall that the genotype network of any one phenotype occupies a vanishing fraction of genotype space. This fact is the cause of genotypic constraints. It means that genotypes can not vary arbitrarily if a phenotype is to be preserved. Examples are so numerous that it is difficult to know where to start. Perhaps the best place is by showing an alignment of protein sequences like that of Figure 12.4. The sequences in this alignment
168
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
S P E I E TR I DE L RKE NP S I F SWE I RE KL I KEGF AD - - - P P S - - T S S I S R L L RGS DR T P E I ENR I E EY KR S S PGMF SWE I RE KL I REGV CDR S TAP S - - V S A I S R LV RGR DA T PD I E S R I E E L KQS QPG I F SWE I RAKL I EAGV CDKQNAP S - - V S S I S R L L RGS SG S P E F EKR I LD I QKE NPGV F SWE I RE KL L KEGQMDR AAVP S - - V S C I S R I L R SHGE NS E I E S K I EQY KKD S P SMF SWE I RDQL I KEGL CDR S S AP T - - V S A I S R I L R S K GC T PDVEKK I E EY KRE NPGMF SWE I RDKL L KDAV CDRNTVP S - - V S S I S R I L R S K FG T PDVEKK I E EY KRE NPGMF SWE I RD R L L KDGHCDR S TVP SGLV S S I S RVL R I K FG TADVDNK I E EY KKE NPG I F SWE I RE R L I KEG I CDR SNVP S - - V S S I S RT L RAK GC QP E I E EK I LQY S S E NSG I F SWE LREML I KNGD CER S TAP S - - V S T I S RT L RAHGV S KE HEYL I VEY R - - KQFA YAWEMRE EMV KR - - - GVQKVP P - - VDQ I KRVL RAK GC SNE NECL I VE L R - - QQFA YAWE LRE EMV KR - - - GAKKVP S - - VDQ I KRVL RAK GC T P E VV S K I AQY KRE CP S I FAWE I RD R L L S EGV CTNDN I P S - - V S S I NRVL RNL AS TAE VV S K I SQY KRE CP S I FAWE I RD R L L QENV CTNDN I P S - - V S S I NRVL RNL AA T P E VVNK I ADY KRE CP S I FAWE I RD R L I T ENV CNT DN I P S - - V S S I NRVL RNF QN T P E VVNK I AS Y KRE CP S I FAWE I RD R L L NEG I CNNDN I P S - - V S S I NRVL RNL NG T PQVVNK I AMY KRE CP S I FAWE I RD R L L NEAV CNA EN I P S - - V S S I NRVL RNL NG T P P VVAR I AQL KGE CPAL FAWE I QR QLC AEGL CTQDKT P S - - V S S I NRVL RAL QE T PG VVNA I KDY KVR DPG I FAWE I RD R L L SDAV CDK YNVP S - - V S S I S R I L RNK I G T P S VVNA I KDY K I R DPG I FAWE I RD R L L SDC I CDK YNVP S - - V S S I S R I L RNK I G T PNVVKH I RDY KQGDPG I FAWE I RD R L L ADGV CDK YNVP S - - V S S I S R I L RNK I G T P T VVKH I RTY KQR DPG I FAWE I RD R L L ADGV CDK YNVP S - - V S S I S R I L RNK I G T PK VVNY I RE L KQR DPG I FAWE I RD R L L S EG I CDK TNVP S - - V S S I S R I L RNK LG T PQ I VNK I R S Y KR L DPGMFAWE I RD L L I EDKV CDT NS AP S - - V S S I S R I L RNK I G T PR VVEK I CDY KRQNP TMFAWE I RD R L L S EG I CDHDNVP S - - V S S I NR I V RNK AA T PK VVEK I CEY KRQNP TMFAWE I RD R L L GEQ I CDQDNVP S - - V S S I NR I V RNK AA T PR VVEK I CEY KRQNP TMFAWE I RD R L L VEC I CDT ENVP S - - V S S I NR I V RDK AA THDVVMR I T EY KRE NP TMFAWE I RD R L L ADEV C SQ E TVP S - - V S S I NR - - - - - - TQDVVVK I T EY KRDNP TMFAWE I RD R L L SDG I CTG E TVP S - - V S S I NR I V R S K T S T PK VVDK I AEY KRQNP TMFAWE I RD R L L AEG I CDNDTVP S - - V S S I NR I I RTK VQ T PK VVEK I AEY KRQNP TMFAWE I RD R L L AERV CDNDTVP S - - V S S I NR I I RTK VQ T PK VVEK I GDY KRQNP TMFAWE I RD R L L AEGV CDNDTVP S - - V S S I NR I I RTK VQ T P P VVDA I ANY KRE NP TMFAWE I RD R L L AEA I C SQDNVP S - - V S S I NR I V RNK AA T P E VVNK I T EY KHANP TMFAWE I RQQL I DDRV CLKDNVP S - - V S S I NR I V R S Y S A T P T VVKK I I R L KE E NSGMFAWE I RE QLQQQRV CDP S S VP S - - I S S I NR I L RNS GL Figure 12.4 Constrained evolution of molecular genotypes becomes visible in protein sequence alignments. The 34 protein sequences shown are fragments of the Paired protein domain characteristic of Pax proteins [120]. These sequences are taken from a wide variety of organisms, including vertebrates (humans), tunicates (sea squirts), and invertebrates (fruit flies). The letters shown correspond to the standard single letter code for proteinaceous amino acids [741]. Dashes represent alignment gaps, corresponding to either deletions or insertions of protein coding DNA into one or more genes during their evolution. Letters on black background indicate amino acids that are unchanged in this set of sequences (9 out of 55 amino acids); letters on gray and white background indicate amino acids that show low and high variability, respectively. Data from [206].
come from a class of transcriptional regulators called Paired box (Pax) proteins [98, 120]. Pax proteins are involved in the embryonic development of many different body structures and organs [120]. Among other phenotypic features, these proteins share a structure called the Paired domain, which is
required for DNA binding [524]. Fifty-five amino acid residues of this domain are shown in the 34 sequences of Figure 12.4. The sequences shown are taken from diverse animals that comprise both vertebrates, such as humans, and invertebrates, such as the fruit fly [206]. The alignment shows that amino
E V O L U T I O N A RY C O N S T R A I N T S A N D G E N O T Y P E S PA C E S
acid sequences encoding the Paired domain are highly variable, but this variability is constrained. For example, 9 out of 55 amino acids in the aligned sequences do not vary at all (black background), and multiple others vary little (gray background). We do not know the exact size of the genotype networks of these Paired domain proteins, but it is certain to comprise a small fraction of sequence space. This fact causes the constrained variation visible in Figure 12.4. Historically, such constrained genotypic variation has been crucial to identify and classify different proteins into families that differ in their structure and function. It is also important to assign newly identified proteins to functional classes. The same holds for RNA molecules. Constraints on genotypic variation are of course not just restricted to molecules, they also occur in regulatory circuits. For example, while gene regulatory circuits that are responsible for patterning similar body parts in different animals can be highly diverse, they often contain a small number of conserved regulatory interactions. A case in point is the transcriptional regulation circuitry that determines gut development in sea urchins and starfish, and heart development in Drosophila and vertebrates. In these circuits, a small number of regulatory interactions has been conserved at least since the end of the Cambrian, for almost 500 million years [165]. Constraints on genotypic variation also affect metabolic networks. Figure 2.2a showed data highlighting the great diversity of reaction content in more than 200 metabolic networks of prokaryotes with completely sequenced genomes. We know little about the exact metabolic phenotype of many of them, except that they must be capable of producing essential biomass components in their host organism’s environment. This fact alone requires that their genotypes must be constrained in their variation. Such constraints are correlated presence/ absence patterns of individual chemical reactions in different metabolic networks, much like what we observed in proteins, where amino acids cannot vary freely, but show correlated variation. Figure 12.5 shows an example involving the last four reactions in the biosynthesis of cobalamine (vitamin B12). Only prokaryotes are known to synthesize this complex molecule, an essential cofactor for some enzymes [635]. The reactions shown in
169
Figure 12.5 assemble the major parts of the molecule [635]. Figure 12.5a lists these reactions in different shades of gray, together with the names of key substrates. The phylogenetic tree in Figure 12.5b shows how these reactions are distributed among 222 prokaryotic metabolic networks. This tree is based on the 16S ribosomal DNA of 222 species of prokaryotes with completely sequenced genomes, whose metabolic networks have been characterized to various degrees [831]. Superimposed on this tree are gray bars that indicate the presence or absence of each chemical reaction from Figure 12.5a in the same shade of gray. A particular shade of gray in a bar above a species indicates that a gene encoding an enzyme catalyzing this reaction is present in the species. The four highlighted reactions reveal a striking pattern of association: they are almost always jointly present or absent. This is an example of extremely constrained genotypic variation on the level of a genome’s enzyme coding genes. This pattern is expected if the four shown reactions cannot be easily bypassed by an alternative metabolic route leading to cobalamine. The small number of exceptions where a metabolic network encodes fewer than four reactions may reflect a decaying cobalamine biosynthesis pathway, the presence of other pathways that produce some related product, or simply errors in what we know about metabolic genotypes for some organisms. Many other groups of reactions also show constrained variation, which need not always be this extreme [831]. In sum, genotypic constraints on variation occur on all levels of biological organization. They are a consequence of the fact that any one phenotype’s genotype network, even though it extends far through genotype space, occupies only a tiny fraction of this space. These constraints are highly valuable to classify molecules according to sequence and probable function. They may acquire similar value for regulatory circuits and metabolic networks in the future.
Summary Based on the cause of constrained phenotypic variation, one can distinguish physicochemical, selective, genetic, and developmental constraints. The latter class of constraints emerges from the processes that produce phenotypes from
(a)
Precorrin 2
Adenosyl cobyrinate diamide Adenosylcobyric acid synthase (Cob Q) Adenosyl cobyrinate hexaamide Cobalamin biosynthetic protein (CobC, D) Adenosyl cobinamide Adenosylcobinamide kinase (CobP, U) Adenosylcobinamide-phosphate Guanylyltransferase (CobP, U) Adenosine-GDP-cobinamide Adenosylcobinamide-GDP ribazoletransferase (CobS, V) Cobalamine (Vitamine B 12)
d
o
a
yano h ce p A CC 11 2
a
gu
p
m
S
x
4
C2
NC
pp n wh n ma hg m CB CT
u
Tw
4 B2
h
a
h a
n
i n
u
M
h
i
o
u
g
n
ep
ch
A
i
i
m ru
yc
M
l sm
a pa
ap
p
S
m
o
re
s
c
oc
co
ti
py
La
b
ct
La
m
v
ed
CW
L0
ci o
1
L
tis
4
3
sp m
t ru
an
pl l s
us
c
pe
cu
o
tosa
s
de eo
en
os
to
mes
i oen
m r cu sib s i m ns heye us us Ex aci ei no spha ce us h lus a l top nib au ys llus ba Ge ilis 5 N us sub e s a i s aur e ccu GD− oc ap og nes noc a e alis Lis sf e co c E tero es n ge a s cc W l ne 26 95 y or ter ob 8 He i C 11 e u i NCT ob ter C mpy leri ter t Ar ob fo m s o i vibr r thec ch
on
eu
29
u
ts
a
eM
yp
o
se
a
0
9
70
C
T
70
SF
es
en
o
us
o
3A
ar
o
e rv
re
t
m
li
um
sm
l
op
t
en
a
so
l t
w
fl
a
sm
p
M
i t
dl
la
n
n
g
ai
a sm
u
b
lu
l
n
iD
t
po
u
h
ll
t
CX
n
l
t
3/
u
W
A a yo h o s ma
ph
x
h
i O
D/U
n
t
hl
l
u
m
l m yd
b
l
H
h
Ch
i
um
na
l
h b
lib t
t
o ad
L m
d
u
b
Rh
a n
T
n
ba
l
t
No
a
n h ga h u
b
Ch
il
d
M c o ys s ae ug n sa
do
m A
K
b
b
id i g
l m
nus S1 0 P o h o co cus ma
B
a n
R
Ru
l
i S
i
The mo yn ch co cu e onga us
Rhodoc
a
n
h
B
i
A
sp WH81 2 S ne ho oc us me w one A−P e a Ye Cyano ac eu e voa G o obac Ha ko 2 yowa CC 30 um A H3 R gu m cu o c e um um ube Co yneba ac e c n ca M co da a No a A p RH c u h ae ae y y po u oy a opo e u he mu p C A do nk a u da b mo p Th Sa
C P K
D L
Ch
Tr chod sm um ry h a um A ab ena sp P C7120 os oc punc forme Ca d e u os up or a ch ro y cus Thermus hermoph us H De no 27 oc us ad odu u ur rans yd oge nb Aqu mp ex YO 3AO eo cus P1 Pet o oga m bil T erm s os ph o me Fer lane ido b c ens is e ium Symb nodo o ac ter um De um su fot the om mop De ac sulf ulu hilum m red ob Hel cte c ns ob ium acte Sy ha r um nt n en op mod se ho elo m estic om n sw M ac du oo u um o fei m rella Th he er he rm m rm pr an oa a ae pion bo ei r C a ba i um y os o ter rid he A rm ng ka um us o i a F h yd g e neg ob ns us o o is ut hy m e d y o to a ic ta or m p li m m as g ed ns m a ig a en O s Y
(b)
e
cu ococ g
C tr
b t rk s r ATCC BAA lexn r 301 9 er t Es he pe 2a ch a c l K 12 MG 1655 E t ro a t r sp 6 8 K eb i la p eumo ni e Se r t a pr t ama u ans
S i e la
Erw n a c r to o a P o or abd s l min s ens
ob
ce
ch o Ch o ob um
Ye s n a pe t s CO 2
Ch or ba u um t pi um Sa nibac e ru er
od l s g o s n di s
Cy oph ga h t hi so ii
s ro ype d Ha moph us nf u nz e Rd KW20
F av ba t r um jo ns n ae
o ida P s eu e a m l
G ame la o s t i
e t pe 5b um ni e L20 l s pl u opn A t no ac s ip od ce a uc in M nn e m ph l s hydr Ae omon
or hy omon s g ng val s W83 Pa ab c e o des ist so is B c e oides h ta o a m c on De u fo i r o vu ga s Hild aws nia nbo ou h i t ac lua B el is vibr ob c er o Ge ba o us t r ul u r du P l ens b t r rb n li y tr us ph b t D r um ulf r id t l n An py r my r ph l x b Myx t d h M l g x g n n h t
O1 ho e ae
V b io ndum p of ter um sis Ph tob ne d e a he a an tis ao
o as i aham ing nas om e sis s c oh a na ea d om t r ery y h ap dans we gr Co s us d p ag u ns aro he la acc e lei ah aqua m c er iu noba ac 1 er Mar ac DP ob p hr ss Py c er men t ba k L1 bo c ne W M orax s p as cani i en o alex m O1 n ter P ar ac 4 M sa U ao gino CH oh r us om o hr s ae ss n na en no a g m er tu n do a1 ac r eu h sp P e ob u 3 el ia s ic 49 sp s D i ad en Ph i ro a RSA a u ii om hi e h la T rn op ise m bu n a eu Fr e p la e Cox on e
alterom
Gl
i
E
li
ni
N
b
i
s oc
la tu
s
s
ns
ca s
e b c er
uc
Min
yn
Ha or odo p ra ha ph a
CS1
mn co a h che
Po
ka
ann s h a p
Ros ob c e d n
S c b c e p me y
Burkh
m
B
1
y
m
g
g i
m
JF 5
hi
i
ce D no o ob
lb 6 x um ho pe od t ol ni Bo r et e ph e la i um p pe QL t ss W−P Cu is p iav DM idus WA− Ra 1 aw sto nia nens oa s b c acea e ium um olde ma ia ma si i ns lle ATCC N tro 2 344 omo as uro Nt pa ososp a a mu l form Azo r us sp EbN Dech 1 o omo nas arom t ca Th obac Ne se us den a m en r f cans ng t d s MC58 serogroup B Chr mob c er um v o a eum Me y ob c us f ge a us Xy e a as d o a 9a c an homo as ampes r pv amp s r s ATCC 33913
ce du
J 6 sp
as
hr ptot
Me
hy
Le
cu
su
o
e
e
o
o
se ni
s ca
na
ei
cu
er
an o o i
ac
r re er x
fer
Po
lar
mon
N i
av
oc
ct
oc
ax
ba
hy
o o
ph ro
M et
n
ia
Aci
MC
h
n
m
m
t
i
Del
i
m
m
t
gm
l
b
b
t
l
g
i
do
ll
m
t
tt
wM
m
ll
m
Rh
h
m
t
i
i
b
h
k
b
l
g
i
Gl
i m n biu k g l h hin i w n l g n b n m hin m m ng n h l m h um m n ul 2 Z b um N C P b h m n O p m b Du b a ah Rh h n UW h C M m n n um a Ba um hz um n nu a n Ag n da a au uo m u Ca h zo o op oh au Az a ho a d Xan a q en n k ex o B e um ba h ad ky Me n g e w um o a apon N um CGA00 zob u yh a pa Ba mon op eud Rho ma au Ma un um o a nep Hy hom 241 eo e c e ph Rh doba an den a c ccu h bae
Ver
t
N
W
A
M
Rh
A
ea
seu
o hr m tii
l m Pe o i ty n lu e
Bu hne a a hi i o a APS B uman ia i ad l i i o a
Figure 12.5 Four highly constrained reactions in cobalamin biosynthesis. (a) The four last reactions of the cobalamin biosynthesis pathway are written in different shades of gray to help visualize their occurrence in (b), which displays a 16S rDNA-based maximum-likelihood phylogenetic tree of 221 prokaryotic species analyzed here. Bars along the circumference of the tree indicate whether a specific reaction (as indicated by the bar’s shade of gray) is catalyzed by an enzyme encoded in a species’ genome or not. Bars containing two or more shades of gray indicate that two or more reactions occur in a given species. Note that most bars contain none or all four of the shades of gray, indicating that the respective genomes encode none or all four reactions. Gene symbols beginning with the prefix “Cob” in (a) reflect names of genes known to catalyze these reactions in aerobic prokaryotes [571, 635]. Data from [831].
E V O L U T I O N A RY C O N S T R A I N T S A N D G E N O T Y P E S PA C E S
genotypes. I examined these four causes for molecules, regulatory circuits, and metabolic networks in the unifying genotype space framework. This framework can help us see that the processes of phenotype formation cause the three other classes of constraints. It can also help us appreciate why causes of constrained variation are often entangled and not clearly separable. I also showed that the kind of evolutionary stasis that occurs during punctuated and episodic evolution is a consequence of
171
genetic constraints, whose origins the genotype space framework can readily explain. The remarkable yet simple observation that typical genotypes are to some extent robust, that is, that they have many neighbors with the same phenotype, is sufficient to explain important features of genotype space and the resulting constraints. Together with developmental processes, it can thus be viewed as a fundamental cause of multiple aspects of constrained phenotypic variation.
CH A PT ER 13
Phenotypic plasticity and innovation
My discussion thus far was based on the assumption that a single genotype produces a single phenotype. This simplification aids conceptual development. However, individual genotypes can often form more than one phenotype. This phenomenon is usually called phenotypic plasticity, although other names also circulate [848]. Phenotypic plasticity arises from environmental change. On the smallest scale, such change arises from random fluctations in the microscopic environment of a molecular system, such as a protein. These fluctuations include not only the thermal motion of molecules, but also random change in their numbers or concentrations. On a larger scale, environmental change is variation in an organism’s external environment. Either way, environmental change is an essential driver of plasticity. In this chapter, I will first discuss that phenotypic plasticity is widespread, and that plasticity itself varies genetically. I will then show that the genotype network concept can accommodate plastic phenotypes. More than that, the very existence of genotype networks facilitates the origin of new phenotypes through environmental change. I will then discuss genetic assimilation and related phenomena. In genetic assimilation, a previously plastic phenotype loses this plasticity over time, and forms only one of its alternative phenotypes. Genetic assimilation is a process in which previously existing plastic phenotypes can become independent of the environment. Although assimilation does not regard the origin of innovations but their stabilization, it is a widely debated phenomenon that deserves integration into this context. Assimilation may be very widespread, thus supporting the notion that environmentally induced phenotypic change may be an important mode of evolutionary innovation. I will also argue that plasticity is not necessarily good for adaptive evolution. 172
It may even slow down adaptive evolution, depending on the phenotype considered. Finally, I will discuss how robustness to mutations may influence evolutionary innovation through environmental change. I emphasize that the chapter is not an exhaustive treatment of plasticity, which is a vast subject [848]. Rather, I aim to show how plasticity fits into the framework I propose here, and to sketch its role in evolutionary innovation.
Plasticity everywhere The best-known and most numerous examples of plasticity regard traits of whole organisms. The dramatically different juvenile and adult phenotypes of metamorphosing organisms (think: caterpillars and butterflies) are striking instances of plasticity. Others include caste determination in social insects [335, 848]. Perhaps the best known example here regards developing honeybee larvae, where the amount of royal jelly—a substance that bees excrete—fed to a larva determines whether it develops into a queen or not. A second broad class of examples involves plants that can readily change their morphology in response to environmental change. For instance, the marsh plant Sagittaria sagittifolia (also called arrowhead) grows in wetlands, where parts of the plant are submerged. The leaves below water adopt a narrow, linear shape, whereas the leaves above water resemble arrowheads (Figure 13.1). A single genotype can produce these different leaf shapes [682, 848]. These examples regard morphological traits, but plasticity is also widespread for behavioral traits. For instance, every organism that changes its behavior in response to the environment, from chemotactic bacteria to foraging ants to humans, displays plasticity in a behavioral phenotype [848].
P H E N OT Y P I C P L A S T I C I T Y A N D I N N OVAT I O N
(a)
(b)
173
(c)
Figure 13.1 Plastic leaf morphology in specimens of the marsh plant Sagittaria sagittifolia (arrowhead). (a) A partly submerged plant displays narrow lineal leaves under water, as well as arrowhead-shaped leaves above water. Water is indicated by the parallel horizontal lines. Completely terrestrial (b) and submerged plants (c) display only one kind of leaf. From [848], after figures 31–33 in [682], used with permission from Oxford University Press.
Plasticity is also abundant on suborganismal and microscopic levels of organization. For example, plants react to changing light not only by moving their leaves, but also by moving their chloroplasts, by changing chloroplast shape, and chloroplast internal structure. Upon changes in irradiance, they change the amount of photopigments in chloroplast membranes, and the number of chloroplast membrane stacks known as granal regions. Such plasticity may improve light harvesting in low light and protect the photosynthetic apparatus in high light [23, 24]. Similarly, the number of mitochondria per cell can vary widely in a single organism [13]. Suborganismal plasticity can also be essential for proper development of adult body structures, especially the nervous system. For example, the visual cortex does not develop properly in the absence of light exposure [268, p.824, 339]. A curious example of small-scale plasticity regards persister forms of bacteria. In a bacterial population of genetically identical individuals exposed to an antibiotic, a small number of individuals often survive (“persist”) indefinitely, whereas others die. Part of the reason is that persisters divide much more slowly, even before the
exposure to antibiotic. Importantly, persisters are not genetically different from non-persisters. Most offspring of persisters divide rapidly, showing that this plasticity in cell division rate is not caused by genetic variation [37, 392, 451]. It is phenotypic plasticity. Cellular properties such as cell division rates, organelle numbers, and neural connectivity, are ultimately determined by molecules and the regulatory circuits they form. It is thus little surprise that plasticity is also widespread on the level of such circuits. For example, the expression of many proteins and mRNA molecules varies widely among genetically identical cells that exist in the same environment [43, 51, 52, 63, 212, 216, 504, 586, 632]. As an example, Figure 13.2 shows the expression of green fluourescent protein in a population of genetically identical cells of the bacterium Bacillus subtilis [586]. The figure shows the distribution, as well as the mean (
) and the standard deviation (σP) of the protein’s expression level, as measured by fluorescence intensity. Clearly, different cells vary widely in their expression level of this protein. This is not a peculiarity of this protein. For instance,
174
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
in the yeast Saccharomyces cerevisiae, data on variation in protein levels among single cells have been measured for more than 2000 proteins. The coefficient of variation—the ratio between the standard deviation and mean protein expression level among cells—has a median of 20.4 [553]. This means that for many proteins the extent of variation in expression levels greatly exceeds the mean. Multiple mechanisms may cause such expression noise, but a particularly important one seems to be transcription initiation. Transcription initiation requires events such as chromatin opening and RNA polymerase binding to DNA. These events themselves involve stochastic interactions of molecules [63, 553, 633]. Interactions among genes in a regulatory circuit can dampen or amplify such noise in ways that are still incompletely understood. Conversely, noise can alter the gene activity pattern that a circuit’s regulatory interactions would produce in the absence of noise [52, 222, 448, 601–603, 632, 878]. The lowest level of phenotypic organization is that of individual molecules, where plasticity is again abundant. I will first discuss some examples involving proteins. Plasticity can change a protein’s global tertiary structure. An example involves lym-
photactin, a molecule that promotes chemotaxis in T cells, which are important cells in the human immune system. Under physiological conditions, lymphotactin interconverts rapidly between two very different tertiary structures [788]. The first is a monomer with a single alpha-helix and a threestranded beta-sheet. The second is a dimer whose monomers consist only of a four-stranded betasheet (Figure 13.3a). Another kind of large-scale conformational change is the aggregation of proteins into large insoluble complexes that are involved in many diseases, including Alzheimer’s disease and Parkinson’s disease [658]. Although not rare, such dramatic global structural changes may be dwarfed in abundance by more local changes, including fluctuations of amino acid side chains and loops in a protein’s structure phenotype. In general, small-scale conformational changes are at the heart of a protein’s “breathing” motions, low frequency conformational changes that are required for catalysis [87, 184, 613]. They are thus ubiquitous in enzymes. Both global and local structural changes need not just be caused by thermal noise, but can be promoted by molecules that bind specifically to a protein. Prominent examples include the allosteric
number of cells
ápñ
400
200
sp
0
0
200
400
600
800
p (fluorescence units) Figure 13.2 Plastic gene expression. The figure shows the experimentally determined distribution, as well as the mean
and the standard deviation σP of green fluorescent protein expression levels in a clonal population of Bacillus subtilis in arbitrary units [586]. Used with permission from Nature Publishing Group.
P H E N OT Y P I C P L A S T I C I T Y A N D I N N OVAT I O N
(a)
(b)
Figure 13.3 Plasticity in protein structure phenotypes. (a) The protein lymphotactin interconverts between two different global conformations [788]. (b) The figure shows two conformations of cytochrome P450-CYP2B4 indicated by light and dark shades of gray [541]. These conformations arise through binding of the two different substrates bifonazole and 4-(4-chlorophenyl) imidazole. Figures courtesy of Danny Tawfik. From [780].
175
regulation of enzymes by small molecules, and the conformational changes in cell membrane receptors upon ligand binding [741]. A single protein genotype can not only form multiple structures, but also perform different biochemical functions [344, 367, 566]. Catalytic promiscuity, a widespread phenomenon, serves as an example [566]. Here, a single protein may catalyze not only one reaction with one kind of substrate, but also very different reactions that use different substrates. One example is chymotrypsin, which can cleave many different kinds of compounds. Another is bovine carbonic anhydrase. Its main activity is the interconversion of carbon dioxide and bicarbonate ions (HCO3-), but it can also cleave organophosphates, compounds that include nerve gases and pesticides [566] A third example involves members of the cytochrome P450 enzyme family that can hydroxylate a broad spectrum of chemicals. Because protein function is linked to protein structure, it is not surprising that such functional plasticity is reflected in structural plasticity. Figure 13.3b indicates in two different shadings of grey two different conformations of cytochrome P450 protein that are induced by the binding of two different small molecules [541, 780]. While more is known about phenotypic plasticity in proteins, phenomena analogous to those in proteins exist in RNA. For example, thermal motions cause the continuous formation and destruction of base pairs in RNA sequences, such that any one RNA genotype can assume many different alternative structures in addition to its minimum free energy structure [22, 130, 181, 865]. Like proteins, RNA molecules can also change their structure through specific interactions with small molecules. In riboswitches, for instance, secondary structure change occurs through the binding of a small molecule to a messenger RNA. The result is a change in gene expression; for example, by hindering the ribosome’s access to its binding site on the RNA. In many organisms, such riboswitches regulate the expression of genes involved in the biosynthesis of small molecules, such as vitamins and amino acids. These molecules are often themselves the regulators of a riboswitch [810].
176
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
This smattering of examples shows the enormous breadth and heterogeneity of plasticity. The variable phenotypes may be active (leaf morphology) or passive (protein motion) responses to change; they may be largely irreversible (plasticity in embryonic development) or reversible (riboswitches); they may require a specific signal from the environment (light adaptation) or just random fluctuations (gene expression noise); they may be adaptive (allosteric regulation) or maladaptive (protein aggregation). I will treat all such phenomena as instances of phenotypic plasticity, although narrower definitions of plasticity are conceivable [677, 848].
Plasticity varies among genotypes If plasticity is to have a role in innovation, then genotypes need to vary in plasticity; that is, in the spectrum of phenotypes they produce. Such variation is indeed ubiquitous. It is perhaps easiest to appreciate for molecules. Proteins with slightly different amino acid sequences may differ in their conformational plasticity, if their amino acids differ in hydrophobic, electrostatic, and other interactions that stabilize the native structure. Such differences in plasticity can have functional consequences. For example, different variants of human cytochrome P450, a promiscuous enzyme I also discussed earlier (Chapter 7), differ in their structural plasticity. The variants whose structures are more plastic are also more promiscuous; they metabolize a greater number of substrates [707]. Also in Chapter 7, I discussed protein engineering experiments that mutagenized promiscuous proteins, such as paraoxonase or carbonic anhydrase. Such mutations can change an enzyme’s phenotypic plasticity through change in the main activity and in one or more promiscuous activities [8, 20]. Genetic variation in plasticity also exists for the expression levels of genes. For example, different yeast proteins show different plasticity in their expression levels. Specifically, the expression of proteins that are involved in the cell’s response to environmental change is more plastic than that of proteins involved in protein synthesis [553]. Genetic differences in expression plasticity can be readily engineered into cells, for example, by changing the promoter sequences necessary for transcription ini-
tiation, or by mutating ribosomal binding sites on mRNA and thus affecting translational initiation [63, 586]. Genetic engineering also serves to show that plasticity can vary genetically for higher level phenotypes. A case in point regards the morphology of yeast (Saccharomyces cerevisiae) cells, whose plasticity in cell shape can be estimated through quantitative microscopic phenotyping of cell morphology in genetically identical populations [574]. In genetic variants where individual genes are eliminated (knocked-out) from the genome, this morphological plasticity can change. For example, 300 out of more than 4500 yeast gene knock-out strains show an increase in morphological plasticity [450]. While these observations on a morphological trait demonstrate that plasticity can vary genetically, the mechanistic causes of such variation are less clear than for molecular phenotypes. This knowledge gap becomes worse for macroscopic morphological phenotypes of higher, multicellular organisms. Here, investigations of variation in plasticity have the longest tradition and a huge body of literature exists [615, 677, 848]. It can be summarized in this way: most plastic phenotypes show genetic variation in plasticity. Evidence comes from studies demonstrating that plasticity varies in natural populations, and from laboratory evolution experiments—many of them in fruit flies—which demonstrate that plasticity can change in the course of evolution [677].
Genotype networks and plasticity As in earlier chapters, I will here focus on systems where we can study the relationship between genotype and complex phenotypes in detail. These include molecules and regulatory circuits. I also do so, because novel phenotypes in these systems are the building blocks of more complex macroscopic innovations. They may thus teach us important general lessons. Figure 13.4 shows in a highly schematic and simplified fashion how we can envision phenotypic plasticity in the context of a genotype network. As in earlier representations (Figure 5.5), circles represent individual genotypes, and edges connect neighboring genotypes. The shades of gray in each circle reflect the fraction of time a system adopts each of several hypothetical pheno-
P H E N OT Y P I C P L A S T I C I T Y A N D I N N OVAT I O N
types. Membership of a genotype in a genotype network is defined through a dominant phenotype (largest sector). Aside from this dominant phenotype, the genotype may also form several alternative, “minority” phenotypes, either as a result of small-scale random noise or through larger scale changes in the external environment. The figure includes the boundaries of three genotype networks with different dominant phenotypes. This
Phenotype 1
177
schematic is highly simplified, not only because it neglects the high-dimensional nature of genotype space. It also does not reflect that each protein or regulatory circuit may assume many (not just three) alternative phenotypes; that the amount of time spent in alternative phenotypes may vary substantially among genotypes; and that different genotypes on the same genotype network will differ in the identity of their alternative phenotypes.
Phenotype 2
Phenotype 3
Figure 13.4 Phenotypic plasticity and genotype networks. The figure shows the boundary between three genotype networks, where individual genotypes (circles) can form one of three phenotypes (represented by three shades of gray). The size of a circle sector with specific gray shading corresponds to the prominence of a specific phenotype in the spectrum of alternative phenotypes a genotype can form. For example, for a regulatory circuit, it might correspond to the amount of time the circuit shows any one gene expression pattern as a result of gene expression noise, or in a given spectrum of external environments. For a protein, it might correspond to the time spent in a given fold. Lines connecting circles indicate neighboring genotypes.
178
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
This visual metaphor serves to highlight that in the presence of plasticity, the boundary of genotype networks is no longer crisp, but becomes blurred: minority phenotypes of a genotype on one genotype network may be dominant phenotypes of nearby neighbors on different genotype networks. This metaphor also demonstrates that phenotypic plasticity need not violate the observation that many systems show fewer phenotypes than genotypes (Chapter 5). Phenotypic plasticity simply means that a given (perhaps astronomically large) number of phenotypes become distributed among multiple genotypes, such that one genotype can have more than one phenotype. The framework I propose in this book can capture these complexities, partly because it is suited for phenotypes that are complex, multidimensional objects. In earlier chapters, I developed a simple yet general view on how evolutionary search can explore an astronomical number of novel phenotypes, while preserving existing phenotypes. The perspective of Figure 13.4 can help us appreciate that phenotypic plasticity may not require a major modification of this perspective. To see this, recall a critical feature of any one genotype network: mutational exploration of different genotypic neighborhoods yields different novel phenotypes. With this feature in mind, consider the following plasticity-centered view of evolutionary search for novel phenotypes. In this view, evolutionary search starts with a genotype G that has some dominant phenotype P. The search has two stages. In the first stage, a series of mutations in G that do not change P may change the spectrum of minority phenotypes, until some specific genotype G’ has a specific novel phenotype Pnew in its phenotypic spectrum. The second stage involves additional mutations that turn Pnew into the dominant phenotype, provided that this genotype G’ is close to the genotype network of Pnew. If one focuses on the origin of a novel phenotype, then the relevant events occur in the first stage. They are the mutations that allow a phenotype to first appear in the plastic phenotypic repertoire of a genotype. Genotype networks clearly facilitate these events, and thus the origin of novel phenotypes, because they allow preservation of an existing dominant phenotype, while modification of its encoding genotype’s plastic repertoire occurs. This
repertoire changes with a genotype’s location in a genotype network.
Genetic assimilation I could end my discussion of plasticity here: genotype networks facilitate innovation also for plastic phenotypes. However, much discussion in evolutionary biology has focused on the second stage of the evolutionary search above. This second stage is often called genetic assimilation [682, 816–818]. It is closely related to a variety of other phenomena that I will not discuss further here, such as the Baldwin effect and genetic accommodation [39, 848, pp.147–157]. Although assimilation does not strictly pertain to how novel phenotypes originate, understanding it is useful to understand the role of plasticity in innovation. As I just discussed, assimilation begins after a phenotype Pnew originates as a minority phenotype of some genotype G’. That is Pnew may initially be formed only by few individuals in a population, or during a small fraction of any one individual’s lifetime. It might arise only in a specific environment, or as a result of molecular noise. For example, Pnew might be a novel protein structure that can catalyze a new chemical reaction, or a regulatory circuit’s new gene expression pattern that affects a morphological trait through its role in embryonic development. In genetic assimilation, this phenotype Pnew becomes a phenotype that most individuals form most of the time, and independently of environmental cues. Assimilation would typically proceed as follows. If Pnew conveys superior fitness on its carrier, it may facilitate the persistence of G’ in a population of organisms, until a mutant G’’ arises where Pnew is a prevalent phenotype. This mutant can then rise in frequency until many individuals in the population have its genotype, such that Pnew becomes independent of any original environmental trigger. I note that selection may favor not only the assimilation of novel phenotypes into prevalent phenotypes, but also their continued coexistence with other, existing phenotypes. In other words, plasticity itself is often adaptive, as some of the opening examples of the chapter illustrated. Although I will focus below on genetic assimilation, much of what I say applies to adaptive plasticity. Before turning to empirical evidence for assimilation, I will note that one last condition in the above
P H E N OT Y P I C P L A S T I C I T Y A N D I N N OVAT I O N
evolutionary search scenario is important for the scenario to work: genotypes that contain a specific phenotype Pnew in their spectrum of plastic phenotypes must have genotypes nearby that have Pnew as the dominant phenotype. Whether this is so can only be answered by a systematic exploration of genotypes, phenotypes, and genotypic neighborhoods, and thus at present by computational approaches. The answer is yes, based on current knowledge. For example, Ancel and Fontana have shown that the RNA secondary structure phenotypes found in the plastic repertoire of an RNA genotype G are similar to those found as dominant (minimum-free energy) phenotypes in the neighborhood of G. They called this phenomenon plastogenetic congruence [22]. An analogous phenomenon holds for regulatory circuits. We showed that minority gene expression patterns (Pnew) formed by a regulatory circuit genotype as a result of gene expression noise occur as dominant phenotypes within a small genotypic neighborhood around the circuit [222].
Assimilation in the laboratory and in the wild To find out whether phenotypic plasticity is important for the fixation of novel phenotypes in a population, one needs to ask how prevalent genetic assimilation is. Three pertinent questions need answering. Can assimilation occur in principle? Does it occur in nature? And does plasticity speed up adaptive evolution? That is, would new phenotypes sweep faster through a population if they first arise as part of a genotype’s plastic repertoire of phenotypes. I will examine these questions in turn. First, genetic assimilation can occur in principle. This has been demonstrated in multiple, independent laboratory experiments [197, 666, 816, 817]. For example, Waddington showed that some embryos of the fruit fly Drosophila can develop a second thorax when exposed to ether vapor [817]. He then carried out a laboratory evolution experiment in which he continually selected for embryos with the ability to develop this second thorax in response to ether vapor. Few generations into this experiment, flies appeared that developed the second thorax without ether treatment. Thus, a second thorax, which first appears as a minor phenotype, and only in a spe-
179
cific environment, can become genetically assimilated [817]. This and related experiments clearly show that assimilation is more than a theoretical possibility [197, 666, 816, 817]. The next question is whether assimilation actually occurs in the wild. Studies that address this question typically rely on a comparative approach (Figure 13.5). Ideally, one needs to show that a phenotype of interest shows plasticity in some ancestral species (“A” in Figure 13.5), but that this plasticity has been lost and one of several alternative phenotypes have become fixed in contemporary, extant species. The problem with this idea is that, except in a few cases, the ancestor (“A”) is long extinct. One thus needs to infer from one or more extant species (“E”) whether it was plastic. Fortunately, methods of phylogenetic analyses permit such inference, provided that sufficiently many species with the phenotype of interest are known. I will highlight two examples of this approach from two very different levels of organismal organization: molecules and whole organisms. The first example regards the evolution of steroid hormone receptors [88]. I have already discussed these receptors earlier in the context of epistasis (Chapter 7). In extant animals, the glucocorticoid receptor is typically activated by the steroid hormone cortisol,
E
A Figure 13.5 Detecting genetic assimilation in the wild. The figure shows a hypothetical phylogenetic tree. Squares can be viewed as representing biological systems on different levels of organization, from molecules to whole organisms. Black and white indicate one of two alternative phenotypes. Filled or open squares represent systems that form only one of these two phenotypes. Squares with both black and white indicate a system with phenotypic plasticity. It can form both phenotypes. The letter “A” indicates an ancestral (typically no longer existing) system, and “E” indicates a system with plasticity existing today. See text for details.
180
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
whereas the mineralocorticoid receptor is activated by aldosterone. The two different receptors originated early in the evolution of vertebrates. For receptor molecules like these, many representatives from different species are known. The amino acid sequences of these representatives can be used to reconstruct the putative ancestor (corresponding to “A” in Figure 13.5). A study that synthesized this ancestral receptor found that it can get activated by both cortisol and aldosterone. In addition, receptors in basal extant vertebrates, such as the hagfish and the lamprey, are activated by both cortisol and aldosterone. These species correspond to taxa indicated by (“E”) in Figure 13.5. Taken together, these observations suggest that plasticity in hormone interactions of an ancestral steroid hormone receptor has been assimilated into the more specific response of extant receptors [88]. Very few studies go to this level of detail in examining genetic assimilation in proteins, but more circumstantial evidence suggests that this example may not be a rare exception [265, 495, 566]. For instance, enzymes with a specific primary catalytic activity are often related to proteins where this activity is one of several promiscuous activities. Examples include alkaline phosophatase, adenylate kinase, and threonine synthase [566]. My second class of examples regards whole organisms and asymmetry in their body organization. Examples include the claws of many crustaceans, such as lobsters and snapping shrimp, where the left claw can differ dramatically in size from the right claw; the vertebrate heart, which is displaced to the left in most vertebrates; the shells of snails, which can be left-coiled or right-coiled; the priapium, a copulatory organ in fish of the family Phallosthethidae that is derived mostly from pelvic fins and occurs on one side of the body; and phyllotaxy—the arrangement of consecutive leaves on the stem of a plant—which can be right-handed or lefthanded [592]. For some asymmetric traits, different individuals within a species or population typically display different (dextral or sinistral) asymmetries. In this case, the direction of asymmetry is typically not inherited from parent to offspring. I will refer to such asymmetry as plastic asymmetry. In contrast, for other asymmetric traits, most individuals are asymmetrical in the same direction, and this direction is
typically inherited. I will refer to it as deterministic asymmetry. Genetic assimilation of asymmetry occurs if plastic asymmetry in an ancestral species is replaced by deterministic asymmetry in a descendant species. Many asymmetric traits are sufficiently widespread to permit reconstructing the likely ancestral form of asymmetry—plastic or deterministic—from its distribution in extant species. In a study examining 63 different asymmetric traits, Palmer found that between 36 and 44 percent of these traits underwent genetic assimilation [592]. In the remaining cases, deterministic asymmetry may have arisen directly through mutation from an ancestrally symmetric trait. While this study is unusually systematic and regards multiple, albeit peculiar, traits, evidence of varying strength for genetic assimilation also exists in other traits [137, 265, 300, 319, 366, 614, 617, 848]. One example is the secretion of extrafloral nectar by Acacia trees. This nectar attracts ants that protect the plant against herbivore attack. In most Acacias, nectar secretion is a plastic trait, induced upon leaf damage, but Acacia trees that are obligately inhabited by symbiotic ants have lost this plasticity. They excrete nectar even in the absence of leaf damage [319]. Another example involves sex determination in turtles and lizards. Ancestrally, whether an individual develops into a male or female is determined in a plastic manner by the environmental temperature. However, this plasticity has been lost several times and been replaced by genetic sex determination [366, 618]. Individual studies like these can ask whether assimilation occurs in any one trait. The answer clearly is yes. However, they cannot answer whether assimilation is frequent or rare in nature. This question has been quite controversial [617]. Some authors emphasize assimilation as a prominent mechanism in evolutionary adaptation [592, 614, 615, 848], whereas others dismiss its importance [151, 581]. No consensus exists in this matter, but a few observations are germane. First, assimilation has not been part of orthodox evolutionary thinking for most of the twentieth century. In consequence, the number of studies looking for evidence of assimilation is limited. Such evidence may become more abundant once it is more actively sought. Second, to identify assimilation of any one trait in the wild requires that the trait must
P H E N OT Y P I C P L A S T I C I T Y A N D I N N OVAT I O N
still be plastic in some extant organisms. Assimilation could not be detected if such plasticity is absent. This limitation biases any studies of assimilation against its detection. Third, in laboratory evolution experiments, assimilation typically occurs very rapidly, within a few generations [197, 666, 816, 817]. Such transience of assimilation, if widespread, might also hamper its detection in the wild. A last, more speculative consideration is that plasticity is almost certainly a primal feature of life. This consideration leads me back to molecules, where this assertion is easiest to appreciate. In earliest life, the production of catalytic molecules was probably a noisy process. For example, early errorprone protein translation may have produced multiple “statistical proteins” from any one nucleic acid [858]. In addition, the fidelity of genetic information transmission was low; yes, the first life forms may not even have had genes as we know them [535]. In consequence, the phenotypes of early molecules would have been more plastic than they are now. The same would then hold for any regulatory circuits or metabolic networks, because they are built from these molecules. If plasticity is really a primal phenomenon, then assimilation must also have been a primary mode of evolutionary adaptation wherever genetically determined phenotypes emerged from plastic phenotypes.
Does plasticity facilitate adaptation? The above lines of evidence suggest that genetic assimilation can and does turn plastic phenotypes into genetically determined phenotypes. However, they are silent on the question whether plasticity facilitates evolutionary adaptation, in general, and assimilation, in particular. In other words, is more plasticity better? Does it accelerate evolutionary adaptation, for example, the rate at which organisms that harbor a new and useful phenotype sweep through a population? If a system showed no plasticity, and if it had to produce all novel phenotypes through mutation, would it have a disadvantage? This question has been widely explored, with no clear emerging consensus [21, 680, 728, 744, 844, 848, p.178, 863]. The next two examples show that a wait for such a consensus might be futile. The first example regards a molecule, green fluorescent protein, and a simple phenotype: its flu-
181
ourescence intensity. As the name suggests, this protein emits green light when exposed to light of a different wavelength. The protein was first isolated from jellyfish, and is widely used as a molecular marker to monitor cell biological processes [785]. When this protein is expressed in bacterial cells, genetically identical cells show different fluorescence intensities, which are partly caused by gene expression noise [216, 586] (Figure 13.2). In other words, fluourescence is a phenotypically plastic trait. In a laboratory evolution experiment that involved variants of this protein, Yomo and collaborators attempted to identify proteins with increased fluorescence intensity [674, 879]. To this end, they mutagenized their study protein, expressed the mutants in E. coli cells, identified an E. coli clone with high fluorescent intensity, mutagenized again, and so forth, repeating this cycle multiple times. In each of these cycles (“generations”), they found proteins that showed higher average fluorescence than its predecessor. The average is taken over a population of genetically identical E. coli cells expressing the protein. Intriguingly, the increase in average fluorescence intensity from one generation to the next was proportional to the phenotypic plasticity of fluorescence. That is, variants of the protein that showed greater plasticity in fluorescence than others produced mutants that also showed a greater increase in average fluorescence. In other words, plasticity in this phenotype facilitates adaptive evolution through mutations [674, 879]. A second example regards RNA secondary structure phenotypes. These phenotypes are plastic at ambient temperatures, because RNA molecules continuously fold and unfold in response to thermal motions. A typical RNA molecule forms a dominant, minimum free-energy structure, and a broad spectrum of other structures. If any one such structure is optimally suited for a particular function, then plasticity has a cost: if a molecule is more plastic than another, i.e., if it forms a greater number of alternative phenotypes, then it will also typically spend less time in any one phenotype, including the optimal phenotype [22]. (I note parenthetically that potential costs of plasticity are known for many other plastic phenotypes, especially when their plasticity is an active
182
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
response to environmental change [7, 182, 615, 677, 848].) From this perspective, plasticity may not be a good thing. This suspicion is confirmed if one studies evolutionary adaptation in RNA; that is, evolutionary searches through genotype space for some optimal RNA phenotype. Such searches may take longer (or they may even fail) for highly plastic RNA molecules, when compared to molecules that lack such plasticity, where novel phenotypes must arise through mutation alone [22]. Whether plasticity accelerates adaptation also depends on how alternative phenotypes contribute to fitness [21, 22, 863]. These examples illustrate that the role of plasticity in adaptation may depend, among other factors, on the specific kind of phenotype considered. Plasticity in fluorescence intensity is variation in a simple, scalar phenotype, whereas plasticity in molecular structures creates a spectrum of complex objects. These two kinds of molecular phenotypes are very different from one another. In consequence, plasticity also means different things in both cases. Plasticity is not a monolithic concept. This observation extends to phenotypes on higher levels of organization, which are at least as heterogenous as molecular phenotypes. In sum, plasticity does not necessarily facilitate adaptive evolution. This would be a problem only if one believes that plasticity is ubiquituous because it facilitates adaptive evolution. However, as molecular systems ranging from molecules to regulatory circuits amply demonstrate, plasticity is one of life’s primal features. Natural selection leading to genetically determined phenotypes may often have to overcome such plasticity rather than harness it.
Phenotype-first and nurture over nature? The material I have discussed so far speaks to the “phenotype first” perspective on evolutionary innovation taken by some researchers [555, 556, 592, 614, 615, 848]. This perspective is built on two general observations: first, phenotypic plasticity is pervasive; second, such plasticity often produces phenotypes that are pre-adapted for novel functions, as in the case of promiscuous proteins. In consequence, the argument goes, plasticity may be more important than genetic change in facilitating innovation.
Systems with clear genotype–phenotype relationships allow us to see that the “phenotype first–genotype first” dichotomy is a false dichotomy. To be sure, evolutionary innovations may first appear as (minor) phenotypes in a genotype’s spectrum of plastic phenotypes. From this point of view, the phenotype-first view is correct. However, the spectrum of plastic phenotypes a system can assume is determined by its genotype in the first place. This holds regardless of whether one considers molecular noise or external environmental change as the source of plasticity. From this perspective, the genotype-first view is correct. Which of these perspectives to choose is a matter of taste. Neither of them is wrong— they are complementary views of the same phenomenon. These observations also speak to the debate whether nature (genotype) or nurture (environment) is more important in determining the phenotypes of biological systems, be they molecular phenotypes or complex human behavioral phenotypes. The tension between nature and nurture plays an important role in understanding the origin of many traits, including human cognition and complex genetic diseases. In this context, a brief look at simple phenotypes like those of proteins is useful, because they allow us to see clearly how nature and nurture affect phenotypes. The structure and function of a protein are clearly influenced by thermal noise and by interactions with other molecules, such as different substrates or allosteric regulators. But this very spectrum of phenotypes is determined by the genotype (amino acid sequence), and how this genotype reacts to environmental change. From this perspective, it becomes clear that efforts to disentangle, and clearly separate, the effects of nature and nurture may be futile. Both are equally important in determining phenotype.
Phenotypic robustness can enhance plasticity In the final section, I turn to the role of robustness in promoting or hindering phenotypic plasticity. By definition, robustness of a phenotype to environmental change is the opposite of phenotypic plasticity. Phenotypic robustness to such change thus reduces phenotypic plasticity. That much is
Number of different gene expression phenotypes formed through expression noise
P H E N OT Y P I C P L A S T I C I T Y A N D I N N OVAT I O N
183
300
280
260
240 low
medium
high
Phenotypic robustness Figure 13.6 Phenotypic robustness can facilitate plasticity in evolving populations. The figure is based on populations of regulatory circuits (Chapter 3) that are in mutation–selection balance on the genotype network of a gene expression phenotype. The horizontal axis distinguishes three different kinds of phenotypes that differ in their robustness. Specifically, in phenotypes with low, medium, and high robustness, the distances between initial and equilibrium expression state of a circuit are d=0.5, d=0.25, and d=0.1, respectively (Chapter 3). The vertical axis shows the number of different new phenotypes that individuals in the population produce in response to gene expression noise. Such noise corresponds to perturbations in a circuit’s gene expression trajectory, that is, random changes in the expression state of individual genes during the gene expression dynamics that leads to a circuit’s steady state expression pattern. The observed gene expression phenotypes at the end of 100 independent such perturbed trajectories are recorded. More specifically, the vertical axis shows the number of unique phenotypes that are produced by all the individuals in a given population. That is, if the same phenotype is produced as a result of noisy expression dynamics by two different circuits in the population, it is counted only once. The phenotypes used in this analysis differ in their robustness to mutations, and thus in their genotype network size [222, 830]. Genotype network size, in turn, is proportional to the fraction d of genes that differ in their expression between a circuit’s initial expression state, imposed by upstream factors, and its final expression state, reached through regulatory interactions within the circuit (Chapter 3, [123, 222]). Data are based on circuits of 20 genes with approximately four regulatory interactions per genes, but similar observations hold for circuits of different size and architecture, for different kinds of gene expression noise, and for larger scale and systematic environmental changes. Each data point is based on 500 independent populations of 200 individuals each that evolved for 104 generations subject to selection for their phenotype, and a probability of μ=0.5 per generation that a mutation changes a regulatory circuit. Error bars correspond to standard errors of the mean over these 500 populations. Data from [222].
simple. However, in Chapter 8 we saw that phenotypic robustness to mutations can facilitate the evolutionary exploration of novel phenotypes through mutations. It is thus natural to ask whether such robustness might also promote exploration of novel phenotypes through environmental change. On a qualitative level, robustness is responsible for the existence of genotype networks (Chapter 8). And, as I argued above, a plastic system with
unchanging primary phenotype that explores a genotype network will form different alternative phenotypes. Robustness thus facilitates phenotypic plasticity, because a genotype network permits the exploration of more alternative phenotypes than if a system was confined to a single genotype. In addition to facilitating the exploration of novel phenotypes through mutations (Chapter 8), robustness thus also facilitates their exploration through environmental change.
184
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
On a more quantitative level, it is useful to first focus on the robustness of different genotypes to mutation. With possible exceptions [138, 496, 522], genotypes of molecules and regulatory circuits that are more robust to mutations are also more robust to environmental change [22, 68, 123, 441, 666, 682, 825]. In other words, robustness of a genotype to mutations reduces phenotypic plasticity. Next, I will turn to the robustness of phenotypes to mutation. Recall that the mutational robustness of a phenotype is proportional to the number of genotypes that form this phenotype (Chapter 8). It is given by the average number of neutral neighbors that each such genotype has. A population of evolving molecules that spreads through the genotype network of robust phenotypes can encounter more novel phenotypes in its neighborhood than a population spreading through a small network. This holds even though each individual in the population on the large network encounters fewer mutations in its neighborhood. The reason is that the population as a whole is genetically more diverse, because it spreads more rapidly if its genotype network is large. Taken together, the neighborhoods of its genotypes thus also contain more diverse new phenotypes. This greater diversity more than compensates for the reduced phenotypic diversity in the neighborhood of individual genotypes. Chapter 8 showed evidence supporting this phenomenon for RNA and protein molecules, and emphasized that it may not hold for all system classes. One can ask an analogous question for phenotypic plasticity: Can phenotypic robustness also promote phenotypic plasticity? Regulatory circuits are good study systems for this purpose, because their expression phenotypes can change both in response to gene expression noise, and in response to larger scale environmental changes. Figure 13.6 shows an answer to this question for the kinds of circuits of Chapter 3 [222]. The figure is based on populations of circuits confined to the genotype networks of three different classes of phenotypes that differ in their robustness (low, medium or high, horizontal axis). The populations have reached a balance between mutation, selection (maintaining their respective phenotypes),
and genetic drift. The vertical axis shows the number of different expression phenotypes that gene expression noise can produce in these populations. The figure demonstrates that plasticity increases with phenotypic robustness. Thus, in this system at least, robustness can facilitate plasticity. This observation is largely independent of circuit details, and on whether plasticity arises through random gene expression noise or larger scale environmental changes [222]. The mechanism is essentially the same as that observed for phenotypic variability mediated by mutations [222]. This positive role of robustness for plasticity may seem surprising, given that for circuits like these, phenotypic robustness does not facilitate phenotypic variability through mutations (Chapter 8, [222]). Fundamentally, the reason is that the existing correlation between phenotypic variability in response to environmental change and to mutation is not very strong [22, 123, 222]. Thus, even mutationally robust systems can be phenotypically very plastic, and vice versa. We still have much to learn about the relationship between these two kinds of variability.
Summary Phenotypic plasticity is ubiquituous on all levels of biological organization. Because it is a primal feature of biological systems, we must examine its role in evolutionary innovation. When considering this role, two stages of the evolutionary process must be considered. The first is the origin of genotypes that can produce novel phenotypes as part of their plastic repertoire in changing environments. Genotype networks facilitate this origin. The reason is that they allow exploration of many genotypes with the same primary phenotype, but with an ever-changing repertoire of plastic phenotypes. The second stage is the stabilization of such novel phenotypes through processes such as genetic assimilation. Evidence is mounting that such assimilation may be frequent, which is not surprising, because plasticity is widespread. However, whether more plasticity is better, e.g., whether it accelerates the encounter and assimilation of novel phenotypes, depends on system details. The molecular systems I study allow us to appreciate that the “phenotype-first” and “genotype-
P H E N OT Y P I C P L A S T I C I T Y A N D I N N OVAT I O N
first” scenarios of evolutionary innovation are false dichotomies, as is the “nature versus nurture” dichotomy. The role of phenotypic robustness to mutations in evolutionary innovation through environmental change is complex. Qualitatively, robustness is responsible for the existence of geno-
185
type networks, and thus facilitates innovation mediated by environmental change. Quantitatively, phenotypes that are highly robust to mutations may facilitate plasticity in evolving populations, but whether they do may depend on system details.
CH A PT ER 14
Towards continuous genotype spaces
The discrete genotypes and phenotypes that dominate this book provide conceptual clarity. They help us see unifying principles in systems as different as metabolic networks and proteins. Many systems, however, are best viewed as having a continuous range of phenotypes. I have already discussed the continuous spectrum of conformations that molecules form through thermal noise (Chapter 13). Such continuity extends from this lowest level of organization to the highest levels. Just consider the many macroscopic forms of organisms that can often be continuously transformed into one another [773]. And continuously valued phenotypes are also abundant on an intermediate level: that of cellular circuitry. Below I will discuss examples from this intermediate level. They come from circuits in cell communication, gene regulation, and biological rhythms. As we will see, for such circuits, not only phenotypes, but also genotypes are often best represented in a continuous form. More than for organismal traits, whose relationship to genotype is often opaque, we can link the continuous phenotypes of such systems to their genotypes. Therefore, such systems are well-suited to explain the challenges that continuous phenotypes pose to understanding innovation. And because such systems form a bridge between molecules and macroscopic phenomena, what we learn from them may also apply to macroscopic innovations. Continuously valued phenotypes and genotypes may well obey principles analogous to those I discussed earlier for discrete systems. However, it is difficult to transfer these principles to continuous systems. Below, I will first discuss these difficulties, then I will discuss the small steps that have been made towards addressing them for cellular circuits 186
that function in cell biology and development [274, 301, 627, 823]. To understand the relationship between genotype and phenotype in such circuits for an entire space of possible genotypes is beyond current experimental technology. It requires mathematical modeling, based on a detailed understanding of experimental evidence. I will keep the resulting technicalities to an unavoidable minimum here, and refer to original papers for mathematical details. However, even the remaining level of detail may repel some readers. If you are such a reader, here is the briefest chapter summary: the little available work hints that two features crucial for evolutionary innovation also exist in continuous systems. That is, widely different genotypes can have the same phenotype, and the neighborhoods of different genotypes can contain very different phenotypes.
Conceptual problems Before examining the problems arising in continuous systems, I will introduce a prototypical signaling circuit that will help develop the necessary concepts [13]. Such a circuit communicates an external signal to a cell’s nucleus. This signal contains information about a cell’s environment, often through the concentration of some chemical, e.g., a nutrient, a hormone, or perhaps a toxic compound. The signal is often best represented by a continuous range of (concentration) values. The signal is detected by a receptor in a cell membrane, which interacts with molecules inside the cell. This interaction triggers a cascade of intracellular interactions among multiple different molecules. These include protein phosphorylations, dephosphorylations, methylations, interactions with small molecules, such as ATP, and many others. At the end of this cascade
T O W A R D S C O N T I N U O U S G E N O T Y P E S PA C E S
stand some regulatory molecules whose concentration or activity changes in response to the external signal. Often, these molecules are transcriptional regulators that activate or repress a broad spectrum of genes. In such a system, the genotype includes the genes coding for all molecules in the signaling circuit. It also determines the rates at which all relevant molecular interactions occur, including the kinetic rates of enzymatic reactions, such as phosphorylation, and the rates at which molecules associate and dissociate. This genotype determines the circuit phenotype, the response of the regulators’ concentrations to the external signal. As long as the number of regulators in the cell is high enough, the phenotype can also assume an effectively continuous range of values. The mathematical models necessary to understand the relationship between genotype and phenotype typically take the form of differential equations that have multiple variables. These variables represent the molecules of the circuit, and how their activities or concentrations change over time. At least one of these variables represents the input signal, and at least one of them represents the circuit’s output, its phenotype. The rates at which these molecules interact are described by some number S of continuously valued biochemical parameters. These parameters represent the genotype of the model. Ultimately, of course, the genotype is a discrete DNA sequence. However, this DNA sequence can have so many variants that the parameters it encodes can vary effectively continuously. It is thus usually more expedient to represent the genotype directly by these parameters. In my discussion below, I will set aside the effects of environmental change and noise on phenotype—I already discussed them in Chapter 13—and assume that a single genotype produces a single phenotype. Doing so allows me to expose the challenges that continuous systems pose most clearly. In sum, systems like this have effectively continuous genotypes and phenotypes. I will now explain the problems with this feature using the two core properties I discussed earlier in discrete systems (Chapters 2–4): genotypes with different phenotypes form vast connected sets that nearly span
187
genotype space, and different genotypic neighborhoods contain very different phenotypes.
Continuous genotype spaces The genotype spaces we have encountered up to this point were discrete high-dimensional spaces with a huge but finite number of member genotypes. In the systems of this chapter, the analog of these spaces are continuous high-dimensional spaces of many parameters with uncountably many values (genotypes). In other words, we can no longer enumerate genotypes in these continuous spaces. In a continuous genotype space, the set of genotypes with the same phenotype comprises one or more continuous regions of parameter space. Finding out whether these regions form the analog of an extended genotype network corresponds to finding out whether these regions are connected and how far they reach through parameter space. (Two regions are connected if a continuous path exists between them, where each point or genotype on the path has the same phenotype.) For most systems of realistic complexity, the differential equations representing them cannot be solved, so it is impossible to obtain mathematically rigorous answers to this problem. This leaves the numerical sampling of genotypes, which by itself raises huge challenges. Models of realistic complexity typically have many parameters (large system size S), and sampling high-dimensional parameter space becomes exponentially more difficult with increasing S. In addition, sampling has an even more fundamental problem: it can never prove that a set of genotypes is connected. To see this, just consider two sampled genotypes, i.e., two points in a high-dimensional parameter space, with the same phenotype. The simplest and most straightforward approach to find out whether they belong to the same connected set is this: Do all points on the straight line connecting these two points have the same phenotype? Unless you can solve the equations, you cannot answer this question rigorously. Also, this simple approach would fail if the set is connected but not convex. In a convex set of points, every two points can be connected by a straight line that lies entirely within the set. Connected sets of parameters need not be convex, but can have arbitrarily complex and twisted
188
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
high-dimensional shapes, whose connectivity may be difficult to examine by sampling. A further problem concerns how mutations affect genotype. In the discrete systems I focused on earlier, mutations change one system part, such as a nucleotide. How do mutations change the biochemical parameters of a circuit? A mutation may affect one parameter a little or a lot, and it may even affect more than one parameter. For example, a single mutation may affect both the activity and the halflife of a protein. Representing genotype directly on the level of DNA might avoid this problem in principle, but is too unwieldy in practice. An additional problem is this. When should one call two phenotypes identical? In a continuous system, very small differences in genotypes may cause equally small changes in phenotypes. What if no two phenotypes are exactly identical? Clearly, very small phenotypic differences may be negligible, but how small is small enough? One might argue that, strictly speaking, only differences that are neutral and do not affect fitness are small enough. However, in practice the threshold for neutrality is usually unknown, difficult to evaluate, and depends on external factors such as population size (Chapter 7). Thus, in practice, one must make ad hoc decisions about the sameness of two phenotypes. The last two points lead towards the second important property of genotype space: the phenotypic diversity of different neighborhoods. In a discrete system, the meaning of a genotype’s neighborhood is crystal clear. It comprises all 1-mutant neighbors, a finite number of genotypes with a finite number of phenotypes. In a continuous system, the obvious generalization is a ball in parameter space of some radius around the genotype. But what should this radius be? The question cannot be answered without knowing how much a single mutation changes a genotype; and this question may not have a simple answer, as I argued above. In addition, there may be uncountably many different phenotypes in a genotype’s continuous neighborhood. The question again arises how to compare them, and when to call two phenotypes identical. And how do we determine the diversity of phenotypes in different neighborhoods, if we cannot simply count them? These are all obstacles to characterize systems with continuous genotypes and phenotypes. To overcome
them, new methods and concepts need to be developed. Unsurprisingly then, among many studies proposing mathematical models of cellular circuitry [e.g., 78, 211, 223, 240, 454, 487, 505, 513, 514, 645, 791], only relatively few characterize genotype space. I will discuss several pertinent examples below [274, 301, 627, 823]. They have a common feature, dictated by the problems listed above: while they all examine a continuous model of phenotypes and analyze its full, continuous dynamics, they discretize genotypes, phenotypes, or both, at some stage of their work, to help overcome these problems.
Connectedness of parameter space My first example regards a bacterial cellular circuit that drives circadian (daily) rhythms in a clock-like fashion. These are 24-hour activity patterns that occur in many organisms, including animals, plants, fungi, and some bacteria [192, 198, 199, 306, 415]. The circuit I will focus on occurs in photosynthesizing cyanobacteria. It regulates gene expression according to light availability, and oscillates in synchrony with the daily light/dark cycle, which is its input signal. When it cycles incorrectly, bacterial fitness decreases [372, 373, 873]. At this circuit’s core is one of the simplest known biochemical oscillators. This oscillator consists of merely three proteins that are necessary and sufficient for sustained oscillations [547]. These proteins are called KaiA, KaiB and KaiC. In the presence of ATP and the other two proteins, KaiC can oscillate continuously between a highly phosphorylated and a lowly phosphorylated state. Molecular interactions between the three proteins are well characterized [209, 357, 374, 381, 533, 540, 665, 871]. KaiA catalyzes phosphorylation of KaiC, and also seems to inhibit its dephosphorylation. KaiB counteracts the action of KaiA when KaiC is highly phosphorylated [534, 871]. The circuit’s probable phenotypic output regulator is KaiC. It can bind DNA and regulate the expression of other genes [533]. Other proteins and molecular interactions, including transcriptional regulation, may also be important for the oscillation [374]. Several mathematical models describe this oscillator [127, 209, 217, 415, 510, 665, 801]. Their details are complex. To analyze, motivate, and compare them could easily fill a book. To keep the focus on my main purpose, I thus refer you to the original
T O W A R D S C O N T I N U O U S G E N O T Y P E S PA C E S
literature for the mathematical details. Here, I focus only on an analysis pertinent to my purpose. In this analysis, we have characterized the genotype space of one circadian oscillator model [301, 665]. This model takes into account that KaiC gets phosphorylated at two sites in an ordered pattern, and that one of these phosphoproteins inhibits KaiA, which interacts with KaiB [600]. The model has four state variables that represent concentrations of different phosphorylated versions of KaiC, as well as of KaiA. Twelve parameters describe the interactions between the circuit molecules. We sampled more than 105 points (genotypes) from the 12-dimensional parameter space. Each parameter in this sample could vary in its value over a broad, 106fold range [301]; 604 of these points gave the phenotype of interest, an oscillation in total KaiC concentration, with a period that deviated by no more than 10 percent from 24 hours. For my purpose, an important question is whether these 604 phenotypes are part of a connected set in the continuous parameter space. To address this question, one can define a graph whose nodes consist of these points. An edge connects two points in this graph, if the genotypes lying on the line connecting them also have the same phenotype. To find out whether this may be the case, one can sample genotypes from this line and determine their phenotype. If all the sampled genotypes have the same phenotypes, they may be part of a connected set. This approach can then be applied systematically to many or all pairs of points. If the set of genotypes with the same phenotype is connected, then the resulting graph must also be connected; that is, every genotype must be reachable from every other genotype via a path of connecting edges. An analysis based on 10 sampled genotypes per connecting line showed that the vast majority (98.7 percent) of sampled genotypes form a single connected graph. Figure 14.1 shows this graph. Specifically, the figure shows a projection of parameter space onto its first two dimensions. Black and gray dots correspond to the examined genotypes. Lines correspond to edges between genotypes. Black dots correspond to eight genotypes that cannot be connected to any other genotype. This analysis provides a hint that an object analogous to a genotype network exists in this system. Specifically, a series of mutations
189
that individually cause small parameter changes might conspire to change the genotype dramatically, but they may leave the phenotype unchanged. This feature is not a peculiarity of this particular oscillator model. It also exists in a completely different model of the cyanobacterial oscillator [301, 510]. This analysis illustrates the challenges of working with a continuous system: unable to solve the system exactly, we need to sample from its genotype space. This is a form of discretization, as is the decision when to call two phenotypes (oscillations) sufficiently similar to be identical. Because we cannot solve this system exactly for any one (let alone all) of its possible parameters, such discretization is inevitable. It provides a hint but cannot prove that the set of genotypes with the same phenotype is connected.
Analyses of topological circuit variants The preceding analysis examined continuous genotypic variation over many orders of magnitude in a highdimensional continuous space of genotypes. The following example, from eukaryotic circadian oscillators, illustrates an approach different from the sampling procedure I just described. Specifically, it involves an additional form of discretization, that of identifying qualitatively different circuit architectures or topologies. Some mutations can completely abolish a molecule’s ability to interact with another molecule. Such mutations cause a change in a circuit’s topology, a qualitative change in the who-interacts-with-whom in the circuit. Such mutations effectively set one or more continuous parameters that represent this interaction to a value of zero. Complex circuits with multiple molecules can have many different molecular interactions, and thus also many possible ways in which mutations can change circuit topology. Put differently, such circuits have many topological variants. To examine different such topologies systematically is to examine different regions of a continuous parameter space, in which some parameters have a value that is equal to zero. This is a form of discretization. I note that the models it produces are different from simpler, truly discrete models of cellular circuits [14, 388, 561, 820]. The reason is that here, for any one circuit topology, both the remaining parameters and the phenotype can vary continuously, and the system’s full, continuous dynamics can be studied.
190
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
102
Parameter 2
101
100
10–1
10–2 10–4
10–3
10–2
10–1
100
101
102
Parameter 1 Figure 14.1 A set of continuous genotypes (parameters) yielding circadian oscillations forms a connected graph. For visualization, the 12-dimensional parameter space of the circadian oscillator model is shown as a projection onto its first two dimensions [665]. Small grey dots correspond to 604 sampled genotypes with the circadian oscillatory phenotype. Parameters outside the cloud of points shown do not produce circadian oscillations. Lines connect genotypes for which 10 points sampled equidistantly along the line (in the 12-dimensional parameter space) have the same phenotype. Larger black dots correspond to eight genotypes that cannot be connected to any other genotype. Parameters 1 and 2 indicated on the axes correspond to the rates [h−1] at which a single and a doubly phosphorylated form of KaiC become dephosphorylated, respectively [301, 665]. From [301].
Eukaryotic circadian oscillators have been dissected in more species than the prokaryotic oscillator I discussed above [198, 306]. Their topology varies greatly among species, but three recurrent themes exist. First, the oscillator mechanism consists of one or more negative feedback loops that involve transcriptional regulation. Second, the oscillator mechanism is simple in principle, requiring minimally only one gene. This gene is expressed to produce an mRNA (R) and a protein product (Pr). The protein may be modified to a protein Pr’ that exerts direct or indirect negative feedback on the expression of its encoding gene [30, 726]. Mathematically, such an elementary oscillator can be effectively modeled by a set of differential equations called the Goodwin oscillator [280]. For the fungus Neurospora crassa, for instance, a Goodwin oscillator can correctly predict the oscillator’s response to temperature pulses, light pulses, and an inhibitor of protein synthesis [662–664]. An additional property of eukaryotic circadian oscillators is that they usually consist of more
than one oscillating gene product, and these gene products regulate each other transcriptionally, post-transcriptionally, or in both ways, depending on the organism [198, 306, 307]. Examples include circadian oscillators in the fungus Neurospora, the fruit fly Drosophila, and in mammals [36, 161, 174, 198, 306, 439, 516, 648, 697]. For instance, the Neurospora oscillator involves two key oscillatory proteins, FRQ and WC-1. WC-1 forms a heterodimeric transcriptional regulator jointly with WC-2, another (noncycling) protein. This transcriptional regulator positively regulates the expression of the frq gene. The protein product of the frq gene is called FRQ and interferes with the action of this heterodimeric (WC-1/ WC-2) regulator. It thus exerts an indirect effect on its own expression [174, 516]. In addition, FRQ appears to promote the accumulation of WC-1 via a posttranscriptional mechanism [439]. Other organisms contain more than two oscillating regulators, which can vary greatly in how they regulate each other’s activity or expression [36, 161, 648, 697].
T O W A R D S C O N T I N U O U S G E N O T Y P E S PA C E S
Even when just two oscillators are linked, circuits with multiple possible topologies can emerge, depending on how the circuit’s molecules regulate each other. For example, Figure 14.2a shows two linked Goodwin oscillators where only the six regulatory interactions indicated by dashed lines can vary [823]. A total number of 36=729 possible circuit topologies arises from qualitative variation in these interactions, depending on whether each variable interaction is activating (dashed arrow), repressing (dashed crossbar), or absent. Each topology can be represented as a different set of differential equations. These equations have between 10 and 16 parameters, depending on the topology. Most of them cannot be solved analytically. Their analysis thus requires numerical integration. When done for thousands of parameters sampled from genotype space, and for multiple topologies, such integration becomes computationally expensive. In the space of circuit topologies, one can define two topologies as neighbors if they differ in exactly one regulatory interaction (Figure 14.2b). Mutations that change parameters, such that only one interaction is added or eliminated, can transform neighboring topologies into one another. The genotype space thus defined resembles that of the transcriptional regulatory circuits we discussed earlier (Chapter 3). However, I note again an important difference: here, both the parameters and the phenotype of interest, a daily oscillation in regulatory molecules, can assume a continuous range of values. For any given allowable range of parameters, any one circuit topology may or may not be able to produce circadian oscillations. In an analysis of all 729 topologies, I found that 201 of the 729 of topologies can produce circadian oscillations [823]. I based this assertion on a sample of more than 106 points in parameter space for each of the 729 topologies. Some of the 201 topologies differ in every single one of their regulatory interactions. This means that circuits with widely different molecular interactions can have identical phenotypes. Figure 14.2c shows that viable topologies form a connected network of topologies. In this network, each topology has on average seven neighboring topologies that can also show circadian oscilla-
191
tions. 75 percent of neighboring topologies yield oscillatory behaviors with parameters that are identical, except for the parameters defining the single regulatory interaction in which they differ [823]. This connectivity is not expected by chance alone. To see this, one can examine “random” graphs that consist of the same number of circuit topologies as a graph of oscillating topologies, but where these topologies can have arbitrary (not necessarily oscillatory) phenotypes. When studying many such random graphs, I found that they are typically not connected. Instead, they consist of multiple disconnected parts or components (Figure 14.2d). The connectedness of the graph formed by oscillating topologies emerges from the high number of neighbors each node has in it. As we saw earlier (Chapter 6), such a high number of “neutral” neighbors is a prerequisite for a set of genotypes to be connected. The circuit topology graph from Figure 14.2c is the analog of a genotype network that spans genotype space. It suggests that a series of modest genotypic changes in individual parameters—some of which are strong enough to change a circuit’s topology—can accumulate to radically change a circuit’s topology, while leaving its oscillatory phenotype intact. Again, my earlier caveats about the discretization of continuous spaces apply to these observations.
Diverse genotypic neighborhoods in a developmental gene circuit Thus far, my discussion has focused on the first of two system properties important for phenotypic innovation: the ability to travel far through genotype space without changing a phenotype. I will now turn to the second property, the phenotypic diversity of different genotypic neighborhoods. This property has not yet been explored for circadian oscillators, but for a regulatory gene circuit in organismal development. It is the circuit responsible for forming the vulva of the nematode worm Caenorhabditis elegans [420]. The phenotype of this circuit is a phenotype of developmental cell fates, identities acquired by those differentiating cells that will later form the worm’s vulva. C. elegans develops through a pattern of cell divisions that is stereotypically repeated among
192
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
(a)
(c)
R1
R2
Pr1
Pr2
Pr1’
Pr2’
(b)
(d)
Number of random graphs
1200 1000 800 600 400
Oscillator topology graph
200 Random graphs 0
1
4
7 10 13 16 19 22 25 28 31 34 Number of components
Figure 14.2 Topologies of linked eukaryotic circadian oscillators form networks with the same phenotype. (a) The circles represent six molecular species, i.e., the concentrations of mRNA (Ri), protein (Pri), and modified protein (Pri’) encoded by two genes (i=1,2). The upper and lower solid vertical arrows represent translation and posttranslational modification, respectively, of these molecules. They are required in all circuit topologies. Dashed lines indicate transcriptional (vertical) or post-transcriptional (horizontal) regulation. Dashed lines terminated by an arrowhead (crossbar) indicate activating (repressive) regulation. (b) Two nodes of a circuit topology graph (large circles), where each node corresponds to one topology (drawings inside the circles). The circles are connected by an edge (horizontal line between large circles), because they are neighbors, i.e., they differ only in one interaction (bold arrow in right topology). (c) Structure of the circuit topology graph for all 201 topologies (circles) that yield circadian oscillations [823]. Large circles with light shading correspond to topologies where a large fraction of sampled parameters yield circadian oscillations. Neighboring topologies are connected by straight lines. The graph is connected, i.e., any two topologies can be reached from one another through a path of edges. (d) The figure shows the distributions of the number of components (groups of nodes connected to each other but to no other node) for 104 circuit topology graphs with 47 randomly chosen circuit topologies, regardless of their ability to show circadian oscillations. The arrow indicates the single connected component of the corresponding graph of those 47 circuit topologies where at least 1 in 100 sampled parameters yield circadian oscillations [823].
different individuals, an attractive feature for developmental biologists. The C. elegans vulva is part of the egg-laying apparatus, which consists of the uterus, the sex muscles, the vulva, and various neurons. The vulva itself forms from a small
number of six vulval precursor cells named P3.p through P8.p that form a linear array (Figure 14.3). Specifically, the descendants of the vulval precursor cell P6.p will form the orifice of the vulva. They will also connect the vulva to the uterus.
T O W A R D S C O N T I N U O U S G E N O T Y P E S PA C E S
This cell’s fate during development is also called the “primary” (1°) fate. The two “secondary” (2°) cells P5.p and P7.p lie adjacent to P6.p (Figure 14.3); their descendants will form the vulval lips. Descendants of the three remaining “tertiary” (3°) cells (P3.p, P4.p, and P8.p) eventually fuse with other cells and become part of the C. elegans epidermis. Adjacent to the vulval precursor cell P6.p lies the developing gonad. One of the gonad’s cells, the so-called anchor cell (“AC” in Figure 14.3) is important to form the vulva, because it sends a chemical signal to the vulval precursor cells that influences their fates. This signal is a peptide called LIN-3. The vulval precursor cells possess a receptor called LET-23 for this peptide [327, 734]. Precursor cells that receive most of this signal adopt the primary fate, whereas cells that receive increasingly smaller amounts adopt secondary and tertiary fates. In addition to this signaling between the anchor cell and the precursor cells, adjacent precursor cells also communicate with one another via a receptor called LIN-12 [290, 733] and its ligands. Both signaling processes are complex and involve multiple other proteins. Also, both signaling processes influence each other: lateral signaling between precursor cells via LIN-12 influences their responsiveness to the anchor cell’s signal; conversely, the LIN-3 signal
gonad
193
from the anchor cell influences lateral signaling via LIN-12 [273]. A model for vulval cell-fate specification that represents both signaling processes has been developed [273]. This model involves 12 differential equations with 9 parameters that exist in a continuous genotype space. The phenotype that this model produces is the discrete pattern of cell fates (3°3°2°1°2°3°) shown in Figure 14.3. To be sure, this is a discrete phenotype, but it is brought forth through a circuit whose parameters exist in a highdimensional continuous genotype space. This is why I discuss it here. In developing worms, cell fate phenotypes can vary. Specifically, laboratory mutants of C. elegans and related species produce vulval precursor cells with multiple different cell fate patterns that help elucidate the signaling processes I just described [290, 735]. Examples include worms that do not produce LIN-12, and whose vulval precursor cells produce only 1° and 3° cell fate patterns; and mutants whose inductive signal is hyperactive and produces the phenotype 2°1°2°1°2°1°. In principle, independent combinations of each cell’s possible fate would allow 4096 different cell fate phenotypes [274]. Thus, this system is well-suited to study how complex phenotypic variation can arise as a function of genotypic variation. A study addressing this problem analyzed the phenotypes of more than 2×108 genotypes
AC
basal
apical
P3.p
P4.p
P5.p
P6.p
P7.p
P8.p
3⬚
3⬚
2⬚
1⬚
2⬚
3⬚
LIN3 LIN3 receptor, LET23
LIN12 ligand LIN12
Figure 14.3 Vulva precursor cells and the determination of their fates. The anchor cell (AC) in the C. elegans gonad lies adjacent to vulval precursor cell P6.p. The LIN-3 signal it produces influences the fate of vulva precursor cells P3.p through P8.p in a graded manner (shaded area). LIN-12 and its ligands mediate lateral communciation between adjacent cells. Both processes together result in the indicated wild-type phenotype 3°3°2°1°2°3° of vulval cell fates. From [274].
(a)
(b) core
v1
v16
v1 v16 v1 v3
v1
v3
v16
Figure 14.4 Schematic illustration of the connectivity of TOR circuit genotypes in genotype space. (a) The core TOR circuit, based on available evidence and a recent experimentally validated model [427]. The large rectangle shows a process diagram [407] that encapsulates the circuit’s molecular interactions. Ellipses represent small molecules; rounded rectangles represent proteins (phosphorylated or not, large circles); boxes surrounding two or more molecules represent
T O W A R D S C O N T I N U O U S G E N O T Y P E S PA C E S
sampled from parameter (genotype) space, such that individual parameters were allowed to vary by at least 105-fold [274]. It found that the genotypes in this sample formed more than 500 different phenotypes. Some of these phenotypes are formed by many genotypes, whereas others by very few genotypes. The wild-type phenotype (3 ° 3 ° 2 ° 1 ° 2 ° 3 ° ) is among the phenotypes formed by a disproportionately large fraction of geno types. The study’s authors also examined the fraction of each genotype’s neighbors (in the parameter sample) that have the same phenotype. They computed the average of this fraction for all genotypes with the same phenotype. For the wild-type phenotype, on average more than 70 percent of a genotype’s neighbors have the same phenotype. Thus, this phenotype is quite robust to parameter changes. But their most important observation is this: the neighborhoods of different genotypes with the wild-type phenotype contain different phenotypes [274]. This observation can account for data from experiments in C. elegans and two related species, C. briggsae and C. remanei. These species have the same wild-type cell fate pattern as C. elegans [234, 274]. However, their phenotype changes differently from that of C. elegans in response to changes in the anchor cell’s signal [234]. For example, if this signal is moderately reduced, C. elegans produces the phenotype 3°3°1°3°3° in precursor cells 4 through 8, whereas
195
C. remanei produces the different phenotype 3°2°3°2°3° [234]. Although the adaptive significance of many of these variant phenotypes is unknown (most may be deleterious), this analysis serves to demonstrate qualitatively a phenomenon that we encountered many times before, and that is central to innovation, namely that a mutant’s phenotype can strongly depend on a genotype’s location in genotype space before the mutation.
Diverse genotypic neighborhoods in a eukaryotic signaling circuit My last example regards the eukaryotic TOR nutrient signaling circuit of the budding yeast Saccharomyces cerevisiae. As for the eukaryotic circadian oscillator, the analysis I will discuss rests on examination of many topological variants. TOR proteins are kinases that control the growth of proliferating yeast and mammalian cells in response to nutrients such as nitrogen [139, 351, 360, 469, 866]. Their name is an acronym derived from the observation that they are a target of rapamycin, an immunosupressant and anti-cancer drug [320]. In yeast two related TOR proteins, Tor1p and Tor2p [320], are involved in the sensing of nitrogen source quality. High-quality (good) sources of nitrogen activate these proteins, whereas poor-quality nitrogen sources deactivate them [139, 484]. Rapamycin mimics the effect of poor-nitrogen sources and inhibits Tor1/2p. It is often used as an input signal to dissect the signaling circuit’s
molecular complexes; arrows and small rectangles represent transitions between states; filled small circles indicate complex formation; open small circles indicate catalysis. For example, the diagram indicates that the small molecule rapamycin (left) binds to the protein Fpr1p, and this complex then binds Tor1/2p to deactivate it. Uncomplexed Tor1/2p can promote the phosphorylation of Tip41p and Tap42p, an event that influences the formation of the type 2A phosphatases PP2A1 and PP2A2. This influence is exemplified by phosphorylated Tap42p, which can bind Sit4p, a subunit of a type 2A phosphatase, and can compete with the other subunit, Sapp, for formation of PP2A2. As shown by the various other interactions in the diagram, Tip41p and Tap42p not only promote the formation of the type 2A phosphatases, their modification is also influenced by these phosphatases. (b) A recent study examined 18 TOR circuit variants (v1, … v18) [627], five of which are indicated by the large circles from which lines protrude that indicate all 18 possible neighbors of a variant. The core model from (a) is common to all variants. Neighboring topologies are indicated by long black lines that connect the large circles, or by short stubs. In addition to the core, only three other variants are shown in detail (three large rectangles in (b). Each rectangle indicates in dark grey shading how each variant deviates from the core. For example, in the upper left rectangle (variant 1), Tor1/2, in addition to its action in the core circuit, promotes the phosphorylation of Tip41 at a second site, whereas the type 2 phosphatases promote the dephosphorylation of Tip41p at this site.
196
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
function in laboratory studies. Tor1/2p influence nitrogen metabolism by modulating the so-called type 2A phosphatases [183, 369], which, in turn, interact with the transcription factor Gln3p [484]. This protein regulates the activity of genes involved in nitrogen metabolism. The phosphatases themselves consist of several polypeptides, including Sit4p, Sapp, Cdc55p, Tpd3p, Pph21p, and Pph22p [473, 884]. The interaction between Tor1/2p and type 2A phosphatases is indirect, and involves intermediate regulators, most notably two proteins called Tip41p and Tap42p [359, 360, 369]. The upper large rectangle of Figure 14.4 shows the interactions in this signaling circuit that are best understood experimentally. The known molecular interactions of the TOR circuit have been encapsulated in an experimentally validated mathematical model [427]. In this model, the concentrations of individual molecules or complexes change according to a system of differential equations. The parameters of these equations account for the different molecular interactions. Different parameter values correspond to different genotypes. The model distinguishes a core topology of the TOR circuit, consisting of molecular interactions and reactions that are especially well-understood, along with 18 topological extensions of this core. Each extension is a circuit variant that affects an elementary molecular interaction, and that is based on direct and indirect experimental evidence of this interaction’s occurrence. For example, variant 1 (Figure 14.4b) is based on the observation that Tip41p may be phosphorylated at multiple sites [359]. Variant 2 (not shown in the figure) reflects the possibility that the Tap42p–Pph21/22p complex can protect phosphoproteins from dephosphorylation [369]. Which of these variants occur may vary among yeast strains and species. In principle, the 18 variants would imply 218 model topologies. However, not all variants can occur independently from one another. Once only independently variable topologies are counted, 7×104 possible independent model topologies remain. Between 24 and 56 differential equations, with between 24 to 117 parameters, are needed to represent them, depending on the variant. A signaling circuit’s phenotype comprises the concentrations of its signaling molecules or molecu-
lar complexes in response to environmental signals, such as rapamycin. The available model focuses on molecules and complexes for which relevant data is available [427]. Specifically, it defines a canonical or reference phenotype that incorporates information from 11 different experiments measuring the concentrations of these molecules. Examples include experiments measuring how rapamycin affects the concentration of the Tap42p/Pph21p/Pph22p complex (Figure 14.4), of the Tap42p/Sit4p complex, or of phosphorylated Tip41p [427]. The 11 experiments provide 11 independent phenotypic axes, each corresponding to the concentration of one molecule or complex. Together, they constitute an 11-dimensional phenotype. Because the molecules involved influence gene regulation downstream of the circuit, this phenotype has biological relevance. Although variation in this phenotype does not occur in the yeast strains on which these 11 experiments are based, it occurs in other strains. For example, in a strain carrying a specific mutant allele of the Pph21p phosphatase, Pph21p does not interact with Tap42p [836]; in some mutants affecting Sit4p, this protein no longer binds Tap42p [837]. Because available experimental phenotypic data are semiquantitative or qualitative, it is useful to discretize phenotype space [427]. Such discretization also aids the comparison of different phenotypes. Discretization can be achieved, for example, by examining, for each phenotypic axis separately, a single relevant scalar measurement, such as the concentration of a molecule at a particular time point after the addition of the signal rapamycin. One can then study this phenotypic axis for all model topologies, and assign to a model a value of “0” (low concentration) in this phenotypic axes if its phenotype is below the median value observed for all circuit genotypes, and a “1” (high concentration) if it is above the median. This discretization leads to 211 possible binary phenotypes [627]. To summarize what I have said thus far, the continuous, high-dimensional genotype space of the TOR signaling circuit can be partitioned into multiple subspaces, each of which corresponds to a different circuit topology. For each such topology, a set of parameters in the corresponding subspace corresponds to a genotype. The phenotype of any one circuit corresponds to the concentration of several
T O W A R D S C O N T I N U O U S G E N O T Y P E S PA C E S
signaling molecules. Although this phenotype is high-dimensional and continuous, it is useful to discretize it. To analyze the relationship between genotype and phenotype in models this complex, sampling of many different parameters for each topology is no longer computationally feasible. Because of this limitation we used the following approach [627]. For each of the 7×104 topologies, we chose one parameter set that yields a behavior as close as possible to the reference signaling behavior, and then tested whether any of the properties discussed below are robust to perturbations in these parameters. We then analyzed the phenotypes of all 7×104 circuit topologies, and found the following properties [627]. First, circuit topologies with the same phenotype form sets whose size ranges over three orders of magnitude, from one to more than 4000 genotypes. Second, these genotype sets are more fragmented than in the other studies I discussed above, but some of the larger sets nearly traverse the space of possible topologies. This means that circuit genotypes (topologies) can be very different but still have the same phenotype. Third, genotypic neighborhoods are highly diverse. We studied this diversity with an approach identical to one I discussed earlier (Chapters 2–4). Specifically, we examined two circuit genotypes, G1 and G2, in the same genotype network, and thus with the same phenotype. Genotypes G1 and G2 differ in some number D of topological variants. We determined all phenotypes in the neighborhood of G1, i.e., for all genotypes that differ from G1 in one topological variant. We subsequently did the same for the neighborhood of G2. Then we calculated the fraction of phenotypes that occur only in one, but not in both neighborhoods. We found that even for genotypes that differ minimally (genotype distance D of one) typically 60 percent of phenotypes are unique to the neighborhood of one of the genotypes. That is, these phenotypes do not occur in the neighborhood of the other genotype. This high proportion of unique phenotypes mirrors observations from the three system classes that I discussed earlier (Chapters 2–5). It is one of the facilitators of innovation.
197
All of the observations I have just mentioned have to be taken with a grain of salt, because they are not based on extensive samples of parameter space for all 7×104 possible model topologies. Such sampling is currently infeasible. This is not only a limitation of this analysis, but one of the major challenges we need to overcome if we want to understand innovation in high-dimensional continuous genotype spaces.
Summary The systems that I have discussed in this chapter are very different from each other. Their architecture ranges from simple for the cyanobacterial oscillator, to highly complex for the TOR signaling circuit; they exist in bacteria, unicellular eukaryotes, and multicellular organisms; and their phenotypes range from a molecular oscillation in bacteria, to the cell fate pattern of a multicellular organism. Their analyses are limited in scope, and need to use the crutch of discretization to analyze a continuous system. To understand such systems and their continuous genotype spaces better, several theoretical and methodological obstacles need to be overcome. Thus, our current knowledge of these circuits is tentative, and only provides a glimpse on the organization of their genotype space. This glimpse hints that, first, analogs of large and connected genotype sets exist in continuous systems. If so, evolution can change continuous circuit genotypes dramatically without changing phenotypes. Second, it hints that different circuit genotypes with the same phenotype can access very different novel phenotypes through mutation. Taken together, these two properties, if they exist more generally, permit exploration of many new circuit phenotypes, while allowing the preservation of old phenotypes. Cellular circuits exist at a level of organization intermediate between molecules and macroscopic traits of organisms, forming a bridge between the two. Because they drive the formation of most continuous macroscopic traits, they are also important for innovation in such traits. The properties hinted at here, if prevalent in biochemical circuitry, would thus also facilitate evolutionary innovations in macroscopic traits.
CH A PT ER 15
Evolvable technology and innovation
Genotype networks and their diverse neighborhoods are necessary to understand the spectacular record of innovations that life has produced. Any human inventor and engineer who knows about this record could not help but be awed. Genotype networks exist in very different kinds of biological systems, such as metabolic networks and macromolecules. Their transcendence of one system class suggests that they may also exist outside biology. For example, an analog of them may already exist, unappreciated, in some technological systems; or perhaps technological systems could be designed whose organization is similar to that of genotype spaces in biological systems. If we could shape novel technologies in the right way and improve our understanding of existing technologies, we might obtain access to important principles of biological innovation. These principles might permit innovations whose impact and magnitude rivals those found in nature. Here, I will first discuss some key differences between biological and many technological systems, as well as prominent existing approaches to incorporate evolutionary principles into technological problem solving. I will then demonstrate that an analog of genotype networks, with many similar properties, indeed exists in technological systems. To this end, I will analyze a particular class of electronic circuitry in greater detail. This circuitry shows some properties that are strikingly similar to those of biological systems. I will discuss what this observation implies for designing robust yet functionally rich circuitry. In the course of this discussion, I will also revisit another property of biological systems, and show that technological systems may share it: innovability comes at the price of system complexity. 198
Evolutionary approaches in technology Aside from their different materials, biological systems differ in various respects from many technological systems, such as electronic circuits, industrial robots, or power plants. We tend to think of technological systems as products of rational design, at least more so than biological systems, which are products of natural selection. Many technological systems are also fragile to changes in their internal structure. Failure of their parts often causes catastrophic failure of system performance. Such lack of robustness or “fault tolerance,” as it is called in engineering, makes an important principle of biological innovation inaccessible: random change (paired with selection of suitable variants) is not necessarily an effective strategy for system improvement, because the effects of such change are typically too destructive. Researchers have appreciated these differences to biological systems for many years [636]. They sought ways to design technological systems that circumvent these problems. One notable example is evolutionary algorithms, including genetic algorithms [276, 333, 526] and genetic programming [42, 422]. Their goal is to create computer programs that solve complex problems, especially difficult optimization problems, using principles borrowed from biology. For example, in genetic algorithms a “genotype” may correspond to a string of bits that represents a (good or bad) candidate solution for an optimization problem. This solution corresponds to a “phenotype.” Genotype strings can be changed through “mutation” of their entries and “recombination” between strings. If one subjects entire populations of strings to repeated “generations” of such change, and to selection of the best solutions among them, genotypes that embody very good solutions to any given problem may arise.
E V O LV A B L E T E C H N O L O G Y A N D I N N O V A T I O N
Today, the design or manufacture of most complex man-made objects involves computers and the software that operates them. Not surprisingly then, evolutionary algorithms, albeit often thought of as software technology, can interface with the design of such objects. An especially active area in this regard is that of reconfigurable hardware. This is electronic circuitry whose internal wiring can be rapidly changed by a user to serve new functions. Random changes in the configuration of such hardware can be combined with selection towards a desired function. Hardware suitable for such an evolutionarily inspired search is also called evolvable hardware [291, 769]. One can search for circuit configurations that create new functionality in such hardware using computer programs that simulate a circuit’s behavior, and with evolutionary algorithms that prescribe how the circuit’s (simulated) configuration should be changed. In this case, one implements only the final search result in hardware. Alternatively, one can implement each circuit configuration during the search directly in hardware. This last strategy is facilitated by hardware that can not only be reconfigured, but that can be reconfigured very rapidly [291]. In evolutionary algorithms and evolvable hardware, neutral change is change in a program’s or circuit’s architecture that does not affect its function. Some existing work shows that evolutionary approaches can help identify circuits where many changes are neutral [312–314, 394]; that is, they can help design fault-tolerant systems. Other existing work also explores the effect of neutral change on the ability to evolve new functions. With possible exceptions [132, 713], neutral change facilitates the search for new functions [40, 41, 45, 132, 520, 803, 880–882]. This observation hints that an important qualitative principle that facilitates biological innovation may play similar roles in these technological systems. To my knowledge, however, no existing systematic analysis of a technological system studies the central properties essential to biological innovation that I have discussed here: the connectedness and large diameter of genotype networks, the phenotypic diversity of genotype neighborhoods, and how these features depend on different phenotypes. I will next turn to a class of system amenable to such an analysis [628].
199
The space of digital logic circuits and their functions The class of systems I will focus on is electronic circuits that compute digital logic functions. That is, they take two or more binary (Boolean) variables as inputs, and compute one or more binary outputs from them. Digital logic functions are what every contemporary computer’s central processing unit (CPU) computes. More generally, they are at the heart of digital computation. Without such computation, post-industrialized societies would look very different, or they might never arisen at all. The importance of such functions can thus not be overstated. The computational abilities of traditional digital logic circuits are hardwired into a circuit and specific to the circuit’s designated application. As I mentioned above, more recently circuitry has become available that can be reconfigured, either once or any number of times. I will briefly discuss an especially versatile kind of reconfigurable circuitry, a field-programmable gate array (FPGA) [38]. The analysis I describe below was designed with such circuitry in mind. The “gates” of an FPGA are simple computational devices, each of which performs a logical operation (logical AND, OR, NOT, etc.), on one or more logic inputs, and produces a single output. In digital electronics, such gates are typically implemented by transistors. Multiple logic gates are interconnected to calculate specific logic functions of multiple inputs. What makes an FPGA reconfigurable is that the interconnections and/or the functions computed by each gate can be changed by a user “in the field,” hence the term field-programmable. FPGAs are widely used in various high-performance computing applications, such as database searching, image processing, and digital signal processing. Their use amounts to having dedicated hardware perform complex computations at speeds much faster than achievable with software on a conventional general-purpose computer. Commercially available FPGAs are complex, and may comprise more than 106 gates. The functional versatility of such circuits comes at a price, namely that hidden logic is needed to permit reconfiguration. This hidden logic causes increased prices compared to hard-wired chips, reduces computing speed, and decreases energy efficiency [38]. Despite these limitations, FPGAs are of great economical
200
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
importance, with an estimated 2005 world wide market exceeding US$ 3.2 billion [1]. I will here analyze a computational implementation of a digital logic circuit modeled after FPGAs. Despite being vastly simpler than commercial FPGAs, the kind of circuitry I will discuss is complex enough to allow an astronomical number of circuits and logic functions to be computed. At the same time, it is simple enough to be tractable for my purpose. Figure 15.1a shows a specific example of this circuitry, and Figure 15.1b shows its general layout. It consists of a rectangular array of logic gates, each of which can compute one of the five most common two-input logic functions AND, OR, XOR, NAND, NOR. The circuit has some number I of binary inputs. Each logic gate can receive one of these inputs. In addition, each logic gate can receive the output of a logic gate to the left of it as one of its inputs. Finally, the output of each logic gate can become one of O circuit outputs. Clearly, this general architecture allows a large number of circuits, depending on how inputs are wired to gates, how gates are wired to each other, which function each gate computes, and which gates connect directly to the circuit output. To determine the number of possible circuits in a system like this, it is useful to use a simplified, numerical representation that encapsulates the architecture of a circuit like that of Figure 15.1a. In this representation, each circuit input and each logic gate are assigned an integer identifying them uniquely (ref. 628 contains details). The circuit’s configuration can then be represented by a string of integers. Specifically, each logic gate Lij (Figure 15.1b) is assigned a triplet of integers. The first two integers represent the inputs of the gate, whether they come from another gate or directly from one of the circuit inputs. The third represents the logic function the gate computes. At the end of this string, which has three times as many digits as the number of gates, stand O integers that indicate which gates map onto which output bit. With this representation one can enumerate the numbers of circuits and show that even small circuits can have many configurations. For example, a circuit with merely 16 gates (organized in 4 rows and 4 columns) has 9×1046 possible configurations. I will mostly focus on circuits of this size with I=4 input bits and O=4
output bits here. However, most of what I will say also applies to circuits of different size [628]. I will refer to the total number of possible circuit configurations—or more briefly, circuits—as a circuit space. It is the analog of genotype space. I will call two circuits neighbors in this space, if they differ by a single, smallest possible (“elementary”) change in their organization. Such an elementary change can affect a circuit’s internal wiring or the function a gate computes (Figure 15.2). Recall that both kinds of changes are possible in reconfigurable hardware. A circuit’s neighborhood consist of all its neighbors. I define the distance of two circuits in circuit space as the number of such elementary changes needed to transform one circuit into the other. This maximum distance is 58 changes for the circuits with 16 gates I discuss in detail here. Different circuits will generally also compute different Boolean functions. There are 22 × O = 264 = 1.8 × 1019 different Boolean functions involving merely four inputs and outputs [628]. These functions are the analog of phenotypes. We do not know whether circuits of any given number of gates can compute all of these functions, but they will certainly be able to compute many logic functions. I note that there are more circuits (9×1046) than functions (1.8×1019), thus permitting, at least in principle, the existence of multiple circuits that compute the same function. These large numbers of “genotypes” and “phenotypes” show that the high complexity of this system is comparable to that of biological systems. I
Logic functions differ in the number of circuits that compute them A space of 1046 circuits can not be explored exhaustively. It needs to be characterized by sampling to reveal its generic properties. To this end, we uniformly sampled 20 million circuits at random from circuit space, and determined the logic functions they compute. Figure 15.3a shows a rank plot, where Boolean functions are ranked according to the number of circuits that compute them in this sample of 2×107 circuits. The horizontal axis shows the rank of each function, and the vertical axis shows the fraction of circuits in the sample that corresponds to each rank. I will refer to this fraction as the frequency of the function. I will call a function frequent if this frequency is high, and rare otherwise.
E V O LV A B L E T E C H N O L O G Y A N D I N N O V A T I O N
A B
(a)
201
Y OR
3 I N P U T S
5
A B
O U T P U T S
1
2
AND A B
Y XOR
A B
6
4
Y
Y NAND
A B
Y NOR
(b)
L11
L21
Lm1
L12
L22
Lm2
I Inputs
O Outputs
L1n
L2n
Lmn
Figure 15.1 A circuit example, and the general circuit architecture. Panel (a) shows an example of the kind of circuit I analyze here. This example is a small circuit comprising only four (2×2 logic gates). The right side of the panel shows five symbols that correspond to the five kinds of permissible logic gates in these circuits. For example, the topmost OR gate produces a binary output bit of Y=1 if at least one of its binary input bits A, B are equal to one. It produces an output of Y=0 otherwise. Panel (b) shows the general layout of the circuits I study in this chapter. Each filled box represents a logic gate Lij. A circuit has m columns of such gates, each comprising n gates, such that 1≤i≤n, and 1≤j≤m. Each gate has two binary inputs and one binary output. Gates connect to each other in a feed-forward manner, that is, every gate can receive inputs only from gates in one of the columns to the left of it. This means that the robustness of these circuits does not depend on elaborate feedback structures. The entire circuit has I binary inputs, on which the gates perform a computation. The circuit has O binary outputs, which contain the result of the computation the circuit performs on the I inputs. Each gate can receive inputs from any of the I circuit input bits. The output of each gate can be connected to any of the O circuit outputs. A numerical representation of circuit architecture similar to that used in Cartesian genetic programming is useful for the exploration of circuit space[521]. For most of the work discussed here I=O=m=n=4. From [628].
202
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
3 I N P U T S
5
3 O U T P U T S
1
2
I N P U T S
O U T P U T S
2
6
4
3
C3
5
3 O U T P U T S
1
2
I N P U T S
O U T P U T S
2
3
C4
3
5 O U T P U T S
1
2 6
4
C6
6
4
C1
I N P U T S
5
1
6
4
6
4
C2
I N P U T S
5
1
I N P U T S
5 O U T P U T S
1
2 6
4
C5
Figure 15.2 Neighboring circuits differ in one of several kinds of elementary circuit change. The circuit C1 (circled by a thick black ellipse) has multiple neighbors in circuit space, four of which are shown here. They are connected to C1 by a straight line. These four circuits illustrate the kind of elementary wiring changes that distinguish neighbors in circuit space. The changes are highlighted through shaded regions in each neighbor. Circuit C2 differs from C1 in its internal wiring; C3 differs in the logic function computed by the upper left gate; C5 differs in the input of the lower left gate; and C6 differs in that it has the lower-right gate (instead of the lower-left gate) connected to the upper output. Circuit C4, finally, is not an immediate neighbor of C1, because it differs in two instead of one elementary change from C1. It is, however, a neighbor of C3 and C5.
E V O LV A B L E T E C H N O L O G Y A N D I N N O V A T I O N
203
(a)
Frequency of logic function
10–4
10–5
10–6
10–7
10–8 100
102
104
106
108
Rank of logic function (decreasing size of circuit set) (b)
Number of functions
103
102
101
100
0
0.2
0.6 0.4 Maximal circuit distance
0.8
1
Figure 15.3 (a) Logic functions differ dramatically in the number of circuits computing them. The data shown are based on a random sample of 2×107 circuits from the space of sixteen-gate circuits. Logic functions computed by circuits in this sample are ranked according to the number of circuits that compute these functions. A function has the highest rank of one if it is computed by the largest number of circuits in the sample. The vertical axis shows the frequency of each function in the sample, which is defined as the number of circuits computing it, divided by the sample size. Note the double-logarithmic scale. The flat “tail” of the rank histogram indicates that the vast majority of functions are computed by only one circuit in the sample. (b) Neutral networks of circuits that compute the same function extend far through circuit space. The figure shows, for 16-gate circuits computing 1000 logic functions, the distributions of the maximum circuit distance from a starting circuit at the end of a function-preserving random walk of 2000 steps. The distance is expressed as a fraction of circuit space diameter. The mean distance exceeds 0.97. It is similarly high for circuits of all sizes. After [628].
204
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
The plot clearly shows a phenomenon that we saw earlier in biological systems: Different functions (phenotypes) are computed by very different numbers of circuits (genotypes). In other words, the frequencies of different phenotypes vary dramatically. For functions computed by multiple circuits in our sample, this frequency indicates what fraction of circuits in circuit space compute this function. I will call the circuits computing a given function, the function’s circuit set, in analogy to genotype sets. For example, a function occurring at an intermediate frequency of 10–6 in our sample would be computed by approximately 1046×10–6=1040 circuits in circuit space. In these large numbers, we encounter another feature I emphasized earlier: because of the vast size of circuit (genotype) space, it is no contradiction that circuit sets can be very large, yet occupy a tiny fraction of circuit space. A final feature of this analysis becomes obvious if one observes that the horizontal axis in Figure 15.2a is displayed on a logarithmic scale. The functions with a rank lower than 106 occupy little space on this axis, but they comprise the majority of functions we encountered in our sample. Specifically, we found 1.74×107 different functions in our sample of 2×107 circuits, and almost 93 percent of functions were computed by only one circuit. I note that because circuit space is much larger than our sample, even most of these rare functions may be computed by an astronomical number of circuits.
Neutral networks in circuit space Having established that logic functions can often be computed by multiple circuits, we next turned to the question whether the circuits computing any one function are connected in circuit space. To this end, we chose 750 functions that were computed by more than one circuit in our sample. For each of these functions, we then took pairs of circuits computing them, and tried to “walk” from one to the other circuit through a sequence of elementary circuit changes (Figure 15.2) that leave the computed function unchanged. Using this approach we found that for every single function, and for every single circuit pair we examined, the pair can be connected in this way. We thus established for each of these functions that the circuits computing a given function in our sample are
part of a single connected circuit network (neutral network). Because these properties may be peculiarities of the frequent functions that dominate a finite circuit sample, we also examined two logic functions that are of practical importance but that did not occur in our circuit sample—and are thus rarer than many functions in this sample. The first of these two functions is the circular shift function. It is important in cryptographical algorithms and computes a circular permutation of a bit sequence. That is, if ijkl is a bit sequence (i, j, k, l stand for arbitrary binary numbers). The function’s output is the sequence lijk. The second function is the right-shift function, which amounts to a division by two. Its output for the sequence ijkl is 0ijk. Standard central processing units (CPUs), such as the widespread Intel x86 have specific instructions to compute both functions, illustrating their general importance in digital logic. To study the circuit sets for each of these two functions, we generated 100 independent circuits that compute each function. We did so via 100 random walks through circuit space that started out from 100 different randomly chosen circuits. Each random walk continued until a circuit computing the function had been identified [628]. Analogously to the approach just described, we then studied whether pairs of such circuits can be connected in circuit space. We found, for each of the two functions, that a single connected neutral network includes all the circuits we examined. We next turned to the question of how far such neutral networks extend in circuit space. To this end, we carried out a random walk through circuit space that started from a single circuit. Each step in this random walk amounted to one elementary change and was required to preserve the circuit function. We recorded the maximum circuit distance to the starting circuits that this random walk reached after 2000 steps. Figure 15.3b shows the distribution of this distance for 1000 starting circuits computing 1000 different functions. The distance is expressed as a fraction of circuit space diameter. Its mean is greater than D=0.98, its median equal to D=1. This means that the majority of random walks reached a circuit distance of 1, corresponding to circuits that are maximally different from the starting
E V O LV A B L E T E C H N O L O G Y A N D I N N O V A T I O N
circuit. In general, the maximum distance reached in these random walks did not depend strongly on the frequency of any given logic function in a circuit sample. Circuits computing the rarer right-shift and circular shift functions had similarly extended neutral networks [628]. This distance also did not depend strongly on circuit size, but increased only moderately from D=0.97 for 9-gate circuits to D=0.98 for 36-gate circuits. These large distances of circuits in the same neutral network mean that two circuits computing the same function can have dramatically different architectures. Figure 15.4a provides an example. The two circuits shown both compute the circular shift function, but the circuits are completely different in architecture. A careful comparison of the two circuits would show that each input maps to a different gate, each gate computes a different function, and each output connects to a different gate. Nonetheless, these circuits belong to the same connected network in circuit space. The existence of extended neutral networks requires that individual circuits have multiple neutral neighbors, neighbors that compute the same function. This is indeed the case. Figure 15.4b shows an example, the distribution of the fraction of neutral neighbors for 100 circuits computing the circular shift function, and for various circuit sizes. Two features are worth highlighting. First, the mean fraction of neutral neighbors is high. Depending on circuit size, it ranges between values greater than 0.4 and 0.7. Second, the distribution of the fraction of neutral neighbors is broad. Circuit robustness thus varies widely within the same neutral network. In the parlance of electronics, some circuits on the same neutral network are much more fault tolerant than others. A more general observation—not shown in the figure—is that these properties extend to a broad array of other functions, and depend only modestly on function frequency [628]. For example, the mean fraction of neutral neighbors for 1000 different 16-gate circuits that compute 1000 different functions increases from a value of 0.6 for functions with a frequency of 10–7 to a value of 0.8 for functions with a frequency of 10–4. Within the neutral network
205
of any one function, the distribution of the fraction of neutral neighbors is typically broad. This holds for a wide range of functions with different frequencies [628].
Engineering implications for fault-tolerant circuitry In sum, these analyses reveal several parallels between circuit space and the genotype space of biological systems. Typically, many circuits compute the same function and are connected in large neutral networks that span circuit space, or nearly so. Typical circuits also have a large fraction of neighbors that compute the same function. In addition, the distribution of this fraction of neighbors is broad (Figure 15.4b). That is, some circuits in this distribution have many more neutral neighbors than others. These are especially fault tolerant circuits. These generic properties of circuit space help place an important class of engineering problems into a broader context. They suggest that it is typically possible to design circuits with high fault tolerance, without redundant components that serve as mere back-up to existing gates, and that serve no purpose other than such backup. In other words, the existence of fault tolerant circuitry is a generic feature of circuit space that holds for many different functions. These observations also provide a broader context for earlier work that used evolutionary approaches to evolve fault-tolerant circuitry. In such work, populations of circuits were subjected to changes in their configuration, and to selection favoring circuits that compute specific target functions, such as logical XNOR or binary addition. Circuits arise in such populations that do not only compute the desired function. They are also highly fault-tolerant, for example, to faulty connections between transistors [312, 313, 394, 770–772]. My observations here suggest that evolutionary approaches will be generally useful— not just for few functions—to discover fault-tolerant circuits. As I discussed earlier (Chapter 8), sufficiently large populations evolving on a neutral network will generally accumulate in regions of the network where genotypes (circuits) have many neutral neighbors, and thus high robustness or fault-tolerance. One can even predict the average population
(a)
5
8
6
9
12
7
10
13
11
1
I N P U T S
2
O U T P U T S
3
4
5
8
6
9
12
7
10
13
11
1
I N P U T S
2
O U T P U T S
3
4
(b) 100
9 gates
50
Number of circuits
0
0
0.2
0.4
0.6
0.8
40
1 16 gates
20 0
0
0.2
0.4
0.6
0.8
40
1 25 gates
20 0
0
0.2
0.4
0.6
0.8
1
Fraction of neutral neighbors Figure 15.4 (a) Two maximally different circuits that both compute the circular shift function. (b) The fraction of neutral neighbors of circuits that compute the circular shift function is typically high, varies broadly, and increases with circuit size. From [628].
E V O LV A B L E T E C H N O L O G Y A N D I N N O V A T I O N
robustness from a neutral network’s structure [798]. In other words, the evolvability of high fault-tolerance is also a generic property of circuit space.
Diverse phenotypes in different neighborhoods of a neutral network The phenomena I just discussed pertain to the first of two principal properties I highlighted in biological systems. The second property regards two genotypes G1 and G2 on the same genotype network, and their neighborhoods. The novel phenotypes one can find in these neighborhoods are generally very different. We took several approaches to determine whether this also holds for two circuits C1 and C2 that compute the same function. That is, do their neighborhoods contain circuits that compute very different functions? [628] In a first approach to this question, we started out with a circuit C computing a given function, and subjected it to a function-preserving random walk, as described above. At each step of this random walk, we identified the functions found in circuits belonging to the neighborhood of the random walker C’ and the starting circuit C. We determined the fraction of functions that occur in the neighborhood of only one but not the other circuit. For brevity, I will call it the fraction of functions unique to the circuit’s neighborhood. Figure 15.5a shows the fraction of these unique functions on the vertical axis, as a function of the number of steps in this random walk on the horizontal axis. Clearly, the fraction of unique functions rises rapidly to a large value that is of the order of 0.8, after fewer than 10 steps. The data in this figure are based on a function with moderate frequency of 4.7×10–6. The same analysis for functions with different frequencies yields qualitatively identical observations: similar circuits on the same neutral networks have neighborhoods whose circuits compute mostly different functions [628]. In a second approach, we examined the diversity of phenotypes in the neighborhoods of two circuits C and C’ at the end of a function-preserving random walk with 2000 steps (Figure 15.5b). Specifically, the figure examines the relationship between the frequency of a function; that is, the number of circuits computing the function (horizontal axis) with the fraction of unique functions found in the neighborhood of C and C’. The figure shows that this frac-
207
tion does not depend strongly on the frequency of the function, and is typically high. Specifically, more than 70 percent of the functions in the neighborhood of two circuits on the same neutral network are unique to one of the neighborhoods. In a third approach, we again studied a phenotype-preserving random walk through circuit space, but now determined the cumulative number of different functions that can be found in the neighborhood of the random walker. That is, we determined for each step in the random walk the circuit neighborhood of the random walker and the functions therein. If a circuit in this neighborhood computed a function that had not been computed by any circuit found in the neighborhood of a previous step, we added this function to a cumulative list of novel functions encountered during the random walk. Figure 15.5c shows the number of these novel functions and its dependence on the number of steps in the random walk. Although the rate at which novel functions are found decreases slightly over the course of the random walk, it does not approach zero. I note that the number of novel functions encountered during this random walk, albeit large, is much smaller than the total number of 1.8×1019 possible functions. The neutral network of the function examined in the figure may comprise more than 1040 circuits, because the logic function used for this example had a frequency exceeding 10–6. In principle, this accumulation of novel functions could thus increase for much longer random walks. This near-endless accumulation of novel functions is a generic feature of neutral networks in circuit space. It exists for multiple neutral networks of functions with different frequencies [628]. Taken together, these observations again display striking similarities to biological systems. The neutral networks of typical functions (phenotypes) contain functionally very diverse neighborhoods. That is, small changes in two different circuits (genotypes) on the same neutral network can make very different novel functions computable.
The neutral networks of different circuits are close together in circuit space In biological systems, the neutral networks associated with two phenotypes are typically close together in genotype space. This means that there exist some genotypes
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
(a)
Fraction of new phenotypes in neighborhood
1
0.8
0.6
0.4
0.2
0
0
50
100
150
200
Number of steps in random walk
Fraction of unique phenotypes in neighborhood at the end of random walk
(b) 1
0.8
0.6
0.4
0.2
0 10–8
10–7
10–6
10–5
10–4
Frequency of logic function (c) 3000 Cumulative new phenotypes encountered in neighborhood
208
2500 2000 1500 1000 500 0
0
500
1000
1500
Number of steps in random walk
2000
E V O LV A B L E T E C H N O L O G Y A N D I N N O V A T I O N
on the two different networks that are similar to one another (Chapters 2–4). It implies that any evolutionary search for a new phenotype may not have to explore the entire vast genotype space, but only a small fraction of this space. We examined whether the space of electronic circuits has an analogous property. We did so first for circuits computing two specific functions, the circular shift and right shift I discussed earlier. We started with 1000 pairs of random circuits where one circuit in a pair computes the right-shift function, and where the other circuit computes the circular shift function. Starting from the first circuit in such a pair, we performed a function-preserving random walks that was required to reduce the distance to the other circuit. We then carried out an analogous random walk starting from the second circuit. We recorded the minimal distance between the circuits after 2×104 steps of this random walk. Figure 15.6a shows the distribution of the resulting minimal distances. The minimum of the distribution is D=0.06, corresponding to three elementary changes. In other words, a circuit computing the right-shift function is only three steps away from a circuit computing the circular shift function (and vice-versa). We note that the minimal distance our approach estimates is merely an upper bound of the actual minimal distance. That is, the actual minimal distance may be even lower than our estimate, because our approach may
209
have failed to uncover circuits that are even closer to each other. We repeated this approach with 5000 pairs of different functions, all of which were rare functions found only once within our sample of 2×107 circuits. For each pair of functions, we again carried out function-preserving random walks that attempted to minimize the distance between circuits computing a given function. To make this analysis computationally feasible, we carried out only one random walk per function pair. Figure 15.6b shows the minimum distances this approach found. The median of the distribution equals D=0.19. This means that a typical evolutionary search starting from a circuit computing a given logic function and aiming to find a circuit that computes another logic function will have to explore only a small fraction of circuit space. Specifically, a ball in circuit space of radius 0.2 comprises less than one 10–15th of circuit space for 16-gate circuits [628]. Again, this analysis provides only an upper bound on the minimum distance between two neutral networks. This upper bound is worse than for the two circular shift and right shift functions, partly because we here only attempted to minimize distance through one random walk per function pair. Actual minimal distances between neutral networks may be substantially lower. In sum, this last analysis reveals another similarity to biological systems. Identification of circuitry
Figure 15.5 The neighborhoods of different circuits on the same neutral network contain highly diverse new functions. (a) The horizontal axis shows the number i of steps of a function-preserving random walk starting with a circuit C0. If Ci denotes the random walking circuit at the i-th step, and if F0 and Fi denote the sets of different functions computed by at least one circuit in the neighborhoods of C0 and Ci, respectively, then the vertical axis shows the quantity U=1–(|F0ÇFi|/|F0ÈFi|), where for a set X, |X| denotes the number of elements in the set. In other words, U is the fraction of phenotypes that occur only in the neighborhood of one, but not the other circuit of the pair (C0, Ci). (b) The mean fraction U, determined as just described, of unique functions in two circuit neighborhoods C0 and C2000, that is, at the end of function-preserving random walks of 2000 steps. The data in this plot are based on one random walk each for 1000 neutral networks corresponding to 1000 different functions. Functions are binned according to their frequency, as indicated on the horizontal axis, and means are calculated over all data points in one bin. The gap in the middle of the plot results from the fact that we focused only on rare and frequent functions in this analysis, to limit computational cost. Lengths of error bars correspond to one standard deviation. (c) The cumulative fraction of new functions encountered in the neighborhood of a random walking circuit during a function-preserving random walk. Data in (a) and (c) are based on 16-gate circuits whose logic function has a moderate frequency of 5.1×10–5. Similar observations hold for functions with widely varying frequency. From [628].
210
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
(a) 200 circular and right-shift function
Pairs of circuits
150
100
50
0
0
0.2 0.4 0.6 0.8 Minimum distance between neutral networks (as fraction of circuit space diameter)
1
(b) 700 5000 random rare function pairs Pairs of neutral networks
600 500 400 300 200 100 0
0
0.2
0.4
0.6
0.8
1
Minimum distance between neutral networks (as fraction of circuit space diameter) Figure 15.6 Neutral networks of different functions are near each other in circuit space. (a) Distribution of the minimal distance between circuits that calculate the circular-shift and right-shift functions, as estimated through random walks that start from 1000 different circuit pairs computing these different functions. Distances are expressed as fractions of the circuit space diameter. The minimum distance of the distribution shown corresponds to 3 elementary circuit changes. It is an upper bound of the actual distance between the neutral networks of these two functions. (b) Distribution of upper bounds for the minimal distances between neutral networks for a broad range of functions. Data are based on 5000 neutral network pairs, where both starting circuits of each pair compute a different function, and on one distance-minimizing functionpreserving random walk per function pair. For all function pairs in this analysis, the functions are sufficiently rare to appear only once in a sample of 2×107 circuits. The data show that even neutral networks of such rare functions are typically near each other in circuit space. From [628].
E V O LV A B L E T E C H N O L O G Y A N D I N N O V A T I O N
computing novel functions through evolutionary search needs to explore a small fraction of circuit space. In the context of reconfigurable hardware, this means that one may not need to overhaul a circuit’s architecture completely to compute a typical new function. Reconfiguring a small fraction of a circuit may suffice.
Implications for adaptive systems It is not difficult to see that observations from the last two sections have implications for the design of technological systems, especially of adaptive systems that can change their configuration and behavior in response to new environments, or while they learn a new task. In such systems, hardware that can be reconfigured on-the-fly, while the system performs a task is highly desirable. An example of such a system is the robot YaMoR [529, 721], designed to autonomously move in its environment. It consists of several mechanically homogeneous modules, somewhat analogous to segments of segmented organisms. Each module contains a field-programmable gate array with the ability to self-reconfigure while the robot is navigating its environment. Such reconfigurable circuitry can endow a robot with the ability to learn. In systems like this, dynamical hardware reconfiguration permits exploration of different circuits architectures, and thus implementation of different behaviors, followed by retention of behaviors suitable for a given task. If a system is to adapt or learn autonomously, that is, without human help, such internal reconfiguration will typically contain a random, exploratory element. A related application involves self-repairing circuitry. Such circuitry might be required to retain some functionality while exploring configurations that repair an occurring fault. The existence of vast neutral networks with functionally diverse neighborhoods would facilitate this task. I note that analogs of such dynamical, exploratory (re)configuration also exist inside living systems. Examples include the generation of antibody diversity through hypermutation, the polymerization dynamics of microtubules before mitosis, which enables choromosome segregation, and the development of the nervous system, where initial connections between neurons are often first established exploratorily [264].
211
Circuits like those I analyzed here would be especially amenable to such dynamic, adaptive, and exploratory reconfiguration. This is because a rich diversity of novel functions exists in the neighborhood of a changing circuit. At the same time, because of the existence of neutral networks, such a circuit can preserve its existing function or behavior, until a new, better configuration has been found. Circuit designs that can access a large number of functions in any one circuit’s neighborhood would be especially useful in this regard. Our observation that neutral networks of different functions are located close together in circuit space is also relevant. The reason is that hardware reconfiguration is costly [117, 700, 782]. It requires storage space and time, during which all or parts of a circuit may be unavailable. Existing reconfigurable hardware is amenable to partial reconfiguration, where only some of its internal architecture is changed. Partial reconfiguration can reduce reconfiguration costs [700, 782]. Circuitry like that I discussed here, where different neutral networks are close to each other, can help find circuits that compute a specific new function with few changes. In other words, such circuitry can also help minimize reconfiguration costs.
Complexity as the price of robustness and evolvability Figure 15.7 shows a small 4-gate circuit and underneath it a much larger 16-gate circuit. These two circuits compute exactly the same logic function. The example illustrates that circuits able to compute any one function can have very different complexity, defined here as the number of gates in a circuit. Engineers tend to value simplicity and elegance in system design. The 16-gate circuit from Figure 15.7 clearly violates this principle. Its apparently unnecessarily complex design, however, has other advantages. As Figure 15.4b showed, larger circuits that compute a specific function tend to be more robust to wiring changes. For example, the fraction of changes that leave the circuit’s function intact is equal to 0.06 for the small circuit in Figure 15.7, but equal to 0.8 for the larger circuit. In other words, the more complex circuit is more than ten times more robust. Figure 15.8a shows that this increase of robustness is a generic property that occurs for
212
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
(a) 1 5 I N P U T S
7 O U T P U T S
2
3 8
6 4
(b) 5
9
13
17
6
10
14
18
7
11
15
19
8
12
16
20
1
I N P U T S
2
3
O U T P U T S
4
Figure 15.7 A simple (4-gate) and a complex (16-gate) circuit that compute the same logic function. For each of 16 possible 4-bit input bitstrings, one can represent the corresponding output bitstring of a logic function as a hexadecimal digit. In the order of descending magnitude of input bitstrings, the output bitstrings of the function used here can then be represented as the hexadecimal string CEDFA8B957463120. The two circuits shown compute the same function but differ dramatically in their robustness, i.e., the fraction of neighbors computing the same function. This fraction is equal to 0.06 for the small circuit, but equal to 0.8 for the larger circuit. From [628].
many different functions. Larger circuits also tend to be more robust to failures of individual logic gates, a common failure mode in digital circuitry [628]. Similar trends exist for the fraction of functions that occur in the neighborhood of one but not the other circuit on the same neutral network. This fraction increases with increasing circuit size (Figure 15.8b). In other words, higher circuit complexity facilitates access to a larger number of novel functions.
In sum, circuit complexity increases robustness and functional versatility. A lack of simplicity (and elegance) in circuit design may be the price to pay for these features. Evolvable systems may be neither simple nor elegant, but complex and messy. In artificial systems like electronic circuitry we can systematically change system complexity and see how it affects the properties I discussed here. This is a great advantage over biological systems, where such systematic manipulation is less straight-
E V O LV A B L E T E C H N O L O G Y A N D I N N O V A T I O N
increase causes robustness to genetic change in any one environment. Robustness, in turn, brings forth genotype networks, which make many novel phenotypes accessible to an evolving system. Thus, while the causes of complexity may differ for biological and technological systems—natural selection in changing environments versus design for versatility—their effects on innovability may be similar.
(a) Fraction of neutral neighbors
1 0.8 0.6 0.4 0.2
Summary I here analyzed reconfigurable digital
0 (b) Fraction of new functions in neighborhood at end of random walk
213
1 0.8 0.6 0.4 0.2 0
9
16 25 36 Circuit size (logic gates)
Figure 15.8 Robustness and novel accessible phenotypes increase with circuit complexity. Circuit complexity is here defined as the numbers of gates in a circuit. (a) Mean fraction of neutral neighbors, a measure of circuit robustness to configuration change. (b) The fraction of phenotypes that occur only in the neighborhood of one, but not the other circuit of a circuit pair (C0, Ci), where C2000 is the endpoint of a function-preserving random walk of 2000 steps that began at the circuit C0. Data in both panels are based on 1000 circuits sampled at random from circuit space, each of which computes a different logic function. Error bars correspond to one standard deviation. From [628].
forward. However, I earlier highlighted some principles of biological systems (Chapter 11) similar to those I just discussed. Specifically, I argued that a biological system’s complexity is linked to the number of different environments that it needs to survive in. Survival in more environments can require an increase in system complexity. This
logic circuitry similar to commercially important field-programmable gate arrays. I showed that this circuitry has multiple features similar to those I discussed earlier in biological systems. Circuits that compute the same logic functions form vast neutral networks, connected sets of circuits that extend far through circuit space. The neutral networks of different functions differ greatly in size. Different neighborhoods of circuit pairs on the same neutral networks contain mostly circuits that compute different new functions. Neutral networks of different functions are typically close together in circuit space. In Chapter 6 I discussed that features like these emerge if genotypes in a high-dimensional genotype space typically have many neighbors with the same phenotype. This condition is also met for the circuitry I study, which is quite robust to changes in internal wiring and failure of its logic gates. Robustness and other features of such circuitry tend to increase with circuit size. Circuit complexity may thus be a price to pay for both robustness and functional versatility. In systems like these, the existence of fault-tolerant circuitry is a generic property of circuit space. In addition, circuit spaces with this organization are well suited for the design of autonomous adaptive systems that can dynamically reconfigure themselves. The reason is that such systems should be able to explore many new behaviors while reconfiguring themselves only minimally. I only explored one class of technological system here. Future work needs to explore which other classes of systems— electronic or otherwise—have similar properties. A systematic understanding of such systems may help future engineers create technologies that can leverage principles which have served nature well for several billion years.
CH A PT ER 16
Summary and outlook
Most, if not all evolutionary innovations involve changes in the following three classes of biological systems. 1. Metabolic networks. Many metabolic innovations involve the ability to use novel nutrients to synthesize all biomass molecules that are necessary for an organism’s survival and reproduction. 2. Regulatory circuits. Regulatory innovations involve new patterns of gene expression or molecular activity that can drive the formation of novel macroscopic traits and physiological states. 3. Proteins and RNA molecules. Here, innovations involve molecules with new structures and biochemical functions, such as enzymes with new catalytic activities. New phenotypes in these three classes of systems are the building blocks of innovations on all levels of biological organization, including the complex, macroscopic innovations in multicellular organisms. Such innovations will typically involve multiple changes in all three classes of systems. However, to identify common principles of innovation, it is useful to study these systems separately. They share the following basic commonalities (Chapters 2–5).
• Genotype networks of different phenotypes vary in size, often by many orders of magnitude. • The genotype set of any one phenotype typically occupies a vanishing fraction of genotype space; yet it has astronomically many members. • Their vast size allows genotype sets and genotype networks to have a rich internal structure. For example, any one genotype network can contain regions with high robustness of genotypes to mutations, or with genotypic “memory” of past environments (chapters 8, 11). • Heterogeneity exists both within any one genotype network, and among genotype networks. The ability to bring forth new phenotypes varies both among genotypes within the same genotype network, as well as among different genotype networks. • The neighborhoods of different genotypes on the same genotype network contain different novel phenotypes. Thus, a population that explores genotype space while preserving its phenotype can encounter ever-changing novel phenotypes in its neighborhood, many times more than if genotype networks did not exist.
• The genotypes in each class of system form a vast genotype space, whose members can adopt astronomically many different phenotypes.
• The genotype networks of different phenotypes are close together in genotype space, such that only a small fraction of this space needs to be explored to find most novel phenotypes.
• The genotypes with any one phenotype form large, often connected sets—genotype networks— that span genotype space or large proportions of it.
The existence of genotype networks and their phenotypically diverse neighborhoods solves the perhaps most difficult problem evolutionary innovation
214
S U M M A RY A N D O U T L O O K
poses. It allows preservation of an existing, welladapted phenotype, while permitting exploration of a myriad novel phenotypes, some of which can become innovations. We may never know whether life originated with simple metabolic networks, or with a carrier of genetic information such as RNA. However, both of these system classes obey the principles I just discussed. Therefore, these principles may have helped innovation along since life’s origin. The above three system classes are very different, but they nonetheless have key common properties important for innovation. These commonalities emerge from one fundamental feature that these systems share. In all of them, genotypes typically have many one-mutant neighbors with the same phenotype. In other words, their phenotype is to some extent robust to genetic change. Such robustness is a feature of many biological systems, as I also argued in an earlier book (825). It is both necessary and sufficient to explain the most important features of genotype space. One can thus view genotype networks as self-organized features of genotype space that emerge in robust systems (chapter 6). Different levels of organization, from molecules to networks, can reinforce each other’s robustness (chapter 8). The question what causes such robustness lead me to the role of the environment, both that outside and that inside of an organism. The greater a system’s ability to adapt to new or changing environments, the more complex it needs to be—the more parts it must contain. In other words, environmental change is one key driver of biological complexity. This complexity indirectly leads to robustness in any one environment, where small genetic changes in individual genotypes do not affect the phenotype (chapter 11). The schematic of Figure 16.1 shows how these observations fit together. Evolutionary adaptation or innovations that organisms acquire in order to cope with novel environments will tend to increase system complexity. Such increased system complexity implies robustness, which brings forth genotype networks. These networks facilitate innovations that allow the conquest of new environments or persistence in changing environments. The ascending spiral of the figure hints that this process may well be self-accelerating.
215
Environmental change
Innovation
Increased system complexity
Genotype Networks Robustness
Figure 16.1 The relationship between environmental change, robustness, complexity, genotype networks, and innovation.
These are the core observations of this book. The conceptual framework they form sheds light on many other phenomena, some of which are long-standing unresolved issues in evolutionary biology.
Neutralism and selectionism I showed how the genotype space framework can reconcile two extreme perspectives on molecular evolution, that of neutralism and selectionism (825, 829). Most mutations that are neutral when they first arise may not remain neutral forever. This, however, does not mean that their neutrality is unimportant for innovation. Neutral mutations on a genotype network are crucial to prepare the ground for later, beneficial mutation that lead to evolutionary adaptation or to an evolutionary innovation. They can be viewed as molecular exaptations. Molecular data that demon-
216
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
strate episodic diversification among molecules, pervasive epistasis, and shifting foci of positive selection support this perspective. Neutralism and selectionism capture complementary and equally important aspects of biological reality (chapter 7).
benefits. In addition, gene duplication can also help resolve functional trade-offs: If two functions cannot be carried out equally well by any one molecule, gene duplication allows their execution by two different molecules, and can thus avoid this trade-off (chapter 9).
Robustness A second issue regards the role of a Recombination Recombination, with its long
system’s robustness to mutations in promoting or hindering innovation. On a qualitative level, robustness brings forth genotype networks. On this level, it is thus essential for innovation. However, more quantitatively, the role of robustness depends on two factors. The first is the level of organization on which we focus (genotype or phenotype), and the second is the kind of system we study. High robustness of a genotype—its number of neutral neighbors—will typically hamper innovation, because it reduces the number of novel phenotypes mutations can produce. Thus, selection that favors preservation of an optimal phenotype in a constant environment will generally reduce a system’s ability to innovate, because it will increase genotypic robustness in a population. However, high robustness of a phenotype, which is proportional to the size of its genotype network, can facilitate innovation. This is because it can facilitate rapid spreading of an evolving populations through its genotype network, and thus exploration of many novel phenotypes. This advantage exists for proteins and RNA molecules, but it is not universal. System classes where it exists can avoid an important impediment to evolutionary adaptation, a conflict between two levels of organization, that of the individual and that of a lineage. This is because in such systems the benefits of robustness to the individual are aligned with the benefits of innovations to a lineage, population, or species. In other words, systems that avoid this conflict can be both robust and innovative (chapter 8).
ubiquituous, and probably primal phenomenon of life. It existed since life’s earliest history and occurs on all levels of biological organization. I argued that genotype networks can facilitate the origin of genotypes that form novel phenotypes through phenotypic plasticity. Genotype networks can thus facilitate innovation through plasticity wherever plasticity is an important source of novel phenotypes. This perspective can also help us see that the “phenotype-first” and “genotype-first” scenarios of evolutionary innovation are false dichotomies, similar to the “nature versus nurture” dichotomy (chapter 13).
Gene duplication Duplicated genes are involved
Changing environments and innovation I
in many innovations. I showed that gene duplications can dramatically increase both the robustness of a system and the fraction of genotype space available for the exploration of novel phenotypes. This is another incidence where genetic systems can avoid conflicts between individual-level and lineage-level
mentioned above that environmental change may foster increasing system size. However, even systems of a given size can adapt, within limits, to either constant or changing environments. Constant environments that favor one particular phenotype may hinder innovation, because they
jumps through genotype space, explores novel phenotypes much more efficiently than mutation, which changes a system one part at a time. The problem is that recombination can destroy welladapted genotypes. The genotype space framework allowed me to ask how destructive the effects of recombination really are. The answer is: not very destructive at all. Changes in a given number of system parts through recombination are often orders of magnitude less likely to disrupt a welladapted phenotype than the same amount of change caused by mutation. In addition, in populations whose genotypes frequently recombine, recombination may decrease fitness less than mutation alone. In sum, recombination only has a modest price. In some circumstances, it may, paradoxically, even help preserve phenotypes (chapter 10).
Phenotypic plasticity Phenotypic plasticity is a
S U M M A RY A N D O U T L O O K
create robust genotypes (chapter 8). In contrast, rapidly or slowly changing environments can promote the exploration of novel phenotypes, and thus evolutionary adaptation of populations to environmental change (chapter 11). Different regions on a genotype network can also contain a genotypic “memory” of phenotypes adaptive in past environments. Such a memory can facilitate adaptation to future environments, if these environments resemble past environments. Genotypic memory can help explain atavisms, ancient traits that were adaptive in the past, have lost their adaptive value, but still resurface in some individuals of a population (chapter 11).
Constraints Phenotypic variation can be constrained by physicochemical, selective, and genetic factors, as well as by the developmental processes that produce phenotypes from genotypes. The genotype space framework can help us see that the processes that form phenotypes are the fundamental cause of the other constraints. It also illustrates why causes of constrained variation are often not clearly separable. For students of innovation, evolutionary stasis and punctuated change are important effects of constraints. They can be readily explained as a consequence of the organization of genotype networks (chapter 12).
Technology To harness the principles I described here for human technologies is to harness Nature’s successful way of innovating. I showed that this goal may be within reach for at least one class of evolvable technology. Specifically, I showed that electronic circuitry implemented in reconfigurable hardware can show many parallels to biological systems. These parallels include analogues of genotype networks, computationally (“phenotypically”) diverse neighborhoods of individual circuits, as well as robustness and functional versatility that rise with system complexity. These commonalities can help design robust and adaptable circuitry that can execute many different computing functions with minimal system change (chapter 15). Together with earlier insights from biological systems (chapter 8), my observations in this chapter underscored another general principle.
217
Systems that innovate in the ways I described here will not be elegant and simple, but complex and messy. Innovability requires complexity. Many other technologies that can harness the principles I discussed here may lie in wait. A look at life’s spectacular innovations suggests that only our imagination may limit what that these principles can help us uncover.
Challenges Systematic analyses of genotype spaces are young research endeavors. They require large amounts of data on biological systems, and a heavy dose of computational data analysis. Both have become available only late in the 20th century. We thus have barely scratched the surface of understanding the internal structure of genotype spaces, whose size dwarfs the number of atoms in the physical universe. One major challenge ahead is to extend the concepts I developed here to systems best represented in continuous genotype and phenotype spaces. Hints exist that such spaces may not fundamentally change the view on innovation I propose here, but our tools to understand them are still rudimentary. A second challenge is to find out how the organization of genotype spaces differs among different system classes, and not just to focus on universal properties, as I mostly did. A third challenge is to gain a much more detailed understanding of the frequency, extent, and kind of environmental changes that promote innovations. A fourth challenge is to integrate knowledge from the three core systems (metabolic, regulatory, molecular), and extend it to the spatial dimension characteristic of macroscopic traits in multicellular organisms. Because innovations in the three core systems are building blocks of developmental innovation, spatially extended systems may not show properties very different from those I discuss here. However, this has not been proven. The ideas I presented here suggest one way to understand innovation in the natural world. Tackling these challenges will help us see whether it is the best way. This is the end of a book, but just the beginning of a new opportunity, the opportunity to understand innovation, not just anecdotally but systematically. Such understanding is not only
218
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
arguably the most profound outstanding problem in evolutionary biology. Its application to human technology may shape the future of technological innovation through evolvable technologies. We are the first generation to be granted this opportunity. We are the first who can integrate the fundamental insights of the modern synthesis of
evolutionary biology in the early 20th century, with much more recent knowledge about complex phenotypes. This knowledge allows us to explore the universe of genotype spaces, a universe rivaling the visible universe in complexity, home to countless innovations waiting to be discovered.
References
1. May 30 2006. FPGA/PLD market to grow 14% in ‘06, says Gartner. In EE Times Asia (http://www.eetasia.com/). 2. Abramowitz M, Stegun I. 1972. Handbook of mathematical functions. New York: Dover. 3. Abzhanov A, Kuo WP, Hartmann C, Grant BR, Grant PR, Tabin CJ. 2006. The calmodulin pathway and evolution of elongated beak morphology in Darwin’s finches. Nature 442: 563–7. 4. Abzhanov A, Protas M, Grant BR, Grant PR, Tabin CJ. 2004. Bmp4 and morphological variation of beaks in Darwin’s finches. Science 305: 1462–5. 5. Adams B, Holmes EC, Zhang C, Mammen MP, Nimmannitya S, et al. 2006. Cross-protective immunity can account for the alternating epidemic pattern of dengue virus serotypes circulating in Bangkok. Proceedings of the National Academy of Sciences of the United States of America 103: 14234–9. 6. Adams KL, Qiu YL, Stoutemyer M, Palmer JD. 2002. Punctuated evolution of mitochondrial gene content: High and variable rates of mitochondrial gene loss and transfer to the nucleus during angiosperm evolution. Proceedings of the National Academy of Sciences of the United States of America 99: 9905–12. 7. Agrawal AA. 2001. Ecology–phenotypic plasticity in the interactions and evolution of species. Science 294: 321–6. 8. Aharoni A, Gaidukov L, Khersonsky O, Gould SM, Roodveldt C, Tawfik DS. 2005. The “evolvability” of promiscuous protein functions. Nature Genetics 37: 73–6. 9. Akashi H. 1995. Inferring weak selection from patterns of polymorphism and divergence at silent sites in Drosophila DNA. Genetics 139: 1067–76. 10. Akashi H. 1999. Inferring the fitness effects of DNA mutations from polymorphism and divergence data: Statistical power to detect directional selection under stationarity and free recombination. Genetics 151: 221–38. 11. Alberch P. 1991. From genes to phenotype: Dynamical systems and evolvability. Genetica 84: 5–11. 12. Alberch P, Gale EA. 1985. A developmental analysis of an evolutionary trend—digital reduction in amphibians. Evolution 39: 8–23.
13. Alberts B, Johnson A, Lewis J, Raff M, Roberts K, Walter P. 2008. Molecular biology of the cell. New York, NY: Garland Science. 14. Aldana M, Cluzel P. 2003. A natural class of robust networks. Proceedings of the National Academy of Sciences of the United States of America 100: 8710–4. 15. Alon U. 2007. An introduction to systems biology: Design principles of biological circuits. Boca Raton, FL: Chapman & Hall/CRC. 16. Alonso J, Stepanova A, Leisse T, Kim C, Chen H, et al. 2003. Genome-wide insertional mutagenesis of Arabidopsis thaliana. Science 301: 653–7. 17. Alvarez-Buylla ER, Liljegren SJ, Pelaz S, Gold SE, Burgeff C, et al. 2000. MADS-box gene evolution beyond flowers: expression in pollen, endosperm, guard cells, roots and trichomes. Plant Journal 24: 457–66. 18. Ambros S, Hernandez C, Desvignes J, Flores R. 1998. Genomic structure of three phenotypically different isolates of peach latent mosaic viroid: Implications of the existence of constraints limiting the heterogeneity of viroid quasispecies. Journal of Virology 72: 7397–406. 19. Amit DJ. 1989. Modeling brain function. The world of attractor neural networks. Cambridge, UK: Cambridge University Press. 20. Amitai G, Gupta R, Tawfik D. 2007. Latent evolutionary potentials under the neutral mutational drift of an enzyme. HFSP Journal 1: 67–78. 21. Ancel LW. 2000. Undermining the Baldwin expediting effect: Does phenotypic plasticity accelerate evolution? Theoretical Population Biology 58: 307–19. 22. Ancel LW, Fontana W. 2000. Plasticity, evolvability, and modularity in RNA. Journal of Experimental Zoology/ Molecular Development and Evolution 288: 242–83. 23. Anderson JM. 1986. Photoregulation of the composition, function, and structure of thylakoid membranes. Annual Review of Plant Physiology 37: 93–136. 24. Anderson JM, Aro EM. 1994. Grana stacking and protection of photosystem-II in thylakoid membranes of higher-plant leaves under sustained high irradiance - an hypothesis. Photosynthesis Research 41: 315–26. 219
220
REFERENCES
25. Andolfatto P. 2005. Adaptive evolution of non-coding DNA in Drosophila. Nature 437: 1149–52. 26. Andolfatto P. 2007. Hitchhiking effects of recurrent beneficial amino acid substitutions in the Drosophila melanogaster genome. Genome Research 17: 1755–62. 27. Andolfatto P, Przeworski M. 2001. Regions of lower crossing over harbor more rare variants in African populations of Drosophila melanogaster. Genetics 158: 657–65. 28. Anfinsen CB, Haber E, Sela MS, White Jr. FH. 1961. The kinetics of formation of native ribonuclease during oxidation of the reduced polypeptide chain. Proceedings of the National Academy of Sciences of the USA 47: 1309–14. 29. Antonovics J, Vantienderen PH. 1991. Ontoecogenophyloconstraints—the chaos of constraint terminology. Trends in Ecology & Evolution 6: 166–8. 30. Aronson BD, Johnson KA, Loros JJ, Dunlap JC. 1994. Negative feedback defining a circadian clock—autoregulation of the clock gene frequency. Science 263: 1578–84. 31. Aronson H, Royer W, Hendrickson W. 1994. Quantification of tertiary structural conservation despite primary sequence drift in the globin fold. Protein Science 3: 1706–11. 32. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, et al. 2000. Gene ontology: Tool for the unification of biology. Nature Genetics 25: 25–9. 33. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, et al. 2000. Gene ontology: Tool for the unification of biology. Nature Genetics 25: 25–9. 34. Azevedo RBR, Lohaus R, Srinivasan S, Dang KK, Burch CL. 2006. Sexual reproduction selects for robustness and negative epistasis in artificial gene networks. Nature 440: 87–90. 35. Babajide A, Hofacker I, Sippl M, Stadler P. 1997. Neutral networks in protein space: a computational study based on knowledge-based potentials of mean force. Folding & Design 2: 261–9. 36. Bae K, Lee C, Sidote D, Chuang KY, Edery I. 1998. Circadian regulation of a Drosophila homolog of the mammalian Clock gene: PER and TIM function as positive regulators. Molecular and Cellular Biology 18: 6142–51. 37. Balaban NQ, Merrin J, Chait R, Kowalik L, Leibler S. 2004. Bacterial persistence as a phenotypic switch. Science 305: 1622–5. 38. Balch M. 2003. Complete digital design. New York, NY: McGraw-Hill. 39. Baldwin JM. 1902. Development and evolution. Macmillan: New York, NY.
40. Banzhaf W. 1994. Genotype-phenotype-mapping and neutral variation. A case study in Genetic Programming. 3rd International Conference on Parallel Problem Solving from Nature (PPSN III), Jerusalem, ISRAEL. 41. Banzhaf W, Leier A. 2005. Evolution on neutral networks in genetic programming. 3rd Workshop on Genetic Programming, Theory and Practice, Ann Arbor, MI. 42. Banzhaf W, Nordin P, Keller R, Francone F. 1998. Genetic programming—an introduction. San Francisco, CA: Morgan Kaufmann. 43. Bar-Even A, Paulsson J, Maheshri N, Carmi M, O’Shea E, et al. 2006. Noise in protein expression scales with natural protein abundance. Nature Genetics 38: 636–43. 44. Barlow AJ, FrancisWest PH. 1997. Ectopic application of recombinant BMP-2 and BMP-4 can change patterning of developing chick facial primordia. Development 124: 391–8. 45. Barnett L. 2001. Netcrawling—optimal evolutionary search with neutral networks. Congress on Evolutionary Computation (CEC 2001), Seoul, South Korea. 46. Barton NH, Charlesworth B. 1998. Why sex and recombination? Science 281: 1986–90. 47. Bastolla U, Porto M, Roman HE, Vendruscolo M. 2003. Connectivity of neutral networks, overdispersion, and structural conservation in protein evolution. Journal of Molecular Evolution 56: 243–54. 48. Baudin F, Marquet R, Isel C, Darlix J, Ehresmann B, C E. 1993. Functional sites in the 5’ region of humanimmunodeficiency-virus type-1 RNA form defined structural domains. Journal of Molecular Biology 229: 382–97. 49. Becker A, Winter KU, Meyer B, Saedler H, Theissen G. 2000. MADS-box gene diversity in seed plants 300 million years ago. Molecular Biology and Evolution 17: 1425–34. 50. Beckert B, Nielsen H, Einvik C, Johansen SD, Westhof E, Masquida B. 2008. Molecular modelling of the GIR1 branching ribozyme gives new insight into evolution of structurally related ribozymes. Embo Journal 27: 667–78. 51. Becskei A, Kaufmann BB, van Oudenaarden A. 2005. Contributions of low molecule number and chromosomal positioning to stochastic gene expression. Nature Genetics 37: 937–44. 52. Becskei A, Serrano L. 2000. Engineering stability in gene networks by autoregulation. Nature 405: 590–3. 53. Begun DJ, Holloway AK, Stevens K, Hillier LW, Poh YP, et al. 2007. Population genomics: Whole-genome analysis of polymorphism and divergence in Drosophila simulans. PLoS Biology 5: 2534–59.
REFERENCES
54. Beldade P, Brakefield PM. 2002. The genetics and evodevo of butterfly wing patterns. Nature Reviews Genetics 3: 442–52. 55. Bergman A, Siegal M. 2003. Evolutionary capacitance as a general feature of complex gene networks. Nature 424: 549–52. 56. Bergthorsson U, Andersson DI, Roth JR. 2007. Ohno’s dilemma: Evolution of new genes under continuous selection. Proceedings of the National Academy of Sciences of the USA 104: 17004–9. 57. Berkenpas M, Lawrence D, Ginsburg D. 1995. Molecular evolution of plasminogen-activator inhibitor-1 functional stability. EMBO Journal 14: 2969–77. 58. Bershtein S, Goldin K, Tawfik D. 2008. Intense neutral drifts yield robust and evolvable consensus proteins. Journal of Molecular Biology 379: 1029–44. 59. Bershtein S, Segal M, Bekerman R, Tokuriki N, Tawfik DS. 2006. Robustness-epistasis link shapes the fitness landscape of a randomly drifting protein. Nature 444: 929–32. 60. Bharathan G, Goliber TE, Moore C, Kessler S, Pham T, Sinha NR. 2002. Homologies in leaf form inferred from KNOXI gene expression during development. Science 296: 1858–60. 61. Bierne N, Eyre-Walker A. 2004. The genomic rate of adaptive amino acid substitution in Drosophila. Molecular Biology and Evolution 21: 1350–60. 62. Black BL, Olson EN. 1998. Transcriptional control of muscle development by myocyte enhancer factor-2 (MEF2) proteins. Annual Review of Cell and Developmental Biology 14: 167–96. 63. Blake W, Kaern M, Cantor C, Collins J. 2003. Noise in eukaryotic gene expression. Nature 422: 633–7. 64. Blank LM, Kuepfer L, Sauer U. 2005. Large-scale C-13flux analysis reveals mechanistic principles of metabolic network robustness to null mutations in yeast. Genome Biology 6: R49. 65. Blattner FR, Plunkett G, Bloch CA, Perna NT, Burland V, et al. 1997. The complete genome sequence of Escherichia coli K-12. Science 277: 1453–62. 66. Bloom J, Romero P, Lu Z, Arnold F. 2007. Neutral genetic drift can alter promiscuous protein functions, potentially aiding functional evolution. Biology Direct 2: 17. 67. Bloom JD, Gong LI, Baltimore D. 2010. Permissive secondary mutations enable the evolution of influenza oseltamivir resistance. Science 328: 1272–5. 68. Bloom JD, Labthavikul ST, Otey CR, Arnold FH. 2006. Protein stability promotes evolvability. Proceedings of the National Academy of Sciences of the USA 103: 5869–74.
221
69. Bloom JD, Silberg JJ, Wilke CO, Drummond DA, Adami C, Arnold FH. 2005. Thermodynamic prediction of protein neutrality. Proceedings of the National Academy of Sciences of the USA 102: 606–11. 70. Bollobas B. 2001. Random graphs. Cambridge, UK: Cambridge University Press. 71. Bollobas B, Kohayakawa Y, Luczak T. 1992. The evolution of random subgraphs of the cube. Random Structures & Algorithms 3: 55–90. 72. Bollobas B, Kohayakawa Y, Luczak T. 1993. Connectivity properties of random subgraphs of the cube. 6th International Seminar on Random Graphs and Probabilistic Methods in Combinatorics and Computer Science, Random Graphs 93. Poznan, Poland. 73. Bollobas B, Kohayakawa Y, Luczak T. 1994. On the diameter and radius of random subgraphs of the cube. Random Structures & Algorithms 5: 627–48. 74. Bollobas B, Kohayakawa Y, Luczak T. 1994. On the evolution of random Boolean functions. In Extremal problems for finite sets, ed. P Frankl, Z Furedi, G Katona, D Miklos, pp. 137–56. Budapest, Hungary: Janos Bolyai Mathematical Society. 75. Boltyanskaya YV, Detkova EN, Shumskiii AN, Dulov LE, Pusheva MA. 2005. Osmoadaptation in representatives of haloalkaliphilic bacteria from soda lakes. Microbiology 74: 640–5. 76. Borenstein E, Ruppin E. 2006. Direct evolution of genetic robustness in microRNA. Proceedings of the National Academy of Sciences of the USA 103: 6593–8. 77. Borgs C, Chayes JT, Van der Hofstad R, Slade G, Spencer J. 2006. Random subgraphs of finite graphs: III. The phase transition for the n-cube. Combinatorica 26: 395–410. 78. Borisuk M, Tyson J. 1998. Bifurcation analysis of a model of mitotic control in frog eggs. Journal of Theoretical Biology 195: 69–85. 79. Bork P, Doolittle R. 1992. Proposed acquisition of an animal protein domain by bacteria. Proceedings of the National Academy of Sciences of the USA 89: 8990–4. 80. Bornberg-Bauer E. 1997. How are model protein structures distributed in sequence space? Biophysical Journal 73: 2393–403. 81. Bornberg-Bauer E. 2002. Randomness, structural uniqueness, modularity and neutral evolution in sequence space of model proteins. Zeitschrift fur Physikalische Chemie—International Journal of Research in Physical Chemistry & Chemical Physics 216: 139–54. 82. Bornberg-Bauer E, Chan H. 1999. Modeling evolutionary landscapes: Mutational stability, topology, and superfunnels in sequence space. Proceedings of the National Academy of Sciences of the USA 96: 10689–94.
222
REFERENCES
83. Bornholdt S, Sneppen K. 2000. Robustness as an evolutionary principle. Proceedings of the Royal Society of London Series B—Biological Sciences 267: 2281–6. 84. Boucher I, Parrot M, Gaudreau H, Champagne CP, Vadeboncoeur C, Moineau S. 2002. Novel food-grade plasmid vector based on melibiose fermentation for the genetic engineering of Lactococcus lactis. Applied and Environmental Microbiology 68: 6152–61. 85. Brakefield PM. 2006. Evo-devo and constraints on selection. Trends in Ecology & Evolution 21: 362–8. 86. Brakefield PM, Gates J, Keys D, Kesbeke F, Wijngaarden PJ, et al. 1996. Development, plasticity and evolution of butterfly eyespot patterns. Nature 384: 236–42. 87. Branden C, Tooze J. 1999. Introduction to protein structure. New York: Garland. 88. Bridgham JT, Carroll SM, Thornton JW. 2006. Evolution of hormone-receptor complexity by molecular exploitation. Science 312: 97–101. 89. Bruce AEE, Oates AC, Prince VE, Ho RK. 2001. Additional hox clusters in the zebrafish: divergent expression patterns belie equivalent activities of duplicate hoxB5 genes. Evolution & Development 3: 127–44. 90. Buchler N, Goldstein R. 1999. Universal correlation between energy gap and foldability for the random energy model and lattice proteins. Journal of Chemical Physics 111: 6599–609. 91. Buchler NEG, Goldstein RA. 2000. Surveying determinants of protein structure designability across different energy models and amino-acid alphabets: A consensus. Journal of Chemical Physics 112: 2533–47. 92. Buckingham M, Meilhac S, Zaffran S. 2005. Building the mammalian heart from two sources of myocardial cells. Nature Reviews Genetics 6: 826–35. 93. Buljan M, Bateman A. 2009. The evolution of protein domain families. Conference on Protein Evolution Sequences, Structures and Systems. Cambridge, England. 94. Burger R. 2000. The mathematical theory of selection, recombination, and mutation. Chichester, UK: Wiley. 95. Bush RM, Bender CA, Subbarao K, Cox NJ, Fitch WM. 1999. Predicting the evolution of human influenza A. Science 286: 1921–5. 96. Bushman F. 2002. Lateral DNA transfer: mechanisms and consequences. Cold Spring Harbor, NY: Cold Spring Harbor University Press. 97. Bussemaker H, Thirumalai D, Bhattacharjee J. 1997. Thermodynamic stability of folded proteins against mutations. Physical Review Letters 79: 3530–3. 98. Callaerts P, Halder G, Gehring WJ. 1997. Pax-6 in development and evolution. Annual Review of Neuroscience 20: 483–532. 99. Camazine S, Deneubourg J-L, Franks NR, Sneyd J, Theraulaz G, Bonabeau E. 2001. Self-organization in
100.
101.
102.
103. 104.
105.
106.
107.
108. 109.
110.
111.
112.
113.
114.
115.
biological systems. Princeton, New Jersey: Princeton University Press. Cambray G, Mazel D. 2008. Synonymous genes explore different evolutionary landscapes. PLoS Genetics 4: e1000256. Carey M, Lin YS, Green MR, Ptashne M. 1990. A mechanism for synergistic activation of a mammalian gene by Gal4 derivatives. Nature 345: 361–4. Carlborg O, Haley CS. 2004. Epistasis: Too often neglected in complex trait studies? Nature Reviews Genetics 5: 618–25. Carroll SB. 2005. Evolution at two levels: On genes and form. PLoS Biology 3: 1159–66. Carroll SB, Grenier JK, Weatherbee SD. 2001. From DNA to diversity. Molecular genetics and the evolution of animal design. Malden, MA: Blackwell. Carson HL, Lande R. 1984. Inheritance of a secondary sexual character in Drosophila silvestris. Proceedings of the National Academy of Sciences of the USA— Biological Sciences 81: 6904–7. Causier B, Castillo R, Zhou JL, Ingram R, Xue YB, et al. 2005. Evolution in action: Following function in duplicated floral homeotic genes. Current Biology 15: 1508–12. Chan H, Bornberg-Bauer E. 2002. Perspectives on protein evolution from simple exact models. Applied Bioinformatics 1: 121–44. Chan H, Dill K. 1991. Sequence space soup of proteins and copolymers. Journal of Chemical Physics 95: 3775–87. Chan H, Dill K. 1992. Lattice conformational enumeration approaches to protein folding. Abstracts of Papers of the American Chemical Society 204: 27. Chan H, Dill K. 1996. Comparing folding codes for proteins and polymers. Proteins—Structure Function and Genetics 24: 335–44. Chang C, Chen T, Cox B, Dawes G, Stemmer W, et al. 1999. Evolution of a cytokine using DNA family shuffling. Nature Biotechnology 17: 793–7. Chaput M CV, Portetelle D, Cludts I, Cravador A, Aburny A, et al. 1988. The neurotrophic factor neuroleukin is 90 percent homologous with phosphohexose isomerase. Nature 332: 454–5. Charlesworth B, Morgan MT, Charlesworth D. 1993. The effect of deleterious mutations on neutral molecular variation. Genetics 134: 1289–303. Charlesworth J, Eyre-Walker A. 2006. The rate of adaptive evolution in enteric bacteria. Molecular Biology and Evolution 23: 1348–56. Chen LB, DeVries AL, Cheng CHC. 1997. Convergent evolution of antifreeze glycoproteins in Antarctic notothenioid fish and Arctic cod. Proceedings of the National Academy of Sciences of the USA 94: 3817–22.
REFERENCES
116. Chen RD, Greer A, Dean AM. 1995. A highly active decarboxylating dehydrogenase with rationally inverted coenzyme specificity. Proceedings of the National Academy of Sciences of the USA 92: 11666–70. 117. Chen WN, Wang Y, Wang XW, Peng CL. 2008. A new placement approach to minimizing FPGA reconfiguration data. International Conference on Embedded Software and Systems, Chengdu, China. 118. Cheng CC-H. 1998. Evolution of the diverse antifreeze proteins. Current Opinion in Genetics & Development 8: 715–20. 119. Cheverud JM. 1984. Quantitative genetics and developmental constraints on evolution by selection. Journal of Theoretical Biology 110: 155–71. 120. Chi N, Epstein JA. 2002. Getting your Pax straight: Pax proteins in development and disease. Trends in Genetics 18: 41–7. 121. Chisaka O, Capecchi MR. 1991. Regionally restricted developmental defects resulting from targeted disruptions of the mouse homeobox genes Hox-1.5. Nature 350: 473–9. 122. Choi IG, Kim SH. 2007. Global extent of horizontal gene transfer. Proceedings of the National Academy of Sciences of the USA 104: 4489–94. 123. Ciliberti S, Martin OC, Wagner A. 2007. Circuit topology and the evolution of robustness in complex regulatory gene networks. PLoS Computational Biology 3(2): e15. 124. Ciliberti S, Martin OC, Wagner A. 2007. Innovation and robustness in complex regulatory gene networks. Proceedings of the National Academy of Sciences of the USA 104: 13591–6. 125. Clark AG, Glanowski S, Nielsen R, Thomas PD, Kejariwal A, et al. 2003. Inferring nonneutral evolution from human-chimp-mouse orthologous gene trios. Science 302: 1960–3. 126. Cline RE, Hill RH, Phillips DL, Needham LL. 1989. Pentachlorophenol measurements in body-fluids of people in log homes and workplaces. Archives of Environmental Contamination and Toxicology 18: 475–81. 127. Clodong S, Duhring U, Kronk L, Wilde A, Axmann I, et al. 2007. Functioning and robustness of a bacterial circadian clock. Molecular Systems Biology 3: 90. 128. Cobey S, Koelle K. 2008. Capturing escape in infectious disease dynamics. Trends in Ecology & Evolution 23: 572–7. 129. Coen ES, Meyerowitz EM. 1991. The war of the whorls: Genetic interactions controlling flower development. Nature 353: 31–7. 130. Cohen SB, Cech TR. 1997. Dynamics of thermal motions within a large catalytic RNA investigated by
131.
132.
133.
134.
135.
136.
137.
138.
139.
140.
141.
142.
143.
144. 145.
223
cross-linking with thiol-disulfide interchange. Journal of the American Chemical Society 119: 6259–68. Cohn MJ, Patel K, Krumlauf R, Wilkinson DG, Clarke JDW, Tickle C. 1997. Hox9 genes and vertebrate limb specification. Nature 387: 97–101. Collins M. 2005. Finding needles in haystacks is harder with neutrality. Genetic and Evolutionary Computation Conference. Washington, DC. Committee for Nomenclature of the International Union of Biochemistry and Molecular Biology, Webb EC, ed. 1992. Enzyme nomenclature 1992. Academic Press: San Diego, CA. Conant GC, Wagner A. 2004. Duplicate genes and robustness to transient gene knockouts in Caenorhabditis elegans. Proceedings of the Royal Society of London Series B 271: 89–96. Conant GC, Wolfe KH. 2008. Turning a hobby into a job: How duplicated genes find new functions. Nature Reviews Genetics 9: 938–50. Condie BG, Capecchi MR. 1993. Mice homozygous for a targeted disruption of Hoxd-3 (Hox-4.1) exhibit anterior transformattions of the first and second cervical vertebrae, the atlas and the axis. Development 119: 579–95. Cook SA, Johnson MP. 1968. Adaptation to heterogeneous environments. I. Variation in heterophylly in Ranunculus flammula L. Evolution 22: 496–516. Cooper TF, Morby AP, Gunn A, Schneider D. 2006. Effect of random and hub gene disruptions on environmental and mutational robustness in Escherichia coli. BMC Genomics 7: 237. Cooper TG. 2002. Transmitting the signal of excess nitrogen in Saccharomyces cerevisiae from the Tor proteins to the GATA factors: connecting the dots. FEMS Microbiology Reviews 26: 223–38. Copley RR, Bork P. 2000. Homology among (βα)8 barrels: Implications for the evolution of metabolic pathways. Journal of Molecular Biology 303: 627–40. Copley SD. 2000. Evolution of a metabolic pathway for degradation of a toxic xenobiotic: the patchwork approach. Trends in Biochemical Sciences 25: 261–5. Cordell HJ. 2002. Epistasis: What it means, what it doesn’t mean, and statistical methods to detect it in humans. Human Molecular Genetics 11: 2463–8. Cordes M, Burton R, Walsh N, McKnight C, Sauer R. 2000. An evolutionary bridge to a new protein fold. Nature Structural Biology 7: 1129–32. Cormen TH, Leiserson CE, Rivest RL, Stein C. 2005. Introduction to algorithms. Cambridge, MA: MIT Press. Coulson A, Moult J. 2002. A unifold, mesofold, and superfold model of protein fold use. Proteins— Structure Function and Genetics 46: 61–71.
224
REFERENCES
146. Covert MW, Knight EM, Reed JL, Herrgard MJ, Palsson BO. 2004. Integrating high-throughput and computational data elucidates bacterial networks. Nature 429: 92–6. 147. Cowperthwaite M, Ancel Meyers L. 2007. How mutational networks shape evolution: lessons from RNA models. Annual Review of Ecology, Evolution, and Systematics 38: 203–30. 148. Cowperthwaite MC, Bull JJ, Meyers LA. 2006. From bad to good: Fitness reversals and the ascent of deleterious mutations. PLoS Computational Biology 2: 1292–300. 149. Cowperthwaite MC, Economo EP, Harcombe WR, Miller EL, Meyers LA. 2008. The ascent of the abundant: How mutational networks constrain evolution. PLoS Computational Biology 4(7): e1000110. 150. Cox RA. 2004. Quantitative relationships for specific growth rates and macromolecular compositions of Mycobacterium tuberculosis, Streptomyces coelicolor A3(2) and Escherichia coli B/r: an integrative theoretical approach. Microbiology 150: 1547–58. 151. Coyne JA. 2009. Freaks of Nature: What anomalies tell us about development and evolution. Nature 457: 382–3. 152. Crameri A, Dawes G, Rodriguez E, Silver S, Stemmer W. 1997. Molecular evolution of an arsenate detoxification pathway DNA shuffling. Nature Biotechnology 15: 436–8. 153. Crameri A, Raillard S, Bermudez E, Stemmer W. 1998. DNA shuffling of a family of genes from diverse species accelerates directed evolution. Nature 391: 288–91. 154. Cripps RM, Olson EN. 2002. Control of cardiac development by an evolutionarily conserved transcriptional network. Developmental Biology 246: 14–28. 155. Crombach A, Hogeweg P. 2008. Evolution of evolvability in gene regulatory networks. PLoS Computational Biology 4(7): e1000112. 156. Cui Y, Wong W, Bornberg-Bauer E, Chan H. 2002. Recombinatoric exploration of novel folded structures: A heteropolymer-based model of protein evolutionary landscapes. Proceedings of the National Academy of Sciences of the USA 99: 809–14. 157. Curtis EA, Bartel DP. 2005. New catalytic structures from an existing ribozyme. Nature Structural & Molecular Biology 12: 994–1000. 158. Cutter AD, Payseur BA. 2003. Selection at linked sites in the partial selfer Caenorhabditis elegans. Molecular Biology and Evolution 20: 665–73. 159. Dalby A, Dauter Z, Littlechild JA. 1999. Crystal structure of human muscle aldolase complexed with
160.
161.
162.
163.
164.
165.
166. 167.
168. 169.
170.
171.
172.
173.
174.
fructose 1,6-bisphosphate: Mechanistic implications. Protein Science 8: 291–7. Dantas G, Sommer MOA, Oluwasegun RD, Church GM. 2008. Bacteria subsisting on antibiotics. Science 320: 100–3. Darlington TK, Wager-Smith K, Ceriani MF, Staknis D, Gekakis N, et al. 1998. Closing the circadian loop: CLOCK-induced transcription of its own inhibitors per and tim. Science 280: 1599–603. Darwin C. 1859. On the origin of species by means of natural selection: The preservation of favored races in the struggle for life. London, England: Penguin Group. Daubin V, Ochman H. 2004. Quartet mapping and the extent of lateral transfer in bacterial genomes. Molecular Biology and Evolution 21: 86–9. Davidson AR, Sauer RT. 1994. Folded proteins occur frequently in libraries of random amino-acid-sequences. Proceedings of the National Academy of Sciences of the USA 91: 2146–50. Davidson EH, Erwin DH. 2006. Gene regulatory networks and the evolution of animal body plans. Science 311: 796–800. Davies PL, Sykes BD. 1997. Antifreeze proteins. Current Opinion in Structural Biology 7: 828–34. Davis B, Whitlock MC. 2004. Genetic load. In Encyclopedia of life sciences (http://www.els.net): John Wiley & Sons Ltd, Chichester. Dawkins R. 1996. Climbing mount improbable. New York: Norton. Dayton E, Konings D, Powell D, Shapiro B, Butini 1, et al. 1992. Extensive sequence-specific information throughout the CAR RRE, the target sequence of the human-immunodeficiency-virus type-1 rev protein. Journal of Virology 66: 1139–51. de Vries H. 1905. Species and varieties, their origin by mutation. Chicago, IL: The Open Court Publishing Company. Dean AM, Thornton JW. 2007. Mechanistic approaches to the study of evolution: the functional synthesis. Nature Reviews Genetics 8: 675–88. Dechesne A, Or D, Smets BF. 2008. Limited diffusive fluxes of substrate facilitate coexistence of two competing bacterial strains. Fems Microbiology Ecology 64: 1–8. Dekel E, Alon U. 2005. Optimality and evolutionary tuning of the expression level of a protein. Nature 436: 588–92. Denault DL, Loros JJ, Dunlap JC. 2001. WC-2 mediates WC-1-FRQ interaction within the PAS proteinlinked circadian feedback loop of Neurospora. EMBO Journal 20: 109–17.
REFERENCES
175. DePristo MA, Hartl DL, Weinreich DM. 2007. Mutational reversions during adaptive protein evolution. Molecular Biology and Evolution 24: 1608–10. 176. DePristo MA, Weinreich DM, Hartl DL. 2005. Missense meanderings in sequence space: A biophysical view of protein evolution. Nature Reviews Genetics 6: 678–87. 177. Dermitzakis E, Clark A. 2002. Evolution of transcription factor binding sites in mammalian gene regulatory regions: Conservation and turnover. Molecular Biology and Evolution 19: 1114–21. 178. DeSantis TZ, Hugenholtz P, Keller K, Brodie EL, Larsen N, et al. 2006. NAST: a multiple sequence alignment server for comparative analysis of 16S rRNA genes. Nucleic Acids Research 34: W394-W9. 179. Detkova EN, Boltyanskaya YV. 2007. Osmoadaptation of haloalkaliphilic bacteria: Role of osmoregulators and their possible practical application. Microbiology 76: 511–22. 180. Detkova EN, Pusheva MA. 2006. Energy metabolism in halophilic and alkaliphilic acetogenic bacteria. Microbiology 75: 1–11. 181. DeWachter R, Chen MW, Vandenberghe A. 1984. Equilibria in 5-S ribosomal RNA secondary structure—bulges and interior loops in 5-S RNA secondary structure may serve as articulations for a flexible molecule. European Journal of Biochemistry 143: 175–82. 182. DeWitt TJ, Sih A, Wilson DS. 1998. Costs and limits of phenotypic plasticity. Trends in Ecology & Evolution 13: 77–81. 183. DiComo CJ, Arndt KT. 1996. Nutrients, via the Tor proteins, stimulate the association of Tap42 with type 2A phosphatases. Genes & Development 10: 1904–16. 184. Diehl M, Doster W, Petry W, Schober H. 1997. Watercoupled low-frequency modes of myoglobin and lysozyme observed by inelastic neutron scattering. Biophysical Journal 73: 2726–32. 185. Dill K. 1995. Simple lattice models of protein folding. Abstracts of Papers of the American Chemical Society 209: 229-Poly. 186. Dill K, Bromberg S, Yue K, Fiebig K, Yee D, et al. 1995. Principles of protein-folding—a perspective from simple exact models. Protein Science 4: 561–602. 187. Ditta G, Pinyopich A, Robles P, Pelaz S, Yanofsky MF. 2004. The SEP4 gene of Arabidopsis thaliana functions in floral organ and meristem identity. Current Biology 14: 1935–40. 188. Doolittle R. 1995. The origins and evolution of eukaryotic proteins. Philosophical Transactions of the Royal Society of London Series B—Biological Sciences 349: 235–40.
225
189. Doolittle RF. 1986. Of URFs and ORFs: A primer on how to analyze derived amino acid sequences. Mill Valley, CA: University Science Books. 190. Dorus S, Vallender EJ, Evans PD, Anderson JR, Gilbert SL, et al. 2004. Accelerated evolution of nervous system genes in the origin of Homo sapiens. Cell 119: 1027–40. 191. Doudna J. 2000. Structural genomics of RNA. Nature Structural Biology 7: 954–6. 192. Doyle FJ, Gunawan R, Bagheri N, Mirsky H, To TL. 2006. Circadian rhythm: A natural, robust, multi-scale control system. 7th International Conference on Chemical Process Control (CPC 7), Lake Louise, Canada. 193. Draghi J, Parsons T, Wagner GP, Plotkin J. 2010. Mutational robustness can facilitate adaptation. Nature 463: 353–5. 194. Draghi J, Wagner GP. 2009. The evolutionary dynamics of evolvability in a gene network model. Journal of Evolutionary Biology 22: 599–611. 195. Droux M. 2004. Sulfur assimilation and the role of sulfur in plant metabolism: a survey. Photosynthesis Research 79: 331–48. 196. Drummond DA, Silberg JJ, Meyer MM, Wilke CO, Arnold FH. 2005. On the conservative nature of intragenic recombination. Proceedings of the National Academy of Sciences of the USA 102: 5380–5. 197. Dun RB, Fraser AS. 1959. Selection for an invariant character, “vibrissae number,” in the house mouse. Australian Journal of Biological Sciences 12: 506–23. 198. Dunlap JC. 1999. Molecular bases for circadian clocks. Cell 96: 271–90. 199. Dunlap JC, Loros JJ. 2006. How fungi keep time: circadian system in Neurospora and other fungi. Current Opinion in Microbiology 9: 579–87. 200. Durai S, Mani M, Kandavelou K, Wu J, Porteus MH, Chandrasegaran S. 2005. Zinc finger nucleases: custom-designed molecular scissors for genome engineering of plant and mammalian cells. Nucleic Acids Research 33: 5978–90. 201. Dutta S, Burkhardt K, Young J, Swaminathan G, Matsuura T, et al. 2009. Data deposition and annotation at the worldwide protein data bank. Molecular Biotechnology 42: 1–13. 202. Ebenhoh O, Handorf T. 2009. Functional classification of genome-scale metabolic networks. EURASIP Journal on Bioinformatics and Systems Biology 2009, Article ID 570456; doi:10.1155/2009/570456. 203. Ebenhoh O, Heinrich R. 2003. Stoichiometric design of metabolic networks: Multifunctionality, clusters, optimization, weak and strong robustness. Bulletin of Mathematical Biology 65: 323–57.
226
REFERENCES
204. Eble GJ. 1999. On the dual nature of chance in evolutionary biology and paleobiology. Paleobiology 25: 75–87. 205. Ebner M, Shackleton M, Shipman R. 2002. How neutral networks influence evolvability. Complexity 19: 19–33. 206. Edvardsen RB, Seo HC, Jensen MF, Mialon A, Mikhaleva J, et al. 2005. Remodelling of the homeobox gene complement in the tunicate Oikopleura dioica. Current Biology 15: R12–R3. 207. Edwards JS, Palsson BO. 1999. Systems properties of the Haemophilus influenzae Rd metabolic genotype. Journal of Biological Chemistry 274: 17410–6. 208. Edwards JS, Palsson BO. 2000. The Escherichia coli MG1655 in silico metabolic genotype: Its definition, characteristics, and capabilities. Proceedings of the National Academy of Sciences of the USA 97: 5528–33. 209. Eguchi K, Yoda M, Terada TP, Sasai M. 2008. Mechanism of robust circadian oscillation of KaiC phosphorylation in vitro. Biophysical Journal 95: 1773–84. 210. Eigen M, Schuster P. 1979. The hypercycle: A principle of natural self-organization. Berlin: Springer. 211. Eldar A, Dorfman R, Weiss D, Ashe H, Shilo B, Barkai N. 2002. Robustness of the BMP morphogen gradient in Drosophila embryonic patterning. Nature 419: 304–8. 212. Eldar A, Elowitz MB. 2010. Functional roles for noise in genetic circuits. Nature 467: 167–73. 213. Eldredge G, Eldredge N. 2008. Editorial. Evolution: Education and outreach (Special Issue: The evolution of eyes) 1: 1. 214. Elena SF, Cooper VS, Lenski RE. 1996. Punctuated evolution caused by selection of rare beneficial mutations. Science 272: 1802–4. 215. Elena SF, Sanjuan R. 2008. The effect of genetic robustness on evolvability in digital organisms. BMC Evolutionary Biology 8: 284. 216. Elowitz M, Levine A, Siggia E, Swain P. 2002. Stochastic gene expression in a single cell. Science 297: 1183–6. 217. Emberly E, Wingreen NS. 2006. Hourglass model for a protein-based circadian oscillator. Physical Review Letters 96: 038303. 218. Emberly EG, Wingreen NS, Tang C. 2002. Designability of alpha-helical proteins. Proceedings of the National Academy of Sciences of the USA 99: 11163–8. 219. Endress PK. 2006. Angiosperm floral evolution: Morphological developmental framework. Advances in Botanical Research: Incorporating Advances in Plant Pathology 44: 1–61.
220. England J, Shakhnovich B, Shakhnovich E. 2003. Natural selection of more designable folds: A mechanism for thermophilic adaptation. Proceedings of the National Academy of Sciences of the USA 100: 8727–31. 221. England JL, Shakhnovich EI. 2003. Structural determinant of protein designability. Physical Review Letters 90: 218101. 222. Espinosa-Soto C, Martin OCM, Wagner A. 2011. Phenotypic robustness can increase phenotypic variability after non-genetic perturbations in gene regulatory circuits. Journal of Evolutionary Biology (in press). 223. Espinosa-Soto C, Padilla-Longoria P, Alvarez-Buylla ER. 2004. A gene regulatory network model for cellfate determination during Arabidopsis thaliana flower development that is robust and recovers experimental gene expression profiles. Plant Cell 16: 2923–39. 224. Espinosa-Soto C, Wagner A. 2010. Specialization can drive the evolution of modularity PLoS Computational Biology 6: e1000719. 225. Evangelisti A, Wagner A. 2004. Molecular evolution in the transcriptional regulation network of yeast. Journal of Experimental Zoology/Molecular Development and Evolution 302B: 392–411. 226. Eyre-Walker A, Keightley PD. 2007. The distribution of fitness effects of new mutations. Nature Reviews Genetics 8: 610–8. 227. Eyre-Walker A, Keightley PD, Smith NGC, Gaffney D. 2002. Quantifying the slightly deleterious mutation model of molecular evolution. Molecular Biology and Evolution 19: 2142–9. 228. Faik P, Walker J, Redmill A, Morgan M. 1988. Mouse glucose-6-phosphate isomerase and neuroleukin have identical 3’ sequences. Nature 332: 455–6. 229. Falkowski P, Scholes RJ, Boyle E, Canadell J, Canfield D, et al. 2000. The global carbon cycle: A test of our knowledge of earth as a system. Science 290: 291–6. 230. Fay JC, Wyckoff GJ, Wu CI. 2002. Testing the neutral theory of molecular evolution with genomic data from Drosophila. Nature 415: 1024–6. 231. Feist AM, Henry CS, Reed JL, Krummenacker M, Joyce AR, et al. 2007. A genome-scale metabolic reconstruction for Escherichia coli K-12 MG1655 that accounts for 1260 ORFs and thermodynamic information. Molecular Systems Biology 3: 121. 232. Feist AM, Herrgard MJ, Thiele I, Reed JL, Palsson BO. 2009. Reconstruction of biochemical networks in microorganisms. Nature Reviews Microbiology 7: 129–43. 233. Feist AM, Palsson BO. 2008. The growing scope of applications of genome-scale metabolic reconstructions using Escherichia coli. Nature Biotechnology 26: 659–67.
REFERENCES
234 . Felix MA. 2007. Cryptic quantitative evolution of the vulva intercellular signaling network in Caenorhabditis. Current Biology 17: 103–14. 235. Fell D. 1997. Understanding the control of metabolism. Miami: Portland Press. 236. Feller W. 1968. An introduction to probability theory and its applications. New York: Wiley. 237. Felsenstein J. 2004. Inferring phylogenies. Sunderland, Massachusetts: Sinauer Associates. 238. Ferrada E, Wagner A. 2008. Protein robustness promotes evolutionary innovations on large evolutionary time scales. Proceedings of the Royal Society of London Series B—Biological Sciences. 275: 1595–602. 239. Ferrada E, Wagner A. 2010. Evolutionary innovation and the organization of protein functions in genotype space. PLoS ONE 5(11): e14172. 240. Ferrell JE. 2002. Self-perpetuating states in signal transduction: positive feedback, double-negative feedback and bistability. Current Opinion in Cell Biology 14: 140–8. 241. Finkelstein AV. 1994. Implications of the random characteristics of protein sequences for their 3-dimensional structure. Current Opinion in Structural Biology 4: 422–8. 242. Firulli AB, McFadden DG, Lin Q, Srivastava D, Olson EN. 1998. Heart and extra-embryonic mesodermal defects in mouse embryos lacking the bHLH transcription factor Hand1. Nature Genetics 18: 266–70. 243. Fischer E, Sauer U. 2005. Large-scale in vivo flux analysis shows rigidity and suboptimal performance of Bacillus subtilis metabolism. Nature Genetics 37: 636–40. 244. Fitch WM, Bush RM, Bender CA, Cox NJ. 1997. Longterm trends in the evolution of H(3) HA1 human influenza type A. Proceedings of the National Academy of Sciences of the USA 94: 7712–8. 245. Flamm C, Fontana W, Hofacker I, Schuster P. 2000. RNA folding at elementary step resolution. RNA 6: 325–38. 246. Fletcher GL, Hew CL, Davies PL. 2001. Antifreeze proteins of teleost fishes. Annual Review of Physiology 63: 359–90. 247. Fong SS, Nanchen A, Palsson BO, Sauer U. 2006. Latent pathway activation and increased pathway capacity enable Escherichia coli adaptation to loss of key metabolic enzymes. Journal of Biological Chemistry 281: 8024–33. 248. Fong SS, Palsson BO. 2004. Metabolic gene-deletion strains of Escherichia coli evolve to computationally predicted growth phenotypes. Nature Genetics 36: 1056–8. 249. Fontana W. 2002. Modelling “evo-devo” with RNA. Bioessays 24: 1164–77.
227
250. Fontana W, Konings D, Stadler P, Schuster P. 1993. Statistics of RNA secondary structures. Biopolymers 33: 1389–404. 251. Fontana W, Schuster P. 1998. Continuity in evolution: On the nature of transitions. Science 280: 1451–5. 252. Fontana W, Schuster P. 1998. Shaping space: The possible and the attainable in RNA genotype–phenotype mapping. Journal of Theoretical Biology 194: 491–515. 253. Force A, Lynch M, Pickett FB, Amores A, Yan YL, Postlethwait J. 1999. Preservation of duplicate genes by complementary degenerative mutations. Genetics 151: 1531–45. 254. Forster J, Famili I, Fu P, Palsson B, Nielsen J. 2003. Genome-scale reconstruction of the Saccharomyces cerevisiae metabolic network. Genome Research 13: 244–53. 255. Fox GE, Woese CR. 1975. 5S RNA secondary structure. Nature 256: 505–7. 256. Frank DN, Pace NR. 1998. Ribonuclease P: Unity and diversity in a tRNA processing ribozyme. Annual Review of Biochemistry 67: 153–80. 257. Frigaard NU, Dahl C. 2009. Sulfur metabolism in phototrophic sulfur bacteria. In Advances in microbial physiology, vol 54, ed. RL Poole, pp. 103–200. London: Academic Press. 258. Fujii N. 2002. D-amino acids in living higher organisms. Origins of Life and Evolution of the Biosphere 32: 103–27. 259. Futuyma DJ. 1998. Evolutionary biology. Sunderland, Massachusetts: Sinauer. 260. Gardner A, Kalinka AT. 2006. Recombination and the evolution of mutational robustness. Journal of Theoretical Biology 241: 707–15. 261. Gasch AP, Moses AM, Chiang DY, Fraser HB, Berardini M, Eisen MB. 2004. Conservation and evolution of cis-regulatory systems in ascomycete fungi. PLoS Biology 2: 2202–19. 262. Gavrilets S. 1997. Evolution and speciation on holey adaptive landscapes. Trends in Ecology & Evolution 12: 307–12. 263. Gavrilets S, Gravner J. 1997. Percolation on the fitness hypercube and the evolution of reproductive isolation. Journal of Theoretical Biology 184: 51–64. 264. Gerhart J, Kirschner M. 1998. Cells, embryos, and evolution. Boston: Blackwell. 265. Giacomelli MG, Hancock AS, Masel J. 2007. The conversion of 3’ UTRs into coding regions. Molecular Biology and Evolution 24: 457–64. 266. Giaever G, Chu AM, Ni L, Connelly C, Riles L, et al. 2002. Functional profiling of the Saccharomyces cerevisiae genome. Nature 418: 387–91.
228
REFERENCES
267. Gibson G, Reed LK. 2008. Cryptic genetic variation. Current Biology 18: R989–R90. 268. Gilbert SF. 1997. Developmental biology. Sunderland: Sinauer. 269. Gilbert W. 1978. Why genes in pieces? Nature 271: 501. 270. Gillespie JH. 1991. The causes of molecular evolution. New York: Oxford University Press. 271. Gillespie JH. 2000. Genetic drift in an infinite population: The pseudohitchhiking model. Genetics 155: 909–919. 272. Ginsberg AM, King BO, Roeder RG. 1984. Xenopus 5S gene-transcription factor TFIIIA-characterization of a cDNA clone and measurement of RNA levels throughout development. Cell 39: 479–89. 273. Giurumescu CA, Sternberg PW, Asthagiri AR. 2006. Intercellular coupling amplifies fate segregation during Caenorhabditis elegans vulval development. Proceedings of the National Academy of Sciences of the USA 103: 1331–6. 274. Giurumescu CA, Sternberg PW, Asthagiri AR. 2009. Predicting phenotypic diversity and the underlying quantitative molecular transitions. PLoS Computational Biology 5(4): e1000354. 275. Givnish TJ. 1987. Comparative studies of leaf form—assessing the relative roles of selective pressures and phylogenetic constraints. New Phytologist 106: 131–60. 276. Goldberg DE. 1989. Genetic algorithms in search, optimization, and machine learning. Boston, MA: Kluwer Academic Publishers. 277. Golding GB, Dean AM. 1998. The structural basis of molecular adaptation. Molecular Biology and Evolution 15: 355–69. 278. Goldman N, Yang ZH. 1994. Codon-based model of nucleotide substitution for protein-coding DNA sequences. Molecular Biology and Evolution 11: 725–36. 279. Goodman M, Pedwaydon J, Czelusniak J, Suzuki T, Gotoh T, et al. 1988. An evolutionary tree for invertebrate globin sequences. Journal of Molecular Evolution 27: 236–49. 280. Goodwin B. 1965. Oscillatory behavior in enzymatic control processes. Advances in enzyme regulation 3: 425–38. 281. Gould S, Vrba E. 1982. Exaptation—a missing term in the science of form. Paleobiology 8: 4–15. 282. Gould SJ, Lewontin RC. 1979. Spandrels of SanMarco and the Panglossian paradigm: a critique of the adaptationist program. Proceedings of the Royal Society of London Series B—Biological Sciences 205: 581–98.
283. Govindarajan S, Goldstein R. 1996. Why are some protein structures so common? Proceedings of the National Academy of Sciences of the USA 93: 3341–5. 284. Govindarajan S, Goldstein RA. 1997. Evolution of model proteins on a foldability landscape. Proteins— Structure Function and Genetics 29: 461–6. 285. Govindarajan S, Goldstein RA. 1997. The foldability landscape of model proteins. Biopolymers 42: 427–38. 286. Govindarajan S, Recabarren R, Goldstein R. 1999. Estimating the total number of protein folds. Proteins—Structure Function and Genetics 35: 408–14. 287. Grant PR. 2003. Evolution in Darwin’s finches: A review of a study on Isla Daphne Major in the Galapagos Archipelago. 96th annual Meeting of the German Zoological Society, Berlin, Germany. 288. Grant PR. 1986. Ecology and evolution of Darwin’s finches. Princeton, NJ: Princeton University Press. 289. Greene LH, Lewis TE, Addou S, Cuff A, Dallman T, et al. 2007. The CATH domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution. Nucleic Acids Research 35: D291–D7. 290. Greenwald IS, Sternberg PW, Horvitz HR. 1983. The lin-12 locus specifies cell fates in Caenorhabditis elegans. Cell 34: 435–44. 291. Greenwood GW, Tyrrell AM. 2006. Introduction to evolvable hardware: a practical guide for designing self-adaptive systems. Hoboken, NJ: Wiley–IEEE Press. 292. Greer JM, Puetz J, Thomas KR, Capecchi MR. 2000. Maintenance of functional equivalence during paralogous Hox gene evolution. Nature 403: 661–5. 293. Gronemeyer H, Gustafsson JA, Laudet V. 2004. Principles for modulation of the nuclear receptor superfamily. Nature Reviews Drug Discovery 3: 950–64. 294. Gruner W, Giegerich R, Strothmann D, Reidys C, Weber J, et al. 1996. Analysis of RNA sequence structure maps by exhaustive enumeration.2. Structures of neutral networks and shape space covering. Monatshefte fur Chemie 127: 375–89. 295. Gu Z, Steinmetz L, Gu X, Scharfe C, Davis R, Li W. 2003. Role of duplicate genes in genetic robustness against null mutations. Nature 421: 63–6. 296. Gu ZL, Cavalcanti A, Chen FC, Bouman P, Li WH. 2002. Extent of gene duplication in the genomes of Drosophila, nematode, and yeast. Molecular Biology and Evolution 19: 256–62. 297. Guindon S, Rodrigo AG, Dyer KA, Huelsenbeck JP. 2004. Modeling the site-specific variation of selection patterns along lineages. Proceedings of the National Academy of Sciences of the USA 101: 12957–62.
REFERENCES
298. Gultyaev A, Vanbatenburg F, Pleij C. 1995. The computer simulation of RNA folding pathways using a genetic algorithm. Journal of Molecular Biology 250: 37–51. 299. Gurevitch J. 1988. Variation in leaf dissection and leaf energy budgets among populations of Achillea from an altitudinal gradient. American Journal of Botany 75: 1298–306. 300. Gurevitch J. 1992. Sources of variation in leaf shape among 2 populations of Achillea lanulosa. Genetics 130: 385–94. 301. Hafner M, Koeppl H, Hasler M, Wagner A. 2009. “Glocal” robustness in model discrimination for circadian oscillators. PLoS Computational Biology 5: e1000534. 302. Hahn MW. 2008. Toward a selection theory of molecular evolution. Evolution 62: 255–65. 303. Haken H. 1983. Synergetics, an introduction: Nonequilibrium phase transitions and self-organization in physics, chemistry, and biology. New York: Springer. 304. Harary F. 1969. Graph theory. Reading, Massachusetts: Addison-Wesley. 305. Hardison RC. 1996. A brief history of hemoglobins: Plant, animal, protist, and bacteria. Proceedings of the National Academy of Sciences of the USA 93: 5675–9. 306. Harmer SL. 2009. The circadian system in higher plants. Annual Review of Plant Biology 60: 357–77. 307. Harmer SL, Panda S, Kay SA. 2001. Molecular bases of circadian rhythms. Annual Review of Cell and Developmental Biology 17: 215–53. 308. Harris H. 1966. Enzyme polymorphism in man. Proceedings of the Royal Society of London Series B—Biological Sciences 164: 298–310. 309. Harrison R, Papp B, Pal C, Oliver SG, Delneri D. 2007. Plasticity of genetic interactions in metabolic networks of yeast. Proceedings of the National Academy of Sciences of the USA 104: 2307–12. 310. Hartl D, Clark A. 2007. Principles of population genetics. Sunderland, MA: Sinauer Associates. 311. Hartl DL, Dykhuizen DE. 1981. Potential for selection among nearly neutral allozymes of 6-phosphogluconate dehydrogenase in Escherichia coli. Proceedings of the National Academy of Sciences of the USA 78: 6344–8. 312. Hartmann M, Haddow P, Eskelund F. 2002. Evolving robust digital designs. NASA/DOD Conference on Evolvable Hardware, Alexandria, VA. 313. Hartmann M, Haddow PC. 2004. Evolution of faulttolerant and noise-robust digital designs. IEE Proceedings—Computers and Digital Techniques 151: 287–94.
229
314. Hartmann M, Haddow PC, Lehre PK. 2005. The genotypic complexity of evolved fault-tolerant and noise-robust circuits. 6th International Workshop on Information Processing in Cells and Tissues, York, England. 315. Hatta K, Kimmel CB, Ho RK, Walker C. 1991. The cyclops mutation blocks specification of the floor plate of the zebrafish central-nervous-system. Nature 350: 339–41. 316. Hay A, Tsiantis M. 2006. The genetic basis for differences in leaf form between Arabidopsis thaliana and its wild relative Cardamine hirsuta. Nature Genetics 38: 942–7. 317. Hayden E, Ferrada E, Wagner A. 2011. Cryptic genetic variation promotes rapid evolutionary adaptation in an RNA enzyme (submitted). 318. He L, Kierzek R, Santalucia J, Walter A, Turner D. 1991. Nearest-neighbor parameters for G-U mismatches—5’GU3’/3’UG5’ is destabilizing in the contexts CGUG.GUGC, UGUA.AUGU, and AGUU. UUGU but stabilizing in GGUC.CUGG. Biochemistry 30: 11124–32. 319. Heil M, Greiner S, Meimberg H, Kruger R, Noyer JL, et al. 2004. Evolutionary change from induced to constitutive expression of an indirect plant resistance. Nature 430: 205–8. 320. Heitman J, Movva NR, Hall MN. 1991. Targets for cell-cycle arrest by the immunosuppressant rapamycin in yeast. Science 253: 905–9. 321. Hellmann I, Ebersberger I, Ptak SE, Pääbo S, Przeworski M. 2003. A neutral explanation for the correlation of diversity with recombination rates in humans. American Journal of Human Genetics 72: 1527–35. 322. Hellmann I, Prufer K, Ji HK, Zody MC, Paabo S, Ptak SE. 2005. Why do human diversity levels vary at a megabase scale? Genome Research 15: 1222–31. 323. Hernandez C, Flores R. 1992. Plus and minus RNAs of peach latent mosaic viroid self-cleave in vitro via hammerhead structures. Proceedings of the National Academy of Sciences of the USA 89: 3711–5. 324. Herrgard MJ, Swainston N, Dobson P, Dunn WB, Arga KY, et al. 2008. A consensus yeast metabolic network reconstruction obtained from a community approach to systems biology. Nature Biotechnology 26: 1155–60. 325. Hertz J, Krogh A, Palmer RG. 1991. Introduction to the theory of neural computation. Redwood City, CA: Addison, Wesley. 326. Higgs P, Morgan S. 1995. Thermodynamics of RNA folding. When is an RNA molecule in equilibrium? Advances in Artificial Life 929: 852–61.
230
REFERENCES
327. Hill RJ, Sternberg PW. 1992. The gene lin-3 encodes an inductive signal for vulvar development in C. elegans. Nature 358: 470–6. 328. Hittinger CT, Carroll SB. 2007. Gene duplication and the adaptive evolution of a classic genetic switch. Nature 449: 677–681. 329. Hodin J. 2000. Plasticity and constraints in development and evolution. Modularity of Animal Form Workshop, Friday Harbor, Washington. 330. Hoekstra HE, Coyne JA. 2007. The locus of evolution: Evo devo and the genetics of adaptation. Evolution 61: 995–1016. 331. Hofacker I, Fekete M, Flamm C, Huynen M, Rauscher S, et al. 1998. Automatic detection of conserved RNA structure elements in complete RNA virus genomes. Nucleic Acids Research 26: 3825–36. 332. Hofacker I, Fontana W, Stadler P, Bonhoeffer L, Tacker M, Schuster P. 1994. Fast folding and comparison of RNA secondary structures. Monatshefte fuer Chemie 125: 167–88. 333. Holland JH. 1975. Adaptation in natural and artificial systems. Ann Arbor, MI: University of Michigan Press. 334. Holland PWH, Garcia-Fernandez J, Williams NA, Sidow A. 1994. Gene duplications and the origins of vertebrate development. Development (Supplement): 125–33. 335. Hölldobler B, Wilson EO. 1990. The ants. Cambridge, MA: Belknap Press of Harvard University Press. 336. Holt LJ, Tuch BB, Villen J, Johnson AD, Gygi SP, Morgan DO. 2009. Global analysis of Cdk1 substrate phosphorylation sites provides insights into evolution. Science 325: 1682–6. 337. Huang XP, Ellis J. 2007. Mutational disruption of a conserved disulfide bond in muscarinic acetylcholine receptors attenuates positive homotropic cooperativity between multiple allosteric sites and has subtypedependent effects on the affinities of muscarinic allosteric ligands. Molecular Pharmacology 71: 759–68. 338. Huang Z, Szostak JW. 2003. Evolution of aptamers with secondary structures from a new specificity and new an ATP aptamer. RNA-a Publication of the RNA Society 9: 1456–63. 339. Hubel DH. 1967. Effects of distortion of sensory input on the visual system of kittens. Physiologist 10: 17–45. 340. Hueber SD, Lohmann I. 2008. Shaping segments: Hox gene function in the genomic age. Bioessays 30: 965–79. 341. Huerta-Sanchez E, Durrett R. 2007. Wagner’s canalization model. Theoretical Population Biology 71: 121–30.
342. Hughes CL, Kaufman TC. 2002. Hox genes and the evolution of the arthropod body plan. Evolution & Development 4: 459–99. 343. Hukushima K, Nemoto K. 1996. Exchange Monte Carlo method and application to spin glass simulations. Journal of the Physical Society of Japan 65: 1604. 344. Hult K, Berglund P. 2007. Enzyme promiscuity: mechanism and applications. Trends in Biotechnology 25: 231–8. 345. Hunter MP, Prince VE. 2002. Zebrafish Hox paralogue group 2 genes function redundantly as selector genes to pattern the second pharyngeal arch. Developmental Biology 247: 367–89. 346. Huynen M, Snel B, Lathe W, Bork P. 2000. Predicting protein function by genomic context: Quantitative evaluation and qualitative inferences. Genome Research 10: 1204–10. 347. Huynen MA. 1996. Exploring phenotype space through neutral evolution. Journal of Molecular Evolution 43: 165–9. 348. Ibarra RU, Edwards JS, Palsson BO. 2002. Escherichia coli K-12 undergoes adaptive evolution to achieve in silico predicted optimal growth. Nature 420: 186–9. 349. Ingolia NT. 2004. Topology and robustness in the Drosophila segment polarity network. PLoS Biology 2: 805–15. 350. Innan H, Stephan W. 2001. Selection intensity against deleterious mutations in RNA secondary structures and rate of compensatory nucleotide substitutions. Genetics 159: 389–99. 351. Inoki K, Ouyang H, Li Y, Guan KL. 2005. Signaling by target of rapamycin proteins in cell growth control. Microbiology and Molecular Biology Reviews 69: 79–100. 352. Irish V. 1999. Patterning the flower. Developmental Biology 209: 211–20. 353. Irish VF. 2003. The evolution of floral homeotic gene function. Bioessays 25: 637–46. 354. Irish VF, Litt A. 2005. Flower development and evolution: gene duplication, diversification and redeployment. Current Opinion in Genetics & Development 15: 454–60. 355. Isalan M, Lemerle C, Michalodimitrakis K, Beltrao P, Horn C, et al. 2008. Evolvability and hierarchy in rewired bacterial gene networks. Nature 452: 840–5. 356. Isambert H, Siggia E. 2000. Modeling RNA folding paths with pseudoknots: Application to hepatitis delta virus ribozyme. Proceedings of the National Academy of Sciences of the USA 97: 6515–20. 357. Ito H, Kageyama H, Mutsuda M, Nakajima M, Oyama T, Kondo T. 2007. Autonomous synchroniza-
REFERENCES
358.
359.
360.
361.
362.
363.
364.
365.
366.
367. 368.
369.
370.
371.
tion of the circadian KaiC phosphorylation rhythm. Nature Structural & Molecular Biology 14: 1084–8. Izquierdo EJ, Fernando CT. 2008. The evolution of evolvability in gene transcription networks. In Artificial Life XI: Proceedings of the Eleventh International Conference on the Simulation and Synthesis of Living Systems, ed. S Bullock, J Noble, R Watson, MA Bedau, pp. 265–73. MIT Press: Cambridge, MA. Jacinto E, Guo B, Arndt KT, Schmelzle T, Hall MN. 2001. TIP41 interacts with TAP42 and negatively regulates the TOR signaling pathway. Molecular Cell 8: 1017–26. Jacinto E, Hall MN. 2003. TOR signalling in bugs, brain and brawn. Nature Reviews Molecular Cell Biology 4: 117–26. Jackson R, Kaminski A. 1995. Internal initiation of translation in eukaryotes: The picornavirus paradigm and beyond. RNA 1: 985–1000. Jaeger J, Surkova S, Blagov M, Janssens H, Kosman D, et al. 2004. Dynamic control of positional information in the early Drosophila embryo. Nature 430: 368–71. Jaeger J, Turner D, Zuker M. 1989. Improved predictions of secondary structures for RNA. Proceedings of the National Academy of Sciences of the USA 86: 7706–10. James B, Olsen G, Liu J, Pace N. 1988. The secondary structure of ribonuclease-P RNA, the catalytic element of a ribonucleoprotein enzyme. Cell 52: 19–26. James B, Olsen G, Pace N. 1989. Phylogenetic comparative analysis of RNA secondary structure. Methods in Enzymology 180: 227–39. Janzen FJ, Paukstis GL. 1991. A preliminary test of the adaptive significance of environmental sex determination in reptiles. Evolution 45: 435–40. Jeffery C. 1999. Moonlighting proteins. Trends in Biochemical Sciences 24: 8–11. Jensen RA. 1976. Enzyme recruitment in evolution of new function. Annual Review of Microbiology 30: 409–25. Jiang Y, Broach JR. 1999. Tor proteins and protein phosphatase 2A reciprocally regulate Tap42 in controlling cell growth in yeast. Embo Journal 18: 2782–92. John D, Whitton B, Brook A, ed. 2002. The freshwater algal flora of the British Isles, p.471, pl. 109B. New York: Cambridge University Press. Johnson AE, Tanner ME. 1998. Epimerization via carbon-carbon bond cleavage. L-ribulose-5-phosphate 4-epimerase as a masked class II aldolase. Biochemistry 37: 5746–54.
231
372. Johnson CH, Golden SS, Kondo T. 1998. Adaptive significance of circadian programs in cyanobacteria. Trends in Microbiology 6: 407–10. 373. Johnson CH, Kondo T, Golden S. 1999. Circadian clocks enhance fitness in cyanobacteria. Photochemistry and Photobiology 69 (Supplement): SAM–C5. 374. Johnson CH, Mori T, Xu Y. 2008. A cyanobacterial circadian clockwork. Current Biology 18: R816–R25. 375. Johnson DB, Hallberg KB. 2009. Carbon, iron and sulfur metabolism in acidophilic micro-organisms. In Advances in microbial physiology, vol 54. ed. RL Poole, pp. 201–55. London: Academic Press. 376. Johnson ME, Viggiano L, Bailey JA, Abdul-Rauf M, Goodwin G, et al. 2001. Positive selection of a gene family during the emergence of humans and African apes. Nature 413: 514–9. 377. Jolliffe IT. 2002. Principal component analysis. New York, NY: Springer. 378. Jörg T, Martin O, Wagner A. 2008. Neutral network sizes of biological RNA molecules can be computed and are not atypically small. BMC Bioinformatics 9: 464. 379. Joyce GF. 2004. Directed evolution of nucleic acid enzymes. Annual Review of Biochemistry 73: 791–836. 380. Joyce GF. 2005. The promise and peril of continuous in vitro evolution. Journal of Molecular Evolution 61: 253–63. 381. Kageyama H, Nishiwaki T, Nakajima M, Iwasaki H, Oyama T, Kondo T. 2006. Cyanobacterial circadian pacemaker: Kai protein complex dynamics in the KaiC phosphorylation cycle in vitro. Molecular Cell 23: 161–71. 382. Kamtekar S, Schiffer J, Xiong H, Babik J, Hecht M. 1993. Protein design by binary patterning of polar and nonpolar amino-acids. Science 262: 1680–5. 383. Kanzler B, Foreman RK, Labosky PA, Mallo M. 2000. BMP signaling is essential for development of skeletogenic and neurogenic cranial neural crest. Development 127: 1095–104. 384. Kashtan N, Alon U. 2005. Spontaneous evolution of modularity and network motifs. Proceedings of the National Academy of Sciences of the USA 102: 13773–8. 385. Kashtan N, Noor E, Alon U. 2007. Varying environments can speed up evolution. Proceedings of the National Academy of Sciences of the USA 104: 13711–6. 386. Katju V, Lynch M. 2003. The structure and early evolution of recently arisen gene duplicates in the Caenorhabditis elegans genome. Genetics 165: 1793–803. 387. Kauffman SA. 1967. Metabolic stability and epigenesis in randomly connected nets. Journal of Theoretical Biology 22: 437–67.
232
REFERENCES
388. Kauffman SA. 1993. The origins of order. New York: Oxford University Press. 389. Keefe AD, Szostak JW. 2001. Functional proteins from a random-sequence library. Nature 410: 715–8. 390. Kellis M, Birren BW, Lander ES. 2004. Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature 428: 617–24. 391. Kelly KK, Meadows SM, Cripps RM. 2002. Drosophila MEF2 is a direct regulator of Actin57B transcription in cardiac, skeletal, and visceral muscle lineages. Mechanisms of Development 110: 39–50. 392. Keren I, Shah D, Spoering A, Kaldalu N, Lewis K. 2004. Specialized persister cells and the mechanism of multidrug tolerance in Escherichia coli. Journal of Bacteriology 186: 8172–80. 393. Kern AD, Kondrashov FA. 2004. Mechanisms and convergence of compensatory evolution in mammalian mitochondrial tRNAs. Nature Genetics 36: 1207–12. 394. Keymeulen D, Zebulum R, Jin Y, Stoica A. 2000. Fault-tolerant evolvable hardware using field-programmable transistor arrays. IEEE Transactions on Reliability 49: 305–16. 395. Keys D, Lewis D, Selegue J, Pearson B, Goodrich L, et al. 1999. Recruitment of a hedgehog regulatory circuit in butterfly eyespot evolution. Science 283: 532–4. 396. Keys DN, Lewis DL, Selegue JE, Pearson BJ, Goodrich LV, et al. 1999. Recruitment of a hedgehog regulatory circuit in butterfly eyespot evolution. Science 283: 532–4. 397. Khaitovich P, Hellmann I, Enard W, Nowick K, Leinweber M, et al. 2005. Parallel patterns of evolution in the genomes and transcriptomes of humans and chimpanzees. Science 309: 1850–4. 398. Khersonsky O, Tawfik DS. 2005. Structure-reactivity studies of serum paraoxonase PON1 suggest that its native activity is lactonase. Biochemistry 44: 6371–82. 399. Kim KJ, Fernandes VM. 2009. Effects of ploidy and recombination on evolution of robustness in a model of the segment polarity network. PLoS Computational Biology 5: e1000296. 400. Kim S, Plagnol V, Hu TT, Toomajian C, Clark RM, et al. 2007. Recombination and linkage disequilibrium in Arabidopsis thaliana. Nature Genetics 39: 1151–5. 401. Kimura M. 1979. Model of effectively neutral mutations in which selective constraint is incorporated. Proceedings of the National Academy of Sciences of the USA 76: 3440–4. 402. Kimura M. 1983. The neutral theory of molecular evolution. Cambridge: Cambridge University Press.
403. Kimura M, Ohta T. 1974. On some principles governing molecular evolution. Proceedings of the National Academy of Sciences of the USA 71: 2848–52. 404. King MC, Wilson AC. 1975. Evolution at two levels in humans and chimpanzees. Science 188: 107–16. 405. Kirschner D, Marino S. 2005. Mycobacterium tuberculosis as viewed through a computer. Trends in Microbiology 13: 206–11. 406. Kirschner MW, Gerhart JC. 2005. The plausibility of life. New Haven, CT: Yale University Press. 407. Kitano H, Funahashi A, Matsuoka Y, Oda K. 2005. Using process diagrams for the graphical representation of biological networks. Nature Biotechnology 23: 961–6. 408. Knapp G. 1989. Enzymatic approaches to probing of RNA secondary and tertiary structure. Methods in Enzymology 180: 192–212. 409. Knight RD, Freeland SJ, Landweber LF. 2001. Rewiring the keyboard: Evolvability of the genetic code. Nature Reviews Genetics 2: 49–58. 410. Knoll A. 2003. The geological consequences of evolution. Geobiology 1: 3–14. 411. Knoll AH. 1992. The early evolution of eukaryotes—a geological perspective. Science 256: 622–7. 412. Koelle K, Cobey S, Grenfell B, Pascual M. 2006. Epochal evolution shapes the phylodynamics of interpandemic influenza A (H3N2) in humans. Science 314: 1898–903. 413. Kohn MH, Fang S, Wu CI. 2004. Inference of positive and negative selection on the 5’ regulatory regions of Drosophila genes. Molecular Biology and Evolution 21: 374–83. 414. Kolkman J, Stemmer W. 2001. Directed evolution of proteins by exon shuffling. Nature Biotechnology 19: 423–8. 415. Kondo T, Ishiura M. 2000. The circadian clock of cyanobacteria. Bioessays 22: 10–5. 416. Kondrashov AS, Sunyaev S, Kondrashov FA. 2002. Dobzhansky-Muller incompatibilities in protein evolution. Proceedings of the National Academy of Sciences of the USA 99: 14878–83. 417. Kondrashov FA, Rogozin IB, Wolf YI, Koonin EV. 2002. Selection in the evolution of gene duplication. Genome Biology 3: Research 8.1–8.9. 418. Koonin E, Wolf Y, Karev G. 2002. The structure of the protein universe and genome evolution. Nature 420: 218–23. 419. Koonin EV, Makarova KS, Aravind L. 2001. Horizontal gene transfer in prokaryotes: Quantification and classification. Annual Review of Microbiology 55: 709–42. 420. Kornfeld K. 1997. Vulval development in Caenorhabditis elegans. Trends in Genetics 13: 55–61.
REFERENCES
421. Kosiol C, Vinar T, da Fonseca RR, Hubisz MJ, Bustamante CD, et al. 2008. Patterns of positive selection in six mammalian genomes. PLoS Genetics 4(8): e1000144. 422. Koza JR. 1992. Genetic programming: on the programming of computers by means of natural selection. MIT Press: Cambridge, MA. 423. Kramer EM, Jaramillo MA, Di Stilio VS. 2004. Patterns of gene duplication and functional evolution during the diversification of the AGAMOUS subfamily of MADS box genes in angiosperms. Genetics 166: 1011–23. 424. Kreimer A, Borenstein E, Gophna U, Ruppin E. 2008. The evolution of modularity in bacterial metabolic networks. Proceedings of the National Academy of Sciences of the USA 105: 6976–81. 425. Kreitman M. 1996. The neutral theory is dead: Long live the neutral theory. Bioessays 18: 678–83. 426. Kreitman M, Ludwig M. 1996. Tempo and mode of even-skipped stripe 2 enhancer evolution in Drosophila. Seminars in Cell & Developmental Biology 7: 583–92. 427. Kuepfer L, Peter M, Sauer U, Stelling J. 2007. Ensemble modeling for analysis of cell signaling dynamics. Nature Biotechnology 25: 1001–6. 428. Kulathinal RJ, Bettencourt BR, Hartl DL. 2004. Compensated deleterious mutations in insect genomes. Science 306: 1553–4. 429. Kussell E. 2005. The designability hypothesis and protein evolution. Protein and Peptide Letters 12: 111–6. 430. Kwong PD, Wyatt R, Robinson J, Sweet RW, Sodroski J, Hendrickson WA. 1998. Structure of an HIV gp120 envelope glycoprotein in complex with the CD4 receptor and a neutralizing human antibody. Nature 393: 648–59. 431. Land MF, Nilsson D-E. 2002. Animal eyes. Oxford, UK: Oxford University Press. 432. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, et al. 2001. Initial sequencing and analysis of the human genome. Nature 409: 860–921. 433. Lathe WC, Snel B, Bork P. 2000. Gene context conservation of a higher order than operons. Trends in Biochemical Sciences 25: 474–9. 434. Lau KF, Dill KA. 1989. A lattice statistical mechanics model of the conformational and sequence spaces of proteins. Macromolecules 22: 3986–97. 435. Lau KF, Dill KA. 1990. Theory for protein mutability and biogenesis. Proceedings of the National Academy of Sciences of the USA 87: 638–42. 436. Lawrence J. 1999. Selfish operons: the evolutionary impact of gene clustering in prokaryotes and eukary-
437.
438.
439.
440.
441.
442. 443.
444.
445.
446.
447.
448.
449.
450.
451. 452. 453.
233
otes. Current Opinion in Genetics & Development 9: 642–8. Lawrence JG, Ochman H. 1998. Molecular archaeology of the Escherichia coli genome. Proceedings of the National Academy of Sciences of the USA 95: 9413–7. Lebowitz JH, Clerc RG, Brenowitz M, Sharp PA. 1989. The Oct-2 protein binds cooperatively to adjacent octamer sites. Genes & Development 3: 1625–38. Lee K, Loros JJ, Dunlap JC. 2000. Interconnected feedback loops in the Neurospora circadian system. Science 289: 107–10. Lee T, Rinaldi N, Robert F, Odom D, Bar-Joseph Z, et al. 2002. Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 298: 799–804. Lehner B. 2010. Genes confer similar robustness to environmental, stochastic, and genetic perturbations in yeast. PLoS ONE 5: e9035. Lemons D, McGinnis W. 2006. Genomic evolution of Hox gene clusters. Science 313: 1918–22. Leong S, Chang J, Ong R, Dawes G, Stemmer W, Punnonen J. 2003. Optimized expression and specific activity of IL-12 by directed molecular evolution. Proceedings of the National Academy of Sciences of the USA 100: 1163–8. Leopold P, Montal M, Onuchic J. 1992. Protein folding funnels—a kinetic approach to the sequence structure relationship. Proceedings of the National Academy of Sciences of the USA 89: 8721–5. Lerat E, Daubin V, Ochman H, Moran NA. 2005. Evolutionary origins of genomic repertoires in bacteria. PLoS Biology 3: e130. Lercher MJ, Hurst LD. 2002. Human SNP variability and mutation rate are higher in regions of high recombination. Trends in Genetics 18: 337–40. Letunic I, Bork P. 2007. Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree display and annotation. Bioinformatics 23: 127–8. Levine J, Kueh HY, Mirny L. 2007. Intrinsic fluctuations, robustness, and tunability in signaling cycles. Biophysical Journal 92: 4473–81. Levitt M. 2009. Nature of the protein universe. Proceedings of the National Academy of Sciences of the USA 106: 11079–84. Levy SF, Siegal ML. 2008. Network hubs buffer environmental variation in Saccharomyces cerevisiae. PLoS Biology 6: 2588–604. Lewis K. 2007. Persister cells, dormancy and infectious disease. Nature Reviews Microbiology 5: 48–56. Lewontin RC. 1970. The units of selection. Annual Reviews of Ecology and Systematics 1: 1–18. Lewontin RC, Hubby JL. 1966. A molecular approach to the study of genic heterozygosity in
234
454.
455.
456.
457.
458. 459.
460.
461.
462.
463.
464.
465.
466.
467.
REFERENCES
natural populations, II. Amount of variation and degree of heterozygosity in natural populations of Drosophila pseudoobscura. Genetics 54: 595–609. Li F, Long T, Lu Y, Ouyang Q, Tang C. 2004. The yeast cell-cycle network is robustly designed. Proceedings of the National Academy of Sciences of the USA 101: 4781–6. Li H, Helling R, Tang C, Wingreen N. 1996. Emergence of preferred structures in a simple model of protein folding. Science 273: 666–9. Li H, Tang C, Wingreen NS. 1998. Are protein folds atypical? Proceedings of the National Academy of Sciences of the USA 95: 4987–90. Li H, Tang C, Wingreen NS. 2002. Designability of protein structures: A lattice-model study using the Miyazawa–Jernigan matrix. Proteins-Structure Function and Genetics 49: 403–12. Li W-H. 1997. Molecular evolution. Massachusetts: Sinauer. Liang YH, Hua ZQ, Liang X, Xu Q, Lu GY. 2001. The crystal structure of bar-headed goose hemoglobin in deoxy form: The allosteric mechanism of a hemoglobin species with high oxygen affinity. Journal of Molecular Biology 313: 123–37. Liljegren SJ, Ditta GS, Eshed HY, Savidge B, Bowman JL, Yanofsky MF. 2000. SHATTERPROOF MADS-box genes control seed dispersal in Arabidopsis. Nature 404: 766–70. Lin Q, Schwarz J, Bucana C, Olson EN. 1997. Control of mouse cardiac morphogenesis and myogenesis by transcription factor MEF2C. Science 276: 1404–7. Lindesmith LC, Donaldson EF, Lobue AD, Cannon JL, Zheng DP, et al. 2008. Mechanisms of GII.4 norovirus persistence in human populations. PLoS Medicine 5: 269–90. Liou Y-C, Tocilj A, Davies PL, Jia Z. 2000. Mimicry of ice structure by surface hydroxyls and water of a b-helix antifreeze protein. Nature 406: 322–4. Lipman D, Wilbur W. 1991. Modeling neutral and selective evolution of protein folding. Proceedings of the Royal Society of London Series B 245: 7–11. Lisacek F, Diaz Y, Michel F. 1994. Automatic identification of group-I intron cores in genomic DNA sequences. Journal of Molecular Biology 235: 1206–17. Liu XZ, Li SL, Jing H, Liang YH, Hua ZQ, Lu GY. 2001. Avian haemoglobins and structural basis of high affinity for oxygen: structure of bar-headed goose aquomet haemoglobin. Acta Crystallographica Section D—Biological Crystallography 57: 775–83. Lohaus R, Burch CL, Azevedo RBR. 2010. Genetic architecture and the evolution of sex. Journal of Heredity 101 (Supplement 1): S142–S57.
468. Lopez P, Casane D, Philippe H. 2002. Heterotachy, an important process of protein evolution. Molecular Biology and Evolution 19: 1–7. 469. Lorberg A, Hall MN. 2003. TOR: The first 10 years. In TOR-Target of Rapamycin, ed. G Thomas, DM Sabatini, MN Hall, pp. 1–18. Springer: Berlin. 470. Ludwig MZ, Bergman C, Patel NH, Kreitman M. 2000. Evidence for stabilizing selection in a eukaryotic enhancer element. Nature 403: 564–7. 471. Ludwig MZ, Kreitman M. 1995. Evolutionary dynamics of the enhancer region of even-skipped in Drosophila. Molecular Biology and Evolution 12: 1002–11. 472. Ludwig MZ, Patel NH, Kreitman M. 1997. Evolution of the even-skipped stripe 2 enhancer of Drosophila. Developmental Biology 186: A27. 473. Luke MM, DellaSeta F, DiComo CJ, Sugimoto H, Kobayashi R, Arndt KT. 1996. The SAPs, a new family of proteins, associate and function positively with the SIT4 phosphatase. Molecular and Cellular Biology 16: 2744–55. 474. Lunzer M, Milter SP, Felsheim R, Dean AM. 2005. The biochemical architecture of an ancient adaptive landscape. Science 310: 499–501. 475. Luo Y, Samuel J, Mosimann SC, Lee JE, Tanner ME, Strynadka NCJ. 2001. The structure of L-ribulose-5phosphate 4-epimerase: An aldolase-like platform for epimerization. Biochemistry 40: 14763–71. 476. Lynch M. 2007. The origins of genome architecture. Sunderland, MA: Sinauer. 477. Lynch M, Blanchard J, Houle D, Kibota T, Schultz S, et al. 1999. Perspective: Spontaneous deleterious mutation. Evolution 53: 645–63. 478. Lynch M, Conery J. 2003. The origins of genome complexity. Science 302: 1401–4. 479. Lynch M, Conery JS. 2000. The evolutionary fate and consequences of duplicate genes. Science 290: 1151–5. 480. Lynch M, O’Hely M, Walsh B, Force A. 2001. The probability of preservation of a newly arisen gene duplicate. Genetics 159: 1789–804. 481. MacCarthy T, Seymour R, Pomiankowski A. 2003. The evolutionary potential of the Drosophila sex determination gene network. Journal of Theoretical Biology 225: 461–8. 482. Maduro M, Pilgrim D. 1995. Identification and cloning of unc-119, a gene expressed in the Caenorhabditis elegans nervous system. Genetics 141: 977–88. 483. Maduro M, Pilgrim D. 1996. Conservation of function and expression of unc-119 from two Caenorhabditis species despite divergence of noncoding DNA. Gene 183: 77–85.
REFERENCES
484. Magasanik B, Kaiser CA. 2002. Nitrogen regulation in Saccharomyces cerevisiae. Gene 290: 1–18. 485. Malcomber ST, Kellogg EA. 2004. Heterogeneous expression patterns and separate roles of the SEPALLATA gene LEAFY HULL STERILE1 in Grasses. Plant Cell 16: 1692–706. 486. Mandl C, Holzmann H, Meixner T, Rauscher S, Stadler P, et al. 1998. Spontaneous and engineered deletions in the 3’ noncoding region of tick-borne encephalitis virus: Construction of highly attenuated mutants of a flavivirus. Journal of Virology 72: 2132–40. 487. Mangan S, Alon U. 2003. Structure and function of the feed-forward loop network motif. Proceedings of the National Academy of Sciences of the USA 100: 11980–5. 488. Mansky LM, Temin HM. 1995. Lower in-vivo mutation-rate of human-immunodeficiency-virus type-1 than that predicted from the fidelity of purified reverse-transcriptase. Journal of Virology 69: 5087–94. 489. Manzella J R W, Rhoads R, Hershey J, Blackshear P. 1991. Insulin induction of ornithine decarboxylase— importance of messenger RNA secondary structure and phosphorylation of eukaryotic initiation factor EIF-4b and factor EIF-4e. Journal of Biological Chemistry 266: 2383–9. 490. Martchenko M, Levitin A, Hogues H, Nantel A, Whiteway M. 2007. Transcriptional rewiring of fungal galactose-metabolism circuitry. Current Biology 17: 1007–13. 491. Martin OC, Wagner A. 2008. Multifunctionality and robustness trade-offs in model genetic circuits. Biophysical Journal 94: 2927–37. 492. Martin OC, Wagner A. 2009. Effects of recombination on complex regulatory circuits. Genetics 183: 673–84. 493. Martinez-Castilla LP, Alvarez-Buylla ER. 2004. Adaptive evolution in the Arabidopsis MADS-box gene family inferred from its complete resolved phylogeny. Proceedings of the National Academy of Sciences of the USA 101: 13407–12. 494. Martinis SA, Plateau P, Cavarelli J, Florentz C. 1999. Aminoacyl-tRNA synthetases: a family of expanding functions. EMBO Journal 18: 4591–6. 495. Masel J, Bergman A. 2003. The evolution of the evolvability properties of the yeast prion [PSI+]. Evolution 57: 1498–512. 496. Masel J, Siegal ML. 2009. Robustness: mechanisms and consequences. Trends in Genetics 25: 395–403. 497. Mathews D, Sabina J, Zuker M, Turner D. 1999. Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. Journal of Molecular Biology 288: 911–40.
235
498. Maynard-Smith J. 1970. Natural selection and the concept of a protein space. Nature 255: 563–4. 499. Maynard-Smith J, Burian R, Kauffman S, Alberch P, Campbell J, et al. 1985. Developmental constraints and evolution. Quarterly Review of Biology 60: 265–87. 500. Maynard Smith J, Haigh J. 1974. The hitch-hiking effect of a favorable gene. Genetical Research 23: 23–35. 501. Mayr E. 1961. Cause and effect in biology. Science 134: 1501–6. 502. Mayr E. 1963. Animal species and evolution. Cambridge, MA: Belknap. 503. Mayr E. 1982. The growth of biological thought: diversity, evolution, and inheritance. Cambridge. Massachusetts: Belknap Press. 504. McAdams HH, Arkin A. 1999. It’s a noisy business! Genetic regulation at the nanomolar scale. Trends in Genetics 15: 65–9. 505. McAdams HH, Shapiro L. 1995. Circuit simulation of genetic networks. Science 269: 650–6. 506. McDonald JH, Kreitman M. 1991. Adaptive protein evolution at the Adh locus in Drosophila. Nature 351: 652–4. 507. McGhee G. 2007. The geometry of evolution. Cambridge, UK: Cambridge University Press. 508. McGinnis W, Krumlauf R. 1992. Homeobox genes and axial patterning. Cell 68: 283–302. 509. Meer MV, Kondrashov AS, Artzy-Randrup Y, Kondrashov FA. 2010. Compensatory evolution in mitochondrial tRNAs navigates valleys of low fitness. Nature 464: 279–82. 510. Mehra A, Hong CI, Shi M, Loros JJ, Dunlap JC, Ruoff P. 2006. Circadian rhythmicity by autocatalysis. PLoS Computational Biology 2: 816–23. 511. Meiklejohn C, Hartl D. 2002. A single mode of canalization. Trends in Ecology & Evolution 17: 468–73. 512. Meir E, von Dassow G, Munro E, Odell G. 2002. Robustness, flexibility, and the role of lateral inhibition in the neurogenic network. Current Biology 12: 778–86. 513. Mendoza L, Alvarez-Buylla E. 2000. Genetic regulation of root hair development in Arabidopsis thaliana: A network model. Journal of Theoretical Biology 204: 311–26. 514. Mendoza L, Thieffry D, Alvarez-Buylla E. 1999. Genetic control of flower morphogenesis in Arabidopsis thaliana: a logical analysis. Bioinformatics 15: 593–606. 515. Merrick MJ, Edwards RA. 1995. Nitrogen control in bacteria. Microbiological Reviews 59: 604–622. 516. Merrow M, Franchi L, Dragovic Z, Gorl M, Johnson J, et al. 2001. Circadian regulation of the light input
236
517.
518.
519.
520.
521.
522.
523. 524.
525.
526. 527.
528. 529.
530.
REFERENCES
pathway in Neurospora crassa. EMBO Journal 20: 307–15. Meyers LA, Ancel FD, Lachmann M. 2005. Evolution of genetic potential. PLoS Computational Biology 1: 236–43. Michael SF, Kilfoil VJ, Schmidt MH, Amann BT, Berg JM. 1992. Metal binding and folding properties of a minimalist Cys2His2 Zinc finger peptide. Proceedings of the National Academy of Sciences of the USA 89: 4796–800. Miller J, Zeng C, Wingreen NS, Tang C. 2002. Emergence of highly designable protein-backbone conformations in an off-lattice model. ProteinsStructure Function and Genetics 47: 506–12. Miller JF, Smith SL. 2006. Redundancy and computational efficiency in Cartesian genetic programming. IEEE Transactions on Evolutionary Computation 10: 167–74. Miller JF, Thomson P. 2000. Cartesian genetic programming. European Conference on Genetic Programming (EuroGP 2000), Edinburgh, Scotland. Milton CC, Huynh B, Batterham P, Rutherford SL, Hoffmann AA. 2003. Quantitative trait symmetry independent of Hsp90 buffering: Distinct modes of genetic canalization and developmental stability. Proceedings of the National Academy of Sciences of the USA 100: 13396–401. Mironov A, Lebedev V. 1993. A kinetic model of RNA folding. Biosystems 30: 49–56. Miskiewicz P, Morrissey D, Lan Y, Raj L, Kessler S, et al. 1996. Both the paired domain and homeodomain are required for in vivo function of Drosophila Paired. Development 122: 2709–18. Mitchell JBO, Smith J. 2003. D-amino acid residues in peptides and proteins. Proteins-Structure Function and Genetics 50: 563–71. Mitchell M. 1998. An introduction to genetic algorithms. Cambridge, MA: MIT Press. Mjolsness E, Sharp DH, Reinitz J. 1991. A connectionist model of development. Journal of Theoretical Biology 152: 429–53. Moczek AP. 2008. On the origins of novelty in development and evolution. Bioessays 30: 432–47. Moeckel R, Jaquier C, Drapel K, Dittrich E, Upegui A, Ijspeert A. 2005. YaMoR and Bluemove—An autonomous modular robot with bluetooth interface for exploring adaptive locomotion. 8th International Conference on Climbing and Walking Robots (CLAWAR 2005), London, England. Moine H, Ehresmann B, Ehresmann C, Romby P. 1998. Probing RNA structure and function in solution. In RNA structure and function, ed. R Simons, M
531.
532.
533.
534.
535. 536.
537.
538.
539. 540.
541.
542. 543.
544.
545.
546.
Grunberg-Manago, pp. 77–115. Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press. Monge C, Leonvelarde F. 1991. Physiological adaptation to high-altitude—oxygen-transport in mammals and birds. Physiological Reviews 71: 1135–72. Morgan S, Higgs P. 1996. Evidence for kinetic effects in the folding of large RNA molecules. Journal of Chemical Physics 105: 7152–7. Mori T, Saveliev SV, Xu Y, Stafford WF, Cox MM, et al. 2002. Circadian clock protein KaiC forms ATPdependent hexameric rings and binds DNA. Proceedings of the National Academy of Sciences of the USA 99: 17203–8. Mori T, Williams DR, Byrne MO, Qin XM, Egli M, et al. 2007. Elucidating the ticking of an in vitro circadian clockwork. PLoS Biology 5: 841–53. Morowitz HJ. 1992. Beginnings of cellular life. New Haven: Yale University Press. Motter AE, Gulbahce N, Almaas E, Barabasi AL. 2008. Predicting synthetic rescues in metabolic networks. Molecular Systems Biology 4: 168 (doi:10.1038/ msb.2008.1). Moxon ER, Rainey PB, Nowak MA, Lenski RE. 1994. Adaptive evolution of highly mutable loci in pathogenic bacteria. Current Biology 4: 24–33. Muller GB, Wagner GP. 1991. Novelty in evolution— restructuring the concept. Annual Review of Ecology and Systematics 22: 229–56. Muller HJ. 1950. Our load of mutations. Genetics 2: 111–75. Murakami R, Miyake A, Iwase R, Hayashi F, Uzumaki T, Ishiura M. 2008. ATPase activity and its temperature compensation of the cyanobacterial clock protein KaiC. Genes to Cells 13: 387–95. Muralidhara BK, Sun L, Negi S, Halpert JR. 2008. Thermodynamic fidelity of the mammalian cytochrome P4502B4 active site in binding substrates and inhibitors. Journal of Molecular Biology 377: 232–45. Murray JD. 1989. Mathematical biology. New York: Springer. Muse S. 1995. Evolutionary analysis of DNA sequences subject to constraints on secondary structure. Genetics 139: 1429–39. Nachman MW. 1997. Patterns of DNA variability at X-linked loci in Mus domesticus. Genetics 147: 1303–16. Nachman MW, Bauer VL, Crowell SL, Aquadro CF. 1998. DNA variability and recombination rates at X-linked loci in humans. Genetics 150: 1133–41. Nagano N, Orengo C, Thornton J. 2002. One fold with many functions: The evolutionary relationships between TIM barrel families based on their sequences,
REFERENCES
547.
548.
549.
550.
551.
552.
553.
554.
555.
556.
557.
558.
559.
structures and functions. Journal of Molecular Biology 321: 741–65. Nakajima M, Imai K, Ito H, Nishiwaki T, Murayama Y, et al. 2005. Reconstitution of circadian oscillation of cyanobacterial KaiC phosphorylation in vitro. Science 308: 414–5. Nam J, Kim J, Lee S, An GH, Ma H, Nei MS. 2004. Type I MADS-box genes have experienced faster birth-and-death evolution than type II MADS-box genes in angiosperms. Proceedings of the National Academy of Sciences of the USA 101: 1910–5. Nelson E, Onuchic J. 1998. Proposed mechanism for stability of proteins to evolutionary mutations. Proceedings of the National Academy of Sciences of the USA 95: 10682–6. Nelson KE, Clayton RA, Gill SR, Gwinn ML, Dodson RJ, et al. 1999. Evidence for lateral gene transfer between Archaea and Bacteria from genome sequence of Thermotoga maritima. Nature 399: 323–9. Nelson MI, Simonsen L, Viboud C, Miller MA, Taylor J, et al. 2006. Stochastic processes are key determinants of short-term evolution in influenza A virus. PLoS Pathogens 2: 1144–51. Ness J, Welch M, Giver L, Bueno M, Cherry J, et al. 1999. DNA shuffling of subgenomic sequences of subtilisin. Nature Biotechnology 17: 893–6. Newman JRS, Ghaemmaghami S, Ihmels J, Breslow DK, Noble M, et al. 2006. Single-cell proteomic analysis of S. cerevisiae reveals the architecture of biological noise. Nature 441: 840–6. Newman MEJ, Engelhardt R. 1998. Effects of selective neutrality on the evolution of molecular species. Proceedings of the Royal Society of London Series B—Biological Sciences 265: 1333–8. Newman SA, Bhat R. 2009. Dynamical patterning modules: a “pattern language” for development and evolution of multicellular form. International Journal of Developmental Biology 53: 693–705. Newman SA, Muller GB. 2000. Epigenetic mechanisms of character origination. Journal of Experimental Zoology 288: 304–17. Nicolis G, Prigogine I. 1977. Self-organization in nonequilibrium systems: From dissipative structures to order through fluctuations. New York: Wiley. Nielsen R, Bustamante C, Clark AG, Glanowski S, Sackton TB, et al. 2005. A scan for positively selected genes in the genomes of humans and chimpanzees. PLoS Biology 3: 976–85. Nishikawa T, Gulbahce N, Motter AE. 2008. Spontaneous reaction silencing in metabolic optimization. PLoS Computational Biology 4(12): e1000236.
237
560. Nishimiya Y, Kondo H, Takamichi M, Sugimoto H, Suzuki M, et al. 2008. Crystal structure and mutational analysis of Ca2+-independent type II antifreeze protein from longsnout poacher, Brachyopsis rostratus. Journal of Molecular Biology 382: 734–46. 561. Nochomovitz YD, Li H. 2006. Highly designable phenotypes and mutational buffers emerge from a systematic mapping between network topology and dynamic output. Proceedings of the National Academy of Sciences of the USA 103: 4180–5. 562. Nogales J, Palsson BO, Thiele I. 2008. A genome-scale metabolic reconstruction of Pseudomonas putida KT2440: iJN746 as a cell factory. BMC Systems Biology 2: 79. 563. Nussinov R, Jacobson A. 1980. Fast algorithm for predicting the secondary structure of single-stranded RNA. Proceedings of the National Academy of Sciences of the USA 77: 6309–13. 564. Nusslein-Volhard C, Wieschaus E. 1980. Mutations affecting segment number and polarity in Drosophila. Nature 287: 795–801. 565. Nymeyer H, Garcia A, Onuchic J. 1998. Folding funnels and frustration in off-lattice minimalist protein landscapes. Proceedings of the National Academy of Sciences of the USA 95: 5921–8. 566. O’Brien PJ, Herschlag D. 1999. Catalytic promiscuity and the evolution of new enzymatic activities. Chemistry & Biology 6: R91–R105. 567. Ochman H, Jones IB. 2000. Evolutionary dynamics of full genome content in Escherichia coli. EMBO Journal 19: 6637–43. 568. Ochman H, Lawrence J, Groisman E. 2000. Lateral gene transfer and the nature of bacterial innovation. Nature 405: 299–304. 569. Ochman H, Lerat E, Daubin V. 2005. Examining bacterial species under the specter of gene transfer and exchange. Proceedings of the National Academy of Sciences of the USA 102: 6595–9. 570. Odell GM, Oster G, Alberch P, Burnside B. 1981. The mechanical basis of morphogenesis. 1. epithelial folding and invagination. Developmental Biology 85: 446–62. 571. Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, Kanehisa M. 1999. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research 27: 29–34. 572. Ohnemus S, Kanzler B, Jerome-Majewska LA, Papaioannou VE, Boehm T, Mallo M. 2002. Aortic arch and pharyngeal phenotype in the absence of BMP-dependent neural crest in the mouse. Mechanisms of Development 119: 127–35. 573. Ohno S. 1970. Evolution by gene duplication. New York: Springer.
238
REFERENCES
574. Ohya Y, Sese J, Yukawa M, Sano F, Nakatani Y, et al. 2005. High-dimensional and large-scale phenotyping of yeast mutants. Proceedings of the National Academy of Sciences of the USA 102: 19015–20. 575. Okamura-Ikeda K, Fujiwara K, Motokawa Y. 1992. Molecular-cloning of a cDNA-encoding chicken T-protein of the glycine cleavage system and expression of the functional protein in Escherichia coli— effect of messenger RNA secondary structure in the translational initiation region on expression. Journal of Biological Chemistry 267: 18284–90. 576. Oliver KR, Greene WK. 2009. Transposable elements: powerful facilitators of evolution. Bioessays 31: 703–14. 577. Oliviero S, Struhl K. 1991. Synergistic transcriptional enhancement does not depend on the number of acidic activation domains bound to the promoter. Proceedings of the National Academy of Sciences of the USA 88: 224–8. 578. Olson EN. 2006. Gene regulatory networks in the evolution and development of the heart. Science 313: 1922–7. 579. Orengo C, Jones D, Thornton JW. 1994. Protein superfamilies and domain superfolds. Nature 372: 631–4. 580. Orengo CA, Thornton JM. 2005. Protein families and their evolution—a structural perspective. Annual Review of Biochemistry 74: 867–900. 581. Orr HA. 1999. Phenotypic evolution—a reaction norm perspective. Science 285: 343–4. 582. Ortlund EA, Bridgham JT, Redinbo MR, Thornton JW. 2007. Crystal structure of an ancient protein: Evolution by conformational epistasis. Science 317: 1544–8. 583. Oster GF, Shubin N, Murray JD, Alberch P. 1988. Evolution and morphogenetic rules—the shape of the vertebrate limb in ontogeny and phylogeny. Evolution 42: 862–84. 584. Otto SP, Gerstein AC. 2006. Why have sex? The population genetics of sex and recombination. Biochemical Society Transactions 34: 519–22. 585. Otto SP, Lenormand T. 2002. Resolving the paradox of sex and recombination. Nature Reviews Genetics 3: 252–61. 586. Ozbudak E, Thattai M, Kurtser I, Grossman A, van Oudenaarden A. 2002. Regulation of noise in the expression of a single gene. Nature Genetics 31: 69–73. 587. Pace N, Smith D, Olsen G, James BD. 1989. Phylogenetic comparative analysis and the secondary structure of ribonuclease P RNA—a review. Gene 82: 65–75.
588. Painter HA. 1970. A review of literature on inorganic nitrogen metabolism in microorganisms. Water Research 4: 393–450. 589. Pal C, Papp B, Lercher MJ. 2005. Adaptive evolution of bacterial metabolic networks by horizontal gene transfer. Nature Genetics 37: 1372–5. 590. Pal C, Papp B, Lercher MJ. 2005. Horizontal gene transfer depends on gene content of the host. Joint Meeting of the 4th European Conference on Computational Biology/6th Meeting of the Spanish-BioinformaticsNetwork, Madrid, SPAIN. 591. Pal C, Papp B, Lercher MJ, Csermely P, Oliver SG, Hurst LD. 2006. Chance and necessity in the evolution of minimal metabolic networks. Nature 440: 667–70. 592. Palmer AR. 2004. Symmetry breaking and the evolution of development. Science 306: 828–33. 593. Papp B, Pal C, Hurst LD. 2004. Metabolic network analysis of the causes and evolution of enzyme dispensability in yeast. Nature 429: 661–4. 594. Park J, Teichmann SA, Hubbard T, Chothia C. 1997. Intermediate sequences increase the detection of homology between sequences. Journal of Molecular Biology 273: 349–54. 595. Parsch J, Braverman J, Stephan W. 2000. Comparative sequence analysis and patterns of covariation in RNA secondary structures. Genetics 154: 909–21. 596. Parsch J, Braverman JM, Stephan W. 2000. Comparative sequence analysis and patterns of covariation in RNA secondary structures. Genetics 154: 909–21. 597. Parter M, Kashtan N, Alon U. 2007. Environmental variability and modularity of bacterial metabolic networks. BMC Evolutionary Biology 7: 169. 598. Parter M, Kashtan N, Alon U. 2008. Facilitated variation: How evolution learns from past environments to generalize to new environments. PLoS Computational Biology 4: e1000206. 599. Pati S, Disilvestre D, Brusilow W. 1992. Regulation of the Escherichia coli unch gene by messenger RNA secondary structure and translational coupling. Molecular Microbiology 6: 3559–66. 600. Pattanayek R, Williams DR, Pattanayek S, Mori T, Johnson CH, et al. 2008. Structural model of the circadian clock KaiB-KaiC complex and mechanism for modulation of KaiC phosphorylation. EMBO Journal 27: 1767–78. 601. Paulsson J. 2005. Models of stochastic gene expression. Physics of Life Reviews 2: 157–75. 602. Paulsson J, Berg OG, Ehrenberg M. 2000. Stochastic focusing: Fluctuation-enhanced sensitivity of intracellular regulation. Proceedings of the National Academy of Sciences of the USA 97: 7148–53.
REFERENCES
603. Pedraza JM, Paulsson J. 2008. Effects of molecular memory and bursting on fluctuations in gene expression. Science 319: 339–43. 604. Pegg SCH, Brown SD, Ojha S, Seffernick J, Meng EC, et al. 2006. Leveraging enzyme structure-function relationships for functional inference and experimental design: The structure-function linkage database. Biochemistry 45: 2545–55. 605. Pelaz S, Ditta GS, Baumann E, Wisman E, Yanofsky MF. 2000. B and C floral organ identity functions require SEPALLATA MADS-box genes. Nature 405: 200–3. 606. Pelaz S, Tapia-Lopez R, Alvarez-Buylla ER, Yanofsky MF. 2001. Conversion of leaves into petals in Arabidopsis. Current Biology 11: 182–4. 607. Pennisi E. 2008. Deciphering the genetics of evolution. Science 321: 760–3. 608. Perelson AS, Neumann AU, Markowitz M, Leonard JM, Ho DD. 1996. HIV-1 dynamics in vivo: Virion clearance rate, infected cell life-span, and viral generation time. Science 271: 1582–6. 609. Philippe H, Casane D, Gribaldo S, Lopez P, Meunier J. 2003. Heterotachy and functional shift in protein evolution. IUBMB Life 55: 257–65. 610. Phillips PC. 2008. Epistasis—the essential role of gene interactions in the structure and evolution of genetic systems. Nature Reviews Genetics 9: 855–67. 611. Piatigorsky J. 1998. Gene sharing in lens and cornea: Facts and implications. Progress in Retinal and Eye Research 17: 145–74. 612. Piatigorsky J, Wistow GJ. 1989. Enzyme crystallins: Gene sharing as an evolutionary strategy. Cell 57: 197–9. 613. Pickover CA. 1984. Spectrographic representation of globular protein breathing motions. Science 223: 181–2. 614. Pigliucci M. 2001. Phenotypic plasticity: beyond nature and nurture. Baltimore, MD: John Hopkins University Press. 615. Pigliucci M. 2005. Evolution of phenotypic plasticity: Where are we going now? Trends in Ecology & Evolution 20: 481–6. 616. Pigliucci M. 2006. What, if anything, is an evolutionary novelty? 20th Biennial Meeting of the Philosophy of Science Association, Vancouver, CANADA. 617. Pigliucci M, Murren CJ, Schlichting CD. 2006. Phenotypic plasticity and evolution by genetic assimilation. Journal of Experimental Biology 209: 2362–7. 618. Pokorna M, Kratochvil L. 2009. Phylogeny of sexdetermining mechanisms in squamate reptiles: are sex chromosomes an evolutionary trap? Zoological Journal of the Linnean Society 156: 168–83.
239
619. Poon AFY, Chao L. 2006. Functional origins of fitness effect-sizes of compensatory mutations in the DNA bacteriophage phi X174. Evolution 60: 2032–43. 620. Postgate JR. 1994. The outer reaches of life. Cambridge, UK: Cambridge University Press. 621. Pourquie O. 2003. The segmentation clock: Converting embryonic time into spatial pattern. Science 301: 328–30. 622. Powell D, Zhang M, Konings D, Wingfield P, Stahl S, et al. 1995. Sequence specificity in the higher-order interaction of the rev protein of HIV with its target sequence, the RRE. Journal of Acquired Immune Deficiency Syndromes and Human Retrovirology 10: 317–23. 623. Price MN, Arkin AP, Alm EJ. 2006. The life-cycle of operons. PLoS Genetics 2: 859–73. 624. Price ND, Reed JL, Palsson BO. 2004. Genome-scale models of microbial cells: Evaluating the consequences of constraints. Nature Reviews Microbiology 2: 886–97. 625. Queitsch C, Sangster TA, Lindquist S. 2002. Hsp90 as a capacitor of phenotypic variation. Nature 417: 618–24. 626. Rabinowitz JD, Hsiao JJ, Gryncel KR, Kantrowitz ER, Feng XJ, et al. 2008. Dissecting enzyme regulation by multiple allosteric effectors: Nucleotide regulation of aspartate transcarbamoylase. Biochemistry 47: 5881–8. 627. Raman K, Wagner A. 2011. Evolvability and robustness in a complex signaling circuit. Molecular BioSystems 7, 1081–1092. 628. Raman K, Wagner A. 2010. The evolvability of programmable hardware. Journal of the Royal Society Interface. doi: 10.1098/rsif.2010.0212. 629. Rambaut A, Posada D, Crandall KA, Holmes EC. 2004. The causes and consequences of HIV evolution. Nature Reviews Genetics 5: 52–61. 630. Ranea JAG, Sillero A, Thornton JM, Orengo CA. 2006. Protein superfamily evolution and the last universal common ancestor (LUCA). Journal of Molecular Evolution 63: 513–25. 631. Ranganayakulu G, Zhao B, Dokidis A, Molkentin JD, Olson EN, Schulz RA. 1995. A series of mutations in the D-Mef2 transcription factor reveal multiple functions in larval and adult myogenesis in Drosophila. Developmental Biology 171: 169–81. 632. Rao C, Wolf D, Arkin A. 2002. Control, exploitation and tolerance of intracellular noise. Nature 420: 231–7. 633. Raser JM, O’Shea EK. 2004. Control of stochasticity in eukaryotic gene expression. Science 304: 1811–4. 634. Raup DM. 1966. Geometric analysis of shell coiling: general problems. Journal of Paleontology 40: 1178–90.
240
REFERENCES
635. Raux E, Schubert HL, Warren MJ. 2000. Biosynthesis of cobalamin (vitamin B-12): a bacterial conundrum. Cellular and Molecular Life Sciences 57: 1880–93. 636. Rechenberg I. 1973. Evolutionsstrategie. Stuttgart: Frommann-Holzboog. 637. Reed JL, Vo TD, Schilling CH, Palsson BO. 2003. An expanded genome-scale model of Escherichia coli K-12 (iJR904 GSM/GPR). Genome Biology 4: R54. 638. Rehmann L, Daugulis AJ. 2008. Enhancement of PCB degradation by Burkholderia xenovorans LB400 in biphasic systems by manipulating culture conditions. Biotechnology and Bioengineering 99: 521–8. 639. Reidhaar-Olson JF, Sauer RT. 1990. Functionally acceptable substitutions in 2 alpha-helical regions of Lambda repressor. Proteins-Structure Function and Genetics 7: 306–16. 640. Reidys C, Stadler P, Schuster P. 1997. Generic properties of combinatory maps: Neutral networks of RNA secondary structures. Bulletin of Mathematical Biology 59: 339–97. 641. Reidys CM. 1997. Random induced subgraphs of generalized n-cubes. Advances in Applied Mathematics 19: 360–77. 642. Reidys CM. 2002. Distances in random induced subgraphs of generalized n-cubes. Combinatorics Probability & Computing 11: 599–605. 643. Reidys CM. 2009. Large components in random induced subgraphs of n-cubes. Discrete Mathematics 309: 3113–24. 644. Reidys CM, Stadler PF. 2002. Combinatorial landscapes. SIAM Review 44: 3–54. 645. Reinitz J. 1999. Gene circuits for eve stripes: Reverse engineering the Drosophila segmentation gene network. Biophysical Journal 76: A272. 646. Reinitz J, Mjolsness E, Sharp DH. 1995. Model for cooperative control of positional information in Drosophila by bicoid and maternal hunchback. Journal of Experimental Zoology 271: 47–56. 647. Rendel JM. 1959. Canalization of the scute phenotype of Drosophila. Evolution 13: 425–39. 648. Reppert SM, Weaver DR. 2001. Molecular analysis of mammalian circadian rhythms. Annual Review of Physiology 63: 647–76. 649. Riechmann JL, Heard J, Martin G, Reuber L, Jiang CZ, et al. 2000. Arabidopsis transcription factors: Genome-wide comparative analysis among eukaryotes. Science 290: 2105–10. 650. Riley P, Anson-Cartwright L, Cross JC. 1998. The Hand1 bHLH transcription factor is essential for placentation and cardiac morphogenesis. Nature Genetics 18: 271–5.
651. Rizzi M, Wittenberg J, Coda A, Fasano M, Ascenzi P, Bolognesi M. 1994. Structure of the sulfide-reactive hemoglobin from the clam Lucina pectinata—crystallographic analysis at 1.5-angstrom resolution. Journal of Molecular Biology 244: 86–99. 652. Rodrigues JF, Wagner A. 2009. Evolutionary plasticity and innovations in complex metabolic reaction networks. PLoS Computational Biology 5: e1000613. 653. Rodrigues JF, Wagner A. 2011. Genotype networks, innovation, and robustness in sulfur metabolism. BMC Systems Biology 5:39. 654. Rogulja-Ortmann A, Technau GM. 2008. Multiple roles for Hox genes in segment-specific shaping of CNS lineages. Fly 2: 316–9. 655. Romano L, Wray G. 2003. Conservation of Endo16 expression in sea urchins despite evolutionary divergence in both cis and trans-acting components of transcriptional regulation. Development 130: 4187–99. 656. Roselius K, Stephan W, Stadler T. 2005. The relationship of nucleotide polymorphism, recombination rate and selection in wild tomato species. Genetics 171: 753–63. 657. Rosenberg A. 1985. The structure of biological science. Cambridge, UK: Cambridge University Press. 658. Ross CA, Poirier MA. 2004. Protein aggregation and neurodegenerative disease. Nature Medicine 10: S10–S7. 659. Rost B. 1997. Protein structures sustain evolutionary drift. Folding & Design 2: S19–S24. 660. Rost B. 2002. Enzyme function less conserved than anticipated. Journal of Molecular Biology 318: 595–608. 661. Rothschild LJ. 2008. The evolution of photosynthesis . . . again? Philosophical Transactions of the Royal Society B—Biological Sciences 363: 2787–801. 662. Ruoff P, Rensing L. 1996. The temperature-compensated goodwin model simulates many circadian clock properties. Journal of Theoretical Biology 179: 275–85. 663. Ruoff P, Vinsjevik M, Mohsenzadeh S, Rensing L. 1999. The Goodwin model: Simulating the effect of cycloheximide and heat shock on the sporulation rhythm of Neurospora crassa. Journal of Theoretical Biology 196: 483–94. 664. Ruoff P, Vinsjevik M, Monnerjahn C, Rensing L. 2001. The Goodwin model: Simulating the effect of light pulses on the circadian sporulation rhythm of Neurospora crassa. Journal of Theoretical Biology 209: 29–42. 665. Rust MJ, Markson JS, Lane WS, Fisher DS, O’Shea EK. 2007. Ordered phosphorylation governs oscilla-
REFERENCES
666. 667.
668. 669.
670.
671.
672.
673. 674.
675.
676.
677.
678.
679.
tion of a three-protein circadian clock. Science 318: 809–12. Rutherford S, Lindquist S. 1998. Hsp90 as a capacitor for morphological evolution. Nature 396: 336–42. Sadava D, Heller C, Orians G, Purves W, Hillis D. 2006. Life: The science of biology. New York: W.H. Freeman. Sali A, Shakhnovich E, Karplus M. 1994. How does a protein fold. Nature 369: 248–51. Sali A, Shakhnovich E, Karplus M. 1994. Kinetics of protein-folding—a lattice model study of the requirements for folding to the native-state. Journal of Molecular Biology 235: 1614–36. Samal A, Rodrigues JFM, Jost J, Martin OC, Wagner A. 2010. Genotype networks in metabolic reaction spaces. BMC Systems Biology 4: 30. Sanchez L, Chaouiya C, Thieffry D. 2008. Segmenting the fly embryo: logical analysis of the role of the Segment Polarity cross-regulatory module. International Journal of Developmental Biology 52: 1059–75. Sanjuan R, Forment J, Elena SF. 2006. In silico predicted robustness of viroid RNA secondary structures. I. The effect of single mutations. Molecular Biology and Evolution 23: 1427–36. Sassanfar M, Szostak JW. 1993. An RNA motif that binds ATP. Nature 364: 550–3. Sato K, Ito Y, Yomo T, Kaneko K. 2003. On the relation between fluctuation and response in biological systems. Proceedings of the National Academy of Sciences of the USA 100: 14086–90. Sawyer SA, Kulathinal RJ, Bustamante CD, Hartl DL. 2003. Bayesian analysis suggests that most amino acid replacements in Drosophila are driven by positive selection. Journal of Molecular Evolution 57: S154–S64. Sawyer SA, Parsch J, Zhang Z, Hartl DL. 2007. Prevalence of positive selection among nearly neutral amino acid replacements in Drosophila. Proceedings of the National Academy of Sciences of the USA 104: 6504–10. Scheiner SM. 1993. Genetics and evolution of phenotypic plasticity. Annual Review of Ecology and Systematics 24: 35–68. Schilling CH, Edwards JS, Palsson BO. 1999. Toward metabolic phenomics: Analysis of genomic data using flux balances. Biotechnology Progress 15: 288–95. Schilling CH, Palsson BO. 2000. Assessment of the metabolic capabilities of Haemophilus influenzae Rd through a genome-scale pathway analysis. Journal of Theoretical Biology 203: 249–83.
241
680. Schlichting CD. 1986. The evolution of phenotypic plasticity in plants. Annual Review of Ecology and Systematics 17: 667–93. 681. Schlosser G, Wagner G, ed. 2004. Modularity in development and evolution. Chicago, IL: University of Chicago Press. 682. Schmalhausen II. 1949. Factors of evolution. Philadelphia: Blakiston. 683. Schmidt D, Wilson MD, Ballester B, Schwalie PC, Brown GD, et al. 2010. Five-vertebrate ChIP-seq reveals the evolutionary dynamics of transcription factor binding. Science 328: 1036–40. 684. Schultes E, Bartel D. 2000. One sequence, two ribozymes: Implications for the emergence of new ribozyme folds. Science 289: 448–52. 685. Schuster P. 1997. Landscapes and molecular evolution. Physica D 107: 351–65. 686. Schuster P. 2003. Molecular insights into evolution of phenotypes. In Evolutionary dynamics: Exploring the interplay of selection, accident, neutrality, and function., ed. JP Crutchfield, P Schuster, pp. 163–215. New York, NY: Oxford University Press. 687. Schuster P. 2006. Prediction of RNA secondary structures: from theory to models and real molecules. Reports on Progress in Physics 69: 1419–77. 688. Schuster P, Fontana W, Stadler P, Hofacker I. 1994. From sequences to shapes and back—a case-study in RNA secondary structures. Proceedings of the Royal Society of London Series B 255: 279–84. 689. Searcoid MO. 2007. Metric spaces. London, UK: Springer. 690. Segre D, Vitkup D, Church G. 2002. Analysis of optimality in natural and perturbed metabolic networks. Proceedings of the National Academy of Sciences of the USA 99: 15112–7. 691. Shackleton NJ, Backman J, Zimmerman H, Kent DV, Hall MA, et al. 1984. Oxygen isotope calibration of the onset of ice-rafting and history of glaciation in the North-Atlantic region. Nature 307: 620–3. 692. Shahrezaei V, Ejtehadi MR. 2000. Geometry selects highly designable structures. Journal of Chemical Physics 113: 6437–42. 693. Shankarappa R, Margolick JB, Gange SJ, Rodrigo AG, Upchurch D, et al. 1999. Consistent viral evolutionary changes associated with the progression of human immunodeficiency virus type 1 infection. Journal of Virology 73: 10489–502. 694. Shapiro JA. 2004. A 21st century view of evolution: genome system architecture, repetitive DNA, and natural genetic engineering. Annual Scientific Meeting on Structural Approaches to Sequence Evolution, Dresden, Germany.
242
REFERENCES
695. Shapiro JA, Huang W, Zhang CH, Hubisz MJ, Lu J, et al. 2007. Adaptive genic evolution in the Drosophila genomes. Proceedings of the National Academy of Sciences of the USA 104: 2271–6. 696. Sharp DH, Reinitz J. 1998. Prediction of mutant expression patterns using gene circuits. Biosystems 47: 79–90. 697. Shearman LP, Sriram S, Weaver DR, Maywood ES, Chaves I, et al. 2000. Interacting molecular loops in the mammalian circadian clock. Science 288: 1013–9. 698. Shen-Orr S, Milo R, Mangan S, Alon U. 2002. Network motifs in the transcriptional regulation network of Escherichia coli. Nature Genetics 31: 64–8. 699. Shimeld SM, Holland PWH. 2000. Vertebrate innovations. Proceedings of the National Academy of Sciences of the USA 97: 4449–52. 700. Shirazi N, Benyamin D, Luk W, Cheung PYK, Guo S. 2001. Quantitative analysis of FPGA-based database searching. Journal of VLSI Signal Processing Systems for Signal Image and Video Technology 28: 85–96. 701. Shubin N, Tabin C, Carroll S. 2009. Deep homology and the origins of evolutionary novelty. Nature 457: 818–23. 702. Shubin NH, Alberch P. 1986. A morphogenetic approach to the origin and basic organization of the tetrapod limb. Evolutionary Biology 20: 319–87. 703. Siebenga JJ, Vennema H, Renckens B, Bruin E, van der Veer BD, et al. 2007. Epochal evolution of GGII.4 norovirus capsid proteins from 1995 to 2006. Journal of Virology 81: 9932–41. 704. Siegal M, Bergman A. 2002. Waddington’s canalization revisited: Developmental stability and evolution. Proceedings of the National Academy of Sciences of the USA 99: 10528–32. 705. Siegel JS. 1998. Homochiral imperative of molecular evolution. Chirality 10: 24–7. 706. Simpson GG. 1953. The major features of evolution. New York: Columbia University Press. 707. Skopalik J, Anzenbacher P, Otyepka M. 2008. Flexibility of human cytochromes P450: Molecular dynamics reveals differences between CYPs 3A4, 2C9, and 2A6, which correlate with their substrate preferences. Journal of Physical Chemistry B 112: 8165–73. 708. Slotman MA, Reimer LJ, Thiemann T, Dolo G, Fondjo E, Lanzaro GC. 2006. Reduced recombination rate and genetic differentiation between the M and S forms of Anopheles gambiae s.s. Genetics 174: 2081–93. 709. Smith DJ, Lapedes AS, de Jong JC, Bestebroer TM, Rimmelzwaan GF, et al. 2004. Mapping the antigenic
710.
711. 712. 713.
714.
715. 716. 717.
718.
719.
720.
721.
722.
723.
724.
725.
726.
and genetic evolution of influenza virus. Science 305: 371–6. Smith HC, Gott JM, Hanson MR. 1997. A guide to RNA editing. RNA-a Publication of the RNA Society 3: 1105–23. Smith JM, Sondhi KC. 1960. The genetics of a pattern. Genetics 45: 1039–50. Smith NGC, Eyre-Walker A. 2002. Adaptive protein evolution in Drosophila. Nature 415: 1022–4. Smith T, Husbands P, O’Shea M. 2001. Neutral networks in an evolutionary robotics search space. Congress on Evolutionary Computation (CEC 2001), Seoul, South Korea. Sniegowski P, Gerrish P, Johnson T, Shaver A. 2000. The evolution of mutation rates: separating causes from consequences. Bioessays 22: 1057–66. Sober E. 1984. The nature of selection. Cambridge, MA: MIT Press. Sober E. 2000. Philosophy of biology. Westview Press: Boulder, CO. Sober E, Wilson DS. 1999. Unto others: The evolution and psychology of unselfish behavior. Cambridge, Massachusetts: Harvard University Press. Socci N, Nymeyer H, Onuchic J. 1997. Exploring the protein folding funnel landscape. Physica D 107: 366–82. Soong N, Nomura L, Pekrun K, Reed M, Sheppard L, et al. 2000. Molecular breeding of viruses. Nature Genetics 25: 436–9. Soyer OS, Pfeiffer T. 2010. Evolution under fluctuating environments explains observed robustness in metabolic networks. PLoS Computational Biology 6(8): e1000907. Sproewitz A, Moeckel R, Maye J, Ijspeert AJ. 2008. Learning to move in modular robots using central pattern generators and online optimization. The International Journal of Robotics Research 27: 423–43. Spurway H. 1949. Remarks on Vavilov’s law of homologous variation. La Ricerca Scientifica Suppl. 19: 3–9. Srivastava D, Thomas T, Lin Q, Kirby ML, Brown D, Olson EN. 1997. Regulation of cardiac mesodermal and neural crest development by the bHLH transcription factor, dHAND. Nature Genetics 16: 154–60. Stadler B, Stadler P, Shpak M, Wagner G. 2002. Recombination spaces, metrics, and pretopologies. Zeitschrift fur Physikalische Chemie 216: 217–34. Stajich JE, Hahn MW. 2005. Disentangling the effects of demography and selection in human history. Molecular Biology and Evolution 22: 63–73. Stanewsky R, Jamison CF, Plautz JD, Kay SA, Hall JC. 1997. Multiple circadian-regulated elements
REFERENCES
727.
728.
729.
730.
731.
732. 733.
734.
735.
736.
737.
738.
739. 740.
741.
contribute to cycling period gene expression in Drosophila. EMBO Journal 16: 5006–18. Stayrook S, Jaru-Ampornpan P, Ni J, Hochschild A, Lewis M. 2008. Crystal structure of the lambda repressor and a model for pairwise cooperative operator binding. Nature 452: 1022–25. Stearns SC. 1989. The evolutionary significance of phenotypic plasticity—phenotypic sources of variation among organisms can be described by developmental switches and reaction norms. Bioscience 39: 436–45. Steipe B. 1998. Protein design concepts. In Encyclopedia of computational chemistry, ed. PvR Schleyer, NL Allinger, T Clark, J Gasteiger, PA Kollman, et al, pp. 2168–85. Chichester: Wiley. Stelling J, Klamt S, Bettenbrock K, Schuster S, Gilles ED. 2002. Metabolic network structure determines key aspects of functionality and regulation. Nature 420: 190–3. Stemmer W. 1994. DNA shuffling by random fragmentation and reassembly—in-vitro recombination for molecular evolution. Proceedings of the National Academy of Sciences of the USA 91: 10747–51. Stephan W. 1996. The rate of compensatory evolution. Genetics 144: 419–26. Sternberg PW. 1988. Lateral inhibition during vulval induction in Caenorhabditis elegans. Nature 335: 551–4. Sternberg PW, Horvitz HR. 1986. Pattern formation during vulvar development in C. elegans. Cell 44: 761–72. Sternberg PW, Horvitz HR. 1989. The combined action of 2 intercellular signaling pathways specifies 3 cell fates during vulval induction in C. elegans. Cell 58: 679–93. Stevens M. 2005. The role of eyespots as anti-predator mechanisms, principally demonstrated in the Lepidoptera. Biological Reviews 80: 573–88. Stevens M, Hardman CJ, Stubbins CL. 2008. Conspicuousness, not eye mimicry, makes “eyespots” effective antipredator signals. Behavioral Ecology 19: 525–31. Stevens M, Stubbins CL, Hardman CJ. 2008. The antipredator function of “eyespots” on camouflaged and conspicuous prey. Behavioral Ecology and Sociobiology 62: 1787–93. Stolum HH. 1996. River meandering as a self-organization process. Science 271: 1710–3. Stone J, Wray G. 2001. Rapid evolution of cis-regulatory sequences via local point mutations. Molecular Biology and Evolution 18: 1764–70. Stryer L. 1995. Biochemistry. New York: Freeman.
243
742. Studer RA, Robinson-Rechavi M. 2009. Evidence for an episodic model of protein sequence evolution. Conference on Protein Evolution Sequences, Structures and Systems, Cambridge, England. 743. Stump AD, Fitzpatrick MC, Lobo NF, Traore S, Sagnon NF, et al. 2005. Centromere-proximal differentiation and speciation in Anopheles gambiae. Proceedings of the National Academy of Sciences of the USA 102: 15930–5. 744. Sultan SE. 1992. Phenotypic plasticity and the neoDarwinian legacy. Evolutionary Trends in Plants 6: 61–71. 745. Sumedha, Martin OC, Wagner A. 2007. New structural variation in evolutionary searches of RNA neutral networks. Biosystems 90: 475–85. 746. Surrey T, Nedelec F, Leibler S, Karsenti E. 2001. Physical properties determining self-organization of motors and microtubules. Science 292: 1167–71. 747. Sweeney AM, Des Marais DL, Ban YEA, Johnsen S. 2007. Evolution of graded refractive index in squid lenses. Journal of the Royal Society Interface 4: 685–98. 748. Szollosi GJ, Derenyi I. 2008. The effect of recombination on the neutral evolution of genetic robustness. Mathematical Biosciences 214: 58–62. 749. Szollosi GJ, Derenyi I. 2009. Congruent evolution of genetic and environmental robustness in micro-RNA. Molecular Biology and Evolution 26: 867–74. 750. Tacker M, Stadler P, BornbergBauer E, Hofacker I, Schuster P. 1996. Algorithm independent properties of RNA secondary structure predictions. European Biophysics Journal with Biophysics Letters 25: 115–30. 751. Taddei F, Radman M, Maynard-Smith J, Toupance B, Gouyon P, Godelle B. 1997. Role of mutator alleles in adaptive evolution. Nature 387: 700–2. 752. Takahashi A, Liu YH, Saitou N. 2004. Genetic variation versus recombination rate in a structured population of mice. Molecular Biology and Evolution 21: 404–9. 753. Takiguchi M, Matsubasa T, Amaya Y, Mori M. 1989. Evolutionary aspects of urea cycle enzyme genes. Bioessays 10: 163–6. 754. Tan CSH, Pasculescu A, Lim WA, Pawson T, Bader GD, Linding R. 2009. Positive selection of tyrosine loss in metazoan evolution. Science 325: 1686–8. 755. Tanabe Y, Hasebe M, Sekimoto H, Nishiyama T, Kitani M, et al. 2005. Characterization of MADS-box genes in charophycean green algae and its implication for the evolution of MADS-box genes. Proceedings of the National Academy of Sciences of the USA 102: 2436–41. 756. Tanaka F, Barbas CF. 2005. Enamine-based reactions using organocatalysts: from aldolase antibodies to
244
757.
758.
759.
760.
761.
762.
763.
764.
765.
766.
767.
768. 769.
REFERENCES
small amino acid and amine catalysts. Journal of Synthetic Organic Chemistry Japan 63: 709–21. Tanaka F, Fuller R, Barbas CF. 2005. Development of small designer aldolase enzymes: Catalytic activity, folding, and substrate specificity. Biochemistry 44: 7583–92. Tanay A, Regev A, Shamir R. 2005. Conservation and evolvability in regulatory networks: The evolution of ribosomal regulation in yeast. Proceedings of the National Academy of Sciences of the USA 102: 7203–8. Taverna D, Goldstein R. 2002. Why are proteins marginally stable? Proteins-Structure Function and Genetics 46: 105–9. Taverna D, Goldstein R. 2002. Why are proteins so robust to site mutations? Journal of Molecular Biology 315: 479–84. Taylor SV, Walter KU, Kast P, Hilvert D. 2001. Searching sequence space for protein catalysts. Proceedings of the National Academy of Sciences of the USA 98: 10596–601. Tenaillon MI, Sawkins MC, Long AD, Gaut RL, Doebley JF, Gaut BS. 2001. Patterns of DNA sequence polymorphism along chromosome 1 of maize (Zea mays ssp mays L.). Proceedings of the National Academy of Sciences of the USA 98: 9161–6. Thatcher JW, Shaw JM, Dickinson WJ. 1998. Marginal fitness contributions of nonessential genes in yeast. Proceedings of the National Academy of Sciences of the USA 95: 253–7. Theissen G. 2001. Development of floral organ identity: stories from the MADS house. Current Opinion in Plant Biology 4: 75–85. Theissen G, Becker A, Di Rosa A, Kanno A, Kim JT, et al. 2000. A short history of MADS-box genes in plants. Plant Molecular Biology 42: 115–49. Theissen G, Kim JT, Saedler H. 1996. Classification and phylogeny of the MADS-box multigene family suggest defined roles of MADS-box gene subfamilies in the morphological evolution of eukaryotes. Journal of Molecular Evolution 43: 484–516. Thomas GH, Zucker J, MacDonald SJ, Sorokin A, Goryanin I, Douglas AE. 2009. A fragile metabolic network adapted for cooperation in the symbiotic bacterium Buchnera aphidicola. BMC Systems Biology 3: 24. Thomas JH. 1993. Thinking about genetic redundancy. Trends in Genetics 9: 395–9. Thompson A. 1995. Evolving electronic robot controllers that exploit hardware resources. Third European Conference on Artificial Life Granada, Spain, June 4–6, 1995, Proceedings. Lecture Notes in
770.
771.
772.
773. 774.
775.
776.
777.
778.
779.
780. 781.
782.
783. 784.
785.
Computer Science 929: 640–656 (http://www.springerlink.com/content/hnuu31518612/). Thompson A. 1997. Evolving inherently fault-tolerant systems. Proceedings of the Institution of Mechanical Engineers Part I—Journal of Systems and Control Engineering. 211: 365–71. Thompson A, Layzell P. 1999. Analysis of unconventional evolved electronics. Communications of the ACM 42: 71–9. Thompson A, Layzell P, Zebulum R. 1999. Explorations in design space: Unconventional electronics design through artificial evolution. IEEE Transactions on Evolutionary Computation 3: 167–96. Thompson DW. 1942. On growth and form. Cambridge, UK: Cambridge University Press. Thornton J, Orengo C, Todd A, Pearl F. 1999. Protein folds, functions and evolution. Journal of Molecular Biology 293: 333–42. Tirosh I, Reikhav S, Levy AA, Barkai N. 2009. A yeast hybrid provides insight into the evolution of gene expression regulation. Science 324: 659–62. Todd A, Orengo C, Thornton J. 1999. Evolution of protein function, from a structural perspective. Current Opinion in Chemical Biology 3: 548–56. Todd A, Orengo C, Thornton J. 2001. Evolution of function in protein superfamilies, from a structural perspective. Journal of Molecular Biology 307: 1113–43. Tokuriki N, Stricher F, Schymkowitz J, Serrano L, Tawfik DS. 2007. The stability effects of protein mutations appear to be universally distributed. Journal of Molecular Biology 369: 1318–32. Tokuriki N, Tawfik DS. 2009. Chaperonin overexpression promotes genetic variation and enzyme evolution. Nature 459: 668–73. Tokuriki N, Tawfik DS. 2009. Protein dynamism and evolvability. Science 324: 203–7. Tomarev S, Piatigorsky J. 1996. Lens crystallins of invertebrates—diversity and recruitment from detoxification enzymes and novel proteins. European Journal of Biochemistry 235: 449–65. Torresen J, Glette K. 2007. Improving flexibility in online evolvable systems by reconfigurable computing. 7th International Conference on Evolvable Systems— From Biology to Hardware, Wuhan, China. Tortora GJ, Funke BR, Case CL. 1997. Microbiology: an introduction. Menlo Park, CA: Benjamin/Cummings. True JR, Carroll SB. 2002. Gene co-option in physiological and morphological evolution. Annual Review of Cell and Developmental Biology 18: 53–80. Tsien RY. 1998. The green fluorescent protein. Annual Review of Biochemistry 67: 509–44.
REFERENCES
786. Tsong AE, Tuch BB, Li H, Johnson AD. 2006. Evolution of alternative transcriptional circuits with identical logic. Nature 443: 415–20. 787. Tuch BB, Li H, Johnson AD. 2008. Evolution of eukaryotic transcription circuits. Science 319: 1797–9. 788. Tuinstra RL, Peterson FC, Kutlesa S, Elgin ES, Kron MA, Volkman BF. 2008. Interconversion between two unrelated protein folds in the lymphotactin native state. Proceedings of the National Academy of Sciences of the USA 105: 5057–62. 789. Turner D, Sugimoto N, Freier S. 1988. RNA structure prediction. Annual Review of Biophysics and Biophysical Chemistry 17: 167–92. 790. Tvrdik P, Capecchi MR. 2006. Reversal of Hox1 gene subfunctionalization in the mouse. Developmental Cell 11: 239–50. 791. Tyson J. 1991. Modeling the cell-division cycle—Cdc2 and cyclin interactions. Proceedings of the National Academy of Sciences of the USA 88: 7328–32. 792. Uriu K, Morishita Y, Iwasa Y. 2009. Traveling wave formation in vertebrate segmentation. Journal of Theoretical Biology 257: 385–96. 793. Van de Peer Y, Maere S, Mayer A. 2009. The evolutionary significance of ancient genome duplications. Nature Reviews Genetics 10: 725–32. 794. Van de Peer Y, Taylor JS, Braasch I, Meyer A. 2001. The ghost of selection past: Rates of evolution and functional divergence of anciently duplicated genes. Journal of Molecular Evolution 53: 436–46. 795. van den Akker E, Fromental-Ramain C, de Graaff W, Le Mouellic H, Brulet P, et al. 2001. Axial skeletal patterning in mice lacking all paralogous group 8 Hox genes. Development 128: 1911–21. 796. van der Meer JR. 1995. Evolution of novel metabolic pathways for the degradation of chloroaromatic compounds. Beijerinck Centennial Symposium on Microbial physiology and gene regulation—emerging principles and applications, The Hague, Netherlands. 797. van der Meer JR, Werlen C, Nishino SF, Spain JC. 1998. Evolution of a pathway for chlorobenzene metabolism leads to natural attenuation in contaminated groundwater. Applied and Environmental Microbiology 64: 4185–93. 798. van Nimwegen E, Crutchfield J, Huynen M. 1999. Neutral evolution of mutational robustness. Proceedings of the National Academy of Sciences of the USA 96: 9716–20. 799. van Nimwegen E, Crutchfield JP. 2000. Metastable evolutionary dynamics: Crossing fitness barriers or escaping via neutral paths? Bulletin of Mathematical Biology 62: 799–848.
245
800. van Nimwegen E, Crutchfield JP, Mitchell M. 1999. Statistical dynamics of the royal road genetic algorithm. Theoretical Computer Science 229: 41–102. 801. van Zon JS, Lubensky DK, Altena PRH, ten Wolde PR. 2007. An allosteric model of circadian KaiC phosphorylation. Proceedings of the National Academy of Sciences of the USA 104: 7420–5. 802. Vandeguchte M VT, Kok J, Venema G. 1991. A possible contribution of messenger-RNA secondary structure to translation initiation efficiency in Lactococcus lactis. FEMS Microbiology letters 81: 201–8. 803. Vassilev VK, Miller JF. 2000. The advantages of landscape neutrality in digital circuit evolution. 3rd International Conference on Evolvable Systems (ICES 2000), Edinburgh, Scotland. 804. Vavouri T, Semple JI, Lehner B. 2008. Widespread conservation of genetic redundancy during a billion years of eukaryotic evolution. Trends in Genetics 24: 485–8. 805. Vendruscolo M, Maritan A, Banavar J. 1997. Stability threshold as a selection principle for protein design. Physical Review Letters 78: 3967–70. 806. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, et al. 2001. The sequence of the human genome. Science 291: 1304–51. 807. Vermeij GJ. 2006. Historical contingency and the purported uniqueness of evolutionary innovations. Proceedings of the National Academy of Sciences of the USA 103: 1804–9. 808. Veron A, Kaufmann K, Bornberg-Bauer E. 2007. Evidence of interaction network evolution by whole-genome duplications: a case study in MADSbox proteins. Molecular Biology and Evolution 24: 670–8. 809. Vitkup D, Kharchenko P, Wagner A. 2006. Influence of metabolic network structure and function on enzyme evolution. Genome Biology 7: R39. 810. Vitreschak AG, Rodionov DA, Mironov AA, Gelfand MS. 2004. Riboswitches: the oldest mechanism for the regulation of gene expression? Trends in Genetics 20: 44–50. 811. Voigt C, Martinez C, Wang Z, Mayo S, Arnold F. 2002. Protein building blocks preserved by recombination. Nature Structural Biology 9: 553–8. 812. von Dassow G, Meir E, Munro E, Odell G. 2000. The segment polarity network is a robust development module. Nature 406: 188–92. 813. von Mering C, Zdobnov EM, Tsoka S, Ciccarelli FD, Pereira-Leal JB, et al. 2003. Genome evolution reveals biochemical networks and functional modules. Proceedings of the National Academy of Sciences of the USA 100: 15428–33.
246
REFERENCES
814. Vrebalov J, Ruezinsky D, Padmanabhan V, White R, Medrano D, et al. 2002. A MADS-box gene necessary for fruit ripening at the tomato ripening-inhibitor (Rin) locus. Science 296: 343–6. 815. Vu T, Hung D, Wheaton V, Coughlin S. 1991. Molecular cloning of a functional thrombin receptor reveals a novel proteolytic mechanism of receptor activation. Cell 64: 1057–68. 816. Waddington CH. 1953. The genetic assimilation of an acquired character. Evolution 7: 118–26. 817. Waddington CH. 1956. Genetic assimilation of the bithorax phenotype. Evolution 10: 1–13. 818. Waddington CH. 1959. The strategy of the genes. New York: Macmillan. 819. Waddington CH. 1960. Experiments on canalizing selection. Genetical Research 1: 140–50. 820. Wagner A. 1996. Does evolutionary plasticity evolve? Evolution 50: 1008–23. 821. Wagner A. 1999. Redundant gene functions and natural selection. Journal of Evolutionary Biology 12: 1–16. 822. Wagner A. 2003. Risk management in biological evolution. Journal of Theoretical Biology 225: 45–57. 823. Wagner A. 2005. Circuit topology and the evolution of robustness in two-gene circadian oscillators. Proceedings of the National Academy of Sciences of the USA 102: 11775–80. 824. Wagner A. 2005. Distributed robustness versus redundancy as causes of mutational robustness. Bioessays 27: 176–88. 825. Wagner A. 2005. Robustness and evolvability in living systems. Princeton, NJ: Princeton University Press. 826. Wagner A. 2005. Robustness, evolvability, and neutrality. FEBS Letters 579: 1772–8. 827. Wagner A. 2007. Energy costs constrain the evolution of gene expression. Journal of Experimental Zoology Part B-Molecular and Developmental Evolution 308B: 322–4. 828. Wagner A. 2008. Gene duplications, robustness and evolutionary innovations. Bioessays 30: 367–73. 829. Wagner A. 2008. Neutralism and selectionism: A network-based reconciliation. Nature Reviews Genetics 9: 965–74. 830. Wagner A. 2008. Robustness and evolvability: A paradox resolved. Proceedings of the Royal Society B—Biological Sciences 275: 91–100. 831. Wagner A. 2009. Evolutionary constraints permeate metabolic networks. BMC Evolutionary Biology 9: 231. 832. Wagner A, Stadler P. 1999. Viral RNA and evolved mutational robustness. Journal of Experimental Zoology/Molecular Development and Evolution 285: 119–27.
833. Wagner GP, Amemiya C, Ruddle F. 2003. Hox cluster duplications and the opportunity for evolutionary novelties. Proceedings of the National Academy of Sciences of the USA 100: 14603–6. 834. Wagner GP, Pavlicev M, Cheverud JM. 2007. The road to modularity. Nature Reviews Genetics 8: 921–31. 835. Walter A, Turner D, Kim J, Lyttle M, Muller P, et al. 1994. Coaxial stacking of helixes enhances binding of oligoribonucleotides and improves predictions of RNA folding. Proceedings of the National Academy of Sciences of the USA 91: 9218–22. 836. Wang HM, Jian Y. 2003. The Tap42-protein phosphatase type 2A catalytic subunit complex is required for cell cycle-dependent distribution of actin in yeast. Molecular and Cellular Biology 23: 3116–25. 837. Wang HM, Wang XD, Jiang Y. 2003. Interaction with Tap42 is required for the essential function of Sit4 and type 2A phosphatases. Molecular Biology of the Cell 14: 4342–51. 838. Wang Z. 1998. A re-estimation for the total numbers of protein folds and superfamilies. Protein Engineering 11: 621–6. 839. Wang Z, Zhang J. 2009. Abundant indispensable redundancies in cellular metabolic networks. Genome Biology and Evolution 1: 23–33. 840. Watanabe H, Takehana K, Date M, Shinozaki T, Raz A. 1996. Tumor cell autocrine motility factor is the neuroleukin/phosphohexose isomerase polypeptide. Cancer Research 56: 2960–3. 841. Waterman M, Smith T. 1978. RNA secondary structure—complete mathematical analysis. Mathematical Biosciences 42: 257–66. 842. Waters CM, Bassler BL. 2005. Quorum sensing: Cellto-cell communication in bacteria. Annual Review of Cell and Developmental Biology 21: 319–46. 843. Wayne LG, Kubica GP. 1986. The mycobacteria. In Bergey’s manual of systematic bacteriology, ed. PHA Sneath, NS Mair, ME Sharpe, JG Holt. London, UK: Williams & Wilkins. 844. Wcislo WT. 1989. Behavioral environments and evolutionary change. Annual Review of Ecology and Systematics 20: 137–69. 845. Weinreich DM, Chao L. 2005. Rapid evolutionary escape by large populations from local fitness peaks is likely in nature. Evolution 59: 1175–82. 846. Weinreich DM, Delaney NF, DePristo MA, Hartl DL. 2006. Darwinian evolution can follow only very few mutational paths to fitter proteins. Science 312: 111–4. 847. Weirauch MT, Hughes TR. 2010. Conserved expression without conserved regulatory sequence: The
REFERENCES
848. 849.
850.
851. 852. 853.
854. 855.
856.
857.
858.
859.
860.
861.
862.
863.
more things change, the more they stay the same. Trends in Genetics 26: 66–74. West-Eberhard MJ. 2003. Developmental plasticity and evolution. Oxford, UK: Oxford University Press. Wierenga RK. 2001. The TIM-barrel fold: A versatile framework for efficient enzymes. FEBS Letters 492: 193–8. Wilke CO, Lenski RE, Adami C. 2003. Compensatory mutations cause excess of antagonistic epistasis in RNA secondary structure folding. BMC Evolutionary Biology 3: 3. Wilkins A. 1997. Canalization: a molecular genetic perspective. Bioessays 19: 257–62. Williams GC. 1966. Adaptation and natural selection. Princeton, NJ: Princeton University Press. Williams P, Pollock D, Goldstein R. 2001. Evolution of functionality in lattice proteins. Journal of Molecular Graphics & Modelling 19: 150–6. Wills PR. 1993. Self-organization of genetic coding. Journal of Theoretical Biology 162: 267–87. Wilson LAB, Sanchez-Villagra MR. 2010. Diversity trends and their ontogenetic basis: an exploration of allometric disparity in rodents. Proceedings of the Royal Society B-Biological Sciences 277: 1227–34. Wittkopp PJ, Haerum BK, Clark AG. 2004. Evolutionary changes in cis and trans gene regulation. Nature 430: 85–8. Wittkopp PJ, Haerum BK, Clark AG. 2008. Regulatory changes underlying expression differences within and between Drosophila species. Nature Genetics 40: 346–50. Woese CR. 1965. On the evolution of the genetic code. Proceedings of the National Academy of Sciences of the USA 54: 1546–52. Woese CR, Pace NR. 1993. Probing RNA structure, function, and history by comparative analysis. In The RNA world, ed. RF Gesteland, JF Atkins, pp. 91–117. Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press. Wolf Y, Brenner S, Bash P, Koonin E. 1999. Distribution of protein folds in the three superkingdoms of life. Genome Research 9: 17–26. Wray G, Hahn M, Abouheif E, Balhoff J, Pizer M, et al. 2003. The evolution of transcriptional regulation in eukaryotes. Molecular Biology and Evolution 20: 1377–419. Wright APH, Gustafsson JA. 1991. Mechanism of synergistic transcriptional transactivation by the human glucocorticoid receptor. Proceedings of the National Academy of Sciences of the USA 88: 8283–7. Wroe R, Chan H-S, Bornberg-Bauer E. 2007. A structural model of latent evolutionary potentials under-
864.
865.
866. 867.
868.
869.
870.
871.
872.
873.
874.
875.
876.
877.
247
lying neutral networks in proteins. HFSP Journal 1: 79–87. Wuchty S, Fontana W, Hofacker I, Schuster P. 1999. Complete suboptimal folding of RNA and the stability of secondary structures. Biopolymers 49: 145–65. Wuchty S, Fontana W, Hofacker IL, Schuster P. 1999. Complete suboptimal folding of RNA and the stability of secondary structures. Biopolymers 49: 145–65. Wullschleger S, Loewith R, Hall MN. 2006. TOR signaling in growth and metabolism. Cell 124: 471–84. Wyckoff GJ, Wang W, Wu CI. 2000. Rapid evolution of male reproductive genes in the descent of man. Nature 403: 304–9. Xia Y, Levitt M. 2002. Roles of mutation and recombination in the evolution of protein thermodynamics. Proceedings of the National Academy of Sciences of the USA 99: 10382–7. Xiong J, Bauer CE. 2002. Complex evolution of photosynthesis. Annual Review of Plant Biology 53: 503–21. Xu W, Seiter K, Feldman E, Ahmed T, Chiao J. 1996. The differentiation and maturation mediator for human myeloid leukemia cells shares homology with neuroleukin or phosphoglucose isomerase. Blood 87: 4502–6. Xu Y, Mori T, Johnson CH. 2003. Cyanobacterial circadian clockwork: roles of KaiA, KaiB and the kaiBC promoter in regulating KaiC. Embo Journal 22: 2117–26. Yamaguchi Y, Gojobori T. 1997. Evolutionary mechanisms and population dynamics of the third variable envelope region of HIV within single hosts. Proceedings of the National Academy of Sciences of the USA 94: 1264–9. Yan OY, Andersson CR, Kondo T, Golden SS, Johnson CH. 1998. Resonating circadian clocks enhance fitness in cyanobacteria. Proceedings of the National Academy of Sciences of the USA 95: 8660–4. Yang W, Bielawski JP, Yang ZH. 2003. Widespread adaptive evolution in the human immunodeficiency virus type 1 genome. Journal of Molecular Evolution 57: 212–21. Yang ZH, Nielsen R. 2000. Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Molecular Biology and Evolution 17: 32–43. Yang ZH, Nielsen R. 2002. Codon-substitution models for detecting molecular adaptation at individual sites along specific lineages. Molecular Biology and Evolution 19: 908–17. Yelon D, Ticho B, Halpern ME, Ruvinsky I, Ho RK, et al. 2000. The bHLH transcription factor Hand2
248
878.
879.
880.
881.
882.
883.
884.
885.
REFERENCES
plays parallel roles in zebrafish heart and pectoral fin development. Development 127: 2573–82. Yi TM, Huang Y, Simon MI, Doyle J. 2000. Robust perfect adaptation in bacterial chemotaxis through integral feedback control. Proceedings of the National Academy of Sciences of the USA 97: 4649–53. Yomo T, Ito Y, Sato K, Kaneko K. 2005. Phenotypic fluctuation rendered by a single genotype and evolutionary rate. Physica a-Statistical Mechanics and Its Applications 350: 1–5. Yu T, Miller J. 2001. Neutrality and the evolvability of Boolean function landscape. 4th European Conference on Genetic Programming (EuroGP 2001), Como, Italy. Yu T, Miller JF. 2006. Through the interaction of neutral and adaptive mutations, evolutionary search finds a way. Artificial Life 12: 525–51. Yu TN. 2007. Program evolvability under environmental variations and neutrality. 9th European Conference on Artificial Life, Lisbon, Portugal. Yus E, Maier T, Michalodimitrakis K, van Noort V, Yamada T, et al. 2009. Impact of genome reduction on bacterial metabolism and its regulation. Science 326: 1263–8. Zabrocki P, Van Hoof C, Goris J, Thevelein JM, Winderickx J, Wera S. 2002. Protein phosphatase 2A on track for nutrient-induced signalling in yeast. Molecular Microbiology 43: 835–42. Zhang C. 1997. Relations of the numbers of protein sequences, families and folds. Protein Engineering 10: 757–61.
886. Zhang C, DeLisi C. 1998. Estimating the number of protein folds. Journal of Molecular Biology 284: 1301–5. 887. Zhang JZ, Nielsen R, Yang ZH. 2005. Evaluation of an improved branch-site likelihood method for detecting positive selection at the molecular level. Molecular Biology and Evolution 22: 2472–9. 888. Zhang Y, Perry K, Vinci V, Powell K, Stemmer W, del Cardayre S. 2002. Genome shuffling leads to rapid phenotypic improvement in bacteria. Nature 415: 644–6. 889. Zheng W, Zhao HY, Mancera E, Steinmetz LM, Snyder M. Genetic analysis of variation in transcription factor binding in yeast. Nature 464: 1187–91. 890. Zhou Q, Zhang GJ, Zhang Y, Xu SY, Zhao RP, et al. 2008. On the origin of new genes in Drosophila. Genome Research 18: 1446–55. 891. Zinzen RP, Girardot C, Gagneur J, Braun M, Furlong EEM. 2009. Combinatorial binding predicts spatiotemporal cis-regulatory activity. Nature 462: 65–70. 892. Zuker M, Sankoff D. 1984. RNA secondary structures and their prediction. Bulletin of Mathematical Biology 46: 591–621. 893. Zuker M, Stiegler P. 1981. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Research 9: 133–48. 894. Zwanzig R, Szabo A, Bagchi B. 1992. Levinthal’s paradox. Proceedings of the National Academy of Sciences of the USA 89: 20–2.
Index
ABC model of flower development 125 abstraction 80 Acacia trees 180 adaptation adaptive landscape 80 design of adaptive systems 211 phenotypic plasticity role 181–2 speed of 149–50 AGAMOUS (AG) gene family 125 aldolase 156–7, 156 aldosterone 103, 180 alleles fixation 93, 94 frequency 93, 94 mutator alleles 121 neutrality tests 95–7 polymorphisms 96 α-interferons 142 Ambystoma mexicanum 159 amino acid hydrophobicity 149–50 ammonia detoxification 7–8 Anser indicus 13 antibody diversity 81 antifreeze proteins 11–13, 13 Antirrhinum majus 125 Arabidopsis thaliana 125, 128–9 asymmetric traits 180 atavisms 150 ATP-binding proteins 64 Bacillus subtilis 173, 174 background selection 95 bacteriophage lambda repressor 64, 74 bar-headed goose (Anser indicus) 13 beaks 9–11 β-lactamase 138, 139 bone morphogenetic protein 4 (bmp4) 10 boom and bust cycles of diversity 101–2, 102 bovine carbonic anhydrase 175 Buchnera aphidicola 153 butterfly eyespots 8–9, 8
Caenorhabditis elegans 191–5, 193 calmodulin 10 Candida albicans 34 carbon sources 18 Cardamine hirsuta 9, 10 cell fate specification, C. elegans 191–5, 193 cephalosporinases 134 chaperone proteins 97–8 chlorophylls 7 chloroplasts 173 chorismate mutase 64, 129 chromosomes 95 chymotrypsin 175 circadian oscillator models 188–90 connectedness of parameter space 188–9, 190 topological circuit variants 189–91, 192 circuit set 204 see also digital logic circuits circuit space 200, 204 neutral networks 204–5, 206, 207–11, 209, 210 see also digital logic circuits cobalamin 169, 170 colchicin 159 compensatory mutations 103 complexity environmental variation and 152– 7, 155 technological systems, related to robustness and evolvability 211–13, 212, 213 computers see technological systems connectedness in circadian oscillator models 188–9, 190, 191 in digital logic circuit networks 204 in genotype networks 41, 89 accessibility of novel phenotypes and 77–8, 77, 78 constraints see evolutionary constraints
continuous systems 73, 186 circadian oscillator model 188–91, 190, 192 conceptual problems 186–7 continuous genotype spaces 187–8 developmental gene circuit 191–5 eukaryotic signaling circuit 194, 195–7 cortisol 103, 179–80 cryptic variation 148–9 crystallins 9 cubes 84, 84 cyanobacteria 188–9 cyclopia 159 cytochrome P450 97, 176 Danio rerio (zebrafish) 127 Darwin, Charles 1–2 Darwin’s finches 10–11 de Vries, Hugo 2 developmental constraints 159–60, 165–6 digit number in amphibians 159–60 digital logic circuits 199–200, 201 engineering issues 205–7 neutral networks 204–5, 206, 207–11, 209, 210 diverse phenotypes in different neighborhoods 207, 208 number of circuits 200–4, 203 see also technological systems discretization 80 Distal-less 9 diverse genotypic neighborhoods 75–7, 76, 90–1, 191–7 developmental gene circuit 191–5 eukaryotic signaling circuit 194, 195–7 DNA shuffling 134, 142 Drosophila 39, 159, 160 bristles 162 second thorax development 179 segmentation 160 duplications see gene duplications
249
250
INDEX
ectoine 6–7 Endo 16 gene 35 envelope protein (env) 104, 105 environment 143 variation related to system complexity and robustness 152–7, 155 environmental change 143–5, 216–17 genotype networks and 145–8, 145, 146, 147 rapid change 144 slow change 143–4, 148–9 speed of evolutionary adaptation 149–50 enzymes see proteins epistasis 102–4 Escherichia coli 11 biomass molecules 20 gene turnover 23 metabolic network 24, 26, 153 regulatory circuit 35 even-skipped gene 35 evolutionary algorithms 198–9 evolutionary constraints 158, 217 causes of 159–60 consequences of 166–7 developmental constraints 159–60, 165–6 genetic constraints 158–9, 163–5, 164, 165 genotype constraints 167–9, 168 local constraints 158 phenotypic constraints 17, 158–9 study of 160–1 physicochemical constraints 159, 161, 165 selective constraints 159, 161–3, 165 universal constraints 158–9 evolutionary developmental biology 4–5 evolutionary innovation see innovation evolvability 15 technological systems, complexity and 211–13, 213 evolvable hardware 199 exaptations 106 eyes 1 lenses 9 eyespots 8–9, 8 fault-tolerant circuitry 205–7 fibronectin 49–50 field-programmable gate array (FPGA) 199–200
fitness 95, 97 fixation 93, 94 flowering plant radiation 125, 126 flux balance analysis 21–2 frogs digit number 159–60 teeth in lower jaw 159 Gal4p 34 galactitol 31 galactose metabolism 34 gene duplications 16, 124, 216 flowering plant radiation 125, 126 heart evolution 127, 128 novel phenotype accessibility and 129–31, 130 regulatory circuits 129–31 robustness and 128–9 vertebrate radiation 126–7 gene expression phenotype 34 accessibility of novel phenotypes 43, 44, 77–8, 77, 78 different regulatory mechanisms 34–5 different topology 42, 42 preservation of following recombination 134–7, 136 gene turnover 23 genetic algorithms 198 genetic assimilation 178–9 detection of in the wild 179–80, 179 prevalence of 179–81 genetic constraints 159, 163–5, 164, 165 genetic draft 95 genetic drift 94–5 genetic load 140–1, 141 genotype 2, 68–70 constraints on variation 167–9, 168 link to phenotype 4, 187 more genotypes than phenotypes 70–1 neighbors 14, 20 neutral neighbors 86–90 see also genotype networks protein genotypes 47, 54–6, 68, 69, 70–1 regulatory circuits 68–9, 69 genotypes with the same phenotype 40–3, 40 number of genotypes 38–40, 39 RNA genotypes 47, 68, 69, 71 robustness 184 understanding of 3 see also metabolic genotype
genotype distance metabolic networks 20, 23–4, 23, 72 proteins and RNA 72 regulatory circuits 38, 72 genotype networks 13–15, 24–6, 25, 52, 71–2, 214–15 accessibility of novel phenotypes 43, 44, 77–8, 77, 78 as graphs 83 connectedness 41, 77–8, 77, 78, 89 environmental change and 145–8, 145, 146, 147 experimental evidence 65–7 history of concept 14–15 interwoven nature of 43–5, 45, 79 metabolic see metabolic genotype neighborhood phenotypes 28–31, 28, 30, 41 diversity 75–7, 76, 90–1, 191–7 neutral neighbors 86–9 necessity of 89–90 phenotypic plasticity and 176–8, 177 protein genotype networks 48–51, 54–5, 71–2 novel phenotypes in different neighborhoods 55–6, 55 regulatory circuits 40–3, 71 RNA genotype networks 51–2, 57, 71–2 close proximity of different networks 62 heterogeneity 57–62, 58, 60, 61 novel phenotypes in different neighborhoods 62, 63 size of 122–3, 122 role in innovation 64–7, 66 self-organization 91–2 size of 64, 73–4, 74, 214 consequences 73–5 related to phenotype 27 RNA genotype networks 122–3, 122 see also metabolic networks genotype sets 40–3, 214 see also genotype networks genotype space 163, 214 as a graph 83 continuous 187–8 protein genotype space 47 regulatory circuits 38–9 RNA genotype space 47 size of 64, 73–4, 214 consequences 73–5 genotypic memory 150–2, 151 Geospiza finches 10–11
INDEX
giant component 89 globins 48–9, 51 evolutionary relationships 48–9, 50 glucocorticoid receptor 103, 179–80 glycine betaine 6–7 graphs 40, 83 components 89 giant component 89 genotype networks as graphs 83 hypercube graphs 83–6 random graphs 86–9, 88 green fluorescent protein 173, 174, 181 guide RNA 74 halophilic bacteria 6–7 Hammerhead ribozyme 60–2, 61 Hand gene 126 heart evolution 127, 128 hemagglutinin 101–2, 102, 104 hemerythrin 117 hemoglobin 48 oxygen affinity 13 high-dimensional spaces 78 hitchhiking 95 homologous recombination 132 honeybees 172 Hox genes 126–7, 129 human immunodeficiency virus (HIV) envelope protein (env) 104, 105 hydrophobicity 149–50 hypercubes 83–6, 84 immunity 81 influenza virus 101–2, 102, 104 innovability 15 prerequisites for 77–8 innovation 1 at the origin of life 81 random change and 92 see also innovability; theory of innovation interwoven nature of genotype networks 43–5, 45, 79 isocitrate dehydrogenases (IDHs) 11 β-isopropylmalate dehydrogenases (IMDHs) 11 Kimura, Motoo 93 KNOX (KNOTTED-like homeobox) 9, 10 Kyoto Encyclopedia of Genes and Genomes (KEGG) database 23 L-ribulose-5-phosphate 4-epimerase (L-Ru5P) 11, 12
lateral gene transfer 133 lattice proteins 52–4, 53 recombination effects 137–8 leaf dissection 9, 10 Leishmania tarentolae guide RNA 74 Lucina pectinata 48 McDonald–Kreitman test 96 macromolecules 11–13, 214 see also proteins; RNA MADS box gene duplications 125, 126 Maynard Smith, John 14 MEF2 gene 127 melibiose 29–31 metabolic genotype 19–20, 19, 69, 69 genotype distance 20, 23–4, 23, 72 metabolic phenotype determination from 21–2 neighborhood 20 neighbors 20, 72 number of with a given phenotype 27–8 see also genotype metabolic networks 18, 26, 214 addition of enzyme-coding genes 22–3 diversity 23–4, 23 environmental variation and 152– 4, 155 essential reactions 26–7, 26 evolution of 22–4, 71 evolutionary constraints 161, 163–4, 169 principle component analysis 163–4, 164 genotype distance 20, 23–4, 23, 72 neighbors 20, 72 phenotype 69–70, 69 neighbors with the same phenotype 26–7 networks with different phenotypes 31, 32 see also metabolic phenotype viable network 21 see also genotype networks metabolic pathways 5–8 metabolic phenotype 19, 20–1, 69–70, 69 determination from metabolic genotype 21–2 evolution of novel phenotypes 28–31 see also phenotype micro RNA genes 162 microbial metabolism 6–7 mineralocorticoid receptor 103, 180 minimal networks 154
251
molecular exaptations 106 mutations 93, 101 compensatory mutations 103 epistasis 102–4 in duplicate genes 128 neutral 94–5 evolutionary dynamics of 94–5 neutrality tests 95–7 see also neutralism random 92 versus recombination 138–9, 140–2 mutator alleles 121 Mycobacterium tuberculosis 22 myoglobin 48 Myoxocephalus 12–13 natural selection 1–2, 91–2 selectionism 16, 93 shifting foci of selection 104–6, 105 nature versus nurture 182 neighbors 14, 20, 38 digital logic circuits 200, 202 metabolic networks 20, 72 neutral neighbors 73, 86–9 necessity of 89–90 proteins and RNA 72 regulatory circuits 72 with the same phenotype 26–7 neutral change 199 neutral mutations 94–5 evolutionary dynamics of 94–5 neutrality tests 95–7 neutral neighbors 73, 86–9 necessity of 89–90 neutral networks 15 in digital logic circuits 204–5, 206, 207–11, 209, 210 diverse phenotypes in different neighborhoods 207, 208 neutralism 16, 93, 215–16 broad and narrow sense 93–4 supporting molecular phenotype data 97–8 synthesizing neutralism and selectionism 98–101, 99, 100 neutrality tests 95–7 novel phenotypes, accessibility of 43, 44, 77–8, 77, 78 gene duplications and 129–31, 130 phenotype robustness and 109–15, 110, 116 small and large populations 113–15, 116 origin of life 81
252
INDEX
Paired box (Pax) proteins 168–9, 168 parental similarity 135–7, 137 pentachlorophenol metabolism 6, 6 persister forms of bacteria 173 phenotype 2, 68–70 accessibility of novel phenotypes 43, 44, 77–8, 77, 78 gene duplications and 129–31, 130 phenotype robustness and 109–15, 110, 116 small and large populations 113–15, 116 differences between RNA and protein phenotypes 64–5 digital logic circuits 200 diverse phenotypes in different neighborhoods 207, 208 diversity of genotype network neighborhoods 75–7, 76, 90–1, 191–7 developmental gene circuit 191–5 eukaryotic signaling circuit 194, 195–7 evolution of phenotypic variability 121–3 evolutionary constraints 17, 158–9 causes of 159–60 consequences of 166–7 study of 160–1 link to genotype 4, 187 more genotypes than phenotypes 70–1 ”phenotype first” perspective 182 protein phenotypes 47–8, 68, 69, 70–1 variability in the neighborhood of different genotypes 55–6, 55 see also proteins regulatory circuits 37, 69, 69 number of phenotypes 38–40, 39 RNA phenotypes 48, 68, 69, 71 different genotypes with the same phenotype 59–60, 59 robustness 122–3 variability in the neighborhood of different genotypes 62, 63 see also RNA robust phenotypes 109–13, 110, 112, 121–2 genotype diversification and 112 plasticity and 182–4, 183 understanding of 3–4
see also gene expression phenotype; metabolic phenotype phenotypic plasticity 17, 144, 216 adaptive significance 181–2 genetic assimilation 178–9 detection of in the wild 179–80, 179 prevalence of 179–81 genotype networks and 176–8, 177 proteins 174–5, 175, 176 RNA 175, 181–2 robustness and 182–4, 183 variation among genotypes 176 widespread nature of 172–6 phenotypic variability 15 photosynthesis 1 chlorophylls 7 physicochemical constraints 159, 161, 165 plastogenetic congruence 179 polymorphisms 96 recombination and 96 population genetics 4 population size 94–5 population-level processes 4 principle component analysis 163–4, 164, 165 protein engineering 98, 176 protein kinases 161 proteins 11–13, 214 amino acid hydrophobicity 149–50 chaperone proteins 97–8 domains 49–51 zinc finger domain 98 environmental variation and 156–7 evolutionary constraints 161, 163, 167–9 principle component analysis 164, 165 folding 52–5, 53 globular proteins 161 genotype distance 72 genotype networks 48–51, 54–6, 71–2 novel phenotypes in different neighborhoods 55–6, 55 genotype sets 51 genotypes 47, 68, 69, 70–1 neighbors 72 phenotypes 47–8, 68, 69, 70–1 differences from RNA phenotypes 64–5 function phenotypes 70–1 variability in the neighborhood of different genotypes 55–6, 55 phenotypic plasticity 174–5, 175, 176
recombination 133 preservation of structure and function 137–9, 139 robustness 115–17 functional innovations and 117–19, 118, 119, 120 structures 47–8, 64–5 see also specific proteins punctuated evolution 166 random graphs 86–9, 88 randomness 92 rapamycin 195–6 recombination 16, 132, 216 different kinds of 132–4 genetic load and 140–1, 141 polymorphisms and 96 power of 134 preservation of existing expression phenotypes 134–7, 136 preservation of protein structure and function 137–9, 139 robustness to 139–42 versus mutation 138–9, 140–2 reconfigurable hardware 199 regulation 8–11, 33–4 multiple ways to produce the same gene expression phenotype 34–5 see also regulatory circuits regulatory circuits 35–8, 36, 214 computational model 36–8 environmental variation and 156 evolutionary constraints 161, 162, 169 gene duplications and 129–31 genotype 68–9, 69 genotypes with the same phenotype 40–3, 40 number of genotypes 38–40, 39 genotype distance 38, 72 genotype networks 40–3, 71 genotype space 38–9, 40 genotypic memory 150–2, 151 neighborhood 38 neighbors 72 phenotype 37, 69, 69 number of phenotypes 38–40, 39 recombination in 133, 140–2 genetic load and 140–1, 141 preservation of existing expression phenotypes 134–7, 136 topology 38 vulval development in C. elegans 191–5, 194 repetitive DNA 132
INDEX
ribosomal protein coding genes 34 ribozymes 98 genotype networks 64–5, 66 Hammerhead ribozyme 60–2, 61 RNA 11–13, 214 genotype distance 72 genotype networks 51–2, 57, 71–2 close proximity of different networks 62 heterogeneity 57–62, 58, 60, 61 novel phenotypes in different neighborhoods 62, 63 size of 122–3, 122 genotype sets 51–2, 57 genotypes 47, 68, 69, 71 guide RNA 74 micro RNA genes 162 neighbors 72 phenotypes 48, 68, 69, 71 differences from protein phenotypes 64–5 different genotypes with the same phenotype 59–60, 59 robustness 122–3 variability in the neighborhood of different genotypes 62, 63 phenotypic plasticity 175, 181–2 structure 14–15, 48, 49, 64–5 prediction 56–7 robustness 16–17, 73, 107, 215, 216 environmental variation role 152–7 gene duplications and 128–9 genotype robustness 184 mutual reinforcement of 108 necessity of for innovation 107–8 proteins 115–17 functional innovations and 117–19, 118, 119, 120 zinc finger domain 98 robust phenotypes 109–13, 110, 112, 121–2 accessibility of novel phenotypes and 109–15, 110, 116
genotype diversification and 112 plasticity and 182–4, 183 RNA phenotypic robustness 122–3 studies of artificial systems 108–9 technological systems, related to complexity 211–13, 212, 213 to recombination 139–42 Rossman fold 54 Saccharomyces cerevisiae 34–5, 39, 174, 176, 195–7 Sagittaria sagittifolia 172, 173 salamander 159 Schizosaccharomyces pombe 35 Schuster, Peter 14, 15 selectionism 16, 93, 215–16 broad and narrow sense 93–4 supporting data 95–7 synthesizing neutralism and selectionism 98–101, 99, 100 selective constraints 159, 161–3, 165 self-organization 91–2 SEP genes 125, 129 serum paraoxonase 97 shape-space covering 62 Sphingomonas chlorophenolica 6 stabilizing selection 162 stasis 166–7, 167 steroid hormone receptors 103, 179–80 Streptomyces fradiae 134 suborganismal plasticity 173 systems 5 complexity, environmental variation and 152–7, 155 macromolecules 11–13 metabolic pathways 5–8 regulation 8–11 system class 80–1 see also specific systems; technological systems
253
technological systems 17, 198, 217 complexity related to robustness and evolvability 211–13, 212, 213 design of adaptive systems 211 digital logic circuits 199–200, 201 diverse phenotypes in different neighborhoods 207, 208 engineering issues 205–7 neutral networks 204–5, 206, 207–11, 209, 210 number of circuits 200–4, 203 evolutionary approaches 198–9 theory of innovation 1–2 essential properties of 2–3 minimal requirements of 79–80 information requirements 3–4 see also innovation topoisomerase 117 TOR signaling circuit 194, 195–7 transcription 33 initiation 174 transcription factors 161 in flower development 125 transcriptional regulation 33–4 see also regulatory circuits transposable elements 132–3 triosephosphate isomerase (TIM)-barrel domain 50–1, 54 tylosin 134 urea cycle 7–8, 7 validation 80–1 vertebrate radiation 126–7 vitamin B12 169, 170 vulval development, C. elegans 191–5, 193 yeast cell mating types 34 zebrafish 127 zinc finger domain 98