Folding and Self-Assembly of Biological Macromolecules: Proceedings of the Deuxiemes Entretiens de Bures, Institut Des Hautes Etudes Scientifiques, Bures-sur-Yvette, France, 27 November-1 December 2001

Proceedings of the Deuxiemes Entretiens de Bures FOLDING AND SELF-ASSEMBLY OF BIOLOGICAL MACROMOLECULES INSTITUT DES HA...

Author: E. Westhof | N. Hardy

17 downloads 544 Views 31MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Proceedings of the Deuxiemes Entretiens de Bures

FOLDING AND SELF-ASSEMBLY OF BIOLOGICAL MACROMOLECULES INSTITUT DES HAUTES ETUDES SCIENTIFIQUES

editors

E Westhof N Hardy

•

^

^

ot<|iini/i>rs

A Carbone M Gromov F Kepes E Westhof

World Scientific

Proceedings of the Deuxiemes Entretiens de Bures

FOLDING AND SELF-ASSEMBLY OF BIOLOGICAL MACROMOLECULES INSTITUT DES HAUTES ETUDES SCIENTIFIQUES

!#••;?.

This page is intentionally left blank

Proceedings of the Deuxiemes Erttretiens de Bures

FOLDING AND SELF-ASSEMBLY OF BIOLOGICAL MACROMOLECULES edited by

E Westhof Institut de Biologie Moleculaire et Cellulaire Universite Louis-Pasteur Strasbourg, France

N Hardy

INSTITUT DES HAUTES &TUDES SCIENTIFIQUES Bures-sur-Yvettef France 27 November - 1 December 2001

\[p World Scientific NEW JERSEY • LONDON • SINGAPORE • SHANGHAI • HONGKONG • TAIPEI • CHENNAI

Published by World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA office: Suite 202, 1060 Main Street, River Edge, NJ 07661 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE

British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.

Model of the secondary and tertiary structure of the catalytic RNA component of bacterial ribonuclease P. Courtesy of Dr. Fabrice Jossinet (IBMC-CNRS, Universite Louis Pasteur, Strasbourg, France).

FOLDING AND SELF-ASSEMBLY OF BIOLOGICAL MACROMOLECULES Proceedings of the Deuxiemes Entretiens de Bures Copyright © 2004 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.

For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.

ISBN 981-238-500-2

This book is printed on acid-free paper. Printed in Singapore by Mainland Press

ORGANIZERS Alessandra Carbone (IHES, Bures-sur-Yvette, France) Misha Gromov (IHES, Bures-sur-Yvette, France) Francois Kepes (CNRS-Genopole®, Evry, France) Eric Westhof (Universite Louis Pasteur, Strasbourg, France)

SPEAKERS Steven Benner (University of Florida, Gainesville, FL, USA) Antoine Danchin (Institut Pasteur, Paris, France, and Hong-Kong) Marc Delarue (Institut Pasteur, Paris, France) Izrail Gelfand (Rutgers University, Piscataway, NJ, USA) Nobuhiro Go (Kyoto University, Japan) Herve Isambert (Universite Louis Pasteur, Strasbourg, France) Jean-Francois Joanny (Institut Curie, Paris, France) John E. Johnson (Scripps Research Institute, La Jolla, CA, USA) Alexander Kister (Rutgers University, Piscataway, NJ, USA) Tanja Kortemme (University of Washington, Seattle, WA, USA) Olivier Lichtarge (Baylor University, Houston, TX, USA) Francois Michel (CNRS-Centre de Genetique Moleculaire, Gif-sur-Yvette, France) Leonid Mirny (Harvard-MIT, Cambridge, MA, USA) David Sankoff (Universite de Montreal, Canada) Peter Schuster (Universitat Wien, Vienna, Austria) Devarajan Thirumalai (University of Maryland, College Park, MD, USA) Eric Westhof (Universite Louis Pasteur, Strasbourg, France) James R. Williamson (Scripps Research Institute, La Jolla, CA, USA) Sarah Woodson (Johns Hopkins University, Baltimore, MD, USA) Michael Zuker (Rensselaer Polytechnic Institute, Troy, NY, USA)

This page is intentionally left blank

EDITOR'S NOTE

Folding and Self-assembly of Biological Macromolecules, the title of these proceedings of the Deuxiemes Entretiens de Bures, is a major focus of contemporary research in structural biology. Between November 27 th and December 1st 2001, some twenty leading researchers met to report their recent results on this subject at the Institut des Hautes Etudes Scientifiques in Bures-sur-Yvette (France). They interacted with an audience of more than 150 specialists from a wide range of scientific disciplines, including bioinformatics, biophysics, chemistry, genomics, mathematics, molecular biology, theoretical physics, and virology. In French, entretiens are interactive scientific conferences characterized by lively exchanges among the participants. The title of the proceedings of the first Entretiens, held at the IHES between December 2 nd and 6th, 1997, is Pattern Formation in Biology, Vision and Dynamics, which was edited by the conference organizers, Alessandra Carbone, Misha Gromov, and Przemyslaw Prusinkiewicz. Participants at the Deuxiemes Entretiens explored the folding pathways and mechanisms by which constituent residues interact to yield native biological macromolecules (catalytic RNA molecules and functional proteins), and how ribosomes and other macromolecular complexes self-assemble. These proceedings are the transcribed harvest of fourteen of the talks delivered at the conference, as well as the corresponding verbal exchanges, all of which was professionally captured on videotape by Francois Tisseyre's Atelier EcoutezVoir. Four native English-speakers, Valerie Lafitte, Carol Madden, David Sindell, and Sean Newhouse, transcribed the talks and audience interaction. The texts were then pre-edited and sent to the speakers for self-review and correction. Roberto Incitti, scientific coordinator of the IHES Mathematics:Molecular Biology program, maintained a dedicated internet website for importing the corrected texts and figures provided by the speakers to illustrate their manuscripts. The IHES graphic artist, Marie-Claude Vergne, then prepared the figures according to the publisher's specifications. Helga Dernois, the IHES scientific secretary responsible for producing the final manuscript of the proceedings, patiently processed several preliminary drafts of the manuscripts before adapting the final versions to the publisher's "style-file." Both editors verified successive versions of the manuscripts after the speakers had made their corrections, prior to obtaining permission to

viii

N. Hardy

publish the final versions. The preface by Eric Westhof based on his opening remarks at the Entretiens, is a synopsis of the talks, as well as speculation on the future of the RNA and protein-folding field. The result of these efforts is contained in these pages, which we hope you will find attains a level of interest commensurate with that of the Entretiens itself N. Hardy

PREFACE ERIC WESTHOF Institut de Biologie Moleculaire et Cellulaire, CNRS and Universite Louis Pasteur, Strasbourg, France

In 1988, while Editor-in-Chief of Nature, John Maddox regularly discussed the evolution of science in his columns and worried about the development of molecular biology. There was already such an accumulation of data that its assimilation would lead to a deadlock in any conceptual framework attempting to rationalize this vast quantity of disparate information. Such a concern is also at the origin of the Deuxiemes Entretiens de Bures, "Folding and self-assembly of biological macromolecules," held the Institut des Hautes Etudes Scientifiques (IHES), in Bures-sur-Yvette (France) between November 27th and December 1st, 2001. The Entretiens are organized by the IHES to promote interaction and exchanges among mathematicians, physicists, and biologists. If one peers at any interesting object in molecular biology, the size and number of atoms to be considered are so large that one is soon quite overwhelmed by the complexity of interactions among the constituent particles. For example, the prokaryote ribosome, responsible for the translation of the genetic code from nucleic acid to protein, is composed of two particles, amounting to a total mass of around 2.5 million daltons. This macromolecular assembly contains around 55 ribosomal proteins and three ribosomal RNAs (totaling roughly 4,600 nucleotides). All these molecular objects interact cooperatively so as to make this machine work very precisely, controllable at the same time by several external factors. Where do we start in order to understand such an assembly? We now agree that different levels of organization exist, possibly hierarchically structured. Broadly speaking, one can distinguish the secondary structure of the 16S rRNA present in the 30S particle, as well as its tertiary structure when it is assembled in the 3 OS particle. RNA architecture, and to some extent protein structure, are now understood on the basis of two central design principles: modularity and hierarchy among organizational levels. Unifying principles are thus seen to emerge from the molecular level to that

x

E. Westhof

of functional biology, since the notion of the hierarchical organization of modularity has recently been uncovered in metabolic networks. Today, tremendous activity is underway around the world in the effort to establish databases for organizing fragments of biological knowledge. In such endeavors, one analyzes, classifies (base-pairs, motifs, interactions, e t c . ) , and dissects how component parts interact with each other. In the Ninteenth Century, without realizing it, Mendeleev prepared the field for quantum chemistry by classifying the chemical elements and devising the periodic table with surprising precision. Nowadays, when we classify biological objects and try to learn about base-pairs or protein-DNA recognition motifs, are we sure the concepts we use are appropriate for preparing the future? In other words, can we really go beyond organizing databases of sequences, structures, motifs, and genomes? Are all our concepts really relevant and pertinent? At the same time, we know that biological structures are the chemical products of our planet's history, and that while these billions of years of evolutionary history are consistent with physical laws, those laws do not determine them, as Steven Benner beautifully illustrates in the first article of this book. This implies that potentially not a single interaction or atomic contact may be neglected, which leads to the inescapable conclusion that one cannot neglect weak interactions, which control fine-tuning in specific binding and recognition. Integration and cooperation between the strong and weak forces, between water molecules and ions, are responsible for the folding and stability of biologically functional macromolecular objects. At the other extreme, physicist Ken Wilson tells us that even if we knew everything about the quantum chemistry of water molecules, we would still be unable to understand the formation of waves. Nowadays, biology extends between two extremes: Attempting to understand biological catalysis and the movement of a single proton in very high-resolution X-ray structures with millions of atoms involved in interactions of various kinds and strengths, all the way to systems biology and the study of complex networks. Two principles constantly permeate biological systems: self-organization and mechanisms of symmetry-breaking. The book begins with articles that focus on self-assembly (RNA molecules and proteins) and ends with examples of symmetrybreaking in viruses and in the central mechanisms of molecular biology. Strong electrostatic interactions dominate the folding of polyelectrolytes, such as DNA and RNA molecules, as shown experimentally by Steven Benner and theoretically illustrated in the next chapter by Jean-Francois Joanny, who seeks the coarse-grained properties of charged systems, avoiding the specific chemistry of the

Preface

xi

charge-bearing molecules as much as possible. But the influences of electrostatic charges on polymer conformation are so pervasive that the deduced principles and laws extend deep into biology. Michael Zuker follows with rules for RNA-folding, based on Boltzmann statistics. For many years, Michael Zuker's software has made it possible for biologists to routinely compute secondary structures of RNA molecules on the basis of nearest-neighbor energies between base-pairs (experimentally obtained by Doug Turner's group) by minimizing the energy of the structure. In this new approach, the reverse process is envisaged: One can derive the frequencies of dinucleotide pairs from phylogenetically aligned sequences, thus obtaining pseudo-free-energies, which may be compared with experimental values. The next three chapters, by Sarah Woodson, Francois Michel, and Jamie Williamson (respectively), describe experimental approaches to the problem of RNA folding. The themes covered in these chapters overlap and intersect, treating the chemical nature of ions that promote folding, folding kinetics, RNA transcription rates, and sequential binding of proteins to ribosomal RNAs. These three authors used various experimental techniques, including UV melting, chemical probes, hydroxyl radicals generated by synchrotron radiation, fluorescence measurements, and single-molecule studies. The main section on RNA folding concludes with the chapter by Herve Isambert, who describes the kinetics of RNA folding, as seen in computer simulation experiments. Interestingly, throughout these conferences, several participants in the audience raised the question of the prevalence of magnesium ions in RNA folding. A large part of the answer lies in physical chemistry and in the lifetimes of the water molecules that are bound to the ions (e.g., very short lifetimes around potassium and very long ones around magnesium ions). Thus, chemically "hard" magnesium ions bind to RNA molecules mainly via the water molecules of their solvation shells, which buffer the strong electrostatic attractions while simultaneously preventing kinetic folding traps. Coupled to the hierarchy in the architectural folding of RNA molecules is a hierarchy of ion-binding (monovalent ions such as sodium first stabilize the secondary structure, then divalent ions such as magnesium lock the tertiary structure; see the chapters by Sarah Woodson and Francois Michel for these aspects). An analogous explanation lies at the origin of the selectivity of potassium channels (membrane proteins that catalyze ion movements which generate electrical signals in neurons), letting through only (dehydrated) potassium ions, not smaller (dehydrated) sodium ions, because only the former may be properly re-solvated during passage through the ion channel. The next five chapters treat protein structure and folding. Like the chapters by Steven Benner and Francois Michel, they all stress the evolutionary history

xii

E. Westhof

contained in protein sequences. This aspect is especially apparent in Olivier Lichtarge's article, which addresses the fundamental problem of integrating sequence, structure, and functional information. Whereas this problem is usually tackled using mathematics, statistics, and physics, the sole link between sequence, structure, and function is biological evolution, the central and unique property of biology. In a counterpoint approach, Alexander Kister and Izrail Gelfand search for sequence determinants, which are strongly related to the structural stability of a given fold and which allow assigning a query protein to its proper structural class. In the following chapter, Marc Delarue first applies bioinformatics tools, then X-ray crystallography, and finally normal mode analysis to DNA polymerase families. These molecules present a fascinating example of molecular evolution with convergent evolution to an active site that is similar in two such families (each of which reveals divergent evolution). The final sections, which treat normal mode analysis and the application of the Poisson-Boltzmann equation to polymerases, demonstrate how an appropriate coarse-grained physical method can reveal important characteristics of the large-amplitude transitions that polymerases must undergo during their polymerization activity and in the translocation step. The last two chapters on protein structure treat the problem of protein-folding per se. Leonid Mirny describes another example, in which a simple physical model of protein-folding on a lattice leads to increased understanding of the crucial phenomena in real protein-folding. Lattice simulations have taught us, among other things, that fast-folding proteins have a stable folding nucleus that stabilizes the transition state and compensates for the loss of entropy. Using sequence comparisons, Mirny further shows that residues which belong to the folding nucleus are more conserved than would be expected if they only contributed to stabilization of the native structure. In the last chapter on protein-folding, Devarajan Thirumalai leaves the realm of spontaneous folding in the Anfinsen sense and introduces us to the formidable nanomachine that is the E. coli chaperonin particle GroEL, showing how this stochastic machine uses ATP in an iterative annealing mechanism to fold polypetide chains to their native state. In exquisite detail, Jack Johnson later presents the processes involved in viral self-assembly and maturation. Although one may argue about whether viruses are living organisms, it is indisputable that their study has made an enormous contribution to our understanding of living systems. More than 50 viral crystal structures displaying diverse molecular biology have been provided by a variety of sources, in 85% of which the capsid protein adopts the sandwich fold. All spherical viruses have icosahedral symmetry, but only the non-functional satellite viruses contain the minimal set of 60 subunits. In order to package enough genetic

Preface

xiii

information, functional viruses contain multiples of 60 subunits, in agreement with the concept of quasi-equivalence discovered by Caspar and Klug in 1962. Jack Johnson describes molecular examples of how quasi-symmetry is achieved in viruses with icosahedral symmetry. Finally, Antoine Danchin takes a new look at genomes, asking at which level the genome is "fluid." First he shows that the genetic program leads to biases that favor transcription in the same direction as the replication fork in several organisms, which leads to a G/T-rich bias in the leading strand and an A/C-rich bias in the lagging strand. As a result, proteins that are coded from the leading strand tend to be valine-rich and those coded from the lagging strand threonine-rich. Danchin ends with two central points, the first of which was present throughout the conference and especially discussed in the chapters by Steven Benner and Olivier Lichtarge, and which is at the core of the present difficulties of functional bioinformatics and automatic genome annotation: How to assign a function to a structure and, ultimately, to a sequence. Even without dwelling on the linguistic ambiguities of the word function, the observations that "function captures preexisting structures," and that folded structures are prerequisites for the evolution of function, have a profound influence on how to organize biological observations and databases. Antoine Danchin's second point concerns cell organization, forcefully advancing the idea that the driving force behind it is translation, and that it is structured around translation and the ribosomal network. This brings us back to our initial interrogation. In his classic article, Evolution and tinkering (Science, 1977), Francois Jacob wrote that nature functions by integration. Although global principles concerning biological systems are becoming clearer (as several chapters of this book illustrate), the modeling of complex biological systems will require the integration of computational biology and highthroughput technologies in a network perspective approach. These proceedings consist of the transcribed oral presentations as well as dialog among the speakers and the audience present at the Entretiens. Although the texts were thoroughly edited, an effort was made to not strip them of the liveliness and candor of the verbal exchanges they elicited. On behalf of the organizers, I thank the authors for accepting these long interactive presentations with patience and humor, as well as the conference participants for their numerous and valuable questions and comments. Regrettably, it was not possible to keep track of the names of the persons who intervened during the talks. Finally, 1 commend the efforts of Noah Hardy for his careful and dedicated editing of the entire proceedings, by no means an easy task. He was assisted by Helga Dernois, the IHES scientific secretary whose steadfast work produced the final draft of the

xiv

E. Westhof

manuscript, Marie-Claude Vergne, the IHES graphics specialist who handled the figures, and Roberto Incitti, who managed the web-based system for importing the images and corrected texts. Without them, this volume could not have been produced. Last but not least, I thank Jean-Pierre Bourguignon, Director of the Institut des Hautes-Etudes, without whose constant encouragement and stimulating presence these Deuxiemes Entretiens de Bures would not have taken place. To all of them, along with the three other organizers, Alessandra Carbone, Misha Gromov, and Francois Kepes, I extend my warmest acknowledgments.

CONTENTS Organizers and Speakers

v

Editor's Note Noah Hardy

vii

Preface Eric Westhof

ix

Evolution-Based Genome Analysis: An Alternative to Analyze Folding and Function in Proteins Steven Benner

1

Conformation of Charged Polymers: Polyelectrolytes and Polyampholytes Jean-Francois Joanny

43

Statistically Derived Rules for RNA Folding Michael Zuker

73

Experimental Approaches to RNA Folding Sarah Woodson

99

Some Questions Concerning RNA Folding Francois Michel

127

RNA Folding in Ribosome Assembly James R. Williamson

179

From RNA Sequences to Folding Pathways and Structures: A Perspective Herve Isambert

211

An Evolutionary Perspective on the Determinants of Protein Function and Assembly Olivier Lichtarge

225

Some Residues are more Equal than Others: Application to Protein Classification and Structure Prediction Alexander Kister and Izrail Gelfand

255

xvi

Contents

Structure-Function Relationships in Polymerases Marc Delarue

267

The Protein-Folding Nucleus: From Simple Models to Real Proteins Leonid Mirny

303

Chaperonin-Mediated Protein Folding Devarajan Thirumalai

323

Virus Assembly and Maturation John E. Johnson

349

The Animal in the Machine: Is There a Geometric Program in the Genetic Program? Antoine Danchin

375

EVOLUTION-BASED GENOME ANALYSIS: AN ALTERNATIVE TO ANALYZE FOLDING AND FUNCTION IN PROTEINS STEVEN BENNER Departments of Chemistry, Anatomy, and Cell Biology, University of Florida, Gainesville, FL, USA

From time to time, it is useful to step back from our professional activities to ask "big" questions. One of the biggest is Why are things the way they are! This question may be asked in any discipline. It is frequently asked in physics. It is especially important in biological chemistry; however, since its answer ultimately determines which research problems are interesting and which are not. In biological systems, the Why question may be approached at many levels. At the highest level, we ask why physiology, the structure of our bones and tissues, is the way it is. The question may be asked of decreasingly smaller biological structures as well. We can ask this question about the structure of cells, the structure of proteins in cells, and the structure of individual molecules involved in biological metabolism, for example. Biology offers two classes of answers to such questions. The first holds the structure of a biological system to be a unique solution to a particular biological problem. We frequently encounter this type of explanation when discussing physiology. When we consider the function of a tooth obtained from a fossil organism, for example, we often conclude that an animal ate grass if its teeth have an optimal structure for chewing grass. This implies that teeth in general have been optimized to macerate the substance that is eaten. Biomolecules also often appear to be unique solutions to a particular biological problem. For example, the enzyme triosephosphate isomerase, which is important for the degradation of sugars in our diets, catalyzes the turnover of a substrate molecule to a product molecule whenever it encounters one. This behavior would seem to be optimal for the survival of an organism that is dependent on the enzyme. More broadly, a similar outlook predicts that if you were to go to Mars and find life there, it would be constructed with the same general chemical features - if it used enzymes at all. This type of explanation drives a research strategy. In part, we want to study the details of the structure of biological systems because we believe that they are

2

5. Benner

optimized. In this view, within these structures lies a deeper understanding of chemistry, of biochemistry, and of life itself, all awaiting inspection at the correct level of detail. The second class of answers recognizes that the biostructures of life are the products of four billion years of biological evolution and planetary history. Given this, one can also explain the structures of living matter in terms of their historical, geological, and paleontological records. Explanations take the form of stories about the historical past, certainly consistent with physical law, but not necessarily determined by them. This is the approach of the natural historian. Given this perspective, to engage the Why question brings the scientist to the confluence of the three great traditions in science. The first one is the natural history tradition, which is older than civilization and as young as my three-year-old son. It comes from the human compulsion to collect - sticks and stones at first then plants, minerals, and fossils, and to classify them. Natural history gained its standing as a science after the Enlightenment as classification of natural things, and came to be seen more as a consequence of the history of life on Earth than the consequence of divine intervention. Natural history uses a human-constructed metalanguage to describe the natural world. To the naturalist, explanations are historical. The models used to explain the natural world reconstruct events in the historical past that are contingent in large part on random chance. These events are certainly consistent with physical law there are no violations of the laws of thermodynamics - but they are not predictable by physical law. The second tradition derives from physical science, which began as Enlightenment scientists devised mathematical models to explain the motion of planets in the heavens. The physical science paradigm uses mathematical models for both description and explanation. If one asks a physicist why an atomic bomb explodes, (s)he will say, "Because e = mc 2 ." Physicists generally have little use for the natural historical. "Science is either physics or stamp collecting," said Rutherford, near the turn of the last century. Indeed, physicists may view their descriptions and explanations as better than those of the natural historian because theirs are mathematical, and therefore (we presume) "universal." If one asks a Klingon physicist why an atomic bomb explodes, we would expect (s)he would also say, "Because e = mc 2 ." Even the natural historian is somewhat embarrassed by the "just so" storytelling of tradition. However, natural historians are struggling to convert their field to a physical science by adding mathematics to their descriptions whenever possible. Yet the natural historian often finds a purely mathematical law to be an unsatisfying

Evolution-Based

Genome Analysis: An Alternative to Analyze . . .

3

explanation. So do humans, generally. If one explains the bomb by the equation "e = mc2", it is only human to then ask, "But why does e equal mc2?" However, both naturalists, with their human-constructed metalanguage and historical explanations, and physicists, with their universal language and mathematical explanations, are confounded when they encounter the third tradition in science; that of chemistry. Chemistry builds its descriptions in terms of universals. There is little doubt that chemists on the planet Klingon would work with the same carbon, hydrogen, oxygen, and nitrogen atoms that earthling chemists do. At the same time, however, chemists use a human-constructed metalanguage for explanation. To the question, "Why is benzene not very soluble in water?" the answer is "Because benzene is hydrophobic." The concept of hydrophobicity is almost certainly grounded in some combination of the enthalpy and entropy of benzene and water, the mathematical descriptions of which may be universal. But the explanatory metalanguage is definitely of human construction. Physicists and naturalists may both view the explanations of organic chemists that were constructed using human-created metalanguage with suspicion; they feel this metalanguage is so malleable that it may be used to explain anything. Indeed, with embarrassing frequency, organic chemists find themselves in a position of having explained an experimental result in terms of their metalanguage, only to discover that the experimental result is the opposite of what they thought, and that they are able to explain the opposite result just as easily within the context of the same metalanguage! Physicists and naturalists alike may be infuriated by the extent to which organic chemists do not view this as a problem in their science. Chemists may in part be forgiven, because they have a peculiar, but powerful, experimental research strategy for developing their (often intuitive) understanding of the behavior of molecules: synthesis. Synthesis, especially in organic chemistry, involves the rational creation of new matter; different forms of the arrangement of atoms. These differences may be selected in order to test models that explain the behavior of molecules. For this reason, synthesis serves as an experimental method for developing understanding. But synthesis in chemistry serves another role: to validate understanding. One knows that one understands a molecular system when one can design a new molecule within that system, synthesize it, and show that the molecule behaves as predicted. This approach to validating understanding is certainly not available to astrophysics. One cannot (today) synthesize a new star to see whether the model is correct. The power of synthesis has made chemistry arguably the most successful of the three sciences. This is certainly true from a technological perspective. Rational

4

S. Benner

synthesis based on organic structure theory has generated plastics, dyes, and materials. Essentially all the advances in contemporary biotechnology have come from a description of living systems, using the universal chemical language. The human genome is, after all, nothing more (and nothing less) than a statement of how carbon, oxygen, nitrogen, hydrogen, and phosphorus atoms are bonded in molecules directly involved in human inheritance. The race to do "structural genomics" is nothing more than associating conformation with these chemical constitutional formulas. But synthesis becomes especially important when asking the "big questions" in biology. We can use synthesis to make new forms of biological matter, to ask why not- and what zj-types of questions. Do the forms of biological matter that we see on Earth in fact perform better than alternative forms? If so, we synthesize an alternative form of matter and see how it behaves. Could life not take some other form and perform as well as the life we know? My goal in this lecture is to show the virtue of connecting the three approaches, tying physical science to natural history and molecular structure. The point of this lecture is to show that the answers to these questions require input, data, language, and ideas from the physical and chemical sciences, as well as from natural history. To really understand the world around us in the new millennium and in the age of the genome chemists must become natural historians and natural historians must become chemists. That is the point I would like you to take away with you today.

Why is DNA the way it is? Let us start by considering the structure of DNA, the molecule at the core of genetics. Nucleic acids such as DNA are built from nucleotide units. These are based on one of two sugars: ribose and 2'-deoxyribose, respectively generating RNA and DNA. A nucleobase (or, or more simply, a base) is appended to the sugars to give a nucleoside. In a nucleic acid strand, the nucleoside units are joined by phosphodiester linkages (the "phosphates.") The resulting strand is an irregular polymer whose backbone is a repeating sugar-phosphate chain with a variable heterocyclic nucleobase attached to the side. Information is contained in the order of the nucleobases in the oligonucleotide chain. For those inspecting the figures without a background in organic chemistry, let me simply state that organic chemists represent molecular structures using geometrical structures. They often place letters denoting the chemical elements at

Evolution-Based

Genome

Analysis:

An Alternative

to Analyze

...

5

the vertices of a geometric figure representing the molecular structure, but C and H, denoting carbon and hydrogen, are rarely so placed. Carbon is represented without letters by the vertices of the geometric figures in these structures, every unlettered vertex in the graph representing a carbon atom. Carbon makes four bonds, and the bonds between carbon and all atoms except hydrogen are written explicitly. However, bonds made between carbon atoms and hydrogen atoms arc not. This means that if a vertex has fewer than four lines going to it, the missing bonds arc made to hydrogen.

H

G

Q u i ii mii'H-N

tvl

f >—( Q

- V charged phosphate backbone

rr

f* ° >

O

II

x

O** \

R

TAJ

\

'A.

n

9

A

""""H—N

—HMIMIUN

N

V-

N

N—Hn«iii»i iQ \

N"1

'"•H—N

,>9>i O

V

0.

hO

li» O

II O

Nil.

R

K T/U

t )~< J*^\

/-o

H-N

\ N—Him

>

sugar

V- N

>-N

H-N

>

Y,o •o

o

I /

base pairs

Figure 1. The chemical structure of DNA - a paradox of design.

A nucleic acid strand recognizes its complementary strand by Watson-Crick base-pairing. In the first generation model for DNA proposed by Watson and Crick a half-century ago, two DNA strands form a duplex structure. The duplexes are

6

S. Benner

stabilized by base-stacking; the base-pairs stack on top of each other. Base-stacking brings the hydrophobic bases out of water; it also allows "stacking energy," a term from the human metalanguage describing base-pairing, to be realized. The backbone, according to the Watson-Crick model, is largely incidental to the process, simply acting to hold the bases in the strand together. In the first-generation Watson-Crick model for DNA duplex formation, the specificity of base-pairing arises from both size complementarity and hydrogenbonding complementarity between the bases. Big things, like A and G (also known as purines), pair with little things, like T and C (known as pyrimidines), hydrogenbonding complementarity arises from matching between hydrogen-bond donors and hydrogen-bond acceptors. Guanine presents an acceptor-acceptor-donor pattern of hydrogen-bonding on a large component, which is complementary to the donordonor-acceptor pattern of hydrogen bonding on cytosine, the small complement. Thymidine presents an acceptor-donor-acceptor pairing pattern on the small component of a second base-pair, which anticipates a donor-acceptor-donor pattern on its large complement.

N==\ cytosine

.Q^

Donor

guanine

^ / ^ y^-—

Acceptor

% " "

Acceptor

' N ^ ^

Acceptor

'-1

pyDAA

'

^N. ^ ^

Acceptor Donor

^

aininoadenine

Z

N

^

o:

pyADA

!i

Donor Acceptor

" S f ^ "

Acceptor

Donor

H

I

thymine

puADD

Js.

J H

puDAD

Donor

H

I

Figure 2. Hydrogen-bonding between nucleobases. The small pyrimidines are designated by py and the large purines by pu. Following the prefix is the order, from the major to the minor groove, of acceptor (A) and donor (D) groups. (The A-T base-pair is incomplete.)

Evolution-Based

Genome Analysis: An Alternative to Analyze . . .

7

Looking at natural DNA, we immediately see an interesting feature of the structure that departs from this regular design: Adenine is missing the donor group that would enable it to form a third hydrogen bond to thymidine, its complement in natural DNA. As a consequence, the natural A-T base-pair is joined by only two hydrogen-bonds, whereas the G-C base pair is joined by three. This prompts the question why? Is this a defect in the structure of DNA? Would DNA be better if aminoadenine replaced adenine, providing a nucleobase that can form three hydrogen-bonds to thymine? Or, following the first class of explanation that I mentioned earlier, is DNA better able to contribute to the fitness of the organism if it has a stronger base-pair joined by three hydrogen-bonds and a weaker base-pair joined by only two? Alternatively, we might explain this feature historically: Perhaps the incomplete structure of adenine reflects a frozen historical accident. Perhaps adenine (not aminoadenine) was present in the prebiotic soup, life emerged using it, and has since had no opportunity to replace it, at least not without disrupting the life that was attempting to do the replacing. We can generally ask questions of this class with respect to the structure of DNA. Indeed, as soon as one begins to formulate such questions from a chemical perspective, many peculiarities appear in the DNA structure. If one asks too many questions, DNA begins to appear to be poorly designed. Consider just three features of the molecular structure of DNA from the perspective of a chemist who might want to design a molecular recognition system: First, DNA is a floppy molecule. When two DNA strands come together, they must become more rigid. This would imply, perhaps naively, that the DNA strand is losing conformational entropy when it binds to its complementary strand, which is generally regarded as being "bad" for molecular recognition. If in fact, chemists who design artificial molecular recognition systems generally seek rigid "lock-andkey" pairs, they never try to design two floppy things that bind together. Secondly, in water, DNA uses hydrogen bonding to transfer genetic information. But water presents hydrogen-bonding opportunities everywhere. For this reason, few chemists working in the design of molecules that recognize other molecules exploit hydrogen-bonding as a molecular recognition unit in water. But the most remarkable feature of strand-strand binding in DNA comes from the fact that the two molecules that interact are both polyanions; each of the phosphate groups that form the backbone of a DNA strand bears a negative charge. In general, someone seeking to design a molecule that binds to a polyanion would begin by making a polycation, not another polyanion.

8 S. Benner We were not the only ones who thought that binding a polyanion to another polyanion was a peculiar way to design a molecular recognition system. In the late 1980s and early 1990s, an entire industry, known as the "anti-sense industry," consumed a significant amount of venture capital by seeking to replace the anionic phosphate linkers in the backbone of DNA with an uncharged linker (methyl phosphonate groups, for example.) The uncharged DNA analog was expected to passively enter the cell through membranes. Since the backbone has no role in the molecular recognition event, it was expected that the molecular recognition specificity would be retained. When we moved to the ETH (Eidgenossische Technische Hochschule Zurich), we had the opportunity to address the why questions experimentally with DNA, using synthesis as our paradigm. If the first-generation model for DNA structure were correct in postulating no particular role for the backbone, then we ought to be able to design and synthesize these DNA analogs, which take a small step away from the natural backbone. If the first-generation model for nucleic acid pairing were correct, these analogs should retain the rule-based molecular recognition characteristic of DNA; that A pairs with T, G with C, large with small, and hydrogen-bond donors with hydrogen-bond acceptors. In our first step, we replaced the phosphate linkers with a dimethylenesulfone group. This substitution removes the charge. Several talented synthetic organic chemists, in particular, Clemens Richert (now a professor at the University of Konstanz), Zhen Huang (now a professor at the City University of New York), and Andrew Roughton (now with Pharmacia) moved mountains to make these molecules. Fortunately, their hard work was rewarded.

o °1oJ'

Hi

NH2

n^oJr

V—7

-„-ptural

\—/

Oligosulfone

V—>

Figure 3. Sullbne analogs of DNA.

NH2

NH 2

Evolution-Based

Genome Analysis: An Alternative to Analyze . . .

9

The first success came with the synthesis of the sulfone-linked GS02C dimer, the analog of the dinucleotide GpC in which the phosphate linker is replaced by a dimethylenesulfone group. The molecule is self-complementary, because G pairs with C. It should form a duplex of these dimers, in which G from one molecule pairs with C from the other, and C from the second position of the first molecule pairs with G from the first position of the second. In fact, GS02C does form a Watson-Crick duplex of this type in the crystal, as shown in the crystal structure of the substance solved by Martin Egli. The structure is isomorphous with the structure of G-phosphate-C, which is a self-complementary RNA molecule whose duplex was crystallized and structure solved by Alan Rich around thirty years ago. Indeed, it is amazing how similar those structures are. This result suggested that Watson and Crick - and their first-generation model - were right. The backbone really isn't all that important; the charge may be replaced by a neutral linker. Perhaps we should have stopped there, but we took the next step, making DNA analogs with bridging sulfones that were four units long. These no longer behaved in a "Watson-Crick" manner. For sure, we saw molecular aggregation, and selfassembly, but not following Watson-Crick rules. For example, the sequence US02CS02AS02U is not self-complementary. It would normally pair with ATGA. In fact, the sequence aggregates with itself. An NMR structure done by Richert showed that duplexes formed, but not of the Watson-Crick type. Longer sulfones were also unusual in their conformation and aggregation. For example, the sequence A-S02-U-S02-G-S02-G-S02-U-S02-C-S02-A-S02-U was prepared by Richert and Roughton. This molecule folds, and melts only at a transition temperature above 75°C. No evidence was ever found that it was able to pair to its complement in an antiparallel Watson-Crick sense. We then asked whether we could compare this behavior with that of other biomolecules we know. By the time any oligosulfone gets beyond a certain length, it has its own unique properties. Some of them are soluble in water, others are not. The chemical properties of various sulfone sequences vary widely, and largely unpredictably. We asked ourselves when we last heard of a molecule whose distinctive properties based on its sequence, whose properties vary widely when the structure changes modestly. Of course, we do know of biopolymers that display such behaviors; they are called proteins. One cannot help but be struck by the observation that by removing the repeating negative charge from DNA, we made a molecule that behaves like a protein. Indeed, we even encountered cases where sulfones were catalysts; they folded and catalyzed reactions.

10

S. Benner

In retrospect, in light of these experimental findings, we conclude that perhaps a polyanionic structure is not as absurd as we thought for a molecule involved in genetics. Having now changed a repeating charge and seen the consequences of the change, we can suggest four reasons why negative charges are important to DNA: First, of course, the negative charges render the DNA molecule water-soluble. This is well known and not trivial. Next, when two DNA strands interact with each other, the repeating negative charges force the inter-strand interaction to a position on each strand that is as far away from the backbone as possible. This is important, because DNA offers many sites of interaction. In particular, interaction is well known on the "back side" of the purine ring, involving nitrogen-7, to form a Hoogsteen interaction. Indeed, in Richert's nuclear magnetic resonance structure of the tetrameric sulfone, this is what is seen without the negative charges. It therefore seems that the phosphates control the molecular interactions between molecules that are rich in functional groups. Without the repeating negative charge, DNA is a richly functional molecule that "wants" to spontaneously self-assemble and aggregate. The phosphates control that tendency. The repeating backbone charge requires the strands to interact on edges that are as far from the backbone as possible. This, of course, is the part of the molecule that forms the hydrogen-bonds in a classical Watson-Crick base-pair. So perhaps it makes sense for nature to use a polyanion to bind to another polyanion in a genetic system. A third way that the polyanionic character of DNA contributes to its behavior may be described by using the statistical mechanics theory of biopolymers. Normally a polymer occupies a volume whose radius scales with the length of the polymer to the one-half power. This is not the case if the polymer is a polyanion. A polyanionic polymer has a larger "excluded volume;" it stretches itself out, which allows the molecule to more readily act as a template. Again, the repeating charge in the backbone appears to be useful, if not required, for Watson-Crick rule-based behavior. Again, different from the first-generation DNA model, the nature of the backbone is quite relevant. Last, and most important, the polyanionic nature of the DNA backbone appears to be important to support Darwinian evolution. As noted above, oligosulfone analogs have very different properties, according to their precise sequence. This variation in physical behavior cannot be tolerated by a molecule expected to support Darwinian evolution. Here, the molecule must be able to replicate - we all need to have children - but in order to evolve, we must also be able to have mutant children. The need to support mutation without losing the ability to replicate is therefore essential for a genetic molecule. We have somewhat whimsically converted

Evolution-Based

Genome Analysis: An Alternative to Analyze . . .

11

capable of suffering mutations independent of concern over the loss of properties essential for replication into an acronym: COSMIC-LOPER, which is that property of a molecule that allows it to support Darwinian evolution as a genetic system. This property is conferred by the repeating backbone charge of DNA. Reactivity is dominated by this structural feature, surpassing dipolar and quadripolar interactions - indeed, every higher-order molecular interaction involving electronic distribution within a molecule. As long as the repeating negative charges remain, the dipolar interactions (hydrogen-bonds, for example) may be changed without necessarily dramatically changing the solubility of the molecule, or changing the conformation or position to which the DNA migrates on a gel (for example.) This is not the case with proteins. Consider, for example, the behavior of hemoglobin: The replacement of a single amino-acid in hemoglobin results in sickle-cell hemoglobin, which precipitates. The converse implication is that proteins cannot themselves be genetic molecules because they cannot suffer mutation without changing their properties in a way that permits them to be copied (given a mechanism to do so in the first place.) Comment: You talked about the anti-sense companies as if they were in the past... Response: Most of them are. Question: Yes, most of them are, but I know of a new one in Leipzig that is going ahead. What about phosphorothioates, PNA, and all that stuff? Response: Excellent question. Phosphorothioates are DNA analogs in which one of the oxygen atoms in the phosphate linker is replaced by a sulfur atom. To date, these are the only anti-sense molecules that have shown promise in real biological settings. But phosphorothioates still carry a charge on each of the linking groups. PNA, in contrast, lacks the repeating charge. They were developed by Peter Nielsen and Peter Egholm, in Denmark, at the laboratory of Professor O. Buchardt, now deceased. These scientists replaced the backbone of the DNA molecule with a peptide-like linkage that lacks a charge. PNA is the exception that "proves" (or tests) the rule. If a repeating negative charge is in fact a universal feature of genetic molecules in water, PNA should not work. However, PNA displays Watson-Crick behavior, binding to complementary DNA in the Watson-Crick manner. The catch with PNA is that it does so only up to a point. PNA still generally displays WatsonCrick behavior up to ten nucleotides. However at fifteen, especially if the PNA molecule is rich in G, the Watson-Crick behavior begins to disappear amid

12

S. Benner

solubility problems. This is the same behavior that is observed in sulfone molecules, but at somewhat longer lengths. Dimers and tetramers of sulfone-linked DNA analogs still display Watson-Crick base-pairing in some cases; but in most cases, longer sulfone-linked DNA analogs do not. As far as I know, PNA holds the record of being the longest non-ionic analog of DNA to retain Watson-Crick pairing properties. Why it does so well is uncertain. It may be due to an unusual interaction between the PNA backbone and water. Nevertheless, what is clear is that PNA itself could not support Darwinian evolution for long genes. Question: What was the strand orientation in these sulfone complexes? Response: Each one is different, but the strand orientation is only anti-parallel in GS0 2 C (as is observed in Watson-Crick DNA-pairing.) In all the other structures that have been examined there is no strand orientation. The sulfones simply fold or precipitate, like proteins. The key feature of the "second-generation" DNA model is that the backbone matters. But so do the heterocycles, or bases. The bases were at the center of the molecular recognition phenomenon, as discussed by Watson and Crick. However, it turns out that bases are the only structures of the DNA duplex that we can engineer without losing rule-based molecular recognition. Changing the bases simply required that we understand the combinatorial rules of hydrogen-bonding patterns. The C base exploits the donor-acceptor-acceptor hydrogen-bonding pattern on the "small" complement. The T base exploits the acceptor-donor-acceptor hydrogenbonding pattern on the "small" complement. But we still have the opportunity to construct organic molecules that use the donor-donor-acceptor, the acceptor-donordonor, the donor-acceptor-donor, and the acceptor-acceptor-donor hydrogenbonding patterns on the "small" component. This means that four more "small" bases and their four "large" complements which are not found in natural DNA are possible within the geometry of the Watson-Crick pair. Being organic chemists, we set out to synthesize the extra nucleobases, then tried to find out whether we could construct a DNA analog with an expanded genetic alphabet. We found that this synthesis was possible. Also, the extra letters in the genetic alphabet form acceptable Watson-Crick base-pairs with Watson-Crick specificity. The Watson and Crick rules may be expanded to include twelve letters, not just the initial four found in natural DNA.

Evolution-Based

Genome

Analysis:

Q.

H W

Acceptor Acceptor

An Alternative

»k^ Y " l

N

to Analyze

...

13

Donor Donor

O

Acceptor

Donor

pyAAD

|

^

:0

u Donor

£ H

(T

f:

=%j/ Y' J j^. u - " ^ ^- r,M

Acceptor

;0

Acceptor

Acceptor ^ H

pyDAD

Donor

puADA

^

Acceptor

T

Q;

Donor Acceptor

Donor :0 Donor

N

^ ^ N '

u

ij

PyADD

puDAA

Acceptor

H

«« •^^y^"--/'

Acceptor

Donor Acceptor Donor Acceptor

"V^n-

pyDDA Figure 4. Abstracting complementarity rules yields eight additional coding units fitting Watson-Crick geometry, joined by "non-standard" hydrogen-bonding patterns, and expanded genetic alphabet.

We closely examined the role of hydrogen-bonding in developing an artificial genetic alphabet. Ronald Geyer and Thomas Battersby, postdoctoral fellows working in my laboratory, determined an enormous number of melting temperatures with more than a dozen DNA analogs in an attempt to determine the role of hydrogen-bonding. In part, this work was motivated by a statement by Myron Goodman, based on some experimental work done by Eric Kool, implying that only size complementarity is important in base-pairing, not hydrogen-bonding complementarity. The work by Ron, who is now a professor at the University of Saskatchewan, and Tom, now with Bayer Diagnostics in California, suggests that these two features are approximately equal in importance.

14

S. Benner

In all cases, the number of hydrogen-bonds is an adequate predictor of base-pair stability, as estimated by its contribution to the melting temperature of a duplex that contains it. Further conclusions may be drawn. First, size complementarity and hydrogen-bonding complementarity are about equally important in forming a stable base-pair. Furthermore, both are more important than "context." Context is the metalanguage term to specify which base-pairs lie above and which lie below the pair in the helix. Another predictive feature that this work uncovered is that a charge in the heterocycle is bad for duplex stability. Also bad is an uncompensated amino group, especially in the minor group. C-glycosides, in which the base is joined to the sugar via a carbon-carbon bond (instead of a carbon-nitrogen bond, as is the case in natural DNA), are modestly bad. From these observations derive semi-quantitative "rules" that design the alternative genetic systems. This enables new technology. Rule-based molecular recognition is desirable throughout industry, and tags built from the artificially expanded genetic information system that we have invented are useful throughout industry. For example, James Prudent and his colleagues at EraGen Biosciences (Madison, WI) has used the expanded genetic alphabet to create 76 tags that can capture 76 different species in one tube, permitting a multiplexed assay for DNA variation in a sample. Question: What is the advantage of non-standard base-pairs? Response: The presence of extra letters in the artificial genetic alphabet means that we can generate tags containing non-standard bases that bind to other tags containing non-standard bases without cross-binding to DNA molecules that contain only standard bases. This means that the extent to which a non-standard tag finds and binds to its complement does not depend on the amount of natural DNA contained in the assay mixture. If we try to use tags made from standard bases, adventitious DNA also built from standard bases can contain sequences that interfere. The first diagnostic product that exploits non-standard bases was developed at Chiron. It is a branched DNA diagnostic assay. The assay captures an analyte DNA molecule in a sandwich assay. In figure 5, we imagine the analyte, the molecule we want to detect, to be the meat between two slices of bread. We first captured the analyte DNA with another DNA molecule that was complementary to a piece of the

Evolution-Based Genome Analysis: An Alternative to Analyze ...

15

analyte sequence, using Watson-Crick base-pairing (the bread in the sandwich.) This is then captured onto a solid support.

signal molecules (luciferase) ^_

branched DNA Z

r \

capture strand M

11 i i i i i 11 analyte DNA (for example. IromHTV)

Figure 5. The Chiron-Bayer-EraGen branched DNA diagnostics system based on non-standard bases. The other slice of bread is a DNA molecule that captures another part of the analyte DNA, again with Watson-Crick base-pairing. However, this molecule is branched, with -10,000 DNA branches coming off it. Each of these, through Watson-Crick base-pairing, now captures fluorescent molecules. Thus, only if the analyte were present, would fluorescent species be attached to the support. Nonstandard bases enhance the assay by permitting orthogonality. A typical analyte sample contains a sufficient quantity of DNA built from A-T-G-C to capture enough of the branched and fluorescent molecules onto the support, even in the absence of analyte, if the branched and fluorescent molecules are also built from A-T-G-C. This created background noise. However, by making the branched and fluorescent molecules from non-standard bases, the noise decreased and the sensitivity went down to eight molecules. Orthogonality means that you can do molecular recognition out here using one set of rules, and just use this part, A-T-G-C, where you need to, because the natural system contains it, and that way you don't have cross-reactivity between the molecular recognition that is recognizing the analyte and the molecular recognition system that is doing the signaling. That is the key issue. Is this clear? Question: / don 'I understand what the. actual source of the improved specificity is. What you 're saying is that it is easier to recognize something with a non-natural base than with the natural base - but what is the actual source of the specificity?

16

S. Benner

Response: The source of the specificity is that non-standard bases pair only with their complements, as defined by the pattern of hydrogen-bonding and sizecomplementarity, and not with standard bases. Question: I've got lots of questions. First of all, in chips, what if you just have longer oligos, would you not then get better specificity? Response: No, you don't. This depends in part on the temperature at which you run the reaction. With DNA molecules that are very long, at reasonable temperatures, sub-sequences bind non-specifically. You have a melting temperature issue at some point, because a short molecule will bind to a short molecule - this is a bit of a cartoon - at low temperatures, and a long molecule binds to a long molecule at high temperatures. So at some point you are limited. As it turns out, there is an upper limit, and you actually don't have many things that bind at 100°C. Comment: But what you can do is work out a sort of deconvolution software. Response: Yes, that has been tried, in sequencing by hybridization, for example, and it hasn't worked very well. In part, it is difficult to know what DNA sequences are found in the background DNA, therefore it is hard to know what complements to avoid. Consider some very simple molecular biology: Let's say you want to design a primer that would be suitable for a PCR reaction. You have some temperaturescale protocol that you would like to use. Obviously, your next problem is that if you have an A-T-rich primer, the temperature scale you would use is different from what you would optimally use for a G-C-rich primer. With these extra bases, EraGen has developed a "gene code" software package in which their first primer fits into an already existing PCR cycle for parallel PCR 70% of the time. So when anthrax hit America, EraGen was approached by Cepheid to develop the chemistry for an anthrax test kit. Within three weeks, EraGen had a working multiplex anthrax test chemistry, just because of the non-orthogonality. Question: So, you could have your non-standard DNA on the chip, right? when you have a sample that is real DNA, how does it...

But

Response: You're always going to have to capture standard DNA with a complementary sequence written in standard DNA... Question: You just convert it, right?

Evolution-Based

Genome Analysis: An Alternative to Analyze . . .

17

Response: No, you don't have to convert it, but you have to do what you did here: divide the DNA detection problem into two parts. The first part is the recognition of natural DNA. This must be done with a strand that has A-T-G-C in it. The second part is responsible for the capture onto a support, or signaling. If you try to put A-TG-C into that part, it will cross-react with standard DNA and give you problems. So you use the non-standard nucleic acids for your signaling output and standard bases for the recognition of the natural DNA analyte. Comment: OK, now let's get to the real point: you might ask why nature did not use this, and I would say you could have an RNA equivalent. Response: Yes, we have it. Question: ...andfor Uand T? Response: Yes, we've made it. Comment: - But then it binds too well, and you don't have this exocyclic amine waiting to react with something; you 're not going to get ribozymes that way; you 're not going to get wobble-pairs, and all these other lousy pairs that make life interesting. Response: But you do! These molecules with extra letters have a rich folding chemistry as well; iso-G, in particular, has a great tertiary structure. It is probably better than G at forming three-dimensional structures. I can go through each one of these and give you a chemical problem that may lead to an explanation for why nature does not use that particular structure. So iso-C deaminates with some degree of facility, for example; but C deaminates as well. Likewise, the C-glycoside pyDDA has an epimerization problem, as said in the metalanguage of organic chemistry. But then again, N-glycosides have depurination and depyrimidinylation problems. At some point, you must marvel that any of these things have the chemical stability needed to serve as genetic molecules. Indeed, if you're talking about chemical stability, RNA is not all that great a molecule (it is cleaved easily in base), so one may be astonished that it's used at all! But these questions all come before we can ask whether the duplex stability is too tight. I worry about the chemicals - the covalent bonds holding together in these systems - long before I worry about whether the non-covalent interactions are adequate to support life. But that is a paradox I don't want to ignore. It is not easy to understand why we use

18

S. Benner

DNA in the first place. Once we've agreed to use DNA, I have to wave my hands to explain why these four nucleobases are used - as opposed to the other eight - fully recognizing that on any good day, if we had used the other ones, I could explain that just as well. This is a typical organic chemistry problem, which causes some to question whether organic chemistry is a science... Question: Then you must also be changing the geometry of the structures? Response: Yes and no. Obviously, any structural change changes the geometry at some level, perhaps only slightly. This certainly happens with C-glycosides, to which I casually referred. Joining the heterocycle to the sugar ring by a carboncarbon bond instead of a carbon-nitrogen bond changes the pucker of the sugar ring, which might be responsible for the small difference in association constant. But it actually makes it more RNA-like - it's an interesting problem. But the conformational change is on the order of tenths of Angstroms, not Angstroms. Question: Don't you think that the natural bases are maintained because they are resistant to tautomerism over a wide range ofpH? Response: Did everybody understand that? The question is whether the standard nucleobases are resistant to tautomerism over a wide range of pH. I'm perfectly prepared to reject this base-pair (iso-G) on the grounds of tautomeric issues. Iso-G, which we have looked at in detail, has big tautomeric problems. The rest of them do not. What is a tautomer? Let me just illustrate it with iso-G. Keep in mind that the location of hydrogen atoms in a molecule determines the hydrogen-bonding pattern. Iso-G has a hydrogen-bond donor, another hydrogen-bond donor, and a hydrogenbond acceptor. There is an isomer of iso-G where we move a hydrogen from the ring N onto the oxygen. What was once a hydrogen-bond donor-donor-acceptor pattern is now a donor-acceptor-donor pattern. Therefore, this kind of isomerism changes the hydrogen-bonding pattern. In an aqueous solution, the donor-donoracceptor is about 9 1 % of the total, with the isomer presenting the donor-acceptordonor pattern contributing ~9% of the total. Now, the donor-acceptor-donor pattern on the minor tautomer of iso-G pairs with the acceptor-donor-acceptor pattern of T. So, that's the problem; the speculation was that this is an intrinsic reactivity of this arrangement of atoms that makes it unsuitable for a genetic molecule. G, by the way, also has a minor tautomer formed in the natural base; it contributes about one part in 10,000. But its presence is largely insensitive to

Evolution-Based

Genome Analysis: An Alternative to Analyze . . .

19

solvent effects. It is amazing how stably that number remains small, actually. But the other tautomer of G would also pair with T. There was a paper by Topal and Fresco in 1976 in which an argument was made that the minor tautomer of G was important for mutations. Question: Actually, may I go back for just a second? Do me a favor and draw it on the board. In the scheme you illustrated; the single mutation detection, what exactly is on the chip and what exactly is labeled? Response: It's complicated, but let me just briefly describe it.: There is a detection step and a readout step. Let me draw you a cartoon, because the actual reality is more complicated. This is what we call an artificially expanded genetic information system, and this sequence is natural DNA. We're going to call it a primer, and what you have is an oligonucleotide molecule; and let's just say it has a G here and a C here. Now what you're going to do is introduce a polymerase here, and what that is going to do is copy the rest of it and make a complete copy. For the sake of argument, let's just say that you now do a polymerase chain reaction-type of reaction, in which you add the primer that is complementary to the product, and read back in the other direction. Question: It's confusing; you know what you're talking about, but we don't; and what you've said a couple of times is that this funny business increases the sensitivity. Response: Do you understand this system here, because this is relatively easy to explain? Answer: Do this; that would do it well. Response: OK, what you're looking for is the red DNA [in Figure 5], which is the analyte, and the readout is going to be glowing solid supports; so at some point you've got glowing solution - this is all present in solution. At some point you're going to recover the solid support and see if is glowing. The green molecules glow - but in any case they give off light from the support only if they're attached to the solid support. The theory is that the specificity of Watson-Crick base-pairing, A-TG-C, will guarantee that the only way the green molecules will stick to the support is if there is something in between to bind them. Now, it is not direct; these

20

S. Benner

molecules are covalently attached to that DNA, that DNA binds to this DNA, and this DNA is covalently attached to that DNA. Keep in mind that you can't detect just one glowing DNA molecule; we need an amplification system, so this is 10,000-to-one onto a support. They have to be made of something that does rule-based molecular recognition, which is a problem, because in the natural world there is no system other than DNA that does rule-based molecular recognition. Another possibility would be to build that entire dendrimer out of covalent structure. You could do that, in principle. It is mostly because of cost that it isn't done; that's the primary reason. Comment: So the way I understand it, which might be wrong, is that the noise is decreased by the sandwich, but you have two recognition... Response: No, the noise arises because you have lots of glowing pieces attached to A-T-G-C; you have a support with A-T-G-C on it, and you have a lot of other DNA in there; that is not the red DNA which contains A-T-G-C. Comment: But there is also this sandwich thing, and because you hybridize with two different portions of your analyte, you obviously decrease the non-specificity. Response: That may or may not be obvious, but it is true. The real reason this sandwich exists is that it allows the signal; a glow on a support, right? This sandwich is a way of attaching light-emitters to the support. Ten thousand times is a 10,000-to-l amplification of that, and the background noise is due to the fact that the molecular recognition used to assemble this consists of the same units, A-T-GC, as the contaminating DNA in a biological sample. Question: Right, so then what I'm confused about - because I kind of understood this - is that you said you can do about the same without the sandwich. Response: His point is: why don't you just synthesize all this into one big covalent thing; just make a big glob; make a polystyrene bead and put glowing things on it; put on one tag. The answer to that is that it is a mess! People have tried this type of thing. The glob is a difficult thing to synthesize; it turns out to be an expensive thing to synthesize, it is difficult to synthesize in a form suitable for FDA approval, and also, there is a signal-to-noise issue there as well, as it turns out, because the non-covalent and reversible assembly of this nanostructure is one of the ways you get cleaner results.

Evolution-Based

Genome Analysis: An Alternative to Analyze . . .

21

We can now take the next step. Obviously, we now have extra letters in the genetic alphabet and there is an enormous potential for putting functionality on them. Let's ask how to make DNA into a molecule that has the same catalytic potential as proteins. This goes back to the work of Jack Szostak, Jerry Joyce, Larry Gold, and others. They tried to do test tube evolution with DNA. In this work one makes a library of DNA molecules and sets up a selection system such that only those DNA molecules that have a particular catalytic activity survive. This story shows how synthesis permits one to get full practical manipulative control over the behavior of nucleic acids. Here, the metalanguage is simple. We are analyzing only local interactions. We are also using only very simple rules that require no long-range, higher-order analysis. The level of the theory is really very low. The metalanguage that we use is big, with small hydrogen-bond donors and hydrogen-bond acceptors. Sometimes we talk about C-glycosides, sometimes we talk about negative charges, and sometimes we speak of uncompensated functionality. But we are not using quantum mechanics, molecular dynamics, or explicit solvents. The simplicity of the system arises from some key structural features, particularly the polyelectrolyte nature of the backbone. It is this feature of the DNA structure that we think will prove to be unique or universal. We have also learned that some features of the DNA structure, like the structure of the heterocycles, are flexible, and flexible to a good end. The point I want to make before I move on is that this is different from what you see with proteins, a lot different. With proteins, we are not even close to this kind of practical manipulative control. The metalanguage that we use is quite inadequate to explain proteins. Therefore, if we want to understand the way things are with proteins, we need to rely on a more historical approach, which I'm going to talk about as soon as I have cleared all the questions about the design and synthesis approach. Question: Well, I'm curious whether these non-canonical nucleotides can be used as precursors? Response: Yes. That's a long story, and I did not tell it because it's nowhere near as interesting or as clean. For a polymerase to incorporate non-standard bases requires the interaction of a protein with a nucleic acid. From many experiments, we know that that interaction is idiosyncratic. It turns out that if you just take the standard polymerase and throw it at non-standard nucleobases - we did this in 1990 - you can get some incorporation. But it is idiosyncratic. Tag polymerase works here only if this non-standard base is in the template and its complement is in

22

S. Benner

the triphosphate, not the other way around. With Tom Battersby as first co-author, we reported the first in vitro experiments using functionalized nucleotides in the Journal of the American Chemical Society around two years ago. We still have a long way to go. Question: [inaudible] Response: We've tried everything. We've tried HIV reverse transcriptase - which is actually the place you start - but HIV reverse transcriptase has the unfortunate feature of not being thermostable, so you don't have the opportunity to use PCR with it as well as you do with some of the others. [Note added in proof: Our laboratory will report the first example of a PCR with six letters; the report will appear late in 2003 in Nucleic Acids Research.'] Let me just go back and talk about proteins. As I mentioned before, with DNA, we are able to look at local sequence interactions and come up with perfectly good models that are predictive and that provide manipulative support to anything we do. With proteins, however, this has not been possible. In particular, local sequence interactions have not been particularly valuable in predicting and manipulating proteins. This is actually an old observation. Back to 1984... Chris Sander noticed that the pentapeptide valine-alanine-histidine-alanine-leucine was found in both triosephosphate isomerase and proteinase-K. It forms a helix in triosephosphate isomerase; it forms a beta-strand in proteinase K. The helix in triosephosphate isomerase is the continuation of a longer helix, whereas in proteinase K the sequence is found in a beta-turn-beta structure. Obviously, in 1984, the database was very small. A few years later, as the database grew, identical hexapeptides were found in two protein contexts; one was a helix and the other a strand. Today we know of identical octapeptide sequences that, in two contexts, form a helix and a strand. This suggests that the protein conformation is not determined by local sequence. This fact defeated the field until we began using a historical approach in analyzing these particular systems. I'm going to have to digress a bit in order to discuss alignments of protein sequences. These are the key to the history of proteins. Question: When God picked the four nucleobases, why did (s)he pick the four? Response: The short answer is that I don't know. The long answer is in the metalanguage of organic chemistry. For example, adenine is a polymer of hydrogen

Evolution-Based

Genome Analysis: An Alternative to Analyze . . .

23

cyanide, and is therefore possibly prebiotic. A lot of HCN exists in the cosmos; if you spark it, heat it, and photolyse it right, you get adenine out. But then 1 must also tell you that adenine hydrolyzes in water to give inosine. This means that up to a point, adenine is prebiotic. The only problem with these explanations (as with other explanations in organic chemistry) is whether we could be just as convincing if it were the other way, using the explanatory metalanguage of organic chemistry. In general, we probably could be. Question: Do you have any data concerning the flexibility of non-natural DNA? Response: Flexibility meaning the conformational flexibility, or persistence length? No, we really don't, but I do not expect it to be any different because of theory. That the persistence length will be dominated by the repeating anion is what we would expect. But since you asked me about experimental data, I should not tell you about my expectations. No, we have no data. Comment: That might be an interesting difference. Response: We are willing to collaborate with anyone who wants to do the measurement. There must be experimentalists who do this in the audience. Question: / was thinking, for instance, about super elasticity and all those kinds of things, that are absolutely crucial for segregation of chromosomes and whatever... Response: Yes, Chris Switzer, now a professor at the University of California at Riverside, a former postdoctoral fellow of mine, who has been carrying the iso-C iso-G story forward, has looked at some of the iso-G structures in recombination forks - this type of thing. But keep in mind that the minute you start looking at a real biological system you look at proteins that have evolved over billions of years to handle A-T-G and C, and that's a different question. Question: You, or someone else might do some single molecule rotation.. Response: Well, I don't know how to do those. I will be happy to collaborate with anybody who is interested. So, let me go into a little theory here to discuss sequence alignments. I only have around half an hour left, so I will go through this quickly. For the alignment of

24

S. Benner

two sequences, dynamic programming tools are the gold standard. These tools require a scoring scheme to find the alignment with the highest score. However, these tools assume that each site mutates independently. A variety of public tools are available to construct sequence alignments that include the sequences of many homologous proteins. First one must collect the sequences using these tools. Then one places the sequence into a program that returns a multiple sequence alignment, which generally has gaps. Usually the scientist is not satisfied with the gapping, so (s)he normally shuffles the gaps back and forth. One reason the difficulties arise is that the sequence alignment packages are based on what we have come to call the first-order Markovian sequence alignment model. This is a model that assumes that future mutations in a sequence are independent of past mutations; that mutations occur independently at individual positions; that the probability of substitution reflects a twenty-by-twenty log-odds matrix, and that gaps are scored with a penalty-plus-increment formula. I had a good friend in Zurich named Gaston Gonnet, who was a computer scientist. In 1990, Gaston came to Canada from Waterloo and had a look at the protein sequence database. Gaston knew all the computer science tricks to allow us to do what we call an exhaustive matching, in which we compare every sequence in the database with every other sequence. This enabled us to make historical statements about sequences in the database. Exhaustive matching finds sequences that are similar to each other, and we use it to suggest that the sequences are related by common ancestry; that they have a shared history. There is a caveat to that. But the exhaustive matching generated enough sequence pairs to allow us to test this Markovian model for sequence divergence. When the model is put to the test, it turns out that the penalty-plus-increment score for gaps is not a very good approximation for how real proteins suffer insertions and deletions that lead to gaps. In fact, you can even see that by eye, when you inspect a multiple sequence alignment. It turns out that the probability of a gap in an alignment falls off roughly with its length to the three-halves power. Those of you who are experts in polymer mechanics should notice that exponent; it is an important one. Furthermore, it turns out that adjacent substitutions and adjacent positions are strongly correlated. It also turns out that future and past mutations are also strongly correlated. Some sites are more mutable than others. You might ask why the probability of a gap is inversely proportional to its length raised approximately to the three-halves power. The answer is that I do not know; but if you make the assumptions that segments in a polypeptide chain that can be inserted and deleted are random coils, and that insertions or deletions extract or insert segments that end close in space, and if you assume that the same laws that govern, say, the conformation of free coils also

Evolution-Based

Genome Analysis: An Alternative to Analyze . . .

25

govern the conformation of coils in a protein, you would derive this relationship (if you model the peptide as a linear polymer with no excluded volume.) The volume occupied by a one-dimensional polymer scales with its length to the three-halves power. The probability of the ends of the polymer being near each other in space is inverse with respect to the volume of the polymer - you might very well expect that. And by the way, the best fit to the experimental data does not raise the length to the three-halves power; the best power fit is 1.7. That is the actual exponent that fits the curve for gap-length distribution. If you go back and measure the volume occupied by a real polymer, as Paul Flory did in 1965 for polyalanine, the power is almost exactly 1.7. This is a remarkable use of a set of historical relationships, aligned protein sequences, to define a physical law. One can now use this in structure prediction. That is, if you see a segment of a protein with a gap, you predict that the ungapped peptides adopt a random coil structure. Notice how we made this prediction without a force field and without molecular dynamics. No calculation of energies is involved. All we did was say that if two proteins divergently evolve under functional constraints, and one loses a segment of the polypeptide chain, then the segment lost must have been a coil in the folded structure. Of course, this approach requires that the homologous proteins that divergently evolve under functional constraints must have analogous folds. There was a great paper by Cyrus Chothia and Arthur Lesk in 1986 with which you may not be familiar, that says that when two proteins divergently evolve from a common ancestor under functional constraints the conformation (or fold) is more or less conserved, even though large amounts of the sequence are not conserved. So far, the historical view of protein sequence and structure has provided only a small piece of structure prediction. Something else that is evident from the exhaustive mapping is the existence of correlated residues. Remember, in the models in which you do alignments, residues / and / +1 are scored independently. In real proteins, substitution at adjacent positions is correlated. In particular, when residue / is conserved, residue / +1 is generally also conserved, at least more than average. However there are exceptions to this general rule. If residue / is a conserved proline, then the adjacent residue is more likely than average to be variable. The numbers used to describe this are ten times the log of a probability difference; the probability that the next residue would be conserved minus the probability that it would be conserved at the next position. The bottom line is that if residue / contains a conserved methionine or valine or threonine, then the next residue (i +1) is likely to be conserved. This correlation of conservation or mutability of adjacent sites holds. But if residue i is a proline or a glycine, or even a glutamate, the adjacent residue is likely to be variable. You

26

S. Benner

might ask why that is. I am not really sure. But bear in mind that typical proteins fold, and when they fold there are turns, and turns are generally on the surface. Surface residues, unlike interior residues, generally can suffer substitution without dramatically changing the packing of the inside of the protein. But prolines that define turns are frequently conserved, as are glycines that define turns. So, if residue / is a conserved glycine defining a turn, and since turns occurring on the surface are therefore likely to be adjacent to a residue on the surface and are therefore likely to be able to suffer change without much dramatic impact on fitness, conserved glycines adjacent to variable residues are more likely to appear on the database when they are in turns. You can use this observation to predict the fold of structures as well, whenever you see a conserved proline or glycine adjacent to a variable residue. This predicts that there is likely to be a turn at that position. This is the same pair of sequences; here is the coil. Because of the gap, there is a conserved glycine adjacent to a variable residue, so you'd put a turn there and you'd put a coil there... This is structure prediction, but we are doing it without the use of a large computer, a force field, or an energy calculation. We are using natural history. Now that we have a coil and a turn placed on the segment, we have "parsed" it. I may not know what the secondary structure is between the coil and the turn; maybe it is an alpha-helix; maybe it is a beta-strand. But whatever it is, we can consider secondary structure independent of what is before the coil and after the strand. We can now explore another non-first-order behavior of proteins undergoing divergent evolution under functional constraints: the fact that future mutations are dependent on previous mutations. This is a more complicated argument to make, because you need to understand the concept of "evolutionary distance," or pointaccepted mutations; the PAM-distance. The PAM is the number of point-accepted mutations per hundred amino acids. So, two proteins that are 5.5 PAM units separated are about 95% identical. If 10 PAM units separate two sequences, they are about 90% identical; 42 PAM units means maybe around 70% or 75% identical; it is not 60% or 58%, because there is a possibility for a second mutation occurring at a site where you've already had a first mutation. It turns out that the probability of tryptophan being paired with arginine is greater than average in two proteins separated by 5.5 PAM units, whereas the possibility of tryptophan being paired with a phenylalanine is less than average in two proteins separated by 5.5 PAM units. In contrast, in distant pairs of proteins, the probability of tryptophan being mutated into an arginine is less likely than average, whereas in distant pairs of proteins the probability of tryptophan being mutated into a phenylalanine is more likely than average. That, in and of itself, is

Evolution-Based

Genome Analysis: An Alternative to Analyze . . .

27

peculiar, especially if you know the physical chemical properties of the side-chain. Tryptophan and arginine are about as different as you can get in an amino acid sidechain. Trp is a hydrophobic, big, flat, aromatic, oily thing. Arg is water-soluble, positively charged and hydrophilic. Thus, since Trp and Phe are both hydrophobic, flat, aromatic, oily things, you really would expect them to interchange more than you would expect Trp and Arg to interchange. But that is not true at low-level evolutionary distances; it is true only at high evolutionary distances. That observation is also the consequence of this historical analysis here, of roughly 1.7 million lines of paired amino acid sequences. How do we explain this? Again, I do not know for certain. But you can go back and look at the codon for tryptophan. It is U-G-G in the standard code. Suppose I have only a small amount of time to diverge; suppose I only have time to change one of the three bases in the U-G-G codon for tryptophan. I can make AGG or CGG or GGG, or I can make UAG or UCG or UUG, and so on. It turns out that the way the code is structured, none of the codons that arise from a single nucleotide replacement in the tryptophan codon code for phenylalanine; two of them code for arginine, and two for cysteine. This leads to the obvious explanation that at low PAM distances, the code drives variation in the protein. In contrast, chemistry drives variation at high evolutionary distances. So you ask yourself where in its three-dimensional folded structure will a protein tolerate a code-driven substitution when tryptophan is replaced by an arginine? The answer is only on the surface. So when one sees a tryptophan paired to an arginine at a low PAM distance, one infers that the side-chain is on the surface of the folded structure. That provides a piece of tertiary structural information about the protein fold. It is a statement about the disposition of a side-chain of a residue relative to a bulk structure. By the way, you can go back and do this with other patterns of replacement, and you can come up with many statements. Again, local sequence does not reliably predict secondary structure. But you can use a historical model to extract this kind of tertiary interaction from protein sequences divergently evolving under functional constraints. So between the coil and the turn, making reference to the slide showing the protein kinase alignment, position 130 is a surface residue, 131 is an interior residue, 132 and 133 are both surface, 134 is an interior residue, 135 and 136 are surface residues, and 137 and 138 are interior residues. And now I can ask you what the secondary structure is between the coil and the turn. All those that think it is a beta-strand, raise your hands. All those that think it is an alphahelix, raise their hands. The 3.6-residue periodicity in the pattern of surface and interior residues shows that this is a helix. The Edmundson-Schiffer helical wheel is very useful at this

28

S. Benner

point, because if you start mapping those residues not as a line, but project that line onto a three-point, six-residue helix, 131 is on the inside, 132 is on the surface, 133 is on the surface, 134 is on the inside, and 135 is on the surface, and it is quite clear that this forms a helix where the inside of the protein is on one side and the water is on the other. This is a prediction, at a level of theory at which not a single person in this room could get tenure. It involves no number-crunching, it involves no force fields; all it requires is a natural history perspective on sequence and structure in proteins. We've been doing this now for about twelve years. Question: How are the two sequences on the slide related? Response: Common ancestry. Question: Which organisms? Response: This prediction was done with 71 sequences of protein kinases from all sorts of organisms, but mostly higher organism yeasts and mammals. I have put only two of them up there, because the slide with 71 sequences becomes extremely confusing. Question: So these two are homologous proteins, two very different... Response: Well, these two happen to be one from yeast and one from mammal; they happen to be protein kinases. But if you really want to do an evolutionary analysis and have true history, the game here actually is that you do not have two sequences; you do not even have four sequences, but you have a whole bunch of sequences scattered around an evolutionary tree, with some degree of symmetry across the tree. Some information is contained in sequences that are very similar. The differences in two sequences that are overall very similar contain information, and the similarities of two sequences that are overall very different contain information, and if you want to actually build a good structural model, you get all of them, and do as many sequences as you want. There were 71 in this particular prediction. What comes out of this is a prediction that looks something like this. That prediction was published before there was an experimental structure for any protein kinase. We sent this prediction to Sue Taylor, who was at UCSD doing the crystal structure with the crystallographers there. When the structure eventually emerged, we were able to overlay the experimental and predicted structures, and

Evolution-Based

Genome Analysis: An Alternative to Analyze . . .

29

you can see the correspondence between the two. That is a level of prediction I do not know how to do with higher-level theory. Question: [inaudible] Response: Let me call your attention to the big mistake, which is over here. There is a long internal helix, and since it is inside, it is actually very difficult to apply simple rules to predict secondary structure. The rest of it is remarkably accurate. Question: [inaudible] Response: That is right; Arthur Lesk recently developed a very nice tool for representing the topology of a protein fold. I do not really like the word, since the representation is not really topological, but it suggests the connectivity or folding of these units. This is a resolution that you get from these predictions. For example, I have just said that these two strands in this structure are anti-parallel and form the core of a beta-sheet. That is an explicit part of this prediction. We made this prediction by looking at, again, a non-linear behavior of the protein. It turns out that the evolutionary history of the protein family at site 108 and the evolutionary history of the protein family at site 87 are correlated. A neutral amino acid became a negatively charged amino acid at position 108. During the same episode of evolutionary history, a hydrophobic amino acid at position 87, L-lysine or proline in this branch, mutated to become a positively charged hydrophilic amino acid, arginine. So what we are seeing is correlated change at distant residues, 20 amino acids separate; which leads us to suspect that these two residues, although distant in the sequence, are near each other in the three-dimensional conformation. This allows us to pack those two beta-strands, predicted from other reasons, into an anti-parallel structure. By the way, when the crystal structure emerged, not only were those two residues found to be close, but they also formed a salt bridge. The prediction of the protein kinase fold was important, in part because threading and homology modeling models had failed. Our prediction of an anti-parallel sheet in protein kinase said that this protein was not a distant homologue of adenylate kinase. In adenylate kinase, the central beta-sheet is parallel, a different fold entirely. Logically, if homologous proteins have analogous folds, then proteins with non-analogous folds cannot be homologous. This therefore is a case in which structural prediction was used to deny distant homology, which is the opposite of what we usually do.

30

S. Benner

Let us see why the homology modelers failed. Both kinases have a motif: a Gly-X-Gly-X-X-Gly sequence. In adenylate kinase, that motif lay in a strand-turnhelix pattern. In the predicted structure for protein kinase, the Gly-X-Gly-X-X-Gly motif lay in a strand-turn-strand pattern. From that we concluded that the folds of these two proteins were not analogous, and that these two proteins could not be homologous. That was the thing that I think really impressed the crystallographers. Sue Taylor was one of the people who had used that motif to say that protein kinase and adenylate kinase are themselves related by common ancestry, therefore should have analogous folds. In fact, five groups had built models in which they had placed the sequence for protein kinase on top of the sequence for adenylate kinase. They were all wrong, because that motif convergently evolved in these two cases. I think this is why the crystallographers were so nice to us. This is what the crystallographers said about our predictions: "Remarkably accurate, especially for the small lobe," where we had actually had packed the domains correctly. Janet Thornton pointed out that this is "much better than achieved by standard methods," and Lesk and Boswell wrote: "spectacular achievement;" "major breakthrough!" So, this is the kind of thing we can do with a level of theory that is not very high, but just by bringing natural history together with the physics and chemistry paradigms. Question: Then what is your most spectacular failure? Response: Perhaps I just showed you the most spectacular failure; the failure to detect secondary structural elements that are completely buried, as well as the failure to detect secondary structural elements near the active sites, are generic weaknesses of this approach. Why is that? Because active site patterns of variation and conservation are dominated by things other than confirmation of the secondary structure. We had the same problems in nitrogenase and in isopenicillin synthase. Comment: But that is not a spectacular failure. Response: Well, thank you, but these are the failures that you have. Comment: Well, if you are such a clever organic chemist, and you can make all these new kinds of nucleotides... Response: This question cannot be going in a good direction...

Evolution-Based

Genome Analysis: An Alternative to Analyze . . .

31

Question: No, it isn't. So, why are you wasting your time with this kind of prediction stuff, sequence-gazing and whatever? Response: Ah, because it is not just sequence-gazing. Let me go on to the next step. Why are we interested in predicting secondary structures? The answer is very close to what you have just seen here. We are very much interested in knowing whether two proteins are related by common ancestry. The reason for that is that we are interested in knowing function. And function can sometimes be indicated by common ancestry. For example, there was a target from the CASP-2 protein structure prediction contest that was called the heat shock protein 90 (HSP90). Now, that title is completely uninformative about function; it tells you that the gene for the protein is turned on when you shock the organism with heat and that the protein has a mass of 90,000. That is what the name means. That's all they knew about it. So CASP2 put this protein out to the structure prediction contest. We and Fernando Bazan both actually produced accurate predictions that told us about function. Fernando was doing pretty much the same thing we were doing by now. You can judge for yourself how well the predictions corresponded to the experimental structure, which was kept secret until the prediction was announced. From that predicted experimental structure we recognized a long-distance homology between this protein HSP90 and the protein called gyrase. Dietlind Gerloff made this observation by eye. We could then draw a functional inference based on a distant homology. Again, the crystallographers were really very nice to us, because as it turned out, the gyrase structure had been solved, but the coordinates had not been deposited in the database, which is actually a very common problem in this business. Experimental studies had said there was no ATP binding-site in HSP90. Our prediction said that there was, based on this distant homology detection. It turned out that the experimental studies were wrong, for reasons that are too complicated to go into. So, this is the kind of thing that we do with predicted structures. We look for distant homologues as a way of inferring function. I have page after page of these, but let me see if I can give you another example. Question: Did I hear heat shock stabilizes the DNA ?

32

5. Benner

Response: No. When you hit something with heat, you do all sorts of things. You turn on chaperones, for example. Gyrase is a DNA manipulation protein, which allows you to untangle DNA, basically. This is part of the response to heat shock: turn on a protein, manipulate DNA, help fold proteins, turn on chaperones. Obviously, structure prediction from our perspective is low-resolution. We do not get atomic resolution out of these structures - not that we would know what to do with it if we did. If I have a protein-to-atomic resolution by crystal structure, 1 still do not know enough about things to design a molecule that will bind to it. So this turns out to be a very powerful tool, in part because it allows you to deny distant homology, like we did with protein kinase, or to confirm distant homology, which is what we did with the prediction for the heat shock protein. I am running out of time; let me just see if I can go to the last point, which is relevant to what you just said. In this analysis, we are obviously using so-called contemporary annotation logic. If heat shock protein and gyrase are inferred to be homologous from a prediction of their folds, we are tempted to assume that they have analogous functions. This is the annotation transfer logic: sequence similarity indicates homology; homology implies analogous folds (which it universally does by empirical analysis); analogous folding implies analogous behaviors; and analogous behaviors implies analogous functions. This logic is widely used; in fact, almost all the new sequence databases are being annotated that way. But the logic is easily defeated by one word: recruitment. This is a great example of it. Here are three proteins, all of which have recognizably similar sequences: GSSIMPGK, GSSIMPAK, and GSSAMPYK. All three proteins are homologous; they all fold to give 8-fold alpha-beta barrels. But one of them works in nucleic acid biosynthesis, one of them works in the citric acid cycle, and one of them works in amino acid degradation. This creates a problem for annotation transfer. Can we tell when function might have changed simply by examining the sequence data? Again, we need a natural history perspective. Let me illustrate this using leptin, the obesity gene protein. When you knock the leptin gene out of a mouse, the mouse becomes plump. There was a Science cover with two svelte mice on one side of a balance and a plump one on the other side missing the leptin gene. That was from the Howard Hughes Medical Center at Rockefeller University. We built an evolutionary tree and a sequence alignment for leptins. We then reconstructed the ancestral sequences for ancient leptins throughout the tree. From that, we could predict a leptin fold. The predicted fold is closely related to the cytokine fold, a well-known set of proteins involved in signal regulation with four helices in the structure. Thus, leptin is a distant homologue of the cytokines.

Evolution-Based

Genome Analysis: An Alternative to Analyze . . .

33

Next, we reconstructed the history of the mutation, saying what mutations occurred on which branches. We noticed that in the branch evolving to give the hominoid apes, the leptin protein was evolving very rapidly, faster than would be allowed by any model, except for one which says that the child with mutant leptin is more likely to survive than the parent; the mutant is more fit than the parent. If the mutant is more fit than the parent, it means that the protein "function" is changing. We actually did this in a consultation environment. The people at Sandoz wanted to know whether they should go after leptin as a human obesity gene target; as a therapeutic target. Our comment was that if you are going to do that, you had better do pharmacological studies in a primate model rather than a rodent model, because somewhere after primates diverged from the ancestral stem stock something happened to the role of leptin. This is not surprising, based on physiology. Your feeding behavior is different from that of mice. You're a binge eater; when a mouse goes out to find food, it is just as likely to be food as find food. So, there is an enormously strong selection pressure on feeding behavior in rodents that is not present in primates. We published our prediction in 1998. Last year I was delighted to see this article in Nature: "Whatever happened to Leptin?" Quoting from the article: "It seemed just five years ago that a single protein might reverse the rising tide of obesity, but what works in mice has not yet been translated into people." Now, that is not a surprise to us, or to anyone who looked at the leptin sequence from the natural history perspective. Question: The speed of change is just the number of mutations? Response: It is actually the number of mutations at the DNA level that changes the encoded sequence; the non-synonymous mutations, divided by the number of mutations in the DNA that are silent, that do not change the encoded sequences, normalized for the number of silent and non-silent sites. We can make the mathematical model more sophisticated. Consider the PAM distance metric discussed above. The model behind it assumes that individual sites in an amino acid sequence have the same rate of divergence. That is clearly a poor approximation, as is obvious by simple inspection of any of these alignments. We can advance the model to allow some sites to evolve more rapidly than others. A gamma distribution may be used to model the distribution in mutability. But even this is an approximation, because it assumes that some sites that are mutable in some branches of the evolutionary tree will be the same sites that are mutable in other

34

S. Benner

branches of the evolutionary tree. That is called a stationary gamma model for sequence divergence. What if a function is changing in two branches of the tree? Well, then you might very well expect different sites to be more mutable in this branch of the tree than are mutable in other branches of the tree. Eric Gaucher, who is a graduate student in the group, went back and had a look at elongation factor Tu in light of this. This protein is used in translation, and is highly conserved. Everybody agrees that the function is "the same" everywhere. But if you look closely, you will see that in eubacteria, the sites that are more mutable in them are not the same as the sites that are more mutable in the eukaryotic branch of the tree, and vice-versa. This implies that the functions in the two branches are not the same. Next, we will use this to do something that people like Olivier Lichtarge have been working on. You place the sites that display peculiar evolutionary properties on the three-dimensional crystal structure and ask yourselves: "Where are the sites in the three-dimensional crystal structure that are more mutable in the eukaryotes and less mutable in the eubacteria, and where are the sites that are more mutable in the eubacteria and less mutable in the eukaryotes?" They are certainly not randomly distributed around the structure. It turns out that the eukaryotic protein must leave the nucleus, must go to the ribosome in the cytoplasm, and must follow an actin filament as it does so. You can identify the actin binding-site on the eukaryotic elongation factor as some residues on the surface that are not as mutable as the corresponding residues in the eubacterial enzyme, which does not have a nucleus to leave and does not have any actin filaments to follow. This kind of analysis becomes very useful when you can go one last step, to which I have already alluded. In the next step, we date the divergences from the molecular records. Obviously, when you have mice and humans, you know roughly when sequences diverge from fossils. When a gene duplicates within an organism, it is more difficult to date divergences, but we have developed a clock based on silent substitutions that does it well. A protein sequence does not change with a clock-like rate constant. Silent substitutions have been a little bit better. It is possible to change the DNA sequence without changing the encoded sequence; selective pressure will not accept or reject that change so well. However, there are twelve different rate constants; for silent substitution, A can go to G and G can go to A, so it is very complicated. You know, people tried to aggregate everything, and what they ended up seeing was nothing. What we've actually done is just one simple thing: When we look at two-fold redundant codon systems, we are looking only at transitions; that is, C to T and T to C, or A to G, and G to A. It turns out that

Evolution-Based

Genome Analysis: An Alternative to Analyze . . .

35

transition rates are remarkably clock-like. Let me just show you how clock-like they are. You take the yeast genome... Question: Why do you do that? Response: Instinct is the answer. A chemist views C and T as being very similar nucleotides. The rate constant for interconversion of C and T is generally much faster - such as a pyrimidine for another pyrimidine - than the interconversion of A and C, for example. [inaudible comment or question from audience] Response: No, the source of mutations in the wild is almost certainly not that. The source of natural mutations is not known, but repair mistakes and polymerase errors are possible. There are many types of silent sites in the standard genetic code. Some offer better clocks than others. Most useful are silent sites in codon systems that are twofold redundant. Here exactly two codons encode the same amino acid. These codons are interconverted by transitions, a pyrimidine replacing another pyrimidine, or a purine replacing another purine. When the amino-acid itself is conserved, the divergence at such sites can be modeled as an "approach to equilibrium" kinetic process, just like radioactive decay, with the end-point being the codon bias, b. Here the fraction of paired codons that is conserved, f2> is equal to b + (l-b)'kl, where again k is the first-order rate constant and t is time. Given an estimate of the rate constant k for these "transition-redundant approach-to-equilibrium" processes, if k and b are time-invariant, one can estimate the time, t, for divergences of the two sequences. Empirical analysis suggests that codon biases and rate constants for transitions have been remarkably stable, at least in vertebrates, for hundreds of millions of years. Therefore, approach-to-equilibrium metrics provides dates for events in molecular records within phyla, especially of higher organisms. These dates are useful to time-correlate events in the molecular record with events in the paleontological and geological records. Of course, simultaneous events need not be causally related, especially when simultaneity is judged using dating measurements with variances of millions of years. But an observation that two events in the molecular record are nearly contemporaneous suggests, as a hypothesis, that they might be causally related. Such hypotheses are testable, often by experiment, and are useful because they

36

S. Benner

focus experimental work on a subset of what would otherwise be an extremely large set of testable hypotheses. Consider, for example, the yeast Saccharomyces cerevisiae, whose genome encodes ~6,000 proteins. The yeast proteome has 36 million potentially interacting pairs. Some systems biologists are laboring to experimentally examine all of these in the laboratory, hoping to identify these interactions. Correlating dated events in the molecular record offers a complementary approach. Gene duplications generate paralogs, which are homologous proteins within a single genome. Paralogous sequences may be aligned, their f2 calculated, and their divergence dated. In yeast, paralog generation has occurred throughout the historical past. A prominent episode of gene duplication, however, is found with an f2 near 0.84, corresponding to duplication events that occurred -80 Ma, based on clock estimates that generated divergence dates in fungi. These duplications created several new sugar transporters, new glyceraldehyde-3-phosphate dehydrogenases, the non-oxidative pyruvate decarboxylase that generates acetaldehyde from pyruvate, a transporter for the thiamine vitamin that is used by this enzyme, and two alcohol dehydrogenases that interconvert acetaldehyde and alcohol. This is not a random collection of proteins; rather, these proteins all belong to the pathway that yeast uses to ferment glucose to alcohol. Correlating the times of duplication of genes in the yeast genome has identified a pathway. Approach-to-equilibrium dating tools can be more effective at inferring possible pathways from sequence data than other approaches, especially for recently evolved pathways. By adding the geological and paleontological records to the analysis, however, these pathways assume additional biological meaning. Fossils suggest that fermentable fruits also became prominent ~80 Ma, in the Cretaceous, during the age of the dinosaurs. Indeed, over-grazing by dinosaurs may explain why flowering plants flourished. Other genomes evidently also record episodes of duplication near this time, including those of angiosperms (which create the fruit) and fruit flies (whose larvae eat the yeast growing in fermenting fruit.) Thus, time-correlation between the three records connected by approach-toequilibrium dates generates a planetary hypothesis about functions of individual proteins in yeast, one that goes beyond a statement about a behavior ("this protein oxidizes alcohol...") and a pathway ("...acting with pyruvate decarboxylase...") to a statement about planetary function (" ... allowing yeast to exploit a resource, fruits, that became available - 8 0 Ma.") This level of sophistication in the annotation of a gene sequence is difficult to create in any other way. You can then resurrect the alcohol dehydrogenases, work which was done by Andrew Ellington, Hiroshi Nakano, and Mike Thomson, in which they made the

Evolution-Based

Genome Analysis: An Alternative to Analyze . . .

37

protein that is the ancestor to the oxidizing enzyme and the fermenting alcohol dehydrogenase in yeast. The ancient enzyme is not fermenting; the fermentation behavior is the derived trait that arises following the duplication and, I should say, an episode of rapid sequence evolution. Since I'm out of time, let me conclude by pointing out that we started off by asking why things were the way they were, recognizing that in some way, we had to combine intrinsic chemical reactivity with the history that leads to the biological systems. With nucleic acids, this story is dominated by intrinsic chemical reactivity, because local interactions work, and the organic metalanguage is adequate. You can get full practical manipulative control over nucleic acids and you can persuade yourself that some of their features are universal. With proteins it is quite different. Non-local interactions are very important, and theory to handle these is still lacking. However, here the historical analysis, becomes powerful. We can solve the prediction problem for proteins at low resolution right now, at least to its limits. But these limits are still good enough to detect distant homologues, which is to apply predictions. We now have dozens of case studies, such as the ones I mentioned, that assess function, using this combination of geo- and paleobiology. In some cases these studies include the resurrection of ancient forms. Our goal is analysis of the global proteome. With scientists at EraGen Biosciences in Madison Wisconsin, we have assembled what is called the Master Catalog, which is based on the fact that after all the genomes of all the organisms on Earth have been sequenced, there are only going to be about 100,000 families of proteins. Every one of them tells a story that we are working through one at a time. If anyone would like to help, please let me know. With that, let me stop. Thank you for your exciting questions. I will be happy to answer any more if you have them. Question: I just wanted to understand why you have this peak in the recent past on the yeast evolution curve? Response: That is an excellent question. The answer is, of course, I do not really know. But I suspect that this recent spate of duplication in yeast is responsible for yeast adapting to its new interaction with man. All the genes that are duplicated either allow yeast to divide faster or to ferment malt. The latter needs no discussion. The former, one may hypothesize, reflects the fact that yeast in the wild is rarely as well fed as it is in human culture, meaning that it never needs to divide as fast.

38

S. Benner

[inaudible question from audience] Response: The major episode of duplication in the human genome in the Jurassic may be associated with the emergence of placental reproduction. It is an interesting question. We go back in history and ask where were the major challenges were and where innovation in gene duplication was required. Question: / have the impression that except for those few mistakes concerning the wrong secondary structure assignments, you can pretty much predict the structure from the sequence, so do we have a rough solution to the protein structure prediction problem? Response: Yes, that is right. Keep in mind that structure prediction requires an input of more than one sequence. Our best case with few sequences was with a hemorrhagic metalloprotease, for which we had just seven; but in this case, we were very fortunate that those seven are widely distributed across the tree. It doesn't do to have a hundred sequences that are all very similar - that's like having a hundred copies of the same sequence. But from a nicely balanced tree with twenty sequences; some that are 10-15 PAM units apart, some 20-30 units apart, some 5070 PAM units apart, and some 100 PAM units apart, the secondary structure prediction will identify maybe 80% of the secondary structural units in a way that is clearly obvious, with about 20% being ambiguous. Secondary structural elements near the active sites are difficult to assign, and in a typically sized protein, one secondary structure segment will be completely buried, and you'll scratch your heads for hours to decide whether it is a internal helix or an internal strand. Question: Is this just the secondary structure? Response: Yes, this is just secondary structure prediction. Tertiary structure prediction is then based on identifying active-site residues, which are brought together in the fold. One can also look for covariation, which is the example I showed you in protein kinase, in residues 108 and 87. Here is just a case in which we put together a tertiary structure based on active-site residue assemblies. It was actually predicted for protein tyrosine phosphatase and published in 1995, but it was clear that it is possible to assemble elements of the secondary structure into tertiary structural elements. In general, the structural model is more or less complete, depending on how big the protein is. A very big protein requires you to spend much time sitting there

Evolution-Based

Genome Analysis: An Alternative to Analyze . . .

39

trying to figure out alternative structures. With, for example, some smaller proteins like synaptotagmin, which was a CASP-1 project target, we predicted three alternative conformations. It turned out that one of the three was correct. This level of resolution is entirely adequate to ask whether these proteins are clearly not homologous, since they do not have analogous folds. Question: For both secondary and tertiary... Response: Yes, that's right. Question: Your last line is "when is a problem considered solved? " Response: Yes, I'm sorry; I did not get to it. Question: So is the protein-folding problem solved? Response: Yes and no; the answer depends on what you want to do with the predicted folded structures. We ourselves want to answer certain kinds of biological problems. For example, in this case, we wanted to know whether protein serine phosphatase was homologous to protein tyrosine phosphatase. It was not, and that was based on two predicted structures; one that we did and one that Geoff Barton did. We wanted to know whether a cofactor metal, zinc, was required, as it is in the protein serine phosphatase, because we were interested in the mechanistic features of that enzyme. The predicted structure suggested that it was, and this prediction was correct. I think that the challenge today is to get atomic-level resolution in these threedimensional structures. The temptation is to use force fields and number crunching. To do this, however, requires solving problems that are far from being solved. I am not convinced that I understand the packing of small organic crystals, nor can I predict the solubility of a compound in water. These are two issues that are directly related to the folding problem, which we cannot do very well with small molecules. So, the protein structure prediction problem is solved when the predictions answer the biological questions you want to answer. From our perspective, what we understand about the protein structure prediction problem, starting from an input of homologous sequences, is where it will not work, and frankly, unless there is a very good idea, where it will never work. Secondary structure near active sites is a classic case of that. We have never gotten secondary structures near active sites correct reliably. We understand why we can't get it right; patterns of variation near

40

S. Benner

the active site are dominated by issues related to catalysis that are not related to fold. Obviously, someone could come along with a good idea and solve that problem, using multiple sequence line-ups, but there are good reasons to believe that the de novo prediction of secondary structure right at the active site will be a very difficult problem to solve by this approach. It is an interesting sociology; we seem to be in the middle of a Kuhnian paradigm shift in this field, something that I never thought actually happened. Many of the number crunchers who participate in the CASP project do not seem able to accept the fact that we predict these three-dimensional structures of proteins without crunching numbers. One can make successful predictions, publish them before experimental structures are known, have judges declare these structures to be correct, and the reaction from some number crunchers is still disbelief. They cannot believe that a solution to the problem is possible without a large computer, a force field, and number crunching. So any solution that does not involve these does not occupy a stable position in their view of the world. Question: So you would argue - maybe I will argue it for you - why anyone would care about trying to predict three-dimensional structures. It seems like a waste of time, apart from an intellectual exercise. What Olivier Lichtarge said in another context, and what you said in this context, was to look at all these footprints that life has left for you — let's sort this out, and once you have done that, more or less, you know everything that is happening, so why would anyone even worry about... Response: Obviously, if we are doing distant homology detection, prediction based on natural history analysis is a way to do it. If we want to place function together with biomolecular structure, again, prediction based on natural history analysis is a useful tool. From there on, the mission of chemistry is to understand the behavior of matter in terms of the behavior of its constituent molecules. There is a role for computation here. Maybe, over the long-term, computational models for molecules may permit organic chemistry to escape from the "non-scientific" features of its explanations; those derived from its non-computational metalanguage. But, at least for biomolecules, the first step is to understand water. Until you have a model that is predictive and manipulative for water, then for things dissolved in water, and for small organic molecules generally, there seems to be little use to apply computation to biomolecules. The drive to do so, at least in the United States, comes from funding agencies, of course. The National Institutes of Health virtually insists that theoreticians look at large molecules, rather than look at the fundamental interactions that govern the behavior of large molecules. I think that this is a

Evolution-Based

Genome Analysis: An Alternative to Analyze . . .

41

mistake. If I were the NIH I would not require theoreticians to handle big molecules. I would put money into studying water, things in water, things packed in crystals, from the bottom up, where you have good manipulative tools, and where the system computationally is not so overwhelming that you have to abandon rigor to do it. I would recognize that this is how, over the long-term, the chemistry of biological molecules will come to be understood, not by a large set of poorly rigorous and highly approximated simulations of biomolecules. Question: [inaudible] Response: Yes, if you want to design drugs, it is actually quite useful to even have approximate models. Obviously, the HIV protease inhibitors were designed based on a homology model for the protease. These models need not help you design, per se, but they do shorten the random walk, which combines random synthesis with design synthesis and provides focus. Models also provide motivation to the chemist doing the synthesis, because nothing encourages chemists more than thinking they are doing something rational, even if they are not. So even a wrong model is useful in the pharmaceutical industry, because it drives chemists in that direction. Question: Did you try S02 analogues with template-directed

synthesis?

Response: No, we didn't; that may be worth doing. At some point, we should ask whether the analogues would at least do something prebiotically, as PNAs have been tried with template-directed synthesis; there has been some progress from Leslie's laboratory on that. We have not tried that; it would be a good thing to try. Question: Just for the record, during the last CASP competition, how many structures did you get right and how many did you get wrong, and how did your results compare to the work of other groups that use more automatic methods, like neural networks? Response: We did not go to the last CASP competition. In CASP 1, there were two ab initio prediction targets: phospho-beta-galactosidase and synaptotagmin. For the first, we predicted an eight-fold alpha-beta barrel structure, which was correct. For the second, we presented three possible topologies for an all beta-barrel (out of ca. 200 possible); one was correct. In CASP 2, we made predictions for heat shock protein 90, ferrocheletase, NK lysin, calponin, and fibrinogen. Heat shock protein 90 was the most interesting case, since we predicted (correctly) that it was a distant

42

S. Benner

homologue of gyrase, a homology that the authors of the crystal structure did not see, and predicted an ATP binding-site (again correctly), despite the fact that experimental evidence had been presented that the protein did not bind ATP. The rest of the proteins had known functions, so our predictions did not add to the functional interpretation. For ferrocheletase, we correctly predicted all nine helices and six strands, but mistakenly assigned a short strand (3 residues) as a part of a longer helix. For NK. lysin, we predicted an all-helical protein built from four helices, and these were all correct. Calponin was a largely disordered structure, but we did get both of the helices correct. Fibrinogen was a problematic prediction. The prediction correctly identified ten strands and the two long helices. However, it missed one helix and over-predicted two strands. Furthermore, in two regions, disulfide bonds created ambiguities in the secondary structure assignment. We doubt that this would have been an adequate starting point for building a tertiary structural model. Beyond CASP, the interesting metric right now is function prediction - not structure prediction - for homology detection. I think we have reached the limit of what we can do with prediction based on multiple sequence alignments. Today, the Master Catalog, a product marketed by EraGen, has predictions for every family of proteins in the global proteome. The quality of these predictions depends, of course, on the number of homologous sequences in the family; the more the better. Many of these may undoubtedly be assembled into ribbon structures, such as those Arthur Lesk described. For the next CASP, I will need to figure out into just what families the CASP targets fall. We will print out the prediction that we have and see how well it does.

CONFORMATION OF CHARGED POLYMERS: POLYELECTROLYTES AND POLYAMPHOLYTES JEAN-FRANCOIS JOANNY Physicochimie Curie, Institut Curie, Paris, France

First of all, 1 should warn you that I know almost no biology; 1 am something like a physical chemist or a polymer physicist. The problems with which 1 am familiar are much simpler than the ones about which you have been hearing during this conference. In my community the general point of view is that one should look for universal properties, which is somehow orthogonal to what people in biology do. This means studying properties for which you can ignore the specific chemistry as much as possible. What 1 want to talk about is influences polymer conformations. course, it will be extremely general. theme; first of all I would not be able

how the existence of electrostatic charges That is the main theme of my talk, and of But I will not tell you everything about this to, and it would take an infinitely long time.

Flexible polyelectrolytes Blob model, Annealed polyelectrolytes Rigid polyelectrolytes Persistence length, Adsorption on a small sphere Polyelectrolytes in a poor solvent Rayleigh Instability, Chain stretching, Charge distribution Small ion condensation Manning condensation, Charge regulation by a surface Polyampholytes Chain conformation, Adsorption on a charged surface Figure I. Conformation of charged polymers: polyelectrolytes and polyampholytes.

44 J.-F. Joanny 1 have chosen a few topics. Despite the fact that specific interactions are very important for all RNA and protein problems, I want to insist that polymer physics is also important. That is the other aspect 1 will discuss, which has almost not been mentioned so far in this conference. Let us consider the simplest case, that of a flexible polymer carrying very few charges and displaying no interactions other than electrostatics. That will be the first part of my talk, after which 1 will try to make the discussion more realistic. I will then introduce the concept of polymer rigidity, providing you with an example that we worked on recently, namely charged polymers interacting with oppositely charged spheres. Then I will introduce the concept of polymers in a poor solvent. Then, to make it fancy, I will introduce hydrophobic effects, albeit in a very poor man's way. I will talk about small ion condensation, which was mentioned during the first day of the meeting. In the last part of my talk, 1 will discuss what 1 call poly ampholytes, which are polymers that like proteins, have both positive and negative charges along the same chain. In order to do so, I will carry out a sort of review of the subject, mixing-in extremely well-known things sometimes from long before 1 was born - with recent work that we have done ourselves. I will try to make it as simple as possible. Although 1 am a theorist, 1 will start with just some hand-waving or scaling arguments. In some cases we have done more complex calculations, but sometimes we are unable to carry out sophisticated calculations, so the only thing we are left with is the hand-waving arguments. You have to take my word for it that we have done it more seriously than the way it is presented here.

N monomers of size a fraction f of charged monomers Gaussian radius

Rn-N

a

Electrostatic interaction Bierrum length 1 =

q

,47rekT Quenched and annealed charges

screening length * 1=(8TTIII

Figure 2. Weakly charged polyelectrolytcs.

Conformation of Charged Polymers: Poly electrolytes and Poly ampholytes

45

So, what is the simplest model you can devise for a polymer? First of all, forget about the fact that there are charges on the polymer. The simplest model is to assume that each monomer is a small rod and that the rods are freely jointed. Each rod is randomly oriented with respect to the previous one. If there are N of these rods, this is a trajectory of a random walk, with a size R0 = Nx'2. Maybe it is important to mention that when I model a polymer this way there is no energy in the problem; only entropy. So the only energy scale for this problem is kT. Whatever I do afterward, if I want to compare it to this random coil, I have to compare the energies to kT. The only other thing you need to know, and that you probably all do know, is that if I pull on both ends of the polymer with a force, the polymer reacts like a spring. The spring constant has an entropic origin, which has been known for forty years. I will call R the distance between the two end-points; and the free energy the elastic energy. It is of entropic origin, therefore proportional to kT. Since this is a spring, it is proportional to R2; therefore the spring constant is 3/2(Na2). Since the energy is proportional to R2, there is a restoring force proportional to ^?. This is my starting point, and what I want to do afterward is introduce charges. In a biological problem, the polymer could be DNA, so I put in negative charges. One of the points I want to make is that if you put charges on the polymers, you must add counterions to the solution. You can forget about the counterions for some problems. First I will consider electrostatic interactions. I want to see how polymer conformation is modified by electrostatic interactions. If I put negative charges on the polymer, I must put positive ions into the solution. If I have only one polymer chain, the counterions gain entropy by going to infinity, so I can forget about them. But at finite concentrations, for some of the properties the polymer is not important, and the counterions dominate all properties. For instance, if you measure the osmotic pressure, you do not measure anything about the polymer; you just count the counterions. These are not the properties that I want to talk about, but you have to remember them. How do I introduce electrostatics? Essentially, what I want to know is how much two charges on this polymer interact. If it is in a vacuum or a dielectric medium, the Coulomb potential decays as Mr. As 1 said, the energy scale is kT, so I use kT as a unit, and since this is an interaction, the remaining pre-factor is a length. People call this the Bjerrum length. It is proportional to the square of the charge divided by the dielectric constant of water. The solvent is water, and kT appears in the denominator, because I artificially introduced it in the numerator. This length measures the strength of the electrostatic interaction, and if I take two charges at a distance equal to the Bjerrum length their interaction is kT. You all know that in water, the electrostatic interaction is screened, and during most of this

4(i J.-F. Joanny talk I will use this Dcbye-Hiickel potential, which means that at large distances the screening is exponential and the screening length depends only on the salt- or smallion density, n. The screening length decays as one over the square root of the salt density. If you want numbers: if the solution is 10 molar, the screening length is 100 Angstroms. Something else I want to ask is "How do you get charges on the polymer?" There are essentially two ways: One is to take charged polymers such as salt ions and copolymerize them with uncharged monomers. The other way is to take a polyacid or a polybase and change the pH. Using the language of physicists, I call the first case quenched polyelectrolytes and the second one annealed polyelectrolytes. Physical chemists call quenched polymers strong polyelectrolytes and annealed polymers weak polyelectrolytes. During most of this talk, I will consider quenched polymers, just mentioning a few results concerning the annealed ones. That is my basic model.

DeGennes et al. Electrostatics as a perturbation weakly charged, f small, Gaussian chain

2/3

Electrostatic energy F 0 l - k T l „«- ^ - small if N
wm?&M&

\?™.*..'..™..™'„™1

1/3

VU radius

?*M«

R=-^£ er Na(f 2 M 8* V a/

Figure 3. Electroslalie blob model.

The first question concerns the size of a polymer carrying a few charges along the chain. The most naive approach is to suppose that the number of electrostatic charges is infinitely small, and thus treat electrostatics as a perturbation. That is easy; the polymer essentially looks like a sphere. I can smooth-out the charge density inside the sphere and calculate the density. It is proportional to the square of

Conformation

of Charged Polymers: Poly electrolytes and Poly ampholytes

47

the charge divided by the radius, and if the electrostatic charge is small, the radius does not change. You might worry about the pre-factors - which is something I did not mention. Since I will work with scaling arguments, most of the time I will ignore any pre-factors. You could calculate this using Gaussian statistics: if you put the charges on the outside of the sphere, the pre-factor is one-half; if you distribute the charges uniformly inside the sphere, it is three-fifths. It is somewhat different for a Gaussian chain, but you can still calculate it. That is the electrostatic energy, and what I want to do is compare it to kT. If it is smaller than kT, what I did is legitimate, and electrostatics does not count. The electrostatic energy is smaller than kT if the number of monomers is smaller than this number, g, shown in the figure. If this is true, the problem is solved, but this is not the interesting case. The case in which I am interested is the one in which N is large (N is much larger than the characteristic value) and electrostatics is important. In order to construct a picture of the chain that you should have, I will provide you with a kind of geometrical construction. This is done by chopping the polymer chain into pieces, each of which contains the characteristic number of monomers, g, within each piece. If I isolate one of these pieces, electrostatics is not so big. So I can say that for each of these pieces - and in the jargon they are called "blobs" - electrostatics does not count, and they are Gaussian chains. The size of the Gaussian chain is just the square root of the number of monomers. Of course if we consider larger distances, the blobs interact strongly due to electrostatics, and you can convince yourself by all kinds of methods that the chain is fully stretched and the blobs aligned. But then you can look at this picture and read everything. For instance, if you want to know the total size of the polymer, there are N/g blobs, each of size ^; the size of this polymer is N/g £,. This tells me that the size increases linearly with TV, with which I think everybody agrees: if you charge a polymer, it gets elongated; its radius increases linearly with the molecular weight. But it is less than fully elongated; it is smaller, because it wiggles at short distances. Of course, this is very schematic, and I "cheated" you at several places. Maybe the worst cheat is that I isolated the subunit and considered this subunit to be interacting with its neighbor. This is to forget that electrostatics is long-range. You can assume this structure and calculate the electrostatic interaction, which is dominated by interactions between charges that are far apart. This effect may be taken into account, slightly modifying the result. The size does not increase as N, but as N \oginN. However, whatever number you insert, log 1/3 N is a constant. If you look at the blob picture, you have the impression that it is a cigar, like this, but of course, it fluctuates. There are transverse fluctuations of this chain. The most naive answer is that the blobs interact strongly, so if you construct the blob on a lattice, the following blob must be

48

J.-F. Joanny

put on a site to the right. However, in the transverse direction you can have a random walk. It is almost the right answer. If you do it more properly, you again find logarithmic corrections. This will be sufficient for the rest of the talk, but you must be aware that 1 ignored a few things when 1 did it.

Annealed and quenched polyelectrolyte (Castelnovo, JF) Quenched: charge distribution dictated by chemistry Annealed: polyacids or polybases; fluctuating number of charges, fluctuating charge positions Charge distribution on annealed polyelectrolytes Lower electrostatic potential at the ends Higher charge density at the ends M = 1 _ l B l !al ( l o g ( l _ ( ^ ) 2 ) + 2 ( l - l o g ( 2 ) ) ) r L L/2 J

<^m

Figure 4. Annealed polyelectrolytes. Before I go beyond this simple picture I want to come back to the problem of quenched and annealed polyelectrolytes. What I did implicitly in the previous picture was spread the charge out along the chain; I decided that the chain had a uniform charge density. Maybe it does not matter whether it is annealed or quenched. As 1 said, if you make the chain by copolymerizing neutral and charged monomers, each chain has a given number of monomers. Once the chain is made you do not change it, and the positions of the monomers are fixed. If you use the other procedure to build annealed polyelectrolytes, which is what 1 have sketched here, all the monomers arc identical. If it is an acid, they have a COOH group. When the pH is changed in some places, the H* goes away, leaving a residual charge. The question is, what is the pH? You can convince yourself that the pH is the chemical potential of the charges; or, if you like, in more of a physicist's language, it is the field conjugate to the number of charges. What you maintain constant is not the number of charges, but the chemical potential. The number of

Conformation of Charged Polymers: Polyelectrolytes and Polyampholytes 49 charges will fluctuate between chains. Also, the position of the charges can fluctuate on one chain because a given H* can recombine, and another one can go away. You may wonder whether the charges distribute uniformly. If you go back to my previous picture and calculate the electrostatic potential, the answer is somehow obvious: the average electrostatic potential is stronger at the center of the chain than at the edge, since in the center it is created by the two halves, whereas at the edge one side is missing. Obviously, the H + ions are attracted more strongly here at the center of the chain than at the edge, so we would expect the charge density to be higher at the edges. This effect can be calculated; I put the formula on the slide. We have checked this formula against numerical simulations. 1 do not know of any experiments that measure charge distribution, and I also do not know whether this is an important effect. The only thing I can imagine is that if you adsorb a chain onto the surface, maybe it adsorbs at the end, because the end carries a higher charge, but I do not know of any experiment that proves this. So that is my basic model, to which 1 want to add a few features.

Bond-angle correlations: orientation memory

( c o s ( # ) ) = e x p - (—) Mechanical definition 1

f

2

Bending energy F B =-kTl p J dsp(s) End to end distance

p=local curvature

=2Llp , - ] p ( l - e - ^ > ) L 2 (R )=2Llp L » l p
Figure 5. Semi-flexible polyelectrolytes: persistence length.

The first thing 1 want to add to this simple model is the stiffness of the chain. This is a very short summary of the properties of semi-flexible chains. The classical

50

J.-F. Joanny

way to characterize semi-flexible chains is by means of a persistence length. There are, at least for me, three ways to define persistence length. The first definition may be used not only for polymers but also for any rod-like object, such as this stick, a wire, or a thread. You fix the orientation at one point, walk a distance, s, along the object, and then measure the orientation at s. You define it by the cosine of the angle between the orientation at s and at the origin. If all interactions are shortrange, the cosine decays exponentially with s and the decay length is equal to the persistence length. The information it provides is how far away on the chain you have to walk to lose the memory of the orientation. There is another definition, which is a mechanical one: Suppose I bend a rod-like object; how does the bending energy vary? The bending energy must depend on the local curvature, and if you change the sign of the curvature, the energy does not change; it increases as the square of the curvature. If an interaction is local, the energy must be proportional to the length of the object. I then carry out dimensional analysis, scaling-out kT. The remaining factor is the persistence length. If you read Landau's and Lifshitz's book right to the last page you will see that they show that this length exactly corresponds to the previous definition. The persistence length can be almost anything; for flexible polymers it is a few Angstroms; for DNA it is five hundred A. For actin filaments, it is microns - you could calculate it for a strand of spaghetti! If you were to do continuum elasticity theory for a strand of spaghetti, you would find thousands of kilometers. It tells you that thermal fluctuations do not matter for spaghetti, which I guess you knew. The other way to define the persistence length is to look at the end-to-end distance of the chain. Of course, there are two limits: if the chain is short, it is smaller than the persistence length, so it behaves like a rod. If the chain is large, it is Gaussian again, which means that the square of the radius is proportional to the contour length. So, the picture you may now have of this chain is that it is some sort of freely jointed chain whose step-length is of the order of the persistence length. People call this step-length the Kuhn length, which is twice the persistence length. The question for a charged polymer then concerns how the stiffness, or the persistence length varies with the electrostatic interactions.

Conformation of Charged Polymers: Poly electrolytes and Polyampholytes 51

Odijk, Skolnick-Fixman Screened electrostatic interactions, t charges per unit length B e n d i n g — • increase of electrostatic interactions

Larger than screening length 1/K, decreases as 1/n Total persistence length 1 = 1 + 1 Limitations Perturbative calculation ignores thermal fluctuations Figure 6. Electrostatic persistence length. This time I will consider the electrostatic interaction to be screened, because if I have a rod, there is no problem associated with the chain conformation. Electrostatic interactions increase the persistence length, because the charge positions along the object being fixed, bending a rod-like object, brings the charges closer and increases the electrostatic interaction. Obviously, it depends on the curvature, and again, it is proportional to the square of the curvature. Thus it is possible to characterize screened electrostatic interactions by a persistence length. If you wish, you may obtain the result by dimensional analysis. The precise calculation was done at around the same time in the Netherlands by Odijk and in the US by Skolnick and Fixman. The persistence length is proportional to the Bjerrum length and to the square of the charge density along the chain, since it varies linearly with the electrostatic interaction. It is also proportional to 1/K2. Remember that (K~ ') is the screening length for electrostatics. First, the persistence length decays with the salt density as \ln, and 1/K decays as H="J- So the persistence length is different from the screening length, and in many cases it is larger. One thus defines an electrostatic persistence length that is larger than the screening length. Of course, it adds up to the bare persistence length. If the chain has stiffness in the absence of

52

J.-F.

Joanny

electrostatics, the total stiffness is the bare stiffness plus the electrostatic stiffness. Although this calculation works pretty well for stiff polymers, it is not so well justified. If you think about it, it is kind of a perturbational calculation. 1 assumed that the chain was a rod, and then bent it, weakly. Implicitly, I assume the electrostatic contribution to the stiffness to be much smaller than the bare one. The other thing that is not justified is that this calculation ignores thermal fluctuations; it is a purely mechanical calculation. I just take a rod-like object, bend it, and ask how the energy changes. You can worry about that. In fact, Odijk very early realized that despite the fact that the calculation was perturbational, the range of validity was much larger.

Thermal fluctuations: Banal, JF small undulations around a rod: dependent persistence length

l p (q)=l 0 q-'<s c ;lp(q) =

1BT

,*»0

s0
qfW

crossoverlength •«oty

rigid polyelectrolyte: sc
How did we approach this problem? We considered a chain that was almost a rod, and added an undulation of a wave vector, q, on top of it. Instead of asking about the global stiffness of the chain, we calculated the persistence length as a function of the wave vector, or if you like, the bending energy as a function of the wavelength that 1 apply. I apply wavelength X = ^r-. Some limiting cases are very obvious. Suppose q is very large; the wavelength is then very small. It is like taking a very small piece of chain; the electrostatics is then very weak. So at small distances, or with large wave vectors, electrostatics does not count and the persistence length is the bare persistence length. But if you go to the other limit, at very large distances you are back to the previous Odijk calculation. So somewhere

Conformation of Charged Polymers: Polyele.ctrolytes and Polyampholytes 53 in the middle there is a transition between the two behaviors. Also, there is a range of intermediate distances, where instead of varying as ( — ) , as predicted by Odijk theory, the persistence length varies as (-j)To be more quantitative, I take the chain and sum thermal bending fluctuations. The question I can ask is: "Suppose 1 fix the orientation at the origin; then what is the square of the angle of the orientation due to thermal fluctuations at a distance .v?" I have to sum all these fluctuations at different wave vectors with the relevant stiffness, and I get the curve plotted on this slide.

Complexation between a positively charged sphere and a negatively charged polymer Netz, JF Competition between bending and electrostatic interactions Mcbkad*.

Many parameters: Z, D(
10 i n2

Figure 8. Interaction between a semi-flexible polyelectrolyte and a small sphere. At small distances, ,v, the persistence length is given by the inverse slope of the angle squared as a function of s. That is the bare persistence length. At large distances, we get the Odijk result, and in between there is a crossover. A new length thus comes into the problem, for which I have no good name, so I call it the "crossover length," which depends both on the bare stiffness and the charge on the chain. Then one has to worry about the validity of this approach; when is it consistent? The answer is as long as the angle remains small, since I started with a rod and want to apply small perturbations. This condition tells you that the crossover length should be smaller than the bare persistence length. This is true in two cases: when the chain is stiff, and when it is highly charged. If you apply numbers, for chains like DNA there is no problem. The Odijk theory is almost perfect. There is a small discrepancy, but it does not matter much.

54 J.-F. Joanny We now discuss the interaction between a semi-flexible polyelectrolyte and a small sphere of opposite charge. The naive model we wanted to make is not quite right for nucleosomes, which are objects that arc too complicated. So 1 will discuss a positively charged sphere interacting with a rigid, negatively charged polymer. How does the polymer wrap around the sphere, which is small? Consider only a sphere smaller than the persistence length of the polymer. If the sphere is not charged, the chain is not deformed, and since I am looking at very small sections of polymers, it is almost like a rod. If I increase the sphere charge, the chain begins to bend. At low charges, the curvature radius at the contact point is very large. Eventually, at a higher charge, the polymer starts to follow the sphere, at which point it is almost a two-dimensional object, at least at the outset. You can ask whether it bends toward or away from the sphere. Indeed, that depends on the screening. If the sphere is larger than the screening length, the two arms of the polymer do not interact and the chain bends towards the sphere. If the screening is weak, the two arms interact strongly, so they want to be mostly flat and bend away from the sphere. At some sphere charge, the polymer starts to wrap around the sphere, making several turns. We want keep track of all the parameters: the charge on the sphere, the size of the sphere, the ionic strength, the charge on the polymer, and the persistence length of the polymer. You can eliminate one of the parameters by making a dimensionless number, which I have put on the slide. You can understand the origin of this number: it is the ratio of the bare persistence length to the electrostatic persistence length at a length-scale that is the size of the sphere.

Small bending rigidity 10/(1BT2D2)<1 Continuous and discontinuous transitions Numerical phase diagram for finite chains

XD

point contact •

•

1

—

E)-' K . . . K .

K -*-

Structure of the wrapped phase

Figure 9. Complcxation diagram.

Conformation

of Charged Polymers:

I'olyeleclrolyles

and Polyumpholytes

55

We then minimize the energy, obtaining something like a phase-diagram, as shown in the figure. There are three phases. In the lower region, the gain in energy is smaller than kT, and there is no complex formation. If you approach the complexation line at low ionic strength, the polymer binds to the sphere, but it is only weakly deformed; this is what we call the "touching phase." Upon further charge increase, there is a phase at which it is entirely wrapped. The interesting thing is how the transition occurs. If you approach the complexation line on the left side, everything is continuous; if you approach it at high ionic strength, wrapping is discontinuous. This means that you have no complex, and suddenly the polymer wants to wrap around the sphere.

Dobrynin, Rubinstein, Obukhov Neutral Polymer in a poor solvent: globule negative excluded volume v = - T a3 constant density c~-r7a3 radius Equivalent to oil droplet in water: surface tension

,„

2,

2

y ~ k T T /a Figure 10. Polyeleclrolyte in a poor solvent: Rayleigh instability.

That was my first point. The second point concerns what happens with a polyelectrolyte in a poor solvent. On top of the electrostatic interactions, I want to add short-range attractive interactions, which in my mind describe the hydrophobic interactions. That is a naive but simple way to treat hydrophobic interactions. Before considering polyelectrolytes, I shall tell you what we know about polymers in poor solvents. Wc start from a Gaussian chain and turn on attractive interactions. If 1 am very crude, and if attractive interactions are very strong, they induce a collapse of the chain onto itself, forming a small globule. If the chain makes a small globule, the density inside the globule is constant, which tells you that the radius

56

J.-F.

Joanny

grows as N1, because the volume is the number of monomers. In the polymer community, the standard way to characterize this attraction is a negative excluded volume or a second virial coefficient for the attractive interactions. This is homogeneous to a volume, so it is more or less proportional to the cube of the size of the monomer. It is negative, and there is also a dimensionless factor, r, which is the variable I will use. When r is zero, there is no attraction, and that is called the 0 point. For many purposes, it is considered that r increases linearly with temperature and vanishes at the 0 point. If v is very large, the polymer collapses as mentioned above, forming a constant density sphere. If v is finite, it reaches a state of finite density within the sphere. However, the density depends on the attraction; in fact, it is simply proportional to r. Again, this means that since, roughly speaking, the density is the number of monomers divided by the volume, the radius varies (Mr) 1 . If r decreases, the size increases; if T increases, the size decreases. Of course, this formula is not always valid; if T is too small, the radius cannot be larger than the radius of the Gaussian chain in the absence of attractions. The only thing to remember is that the chain forms a dense polymer sphere, which is more or less like a drop of oil in water. Indeed, for most of the properties, it behaves like drop of oil in water. What I mean is that the connectivity molecule is not so important at this point.

charge droplet splits into two smaller droplets if •^

ol

Pearl-necklace conformation

Pearl size

f^

Strand size Polymer size

n

N,

- 71 ~

Nf,.,

Figure 11. Rayleigh instability.

F .> F ^

^

yl,2

surface

Conformation of Charged Polymers: Poly electrolytes and Poly ampholytes

57

The next question concerns the energy of the collapsed chain. Again, it is like a drop of oil in water, and the energy of a drop of oil in water is just surface tension. So if you calculate the energy of a collapsed chain, the surface tension is by far the dominant contribution. This is due to the fact that the monomers, which are on the surface, are not surrounded by other monomers that they "like"; they are surrounded by solvent. The surface tension may be calculated, but it is not so important. That is the state from which I want to start. Then I add charges onto the polymer, little by little. Here is a nice result from a very long time ago, of which I was not very aware: Rayleigh's study on oil droplets. It treats what happens if you charge oil droplets. Rayleigh tells us that when you start with an oil droplet in water and add charges, if the charge is weak enough, it does not matter. But at some point, the charge is large enough and the droplet breaks into two. The two daughter droplets then separate. If you continue charging it, each smaller drop breaks again in two, and so on. So the electrostatics breaks big drops into smaller ones. When does this happen? It happens when the electrostatic energy on a given sphere is larger than the surface energy. This criterion is satisfied for a given charge known as the Rayleigh charge. I will now use this idea for polyelectrolytes and consider the consequences. I start with a collapsed polymer and add charges. The polymer globule wants to split into two smaller globules, and of course, since it is a polymer, the two smaller globules remain connected. The way the system accommodates this is by making two dense globules and a stretched strand between the two. The strand has to be stretched, because the charges are mostly localized on the dense globules, and stretching the strand minimizes the electrostatic interaction. Upon further charge increase, each globule splits in two, and so forth. The final picture is sketched on the slide. Dense regions, which are the droplets that have exploded due to Rayleigh instability, are separated by stretched strands. We want to calculate the molecular parameters of this structure, often called a "pearl-necklace structure." The pearl size is that of the cylinders that I had for a Gaussian chain; it is the same electrostatic blob size that I had in the beginning. However, it is not the same internal structure; the pearls are dense, and the internal density is that of a collapsed globule. The important parameter is the distance between pearls. Since two neighboring pearls are charged, they repel each other. Thus, there is a force pulling them apart. If this force is too strong, the pearls unwind. You may ask what force would be required to unwind the pearls. It is difficult to calculate such a force explicitly, but, simply using dimensional analysis, we know it cannot be anything but kT over the molecular length. This molecular length varies as 1/T. Since it is known, the critical force fixes the distance between pearls. The size of the chain is obtained by looking

58

J.-F. Joanny

at the figure as before. Again, the polymer size increases linearly with the molecular weight, and of course, the polymer size decreases if you decrease the solvent quality. If r is larger, the polymer size decreases. You may wonder whether this is purely a theorist's game, or whether it may be observed experimentally. In response, two things may be said: Scattering experiments have been carried out in a more concentrated solution, but locally you would expect to observe the same pearl necklace structure. But the way neutron-scattering experiments are interpreted is by measuring an intensity, then supposing you know the structure. The intensity is calculated for the supposed structure and compared with the experimental data. The only thing you can say is that the pearl-necklace model is not too bad. But you plug in that structure in by hand. The model was also compared to numerical simulations on one chain, and that structure is definitely observed. We think that another way to test this would be to pull on both ends of the chain. This experiment has not yet been done, but we have tried to calculate what happens. Consider the pearlnecklace structure. Apply a force at the chain-ends and ask what the relationship is between the force and the distance. In order to find out, you have to write the full free-energy of this structure, add the work of the force, and minimize with respect to all the parameters. In fact, the result is simple: the pearl-necklace structure almost does not change. The only thing that changes is the distance between pearls, or the number of pearls on the chain. It changes because the force that now pulls the pearls apart is not only due to electrostatics, but to electrostatics plus the applied force. If you say that this total force should be a constant, when you increase the external force, the strand length between pearls should increase. When the external force increases, the pearls unwind, one by one. This argument yields unusual elasticity laws; the length increases as one over the square root of the force, minus the critical force. That would be the signature of this structure. This model was extremely rough, and ignored many subtleties. The first thing I forgot is that the pearls are discrete. If there are many pearls, this is not very important, because of thermal fluctuations. If there are few pearls, of course you have to worry about the fact that it is a finite object. For instance, you can easily convince yourself that there are always pearls at the chain ends, because this minimizes the electrostatic energy, and these pearls are slightly larger. There are three pearls in the example shown on this slide. Upon pulling, the central pearl becomes weaker and will be destroyed first. In extension experiments, you would expect to find the force-extension curve sketched on the figure. At first, the length increases with the force. When the central pearl is destroyed, it disappears at once, and there is a jump in length. Upon further pulling, another pearl is destroyed. If there are three or four pearls on the chain, there is a succession of jumps. If there

Conformation

of Charged Polymers:

Polyelectrolytes

and Poly ampholytes

59

are more pearls, the jumps are washed out by thermal fluctuations. Again, as I said, one must be extremely careful; it looks like a phase transition, but it is a onedimensional system.

Discontinuous jumps in the number of pearls Important for small numbers of pearls

Larger pearls at the end more stable

analogous to a phase transition

rounding by fluctuations Figure 12. Discrete pearls.

That is all I wanted to say about poor solvents. The third problem I want to touch on is counterion condensation. So far I have completely ignored the counterions, and as I said at the beginning of my talk, this is not legitimate, if the polymer is highly charged. It is well known that there exists a phenomenon known as counterion condensation, first described by Manning. If you search the literature more thoroughly, you will find an old paper by Fuoss, Katchalsky, and Lifson, in which I believe this phenomenon was introduced. They solved the PoissonBoltzmann equation in a cell model, finding two states: a condensed and a noncondensed counterion state. The idea is the following, and I will illustrate it with a rod, because it is easier to understand:

60 J.-F. Joanny

Stretched conformation

flB/a~l

+

Electrostatic potential

=2-riBlog(r/a),

r=f/a

+

Counterions gain

kTrlB

electrostatics

+

loose kT Condensation threshold

TC1B=1,

translational entropy

f c =a/(l B z) ^ -

Effective charge

^

valency

\ —\

' eff *B "~

L

stronger effect with multivalent ions Figure 13. Counterion condensation.

The parameter for a rod is the charge-density along its length, which I call r. If you ignore the small ions, the rod creates a two-dimensional electrostatic potential that varies as r lB log r, where r is the distance to the rod. What this tells you is that if you bring a counterion to the surface of the rod I assume that this log is not too different from a constant - it gains energy of the order of r IB. Of course, if you freeze a counterion on the rod, it loses entropy. The energy gain must be compared with the entropy loss. If the energy is larger, the counterions tend to go back onto the rod, but when they do that, they decrease the charge on the rod. The rod has an effective charge, which becomes lower and lower, and at some point the counterions stop condensing. That is the phenomenon known as counterion condensation, described by Manning. If you do this calculation carefully, you obtain value a threshold. The threshold is when the distance between charges along the rod is exactly equal to the Bjerrum length, so for monovalent counterions, r lB = 1. If you take multivalent counterions - and this is the formula that I wrote - the counterions condense more, and this threshold is T IB = •*•. The counterions then accumulate until they create an effective charge that attains this threshold. The effective charge is such that r lB is equal to 1. In a sense, this means that it does not help to increase the charge on a rod, because if you increase it too much, the counterions condense

ConfoTmation of Charged Polymers: Polyelectrolytes and Poly ampholytes 61 back and neutralize the rod charge. With this very naive approach, the effective charge obtained is such that r /g equals 1.

£=80, ! ii

t

Rod attraction to surface Direct electrostatic attraction Self-energy Image charge repulsion Counterion condensation: effective charge

T == r sS P

Figure 14. Charge regulation. Along these lines, I have an example of a problem that we looked at; something we call charge regulation. Consider a highly charged polymer, something like DNA. I can state numbers for DNA: x lH is 4. What I mean by r lB is the nominal charge - so 1 count the phosphate groups. I put it close to a charged surface of opposite sign. If it is very far from and parallel to the surface, there is counterion condensation on the rod. Then I bring the rod closer and closer to the surface. The question 1 ask is, "What happens to the condensed counterions?" The first idea you might have is that close to the surface, counterions are not needed to neutralize the rod, due to the presence of the surface, so they can be released. This is the kind of calculation that we did, using Poisson-Boltzmann theory. One needs to calculate the energy of the rod and minimize it with respect to the number of condensed counterions. There are three terms in the energy: 1) the direct electrostatic attraction of the rod to the surface; 2) the self-energy, because the screening changes when the rod gets close to the surface; and 3), if there is a dielectric discontinuity, an image charge effect. We calculated the total energy then minimized it with respect to the effective charge on the rod. 1 write the effective charge of the rod as p r, where r is the

62 J.-F. Joinmy nominal charge. If/? equals 1, there are no condensed counterions. If/? equals 0, the effective charge is zero and all the counterions are condensed.

Poisson-Boltzmann Equation

«+

dz + K Two length stales: Debye Gouy Chapman

i

a Debye Hiickel regime Gouy-Chapman regime

K

i

\=1I{2TT
<*> -4TrlBerK_1e "

A
<£-21og(K('z+A)/2), E=—=— Z+A

Figure 15. Charged planar surfaces. This was done within the mean field framework, which is the PoissonBoltzmann equation. If you immerse a surface in an electrolyte solution, it gives you the potential close to a surface. The result has already been shown during this conference. The only point I want to make about the Poisson-Boltzmann equation is that there are two length scales: One is the Debye screening length, which, if you have free electrolyte, is given by the formulas that I showed you, varying as 1 over the square root of the electrolyte density. The other one does not appear here; it is hidden in the boundary conditions. It is fixed by the surface charge. You can easily convince yourself that with the surface charge, you can make a length that I call the Gouy-Chapman length. Gouy and Chapman are the first people who solved this problem in one-dimensional geometry. It can be almost anything; the Bjerrum length is 7 A, but the GouyChapman length depends on how many charges are on the surface. If the charges are very dilute, it is large. For a membrane, where all the lipids bear a charge, it is of the order of 5-10A. The way the Poisson-Boltzmann equation is solved depends on the relative values of these two lengths. If there is a lot of salt, /c_1 is smaller than X, and you can linearize the equation. You obtain the Debye-Hiickel solution; the potential decays exponentially from the surface. In the reverse limit, where X is

Conformation of Charged Polymers: Polyelectrolytes and Polyampholyles 63 smaller than K~\ if there is a high charge or a low salt-density, the counterions dominate the screening, not the salt. In such case there is screening, but it is not exponential. If you calculate the density of ions, K ' is somehow the distance within which half the counterions arc condensed on the surface. The screening is weaker than the standard exponential screening.

Two-state model: chemical potential balance between free and bound counterions fl

Example (with DNA numbers)

'

M

0

2

t

•

6

8

•

,

" ' A

Counterion release More important than electrostatic interactions Figure 16. Oosawa theory of condensation.

We calculated all the energies for the case in which the screening length is very large. We used the very large value 1000A, which means that there are almost no free ions in the solution. We calculated /?, which is the fraction of condensed counterions, as a function of the distance to the surface. There is a curve on figure 16 for each surface charge. I start with the lowest surface charge. As fi increases, the effective charge increases, which means that counterions are released. However, for this value of/?, p does not go to 1; some of the counterions remain condensed, and at short distances, /? decreases. This is the image charge effect, which can be traced to the energy. The important point is that the maximum value is not 1. Counterions are released, but a very high charge is necessary in order to release them completely. This calculation contains another piece of information, which is that when the counterions are released, they gain entropy. There are two driving forces to attract a polymer to the surface: one is a direct electrostatic

64

J.-F.

Joanny

interaction, and the other is the entropy of counterion release. The dominant contribution is the entropy of counterion release. In all cases, it is larger than the direct electrostatic interaction between the polymer and the surface.

Positive and negative monomers on the same chain N monomers f positive, f negative overall charge 6f=f+-f, number of charges f=f +f Charge distribution random: charge fluctuation ofKf/N)" 2 alternating, diblock Annealed and Quenched polyampholytes Figure 17. Polyampholytc chains.

That is all I wanted to say about counterion condensation. Now 1 want to go to the last part of the talk, which deals with polyampholyte chains. Let me remind you what a polyampholyte chain is: It is a polymer chain along which positive and negative ions are distributed. For this problem, two variables are used to characterize the ions: the fraction of monomers bearing a positive ion, and the fraction of monomers bearing a negative ion. The sum of these two i s / and 5/"is the difference between them. 8f\s the fraction of charged monomers used earlier for polyelectrolytes. It measures the effective charge on the chain, and/measures the total number of charges. There is one additional variable, compared with the polyelectrolyte problem. Again, polyampholytes can be annealed and quenched polymers, which also means that in the annealed case, they are made of a polyacid and a polybase. Upon tuning the pH of the solution, the relative fraction of both types of charges changes. The experiments that were carried out in Strasbourg were done with quenched polymers containing three monomers. One has a positive charge, one has a negative charge, and one is neutral. These monomers are copolymerized. There is an additional parameter that was not discussed before: how

Conformation of Charged Polymers: Poly electrolytes and Polyampholyles 65 the charges are distributed. In fact, there are three main ways they may be distributed. First, they can be thrown in at random; we know how to do that. The charges can be alternating: plus-minus-plus-minus-plus-minus, which is easy to do. If the polymer is made in solution, when it has a positive end it attracts negative monomers, and there is a tendency toward alternation. The polymers can also be blocky, so that they have a positive part and a negative part. What I want to show you later is that it is not the same thing to have a random and an alternating polymer. An additional word on randomness: What does random mean? A really neutral chain has an equal number of positive and negative charges. However, that is not what chemistry does. What chemistry does is attach randomly positive and negative charges, so the chains are neutral on the average, but each has a finite charge, either positive or negative. With Gaussian statistics, the effective charge fraction, 8f, of a chain is of the order (f/N)ia. The naive idea is that for polyelectrolytes, the negative charges repel each other and the polymer stretches, and for polyampholytes, the attractions between positive and negative dominate, so the polymer collapses onto itself.

Polarization effect +

/ w1

Accumulation of charges of opposite sign around a given charge One charge in a sphere of size tf"1

Polarization free energy attractive, kT per screening length

kT

12 rr

Figure 18. Polarization effects in electrolytes. In order to describe this effect, I will go back to the well-known polarization effect in electrolyte solutions. Consider an electrolyte - NaCl in water - and ask what the electrostatic energy of the solution is. The simplest idea is to use mean

66 J.-F. Joanny field theory, but in mean field theory you calculate the average charge density. Nature is neutral; and there are as many positive as negative charges, and the charge density vanishes, so there is no contribution to electrostatics. One has to go one step further and consider correlations. This is what is called the polarization effect. Consider a positive charge at the origin. It attracts negative charges, which means that there are more negative than positive charges around this positive charge. The electric field is screened over the length k , thus according to Gauss's theorem, even if there is a positive charge in the center, at a distance of K~ , the central charge is compensated. The average distribution of positive and negative charges is such that there is one negative charge in a sphere of size K~1. Then there is a contribution to the free-energy by the central positive ion, since around this positive ion there are more negative ions, which it attracts. That is the polarization energy. It may be calculated by dimensional arguments. It is a fluctuation effect, thus proportional to kT, and is free-energy per unit volume. The only distance in the problem is K , the polarization energy per unit volume is i7"K"\ and it is negative, because of the attraction between positive and negative charges. The numerical prefactor is found either by direct calculation or by checking in Landau's and Lifschitz's book; its valueis

re-

polarization energy of a neutral polyampholyle Higgs, JF Gaussian chain of radius R = N , / 2 a 2

Debye-Hu ckel length

fN

K = 4 IT 1 B — R0 = - R T K 3 RQ3

Polarization energy F

Collapse oi'a neutral polyampholyte chain Gaussian chain F ,
N
R=(N/f)1/3a

Figure 19. Polyampholyle effect.

Conformation

of Charged Polymers:

Polyelectrolytes

and Polyampholytes

67

For the moment, I will argue that for a neutral polyampholyte, positive and negative charges are randomly bound along the chains, and the same result holds. 1 will then tell you how we checked the result. Consider a Gaussian chain with a Gaussian radius. The density inside the chain is the number of charges divided by the volume, and the Debye-Huckel formula gives the screening length inside the chain volume. The polarization energy of the polymer is just - kT(KRn) • This must be compared to kT. If the polarization energy is smaller than kT, the polymer remains soluble. If it is larger than kT, it collapses on itself. This yields the criterion that if the size of the polymer is small enough, it remains soluble; if the size of the polymer is too great, it collapses on itself. Experimentally, the chains are huge: up to 100,000 monomers. If 1.5% of the monomers are charged, the whole polymer collapses on itself. If you enter approximate numbers into the formula, it indicates that the polymer collapses roughly at the observed value. If the polymer collapses, the radius is expected to grow as the number of monomers to the onethird power. That is the physics of what I call the polyampholyte effect. In practice, it is slightly more complicated, because 1 assumed here that the chain was exactly neutral. This requires the study of charged chains.

Random polyampholyte Wittmer, Johner, JF Frozen charges on a gaussian chain: equivalent to a salt Alternating polyampholyte Dipolar interaction: neutral chain with a virial coefficient More soluble . 2c3n v=-l»a r Block polyampholyte micelle formation

Figure 20. Charge distribution.

68

J.-F. Joanny

Before discussing charged chains, I will consider the role of the charge distribution for a neutral chain. On the previous slide I assimilated the polyampholyte to a salt, under the assumption that the charges are evenly distributed, but this was not justified. We considered a random polyampholyte; i.e., a polymer chain with randomly distributed positive and negative charges, calculating its free-energy averaged over all possible charge distributions. It has exactly the same value as for a salt. Even the prefactor is the same. Of course, one can also introduce correlations into the distribution. We assumed the polymerization to be a Markov process, so that the nature of one monomer depended only on the nature of the previous one. The result is still valid, except that the prefactor changes. The salt analogy is thus accurate, up to a prefactor, except for two extreme limitations: the alternating polyampholyte (when I have plusminus-plus-minus-plus minus), and the block polyampholyte. For the alternating polyampholyte, the interaction is weaker; it is like a short-range interaction, and one can define a virial coefficient, which appears on the slide. The idea is very simple: In an alternating chain, positive and negative charges alternate, and if one isolates a positive and a negative charge, they form a dipole that goes from the negative to the positive charge. These dipoles are free to rotate when the chain conformation changes; so it is like a Van der Waals interaction. The interaction between a dipole and another dipole that is far away decays as 1/r6, which is sufficient to be treated as a short-range potential, and one can calculate the virial coefficient. If this is done properly and cut-off at small distances, the same result is obtained. The question then is whether or not it is observed experimentally, and for once, the calculation was carried out prior to the experiment. Experimentalists in Strasbourg know how to make both polymers, and the difference is spectacular. The random polyampholyte is insoluble, and the alternating one is soluble. The reason is that in addition to electrostatic interactions, one must consider excluded volume interactions. The solubility of the polymer depends on whether the attraction is larger or smaller than the excluded volume. The experimental result goes in the right direction. However, one must be careful and honest. The random polymers were much longer than the alternating ones, and it is well known that large objects precipitate much faster than small ones. So I tend to believe that yes, it is true that the attractions are weaker for alternating than for random polyampholytes, but I am not so sure that this is a definite proof. There is a third type of polyampholyte, which is "blocky." If you consider two blocks of positive and negative monomers, of course they fold back on themselves and complex each other. The solution precipitates; there is no doubt about that. Experiments are also being done in Strasbourg, and the neutral block

Conformation of Charged Polymers: Polyelec.trolytes and Polyampholytes 69 polyampholytes are found to precipitate. A non-symmetrical polymer with a short positive sequence and a long negative sequence precipitates locally. It is expected to aggregate and to form a sort of micelle. Experiments have been devised to try to determine the characteristics of the aggregates.

Neutral polyampholyte in an external field Schiessel +<W)

Equivalent to a dipole Flory free energy

~=7n~ kT Na

B

(

f N

)

R

Stretched conformation R ~ E N 3 / 2 f " 2 a VAN)

Adsorption on a surface Dobrynin, Rubinstein, JF Surface attraction due to electric field gradient (dipole) Screening of the surface field: adsorbed thickness X Figure 21. Polyampholyte adsorption on a charged surface. The next question concerns how a polyampholyte adsorbs onto a charged surface. 1 want to consider a perfectly neutral polyampholyte. 1 assume to be able to isolate a perfectly neutral chain. 1 bring it close to a charged surface, and ask whether or not it adsorbs under the effect of electrostatic interaction alone. The charged surface creates an electric field. If you put such a polyampholyte chain into an electric field, what is its conformation? Despite the fact that the polymer is neutral, if I cut it in two, each part has a charge. Let me assume that this upper one has a positive charge and the other one a negative charge. There is a force pulling the two pieces apart in the electric field. An electric field stretches a polymer because even if it is neutral, the polymer behaves like a dipole and the dipole is stretched. The stretching of the dipole may be calculated using some kind of Flory energy. Entropy is lost if the polymer is stretched that is the elastic energy - and dipolar energy is gained, as for any electrostatic dipole in an electric field. There is a minimum in the energy, which gives the equilibrium size. Size increases very quickly with molecular weight, and the chain is very soon completely stretched. Close to a surface, the charged surface creates an electric field, so it polarizes and

70 J.-F. Joanny stretches the polymer, but it also creates an electric field gradient. Because of the screening, the electric field is larger close to the surface than it is when farther away. If you put a dipole in the electric field gradient, it drifts in the gradient and goes to the surface. Close to a surface, both the electric field and electric field gradient play a role. The electric field polarizes the polymer, creating a dipole, and the electric field gradient drives it to the surface. But this ignores the screening of the electrostatic interactions.

Ozon, Dimeglio, JF

cwaa

/

Extension

.vpmfa

Figure 22. Atomic Force Microscope (AMF) experiments.

I mentioned earlier that in the vicinity of a charged surface, even if the salt concentration is not too high, the electric field is screened, and it is screened over the Gouy-Chapman length, which I call X. The polymer conformation that minimizes the energy in the presence of the screened electric field of the surface is the conformation shown in this slide. The polymer does not explore those regions

Conformation

of Charged Polymers:

Polyelectrolytes

and Polyarnpholyl.es

71

where the electric field is too low, remaining confined between zero and X. It is stretched by the surface field to between zero and A..

?a

u ir,-

u m ••

I 1

•V

io-

\

•s

g = 507

" • •

•

••

5-

• ""••» • •

•

•

,^ •

D~

i

)

1

1

SO

1

100

r-

1 1bO

•

i*"—-•

•

a

• • — i

ZOO

••

•

• •

• — i — • — i

250

300

-

•

• 1 350

•

1 400

breaking distance (nm)

Loop distribution

S(n)=(g+n)-,o-,/2 Figure 23. Loop-size distribution.

My colleague Dimeglio decided to test this idea with an atomic force microscope (AFM). He is doing single-molecule experiments, the principle of which is the following: The surface is coated with self-assembling molecules that are a mixture of charged and uncharged thiols. This allows control of the surface charge. The surface is then introduced into the polyampholyte solution at an extremely weak concentration, so that it is coated with polyampholyte. The tip of the AFM is then brought close to the surface. The idealized view of the experiment that the Dimeglio group has is the following: When the AFM tip is brought close to the surface, you catch a loop, it adsorbs, and then you pull back. When you pull back, you stretch the loop. At some point the loop is so stretched that it pops off. What you measure is force as a function of distance. As you go in, the force is essentially zero, until you touch the layer. When you go back, the loop that you

72

J.-F. Joanny

have caught acts as a spring. You stretch it, and at some point it jumps off, and the jumps appear as kinks in the curve. The first idea in analyzing the curve - which turned out to be completely wrong - was that the polyampholyte force fluctuations, being random, bear the signature of the distribution of charges along the chain. This was tried experimentally, but nothing could be gotten out of the force fluctuations. The curve was analyzed by doing statistics on the positions of the jumps. Statistics on the positions of the jumps provides access to statistics on the loop-size. The naive question is of course how can one be sure that two loops, or three loops, or whatever number are caught at the same time. First the order of magnitude of the force may be estimated, and the right order of magnitude measured. Maybe you catch two or three loops, but you do not catch five hundred. There are experiments in which you have two jumps; the smallest loop pops off first, then the second one, and so on. Curves in which there were two jumps were not considered. This way it was assumed that only one loop was seen, and the statistics of the loop sizes was done. Another assumption is made: when the loop pops off, it is fully stretched. The number of monomers in the loop is just twice the distance. This is not too bad an assumption, and there is a way to check it. You can look at the shape of the force curve and observe that it is like a chain under tension. There are lots of models for polymer chains under tension. One that always fits the data is the Langevin model, in which the force is related to the distance by the Langevin function. You can fit the curve and obtain the number of monomers in the loop. It is the same as assuming that the loop is fully stretched when it pops off. The net result is the number of times the chains pop off as a function of distance. This directly yields the probability that a loop has a length, n (the length meaning the number of monomers.) Using the scaling type of model, the density of loops decays as , | . «"1/2, where g is the number of monomers of one chain between zero and A,. This is something that you can compare with experiment. It is rather spread-out, but not so bad. It tells you that in this experiment there are 500 monomers between zero and A.. The chain is not fully stretched, so maybe in this case X is 300A, or something like that. If you then check with the official surface density, you will find that it does not agree so well, for the reasons I mentioned.

STATISTICALLY DERIVED RULES FOR RNA FOLDING MICHAEL ZUKER Department of Mathematical Sciences, Rensselaer Polytechnic Institute, Troy, NY, USA What I am not going to talk about, although I could improvise, is predicting RNA folding by energy minimization. Peter Schuster certainly talked about it, although he didn't present any algorithms, etc. If I had been trained as a physicist or a chemist instead of a mathematician, and if my chemical colleagues had cared about statistics or Boltzmann distributions, I think I would have come up with the McCaskill algorithm for computing partition functions earlier, because no one ever told me that that was a problem needing to be solved. I think there's a good potential for combining the two approaches. What I do is energy minimization; I compute dot-plots very similar to those Peter showed. A dot-plot is the superposition of all possible foldings within a certain increment from the minimum energy. What you get from this is a superposition of all foldings within an energy increment. It shows the most likely foldings, superposed up to a certain limit. The box-p\ot shows probabilities precisely. You can use these two approaches together, one following the other, and get some interesting results. This hasn't been done yet, which is a bit unfortunate. There is another approach to finding foldings. The Vienna group first finds the minimum folding, then the next one, and the next one up. If you're working in the vicinity of the minimum folding-energy, you don't get overwhelmed by too many possibilities. A colleague of mine in New York State, Chip Lawrence, and his group are developing an algorithm to take the partition function approach to do statistical sampling of structures within some energy increment. There might be 10 million foldings - or 100 million - within 5 kcal/mole of the minimum foldingenergy. You sample perhaps 10,000 of them - but a statistically valid sample - so that the probability of a base-pair occurring within the sample constructed would be valid. Of course, we know what the probability of a base-pair is anyway; we can calculate that exactly with the McCaskill algorithm. But there are a lot of statistics that you can't calculate exactly using the McCaskill algorithm (the probability of certain motifs forming, for example), or at least not so easily. You can get that by using this rather nice statistical approach. Taking my sort of folding method, the Vienna method of computing structures, plus Chip Lawrence's method constitutes

74

M. Zuker

three different ways of constructing samples of sub-optimal RNA secondary structures. 1 came here prepared to talk about work that I've wanted to do for the last ten years but have only recently been able to carry out because of the difficulty in collecting reliable enough appropriate data on RNA secondary structures, and it's hard to find a person to work on it; someone who knows enough molecular biology and biochemistry and yet who could also do some computing. This was also difficult to do, although 1 finally succeeded. 1 want to provide some background on computing RNA structures using comparative methods. This will be complementary to what we heard this morning concerning energy-based approaches. Although in fact what I use is an energy-based approach, I'm trying to derive statistical rules for RNA folding. What I want to talk about now is comparative methods, so you'll know where the RNA secondary structures that I'm analyzing come from.

c

u

AU G-C A-U C-G

u

U C U G U-G A-U C-G

U G C C U-A C-G A-U" C-G

ESI A-U

ED A-U

G EO A-U

C-G U-A

C-G U-A

C-G U-A

u

T, thermophila

T. pigmentosa

T. malaccens

Figure 1. Potential secondary structure of segment 227-247 of group I introns. The boxed nucleotides are those involved in a compensating pair of transitions between T. malacemis and the other species.

So here is - starting arbitrarily - with a badly reproduced slide from a 1984 Nucleic Acids Research article 12, pp. 8733-45. The authors compare part of the group I intron from Tetrahymena thermophila with the homologous segments in introns of some closely related organisms. Here you have the differences in Tetrahymena pigmentosa and the two other strains that are indicated [Fig. 1]. The idea is that in both RNA and proteins, one of the fundamental points is that sequence diverges much more rapidly than structure, so structure is conserved more than sequence. In RNA you can make use of this in very precise ways. I'll mention exactly how that can be done. Here is just a multiple alignment of four different strains of the group I

Statistically Derived Rules for RNA Folding

75

intron about which we heard a lot this morning and of which certain features were pointed out. Again, this is quite early, before the P3-P5 nomenclature was fixed. Somehow you recognize it as being the same structural feature, and in Tetrahymena thermophila you have a stem like this, and in T. pigmentosa it's like this - a little bit shorter - it's not really the same, but you can recognize it as really the same structural feature. In malaccensis you sort of have the same kind of stem loop. Here there clearly is a difference; you have an A changing to a G and a U changing to a C. The important thing is that you have a covarying change, so that the basepairing is conserved. That's the critical thing. Covarying change is considered strong evidence of a base-pair. Similarly, you have some other motifs. I want to go directly into a little bit of theory [Fig. 2]. Ri = n(l),n(2),n(3),...,n(n), R2 = r 2 (l),r a (2) I r 2 (3),... 1 r 2 (n) ) i?3 = r 3 (l),r3(2),r 3 (3) r3(n), Rm

=

r m (l),r r o (2),r m (3),...,r m (ji).

Figure 2. Multiple alignment of RNA sequences. A common length is enforced by the use of embedded dashes, if necessary.

RNA folding by comparative sequence analysis uses the idea that structure is conserved, while sequences drift. This is as much true for proteins as it is for nucleic acids. The trouble is that if you have a multiple alignment with proteins, you can look for covariation, although it's harder to detect reliably, because you have twenty amino-acids. Given that you do detect co-variation, what do you know except that these two acids are close to each other - and what does that mean? If they're both hydrophobic, you don't really know how to model it. With RNA, if two things are associated like that, then more than 90% of the time there is going to be a Watson-Crick base-pair. And you know what that is - in fact, more than 95% of the time it'll be a Watson-Crick base-pair or a G-U wobble-pair, and then you have this very long tail of every other possibility, every other possible non-canonical basepair, although people didn't know about that fifteen years ago. The idea is that, formally, you have a bunch of homologous RNAs and you've aligned them somehow; that's a big if I'm representing a group of m sequences (R] to Rm). They can be m 5S RNA sequences, m ribosomal, mRNase P sequences, or whatever; they're homologous and they've all been aligned so that they all have a common length n. This means that some of these symbols are going to be dashes. There will be blanks. You've already done the alignment, which is highly non-trivial to do.

76 M. Zuker The first example I give takes a bunch of 5S RNAs of identical length, all 120. The alignment is not in question. Most of the time you cannot do that. The type of approach used is computation of a sort of mutual information content [Fig. 3] between all pairs of the columns, so as to detect covariation. v>

H(i,j)=

f,AN„N,) '' ' — —

/!,(#„#,) log,

Figure 3. Mutual inl'ormation between two columns in multiple alignment.

If you have two different columns in the alignment (the /"' column and the / ' column), f/,{N|,N2) is the frequency of having nucleotide N| versus Ni, where N, and N2 vary over {A, C, G, U}. This is an observed frequency of how many times you have A-A, A-C, A-G, A-U, all the way down to U-U. There are sixteen possibilities. You just have a certain frequency. It's a 2x2 table. The f, is just the frequency of the nucleotide N, in the i* column, and similarly, for the f}and the /'"' columns. This is really relative entropy between two probability measures; i.e., two empirically derived probabilities. It's really a test for independence. If the distributions were completely independent, the probability of observing nucleotide N| in column /' and N2 in column j would be just the product in the denominator. The numerator contains the actual distribution observed. The logarithm is taken to be a probability ratio. This computation gives the "mutual information" between the two columns and is a special case of relative entropy. That's one way of describing it in terms of information theory. We take the logarithm to base 2 by convention, and so on. If the columns are really independent random variables, you would expect the mutual information to be 0. If they covary, the information can go up to two. This is how we actually compute the mutual information.

U-G

U-A

1/ V\C « G/ •

10

U

\

C»G

i

1 1 G»C •

i

1

1

l

1

/

5 —G«C 1

1

1 1

C«G i i

3" I

G«C 1 1 C«G—

5" — G » C - G C-A-U

S' — G » C -

G»C 1 1 C«G 1

1

to

C

i

i

i

16

1

Figure 4. An RNA secondary structure (left) and its altered form after deletion of the A-A dimer.

Statistically Derived Rules for RNA Folding

77

It sounds so nice and easy to do. But suppose you have this sequence and this nice stem-loop structure [Fig. 4, left]. A mutational event occurred which deleted the A-A. It would be too tight a turn for the structure to remain otherwise the same, so you would get a shift in the secondary structure [Fig. 4, right]. The deletion would actually have been the A-A, but a structural alignment would have us think the A-A changed to G-C, and that G17 and CI8 are deleted. The C-G dimer is being sequestered, so the structure shifts a bit. Looking at these two structures, you see that they differ just by a deletion, and you still have a stem-loop structure. But if you do a sequence alignment, you would predict: GCGCGCUUAAGCGCGCGCAU | | | | | | | |

I I I I I I

II

"incorrect" alignment forced by structural considerations

GCGCGCUUGCGCGCGC-- AU instead of GCGCGCUUAAGCGCGCGCAU | | | | | | | |

I I I I I I

II

"correct" alignment based on mutational event

GCGCGCUU- - GCGCGCGCAU That is, if you do a structural alignment, you have to misalign them (as above) where it doesn't belong. The gap goes in the wrong place and you are forced to accept two mismatches. It's this kind of problem, magnified over and over again in very complicated ways that makes the problem of doing the alignment so that you get structure; very difficult. In fact, it's still an unsolved problem. Lots of papers are being written proposing practical algorithms, and they have varying degrees of success. This is still an unsolved problem.

Salmo Mi agu Hlsgru Chrys Aural Nemop Ant bo Kalic Halic Brach Acyrt Bomby Piano

10 20 30 40 gc«uacGgcCA«Jtac*gcc«gaawacgCCcgaUCttCgOccGAttCtti gc«uaeGgcCAufe:cAcecugagcacgCCcgaUCttCg0ccG&uCUi gcuuacGgcCAc&ccAaocugagcaagCCcgaOCuCgUcuQAuCa' gccuacGacCAuAccAccaugaguauaCCgguUCuCgUccGAttCa gccaacGacCAuAccAccaugaauacaCCggwUCuCgyccGAwCa. gucuacGacCAuAccAcaaugaacacaCCgguUCuCgUccGAuCa< gMC«acGgcCA«AccAccgggaaaaaaCCgguOC«CgOccSAuCa' gccugeGgcCAuAccAcguugaaugcaCCgguUCcCaUcuGAaCa gccuacGgcCAuAccAcguugaaaacaCCgguUCuCgllcuGAuCa gccuagGacCAuAucAcguugaaugcaCCgguUCuCgUccGAuCa ggcaacGaoCAwAccAcgwugaauacaCCaguOCuCgOccQAuCa gccaacGucCAuAccAuguugaauacaCCgguUC uCgUccGAuCa gattagcGucCAttAccAcacttgaaaacaCCggttOCttCgUccSAwCa

Figure 5. Trivial alignment of 20 5S RNA sequences of identical length.

78

M. Zuker

I'll go ahead with an alignment that no one can question, although people could say you should have insertions and deletions even though all the lengths are the same. This is an alignment of twenty very carefully selected cukaryotic 5S ribosomal RNAs, all of identical length (120 bases) [Fig. 5]. The alignment is not in question; they are all of the same length (no insertions, no deletions). Let's compute the mutual information content that I've described. There arc many ways of presenting the data [Fig. 6].

Covariation for 20 5S rRNA sequences Culofl: 0600

1

40 1

V 90

f

At

*

0

CO

f

0

9 9 *

$

•

\

4

00

0

/

»

*

r

3D

0

9

0 0

'

9

* *

0

m

p .. i /

*

0

4

.-to

0

0

\ " • \ *

*

f

100

« 00

' 0

9

60

m

'/ flO

\. »

0

9

0*

4

100 "V

*

•

i

0

_120 [• 40. S.00| |1 Oft 1 40)

pro, ice) peo. o 70)

Figure 6. Mutual information plot for 20 5S RNA sequences.

Statistically Derived Rules for RNA Folding

79

This is one of the ways: I will plot in row / column y and put a dot. That represents covariance or mutual information between column / and columny. If the covariance is between 1.4 and 2, 1 use a red dot; if between 1 and 1.4, a green dot, and so on. My lower cut-off point is 0.6; anything below that doesn't get plotted at all. I'm getting an incomplete signal where I should have a complete one. Dots are missing and I'm getting all kinds of noise where I don't want it. That's just twenty sequences and a very naive approach. I'm not using any energy considerations at all. If you couple energy minimization with these comparative methods, it gets quite powerful very quickly. The person who has worked on these statistical rules, Maciej Szymanski, of the Polish Academy of Sciences, spent over two years with me (first at Washington University, then at Rensselaer Polytechnic Institute, after I moved), happens to be one of the curators of the finite 5S RNA database, along with Volker Erdmann in Berlin. With 316 sequences in correct alignment, you get something like this when you do the same sort of plot [Fig. 7]. The noise has disappeared and you get a rather strong signal. This numbering, which goes way past 120, includes all gaps in the alignment, and this is the common numbering system. For instance, this should be a straight helix, but it shows various breaks and gaps and so on. If you take out a particular sequence, this sort of dot-plot collapses, because a lot of the blanks go away and you get a particular structure for a particular sequence. That is why you have a certain irregularity here for the consensus, but the irregularity does not really exist for individual sequences. The consensus structure for eubacteria is the upper one, and is in B for eukaryotes [Fig. 8]. For Archaea, there is yet a third model. They are all pretty closely related. Just by gazing at sequences, using statistics (simple statistical ideas) and looking at sequences, you can come up with a pretty precise model of what your secondary structure should be. In this case, Watson-Crick base-pairs are indicated. YR must be indicated as pyrimidine-purine. There are variable regions where insertions or deletions may occur. This is a very detailed model of secondary structure. This is considered the gold-standard of secondary structure. (I would say maybe the bronze standard.) They have held up pretty well when confronted with real data; for instance, ribosomal RNA data held up quite well versus the crystal models of the last couple of years. My data mostly come from sequence analysis and covariation analysis.

80

M. Zuker

muiplot results for 316 5S rRNA sequences Cutoff: 0.6CO |

20

AC

40

80

00

A

_J 9

\

A

• i

a

.'

S>0

\ /

s

# 40

#

m .•'

\ s

'

ftO

11Y m

ia S

\ [1.40.2.001 11.00. I 40) (0.70. I 00) (0.60.0.70)

Figure 7. Mutual informalion plot for 316 5S RNA sequences.

_1*

Statistically Derived Rules for RNA Folding

R GCR

OCOOY#Y«GYG I I : CCOOR1»R»CGY

G G

CRY G

V

G-Y A R U-A

C A C C Y G iY I I I M I UGGRCii AA'

C

81

cCc AU

c A

A

G

C

R-#

R R A R

R R U A U

IV

B B A

| = |

•

Rs-R

R

3

Y

I

,,CYC U R U

AA!

A G

A A

AU G-Y

Y-f

C-G R-d» Y - G ;R C-G A G G«

Figure 8. Consensus foldings for 5S RNA (A: eubacteria; B: eukaryotes).

Question: When there is a point in the previous diagram, is it a likely base-pair in the structure? Response: Yes; it says that the mutual information between these two columns is greater than some value.

82 M. Zuker Question: How do you go from one feature to the next? Response: When I have enough sequences, I have high-enough mutual information; then I can say that I probably have a base-pair. There are a lot of intuitive things that you can put in. Again, these models were not built up that way, but slowly, over a number of years, and refined again and again. This is the summary of the data set that we were able to put together. Question: Why is mutual information-content used instead of an ordinary correlation function? I've read that it seems that for binary sequences, they are the the same thing, but if you go to something that has an alphabet of four letters, as is the case for RNA, mutual information content is more appropriate, and I was wondering why. Response: People who aren't computer scientists still publish papers in which their covariant functions are something they've thrown together; it really is impossible to determine. I've adapted mutual information because I was trained as a mathematician; a probabilist. It's fashionable; I don't justify it any other way. You are not just looking for Watson-Crick pairs, but for any kind of covariation. Initially, people were looking for covarying Watson-Crick pairs, then any possible G-U pairs, then it got more generalized. You can detect any kind of covariation. I thank Robin Gutell for rRNA and group I intron data, and Jim Brown for his RNase-P RNA data. Jim gave me everything he had for Eubacteria and Archaea, but nothing on Eukaryotes, because the data were too messy. Maciej (Szymanski) provided me with data on 5S. I got the group I intron from Robin's website. We have two data-sets that overlap somewhat: the complete set and the core set. In terms of the complete set, I have a total of 1,900 distinct sequences whose secondary structures I know by comparative analysis. You can judge for yourself whether it is a gold- or bronze-standard. They are supposedly well-determined. To break down the complete structure: there are 233 large sub-unit ribosomal RNA, 466 small sub-unit rRNA, 819 5S rRNA (of which 316 eukaryotic ones I've already shown you in complete detail), 215 RNase-P RNA comprising just the bacterial and Archaea sequences, and 167 group I introns. That makes 1,900 sequences, for a total of 337,000 facts, or so many base-pairs and so many nucleotides. Concerning the "complete extended," mutual information does not say anything about the 100%-conserved base-pairs. You might have a loop that's sort of a l x l mismatch. As far as I'm concerned, any lxl loop (or 2x2 or larger loop) that is surrounded by base-pairs is a non-canonical base-pair. When I have a situation in which I can pick

Statistically Derived Rules for RNA Folding

83

up another base-pair that is essentially in a symmetrical loop, I do so. Some people are very demanding and insist on finding covariation to justify every single basepair, whereas I just say that if they are in the middle of the helix, they must exist. So I can pick up extra base-pairs like that - that's what we mean by the extended model. I'll just brush over it lightly now and give you other examples later. Then I have to tell you what we mean by "core." Robin Gutell has thousands of aligned sequences, although not in a single multiple alignment. He has separate alignments for Eubacteria, for Archaea, for chloroplasts, for mitochondria, and for Eukaryotes. If you have a structure for one, it should give you a structure for all of them, except that it doesn't quite work out so simply. There's a lot of manual playing-around with the data in the alignments. It's a black art, not an algorithm. If you ask someone like Robin Gutell to give you a secondary structure for a particular sequence, and if he agrees to do it, it would take him about twenty minutes. It would take me hours based on the structure. But he has the whole alignment. We have managed to pry the whole alignment database out of him, but how are we supposed to get the secondary structures out of it? For ribosomal RNA (rRNA) we only have 466 complete secondary structures that we were able to download from his site. So there is a huge gap between fully known structures and what is buried in the multiple sequence alignments. Using the known E. coli secondary structure and the E. coli alignment to all these separate classes of alignments, how many base-pairs can we pull out with a high degree of confidence completely automatically, i.e., without any human intervention? The answer is "quite a few." By "core" of secondary structure we mean, given the alignment block for ribosomal RNA and given that you have the E. coli sequence aligned (at the top), and knowing all the base-pairs in E. coli, what base-pairs can you pull out of the other sequences with a high degree of confidence? The answer is "those base-pairs that correspond to the core RNA structure". For E. coli, around 80% of the roughly 440 or so base-pairs are usable in the other eubacterial sequences. They correspond to base-pairs that may be derived with confidence. So you're losing at least 20%, but we have thousands of these sequences. That's a lot of extra data. Of course there's some overlap, although we have some complete large and small sub-unit rRNA data. However, our complete and core databases are largely non-overlapping. This is a good thing, because we compare results derived from each database. We are getting almost the same results for analyses that are carried out on the two databases. It's a way of validating one with the other.

84

M.

Zuker

Figure 9. Group I intron from the small rRNA subunit of Urospora penicilliformis.

We had around 167 group I introns, so we have a whole bunch of structures that look like this (not Tetrahymena thermophila, which is a different group I intron, but P3, P5, P6...)[Fig. 9]. We looked at all the base-pairs that occur and all the base-pair stacks that occurred (G-C next to G-C, G-C next to U-A... etc. - all base-pairs and base-pair stacks - compiled statistics on those, and summed them up over all the 167 different structures. In our analysis, it does not matter structurally; we are just throwing everything into the bin and looking at what's paired with what and what stacks occur. Concerning the core database, we look at the small RNA ribosomal sub-unit. We had to throw away several sections, because we could not reliably get the corresponding base-pairs in the other structures. You could get them if you go in manually, but that would be some years of work, which we were not able to do. These models were not forthcoming by the curator of the data-base. It's the best we could do. These are the inferred base-pairs: G vs A, A-G - thermodynamically,

Statistically Derived Rules for RNA Folding

85

that's rather stable - it's surrounded by known or proven base-pairs, so I throw those in. That's what I mean by the "extended" model. When I have symmetric interior loops, l x l , 2x2, up to even 5x5, I'm going to throw those base-pairs into the model and add them. That way we pick up a lot of non-canonical ones. For a large subunit, we pick up some interesting things. We picked up five base-pairs in a row. We were bold enough to say that these must be base-pairs - and they were. But not all our predicted extensions have been found to be base-pairs in the three dimensional models of rRNA that have recently been published. In any case, we have kept the automatic procedure. We are doing "large-scale gross analysis" and have to be true to the method. We got a lot of base-pairs out of the alignment just with this automatic procedure. That is supposed to explain the reliability. Sequence-gazing alignment, mutual information analysis, and a lot of hard work over the years, hand-curating the data, because alignments are never sure. Now I'm telling you what the data-sets are: complete structures, core structures, and extended base-pairs. Note the nomenclature. We have a sequence starting 5' and a base W covalently linked to Y, then Z covalently linked to X and coming off the 3' end. W pairs with X and Y base-pairs with Z. That's the covalent linkage, so the stack is W-X/Y-Z, written that way - that's our convention. Normally, it would be G-C/A-U or something like that, if it's Watson-Crick. If you turn the diagram upside down it's structurally equivalent. We can swap the outer ones with the inner ones. They are equivalent for the stacks. What you see in the sequel is what we've done, of course. Statistics. Sometimes we just look at Watson-Crick interactions and G-U, and sometimes we look at every possibility from A-A/A-A to U-U/U-U, although there are not going to be 256 possibilities because of the symmetry involved. It's not exactly a factor either. Some are self-symmetric. We compute this number F(WX/Y-Z), which is called an "odds ratio." We're trying to do for RNA what people like Margaret Dayhoff have done for protein structures. They looked at wellcharacterized protein structures, in which amino-acids tend to be close to other amino-acids, and then we compiled statistics. You can derive potentials for doing protein folding. That's one thing I had in mind. The other thing is duplicating what Margaret Dayhoff did in the 1970s in proteins: computing log-odds statistics for which amino-acids get substituted by other amino-acids, then deriving rules for aligning proteins. The idea is to compute this number to be an observed number of stacks divided by an expected number of stacks. The observed number of stacks WX/Y-Z in a secondary structure is just the number of times that such a stack is found.

86 M. Zuker Do not forget that symmetry is taken into consideration. Sum that over all the structures. We have a database of "complete structures" and another, slightly overlapping database of "core structures." Both of these have "extensions," which are called "complete-extended" and "core-extended." That's obvious, which means that computing the numerator is straightforward. The denominator is not obvious. What is the expected number of W-X/Y-Z basepairs in any structure? Answering that question took over a year of thinking. We got it wrong first, and then I butted heads with Sam Karlin at Stanford University over what should be used. I finally agreed with Sam. The observed number of any stacking interaction may be derived directly from the structure file by counting. The expected number may be written as a frequency multiplied by the total number of base-pairs in a structure. That's obvious. Now, how do you compute these frequencies? The expected frequency may be calculated in different ways. A secondary structure may be treated as: 1. a collection of known base-pairs. In this case, we compute the expected number of each kind of stacking interaction, given the collection of base-pairs; 2. an ordered sequence of nucleotides. In this case, we compute the expected number of each kind of stacking interaction given just the sequence. In fact, all we use are the base frequencies; 3. an ordered sequence of dinucleotides. In this case, we compute the expected number of each kind of stacking interaction given just the dinucleotide frequencies. To explain the above, I remark that you can compute the expected number of base-pair stacks according to the degree of ignorance you want to assume. Suppose you assume you don't know what the base-pairs are, even though you do. You "admit" to knowing only the sequence. Then you can come up with one approach. I know the nucleotide composition and therefore I know how many As, Cs, Us, and Gs there are. If the ith sequence has length N„ let N,(W) be the number of nucleotides that will be W in the /th sequence, and similarly for N,(X), N,(Y) and N,
Ni(W) - Nt(X) - Nt(Y) - Ni(Z)

Nf

Statistically Derived Rules for RNA Folding

87

is the expected number of stacks that might form at random. The odds ratio would be the total number of stacks of that form divided by this expected number. That's one way of computing, using just the nucleotide frequency. We also know the dinucleotide frequency, which provides a lot more information. It turns out that the first and second calculations give us essentially the same results: ^n£WXIYZ) ••

F(WX IYZ) =

Z,. n (W)xn (X)xn (Y)xn (Z)xI] i

i

l

l

= >

; n,(wx/ yz)

For the second calculation, the expected frequency of a stack would be the frequency of (W-Y) x the frequency of (Z-X) divided by (Ll-1) 2 . The other way is to use what the base-pairs are in each structure, but not what the stacking interactions are. Let's compute the expected number of stacks of a certain form, knowing what the base-pairs are: we have knowing stacks, knowing base-pairs, but they come together in random ways. You get another denominator (see formula on next slide.) The numbers F are either over 1 for over-represented base-pair stacks or under 1 for under-represented base-pair stacks. Just take the logarithm of the above odds ratio and multiply by -RT. The result is a pseudo free-energy. That's how I correlate my statistics to what my chemical colleagues have been doing for the last decades. Is this obvious to everyone? That is: \f§G(WX I YZ) = -RT\nF(WX

I YZ)

R is the gas constant, 1.987 kcal/mole/'K. T is 37° C, or about 315.15'K, because that's the appropriate temperature you use to compare with the RNA rules I have from Doug Turner's Lab. This is the relation between statistics and thermodynamics. The idea now is that if I use these rules as energy rules, say if I try to do RNA-folding by maximizing the sum of stacking energies derived by this statistical formula - ignoring all the loops and the tropic effects and everything else - it means I'm maximizing a sum of logarithms, a product of observed over expected; the likelihood of reproducing a structure that best duplicates the underlying statistics I have.

88

M. Zuker

Question: I'm assuming that you're looking at overlapping pairs, in which case by taking odds ratios, you are in fact looking at a kind of maximum likelihood estimator for non-independent things. So if these were single nucleotide frequencies, a maximum likelihood estimator is trivial. For things that are not independent, there is no theory whatsoever. Response: That's right; there's no theory. But we can talk about the blast statistics. It turns out that you can assume a Markov dependence, and the statistics still hold. In any case, I don't claim any statistical ruling here. I would claim that the same plot of statistical results will still hold, even though we still have this dependence. Question: If you took a genomic DNA sequence which should be B-formed DNA, by definition completely stacked, do you get Turner's rules out of the analysis of this DNA? Response:

No.

Question:

Oh? Why not?

Response: Because any DNA sequence is going to have its counterpart; its reverse complement, so there's no information there at all. Everything is perfectly paired. When you divide by the expected numbers obtained by using the maximum amount of information, think of what you're going to get. The base-pairs are overwhelmingly Watson-Crick. Therefore you would expect the base stacks to be overwhelmingly Watson-Crick. Therefore the log-odds frequencies for WatsonCrick pairs are going to be more or less around 1. They're not going to stand out. We don't give them any particular weight. We don't reproduce any usable energy rules that way. We have to go back to maximum ignorance, assuming we know the base-pair frequencies or the base frequencies in a sequence. What does stand out for instance, is the double-mismatch pairs (in particular, G-A stacks over A-G twenty more times than we could expect at random). These features are so overwhelming that they're known anyway. All the rest are numbers computed for using the dinucleotide frequencies. We first focus on base-pair stacks involving Watson-Crick and G-U interactions because we have numbers to tell us what they should be. We have AG at 37°C, as measured in Turner's lab. Our latest version over energy rules from A-U/A-U all the way down to U-G/G-U, although there are some problems because the double non-canonical pairs are not really appropriate. We treat these as 2x2 interior loops, not as base-pairs anymore. That's a little over-refinement. We have a considerable

Statistically Derived Rules for RNA Folding

89

number of base-pair types in a complete set of structures. This is not extended. We do have the results for extended, the core. For example, a frequency ratio of about 10 gives you a pseudo-energy of-1.4kcal. That's still a very significant number. The most stable interaction is G-C/C-G: the pseudo-energy is about -1.64 to -1.76 kcal/mole. The best way to deal with these is to compute the correlation coefficients and sort of plot them and look at them as a whole. Question: Why are all the odds frequencies greater than 1 ? Response: Remember, I'm showing you the results for the Watson-Cricks and the G-Us. Of course they're all greater than 1. Here we have numbers: 0.24, much less than expected, not just random, highly under-represented. The frequencies lower than 1 reside here. We're now looking at all those pairs that involve totally non-canonical interactions, or combinations of Watson-Crick with non-canonical. We can predict pseudo-energies for all these interactions, but we have nothing with which to compare them because nearestneighbor interactions don't work. I have to explain this: What my colleagues do is make a duplex, say 5'-AUCGGAC-3' and 3'-UAGCCUG-5', melt it, and measure AG for this case. This is a "melting experiment." They do that for many different duplexes. You get a large database of measured free-energies, or AGs. What good is that? The goal is to extract a set of "energy parameters" that allows you to predict the stability of these duplexes. The first approximation would be to count how many A-U pairs and C-G pairs there are and take a linear combination of them; so many A-Us + so many C-Gs. Can that approximate the stability of the duplex? The answer is no; it doesn't work. The next approximation concerns A-U next to U-A, U-A next to C-G, and so on. These are called "nearest-neighbor" interactions. They take into account not only hydrogen-bonds but also the stacking effects. Is that sufficient? The answer is yes if you include only single G-U base-pairs along with Watson-Crick base-pairs. The "nearest neighbor" approach will not work for (nonG-U) mismatched pairs or for double mismatches, also called "tandem mismatches." This includes tandem G-U pairs. The hypothesis is that the free-energy of a duplex may be expressed as the sum of the free-energies of each of its stacked base-pairs. When only Watson-Crick base-pairs occur, there are ten unknown parameters, corresponding to all different possible stacking interactions. The next step is to carry out least-squares analysis, using observed AGs. For example, for the duplex above, 5'-AUCGGAC-3' and 3'-UAGCCUG-5', the free-energy would be A-U/U-A + U-A/C-G + ... ; the sum of 6 terms.

90 M. Zuker The linear least-squares equations, for Watson-Crick base-pairs and stacking are: a

fl

<*2A

a,.

u

«l, °4J

..2

a

w

«4.2

..S

«...

°..7

a

°..»

a

2.5

°2.6

°2.7

a2.>

G

«3.4

°J.5

°y<.

°3.7

a

y*

^3.9

°lM

fl

«4.5

^4.6

«4.7

«4.«

«4.9

fl

a

°u

«,.4

fl

a

23

«2.4

«3.3 fl

4.3

4.4

u

2.9

...... " '8G(AU/AU) 5G(AU/UA) 2.H,

a

8G(AU/CG)

4.IO

hG(AUIGC)

5..0

hG(CGIAU)

«5.2

fl

3J

°5.4

°5.5

a

5.6

°5.7

tfs.x

a

fl

«U

«6.2

fl

6.3

«6,4

«6.5

°6.6

«6.7

««.«

°6.9

«6..0

SG(CG/CG)

°7,

«7.2

fl

°7.4

«7.5

^7.6

«7.7

fl

7.S

fl

O7.10

SG(CG/GC)

«...

fl

8.2

fl

fl

«».5

a

« 8 .7

"8.X

a

G

„2

fl

°..4

fl

a...

a

fl

°..»

°5.l

«...

7.3 ».3

O

8.4

O

*.6

n.l

..«

5.9

7.» 8.9

"s.io

*...«>,

8G(GC/AU) 5G(GC/CG)

8G:bs

V

8G o 1 , " J

8G(UA/AU)

The total number of melted oligos is n, which is much greater than ten. The nxlO matrix consists of non-negative integers, a,;, where a,,; is the number of times that they* base-pair stack occurs in the ith oligo that is melted. The 10x1 matrix contains the unknown stacking free-energies to be determined, and the final nxl matrix on the right contains the free-energies for the melting of all n oligos. In practice, the least-squares solution for the base-pair stacking free-energies "works." That is, for a duplex containing n base-pairs, there are n-1 base-pair stacking free-energies added in order to estimate the free-energy of the duplex. The computed and measured values agree very well. There's no deep theory; it just works. When you have Watson-Crick pairs and maybe an isolated G-U, "nearest neighbors" is a good approximation. Note that when G-U and U-G, are added, the number of parameters to be estimated increases from 10 to 21. In fact, a little more is needed; you need to include the so called "end effects." You have to say that a base-pair at the end of a helix is different. A base-pair stack at the end of a helix is different from the same stack surrounded by other base-pairs. This would give far too many new parameters to estimate. As things turn out, all that really matters is whether you have a C-G pair or a non-C-G pair at the end. That is called the "A-U penalty" (about 0.5 kcal/mole at 37°C). We end up having nearest-neighbor rules with a single correction term for the ends of helices that do not terminate in a C-G pair or a non-C-G base-pair. The resulting set of parameters works quite well. What about if we have the self-hybridization of 5'-GCGCGCGAAGCGGCCG3'? In this case, there is a double (tandem) mismatch consisting of two non-

Statistically Derived Rules for RNA Folding

91

canonical base-pair pairs in a row. It turns out that there is no way of assigning free-energies to base-pair stacks involving two non-canonical pairs in order for the numbers to work. You have to treat this and similar tandem mismatches as special cases. This has been done in Doug Turner's lab. Think what it means to carry out so many measurements, how tedious it is and how much work it is. The above case is symmetric: G-A next to A-G. It's symmetric because a single oligo is created that hybridizes to itself and gives a symmetric mismatch. Rule number 1 is that it's much easier to measure a symmetric mismatch than an asymmetric one. All the symmetric cases were constructed and measured. We have the most reliable information on that. Also, recall that the G-C/C-G base-pair stack is the most stable. Again, you want the best ends locking-down the potentially destabilizing central part. The measurements were made mostly with G-C or C-G pairs, a few were made with A-Us, and none at all were originally made with G-Us, even though our table demanded numbers for everything. When you fold an arbitrary sequence, the algorithm says "Give me a number for every possibility, whether measured or not." You measure some values and interpolate or extrapolate to obtain others. How do we deal with non-symmetric cases? Suppose we have a C-U/A-A mismatch, surrounded by C-G and G-C closings. You treat that as C-U/U-C+A-A/A j n a t ^ y QU a v e r a g e m e free-energies that correspond to the symmetric cases. You probably don't expect to do as well as with direct measurements. Then you do a couple of measurements on non-symmetric cases, just to get a feel for how well the "average rule" works. You really go to the effort of making some non-symmetric cases. It's too much. This average rule doesn't seem that bad after all. Question: [editor's note: This question pertains to mismatches in the middle of helices.] Response: We're not saying whether they're paired or not. From the point of view of predicting, whether you have a stem or not, whether they're paired or not, is irrelevant. All I need are frequencies. In fact, the G-A/A-G contributes to the stability of the structure, so much so that an NMR structure was determined in the Turner lab and the G-A/A-G "tandem mismatch" turned out to be two adjacent basepairs, although the form was non-canonical. Some of the other mismatches, such as a G-G single mismatch, are also stabilizing. Most are destabilizing and no one knows what their structure is, or whether or not the bases are paired. Suppose I want to know the AG for a particular helix. If I can derive it by summing up nearest neighbor rules, I'll do so, because it's reliable. For mismatches like this, it's just not

92 M. Zuker reliable. The way I would estimate the AG for a helix with a tandem mismatch would be to add up nearest neighbor terms where possible and then take the freeenergy of the tandem mismatch from a table containing free-energies for 2x2 loops. Question: [editor's note: This question probably pertains to the computations of pseudo free-energies for tandem mismatches. ] Response: We don't have enough data. We don't have sufficient nearest-neighbor statistics for every double mismatch surrounded by every type of closing base-pair. The fact that I can derive pseudo free-energies for tandem mismatches is irrelevant. We don't have anything measured with which to compare them. What I'm doing is cheating a little bit: I look at each 2x2 mismatch surrounded by all possible closing base-pairs. My pseudo free-energy is computed as the sum of three terms: a closing base-pair next to a mismatch, a mismatch next to a mismatch, and a mismatch next to a closing base-pair. It is these numbers that I compare with "measured" values from my chemist colleagues. I simply don't have enough motifs in my database to derive pseudo free-energies for every 2x2 loop with all possible closing base-pairs. Question: Are you saying that you don't have them in your database? Response: They do occur. Some may be absent. We have insufficient numbers to derive reliable statistics. The decision we made was to take the sum of these three pseudo-energies to approximate a number from a measured table. Here are the kinds of results we're getting: For Watson-Crick base-pairs, we're getting correlations of 0.89, almost 0.9, between the Turner energy rules and the pseudoenergy rules. These are the best numbers we have. For the loops, it gets worse. For the multi-branch loops we know virtually nothing. Outrageous approximations are made to get (minimization) algorithms to work. We get pretty good correlation between the real AGs and the pseudo-AGs. This is without the extended base-pairs. It makes no difference if you add some extended base-pairs. You get the same results whether you use the core structure or the complete structure (90% correlation, which is pretty good.) Now, let's look at the double mismatches W-X versus Y-Z, but closed by G-C and C-G, which are the most stable closing pairs. We have measured numbers for these. We are taking the pseudo free-energy to be the sum of the pseudo freeenergies for three stacks involving base-pairs and mismatches. If we simply compute statistics for how many times we observe this particular motif we would get virtually nothing. I wouldn't find enough examples.

Statistically Derived Rules for RNA Folding

93

I have numbers for the free-energies and the pseudo free-energies. However, it is best to show you correlations. The correlations fall from 0.90 to 0.67. It makes a difference whether I'm looking at a complete structure with a core. This is without extension. It turns out that when I add the extended base-pairs it gives me virtually the same results, which makes me believe that I really should be adding the extended base-pairs. Again, I'm comparing the Turner energy rules for 2x2 symmetric loops with what I'm getting from statistics. I'm getting a correlation down from 0.9 to 0.770.76, virtually the same results for the complete structure with the core when I add those extensions. These are the best data I have for GCCG pair closing, but remember, this includes measured values for all the symmetric cases, but some approximated values for some asymmetric cases, like that. What if I pull out just the symmetric cases here? Question: It seems like most of the correlation is coming from three points on the very left. What will happen with the correlation if you take the three points off? Response: It's the tail wagging the dog. You get the same correlation. This line is a best fit between the pseudo free-energies and the actual measured free-energies. It is not a correlation coefficient. Comment: Yes, but it looks like you have a big cloud and three dots on the left. Response: Yes, but in fact you get the same correlation from these three points. We've eliminated these three for the Watson-Crick and get the same results for the correlation. Restricting myself to the 2x2 loops with only the symmetric mismatches, the correlation now goes back up to about 0.85. Somehow we get better correlations between the pseudo energies and numbers of which we are more confident. That seems to be what's happening. When we have better ideas of what the numbers really are, we get a better correlation. There is similar improvement for the core structures. Just looking at the small number of symmetric cases, such as GAAGCUUC and so on - all the various possibilities here. Then we thought, if that's the case, then what would happen if we look at the numbers that are really badly determined, such as the G-U closings, with arbitrary penalties applied. Then we get some really awful results. Correlation coefficients of 0.13 or 0.2 occur. Here we're looking at all possible 2x2 mismatches, but closed by G-U on the left and a U-G on the right, for which we have no measured data at all. We're just putting in a 1 kcal/mole penalty; some

94

M. Zuker

arbitrary rule. We get really crummy correlation between these published values and those 1 get by statistics. This suggests that these numbers are in fact not very good. If we went back and actually did measurements on some of them, correlations would improve. This is a first analysis of a large quantity of data. Question: You get very nice correlations using Watson-Crick pairs and ugly ones with non- Watson-Crick pairs. Is that correct? Response: No; I get very nice correlations with the Watson-Crick pairs. When I look at 2x2 loops closed by G-C pairs, I get a pretty good correlation. If I restrict myself to 2x2 loops that are just symmetric, the correlation goes back up again almost to the level of the Watson-Crick base-pairs. If I look at non-G-C closings for the 2x2 loops (for example A-U closings), I get a significant correlation, but not as good as the one shown for the G-C closing. The worst example is precisely the G-U/U-G closing, for which you get correlation no better than random. My conclusion is that the Turner rules must be very little trusted for this case. Comment: What may happen is that the secondary G-U base-pairs constitute the backbone that has to be stable in any RNA molecule and the sequence itself may be less important, but when you're looking at mismatches, you're looking at specific active or functional sites that are RNA-specific, as if you were doing protein studies, but just looking at amino-acids in active sites. In that case, you would find some strange sequence biases not explained by the stability, but by the fact that you (need to) have five molecules at specific sites. Response: I agree. Why should these statistically derived rules correspond at all with thermodynamics? First, why do I believe in minimum-energy folding? We have this idea that somehow the appropriate structure is reasonably close to minimum-energy structure within experimental or computational error. That is one underlying idea here. If you use the statistical rules and try to do RNA-folding by minimizing the sum of these statistically derived rules, that's is essentially maximizing the likelihood of getting back to statistics you observe in nature. That seems appropriate, just as appropriate as it is to align protein sequences to try to reproduce the frequencies that Margaret Dayhoff derived. Both should work to predict a structure. Therefore there should be a correlation between the two. It's a pretty fuzzy idea.

Statistically Derived Rules for RNA Folding

95

Question: / did something which probably would seem illogical to people in the field, but that I find interesting: I started with a data-set in which there were practically no base-pairs, such as G-A, G-G, and things like this. Then I derived energies for the G-As and G-Gs, although they were not in the data-set. It is a case where the statistics did not show these base-pairs, and yet you could find them hidden in the data-set. The reasoning was that if I use a G-G pair and find an energy level such that it does not blow the secondary structure, then perhaps the energy values are valid. I remember that in one of your publications you put in artificial energy so that you could correctly predict 5S on tRNA foldings. That was your asymmetric loop penalty. Did you assign stabilizing energies to G-G mismatches or destabilizing? Response: I used a range of values from G-Us and G-Cs. All base-pairs were taken into account. I think that most of the out pairs were given small stabilizing values. The point is that I could give them stabilizing values I had derived from sequence energy values that were not reflected in the statistics. Comment: But you didn't derive these from statistics. These numbers were pulled from the air. They were numbers that you could assign to hypothetical base-pairs that did not destroy the calculation. They didn 't predict nonsense. The argument is more general for people interested in genomics: you have sequence data and people look for consensus. But what is avoided, what you do not see perhaps reveals something about interactions. Response: I get your point, but I don't really understand it and 1 won't claim to. Comment: / would like to come back to your statistical approach, which is very reasonable and legitimate; to assume that a priori you don't know anything in particular you will find in those Watson-Crick bases. But on the other hand, since you find them and we know them, we can guess that this very high statistics that you find so to speak "pollutes" the very rare tails between the other non-canonical base-pairs, so I was wondering whether you could correct the oral statistics so that you don't assume you don't know anything and maybe your correlation will improve a great deal for those unknown things you're actually after. Response: Yes, I see what you mean. What you're suggesting is that I don't act so as though 1 were ignorant; that I use all the information I have plus all the statistics

96 M. Zuker and then get really good estimates that I really don't know anything about. I think I would do that. Comment: Yes, because the reference state you're studying, which is to assume that everything is equally likely to occur, is actually not correct. Comment: Yes, by doing the normalization, using maximum knowledge, saying we know which base-pairs are occurring in secondary structures, if we assume that they just occur together at random; which stacks appear to be over-represented and which are under-represented? Of course, the Watson-Crick stacks are just ordinary when you do that, but, for example, then you get the G-A/A-G stacks really standing out. We found some other highly occurring stacks that might be worth studying. I think we have a wealth of knowledge that has already come out of this study that suggests melting experiments and even maybe structural studies on certain motifs that have been looked at. Comment: / think the basic background for statistical analysis is that you have a sufficiently large sample; that the overall stability determines your cases, and that's the reason why it works very well with Watson-Crick base-pairs, because you have a large number of statistics and a few cases that were selected for other reasons. If you come to your special cases, I assume that the number is very low, and my argument would be that thermodynamics need not be the major factor. In that case, I'm not really surprised that you get high correlations. Response: Yes, that's reinforcing what Daniel is saying. I gave you large numbers of base-pairs, but these are not independent observations. The independent results are along the biogenetic trees that are unknown here. So the actual numbers of independent events that are represented by these statistics are far lower, but not as low as you might think, because you're getting the covariations occurring over and over and over again. Question: You did not mention that as a rule, mutual information will not work for non-Watson-Crick base-pairs. Why? Response: Because what matters is the number of events. Unless you know the number of events, the constraints can be explained just by common history. Have you ever tried to find tertiary interactions by using mutual information? You get incredible noise, no signals, too much noise. Actually, I have one last slide showing what you could get for tRNA. With mutual information, you can pull out the four

Statistically Derived Rules for RNA Folding

97

major stems and also the tertiary interactions. The reason the set is so large is that the history has been blurred out. Question: Have you also looked at two-point correlation? lot of data.

You showed you had a

Response: No, we have not. The next step would be to look at small interior loop folds to try to extract statistical information on that and think how to combine that in an appropriate way with the pseudo energy rules for our stacks. Comment: Watson-Crick pairs occur in bunches as or as a motif, and the order of Watson-Crick pairs is usually conserved, which also explains why you have such low statistics. The order is not random anymore; you can't have just any type of occurrence.

This page is intentionally left blank

E X P E R I M E N T A L A P P R O A C H E S T O RJNA F O L D I N G

SARAH WOODSON Department of Biophysics, Johns Hopkins University, Baltimore, MD, USA

This presentation has two objectives: i. ii.

To describe and discuss work underway in my laboratory investigating how RNA structures are formed; how they fold or self-assemble; To present a brief historical perspective of the RNA folding problem and to provide some sense of what other kinds of experiments are currently underway.

Scientists first became engaged in studying the problem of how RNA structures are formed with work done on transfer RNA (tRNA) in the 1960s & 1970s. Studies of tRNA showed that RNA molecules have very precisely defined threedimensional structures, and that these structures are absolutely essential for their biological function in cells.

Mg2

unfolded

secondary structure

tertiary structure

Figure 1. The structure of tRNA. Secondary structure (base-pairing) precedes the formation of the native tertiary structure (far right). (Reprinted from Elsevier with permission.)

The structure of tRNA may be defined on several levels. That is to say, it is hierarchical. One level is the secondary structure, which is formed by Watson-Crick base-pairing between different parts of the RNA strand [Fig. 1]. The base-pairing arrangement depends on the sequence of the particular RNA. The next structural level is the tertiary fold. In tRNA, the pairs of double helices stack on one another,

100

S. Woodson

hence, the four helices form two co-linear helical domains. The L-shape of tRNA is determined by tertiary interactions between two loops. Other important tertiary interactions involve nucleotides at the helical junctions. The elucidation of tRNA structure by biochemical and genetic methods, as well as by X-ray crystallography, have established the basic principles of RNA structure and laid the foundations for RNA biochemistry. Base pairs are quite stable. It turns out that the double helices formed between two strands of the RNA are also stiff. A double-stranded RNA may be modeled as a worm-like chain with a persistence length of 700 to 800 Angstroms. This is much longer than the dimensions of folded RNAs. One problem is how an interesting globular structure can be made out of such a long, stiff polymer. Nearly all RNAs that occur in nature have only short contiguous sections of base-pairing (tRNA consists of four helices). These helices are joined by unpaired nucleotides. Singlestranded RNA is considerably more flexible than double-stranded, and can take on a variety of different three-dimensional conformations. The atomic interactions of "unpaired" nucleotides at the helix junctions determine the three-dimensional topology of the RNA molecule. The structural principles first gleaned from tRNA apply to more complex molecules, such as ribosomal RNA. In tRNA, there is one intersection of four helices. However, many helices can intersect in the small subunit of ribosomal RNA. The organization of the double helical elements in three-dimensional space gives rise to the complex structure of rRNA. The orientation of the helices in space produces molecules that are not flat like a sheet of paper, but instead display a complex three-dimensional architecture. The problem with which my laboratory has been concerned is not so much what the structures of these RNAs are, but how they are assembled. Assembly is driven by the primary RNA sequence, and one would expect different structures to be formed by RNAs with different nucleotide sequences. We want to understand how this process is driven by physical interactions among nucleotides; if we can understand how these molecules self-assemble, perhaps we can begin make better predictions about what kinds of structures they form and what sorts of time-scales are involved. Studies of the dynamics of nucleic acids began in the 1960s and 1970s on very simple systems. Several laboratories used a variety of methods, such as temperature-jump, UV spectroscopy, and NMR spectroscopy to look at elementary steps in the formation of secondary structures. These experiments showed that the formation of double helices could be broken down into two kinds of elementary steps: nucleation and propagation. Nucleation of the helix can take place between

Experimental Approaches to RNA Folding

101

two different molecules, or alternatively, between parts of the same molecule. After two to three Watson-Crick pairs have been formed, base-pairing propagates through adjacent nucleotides, establishing a longer helix. This results in the rapid closing of the helix, by a process sometimes called "zippering." [Fig. 2]

nucleation

zippering

2-3 bp

loop closure

zippering

R Figure 2. Conformational dynamics in RNA. Experimental values for the rates of helix nucleation and propagation (ref s 6-9).

Base-pairing interactions within a helix are cooperative; hence, the zippering process is fast, taking place on the microsecond time scale. In contrast, nucleation is often rate-determining. When two different RNAs base-pair, the rate of nucleation depends on the concentrations of the two strands. The formation of intramolecular hairpins is independent of the bulk RNA concentration, since the two elements are part of the same molecule. The time required for hairpin closure is approximately 10 to 100 microseconds, but this depends both on the size and sequence of the loop. In summary, local base-pairing interactions in nucleic acids occur relatively rapidly, typically on the order of microseconds. Question: What do you mean by "time of nucleation? " Response: Nucleation means that a sufficient number of bases have paired for there to be a high probability that the helix will continue to grow. The kinetics of basepairing may be measured by analyzing the change in UV absorption. The basepaired state absorbs less than the unpaired state (hypochromics shift). By varying the nucleotide sequence or evaluating the temperature-dependence of helix formation, one can distinguish between these two kinetic steps. Probably the easiest way to initiate these processes is by instigating a rapid jump in temperature.

102

S. Woodson

The formation of tertiary interactions usually takes longer than base-pairing, as shown in the laboratories of Don Crothers and Paul Schimmel, where NMR and stopped-flow spectroscopy experiments on fRNAs were carried out. In NMR experiments, the change in relaxation times of proton resonances attached to individual bases was measured at increasing temperatures. The first step in the unfolding transition is the disruption of the tertiary interactions that determine the L-shape of the tRNA. This occurs on a millisecond time scale (approximately 10 to 100 ms), according to the particular tRNA being examined. In fact, some tRNAs fold quite a bit more slowly, on the time scale of a second. Next, individual hairpins were observed to open and close on a microsecond time scale. This not only suggested that there is a hierarchy in structure (secondary and tertiary), but that that hierarchy manifests itself in the thermodynamic stability and kinetics of folding. RNA secondary structure is typically quite stable, whereas tertiary interactions are often marginally stable and depend very strongly on the presence of positively charged ions. Secondary structures tend to be a little more dynamic and form rapidly, whereas tertiary interactions seem to require more time to become organized. In the 1990s a number of laboratories, including my own and that of Jamie Williamson, began to tackle the problem of how even larger RNA structures selforganize. The problem becomes a bit more difficult, because unlike tRNA, larger RNAs have several different domains of tertiary structure. Figure 3 is a schematic of a ribozyme from Tetrahymena thermophila that was discovered and extensively worked on by Tom Cech and his colleagues. It is the best-studied example of a large catalytic RNA. Just as for tRNA, base-pairing interactions (horizontal lines) yielding double helical segments are interrupted by short segments that are not basepaired. This diagram may look a bit complicated, because it has been laid out in such a way as to suggest how this molecule is organized in three-dimensional space. [Fig. 3] One domain of tertiary interactions is formed by double helices known as "paired regions 4 through 6, or P4-P6" (left). Another domain of tertiary interactions (right) contains the double helices P3-P9. There are further interactions with helices P2 and P9.1/P9.2, which have been proposed as going around the outside of the RNA. You can get some sense of what this should look like in threedimensions in a ribbon diagram [Fig. 3] taken from a 5A crystal structure determined by Barbara Golden in Tom Cech's lab. The P4-P6 domain on the left [in blue in Fig. 3] has a bent-over conformation, colored purple in this slide [Fig. 3, left]. This structure indicates how double helices pack against one another. The structure is complex, with a large interface between helical segments. The P3-P9

Eotperimental Approaches to RNA Folding 103 domain (in green) is wrapped around the P4-P6 domain. This may be seen better in the top-down view of the same structure [Fig. 3, right]. The active site responsible for chemical catalysis lies in a cleft between these domains. Several parts of the molecule [not shown on Fig. 3 J brace the exterior of the structure.

GmrVi

OO

l'3-l»9

P3-P9 domain

P5c

P4-P6 domain

P4-P6

Figure 3. Structure of Tetrahymena group I ribozyme. Diagram of secondary structure (left; (rePs 46,47)) and ribbon (right; (14)) show the domains formed by paired (P) regions 4 - 6 (green) and P3 - P9 (blue). The active site is formed by the cleft between the two domains. P2/P2.I and P9.1/P9.2 regions are missing from the ribbon diagram on the right.

The challenge was to understand how this more complex molecule selfassembles. This RNA folds in vitro in the presence of a buffer (to maintain the correct pH) and magnesium ions, which are essential for the RNA to form this structure. Question: Is the level of detail of an RNA structure that is central to its biological role known? How much of that detailed knowledge is important? Response: At one extreme, if one wants to know precisely how a molecular bond is broken by a ribozyme, one must know much detail concerning the arrangement of atoms within the catalytic center. Catalysis is sensitive to movements of several Angstroms back-and-forth in the positions of the atoms. Eventually, one would even need to know the various electronic states of critical atoms. At the other extreme, if one simply wants to know whether or not this sequence belongs to a family of sequences that all behave in the same manner (it does in this case), an atomic level of detail is not necessary. One simply needs to be able to recognize patterns of sequences and structure; in a very generic way, one can then classify a molecule as having a certain biological function. Did I answer your question?

104

S. Woodson

Question: Fine, but what do you think?

What level of detail is important?

Response: In the end, one needs to understand a system at all levels. My own interest is strongest at the biological and physical levels. Perhaps the best answer would be that we must have detailed knowledge of one example (e.g., System A) in order to get some sense of the depth of the problem. Broad classifications are very useful for making analogies, rendering it unnecessary to study every example to such extreme depth. By analogy, one can say system A is like system B, which would suffice to establish the probable function of system B. Returning to our original discussion, the problem is to understand how RNA domains are formed and how interactions between structural domains are established so that the global structure of the RNA is achieved. This problem was first tackled in Jamie Williamson's lab. He and a graduate student, Pat Zarrinkar, used a method to look at the formation of structure over time that involved DNA oligonucleotides that were complementary to various sections of the ribozyme. The oligonucleotides were used to determine which sections of the RNA were unstructured, hence accessible to base-pair with oligonucleotides that were added during the experiment. Those parts of the RNA that were already folded were inaccessible for base-pairing with the oligonucleotides, thus refractory to cleavage by ribonuclease H (ribonuclease H specifically cleaves RNA-DNA double helices). They observed that the ribozyme structure formed in stages, and that it took a long time for the complete structure to be formed (1 to 10 minutes, or even longer, depending on experimental conditions). Hence, it was possible for them to observe partially folded intermediate structures and to deduce that one of the RNA domains [P4-P6, purple on Figure 3] is thermodynamically very stable and folds independently of the rest of the sequence. The P4-P6 domain formed very early in their experiments; within the shortest time they could probe the structure of the RNA. In contrast, the P3-P9 domain [green section of Figure 3] was formed much more slowly (it took several minutes). The experiments of Zarrinkar and Williamson provided a first hint that minutes rather than milliseconds were required for the structure of the larger RNA to assemble and that the assembly process involved stable intermediates that could persist for relatively long periods of time. My laboratory, as well as those of Mark Chance and Mike Brenowitz at Albert Einstein College of Medicine (Bronx, NY) took a different approach to looking at the evolution of tertiary interactions over time in the Tetrahymena ribozyme. This is based on well-established nucleic acid chemistry. When nucleic acids are exposed to hydroxyl radicals, which are highly reactive, they are cleaved at

Experimental Approaches to RNA Folding

105

a frequency that depends on the hydroxyl radical concentration, as well as on the probability that the OH" radicals collide with the ribose C4 atoms in the RNA. Nucleotides located on the outside of the structure are cleaved quite readily. Those parts of the RNA strand that are excluded from the solvent and internalized within the structure are unable to react with hydroxyl radical, and are therefore cleaved fairly slowly. An RNA that has base-pairing interactions but no further tertiary structure is relatively extended and in full contact with the solvent. Nearly all its residues are cleaved with roughly equal efficiency. One can detect cleaved products with a sequencing gel, a very standard method in molecular biology. The results may be seen in the schematic on the right [Fig. 4] as a "ladder" of products. In contrast, if the RNA is first exposed to magnesium ion, which permits stable tertiary interactions to form, parts of the RNA strands are internalized. Under these conditions, regions of the sequence are protected from cleavage, in spite of the presence of hydroxyl radical. These "protected regions" show up in our experiments as gaps in the sequencing gel pattern. From these gaps, it is possible to determine not only which parts of the RNA are folded, also to quantify the fraction of the RNA population that has formed that structure.

2' structure

Native RNA

Figure 4. Probing nucleic acide tertiary structure with hydroxyl radicals. After partial cleavage of the RNA backbone in the presence of hydroxyl radical, fragments are resolved on a polyacrylamide sequencing gel (right). Product detection is facilitated by 32P at the 5'-end of the RNA. Nucleotides whose ribose atoms are inaccessible to hydroxyl radical in the bulk solvent are protected from cleavage, creating a gap in the sequencing gel.

Comment: But after you chop-up the folded form, other parts become accessible again and you get a very muddy picture. Response: That is correct. One must control the cleavage kinetics, so that on average each molecule is cleaved no more than once, otherwise you will have

106

S. Woodson

precisely that problem. It is actually easy to do this if you limit the total extent of cleavage to between 10 and 20% - and not more than 30% - of the RNA. Hydroxyl radical cleavage experiments are carried out in many molecular biology and biochemistry labs. Chemical reagents, such as Fe(II)-EDTA or peroxynitrous acid are commonly used to generate the hydroxyl radical in solution. Using these methods, the cleavage reactions take at least several seconds, more typically 30 to 60 seconds. These methods are not appropriate if one wishes to look at very short-lived structures. Instead, we use a white light synchrotron X-ray beam [Fig. 4] to generate the hydroxyl radicals. As the water absorbs the photons, one of the products is hydroxyl radical. The experimental scheme is as follows: The RNA is dissolved in a buffered solution, then mixed rapidly with magnesium, in order to initiate the folding reaction. (This is done using a mechanical mixing device with a millisecond deadtime.) The RNA solution is then "aged" for a defined period of time, which allows the folding reaction to proceed. The RNA solution is then introduced into an exposure cell in front of the X-ray beam and irradiated for 10 to 30 milliseconds, in order to achieve the desired extent of cleavage. The effluent is then collected. These experiments are done at the synchrotron facility at Brookhaven National Laboratory, in Upton (Long Island), New York. To complete the experiment, we must then transport the solutions to Baltimore and run them through sequencing gels. This is necessary in order to find out which parts of the RNA have folded and to determine the extent to which the RNA population has folded over time. This information may be quantified, and a progress curve is obtained of the fraction of folded RNA over time. In this particular example [bottom left of Fig. 4] you can see that these sequences, which turn out to be in the P4-P6 domain [Fig. 3], are fully protected in several seconds. When Bianca Sclavi started her experiments on Tetrahymena thermophila ribozyme, she observed that different parts of the RNA were protected from cleavage at different rates. These results were consistent with the results obtained in Jamie Williamson's lab; that is, different structural domains reached their native conformations at different times. This figure [Fig. 5] displays a schematic of our results. The nucleotides represented by the green patches in helices P5a, 5b, and 5c were protected most rapidly, with a rate constant of about 2 s"1. Regions of the RNA corresponding to nucleotides that should be protected when the whole P4-P6 domain folds on itself (orange patches) were protected only slightly more slowly, at about Is"1.

Experimental Approaches to RNA Folding 107 The ribbon diagram in Fig. 5 shows how the nucleotides (highlighted in orange) are occluded from solvent when the RNA folds. This is what leads to protection of the nucleotides from cleavage by hydroxyl radical. These experiments established that it was possible for this RNA tertiary domain to fold with an average rate constant of 1 s"1. The P4-P6 domain consists of 160 nucleotides, more than twice as long as tRNA, which is around 75 nucleotides. Nevertheless, it folds rapidly. The folding process is fairly concerted; the time constants for the protection of each sequence patch was the same, within the limits of our experimental precision. Therefore, even large domains of RNA can collapse fairly rapidly into an ordered structure. If we increased the ionic strength of the solution by adding monovalent cations, the folding speed increased significantly, to around 50 s" .

time

>1 min

un,0,fled

74 A

P4-P6 folded (olded

IP

core m inded misfolde

nalive

51 A

48 A

Figure S. Folding pathway of the Tetrahymena ribozyme. Cartoon indicating progressive formation of native tertiary interactions in the RNA population with time, after addition of Mg2', based on X-ray footprinting data (ref. 18). Green: P4-P6 domain; pink: P2/P2.1; blue: P3-P9; grey: P9.1/P.2. Numbers at the bottom indicate the radius of gyration measured by small angle X-ray scattering (ref. 48). (Reprinted from ref. 49 with permission.)

Question: When you say that you increase the ionic strength, do you mean only by increasing the magnesium concentration, or can you also do it with sodium? Response: In this case, 100-200mM sodium was sufficient. alter the magnesium concentration.

We did not have to

Question: Can you fold the initial helices without magnesium in a high sodium concentration?

108

S. Woodson

Response: Yes, the double helices can form in the absence of magnesium, and are already formed at the start of our experiment. Question: Is it only magnesium, or can you also use cesium or magnesium for tertiary structure? Response: The simple answer is only magnesium; manganese and calcium can substitute to some extent in stabilizing this RNA, but less effectively. Question: What is the explanation for this magnesium principle? Response: This will become clearer in a moment, when we look at metal ions bound to RNA. Other laboratories, including Eric Westhofs, have looked at this problem rather extensively. First, the electrostatic requirements of the phosphates must be satisfied. This does not require specific ion types. Second, there are certain pockets within folded RNA structures that are well-adapted to ions of a particular radius and charge. It turns out that the size of the magnesium ions, plus their ability to interact with oxygen atoms, makes them well-suited to stabilizing RNA tertiary interactions. In high-resolution of structures of the P4-P6 domain alone, which were determined in Jennifer Doudna's and Tom Cech's labs, metal ions may be visualized in the electron density map. In the X-ray experiment, five ions are clustered in the P5abc region of the RNA alone. The ions are localized in regions in which two segments of phosphate backbone are in close proximity. The dense packing of phosphates implies a high negative-charge density. Question: diagram?

Do you find monovalent ions that are specifically bound, as in the

Response: We haven't, but other labs have. They may be detected in some crystal lattices and in NMR experiments. Scott Strobel has obtained modification interference data that suggest there are monovalent ions that localize to specific regions and stabilize tertiary interactions. The first conclusion to be drawn from our folding experiments is that one domain of the ribozyme is able to rapidly fold to its native structure. This establishes the principle that RNA can collapse reasonably rapidly (milliseconds-to-

Experimental Approaches to RNA Folding 109 seconds) to a well-organized structure that is compatible with its biological activity. But what is happening with the rest of the RNA? Again, it takes minutes or longer for the overall structure of the ribozyme to be formed. What Bianca [see Fig. 6] found was that there were nucleotides in other regions of the RNA, such as these highlighted by yellow patches in the P3-P9 domain, which take a very long time to reach their native conformation.

80% folds rapidly

mlsfolded

native

Figure 6. Mispairing of P3 helix in ribozyme center. Non-nativefoldingintermediates arc stabilized by an alternative base-pairing alt P3 that competes with the native P3 helix. A U to A mutation that stabilizes P3 increases the fraction of RNA that folds rapidly. (Adpated from ref s 24 and 25.) One can put these data together in a cartoon [Fig. 7] that summarizes the results showing the evolution of native structure within the population of RNA molecules. Within this population, one domain of the RNA folds very quickly (domain P4-P6, green cylinders), whereas other domains of the RNA become organized very slowly, including the P3-P9 domain (blue cylinders). The latter comprise part of the active site of the ribozyme. In fact, their organization is required for the biological activity of the RNA. The next questions wc asked were "What are these intermediates; what is their nature, and why do some of these self-assembly/self-folding steps require such long times'?" The answers came from a completely different approach. Jie Pan, a former graduate student in my lab, used non-denaturing, or native polyacrylamide gels to resolve the structures of folded and unfolded RNAs [Fig. 8], The molecules move through a polyacrylamide matrix, driven by an electrostatic field. Movement through the polyacrylamide matrix is partly a function of molecular weight, but also depends on the overall dimensions or size of the molecule. In this case, all the RNA molecules are labeled with 12 P, so they may be seen on X-ray film. They are all of

110

6'. Woodson

the same molecular weight, but migrate at different rates, indicating that they are of different sizes.

U273A«80% fast

S^ N slow

I Figure 7. Kinetic partitioning between fast (direct) and slow (indirect) folding pathways, (Reprinted from ref. 25, with permission.)

Mg*2

^

M. M,z

Na*

{lNS} r^>( |

1 'N
N—I m

*"—

Mg+2 in native gel

Figure 8. Countcrions induce near-native structures. RNA was incubated with the desired salt 2 hr at 30°C (such as NaCI,) then loaded onto a native gel at 4°C containing 3mM MgCI.. RNA that appears in the native form (N) represents molecules that were able to fold correctly in the 15 - 30s required for the sample to enter the polyacrylamide gel matrix. Refolding is arrested once the RNA enters the gel. (Reprinted from ref. 41, with permission.)

Experimental Approaches to RNA Folding

111

These experiments showed that 15 seconds after adding magnesium and buffer, the RNA population migrates at a wide range of speeds (shown as broad smears in the gel). This confirms that the RNA molecules have many different shapes. If the sample is loaded into the gel a sufficiently long time after adding magnesium to the RNA, we see that nearly all the RNA migrates as a sharp band, consistent with a uniform structure. We know from previous experiments that this structure corresponds to the biologically active material; that is, the correctly folded RNA. It took 1 to 2 hours in this case for the majority of the RNA to attain the native conformation. These native gel experiments allow us to conclude that the folding intermediates are not a single structure, as suggested in the previous cartoon, but really a whole family of structures. Second, we observed that urea, which partially destabilizes the RNA, accelerates the folding process. This suggested that the intermediates must unfold in order to reach the native structure. Thus, by destabilizing interactions between different parts of the RNA, it is possible to speed-up the transition from the intermediates to the native state. Question: Is what you are saying that there is no magnesium in the buffer; that it is retained by the gel? Response: I'm sorry, I haven't provided enough detail. The running buffer in the gel [Fig. 8] contains 10 mM magnesium, which is sufficient to stabilize the RNA. Once the RNA enters the polyacrylamide matrix, further re-organization (further folding) appears to be arrested. Question: What if you do the same kinds of experiments with no magnesium in the gel? Response: It would be impossible to separate the unfolded and folded species; they would all appear as if unfolded. Question: Do the bottom bands [in Fig. 8] correspond to the folded

conformation?

Respose: Yes. Question: Then why do you have them at different heights in the figure?

112

S. Woodson

Response: The gel was run continuously throughout the experiment, and the last lane was loaded two hours after the first and found to have moved a shorter distance. Question: Do you start folding all samples at the same time and then just load them? Response: The folding reactions were all started at one time and stopped at different times. Folding was arrested when the samples entered the gel. We loaded the samples at different times. Question: Looking at the hierarchy of the folding events in the previous slide [Fig. 7], it appears that you had a framework of a double helix, which initiates rapidly, after which the active region is more slowly assembled. Can one relate this at all to being able to alter sequences as long as you have a duplex-forming sequence for the rapidly forming part of the structure? Can one relate the importance of sequence detail to the rate at which things are taking place? Response: The folding kinetics that we observed and the structures of the intermediates are very sensitive to the RNA sequence. If one base is changed within this RNA (the ribozyme itself is about 400 nucleotides long), the folding kinetics and pattern of intermediates can vary dramatically. Yes, the details of the sequence are very important. One can start to surmise some general patterns in terms of the relative stabilities of the various helical segments. For example, the P4-P6 domain, which folds rapidly, is far more stable than the P3-P9 domain, which folds slowly. The other general conclusion is that the number of nucleotides separating the two parts of the RNA that are coming together also has a strong effect; longer distances between interacting residues makes them less likely to base-pair. Since the intermediates have considerable native character, they are stable enough to persist for long periods of time. A framework for thinking about this kind of problem was first established for the problem of protein folding. The system may be modeled as a free-energy landscape as a function of conformation. One can imagine the native RNA structure, which generally corresponds to the lowest freeenergy state, represented as a well in the landscape. One can then imagine the intermediates as corresponding to other low free-energy states in competition with the native state.

Experimental Approaches to RNA Folding

113

As we begin the folding experiment under our initial conditions (no magnesium present), many of the various RNA conformations are of roughly equal free-energy, hence the RNA population samples many different structures. The instant we add magnesium to our solution, tertiary interactions are energetically favored. Under these conditions, a few structures become far more stable than the rest. There are only a few ways to organize the RNA chain so as to generate large numbers of energetically-favorable contacts. These contacts must be compatible with each another. The best structure corresponds to the native state, but it is clear that as the RNA sequence becomes fairly long (in this case almost 400 nucleotides), competing structures with low free-energy become increasingly likely. These intermediates contain many native interactions, which decreases their free-energy. They also include some interactions that are non-native and incompatible with the native structure. These interactions must minimally dissociate in order for the intermediates to convert into the native structure. Thus, there are high free-energy barriers separating these states, which give rise to the very long folding times that we observe. Based on theoretical models developed to describe protein folding, Dave Thirumalai and I proposed a "kinetic partitioning mechanism" for the folding process. In this model, a small fraction of the population rapidly reaches the native structure because only correct contacts are made. In contrast, a large fraction of the RNA population makes some non-native contacts. These molecules spend long periods of time in intermediate states. This is illustrated in Figure 8, in which there are two types of folding processes: rapid collapse to the native structure, and slow folding via non-native intermediates. As we have already seen, it is possible for small RNA domains to collapse rapidly to an ordered structure. For the majority of the RNA population, non-specific collapse leads to a collection of low free-energy states, which take a long time to refold. Question: Can you estimate the fraction of the molecules that takes the fast pathway and the fraction that takes the slow pathway? Response: This can actually be measured fairly well. For the wild-type RNA that I have been describing, around 5 to 10% of the RNA folds quickly. This fraction is very sensitive to sequence and conditions. Concerning questions about the importance of sequence, non-native interactions also stabilize the intermediates. In my laboratory, Jie Pan used the fact that the intermediates can be physically separated from correctly folded RNA in

114

S. Woodson

polyacylamide gels in order to figure out which interactions were uniquely present in the intermediates. As shown in Figure 6, one result of her experiments was that the P3 helix mispaired. The P3-P9 domain folds very slowly and involves a longrange base-pairing interaction (between the purple and black strands). These nucleotides are separated by around 170 nucleotides of the primary sequence. This base-pairing interaction (P3), which is required in the biologically active structure, is easily replaced by base-pairing between yellow and black, called "Alt P3" [Fig. 6, left]. Indeed, sequence changes that are predicted to favor Alt P3 slow-down the overall folding reaction. Conversely, sequence changes that are predicted to favor the native base-pairing increase the fraction of the RNA that folds rapidly. Jie mutated a UU mispair into an UA Watson-Crick base-pair, which is more stable [Fig. 6]. For the mutant RNA, around 80% of the RNA population folds rapidly (around Is"1). This is in sharp contrast with the wild-type RNA, of which 210% folds quickly. The idea that the RNA partitions between rapid collapse to the native structure and slow transitions from competing, non-native states was beautifully confirmed by single-molecule fluorescence resonance-energy transfer (FRET) experiments done in the lab of Steve Chu and Dan Herschlag at Stanford. They attached two fluorophores to different parts of the ribozyme. When the fluorophores are close in space, energy transfers from the donor to the acceptor. This only occurs when the RNA is folded. They used the FRET intensity to detect the folding of the RNA molecule. By observing single molecules, they could see that some RNAs folded rapidly; at the same rate constants that we predicted (1 s"1). They also observed other molecules that folded much more slowly. In addition, Tao Pan and Tobin Sosnick used completely different methods (fluorescence and circular dichroism spectroscopy) to investigate the folding mechanism of another large ribozyme (RNase P). They arrived at similar conclusions: an isolated domain collapses rapidly to the native structure in around 50 milliseconds. The complete RNase P ribozyme has a propensity to form longlived intermediates, which include native as well as non-native interactions. Therefore the same principles apply to different RNA systems studied in different labs. It is comforting that the whole field seems to be converging on the same themes. Question: Biologically speaking, why is the native state retained; why does it keep re-appearing?

Experimental Approaches to RNA Folding

115

Response: The reason this U-to-A mutation [Fig. 6] does not occur in the natural sequence is that it lowers the intrinsic catalytic activity of folded RNA two-fold. RNA splices extremely well in the cell; normally 99.9% of the RNA is spliced. For this mutant, about 95% of the RNA is spliced. This doesn't seem so terrible, but, with evolutionary pressure over many generations, this is probably the reason that the wild-type sequence is maintained, even though it folds poorly in vitro. Another possibility is that the competing structures we observe as folding intermediates may not be an undesirable property of the system; they could have a biological regulatory function that we do not yet understand. Question: Helices in proteins are themselves unstable (they are stabilized by the 3dimensional structure). Although the helices are already formed, they fluctuate during formation of the final state. It seems that in the case of RNA the helices are stable by themselves. (Response: That is correct.) They are called domains. Proteins domains assemble to constitute the protein. These domains are usually also evolutionarily independent. Do RNA domains also have evolutionarily independent functions or origins? Response: I would say no; the evolutionarily independent part of the ribozyme is its center. However, according to experimental evidence from Tan Inoue at Kyoto University, the P4-P6 and P3-P9 domains may have independent evolutionary origins. Question: What happens in vivo? Is there a slow/fast dichotomy, or are there chaperonins? Response: We've done a few experiments to find out what happens in vivo. First of all, you don't see 98% of the RNA taking minutes to fold - the cell would be dead if that happened. In fact, it appears to be the reverse; 98% or more of the RNA appears to fold within seconds. We can estimate that by measuring the rate of selfsplicing, because catalysis does not occur until folding has taken place. If one introduces mutations that increase the propensity of the RNA to misfold in vitro, one does in fact see some increase in in vivo misfolding, particularly at very low temperatures. This is consistent with the stabilization of misfolded intermediates. Misfolding is suppressed to a great extent in the cell; in vivo phenotype differences are milder than the differences in activity that we observe in vitro. There is clearly a phenomenon in the cell that either accelerates the exchange between misfolded and native structures, or, alternatively, somehow directs the

116

S. Woodson

RNA along the fast-folding pathway. It is certainly possible that this is accomplished by proteins that are analogous to chaperonins in protein folding. Question: What about RNAs that don't have enzymatic activity? How can RNA be isolated from cells without disrupting their structure? How can you be sure that what you observe in vitro is also what happens in vivo? Response: In fact, it is very difficult if you don't have a function you can measure. This is why people in field have started with RNA molecules that have catalytic activity. If you're looking at a ribosomal RNA or some other RNA that forms an RN A-protein complex, you could try to reconstitute the normal activity of the RNA in vitro. One can try to isolate native particles from cell extracts and determine whether these particles have the same properties as material created in the test tube. One can also generate a large family of sequences, either artificially or by comparing homologous genes from different species. The sequence information may be used to determine which interactions are preserved over evolution, therefore required in the active structure. These are some approaches to this problem. However, catalytic activity is easy to measure and thus a great technical advantage. Next, I want to discuss the importance of counterions in directing the folding process [Fig. 9]. The first question is what types of interactions drive the initial collapse of the RNA into compact structures. We have experimental evidence that collapse occurs very rapidly, on the microsecond-to-millisecond time-scale. It is not known which interactions "nucleate" the tertiary structure. If time remains, I will discuss which aspects of the nucleotide sequence determine the proportion of the population that will specifically nucleate the native structure, as opposed to going through non-native intermediates. One key element that is certainly important for enabling the RNA chain to collapse to more compact structures is the necessity to overcome the electrostatic repulsion between phosphate groups. Remember that each phosphate bears a full negative charge. The importance of the electrostatics was appreciated very early, in the 1960s, as physical chemists began to work on nucleic acids. If very few counterions are present, an RNA will predominantly form extended structures in solution. Additional counterions reduce the net negative charge on the RNA chain and can permit the chain to sample more compact structures. Certain divalent ions, such as magnesium, are specifically required for the native structure.

Experimental Approaches to RNA Folding tCo(Nh4)Ja<

o.ooi

0.01

yg2'

sprf"

0.1

i

Ba?<

io

Na'

mo

117

K'

1000

io<

(cation] (mM)

Figure 9. Counterion-dependence of folding. Results of experiments shown in Figure 9. All counterions tested induced cooperative transitions to native-like intermediates. Reprinted with permission.

One thing we wanted to address experimentally was the ability of different kinds of counterions to drive the initial collapse of the chain. We hypothesized that collapse can be described by the net charge density of the polymer. Ions interact very strongly with polyelectrolytes. A simple theory to describe the condensation of counterions around polyanions was proposed by Gerald Manning and Tom Record in the 1970s. The theory proposes that a large number of ions are confined to a volume of solution around the nucleic acid, which is modeled as an infinite linear rod with uniformly spaced negative charges. Condensation of ions depends on the spacing between the negative charges on the polyanion, the temperature, the valence of the counter ion, and to some extent, on the concentration of ions in solution. These are non-specific interactions, in the sense that the ions are confined to a volume around the nucleic acid, but they do not dwell at individual sites along the RNA for long periods of time. The "bound" ions remain highly dynamic and mobile. Actual RNA structures are more complex than an infinite rod. If one computes the negative charge density around the surface of a folded RNA, the electrostatic field will be found to depend on the density of the phosphates and on other polar atoms present in the RNA chain. David Draper computed the electrostatic densities around tRNA, and Eric Westhof determined the structure of the tRNA. The

118

S. Woodson

positions of the magnesium ions observed in the crystal lattice correspond very well to the predicted regions of high negative-charge density. Using Brownian dynamic simulations, Eric Westhof has come to very similar conclusions; that is, the ions are attracted to specific sites by the electrostatic field of the RNA. These computational and experimental approaches suggest that a more irregular structure, and hence a more irregular charge density, cause the ions to occupy some sites more frequently than others. One also observes direct interactions between the ions and the RNA. An example of metal ion complexation in RNA is found in a crystal structure of a fragment of HIV genomic RNA. This structure was solved in Dumas's lab in Strasbourg. In this beautiful structure one can see metal ions that are well ordered in the crystal lattice. In solution, magnesium normally has six water ligands. In this structure, some of these water ligands have been displaced and replaced with phosphate oxygen atoms. Thus, there are several modes of interaction between counterions and RNA. There is a general mode, according to which the ions are attracted by the electrostatic field of the RNA but do not make specific bonding interactions with the RNA. Based on NMR experiments, these ions are thought to be very mobile; that is, they dynamically sample various positions. Occupancy of individual sites along the RNA depends on the electrostatic field, which is typically not uniform if RNA has tertiary structure. There is also a specific binding mode in which the metal ions coordinate with ligands donated by the nucleic acid. Certainly, only particular metal ions will be able to satisfy such coordination. Hence, these interactions occur not only at specific sites in the RNA, but also are more selective towards particular counterions. Question: Would there be different folding with zinc? We don't know which metal ion is bonding to this particular RNA in vivo. If their folding depends on the particular metal, what happens? Response: Magnesium is one of the most abundant divalent ions in cells. Zinc atoms are found in lower concentration and are largely coordinated with proteins. The free concentration of zinc and other heavy metals is low. It is often assumed that the most available ion is magnesium. You are correct, however, that we do not have direct proof that the RNAs interact with magnesium in the cell. Another candidate counterion is calcium, which is abundant in cells and interacts strongly with RNA. Similar ions compete with one another for binding to nucleic acids. I imagine that ions diffuse in and out of sites, in continuous competition with one

Experimental Approaches to RNA Folding

119

another. If a particular ion makes very nice coordination geometry with low freeenergy it will persist for a long period of time at that site and not be readily displaced by another ion. Question/comment: If you compare magnesium and manganese, you will find that the latter does not bind very tightly to oxygen atoms; it prefers sulfur or nitrogen. If you look at those two sites, you will see that one of them will always be preferred by magnesium, but not by manganese. In this case, zinc will not even bind, because it prefers a tetrahedral coordination. So it appears that the chemistry, the coordination geometry, and also water exchange (which is another important parameter, since the energy of hydration is very high for magnesium, very low for calcium, and still lower for potassium), it appears that all these factors are involved in selecting the ions. I don't know whether zinc is able to enter this type of state, because it has catalytic action; it promotes hydrolysis because of its low pKa. There are different kinds of effects. You can observe different types of folding with cadmium, which is closer because it is also tetrahedric. Response: Yes, but to get the RNA to fold in a coherent way, metal ions must satisfy their sites and the RNA-RNA interactions must be set up correctly. You could probably force binding of a tetrahedral coordinate ion, but that would disrupt the orientation of the hydrogen-bonds, so the RNA wouldn't fold. Question: Isn 't calcium likely to replace magnesium? You should not have calcium at the specific site because calcium is a different ion; it's larger and has different ligand-binding kinetics. Has anyone tried beryllium? Response: We considered trying beryllium, until we found out how poisonous it was. We decided that the risk it posed to our health did not merit the results we might obtain with it. Comment: It's actually very bad for RNA; it's too small. Response: In fact, lithium is OK. Comment: Yes, but beryllium has such a high charge-density that you will drive the RNA into a wrong conformation. It also inhibits catalysis. That's why it's poisonous. It's a very strong inhibitor of RNA catalyst; it will inactivate polymerase - and any enzyme - because it binds instead of magnesium.

120

S. Woodson Monovalent metal ions; poljamines:

Multivalent m etal ions:

e

»enaM

INS

!N

N

dl'Xfd

or*r*d

native

Figure 10. Model for counterion-induced collapse and folding of RNA. Counterions condense around the negatively charged RNA, causing the RNA to contract. This is followed by slower rearrangements in the RNA structure, eventually leading to native-like intermediates. The RNA must bind some Mg2+ to form the fully native structure. Reorganization of intermediates is slower in multivalent ions, because the intermediates are more stable. Reprinted with permission.

We used the native gel experiments I introduced earlier to look at the capacity of different kinds of ions to drive the initial collapse of the RNA [Fig. 10]. We preincubate the RNA in a counterion of our choice, then load it onto a gel that contains magnesium. The magnesium in the gel stabilizes the native structure of the RNA. We observe that most of or all of the ribozyme appears in the native form following preincubation with magnesium. If it is not pre-incubated with magnesium we observe a smear of intermediates. The same is observed if the RNA is pre-incubated with sodium chloride instead of with magnesium chloride; the RNA appears in its native form. We infer that the RNA forms structures that are sufficiently close to the native state for it to bind the magnesium ion, and folds during the short period when it enters the polyacrylamide matrix in electrophoresis. This period is well under 30 seconds, but sufficient for the RNA to reach the native structure under these conditions. Question: If you try to add sodium instead of magnesium, do you also stabilize the native RNA?

Experimental Approaches to RNA Folding

121

Response: No, because sodium alone does not stabilize the native tertiary structure sufficiently well to trap it within the gel. Calcium works to some extent, but sodium is not effective. Question: If you don't add magnesium, intermediate smear?

do you see anything, or just

the

Response: If there is no magnesium; we just see the intermediates. Question: Why won 7 it fold inside the gel? Is the electrophoresis long enough? Response: Yes, four hours; presumably at low-voltage, and there's 10 millimolar magnesium, which runs through the gel. It appears that the RNA is unable to refold in the gel. The reason for this is not well understood, although there have been efforts to develop physical models that better describe what's happening during electrophoresis. First, the magnesium ions and the low temperature (<10°C) slow rearrangement of RNA structures. Second, the RNA molecules are just about the same dimensions as the pores of the polyacrylamide matrix. Presumably, partial unfolding or extension of the RNA chain is somewhat inhibited by the fact that it is stuck in the pores of the gel. Reorientation of the chain is also limited by biased movement toward the positive electrode. If the pores were huge compared to the size of the molecules, the RNA would fold as if it were in solution. If the pores were tiny, then the RNA would have to unfold a bit to extrude its way through the pores. This phenomenon is often called a "caging" effect. The result is that the RNA does not refold significantly once it enters the gel. Question: How long does it take for the sample to enter the gel, compared to the time for magnesium in solution to accelerate the folding? Response: The ions interact with the nucleic acid at the diffusion-controlled limit. At our concentrations, the time-scale of diffusion is several microseconds. However, the folding process takes minutes for this particular RNA, because folding to the native structure requires re-ordering of meta-stable intermediates. Hence, the folding process takes minutes, whereas the RNA enters the gel in 5 to 10 seconds. An RNA that can fold in 5 seconds always appears fully folded in these kinds of

122

S. Woodson

experiments, because there is sufficient time for it to fold as the sample enters the gel. For these RNAs, the folding kinetics cannot be resolved by this method. Question: What about doing the experiment in a capillary tube? Response: That is a principle that might work very well, and it might possibly be a little more efficient. We tested a variety of metal ions in our gel assay. All the positively charged ions tested induced a conformational transition in the RN A. Of course, the extent of this transition depends on the ion concentration. We found the concentrationdependence of each folding transition to be a function of the valence of the ion. Nearly molar quantities of monovalent ions are required to drive folding, while millimolar concentrations of divalent ions (magnesium, barium), and micromolar concentrations of the trivalent ions (cobalt hexammine and spermidine) suffice. These results agreed extremely well with predictions from simple counterion condensation theory, which states that multivalent ions are more strongly condensed around the nucleic acid. This is partly because the entropic penalty is less, since fewer ions need to be localized around the nucleic acid in order to reduce the negative charge. After condensation has occurred, the resulting net charge on the polymer is lower; it is closer to neutral in multivalent ions than in monovalent ions. These predictions are supported by experiments on DNA condensation, which is also driven by cations. It is estimated that with multivalent ions the residual charge on the RNA is reduced by 95%. In monovalent ions it is reduced by 75-80%. Counterion-induced collapse of the RNA also depends on the dimensions of the ion. Spermidine is a polyamine with three monovalent charges distributed over a diameter of 13 A. In contrast, the positive charge of cobalt hexammine (~6 A) is localized around the metal center. A higher concentration of spermidine (55 uM) is required to fold the RNA than cobalt hexammine (12 uM). The role of ion size has been further tested using a series of polyamines. Here we compare three polyamines of different charge. The ability of these polyamines to drive this transition depends on the total charge of the polyamine. Putrescine, which has only a 2 + charge, is required in millimolar concentrations, whereas spermidine3+ functions in the micromolar concentration range. Spermine4+ operates in the sub-micromolar range. Therefore, more highly charged polyamines fold the RNA more efficiently. The spacing between the positively charge amine groups may also be varied. We tested a series of diamines with various numbers of carbon atoms (2, 3, 4, or 5)

Experimental Approaches to RNA Folding

123

between the two amino groups. Both the cooperativity and midpoint of the RNA folding transition were found to depend on the distance between the positive charges. Pentanediamine is the largest ion, and the least effective. Ethanediamine is the most effective and folding is the most cooperative. We are still in the early days with these studies, and it will be interesting to see if we can develop more quantitative models for describing these kinds of electrostatic interactions. One can think of this not only in terms of net electrostatic charge, but also in terms of a lattice of negative and positive charges. If we force a change in the spacing between the positive charges, it clearly has a large effect in our experiments. With pentane diamine, we obviously do not achieve a positive charge density sufficient to drive the folding transition to completion.

Summary To reiterate the framework in which we are considering the RNA-folding problem: We hypothesize that collapse of the RNA chain is driven by counterions. For the large RNAs, collapse tends to be non-specific, and produces a family of intermediates, some of which eventually acquire large numbers of native interactions. Native interactions lower their free-energy and enable the intermediates to persist for long periods of time. Since the intermediates are misfolded in some way, they must refold in order to reach the native state. The time-scale of this re-ordering depends on the free-energy of the intermediates. The folding kinetics are more rapid in monovalent ions, since these ions are less effective in stabilizing the RNA-RNA interactions. The kinetics becomes very slow in multivalent ions, which do greatly stabilize the RNA-RNA interactions. Both the overall kinetics of self-assembly and the stability of the final structure are very sensitive to the nature of the counterions present. Another concept that we barely touched upon is that the possibility of nucleating the native structure should be sensitive to the primary sequence of the RNA. We think that this has to do with the balance between long- and short-range interactions that must be established within the folded RNA. We have begun experiments that address the importance of the topology of the native structures. (We may hear more about this tomorrow, since this question has been extensively studied in the context of protein folding.) If we change the topology of the native structure by circular permutation of the RNA sequence, we do in fact see that both the structure and stability of the intermediates change quite dramatically. This alters the overall folding kinetics.

124

S. Woodson

Question: What about the role of modified bases in the folding pathways? Response: In fact, very little is known about that. Investigators have begun to tease out the role of modified bases in stabilizing the native structure. It is clear that they do favor the native tertiary conformation of the RNA in many cases, so they are far from being unimportant. But I am not aware that their role in folding kinetics has been addressed yet. Question: In your studies, have you addressed the role of either single-stranded RNA binding proteins or RNA helicases in accelerating these transitions? Response: We've tried every protein published in the literature said to have these chaperonin activities, with no success. This is not to say that they are unimportant, but rather that either the particular requirements of the ribozyme sequence don't make them suitable, or we have just not yet happened upon the right in vitro conditions. My personal feeling is that the primary biological function of many of these proteins is not that of a generic RNA chaperone, in the way that GroE would act in E. coli to rescue misfolded proteins. Question: In your studies, you have concentrated on ribosomal RNA and tRNA. On the other hand, messenger RNA could also be interesting from a dynamic point of view. Is this being studied? Response: A little work has been done: Several laboratories have looked at the structures that form in regulatory regions in prokaryotic mRNAs, and in bacteriophages and viruses. They do see metastable intermediates in the translational control regions that have quite complex structures. These structures are embedded in the middle of the messenger RNA. There is very nice genetic information indicating that folding intermediates play a key role by permitting translation initiation for only a short period of time after the RNA is transcribed. The initial metastable structure of the RNA is accessible to ribosomes. After a while the RNA refolds to a structure that is inaccessible to ribosomes and translationally repressed. Very little has yet been done to address this problem in the processing of eukaryotic pre-messenger RNA.

Experimental Approaches to RNA Folding

125

References

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22.

Crothers, D.M. (2001) in RNA (Soil, D., Nishimura, S. Moore, P., Ed.) pp 6170, Elsevier, Oxford, UK. Kim, S.H., Quigley, G.J., Suddath, F.L., McPherson, A., Sneden, D., Kim, J.J., Weinzierl, J., and Rich, A. (1973) Science 179, 285-288. Hagerman, P.J. (1997) Annu RevBiophys BiomolStruct 26, 139-156. Friederich, M.W., Vacano, E., and Hagerman, P.J. (1998) Proc Natl Acad Sci USA 95, 3572-7. Leontis, N.B., and Westhof, E. (1998) JMol Biol 283, 571-83. Porschke, D., and Eigen, M. (1971) Journal of Molecular Biology 62, 361-381. Craig, M.E., Crothers, D.M., and Doty, P. (197J) Journal of Molecular Biology 62,383-401. Coutts, S.M., Gangloff, J., and Dirheimer, G. (1974) Biochemistry 13, 39383948. Dourlent, M., Thrierr, J.C., Brun, F., and Leng, M. (1970) Biochem Biophys Res Commun41, 1590-1596. Crothers, D.M., Cole, P.E., Hilbers, C.W., and Shulman, R.G. (1974) J Mol Biol 87, 63-88. Lynch, D.C., and Schimmel, P.R. (1974) Biochemistry 13, 1841-1852. Cech, T.R. (1993) in The RNA World (Gesteland, R.F., and Atkins, J.F., Eds.) pp 239-269, Cold Spring Harbor Library Press, Plainview, New York. Michel, F., and Westhof, E. (1990) J Mol Biol 216, 585-610. Golden, B.L., Gooding, A.R., Podell, E.R., and Cech, T.R. (1998) Science 282, 259-264. Zarrinkar, P.P., and Williamson, J.R. (1994) Science 265, 918-924. Murphy, F.L., and Cech, T.R. (1993) Biochemistry 32, 5291-5300. Sclavi, B., Woodson, S., Sullivan, M., Chance, M., and Brenowitz, M. (1998) Methods Enzymol 295, 379-402. Sclavi, B., Sullivan, M., Chance, M.R., Brenowitz, M., and Woodson, S.A. (1998) Science 279, 1940-1943. Deras, M.L., Brenowitz, M., Ralston, C.Y., Chance, M.R., and Woodson, S.A. (2000) Biochemistry 39, 10975-10985 Cate, J.H., Gooding, A.R., Podell, E., Zhou, K., Golden, B.L., Kundrot, C.E., Cech, T.R., and Doudna, J.A. (1996), Science 273, 1678-1685. Cate, J.H., Hanna, R.L., and Doudna, J.A. (1997) Nat Struct Biol 4, 553-558. Pan, J., Thirumalai, D., and Woodson, S.A. (1997) J MolBiol 273, 7-13.

126

23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42.

43. 44. 45. 46. 47. 48. 49.

S. Woodson

Emerick, V.L., and Woodson, S.A. (1994) Proc Natl Acad Sci USA 91, 9675-9. Pan, J., and Woodson, S.A. (1998) JMol Biol 280, 597-609. Pan, J., Deras, M.L., and Woodson, S.A. (2000) J Mol Biol 296, 133-144. Zhuang, X., Bartley, L.E., Babcock, H.P., Russell, R., Ha, T., Herschlag, D., and Chu, S. (2000) Science 288, 2048-51. Pan, T., and Sosnick, T.R. (1997) Nat Struct Biol 4, 931-938. Pan, T., Fang, X., and Sosnick, T. (1999) J Mol Biol 286, 721-731. Fang, X., Pan, T., and Sosnick, T.R. (1999) Biochemistry 38, 16840-16846. Fang, X.W., Pan, T., and Sosnick, T.R. (1999) Nat Struct Biol 6, 1091-1095. Ikawa, Y., Shiraishi, H., and Inoue, T. (2000) Nat Struct Biol 7, 1032-5. Brehm, S.L., and Cech, T.R. (1983) Biochemistry 22, 2390-2397. Nikolcheva, T., and Woodson, S.A. (1999) J Mol Biol 292, 557-567. Misra, V.K., and Draper, D.E. (1998) Biopolymers 48, 113-135. Brion, P.W.E. (1997) Annual Review of Biophysics and Biomolecular Structure 26, 113. Manning, G.S. (1977) Biophys Chem 7, 189-192. Misra, V.K., and Draper, D.E. (1999) J Mol Biol 294, 1135-1147. Misra, V.K., and Draper, D.E. (2002) J Mol Biol 317, 507-21. Hermann, T., and Westhof, E. (1998) Structure 6, 1303-1314. Ennifar, E., Yusupov, M. Walter, P., Marquet, R., Ehresmann, B., Ehresmann, C , and Dumas, P. (1999) Structure FoldDes 7, 1439-1449. Heilman-Miller, S.L., Thirumalai, D., and Woodson, S.A. (2001) J Mol Biol 306, 1157-1166. Bloomfield, V.A., Crothers, D.M., and Tinoco, I.J. (2000) Nucleic Acids: Structures, Properties, and Functions, University Science Books, Sausalito, CA. Heilman-Miller, S.L., Pan, J., Thirumalai, D., and Woodson, S.A. (2001) J Mol Biol 309, 57-68. Ma, C.K., Kolesnikow, T., Rayner, J.C., Simons, E.L., Yim, H., and Simons, R.W. (1994) Mol Microbiol 14, 1033-1047. Poot, R.A., Tsareva, N.V., Boni, I.V., and van Duin, J. (1997) Proc Natl Acad Sci USA 94, 10110-10115. Damberger, S.H., and Gutell, R.R. (1994) Nucleic Acids Res 22, 3508-10. Lehnert, V., Jaeger, L., Michel, F., and Westhof, E. (1996) Chem Biol 3, 9931009. Russell, R., Millett, I.S., Doniach, S., and Herschlag, D. (2000) Nat Struct Biol 7, 367-370. Woodson, S.A. (2000) Nat Struct Biol 7, 349-352.

SOME QUESTIONS CONCERNING RNA FOLDING FRANCOIS MICHEL Centre de Genetique Moleculaire-CNRS, Gif-sur-Yvette, France

I am both moved and embarrassed to have been invited to participate in this conference. The IHES is a very special place for me, because my father, Louis Michel, worked here for more than thirty years, until his death in fact, not quite two years ago. Sadly, I never visited the IHES while he was here - for no good reason except perhaps that I am one of the four of his six children who were scientists: My brother is a mathematician, one of my sisters is a climatologist, and another one is a linguist specialized in Paleo-Babylonian. Each of us was very careful not to interfere with the professional lives of the others. I am also quite embarrassed. As far as I have been able to gather, I am supposed to talk to you about RNA folding. Although I have been interested in the structure and architecture of RNA for a number of years, the period during which I specifically worked on the RNA folding process was rather brief, and a number of years ago. So anything I can show you today is rather outdated. Still, while listening to the speakers during the first day of this meeting, I kept pondering the fact that we all have somewhat different perspectives of the same topic. Therefore, I will try to describe our own RNA-folding perspective and working models, contrasting them with the experiences of other labs working in the field. In so doing, I hope to stimulate discussion. When I say we, I should stress that essentially everything I am going to tell you about is part of a long-standing collaboration with Eric Westhof, in Strasbourg. We shared graduate students, first Luc Jaeger, then Phillipe Brion; Maria Costa also did some of the work in my lab at Gif-sur-Yvette. I used the word questions in the title of my talk. From what I just said, you may have inferred that the first question might be "How did I lose interest in RNA folding?" In fact, that was not quite the case. The question should rather be, "How did I become interested in the RNA folding process to begin with?" Let's go back some twelve years. Eric Westhof and I were in the process of proposing a threedimensional model of the Group I intron ribozyme component. This complex molecular model involved a fair number of novel interactions; that is, interactions which had not yet been observed in crystals. Some of this seemed to be a pretty

128

F. Michel

bold move. We were interested in proving experimentally that such contacts were real. We needed an experimental system, and I got ahold of some Group I RNA molecules. Like everyone else in the field, I was stuck with misfolding problems. What we wanted to do was compare the original, 'wild-type' molecule with a number of base-substituted molecules. We were monitoring "activity," whatever that meant, in order to have some indirect measure of the state of the molecules. Instead, what we got were RNA samples that would react only halfway, or even less, and for some mutants, not at all. Much worse, the results would change from one batch to another and from one experiment to the next. We found it very difficult to get anything reproducible. I must remind you that RNA is normally prepared on denaturing gels, and must be renatured before it can be studied. Everyone agrees on this. Our first reaction to the lack of reproducibility was to erase history; to start every experiment by denaturing everything by heating all samples in water at 90°C. We hoped the molecule would fully unfold under these conditions. After that, we would struggle to devise an efficient renaturation protocol. There were plenty of such protocols, and there still are many in the RNA literature. They invariably read like cookbook recipes: Bring your sample to 55°C, add some salt, cool slowly to reaction temperature, add more salt... We played the game, and like others, soon found that there were optimal salt concentrations, that adding too much or too little magnesium would not do the job, that there were optimal temperatures, and so on. Still, whatever the protocol, we found it difficult to get 100% out of the molecules, which was what we were hoping for. Once a preparation had been treated and found to comprise a large fraction of inactive individuals, the question was what to do with them. Materials were often costly, in terms of money and work. Rather than throw them away, we were inclined to try to increase the extent of renaturation. An obvious way to do this was to reheat the sample to a temperature at which inter-conversion between the metastable state or states (in which the molecules had become trapped) and the active, native fold might occur. In doing so, we soon found that heating the samples to a temperature between 50 and 55°C at 5mM magnesium and 50mM ammonium concentrations would do at least part of the job, considerably improving the extent of renaturation, thus the fraction of correctly folded molecules. Based on this observation, we eventually became curious about the renaturation phenomenon itself. In succeeding experiments, we decided to monitor the course of renaturation within the same temperature range by adding the guanosine cofactor of the Group I self-splicing reaction to aliquots drawn at fixed time intervals. One nice

Some Questions Concerning RNA Folding

129

thing about Group I introns is that you can completely uncouple folding from the reaction.

•o

time (min)

B slow partly unfolded

fast -•

— "*"

reacted

+ GTP

correctly folded

Figure 1. Renaturation kinetics of a Group I intron ribozyme (the 321-nucleotide sunY molecule) at 55CC. A) Biphasic kinetics of reaction of a sunY (SYT) precursor molecule containing the intron and 5' exon (redrawn from Jaeger et al., 1993.) After pre-incubation for at least 20 min, the reaction is triggered by addition of GTP at time zero. The slow phase was monitored by determining the fraction of unreacted precursor molecules contained in aliquots of the solution. The existence of a fast phase (dotted line) is inferred from the sharp drop in the fraction of unreacted precursor at between 0 and 2 min. B) Data Interpretation: Molecules correctly folded at 55°C react irreversibly within a few seconds. In contrast, folding (and unfolding) of this large RNA molecule is a slow, reversible process. Middle: Three-dimensional model of the active form of the sun Y ribozyme. Only the backbone of the molecule is shown, except for two tertiary interactions that bring separate folding domains of the ribozyme together. Dotted lines connect phosphate groups that face each another in double-stranded helices (drawing based on Jaeger et a!., 1993.)

These molecules will not react until you add this very small molecule, the socalled guanosine cofactor, which one hopes will not perturb folding. So we took

130 F. Michel aliquots, added guanosine, and obtained strikingly biphasic reaction kinetics. There was a very fast phase with a rate-constant (or constants) typically on the order of seconds during which part of the sample would react, followed by a much slower phase. Our working model, which proved to be correct, predicted that the fast phase corresponded to correctly folded molecules. At 50 to 55°C and neutral pH, the reaction should be very fast indeed. Then there was the slow phase, which could take minutes or hours. What might the underlying phenomenon be? Well, we knew that within the chosen temperature range, renaturation was an ongoing process. Any molecule that managed to refold into an active form would almost instantly react with the guanosine cofactor, which was still around. Thus, assuming our working model to be correct, our experimental procedure made it possible to estimate both the fraction of molecules that were folded at time zero (addition of guanosine) and the subsequent renaturation rate at the temperature we had chosen [Fig. 1]. Now to discuss some of the data. One of the first things we did was change the time interval during which the sample was pre-incubated at the chosen temperature prior to addition of guanosine and again measure the fraction of initially active molecules. By doing so, we found that equilibrium was reached very quickly, typically in a matter of 15 minutes or so. Nothing further happened after that. Question: How do you really know that your molecules are correctly folded before you add GDP? Response: In these experiments, only the addition of GTP would tell you that. Later however, I will describe data from another source that corroborate our views - namely, that guanosine or GTP does not significantly contribute to folding. It is a working model at this stage. Question: You evaluated folding based on the appearance of an activity that can be seen only in the presence of GDP. How do you know you have correctly folded molecules? Response: As I already said, guanosine is a very small molecule compared to a Group I intron, which is 300 or 400 nucleotides long. By itself, GTP is not supposed to have major effects on folding. In fact, Group I introns will react in the absence of GTP; the rate of reaction is slower, or you need to have a higher pH, but they do react by hydrolysis. It is well-recognized in the field that there is no major structural change in the presence of GTP.

Some Questions Concerning RNA Folding

131

Comment: / think you have to say that the GTP needs that very precise binding pocket to start the catalysis. That is how you know the RNA is correctly folded. To do catalysis for ribozymic action, you must have a very highly 3-D structured molecule. It must be properly organized, so that when you add the GTP it goes directly to this binding site, where the chemistry is very fast. The RNA must be properly folded to start the reaction, which is shown in the first step. Response: All the other data available excluded major effects on the structure of the molecule - everything else we knew excluded major effects. 1 mentioned that the same type of reactions may occur in the absence of the guanosine cofactor, on a different timescale, or perhaps at a slightly different pH. Basically, the molecule does not need guanosine for activity. The point is that addition of guanosine beautifully triggers a reaction that is very fast, compared to renaturation. Comment: Whether induced fit could occur in RNA is a very interesting question. I think this is not seen with RNA molecules; there is no evidence for major induced fit at the very local level. You may have some minor rearrangements, but it is not the same as for proteins; the amount of induced fit for highly structured RNA molecules such as those you are talking about is much lower. Typically, the fraction of - let us call them "initially correctly folded molecules" - would not change after 15 minutes. We also checked whether samples that had been completely renatured or completely denatured prior to the experiment would reach the same equilibrium value. The same fraction of "initially folded molecules" was attained after 15 minutes of preincubation at the chosen temperature. So this was a true conformational equilibrium between folded and unfolded states. Once we knew for sure that we were looking at an equilibrium process, the next thing we did was change the temperature. We found the fraction of folded molecules to be very sensitive to temperature at around 55°C [Fig. 2]. When we plotted measurements as a function of temperature - the ordinate is now the fraction of initially inactive molecules - we observed a very sharp transition within a very few degrees, with enthalpy we estimated to be between 150 and 200 kcal/mol. Recall that our initial aim was to compare the original, 'wild-type' molecule with base-substituted 'mutants'. How would the mutant RNAs behave when subjected to our experimental protocol? The answer is that the mutant molecules would also reach conformational equilibrium, the only significant difference being that the transition between initially active and initially inactive samples was

132

F. Michel

invariably shifted to lower temperatures with all the base-substituted molecules we investigated.

B

A

g> 1,0

O

O 0.8 J (9 £ 0,6" §

O 0.2

10

20

time (min)

f

0.4

«- 0,0

o>**—

mil*

P

¥ • 3? —o"-»"B"*''

48 50 52 54 56 58 60 62

T (°C)

-4 3,02 3,03 3.04 3.05 3,06 3,07 3.08

1000/T (deg.K-1)

Figure 2. Effects of temperature on the kinetics and extent of renaturation of sunY molecules (redrawn from Jaeger et ah, 1993). A. Renaturation course of sunY (SYC) precursor molecules at 53.5°C (open squares), 55°C (filled squares), and 56.5°C (open circles). B. Fraction of initially inactive molecules as a function of temperature. Filled squares: samples were completely inactivated by heating at 65°C prior to a 15-min preincubation at the chosen temperature. Open squares: samples were renatured and checked to be at least 90% active prior to preincubation at the chosen temperature. C. van't Hoff representation of the data in panel B. Keq = (l-f) / f, where f is the fraction of initially inactive molecules. AH° = 156 kcal/mol; AS° = 474 e.u./mol.

In this example [Fig. 3], the interaction being investigated was a Watson-Crick G:C base-pair. One aim of the experiment was to prove the existence of that pair; therefore we first built what we call 'single mutants' - G:U and A:C combinations before generating the A:U double-mutant combination. The important point to note

Some Questions Concerning RNA Folding

133

is that A:U, although less stable than G:C, appears to be more stable than G.U and A:C, the second mutation partly compensating for the first one. In fact, the order of stabilities - GC > AU > GU > AC - is exactly what you would expect when inserting the same base combinations within an extended double-stranded helix. However, in this particular case, it is not one Watson-Crick base-pair within a helix, but rather an isolated interaction, an isolated base-pair that most people would consider to be part of the tertiary structure of the molecule, rather than of the secondary structure. There was no point in testing the secondary structure of Group I introns, which at the time had already been very well established by a number of methods, including comparative sequence analysis.

g 1,0

'S 0,8i CO

.£ 0,6 O 0,4; OCD 0,2-

•fl-S^&f^B^-D-- 0

"•"• 0 0 * >m»*"**" •

i

•

i

—

•

i

• — i — • — i

•

i

—

•

i

^G:C (wtj)

»

38 40 42 44 46 48 50 52 54 56 58 60

T(°C) Figure 3. Fraction of initially inactive molecules as a function of temperature for sunY SYT transcripts with various base substitutions in the P9.0a (176-1028) base-pair (redrawn from Jaeger et al., 1993.)

Suppose that base-pair had been part of an extended helix, say one with ten base-pairs. Rather than resorting to the complex kinetic analysis I just described and to which I will return, we would have generated what is known as an 'optical melting curve' to prove the existence of the base-pair. The experiment is quite simple; it consists in watching the changing UV absorbance of an RNA (or DNA) solution at 260 nanometers as that solution is slowly heated. When the temperature range within which the nucleotide bases begin to unstack is reached, there is a sharp rise in absorbance. In practice, the transition curve between order and disorder is described by its TM, the temperature at which half the bases are unstacked (if the

134

F. Michel

unfolding process is all-or-none, this is the temperature at which half the molecules are unfolded), and its 'cooperativity' (maximal slope), from which the enthalpy associated with the process may be estimated.

A 0,8

« < E o

0,6 0,4 0,2

^y

^ / y - ( G : C (wtj)

0,0

40 44 48 52 56 60 64 68 72 76 T(°C)

40 44 48 52 56 60 64 68 72 76 T(°C)

Figure 4. Optical melting curves of sunY (SYT) precursor transcript samples in standard (absorbance at 260nm as a junction of temperature) and derivate (dA/dT) representations. Monitoring the effects of a Gto-A mutation in the P9.0a base-pair (redrawn from Jaeger et al., 1993).

Why not do the same thing with a solution of Group I intron RNA? At first, we were reluctant to try it, essentially for two reasons. One was that it was not obvious to us that unfolding tertiary structure, that is, three-dimensional structure, would generate a measurable rise in absorbance (remember that the interactions we wanted to investigate at the time were tertiary interactions.) In fact, we were wrong. Had we looked at phenylalanine tRNA, of which there were crystals at the time, we would have noted that even though the molecule comprises only 22 canonical (Watson-Crick and G:U) base-pairs, 71 out of the 76 nucleotide bases are stacked. In fact, when you melt a tRNA molecule, you do detect a signal that corresponds to the unfolding of the entire three-dimensional structure.

Some Questions Concerning RNA Folding

135

The second reason was that we initially doubted that we could see a signal corresponding to the melting of a single base-pair in a three-hundred or fourhundred nucleotide molecule. We were wrong again, because as I already told you, the unfolding transition during which the individual base-pair is disrupted has very high enthalpy, which is about equivalent to that of a helix of say, 25 base-pairs. As soon as we realized this, it became obvious that we should try to get optical melting curves. Obtaining optical melting curves was the work of Luc Jaeger. Fig. 4 shows a typical experiment that uses the wild-type molecule on one hand and one of the base-substituted RNAs I just described on the other. In the wild-type G:C melting curve, you clearly see an early transition over a temperature range that coincides with the one over which we had been observing conversion of initially active populations of molecules into initially inactive ones. As for the A:C mutant RNA, it also undergoes an early cooperative transition (this is especially clear in the derivate dA/dT representation), but that transition is shifted towards lower temperatures to precisely the same extent, compared to the wild-type, that we had previously observed in our kinetic analyses. In the RNA molecules whose optical melting curves are shown in Fig. 5, base substitutions have been made in a second base-pair (P7 bp 2) and combined or not with substitutions in the P9.0a pair (see Fig. 6 for a secondary structure diagram of the intron).

Figure 5. Optical melting curves of td intron molecules bearing various base substitutions, shown in lower-case (from the data of Brion etai, 1999).

136

F.

Michel

Only the early melting range is affected by base substitutions. The thermal shift resulting from replacement of the original G:C pair by an A:C combination is about the same for P9.0a and P7 bp 2 interactions. At P7 bp 2, A:U is intermediate between G:C and A:C, just as was found for P9.0a. Finally, combining mutations in P9.0a and P7 bp 2 has additive effects. We are clearly dealing with the same unfolding transition.

UAAAU G GC UGUAGGACU C

II CG C

I II I I I I I I

UC U

GUUCAAC

U

CAAGUUG

III I AA

C

ACAUCUUGG C UAAAU G

•-C-G U I G-C G-C A-U U U A A U A

I

GoUGAA£kUAAUG-3 ' G-C GoU U-A A-U A-U G-C A C U A U G AAC

P9.0a

Figure 6. Secondary structure diagram of the td intron, a close relative of the sun Y intron. Arrows point to intron-exon junctions. The transcripts used to generate the curves in Fig. 5 were inactive because they lacked both junctions (the first 35 and last 5 intron nucleotides were missing). Bases that were mutated (see Figs. 5, 15, 18 and 19) are boxed.

One question you must have been asking yourself is whether any particular set of interactions could be associated with early melting. I already gave you a few hints, and it certainly makes sense that the three-dimensional fold of the molecules, which we call tertiary structure, should be the first to unwrap, or else that the rest of the melting curve presumably reflects the disruption of individual secondary

Some Questions Concerning RNA Folding

137

structure elements. However, there is something unsatisfactory in what I am just saying, because there does not seem to be any general agreement in the RNA field about what exactly should be called secondary structure. Some people call secondary structure the set of all canonical base-pairs (G:C, A:U, and G:U) with at least one neighboring canonical pair along the primary sequence. This means that you can have a number of intervening nucleotides on one side of the helix, but the pair must have a neighbor on the other side. That is one definition. Another definition is similar, except that it excludes all pseudoknots. Pseudoknots are interactions that interfere with a tree-like representation of the structure. When I say tree-like, I mean something like a natural tree with branches that do not meet again, once they diverge. I apologize to the mathematicians in the audience for not using the right terms. Any interaction between the loops tends to be called a pseudoknot, whatever that means. One reason for excluding pseudoknots from secondary structure was that until the recent work by Rivas and Sean Eddy (1999), the dynamical programming algorithms of the type Michael Zuker is using in his FOLD program would not be able to take pseudoknots into account. By the way, there exist some other programs for folding RNA; the NinioDumas program for instance (1982), which, although less efficient, was able to predict pseudoknots. Yet another problem in defining secondary versus tertiary structure is whether the loops should be considered part of secondary structure. I will return to this later. There are also some people who consider that tree-like representations of canonical base-pairs do not depict secondary structure; they complain about the use of terms that are derived from the protein world and would suggest that we rather call these representations stem-loop diagrams. Now they are all very respected people, so whom do you believe? My own inclination would be to ask nature, which is exactly what we did. Comment: / wouldjust like to make a comment that the field has now reached maturity and is now in decline when we argue about nomenclature. Response: Thanks; that's how I feel. One thing I want to show you is that even the definition of a pseudoknot is somewhat problematic. Coming back to the structure of Group I introns [Fig. 6], it is clear that when taken together, the P3 and P7 pairings are incompatible with a tree-like representation. But which of these pairings should you regard as part of secondary structure and which as part of tertiary structure? I have no idea, at least not from the definitions just given.

138

F. Michel

Comment:

Whichever melts out first.

Response: That is exactly what I will try to show. As I was saying, we attempted to determine what exactly was going on during the early transition. But before I go into that, I need to stress that the situation I described for Group I introns is by no means unique. I'm sorry, but I have to introduce another type of catalytic RNA molecule. All you need to know about Group II introns is that they are the same as Group I introns, except they are totally unrelated to them and about twice as large, somewhere around 600 nucleotides. When you look at the melting profiles of a Group II intron, there is again an early transition followed by some quasi-continuous melting. As you might have expected from the larger size of the molecule, the early transition, with an enthalpy of about 300 kcal/mol, is even sharper than in Group I introns. And when you mutate the molecule, again, the transition usually does not vanish, but has simply shifted to lower temperatures. Thus, the problem was a general one, and to find out what was going on, we took the approach of chemically modifying bases, using dimethyl sulphate (DMS). At first you would think this would not work because any chemical reaction will shift the equilibrium between the folded and unfolded forms. In fact, the DMS molecule is fairly unstable in water, so the reaction time is very short, compared to the rates of folding and unfolding. If you use low concentrations of DMS and modify molecules, as Sarah Woodson describes, such that there is less than one modification per molecule, it looks like you have taken a photograph of the situation at the time you added DMS. I will now show you an experiment with a Group II intron. We chose to compare the accessibility of bases at two temperatures, one (42°C) that was just below the temperature at which melting begins and the other (50°C) slightly above the early melting range. What happens when you treat RNA with dimethyl sulphate? You modify bases, and the reaction we will look at is the modification of the N1 position of adenine, one of the groups on the Watson-Crick face of the base. Question: What kind of change occurs at position Nl ? Response: The Nl is methylated by the dimethyl sulphate, and you get a +1 charge on the Nl. Everyone agrees that if the Nl position of the adenine is involved in pairing with another base or another group in the molecule, it normally will not become

Some Questions Concerning RNA Folding

139

methylated by dimethyl sulphate, whereas if there is no interaction, it will be accessible and react.

Figure 7. Dimethyl sulfate (DMS) modification of the Pl.LSU/2 group II intron at 50°C. 5' and 3' are intron-exon junctions. Roman numerals are used to designate the six separate domains of secondary structure, as predicted by comparative sequence analyses. Filled and empty arrows point to strongly and weakly reactive adenines, respectively. (Redrawn from Costa et al., 1998.)

The first thing we should look at is the methylation map of adenines at 50°C, which is just above the early transition temperature[Fig. 7]. Upon glancing at the secondary structure model, you will immediately notice in domain I, that essentially none of the adenines in secondary structure helices were affected (there is one exception), whereas all the adenines in so-called single-stranded loops were reactive. Question: How do you know what is methylated and what is not?

140

F. Michel

Response: I will show you the data [Fig. 8]. This is a sequencing gel, in which you read the sequence of the molecule. The sequence was determined using an enzyme that will stop if there is methylation, so that you get a band. We are looking at the piece of RNA called IC1, and you see that the two adenines...

O

O

o

o

O

CM

GATCr>l

I0

VX

domain II

A s*c

i

G U

U A, C

A6A-U

?*

* -? • " * „ • : «

°uVs G

IC1

U,

/ ^ * *

• • 5 0 °C 0

H> 42 °C

Figure 8. DMS modification of the Pl.LSU/2 group II intron RNA, experimental data and interpretation (data from the work of Costa et al., 1998). Top part: At the left are sequence ladders generated by the occasional incorporation of base-specific dideoxynucleotides by a reverse transcriptase polymerizing DNA from an intron RNA template. Right: polymerase extension without dideoxynucleotides on either unmodified template RNA(- lanes) or template RNA that was modified at 42°C or 50°C (two independent experiments were carried out in each case). Dark bands in DMS lanes indicate significant modification of the template RNA (the polymerase stops one nucleotide before the modified base).

Some Questions Concerning RNA Folding

141

Arrows point to the location of the 8 and e' loops in this autoradiograph of a sequencing gel. Bottom part: At 50°C, all four adenines in the single-stranded loops of subdomain 1C1 are modified by DMS, whereas at 42°C, the two adenines that are part of the terminal GUAA loop (known to interact with a specific receptor in domain II) are protected.

Question: - Was nucleic acid digested up to the methylation point? Response: No; there is a polymerase going along the RNA, using it as a template, and it will stop whenever there is a methylation. This gel shows us that none of the adenines in the helices are methylated. But the four adenines in single-stranded loops give signals at 50°C. On the other hand, at 42°C, two adenines (in the 0' loop) still give a strong signal, whereas the remaining two (in the e' loop) give only a barely detectable signal. Returning to the 50°C map [Fig. 7], it looks like the data provide an instant picture of those bases that are part, or not part, of what some (but not everyone) in the field would call the secondary structure of the molecule. But how do we know that this diagram represents the actual, physical secondary structure of this molecule? The structure shown here was derived by comparative sequence analysis, which is the easy and efficient way of inferring canonical base-pairs from sequence data. The problem, however, with comparative sequence analysis is that it provides you with statistical constraints that reflect selection pressures in nature. It tells you that two nucleotides are paired; that they form a Watson-Crick base-pair at some stage in nature. It does not tell you where this occurs. You might argue, for instance, that the base-pair is formed in a single-stranded DNA molecule that is somehow later converted into intron RNA, or vice versa, although in our particular case, most people would be willing to bet that intron RNA is the source of the vast majority of statistical constraints in intron sequences. Much worse, comparative sequence analysis does not tell you that all base-pairs are formed simultaneously. Constraints could refer to different states of the same molecule. And this happens, for instance, in a particle we call the spliceosome, which is responsible for removing introns from pre-messenger RNA in our own cells. When you analyze spliceosomal RNAs by comparative sequence analysis, you end up with bases forming canonical pairs with several different partners. The interpretation is that they do form those base-pairs, but at different times during the lifetime of the spliceosome particle, which has been shown to undergo a number of rearrangements. Fortunately, with Group II introns, this does not seem to be the case, since according to comparative sequence analysis each base has either a single partner or none. That is why we are confident with this structure.

142

F. Michel

pjt22fw by D. Stewart and M. ZiJcer '2001 Washlngrcn Unlwwsfty

&*t, fy

4 h .,--rf

Structure #4 AG = -149.5 kcal/mol at 37 °C

^ ^

^

Figure 9. One suboptimal solution for the minimal free-energy folding of domains I-III of intron Pl.LSU/2, as computed on M. Zuker's web server. The five segments in black are incompatible with the secondary structure shown in Fig. 7.

By the way, something else we could have done is compare the structure generated by comparative sequence analysis with the predicted minimal free-energy folding. Since Michael Zuker described comparative sequence analysis, I thought it fair enough that 1 discuss some minimal free-energy foldings. So I logged onto Mike Zuker's web site and here's what I got [Fig. 9]: This is a minimal free-energy folding for domains I, II, and III of the molecule that we modified with DMS. Segments in black are incompatible with the structure derived by comparative sequence analysis, and you can see that the number of disagreements is very small: five. The only problem is that this was not actually the absolute minimal free-energy folding. If I had taken the best, number I, it would

Some Questions Concerning RNA Folding

143

have been much worse [Fig. 10]. Those foldings are very useful when you think you already know the solution.

pftMps by D, Stewart ami M. ZUmr '2001 Washington University

i

<&

Structure #1 AG = -150.3 kcal/mol at 37 °C

SN

"%^

fMaw

0 / 0 &

Figure 10. The optimal solution for minimal free-energy folding of domains I-II1 of intron Pl.LSU/2. (Same as Fig. 9.)

Question: Does this solution take pseudoknots into account? Response: No, no pseudoknots. Nussinov and Zuker.

This is the dynamical programming algorithm of

Question: And what about the actual molecule; are there pseudoknots in the molecule itself? Response: Yes, there are. I will discuss what happens to pseudoknots in a moment. The issue I am trying to bring up is whether pseudoknots melt with secondary structure or with tertiary structure.

144 F. Michel Question: This methylation picture [Fig. 7] does not show the tree-like structure; that is, the actual pairing? Comment: The methylation picture shows you which adenines are not paired. This is a map of the secondary structure that Dr. Michel has derived from comparative sequence analysis, and in that state, at 50°C, there are no pseudoknots even in a natural molecule. Response: And what 1 am trying to show is that this was not obvious in the first place. That is what I am describing. Comment: Additionally, the experiments by Michael Zuker should also give the same results. Response: In fact, free-energy calculations should predict precisely this same situation at 50°C. I did the calculation at 37°C and 55°C. Question: Does Turner's lab have data at 50°C? I assume it is not possible to build it into the program if the data are not available. Otherwise it would be trivial to build it into the program. Response: We believe that is part of the problem. Comment: Turner's lab no longer provides enthalpies. If you want to do folding simulations at different temperatures, you have to use his older energy parameters, which are still available. Response: That is what 1 have been using. Comment: ...And it is really unfortunate, because in fact the Turner lab does have enthalpies for a variety of loops, but they do not have a full set. I have been arguing with Doug Turner to make at least a reasonable set of enthalpies available so that we can do folding simulations with the latest rules at altered temperatures, but this has not happened yet. The other thing I want to emphasize is that the computed free-energy between these two diagrams [Figs 9 and 10] is quite small, only 0.8 kcal. uncertainty in the sets of thermodynamic parameters being used, I difference is insignificant. What I want to emphasize is not so much the

difference Given the think the difference

Some Questions Concerning RNA Folding 145

with the comparative sequence analysis output, but the extent of agreement. We are rather confident that we can rely on the comparative sequence analysis approach; that we are in fact dealing with the actual secondary structure of the molecule. Let's go back to DMS modification. What was the situation at 42°C?

42 °C

U

A'0G

A

G

U | J

C G U U C

u C G C A

G •

*V *<*°u

*

V-S C= G

*

c

II

,'.»• ^

A

c

n A w »

C

'%"- *

o

~

u

r

> ii

U

iv o

t

f

t

\ y

' O /A G

G U C C C U U C C U G GG • I II 11 II I l " l l \ " „ „ UACGGAAGGAt,uuG

^ G A

A

\

c

vi „«„.

fl G c

"«G*»u<:-

* ' " ^ # ••A

.A

,

V

CACA-U c= r

J

7

5'

v

,

*

> ~ ^

' '

G ^ " U U G GC"C C "1 G G^C C-

^ . U ^

C G >. A *

U G , * A

3'

5 3

VV=G

G

"»„•>*

"-''

A

/ M

U C

^ r '/ , / A

Gc

u'

G

/

.A

*

C / ^

G'*G, C

M'

G

Figure 11. DMS modification of the Pl.LSU/2 group II intron at 42°C. Same as Fig. 7. (Redrawn from Costa etal., 1998.)

The answer, as shown in Fig. 11, is that only part of the adenines in the loops are modified, in contrast to what was found at 50°C. What we are interested in, of course, is the difference map [Fig. 12], which shows adenines that have changed state between 42°C and 50°C during the initial unfolding transition. This map does not tell much as such, because chemical modification makes it possible to determine

146

F. Michel

whether or not a particular base has a partner, but not the identity of that partner. Fortunately, we knew far more about Group II introns than their mere secondary structure. When you superimpose some known interactions that are not part of the tree-like secondary structure on the difference map, it begins to make sense [Fig 12]. For instance, the a sequence in a terminal loop pairs with the oc' sequence in an internal loop; this is a beautiful example of a large pseudoknot with over seven canonical base-pairs. As you can see, this pseudoknot unfolds as it is disrupted during the initial unfolding transition. If we are going to listen to nature, we should regard such pseudoknots as part of the tertiary rather than the secondary structure. Question:

What is the magnesium concentration?

Response: Five millimolar. We knew of other interactions in this molecule which, rather than large pseudoknots, were isolated Watson-Crick base-pairs. There is also another class of interactions shown here that involves GNRA loops and their RNA receptors. A GNRA loop is a four-nucleotide loop in which the first base is a G, the last base an A, the third base a purine (either G or A), and N is any nucleotide. The interactions between GNRA loops and RNA receptors were predicted by Eric Westhof and me at the time we built the model of Group I introns, and they have proven quite common in large self-folding RNAs. They are important for the assembly of the final threedimensional fold. There are several examples of such interactions in this molecule [Fig 12]. The GUAA loop is recognized by a helical receptor and the GAAA loop is recognized by a somewhat larger receptor that is partly non-helical. We already knew this from other sets of data. As you can see, during the early unfolding transition, four of the GNRA loops in this molecule do change state. Some of the known receptors do not, because they consist of consecutive G:C pairs within an extended, continuous helix, but the C, receptor for the GAAA C,' loop definitely undergoes some kind of structural rearrangement. The work on Group II introns was done by Maria Costa. We have the same kind of data for Group I introns. Experiments were done by Luc Jaeger in Gif, and independently in the Turner lab by Aloke Banerjee and John Jaeger. When the data are summarized, we all agree that what is going on in the early unfolding transition is complete unfolding of the entire three-dimensional fold of the molecules. You end up at 50°C with a structure that has kept only those double-stranded helices that are not pseudoknots. This is exactly what most people would call the secondary structure of the molecule, which thus has physical existence.

Some Questions Concerning RNA Folding

EBS2/^ A A « c u S A

Difference

147

map

Figure 12. DMS difference modification map. Filled and empty arrows point to adenines that respectively undergo major and minor changes in reactivity between 42 and 50°C. Arrowheads indicate intron-exon junctions. EBS1-IBS1, EBS2-IBS2, and Greek letters designate known tertiary interactions that may or may not consist of canonical base-pairs. Question: Did you try using the program by Sean Eddy that predicts pseudoknots this data? Response: No. Question: At what temperature does catalytic activity Response: That was my initial piece of data.

disappear?

on

148 F. Michel Comment: I'm sorry, I walked in a little late. Response: Catalytic activity disappears at exactly the same temperature as tertiary structure. We initially monitored the transition from renaturation and reaction kinetics, and optically only later, and eventually by DMS accessibility. Question: The folding you describe seems to be a very solitary activity. The molecule is doing things to itself. Is it conceivable that if you were to use a higher concentration, the intramolecular interactions could in fact be replaced by intermolecular reactions, in which case, as one molecule folds it helps its neighbor? Response: Definitely; intermolecular interactions become a problem at high concentrations, which is something we are very careful to avoid. We work at the lowest possible concentrations. Comment: Isn't it possible that in vivo it could be the other way around? Question: Rather than help each other to fold, in vivo it may prevent misfolding. At high concentrations, they probably don't fold properly, right? Response: No, of course not; they do not. Comment: / do not know how close you are to in vivo concentrations, but if there were areas in which the folding take place in vivo, one might imagine very high local concentrations, in which case it might not be an artifact you were avoiding, but reality. Response: This is a very good point that I keep in mind whenever we are dealing with in vivo situations. Comment: We also see that in vitro, high concentrations inhibit folding, rather than promote it. The question is perhaps better posed as: "Is it important to have a high total concentration of molecules but not a high concentration of the same molecule? " It is very possible that general crowding may help, but high concentrations of the same sequence may impede folding. Question: What is a high concentration?

Some Questions Concerning RNA Folding

149

Response: The lowest concentrations that we used were reasonably about 5 to 10 nanomolar. Question: At what point is it high, where crowding occurs ...at the micromolar range? Response: I cannot say exactly, because we are taking such pains to avoid it. We never investigated this in any systematic way. Comment: We had to make it our business to know. It depends on the length of the molecule. For molecules that are around 650 nucleotides long, it is about 1 micromolar. For smaller RNAs, it is at about 10 micromolar that this kicks in. Question: Has there been enough data collection of exactly which nucleotide subsequences are involved in pseudoknots in order to build a database? Response: Yes, I think so. There are hundreds of sequences of Group II introns, and I think I know just about every canonical base-pair in the molecule. I have spent so much time doing comparative sequence analysis that I am unlikely to have missed things that are present in more than say, 5 or 10 sequences. Comment: So that, in fact, provides the basis for an algorithm for the detection of a secondary structure with pseudoknots that is perhaps more reliable. Response: Possibly, however, there are very few examples of pseudoknots in any individual sequence. If we exclude isolated Watson-Crick pairs, which are not going to be detected anyway, I doubt it would be very meaningful. In Group II, we have two extended pseudoknots and one very short one, and that is all. Tertiary structure is mostly different types of interactions. Question: If you do extensive methylation at 42°C you prevent the formation of pseudoknots. I wonder if you could do the type of methylation interference experiments that are routinely done with DNA; that is, to modify at 42°C with high concentrations of DMS. Would pseudoknots still not be formed? Response: You would eventually methylate every methylatable atom if methylation were done at high concentrations. You would unfold not only the tertiary structure, but also the secondary structure.

150 F. Michel Question: Why? Response: Because you are going to drive the equilibrium to the unfolded form, whether they are tertiary or secondary pairings. Comment: Methylation interference works fine with DNA. With DNA, you can certainly do methylation interference without completely unwinding the structure. I do not see why it is technically impossible to do it with RNA. Response: We are doing interference experiments with RNA. But we use very low concentrations, because we want no more than one modification per molecule. We never tried saturation. Comment: / don't think the suggestion is to methylate the hell out of the molecule, but rather that methylation interference would detect interruption of tertiary interactions with methyl groups. Response: Yes. In that case, you would probably get the same sort of map. Comment: Right, it's a common method. A lot of reagents may be used besides DMS, and there is a large body of literature regarding doing this exact sort of thing to probe the structure. Comment: Is this what you mean? Because for the example shown, only one base at a time is modified; only one molecule is methylated at a time. They are all on a single map, but only one molecule at a time is methylated, and then they are all collected and put on a single map. You never have a single molecule with all of those changes. Comment: But ifyou increase the DMS concentration 20-fold, you still would probably not unwind the double helices. Comment: So your question is "How many interactions must be disrupted before unfolding occurs?" Response: This is something that can be found out or guessed from the melting curves, because we know the extent of destabilization that results from the disruption of a single interaction. It is reasonably large, typically 5-6 degrees Celsius for each individual

Some Questions Concerning RNA Folding

151

interaction. I would say you do not have to disrupt many at room temperature in order to get complete unfolding. A brief comment about the interactions between GNRA loops and their receptors. They were initially spotted by comparative sequence analysis. Some of the rules for loop-receptor interactions are shown here [Fig. 13]. The GUGA loop prefers CU:AG helical receptors, whereas GUAA prefers CC:GG and GAAA, an 11-nucleotide motif, CCUAAG:UAUGG. My point is for Michael Zuker with regard to using mutual information. Had we been using mutual information to look for those contacts we might not have found them.

G=C O-A A-U U-A A-D A-O G=C G

A O G

3 •5• • 1 1 • G=C O-A O-A G-C A-O A-O

L2

C=G A

e

°

3-5 •

a.3G

It

ES

G=C

O-A G=C

A

s=£

A-tt

L2

C=G

C

oo

5-3.

t l

<J ® PP p s

A *

C O

o u

3'5'

ti

it

G=C O-A G=C G=C G A — -G=C O O

®®k ^**'5^ *

1

A

L2

A A

0»G C=G A-O G=C

PS G

O C O

Figure 13. Preferred RNA receptors for GUGA, GUAA, and GAAA loops (redrawn from Costa and Michel, 1995). The receptor motifs, which were identified by comparative sequence analysis and SELEX experiments, are boxed. The L2 x P8 interaction is used by many Group I introns to bind and specify the 5' intron-exon junction.

A major reason why computing mutual information for all pairs of sites is unlikely to provide us with many tertiary contacts is explained here [Fig. 14]. Let us assume we have aligned a number of sequences that are related by descent. It could be RNA, it could be protein, it does not matter. At site number 1, you have either A or a. At site number 2, you find either B or b. If you have A, you always have B. If you have a, you always have b. This is a case of perfect co-variation. Unfortunately, there are two possible extreme interpretations, and any intermediary combination of these two may also be made plausible. One interpretation is that there have been only two events, one at site one and one at site 2, which happened to occur for no profound reason, by chance, along the same deep branch of the

152

F. Michel

particular phylogenetic tree, which describes the historical relationships of the sequences being analyzed. The other, contrasting interpretation is that while the contents of sites 1 and 2 have kept changing, they always did so together, because only the AB and ab combinations are allowed by nature. If the latter is true, it is most likely that sites 1 and 2 interact physically; the information extracted from the dataset was about the structure of the molecules (which, as a rule, evolves much more slowly than their sequences.) On the other hand, if only one event took place at each site, the information gained is about the history of the molecules. Clearly, the only way to know which interpretation is most plausible is to try to infer actual phylogenetic relationships within the dataset by using the entire sequence alignment, as well as any additional information available about the host organisms. In the example I chose in Fig. 14, phylogeny number two implies at least four pairs of The data: Sequence # 1 2 3 4 5 6 7 8 9 10

sitel

site2

A a a A A a A a A a

B b b B B b B b B b

interpretation number one:

AB

ABABAB

AB ab

ab ab ab ab

interpretation number two:

AB

ab AB ab

ab AB

AB ab ab AB

Figure 14. Covariation data and two possible interpretations. Sites 1 and 2 in aligned biological sequence data display perfect co-variation. One extreme interpretation is that there has been only one change at site 1 and one at site 2, both of which occurred by chance along the same, deep and long branch of the phylogeny that gave rise to the collection of related sequences being analyzed. The other extreme interpretation is that there have been many changes, but any change at one site was necessarily accompanied by a change at the other site (the ancestral sequence is arbitrarily assumed to be AB in both cases.)

Some Questions Concerning RNA Folding

153

possibly simultaneous events at sites 1 and 2, whereas in case the actual phylogeny is number one, the hypothesis with maximum likehood is that there have been only two, possibly separate, events at sites 1 and 2. Now, all of this is essentially no problem if you are dealing with Watson-Crick pairings. First of all, there are four isosteric base combinations for a Watson-Crick pair, so that if you observe all of them, you already know there have been at least three distinct events in phylogeny. Moreover, Watson-Crick base-pairs are usually not isolated; they tend to come in bunches, since they are most often organized into continuous double helices. In order to prove the existence of a helix, you can safely add the observed numbers of base combinations at each potential base-pair. In contrast, for a given tertiary interaction, there are usually few, typically two, isosteric pairings, and they most often have no neighbors along the sequence. This is why you have to estimate the number of events, and this is why we were successful for Group I when others were not. We had ordered the molecules phylogenetically to begin with and were able to count events, like Carl Woese and Gary Olson initially did when analyzing ribosomal RNA. Question: Does Gutell do it? I don't think he does. Response: No - well, sometimes, typically not. He seems to rely more on mutual information, which by itself seldom goes beyond secondary structure. Let us now briefly go back to the melting curves of some Group I intron mutants [Fig. 15]. Does destabilizing an interaction by introducing mutations tell us whether that interaction is part of the secondary or tertiary structure of the molecule being probed? I already mentioned the problem of P7 and P3. At least one of them has to be a pseudoknot and therefore, according to our "early melting" criterion, part of the tertiary structure. Let us then look at the data for P7 [Fig. 15A]. The basepair being probed (number 2) was G:C in the original, wild-type molecule; the mutants are C:G (pair-reversal,) A:U, G:U, and A:C, and you can see that as soon as you change that pair, you begin destabilizing tertiary structure. As long as we agree that anything that disappears during the early melting transition is tertiary structure, P7 is clearly part of the tertiary structure. In contrast, when we look at the data for a helix such as P9 (Fig. 15C; the same types of mutations were introduced into a G:C pair), we observe that the G:C to A:U or G:U mutations have no effect on the early melting peak. Only by introducing an A:C mismatch can you destabilize the P9

154

F. Michel

helix sufficiently to affect the initial unfolding transition. Thus, a reasonable interpretation is that P9 is actually part of the secondary structure of the molecule. With regard to P3 [Fig. 15B], I'm not sure the answer is that clear, but it is certainly more a part of the secondary structure than P7 is. To some extent, melting profiles do provide us with a criterion.

0.06 m G8/1£94G C8/1:G94G A871:U946 G871:U946 A8/1:C946

WTA42:U912 G42:C912 G42:U912 A42:C912

.N ~co

E i—

o c

I WT G9S3:C9S6

35

40

45 50

55 60

65

70 75

T(°C) Figure 15. Derivaie melting profiles of wild-type and base-substituted versions of the td intron (see Fig. 6 for location of mutated base-pairs), (Redrawn from Brion el <;/.,! 999.)

Some Questions Concerning RNA Folding

155

Is the discrimination between secondary and tertiary structure afforded by the melting curve absolute, or is it a function of experimental conditions? Unfortunately, contrary to what I have tried to infer so far, it is not an absolute distinction. It changes in accordance to the types of salts used in the solutions. Let me show you the data for Group II introns [Fig. 16].

I

' . , . . , 1-25

l

' i ' ' ' '

' ' ' ' '

2 mM MgCl2 5 mM MgCI2

-o N

' ' '

4

V'j,

f

10 mM MgCI, /

1.2

«

-

£ j -

1.15

<

1.05

-

*

* »ir * V

; :I

-

rArr

#a*

--

-

* V

: —

-

*Jr

i

im

V

* *JrJr

'•

1.1

-

*' JT

•

o *— * 01 o c «J n o (A

jr

! 9* JJr S J

! \ |

'2r •

;

•

+

!

i , ,

1

30

40

50

60

70

80

Temperature (°C)

Figure 16. Melting curves of the Pl.LSU/2 Group II intron in 1M ammonium chloride (from Costa, et al., 1997.)

The trick in those experiments was to change the magnesium concentration in the presence of a very high concentration of monovalent ions. At a 1M concentration of monovalent ions, the stability of secondary structure helices is no longer supposed to be sensitive to the magnesium concentration. On the other hand, the tertiary structure, which depends heavily on magnesium in order to fold, should remain highly sensitive. In fact, just as expected, the early transition is shifted towards higher temperatures as magnesium is increased, whereas the rest of the curve is not significantly altered. You can also note that the early transition becomes sharper and sharper as the magnesium is increased; that is, its enthalpy changes as well. This has been observed with every RNA studied.

156

F. Michel

Here is a more complete set of data for a Group I intron [Fig. 17]. At a very low magnesium concentration, there is no clear-cut transition, and then the early melting peak becomes sharper and sharper again. What is happening? My own interpretation is nothing original; it's the same one that was reached by David Draper. That is, as you stabilize tertiary structure, you progressively capture a number of secondary structure transitions. Since tertiary structure cannot exist in the absence of secondary structure, you indirectly stabilize a number of the weakest secondary structure helices; you capture them one by one, so the entire enthalpy of the transition will grow progressively as you increase the magnesium concentration.

T(°C)

Figure 17. Derivate melting profiles of the td Group I intron at various magnesium concentrations (from Brion etal., 1999.)

Question: Could it go both ways? You could be stabilizing tertiary structure as well as destabilizing the helices. Response: No. Secondary structure is part of the tertiary structure. As a rule, most secondary structure helices survive during three-dimensional unfolding. There must be minor changes, but basically we believe the secondary structure revealed by DMS probing of the Group II intron at 50°C, which essentially agrees with the predictions of comparative sequence analyses, is also for the most part the one that prevails in the

Some Questions Concerning RNA Folding 157 native fold. Admittedly, we do not have the crystal structure of Group II introns, but for RNAs for which we do have crystal structures, we know that the secondary structure, as predicted by either Mike Zuker's program or comparative sequence analysis, largely agrees with the set of Watson-Crick pairings in the final fold. Folding is largely hierarchical. A few pairings may change locally, but more than 90% of base-pairs survive. Question: What you are saying is that ifyou change the magnesium concentration, you enhance tertiary structure, and in this way destroy secondary structure a priori? Response: No, I think 1 just gave the answer. Question: But why doesn't it happen? Response: I guess because Nature wants it to happen that way. Comment: There is one example where this kind of situation occurs. It is very local, and again, this is somewhat artificial, because it is extracted from a large RNA sequence. In fact, the sequence comparisons give the correct secondary structure, whereas if they extract the RNA, put it in solution, and analyze the structure by NMR at low magnesium concentrations, they will see a kind of secondary structure that is not the same as in the final state. But ifyou do comparative sequence analysis, you immediately get the correct secondary structure. Question: There are two stems in the presence of magnesium. If they were without the magnesium, would they be more stable? Response: Yes, that is what I'm saying. In the presence of magnesium, you have these helices and bring them together, and at the same time you capture and stabilize them, and also promote tertiary contacts between the helices and the single-state region, stabilizing overall. The more magnesium you add, the more you cooperatively bring together secondary structure elements. Question: This is what I do not understand; why are there helices when you stabilize one of them?

158 F. Michel Response: It does happen; in those kinds of molecules, but not in Group I or Group II introns. If I have time at the end of my talk, I will discuss why introns are to some extent special. Comment: Adding ions stabilizes all structures. It stabilizes the secondary structure and it stabilizes the tertiary structure. In general, what you see is that ion-dependence of the tertiary structure is much steeper. It essentially catches up, so you do not see two transitions, you see one. Once the tertiary structure is sufficiently stabilized by ions, everything melts all at once. What Eric [Westhof] said is exactly true; the secondary structure stabilizes the tertiary structure and vice-versa, and that is what makes it cooperative. You cannot get the tertiary structure without the secondary structure; that was Francois Michel's point. You require the secondary structure scaffold. It is also stabilized by ions, but in a less steeply dependent manner, and that is why this coalesces. At some high ion concentrations it is all or nothing. Comment: / think the situation in proteins is very similar to what you are envisioning. And I think that in the case ofproteins, it is not at all clear that in all folds, for example, that the tertiary structure does not impose the stable secondary structures on it. It is much more common that ifyou could actually catch the secondary structure, helices and sheets and such things, in isolation, they would have structures that would be very different from what they are in the tertiary structure. But with RNA, the belief is that this is the way it works. But it is a bit of a circular argument, as I see it, because unless you can do the large-scale secondary structure determination at very low magnesium concentrations, which may not be possible, I do not think you can know for sure. But in general, I think it appears to be the case. Question: Is it accurate to say that at a high monovalent ion concentration it is possible to adjust the magnesium concentration so that you can very sharply separate-out tertiary and secondary structure as you melt it? Response: Yes of course; that is exactly what we did with Group II introns. We used a magnesium concentration low enough to get almost complete separation between tertiary and secondary structure. That is why I could start with the clean data. Question: But if you have no magnesium, you would not get the tertiary interaction at all, so don't you need to have some magnesium?

Some Questions Concerning RNA Folding 159 Response: Yes, exactly. The intron we chose to work with, which was identified by Maria Costa, is the one that requires the least magnesium for activity among all the Group II introns investigated so far. We were able to go down to 2mM magnesium and still observe a signal corresponding to the cooperative unfolding of tertiary structure. Question: You talk a lot about magnesium; does it really matter whether you use magnesium? Could you use calcium? Response: We had the same question on the first day. The Group II intron I have been talking about works in calcium and folds in calcium. But then it only does the first step of the reaction; not the second part. Question: And what about polyamines? Response: Folding can be achieved with a number of different chemical items, except we know that magnesium works with all RNAs; it folds every natural RNA molecule we have so far seen, and it is very abundant in nature. That is why we usually use magnesium. Also, most ribozymes require magnesium in order for an efficient reaction to take place. Some of them can use manganese, and I mentioned that this particular Group II intron can use calcium for the first step but not the second. We do not have a large choice of ions that are compatible with activity. The safest bet is magnesium. Question: / wondered whether the cell could manipulate the type of ion in order to produce a certain sort of folding — but is there no evidence to show that? Response: It is quite possible. Now I would like to go back to the kinetics of renaturation in what we now know to be the melting range for the entire three-dimensional fold of the intron. Most of our data for base-substituted molecules never got published. In the example I am going to discuss we were experimentally probing the interaction between a GNRA loop - the P9 terminal loop of intron sunY - and its receptor (in the P5 helix) for the first time. What we published were the data at equilibrium [Fig. 18]. Since the AGs you estimate from these equilibrium melting curves are in fact energy differences between the folded and unfolded states, and the energy of the latter may differ between the wild-type and base-substituted molecules, the only truly meaningful comparisons are those in which the terms relating to the unfolded state cancel each other, as in a thermodynamical cycle, e.g.:

160

F. Michel [AG A G:GA - AG A G:AA] - [ A G G G : G A - AGGG.AA] =

GAG:GA "GAG:AA

+

GQG:AA " GGG:GA

where AG:GA and so on refer to the sequences of the P5 helix and P9 terminal loop (nucleotides 3 and 4). In the present case, the existence of the interaction is supported because the two quantities between brackets have opposite signs, even though the energy differences, 0.42 and 0.45 kcal, are not large.

42 44 46 48 50 52 54 56 58 60 62 T(°C) Figure 18. Fraction of initially inactive sunY (SYT) transcripts as a function of temperature. The interaction being probed is between the P9 terminal loop and its P5 helical receptor (see Fig. 6; the sequence of these components is the same in the sunY and td introns). Dashed lines between the loop and receptor indicate potential interactions. Values provided are AAG°'s (kcal/mol), as estimated from the fraction of initially inactive molecules at 52.5°C. (Redrawn from Jaeger et al., 1994.)

However, what I want to show you is the renaturation kinetics of these mutant molecules [Fig. 19]. As mutations are introduced that are progressively destabilizing (the most destabilizing ones are at the top) at a given temperature you change not only the fraction of initially folded molecules, but also the rate of renaturation. What sort of information can you obtain from this? Let us suppose that the renaturation process is a very simple one, with a single rate-limiting step. Two

Some Questions Concerning RNA Folding

161

extreme possibilities may be contemplated. If the interaction we are probing was part of the transition state, you would expect that when raising AG of the final, folded state, you would also raise the AG of the transition state, so that AAG*, the difference in activation energy estimated from the ratio of the folding rates, would change by the same amount as the AAG of the equilibrium (their ratio, <1>, would be 1). The other extreme possibility is that the interaction is not part of the transition state, so that if you change the equilibrium, you are not going to change the rates at all, i.e., 0 = 0.

•*-s

(u a G

N - N ->

G-C A- -G-C

+1.16 kcal

G-C

> I" +0.72 kcal

f

N-NM U A—-A-U G A--G-C G-C

I t

N-N' U A- -G-C * A--G-C O-C \ f

t * +0.32 kcal

> + 0.18 kcal 20

N _ N 7> G---A-U A--G-C G-C + t

fU G

J

wild-type

Figure 19. Kinetics of renaturation at 52.5°C of sunY (SYT) transcripts bearing base-substitutions in the P5 helix and P9 terminal loop. Values are AAGt values calculated from the ratio of wild-type to mutant renaturation rate constants (AAGJ = RT In [kwt/kmut]).

In Fig. 20, the actual value of O is seen to be neither 0 nor 1, but 0.46. When similar experiments were carried out with RNAs bearing mutations in the P9.0a base-pair, a nearly identical (0.42) O-value was obtained. What is the most likely explanation for this? Well, simply that there must be a multiplicity of pathways that lead to the folded state. In some of these pathways, these interactions are part of early folding intermediates, whereas in others, they form late, something that does not come as a surprise when you remember that in our experiments renaturation takes place within the melting range. Admittedly, multiple folding pathways have

162

F. Michel

also been observed very far from the melting range in experiments carried out in Sarah Woodson's and Jamie Williamson's labs. Still, you would definitely expect that within the melting range of a molecule, when conditions are ideal for conformational searching, there are going to be multiple ways to fold.

0

0.5

1

1.5

2

2.5

3

1

AAG° (kcalmol ) Figure 20. Relationship between AAGJ, estimated from the renaturation rates, and AAG, estimated from the fraction of initially active molecules. Each point corresponds to one of the mutated molecules in Fig. 19 (the wild-type is at 0,0). The slope of the linear fit is 0.46.

The folding conditions I have just described, in which the molecule is placed within its melting range and all tertiary interactions appear to form simultaneously by an all-or-none process, is very different from the renaturation conditions described by Sarah Woodson, for instance. Except perhaps when urea is added, the conditions used in most laboratories for renaturing the Tetrahymena Group I intron, whose tertiary structure unfolds at around 65°C, are very far from its melting range. What I am now going to try to argue is that in nature, RNA molecules, such as Group I and Group II introns, fold under conditions that are in fact much closer to the conditions we have been using. They are very close to their melting range; i.e., they have very small AG's.

Some Questions Concerning RNA Folding

163

I first began to realize that this had to be the case while working with the sun Y Group I intron, which used to be fashionable around ten years ago, but no longer seems to be. Like other introns, the sunYintron is excised after transcription, and as is often the case with Group I introns, the piece that gets removed (some 1,033 nucleotides) includes not only a ribozyme that directs and catalyzes splicing, but also an open reading-frame (a potential protein-coding sequence devoid of stop codons), which actually specifies a protein. What is somewhat unusual - though not unique - about the sunY intron is that its reading-frame is located entirely downstream of the catalytic core of the ribozyme. Is this a problem? Well, it could be, because as soon as it is synthesized, the ribozyme core becomes liable to catalyze splicing, provided it can find a suitable 3' splice-site. Now, those were precisely the times when we were finding out about the rules for the recognition of 3' splice-sites, and on paper, there was not merely one, but several possible splice-junctions immediately 3' of the ribozyme core. Since it seemed reasonable to look for alternative splicing in the sun Y system, we chose to begin by generating truncated intron transcripts in vitro (they were intended to mimic transcription intermediates in vivo) and soon enough we were able to observe splicing, not just to the proximal sites we had predicted, but to several additional ones. The core seemed quite active and could very easily find a surrogate 3' splice-site. But what was it doing in vivo? I phoned David Shub, a microbiologist and biochemist who had discovered the sunY intron and asked him about the possibility of alternative splicing in vivo. His answer was, "In fact, I have been working on that problem. We have tried very hard to observe alternative splicing in vivo and have been totally unable to find it." This was quite puzzling, so our next experiment was to transcribe the entire intron. I was expecting to see both proximal and distal splicing, but could only observe distal splicing. All proximal reactions were abolished (we had similarly predicted that a number of other group I introns had alternative, proximal splice sites, and in those introns, alternative splicing events were eventually detected in vivo by PCR, a very sensitive method for amplifying nucleic acid molecules that was not available at the time we were tackling this problem). Eventually, I found the catch, which was this [Fig. 21]: The truncated transcripts were very active, but were far more sensitive to the magnesium concentration than the full-length transcript was. They required at least 10 to 15mM Mg + for full activity, as opposed to 3mM for the complete intron, and were essentially inactive at 5mM or less. Now, what you need to know is that the physiological concentration of magnesium - in fact, the concentration of unbound, available magnesium ions - is much less than lOmM. It seems to be much closer to

164

F. Michel

2-3mM, which is barely sufficient for full-length transcripts to be active in vivo. In E. coli cells - I have not yet mentioned that this intron is found in the T4 bacteriophage, and therefore supposedly works in E. coli - the medium is simply not favorable enough for the ribozyme core to fold until the intron is completely synthesized, whereas in vitro, truncated molecules appear fully active under suitable conditions.

reading frama (258 aa)

ribozym*

QJTJJ- , , ,

——•

1-s

,|

VAz|r»a.1-3-| P9.2 [MUCACAOTO

i

I

A distal splicing

BcoRI.

pSYAX pSVC1 + 196

pSYK15

+2«1

+S74

-+

"

Fuli-langth

vs truncated

-OAACACAUCG-

transcripts

ivity

0,80,6-

o

0,4-

i

SYC1 + EcoRI SYC1 + Pvull SYAX + Dral SYK15 +EcoRI

n

< "

0,20,0 Hi

•

•

• iffTi. i —

I

10

Mg (mM)

Figure 21. Top: Organization of the sunY intron (see Fig. 22 for P9, P9.1, and P9.2). A distal splicing reaction excises the entire intron. Bottom: Variously truncated transcripts (see middle) display elevated dependence on the magnesium concentration.

Some Questions Concerning RNA Folding

165

We also eventually found what the missing part was. It consists of a small piece of RNA (lower right in Fig. 22) located immediately upstream from the 3' splicesite, and makes a small number of interactions with the ribozyme core. As far as we know, only two of these interactions matter, in the sense that they seem necessary and sufficient to stabilize the entire ribozyme molecule. One of them is a small pseudoknot (between the L7.2 and L9.2 terminal loops) and the other (P9.0a) an isolated Watson-Crick base-pair. The experimental evidence consists as usual of the melting curves (whether obtained from optical or kinetic measurements) of basesubstituted molecules.

Figure 22. Schematic representation of the three-dimensional organization of the sunY intron. The conserved Group I ribozyme core (P3 to P8) is boxed. Large arrows point to 5' and 3' intron-exon junctions. Known interactions between the 3' terminal domain and the rest of the intron are shown as dashed lines.

Now at the same time we were measuring melting temperatures, David Shub and Ming Xu in Albany were determining relative splicing efficiencies in E. coli cell extracts (Michel et al., 1992), and it is quite interesting to compare the two

166

F. Michel

datasets. Let us assume (a bit naively) that the ability of a population of molecules to carry out splicing directly reflects its ability to fold within a short time window after synthesis (splicing must stand in competition with RNA degradation). Since the time window within which folding competes efficiently with degradation after synthesis must be of the same order of magnitude as the time necessary for transcription of the entire intron, which is estimated to take between 20 and 50 sec, we will assume it to be 1 minute. Thus, when the relative splicing efficiencies of mutants P7.2xP9.2 CC:GC and GC:GG (the wild-type interaction is GC:GC) were estimated from reverse transcription of in vitro extracted RNA to (respectively) be 0.08 and 0.06, our interpretation is that some 7% of the mutant molecules managed to fold during 1 min. Assuming kinetic and thermodynamic parameters determined in vitro apply to in vivo conditions, let us now estimate the folding rate of wild-type molecules under conditions that allow 7% of the mutant molecules to fold. The mutant AAG*s have not been determined, but may be estimated from measured AAG°s (+5.5 and +5.6 kcal/mol; Jaeger et al., 1993) if O is assumed to be the same as for kinetically probed interactions, that is -0.45. The folding rate of wild-type molecules is then estimated to be 3.6 min"1. More than 97% of wild-type molecules get folded during the one-minute time window, which should effectively free the wild-type sequence from selective pressure. Moreover, since AAG* = AAG° for the wild-type molecule (Fig. 2a in Jaeger et al., 1993), which is consistent with the fact that the transition state must be a nearly entirely folded molecule, AG° may be estimated from the ratio of 3.6 min"1 to the values that were measured within the melting range [Fig. 18]. The outcome - a mere -2.2kcal/mol - indicates that in vivo, the stability of the wild-type intron fold is marginal, just enough for 97% of the molecules to be folded at equilibrium. It appears that the conditions under which renaturation takes place actually correspond to the early melting range (this calculation is based on the assumption that the near equivalence between AAG* = AAG° extends beyond the range in which it was established in Jaeger et al., 1993. In actual fact, AG° is likely to be somewhat larger, since kinetic traps should progressively form as one gets away from the melting range.) Interestingly, I have been watching the literature for globular proteins, and the in vivo AGs also appear to be quite small; in the range of-5 to -10 kcal. All of this makes sense if we assume that natural selection is sensing only thermodynamic stability. As long as you are very close to 0, you will have strong selection against mutations; but if you go as far as -10, a single mutation gets you back only to -5, so that there will be no selection against it. Thus, the molecule will in fact drift back to this threshold of say, -5 AG values, estimated for globular proteins. In the example I just gave you for catalytic RNAs, they are exactly as what you would expect in this

Some Questions Concerning RNA Folding

167

perhaps a bit naive picture of natural selection working primarily on thermodynamic stability. I would like to emphasize that there is nothing particularly special about the sunY Group I intron. We also worked on the td intron, which is found in the T4 bacteriophage and has very different organization in terms of the respective locations of the ribozyme core and protein-coding sequence. The td intron does not have to delay folding in order to avoid alternative splicing, yet it yields essentially the same set of thermodynamic estimates. It is no more stable than the sun Y intron. My point is that at least in the case of self-splicing Group 1 and Group II molecules, the overall fold stands very close to its in vivo melting point, so that in fact folding takes place very close to melting conditions. This has a number of implications that we can discuss. One of them is that kinetic traps in three-dimensional folding seem to be rather unlikely under such conditions, which should be ideal for conformational searching.

0.01

11-nt motif

Figure 23. Recognition efficiency of diverse GNRA loops by the 11-nucleotide GAAA receptor motif (from Costa and Michel, 1997.)

168 F. Michel I will mention a number of points that I regard open questions we might now discuss. One of them is whether there is anything special about self-splicing introns. I have been trying to imply that what I am describing is general for RNA. In fact, self-splicing introns are somewhat peculiar molecules. Many of these introns have been selected to move around from one host to the next, so that a fair fraction of them must be very self-sufficient. A hint of this is provided by the distribution of GNRA loops and their receptors in self-splicing molecules. Maria Costa in my lab did selection experiments to find the best receptor for each GNRA loop. She found that for GAAA, there was an incredibly good receptor - an 11nucleotide motif - which gave rise to an unusually stable combination. The interaction was both tight and specific [Fig. 23]. Note that the ordinate scale is logarithmic; the GAAA loop stands out completely for the 11 -nucleotide receptor. Question: For what kind of receptor is this? Response: I can show you the motif. It is a small piece of RNA with a defined threedimensional structure when interacting with the GAAA loop. The structure was determined by crystallography. Question: // is a receptor with a short oligo? Response: It is formed by two strands. There are helices on either side and a small internal loop in secondary structure representation. Question: So this motif is a receptorfor that oligo? Response: For the GAAA terminal loop, and, it seems, the best possible one. We know this from in vitro selection. Question: / am lost. What is a receptor for what? You have this short oligo and this loop... I'm just lost in the logic. Response: Well, it does not really matter. It is a matter of definition.

Some Questions Concerning RNA Folding 169 Comment: // is actually a tertiary interaction in the molecule. It is not a short oligo that is being bound. It is actually the interaction of one part of the molecule with another. In fact, the interaction is very weak in completely separate molecules. Response: Rather weak, yes. Comment: It has to be in cis, so it is just a simple tertiary motif. Response: Luc Jaeger showed that it could be in trans, provided you have two of them. Question: What do you mean by "receptor?" Comment: The word receptor is basically these two motifs interacting with each other. It is tertiary structure. They make the mold better. We call it a receptor and a GNRA loop. Question: Have you used something like SELEX to improve on this? Response: That is exactly what I have been saying. experiment.

Maria Costa did this very

Question: But you have a natural one right there. Response: We initially uncovered the motif in natural molecules, then we did the selection experiments and found that the natural solution was the best possible one for the GAAA loop. It was spectacularly better than anything else for that loop. We know that there is a very specific, perfect match between the two partners. The point I wanted to make is that the 11-nucleotide receptors are very common in self-splicing introns. I can show you a Group II intron in which within same molecule you have three GAAA loops and three 11-nucleotide receptors [Fig. 24]. Eleven-nucleotide receptors are very common in self-splicing introns, but I have never found them in either ribosomal RNA or ribozymes that were selected for activity in vitro (though not for stability or performance in folding.) There is definitely something rather special about these self-splicing introns. This is what attracted me to them in the first place. 1 was interested in molecular evolution. I was looking for an experimental system that would look like a self-contained world.

170 F. Michel Question: Why is it that after ten years, you and Eric have not given us a 3-D model of the Group II intron? Comment: [unintelligible] Response: Well, we have modeled part of the molecule. Question: When did that come out? Response: In the EMBO Journal, last year (Costa et al., 2000). Question: But why not the entire intron? Response: Well, because I hate to be wrong. Question: Is Group II harder to model? Why? Response: We do not feel confident about our current model. There are things I do not like. In the case of the Group I intron, it seemed like a crystallization process; everything started falling together and making sense in a matter of a few weeks. Take the interactions of GNRA loops... I was supposed to be rather good at comparative sequence analysis, and I had never spotted them before we started modeling. It was really cooperation between the two of us. Eric started modeling, then I looked at where the GNRA loops might go. This was a very rough model to start with. Still, it helped me to find the receptors and thus provide some feedback to refine coordinates. It was a very collaborative endeavor and we have never reached the same consensus with Group II. Question: You seemed to make a statement to the effect that - perhaps I misunderstood — the natural selection of these UNA sequences might have arisen because of stability considerations; because of say, 6 to 8 or Wkcals or something like that. Is it possible that selection pressure could be coming purely from kinetics? Response: Yes, of course. Our idea that thermodynamic stability dominates may be pretty naive, but at the same time, it is worth asking the question. Comment: This guy showed that you could in fact improve folding speed.

Some Questions Concerning RNA Folding 171 Response: ...by destabilizing. Adding urea did the job, and in vitro selection did it, and the Woodson and Thirumalai labs showed that by going closer to the melting range you improve folding, and that is my entire point. Question: ...Somewhat of a general question: According to you, has the plausible evolution of RNA sequences been dictated by stability or by kinetics, or are they related? Response: I believe they are related. Still, it may depend on the molecule. The point I am trying to make is that the selection pressure has been different for different types of natural RNA molecules. Ribosomal RNA, as we see it today has not been under the same type of selection pressures as some of the Group I and Group II introns. Question: Is this based on experiments? Response: No; it is based on statistics. Question: How many statistics? Response: Well, quite a large quantity. For example, I know of at least four series, each with numerous sequence examples, of different instances of the 11 -nucleotide motif in Group I; at least four series in Group II, only 1 in RNase P, and not a single example in the huge collection of ribosomal RNA sequences available. Question: They are available, but have the kinetics and stability been determined? Response: We know that ribosomal RNA requires proteins in order to fold. Some Group I and Group II introns do require proteins, although not as massively as ribosomal RNA does. Comment: This comment will tie to Jamie Williamson's talk. Jamie Williamson's group showed that if you destabilize some of these interactions, folding speeds up, because the intermediates are also less stable. The same thing is true for these loop-loop interactions that you identified. RNA-RNA interactions that help stabilize the native structure, at least in the Group I introns, also stabilize misfolded intermediates. Hence, the folding kinetics are slower, although the ultimate stability of the final structure is improved. My pet theory, although not proven, is that by the time you get to the size of a ribosome, this no longer works. Because if you had to assemble the whole thing based on RNA-RNA interactions, you would never arrive. An evolutionary solution to this

172 F. Michel problem, as Jamie Williamson nicely pointed out, is to supplant the RNA-RNA interactions with RNA-protein interactions, which are also more specific. Response: I wonder if what you are describing is due to the fact that you are very far from the melting range. Comment: // turns out that for me, and I believe that Jamie Williamson is also more or less in agreement, when adjusting at physiological temperatures for Tetrahymena, which would be between 27°C and 30°C, we cannot arrive at the physiological folding rates merely by adjusting the ionic conditions. Although we can certainly improve our folding rates, that is not sufficient. Question: This picture [Figure 24] is tree-like. There are many helices there, but it is tree-like. I have the impression that this molecule is deliberately trying very hard to avoid non-tree-like helices. Why is that so? Response: Yes. I think that is because it would be dramatic for folding if you made tertiary interactions before you were done with secondary structure. This helps me make one of my points. If you define the secondary structure in this way, without pseudoknots, it is always physically viable. You will never have problems physically folding the molecule in this way into the structure that probably actually exists at 50°C. As soon as you make tertiary interactions, you get into possible topological problems. My point is that only when you are done with the secondary structure, which is most probably going to form co-transcriptionally, should you attempt three-dimensional folding. In the case of the sunY intron, folding is clearly postponed until the entire molecule is synthesized. Then, folding is entirely cooperative. It is an all-or-none process, taking full advantage of the potential for conformational search. Comment: If this were two-dimensional space, you would be right. In threedimensional space, there is a difference. I think that evolutionarily, this thing could build out more physically. What you say just cannot be true, because threedimensionally, it does not know what is secondary or not. Response: If you start making those pseudoknots before you have completed the secondary structure - I got trapped more than once using models of wires and wood - if you make those interactions, oe-a' and P-P' in Fig. 24, you are going to end up with real knots before you are done with the secondary structure of domain I.

Some

Questions

Concerning

RNA

Folding

173

Comment: / agree, but the reason it evolved could possibly be that the parts appeared separately and were then glued together by something like gene duplication. And they could have evolved this way. It cannot be accidental just for that reason. There must be a structural mechanism forcing it to be tree-like. Comment: One possible answer is that you can envisage folding this even in three dimensions by a succession of interactions that are local in sequence.

Figure 24. A group II intron from the cytochrome oxidase gene of the alga Pylaiella littoralis with three GAAA terminal loops and three 11-nucleotide receptor motifs (from Fontaine et a!., 1997).

Response: That is another point I wanted to raise. What is the role of internal loops in the folding process? What is the role of these small internal loops everywhere in the molecule during the folding process? Are they part of the secondary or tertiary

174 F. Michel structure? Should we regard the final structure of the internal loop as secondary or tertiary? We know the answer, in part. We know that for terminal loops, the GAAA loops for instance, they are clearly part of the secondary structure. They are so stable that they are going to survive the early melting transition. In fact, they will melt much later. For internal loops, the answer is not clear at all. It has been published that the 11nucleotide receptor changes state when it meets the GAAA loops. This is an induced fit. What I do not know is what the stability of this alternate state might be. Does it survive close to the melting range of tertiary structure? I have no idea whether the receptor is disordered or ordered, and to what extent, when close to the melting range. This is another point I would like to know about. Comment: / would like to come back to the issue of selection again. You are emphasizing folding close to the melting temperature. Conjecture could be made based on some expectation that what you want are sequences that have similar architecture, that fold with relatively similar speeds over a wide temperature range to accommodate various species; thermophiles, mesophiles, and so on. So you want to optimize speed close to the melting temperature not just for your system, but also for some range of temperatures. Response: Yes; thank you for bringing this up. Some of those introns work in E. coli, and of course, E. coli does not grow only at 37°C. It has to do splicing presumably at a whole range of temperatures. Is there any kind of homeostasis? You would posit from what I am saying that there should be some kind of homeostasis with respect to the stability of macromolecules, especially RNA. I was recently delighted to find out from the literature that when you submit E. coli to a cold shock, the bug synthesizes a number of so-called cold-shock proteins, at least three of which have proven to be RNA helicases. That is exactly what you would want; proteins that are going to destabilize the RNA and bring it back close to the melting range. Question: I just want to make sure that I understand this picture [Figure 24]; do the additional arrows on this tree-like structure correspond to pseudoknots? Response: Yes; you have one called ot-a', another large one is P-(3', and a shorter one ee'. Question: ... and at least two or three others?

Some Questions Concerning RNA Folding

175

Response: There are three that are more than one base-pair; the others are single, isolated base-pairs. Question: So according to this notation, what about EBS1-IBS1, EBS2-IBS2 - are they not pseudoknots ? Response: It depends. They exist in the precursor molecule; the initial transcript, but they are absent from the excised intron; that is, in the absence of the 5' exon. Question: Regarding the parameters that may influence the folding, what do you think about the speed of transcription? Response: In the case of the sunY intron, if you have a maximum of 55 base-pairs per second, it will still take between 15 and 20 seconds to synthesize the entire intron, which is far more than necessary for splicing if the core were folded. The problem exists. Many of those introns exist in bacteria. I have a phylogenetic tree of Group II introns showing there are an incredible number of them being sequenced in bacterial genomes; organelle introns will soon be a minority. Note also that in bacteria, transcription is normally coupled with translation. Comment: During the course of the elongation of a definite Group I or Group II intron there might be regions of the sequence where the speed can change. It could be imagined that a polymer is somehow weighed in such a way that folding has some time to occur, so that the phase-space should also mention the course of the speed of elongation along the sequence, which makes things even more complicated. Response: If I am correct that three-dimensional folding occurs close to the melting range, then it does not matter much whether synthesis starts with the 5'-end or the 3'-end. The fact that you transcribe from one end to another at a given speed matters only for secondary structure. If I am correct about three-dimensional foldings; i.e., that it is postponed until the entire molecule is synthesized, it means that even though the molecule we are dealing with is a complete transcript, our in vitro experiments reasonably simulate in vivo conditions. Question: Are there data in E. coli about the deficiency of activity of Group I introns? Response: I showed you T4 bacteriophage introns that work in E. coli.

176 F. Michel Question: Are there data about the temperature? Response: No. I do not know what the reasonable temperature range is for T4 bacteriophage infection. David Shub and Ming Xu did all the in vivo work. We only dealt with T4 DNA, which we soon regretted, because we did not know about the nucleotide modifications, which prevent restriction enzymes from cutting. Question: What is the free-energy stabilization range for the three-dimensional tertiary structure, compared with that of the secondary structure? Response: The computed AH's ranged from 150 to 300 kcal/mol, depending on the intron. I computed 300 kcal for Group II introns. Under physiological conditions, when I tried to estimate the AG for three-dimensional tertiary structure in the sunY intron, I found from -2 to -4kcal/mol. But we must remember that the determination is quite indirect. A fraction of spliced transcripts is a very indirect measure of what may be going on. I cannot really put values on thermodynamic parameters. What we think we know is that as soon as you lose more than 2kcal/mol, you begin to get an in vivo phenotype that is going to be severely counter-selected. Question: / am interested in making a comparison with a protein. In an average small globular protein, the amount of AG stabilization is a few kcal per mole of the molecule. Is this a similar case? Response: Yes. It is very similar. Question: In that case, the AG is between the tertiary structure and the structure in which the tertiary structure has been disrupted, but the secondary is still more or less conserved. Response: Yes, exactly. Question: Are these small AGs also pseudoknots? Response: Yes. Question: ...even when you have real canonical pairing? Response: AH is large, but so is AS.

Some Questions Concerning RNA Folding

177

Comment: // is just adding; it is much more than 4kcal. Response: That depends on the temperature. There is one question I have, perhaps for Sarah Woodson or Jamie Williamson. I have been implying that there is usually a single domain of three-dimensional folding. This is very different from what you have described in your kinetic folding experiments, in which you first have folding of the P4-P6 domain, then of P3-P7. On the other hand, when I looked at equilibrium data for the Tetrahymena intron, I think the evidence for separate domains is rather scant. In Turner's lab, there is a single detectable early transition, and they also did chemical modification on both sides. I do not know whether they looked for substructure in the early melting peak, but they clearly did not see it. There are also data from the Cech lab, where folding is followed as a fraction of magnesium concentration. The difference in magnesium concentration between P4-P6 and the rest of the molecule is very small. The values are 0.7 and 0.85, so there is considerable overlap. It is not obvious to me that there are really separate domains within the overall tertiary structure near equilibrium. I was expecting someone to raise this issue, so I am doing it myself. Comment: // is a matter of resolution. If you look at what Sarah Woodson showed; if you look at the small angle-scattering experiments, it is quite clear there is an intermediate; so it is a question of resolution. The experiments, as I see them, are not clean or precise enough to distinguish between equilibrium intermediates. I would be willing to bet a nickel or (a euro). Response: Yes, if these intermediates exist in the melting range. Comment: /'// give you a euro on January 2nd, if I 'm wrong.

References 1. 2.

Dumas, J.P., Ninio, J., Efficient algorithms for folding and comparing nucleic acid sequences. Nucleic Acids Res. 10, 197-206 (1982). Michel, F., Westhof, E., Modelling of the Three-dimensional Architecture of Group I Catalytic Introns Based on Comparative Sequence Analysis. J. Mol. Biol. 216, 585-610(1990).

178 F. Michel 3.

4.

5. 6. 7. 8.

9.

10. 11.

12.

13.

Michel, F., Jaeger, L., Westhof, E., Kuras, R., Tihy, F., Xu, M.-Q., Shub, D.A., Activation of the catalytic core of a group I intron by a remote 3' splice junction. Genes & Development 6, 1373-1385 (1992). Jaeger, L., Westhof, E., Michel, F., Monitoring of the cooperative unfolding of the sunY group I intron of bacteriophage T4. The active form of the sunY ribozyme core is stabilized by multiple interactions with 3' terminal intron components. J. Mol. Biol. 234, 331-346(1993). Jaeger, L., Michel, F., Westhof, E., Involvement of a GNRA tetraloop in long-range RNA tertiary interactions. J. Mol. Biol. 236, 1271-1276 (1994). Costa, M., Michel, F., Frequent use of the same tertiary motif by self-folding RNAs. EMBO J. 14, 1276-1285 (1995). Costa, M., Michel, F., Rules for RNA recognition of GNRA tetraloops deduced by in vitro selection: comparison with in vivo evolution. EMBO J. 16, 3289-3302 (1997). Costa, M., Fontaine, J.-M., Loiseaux-de Goer, S., Michel, F., A group II selfsplicing intron from the brown alga Pylaiella littoralis is active at unusually low magnesium concentrations and forms populations of molecules with a uniform conformation. J. Mol. Biol. 274, 353-364 (1997). Costa, M., Christian, E.L., Michel, F., Differential chemical probing of a group II self-splicing intron identifies bases involved in tertiary interactions and supports an alternative secondary structure model of domain V. RNA 4, 1055-1068 (1998). Rivas, E., Eddy, S.R., A dynamic programming algorithm for RNA structure prediction includingpseudoknots. J. Mol. Biol., 285, 2053-2058 (1999). Brion, P., Michel, F., Schroeder, R., Westhof, E., Analysis of the cooperative thermal unfolding of the td intron of bacteriophage T4. Nucleic Acids Res. 27, 2494-2502 (1999). Brion, P., Schroeder, R., Michel, F., Westhof, E., Influence of specific mutations on the thermal stability of the td group I intron in vitro and on its splicing efficiency in vivo: a comparative study. RNA 5, 947-958 (1999). Costa, M., Michel, F., Westhof, E. A three-dimensional perspective on exon binding by a group II self-splicing intron. EMBO J. 19, 5007-5018 (2000).

RNA FOLDING IN RIBOSOME ASSEMBLY JAMES R. WILLIAMSON Department of Molecular Biology, The Scripps Research Institute, La Jolla, CA, USA

I want to extend what has been discussed to include the possibility of having proteins direct the folding. As we will see, that is a critical component of what is going on in assembly for the ribosome. Figure 1 is a crystal structure of a piece of RNA with this protein bound to it that was done in our laboratory. I will try to describe the role of a particular protein in guiding some critical steps in RNA folding as the ribosome assembles. The things we are talking about for ribosome assembly are applicable to the formation of all kinds of RNA complexes.

Figure 1. RNA-proIein interactions in the assembling ribosome.

Figure 2 is an example from a paper from Ian Mattaj's lab. Certain small nuclear RNAs are exported from the nuclear compartment into the cytoplasm, and there is a defined series of steps that must take place for that to occur. A Cap binding complex (CBC) has to bind to RNA, and only then does another protein, an adaptor protein called PIIAX, bind to the complex, and only then does the exporting complex, a Ran-GTP XpOl, form the functionally competent intermediate for

180 J. R. Williamson transport through the nuclear complex. All of the things discussed here having to do with an ordered, step-wise formation of a complex apply to the formation of any complex in general. It is a biologists' view of this, which is a bunch of blobs. What I hope to be able to give you is a molecular picture for at least one such step in a complex assembly pathway. What is different about the CBC RNA complex that allows PHAX to bind? The basic issue we are trying to understand is the nature of the changes in the complex as things proceed along the assembly pathway.

Ran-GTP Xpo1

Nucleus

Cytoplasm

t

•""U assembly

A

disassembly

CBC *~~\J

U-snRNA

Ohno, Segref. Bachi. Wilm 8 Mattaj. Cell 101.187-198 (2000) Figure 2. Assembly of sn-RNPs for nuclear export.

Question: Is there experimental evidence for this sequence of steps? Response: Yes; Ian Mattaj's lab has identified each of these complexes. Question: What is the time-gap between each step? Response: That is not clear; they did not measure that at all, but we will talk more about time. This past year has seen spectacular progress in understanding the structure of the ribosome. We now have a 5A crystal structure of the full 70S ribosome done by Harry Noller, and there is a 2.4A crystal structure of the large subunit (50S) done by Tom Steitz and Peter Moore, and a structure of around 3A of the 30S subunit, done by Venki Ramakrishnan, as well as by Ada Yonath's laboratory [Fig. 2]. The 70S

RNA Folding in Ribosome Assembly 181 ribosome is almost 3 million Daltons, the 50S approximately 2 million, and the 30S approximately one million. What people have known ever since they isolated and purified the components of ribosomes is that two-thirds of the mass of the ribosome is made of RNA. The RNA is color-coded in red and white in this figure, and the proteins are coded in blue. What we see here is where the two subunits come together at the 70S subunit, and we are looking at the faces that form the interface, most of which consists of RNA. The best analogy for the ribosome is to think of it as an orange with a bunch of twigs stuck to it on the outside. The proteins are really on the outside, and the functionally important parts of the RNA on the inside. Seeing this structure has been quite remarkable, because we knew the RNA was critical, but how critical it was turned out to be quite a surprise. The RNA carries out all the important functional aspects, but what is the role of the proteins? One of the things they are clearly doing is helping in the assembly and stabilization of the ribosome.

70S ribosome

Figure 3. Anatomy of the bacterial ribosome. This describes a molecular inventory for the 70S ribosomal subunit [Fig. 3]. I will now focus exclusively on the small subunit. There is a 16S RNA that is 1,500 nucleotides in length. This is quite large compared to most of the RNAs discussed earlier (see chapter by Woodson). Bound to that are 21 small subunit proteins called S proteins, numbered SI through S21. Most of them are in the range of 10 to 20kDa. For the most part they are very small proteins and bound to this very large, complicated RNA structure. One of the things I think is an emerging theme that has been experimentally verified is that as the RNA molecules get larger and larger.

182 J. R. Williamson they actually fold to their final form slowly [Fig. 4]. Sarah Woodson's talk introduced the time-scales: rapid formation of hair-pins occurs on the microsecond time-scale, tRNA folds on the millisecond time scale, and group I introns, which are about 400 nucleotides, fold on the second-to-minute time-scale. Here we have something that is four times the size of a group I intron; what will keep it folding on a proper track?

rmB ope ron rJ'

16S

23S tRNA

18,700 rlbosomes/cell 40 minute doubling time

5S TITJ

468 ribosomes/minule 8 ribosomes/second 1,1 rRNAtranscripts/SQcond/oporon

7 rRNA operons Transcription rate - 55 nt per second ~5500 nt transcript -100 seconds per transcript (0.01 transcripts/sec) -50% ol bacterial transcription is rRNA Neidhnrt, Escherichia coS and Salmonella lypimtirium, Ch. 1. (1987).

Figure 4. Log phase bacterial ribosome biogenesis.

To give you an idea of the magnitude of this problem, I would like to discuss the demand for ribosomes inside bacterial cells. The role of a group I intron, 400 nucleotides long, is to self-splice, and it does so once it performs one turnover in its life. The ribosome has to get made, stay stable, and be propagated from generation to generation. Ribosomes pass on by cytoplasmic inheritance from one cell generation to the next. There are three ribosomal RNAs, and they are all transcribed as a part of one large operon. There is one big transcript approximately 5,500 nucleotides long. If you were to do an inventory of how many ribosomes there are in a bacterial cell, you would find that there are almost 20,000. So about 25% of the dry weight of a bacterium consists of ribosomes. Ribosomes are responsible for making all the proteins in the cells. Bacteria are essentially packed with ribosomes. If a bacterium has a 40-minute doubling time, that means we must have 468 new ribosomes per minute, or about eight per second. The demand for ribosomes is so great in bacteria that there is not just one of these operons, there are seven of them;

RNA Folding in Ribosome Assembly 183 seven operons constantly transcribing RNA. So we need about one transcript per second per operon to meet the demand for slow-phase bacterial growth. It is well known that the transcription rate for RNA polymerase is approximately 55 nucleotides per second. This means it takes RNA polymerase approximately 100 seconds to transit the entire operon. As a side-note, 50% of the transcription inside the bacterium actually concerns making ribosomal RNA.

What is the steady state concentration of assembling ribosomes? ki D *Assume transcription is rate limiting for assembly

fo R*

P-

R k, [D] =

ki = 1.1 transcripts/(sec*operon) [D] • 7 operons/cell kz-0.01 transcripts/sec

1.1 transcripts/sec ® 55 nt/sec implies RNAP every 50 ntl

~k

[R*]« = 770 /cell _„, , ., - 5 % of ribosomes are assembly intermediates

XgyXro^Craw^ ' - j — — i — —|— —r 1 50«

50 nl

50 (It

Figure 5. Steady-state concentration of ribosome assembly. If we devise a very simple, steady-state hypothesis for what has to happen, we assume that the rate-limiting step for assembly of a ribosome is actually the making of the RNA. Then everything else is just fast, and we can come up with a sort of steady-state concentration of ribosome assembly intermediates [Fig. 5]. If we make one transcript per second per operon and assume it takes around 100 seconds to transit the whole operon, wc can come up with a steady-state concentration of about 700 ribosomes per cell, or around 5% of the ribosomes in the process of actually being assembled. This (equation on right) represents a lower-limit for the concentration in the population of intermediates. So if assembly is slower, this number goes way up. In addition, if you look at the demand for RNA synthesis, if you have 1.1 transcripts at 55 nucleotides per second, it means you have an RNA polymerase every 50 nucleotides. If you measure the RNA polymerase footprint, it is approximately 50 nucleotides. I may have dropped a factor of two somewhere in this "back of the envelope analysis," but basically, the ribosomal operon is

184 J. R. Williamson completely loaded with RNA polymerases. They are chugging along, and then nascent transcripts emerge from each polymerase. We anticipate (as shown in Hcrve Isambert's talk) that these ribosomes are assembling co-transcriptionally.

C3> 23 S i UNA 1000 kDa 2904 nucleotides

5SlRNA -40kOa 120 nucleotides

•

•

• • • • • • • • • • •

34 "L" proteins 5 to 25 kOa •

•

•

• •

• • • • • • • * • • • # •

•

IBSrRNA 500 kDa 1542 nucleotides

• •

30 S subunit 0.9 MDa

a

21

" s " P' 0,eins 8 to 60 kDa

Figure 6. Components of the 70S ribosome.

Figure 7. The structure of the 30S liposomal subunil. Shown here [Fig. 7] is the tertiary RNA structure, rendered in red. You can also see that there are proteins in this figure. The solvent face is on the right, and on the left is the face that faces the 50S subunit. As you can see, most of the proteins are

RNA Folding in Ribosome Assembly

185

located on the periphery; on the outside, and not so much at the center. The mRNA would thread through the upper portion (left), which is also the decoding region; the heart of where the mRNA is decoded into the protein sequence. Of course, peptidebond formation occurs on the large subunit, but this same area on this subunit is where codon/anti-codon interaction occurs. One thing that has become clear (and a good example was given by Sarah Woodson with regard to the group I intron) is that these large RNAs seem to fold into quite large and fairly stable domains. There are three big domains in the 30S subunit; the 5' domain (in blue), the central domain (in light yellow), and the 3' domain (light purple) [Fig. 7]. I will use the rest of my time to discuss experimental approaches that are used to understand the kinds of conformational changes in protein-binding events that occur during assembly of the central domain. In a sense, we have taken a bit of a reductionist approach. Pictured here is the central domain, which is also called the platform region. The 3' domain is also called the head, and the 5' domain is called the body [Fig. 8]. Again, the mRNA threads through the interstices, where the three domain-areas meet. The red helix is also called the decoding helix. The mRNA decoding occurs roughly in that red region. The central domain is more or less the anvil on which the genetic code is read out. Question: Can you explain the central domain that seems to be protruding into the body? Response: Yes. This area in light blue is an interesting intra-subunit interaction, and I can show you exactly where that is, but we are not going to talk about that at all. We are going to talk about the folding of this globular part, which is important for inter-domain interactions. But I'm actually going to just cut it off as fast as I can. I consider this to be absolutely remarkable and heroic work, done 25 years ago by Nomura [Fig. 10]. He showed two things: First, that you can purify all the ribosomal proteins to homogeneity and then reconstitute them onto the 16S RNA, to get a functional 30S subunit. The second thing he did was use this reconstitution of the 30S subunit to demonstrate the order in which the proteins were incorporated into the nascent ribosome. He showed that there are several proteins that interact with the RNA in the absence of any other proteins; these are termed primary binding proteins. There is another set of proteins that require prior binding of one protein, now known as secondary binding proteins. Tertiary binding proteins then

186

J. R.

Williamson

require prior incorporation of two or more proteins. During Nomura's time, the domain organization that I am showing here was not appreciated.

' • • ' ' ; : ' •

••

-'

Central Domain ......••.....-••.....•.•

I,

I H I

3'-Dom«in

3'-Minor Domain 5'-Domain Thermus thermophilic (MM. NAfiH. MM (ISM)

Figure 8. Bacterial 16S ribosomal RNA secondary structure.

Figure 9. Domain structure of the 30S ribosomal subunit.

RNA Folding in ltibosome Assembly 187 Question: Do these proteins interact among themselves? Response: That is part of the story; the short answer is yes, some do, whereas most of them do not. Nomura did not appreciate the domain structure, as in Figures 8 and 9 and I have redrawn his classic map in this form to reflect the domain organization. There is a primary organizer for each domain, and I will now talk about the' central domain, which is organized by the binding of the S15 protein. After S15 binds, a pair of proteins, S6 and S18, bind cooperatively, followed by S l l and S21. The question just asked about this inter-domain interaction turns out to be part of S8, which seems to potentiate the biding of proteins to the 5' domain. It does not do anything in the central domain; it binds to the lower helix, and I will cut that off.

16SRNA pnmary binding proteins

secondary binding proteins

based on Nomura map and Noller OH footprinting

S13

jS10*»8U HtW»lll..JSC?JS 3103 (1974) P O M I S & Nkillec RNA I 19* (1995)

5'-domain

kj3 •^*S2 central j 3'-domain domain !

I f i ^ binding proteins

Figure 10. 30S assembly map.

I will go through the experimental approaches used to answer these questions. We have the assembly pathway of Figure 10. How does S15 recognize the RNA? Why is the binding of S6 and S18 proteins so cooperative? Furthermore, how does S15 exert its effect such that after S15 binds, S6 and S18 can bind? There arc two possible answers for the latter question: Either there are protein-protein interactions between them, or the effect of S15 is mediated at the level of stabilizing the RNA structure in order to make the binding site right for S6 and S18.

188 J. R. Williamson I will distill all these data into a very brief slide [Fig. 11]. A number of research groups had localized the binding of S15 to one region. We identified the minimal binding site as a three-way junction deriving from the central domain. One of the consequences of the fact that the secondary structures in RNA are so thermodynamically stable is that we can frequently dissect pieces out and ensure that they fold properly simply by adding what we call stable tetra-loops to cap the ends of the helices. We can measure the binding constant of S15 for the constructs to be 5 nanomolar. We have captured the thermodynamic essence of binding. Furthermore, these nucleotides boxed in blue are conserved, and if you mutate them S15 no longer binds. These are conserved in 16S RNA, because they form the S15 binding-site.

central domain

How does S15 recognize 16S rRNA?

How is the binding of S6 and S18 cooperative?

How does S15 direct the binding ofS6andS18?

Figure 11. Key questions in 30S central domain assembly. We can consider this minimal S15 binding site as a three-way helical junction, drawn schematically here [Fig. 12]. Early on, we had a hint that S15 induced a conformational change in this three-way junction. We wanted to find a way to quantify this three-way junction, so we collaborated with Paul Hagerman, who is at the Denver Health Science Center, in Colorado. Paul Hagerman adapted a wellknown technique, called transient electric birefringence [Fig. 13], to allow us to measure the angle between two helices. This is a beautiful application of a simple experiment.

RNA Folding in Ribosome Assembly 189

- B*l9f JMB XI

536.560 (19961

Figure 12. The minimal S15 binding. It consists of a simple apparatus and a cell containing the RNA solution. You apply a strong electric field of around 1000 volts per centimeter across the cell. The molecules tend to align along the electric field, and that induces birefringence, which is simply a difference in the refractive index in a direction perpendicular or parallel to the cell. The birefringence may then be read with these polarizer and analyzer filters. First the voltage is turned on, then the molecules are aligned, after which birefringence is induced. Then you turn the voltage off, and the molecules go from their somewhat aligned orientation along the field back to a very random orientation. The rate at which they do this is related to the hydrodynamic radius of the molecule. Hagerman's approach was very clever. He said that since we want to know the angle between two of these arms, the rotational correlation time should be made extremely sensitive to this angle by extending a pair of helices. This was done by adding approximately 200 base-pairs of RNA helix while keeping the central junction, since it contains the S15 binding site. The longest rotational correlation time for this asymmetric molecule exquisitely depends on the angle between these two very long helices. These decays are shown schematically in Figure 13. If you have a linear molecule you can get a very long decay, whereas if the molecule is bent you obtain a much shorter decay. Hagerman developed a hydrodynamic theory to approximate these molecules as a series of small spheres and analytically

190

J. R.

Williamson

calculated the expected rotational correlation times such that it was possible to turn these decay lifetimes of the TEB into an intracellular angle.

Laser

Polarizer / Cell

.* Time, |<s

10

Analyzer

i

rise

decay

r—

i—

r\ ' V pulse

adapted from Hagerman (1996) Curr. Op. Struct Biol. 6, 643

v

Detector

Time

Figure 13. Measurement of intcrhelical angles using iransieni electric birefringence. While the experiment is quite simple, the theory is quite complex. Basically, we make three constructs, one in which we have each pair of arms in turn extended to be 200 base-pairs, and using this simple optical technique, we measure the angles between all the helices. The results of these experiments are shown in Figures 14 and 15. In the absence of the protein, we find the angles between the helices to all be approximately 120°. The error bar is on the order of 10° or 20° (Paul Hagerman would say 10°). In the presence of the S15 protein, these two helices (purple and yellow) become co-linear, and the blue helix forms an acute angle with respect to this. So there is a big conformational change occurring in this very localized region in the 16S RNA. I will say more about that in a moment.

RNA Folding in Ribosome Assembly 191

Iree RNA

S15-RNA complex

Figure 14. S15-induced conformational change. • S15 initiates central domain assembly • Conserved nucleotides are important for S15 binding • The conformation of the 3-helix junction is important for S15 binding

• I k . » ^«L

. * OAO

T

Figure 15. The minimal S15 binding site. Once we knew there was a scissoring motion between these two arms, we could set up a very convenient fluorescence assay to monitor the conformational change [Fig. 16]. This has also allowed us to measure other thermodynamics and kinetics. We can synthesize our RNA out of three separate pieces: attach a fluorescein dye to one end and a cy-3 dye to another end. This makes a donor and acceptor fluorophore pair. We can then attach something else to the third arm, for instance biotin, which allows us to immobilize the RNA on a surface. The basic idea is that the two arms undergo this scissoring motion in response to S15 and magnesium ions. When the protein is not bound we have the open conformation, and the

192

J. R.

Williamson

distance between the two chromophores is well beyond the Forster distance for efficient energy transfer between the two dyes. If we excite the fluorescein donor, we see the fluorescein fluorescence as green. However, once we have the closed conformation, there is close proximity of the two chromophores, we get efficient energy transfer, and see the orange cy-3 fluorescence if we excite the fluorescein dye.

r

4?3 W-Biotin

^

!««!«!

HAA

S-K V* * open

closed

free

bound

Figure 16. A fluorescence assay for S15 binding.

We have done a variety of solution experiments. I will discuss just one, which we did in collaboration with Steve Chu's lab: single molecule fluorescence. They built a microscope with which you can look at this 10-micron area, and each of the blobs corresponds to an individual molecule that has been immobilized on the surface [Fig. 17]. You have to integrate the fluorescence over about 5 milliseconds in order to see it, so this might be on the order of about 10,000 photons, but each photon came from an individual molecule. We washed the S15 protein at approximately the KD concentration in the solution over these immobilized RNA molecules. When doing bulk fluorescence, in order to do a titration you usually measure the fluorescence intensity of the donor as a function of the protein concentration. Here we can actually cap the molecules. At the Kp, half the

RNA Folding in Ribosome Assembly

193

molecules should have a protein and half should not. In fact, we see that half the molecules are green, indicating they arc in the open conformation, and half of them

are orange. Basically, we are able to reproduce things that can be done in solution. But the power of single-molecule methods is that you can get at the details of the ensemble; you can just read them out. Just recently, experiments have been done in which we can do autocorrelation analysis of the donor wavelength; autocorrelation of the acceptor or crosscorrelation between the donor and the acceptor. If you watch one molecule, you see that the donor fluorescence will go on, then blink off; go on and off, etc. You see a stochastic change of state. If you do autocorrelation analysis, you get the decay rate, which allows you to measure the opening and closing rates. We have done that as a function of the ion concentration, which reveals that it is actually not a simple opening and closing. Ion-driven folding is more complicated than a simple twostate transition. This is interesting because we could never have seen that by doing steady-state fluorescence. This is a very powerful process now in use in a number of labs throughout the world.

individual RNA junctions immobilized on a surface in the presence of S15

Ha, Zhuang, Kim, Orr, Williamson & Chu, PNAS 96,9077 (1999) Figure 17. Single molecule fluorescence.

194 J. R. Williamson The conformational change I just described is in these three helices. Within these two that are scissored together [Fig. 17] I have drawn the protein contacts to the rest of the central domain. After S15 binds, two other proteins, S6 and SI8, bind. Note that they footprint in two separate places, as does SI 1. Those proteins that bind downstream from S15 have bi-partite footprints separated by the two arms that undergo this scissoring motion. One of the things S15 does is consolidate the global structure of this RNA, bringing together two downstream parts of the proteinbinding-sites in the assembly of the domain. This is an important point, because the degrees of freedom of large amounts of RNA are being restricted, and the parts that are attached to this junction are undergoing motions on the tens of Angstroms scale. This has dramatic consequences in terms of consolidating the structure.

eCtaagiD

«£te£p

I

•8

ral domain Tth central

4H21

"%

36 816 818 binds •S | 36

'""T 3

ILJ?

7th T2

il'sC SZ **'

hi l-i

binds816 S6 S18

Hr

ji

" W ^

binds^Se S18 $.$£%< binds 816 S6 S18 J iSSf^

lb I :\iL • • • * % . .

j:

Ji

Bat Fr1

*•"%%,

binds!

%-

binds S15 S6 S18 Figure 18. Reconstitution of central domain RNPs.

We were able to do deletion analysis of the central domain and start chopping off various pieces, coming up with an interesting observation [Fig. 18J: We could chop off this whole region (right-hand area of figure on top-left) and still bind all the proteins that bind to the central domain. As already mentioned, we can chop off the

RNA Folding in Ribosome Assembly 195 helix, which is the S8 binding site, and that does not affect the other proteins. Now we have half the central domain that still binds most of the proteins. If we chop off the helix where S l l binds, we get a small piece (bottom middle), which binds the three proteins shown below (bottom middle), and this (bottom right) is our minimal S15 site. What is it that S15 does to this structure (bottom middle) that makes it so that S6 and S18 can bind? It is not so clear.

central domain

v I ti

~ assembly proceeds by discrete and sequential steps

•• S 6

.S18

f Agalarov & Williamson, RNA 6,402 (2000) Figure 19. A series of RNA-protein complexes are intermediates in central domain assembly.

What I like about this series of RNPs is that I could show you the assembly pathway. Here we have a series of RNA constructs that correspond to intermediates in this discrete assembly process [Fig. 19]. We have an intermediate, where one protein is bound, where three proteins are bound, and where four proteins are bound, and we can put them in a bottle and measure the transitions between the states along this cascade. This is the power of the reductionist approach.

196 J. R. Williamson Question: Are S6 and SI8 binding RNA, or are they binding SI5? Response: Those are in fact the two possibilities. We will ask and answer that in a moment. We solved the crystal structure of this piece (second from right), and mere months after it was finished, the structure of the whole 30S subunit came out. The structure that I m going to show you of our piece is basically identical to what is in the 30S subunit.

Figure 20. The T. thermophilusS\5,

S6. S18-RNA complex.

Here is the structure [Fig. 20], with the S15 protein and the 3-helix junction, in which I described the conformational change. S6 and S18 are at the top, so we have a little bit of both. S6 and S18 clearly form an intimate heterodyne, and that is likely to be why their binding is cooperative. You never sec S6 without S18. However, there are no protein-protein contacts between S15 and S6 or SI8. This means that the effect mediated by S15, which allows the next two proteins to bind, is exerted entirely through stabilizing the RNA structure. There is something

RNA Folding in Ribosome Assembly 197 unstable about the RNA in this region (middle region of figure on right). Once S15 binds, it stabilizes the proper conformation, so that the next two proteins may bind. The S15 structure [Fig. 20] is that of a helical protein. Three helices form the RNA binding face. We see specific-binding amino-acid contacts, and there is a pink loop that folds down. We are actually getting contacts to this loop that we did not anticipate. We cut this pink loop off, and S15 did not seem to care. However, in the complex, we saw contacts between S15 and the residues, and S6 and S18 binding above (not shown). We have now done extensive thermodynamic analysis of the binding of S6 and S18 to the pre-formed S15 complex. We did this with purified components from Aquifex aeolicus, which is a hyper-thermophile that we expressed in E. coli. That is not really important. All these proteins are virtually conserved throughout bacterial sequences. Here are the two possible explanations for why S6 and S18 bind cooperatively [Fig. 21J. We could have the two proteins binding together to form a heterodimer, which then binds to the S15 RNA complex (left), or you could imagine a sequential and weak binding of either protein in either order (right). As I will show you in a minute, it turns out that the two proteins actually form a pre-formed heterodimer.

Model 1- S6 and S18 bind as pre-formed heterodimer:

Model 2- sequential binding of S6 and S18:

Figure 21. Two possible mechanisms for binding of S6 and S18 to the S15-RNA complex.

This is an example of how we measure the affinity of these two things, called a gel-shift experiment [Fig. 221. The reaction we are monitoring is the binding of S15

198 J. R. Williamson to RNA (top right). Looking at either curve, the top describes electrophoresis in a direction that goes from top to bottom. The free RNA moves at the described mobility. When the protein comes on board, you get a complex at a different mobility. All we do is measure the fraction of the RNA that appears in the S15 RNA complex, and we get a KD that is around 5 nanomolar, which is the same in Thermus thermophilus and in E. coli.

[S15] M 40-C, 20 mM K-HEPES, pH 7.6, 330 mM KCI

Figure 22. Polyacrylamide gel-shift assay for S15 binding lo RNA.

To measure the binding of S6 and S18, we carried out an isothermal titration calorimetry (ITC) experiment [Fig. 23]. One of the proteins is put into a thermostatically controlled cell, say S6, in the sample cell, and then SI 8 is introduced into the syringe. You inject a small aliquot into the cell and directly measure the heat that evolves due to the positive or negative binding enthalpy between the two proteins. You keep titrating-in the protein until you have formed a 1:1 complex. Then you sum this up and get the AH, and you can also fit what the KD is within reasonable limits [Fig. 24]. We inject a small amount of protein and get a spike; we actually get some heat evolved. We integrate that, wait a while, and

RNA

Folding in Ribosome

Assembly

199

inject another aliquot. We are basically doing a titration of S6 into SI8. We eventually add more S6 and no more heat evolves, so we have a 1:1 complex. It turns out to happen at a 1:1 ratio. The KD for the formation of this dimer is about 8 nanomolar, and the AH is about 16 Kcal per mole, which is very typical of a proteinprotein interaction. We can use this preformed heterodimer to do another gel-shift, which is a sort of super-shift experiment, to monitor the binding of the heterodimer to the S15 RNA complex [Fig. 25]. Again, we have electrophoresis from the top to the bottom; shown is the free RNA at a certain mobility, and the pre-formed S15 RNA complex. We have saturated the RNA with S15, and now we titrate-in increasing amounts of the pre-formed heterodimer, and see a higher-order complex coming in. We can measure the KD for this complex, which is approximately 5 nanomolar.

As ligand is injected through syringo into samplo coll, heat is generated or consumed. The ITC keeps a constant temperature difference between the reference and sample cell.

ligand syringe

insulated chamber Binding is measured directly by the heat evolved: AH of binding

reference " "

cel1

sample cell

Figure 23. Isothermal Titration Calorimctry (ITC).

I will now describe the interesting thing that gets to the very heart of the cooperativity in this binding. I previously described the ITC experiment, in which a

200 J. R. Williamson lot of heat is evolved as we bind S6 and S18 to the RNA, with S15 bound. I have done three other experiments that are also shown here [Fig. 26], and there is no heat. If we titrate-in S6 without S18, or S18 without S6, there is no binding. Or if we titrate the pre-formed heterodimcr into the RNA, where there is no bound S15, there is no binding. This means there is an absolute hierarchy and required assembly order for these proteins. There are no parallel assembly pathways. We must find S15 first, because S6 and S18 do not bind to this RNA by themselves, and we cannot bind either S6 or S18 alone without its other partner [Fig. 27]. We have to pre-form the heterodimer and wait for S I 5 to bind; only then can we bind to the assembling particle. The ribosome assembly pathway has a built-in obligate ordered pathway to literally ensure that you do not form other interactions until the first interactions that are needed to nucleate the folding pathway have been established.

*

K d = 8 . 4 ± 1 . 1 nM

S18

A H = - 1 6 ± 0 . 1 kcavmol

S6 S18

AS=-14.64cal/mol-K

-—«-

eg 0 '

f 1

•S -61

1

"5 -«>1 10

©- 1 0-12 £-14.

•• • • • • • • • • • I 0 20 40

•••••••••••••• 60 80 100 120

Time (min) 20/>MS18incell 154pMS6 in syringe

S-180.0

0.5

1.0

1.5

2.0

Molar Ratio 4 0 " C , 2 0 m M K-HEPES, p H 7 . 6 3 3 0 m M K C I , 2 0 m M M g C l 2 , 1 m M DTT

Figure 24. Isothermal titration calorimetry of the S6:S 18 complex. What is the molecular basis for that occurrence? Here is the complex that was just shown [Fig. 28], in which S15 is making contacts at the lower junction and at

UNA Folding in Ribosome Assembly 201 the upper junction, shown in pink (left). A highlight of this region is on the right. S15 is doing two things: there are amino-acid side-chain contacts to this pink loop, and in addition, there is what we call an inter-helical base-pair. This is a nucleotide from the middle of the green helix that is stuck into the middle of the pink helix. We get a non-Watson Crick base-pair, but it stacks perfectly well into the pink helix, which we call helix 23a. This is a very laminar structure. It is a little like plywood; you are basically locking the helix parallel to helix 22 by this base-pair. That interaction is buttressed by the interactions of S15 to this tetra-loop structure. Apparently this structure by itself is unstable and cannot bind the proteins in the absence of this inter-helical base or the S15 interaction. At this point we really understand at the molecular level the structural basis for the obligate hierarchical assembly that we see in this part of the assembly of the central domain.

se si8 S6S1S

Kd=5.4±0.1nM

[S6/Sl8]nM 0

| 08

!o.6

S6:S18-S15-RNA R N A

800

i °* *

I 02 U-

[S15]=100nM

10-'°

10*

10-8

10-'

1S6/S18) M 40'C, SO mM K-HEPES, pH 7.6, 330 mM KCI, 10 mM MgCfc.

Figure 25. Formation of S6:S 18-S15-RNA complex monitored by gel shift assay. A few comments with regard to the bigger picture and what is going on: This is the central domain [Fig. 29]. If I extract it from the entire 30S structure, what we see is that the protein binding-sites for all these central domain proteins are on the

202

J. R.

Williamson

outside and the three helices (top left) form this long, continuous coaxial stack (right). The helix that makes up the inter-domain contact that was asked about earlier is at the bottom right. The figure on the right is really a globular structure. The protein binding-site consists of those three helices on the left that form a coaxial stack. All the other parts of the figure on the left form the globular folded domain (right), and this other region that is folded up (left) actually contains the functionally important parts of the 30S subunit. The 790 loop, one of the most conserved sequences in the 30S subunit, is part of the P-site, where one of the tRNAs binds during protein synthesis. The other helix is called the Dahlberg switch helix, which undergoes a base-pairing register shift as the ribosome translocates from one codon to the next. These are functionally important and potentially flexible regions of the central domain and have nothing to do with protein binding.

ITC reveals no RNA binding of:

S6 without S18 S18 without S6 S6:S18 without S15

Ka>100uM

••"'.*;."." »

WJkJii

•

I 9

S6/S18IWOS15-RNA S6kltoS15-RNA

•

S18 Wo S15-RNA

•

S6/S18into RNA only

-10' 10 uM components

in cell

I.«. -0.5

0.0

0.5

1.0

1.5

2.0

2.5

Molar Ratio Figure 26. ITC analysis of protein binding cooperativity.

UNA Folding in Ribosome Assembly Kd= 8 nM S6 + 81$ -^

L ——

• — • W

— S6

W

Kd-6nM

S6S18

* W

40'C, 20 mM K-HEPES (pH 7.6), 330 mM KCI, 10 mM MgCfe Figure 27. Thermodynamic parameters for Aquifex S 15:S6:S 18 assembly.

Interhelical base pair

j y * **»

Figure 28. S15 contacts to helix 23a.

H.:,-ol

203

204 J. R. Williamson We can further subdivide the central domain into two subdomains: a primary subdomain responsible for protein binding, which apparently folds first and forms the template upon which this functionally important, secondary subdomain that can now assemble (highlighted in blue). Everything I have described is literally the folding of the two three-way junctions (in white) in order to set up this RNP scaffold, upon which the part in blue is now assembled. In general, what can we say from looking at the binding of this small number of proteins? We can write a kind of mechanism for the assembly of the central domain [Fig. 29]. The top left figure is the helices in the central domain that we have already looked at. The second figure (to the right) describes the conformational change. After S15 binds, we sec another conformational change in the third figure. On the bottom left, we still do not understand the basis for S l l requiring S6 and S18 binding; that study is ongoing. Once all the proteins are bound (bottom right), the second domain folds up to form the P-site in the encoding region.

secondary subdomain functionally important Figure 29. A hierarchy of RNA subdomains.

RNA Folding in Ribosome Assembly

205

What we see here is that the mechanism literally consists of an alternating series of conformational changes in protein-binding events. The protein-binding events seem to consolidate relatively unstable parts of the RNA structure. You have something that is very flexible; it folds and adopts the bound conformation, then the protein comes along and locks it down and reduces the unfolding rate of the RNA in that local region so that the next step can occur. To return to what I started talking about earlier, one might expect such a large RNA to fold extremely slowly, but the biology of ribosome biogenesis is such that we know these things have to fold accurately and fast. It seems that the assembly process is encoded in the sequence; it is mediated through the formation of the local RNA structure; conformational changes that are locked down by protein-binding events. That is how you can inexorably close in onto a unique folded structure, even though you have this very large, complicated sequence that has to encode that function. Question: Is this assembly of protein folds co-transcriptional? Response: It has to be. It has been shown that the 5' domain proteins seem to assemble first, followed by the central domain proteins, then the 3' domain proteins. That even happens in vitro. In part, there is a similarity between the cotranscriptional folding and the folding initiated from denatured RNA... Question: Does RNA folding go along with the assembly ofproteins? Response: Yes, absolutely. We have used small pieces, with which we are less concerned about the formation of the secondary structure. For a piece such as this, transcription would take a few seconds. It is fast, compared to most things we are worried about. Protein binding rates in vitro are on the order of 105 molar/second, which is several orders of magnitude slower than diffusion. However, inside the cell, protein concentrations are reasonably high, so one might expect the actual rates would be on the order of seconds. It probably takes 25 seconds to go through 16S RNA. As was shown in Herve Isambert's discussion, as soon as you get a local secondary structure, it will form. That happens very fast. Then you need to make sure that inappropriate structures are not formed. I should say that one of the reasons RNA folding slows-down as the length increases is that you increase the probability of forming misfolded structures. Peter Schuster used a simple structure to very nicely show that you could have slow folding by the unfolding of something

206

J. R.

Williamson

that was improperly folded. Sarah Woodson then showed that for the group I intron there is the alt P3, an inappropriate helix that forms late, and a bunch of stuff has to unfold in order for that to resolve itself. The more structure that forms, the larger the activation barrier becomes for unfolding. For the ribosome, it becomes critical that you not get stuck in something that is misfolded. If you make the whole thing and you have a wrong helix in the 5' domain, you will never get this thing unassembled. In part, what the ribosomal proteins are doing is making sure that all of this happens; but they are not chaperones, because they are incorporated into the final product. Question: Is this the same for eukaryotes? Response: It is much more complicated in eukaryotes, for a variety of reasons. There are small nucleolar RNAs that are complementary to these sequences, which bind and are important for RNA processing, as well as the modification. Helicases are also involved, and it is much more complicated. In part, it could be that they seize control of the orchestration of all these events. With a bacterial ribosome you can get away with this chaotic assembly pathway. By making the mechanism more complicated, you also can exert more control. Perhaps that is the reason for these elaborations in eukaryotic ribosomes. Question: Once you admit the constraint that every time a proper contact is made it is left alone, in a sense, you are making rules for the proteins. It would seem that they should also ignore those traps in their hierarchical pathway, is that correct? Response: Right. That sort of depends on the energy function used to calculate it. You can always find traps, depending on what your potential function is. Question: Do you have any idea why, among all the stabilizers, G-A mismatches, rather than say, a G-C base-pair? Response: That is an interesting question. We're talking about the inter-helix basepair shown in Figure 28. It turns out that this pair was missed by phylogenetic comparison, because it does not co-vary. It is not clear that it actually matters exactly what the base-pair is. There are probably some geometric constraints with regard to which way the bases are pointing. I think either of the nucleotides in the bulge nucleotide (on left) could fit into a pocket and form a stable structure. There are probably cases in which one nucleotide is in and the other is out, which is why it

RNA Folding in Ribosome Assembly

207

could be missed in phylogeny. I do not think there is anything magical about this. It is not conserved in phylogeny; however, the existence of the bulge and the fact there are two nucleotides is conserved. This is an interesting point that we do not quite understand. It is a detail that we are currently thinking about. We have deleted one of the nucleotides and found that S15 does not care at all. However, S6 and S18 do not incorporate into the complex, and that shows us that this is critical for binding. Question: Do all ribosomes have exactly the same structure in a given cell? Response: Yes; while that may not be exactly true, we know that they all have to perform the same function, and that has to do with high fidelity; they are all making proteins. Where the ribosomes might differ is that they can have all sorts of translation factors bound, and you can modulate the specificity of protein synthesis by binding all types of things. The translation apparatus goes way beyond the confines of the ribosome. The other thing is that if you have antibiotics bound, you can bind them to the RNA and they can disrupt protein synthesis, probably in a very heterogeneous way. Question: Does that depend on the location within the cell? Response: There is binding of ribosomes to the membrane, so those could be somewhat different. Question: [inaudible] Response: There are plenty of cases in which you have mutants. If you treat them with certain antibiotics you disrupt ribosome assembly, and you can find shards of ribosomes laying around. Question: Is there a homolog of SI 5 protein in eukaryotic ribosomes? Response: No. It turns out that the three-way junction structure is completely different in eukaryotic ribosomes. I do not think there is an S15 homolog. Comment: It might be related to the fact that in eukaryotes the proteins are assembled in the ribosomes in a different way.

208

J. R.

Williamson

Response: Yes; that could be the case. Thai whole relationship is interesting. Some ribosomal proteins are conserved in prokaryotes and eukaryotes, some are unique to prokaryotes, some unique to eukaryotes. Some are uniquely conserved in archaeobacteria or eukaryotes, and some others in archaeobacteria and bacteria. There are all kinds of different ways this might be orchestrated, and the assembly pathway may depend on what Kingdom you are in. That is a good point. Comment: Concerning ribosome heterogeneity, I believe there were data in the 1970s on the dispersion of translation rates on ribosomes. Response: In vitro? Comment: / believe it was in vivo. There was a time when people were measuring protein synthesis rates. Response: On a homogeneous messenger RNA? Comment: Yes, I believe so. Response: One thing I will say about the ribosome field is that the vastness of the literature is humbling, and most of it is not accessible on computer. Question: You 've checked the Stanford database? Response: Yes, but most of that is literature that has to do with individual mutations and chemical modifications and biochemical data, not biological data. There is a huge amount of interesting stuff out there that I, unfortunately, do not have at my fingertips. Question: What exactly is known about the in vivo assembly map? Response: I think all that is known is that roughly, you see the 5'-to-3' assembly. It is a rather difficult experiment to do. One experiment that I know of is essentially an isotope chase experiment, where you grow bacteria, throw in tritiated amino acids and 32P nucleotides and measure the rate of incorporation of 32P and tritium into ribosomes, and you can see some of the 5' proteins come on before. But it is a hellish experiment and I think the dynamic range for the measurement of fractions is

RNA Folding in Ribosome Assembly

209

not very large. It would be good to repeat those experiments, since they were probably done twenty years ago. Question: [inaudible] Response: In principle, you could use FRET by labeling various proteins, absolutely. Then you presumably would have to use GFP fusions, and then, of course, someone could say "the GFP is not perturbing the folding pathway." But it is a good idea that is under consideration. Question: Are you working with E. coli? Response: We have worked with E.coli, Bacillus starothermophilus, thermophilys and Aquifex acolius as a matter of convenience.

Thermus

Question: But the full 70S subunit was Halobacterium? Response: The 50S subunit was Halobacterium, which is an archaeobacterium, although just recently; the eubacterial one came from Ada Yonath's lab. Question: How do you translate the archaeobacterial information to E. coli, for example? Response: Most of it is conserved. You just look at the RNAs, and they are conserved. There are certain expansion sequences... Comment: But the proteins are not. Response: Many of them are. At least half the ribosomal proteins have direct homologues in eubacteria. There is a variety that does not. I think if you look at Yonath's paper that came out in Cell, you will see that they go into great detail about the correspondences between ribosomal proteins and who substitutes for whom in which structure. I have not yet absorbed that paper, but I know that information is in there. Comment: / assume they had to wait until the structure came out before they could really take that apart.

210

J. R.

Williamson

Response: Yonath just recently did the eubacterial 50S, using molecular replacement from the Archaea structure. So that has been in the database for some time, perhaps almost a year.

FROM RNA SEQUENCES TO FOLDING PATHWAYS AND STRUCTURES: A PERSPECTIVE HERVE ISAMBERT LDFC, Institut de Physique, Strasbourg, France

My talk today concerns RNA folding. Our group is trying to understand the process of RNA folding, going from its primary nucleotide sequence to its secondary structure. The approach we are developing is somewhat complementary to those Michael Zuker and Peter Schuster presented earlier today. We are interested in modeling RNA folding and unfolding kinetics. The idea behind modeling RNA folding kinetics is that in principle it may not only be used to predict RNA secondary structure, but also to potentially learn something about the folding pathways of these molecules. In addition, attempting to model RNA folding kinetics also allows us to predict more complex secondary structures, including pseudo-knots, which are secondary structures that do not resemble trees [Fig. 1]. Naturally, in developing such a dynamic approach it is helpful to have a tool for visualizing what the algorithm is predicting. For this reason, we have adapted the software "RNAmovie," designed by the Bielefeld group, in Germany. This software displays RNA folding pathways, including pseudo-knots, in a movie format and helps to analyze data predicted with the actual folding algorithms. Our primary goal is not only to decode and predict RNA secondary structures, but also to decode their folding pathways. For example, I will discuss the folding pathways of the group 1 intron, which has already been mentioned several times today. We have other objectives beyond this primary goal, such as trying to model and understand the dynamics of antisense hybridization of at least partially complementary molecules. I will show some results that involve the HIV-1 initiation complex. Another topic that interests us is micro-mechanical unfolding of RNA, which essentially concerns a secondary structure being pulled apart by what turns out to be minute forces applied to single molecules by a "large" apparatus. Some very elegant experiments of this type have been conducted in our laboratory at the Institute of Physics in Strasbourg, France.

212

//.

Isambert

Figure I. Decoding RNA folding pathways: Proposed stability exchange between two competing helices forming sequentially during transcription of hepatitis delta virus ribozyme. The strong, yet transient helix P8 guides the nucleation ot"P4.

From RNA Sequences to Folding Pathways and Structures: A Perspective

213

Beyond these attempts to decode the information stored in the RNA primary sequence, which generates secondary structures, we are also interested in a sort of reverse engineering approach, in which the idea is to design artificial sequences that exhibit interesting or puzzling behaviors with respect to folding pathways. For instance, we are trying to develop some bi-stable RNA molecular toys, both by computer simulation and in real laboratory experiments. First 1 will present a short visual demonstration of the software we have adapted for use in studying these molecules, which will give you some idea of the software tool we are using. Of course, it is a visualization tool, and you may have questions, which I will be glad to try to answer. This is an RNA molecule being synthesized from its 5'-end. As you can see, the molecule is in fact folding while it is being synthesized. You can see green helices being formed, and orange stretches linked by thin blue lines, which correspond to pseudo-knots. You have to imagine that these orange helices are essentially identical to the other helices in the model. It is just for purposes of visualization that they are somewhat pulled apart like this. You can also see that some of these helices are also transient; they appear at one point, then disappear. We are studying these transient helices, which are quite interesting. The software we use to display these structures was developed by the Bielefeld group. We did have to add the pseudo-knot feature, which was not in the original software package, but that was a minor addition. A good example of transient helices is shown by the intermediate structure of this molecule, which is actually very stable, because it has very strong helices, as well as the pseudo-knot, which is also seen to be very long. In principle, this molecule is so stable that nothing can really happen to it. We see that merely by synthesizing just a few more bases downstream from the sequence a new helix begins to nucleate. At this point, the previous orange helix can very easily be removed by replacing its base-pairs with all the other base-pairs from this new stem. There are definitely transient stems in this particular molecule. My guess is that this is quite a general feature, which is why I asked Michael Zuker whether he had looked at several point-mutations, since one would expect that if these transient stems are really encoded in the primary sequence there would also be higher-order correlations between these complementary mutations, perhaps even complementary mutations that do not appear in the native structure. Question: Is this assuming that the modified bases are being created almost instantaneously after synthesis? Response: There are no modified bases here.

214

H. Isambert

Question: If one wanted to include that in the model, would it be biologically correct to assume that they are modified almost immediately after synthesis? Response: I do not know the answer to your question. I have asked this question myself many times; the answer is no, probably not. Question: No? Response: Probably not; I don't know. We can also look at another problem, one that is different from that of a single molecule being synthesized. In particular, we can follow the hybridization of two partially complementary nucleic acid sequences. One interesting example is the formation of the initiation complex for reverse transcription of the HIV-1 retrovirus, which involves its hybridization with a tRNA molecule. As many of you know, these retroviruses are reverse-transcribed into double-stranded DNA and then introduced into the host cell genome. These retroviruses must all be recognized by a reverse transcriptase, which carries out that task. It turns out that the reverse transcriptase does not actually recognize these retroviruses by themselves, but only after they have been partially hybridized to a tRNA molecule that is presumably not initially made for this purpose, but that just happened to have been hijacked along the way. This is also a question that interests us. These two molecules, although you would not imagine it from this first picture, have rather long extents of complementary regions. We are studying the dynamic process of hybridization between these two molecules. In order to do so, we join them with this inert linker. If we want to model molecules with modified bases, we could use the same process. First we fold the two molecules separately, and once we are satisfied with the two separately folded states, allow cross-hybridization to occur, with a long extension between the two molecules. You can also see that there is a fair amount of tRNA unfolding going on during this process. The preceding were a few examples of what it is possible to do. How do we actually do this? In order to model RNA folding kinetics, first we must know the free-energies of various structures. There certainly is one very important contribution to the overall free-energy of a molecule, which may be fairly complex, as shown here. This contribution derives from the formed helices, as previously mentioned in some of the other talks. There exist some fairly good models, such as nearest-neighbor models between consecutive stacks. We did not

From RNA Sequences to Folding Pathways and Structures: A Perspective

215

invent these numbers; we are using the same tables everyone else does. This part is certainly well known. The other part we incorporate into the model accounts for the overall conformational entropy loss of the molecule during the folding process. We have devised and developed a fairly crude model for the second part, [Fig. 1]. Using this approach, we take this complex molecule and essentially throw out all the details, retaining only the fact that it basically looks like a mixed assembly of stiff rods (blue), which correspond to the helices, and ideal springs, which correspond to the single-strands (black). At this level of complexity, we can compute values by means of basic polymer physics theory, evaluating the entropic costs of the molecule. In the case of usual, tree-like structures, everything is fairly simple, because the overall thermodynamic weight is simply factorized into independent contributions that correspond to the various parts, which in this example consist of two helices and two loops. Everything comes easily, and in principle, we obtain the same results as with the usual approach. On top of these nested structures, we can also estimate the entropic cost of making more complex structures, including pseudo-knots. A simple example is shown, in which two helices are connected by three single-strands. In this conformation, the orientation of the two helices is no longer independent, indicating that there is an entropic price to be paid, which may be estimated using this model. So we can compare structures. But does that mean that we can predict structure? That is not yet the case. The problem is that, as Peter Schuster mentioned earlier today, we start with one sequence and try to predict its structure within a huge structural space. To illustrate this point, it is quite useful to begin with an example of a tRN A sequence and all the possible helices that can be formed from it, in which each line has two segments that correspond to a possible helix. Thus, ACCA can pair with UGGU, and so on. When this is enumerated you obtain many, many helices. This is probably fairly obvious, and also the easy part. The complex part is to find the combination of helices that make a good fold, overall. In addition to the helices that generate the well-known clover-leaf shape, the actual shape is surrounded by other structures. The task of finding the "good" structure is not all that trivial. Remarkably, when limited to the sub-space of tree-like structures, it is possible to exhaustively search all such structures to find the absolute minimum free-energy configurations. It is indeed impressive that algorithms may be designed to essentially enumerate all this structure space. But there are limitations. An obvious one is that by this method, pseudo-knots - which do in fact occur in RNA molecules - cannot be included, a priori. The space including pseudo-knots is much larger. If one would

216

H. Isambert

like to learn something about all the pseudo-knotted structures, it would seem necessary to abandon the idea of carrying out an exhaustive search of the entire structural space. We had to try something else. Although we were not the first to do so, instead of enumerating the entire structure space, we attempted to devise a reasonable model of the actual wandering around of the sequence within the structure space; i.e., to model the kinetics of the molecule. In order to do so, we had to introduce connections between these states. These connections turned out to connect structures that differed by only one helix. The whole space is then seen to have neighborhood relations. We also had to model and evaluate the kinetic rates between those states. The reason that we could actually do this is that experimental results have shown that the time-limiting steps within the structure-space do indeed involve the formation or dissociation of entire helices and follow kinetic rules involving the barrier between the current ground-state and some intermediate structure, which in this simple case would imply a pinch in the loop in between. Assuming that the kinetics all around the huge structural space follows those laws, the picture we now have is that if one can evaluate those barriers, one can model the wandering of the molecule, as shown by the green arrows. In this random walk, many states are ignored, so it is clear that not all states can be explored, but this is probably what actual RNA molecules do; they do not explore all states either. In the algorithm we developed, when we have a sequence, we first enumerate all stems, as for the tRNA example. We calculate the rate of formation for each of these stems, if they are not already formed in the current structure, or of their dissociation, if they are already formed. Question: Do you include the nucleation step in this case, or is it just proportional to the AG? Response: The nucleation step, if I understand correctly, derives from these rates, k0, which give you the actual time clock for RNA folding. These numbers were extracted from previous experiments. They are a bit like base-stacking interactions, which cannot be estimated ad initio, but have been measured, and we use those numbers. The relevant question might be: "These numbers were measured for some specific molecules, so what is the justification for using them in general?" Comment: In the nucleation step, you already start with some given structure; you are simply changing the combinatorics. Nucleation does not appear here.

From RNA Sequences to Folding Pathways and Structures: A Perspective

217

Response: Nucleation does appear at the barrier, and to compute those rates you need to go to some barrier, which essentially is the nucleation step, and we calculated those structures as well. But at each stage, when, for instance, we decided to form another helix, we found another structure, and in principle, as you mentioned, all those rates had changed and we had to recalculate everything again, which is what we do. Question: In the previous case you had entropic factors; have they disappeared now? Response: No, they are all buried here. That is the difference between the freeenergy of these states and the free-energy of the barrier, which is not drawn in this figure. Question: Yes, but the power law, such as with 3/2 power? Response: Yes, they are all here. That is this curve, which comprises everything, loops and all. Even though it may not be included in the figures, the free-energies that comprise loops, and things of that nature, are included. Question: Just to clarify, in part you need these nucleation steps because you only allow complete opening and closing rather than say, a local extension or a shifting or a sliding-over of base-pairs to go from one helix to another - is this correct? Response: You need the nucleation steps to change the overall topology of the current structure. As for local extension - shifting or sliding over base-pairs which do not change the structure topology - we know that they generally occur at much higher rates. This allows us to assume that those degrees of freedom have essentially reached equilibrium. So for each structure topology, we shift and slide over the competing base-pairs in order to find the optimum {i.e., lowest minimum free-energy) configuration., We have to recalculate everything at each stage, which is CPU time-consuming, and is why we have developed a rather elaborate algorithm to hasten this process. Let me finish by saying that when we follow the pathway of this molecule, we obtain a measurement of the time-lapse during the simulation, because we evaluate the lifetime of the current structure by adding all those rates to make a transition from this current structure. When we sum up those rates, we obtain the total number of possible transitions per unit time, so the inverse of those rates is the actual mean lifetime of the molecule. We then have a measurement of

218

H. Isambert

the time spent between each transition and do the actual transition stochastically, by picking out the next transition according to its weight. We are modeling the dynamics in this way. Question: You capitalize on the cooperativity of zipping and it becomes one step? Response: Yes, that's right. We are wandering around this structure space, and sometimes we will be trapped into a short cycle. At each time, for instance, if we are in a structure defined as Ti, we have to recalculate everything and ask where to go from there. Very often what happens is that the transition leads to a state that has already been visited in the past. So time is wasted in recalculating something that had already been calculated. The idea is to try to speed up this process. Question: What you have developed is not specific to your method? Response: No, it is not specific, and the method we have developed is actually quite general. Today I'm only reporting results obtained for RNA, but in principle, it can be used for many other problems. The idea is to go from this straightforward algorithm to a more complex one, which turns out to be an exact clustering algorithm. When you are in Ti, you recall the states you visited in the past, so you already know everything about those states and all the connections between them. Then you pick the next state, after having summed over all possible pathways within this set of clustered states that are memorized. The first question is "Which is the actual state within this clustered state from which I will now choose a new one?" This can be done statistically by summing over all the different pathways. In this example we chose state j , from which we pick a new state outside the cluster. When we calculate all these statistics over all the pathways, we can also calculate all the time averages that we want. Because we are doing kinetics, we can only measure time averages. There is certainly one average you must know, and that is the average time it took you to go from i to j , and then to exit from j . That is the quantity to be measured and it is precisely evaluated. You also might be interested in the time average fraction of pseudo-knots visited while wandering around in this cluster, etc. This may be done using an exact and efficient algorithm in 0(n 2 ) operations, where n is the number of states in the cluster. When you choose a new state from outside this cluster, it must be included in the new cluster; in other words, the cluster must be updated by

From RNA Sequences to Folding Pathways and Structures: A Perspective

219

including the new state. In order for the whole method to remain bounded, you also have to get rid of one state. This can also be done in 0(n 2 ) operations. In the results of this method, and specifically, what may be learned about RNA using this method, the effective speeding-up of this exact clustering-algorithm reveals that several orders of magnitude are gained for a cluster that contains several states. More than a thousand-fold speed-up is gained when you study short molecules such as these. It turns out that it is essential to be able to simulate these molecules for a very long time, since they are trapped for hours, or even for several days. The gain is still several hundred-fold for this natural RNA molecule, the Hepatitis Delta virus ribozyme. As one would guess, it becomes less and less efficient as you use larger and larger molecules. I will further discuss the previously mentioned Group I intron, in which there is still a four-fold speed-up, which in practice makes a lot of difference when simulating those molecules. One thing that may be evaluated with this method is the occurrence of pseudoknots in random RNA sequences. No one knows those numbers exactly. What we find is rather novel and also quite unexpected: even short molecules have pseudoknots. However, this may not come as a surprise to many RNA biologists. We also find that this fraction of pseudo-knots increases slightly with the GC content of these molecules, as one might also expect. It can attain about 25% of the base-pairs involved in pseudo-knots for high GC content, which is clearly non-negligible. What is more unexpected is that the curves for 50-, 100-, and 150-base-pair-long random sequences pretty much collapse into the same curve. This means that the number of pseudo-knots is roughly independent of the length of the molecule. This is something that was not foreseen, at least not by theoretical physicists, who have argued that pseudo-knots should be essentially negligible for short molecules, becoming problematic only for very long sequences. We find this not to be the case and that pseudo-knots are typically non-negligible, even for small molecules, perhaps less so for extremely short molecules. In addition, we find them to be independent of the length of the molecule. Due to lack of time, I can only say a few words about mechanical unfolding. In Strasbourg we are experimentally and numerically developing ideas for the purpose of studying the mechanical unfolding of RNA molecules. We are not the only people in the world doing this. The question we are addressing is how to find out what may be deduced from these experiments on secondary structures, since once it has been pulled apart the structure of the molecule it is not always obvious. We designed some toy molecules, whose purpose is to help address this question. These are three different molecules, consisting of the same two stems, one of which is rather weak and the other fairly strong: GC and AU. We arrange these stems

220

H. Isambert

differently in each of the three molecules. In principle, we pull one end of GC and one end of the AU stem, or both ends, either AU or GC first, before we would expect the rest of the molecule to unzip. Question: How quickly does this occur? Is it a fast force or a slow force? Response: It is fairly slow experimentally. You pull the molecule apart within a few seconds. You can also test the equilibrium by doing the reverse experiment; that is, to let it fold. This process is not exactly reversible, so we do not quite reach equilibrium either. I will have to skip the actual experiment, which is quite complex and a long story. Instead, I will jump to the actual results to show you that these experiments converge rather nicely with the numerics shown. In these two examples we look at different helices with the same structures. In one example the strong stem must be broken first, then the weak stem is broken. In another example the opposite must be done; the weak stem must be broken before the strong stem. This is shown both in the numerics and the experiments. A striking feature seems to be disagree between the two: the overall slope, which actually comes from the spring constant of the optical tweezers. There are two different optical tweezers in the actual experiments whereas in the numerics carried out before the experiments, the spring constants were the same. What we see in both experiments and in the numerics is that the molecule shows two plateaus, corresponding to the consecutive opening first of the weak stem at rather weak force, and then of the strong stem at a stronger force. For the other molecule, the strong stem cannot be broken until the critical force is reached. Once that happens, the second stem cannot resist and breaks as well. This gives you a larger drop, which is also indicated with the numerics. However, there are certain limitations with this approach. The third molecule is very different from the two just described, yet its trace is very similar to the molecule described. Hence, many details may be observed and measured with these unfolding experiments, but certainly other structural features are missed. I will finish by presenting some new data on the group I intron, returning to the primary goal of our approach, which is to decode RNA folding pathways. The lowest free-energy states we find are pretty close to the known native structure, in particular the pseudo-knots in P3, as well as some other pseudo-knots. There are certainly other bases that are probably incorrectly predicted, but overall there is around 85% agreement between the predicted structure and the known native structure. Perhaps more interesting is the prediction we may make concerning

From RNA Sequences to Folding Pathways and Structures: A Perspective

221

certain trapped intermediate structures, which are much less well known in the literature. A word of caution: In the animation of the folding pathway, you will only see the new minimum free-energy structures as they are attained in the simulation. You will miss all the other structures that are actually visited in the prediction algorithm, and that information might prove rather important. With this in mind, the freeenergy of this molecule is described. We also used the clustering approach and its time average features described earlier to follow particular helices, in this case P9, P3, and some other helices, for which one can compute the average time fraction of probability that one has with those particular helices, with a rate between 0 and 1. If it is 1, the helix is always there, and if it is 0, the helix is essentially never there. These patterns indicate that although you have found the actual native structure, there is a certain period of time - a few seconds to perhaps one minute - during which the molecule is not in the correct conformation and probably inactive. We also find that some pathways are misfolding; that is, the molecule is essentially trapped at the end of synthesis, indicating that the absence of P3 in the trapped states seems to agree with Sarah Woodson's work. Question: These experiments require applied force to open RNA molecules. What is your mechanical mode? You compared your results with some numerical model; what is that model? Response: The numerical model is the dynamic model of RNA folding I first described, but in addition to that, a force is added. You must apply work to the system and force it to stretch. In these pulling experiments, there is a molecule, and in addition to carrying out the kinetics described, a fixed constraint is enforced between the two ends. The constraint is then slowly increased to trigger unfolding of the molecule. Upon return, the molecule exerts a force that can be evaluated on the apparatus. If you do only that, you obtain a pretty bad match with the actual experiments, because it amounts to assuming an ideal apparatus to measure the force, which is not the case. Optical tweezers are not ideal, since they have some intrinsic stiffness. What should be modeled is some sort of system in which a bead is included and a fixed constraint applied to the composite system of the "apparatus + molecule." That is how you can match those experiments. The model consists ofthe molecule and the apparatus itself, which is quite important; otherwise the data do not fit. Question: Is the bead also dynamic?

222

H. Isambert

Response: Right. The bead also has some time to respond. This must also be taken into account. The experiments themselves also need some acquisition time, and all of this must be taken into account in order to interpret the data. That way you obtain fairly good agreement with the actual experiments. Question: Would it depend on the kinds of transitions that you make with the RNAs as they go through large barriers, or are they much more sensitive, compared with the time of reaction to the bead? If some stage takes more than milliseconds, it would react much differently from how it would if it took microseconds, right? Response: Exactly. That could be the case, but not in our very simple example. In principle, yes, you would have a very different reaction. Question: Do you have an argument to explain why the number of pseudo-knots does not increase with size? Response: Not yet, apart from hand-waving arguments, but we do not have a very profound explanation yet. Question: Looking at your simulations, it appears that your RNA is growing from one hairpin. It seems as if you are growing it from one end. Where is the rest of the RNA? Is it unfolded? Is it unstructured? Are you not showing the unstructured part? Response: Everything is shown. Question: But at the beginning only a few nucleotides are shown. Does it start to grow at that point? Response: beginning.

This is simply because the sequence was not started right at the

Question: I'm a bit confused by your model. grows? Response: Right.

Do you assume that it folds as it

From RNA Sequences to Folding Pathways and Structures: A Perspective

223

Question: So it is not a folding of RNA; rather, it is basically a co-transcriptional folding of RNA? Response: That is right. My guess is that it's also the way those molecules fold. Question: If you were to take an RNA, melt it and then cool it, would that be a different process ? Response: You would get something else. We studied that for some examples and obtained very different results. That is perhaps one of the reasons why group I seems to have better folding rates in vivo than in vitro, were you to do such coolingdown and heating-up of the sample; so they do not actually fold in the same process. There might also be proteins involved, but that didn't necessarily have to be the case. Comment: I just wanted to have some clarification that it is co-transcriptional folding. Response: Yes, in one case. But you can certainly also devise some models to support this. As seen from the hybridization of the two molecules already shown, there is no synthesis or transcription of those molecules, so you can look at the dynamic hybridization process between the two. Question: While the molecule is being synthesized? Response: No; not in this case. You recall that above I showed two molecules that hybridized with no synthesis involved in the process. It is just to show you that you can address different, but related, problems. Question: But you cannot simulate this molecule folding from a completely formed chain? Response: Yes, I certainly can, and have already done so. I do not have the sequence here, but I could certainly show you something that looks like a very long chain that collapses upon itself. Question: Do you get the same result? Do you get the same final structure?

224

H. Isambert

Response: For this particular molecule, I have not done the extensive statistics needed to give you an answer, but I have studied other molecules and gotten very different answers in the quench experiments, for which the whole sequence is there and allowed to fold suddenly, starting from any region of the sequence, compared with co-transcriptional folding. Question: Which of the two is closer to the known X-ray structures? Response: I was not referring to the final structure. The final structures in these cases were actually identical, but the pathways were different. Question: Are the final structures always identical no matter whether you do it by growing RNA or by pulling it? Response: Not in all cases; we are now involved in trying to design bistable molecules that are trapped by their own synthesis in one particular state, so you can certainly bias this folding by the mere synthesis of this molecule, which is what we are now exploring.

AN EVOLUTIONARY PERSPECTIVE ON THE DETERMINANTS OF PROTEIN FUNCTION AND ASSEMBLY OLIVIER LICHTARGE Department of Molecular & Human Genetics, Baylor College of Medicine, Houston, TX, USA

Most of us work in this field because it is at the interface of mathematics, physics, biology, and medicine. It is particularly exciting therefore to address an audience as equally diverse. I will describe and discuss our work using evolution as a computational and analytical tool to locate functional sites within protein structures, and the residues that most directly mediate their activity and specificity. This approach is a computational method, known as Evolutionary Trace (ET), and based on the same biological paradigm as mutational analysis in the laboratory. Specifically, it uses mutations and assays to identify functionally important residues, except that here the mutations and assays have already occurred during the course of evolution. First I will explain how the method works, then show you various examples that illustrate its applications. The general context of this work is the fundamental problem of integrating sequence, structure, and function information. Because of the data pouring forth from genome sequencing projects, as well as those soon to come from the Structural Genomics Initiative, this is an especially pressing issue. However, the difficulty of the problem cannot be overstated. Consider that after thirty years of effort, and despite the fact that the underlying process is deterministic, the sub-problem of predicting structures from sequences remains unsolved. Another sub-problem, predicting functions from sequences, should be even harder, since the underlying process there involves evolution (random chance), and is therefore nondeterministic. In view of these difficulties, we focus on a much narrower sub-problem: Given a structure, how to locate its functional sites? Since functional sites mediate all protein functions, the answer to this question should have important applications. If you understand which amino-acids are important in a protein, and how they come together to form a functional surface, you gain basic insights into molecular recognition, as well as into such functions as catalysis, signaling, transport, and so

226

O. Lichtarge

on.... In turn, experiments to modulate biological activity, protein engineering, and drug design can focus on the most relevant targets. Another attractive aspect of this problem is that its solution is not trivial. A protein structure just looks like a big collection of atoms in space, and simple observation is inadequate to recognize its functional sites, especially those that involve large macromolecular binding interfaces. One approach is mutational analysis, whereby you systematically change each residue, then assay the mutant protein to decide whether your mutation had functional consequences. In this way, you can map out all the active sites. Unfortunately, mutations are costly and laborious. To assay their function, you obviously must have appropriate biological tests, but this is far from trivial. A given protein may well have five different functions that are unknown to you a priori. How then can you design the appropriate assays needed to grasp the complex functional roles of specific residues? You cannot. Ideally, what is needed is a cheap, scalable method for characterizing the key determinants of protein function. Comment: You can do a mutation that will completely destroy your protein. Response: Yes, replacing certain amino-acids that are important for structural reasons can destroy the protein's function. In essence, structural and functional importance can overlap. But people would then also focus on surface functional sites, where mutations are less likely to have drastic packing consequences.

The evolutionary trace method The problem appears solvable at the outset. Intuitively, you hypothesize that functional sites evolve through variations on a conserved architecture. This is much like a paleontologist recognizing different species from the similarities and differences in their mandibles or teeth. The similarities point to common ancestral functions that are often retained, and the differences point to functional characteristics that are often unique to each descendent species. If you apply this logic to the active site of a protein, you might expect that the location, basic structure, and function of divergently related active sites will be conserved over evolution, but also that each divergent protein will have acquired some variations that mediate unique and novel functional variations. If this simplistic transposition of the macroscopic world to the molecular scale is correct, it predicts that by comparing active sites among divergently related

An Evolutionary Perspective on the Determinants

of Protein ...

227

proteins we should be able to recognize two types of residues: Those that underlie the fundamental, conserved architecture of the site, which should be mostly invariant, and those that impart species-specific functional modulation, which should vary among functionally distinct species. Hence, just by looking at the sequences, it should be possible to distinguish functional site residues that are completely invariant from those that vary in class-specific patterns; i.e., that are invariant within a given class but that vary among different classes. This, in essence, solves the problem of how to identify active sites; what needs to be done is gather sequences, classify them into various functional groups, and then identify residues that are invariant within each functional group. By construction, these class-specific residues have the property of changing the protein's function whenever they vary during evolution. This is the sine qua non of functional importance. This procedure may be repeated for every single position, identifying a set of class-specific residues and mapping them onto the structure. Hopefully, they will cluster at a site where any variation is linked to functional change. Question: [inaudible] Response: No, actually it does not. I am trying to minimize the number of assumptions. The logic of the argument is simply that if these hypotheses are true, then maybe that will be observed. Question: When? Response: We just take one mutation at a time. Double mutations are much less likely to appear than individual ones. Question: // seems to me that comparing different proteins, you might also find residues that are important for general protein folding. How do you make these distinctions? Response: We do not, but in truth, I am still not sure how to quantify the difference between structural and functional importance. While conceptually intuitive, structure and function are complex, intertwined concepts that may be difficult to separate and in the extreme, misleading. Some of the examples we work on have residues that are functionally important in allosteric pathways, and cause misfolding upon mutation.

228

O. Lichtarge

Question: As far as I know, in almost all cases, the active sites will be one of the largest cavities on the surface of the protein. If you know the structure, is there anything that helps to find the active site? Response: That is often true. Small ligands usually bind in cavities. Even so, the residues that contribute most directly to binding in the cavity may not be obvious. Moreover, other functional sites, such as those in protein-protein, protein-RNA, and DNA interfaces involve very large, flat interfaces. So outlining cavities solves only part of the problem, and does so at a lower resolution than we hope to achieve. Question: My comments concern the notion of function. The list you showed is really redundant. If you see what people in supramolecular chemistry like J.-M. Lehn consider, you have three functions: recognition, binding, and catalysis, and even they are already redundant. This is one aspect of your talk that is not very clear so far, because obviously, the question of function is not something you can solve by looking at the protein itself, because it is determined by interaction within the system itself. This should be considered more carefully. My question is the following: If I understand correctly, I have a protein, I know its structure, and I want to know about its functions, in general. Are you going to say anything about it? In a way, this structure is unique; there is only one such sequence and only one such structure. Or are you only going to be able to say something in comparison? Response: As you see, the method depends on the number of sequences you have to compare. If you have only one sequence, you really do not have any evolutionary information. Comment: People learn about a given function, in such a way that they generally don't have to make comparisons. You can learn about a function by working on a protein; a structure, and by setting up an experiment, without making comparisons. Response: These questions will be easier for me to answer in the concrete context of specific examples, so allow me to come back to those issues a little later. For the moment, let me just point out that we have not yet solved the problem. I only said that we can identify a cluster of important residues, defined as invariant within functional classes, if you can define those functional classes. Thus, the problem may be solved if and only if we can define functional classes. But this is

An Evolutionary Perspective on the Determinants

of Protein ...

229

not straightforward; how can we take sequences and split them into different functional subgroups? There are three possibilities: • • •

expert bias experiments approximations

We choose the last one: to use sequence identity trees as approximations of functional classifications. This simplistic choice may not necessarily be the best one, but it does imply that each node in the tree is a virtual assay that defines functional subgroups, and if so, we now have a fully determined algorithm. Specifically, we can take a set of sequences, calculate a sequence identity tree, approximate a functional tree with that sequence identity tree, thereby dividing the group into functional classes, identify class-specific residues, and eventually map them to the structure.

GROUPS

CONSENSUS SEQUENCES

rtcs

KE-TFT-HK-LM

FUNCTIONAL SITE

EVOLUTIONARY TRACE

KE-TFT-HK-m VERT-TG-K-QM ASR.YTGVKKNV ASR.YTGHKKNM XX

K XX

ASH.TTGVKK1V— A IF . em 11 NK

ASR.YTGHKKNM ASR.YTGHKKNM —

Figure 1. Description of ET: a. A tree divides a multiple sequence alignment into groups that approximate functional classes. Trace residues, marked by X, are invariant within each group, but variable between them. A structural map of trace residues on a representative structure shows a spatial cluster of trace residues, called a trace cluster, indicating a likely functional site. The minimum number of branches at which a residue is traced first is the trace rank, and is shown below each trace residue. Thus, the higher the rank (1 is the highest), the greater the predicted functional importance of a trace residue, h. The cluster of magenta trace residues is statistically significant and was predicted to be a conserved interface of G proteins to their receptors, c. Alanine scanning mutagenesis maps out the key residues in red and black that mediate functional coupling between (i proteins and their receptors. This is the same area.

230

O.

lAchtarge

This process is illustrated in Figure 1. As you see, you can use the tree to divide a family into one, two, three, or more branches. Each time, you identify trace residues with the property of class specificity, namely that they are invariant within a branch hut variable between at least two. In this way, the tree's intrinsic hierarchy allows you assign an evolutionary trace rank to every single residue, defined as the minimum number of branches at which that residue becomes classspecific. The two hypotheses underlying this scheme are that functional sites evolve through variations on a conserved architecture; that sequence identity trees are reasonable approximations of functional trees. Let me move from an abstract discussion to concrete examples that will help you understand how a seemingly general hypothesis can lead to very specific experimental hypotheses based on molecular function.

Figure 2. Control Studies, a. A trace of the SH ? domain showing that the trace residues (all those in color) cluster and match the structural binding site shown in cyan in /) and c. A trace of SHj domain proteins, which also matches the structural binding site in d, and i n / the functional site mapped out by alanine-scanning mutagenesis. The trace analysis in e is essentially similar, but was done on half as many sequences, g, h. Trace of the DNA binding domain of intracellular hormone receptors showing clusters of trace residues at the DNA binding-site and none on the other side of the protein.

An Evolutionary Perspective on the Determinants

of Protein ...

231

Basic control studies First consider SH2 domains. A tree of that protein family has been divided roughly into 15, 20, 25, 30, 35, 40, and 45 branches, and an evolutionary trace was carried out for each partition. In Figure 2, trace residues in blue and green have rank 1, so they are completely invariant, and those in red or brown have rank >1, so they are invariant within branches, but variable between some of them. Also, the red and green residues are mostly on the surface, whereas the brown and blue ones are mostly internal. First of all, such residues exist, and second, they are more numerous as we cut the tree into more and more branches. They are far from being distributed randomly, but rather spatially cluster in the structure at one clear spot. So we would surmise that the hot-spot where all the colored trace residues cluster, as shown in Figure 2a, might be an active site. Third, eventually, if you cut the tree into too many branches, scattered residues begin to appear. Fourth, it turns out that when people mutated the top-ranked residues, they killed function. When they mutated lesser-ranked residues, the function was only modulated. And when they mutated residues ranked in the noise region, there was almost no effect on function. Fifth, the structural data validates the evolutionary analysis: Figure 2b shows (in cyan) the SH2 residues that are within 5 Angstroms of the ligand, and these precisely match the trace in Figure 2a. Another example is SH3 domains. Figure 2c shows the cluster of top-ranked trace residues, and Figure 2d shows the actual binding site determined from an Xray structure of the co-complex. As before, there is excellent visual overlap of the trace cluster with the binding site. This remains true if the trace is done on 40, as shown in Figure 2e, rather than on 80 sequences. Note however, that ET misses some residues on the far right that are obviously part of the interface. However, mutational alanine scanning data, collected by Wendell Lim before any of this work, showed that mutations at these positions are in fact functionally well-tolerated. It is only the residues shown in red in Figure 2f that affect function upon alanine mutation. Hence, although the evolutionary analysis agrees well with the structural analysis, it agrees even better with the functional analysis. In other words, the mutations and assays of evolution match those from the laboratory. Question: Perhaps there some residues that are conserved all through the SH3 domains, which are doing some non-specific recognition in this binding process. There might also be residues that are not identified by your method, because they correlate with the function. So those that are 100% conserved in their family could also be important.

232

0. Lichtarge

Question: I'm not so sure I understood everything well. You make a classification using tree classifications. Is it possible to use something like Deeday 's dynamic cluster classification instead? Response: As I said, you can use any classification you wish and test it against experiments. Question: In this case, if you use Deeday's classification, you have to determine the assembly distance between two proteins to see which one will go with which. What distance will you take in this case? Response: We've been using UPGMA trees for our purposes. If you want to build a different classification, you should just try it out. Question: Do you have bacteria sequences? Response: That is up to you. In this work I have taken as many of the sequences as I found available. When you do an experiment, you determine the experimental conditions; you have to make the choice on how to set up the experiment. If you want to look at evolution over all three Kingdoms, you can. In the present case, we looked only at eukaryotes. Question: Which sequences did you use in the SH3 example? Response: This has been published; you can find the detailed input data in the paper, but as I recall, all the SH3 sequences known at the time of the study were included. I just want to make one more point: The ET method identifies residues that are important for protein/protein interactions. But it also works well for DNA/protein interactions. Figure 2g shows (in yellow) a large trace cluster of residues exactly at the interface with the overlying DNA. On the other hand, there are almost no trace residues on the opposite side of this intracellular receptor DNA-binding domain. These data are all simple positive control studies, but they provide basic proof of the principles, and show that despite the simplicity of the hypotheses namely that i. the dendrogram approximates a functional tree; ii. the active site evolves through variations on a conserved architecture.

An Evolutionary Perspective on the Determinants

of Protein ...

233

These are sufficient to identify class-specific residues ranked in a functionally relevant hierarchy that cluster and thereby predict functional sites.

C protein signaling With these initial control studies out of the way, 1 would now like to focus on Galpha and RGS protein families. We studied these two cases very carefully to show how to apply ET in order to make bona fide predictions of functional surfaces and functional determinants in systems for which there are no a priori answers. First, note that the G protein signaling pathway is ubiquitous in eukaryotes. It is so fundamental that it directly mediates three of our senses (sight, smell, and taste), nearly all of our neuro-endocrine signaling, and 100% of autonomic physiology. In practice, 40 to 60% of all drugs (including those of abuse) target G protein-coupled receptors. The pathway is turned on when an extra-cellular ligand (of which there are nearly one-thousand different types) binds to a receptor. The receptor then changes conformation, and this in turn activates a G protein, which is a heterotrimer of alpha, beta, and gamma subunits. In particular, the activated G protein alpha subunit exchanges GDP for GTP and diffuses along the cellular membrane to activate an effector protein (either a membrane-bound enzyme or a channel) which then modulates the extracellular concentration of secondary messengers. Question: What is the molecular weight of the alpha subunit? Response: The alpha subunit consists of around 350 amino-acids. The G protein receptor is often more than 400 amino-acids. Signaling eventually stops when G-alpha hydrolyses GTP back to GDP. This step is accelerated by helper proteins (regulators of G protein signaling, or RGS) that bind to the activated G-alpha-GTP and increase the rate of hydrolysis. Thus, Galpha has many potential functional sites, since at the minimum, it interacts with a receptor, beta-gamma, an effector, and an RGS protein. The original reason for creating ET was to find out where G-alpha binds the receptor. The first trace ever was thus done on the structure of G-alpha, which was available, and prior to any controls, it identified two active sites. We knew that the C-terminal of G-alpha takes off from the magenta trace cluster shown in Figure lb and that this C-terminal interacted with the receptor. This suggested that the magenta site was the receptor interface, and hence that the other site was the interface to beta-gamma. Since there

234

O. Lichtarge

are few ways to arrange all those structures into 3-dimensional space, we also proposed a model of the alpha-beta-gamma receptor arrangement. Both structural and mutational data subsequently validated this model. First, a few months later, the actual structure of the alpha-beta-gamma hetero trimer was published and confirmed that beta-gamma interaction occurred at the predicted site. Second, 108 mutations were performed by Rene Onrust in order to map-out which alanine mutations disrupt receptor binding. The red and black residues in Figure 1 c are those in which the mutations kill coupling, and these exactly match the same area predicted computationally by ET in Figure lb. Question: Was this in vivo or in vitro? Response: In vitro. Thus, the predicted interface is consistent with mutational data. In fact, if you compare the entire mutational analysis with the trace, seven out of ten times, the assay confirms a prediction that a given residue is important or not important. Moreover, regions of disagreement are ambiguous. The assay was geared to detect disruption in the receptor interaction, not with other proteins with which G-alpha also interacts. Many discrepancies occur precisely in regions involved in binding to beta-gamma or to the nucleotide. These residues were counted as false-positives, but this may simply reflect that ET is sensitive to all important residues, whatever their function, whereas here the assay was geared to a very narrow activity. Question: Can we imagine that the function stops during evolution and that your system will still detect it in a tree? Response: We will have just such a case later on. The answer is that it depends on how you set up your experiment. Certainly, with enough evolutionary time, functions may vanish altogether from some branches, but this still leaves recognizable patterns if one studies a subpart of a family tree. In summary, this first prospective study is a bona fide prediction, in the sense that it led to predictions that were published before any confirming experiments. Multiple trace clusters were discovered, assigned to specific protein interactions, and led to a low-resolution quaternary structure that was proven accurate by the alpha-beta-gamma crystal structure. We still do not know for sure the interaction

An Evolutionary Perspective on the Determinants

of Protein ...

235

site with the receptor, but mutational analysis suggests very good agreement and overall, ET anticipates the results of mutational analysis seven out often times. Question: When you compare different proteins, do you take into account the fact that some of them can come from the same organisms, and that if you then see a difference, you probably had some functional difference, but if they are two proteins from different organisms that are expressed by the same gene, then probably these differences are some kind of evolutionary artifact? Do you take this into account? Response: I take a somewhat different approach: I try to make no assumptions about whether or not two sequences are functionally similar. Rather, I let the tree explicitly factor-in the distinction you are hypothesizing, using "expert" bias, between orthologs and paralogs. Comment: / think it might. If you compare two proteins from two organisms and find differences in certain amino-acids, you would assume these differences to not be important. But if the two proteins that you compare come from the same organism and you also see some minor differences, there could be some reasons why there are two different proteins in the same organism. For example, if they express gene duplication, it would mean that these differences are important. Response: Yes I agree; we can imagine many different scenarios. The tree is a formal way of generalizing and standardizing all of these so as to develop a systematic approach to all of them that introduces as little bias as possible. Then we can experimentally check whether our tree-based viewpoint was predictive, and therefore useful, or not. In G-alpha it was.

Regulators of G protein signaling Now I would like to turn to Regulators of G protein Signaling (RGS) proteins. This is another protein family for which predictions preceded experiments. Recall that RGS binds G-alpha and enhances GTPase activity. However, in the presence of PDE-gamma (the effector protein of the visual pathway), the GTPase-accelerating property of RGS9 is boosted, while that of RGS7 slows-down. To understand the molecular basis of this difference, we traced the RGS family and identified important residues that map on only one side of the RGS, suggesting a large interface on RGS. This is shown in Figure 3a. We knew that G-alpha binds RGS

236 O. Lichtarge over part of this area, but this interface, SI, accounts for only part of the trace cluster, leaving the blue area in Figure 3b, S2 open for some other binding activity. In attempting to identify the S2 protein partner, we noticed that S2 side-chain variations follow the PGE-gamma-dependent effect. The side-chains change dramatically in charge or character between RGSs that are inhibited by PDE-gamma and those that are enhanced by it. Next, we noticed that a patch on G-alpha, shown in bright yellow, traced in G-alpha and linked to PDE-gamma binding by peptide studies, is directly contiguous to S2. These data thus suggest that PDE-gamma sits astride both alpha and RGS at S2, and that S2 residues control the differential effect of PDE-gamma.

Figure 3. Identification of an allosteric pathway in RGS that regulates 0 protein signaling, a. A trace of RGS identities a well demarcated and large cluster of trace residues, in blue. />. Part of this site is a binding site for G-alpha. shown in yellow, but part of it remains free for a predicted interaction with the downstream effector protein, c. Subsequent mutations that swap residues b. c. and e from RGS9 onto RGS7 are sufficient to confer the activity of the former to the latter. Residues b and c mimic the presence of the effector, PDIi-gamma. They are remote from G-alpha and act through an allosteric pathway that ends at residue <•, which is at the G-alpha interface. J. The interaction predicted between the effector protein with G-alpha and RGS was eventually confirmed by this structure, obtained by Kevin Slep in Paul Sigler's laboratory.

An Evolutionary Perspective on the Determinants

of Protein ...

237

To test whether S2 determines a functional difference between RGS7 and RGS9, we set out to confer RGS9 activity to RGS7 by swapping S2 residues. Around 65 residues differ between RGS7 and RGS9. If you swap all of them, it will work for sure, but we targeted only the six residues identified through the evolutionary trace. A double mutant of residues 353 (labeled b in Figure 3) and 360 (labeled c in Figure 3) had a level of activity comparable to the RGS7 wild-type, but in the presence of PDE-gamma. These data are shown in Figure 3c. This is remarkable in that neither residue b nor c contacts G-alpha directly. It is as if you activate a switch and the light goes on somewhere else. So b and c are "activation switches" that work at a distance. Another switch, at residue 367 (labeled c in Figure 3), produced a triple mutant with an activity level equal to RGS9 in the presence of PDE-gamma. Hence, three mutations changed RGS7 into an intrinsically activated RGS9. That is, the trace residues behave exactly as expected of functional determinants; if you swap them, you can engineer a functional transfer from RGS7 to RGS9. Independently, the structure of the RGS, G-alpha, and the PDE-gamma complex confirmed that RGS makes direct contact at the site we predicted, as shown in Figure 3d. So we have confirmation, both mutationally and by X-ray crystallography, that our predictions were correct. To summarize, we were able to predict another functional interface, the specificity determinants, and a low-resolution quaternary structure, which allowed us to identify an allosteric effector on-off switch, the RGS7/RGS9 specificity determinants, and RGS effector interaction. Otherwise stated, we are able to use evolution to link raw sequence and structure data to function, allowing us to anticipate mutations and quaternary structures. In this case it helped us uncover the molecular basis of G protein signaling.

Applications to functional annotation I would like to show you how this method can be helpful for such problems as functional annotation, detection of remote homology, and alignments. Finally, I will add a few words about its generality, since the detailed studies I have shown so far may appear anecdotal. First, note that the problem known as functional "annotation" is a terrible misnomer, since it trivializes a fundamental problem in biology. Biologists are not really interested in sequences or structures, but in functions. Yet, if you look at all known protein sequences, you will notice that their functions have been determined

238

O. Lichtarge

experimentally in a scant 0.5%. For another 4.5%, function is inferred computationally from homology. Often these deductions are correct but not always, since sequence homology does not inform you about functional analogy. Thus, relatively very few sequence annotations are as ironclad as we would like them to be. Some of the questions concerning functional annotation that we would like to address are: 1) How does specificity arise at a functional site? 2) Is it possible to determine whether two proteins perform the same function?

GROUP 1 TRACE RESIDUES

RESPONSE ELEMENT

YF452#-.. H 451 0 • : ":*Mrai KA401 • R 496 • R 489 • ' • R 4S6 • "'" F 4S3 • •"-""

LHh

GROUP 2 TRACE RESIDUES ^!3GKDEMN 511 HKLVQ . ; 465 QKH 458 EC3QN ' > 45Q QSA •ISO TKNfiSC 493 KPQR

LU

MOSTLY INVARIANT

MOSTL¥V«ra*BLE

Figure 4. The origin of DNA binding specificity in hormone receptors. A trace of intracellular hormone receptors identifies two qualitatively different groups of functionally important residues. Those in Group 1 are mostly invariant and bind nearly invariant bases that they contact similarly in different structures of the protein-DNA complex. Those in Group 2 are highly variable; they contact more variable bases, some of which are outside the strict consensus response elements, with more or less flexibility in various structures. This is consistent with the view that Group 1 determines the basic recognition structures of all hormone receptors, which are then modulated by Group 2 to match the specific variation of their target response elements.

Let's start with the problem of specificity. 1 have already told you about intracellular nuclear receptors. The largest eukaryotic family of transcriptional regulators, they help switch genes on and off. Some homodimerize head-to-head onto a stretch of DNA that forms a palindromic response element. Others heterodimerize head-to-tail or tail-to-tail on DNA response-elements that are double

An Evolutionary Perspective on the Determinants

of Protein . ..

239

or inverted repeats. Some even bind single response elements as monomers. The structural unit that binds to the DNA is actually a small part of the entire hormone receptor; i.e., the DNA binding domain. We traced this domain and identified the DNA binding site. Since the tree has a hierarchy that may tell you something about functional importance, we can perhaps identify the amino-acids that are classspecific very early in the tree, shown as Group 1 in Figure 4, and that are presumably the most important during evolution. Or we can consider those that become class-specific a little later, and which may be a little less important, shown as Group 2 in Figure 4. We can trace an entire hierarchy of residues. If you look at how Group 1 residues bind response elements in three structures of different nuclear hormone receptor-DNA complexes, it turns out that they always contact the same bases. These bases are themselves nearly invariant among DNA response elements. On the other hand, Group 2 trace residues are appreciably more variable and contact bases that are themselves variable during evolution; sometimes they even fall outside the classical response element. Thus, as we assumed in our first hypothesis, this binding site evolves through variations on a conserved theme. Moreover, there is co-variation at the protein/DNA interface, where the most important amino-acids contact invariant bases and the variable, less important amino-acids contact the variable bases. Remarkably, these variations are often drastic and entirely non-conservative of sidechain character. It is then possible to study how those highly variable residues vary during evolution and bind to variable bases, and to exactly map these variations onto the evolutionary tree. The result is a protein-DNA "specificity key," because of the hypothesis that those residues are very important in modulating the specificity that explicitly shows which residues are necessary at the DNA interface for each branch of evolution. Thus, if you wish to confer the DNA-binding activity of a PPAR receptor to an estrogen receptor, you would just go to those residues in estrogen and switch them to what they are in PPAR. While at this point it is somewhat speculative, note that this is exactly the experimental protocol we followed in RGS proteins: we identified trace residues and swapped them in order to swap activity. Question: Are you saying that there is actually a unique code of amino-acids versus DNA base? Response: That is not what I am suggesting. In fact, it is very hard to conceive how proteins with different conformations, and that therefore bind DNA through distinct interactions, would nevertheless converge on a single code. So even if you

240

O. Lichtarge

are willing to say that DNA always has the same conformation - which it doesn't a universal recognition code is unlikely. Question: Basically, you are saying that there are amino-acids that conserve interactions with specific DNA bases, and less conserved amino-acids that interact with less-conserved DNA bases. This would imply that there is a code. Response: Yes, absolutely, but the point is that this code is specific to that protein family; type II zinc fingers. Another family of DNA-binding proteins may have a completely different code. Question: Do you think that by looking at a protein sequence you can predict what the recognition site would be? Response: That's a great question, and logic suggests that the answer should be yes. In fact, we are currently trying to think about this in the context of G proteincoupled receptors, but it is certainly not a straightforward problem. A second problem I would like to address is whether two proteins perform the same function. This starts with a case in which ET apparently fails, and it is relevant to a previous question: "Which proteins do you choose in your experiment?" Note that although we traced the DNA binding-site, we utterly failed to identify a dimerization site between the dimer components. There is no signal at the dimer interface. Remember however, that this trace included all nuclear hormone receptors. So, besides those that homodimerize head-to-head, there are also those that dimerize head-to-tail, or that are monomers. There is no reason, of course, that a dimerization interface would be conserved in the latter proteins, so it is not. The remedy is to restrict the trace to only those proteins that homodimerize. You can then immediately recover the homodimer interface. In other words, you can select your sequences any way you wish and test whether they share a common functional surface. This allows you to manipulate the tree in order to set up a number of algebraic manipulations of functional sites and to test whether other receptors use that dimer interface. You start with the steroid sequence and add the other branches to it one at a time. You can construct your computer experiment to look at PPAR receptors, which destroys most of the interface. This is also true of RXR and other non-steroid receptors, except for RAR receptors. This could be a statistical fluke, or it could indicate that RAR receptors use the dimerization

An Evolutionary Perspective on the Determinants

of Protein ...

241

interface typically associated with steroid receptors for some aspect of their function. To summarize, we were able to identify protein-DNA binding sites, to suggest how DNA recognition specificity is encoded, to identify subgroup-specific active sites, and to find sites that may be shared by distant branches of a sequence family. This is an example of using subgroup analysis to identify which among various members of a family have a structural intersection of common active sites. Basically, it allows us to conduct computational experiments using the data already acquired by evolution.

G protein-coupled receptors 1 would now like to explore functional annotation in the context of G proteincoupled receptors (GPCR). These receptors have seven membrane-spanning helices connected by internal and external loops. They are divided into five main classes, within which sequence identity is low but significant, and between which sequence identity is not detectable. We would like to understand where ligands bind GPCRs, where the conformational switch is located, whether GPCRs dimerize, and where they couple to G proteins. This information would help target mutations, design drugs, interpret ligand-binding affinity, predict G protein coupling, create constitutively active receptors, and modify G protein targets for assay purposes. Ideally, we would like to repeat the algebra I showed you above: First, establish a trace of all G protein-coupled receptors in order to understand what is important for all of them, then do a trace of a given family of GPCRs. Second, trace a specialized GPCR family, in order to identify the residues that are important to that family. Finally, subtract the former from the latter, so as to extract only those residues that are important uniquely in the given GPCR. To do this, we should compare many GPCRs. To be sure that these comparisons make sense, we would first like to make sure these GPCRs have related structures and functions. This is difficult, because sequence identity is poor (there can be substantial structure variation or even greater function variation). We tackle this problem by showing similarities in related GPCRs (positive control) as well as in unrelated ones (the test case of interest), but also show that there are no similarities in non-GPCRs (negative control). What types of similarities should we look at? The answer is similarities in terms of the functional importance of residues, as measured by evolutionary trace. For example, in the fifth transmembrane helix, the graph of an evolutionary trace drawn from the N- to the C-

242

O. Lichtarge

terminus shows peaks for residues that are very important, and valleys for those that are not. You can see that among Class A receptors, which are easy to align, peaks and valleys tend to be aligned, indicating some form of correlation that grows toward the C-terminus, the region where all GPCRs couple to the G protein, hence where they are most likely to have an identical function. This suggests that matching peaks and valleys might allow us to recover alignment in instances when they are unknown. Question: / was just wondering if there was something special about this particular helix, or how you chose it? Response: Well, in truth, it is the best and therefore the most illustrative data for this general alignment schema. But the work I will show next is done on all helices, with no bias or greater weight given to any. Question: Why did you focus on the helix rather than on the loops? Response: The helices define the transmembrane domain - all of it. The loops are intra- or extracellular and therefore define other domains which in any case are extraordinarily variable among GPCRs. Since I am initially trying to understand general rules that apply all receptors, it makes sense to focus on the helices. We can study loops later, one family at a time, and extract rules that apply to each ligand family. Figure 5a shows the correlation of trace ranks over all seven transmembrane helices of bioamine receptors with other members of class A. The correlation is small but not nil, consistent with the great functional diversity of those receptors. The correlation is also very sensitive to the alignment. It drops drastically if you offset alignments by ±1. At ±2, it becomes negative. It then recovers somewhat, but never fully, since ±4 residues is a full rotation of the helix axis and all the internal residues are again facing inside, while the external ones are again facing lipids. Internal residues will be more conserved and external ones less so. In other words, there is a structure-based correlation of about 0.18. In conclusion, this positive-control study of trace-rank correlation among class A receptors shows it to be markedly sensitive to the alignment, suggesting that such analysis would be useful for aligning class A with class B and C receptors, which we cannot otherwise align based on sequence identity.

An Evolutionary Perspective on the Determinants

of Protein . ..

243

Figure 5c shows the same analysis when offsetting an alignment of class B helices with those of Class A. A best-candidate alignment emerges, which we defined as the zero offset, such that evolutionary trace rank correlation is of the same magnitude as within class A, although the sequence identity of the best alignment has dropped to 12% [Figure 5d], well below the normal range in which you can make comparisons or alignments. In class C receptors [Figures 5e and 5f), as before, there is still an alignment that maximizes correlation, but now there is no significant sequence identity above the noise threshold of-8%.

il 1 fi 1 B

g -4-3-2-10

1

4 -3 -2 -1 0 1

2 3

1 1 1 1 1 il P

4

|l

-1-3-2-1

-1-3-2

15%,

:5%

15%

15% ID* 5%

ID* 4 - 3 - 2 - 1 0 offset

ADRvs Glass A

1 2 3 4

Class Bvs Class A

lllllllll •A -3 -2 -1 (1 offeet

1 2

Class C vs Class A

It

'•'

- 1 0 1 2 3 4 offset

lllllllll lllllllll -4-3

3-16 offsel

BR vs Class A

Figure 5. Optimal alignment of GPCRs. Panels a through g show the extent of correlation between trace ranks of various GPCRs or bacteriorhodopsin, for different alternative offsets. Panels b through h show the corresponding sequence identity. See text for details.

This pervasive ability to identify a correlation is becoming suspicious and may be telling us that whatever the protein, it is always possible to find an alignment with a large rank-correlation. So we need a negative control. We used bacterial rhodopsin. This protein is a non-G protein-coupled ion-pump or light-sensor found in archaebacteria. It also has seven helices and folds, much like visual rhodopsin. However, evolutionary trace-rank correlations identify no correlation between Class A and bacteriorhodopsin. Nevertheless, as shown in Figure 5g, the correlation magnitude is at the level of the noise. Thus, the negative control is truly negative. Question: Are you using the sequence of all seven transmembrane helices? Are you also including loops?

244 0. Lichtarge Response: helices.

No, these studies all focus on the membrane domain and its seven

Question: / am assuming that this general problem of structural alignment is in fact common to all the examples you have gone through, a/though it is most obvious in the case of the transmembrane helices. So in fact, how easy is it in general to align the sequences you work with? Response: It is easy to align sequences above 30%, sometimes 25%. Below this, it becomes very hard. Whether the alignment approach that we are using in GPCRs applies to other proteins is an important question that remains to be explored.

Figure 6. Identification of GPCR ligand-bincling sites, a. A trace of visual rhodopsin shows that functionally important residues (in red) cluster internally in the rhodopsin structure, b. A trace of nearly 250 receptors from Classes A and B (in yellow) reveals a cluster of trace residues that form a site of common importance to all these GPCRs. As expected, it is especially prominent close to the G protein coupling site, which is expected to be common to all GPCRs. c. Subtracting h from a yields a small set of trace residues that are specific to visual rhodopsins and that precisely map out the binding site of the light-sensitive retinal chromophore.

These results suggest that GPCRs have a common structure and perhaps common functional determinants. So it is legitimate to trace them jointly. A trace of class A and class B receptors together is shown in Figure 6a, and a trace limited to visual rhodopsinsin in Figure 6b. Subtracting the globally important residues yields those that are important uniquely to rhodopsin, shown in Figure 6c.

An Evolutionary Perspective on the Determinants

of Protein ...

245

Remarkably, they cluster precisely around the retinal binding-site, which of course is unique and specific to rhodopsin. If you repeat this in other receptors, it is always possible to identify clusters of unique trace residues, but with significant variations in location, suggesting that there are significant differences in the details of ligand-coupling. With rhodopsin, as you tolerate more and more residues, you end up with a large cluster in the lower half of the transmembrane domain that is important to all GPCRs. On the other hand, a trace unique to rhodopsin forms a funnel to this cluster, which we believe is probably the conformational switch that controls GPCR activation.

Proteome-scale ET During the last few minutes I would like to go from the anecdotal world of a few proteins to the entire proteome. Basically, if ET works, it is worthwhile generalizing. However, a number of issues are in the way, such as scalability, objective criteria for success (statistics), and automation into an efficient pipeline. First, I would really like to focus on statistics, because one of the problems we have had until now is that we have only been able to identify a trace analysis as being important using visualization. We look at the structure and notice a cluster. In practice this is certainly useful, but for large-scale applications we would like to have a quantitative and objective way to assess significance. For that we look at how different the clustering of trace residues is from residues that are randomly picked. For example, random residues in pyruvate decarboxylase do not cluster; in fact they scatter all over, whereas trace residues form one very large main cluster. Random residues form more clusters, each of which is quite small. Therefore we repeatedly picked residues in proteins and built histograms that approximated the random distribution of the total number of clusters that would be expected by chance, and of the size of the largest cluster that would be expected by chance. Comparison with actual traces allows us quantify significance. We can do that in proteins of different molecular weights, and the significance threshold decreases linearly. So our experiment was to pick the proteins, blast them, retrieve these sequences, align them, then run the trace and look at the statistics. These traces were not optimized in any special way, except for the obvious sequence fragments, which were removed. The proteins were chosen to be diverse. Nineteen had alpha-beta folds, fifteen had alphas, seven had betas, and two had small domains. Some of them were all eukaryotic, some were eukaryotic-prokaryotic, and some were only prokaryotic. Their functions were also extremely varied. Overall,

246

O. Lichtarge

at a 5% threshold of significance, trace clusters were found to be significant in 45 out 46 proteins. Moreover, in cases when the real functional site is known, trace clusters accurately overlapped them. This therefore demonstrates that the evolutionary trace method applies not in just a few special cases, but can in fact be widely applied to the entire set of proteins in the PDB. Notice also that since trace residues are determined using only the sequence, they map onto a structure in clusters. This illustrates the cooperative nature of folding and function on evolutionarily important residues. Let us now summarize these results and ask why this evolutionary approach works. First, we can rank the importance of sequence over the course of evolution, and this evolutionary importance appears to be directly linked to structural and functional importance, although as yet it makes no clear distinction between one and the other. Functional sites may then be identified as clusters of the most important residues. This allows us to predict ligand binding pockets and specificity determinants in a number of blind tests. Specifically, we can anticipate mutation outcomes, low-resolution quaternary structure, and remote homology (GPCR). This can be used to target mutations to relevant sites and for functional annotation by figuring out which remote homologs may share functions. These results are statistically significant and we hope they can be applied to the PDB at large. The problem is to understand how these fairly detailed but general results can emerge from a simple comparison of sequences and can pull out residues that are invariant within groups but variable between them. My view is that it is because the evolutionary trace approach is quite different from a typical algorithm. Normal bioinformatics computations carry out retrospective analyses based on analogy (if sequence A is kind of like sequence B, then protein A is kind of like protein B, and therefore the structure and function of A are kind of like those of B). But this is far different from what is done in a laboratory, where prospective analysis based on deductive analysis leads you to take a sequence and mutate it so that A is made to be unlike B. An assay of function then tells you whether or not A and B are still similar. From that result you deduce the logical relationship between the sequence and the function. The point is that ET exactly follows the same laboratory paradigm as these prospective experiments. First, by comparing sequences pair-wise and looking at all residue variations, we have a large number of sequence mutations at our disposal, exactly as if we had infinite time and resources to create them in the lab. Moreover, these mutations all produce proteins that fold and function sufficiently well to produce an organism that survives natural selection. Next, we need to couple these mutations to assays. This is done through our second hypothesis; that the sequence identity tree approximates

An Evolutionary Perspective on the Determinants

of Protein . ..

247

a functional tree. If so, it literally means that every node in the tree is a virtual functional assay that distinguishes the function of the top branch from the function of the bottom branch. In a tree of 100 proteins, there are 99 nodes, hence 99 virtual functional assays. This enables us to categorize every single mutation that ever occurred during evolution by their functional effects, from the perspective of natural selection. ET is therefore like experimental mutational analysis that simply uses all the mutations that occurred during evolution, as well as approximating all the assays carried out by evolution. Thus ET can integrate our increasing sequence and structure databases into a meta-database of annotated functional sites which, we hope, will lead experiment and theory to relevant parts of a protein. Biology is now confronted with an avalanche of facts and data. Before we can build theories based on these facts we must devise methods to efficiently sift through them and sort those that are relevant to one another. I have shown you an approach that uses a tree classification as a filtering device to extract residues that are most directly relevant to function and structure. We hope to be able to target mutations to these residues and predict their outcomes, leading to many novel and useful applications that should extend to the entire proteome. Taking a step further back, as previously noted, one of the most fundamental challenges in biology is to understand the relationship between sequence, structure, and function. This problem is normally tackled using mathematics, statistics, and physics, but as we pointed out earlier, in biology, random chance and natural selection can result in opportunistic discoveries of novel functional niches, such that we cannot know for sure what a sequence does, at least not without a comprehensive description of its context. Thus, one may not always be able to deterministically connect sequence, structure, and function. A complementary approach employed in the laboratory is to test possible hypotheses with experiments. Of course, one can then look at evolution as the greatest repository of a staggering number of experiments. The advantage of Evolutionary Trace is that it directly exploits these experiments by focusing on the single feature that is central to biology that is not found in any of the other quantitative sciences: Evolution. I thank some current and past members of my laboratory: Anne Philippi, Srinivasan Madabushi, Mat Sowa, Ivana Mihalek, Hui Yao, David Kristensen, Ivica Res, and Dan Morgan, as well as my collaborator and colleague at Baylor College of Medicine, Ted Wensel. Question: Your results are very impressive. My question concerns only clarifying which problem you have solved. If you say prediction of function, this could mean

248

0. Lichtarge

J) you have a protein, you know what it does, and want to clarify what particular residues in this protein would be important to achieve this function; 2) you have a protein, you know its sequence and structure, but you don't have the faintest idea what it does. Which problems do you think you can solve? How would you tackle my second point? Response: The method directly addresses the first problem: You have a biologically active protein; you think you know the function of that protein, and wonder what the molecular basis for that particular function is; you do the trace and identify a set of evolutionarily important residues, which by inference are probably linked to your function. Therefore you can focus all your mutational experiments, essentially in order to convince yourself that this is indeed a correct link. That is the kind of problem we clearly address, and for which we have data. The second problem may be looked at in two ways: 1) if you have enough sequence information, you can do a trace on it and perhaps determine an active site on which you can focus, and see if it reminds you of any other known active sites. Active sites are starting to be analyzed as irreducible 3-dimensional functional elements, to the extent that you can build a database that relates those that are irreducible. Given a function, you can search for those elements in a protein, or given other elements, see if there is a relationship between those you already know. There are problems, however: If you are dealing with a catalytic site, you are probably in good shape, because they are not very flexible, therefore you may be able to recognize structural mimicry. If you are dealing with a protein/protein interface, where there may be many conformational changes, it could be much harder to use simple geometric comparisons to identify the underlined function similarity. There is no doubt that in nature, local structural convergence occurs locally; so a given protein region may mimic a region in a different protein, thereby triggering an immune response against the host, or perform the function of that other protein. Comment: // is a very interesting technique. Basically there are three steps: 1) identify the residues by their trace; 2) work out the statistical significance, which basically tells whether the overall traces are good enough, but does not help to focus on any particular patch (usually there are very many scattered and clustered patches, but within the structure); 3) look at the structure and decide whether a patch is important. The question is whether you can develop a statistical procedure that will tell whether a given patch is more significant than another.

An Evolutionary Perspective on the Determinants

of Protein ...

249

Response: Yes. In the case of the largest patch, we have a formalism that already works (that is what we used here.) For the second, third, and fourth largest patches, the question is whether they are really significant. And I think that what we are doing may be extended to consider these secondary patches as well. There are probably even better formalisms to basically address the whole issue. Question: When you look at the mutation frequency in those traces and compare it to the mutation frequency around the traces, which will be higher? Response: Typically, the mutation frequency will be higher in the residues that are least important. The mutation frequency will be lower in the residues that are more important. This is a generality and does not necessarily apply to specific residues. The point is that mutation frequency may be quite high in some trace residues, so a simple statistical analysis of mutation frequency will not allow you to resolve the functional site as well as you would like. Question: / understand that one approach would be to go along the protein chain and look at the mutation frequencies (the peaks and the troughs). How does that correlate with what you have done? Response: We did do that initially, finding that the functional sites we identified appeared larger, blurry, and smeared over the protein, with some extension. The problem is that you are basically neglecting some of the available information, so you get a low-resolution answer. Comment: You might have mentioned the spectacular success of this kind of approach done a long time ago. But I completely agree that there is much less than there is in the whole tree. Response: We used to believe that conserved residues are more important for function. That is basically true, and part of the story. Conserved residues are important, but if you look across the whole family, there are some that seem not to be conserved. Those that co-vary with the function are also important. So if you look at conserved residues (example: the active site), they will be conserved throughout the whole family. But if you look at the SH3 domain, the binding site co-varies with the ligands that the protein binds. So if you look at co-variation, you will see the residues that actually recognize various ligands.

250

O. Lichtarge

Comment: Obviously, nature does not provide unbiased data, and some of the most productive controls for your experiment might not be present in today's organisms, but would be provided by bench experiments. My question concerns the patches that you showed with false-positives; you proposed that they might become non-false positives later on, as more data accumulate. However, another possibility could be that they reflect structural or regular patterns on upstream macromolecules, such as DNA, RNA and mRNA. I would like to have your opinion on that. Response: Yes, it is very hard to know whether something is important when you do not have a strong context in which to infer why it is important. There are many processes in nature according to which a particular residue would end up not being important. Even when you have a context, even when you have something that is important for the binding site, it may be a supposition of importance. Comment: Your approach is based on the assumption that sequence phytogeny is a good representation of functional phytogeny. Sequence phytogeny is more or less defined, because sequence is digital, but the function is more difficult to quantify. I think that function may be either chemical or biological. A good example is the case of hemoglobin, which had been thought of as an oxygen-binder, but it is now shown to be an enzyme. Even such a well-known protein as hemoglobin is known to have a completely different biological function, although it is chemically similar. The concepts of function and functional phytogeny can be very complex. Do you have any comment on this? Response: Those are questions for which you almost need the answer in order to build the functional classification, and many of those functions remain hidden to us. I think that what you can do is identify what is important, carry out experiments, and see if you can start teasing apart the various functions by focusing on the most relevant residues. You need to go back to experimentation. You can only go so far, because a given interface may be involved in many different functions, and considering the overlap, you will not know that until you start testing, using various assays. Question: So to be on the safe side, would you admit that sequence phytogeny corresponds to chemical functions, to make a conservative statement?

An Evolutionary Perspective on the Determinants

of Protein . ..

251

Response: I guess that seems reasonable, and chemistry would be a nice way to put it. But I am still not sure that I agree. The most accurate statement 1 can make is that we use sequence phylogenies to approximate functional classifications. But there are currently no true, correct, and experimentally tested functional classifications that we can use as a gold standard, because we only have a few crude assays to use in trying to understand a myriad of functional aspects of any given protein in terms of its folding, structural, dynamical, biochemical, cellular, physiological, degradation, and other characteristics. I am perplexed, as you may be, as to how one might begin richly describing such a complex classification. I believe that sequence identity does begin to approximate it, and our results suggest that we are on the right path, even if this is a gross approximation. Does this approximation reduce to simply describing chemistry? I think only partially, depending on the protein family. Question: Could you comment on the evolutionary tree? How long are the branches to get a given group? Do you depend on the a prioris of the evolution of proteins? Did you try several methods in the building of a tree with a specific example? Response: Not thoroughly. I've kept a pragmatic point of view (like evolution): if it works, it's good enough. When it starts failing, you go back and you ask yourself what approximations you made that are not reasonable. And maybe at that point you have to start asking yourself why the tree does not work. In one case, we did obtain a nonsense result for one residue. It appeared to be unimportant, but was surrounded by many others that were. This inconsistency suggested an error, and when we examined the trace we found that a single sequence with many atypical variations was responsible for that residue's poor ranking. It turned out that this sequence is an oncogene. Hence it was misclassified in the tree as being similar to its evolutionary relatives, while in fact it performs a completely different biological function. In other words, in that instance, the correct sequence identity tree yielded a false functional classification. So the trees we use are really approximations and can doubtless be improved. Comment: To continue this question: You used importance functions, and if your definition of weights is not one-to-one in the evolutionary tree, it could completely change that importance function, which could change your conclusions entirely.

252

O. Lichtarge

Response: If you do a different experiment, you might change the result. But our experimental results match known biological data, so we feel that the current treebuilding algorithms, are good enough for the time-being

General references Lichtarge, O., Sowa, M.E. (2002) Evolutionary Predictions of Binding Surfaces and Interactions. Curr. Op. Struct. Biol. 12:21-27. Lichtarge, O., Sowa, M.E., Philippi, A. (2002). Evolutionary Traces of Functional Surfaces Along the G protein Signaling Pathway. Methods Enzym.. 344:536-556.

Methodological references Lichtarge, O., Bourne, R, Cohen, F.E. (1996). The Evolutionary Trace Methods Defines the Binding Surfaces Common to a Protein Family. J. Mol. Biol 257:342358. Lichtarge, O., Yamamoto, K.R., Cohen, F.E. (1997). Identification of Functional Surfaces of the Zinc Binding Domains of Intracellular Receptors. J. Mol. Biol. 274:325-337. Madabushi, S., Yao, H.,Marsh, M., Philippi, A., Kristensen, D., Sowa, M.E., Lichtarge, O. (2002). Structural Clusters of Evolutionary Trace Residues are Statistically Significant and Widespread in Proteins J. Mol. Biol. 316:139-153. Yao, H., David M. Kristensen, D.M., Mihalek, I., Sowa, M.E., Shaw, C , Kimmel, M., Kavraki, L., Lichtarge, O. (2003) An accurate, scalable method to identify functional sites in protein structures. J. Mol. Biol. 326:255-261.

Specific applications Lichtarge, O., Bourne, H.R., Cohen, F.E. (1996). Evolutionarily Conserved Gapy Binding Surfaces Support a Model of the G Protein-Receptor Complex. Proc. Nat. Acad. Sci. USA. 93:7507-7511. Onrust, R., Herzmark, P., Chi, P., Garcia, P., Lichtarge, O., Kingsley, C , Bourne, H.R. (1997). Receptor and (3y binding sites in the a subunit of the retinal G transducing protein. Science 275:381-384. Sowa, M.E., He, W., Wensel T.G. and Lichtarge O. (2000) Identification of a General RGS-Effector Interface. Proc. Nat. Acad. Sci. USA. 97:1483-1488.

An Evolutionary Perspective on the Determinants

of Protein ...

253

Sowa, M.E., Wei He, Slep, K.C., Kercher, M.A., Lichtarge, O., Wensel, T.G. (2001). Prediction and Confirmation of an Allosteric Pathway for Regulation of RGS Domain Activity. Nature Struct. Biol. 8:234-237. Madabushi, S., Philippi, A., Meng, E.C., Lichtarge, O. Signaling Determinants Reveal Functional Subdomains in the Transmembrane Region of G Protein-Coupled Receptors. (Submitted).

This page is intentionally left blank

SOME RESIDUES ARE MORE EQUAL THAN OTHERS. APPLICATION TO PROTEIN CLASSIFICATION AND STRUCTURE PREDICTION ALEXANDER KISTER & IZRAIL GELFAND Department of Mathematics, Rutgers University, Piscataway, NJ, USA

"All animals are equal but some animals are more equal than others." .. .George Orwell, "Animal Farm"

It is well-known that not all residues are equally significant in their degree of contribution to the stability of protein structure. In view of this, we suggest a new approach to the classification and prediction of the structure of proteins, based on the following premise: A small set of specific residues may be used to assign a query protein to a proper group in the protein hierarchy and to predict their secondary and tertiary structure.

Introduction One of the main challenges in the life sciences today is to understand how genomic sequences determine the geometric structure of proteins {e.g., see [1]). Knowledge of the three-dimensional structure of proteins provides valuable insights into their functional properties, since their function is largely determined by their structure [2]. The ability to classify a genomic or amino-acid sequence into its proper protein family allows one to predict, with some degree of approximation, its structure and function. This is an essential prerequisite to using genomic information to explain enzymatic processes that underlie cell behavior, understanding the molecular basis of disease, and achieving rational drug design. With more than fifty complete genomes already sequenced and at least a hundred more close to completion [3], the gap between known sequences and solved structures (collected at the Protein Data Bank [4] and classified in the SCOP database [5]) is quickly widening. Consequently, the task of predicting structure from the amino-acid sequence has taken center stage in the "post-genomic" era. Direct approaches to structure determination include X-ray crystallography and nuclear magnetic resonance, among other techniques. However, such methods are expensive, time-consuming, and not always applicable, especially since, for a large number of proteins, only the primary sequences are known.

256

A. Kister & I. G elf and

N-end.

k+1

i+1

r

... C-end

a)

b)

V c) Figure 1. Schematic representation of a typical variable immunoglobulin domain, a) P-sheet strands are numbered sequentially, since they are presented in a sequence. Strands 2, 3, 7, and 8 are shown: b) Chain-fold of immunoglobulin heavy-chain variable domain (PDB code line). Drawing done using Molscript [26]. P-sheet strands are shown as ribbons; c) Arrangement of strands in two main P-sheets. The interlocked pairs of strands, ft i+l) and (k, k+l) correspond to strands 2, 3 and 7, 8.

Some Residues are more Equal than Others: Application to Protein ...

257

The potential of alternative methods of protein comparison and classification is not yet settled, and there is an urgent need for more reliable approaches to such bioinformatics problems. Alternative approaches based on theoretical study of the nature of the sequence/structure relationship may be immensely useful in dealing with the wealth of data on newly sequenced genomic sequences. There exist both local and global points of view with respect to the relationship between the linear sequence of amino-acids and their resulting three-dimensional protein structure. The former point of view postulates that just a few critical residues, some 10-20% of the sequence, play the most critical role in determining the characteristics of a fold, whereas the latter considers all residues in the sequence to be crucial [6-7]. The local model received considerable support when Chothia and Lesk showed that rather different amino-acids sequences share the same fold, i.e., the same major secondary structure in the same arrangement with the same chain topology [8]. In our recent article with Chothia and Lesk, we discussed why structure changes more slowly than sequence during protein evolution [9]. For related proteins, structural similarities arise during the course of their evolution from a common ancestor, whereas for proteins with very low homology, fold similarity may be due to physical and chemical factors that favor certain arrangements for secondary structure units and chain topology. One possible explanation for the structural similarity of proteins with widely divergent sequences (homology of 20% or less) is that a few essential residues at specified key positions define the structure of a molecule, whereas residues located at other positions play an auxiliary role and do not have a major effect on the fold. In reality, all residues make some contribution to structure stability, but the relative importance of these contributions may be very different. This is also borne out by site-specific mutagenesis experiments, which reveal that substitutions of residues at various positions may have quite variable effects on the structure and stability of proteins [10-13]. It may therefore be suggested that some residues are more equal than others. In this work we show that these "more-equal-than-others" key residues have very important properties. Knowledge of a small set of key residues merely allows one i. to classify a given protein into an appropriate group of proteins; ii. to predict the main structural characteristics of a query protein, such as a protein fold, a supersecondary structure, and the coordinates of key residues. A protein group can be as 'narrow' as for example, a protein family, or as 'wide' as a set of several superfamilies from different folds. The validity of our

258

A. Kister & I. Gelfand

approach has been demonstrated for 'narrow' groups of proteins, such as the family of the variable domains of immunoglobulin-like proteins [14-16] and for the cadherin superfamily [17]. In this work we will describe a 'wide' group of proteins. This group combines so-called sandwich-like proteins (SP). The overall goal of this research is to identify a small set of key positions in the SP group. Residues at the key positions should have similar structure and sequence characteristics across all SPs. Knowledge of the structural characteristics and 3dimensional coordinates of the key residues coupled with the ability to identify key residues within a query sequence allows us not only to assign a query protein to an SP, but also to make specific predictions regarding its secondary and tertiary structures. Residues at the key positions will be referred to as sequence determinants, since they determine the group affiliation and essential structural characteristics of the proteins. Investigation of structural and sequence features common to SPs is divided into two parts: the search for positions whose residues have the same structural role across all SPs, and the search for the sequence determinants of SPs - a subset of conserved positions whose residues share both structural and chemical properties in all these proteins.

Identifying invariant structural characteristics in a group of non-homologous sandwich-like proteins The proteins of 69 superfamilies in 38 protein folds have been described as 'sandwich-like proteins' (see folds 1.2.1 - 1.2.37 in SCOP [5], release 1.59). Spatial structures of SPs are composed of (3-strands that form two main (3-sheets [Fig. 1]. Although the general architecture of SPs is relatively uniform, the number of strands and their arrangement in space varies widely [18-21]. In addition to two 'main' sandwich sheets, many SPs contain several 'auxiliary' beta sheets. Comparison of SPs sequences in various superfamilies reveals neither functional homology nor significant sequence homology. In fact, some SPs share so little homology (less than 10-15%) as to be undetectable even with the most advanced homology search algorithms, such as HHMer [20]. Our working assumption is that non-homologous proteins grouped together on the basis of common architecture share common features at the level of supersecondary structure. To reveal these structural regularities, we analyzed the hydrogen-bonds between strands that make up two main sheets. It was found that despite a seemingly unlimited number of arrangements of strands resulting in the

Some Residues are more Equal than Others: Application to Protein . ..

259

sandwich-like structure, there exists a rigorously defined constraint on the arrangement of strands in the sheets that holds true for some 95% of SPs. This constraint may be stated as follows: In any given sandwich-like protein structure there exist two pairs of strands, (i, i+l) and {k, k+l) such that: i. ii. iii. iv. v.

the strands of each pair are adjacent to each other in the sequence (Fig. la); strand i is located in one main sheet and i+l in the other; strand k is found in one main sheet and k+l in the other; strands i and k are located within the same sheet, are anti-parallel to each other, and are linked by hydrogen bonds; likewise, strands £+1 and i+l are located within the other main sheet, are anti-parallel to each other, and form hydrogen bonds with each other.

The two pairs of strands are usually found in the middle of the sheets. Interestingly, the two pairs of strands form a sandwich-like substructure within SPs (Fig. 1). This regulation, termed "the rule of interlocked pairs of strands," defines the invariant feature of SPs at the supersecondary structure level.

Identification of SP sequence determinants The identification of sequence determinants is contingent upon proper structurebased sequence alignment of proteins. An essential element of our method is that it involves alignment not of whole sequences, but of strands with a similar structural role in their respective proteins. A group of homologous proteins comprising a protein family are generally characterized by a similar number and arrangement of strands. For this reason, secondary structure determination and alignment of corresponding strands in homologous proteins is generally straightforward. On the other hand, recognition of corresponding strands in a group of proteins as diverse as a collection of SPs from different superfamilies is not a trivial problem. The number and arrangement of strands in the sheets vary widely in SP structures. Therefore, structure-based alignment of non-homologous proteins involves a prerequisite step: determining which strands play an analogous structural role in their respective sequences. Armed with knowledge of the invariant supersecondary features of SPs, we are able to align corresponding strands from different proteins. It follows from the rule of interlocked pairs of strands that four strands, i, i+l, k, and k+l, which have similar structural properties, were found in all SPs. Thus, in our procedure, i strands from all structures were aligned with each other, as were all i+1

260

A. Kister & I. Gelfand

strands, and so forth. In order to find conserved positions in the i, i+l, k, and k+l strands, we characterized each residue with respect to its (i) residue-residue contacts, (ii) hydrogen bonds, and (iii) residue surface exposure. Since strand alignment is based on the structural properties of residues, the first residue in the /* strand of one sequence can possess structural properties similar to (and be aligned with), for example, the 3rd residue of the /* strand of another sequence. See, for example, the first residue, S, in the / strand (PDB code line in table 1) and the third residue, T, in the i strand (PDB code lcgt). Thus, in the common numbering system, the i strand starts at position 3 in the line protein. A consequence of introducing a common numbering system based on the structural alignment of residues is that strands can start at positions other than position 1 and that their lengths can vary for different sequences. It is important to note that no "gaps" are allowed within strands, since strands are viewed as indivisible structural units. Adjacent residues within a strand are always assigned sequential position numbers. However, gaps between strands are a common occurrence. This analysis enabled us to align residues with similar structural properties. It follows that residues that occupy identical positions in the strands have the same structural role in various molecules. This allows us to compare non-homologous proteins, for example, from different superfamilies and dissimilar geometrical folds. The advantage of the structure-based approach is that it makes a common system of residue numbering possible for widely divergent sequences. The structure-based sequence alignment method employed here was developed in our previous work [14]. For the alignment of residues from the i, i+l, k, and k+l strands, the structures were culled from the "SP folds" (Table 1). These proteins belong to different superfamilies and possess no major homology. Analysis of the structurally aligned sequences in the "SP common system of numbering" revealed that in i strands, only positions 6-8 are always occupied in all known SPs. This means, for example, that residues found at position 6 of SP sequences all share similar structural characteristics. The same applies to residues at positions 7 and 8. In the remaining three strands of the invariant substructure, the following positions were taken by structurally similar residues in all SP structures: in i+l strands, positions 4-6; in k strands, positions 8-10; and in k+l strands, positions 6-8. These twelve positions are occupied by residues with structurally-similar properties in their respective SP structures. The residues at these positions lie at the center of the interface between the (3-sheets and form the common geometric core of SP structures.

Some Residues are more Equal than Others: Application to Protein . ..

261

Table 1. Structure-based sequence alignment of i, i+l, k, and k+\ strands. 'Fold' proteins are classified as per the SCOP database (release 1.59), i.e., with three numbers identifying the protein: the protein-fold, the superfamily, and the family, respectively, Str - PDB codes of the structure. Conserved hydrophobic and aromatic positions (see text) are in boldface. Each vertical column in the table, starting with the third one, corresponds to a specific position in one of the four strands (i, i+\, k and k+\). Fold

Str

1-1.1 2_2_2 3_1_1 6_1_1 7_1_1 8_1_1 9_1_1 11_1_4 12_1_5 13_1_1 14_2_1 15_1-1

1INE

1TF4 G T F R Y W F T ICG T Q V T V R F V V N 1HO E Y S Q A D 1AA C D T V T WI 1RL W H K F T V V V L 1CZ Y T H L S L F 1SV A S V A R I P L 1G6 E V N WV E S 1LO X V Y R V C V S 1PG S V K T 1 K M F I K 1DK V T F L V C

no

16_1_1 B 1SL 18_1_1 U 1FU 19_1_1 X 20_1_1

1G13

21-1-1

1I9B 1BH G 1BV P 1AO L 1AL Y

22_1_1 23-1-1 24_1_1 26_1_1 27_1_1 28_1_1 29_1_1 30_1„1 31_1_1 33_1_1 34_1_1 36J_1 38_1„1

1SPP 1CB 8

10 1 2 3 4 5 6 7 8 9

10 1 2 3 4 5 6 7 8 9 MS WV R Q

S L K L S C A A

I T V S S A Y

L

N

Q A Y S L T F

V H F V A

P V W N E T F E F I

P Y V E L F I

L F C P V

G S WN P

K V E L WL V G Y A N V Y V K I G F G I

L K V E L L I KS

L T WS G N V T L S V -

T V V L R I A R G D V V L T P D L Y I Y A Q V T F C S T E C V W T L Q M V Q A I F Y T

1IAZ 1DU 5 1YG S

RKI

A V G V D N

1P35 1NL S 1HS 6 1QE X

V L MMF N I

N L D F F D I S V WV

F A V T V Y

G K V D L V L E K E

N E V D V V F

R C L S

L E I D V

F L I T F pi

V V Y T

T V L F V T Y S L P K S E F V V S P S I R Q R F S C

D L A A Y

A

K Q L E F K F F K K H A R Y L A R C G T Y D Y H C V L E I T L M a I F I K A I R Y F K I

T L R K R

E I L S I L L F V R L R K K G N T E L K I

YT

G E Y I L V P S P A N L K A S

VV

D V K Y R V W K Y Q F K V WA L S L V S E I A T C R I

R Y N G KI

G S W

R L R I T I A I

Y A I V WV

F E A D I

MMI Y L

L H V H N

Q V V F Y

N P L A I

H Y WG L R

F Y V C F I A S L C L K L L V S I G I EI

P T

E T D

G Q S I H L G G M T V K V K R C A V L I K H

T WT A L N T Y

A L L Y N G Q K

P MS F L

Y S T F T C P A

V F Q S G P I R S V

E WV R V G L I L S W S F T S K L T Y T A E V S V S

E E E F K V N

R F K L

YE

A E S V Y R L F D H WN V V L D M K R Q V I Q L

D S L Y V S A

1011

A L Y F C A S

DC R R E V Q F R I

Q I T T V G

V T L Q F

N

D L Y Y V E I

rV T V K V V Y

1 L M W E A V T V K

L

N L Y L Q L N

WY Y D V S V

F K V T L ML L

T

101112 1 2 3 4 5 6 7 8 9

N G T L Y L T G N V

N R

k + 1

k

i + 1

I 1 2 3 4 5 6 7 8 9

L V A L MS

S V D I P L F H M A K L L V T C Q S

A Y I K V F Y Y V A V C V L G R A L F Y A K I Y K F I Q Y V T A T I S

S V F V N V T D EI

I F L R D

K T A V L S I

RDL

V G V L A Y L M S N Y K V V F C L R MS F L S F E Y N V V A S F E A T F T I A L V V G M R A A V K V A

262

A. Kister & I. Gelf and

Table 2. Identifying four groups of SP proteins within eleven distinct genomes. Sequence determinants of four groups of sandwich-like proteins. The four groups are classified as per the SCOP database: 1) PL protein family lipoxygenase N-terminal domain; 2) AT protein family: alpha-toxin C-terminal domain; 3) AD corresponds to a 30-kD adipocyte complement-related protein; 4) TR corresponds to TRANCE/RANKL cytokine protein domain (such as position 4 in i strand in the PL family and 10 in the i strand of the AT family). The table presents family-specific sets of conserved positions for each of the four protein groups (PL, AT, AD, and TR). Assignment of genomic sequences to each of the four protein groups: Thefirstcolumn lists the names of organisms from which the genomes are derived; the second column contains the numbers of proteins sequenced from the respective genomes. The numbers of sequences belonging to each group of proteins (PL, AT, AD, and TR) found in the genome using our method of sequence determination (MSD) appear in the "MSD" columns. "HMM" columns show the number of sequences of the respective groups of proteins found using Hidden Markov models (see SCOP database).

PL

AT AD TR HM HM HM proteins HMM MSD M MSD M MSD M MSD

genome Arabidopsis thaliana Clostridium acetobutylicum

25617 8

11

4

3672 0

1

1 2

Clostridium perfringens

2660 0 2

Mesorhizobium loti

6752

1

1 2

1

0

Pseudomonas aeruginosa 5567 0 0 Caenorhabditis elegans

20448 5

5

9

2

1 0 0

0

0

0

0

0

0

0

0

3

1 0 0

0

0 0 0

0 0 0

0

0

0 0 0

0

Drosophila melanogaster 14335 2 5

0 0

0 0

1 1

Escherichia coliKl2

4289 0

0

0

1

0

0

0

0

Escherichia coli 0157H7

5361 0

1

0

1

0

0

0

0

Bacillus halodurans

4066 0 0

00

00

Lactococcus lactis

2266 0 1

0 0

1 1

00 0 0

Inspection of amino-acid frequencies in these twelve positions showed that two of three positions in each strand are the conserved hydrophobic positions of SPs: positions 6 and 8 in i strands, 4 and 10 in k strands, and 6 and 8 in k+l strands. They are occupied either by aliphatic (A, V, L, and I), aromatic (W, Y, and F), or nonpolar amino-acid residues (M and C). Residues at these eight conserved positions were termed SP sequence determinants. of all SP sequence

determinants.

Residues V, L, I, and F accounted for 80%

Some Residues are more Equal than Others: Application to Protein ...

263

Protein classification and structure prediction based on sequence determinants This work is based on the premise that structure and sequence determinants may be used to classify proteins and predict their structure. A group of proteins may be characterized by sets of residues at conserved positions; the sequence determinants. Since the residue content of the conserved positions and the amino-acid distances between them are known for a group of proteins, it is possible to scan a novel protein sequence in order to ascertain whether it contains the sequence determinants of a given protein group. If a query sequence contains all the sequence determinants of a given protein group, can it be assigned to that group? In addition to the possibility of assigning primary sequences to their proper protein classes, our approach also allows making a number of specific predictions with respect to the structural properties of proteins. As follows from the definition of sequence determinants, they are characterized by a number of secondary and 3-D structural characteristics, including the coordinates of their Coc atoms. Thus, residues in the query sequence that match the sequence determinants by virtue of their chemical properties and location in the amino-acid sequence may be assumed to have all the structural characteristics of the sequence determinants as well. Knowledge of the secondary structural properties and coordinates of the Coc atoms of residues at conserved positions allows prediction of the protein fold and main features of the supersecondary structure (arrangement of strands), as well as construction of a fairly detailed 3-D model of the query sequence.

Using sequence determinants to classify proteins: an illustration Knowledge of the sequence determinants of protein groups has led to the development of a computer algorithm for the classification of proteins. To assign a query sequence to its proper protein family, it is necessary to match a subset of query sequence residues with the sequence determinants of a protein group. In order to classify sequences, we implemented an algorithm based on appropriate modification of dynamic programming [17]. This algorithm matches the sequence determinants of a given protein group one-by-one with residues of the query sequence. Once a match has been obtained for the sequence determinant closest to the beginning of the sequence, the algorithm seeks a match for the second determinant in the query sequence, and so on. If all the sequence determinants

264

A. Kister & I. G elfand

match, the protein is assigned to the group. A small number of residues of a given query sequence uniquely identify its group affiliation. Some data concerning the extraction of the proteins of several SP protein families/domains from the genomes of various organisms are presented below. As described above, sandwich-like proteins are characterized by a certain invariant substructure consisting of two interlocked pairs of adjacent strands, i, i+l and k, k+\. Eight conserved positions - the sequence determinants common to all SPs - were found in these strands. However, in addition to those eight positions, SP families have "family-specific" conserved positions. For various protein groups, there were between 1 and 3 of these "extra" specific conserved positions within the four strands. The results of applying a search algorithm that uses the sequence determinants of four protein families to all sequenced proteins of eleven different genomes are presented in Table 2. The "MSD" column of the table provides data on how many proteins of the given family were found in the respective genome by the application of our algorithm. For comparison purposes, the "HMM" column indicates the number of the proteins of a family found using the HMM search procedure, considered to be the most powerful of all currently used methods [20]. Overall, both methods found approximately the same number of SPs in the 11 genomes. All but one of the sequences found by HMM were detected by our approach. However, our method revealed a number of additional sequences that may be putatively assigned to the four families. For the most part, these "additional" proteins are labeled "unrecognized proteins" in the genome. We suggest that our approach can identify even those SPs that are hidden from the HMM search procedure. Further investigation is necessary to tell whether these candidate sequences indeed qualify to link the respective SP families. Our approach also provides an independent check on the accuracy of the HMM-based algorithm. The assignment of a query protein to a protein family yields the structural characteristics of that protein. Thus, all the proteins found have sandwich-like structures. A set of residues aligned with sequence determinants constitutes the geometrical core of the given protein family and allows us to assign coordinates for the C a atoms for these residues. Based on this substructure, we can construct the entire 3-D structure of the protein by applying commonly used homology-based structural prediction methods [21].

Some Residues are more Equal than Others: Application to Protein . ..

265

Discussion A direct corollary of our approach is that the complexity of protein-sequence search algorithms and 3-D structure predictions may be dramatically reduced. Instead of carrying out searches using whole protein sequences, we can now search using predefined sets of several key residues. This is analogous to searching for a criminal suspect by his fingerprints rather than using a long list of non-unique descriptors. Our data on sandwich-like proteins shows that the proposed search algorithm compares favorably with powerful and widely used techniques based on Hidden Markov Models. Another advantage of carrying out structure-based analysis is that it often allows us to not only predict the affiliation of a particular protein and outline its secondary and 3-D structure, but also to make "educated guesses" regarding the functional roles of various portions of its sequence. It is evident that the ability to pinpoint parts of a protein sequence that are likely to participate in protein binding (for example) can prove invaluable in planning mutagenesis experiments, or for rational drug design. Overall, our approach may be called "naturalistic" in the Linnaean sense; our aim is to construct a kind of protein "classification key" whereby each protein family, superfamily, group of superfamilies, etc. would be characterized by a limited set of highly specific structure and sequence characteristics. Upon encountering a new protein sequence, one would be able to quickly "scan" it for the presence of the characteristic features and assign it to its proper classification category. The strength of this approach lies in its predictive power; upon attributing a query sequence to a particular protein group, it would be possible to make highly specific predictions about its structural properties. Reasoning by analogy with known structures, one can also speculate about the function of various parts of the sequence and predict, for instance, that a certain portion would be involved in protein-protein recognition, etc. By analogy with zoology, if a new species of animal possessed one or more distinguishing characteristics of ruminants, for example, and was therefore classified as such, one could immediately predict that the newly found mammal would only have molars on its upper jaw (structure prediction), as well as what the functions of each of the four parts of its stomach would be (prediction of functional properties of various parts of the structure.)

266

A. Kister & I. Gelfand

References 1. 2. 3. 4.

5. 6. 7. 8. 9.

10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21.

Brooks C.L., Karplus M. and Pettit B.M., Proteins: a Theoretical Perspective of Dynamics Structure and Thermodynamics (Wiley, New York, 1988). Anfinsen C.B., Science 181 (1973), 223-230. National center for biotechnology information - http://www.ncbi.nlm.nih.gov/ Berman H.M., Westbrook J., Feng Z., Gilliland G., Bhat T.N., Weissig H., Shindyalov I.N., Bourne P.E., Nucleic Acids Research, 28 (2000), 235-242. http://www.rcsb.org/pdb/ Murzin A.G., Brenner S.E., Hubbard T., Chothia C., /. Mol. Biol. 247 (1995), 536-540. http://scop.mrc-lmb.cam.ac.uk/scop/ Lattman E.E. and Rose G.D., Proc. Natl. Acad. Sci. USA 90 (1993), 439-441. Wood T.C. and Pearson W.R., J.Mol. Biol. 291 (1999), 977-995. Chothia C. and Lesk A.M., EMBO J. 5 (1986), 823-826. Chothia C , Lesk A.M., Gelfand, I.M. & Kister, A.E. (1999) Simplicity and Complexity in Proteins and Nucleic acids, edited by Frauenfelder H., Deisenhofer J. and Wolynes P.G., Dantem University Press, pp. 281-295. Bowie J.U. and Sauer R.T., Proc. Natl. Acad. Sci. USA 86 (1989), 2152-2156. LimW.A. and Sauer R.T., Nature 339 (1989), 31-36. Rennel D., Bouvier S.E., Hardy L.W. and Poteete A.R., J. Mol. Biol. 222 (1991), 67-87. Axe D.D., J. Mol. Biol. 301 (2000), 585-595. Gelfand I.M. & Kister A.E., Proc. Natl. Acad. Sci. USA 92 (1995), 1088410888. Galitsky B., Gelfand I. M. & Kister A.E. Proc. Natl. Acad. Sci. USA 95 (1998), 5193-98. Chothia C , Gelfand I.M. & Kister A.E., J. Molec. Biol. 278 (1998), 457-479. Kister A.E, Roytberg M.A., Chothia C , Vasiliev Y.M. & Gelfand, I.M., Prot. Sci. 10(2001), 1801-1810. Chothia C. and Finkelstein A.V., Ann.Rev. Biochem, 57 (1990) 1007-1039. Woolfson D.N., Evans P.A., Hutchinson E.G. & Thornton J.M., Protein Engin. 6 (1993), 461-470. Gough J. and Chothia, Nucleic Acids Res. 30(1) (2002), 268-272. (http://supfam.mrc-lmb.cam.ac.uk /SUPERFAMILY /index.html) Neumaier A., Molecular modeling of proteins and mathematical prediction of protein structure, SI AM Rev. 39 (1997), 407-460.

STRUCTURE-FUNCTION RELATIONSHIPS IN POLYMERASES MARC DELARUE Unite de Biochimie Slriiclurale, Instilui Pasleur-CNRS, Paris, France

My talk today will focus on DNA polymerases, starting with a brief overview of the topics to be covered. First I will review how protein sequence analysis may be used to identify and cluster various DNA polymerase families. Next I will describe the crystal structure of a template-independent DNA polymerase that was recently solved in our laboratory. Then I will discuss the open/closed conformational transition in DNA polymerases, a feature common to all polymerase families. To do this I will rely on a simplified version of normal mode analysis. If time allows, I will discuss the role of electrostatics in the active site, where there are metal ions and charged substrate molecules.

Figure 1. The central dogma (reproduced with permission of Garland Science).

268 M. Delarue The physiological role of polymerase is essential in all the kingdoms of life. The so-called central dogma of molecular biology, which was written on a blackboard by Jim Watson in the 1950s [Fig. 1], simply states that DNA makes DNA makes RNA makes protein. Of course, not all the players in the process were known at that time, but over the years, especially during the 1960s, they came to be identified. One of them, DNA polymerase, is required when a cell divides and must make copies of its DNA for the daughter cells. Transcription of DNA into RNA requires RNA polymerase. The machinery that translates RNA into protein is the ribosome, about which you will no doubt hear more during this conference. Fig. 1 shows the celebrated double helix of Watson and Crick, which immediately suggests the mechanism for transferring information from one strand to the other. Here is the famous Watson-Crick pair, which best demonstrates the notion of complementarity and base-pairing ...and on the seventh day of hard work they just sat back, relaxed, and sipped tea...

i I'll 5

•tnctona

h .1111

9JP«M«

1)15151)131)

Figure 2. DNA polymerization is directional (reproduced with permission of Garland Science). DNA polymerization is directional, going from 5' to 3' [Fig. 2]. The template strand, which directs the copying process, and the primer strand, which will be elongated, are shown here. A nucleotide triphosphate arrives, the 3'-hydroxyl end of the primer attacks the alpha phosphate of the dNTP, and a new base is incorporated opposite the template base. In this figure there is a very simple nucleotide binary code. The small boxes represent the pyrimidine bases, while the purine bases are represented by larger boxes. Whenever there is a small box on one strand, there will

Structure-Function Relationships in Polymerases 269 be a large one on the opposite strand, due to steric complementarity. In reality of course, there is hydrogen-bonding between purines and pyrimidines, not just steric complementarity. However, as we shall see shortly, faithful copying of DNA relies on more than steric and hydrogen-bonding complementarity. On the right side of the figure is a general view of a polymerase, with a kind of canyon in which the different substrates bind.

Figure 3. The replication fork (reproduced with permission of Garland Science).

Since replication is unidirectional and the two strands must be replicated simultaneously, the cell uses a slightly different mechanism for each [Fig. 3]. There is no problem for 5'-3' synthesis, although in reality the replication machinery consists of more than just a polymerase molecule. A "clamp" is required to maintain the progression of the process, so that the polymerases do not continuously fall on and off the strand. A helicase is also needed to unwind the DNA. Singlestrand DNA must be protected by single-strand binding proteins. However, in order to carry out synthesis in the 3'-to-5' direction, Okazaki primers (RNA fragments

270

M. Delarue

synthesized by a primase) are also necessary. These RNA primers are later removed, and still another polymerase arrives to fill in the gaps between the extended Okasaki fragments. A ligase finishes off the job by joining the various DNA pieces. This is quite a lot of machinery, and there is even more in eukaryotic cells. One might say that studying only a polymerase is very reductionist, and that the structures and roles of all the players implicated in the replication process must be known in order to understand the whole picture. Nevertheless, certain bacterial polymerases acting by themselves in vitro are able to elongate a primer in a template-primer duplex, as Tag polymerase does in PCR procedures carried out every day in molecular biology labs all over the world. One can therefore justifiably state that solving the structure of active Taq polymerase was by itself a major achievement in the pursuit of understanding the DNA replication machinery, especially when it was shown in 1998 by G. Waksman and coll. to be active in the crystal state.

1. Classification of polymerases by sequence analysis Polymerases are absolutely necessary in all living organisms, including viruses, prokaryotes, and eukaryotes. By the end of the 1980s, a growing number of polymerase sequences (essentially viral, since there were no large-scale sequencing projects at the time) were available in the databases. In trying to organize these sequences, we found that it was possible to classify DNA polymerases. This work, which was published by Olivier Poch, Noel Tordo, Dino Moras, Pat Argos, and me in 1990 still holds true, and has contributed to setting the stage in the polymerase field. What did this work show? We were able to identify a few strictly conserved residues, namely aspartates and glutamates, that were scattered along the sequence [Fig. 4]. The aspartates are not actually isolated, but part of a specific stretch of sequence, called a motif. For instance, hydrophobic residues are always present here in this motif, and here they surround this conserved aspartate. At the time, there was only one known polymerase structure, the Klenow fragment off. coli pol I. Examined in the context of this 3-D structure, it becomes immediately apparent that, while situated at very different places in the sequence, the three different aspartates of motifs A and C are very close to each other in 3-dimensional space. Indeed, being the ligands of two

Structure-Function

Relationships in Polymerases

271

functionally important Mg++ ions, they identify the active site of the polymerase; however this was not known at the time.

Figure 4. The pol 1 and pol u families. These motifs were located using the so-called profile method, which had just been described by Gribskov et al. (1987). Once sequences are aligned, a positionspecific mutation table is drafted, based on multi-alignment. Using the profile, the database sequence is searched for hits above a certain threshold, such as three sigmas above the mean. If found, new protein sequences are added to the original profile, a new profile is derived, and the analysis is run again. This process is iterated until no new sequence is detected to enrich the profile. At present, this is done more automatically, for instance, using Psi-Blast, a program written by Eugene Koonin and colleagues, at the NCB1, in the United States. It is interesting to look at a simplified representation of the structure of the Klenow fragment. The Klenow fragment off. coli DNA pol I is really just that part of pol 1 from which the N-terminal domain, which exhibits 5'-3' exonuclease activity, has been removed (see fig. 5).

272

M. Delarue Klenow fragment <

•nuclease

1

•

Polymerase domain

3'-5' exonuelcase

324

518

928

Figure 5. The Klenow fragment of E. coli pol I.

During the rest of my talk, when describing a polymerase, 1 will use the socalled hand metaphor, introduced by Tom Steitz in the mid 1980s, according to which a polymerase is modeled as a human hand (see Fig. 4, right), with a palm, thumb, and fingers. Motifs that constitute the active site are located on the surface of the palm domain. The hand, with the thumb containing the B motif, bears template specificity, and the fingers hold a grip on the DNA. The really active part of the polymerase is shown in red. The domain in magent actually executes 3'-5' exonuclease activity, which removes an occasional wrongly incorporated base. In the 1990 article we slated that there were two families that could be united under the fold of the Klenow fragment, namely pol A and pol B (sometimes called pol I and pol a, using the names of one of their most prominent members.) Here are a few representative members of these two families: Pol I and pol II from E. coli belong to the pol A and pol B families, which sequence analysis correctly predicted to share the same fold. Similarly, although polymerases from the 1-odd and T-even phages partition themselves between these two families, they are really the same, sharing a Klenow-like fold. In eukaryotes, both the pol a and the pol 8, the most important players in eukaryotic DNA replication, belong to the pol B family. New structures of various members of the pol I and pol a families have been solved during the last past five years. The first pol a structure came out of Tom Steitz's lab around three years ago, and it turned out that sequence analysis had correctly predicted that pol a and pol I share the same fold [Fig. 6]. The active sites of family A and family B are strikingly similar [Fig. 6], with the same positioning of the two divalent ions by motifs A and C strictly conserved carboxylate residues; they also both include one helix (helix O in pol I) which is very important for template-positioning and dNTP binding and which contains motif B, whereas motifs A and C contain strictly conserved aspartate residues that are absolutely crucial for magnesium-binding and catalysis. It was recently shown by Tom Steitz's and other groups that the polymerases involved in DNA repair (the so-

Slruclure-Function

Relationships

in Polymerases

273

called pol Y family, which includes several members called pol eta, pol iota, and pol kappa) also share the same fold as pol I and pol a.

Family A polymerase

Figure 6. Pol A and pol B have the same architecture and active site.

274

M. Delarue

Things get a bit more complicated now. In addition to pol A and pol B, here is another family, called pol X [Fig. 7].

(b) Kanamycln nucleotidyltronafotaaa

ON A polymeraa* |* Palm doma»i

Tocartoxytarminal domain Amino tarminus

\ X J ^ ^ s

P'omtlnjjan domain

Figure 7. Pol X structures: a new Ibid.

In 1994, Huguette Pelletier, of Joe Kraut's lab in San Diego, solved the structure of rat DNA polymerase P, followed by that of the human enzyme, as a complex with a template-primer duplex. This is what the enzyme looks like. You can see that the hand-metaphor holds true for pol P, but that its topology differs from that of pol 1. A year later, Chris Sander and Liisa Holm found that the structure of pol P could be superimposed onto the structure of kanamycin nucleotidyltransferase. They were able to identify the crucial residues involved in catalysis. Again, these involved strictly conserved aspartate residues, and the active site has the same two-metal ion

Structure-Function

Relationships in Polymerases

275

mechanism. But I stress that its topology is totally different from that of pol I and pol a. Question: Not being a biologist, I am really surprised that there is so much difference among polymerases, since they are basically replicating DNA. What is going on? Response: Yes, but there is a great deal of regulation, especially in eukaryotes. Because of this regulation, the replication machinery in eukaryotes is much more complicated. Also, some DNA polymerases are not really involved in DNA replication, but rather in DNA repair. Although I have not spoken much about this, if a DNA polymerase encounters a defect in the DNA, such as a thymidine dimer linked by a cyclobutane, it falls off, since it cannot do the copying job properly. Another polymerase, one specialized in dealing with this kind of defect, comes to the rescue, takes on the job, then also falls off. Once the defect has been bypassed, the original polymerase returns. This is the role of so-called "pol Y" polymerases. Actually, pol (3 is also involved in DNA repair, specifically in filling-in gaps after the so-called base-excision process, which removes mistakes in DNA replication. Question: So the basic problem is not so much understanding the differences in the functions of polymerases, but differences in regulation of the polymerases. Is that the right way to express it? Response: No; the basic problem is not just to understand the regulation, but also to understand the basis of the differences among polymerases, which display wide variety. Sequence analysis is one way of appreciating these differences, but it is sometimes misleading. For me (and others), some polymerases are amazingly similar, in spite of differences in their sequences, and so on. That is what I want to make clear - but we can discuss this later. First let me finish describing the various polymerase families detected by sequence analysis. The pol C family includes all bacterial pol III polymerases, which are actually the most processive bacterial polymerases. They have two motifs that also contain strictly conserved aspartate residues, which were mutated by McHenry's lab in 1999. The inactivation pattern of aspartic acid strongly recalls what is observed in pol p\ I think it may be postulated that pol C and pol X are actually the same. So we now have only two folds; one containing pol A, pol B, and the pol Y family, and

276

M. Delarue

the other containing pol C and pol X. Therefore, there are only two structural families. Question: If you look only at the primary sequences of various polymerases, is it possible to put them all into one big family? Do they all have some similarity? Response: No. There is no single family that contains all the polymerases. When they do share some sequence similarity, it is very loose and difficult to spot using normal sequence alignment programs. Question: structure?

So basically what you see is that the similarity is in the tertiary

Response: Yes. Once the tertiary structure of at least one member of each family was known, everything became clear. Question: So they are probably not even evolutionary

related?

Response: That depends; some of them are. All polymerases in this class (pol A and pol B) are evolutionarily related; they all derive from a common ancestor. But these two families (pol B and pol X) are clearly different. Question: If, when comparing different classes, you find they are not evolutionarily related, is it just some kind of convergent evolution? Response: There is convergent evolution between these two classes (pol A and pol C) and divergent evolution in these two classes (pol A and pol B on one hand, and pol C and pol X on the other), because the active sites are different manifestations of the same general chemical strategy. Another family has just been solved for a multi-subunit RNA polymerase by the teams of Roger Kornberg at Stanford and Seth Darst at Rockefeller. Multi-subunit RNA polymerases have a new architecture, but mono-subunit polymerases derived from phages actually do have the pol I-type fold, as do all RNA-dependent RNA polymerases, which we predicted earlier, using sequence analysis arguments. This is saying a great deal, since it implies that all RNA viruses, including retroviruses, have the same kind of polymerase catalytic cores [Fig. 8].

Structure-Function Relationships in Polymerases 277 The following arc the key concepts that have been identified in polymerases. They are multifunctional enzymes; the structural counterpart is that of a multidomain protein whose architecture may be loosely described as being like that of a right hand. They are an ancient family in which the two classes of DNA polymerases have a common two-metal ion mechanism. (We will return to this point later.) Fidelity, which is the inverse of the error (mutation) rate, seems linked on the structural level to the existence of both open and closed forms, which have been observed for both pol fJ and pol I. Processivity is probably induced by auxiliary proteins and/or separate domains, which are different from the catalytic domain. Once it has been elongated, translocation of the primer strand is still difficult to understand from a structural point of view. (I will treat this later, when discussing electrostatics, in the last part of my talk.)

There are at least two large DNA polymerase families RNA-dependent DNA polymerases constitute a separate group Pol) ll.i-l M M \ inl< l>;u-n-ri:il. i i i k . i r w i t i primordial taction

f

Pol A and Pol B \Un«l' modtil.ir • r c h i t r e t a r c ii iml model"

Pol X t a d N

C

Pol P (Pol I I I and Pol | V i modular urchiliclur."hand model"

* N V P*l ON \-/l,-,..-i«««il Mlunknown H t U U t l W t l t c l r o n microscope

Figure 8. Nomenclature of DNA polymerases. Two magnesium ions that are beautifully hexa-coordinated in an octahedral manner have been identified in the active sites of high-resolution structures [Fig. 9|. Shown here are the 3'OH of the primer, as well as the incoming nucleotide with the a phosphate on one hand, and the leaving group, made of the p1 and y phosphates, on

278

M.

Delarue

the other. Generally speaking, one of the magnesium ions assists in the departure of the leaving group and the other, which is coordinated by one of the strictly conserved aspartate residues, activates the 3'OH of the primer to attack the a phosphate. This is true for both pol (3 and pol I, although their topologies are different. This then, is a case of convergent evolution.

Distance between the two cations is just under 4.0 A; There are 3 crucial Aspartate residues. One cation to activate 3 'OH (primer) One cation to assist in PPi departure The two-metal-ion mechanism is universal in polymerases a

Figure 9. The two-metal ion mechanism.

Fidelity in the DNA replication process has been studied by many research groups over the years. I will adopt a structural point of view here. (Please forgive me if I do not provide all the details.) Early kinetic studies have indicated that the first step in the binding of the dNTP is, in some bizarre way, template-independent; it does not depend on the templating base. Then there is the open/closed conformational transition, which has been tentatively described as the slowest step in the reaction. The structures of pol I and pol (3 have been solved in both the closed and open forms, often in the same crystal. The transition is seen to occur only when the correct base-pair is formed. It is believed that the actual checking of base-pair complementarity is carried out by means of an induced-fit mechanism, which occurs only when binding the correct dNTP. In other words, if a polymerase is complexed to a template-primer duplex and there is dNTP in the solution entering the active site by diffusion, two things could happen: If it is the right one, transition to the closed

Structure-Function

Relationships in Polymerases

279

state will occur and the reaction will proceed. If it is not the right dNTP, the transition will not occur, and that dNTP will eventually come out and be replaced by another dNTP, again by diffusion. This is basically how people think about fidelity in the replication process. Indeed, there must be something other than just base-pair complementarity, since the associated energetics cannot by itself explain the rates of mis-incorporation of DNA polymerases. However, there are mechanisms at work other than induced fit, such as 3'-5' exonuclease activity, to which I briefly alluded earlier. To summarize the common themes of DNA polymerases identified by various researchers over the years: There are essentially two types of topology: Klenowlike and pol beta-like. They have the same general morphology, with palm and finger domains, although the topology of the palm domain is different. Their sequence motifs differ: in the Klenow fragment, it is A and C, where A has one aspartate and C two different aspartates. In pol (3 it is the converse. The two-metal ions mechanism seems to hold for pol I and pol (3, and is also valid for ribozymes, so it is probably a mechanism that appeared very early in evolution. The open/closed transition is known to occur in both pol I and pol p\ and both processive and non-processive enzymes exist in both classes.

2. Structure-function relationship in a template-independent DNA polymerase The following is a more detailed discussion of the structure/function relationship in a particular DNA polymerase whose structure was recently solved in our laboratory. This is a peculiar case in the DNA polymerase family in that it only elongates a primer and does not really "care" about a template. This work was carried out in collaboration with Jean-Baptiste Boule and Catherine Papanicolaou, in Francois Rougeon's lab at the Institut Pasteur in Paris, where sufficient quantities of the protein were produced and purified to allow the growth of crystals. It belongs to the family of nucleotidyl transferases, and in vivo is implicated in the generation of the N regions of the V(D)J junction in immunoglobulin genes, which I will discuss in the next slide, to provide you with some basic definitions. In this sense, it participates in the generation of immune response diversity. It is a non-templatedirected process and can incorporate a variety of nucleotides, essentially according to their relative concentrations in solution, and the incorporation of different dNTPs may be modulated by different metal ions.

280

M.

Delarue

h,\:;

• »

'» SR

„•!•..>..,

III} MM llWAWAMAXCIMlNt

IMA

MNU V

dixuUklc

hypcrv*ri«bto loops

•

•

|

•

XI

V)

VI|>M

C

V

TtAvscurnoN

GOOD l\AI

VHIH

C

«——| KKASniOMi ' variable (V^dNMia

•HM

constant domain

vine I r«v<$iAno.v

^

=

$

Generation of antibody diversity .

-U

i—*-

Figure 10. V(D)J recombination and antibody structure. The following is very quick review of antibody structure. The general shape of antibodies is shown in the lower left. Antibodies consist of a heavy chain and a light chain. They may be described three-dimensionally as constituting the fold shown here, with essentially two superimposed (sandwiched) beta sheets. The action really takes place through these three loops, known as CDR1, CDR2, and CDR3. The loops display large sequence divergence, because the antibody fold potentially must adapt and bind to many different possible antigens. Because the number of genes that can be coded on the DNA is limited, the cell has devised a way to generate a large diversity of response, essentially by performing combinatorics on different copies of the same genes. Shown here are various copies of the V, D, and J genes and a single copy of the C gene. According to the antigen encountered, the cell chooses one each of the V, D, and J domain genes, while the C gene remains constant. During a so-called somatic recombination event, a random

Structure-Function

Relationships in Polymerases

281

number of nucleotides is added between the V and D domains and between the D and J domains, generating the so-called N regions, which are of variable length. More diversity is generated in this way, especially in the CDR3 loop. The physiological role of terminal deoxynuleotidytransferase, or Tdt, the protein about which I am now going to speak, is to create these N regions. Tdt exists only in vertebrates and is very abundant in calf thymus cells, which made it one of the first eukaryotic polymerase ever to be isolated and purified. In our lab, Tdt was crystallized as very thin plates, which are difficult to manipulate [Fig. 11]. Synchrotron radiation is required to yield a decent diffraction pattern. We collected the native data set at 2.3A resolution, and the structure was solved by the heavy-atom method. The following is a description of the native structure of Tdt, as well as of two binary complexes, one with the primer strand and one with the incoming dNTP.

Figure 11. X-ray structure of a murine TdT.

282

M. Delarue

Tdt contains an N-terminal domain called BRCT, the structure of which is known, and which appears in many different proteins involved in DNA repair. We had to remove this domain in order to obtain crystals, but this shorter construct remains active in solution. It is also active in the crystal state, which means that if you soak the crystals in the presence of a primer and dNTPs, then stop the reaction by adding EDTA the following morning, and analyze the results by running a gel, you find that the primer has been elongated. Somehow the primer dNTPs are able to diffuse in the crystal. The C and A motifs are red in this schematic cartoon, which shows the TdT sequence aligned with the one coding for pol (3. Here, at the N-terminus, one part (shown in cyan) is disordered in the crystal, but there is an additional part (shown in magenta) that is very important. When I say additional, I mean when you compare Tdt to pol (3, whose structure is known. This additional Nterminal peptide sort of fills in a gap in the structure; we will make a case for that in the next slide. But you can already see here that it sort of seals the thumb and finger domains together and causes the enzyme to adopt the closed conformation in such a way that in solution it is probably always locked in the closed conformation. This is not so surprising considering its function, but I shall return to that later. In addition, you can see one monovalent metal ion attached to the protein through a known structural motif called HhH, which appears in several non-specific DNA binding proteins. The catalytic site is located near the strictly conserved aspartates. One magnesium ion is coordinated by these conserved aspartates, even in the native enzyme; i.e., in the absence of substrate. The region thought to be involved in the binding of the dNTPs is shown in magenta. What does it mean to conduct a structure/function relationship discussion? The rule of the game is to use the multi-alignment of all Tdt sequences found in the database along with two closely related human polymerases called pol u.. One need only look at the blocks of conserved sequences and ask why they are conserved. By examining the structure, one can in fact more or less explain all these blocks of conserved sequences. The N- and C-termini interact with each other to make the ring-like structure. One block of conserved residues defines the so-called HhH motif, which binds the monovalent sodium ion. The two longest blocks of conserved sequences interact with each other to build the active site. They contain one residue in an unusual c/s-peptide bond, which is also conserved in polymerase P, plus a strictly conserved arginine, whose side-chain stabilizes this m-peptide bond. The presence of this arginine is necessary, since if you mutate it you lose the function. There is also a long loop, to which I shall return later, that is a special feature of TdT in the sense that it is absent in polymerase p\

Structure-Function

Relationships in Polymerases

283

To make the case for sealing the closed form of pol $ clearer, 1 have superimposed the C-alpha traces of Tdt in red [Fig. 12] with those of the closed form of pol P in blue and the open form in green. It thus immediately becomes apparent that Tdt adopts the closed form. I have also shown the molecular surface of pol (i plus the extra Tdt N-terminal residues, and you can see that they perfectly match the canyon-like structure present on the surface, thus sealing the structure in the closed form by providing additional van der Waals energy.

Figure 12. Comparison ol'TdT (red) with pol P in the open (green) and closed (blue) forms.

Why is this the case? Think about what I mentioned earlier concerning the conformational change from the open to the closed form, which is involved in replication fidelity. But TdT is a totally "unfaithful" protein. Indeed, since it does not copy a template, it does not have to rely on this mechanism; there is no template-base to copy, so there is no need for alternation between two forms.

284

M. Delarue.

We collected data on protein crystals soaked in the presence of a DNA primer strand [Fig. 13, left]. In this case we used brominated DNA, because bromine atoms have a good anomalous signal, shown here in green. The bases are somewhat disordered, but the isomorphous map reveals very clear peaks for at least three phosphates. The DNA is in the B-DNA conformation. In the final map, we could see that the incoming site was also occupied, so this is more like an enzyme-product complex than the expected enzyme-substrate complex. At the bottom left I have again superimposed the C-alpha traces of TdT and the closed form of pol p" with its primer strand, in the presence of the electron density of the Tdt binary complex. If the figure were enlarged a bit, you would be able to see that the phosphates are almost inside the density, meaning that the binding of the primer strand is almost identical in the two proteins.

Figure 13, left. Binary complex ot'TdT with the primer strand.

Structure-Function Relationships in Polymerases 285 On the bottom right is the sodium ion I mentioned earlier, held by the HhH motif. It is held in place by various carbonyl groups of the protein mainchain and bound in an octahedral fashion. One of the ligands is water in the absence of DNA, and this water is replaced by one phosphate oxygen of the primer strand at position P2. The other phosphates, at positions PI and P3, are stabilized through ion-pairing with an arginine or a lysine side-chain. Wc have also solved the crystal structure of the enzyme binary complex in the presence of the incoming nucleotide and divalent cobalt ions [Fig. 13, right]. In this case, the dideoxynucleotide ddATP was used. At the top left, the anomalous density of cobalt clearly shows the density for not just one, but two metal ions. The final map at the top right is shown in blue, along with the conserved aspartates, which coordinate both cobalt atoms.

Figure 13, right. Binary complexe of TdT wilh the incoming dNTP.

286

M. Delarue

How the adenine ring is stabilized is actually quite interesting: There is one tryptophan, partly stacked with the adenine ring. More impressive is this lysine 403 side-chain, with its NH 2 group pointing directly above the adenine ring in a typical cation-7r interaction. Again, I have superimposed both Tdt and pol (3 in order to show that the binding sites of the incoming nucleotide are really very similar in the two proteins. The dNTP is shown at the bottom right, in the presence of the two strictly conserved regions in the multi-alignment mentioned earlier, which are colored magenta. The effect of varying the metal ions is interesting. This enzyme has been known since the 1960s, following the work of F. Bollum and his group at the University of Kentucky. In fact, Tdt was one of the first eukaryotic proteins with polymerase activity to be purified in eukaryotic cells, and for a while it was thought to be a true polymerase. But it was soon realized that Tdt merely elongates a primer and does not care about a template. Many in vitro physical chemistry studies of this protein have shown that four metal ions, Mg ++ , Mn++, Zn++, and Co++ all work, although their optimal concentrations are very different. Of the four, manganese works least well. Cobalt works very well for pyrimidine incorporation. It actually works better than magnesium, though one can argue that cobalt is not really physiological. Zinc works by itself, albeit only at sub-millimolar concentrations, because if you use millimolar concentrations, the enzyme dies; it has to be lower than that. It is also known that in the presence of micro-molar quantities of zinc, magnesium works much better than magnesium alone; therefore there must be an additional site for a Zn atom. It is generally thought that in the presence of cobalt ions, the enzyme works in the following way: One of the cobalts takes the place of the additional zinc, and the other two play the role of the two magnesiums in the two-metal ion mechanism common to all polymerases. This all points to the possibility of an additional divalent ion, in addition to the two catalytic ones I have described. We have just completed some preliminary laboratory work on this topic and obtained evidence for the binding, under certain conditions, of another divalent metal. Finally, let us look at the surface of the protein, with all loopl residues colored differently from the rest of the surface [Fig. 14]. If we build the strand that is complementary to the primer in a classical B-DNA conformation, it is clear that this strand; i.e., the template strand, cannot be accommodated because of loopl, which is itself held in place by three different intramolecular hydrogen-bonds involving strictly conserved residues. So a very good structural reason explains why the template strand is excluded from the catalytic center, and we can begin to formulate such questions as How can this information be used to redesign a mutated TdT that

Structure-Function

Relationships

in Polymerases

287

would then accept a template strand and How can we understand at which point in the evolution of the nucleotidyltransferase family this feature (Loop!) was added?

Figure 14. The case of Loopl and non-accommodation of the template strand. Loop 1 (magenta) severely hinders the binding of a template strand (blue).

That is more or less what I wanted to say about Tdt. In summary, there is a possible structural explanation for the influence of various divalent ions on the specificity of nucleotide incorporation. Also, the presence of Loopl and its stabilization seem to be enough to explain why the template strand is excluded from the active site. The enzyme is active in the crystal state and very closely related to the closed form of pol p. Wc think there is no transition between the open and

288

M. Delarue

closed forms. There are other arguments for this, such as that an ion-pair stabilizes the open form in pol (3 that is not conserved in Tdt, in addition to the N-terminal extension of TdT compared to pol p\ which seals the enzyme in the closed form. Question: Considering that the enzyme is active in the crystalline state, have you tried to make a crystal structure with the primer and NTP while you presumed it to be doing something? Response: Yes. We have tried that several times, using different oligonucleotides, with different ions, etc., to get the ternary complex. It turns out not to be so easy, especially if you want to crystallize in the presence of cobalt, because you need DTT in solution in order to get crystals. However, DTT and cobalt make some strange mixtures (they precipitate), and zinc plus DTT is also very bad. We first have to make the crystals in magnesium and DTT, then transfer them into a solution that does not contain magnesium, but some other divalent ions and no DTT at all. Basically, what we found is that when we set up co-crystallization drops of the ternary complex we get very poor crystals, so we have to resort to soaking experiments. Question: Actually, my question concerned whether you really expect a different structure? Since it is dynamic, it is doing something; do you really expect to see some atoms in particular states while you know that it is actually working, i.e., moving? Response: I'm not sure I understand your question. Question: Perhaps I don't either. The thing is, by doing crystal structure, you look at structure. But on the other hand, when the enzyme is active, is it not moving? Response: To get crystal structures of the ternary complex, such as the enzyme plus the primer strand plus the incoming NTP, we want all the actors in place, but we want the reaction not to occur. There is a very simple way to do this: arrange it by using a primer that has no 3'OH. Question: Yes, so under these conditions, you arrange it in such a way so that nothing moves. But theoretically, in this case, the enzyme is apparently moving, or at least doing something while it is crystallized. If you try to make the crystal structure do, what do you expect?

Structure-Function

Relationships in Polymerases

289

Response: Actually, atoms are not completely immobile in a crystal structure; some are more mobile than others, as described in the so-called temperature factors (the B factors.) There are also large water channels through which the substrate can diffuse, which explains why in some cases the crystalline enzyme can accommodate different substrates. But the reaction will not proceed when using ddNTP instead of dNTP. Question: What do you expect if you add the dNTPs? Do you expect to see some difference between a crystal that is doing nothing and a crystal that is doing something? Response: If you have a large conformational change, you may or may not be able to see it in the same crystal form; that depends on whether it can be accommodated by the crystal contacts. But if you want to see the reaction take place in the crystal, you have to arrange it so that all the molecules start to react at the same time, otherwise the image will be blurred. Comment: One generally wants to have a highly synchronized situation, so that the reaction starts simultaneously for all the molecules in the crystal, because if you have 1012 molecules and you want to see something in a time-resolved fashion, it is certainly necessary to trigger that reaction. One way of doing this is to have caged molecules that are soaked into the crystal and can be photolysed, so that all the molecules can access the enzyme simultaneously and you can record a very rapid diffraction data pattern that may capture simultaneous movement. Comment: But in the absence of synchronization, you might expect that certain atoms simply would not be resolved very well, and it could be that these atoms are in fact involved in the action; it might be useful. If you see some very highly resolved parts of the protein not moving and others that are very blurry, it might mean that in fact these atoms are involved in the activity and have some dynamical interpretation. Response: Yes, you are right, but it would not be right to say that the least mobile atoms are those that will not be involved in catalysis; that is not true. Most of the active site is very strongly held. There is a pattern of hydrogen-bond networks that very rigidly maintain a specific geometry, so as to stabilize the transition state of the reaction.

290

M. Delarue

Comment: No, I am not saying that; it might just be some conformational change. The whole molecule can breathe, and so on. But it just might be useful to look at it and try to determine whether there are some differences. You might try to interpret them in a way that has something to do with the activity. Response: What I can tell you is that for Taq pol I, there is an amazing story concerning the crystals. Only one crystal of the protein was available in the presence of the primer-template duplex, and the structure was solved using that very crystal, which turned out to be in the open form. The crystal was actually frozen to collect the data. Once all the necessary diffraction data had been collected, the crystal was thawed and put into a drop containing dNTP. The same crystal was refrozen and data on it recollected. The enzyme was seen to have gone into the closed conformation. It was the same crystal, but there was a major rearrangement of the protein. However, it remained in the crystal state in the same space group. All the packing interactions were somehow preserved in the crystal. Question: You made the comment that all polymerases are the result of divergent evolution, with the exception of one family, which was the result of convergent evolution. What makes you believe that these are two different scenarios? Response: Simply because the topologies are completely different. The topologies of pol P and pol a (or pol I, if you like) are different. Question: Are there no known evolutionary intermediates? Response: None of which I am aware. Question: Do these terminal transferases have any preferences in terms of in vivo nucleotide incorporation under physiological conditions? Response: Not in vivo. I think it is a more or less random process. It depends on the concentration of the various nucleotides. But as far as I am aware, it is thought to be a random process in vivo; random selection. Question: There is one point about which I am a little confused; maybe it goes back to the issue of the movement of a protein: In many polymerases, there is this idea that the finger domain is sort of opening or closing. When you said that Tdt is

Structure-Function

Relationships in Polymerases

291

in the closed conformation, is this because you think it always stays in this conformation, or because this happens to be the conformation in your crystal? Response: That's a good question. What I can say for sure is that Tdt is in a closed form in this crystal. I cannot formally exclude that in some other crystal form under other experimental conditions it will be found in the open form. However, there are at least two structural features that point to the fact that the closed form is the only one that exists in solution. One is the extra N-terminus that seals the structure and the other is the non-conservation in the TdT of arginine residues, which is known to stabilize the open form in pol (3 through an ion-pair. It also makes sense on the functional level because we do not need this transition between open and closed forms to check complementarity with the template base, simply because there is no template.

3. Normal mode analysis of the open-to-closed transition I now want to go on to the third part of my talk. I have told you a lot about this open-to-closed-transition. We would like to know whether there is anything in the initial structure that drives this transition; i.e., that "tells" the structure it has to go from the open to the closed state. One way to do this is to use molecular dynamics to simulate the dynamics of the protein. You all know that this involves the resolution of Newton's equation for every atom. The problem is that the energy function is not really well known. You also have to take the solvent molecules into account. If you start to add explicit solvent molecules, it amounts to a huge system of around 10,000 atoms and three times as many equations to solve. Also, since this second-order equation can only be integrated with very small time-steps, you need a lot of computer time to carry out a simulation that typically only goes up to between 10 and 100 nanoseconds. You do not really expect the protein to suffer large amplitude movement on this time-scale. One way around this (and it involves some approximations) is to use so-called normal mode analysis, pioneered by Nobihiro Go, which states that if the potential is harmonic you can solve the equations of the movements analytically, calculate all its normal modes, and the movement can actually be simulated forever by a superposition of these normal modes. The interesting thing is that the lowfrequency modes are going to tell you all about the large amplitude movements, because of the Theorem of Energy Equipartition. They will also tell you about

292

M. Delarue

correlated motions. In order to get these low-frequency modes you still have to diagonalize a huge matrix, and that still takes a lot of cpu time. Question: How can you assume it to be harmonic? Response: Even if you are in a local minimum, you can assume it to be harmonic, at least locally. I reckon this is a bit of an approximation. Actually, I am going to make a much bigger approximation in a second. Comment: It will not move much more than several Angstroms. Comment: If the Hessian is not singular at the minimum, you can always have this kind of harmonic behavior for perturbations that are small enough. Response: Anyway, just let me proceed. I'll tell you what I did, show you some kind of validation and the interpretation, and you can decide for yourself whether it makes sense and whether or not it is a crazy approximation. What is the potential energy used in this method, which was actually pioneered by Monique Tirion? Here is the structure of one particular protein. I think it is retinol-binding protein [Fig. 15]. This is a C a trace representation of the molecule. You set a threshold at let's say 10A, and every residue in the structure is linked by a kind of a spring to all the neighboring C a atoms that are closer than the threshold. Each atom is obviously linked to both its predecessor and successor along the covalent chain, but also to its neighbors, in a kind of an off-lattice network. Writing the equilibrium (condition) which states that there should be no force exerted on each atom, you end up with a so-called Kirschoff matrix. You must diagonalize this Kirschoff matrix to obtain the normal modes. From that you can get the meansquare value of the displacement for each atom of the protein. Now comes the validation step of the model. As I said earlier, when you solve a crystal structure, you obtain not only the positions of the atoms, namely their x, y, and z coordinates, but also the so-called temperature factor, which describes the average movement of each atom. This B-value is related to the mean-square displacement of each atom. You can plot the experimental B values for each residue as well as the displacement predicted by Monique Tirion's elastic model. I did it for various polymerase structures, using a program written by Yves-Henri Sanejouand, and obtained very good correlation factors; on the order of 0.6 + / - 0.2. The

Structure-Function

Relationships in Polymerases

293

following is what the model tells you; 1 think it compares favorably with the experiments.

Figure 15. The elastic network model.

If you treat each atom as a particle you still have to diagonalize a huge matrix. Monique Tirion, followed by Ivet Bahar and K. Hinsen, came up with the idea that it might be possible to simplify the system by taking one amino-acid to be a superatom or a particle. In such case the matrix that has to be diagonalized is much smaller. Some people have even gone a bit further; for instance, YvesHenri Sanejouand (Bordeaux) has shown that you also get very good results in large proteins if you take one particle for two, or five, or even ten residues, in order to study the normal mode dynamics of very, very large molecules. They have shown their results to be strictly comparable to the usual normal-mode analysis with single atoms as single particles. You are then in a position to apply this method to very large proteins and macromolecular assemblies. For each lowest-frequency normal mode [Fig. 16], it is possible to calculate the set of displacements (eigen vectors) for each atom, as predicted by theory. Then you can compare this set of vectors to the set of the difference vectors between the open and closed forms and project each vector set onto the other, yielding a generalized correlation coefficient, or mean cosine, between the two sets. For some low-frequency modes, this cosine is amazingly high: above 0.7 for both pol I and pol a. The other modes contribute in a decreasing manner. It is obvious in this case that the first one or two modes entirely describe the open-to-closed conformational change. This seems to be general for polymerases of different classes.

294

M.

Delarue 1 'projmod.pola' u 1:3 'projmod pola' u 1:4

0.9 I0.8 0.7 0.6 • 05

0.4 0.3 0.2

0 1 / V A 20

^->~—

30

40

50

^~N—-

60

70

30

100

Figure 16. Correlation coefficient between the difference vector set and the predicted displacement vector set (pol a).

Question: How could that be? You have to go through an energy harrier to go from one state to the other; it cannot go by itself. Response: Yes, but there is a tendency that anticipates movement in the direction of the transition state. Or, if you like, the open-to-closed transition occurs in solution, and only when the dNTP is bound does the closed form become more stable. Question: These are not linear things; it is very far from any linear motion. It would just go in one direction, precisely because of the reality of the transition otherwise. How can it work? Response: I see it as a kind of tendency. There exists some tendency, there are some privileged collective movements in this connected network of particles, and

Structure-Function

Relationships in Polymerases

somehow these movements do correspond to open-to-closed transition.

295

conformational

Comment: One question is whether there are any dynamics at all. When you diagonalize the Hessian matrix and look at the eigen values, you can either say that you are solving a dynamical system or that you are looking at the quadratic part of a potential energy and just seeking nearby low-energy configurations. I suspect that the second interpretation is probably why it works so well, although it is very surprising. Comment: Exactly; the quadratic approximation is good for infinites imally small motions. Response: So why does it work for large amplitude movements? Comment: Maybe it was accidental. Response: It was not accidental. Yves-Henri Sanejouand has tested this method for several open/closed transitions (not in polymerases) in many different proteins, and it seems to work very well in all cases. Comment: / is a certain conformation; then analyze conformation,

think you can say that you have a certain conformation in which there low-frequency mode. It would be interesting to be in the other let's say the open form, which is certainly also a local minimum, and and apply the same kind of normal-mode analysis to the other in order to see whether there is also such a low-frequency mode.

Response: Of course; I've also done that, and it works well too, although normal modes calculated from the open form are always slightly better than the ones calculated from the closed form. Comment: Still, you cannot say anything about the transition state; whether or not it goes in a similar direction, so that the coordinates match. But just talking about a single minimum conformation, I do not see how you can guess where the other one is and whether or not this mode goes over into the second minimum. Response: You're right. The method is not really predictive; it's more like a. postmortem analysis, at least at present.

296

M. Delarue

Comment: / [Nobihiro Go] think this is just the type of question I can answer, since I have been working on this problem for around twenty years. Certainly, normal-mode analysis is a quadratic approximation around the minimum, and should reflect motion of very small amplitude. Within the potential energy surface in the range of thermal fluctuation there are many, many minima. The quadratic approximation should not hold for states in which the protein is present under physiological conditions. The naive idea of normal-mode analysis should not be valid, but a very interesting phenomenon lies behind it. From normal-mode analysis, we can calculate the second moment of the fluctuations by invoking the Hessian, namely the second-derivative matrix. But we can also calculate the second moment from molecular dynamics simulation, which faithfully traces the effect of the harmonicity. We can calculate the second moment matrix by two methods: normal-mode analysis, which is based on an assumption of the quadratic nature of the energy surface, and by molecular dynamics simulation. Comparing these methods - interestingly — we find that they agree very well. To answer one of the earlier questions, if we do normal mode analysis of different minima we can identify many different minima computationally and calculate the Hessian matrix at different minima. Very interestingly, the directions of the eigen vectors corresponding to very low frequencies are very similar, independent of the minimum. This nature is reflected in the second moment, calculated by molecular dynamics simulation. Even though there are many minima, they have quite similar surfaces, including the direction of low-curvature directions. It also looks like these very large numbers of local minima are located within a very low-dimensional space, corresponding to the low-frequency direction. In the case of a protein, where the typical number of atoms is a few thousand, perhaps ten-thousand, the dimension of the conformational space may be about one-hundred thousand. Within lowfrequency dimensions, there are very specific low-dimensional spaces in which the amplitude is high and a very large number of local minima are distributed. The dimension is very small; something around thirty, not just two, as you said. The dynamics of the protein is a limited program that occurs in a very low-dimensional space, compared with the whole number of degrees of freedom in the system. Question: / have a question for the previous commentator. Do you think it is a property of native proteins, or of any proteins chosen at random, or do you think this has something to do with evolution?

Structure-Function

Relationships in Polymerases

297

Response [of Prof. Go]: / think it is a universal property; any protein - not just any protein, but also even a small cluster of atoms, such as an argon cluster - have a similar property. Comment: There is still the problem of the activation barrier. My guess is that it would be an entropic barrier. Comment: / have a comment and/or conjecture in that regard: If you have a number of protein subcomponents and the large motions are dominated essentially by the rigid motions of two parts, I think that would explain the constancy of the direction of the eigen vectors, because then you would have very low-dimensional degrees of freedom, which are basically rigid-body motions of one cluster of atoms against another. Comment by Prof. Go: That is a simplified description of what I just said. There are relatively few degrees of freedom of collective movement against a huge number of uncorrected degrees of freedom. Comment: So in this case, degrees of freedom involves domains like the palm, thumb, and finger domains.

4. Electrostatics and translocation I would like to resume, and go to the fourth part of my talk, which has to do with electrostatics. I want to make the case that electrostatics can tell us something about the translocation step. Just to remind you, translocation occurs after elongation has taken place. The template strand has to move one base further in order for copying to proceed. Electrostatics is present everywhere in proteins, which I will briefly review, with an elementary treatment of cases in which it obviously applies. One of those cases is the generation of secondary structure, because the peptide dipole holds partial charges on the carbonyl and -NH groups, both of which point in the same direction. In the alpha-helix, it is obvious that the peptide dipoles align themselves in a favorable way, because the minus part of one dipole interacts with the plus part of the next one. If you like, you can think of secondary structure as the way nature deals with the partial charges of the peptide bond. Indeed, it is the same for beta-strands, in which the dipoles are arranged in an anti-parallel manner for both anti-parallel and parallel Z>eta-sheets. This has no net macro-dipole, whereas

298 M. Delarue the alpha-hehx. has a net macro-dipole that is also optimized, at least in some very simple all-fl/p/w-helical proteins, such as the so-called 4-helix-bundle fold, in which four alpha-helices are arranged in an anti-parallel manner. So the macro-dipoles are obviously arranged in a very favorable pattern. People treat electrostatic interactions, which not only have to do with the partial charges of the peptide dipole, but also with the full charges of side-chains, such as aspartate, glutamate, lysine, arginine, and perhaps histidine, as well as the free ions present in solution, by means of the celebrated Debye-Hiickel Theory. This amounts to solving the Poisson-Boltzmann equation [Fig. 17].

V • e(r)V<£(r) - K? sinr#(r)] =

-4np(r)

K! is related to the Debye-Huckel inverse length, K, by: inN 1 •c2 - — ^ e ~ 1000atH:

E-80

Figure 17. The Poisson-Boltzmann equation.

This second-order differential equation is usually solved by relaxation methods on a grid, using a linear approximation. The difficulty comes from the value of epsilon that is assigned to the dielectric constant of the interior. The protein is thought to have a low dielectric, but its precise value is still a matter of debate. Anyway, if you suppose the dielectric constant to be constant throughout the protein interior and to lie between 2 and 4, you can calculate the electric potential everywhere and start

Structure-Function

Relationships in Polymerases

299

looking at active sites. Before going into polymerase active sites, 1 want to make the case for magnesium and phosphate groups in active sites.

35 |

0

i

i

1

2

i

3

i

4

i

S

6

i

7

i

8

i

S

i

1

0

bp

Figure 18. Iilectrostatic energy of the primer strand in the open- and closed-forms of the pol 1 DNA polymerase.

We recently solved another structure in our lab, that of TMP kinase, which simply adds another phosphate onto a TMP, yielding TDP by taking one phosphate from ATP. If you carefully calculate the electrostatic potential in the active site of TMPK. using the Poisson-Boltzmann method, you can sec regions of both very high and very low potential in kT/e units, going from -30 to +30 kT/e over the course of 5A. In the absence of any counterions, this is nothing other than a huge electric field. I calculated it to be about 107V/cm. Needless to say, this electric field could play a role in breaking the phosphodiester bond of the phosphoryl donor molecule, namely ATP. Returning to polymerases, out of curiosity, I wanted to look at the electrostatic potential created by protein side-chains at the sites of the various phosphates in the primer strand in the crystal structure of one member of the DNA pol a family [Fig. 18]. What 1 found was this kind of parabolic curve, which just means that

300

M. Delarue

positions 3, 4, and 5, starting from the 3' end, are more stabilized than the rest. I also calculated this for the intermediate positions, even though they are not quite physical. This is actually interpolated along the DNA helix. But if the DNA were rotating along its axis in a helicoidal manner, the potential it would experience would be more or less this parabola. I thought it would be interesting to calculate the potential both in the closed and open forms, and found there to be one parabola in the open form (figure 18). In the closed form, there is also one parabola, but it is displaced by almost exactly one base. This is the number of bases (horizontal axis); so if you translate along the X-axis, you have translocation. This is the energy (vertical axis), so if you add an extra phosphate, you can read how much it would cost on the Y-axis. I think this representation provides a very simple explanation for a full thermodynamic cycle of nucleotide incorporation. Suppose you are in the open form and you want to add one nucleotide. By just extrapolating this parabola, you can see that it would cost a lot. So what you do is go to the closed form first, then add another nucleotide, and it costs you this - much less. Then, to start the system again, you translocate and go to the open form at the same time. It turns out that it costs you absolutely no energy at all, because of the symmetry of the two curves and because they are shifted by almost exactly one base-pair; what you lose on one side you get back on the other. Then you can start another cycle. You go to the closed form if you want to add another nucleotide, and this is the replication cycle. It's completely speculative, but I thought it worth mentioning. This basically ends what I wanted to say about polymerases today. In closing, I acknowledge the collaboration of the following people: J.B. Boule and C. Papanicoulaou, who worked on the TdT project, and F. Rougeon, whom I have already mentioned; T. Vatzaki, who worked on the structure of TdT in the presence of various ions and nucleotides; N. Sukumar obtained the first crystals; N. Jourdan helped with phasing. N. Expert-Bezancon did many Tdt preps, and J. Lescar of the ESRF (Grenoble) helped a great deal with data-collection and crystallographic refinement. Comment: / remember in the 1990s when we started getting these DNA polymerase structures the proofreading activity seemed to be located far away from the nucleotide site. The notion was that after incorporation of the nucleotide, the DNA could move by perhaps 15 to 20A, so that the 3'-terminus was exposed to the proofreading function. Maybe the picture has changed, but the idea was that there could be long-range motion.

Structure-Function

Relationships in Polymerases

301

Response: My understanding is that if there is one mistake in the incorporation, at least in the Klenow fragment, then somehow the enzyme will sense it and the primer strand will be diverted to another site of the protein, the exonuclease site, which is quite far away, maybe 15 A. But this implies a type of movement different from the one involved in the open-to-closed transition; this is different. Comment: / think the kinetic data does not go in this direction. I think that the kinetic data of Benkovic shows there is very fast motion. In fact, when Steitz published the first DNA-polymerase structure with the DNA in the polymerase, the DNA could not be seen at all, and the interpretation was that it was moving enormously, even in the crystal. That is one point. Secondly, I think the notion that the polymerase decides that the nucleotide is incorrect and then sends it to the proofreading site is completely excluded by the kinetic data, which show what is called the "next nucleotide effect," which has been demonstrated with all DNA polymerases, showing that the proofreading and polymerase activity are concomitant. Response: I am not an expert in proofreading, so I do not feel confident in replying to your question, but I am pretty sure they now see the DNA quite well in an editing complex of a novel polymerase crystal form. I believe it is for the RB69 polymerase, and appeared in a 1999 paper in Cell.

This page is intentionally left blank

THE PROTEIN-FOLDING NUCLEUS: FROM SIMPLE MODELS TO REAL PROTEINS LEONID MIRNY Harvard-MIT Division of Health Sciences and Technology, Cambridge, MA, USA

Introduction Although this subject it is not the main focus of my current research, I did work in protein folding for some time. Since the word folding is part of the title of the conference, I thought it would be appropriate to talk about protein-folding - not an overview of the whole field - but concentrating on just one particular physical model on which people in the field are now working. I will first talk about the protein-folding problem, then about the concept of nucleation in two-state folding proteins. After that, I will discuss how we have been studying nucleation in model proteins, and then how people have been conducting experiments to look for the folding nucleus in real proteins, followed by the way we design an evolutionary analysis of real proteins in order to find nuclei in them.

The protein folding problem When we address "the protein-folding problem," we are in fact talking about two problems: One is to predict the structure of the protein, given the amino-acid sequence. A less well-defined problem concerns how proteins fold. I will not talk about the first problem, focusing only on the second one. The main challenge in the field of protein folding is the fact that we don't have a single program; there is no algorithm for folding a protein. Therefore at present, we cannot fold a single protein on our computers. We would nevertheless like to understand how proteins fold, and you will see how we find a way to address this problem. So, how does a protein fold; what do we know from experiments? First of all, we know that many proteins fold in a "two-state" manner. This means starting an experiment with the proteins unfolded and then changing the conditions, which results in the accumulation of some folded proteins. If you further change the conditions, you get more proteins in the folded state and fewer in the unfolded state.

304 L. Mirny If you can start with denatured conditions and dilute the solution to make the protein fold, you will find that the fraction of protein in the unfolded state decreases and the fraction in the folded state increases, without the accumulation of intermediates. That is the most important definition of a two-state process. Many small proteins fold in a two-state manner. Again, I'm talking about relatively short proteins; up to 150 amino-acids. Unfortunately, there are very few studies of the folding of longer proteins. We also know that proteins fold fast, in terms of a certain time-scale. So what is the typical scale of folding times? The typical folding rate is usually between 1,000 and 0.2 inverse seconds, meaning that it usually takes from between one millisecond and one second for a protein to fold. Why do we consider this to be fast? Usually, if you take a protein and start making mutations, it is very hard to get a protein that folds faster than the natural one; you usually get one that folds more slowly. So natural proteins have been optimized by evolution to fold relatively fast - not as fast as possible, but relatively fast.

1. Two-states 2. Fast ... but - same length - same stability - same topology (fold) VERY DIFFERENT RATES

WHY? f,H?o=0.23 s-1

kfHn=897s-

Figure 1. How does a protein fold? What more do we know about the folding rate? A recent interesting observation is that some proteins of the same length, stability, and topology have very different

The Protein-Folding Nucleus: From Simple Models to Real Proteins 305

9 TRANSITION STATE ENSEMBLE

k,-Cexp(-AG /RT)

UNFOLDED

FOLDED RtAC I ION COORDINATE

TRANSITION STATE ENSEMBLEFASTER

i

FOLDING

AG\

i NSX i

UNFOLDED

UT\

4G

at

\ VFOLDED

REACTION COORDINATE

TRANSITION STATE ENSEMBI

SLOWER FOLDING

UNFOLDED FOLDED REACTION COORDINATE

Figures 2,3,4. Two-state proteins. folding rates. A specific example of this type is a pair of proteins, ADA2h (activation domain of human procarboxypeptidase A2) and AcP (acylphosphatase), which have very similar topology but dramatically different folding rates. AcP is

306

L. Mirny

the slowest-folding alpha-beta protein known, with a folding rate of about 0.23 s"1. Importantly, folding of this protein does not require a cis-trans proline isomerization step, formation of disulfide bonds, or binding of any co-factor. In contrast, ADA2h, is the fastest-folding alpha-beta protein, with a rate of around 800 s"1. Both proteins are of almost the same length, stability, and general fold topology [Fig. 1]. The question is why do two proteins that are so similar fold at such different rates. I am not going to talk about these two particular proteins, but in general about factors that determine the folding rate. For two-state proteins, the unfolded state is separated from the folded state by a free-energy barrier. The stability of the system is given by the difference in the free-energies of the unfolded and the folded states, AGFu- The height of the barrier between the unfolded state and the transition state, AGFT, determines the folding rate; the higher the barrier, the longer it takes for a protein to overcome the barrier, i.e., the longer it takes to fold [Figs 2, 3, & 4]. It is important to understand that each of these states is represented by an ensemble of protein conformations. The native state is an ensemble of conformations that have near-native topology and small variations in the structure of some loops and in the positions of the side chains. The unfolded state is a broad ensemble of disordered, non-compact protein conformations. The transition state is an ensemble of conformations that have some partially folded regions and some big disordered loops. The structure of the transition state is not fully understood and is currently the subject of intensive research.

The folding nucleus Central to my talk is the concept of the folding nucleus. A folding nucleus is a small sub-structure or set of interactions common to most conformations in the transition state ensemble. If you stabilize the folding nucleus, you lower the energy of the transition state, and the folding goes faster. If you destabilize the folding nucleus, you destabilize most of the structures in the transition state, and folding proceeds more slowly. Hence, proteins that fold fast must have a stable nucleus. We will come back to this later. The idea of nucleation as a mechanism of protein folding was first suggested in 1984 by Nobuhiro Go. The concept of a specific folding nucleus and its role in determining the folding rate was first developed in 1996 by Abkevich, Gutin, and Shakhnovich. Many other people contributed to further theoretical development of the concept, including Oleg Ptitsyn, Peter Wolynes, Martin Karplus,

The Protein-Folding Nucleus: From Simple Models to Real Proteins 307 Devarajan Thirumalai, Jose Onuchic, Vijay Pande, Alexey Finkelstein, and many, many others. Alan Fersht was the first to study the folding nucleus experimentally in chymotrypsin inhibitor CI2. Since then, folding nuclei has been found in numerous proteins. The main experimental results and further refinement and development of the concept have come from the laboratories of Luis Serrano, David Baker, Michael Oliveberger, Jane Clarke, Arthur Clarke, Chris Dobsin, and, again, many, many other experimental groups. Let us now focus on the main problem: how the ability to fold fast is encoded in the protein sequence.

Lattice proteins

Energy function:

Figure 5. The world of lattice proteins (1).

In studying this problem, I welcome you to the world of lattice proteins [Fig. 5]. Remember that I mentioned it was not possible to fold a real protein on a computer. However, we can fold these little lattice proteins. In this model, a self-avoiding chain in a cubic lattice represents the protein structure. Amino-acids are situated at the nodes of the lattice and interact with other amino-acids situated in neighboring

308

L. Mirny

lattice sites. No two amino-acids may occupy the same site. Although lattice proteins do not look like real proteins, they do have many similar physical properties. First of all, they fold, and it's not easy to fold a lattice protein, since the total number of possible conformations for a protein of 30 amino-acids is around 4" - a huge number! So one cannot enumerate all the possible conformations for lattice proteins in three dimensions. However, we can study how these proteins fold; we can simulate the evolution of such proteins, and fold them under various conditions, etc. So, there's a whole world of lattice proteins! We need two things in order to study the folding of such proteins: First, we need a model of protein energetics; that is, the energy function. Second, we need a way to simulate protein dynamics.

Dynamics of Lattice Proteins

Accept the move with probability P P=l ifAE<0 P=exp(-AE/T) ifAE>0

Lattice Proteins Can Fold! Folding Time: = 10 6 -10 8 for48-mer = 10 5 -10'for27-mer

Folding time depends on the sequence and structure of the protein! Figures 6 & 7. The world of lattice proteins (2)

The Protein-Folding Nucleus: From Simple Models to Real Proteins

309

The energy of a lattice conformation is the sum of the interaction energies between every pair of contacting amino-acids, a, and ay Amino-acids are said to be in contact if they occupy neighboring lattice sites and are not neighbors along the sequence (see the figure). If amino-acids i and j are in contact, Ay equals 1, and zero otherwise, a, is the type of amino-acid at positions i of the protein. Matrix U (a„ flj) gives the energy of interaction between amino-acids of type a-, and type aY Therefore, for twenty types of amino-acids, U is a 20x20 matrix, indicating how strongly amino-acids of different types interact, either attracting or repelling each other. Given this matrix, we can compute the energy for any amino-acid sequence in any conformation. In order to study folding, we also must simulate protein dynamics. We use a Monte-Carlo procedure to simulate dynamics [Figs 6 and 7]. At each step of the Monte-Carlo procedure we pick an element of the protein chain and attempt to move it. For example, we can pick a corner and flip it, or pick a crankshaft and rotate it 90 degrees. Then we compute the change in the energy, AE, for such an attempted move. If the move decreases the energy, we accept the move; if it increases the energy, we accept the move with a probability that exponentially decreases with AE: P = exp(-AFVT), T is the temperature of the system. At high T we are more likely to accept moves that increase the energy. This is known as the Metropolis MonteCarlo procedure.

• • i—» c c • » i.--m L D A P S Q I E V K

I

Mutation LDA P S A I E V K

k-

& & *

Accept mutation if <*2> <

{h)

(h)

Folding Time Figure 8. Evolution of lattice proteins.

<*2>

310 L. Mirny The most important property of the lattice proteins is that they can fold very fast on our computers! If I design the right sequence and use the right temperature, I can start my simulations with a random unfolded conformation and watch as the protein folds into the native conformation, It takes between 105 and 108 Monte-Carlo steps to fold a lattice protein of 48 amino-acids and 104 to 106 steps for a 27-amino-acid protein. Significantly, the folding time strongly depends both on the protein's amino-acid sequence and on its native structure. Our goal is to understand what determines the folding rate and what the transition state looks like. In order to address these questions we must obtain fast-folding proteins and investigate how they differ from slow-folding proteins. In order to create fastfolding proteins, we simulate evolution in the lattice protein world. We start with a specific sequence - not a random sequence - because random sequences do not fold [Fig. 8]. We designed the starting sequence in such a way that it folds to the desired native conformation, but slowly. Then we simulate evolution. We make a single mutation at each step of the evolution. If this mutation decreases the folding time, we accept the mutation. If it increases the folding time, we accept it at low probability. In the figure you see proteins that are folding into the same structures. Before a mutation, the average folding time was t|, and after a mutation, it was t2. We accept or reject a mutation according to these average folding times.

I-IO 8 T

310'

IS

3-106 ft

3- ios • •

*fff nw ^rfffi I T *vwp

_ H

,

1

,

1

,

1

200 400 600 800 Number of accepted mutations

,

1—

1000

How? Evolution made proteins that fold 300 times faster1 F i g u r e ? . Results.

The Protein-Folding

Nucleus:

From Simple

Models to Real Proteins

311

What happens as a result of this evolution? The first question concerns what happens to the folding time over this evolution. As you can see in figure 9, we started with a protein the average folding time of which was around 108 MonteCarlo steps. As a result of this evolution, we obtained proteins that folded in 105 steps. This is quite a spectacular speedup! Most of the increase was achieved by the first few mutations. Interestingly, if you stabilize all native interactions - if you make a sort of a perfect protein - it folds more slowly than the fast protein we obtained during evolution. So, the main result of the evolution is the creation of proteins that fold 300 times faster than the original ones. Now the question is: How did evolution do this? The first possibility is that evolution may be producing more stable proteins, i.e., of which the native structure has lower energy. That is not the case. In fact, as evolution progressed, the energy of the native state at first increased a bit, then did not change much. So if it is not the energy, what is it? What else made these proteins fold so fast? In order to address this question, we collected the last 500 proteins produced by evolution. They all fold very fast. Comparing their amino-acid sequences, we found that 10 out of 48 did not mutate during the last phase of evolution. Note that hundreds of mutations were attempted for each amino-acid, of which 500 were accepted. However, all mutations of these 10 amino-acids were rejected because they slowed-down the folding [Fig. 10]. Now the questions concern why these amino-acids are conserved, what happens if we mutate them, where they are in the structure, and what their energies of interaction are.

500 proteins

WSPP»IN0YTQTFF-. 1 » H H ' V L f f i A G SATYARnRAESROS<"J>AAuM'Vl.FflAG VftTP ftGftOhtSFV DNS 8RTvflftRRftESA€Sr J i f l S n f i r r lYTM NERPfATnQTLSFVaKQNS STTYTAVAAEOAGSPaPAflkf PLFFRRH HATPrATAYELSFVmSNS STTERRRSREGRESF-jr-Rft^i (-LFFAAH tWAPPAETYELSFVJrlSNS STTEREISTSOTESPUPAVWPtFFTfiY RARPPRETYEVSFIWSMV STTTIRRSVESTEE"QPRRuiPLGFRRY nTCPPTETVCLSFlQfiSNS' STFEVRASASTSGCPQPnQut F L L F A G H MTRPPTETVELSf VQH8W8 STKYVAAAASSAESTQPASMrFLLFARH HASP rSTTYPVSF I Q «EH8 STTTVRRRREOTES- m - R d w t K L L T S R H HTOPl-STTYPVSFVGr*HS' STTYftflRSREGTTSi'OPflRuFPVLFRGH HTGP P T T T Y P L S F V U H K H S ST t E RRGRT ECRES P U P W J I I r PLLFAGY 8TTYREftftTE&TK8PilPfl6ni c L W R R H HAAPi>TTAVPVSFV3r>CNS STTYTSAESESAES P Q PfHi WFPLLF RRH MGC isflinsiv ihs STTYVEGESEGRRSi'Ui-miwi- t - U F A A V HAOP rTAYPt v : i N-: STTYAEOESEGRRSFOPRCuFPLLFAAV MUII" i l i n e ; M : M I - . STTYAEGKSEGWSPaPRtJ Fl>l.l.FfiRV H R A C X P T T A Y P I S I I G I O H S SlTEftEflES$STV$i'i,.-lAv., .••U.PAGV H S C P P T T A Y K M I . • GHS STTTTEASSSGTAS V A T . ILPA0Y HVOPPTETYPAAtIOHIHE SGTYAERSSSTTAS . I ' A R , . . U N B G Y HAVP'flKVYTAAVI O M T H S • S T T Y A R V S V E G T C G I I P R R H I . L L E A A Y MA»PrA*LYtAAVlawaKS W U P i f l E M P L O M iJHCHS

t ft t tn m CONSERVED AMINO ACIDS Figure 10. Database of fast-folding proteins.

312 L. Mirny Why then are they conserved? Perhaps they never mutated because they are important for folding. What happens if we mutate them? The mutants then fold much, much more slowly. So a single mutation in any of these key positions results in much slower folding. Where are they in the structure? They are located in the central part of the structure, where they form a sort of "interaction cluster." The question now is what happens to the energy of interaction between these conserved amino-acids. You will remember that the energy of the whole protein did not change much over evolution; however, the energy of the interactions among these conserved amino-acids did change dramatically during the first steps of evolution. We conclude that selection for fast-folding decreased the energy of interaction among these amino-acids, whereas this energy did not change much over many, many steps of evolution. We now have a valid case in stating that these ten conserved amino-acids constitute the folding nucleus [Fig. 11]. In fact, if you mutate an amino-acid in the folding nucleus of a fast-folding protein, it likely destabilizes the nucleus, yielding a slower-folding protein. That is what we observed for these conserved amino-acids.

Selection to fold fast stabilized the folding nucleus!

TRANSITION STATE ENSEMBLE

UNFOLDED

FOLDED REACTION COORDINATE Figure 11. Folding nucleus.

The folding process Let us now consider the folding process. It starts with an unfolded conformation, proceeds up to the transition state (which is stabilized by interactions in the folding

The Protein-Folding Nucleus: From Simple Models to Real Proteins

313

nucleus), and then proceeds downhill in free-energy to the folded state. The first part of the process, from the unfolded state to the transition state, goes uphill in the free-energy, since the entropy loss of folding is greater than the energy gain. This part of the folding process is slow, constituting the rate-limiting step. The folding nucleus stabilizes the transition state, compensating for the loss of entropy. Whereas this conformation is partially folded, it has large flexible loops. After reaching the transition state, the process proceeds downhill in the free-energy; the entropy loss is smaller than the energy gain. Therefore this part of the process is fast. What did we learn about protein-folding from lattice simulations? First we learned that in order to fold fast, a protein must have a stable folding-nucleus. Second, we learned that if evolution favors fast-folding proteins, the amino-acids that form the folding nucleus must be conserved over evolution. Does evolution favor fast-folding proteins? We don't know, but clearly, many protein properties are important for healthy functioning: stability of native structure, rapid folding, transport flexibility, target-recognition ability, enzymatic activity, etc. Rapid folding is only one term in this equation. However, any selective pressure to preserve rapid folding eventually turns into conservation of those amino-acids that constitute the folding nucleus. I don't think that natural selection leads to obtaining the fastest-folding proteins. However, there is definitely an evolutionary pressure to fold relatively fast, since proteins that fold slowly tend to aggregate. Aggregates are very toxic to cells, leading to rapid cell-death. Therefore, there must be natural selection for folding relatively fast. I will now discuss how the concept of the folding nucleus was developed and verified experimentally. How can the folding nucleus be identified in real proteins? One cannot "freeze" real proteins in the middle of their folding process and examine the transition state. It is important to note that the transition state is NOT an intermediate meta-stable state; it is a very unstable state, as a result of which proteins spend very little time in it. Alan Fersht suggested the method of O-value analysis to identify nuclei in real proteins. The idea of this method is to choose an amino-acid and produce a minimally disruptive mutation of it. Then the folding rate and stability of the mutant are compared with those of the original protein. First the change in protein stability: AAGUF = AGUF (mutant)- AGUF (native), then the folding rates, kf] of the mutant and the native protein are measured. Since the folding rate depends on the height of the barrier, i.e., AGUT, it is possible to compute AGUT from the folding rate,

314 L. Mirny and eventually to compute the change in the free energy of the transition state, AAGUT = AGUT (mutant)-AGuT (native).

The O-value is the ratio of AAGUT and

AAGUF-

Alan Fersht kf=Cexp(-AG /RT)

A\G 0=

.wG BEMG UF

0 = 1 in the nucleus o=0 not in nucleus FOLDED

REACTION COORDINATE Figure 12. Experiment: values. Let us now consider how the O-value identifies the folding nucleus [Fig. 12J. Take an amino-acid involved in the folding nucleus: If such an amino-acid were mutated, we would expect both the native and transition states to be affected. If this amino-acid is involved in stabilizing the folding nucleus to the same extent that it is involved in stabilization of the native state, the native state and the transition state would be equally affected (stabilized or destabilized) upon mutation. Hence, the value would be equal to 1. On the contrary, if an amino-acid is not involved in stabilization of the transition state, and was important only for stabilization of the native state, the folding rate would not change upon mutation. Such an amino-acid has a <J>-value of 0. In reality, the main problem is that O-values are usually between zero and one. It is very difficult to interpret these intermediate -values. A <E>-value of 0.7 could mean that this residue is involved in the folding nucleus in 70% of folding trajectories. It can also mean that the solvent-accessible area of the residue is 70% buried in the folding nucleus. Many interpretations are possible. Intermediate values are routinely interpreted as the "degree of involvement" of a particular residue in the folding nucleus.

The Protein-Folding Nucleus: From Simple Models to Real Proteins 315 Mutational studies are usually accompanied by visual analysis of the protein structure. High <&-value residues that form a cluster are more likely to be the folding nucleus than residues scattered all over the structure. A more systematic way to carry out this structural interpretation of O-values has recently been suggested by Michelle Vendruscollo of Oxford, but it is not widely used. O-valuc analysis has been done for several (around ten) proteins, in all of which folding nuclei were found. That folding nuclei were later found experimentally is a great achievement of computational and theoretical protein folding research. I think it is a fine example of how theoretical biology helps to reveal biological principles. A more challenging problem for theoreticians now is to predict the exact location of the folding nucleus in real proteins. It is a very hard problem, since we are still unable fold even a single real protein on a computer [Fig. 13]. However, some groups have made substantial progress toward predicting folding nuclei, especially those of Bill Eaton, David Baker, and Alexey Finkelstein.

HOW TO FIND FOLDING NUCLEUS IN REAL PROTEINS? PROBLEM: THERE ARE SEVERAL REASONS FOR AMINO ACIDS TO BE CONSERVED: • Function • Stability • Folding

LDnPSQlEVKOVrOTIftLITHFKPlflEIDe LDflPSQIEVKOVIDTIfiLITUFKPLAEIOG LOflPSQIEAKOVTOTTflllTWSKPLftElEC ICiil'THLQFINEIDTTVIVTriTPPRflRlVG l DCPTNLQFVHEIDSTVLVRHTPPRflOITG LOAPTHLQFVHETORTVLVTwTPPRflR I AG I DP.PTHLQFVHETOR!VLVTuTPPRHRlftG LCmPTOLQFTOVTESTWHrtlPranKIGR

HOW TO SEPARATE THEM? Figure 13. Real proteins.

Our idea was to develop an evolutionary approach to this problem. Can we apply the principles we used in lattice proteins to real proteins? We start with two assumptions. First, that there is evolutionary pressure on real proteins to fold rapidly; we discussed this assumption above. The second assumption is that the location of the nucleus depends primarily on the structure of the protein. In other words, we assume that the folding nuclei of two proteins similar in structure but different in sequence are located in the same place, irrespective of how different the sequences arc. This idea was also suggested by lattice simulation. At first this idea was not accepted in the experimental community; various groups studied the

316 L. Mirny question, showing that folding nuclei in similar proteins may be located at somewhat different places. However, as experimentation progressed, the original concept turned out to be correct: in a big family of proteins, the nuclei are likely to be in the same part of the fold - not necessarily at the same locations - but located in the same part of the structure.

Similar sequences always have similar structures. Different sequences have different structures, but Different sequences may have similar structures (ANALOGOUS PROTEINS)

Figure 14. S'equencc-structure mapping.

f nSSKF.

.FTA

J «*3rAA..PTA I*rGTTT..PAS 1 *iGD.P..PAS •t 4TQSSP. .SIT imfiorrp..PTG It »rerTA..PAT . . . J . . . ' . . .:...' It (fcGD.P. .PTS */N PT. Sequence Space

Suuclure Space

70% WHY THESE AMINO ACIDS ARE CONSERVED? Figure 15. Sequence families.

The Protein-Folding Nucleus: From Simple Models to Ileal Proteins

317

Now, before I tell you how we used evolutionary analysis to identify folding nuclei, I will talk about the general properties of sequence-to-structure mapping in proteins [Figs 14 and 15]. The first property is that similar sequences always have similar structures. So if the difference between the sequences is less than 70% that is, if the sequence identity is more than 30% - the sequences fold into the same structure. Similar sequences are called homology. The second property is that different proteins, i.e., those with less than 30% identity, may or may not have different structures. There are many examples of proteins with different sequences and very similar structures. Such proteins are called analogs. A protein family is a set of proteins that are homologous to each other, as determined by certain measurements of sequence similarity. All proteins in a given sequence family fold into the same structure. Next, we can identify conserved amino-acids in protein families. We then look at these amino-acids and ask why they are conserved. There are several reasons why an amino-acid is conserved. First of all, an amino-acid may be conserved if it is important for protein function, e.g., if it is an amino-acid of the active site. Mutation of such amino-acid leads to loss of enzymatic activity. Second, many amino-acids are conserved for reasons of protein stability. Mutation of such amino-acids destabilizes the protein structure, hence, may be deleterious. Finally, an amino-acid may be conserved because of its importance for the folding rate. Mutation of such amino-acid leads to a substantial slow-down in the folding process and is likely to be deleterious, due to the accumulation of toxic aggregates or premature protein degradation. We are particularly interested in finding amino-acids that are conserved because of their importance for folding. The challenge is to separate these three major sources of conservation.

IDEA: Compare several "runs" of evolution, i.e. proteins of similar structure and

Look for positions conserved in all families. UNIVFRSAUY CONSERVED POSII IONS

Figure 16. Real proleiris.

318

L. Mirny

In order to separate these sources of conservation, we compare proteins similar in structure and different in sequences; analogous proteins. Each of these analogous proteins has its own family of homologous proteins. First we identify conserved amino-acids in each family. Then we map these amino-acids to the corresponding protein structure. Next we superimpose these structures in space and select conserved amino-acids that overlap in all analogous proteins. These amino-acids are unlikely to be conserved as a result of function. In fact, analogous proteins typically have different functions and their active (or binding) sites are located in different parts of the structure. Hence, conserved amino-acids of the active site do not overlap when analogous proteins are superimposed. There are very interesting exceptions: analogous proteins that have a super-site - an active site located in the same place in all analogous proteins despite the proteins' different biochemical functions. (We will discuss such proteins later.) As a rule, amino-acids that are conserved due to function are eliminated by such a procedure. Amino-acids selected by this procedure are conserved in each family and located in the same place in analogous proteins, irrespective of the type of amino-acid conserved in each family. For example, alanine could be conserved in one family, cysteine in a second, and tryptophan in a third. If these conserved alanines, cysteines, and tryptophans overlap when the structures are superimposed, this position in the structure qualifies as a universally conserved position. We study each family in detail, compute the conservation of every position, and average the conservation across several analogous families. You will remember that we outlined three main reasons for conservation: function, stability, and folding. We are seeking amino-acids that are conserved due to folding. We have successfully removed amino-acids that are conserved due to function. Now we must remove amino-acids that are conserved due to their role in stabilizing the native structure. We have developed a rather sophisticated statistical procedure for doing this. (I won't go into the details here.) The idea of the method is to compute the expected structural conservation at each position of the protein. Since protein structures are primarily stabilized by hydrophobic interactions between buried amino-acids, we can compute the expected conservation of each amino-acid according to how deeply it is buried. By subtracting the expected conservation (blue line) from the observed one (red line), we remove positions that are conserved for the purpose of providing protein stability [Fig. 17]. This plot shows you sequence entropy, which measures conservation. Conserved positions have low entropy; variable ones have high entropy. So positions whose observed entropy (red line) is lower than the expected entropy (blue line) are of interest to us. We have subtracted the second part of the signal and are left with the universally

The Protein-Folding Nucleus: From Simple Models to Real Proteins

319

conserved amino-acids. They are conserved for reasons other than protein function or stability.

Compute conservation expected from the protein stability and compare it with the observed conservation.

J—;—;—i _j—i—j—i—i—i—i—i—;—i—;—i—L 5

HI

15 .'.II

IS

HI

15 -III 4 5

50

55

Ml 6 5

70

75 Sll

85

Figure 17. Universally conserved folding nucleus.

Universally conserved folding nuclei Let's consider universally conserved amino-acids in the protein of the immunoglobulin fold. Why are these amino-acids so well-conserved? They are not conserved for functional reasons. As I said before, the proteins of this fold have functional amino-acids located at various parts of their structures. Conservation of these amino-acids exceeds the degree of conservation expected due to their role in stabilizing the native structure. There is some additional evolutionary pressure in these positions. It is also important to distinguish the cluster of these universally conserved amino-acids from the hydrophobic core. The hydrophobic core is much bigger than this cluster. It involves all seven strands, whereas the universally conserved cluster involves only three. Universally conserved amino-acids are located in the center of the protein structure in the immunoglobulin fold. This is not necessarily true for other folds. (I will show you such examples in a moment.) We therefore believe that these clusters of universally conserved amino-acids constitute the folding nucleus. Additional evolutionary pressure, then, comes from the requirement to fold fast. Not as fast as possible, but fast enough. To be more

320

L. Mirny

precise, we think that the folding nucleus is located in similar places in the proteins of this fold. These nuclei exhibit some additional conservation. Averaging conservation over analogous proteins, we obtain amino-acids that are common to the nuclei of individual proteins. That is what you see on the left part of figure 18 (top). On the right, you can see an experimentally identified folding nucleus in one of the proteins of that fold. This was the first protein of an immunoglobulin fold in which the folding nucleus was experimentally identified. As more proteins were studied by O-value analysis, it became evident that folding nuclei are located in somewhat different parts of the fold and involve somewhat different sets of aminoacids. However, the overlap between these nuclei shows that the set of the common amino-acids agrees very well with the universally conserved positions. So these universally conserved positions do not predict the folding nucleus, but rather suggest locations in the folding nucleus of the structure. This protein fold, a so-called flavodoxin fold, is an illustration of another case. On the left of figure 1 (bottom) you see universally conserved amino-acids, on the right, an experimentally found folding nucleus. Notice that the cluster of universally conserved amino-acids is not fully hydrophobic and is not located in the center of the protein. It is mostly at the top of the beta-sheet and includes three aspartic acids, which are binding the metal ion. Again, amino-acids of the folding nucleus do exhibit this universal conservation. There is also an interesting story about a few universally conserved amino-acids that are not part of the nucleus. They constitute a functionally conserved cluster called a super-site. (I refer you to our paper on universal conservation for details.) This is a protein of a so-called nucleotide-binding fold. We identified universally conserved positions this fold, as well. Although, these universally conserved amino-acids are in the center of the barrel, they constitute only a small part of the hydrophobic core, involving only 3 strands out of 5 or 6. Unfortunately, no one has tried to detect a folding nucleus in these proteins, so these amino-acids are still just a prediction. The following are the main points of this part of this talk: First, as for lattice proteins, real proteins must have stable folding nuclei in order to fold fast. Second, the folding nucleus in real proteins exhibits some additional conservation. I would now like to make the following important point: If you look at a single family of proteins - not many families, as we did, but just one family - the folding nucleus is no more conserved than other buried residues. That is true because there are many reasons for conservation and they cannot be distinguished among each other by looking at a single protein family. There are active site residues, binding site residues, buried residues, etc. You cannot find excessive conservation of the

The Protein-Folding Nucleus: From Simple Models to Real Proteins 321 folding nucleus unless you average over several analogous proteins and unless you subtract the conservation expected due to buried amino-acids. The folding nucleus of a single family is no more conserved than other buried residues or than the amino-acids of the hydrophobic cores. However, the amino-acids of the folding nucleus are, on average, more conserved than the rest of the protein.

Figure 18. Universally conserved folding nuclei.

322 L. Mirny In summary: rapid folding on the lattice is achieved by stabilization of the folding nucleus. The folding nucleus is defined as a common stabilizing core of conformations that belong to the transition state ensemble. It is important to note that all first-order transitions; all cooperative transitions go through some nucleation process. For example, condensation of water vapor to liquid water goes through a nucleation phase. An important distinction between condensation and folding is that condensation goes through a non-specific nucleus phase, whereas folding goes through a specific one. In other words, nucleation in water vapor condensation is non-specific; any sufficiently large set of water molecules can form a nucleus. In protein folding however, the nucleation is specific; a specific set of amino-acids has to come together to form a transition state conformation. Using our analysis of homologous and analogous proteins, we found universally conserved positions; amino-acids that are, first of all, conserved in all families of a particular fold and, secondly, more conserved than expected by stabilization of the native structure. We suggest that some of these universally conserved positions constitute the folding nucleus. Their excessive conservation derives from evolutionary pressure to fold fast. Comparison of universally conserved positions with experimentally identified folding nuclei supports this hypothesis.

Acknowledgements I am grateful to my former advisor Eugene Shakhnovich and to my collaborator Victor Abkevich, both of whom worked on this project with me.

CHAPERONIN-MEDIATED PROTEIN FOLDING DEVARAJAN THIRUMALAI Chemistry and Biochemistry/IPST, University of Maryland, College Park, MD, USA

Introduction To function as enzymes, proteins must fold to well-defined native states. Most proteins and RNA molecules fold spontaneously, giving rise to the notion that biomolecular folding is a self-assembly process. However, certain proteins and RNA molecules essential for cell survival require the assistance of molecular chaperpnes in order to fold. In this talk I will discuss work done in several laboratories, in order to present a coherent picture of how chaperones work. Molecular chaperones are nanomachines whose job is to "rescue" proteins destined for aggregation. Chaperones are one of the many machines in living systems that use ATP in a coordinated manner and that undergo a series of rather large conformational changes in carrying out a specific activity. In the case I describe here, the ATP-consuming nanomachine helps proteins that normally do not fold spontaneously to reach their functionally active native state. Many aspects of protein folding and allosteric control must be understood in order to describe the workings of this class of nanomachines. Only recently has a molecular explanation of how they function begun to emerge. Many biophysical aspects of molecular chaperones, most notably the chaperonins of E. coli, have recently been reviewed. Several developments in seemingly unrelated areas have contributed to our current understanding of how chaperonins function. The most spectacular crystal structures, those of E. coli chaperonins, were solved by the late Paul Sigler, whose footprints are indelibly etched all over this field. Knowledge of E. coli chaperonin structures has provided considerable insight into the workings of chaperonins. However, they are insufficient for understanding the ATP-driven transitions integral to assisted folding. In the process of binding to ATP, E. coli chaperonins undergo substantial domain movements. This aspect, namely allosteric transitions, were systematically studied by Amnon Horowitz of the Weitzman Institute, in Israel. Horowitz and coworkers developed a working model of the allosteric transitions the E. coli chaperonin (called the GroEL particle) undergoes upon binding to substrate protein (SP), ATP,

324 D. Thirumalai and to the co-chaperonin GroES. This field - like many other fields in which it is imperative to determine large-scale structures - has benefitted from cryoelectron microscopy work. The picture of how E. coli chaperonins work has also required computational methods and bioinformatic analysis. I will summarize how these developments have been harnessed to yield a conceptual framework for understanding the workings of this class of ATPconsuming nanomachines. The resulting picture was developed in collaboration with my colleague George Lorimer. To whet your appetite, I will begin by showing the spectacular structure [Fig. 1] of the E. coli chaperonin, GroEL. 1 will refer to this state, which appears as an intermediate in the ATP-driven cycle, as the " R " state.

Figure I.

Discovery of GroEL A little knowledge of the history of protein folding is needed in order to appreciate the relatively recent accidental discovery of chaperonins. In 1960, Anfinsen carried out his famous experiments on reversible folding and unfolding of ribonuclease-A, thus beginning the protein-folding game as we know it today. The hypothesis enunciated by Anfinsen states that protein folding is a self-

Chaperonin-Mediated

Protein

Folding

325

assembly process in which sequence uniquely determines the proper fold. Solving this problem (second half of the genetic code); i.e., how to accurately determine the three-dimensional structure of the native state starting from the sequence alone, is one of the holy grails of molecular biology. There is another aspect to the protein and RNA folding problems: How does an ensemble of unfolded states navigate the rough free-energy landscape of proteins in search of the relatively unique native conformation? Both these problems, which are concerned only with monomeric proteins sans chaperones, remain unsolved. In retrospect, the concept of assisted folding may be traced back to two disparate developments during the 1970s. Georgopolous, a geneticist, was working on temperature-sensitive mutant strains of E. coli. He found that some of these temperature-sensitive mutants were unable to support the growth of various bacteriophages. The temperature-sensitive forms were related to a mutant form of the gene, which he called a "GroE." The normal product of the gene is a large protein referred to as "GroEL" (L stands for "large.") The molecular weight of GroEL is around 60 kilodaltons (kDa.) The structure of GroEL, which assembles as an oligomer to form a ring, displays an unusual seven-fold symmetry [Fig. 1]. Subsequently, again using genetics, Georgopolous found GroEL to be associated in 1:1 stoichiometry with another particle, GroES (S for "small"), which is a smaller subunit of GroE, of molecular mass around lOkDa. However, the discoveries of GroEL and GroES did not reveal the precise roles they play in enabling proteins to reach the native states. Plant biologists had been independently examining protein synthesis in chloroplasts. Since it is produced in plants, ribulose 1,5-bisphosphate carboxylase/oxygenase (Rubisco) is the most abundant protein on Earth. By means of a gel electrophoresis purification process, J. Ellis found Rubisco to always be associated with another protein, which he called "rusBP." Not coincidentally, the molecular weight of rusBP turned out to be about 60kDa. These results, which were reported in the 1980s, did not reveal the function of rus B (or GroE). Anfinsen's hypothesis of spontaneous folding so dominated the thinking at the time that there was no room to admit that certain proteins might require molecular chaperones in order to achieve the native conformation. In 1987, apparently in a meeting, J. Ellis introduced the term "molecular chaperones," upon realizing that in some unknown way rus BP was in fact aiding the assembly of Rubsico. More importantly, in the absence of rus BP, one could not reconstitute Rubisco. In 1989, he observed that rusBP was homologous to GroEL. In 1989, Lorimer found that GroEL, which is an E. coli chaperonin, could help in the assembly of Rubisco. This is an in vitro experiment in which, upon addition

326

D.

Thirumalai

of GroEL and ATP, Rubisco, which normally aggregates, folds. After this experiment, the problem of assisted folding became very well-defined. Thus, insights into the function of GroEL could be obtained by systematically studying proteins in E. coli. The minimal components essential for assisted folding, namely, GroEL, GroES, and ATP, were identified. The question was how did these components interact in concert to assist misfolded SPs to reach the folded state. Question: How did you check in vitro that it was not properly folding? Response: Rubisco is involved in carbon dioxide fixation, so one can reconstitute it and see whether it initiates the first step in that process; i.e, ensure that the refolded Rubisco is functional. It was discovered only fifteen years ago that these nanomachines were involved in protein folding. This implied that in certain instances the Anfinsen hypothesis had to be modified to allow for the active participation of certain cofactors. We now know that there are two kinds of chaperones. I will refer to one type as "intramolecular chaperones." The precise way these intramolecular chaperones work is not known. Bovine pancreatic trypsin inhibitor (BPTI) is a protein related to ribonuclease-A. The BPTI folding pathway is extremely complicated, because in vitro refolding studies show it to have long-lived, dead-end kinetic traps that impede folding. It is known that under in vivo conditions, the folding of BPTI involves peptide signal sequences. P. Kim added a "signal sequence" (a protein disulphide isomerase that covalently links to the N' terminal of BPTI), which enhanced the BTPI folding rate by a factor of around 7,000. It is believed that long-lived metastable kinetic traps do not exist under in vivo conditions. The intramolecular signal sequences help assemble BPTI without errors. In this lecture, I will describe the workings of intermolecular chaperones, focusing on class I chaperonins. The best studied class of chaperonins comes from E. coli and is called GroEL, and its co-chaperonin is GroES. I will argue that GroEL participates actively, in enabling the SP (substrate protein) to reach the folded native state. There are three actors in this drama, the first of which is the GroE machine itself, which constitutes the large subunit. I will discuss its architecture in some detail, because it directly reveals a few of the key characteristics of chaperonins. The second actor is GroES, a small subunit that associates with the GroEL particle. Its crystal structure was determined in 1996 by John Hunt. Finally, there is the SP. It is the job of GroEL, GroES, and ATP (the components of the nanomachine) to "rescue" the SP. The function of the

Chaperonin-Mediated Protein Folding

327

nanomachine is manifested through the coordinated interplay of GroEL, GroES, and the SP, with ATP hydrolysis serving as the timing device. I will try to explain this exciting story during the rest of this lecture. Experimentalists who work on protein or RNA folding rarely report the yield of the reaction (i.e., the native-state product). One can take almost any protein that has a unique native-state, artificially arrange the conditions (extremely low temperature, for example), and it will in fact fold. While this is interesting, what one really wants to know is the spontaneous native protein yield obtained under given physiological conditions. Consider in vitro refolding of malate dehydrogenase (MDH): The yield for malate dehydrogenase, as a function of temperature under spontaneous cellular conditions and infinite dilution, is around 30%. The yield is essentially unchanged at lower temperatures. Upon addition of GroEL and ATP, the yield increases to about 80%. Therefore the function of GroEL is not merely to speed-up the folding rates; indeed, in certain cases, interaction with GroEL retards folding rates. The job of the nanomachine is to produce a sufficient quantity of native material for the protein to carry out the enzymatic reaction. One of the first questions to ask is: Does every protein in E. coli, for example, require the assistance of chaperones in order to fold? I will provide some arguments to show this not to be the case, and that in fact, a majority of proteins do not require chaperones to fold in excess of 90%. That leads us to the next question: How do the rest of the proteins fold spontaneously? To answer that, one has to understand spontaneous folding, which is the Anfinsen problem. The first question, which concerns the fraction of E. coli proteins that requires chaperonins to fold, may be answered using data from cell biology. The celldoubling time of E. coli growing at 37°C under minimal glucose conditions is around 40 minutes. The amount of protein in the cell is about 10"13 grams. The average molecular weight of proteins in E. coli is 40,000 daltons. Therefore the average number of polypeptide chains is on the order of a couple of million. The number of distinct proteins in E. coli is around 4,800. From this it follows that the minimal rate of protein synthesis is 60,000 polypeptide chains per minute. So every minute, E. coli synthesizes around 60,000 polypeptide chains that are expected to fold for functional purposes. The number of ribosomes involved in this synthesis is on the order of 35,000. So every half-minute or so, a polypeptide chain is synthesized and folds for functional purposes, or gets chewed-up by the proteosome machinery. The quantity of GroEL per cell is around 10"15 grams and the quantity of GroES per cell around 10"'6 grams. In vitro experiments involving GroEL show the rate of assisted folding to be on the order of one per minute. So the GroEL present in E.

328

D.

Thirumalai

coli can process on the order of 3,000 proteins. This number is much lower than the 60,000 being synthesized, which implies that most of the proteins must fold spontaneously. This calculation indicates that the fraction that may be assisted to be around 5% or 10%. So the majority of the synthesized proteins (in excess of 80%) will have to find their destination without the benefit of interacting with GroEL. Question: Does that represent the number of genes or the number of proteins? Response: The number of proteins. Question: Was GroEL originally found as a mutation in the temperature-sensitive chain? Response: That is what I understand. Question: So these bacteria could actually be grown at regular temperatures? Response: Yes, these bacteria could grow at regular temperatures.

Spontaneous folding mechanisms I will take a short detour to get to the point mentioned in the context of RNA folding, namely the notion of a kinetic partitioning mechanism [see Figure 2, below]. How does spontaneous folding occur? A cottage industry has evolved to answer this question, for both protein and RNA folding. Although the outlines of how proteins and RNA fold have emerged, there are many important unanswered questions in this field. Over the years, my laboratory has developed a unified perspective of the theoretically expected scenarios for RNA and protein folding. It is necessary to describe this viewpoint in order to explain how chaperonins function. Consider proteins, which are essentially two-state folders, which means that they exist in both folded and unfolded states. How do such proteins fold? A majority of experimentalists and theorists in the protein-folding community have been trying to understand how two-state systems kinetically reach the native state starting from a myriad of distinct unfolded states. To understand the problem, consider a random coil generated by adding a sufficient concentration of denaturant (urea, for example). As an aside, it should be pointed out that the nature of such

Chaperonin-Mediated Protein Folding

329

states from which in vitro folding is initiated has not been characterized. This issue, while important, is not relevant for our purposes. In two-state proteins, the random coil folds in an error-free fashion. Under folding conditions (obtained by letting the denaturant concentration go to zero), the polypeptide chain undergoes "specific collapse." By this I mean that if the structures were to be frozen immediately after the chain contracts, there would be very few interactions present in these conformations, which are absent in the native state. By a nucleation mechanism, the set of specifically collapsed states then rearranges to the native state. The physical picture of two-stage folding, namely specific collapse followed by a nucleation mechanism, now has considerable experimental support. Some beautiful fast-mixing experiments using high time resolution techniques have been used to measure these two distinct time-scales for proteins classified as two-state folders. The "energy landscape" of two-state folders (some people use the word funnel) has a dominant attractor, known as the Native Basin of Attraction (NBA). For these systems, the probability of getting trapped in metastable states for any length of time is negligible. A number of proteins are known to fold by traversing long-lived intermediates. For these proteins, competing basins of attraction impede spontaneous folding to the NBA. To qualitatively describe the folding scenario for this class of proteins (as well as large RNA molecules), the notion of topological frustration plays an important role. In naturally occurring polypeptide chains, the linear density of the hydrophobic residues is more or less uniform across the length of the chain. From database analysis, one knows that at least 50% of amino-acid residues are hydrophobic, while the rest are polar or charged. Under favorable folding conditions, various local structures (helices, strands, and loops) clearly ought to form first, since they consist of residues that are in close proximity. Let us assume that one forms a sequence of secondary structures starting from the amino terminal and going toward the carboxy terminal. As an example, consider a native state consisting of a helix, a strand, and another helix. There are a number of distinct ways to pack secondary structures, only one of which would be the native state. Let the structure in the NBA be one with a helix connected to a strand, with a third strand packed against this motif. The other structure in the competing basins of attraction (CBA) could be one in which the strands are connected by a loop with the helix packed against it. If the free-energy associated with the CBA is comparable to that of the NBA, it is possible that this state will interfere with the folding process. This is very different from the two-state system, in which there is only one dominant basin of attraction. There is an intrinsic incompatibility between what the local structures

330 D. Thirumalai want the chain to do and the global fold. This incompatibility is termed topological frustration.

Unfolded states

Native state Figure 2.

Go and Wolynes have argued that in order for a protein to fold rapidly without getting trapped in the CBA the interactions among all segments must be compatible. This would make the polypeptide chain minimally "frustrated." Many natural sequences are minimally frustrated. It is likely that proteins that require chaperones to fold are topologically frustrated. The energy landscape for topologically frustrated polypeptide chains is rugged [Fig. 2]. In such a rugged landscape, there are a few CBAs besides the NBA. Due to the presence of deep metastable CBAs, the folding of proteins (or RNA) of such an underlying energy landscape is intrinsically sluggish. Imagine the folding of a large number of denatured molecules, each of which is of somewhat different conformation. If the NBA occupies the whole volume, the pool of denatured molecules does not become kinctically trapped. But if the volume corresponding to the NBA is small, a certain fraction of molecules will become trapped in the CBA. In this case, the molecules will reach the native state only on long time-scales (equal

Chaperonin-Mediated Protein Folding

331

to the time required to cross the free-energy barrier separating the CBA and the NBA.) This is the case for Rubisco, for which only around 5% of the molecules can spontaneously reach the NBA without the help of chaperonins. Thus, for Rubisco, under conditions of interest, the NBA occupies only a small volume in the freeenergy landscape. The mechanism by which the initial pool of molecules partition between CBAs and NBA has been termed the kinetic partitioning mechanism (KPM). The folding of Tetrahymena RNA (large ribozyme) and the protein lysozyme may be quantitatively described using KPM. Using the KPM, one can write an equation for the fraction of molecules that does not fold by time t after the initiation of folding. This equation is parametrized by the partition factor, O, which is the amplitude (or yield) of the fast process. In other words, O is that fraction of the molecules which would fold spontaneously in the absence of chaperonins. The remaining fraction of molecules (1 - O) becomes trapped in one or more of the CBAs. The values of for large RNA (such as the group I intron), is between 0.05 and 0.1, whereas O for lysozyme it is around 0.25. Whenever O is small, the refolding kinetics is dominated by activated transitions out of the metastable CBAs. From general theoretical considerations, we have shown that the number of viable CBAs has to be small. For lysozyme there appears to be only one dominant CBA. Needless to say, for two-state folders, O is on the order of the unity. The notion that all proteins have a non-zero O value (i.e., a small fraction of molecules reaching the NBA rapidly, even under non-permissive conditions) will play an important role in our mechanism for chaperonin-assisted folding.

Dissecting problems associated with assisted folding Four points must be clarified in order to understand how the GroEL/GroES nanomachine functions: 1. 2.

3.

The architecture of the GroEL/ES particles; Since the job of this nanomachine is to rescue substrate proteins otherwise destined for aggregation, one may wonder whether assisted-folding rates of SPs increase. Several studies have shown GroEL to enhance the increase SP folding rates by a factor of 2 or 3, at best; One knows from the turnover cycle that GroEL undergoes large conformational changes as a result of binding to SP, GroES, and ATP hydrolysis. What is the

332 D. Thirumalai microscopic basis of such allostery? How are the allosteric transitions coupled

to SPfolding? 4.

The dictionary definition of a machine is a "device that transmits or modifies force or motion." Is a force being transmitted as a result of coupling between the allosteric transitions and SP? This question is fundamentally related to the annealing function of this class of nanomachines.

We will more fully discuss the architecture of GroEL/GroES, which provides a logical, but incorrect, mechanism for its function. GroEL is a heptameric ring with an unusual, non-crystallographic, seven-fold symmetry. In the state shown in [Fig. 1 ], the dimensions of the top cis ring (in green) is larger than that of the bottom trans ring. GroES essentially fits on top like a dome. The overall architecture of the chaperonins has the general shape of a cylinder. There is a hole in the middle, through which it does not pass all the way. The base of this structure (referred to as the equatorial domain) acts as a lid that separates the two rings.

Figure 3.

Chaperonin-Mediated Protein Folding

333

The GroEL particle in the un-liganded state, meaning that neither SP nor GroES is bound to it, is displayed in [Fig. 3]. In order to understand allosteric transitions in GroEL, it suffices to divide each of its subunits into three interconnected rigid domains. One of them is the equatorial domain that is shown in blue in Figs. (3a) and (3b). Over 60kDa (or 2/3) of the mass resides in the equatorial domain. The cylinders correspond to helices and the strands are shown as arrows. The intermediate (I) domain is in green and the apical (A) domain is in red [Fig. 3]. The I domain can undergo hinge-bending-type motions around the two residues shown in the circles. The architecture of GroEL with GroES bound clearly shows that the domains move as rigid bodies; i.e., they move as blocks. If the two domains in the unliganded and liganded states are superimposed, the root mean square deviation of the oc carbon atoms is only between about 0.7 to 1.5A. This shows that for all practical purposes this large conformational change may be viewed as the motion of three rigid domains (equatorial, intermediate, and apical) around the two hinges. The structure of the co-chaperonin GroES [panels c and d in Fig. 3] shows that it has nine P-strands and lots of loops. There is one key loop, referred to as the "mobile loop," which is highly disordered when GroES is unliganded. The mobile loop interacts with GroES, forming an interface with residues located in the apical domain. In the process, the mobile loop becomes ordered and adopts a (3-hairpin structure. When complexing between GroES and GroEL takes place, the mobile loop undergoes substantial conformational change. The long, hairpin-like loop moves around 9 Angstroms and swings away from the core of GroES by about 20A. This strong structure-inducing interaction occurs because of the formation of an interface between the GroES particle and the GroEL. We will return to the alterations in interactions between SP and GroEL due to GroES binding a bit later. When GroES and ATP bind to the cis ring (the top ring) [Fig. 1], the volume of the cavity doubles. It goes from 85,000A3 in the unliganded state to 175,000A3 in the state with GroES bound to GroEL. This change involves a spectacular rearrangement. The doubling of volume plays a key role in the way GroEL functions as an unfolding machine. The cavity dimensions provide an upper-bound on the size of protein molecules that may be accommodated. From analysis of the structures in the protein databank (PDB), the radius of gyration of the native state of proteins is found to scale as a N1/3A, where a = 3.8A. This scaling provides very good fit to the experimental data. Small-angle X-ray scattering data on a number of proteins shows that in the misfolded state the radius of gyration of the molten globule is about 10% to 15% larger than what it is in the native state. The maximum volume available in the Anfinsen cage after GroES and ATP bind is

334

D.

Thirumalai

175,000A3. Within this volume, one can fit polypeptide chains whose radius of gyration is around 3 5 A. In the misfolded state, the dimension of Rubsico, which has around 500 amino-acid residues, is around 32A. If such a large polypeptide chain were fully encapsulated within the cavity there would be sufficient room for one layer of water molecules. Question: Is the internal residue hydrophobic? Response: This is a key point to which I am coming, but it does not appear to always be the case. For the nanomachine to function efficiently, a kinetic constraint must be satisfied. Under folding conditions, a polypeptide can kinetically partition, which means a fraction can very rapidly go to the native state. Secondly - and this is important - GroEL does not recognize the polypeptide chain (the SP) if it is presented in the native form. Thus, molecules that fold rapidly will not further interact with GroEL. The remaining fraction of unfolded proteins goes into the misfolded state, which has errors that must be undone before they can reach the native state. A number of residues are exposed to water in misfolded states. The chaperonin must capture the misfolded chain within a shorter time-scale than the time required for aggregation to take place. Since recognition by GroEL of the misfolded SP takes place on a diffusion-limited time scale, this simple kinetic constraint is easily satisfied. The presence of a cavity leads to the proposal that GroEL functions by sequestering SP and letting it fold as it does in very dilute solution. This model, referred to as the Anfinsen cage, does not allow for a dynamic role to be played by GroEL in the rescue process. Experiments have clearly shown this model to be incorrect. The major reason for this is that the SP is released from the cavity with each cycle regardless of whether or not it is folded. Moreover, recent experiments also suggest that GroEL acts an "unfolding machine". These considerations show that GroEL actively participates in the annealing function. The active participation of GroEL is most evident when we consider the allosteric transitions of GroEL as it interacts with the SP, with ATP, and with GroES. Let us begin by considering GroEL to be in the tense state, T [Fig. 4]. Upon addition of ATP and GroES, the GroEL particle accesses the R state, the R' state, and the R " state, before returning to the T state. This is what one GroEL particle undergoes in one of the rings during a single turnover cycle. The crystal structure of the T state corresponds to GroEL in the unliganded state. The crystal state of R " (without the

Chaperonin-Mediated Protein Folding

335

SP bound to it) is known. This is the state that is populated after ATP hydrolysis [Fig. 1].

W T

p=^GED

=

capture

=5 rafibri - v p'U 1 -!

sr

„ T

@

B

s* s "3 ,

(

|

^

,(%

,

r™JVt-Tj

n

4*

f

A

s

5 a hydrolysis & ring conditioning

l"

ftnr

>

rcjc ir

Figure 4.

To understand the events that occur in the cycle [Fig. 4], we begin by describing the structure of the apical (A) domain in more detail. The key structural elements in in the A domain are the two helices, H and I. The sequence composition of these helices shows that these helices are made from a number of bulky hydrophobic residues. Upon oligomerization to form the heptamers, the A domains are arranged in such a way that they line the mouth of the cavity. Thus, the SP encounters an almost continuous lining of hydrophobic residues as long as GroEL is in the T state. Recall that GroEL recognizes the SP as long as it is misfolded. It follows that the interaction between the SP and GroEL is a non-specific attractive interaction

336

D. Thirumalai

between the exposed hydrophobic residues of the SP and the hydrophobic residues that line the mouth of the cavity. The lack of substrate specificity is what makes the GroEL a very "promiscuous" nanomachine. As long as one presents any substrate (a random polypeptide chain) with exposed hydrophobic residues, GroEL will bind the molecule. Thus, the first event in the cycle [Fig. 4] is the capture of the SP by T state GroEL. The SP, which prefers the T state, resists the transition to the R state. The transition to the R state can only be driven when ATP binds to the residues in the equatorial domain of GroEL. The nucleotide binding-sites are situated in the equatorial domain. T/ R transition, which is resisted by the SP, is activated by binding to ATP. ATP thus drives this equilibrium allosteric movement. The SP resists this transition, resulting in some work being done on the SP. So what happens when ATP is bound to the E domain of GroEL? Since the structure of the GroEL particle may be thought of in terms of three rigid domains, it follows from the crystal structures that upon binding to ATP, the I domain collapses onto the E domain. In particular, the I domain rotates by 25 degrees onto the nucleotide-binding sites, resulting in the blocking of any further entrance or exit of the ATP molecules. When this occurs, the nature of the SP-GroEL interaction begins to change. The completely attractive interaction between the SP and the apical domain changes, becoming slightly repulsive. This occurs because the bulky hydrophobic residues in helices H and I withdraw from the SP. According to our theory, the change in character of the SP-GroEL interaction is the key annealing action of this nanomachine. This change occurs in a much more spectacular fashion after the GroES particle has bound to GroEL. The equilibrium transition from T to R, which can be driven with or without SP, is fully concerted. (The concerted nature of this transition is similar to the workings of a Chinese vegetable steamer; since they are all interconnected, when one lid is lifted, the others all respond in unison.) The T/R transition works in exactly the same way; when one of the subunits undergoes a transition to the R state, all of them do so. Lorimer has done experiments tethering just two adjacent subunits, using cysteine residues. With this construct, the T/R transition cannot be triggered upon addition of ATP. The crystal structure of the R state is not known at atomic resolution. However, cryoelectron microscopy studies indicate it to be different from the T state. After the addition of ATP, which triggers the T/R transition, GroES can bind. Upon addition of GroES, the apical domain swings upward by around 60 degrees, simultaneously twisting 90 degrees [Fig. 3b]. All seven subunits undergo this transition in a concerted manner, which results in the cavity volume changing from

Chaperonin-Mediated Protein Folding

337

85,000A3 to 170,000A3. The hydrophobic residues in the R' state, which are buried in the T state, become exposed. The mobile loop of GroES then binds to these exposed hydrophobic residues. Structural analysis shows this large domain movement to result in dramatic changes in the accessible surface areas of certain GroEL residues. Let us consider the hemicycle [Fig. 4] again and analyze the changes that take place in going to the R state from the perspective of the SP. Initially, the SP finds an attractive hydrophobic surface for a while, until the addition of ATP and GroES. Then the SP basically interacts with hydrophilic or charged residues, rendering the SP-GroEL repulsive. Alteration of the very nature of SP-GroEL (going from attractive to repulsive) as a result of ATP- and GroES-driven allosteric movements in GroEL is the fundamental annealing action of this nanomachine. The annealing action alters the underlying free-energy landscape of the SP in a manner similar to the more well-appreciated simulated annealing protocol used in simulations. The next step in the driven cycle [Fig. 4] involves ATP hydrolysis at the seven sites. Upon ATP hydrolysis, which simultaneously occurs at all seven sites, transition to the R" state takes place with ADP, SP, and GroES all bound to GroEL. This process is followed by the release of the SP, GroES, and inorganic phosphate. The cis ring of the GroEL state then relaxes to the T state. In order to understand how this happens, one must consider the movements in the trans ring; to know what the trans ring is doing. A signal is sent from the trans ring, which I will briefly discuss a little later. The only equilibrium step in the hemicycle is the T/R transition. All other allosteric transitions shown in [Fig. 4] are driven. Once these processes are initiated (upon adding GroES), the cycle goes toward completion, until the cis ring of GroEL relaxes to the T state. The cycle time is around 13 seconds, after which the timer is reset. Once the cycle is complete, the deck is cleared and the process can begin anew, except that it alternates between the cis and the trans rings. The SP sees a largely hydrophilic surface after transition to the R state. The time during which the SP encounters a hydrophilic surface is the productive folding time. During this interval the SP is fully encapsulated within the cavity, provided the dimension of the SP is not too large. The only problem with making quantitative predictions using our model is that the total time during which the SP sees both the hydrophobic and hydrophyllic surfaces is around 13 seconds. We do not know how that time is partitioned between these events. One of the puzzles in this field is understanding what is sacrosanct about these 13 seconds. Why doesn't the process of alternation between the hydrophobic and hydrophilic surface occur

338

D.

Thirumalai

more often? If changes in the nature of interactions between the SP and the GroEL cavity occur often, the efficiency of this nanomachine would increase. Question: When you are in the R' state cavity, the hydrophobic residues of the substrate are exposed to the hydrophilic residues of the cavity. You then have molecular dynamics occurring. So you just have thermal motion occurring? Response: Yes. I will describe that. Question: And ATP hydrolysis occurs afterward, just to prepare it for the release? Response: Yes. Hydrolysis of the ATP conditions the cis ring for product release. In this sense, ATP hydrolysis acts as a timing device. The efficiency of product release is greatly enhanced if ATP and SP bind to the trans ring. Question: In the first step, going from T to R, are mainly hydrophobic actions involved? Response: The capture of the SP by the T state is due to favorable hydrophobic interactions between the residues of the SP and those in the H and I helices in the apical domain. Upon ATP binding, the annealing action begins. In the R state, SP interaction with GroEL is weaker. Question: If I understand correctly, there is seven-fold symmetry in the cavity as well. Is this significant for re-folding? Response: There is seven-fold symmetry in the cavity, but its significance is not fully known. There is significance to the ring structure. In this structure, efficient work may be conducted on the SP as the GroEL particle undergoes large, liganddriven domain movements. There are other chaperonins that have eight-fold symmetry. I don't think there is anything about the seven-fold symmetry that is crucial to the function of the chaperonins. Question: Is there water in the cavity at this point? If there is no water, hydrophobic residues are quite happy being exposed to hydrophilic residues.

Chaperonin-Mediated Protein Folding

339

Response: There may be a sheet of water lining the walls of the cavity. We are currently carrying out extensive molecular dynamics simulations to test the role of water in the capture of SPs by the apical domain. Let us return to the events in the cycle. In general, unless it is really very large, the SP only interacts with a subset of the seven available subunits. What happens as a result of the change in the character of interaction between the SP and the walls of the cavity? According to the kinetic partitioning mechanism, the SP can fold with some probability, or it can misfold. Once it folds, the SP is no longer recognized by GroEL. If it is not folded, it rebinds, and the process continues over and over again, until the native state is reached. It is for this reason that Lorimer and I coined the phrase iterative annealing mechanism in 1995 to describe the annealing function of this nanomachine. The annealing action (change in the interaction between the SP and GroEL) is repeated until a sufficient yield of the native state is obtained. Question: How does the complex know whether or not the protein is folded? Response: It doesn't; it's a stochastic machine. It simply gives the SP another chance to fold. Question: That means this complex might recognize some folded molecules trying to fold...? Response: No. It does not recognize molecules in the native state, or if it does, only very weakly. In the folded molecules, most of the hydrophobic residues are buried in the core. As a result, GroEL cannot recognize the SP in its native state. Non-specific hydrophobic-hydrophobic recognition is required for the SP to be captured by the GroEL particle. Question: Does that mean that this type of complex is recognizing... does this then explain the folding of a molecule that is essentially soluble? Response: Essentially soluble...? Question: Yes, because you also have proteins membranes...

that are inserted

within

340

D.

Thirumalai

Response: Yes, that's right; soluble proteins. Folding of membrane proteins is an entirely different problem. Question: seconds?

What happens if the folding time of a protein is more than seven

Response: I don't know of examples of single-domain E. coli proteins that take much longer than seven seconds to fold. This brings me to another issue, on which I briefly commented earlier: The decision whether or not to partition to the native state in all cases that have been experimentally studied occurs on a time-scale that is at most one second. One wonders why this machine turns over only every 13 seconds. This is an important conceptual question. An additional comment: single molecule studies indicate that, on average, GroEL interacts with GroES for around seven seconds. However, based on experiments done at UCSB and by Yoshida's group in Japan, there is substantial dispersion in the average time. Now that we know the basis of the GroEL annealing action, we can anticipate a couple of scenarios for its function. We know that minimally, two time-scales are associated with the SP. One is associated with rapid folding, and the other corresponds to transition time from the CBA to the NBA. The SP "feels" the hydrophobic wall for roughly x.H (no folding.) The productive folding time is t_P. Consider the scenario appropriate for Rubisco to fold. In this example, the fast time-scale of SP folding is comparable to the productive timescale, which is less than the slow time-scale for escape from the CBA. It is easy to show for this case that after n turnovers, the fraction of molecules that have reached the native state is given by the simple formula *PN (n) = 1 - (1 - <J>)n. By measuring the fraction of native state as a function of time (proprtional to n), the value of O may be calculated. For Rubisco, O is on the order of 5%, which implies that only 5% of the molecules fold spontaneously. To obtain a 90% yield of native Rubisco, 20 cycles of the GroEL machine are required. In 20 iterations, 140 ATP molecules would be consumed. The quantity of ATP consumed is only around 5% to 10% of the total energy the ribosome spends in synthesizing Rubisco. Hence, lavish consumption of ATP is a small price to pay for insuring that the SP folds properly. This argument leads one to believe that GroEL is a nanomachine that must have developed very early in the evolutionary process. In the second scenario, we consider the case in which the fast time-scale is much smaller than the productive time-scale, which is comparable to the slow timescale. After n iterations, or after having spent n x 7 ATP molecules, the fraction of the native state is given by (1 - O/e)". If the partition factor is 5% for a set of

Chaperonin-Mediated Protein Folding

341

proteins, then in a couple of iterations, or by spending only 14 ATP molecules, one obtains nearly 90% of the native state. There could be a class of protein molecules in E. coli that follow this scenario; however, they have not yet been identified. The annealing action of GroEL involves the coupling of various time-scales. Minimally, at least seven distinct time-scales are associated with the many allosteric transitions that GroEL undergoes as a result of binding to ATP and GroES. Competition between these time-scales makes it difficult to analyze the kinetics of coupling between the SP folding and the GroEL allosteric transitions. It would therefore seem desirable to look at a limiting case. In vitro experiments have proven invaluable in analyzing the coupling between GroEL allostery and SP folding in the absence of GroES. Many SPs may be folded without requiring GroES. In this simpler case, one need only examine the relationship between T-to-R transition and the SP folding. The T/R transition is characterized by two time-constants. For the case when GroES is absent, we posed the following question: What happens to the folding time of the SP (relative to the spontaneous case) as the equilibrium constant for the T/R transition is varied? We answered this question using simple lattice models to represent GroEL and the SP. We modeled GroEL as a cubic box. The walls of this cavity are initially considered hydrophobic; with time, the nature of the wall changes to hydrophilic. Using standard methods, we generated random sequences to represent the SP. For random sequences, SP folding time is expected to be long. This mimics non-permissive conditions that trigger the need for GroELassisted folding. To mimic the fundamental annealing action of GroEL, we introduced an additional interaction between the SP and the wall. Using this model, we encapsulated the SP within the box. We let the wall remain in the hydrophobic state for a specified time, after which it was allowed to remain in a hydrophilic state for an arbitrary duration. The ratio of these times is the equilibrium constant for the T/R transition, which in our model may be varied arbitrarily. For this model, we predicted that the "GroEL-assisted" folding rate of the SP decreases as the equilibrium of the T/R transition increases. This theoretical prediction was experimentally validated by Horowitz and Yifrach. What is the microscopic effect on the SP as the nature of wall cavity changes? The toy model shows us the state of the protein before and just after the hydrophobicity changes (there is a distribution which depends on the number of samples). Upon inducing this change, we find that the SP falls into a distribution of states with varying probabilities. There are a large number of misfolded conformations after the annealing occurs. In addition, the native state is accessed with a certain probability. All these findings are consistent with the proposed iterative annealing mechanism. This exercise showed us that the SP undergoes

342

D.

Thirumalai

kinetic partitioning with each iteration. Analysis of the SP conformations showed that upon the annealing action, the SP has nearly completely unfolded. In the process of annealing, GroEL nearly completely unfolds the SP in order to help it fold! Detailed experiments confirm this picture. Some subunits move significantly over the course of allosteric transitions. It is therefore possible that some force is being transmitted to the SP. This model rationalizes how global unfolding of the SP may be induced by force. In the T state, adjacent subunits are around 8A apart. The diagonal distance is about two nanometers. If an SP were pinned at two adjacent subunits, as a result of GroEL domain movement, the adjacent subunits, initially around 8A apart, would move to a distance of 2nm. We can estimate the magnitude of the force transmitted to the SP by means of dimensional analysis. Using kBT as the energy scale, the stretching force is on the order of over 20 picoNewtons. In AFM experiments that stretch muscle proteins, the typical forces used are in the 60-to-200pN range, depending on the pulling-speed. The AFM experiments are carried out were at a pulling-speed of about lBm/s. What should the effective pulling-speed be in GroEL? It is known that the T/R transition takes place in around a microsecond. Using dimensional analysis, we estimate the GroEL pullingspeed to be on the order of 10"6 micrometers per second, which is around 4 to 5 orders of magnitude slower than the AFM experiments on the muscle protein titin. We know that if we pull things slowly, we need less force. Therefore, the magnitude of forces transmitted by GroEL, - 20-to-50pN - is sufficient to unfold nearly all globular proteins. As a result of the complicated domain movement that occurs in the GroEL/ES particle, the SP experiences a stretching force that globally unfolds the SP. Thus, the microscopic annealing action that enables the SP to fold is preceded by global unfolding! Question: / have two questions. The first is about the force interpretation. In the titin experiments, the ends of the protein are connected covalently, so they are a pulling apparatus. In such case, I presume there to be only weak interactions. Did your force calculations take the elasticity of the interactions into account? Response: We are setting up simulations to do this more precisely. It is just an order of magnitude. You are absolutely right. Question: My second question concerns the power stroke and water again. I presume there is enough space between the trans and cis rings for water to flow between the two cavities?

Chaperonin-Mediated Protein Folding

343

Response: There is one layer of water. Question: So one could perhaps imagine water being pumped from one cavity to another? Response: It is likely, but I don't really know. You have to do these experiments dynamically. In the large-scale motion of its equatorial domain, the upper ring is doing some work on the lower ring, and whether that causes some of the waters to flow and lubricate is an interesting question. Comment: That could be one element of the cooperativity between the two cavities. Response: It is anft'-cooperative between the cavities; it is cooperative within the cavity. Question: What is known about the sequence of processing of the polypeptide chain after it leaves the ribosome? You have these folding enzymes like proline cistrans isomerase, and you have modifications of amino-acids; is this all occurring before the polypeptide chain enters GroEL, and if so, how is this organized? Response: Yes; there are many other chaperones making sure it gets to the correct cellular compartment for folding. There are around 5 or 6 chaperones, and I don't believe their crystal structures are known yet. That's a very complicated zoo of interacting systems. Question: You mentioned this highly concerted interaction between subunits. Is there any understanding at all of how that works; of how they are coupled together? Response: Ahhh... Next question. Question: Ok, then a question related to that one (this may refer to Olivier Lichtarge 's discussion). How many sequences are therefor these things? Is there a chance that one could use the kinds of analyses discussed by Olivier Lichtarge to look for features of coupled protein interactions? I'm asking because I'm very interested in the topic of viruses and the way they behave. Response: George Stan, a postdoctoral fellow in my lab, has already begun to do some sequence analysis of this; to see what pieces of this GroEL particle are

344

D.

Thirumalai

conserved, how they might interact, etc. I haven't had a chance to get into that. Certain residues are absolutely conserved. Those regions that involve nucleotide binding are conserved. The nature of the amino-acids in H and I regions of the apical domain are conserved. Comment: / just want to say there are enough sequences to run ET on the chaperones, and we have done that. The results are around somewhere. It is so hard to interpret, and I know so little about chaperones, which basically have all those active sites. Perhaps now is a good time to get it together. Response: We've done some analysis that I will talk to you about during a break. Question: Is there any idea of what is so special about the tertiary structure of these proteins that requires chaperonins in order to fold? Response: One of the questions concerns the natural substrates of these things and whether there is anything special about them. People erroneously thought GroEL only rescues those particles in a sequence whose architecture was alpha-beta. Yoshida's experiments showed that GroEL is a "promiscuous" nanomachine. GroEL can even recognize random sequences, as long as they are not in their native states. Thus, there is no preferential native-state architecture that requires the GroEL nanomachine. Question: Concerning this aliosteric effect that leads to the change in the shape of the cavity, ATP is required; do you have any idea of the amount of ATP required to initiate the process? Response: That is experimentally known. The ATPase activity is a function of the measured ATP concentration. It is different for the T and R states; ATP activity is greater in the T state than in the R state. Under physiological conditions the concentration of ATP is sufficiently high that all seven subunits are ATP-bound. Question: / have a few questions about the role of GroEL in preventing aggregation. What is known about the concentration-dependence of the efficiency of the GroEL substrate? Response: There exist computations, some of which we have done, and more importantly, there are some experiments that are, shall we say, producing data,

Chaperonin-Mediated Protein Folding

345

although nothing systematic about this is yet known. It is an interesting proposition that something like this might prevent aggregation, at least temporarily, when associated with these things. Perhaps we will have more speculations in a couple of years. Question: / have another question. What is the experimentally measured maximum size of a protein that can be the substrate of GroEL? People must have tried proteins of different sizes. Response: Rubisco is possibly the largest protein that can be fully encapsulated in the cavity. Now, the question I just posed: Can it help proteins that are much larger, and what is the mechanism of that?... Comment: A multi-domain protein. Response: This is just starting; experiments are beginning in two labs. Question: My question is about Anfinsen 's principle. As you said, many proteins can fold without being helped by chaperonins. This is interpreted to indicate that the native-state corresponds to the state of the lowest free-energy. That is Anfinsen's principle. Are the proteins that must be helped by chaperonins violators of Anfinsen's principle? I think your mechanism suggests that they are. Response: No. Comment: Let me offer my reasoning as to how your mechanism suggests the violation. You infer from the kinetic partitioning. Those that go to the non-native state are aggregate, I think. Response: If you wait long enough. Question: Yes, Anfinsen's principle is concerned with the equilibrium state, so one has to wait, say, infinitely long. Those that go to the non-native state go on to the aggregated state. Those that go to the native state do just that, but the native state sometimes fluctuates again to the unfolded state, because it is equilibrated and stabilized by a marginal amount of the free-energy. Then it again either goes to the native state or to the non-native state. If it goes to the non-native state, it would eventually go to the aggregated state, which is very stable. If you leave the urea

346

D.

Thirumalai

molecules in for a long time, all of them would go to the aggregated state, which has lower free-energy than the native state. In the absence of chaperonins, is the state of least free-energy the aggregated state? Response: There are many experiments proving that proteins can aggregate at that kind of concentration. First of all, a given protein at a sufficiently high concentration will aggregate, as shown over and over by experiments, most recently by Dobson. I do not know why you are surprised by this. Secondly, there are a number of proteins whose folding is under kinetic control. Comment: What I am saying is that under a certain given concentration, the state with the lowest free-energy may be the aggregated state. Under such a condition, maybe chaperonins are necessary to allow the protein to go into the native state, and that that is not the state with the least free-energy. I think that is what you are saying with your mechanism. Comment: There is a distinction between an aggregated state and a kinetic trap; a misfolded state. It might not even be a state; it is a kinetic trap, a transient state. It is stable, but not an aggregated state. This kinetic trap is less stable than a native structure. At high concentration, you will definitely have an aggregate from any protein. Response: It is a fact of life. Comment: Yes, a fact of life. This mechanism is telling us how in a misfolded state, a single protein — not yet an aggregate — can be turned into a native state just by giving it another chance to fold. The chaperonin is taking a protein that is in a misfolded state, pulling it apart, and letting it fold again. So it is not yet an aggregated state. But at high concentrations and with a lack of chaperonins, you will get an aggregate. Response: The phase diagram of even two-state proteins is very complicated, as one varies temperature, concentration, pH, salt, etc., and what Mr. Anfinsen taught us is what happens in infinite dilutions. In this sense, the Anfinsen hypothesis is not violated. Comment: A question was asked about the size of the protein that can use this mechanism. It might be amusing to tell you about a phage that has a protein for

Chaperonin-Mediated Protein Folding

347

which it wants to use the GroEL/ES system to refold, but it is too big. So the phage synthesizes its own homologue of GroES that fits into the system and creates a larger internal cavity in order to allow the phage protein to use this mechanism. It is very clever. Question: / also want to talk a little bit about the size limitation. What is peculiar is that the major function of this protein should be to fight heat-shock, right? Response: Correct. Comment: Apparently, all proteins, large and small, partially denature. From the biological point of view, it would be unclear as to why you would select machines to repair only small proteins, unless there is something specific in the structure of large proteins. Response: This is not the case, and I should not leave you with the impression that these machines are unsuccessful in helping anneal big proteins. An experiment carried out at Yale has done exactly that. They took a protein much larger than may be accommodated in this cavity. It partially associates with it in both the cis and trans cavity. But the mechanism is probably different, because it spans this in some way. So even partial association will help in proteins that are larger than may be comfortably accommodated. Question: they be?

What are the largest one-domain globular proteins?

How large can

Response: As I recall, the average size in E. coli is less than around 200 residues.

This page is intentionally left blank

VIRUS ASSEMBLY AND MATURATION JOHN E. JOHNSON Department of Molecular Biology, The Scripps Research Institute, LaJolla, CA, USA

We use two techniques to look at three-dimensional virus structure: electron cryomicroscopy (cryoEM) and X-ray crystallography. Figure 1 is a gallery of virus particles whose structures Timothy Baker, one of my former colleagues at Purdue University, used cryoEM to determine. It illustrates the variety of sizes of icosahedral virus particles. The largest virus particle on this slide is the Herpes simplex virus, around 1200A in diameter; the smallest we examined was around 250A in diameter. Viruses bear their genomic information either as positive-sense DNA and RNA, double-strand DNA, double-strand RNA, or negative-strand RNA. Viruses utilize the various structure and function "tactics" seen throughout cell biology to replicate at high levels. Many of the biological principles that we consider general were in fact discovered in the context of viruses.

Figure 1. A representative gallery of icosahedral vims particles determined by electron cryomicroscopy, and image-processing in the laboratory of Timothy Baker, Purdue University. abbreviations are indicated with each particle.

Viruses or their

350

J. E. Johnson

Properties of a virus infection First, and most important for biophysical studies of viruses, is that outside the host cell, the virion exists as an inert chemical molecule. When a virus particle is removed from its host it behaves like other macromolecules, such as hemoglobin or tRNA. We can study it using crystallography and other methods that require homogeneous material. Second, a virus requires a vector that will transport it into the area of the susceptible host. For human viruses, that vector is often other humans, whereas for plant viruses, it is insects. In some cases the virus infects the vector, while in others the vector is not be affected by the virus, merely moving it from one host to another. Third, the virion has to attach to and enter the susceptible host cell. Receptor-virus interactions are among the most important factors affecting host range. Generally, a protein on the viral surface binds to a specific protein on the host-cell surface in order for entry to take place. Fourth, the capsid, which is extremely stable under most circumstances, must disassemble and release its nucleic acid. In many viruses there is a trigger that changes the stability of the particle when it binds to the receptor. This is an active area of antiviral research, since if that trigger can be altered, there could be protection against viral infection. Fifth, the nucleic acid enters the cell.

Table 1. Characteristics of the virus life-cycle.

1. 2. 3. 4. 5.

6. 7.

Outside a host cell, the virus exists as an inert chemical molecule. A vector transports the virus to a susceptible host. The virus attaches to a receptor on a susceptible host cell, whose cytoplasm it enters. At some point during this process, the protective protein shell of the virus disassembles, releasing its genome for replication. Viral nucleic acid replicates and viral protein synthesis initiates (in many cases, significantly more efficiently than the carefully controlled replication and translation of the host cell's genes.) Assembly occurs as the viral genome and proteins are synthesized, often in specialized compartments called virosomes. Assembled virus particles are released from the cell and are able to infect other cells.

Virus Assembly and Maturation

351

There are a number of modes of entry by which the viral genome can enter the host cell cytoplasm, where it replicates and is translated into protein. These products assemble to generate new virus particles. The virions then exit the cell, usually leaving it in a gravely altered state, since viral production has usurped most of the cell's "biology" and used it to make additional viruses. That is why we get sick. Last, when the virion is released from the host-cell, it goes on to infect other cells.

Virus crystallography and subunit coordinates I will briefly discuss the techniques we use. As stated in the first point above, outside the cell, the virion exists as an inert chemical molecule. This means that when we purify it, which we can do to a very great degree, these particles organize. Two-dimensional arrays of virus particles can form, as well as three-dimensional crystals, which grow to dimensions exceeding 2mm. Crystals of viruses can diffract X-rays very well, often to a resolution of 3.0A or higher, allowing calculation of electron density maps, and eventually models of the viral subunit. Those with the (3sandwich fold discussed below are shown in Fig. 2. Around 50 different virus particles have been determined at high resolution. Our group at Scripps has assembled a website called the Virus Particle Explorer, or VIPER (http://mmtsb.scripps.edu/viper/viper.html). We organized all virus capsid coordinates that had been deposited in the protein data bank and put them into a standard orientation that allows the generation of any oligomeric structure of any of the known virus structures. We have also included energy calculations for all subunit contacts. This website contains a wealth of primary and derived information on virus particles.

Viral subunit functions Viruses obtain a maximum amount of function from a minimum amount of genetic information. All viral gene products are multifunctional. The viral capsid protein is also multifunctional. Table 2 lists a number of different functions that these proteins carry out.

352

J. E.

Johnson

N(1) N(1S) simian virus 40 (SV40) VP1

pollovirus (type 3) VP3

Wuetongue virus VP7

B

;(38i]

N(19)

I SFtyfx

Flock House virus (FHV)

nudaurella a capensls virus (Nov) C(260)

N(27)

C(190) N(39)

cowpea chlorollc mottle virus (CCMV)

southern bean mosaic virus (S8MV)

Figure 2. The subunit folds of a variety of viruses whose structure was determined by X-ray crystallography. All structures were determined as intact particles. All subunits have the ubiquitous psandwich fold, suggesting they are related by evolution.

Table 2. functions of the virus capsid protein in simple viruses.

1. 2. 3. a) b) 4.

Subunits must assemble to form a protective shell (capsid) for the viral genome. Subunits must specifically package the viral genome. The capsid can actively participate in the life cycle of the virus by binding to receptors and mediating cell entry (animal viruses) some plant viruses are actively transported into the host with capsid-dependent interactions. Capsid proteins mutate to avoid detection by the immune system.

Virus Assembly and Maturation

353

First, the subunits must assemble to form a protective shell (capsid) for the nucleic acid. In the simplest cases, this is very rapid and straightforward, but for complicated viruses there are stages of assembly. The subunits must specifically package the cognate viral RNA or DNA. A recognition event takes place in order for the subunits to identify the gene that encodes them and other viral genes, and to package them with a high degree of fidelity. The capsid actively participates in the infection process by binding to the host cell surface receptor. In some plant viruses, capsids are important in helping to move virions between cells of the host. A very interesting feature of the animal virus capsids is their ability to rapidly mutate in order to avoid the host's immune system. Hence there are around 150 different strains of the common cold virus, which is basically the same virus particle that has gone through a dazzling array of mutations to avoid the immune system.

Icosahedral symmetry All virus particles that form a spherical shell employ icosahedral symmetry. A feature of great significance is that for a given-sized subunit, represented by a trapezoid labeled A in Fig. 3, we can generate the largest possible volume. Every subunit is situated in an identical environment. Once subunit-subunit contacts are established, assembly occurs spontaneously. There is a great deal of packaging power for a very small amount of genetic information. Sixty trapezoids assemble to form this shell. They are related by 2-fold, 3-fold, and 5-fold symmetry. Figure 3 also illustrates the main chain topology of the common (3-sandwich fold of viral subunits that is discussed below. In nature, there exists no simple (60-subunit) icosahedral virus able to carry enough genetic information to be infectious. A virus must carry a minimum of one gene for a coat protein and one for a polymerase that will replicate its RNA or DNA. If you search for viral capsids that contain only 60 subunits, you will only find satellite viruses that co-infect with other fully functional viruses. Thus, a 60-subunit particle is too small to carry enough genetic information for infection; however all larger virus particles are based on the symmetry of the icosahedron. Stephen Harrison and his colleagues at Harvard reported the first crystal structure of a virus in 1978. That was followed eighteen months later by the structure of Southern Bean Mosaic Virus from Michael Rossmann's laboratory. One of the features that emerged was that all the subunits forming these particles had the same topology and the same fold; the so-called P-sandwich fold (Figs. 2, 3).

354

J. E.

Johnson

The fold of the protein was highly conserved and most of the function was encoded at the termini of the subunit. The functions consist of protein-RNA recognition and, as we will see in a little while, molecular switching, which is required to generate larger capsids.

H2N

Figure 3. A representation of the icosahedron (A) formed by 60 trapezoids. Selected symmetry axes are labeled. Each trapezoid corresponds to a viral subunit schematically depicted as a P-sandwich in (B). The |3-sandwich has a characteristic hydrogen-bonding pattern depicted in (C).

I mentioned that the first two structures solved - both plant viruses - had the (3sandwich-type topology (Figs. 2, 3). The first 20 virus structures solved all had this same fold! Simian virus 40 is a DNA tumor virus; southern bean mosaic virus and cowpea mosaic virus are plant viruses, and all three have the same fold, but no detectable sequence similarity. No one anticipated that we would see this similarity in DNA viruses, RNA viruses, double-strand RNA virus, etc. A wide variety of

Virus Assembly and Maturation 355 viruses share this fold, suggesting that at least one evolutionary tree of viruses diverged from a common source, or that the same cellular protein was repeatedly adapted for viral use during evolution.

bacteriophage MS2 Sindbis virus (SINV) human core protein immunodeficiency virus (HIV-1)p24

Figure 4. Viral subunit folds that do not have a p-sandwich. However, it is not the only fold. The RNA bacteriophage known as MS2 was the first virus structure solved that did not have the P-sandwich. The alpha viruses have a capsid subunit that looks like the enzyme chymotrypsia, and the protein that interacts with the RNA in HIV also has a different fold (Fig. 4). Interestingly, the easiest viruses to crystallize initially all had the p-sandwich fold. About 85% of all virus structures determined have the p-sandwich fold.

Quasi-equivalent virus capsids I stated above that there are no examples of the simple icosahedral capsid among viable viruses. This is the case because viruses learned how to make icosahedral capsids that have quasi-symmetry. Those of you familiar with Buckminster Fuller's geodesic domes will remember that this principle was discovered by man a billion years after the viruses did so. Quasi-equivalence, which was discovered by Casper and Klug in 1962, explained the fact that particles with icosahedral symmetry contained more than 60 subunits. One can triangulate a hexagonal net in a very organized fashion by just taking unit steps along two hexagonal axes, the h and k axes. It is then possible to identify any hexagon in this lattice by the indices of the unit steps along each of these axes. If you identify a triangle, you can create a net consisting of hexagons

356

J. E. Johnson

and pentagons. That net can be folded into a particle with perfect icosahedral symmetry, but imbedded on top of it is so-called quasi-symmetry. (See the Scripps icosahedral server website for details of the particle construction: http://mmtsb.scripps.edu/viper/chunxuqu/index.html.) For a virus particle to have this organization, the subunits must be able to exist as hexamers or pentamers. Adaptability is not a problem with these particles, because they have a switching mechanism. How does the same protein switch into different environments? We know it does, but we do not know how it is achieved. I will give some examples of quasi-equivalence and molecular switching. Unrelated viruses have different switching mechanisms, which surprised us, because we thought that once nature found a way to do that, it would stick to it. But that is not the case. The simplest quasi-equivalent structure consists of 180 subunits: 20 hexamers coincident with the icosahedral three-fold axes, and 12 pentamers. We describe the particle by a number, T, defined as the sum of the square of the index along the h axis, the square of the index along the k axis, and product of h and k. In other words, T = h2+ k2 + hk. What is important is that T tells you the number of environments in which a protein must sit in order to make a capsid of a particular size. For a T = 3 capsid, you must have proteins in three different environments. Somehow, you have to switch between three different states. For T = 4, you must have four different environments, for T = 7, seven different environments, and for T = 13, you need 13 different environments. I will give examples of T = 3, T = 4, and T = 7, and we will try to see how this molecular switching takes place in these examples, as determined by crystallography and microscopy.

Cowpea Chlorotic Mottle Virus (CCMV) assembly Here is how the process works, as illustrated for CCMV [Fig. 5]. Consider a sheet of planar hexagons formed by subunits. Add pentamers, and you add curvature. All the pentamers in this closed shell are in blue, and the hexamers are in green and red. The three different colors illustrate the three different environments in which these viral subunits are arranged. All the switching takes place in these first 42 amino acids at the amino terminus of CCMV. Residues 1-27 are invisible in all the subunits and residues 27-42 are visible only in the hexamers. We do not see them in the pentamers. We can make mutants in which we delete residues 1-42, and then we just form T = 1 particles; particles with 60 subunits, because we have removed the switch. So the switch is very simple in principle, but we still do not know how it works in practice during the dynamics of the assembly process.

Virus Assembly and Maturation

357

Figure 5. A schematic representation of the capsid assembly ol'cowpea chlorotic mottle virus (CCMV). (a) Hexagonal lattice with indices for various hexagons, based on a 2-dimcnsional grid, (b) Subunits from the CCMV crystal structure arranged on the hypothetical lattice, (c) Side-view of the lattice, showing that the array is planar. If only hexagons are formed, sheets of subunits result, (d) If pentamers are inserted at the vertices of the triangle in (a), curvature is introduced, (e) Side-view of (d) showing the curvature, (f) Blue subunits are placed on the pentamers and red and green subunits at the hexamers, showing the curved surface now occupied by protein subunits. (g) The truncated T = 3 icosahedron showing the location of the pentamers, consistent with quasi-equivalence theory, (h) The particle structure determined by crystallography and consistent with the "soccer ball" in (g); (i) Side-view of icosahedral and quasi 2-fold axes, showing closely similar dihedral angles in the truncated icosahedron.

One of the things that we discovered in this simple example - and we are going to see it again - is that in this polymorphism, around 90% or more of the subunit behaves as a rigid body. Most of the protein maintains its tertiary structure with a great deal of fidelity. Only small portions of the protein change to produce these differences.

358

J. E. Johnson

Question: In all of the assembled states, it looks like the material is highly monodispersed. Is that the case? Response: Yes, the purity is remarkable; hence the very high quality of the crystals we can produce. Question: This may be only a somewhat related question. When most of these structures are solved, do you have to use the symmetry of the particle in average... ? Response: We always use the symmetry of the particle to determine phases. We end up destroying whatever asymmetry there is. However, you never mix subunits that are quasi-equivalent to each other. So we maintain whatever switching mechanisms there are. If you have a T = 3 particle, that means there are three subunits in the icosahedral asymmetric unit. When we do our averaging over the icosahedral asymmetric unit, we are averaging the three subunits as individual components of that asymmetrical unit. This is really critical, since we are not mixing quasi-equivalent subunits in this process. Comment: / want to understand the switch a little more... Response: So do I! Question: You said the C-terminal ends the switch? Response: The N-terminal. The C-termini sort of bring the subunits together, like a basket. The N-termini control the formation of hexamers and pentamers. Question: / think you said that it is visible in some cases and invisible in others. Could you please review that? Response: Around the hexamers we see a beautiful beta structure where six termini come together and form almost perfect 6-fold symmetry. Within experimental error they are 6-fold symmetrical. And it is a beta structure, sort of like a beta cylinder not as extensive as in many of the other beta cylinders - but it is a beta structure. That is what we see at the hexamer. At the pentamer, if you try to create a structure like that with a model, there is steric hindrance. When you pull one subunit out of the hexamer, you now bring everything closer together in the pentamer. Remarkably, the subunit contacts are very, very similar. If you imagine taking one

Virus Assembly and Maturation

359

piece out of a pie, basically you've got something that is planar in the hexamer. Then you take one piece out and you have to bring the sides together. The majority of the protein-protein interactions stay very similar between the hexamers and the pentamers because it is just motion that you are looking at. The N-termini are completely different; they are disordered, invisible at the pentamer axes. They seem to go in every direction. They do not average by icosahedral symmetry, whereas the ones at the hexamers do. Question: Where are the N-termini located with respect to the spherical shape; are they inside, outside...? Response: The N-termini are internal. If we look at the subunits that form this particle, at the hexamer, we see that they are the N-termini that come out (top figure, lower right inset). At the pentamer (top figure, top right inset), this part is completely missing; we do not see any density there. These are subunits at the hexamers (top figure, lower-right inset), and these are subunits at the pentamers (top figure, top-right inset). These are completely inside the particle. Question: Is it invisible because of the motion, or is it because of averaging? Response: I think it is probably because of the averaging. The N-termini at the pentamers are avoiding each other, and they do not avoid each other with icosahedral symmetry. There are also probably interactions with the RNA, which changes how it appears. Question: So you identified it as a switch, but can you suggest the mechanism by which it can be a switch? Response: No. All I know is that the stability of hexamers, because of this additional p structure, is much higher than what we see at the pentamers. But the how it actually occurs; the dynamics of this process, is still a mystery. Comment: / am interested because we are now creating software with which we can probably calculate a whole virus. I hope we can identify the mechanism of the switch. Response: I hope so too! At this point in time, every time we have suggested something, it has turned out to be wrong, so I have stopped making suggestions.

360

J. E. Johnson

Question: / have a question related to the N-terminal. Is there anything specific about the amino-acid composition; is there anything specific, such as a lot of alanine or glycine residues? Response: There are small residues. There clearly tend to be smaller residues. A variety of mutations have been made, and you can destroy the inability of this thing to form them, and then they no longer form particles. I think there are a variety of sequences that will do it, but generally speaking, there are not such large residues. Question: / apologize in advance because this is probably going to be a mathematically oversimplified question. It seems that each unit is being modeled more or less as a quadrilateral, with four corners. There is one corner which can either be in the middle of the hexamer or in the middle of a pentamer, and that's where the switch is. What is going on at the other corners? ...Becauseyou have to make joints between the... Response: Basically, they stay very, very similar in this particular architecture. You notice that this architecture is a truncated icosahedron. This is what we would call a soccer-ball representation of a T = 3 particle. What we see is that there is what we call quasi three-fold symmetry, and it is very high-fidelity. Those contacts are maintained with a remarkable degree of fidelity in the presence of the changes occurring at the hexamers and pentamers. There is a lot of flexibility in these subunits, and that is really required for them to be able to accommodate the various environments. Question: So among the four corners, one is special with the switch and the other three are more or less identical? Response: Yes, exactly, that is right, and that is a good way to put it. I wish I had put it that way.

Assembly of nodaviruses I told you that there is more than one way to do the switching. There is another way to make these T = 3 particles. CCMV is a truncated icosahedron (top figure, Figure 6.) The lower figure is a rhombic tri-icosahedron. In each case there are 180 subunits, so formally, each polyhedron is a T = 3. However, they have different

Virus Assembly and Maturation 361 shapes. The central triangle in the lower figure, ABC, shows subunits represented as trapezoids that interact with approximate 3-fold symmetry. There are contacts at each edge of the central triangle and they are related by approximate 3-fold symmetry. If you put your eye along one of the edges, you find a dihedral angle of 144° between these subunits. On the other edge it is flat. One rhombic surface is essentially planar, while the other is bent. In this case, the virus must have some way to switch between bent and flat contacts. This is achieved either with protein (in SBMV and other plant viruses) or, in the case of insect nodaviruscs, with the viral genome in the form of duplex RNA.

Figure 6. Comparison of two polyhedra consistent with T = 3 icosahcdral symmetry. The truncated icosahedron of CCMV is shown on top (see figure 5 for more details) and the rhombic tri-contahedron of nodaviruses and SBMV is shown below. CCMV and SBMV subunits are shown on the right, as are dihedral angles of selected subunit contacts. Nodaviruses have two genes that exist as two RNA segments. One encodes the capsid subunit and the other encodes the polymerase. Large quantities of nucleic acid are synthesized, followed by the protein. The capsid protein assembles rapidly,

362

J. E. Johnson

with the two pieces of RNA inside the capsid. This takes less than ten minutes. But this particle is not infectious. The virus has to use some chemistry to cleave its subunits at a particular position; residue 363 is cleaved from residue 364. It actually uses chemistry similar to what forms cataracts in our eyes. I discovered this when I had cataract surgery and I read that you can deaminate an asparagine, or you can cleave an asparagine, by very similar mechanisms. When we get a cataract, asparagines in the protein crystalline lens are deaminated and the crystalline forms an opaque crystal. However, there is a separate mechanism related to it that will allow cleavage to occur, and that is what a virus does; it enhances the rate of that cleavage so that it will become infectious. I do not have time to talk about those features. Question: How does this inert particle function? how can it activate this chemical reaction?

It's in the capsid, so it's passive;

Response: The reaction is incorporated inside the particle by the protein-protein interactions created during assembly. The pK of the acidic residues is changed and an auto-catalytic reaction is then turned on, which is completely controlled by the environment of the residues inside the virus particle. Question: So these are enzymatic acts? Response: Exactly. But it only occurs once; it's not like an enzyme that turns-over. It creates the environment just once. It is slow, having a half-life of four hours, but it is still around a hundred thousand or more times faster than would occur naturally. We can see which residues create that. This is required for infectivity. This is a single-strand RNA virus, so there is very specifically situated duplex formation. We can follow the density of the RNA and make a model that brings it through these three-fold axes of the particles. However, we know that this is a result of averaging, because only 40% of the RNA is visible in these duplexes, and the rest of the bulk RNA, the 60% that remains, has to drop in and out of these duplexes, and we have no idea how that occurs. The RNA in this virus is part of the capsid. This is not a passive piece of nucleic acid that is just carrying genetic information; it is also a chemical entity optimized for assembly of this particle. A whole series of experiments have shown that the evolution of this RNA has optimized it for particle stability, as well as for its genetic content. This really is a quite remarkable organization of protein and RNA, and as far as we know, the only example of this that has been seen in viruses.

Virus Assembly and Maturation

363

Question: So there is double-information RNA? Response: The natural existence of this RNA is as a single strand. But we know that more than 70% of it is involved in secondary structure; i.e., duplex formation. Question: Does that mean that 70% is used for forming secondary structure and 30% for the organization of the proteins? Response: Roughly 40% of the duplex, 40% of the RNA is in the form that we see; that is, visible. If you use something like Raman spectroscopy, it suggests that up to 70% of that RNA is in secondary structure duplex-type formation. But 40%, or nearly half of that 70%, is ordered in the X-ray structure. Question: Is that dependent on the influence of the code? Response: Absolutely. It is a synergistic type of assembly process. Again, we do not know how it takes place. The dynamics of this must be magnificent. Question: What happens if you change codons? Response: We can express this capsid protein in an insect cell expression system, known as a "baculovirus expression system." It does not have the whole genome. All it has is the gene for the co-protein. But it packages tRNA from the cell in which the expression occurs and forms particles that are indistinguishable by crystallography from the authentic virus. However, if you take that virus with heterologous RNA, put it in solution, and expose it to a protease, that particle is seven times more readily digested than the authentic virus. In spite of the crystallographic equivalence of these particles, they are trivially separable in solution. That is why I emphasize that this is co-evolution. It requires evolution in chemistry as well as in genetics to create this kind of particle. Question: Are there repeats in the RNA? Response: We see no hints of this. There is nothing that suggests the repetition of sequence in the RNA. It is another interesting mystery. Question: / have another math question. How efficient is the pancake? Suppose you take a piece of RNA and compute the volume of it; how much free space is

364

J. E. Johnson

inside? Is this known? Response: We would call this "head-full packaging." There is a very well-defined amount of RNA. If you look at absorption at 280 nanometers and absorption at 260 nanometers, RNA is found to absorbing primarily at 260 and protein primarily at 280. If you compare this ratio, in an authentic virus you get a ratio of about 1.6 between these two absorbances. If you take particles assembled in an expression system, where they are both just loading up to handle the assembly requirements, and so forth, you get exactly the same 260:280 ratio. Regardless of how these are assembled, whether you do it with genomic or heterologous RNA, you get the same ratio of capsid protein to RNA. And the density of the RNA inside is about the same as what would be found in a tRNA crystal. Question: What is the density, based on pure volume? Response: We describe it as Angstroms3 per Dalton, where the Dalton is the mass unit of hydrogen. This is a crystallographic term. Generally speaking, it is right around 2.8 or so A3/Dalton. Question: So it is optimal packing for crystals? Response: It is optimal for RNA that does not have polyamines and other condensing features associated with it. You can find some viruses, such as picorna viruses, where there is a much higher RNA packing density, but they also contain polyamines, which guide the condensation to some degree. Question: Is the assembled state stable in the absence of the coat protein? Response: No. Do you mean in the absence of the RNA? Comment: Yes. Response: No; we cannot assemble it without RNA. Question: But what about the (capsid) coat protein? inside be stable?

If you remove that, will the

Response: If you do an extremely long digestion with ribonuclease, you start to see

Virus Assembly and Maturation

365

nucleic acid spewing out. This takes many hours at a high concentration, and the particle integrity is still there. But it is a very heterogenous niche of stuff. Why does the capsid subunit have to cleave for the virus to be infectious? Nonenveloped RNA viruses have to get RNA across a membrane. There is a hydrophilic RNA molecule and a hydrophobic membrane. We now have data that suggests that as the virus starts to disassemble, it deposits these cleaved peptides into the membrane. These are membrane-active peptides (we see this with artificial membranes when we use just the peptides themselves) that penetrate into the membrane and allow the RNA to get across it. We also have data to suggest that the RNA goes across this membrane in a vectorial way; that gets a 5'-end into the cytoplasm. The ribosome grabs that 5'-end and starts translating, and you have cotranslational entry of the RNA into the cell. The virus uses the cellular function in order to deliver its nucleic acid across the membrane after it has altered the membrane with these peptides.

Assembly of tetraviruses T = 4 viruses (Fig. 7) have four subunits in the asymmetrical unit. This is another insect virus. I have been interested in insect viruses mainly because you can get a lot of material. We solved the structure of this virus, called NooV. It originally came from South Africa and the material we obtained from South Africa was purified from Pine Emperor Moth larvae in the wild, so it was not an expressed virus. There are four subunits in NroV, so in theory, we need four different switches. But again, you can see there is a hexamer made up of B, C, and D subunits. These cluster around the icosahedral two-fold axes. The A subunits form the pentamer. This is what the four different subunits look like in the X-ray structure (A-D, at bottom). About 90% of the subunits look exactly the same. They have a P-sandwich fold with an Ig domain added to it that is inserted between two strands of the p sandwich. Then you have helices on the inside. Sometimes you see helix D, for example, and sometimes you do not. It turns out that the protein itself, using the order/disorder of individual helices, carries out the switching. This virus undergoes cleavage just like the nodaviruses that I mentioned. Again, helices are cleaved off, and you have a helical bundle around the five-fold symmetry axes that are covalently independent. We think the entry mechanism for this virus is very similar to that of nodaviruses.

366 J. E. Johnson

Figure 7. Structure of the T = 4 No)V. Top-left is a cryoEM reconstruction of the virus particle. To the right is a diagramatic representation of the particle with trapezoids representing the viral subunits. Below: the four different subunit structures color-coded by their locations in the capsid. Note that C and D subunits have an extra helix extended to the left (C-terminal). It is disordered in the A and B subunits. Those helices are molecular switches for T = 4 particle formation. When particles with higher T numbers assemble and there are more positions for the subunits to occupy, the virus does not go right to the end-product; there are assembly intermediates. Something has to happen to it as a procapsid (the intermediate in capsid formation) before it becomes a capsid. It is not possible to determine whether there is an intermediate in the assembly process if a virus purified from susceptible cells is examined. Fortunately, we could make this capsid protein in an expression system and, depending on the pH at which we purified it, we got either a procapsid or a capsid (Fig. 8). We could drive the particles at the top left to the particles at the top right by lowering the pH. The procapsid does not cleave, whereas the capsid does. Lowering the pH also turns on cleavage. Subunits rearrange during transition from the procapsid to the capsid. These movements were modeled by fitting the high-resolution X-ray structure into the density of the procapsid determined by cryoEM. Procapsids are unstable particles, and we could only visualize them with cryo-electron microscopy. We can easily identify the regions between these two that are equivalent. It is interesting that in the procapsid, the outer Ig domains are clearly organized as dimcric dumbbells. When the procapsid matures and goes into the capsid form, they are organized as trimers, thus we get a sense of the subunit reorganization.

Virus Assembly and Maturation

367

Figure 8. Procapsid and capsid of NcoV. Upper-left: Surface rendering of the NcoV procapsid, showing dimeric surface features and porous shell Lower-left: Cross-section of the procapsid, showing the shell density. Upper-right: Surface rendering of the capsid of NcoV, showing trimer clustering. Lower-right shows a cross-section of the capsid. Right: SDS gel showing that subunits of the NcoV procapsid do not cleave, whereas the capsid subunits do.

Why does the procapsid exist? Our theory is that all the dimers look the same in the procapsid. If you look at the dihedral angles between AB and CD, you observe that they are virtually indistinguishable from each other. We know that these subunits exist as a dimer in solution. We propose that when assembly occurs, dimers come together and maintain approximately the same structure they have in solution. The amount of reorganization that the subunits have to do is minimized when they go into the procapsid form. Once the capsid form has been taken, subunit-subunit interactions can drive and completely differentiate contacts between the two categories of two-fold axes. That is when helices go into the groove, similar to what the RNA was doing in the other structure. We think that you get one form that brings everything together. It is unstable, but quaternary structure is close to the way it exists in solution. The program that initiates depends on the entire structure; it drives the particle into a mature and stable organization. Question: Do you know whether this second transition occurs after full assembly, or during the course of assembly?

368

J. E. Johnson

Response: No; we can stop it. (This is a whole area of discussion that could go on.) When we do this in an expression system we can stop it at the procapsid form. When it happens in the cell all we ever get is the capsid form. You cannot stop it in normal infection systems; it just goes straight into the capsid form. However, when we can control the pH, we get the procapsid intermediate. Question: At what point does it pick up the nucleic acid? Response: It has RNA in it at this point. The 260:280 absorption ratio of these two particles is the same. Once cleavage has taken place you cannot go back, but if you mutate the asparagine cleavage site, it is reversible. It can be played like an accordion. We think we understand what is going on, but I don't have the time to go into the details here.

Assembly of HK97, a double-strand DNA bacteriophage I will finish with a topological story for the mathematicians. Mathematically, it may be trivial topology, but what nature did is spectacular. HK97 is a double-strand DNA virus. It is related to a virus called phage X, which is widely used in molecular biology. Crystallographers have wanted to study viruses like this for a long time, but there is a problem, because they all have asymmetrical tails. Our collaborators at the University of Pittsburgh, Roger Hendrix and Robert Duda, were able to use an expression system to make the capsid subunits in the absence of the connector that connects the tail, so that we could get just these heads. These heads were crystallized and the crystals diffracted to 3.5A resolution. We were able to get a very nice electron density map from this, the structure of which is shown in Fig. 9. The particles are 650A in diameter in their largest dimension. The crystal unit cell is 580A, 628A, and 790A. Nearly 22 million reflections were recorded to produce a data set of 4.8 million unique reflections, which were used for structure determination. The subunit is shown in Fig. 10. It is different from any other viral gene product we have seen. This capsid [Fig. 9] has T = 7 symmetry, so there are seven different positions that the subunits sit in, and it is the same gene product. The top part, in green (the A domain), clusters around the hexamer and pentamer symmetry axes. The lower part, in blue and red (the E-loop and the P domain) spans two three-fold symmetry axes. The N-terminus sits at the end of N-arm. As I will tell

Virus Assembly and Maturation 369 you in a moment, residue 105 is at the same location and the C-terminus is near the 6- or 5-fold symmetry axes (in green). There are two residues in this structure that are very important: lysine 169 and asparagine 356. During maturation of this virus, these interact with each other and are ligated to each other to form covalent links

[Fig. 9].

Figure 9. The HK97 T = 7 particle, showing chain-mail cross-linking of subunits. Contacts of the same color are cross-linked together, generating closed circles that interlink with neighbors to create the chainmail. The maximum particle dimension is 650A. See figure 10 for residues that chemically bond to generate the cross-links. The shell is very thin, with a maximum thickness of 18A. The particle is like a molecular balloon. It is a nice material that is remarkably robust because of subunit cross-linking. The capsid assembles in stages, just like the other capsid I mentioned for NcoV. The first stage of assembly occurs when the capsid protein and a protease coassemble into this prohead I. When assembly takes place, the protease is turned on. The protease is inactive during of assembly, but when it assembles, the protease turns on. The protease does two things: First, it digests 104 amino acids off of the N-termini of the subunit; it changes the subunits. The second thing it does is digest itself. Auto-proteolyis results, so the largest polypeptide left at the end of the digestion period is around 25 amino-acids in length. All these cleaved peptides diffuse out of this particle, and prohead II results. Prohead I and prohead II look the

370

J. E.

Johnson

same from the outside, but are completely different on the inside. Prohead 1 has - 6 0 copies of a protease and 104 amino acids on each of the subunits. Prohead II has lost the protease and the first 103 residues of each subunit. Prohead II is a metastable particle. Prohead I is stable, but metastable after digestion. Prohead II is stable indefinitely if it is not disturbed by low pH, high temperature, or denaturants. When the pH is lowered to 4, the transition to prohead I occurs. Any individual transition is rapid but stochastic for the population, and it takes a while for all the particles to make this transition. The difference in particle size and morphology is large between the two states: Prohead II is 450A in its maximum dimension and has skewed hexamers; prohead I is 650A in its maximum dimension, with a smooth surface. After expansion, a cross-link forms between the subunits [Fig. 9], between a lysine and an asparagine side-chain [shown in Fig. 10]. Asparagine residues are very important for viruses. Much of the autocatalytic chemistry that goes on in viruses involves asparagine residues. A pseudo-peptide bond is formed between the lysine and asparagine side-chains. As expected, there is a well-defined connecting density between the side-chains in the seven asymmetrical units. A third residue critical for catalysis, glutamic acid 363, must be there to catalyze this ligation. If it is mutated even to an aspartic acid, nothing happens. We have an idea about why that residue has to be there.

1

CAPSID SUBUNIT

a XT 1<<*T( A domain

^

6^*** r—-—-O [

i5i

^ ^

r

- ^^ / ' ^ " " *

N

*v

E-loop /

/

•\sn356

P domain

\ Lysl69

Figure 10. A single subunit of HK97 with l.ys 169 and Asn 356 labeled. These residues link with neighboring subunits to form the closed-rings shown in Figure 9. The reaction is auto-catalytic and occurs after formation of prohead II.

Virus Assembly and Maturation

371

The virus creates a remarkable interlocking network of rings, called chain-mail, that stabilizes the capsid. Subunits are arranged around the hexamer and are not covalently linked to each other. It is the subunits that are one step away from the hexamer that covalently link to each other. This has an interesting topological effect on how these rings relate to each other, since there is a two-fold symmetry axis that relates one six-fold to the next six-fold. If there is a two-fold axis (for example, between the green and purple paperclip), it means that the purple ring has to be on top of the side closest to the hexamer, that the green paperclip ring has to be on top of the side toward the outside, and that they have to interpenetrate each other. There are sets of hexameric rings, all of which form chain-links. If we examine the whole virus particle [Fig. 9], everything with the same color is cross-linked together; all the yellows are cross-linked to each other and all the pinks are cross-linked to each other. Each of these subunits is cross-linked mechanically through this chain to six other rings. If you examine this carefully, you can see the interlocked rings generated by symmetry and chemistry. To our knowledge, this is the only example of such a topological chain-mail organization of protein. I hope that this lecture has provided a broad overview of structural virology and of the remarkable functions associated with the viral subunit gene products. I acknowledge the following people, who have been working in our laboratory: The work on Noda viruses with RNA was done primarily by Liang Tang, Annetta Schneemann, and Vijay Reddy. In the work on Tetra Viruses, the T = 4 studies were done by Mary Canady, Derrick Taylor, and Sanjeev Munshi. Our collaborators at the University of Pittsburgh, Roger Hendrix and Bob Duda, did all the molecular biology, including the bacteriophage work. Bill Wikoff, Hiro Tsuruta, and Lars Lilias did the crystallography on this system. Question: In the nodavirus, you described RNA as playing a major role informing the capsid, as much as 25%. In the picornaviruses, one barely sees the RNA; there is only a base or two. Could you please comment on the range of diversity; how much of the nucleic acid appears to be structured in viruses? Response: Alex McPhereson's group recently solved the structure of a satellite virus called Satellite Tobacco Mosaic Virus. It can only grow with Tobacco Mosaic Virus. It is a Tl particle. It has only the genome as the capsid protein. It is a very "silly" virus, in that all it does is make capsid protein and package itself. But they could also see a lot of ordered nucleic acid in that particle. That is the only one I know of. But it is not playing a structural role. Because it is a Tl particle, it does

372

J. E. Johnson

not need a switch. But it is ordered. It is just there, and a lot of it is visible. Most of the time you just see a little bit of ordered RNA. I think the nodaviruses are novel - if not unique - in actually using RNA to switch in capsids. So they are pretty special in that regard. Question: Do you believe that the coat-protein that looks like serine protease is a genuine homologue, or just vaguely looks like one? Response: Topologically, it is the same; it superimposes. I think there is a very strong feeling among those who study alpha viruses that it grabbed a protease and adapted it for the purpose of viral assembly. Question: But is it inactive? Response: Actually, it is active. The active site has a catalytic activity similar to what 1 described for the Noda and Tetra viruses, except it is actually done with the active site of chymotrypsin. But again, it is a one-cut event; there is no turnover. It just cuts the one polypeptide. Question: Is there an assembly sequence requirement for those viruses that have double-stranded DNA ? Response: When assembly occurs in double-stranded DNA particles, you always pre-form the capsid. I gave you a very distorted view of a double-strand DNA virus, in that we left out all the DNA packaging machinery needed to get a symmetrical particle. Normally, you have what is called a connector; a protein that is in the capsid. There is certainly no DNA requirement for assembly, because the DNA is not around at all. Then you have a DNA pump that actually pumps DNA into these capsids. Sequence-dependence occurs when part of the pump grabs the DNA. These are very generic "machines" that will pump almost any DNA into these particles. I would say there is no known sequence requirement except for recognition of the binding of this protein to the DNA termini. Actually, these are concatamers that bind to one terminus they detect when they get to the end of the genome. Then they cut and drop off one particle and move on to fill the next particle. I think the actual particle is impervious to what kind of DNA is going in.

Virus Assembly and Maturation

373

Question: When you discussed the assembly process of the last virus, the first step you described was the protease activation. Do you think that it is activated in concert, or is there some kind of nucleation step first? Response: One of the things that allowed us to study this virus is that there is no independent gene for what is called a "scaffolding" protein. The particles will not assemble if the protease is not expressed. So the protease is involved in the particle formation. When you activate it, it is almost certainly the type of protein-protein interaction that occurs that turns this thing on. There is no activity of this molecule outside the particle. The only place they have been able to get activity is within the particle. Clearly, there is a ligamerization or something involved with the activation of the protease activity. It does not do anything outside of that, or if you express it. Question: When this protease gets activated, how does it sense that there is complete assembly? Is there some kind of global topological property that it senses? How does it happen? Response: Assembly occurs very rapidly; I don't think it is a major issue. Question: Perhaps it is actually already active when it starts; it might be local... Response: Exactly. However, there is a hierarchy. If you look at the timedependence, it knows when it is finished cleaving the subunit N-termini. It does not start its auto-digestion until all, or 95%, of those have been cleaved. There is an incredible amount of sensitivity for the types of cleavage that are going on inside. Question: Perhaps this is a naive question. In passing from the procapsid to the capsid state, is there an energy decrease, or is this a more stable state? Does it require something like ATP to pump it? Is there an activation energy? Is there anything like that going on at all? Response: It is not a naive question at all. Basically, after cleavage has taken place, that procapsid is a meta-stable particle. The naive perspective that we have is that assembly occurs in an energy minimum, but the protease is essentially driving it now into a less stable state. However, it is sufficiently stable that it will sit there

374

J. E. Johnson

indefinitely if it is not perturbed. Normally, the perturbation is DNA going into these particles. We think it is electrostatic triggering that takes place. It is known that when the expansion takes place, there is heat given off. Calorometric studies have shown this is an exothermic event. It is the protease that drives it into this meta-stable state.

THE ANIMAL IN THE MACHINE: IS THERE A GEOMETRIC PROGRAM IN THE GENETIC PROGRAM? ANTOINE DANCHIN Institut Pasteur, Paris, France and HKU-Pasteur Research Center, Hong Kong

Since what I have to say is very different from what you have been listening to over the past week, perhaps I should start with a metaphor: Around 3,000 years ago in Greece, people would consult Pythia, the Oracle at Delphi. (Note that Pythia was a priestess1; although there are women present at this assembly today, they represent a far lower proportion than in the real world.) People would ask Pythia questions about the future, for example, and she in turn would ask them questions, one of which was the following: "A boat is made of wooden planks, which rot. They are replaced one by one, and eventually all the planks have been changed. Is it then still the same boat?" No doubt the answer was yes, and rightly so, I think.

* Genes do not operate in isolation * Proteins exist in complexes, like the parts of an engine

'. t y

—* It is important to understand their relationships, like those of the planks of a boat

Figure 1. The Delphic boat.

The reason I chose this metaphor [Fig. 1 ] was to illustrate my point of view: that biology is a science of the relationships among objects, rather than a science of the objects themselves. Biology is highly abstract, similar to many of the subjects mathematicians study. This is more or less the idea behind what I shall try to convey to you today.

376

A. Danchin

Biology is somewhat different from many disciplines that focus on objects. Indeed, biology derives from both genetics, which is a science of relationships among objects, and chemistry, which gave rise to biochemistry, and in which we are happy if we can even find individual objects. As an illustration, for a very long time, people concentrated on demonstrating that a given dark band on an electrophoresis gel identified an object they sought. Not only was this a limited view of biological objects, but this narrow-minded approach led some people to construct fakes (it is so easy to produce photographs with dark bands.) When other bands were present, they were called contaminants. But contaminants are usually extremely interesting, and are often where the real information lies; they merely indicate that the protein of interest is contained within a complex of proteins.

Genomics is based on an alphabet metaphor; a text written in a four-letter alphabet; • Genetic engineering •Viruses • Horizontal gene transfer • Cloning of animal cells All these point to a separation between •the machine • the data • the program

Figure 2. Cells as Turing Machines.

Cells as Turing machines My approach may be somewhat different from what you are used to, or to your spontaneous tendency. I think of cells as Turing Machines [Fig. 2]. The reason I think about them in this way is that in developing the science of genetics (and now the genetics of genomes), we use a very strange metaphor: the genetic program. We write genomes as texts, using a four-letter alphabet. Indeed, we manipulate genomes like an alphabet, on our desk or with a computer, and then return to do

The Animal in the Machine: Is There a Geometric Program ...

377

something with real cells. Curiously enough, it works. This metaphor gave rise to genetic engineering. Viruses (not really living organisms for me) are concrete examples of the distinction between the machine and the program. There is no time to discuss this today, but viruses are simply fragments of programs; they lack what constitutes the machine. Viruses are read by the machine, but are not part of the machine. We hear quite a bit about "horizontal gene transfer" these days. In many organisms, it is easy to detect the presence of bits of exogenous programs, which sometimes amount to around half the text, sometimes more. For instance, whereas the normal Bacillus cereus genome is 2.9 Mb in length, some B. cereus genomes are lengthened to 6.4 Mb, entirely by foreign DNA. The recent advent of animal cell cloning has also surprised many people. All this points to distinctions between the machine, the data, and the program. In a Turing machine, the data and the program play an identical role, although I will not discuss this today. The main question is whether it is reasonable to consider a cell to be a kind of Turing Machine (which I think it practically is), and if so, what this means.

If the machine must not only behave as a Turing machine, but must also produce a Turing machine, one must find a geometric program somewhere in the machine {J. von Neumann).

Figure 3. If the machine...

If the machine not only has to behave like a Turing Machine; i.e., to read a program, like a computer, but also to produce another Turing Machine (this question was long addressed by J. von Neumann), then something must be found that will provide hints concerning the geometry of the Turing Machine; how it is to be constructed [Fig. 3]. This is part of the reason why I launched the Bacillus subtilis genome research program in 1985. The program was not intended to merely identify a collection of genes, but rather to find how genes interact with each other. Is there really something to see in a genomic text? We were not just interested in bits and pieces of genomes. Many people today say they have "completed" suchand-such a genome sequence, which in my view is absolutely wrong. In many

378

A. Danchin

cases, particularly with the human genome, more than one-third of a genomic sequence remains lacking for a very long time. Of course, this fragmented genome does contain genes - it contains many things - but if there is something in which the self-consistency of the genomic text is revealed, it won't be found in a genome sequence fragmented into bits and pieces. 1 think that except for many bacteria (microbes are simpler than other organisms), complete genome sequences will not be known for quite a while, and it will be some time before it will be possible to answer many very specific questions.

O What is seen from replication: no meaning, Shannon information

D What is seen by the gene-expression machinery * Algorithmic complexity (space) # Logical depth (time) $ Critical depth (finiteness)

Figure 4. Different levels of information.

Different levels of information Let us think about what we see in the genome program [Fig. 4]. 1 think that we should use a very important concept: information. Unfortunately, for a very long time, information was identified with Shannon Information, which is information as referred to in the theory of communication, not the theory of information. To me, information is a much richer concept, which John Myhill would have called a prospective concept. In 1952, Myhill wrote a paper entitled, Some philosophical implications of mathematical logic; three classes of ideas. His paper was quoted in Douglas Hofstadter's book, Godel, Escher, Bach: the Eternal Golden Braid. (Curiously, if I am not mistaken, the French translation of Hofstadter's book omits the reference to John Myhill's paper. I wonder why, since it is one of the most important references in the book.) Myhill, in fact, wanted to distinguish among what he called "different characters." He didn't want to use the word concept, since that would have led people to immediately think of Immanuel Kant; so he used character, of which he said there were at least three levels: The character of a word may be effective, meaning capable of being immediately understood anywhere by

The Animal in the Machine: Is There a Geometric Program . ..

379

an educated person. If the effective character of a word is communicated, one person may immediately understand another. Second is what Myhill called the constructive character of a word. In order to communicate using a word's constructive character, a person must carry out some kind of mental computation in order to understand the other person. Finally, the prospective character of a word, while meaningful, is difficult - or even impossible - to communicate to everyone; understanding it is endless and never complete. There are many connections between levels of character and number theory, which I won't go into now. Jean-Paul Delahaye, a mathematician from Lille (France) has addressed the meaning of these words and their relationship with logic. For me, the character of "information" usually corresponds to what Myhill called prospective. There are different levels of information, and it will never be fully understood or exploited as a general concept. The first level of information concerns what is transmitted without taking meaning into account, which is what Shannon typically did. Interestingly, this is also what DNA replication machinery "sees." When DNA replicates, it can copy any type of DNA, such as another piece of DNA from the same sequence, without being concerned by meaning. This is why genetic engineering is possible; why, for instance, bacteria can correctly express a human protein from an inserted gene that has absolutely no meaning for the bacteria. There are many other levels within the concept of information; I don't have time to discuss them all. One that mathematicians know well is called "algorithmic complexity," which was defined by the school of Kolmogorov, as well as by Solomonoff, and further described by Chaitin, in the United States. In terms of the history of science, this is interesting, since it goes back to the Andronoff School, in the Soviet Union. Algorithmic complexity simply uses the length of a program as a measure of the information it contains. If there is a repeating sequence, algorithmic complexity is low, because one can just instruct it to •

Begin Print n times Stop,

and it is finished. In contrast, the only way to describe a random sequence is to print it out entirely, in which case it will have high algorithmic complexity. It is interesting to remember that in practice, one cannot determine whether or not a sequence is random. That is, Kolmogorov's theorem is a theorem of existence; it states that any sequence has a certain algorithmic complexity, but it does not tell you

380 A. Danchin how to find the complexity. Since it is a theorem of existence, perhaps we should consider it to be a research program; which is interesting, since DNA comes from DNA, which comes from DNA, etc. When we describe a genome sequence, we try to see how it was constructed, since that is how its algorithmic complexity may be calculated. I won't discuss the existence of DNA repeats in detail, but just describe an example, in order to show you that repeats and "junk" DNA are in fact highly meaningful.

This clock has a minute minute hand

Figure 5. Repeats.

A large cube and a small cube are placed on a round base [Fig. 5]. There is a ball on top of the small cube. What does the smallest cube supported by the round support support? The answer is simple: a ball. In French it is possible to state a similar example of meaningful repeats, which is very simple: "Les poules du couvent couvent," a play on the identically spelled French words for nunnery and incubate (couvent), yielding The nunnery hens are incubating their eggs. This is meant to show that repeats are possible and that it is a mistake to assume that repeats are meaningless. To make a long story about DNA repeats short, they have

The Animal in the Machine: Is There a Geometric Program . . .

381

at least two major functions in eukaryote genomes: one is to make spacers (I will explain later why spacers are very important in a geometrical program.) For this function, the length of the repeat is important, whereas the sequence of the repeat is not. It can be an "insertion sequence," it can be a virus - it can be anything - as long as something is inserted into the DNA sequence at that precise position. Something else is generally overlooked: timing, especially the fact that the transcription rate is very low. In fact, at 37° Celsius, the highest average transcription rate is around 45 bases per second, much slower than replication. Imagine the length of a one-megabase gene; it takes a really long time to transcribe. All this is to demonstrate that it is not uninteresting to study repeats, which are a part of algorithmic complexity. I agree with what has been said about Kolmogorov's theorem: that it is generally impossible to find the algorithmic complexity of a sequence. This theorem can therefore be viewed as a research program: any sequence has a complexity, and we should try to reveal the rules that would make it shorter than itself. Algorithmic complexity is far from being restricted to what I have described, but the concept of information is, in Myhill's terms, prospective, which leaves many things to discuss. The second aspect, which I think is very important, is what Charles Bennett called "logical depth" in a 1988 paper of the same name. The following is a succinct description of logical depth: If you compare just the programming and algorithmic complexity of a repetitive sequence, the program will be short; if you write a program that generates recursive constructs, such as the Koch Snowflake or fractals, it also will be very short. Clearly, there is much more in the Koch Snowflake than in the repeating sequence. Something is different between them, although they appear to have the same algorithmic complexity. One is much more interesting than the other. The interesting one is the Koch Snowflake. One also thinks of Mandelbrot sets and the like. Charles Bennett's idea was that time must be taken into account; it's fine to have a program, but if we want to see the output of the program, we must take into account the time needed to reach it. This is what he called the logical depth, which is extremely important in biology. Because DNA derives from DNA (no need to introduce randomness), its evolution could even be entirely deterministic right from the origin, since this would not change the situation. Each base has a very long story, and therefore great logical depth. We are entering a field that is becoming really mathematical. As far as I know, concepts such as logical depth have not really been investigated, especially in terms of program length, algorithmic complexity, involvement of time, etc. What I call critical depth here is the notion that if we continue to consider living organisms to be Turing machines, we find that genomes have something special: they are

382

A.

Danchin

finite, very small, and work with a finite length for a given time in a given organism. How can this be achieved? I think this would probably be an interesting starting point for research programs in Number Theory. DNA management

" •' '•..••* y N„=397 Nr = 283

•'•'"

Escherichia coli

& HMO

2W0

Nj,= 552 Mr = 250

Mycoplasma pneumoniae

3000

N„=260 Nr-187 Methanococcus jannaschii

NB=139

Nr = 82

Mycoplasma genitallum

'ISM

Figure 6. DNA management: repeats in genes.

Now let's get to biology. I started with some very general observations. The simplest idea about the distribution of repeats is that they are present in bacterial genomes, and that their number is more or less proportional to the length of the genome [Fig. 6]. A larger genome would have more repeats, and there would be fewer in a smaller genome. That's what we thought, but it turned out to be completely wrong! The following describes the results obtained from comparing eight genomes. We identified long exact repeats - but, first, something very

The Animal in the Machine: Is There a Geometric Program ...

383

important, for which we may need the help of mathematicians. The statistics of finite sets are extremely difficult to do, because it is possible to predict many spurious things. For example, any finite set has a period; if you find it, nothing tells you whether it is meaningful. Although I won't name names, many publications claim that this organism or that genome has a period. So what? - it's just trivial. Finite sets present many difficulties, especially how to know what is meaningful and what is not. And when we find repeats - there are always repeats - how do we know they are meaningful? We chose conservative hypotheses (although I agree that this is disputable) using different ways to mimic genomes, which is why I said the research program on algorithmic complexity would entail comparing what we call a "realistic" chromosome with a real chromosome (constructing a realistic model is a way to approach the algorithmic complexity of a genome.) To be "realistic," using all the knowledge we possess, consists in devising a model that compares the realistic with the real. I think this is the only correct way to do statistics, but that's not how it's generally done. For instance, people simply use the average number of bases, or dinucleotides, or amino acids - or sometimes more complicated Markov chain analysis of the system - all of which are inadequate. It would be more appropriate to try to input the knowledge we have and then do statistics. That is why implementing this is precisely the problem of trying to compute the algorithmic complexity of the genome. I think this may be of interest. Here we have four genomes of very different length. Mycoplasma genitalium has the shortest of all known genomes at the moment - less than 600 kilobases - and contains many repeats. This one is 800 kilobases long (M. pneumoniae) and also has many repeats. These two, Escherichia coli and Bacillus subtilis, are more than four megabases long and similar in length, but you can immediately see that they look very different. In fact, the Bacillus subtilis genome has the least number of repeats. These are exact repeats of longer than 25 bases. The abscissa is the position of the first repeat and the ordinate the position of the second repeat. What you see in the Escherichia coli genome is that the repeats are apparently randomly distributed. 1 do not have time to discuss this in detail, but in fact, although it looks random, it isn't, not at all. This is also something that may be interesting: You can clearly see in the Bacillus subtilis genome that the distribution of repeats is not random. When present, repeats in the Bacillus subtilis genome are very close to each other, separated by only around 10 or 11 kilobases. This is really surprising, since it is possible to construct repeats in the laboratory that are widely separated from each other. We insert a gene at the so-called amyE region, which is simple and works very well in the laboratory. But apparently it does not work in the wild. This is surprising; there appears to be a mechanism for eliminating repeats in

384

A. Danchin

Bacillus subtilis. There are no similarities between the B. subtilis and E. coli genomes, although they are of the same size. This slide indicates that DNA is managed differently in different organisms; demonstrating that despite this extreme variation in DNA management there are constraints in genomes. In examining bacteria, we find that DNA is managed in very different ways.

Is gene order random ? At first sight, despite different DNA management processes, not much is conserved, and horizontally transferred genes are distributed throughout genomes. However, pathogenicity islands tend to cluster in specific places and code for proteins with common functions.

Figure 7. Genome organization: Is gene-order random?

In fact, the question I am now asking, Is gene-order random?, will become more important [Fig. 7]. It is being written about almost everywhere. People speak of genome fluidity and say gene-order is random, at least in bacteria (remember, I'm talking only about bacteria, not eukaryotes, but I think it's probably the same for eukaryotes.) First of all, people do not take into account the age of divergence between two species. For instance, when they compare Bacillus subtilis and Escherichia coli, they say that almost everything is not conserved, therefore the genome is highly fluid. But you must remember that this happened after a separation that occurred 1.5 billion years ago, which means a huge number of generations of growth of these bacteria, during which time things must have changed. The fact that you do not see anything - it's a question of whether the glass is half-full or half-empty - does not mean that the genome is really fluid. In fact, there are things that we already know not to be random. First of all, there are operons, which are highly conserved. For instance, comparing E. coli and

The Animal in the Machine: Is There a Geometric Program . ..

385

B. subtilis, which diverged 1.5 billion years ago, ribosomal protein operons are found to be highly conserved; the gene-order does not change. The ATP synthase operon reveals the same; it is completely conserved. In fact, more than 100 operon segments are completely conserved between E. coli and B. subtilis. Perhaps this may be considered meaningful. "Pathogenicity islands" are another example demonstrating that at least some things go together very frequently. This suggests that the fact of "being together" is meaningful.

Genome organization is so rigid that the overall result of selection pressure on DNA is visible in the genome text, which differentiates the leading strand from the lagging strand.

Figure 8. Genome organization (2).

To lead or to lag? We recently found something that surprised us with E. coli and B. subtilis [Fig. 8]. Sueoka, who for a very long time worked on transfer RNA and genome sequences, wanted to see whether the Chargaff rule stating that A binds to T and G to C between strands holds within a single strand. The second Chargaff rule is that all things being equal, if you have optimized the mutation rate and reached equilibrium, such should also be true within the same strand. In fact however, this is not the case, which means that there are biases. We must remember that replication of DNA is called non-conservative - it seems a bit strange, but that's how it is not pairing in this case, but matching numbers. I have to make a drawing (Fig. 9). DNA is oriented (5' - 3' and 3' - 5') complementary strands going in opposite orientations. This means that replication is more or less continuous in the 5'- to -3' direction, but the only way for this to happen is to start and then re-start, and so on, on the complementary strand. It does not replicate without interruptions. A ligase is required to stitch the discontinuous fragments together, but first a primase is needed for what is known as the priming sequence. The primer is RNA, not DNA, so the story gets complicated. The system is highly

386

A. Danchin

asymmetrical. You may wonder about leading-strand and lagging-strand basecomposition. If the genome is very "fluid," genes can transpose and shift place, and so on, and you would expect that all these things to be more or less equal. We realized that this was not really the case, since in comparing the orientation of transcription with the orientation of replication we found a high bias in favor of transcription in the same direction as the replication fork in quite a few organisms. This is the case, for instance, with B. subtilis, in which 75% of the genes are transcribed in the same direction as the movement of the replication fork. However, this is not the case in E. coli, in which it is around 55% to 45%. Although varying with the organism, we wanted to know whether it is possible to find a bias in the base distribution; is there any difference between the leading and lagging strands? By the way, if we can do that, we can identify the origin of replication in certain cases in which it has not been done. Eduardo Rocha, Alain Viari, and 1, based on the work of Jean Lobry, did the following:

Figure 9. DNA replication.

The Animal in the Machine: Is There a Geometric Program ...

387

The statistics are quite simple. Recall that in many cases, chromosomes are circular. Take an arbitrary position for the origin of replication. We select a criterion - for example, base composition or codon composition. Since we have chosen it to be the origin of replication, we take this to be the leading strand and that to be the lagging strand, making predictions with this base (codon) composition, in order to determine whether a gene is on the leading or lagging strand. Presuming this hypothesis to be correct; that is, that there is a difference between the strands, and if the origin was well chosen, we should be able to predict on which strand a gene is located with 100% accuracy. If this were entirely random, we would be right only half the time, and the figure would be between 0.5 and 1. So we just progressively shift the position of the origin of replication, predict the location of the gene, and see what we get. We find the following (this is also true with different criteria for a series of genomes) [Fig. 10].

Arbitrarily choosing an origin of replication and a property of the strand (base-composition, codon-composition, codon-usage, aminoacid composition, etc. of the coded protein), one can use discriminant analysis in order to determine whether the hypothesis holds.

Figure 10. To lead or to lag .

388 A. Danchin

Figure 11. To lag or lo lead; that is the question.

The most fascinating case is that of Borrelia burgdorferi. Its chromosome is linear, not circular. Various criteria reveal different curves. We obtain 97% accuracy for predictions with B. burgdorferi for the amino-acid sequence coding a protein on one strand. The same method allowed us to predict the replication origins in Helicobacter pylori and Chlamydia trachomatis. It also shows something that I believe to be important: that there is a very large bias between the leading and the lagging strand. To make a long story short, the leading strand is G/T-rich, whereas the lagging strand is A/C-rich. This is so strong in fact, that you can more or less predict from which strand the protein comes just by knowing its threonine or valine content. In Borrelia burgdorferi and Chlamydia trachomatis, one can see that simply by plotting the valine and threonine percentages it is almost possible to completely discriminate between the strands. This is indeed troublesome, since it means that when people draft evolutionary trees with protein sequences, the rate of evolution in the two strands is different. Biologists like to fight, saying, "My tree is the best, etc." Well, this means that everything is wrong. People should not fight with each other over

The Animal in the Machine: Is There a Geometric Program ...

389

this point, but rather find ways to improve what they are doing. Therefore, at least for bacterial protein trees, it is not equivalent to be on the leading strand and on the lagging strand. Indeed, we found that even for genes with the same function, the valine content is usually different from the threonine content on the valine-rich leading strand compared with the lagging strand, which is threonine-rich. Eduardo Rocha recently refined this study. In some cases, with different isolates of similar bacteria, such as Chlamydia trachomatis or Chlamydia pneumoniae, it is even possible to observe when a gene has left the strand more or less recently; you can see its trace. This suggests that genome organization is much more rigid than usually assumed [Fig. 12].

Genome organization is much more rigid than usually assumed. Some regions (such as the terminus) are rather unstable, but most of the genome structure is preserved throughout evolution. Figure 12. Conclusion

From function to structure One may wonder why; what is behind this organization. Let's go back to another kind of biology. In a way, biology is not logical. Francois Jacob's idea that biology is tinkering is extremely important, because when speaking with laymen, researchers usually say that we will solve the human genome sequence and cure all diseases. This is quackery; all the more so since the Darwinian trio variation, selection, and amplification - about which people usually forget - works in this direction. All material systems subjected to this trio must evolve, and when they do, they "execute" experiments - real live experiments, that give rise to function [Fig. 13].

390

A. Danchin

Variation / Selection / Amplification Evolution JL

creates

Function X

recruits

Structure

I

coding

process

Sequence Figure 13. How to find the function of a gene?

But creating function means capturing a structure; a preexisting structure that emerged during evolution. I propose a simple metaphor for this. (Many examples in biology could also account for it.) It is summertime, and I am sitting at my desk, which is covered with papers, reading a book. The window behind me is open. All of a sudden, the wind comes up. What do 1 do? I quickly put the book on the papers, to prevent my work from being blown away. So the book all of a sudden acquires a function - it has been captured and turned into a paperweight! So the function of the book now is that of a paperweight. If I were a genome-analysis person, this object would appear to be a book; but in fact, it is not a book, it is a paperweight. My prediction would be entirely wrong. There are many examples of this; a well-known one is the origin of the crystalline lens, in the eye. Crystalline lens proteins (crystallins) are composed of a quite common enzyme, argininosuccinate lyase. From the biochemical point of view, crystalline lenses are indeed enzymes; that is, they would work as enzymes if the appropriate substrate is available. So we have the sequence and the biochemical experiment, and we say this is an enzyme, which is completely wrong. In fact, the function of this particular protein is not to be an enzyme, but to remain transparent when concentrated. This remark is very important for genome programs. Thus, if you only have the sequence, even if you know the structure, it is difficult to make correct

The Animal in the Machine: Is There a Geometric Program . ..

predictions. I spoke about how wrong assignments can occur. associate in silico prediction with biological knowledge.

391

You have to

Gene functions are often related to each other in some way; hence the interest in finding gene neighborhoods in the broadest sense: Proximity in the chromosome (operons), in protein (RNA) complexes, in cell compartments, in isoelectric points, in metabolic pathways, in molecular mass, in codon usage bias, in the literature... Figure 14. Neighborhoods (1).

The next example is very simple. Something very simple may be suggested in order to carry out inductive reasoning. Take an object and work on its neighborhoods [Figs. 14, 15]. By the way, it requires a great deal of mathematics to explore a neighborhood. By neighborhood, I mean, as a first example, the chromosome neighborhood. For instance, when I spoke about operons and pathogenicity islands, I meant that genes are next to each other in the chromosome. However, this can mean something totally different. It could also be a metabolic neighborhood, because this protein is in the same metabolic pathway as that protein. It could also be that this protein has more or less the same shape, or has a sequence related to that protein, or has the same isoelectric point, or the same molecular mass - any kind of property. I will document just one example to show you that, by using neighborhoods, you can find at least some ideas that inform you about the functions of proteins. One of the earliest outcomes of the genome program (and the most interesting surprise for me), which is not commonly known, occurred in 1991, in Elounda, Crete, where the first chromosome sequence was presented: yeast chromosome 3, as well as lOOkb of the Bacillus subtilis genome. The most surprising observation of that meeting ten years ago was that at least half the genes that were uncovered did not look like anything previously known and had absolutely no known function. T his was really unexpected, because at that time several people had been predicting that when you make a mutant, any kind of mutant, isolate the

392 A. Danchin gene, translate it with the universal genetic code, and compare it with what is known in data libraries, you find it in fact to be something already known. People arguing against genome programs said that these programs were not necessary, since we already knew everything. In fact, at least half of the genes there were unknown. Piotr Slonimski, who at that time was a strong advocate of the yeast genome program, called them Elusive Exoteric Conspicuous genes, so as to emphasize their oddity (you will recognize the former acronym for the European Union, EEC, so the NIH representative at the meeting was not very happy). Unfortunately, these genes still exist, and one of the most important questions today is to determine the function of so-called "genes of unknown function."

Histiduie metabolism genes belong to a common «line » in codon usage bias, suggesting organization in the corresponding metabolism

Figure 15. Neighborhoods (2).

What can we do to determine the functions of such genes? One way is to study the codon usage bias of the gene. As you know, there are 20 amino acids and 64 codons, 61 of which code for an amino-acid. Therefore, there are (on average) three codons for every amino-acid. This means that there is a huge choice for making a

The Animal in the Machine: Is There a Geometric Program ...

393

given protein with different codons. What can be done is to study the distribution of codon usage bias in different organisms. In fact, this is feasible, but requires great care in choosing the statistical means used; for example, it is impossible to use classical principal component analysis; it is better to use (as a Euclidian distance) the chi squared test, and other such measures. It is also necessary to normalize appropriately. This means that each gene normalized according to its length and to the amino-acid coding property of each codon is represented as a point in a sixtyone dimensional space: the codon space. If the right measure is chosen, it is possible to see the dual space, representing codons in the gene space, which can be studied and may yield much information. This is the representation of the projection along two axes, generating a cloud of points. Technically, the cloud projection may be spread out, yielding the largest projection. It is possible to compute the axes so that the spread is maximal, and to see how the dissymmetry of the clouds spreads along the axes. Ten years ago, using purely statistical means, without entering any biological data other than the nature of the code, we found that this cloud of points could be described as consisting of three classes of genes. It was then possible to determine whether the three classes had common biological properties (without having used this information in the statistics). In fact, we found it possible to describe the three classes: a major one here, another one here, and one more here. Since each point is a gene, once this has been done it is possible to label the gene and see whether it belongs to a class; whether there are common biological properties in this or that class - which I could not do before this experiment. In fact what we found (the first part was already known, having derived from the work of Richard Grantham and Christian Gautier) was that this class corresponds to genes that are highly expressed under exponential growth conditions. However, we also found there to be a third class, with completely different codon usage bias, corresponding to genes that are transferred horizontally from genome to genome. Since that time, horizontal gene transfer has been found repeatedly, and, at least in bacteria, is clearly important. To my knowledge, no such study has yet been carried out with eukaryotes. It might be interesting to do so. Clearly, there are more classes, and it might be interesting to study what this means. Consider genes involved in metabolic pathways, here, for instance, in histidine biosynthesis [Fig. 15]. What we find - and this is indeed surprising - is that the genes are more or less aligned. In fact, this is a line in 61-dimensional space. Note that what we see in this figure is a projection on the plane, so that it might look like a line just by chance, but it is really a line in the 61 -dimensional space. This means that there is a correlation between these genes, and that this correlation is in the nature of the

394

A. Danchin

codons that are and are not used. This means that for some reason, codon usage bias changes from place to place.

• * Genes of unknown function: neighborhoods D * Genome rigidity D * Sulfur islands • * Metabolites of unrecognized importance O * Origin of life O An integrated approach: in vivo, in vitro, in silico

Figure 16. Sulfur metabolism.

What does my neighbor tell me? Although I could go a long way in this direction, 1 will provide just one hint about what the underlying biological principle for the organization of the cell could be. I don't have time to discuss it, but I think sulfur metabolism is extremely important. We only very recently began working on this subject, and are now carrying out many biochemical and genetic experiments. Sulfur metabolism did not seem very interesting. Although there are many studies involving carbon, nitrogen, and phosphorus, there are almost none on sulfur. This is really strange, because sulfur is certainly very important, since, for example, all proteins start with methionine (a sulfur-containing amino-acid). Sulfur metabolism is therefore implicated in every case [Fig. 16]. Also, although 1 have no time to elaborate on it, Sam Granick, in an unfortunately overlooked 1957 paper, and more recently, Gunther Wachershauser, in 1988, emphasized the role of sulfur metabolism in the origin of life, not only because it organizes electron transfer, but for many other reasons. The reason I was interested in the metabolism of sulfur is that it is highly reactive; that is, it goes from -2 to +6 in oxidation, which makes it extremely reactive; it is highly sensitive to everything [Fig. 17]. So once we had genomes, I wanted to see where the genes for

The Animal in the Machine: Is There a Geometric Program ...

395

sulfur metabolism were, because there are generally between 100 and 200 genes for it, which is a large number in the genome.

O Sulfur undergoes oxido-reduction reactions from - 2 to + 6. D Incorporation of sulfur into metabolism usually requires reduction to the gaseous form, H2 S. D H2 S is highly reactive, in particular towards O2. D => Despite their diffusion properties, these two gasses must be kept separate as much as possible. D Sulfur-scavenging is energy-costly • => Sulfur-containing molecules have to be recycled.

Figure 17. Oxido-reduction.

What does this tell us? I just have time to say a word about it. I spoke about genes of unknown function. In fact, one of the reasons gene functions have escaped our attention derives from a very simple observation: If we wish to define life, we must combine metabolism, compartmentalization, and information transfer, where the genetic code and replication define essential laws. Therefore, metabolism must be taken into consideration. When people speak about metabolism, they usually say something like: "A gives B, which gives C," etc. - you find this in all the textbooks. People forget about by-products in this description. Remember what Lavoisier said: Rien ne se perd, Hen ne se cree, tout se transforme. (Nothing is lost, nothing is created, everything is transformed.) You cannot lose anything. Something must happen to the by-products of metabolic reactions. There are many side-reactions in metabolism; we have discovered quite a number of metabolites that are completely absent in classical metabolism charts - people are not at all interested in them. I think that many of the molecules involved are mediators of many different phenomena, and we find particularly many of them in sulfur metabolism. Among these, one that is quite interesting is methylthioribose, which is involved in many processes. I think it may be much more important than recognized until recently. But let's stick to genome rigidity and sulfur metabolism.

396

A. Danchin

Escherichia colt proteins isoelectric points Figure 18. Neighborhoods (3).

Looking into neighborhoods, I told you that it was possible to predict the functions of unknown genes. One way to do this is to compute the electric charges of proteins [Fig. 18]. All E. coli proteins reveal a biphasic curve. The intracellular pH corresponds to about 7.5, but 1 will allow you to recognize that there is no such thing as intracellular pH, because the number of protons in a bacteria, for instance, would be at pH 7.5, which means that all standard biochemical measurements in test-tubes are simply misleading. This is usually not taken into account, but I think it is interesting to know. Now, here for instance, if you look into the annotations of the genes, what you find is that they either are secreted, so they could be there (of course, they are counter-selected), they are unknown, or when they are known, you find the word proton in the annotation. This tells you that if you find a gene here, you should look for experiments involving protons. This is just to say something about the kind of approach that uses neighborhoods. Sulfur proteins are highly biased; that is, almost no genes are in the region of the positive isoelectric points.

The Animal in the Machine: Is There a Geometric Program ...

397

Figure 19. Sulfur islands In fact, sulfur is highly reactive and generates H2S gas. This suggests another simple explanation for all these genes of unknown function. When a biochemist wants to carry out a careful study, he must control all parameters except those he is trying to measure. That is, he must know the concentrations of various reagents in a test tube; he must measure the magnesium concentration, etc. Two types of molecules cannot be easily verified: gases and radicals. This creates a purely experimental bias, therefore very few people have worked on gases and radicals. Indeed, what we are now discovering is that many genes of unknown function concern gases and radicals. There are many gases, many of which are implicated in the origin of life. Many genes are involved in the management of gases, and H 2 S is very interesting in this regard. It was reported in a very recent paper that H2S might play a role in the brain, as does NO, which is another reason why H2S may be interesting. We recently looked at the distribution of sulfur genes in various organisms, particularly E. coli, but not carefully enough [fig. 19]. I think we should do that in collaboration with a good mathematician or statistician, because we only used Wilcoxon's test. Anyway, the genes seem to be highly clustered, which means it is possible that sulfur metabolism is an anchor-point for what I said about ribosomes, an anchor-point for the organization of the cell. This is understandable; sulfur metabolism must be protected from the environment, therefore sulfur metabolism

398

A. Danchin

gene products must be inside highly protected clusters; sequestered in complexes, which corresponds to gene-clustering.

Sulfur metabolism: an unexpected organizer of cell architecture • Sulfur metabolism-related proteins are more acidic (average pH6.5) than bulk proteins (richer in asp and glu); they are poor in serine residues • They are significantly poor in sulfur-containing amino-acids • Their genes are very poor in ATA, AGA,and TCA codons • There are no class III (horizontal transfer) genes in the class (only 2 out of 150) • => sulfur metabolism genes are ancestral and may form the core structure of the E. coli genome

Figure 20. Sulfur metabolism: an unexpected organizer of cell architecture.

Fig. 20 summarizes what I just said. There are many pending questions, but clearly, we can begin to think that, at least in bacterial genomes, sulfur metabolism may be an anchor-point in cell organization. When I said that ribosomes might be very important, the idea was that the driving force for the organization of the cell would be translation. In fact, people usually think about information transfer as DNA —> RNA —» proteins. But in fact, this is not the real organization; the real organization is driven by translation. Translation is the driving force; messenger RNA is pulled from the DNA by the translation machinery. In fact, people often say that ribosomes travel along the messenger, which is wrong. The ribosome network is fixed, more or less, at a given time scale, and the RNA goes through it - pulled through by the ribosome network. If this is true, it is possible to make the following simple prediction: A bacterium has rates of transcription and rates of translation,

The Animal in the Machine: Is There a Geometric Program ...

399

which of course, must be extremely well-matched, since, for instance, if translation goes too fast, the messenger RNA complex with RNA polymerase will detach, which is very bad, because then a truncated messenger RNA would enter the ribosome. If this were to occur, what would happen with a truncated messenger RNA? It would be incorrectly translated, in which case it would result in a truncated protein. But as I said, one expects proteins to be located inside complexes, and geneticists know that truncated proteins are usually highly toxic negative dominants that destroy the complexes. So the very strong prediction is that, normally, because you need this adaptation, there must be a means of protecting the cell against such an event. Indeed, such a means does exist: so-called tmRNA (transfer messenger RNA), formerly known as lOSa RNA. It is a very interesting molecule that exists everywhere in bacteria, even in Mycoplasma. When there is a truncated messenger RNA inside the ribosome it goes through the ribosome and stops. The transfer RNA is loaded with the peptide, but bears no termination codon. Instead of releasing the peptide into the cell, everything stops and waits for tmRNA, which is coming. Transfer mRNA has an alanine residue at the end of the truncated protein, and this RNA in fact acts as a messenger RNA for ten more codons (which is why it is called transfer RNA-messenger RNA). The truncated protein, ending with the label AANDENYALAA, drives the protein to a protease machine ("Op degradation machinery"). This demonstrates that it does indeed occur, and that it is so important that it must be conserved, even in Mycoplasma. I think this is a very strong argument, at least for saying that the translation machinery is the driving force, and that cell organization should be studied starting from the ribosome, which is one reason why I have been quite happy to hear about ribosomes at this meeting.

INSTITUT DES HAUTES ETUDES SCIENTIFIQUES Proceedings of the Deuxiemes Entretiens de Bures

FOLDING AND SELF-ASSEMBLY OF BIOLOGICAL MACROMOLECULES This proceedings volume explores the pathways and mechanisms by which constituent residues interact and fold to yield native biological macromolecules (catalytic RNA and functional proteins), howribosomesand other macromolecular complexes self-assemble, and relevant energetics considerations. At the week-long interactive conference, some 20 leading researchers reported their most pertinent results, confronting each other and an audience of more than 150 specialists from a wide range of scientific disciplines, including structural and molecular biology, biophysics, computer science, mathematics, and theoretical physics. The fourteen papers - and audience interaction - are edited and illustrated versions of the transcribed oral presentations.

HPLC of Biological Macromolecules

Read more

Energetics of Biological Macromolecules

Read more

HPLC of Biological Macromolecules

Read more

Energetics of Biological Macromolecules

Read more

Libro HPLC of Biological Macromolecules

Read more

Essentials Of Chemical Biology Structure and Dynamics of Biological Macromolecules

Read more

Points de suspension: Entretiens

Read more

The Economist - 15 December 2001

Read more

The Economist - 22 December 2001

Read more

Logical Aspects of Computational Linguistics: 4th International Conference, LACL 2001, Le Croisic, France, June 27-29, 2001, Proceedings

Read more

The Economist - 08 December 2001

Read more

The Economist - 01 December 2001

Read more

The Economist - 27 October 2001

Read more

The Economist - 27 January 2001

Read more

Biological Thermodynamics (2001)

Read more

Recueil des publications scientifiques de Ferdinand de Saussure

Read more

Biographie des Archevêques de France

Read more

Barriers to Riches (The Walras-Pareto Lectures, at the Ecole des Hautes Etudes Commerciales, Universite de Lausanne)

Read more

Physical Chemistry of macromolecules

Read more

Physical Properties of Macromolecules

Read more

International Tables for Crystallography, Vol.F: Crystallography of Biological Macromolecules

Read more

Thermodynamic theory of site-specific binding processes in biological macromolecules

Read more

Thermodynamic Theory of Site-Specific Binding Processes in Biological Macromolecules

Read more

Statistical Physics of Macromolecules

Read more

ETUDES DE PHILOSOPHIE CHINOISE

Read more

Information Security and Cryptology - ICISC 2001: 4th International Conference Seoul, Korea, December 6-7, 2001 Proceedings

Read more

Konuşmalar (Entretiens)

Read more

Manuel de calcul numérique appliqué à l'usage des scientifiques

Read more

Astroparticle Physics: Proceedings of the First Ncts Workshop, Kenting, Taiwan, 6-8 December 2001

Read more

Quantum Information V: Proceedings of the Fifth Conference, Meijo University, Japan 17 - 19 December 2001

Read more

Recommend Documents

HPLC of Biological Macromolecules

Energetics of Biological Macromolecules

METHODS IN ENZYMOLOGY EDITORS-IN-CHIEF John N. Abelson Melvin I. Simon DIVISION OF BIOLOGY CALIFORNIA INSTITUTE OF TE...

HPLC of Biological Macromolecules

Energetics of Biological Macromolecules

METHODS IN ENZYMOLOGY EDITORS-IN-CHIEF John N. Abelson Melvin I. Simon DIVISION OF BIOLOGY CALIFORNIA INSTITUTE OF TE...

Libro HPLC of Biological Macromolecules

Essentials Of Chemical Biology Structure and Dynamics of Biological Macromolecules

Essentials of Chemical Biology Structure and Dynamics of Biological Macromolecules Andrew Miller Department of Chemistr...

Points de suspension: Entretiens

Points de suspension Entretiens JacquesDerrida galilée Points de suspension JACQUES DERRIDA Points de suspension ...

The Economist - 15 December 2001

SEARCH RESEARCH TOOLS Economist.com Choose a research tool... Subscribe advanced search » Saturday September 30th ...

The Economist - 22 December 2001

SEARCH RESEARCH TOOLS Economist.com Choose a research tool... Subscribe advanced search » Saturday September 30th ...

Logical Aspects of Computational Linguistics: 4th International Conference, LACL 2001, Le Croisic, France, June 27-29, 2001, Proceedings

Lecture Notes in Artificial Intelligence Subseries of Lecture Notes in Computer Science Edited by J. G. Carbonell and J....