Phylogenomics

Phylogenomics METHODS IN MOLECULAR BIOLOGY™ John M. Walker, SERIES EDITOR 466. Kidney Research: Experimental Proto...

Author: William J. Murphy

19 downloads 834 Views 4MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Phylogenomics

METHODS

IN

MOLECULAR

BIOLOGY™

John M. Walker, SERIES EDITOR 466. Kidney Research: Experimental Protocols, edited by Tim D. Hewitson and Gavin J. Becker, 2008 465. Mycobacteria, Second Edition, edited by Tanya Parish and Amanda Claire Brown, 2008 464. The Nucleus, Volume 2: Physical Properties and Imaging Methods, edited by Ronald Hancock, 2008 463. The Nucleus, Volume 1: Nuclei and Subnuclear Components, edited by Ronald Hancock, 2008 462. Lipid Signaling Protocols, edited by Banafshe Larijani, Rudiger Woscholski, and Colin A. Rosser, 2008 460. Essential Concepts in Toxicogenomics, edited by Donna L. Mendrick and William B. Mattes, 2008 459. Prion Protein Protocols, edited by Andrew F. Hill, 2008 458. Artificial Neural Networks: Methods and Applications, edited by David S. Livingstone, 2008 457. Membrane Trafficking, edited by Ales Vancura, 2008 456. Adipose Tissue Protocols, Second Edition, edited by Kaiping Yang, 2008 455. Osteoporosis, edited by Jennifer J. Westendorf, 2008 454. SARS- and Other Coronaviruses: Laboratory Protocols, edited by Dave Cavanagh, 2008 453. Bioinformatics, Volume 2: Structure, Function, and Applications, edited by Jonathan M. Keith, 2008 452. Bioinformatics, Volume 1: Data, Sequence Analysis, and Evolution, edited by Jonathan M. Keith, 2008 451. Plant Virology Protocols: From Viral Sequence to Protein Function, edited by Gary Foster, Elisabeth Johansen, Yiguo Hong, and Peter Nagy, 2008 450. Germline Stem Cells, edited by Steven X. Hou and Shree Ram Singh, 2008 449. Mesenchymal Stem Cells: Methods and Protocols, edited by Darwin J. Prockop, Douglas G. Phinney, and Bruce A. Brunnell, 2008 448. Pharmacogenomics in Drug Discovery and Development, edited by Qing Yan, 2008 447. Alcohol: Methods and Protocols, edited by Laura E. Nagy, 2008 446. Post-translational Modifications of Proteins: Tools for Functional Proteomics, Second Edition, edited by Christoph Kannicht, 2008 445. Autophagosome and Phagosome, edited by Vojo Deretic, 2008 444. Prenatal Diagnosis, edited by Sinhue Hahn and Laird G. Jackson, 2008 443. Molecular Modeling of Proteins, edited by Andreas Kukol, 2008 442. RNAi: Design and Application, edited by Sailen Barik, 2008 441. Tissue Proteomics: Pathways, Biomarkers, and Drug Discovery, edited by Brian Liu, 2008

440. Exocytosis and Endocytosis, edited by Andrei I. Ivanov, 2008 439. Genomics Protocols, Second Edition, edited by Mike Starkey and Ramnanth Elaswarapu, 2008 438. Neural Stem Cells: Methods and Protocols, Second Edition, edited by Leslie P. Weiner, 2008 437. Drug Delivery Systems, edited by Kewal K. Jain, 2008 436. Avian Influenza Virus, edited by Erica Spackman, 2008 435. Chromosomal Mutagenesis, edited by Greg Davis and Kevin J. Kayser, 2008 434. Gene Therapy Protocols: Volume 2: Design and Characterization of Gene Transfer Vectors, edited by Joseph M. LeDoux, 2008 433. Gene Therapy Protocols: Volume 1: Production and In Vivo Applications of Gene Transfer Vectors, edited by Joseph M. LeDoux, 2008 432. Organelle Proteomics, edited by Delphine Pflieger and Jean Rossier, 2008 431. Bacterial Pathogenesis: Methods and Protocols, edited by Frank DeLeo and Michael Otto, 2008 430. Hematopoietic Stem Cell Protocols, edited by Kevin D. Bunting, 2008 429. Molecular Beacons: Signalling Nucleic Acid Probes, Methods and Protocols, edited by Andreas Marx and Oliver Seitz, 2008 428. Clinical Proteomics: Methods and Protocols, edited by Antonia Vlahou, 2008 427. Plant Embryogenesis, edited by Maria Fernanda Suarez and Peter Bozhkov, 2008 426. Structural Proteomics: High-Throughput Methods, edited by Bostjan Kobe, Mitchell Guss, and Huber Thomas, 2008 425. 2D PAGE: Sample Preparation and Fractionation, Volume 2, edited by Anton Posch, 2008 424. 2D PAGE: Sample Preparation and Fractionation, Volume 1, edited by Anton Posch, 2008 423. Electroporation Protocols: Preclinical and Clinical Gene Medicine, edited by Shulin Li, 2008 422. Phylogenomics, edited by William J. Murphy, 2008 421. Affinity Chromatography: Methods and Protocols, Second Edition, edited by Michael Zachariou, 2008 420. Drosophila: Methods and Protocols, edited by Christian Dahmann, 2008 419. Post-Transcriptional Gene Regulation, edited by Jeffrey Wilusz, 2008 418. Avidin–Biotin Interactions: Methods and Applications, edited by Robert J. McMahon, 2008 417. Tissue Engineering, Second Edition, edited by Hannsjörg Hauser and Martin Fussenegger, 2007

METHODS IN MOLECULAR BIOLOGY™

Phylogenomics Edited by

William J. Murphy,

PhD

Texas A & M University, College Station, TX, USA

Editors William J. Murphy, PhD Department of Veterinary Integrative Biosciences College of Veterinary Medicine and Biomedical Sciences Mail Stop 4458 Texas A&M University Raymond Stotzer Parkway, Rm 107 VMA College Station, TX 77843-4458 USA

Series Editors John M. Walker, PhD 28 Selwyn Avenue Hatfield, Herts. AL10 9NP UK

ISBN: 978-1-58829-764-8

e-ISBN: 978-1-59745-581-7

Library of Congress Control Number: 2008922327 © 2008 Humana Press, a part of Springer Science+Business Media, LLC All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Humana Press, 999 Riverview Drive, Suite 208, Totowa, NJ 07512 USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Cover illustration: Fluorescent in situ hybridization of LINE-1 elements in pika, rhinoceros, and elephant shrew (see complete caption on p. 234 for Fig. 2 in Chapter 14, and discussion on pp. 232–233). Printed on acid-free paper 9 8 7 6 5 4 3 2 1 springer.com

Preface The past decade has seen the emergence of a new field of scientific inquiry at the intersection of phylogenetics and genomics: phylogenomics. From one perspective, phylogenomics is defined as the use of large genomic data sets to aid in difficult phylogeny problems. Alternatively, phylogenomics may be described as the use of phylogeny and comparative analysis to infer processes of genome evolution. Regardless of how one defines the field, the two applications are intertwined. This volume is a collection of protocols and resources compiled by leading researchers in the field and describes many of the molecular methods and bioinformatics tools that have brought this field to fruition in recent years. Several chapters in this volume highlight the use of cytogenetic methods for characterizing the genomes of different species. Fluorescent in-situ hybridization (FISH) is a powerful tool for establishing chromosome homologies between divergent species. The broadest of these techniques is chromosome painting, which in recent years has been performed on members of nearly every order of eutherian mammals, and across marsupial and avian orders. FISH mapping of single copy clones (e.g. cDNAs, fosmids, and BACS) can provide ordered gene mapping from megabase-pair resolution on metaphase preparations down to exquisite kilobase-pair resolution detail with extended-fiber techniques. Other chapters highlight the construction and development of radiation-hybrid (RH) maps, now fueled by thousands of markers from either large scale BAC-end sequencing projects or survey-sequenced genomes. RH maps provide less than megabase-scale resolution of gene order and can reveal chromosomal rearrangements between genomes of different species. Tracing the history of chromosome evolution inferred from both RH maps and genome sequence-assemblies are facilitated by bioinformatic tools for analyzing rearrangement data. BAC clones were instrumental in the early successes of the human genome project, and are now common tools for physical mapping in many ongoing genome projects. More than 160 BAC libraries are now available from different animal species. Phylogenomics includes several chapters that focus on different facets of BAC clone isolation, sequencing, and analysis, which can facilitate the exploration of orthologous sequences across a diversity of species. Methods are also provided for screening BAC libraries for novel gene sequences from sex chromosomes.

v

vi

Preface

The mitochondrial genome has held a prominent role in the field of phylogenetics. Accordingly, methods are provided to purify and sequence whole mitochondrial genomes in animals. With over two-dozen vertebrate nuclear genome projects finished at draft or survey-sequencing (2X) coverage, several chapters highlight the computational resources that exist for mining these genomes for phylogenetic characters, and for reconstructing ancestral sequences and genomes. A major proportion of eukaryotic genomes are retroposons, and in addition to mining these in silico from finished genomes, laboratory methods are also described that can be used to isolate and analyze retroposon classes from various groups of organisms. Finally, I would like to thank all of the authors for making this an exciting and diverse volume. Their outstanding contributions should provide a rich resource for new generations of phylogenomics researchers. I have no doubt that this resource will invigorate the comparative analysis of genomes across all branches of the tree of life. Bill Murphy Texas A & M University College Station, TX, USA

Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1

From Gene-Scale to Genome-Scale Phylogenetics: The Data Flood in, but the Challenges Remain . . . . . . . . . . . 1 Antonis Rokas and Stylianos Chatzimanolis

2

Phylogenomic Analysis by Chromosome Sorting and Painting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Roscoe Stanyon and Gary Stone

3

FISH for Mapping Single Copy Genes . . . . . . . . . . . . . . . . . . . 31 Terje Raudsepp and Bhanu P. Chowdhary

4

Construction of Radiation Hybrid Panels . . . . . . . . . . . . . . . . . 51 John E. Page and William J. Murphy

5

Survey Sequencing and Radiation Hybrid Mapping to Construct Comparative Maps . . . . . . . . . . . . . . . . . . . . . . 65 Christophe Hitte, Ewen F. Kirkness, Elaine A. Ostrander, and Francis Galibert

6

Construction of High-Resolution Comparative Maps in Mammals using BAC-End Sequences . . . . . . . . . . . . . . . 79 Denis M. Larkin and Harris A. Lewin

7

Amniote Phylogenomics: Testing Evolutionary Hypotheses with BAC Library Scanning and Targeted Clone Analysis of Large-Scale Sequences from Reptiles . . . . . . . . . . . . . . . . 91 Andrew M. Shedlock, Daniel E. Janes, and Scott V. Edwards

8

Comparative Physical Mapping: Universal Overgo Hybridization Probe Design and BAC Library Hybridization . . . . . . . . . . 119 James W. Thomas

9

Phylogenomic Resources at the UCSC Genome Browser . . . 133 Kate Rosenbloom, James Taylor, Stephen Schaeffer, Jim Kent, David Haussler, and Webb Miller

10

Computational Tools for the Analysis of Rearrangements in Mammalian Genomes . . . . . . . . . . . 145 Glenn Tesler and Guillaume Bourque

vii

viii

Contents 11

Computational Reconstruction of Ancestral DNA Sequences . . . . 171 Mathieu Blanchette, Abdoulaye Baniré Diallo, Eric D. Green, Webb Miller, and David Haussler

12

Sequencing and Phylogenomic Analysis of Whole Mitochondrial Genomes of Animals . . . . . . . . . . 185 Rafael Zardoya and Mónica Suárez

13

Retroposons: Genetic Footprints on the Evolutionary Paths of Life . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 Hidenori Nishihara and Norihiro Okada

14

LINE-1 Elements: Analysis by FISH and Nucleotide Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . 227 Paul D. Waters, Gauthier Dobigny, Peter J. Waddell, and Terence J. Robinson

15

Identification of Cryptic Sex Chromosomes and Isolation of X- and Y- Borne Genes . . . . . . . . . . . . . . . . . . . . . . . . . . 239 Paul D. Waters, Jennifer A. Marshall Graves, Katherine Thompson, Natasha Sankovic, and Tariq Ezaz

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .253

Contributors MATHIEU BLANCHETTE • McGill Centre for Bioinformatics, McGill University, Montreal, Canada GUILLAUME BOURQUE • Genome Institute of Singapore, Republic of Singapore STYLIANOS CHATZIMANOLIS • Department of Biological & Environmental Sciences, University of Tennessee at Chattanooga, Chattanooga, TN BHANU P. CHOWDHARY • Department of Veterinary Integrative Biosciences, Texas A&M University, College Station, TX ABDOULAYE BANIRÉ DIALLO • McGill Centre for Bioinformatics, McGill University, Montreal, Canada GAUTHIER DOBIGNY • Institut de Recherche pour le Développement, Centre de Biologie et Gestion des Populations, Montferrier-sur-Lez, France SCOTT V. EDWARDS • Department of Organismic and Evolutionary Biology and Museum of Comparative Zoology, Harvard University, Cambridge, MA TARIQ EZAZ • Research School of Biological Sciences, The Australian National University, Canberra, Australia FRANCIS GALIBERT • CNRS, Université de Rennes1, Rennes Cédex, France JENNIFER A. MARSHALL GRAVES • Research School of Biological Sciences, The Australian National University, Canberra, Australia ERIC D. GREEN • National Human Genome Research Institute, National Institutes of Health, Bethesda, MD DAVID HAUSSLER • Howard Hughes Medical Institute, University of California, Santa Cruz, CA HIDENORI NISHIHARA • Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology, Yokohama, Japan CHRISTOPHE HITTE • CNRS, Université de Rennes1, Rennes Cédex, France DANIEL E. JANES • Department of Organismic and Evolutionary Biology and Museum of Comparative Zoology, Harvard University, Cambridge, MA JIM KENT • Center for Biomolecular Science and Engineering, University of California, Santa Cruz, CA EWEN F. KIRKNESS • The Institute for Genomic Research, Rockville, MD DENIS M. LARKIN • Department of Animal Sciences, University of Illinois at Urbana-Champaign, Urbana, IL HARRIS A. LEWIN • Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL WEBB MILLER • Center for Comparative Genomics and Bioinformatics, Penn State, University Park, PA ix

x

Contributor

WILLIAM J. MURPHY • Department of Veterinary Integrative Biosciences, Texas A&M University, College Station, TX NORIHIRO OKADA • Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology, Yokohama, Japan ELAINE A. OSTRANDER • Cancer Genetics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD JOHN E. PAGE • Integrated Toxicology Division, United States Army Medical Research Institute of Infectious Diseases, Fort Detrick, Frederick, MD TERJE RAUDSEPP • Department of Veterinary Integrative Biosciences, Texas A&M University, College Station, TX TERENCE J. ROBINSON • Department of Botany & Zoology, University of Stellenbosch, Matieland, South Africa ANTONIS ROKAS • Department of Biological Sciences, Vanderbilt University, Nashville, TN KATE ROSENBLOOM • Center for Biomolecular Science and Engineering, University of California, Santa Cruz, CA NATASHA SANKOVIC • Department of Zoology, University of Melbourne, Australia STEPHEN SCHAEFFER • Department of Biology, Penn State, University Park, PA ANDREW M. SHEDLOCK • Department of Organismic and Evolutionary Biology and Museum of Comparative Zoology, Harvard University, Cambridge, MA ROSCOE STANYON • Dipartimento di Biologia Animale e Genetica, Florence University, Firenze, Italy GARY STONE • National Cancer Institute-Frederick, Frederick, MD MÓNICA SUÁREZ • Departamento de Sanidad Animal, Facultad de Veterinaria, Universidad Complutense de Madrid, Madrid, Spain JAMES TAYLOR • Center for Comparative Genomics and Bioinformatics, Penn State, University Park, PA GLENN P. TESLER • Department of Mathematics, University of California-San Diego, La Jolla, CA JAMES W. THOMAS • Department of Human Genetics, Emory University School of Medicine, Atlanta, GA KATHERINE THOMPSON • Research School of Biological Sciences, The Australian National University, Canberra, Australia PETER J. WADDELL • South Carolina Research Center, University of South Carolina, Columbia, SC PAUL D. WATERS • Research School of Biological Sciences, The Australian National University, Canberra, Australia RAFAEL ZARDOYA • Departamento de Biodiversidad y Biología Evolutiva, Museo Nacional de Ciencias Naturales-CSIC, Madrid, Spain

1 From Gene-Scale to Genome-Scale Phylogenetics: the Data Flood In, but the Challenges Remain Antonis Rokas and Stylianos Chatzimanolis Summary An important goal of phylogenetics is to be able to consistently and accurately reconstruct the historical patterns of cladogenesis among major organismic groups. Gene-scale phylogenetics is insufficient to attain this goal owing to the presence of poor resolution and incongruence in single- and few-gene phylogenies. The increasing availability of genomescale amounts of data promises to overcome the insufficiency of gene-scale phylogenetics and uncover the genealogical tapestry uniting all living organisms with unprecedented accuracy. Here, we argue that a vast increase in data size alone—although necessary—may not be sufficient to achieve the desired accuracy for three reasons: (i) the existence of short stems in the tree of life, (ii) the saturation of phylogenetic signal in molecular sequences, and (iii) the effect of systematic error on phylogenetic inference. Devising strategies to ameliorate the effect of such challenges on sequence evolution will be critical to the success of current efforts to reconstruct the tree of life. Key Words: Genomics; phylogenetics; tree of life; incongruence; resolution; short stems; saturation; natural selection; systematic error.

1. Introduction Darwin was the first to recognize that living species are not independently created, but have been generated through descent with modification from ancestral species (1). Darwin further envisioned that the propinquity of descent among species could be depicted in the form of a Tree of Life (TOL). Almost 150 years later, one of the major goals in biological research is to convert Darwin’s monumental vision into reality by assembling the complete TOL (2). A complete TOL promises to deepen our understanding of the history of life and sheds light on the evolution of molecules, phenotypes, and developmental mechanisms (2),

From: Methods in Molecular Biology: Phylogenomics Edited by: W. J. Murphy © Humana Press Inc., Totowa, NJ

1

2

Rokas and Chatzimanolis

as well as directly impact the research on key areas such as human health, agriculture, and biodiversity (3). Advances in three disciplines, namely statistical phylogenetics, information technology, and molecular biology and genomics, have lead to enormous optimism in the prospect of assembling the TOL (4). The advances in genomics are perhaps worthy of special mention, as progress has been extremely fast-paced. Within the decade since the sequencing of the first prokaryote (5) and eukaryote (6) genomes, more than 300 prokaryote and 40 eukaryote genomes have been decoded, whereas several more scheduled for completion (see Note 1) (7). As a result, this generation of researchers in phylogenetics is gaining access to a key ingredient missing from generations past; an avalanche of new data. The impact of this data influx has already been nothing short of remarkable; for example, several recent studies have featured data matrices consisting of tens to hundreds of genes (8–11). But is there a need for genome-scale amounts of data in phylogenetic analyses? 2. Insufficiency of Gene-Scale Phylogenetics Systematists have always placed more emphasis on collecting taxa rather than genes. This approach has been shaped by the relative technical ease of increasing the number of taxa and the desire to study as many taxa as possible (12). Although this emphasis is understandable (adding taxa is far more interesting than adding genes for most systematists), single- or few-gene phylogenies—what we call gene-scale phylogenetics—have been frequently shown to harbor insufficient phylogenetic information and lead to poor resolution (13,14). Perhaps more importantly, gene-scale phylogenetic analyses from organisms as diverse as primates, fruitflies, yeasts, and arthropods have revealed extensive incongruence (15–19) (see Note 2). For example, an examination of the phylogenies obtained from the analysis of each of 106 genes from eight yeast species revealed the existence of 24 different topologies, 10 of which were supported by three or more genes (Fig. 1) (15). A careful look at the literature in molecular phylogenetics further suggests not only that incongruence is present, but also it is widespread across the clades of the TOL (Fig. 2) (see Note 3). Specifically, examination of 404 published studies between 1998 and 2005 indicates that approx 40% of the studies analyzing two or more genes report incongruence among gene phylogenies (Fig. 2A). Incongruence is present in all major taxonomic groups examined, although its intensity varies depending on the group considered (Fig. 2B,C). Thus, detailed studies of several clades of the TOL and a survey of the published literature indicate that phylogenies based on gene-scale data typically provide insufficient evidence for establishing or refuting phylogenetic hypotheses (15) (see Note 4).

From Gene-Scale to Genome-Scale Phylogenetics

3

Fig. 1. Incongruence is abundant in gene-scale phylogenetics. Panels (A)–(C) depict three single-gene phylogenies from a 106-gene, 8-taxon data matrix from yeasts (gene in panel (A), YBL091C; gene in panel (B), YDL031W; gene in panel (C), YGL001C). The phylogeny obtained from the phylogenetic analysis of a concatenation of all 106 genes is shown in panel (D), and most likely represents the true evolutionary tree for these eight yeast species. Numbers above stems indicate bootstrap support values from maximum likelihood and maximum parsimony analyses, respectively. Note that the trees shown in panels (A)–(C) are incongruent with each other, and that the trees in panels (B) and (C) are also incongruent with the species tree (panel (D)). Additionally, note that the bootstrap analysis on the concatenated data matrix yields maximum support, irrespective of the method utilized (panel (D)). Data from Ref. (15).

3. Importance of Genome-Scale Phylogenetics An obvious solution to the problem of incongruence may be the use of genome-scale amounts of data (see Note 5). For example, in a study of 106gene, 8-taxon yeast data matrix, phylogenetic analysis of the concatenation of all genes produced a robust phylogeny that significantly rejected all other phylogenies ever obtained in the single-gene analyses (Fig. 1) (15). This result highlights the potential power of genome-scale phylogenetics to overcome the incongruence observed in gene-scale phylogenetics. More generally, the availability of larger amounts of data has enabled researchers to increase the analytical power of their studies, shedding light on key clades of the TOL. For example, from the application of increasing amounts of molecular data, a detailed

4

Rokas and Chatzimanolis

Fig. 2. Gene-scale incongruence is widespread across clades of the tree of life. A literature survey of 404 phylogenetic studies indicates that 39% (156/404) of all studies utilizing two or more genes report incongruence among single gene phylogenies (panel (A)). The degree of incongruence differs across major taxonomic groups; for example, only 35% (17/49) of studies on mammalian relationships report incongruence (panel (B)), whereas incongruence reaches 48% (57/118) in phylogenetic studies of insect taxa (panel (C)). For details on data collection, refer to the main text and Note 3.

picture of the pattern as well as the tempo of mammalian radiation is emerging, which includes the discovery of novel clades such as the Afrotheria and the Laurasiatheria (e.g., [20,21]). Genome-scale phylogenetics has also enriched the gene-scale perspective of evolution at the molecular level. Notable examples include several experimental demonstrations of the occurrence and impact of lineage sorting in phylogenies of closely related species (e.g., Ref. [22]), including our own evolutionary branch, the primates (19); the discovery of the major impact of horizontal gene transfer in prokaryotic evolution (23); the presence of incongruence in gene-scale phylogenetics and its severity (15); and the demonstration that, given genome-scale data, the lack of phylogenetic resolution may be the signature of closely spaced series of cladogenetic events (8). 4. Challenges for Genome-Scale Phylogenetics From these few examples, one may infer that the problems encountered in gene-scale phylogenetics—such as incongruence and poor resolution—may be overcome simply by scaling up the amount of data utilized. However, the acquisition of genome-scale data—even if analyzed by state-of-the-art methodology—does not guarantee that the resulting phylogenetic inference is correct for three reasons: (i) the existence of short stems in the TOL (Subheading 4.1.), (ii) the substitutional saturation of phylogenetic signal from very ancient clades (Subheading 4.2.), and (iii) the effect of systematic errors (Subheading 4.3.).

From Gene-Scale to Genome-Scale Phylogenetics

5

4.1. The Punctuated Tree of Life Elucidation of the TOL requires the identification of the intervals separating cladogenetic events, the internal branches or stems of TOL. The lengths of these stems, as well as TOL’s overall geometry, have been shaped by the interplay— over the enormous span of time—of cladogenesis and extinction. Thus, accurate estimates of the rates of appearance (cladogenesis) and loss (extinction) of new lineages on the TOL are essential to understanding the distribution of stems’ lengths. Although the rates of both cladogenesis and extinction are imprecise and exhibit wide margins of error, abundant evidence indicates that these rates have fluctuated considerably across geological time (24). Importantly, several lineages have undergone spectacular radiations (e.g., insects and cichlids), giving rise to a series of closely spaced cladogenetic events occurring—geologically speaking—within narrow temporal windows. Phylogenetic reconstruction of clades characterized by such short stems can be problematic. This is so because the amount of phylogenetic signal available for any given stem is proportional to its length, and theoretical work indicates that there may be an absence of informative characters in the molecular data for very short stems (25). In agreement with theory, multigene analyses on sets of closely related species have revealed the existence of gene sequences with no phylogenetic signal (e.g., [19,22]). Additionally, when the length of stems is small, population-level processes such as lineage sorting of ancestral polymorphisms and hybridization, can actually generate gene histories that deviate from the species’ history (see Note 6). Thus, the presence of polytomies within certain branches of the TOL may be real and not the result of failure to achieve resolution (8,26).

4.2. How Much Can Genes Tell Us About the Trees That Generated Them? Aside from stem length, the other parameter influencing phylogenetic reconstruction is the time elapsed since the generation of the stem. To understand why this is so, we must consider the theoretical framework behind the analysis of sequence evolution, the neutral theory of molecular evolution (27). According to the neutralist paradigm, sites along a sequence fall into one of three categories: neutral, deleterious, and advantageous. Substitutions at neutral sites do not have a selective effect, substitutions in all other sites are considered deleterious and are removed by purifying selection, whereas advantageous substitutions are assumed to occur very rarely and are not taken into account. Each neutral site in the gene sequence evolves at the same rate, i.e., the neutral rate (27). Thus, while genes on an average evolve at a slower or faster rate depending on the fraction of neutral sites, those sites allowed to change do so at the neutral rate, irrespective of whether they are located in slow- or fast-evolving genes.

6

Rokas and Chatzimanolis

Theoretical work indicates that neutrally evolving sites should loose their informativeness after the lapse of 300–400 Myr owing to substitutional saturation (28,29). Furthermore, these calculations rely on assumptions which are arguably violated by empirical data; deviations in base composition (30), nonindependence of nucleotide substitutions (31), and changes in rates of gene evolution within and across lineages (32)—to mention just a few—can dramatically increase substitutional saturation. Thus, it is highly unlikely that neutrally evolving sites in protein-coding genes will have phylogenetic signal to resolve very ancient stems. Consideration of biochemistry also suggests that it is unlikely that deleterious sites in a protein-coding region will remain deleterious over long periods of evolutionary time (28). As a result, more elaborate and biochemically robust models (known as covarion models) have been developed which allow a fraction of sites to switch from a deleterious to a neutral state and vice versa over evolutionary time. Under these covarion models, it has been shown that substitutional saturation is expected to occur at greater evolutionary depths (28). However, even under covarion models, it is unlikely that most sites having been in existence for hundreds of million years have retained any phylogenetic information. For example, simulation analyses employing a covarion model and a rate of substitution 2 orders of magnitude lower than the neutral rate suggest that stems shorter than 10 Myr and occurring early in the 600-Myr evolution of a clade may be difficult to resolve (8). Furthermore, no empirical evidence supports the existence of sites evolving at rates orders of magnitude slower than the neutral rate. These results highlight the discordance between what is actually observed in real data sets and the theoretical framework on which analysis of molecular sequence data is based on—neutrality. This is not surprising; for example, whereas genome-wide studies suggest that a significant fraction of protein-coding genes has been under positive selection (33,34), all popular models of sequence evolution assume that the fraction of sites under positive selection is so small that it can be safely ignored in the calculation of evolutionary rates (27,28). In summary, ample evidence suggests that currently available models make naive assumptions, thus failing to take into account key processes affecting sequence evolution. The accurate reconstruction of ancient divergences remains a two-fold challenge; on the one hand in developing a theoretical framework for phylogenetic analysis of genome-scale data that adequately accounts for the biological forces shaping sequence evolution, and on the other hand in finding genes whose sites have not been substitutionally saturated.

4.3. Clade Support, Phylogenetic Accuracy, and Systematic Error Phylogeneticists place a great deal of effort and trust to clade support indices, such as bootstrap resampling, jackknifing, and posterior probabilities. Unfortunately,

From Gene-Scale to Genome-Scale Phylogenetics

7

the values obtained from such indices do not always equate with phylogenetic accuracy. Because the number of sites in genome-scale data sets is very large, clade support values almost always turn out to provide significant support for a specific topology. Although this should be the outcome when a large number of sites are in support of a given topology, in genome-scale data sets this can also be the case even when the underlying support for the best topology over alternative topologies is only marginally better (11). This latter scenario typically occurs when there is an undetected source of bias in the data (e.g., unequal base frequencies), or when levels of substitutional saturation are high. Owing to the lack of models accounting for these biases (see Subheading 4.2.), the signal emanating from these sources can overwhelm the phylogenetic signal and lead to high confidence in the wrong topology; this—in a nutshell—is a systematic error. Evidence for the existence of systematic error in analyses of real data is abundant; several genome-scale studies have reported alternative conflicting phylogenies (9,10,35–38), absolute clade support for incorrect topologies (Fig. 3) (15,39), and the generation of absolute clade support values from marginal character distribution differences (8,11). Thus, clade support values, albeit necessary, are not sufficient when dealing with genome-scale data sets because high clade support values do not guarantee phylogenetic accuracy. Will clade support indices be rendered obsolete as the flood of genomic data transforms phylogenetics as we know it? Whereas a definitive answer may have to wait, there is an emerging need for innovative methods enabling researchers to better understand the nature and impact of systematic error in molecular data sets and how to dissociate it from the phylogenetic signal. It has been suggested that the systematic error can be alleviated by the addition of taxa. For example, phylogenetic accuracy can be dramatically increased when taxon addition breaks up long branches (40). However, taxon addition can also decrease accuracy, either by reducing the amount of phylogenetic information available to resolve the newly added stems (41,42), or by the introduction of new long branches (43). Generally, it may be difficult to predict a priori what constitutes ‘adequate’ taxon sampling in a given clade and its effect on the topology obtained (12,43–45). For example, the closest relative of the brewer’s yeast Saccharomyces cerevisiae was correctly identified by employing a strategy which utilized more genes from fewer taxa (15,46), whereas the position of rodents among placental mammals was correctly identified by employing a strategy which utilized fewer genes from more taxa (20,21,47). 5. Conclusion Genomics is delivering an unprecedented amount of data from a variety of organisms that promises to transform molecular phylogenetics. Here, we argued that short stems on the TOL, the saturation of phylogenetic signal in molecules

8

Rokas and Chatzimanolis

Fig. 3. The negative effect of systematic error in genome-scale phylogenetics. Concatenations of genes supporting alternative stems (such as those shown in Fig. 1B,C) leads to further amplification of the bias (panels (A)–(C)). The species tree is shown on panel (D). Numbers above stems indicate bootstrap support values from maximum likelihood and maximum parsimony analyses, respectively. Note that the phylogenies shown in panels (A)–(C) all differ from the species tree (panel (D)) and that most of their stems exhibit high bootstrap values. Data from ref. 15.

that have been evolving for billions of years, and the detection and avoidance of systematic error represent three key challenges for genome-scale phylogenetics (see Note 7). Whether this wealth of genome-scale data will produce a robust TOL remains an open question; what is abundantly clear is that genomescale phylogenetics is already enhancing our knowledge of the factors influencing success in phylogenetic inference, as well our understanding of how cladogenesis and extinction have sculpted the major branches of the TOL. 6. Notes 1. An extensive and fairly up-to-date list of completed, draft, and in-progress genome projects may be found at http://www.genomesonline.org/ (7). Whole-genome sequence data can be downloaded from the NCBI’s Entrez Genome Project at http://www. ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj. The raw trace files for in-progress genome projects and genome-scale (e.g., EST) projects can be downloaded from NCBI’s Trace archive at http://www.ncbi.nlm.nih.gov/Traces. As of March 2006, the Trace archive has more than one billion trace sequences from over 480 organisms.

From Gene-Scale to Genome-Scale Phylogenetics

9

2. Incongruence is a term that has been used in a variety of different contexts and given slightly different meanings (48). Here, incongruence is simply defined as the existence of conflicting topologies among different genes. For example, when gene X produces topology x and gene Y produces topology y, if topology x differs from topology y, then the topologies x and y are considered incongruent. 3. To quantify the degree of incongruence reported in the literature of molecular phylogenetics, we examined all research articles found in Molecular Phylogenetics and Evolution, Systematic Biology, and Cladistics between the years 1998 and 2005 (inclusive) which contained phylogenies derived from two or more genes. For each research article, we recorded the existence or not of incongruence among the phylogenies produced by the single genes. A few comments should be made about this literature survey. First, the opinion of these author(s) was always followed to determine whether two genes are incongruent or not (typically—but not always— the criterion used for the assessment of incongruence was a significant value from a statistical test, such as the incongruence length difference test). Second, gene definitions varied among researchers of different taxonomic groups (e.g., some researchers classify mitochondrial genes as separate genes, whereas others consider them—because of their physical linkage and the absence of recombination—as a single locus). Finally, prokaryotes were not considered because measuring incongruence is further complicated by the extensive presence of lateral gene transfer (e.g., [23]). 4. While increasing evidence indicates that gene-scale phylogenetics is insufficient for accurate phylogenetic inference, this does not mean that the end of gene-scale phylogenetics is approaching. Constraints in funding and data availability aside, many evolutionary questions can be addressed with gene-scale phylogenetics (e.g., molecular microbial ecology, [49]). Furthermore, the development of more robust models of sequence evolution may improve the accuracy of gene-scale phylogenetics in the future. 5. Here, the term ‘genome-scale’ is applied liberally to data sets composed of linear sequence data (see Chapters 10, 13, and 14, and ref. 50 for a discussion on types of genome-scale characters) at least an order of magnitude larger than typically utilized. 6. When stems are short, population-level processes can have important consequences for phylogenetic inference. Specifically, population genetics theory suggests that for stems spanning less than 2–3 Myr (the exact time span is dependent of the population size and the number of generations elapsed [51]), incomplete sorting of ancestrally polymorphic alleles of some genes can lead to gene histories differing from the species’ history. Genes with discordant histories may persist indefinitely, although our ability to identify them as such becomes vanishingly small with elapsed time, related to the dearth of informative sites (25). Gene histories deviating from the species’ ones may also be caused by other processes, such as hybridization, introgression, and horizontal gene transfer (52). 7. Another important factor—not considered here—that further adds to the challenge of assembling the complete TOL is the (high) rate of discovery of new species (53).

10

Rokas and Chatzimanolis

Acknowledgments We thank Bill Murphy for the invitation to contribute to this volume. Many of the ideas expressed in this review were formed during the postdoctoral tenure of A.R. in Sean Carroll’s lab in the University of Wisconsin—Madison. References 1. Darwin, C. (1859) On the Origin of Species, John Murray, London. 2. Cracraft, J. and Donoghue, M. J. (eds) (2004) Assembling the Tree of Life, Oxford University Press, Oxford. 3. Yates, T. L., Salazar-Bravo, J., and Dragoo, J. W. (2004) In Assembling the Tree of Life (Cracraft, J. and Donoghue, M. J., eds), pp. 7–17, Oxford University Press, Oxford. 4. Dawkins, R. (2003) A Devil’s Chaplain, Houghton Mifflin, New York. 5. Fleischmann, R. D., Adams, M. D., White, O., et al. (1995) Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269, 496–512. 6. Goffeau, A., Barrell, B. G., Bussey, H., et al. (1996) Life with 6000 genes. Science 274, 546, 563–567. 7. Liolios, K., Tavernarakis, N., Hugenholtz, P., and Kyrpides, N. C. (2006) The Genomes On Line Database (GOLD) v.2: a monitor of genome projects worldwide. Nucleic Acids Res. 34, D332–D334. 8. Rokas, A., Kruger, D., and Carroll, S. B. (2005) Animal evolution and the molecular signature of radiations compressed in time. Science 310, 1933–1938. 9. Driskell, A. C., Ane, C., Burleigh, J. G., McMahon, M. M., O’Meara, B. C., and Sanderson, M. J. (2004) Prospects for building the tree of life from large sequence databases. Science 306, 1172–1174. 10. Ciccarelli, F. D., Doerks, T., von Mering, C., Creevey, C. J., Snel, B., and Bork, P. (2006) Toward automatic reconstruction of a highly resolved tree of life. Science 311, 1283–1287. 11. Takezaki, N., Figueroa, F., Zaleska-Rutczynska, Z., Takahata, N., and Klein, J. (2004) The phylogenetic relationship of tetrapod, coelacanth, and lungfish revealed by the sequences of 44 nuclear genes. Mol. Biol. Evol. 21, 1512–1524. 12. Rokas, A. and Carroll, S. B. (2005) More genes or more taxa? The relative contribution of gene number and taxon number to phylogenetic accuracy. Mol. Biol. Evol. 22, 1337–1344. 13. Rokas, A., King, N., Finnerty, J., and Carroll, S. B. (2003) Conflicting phylogenetic signals at the base of the metazoan tree. Evol. Dev. 5, 346–359. 14. Berbee, M. L., Carmean, D. A., and Winka, K. (2000) Ribosomal DNA and resolution of branching order among the ascomycota: how many nucleotides are enough? Mol. Phylogenet. Evol. 17, 337–344. 15. Rokas, A., Williams, B. L., King, N., and Carroll, S. B. (2003) Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature 425, 798–804. 16. Kopp, A. and True, J. R. (2002) Phylogeny of the Oriental Drosophila melanogaster species group: a multilocus reconstruction. Syst. Biol. 51, 786–805.

From Gene-Scale to Genome-Scale Phylogenetics

11

17. Hwang, U. W., Friedrich, M., Tautz, D., Park, C. J., and Kim, W. (2001) Mitochondrial protein phylogeny joins myriapods with chelicerates. Nature 413, 154–157. 18. Giribet, G., Edgecombe, G. D., and Wheeler, W. C. (2001) Arthropod phylogeny based on eight molecular loci and morphology. Nature 413, 157–161. 19. Satta, Y., Klein, J., and Takahata, N. (2000) DNA archives and our nearest relative: the trichotomy problem revisited. Mol. Phylog. Evol. 14, 259–275. 20. Murphy, W. J., Pevzner, P. A., and O’Brien, S. J. (2004) Mammalian phylogenomics comes of age. Trends Genet. 20, 631–639. 21. Springer, M. S., Stanhope, M. J., Madsen, O., and de Jong, W. W. (2004) Molecules consolidate the placental mammal tree. Trends Ecol. Evol. 19, 430–438. 22. Jennings, W. B. and Edwards, S. V. (2005) Speciational history of Australian grass finches (Poephila) inferred from thirty gene trees. Evolution 59, 2033–2047. 23. Gogarten, J. P. and Townsend, J. P. (2005) Horizontal gene transfer, genome innovation and evolution. Nat. Rev. Microbiol. 3, 679–687. 24. Simpson, G. G. (1953) The Major Features of Evolution, Columbia University Press, New York. 25. Lanyon, S. M. (1988) The stochastic mode of molecular evolution: what consequences for systematic investigations? Auk 105, 565–573. 26. Hoelzer, G. A. and Melnick, D. J. (1994) Patterns of speciation and limits to phylogenetic resolution. Trends Ecol. Evol. 9, 104–107. 27. Kimura, M. (1983) The Neutral Theory of Molecular Evolution, Cambridge University Press, Cambridge. 28. Penny, D., McComish, B. J., Charleston, M. A., and Hendy, M. D. (2001) Mathematical elegance with biochemical realism: the covarion model of molecular evolution. J. Mol. Evol. 53, 711–723. 29. Mossel, E. and Steel, M. (2005) In Mathematics of Evolution and Phylogeny (Gascuel, O., ed.), pp. 384–412, Oxford University Press, New York. 30. Naylor, G. J. P. and Brown, W. M. (1998) Amphioxus mitochondrial DNA, chordate phylogeny, and the limits of inference based on comparisons of sequences. Syst. Biol. 47, 61–76. 31. Averof, M., Rokas, A., Wolfe, K. H., and Sharp, P. M. (2000) Evidence for a high frequency of simultaneous double-nucleotide substitutions. Science 287, 1283–1286. 32. Ayala, F. J. (1999) Molecular clock mirages. Bioessays 21, 71–75. 33. Fay, J. C., Wyckoff, G. J., and Wu, C. I. (2002) Testing the neutral theory of molecular evolution with genomic data from Drosophila. Nature 415, 1024–1026. 34. Smith, N. G. and Eyre-Walker, A. (2002) Adaptive protein evolution in Drosophila. Nature 415, 1022–1024. 35. Wolf, Y. I., Rogozin, I. B., and Koonin, E. V. (2004) Coelomata and not Ecdysozoa: evidence from genome-wide phylogenetic analysis. Genome Res. 14, 29–36. 36. Blair, J. E., Ikeo, K., Gojobori, T., and Hedges, S. B. (2002) The evolutionary position of nematodes. BMC Evol. Biol. 2, 7. 37. Dopazo, H. and Dopazo, J. (2005) Genome-scale evidence of the nematode-arthropod clade. Genome Biol. 6, R41.

12

Rokas and Chatzimanolis

38. Philip, G. K., Creevey, C. J., and McInerney, J. O. (2005) The Opisthokonta and the Ecdysozoa may not be clades: stronger support for the grouping of plant and animal than for animal and fungi and stronger support for the Coelomata than Ecdysozoa. Mol. Biol. Evol. 22, 1175–1184. 39. Phillips, M. J., Delsuc, F. D., and Penny, D. (2004) Genome-scale phylogeny and the detection of systematic biases. Mol. Biol. Evol. 21, 1455–1458. 40. Graybeal, A. (1998) Is it better to add taxa or characters to a difficult phylogenetic problem? Syst. Biol. 47, 9–17. 41. Kim, J. (1998) Large-scale phylogenies and measuring the performance of phylogenetic estimators. Syst. Biol. 47, 43–60. 42. Bininda-Emonds, O. R., Brady, S. G., Kim, J., and Sanderson, M. J. (2001) Scaling of accuracy in extremely large phylogenetic trees. Pac. Symp. Biocomput. 547–558. 43. Poe, S. and Swofford, D. L. (1999) Taxon sampling revisited. Nature 398, 299–300. 44. Zwickl, D. J. and Hillis, D. M. (2002) Increased taxon sampling greatly reduces phylogenetic error. Syst. Biol. 51, 588–598. 45. Rosenberg, M. S. and Kumar, S. (2001) Incomplete taxon sampling is not a problem for phylogenetic inference. Proc. Natl Acad. Sci. USA 98, 10,751–10,756. 46. Kurtzman, C. P. and Robnett, C. J. (2003) Phylogenetic relationships among yeasts of the ‘Saccharomyces complex’ determined from multigene sequence analyses. FEMS Yeast Res. 3, 417–432. 47. Nei, M. and Glazko, G. V. (2002) Estimation of divergence times for a few mammalian and several primate species. J. Hered. 93, 157–164. 48. Planet, P. J. (2006) Tree disagreement: measuring and testing incongruence in phylogenies. J. Biomed. Inform. 39, 86–102. 49. Head, I. M., Saunders, J. R., and Pickup, R. W. (1998) Microbial evolution, diversity, and ecology: a decade of ribosomal RNA analysis of uncultivated microorganisms. Microb. Ecol. 35, 1–21. 50. Rokas, A. and Holland, P. W. H. (2000) Rare genomic changes as a tool for phylogenetics. Trends Ecol. Evol. 15, 454–459. 51. Rosenberg, N. A. (2002) The probability of topological concordance of gene trees and species trees. Theor. Popul. Biol. 61, 225–247. 52. Funk, D. J. and Omland, K. E. (2003) Species-level paraphyly and polyphyly: frequency, causes, and consequences, with insights from animal mitochondrial DNA. Annu. Rev. Ecol. Evol. Syst. 34, 397–423. 53. May, R. M. (2004) Tomorrow’s taxonomy: collecting new species in the field will remain the rate-limiting step. Phil. Trans. R. Soc. Lond. B Biol. Sci. 359, 733–734.

2 Phylogenomic Analysis by Chromosome Sorting and Painting Roscoe Stanyon and Gary Stone Summary Chromosome sorting by flow cytometry is the principle source of chromosome-specific DNA not only for chromosome painting, but also for many other types of genomic analysis such as library construction, discovery and isolation of genes, chromosome specific direct DNA selection, and array painting. Chromosome sorting coupled with chromosome painting is a rapid method for global phylogenomic comparisons. These two techniques have made notable contributions to our knowledge of the evolution of the mammalian genome. The flow sorting of multiple species allows reciprocal painting and permits the delineation of subchromosomal homology and the definition of chromosomal breakpoints. Chromosomes are valuable phylogenetic makers because rearrangements that become fixed at the species level are considered rare events and apparently tightly bound to the speciation process. This chapter covers the preparation of a single chromosome suspension from cell cultures, bivariate chromosome flow sorting, preparation of chromosome paints by degenerate oligonucleotide primed-PCR and the fluorescence in-situ hybridization and detection of whole chromosome specific probes. Key Words: Zoo-FISH; flow cytometry; DOP-PCR; comparative molecular cytogenetics; genome evolution; chromosome sorting and painting.

1. Introduction Over the last 15 yr, molecular cytogenetics has revealed the genome composition of dozens of primate and carnivore taxa, as well as a good number of species from most placental mammalian orders and other vertebrate taxa (1–12). Fluorescence in-situ hybridization (FISH) and chromosome painting using DNA probes specific to entire chromosomes have been and remain the method of choice (9,13,14). Fluorescence-activated chromosome sorting coupled with degenerate oligonucleotide primed-PCR (DOP-PCR) is the principle method of producing From: Methods in Molecular Biology: Phylogenomics Edited by: W. J. Murphy © Humana Press Inc., Totowa, NJ

13

14

Stanyon and Stone

Fig. 1. Flow diagram of the main steps in phylogenomic analysis using chromosome flow sorting and chromosome painting.

chromosome paints (15–18). The flow sorting of multiple species allows reciprocal painting and permits the delineation of subchromosomal homology and definition of breakpoints. Chromosome sorting by flow cytometry is the principle source of chromosome-specific DNA not only for chromosome painting, but also for many other types of genomic analysis such as library construction, discovery and isolation of genes, chromosome-specific direct DNA selection, array painting, and interphase architecture of the nucleus (12,19–23). A single chromosome suspension is produced from a rapidly growing cell culture. The chromosomes are then stained with GC- and AT-specific fluorochromes and passed through a fluorescence-activated cell sorter (FACS). Most modern high-end flow sorters can be equipped to sort chromosomes. We assume that an experienced flow cytometer operator is available, but that the operator does not necessarily have experience in sorting chromosomes. A bivariate plot of chromosome fluorescence allows the operator to gate on specific chromosomes and sorts them directly into PCR reaction tubes. Two rounds of PCR amplification are necessary. A primary round of DOP-PCR is used to directly amplify the sorts. Then a secondary PCR reaction is used to label the primary products. The probes can be labeled with biotin, digoxigen, or fluorochrome-conjugated nucleotides. The labeled chromosome paints are then hybridized in situ, singularly or in combination, to standard chromosome metaphase preparations, detected and analyzed on a fluorescence microscope (Fig. 1).

1.1. Contribution of Chromosome Painting to Phylogenomics The chromosome painting data are useful in phylogenomics, because chromosome rearrangements that become fixed in a species karyotype are rare events apparently tightly linked to the speciation process. The chromosome

Phylogenomic Analysis by Chromosome Sorting and Painting

15

painting data have made a notable contribution to our knowledge of the global, placental mammalian, genome evolution. These data provide a pictorial perspective mapping the phylogeny of each human chromosome over the last 90 Myr.

1.1.1. Ancestral Eutherian Karyotype There are a number of hypothesis concerning the content of the ancestral eutherian karyotype (AEK) (24–28). Many of these proposals are based on comparative chromosome painting data. Although there is fair agreement between these proposals, the diploid numbers range from 2n = 44 to 50. However, the lack of comparative chromosome painting data between eutherians and an appropriate outgroup (Marsupialia and Prototheria) is a limitation on attempts to delineate the ancestral genome of placental mammals with this method. Chromosome painting data are still lacking on important taxonomic divisions within eutherians. Until recently, the most glaring lacunae were the lack of data on the suborder Xenarthra and the orders Pholidota, Eulipotyphla, and Dermoptera. There are now reports on chromosome painting in Xenarthra (29,30). Overall, the Xenarthran karyotypes strongly resemble the proposed AEK. There is also recent data chromosome painting in three insectivore (Eulipotyphla) species (31). These reports help clarify some of the differences in the AEK proposed by various authors concerning the presence or absence of the syntenic associations 1/19, 4/8, and 10/12/22. There seems little doubt that chromosome 1 is present as a single syntenic unit in the AEK (28,32,33). A conserved synteny for this chromosome has now been found in each of the superordinal taxa: Afrotheria (aardvark and elephant shrews), Xenarthra (sloths), Euarchontoglires (primates), and Laurasiatheria (cetaceans). The syntenic association human 1/19p, was also found in the four studied species of Afrotheria (24,28). The presence of this combination in the Afrotheria (aardvark and elephant) as well as in the galago led some authors to assume that the 1/19p synteny was ancestral to the AEK (2n = 44) (28). It is now known from the reciprocal chromosome painting that the segment of human 19 combined with the counterpart of human chromosome 1 in the strepsirrhine primate Otolemur crassicaudatus is not the same as in the Afrotheria (34,35). Our recent results show that a 1/19 association is present in the armadillo, but its absence in the other two Xenarthran species studied suggests that this association may not be homologous to that found in Afrotheria (30). Reciprocal chromosome painting will be needed to test these two alternate proposals, but the best working hypothesis at the moment is that the syntenic association 1/19 was not present in the AEK. A combination of human 10p/12p/22a and a single human 10q was found in Afrotheria—the aardvark and elephant karyotypes (24,28). This syntenic association appeared by reciprocal painting to be homologous to that found in Carnivores. These findings led to the hypothesis that 10p/12q/22a was present

16

Stanyon and Stone

in the AEK. An apparently identical sequence of segments in the elephant shrew supported the conclusion that this synteny was widespread in Afrotheria (27). Recent chromosome painting data in Xenarthra (30) and Eulipotyphla (31) show that this syntenic association is more widespread than previously thought. It now appears that the syntenic association 10/12/22 could be included in the AEK. New chromosome painting data also demonstrate that the association 4/8 is present in all mammalian orders outside of elephants and primates including the species of Xenarthra. This association is present in the shrew-hedgehog and was missed in previous studies of the common shrew (31). Our unpublished data of in-situ hybridizations in additional Scadentia species show that the 4/8 association was also probably missed previously (36). It seems certain that this association should be included in the AEK. Given the above consideration, the AEK karyotype would contain 46 chromosomes: 1, 2p-q, 2q, 3/21, 4/8p, 5, 6, 7a, 7b/16p, 8q, 9, 10q, 10p/12/22, 11, 12/22, 13, 14/15, 16q/19q, 17, 18, 19p, 20, X, Y.

1.1.2. Ancestral Primate Karyotype Reconstructions of the ancestral primate karyotype (APK) hypothesized a diploid number from 2n = 48 to 50 (25,36,37). Recent reciprocal chromosome painting has also refined the content of the APK, because the syntenic association 7/16 identical to that found in the proposed AEK has been reported in lorids (34,35). Therefore, 7b/16p should be included in the APK. Defined by reference to homology with the human karyotype, the genome of the last common ancestor of all living primates had the following chromosomes: 1, 2a, 2b, 3/21, 4, 5, 6, 7a, 7b/16b, 8, 9, 10, 11, 12a/22a, 12b/22b, 13, 14/15, 16a, 17, 18, 19a, 19b, 20, X and Y. Only a very few inter-chromosomal rearrangements, three fissions and two fusion, are needed to derive the APK from the AEK. Reciprocal painting in lorids also shows that the reciprocal translocation between 12a/12a and 12b/22b proposed by some authors (38) on the basis of previous painting results is unnecessary. The smaller 12/22 association in both primates and the AEK is formed by the same segment of the distal part of 12q and the proximal part of 22q. 2. Materials 2.1. Cell Culture and Flow Sorting 1. RPMI 1640 medium (cat. no. 22400-105, Invitrogen, Gaithersburg, MD) supplemented with 10% fetal bovine serum (FBS, Atlanta Biologicals, Norcross, GA), glutamine (cat. no. 25030-081, Invitrogen) and 1% antibiotics (cat. no. 15070-063, Invitogen). 2. Hoechst 33258 (cat. no. B2883, Sigma-Aldrich, St. Louis, MO) stock solution: 10 mg in 10 mL dH2O. Heat to 65°C to dissolve. Filter and store at 20°C in amber or foil-wrapped tubes.

Phylogenomic Analysis by Chromosome Sorting and Painting

17

3. Chromomycin-A3 (cat. no. C2659, Sigma-Aldrich) stock solution: 5 mg in 2.5 mL of 100% ethanol. Store in amber tubes at 20°C. 4. Propidium iodide (cat. no. P4170, Sigma-Aldrich) stock solution: 1 mg in 10 mL dH2O aliquot in 1 mL vials and freeze, 20°C. 5. Tris(hydroxymethyl)aminomethane hydrochloride (cat. no. T5941, Sigma-Aldrich) stock solution: EDTA: 1.18 g Tris, 832 mg EDTA in 100 mL dH2O, adjust to pH 8.0. 6. Tris(hydroxymethyl)aminomethane hydrochloride/EGTA stock solution: 1.18 g Tris, 190 mg EGTA in 100 mL dH2O, adjust to pH 8.0. 7. KCl stock solution: 5.98 g of KCl in 100 dH2O. 8. NaCl stock solution: 1.17 g NaCl in 100 dH2O. 9. Spermidine trihydrochloride (cat. no. S2501, Sigma-Aldrich)/spermine tetrahydrochloride (cat. no. S2867, Sigma-Aldrich) stock solution: 0.26 g spermidine and 0.14 g spermine in 1.0 mL dH2O, filter and freeze (20°C) stock solution in 1 mL vials. 10. Magnesium sulphate solution: 100 mM, 0.25 g in10 mL dH2O. 11. Sodium citrate/sodium sulfite stock solution: 0.29 g tri-sodium citrate, 0.32 g sodium sulphite in 10 mL dH2O. 12. Polyamine buffer: 50 mL each of Tris/EDTA, Tris/EGTA, KCl and NaCl stock solutions, 250 L spermidine/spermine stock solution, 1250 L Triton-X (cat. no. X-100, Sigma-Aldrich) 100 and 500 L -mercaptoethanol. Adjust to pH 7.2 and bring the total volume to 500 mL. Filter, aliquot in 10 mL tubes and freeze (20°C). 13. Hypotonic solution (0.075 M KCl) with spermine/spermidine: 0.56 g KCl in 100 mL distilled water and add 50 L spermidine/spermine stock solution. Filter before use and do not conserve. 14. Colcemid solution: 10 g/mL (cat. no. 15210-040, Invitrogen). 15. 2.0 m YG beads (cat. no. 18604, Polysciences, Warrington, PA). 16. Sheath Fluid (FACSFlow, cat. no. 342003, Becton-Dickinson, San Diego, CA).

2.2. PCR Amplification and Labeling 1. PCR water is prepared by cross-linking 1 mL screw top vials (cat. no. 20170-227, VWR, West Chester, PA) with 750 L molecular grade water (cat. no. 351-029-061, Quality Biologicals, Gaithersburg, MD) for 20 min. 2. dNTPs (cat. no. 1814362, Roche, Indianapolis, IN). 3. Biotin-16-dUTP (cat. no. 11093070001, Roche). 4. Dig-11-dUTP (cat. no. 1570013, Roche). 5. Labeling mix (dNTPs, cat. no. 969064, Roche) 100 L each dATP, dGTP, dCTP, and 65 L dTTP with 4.635 mL PCR water. 6. Taq DNA polymerase (cat. no. M0267L, New England Biolabs, Ipswich, MA) supplied with ThermoPol buffer and MgSO4. 7. Random Primer 6 MW 5-CCGACTCGAGNNNNNNATGTAG-3 (Operon, Huntsville, AL), dissolve and dilute to 100 mM. 8. Ladder 100 bp (cat. no. 15628-019, Invitrogen): 1225 L of a 20-mM NaCl solution and 375 L of neat (undiluted) loading buffer (cat. no. 10816-015, Invitrogen). Store at 4°C.

18

Stanyon and Stone

2.3. Chromosome Painting (FISH) 1. DAPI (4,6-diamidino-2-phenylindole, cat. no. D9542, Sigma-Aldrich) stock solution: 10 mg DAPI in 100 mL dH2O. Warm to 60–70°C to dissolve then aliquot and freeze at 20°C. 2. DAPI staining solution: dilute stock solution 1: 1000 in 4X SSC. 3. Deionized formamide (cat. no. JT4028-1, Baker): 5 g mixed resin (cat. no. 143-6425, BioRad, Hercules, CA) to 100 mL formamide. Stir for 1 h before use. Adjust pH to 7.0, aliquot and freeze at 20°C. 4. Hybridization buffer: 25 mL deionized formamide, 10 mL 50% dextran sulphate (cat. no. D8906, Sigma-Aldrich), 2.3 mL 0.5 M NaH2PO4, 1.7 mL 0.5 M Na2HPO4, 6 mL dH2O, 5 mL 20X SSC. 5. 20X SSC: 3 M NaCl, 0.3 M Na3 citrate. Dissolve 175.32 g Na3 citrate, 88.23 g NaCl, in dH2O up to 1 L. 6. Antifade mounting solution: 100 mg p-phenylenediamine, 80 mL glycerine, 20 mL 1X PBS, adjust to pH 8.0 with 0.5 M bicarbonate. Aliquot in 1.5 amber vials and store at 20°C. 7. 4X SSC Tween 20: 100 mL 20X SSC, 400 mL dH2O, and 350 L Tween 20 (cat. no. P1379, Sigma-Aldrich). 8. Antibody buffer/blocking buffer: 3 g bovine albumin (Fraction V, cat. no. A3059, Sigma-Aldrich), 100 mL 4X SSC Tween 20. 9. Fluoroscein-Avidin (cat. no. A2011, Vector) detection solution: 1:200 in antibody buffer.

3. Methods 3.1. Cell Culture and Flow Sorting

3.1.1. Cell Culture and Chromosome Suspension 1. The lymphoblastoid cell lines are grown in RPMI complete medium in capped flasks. With rapidly growing cells, tissue culture medium (doubling the volume) is added every 4 d. Cells grow best when the medium is slightly acid. The culture is continued until about 100 mL of densely growing cells are present in capped 250 mL tissue culture flasks (see Note 1). 2. Colcemid (10 L/mL) is added 24 h after the last addition of medium. Incubation is continued for 5–14 h. 3. Shake off cells and pour cells into two 50 mL centrifuge tubes. Centrifuge at 200g for 10 min and remove the supernate. Add 50 mL freshly made hypotonic solution to each tube taking care that the cells are not clumped and are homogenously dispersed in the hypotonic solution. Incubate at room temperature for 20–25 min (see Note 2). 4. Centrifuge at 400g for 5 min. Remove the supernate and invert the tubes. Carefully wipe down the tube sides with an absorbent wipe and add 3 mL ice-cold polyamine buffer. Briefly, and gently, pipette the cells to break up any clumps. Incubate on ice for at least 15 min (see Note 3). 5. Vortex the cells at a moderate speed for 30–60 s. Mix 10 L of suspension and 1 L of propidium iodine stain on a slide. Mount the mixture and examine on a fluorescent

Phylogenomic Analysis by Chromosome Sorting and Painting

19

microscope to control chromosome release. Chromosomes should not be clumped or maintained within the cells (see Note 4). 6. Transfer the cell suspension into 1.5 mL Eppendorf tubes. Centrifuge at 100g for 3 min. Remove the supernate and place into 12 × 75 mm tubes for the flow cytometer [cat. no. 352063, Becton-Dickerson (B-D)]. For every 750 L of cell suspension add 20 L chromomycin, 2 L Hoechst stain and 20 L 100 mM magnesium sulphate. Invert the tube several times to mix. Incubate on ice for 2–3 h. 7. For every 750 L initial suspension add 100 L sodium citrate/sodium sulphite solution. Incubate for at least 15 min.

3.1.2. Flow Sorting 1. Set up the sorter with a constant 30 psi air supply with a 0.2-m filter in line (cat. no. H095, Whatman, Florham, NJ) and a vacuum system that is capable of drawing 8 in. mercury (Hg) (see Note 5). 2. Fill the sheath tank with FACSFlow fluid (cat. no. 342003, Becton-Dickinson) and connect a 0.22-m cartridge (cat. no. SVGS010RS, Millipore, Billerica, MA) filter to the sheath line going to the machine from the tank to preclude contaminants from reaching the sample and resulting sorts. 3. Chill the sample port and sort reservoirs with a recirculation system (Scientific model no. 1160A, VWR) set to 4°C (see Note 6). 4. Fit the sorter with a 50-m diameter nozzle tip (cat. no. 343592, B-D) to get the resolution required for the display of the flow karyotype (see Note 7). 5. Remove the beam expander element from the forward optics bench. This makes the beams more spherical and serves no purpose in this setup. 6. Set the rear optics bench for fluorescence (Fl) 1 height vs Fl 4 height density plot. The Fl 1 parameter for the UV/Hoechst signal with the filters of 390 nm long pass (cat. no. E390LP, Chroma Technologies, Brattleboro, VT) and a 480-nm short pass (cat. no. E480SP, Chroma Technologies) is the drive position. The Fl 4 for the 457/chromomycin signal, the second beam, passes through a 490-nm long-pass filter (cat. no. E490LP, Chroma Technologies). All photo multiplier tube (PMT) sensors are acquired in the height mode for processing information (Fig. 2, Table 1). 7. Warm the lasers up for at least 30 min prior to attempting alignment. The lasers must have a timed delay between beam spots. The delay needs to be about 18.5 s. The alignment for the initial setup should be performed as established by the instrument manufacturer. 8. Dilute 2.0 m YG beads from (cat. no. 18604, Polysciences) in 3.5 mL filtered distilled water in a tube (cat. no. 352063, B-D). They are run in the cytometer at a rate of 200 events per second. 9. The PMTs are adjusted for each FL signal so that channel 200 ± 5 is the peak channel. 10. Close the shutter for the 457 laser. It is best to align by PMT 3 vs PMT 3 for the UV laser and close the shutter for the 457 laser. It is aligned for maximum saturation of the signal. Next align the 457 laser that is in the second position. Open the shutter for the visible laser and peak as in the UV laser except use PMT 5 vs PMT 5. Typically the voltage settings for the PMTs for alignments are 508 and 276 V,

20

Stanyon and Stone

Fig. 2. Schematic figure of the flow sorting of chromosomes (modified from B-D FACS training manual, page 123).

respectively with the threshold set to forward scatter (FSC) and no value selected and an amp. gain of 8 (Fig. 3). 11. Set the drop delay, at 61804 drop drive frequency (DDF) at 3 V, for 18.6. After the alignment, back flush the sample line and place the chromosome sample in the sample port (see Note 8).

Phylogenomic Analysis by Chromosome Sorting and Painting

21

Table 1 This Table Illustrates the Typical Program Setup for Chromosomal Acquisition with the Threshold Set to Fl 1-H and Having a Threshold Value of 108 P1 FSC-H 8 P2 SSC-H 400 8 P3 Fl1-H 472 2 P4 Fl1-A 2 P5 Fl4-H 617 2 P6 Fl4-A 2 The amp. gains in P1 and P2 are set to a gain of 8. As we stated earlier, these are for our instrument only and are not intended for diagnostic purposes.

12. Run the sort at 200–500 events per second and acquire a bivariate plot of Hoechst fluorescence vs chromomycin fluorescence for 20,000–40,000 events for a storage file on the computer (see Note 9). 13. Identify the peaks by printing out the flow karyotype and numbering the clusters from the previously stored file of the bivariate karyotype. 14. Set the gates for sorting from peaks. Use the Normal-R sort mode and sort right and left. The sort windows can be set-up on the density plot as mentioned above with the sort gate selection that pertains to the region of choice (Fig. 4). 15. Sort 250–500 chromosomes into 500 mL Eppendorf tubes (cat. no. PCR-05-C, Axygen Scientific, Union City, CA) that have 30 mL of molecular grade water inside which were previously cross-linked for 10 min. Store sorts at 20°C overnight before proceeding with PCR.

3.2. PCR Amplification and Labeling 1. Set up the primary PCR reactions by assembling a master mix of the following reagents per reaction (including an extra control tube): 9.5 L PCR water, 5 L buffer, 1 L dNTPs, 1 L primer, 0.5 L Taq, 3 L MgSO4 (25 mM). A control tube is setup using master mix and 30 L water (see Note 10). 2. Mix and transfer 20 L of the master mix per reaction tube changing the pipette tip for each reaction tube (see Note 11). 3. The PCR protocol is as follows: denaturation for 9 min at 94°C, 8 cycles of 94°C 60 s, 30°C 90 s, 72°C 120 s then 30 cycles 94°C for 60 s, 60°C for 60 s, 72°C for 90 with a final extension of 72°C for 10 min (see Note 12). 4. PCR reaction products are checked by standard 1.0% agarose gel electrophoresis using the 100-bp ladder (cat. no. 15628-019, Invitrogen). The products should be present as a smear going from less than 200 bp and fading out over 1000 bp with most product concentrated around 200 and 500 bp (Fig. 5). The yield should be near 100 ng/L. 5. A second PCR reaction is used to label the primary products. Assemble a master mix with the following reagents per reaction: 5 L dNTPs, 5 L buffer, 1 L primer, 1 L Mg, 0.5 Taq, 1.5 L biotin or digoxigenin, 32 L PCR water and 4 L primary PCR product (about 4 ng) (see Note 13).

22

Stanyon and Stone

Fig. 3. Histogram of each of the laser parameters after alignment. The top histogram is of the 2.0-m beads showing peak channel and the relative coefficient of variation (CV) for the UV laser, Fl-1 parameter. The bottom histogram also shows the peak channel and CV for the same beads except for the 457 laser, Fl-4 parameter. The histogram stats show the statistical data for each.

6. The PCR cycling protocol is as follows: initial denaturation for 3 min at 94°C, then 30 cycles of 94°C 60 s, 60°C 60 s, 72°C 90 s, with a final extension at 72°C for 7 min. 7. Check PCR products by gel electrophoresis as before. The smear should be concentrated around 200 bp (see Note 14).

3.3. Chromosome Painting 1. Prepare standard chromosome spreads and age the slides overnight at 45°C or for several days at room temperature (see Note 15). 2. For inter-ordinal Zoo-FISH assemble the chromosome paint by mixing 10 L labeled PCR product, 10 L Human COT-1 DNA, 1 L Salmon sperm, 2 L sodium acetate (3 M) and then adding 60 L cold, 100% ethanol (see Note 16). 3. Store overnight at 20°C and centrifuge at 4°C for 30 m at 12,000g. 4. Pour off the ethanol conserving the probe pellet and wash twice with 70% ethanol. Eliminate all ethanol and dry the pellet. 5. Add 12 L hybridization buffer and dissolve the probe pellet.

Phylogenomic Analysis by Chromosome Sorting and Painting

23

Fig. 4. A flow karyotype of the chimpanzee. Left is the density plot and right is the dot plot. As was noted earlier in the note section of the text (see Notes 4–9), the density plot is the choice of our sorts. The colors in the density plot relates to the number of chromosomes that display the same amount of stain uptake and therefore are more alike or purer. This affords the most concise spot of sorting window placement for purer sort, hence a purer probe.

Fig. 5. Typical gel for DOP-PCR of chromosome paints which were primed with 6 MW primer. From left to right: 100 bp ladder, secondary PCR products of chimpanzee chromosomes 10 and 14, and a control lane for PCR master mix without DNA. 6. Denature the probe mixture at 70°C for 15 min, then reanneal the probe for 60 m at 37°C. 7. Denature the slides in 70% formamide/2X SSC at 65°C for 2 m (see Note 17). Dehydrate and dry the slides in a 4°C ethanol series (70, 90, and 100% for 3 min each). 8. Place the hybridization mixture on the slide and mount with a 18 × 18 plastic coverslip. 9. Seal the coverslip with rubber cement and incubate in a wet chamber at 37°C for at least 48 h (see Note 18).

24

Stanyon and Stone

10. Carefully remove the rubber cement and place the slides in a Coplin jar with 2X SSC, 42°C until all slides are ready. 11. Move the slides to a Coplin jar with 50% formamide and 2X SSC (pH 7.0, 42°C) for 10 min. 12. Move the slides to a Coplin jar with 2X SSC (42°C, pH 7.0) for three changes for a total of 15 min. 13. Move the slides to a Coplin jar with 4X SSC Tween 20, 42°C, for 3 min. 14. Briefly drain the slides, but do not dry and mount in Fluoroscein-Avidin detection solution with a precut parafilm strip. Incubate in a wet chamber for 45 min at 37°C (see Note 19). 15. Remove the parafilm coverslip and place in 4X SSC Tween 20, 42°C for a total of 20 min with two changes. 16. Stain in DAPI staining solution for 10 min (see Note 20). 17. Rinse the slides briefly in distilled water and mount with 35 L antifade solution with a 24 × 50 coverslip. 18. Analyze the slide with a fluorescence microscope (Fig. 6).

4. Notes 1. With fibroblast or other attached cell lines flasks in a CO2 incubator are used and medium is changed in every 3–4 d. Colcemid (10 L per mL) is added 24 h after confluent cells have been subdivided. Cell culture medium, serum, and other supplements appropriate for the cell type should be used. Continue the cell culture for about 14 h then detach the mitotic cells mechanically or by washing with PBS or treating with trypsin/EDTA. 2. Different cell types may require shorter or longer incubations in hypotonic. 3. After adding the polyamine buffer the cells can be stored at 4°C before continuing. Additional amount of polyamine buffer can be added after chromosome release if needed. Digitonin in place of Triton-X can also provide equivalent results. 4. Cells may need to be vortexed for additional periods. Control release after an additional 15 s of vortexing. Fibroblast cells may need more vigorous vortexing. If chromosomes are not released after repeated vortexing they may be gently syringed using a 22-gauge needle. Alternatively, different incubation times in hypotonic may help chromosome release. If no released chromosomes are seen on the microscope, the experiment should be terminated. 5. The instrument settings described here are for a B-D FACS DiVa with a dual laser configuration, high-speed sorter head and Cell Quest software (B-D). The lasers are Coherent Innova 305Cs (Coherent, Santa Clara, CA). They are run in the light mode with power track on with the power setting at 275 mW. The drive laser or the first position laser is a 361-nm UV laser. The second laser is a 457-nm laser. Both lasers must have a T00 aperture setting. The T00 setting for the UV laser is aperture 11 and the aperture for the 457 laser is aperture 0. 6. Keeping both the specimen and the sorted chromosomes cold provides better sort results. The chilled specimen prevents the flow karyotype from drifting during long

Phylogenomic Analysis by Chromosome Sorting and Painting

25

Fig. 6. Typical result for a single color inter-ordinal chromosome painting: human chromosome 7 paint on elephant shrew chromosomes.

7. 8.

9. 10.

sorts. The sorted chromosomes that are also chilled prevent degradation of the chromosomal DNA. A 70-m tip (cat. no. 343593, B-D) is usable for viewing chromosomes but does not have the resolution of the 50-m tip. Your actual delay may vary depending on your laser and manufacturer. Most publications show flow karyotypes as dot plots but it is our experience that the density plot affords better sort window ability and as a consequence higher probe purity. Higher sorting speeds are necessary for bulk sorting and up to 17,000 events per second can be achieved if there is no breakdown in the flow karyotype. This general primer does not always provide good results in all species. Some alternative primers are F/S = 5-GGACTCGAGNNNNNNTACACC-3 which is good for mouse chromosomes and GAG = 5-GAGGAGGAGGAGGAGGAGGAG-3. Another alternative is G1 = 5-GAGGATGAGGTTGAGNNNNNNTGG-3 and

26

11.

12. 13.

14.

15.

16.

17.

18. 19.

20.

Stanyon and Stone G2 = 5-GTGAGTGAGAGGATGAGGTTGAG-3 where G1 is used in the primary PCR and G2 is used in the secondary PCR. It is important to avoid contamination by already amplified DNA products. It is suggested that a separate room is maintained for primary PCR. Single use aliquot reagents and pipettes dedicated only to primary PCR are also helpful to avoid contamination problems. PCR reaction products should all be checked by standard agarose gel electrophoresis. The amount of biotin or digoxigenin or used in labeling can be increased if necessary to 2.5 L per reaction. Smaller amounts of digoxigenin compared to biotin may provide equivalent results. Procedures for fluorescence conjugated dNTPs (direct labeling) are essentially similar. High molecular weight labeled DNA produces higher background. If the gel smear demonstrates notable DNA over 500 bp, the reaction should be run again adjusting the amount of DNA, primers, and other PCR parameters. DNAase digestion is also possible to render the label probe useable. Good quality slides of metaphase spreads (free of cytoplasm) are essential for hybridization and can usually be used for chromosome painting with any pretreatment. A humidity-controlled environment (about 50%) is helpful to improve spreading. However, many laboratories experience difficulties in preparing a slide of sufficient quality for hybridization. Hybridization quality can be increased by a number of pretreatments. The most common pretreatment uses pepsin to eliminate cytoplasm and other proteins, which might interfere with probe penetration. Other laboratories routinely refix slides in methanol:acetic acid before use. This section deals with single color, chromosome painting. Multicolor techniques require hybridizing together several differentially labeled paints and detecting them simultaneously, but the procedure is essentially identical. If precipitating both digoxigen and biotin probes together for two color FISH double the amounts of reagents. The slides can be G-banded before in-situ hybridization. In this case they should be destained and refixed in 1% formaldyhyde in PBS for 10 min. Subsequent denaturation is at 55°C for 30–60 s. There is considerable discussion about hybridization times; suggested time range from overnight to a week. For detecting biotin and digoxigen together. You should add antidigoxigenin Rodamine (cat. no. 1207750910, Roche) conjugated antibodies to the detection solution (1:200). Both directly and indirectly labeled probes may be hybridized contemporaneously. If only directly labeled probes are used, steps 13–15 can be skipped. Slides may also be directly mounted in antifade solution containing DAPI; however, the banding is not as sharp.

Acknowledgments The authors would like to thank Nigel Carter and Fengtang Yang (Sanger Institute) and Joe Fawcett (Los Alamos National Laboratory) for their many

Phylogenomic Analysis by Chromosome Sorting and Painting

27

helpful suggestions on chromosome flow sorting. We also thank Polina Perelman and Sandra Burkett for comments and assistance. References 1. Bourque, G., Pevzner, P. A., and Tesler, G. (2004) Reconstructing the genomic architecture of ancestral mammals: lessons from human, mouse, and rat genomes. Genome Res. 14, 507–516. 2. Chowdhary, B. P. and Raudsepp, T. (2001) Chromosome painting in farm, pet and wild animal species. Methods Cell Sci. 23, 37–55. 3. Graphodatsky, A. S., Yang, F., Perelman, P. L., et al. (2002) Comparative molecular cytogenetic studies in the order Carnivora: mapping chromosomal rearrangements onto the phylogenetic tree. Cytogenet Genome Res. 96, 137–145. 4. Nash, W. G., Wienberg, J., Ferguson-Smith, M. A., Menninger, J. C., and O’Brien, S. J. (1998) Comparative genomics: tracking c hromosome evolution in the family Ursidae using reciprocal chromosome painting. Cytogenet Cell Genet. 83, 182–192. 5. O’Brien, S. J., Eisenberg, J. F., Miyamoto, M., et al. (1999) Genome maps 10. Comparative genomics. Mammalian radiations. Wall chart. Science 286, 463–478. 6. Shetty, S., Griffin, D. K., and Graves, J. A. (1999) Comparative painting reveals strong chromosome homology over 80 million years of bird evolution. Chromosome Res. 7, 289–295. 7. Wienberg, J. and Stanyon, R. (1997) Comparative painting of mammalian chromosomes. Curr. Opin. Genet. Dev. 7, 784–791. 8. Wienberg, J. and Stanyon, R. (1998) Comparative chromosome painting of primate genomes. Ilar J. 39, 77–91. 9. Ferguson-Smith, M. A., Yang, F., Rens, W., and O’Brien, P. C. (2005) The impact of chromosome sorting and painting on the comparative analysis of primate genomes. Cytogenet. Genome Res. 108, 112–121. 10. Graphodatsky, A. S., Yang, F., O’Brien, P. C., et al. (2001) Phylogenetic implications of the 38 putative ancestral chromosome segments for four canid species. Cytogenet. Cell Genet. 92, 243–247. 11. Scherthan, H., Cremer, T., Arnason, U., Weier, H. U., Lima-de-Faria, A., and Fronicke, L. (1994) Comparative chromosome painting discloses homologous segments in distantly related mammals. Nat. Genet. 6, 342–347. 12. Trask, B. J. (2002) Human cytogenetics: 46 chromosomes, 46 years and counting. Nat. Rev. Genet. 3, 769–778. 13. Ferguson-Smith, M. A. (1997) Genetic analysis by chromosome sorting and painting: phylogenetic and diagnostic applications. Eur. J. Hum. Genet. 5, 253–265. 14. Ferguson-Smith, M. A., Yang, F., and O’Brien, P. C. (1998) Comparative mapping using chromosome sorting and painting. Ilar J. 39, 68–76. 15. Cram, L. S. (1990) Flow cytogenetics and chromosome sorting. Hum. Cell 3, 99–106. 16. Telenius, H., Carter, N. P., Bebb, C. E., Nordenskjold, M., Ponder, B. A., and Tunnacliffe, A. (1992) Degenerate oligonucleotide-primed PCR: general amplification of target DNA by a single degenerate primer. Genomics 13, 718–725.

28

Stanyon and Stone

17. Telenius, H., Pelmear, A. H., Tunnacliffe, A., et al. (1992) Cytogenetic analysis by chromosome painting using DOP-PCR amplified flow-sorted chromosomes. Genes Chromosomes Cancer 4, 257–263. 18. VanDevanter, D. R., Choongkittaworn, N. M., Dyer, K. A., et al. (1994) Pure chromosome-specific PCR libraries from single sorted chromosomes. Proc. Natl Acad. Sci. USA 91, 5858–5862. 19. Cremer, T. and Cremer, C. (2001) Chromosome territories, nuclear architecture and gene regulation in mammalian cells. Nat. Rev. Genet. 2, 292–301. 20. Fantes, J. A., Green, D. K., and Sharkey, A. (1994) Chromosome sorting by flow cytometry. Production of DNA libraries and gene mapping. Methods Mol. Biol. 29, 205–219. 21. Fauth, C. and Speicher, M. R. (2001) Classifying by colors: FISH-based genome analysis. Cytogenet. Cell Genet. 93, 1–10. 22. Fiegler, H., Gribble, S. M., Burford, D. C., et al. (2003) Array painting: a method for the rapid analysis of aberrant chromosomes using DNA microarrays. J. Med. Genet. 40, 664–670. 23. Gribble, S. M., Fiegler, H., Burford, D. C., et al. (2004) Applications of combined DNA microarray and chromosome sorting technologies. Chromosome Res. 12, 35–43. 24. Fronicke, L., Wienberg, J., Stone, G., Adams, L., and Stanyon, R. (2003) Towards the delineation of the ancestral eutherian genome organization: comparative genome maps of human and the African elephant (Loxodonta africana) generated by chromosome painting. Proc. Biol. Sci. 270, 1331–1340. 25. Murphy, W. J., Bourque, G., Tesler, G., Pevzner, P., and O’Brien, S. J. (2003) Reconstructing the genomic architecture of mammalian ancestors using multispecies comparative maps. Hum. Genomics 1, 30–40. 26. Murphy, W. J., Stanyon, R., and O’Brien, S. J. (2001) Evolution of mammalian genome organization inferred from comparative gene mapping. Genome Biol. 2, REVIEWS0005. 27. Svartman, M., Stone, G., Page, J. E., and Stanyon, R. (2004) A chromosome painting test of the basal eutherian karyotype. Chromosome Res. 12, 45–53. 28. Yang, F., Alkalaeva, E. Z., Perelman, P. L., et al. (2003) Reciprocal chromosome painting among human, aardvark, and elephant (superorder Afrotheria) reveals the likely eutherian ancestral karyotype. Proc. Natl Acad. Sci. USA 100, 1062–1066. 29. Dobigny, G., Yang, F., O’Brien, P. C., et al. (2005) Low rate of genomic repatterning in Xenarthra inferred from chromosome painting data. Chromosome Res. 13, 651–663. 30. Svartman, M., Stone, G., and Stanyon, R. The ancestral eutherian karyotype is present in Xenarthra. submitted. 31. Ye, J., Biltueva, L. S., Huang, L., et al. (2006) Cross-species chromosome painting unveils cytogenetic signatures for the Eulipotyphla and evidence for the polyphyly of Insectivora. Chromosome Res. in press. 32. Haig, D. (2005) The complex history of distal human chromosome 1q. Genomics 86, 767–770.

Phylogenomic Analysis by Chromosome Sorting and Painting

29

33. Murphy, W. J., Fronicke, L., O’Brien, S. J., and Stanyon, R. (2003) The origin of human chromosome 1 and its homologs in placental mammals. Genome Res. 13, 1880–1888. 34. Nie, W., O’Brien, P. C., Fu, B., et al. (2006) Chromosome painting between human and lorisiform prosimians: evidence for the HSA 7/16 synteny in the primate ancestral karyotype. Am. J. Phys. Anthropol. 129, 250–259. 35. Stanyon, R., Dumas, F., Stone, G., and Bigoni, F. (2006) Multidirectional chromosome painting reveals a remarkable syntenic homology between the greater galagoes and the slow loris. Am. J. Phys. Anthropol. in press. 36. Muller, S., Stanyon, R., O’Brien, P. C., Ferguson-Smith, M. A., Plesker, R., and Wienberg, J. (1999) Defining the ancestral karyotype of all primates by multidirectional chromosome painting between tree shrews, lemurs and humans. Chromosoma 108, 393–400. 37. O’Brien, S. J. and Stanyon, R. (1999) Phylogenomics. Ancestral primate viewed. Nature 402, 365–366. 38. Froenicke, L. (2005) Origins of primate chromosomes—as delineated by Zoo-FISH and alignments of human and mouse draft genome sequences. Cytogenet. Genome Res. 108, 122–138.

3 FISH for Mapping Single Copy Genes Terje Raudsepp and Bhanu P. Chowdhary Summary During the past two decades fluorescent in-situ hybridization (FISH) has become a standard technique to directly localize, orient, and order genes in the genomes of a wide range of species. Despite the availability of a variety of probes, probe labeling and signal-detection systems, and advanced image analysis software, the core procedures used to carry out FISH remain the same. A detailed overview of these procedures, including target preparation (metaphase/interphase chromosomes and DNA fibers), probe labeling, in-situ hybridization, signal detection, and imaging, is here provided in a stepwise manner. Key Words: FISH; gene mapping; metaphase chromosomes; interphase chromosomes; DNA fibers; DNA labeling.

1. Introduction Fluorescence in-situ hybridization (FISH) to nuclear chromatin is the most direct approach to visualizing the physical location of DNA markers directly on the chromosomes. For over a decade, the technique has served as the backbone of the development of physical gene maps in a range of evolutionarily diverse species by finding precise band location of genes on the chromosomes, deducing relative physical order of closely located loci, and aligning syntenic and linkage groups to specific chromosomes or chromosomal regions (for reviews, see refs. 1–7). The two primary components of FISH mapping are the target DNA and the probe DNA. Complimentary sequences between these components permit chemical bonds between them during hybridization. The target could be metaphase or prometaphase chromosomes, chromatin fiber from interphase cells, or mechanically stretched nuclear DNA. A typical probe for FISH mapping is at least 4–5 kb DNA in size, but in certain instances it can range between a few-basepairs-long telomeric or centromeric repeat sequences and From: Methods in Molecular Biology: Phylogenomics Edited by: W. J. Murphy © Humana Press Inc., Totowa, NJ

31

32

Raudsepp and Chowdhary

1- to 2-Mb-long DNA segments cloned in yeast artificial chromosomes (YACs). When the probe DNA and the target chromatin DNA are from the same species, the hybridization is referred to as homologous, whereas if they are from different species, it is referred to as heterologous. One of the primary reasons that FISH has become the central approach to developing physical gene maps for a variety of species is the remarkable resolution it provides over all other approaches. Depending on the type of target, DNA probes can be mapped at a resolution of ~5 Mb (metaphase FISH) to ~5 kb (fiber FISH) (6,7). None of the other approaches, such as synteny, linkage, or RH mapping, provide physical resolution of this magnitude. Multicolor FISH, whereby several probes can be hybridized and visualized simultaneously on the target, has been vital for improved mapping resolution (8,9). This approach is now routinely used to successfully order closely located loci that either fall in the same “bin” by linkage and RH mapping or cannot be confidently oriented owing to resolution limitation of the mapping panels. The success of mapping single copy genes by FISH relies on a number of factors, including preparation of target chromatin/DNA, accurate identification of chromosomes, isolation and modification of probe DNA, hybridization, signal detection, and imaging. Further, although digital imaging and software-based analysis of FISH results have become more or less standard, due consideration must be given to accurate documentation and interpretation of results. In this chapter, we will expound chronologically on each of these factors with an aim toward highlighting their critical role in optimizing hybridization signals for mapping single copy genes. 2. Materials 2.1. Cell Cultures for Metaphase and Interphase Chromosomes A range of tissue or cell types can be used as a source for metaphase and interphase chromosome preparations. Consequently, the culture/harvesting techniques may vary accordingly. Herein we provide details on only two of the most commonly used cell culture approaches employed to obtain well-spread metaphase/pro-metaphase and interphase chromosome preparations. One of these approaches uses peripheral blood lymphocytes whereas the other uses fibroblast cells.

2.1.1. Chromosome Preparations from Peripheral Blood Lymphocyte Cultures 1. About 5–10 mL peripheral blood collected under sterile conditions in sodium– heparin vacutainer tubes (VACUTAINER™, Becton-Dickinson, San Diego, CA). The samples must be stored at 4°C and preferably used within 3–4 d.

FISH for Mapping Single Copy Genes

33

2. Culture media containing RPMI Medium 1640 with Glutamax and 25 mM HEPES buffer (Gibco, Invitrogen, Gaithersburg, MD), 30% fetal bovine serum (GEMINI BioProducts), 1.4% antibiotic–antimycotic solution (Gibco BRL, Life Technologies), and 1% mitogen—either pokeweed (lectin from Phytolacca americana, Sigma-Aldrich, St. Louis, MO) or phytohemagglutinin (lectin from red kidney bean, Gibco BRL) (see Note 1). Typically, 500 mL media is prepared and stored in 9-mL aliquots in sterile 15-mL screw-cap centrifuge tubes at 20°C. 3. Ethidium bromide (Bio-Rad) solution 1 mg/mL in ddH2O. Store in the dark at 4°C. 4. Demecolcine solution 10 g/mL in Hanks’ balanced salt solution (HBSS) (SigmaAldrich). Store at 4°C. 5. Hypotonic solution: 0.075 M KCl (see Note 2). Store at room temperature (RT). 6. Fixative: methanol/glacial acetic acid—in a ratio of 3:1. 7. Cleaning solution for microscope glass slides: chromic-sulfuric acid solution alias cleaning solution (Fisher Scientific). 8. Double-frosted microscope glass slides (Fisher Scientific). 9. Light microscope with ×20 phase contrast objective.

2.1.2. Chromosome Preparations from Fibroblast Cultures 1. 2. 3. 4.

5. 6. 7. 8. 9.

Sterile forceps and small scissors, 70% alcohol, sterile Petri dish. A small piece (5 × 5 mm) of skin/tissue biopsy collected under sterile conditions. Collection media: sterile HBSS (Sigma-Aldrich). Culture media comprising HyQ™DME high glycose Dulbecco’s modified Eagle’s medium (MEM; HyClone) supplemented with 10% fetal bovine serum (GEMINI BioProducts), 1% non-essential amino acid 100X solution for MEM (Cellgro, Mediatech), 1% MEM 100X vitamin solution (Cellgro, Mediatech), and 1% MEM sodium pyruvate 100X solution (Cellgro, Mediatech). 25-cm2 (T25) and 75-cm2 (T75) cell culture flasks ( COSTAR®, Corning). Sterile-filtered and cell culture tested trypsin/EDTA 0.25% solution in HBSS (Sigma-Aldrich). 5% CO2 sterile incubator. Sterile 5- and 10-mL pipettes, pipettors, or pipette bulbs. Inverted microscope with ×20 and ×40 phase contrast objectives.

The remaining materials are the same as listed in Subheading 2.1.1., items 3–9.

2.2. DNA Fiber Preparations 2.2.1. Preparation of Agarose Embedded DNA 1. 50 mL peripheral blood is collected in K3EDTA vacutainer tubes (VACUTAINER, Becton-Dickinson). Best results are obtained with fresh blood. 2. Phosphate-buffered saline (PBS): 137 mM NaCl, 10 mM phosphate, and 2.7 mM KCl. For 1000 mL solution, dissolve 8 g NaCl, 0.2 g KCl, 1.44 g Na2HPO4, and 0.24 g KH2PO4 in 800 mL ddH2O, adjust pH to 7.4 with HCl, add ddH2O to 1 L, and autoclave. Store at RT.

34

Raudsepp and Chowdhary

3. Lymphoprep™ (Greiner Bio-One). Store at RT. 4. Counting chamber Bürker-Türk with double net ruling (Fisher Scientific) or a hemacytometer (Hausser Scientific). 5. Trypan blue 0.4% solution (Sigma-Aldrich). Store at RT. 6. The 50-well disposable plug molds used for pulse field gel electrophoresis (PFGE) (Bio-Rad). 7. Parafilm “M” laboratory film (Pechiney Plastic Packaging). 8. 0.1 M HCl. 9. 15- and 50-mL screw cap centrifuge tubes (Fisher Scientific). 10. 1.9% low-melting point agarose (NuSieve GTG) in 0.125 M EDTA. Store at 4°C. 11. Cell lysis solution: 50 mM EDTA, 1% N-lauroylsarcosine sodium salt (SigmaAldrich), 2 mg/mL proteinase K (Sigma-Aldrich) in ddH2O. Store at 4°C. 12. TE buffer (1X): 10 mM Tris-HCl, pH 8.0, and 1 mM EDTA, pH 8.0. Store at RT. 13. Phenylmethylsulfonyl fluoride (PMSF) solution: make fresh stock solution by dissolving 40 mg PMSF (Sigma-Aldrich) in 1 mL absolute ethanol. Prepare working solution by diluting the stock solution 1:1000 with 1X TE buffer.

2.2.2. Pretreatment of Glass Slides for Making DNA Fiber Preparations 1. 2. 3. 4. 5.

Wheaton glass staining dishes, 500 mL (Fisher Scientific). Plain microscope slides (Fisher Scientific). 0.2 N HCl. Acetone (Fisher Scientific). Subbing solution: mix 0.3 g sodium azide (Sigma Aldrich) and 1.5 g gelatin in 30 mL ddH2O (use 100 mL glass beaker). Gently heat and stir until the ingredients are completely dissolved. Add the solution to another beaker containing 970 mL ddH2O and mix. Filter through Whatman filter paper No.1 (Fisher Scientific) and store in a closed bottle at 4°C. 6. 500 mL 0.02 % poly-L-lysine hydrobromide (Sigma Aldrich) solution in ddH2O. Store at 20°C.

2.2.3. Stretching DNA Fibers on Lysine-Coated Slides 1. Microwave oven. 2. DAPI solution: 1 mg/mL 4,6-Diamidino-2-phenylindole dihydrochloride (DAPI; Sigma Aldrich) in ddH2O. Store in the dark at 20°C. 3. Antifade solution: 1 mg/mL p-Phenylenediamine (Sigma Aldrich) in PBS. Dissolve 100 mg p-Phenylenediamine in 10 mL PBS (pH 7.4, autoclaved). Adjust pH to 8.0 with NaOH. Add glycerol to make the final volume 100 mL and mix in the dark for 2 h. Aliquot in 1-mL volumes and store in the dark at 20°C. During prolonged storage, the antifade solution oxidates and turns dark in color; however, this does not affect the quality. 4. Mounting solution—DAPI-antifade: add 1 L of DAPI solution (1 mg/mL) to 1 mL antifade solution. Vortex briefly and store in the dark at 20°C until use (see Note 3). 5. Glass coverslips, 24 × 60 mm (VWR Scientific). 6. Diamond point glass marker.

FISH for Mapping Single Copy Genes

35

Table 1 Most Commonly Used Fluorophores, Their Excititation and Emission Wavelengths, and Corresponding Filter Cubes for Visualization Under the Microscope Fluorophore DAPI/HOECHST/AMCA Aqua Fluorescein (FITC), spectrum-green Spectrum-orange (see Note 6) Rhodamine, spectrum-orange, Cy3 Spectrum-red, Texas Red, Cy3.5 Far-red1 FITC/Spectrum-green + Rhodamine / spectrum-orange 1Far-red

Excitation wavelength (nm)

Emission wavelength (nm)

360 435 480

460 477 535

UV BLUE1 BLUE2

546

572

YELLOW

545

610

RED1

560

645

RED2

630

667

FAR-RED Double bandpass

Filter cube

is visible only for the camera and not for human eye.

7. Fluorescent microscope with UV filter (see Table 1 for different types of filters). 8. 20X SSC: 3 M NaCl and 0.3 M sodium citrate. Autoclave and store at RT. This stock solution is diluted with ddH2O to get 2X SSC, 3X SSC, 4X SSC, etc. solutions. 9. Ethanol series: 70%, 80%, 90%, 100%.

2.3. Probe DNA 1. 2. 3. 4.

3 M Sodium acetate buffer solution, pH 5.5 (Sigma Aldrich). 100% and 70% ethanol. 1% agarose gel with 0.2 g/mL ethidium bromide. Gel loading buffer: 0.124 g Orange G (Sigma Aldrich), 8.6 mL glycerol, and ddH2O to a final volume of 25 mL. Mix by vortexing and store at RT. 5. Spectrophotometer. 6. DNA quantification marker 100 ng/L. The marker can be made from any good quality and high-molecular-weight DNA sample with known concentration.

2.4. Probe Labeling 2.4.1. Indirect Labeling with Biotin or Digoxigenin by Nick Translation 1. 1 g of probe DNA (in a maximum of 16 L volume). 2. Biotin (Bio) nick translation kit, 5X concentrated, for 40 reactions (Biotin—Nick Translation Mix, Roche Diagnostics). Contains reagents: 0.25 mM dATP, 0.25 mM

36

3.

4. 5. 6. 7. 8.

Raudsepp and Chowdhary dCTP, 0.25 mM dGTP, 0.17 mM dTTP, 0.08 mM biotin-16-dUTP, DNA polymerase I, and DNase I, all in stabilized reaction buffer with 50% v/v glycerol. Digoxigenin (Dig) nick translation kit, 5X concentrated, for 40 reactions (DIG—Nick Translation Mix, Roche Diagnostics). Contains reagents: 0.25 mM each of dATP, dCTP, and dGTP, 0.17 mM dTTP, 0.08 mM dig-11-dUTP, DNA polymerase I, and DNase I, all in stabilized reaction buffer 50% v/v glycerol. Waterbath or styrofoam box with lid. Thermometer. Spin-50 minicolumns with collecting tubes (BIOMAX Inc.). 1% agarose gel with 0.2 g/mL ethidium bromide. Gel loading buffer (see Section 2.3. item 4). 100 bp molecular size marker (New England Biolabs).

2.4.2. Direct Labeling with Spectrum Fluorophores (Fluors) by Nick Translation 1. 1 g probe DNA (in a maximum of 17.5 L volume). 2. Nick translation kit (Vysis LCI®): 0.3 mM stock solutions for dTTP, dCTP, dATP, and dGTP; nick translation enzyme mix (DNA polymerase I and DNase I in 50% v/v glycerol); 10X nick translation buffer (500 mM Tris–HCl, pH 7.2, 100 mM MgSO4, 1 mM DTT); nuclease free water. 3. 0.1 mM dTTP and 0.1 mM dNTP mix (dATP, dCTP, dGTP) working solutions in nuclease free water. 4. SpectrumOrange™-dUTP, SpectrumGreen™-dUTP or SpectrumRed™-dUTP (Vysis), 50 nanomoles each. Make 1 mM stock solutions in nuclease free water and store in the dark at 20°C. 0.2 mM working solutions are made immediately before use.

The remaining labeling materials are the same as in Section 2.4.1. items 4–8.

2.5. Probe Hybridization Mixture 1. Hybridization mastermix (MM): 70% deionized formamide (molecular biology grade, Sigma Aldrich), 14% dextran sulphate sodium salt (Sigma Aldrich), and 3X SSC (see Subheading 2.2.3., item 8). Mix carefully and store in 1-mL aliquots in microcentrifuge tubes at 20°C. 2. Competitor DNA: 1 g/L genomic DNA from the species of probe origin (see Note 4). 3. Labeled probe DNA (see Subheading 2.4.). 4. DNA concentrator (Eppendorf Vacufuge™).

2.6. Hybridization 1. 10X RNase (Fisher Scientific) stock solution 1 mg/mL in 2X SSC. Boil the solution for 10 min to inhibit DNase activity and store at 20°C. Make RNase 1X working solution (100 g/mL in 2X SSC) and store at 20°C. 2. 24 × 60 mm glass coverslips for RNase treatment. 3. 70% formamide (deionized, OmniPur, EMD) in 2X SSC. Adjust pH 7.0 with concentrated HCl and store at 4°C.

FISH for Mapping Single Copy Genes

37

4. Diamond point glass marker. 5. 70%, 80%, 90%, 100% ethanol series. 70% ethanol is stored at 20°C and 80%, 90%, 100% at 4°C. 6. 4 × 4 mm2 glass coverslips for hybridization (can be easily made by cutting with diamond point marker from larger coverslips). Clean the coverslips by rinsing in 100% ethanol and air dry. Make sure that no dust particles or small glass pieces remain attached to the coverslips. 7. Rubber cement. 8. Moist chamber and 37°C incubator.

2.7. Posthybridization Washing and Signal Detection 1. 2. 3. 4. 5.

6. 7. 8.

50% formamide in 2X SSC. Adjust pH 7.0 with concentrated HCl and store at 4°C. 4X SSC. 4X SSC containing 0.05% Tween-20. PN buffer: 0.1 M Na2HPO4, 0.1% IGEPAL CA-630 (Sigma Aldrich). Adjust pH 8.0 with 0.1 M NaH2PO4. Blocking solution: 5X in-situ hybridization blocking solution (Vector Laboratories). 1X blocking solution is prepared by diluting the stock solution with PN buffer. 24 × 60 mm and 24 × 50 mm glass coverslips. DAPI-antifade mounting solution (see Section 2.2.3. item 4). Slide storage boxes.

2.7.1. Biotin System 1. Fluorescein (FITC) Avidin D (Vector Laboratories) stock solution 5 mg/mL (see Note 5). 2. Biotinylated antiavidin D (Vector Laboratories) stock solution 0.5 mg/mL.

2.7.2. Digoxigenin System 1. Antidigoxigenin, monoclonal antibody from mouse (Roche Biochemicals) stock solution 0.1 mg/mL. 2. Sheep antimouse Ig-digoxigenin, F(ab’)2-fragment (Chemicon International) stock solution 0.2 mg/mL. 3. Antidigoxigenin–rhodamine Fab fragments (Roche Biochemicals) stock solution 0.2 mg/mL.

2.8. Analysis 1. Standard karyotype and chromosome nomenclature of the species being studied. 2. Florescence microscope (e.g., Zeiss AXIOPLAN 2 universal microscope) with appropriate filter cubes corresponding to the fluorophores used (see Table 1). 3. High-performance CCD camera (e.g., MetaSystems, Applied Imaging, Hamamatsu ORCA camera). 4. Image Analysis System (e.g., Applied Imaging CytoVision™ Imaging System, GENUS™ Workstation or MetaSystems Isis Imaging software).

38

Raudsepp and Chowdhary

3. Methods 3.1. Cell Cultures for Metaphase and Interphase Chromosome Preparations

3.1.1. Chromosome Preparations from Peripheral Blood Lymphocyte Cultures 1. Under sterile conditions, add 1 mL whole blood or plasma with the buffy coat (on top of sedimented blood) (see Note 7) to the aliquoted 9 mL prewarmed (37°C) cell culture media in culture tubes. Mix and incubate for 72 h at 37°C. Gently mix/invert cultures twice a day. 2. After 68–72 h of culture, add 100 L ethidium bromide solution (final conc. 10 g/mL) and incubate for 1 h at 37°C. 3. Add 100 L demecolcine solution (final conc. 0.1 g/mL) and incubate for 1h at 37°C. 4. Spin the tubes at 100 rcf for 10 min and aspirate supernatant leaving ~1 mL medium at the bottom. Re-suspend the cell pellet gently with a Pasteur pipette. 5. Add 2–3 mL prewarmed (37°C) hypotonic solution, gently mix the cells with Pasteur pipette, and add more solution for a final volume of 10 mL. Incubate for 30–40 min at 37°C and spin the tubes at 100 rcf for 10 min. 6. Aspirate supernatant, and gently pipette to re-suspend the cells so that no clumps remain; add 5 mL fresh fixative and mix. Spin at 100 rcf for 10 min. Repeat this step three more times. After the last treatment, aspirate most of the supernatant, leaving cells in ~200 L fixative. Re-suspend the cells by gentle pipetting. 7. Transfer the fixed cell suspension from 15-mL centrifuge tubes into 1.5- to 2.0-mL microcentrifuge tubes. Either make chromosome preparations directly or store the tubes at 20°C until needed. 8. Slides are cleaned by soaking them overnight in glass cleaning solution followed by rinsing under running tap water for 10 h. Slides are thereafter rinsed thoroughly several times in distilled water and stored in distilled water at 4°C until needed. When from the water clean slides should have a water film spread evenly over the glass surface. If the film is not even, the slides need more cleaning (see Note 8). 9. Hold the clean, wet and cold glass slide at 45° angle. Drop approximately 10 L of fixed cell suspension, and allow it to spread/flow on the slide. Let the slide air dry and check it under light microscope using ×20 phase contrast objective. Good slides for FISH should have ~30–40 metaphase spreads per 3–4 mm2 and the metaphase chromosomes and interphase nuclei should be free from cytoplasm (see Note 9). The slides can be stored in air-tight boxes containing desiccant at 20°C for 2–4 yr.

3.1.2. Chromosome Preparations from Fibroblast Cultures 1. Place the skin/tissue biopsy with a small amount of collection media on a sterile Petri dish and mince with sterile forceps and a scissors. 2. Carefully transfer the tiny minced fragments into T25 culture flasks and place them so that there is enough space for outgrowth. Add 0.5 mL culture media and incubate at 37°C until the fragments have attached to the flask. Do not let the pieces dry out.

FISH for Mapping Single Copy Genes

39

3. Add 5 mL of culture medium per flask and incubate at 37°C with 5% CO2. Check cultures under inverted microscope every second day and feed them twice a week by replacing the medium. 4. When outgrowth is sufficient (~50–100 fibroblast cells around most of the fragments), aspirate culture medium and wash the flasks twice with 5 mL HBSS to remove the tissue fragments. Add 1 mL 0.25% Trypsin-EDTA and incubate for 5 min at 37°C. Trypsin detaches cells from the flask surface. Without removing the trypsin solution, add 2 mL culture medium to suspended cells in the flask (serum will inactivate trypsin) and transfer the cells into a new T25 flask. Add 5 mL fresh medium and incubate at 37°C in 5% CO2. Check the cultures every 2 d and feed as needed. 5. When the T25 flask reaches confluency, the cells are ready to be transferred into a T75 flask. Fibroblasts are detached from the surface as described in step 4, transferred into the T75 flask, covered with 10 mL fresh medium, and incubated at 37°C in 5% CO2. The cultures and checked every second day. 6. Cultures are ready for harvesting when they are semi-confluent (~60%) and contain abundant mitotic cells visible as round, enlarged, partially detached bodies. The non-diving fibroblasts appear as elongated bodies attached to the surface. 7. Add 100 L of demecolcine solution (see Subheading 3.1.1., item 3) directly to the T75 flask, gently swirl to let it mix with the media, and incubate for 1 h at 37°C. After incubation, tap the side of the flask firmly against the palm of your hand to detach dividing cells. Aspirate the medium containing suspended mitotic cells into a 15-mL centrifuge tube and proceed as described in Subheading 3.1.1., items 4–10.

3.2. DNA Fiber Preparations 3.2.1. Preparation of Agarose-Embedded DNA 1. Mix whole blood with an equal volume of PBS buffer in 50-mL centrifuge tubes. 2. Fill 15-mL centrifuge tubes with 2.5 mL lymphoprep and very carefully add 5 mL PBS diluted blood. The blood should remain on the top and not get mixed with lymphoprep. Spin for 30 min at 600 rcf. 3. Using a Pasteur pipette, aspirate the small phase (containing white blood cells) between the lymphoprep and PBS-serum layers, and transfer into a clean 15-mL centrifuge tube. While processing many lymphoprep-PBS-blood tubes, the white blood cell phases can be pooled into 15-mL centrifuge tubes up to a volume of 5 mL in each. 4. Add PBS buffer up to 10 mL and spin for 10 min at 100 rcf. Pour off the supernatant and repeat washing with PBS two more times. After the last wash, pour off most of the supernatant and gently re-suspend the cell pellet in the remaining ~1 mL PBS. 5. Count the cells in Bürker-Türk chamber: dilute the cell suspension 1:100 with PBS and 0.4% trypan blue by mixing 1 L cell suspension with 98 L PBS and 1 L trypan blue solution. Living cells remain white while dead cells stain blue. Count the number of live cells in the smallest 16-cell square of the chamber and calculate the total number of cells using the formula N = X × 100 × 250, where N = total number of live cells and X = number of live cells in the smallest square. The ideal concentration is 2 × 106 cells per 100 L. If the concentration is lower, spin the cell suspension for 10 min at 100 rcf and draw more supernatant.

40

Raudsepp and Chowdhary

6. Seal the bottom of PFGE plug molds with general use laboratory tape and precool them on ice. 7. Prewarm 1.9% low melting agarose solution at 38–40°C and mix equal volumes of cell suspension and melted agarose. Dispense cell-agarose mixture into ice-cool plug molds. Each well of Bio-Rad PFGE plug mold holds approximately 60 L solution. Allow agarose to solidify. 8. Tap off cell-agarose blocks carefully onto parafilm and transfer into a 50-mL centrifuge tube containing 40 mL cell lysis solution. Typically, 30–40 agarose blocks are collected in each tube. Incubate the blocks for 48 h at 50°C with gentle shaking. After proteinase K treatment, the agarose blocks should turn virtually clear and sink to the bottom—an indication that cells have lysed, releasing nuclear DNA. 9. After incubation, place the tube(s) on ice for 5–10 min and allow the blocks to firm up. Pour the supernatant through a filter paper to catch escaping blocks. 10. Transfer the blocks from the filter paper into a 50-mL centrifuge tube containing 45 mL TE. Wash the blocks by gently rotating the tube. Pour off the TE through a filter paper. Repeat washing three more times. 11. Transfer the blocks into a new 50-mL tube (maximum 30–40 blocks per tube) and fill it with PMSF working solution. Incubate for 50 min at 50°C with gentle shaking. This step will inactivate proteinase K. PMSF is very poisonous—use extreme caution and protective gloves! 12. Pour PMSF off through a filter paper and wash the blocks four times with TE as described in step 10. 13. Released nuclear DNA embedded in agarose blocks can be stored in TE at 4°C for 2–3 yr. TE buffer should be changed every 3–4 mo.

3.2.2. Pretreatment of Glass Slides for DNA Fibers For better attachment of negatively charged DNA fibers to a microscope slide, the glass slides are precoated with positively charged amino acid poly-L-lysine. 1. Arrange slides in glass racks ensuring that slide surfaces do not touch each other. 2. Dip the slides for 30 s each in (a) 0.2 N HCl (b) ddH2O, and (c) acetone. Air dry at RT. Note: All drying steps in this section must be in draft-free chambers. 3. Dip the slides for 5 min in subbing solution and air dry at RT. 4. Dip the slides twice in poly-L-lysine solution, 10 min each, then rinse in ddH2O for 30 s, and air dry at RT. Repeat this step one more time. 5. Let the slides dry overnight at RT. Thereafter, store the slides in an air-tight box at 4°C. The slides are generally good for up to 6 mo.

3.2.3. Stretching DNA Fibers on Lysine-Coated Slides Slides with mechanically stretched DNA fibers cannot be stored and are therefore made fresh on the day of the FISH experiment. 1. Preheat microwave oven for about 30 min. 2. Take poly-L-lysine-coated slide(s) out of the storage box and, with a diamond point marker, denote the side to be used for stretching the DNA.

FISH for Mapping Single Copy Genes

41

3. Take an agarose block from the storage tube, slice off a small 3 × 3 mm piece, and place it on one end of the glass slide. Wash the piece twice for 2 min with 20 L of ddH2O (use micropipette). 4. Cover the block with 20 L of ddH2O and place the slide into a preheated microwave oven for 30–40 s (see Note 10) to let the agarose block melt but not dry out. 5. Take the slide out from the oven and quickly, with the help of another glass slide, stretch the melted agarose containing DNA over the slide (just like preparing a blood smear). 6. Stain the slide with DAPI-antifade mounting solution: place a few drops of the solution on the slide and cover it with a 24 × 60 mm coverslip. 7. Check the quality and density of DNA fibers with the UV filter under a fluorescent microscope. The fibers should be straight and long (Fig. 1C). On the reverse side of the slide, mark most suitable areas for hybridization and analysis using a diamond pen. 8. Dip the slides in 2X SSC solution for 5 min and gently slide off the coverslip. Rinse the slides again in another 2X SSC to wash off the mounting solution. Dehydrate the slides through ascending ethanol series and air dry. The slides are now ready for FISH.

3.3. Probe DNA Probe DNA for mapping single copy genes can originate from a variety of sources [PCR product, cDNA clones or genomic DNA fragment cloned in plasmid, lambda phage, cosmid, bacterial artificial chromosome (BAC), P1 artificial chromosome (PAC), or YAC vectors]. Several protocols for isolation and purification of DNA from these sources are described elsewhere (10). However, for a successful FISH experiment, careful consideration must be given to three main characteristics of probe DNA.

3.3.1. Probe Size To overcome the limitations associated with sensitivity and resolution in FISH (see Subheading 1), DNA probes for mapping single copy genes should typically contain at least 4–5 kb of unique DNA sequence. At present, whole genome BAC libraries are available for most domestic and many wild animal species. BAC clones are the preferred DNA probes for cytogenetic mapping single copy genes at all FISH resolutions (metaphase, interphase, and fiber) because of their large insert size (100–200 kb), low rate of chimerism, and easy handling (growth and isolation).

3.3.2. Probe Quality Traces of proteins (e.g., DNase I and DNA polymerase I inhibitors) in probe DNA might interfere with nick translation and prevent efficient probe labeling. If labeling and subsequent FISH do not give distinct hybridization signals, purification of the probe DNA as described below is strongly recommended: 1. Precipitate DNA with 2.5 v/v of cold 100% ethanol and 0.1 v/v of 3 M Na-acetate. 2. Incubate at 20°C for 30–60 min. 3. Spin for 15 min at 15,700 rcf and discard ethanol.

42

Raudsepp and Chowdhary

Fig. 1. FISH signals on (A) metaphase chromosomes, (B) interphase nuclei, and (C) DNA fibers.

4. Wash the pellet with 70% ethanol, spin for 5 min at 15,700 rcf, remove ethanol, air dry, resuspend the DNA in ddH2O, and store at 20°C. It is recommended to dissolve the DNA in water instead of TE to avoid the inhibitory effect of EDTA on nick translation enzymes.

3.3.3. Probe Concentration An accurate estimate of the amount of probe DNA (especially when received from some other lab) must be routinely performed by quantifying on spectrophotometer and by electrophoresis on 1% agarose gel against control DNA (100 ng/L). Gel electrophoresis also shows whether the probe contains bacterial DNA (in case of cloned DNA) and RNA—factors that inflate the estimates on spectrophotometer. Final estimation of probe DNA concentration should be made by comparing the results of the two approaches.

3.4. Probe Labeling FISH probes can be labeled using different approaches, e.g., nick translation, random priming, PCR, FastTag system (Vector Laboratories, ref. 11), and others. These approaches may use indirect labeling systems that require posthybridization signal enhancement and detection (e.g., from Vector Laboratories, Roche Diagnostics, and Invitrogen) or direct labeling systems by which the dNTPs are tagged with fluorophores (e.g., from Cambio, Vysis, and Molecular Probes). In the following section, we will limit descriptions to: (i) nick translation—a labeling technique suitable for most of the probes; (ii) labeling using biotin and digoxigenin—the most widely used molecules for indirect labeling, and (iii) labeling using spectrum fluorophores—to exemplify direct labeling.

3.4.1. Indirect Labeling with Biotin or Digoxigenin Using Nick Translation 1. Place a microcentrifuge tube on ice and mix 1 g probe DNA, 4 L biotin—or DIG—nick translation mix, and ddH2O to get a final reaction volume of 20 L.

FISH for Mapping Single Copy Genes

2. 3. 4. 5.

6.

43

Mix the contents by pipetting, spin down for 1–2 s, and incubate the reaction for 90 min at 15°C. Incubate in either a waterbath placed in a cold room or in a styrofoam box in the lab after the water temperature is adjusted with ice. After incubation, add 30 L ddH2O to increase the volume to 50 L. Prepare Spin-50 minicolumns by spinning them for 3 min at 1200 rcf. Discard the collection tubes and place the columns into clean microcentrifuge tubes. Add the 50 L labeled DNA mix to the columns and spin for 3 min at 1200 rcf. This step separates the labeled probe from unincorporated nucleotides. Mix 5 L (~100 ng) labeled probe with 2 L loading buffer and run on a 1% agarose gel with 100 bp ladder in a separate lane. Labeled DNA should look like a smear with an average fragment size of 300–700 bp (see Note 11). Store labeled probes at 20°C or proceed immediately with the preparation of hybridization mixture.

3.4.2. Direct Labeling with Spectrum Fluorophores (Fluors) Using Nick Translation 1. Place a microcentrifuge tube on ice and mix the following: 1 g probe DNA, 2.5 L 0.2 mM spectrum fluor (orange, green, or red), 5 L 0.1 mM dTTP, 10 L 0.1 mM dNTP mix, 5 L 10X nick translation buffer, 10 L enzyme mix, and ddH2O to bring the final volume to 50 L. 2. Incubate the reaction overnight (8–16 h) at 15°C (see Subheading 3.4.1., item 1). 3. Clean the labeled probe through Spin-50 columns, check probe size on 1% agarose gel, and store as described in Subheading 3.4.1., items 4–6.

3.5. Preparation of Probe Hybridization Mixture 1. Mix the remaining 45 L labeled probe with 15–20 L unlabeled genomic DNA (1 g/L) (see Note 12) and vacuum dry the contents. 2. Resuspend the dried probe DNA in 6 L H2O and 14 L hybridization MM. These volumes can vary but the probe and MM ratio must remain in a 3:7 v/v ratio. Final concentrations of various constituents in the hybridization mixture are as follows: ~45 ng/L labeled probe, formamide 50%, dextran sulfate 10%, and 2X SSC. Probe hybridization mixtures can be stored at 20°C for 1–2 yr.

3.6. Hybridization The basic protocol for hybridization to metaphase/interphase chromosomes and to DNA fibers is the same except for two minor differences: (i) fiber-FISH slides are prepared on the day that hybridization is to be carried out, whereas chromosome slides can be made a long time in advance, and (ii) chromosome slides need RNase pretreatment, whereas fiber-FISH slides can be used directly for hybridization. 1. Check chromosome slides under phase contrast objective and mark (with diamond marker) ~4 × 4 mm2 areas with good quality metaphase spreads and/or interphase

44

2. 3.

4. 5.

6.

7. 8. 9.

Raudsepp and Chowdhary nuclei. High-quality slides allow simultaneous hybridization of 4–8 probes on a single slide, making the experiment cost- and labor-efficient. Place 500 L of RNase working solution on the slide, cover with a 24 × 60 mm coverslip, and incubate in moist chamber for 1 h at 37°C. Slide off the coverslip, rinse the slides for 2 min in 2X SSC at RT, dehydrate in ascending ethanol series (2 min each), and air dry. From this step onwards until the end of signal detection, the slides should not dry except when dehydrated in ascending ethanol series. Drying causes background signal during signal detection. Denature slides with chromosomal or stretched DNA in 70% formamide, 2X SSC solution at 70°C for 2 min (see Note 13). Immediately immerse the slides for 2 min in ice-cold 70% ethanol and then dehydrate in ascending ethanol series and air dry. Slides can be denatured and dehydrated several hours before the probe is ready for hybridization. Take 2–3 L of probe hybridization mix into clean microcentrifuge tube. If two or three differently labeled probes are being cohybridized, take 2 L aliquots from each and pool together. Denature the probe DNA for 10 min at 70°C. Preanneal the denatured probe mix for 20 min at 37°C (see Note 14). Place denatured probe(s) on denatured preparations, apply a separate coverslip to each hybridization area (make sure no air bubbles remain under coverslip), seal the edges of the coverslip(s) with rubber cement, and incubate the slides overnight in a moist chamber at 37°C.

3.7. Posthybridization Washing and Signal Detection 1. Remove rubber cement and rinse slides in 2X SSC until the coverslips glide off. 2. Wash slides three times, 5 min each, in 50% formamide, 2X SSC at 40°C. 3. Wash three times, 2 min each, in 4X SSC, 0.05% Tween-20 at RT with gentle shaking. 4. Wash for 2 min in 4X SSC at RT with gentle shaking. 5. At this stage, mount the slides with directly labeled probes (spectrum fluors) in DAPI-antifade solution as follows: take slides out of the 4X SSC, let excess solution drain off (do not dry), place approximately 20 L mounting medium on the hybridization area, and seal with a 24 × 50 mm coverslip. Make sure no air bubbles remain under the coverslip. Wipe off excess antifade solution from the sides and store the slides in a dark, air-tight box at 20°C.

For biotin and/or digoxigenin labeled probes, continue with the detection as described below.

3.7.1. Detection and Signal Amplification of Biotin-Labeled Probes Antibodies for each of the detection layers outlined below are first diluted in 1X blocking solution (200 L per slide), briefly mixed by vortexing, and then applied to the slide under a 24 × 60 mm coverslip. For all detection steps, the slides are incubated for 30 min at 37°C.

FISH for Mapping Single Copy Genes

45

1. Biotin-Layer I: Mix 0.2 L fluorescein–avidin D (avidin-FITC) stock solution with 200 L 1X blocking solution (final conc. 5 ng/L) and apply to the slide. 2. Wash the slides as described in Subheading 3.7., items 3 and 4. 3. Biotin-Layer II: Mix 2 L biotinylated antiavidin D stock solution with 200 L 1X blocking solution (final conc. 5 ng/L) and apply to the slide. 4. Repeat washing as in Subheading 3.7., items 3 and 4. 5. Biotin-Layer III: the same as Subheading 3.7.1., item 1. 6. Wash the slides as described in Subheading 3.7., items 3 and 4. 7. Mount the slides in DAPI-antifade and store in the dark at 20°C.

3.7.2. Detection and Signal Amplification of Digoxigenin Labeled Probes 1. Dig-Layer I: Mix 0.8 L antidig stock solution with 200 L 1X blocking solution (final conc. 0.4 ng/L) and apply to the slide. 2. Wash the slides as described in Subheading 3.7., items 3 and 4. 3. Dig-Layer II: Mix 0.4 L antimouse Ig-dig stock solution with 200 L 1X blocking solution (final conc. 0.4 ng/L) and apply to the slide. 4. Wash the slides as described in Subheading 3.7., items 3 and 4. 5. Dig-Layer III: Mix 1 L antidig–rhodamine stock solution with 200 L 1X blocking solution (final conc. 1 ng/L) and apply to the slide. 6. Wash the slides as described in Subheading 3.7., items 3 and 4. 7. Mount the slides in DAPI-antifade and store in the dark at 20°C. The slides can be stored for prolonged periods (6 mo) without deterioration of signal intensity. However, immediate analysis and image capture is recommended for optimal signal detection.

If biotin- and dig-labeled probes are cohybridized on the same slide, detection antibodies from both systems are pooled together and applied simultaneously to the slide.

3.8. Analysis 3.8.1. Analysis of Metaphase Chromosome Hybridizations 1. In order to reliably determine the location of FISH signals, at least 30 good quality metaphase spreads must be examined and images for a minimum of 10 spreads must be captured for analysis. 2. Chromosome identification and precise cytogenetic localization of genes requires identification of chromosome number and band location using the available standard karyotypes and chromosome nomenclatures for individual species (see ref. 6). 3. Double-color FISH enables simultaneous mapping and physical orientation of two or more loci (Fig. 1A).

3.8.2. Analysis of Interphase Nuclei Hybridizations 1. Double-color interphase FISH is used either for determining relative order of three loci or for estimating the distance between closely (<750 kb apart) located genes (see refs. 12–14). Before conducting interphase FISH, the probes must be tested on metaphase chromosomes to evaluate their signal intensity and overlaps. Only

46

Raudsepp and Chowdhary

probes with distinct hybridization signals and apparent overlap on metaphase chromosomes are suitable for interphase FISH. 2. Interphase chromatin is relatively decondensed and frequently tends to twist and form loops. Examination of at least 50, and capture of images for about 25, interphase nuclei is therefore essential to reliably determine the order of the loci. 3. Hybridization signals on interphase nuclei are usually weaker than signals with the same probe on metaphase chromosomes (Fig. 1B). The use of probes directly labeled with fluorophore dNTPs is usually not recommended for interphase mapping because it precludes the possibility of signal amplification.

3.8.3. Analysis of DNA-Fiber Hybridizations 1. Hybridization signals on DNA fibers are weaker than contemporary signals on metaphase and interphase chromosomes. Therefore, only probes that give excellent background-free signals on metaphase spreads are suited for fiber-FISH. 2. It must be noted that hybridization signals on stretched DNA fibers are long and might extend beyond the visual field of a ×100 objective. Hence, special accessories such as camera adaptors and zoom reducers are often needed. 3. Mechanically stretched fibers tend to break beyond ~500–700 kb level (15–17) and are therefore not suitable for the study of clones that are more than 500 kb apart. 4. While determining overlaps between two differently labeled probes, one should carefully examine the DAPI-counterstained DNA fibers to make sure that the two signals are indeed on the same fiber. 5. Fiber-FISH signals are usually weak, and the intensity is sometimes comparable to the background signal. Therefore, careful examination of the entire selected area and the capture of images for 20–25 hybridization sites are essential for reliable conclusions. 6. Fiber-FISH can also be used to determine the size of clones and the distances between them. Several imaging softwares provide the option of measuring the size of FISH signals on the computer screen and give reads either in pixels or micrometers. Using the Watson–Crick formula (18), where 1 m = 2.941 kb, the measurements can be converted into “physical length” in base pairs. However, because the DNA fibers are stretched mechanically, statistical analysis of measurements from ~100 fiber hybridization signals is essential (19) to overcome the bias associated with uneven stretching.

4. Notes 1. The choice of mitogen depends on the species. Phytohemagglutinin is the best mitosis stimulator for human, cattle, dog, and pig lymphocyte cultures. However, for many other species, e.g., horse and other equids, pokeweed is the mitogen of choice. 2. In our experience, the use of optimal hypotonic solution (Rainbow Scientific) considerably improves the quality of chromosome preparations by efficiently removing the cytoplasm and facilitating good spreading of metaphase chromosomes. 3. If DAPI appears to be too bright under the fluorescent microscope, the DAPI-antifade solution should be diluted by adding more antifade.

FISH for Mapping Single Copy Genes

47

4. Sonicated or Cot1 fraction of DNA is often used as competitor to block repetitive sequences. However, if Cot1 is not available for a particular species, and there is no possibility of sonicating the DNA, unprocessed genomic DNA gives equally good results. 5. All antibodies in biotin- and dig-detection systems are resuspended according to the manufacturers’ instructions and stored in 50- to 100-L aliquots at 20°C. The aliquots in use are stored at 4°C until finished. Do not defrost and freeze antibodies repeatedly! 6. Spectrum-orange- and spectrum-red-labeled probes can be hybridized together. Although both fluorophores are visible in the RED1 filter, spectrum-orange also bleaches through the YELLOW filter whereas spectrum-red does not (see Table 1). 7. The use of either whole blood or the buffy coat with plasma in lymphocyte cultures depends on the species. Whole blood works fine for human, cattle, dog, cat, and many other mammals. However, for horse and other equids, buffy coat with plasma gives better results. Whole blood of equids tends to clot in cultures, leading to poor mitotic index. 8. Clean, grease-free slides are crucial for high-quality chromosome preparations. 9. If the density of metaphase spreads is not sufficient, spin the cell suspension for 10 min at 100 rcf and resuspend the cells in a lesser volume of fixative to achieve improvement in density. If several fixative washes do not remove the cytoplasm from the background of the chromosomes (leading to poor spreading of chromosomes), add a few drops of glacial acetic acid to the cell suspension, incubate at RT for 2–3 min, and prepare new slides. Under desperate conditions, this may help to obtain some useable metaphase spreads. Prolonged incubation with an excess of acetic acid will destroy chromatin structure; therefore, the cell suspension must be used for preparing as many slides as possible. 10. The heating time in a microwave can vary from oven to oven and should be found empirically for each microwave. 11. Following nick translation, the probe may (rarely) be cut into fragments smaller than 100 bp. FISH with such probes might give more background signal than usual. Relabeling the probe with reduced nick translation time (1 h) and/or increased amount of probe DNA might solve the problem. 12. The ratio of competitor and probe DNA depends on the type of probes used. For lambda phage and cosmid clones, competitor should be ~10 times in excess, whereas for BACs, PACs, and YACs it should be 20–40 times in excess. Competitor genomic DNA is usually not needed for cDNA probes, although sonicated salmon sperm DNA is routinely used as competitor. The provided protocol is best suited for BAC probes. 13. Denaturation time and temperature must be precise for metaphase chromosome slides. Higher temperatures and longer treatment during denaturation affects chromatin structure and makes chromosome identification difficult. However, slides with stretched DNA fibers can be denatured for 2–4 min at 7075°C. 14. Preannealing time depends on the probe type. Lambda phage and cosmid clones need ~10–15 min, BACs and PACs ~20–25 min, and YACs ~30–40 min incubation at 37°C to allow reannealing of repetitive sequences. Compared to this, cDNA probes do not need preannealing and are placed for 10 min on ice immediately after denaturation.

48

Raudsepp and Chowdhary

Acknowledgments Fund for our equine genomics program were provided by the Texas Higher Education Board (ATP 000517-0306-2003), NRICGP/USDA Grant 2003-03687, Texas Equine Research Foundation, Link Endowment—Texas A&M University, TERRAC, American Quarter Horse Association, Morris Animal Foundation, and the Dorothy Russell Havemeyer Foundation are gratefully acknowledged. References 1. Nath, J. and Johnson, K. L. (2000) A review of fluorescence in situ hybridization (FISH): current status and future prospects. Biotech. Histochem. 75, 54–78. 2. van der Ploeg, M. (2000) Cytochemical nucleic acid research during the twentieth century. Eur. J. Histochem. 44, 7–42. 3. Liehr, T. and Claussen, U. (2002) Current developments in human molecular cytogenetic techniques. Curr. Mol. Med. 2, 283–297. 4. Trask, B. J. (2002) Human cytogenetics: 46 chromosomes, 46 years and counting. Nat. Rev. Genet. 3, 769–778. 5. Levsky, J. M. and Singer, R. H. (2003) Fluorescence in situ hybridization: past, present and future. J. Cell Sci. 116, 2833–2838. 6. Chowdhary, B. P. and Raudsepp, T. (2005) Mapping genomes at the chromosome level. In Mammalian Genomics (Ruvinsky, A. and Graves, J. A. M. eds), pp. 23–66, CABI, Wallingford, UK. 7. Speicher, M. R. and Carter, N. P. (2005) The new cytogenetics: blurring the boundaries with molecular biology. Nat. Rev. Genet. 6, 782–792. 8. Nederlof, P. M., Robinson, D., Abuknesha, R., et al. (1989) Three-color fluorescence in situ hybridization for the simultaneous detection of multiple nucleic acid sequences. Cytometry 10, 20–27. 9. Nederlof, P. M., van der Flier, S., Wiegant, J., et al. (1990) Multiple fluorescence in situ hybridization. Cytometry 11, 126–131. 10. Birren, B., Green, E. D., Klapholtz, S., Myers, R. M., and Rskams, J. (eds) (1997) Genome Analysis. A Laboratory Manual. Cold Spring Harbor Laboratory, Cold Spring Harbor, New York. 11. Daniel, S. G., Westling, M. E., Moss, M. S., and Kanagy, B. D. (1998) FastTag nucleic acid labeling system: a versatile method for incorporating haptens, fluorochromes and affinity ligands into DNA, RNA and oligonucleotides. Biotechniques 24, 484–489. 12. Lawrence, J. B., Singer, R. H., and McNeil, J. A. (1990) Interphase and metaphase resolution of different distances within the human dystrophin gene. Science 249, 928–932. 13. Trask, B., Pinkel, D., and van den Engh, G. (1989) The proximity of DNA sequences in interphase cell nuclei is correlated to genomic distance and permits ordering of cosmids spanning 250 kilobase pairs. Genomics 5, 710–717. 14. Trask, B. J., Allen, S., Massa, H., et al. (1993) Studies of metaphase and interphase chromosomes using fluorescence in situ hybridization. Cold Spring Harb. Symp. Quant. Biol. 58, 767–775.

FISH for Mapping Single Copy Genes

49

15. Heiskanen, M., Karhu, R., Hellsten, E., et al. (1994) High resolution mapping using fluorescence in situ hybridization to extended DNA fibers prepared from agarose-embedded cells. Biotechniques 17, 928–929, 932–933. 16. Heiskanen, M., Hellsten, E., Kallioniemi, O. P., et al. (1995) Visual mapping by fiber-FISH. Genomics 30, 31–36. 17. Heiskanen, M., Kallioniemi, O., and Palotie, A. (1996) Fiber-FISH: experiences and a refined protocol. Genet Anal. 12, 179–184. 18. Watson, J. D. and Crick, F. H. (1953) Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid. Nature, 171, 737–738. 19. Sjoberg, A., Peelman, L. J., and Chowdhary, B. P. (1997) Application of three different methods to analyse fibre-FISH results obtained using four lambda clones from the porcine MHC III region. Chromosome Res. 5, 247–253.

4 Construction of Radiation Hybrid Panels John E. Page and William J. Murphy Summary Whole-genome radiation hybrid (RH) mapping has proven to be a powerful tool for mapping genes and comparing genome architecture. We describe a protocol for constructing RH panels by rescuing irradiated fibroblast donor cells of any mammalian species by polyethylene glycol fusion to a thymidine kinase-deficient hamster cell line. Characterization and expansion of a panel of 90–100 cell lines can be used to map virtually any PCR-based marker that can be distinguished from the recipient hamster genome. The described procedure has been used successfully to create RH panels from diverse mammalian species such as macaques, elephants, alpacas, and armadillos, and may be applicable to nonmammalian vertebrates as well. Key Words: Comparative genomics; phylogenomics; radiation hybrid mapping; comparative gene mapping; A23; HAT selection.

1. Introduction Whole-genome radiation hybrid (RH) mapping (1) has become a mainstream method for high-resolution gene mapping in animal and plant species (2–10). RH maps are useful tools that complement linkage and other physical maps (i.e., FISH and BAC contigs) by providing higher resolution ordering, and can aid in the assembly of whole-genome sequence contigs and scaffolds (6). In addition, polymorphism ascertainment for highly conserved coding markers is increased because of the evolutionary divergence between the recipient rodent genome and the genomes of nonrodent mammalian orders. RH mapping also holds great utility for generating comparative maps in species for which the development of genetic crosses is logistically problematic. During the creation of RH panels, the chromosomes of the species of interest (index species) are fragmented by a lethal dose of ionizing radiation; the cells From: Methods in Molecular Biology: Phylogenomics Edited by: W. J. Murphy © Humana Press Inc., Totowa, NJ

51

52

Page and Murphy

are then fused to a thymidine kinase-deficient hamster recipient cell line and hybrids are selected on HAT medium. DNA from 90 to 100 radiation-hybrid cell clones (each containing a different random assortment of donor species chromosome fragments) are screened with genetic markers from the index species. The markers are ordered by using maximum-likelihood ordering strategies (described in Chapter 5 by Hitte et al.). By mapping several hundred or thousand comparative markers in an index species, the gene position can be compared to orthologous regions in draft genome sequences or other gene maps. Whole-genome RH panels are available for nearly a dozen mammalian and nonmammalian species, including human, baboon, rhesus macaque, mouse, rat, cattle, pig, horse, cat, dog, and zebrafish (http://compgen.rutgers.edu/rhmap/). Comparative genome alignments based on RH maps and genome sequences are beginning to reveal much about the evolutionary forces reshaping modern genomes (11–12). Here, we describe methods for creating whole-genome RH panels. This procedure has been used to successfully create RH panels from the domestic cat (13), rhesus macaque (14), alpaca, nine-banded armadillo, and African elephant genomes (Teeling et al. unpublished data). It can likely be extended to marsupial, monotreme, and nonmammalian vertebrate species as well (15,16). 2. Materials 2.1. Donor and Recipient Cell Preparation

2.1.1. Cell Culture 1. Donor cell line (target “species” genome to be studied). 2. Recipient cell line (A23 hamster cells) (see Note 1). 3. Dulbecco’s modified Eagle’s medium (DMEM) (Gibco/Invitrogen) with high glucose and without sodium pyruvate supplemented with 10% fetal bovine serum (Atlanta Biologicals), Penicillin (100 units/mL)–streptomycin (100 g/mL) solution (Gibco/Invitrogen), L-glutamine (2 mM) (Atlanta Biologicals), and Fungizone (amphotericin B) (50 g/mL) (Sigma) (see Note 2). Store at 4°C. The medium is considered a “complete” medium (complete DMEM) with fetal bovine serum and all supplements added. An “incomplete” DMEM is medium that contains all the supplements except the fetal bovine serum. Unsupplemented medium is DMEM only. 4. Hank’s balanced salt solution (HBSS) (Gibco/Invitrogen). Adjust to pH 8.0. Store at 4°C. 5. Bromodeoxyuridine (BrdU, Sigma) (100X): 3 mg/mL in incomplete DMEM. Store at 20°C (see Note 3). When used for incubating cells, make a fresh 1X (0.03 mg/mL) solution with complete DMEM. 6. Trypsin–EDTA 1X solution (0.05% trypsin–EDTA solution) (Gibco/Invitrogen). Store at 5 to 20°C. 7. Cell/tissue culture flasks (24-well dishes, T-25, T-75, T-150). 8. Pipets (1-, 5-, 10-, 25-, 50-mL volumes).

Construction of Radiation Hybrid Panels

53

2.2. Fusion of Irradiated Donor Cells and Recipient Cells 2.2.1. Equipment 1. Hemacytometer. 2. Mark I model 68A Gamma Cell Irradiator (137Cs) (JL Shepard, San Fernando, CA) or other irradiator source. 3. Pipetman (P-100 or P-200).

2.2.2. Solutions and Supplies 1. DMEM (Gibco/Invitrogen) (see Subheading 2.1.1., item 3). 2. HAT medium supplement (Hybri-Max) (50X) (Sigma), supplied as a combined solution of hypoxanthine (5 × 103 M), aminopterin (8 × 105 M), and thymidine (8 × 104 M), lyophilized, stored in 10 mL bottles. Store stock bottles at 20°C. Handle with care—toxic! To make a 500-mL bottle of HAT selection medium, reconstitute one lyophilized stock bottle with 10 mL of complete DMEM and add to 490 mL of complete DMEM. 3. Polyethylene glycol 1500 (PEG-1500 in 75 mM HEPES) (PEG 50% v/w) (Roche). Store at 2–8°C. Handle with care—toxic! Equilibrate to 37°C for donor and recipient cell fusion. 4. Ouabain (G-strophanthin) (Sigma), Handle with care—very toxic! Optional (see Note 4). 5. Trypan blue solution (20%) (see Note 5). 6. Disposable tubes for counting cells (microfuge or equivalent). 7. Polypropylene centrifuge tubes (50 mL). 8. Polypropylene centrifuge tubes (15 mL). 9. Cell culture dishes (100 × 20 mm2), disposable. 10. Pipets (1-, 5-, 10-, 25-, 50-mL sizes). 11. Pipetman tips.

2.3. Selection of RH Clones 2.3.1. Equipment Pipetman (P-100 or P-200, P-1000).

2.3.2. Solutions and Supplies 1. DMEM (Gibco/Invitrogen) (see Subheading 2.1.1., item 3). 2. HAT medium supplement (Hybri-Max) (50X) (Sigma) (see Subheading 2.2.2., item 2). 3. Trypsin–EDTA 1X solution (0.05% trypsin–EDTA solution) (Gibco/Invitrogen). Store at 5 to 20°C. 4. Cloning cylinders (Fisher Scientific), autoclaved in 100 × 20 mm2 Pyrex Petri dishes with lids. 5. Vacuum grease, clear, autoclaved in Pyrex Petri dishes with lids. 6. Cell culture cluster plates, 24-well. 7. Pipetman tips.

54

Page and Murphy

2.4. Expansion and Harvesting of RH Clones 2.4.1. Solutions and Supplies 1. DMEM (Gibco/Invitrogen) (see Subheading 2.1.1., item 3). 2. HAT medium supplement (Hybri-Max) (50X) (Sigma) (see Subheading 2.2.2., item 2). 3. Trypsin–EDTA 1X solution (0.05% trypsin–EDTA solution) (Gibco/Invitrogen). Store at 5 to 20°C. 4. Cell/tissue culture flasks (T-25). 5. Cell culture dishes, 100 × 20 mm2, disposable. 6. Polypropylene centrifuge tubes (50 mL). 7. NUNC cryogenic freeze vials (1.8 mL).

3. Methods 3.1. Donor and Recipient Cell Preparation

3.1.1. Cell Culture The methods described herein for establishing donor and recipient cell lines are for adherent (monolayer) fibroblastic cell lines only. To maximize the chances of obtaining the highest quality and quantity of RH clones from a single experiment with the highest possible retention frequencies, it is important to have well-established, nonprimary, actively growing cell lines for both the donor (the species to be mapped) and recipient (hamster A23 cell line) genome. To this end, it is equally important to have the proper quantity of donor and recipient cells ready for fusion at the same time. 1. Prepare complete DMEM. Put one 500-mL bottle of complete medium in a 37°C water bath until needed. 2. Start with the donor cell line as little may be known about the growth characteristics of the donor cell. The A23 cells (recipient) are well characterized and grow very fast (see Note 6). Thaw a vial of the donor cells (see Note 7) by placing the vial in a 37°C water bath in a floating rack. Place the vial in the rack in water up to, but not over, the seal on the vial. Keep the vial in the rack just until the frozen cell pellet is beginning to dissolve. Remove the vial from the rack and gently rock the vial by hand while it is still immersed in the water bath. DO NOT invert the vial while thawing. Do this until the cell pellet is completely dissolved. It is important to work quickly, but carefully, as the cells are very fragile at this point and sensitive to the cryogenic preservative (dimethylsulfoxide); terminal effects start immediately upon thawing, with a rapid deterioration in 15–30 min at room temperature. 3. Transfer the vial to a sterile, laminar flow hood (biological safety hood). Wipe or spray the outside of the vial with Micro-Chem Plus or similar disinfectant. Let the vial stand for approx. 1 min. Meanwhile, label a 15-mL tube with the cell line information (name, date, passage, if applicable). 4. Open the vial and remove the cells carefully with a 1-mL pipet. Place the cells into the 15-mL tube. Using a 10-mL pipet, draw up 9 mL of complete DMEM. Tilt

Construction of Radiation Hybrid Panels

5. 6. 7.

8.

9. 10.

11.

12.

13.

14.

15.

16.

55

the tube containing the cells to a 45° angle and slowly add a few drops of complete DMEM down the side of the tube until 1 mL is added to the cells (i.e., 1 mL per 2 min). Continue adding medium to the cells slowly, but steadily, progressing to a slow stream of medium until all 9 mL are added. Do not add medium directly to the cell suspension at any point in this process. There should be a final volume of 10 mL of cells and medium in the tube. Place the cap on the tube and invert slowly once to ensure that all cells are resuspended. Spin the cells in a centrifuge at 250g for 5 min at room temperature. While the cells are spinning, label a T-25 flask with the appropriate cell line information. Return the tube to the hood and decant the supernatant, being careful not to remove the cell pellet. It is not necessary to remove all traces of medium as the remaining preservative in the medium will be diluted. Slowly, add 2.5 mL of complete DMEM to the cell pellet, again adding the medium down the side of the tube. Carefully pipet the cell pellet up and down, just enough to homogenize the cells. Remove the cell suspension from the tube and dispense into the T-25 flask. Gently rock the flask to allow the cell suspension to completely coat the bottom of the flask. Place the T-25 flask into a 37°C, 5% CO2-humidified incubator. Loosen the cap on the flask to allow gas exchange. The next day, check the condition of the cells for any sign of distress (see Note 8). Completely aspirate all medium and add 5 mL of 37°C complete DMEM. This step not only removes any dead cells and/or cell debris from the freezing process, but also removes any residual toxic cryogenic preservative. Return the T-25 flask back into the incubator with the cap loosened. Observe the cells daily for their growth characteristics. Change the medium (remove and add 5 mL of 37°C complete DMEM) approximately every 3 d or so until the cell monolayer approaches confluency (see Note 9). When the cell confluency reaches 80–85% (see Note 10), split (or subculture) the cells in the T-25 flask into a T-75 flask. Aspirate the medium from the T-25 flask and then add 5 mL of HBSS to “wash” the old complete medium from the cells. Rock the flask three times. Aspirate the HBSS from the cells. Add 2 mL of 37°C trypsin–EDTA solution. Rock the flask so that the trypsin–EDTA solution covers the entire cell surface. Place the T-25 flask back into the incubator for 3–5 min (see Note 11). Remove the T-25 flask and swiftly hit the flask against the palm of your hand or another hard surface (such as a benchtop) 2–3 times to dislodge the cells from the cell surface. Observe the cells microscopically to confirm cell release. Add 3 mL of complete DMEM to neutralize the trypsin. Rock the flask to collect the cell suspension at the bottom of the flask. Resuspend the cells by pipetting up and down. Add the cell suspension to a T-75 flask that contains 10 mL of complete DMEM (15-mL total volume). Place the T-75 flask into the incubator and loosen the cap. Observe the cells daily for their growth characteristics. Change the medium (aspirate the old medium and add 5 mL of 37°C complete DMEM) approximately every 3 d or so depending upon the confluency of the cell culture.

56

Page and Murphy

Table 1 General Guidelines for Flask Volumes and Cell Yields

Flask size (cm2 ) T-25 T-75 T-150 1Yields

Maximum medium volume required (mL)

Volume of trypsin– EDTA required (mL)

Typical cell yield (assuming 80–85% confluency)1

5 15 25

1–2 2–4 3–5

2.8 × 106 8.5 × 106 1.7 × 107

will vary depending upon cell quality, environmental conditions, and species of

cell origin.

17. Repeat steps (12–16) to expand the cell population from one T-75 flask to one T-150 flask and eventually from one T-150 flask to four T-150 flasks. The volumes of medium and trypsin will change depending on the size of the flask. Table 1 is provided as a general guideline for volumes and typical cell yields expected. All T-150 flasks for both the donor and recipient cell lines should be between 80–85% confluent at the time of fusion (see Note 12).

3.2. Fusion of Irradiated Donor Cells and Recipient Cells The fusion procedure is separated into three sections: prefusion, fusion, and postfusion. The prefusion procedure includes BrdU treatment of the A23 cells to prepare them for fusion with the donor cells. The fusion procedure involves counting the donor and recipient cells to obtain the correct ratios for fusion, the irradiation of the donor cells, and the PEG treatment of the combined donor and recipient cells. The postfusion procedure includes distributing (plating) the fused cells into dishes for selecting RH clones. To minimize any confusion, the procedures are designated by events performed on certain days (i.e., day 0, day 1, and so on) which starts with pretreating the A23 cells and ends with harvesting the cells from each cell line for DNA extraction.

3.2.1. Prefusion Procedure 1. Six days (day 0) before fusion, perform a 1:20 split (i.e., a dilution of 1–20) of the A23 cells from the four (80–85% confluent) T-150 flasks into four new T-150 flasks. 2. The next day (day 1), aspirate the medium and add 0.03 mg/mL (1X) of BrdU medium to the T-150 flasks (see Subheading 2.1.1., item 5). Place the flasks in a 37°C, 5% CO2-humidified incubator. Allow the cells to incubate for 4 d (days 2–5). 3. Day 5: aspirate the BrdU medium and refeed each flask with 25 mL complete DMEM. Check the cells for confluency (see Note 13). Place the flasks back into the incubator.

Construction of Radiation Hybrid Panels

57

3.2.2. Fusion Procedure 1. Day 6—fusion: working with one cell line at a time (donor or recipient), aspirate the medium from the four T-150 flasks and add 20 mL of HBSS to each flask as a wash. 2. Trypsinize the cells as detailed under Subheading 3.1.1. and Table 1. Add 10 mL of complete DMEM to one flask, than transfer the cells to the next flask. Repeat this until the last flask contains the combined cells from all four flasks. Transfer the cells to a labeled 50-mL centrifuge tube. Set aside. Repeat steps (1) and (2) for the other cell line. 3. Centrifuge the cells at 250g for 5 min at room temperature. While the cells are in the centrifuge, label two 1.5-mL microcentrifuge tubes with the donor and recipient cell line names, respectively. Add 50 L of a 20% trypan blue solution to each tube. 4. When the cells are finished spinning, aspirate the medium and add 10 mL of complete medium to each cell line. Resuspend the cell pellets. 5. Using a Pipetman, aspirate 50 L of each cell line and place in their respective microcentrifuge tubes. Mix well. Let the cells sit for a minute or so. Meanwhile, centrifuge the 50-mL tubes containing the cells at 250g for 5 min at room temperature. 6. Count each cell line using a hemacytometer while the cells are in the centrifuge. Determine the total number of cells for each cell line in a 10-mL volume. Record the total cell count for each cell line. It is very important to have an accurate count because the irradiation of the donor cells and the ratio of donor to recipient cells for fusion are directly dependent on the cell counts, which could affect fusion efficiency and hence, retention frequencies. 7. Aspirate the medium from the cells and resuspend them in 10 mL of incomplete medium. Place the cells on wet ice. 8. Using the total cell count, calculate the number of cells needed for a 1:1 cell ratio for fusion (i.e., one donor cell to one recipient cell) and any controls needed (see Note 14). Table 2 is provided as a guide to the number of cells needed for a typical RH panel construction and the controls. 9. Irradiate the donor cells with the desired dose of radiation according to the irradiator manufacturer’s manual or specifications. The dose of radiation will depend on the resolution desired. The lower the dose, the less fragmentation, and hence less mapping resolution; whereas the higher the dose, the more resolved the RH map (see Note 15). Important: Once the donor cells are irradiated, the fusion MUST be performed rapidly (within approx. 30 min). 10. Combine the appropriate quantities (see Table 2) of irradiated donor cells and recipient cells in a 15-mL polypropylene tube. Keep unused cells on ice for controls. Plate the unfused controls the same time the fused cells are plated. 11. Gently, centrifuge at 185g for 5 min at room temperature to pellet the cells. 12. Aspirate the medium and finger-flick the bottom of the tube to resuspend the pellet. 13. Add 0.5 mL of 37°C PEG-1500 (see Subheading 2.2.2., item 3) to the resuspended cells. Mix, very gently, by pipetting once up and down with a 1-mL pipet. Allow the cells to be exposed to the PEG-1500 for a total of 2 min at 37°C. 14. Slowly, using a 10-mL pipet, add 9.5 mL of unsupplemented DMEM (or HBSS, pH 8.0) with gentle mixing at a rate of 1 mL/min (see Note 16).

58

Page and Murphy

Table 2 Total Cells Needed for Radiation Hybrid Fusion and Controls Category (A) Radiation hybrids (RH) (B) RH media controls (complete DMEM, HAT, HAT + Ouabain, Ouabain) (C) Donor cells—irradiated (D) Donor cells—not PEG fused (E) Donor self-fusion (PEG/selective media toxicity) (no irradiation) (F) Recipient self-fusion (PEG/selective media toxicity) (G) Fusion efficiency (no irradiation)

Donor (cells)

Recipient (cells)

20 × 2.0 × 106

20 × 106 2.0 × 106

0.5 × 106 0.5 × 106 1 × 106

0 0 0

106

0 1 × 106

1 × 106 1 × 106

15. Centrifuge the cell suspension at 67g for 5 min at room temperature. 16. Slowly, aspirate the medium from the cells and very gently resuspend the cells in 5 mL of unsupplemented DMEM. 17. Place the cells in a 37°C, 5% CO2-humidified incubator for 1 h with the cap loose. Mix the fused cells by tightening the cap and very gently inverting the tube every 15–20 min. Invert only once each time. Loosen the cap between inversions. While the cells are incubating, prepare the cell-culture dishes for the RH panel and controls. Add 9.5 mL of complete DMEM–HAT selection medium for each radiation hybrid panel plate and the appropriate controls and other media for controls where appropriate (see Table 3). Label each cell culture dish with a unique identification code. If time allows, place the tray of dishes in the incubator to allow the medium to equilibrate to temperature, pH, and humidity.

3.2.3. Postfusion Procedure 1. Centrifuge the fused cell suspension at 185g for 5 min at room temperature. 2. Gently, aspirate the medium and gently resuspend the cells in 10 mL of complete DMEM at 37°C. Transfer the fused cell suspension to a 50-mL polypropylene tube and add an additional 12.5 mL of 37°C complete DMEM. Gently mix by inversion twice. 3. Distribute 0.5 mL of the fused cell suspension (5 × 105 cells) and controls to each respective cell culture dish (see Table 3). When adding each cellular aliquot to a dish, gently move the dish back and forth and side-to-side to spread the cells. DO NOT swirl the dishes as this concentrates the cells in a clump in the middle of the dish. 4. Place the dishes in a 37°C, 5% CO2-humidified incubator.

3.3. Selection of RH Clones Total incubation time before RH colonies start to become evident is approx. 10–20 d postfusion. The dishes should be incubated for 2–3 wk postfusion

Construction of Radiation Hybrid Panels

59

Table 3 Total Cell Culture Dishes Needed for Radiation Hybrids and Controls Category (A) Radiation hybrids (RH) (B) RH media controls (each dish gets one of complete DMEM, HAT, HAT + Ouabain, Ouabain) (C) Donor cells—irradiated (complete DMEM only) (D) Donor cells—not PEG fused (selective medium) (E) Donor self-fusion (PEG/selective medium toxicity) (no irradiation) (F) Recipient self-fusion (PEG/selective medium toxicity) (G) Fusion efficiency (no irradiation) (selective medium) 1Each

Dishes1 40 4 1 1 2 2 2

dish receives 5 × 105 cells.

(days 7–28 since inception). RH colony growth should be monitored every 2 d or so, especially when the colonies are a size of approx. 1–2 mm (estimated) with about 25–75 cells. Controls should be monitored for effects of medium toxicity, fusion efficiency, and so on. (see Table 3 for media references). Compare the control dishes to the RH panel dishes to determine if the experiment needs to be repeated. It is better to start over at this point then wait until harvesting the panels before making the decision to start over. If Ouabain was applied to the medium for the RH panel dishes, allow the exposure for 6 d postfusion. Most controls should be dead by day 10 postfusion (day 16 since inception). 1. Allow the dishes to incubate for 3 d before disturbing them. On day 3 postfusion (day 9 since inception), observe the plates for adherence of cells and any fused cells. These should be visible at this point but may be hard to determine. 2. Aspirate the media from the cells while working with only 10 dishes at a time. 3. Tilt each dish to one side and slowly aspirate the medium off the cells. Add fresh HAT selective medium to the respective dishes. Replace the appropriate medium in the control dishes as well. 4. Place the dishes back into the 37°C, 5% CO2-humidified incubator. Continue observing the dishes every 2 d for any contamination and the status of colony formation. 5. On day 6 postfusion (day 12 since inception), aspirate the medium from the cells (see steps 2–3) and refeed the cells with the appropriate selective medium. If Ouabain–HAT selective medium was used, refeed with HAT selective medium only. Place the dishes back into the 37°C, 5% CO2-humidified incubator. Continue observing the dishes for any contamination and the status of colony formation. 6. On day 12 postfusion (day 18 since inception), aspirate the medium from all dishes and refeed with HAT selective medium. The HAT selective medium is used on the selective dishes until the RH panel colonies are isolated and transferred to T-25 flasks (which is after colony isolation in the 24-well plates).

60

Page and Murphy

7. The colonies will start to appear at various times throughout the 10–20-d postfusion. They are often spread unevenly among the dishes, i.e., some dishes may have one to two colonies whereas others will have five to eight colonies (see Note 17). Ideally, each dish should have a few colonies; therefore, all colonies can be isolated over 2–3 d. 8. To select the colonies, carefully hold the dish up (with medium still in the dish) so that you are looking through the bottom of the dish. Observe the colonies for RH panel candidates for isolation and further expansion. Using a microscope, confirm the selected colony sizes and that they are well isolated. Circle each colony that is to be isolated with a permanent marker to make the selection with the cloning cylinders easier. Select all the colonies that are to be included in the RH panels from all dishes at one time before starting to transfer the colonies from the dishes to the 24-well plates. 9. Working with one dish at a time, aspirate the medium from the dish. Dip one cloning cylinder in vacuum grease, just enough to form a seal when applied to the dish. Caution: if too much grease is added, the excess will cover the colony and prevent it from being removed. Gently, apply the cylinder over the colony that is to be isolated using the mark as a guide. Apply all the cloning cylinders to the colonies on the dish before removing the colonies inside the cylinders. 10. Add three to four drops (50–100 L) of HBSS to each cloning cylinder using a P-1000 Pipetman. Aspirate the medium from the colonies with a P-200 Pipetman. Change tips between each cloning cylinder to prevent carry over. 11. Add three to four drops of trypsin–EDTA solution to each cylinder with a P-1000 Pipetman. Place the dish in the 37°C incubator for 3–5 min. 12. Remove the dish from the incubator and add three to four drops of complete DMEM to each cloning cylinder with a P-1000 Pipetman. 13. With a P-1000 Pipetman, remove each colony from a cloning cylinder by pipetting the cells up and down three times, the last time transferring the colony to a 24-well plate. Use a different pipet tip each time. When transferring colonies to the 24-well plate, record the original dish from which the colony originated (1–40) in addition to labeling the well of the 24-well plate with a unique identification code, e.g., (1A) for original dish “1” and colony “A” from dish “1.” Repeat this method for every transferred colony. 14. Confirm that each colony was transferred by observing the well microscopically (see Note 18). 15. Repeat steps (10–14) until one whole 24-well plate is filled with 24 isolated colonies. 16. Add 2 mL of HAT selective medium to each well. Place the plate in a 37°C, 5% CO2-humidified incubator. 17. Repeat this procedure for each dish until all colonies from the 40 original dishes are transferred to the 24-well plates.

3.4. Expansion and Harvesting of RH Clones Once the colonies have grown approx. 80–85% confluent in each well of the 24-well dishes (see Note 19), transfer each to a T-25 flask.

Construction of Radiation Hybrid Panels

61

1. Aspirate the medium from each well (RH clone) using a different pipet (see Note 20) then add 2 mL of HBSS to each well. 2. Aspirate the medium from each well and add 200 L of trypsin–EDTA solution to each well. 3. Place the 24-well plate in the 37°C incubator for 3–5 min. 4. Add 2 mL of complete DMEM to each well. 5. Transfer the cells of each well to a T-25 flask, labeling the T-25 flask with the same identification code that was assigned on the 24-well plate. 6. Place the T-25 flask in a 37°C, 5% CO2-humidified incubator with the caps loosened. 7. Continue expanding each T-25 flask from one T-25 flask to two 100-mm cell culture dishes (see Note 21). When the two dishes are 80–85% confluent, transfer the contents to six 100-mm cell culture dishes, and then to twelve 100-mm dishes. Alternatively, you may expand the cell lines in roller bottles. 8. When the twelve 100-mm dishes are 100% confluent, harvest the cells by standard cell culture techniques (see Subheading 3.1.1.), combining the cells from the 12 dishes into one 50-mL polypropylene tube. Label the tube with the clone's identification code. 9. Centrifuge the tube at 250g for 5 min at room temperature. Aspirate the medium from the cells and store the pellet in a 70°C freezer until all cells from each RH clone are harvested. 10. Extract the DNA from the cells using established extraction techniques. The typical DNA yield from each RH clone is approx. 0.5–4 mg ( 2 mg on an average). Procedures for selecting the final optimal collection of 90–93 RH clone DNAs (i.e., the final RH panel) are described in subsequent chapters in this volume.

4. Notes 1. The A23 cells have a mutation in the thymidine kinase (TK) gene which inactivates the enzyme, i.e., they are TK–. The A23 cells grow very rapidly in culture and are generally smaller then most fibroblasts. They have been used successfully for constructing RH panels for several mammalian species. 2. DMEM is the medium we have used for RH panel construction; however, Alpha-MEM or other enriched media may be used. It may be advantageous to use the medium in which the donor cells were previously grown. The complete medium can be made up ahead of time and stored for the entire experiment. We have found that the premium triple-filtered fetal bovine serum from Atlanta Biologicals gives consistent cell growth results. 3. Bromodeoxyuridine (BrdU) kills the TK+ cells, thus incubating the A23 cells will leave a culture with purely TK– cells for the fusion. 4. If Ouabain is used, make a 100X solution in incomplete DMEM and store at 20°C. Dilute to 1X for use with HAT selective medium. Ouabain is a plant alkaloid that specifically binds to and inhibits sodium/potassium ATPase; it acts as a cellular poison. The purpose of the Ouabain is to kill any parental donor cell fibroblasts (mammalian) that were inadvertently introduced into the RH panel selection, including any donor cells that did not receive a lethal dose of radiation. Rodent cells (A23) seem to be immune to the effects of Ouabain.

62

Page and Murphy

5. Certain mammalian cells (e.g., feline cells) are sensitive to full-strength trypan blue. The trypan blue can kill the cells, thus giving a false result for the count. We therefore routinely use a 20% solution and just incubate the cells slightly longer to allow the dead cells to uptake the trypan blue. 6. Generally, the donor cells are introduced in culture 3–4 wk before cell fusion, whereas the A23 cells are placed in culture 2 wk before cell fusion. 7. The donor cell line is assumed to have already been well established from a skin biopsy or primary cell line. If the donor cell line is to be processed from a skin biopsy, the biopsy must be cleaned of excess debris, digested in collagenase/hyaluronidase, and then become established on a cell-culture dish before proceeding to the RH panel procedure. 8. Distressed cells may have the following characteristics: large vacuoles, appear granulated, appear “balled-up” or have smaller or absent pseudopods, appear malformed or appear to have rough cellular membranes. There maybe an increase of cell debris in the flask, an indication of dying cells. 9. Depending upon the level of confluency (which is a measure of how many cells are present within any given vessel), the medium may have to be changed more often if the cells are growing faster and thus metabolizing the nutrients faster, or less often if the cells are growing slower. 10. Waiting until the cells are 100% confluent may cause the cells to begin "shutting down" in part to overcrowding and cell-to-cell contact. Splitting the cells when they are 80–85% confluent ensures that the cells remain in an actively growing state. 11. Some cell lines are more difficult to remove and require a longer time (6–7 min) in the 37°C incubator. For exceptionally difficult cell lines, an additional 1 mL of trypsin–EDTA solution may be added in combination with an increase in incubation time. 12. It has been our experience that the number of cells harvested from four T-150 flasks is more then sufficient for a single RH experiment. 13. If the confluency is 100%, then split the cells 1:5 into four T-150 flasks. 14. Depending upon the fusion and retention frequencies, a different ratio of donor to recipient cells can be adjusted in an attempt to increase the fusion efficiency (e.g., a 1:1.5 or 1:2 ratio of donor to recipient cells). 15. In our experiments, we used a Mark I Gamma Cell Irradiator and irradiated our donor cells with 5000 rads of radiation for 3.37 min. This provided us good long-range mapping resolution and average panel retention frequencies between 0.25 and 0.40. 16. Mix the cells periodically as the medium is added by gently moving the pipet up and down in the cell suspension. DO NOT draw the cell suspension into the pipet. 17. Select colonies that are well-isolated to minimize picking identical colonies that may have spread from a neighboring colony. By selecting well-isolated colonies you maximize the chances of obtaining different donor chromosome fragments. 18. An alternative method is to wait until all of the wells are filled with cells to confirm transfer of the cells microscopically. 19. Each colony may grow at a different rate depending upon the characteristics of the clone and the donor chromosome fragments it carries. Thus, clones from some 24-well plates will still contain growing cells while others will already be expanded

Construction of Radiation Hybrid Panels

63

to T-25 flasks. To those 24-well plates that still contain cells, reapply HAT selective medium and place the plate back into the 37°C, 5% CO2-humidified incubator. Monitor the wells daily and transfer the cells in wells that are 80–85% confluent. Repeat as necessary. 20. An alternative method is to use one Pasteur pipet with a vacuum source and flame— sterilize the pipet between aspirating medium from each well. 21. To determine which RH clones are viable, you could return some of the cells to the T-25 flask and allow them to grow, after which the DNA can be extracted for prescreening of identical RH clones or A23 cells that were self-fusions. Additionally, viable freezes of the cells can be made at this point for future evaluation or extra DNA.

Disclaimer: The content of this publication are the opinions, interpretations, conclusions, and recommendations of those of the authors and do not necessarily reflect the views or the policies of the U.S. Government, nor does mention of trade names, commercial products, or organizations imply endorsement by the U.S. Government. The procedure described herein was funded by the U.S. Government. References 1. Walter, M., Spillet, D. J., Thomas, P., Weissenbach, J., and Goodfellow, P. N. (1994) A method for constructing radiation hybrid maps of whole genomes. Nature Genet. 7, 22–28. 2. Deloukas, P. G. D., Schuler, G., Gyapay, E. M., et al. (1998) A physical map of 30,000 human genes. Science 282, 744–746. 3. Steen, R. G., Kwitek-Black, A. E., Glenn, C., et al. (1999) A high-density integrated genetic linkage and radiation hybrid map of the laboratory rat. Genome Res. 9, AP1–AP8. 4. Van Etten, W. J., Steen, R. G., Nguyen, H., et al. (1999) Radiation hybrid map of the mouse genome. Nat. Genet. 22, 384–387. 5. Watanabe, T. K., Bihoreau, M. T., McCarthy, L. C., et al. (1999) A radiation hybrid map of the rat genome containing 5,255 markers. Nat. Genet. 22, 27–36. 6. Hitte, C., Madeoy, J., Kirkness, E. F., et al. (2005) Facilitating genome navigation: survey sequencing and dense radiation–hybrid gene mapping. Nat. Rev. Genet. 6, 643–648. 7. Meyers, S. N., Rogatcheva, M. B., Larkin, D. M., et al. (2005) Piggy-BACing the human genome II. A high-resolution, physically anchored, comparative map of the porcine autosomes. Genomics 86, 739–752. 8. Everts-van der Wind, A., Larkin, D. M., Green, C. A., et al. (2005) A high-resolution whole-genome cattle–human comparative map reveals details of mammalian chromosome evolution. Proc. Natl Acad. Sci. USA 102, 18,526–18,531. 9. Brinkmeyer-Langford, C., Raudsepp, T., Lee, E. J., et al. (2005) A high-resolution physical map of equine homologs of HSA19 shows divergent evolution compared with other mammals. Mamm. Genome 16, 631–649. 10. Murphy, W. J., Davis, B., David, V. A., et al. (2006) A 1.5 megabase resolution radiation hybrid map of the cat genome and comparative analysis with the canine and human genomes. Genomics in press.

64

Page and Murphy

11. Larkin, D. M., Everts-van der Wind, A., Rebeiz, M., et al. (2003) A cattle–human comparative map built with cattle BAC-ends and human genome sequence. Genome Res. 13, 1966–1972. 12. Murphy, W. J., Larkin, D. M., Everts-van der Wind, A., et al. (2005) Dynamics of mammalian chromosome evolution inferred from multispecies comparative maps. Science 309, 613–617. 13. Murphy, W. J., Menotti-Raymond, M., Lyons, L. A., Thompson, M. A., and O’Brien, S. J. (1999) Development of a feline whole genome radiation hybrid panel and comparative mapping of human chromosome 12 and 22 loci. Genomics 57, 1–8. 14. Murphy, W. J., Page, J. E., Smith, C. J., Desrosiers, R. C., and O’Brien, S. J. (2001) A radiation hybrid mapping panel for the rhesus macaque. J. Hered. 92, 516–519. 15. Kwok, C., Korn, R. M., Davis, M. E., et al. (1998) Characterization of whole genome radiation hybrid mapping resources for non-mammalian vertebrates. Nucleic Acids Res. 26, 3562–3566. 16. Senger, F., Priat, C., Hitte, C., et al. (2006) The first radiation hybrid map of a perch-like fish: the gilthead seabream (Sparus aurata L). Genomics 87, 793–800.

5 Survey Sequencing and Radiation Hybrid Mapping to Construct Comparative Maps Christophe Hitte, Ewen F. Kirkness, Elaine A. Ostrander, and Francis Galibert Summary Radiation hybrid (RH) mapping has become one of the most well-established techniques for economically and efficiently navigating genomes of interest. The success of the technique relies on random chromosome breakage of a target genome, which is then captured by recipient cells missing a preselected marker. Selection for hybrid cells that have DNA fragments bearing the marker of choice, plus a random set of DNA fragments from the initial irradiation, generates a set of cell lines that recapitulates the genome of the target organism several-fold. Markers or genes of interest are analyzed by PCR using DNA isolated from each cell line. Statistical tools are applied to determine both the linear order of markers on each chromosome, and the confidence of each placement. The resolution of the resulting map relies on many factors, most notably the degree of breakage from the initial radiation as well as the number of hybrid clones and mean retention value. A high-resolution RH map of a genome derived from low pass or survey sequencing (coverage from 1 to 2 times) can provide essentially the same comparative data on gene order that is derived from high-coverage (greater than ×7) genome sequencing. When combined with fluorescence in situ hybridization, RH maps are complete and ordered blueprints for each chromosome. They give information about the relative order and spacing of genes and markers, and allow investigators to move between target and reference genomes, such as those of mouse or human, with ease although the approach is not limited to mammal genomes. Key Words: RH Mapping; survey sequencing; comparative genomics; synteny; canine genome; high-coverage genome sequencing.

1. Introduction Genome maps are essential for identifying disease genes or loci controlling traits of interest. Development of maps for the dog genome began in the late From: Methods in Molecular Biology: Phylogenomics Edited by: W. J. Murphy © Humana Press Inc., Totowa, NJ

65

66

Hitte et al.

1990s with the production of both meiotic linkage (1) and radiation hybrid (RH) maps (2). Analysis of the same sets of markers on both reference families and RH panels (3) allowed the meiotic linkage and RH maps to be quickly integrated (4). Although such maps were useful for the identification of several disease loci (5,6), the lack of an agreed upon nomenclature for the dog’s 38 chromosomes until the late 1990s (7) made discussion of the data problematic. Most dog chromosomes are small and acrocentric. Thus, only those having expertise with techniques like fluorescence in-situ hybridization (FISH) could readily identify specific dog chromosomes. The availability of a set of flow-sorted canine chromosomes (8,9), from which the DNA fragments could be both FISH and RH mapped (10,11), as well as used to develop microsatellite markers for meiotic linkage mapping, proved invaluable to the canine community. It allowed the first fully integrated maps to be developed in which most linkage groups were assigned to named chromosomes (12). In addition, it allowed researchers to orient linkage groups on chromosomes, thus facilitating the first studies of comparative genomics between the dog and human (13). Since 2001 the community has advanced rapidly, generating new maps every 12–18 months (13–16). Particular emphasis has been placed on maps that have ordered microsatellite markers, genes sequences, and sequenced BAC ends that can facilitate positional cloning efforts. In addition, researchers have focused on increased understanding of the relationship between the dog, mouse, and human genomes, initially to define about 85 conserved segments (13). As map resolution improved and researchers moved from a mapping panel with 600 kb resolution (17) to 200 kb resolution, the number of conserved ordered fragments increased to 264 (16). Key to the success of the field has been the development of new bioinformatics tools that allow more precise positioning of markers then was possible with existing programs (18), together with the assignment of scores reflecting the relative confidence of local map order (19,20). In 2003, The Institute for Genomic Research (TIGR) released 1.5× sequence coverage of a Standard Poodle genome (21). The 6.22 million sequence reads contained canine gene fragments that were orthologous to 84% of annotated human genes, and this was put to immediate use by researchers interested in cloning genes associated with disease and morphology traits. We subsequently opted to use the 1.5× sequence, or survey sequence as it has been termed, as a resource for constructing a very dense canine RH map, composed primarily of canine gene fragments. The resulting resource, discussed in detail here, advanced our knowledge of the canine genome in several ways. First, the dense map, composed of portions of nearly 10,000 independent genes detailed nearly all canine/human conserved

Survey Sequencing and Radiation Hybrid Mapping

67

ordered segments (16). This meant that, for the first time, we could move freely between the dog map and the reference human and mouse genome assembled sequences. Second, the ordered presentation of over half of what are now known to be 19,000 canine genes offered an opportunity to verify and edit the 7.5× assembly of the Boxer sequence (22) and a way to check data related to the orientation of contigs. Finally, the work suggests a new paradigm for analysis of the many genomes currently being sequenced at only 2× coverage. These include a variety of mammalian genomes that are being sequenced primarily to identify conserved genomic elements (e.g., elephant, armadillo, cat, tenrecs, sloth, hedgehog, shrew, squirrel, rabbit, bat, and bushbaby; http://www.genome. gov/10002154). If high-density RH maps were constructed to accompany each survey sequence, researchers in the relevant fields would have essentially the same mapping resources for comparative genomics as those who benefit from 7 to 8× sequencing. We thus present the detailed methodology used for construction and utilization of a 10,000 gene RH map. 2. Materials 2.1. Marker Identification and Primer Sequence Design 1. Computer with Unix operating system. 2. Primer design software “Primer3” (http://frodo.wi.mit.edu/cgi-bin/primer3/primer3 _www.cgi). 3. DNA sequence Alignment Programs BLAST, BLAT, MEGABLAST (http://www. ncbi.nlm.nih.gov/BLAST/).

2.2. Marker Genotyping 1. 2. 3. 4.

PCR thermocycler. Electrophoresis apparatus and gel casting systems. Hot start PCR kits with high fidelity Taq enzyme and master mixes. Electrophoresis buffer. TE, Ph 7.5 10 mM Tris-HCl, 1 mM EDTA, and 2% agarose gels for typing. 5. Genotyping acquisition data -software (see Note 1). 6. DNA gel or capillary sequencer for sequencing and typing. 7. Sequence analysis software.

2.3. Map Construction 1. RH mapping analysis software (see Note 2). 2. Database with a portable relational database management system.

3. Methods A key limitation of survey sequencing is the fragmentary nature of the data. The absence of long-range continuity greatly restricts the ability to predict gene and marker order in the surveyed genome. RH mapping can maximize the value of survey sequencing through the localization of thousands of markers derived

68

Hitte et al.

from the 1–2× sequence data. This has been tested with the canine 1.5× coverage (21) for which fragments of orthologs for 18,473 of 24,567 annotated human genes were identified (see Note 3).

3.1. Survey Sequencing Canine genomic sequence was obtained by end-sequencing of plasmid clones (2 and 10 kb inserts) prepared from genomic DNA of a male Standard Poodle. After trimming to remove vector or poor-quality sequence data, 6.22 million reads (mean 576 bases) provides approx. 1.5× coverage of the 2.5 Gb haploid canine genome. The assembled data consists of 1.2 million contigs (mean 1414 bases) and 0.4 million singletons (21).

3.2. Orthologous Sequence Identification Strategy All contigs and singletons from the survey sequence assembly are searched against an annotated reference genome that is closely related to the surveyed genome (e.g., human) using the Blastn alignment program (23). Sequences can also be searched across multiples species to ensure the selection of sequences that are highly conserved through evolution (24).

3.3. Orthologous Sequence Identification Criteria 1. Selected sequences have mutually-best alignments. Blastn comparison of the test (dog) and reference (human) genome sequences yields many alignments of short segments, termed high-scoring pairs (HSPs). HSPs are defined as mutually-best when the segment of aligned test sequence has no higher scoring hit elsewhere on the reference genome, and the segment of aligned reference sequence has no higher scoring alignment with another test sequence. 2. The test/reference alignment overlaps at least one exon of the reference genome (as defined by the Ensembl annotation of 24,567 human genes; version 13.31.1). 3. Ensure that the test sequence is not a processed pseudogene. Where multiple HSPs are separated by <25 bases on a test sequence, but >300 bases on the reference sequence, the test sequence is considered as a potential pseudogene, and should be eliminated. 4. Ensure that the test sequence does not align end-to-end with the reference genome. A test sequence for which an alignment extends over its complete length may not permit the design of primers that are specific for the test sequence, and should be eliminated. 5. After applying criteria 1–4 above, a single test sequence for each reference gene is selected. When application of these criteria gives several hits, the sequence having the highest Blastn score is selected. 6. With the objective to map more than 10,000 gene sequences, selection of 12,000 sequences is recommended to anticipate failures at each step of the mapping. The reference genome is divided into the smallest number of equivalently sized segments

Survey Sequencing and Radiation Hybrid Mapping

69

that permits one or more genes to be located in each of 12,000 distinct segments. Here, this entails dividing the human genome into 40,000 “bins” of 75 kb, 11,818 of which contains a gene target. 7. Select a test sequence for each of the 75 kb bins that contain a gene. If the bin contains multiple genes, select the sequence providing the highest Blastn score. 8. Selection of test sequences is determined in two rounds from alternate bins. Processing in two rounds permits first-round failures to be retried during the second round, thereby minimizing the size of gaps in the final map.

3.4. Orthologous Sequence Content Criteria 1. Gene-based sequence selection. In order to characterize conserved synteny between a test and a reference genome, fragments of orthologous genes are first predicted using mutually-best Blastn alignments. Although any unique conserved sequences can be used to anchor regions of conserved synteny, there is particular interest in the comparison of protein-coding genes, and analyses have therefore focused on identification of orthologous exons. 2. Microsatellite sequences within genes (see Note 4). In the course of exon-based sequence selection, short tandem repeats (in flanking introns) were retrieved for 10% of the 11,818 selected sequences (see Note 5). Such sequences are an additional resource for genetic linkage and cloning studies and depending upon the ultimate goal, they could be specifically selected to be included within the marker sequence (25). 3. Non-coding conserved sequence selection. Genome-wide sequence alignments lead to an estimate that 5% of a mammalian genome is under negative selection and thus potentially functional (22,26). This is about three times higher than the portion coding for proteins. Selection of such noncoding sequences, expected to be transcribed into noncoding RNA or be involved in gene expression regulation, can also be selected for large-scale genomic comparative analysis. These can be particularly useful for comparative mapping of large genomic regions that lack protein-coding genes.

3.5. Gene-Based Marker Design for Mapping Selected genomic survey-sequences (300–1000 bp) have mutual best alignments with fragments of annotated human genes. For each sequence, the boundary between aligned and nonaligned segments is identified. In order to limit cross-amplification of the carrier hamster DNA present in RH DNA, primers should be designed in segments of the survey sequence that flank the aligned region. Alternatively, one primer can be selected from a flanking segment and the other from an aligned region. Markers derived from the sequence are amplified using two primers preferentially selected to be 25 bp in length and to work under a single optimal set of PCR conditions (salt, Tm, Mg2+, and so on) generating PCR products of 100–400 bp. Primers are selected for mapping using a standard selection program, i.e., Primer3 software, within nonrepetitive sequences.

70

Hitte et al.

3.6. Radiation Hybrid Panel Characteristics 1. Radiation hybrid cells are constructed by fusing host cells that have been gammairradiated using a dose of 9,000–12,000 rads (17,27). This ensures a resolving power sufficient to map 10–12,000 markers at individual positions. 2. Radiation hybrid cell lines should be selected to contain 15–30% of the host genome with a standard deviation as low as possible. Approximately 200 markers distributed randomly on all chromosomes should be tested to determine consistent retention values. Selection of 90–95 hybrid cell lines (from a stock in our case of 387 cell lines) is required to constitute a complete RH panel with appropriate statistical power.

3.7. Marker Genotyping 1. Genotype markers as follows: all reactions are performed using a 96-well or 384-well format in a volume of 10–15 L. An initial screen using 50 ng dog DNA, 50 ng hamster DNA, and a 1:3 mix of dog/hamster DNA (50 ng) is used to select primers that under PCR amplification produced a clear DNA band with the host sample and no band with hamster DNA. 2. PCR amplifications are done using 50 ng of RH DNA and a touchdown program: 8 min 95°C, followed by 20 cycles of 30 s 94°C, 30 s 63°C decreasing of 0.5°C per cycle, 1 min 72°C and 15 cycles of 30 s 94°C, 30 s 53°C, 1 min 72°C, and final extension of 2 min 72°C. 3. Resolve PCR products on 1.8 or 2% agarose gels, electrophoresed for 30 min using standard electrophoresis gel apparatus (2,12). 4. View PCR bands under UV light after ethidium bromide staining, and record image. 5. Submit images to manual or automated software for data acquisition.

3.8. Radiation Hybrid Map Construction 1. Score images of PCR products resolved on agarose gels for the absence or presence of a band as ‘0’ and ‘1’, respectively, in plain text format (ambiguous data are scored as ‘2’). Typically a genotype corresponds to a string of ‘0s’ and ‘1s’ distributed in 90–100 data points. A complete RH data, termed RH vector, comprises the Marker Id and its retention pattern as: Marker_Id_1 0000011010000110010101000010...... 001010000010000100. 2. Computational analysis. The complete set of RH vectors (each RH vector characterize one marker) is clustered into RH groups using pairwise calculations with twopoints linkage analysis, using dedicated RH software (18–20). The Lod score statistical test is applied to the whole dataset in order to assign markers to RH groups. On average, when a map of this resolution is constructed, each chromosome will be represented by one to three individuals RH groups. Lod score threshold, while empirically determined by users, relies on several factors, notably the degree of breakage from the initial radiation (i.e., the RH panel resolution) and the number of markers that cluster. A conventional threshold of 6.0–8.0 is commonly accepted as significant. RH groups are sets of markers that are linked to at least one another marker at a Lod score higher than the Lod score threshold.

Survey Sequencing and Radiation Hybrid Mapping

71

3. Ordering markers within RH groups is then carried out using multipoint linkage analysis. This step computes the final map order and delivers inter-marker distances expressed in centiRay (cR). Resulting maps are either comprehensive maps where all markers are placed and ordered respectively to other markers or framework map when only a subset of selected markers is analyzed; the other markers being inserted between the framework markers (see Note 6).

3.9. Comparative Map Construction Comparative genomics aims to describe the structural organization between genomes at both large-scale and micro-rearrangement level. Gene-based markers in a genome having orthologs unambiguously identified in a second genome are informative anchor sites between genomes (28). A RH map encompassing 10,000 gene-based markers (one every 250–300 kb on average) orthologous to genes mapped in reference genomes is a powerful tool for constructing comparative maps (Fig. 1). About 90% of the human genome is in large blocks of homology with the genomes of dog and/or other mammals sequenced to date. The regions of conserved synteny reveal many genes from canine chromosomes that match blocks of genes in human chromosomes defining conserved segments (CS). Large subblocks that have their markers colinearly arranged in the test RH map and in the reference sequence are termed conserved segments ordered (CSO) (28). 1. Compare the marker positions in the test RH map with the positions of their orthologs in the reference sequence to display chromosomal rearrangements such as inversions, translocations, and duplications. The comparison of 10,000 independent gene positions between genomes will identify nearly all test/reference conserved segments greater than 500 kb (16,22). 2. Beyond the description of conserved segments between genomes, the RH map comprised of survey-sequencing data can be used to identify evolutionary breakpoints between CS and CSO. 3. Careful identification of synteny breakpoints defines regions of interchromosomal and intrachromosomal rearrangement and can pinpoint genomic loci prone to chromosomal breakage and fusion. 4. Identify sites of rearrangment that tend to occur in regions containing duplicated sequences and gene family members. Categorization of evolutionary breakpoints classified as either lineage-specific or shared through species evolution permit derivation of ancestral genomes (29). 5. Orthologs corresponding to gene-based markers from new species can be searched after construction of the RH map with the alignment program Blastn. This allows integration data from new genomes for further multispecies comparative genomics analysis.

4. Notes 1. Acquisition software is a computer application which allows researchers to accurately and quickly score electrophoresis data from an RH gel image. It displays the image

72

Hitte et al.

Fig. 1. (Continued)

Survey Sequencing and Radiation Hybrid Mapping

73

Fig. 1. Strategy to construct comparative maps using survey-sequencing combined with dense RH gene maps. (A) End-sequencing (arrows) of clones from 2 or 10 kb inserts are assembled into contigs with on average 5 reads per contig as represented on the left of the figure. Gene fragments, symbolized by black circles, are identified by sequencesimilarity search and selected based on an even distribution on a reference genome as schematized by human chromosome 8 on the right of the figure. (B) (1) Gene-based markers (white circles) are derived from gene fragment sequences (black circles). (2) The RH mapping analysis clusters and orders markers (represented by small grey circles) along chromosomes, here canine chromosome 16 (CFA16) shown in the middle part of panel (B). (3) A human–canine comparative map for CFA16 is constructed using orthologous markers that serve as comparative anchors between the two species as shown on the right part of the figure. The comparative map identifies blocks of conserved synteny between CFA16 and HSA17 and HSA8 as well as conserved ordered segments (CSO) that contain adjacent markers in the same order and orientation.

74

Hitte et al.

of the gel on the screen, detects PCR products automatically using a specifically designed pattern recognition algorithm. Once all of the desired bands have been selected, acquisition programs such as AutoScore (available upon request from the corresponding author) allows users to save the scored data in a text file. The AutoScore application is written in Java. Therefore, it will run on any hardware platform that supports Java. AutoScore is able to read and import gel images in GIF or JPEG formats. It allows researchers to overlay a resizable grid onto the displayed image. The grid can be manipulated so that the grid cells align with the bands in the image. The image is automatically rotated at an angle of 90° when it is loaded from disk. AutoScore allows the user to “click” on any grid cell in order to select the band intensity. The cell will be highlighted at three different levels [0 = off (black), 1 = on (white), 2 = maybe (gray)]. Successive clicks will toggle through these three values. The resulting array of scores can be saved in a text file, including the image name. 2. RH package softwares description and distribution - TSP/CONCORDE: An approach to construction of RH maps that uses the CONCORDE software for solving the Travelling Salesman Problem can efficiently map large numbers of markers and can construct maps combining two RH panels (19). http://www.tsp.gatech.edu/concorde.html. - RHMAP at the University of Michigan: Written in Fortran, multiple retention models and mapping approaches: http://csg.sph.umich.edu/boehnke/rhmap. php (30). - MultiMap at Rutgers University: Written in CLISP and C++, highly automated, equal retention model. http://compgen.rutgers.edu/multimap/multimapdist. html (18). - RHMAPPER at the Whitehead Institute/MIT Center for Genome Research: Written in C and PERL, automated, equal retention model: Z-extensions developed at the Sanger Centre provides additional functionality to RHMAPPER: ftp.sanger. ac.uk/pub/zmapper (30,31). - CarthaGene at INRA Toulouse France: Program based on a maximum likelihood multipoint RH and genetic data mapping tool. Haploid/diploid equal retention models with a very fast EM algorithm. Can build maps combining RH/genetic datasets. Automated C++/Tcl/Tk multiplatform (Windows/Unix) program with a rich graphic user interface. http://www.inra.fr/Internet/ Departements/MIA/T/ CarthaGene/index.html (32). 3. Gene-based orthologous sequence selection: If the primary objective of a sequencing project is to generate gene-based markers for RH-mapping, 1× sequence coverage of a genome offers several advantages over large collections of ESTs. Unlike cDNA libraries, the representation of genes is unaffected by cellular expression levels, and identification of orthologous exons is not biased by the length of 3 untranslated mRNA. In addition, the low but significant conservation of intronic sequences between species is useful for distinguishing between paralogous sequences that share substantial sequence identity within exons. 4. Microsatellites are tandem arrays made up of many copies of a short repeating unit typically 1–8 bp in length (25). Microsatellites have been found in 1054

Survey Sequencing and Radiation Hybrid Mapping

75

gene-marker sequences to date, that have been identified by screening the 10,000 marker sequences using the sputnik program http://espressosoftware.com/pages/ sputnik.jsp. 5. Microsatellites selected for their highly repeated motif number and the absence of point mutation that disrupt the periodic pattern have a higher chance to present allelic polymorphism. Analyses should therefore focus on such microsatellites. 6. Comprehensive and framework maps: A comprehensive RH map is a map in which all markers are ordered using, for example, the state of the art CONCORDE algorithm. The RH map is produced using a global method that searches for local improvement, starting with an initial tour (map) defined by the nearest neighbor. The heuristics used in the CONCORDE chained Lin-Kernighan algorithm allows random “kick” to the tour (map) and reruns the tour, providing an improved solution when possible or returning back to the tour to continue on local improvement. The different algorithms used by TSP/CONCORDE package are independent in terms of computation principle, in that combinatorial and maximum likelihood approaches are both used in the analysis. Within the two approaches, variations are made to incorporate unknown entries to reduce the effect of unknowns in the quality of map produced by TSP. Framework 1000:1 maps such as produced by RHMAPPER, Carthagene, or MultiMap are not true framework maps. All alternative orders have a log-likelihood within the initial map larger than an “adding threshold.” This is a weakness of all simple heuristic procedures that are based on iterative insertion processes. Furthermore, so-called framework maps place only subsets (<50%) of markers. The remaining markers are then inserted between best neighbor markers.

Acknowledgments We acknowledge all our collaborators involved in this work and for their contribution to this chapter. We acknowledge the American Kennel Club Canine Health Foundation, U.S. Army Grant DAAD19-01-1-0658 (E.A.O. and F.G.) and NIH R01CA-92167 (E.A.O, E.K., and F.G.). F.G. and C.H. are supported by the French Centre National de la Recherche Scientifique (CNRS) and by the Conseil Regional de Bretagne. This research was also supported (in part) by the Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health. References 1. Mellersh, C. S., Langston, A. A., Acland, G. M., et al. (1997) A linkage map of the canine genome. Genomics 46, 326–336. 2. Priat, C., Hitte, C., Vignaux, F., et al. (1998) A whole-genome radiation hybrid map of the dog genome. Genomics 54, 361–378. 3. Vignaux, F., Hitte, C., Priat, C., Chuat, J. C., Andre, C., and Galibert, F. (1999) Construction and optimization of a dog whole-genome radiation hybrid panel. Mamm. Genome 10, 888–894.

76

Hitte et al.

4. Mellersh, C. S., Hitte, C., Richman, M., et al. (2000) An integrated linkage-radiation hybrid map of the canine genome. Mamm. Genome 11, 120–130. 5. Ostrander, E. A. and Wayne, R. K. (2005) The canine genome. Genome Res. 15, 1831–1837. 6. Parker, H. G. and Ostrander, E. A. (2005) Canine genomics and genetics: running with the pack. PLoS Genet. 1(5). 7. Breen, M., Bullerdiek, J., and Langford, C. F. (1999) The DAPI banded karyotype of the domestic dog (Canis familiaris) generated using chromosome-specific paint probes. Chromosome Res. 7, 401–406. 8. Breen, M., Thomas, R., Binns, M. M., Carter, N. P., and Langford, C. F. (1999) Reciprocal chromosome painting reveals detailed regions of conserved synteny between the karyotypes of the domestic dog (Canis familiaris) and human. Genomics 61, 145–155. 9. Yang, F., O’Brien, P. C., Milne, B. S., et al. (1999) A complete comparative chromosome map for the dog, red fox, and human and its integration with canine genetic maps. Genomics 62, 189–202. 10. Yang, F., Graphodatsky, A. S., O’Brien, P. C., et al. (2000) Reciprocal chromosome painting illuminates the history of genome evolution of the domestic cat, dog and human. Chromosome Res. 8, 393–404. 11. Breen, M., Langford, C. F., Carter, N. P., et al. (1999) FISH mapping and identification of canine chromosomes. J. Hered. 90, 27–30. 12. Breen, M., Jouquand, S., Renier, C., et al. (2001) Chromosome-specific single-locus FISH probes allow anchorage of an 1800-marker integrated radiation-hybrid/ linkage map of the domestic dog genome to all chromosomes. Genome Res. 11, 1784–1795. 13. Guyon, R., Lorentzen, T. D., Hitte, C., et al. (2003) A 1-Mb resolution radiation hybrid map of the canine genome. Proc. Natl Acad. Sci. USA 100, 5296–5301. 14. Guyon, R., Kirkness, E. F., Lorentzen, T. D., et al. (2003) Building comparative maps using 1.5× sequence coverage: human chromosome 1p and the canine genome. Cold Spring Harb. Symp. Quant. Biol. 68, 171–177. 15. Breen, M., Hitte, C., Lorentzen, T. D., et al. (2004) An integrated 4249 marker FISH/RH map of the canine genome. BMC Genomics 5, 1–11. 16. Hitte, C., Madeoy, J., Kirkness, E. F., et al. (2005) Survey sequencing combined with dense radiation hybrid gene mapping facilitates genome navigation. Nat. Rev. Genet. 6, 643–648. 17. Vignaux, F., Priat, C., Jouquand, S., et al. (1999) Toward a dog radiation hybrid map. J. Hered. 90, 62–67. 18. Matise, T. C., Perlin, M., and Chakravarti, A. (1994) Automated construction of genetic linkage maps using an expert system (MultiMap): a human genome linkage map. Nat. Genet. 6, 384–390. 19. Agarwala, R., Applegate, D. L., Maglott, D., Schuler, G. D., and Schaffer, A. A. (2000) A fast and scalable radiation hybrid map construction and integration strategy. Genome Res. 10, 350–364.

Survey Sequencing and Radiation Hybrid Mapping

77

20. Hitte, C., Lorentzen, T., Guyon, R., et al. (2003) Comparison of the MultiMap and TSP/CONCORDE packages for constructing radiation hybrid maps. J. Hered. 94, 9–13. 21. Kirkness, E. F., Bafna, V., Halpern, A. L., et al. (2003) The dog genome: survey sequencing and comparative analysis. Science 301, 1898–1903. 22. Lindblad-Toh, K., Wade, C. M., Mikkelsen, T., et al. (2005) Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature 438, 803–819. 23. Altschul, S. F., Madden, T. L., Schaffer, A. A., et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402. 24. Margulies, E. H., Maduro, V. V., Thomas, P. J., et al. (2005) Comparative sequencing provides insights about the structure and conservation of marsupial and monotreme genomes. Proc. Natl Acad. Sci. USA 102, 3354–3359. 25. Thomas, E. E. (2005) Short, local duplications in eukaryotic genomes. Curr. Opin. Genet. Dev. 15, 640–644. 26. Hardison, R. C. (2003) Comparative genomics. PLoS Biol. 1(2). 27. Senger, F., Priat, C., Hitte, C., et al. (2006) The first radiation hybrid map of a perch-like fish: The gilthead seabream (Sparus aurata L). Genomics 87, 793–800. 28. O’Brien, S. J., Womack, J. E., Lyons, L. A., et al. (1993) Anchored reference loci for comparative genome mapping in mammals. Nat. Genet. 3, 103–112. 29. Murphy, W. J., Larkin, D. M., Everts-van der Wind, A., et al. (2005) Dynamics of mammalian chromosome evolution inferred from multispecies comparative maps. Science 309, 613–617. 30. Boehnke, M., Lange, K., and Cox, D. R. (1991) Statistical methods for multipoint radiation hybrid mapping. Am. J. Hum. Genet. 49, 1174–1188. 31. Soderlund, C., Lau, T., and Deloukas, P. (1998) Z extensions to the RHMAPPER package. Bioinformatics 14, 538–539. 32. Schiex, T. and Gaspin, C. (1997) CARTHAGENE: constructing and joining maximum likelihood genetic maps. Proc. Int. Conf. Intell. Syst. Mol. Biol. 5, 258–267.

6 Construction of High-Resolution Comparative Maps in Mammals Using BAC-End Sequences Denis M. Larkin and Harris A. Lewin Summary Whole-genome bacterial artificial chromosome (BAC) libraries for 40 representative species of mammals provide unique resources for the construction of high-resolution ordered genomic maps, the development of species genetics, and multispecies genome comparisons. Herein we describe procedures to construct high-quality radiation hybrid comparative maps using BAC-end sequences (BESs). This approach has been applied to the construction of 1 Mbp resolution porcine–human and cattle–human comparative maps. High-resolution ordered comparative maps built with BESs and whole-genome sequences have been shown to be an invaluable resource for the analysis of mammalian chromosome evolution, and as a backbone for the assembly of whole-genome shotgun sequences. Key Words: BAC-end sequences; comparative mapping; genome analysis; radiation hybrid mapping.

1. Introduction The creation of high-resolution radiation hybrid (RH) maps can be significantly accelerated using terminal “end” sequences of large insert DNA libraries as a source of markers (1,2). Such libraries created in bacterial artificial chromosome (BAC) vectors and cloned in E. coli typically contain host-DNA inserts of 100–300 Kbp (3). More than 90 BAC libraries have been constructed for 40 species representing 12 mammalian orders (http://www.genome.gov/ 10001852; http://bacpac.chori.org/). High throughput sequencing of the 5 and 3 ends of the host-DNA inserts, termed BAC-end sequences (BESs), provides an excellent source of markers for detailed RH maps. An advantage of BESs is that they can be used as “comparative anchors” to other genomes that have been completely sequenced (reference genomes) using similarity searching From: Methods in Molecular Biology: Phylogenomics Edited by: W. J. Murphy © Humana Press Inc., Totowa, NJ

79

80

Larkin and Lewin

algorithms such as BLASTn. The present article describes methods for constructing high-resolution, comparatively anchored RH maps using BESs. Comparatively anchored high-resolution RH maps are an important resource for positional cloning of trait loci (4,5) and for studying chromosome evolution (6). In addition, such maps are also useful in the final stages of assembling whole-genome sequence produced by the random shotgun strategy (7). 2. Materials 2.1. BAC DNA Culturing 1. 2X Luria–Bertani (LB) media: 20 g Bacto Tryptone (Difco Laboratories, Detroit, MI), 10 g Bacto Yeast Extract (Difco Laboratories), 10 g NaCl (Sigma, St. Louis, MO), in 1 L water (see Note 1). Autoclave for 15 min. Cool to room temperature; add 1 mL chloramphenicol (25 mg/mL; see Note 2). The media is stored at +4°C.

2.2. BAC DNA Extraction 1. Montage BAC96 Miniprep Kit (Millipore, Billerica, MA). RNase A provided is added to solution S1 and stored at + 4°C. Solutions S2, S3, S4 are stored at room temperature. 2. Vacuum manifold (Millipore). 3. The 96 deep-well blocks for preculturing BAC clones (Millipore).

2.3. Sequencing BAC Ends 1. T7 (TAATACGACTCACTATAGGG) and modified SP6 (GGCCGTCGACATTTAGGTGACA) oligonucleotides (0.01 M scale) are ordered from commercial vendors and reconstituted or diluted at a concentration 20 M with water. Oligonucleotide solutions are stored at 20°C (see Note 3). 2. BAC sequencing buffer: 200 mM Tris-HCl, pH 9.0, 5 mM MgCl2 is aliquoted in 1.5 mL Eppendorf tubes and stored at 20°C. 3. BigDye Terminator v3.1 Cycle Sequencing Kit (Applied Biosystems, Foster City, CA) is stored at 20°C. 4. Absolute ethanol diluted to 70% with water and stored at room temperature. 5. Magnesium sulfate solution (0.2 mM): 5 mg of MgSO4 (Sigma-Aldrich, St. Louis, MO) in 200 mL of 70% ethanol, stored at room temperature.

2.4. Selection of BAC-End Sequences and PCR Primer Design 1. BAC-end sequences are either downloaded from GenBank (http://www.ncbi.nlm. nih.gov/) or generated in-house and formatted into multiple sequence FASTA files (see Note 4). 2. RepeatMasker software (ftp://ftp.genome.washington.edu) is required for masking of repetitive elements in BESs. 3. NCBI BLASTn (8) with default parameters and E-value of e5 is used for comparison of BESs to the reference genome. However, more sensitive parameters:

Mapping BAC-End Sequences

81

W 7 r 17 q 21 f 280 G 29 E 22 X 240 e 0.00001 may be used to identify matches with more divergent sequences. 4. Vector NTI (Invitrogen, Bethesda, MA) sequence analysis package is used for the selection of PCR primer pairs from the repeat-masked BESs with significant matches in the reference genome(s) (see Note 5).

2.5. Optimization of PCR Conditions and RH Panel Analysis 1. Oligonucleotides (0.01 M scale) are ordered from commercial vendors (see Note 6) and reconstituted or diluted at a concentration of 100 M with water. Stock solutions are stored at 20°C. 2. Oligonucletide (primer) stock solutions are diluted to 20 M and the final working solutions are stored at + 4°C until use. 3. Deoxyribonucleotide triphosphates (dNTPs; Life Technologies, Gaithersburg, MD) are diluted in water to a final concentration of 100 mM each, aliquoted to 1.5 mL microcentrifuge tubes. Stock solutions are stored at 20°C until use. 4. Ten microliters of each stock dNTP solutions are mixed in a 1.5-mL tube and 760 M water is added. Working solutions are made fresh and stored at 20°C until use. 5. AmpliTaq Gold DNA polymerase (Applied Biosystems) is stored at 20°C (see Note 7). 6. 10X PCR Buffer II (Applied Biosystems) is stored at 20°C. 7. 25 mM MgCl2 (Applied Biosystems) is stored at 20°C. 8. MJ Research PTC-100 PCR thermocyclers. 9. DNAs from RH clones are diluted in TE buffer and stored in individual tubes at +4°C. 10. PCR master mix for primer optimization (96-well plate): 200 L of 10X PCR Buffer II; 160 L master dNTPs; 120 L MgCl2, 1430 L water, and 10 L AmpliTaq Gold DNA polymerase. 11. The PCR mixture for a genotyping plate includes a premix of 150 L PCR Buffer II, 131.25 L dNTPs, 101.25 L MgCl2, 1185 L of water, 33.75 L of each oligonucleotide primer, and 11.25 L AmpliTaq Gold DNA polymerase mixed in a 5-mL tube (Becton-Dickinson, Franklin Lakes, NJ).

2.6. Agarose Gel Electrophoresis 1. TBE buffer (10X): 108 g of Trizma base (Sigma-Aldrich), 55 g of boric acid (FisherBiotech, Pittsburgh, PA), 40 mL of 0.5 M EDTA (pH 8.0) add water to 1 L total volume (see Note 8). TBE is stored at room temperature. 2. Loading buffer: 0.25% Bromophenol blue (Sigma-Aldrich), 15% Ficoll, Type 400 (Sigma-Aldrich), in 1X TE buffer. Loading buffer is stored at room temperature. 3. Molecular weight marker (1 Kbp ladder; Invitrogen) diluted at a concentration of 100 ng/L with loading buffer and stored at 20°C. 4. Ten mg/mL ethidium bromide (EtBr) solution in water (Molecular Probes, Eugene, OR) stored at room temperature (see Note 9).

82

Larkin and Lewin

3. Methods BAC-end sequences can be generated using various commercially available kits (e.g., Qiagen R.E.A.L. Prep Kit or Millipore Montage BAC96 Miniprep Kit) for BAC DNA purification and standard sequencing procedures or can be downloaded from NCBI GenBank using the following query in Nucleotide/ GSS section: “BAC end* [species of interest]”. The repeat-masking of BESs before BLASTn sequence comparison is critical in order to avoid multiple nonspecific hits in the reference genomes, which will ultimately prevent amplification of nonspecific sequences during the PCR screening of RH clones. We recommend performing a BLASTn search of BESs against two reference genomes with relatively divergent genomic sequences (e.g., human and mouse) and selection of those BESs for comparative mapping that have a single significant hit in both genomes. This approach avoids BLASTn matches in BESs that have hits in duplicated regions of the reference genome(s) and subsequent problems in comparative analysis. For the construction of reliable RH maps, selected BESs should be equally distributed over the whole genome. In order to avoid amplification of orthologous sequences from the recipient chromosomes in the RH clones, BESs with very high sequence identity to the reference genomes should not be selected for the primer design. In addition, rigorous checking of the RH vectors included into the RH map construction will significantly improve the quality of the comparative RH maps.

3.1. BAC Culturing 1. For each 384-well BAC library plate, four 96 deep-well plates are prepared (Millipore) containing 1 mL 2X LB plus antibiotic in each well (preculture plates) and four 96 deep-well plates containing 2X LB with antibiotic (culture plates). Plates are covered with ScotchPad 3 M (Masking and Packaging, St. Paul, MN) and stored at + 4°C until use. 2. A 384-well BAC library plate is removed from freezer (80°C), uncovered and defrosted for about 40 min. 3. Preculture plates are labeled according to library plate id and one of four quadrants (e.g., A1, A2, B1, and B2). 4. A 96-pin replicator (V&P Scientific, San Diego, CA) is flamed with EtOH 3–5 times, cooled to room temperature, and BAC colonies are transferred from the first quadrant (A1) of the 384-well BAC plate to a preculture plate labeled with “A1.” Inoculated preculture plate is covered with AirPore tape (Qiagen, Valencia, CA). 5. Step 4 is repeated for each remaining quadrant of the 384-well plate. 6. The 384-well library plate is covered with a lid and placed back in the 80°C freezer. 7. The preculture plates are placed on a platform shaker and shaken at 320 rpm for 21 h at 37°C (see Note 10).

Mapping BAC-End Sequences

83

8. Preculture plates are removed from shaker. 9. Culture plates are labeled with the 384-well plate id and quadrant. 10. A 96-pin replicator (V&P Scientific) is flamed with ethanol 3–5 times and a culture plate is inoculated 2–3 times with the culture from corresponding preculture plate. Culture plate is covered with an AirPore tape (Qiagen). 11. Step 10 is repeated for each remaining quadrant and culture plate. 12. The culture plates are placed on a platform shaker and shaken at 320 rpm for 17 h at 37°C (see Note 10). 13. Culture plates are centrifuged at 1500g, +4°C for 15 min. After centrifugation culture supernatant is immediately decanted to a container for proper disposal. 14. Pellets are stored at 80°C or immediately used for BAC DNA extraction.

3.2. BAC DNA Extraction (Adapted from Montage BAC96 Miniprep Kit User Guide) 1. Cell pellets are resuspended by adding 100 L of Solution 1 (containing RNase A enzyme) to each well and mixed on a plate shaker for 3–5 min (or longer if necessary) until cells are completely resuspended. Alternatively, resuspension may be achieved by vortexing or pipetting (see Note 11). 2. One-hundred microliters of Solution 2 are added to each well and mixed immediately on a plate shaker for 1 min. Plates are incubated for an additional 2 min at room temperature. Total lysis time should not exceed 5 min. 3. One-hundred microliters of Solution 3 are added to each well and mixed immediately and vigorously with a plate shaker for 2 min (see Note 10). At this point, the bacterial lysate is ready for transfer to the lysate clearing plate (labeled “CLEARING”). 4. The BAC plate (labeled “BAC”) is placed inside the vacuum manifold. 5. The entire lysate volume is pipetted from the bottom of each deep well, and dispensed into the corresponding well of the lysate clearing plate. 6. The lysate clearing plate is placed on top of the manifold. The vaccum seal is checked and the vacuum is adjusted to 8 inches of Hg (270 mbar—203 torr) The vacuum is applied, drawing the lysate through the clearing plate into the BAC plate. After filtration is complete, the vacuum is switched off and the lysate clearing plate is discarded. 7. The BAC plate containing clarified lysates is placed on top of the empty manifold. Vacuum at 24 inches of Hg (810 millibar—610 torr) is applied until the wells are empty. The filtrate is discarded in waste. When filtration is complete, the vacuum is switched off (see Note 12). 8. Two-hundred microliters of Solution 4 is added to each well of the BAC plate. Vacuum is applied at 24 inches of Hg (810 millibar—610 torr) until wells are empty. Filtrate is directed to waste. When filtration is complete, the vacuum is switched off. 9. The BAC DNA samples are resuspended by adding 35–40 L of Solution 5 to the wells of the BAC plate. BAC plate is shaken for 10 min on a plate shaker (see Note 10).

84

Larkin and Lewin

10. Retained BAC DNA is pipetted from the wells of the BAC plate into the V-bottom plate for storage. (Recovered volume can be maximized by tilting the BAC plate before collecting the sample.) Sealing tape is used to seal wells of the V-bottom storage plate.

3.3. End Sequencing BAC DNA 1. The following protocol assumes the use of 96-well ABI 3700 automated capillary sequencers (Perkin Elmer-Applied Biosystems). The protocol is adaptable to other sequencers. 2. Six-hundred microliters of BAC sequencing buffer is mixed with 200 L SP6 or T7 primer and 200 L Big Dye v 3.1 Cycle Sequencing Mix. With an automatic pipettor, 10 L of the mixture are dispensed into each well of a 96-well ABI sequencing plate. About 1 g of BAC clone DNA (or 10 L of recovered BAC DNA solution) is added to each well. The plate is sealed with a rubber lid (Costar, Cambridge, MA) and placed in a thermocycler (e.g., MJ Research PTC-100). 3. Amplification conditions are as follows: initial denaturation at 94°C for 4 min, 99 cycles of 96°C for 30 s, 57°C (56°C) for 10 s, 60°C for 4 min (see Note 13). 4. The PCR plates are cooled to room temperature and 70 L of 0.2 mM MgSO4 in 70% ethanol are added. Plates are covered with rubber lids, shaken briefly, and left on a laboratory bench for 15 min. 5. The plates are centrifuged at 2290g for 15 min. The lids are removed; the plates are inverted and centrifuged for 1 min at 170g. 6. Seventy microliters of 70% EtOH are added to each well with an automatic pipette, and the centrifugation steps in Subheading 3.3., item 5 are repeated. After the last spin, the plate is dried in a Speed-Vac device for 10 min. The plate is covered with a rubber lid (Costar) and stored at + 4°C until the plate is ready to be loaded on the DNA sequencing apparatus.

3.4. Comparison of BAC-End Sequences with Reference Genomes for High-Resolution Mapping 1. High quality BESs are formatted to multisequence FASTA files and repetitive elements are masked using RepeatMasker software (see Note 14). 2. BLASTn comparison of genomes is performed using default NCBI BLASTn parameters and expectation value (E-value) < 0.00001. This results in 95% high confidence hits, if the hit size is greater than 100 bp. This approach resulted in 30% of cattle BESs with significant hits in the human genome (1). However, these parameters in more diverged genomes (e.g., mouse) will reveal meaningful hits for significantly fewer BESs (1). More sensitive search parameters will increase the number of BESs with hits in such genomes (2) (see Note 15). 3. BESs with a single significant hit in the human and mouse genomes are then selected for comparative mapping. If whole-genome mapping is planned, BESs are selected to “cover” the human genome with equal spacing to optimize RH mapping and subsequent comparative analysis. Different resolution of comparative maps can be obtained depending on the resolution of the RH panel used. For example, if

Mapping BAC-End Sequences

85

the RH panel is produced by irradiation of donor cells with 5000 Rad, markers spaced an average of 1 Mbp apart can be ordered on the comparative map. 4. The BESs selected for RH mapping should have significant but not very high identity with sequences in the reference genome(s) when analyzed by BLASTn. Oligonucleotide primers from BESs with high identities may amplify nontarget genome sequences, complicating interpretation of genotyping results. To avoid this problem, BESs should be selected with hit E-values in the range of e10–e50 (2).

3.5. Design of PCR Primers 1. PCR primer pairs are designed from selected BESs avoiding repetitive sequence regions. Primers can be designed with any appropriate program using the following guidelines: (i) primer length of 18–25 bp, (ii) product size of 100 –600 bp, (iii) melting temperature (Tm) of 60 ± 5°C, and (iv) difference between forward and reverse primers <3°C. If no primer pairs are found falling within these guidelines, the Tm can be lowered to 55 ± 5°C. If primer pairs with desirable criteria still cannot be found, the best solution is to select another sequence from the same region, because lower Tm may allow for the specific amplification of orthologous DNA sequences from nontarget genome. PCR products greater than 600 bp may require more time for electrophoretic separation, whereas the products less than 100 bp may be masked by possible PCR primer dimers in the gel.

3.6. Optimization of PCR Conditions 1. This protocol is optimized for 96-well MJ Research PTC-100 thermocyclers and AmpliTaq Gold DNA polymerase (Applied Biosystems). Primers are tested at annealing temperatures (Ta) of 56°C and 60°C on two separate PCR machines at the same time. Five PCR reactions are done for each primer pair: two reactions consist of positive controls (25–50 ng genomic DNA from the donor species), two negative controls (25–50 ng DNA from the recipient cell line), and one no-template control. Negative and no-template controls are necessary to check for possible contamination with other sources of DNA. Sixteen 115-L aliquots of the master PCR mix (see Subheading 2.5., item 10) are transferred into separate 1.5 mL Eppendorf tubes, and 2.5 L 20 M solution of the forward and reverse primers of a given pair are added to each tube. Twenty microliters of the final PCR mixtures are transferred into five wells in a 96-well plate, and 2 L template DNA are added to four wells, whereas the fifth well receives 2 L water. A full 96-well plate can be used for screening of 16 primer pairs at a time. 2. Plates are covered with rubber lids (Costar) and placed into the thermocycler (e.g., MJ Research PTC-100). 3. Amplification conditions are as follows: initial denaturation at 95°C for 9 min, 35 cycles of 94°C for 40 s, 56°C (60°C) for 40 s, 72°C for 30 s, with a final extension at 72°C for 3 min (see Note 16). 4. PCR plates are cooled to room temperature and 4 L loading buffer is added to each well to prepare samples for gel electrophoresis.

86

Larkin and Lewin

3.7. Separation in 2% Agarose Gel 1. The following protocol assumes the use of an Owl Scientific A3.1 gel system. The protocol is easily adaptable to other gel formats. 2. Separation of testing PCR reactions is performed in 2% agarose gel. 3. To prepare the gel, 8 g of agarose (Shelton Scientific, Peosta, IA) are added to 400 mL 1X TBE buffer in a 1-L flask. The mixture is heated in a microwave for 10 min or until agarose completely dissolves. Three microliters of EtBr (see Note 9) are added, and melted agarose is cooled to 55–60°C and poured into a casting gel tray with 25-tooth 1-mm thick combs (Midwest Scientific, St Louis, MO). The gel should polymerize in about 30 min. 4. 1X TBE running buffer is prepared by mixing 100 mL 10X TBE with 900 mL water in a 1-L measuring cylinder. The cylinder is covered with Parafilm “M” (American National Can, Chicago, IL), inverted several times, and the running buffer is poured into the gel chamber. 5. A casting tray with the gel is placed the gel box, combs are removed and additional 1X TBE running buffer is added (if necessary) to cover the surface of the gel. 6. Five microliters of 1 Kbp molecular weight marker are added to the first well on each row of the gel and 14 L of PCR-amplified samples containing 20% (v/v) loading buffer are loaded into the remaining wells. PCR samples amplified using the same primer pair and tested with different annealing temperatures are placed next to each other to facilitate comparison. 7. The electrophoresis cell is covered with the lid and the gel is electrophoresed at 375 V for 10–15 min using Bio Electrophoresis FB570 power supply (Fisher Scientific, Pittsburg, PA) until the bromphenol blue dye runs 2 cm towards the anode. The power is switched off and the gel tray is removed from the chamber. The gel is visualized under UV light (AlphaImager 2000, Alpha InnoTech Corporation, San Leandro, CA) and photographed. 8. The PCR products are checked for each primer pair and results from different annealing temperatures are compared. An ideal primer pair will produce a PCR product with genomic DNA of the donor cell line and no PCR product with DNA from the recipient cell line, and no PCR product(s) with DNA from both donor and recipient that are different from expected sizes (shadow bands).

3.8. Preparation of Radiation Hybrid Clones for Genotyping 1. For an RH panel consisting of 90 hybrid clones stored in individual tubes at a concentration of 10 ng/L, 1 mL of DNA is transferred into 90 wells of a 96 deep-well microplate (Innovative Microplate, Chicopee, MA). Two positive control DNAs (donor genomic DNA), two negative control DNAs (recipient DNA) and no-template DNA (water) are added to the remaining two wells. The “master” plate is covered with a plastic lid and stored at +4°C until use. 2. Five microliter DNA from each clone are transferred from the “master” plate into a regular 96-well plate (MJ Research), using a 96-pin dispenser (Hydra; Robbins Scientific, Sunnyvale, CA). Plates are allowed to dry overnight at room temperature. Plates are sealed in plastic bags and are stored at +4°C until use.

Mapping BAC-End Sequences

87

3.9. Genotyping on the RH Panel 1. For each primer pair, genotyping is done in duplicate in 96-well plates containing dried DNA from the RH clones (see Subheading 3.8.). 2. Fifteen microliters of the reaction mixture (see Subheading 2.5., item 11) is transferred into each well of the genotyping plate using a repeater pipette (Eppendorf). 3. Plates are covered with plastic lids, labeled according to a marker (e.g., BES) name and placed in the thermocycler with lid closed. (see Subheading 3.6. for thermocycler parameters). 4. After the PCR is completed, plates are removed from the thermocycler and stored at +4°C until the electrophoretic analysis. 5. Four microliters of running buffer are added to each well of the genotyping plate with a repeater pipette. 6. Two-percent agarose gel is prepared and run as described in under Subheading 3.7. 7. Gels are photographed and scored using an imaging system as described above. Scoring can be performed directly on the imaging system or from the photographed image using the scoring option of the imaging software (e.g., AlphaEase V.3.3).

3.10. Genotyping Data Analysis and Quality Control 1. Cell lines that present strong positive bands in either one of the two gels for a particular primer pair are scored as positives or “1”, which is the format accepted by most RH map construction software. Cell lines that do not show a PCR product with equivalent mobility shown by the donor genomic DNA are scored as negative or “0”. Finally, cell lines that present weak bands with mobility that is similar to the PCR product from the donor genomic DNA in at least one gel are scored as “2”. If there are too many weak reactions in the resulting scoring vector, genotyping should be repeated for particular marker, adjusting the annealing temperature and/or selecting new PCR primer pairs. 2. Genotyped BESs are combined in linkage groups on the basis of a two-point linkage analysis by using a logarithm of odds (LOD) score threshold of 8.0 in RHMAPPER 1.22 software (9). 3. To perform quality control on the RH vectors, which is critical for producing an RH map with reliable order of the markers, the RH vectors within each linkage group are ordered by their orthologous position in the reference genome used for selection of markers for RH mapping. 4. Markers without identified orthologous positions in the reference genome(s) are placed between ordered markers on the basis of the two-point linkage analysis results. 5. RH vectors are then compared with vectors from adjacent markers. Every position of the vector that is not consistent with the scoring of the same cell line in two adjacent vectors is “flagged”. 6. All flagged reaction scores are checked for accuracy based on the original gel pictures. 7. Markers showing more than three ambiguous scores (“2”) after the quality control, or more than two inconsistent positions between adjacent vectors, are excluded from the initial build of the RH map.

88

Larkin and Lewin

3.11. Construction and Visualization of Comparative RH Maps 1. This procedure assumes the use of software packages RH_TSP_MAP version 2.0 (10) and CONCORDE (11). 2. Five independent maps are built for each linkage group in RH_TSP_MAP and CONCORDE. Two maps are based on the minimum number of obligate chromosome breaks and three maps are variants of the maximum likelihood estimate approach. 3. Consensus order of markers within each linkage group is established basing on evaluation of position frequencies of each marker in five maps. Confidence of marker placement in the consensus map is then calculated. The markers with positions consistent across all five maps are placed with 100% confidence; in four maps with 80%, three with 60%, and two with 40% confidence, respectively. If the consensus map order agrees with only one of five maps, the marker is placed with 20% confidence. 4. Markers with inconsistent positions in over three out of five maps may represent problematic vectors and should be carefully checked for scoring mistakes using original genotyping gel images. 5. When the order of markers is established within each linkage group with the desired level of confidence, and vectors are corrected for mistakes, vectors from different linkage groups of the same chromosome are combined. Consensus order of markers are then determined and compared to the consensus order of markers in original linkage groups. If the orders are consistent, combined map can be accepted as the final order; otherwise, the map should be broken into the original linkage groups. 6. Final maps or linkage groups are visualized using free MapChart software (12). 7. A spreadsheet (e.g., in Excel) is created to manage the RH and comparative mapping data. The spreadsheet contains information on marker name, chromosome location in the RH map, order in five original RH maps, consensus order, and coordinates in the reference genome(s) of BLASTn hit(s). 8. Homologous synteny blocks (HSBs) are defined between two genomes using the set of rules (6): (i) an HSB is defined as two or more markers on the same chromosome in the reference and mapped genomes not interrupted by an HSB from a different region of the same chromosome or a different chromosome; (ii) an inversion is defined by a minimum of three consecutive markers in the mapped genome as in the reference genome, with adjacent markers separated by a span of >1 Mbp in the reference genome; (iii) a comparative singleton is a single marker placed in mapped genome, but out-of-place with respect to its expected position on the basis of the reference genome comparative mapping information, and (iv) to minimize disruption of HSBs, a comparative singleton can “jump” to its expected position in a HSB on the same chromosome, provided that the distance for the “jump” is <2 Mbp in the reference genome.

4. Notes 1. Unless stated otherwise, all solutions should be prepared in water that has resistivity of 1.2 M-cm and a total organic content of less than 5 parts per billion. This standard is referred to as “water” in the text.

Mapping BAC-End Sequences

89

2. Required antibiotic may be different from listed here and is dependent on the actual vector used for BAC library construction. Normally this information is provided by the distributor of BAC library. 3. The use of both T7 and SP6 primers allow for sequencing of both the 5 and 3 ends of a genomic fragment inserted into the pTARBAC 1.3 vector. Different oligonucleotides may be required for other BAC vectors. 4. It is critical to use high quality sequences for the comparative analysis. Phred score of at least 20 (Q > 20) should be used as a quality threshold. Sequences of lower quality may result in lower overall success of the mapping process owing to the generation of primers containing wrong bases. 5. For design of primers program should be set-up for selection of primers with a 45–55% GC content and no hairpins. 6. Commercial vendors provide different amounts of oligonucleotides at the same scale of synthesis. Some initial testing might be required to find the best combination of yield and price per nucleotide. 7. Taq polymerase is sensitive to temperature changes and should always be stored at 20°C. During the reaction preparation Taq polymerase should be put on ice and returned to the freezer immediately after use. 8. Bromic acid is difficult to dissolve in water. Heating the solution may be required for better results. 9. EtBr can cause cancer. Gloves should always be worn when working with this chemical. 10. Shaking speed should be empirically adjusted for different shaker models and BAC libraries. 11. Through resuspension of cells is critical for successful lysis. Set the plate shaker at the highest speed possible without causing splitting, splashing, or sliding of the culture block off the shaker. Allow sufficient shaking time (3–5 min) to completely resuspend the cells in solution. No pellets should be visible at the bottom of the wells. 12. Problems with filtration may occur, indicating the density of cells in culture is too high. Lower cell density by using either a shorter culture time or lower shaker speed during culturing. 13. Optimal Ta is different for T7 and SP6 primers. We recommend using Ta = 57°C for T7 and Ta = 56°C for SP6 primer. 14. The appropriate sequence database should be used for the repeat masking. For example, for the masking of cattle repeats, the “-species cow” option should be used in the RepeatMasker. If the species of interest is not available a species from the same order or family or the “-mammal” option should be selected. 15. Comparison with two relatively divergent reference genomes and selection of BESs with significant orthologous matches in both of them will help in avoiding BESs with non-orthologous hits in the reference genome. For example, if cattle BESs are compared with the human genome database as a reference, they should also be compared with the mouse genome, and the preference for mapping should be given to BESs with orthologous hits both in human and mouse genomes. 16. The PCR plates can be left in the thermocycler overnight at +4°C.

90

Larkin and Lewin

Acknowledgments The authors would like to thank Professor William J. Murphy for editorial comments and help. The use of the Montage BAC96 Miniprep Kit User Guide for BAC DNA extraction is provided courtesy of Millipore Corporation. References 1. Larkin, D. M., Everts-van der Wind, A., Rebeiz, M., et al. (2003) A cattle–human comparative map built with cattle BAC-ends and human genome sequence. Genome Res. 13, 1966–1972. 2. Everts-van der Wind, A., Larkin, D. M., Green, C. A., et al. (2005) A high-resolution whole-genome cattle–human comparative map reveals details of mammalian chromosome evolution. Proc. Natl Acad. Sci. USA 51, 18,526–18,531. 3. Shizuya, H., Birren, B., Kim, U. J., et al. (1992) Cloning and stable maintenance of 300-kilobase-pair fragments of human DNA in Escherichia coli using an F-factorbased vector. Proc. Natl Acad. Sci. USA 89, 8794–8797. 4. Grisart, B., Coppieters, W., Farnir, F., et al. (2002) Positional candidate cloning of a QTL in dairy cattle: identification of a missense mutation in the bovine DGAT1 gene with major effect on milk yield and composition. Genome Res. 12, 222–231. 5. Cohen-Zinder, M., Seroussi, E., Larkin, D.M., et al. (2005) Identification of a missense mutation in the bovine ABCG2 gene with a major effect on the QTL on chromosome 6 affecting milk yield and composition in Holstein cattle. Genome Res. 15, 936–944. 6. Murphy, W. J., Larkin, D. M., Everts-van der Wind, A., et al. (2005) Dynamics of mammalian chromosome evolution inferred from multispecies comparative maps. Science 309, 613–617. 7. Gibbs, R. A., Weinstock, G. M., Metzker, M. L., et al. (2004) Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 428, 493–521. 8. Altschul, S. F., Madden, T. L., Schäffer, A. A., et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 17, 3389–3402. 9. Slonim, D., Kruglyak, L., Stein, L., and Lander, E. (1997) Building human genome maps with radiation hybrids. J. Comput. Biol. 4, 487–504. 10. Agarwala, R., Applegate, D. L., Maglott, D., Schuler, G. D., and Schäffer, A. A. (2000) A fast and scalable radiation hybrid map construction and integration strategy. Genome Res. 10, 350–364. 11. Ben-Dor, A. and Chor, B. (1997) On constructing radiation hybrid maps. J. Comput. Biol. 4, 517–533. 12. Voorrips, R. E. (2002) MapChart: software for the graphical presentation of linkage maps and QTLs. J. Hered. 93, 77–78.

7 Amniote Phylogenomics: Testing Evolutionary Hypotheses with BAC Library Scanning and Targeted Clone Analysis of Large-Scale DNA Sequences from Reptiles Andrew M. Shedlock, Daniel E. Janes, and Scott V. Edwards Summary Phylogenomics research integrating established principles of systematic biology and taking advantage of the wealth of DNA sequences being generated by genome science holds promise for answering long-standing evolutionary questions with orders of magnitude more primary data than in the past. Although it is unrealistic to expect whole-genome initiatives to proceed rapidly for commercially unimportant species such as reptiles, practical approaches utilizing genomic libraries of large-insert clones pave the way for a phylogenomics of species that are nevertheless essential for testing evolutionary hypotheses within a phylogenetic framework. This chapter reviews the case for adopting genome-enabled approaches to evolutionary studies and outlines a program for using bacterial artificial chromosome (BAC) libraries or plasmid libraries as a basis for completing “genome scans” of reptiles. We have used BACs to close a critical gap in the genome database for Reptilia, the sister group of mammals, and present the methodological approaches taken to achieve this as a guideline for designing similar comparative studies. In addition, we provide a detailed step-by-step protocol for BAC-library screening and shotgun sequencing of specific clones containing target genes of evolutionary interest. Taken together, the genome scanning and shotgun sequencing techniques offer complementary diagnostic potential and can substantially increase the scale and power of analyses aimed at testing evolutionary hypotheses for nonmodel species. Key Words: Amniote; Reptilia; BAC library; genome scan; genomic signature; retroelement; simple sequence repeat; sex-linked marker; shotgun cloning; EE0.6.

From: Methods in Molecular Biology: Phylogenomics Edited by: W. J. Murphy © Humana Press Inc., Totowa, NJ

91

92

Shedlock et al.

1. Introduction

1.1. Genome-Enabled Phylogenetics and BAC Libraries Completion of the human and other genome projects constitute a leap in the scale on which the genome can be organized and studied. However, the technology that makes producing whole-genome assemblies feasible will likely be more difficult to transfer to other disciplines in biology, particularly the evolutionary sciences, than other more accessible DNA diagnostics have been, most notably, the polymerase chain reaction, for example. Nevertheless, comparative approaches and forays into nonmodel species are already being made by evolutionary biologists in an effort to integrate genome science with established principles of systematics, forging the exciting new area of interdisciplinary research aptly coined phylogenomics. Some would argue that merely having genomic resources for single representatives of, say, fish, mammals, and birds are sufficient to broaden genomics to include evolutionary biology. Furthermore, the cost of genome projects, at least in the short term, is still high enough to prohibit easy transfer of even basic genome resources to the community of evolutionary biologists. The rise of national Genome Centers and the increasing interest of such centers in tackling questions in nonmodel species is also strong evidence that the biology of the 21st century will be much more team-, organization-, and resource-driven than was biology in earlier decades. Indeed, given existing technologies, it is clear that systems-level biology even on single species requires not only advanced technologies and robotics not readily available to single-PI research programs, but large numbers of staff and infrastructure in coordination activities. We believe that, despite the logistical challenges to scaling up the size of comparative DNA sequence studies by orders of magnitude, evolutionary biology needs to embrace genomic technology. In particular, the use of genomic libraries such as those comprised of bacterial artificial chromosome (BAC) clones is one of several excellent ways to do this. BAC libraries are large-insert genomic libraries that are currently the optimal starting point for large-scale analysis of genome evolution in eukaryotes (1). Typically BAC clones can faithfully propagate fragments of DNA on average ~150 kb in length. They have a number of advantages over earlier types of genomic libraries, such as lambda and cosmid libraries, which can accommodate only smaller insert sizes. Although yeast artificial chromosome (YAC) vectors can hold much larger inserts, they are much less stable than BACs: BACs are much less susceptible to inter- and intraclone recombination related to their low copy number (1–2) per E. coli cell and the presence of genes ensuring faithful propagation and passage to daughter cells during cell division (2). For these reasons and their ease of manipulation, growth, and isolation of clones, BAC libraries formed the current backbone

Amniote Phylogenomics: Testing Evolutionary Hypotheses

93

of many efforts to sequence complete genomes, including the human genome. Furthermore, through the use of homologous recombination to modify BACs with reporter constructs and their use as substrates for transgenesis (both within species and between species), many new directions in functional biology are opened through BAC libraries (3–6). The details of BAC library construction are extensive and go well beyond the scope of the present chapter (7,8). We therefore do not include protocols for BAC library production here but rather focus on the use of such libraries for conducting comparative studies. BAC libraries are typically produced by and can be accessed in collaboration with laboratories equipped with semiautomated accessories and robotics optimized for producing and managing BAC resources, exemplified by Production Centers affiliated with the U.S. National Human Genome Research Institute’s BAC Resource Network (NHGRI; http://www. genome.gov). In addition to the BAC libraries themselves, one of the most important advancements in eukaryotic genome science is the adoption of standardized methods of archiving of libraries and clones in microtiter plates, with one clone per well, thereby ensuring their long-term survival and utility to the scientific community (9). Such protocols represent a vast improvement over earlier bulk storage methods, which frequently result in significant loss of clones in the long term. In particular, we have benefited from collaboration with and protocols developed by Amemiya and colleagues (7) and from protocols and applications reviewed extensively by Zhao (1,10). The availability of BAC libraries from an expanding diversity of nonmodel species provides an ideal resource to increase access to genome-scale science by comparative biology. Although it is impractical to expect broad comparative studies to proceed rapidly for nonmodel organisms on a whole-genome basis, it is attractive and far less-expensive to estimate the structure of poorly known genomes by mining BAC libraries from diverse taxa essential for testing numerous hypotheses within a phylogenetic context. First, BAC libraries are in principle applicable to any organism with sufficient high-quality genomic DNA. Thus, the diversity of nonmodel species and those which cannot be maintained in captivity are all potential targets for BAC libraries. Second, in conjunction with ancillary methods of shotgun cloning and sequencing, BAC libraries provide access to an order-of-magnitude increase in the amount of DNA sequence data that can be brought to bear on problems of systematics and molecular evolution. Shotgun sequencing approaches are still relatively uncommon in nonmodel vertebrates (11,12). Despite recent substantial increases in the amount of sequence data devoted to questions of systematics in a variety of clades (13–16). BAC libraries offer yet larger increases that could resolve trees even further (17). More importantly, BAC-enabled studies can provide a genuinely genomic window into molecular evolutionary processes, revealing

94

Shedlock et al.

on a large scale the vast array of non-nucleotide types of molecular variation, such as indels, duplications, and rearrangements, that hold considerable promise for molecular systematics but have thus far been underexploited (18–22). Other key trends in evolutionary biology for which BACs are relevant include the positional cloning of candidate genes for phenotypic traits, enhancement of QTL mapping efforts, population genetics, chromosome evolution, the role of gene regulation in evolution, and the evolution of multigene families. As we discuss below, these research directions in the Reptilia lag far behind similar studies in mammals, partly associated with the lack of suitable molecular reagents with which to tackle these problems. With the large database on developmental genetics, immunogenetics, and genome and mapping efforts provided by the chicken, these fields are now poised to be extended into a comparative framework as the relevant tools for accessing genes and gene families, especially BAC libraries, continue to become available. It is important also to note that several of the approaches for genome scanning, we suggest, can be implemented with genome libraries other than BAC libraries. For example, simple plasmid libraries with 2–7 kb inserts can provide a rich resource for examining many of the same issues as can be analyzed using BAC libraries. However, plasmid library analysis will incur some important limitations compared to BACs, such as the inability to perform downstream hybridizations of selected BACs to chromosomes using FISH, or to delimit large chromosomal regions with simple paired-end sequences of clones using comparative bioinformatics tools. Thus, although plasmid libraries will in principle yield the same type of multimegabase sequence data sets from end-reads as will BACs, the uses to which these can be put are more limited than with BACs.

1.2. Reptile Phylogenomics and the Amniote Ancestor Reptiles are a critical group of vertebrates for understanding the evolutionary dynamics of amniote genome evolution (Fig. 1). Reptiles are far more taxonomically diverse (~17,000 vs. ~4,500 species), and arguably more developmentally and chromosomally diverse than mammals. They also exhibit a diversity of environmental and genetic sex determining mechanisms relative to mammals (23). Comparing data from the chicken genome to that of mammals in the absence of relevant outgroup information from nonavian reptiles has limited our ability to accurately infer the genomic condition of our common amniote ancestor. Recent attempts to close this large gap in the comparative genomics literature have tested alternative models of amniote genome evolution summarized by Waltari and Edwards (24). Results from phylogenomic analysis of multi-megabase large clone insert sequence from exemplar non-avian species indicate that the ancestral amniote likely had a relatively large genome with a diverse repetitive landscape and GC content similar to that observed for many

Amniote Phylogenomics: Testing Evolutionary Hypotheses

95

Fig. 1. Diagram summarizing conventional view of amniote relationships. Distribution of genome sizes, measured in picograms, for lineages are represented by bars over branch tips. The question mark and dotted arrow indicate a growing body of molecular evidence suggesting turtles are more derived among reptilian clades than their traditional basal position illustrated here.

mammals and that underwent a series of sequential size reductions in the lineage leading to birds (25,26). Moreover, integrating paleontological data on bone cell size with that for genome size and interspersed repeat abundance in extant amniote species has revealed that the substantial reduction of non-coding elements

96

Shedlock et al.

leading to a streamlining of the chicken genome likely evolved in theropod dinosaurs ~130 Myrs before the origin of avian flight (27). The largely uncharted genomic landscape of squamate reptiles remains an important open avenue for exploring the diversity of amniote genome structure and for developing a predictive theory of genome evolution (28). Table 1 outlines some of the major genomic features for three nonavian reptile focal species as well as chicken and human for comparison. Genomic clone libraries for the American alligator (Alligator mississippiensis), turtle (Chrysemys picta), and Anole (Anolis smaragdinus) have been analyzed within a phylogenetic context using methods of investigation outlined in detail below (25).

1.3. Historical Genome Dynamics Two major goals of our phylogenomic approach have been to synthesize a model of genome evolution in reptilian clades over the past 310 Myr of vertebrate evolution and to infer ancestral conditions in the amniote common ancestor. Because we have gleaned genome statistics from unaligned sequence data, many of the genome characters we seek to understand are continuous variables, as opposed to the discrete characters provided by aligned DNA sequences. For this reason, the approaches we have used are somewhat different from those used to calculate ancestral sequence states for whole mammalian chromosomes (e.g., 31). Our approach integrates estimates of genome size, global base composition, abundance and diversity of repetitive elements, and phylogenetic relationships among species-specific signatures of higher order sequence complexity. In particular, this can be achieved by (1) mapping the phylogenetic distribution of retroelements which are known to modulate genome size; (2) mapping the phylogentic distribution of particular simple sequence repeat (SSR) subclasses, which may affect global base composition; and (3) calculating rates of frequency change in DNA words along branches of a phylogeny derived from genomic signatures. Taken together these three provide a means for sketching the history of genome dynamics among lineages being investigated by genome scanning. 2. Materials 2.1. Manipulating and Screening BAC Libraries 1. Electrocompetent, DH10B T1-resistant Escherichia coli cells (cat. no. 12033015, Invitrogen). 2. LB broth (Miller). 3. Nylon filters. 4. Chloramphenicol. 5. Glycerol. 6. LB agar. 7. Lysis buffer solution: 2X SSC, 5% SDS.

Amniote Phylogenomics: Testing Evolutionary Hypotheses

97

Table 1 Some General Features of Reptilian and Human Genomes

Genome size (pg) Haploid chromosome no. Microchromosomes?

Alligator

Turtle

Chicken

Anole

Human

2.49 16 No

2.57 ~25 Yes

1.25 39 Yes

2.2 ~18 Yes

3.5 23 No

Sources: refs. 29,30,81–84.

8. Proteinase K buffer solution: 50 mM Tris (pH 8), 50 mM EDTA, 100 mM NaCL, 1% N-lauryl sarcosine. 9. 10 g/mL proteinase K. 10. 2X SSC. 11. Fabric pen (cat. no. PN1, Cleaner’s Supply). 12. XL-10 Gold competent E. coli cells (cat. no. 200315, Stratagene). 13. (-32P)-dCTP. 14. Prime-It II Random Primer Labeling Kit (cat. no. 300385, Stratagene). 15. Microspin columns (cat. no. 27-5120-01, GE Healthcare). 16. Sonicated, nonhomologous herring sperm DNA (cat. no. D1815, Promega). 17. Hybridization mesh (cat. no. RPN2519, GE Healthcare). 18. Hybridization buffer solution: 18.75 mL 20X SSPE, 3.75 mL 100X Denhardt’s solution, 3.75 mL 10% (w/v) SDS, and 48.75 mL H2O. 19. 1X washing buffer solution: 935 mL H2O, 50 mL 20X SSC, 5 mL 20% SDS, 10 mL 5% pyrophosphate. 20. Metal cassette (cat. no. S-14, Spectronics). 21. Biomax MR X-ray film (cat. no. 8567232, Kodak). 22. X-omat film processor (cat. no. 1000A, Kodak). 23. Stripping solution 1: 0.2 N NaOH. 24. Stripping solution 2: 0.1 M Tris-HCl (pH 7.5), 0.1X SSC, 0.1% (w/v) SDS. 25. Stripping solution 3: 0.1X SSC, 0.1% (w/v) SDS. 26. Resuspension buffer (cat. no. 19051, Qiagen). 27. RNase A (cat. no. 19101, Qiagen). 28. Lysis buffer (cat. no. 19052, Qiagen). 29. Neutralization buffer (cat. no. 19053, Qiagen). 30. Isopropanol. 31. 70% ethanol. 32. TE buffer. 33. Restriction enzymes (EcoRI and HindIII). 34. Hydro-shear (cat. no. JHSH000000-1, Genomic Solutions). 35. Micrococcus (cat. no. 159972, MP Biomedicals). 36. Qiaquick kit (cat. no. 28106, Qiagen). 37. Blunting solution for 10 samples: 516.67 L 5X T4 DNA polymerase buffer (cat. no. M0203L, New England Biolabs), 258.33 L BSA (1 mg/mL; cat. no. B9001S, New England Biolabs), 258.33 L dNTPs (1 mM; cat. no. 10766020,

98

38. 39. 40.

41. 42. 43. 44. 45. 46. 47. 48. 49.

Shedlock et al. Roche), 193.75 L H2O, 64.58 L T4 DNA polymerase (cat. no. M0203L, New England Biolabs). 0.5 M EDTA. BstXI/EcoRI Adapter (cat. no. N41818, Invitrogen). Linking solution for 10 samples: 38.75 L 10X T4 DNA ligase buffer (cat. no. M0202L, New England Biolabs), 38.75 L T4 DNA ligase (cat. no. M0202L, New England Biolabs). Vector (cat. no. A362A, Promega). Ligating solution for 10 samples: 23.3 L 10X T4 DNA ligase buffer, 17.5 L T4 DNA ligase. GC5 cells (cat. no. 62-7000-16W, PGC Scientific). SOC medium (cat. no. S1625, Teknova). LB solution: 1.8 L LB broth (Miller), 3 mL chloramphenicol, 0.2 L glycerol. 96-well 800 L uniplate (cat. no. 7701-1800, Whatman). 96-well 800 L unifilter (cat. no. 7700-1806, Whatman). Big Dye Terminator v3.1 Cycle Sequencing Kit (cat. no. 4336917, Applied Biosystems). M13 primers.

3. Methods 3.1. Comparative Investigation of Genomic Libraries Phylogenomic analysis of BAC-libraries can be divided into two general categories of investigation: (1) characterization of genome structure and comparative analysis of multimegabase sequence data based on surveys of paired-end reads of clones sampled randomly from each genome; and (2) targeted studies of genes and genomic neighborhoods based on protocols aimed at characterizing particular loci of interest from a small subset of selected clones. These two approaches to interrogating the library complement each other in terms of their sampling designs, analysis of primary data, and scales of inference about genome evolution. Although genome scanning of BAC-ends provides a statistical estimate of the overall structure of the genomes investigated, it does not typically allow for direct comparisons of particular genes of interest nor does it generate homologous, alignable sequences from specific regions of the genome which can be analyzed by conventional systematic methods. Conversely, although targeted BAC assays will allow for investigation of particular genes of interest, they can provide little inference about the genome-wide distribution of genomic elements or global base composition and historical dynamics of genome evolution among the species being investigated. Taken together the two lines of investigation can provide a synergy of understanding about global patterns of genome dynamics as well as fine-scale evolutionary analysis of particular chromosomal regions in the context of genomic neighborhoods. Each of these two experimental approaches is summarized below with examples from investigating reptilian genomes within a comparative framework.

Amniote Phylogenomics: Testing Evolutionary Hypotheses

99

3.2. Genome Scanning of Nonavian Reptiles The following sections outline basic steps taken in completing a BAC-enabled phylogenomic analysis of major lineages of Reptilia. The experimental strategy exploits the relative ease with which a BAC library can be used to generate a multimegabase dataset of nonoverlapping nucleotide sequences sampled randomly from throughout a genome to complete a so-called “genome scan.” Although we have applied this approach to investigating the phylogenomics of reptiles and the structure of the ancestral amniote genome, the methods outlined here serve as a guideline for similar investigations that may test evolutionary hypotheses regarding other nonmodel species using large-scale comparative sequence analysis. Completing a phylogenomic study based on genome scans of BAC libraries relies heavily on existing genomic resources, informatics tools, access to a diversity of online genomic databases, and an integration of computational methods aimed at testing evolutionary hypotheses within a phylogenetic framework. A flowchart summarizing the basic steps and integration of methods we have employed to investigate amniote genome evolution is illustrated in Fig. 2 (see Note 1). 1. Selection of vector primers. If you are creating your own genomic library, design the vector sequence of each individual clone to contain both forward and reverse priming sites near the point of fragment insertion. If you are using an existing library, determine the sequence priming sites of vectors used to build the library. Standard vector sequencing primers include commercially available forward and reverse oligos such as M13 and T7 which can be used to sequence ~500–1000 bp of original genomic sequence along each end of the clone insert. 2. Sequence nonoverlapping clone inserts. Use forward and reverse vector primers to gather paired end-sequences from randomly selected clones in each library. Set up sequencing PCR reactions in 96-well format with ABI Big Dye cycle sequencing kits and gather primary data using a high-throughput capillary array DNA analyzer such as the model ABI 3730 (see Note 2). Compile paired end-reads for thousands of clones isolated from libraries of target species to produce a multimegabase data set of genomic sequence sampled randomly across each genome. A cartoon summarizing the genome scanning process for a BAC library is shown in Fig. 3, with several targeted downstream applications illustrated. 3. Screen end-sequences for poor base calling and contamination. Employ the Phred base calling software program (32) and exclude poor reads with quality value (QV) scores of less than 20. Identify and remove vector sequences from the data set using the NCBI tool VecScreen (http://www.ncbi.nlm.nih.gov/projects/VecScreen/). 4. Evaluate base composition. Derive statistical estimates of global GC content with a 95% confidence interval from observed GC values in multimegabase data compiled from BAC-end sequences distributed across the genome (see Note 3). Take into account any intragenomic variation in GC distribution owing to genomic features such as isochore structure and cpG islands as well as uneven local GC distributions

100

Shedlock et al.

Fig. 2. Flowchart summarizing series of basic steps taken in completing a BAC-enabled genome scan study. Paired BAC-end sequences are compiled into large-scale datasets, examined for structural features. Complexity measured in terms of genomic signatures of DNA word frequencies are related phylogentically and integrated with structural data and estimates of character change along branches to synthesize models of historical genome dynamics and test evolutionary hypotheses within a phylogenetic framework.

present in given sequence reads. Do this by checking for autocorrelation of bases up to 50–100 nt away from a focal base and assuming a model in which the GC value for each read follows a binomial distribution and where the probability of the GC values are independently and identically distributed with an unknown density (see Note 4). 5. Create genomic signatures of word-frequencies. Large-scale primary sequence derived from genome scans contains compositional information above the level of homologous alignable nucleotide sites that may reflect species-specific differences in patterns of mutation, repair, and selection on the molecular level. Evaluate this higher order organizational structure using motif-counting procedures (33–35). Do this by counting frequencies of all possible short n-letter oligos, or DNA words, using a 1-bp sliding-window approach and plotting genomic signatures (see http://genstyle. imed.jussieu.fr/). Examples of annotated visual summaries of genomic signatures for the American alligator and a human are presented in Fig. 4, along with a legend

Amniote Phylogenomics: Testing Evolutionary Hypotheses

101

Fig. 3. Cartoon detailing the use of BAC clones to establish primary data for largescale sequence scanning. Clone-inserts randomly arrayed in a BAC library are sequenced with known forward and reverse vector primers. Compilation of multimegabase data sets sampled across target genomes approximate global structural features and allow for a diversity of genomic analyses, including plots of continuous character distributions, base compostition profiling, and mapping of sequence synteny onto reference genome data assemblies. explaining the scheme of pixel representation shown for frequencies of all possible 16,384 (47) 7-nt words in ~2.5 Mb of BAC-end sequence in each signature. 6. Relate genomic signatures with phylogenetic methods. Generate Euclidean distances among genomic signatures from multiple species based on the square root of the sum of the square of the differences in frequency of motifs. Normalize each signature for genome-wide base compositional differences prior to generating distance matrices. Do this by subtracting the expected frequency of each motif on the basis of single-letter base composition from the observed frequency for each species. Construct phylogenetic trees of the matrix using distance methods such as the neighbor-joining (NJ) algorithm (36). Evaluate statistical support for NJ tree topology using bootstrap replication. Create pseudomatrices by removing each word with replacement and calculating signatures and distances for each pseudomatrix (37) (see Note 5).

102

Shedlock et al.

Fig. 4. Examples of genomic signatures for (A), American alligator, and (B), human, visualized by pixel representations of all possible 7-nt DNA word frequencies contained in ~2.5 Mb of genomic sequence per signature. Diagram in (C) illustrates the order of pixel counts used to construct signatures of n-letter words. Darker pixels correspond to higher frequencies. 1, 3, 8, 9, correspond to C7, G7, A7, and T7 words, respectively, plotted at corners of the signatures. 2, 6, 7, mark regions of CG-poor words. 4, 5, mark diagonal lines formed by densities of 7-letter words composed only of pyrimidines or purines, respectively. 10 marks adjacent regions exhibiting high densities of microsatellite motifs apparent in human but absent in alligator. 7. Estimate the diversity of interspersed repeats. Complete informatics searches for interspersed transposable genetic elements present in paired end-sequences of clones from genomic libraries with local alignment tools implemented in the program RepeatMasker (38). Evaluate hits from searches that may reflect an inherent ascertainment bias related to both the incomplete nature of the reference database used by RepeatMasker and the relative level of divergence between query and reference sequences (see Note 6). 8. Estimate the diversity of tandem repeats. Count the density of microsatellites or SSRs (see Note 7) in BAC-end sequences corresponding to specific length categories using

Amniote Phylogenomics: Testing Evolutionary Hypotheses

103

the search options built into the online informatics tool Tandem Repeats Finder (39). Algorithms used by Tandem Repeats Finder detect and score target elements independently of any underlying reference database, however, search results can be influenced by alignment parameter settings used for a given query sequence. 9. Calculate the global density of repetitive elements in target genomes. Obtain genome-size information for target species, defined by haploid nuclear DNA content measured in picograms (C-value). Use C-values to calculate the fraction of genomes that are surveyed by paired end sequences of clone inserts sampled for a genomic library. Obtain C-values either from direct experimental measurements (e.g., flow cytometry and buoyant density analysis) or from the literature. The Animal Genome Size Database (http://www.genomesize.com/) is a useful online reference for obtaining statistics on genome size and karyotype information from a wide diversity of organisms. For example, we can estimate the fraction of our target reptile species surveyed by BAC genome scanning using information publicly available on the Genome Size Database as follows: Alligator mississipiensis, 2.49 pg, 50 chromosomes; Chrysemys picta, 2.57 pg, 36 chromosomes; Anolis smaragdinus, 2.2 pg, 32 chromosomes. Conversion of picograms to base pairs: 2.49 pg × (9.78 × 108 pg/Mb) = 2.435 Gbp; 2.57 × (9.78 × 108) = 2.513 Gbp; 2.20 × (9.78 × 108) = 2.152 Gbp. Estimated fractions of genomes surveyed: 2,519,551 bp alligator BAC-end seqs/2.435 Gbp = 0.1035% alligator genome sampled; 2,432,811 bp turtle BAC end seqs/2.513 Gbp = 0.0968% turtle genome sampled; 1,358,158 bp Anolis plasmid end seqs/2.152 Gbp = 0.0631% Anolis genome sampled. The total number of repeats per genome and its standard deviation can be estimated using the following formulas (see Note 8): Total copies =

( raw repeat counts ) × ( genome size in bp)

Standard derivation = ±

size of sequence data set in bp

(

)

raw repeat counts × ( genome size in bp ) size of sequence data set in bp

10. Estimate the evolutionary rates of word frequency change among lineages. Employ comparative methods to calculate the evolutionary rates of word-frequency change for specific branches on the amniote tree on the basis of the phylogentic-generalized least squares (PGLS) approach implemented in the PGLS-ancestor module of the software package COMPARE v. 4.6 (40,41) (see Note 9). As an example, we have used a NJ distance tree with the following published divergence time estimates (Myr) for major clades in the hypothesis assuming the existence of a common ancestor at 310 Myr: squamates, 245 (42); birds, 222 (43); crocodilians and turtles, 207 (42); and rodents and primates, 85 (44,45). Compare differential rates estimated along specific branches in the phylogeny in order to illuminate macroevolutionary trends in genomic change for particular clades. Evaluate the contribution of particularly active repetitive or oligonucleotide subsets of elements to changes in total genome composition by examining frequency change of individual elements along specific branches.

104

Shedlock et al.

3.3. Targeted Studies of Genes and Genomic Neighborhoods In addition to supporting genome scans to estimate the structure and complexity among large numbers of randomly selected sequences, BAC libraries can also be used to target and describe specific genes and their local genomic neighborhoods. To illustrate the steps of gene-targeting using a BAC library, this section describes the location and description of a BAC insert as completed for EE0.6, a heterochromatic marker found on the Z chromosome in emu, Dromaius novaehollandiae (46). In summary, an emu BAC library (see http://evogen.jgi.doe. gov/second_levels/BACs/BAC_library_stats/Dnova_stats.html) was screened with a radioactively labeled probe, EE0.6, and a single hybridizing clone was detected and isolated from the arrayed library for more targeted investigation by means of shotgun-subcloning, fragment sequencing and contig assembly. We present a detailed step-by-step protocol of this process below under items (1)–(11). The library screening and shot-gun assembly of particular BAC clones can be used for comparative investigation of particular homologous functional elements within genomes in a manner that compliments the phylogenomic analysis of BAC-enabled genome scans as outlined above. This example provides an illustration of the type of genome analysis provided by BAC clones that would be impossible to achieve with a simple plasmid library owing to the limited coverage of each clone in small insert libraries. A flowchart summarizing targeted sequence analysis of a BAC clone is presented in Fig. 5. Before a BAC library can be screened, nylon filters must be prepared with a gridded representation of the clones contained in the 384-well plates that hold the library. The emu BAC library, for example, contains 133,632 clones. The emu BACs are inserted into electrocompetent, DH10B T1-resistant Escherichia coli cells. Each clone is suspended in a separate well with LB broth (Miller), chloramphenicol, and glycerol. 1. Filter preparation from the emu BAC library. Remove 384-well plates from frozen storage (80°C) and thaw for ~90 min. Grid each clone in the emu BAC library to a nylon filter with a Genetix Q-bot or other automated colony picker. Each filter represents clones from 48 384-well plates. Soak each filter in LB broth and 25 mg/mL chloramphenicol before gridding. After gridding, incubate filters at 37°C for 17 h on a large bioassay dish filled with 300 mL 1.5% LB agar and 12.5 g chloramphenicol (25 mg/mL). Transfer each filter to filter paper on a glass dish, saturated with lysis buffer (2X SSC, 5% SDS). Incubate filters at room temperature for 3 min, microwave at maximum power for 3 min and transfer to a Pyrex baking dish filled with proteinase K buffer and 10 g/mL proteinase K. Cover the Pyrex dish with plastic wrap and incubate at 37°C for ~2 h with gentle rocking. Once the filters are cleared of colonial debris, rinse them in 2X SSC for 2 min, air-dry and crosslink for 2 min. Label each filter with a fabric pen to indicate orientation and identity of the 48 plates gridded to the filter. Store gridded filters dry at room temperature until their first use.

Amniote Phylogenomics: Testing Evolutionary Hypotheses

105

Fig. 5. Generation and analyses of bacterial artificial chromosome (BAC) sequence data. BAC libraries are stored in frozen media in 384-well plates. For gene-targeting, a two-dimensional representation of the library is printed onto nylon filters. Sequence probes are hybridized to the filters to locate a BAC clone that contains the probe sequence. Once the BAC clone is selected, the BAC insert is sheared into fragments, shotgunsubcloned, sequenced, and assembled. Once assembled, BAC inserts provide raw data for large-scale sequence analyses of repeat density and synteny.

2. Probe generation. Generate sequence of interest from genomic DNA by PCR. Ligate and transform the DNA into XL-10 Gold competent E. coli cells. PCR-transformed clones using original gene primers. Purify PCR products from clone DNA and store at a concentration of 25 ng/L until the day of hybridization. Before hybridization, label 25 ng DNA with (-32P)-dCTP with the Prime-It II Random Primer Labeling Kit. Filter labeled probes through microspin columns to remove unlabeled nucleotides. 3. Filter prehybridization and hybridization. Denature 1.5 mL sonicated, nonhomologous herring sperm DNA by heating at 100°C for 10 min and then chill on ice for 2 min. By the same method, also denature labeled probe DNA. Sandwich each filter between layers of hybridization mesh and roll them in lead-lined glass cylinders. Prehybridize filters by adding 75 mL hybridization buffer and 1.5 mL herring sperm DNA. Incubate filters, buffer, and herring sperm DNA with rolling at 65°C for 6 h.

106

4.

5.

6.

7.

8.

Shedlock et al.

After prehybridization, decant liquid from the cylinders and replace with 25 mL hybridization buffer, 0.5 mL denatured herring sperm DNA, and denatured probe DNA. Incubate cylinders with rolling at 65°C for 12 h. Washing and autoradiography. Decant liquid from cylinders and replace with 25 mL 1X washing buffer. Incubate cylinders at 50°C for 15 min. Decant washing solution. Repeat washing twice. After three washes, blot excess buffer from individual filters with paper towels. Wrap each filter in plastic wrap and place, DNA side up, in a metal cassette under an undeveloped Biomax MR X-ray film. Store cassettes at 80°C for 1 wk. Developing films and picking clones. After autoradiography, develop films with an X-omat film processor. Strip filters of radioactivity by saturating them with three solutions: stripping solution 1 for 20 min; stripping solution 2 for 10 min; and stripping solution 3 for 10 min. Wrap filters and store wet at 20°C. Positive hybridizations are indicated by two black spots on film. The two spots match the pattern in which each clone’s DNA was gridded to the filter by the Q-bot. To illustrate this technique, results from an autoradiograph of a BAC library screening assay of Zebra Finch (Taenopygia guttata), screened with a Mhc class II probe, are shown in Fig. 6. By referencing the location of the spots and the plates gridded onto the filter, select clones from the BAC library for confirmation of sequence of interest within the clone. Growing clones for preliminary PCR survey. Pick each candidate BAC clone from its 384-well plate and incubate in 500 L LB broth (Miller) and chloramphenicol at 37°C with shaking (250 rpm) for 17 h. PCR each clone with original gene primers. Purify clones that produce a distinguishable band in an electrophoretic gel for Southern blotting. Purification, restriction, and Southern blotting. Incubate positive clone in 10 mL LB broth (Miller) and chloramphenicol at 37°C with shaking (250 rpm) for 18 h. Decant culture into a 1.5-mL Eppendorf tube and centrifuge at 13,000 rpm for 5 min. Decant LB and replace with new culture. Centrifuge tube again. Continue decanting and centrifugation until all bacterial pellets from 10 mL culture are contained in one tube. Resuspend pellets in 30 L resuspension buffer and RNase A, and mix with 30 L lysis buffer (Qiagen) and 30 L neutralization buffer. Rock solution gently several times, incubate on ice for 15 min, and centrifuge at 13,000 rpm for 15 min. Remove supernatant carefully, mix with 150 L of isopropanol and incubate at 20°C for 30 min. Centrifuge the tube at 13,000 rpm for 15 min and discard the supernatant. Mix the BAC DNA pellet with 30 L 70% ethanol, rock gently, and centrifuge at 13,000 rpm for 5 min. After ethanol is removed, air-dry the pellet and resuspend in 10 L TE buffer. Restrict BAC DNA with EcoRI and HindIII, and anneal to a filter for hybridization with a labeled EE0.6 probe as previously described. Autoradiography provides additional support for the presence of the sequence of interest in the BAC clone’s insert. Shearing. Once the target BAC clone has been detected and isolated by the above protocols, the shotgun assembly can be completed beginning with shearing the insert and continuing through subcloning and contig assembly as described in steps

Amniote Phylogenomics: Testing Evolutionary Hypotheses

107

Fig. 6. Signal of hybridization between DNA probe and a bacterial artificial chromosome (BAC)-gridded nylon filter. A radioactive or bioluminescent probe is hybridized to a nylon filter that is gridded with DNA from the clones in the library. The organization and location of hybridization signals (arrows pointing at two dots) indicate the identity of the clone that carries the probe sequence. The results shown are from American alligator (Alligator mississippiensis) and were screened with a DMRT1 probe. This filter illustrates how BAC clones can be localized to individual wells by the pattern of hybridization within each 4 × 4 spotted grid. (8)–(11) below. A cartoon summarizing details of the BAC shot-gun sequence and assembly process is illustrated in Fig. 7. Fragment 30 g purified BAC DNA in (200 L solution with a hydroshear. Estimate the concentration of sheared DNA by comparison to known concentrations of micrococcus in an electrophoretic gel.

108

Shedlock et al.

Fig. 7. Bacterial artificial chromosome (BAC) library shotgun-subcloning and sequencing. BAC inserts carry 100–200 kb of sequence. BAC inserts are sheared into smaller fragments and incorporated as smaller inserts in smaller vectors. The fragments can be sequenced and joined into contigs by bioinformatics software.

9. End repair and shotgun subcloning. After hydroshearing, fragmented DNAs have sticky ends inconsistent with ends for a potential cloning vector. To make the fragment ends uniform, blunt-end and end-repair 30 g DNA. Purify hydrosheared DNAs with the Qiaquick kit and blunt-end by incubation at 120°C for 40 min with 100 L blunting mix followed by addition of 8 L 0.5M EDTA to each sample and incubation at 70°C for 10 min. After cleaning samples with Qiaquick kit again, mix 21 L DNA with 3 L BstXI/EcoRI Adapter and 6 L linking mix and incubate at 16°C for 17 h, 70°C for 10 min, and end at 4°C. Purify samples with the Qiaquick kit again and gel-extract DNAs between 4 and 7 kb for ligation. Purify gel extracts with the Qiaquick kit. Mix 15.5 L of sample with 1 L vector and 3.5 L ligating mix and incubate at 16°C for 17 h, 70°C for 10 min, and end at 4°C. Mix 1 L ligated sample with 100 L thawed GC5 cells and incubate at 0°C for 40 min, 42°C for 15 s, 0°C for 1 min. Transfer samples to 900 L SOC medium, seal, shake, and incubate at 37°C for 1 h. Spread samples on large square bioassay dishes filled with LB agar (Miller) and chloramphenicol, invert, and incubate at 37°C for 17 h.

Amniote Phylogenomics: Testing Evolutionary Hypotheses

109

Pick resultant colonies by Genetix QPix or other automated colony picker into the 96-well deep plates filled with 300 L of LB mix. Incubate colony plates at 37°C and shake at 250 rpm for 17 h. Centrifuge plates at 5000g at 12°C for 6 min. Decant supernatant and add 150 L resuspension buffer and RNase A, 150 L lysis buffer, and 150 L neutralization buffer to each sample while shaking at 250 rpm. Centrifuge samples at 5000g at 23°C for 30 min. Fill a 96-well 800 L uniplate with 290 L of 99% isopropanol and cover with a 96-well 800 L unifilter. Add 370 L of sample supernatant to the unifilter. Seal the unifilter and centrifuge at 3000g at 14°C for 15 min. Discard the unifilter and the supernatant in the uniplate. Air-dry pellets, wash twice with 70% ethanol, vacuum air-dry and resuspend in 15 L TE buffer. Centrifuge uniplates at 500 rpm at room temperature for 2 min and incubate at room temperature for 1 h. Freeze samples at 20°C. 10. Sequencing and assembling contigs. Label samples with the Big Dye Terminator v3.1 Cycle Sequencing Kit using m13 primers and sequence. Name resultant sequence files according to the St. Louis naming scheme which consists of three parts: (1) sample identity, (2) sequence direction (b for forward or g for reverse), and (3) unique sequence identity (i.e., 121.b.241). Construct contigs by analyzing sequence files with Phred, Phrap, and Consed assembly software. Query Autofinish for suggested primers for closing gaps between contigs (32,47–49). 11. Bioinformatics. Query contigs produced from the BAC insert for Refseq genes within relevant libraries (human, mouse, chicken, and so on) with the UCSC genome browser. Scan contigs for gene content with Seqhelp, Genscan and Geneparser (50–52). Annotate alignments to relevant libraries with the Apollo software package (53). Measure repeat density of the BAC sequence with the Miropeats and RepeatMasker software packages (38,54). Measure polymorphism using the DNAsp software package (55). These software packages, among others, allow analyses of sequences generated from screening, subcloning, sequencing, and assembling of BAC DNA. An example of annotations produced by the above methods for 41 kb of microchromosomal sequence from emu, Dromaius novaehollandiae, is illustrated in Fig. 8.

4. Notes 1. Our reptile studies surveyed large-insert libraries for the American alligator and painted turtle, constructed by colleagues at the NIH BAC Resource Network Production Center at the Benoroya Research Institute at Virginia Mason, Seattle, WA, (www.benaroyaresearch.org). We also utilized sequences from a whole-genome plasmid library generated at the Washington University Genome Sequencing Center in St. Louis, MO (http://genome.wustl.edu/home.cgi). 2. Clone sequencing was conducted at the Institute for Genomic Research, Rockville, MD (http://www.tigr.org), and the Washington University Genome Sequencing Center, and follow the published protocols of Zhao and colleagues (1,10). A total of 4656 random clones were examined to produce a total of 8638 nonoverlapping high-quality, edited paired end sequence reads, yielding 2,519,551 bp of alligator, 2,432,811 bp of turtle, and 1,358,158 bp of anole original sequence for comparative

110

Shedlock et al.

Fig. 8. Annotations of microchromosomal sequence from emu, Dromaius novaehollandiae. (A) The Apollo Genome Annotation and Curation Tool indicates conservation of sequence among queried species. The letters M, H, and C refer to the mouse, human and chicken genomes to which 41 kb of microchromosomal emu sequence was compared in this query. (B) The UCSC genome browser aligned 16.8 kb of microchromosomal emu sequence to chromosome 17 in the chicken genome. Conserved sequences among other species and repeats are also noted in the UCSC output.

phylogenomic analysis. Among all reads, the average lengths were 769, 703, and 677 bp for alligator, turtle, and anole, respectively. 3. The pattern of nucleotide frequency observed in genome scans provides a means to estimate global base composition of genomes of species and to infer phylogenetic relationships of hierarchical structure that may be reflected by compositional patterns among genome scans of target species. In particular, because guanine + cytosine (GC) or GC-rich regions in the RNA polymerase II promoter region are essential for efficient transcription of genes in eukaryotes (56), the relative GC content and the distribution of GC rich/poor regions in genomes have become a standard index of evaluating protein coding components of genomes. GC content has been estimated indirectly by experimental methods, such as DNA ultracentrifugation and

Amniote Phylogenomics: Testing Evolutionary Hypotheses

111

flow cytometry (29,57), but is increasingly analyzed directly with informatics tools and has been described in extensive detail within the context of whole-genome assemblies of model species (25,58–60). Moreover, analyzing the relationships between GC content and other organismal parameters, such as genome size, cell size, metabolic rate, and physiological constraints on life history, is an expanding line of investigation that illuminates the evolutionary dynamics and possible selective forces shaping genome structure (24,29,30,57,61). 4. We have used a statistical approach with genome scanning of alligator and turtle BAC-end sequences to estimate global GC values that are almost identical to those published based on buoyant densities in CsCl gradients and on-flow cytometry (~42%; [29,57]) indicating that GC content is elevated in alligators and turtles relative to values derived from the chicken and human genome assemblies (~40%; [25,60]). 5. Analysis of genomic signatures from birds has extended this phylogenetic approach to vertebrates and was shown to recover major branches in the avian tree, although phylogenetic relationships near the tips of the tree were clearly incongruent with traditional sequence analysis (62). Alternatively, the use of unsupervised neural network algorithms, known as self-organizing maps, or SOMs, can be used to infer phylogentic relationships based on oligo frequencies in unaligned genome sequences (63,64). Such SOMs are based on the frequencies of short (2 or 3 bp) oligonucleotides estimated from bulk sequence data and have been used to characterize the diversity of species present in environmental genomic samples (62,63,65). Although the signature approach remains exploratory, it is providing valuable new ways for harvesting phylogenetic information present in a wide variety of unalignable genomic sequence and have been shown to corroborate results of conventional analyses based on aligned sequences from homologous nucleotide positions (33,62). In general, when seeking to estimate phylogenetic relationships, we do not advocate relying upon the phenetic approach provided by signatures as a replacement for conventional methods of phylogenetic inference using a matrix of aligned nucleotides and established models of substitution, when available. However, we believe that novel, informatics-rich methods of inference such as genomic signatures provide an exciting option for phylogenetic analysis of higher order patterns of complexity inherent to genomic signatures, whereas at the same time providing insight into global patterns of genomic change not conditioned on a specific chromosome or gene region. 6. Ascertainment bias tends to provide a conservative underestimate of the true repeat diversity in target species, especially for highly derived novel TEs not annotated in model-organism genome assemblies. To minimize this problem there is a growing interest in developing informatics tools that do not rely on reference sequences that can detect directly structural components of certain classes of TEs, such as the tRNA-derived secondary structure in SINEs. Existing online informatics tools such as the tRNA-scan Search Server (htpp://rna.wustl.edu/GtRDB/Hs/Hs-align.html) and mfold (http://www.bioinfo.rpi.edu/applications/mfold/old/rna/form1.cgi) have been used successfully to identify novel families of phylogentically useful retroelements in orders of placental mammals (66). The integration of scripting from text

112

Shedlock et al.

processing languages such as Perl (67) or Python (68) with the proliferation of online resources of bioinformatics promises to greatly facilitate discovery of phylogentically diverse repetitive elements de novo in addition to relying upon referencing against annotated databases. 7. SSRs, or microsatellites, are tandemly duplicated units of 1–6 bp DNA motifs found commonly throughout eukaryotic genomes (69). SSRs are unstable and highly mutable and are thought to be primarily a result of polymerase slippage resulting in misalignment of reassociating strands during DNA replication (70) although other mechanisms such as unequal recombination and gene conversion have been characterized in detail (71,72). The balance between adding repeat units and removing them by mismatch repair enzyme machinery is a dynamic process that can influence genome evolution by contributing substantial genetic variation (73,74), introducing mutational bias (75,76) and altering transcriptional activity (73,77) as well as global oligonucleotide frequencies. Moreover, the frequency of SSR types in vertebrates is uneven (78,79) and overall SSR abundance has been shown to correlate with genome size (80). It is, therefore, of interest to examine the profile of SSRs in nonavian reptiles in an effort to understand the influence of repetitive elements on vertebrate genome evolution within a comparative framework. 8. In order to estimate the density of both repeats and also n-letter word frequencies for genomic signature analysis, we assumed that, since these elements are rare events and are more or less uniformly distributed in the genome, the total number of repeats in the sampled region follows a Poisson distribution with rate Nr, where N is the total number repeats in the genome and r is the relative fraction of total repeats contained in the sampled region. Using this approach, departures from the Poisson model associated with localized uneven distributions of elements should provide a conservative estimate of element abundance. For any genome scan study, sampling a limited number of clone ends may yield interspersed and tandem repeat estimates that are biased from whole-genome counts for both bioinformatic and experimental reasons. 9. The Martins–Hansen method (41) utilizes the topological information in a given phylogeny with branch lengths set proportional to time and the values for continuous characters for each species, be they oligonucleotide word frequencies, base composition, repeat abundance, and others. COMPARE then calculates rates as the regression slope for a generalized least squares model describing how well time predicts the variance between pairwise taxon divergence. Rates of evolution along specific branches of a genomic signature tree can be compared in terms of the change in value of any continuous character per million years when divergence times are known from the fossil record.

Acknowledgments We thank Chris Amemeiya, Tom Miyake, Robert Macey, Jeff Froula, Zhenshan Wang, Shaying Zhao, Jyoti Shetty, Marcia Lara, Jonathan Losos, Wes, Warren, and Pat Minx for technical support with genomic library construction, cloning, and end-sequencing. Charles Chapus, Chris Botka, Amir Karger, and

Amniote Phylogenomics: Testing Evolutionary Hypotheses

113

Lakshmanan Iyer provided computational and informatics support. Jun Liu, Tingting Zhang, and Patrick Deschavanne contributed statistical expertise and help with data analysis. We thank John Wakeley, Mike Sorenson and members of the Edwards Lab for numerous helpful discussions, especially Nancy Rotzel and Chris Balakrishnan, for critical comment on the manuscript and assistance with illustrations. This work was produced in part with support from NSF grant IBN-0207870 to SVE and from Harvard University. References 1. Zhao, S. and Stodolsky, M., eds (2004) Bacterial Artificial Chromosomes: Library Construction, Physical Mapping, and Sequencing, Humana, Totowa, NJ. 2. Sambrook, J. and Russell, D. W. (2001) Molecular Cloning: A Laboratory Manual. Cold Spring Harbor Laboratory, Cold Spring Harbor, NY. 3. Carvajal, J. J., Cox, D., Summerbell, D., and Rigby, P. W. (2001) A BAC transgenic analysis of the Mrf4/Myf5 locus reveals interdigitated elements that control activation and maintenance of gene expression during muscle development. Development 128, 1857–1868. 4. Giraldo, P. and Montoliu, L. (2001) Size matters: use of YACs, BACs and PACs in transgenic animals. Transgenic Res. 10, 83–103. 5. Heintz, N. (2000) Analysis of mammalian central nervous system gene expression and function using bacterial artificial chromosome-mediated transgenesis. Hum. Mol. Genet. 9, 937–943. 6. Takahashi, R., Ito, K., Fujiwara, Y., et al. (2000) Generation of transgenic rats with YACs and BACs: preparation procedures and integrity of microinjected DNA. Exp. Anim. 49, 229–233. 7. Amemiya, C. T., Zhong, T. P., Silverman, G. A., Fishman, M. C., and Zon, L. I. (1999) Zebrafish YAC, BAC and PAC genomic libraries. Methods Cell Biol. 60, 235–258. 8. Choi, S. and Kim, U.-J. (2001) Construction of a bacterial artificial chromosome library. In: Genomics Protocols (Starkey, M. P. and Elaswarapu, R., eds), pp. 57–68, Humana, Totowa, NJ. 9. Zehetner G., Pack M., and Schäfer, K. (2001) Preparation and screening of highdensity cDNA arrays with genomic clones. In: Genomics Protocols (Starkey, M. P. and Elaswarapu, R., eds), pp. 169–188, Humana, Totowa, NJ. 10. Zhao, S., Shatsman, S., Ayodeji, B., et al. (2001) Mouse BAC ends quality assessment and sequence analyses. Genome Res. 11, 1736–1745. 11. Gasper, J. S., Shiina, T., Inoko, H., and Edwards, S. V. (2001) Songbird genomics: analysis of 45 kb upstream of a polymorphic Mhc Class II gene in red-winged blackbirds (Agelaius phoenicius). Genomics 75, 26–34. 12. Kim, C. B., Amemiya, C., Bailey, W., et al. (2000) Hox cluster genomics in the horn shark, Heterodontus francisci. Proc. Natl Acad. Sci. USA 97, 1655–1660. 13. Giribet, G., Edgecombe, G. D., and Wheeler, W. C. (2001) Arthropod phylogeny based on eight molecular loci and morphology. Nature 413, 157–161. 14. Madsen, O., Scally, M., Douady, C. J., et al. (2001) Parallel adaptive radiations in two major clades of placental mammals. Nature 409, 610–614.

114

Shedlock et al.

15. Murphy, W. J., Eizirik, E., Johnson, W. E., et al. (2001) Molecular phylogenetics and the origins of placental mammals. Nature 409, 614–618. 16. Soltis, P. S., Soltis, D. E., and Chase, M. W. (1999) Angiosperm phylogeny inferred from multiple genes as a tool for comparative biology. Nature 402, 402–404. 17. Edwards, S. V., Jennings, W. B., and Shedlock, A. M. (2005) Phylogenetics of modern birds in the era of genomics. Proc. R. Soc. Lond. Ser. B 272, 979–992. 18. Baker, R. J., Longmire, J. L., Maltbie, M., Hamilton, M. J., and VandenBussche, R. A. (1997) DNA synapomorphies for a variety of taxonomic levels from a cosmid library from the new world bat Macrotus waterhousii. Syst. Biol. 46, 579–589. 19. Rokas, A. and Holland, P. W. H. (2000) Rare molecular changes as a tool for phylogenetics. Trends Ecol. Evol. 15, 454–459. 20. Shedlock, A. M., Milinkovitch, M. C., and Okada, N. (2000) SINE evolution, missing data, and the origin of whales. Syst. Biol. 49, 808–817. 21. Shedlock, A. M. and Okada, N. (2000) SINE insertions: powerful tools for molecular systematics. Bioessays 22, 148–160. 22. Shedlock, A. M., Takahashi, K., and Okada, N. (2004) SINEs of speciation: tracking lineages with retroposons. Trends Ecol. Evol. 19, 545–553. 23. Sarre, S., Georges, A., and Quinn, A. (2004) The ends of a continuum: genetic and temperature-dependent sex determination in reptiles. Bioessays 26, 639–645. 24. Waltari, E. and Edwards, S. V. (2002) Evolutionary dynamics of intron size, genomesize, and physiological correlates in archosaurs. Am. Nat. 160, 539–552. 25. Shedlock, A. M., Botka, C. W., Zhao, S., et al. (2007) Phylogenomics of non-avian reptiles and the structure of the ancestral amniote genome. Proc. Natl Acad. Sci. USA 104, 2767–2772. 26. Shedlock, A. M. (2006) Phylogenomic investigation of CR1 LINE diversity in reptiles. Syst. Biol. 55, 902–911. 27. Organ, C. L., Shedlock, A. M., Meade, A., Pagel, M., and Edwards, S. E. (2007) Origin of avian genome size and structure in non-avian dinosaurs. Nature 446, 180–184. 28. Shedlock, A. M. (2006) Exploring frontiers in the DNA landscape: An introduction to the symposium "Genome analysis and the molecular systematics of retroelements”. Syst. Biol. 55, 871–874. 29. Vinogradov, A. E. (1998) Genome size and GC-percent in vertebrates as determined by flow cytometry: the triangular relationship. Cytometry 31, 100–109. 30. Olmo, E. (1986) Animal Cytogenetics, Vol. 4: Chordata, No. 3A: Reptilia. Gebrüder Borntraeger, Berlin. 31. Blanchette, M., Green, E. D., Miller, W., and Haussler, D. (2004) Reconstructing large regions of an ancestral mammalian genome in silico. Genome Res. 14, 2412–2423. 32. Ewing, B. and Green, P. (1998) Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8, 186–194. 33. Chapus, C., Dufraigne, C., Edwards, S. V., et al. (2005) Exploration of phylogenetic data using a global sequence analysis method. BMC Evol. Biol. 2005, 63. 34. Deschavanne, P., Giron, A., Vilain, J., Fagot, G., and Fertil, B. (1999) Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Mol. Biol. Evol. 16, 1391–1399.

Amniote Phylogenomics: Testing Evolutionary Hypotheses

115

35. Karlin, S. and Ladunga, I. (1994) Comparisons of eukaryotic genomic sequences. Proc. Natl Acad. Sci. USA 91, 12,832–12,836. 36. Saitou, N. and Nei, M. (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4, 406–425. 37. Felsenstein, J. (1985) Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39, 783–791. 38. Smit, A. F. A., Hubley, R., and Green, P. (2004) RepeatMasker Open-3.0.5 (http:// www.repeatmasker.org). 39. Benson, G. (1999) Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580. 40. Martins, E. P. (2004) COMPARE, version 4.6b. Computer programs for the statistical analysis of comparative data. Distributed by the author at http://compare.bio.indiana. edu/. Department of Biology, Indiana University, Bloomington IN. 41. Martins, E. P. and Hansen, T. F. (1997) Phylogenies and the comparative method: a general approach to incorporating phylogenetic information into analysis of interspecific data. Am. Nat. 149, 646–667. 42. Hedges, S. B. and Poling, L. L. (1999) A molecular phylogeny of reptiles. Science 283(5404), 998–1001. 43. Kumar, S. and Hedges, B. (1998) A molecular timescale for vertebrate evolution. Nature 392, 917–920. 44. Hasegawa, M., Thorne, J. L., and Kishino, H. (2003) Time scale of eutherian evolution estimated without assuming a constant rate of molecular evolution. Genes Genet. Syst. 78, 267–283. 45. Springer, M. S., Murphy, W. J., Eizirik, E., and O’Brien, S. J. (2003) Placental mammal diversification and the Cretaceous-tertiary boundary. Proc. Natl Acad. Sci. USA 100(3), 1056–1061. 46. Ogawa, A., Murata, K., and Mizuno, S. (1998) The location of Z- and W-linked marker genes and sequence on the homomorphic sex chromosomes of the ostrich and the emu. Proc. Natl Acad. Sci. USA 95(8), 4415–4418. 47. Gordon, D. C. (2004) Viewing and editing assembled sequences using Consed. In: Current Protocols in Bioinformatics (Baxevanis, A. D. and Davison, D. B., eds), Wiley, New York. 48. Gordon, D., Desmarais, C., and Green, P. (2001) Automated finishing with Autofinish. Genome Res. 11(4), 614–625. 49. Gordon, D., Abajian, C., and Green, P. (1998) Consed: a graphical tool for sequence finishing. Genome Res. 8(3), 195–202. 50. Lee, M. K., Lynch, E. D., and King, M. C. (1998) SeqHelp: a program to analyze molecular sequences utilizing common computational resources. Genome Res. 8(3), 306–312. 51. Burge, C. and Karlin, S. (1997) Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268(1), 78–94. 52. Snyder, E. E. and Stormo, G. D. (1995) Identification of protein-coding regions in genomic DNA. J. Mol. Biol. 248(1), 1–18. 53. Lewis, S. E., Searle, S. M. J., Harris, N., et al. (2002) Apollo: a sequence annotation editor. Genome Biol. 3, research0082.1–14.

116

Shedlock et al.

54. Parsons, J. D. (1995) Miropeats: graphical DNA sequence comparisons. Comput. Appl. Biosci. 11, 615–619. 55. Rozas, J., Sanchez-DelBarrio, J. C., Messeguer, X., and Rozas, R. (2003) DnaSP, DNA polymorphism analyses by the coalescent and other methods. Bioinformatics 19(18), 2496–2497. 56. Watson, J. D., Hopkins, N. H., Roberts, J. W., Steitz, J. A., and Weiner, A. M. (1987) Molecular Biology of the Gene, 4th Ed., Benjamin Cummings, Menlo Park. 57. Hughes, S., Clay, O., and Bernardi, G. (2002) Compositional patterns in reptilian genomes. Gene 295, 323–329. 58. Jaillon, O., Aury, J.-M., and Brunet, F., et al. (2004) Genome duplication in the teleost fish Tetraodon nigroviridis reveals the early vertebrate proto-karyotype. Nature 431(7011), 946–957. 59. Rat Genome Sequencing Project Consortium. (2004) Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 428, 493–521. 60. International Chicken Genome Sequencing Consortium. (2004) Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature 432(7018), 695–716. 61. Hughes, A. and Piontkivska, H. (2005) DNA repeat arrays in chicken and human genomes and the adaptive evolution of avian genome size. BMC Evol. Biol. 5(1), 12. 62. Edwards, S. V., Fertil, B., Giron, A., and Deschavanne, P. (2002) A genomic schism in birds revealed by phylogenetic analysis of DNA strings. Syst. Biol. 51(4), 599–613. 63. Abe, T., Kanaya, S., Kinouchi, M., et al. (2003) Informatics for unveiling hidden genome signatures. Genome Res. 13, 693–702. 64. Dopazo, J. and Carazo, J. M. (1997) Phylogenetic reconstruction using an unsupervised growing neural network that adopts the topology of a phylogenetic tree. J. Mol. Evol. 44, 226–233. 65. Uchiyama, T., Abe, T., Ikemura, T., and Watanabe, K. (2005) Substrate-induced gene-expression screening of environmental metagenome libraries for isolation of catabolic genes. Nat. Biotechnol. 23, 88–93. 66. Churakov, G., Smit, A. F. A., Brosius, J., and sSchmitz, J. (2004) A novel abundant family of retroposed elements (DAS-SINEs) in the nine-banded armadillo (Dasypus novemcinctus). Mol. Biol. Evol. 22, 886–893. 67. Wall, L., Christiansen, T., and Orwant, J. (2000) Programming Perl, 3rd Ed., O’Reilly Media, Cambridge. 68. Lutz, M. (2001) Programming Python, 2nd Ed., O’Reilly Media, Cambridge. 69. Tautz, D. and Renz, M. (1984) Simple sequences are ubiquitous repetitive components of eukaryotic genomes. Nucelic Acids Res. 12, 4127–4138. 70. Richards, R. I. and Sutherland, G. R. (1994) Simple repeat DNA is not replicated simply. Nat. Genet. 6, 114–116. 71. Almeida, P. and Penha-Gonçalves, C. (2004) Long perfect dinucleotide repeats are typical of vertebrates, show motif preferences and size convergence. Mol. Biol. Evol. 21, 1226–1233. 72. Majewski, J. and Ott, J. (2000) GT repeats are associated with recombination on human chromosome 22. Genome Res. 10, 1108–1114.

Amniote Phylogenomics: Testing Evolutionary Hypotheses

117

73. Kashi, Y., King, D. C., and Soller, M. (1997) Simple sequence repeats as a source of quantitative genetic variation. Trends Genet. 13, 74–78. 74. Tautz, D., Trick, M., and Dover, G. (1986) Cryptic simplicity in DNA is a major source of genetic variation. Nature 322, 652–656. 75. Amos, W., Sawcer, S. J., Feakes, R. W., and Rubeinsztein, D. C. (1996) Microsatellites show mutational bias and heterozygot instability. Nat. Genet. 13, 390–391. 76. Primmer, C. R., Saino, N., Møller, A. P., and Ellegren, H. (1996) Directional evolution in germline microsatellite mutations. Nat. Genet. 13, 391–393. 77. Gerber, H. P., Seipel, K., Georgiev, O., et al. (1994) Transcriptional activation modulated by homopolymeric glutamine and proline stretches. Science 263, 808–811. 78. Tóth, G., Gáspári, Z., and Jurk, J. (2000) Microsatellites in different eukaryotic genomes: survey and analysis. Genome Res. 10, 967–998. 79. Primmer, C. R., Raudsepp, T., Chowdhary, B. P., Moller, A. P., and Ellegren, H. (1997) Low frequency of microsatellites in the avian genome. Genome Res. 7(5), 471–482. 80. Hancock, J. M. (1996) Simple sequence repeats and the expanding genome. Bioessays 18, 421–425. 81. Burt, D. W. (2002) Origin and evolution of avian microchromosomes. Cytogenet. Genome Res. 96(1–4), 97–112. 82. Epplen, J. T., Diedrich, U., Wagenmann, M., Schmidtke, J., and Engel, W. (1979) Contrasting DNA sequence organisation patterns in sauropsidian genomes. Chromosoma 75, 199–214. 83. Gregory, T. R. (2001) Animal Genome Size Database (http://www.genomesize.com). 84. King, M., Honeycutt, R., and Contreras, N. (1986) Chromosomal repatterning in crocodiles: C, G, and N-banding and in situ hybridization of 18S and 26S rRNA cistrons. Genetica 27, 191–201.

8 Comparative Physical Mapping: Universal Overgo Hybridization Probe Design and BAC Library Hybridization James W. Thomas Summary Comparative genomics is a powerful approach for inferring the history and function of genomic sequence. The generation of bacterial artificial chromosome (BAC)-based physical maps is a proven method for the targeted comparative genomic analysis of genes or regions of interest. ‘Universal’ overgo hybridization probes can be used for the efficient construction of BAC-based physical maps of orthologous chromosome segments from multiple species in parallel. ‘Universal’ overgo hybridization probes can therefore facilitate the assembly of deep and diverse collections of experimental and computational comparative genomic resources corresponding to specific segments of the genome. The design of ‘universal’ overgo probes is dependent on the presence of sequences that are highly conserved within a group of species. Such conserved sequences can be readily identified using local- or whole-genome interspecies sequence alignments. Once ‘universal’ overgo hybridization probes are designed, simple and uniform labeling and hybridization conditions can be carried out to exploit the utility of these probes for targeted comparative physical mapping. Key Words: Comparative genomics; physical mapping; BAC library; overgo hybridization probe; universal probe.

1. Introduction Comparative genomics is a powerful tool for inferring the function and history of genes, chromosomes, and entire genomes. Interspecies genomic sequence comparisons have proven to be a particularly useful for the systematic identification of putative functional elements encoded within the human genome (1,2). The ability of comparative analysis to enhance our understanding of the human and other genomes has spurred a rapid expansion in the availability of genomic resources, including genomic libraries, from a wide variety of animals. From: Methods in Molecular Biology: Phylogenomics Edited by: W. J. Murphy © Humana Press Inc., Totowa, NJ

119

120

Thomas

Genomic bacterial artificial chromosome (BAC) libraries are currently available for more than 80 vertebrates (see http://uprobe.genetics.emory.edu). This diverse collection of BAC libraries represents an exceptional resource for comparative genomic studies. In particular, the isolation of BAC clones containing genes or regions of interest from these libraries is a proven strategy for targeted comparative mapping and sequencing (3), comparative cytogenetic mapping (4), and functional studies (5). Moreover, the species breadth and depth of available BAC libraries can support the development of optimal and customized comparative genomic resources required to test specific hypotheses that can not be adequately addressed with public whole-genome sequence assemblies (6). We have described the adaptation of standard methods for screening BAC libraries using overgo hybridization probes for the efficient construction of orthologous BAC-based physical maps from multiple species in parallel by the design of ‘universal’ overgo probes (7,8). Here we present the fundamental design strategies and methods that can be used to develop or retrieve predesigned ‘universal’ overgo hybridization probes, and the accompanying protocols for probe labeling and BAC-library hybridization. 2. Materials 2.1. Universal Overgo Hybridization Probe Design

2.1.1. Equipment Computer with unix operating system.

2.1.2. Software 1. Probe design software: soop_v2 and nsoop_v2 (http://uprobe.genetics.emory.edu/). 2. DNA sequence alignment programs BLAST and MEGABLAST (http://www.ncbi. nlm.gov), and TBA and BLASTZ (http://bio.cse.psu.edu) 3. The phylogenetic software PAML (required for nsoop_v2) (http://abacus.gene.ucl.ac. uk/software/paml.html).

2.2. Universal Overgo Hybridization Probe Labeling 2.2.1. Solutions and Small Supplies 1. 2. 3. 4.

Klenow enzyme (large fragment of DNA polymerase I). Bovine serum albumin (2 mg/mL) (see Note 1). [ 32P]dCTP and [ 32P]dATP (3000 Ci/mmol). OLB solution. This is a mixture of solutions A, B, and C at the ratio 1:2.5:1.5. Store at 20°C (see Note 2). Solution A (store at 20°C): 1 mL solution O [1.25 M Tris-HCl (pH 8.0), 125 mM MgCl2], 18 L -mercaptoethanol (Sigma), 5 L of 0.1 M dGTP, 5 L of 0.1 M dTTP. Solution B (store at 4°C): 2 M HEPES–NaOH (pH 6.6). Solution C (store at 4°C): 3 mM Tris-HCl (pH 7.4), 0.2 mM EDTA (pH 8.0). 5. TE (pH 7.5): 10 mM Tris-HCl (pH 7.5), 1 mM EDTA (pH 8.0). 6. Nick columns (Sephadex G-50 drip columns, GE Healthcare/Amersham Biosciences).

Comparative Physical Mapping with Universal Probes

121

7. Large polystyrene weigh dishes (512 × 512 × 78 , Fisher). 8. Nalgene polypropylene floating microtube racks.

2.3. Universal Overgo Hybridization Probe BAC Library Hybridization 2.3.1. Equipment and Small Supplies 1. Hybridization oven and bottles (35 × 300 mm bottles for large filters and 35 × 150 mm bottles for small filters). 2. BAC library colony filters. 3. Plastic container with lid for washing filters (Clear Magazine Snap Case, 14 × 858

× 378 , Rubbermaid). 4. Shaking water bath or platform shaker. 5. Forceps for manipulating filters (10 dressing forceps and 4.5 filter forceps, Fisher). 6. All-purpose laboratory wrap (18 × 1000, Fisher). 7. X-ray film or storage phosphor screens (GE Healthcare/Amersham Biosciences) and cassettes. 24 × 30 cm is the minimum size for the film, phosphor screen and cassettes that can be used with 22 × 22 cm colony filters.

2.3.2. Solutions 1. Church buffer. To make 4 L of Church buffer, warm 400 mL of dH2O on hot plate with stir bar and add 40 g of BSA. Stir until dissolved (this can take 1–2 h). In a 4-L beaker, dissolve 268 g of Na2HPO4·7H2O (Sigma) in 1.7 L of dH2O. Add 8 mL of 85% H3PO4 (Fisher) and make up volume to 2 L; this is the phosphate buffer. To the phosphate buffer, add 8 mL of 0.5 M EDTA (pH 8.0), 280 g of SDS, 800 mL of warm dH2O and stir. Add the 400 mL of dissolved BSA and bring the final volume up to 4 L with warm dH2O. Once completely in solution, pass through a coffee filter lined funnel into a 4-L jug for storage. Depending on the temperature of the room, the SDS may fall out of solution. If this occurs, simply warm and mix the solution until the SDS is back in solution. 2. Wash solutions: 2X SSC, 0.1% SDS; 1.5X SSC, 0.1% SDS; and 1.0X SSC, 0.1% SDS.

3. Methods 3.1.1. Universal Overgo Hybridization Probe Design Hybridization-based screening of BAC libraries requires the development of probes that will anneal to the DNA of the target clone(s) of interest. Overgo probes are the short double-stranded DNA probes that were developed specifically for the efficient hybridization-based screening of BAC libraries arrayed on high-density filters (9). Previously we established that 36-bp overgo probes designed from conserved sequences between two or more species, i.e., ‘universal’ probes, were highly effective at screening BAC libraries for the parallel construction of sequence-ready clone contig maps in multiple species (7,8). Quantitative studies of the correlation between the success rate of ‘universal’ overgo probes for identifying positive clones with the divergence between the overgo-probe sequence

122

Thomas

and target species sequence revealed that probe sequences with up to three mismatches to the target sequence were compatible with successful hybridizations (8). Thus, the rate-limiting step in the design of universal probes is the identification of 36-mers predicted to have three or fewer mismatches with the orthologous sequence from the target species of interest. Such sequences are relatively abundant within broad clades of vertebrates, such as placental mammals, and have allowed the creation of whole-genome sets of universal probes for screening eutherian BAC libraries (8). The abundance of highly conserved sequences between species is inversely correlated with divergence. Thus, the number of ‘universal’ probes that can be designed from any given region of the genome is dependent on the level of divergence within the target clade of interest (8). For example, many more ‘universal’ probes could be developed for screening great ape BAC libraries compared to the number of ‘universal’ probes that could be developed for screening all mammalian BAC libraries. To maximize the ability to design ‘universal’ probes with the highest success rates in a given clade of species, the general strategy outlined in Fig. 1A should be adopted. First, the ‘universal’ probes ideally should be designed from the sequence of a species within the clade of interest. By doing so, the divergence between the species from which the ‘universal’ probes are designed and the clade of interest is kept to a minimum, thus maximizing both identity between the probe and target sequence, and the power to predict the conservation between the probe and target sequence. Second, ideally at least one ‘outgroup’ to the clade of interest should be included in the sequence comparison used to identify the ‘universal’ probes. In particular, an outgroup with the minimum amount of divergence with the clade of interest provides the maximum power to predict the largest set of high quality ‘universal’ probes. Multiple species comparisons can also increase the power to identify universal probes especially when sequence from more than one species within the clade of interest and/or, more than one closely related outgroup can be included the probe design process. Two additional important criteria should be kept in mind when designing ‘universal’ probes. First, the physical spacing of the probes at regular intervals across the target region of interest is critical for the development of robust physical maps with long-ran continuity (i.e., minimal clone gaps). For example, when screening a 10X BAC library with an average insert size of ~150 kb with probes that have a ~95% success rate, then screening a library with a set of probes spaced every 50 kb across the region of interest will result in highly contiguous maps in which each BAC clone will on average hybridize to three probes. At this probe density, high-quality probe-content physical maps can be assembled with physical mapping algorithms such as Segmap (10) or SAM (11) over small or large intervals. The isolated BAC clones can also be subjected to restriction-enzyme

Comparative Physical Mapping with Universal Probes

123

Fig. 1. Strategy for ‘universal’ overgo hybridization probe design. (A) Two or more species are selected as the basis for identifying sequences predicted to contain three or fewer mismatches between the 36-bp overgo probe sequence and the orthologous sequence in a clade of species targeted for comparative physical mapping. Ideally, the species from which the universal probes will be designed is nested within the clade targeted for comparative physical mapping (see Species B), and at least one species used to identify the conserved sequences is a closely related outgroup (see Species A). (B) Orthologous genomic sequence alignments including two or more species masked for common repetitive elements (indicated by the Xs) are used as the template for ‘universal’ overgo hybridization probe design with the soop_v2 and nsoop_v2 algorithms. Candidate overgo probe sequences are then filtered using a MEGABLAST search to identify conserved 36-mers that are single-copy. An optimal set of probes is then selected from the single-copy sequences based on sequence divergence and physical spacing. Two overlapping and complementary oligos are then used in a Klenow fill-in reaction to generate radiolabeled probes (indicated by the asterisks), which are pooled and hybridized to one or more BAC libraries in parallel for clone isolation.

124

Thomas

fingerprint analysis (12,13) to establish restriction-enzyme fingerprint contigs that can be combined with the probe-content information to select a minimal tiling path of clones for sequencing. As the success rate of ‘universal’ probes is expected to be lower than 95%, it is imperative to compensate for the probe failure rate in order to develop high-quality BAC-contigs using ‘universal’ probes. For example, sets of ‘universal’ probes with success rates of ~50% but spaced every 30–40 kb are effective for developing highly contiguous physical maps (7). In addition, at a minimum we recommend that at least six physically linked probes corresponding to the region of interest be designed for screening BAC libraries. Second, as the standard objective of a targeted mapping project is to develop an ordered set of clones from a single locus in the genome, the probes used to screen the library should hybridize only to the clones from the region of interest. That is, the probe sequences should be singlecopy. In order to remove repetitive probes, common repetitive elements should be masked prior to probe design. Once candidate ‘universal’ probe sequences are identified based on conservation, (MEGA)BLAST searches can be used to further eliminate repetitive probes. Outlined below are the steps for designing ‘universal’ probes (see Fig. 1B) as well as a description of how to retrieve predesigned ‘universal’ probes from the public Uprobe web site (http://uprobe.genetics.emory.edu).

3.1.2. Strategy for Universal Overgo Hybridization Probe Design 1. Identify genomic sequence corresponding to the region of interest. For example, the sequence of a particular chromosomal segment can be readily retrieved from whole-genome assemblies through public web sites such as Ensembl (http://www. ensembl.org/) or the UCSC Genome Browser (http://www.genome.ucsc.edu). 2. Generate or retrieve orthologous genomic sequence alignments between two or more species corresponding to the region of interest. This can be accomplished with the alignment programs BLAST (14), BLASTZ (15), or TBA/MULTIZ (16). Alternatively, publicly available alignments (*.maf) can be downloaded from UCSC Genome Browser using the Table Browser. The aligned sequences should be masked for repetitive elements (Ns for subsequent probe design with soop_v2, and either Ns or lowercase for probe design with nsoop_v2). 3. Determine which species will serve as the basis for the probe sequences. This should be the species with least divergence to the target clade of interest. 4. Depending on your alignment format, select the appropriate ‘universal’ probe design algorithm. Soop_v2 can be used with pairwise BLAST or BLASTZ alignments, and nsoop_v2 can be used to generate universal probes from *.maf alignments between two or more species generated by the alignment programs TBA/MULTIZ. 5. The intrinsic probe design parameters that should be used in all cases are: probe length = 36 bp, GC content: optimal = 50%, min = 44%, and max = 56%. 6. Determine a probe score cutoff appropriate for your target set of species and desired probe success rate. In the case of soop_v2, this can be simply selected as

Comparative Physical Mapping with Universal Probes

125

the percent identity between the probe and comparative species. For nsoop_v2, an accessory program is provided to show the correlation between probe score and user defined mismatch criteria. 7. Filter the candidate probe sequences as likely to be single-copy (unique) or repetitive (nonunique) based on the output of a MEGABLAST search. Ideally, the candidate probe sequences should be compared to the most complete assembled genomic sequence assembly available from the target clade of interest. For example, ‘universal’ probes designed for screening rodent libraries and based on mouse sequence could be screened against the sequenced mouse genome. Parameters for a fast and sensitive MEGABLAST search are: (megablast -t 16 -N 2 -W 11 -e 0.6 -F F -D 3). For comparisons of the probes back to their genome of origin, the following criteria can be used to classify the probes as unique or nonunique. Unique probe: a single identical match to the genome of origin, no other matches with a bit score above > 40, and fewer than five matches with a bit score above 36. Nonunique probe: in addition to the expected identical match to the genome of origin, at least one other match with a bit score above 40, or 5 or more matches with bit scores above 36. Note the e-value should be set low enough such that all matches with a bit score of 36 or greater will be returned. 8. Select final set of probes for hybridization based on optimal spacing and probe scores. The physical spacing of the probes is performed at an earlier stage with soop_v2 and at this point with nsoop_v2. 9. Synthesize/purchase overgo probe primers (see Note 3). No special modifications are required for the overgo primers and they can be ordered/synthesized just like the oligos used as PCR primers. A small scale synthesis will provide a vast abundance of primers for making overgo probes. Some BAC libraries are arrayed with anchor clones of known sequence that greatly aid in the accurate orientation for scoring of positive clones on the library filters. Therefore, whenever possible primers for an overgo probe that will hybridize to the anchor clone should also be obtained.

3.1.3. Retrieval of Predesigned Universal Probes from Uprobe Predesigned whole-genome sets of ‘universal’ probes for screening eutherian, marsupial, and reptilian BAC libraries have been designed and are publicly available at Uprobe (http://uprobe.genetics.emory.edu) (8), and thus provide an alternative to the custom design of ‘universal’ probes described in the previous section. The Uprobe web site provides a number of options by which the public can find and download ‘universal’ probes for specific species and chromosomal regions of interest. Briefly, for one time or limited use, ‘universal’ probes can be retrieved from any given region of the genome by simply specifying the target clade of interest, chromosomal segment of interest (via gene name, chromosome position, or accession number), and the optimal physical spacing of the probes. For more complex queries, such as the batch retrieval of ‘universal’ probes from multiple locations in the genome or searches for probes based on chromosomal positions in genomes other than the probe genome of

126

Thomas

origin, an advanced query page is also available. Finally, the complete wholegenome probe sets can also be downloaded.

3.2. Universal Overgo Hybridization Probe Labeling 1. Resuspend individual overgo primers in appropriate volume of TE to a final concentration of 100 M. Store at 20°C. 2. From the 100 M overgo primer solutions make a 2 M working overgo primer stock mixture for use in the labeling reaction. Aliquot 96 L of sdH2O and add 2 L of the forward and 2 L of the reverse overgo primer solutions. Store at 20°C. 3. Aliquot 5.5 L of the working primer stock mixture in a PCR tube and seal with cap. 4. Denature and reanneal the overgo primers for 5 min at 80°C, 10 min at 37°C, and 2–15 min at 4°C (see Note 4). 5. While the overgo primers are being denatured and reannealed, make a master labeling cocktail of 4.5 L for each probe consisting of 2 L OLB, 0.5 L BSA (2 mg/mL), 0.5 L [ 32P]dATP (3000 Ci/mmol), 0.5 L [ 32P]dCTP (3000 Ci/mmol), and 1 L Klenow (2.0 units/L) (see Note 5). 6. Mix master labeling solution well by pipeting up and down (avoid making bubbles). 7. Initiate the labeling reaction by adding 4.5 L of master labeling solution to each denatured and reannealed overgo primer pair. Pipet up-and-down 5–10 times to mix solution well. Use one pipet tip per probe. 8. Reseal PCR tubes and place in shielded position for labeling at room temperature for a minimum of 1.5 h. 9. Equilibrate Sephadex G-50 drip Nick columns with TE (fill to top of column and let drip over plastic weigh boat). Wait until column has finished dripping before adding labeling reaction (see Note 6). 10. Add overgo labeling reactions to the Nick columns (see Note 7). 11. Add 400 L of TE to the Nick columns making sure the labeling reaction goes into the gel bed and let drain. 12. Place 1.5 mL microcentrifuge collection tubes under the Nick columns and elute probe with 400 L of TE (see Note 8). 13. Store at 4°C. Labeled probes can be denatured up to two times and should be used within 1 wk.

3.3. Universal Overgo Hybridization Probe Hybridization to BAC Libraries BAC-libraries are commonly arrayed on 22 × 22 cm nylon filters and distributed to the public for hybridization-based screening of the libraries (17,18). The following protocol has been used to hybridize up to n = 192 ‘universal’ overgo probes to library filters and can be readily adapted for use with smaller filters. 1. Turn on hybridization oven and set temperature to 58°C. 2. Label hybridization bottles. Up to seven large colony filters can be hybridized in a single 300 mm hybridization bottle (see Note 9).

Comparative Physical Mapping with Universal Probes

127

3. Add 75 mL of Church buffer to each bottle, seal with lid, and rotate in hybridization oven for >30 min at 58°C (~6 rpm). 4. Warm 1 L of Church buffer to ~60°C. 5. Stack the colony filters in the plastic container (colony side down). 6. Pour 1 L of warmed Church buffer into the container (see Note 10). 7. Flip the individual filters over to equilibrate the filters in the Church buffer and make sure the filters are not sticking together. The colony side (DNA side) of the filters should all be facing up. 8. Roll all the filters (one on top of another) to be hybridized in the same hybridization bottle around a 50 mL plastic pipette tip. 9. Remove the prewarmed hybridization bottles and Church buffer from the hybridization oven and insert the rolled filters. 10. Place the sealed bottle back in the hybridization oven and rotate for at least 3 h (see Note 11). 11. Denature the overgo probes at 95°C for 5 min. 12. Remove the hybridization bottles from the hybridization oven and add the labeled overgo probe(s) to the center of each bottle (see Note 12). 13. Reseal the bottles and place them back in the hybridization oven and rotate overnight (16–20 h) at 58°C. Make sure the filters maintain even and fixed contact with the interior of the bottle during the hybridization by orientating the bottles in the same direction within the hybridization oven as in the prehybridization step. 14. Pour off the hybridization solution and pull the filters out of the bottle using forceps and place them in the plastic container. Up to 14 filters can be washed in a single container (see Note 13). 15. Add 1 L of 1.5X SSC, 0.1% SDS to the container and individually flip and neatly stack the filters in the center of the container with forceps. 16. Place in shaking water bath and rotate at 58°C for 30 min (see Note 14). 17. Pour off the first wash solution and add 1 L of 1.0X SSC, 0.1% SDS. Flip filters as before. 18. Place in shaking water bath and rotate at 58°C for 30 min. 19. Pour off second wash solution. 20. Place a sheet of plastic wrap on the bench and put a single filter colony/DNA or filter label side down on the plastic wrap (see Note 15). 21. Wrap three sides of the filter by folding over the plastic wrap and slide a paper towel with pressure across the wrapped filter toward the open end of the plastic wrap to push out excess liquid. 22. Fold the forth side of the plastic wrap to completely seal the filter and wipe the top exterior of the wrapped filter to remove any residual moisture. 23. Place wrapped filters in labeled cassettes and expose to X-ray film or phosphor screens (see Note 16). 24. Develop film or scan phosphor screens after 4–24 h (see Note 17).

4. Notes 1. To make 2 mg/mL BSA, simply dilute the BSA solution that is commonly provided for use with restriction enzymes (~50 mg/mL).

128

Thomas

2. Make Solution A on ice and aliquot 20 L into n = 50, 1.5 mL microcentrifuge tubes. Keep stock solutions A, B, and C separate for long-term storage. To make OLB as needed for use in the labeling reaction and short-term storage, add 50 L of Solution B and 30 L of Solution C to one 20 L aliquot of Solution A. 3. Ordering oligos in 96-well plate format facilitates the use of large-numbers of probes, particularly for keeping track of probes and two-dimensional screening (19). 4. A thermocycler can be used for denaturing and reannealing the overgo primers. 5. We have used labeling reactions that vary in size from a total reaction volume of 5–15 L. Scale all the reagents to the size of the reaction relative to the recipe given here except the Klenow, which in the case of the 5 L reaction volume remains 2 units (5 U/L Klenow can be used for the smaller reactions). Also note that the OLB and BSA are mixed on ice, and then the isotope and Klenow are added at room temperature with proper shielding. Once the isotope and Klenow are added to the labeling cocktail, the labeling reaction should be started immediately. 6. To hold the Nick columns, we use the square Nalgene Polypropylene Floating Microtube Rack with four legs and 16 slots for tubes. In order to collect the labeled probe, place the collection tubes in a second rack and stack the rack holding the Nick columns on top of the rack holding the collection tubes. 7. When labeling probes that will be pooled in the hybridization, a single Nick column can be used to simultaneously remove unincorporated nucleotides from multiple labeling reactions up to the maximum volume recommended by the manufacturer (100 L). To organize and track which probes are going into which Nick columns, label the Nick columns and collection tubes in a standard naming convention, such as Row1, and so on. 8. The efficiency of the labeling reaction can be measured by the percent incorporation of a control probe (the probe that detects anchor clones on the filters) and one individual probe or single pool of probes. To do this, once the labeling reaction is complete, dilute the anchor probe labeling reaction up to 50 L with TE, mix and add 1 L to a scintillation vial for counting. After collecting the labeled probes from the nick columns, put 1 L of the labeled anchor probe and test probe in separate scintillation vials for counting. Record the cpm with a scintillation counter and calculate the % incorporation. This value can vary greatly, but in our experience the raw counts of the anchor probe labeling reaction after it has been passed over the column with fresh isotope should be greater than 10,000 cpm. Counts below 1000 cpm are considered failed labeling reactions. 9. Hybridization of smaller filters (12.5 × 8 cm or 11 × 7 cm) can be done in 150 mm bottles. If less than four small filters are hybridized in a small bottle, the following modifications to the library hybridization protocol can be used. Equilibrate filters in Church buffer and prehybridize at 58°C in 5 mL of Church buffer for >20 min in the hybridization oven. For washing the small filters, use the same wash time, temperature, and solutions as in the library hybridization, but instead removing the filters from the bottle for the wash steps, simply perform the washing steps in the bottle using the hybridization oven. Specifically, after pouring off the hybridization buffer, add 20 mL of wash solution to the bottles and rotate in the hybridization oven.

Comparative Physical Mapping with Universal Probes

129

10. A separate stock solution of Church buffer for filter equilibration can reused several times. 11. The orientation of the bottle in the hybridization oven (the bottle cap facing the left or the right) should be adjusted such that the rotation in the hybridization oven ‘unrolls’ the filters so that they make even and fixed contact with the interior of the bottle. To test if the orientation of the bottle is correct, simply monitor the filters to see if they unroll so that they make even fixed contact with the interior of the bottle, or remain tightly wound and roll inside the bottles. If the filters remain wound, simply change the orientation of the bottle so the filters unroll. 12. The amount of probe added to each bottle can vary. For library screens, for each bottle we recommend using one-quarter of the labeled probe from a 10-L labeling reaction, and one-eighth of the labeled probe from a 15-L labeling reaction. For hybridizations to secondary filters, we recommend using one-eighth of the probe from a 10-L labeling reaction for each bottle. If possible, we also recommend using an overgo probe that recognizes the ‘anchor’ clones on the filters, as this will greatly aid in orienting the filters for accurately scoring the positive clones. 13. In cases where no shaking water bath is available, the following protocol can be used to wash large library filters. The key to this modification is to make sure that the wash solution stays as close to 58°C as possible during the 30-min washing steps. Before beginning the washes, for every two hybridization bottles, microwave 1 L of 1.5X SSC, 0.1% SDS for about 3 min (5 min for 2 L) to heat the wash solution just above 58°C. Place wash solution in a water bath at 62°C until ready to use. Pour off the hybridization buffer and add 150 mL 2X SSC, 0.1% SDS to each of the hybridization bottles. Tighten the caps and rotate in the hybridization oven at 58°C for 15 min. While washing the filters, place two Pyrex Portables hot packs in the microwave for 3 min and 30 s. Then place in the bottom of a Pyrex Portable. Remove the filters from the bottles and place in modified plastic container. (A modified plastic container can be made as follows. Using a Clear Magazine Snap Case (14 × 85/8 × 37/8 , Rubbermaid), cut the two hinges from the top of the box so it will detach completely. Cut a piece of packing Styrofoam to tightly fit into the lid. Place the Styrofoam in a large Ziploc bag and wedge into the lid of the box. Take a 15-mL conical tube and cut in half. Cut a hole in the lid of the box that goes through the Styrofoam and the Ziploc bag exactly the size of the conical tube (be sure it is not any larger so the lip on the tube will prevent the tube from going all the way through the hole). Next slide the half of the tube with the cap through the hole all the way up to the lid and seal around the tube with a glue gun.) Pour a little 2X SSC, 0.1% SDS over the filters. Flip the filters one by one so they face down neatly stacked on one another. This will prevent the filters from sticking to one another and allow all filters to be washed properly. Place the lid on the box and place inside a Pyrex Portable on top of the hot packs for insulation. Remove the cap on the conical tube and place a funnel into the tube. Pour the prewarmed 1.5X SSC, 0.1% SDS into the funnel as quickly as possible to prevent the wash from cooling down. Quickly remove the funnel, replace the cap, and zip up the container. Place on platform shaker for 30 min. As soon as you have started the 2nd wash,

130

14.

15.

16.

17.

Thomas

microwave the 1X SSC, 0.1% SDS wash solution (as before) and place in water bath at 62°C. Pour off first wash solution, flip the filters and add the 1X SSC, 0.1% SDS wash solution (as before) to the container, and seal in Pyrex Portable. Place on platform shaker for 30 min. Proceed with step 19 under Subheading 3.2. The filters should be washed with enough rotation such that the washing solution is moving back and forth across the filters but at a slow enough rate that the filters do not get stuck to either end of the container. For example, with a Lab-line rotator platform, a speed of ~38 rpm is appropriate. It is standard convention for library filters or custom secondary filters to be labeled with a serial number or other unique identifiers. Typically this label is on the colony/ DNA side and is required to properly orient the filters. For example, filters from BACPAC Resources have labels in the top left, right, and center. One large 22 × 22 cm filter or up to six small filters (11 × 7 cm) can be exposed with a 24 × 30 cm screen and cassette. Place the filters in a standard orientation relative to the cassette and phosphor screen and tape them in place. Whatman paper at the bottom of the cassettes will help absorb any excess moisture. For library filters, overnight exposures are standard with film, and typically 4–24 h with phosphor screens. Secondary filters can usually be exposed after 8 h on film, or 4 h with phosphor screens. Because the optimal exposure time might vary, we recommend developing a single film after a set exposure time and then evaluating whether or not the other filters should be developed at the same time or left for a longer exposure. If the library filters include a control anchor clone, even light exposures can be accurately oriented to the filter grid for scoring positive clones. However, in the absence of an anchor clone, darker exposures are required to accurately align and then score positive clones to the filter grid. We also recommend calculating the number of expected positive clones based on library depth and the size of targeted interval from which the probes were designed and comparing that value to the number of positive clones observed. For example, for a 10X library with an average insert size of 150 kb, the number of positive clones expected per megabase interval is (10)(1/.150) = 67. In cases where many more (threefold) positive clones are observed than expected, for economic reasons it may be advisable to selectively score a subset of positive clones with the strongest hybridization signal. If the number of positive clones observed is greater than threefold the expected number, this is most likely the result of one or more of the probes identifying clones from multiple regions of the genome, i.e., repetitive probe(s). If necessary, additional hybridizations with subsets of probes on individual filters can used to screen for and then eliminate the repetitive probe(s) from future hybridizations.

Acknowledgments The author would like to acknowledge Wendy Kellner and Bob Sullivan for their contributions to this chapter. This work was supported by the NIH (U01MH068185).

Comparative Physical Mapping with Universal Probes

131

References 1. Waterston, R. H., Lindblad-Toh, K., Birney, E., et al. (2002) Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562. 2. Boffelli, D., McAuliffe, J., Ovcharenko, D., et al. (2003) Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science 299, 1391–1394. 3. Thomas, J. W., Touchman, J. W., Blakesley, R. W., et al. (2003) Comparative analyses of multi-species sequences from targeted genomic regions. Nature 424, 788–793. 4. Misceo, D., Cardone, M. F., Carbone, L., et al. (2005) Evolutionary history of chromosome 20. Mol. Biol. Evol. 22, 360–366. 5. Ross, S. R., Schofield, J. J., Farr, C. J., and Bucan, M. (2002) Mouse transferrin receptor 1 is the cell entry receptor for mouse mammary tumor virus. Proc. Natl Acad. Sci. USA 99, 12,386–12,390. 6. Angata, T., Margulies, E. H., Green, E. D., and Varki, A. (2004) Large-scale sequencing of the CD33-related Siglec gene cluster in five mammalian species reveals rapid evolution by multiple mechanisms. Proc. Natl Acad. Sci. USA 101, 13,251–13,256. 7. Thomas, J. W., Prasad, A. B., Summers, T. J., et al. (2002) Parallel construction of orthologous sequence-ready clone contig maps in multiple species. Genome Res. 12, 1277–1285. 8. Kellner, W. A., Sullivan, R. T., Carlson, B. H., and Thomas, J. W. (2005) Uprobe: A genome-wide universal probe resource for comparative physical mapping in vertebrates. Genome Res. 15, 166–173. 9. Vollrath, D. (1999) DNA markers for physical mapping. In: Genome Analysis: A Laboratory Manual, Volume 4: Mapping Genomes (Birren, B., Green, E. D., Hieter, P., et al. eds), pp. 187–215, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY. 10. Green, E. D. and Green, P. (1991) Sequence-tagged site (STS) content mapping of human chromosomes: theoretical considerations and early experiences. PCR Methods Appl. 1, 77–90. 11. Soderlund, C. and Dunham, I. (1995) SAM: a system for iteratively building marker maps. Comput. Appl. Biosci. 11, 645–655. 12. Marra, M. A., Kucaba, T. A., Dietrich, N. L., et al. (1997) High throughput fingerprint analysis of large-insert clones. Genome Res. 7, 1072–1084. 13. Schein, J., Kucaba, T., Sekhon, M., Smailus, D., Waterston, R., and Marra, M. (2004) High-throughput BAC fingerprinting. In: Bacterial Artificial Chromosomes, Volume 1: Library Construction, Physical Mapping, and Sequencing (Zhao, S. and Stodolsky, M., eds), pp. 143–156, Humana, Totowa, NJ. 14. Altschul, S. F., Madden, T. L., Schaffer, A. A., et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402. 15. Schwartz, S., Kent, W. J., Smit, A., et al. (2003) Human-mouse alignments with BLASTZ. Genome Res. 13, 103–107.

132

Thomas

16. Blanchette, M., Kent, W. J., Riemer, C., et al. (2004) Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 14, 708–715. 17. Dunham, I., Dewar, K., Kim, U.-J., and Ross, M. (1998) Bacterial cloning systems. In: Genome Analysis: A Laboratory Manual, Volume 3: Bacterial Cloning Systems (Birren, B., Green, E. D., Klapholz, S., et al. eds), pp. 1–86, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY. 18. Osoegawa, K. and de Jong, P. (2004) BAC library construction. In: Bacterial Artificial Chromosomes, Volume 1: Library Construction, Physical Mapping, and Sequencing (Zhao, S. and Stodolsky, M., eds), pp. 1–46, Humana, Totowa, NJ. 19. Jamison, D. C., Thomas, J. W., and Green, E. D. (2000) ComboScreen facilitates the multiplex hybridization-based screening of high-density clone arrays. Bioinformatics 16, 678–684.

9 Phylogenomic Resources at the UCSC Genome Browser Kate Rosenbloom, James Taylor, Stephen Schaeffer, Jim Kent, David Haussler, and Webb Miller

Summary The UC Santa Cruz Genome Browser provides a number of resources that can be used for phylogenomic studies, including (1) whole-genome sequence data from a number of vertebrate species, (2) pairwise alignments of the human genome sequence to a number of other vertebrate genome, (3) a simultaneous alignment of 17 vertebrate genomes (most of them incompletely sequenced) that covers all of the human sequence, (4) several independent sets of multiple alignments covering 1% of the human genome (ENCODE regions), (5) extensive sequence annotation for interpreting those sequences and alignments, and (6) sequence, alignments, and annotations from certain other species, including an alignment of nine insect genomes. We illustrate the use of these resources in the context of assigning rare genomic changes to the branch of the phylogenetic tree where they appear to have occurred, or of looking for evidence supporting a particular possible tree topology. Sample source code for performing such studies is available. Key Words: Evolutionary event; phylogenetic tree; interspersed repeat; chromosomal break.

1. Introduction Rare genomic changes (RGCs), such as retroposon integrations, indels (insertions/deletions) in protein-coding regions, and inversions, provide useful alternatives to the use of nucleotide and amino-acid substitutions for determining the evolutionary relationships among living organisms (1). If one knows where in the genome to look, the UCSC Genome Browser (2,3) and the associated database (4) make it easy to see evidence that an RGC occurred at a particular point in evolutionary history. From: Methods in Molecular Biology: Phylogenomics Edited by: W. J. Murphy © Humana Press Inc., Totowa, NJ

133

134

Rosenbloom et al.

Fig. 1. A proposed evolutionary tree for several mammals, as discussed in this manuscript. Only the tree topology is sketched; branch lengths have no meaning.

Fig. 2. View in the UCSC Human Genome Browser of alignments of opossum, mouse and dog to a region of human chromosome 13, showing a reversal that appears to have occurred along branch A in the tree of Fig. 1.

For instance, it is widely held (5) that the human lineage diverged from dog (and other so-called laurasiatheres) before it diverged from mouse. Thus, we can expect to observe evidence for genomic changes that occurred on the branch labeled A in Fig. 1. Figure 2 shows the black-and-white version of a view in the Browser of a 4500-bp interval of human chromosome 13 (positions relative to the May 2004 human assembly). Dog shows a local inversion relative to human

Phylogenomic Resources at the UCSC Genome Browser

135

and mouse, and the outgroup opossum indicates that dog is in the ancestral orientation. The most parsimonious explanation is that a single inversion event occurred on branch A. On the other hand, if one is not told where in the genome to look, a search for RGCs that occurred along a hypothesized short or ancient branch cannot be performed efficiently by eye; too many evolutionary events are recorded to make it feasible to start looking through them one at a time. For a systematic attack on the problem, it is often necessary to download large amounts of data and process them with custom-built software. Fortunately, a graduate student or talented undergraduate with a strong programming background can frequently write such programs, provided they get over a few initial hurdles. This chapter contains much of the necessary information. A wealth of genomic data is available from the UCSC Browser, as well as alternative data providers, such as Ensembl (6), NCBI (7), and Galaxy (8). Sequence data, alignments, and annotations can be downloaded from the Table Browser (9) or the UCSC “downloads” page, then searched in arbitrary ways. The aim of this chapter is to illustrate how this can be done through three examples. The general problem treated by the examples is assigning RGCs to branches of the phylogenetic tree. Such an assignment is a natural part of an attempt to reconstruct evolutionary histories, as has been proposed for eutherian mammals (10). The first example focuses on a search for informative retroposon integrations in the 1% of the human genome encompassed by the 44 regions chosen for extensive study by the ENCODE project (11). For the second example, we look across the entire human genome for indels in protein-coding regions that provide evidence concerning the earliest divergences among eutherian mammals, e.g., perhaps occurring in the human lineage after divergence from elephants (afrotheres) but before divergence from armadillo (xenarthrans) (branch B in Fig. 1). In the third example, we look for particular kinds of chromosomal breaks in an alignment of nine insect genomes. To help the reader get started writing such programs, we make the source code used in those examples freely available at http://bio.cse.psu.edu/miller_lab/ under the title “Phylogenomic Tools.” Although many biologists may hesitate to venture into such a project, we encourage them to consider hiring someone with solid programming skills to perform whole-genome searches, helped by the clues we provide here. 2. Methods 2.1. Interspersed Repeats in ENCODE Regions As illustrated by Chapters 14 and 15 in this volume, interspersed repeat elements can sometimes be used to answer phylogenomic questions. In particular, insertion events that occurred very early in the mammalian radiation can provide evidence

136

Rosenbloom et al.

Fig. 3. Browser view of an interspersed repeat element in ENCODE region ENm001 that appears to have been inserted along branch A in Fig. 1.

to support certain hypotheses about phylogenetic relationships, such as that the human lineage diverged from dogs and cows before it diverged from mice (12), and that horses are more closely related to dogs than to cows (13). A repetitive element that may have inserted on the branch labeled “A” in Fig. 1 is shown in Fig. 3. Two important characteristics are that (i) a large fraction of the element aligns with mouse (for this to be possible, the element cannot be completely masked e.g., by replacing each of its nucleotides by “N”), and (ii) none of it aligns with dog. Those properties also hold of segments that were deleted in the dog lineage after it diverged from human and mouse, but here we have additional information. Namely, we recognize the region as an insertion element; for this to be a deletion in dog, the deleted interval would need to correspond to within a few nucleotides of the inserted element. (This can be seen by inspecting the alignment in detail, noting that the positions that align just to either side of the repetitive element are only a few nucleotides apart in dog.) Moreover, the element belongs to the family MLT1A0, which is known to have been transpositionally active early in the mammalian radiation. (For a list of such families, see Table 6 of the mouse genome paper in ref. 14.) Finally, the percent identity with the consensus sequence for the class, 20.8%, is consistent with what one expects for an insertion that occurred early just before human–mouse divergence.

Phylogenomic Resources at the UCSC Genome Browser

137

For further evidence, one can hope to identify the flanking direct repeats at the borders of the human element, and find a single copy of that sequence at the appropriate place in dog. We wrote some simple computer programs to help search through large genomic regions for retroposons that are likely candidates for having been inserted at a particular branch of the phylogenetic tree. The program is essentially the same as what we had used earlier to find examples (12,13). These programs are used on “soft-masked” sequence, i.e., where repetitive DNA is given with lower-case letters (“a”, “c”, “g”, and “t”), and the “.out” files produced by RepeatMasker (a table of information about the identified repeat elements). Both kinds of data can be downloaded from UCSC. The sequences are aligned with the blastz program (15), which can be freely obtained from http://www.bx.psu.edu/ miller_lab/ Some small programs in the “Phylogenomic Tools” package (from the same website as blastz) read the RepeatMasker “.out” files and the blastz alignments, then report repeat elements with the desired property. When we downloaded sequences and “.out” files for the ENCODE region ENm001, the tools identified the repeat element shown in Fig. 3. The Phylogenomic Tools package contains detailed instructions for how to proceed. We ran our programs on all ENCODE regions, looking for repeats that inserted on branch A. We ignored Alu repeats (because they inserted after human–mouse divergence), as well as MIR and L2 repeats (because they are thought to have died out before the eutherian radiation). Four MLTA0 repeats and two L1s were identified. As a check for specificity (i.e., lack of false positives), we reversed the roles of mouse and dog; no repeats were predicted to have inserted after the human lineage diverged from mouse but before it diverged from dog.

2.2. Coding Indels Indels (insertions/deletions) in protein-coding regions are another class of RGC that has been utilized for phylogenetic analysis (16), including deletions that apparently happened on branch A of Fig. 1 (17). We have written several programs to automate most of the job of searching UCSC-generated alignments for informative coding indels and tested them in a search for events on branch B of Fig. 1. The first program reads a file of gene locations downloaded from UCSC and produces a simplified list of coding exons (including the removal of redundant copies in the case of splice isoforms). The second program reads the list of exons and three pairwise alignments of human, to, say, armadillo, elephant, and opossum, and looks for human exons that align without a gap to armadillo, but align to elephant and opossum with gaps (lengths divisible by 3) in precisely the same places. See Fig. 4 for one of the exons that it found.

138

Fig. 4. Browser view of an alternatively spliced exon of the RTN3 gene, showing two codons in the human, mouse, dog, and armadillo genes that are missing in elephant and opossum. This is a candidate for an insertion on branch B of Fig. 1.

Phylogenomic Resources at the UCSC Genome Browser

139

The most parsimonious explanation is that two codons were inserted in a single event on branch B.

2.3. Chromosomal Breaks in Drosophila One of the earliest uses of genetic data to reconstruct phylogenies was by Dobzhansky and Sturtevant when they examined polymorphisms in the polytene chromosomes isolated from Drosophila salivary gland cells (18). The polytene chromosomes have reproducible banding and puffing patterns that reflect the order of genes on chromosomes. Dobzhansky and Sturtevant showed that Drosophila pseudoobscura had a wealth of gene order differences in natural populations and inferred that the paracentric inversion mutations could relate the different chromosomes to each other in an unrooted network (19). Remarkably, all of the intermediate chromosomes in the network of gene arrangements have been collected from nature over the years with one exception, the Hypothetical chromosome (20). It is now well documented that chromosomal inversions have been important in the evolution of Drosophila because many of the species are polymorphic for gene order based on polytene chromosomal data (21). The banding and puffing patterns of polytene chromosomes of different Drosophila species are quite different and prevent direct comparison of gene order. With the advent of complete genomic sequencing, we can now infer ancestral rearrangements between species by comparing inferred gene orders. Comparison of two species of Drosophila has shown that at least 10 inversions occur per million years (22) and that in some cases repetitive elements may be responsible for the process of rearrangement. These analyses also show that some regions tend to resist rearrangement because the size of conserved linkage blocks is too large given the number of inversions that have occurred over esvolutionary time, suggesting a functional conservation of gene order to maintain coordinate regulation (23). Our software toolkit contains programs to read multiway alignments downloaded from UC Santa Cruz and to look for places where one group of species shows conservation of order and orientation of genomic features, but other species have a chromosomal break. See Fig. 5 for an example.

2.4. Obtaining Sequences, Alignments, and Annotation from UCSC The UCSC Genome Browser site provides two main facilities for obtaining genomic sequence, alignments, and annotations: • a download server that supports bulk retrieval via FTP or HTTP access • the Table Browser data retrieval tool (8), useful for selective querying by region

Because this chapter is oriented toward high-throughput, whole-genome analysis, this section focuses on the download server, although we will briefly describe how to use the Table Browser to obtain alignments in selected regions.

140

Rosenbloom et al.

Fig. 5. Browser view of a region on Drosophila melanogaster chromosome 2R where D. melanogaster, D. simulans, D. yakuba, and D. ananassae show colinear alignment, but where D. pseudoobscura, D. mojanavensis, D. virilis, and Anopheles gambiae have a chromosomal break.

2.4.1. UCSC Download Server The download server can be accessed from the “Downloads” menu link (http:// hgdownload.cse.ucsc.edu) on the UCSC Browser main page or by anonymous FTP (http://hgdownload.cse.ucsc.edu/goldenPath). The Downloads web page provides a directory of links to the data files available, grouped by species and assembly. On the download server, each genome assembly is represented by a directory labeled with the UCSC assembly release name. For example, files related to the May 2004 human genome assembly are stored in the “hg17” directory. Assembly

Phylogenomic Resources at the UCSC Genome Browser

141

releases and versions are described on the FAQ page (http://genome.ucsc.edu/ FAQ/FAQreleases). The exact items vary by assembly, but the following data are typically available: Link Directory Description Full data set bigZips Misc assembly and annotation Data set by chromosome chromosomes Repeatmasked sequence Annotation database database Tab-delimited nightly database dump Multiple alignment (eight multiz8way Multiz alignments species) (MAF format) Conservation scores phastCons Per-base scores for multiple alignment Mouse pairwise vsMm7 Blastz/chain/net alignments alignments (AXT format) Dog pairwise alignments vsCanFam2 Blastz/chain/net alignments (AXT format) LiftOver files liftOver Conversion chains to other assemblies The section below details where to download the comparative genomics resources used in this chapter. Note: all referenced human genome data are from the May 2004 (UCSC release hg17) human genome assembly. 1. Human genome multiple alignments Link: Multiple alignments of 16 vertebrate genomes to human Directory: hg17/multiz17way 2. Human alignments to opossum, armadillo, and elephant Opossum Genome: Oct. 2004 assembly (UCSC release monDom1) Link: Opossum/Human (hg17) alignments Directory: monDom1/vsHg17 Armadillo Genome: May 2005 assembly (not released in UCSC browser) Directory: dasNov1/vsHg17 Elephant Genome: May 2005 assembly (not released in UCSC browser) Directory: loxAfr1/vsHg17 3. Human gene locations Link: Annotation database Directory: database Tables: refFlat (RefSeq genes) knownGene (UCSC Known genes) encodeGencodeGeneKnown (ENCODE Gencode genes) Files: *.txt.gz (data), *.sql (table schema) 4. Insect genome multiple alignment

142

Rosenbloom et al. Assembly: Drosophila melanogaster Genome, April 2004 assembly (UCSC release dm2) Link: Multiple alignment of eight insects with D. melanogaster Directory: dm2/multiz9way

The ENCODE project has a project-specific downloads page, accessed from the “Downloads” menu link on the UCSC ENCODE portal page (http://genome.ucsc.edu/ENCODE). The following comparative genomics resources in the ENCODE regions can be found on the May 2004 human genome assembly (currently the ENCODE Reference Assembly): 1. ENCODE region sequence and RepeatMasker output for human genome Link: Nucleotide sequences Directory: hg17/encode/regions 2. ENCODE orthologous region sequence and RepeatMasker output for other species, used to produce multiple alignments Link: Multiple sequence alignments Directory: hg17/encode/alignments//sequences 3. ENCODE region multiple alignments Link: Multiple sequence alignments Directory: hg17/encode/alignments//alignments

The alignment file formats are described in the following help files: MAF: http://genome.ucsc.edu/goldenPath/help/maf.html AXT: http://genome.ucsc.edu/goldenPath/help/axt.html Chain: http://genome.ucsc.edu/goldenPath/help/chain.html

2.4.2. UCSC Table Browser The Table Browser is accessed via a menu link on the UCSC Genome Browser home page. Pull-down menus on the Table Browser allow selection of an organism, assembly, genomic region, data type, and output format; options are provided for filtering, intersecting, and correlating tables. To obtain multiple alignments in a genomic region via the Table Browser, select the following menu options: Group: Comparative Genomics Track: Conservation Table: multiz* Output format: MAF

For ENCODE alignments, use: Group: ENCODE Comparative Genomics Tracks: TBA, MAVID, or MLAGAN Alignment Tables: encodeTbaAlign, encodeMavidAlign, encodeMlaganAlign Output format: MAF

Phylogenomic Resources at the UCSC Genome Browser

143

For compactness, select the gzip compression output option.

2.4.3. Access Guidelines Data in the UCSC Genome database are generally freely available for public use; any limitations are described in the Conditions of Use section of the UCSC Browser home page. When downloading multiple files, we recommend using the FTP site rather than the web access. Postscript Since this chapter was first written, several groups have employed RGCs to predict early mammalian divergences. Kriegs et al. (24) and Nishihara et al. (25) used interspersed repeats, whereas Murphy et al. (26) employed insertions/ deletions in protein-coding regions as well as interspersed repeats. The computer programs used by Murphy et al. to find informative coding indels will be made available along with the code described in this chapter. Acknowledgments Daryl Thomas and Elliott Margulies manage the ENCODE alignments and other resources, Brian Raney produced alignments to the low-redundancy genomes (including elephant and armadillo), and Angie Hinrichs generated the insect multiple alignments. References 1. Rokas, A. and Holland, P. W. H. (2000) Rare genomic changes as a tool for phylogenetics. Trends Ecol. Evol. 15, 454–459. 2. Kent, W. J., Sugnet, C. W., Furey, T. S., et al. (2002) The human genome browser at UCSC. Genome Res. 12, 996–1006. 3. Hinrichs, A. S., Karolchik, D., Baertsch, R., et al. (2006) The UCSC genome browser database: update 2006. Nucleic Acids Res. 34 (Database issue), D590–D598. 4. Karolchik, D., Baertsch, R., Diekhans, M., et al. (2003) The UCSC genome browser database. Nucleic Acids Res. 31, 51–54. 5. Murphy, W. J., Eizirik, E., Johnson, W. E., Zhang, Y. P., Ryder, O. A., and O’Brien, S. J. (2001) Molecular phylogenetics and the origins of placental mammals. Nature 409, 614–618. 6. Birney, E., Andrews, D., Caccamo, M., et al. (2006) Ensembl 2006. Nucleic Acids Res. 34 (Database issue), D556–D561. 7. Wheeler, D. L., Church, D. M., Edgar, R., et al. (2004) Database resources of the National Center for Biotechnology Information: update. Nucleic Acids Res. 32, D35–D40. 8. Giardine, B., Riemer, C., Hardison, R. C., et al. (2005) Galaxy: a platform for interactive large-scale genome analysis. Genome Res. 15, 1451–1455. 9. Karolchik, D., Hinrichs, A. S., Furey, T. S., et al. (2004) The UCSC Table Browser data retrieval tool. Nucleic Acids Res. 32(Suppl 1), D493–D496.

144

Rosenbloom et al.

10. Blanchette, M., Green, E., Miller, W., and Haussler, D. (2004) Reconstructing large regions of an ancestral mammalian genome in silico. Genome Res. 14, 2412–2423. 11. The ENCODE Project Consortium (2004) The ENCODE (ENCyclopedia of DNA Elements) project. Science 306, 636–640. 12. Thomas, J. W., Touchman, J. W., Blakesley, R. W., et al. (2003) Comparative analyses of multi-species sequences from targeted genomic regions. Nature 424, 788–793. 13. Schwartz, S., Elnitski, E., Li, M., et al. (2003) MultiPipMaker and supporting tools: alignments and analysis of multiple genomic DNA sequences. Nucleic Acids Res. 31, 3518–3524. 14. Waterston, R. H., et al. (2002) Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562. 15. Schwartz, S., Kent, W. J., Smit, A., et al. (2003) Human–mouse alignments with blastz. Genome Res. 13, 103–107. 16. De Jong, W. W., van Dijk, M. A. M., Poux, C., Kappe, G., van Rheede, T., and Madsen, O. (2003) Indels in protein-coding sequences of Euarchontoglires constrain the rooting of the eutherian tree. Mol. Phylogenet. Evol. 28, 328–340. 17. Poux, C., van Rheede, T., Madsen, O., and de Jong, W. W. (2002) Sequence gaps join mice and men: phylogenetic evidence from deletions in two proteins. Mol. Biol. Evol. 19, 2035–2037. 18. Dobzhansky, T. and Sturtevant, A. H. (1938) Inversions in the chromosomes of Drosophila pseudoobscura. Genetics 23, 28–64. 19. Dobzhansky, T. (1944) Chromosomal races in Drosophila pseudoobscura and Drosophila persimilis. Carnegie Inst. Washington Publ. 554, 47–144. 20. Anderson, W. W., Arnold, J., Baldwin, D. G., et al. (1991) Four decades of inversion polymorphism in Drosophila pseudoobscura. Proc. Natl Acad. Sci. USA 88, 10,367–10,371. 21. Sperlich, D. and Pfriem, P. (1986) Chromosomal polymorphism in natural and experimental populations. In: The Genetics and Biology of Drosophila, 3rd edition (Ashburner, M., Carson, H. L., and Thomson, J. N., eds), pp. 257–309, Academic, New York. 22. Richards, S., Liu, Y., Bettencourt, B. R., et al. (2005) Comparative genome sequencing of Drosophila pseudoobscura: chromosomal, gene and cis-element evolution. Genome Res. 15, 1–18. 23. Stolc, V., Gauhar, Z., Mason, C., et al. (2004) A gene expression map for the euchromatic genome of Drosophila melanogaster. Science 306, 655–660. 24. Kriegs, J. O., Churakow, G., Kiefmann, M., Jordan, U., Brosius, J., and Schmitz, J. (2006) Retroposed elements as archives for the evolutionary history of placental mammals. PLoS Biol. 4, e91. 25. Nishihara, H., Hasegawa, M., and Okada, N. (2006) Pegasoferae, an unexpected mammalian clade revealed by tracking ancient retroposon insertions. Proc. Natl Acad. Sci. U S A 103, 9929–9934. 26. Murphy, W. J., Pringle, T. H., Crider, T., Springer, M. S., and Miller, W. (2006) Using genomic data to unravel the root of the placental mammal phylogeny. Submitted.

10 Computational Tools for the Analysis of Rearrangements in Mammalian Genomes Glenn Tesler and Guillaume Bourque Summary The chromosomes of mammalian genomes exhibit reasonably high levels of similarity that can be used to study small-scale sequence variations. A different approach is to study the evolutionary history of rearrangements in entire genomes based on the analysis of gene or segment orders. We describe three computational tools (GRIMM-Synteny, GRIMM, and MGR) that can be used separately or in succession to contrast different organisms at the genome-level to exploit large-scale rearrangements as a phylogenetic character. Key Words: Rearrangements; algorithms; homologous regions; phylogenetic tree; computational tool.

1. Introduction The recent progress in whole-genome sequencing provides an unprecedented level of detailed sequence data for comparative study of genome organizations beyond the level of individual genes. We will describe three programs that can be used in such studies: • GRIMM-Synteny: Identifies homologous synteny blocks across multiple genomes. • GRIMM: Identifies rearrangements between two genomes. • MGR: Reconstructs rearrangement scenarios between multiple genomes.

Genome rearrangement studies can be dissected into two steps: (1) identify corresponding orthologous regions in different genomes, and (2) analyze the possible rearrangement scenarios that can explain the different genomic organizations. The orthologous regions are typically numbered 1, 2, … n, and each genome is represented as a signed permutation of these numbers, in which the signs indicate the relative orientation of the orthologous regions. From: Methods in Molecular Biology: Phylogenomics Edited by: W. J. Murphy © Humana Press Inc., Totowa, NJ

145

146

Tesler and Bourque

Fig. 1. Rearrangements in signed permutations showing impact of (a) reversals, (b) fusions and fissions, and (c) translocations.

In multichromosomal genomes, by convention, a ‘$’ delimiter is inserted in the permutation to demarcate the different chromosomes. The types of rearrangements that are considered by GRIMM and MGR using this signed permutation notation are illustrated in Fig. 1. In unichromosomal genomes, the most common rearrangements are reversals (also called inversions), shown in Fig. 1a; in signed permutation notation, a contiguous segment of numbers are put into reverse order and negated. In multichromosomal genomes, the most common rearrangements are reversals, translocations, fissions, and fusions. A fusion event concatenates two chromosomes into one, and a fission breaks one chromosome into two (Fig. 1b). A translocation event transforms two chromosomes A B and C D into A D and C B, where each letter stands for a sequence of signed genes (Fig. 1c). There are other modes of evolution, such as large-scale insertions, deletions, and duplications, which will not be addressed in the present review. There are two main categories of input data that can be used to analyze genome rearrangements: (1) sequence-based data, relying on nucleotide alignments, and (2) gene-based data, relying on homologous genes or markers, typically determined by protein alignments or radiation-hybrid maps. The appropriate acquisition of such datasets will be discussed under Subheading 2.3. Processing of this raw data to determine large-scale syntenic blocks is done using GRIMMSynteny; this will be discussed under Subheading 3.1. After constructing the syntenic blocks, GRIMM can be used to study the rearrangements between pairs of species (1). GRIMM implements the Hannenhalli–Pevzner methodology

Computational Tools for Analysis of Rearrangements

147

(2–7) and can efficiently compute the distance between two genomes and return an optimal path of rearrangements; it will be described under Subheading 3.2. Finally, MGR is a tool to study these types of rearrangements in several genomes, resulting in a phylogenetic tree and a rearrangement scenario (8); it will be described under Subheading 3.3. Bourque and colleagues presented a detailed application of these tools (9). In that study, the global genomic architecture of four species (human, mouse, rat, and chicken) was contrasted using the two types of evidence: sequence-based data and gene-based data. That study will be used as reference point for many of the input and output files provided in this review. Typically, the running time for GRIMM-Synteny is seconds to minutes. GRIMM will take a fraction of a second for most uses, and MGR will take minutes to days. However, it may take considerable time to prepare the inputs and analyze the outputs of each program. 2. Materials 2.1. Computer Requirements At the time this is written, the software requires a UNIX system with shell access and a C compiler. Perl is also recommended for writing custom-scripts to convert the output from other software to a format suitable for input to the software.

2.2. Obtaining the Software GRIMM-Synteny, GRIMM, and MGR software, and demonstration data used in this paper, can be downloaded at http://www.cse.ucsd.edu/groups/ bioinformatics/GRIMM/. The website also has a Web-based version of GRIMM and MGR (see ref. 1). The Web-based version is for small datasets only and cannot handle the datasets described in this paper. After downloading the files, run the following commands. Note that % is the Unix prompt, that the commands may differ on your computer, and that the version numbers might have changed since publication. For GRIMM-Synteny: % gzip –d GRIMM_SYNTENY-2.01.tar.gz % tar xvf GRIMM_SYNTENY-2.01.tar % cd GRIMM_SYNTENY-2.01 % make

For GRIMM-Synteny demonstration data hmrc_align_data: % gzip –d hmrc_align_data.tar.gz % tar xvf hmrc_align_data.tar

148

Tesler and Bourque

For GRIMM: % gzip –d GRIMM-2.01.tar.gz % tar xvf GRIMM-2.01.tar % cd GRIMM-2.01 % make

For MGR: % gzip –d MGR-2.01.tar.gz % tar xvf MGR-2.01.tar % cd MGR-2.01 % make

The executable for GRIMM-Synteny is grimm_synt; the executable for GRIMM is grimm; the executable for MGR is MGR. Either copy the executable files to your bin directory or add the directories to your PATH variable. For updated installation details, please check the website and read the README file in each download.

2.3. Your Data Various types of input datasets can be used to study large-scale rearrangements. Typically, every multigenome comparative analysis project will provide either sequence-based alignments or sets of homologous genes. The challenge is that the actual source and format will differ from one study to the next. For this reason, a simple file format for GRIMM-Synteny, and another for GRIMM and MGR, was created containing only the necessary information required. For each analysis project, it should be relatively straightforward to write custom conversion scripts (e.g., using Perl) to extract the information required from the original source dataset and output it in the standardized format required for the GRIMM/ MGR suite. A description of the type of information needed is given in the following subsection and a detailed description under Subheading 3. It is also possible to acquire datasets with homologous genes or aligned regions from public databases, such as Ensembl, the National Center for Biotechnology Information’s (NCBI) HomoloGene, and the University of California, Santa Cruz Genome Bioinformatics website (see the section on Website References). The website interfaces and database formats tend to change on a frequent basis. It may be necessary to combine multiple tables or do computations to get the required information.

2.3.1. Input Data to GRIMM-Synteny The inputs to GRIMM-Synteny describe the coordinates of multiway orthologous sequence-based alignments or multiway orthologous genes. Either one of these will be called orthologous markers. There are several output files; principally, it outputs large-scale syntenic blocks (similar to conserved segments,

Computational Tools for Analysis of Rearrangements

149

Fig. 2. (a) Sample lines from gene coordinate file hmrc_genes_coords.txt used for GRIMM-Synteny. The first field (ID number) is set to 0 since it is not useful in this example. After that, each species has four fields: chromosome, start, length, and sign. (b) Sample lines from alignment coordinate file hmrc_align_coords.txt. Notice the fourth alignment shown has human chromosome 2 aligned to mouse and rat X. The sixth “alignment” shown uses a fictitious chromosome “_” as a means to filter out a segmental duplication involving chicken chromosome Z.

but allowing for microrearrangements) comprised of many orthologous elements that are close together and consecutive or slightly shuffled in order. The specific details will be given under Subheading 3.1. The information that you must have for every orthologous element is its coordinates in every species. The coordinates include the chromosome, starting nucleotide, length in nucleotides, and strand (or relative orientation). Optionally, you may also assign an ID number to each orthologous element (Fig. 2). If you do not have the coordinates in nucleotides but do know the order and strand of the elements across all the genomes, you may specify fake coordinates that put them into the correct order, and tell GRIMM-Synteny to use the “permutation metric” that only considers their order. It is possible to deal with data in which the strand of the orthologous makers is unknown (e.g., if the source data comes from radiation-hybrid mapping), but it is beyond the scope of GRIMM-Synteny (see Note 1).

150

Tesler and Bourque

2.3.2. Input Data to GRIMM and MGR GRIMM can be used to compare the order of elements within orthologous regions in two genomes, or to compare the orders of syntenic blocks between two genomes on a whole genome scale. MGR can be used for these purposes with three or more genomes. GRIMM-Synteny produces a file called mgr_macro.txt suitable for input to GRIMM or MGR. If you are not using GRIMM-Synteny, you will have to number your orthologous regions or syntenic blocks and create a file in a certain format that specifies the orders and signs (orientations or strands) of these for each species. The format will be described under Subheading 3. (Fig. 3). The signs are very important and the quality of the results will be deteriorated if you do not know them. They should be available with alignments or genes obtained from current whole-genome assemblies. However, if your source data really does not have them (such as gene orders obtained in a radiation-hybrid map), GRIMM has procedures to guess relative signs (for two or more species). These will be described under Subheadings 3.2.3. and 3.2.4. 3. Methods 3.1. GRIMM-Synteny: Identifying Homologous Synteny Blocks Across Multiple Genomes There are two main uses of GRIMM-Synteny: (1) GRIMM-Anchors to filter out nonunique alignments (Subheading 3.1.2.) and (2) forming synteny blocks from anchors (Subheadings 3.1.3. and 3.1.4.). Both require the same input format, which we will cover first.

3.1.1. Preparing the Input for GRIMM-Synteny We will work with the human/mouse/rat/chicken orthologous alignments as a starting point (computed by Angie Hinrichs, see refs. 10,11). The discussion for genes would be similar. This example has k = 4 species. GRIMM-Synteny uses the coordinates of k-way alignments. The input file will consist of many lines with the following format (“k-way coordinate format”), where each line represents coordinates of a k-way alignment (but does not include the baseby-base details of the alignment). The same format is used in the output file blocks.txt that lists the coordinates of the synteny blocks. See Fig. 2 for excerpts from a 4-way coordinate file, and Fig. 4a for an illustration of the coordinate system (in two-way data). The k-way coordinate format is as follows: ID chr1 start1 length1 sign1… chrk startk lengthk signk. 1. ID is a number, which can be used to number the alignments. If you do not care to do this, set it to 0. GRIMM-Synteny does not use the value you put there on input.

Computational Tools for Analysis of Rearrangements

151

Fig. 3. (a) Sample input file for GRIMM or MGR. Part of MGR’s output: (b) Newick representation of reconstructed phylogeny and (c) ASCII graphical tree representation of the same phylogeny. (d) Part of GRIMM’s output: an optimal sequence of rearrangements from Genome1 to Genome4. (e) GRIMM’s 4 × 4 pairwise distance matrix on the input genomes. MGR also produces a 6 × 6 matrix for the input genomes plus ancestral genomes.

152

Tesler and Bourque

Fig. 4. Forming blocks from anchors in GRIMM-Synteny. (a) Anchor coordinates. Coordinates are given genome-by-genome, either as chromosome, start (minimum coordinate), length, sign (strand), or as chromosome window and the cartesian coordinates of the two ends. (b,c) The total distance between two anchors is the Manhattan distance between their closest terminals (shown as a thick solid line). Distances between other terminals (shown as dotted or dashed extensions) increase the distance. The per-genome distance components d1 and d2 are indicated, and add up to the total d = d1 + d2. A block consisting of these two anchors has per-species measurements of the span (total size including gaps) and support (total size not including gaps), as well as the total number of anchors (2). (d) Several blocks between human and mouse from ref. 12, with lines showing how the anchors are joined. On output, the same format is used for the file blocks.txt and the ID is used to number the blocks. 2. Species numbering: In the human/mouse/rat/chicken data, chr1, start1, length1, sign1 refer to coordinates in human. Species 2 is mouse. Species 3 is rat. Species 4 is chicken. For your own data, choose your own species numbers and use them consistently.

Computational Tools for Analysis of Rearrangements

153

3. The four coordinate fields per species are as follows: chrN: Chromosome name, e.g., ‘1’, ‘X’, ‘A1’. startN: Starting nucleotide on the positive strand. It does not matter if you used 0-based or 1-based coordinates, as long as you are consistent. lengthN: Length in nucleotides. Combined with startN, this gives a half-open interval [startN,startN+lengthN) on species N. If your source data has start and end coordinates of a closed interval [start,end] then the length is end - start + 1, while if it is a half-open interval [start,end) then the length is end - start. signN: Strand (+ or 1 for positive, - or -1 for negative). Negative means that the aligned nucleotides are the ones paired to the positive strand nucleotides on the interval just specified. Be careful to check that you use the same coordinate for both members of a base pair, since it is also common to have complementary coordinates on the two strands.

The input file may also include comment lines, which begin with ‘#’. Comments in a special format may be included to give the species names, as shown in Fig. 2. Your multiway alignment or multiway ortholog procedure may produce partial alignments involving fewer than all k species; you must discard those. Your procedure may produce multiple hits involving the same coordinates. GRIMM-Synteny has a procedure GRIMM-Anchors (9) to assist in filtering out alignments with conflicting coordinates. Since all alignments with conflicting coordinates will be discarded, we recommend that you first determine if your source data has information (such as scoring information) that you could use to choose a unique best hit and discard the others. We will describe this procedure next, followed by the main procedure (GRIMM-Synteny).

3.1.2. GRIMM-Anchors: Filtering Out Alignments with Nonunique Coordinates 1. Create a file (e.g., align_coords.txt) with the alignment coordinates in the k-way coordinate format described under Subheading 3.1.1. 2. Create a directory (e.g., anchors) in which to place the output files. The current directory will be used by default. 3. Run the GRIMM-Anchors algorithm to filter out repeats and keep only the anchors. The basic syntax is: % grimm_synt –A –f align_coords.txt –d anchors You should replace align_coords.txt by the name of your alignment coordinates file, and anchors by the name of your output directory. The switch –A says to run GRIMM-Anchors. 4. Three output files are created in the directory anchors: • report_ga.txt: This is a log file with information on the number of conflicting alignments, the number of repeat families detected and filtered out (by merging together the collections of conflicting alignments), and the number of anchors

154

Tesler and Bourque

(unique alignments) remaining. In some cases, such as directed tandem repeats, overlapping alignments are still uniquely ordered with respect to all other alignments, so they are merged into a larger “anchor” instead of being discarded. This is useful for the purpose of determining larger synteny blocks, even if it might not be crucial for most downstream analyses. • unique_coords.txt: This has the same format as described in Section 3.1.1., but all conflicting alignments have been removed. • repeat_coords.txt: This lists the conflicting alignments that were merged or discarded, organized into repeat families.

3.1.3. Forming Synteny Blocks from Sequence-Based Local Alignments A chromosome window is a specification of chromosomes (c1,…,ck) over the k species. Synteny blocks are formed by grouping together nearby anchors in each chromosome window, as shown in Fig. 4. This is controlled by two sets of parameters: parameters that control the maximum allowable gap between anchors, and parameters that control the minimum size of a block. Let x = (x1,…,xk) and y = (y1,…,yk) be two points in the same chromosome window, with coordinates expressed in nucleotides. The distance between them in species N is dN(x,y) = |xN – yN|, and the total distance between them is the Euclidean distance d(x,y) = |x1 – y1| + … + |xk – yk|. Each anchor A can be represented as a diagonal line segment in k dimensions between two points within a chromosome window: a=(a1,…,ak) and a = (a1,…,ak) (Fig. 4a). These are the two terminals of A. They are determined by the start coordinates, lengths, and orientations in the k-way coordinate format. If the orientation in species N is positive then aN = startN and aN = startN + lengthN - 1, whereas if the orientation in species N is negative then the definitions of aN and aN are reversed. Let A and B be two anchors in the same chromosome window. The total distance between A and B is the total distance between their closest terminals. Once the closest terminals have been determined, the distance between A and B in species N is the distance between those closest terminals in species N (Fig. 4b, c). The per-species distances are shown as d1 and d2, whereas the total distance is d1+d2. Had other combinations of anchor terminals been used (shown in Fig. 4b, c with dotted or dashed lines), the total distance would have been larger. Anchors A and B in the same chromosome window are connected together if their distance in each species is less than a per-species gap threshold specified for that species (see the –g option). Groups of connected anchors form potential synteny blocks. Alternately, anchors may be joined together if the sum of the distances across all species is less than a specified total gap threshold (see the –G option). This was the approach used by Pevzner and Tesler in their

Computational Tools for Analysis of Rearrangements

155

human–mouse comparison (12,13). The per-species threshold was added when additional species were considered, and seems to work better than the total gap threshold. To distinguish between noise vs real blocks, several measurements of the minimum size of a block are available: the span, the support, and the number of anchors (Fig. 4b, and see the –m, –M, and –n options in the next section). Any potential block that is smaller than these minimums will be discarded. Determining the relative orientations of blocks in each species can be subtle, if there are microrearrangements. GRIMM-Synteny uses a set of criteria that is geared towards signed permutations (13). If you only want to consider the order of the anchors (which is common in gene-order based studies), prepare your data using either real coordinates or fake coordinates that put it into the correct order, and then use the permutation metric by specifying the option -p. This treats all anchors as length 2 nucleotides in every species (so that the two orientations of the anchor are distinguishable in each species) and no gaps between consecutive anchors in each species. This is achieved by keeping their chromosomes, orders, and signs, but recomputing their starting coordinates and lengths. The process of joining anchors can result in blocks that are overlapping in some species or contained in another block in some species. The minimum size parameters will filter out small blocks that are contained in large ones, but do not prevent overlaps among large blocks. This is in part related to different rearrangements of the anchors in each species, and also to the use of parameters that are constant across each species instead of somehow adapted to each region within a species. GRIMM-Synteny has a phase to detect and repair block overlaps and containments. When these are detected, the blocks are recursively broken into several segments. Any segments smaller than the size minimums will be filtered out. This phase is run by default and can be prevented using the switch -O (letter “oh”). One reason to prevent it would be to study rearrangements in anchors at the ends of block, where two large blocks were brought together in some species, and then their ends were mixed through further rearrangements. Strips are sequences of one or more consecutive blocks (with no interruption from other blocks) in the exact same order with the same signs, or exact reversed order with inverted signs, across all species. In Fig. 3a, elements 4 and 5 form a strip of length 2 (it appears as 4 5 or -5 -4, depending on the species), and all other blocks form singleton strips (x or –x, depending on the species). A most parsimonious scenario can be found that does not split up the strip 4 5 (see ref. 6), so it would reduce the size of the permutations to recode it with blocks 4 and 5 combined into a single block. The option -c does this recoding by condensing each strip of blocks into a single block. (However, in applications

156

Tesler and Bourque

Fig. 5. Graphical visualization of the permutations associated with two modern genomes (human and mouse) and an ancestral permutation (mammalian ancestor) as recovered by MGR (see also ref. 15).

such as studying the sequences of breakpoint regions, the separate boundaries of blocks 4 and 5 could be of interest, so it would not be appropriate to use the -c option.) If GRIMM-Synteny had produced a strip such as 4 5, it would either be because the two blocks are farther apart than the gap threshold, or because the overlap/containment repair phase split-up a block and then its smaller pieces were deleted in such a way that what remained formed a strip of separate blocks.

3.1.4. GRIMM-Synteny: Usage and Options for Forming Synteny Blocks The basic syntax of GRIMM-Synteny when used for forming synteny blocks is: % grimm_synt –f anchor_file –d output_dir [other options]

Computational Tools for Analysis of Rearrangements

157

Input/output parameters: • -f Input_file_name: This is required and should contain the path to the file with the nonconflicting anchor coordinates. • -d Output_directory_name: This is required. It specifies a directory into which GRIMM-Synteny will write several output files.

Gap threshold: • -g N (or -g N1,N2,N3,…): These specify the per-species gap threshold in nucleotides, either the same value in all species or a comma-separated list of values for the different species. Anchors are joined if, in every species, their gap is below the per-species gap threshold for that species. • -G N: This specifies the total gap threshold. Anchors are joined if the total gap (sum of the per-species gaps) is below this threshold.

Minimum block size: • -m N (or -m N1,N2,N3,…): These specify the per-species minimum block spans, either the same value in all species or a comma-separated list of values for the different species. A block on chromosome c, with smallest coordinate x and largest coordinate y, has span y-x+1, so anchors and gaps both contribute to the span (Fig. 4b). Blocks are deleted if their span falls below this threshold in any species. • -M N (or -M N1,N2,N3,…): These specify the per-species minimum block supports. The support of a block is the sum of its anchor lengths in nucleotides, which ignores gaps (Fig. 4b). Blocks are deleted if their support falls below this threshold in any species. • -n N: The minimum number of anchors per block. Blocks with fewer anchors are deleted. Since we do not consider deletions, this parameter consists of just one number, not a different number for each species.

Other settings: • -c: If specified, strips of more than two blocks will be condensed into single blocks. • -O: If specified, the block overlap/containment repair phase will be skipped. • -p: If specified, use the “permutation metric.” The anchor order and signs in each species are retained, but the coordinates are changed so that each anchor has length 2 and there is no gap between consecutive anchors (within each species, the i-th anchor on a chromosome is regarded as having start coordinate 2i and length 2).

Output files: GRIMM-Synteny produces five files in the output directory specified by –d: • report.txt: This is a detailed log file describing the computations and phases that GRIMM-Synteny performed. It includes measurements for each block, such as number of anchors, per-species support and span of anchors, and microrearrangement distance matrix. It also includes information about macrorearrangements of all the blocks.

158

Tesler and Bourque

• blocks.txt: This has the coordinates of the blocks in the same format as described under Subheading 3.1.1. for anchor files. • mgr_macro.txt: This file gives signed block orders on each chromosome in each genome in the GRIMM/MGR input file format described under Subheading 3.2.1. (see Fig. 3a). Coordinates and lengths in nucleotides are not included; to determine these, look up the blocks by ID number in blocks.txt. • mgr_micro.txt: This file lists the anchors contained in each block. For each block, the k-way coordinates of all its anchors are given, and the GRIMM/MGR permutation of the anchors is given. Since blocks are in just one chromosome per species, and since blocks have a definite sign, these permutations should be regarded as directed linear chromosomes (-L option in GRIMM/MGR). Also, in the permutations, strips of anchors have been compressed. • mgr_micro_equiv.txt: This lists which blocks have the same compressed anchor permutations in mgr_micro.txt. In other words, it identifies blocks whose anchors underwent similar microrearrangements.

3.1.5. Sample Run 1: Human–Mouse–Rat–Chicken Gene-Based Dataset We discuss the data in the sample data directory hmrc_gene_data. Change to that directory. The 4-way homologous genes were computed by Evgeny Zdobnov and Peer Bork (see refs. 10,11) and identified by their Ensembl IDs. We combined their data with Ensembl coordinates of those genes for the specific Ensembl builds they used. The file hmrc_genes_ensembl.txt is for informational purposes only, and shows the Ensembl IDs combined with the gene coordinates. The file hmrc_genes_coords.txt is in the GRIMM-Synteny 4-way coordinate format. First, we filter out conflicting homologs: % mkdir anchors % grimm_synt –A –f hmrc_genes_coords.txt –d anchors

This produces a log file anchors/report_ga.txt and a coordinate file anchors/ unique_coords.txt. The log file indicates there were 8095 homologous gene quadruplets, but a number of them had conflicting coordinates, resulting in 6447 anchors. Details about the conflicts are given in a third output file, anchors/ repeat_coords.txt. In the sample data, we named the directory anchors_example instead of anchors so that you can run these examples without overwriting it. In the steps that follow, we also did this with directories gene7_example instead of gene7, and 300K_example instead of 300K. Next, we form synteny blocks. The main run analyzed in the paper, gene7, was produced as follows: % mkdir gene7 % grimm_synt –f anchors/unique_coords.txt –d gene7 –c –p –m 6 –g 7

Computational Tools for Analysis of Rearrangements

159

We used the permutation metric, so each gene is considered to have length 2 units. We required a minimum length of 6 units (i.e., size of three genes at length 2) in each species. Using –n 3 (minimum of three anchors, regardless of size) instead of –m 6 produced identical results, but at larger values, -n x and –m 2x would not be the same, since –m 2x would allow for gaps. Finally, -g 7 is the per-species gap threshold, which motivated naming this run gene7; we performed similar runs with thresholds from 1 through 20. We also varied other parameters. At smaller values of the gap-threshold, there is little tolerance for microrearrangements within blocks, so many small blocks are formed (but many of them are deleted for being below 6 units, from –m 6). Setting it too high would keep a high number of anchors but low number of blocks by merging too many blocks together. The selection -g 7 retained a relatively high number of anchors and high number of blocks. In addition to this, for each combination of parameter settings, we also examined plots of the microrearrangements in the blocks (similar to Fig. 4d), ran the blocks through MGR, and did other tests. Unfortunately, optimal parameter selection is still somewhat an art. (Tools to produce such plots are highly data-dependent and are not provided with the current release of GRIMM-Synteny. The plots are based on the anchor and block coordinates in mgr_micro.txt.)

3.1.6. Sample Run 2: Human–Mouse–Rat–Chicken Alignment-Based Dataset We discuss the data in the sample data directory hmrc_align_data. Change to that directory. In ref. 10, alignments between human and one to three of mouse, rat, and chicken, were computed by Angie Hinrichs and others at the UCSC Genome Bioinformatics group. The alignment files were several hundred megabytes per chromosome because they included the coordinates of the alignments as well as base-by-base annotations. We extracted the coordinates of 4-way alignments in this data. 1-, 2-, and 3-way alignments were discarded. This is further described in ref. 11. The file hmrc_align_coords.txt contains the 4-way coordinates of the alignments. The UCSC protocol included several ways of masking out repeats. However, there were still a number of repeats left in the data, particularly in chicken where repeat libraries were not so thoroughly developed at that time. Evan Eichler provided us with coordinates of segmental duplications in chicken (10). We used GRIMM-Anchors to filter out alignments conflicting with the duplications, by adding the coordinates of the duplications into hmrc_align_coords.txt as shown in Fig. 2. We made a new chromosome ‘_’ in human, mouse, and rat, and coded all the segmental duplications into 4-way alignments at coordinate 0 on ‘_’ in human, mouse, and rat, and their true coordinate in chicken. This way,

160

Tesler and Bourque

all the segmental duplications conflicted with each other in human, mouse, and rat (so that GRIMM-Anchors would filter them out) and conflicted with any real alignments at those coordinates in chicken (so that GRIMM-Anchors would filter those alignments out). We filter out conflicting homologs: % mkdir anchors % grimm_synt –A –f hmrc_align_coords.txt –d anchors

This produces a log file anchors/report_ga.txt and a coordinate file anchors/ unique_coords.txt. Next, the main alignment-based run considered in the paper was produced as follows: % mkdir 300K % grimm_synt –f anchors/unique_coords.txt –d 300K –c –m 300000 –g 300000

We used –c to condense strips of blocks into single blocks. We used –m 300000 to set a minimum span of 300000 nucleotides per species. We used –g 300000 to set a maximum gap size of 300000 per species. We also produced blocks with other combinations of parameters and considered similar factors as for the gene7 run in determining to focus on the “300K” blocks. In Fig. 2b, notice that one of the alignments involves human chromosome 2, and mouse and rat chromosome X. Another sanity check we did on the output for each choice of parameters was to see if any blocks were formed between the X chromosome on one mammal and a different chromosome on another mammal, since such large-scale blocks would violate Ohno’s law (14).

3.2. GRIMM: Identifying Rearrangements Between Two Genomes GRIMM implements several algorithms for studying rearrangements between two genomes in terms of signed permutations of the order of orthologous elements. Most of the literature refers to this as gene orders, although we also apply it to the order of syntenic blocks such as those produced by GRIMMSynteny. Hannenhalli and Pevzner showed how to compute the minimum number of reversals possible between two unichromosomal genomes in polynomial time (2), and Bader et al. improved this to linear time and implemented it in their GRAPPA software (see the Section on Website References and ref. 5). GRIMM is adapted from the part of GRAPPA that implements this. Hannenhalli and Pevzner went on to show how to compute the minimum number of rearrangements (reversals, translocations, fissions, and fusions) between two multichromosomal genomes in polynomial time (3). Tesler fixed some problems in the algorithm and adapted the Bader-Moret-Yan algorithm to solve this problem in linear time (6). Ozery-Flato and Shamir found an additional

Computational Tools for Analysis of Rearrangements

161

problem in the Hannenhalli-Pevzner algorithm (7). GRIMM implements all of these for multichromosomal rearrangements. Hannenhalli and Pevzner described an algorithm for studying rearrangements in genomes when the orientations of genes are not known (4). This algorithm is only practical when the number of singleton genes is small. GRIMM implements this algorithm, a generalization of it for the multichromosomal case, and a fast approximation algorithm.

3.2.1. Input Format The input for GRIMM and MGR is a file that gives the permutation of orthologous regions in each genome, split into chromosomes. See Fig. 3a for a sample file with 4 genomes with up to 3 chromosomes in each. All 4 genomes consist of the same 10 regions but in different orders. Each genome specification begins with a line consisting of the greater-than symbol followed by the genome name. Next, the order of the orthologous regions 1, 2, … n is given, with dollar-sign symbols ‘$’ at the end of each chromosome. The numbers are separated by any kind of white space (spaces, tabs, and new lines). Chromosomes are delimited by ‘$’, by the start of the next genome, or by the end of the file. Comments may be inserted in the file using the ‘#’ symbol. The rest of the line will be ignored.

3.2.2. Output The main usage of GRIMM is to compute the most parsimonious distance between two genomes and give an example of one rearrangement scenario (out of the many possible) that achieves that distance. An excerpt of GRIMM’s output for this usage is shown in Fig. 3d. There are other usages too, which have different outputs.

3.2.3. Usage and Options There are several usages of GRIMM: (1) compute the most parsimonious distance between two genomes (along with other statistics about the breakpoint graph), (2) exhibit a most parsimonious rearrangement scenario between two genomes, (3) compute matrices of pairwise distances and pairwise statistics for any number of genomes and (4) compute or estimate signs of orthologous regions to give a most parsimonious scenario. The command-line syntax is as follows: % grimm -f filename [other options]

Input/output: • -f Input_file_name: This field is required and should contain the path to the file with the starting permutations (e.g., data/sample_data.txt or a file mgr_macro.txt generated using GRIMM-Synteny).

162

Tesler and Bourque

• -o Output_file_name: The output is sent to this file. If –o is not specified, the output is sent to STDOUT. • -v: Verbose output: For distance computations, this gives information on the breakpoint graph statistics from the Hannenhalli-Pevzner theory. For other computations, this gives additional information.

Genome type: • -C, -L, or neither: -C is unichromosomal circular distance and –L is unichromosomal directed linear reversal distance. If neither –C or –L is selected, the genomes have multichromosomal undirected linear chromosomes. (Undirected means flipping the whole chromosome does not count as a reversal, while directed means it does count. On single chromosome genomes, -L vs. multichromosomal are different in this regard.)

Genome selection: GRIMM is primarily used for pairs of genomes, but can also display matrices to show comparisons between all pairs. Input files with two genomes default to pairwise comparisons and files with more than two genomes default to matrix output, unless the options below are used: • -g i,j: Compare genome i and genome j. Genomes in the input file are numbered starting at 1. With the –s option, a rearrangement scenario will be computed that transforms genome i into genome j. Figure 3d used –g 1,4. For a file with two genomes, this option is not necessary, unless you want to compare them in the reverse order (-g 2,1). • -m: Matrix format (default when there are more than two genomes if –g not used). A matrix of the pairwise distances between the genomes will be computed, as shown in Fig. 3e. When used in combination with the –v option, matrices will be computed for breakpoint graph parameters between all pairs of genomes.

Pairwise comparison functions (not matrix mode): Defaults to –d –c –s for multichromosomal genomes and –d –s for unichromosomal genomes. • -d: Compute distance between genomes (minimum number of rearrangement steps combinatorially possible). • -s: Display a most parsimonious rearrangement scenario between two genomes. Not available in matrix mode. • -c, -z: In multichromosomal genomes, a pair of additional markers (“caps”) are added to the ends of each chromosome in each genome, and the chromosomes are concatenated together into a single ordinary signed permutation (without “$” chromosome breaks). The details are quite technical; see refs. 3,6,7. These options display the genomes with added caps in two different formats: -c displays the concatenation as an ordinary signed permutation (not broken up at chromosomes) suitable as input to GRIMM with the –L option, while –z breaks it up by chromosome.

Unsigned genomes: • -U n: A fast approximation algorithm for determining the signs in unsigned genomes via hill-climbing with n random trials. This works for any number of

Computational Tools for Analysis of Rearrangements

163

genomes, not just two. It works for some signs known and some unknown, or for all signs unknown. This was used in Murphy and colleagues to determine signs of blocks with only one gene (15). A paper about the technical details is in preparation. See Note 1 and Fig. 6. • -u: Exact computation of the rearrangement distance between two unsigned genomes (all signs unknown). This also computes an assignment of signs that would achieve this distance if the genomes were regarded as signed. For unichromosomal genomes, this uses the algorithm by Hannenhalli and Pevzner (4) and for multichromosomal genomes, this uses a generalization of that by Glenn Tesler (paper in preparation). The complexity is exponential in the number of singletons, so this option is only practical when the number of singletons is small (see Fig. 6).

3.2.4. Sample Run: Toy Example We will use the file data/sample_data.txt shown in Fig. 3a. The command line to compute the scenario shown in Fig. 3d is % grimm -f data/sample_data.txt –g 1,4

The command line to compute the matrix shown in Fig. 3e is % grimm -f data/sample_data.txt –m

(Note –m is optional; since there are more than two genomes, it is assumed unless –g is used). Additional details about the breakpoint graphs can be shown in either of these by adding the option –v. % grimm -f data/sample_data.txt –g 1,4 -v % grimm -f data/sample_data.txt –m –v

3.2.5. Sample Run 2: Human-Mouse-Rat-Chicken Dataset GRIMM can also be run on the files mgr_macro.txt output by GRIMMSynteny using similar command lines but changing the filename. Of greater interest, however, would be to run GRIMM after MGR has computed the topology of a phylogenetic tree and possible gene/block orders at its ancestral nodes. GRIMM would then be appropriate to study the breakpoint graph or possible scenarios on a branch of the tree.

3.3. MGR: Reconstructing the Rearrangement Scenario of Multiple Genomes The Multiple Genome Rearrangement Problem is to find a phylogenetic tree describing the most “plausible” rearrangement scenario for multiple species. Although the rearrangement distance for a pair of genomes can be computed in polynomial time, its use in studies of multiple genome rearrangements has been somewhat limited since it was not clear how to efficiently combine pairwise

164

Tesler and Bourque

Fig. 6. Unsigned data. (a) Input file. Each genome has one chromosome, directed linear, so the –L option is used on all commands. (b–d) Excerpts from runs. (b,c) The –u option does an exact computation for each pair of genomes. (d) 100 trials of an approximation algorithm are performed that seeks a best global assignment of signs.

Computational Tools for Analysis of Rearrangements

165

rearrangement scenarios into a multiple rearrangement scenario. In particular, Caprara demonstrated that even the simplest version of the Multiple Genome Rearrangement Problem, the Median Problem with reversals only, is NP-hard (16). MGR implements an algorithm which, given a set of genomes, seeks a tree such that the sum of the rearrangements is minimized over all the edges of the tree. It can be used for the inference of both phylogeny and ancestral gene orders (8). MGR outputs trees in two different formats described below: (1) Newick format and (2) ASCII representation. The algorithm makes extensive use of the pairwise distance engine GRIMM. In this section, we first provide a detailed description of the input and output format. Next, we describe two typical standard runs, one a toy example and one of the human-mouse-ratchicken dataset.

3.3.1. Input Format The gene order input format for MGR is the same as for GRIMM. For an example with 4 genomes with 2 or 3 chromosomes each, see Fig. 3a. This small sample input file can also be found in the MGR package in the subdirectory data as sample_data.txt.

3.3.2. Output: Newick Format and ASCII Representation The Newick Format uses nested parenthesis for representing trees. It allows the labeling of the leaves and the internal nodes. Branch lengths corresponding to the number of rearrangements can also be incorporated using a colon. For instance, the example shown in Fig. 3a would produce a tree in Newick Format shown in Fig. 3b. Note that the internal nodes correspond to ancestral nodes and are labeled using the letter A followed by a number (e.g., A4). Also note that the Newick Format specifies a rooted tree with ordered branches, but MGR determines an unrooted tree, so MGR chooses an arbitrary location for the root. Your additional knowledge of the timeline should be used to relocate the root and order the branches. The ASCII graphical representation of the tree is generated by a modified version of the RETREE program available in the PHYLIP package by Joe Felsenstein (see the section on Website References). The number of rearrangements that occurred on each edge is shown and (unless the –F switch is selected, see the following section) the edges are drawn proportionally to their length. When no number is shown on an edge it means that no rearrangement occurred on that edge. See Fig. 3c for the tree associated with the example from Fig. 3a.

3.3.3. Usage and Options There are three main usages of MGR: (1) with data from a file, (2) with simulated data, and (3) to display previous results. In the current description,

166

Tesler and Bourque

we will focus on the first usage, which represents the most common application. The command-line syntax is as follows: % MGR -f filename [other options]

Input/output: • -f Input_file_name: Same as in GRIMM. • -o Output_file_name: Same as in GRIMM. • -v: Verbose output. This is very important to visualize and record the progress of MGR, especially for large datasets. Using this option, the initial input genomes are reported along with their pairwise distances. Following that, each rearrangement identified in the procedure and the intermediate genomes are reported. The program will terminate once the total distance between the intermediate genomes of the various triplets has converged to zero. • -w: Web output (html). Should not be combined with the –v option but allows for a more elaborate html report. This option can also be used to redisplay in html format a previous result (if used in combination with the –N option, see the README file). • -W: Width (in characters) of the tree displayed (default is 80). Only affects the way the ASCII representation of the tree is displayed. • -F: fixed size edges in the tree displayed. Displays fixed size edges in the ASCII representation instead of edges proportional to their length.

Genome type: • -C, -L, or neither: Unichromosomal circular genomes, unichromosomal directed linear genomes, or multichromosomal undirected linear genomes. These are the same as in GRIMM.

Other options: • -H: Heuristic to speed up triplet resolution: -H 1: only look at reversals initially, and pick the first good one. -H 2: only look at reversals initially, and take the shortest one. Especially for large instances of the problem (e.g., more than 100 homologous blocks or more than five genomes), these options can greatly speed-up the algorithm by restricting the search and the selection to specific categories of good rearrangements (see Note 2). • -c: Condense strips for efficiency. Combines strips of two or more homologous blocks that are in the exact same order in all k genomes being considered. If this option is selected, the condensing procedure is called recursively, but the whole process is seamless as the strips are uncondensed before the output is generated. This option can greatly speed-up MGR, especially if the starting genomes are highly similar. This is related to GRIMM-Synteny’s –c option; the difference is that GRIMMSynteny’s –c option changes the blocks that are output, whereas MGR’s –c option affects internal computations but uncondensed block numbers are used on output. • -t: Generate a tree compatible with the topology suggested in the file. Forces MGR to look for an optimal rearrangement scenario only on the tree topology provided by the user (see Note 3).

Computational Tools for Analysis of Rearrangements

167

3.3.4. Sample Run 1: Toy Example To run MGR on the example displayed in Fig. 3a do: % MGR -f data/sample_data.txt

The output should be similar to the output displayed in Fig. 3b, c with some additional information on the parameters used and on the permutations associated with the input genomes and the ancestors recovered. To view the same result but in html format, try: % MGR -f data/sample_data.txt -w -o sample_out.html

3.3.5. Sample Run 2: Human–Mouse–Rat–Chicken Dataset The file data/hmrc_gene_perm.txt is identical to the file gene7/mgr_macro.txt generated under Subheading 3.1.5. on the basis of orthologous genes. It contains the human, mouse, rat, and chicken genomes represented by four signedpermutations of 586 homologous blocks. Run MGR on this example as follows: % MGR -f data/hmrc_gene_perm.txt -H2 -c -o hmrc_gene_perm_out.txt

Even using the –H2 and –c switches to speed-up computations, this is a challenging instance of the multiple genome rearrangement problem that will probably take a few hours to complete on most computers. The final output (hmrc_gene_perm_out.txt) should be identical to the file data/hmrc_gene_ perm_out.txt. To get a better sense of how quickly (or slowly) the program is converging, you can use the –v switch: % MGR -f data/hmrc_gene_perm.txt -H2 –c –v -o hmrc_gene_perm_out1.txt

But of course, this will also generate a much larger output file. An actual rearrangement scenario between one of the initial genomes and one of the recovered ancestors can be obtained by extracting the permutations from the bottom of the output file, creating a new input file (e.g., hmrc_result.txt) and running: % grimm -f hmrc_result.txt -g 1,5

To facilitate the comparison of the initial, modern day, genomes with the recovered ancestral genomes, it is also possible to plot the various permutations (see Fig. 5 and ref. 15). But the tools to produce such plots are highly data dependent and are not provided with the current release of MGR. For challenging examples, when the initial pairwise distances are significant as compared to the number of homologous blocks (such as in the current example), it is possible to find alternative ancestors satisfying the same overall scenario score. This in turn can lead to the identification of weak and strong adjacencies in the ancestors; see Note 4.

168

Tesler and Bourque

4. Notes 1. Radiation-hybrid maps and missing signs. In ref. 15, an 8-way comparison was done between three sequenced species (human, mouse, and rat) and five species mapped using an RH approach (cat, cow, dog, pig, and on some chromosomes, horse). GRIMM-Synteny is not appropriate to use owing to the RH-mapped data. A method is described in that paper to construct syntenic blocks that take into account the mixed coordinate system and types of errors that occur with RH maps. In blocks with two or more genes, an inference about the orientation of the block in each species was easy to make. But singleton blocks (supported by a single gene) had known orientations in the sequenced species (human, mouse, and rat), and unknown orientations in the other species. GRIMM was used to guess the signs in the other species with the –U option. 2. MGR heuristics. The -H heuristics rely on the simple assumption that reversals, and specifically short reversals in the case of –H2, represent a more common evolutionary event as compared to translocation, fusions, and fissions. These heuristics also allow a more robust analysis of noisy datasets that may contain sign errors or local misordering. 3. Fixed topology. MGR can be invoked using the –t option to reconstruct a rearrangement scenario for a specific tree topology. There could be various reasons to use this option: to compare the score of two alternative topologies, to accelerate computations, and so on. The desired topology needs to be specified in a separate file using the Newick format without edge lengths. In such topology files, genomes are referenced using identifiers from 1 to k, where 1 is the genome that appears first in the main gene order file, 2 is the genome that appears second, and so on. Refer to the file data/sample_tree.txt for an example associated with data/sample_data.txt. The command line to run this example would be: % MGR -f data/sample_data.txt -t data/sample_tree.txt Note that the scenario recovered is slightly worse than the scenario shown in Fig. 3c with seven rearrangements instead of six, but the tree topology matches the tree topology in data/sample_tree.txt. The details of the algorithm used for this are in ref. 15. 4. Alternative ancestors. For a given ancestor, when the ratio between the total number of rearrangements of the three incident edges and the number of common blocks is high, it is often possible to find alternative ancestors also minimizing the total number of rearrangement events on the evolutionary tree. By exploring a wide range of such alternative ancestors, it is possible to distinguishing between weak and strong areas of the ancestral reconstructions. Specifically, adjacencies that are present in all of the observed alternative ancestors are called strong adjacencies, whereas adjacencies that are not conserved in at least one of the alternative ancestors are called weak adjacencies (see ref. 15 for more details). The number of weak adjacencies identified in this manner is actually a lower bound for the true number of weak adjacencies since only a subset of all the alternative solutions

Computational Tools for Analysis of Rearrangements

169

can be explored. This search for alternative ancestors is available in MGR for k = 3 using the –A switch but it will not be described further in this chapter.

Acknowledgments G.B. is supported by funds from the Biomedical Research Council of Singapore. G.T. is supported by a Sloan Foundation Fellowship. References 1. Tesler, G. (2002) GRIMM: genome rearrangements web server. Bioinformatics 18(3), 492–493. 2. Hannenhalli, S. and Pevzner, P. A. (1995) Transforming cabbage into turnip (polynomial algorithm for sorting signed permutations by reversals). In: Proceedings of the 27th Annual ACM Symposium on the Theory of Computing, pp. 178–189. (Full version appeared in JACM 1999; 46, 1–27). 3. Hannenhalli, S. and Pevzner, P. A. (1995) Transforming men into mice (polynomial algorithm for genomic distance problem). In: 36th Annual Symposium on Foundations of Computer Science (Milwaukee, WI, 1995), pp. 581–592, IEEE Comput. Soc., Los Alamitos, CA. 4. Hannenhalli, S. and Pevzner, P. A. (1996) To cut … or not to cut (applications of comparative physical maps in molecular evolution). In: Proceedings of the Seventh Annual ACM-SIAM Symposium on Discrete Algorithms (Atlanta, GA, 1996), pp. 304–313, ACM, New York. 5. Bader, D., Moret, B., and Yan, M. (2001) A linear-time algorithm for computing inversion distances between signed permutations with an experimental study. J. Comput. Biol. 8(5), 483–491. 6. Tesler, G. (2002) Efficient algorithms for multichromosomal genome rearrangements. J. Comput. Syst. Sci. 65(3), 587–609. 7. Ozery-Flato, M. and Shamir, R. (2003) Two notes on genome rearrangement. J. Bioinform. Comput. Biol. 1(1), 71–94. 8. Bourque, G. and Pevzner, P. A. (2002) Genome-scale evolution: reconstructing gene orders in the ancestral species. Genome Res. 12(1), 26–36. 9. Bourque, G., Pevzner, P. A., and Tesler, G. (2004) Reconstructing the genomic architecture of ancestral mammals: lessons from human, mouse, and rat genomes. Genome Res. 14(4), 507–516. 10. Hillier, L., Miller, W., Birney, E., et al. (2004) Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature 432, 695–716. 11. Bourque, G., Zdobnov, E., Bork, P., Pevzner, P., and Tesler, G. (2005) Genome rearrangements in human, mouse, rat and chicken. Genome Res. 15(1), 98–110. 12. Pevzner, P. A. and Tesler, G. (2003) Genome rearrangements in mammalian evolution: lessons from human and mouse genomes. Genome Res. 13(1), 37–45. 13. Pevzner, P. A. and Tesler, G. (2003) Transforming men into mice: the Nadeau–Taylor chromosomal breakage model revisited. In: Proceedings of RECOMB 2003, pp. 247–256.

170

Tesler and Bourque

14. Ohno, S. (1967) Sex Chromosomes and Sex Linked Genes. Springer, Berlin. 15. Murphy, W. J., Larkin, D. M., van der Wind, A. E., et al. (2005) Dynamics of mammalian chromosome evolution inferred from multispecies comparative maps. Science 309(5734), 613–617. 16. Caprara, A. (1999) Formulations and complexity of multiple sorting by reversals. In: Proceedings of the Third Annual International Conference on Computational Molecular Biology (RECOMB-99) (Istrail, S., Pevzner, P. A., and Waterman, M. S., eds), pp. 84–93, ACM, Lyon, France.

Website References 1. Felsenstein, J., PHYLIP. http://evolution.genetics.washington.edu/phylip.html. 2. Bader, D. A., Moret, B. M. E., Warnow, T., et al., (2000) GRAPPA. http://www.cs. unm.edu/~moret/GRAPPA. 3. Ensembl from http://www.ensembl.org. 4. HomoloGene from http://www.ncbi.nlm.nih.gov/HomoloGene. 5. UCSC Genome Bioinformatics website from http://www.genome.ucsc.edu.

11 Computational Reconstruction of Ancestral DNA Sequences Mathieu Blanchette, Abdoulaye Baniré Diallo, Eric D. Green, Webb Miller, and David Haussler Summary This chapter introduces the problem of ancestral sequence reconstruction: given a set of extant orthologous DNA genomic sequences (or even whole-genomes), together with a phylogenetic tree relating these sequences, predict the DNA sequence of all ancestral species in the tree. Blanchette et al. (1) have shown that for certain sets of species (in particular, for eutherian mammals), very accurate reconstruction can be obtained. We explain the main steps involved in this process, including multiple sequence alignment, insertion and deletion inference, substitution inference, and gene arrangement inference. We also describe a simulation-based procedure to assess the accuracy of the reconstructed sequences. The whole reconstruction process is illustrated using a set of mammalian sequences from the CFTR region. Key Words: Ancestral DNA sequence reconstruction; multiple sequences alignment; mammalian phylogeny; mammalian evolution; substitutions and indels reconstruction; ancestral sequence reconstruction accuracy.

1. Introduction Following the completion of the human genome sequence, there is now considerable interest in obtaining a more comprehensive understanding of its evolution (2–4). Patterns of evolutionary conservation are used to screen human DNA mutations to predict those that will be deleterious to protein function and to identify noncoding sequences that are under negative selection, and hence may perform regulatory or structural functions (5–7). Long periods of conservation followed by sudden change may provide clues to the evolution of new human traits (8,9). All of these efforts depend, directly or indirectly,

From: Methods in Molecular Biology: Phylogenomics Edited by: W. J. Murphy © Humana Press Inc., Totowa, NJ

171

172

Blanchette et al.

on reconstructing the evolutionary history of the bases in the human genome, and hence on reconstructing the genomes of our distant ancestors. Although some information about ancestral species has been irrevocably lost during evolution, there is still the possibility that large regions of the genomes of ancestral species with many modern descendants can be approximately inferred from the genomes of modern species using a model of molecular evolution. Indeed, it has recently been reported that in the specific case of mammalian evolution, ancestral genome reconstruction was possible to a surprising degree of accuracy (1). The ideal target species for a genomic reconstruction is one that has generated a large number of independent, successful descendant lineages through a rapid series of early speciation events. In this case, the problem can be viewed as attempting to reconstruct an original from many independent noisy copies. In the limit of an instantaneous radiation, the accuracy of the reconstruction approaches 100% exponentially fast as the number of copies increases. From the Cretaceous period, a good choice for reconstruction would be the genome of the eutherian ancestor, as this species is believed to have spawned the relatively rapid radiation of the different lineages of modern placental mammals (10,11). This ancient species also has the added advantage of being a human ancestor, so its reconstruction, however speculative, may shed additional light on our own evolution, perhaps helping to explain features of the human and other modern mammalian genomes. In this chapter, we describe the set of computational approaches and tools that exist for reconstructing ancestral sequences and for estimating the accuracy of such a reconstruction. This area being relatively new, there is no single tool that performs all the steps involved in the reconstruction. Instead, tools developed by different authors need to be used sequentially. The methods are illustrated on a 1.8-Mb region of mammalian genomes, containing the CFTR gene, sequenced by the ENCODE project (12). 2. Materials 2.1. Sequence Data To reconstruct the ancestral sequences, orthologous DNA regions from as many descendants as possible need to be compared. The more orthologous sequences are available, the more accurate the reconstruction will be, provided accurate evolutionary models are used. For vertebrate sequences, a good repository of complete genome sequences is the UCSC Genome Browser (http://genome. ucsc.edu [13]). Besides raw DNA sequences, multiple genome alignments and various types of genome annotation are accessible from the same site. For the purpose of this chapter, we illustrate the process of ancestral sequence reconstruction using a 1.8-Mb region of the human genome including the CFTR gene, together with orthologous regions from 19 other mammals

Computational Reconstruction of DNA Sequences

173

(available from the UCSC Genome Browser). This deep coverage is not currently available over all the genome, but only for the targeted sequencing of the ENCODE project (12).

2.2. Phylogenetic Information An important component of ancestral sequence reconstruction is the knowledge of the phylogenetic relationships among the species being compared. Knowing the correct tree topology and estimating the length of its branches are crucial for an accurate reconstruction, as well as for estimating the accuracy of that reconstruction through simulations. Accepted phylogenetic trees are now available for many sets of species (see, e.g., refs. 10,14). For others, the exact phylogenetic relationships remain unclear and need to be inferred prior to reconstruction, using programs like Phylip (15), PAUP (16), or MrBayes (17). These tools are also necessary to estimate the branch lengths of the phylogenetic tree using a maximum likelihood approach.

2.3. Sequence Annotation In some cases, functional annotation of extant sequences can be used to obtain more accurate reconstruction of ancestral sequences. This is particularly the case for coding region annotation and repetitive region annotation. For metazoans, a good source of such annotations is the UCSC genome browser and the Ensembl Genome Browser (http://www.ensembl.org). 3. Methods This section introduces the techniques that have been developed for predicting ancestral DNA sequences based on their extant descendants, and for estimating the accuracy of the reconstruction. We illustrate this reconstruction process (see also Note 1) and the type of information that can be derived from it using a 1.8-Mb region surrounding the CFTR gene in mammals (see ref. 1 and Note 2 for more details).

3.1. Predicting Ancestral Sequences The prediction of ancestral genomes can be divided into four main steps. A crucial first step toward the reconstruction is to build an accurate multiple alignments of the extant orthologous sequences, thus establishing orthology relationships among the nucleotides of each sequence. Second, the process of indel reconstruction determines the most likely scenario of insertions and deletions that may have led to the extant sequences. Third, substitution history is reconstructed using a maximum likelihood approach. The last step involves dealing with genome rearrangements (inversions, transpositions, translocations, duplications, and chromosome fusions, fissions, and duplications).

174

Blanchette et al.

3.1.1. Multiple Sequence Alignment Given a set of orthologous sequences, the multiple alignment problem consists of identifying (by aligning them together) the sets of nucleotides derived from a common ancestor through direct inheritance or through substitution. Many approaches have been developed to align multiple large genomic regions. Some of the most popular approaches include programs like MAVID (18), MLAGAN (5,19), and TBA (20). All these approaches fall under the category of progressive alignment methods and require the prior knowledge of the topology of the phylogenetic tree that relates the extant sequences compared (see Subheading 2.2.). The threaded blocks aligner (TBA) program, based on the well established pair-wise alignment program BLASTZ (21), has been shown to be particularly accurate for aligning mammalian sequences and is thus a tool of choice for ancestral reconstruction for these species. The program is available at http://www. bx.psu.edu/miller_lab/. The multiple sequence alignment problem is discussed in more detail in Chapter 9.

3.1.2. Indel Reconstructing Given a multiple sequence alignment of the repeat-soft-masked extant sequences and a phylogenetic tree with known topology and branch lengths, the next step consists of predicting, for each ancestral node in the tree, which columns of the alignment correspond to ancestral bases and which correspond to nucleotides inserted after the ancestor. Although the problem of parsimonious indel inference has recently been shown to be NP-Hard (22), good heuristics have been developed by Fredslund et al. (23), Blanchette et al. (1), and Chindelevitch et al. (22). Currently, the only publicly available program for indel reconstruction is the inferAncestors program based on the greedy approach of Blanchette et al. (1). This section describes briefly how the program works. Given a multiple alignment, all the gaps in the alignment are first marked as unexplained. The algorithm iteratively selects the insertion or deletion, performed along a specific edge of the tree and spanning one or more columns of the alignment, which yields the largest number of alignment gaps explained per unit of cost. The number of gaps explained by a deletion is the number of unexplained gaps in the subtree above which the deletion occurs. The number of gaps explained by an insertion is the number of unexplained gaps in the complement of the subtree above which the insertion occurs. The costs can be defined heuristically. The cost of a deletion is given by 1 + 0.01 log(L) 0.01b, where L is the length of the deletion and b is the length of the branch along which the event takes place. The cost of an insertion is given by 1 + 0.01 log(L) – 0.01b – r, where r is a term that takes value 0.5 if the repetitive content of the

Computational Reconstruction of DNA Sequences

175

segment inserted is more than 90%. Once the best insertion or deletion has been identified, its gaps are marked as “explained.” This does not preclude them from being part of other indels, but they will not count in their evaluation. Finally, heuristics are used to reduce errors related to incorrect alignment, in particular to reduce the problems caused by two repetitive regions from two distantly related species mistakenly aligned to each other, with other species having gaps in that region.

3.1.3. Substitutions Reconstruction After having established which positions of the multiple alignment correspond to bases in the ancestor, the inferAncestors program predicts which nucleotide (A, C, G, or T) was present at each position in the ancestor using the standard posterior probability approach (24) based on a dinucleotide substitution model in which substitutions at two adjacent positions are independent except for CpG, whose substitution rate to TpG is 10 times higher than those of other transitions (25). This phase of the reconstruction relies on the availability accurate branchlength estimates for the phylogenetic tree, which can be obtained as described under Subheading 2.2.

3.1.4. The inferAncestors Program The inferAncestor program, available from http://www.mcb.mcgill.ca/ blanchem/software, integrates the steps of indel and substitution inference. The algorithm takes as input a multiple alignment in fasta format, together with a phylogenetic tree in New Hampshire format. The program outputs a predicted ancestral sequence for each internal node of the phylogenetic tree. Two other files are outputs, describing the confidence of the prediction made for each base of each ancestral sequence. The first describes the confidence in the prediction of presence or absence of a base at each position of each ancestral sequence. The second describes the confidence of the actual nucleotide (A, C, G, or T) predicted. The inferAncestor program is written in C++ and has been tested on Linux and Mac OS X.

3.1.5. Genome Rearrangements To complete the inference of ancestral genomes, the ancestral DNA sequences inferred for each block of orthologous sequences need to be ordered into a single, contiguous genome. This problem is made challenging by the presence of genome rearrangements (inversions, transpositions, translocations, and duplications/losses). One of the most popular computer programs for inferring ancestral gene arrangement is MGR ([26], http://www.cse.ucsd.edu/groups/ bioinformatics/MGR), which is described in detail in Chapter 10.

176

Blanchette et al.

3.2. Assessing Reconstruction Accuracy Through Simulations This section describes a simulation-based method for assessing the accuracy of the reconstructed ancestor. An alternate approach based on retrotransposons is described in (1). To assess the reconstructability of ancestral genomic sequences from their extant descendants, the simplest method is to use simulations of sequence evolution. Starting from a known (but synthetic) ancestral sequence, we let the sequence evolve along the branches of the tree until the leaves are reached. The ancestral sequence reconstruction procedure is then applied to the set of simulated leaves, and the prediction made is compared to the known ancestral sequence. The simulation program Simali (http://www.bx.psu.edu/miller_lab/), based on the Rose program (27), can be used to mimic the evolution of sequences under no selective pressure. Given a phylogenetic tree, the program simulates sequence evolution by performing random substitutions, deletions, and insertions along each branch, in proportion to its length. The program allows for the insertion of retrotransposons, which is an important source of error in sequence alignment, and thus in ancestral sequence reconstruction. To assess the reconstructability of ancestral mammalian genomic sequences, Blanchette et al. (1) performed a series of computational simulations of the neutral evolution of a hypothetical 50 kb ancestral genomic region into orthologous regions in 20 modern mammals (Fig. 1). The simulations are based on the phylogenetic tree inferred by Eizirik et al. (10) on a set of genes for a large set of mammals. Substitutions follow a context-independent HKY model (28) with Ts/Tv = 2, p(a) = p(t) = 0.3, and p(c) = p(g) = 0.2, except that substitution rates of CpG pairs are 10 times higher than other rates (25). Deletions are initiated at a rate of about 0.056 times the substitution rate, their length is chosen according to a previously reported empirical distribution (29) that ranges between 1 and 5000 nucleotides, and their starting point is uniformly distributed. Insertions occur randomly according to a mixture model. Small insertions (of size between 1 and 20 nt) occur at half the rate of deletions, their size distribution is empirically determined (29) and their content is a random sequence for which each nucleotide Fig. 1. Estimated reconstructability of ancestral mammalian sequences. Average base-by-base error rate in the reconstruction of each simulated ancestral sequence. The error rate shown is the sum of the percentages of bases that are missing, added, or mismatched as a result of errors in the reconstruction, averaged over one hundred simulations of sets of orthologous sequences of length approximately 50 kb. Error rates are given first for all regions, and in parentheses for nonrepetitive regions only. The species names at the leaves only indicate what organisms we simulated; no actual biological sequences were used here. The tree topology and branch lengths are derived directly from Eizirik et al. (10). (Continued on next page)

Computational Reconstruction of DNA Sequences

Fig. 1.

177

178

Blanchette et al.

is chosen independently from the background distribution. They also simulate the insertion of retrotransposons. For this they used a library of 15 different types of transposable elements chosen to cover the large majority of repetitive elements observed in well studied mammals (30). The rate of insertion of each repeat varies from branch to branch, so that certain retrotransposons (such as ALUs, SINEs B2, and BOV) are lineage-specific, whereas others (L1, LTR, and DNA) are both present in the sequence at the root of the tree (with a range of decaying level) and can be inserted along any branch. The code and parameters used for our simulations are available with the Simali package. After generating a set of simulated sequences, the sequences are first soft-repeat-masked using RepeatMasker (31) and then aligned using one of the methods under Subheading 3.1.1. The repeat-masked multiple alignment is then fed into the inferAncestors program, which produces a predictiozn of the ancestral sequence at each internal node of the phylogenetic tree. To compare the actual ancestral sequence generated by simulations to the predicted ancestral sequence, we align them and count the number of missing bases (those present in the actual ancestor but not in the reconstruction), added bases (present in the reconstruction but not in the actual ancestor), and mismatch errors (positions in the reconstruction assigned the incorrect nucleotide). The sum of the rates of all three types of errors, calculated separately at each ancestral node in the phylogenetic tree, is used to estimate the reconstructability of a given ancestor. In the case of mammalian sequences, Blanchette et al. (1) used the above simulation-based procedure to show that the sequence of certain mammalian ancestors can be reconstructed with remarkable accuracy. Figure 1 shows that under this phylogenetic tree with a relatively rapid placental mammalian radiation, the neutral nonrepetitive regions of the Boreoeutherian ancestral genome that have evolved under their simple model should be reconstructable with about 99% base-by-base accuracy from the genomes of 20 present-day mammals. Repetitive regions are not reconstructed as accurately because they are more often involved in misalignments, which can result in incorrect predictions. Nonetheless, even counting errors in repetitive regions, the total accuracy is more than 98%. The simulations suggest that even in the nonrepetitive regions, much of the difficulty of the reconstruction problem lies in the computation of the multiple alignment, as a reconstruction based on the correct multiple alignment derived from the simulation itself (and thus unavailable for actual sequences) had less than half the number of reconstruction errors. Examining reconstructions made using smaller subsets of this set of 20 species, it was found that, including repetitive regions, an accuracy of about 97% can be achieved using only 10 species chosen to sample most major mammalian lineages (Fig. 2). Sampling only five of the most slowly evolving lineages yields an accuracy of about 94%. Little is gained with our current reconstruction procedures by adding more than 10 species

Computational Reconstruction of DNA Sequences

179

Fig. 2. Estimated reconstructability of the Boreoeutherian ancestor. Fraction of the simulated Boreoeutherian ancestral sequence reconstructed incorrectly as a function of the number of extant species used for the reconstruction. For each number of species used, results are given counting all bases (left columns) and only nonrepetitive bases (right columns). Species are added in the following order: human, cat, chipmunk, sloth, manatee, rousette bat, mole, pig, beaver, tree shrew, horse, pangolin, mouse, armadillo, aardvark, okapi, dog, mole-rat, rabbit, and lemur.

because the risk of misalignment increases, whereas the unavoidable loss of information in the early branches persists. An alternate approach to assessing the accuracy of a reconstruction is through a pseudo cross-validation procedure. Instead of reconstructing an ancestral sequence based on all the extant sequences available, do so using a (large) subset of these species. Different subsets of species will produce slightly different ancestral reconstructions, and the variability between these reconstructions will give an idea of the expected error rate of the reconstruction that is based on all species.

3.3. Reconstruction of Actual Mammalian Sequences Blanchette et al. (1) applied the reconstruction method described above to actual high-quality sequence data from a region containing the human CFTR

180

Blanchette et al.

locus, using 18 additional orthologous mammalian genomic regions generated by the NISC Comparative Sequencing Program ([12], www.nisc.nih.gov). Simulations on synthetic data like those described above indicate that for the topology and set of branch lengths for these 19 species, the ancestral sequence that can be the most accurately reconstructed based on the sequences available is the Boreoeutherian ancestor, and that neutrally evolving regions of this ancestral genome can be reconstructed with an accuracy of about 96%. On a site-specific basis, simulations suggest that more than 90% of the bases of the predicted ancestor can be assigned confidence values greater than 99%. The reconstructed ancestor and site-specific confidence estimates are available at http://genome.ucsc.edu/ancestors. Figure 3 illustrates the reconstruction in a noncoding region of the CFTR locus that exhibits a typical level of sequence conservation. This region is located in a 32-kb intron of the CAV1 gene, about 13 kb from the 5-exon. The bases in this region are relics left over from the insertion of a MER20 transposon sometime prior to the mammalian radiation and are thus unlikely to be under selective pressure. Notice that despite the fact that the alignment of certain species (in particular, mouse, rat, and hedgehog) appears somewhat unreliable, the inference of the presence or absence of a Boreoeutherian ancestral base at a given position is quite straightforward given the alignment, and so is, to a lesser extent, the prediction of the actual ancestral base itself. The MER20 consensus is shown for comparison. Most positions in which the reconstructed Boreoeutherian ancestral base disagrees with the MER20 consensus are likely owing to substitutions in this MER20 relic that predated the Boreoeutherian ancestor, since the support of the reconstructed base is very strong in the extant species. If the MER20 consensus sequence is used as an outgroup in the reconstruction procedure, only two bases (indicated by a longer arrow) are reconstructed differently, indicating that the reconstructed ancestral sequence is very stable and most of it is likely to be correct. 4. Notes 1. The accuracy of the reconstruction depends crucially on the length of the early branches of the phylogenetic tree. In the context of the ancestral mammalian sequence reconstruction, Blanchette et al. (1) have shown that if the major placental lineages had diverged instantaneously, they would be able to reconstruct the simulated Boreoeutherian ancestral sequence, including repetitive regions, with less than 1% error. In contrast, if the early branch lengths inferred by Eirizik et al. (10) turned out to underestimate the actual lengths by a factor of two, the error rate would jump to 3%, and to 6% if they were underestimated by a factor of 4. 2. One of the nonintuitive results presented by Blanchette et al. (1) is the observation that more ancient ancestral genomes can often be reconstructed more accurately than their more recent descendants. Why exactly is this so? For simplicity, consider

181

Fig. 3. Example of reconstruction of an ancestral Boreoeutherian sequence based on actual orthologous sequences derived from a MER20 retrotransposon. Arrows indicate positions where the reconstructed ancestor differs from the MER20 consensus. Longer arrows indicate the positions in which the knowledge of the MER20 consensus sequence would have changed the ancestral base prediction. The position of the human sequence displayed is chr7:115,739,755-115,739,899 (NCBI build 34). The alignment of the flanking nonrepetitive DNA (not shown) verifies that the sequences from the different species are in fact orthologous. The tree and branches are derived directly from ref. 10.

182

Blanchette et al.

the case of reconstructing a single binary ancestral character state in the root species (e.g., purine vs pyrimidine at a given site) under a simple model in which the prior probability distribution on the ancestral character is uniform, substitution rates are known, symmetric, homogeneous, and not too high, and the total branch length in the phylogenetic tree from the root ancestor to each of the modern species is the same (i.e., assume a molecular clock). Here each of n modern species has a state that differs from the ancestral one with the same probability p < 1/2. If the tree exhibits a star topology, in which each of the modern species derives directly from the ancestor on an independent branch, then it is clear that the maximum likelihood and Bayesian maximum a posteriori reconstructions of the ancestral character agree, and the reconstructed state is the one that is most often observed in the n modern species. The probability of an error in reconstruction is: n

n k n k

k p (1 p ) k =[ n / 2 ]

which is at most [ 4 p(1 p) ]

n/2

([32,33]; Lemma 5, p. 479). This error approaches

zero exponentially fast as n increases. The star topology has a kind of “phase transition” where the ancestor becomes highly reconstructible once enough present day sequences are available to compensate for the length of the branches leading back to the ancestor. In contrast, a nonstar topology such as a binary tree that has the same total root-toleaf branch length and the same number n of modern species at the leaves has two nonzero length branches from the root ancestor R leading to intermediate ancestors A and B, and information is irrevocably lost along these two branches. No matter how large the number n of modern descendant species derived from A and B, one can do no better at reconstructing the state at R than if one knew for certain the state in its immediate descendants A and B. Even with this knowledge, the accuracy of reconstruction of R from A and B will be strictly less than 100% for all reasonable models and nonzero branch lengths. The reconstruction gets poorer the longer the branch lengths are to A and B. This extends to the case where the ancestor R being reconstructed has a bounded number of independent immediate descendants and to the case where descendants of an earlier ancestor of R (outgroups) are also available. The long branches connecting them to the rest of the tree are why some more recent ancestral sequences in the tree of Fig. 1 are less reconstructible than the Boreoeutherian ancestor, which acts almost like the root of a star topology (see ref. 34 for a discussion of optimal tree topologies for ancestral reconstructability).

Acknowledgments We thank Jim Kent, Arian Smit, Adam Siepel, Gill Bejerano, Elliot Margulies, Brian Lucena, Leonid Chindelevitch, and Ron Davis for helpful discussions and suggestions. A.B.D. was supported by a NSERC postgraduate scholarship. W.M. was supported by grant HG-02238 from the National Human Genome

Computational Reconstruction of DNA Sequences

183

Research Institute, E.G. was supported by NHGRI, D.H. and M.B. were supported by NHGRI Grant 1P41HG02371 and the Howard Hughes Medical Institute. Finally, we thank the NISC Comparative Sequencing Program for providing multispecies comparative sequence data. References 1. Blanchette, M., Green, E. D., Webb, M., and Haussler, D. (2004) Reconstructing large regions of an ancestral mammalian genome in silico. Genome Res. 14, 2412–2423. 2. International Human Genome Sequencing Consortium, Lander, E., et al. (2001) Initial sequencing and analysis of the human genome. Nature 5, 409(6822), 860–921 (PMID: 12466850). 3. International Mouse Genome Sequencing Consortium, Waterston, R. H., Lindblad-Toh, K., Birney, E., et al. (2002) Initial sequencing and comparative analysis of the mouse genome. Nature 5, 420(6915), 520–562 (PMID: 12466850). 4. Rat Genome Sequencing Project Consortium, Gibbs, R. A., Weinstock, G. M., Metzker, M. L., et al. (2004) Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 428, 493–521. 5. Margulies, E. H., Blanchette, M., NISC Comparative Sequencing Program, Haussler, D., and Green, E. (2003) Identification and characterization of multi-species conserved sequences. Genome Res. 13(12), 2507–2518 (PMID: 14656959). 6. Cooper, G. M., Brudno, M., Green, E. D., Batzoglou, S., and Sidow, A. (2003) Quantitative estimates of sequence divergence for comparative analyses of mammalian genomes. Genome Res. 13(5), 813–820. 7. Bejerano, G., Pheasant, M., Makunin, I., et al. (2004) Ultraconserved elements in the human genome. Science 304(5675), 1321–1325. 8. Goodman, M., Barnabas, J., Matsuda, G., and Moore, G. W. (1971) Molecular evolution in the descent of man. Nature 233, 604–613. 9. Enard, W., Przeworski, M., Fisher, S. E., et al. (2002) Molecular evolution of FOXP2, a gene involved in speech and language. Nature 418(6900), 869–872. 10. Eizirik, E., Murphy, W. J., and O’Brien, S. J. (2001) Molecular dating and biogeography of the early placental mammal radiation. J. Hered. 92(2), 212–219 (PMID: 11396581). 11. Springer, M. S., Murphy, W. J., Eizirik, E., and O’Brien, S. J. (2003). Placental mammal diversification and the Cretaceous-Tertiary boundary. Proc. Natl Acad. Sci. U S A 4, 100(3), 1056–1060 (PMID: 12552136). 12. Thomas, J., Touchman, J. W., Blakesley, R. W., et al. (2003) Comparative analyses of multi-species sequences from targeted genomic regions. Nature 424, 788–793. 13. Karolchick, D., Baertsch, R., Diekhans, M., et al. (2003) The UCSC genome browser database. Nucleic Acids Res. 31, 51–54. 14. Maddison, D. R. and Schulz K.-S. (ed.) (2004) The Tree of Life Web Project. http://tolweb.org 15. Felsenstein, J. (1989) PHYLIP—Phylogeny inference package (Version 3.2). Cladistics 5, 164–166.

184

Blanchette et al.

16. Swofford, D. L. (2003) PAUP: Phylogenetic Analysis Using Parsimony. Sinauer, Sunderland, MA. 17. Huelsenbeck, J. P. and Ronquist, F. (2001) MrBayes: Bayesian inference of phylogeny. Bioinformatics 17, 754–755. 18. Bray, N. and Pachter, L. (2004) MAVID: constrained ancestral alignment of multiple sequences. Genome Res. 14, 693–699. 19. Cooper, G. M., Stone, E. A., Asimenos, G., et al. (2005) Distribution and intensity of constraint in mammalian genomic sequence. Genome Res. 15(7), 901–913. 20. Blanchette, M., Kent, W. J., Riemer, C., et al. (2004) Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 14(4), 708–715 (PMID: 15060014). 21. Schwartz, S., Kent, W. J., Smith, A., et al. (2003) Human–mouse alignments with BLASTZ. Genome Res. 13(1), 103–107. 22. Chindelevitch, L., Li, Z., Blais, E., and Blanchette, M. (2006) On the inference of parsimonious indel evolutionary scenarios. J. Bioinformatics Comput. Biol. in press. 23. Fredslund, J., Hein, J., and Scharling, T. (2003) A large version of the small parsimony problem. Lecture Notes in Bioinformatics, Proceedings of WABI’03. 2812, 417–432. 24. Yang, Z., Kumar, S., and Nei, M. (1995) A new method of inference of ancestral nucleotide and amino acid sequences. Genetics 141, 1641–1650. 25. Siepel, A. and Haussler, D. (2003) Combining phylogenetic and hidden Markov models in biosequence analysis. Proceedings of the 7th Annual International Conference on Research in Computational Molecular Biology. pp. 277–286. 26. Bourque, G. and Pevzner, P. (2002) Genome-scale evolution: reconstructing gene orders in the ancestral species. Genome Res. 12(1), 26–36. 27. Stoye, J., Evers, D., and Meyer, F. (1997) Generating benchmarks for multiple sequence alignments and phylogenetic reconstructions. Proc. Int. Conf. Intell. Syst. Mol. Biol. 5, 303–204 (PMID: 9322053). 28. Hasegawa, M., Kishino, H., and Yano, T. (1985) Dating of the human–ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22(2), 160–174. 29. Kent, J., Baertsch, R., Hinrichs, A., Miller, W., and Haussler, D. (2003). Evolution’s cauldron: duplication, deletion and rearrangement in the mouse and human genomes, Proc. Natl Acad. Sci. USA 100(20), 11, 848–11,489. 30. Jurka, J. (2002) Repbase update: a database and an electronic journal of repetitive elements. Trends Genet. 16(9), 418–420 (PMID: 10973072). 31. Smit, A. and Green, P. (1999) RepeatMasker, http://ftp.genome.washington.edu/ RM/RepeatMasker.html 32. Hoeffding, W. (1963) Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58, 13–27. 33. Le Cam, L. (1986) Asymptotic Methods in Statistical Decision Theory, Springer, New York. 34. Lucena, B. and Haussler, D. (2005) Counterexample to a claim about the reconstruction of an ancestral character states. Syst Biol. 54(4), 693–695.

12 Sequencing and Phylogenomic Analysis of Whole Mitochondrial Genomes of Animals Rafael Zardoya and Mónica Suárez Summary Mitochondrial genomes (mtDNA) of animals are circular molecules of relatively small size, compactly organized, and generally encoding genes for 2 rRNAs, 22 tRNAs, and 13 proteins that are required for mitochondrial function. Methods of mtDNA isolation take advantage of its physical localization apart from the nuclear genome (centrifugation at low speed efficiently separates mitochondria from nuclei) and of its structure (alkaline lysis differentially precipitates linear nuclear DNA, but not circular mtDNA). Furthermore, the recent development of robust long-PCR techniques has boosted high-throughput determination of complete sequences of animal mtDNAs. The exponentially growing number of complete animal mitochondrial genomes deposited in GenBank allows a phylogenomic approach to disentangle phylogenetic relationships among main animal phyla, and provides extensive new data to gain insights on the molecular mechanisms underlying genome evolution. Key Words: Mitochondrial genome; alkaline lysis isolation; long PCR; Bayesian inference.

1. Introduction Mitochondria are ubiquitous organelles of eukaryotic cells whose main function is generating most of the cellular energy through oxidative phosphorylation (1). Mitochondria contain their own genome, which is present in several copies per organelle, and is referred to as mitochondrial DNA (mtDNA). Animal mtDNA is typically a covalently closed double-stranded circular molecule of merely 16 kb, which commonly encodes genes for 2 rRNAs, 22 tRNAs, and 13 proteins that are required for mitochondrial function (2). The organization of mtDNA is extremely compact with gene junctions next to each other, and few noncoding sequences. Genes lack introns and are transcribed From: Methods in Molecular Biology: Phylogenomics Edited by: W. J. Murphy © Humana Press Inc., Totowa, NJ

185

186

Zardoya and Suárez

from both the strands. Many ORFs lack entire termination codons, ending instead with either T or TA, which are completed by posttranscriptional polyadenylation of the corresponding mRNAs (3). Gene order is virtually fixed in vertebrate mtDNAs, whereas numerous gene arrangements have been described in invertebrate mtDNAs (2). Mitochondrial genome sequence comparisons are highly helpful in determining point mutations responsible of mitochondrial diseases (4), but by far the most popular use of mtDNA is in evolutionary studies. Mitochondrial genes have been extensively used for inferring phylogenetic relationships (5,6). These genes offer several advantages with respect to nuclear genes including nearly no recombination, unambiguous orthology, and a higher rate of substitution (7). Furthermore, the maternal inheritance of most animal mtDNAs allows quicker fixing of variants between speciation events, which in turn facilitates phylogenetic inference (7). Several studies (8) have shown that concatenated sequences of mitochondrial genes are more powerful than individual genes in recovering robust phylogenetic relationships among highly derived taxa. The relative gene order of mtDNAs, and in particular the presence of shared derived rare rearrangements, have been also successfully used as phylogenetic markers with little levels of homoplasy (2,9). The relatively small size and circular structure of mtDNA and its physical localization apart from the nuclear genome originally facilitated the isolation, cloning, and sequencing of whole mitochondrial genomes. However, it was the advent of the PCR and automated sequencing techniques what really simplified and boosted the recent high-throughput determination of complete sequences of animal mtDNAs (10). The rate of completion of new mtDNAs has grown exponentially over the last years, and as of July 2007, there were 1035 complete animal mitochondrial genomes deposited in GenBank (http://www.ncbi.nlm. nih.gov), of which a high proportion (73%) corresponded to vertebrates. 2. Materials 2.1. Isolation and Cloning of mtDNA 1. MSB buffer: 210 mM mannitol, 70 mM sucrose, 50 mM Tris-HCl, pH 7.5. Sterilized by-passage through a 0.22-m filter. 2. GTE buffer: 50 mM glucose, 25 mM Tris-HCl, pH 8.0, 10 mM EDTA. Sterilized by-passage through a 0.22-m filter. 3. Alkaline SDS solution: 200 mM NaOH, 1% SDS. 4. TE: 10 mM Tris-HCl, pH 8.0, and 1 mM EDTA. 5. Ligation buffer: 30 mM Tris-HCl, pH 7.8, 10 mM MgCl2, 10 mM DTT, 1 mM ATP. Store in aliquots at 20°C. 6. SOC medium: Mix 2% bactotryptone, 0.5% yeast extract, 10 mM NaCl, 2.5 mM KCl, 10 mM MgCl2, 10 mM MgSO4. Sterilize in autoclave. Add glucose (sterilized by-passage through a 0.22-m filter) to a final concentration of 20 mM.

Sequencing and Phylogenomic Analysis of mtDNA

187

7. LB/agar/ampicillin/X-Gal/IPTG plates: Mix 1% bactotryptone, 0.5% yeast extract, 10 mM NaCl, and 1.5% agar. Sterilize in autoclave and cool down. Add ampicillin (from a 10X solution stored in aliquots at 20°C) to a final concentration of 50 g/mL. Distribute onto Petri dishes. Spread 40 L of X-Gal 2% in dimethylformamide (stored in aliquots at 20°C), and 10 L of 100 mg/mL IPTG (stored in aliquots at 20°C) into each plate.

2.2. PCR Standard PCR buffer 10X: 500 mM KCl, 100 mM Tris-HCl, pH 9.0, 1% Triton-X (store in aliquots at 20°C).

2.3. Repairing Sheared DNA 1. T4 DNA polymerase buffer 10X: 330 mM Tris-acetate, pH 7.8, 660 mM potassium acetate, 100 mM magnesium acetate, and 5 mM DTT. Store in aliquots at 20°C. 2. T4 polynucleotide kinase buffer 10X: 70 mM Tris-HCl, pH 7.6, 10 mM MgCl2, 5 mM DTT. Store in aliquots at 20°C.

3. Methods There are two main strategies to obtain mtDNA. One includes physical separation of mitochondria from other cellular components (in particular, the nucleus), and subsequent isolation and cloning of the mitochondrial genome (11). The other is based on PCR amplification of mtDNA from whole DNA extractions (12). In the first case, the best results are achieved when starting from fresh tissue, although tissues quickly frozen in liquid nitrogen also can be used as starting material. Soft tissues such as eggs, gonads, liver, heart, or kidney are the best sources (see Note 1). Instead, PCR amplification of mtDNA from whole DNA extractions is less sensible to storing conditions and types of tissue, and good yields are typically obtained from frozen or ethanol-preserved muscle.

3.1. Isolation of Mitochondria The following protocol takes advantage of the different masses of mitochondria and nuclei, and follows ref. 13. 1. Mince 2 g of fresh tissue (ideally from one single individual) with scissors, and carry out homogenization in 10 mL of cold MSB buffer containing 3 mM CaCl2, using 4–8 strokes with a tight-fitting glass Teflon homogenizer (see Note 2). 2. Pass the homogenized solution to a Corex glass tube, and add EDTA to a final concentration of 10 mM. 3. Remove nuclei and cellular debris by centrifugation at 700g for 5 min at 4°C in a swinging bucket rotor. Recover supernatant, and repeat the low speed centrifugation. 4. Pellet mitochondria by centrifugation at 10,000g for 10 min at 4°C. Wash the light-brown pellet by gently resuspending it in 20 mL of MSB–EDTA buffer and centrifuging at 10,000g for 10 min at 4°C. Repeat the wash step two more times.

188

Zardoya and Suárez

5. Discard supernatant, and allow the tube to drain for 5 min. Resuspend the pellet in 200 L of GTE buffer. Transfer the mitochondria enriched solution to an Epperndorf tube.

3.2. Purification of mtDNA The methods devised for mtDNA isolation from any contaminant nuclear DNA exploit the different structure of both molecules, which are circular and linear, respectively. One of the procedures yielding purer mtDNA is cesium chloride ethidium bromide gradient ultracentrifugation (10,13). However, this is a rather tedious technique that needs laborious optimization, and alternative simpler methods are available. The following protocol is based on ref. 14, and considers that the structure of mtDNA resembles that of a bacterial plasmid. It is a modified version of the commonly used alkaline lysis method of plasmid miniprep (15). Many commercial miniprep kits, e.g., Qiaprep (Qiagen, Hilgen, Germany) can also be optimized for this purpose. 1. Add 7 L of RNase A (10 mg/mL) to the enriched mitochondrial solution. 2. Add 400 L of a freshly prepared alkaline SDS solution. Mix thoroughly but gently, and incubate on ice for 5 min. 3. Add 300 L of 3M potassium acetate, pH 4.8. Mix thoroughly but gently, and incubate at 70°C for 2 min. 4. Spin down for 10 min at 10,000g at 4°C in a microcentrifuge. 5. Recover the supernatant, and add 1 mL of water-saturated phenol, pH 4.5. Mix vigorously, and spin down for 5 min at 10,000g at 4°C in a microcentrifuge. Recover the aqueous phase, and repeat the phenol extraction. 6. Add 1 mL of a chloroform:isoamylic alcohol, 24:1 solution. Mix vigorously, and spin down for 5 min at 10,000g at 4°C in a microcentrifuge. 7. Recover the aqueous phase, and add 450 L of isopropanol. Mix thoroughly, and incubate at 70°C for 5 min. 8. Spin down for 10 min at 10,000g at 4°C in a microcentrifuge. Wash the mtDNA pellet with 200 L of ethanol 70%. Dry under vacuum, and resuspend in 50 L of MilliQ (Millipore, Billerica, MA) water or TE (see Note 3). 9. Run an aliquot of 3 L of mtDNA in a 0.8% agarose gel with ethidium bromide. Check the size and concentration against a size marker, e.g., lambda HindIII + Phi X174 HaeIII (Fig. 1).

3.3. Cloning of mtDNA Restriction analyses can determine which restriction enzymes render optimal bands of digested mtDNA for cloning. The whole mtDNA can be cut at only one site and cloned into a phage vector such as, e.g., Lambda EMBL3 (Stratagene, La Jolla, CA). Alternatively, mtDNA can be cut into 6–10 fragments of 0.5–7 kb, and cloned into a plasmid vector such as e.g., pUC 18. The second strategy is described in the following protocol:

Sequencing and Phylogenomic Analysis of mtDNA

189

Fig. 1. Restriction analysis of rainbow trout (Oncorhynchus mykiss) mtDNA. (1) Lambda HindIII and Phi X174 HaeIII size marker. (2) Uncut mtDNA. The mtDNA was digested with the following restriction enzymes: (3) EcoRI, (4) HindIII, (5) BamHI, (6) PstI, (7) PvuII, and (8) ApaI.

190

Zardoya and Suárez

1. Mix 1 g of mtDNA, 0.2 g of pUC18, 1–3 units of the restriction enzyme of choice, 2 L of the 10X buffer provided with the enzyme, and MilliQ water to a final volume of 20 L. Digest for 2 h at 37°C. 2. Inactive the restriction enzyme by heating at 70°C for 10 min. 3. Add 1/10 volume of 3M sodium acetate (pH 6.8) and two volumes of cold ethanol. Incubate 5 min at 70°C. Spin down at 10,000g for 10 min. 4. Wash the pellet with 200 L of ethanol 70%. Dry under vacuum, and resuspend in 8 L of ligation buffer. 5. Add three units of T4 DNA ligase (e.g., Promega, Madison, WI), and incubate overnight at 4°C. 6. Transform the ligation reaction into commercially available bacterial competent cells (e.g., Promega) (see Note 4). Add the ligation mix to 50 L of competent cells, and incubate 30 min on ice. Heat shock at 42°C for 90 s. Incubate transformed bacteria on SOC medium at 37°C for 45 min. 7. Spread transformed bacteria onto LB/agar/ampicillin/X-Gal/IPTG plates, and incubate overnight at 37°C. 8. Pick white colonies, and check the size of the inserts either by performing minipreps (e.g., Qiaprep from Qiagen) of the plasmids, and a restriction digestion with the enzyme used for the cloning or by PCR with universal M13 forward and reverse primers (Fig. 2).

3.4. Polymerase Chain Reaction Because not always fair amounts of fresh tissue are available for a given species, the PCR is the methodology of choice when there are only minimal quantities of starting material available or samples have been preserved frozen or in ethanol. The mtDNA can be obtained by PCR using several sets of primers that are able to amplify contiguous and overlapping fragments covering the entire molecule. Typically, the primers are designed in conserved regions every 1000–2000 bp, and PCR amplified fragments overlap 50–150 bp. This strategy is particularly useful for obtaining vertebrate complete mitochondrial genomes, which have a highly conserved gene order. Furthermore, the relatively large representation of vertebrate mitochondrial genomes in GenBank allows the design of versatile primers of 21–24 nucleotides that work robustly at a wide taxonomic range (see Note 5). There are lists of conserved primers for fish (12,16), amphibians (17), reptiles (18), birds (19), and mammals (20). The PCR reaction can use whole DNA or mtDNA as template. 1. Extract whole DNA from a specimen using a standard phenol/chloroform extraction (21) or any available commercial DNA extraction kit such as Dneasy Tissue kit (Qiagen). 2. Prepare reactions containing 2.5 L of 10X standard PCR buffer, 1.5 mM MgCl2, 0.4 mM of each dNTP, 2.5 M of each primer, 10–100 ng of template DNA, 1 U Taq DNA polymerase (see Note 6), and MilliQ water to a final volume of 25 L.

Sequencing and Phylogenomic Analysis of mtDNA

191

Fig. 2. Cloning of different HindIII fragments of rainbow trout mtDNA into pUC18 (the band that has the same size in all lanes).

192

Zardoya and Suárez

Fig. 3. Long PCR amplification of a sea slug (Roboastra europaea) mtDNA in two fragments of about 7000 bp each.

3. Run PCR with the following conditions: 1 cycle of denaturing at 94°C for 2 min; 35 cycles of denaturing at 94°C for 30 s, annealing for 60 s at 45–50°C depending on primer specificity, and extending at 72°C for 90–180 s depending on the length of the amplified fragment; 1 cycle of extension at 72°C for 5 min.

3.5. Long PCR The PCR amplification of the entire mtDNA molecule in one or few reactions (Fig. 3) is the method of choice for obtaining complete invertebrate mitochondrial genomes, which are prone to gene rearrangements even among closely related taxa (2). This technique can be also applied to vertebrate mtDNAs. The working strategy includes four steps: (1) amplification of short fragments of mtDNA using universal primers such as 16Sar and 16Sbr (22) or HCO-2193 and LCO-1490 (23), and standard PCR reaction conditions, (2) sequencing of the PCR products, and design of specific primers of 28–30 nucleotides facing out from the fragments (see Note 7), (3) performing long PCRs to cover the remaining mtDNA molecule,

Sequencing and Phylogenomic Analysis of mtDNA

193

and (4) using the long PCR products (diluted) as template for standard PCR reactions, which render shorter fragments more amenable for sequencing. A routine long PCR includes the following steps: 1. Prepare a reaction mix containing 2.5 L 10X LA PCR buffer II (Takara Bio, Shiga, Japan), 0.4 mM of each dNTP, 0.2 M of each primer, 10–500 ng of template DNA, 2.5 U LA Taq polymerase (Takara Bio) (see Note 8), and MilliQ water to a final volume of 50 L. 2. Use the following long PCR thermal cycle profile: denaturing at 98°C for 30 s, with annealing and extension combined at the same temperature (68°C) for 16–20 min (at least 1 min/kb of the expected fragment).

3.6. Rolling Circle Amplification This procedure to obtain mtDNA is a potentially interesting alternative to long PCRs (10,24). Random hexamers anneal to the mtDNA template, and are extended by the highly processive 29 DNA polymerase rendering many copies of the mitochondrial genome. The main drawback of this technique is that it requires fresh unfertilized eggs as starting material, which may not be always available for certain species. Rolling circle amplification can be performed with TempliPhi 100 amplification kit (GE Healthcare, Piscataway, NJ). The following protocol is based on refs. 10,24. 1. Isolate mitochondria as explained under Subheading 3.1. 2. Mix 1 L of the enriched mitochondrial solution with 5 L of the TempliPhi sample buffer. Heat at 95°C for 5 min, and cool to room temperature. 3. Add 5 L of TempliPhi reaction buffer and 0.2 L of enzyme mix, and incubate overnight at 34°C. Stop reaction with an incubation at 65°C for 10 min. 4. Purify the product using an Amicon Ultrafree-MC minicolumn kit (Fisher Scientific, Hampton, NH).

3.7. Automated Sequencing Strategies The shortest fragments resulting from restriction analyses of purified mtDNA or from PCR reactions can be sequenced in a single reaction. In the first case, a cloning step is performed, and the fragment of interest is sequenced using universal M13 forward and reverse primers. In the second case, fragments can be either sequenced directly with PCR primers or first cloned into an appropriate vector such as pGEM-T (Promega) (see Note 9). Nevertheless, most of the analyzed fragments in a mtDNA genome project are expected to be larger than 2 kb. Long fragments can be sequenced either by walking through them with specific primers or by physically splitting them into smaller pieces. We commonly use the former strategy because contig assembly is straightforward. Long fragments can be used directly as template for the sequencing reaction with the walking primer or as template for a PCR reaction

194

Zardoya and Suárez

with two (forward and reverse) walking primers that renders a shorter fragment, which can be sequenced in a single reaction. Primers are designed so as to have 150 bp of overlap between contigs. Although less trivial, the physical fragmentation procedure allows highthroughput sequencing. In this case, fragments are randomly broken by hydrodynamic shearing, and subsequently cloned, sequenced, and assembled with special software. 1. Incubate DNA at 37°C for 30 min, vortexing every 10 min. Centrifuge at 10,000g for 4 min to make sure that the DNA is completely in solution. 2. Load 100 L of the sterile-filtered DNA solution into the HydroShear® device (Genomic Solutions, Ann Arbor, MI). Shear for 20 cycles and speed codes of 3–5 to obtain fragments around 1–2 kb (see Note 10). 3. Obtain 5-phosphorylated blunt-ended DNA (see Note 11). Prepare a reaction mix containing 0.5–1 g sheared DNA, 10 M of each dNTP, 50 g/mL BSA, 20 L of 10X T4 DNA polymerase buffer, 5 U T4 DNA polymerase (Roche, Basel, Switzerland), and MilliQ (Millipore) water to a final volume of 200 L. Incubate at 16°C for 1 h. 4. Add 1/10 volume of 3M sodium acetate, pH 6.8, and two volumes of cold ethanol. Incubate 5 min at 70°C. Spin down at 10,000g for 10 min. Resuspend pellet in 41 L of MilliQ water. 5. Add 5 L of 10X T4 polynucleotide kinase buffer, 1 mM ATP, 5 U T4 polynucleotide kinase (Promega), and MilliQ water to a final volume of 50 L. Incubate at 37°C for 30 min. 6. Add 1/10 volume of 3M sodium acetate, pH 6.8, and two volumes of cold ethanol. Incubate 5 min at 70°C. Spin down at 10,000g for 10 min. Resuspend pellet in 50 L of MilliQ water. 7. Clone DNA fragments with a blunt-end restriction enzyme into a plasmid vector, e.g., pUC18. 8. Sequence inserted fragments using M13 forward and reverse primers (if the inserted fragment is too large, sequences do not need to overlap). 9. Use Sequencher 4.6 (Gene Codes, Ann Arbor, MI) or similar software for the assembly of the numerous randomly generated sequences.

3.8. Genome Annotation As explained above, the mitochondrial genome includes tRNA, rRNA, and protein-coding genes, which can be encoded in any of the two strands. The rRNA and protein-coding genes are generally easy to identify by similarity searches using BLAST (25) against GenBank. Annotation of protein-coding genes can be further verified by translating the ORF into amino acids and by detecting any frame shift or internal stop codon. Translation should use the appropriate genetic code (26) (http://darwin.uvigo.es/software/GenDecoder.html). It is also important

Sequencing and Phylogenomic Analysis of mtDNA

195

to remember that, in addition to ATG, mitochondrial proteins can also use ATH and NTG as start codons. Furthermore, as mentioned above, mitochondrial proteins use quite often incomplete TA and T stop codons. Annotation of tRNA genes using BLAST searches is more difficult because of their relative short size and little sequencing similarity (particularly in invertebrates) to orthologs deposited in GenBank. Fortunately, mitochondrial tRNAs have a peculiar (10) and fairly conserved secondary cloverleaf structure that helps in the task of recognizing the different tRNA genes. Hence, manual annotation of mitochondrial tRNA genes starts typically searching for anticodons of the different tRNAs in those stretches not identified as rRNA or protein-coding genes. Once a potential anticodon is detected, building up the secondary structure identifies the corresponding tRNA. A Web-based tool named DOGMA (27) has been devised recently to help automatic annotation of animal mitochondrial genomes. The program Sequin (http://ncbi.nih.gov/Sequin/) can be used to generate and submit GenBank entries.

3.9. Phylogenetic Analyses Mitochondrial genomes are being increasingly used to study phylogenetic relationships among highly divergent animal taxa. Phylogenetic reconstruction can be based on any of the three types of mitochondrial genes, i.e., tRNA, rRNA, and protein-coding genes. The latter can be analyzed either at the nucleotide or amino-acid levels. Different studies show that most robust phylogenetic inferences are based on concatenated sequence data sets rather than on single genes. Phylogenetic inferences based on sequence data are commonly reconstructed using parsimony-, distance-, and likelihood-based methods. Here, we will focus on Bayesian inference (28), a likelihood-based method, that searches for the tree with the best posterior probability given a sequence alignment. This method is relatively new but particularly powerful because it produces both a tree estimate (topology and branch lengths), and measures of uncertainty for the groups of the tree. Moreover, Bayesian inference can used simultaneously amino acid and nucleotide data sets, and allows inference from different partitions, using for the different each its best-fit evolutionary model. 1. Use MitoGenHunter (http://pc16141.mncn.csic.es/mtgh.html) or OGRE (29), which are Web-based tools that help in retrieving animal ortholog mitochondrial genes from GenBank. The output of MitoGenHunter is a file in FASTA format that can be used as a input for commonly used multiple alignment programs. 2. Multiple sequence alignment of the different mitochondrial genes can be performed with, e.g., Clustal X (30), T-Coffee (31), or MUSCLE (32). For Clustal X, use the “load sequences” feature under the “File” menu to choose the file with the sequences in FASTA format. Select the NBRF/PIR output format under the “Alignment”

196

3. 4.

5.

6.

7.

8. 9.

10.

Zardoya and Suárez

menu. Perform the alignment using the “Do complete alignment” feature of the “Alignment” menu (see Note 12). Use Gblocks (Castresana, 2000; http://molevol.ibmb.csic.es/Gblocks_server/) or SOAP (33) to identify and remove poorly aligned and gapped positions. Transform the NBRF/PIR output alignments of Gblocks into NEXUS or PHYLIP formats using MacClade (34) or ReadSeq (http://bimas.dcrt.nih.gov/molbio/ readseq/). Combine the alignments of the different mitochondrial genes into a single concatenated data set (e.g., rRNA data set, tRNA data set, protein-coding data set, and whole mtDNA data set). Estimate the best-fit substitution models for each of the combined data sets using Modeltest (35) and Prottest (36) for nucleotide and amino-acid data sets, respectively. These programs use the Akaike informative criterion (AIC) (37) to determine the model that renders the tree with the best likelihood score. For divergent taxa, the models that best-fit mitochondrial nucleotide and amino-acid data sets are usually GTR + I + G and mtREV + I + G, respectively. Perform a Bayesian inference using MrBayes 3.1.2 (38). This program approximates posterior probability distributions using the Markov chain Monte Carlo (MCMC) algorithm. MrBayes utilizes input data sets in NEXUS format. If the best-fit model is GTR + I + G, the following commands can be added at the end of the NEXUS file (after the sequence alignment) to set the run: begin mrbayes; [Set the parameters of the likelihood model] lset nst = 6 rates = invgamma; [Set Markov chain Monte Carlo parameters] mcmcp ngen = 1000000 printfreq = 100 samplefreq = 100 nchains = 4 savebrlens = yes; end; If the best-fit model is mtREV + I + G, the following commands can be added at the end of the NEXUS file (after the sequence alignment) to set the run (see Note 13): begin mrbayes; prset aamodelpr = fixed (mtrev); lset rates = invgamma; mcmcp ngen = 1000000 printfreq = 100 samplefreq = 100 nchains = 4 savebrlens = yes; end; Run MrBayes and load the input file using the command “Exe.” Use the command “mcmc” to run the Markov chain Monte Carlo algorithm. Summarize the results with the commands “sump” and “sumt.” The former allows estimating the number of chains that can be discarded before convergence. The later use the parameter burnin to discard those early chains. Visualize the consensus tree using, e.g., PAUP (39) or TreeView (40).

Sequencing and Phylogenomic Analysis of mtDNA

197

4. Notes 1. Physical isolation of mtDNA requires intact mitochondria, and avoiding disruption of the nuclear membrane to ensure no nuclear DNA contamination. Hence, the best source in this case is fresh soft tissue (eggs or gonads, if available, are particularly good). Whole DNA extraction for PCR amplification of mtDNA can be achieved from frozen or ethanol-preserved tissues. No reliable results are obtained when using formalin-preserved samples as the starting material. 2. Tissue homogenization is crucial and tricky. This step should not be too gentle because insufficient disruption of the tissue would result in low yields of mitochondria. However, it also should not be too vigorous because nuclei breaking needs to be kept to a minimum. The whole procedure should be performed at low temperature (the pestle on ice and the reagents cold at 4–10°C). 3. The alkaline lysis phenol extraction procedure may not render fully purified mtDNA, and any carried contaminant could inhibit subsequent reactions, such as restriction enzyme digestions. If such is the case, the mtDNA solution can be further cleaned using commercial DNA clean-up kits such as QIAExII (Qiagen) or through a Sephadex G-50 spin column. In the latter case, dissolve 5 g of Sephadex G-50 (Sigma-Aldrich, St. Louis, MO) in 100 mL TE. Heat to 90°C for 10 min and cool down to 4°C. Plug the end of a 1-mL plastic syringe with a small amount of glass wool. Place the syringe in a centrifuge tube. Fill the syringe with Sephadex G-50 slurry, and spin down for 1 min at 2000g in a table top centrifuge. Wash the Sephadex G-50 column with 1 mL of TE, and spin down for 3 min at 2000g. Place an Eppendorf at the end of the syringe, and load the mtDNA sample on the top. Spin down for 3 min at 2000g, and collect the eluted clean mtDNA. 4. It is possible to prepare your own competent cells using the method of Hanahan (41). 5. There are different programs to help in designing PCR primers such as Primer3 (http://biotools.umassmed.edu/bioapps/primer3_www.cgi). Some rules can specifically help in the case of mtDNA: (1) the anticodon region of the tRNA genes is particularly conserved in vertebrate mtDNA and very useful for designing PCR primers, (2) for protein-coding genes, select conserved regions including runs of amino acids with low codon degeneracy, and end the primers in second codon positions, (3) try to have an even frequency of the four nucleotides, and (4) the two primers should have similar annealing temperatures, which can be roughly estimated with the following formula: ([2 × A + T] + [4 × G + C]) – 4. 6. Most of the commercial Taq polymerases available in the market are suitable for performing standard PCRs with versatile primers and straighforward templates. We use the BD Advantage™ 2 Polymerase Mix (Clontech, Mountain View, CA) when starting from complex templates, and using degenerated or not highly specific primers. 7. Because invertebrate mtDNAs may have extensive gene rearrangements, all possible combinations of specific primers facing out universal fragments need to be tested. 8. We have also used Elongase (Invitrogen, Carlsbad, CA) for long PCRs with positive results.

198

Zardoya and Suárez

9. Do not forget to remove both vector and PCR primer sequences from any sequence read before contig assembly. This can be achieved either manually or using Sequencher 4.6. 10. Clogging of the shearing assembly should be avoided. It is also important to eliminate any bubble from the sample in order to ensure reproducible results. 11. Alternatively, you can use commercially available DNA end repair kits such as End-It™ (Epicentre Biotechnologies, Madison, WI) or DNA terminator® (Lucigen, Middleton, WI). 12. Multiple alignments need to be further refined by eye. Alignment of tRNA and rRNA genes can be guided by their potential secondary structures, whereas alignment of protein-coding genes at the nucleotide level needs to follow codon structure and the corresponding amino-acid alignments. 13. In order to analyze combined nucleotide and amino-acid data sets, the format datatype in the NEXUS file should be set to “mixed,” and several commands of MrBayes such as “set partition,” “apply to,” or “unlink” need to be used. See http://mrbayes.csit.fsu.edu/wiki/index.php/Analyzing_a_Partitioned_Data_Set for more details.

Acknowledgments This work received financial support from projects of the Ministerio de Ciencia y Tecnología of Spain to RZ (CGL2004-00401) and from Comunidad de Madrid to MS (GR/SAL/0583/2004). References 1. Saraste, M. (1999) Oxidative phosphorylation at the fin de siècle. Science 283, 488–493. 2. Boore, J. L. (1999) Animal mitochondrial genomes. Nucleic Acids Res. 27, 1767–1780. 3. Attardi, G. and Schatz, G. (1988) Biogenesis of mitochondria. Ann. Rev. Cell Biol. 4, 289–333. 4. Wallace, D. C. (2005) A mitochondrial paradigm of metabolic and degenerative diseases, aging, and cancer: a dawn for evolutionary medicine. Ann. Rev. Genet. 39, 359–407. 5. Kocher, T. D., Thomas, W. K., Meyer, A., et al. (1989) Dynamics of mitochodrial DNA evolution in animals: amplification and sequencing with conserved primers. Proc. Natl Acad. Sci. USA 86, 6196–6200. 6. Meyer, A. and Zardoya, R. (2003) Recent advances in the (molecular) phylogeny of vertebrates. Ann. Rev. Ecol. Evol. Syst. 34, 311–338. 7. Curole, J. P. and Kocher, T. D. (1999) Mitogenomics: digging deeper with complete mitochondrial genomes. Trends Ecol. Evol. 14, 394–398. 8. Zardoya, R. and Meyer, A. (1996) Phylogenetic performance of mitochondrial protein-coding genes in resolving relationships among vertebrates. Mol. Biol. Evol. 13, 933–942.

Sequencing and Phylogenomic Analysis of mtDNA

199

9. Rokas, A. and Holland, P. W. (2000) Rare genomic changes as a tool for phylogenetics. Trends Ecol. Evol. 15, 454–459. 10. Boore, J. L., Macey, J. R., and Medina, M. (2005) Sequencing and comparing whole mitochondrial genomes of animals. Methods Enzymol. 395, 311–348. 11. Zardoya, R., Garrido-Pertierra, A., and Bautista, J. M. (1995) The complete nucleotide sequence of the mitochondrial DNA genome of the rainbow trout, Oncorhynchus mykiss. J. Mol. Evol. 41, 942–951. 12. Zardoya, R. and Meyer, A. (1997) The complete DNA sequence of the mitochondrial genome of a “living fossil,” the coelacanth (Latimeria chalumnae). Genetics 146, 995–1010. 13. Lansman, R. A., Shade, R. O., Shapira, J. F., and Avise, J. C. (1981) The use of restriction endonucleases to measure mitochondrial DNA sequence relatedness in natural populations. III. Techniques and potential applications. J. Mol. Evol. 17, 214–226. 14. Palva, T. K. and Palva, E. T. (1985) Rapid isolation of animal mitochondrial DNA by alkaline extraction. FEBS Lett. 192, 267–270. 15. Birnboim, H. C. and Doly, J. (1979) A rapid alkaline extraction procedure for screening recombinant plasmid DNA. Nucleic Acids Res. 7, 1513–1523. 16. Miya, M., Kawaguchi, A., and Nishida, M. (2001) Mitogenomic exploration of higher teleostean phylogenies: a case study for moderate-scale evolutionary genomics with 38 newly determined complete mitochondrial DNA sequences. Mol. Biol. Evol. 18, 1993–2009. 17. San Mauro, D., Gower, D. J., Oommen, O. V., Wilkinson, M., and Zardoya, R. (2004) Phylogeny of caecilian amphibians (Gymnophiona) based on complete mitochondrial genomes and nuclear RAG1. Mol. Phylogenet. Evol. 33, 413–427. 18. Kumazawa, Y. and Endo, H. (2004) Mitochondrial genome of the Komodo dragon: efficient sequencing method with reptile-oriented primers and novel gene rearrangements. DNA Res. 11, 115–125. 19. Sorenson, M. D., Ast, J. C., Dimcheff, D. E., Yuri, T., and Mindell, D. P. (1999) Primers for a PCR-based approach to mitochondrial genome sequencing in birds and other vertebrates. Mol. Phylogenet. Evol. 12, 105–114. 20. Cabria, M. T., Rubines, J., Gomez Moliner, B., and Zardoya, R. (2006) On the phylogenetic position of a rare Iberian endemic mammal, the Pyrenean desman (Galemys pyrenaicus). Gene 375, 1–13. 21. Sambrook, J., Fritsch, E. F., and Maniatis, T. (1989) Molecular Cloning. A Laboratory Manual (Press, C. S. H. L., ed.), Cold Spring Harbor, New York. 22. Palumbi, S. R., Martin, A., Romano, S., Owen MacMillan, W., Stice, L., and Grabowski, G. (1991) The Simple Fool’s Guide to PCR, Department of Zoology, University of Hawaii, Honolulu. 23. Folmer, O., Black, M., Hoeh, W., Lutz, R., and Vrijenhoek, R. (1994) DNA primers for amplification of mitochondrial cytochrome c oxidase subunit I from diverse metazoan invertebrates. Mol. Mar. Biol. Biotechnol. 3, 294–299. 24. Simison, W. B., Lindberg, D. R., and Boore, J. L. (2006) Rolling circle amplification of metazoan mitochondrial genomes. Mol. Phylogenet. Evol. 39, 562–567.

200

Zardoya and Suárez

25. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) Basic local alignment search tool. J. Mol. Biol. 215, 403–410. 26. Abascal, F., Zardoya, R., and Posada, D. (2006) GenDecoder: genetic code prediction for metazoan mitochondria. 34, W389–W393. 27. Wyman, S. K., Jansen, R. K., and Boore, J. L. (2004) Automatic annotation of organellar genomes with DOGMA. Bioinformatics 20, 3252–3255. 28. Holder, M. and Lewis, P. O. (2003) Phylogeny estimation: traditional and Bayesian approaches. Nat. Rev. Genet. 4, 275–284. 29. Jameson, D., Gibson, A. P., Hudelot, C., and Higgs, P. G. (2003) OGRe: a relational database for comparative analysis of mitochondrial genomes. Nucleic Acids Res. 31, 202–206. 30. Thompson, J. D., Gibson, T. J., Plewniak, F., Jeanmougin, J., and Higgins, D. G. (1997) The Clustal X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res. 25, 4876–4882. 31. Notredame, C., Higgins, D. G., and Heringa, J. (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302, 205–217. 32. Edgar, R. C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797. 33. Löytynoja, A. and Milinkovitch, M. C. (2001) SOAP, cleaning multiple alignments from unstable blocks. Bioinformatics 17, 573–574. 34. Maddison, W. P. and Maddison, D. R. (1992) MacClade: Analysis of Phylogeny and Character Evolution, Sinauer, Sunderland, MA. 35. Posada, D. and Crandall, K. A. (1998) Modeltest: testing the model of DNA substitution. Bioinformatics 14, 817–818. 36. Abascal, F., Zardoya, R., and Posada, D. (2005) Prottest: selection of best-fit models of protein evolution. Bioinformatics 21, 2104–2105. 37. Akaike, H. (1973) Information theory and an extension of the maximum likelihood principle. In: Proceedings of the Second International Symposium on Information Theory (Petrov, B. N. and Csaksi, F., eds), pp. 267–281, Akademiai Kiado, Budapest. 38. Huelsenbeck, J. P. and Ronquist, F. R. (2001) MrBayes: Bayesian inference of phylogeny. Bioinformatics 17, 754–755. 39. Swofford, D. L. (1998) PAUP*: Phylogenetic Analysis Using Parsimony (*and Other Methods), version 4.0, Sinauer, Sunderland, MA. 40. Page, R. D. (1996) TreeView: an application to display phylogenetic trees on personal computers. Comput. Appl. Biosci. 12, 357–358. 41. Hanahan, D. (1983) Studies on transformation of Escherichia coli with plasmids. J. Mol. Biol. 166, 557–580.

13 Retroposons: Genetic Footprints on the Evolutionary Paths of Life Hidenori Nishihara and Norihiro Okada Summary Retroposons such as short interspersed elements (SINEs) and long interspersed elements are abundant transposable elements in eukaryote genomes. Recent large-scale comparative genome analyses have revealed that retroposons are a major component of genomes, wherein they provide structural diversity between species and uniqueness to each species. SINEs have been used as powerful markers in phylogenetic analyses of various species. This approach, which has been termed the SINE insertion method, infers phylogenetic relationships based on the presence/absence of SINEs among lineages. However, the method is not yet used extensively among biologists, especially molecular phylogenetists, because it is based on an understanding of the molecular mechanisms of retroposition, which may be unfamiliar to many researchers. Moreover, the method may require a large amount of bench work to characterize a new SINE family and to screen genomic libraries of the species of interest. In this chapter, we present the basic theory and detailed technical steps involved in a SINE insertion analysis. Furthermore, we explain the isolation and characterization of a new SINE family from the genome of a species of interest using as an example a known SINE family in mammals. Key Words: Short interspersed elements; long-interspersed elements; retroposons; transposable elements; mobile DNA; genomic database; bioinformatics.

1. Introduction 1.1. Overview of Retrotransposons A eukaryotic genome generally contains a diverse array of retroposons such as short interspersed elements (SINEs) (1,2), long interspersed elements (LINEs) (3), and long terminal repeat (LTR) retrotransposons. For example, 42% of the human genome consists of such retroposons (4,5), and vertebrate (especially mammalian) genomes contain abundant SINEs and LINEs (6,7). From: Methods in Molecular Biology: Phylogenomics Edited by: W. J. Murphy © Humana Press Inc., Totowa, NJ

201

202

Nishihara and Okada

Fig. 1. Molecular mechanisms of retrotransposition (A) and transposition (B). Retroposons propagate their copied sequences, whereas transposons change their location in the genome.

These elements are amplified via retroposition, which is analogous to a “copy and paste” mechanism (8,9) in which retroposon DNA is reverse-transcribed by reverse transcriptase (RT), and the cDNA is integrated into the host genome at another site. Retroposons cause substantial structural changes to the host genome, and thus it is believed that they have contributed to genome evolution. Hence, understanding genomic repetitive elements is essential for comparative genomics and phylogenetics. Detailed comparative analyses of transposable elements using both experimental and computational approaches are indispensable for understanding the origin, dynamics, and evolution of genomes. The unique features of retroposons provide us an excellent method for phylogenetic analysis by comparing the presence/absence of retroposons, such as SINEs and LINEs (10,11), at specific sites in genomes, and this method has been applied to the determination of phylogenies of various species (see below).

1.2. Overview of SINEs SINEs and LINEs are the major groups of retroposons. The “copy and paste” mechanism of retrotransposition distinguishes retroposons from transposons, which move via transposition—a “cut and paste” mechanism (Fig. 1) (12). A typical vertebrate genome contains 104 retroposon copies, and a typical mammalian genome contains 105 copies (in human, 1.6 × 106 SINE copies and 8.7 × 105 LINE copies). Because over one-third of the human genome is occupied by retroposons (4), their sheer copy number contributes greatly to the genomic size. For our purpose in this chapter, we focus on SINEs instead of LINEs because we originally developed the SINE method by characterizing many SINE families and elucidating phylogenetic relationships of various species groups.

Retroposons as Useful Markers in Phylogenetics

203

Fig. 2. Typical structure of a tRNA-derived SINE. The 3-tail sequence is similar to that of a partner LINE with which the RT, encoded by the LINE, recognizes the 3-tail sequence during retrotransposition.

SINEs range from 80 to 500 bp in length and do not encode any protein. Retrotransposition of SINEs relies completely on LINE machinery—the RT of a partner LINE recognizes the 3-terminal sequence of the SINE RNA and synthesizes its cDNA from the 3-terminal end concurrently with integration into the genome (13,14). Typical SINEs are derived from tRNA, and parental tRNA species have been identified for some SINE families (6,7). Figure 2 shows the general structures of tRNA-derived SINEs, which can be divided into three regions: a tRNA-derived region (containing pol III promoters such as Box A and Box B), a tRNA-unrelated region, and a 3-tail region (1,2). The 3-tail region of SINEs generally corresponds to that of partner LINEs for retrotransposition by the LINE machinery (15,16) and ends with a microsatellite sequence. On the other hand, the 3-tail of almost all mammalian SINEs is a poly-A sequence because the RT encoded by L1, the most active mammalian LINE, recognizes the poly-A sequence of SINEs for retrotransposition (14). SINEs as well as LINEs are often flanked by direct repeats of several nucleotides that are produced by target-site duplication during retrotransposition. Although most SINEs are derived from tRNA, minor exceptions include those derived from 7SL RNA; such SINEs are known only in the genomes of primates (Alu, galago Type II), rodents (B1), and the tree shrew (Tu type I, II) (17,18). Other exceptions, recently characterized from various vertebrates (19,20), are SINE families whose promoter region is derived from 5S rRNA. Therefore, it is possible that other types of non-tRNA-derived SINEs may be identified in the future. As shown in Fig. 3, the distribution of mammalian SINEs is generally limited within a certain clade, such as a family, order, or superorder. For example, AfroSINEs are present only in the genomes of Afrotherian species, which implies that this SINE family originated in the genome of a common ancestor of Afrotheria, suggesting their monophyly (21). Therefore, analysis of retroposon distribution is generally useful for inferring phylogenetic relationships of the host species if we can thoroughly survey the family and subfamily of retroposons by PCR, dot-blot hybridization, and/or genomic library screening as described below (21,22).

204

Nishihara and Okada

Fig. 3. The distribution of SINEs and LINEs generally reflects phylogenetic relationships in mammals. The LINE names are shown in parentheses.

Retroposons as Useful Markers in Phylogenetics

205

1.3. Retroposon Insertion Analysis Species phylogenies can be reconstructed by comparing the presence or absence of retroposons at orthologous loci. The utility and advantages of this method are illustrated by the following three features. 1. Integration of retroposons into a genome, although rare, has greater impact on the genome sequence compared with nucleotide substitutions. The presence or absence of a retroposon can be easily detected by PCR. 2. There is little possibility that retroposon insertion events could occur independently at orthologous sites in different lineages. Oftentimes during retrotransposition, several nucleotides of the target region are duplicated, producing a direct repeat that flanks the retroposon sequence. Thus, integration of a retroposon flanked by the same direct repeats at orthologous sites of different lineages is likely to represent an ancestral event that has been inherited in subsequent lineages. 3. There is no known mechanism to remove a retroposon sequence from the locus by cutting it completely out. Although gross deletions at loci containing retroposons are sometimes observed, it is unlikely that the entire retroposon would be precisely deleted, thereby leaving the ancestral state. Therefore, the absence or presence of a retroposon at a locus can be regarded as the ancestral or derived state, respectively, in the evolutionary process.

From these features, a retroposon insertion can be considered a nearly homoplasy-free synapomorphy (shared-derived character) to reconstruct species phylogeny. Furthermore, these features provide the advantage that the relationships among the in-group species can be analyzed phylogenetically without out-group species. Therefore, the analysis of retroposon insertions yields a highly reliable phylogeny, as evidenced by phylogenetic analyses of various species such as cetartiodactyls (23–26), primates (27–29), afrotherians (30), many mammalian orders (31,32), birds (33), reptiles (34–36), and fishes (37–39).

1.4. Premises and Limitations of Retroposon Insertion Analysis To apply retroposon analysis to determine species phylogeny, the following three major premises must be satisfied. 1. Many retroposon copies must have been actively amplified in the genomes during species divergence. If the period during which a retroposon family actively retrotransposed is much earlier than that of the divergence of species to be analyzed, the retroposon copies analyzed are expected to be inserted at all loci for all species. Therefore, a suitable retroposon family or subfamily must be selected that was amplified during the species divergence period. A retroposon family can be typically classified into a few subfamilies according to diagnostic nucleotides or indels (see Subheadings 3.1.6. and 3.3.), which are identified by aligning many copies (several dozen) of the retroposon family. Copies of a retroposon family are considered to have been amplified according to the “multiple source genes model” (11,40).

206

Nishihara and Okada

Subfamilies characterized by their diagnostic nucleotides represent source genes. Each source gene (subfamily) has its own age. Therefore, by identifying this particular subfamily, we can phylogenetically analyze the group of species that we want to determine (see Subheadings 3.1.8. and 3.3.). 2. To compare orthologous retroposon sites, the sequences of retroposon family (or subfamily) members as well as the flanking sequences must be determined by screening genomic libraries or be obtained from other information sources (e.g., databases; see Subheading 3.4.). However, to characterize a very ancient species divergence, it is difficult to isolate highly diverged retroposon sequences from a genomic DNA library because of the extreme divergence and because the similarity of many more recently retrotransposed copies hinders screening. Therefore, a probe specific for a certain retroposon family must be used in the screen; alternatively, it may be possible to use a genomic database to find divergent copies of the retroposons (31,32). 3. The orthologs of retroposon-inserted loci must be compared among the analyzed species. If the species to be compared are distantly related phylogenetically, it is quite difficult to amplify the orthologs by PCR because the primers directed to the specific sequences are unable to anneal. For example, we previously analyzed the phylogenetic relationships within Afrotheria ([30]; see Fig. 3). The last common ancestor of Afrotheria is considered to have existed about 80 million years ago (41), and we anticipate that mammalian species that diverged before this time will be very difficult to analyze. On the other hand, it is easier to perform a PCR analysis of species groups that have a low substitution rate.

2. Materials 2.1. Construction of Genomic Library 1. 2. 3. 4. 5.

Solutions of sucrose (10 and 40%). Gradient maker for sucrose gradients (Samplatec, Osaka, Japan). Ultra-Clear 50 mL centrifuge tubes (Beckman, Fullerton, CA), and flexible plastic tubes. SW41 rotor in a L8-70M ultracentrifuge (Beckman). pUC 18 or 19 vector, digested by an appropriate restriction enzyme using Ligation kit ver. 1 (Takara Bio, Shiga, Japan).

2.1.1. Investigation of SINE Distribution in Other Species 1. Neutralization solution: 0.5 M Tris-HCl, 1 M NaCl, bring to final pH 7.0. 2. Hybridization solution: 6X SSC, 1% SDS, 5X Denhardt’s solution. 3. Wash solution: 2X SSC, 1% SDS.

2.2. Screening the Genomic Library for SINEs 1. 2. 3. 4. 5.

E. coli strain DH5 . Colony/Plaque Screen nylon membranes (NEN Life Science Products, Boston, MA). DNA oligonucleotide or PCR product as a probe. [-32P]ATP. T4 polynucleotide kinase (NipponGene, Tokyo).

Retroposons as Useful Markers in Phylogenetics

207

[ -32P]dCTP. BcaBEST DNA polymerase (Takara Bio). Alkaline solution: 0.4 M NaOH, 0.6 M NaCl. Neutralizing solution: 1 M NaCl, 0.5 M Tris-HCl, pH 7.0. Hybridization solution for PCR products: 50% formamide, 6X SSC, 1% SDS, 2X Denhardt’s solution (1% each of Ficoll 400, polyvinylpyrrolidone, and bovine serum albumin), and herring sperm DNA (100 g/mL). 11. Hybridization solution for oligonucleotide probe: 6X SSC, 1% SDS, 5X Denhardt’s solution, and herring sperm DNA (100 g/mL). 12. Wash solution: 2X SSC, 1% SDS. 6. 7. 8. 9. 10.

2.3. PCR, Cloning, and Sequencing 1. Ex Taq polymerase kit (Takara Bio). 2. GeneAmp PCR System 9700 (Applied Biosystems, Foster City, CA) or other thermal cycler. 3. QIAquick Gel Extraction kit (Qiagen, Valencia, CA). 4. pGEM-T vector (Promega, Madison, WI). 5. Ligation kit ver. 2.1 (Takara Bio). 6. BigDye terminator ver. 3.1 (Applied Biosystems).

2.4. Software 1. Genetyx version 5.2 (Genetyx, Tokyo, Japan) for Windows to analyze DNA sequences. This software provides many useful options (e.g., searching homology and multiple sequence alignment) that are useful for analyzing retroposon-containing sequences. 2. CPrimer software (http://iubio.bio.indiana.edu/soft/iubionew/molbio/dna/primers/ CPrimer/) for Macintosh. This software provides information, such as Tm values, and indicates the possibility of interfering structures such as dimers and hairpins. 3. Blastz (42) and Multiz (43) programs. These programs, as well as MEGA3 (44), and ClustalX (45) provide a reliable alignment, and we modify the output by eye using Genetyx software. 4. BLAST and FASTA programs are useful to search for orthologous sequences of the loci in question in the genomic sequences obtained from GenBank.

3. Methods 3.1. Isolation and Characterization of a New SINE Family

3.1.1. Overview If no SINE family is known in the species for which you are performing the phylogenetic analysis, it is necessary to newly isolate and characterize SINEs from the genome. There are a few possible methods to find new SINEs, e.g., (1) in vitro transcription of total genomic DNA (46,47), (2) PCR using primers specific to two promoter regions of tRNA (48), and (3) random sequencing of many thousands of base pairs of genomic DNA. To facilitate sequencing on this scale,

208

Nishihara and Okada

we use a high-throughput automated sequencing machine (explained in the third method). This strategy comprises seven steps as follows. 1. Genomic library construction for the species of interest. 2. Random sequencing of many clones. 3. Search for DNA repeats in the sequence determined, alignment, and construction of the consensus sequence. 4. Characterization of the tRNA-derived promoter region. 5. Characterization and classification of SINE subfamilies. 6. Estimation of the number of SINE copies in the genome. 7. Investigation of SINE distribution among species.

We previously characterized novel SINEs (and designated AfroSINEs) in Afrotheria, which is a mammalian clade including elephants, sirenians, hyraxes, aardvarks, tenrecs, golden moles, and elephant shrews ([21]; see Fig. 3). In the subsequent sections, we use AfroSINE as a practical example.

3.1.2. Construction of a Genomic Library For random sequencing of the genome, a genomic library for the species must be constructed. This library is also used in the subsequent screening for the SINEs (see Subheading 3.2.2.). The length of inserts should be around 1.5–2.5 kbp for easy sequencing of the SINEs and flanking regions. For that purpose, the digested genomic DNA should be fractionated according to length by sucrose gradient centrifugation as follows. 1. Completely digest 100 g of genomic DNA using a restriction enzyme that recognizes 6-bp sequences. Although any restriction enzyme can be used, HindIII is usually used in our laboratory because HindIII-digested DNA fragments are generally ligated more efficiently than other major enzymes. However, we must change the enzyme if there is a corresponding restriction site within the SINE sequence. The volume of the reaction is adjusted to 200 L, containing 300 U of enzyme. The reaction is incubated for 5 h to overnight at 37°C. 2. Prepare a 12-mL sucrose gradient from 10% and 40% sucrose solutions in a 14 × 89 mm ultracentrifuge tube. 3. Load the entire digested DNA gently on the top of the gradient. 4. Centrifuge the gradient at 22,000g for 15 h at 15°C. 5. Collect 300-L fractions through a flexible plastic tube with a 50-L capillary inserted into the bottom of the centrifuge tube. 6. Check the lengths of the fractioned DNA by electrophoresis by loading 8 L of each fraction in a 1% agarose gel. 7. Collect appropriate fractions that contain 1.5–2.5-kbp DNA fragments. Precipitate the DNA with ethanol, and dissolve the DNA in 15 L of distilled water. 8. Prepare a ligation reaction containing 1 L of the fractioned DNA, 200 ng of vector DNA digested with the appropriate restriction enzyme (we use HindIII), 12 L of the ligation buffer (Takara Ligation kit ver. 1 sol. A), and 2 L of the ligase solution (Takara Ligation kit ver. 1 sol. B). Incubate the reaction at 16°C for > 1 h.

Retroposons as Useful Markers in Phylogenetics

209

3.1.3. Random Sequencing of Many Clones The length of the total sequence to be determined can be roughly estimated from the genome size and the estimated number of SINE copies of the species. If the length of the SINE is 200 bp and the SINEs occupy 1% of the genome, one SINE copy will be detected in 20 kbp of sequence determined randomly (i.e., 200/0.01 = 20,000). In this case, 60 kbp of sequence data is necessary to detect three SINE copies. Therefore, if possible, predict how many SINEs exist in the genome of interest by referring to known copy numbers of other SINEs (7). After transformation of the genomic library into bacteria, choose colonies randomly for sequencing. Because an automatic sequencer can generally determine 600 nucleotides per sample, 100 samples will be necessary to determine 60 kbp of total sequence (see Note 1).

3.1.4. Search and Alignment of DNA Repeats A few repetitive sequences will be detected by comparing homology among the determined sequences using a sequence analyzing software such as Genetyx. Alternatively, to find SINEs, it may be also useful to perform a local Blast search using known tRNA sequences available in the genomic tRNA database (http://lowelab.ucsc.edu/GtRNAdb/) as queries. Because a retroposon generally ends with a microsatellite sequence or poly-A, these sequences may offer a clue to finding a SINE. Also, direct repeat sequences, which may flank retroposons, are useful to search for SINEs. Gather and align the putative SINE sequences to build a consensus sequence. If you did not detect SINE-like sequences, more sequence data is necessary. In the case of AfroSINEs, we randomly determined 63 kbp of genomic sequence and found 26 SINE-like sequences. An example of an AfroSINE alignment is shown in Fig. 4.

3.1.5. Characterization of the tRNA-Derived Promoter Region If you succeed in finding a few families of repetitive sequences, you must confirm that they are indeed SINEs by identifying their origin—because there are many repetitive sequences other than SINEs. Because almost all SINEs are derived from tRNA, we here explain how to deduce the tRNA-like structure (Fig. 5). If a SINE is derived from another source such as 7SL RNA or 5S rRNA, you can easily identify it by comparing homology between the SINE consensus sequences and these RNAs. 1. All SINEs contain internal promoter sequences so that they are transcribed by RNA polymerase III. The tRNA-like region consists of two conserved regions of first (Box A) and second (Box B) promoters that are separated by approx. 35 nucleotides. It is helpful to first find the second promoter because it is generally well conserved. The consensus sequence of the second promoter is GWTCRANNCY, which corresponds to positions 53–62 of a typical tRNA gene (49).

210

Nishihara and Okada

Fig. 4. An alignment of AfroSINEs in the African elephant genome as an example of three SINE subfamilies (Anc, Ad, and HSP) distinguished by diagnostic deletions and a nucleotide (gray boxes). Dots indicate nucleotides identical to those in the consensus sequence, whereas dashes indicate deletions. Nucleotides corresponding to tRNA numbering and promoters are shown above the consensus sequence (see Fig. 5). 2. Next, find the first promoter, which corresponds to positions 9–19 of a typical tRNA. The consensus sequence of this promoter, GGCNNAGY(N)GG, is generally less well-defined than the second promoter. (The N in parentheses may not exist in the promoter.). 3. An anticodon loop exists between the first and second promoters, which correspond to positions 32–38 (Fig. 5A). Because C32, T33 and R37 are well conserved, you can deduce the nucleotides of the anticodon loop in the SINE-like sequence. 4. By comparing the positions of the above three regions with a typical tRNA structure in Fig. 5A, fit your consensus sequence to all nucleotides. In this step, several other conserved nucleotides may be helpful such as T8, Y11 and R24, and RRRYY of the Extra Arm (positions 44–48). The length of the Extra Arm is generally 5 bp in many tRNAs, but it is longer in a few cases (49). 5. A typical tRNA consists of four loop regions and four stems (Fig. 5A). The typical number of nucleotides in each stem and loop is also useful information to build a tRNA-like structure. There are seven nucleotides in each of the anticodon and TC

Retroposons as Useful Markers in Phylogenetics

211

Fig. 5. A tRNA structure and a tRNA-like structure of the promoter region of AfroSINE. (A) Secondary structure of a typical tRNA gene. Conserved nucleotides among various tRNAs (49) are numbered. (B) Secondary tRNA-like structure of the promoter region of AfroSINE. Nucleotides corresponding to those conserved among various tRNAs are shown in bold.

loops, and there are generally seven or eight nucleotides in the D loop. The number of nucleotides pairs in the D stem, anticodon stem, and TC stem is four, five and five, respectively. In these stem regions of SINEs, nucleotides may not base-pair because of substitutions, insertions, or deletions of nucleotides after generation of the SINEs.

Figures 4 and 5B show the empirical example of AfroSINE. In the alignment of Fig. 4, the second and first promoter sequences can be identified as GTTCAAATCC and GGCATAGTGG, respectively. There are 33 nucleotides between the two promoters in this case. Subsequently, the conserved nucleotides in the anticodon loop are identified as C, T, and A (Fig. 4). Additionally, the sequence of the extra arm is AGGTC in AfroSINE, which correspond to the consensus of RRRYY in a typical tRNA. From this information we can easily fit the AfroSINE consensus sequence to a tRNA-like structure (Fig. 5B).

3.1.6. Characterization and Classification of SINE Subfamilies A SINE family usually can be divided into a few subfamilies. Because each subfamily has its own age, we must choose a suitable subfamily that had frequently propagated in the genome during the divergence of species

212

Nishihara and Okada

to be determined. Therefore, classification of SINE subfamilies and analysis of their retrotranspositionally active ages is required. Each subfamily is recognized by diagnostic nucleotides and insertions/ deletions. Diagnostic nucleotides are defined as those that changed specifically in at least one nucleotide position of a certain subfamily, and they can be distinguished from neutral mutations that were accumulated randomly in the SINE sequence during evolution. To recognize these subfamilies, it is necessary to gather information on many SINE sequences by PCR and sequencing. By referring to a consensus sequence of a SINE family, PCR primers can be designed in 3 and 5 regions of the SINE. If you design multiple sets of primers, more subfamilies of the SINE may be obtained. Next, perform PCR by using genomic DNA of various species as templates, and clone the product via ligation with an appropriate vector DNA and transformation of the DNA into bacteria. We recommend sequencing dozens of sequences. By aligning the sequences by ClustalX and Genetyx software, diagnostic differences that represent subfamilies of the SINE family can be identified. After successful characterization of subfamilies based on the presence of diagnostic nucleotides and insertions/ deletions, it may be possible to examine the taxonomic distribution of a given subfamily by PCR and dot-blot hybridization (see Subheading 3.1.8.). Fig. 4. shows an example of AfroSINEs isolated from the genome of the African elephant. These SINEs can be divided into three subfamilies (Anc, Ad, and HSP) distinguished by diagnostic deletions and a nucleotide site as denoted by the gray box in Fig. 4.

3.1.7. Estimation of the Number of SINE Copies The proportion of SINEs in the genome can be estimated from the SINE length and the number of SINEs in the sequence determined. In the case of AfroSINEs, they occupy 6.6% of the hyrax genome, because 26 copies of the AfroSINE-HSP (160 bp each) are detected in a 63-kbp sequence of the hyrax genome (i.e., 26 × 160 × 100 / 63,000). Additionally, the number of SINE copies is estimated by an assumption of genome size of the species. If we assume that the genome size of hyrax is 2.5 Gbp, approx. 1 million AfroSINE copies should exist in the genome (i.e., 26 × 2.5 × 109 / 63,000).

3.1.8. Investigation of SINE Distribution There are two methods to investigate the distribution of SINEs in the genomes of various species. An easy method is PCR with primers specific to the SINE sequence. An example of the PCR to examine the distribution of the AfroSINEHSP subfamily is shown in Fig. 6A. In this case, the PCR reactions contained 0.25 U of Ex Taq polymerase, 1X Ex Taq buffer, 200 M dNTPs, 5 pmol primers each, and genomic DNA in a final volume of 30 L. The amount of genomic

Retroposons as Useful Markers in Phylogenetics

213

Fig. 6. Distribution of the AfroSINE-HSP subfamily within Afrotheria species investigated by PCR (A) and dot-blot hybridization (B). Positive PCR bands are indicated by an arrow in (A).

DNA used was 100 ng. The DNA fragments were amplified with a denaturation step at 94°C for 3 min, then 30 cycles of denaturation at 94°C for 1 min, annealing at 50°C for 30 s, and extension at 72°C for 30 s. The PCR product was confirmed by electrophoresis in a 3% agarose gel. Another method to investigate the SINE distribution is dot-blot hybridization, in which a SINE-specific 32P-labeled oligonucleotide probe is hybridized with genomic DNA of various species according to the following procedures. 1. Incubate each of 100 and 500 ng genomic DNA in 125 L of 0.4 M NaOH for 15 min at room temperature.

214

Nishihara and Okada

2. Immerse a GeneScreen Plus membrane (Perkin Elmer, Boston, MA) in 2X SSC for 15 min at room temperature. 3. Dot-blot the entire DNA solutions on the membrane. 4. Immerse the membrane in neutralization solution for 5 min, and then allow it to air-dry. 5. An oligonucleotide probe (20–30 nucleotides, 50 pmol) is labeled with [-32P]ATP using T4 polynucleotide kinase in 50 L at 37°C for 30 min. 6. Purify the probe using the QIAquick Gel Extraction kit (Qiagen). 7. Perform prehybridization of the membrane in 10 mL of hybridization solution without the probe at 37°C for 1–2 h. 8. Hybridize the membrane with the labeled probe in 10 mL of hybridization solution at an appropriate temperature (normally 42°C) overnight (see Note 2). 9. Wash the membrane in 100 mL of a wash solution for 10 min at an appropriate temperature (normally 50–60°C). Repeat this washing step a few times with fresh wash solution (see Note 2), and then wrap the membrane in cellophane. 10. Expose the membrane to an X-ray film overnight at –80°C, then develop the film.

As shown in an example of AfroSINE-HSP, positive hybridization signals were observed only for elephant, sirenians (dugong and manatee), and hyrax (Fig. 6B). This result suggests that this SINE subfamily was generated in a common ancestor of these species. Therefore, we can predict that this subfamily is appropriate to use as markers to analyze the phylogenetic relationship among these species.

3.2. SINE Insertion Analysis 3.2.1. Overview The SINE insertion analysis detects the presence or absence of the SINE among orthologous loci of different species. For example, if a SINE is present at locus P in species A and B and is absent at the SINE locus in species C and D, species A and B are regarded as monophyletic among the four species (Fig. 7). Thus, this SINE insertion is interpreted as having occurred at locus P in the genome of a common ancestor of species A and B. On the other hand, if a different SINE insertion is observed at locus Q only in species A, B, and C, this indicates that these three species are monophyletic (Fig. 7). Thus, the phylogeny of the four species can be reconstructed by analyzing two SINE insertion patterns. In this case, note that the locus indicating the monophyly of species A and B, such as locus P, can never be obtained by screening genomic libraries of species C or D (i.e., ascertainment bias). Therefore, to obtain informative loci for the relationship in question by excluding the possibility of ascertainment bias, it is recommended that genomic libraries of all species in question be constructed and screened (50,51). In some cases, it is possible for us to refer to the phylogenetic relationships determined by using other methods such as mtDNA phylogeny.

Retroposons as Useful Markers in Phylogenetics

215

Fig. 7. General retroposon method to reconstruct species phylogeny. Because the presence or absence of a retroposon is regarded as the derived or ancestral form, respectively, the phylogenetic relationships among the four species (A–D) can be estimated from the insertion pattern at loci P and Q.

In the SINE insertion analysis, the following four steps are necessary to reconstruct the phylogeny of species. Among them, a procedure to construct genomic DNA libraries is described in Subheading 3.1.2. 1. Construction of genomic DNA libraries from several species for which the phylogeny is to be determined. 2. Screening the libraries for SINEs and determining the sequences of each retroposon and its flanking regions. 3. PCR and sequencing of each SINE locus to determine its presence or absence in different species.

3.2.2. Screening a Genomic DNA Library for SINEs SINE sequences are generally obtained by screening a genomic library of a certain species. 1. Transform an appropriate bacterium with the genomic DNA library. For efficient screening, we grow 200–400 colonies on each of 20 plates for one procedure. 2. Place a Colony/Plaque Screen nylon membrane on the colonies, and incubate the plates at 37°C for 1 h.

216

Nishihara and Okada

3. Mark the nylon membranes in at least three asymmetric locations by stabbing through it and into the agar. 4. Remove the membranes from the plates and put them in the alkaline solution for 3 min. 5. Immerse the membranes to neutralizing solution for 3 min. 6. Air-dry the membranes completely at room temperature. 7. Incubate the plates at 37°C for 2 h so that the colonies regrow, then store them at 4°C. 8. Either of two kinds of probes is used; one is a PCR product of the SINE and the other is a SINE-specific oligonucleotide. PCR product probes are generally prepared by primer extension using the PCR product of the SINEs as a template. The reaction contains 1X BcaBEST polymerase buffer, 250 mM dNTP mix lacking dCTP, 10 pmol each of the two primers specific to the SINE, and PCR product of the SINE. After denaturation at 95°C and annealing at 50 or 55°C, add 1.85 MBq of [ -32P]dCTP and 2 U of BcaBEST polymerase. Subsequently, incubate the mix at 55°C for 15 min for the extension reaction to incorporate the labeled cytosine. In the case of an oligonucleotide probe, DNA (20–30 nucleotides, 50 pmol) is labeled with [-32P]ATP using T4 polynucleotide kinase in 50 L at 37°C for 30 min. 9. Prepare the hybridization solution; we usually prepare 30 or 50 mL for 10 or 20 pieces of membranes, respectively. When a PCR product of a SINE is used as the probe, the hybridization solution is different in composition than when an oligonucleotide is used as a probe (see Subheading 2.2.). 10. Put the membranes and the half volume of the hybridization solution into the plastic hybridization bag and squeeze out bubbles. Seal the open end of the bag with the heat sealer and incubate the bag for 1–2 h at 37°C for prehybridization. 11. Open the bag by cutting the corner and change the hybridization buffer. Add the rest of hybridization buffer and the probe to the solution, and then squeeze as much air as possible from the bag. Reseal the bag, and incubate it in a water bath set at appropriate temperature and period (e.g., for overnight at 37°C in the case of AfroSINE) (see Note 2). 12. Remove the bag from the water bath and cut off the corner. Remove the membranes and submerge them in a tray containing 250 mL of the wash solution (2X SSC, 1% SDS) and incubate for 10–60 min at an appropriate temperature ranging from room temperature to 70°C (e.g., for 20 min at 50°C in the case of AfroSINE) (see Note 2). Dump the wash solution into the disposal container, and repeat this washing procedure about three times. Wrap the membranes in cellophane. 13. Expose the membranes to an X-ray film overnight at 80°C. 14. After developing the X-ray film, adjust the position of membranes and check the corresponding colonies of the agar plate. Isolate the plasmids from the colonies and determine the sequences of the SINE and its flanking regions (see Note 3).

3.2.3. Flanking PCR to Detect the Presence/Absence of a SINE in Different Species 1. For each SINE locus, design PCR primers for both sides of the flanking sequences to detect the presence/absence of the SINE. The length between the SINE and each

Retroposons as Useful Markers in Phylogenetics

217

Fig. 8. An example of the AfroSINE-inserted locus. (A) An electrophoretic profile of PCR products. (B) An alignment of the SINE and its flanking sequences. The SINE region is indicated by a bar above the alignment.

primer should be at least 100 bp to detect a PCR product clearly by electrophoresis (see Note 4). 2. PCR reactions contain 0.25 U of Ex Taq polymerase, 1X Ex Taq buffer, 200 M dNTPs, 10 pmol each primers, and genomic DNA in a final volume of 30 L. In our work, the amount of genomic DNA used in each reaction ranged from 100 to 300 ng, depending on the integrity of each DNA sample. The DNA fragments were amplified by PCR with a denaturation step at 94°C for 3 min, then 35 cycles of denaturation at 94°C for 1 min, annealing at 50 or 55°C for 1 min, and extension at 72°C for 2 min, followed by a final extension at 72°C for 2 min (see Note 5). 3. Check the amplified DNA by electrophoresis in a 2–3% agarose gel. You can estimate whether the amplified DNA fragments contain retroposons by evaluating the difference in the size of DNA bands using electrophoresis (see the example of an AfroSINE locus in Fig. 8A). However, if it is difficult to detect such an insertion pattern because of species-specific insertions and deletions, sequence all the PCR products and evaluate the insertion pattern by sequence alignments.

218

Nishihara and Okada

4. To purify the PCR products, use the entire DNA sample for electrophoresis after ethanol precipitation. Next, cut the bands out, and recover the PCR products from the agarose gel using the QIAquick Gel Extraction kit. 5. Prepare a sequencing reaction using the BigDye terminator (ver. 3.1) kit. The reactions contain 1 L BigDye solution, 1 L of 5X dilution buffer, 5 pmol primer, and 20% of the recovered DNA in a final volume of 10 L. Purify the DNA samples by ethanol precipitation, followed by sequencing with the ABI Prism 3100 Genetic Analyzer. 6. When direct sequencing is difficult because of low yield of PCR product or other reasons, consider cloning the DNA. The ligation reaction contains 20% of the recovered DNA, 10 ng of pGEM-T vector, and Solution I of the Takara Ligation kit ver. 2.1. Incubate the sample at 16°C for 1 h. Then transformation E. coli DH5a with the DNA. 7. Align the determined sequence using appropriate software. We often use ClustalX or MEGA3 for the alignment and then modify the alignment by eye using Genetyx software (see an empirical example shown in Fig. 8B). Blastz and Multiz software for UNIX is also useful for alignment when there are large genetic distances among species.

3.3. Estimation of Ages of Actively Retrotransposed SINE Subfamilies As described above, a SINE subfamily that is suitable for phylogenetic analysis of a certain species group can be estimated from their distribution in the genomes of the species. In the empirical example of AfroSINEs (Fig. 6), the HSP subfamily is distributed only in the genomes of elephants, sirenians, and hyrax, suggesting that it is suitable for analyzing the relationships among the species. However, even when the distribution pattern is similar between two subfamilies in the genomes of the same species groups, the age of an actively retrotransposed subfamily can be different as exemplified in Figs. 9 and 10. In this case, we estimate the ages from the sequence differences between the SINEs and the consensus sequence or from the presence/absence of the SINEs at each locus. As shown in Figs. 9 and 10, the AfroSINE-HSP subfamily can be further divided into two types (or subsubfamilies), AfroSINE_HSPo (older type) and AfroSINE-HSPy (younger type). As highlighted by gray bars in the alignment of Fig. 9, the two types are distinguished by five diagnostic nucleotides. Insertion analysis among elephants, sirenians and hyrax, showed that all the HSPo SINE insertions are shared among the species, whereas the SINEs belonging to the HSPy type are inserted in only limited species Therefore, the HSPo type frequently retrotransposed in a common ancestor of these species, and after species divergence the HSPy became retrotranspositionally active instead (Fig. 10). Thus, when phylogenetic relationships within each group of elephants, sirenians, or hyrax are analyzed, the data indicate that the HSPy type is the most suitable marker (see ref. 33 for another example of birds).

219

Fig. 9. An alignment of two types of AfroSINE-HSP subfamily, HSPo (older type) and HSPy (younger type). These types are distinguished by five nucleotides denoted by gray bars.

220

Nishihara and Okada

Fig. 10. An estimated shift in retrotranspositional activity from the AfroSINE-HSPo type to the AfroSINE-HSPy type during evolution as demonstrated by SINE insertion analysis. The SINE loci inside the dashed-line box belong to AfroSINE-HSPo, and those inside the solid-line box belong to AfroSINE-HSPy.

3.4. Usefulness of Genomic Databases for Retroposon Insertion Analysis Although we have introduced a method to screen genomic libraries for SINE loci, this method may be technically difficult for isolating SINEs that were inserted in an ancient species because many nucleotide substitutions probably accumulated in such SINE sequences. In such a case, genomic databases are

Retroposons as Useful Markers in Phylogenetics

221

also useful for obtaining information on SINE-inserted loci as well as other retroposons. Whole-genome databases of various species, especially mammals, have been established for several years, and annotation data for all known repetitive elements are available for several species from a few databases such as UCSC Genome Bioinformatics ([52]; http://genome.ucsc.edu/). By taking advantage of such databases, phylogenetic relationships among almost all eutherian orders have been successfully analyzed using retroposon insertions (31,32). These studies coupled bioinformatics with experimental approaches (e.g., PCR and sequencing) to overcome the difficulty of identifying all retroposons that were inserted in an ancient lineage by using the database. Although this approach has been used extensively only for mammalian phylogeny, the method should be universal with regard to the analysis of all phylogenetic relationships in eukaryotes. As we mentioned in Subheading 3.1.3., random sequencing can be used to discover a novel SINE family, and when a large amount of sequence data (>105 bp) for a species is available from a database, sequencing procedures can be omitted in the above method (see examples in refs. 20,53). As genomic data continues to increase, the utility of database information will undoubtedly also increase with regard to characterizing novel retroposons and analyzing the phylogeny of various groups of organisms. 4. Notes 1. For economy, it is better to dilute the BigDye solution in sequencing reactions. Although we dilute it fourfold over the amount recommended in the general protocol, an eightfold dilution is also possible if the amount and purity of the PCR product is sufficient. 2. Experimental conditions such as temperature and time in hybridization and washing differ from case to case. In hybridization, the temperature range is 37–60°C and time is normally overnight. In washing, the temperature ranges from room temperature to 70°C, and time ranges from 10 min to a few hours. Therefore, you should define optimal conditions for each experiment in the screen. 3. After isolating a plasmid, it may be difficult to determine the entire sequence of the insert DNA if the length of the insert is long. In this case, try performing PCR using SINE-specific and vector-specific primers with the plasmid DNA as template. Then, sequence the PCR product using the SINE-specific primer to determine the flanking region of the SINE. 4. Longer primers, although more costly, increase the specificity of annealing to the locus. We usually design 20-nucleotide primers for flanking PCR, but we use longer primers (30 nucleotides) if there is a large genetic distance in the analyzed species. The Tm value of the primer can range from 50 to 65°C, but the recommended range is 55–60°C. 5. The amount and purity of template genomic DNA is also critical to PCR success. If no DNA product is obtained from the initial PCR, the genomic DNA may be

222

Nishihara and Okada

fragmented and a larger amount of template may be required for the PCR reaction. Also, it may be necessary to lower the annealing temperature, which is generally 5–10°C lower than the Tm of the primers. If these solutions do not improve the PCR yield, new primers should be designed.

References 1. Okada, N. (1991) SINEs. Curr. Opin. Genet. Dev. 1, 498–504. 2. Okada, N. (1991) SINEs: short interspersed repeated elements of the eukaryotic genome. Trends Ecol. Evol. 6, 358–361. 3. Hutchison, C. A., Hardies, S. C., Loeb, D. D., Shehee, W. R., and Edgell, M. H. (1989) LINES and related retroposons: Long interspersed sequences in the eucaryotic genome. In: Mobile DNA (Berg, D. E. and Howe, M. M. eds), pp. 593–617, ASM, Washington, DC. 4. Lander, E. S., Linton, L. M., Birren, B., et al. (2001) Initial sequencing and analysis of the human genome. Nature 409, 860–921. 5. Waterston, R. H., Lindblad-Toh, K., Birney, E., et al. (2002) Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562. 6. Ohshima, K. and Okada, N. (2005) SINEs and LINEs: symbionts of eukaryotic genomes with a common tail. Cytogenet. Genome Res. 110, 475–490. 7. Kramerov, D. A. and Vassetzky, N. S. (2005) Short retroposons in eukaryotic genomes. Int. Rev. Cytol. 247, 165–221. 8. Rogers, J. H. (1985) The origin and evolution of retroposons. Int. Rev. Cytol. 93, 187–279. 9. Weiner, A. M., Deininger, P. L., and Efstratiadis, A. (1986) Nonviral retroposons: genes, pseudogenes, and transposable elements generated by the reverse flow of genetic information. Annu. Rev. Biochem. 55, 631–661. 10. Rokas, A. and Holland, P. W. (2000) Rare genomic changes as a tool for phylogenetics. Trends Ecol. Evol. 15, 454–459. 11. Shedlock, A. M. and Okada, N. (2000) SINE insertions: powerful tools for molecular systematics. Bioessays 22, 148–160. 12. Robertson, H. M. (2002) Evolution of DNA transposons in eukaryotes. In: Mobile DNA II (Craig, N. L., Graigie, R., Gellert, M., and Lambowitz, A. M. eds), pp. 1093–1110, ASM, Washington, DC. 13. Kajikawa, M. and Okada, N. (2002) LINEs mobilize SINEs in the eel through a shared 3 sequence. Cell 111, 433–444. 14. Dewannieux, M., Esnault, C., and Heidmann, T. (2003) LINE-mediated retrotransposition of marked Alu sequences. Nat. Genet. 35, 41–48. 15. Ohshima, K., Hamada, M., Terai, Y., and Okada, N. (1996) The 3 ends of tRNAderived short interspersed repetitive elements are derived from the 3 ends of long interspersed repetitive elements. Mol. Cell. Biol. 16, 3756–3764. 16. Okada, N., Hamada, M., Ogiwara, I., and Ohshima, K. (1997) SINEs and LINEs share common 3 sequences: a review. Gene 205, 229–243. 17. Ullu, E. and Tschudi, C. (1984) Alu sequences are processed 7SL RNA genes. Nature 312, 171–172.

Retroposons as Useful Markers in Phylogenetics

223

18. Nishihara, H., Terai, Y., and Okada, N. (2002) Characterization of novel Alu- and tRNA-related SINEs from the tree shrew and evolutionary implications of their origins. Mol. Biol. Evol. 19, 1964–1972. 19. Kapitonov, V. V. and Jurka, J. (2003) A novel class of SINE elements derived from 5S rRNA. Mol. Biol. Evol. 20, 694–702. 20. Nishihara, H., Smit, A. F., and Okada, N. (2006) Functional noncoding sequences derived from SINEs in the mammalian genome. Genome Res. 16, 864–874. 21. Nikaido, M., Nishihara, H., Fukumoto, Y., and Okada, N. (2003) Ancient SINEs from African endemic mammals. Mol. Biol. Evol. 20, 522–527. 22. Nikaido, M., Matsuno, F., Abe, H., et al. (2001) Evolution of CHR-2 SINEs in cetartiodactyl genomes: possible evidence for the monophyletic origin of toothed whales. Mamm. Genome 12, 909–915. 23. Shimamura, M., Yasue, H., Ohshima, K., et al. (1997) Molecular evidence from retroposons that whales form a clade within even-toed ungulates. Nature 388, 666–670. 24. Nikaido, M., Rooney, A. P., and Okada, N. (1999) Phylogenetic relationships among cetartiodactyls based on insertions of short and long interpersed elements: hippopotamuses are the closest extant relatives of whales. Proc. Natl Acad. Sci. USA 96, 10,261–10,266. 25. Nikaido, M., Matsuno, F., Hamilton, H., et al. (2001) Retroposon analysis of major cetacean lineages: the monophyly of toothed whales and the paraphyly of river dolphins. Proc. Natl Acad. Sci. USA 98, 7384–7389. 26. Nikaido, M., Hamilton, H., Makino, H., et al. (2006) Baleen whale phylogeny and a past extensive radiation event revealed by SINE insertion analysis. Mol. Biol. Evol. 23, 866–873. 27. Schmitz, J., Ohme, M., and Zischler, H. (2001) SINE insertions in cladistic analyses and the phylogenetic affiliations of Tarsius bancanus to other primates. Genetics 157, 777–784. 28. Salem, A. H., Ray, D. A., Xing, J., et al. (2003) Alu elements and hominid phylogenetics. Proc. Natl Acad. Sci. USA 100, 12,787–12,791. 29. Roos, C., Schmitz, J., and Zischler, H. (2004) Primate jumping genes elucidate strepsirrhine phylogeny. Proc. Natl Acad. Sci. USA 101, 10,650–10,654. 30. Nishihara, H., Satta, Y., Nikaido, M., Thewissen, J. G., Stanhope, M. J., and Okada, N. (2005) A retroposon analysis of Afrotherian phylogeny. Mol. Biol. Evol. 22, 1823–1833. 31. Kriegs, J. O., Churakov, G., Kiefmann, M., Jordan, U., Brosius, J., and Schmitz, J. (2006) Retroposed elements as archives for the evolutionary history of placental mammals. PLoS Biol. 4, e91. 32. Nishihara, H., Hasegawa, M., and Okada, N. (2006) Pegasoferae, an unexpected mammalian clade revealed by tracking ancient retroposon insertions. Proc. Natl Acad. Sci. USA 103, 9929–9934. 33. Watanabe, M., Nikaido, M., Tsuda, T. T., et al. (2006) The rise and fall of the CR1 subfamily in the lineage leading to penguins. Gene 365, 57–66. 34. Sasaki, T., Takahashi, K., Nikaido, M., Miura, S., Yasukawa, Y., and Okada, N. (2004) First application of the SINE (short interspersed repetitive element) method

224

35.

36.

37.

38.

39.

40. 41.

42. 43. 44.

45.

46.

47. 48. 49.

Nishihara and Okada

to infer phylogenetic relationships in reptiles: an example from the turtle superfamily Testudinoidea. Mol. Biol. Evol. 21, 705–715. Piskurek, O., Austin, C. C., and Okada, N. (2006) Sauria SINEs: novel short interspersed retroposable elements that are widespread in reptile genomes. J. Mol. Evol. 62, 630–644. Sasaki, T., Yasukawa, Y., Takahashi, K., Miura, S., Shedlock, A. M., and Okada, N. (2006) Extensive morphological convergence and rapid radiation in the evolutionary history of the family Geoemydidae (Old World pond turtles) revealed by SINE insertion analysis. Syst. Biol. in press. Murata, S., Takasaki, N., Saitoh, M., and Okada, N. (1993) Determination of the phylogenetic relationships among Pacific salmonids by using short interspersed elements (SINEs) as temporal landmarks of evolution. Proc. Natl Acad. Sci. USA 90, 6995–6999. Takahashi, K., Terai, Y., Nishida, M., and Okada, N. (1998) A novel family of short interspersed repetitive elements (SINEs) from cichlids: the patterns of insertion of SINEs at orthologous loci support the proposed monophyly of four major groups of cichlid fishes in Lake Tanganyika. Mol. Biol. Evol. 15, 391–407. Takahashi, K., Terai, Y., Nishida, M., and Okada, N. (2001) Phylogenetic relationships and ancient incomplete lineage sorting among cichlid fishes in Lake Tanganyika as revealed by analysis of the insertion of retroposons. Mol. Biol. Evol. 18, 2057–2066. Schmid, C. and Maraia, R. (1992) Transcriptional regulation and transpositional selection of active SINE sequences. Curr. Opin. Genet. Dev. 2, 874–882. Springer, M. S., Murphy, W. J., Eizirik, E., and O’Brien, S. J. (2003) Placental mammal diversification and the Cretaceous-Tertiary boundary. Proc. Natl Acad. Sci. USA 100, 1056–1061. Schwartz, S., Kent, W. J., Smit, A., et al. (2003) Human–mouse alignments with BLASTZ. Genome Res. 13, 103–107. Blanchette, M., Kent, W. J., Riemer, C., et al. (2004) Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 14, 708–715. Kumar, S., Tamura, K., and Nei, M. (2004) MEGA3: integrated software for molecular evolutionary genetics analysis and sequence alignment. Brief Bioinform. 5, 150–163. Thompson, J. D., Gibson, T. J., Plewniak, F., Jeanmougin, F., and Higgins, D. G. (1997) The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res. 25, 4876–4882. Endoh, H. and Okada, N. (1986) Total DNA transcription in vitro: a procedure to detect highly repetitive and transcribable sequences with tRNA-like structures. Proc. Natl Acad. Sci. USA 83, 251–255. Okada, N., Shedlock, A. M., and Nikaido, M. (2004) Retroposon mapping in molecular systematics. Methods Mol. Biol. 260, 189–226. Borodulina, O. R. and Kramerov, D. A. (1999) Wide distribution of short interspersed elements among eukaryotic genomes. FEBS Lett. 457, 409–413. Gauss, D. H., Gruter, F., and Sprinzl, M. (1979) Compilation of tRNA sequences. Nucleic Acids Res. 6, r1–r19.

Retroposons as Useful Markers in Phylogenetics

225

50. Deininger, P. L., Moran, J. V., Batzer, M. A., and Kazazian, H. H. Jr. (2003) Mobile elements and mammalian genome evolution. Curr. Opin. Genet. Dev. 13, 651–658. 51. Nikaido, M., Piskurek, O., and Okada, N. (2006) Toothed whale monophyly reassessed by SINE insertion analysis: the absence of lineage sorting effects suggests a small population of a common ancestral species. Mol. Phylogenet. Evol. in press. 52. Karolchik, D., Baertsch, R., Diekhans, M., et al. (2003) The UCSC genome browser database. Nucleic Acids Res. 31, 51–54. 53. Churakov, G., Smit, A. F., Brosius, J., and Schmitz, J. (2005) A novel abundant family of retroposed elements (DAS-SINEs) in the nine-banded armadillo (Dasypus novemcinctus). Mol. Biol. Evol. 22, 886–893.

14 LINE-1 Elements: Analysis by Fluorescence In-Situ Hybridization and Nucleotide Sequences Paul D. Waters, Gauthier Dobigny, Peter J. Waddell, and Terence J. Robinson Summary Long-interspersed nuclear element-1 (LINE-1) is a non-terminal repeat transposon that constitutes a major component of the mammalian genome. LINE-1 has a dynamic evolutionary history characterized by the rise, fall, and replacement of subfamilies. The distribution of LINE-1 elements can be viewed from a chromosomal perspective using fluorescence in-situ hybridization (FISH), as well as at the sequence level. We have designed LINE-1 primers from regions conserved among mouse, rat, rabbit, and human L1, which were able to amplify part of ORF2 from all eutherian (placental) mammals tested thus far. The product generated can be used as a FISH painting probe to examine the genomic distribution of L1 in different species. It can also be cloned and sequenced for phylogenetic analysis. Although FISH patterns resulting from LINE-1 chromosome painting and bioinformatic analyses have shown that this element accumulates in AT-rich regions of the genomes of mouse and human, our PCR amplified LINE-1 probe suggests that this is not a universal phenomenon, and that the patterns displayed in laurasiatherian, afrotherian and xenarthran species are less prominent. The “banding” like distribution of LINE-1 observed in human and mouse, therefore, appears to reflect aspects of genome architecture unique to Euarchontoglires (Supraprimates), the superordinal clade to which they belong. By sequencing the cloned amplicons used for FISH experiments and supplementing these with L1 sequences obtained from public databases, analysis by parsimony, distance-based, maximum likelihood, and “hierarchical Bayesian” or “marginal likelihood” methods provides a powerful adjunct to the FISH data. Using this approach, relatively intact LINE-1 from most placental orders tend to reflect accepted eutherian evolutionary relationships. This suggests that there were often only closely related copies active near branch points in the tree, that inactive copies tended to become extinct quite readily, and that for many orders recently active copies belong to a single lineage of this LINE. Key Words: LINE-1 phylogeny; Afrotheria; Xenarthra; mobile elements. From: Methods in Molecular Biology: Phylogenomics Edited by: W. J. Murphy © Humana Press Inc., Totowa, NJ

227

228

Waters et al.

1. Introduction Long-interspersed nuclear element-1 (LINE-1) are mammalian non–longterminal repeat (LTR) retrotransposons that transpose through an RNA intermediate (reviewed in ref. 1). They constitute a major component of mammalian genomes and account for ~20% of human sequence. Most LINE-1 elements are 5 truncated upon transposition and are therefore rendered inactive. In fact, only an estimated 60 LINE-1 copies are currently active in the human genome (2). LINE-1 insertions have been shown to be responsible for many genetic disorders caused by gene disruption, nucleotide deletions, duplications, and chromosomal instability, many via heterologous recombination (see ref. 3 and references therein). However, they are also known to be involved in important genomic functions. These include X-inactivation in females (4,5), the regulation of gene expression (6,7), and possibly novel transcription start sites. Importantly, LINE-1 provide the reverse transcriptase (RTase) necessary for transposition of Alu sequences (8) and may also have a role in the generation of processed pseudogenes (9). Thus, overall, they are a major driver of genomic evolution. Most of the current data concerning LINE-1 biology and evolution result from the investigations of human and mouse genomes and are often assumed to hold for all living placental mammals (1). Previous studies have investigated the paleohistory of the LINE-1 family based on the “genomic fossil record of pseudogenes retroposed at different times from active source genes” (10), but, importantly, this has typically involved only the human and mouse genomes. It is now widely accepted that rodents and primates fall within the same supraordinal clade, comprised of Euarchonta plus Glires. This has been concatenated to Euarchontoglires, also referred to as Supraprimates (11). The analysis of LINE-1 from a wide selection of mammalian species (these methods could also be applied to other transposable elements from nonmammalian vertebrates) provides new and interesting insights into the genomic distribution and phylogenetic relationships of these elements. FISH analyses conducted on a wide range of xenarthran and afrotherian species show that the strong accumulation of LINE-1 in AT-rich regions of the genome is restricted to Supraprimates (12). Phylogenetic reconstruction of sequences from the cloned amplicons used for FISH experiments and comparable L1 data from a wide variety for eutherian species available in public databases reveals that the relationship of L1 from different species tends to reflect the accepted species tree (11). Thus, in most species only a single linage specific L1 appeared to be active. However, this is by no means universal, and there are multiple active lineages present in some groups (13).

Analysis of LINE-1 Elements

229

2. Materials 2.1. Cell Culture and DNA Extraction (see Note 2) 1. Medium: D-MEM (Gibco BRL) supplemented with 15% fetal bovine serum (FBS, Delta Bioproducts), penicillin (100 units/mL) and streptomycin (100 g/mL). Store at 4°C. 2. Colcemid solution: 10% colcemid (Sigma Karyomax®, Sigma, St. Louis, MO), 1X PBS. 3. 10X PBS: 1.37 M NaCl, 27 mM KCl, 80 mM Na2HPO4, 18 mM KH2PO4, 100 mM dextrose, and 1% phenol red. 4. 10X trypsin solution: 1.37 M NaCl, 27 mM KCl, 80 mM Na2HPO4, 18 mM KH2PO4, 100 mM dextrose, 1% phenol red, 53.7 mM EDTA, and 5% trypsin (Gibco BRL). Store at 4°C. 5. Carnoy fixative (3 : 1, methanol : acetic acid) made fresh for each use. 6. DNA extraction buffer: 10 mM Tris-HCl (pH 8.5), 5 mM EDTA, 2% SDS, 0.2 M NaCl, 0.1 mg/mL proteinase K. Store at 20°C. 7. RNase A 10 mg/mL (Roche).

2.2. PCR 1. Super-therm DNA polymerase (Southern Cross Biotechnology, RSA). 2. Primers (12).

2.3. Cloning and Sequencing 1. 2. 3. 4.

Agarose (Pronadisa, CONDA Laboratories). QIAquick Gel Extraction Kit (Qiagen). pGEM-T Easy Vector System (Promega). EcoR1 (Roche).

2.4. GTG- and CBG-Banding 1. Trypsin solution: 0.0025% trypsin (Gibco BRL), 1X PBS (pH 7). Make fresh before use. 2. KH2PO4 (25 mM, pH 7), 2% FBS. 3. KH2PO4 (25 mM, pH 7). 4. Giemsa solution: 5% Giemsa (Sigma-Aldrich), 25 mM KH2PO4. Make fresh before use. 5. 0.2 M HCl. 6. 5% Ba(OH)2. 7. 20X SSC: 3 M NaCl, 0.3 M C6H5Na3O7 (pH 7).

2.5. Fluorescent In-Situ Hybridization 1. Pepsin solution: 1% pepsin (Sigma-Aldrich), 10 mM HCl. Make fresh before use. 2. Denaturation solution: 2X SSC, 70% formamide. Can be stored at 4°C and used several times.

230 3. 4. 5. 6. 7. 8. 9. 10.

Waters et al.

Hybridization buffer: 10% dextran sulphate, 65% formamide, 2X SSC. Store at 20°C. Wash solution: 2X SSC. 1X PBD: 1% Tween-20, 4X SSC. Biotin-16-dUTP (Roche). Fluorolink™Cy™3-labeled streptavidin (Amersham Pharmica Biotech). Ethanol baths (70, 80, 90, and 100%). Methanol, 100%. Vectashield mounting medium containing DAPI (4,6-diamidino-2-phenylindole, Vector Laboratories).

3. Methods 3.1. Preparation of Samples for FISH Protocols for the establishment of fibroblast cell lines and standard cytogenetics are described in ref. 14 and readers are referred to this text for further details. 1. Obtain chromosome spreads for use in the FISH analysis from dividing cells that have been blocked in mitotic metaphase by adding 125 L of colcemid (a spindle inhibitor) solution to a T75 tissue culture flasks. Following inoculation with colcemid replace the flask in the incubator (37°C, 5% CO2) for 2–3 h (see Note 3). 2. Decant the medium from the flask into a 15-mL Falcon tube. Rinse the flask (with fibroblast cells attached) with 1X PBS and then add approx. 1 mL of 1X trypsin solution to the flask and incubat at 37°C until cells detach from the surface. Collect the loosened cells with a pipette and add to the original medium. 3. Centrifuge cells for 6 min at 1400 relative centrifugal force (rcf), decant the supernatant (or remove using a pipette) being careful not to disturb the pellet, then flick the tube to resuspend the cells in the remaining liquid. Add 10 mL of 0.075 M KCl (prewarmed to 37°C) to the cell suspension, replace cap and invert the tube several times to mix. Incubate at 37°C for 10–20 min (see Note 4). At this point add 1 mL of Carnoy fixative to stop the hypotonic process and pellet the cells (6 min at 1400 rcf). Discard the supernatant. Resuspend the pellet as above and add l0 mL of fixative. Ensure that the cells are mixed in the fixative and centrifuge as above. Repeat twice. After the final spin add sufficient fixative to ensure a reasonable cell density when making slide preparations (see Note 5 and ref. 15 for detailed troubleshooting on making good quality chromosome preparations). 4. Standard protocols are used to extract DNA from organs preserved in alcohol or unfixed cell pellets obtained as above (16). When extracting DNA from cell pellets, two confluent T75 flasks are generally used per extraction.

3.2. PCR 1. Degenerate primers are designed from regions conserved between mouse, rat, rabbit, and human L1 ([12] and see Subheading 2.2. for sequence). These primers amplify ~300 bp of the ORF2 of LINE-1 elements from the genomic DNA of all eutherian mammal species tested. Approximately 240 bp lies within the 7th and 8th subdomains of the reverse transcriptase (RT) domain (as defined in ref. 17, with 60 bp

Analysis of LINE-1 Elements

231

Fig. 1. PCR with the LINE-1 primers described herein (A) aardvark and elephant genomic DNA at an annealing temperature of 52.5°C. (B) PCR of elephant, aardvark, golden mole, bat, and springbok genomic DNA at 53°C. The aardvark and elephant PCRs are sharper at 53°C, although the elephant PCR product is not as good as that of the aardvark. All products were successfully cloned from this PCR. Gel fragments were excised at 300 bp for all lanes leaving behind any potentially nonspecific products in the elephant, golden mole, and bat lanes. extending into the region directly 3 of the RT). The location of a primer outside of the RT gene eliminates any possibility of amplifying RT from TEs other than L1. 2. Perform amplification using 200 ng of template DNA in 50 L reactions and an Applied Biosystems GeneAmp® PCR System 2700. Incubate one unit of Taq DNA Polymerase together with the template DNA, 500 nM of each primer, 200 M each of dATP, dCTP, dTTP, and dGTP in 10 mM Tris-HCl, pH 8.3/1.5 mM MgCl2/50 mM KCl. Standard cycling parameters are 30 cycles of 94°C, 30/52.5°C, 30/72°C, 30 following a 2 denaturation at 94°C (see Note 6). Visualize products on 1% agarose gel (Fig. 1).

3.3. Cloning and Sequencing 1. Excise bands from the agarose gel, purify and clone into pGEM®-T Easy following the manufacturer’s instructions. pGEM®-T Easy is a TA cloning kit specifically

232

Waters et al.

designed to clone PCR products. It has EcoR1 restriction sites either side of the cloning site for easy excision of inserts. 2. Digest clones with EcoR1 to release inserts, then visualize and size these on 1% agarose gel together with a 100-bp ladder. Sequence clones containing inserts of the correct size (~300 bp; see Note 7) using the vector primer T7 and BigDye terminator chemistry on an automated sequencer (in this case an AB3100). 3. Establish homology to LINE-1 (see Note 8) by searches using the RepeatMasker program and default settings (http://repeatmasker.genome.washington.edu/).

3.4. GTG- and CBG-Banding 1. GTG-banding is done using trypsin (see ref. 14). Before banding age (harden) the chromosomes by baking the slide in an incubator or on a slide warmer for at least 3 h at 65°C. Place the slide in 1X trypsin solution at room temperature for 45 s to 2 min (see Note 9). Follow with a wash in KH2PO4/FBS and then transfer to a fresh coplin jar containing KH2PO4 for 8 min. Stain the slides with Giemsa for 5–10 min. 2. We routinely use 5% prewarmed barium hydroxide solution for C-banding (see ref. 14). Place nonaged slides in 0.2 M HCl for 1 h at room temperature and then rinse in dH2O. Transfer slides to a preheated 5% Ba(OH)2 solution (50°C) for 30 s to 2 min (see Note 9); briefly rinse in 0.2 M HCl then in dH2O, and incubated in 2X SSC at 60°C for 1 h. Stain slides with Giemsa as above for 5–10 min.

3.5. Fluorescent In-Situ Hybridization 1. Synthesize the FISH probe using the same degenerate primer pair and cycling conditions described in Subheading 3.2. The only difference to the PCR protocol described earlier is that in this instance, 100 M of Biotin-16-dUTP is added, the concentration of dTTP reduced from 200 to 100 M, and the reaction volume reduced to 25 L. Use 1 L of the primary LINE-1 PCR product (Subheading 3.2.), or 1 L of a single LINE-1 clone (Subheading 3.3.) as template (see Note 10). To make the probe solution mix 1–2 L of the biotin-labeled PCR product with 10 L of hybridization buffer. 2. Perform in-situ hybridization on either unbanded or GTG-banded slides (see ref. 18 for discussion of FISH protocols). In the case of the former, treat chromosomes in 1% pepsin solution for 5 min at 37°C, rinse twice in 2X SSC for 5 min at RT, dehydrate in a 70, 80, 90, and 100% EtOH series, and then age at 65°C for 1 h. Finally, denature in denaturation solution for 45–60 s at 70°C. Quench the slide in cold 70% EtOH and briefly dehydrated in a 70, 80, 90, and 100% EtOH series. Denature the probe at 75°C for 5 min and snap-cool on ice. Pipette 12 mL of the probe solution on the slide, cover with a glass coverslip, seal the edges with rubber cement, and hybridize overnight at 37°C in a dark humidity chamber. See ref. 4 for FISH on previously G-banded slides. 3. Following hybridization wash the slide in 2X SSC at 72°C for 2 min. Thereafter briefly rinse the slide in 1X PBD. Do not allow the slide to dry off completely before proceeding with the detection step. To detect the hybridized probe add

Analysis of LINE-1 Elements

233

200 mL of Streptavidin-Cy3 (2 mg/mL, 1X PBD) to the slide, cover with a plastic coverslip and place in an incubator at 37°C for at least 20 min. After detection rinse chromosomes in 1X PBD at RT and mount in Vectashield mounting medium containing DAPI. We capture images using the Genus software (Applied Imaging) (Fig. 2). Routinely examine at least 20 metaphase spreads for each specimen. 4. Destain previously GTG-banded slides in toluene for 2–5 min and follow with successive 10 min washes at RT in 100% ethanol and 100% methanol. Following this, fix slides at RT in 0.5% formaldehyde PBS containing 50 mM MgCl2 for 10 min, and then rinse in 2X SSC for 5 min at RT. Repeat the 2X SSC wash but this time for 30–60 min at 65°C. Denature the chromosomes and hybridize as described above (the denaturation time can be reduced to 15 s).

3.6. Phylogenetic Analysis of Sequence Data A variety of methods can be used to reconstruct evolutionary trees from LINE-1 sequence data. We routinely employ parsimony, distance based trees, and maximum likelihood (using PAUP*; [19]), plus a “hierarchical Bayesian” analysis using (MC)3 chains (MrBayes 3.0; (20) which yielded good results in this case, see Note 11). General transition matrices, such as the HKY (21), or general time reversible GTR models with site rate variability following an invariant sites (pinv) plus gamma () distribution (22–24) may be used for model-based methods. Phylogenetic reconstruction of sequences as old as the eutherian orders (~100 million years) is not trivial since the assumptions of all models and methods are violated. This often leads to incorrect trees, irrespective of support measures such as the bootstrap (11,22). Therefore, consideration and comparison of trees via different methods and/or inclusion/exclusion of sequences is important. Another way of assessing possible errors and the robustness of results is to consider trees reconstructed from transversion only changes (13,22). Finally, analyses should not be treated as “black box;” the better you know your data, results of varied analyses, and the characteristics of different methods, the more likely you are to correctly infer the phylogeny. 4. Notes 1. LINE-1 has been classified into different subfamilies that were active at different times during mammalian evolution. For example, L1M4 is an ancient lineage of L1 that consists of up to seven subfamilies and was active before the eutherian radiation (10). L1PA2 is a primate specific subfamily which together with L1PA3 makes up the PlP1 lineage (10). 2. Sterilize all solutions for use in cell culture. PBS can be autoclaved whereas the trypsin solution must be filter sterilized. We do not routinely add antibiotics to growth medium as a preventative measure but use it rather only when contamination occurs for fear of developing resistant strains of the contaminating organism.

234

Waters et al.

Fig. 2. (A) Raw Cy3 image of LINE-1 probe hybridized to pika metaphase chromosomes (Order: Lagomorpha). (B) DAPI banding merged with the Cy3 image of the same cell. Poor chromosome spreading and heavy cytoplasm covering the chromosomes results in weak FISH signals and often in increased nonspecific background signal (see Note 5). (C) Unlike the banding pattern observed in Supraprimates (see refs. 25,26 for pictures of LINE-1 probes hybridized to human and mouse metaphase chromosomes) the distribution of LINE-1 in Laurasiatheria (represented here by rhinoceros) and (D) Afrotheria (represented here by elephant shrew) is diffuse except for the X chromosomes and the centromere of chromosome 6 (white arrows in [D]). Gray arrows represent the large blocks of heterochromatin that characterize the elephant shrew genome which are free of LINE-1 hybridization.

Analysis of LINE-1 Elements

3.

4.

5.

6.

7. 8.

9.

10.

11.

235

In the same way only add antifungal agents (e.g., Fungizone, Squibb; 100 g/mL) to the medium when fungal contamination is apparent. Exposure to colcemid ranges from 1 to 4 h depending on cell growth. A large number of rounded cells in the flask when examined under phase contrast microscopy indicates that the harvest will have a high mitotic index (percentage dividing cells). Add colcemid at a time during the cell cycle when the maximum number of dividing cells are observed. Although the colcemid solution does not have to be sterile given the short-term exposure this is recommended since it allows for the retention and subsequent passaging of the cells if the mitotic index is low. Should this be the case, rinse the flask twice with sterile PBS or medium, add fresh tissue culture medium supplemented with FBS, and replace the flask in the CO2 incubator with loosened cap as for routine tissue culture. The time in hypotonic solution is often species-specific and may need to be adjusted. Increase the time to improve spreading of chromosomes. Some labs use RT hypotonic, others standardize at 37°C. Poor chromosome spreading as well as heavy cytoplasm covering the chromosomes are often an anathema to producing good quality cytogenetic preparations (see Fig. 2A, B for an example of the effects of excess cytoplasm). Several troubleshooting protocols have been developed (15) that entail steam and heat treatment during the slide making process. In addition, a commercial product (Cytoclear Cat GGS-JL004, Rainbow Scientific, Inc.) is reported to reduce persistent cytoplasm. PCR using LINE-1 primers. In some instances PCR will result in a smeared band (Fig. 1A). In these instances increase the annealing temperature to obtain a sharper band (Fig. 1B). Decreasing MgCl2 concentration from 1.5 to 1 mM may also help reduce the smear. If a single sharp band cannot be obtained, excise the region of the smear containing fragments of approx. 300 bp for cloning (Fig. 1B). Inserts ranged anywhere from 280 to 310 bp because of indels in the different LINE-1 copies. Homology of different clones is established by comparisons to a wide variety of LINE-1 lineages in the Repbase database ranging from the primate specific L1P1 to the ancient mammalian L1M4. The duration required for the ageing of slides and the time required in the trypsin solution for good G-banding can vary considerably between species. Even the time required between different batches of slides of the same species can vary and may need to be optimized each time in order to gain the best contrasting bands. This is also applicable for the time required in Ba(OH)2 when performing C-bands. The differences in distribution of the hybridization signal observed when using the total LINE-1 probe as opposed to LINE-1 paints made from single clone inserts was insignificant. This is associated with the extensive cross-hybridization of the LINE-1 probe to nonidentical target LINE-1. One would expect that some families of LINE-1 will be significantly diverged from others and show a different distribution from the total LINE-1 distribution. However, we did not detect this in our study. Using flat or uninformative priors (the defaults) these methods may be considered “marginal likelihood” methods, that is, a non-Bayesian type of analysis. In our analyses (13), the distribution was approximated with four discrete rate classes

236

Waters et al.

of equal size. Five chains (four hot, one cold) plus a random starting tree was used for each (MC)3 run. Chains are routinely run to 4 million steps with the sampling of trees occurring every 50 steps. Plateaus (supposed convergences) in likelihood analyses tend to appear by ~400,000 steps; however, at least one chain had not converged by 11 million steps. Each model should be run at least twice and checked for conformity between runs. To further test deeper portions of the tree, including possible misrooting, transversion only invariant-sites plus gamma models can be used (13,24). These do not significantly change the structure of our tree. Posterior probability values for edges should always be interpreted with caution as they may be especially dependent on the assumed model (11), and consequently tend to give too much support to some edges.

Acknowledgments Financial support from the University of Stellenbosch and the National Research Foundation to T.J.R. is gratefully acknowledged. P.J.W. thanks the Laboratory of Biometry and Bioinformatics for computational resources and acknowledges NIH grant 5R01LM8626. References 1. Ostertag, E. M. and Kazazian, H. H. Jr. (2001) Biology of mammalian L1 retrotransposons. Annu. Rev. Genet. 35, 501–538. 2. Lander, E. S., Linton, L. M., Birren, B., et al. (2001) Initial sequencing and analysis of the human genome. Nature 409, 860–921. 3. Kazazian, H. H. Jr. and Goodier, J. L. (2002) LINE drive: retrotransposition and genome instability. Cell 110, 277–280. 4. Lyon, M. F. (1998) X-chromosome inactivation: a repeat hypothesis. Cytogenet. Cell Genet. 80, 133–137. 5. Hansen, R. S. (2003) X inactivation-specific methylation of LINE-1 elements by DNMT3B: implications for the Lyon repeat hypothesis. Hum. Mol. Genet. 12, 2559–2567. 6. Yang, Z., Boffelli, D., Boonmark, N., Schwartz, K., and Lawn, R. (1998) Apolipoprotein(a) gene enhancer resides within a LINE element. J. Biol. Chem. 273, 891–897. 7. Han, J. S., Szak, S. T., and Boeke, J. D. (2004) Transcriptional disruption by the L1 retrotransposon and implications for mammalian transcriptomes. Nature 429, 268–274. 8. Dewannieux, M., Esnault, C., and Heidmann, T. (2003) LINE-mediated retrotransposition of marked Alu sequences. Nat. Genet. 35, 41–48. 9. Ohshima, K., Hattori, M., Yada, T., Gojobori, T., Sakaki, Y., and Okada, N. (2003) Whole-genome screening indicates a possible burst of formation of processed pseudogenes and Alu repeats by particular L1 subfamilies in ancestral primates. Genome Biol. 4, R74. 10. Smit, A. F., Toth, G., Riggs, A. D., and Jurka, J. (1995) Ancestral, mammalian-wide subfamilies of LINE-1 repetitive sequences. J. Mol. Biol. 264, 401–417.

Analysis of LINE-1 Elements

237

11. Waddell, P. J., Kishino, H., and Ota, R. (2001) A phylogenetic foundation for comparative mammalian genomics. Genome Inform. Ser Workshop Genome Inform. 12, 141–154. 12. Waters, P. D., Dobigny, G., Pardini, A. T., and Robinson, T. J. (2004) LINE-1 distribution in Afrotheria and Xenarthra: implications for understanding the evolution of LINE-1 in eutherian genomes. Chromosoma 113, 137–144. 13. Waters, P. D., Dobigny, G., Waddell, P. J., and Robinson, T. J. (2007). Evolutionary history of Line-1 in the Major Clades of Placental Mammals. PLoS ONE 2: e158. doi:10.1371/journal.pone.0000158. 14. Schwarzacher, H. G. and Wolff, U. (1974) Methods in Human Cytogenetics, p. 295, Springer, Berlin. 15. Henegariu, O., Heerema, N. A., Lowe Wright, L., et al. (2001) Improvements in cytogenetic slide preparation: controlled chromosome spreading, chemical aging and gradual denaturing. Cytometry 43, 101–109. 16. Sambrook, J., Fritsch, E. F., and Maniatis, T. (1989) Molecular Cloning: A Laboratory Manual, 2nd Ed., Cold Spring Harbor Laboratory, New York. 17. Malik, H. S., Burke, W. D., and Eickbush, T. H. (1999) The age and evolution of non-LTR retrotransposable elements. Mol. Biol. Evol. 16, 793–805. 18. Yang, F., Carter, N. P., Shi, L., and Ferguson-Smith, M. A. (1995) A comparative study of karyotypes of muntjacs by chromosome painting. Chromosoma 103, 642–652. 19. Swofford, D. L. (2000) PAUP: Phylogenetic Analysis Using Parsimony (and Other Methods), Sinauer Associates, Sunderland, MA. 20. Ronquist, F. and Huelsenbeck, J. P. (2003) MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19, 1572–1574. 21. Hasegawa, M., Kishino, K., and Yano, T. (1985) Dating the human–ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22, 160–174. 22. Swofford, D. L., Olsen, G. J., Waddell, P. J., and Hillis, D. M. (1996) Phylogenetic inference. In: Molecular Systematics (Hillis, D. M., Moritz, C., and Mable, B. K., eds), pp. 407–514, Sinauer, Sunderland, MA. 23. Waddell, P. J. and Penny, D. (1996) Evolutionary trees of apes and humans from DNA sequences. In: Handbook of Symbolic Evolution (Lock, A. J. and Peters, C. R., eds), pp. 53–73, Clarendon, Oxford. 24. Waddell, P. J. and Steel, M. A. (1997) General time-reversible distances with unequal rates across sites: mixing gamma and inverse Gaussian distributions with invariant sites. Mol. Phylogenet. Evol. 8, 398–414. 25. Korenberg, J. R. and Rykowski, M. C. (1988) Human genome organization: Alu, lines, and the molecular structure of metaphase chromosome bands. Cell 53, 391–400. 26. Boyle, A. L., Ballard, S. G., and Ward, D. C. (1990) Differential distribution of long and short interspersed element sequences in the mouse genome: chromosome karyotyping by fluorescence in situ hybridization. Proc. Natl Acad. Sci. USA 87, 7757–7761.

15 Identification of Cryptic Sex Chromosomes and Isolation of X- and Y-Borne Genes Paul D. Waters, Jennifer A. Marshall Graves, Katherine Thompson, Natasha Sankovic, and Tariq Ezaz Summary Comparative molecular cytogenetics provides a powerful tool for deciphering the evolutionary history of vertebrate sex chromosomes. We have adapted cell culture and molecular cytogenetic techniques to study the sex chromosomes of many exotic mammals, birds, and reptiles. Here we describe differential chromosome banding and staining techniques that distinguish sex chromosomes in species with no morphologically distinct XY or ZW chromosome pairs. We describe a method to isolate, identify, and map genomic BAC clones from the Y chromosome, and we also identify strategies for isolating candidate sex chromosome genes. Key Words: Y chromosome; comparative mapping; BAC library screening; sex chromosome evolution; sex chromosome identification.

1. Introduction In our laboratory, we have adapted cell culture and molecular cytology techniques to study the genomes of a great range of vertebrate species, including mammals (placentals, marsupials, and monotremes; e.g., ref. 1), birds (chicken, kookaburra, eagle, and emu; e.g., ref. 2) and reptiles (snakes, lizards, alligators, and turtles; e.g., ref. 3). We culture blood or establish fibroblast cultures in order to prepare chromosome spreads, which we subject to a range of staining and banding procedures (e.g., silver staining to detect nucleolar organizers, c-banding to highlight centromeric or sex-specific heterochromatin, GTG or DAPI banding to induce specific banding patterns, and replication banding to distinguish early and late-replicating chromosome regions). Chromosome microdissection, as well as flow-sorting of chromosomes by our collaborators (M.A. Ferguson-Smith, Cambridge, UK), provides material From: Methods in Molecular Biology: Phylogenomics Edited by: W. J. Murphy © Humana Press Inc., Totowa, NJ

239

240

Waters et al.

for comparative chromosome painting between many mammals. To chart the evolution of the mammalian sex chromosomes, we have used this technique to map genome homologies between many marsupial species (e.g., ref. 4) and distantly related bird species (2). We have pushed the limits of this technique to look for homologies between marsupials and placentals (5), and even birds and turtles (6). Although Y chromosome paints are not useful for cross-species chromosome painting (because Y sequences change so rapidly) we have used them to screen BAC libraries and isolate candidate Y specific genomic clones (7). Dot-blot and fluorescent in-situ hybridization (FISH; Raudsepp and Chowdhary, 2007, Chapter 3 in this issue) analysis confirms Y specificity of clones, which can then be further analyzed. To identify, localize, and characterize genes in map-poor species such as marsupials (kangaroos, dunnart, possum, and wombat) and monotremes (platypus and echidna), we use sequence from data-rich related species to design short probes (overgos). We then use these overgos to screen BAC (Thomas, 2007, Chapter 8 in this issue) or other genomic libraries to obtain clones with large homologous inserts, which produce an unambiguous signal on chromosomes using FISH. In this way, we are able to construct physical framework maps rapidly (8), making many contributions to the knowledge of nonmodel species (9). We also use other candidate gene approaches to isolate more divergent genes from the sex chromosomes, especially the Y, of other species. 2. Materials All solutions, unless otherwise noted, can be stored at room temperature (RT).

2.1. Cell Culture (Short-Term Blood Culture and Fibroblast Cell Lines) 1. DMEM: Dulbecco’s modified Eagle’s medium (Multicell Trace Biosciences, Australia) supplemented with 10% fetal bovine serum (SAF Biosciences, Australia). Store at 4°C. 2. Gibco Amniomax-C100 basal medium supplemented with Amniomax C-100 containing gentamycin (Invitrogen, Australia). Store at 4°C. 3. 10X PBS calcium and magnesium free: 10X PBS (pH 7.4); 1.37 M NaCl, 27 mM KCl, 100 mM Na2HPO4, 18 mM KH2PO4, 0.002% w/v phenol red. Autoclave. 4. Trypsin solution: PBS (pH 7.4), 53.7 mM EDTA, 0.1% w/v trypsin (Gibco BRL). Store at 4°C. 5. PHA-M: 1 mg/mL PHA-M (Sigma, Australia) in DMEM. Store at 20°C. 6. Antibiotics: Gentamycin sulfate (final concentration 10 mg/mL), chloramphenicol (final concentration 40 g/mL), kanamycin (final concentration 1 mg/mL), tetracycline (final concentration 100 g/mL), and Gibco Pen-Strep (Invitrogen): Penicillin (final concentration 100 units/mL), streptamycin (final concentration

Identification of Cryptic Sex Chromosomes

241

100 g/mL). Fungizone: Amphotericin B (final concentration 20 g/mL) (see Note 1). 7. Heparin sodium salt (Sigma).

2.2. Chromosome Preparation Colcemid stock solution: 10 g/mL Colcemid (Roche Applied Science, Australia). Prehypotonic solution (Genial Genetic Solutions, UK). Hypotonic solution: 0.075 M KCl. Fixative: three parts methanol, one part glacial acetic acid. Prepare fresh.

1. 2. 3. 4.

2.3. Differential Chromosome Staining and Banding Silver staining: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18.

Gelatin solution: 2% gelatin (Sigma). Silver nitrate solution: 50% AgNO3 (Sigma). Formic acid (Sigma). DPX mounting solution (Ajax Chemicals, Australia). GTG-banding: Dulbecco’s PBS–CMF. Trypsin solution: 0.0025% trypsin, 1X PBS (pH 7.4). 0.1M Phosphate buffer (pH 7.0): 2.85 mL 1 M Na2HPO4, 2.15 mL 1 M NaH2PO4, 45 mL of dH2O. Giemsa stain: 5% Giemsa (Sigma) in 0.1 M phosphate buffer. DPX mounting solution (Ajax Chemicals). C-banding: 1 N HCl Barium hydroxide: 5% Ba(OH)2 freshly prepared in dH2O. 2X SSC. Prepare 20X SSC: 3 M NaCl, 300 mM sodium citrate, adjust pH to 7 with HCl, autoclave and dilute as required. Giemsa stain in phosphate buffer. DPX mounting solution (Ajax Chemicals). Replication banding: Slide with BrdU (Sigma) incorporated metaphase chromosome spreads. Tetra sodium EDTA (Invitrogen). 100% Giemsa stain (Sigma). Vectashield mounting solution (Vector, Burlingame, CA).

2.4. Isolation and Verification of Y Chromosome BACs Probe preparation: 1. Flow sorted (10) or microdissected (11) Y chromosome from the species of interest. 2. 10 Ci/L [ -32P] dATP. Store at 4°C. 3. MegaPrime DNA Labeling Kit (Amersham BioSciences, Australia).

242 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.

14. 15. 16. 17. 18. 19.

Waters et al.

Hybridizing and washing filters: BAC library filters. After the first hybridization never allow them to dry. Store in 2X SSC at RT. Mesh membranes (Thermo Electron, Australia). Church buffer: 0.5 M Na2HPO4 (pH 7.2), 1 mM EDTA, 7% SDS. If SDS comes out of solution at RT heat before use until it dissolves again. BSA (MP Biomedicals, Australasia). Salmon sperm DNA (Stratagene). Washing solution 1: 2X SSC, 0.1% SDS. Heat to 65°C before use. Washing solution 2: 1X SSC, 0.1% SDS. Heat to 65°C before use. BAC DNA preparation: Wizard® Plus SV Minipreps DNA Purification System (Promega, Australia). LB: 1% w/v NaCl, 0.5% w/v yeast extract, 1% w/v tryptone. Autoclave. To make agar plates add 1.5% w/v of agar to LB and melt in autoclave or microwave. Dot-blot analysis: Hybond-N+ roll (Amersham Biosciences). Bio-Dot Microfiltration Apparatus (BioRad). 6X and 2X SSC. 10 and 0.4 M NaOH. 0.5 M EDTA. A 3 mM Whatman paper.

3. Methods 3.1. Cell Culture (12) Short-term blood culture: 1. Use whole blood, or buffy coat after centrifugation (5 min at 1000g at RT), to set up short-term blood culture. Collected whole blood with an appropriately sized heparinized needle and syringe from the caudal vein of anesthetized animals, or from heart punctures of euthanized animals. 2. Set up a 2-mL culture (immediately if possible, but blood can wait up to 2 h at 4°C) in DMEM supplemented with 10% FCS, PHA-M (we use 3% for reptiles; 1% is routinely used for mammals), Pen-Strep and Gentamycin at pre-determined species-specific temperatures in a 5% CO2 incubator for 4–5 d (see Note 2). Fibroblast cell lines: 3. When collected in the field, place tissue samples in collection medium, which consists of: DMEM/10% FCS supplemented with Pen-Strep, gentamycin, chloramphenicol, kanamycin, tetracycline, and amphotericin B (see Note 1). 4. Establish fibroblast cell lines from the collected tissues (see Note 3) that are cut into small pieces, arranged on the surface of a T25 flask, air dried to allow attachment and then left to grow in 5–6 mL of Amniomax at a species-specific temperature (see

Identification of Cryptic Sex Chromosomes

243

Note 2) in a 5% CO2 incubator. If cells do not require Amniomax use DMEM/10% FCS. If transporting the cells in culture, use subconfluent cultures and fill flask completely with DMEM, 2% FCS at RT.

3.2. Chromosome Preparation (13) Short-term blood culture: 1. For replication banding add 35 g/mL of BrdU 6 h prior to harvest. 2. Harvest short-term blood culture after 96–120 h for reptiles or 60–80 h for mammals (see Note 4). 3. Add 75 ng/mL of Colcemid and incubate (at the appropriate temperature) for 4 h. 4. Add 2 mL culture medium and 200 L of prehypotonic solution at RT and mix well. Centrifuge for 8 min at 400 relative centrifugal force (rcf) at RT, discard supernatant, and resuspend cell pellet thoroughly in residue. Thorough resuspension at all steps of chromosome preparation is critical to prevent cells from clumping. 5. Add 2—10 mL (equal to culture volume) of hypotonic solution (0.075 M KCl) and incubate at 37°C for 20–60 min depending on the species. 6. Add 2 mL of 5% glacial acetic acid, centrifuge for 5 min at 400 rcf and decant supernatant. Thoroughly resuspend the cell pellet in 2 mL of ice-cold fixative and incubated at RT for 30 min. 7. Centrifuge for 5 min at 400 rcf, decant supernatant, and resuspend the pellet with 5 mL of ice-cold fixative. Repeat this step three times. 8. After the final wash, resuspend pellet in appropriate volume of fixative, drop 10–15 L onto a slide and check the quality of the preparation by observing under a phase contrast microscope (see Note 5). Fibroblast cell lines: 9. Harvest fibroblast cell lines when the maximum number of dividing cells (rounded cells and doublets) are observed at ~75% confluence (see Note 6). Add 75 ng/mL of Colcemid and incubate for 1–3 h depending on the species. 10. Transfer the medium from the flask to a 10 mL tube. Rinse the flask with PBS (pH 7.4), add trypsin (1 mL for T25 and 2 mL for T75) and leave the flasks until cells start to detach (~2 min). Tap the flask to detach remaining cells and transfer cell suspension to the saved medium. 11. Centrifuge at 400 rcf for 5 min. 12. Discard supernatant and resuspend pellet thoroughly in the residue. 13. Add 10 mL hypotonic solution (0.075M KCl solution) and incubate at 37°C for 20–60 min depending on the species. 14. Add 2–3 drops of ice-cold fixative (3:1 methanol acetic acid) and centrifuge for 5 min at 400 rcf, discard supernatant, and resuspend pellet in residue. 15. Resuspend the pellet and slowly add ice-cold fixative to a volume equal to the culture volume, centrifuge for 5 min at 400 rcf, and decant supernatant. Repeat this step three times. 16. Check preparation according to step 8.

244

Waters et al.

3.3. Differential Chromosome Staining and Banding Silver staining: (14) 1. Put two drops of gelatin solution and four of silver nitrate solution on a slide with chromosome preparation. Mix gently by tilting the slide from side to side and cover with a cover slip. 2. Put on a slide warmer at 70°C until the solution turns golden brown (~2 min). 3. Gently remove cover slip, rinse slide thoroughly in dH2O, air dry and mount with DPX. GTG-banding: (15) 4. Age (see Note 7) slides overnight by incubating at 55–60°C. 5. Chill (to 2–5°C) a Coplin jar containing PBS–CMF. 6. Treat slides in 0.05% trypsin solution for 30 s to 10 min (depending on species) (see Note 8). 7. Rinse slides briefly in chilled PBS–CMF and stain with 5% Giemsa for 4–6 min (see Note 9). 8. Rinse in dH2O, air dry, and mount with DPX mounting solution. C-banding: (16) (Fig. 1) 9. Incubate aged (see Note 7) slides (2–7 d) in 0.2N HCl for 40 min at RT. 10. Rinse slides in dH2O and transfer to prewarmed (50°C) 5% Ba(OH)2 (see Note 9) for 7–10 min. 11. Rinse slides in 0.2N HCl, followed by dH2O and transfer to prewarmed (60°C) 2X SSC for 60 min. 12. Rinse slides in dH2O and stain with Giemsa for 10–30 min at RT (see Note 9). 13. Rinse slides in dH2O, air dry and mount with DPX mounting solution. Replication banding: (17) 14. Make slides from cells grown with BrdU and age overnight at 55–60°C. 15. Incubate at RT for 2 min in absolute methanol. 16. Warm 50 mL of dH2O to 40°C and add 1 g of tetra-sodium EDTA. Mix, add 2 mL of Giemsa, and mix again. 17. Incubate slide (maximum three slides at a time) for 3–5 min, rinse thoroughly in dH2O, air dry, and mount in DPX mounting solution. If staining is too weak then stain for longer. Each time prepare a fresh solution of tetra-sodium EDTA and Giemsa.

3.4. Isolation and Verification of Y Chromosome BACs Probe preparation: 1. Label 1 g of a flow-sorted (10) or 250 ng microdissected (11) Y-chromosome DNA with [ -32P] dATP using the MegaPrime DNA Labeling Kit (Amersham BioSciences) according to manufacturer’s instructions. 2. Use 2–5 g of sheared male M. eugenii genomic DNA (see Note 10) to suppress highly repetitive elements shared by the Y chromosome and other chromosomes. Add the sheared DNA to the labeled probe and denature at 98°C for 3 min. 3. Preanneal probe to suppressor DNA at 37°C for 30 min.

Identification of Cryptic Sex Chromosomes

245

Fig. 1. C-banded mitotic metaphase chromosomes in two species of dragon lizards showing varying degrees of heterochromatinization on the autosomes and W chromosome of (A) Pogona vitticeps (Victorian population) and (B) Amphibolurus nobbi.

4.

5.

6. 7.

8. 9. 10. 11.

12.

13.

Hybridizing and washing filters: In a single bottle prehybridize three BAC filters (see Notes 11) in 100 mL of Church buffer (see Note 12) containing 1% BSA and 100 L of 10 mg/mL salmon sperm DNA overnight at 65°C. After preannealing, add the probe directly to the prehybridization buffer (be careful not to spill any of the undiluted probe onto the BAC filters). Alternatively, remove 5–10 mL of the prehybridization buffer to a Falcon tube, add the probe, and returned to the hybridization bottle. Hybridize at 65°C for 24 h. The next day remove hybridization buffer and store appropriately. Add 100 mL of washing solution 1 to the bottle and incubate in the hybridization oven for 20 min at 60°C. Remove washing buffer 1 and repeat wash with a further 100 mL. Incubate for 20 min at 65°C. Remove washing buffer 1 and follow with a 20-min wash at 65°C with 100 mL of washing solution 2. Repeat step 6. Drain filters, wrap in plastic wrap, and expose to autoradiograph film with an intensifying screen overnight at 70°C. Develop films the next morning. If signals are weak after one night exposure, expose again for 5–10 d. Identify positives and determine plate and well address according to instructions supplied with the BAC library. BAC DNA preparation: BAC DNA is prepared using an adapted protocol of the Wizard® Plus SV Minipreps DNA Purification System (Promega).

246

Waters et al.

14. Grow clone overnight with shaking at 37°C in 15 mL of LB supplemented with the appropriate antibiotic. 15. Centrifuge culture at 4500 rcf for 5 min and decant supernatant. 16. Resuspend pellet in 500 L cell resuspension solution, and transfer to a 2-mL microtube. 17. Add 500 L of cell lysis solution and invert 10 times to mix. Incubate at RT for 5 min. 18. Add 10 L of alkaline protease solution and invert four times to mix. Incubate at RT for 5 min. 19. Add 700 L of neutralization solution and invert 20 times to mix thoroughly. Centrifuge at 16,000 rcf for 20 min. 20. Decant half of supernatant into spin column and centrifuge at 16,000 rcf for 1 min. Discard flow-through, repeat centrifugation with second half of supernatant, and discard flow-through. 21. Add 750 L of column wash solution to the spin column and centrifuge at 16,000 rcf for 1 min. Discard the flow-through, add a further 500 L column wash solution and centrifuge for at top speed for 2 min. Discard flow-through and centrifuge at top speed for 2 min. 22. Transfer spin column to a sterile 1.5-mL microtube. Elute DNA by adding 100 L of nuclease free dH2O, incubating at RT for 1 min and centrifuging at top speed for 1 min. Store DNA at 20°C (see Note 13). Dot-blot analysis: 23. Prepared a dot-blot containing each BAC isolated in the initial library screen onto the Hybond-N+ nylon membrane with a Bio-Dot Microfiltration Apparatus according to the manufacturer’s instructions. 24. Denature DNA for 5 min at 95°C in a thermocycler before performing the dot-blot. 25. Prewet the membrane in 6X SSC for 10 min. 26. Assemble dot-blot apparatus and equilibrate membrane by vacuuming 100 L of TE through each well. 27. Vacuum 150 ng of BAC DNA (with 2 L 10 M NaOH, 1 L 0.5 M EDTA in a total volume of 50 L) through the dot-blot apparatus. 28. Fix DNA to the membrane by placing DNA side up on 3 mM Whatman paper, which is wet with 0.4 M NaOH, for 30 min. Rinse the membrane in 2X SSC and dry. 29. Hybridize dot-blot (same conditions described above) with 50 ng of female total genomic DNA labeled with [ -32P] dATP (described above) (see Note 14 and Fig. 2).

3.5. Strategies for Isolating Candidate Sex Chromosome Genes (Fig. 3) Overgo probes for isolation of sex chromosome genes: 1. Use data from existing genome sequencing projects to design pools of overgo probes targeting candidate X genes (see Note 15) for hybridization to BAC libraries (Thomas, 2007, Chapter 8 in this issue).

Identification of Cryptic Sex Chromosomes

247

Fig. 2. Hybridization of female tammar wallaby total genomic DNA to dot blots of BAC clones isolated with a Y-specific chromosome paint. The clones were spotted in duplicate (data not show). Clones that display no hybridization are circled in black and were further characterized. As a control, the Y-borne ATRY clone (circled in gray) displays no hybridization.

2. If sequence of the candidate gene/s is not available for the target species, align orthologues from different species to identify regions of conservation and design an overgo probe.

248

Waters et al.

Fig. 3. Flow chart detailing different strategies to isolate candidate sex chromosomes genes. Libraries can be screen with either cross-species or homologous probes derived from the X or Y. Solid lines show library screening strategies taken to isolate X genes. Dashed lines show library screening strategies to isolate Y genes. A * next to the line indicates that it is best to screen the library with either a cDNA or a PCR-generated probe, and not an overgo probe, because target sequences are not homologous (either screening cross-species, or for a Y homolog with the X homolog) resulting in the failure of smaller probes. A * next to the line indicates that it is more efficient to screen with pools of overgos because many genes can be screened for at one time. When designing overgos for X genes using sequence from another species it is important to design the overgo in a region of high conservation. If nonhomologous Y sequences are from a closely related species, or the marker of interest is highly conserved, then these can be used as cross-species probes to screen a cDNA library or the BAC library directly (not-shown on the flow chart).

3. Overgo probes can be designed for Y markers too, but owing to the rapid change of the Y, this can only be done if there is sequence available for the desired species or a very closely related species (see Note 16).

Isolation of sex chromosome genes with PCR-generated or cDNA clones: 1. An approach we have commonly used to obtain genomic clones is with cross-species hybridization of previously obtained cDNA or PCR-generated probes. This is most useful cross-species because orthologous sequences can be detected and isolated across large evolutionary gaps. 2. These probes are much larger than overgo probes and can therefore potentially be used to detect and isolate genes on the Y chromosome. The success of this

Identification of Cryptic Sex Chromosomes

249

depends on how distantly related the two species are, and the nature of the candidate marker.

Isolation of Y genes using their X homolog: 1. Because of poor conservation of Y chromosome genes across species, we isolate Y-borne genes by screening with the more conserved X homolog where there is one. 2. The most efficient approach is an intra-specific hybridization of X sequences to a BAC library (cross-species hybridization also works). The brightest positives obtained usually represent clones that contain the X homolog. Weaker signals may represent clones with the desired Y homologs. 3. If the direct screen of a BAC library fails, screen a testis cDNA library (see Note 17) with the X homolog to try and isolate the Y homolog cDNA (see Note 18). 4. Use the newly obtained Y cDNA to screen the BAC library.

4. Notes 1. For samples collected in the field, use high concentrations of antibiotics and fungizone in collection medium to prevent contamination. However, once contamination is destroyed, best practice is not to use antibiotics routinely as a preventative measure. Rather, use careful aseptic technique and introduce antibiotics only to destroy new contamination as soon as it is detected. This reduces the chance of generating resistant bacterial strains in a selective environment. Fungus—a common problem with mammal and reptile skin samples—can be kept at bay (but never eliminated) with fungizone. 2. In our lab, we culture reptilian cells at 28–30°C, avian cells at 35–37°C and mammalian cells at 31–37°C (37°C for placental mammals, 34°C for marsupials, and 31°C for monotremes). 3. We find eye and pericardial tissues are best to establish fibroblast cell lines. Ear clips work well for marsupials and toe web is excellent for platypus. We have also established lizard cell lines from tail tip and toe clips. 4. The time for blood cultures to reach their first mitotic wave varies for different species and can even vary between individuals. Set up several cultures and harvest them at different times. 5. Drop slides with a P20 Gilson pipette. Withdraw ~15 L cell suspension and drop on the middle of a clean glass slide from a distance of 10–15 cm. Slides are either air dried immediately or quickly flamed over a Bunsen burner. 6. Examine the culture under a microscope to check that cell doublets have disappeared and the number and rounded cells has increased, indicating that mitosis has been arrested by Colcemid. 7. Aged slides often produce sharper bands. 8. Different trypsin digestion times are required for chromosome preparations obtained from different tissues and species. For reproducible results initial optimization is required. Older slides often required longer trypsin digestion time. 9. Giemsa solution forms a green layer on the surface, which results in background. It can be removed from surface with a small piece of Whatman paper and forceps. Ba(OH)2 forms a white layer that can also be removed with Whatman paper.

250

Waters et al.

10. DNA is sheared by boiling until the brightest part of a smear ranges in size from 200 to 600 bp. 11. Ideally, each screen should be of approximately a 3X (or more) coverage of the genome. For instance, we use a tammar wallaby (Macropus eugenii) library with an average insert size of 166 kb. Each filter carries 18,432 clones and equates to approx. 3.1 × 109 bp, which is approximately a 1X coverage of an average mammalian genome. 12. If filters are being screened for the first time, the Church buffer needs to be changed before hybridization is started. 13. This method yields approx. 2 g of DNA, which is enough to label 1 g for FISH and conduct 3–4 sequencing reactions each using 300 ng. 14. Female total genomic DNA will hybridize only to clones that contain autosomal sequence. Negative clones, or very faint hybridization will contain Y-specific sequence (Fig. 2). 15. Orthologues of candidate sex chromosome genes will not always be located on the sex chromosomes in other species. Genes may have been lost from the Y in one lineage and retained on the Y in others (18). There may even have been unexpected losses of material from the X chromosome in some lineages (19). These results are just as significant in the story of sex chromosome evolution. 16. Current genome sequencing projects are all being undertaken on females due to the repetitive nature of the Y and the difficulties in assembling it. Therefore, in most species, it is unlikely that there will be genomic sequence available for design of overgos. 17. Use a testis cDNA library as many genes on the Y are testis specific. 18. The processed cDNAs will give a more intense signal of probe hybridization relative to intron containing genomic clones.

References 1. Wrigley, J. M. and Graves, J. A. M. (1984) Two monotreme cell lines derived from female platypuses. (Ornithorhynchus anatinus; Monotremata, Mammalia). In Vitro 20, 321–328. 2. Shetty, S., Griffin, D. K., and Graves, J. A. M. (1999) Comparative painting reveals strong chromosome homology over 80 million years of bird evolution. Chromosome Res. 7, 289–295. 3. Ezaz, T., Quinn, A. E., Miura, I., et al. (2005) The dragon lizard Pogona vitticeps has ZZ/ZW micro-sex chromosomes. Chromosome Res. 13, 763–776. 4. Rens, W., O’Brien, P. C.,Yang, F., Graves, J. A. M., and Ferguson-Smith, M. A. (1999) Karyotype relationships between four distantly related marsupials revealed by reciprocal chromosome painting. Chromosome Res. 7, 461–474. 5. Glas, R., Graves, J. A. M., Toder, R., Ferguson-Smith, M. A., and O’Brien, P. C. (1999) Cross-species chromosome painting between human and marsupial directly demonstrates the ancient region of the mammalian Y chromosome. Mamm. Genome 10, 1115–1116.

Identification of Cryptic Sex Chromosomes

251

6. Graves, J. A. M. and Shetty, S. (2000) Comparative genomics of vertebrates and the evolution of sex chromosomes. In: Comparative Genomics (Clark, M. S., ed.), pp. 153–205, Kluwer Academic, London. 7. Sankovic, N., Delbridge, M. L., Grützner, F., et al. (2006) Construction of a highly enriched marsupial Y chromosome specific BAC sub-library using isolated Y chromosomes. Chromosome Res., in press. 8. Alsop, A. E., et al. (2005) Characterizing the chromosomes of the Australian model marsupial Macropus eugenii (tammar wallaby). Chromosome Res. 13, 627–636. 9. O’Brien, S. J., et al. (1999) The promise of comparative genomics in mammals. Science 286, 458–462. 10. Telenius, H., et al. (1992) Cytogenetic analysis by chromosome painting using DOPPCR amplified flow-sorted chromosomes. Genes Chromosomes Cancer 4, 257–263. 11. Guan, X. Y., Meltzer, P. S., and Trent, J. M. (1994) Rapid generation of whole chromosome painting probes (WCPs) by chromosome microdissection. Genomics 22, 101–107. 12. Freshney, R. I. (1983) Culture of Animal Cells. A Manual of Basic Techniques, Alan R. Liss, New York. 13. Evans, E. P., Burtenshaw, M. D., and Ford, C. E. (1972) Chromosomes of mouse embryos and newborn young: preparations from membranes and tail tips. Stain Technol. 47. 14. Howell, W. M. and Black, D. A. (1980) Controlled silver staining of nucleolus organizer regions with a protective colloidal developer: a one-step method. Experientia 36, 1014. 15. ISCN (1985) An International System for Human Cytogenetics Nomenclature: Birth Defects. Original article series. Vol. 21, March Dimes Birth Defects Foundation, New York. 16. Sumner, A. T. (1972) A simple technique for demonstrating centromeric heterochromatin. Exp. Cell. Res. 75, 304–306. 17. Miura, I. (1995) The late replication banding patterns of chromosomes are highly conserved in the genera Rana, Hyla, and Bufo (Amphibia: Anura). Chromosoma 103, 567–574. 18. Pask, A., Renfree, M. B., and Graves, J. A. M. (2000) The human sex-reversing ATRX gene has a homologue on the marsupial Y chromosome, ATRY: implications for the evolution of mammalian sex determination. Proc. Natl Acad. Sci. USA 97, 13,198–13,202. 19. Waters, P. D., et al. (2005) Autosomal location of genes from the conserved mammalian X in the platypus (Ornithorhynchus anatinus): implications for mammalian sex chromosome evolution. Chromosome Res. 13, 401–410.

Index A AEK, see Ancestral eutherian karyotype Amniote phylogenomics, see Reptilia Ancestral DNA sequence reconstruction CFTR locus, 179, 180 genome rearrangements analysis, 175 indel reconstruction, 174, 175 inferAncestors program, 175 materials phylogenetic information, 173 sequence annotation, 173 sequence data, 172, 173 multiple sequence alignment, 174 overview, 171, 172, 180 simulations and accuracy assessment, 176, 178, 179 substitutions reconstruction, 175 Ancestral eutherian karyotype (AEK), chromosome painting, 15, 16 Ancestral primate karyotype (APK), chromosome painting, 16 APK, see Ancestral primate karyotype B BAC library, see Bacterial artificial chromosome library Bacterial artificial chromosome (BAC) library advantages and applications, 92, 94 Amniote phylogenomics comparative investigation of genomic libraries, 98 dynamics, 96 gene and genomic neighborhood targeting bioinformatics, 109 clone picking and growing, 106 end repair and shotgun subcloning, 108, 109 filter preparation from emu library, 104

253

hybridization, 105, 106 overview, 104 probe generation, 105 sequencing and contig assembly, 109 shearing of DNA, 106, 107 Southern blotting, 106 washing and autoradiography, 106 materials, 96, 98 nonavian reptile genome scanning base composition evaluation, 99, 100, 110, 111 evolutionary rates of word frequency change, 103, 112 genomic signatures, 100, 101 interspersed repeat diversity estimation, 102, 111, 112 nonoverlapping clone insert sequencing, 99, 109, 110 overview, 99, 100, 109 repetitive element density calculation, 103, 112 screening of end sequences, 99 tandem repeat diversity estimation, 102, 103, 112 vector primer selection, 99 construction, 93 insert size, 92 overgo hybridization for comparative physical mapping overview, 119, 120 materials, 120, 121, 127, 128 universal overgo hybridization probe design, 121, 125 labeling, 126, 128 Uprobe software, 120, 125, 126 hybridization, 126, 130 Y chromosome isolation and verification DNA preparation, 245, 246, 250 hybridization and washing, 245, 250 probe preparation, 244, 250

254 Bacterial artificial chromosome–end sequences (BESs), radiation hybrid mapping advantages, 79, 80 agarose gel electrophoresis, 81, 86 cell culture, 80, 82, 83, 89, 90 comparative map construction and visualization, 88 DNA extraction, 80, 83, 84, 89 genotyping clone preparation, 86 data analysis and quality control, 87 materials, 80, 81, 88, 89 overview, 82 polymerase chain reaction optimization, 81 primer design, 81, 85, 89 radiation hybrid panel analysis, 81, 85, 89 sequence comparison with reference genomes for high–resolution mapping, 84, 85, 89 sequencing, 80, 84 BESs, see Bacterial artificial chromosome– end sequences C Canine genome, see Radiation hybrid mapping C–banding, sex chromosomes, 244, 249 CBG–banding, LINE–1 analysis, 232 Chromosome breaks, University of California Santa Cruz Genome Browser search in Drosophila, 139 Chromosome painting degenerate oligonucleotide primed– polymerase chain reaction amplification and labeling, 21, 22, 26 materials, 17 overview, 13, 14 fluorescence in situ hybridization materials, 18 Zoo–FISH, 22, 23, 26 overview, 12, 13 phylogenomics contribution ancestral eutherian karyotype, 15,16 ancestral primate karyotype, 16 Y chromosome comparative chromosome painting, 240 Chromosome sorting, see Flow cytometry Clade support, indices, 6, 7

Index Comparative gene mapping, see Bacterial artificial chromosome–end sequences; Radiation hybrid mapping; Survey sequencing D Degenerate oligonucleotide primed– polymerase chain reaction (DOPPCR), chromosome painting amplification and labeling, 21, 22, 26 materials, 17 overview, 13, 14 DNA sequencing, see Bacterial artificial chromosome–end sequences; Mitochondrial genome; Survey sequencing DOP–PCR, see Degenerate oligonucleotide primed–polymerase chain reaction F Fiber FISH, see Fluorescence in situ hybridization FISH, see Fluorescence in situ hybridization Flow cytometry, chromosome sorting cell culture and chromosome suspension, 18, 19, 24 data acquisition, 19, 21, 24, 25 materials, 16, 17 overview, 13 Fluorescence in situ hybridization (FISH) chromosome painting materials, 18 Zoo–FISH, 22, 23, 26 LINE–1 analysis hybridization, 232, 233, 235 sample preparation, 230, 235 single copy gene mapping cell culture and chromosome preparation fibroblast, 33, 38, 39 peripheral blood lymphocytes, 32, 33, 38, 46, 47 data analysis DNA–fiber hybridizations, 46 interphase chromosome hybridizations, 45, 46 metaphase chromosome hybridizations, 45

Index DNA fiber preparation agarose–embedded DNA preparation, 33, 34, 39, 40 DNA fiber stretching on slides, 34, 35, 40, 41, 46, 47 glass slide preparation, 34, 40 hybridization, 36, 37, 43, 44, 47 materials, 32, 38, 46, 47 principles, 31, 32 probe concentration, 42 direct labeling with nick translation, 36, 43 indirect labeling with nick translation, 35, 36, 42, 43, 47 quality, 41, 42 size, 41 washing and detection biotin system, 37, 44, 45, 47 digoxigenin system, 37, 45 G Gene mapping, see Fluorescence in situ hybridization; Radiation hybrid mapping Gene–scale phylogenetics, insufficiency, 2, 9 Genome rearrangement ancestral DNA sequence reconstruction, 175 impact of reversal, fusion/fission, and translocation, 146 input data, 146, 147 multiple genome rearrangement reconstruction with MGR, 163, 168 programs for study access, 147, 148 computer requirements, 147 data input, 148, 150 GRIMM, 145, 146, 150, 160, 163, 168 GRIMM–Anchors algorithm, 153, 154 GRIMM–Synteny, 145, 146, 148, 158 MGR, 145, 147, 150, 163, 168 overview, 145 running time, 147 rearrangement analysis between genomes using GRIMM, 160, 163, 168 study approach, 145, 146 synteny block formation from sequencebased local alignments using GRIMM-Synteny, 154, 160

255 Genome–scale phylogenetics challenges, 4, 7 importance, 3, 4, 9 Genome sequencing, progress, 2, 8 GRIMM, see Genome rearrangement GRIMM–Anchors, see Genome rearrangement GRIMM–Synteny, see Genome rearrangement GTG–banding LINE–1 analysis, 232 sex chromosomes, 244, 249 I Incongruence definition, 2, 9 extent, 2, 9 Indel ancestral DNA sequence reconstruction, 174, 175 University of California Santa Cruz Genome Browser search, 137, 139 inferAncestors, ancestral DNA sequence reconstruction, 175 Interphase FISH, see Fluorescence in situ hybridization Interspersed repeats, University of California Santa Cruz Genome Browser search in ENCODE regions, 135, 137 L LINEs, see Long interspersed elements Long interspersed elements (LINEs) distribution in mammals, 203, 204 insertion analysis, see Retrotransposon LINE–1 analysis CBG–banding, 232 cloning and sequencing, 231, 232, 235 features, 228 fluorescence in situ hybridization hybridization, 232, 233, 235 sample preparation, 230, 235 GTG–banding, 232 materials, 229, 230 phylogenetic analysis of sequence data, 233, 235, 236 polymerase chain reaction, 230, 231, 235

256

Index

M

P

Metaphase FISH, see Fluorescence in situ hybridization MGR, see Genome rearrangement Mitochondrial genome overview of features, 185, 186 sequencing and phylogenomic analysis automated sequencing, 193, 194, 198 cloning of DNA, 188, 190, 191, 197 genome annotation, 194, 195 materials, 186, 187 mitochondria isolation, 187, 188, 197 phylogenomic analysis, 195, 196, 198 polymerase chain reaction general PCR, 190, 192, 197 long PCR, 192, 193, 197 purification of DNA, 188, 197 rolling circle amplification, 193 Multiple sequence alignment ancestral DNA sequence reconstruction, 174 SINE repeats, 209 synteny block formation from sequencebased local alignments using GRIMM-Synteny, 154, 160

PCR, see Polymerase chain reaction Polymerase chain reaction (PCR), see also Degenerate oligonucleotide primed–polymerase chain reaction LINE-1 analysis, 230, 231, 235 mitochondrial genome general PCR, 190, 192, 197 long PCR, 192, 193, 197 radiation hybrid mapping of bacterial artificial chromosome–end sequences optimization, 81 primer design, 81, 85, 89 SINE detection, 207, 216, 218, 221, 222

N Nick translation, probe labeling for fluorescence in situ hybridization direct labeling, 36, 43 indirect labeling, 35, 36, 42, 43, 47 O Overgo hybridization candidate sex chromosome gene isolation, 246, 248, 250 comparative physical mapping with bacterial artificial chromosomes hybridization, 126, 130 materials, 120, 121, 127, 128 overview, 119, 120 universal overgo hybridization probe design, 121, 125 labeling, 126, 128 Uprobe software, 120, 125, 126

R Radiation hybrid (RH) mapping bacterial artificial chromosome–end sequences advantages, 79, 80 agarose gel electrophoresis, 81, 86 cell culture, 80, 82, 83, 89, 90 comparative map construction and visualization, 88 DNA extraction, 80, 83, 84, 89 genotyping clone preparation, 86 data analysis and quality control, 87 materials, 80, 81, 88, 89 overview, 82 polymerase chain reaction optimization, 81 primer design, 81, 85, 89 radiation hybrid panel analysis, 81, 85, 89 sequence comparison with reference genomes for high–resolution mapping, 84, 85, 89 sequencing, 80, 84 canine genome studies with survey sequencing comparative map construction, 71 gene–based marker design for mapping, 69 marker genotyping, 70

Index materials, 67, 71, 74 orthologous sequence content criteria, 69, 74, 75 orthologous sequence identification criteria, 68, 69 strategy, 68 overview, 66, 67 radiation hybrid map construction, 70, 71, 75 radiation hybrid panel characteristics, 70 survey sequencing, 68 cell culture, 52, 54, 56, 61, 62 clones expansion and harvesting, 54, 60, 63 selection, 53, 58, 60, 62 comparative gene mapping, 51, 52 fusion of irradiated donor cells and recipient cells, 53, 56, 58, 62 materials, 52, 54, 61, 62 principles, 51, 52, 66 Rearrangement, see Genome rearrangement Replication banding, sex chromosomes, 244 Reptilia bacterial artificial chromosome library analysis of Amniote phylogenomics comparative investigation of genomic libraries, 98 dynamics, 96 gene and genomic neighborhood targeting bioinformatics, 109 clone picking and growing, 106 end repair and shotgun subcloning, 108, 109 filter preparation from emu library, 104 hybridization, 105, 106 overview, 104 probe generation, 105 sequencing and contig assembly, 109 shearing of DNA, 106, 107 Southern blotting, 106 washing and autoradiography, 106 materials, 96, 98 nonavian reptile genome scanning

257 base composition evaluation, 99, 100, 110, 111 evolutionary rates of word frequency change, 103, 112 genomic signatures, 100, 101 interspersed repeat diversity estimation, 102, 111, 112 nonoverlapping clone insert sequencing, 99, 109, 110 overview, 99, 100, 109 repetitive element density calculation, 103, 112 screening of end sequences, 99 tandem repeat diversity estimation, 102, 103, 112 vector primer selection, 99 diversity, 94, 95 genome features, 96, 97 Retrotransposon, see also Long interspersed elements; Short interspersed elements abundance in human genome, 201 evolutionary significance, 202 insertion analysis age estimation of SINE subfamilies, 218 genomic database mining, 220, 221 genomic library screening for SINEs, 206, 207, 215, 216, 221 materials, 206, 207 overview, 214, 215 polymerase chain reaction, 207, 216, 218, 221, 222 premises and limitations, 205, 206 rationale, 205 mechanism of retrotransposition, 202 SINE family isolation and characterization distribution analysis, 212, 214, 221 genomic library construction, 206, 208 library screening for SINEs, 206, 207 materials, 206, 207 overview, 207, 208 random sequencing of clones, 209, 221 search and alignment of repeats, 209

258 SINE copy number estimation, 212 SINE subfamily characterization and classification, 211, 212 tRNA-derived promoter region characterization, 209, 211 RH mapping, see Radiation hybrid mapping Rolling circle amplification, mitochondrial genome, 193 S Sex chromosomes cryptic chromosome and gene identification candidate sex chromosome gene isolation overgo probes, 246, 248, 250 Y gene isolation using X homolog, 249, 250 cell culture, 240, 243, 249 chromosome preparation, 243, 249 materials, 240, 242, 249 staining and banding C-banding, 244, 249 GTG-banding, 244, 249 replication banding, 244 silver staining, 244 Y chromosome bacterial artificial chromosome isolation and verification DNA preparation, 245, 246, 250 hybridization and washing, 245, 250 probe preparation, 244, 250 Y chromosome comparative painting, 240 Short interspersed elements (SINEs) structure, 203 distribution in mammals, 203, 204 insertion analysis age estimation of SINE subfamilies, 218 genomic database mining, 220, 221 genomic library screening for SINEs, 206, 207, 215, 216, 221 materials, 206, 207 overview, 214, 215 polymerase chain reaction, 207, 216, 218, 221, 222 premises and limitations, 205, 206 rationale, 205

Index SINE family isolation and characterization distribution analysis, 212, 214, 221 genomic library construction, 206, 208 library screening for SINEs, 206, 207 materials, 206, 207 overview, 207, 208 random sequencing of clones, 209, 221 search and alignment of repeats, 209 SINE copy number estimation, 212 SINE subfamily characterization and classification, 211, 212 tRNA–derived promoter region characterization, 209, 211 Silver staining, sex chromosomes, 244 SINEs, see Short interspersed elements Substitutions, ancestral DNA sequence reconstruction, 175 Survey sequencing, canine genome studies with radiation hybrid mapping comparative map construction, 71 gene–based marker design for mapping, 69 marker genotyping, 70 materials, 67, 71, 74 orthologous sequence content criteria, 69, 74, 75 orthologous sequence identification criteria, 68, 69 strategy, 68 overview, 66, 67 radiation hybrid map construction, 70, 71, 75 radiation hybrid panel characteristics, 70 survey sequencing, 68 Systematic error data analysis, 7 phylogenetic accuracy effects, 7, 9 T TOL, see Tree of Life Tree of Life (TOL) Darwin's vision, 1, 2 punctuation, 5 U University of California Santa Cruz Genome Browser

Index access, 143 chromosome breaks in Drosophila, 139 download server, 140, 142 indel search, 137, 139 interspersed repeat elements in ENCODE regions, 135, 137 rare genomic change searching, 133, 135 Table Browser, 142, 143 Uprobe, universal overgo hybridization probe design, 120, 125, 126

259 X X chromosome, see Sex chromosomes Y Y chromosome, see Sex chromosomes Z Zoo–FISH, see Fluorescence in situ hybridization

Recommend Documents