1 Gene Mapping Goes from FISH to Surfing the Net John Valdes and Danilo A. Tagle 1. Introduction The amval of the second millenmum will usher m unsurpassed information and knowledge of our genetic constitution, and will promise to revolutronize basic research and molecular medicine. The road toward a complete understanding of our genetic makeup is largely the fruit of the Human Genome Project that has mmated, advanced, and made major strtdes in constructing genetic and physical maps of humans and other model organisms. Already enttre genomic sequences of a few prokaryottc organisms have become avarlable with efforts toward completion of the budding yeast not too far behind. Gene mapping and identification are critical steps m this ambitrous undertaking. Unfortunately, the identrficatton of genes, especially those responsible for the vast majority of inherited human disorders, must often proceed without any knowledge of then biochemical functions. To wit, positional clomng (I) has taken center stage toward the initial steps m the molecular characterization of the estimated 100,000 genes in the human genome. This approach has garnered over 60 “diseased” genes thus far, with many more to come as the process becomes more streamlined. Despite having achieved in the last several years a framework of genetic and physical maps of the human genome, nonetheless the efficient and comprehensive isolation of transcribed sequences within large targeted genomic intervals remains a formidable task. The numerous chapters in this book document the creativity and ingenuity of various investigators and laboratories m this global effort. Our aim in this introductory chapter is to give an overvrew of gene mapping and assesswhere approaches in gene isolation are headed m the near future.
From
Methods
IR Molecular B/o/ogy, Edlted by J Boultwood
Vol 68 Gene lsolatfon and Mapprng Humana Press Inc , Totowa, NJ
1
Protocols
Valdes and Tag/e
7.7. ldenfifying and Defining the Chromosome of Interest The mapping of a gene that contains disease-causing mutations frequently begins with the assignment of the gene to a single chromosome or to a specific subchromosomal region. Chromosomal gene assignments can be accomplished in several ways. For diseases where a large collection of affected families exists, the gene can be locahzed by lmkage analysts which involves studying the segregation pattern of the disease phenotype with selected genetic markers within a pedigree. Statistical methods are used to determine the likelihood that the marker and disease are segregating Independently. If the chance of mdependent segregation 1sCl in 1000 (an LOD score of 3), then lmkage 1sassumed. Identification of recombinant families using addmonal polymorphrc markers allows further delineation of the lmked interval. Linkage analysis has shown widespread successm mapping monogenic disorders that show clear Mendelian inheritance patterns. The same principles are now bemg applied to polygemc diseases (those that show complex genetic patterns, likely owing to multiple genes and/or environmental factors acting m combmation), but this has proven difficult in practice (2). Proposed soluttons have mcluded use of standardized ascertainment and the incorporation of interference models (3,4), inclusion of larger sample sizes, or use of genetically homogeneous populations in lmkage disequilibrium studies (5). Human-rodent somatic cell hybrids (either monochromosomal, regional/deletion, or radiation-reduced mapping panels) provide a convenient resource for mapping of genes by hybridization or polymerase chain reaction (PCR). Hybrid cell lines have also been useful in genetic complementation studies, such as in xeroderma plgmentosa and m Niemann-Pick disease (6). Aside from mapping, radiation hybrids provide additional information about the order and distance of markers/genes (7,s) where segments of DNA that are farther apart on a chromosome are more likely to be broken apart by radiation and thus segregate independently in the radiation hybrid cells than rf they were closely linked together. Fluorescence in situ hybridization (FISH) is also widely used to determine the chromosomal map location and the relative order of genes and DNA sequences within a chromosomal band. Unlike hybrid panel mapping where a cDNA clone or PCR primers are all that is needed, larger genomic clones, such as cosmids, are needed when mapping via FISH. However FISH can readily provide more precise regional mapping than regional or radiation panels. FISH can also detect aneuploidy, gene amplification, and subtle chromosomal rearrangements. Discovery of a patient whose inherited disease has resulted from a visible chromosomal abnormality has often been the ‘Jackpot” that has accelerated efforts to clone the causal gene (9, IO). The ability to map by FISH
most chromosomal
translocations
that interrupts
or inactivates
the
Gene Mapping Goes from FISH to Surfing the Net
3
gene has tremendous utility m the field of cancer genetics (II), where molecular events leading to the loss of tumor suppressor genes (12) or the generation of fusion genes (13) can often be detected at the chromosome level. Usmg FISH on normal metaphase spreads, comparative genomic hybridization (CGH) allows total genome assessmentof changes m relative copy number (regions of chromosomal loss, gain, or amplificattons) of DNA sequencesusing DNA probes derived from tumor cells (14). CGH has the potential to identify previously unknown regions involved m tumorigenesis. 1.2. Defining
and Cloning
the Physical Region
Once a genomic interval has been defined for a disease locus, the gene mapping efforts now shift toward constructmg a physical map of the candidate region, determining accurate distances between markers, and cloning the genomic segment m large insert clones. Physical distances can then be established and correlated with the genetic distance (e.g., if two marker probes hybridize to the same 250-kb fragment, then their maximum dtstance apart must be 250 kb). Physical distances between genomic markers can be refined with pulsed-field gel electrophoresis (PFGE) and a combination of rare cutting restriction enzymes. Because such enzymes occur in GC-rich sequences, the location of CpG islands, which are likely landmarks for expressed genes, can then be determined. The pulsed-field maps also provide a reliable method for verifying the extent of coverage of overlapping clones within a contig in relation to the actual genomic distance. PFGE can also be used to compare patient and normal DNA samples, looking for genomic abnormalities that may have been too small to be detected by cytogenetic techniques (13). In long-range physical mapping, yeast artificial chromosomes (YACs) are the cloning library of choice because of their larger insert size, which means that fewer markers and clones are required to anchor and assemble the contig (15,26). Where a dense ordered array of markers is available, bacterial artificial chromosomes (BACs), Pls, or even cosmids are preferred for screening despite their smaller insert sizes(120 kb for BACs, 95 kb for P 1s and 40 kb for cosmids) because of their ease in purifying DNA, relative stability, and low frequency of chimerism compared to YAC clones. Genomic clones isolated for the candidate interval are analyzed for insert size and for degree of overlap by marker content mapping using sequence-tagged sites (STSs) and repetitive element fingerprint patterns. The clones or derivatives of it can be used as probes for chromosome walking until full coverage of the candidate interval are obtained. More importantly, these genomic clones provide a readily available source of DNA for isolating additional markers, for use as FISH or hybridization probes, for generating sequence data, and for gene identification.
4
Valdes and Tag/e
1.3. Gene lsola tion Genetic linkage analysis and physical mapping experiments can often resolve the rough locatron of a gene to a region of 0.5-l centrmorgan (eqmvalent to a frequency of 1 recombinant/l00 meloses), which IS approx 1 Mb. Such an interval may contain from 3G.50 genes, and rdenttfymg all the genes n-rsuch a region and finding the causative gene for the disorder has been a major bottleneck m most posmonal cloning projects. The choice of which gene cloning strategies to utrhze often depends on the available resources in a given laboratory. The common gene hunting methods can be divided mto hybrtdrzanon-based and functional detection of sequences involved m RNA splicing. Exon trapping identifies putative transcribed sequencesfrom genomtc clones (often cosmlds as starting templates) based on splrcmg signals present m exon-mtron junctions. No assumptrons are made regarding the tissue-specific pattern of expression of a given gene or of its level of expression. The targeted exons can be internal (17,18) or directed toward the 3’-termmal exon (19). Numerous labs have applied the method successfully for both gene lsolatton (20,21) and mapping intron-exon boundaries of known genes (22). Transcribed sequencesm genomtc DNA can also be detected by either using labeled cDNAs as hybridization probes on arrayed genomic clones (23) or the converse, where genomtc clones are used as probes against cDNA libraries (24,25). The former approach has taken on numerous permutations where the genomic YAC clones are either immobrlized on filters (26,27), brotmylated (28-301, or used in solutron hybrrdizatron schemes(32-34). These methodologies assume some prior knowledge of the targeted gene’s expression level, since moderately to abundantly expressed messagesare those usually obtained, as well as an idea on the proper tissue source of library to screen. Because the techniques are hybrrdrzatron-based, problems with sticky or GC-rich cDNAs, repeat sequences, and pseudogenes and related family gene family members frequently accompany the final product. None of the aforementtoned methodologtes are expected to garner fulllength clones. The end points using these techniques are for the most part small exons or cDNA fragments that can then serve as additional expressed sequence tagged sites (ESTs) or probes for rsolatmg larger clones Other gene cloning strategies take advantage of certain features m the genomlc DNA or transcript. One such feature would be CpG islands that are areas of the human genome where the CpG dinucleotide is enriched (1O-20 times greater than other regions). CpG islands tend to be associated with the 5’-ends of genes and can therefore provide a means of tsolatmg those genes. A recent survey of 375 genes m the GenBank database demonstrated that almost all housekeeping genes, and about 40% of tissue-specific genes are
Gene Mapping Goes from FISH to Surfing the Net
5
associated wtth these Islands (3.5). These Islands can be isolated by rare-cutting enzymes (36-38) or by PCR (39), and used as hybrrdrzatron probes against cDNA libraries. Another feature would be the differential expression pattern of genes in certain tissues. Subtraction techmques (40,41) have been used to isolate genes spectfic to one particular tissue source or developmental stage, This technique involves the use of a target cDNA hbrary (derived from a tissue where the desired gene IS likely to be expressed) and a drover cDNA library to subtract out most ubiquitously expressedsequences.Differential display (42-44) is another method for isolating genes that are unique to a partrcular cell type or developmental stage and allows the analysts of expressron patterns of multiple cell types. A third feature takes advantage of mutants m model organisms whose phenotype resembles that in human. The mouse genome (as well as that of other organisms) is also being investigated as part of the Human Genome Project. Mouse genetic studies are able to take advantage of selective breeding, short generation times, and backcrosses (matmg between two mice, one of which is homozygous for a recessive tract, in order to establish the genotype of the first). One possible approach to mappmg a gene is to isolate the mouse homolog, determine its genetic localization within the mouse genome, and then focus efforts on the part of the human genome to which it corresponds. Comparative mapping between the mouse and human is fairly well defined: The entire genome can be separated into 68 homologous chromosomal regions (4.5,). The observatron and characterizatron of naturally occurrmg mouse mutants have also supplied model systems (46), as well as acceleratmg the search for human disease genes (45). 1.4. Future Directions There is no doubt that the number of genes being cloned by positional cloning approaches is increasing at a rapid rate (5). Most of these genes have been obtained using the methodologies outlined in this chapter. However newer resources being made accessible through the Human Genome Project are promising even to accelerate gene mapping and isolation at a more rapid rate. With the increasing resolutton of the chromosome physical maps, it is now feasible to embark on large-scale genomic sequencing (47). This has become possible despite the lack of significant improvement in sequencing methodology, but through a combination of faster computational machines to store and analyze the data, ready availability of sequence-ready cosmtd clones and their derivatives, and dense mapping information to help minimize overlap of cosmid templates. Large-scale sequencing of genomtc clones has been com;sleted for a number of prokaryotic organisms (48,49) and implemented for
6
Valdes and Tag/e
diseased loci (50) as an additional gene searching tool. Sequences are queried to the sequence databases and fed to the Gene Recognition and Analysis Internet Link (GRAIL) server for exon prediction through computational analysis of the sequence (51,.52). Another critical development is the concerted effort to develop a transcript map of the human genome that involves sequencing of human cDNA clones by the Washington University Genome Sequencing Center under the auspices of Merck (Whitehouse, NJ) (53). The centerpieces of this undertaking are the oligo(dT)-primed, directionally cloned and normalized cDNA clones from various tissue sources (54,55). Concomitant with the sequencing are efforts to develop these sequencesinto gene-based STSs, and place them on the physical map via YACs (56,57) and radiation hybrid maps. Although attempted in the past on a limited scale, it is projected that this endeavor will generate approx 400,000 ESTs by early this year (53). The sequences, mapping information, and homology results are easily accessible via World Wide Servers in the Internet. As the number of the mapped cDNAs increase, these ESTs automatically become candidate genes if they so happen to fall in an interval linked to a disease locus. The tremendous potential of this resource can be gleamed from recent statistics obtained by National Center of Biotechnology Information at the National Institutes of Health that 79% of positionally cloned genes are actually represented in the EST database (dBEST at http://www.ncbi. nlm.nih.gov/dbEST/index.html). Positional cloning will soon be simplified to a positional candidate approach where linkage of a particular monogenic or polygenic disorder to a particular chromosomal subregion will be followed by a survey of the interval for any interesting ESTs (5). References 1. Collins, F. S. (1991) Of needlesand haystacks:finding human diseasegenesby positional cloning. Clin. Genet. 39, 615-623. 2. Bishop, D. T. (1994) Linkage analysis:progress and problems. Phil. Trans. R. Sot. Land. 344,337-343.
3. Cloninger, C. R. (1994) Turning point in the design of linkage studiesof schizophrenia. Am. J. Med. Genet. 54, 83-92. 4. Karlin, S. andLiberman, U. (1994) Theoretical recombinationprocessesincorporating interference effects. Theor. Popul. Biol. 46, 198-231. 5. Collins, F. S.(1995) Positionalcloning movesfrom perditional to traditional. Nut. Genet. 9, 347-350.
6. Kurimasa, A., Ohno, K. and Oshimura,M. (1993) Restoration of the cholesterol metabolismin 3T3 cell lines derived from the sphingomyelinosismouse(spm/spm) by transfer of a humanchromosome18. Hum. Genet. 92,157-l 62. 7. Walter, M. A. and Goodfellow, P. N. (1993) Radiation hybrids: irradiation and fusion genetransfer. Trends Genet. 9,352-356.
8. James, M. R., Richard, C. W., III, Schott, J. J., Yoursy, C. , Clark, K., Bell, J., Tersilliger, J. D., Hazan, J., Dubay, C., Viginal, A., Agrapart, M., Imai, T., Nakamura, Y., Polymeropoulos, M., Weissenbach, J., Cox, D. R., and Lathrop, G. M. (1994) A radiation hybrid map of 506 STS markers spanning human chromosome 11. Nut. Genet. 8,70-76. 9. Black, G. and Redmond, R. M. (1994) The molecular biology of Norrie’s disease. Eye 8,491-496. 10. Chotai, K. A., Brueton, L. A., van Herwerden, L., Garrett, C., Hinkel, G. K., Schinzel, A., Mueller, R. F., Speleman, F., and Winter, R. M. (1994) Six cases of 7p deletion: clinical, cytogenetic and molecular studies. Am. J. Med. Genet. 51,270-276. 11. Cohen, M. M., Rosenblum-Vos, L. S., and Prabhakar, G. (1993) Human cytogenetics. Am. J. Dis. Child 147, 1159-l 166. 12. Johansson, B., Met-tens, F., and Mitelman, F. (1993) Cytogenetic deletion maps of hematologic neoplasms: circumstantial evidence for tumor suppressor loci. Genes Chromosomes Cancer 8,205-2 18. 13. Liu, P., Tarle, S. A., Hajra, A., Claxton, D. F., Marlton, P., Freedman, M., Siciliano, M. J., and Collins, F. S. (1993) Fusion between transcription factor CBF beta/PEBP2 beta and a myosin heavy chain in acute myeloid leukemia. Science 261, 1041-1044
14. Kallioniemi, A., Kallioniemi, 0. P., Sudar, D., Rutovitz, D., Gray, J. W., Waldman, F., and Pinkel, D. (1992) Comparative genomic hybridization: a rapid new method for detecting and mapping DNA amplification in tumors. Semin. Cancer Biol. 4,4 l-46. 15. Ramsay, M. (1994) Yeast artificial chromosome cloning. Mol. Biotechnol. 2, 181-201. 16. Khristich, J. V., Bailis, J., Diggle, K., Rodkins, A., Romo, A., Quackenbush, J., and Evans, G. A. (1994) Large-scale screening of yeast artificial chromosome libraries using PCR. BioTechniques 17,498-50 1. 17. Duyk, G. M., Kim, S., Myers, R. M., and Cox, D. R. (1990) Exon trapping: a genetic screen to identify candidate transcribed sequences in cloned mammalian genomic DNA. Proc. Natl. Acad. Sci. USA 87,8995-8999. 18. Buckler, A. J., Chang, D. D., Graw, S. L., Brook, J. D., Haber, D. A., Sharp, P. A., and Housman, D. E. (1991) Exon amplification: a strategy to isolate mammalian genes based on RNA splicing. Proc. Natl. Acad. Sci. USA 88,4005+009. 19. Krizman, D. B. and Berget, S. M. (1993) Efficient selection of 3’ terminal exons from vertebrate DNA, Nucleic Acids Res. 21,5 198-5202. 20. Abel, K. J., Castila, L. H., Buckler, A. J., Couch, F. J., Ho, P., Schaefer, I., Chandrasekharappa, S. C., Collins, F. S., and Weber, B. L. (1994) Isolation of gene sequences from the BRCAl region of chromosome 17q2 1 by exon amplitication, in Identification of Transcribed Sequences (Hochgeschwender, U. and Gardiner, K., eds.), Plenum, New York, pp. 183-189. 21. Andreadis, A., Nisson, P. E., Koisk, K. S., and Watkins, P. C. (1993) The exon trapping assay partly discriminates against alternatively spliced exons. Nucleic Acids Res. 21,2217-2221.
Valdes and Tagle
8
22 Kwok, J B , Gardner, E., Warner, J. P., Ponder, B. A., and Mulligan, L. M. (1993) Structural analysis of the human ret proto-oncogene usmg exon trapping Oncogene 8,2575-2582.
23. Hochgeschwender, U , Sutcliffe, J G., and Brennan, M. B (1989) Construction and screening of a genomic library spectfic for mouse chromosome 16. Proc Nat1 Acad Scz USA 86,8482-8486.
24. Wallace, A4 R., Marchuk, D. A, Anderson, L. B., Letcher, R, Odeh, H. M , Saulmo, A M., Fountain, J. W , Brereton, A., Nicholson, J , and Mitchell, A. L. (1990) Type I neurolibromatosis gene: identification of a large transcript disrupted in three NFI patients. Science 249, 18 1-186 25. Elvm, P., Slynn, G., Black, D., Graham, A., Butler, R., Riley, J , Anand, R , and Markham, A. F. (1990) Isolation of cDNA clones using yeast artificial chromosome probes. Nuclerc Aczds Res l&39 13-39 17 26 Lovett, M , Kere, J , and Hinton, L M. (1991) Direct selection: a method for the isolation of cDNAs encoded by large genomic regions. Proc. Nat1 Acad. Scz USA 88,9628-9632.
27. Parimoo, S., PatanJab, S R., Shukla, H., Chaplin, D. D., and Weissman, S. M. (1991) cDNA selection efficient PCR approach for the selection of cDNAs encoded m large chromosomal DNA fragments. Proc. Nat1 Acad. SCL USA 88, 9623-9627.
28 Korn, B , Sedlacel, Z., Manta, A, Kioschis, P., Koneckt, D., Lehrach, H., and Poustka, A (1992) A strategy for the selection of transcribed sequences m the Xq28 region. Hum Mol Genet. 1,235-242. 29. Morgan, J G., Dolganov, G. M., Robbms, S E., Hmton, L M., and Lovett, M (1992) The selective isolation of novel cDNAs encoded by the regions surrounding the human mterleukm 4 and 5 genes Nucleic Acids Res 20, 5 173-5 179. 30. Tagle, D. A., Swaroop, M., Lovett, M., and Collms, F. S (1993) Magnetic bead capture of expressed sequences encoded within large genomic segments Nature 361,751-753.
3 1. Swaroop, A. and Yan, D. (1994) A sandwich-hybridtzation method for specific and efficient selection of cDNA clones from genomic regions, m Zdentificatzon of Transcrzbed Sequences (Hochgeschwender, U and Gardmer, K., eds.), Plenum, New York, pp. 91-100. 32. Jagadeeswaran, P., Odom, M. W., and Boland, E J. (1994) Novel strategy for isolating unknown coding sequences from genomic DNA by generating genomiccDNA chimeras, in Identzjkatlon of Transcrzbed Sequences (Hochgeschwender, U. and Gardmer, K., eds.), Plenum, New York, pp 10 l-l 10. 33. Brookes, A. J. (1994) Identifymg and directly purifymg transcribed elements coincident sequence cloning, in Zdentzfzcation of Transcribed Sequences (Hochgeschwender, U. and Gardiner, K , eds ), Plenum, New York, pp 111-122. 34. Hozier, J. C , Davis, L. M., Siebert, P. D., Dietrich, K., and Paterson, M C. (1994) Finding candidate genes by preparative zn sztu hybridization, m Identification of Transcribed Sequences (Hochgeschwender, U. and Gardmer, K., eds.), Plenum, New York, pp 123-138.
Gene Mapping Goes from FISH to Surfing the Net
9
35. Larsen, F., Solheim, J , Kristensen, T , Kolsto, A. B., and Prydz, H (1993) A tight cluster of five unrelated human genes on chromosome 16q22 1 Hum Mol. Genet 2,2589-2595
36. Larsen, F., Gundersen, G., Lopez, R., and Prydz, H. (1992) CpG Islands as gene markers m the human genome Genomzcs 13,1095-l 107 37. Bird, A. P. (1989) Two classes of observed frequency for rare-cutter sites m CpG islands Nucleic Aczds Res 17,9485. 38. Trtbtoh, C., Maestrmi, E., Bione, S., Tamamm, F., Mancini, M., Sala, C., Torrt, G , Rivella, S , and Toniolo, D. (1994) Identification of genes and construction of a transcriptional map in Xq28, m Identz$catzon of Transcrzbed Sequences (Hochgeschwender, U and Gardmer, K., eds ). Plenum, New York, pp. 5-10 39. Valdes, J. M., Tagle, D. A , and Collins, F. S. (1994) Island rescue PCR: a rapid and efficient method for isolating transcribed sequences from yeast arttfictal chromosomes and cosmids Proc Nat1 Acad. Sci USA 91,5377-538 1. 40 Swaroop, A., Xu, J , Pawar, H., Jackson, C., Skolmck, C., and Agarwal, N. (1992) A conserved retina-specific gene encodes a basic mottf/leucme zipper domam Proc Nat1 Acad Scz USA 89,266-270. 41. Gratas, C., Herlyn, M., and Becker, D. (1994) Isolation and analysts of novel human melanocyte-specific cDNA clones DNA Cell Biol 13, 5 15-5 19. 42. Liang, P , Averboukh, L , and Pardee, A. B. (1993) Distribution and cloning of eukaryottc mRNAs by means of differential display. refinements and optimization Nucleic Acids Res 21, 3269-3275 43. Liang, P., Averboukh, L., and Pardee, A. B (1993) Distribution and clonmg of eukaryotic mRNAs by means of differential display refinements and opttmtzanon. Nucleic Acids Res 21,3269-3275 44. Bauer, D , Muller, H., Reich, J., Riedel, H., Ahrenkiel, V., Warthoe, P., and Strauss, M. (1993) Identtfication of differentially expressed mRNA species by an improved display technique (DDRT-PCR) Nuclezc Aczds Res 21,4272-4280. 45. Delezoide, A. L and Vekemans, M. (1994) Waardenburg syndrome in man and splotch mutants m the mouse: a paradigm of the usefulness of linkage and synteny homologies m mouse and man for the genetic analysis of human congenital malformations Bzomed Pharmacother 48,335-339 46. Brown, S D. (1994) Integrating maps ofthe mouse genome. Curr Opznion Genet Dev. 4,389-394 47. Olson, M. V. (1995) A time to sequence. Science 270,394-396. 48. Fleischmann, R D., Adams, M. D., White, O., Clayton, R. A., Kirkness, E. F., Kerlavage, D R., et al (1995) Science 269,496-5 12 49. Fraser, C M , Gocayne, J. D., White, O., Adams, M. D., Clayton, R. A, Fleischmann, R. D., et al (1995) The minimal gene complement of Mycoplasma gemtahum Sczence 270, 397-403 50. Brody, L. C , Abel, K. J , Castilla, L H , Couch, F. J., McKinley, D. R., Yin, G. Y., Ho, P. P , MeraJver, S , Chandrasekharappa, S C., Xu, J , Cole, J. L , Struewmg, J P , Valdes, J M., Colhns, F. S , and Weber, B. L. (1995) Construction of a transcription map surrounding the BRCAl locus of human chromosome 17 Genomzcs 25,238-247.
IO
Valdes and Tag/e
5 1 Uberbacher, E C and Mural, R. J (1991) Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach Proc Nat1 Acad Scl USA 88, 11,261-l 1,265. 52. Shah, M. B., Guan, X., Einstein, J. R., Matls, S., Xu, Y , Mural, R. J , and Uberbacher, E. C. (1994) User’s guide to GRAIL and GENQUEST (sequence analysis, gene assembly and sequence comparison systems) e-mail servers and XGRAIL (Version 1 2) and XGENQUEST (Version 1 1) client-server systems Available by anonymous ftp to arthur.epm.oml. gov (128 219.9.76) from directory pub/xgrail or pub/xgenQuest as file Manual grail-genquest 53. Boguski, M. and Schuler, G. D. (1995) ESTabhshing a human transcript map Nat Genet 10,369-371. 54. Soares, M. B., Bonaldo, M F., Jelene, P., Su, L , Lawton, L., and Efstrattadrs, A (1994) Construction and characterization of a normalized cDNA library Proc Nat1 Acad. Sci USA 91,228-232 55. Adams, M D., Soares, M. B., Kerlavage, A. R., Fields, C., and Venter, J. C. (1993) Rapid cDNA sequencing (expressed sequence tags) from a dnectionally cloned human mfant brain cDNA library. Nut Genet. 4,373-380 56 Polymeropoulos, M. H , Xiao, H., Sikela, J M , Adams, M , Venter, J C., and Merril, C. R. (1993) Chromosomal drstribution of 320 genes from a brain cDNA library Nat. Genet 4,381-386. 57. Berry, R , Stevens, T. J., Walter, N. A, Wilcox, A S , Rubano, T., Hopkins, J. A , Weber, J., Goold, R., Soares, M B , and Sikela, J M (1995) Gene-based sequence-tagged-sites (STSs) as the basis for a human gene map. Nat Genet 10, 415-423.
Linkage Analysis
of Genetic Disorders
Eugene W. Taylor, Jianfeng and Deborah A. Meyers
Xu, Ethylin Wang Jabs,
1. Introduction 1. 1. Definition Genetic disorders follow a classic Mendelian dominant or recessive singlelocus pattern of inheritance or a complex genetic pattern (multiple genes and environmental influences). In general, the complexity arises when the simple correspondence between genotype and phenotype is not one to one due to possible misclasstfication of phenotype, mcomplete and age-dependent penetrance, phenocopies, genetic heterogeneity and/or ohgogemc inheritance. Errors in diagnosis could be the result of variable expression of a disease with mildly affected individuals being misdiagnosed as unaffected. In the presence of incomplete or age-dependent penetrance, an mdividual who inherits a predisposing disease allele may not manifest the disease at all or the chance of manifesting the disease may depend on his or her age. On the other hand, phenocopies are indivtduals who do not inherit the disease allele but have the disease m question, probably caused by environmental factors and/or other genes. Genetic heterogeneity is a situation where mutations in any one of several genes may result m identical phenotype. Oligogemc inheritance requires the simultaneous presence of mutations in multiple genes. 1.2. Types of Approaches The lack of a clear one-to-one relationship between genotype and phenotype makes genetic studies difficult; however, several genetic epidemiology approaches are helpful m determining if there is a genetic component to a complex disorder. These approaches are used to determine whether a disorder is caused by environmental factors, polygenes (several genes affect the disorder, From
Methods
in Molecular B/ology, Edlted by J Boultwood
Vol 68 Gene /so/at/on and Mapping Humana Press Inc , Totowa, NJ 11
Protocols
Taylor et al.
12
each by a small amount), major genes (one or several major genes involved), or mixed polygenic and major genes (in addition to a major gene, there 1sstill a residual polygemc effect). 1.2.1. Familial Aggregation
(Relative Risk)
Although familial aggregation of a disorder could be caused by either common environmental factors within a family or genetic components, it 1susually the first hint that a disorder may have a genettc component. In the presence of familial aggregation, the recurrent risk for relatives of an affected person 1s higher than that of the general population. Often, an accurate estimate of the recurrent nsk m relattves and the populatton mcidence and prevalence is drffcult to obtain. Well-designed large-scale epidemiological studies and even longitudmal studies are needed (I). Relattve risk, h,, defined as the mctdence rate for a relative of an affected person divided by that for the general population, is one measure of familial aggregation. The subscript denotes the type of relative, for example ho and h, are the risk to offspring and sibs, respectively. Rtsch (2) showed that genettc mapping IS much easier for traits with hrgh hs (for example hs > 10) than for those with low (for example hs < 2). Again, it may be difficult to obtain accurate estimates of these risks. Risk ratios are very high for Mendelian disorders, because family members of an affected individual may inherit the same gene, whereas the risk in the general population is very low. 1.2.2. Twin Studies Twms constttute a unique sample design provided by nature, and are an excellent way to match for age and many envrronmental factors. The goal of twm studies is to compare similarities (correlation coefficients for quantitative traits and concordance rates for qualitative traits) m monozygotic twins (MZ) and dizygotrc twms (DZ). A large difference m the degree of similarity between MZ and DZ twins suggests a genetic component. For example, in a study of 938 female twin pairs, there is a concordance rate of 37.3% of major depression in MZ twins, compared to a DZ rate of 23.9%, which suggests a genetic etrology (3). Studies of twins raised apart (adopted) can be very useful m partitioning of environmental influences. I .2.3. Segregation Analysis Segregation analysis is used in Mendelian disorders to estimate various parameters, such as penetrance, whereas for complex disorders, segregation analysis is a useful tool to identify the mode of inheritance and estimate important parameters. In complex segregation analysts, the fitting of various specified models to the observed inheritance pattern m the pedigrees is compared.
Lmkage Analysis
13
In other words how likely is tt that we observe the mherrtance pattern in the available pedigrees if the mherttance pattern 1spolygemc, one-locus major gene or multiple genes? The best Iittmg and most parsimonious model (model with higher likelihood and less parameters) suggests that the specified model is the most likely mode of mherttance. Models from segregation analysis are needed if the classic LOD score approach is used for the lmkage analysrs. For example, a complex segregation analysis, using the computer program S.A G.E. (4), of adjusted log IgE levels (adjusted for age) in a Dutch asthma famtly study population (5) was performed, since IgE levels, an easily measured quantitattve trait, correlate with the presence of asthma. Evidence was obtained for a major gene inherited asa recessive trait yielding a model that could be used for lmkage. Unfortunately, segregation analysis 1ssensitive to bias m the ascertainment of families. The common ascertainment scheme for lmkage analyses, of selectmg only pedigrees with multiple affected members, may lead to false evidence of Mendellan inheritance and also to an overestrmate of gene frequency and penetrance. However, if families are selected through a single proband, such as m the Dutch asthma study described above, segregation analysis with adjustment for ascertamment is possible. It 1soften difficult to detect a major gene, with relatively small family sizes.Usually only one locus analysis is performed, and it is difficult to analyze for the presence of multiple distinct loci (6). However, multilocus segregation analysts is especially worth considermg if there is a quantitative measure related to disease status that is easily measured. 2. Parametric
Linkage Analysis
2. I. Definition Genetic linkage is used to elucidate the underlying genetic mechanisms for inherited disorders (traits) and to find chromosomal locations for the susceptibility disease genes. The demonstration of a lmkage is often considered the highest level of statistical “proof’ that a disease is the result of a genetic mechanism (7). At present, there are two major categories of genetic linkage analysis, parametric linkage analysis using family pedigree methods and allele sharing analysis using relative pairs (especially sib pairs). Genetic linkage is defined as the violation of Mendel’s law of independent assortment. The law states that the alleles at two chromosome locations (loci) will assort independently and are transmitted to offspring m random combinations. Nonindependent assortment occurs when genetic loci are positioned near each other on the same chromosome (Fig. 1). As the distance between two loci increases, crossovers (recombination fraction) between the two loci increase, producing new haplotypes (the alleles for a chromosomal region received by an individual from a given parent [8]).
14
Taylor et al,
23
14
N
b
d
34 N
14 R
N
Fig. 1. Shows a pedrgree demonstrating linkage of a disease locus to a marker. Children a, b, and c show no recombmation (N) between the marker and the dtsease locus. Child d shows the occurrence of a recombmatron event (R) This child has the disease allele 4, but IS unaffected
Parametric lmkage analysis involves the comparison of likelrhoods of observing the segregation pattern of two loci within the pedigree for several specific hypotheses. First, the hkelihood of observing the segregation pattern of two loci assummg the null hypothesis of no genetlc linkage is calculated, that is, independent assortment between the two loci or z = log[L(fq/L@)
= OS]
(1)
where Z = LOD, 0 = recombmation fraction, and L = likelihood of observing the patterns of inheritance at the given 8. Next, the likelihoods for each of several alternative hypotheses4ifferent extents of crossing over (recombmation fractionFare calculated and compared with the likelihood of the null hypothesis by means of an “odds ratio.” This is commonly done using the LINKAGE computer programs (9). The odds ratio consists of the likelihood of an alternative hypothesis divided by the likelihood of the null hypothesis. For Mendelian disorders, an odds ratio of >lOOO:1 is usually considered evidence for linkage (10). Clinical aspectsof the disorder being studied, i.e., late age of onset, failure of affected individuals to reproduce, or mode of inheritance, are all factors that make it unlikely that a single family will provide significant evidence for linkage, so often multiple small families are used The LOD scores are summed over pedigrees as seen m the example in Table 1. To allow summation of pedigrees, the base 10 logarithm of the odds ratio is reported (LOD score) at different recombinatron fractions. Strong evidence for lmkage of the locus for Treacher Collins syndrome and a marker on chromosome 5 (D5S210)
Linkage Analysis Table 1 LOD Scores of Families and Marker DSS210a
with Treacher
Collins
Recombination
Family Family Family Family Family Family Family Family Total
1 2 3 4 5 6 7 8
Syndrome
fraction, 8
0.01
0.05
0.10
0.20
0.30
0.97 0.42 1.78 1.19 0.29 0.59 3 01 0.29 8 54
1.49 0 39 1.65 111 0.26 0.54 2.77 0.26 8.47
1.55 0.36 1.49 1.02 0.22 0.47 2.46 0.21 7.78
1.34 0.28 1.13 0.82 0.13 0.32 1.80 0.13 5.95
0.94 0.20 0.73 0.59 0.06 0.17 1.14 0.06 3.89
aAdapted from Jabs et al (II).
was obtained (Table 1). A total LOD score of 8.54 at a recombination fraction of 1% was obtained, suggesting that the disease locus maps very close to this marker (II). Family 7 by itself has an LOD score of >3; the magnitude of the resulting LOD score 1saffected by family size and informativeness of a given marker. Markers with a heterozygosity of >.70 are generally used, this will help increase the power of the study by making more pedigrees informative. To search the entire genome, markers mapped at 10 CM intervals are often used, resulting in genotyping approximately 350 markers. The density of markers used in such a genome screen will depend on the mformatlveness of the markers, structure and number of families, and mode of inheritance for the disease, If the two loci being studied are both genetic markers, the parametric linkage analysis is straight forward, because the mode of inheritance of a genetic marker is usually codominant and there is one-to-one relationship between genotype and phenotype. The situation is similar for a simple Mendelian disease locus, because by definition, the disorder is controlled by a major locus with known mode of inheritance, and it is safe to infer the genotype from the phenotype. There may be rare casesof misdiagnosis, and it may be necessary to estimate the degree of penetrance for unaffected family members. Linkage analysis has been successfully applied to many Mendelian traits. The simplest situation is when unequivocal linkage can be demonstrated in a single large pedigree with LOD score >3, even though other families may show no linkage (genetic heterogeneity). If linkage cannot be established on the basis of any single pedigree or seen in the total sample of families, one can ask whether a subset ofpedigrees collectively shows evidence of Imkage. Of course, one can-
16
Taylor et al.
not simply choose those families with positive LOD scores Such an expost selection criteria will always produce a positive LOD score. However, families can be selected on the basis of a priori considerations (for example, different clmical presentations). The admixture test using the computer program HOMOG can be used for genetic heterogeneity when the families are not divided m groups based on other criteria, such as clinical differences (8). For small families, it 1sdifficult to estimate accurately the degree of heterogeneity from this type of analysis. 2.2. Problems in Complex Disorders Parametric lmkage analysis may not be useful for a complex disorder, mainly because of the breakdown of the simple relationship between phenotype and genotype, caused by the following* 1. Mrsdiagnosrs-the mrsdragnosed affected family members are not susceptrbrhty gene carrters, whereas the misdiagnosed “unaffecteds” actually carry the susceptibrhty gene, 2 Incomplete penetrance owmg to reduced penetrancwertain percent of the unaffected famrly members are susceptibility gene carriers, 3 Phenocopy-indrviduals with the disorder are affected by some other mechanism and do not have the susceptibility gene under study (possibly a different gene); 4. Heterogeneity-some affected famthes have a genetic defect m another locus and thus do not have the susceptibrhty gene under study; and 5 Ohgogemc inherrtancea disease phenotype is the result of several defectrve genes, erther additrve or mteractrve.
Thus, in a given family, the phenotype “affected” may or may not be owmg to the specific gene under study. It is necessary to relate an individual’s genotype for the susceptibility gene from his or her phenotype for linkage studies. The breakdown m the relationship between phenotype and genotype increases the difficulty of finding linkage using parametric linkage analysis (6). These factors affect all methods of linkage analysis of complex disorders, including allele-sharing methods, because they create uncertainty. However the impact tends to be greater m parametric lmkage analysis where the results are the outcome of two components, the correct specified model and lmkage. As can be seen, the correct specified model is often difficult to determine for complex disorders. 2.3. Strategies Used in the Analysis of Complex Disorders Parametric linkage analysis for complex disorders, however, is by no means useless. The understanding of these difficulties may help researchers to overcome these problems, and there are several successful examples, such as
Linkage Analysis
17
early onset breast cancer (12). Several strategies can be considered m the parametric linkage analysis of complex disorders. Overestrmatmg the degree of penetrance can lead to spurious evidence against linkage owing to individuals who inherit a trait-causing allele, but are unaffected. “Affected only” parametric lmkage analysis is a common practice used to deal with the problem of incomplete and age-dependent penetrance (6). This type of analysis might decrease the effective number of meioses. However, it decreases the possible impact of false recombinants from unaffected family members who are gene carriers. In the case of an obscure phenotype where there may be a relatively high rate of misdiagnosis, various alternative diagnostic schemes can be applied. However, it is then necessary to adjust for the number of disease models used when determming significance. Another approach is first to study a related phenotype where information on a genetic model may be available. An example of this is total serum IgE levels, a quantitative measure correlated with the presence of asthma (13). After obtaining evidence for a major locus for IgE regulation mapping to 5q, linkage analysis with the asthma phenotype was performed, resulting m evidence for linkage to this same region (24). Parametric lmkage analysis has also been successfully applied to disorders with genetic heterogeneity. If available, a clinical variable, such as age of onset or severity, can be used to subdivide a sample mto two groups of pedigrees. Families can thus be selected on the basis of a priorz considerations. An example of this approach is provided by the genetic mapping of a gene for early onset breast cancer (BRCAl) to chromosome 17q. Families were added to the linkage analysis in order of their average age of onset, resulting in an LOD score that rose steadily to a peak of 6.0 with the inclusion of families with onset before age 47 and then fell with addition of late onset pedigrees (22). Notwithstanding these successes,many failed linkage studies may result from the presence of a high degree of heterogeneity, It is usually wise to try to define clinically a homogeneous set of families. Although several simulation studies have suggested that in a disorder caused by two genes, a single-locus approxrmation has high power to detect linkage (15), a correctly specified two-locus model can sometime significantly increase evidence for linkage. An example 1sthe parametric linkage analysis between the locus for IgE levels and markers on chromosome 5q. An LOD score of 3.0 for marker D5S436 was first reported using a one-locus recessive model in a Dutch asthma family study. After a subsequent segregation analysis suggesting that a two-locus recessive model fit the inheritance pattern significantly better than one-locus recessive model, parametric linkage analysis using the best two-locus model gave the LOD score of 4.6 for the same marker (16).
18
Taylor et al.
2.4. Multipoin t Mapping It is possible to combine information from several markers to increase the mformativeness of the famtlies. Families that are not informative for a specific marker may be informative for the flanking markers. This method can be used to pinpoint the most likely map location for the disease gene. In Table 2, the multipoint analysis shows that the most likely location for BRCAl is close to D17874. As described previously, families with a young age of onset showed the strongest evidence for linkage. Multipomt analysis is sensitive to errors m genotyping and phenotyping, and care must be taken to ensure data integrity. Linkage disequilibrium (a deviation of random occurrences of specific alleles in haplotypes) studies can then be used to refine further the location of the disease gene (81. This approach is especially effective if the families come from an isolated population, thus increasing the possibility of a founder effect. 2.5. Cautions Misspectficatton of marker allele frequencies can cause false positive linkage results, especially m families where many parents are untyped. This is because underestimation of allele frequencies may lead to spurious lmkage mformation. For example, if cousms share a “rare” allele, this suggests the presence of linkage. However, if the grandparents are deceased, they may have been homozygous for the allele in question and the cousins actually inherited different copies of the allele. Thus, it is important to consider the allele frequencies from both population data and the study sample. The other problem is multiple tests,mainly owmg to the uncertainty of modes of inheritance. This will inflate the type I error (i.e., false positive) and make LOD score results difficult to interpret (6). Two approaches are very useful in these cases. First is a computer simulation method where marker data with no linkage can be simulated using the same pedigree information (availability of typed persons) and the same characteristics of the marker (heterozygosity) where the highest LOD score was observed (6). Then, the simulated data is analyzed using the same approaches (number of models tested) that were used in the actual analysis. An empirical significance level is then obtained. The other approach is to adjust the significance level by the number of models tested (3 + log[number of models tested]) (8). It is difficult to determine the exact cut point for significance in complex disorders. On the one hand, it is important to type additional markers in any region with a suggestion of linkage, especially m regions with known candidate genes. On the other hand, it is important to realize that this may be a false-positive result. Replications between studies are very important (17).
Table 2 LOD Scores
Based on Multipoint
Analysis
of Families
with Breast Cancer Grouped
by Age of Onseta
Map distancesb D17S78 ;3;
D17S41
D17S74
Famhes
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
145 46-5 1 >51
2.83 Xl.30 -6.70
3.09 -0.07 -5.80
3.30 0.01 -5.51
3.47 0.03 -5.52
3.57 -0.05 -5.89
3 41 -0.20 6.98
4.46 -1.58 4 60
4 60 -2.71 -7 94
“Adapted from Hall et al (12) bAppropnate map locations. There IS 10% recombmatlon
0.16
5.24 -9 14 -15 21
0.184
0.208
5.41 -5.61 -8 94
5.24 -4.24 -6.79
between D 17878 and D 17S4 1, and 6%, between D 17S4 1 and D 17874
20 3. Nonparametric 3.1. Definition
Taylor et al. (Allele-Sharing)
Methods
Unlike parametric lmkage analysis which depends on assummg a genetic model, allele-sharmg methods are not based on a specific disease model. One simply tests whether the inheritance pattern of markers for a chromosome region is consistent with random Mendelian segregation If there is a linkage between a locus for a trait and a chromosomal region, for a qualitative trait, affected relative pans should share alleles identical by descent (IBD), that is, inherited from a common ancestor within the pedigree, more often than expected under Mendehan inheritance and independent assortment. For a quantitative trait, relative pairs should show a correlation between the magnitude of then phenotypic difference and the number of alleles shared IBD. Sib-pan methods are the simplest and most commonly used allele-sharing methods (IS). 3.2. Qualitative
Trait: Affected
Sib-Pair Analysis
Considering the possible problems in complex disorders, espectally incomplete and age-dependent penetrance and misdiagnosis, many researchers focus on affected sib-pair methods, although the theories also apply to unaffected sib-pairs. Under the hypothesis of no linkage between a disease predisposing locus and a marker, affected sib-pans sharing of marker alleles IBD will be independent from their phenotype, and follow Mendehan expectation of sharing IBD 0, 1, and 2, with the frequencies of 0.25. 0.5, and 0.25. This distribution of sharing marker allele IBD can also be expressed as a mean number of alleles IBD = OS[ l(O.5) + 2(0.25)]/2. If, however, there is linkage between the disease predisposmg locus and a marker, the Mendelian expectation of sharing marker allele IBD will deviate from the above distributions Several statistical methods have been proposed to test this deviance. One of most powerful methods is a mean test, which testswhether the mean number of a marker allele IBD is significantly different from 0.5 (19). Table 3 shows sib-pair analysts based on the mean test for bipolar disorder and markers located on chromosome 18 (20). Increased sharing (>0.5) was observed for several markers, although most were not highly significant. Another newly developed affected sib-pair method is the likelihood method (2,21), where a LOD score is calculated from the ratio of two likelihoods. the likelihood of observed marker allele IBD of affected sib pairs and the likelihood of sharing IBD under the null hypothesis of no linkage, that is, Mendelian expectation. Affected sib-pair allele-sharing methods can also be used to investigate possible parental origin effect for the disorders. One can look at affected sib-pairs sharing paternal and maternal alleles IBD separately. This may be useful in the
21
Linkage Analysis Table 3 Results of Affected Sib-Pair Analyses for Bipolar Disorder and Chromosome 18 Marker9 Marker DlSS59 D18S54 D18S62 D18S843 D18S464 Dl8S53 D18S71 D18837 D18S48 D18S40 D18S45
# Pairs
Mean
109 61 86 91 68 112 96 46 56 84 45
0.50 0 50 0 50 051 0 55 0.56 0.49 0.64 0.53 0 57 0 52
r) valueb
0.05 0 02 0.001 0.09 0.02
“Adapted from Stme et al (20) bAll p values
presence of imprinting or mitochondrial inheritance. For example, in bipolar disorder, there is evidence for linkage to chromosome 18 and excess sharing is especially pronounced m paternally transmitted alleles @ = 0.004) (20). 3.3. Quantitative Trait The basis for the allele-sharing method for a quantitative trait is straightforward: siblings that share more alleles at a locus IBD should be more similar in phenotypic measurement than siblings that share fewer alleles. Thus, the squared difference of phenotype values between sibs can be regressed on the sharing of marker alleles IBD (18). There is evidence for linkage if the regression coefficient is sigmficantly negative (i.e., sibs with a small difference tend to share two alleles). An example of this method is the sib-pair analysis for total IgE levels in the Dutch asthma family study. Significant negative regression coefficients were found for several markers on 5q (5). As previously described, positive LOD scores were obtained for these same markers using the genetic model obtained from the segregation analysis. 3.4. Multipoint Sib-Pair Analysis Most allele-sharing methods are primarily based on studying genetic markers one at a time. Such analyses may be inadequate, since the exact IBD status cannot always be inferred at the marker loci (for example, if parents
22
Taylor et a/.
were not genotyped). Kruglyak and Lander (22) proposed a method of complete multipoint analysis using the information from all genetic markers to infer the full probability distribution of the IBD status at each point along the chromosome. 3.5. Advantages and Limitations Allele-sharing methods are nonparametric linkage analyses, that is, they require no prior assumptions about such parameters as mode of inheritance, penetrance, phenocopy rate, and disease allele frequency. In this sense, they are more robust than parametric methods because we are not dependent on as many potential erroneous model assumptions. Moreover, the problem of trying multiple models and correcting for inflation of the LOD score (as is often required in such cases) is avoided in these approaches, although one must still correct for multiple diagnostic schemes. The trade-off is that allele-sharing methods are often less powerful than a correctly specified linkage model (6,. Sib-pairs methods are important tools for linkage studies of complex disorders, and are often used for genome screens.In addition to the advantages described above, sib-pairs are relatively easy to ascertain in large numbers, and tend to be more closely matched for age and environment than other relative pairs. It would, however, be incorrect to conclude that the genetic model of the disease is irrelevant. The fact that a model is not required in the analysis only implies that the model cannot be misspecified. Thus, false negative or false positive findings will not be owing to the use of an incorrect model. Instead, the mode of inheritance of the disease influences the power of allele-sharing methods directly. Determining the model of inheritance for major genes for susceptibility to a complex disorder may provide useful information on understanding the pathophysiology of the disorder. Once evidence for linkage is obtained, more complex modeling, such astwo and three locus or MOD (changing the model to maximize the LOD score) score analysis (23), may provide further insight into disease mechanisms. 4. Summary Basic principles and methods of genetic analysis were covered in this chapter. The approaches of linkage analysis for Mendelian or complex disorders can be summarized in the following flowchart (Fig. 2). It is important that the clinical, analytical, and molecular investigators be involved in all steps in the process. Mapping genes for complex disorders is often more difficult than mapping genes for Mendelian disorders, but both may prove to be very important in understanding disease processesand designing new treatments. Practical use of computer programs available for genetic analysis is detailed elsewhere (24).
23
Linkage Analysis Question Is there Mendelian inheritance or familial aggregation ?
For complex disorders, is there a major gene 7 Is it dominant or recessive ?
Study Design Family Clinical Study
Segregation Study
1 Where is this major gene in the human genome ?
Linkage Analysis
Is there a linkage with DNA markers under a specific genetic model ?
A. Parametric Approach
Is there an increased allele sharing for affected relatives (sib pairs) or for relatives with similar phenotype ?
B. Allele-Sharing Approach (sib-pair analyses)
1 Analysis repeated after typing additional makers in region to narrow the region of interest
Multipoint
and Fine mapping
Fig. 2. Flowchart of linkage analysis.
References 1. Khoury, M. J., Beaty, T. H., and Cohen, H. B. (1993) Fundamentals of Genetic Epidemiology. Oxford University Press, New York. 2. Risch, N. (1990) Linkage strategies for genetically complex traits. II. The power of affected relative pairs. Am. J. Hum. Genet. 46,229-241.
24
Taylor et al.
3. Kendler, K. S , Neale, M. C , Kessler, R C , Heath, A , and Eaves, L J (1993) A longrtudmal twin study of l-year prevalence of major depression m women Arch Gen Psychzatry 50,843-852 4 S A.G E. (1994) Statistical Analysts for Genetic Epidemiology, Release 2 2. Computer program package available from the Department of Biometry and Genetics, LSU Medical Center, New Orleans, LA 5. Meyers, D. A., Postma, D. S , Panhuysen, C I. M., Xu, J , Amelung, P J., Levitt, R C , and Bleecker, E R (1994) Evrdence for a locus regulating total serum IgE levels mapping to chromosome 5 Genomics 23,464-470 6. Lander, E. S and Schork, N. J (1994) Genetic dissection of complex traits Science 265,2037-2048. 7. Elston, R. C. (1981) Segregation analysis Adv. Human Genet. 11,63-120. 8. Ott, J. (1992) Analyszs of Human Genetzc Linkage. Johns Hopkins University Press, Baltimore, MD 9 Lathrop, G. M., Lalouel, J. M., Julier, C , and Ott, J. (1984) Strategies for multilocus lmkage analysis m humans. Proc. Nat1 Acad. SCI. USA 81,3443-3446 10 Morton, N. E. (1955) Sequential tests for the detection of linkage. Am J, Hum Genet 7,277-3 18 11. Jabs, E. W., Lr, X., Coss, C A , Taylor, E W., Meyers, D A , and Weber, J. L. (1991) Mappmg the Treacher Collins Syndrome Locus to 5q3 1.3-q33.3 Genomzcs 11,193-198
12. Hall, J M , Lee, M. K., Newman, B., Morrow, J. E., Anderson, L A , Huey, B , and King, M C (1990) Lmkage of early-onset familial breast cancer to chromosome 17q2 1. Sczence 250(4988), 1684-1689 13. Sears, M., Burrows, B., Flannery, E M , Herbison, G P., Hewitt, C. J , and Holdaway, M. D (1991) Relation between airway responsiveness and serum IgE m children with asthma and in apparently normal children N Engl J Med 325, 1967-1971. 14. Panhuysen, C. I M., Levitt, R. C , Postma, D S., Xu, J., Amelung, P. J , Holroyd, K. J., Altena, R. V., Koeter, G. H., Meyers, D. A., and Bleecker, E R (1995) Evidence for a susceptibility locus for asthma mapping to chromosome 5q J Invest Med. 43(Suppl.), 281A. 15. Greenberg, D. A. and Hodge, S. E. (1989) Lmkage analysts under “random” and “genetic” reduced penetrance. Genet. Epldemlol. 6,259-264. 16. Xu, J., Levitt, R. C , Panhuysen, C. I. M , Postma, D. S., Taylor, E. W., Amelung, P .J , Holroyd, K J , Bleecker, E. R., and Meyers, D A (1995) Evidence for two unlmked loci regulating total serum IgE levels Am J Hum Genet 57,425+30. 17. Thomson, G. (1994) Identifying complex drsease genes: progress and paradigms. Nature
Genet
8, 108-l
10
18. Haseman, J. K. and Elston, R. C. (1972) The investrgatron of lmkage between a quantitative trait and a marker locus. Behav. Genet. 2,3-19. 19. Blackwelder, W. C. and Elston, R C. (1985) A comparison of sib-pan lmkage tests for disease susceptibility loci. Genet Epldemlol 2, 85-97.
Linkage Analysis
25
20 Stme, 0 C , Xu, J. F., Koskela, R., McMahon, F J , Gschwend, M., Frtddle, C., Clark, C. D., McInms, M. G., Sampson, S. G., Breschel, T. S., Vishto, E., Riskin, K., Feilotter, H., Chen, E., Shen, S , Folstein, S , Meyers, D. A , Botstem, D., Marr, T. G., and DePaulo, J. R. (1995) Evidence for linkage of bipolar disorder to chromosome 18 with a parent-of-ongm effect. Am. J Hum Genet 57, 1384-1394. 21 Holman, P (1993) Asymptotic properties of affected-sib-pair linkage analysis Am J Hum Genet 52,519-527. 22. Kruglyak, L. and Lander, E. S. (1995) Complete multipoint sib-pax analysis of quahtattve and quantitative traits. Am J, Hum Genet 57, 439-454 23. Hodge, S E. and Elston, E R. (1994) Lods, Wrods and Mods: the interpretation of lod scores calculated under different models Genet Epldemiol. 11,32%342 24. Terwtlhger, J. D. and Ott, J. (1994) Handbook of Human Genetzc Lznkuge. Johns Hopkins Umverstty Press, Baltimore, MD.
3 Gene Ordering and Localization by Linkage Analysis R. E. March 1. Introduction The essenceof lmkage analysis is a deviation from the Mendehan principle of independent, or random, assortment of gene pairs when transmitted from generation to generation. Two genes are said to be completely linked (see Section 4.1. for defimttons of genetic terms) when there is no recombination between them; the same alleles or phenotypes are always transmitted together from generation to generation within a family. Two genes are completely unlinked if they are situated on different chromosomes; m this case, the transmission of alleles within a family will not deviate from the Mendelian principle of independent assortment. However, many gene pairs will be in an intermediate state of incomplete linkage, where there is a consistent and measurable deviation from independent assortment, but also a consistent recombination fraction between them. This recombination fraction is very roughly proportional to the physical distance between the two genes, and it is this principle that forms the basis of genetic linkage mapping. In the early days of genetics, mapping by lmkage analysis was used both to order genes and to estimate the genetic distance between them. Many of the maps of the human genome now available are those derived from genetic linkage data. More recently, as the technology available has increased dramatically, physical maps are becommg more common, at least over smaller distances. However, linkage analysis continues to be crucial in the construction of genetic maps when distances involved are too large to be covered by physical cloning vectors or contigs, when a physical map needs to be integrated with an existing genetic map or vice versa, or when the locus to be localized onto the From
Methods
m Molecular Bology, Edlted by J !3oultwood
Vo/ 68 Gene /so/at/on and Mappmg Humana Press Inc , Totowa, NJ
27
Protocols
28
March
existing map is detectable only by its phenotype, e.g., a physical or quantrtative trait or a disease locus. Lmkage analysis in its simplest form is a matter of examining the transmission of alleles of polymorphrc genes typed in DNA from fully mformative individuals within pedigrees, and counting recombmants and nonrecombmants. Modern lmkage analysis has become much more complex, both m order to maximize the amount of information that can be derived from pedigrees, and to infer the likely genotypes of any missmg Individuals. Likelihood calculations may be used to estimate recombination fractions, order of loci on the genetic map, evidence of linkage, and other parameters of the genetic model used, such as allele frequencies and penetrances These calculations are therefore normally carried out using complex computer programs. In this chapter, it is possible to give only a very brief mtroduction to the theory and practice of linkage analysts. For a fuller treatment, the reader is strongly recommended to refer to the literature listed in refs. l-7 The procedures outlined here should be sufficient to allow the construction of a simple genetic map from data similar to that given m the imaginary example outlined below. However, readers should be aware that there are many possible pitfalls in the use of programs for linkage analysis, which may lead to highly misleadmg results. If amore extensive use of these techniques ISneeded, the best course of action is to contact a competent statistical geneticist. This chapter covers the construction of a simple genetic map using polymorphic markers m human DNA from two- or three-generation families (the programs used for experimental crosses are slightly different). It is assumed that raw data from mtcrosatellite or other alleles have already been gathered, possibly using the ABI 377 automatic DNA sequencer or other semiautomatic systems. It also covers the coding of these raw data into numbered alleles, the setting up of computer files suitable for subsequent genettc linkage analysis, and the calculation of gene frequencies. The ordering of several markers, if unknown, into a genetic map, and the calculation of recombination fractions and genetic distance between the markers is then examined. Finally the placing of a new locus, such as a phenotype or a disease locus, on this known genetic map is covered. This chapter concentrates on the use of the Genetic Analysis System (GAS) package (8), mainly because this is among the most userfriendly linkage analysis packages available, but other options are also covered briefly (see Section 4.3.). 1.1. Linkage Analysis-An Example To illustrate the principles of gene ordering and localization using linkage analysis, an imaginary genetic problem is used throughout this chapter concerning the localization on a genetic map of a single gene of which one allele
29
Gene Ordering and Localization
Dli3-Ml52 Dil 1 .M153
Dll2.Ml51 Dli3-Mi53
Dll2.Ml55 011 3-Ml52
Dl12.Ml51 Dll4-Ml55
Dll 1 -Ml52 Dli4-Ml51
Dll3-Ml52 Dll4-Ml55
Fig. 1. Pedigree of Smith family, m which the phenotype of red hair is segregating
codes for red hair. Hair color is normally a complex trait, but serves to illustrate the problem of localizmg a phenotype, which may be a disease state, onto a genetic map that may include microsatellites, genes encoding polymorphic protems, or other polymorphic markers. In the Smith family, illustrated m Fig. 1, the single gene of which one allele encodes red hair is segregating m the family. Two markers have been typed, Dl 1 and Ml 5. All the grandparents are fortunately available and are homozygous for both markers. Mrs. Smith’s mother was red-haired and her father was brown-haired. Mrs. Smith, who is red-haired and probably heterozygous at the “red” locus, must therefore have inherited allele 2 of Dl 1 and allele 1 of Ml 5 from her mother. Her paternal haplotype carries allele 4 of D 11 and allele 5 of M15. Mr. Smith, who is brown-haired and has two brown-haired parents, has inherited the haplotype carrying allele 3 of D 11 and allele 2 of M 15 from his father. HIS maternal haplotype carries allele 1 of Dl 1 and allele 3 of Ml 5. Mr. and Mrs. Smith have four children, two boys and two girls. One of the boys and one of the girls has red hair. Both of the red-haired children have the alleles 2 and 3 at locus Dl 1. The boy also carries alleles 1 and 3 at locus M15, whereas the girl carries alleles 2 and 5. The brown-haired boy carries alleles 1 and 4 at locus D 11, and alleles 1 and 2 at Ml 5, whereas the girl carries alleles
30
March
3 and 4 at Dl 1, and alleles 2 and 5 at locus Ml 5. Thus, the alleles of Dl 1 are cosegregatmg with the phenotype of red hair (there are more nonrecombinants than recombinants), but the alleles of Ml5 are segregatmg more or less randomly (there are equal numbers of recombinants and nonrecombmants). If we typed several more families, and continued to find no recombinants between locus D 11 and “red,” we could conclude that these two loci were very tightly linked, with a LOD score of more than 3.0, the threshold normally accepted for significant linkage. In contrast, locus M 15 appears to be unlmked to the gene responsible for red hair m this family. The rest of this chapter follows all the practical stepsnecessary for the localization of the gene “red” with respect to the markers Dl 1 and M15.
2. Materials 1. Data from polymorphic markers, such as microsatelhtes typed by semtautomated fluorescent methods. 2. The GAS package, written by and available from Alan Young via the World Wide Web on: http://info.ox.ac.uk/-ayoung/gas.html From this address, you will be able to obtain the latest versions of GAS for many types of operating systems, mcludmg IBM-PC DOS, SUN Solaris, and SunOS, plus a full manual and examples.
3. Method: Linkage Analysis 3.1. Preparation of Linkage-Format Pedigree and Data Files If you are entering raw size data, e.g., from microsatellites genotyped on the ABI 373 or 377 automatic
DNA sequencer, set your GENOTYPER
program to
produce a table like the following: Dll
140.4 150.38 148.78
These figures represent (from left to right) marker category, sample identification number, size of allele 1, and size of allele 2 (in base pairs). Export this table as a text file. If your table does not look like this, you can edit it in a spreadsheet program, such as Excel. You can delete any columns you want or exchange the position of any columns until it is in this format. Now save the tile as text format, givmg it a name such as “D I I .txt”. If you have data from more than one marker and from more than one gel, you can combine all the data together m one large file
Gene Ordering and Localization
31
(but with all the data for each marker together), or you can divide the data into one tile per marker. The allele sizes can be converted into numbered alleles (called “named” alleles in GAS), so that they can be used m subsequent analysis, and the gene frequencies and recombmation fractions calculated, using the GAS package. To enter your raw data into GAS, you will need the text file or files you have just created, an initial pedigree file, which does not contain marker data, and a command or “gas” file. 3.1. I. Preparation of Pedigree File The gas-format pedigree file should have six or seven columns, and look like this (the top two rows are not part of the final file, but you may find it helps to have them m place when you are creating the file and delete them just before use). Pedigree Family Paternal Maternal Affection id number number Sex &g&j Lc! 122 1 6 7 m n X 122 2 f Y 1 122 3 ; m Y 122 4 1 2 f n 122 5 1 2 m Y X 122 6 X m n 122 7 X X Y Y Unfortunately there is no real shortcut to creating a pedigree file apart from typing in the information directly, but once created, the same tile will do for any further markers typed. The first column contains the code for each pedigree or family. The next column contains the identification for each family member. In this family, the parents are called 1 and 2, the children are 3,4, and 5, and 6 and 7 are the paternal grandparents. The next two columns contains the identification for the father and mother, respectively, of each person, if the father appears m the pedigree in your database. Therefore, the paternal and maternal identification for the father of this pedigree is 6 and 7, respectively, but the mother’s father and mother do not appear m this database, so her paternal and maternal identifications are “x” for unknown. The paternal and maternal identifications of the three children are 1 and 2. The grandparents’ parents do not appear m the pedigree, so their paternal and maternal identifications are marked as “x.” The fifth column contains the code for sex (m for male and f for female). This completes the requirements for the GAS-format pedigree, but if you wish, you can use a srxth column for affection status (if you are not using an affec-
32
March
tion locus, this column is blank); y for affected, n for unaffected, and x for unknown, or you could put in the affection status as a separate file and merge It mto the pedigree file with your allele data (see ref. 8 for details). Once these columns are completed, check that every individual in the pedigree is related to at least one other person by means of the paternal and maternal identlficatlon column (i.e., person 122.2 is identified as the mother of person 122.3). If you have missing individuals in your pedigree, such as a missing mother or father, it is worth inserting a “phantom” individual with unknown data. Missing parents or nonconsecutlve identification numbers (e.g., children 3, 4, and 6) can cause some lmkage programs to crash. Once you have checked the pedigree file, save it m a text format, with a name such as “red.ped.” 3.1.2. Preparation of “Gas” File The next file to write 1smuch smaller than the other two, but it 1sthe most crucial. This IS the command file, which has the extension “gas.” It can be written in any word processmg program that can be saved in a text format, or directly in a text editor. There are a very wide range of options that can be used vta the command file, which are covered fully in the GAS manual. The foilowmg is a basic introduction to the use of gas files in genetic mapping. To write a command file to read in ABI-generated raw data, for example on the loci Dl 1 and Ml5 m people who are affected or nonaffected at the locus red-hair, first you can view the distribution of the allele frequencies lake this (the remarks after ! m italics explain each lme and should not be included m the file): set outfile = bar; ! the program will wrote theple “bar.out” barchart gthere is no postscript file set
which gives results of the
set logfile = bar; ! the program wdl wrtte the file “bar.log” problems encountered
whzch gives a record of any
setpstile = bar; ! the program will write thefile “bar.ps” whzch gzves results of a barchart rn graphical format
set locus dl 1 namednofreq nodata; ! set the numbered locus Dll to be read mto the pedigree jle, frequencies or other data known
no gene
set locus m 1.5namednofreq nodata; ! set the numbered locus Ml 5 to be read mto the pedigree jile; no gene frequennes or other data known
read( pedigree red.ped ); ! read the gas-format
pedlgreefile
Gene Ordering and Localization
33
Fig. 2. Allele sizesfor locusdl 1.
read(alsizedl l.txt ml5.txt locusdll ml5 graph); ! read in the two files containing raw size data for loci Dl I and M15, and display a graphical barchart of the distribution of allele lengths of each allele
stop; ! end of program
Save this file as a text file with the extension “gas” (e.g., “red.gas”). Now make sure that all the files you have prepared are in the samedirectory as the GAS program (gas.exe).To run the gasprogram, simply type “gas” at the DOS or UNIX prompt, and you will be given a list of all the gasprograms available. Your tile “red.gas” should be among them, so type “red.gas” and pressreturn. If the program finds all the right files in the directory, and if the relationships between the individuals in the pedigreesare all written correctly, it will start to read in size data from the two files Dl l&t and MlS.txt. When the commandfile has finished running (which you will know because GAS will send you a friendly little messagesaying “SUCCESS!“), look at the “red.ps” file (or the “red.out” file if your computer cannot display postscript files) to seewhether your allele sizesare clustered around a 2- or 3-bp interval, or whether they have a continuous distribution. If the latter is the case,it will be difficult to score your alleles globally (identify one particular allele across different families). It may be worth looking at your experimental proceduresto improve the discrimination between different alleles. Figure 2 shows the graphical output for a typical (CA), repeat. The larger alleles fall neatly into clusters of 2 bp, but the smaller alleles show some overlap between clusters, possibly becauseof nonoptimal gel running conditions. If it is not possible to
34
March
improve drscrtmmatton between alleles, they may have to be scored “locally” (within families), as explained in the GAS manual. Once you have decided on the most appropriate dividing points between groups of alleles, such as 2 bp for (CA), repeats, edit the command file “red.gas” as follows: set allfile = red; 1 the three outfiles are “red.out “, “red log “, whzch gtves a record of any problems encountered and the dectsrons taken on the scorrng of allele stzes, and “redps ” set locus red affection 1 set the affectzon locus with data found in the pedtgreejile 20505 1 number of alleles, gene frequencies 1 0 2 0.2 0.0, ! number of ltabtltty classes, penetrance dtstnbutlon (f the penetrance dtstrtbution is not known thts line may be replaced by the entry “noclass ‘y set locus dl 1 named nofreq nodata, 1 set the numbered locus DI 1 to be read into the pedigree file, no gene frequencies or other data known set locus ml 5 named nofreq nodata, 1 set the numbered locus Ml5 to be read into the pedzgree file, no gene frequencies or other data known read( pedigree red.ped ), ! read the gas-format pedtgreeple read( alslze dl l.txt ml5.txt locus dl 1 ml5 sameslze=I 95 dlffslze=2 05 global ); 1 read tn the twoPles containing raw stze data for loct Dll and M15, defining alleles tn 2-bp tntervals, so that alleles more than 2 bp apart are defined as different alleles, and score globally (the same stze IS the same allele over all the families used) wnte( ldata red.dat ), ! write a ltnkage-format data file containtng gene frequencies of all the alleles scored write( lpedigree red.new ); 1 wrote a linkage-formatpedtgreefile contatnmg all the allele data scored
stop; ! end of program
As the program goes through the size data, tt will be able to score or “bin” most of the alleles, but there will be some that are “ambiguous” or do not fit easily into one or other bin. GAS will ask you for confirmation of where to put these ambrguous alleles: in most cases following the suggested route will be fine, but if the gap between the two sizes given is very close to 2 bp, you may feel safer to follow the prompts to make a separate bm for this odd allele, or
Gene Ordering and Localization
35
to exclude it from the pedigree file. You can fine-tune the number of alleles “included,” “ excluded,” and “ambiguous” for each marker by slightly altering the “samesize” and “diffslze” criteria. Once the alleles have been scored, GAS will check that the inheritance of the alleles is consistent, and write you a warning message if there are any cases of non-Mendelian inheritance. In this case, the data will not be written to the new pedigree file, and you will need to check the scoring of alleles and, If necessary, re-enter them manually (see Section 4.2.) Once the program is finished, you can check the files “red.log,” which will contain a record of which allele sizeswere entered into which bm. Linkage-Format Pedigree and Data Files You can now look at the linkage-format pedigree file red.new, which contains all your data scored mto numbered alleles. The format of the new pedigree file is like thts (the top two rows are not included in the file): 3.7.3.
Id
Pat
Pedno.3
Mat Is!
&&
Affectlon g&&@
122 122
1
0
0
1
2
2
0
0
2
1
122
3
122 122
4 5
1 1 1
2 2 2
1 2 1
2 2
1
a
dll b
ml5 a
ml5 b
5 4 5 5 4
6 5 5 6 6
3 2 2
9
dll
1 1
1 9 3 3
The correspondmg linkage-format datafile looks like this (the remarks after ! explain each line and are not part of the program; some of these are written in by the GAS preparation program): 3003
! number of IOCLrisk locus if known, sex-linked ifknown, program codefor LINK4 GE, @“used 0 0.0 0.0 0 ! line to enter mutation rates of one locus, if known 123
! order of loci 12
#red ! code (1) for affection locus, number of alleles, name
0.500000
0.500000
! genefrequencies 1 ! number of affection classes 02000000.2000000.000000
!frequency of affectzon m homozygotesof dominant allele, in heterozygotes, zn homozygotes of recessiveallele
March
36 3 7#dll ! code (3) for numbered locus, number of alleles, name 000600601876880.3303300066066
0.33033000465470033033 ! gene frequencies 3 12#m15 f code (3) for numbered locus, number of alleles, name 0.21075602616280193314000872100058140.018895 0220930 00276 16 0.001453 0.030523 0 015988 0.004360 ! gene frequencies 00 1sex difference, interference (f 0 or 1) 0.1 0.1 1 inltlal recombination fractions between loci 1 thts lme 1srewritten by the locus control program,
If used (Sectton 4 3)
00 ! thus lme 1srewritten by the locus control program
Both these files are now ready for further processmg, and for running a wide variety of GAS or LINKAGE programs (see Section 4.3.), which will help you m ordering and localizing your markers into a genetic map. 3.2. Programs
for Linkage Analysis
Version 2.0 of GAS mcorporates several programs or “routmes” designed for classical linkage analysis. These routines use the Vitesse likehhood engine (9) and have the advantage of bemg able to analyze up to 8 loci simultaneously and to use highly polymorphic marker alleles without recodmg. Although the Vitesse engine is said to be the fastest likelihood engine in existence at the time of writing, it still takes a relatively large amount of computer time to perform these routines in comparison, say, to sib-pan linkage analysis (see Section 4.4.). The routines are not yet able to deal with more complicated problems, such as sex-linked loci; for these you should use the LINKAGE package (see Section 4.3.). However, for the construction of a simple genetic map, they are generally much easier to use than the traditional programs. 3.2.1. Ordering of Loci and Calculation of Recombination
Fractions
If you do not know the order of your loci, you can use the “lik2point” routine in GAS to find the most likely order and to calculate the maximum likehhood recombination fraction between each of the pairs of 1oc1used. The same routine will estimate the maximum LOD scoresbetween the loci and, if needed, the support levels for each.
Gene Ordering and Localization A typical “gas” file using “lik2point” would be:
37 and the problem described above
set allfile = theta; ! write results tofiles theta.out, thetaJog and theta.ps read( ldata red.dat ); ! read the linkage-format data file containing all the details of the loci used read ( lpedigree red.new ); ! read the linkage-format pedigree file containing all the allele data program; call lik2point( locus red Dl 1 Ml5 allorders ); ! calculate the most probable recombination fractions among the Loci red, Dll, and M15, in all possible orders stop; ! end of program
When the program has stopped running (which may take some time), you can read the file “theta.out” to see your recombination values and LOD scores: Locus # 1 red red Dll
Locus #2 Dll Ml5 Ml5
&
LOD score
0.136 0.119 0.201
12.22 <++ 16.31 c++ 9.38 c++
The highest LOD score represents the most likely order, so the order in this case would be “Dll-red-M15” (or the other way around). Now you can edit the data file “red.dat” to include the new intermarker recombination fractions, and also alter the “order of loci” line to read “2 1 3.” If you set a “red.ps” outfile in line 1 of the gas file, you will have obtained a graphical plot of the LOD score for the loci analyzed. There are a number of other options which you can include at this time; see the GAS manual for details. 3.2.2. Localization of an Unknown Locus on a Fixed Map To localize one unknown locus on a fixed map of markers (either the map constructed using “lik2point” or a physical map), the routine “l&map” is used. This program calculates likelihoods for the entire map with the unknown locus at one position after another, indicating via LOD scores the most probable localization for the unknown locus (called “movable”), in relation to the known loci (called “fixed”). It is important to realize that “l&map” or any other parametric multipoint linkage program (in other words, one where you have to specify the mode of inheritance) should not be used if the mode of inheritance
March
38
is unknown or too complex to be exactly specified in the datafiles using penetrance and liability classes.This is because inaccuracies in the specified mode of inheritance tend to drive the placing of the unknown locus toward the edge of the fixed map. Therefore, if the inheritance of your unknown locus is likely to be complex, it may be better to use one of the affected relative methods described briefly in Section 4.4., if you have access to this sort of family material. Multipoint linkage analysis requires a considerable amount of computer time; therefore, if more than eight fixed loci are used, it is essential that the loci are analyzed only a few at a time or the program may crash. Using the LINKAGE package (see Section 4.3.), each subset must be set manually, but GAS uses the “dosets” parameter to do this automatically. The default value for “dosets” is four loci to be analyzed at a time. The gas file used might look like this: setallfile = likmap; read( ldata red.dat ); read( lpedigree red.ped ); program; call likmap( lo&x dll ml5 d12 d13d14 g40841 842 ! these are the eight fuced loci
theta 0.01 0.02 0.031 0.062 0.06 0.02 0.065 ! these are the seven intermarker recombination fractions
Iocmov red ! the unknown, or movable, locus is red
maptic haldane ! use the haldane mappingfunction
to calculate the genetic map
dosets); ! if there are more than four loci, calculate (the default)Bxed loci at a time
the map in sets offour
stop; When the program has finished running, take a look at the “likmap.out” file. After summarizing the intermarker recombination fractions, it will give you a readout of map position, recombination fractions, and LOD score for each possible position of the movable locus, like this: LIKELIHOOD MAP for locus “red” Distancescalculatedin M using haldanemap red
! the movable locus “red” is set at the left-hand edge of the map, beyond the iocus D11
Gene Ordering and Localization
39 dll ml5 d12 d13
Lodscore ! the map position of the locus
Position
from the left-hand side) is 0 0.000 0.000
-cc 0.805
0.000 -0.541
0.000 0.000 0.000 0.000
0.458 0.255 0.112 0.001
1.041 0.845 0.185 0.338
Order d 11
! the LOD score is -0.541 at theta = 0.805 between “red” and DI 1
! the movable locus “red” is set between DI 1 and Ml5 red ml5 d12 d13
Position 0.000
Lodscore -0
0.001
-0.524
0.000801 0.000601
-0.5 12 -0.510
0.000 0.000
0.0002 0.0004
0.001 0.001
0.000601 0.0004 0.000801 0.0002
-0.520 -0.539
0.001
0.001
-0.568
-0
! the map position of the locus moves between 0 and 0.001 Morgans ! the LOD score is -0.510 at theta = 0.004 between “red ” and Dl I and theta f 0.000601 between “red” and ml5 ! none of these LOD scores are signijkant
The output file will continue to give you information on map position, recombination fractions, and LOD scores for all possible positions of the locus “red,” including the extreme right-hand side of the map beyond locus g42 (again, in this case only one recombination fraction is given). The highest LOD score indicates the most likely position of the unknown locus on the map. The file “likmap.ps” gives graphical output for the map (see Fig. 3). Note that the distances shown are from the first marker, D 11, and that the LOD scores to the left of Dl 1 and to the right of the last marker are not included in the map.
March
genetic distance (haldane) Fig. 3. Likmap analysis for locus red.
4. Notes 4.1. Genetic Analysis-Some The following
is a brief definition
Definitions of genetic terms used in this chapter.
1. Gene frequencies: The relative frequencies in the population of the different states or alleles of a gene or marker. 2. Polymorphism: A gene or marker is defined as polymorphic if its alleles occur so frequently that they cannot be explained by recurrent mutation (normally the most common allele has a gene frequency of <95%). 3. Locus: The site on a chromosome occupied by a particular gene or marker. 4. Haplotype: The alleles of different genes received by an individual from one parent. 5. Linkage: Two loci are said to be linked when they are relatively close together on the same chromosome, so that alleles of different genes appear to be genetically coupled together on the same haplotype. 6. Recombinants and nonrecombinants: A recombination event between two genes occurs if the alleles on an individual’s haplotype are derived from two different grandparents; if there is no recombination event, the alleles of the two genes on the individual’s haplotype will be the same as in one of the grandparents. In certain situations, such as a family in which both parents are homozygous, it is not possible to distinguish between recombinants and nonrecombinants.
Gene Ordermg and Localizatron
41
7 Recombination fraction (0): The probablhty that there 1s a recombination event between two loci The recombmatlon fraction 1sa measure of the extent of genetlc linkage, such that the larger the physical distance between two 10~1, the higher the probablhty that recombmatlon can be observed between them Genes that segregate independently within a family are unlinked and have a recombmation fraction of 0 = 0 5, whereas linked genes cosegregate within a family and have a recombination fraction of 8 < 0 5. 8. LOD score: The LOD score for calculations where the recombination fraction IS varied, Z(e), 1sthe logarithm of the likelihood of the data at that recombmation fraction over the likelihood of the data at 8 = 0.5, i e., how much higher is the llkehhood of the data if linkage 1spresent than if linkage is absent. 9. Mapping function: Recombination fractions are not additive, and the total recombination fiactlon across a set of markers 1sless than the sum of the mtermarker recombination fractions. A mapping function is a formula that converts recombination fractions mto genetic map distance (m Morgans), which 1sthen additive. One of the simplest conversions is the Haldane map function, whereas others, which include more sophisticated correction factors, may be more accurate. 10 Linkage disequilibrium. A deviation from the random occurrence of alleles in a haplotype m populations 1s referred to as allelic association or linkage dlsequilibrium. The latter term reflects the fact that when a new mutation occurs on a haplotype, it will be found together with its surrounding alleles more often than expected by chance for a certain number of generations until recombmation between the alleles leads to linkage equilibrium, or no association between alleles In theory, two alleles that are only 1 bp apart may be m linkage equilibrium. In practice, it 1s common to find some degree of linkage disequilibrium between alleles that are up to l-2 CM apart It 1s this feature that 1sthe basis of linkage disequilibrium mapping. However, it should be remembered that many other situations can lead to the detection of allellc association between loci that are m fact unlinked (see ref I) For this reason, linkage dlsequllibrmm mapping 1s often used to fine-map 1oc1whose genetic relationship 1salready known
4.2. Error Messages in GAS It 1s common
to see the message “Warning-non-Mendelian
inheritance”
after the imtial binnmg of raw data into numbered alleles. The most common cause of this problem IS where one track m a family has been sized inaccurately on the gel and thus fallen into the wrong allele bm. First check whether the allele sizeslook all right by eye; it may be possible to resize the peaks or stretch the allele bin. Otherwise, it may be necessary to rerun the sample, or even the whole family. The data can then be re-entered automatxally or manually, using the bin sizes saved m the log file for the correct numbering. The message “Error-unrelated individuals” means that your pedigree file has not been constructed properly, so there is at least one individual who is not connected to the other people in the family via the paternal and maternal
iden-
tification columns. The program will not be able to proceed until this error has been corrected.
March
42
The message “Warning-individual 11.1 not in mam dataset” simply means that you have data for individuals m the text file that are not in the pedigree file. If you enter “contmue” the program will proceed, but this data will not be written to the pedigree file. The message “Error-multiple data for subject 11.1” means that you have two (or more) entries in your text file for one mdividual. The program will not be able to proceed until these extra entries have been deleted. 4.3. The LINKAGE
Package
This package contains a number of programs for linkage analysis. They are generally less easy to use than the GAS routines, but are more flexible m that the user specifies exactly how the data are to be handled. If you wish to use these programs, it 1s worth getting hold of the book Handbook of Human Genetzc Lznkage by Joseph Terwilliger and Jurg Ott (2) for a detailed descripnon of methods involved in using and optimizing the programs. Previous versions of the LINKAGE programs were relatively slow and were limited m their ability to deal with highly polymorphrc markers, but the more recent FASTLINK program (available through HGMP or from the author, Alejandro Schaffer, c/o
[email protected]) (3,4) has largely overcome both problems. The lmkage-format datafile prepared above can be used directly in FASTLINK, but the pedigree file will need preprocessing by another program called MAKEPED. This program will add several columns to your pedigree tile automatically and will enable the lmkage analysis to run faster To run the program, simply enter “makeped” at the UNIX prompt in the dnectory in which you have the LINKAGE (5-7) or FASTLINK package. The prompts are self-explanatory. If you experience any problems the first time you try to run tiles m the LINKAGE or FASTLINK packages, check your pedigree file for missing parents, unrelated indrvrduals, or nonconsecutive numbering of children. Also remember that if the gene frequency is set at 0, even one occurrence of that allele ~111stop the program. Within the package, the program ILINK is analogous to the GAS routine “lik2point,” and the program LINKMAP is analogous to the GAS routme “likmap.” These programs are usually run usmg a relatively user-frrendly interface called LCP (linkage control program) and the results displayed using another interface called LRP (linkage report program). The use of these programs IS descrrbed in the book by Terwillrger and Ott (2).
4.4. Nonparametric Methods Parametric forms of lmkage analysis are particularly suitable for genetic mapping in situations where the mode of inheritance of all the loci 1sknown,
Gene Ordering and Localization
43
and can be specified fatrly accurately m terms of penetrance and Ilability. In dtsease or phenotype mapping, this may not always be the case. In so-called multifactorial diseases or phenotypes, many genes may contribute to the observed phenotype, and there may be a nongenetic component to the presence or absence of the phenotype (so that there is no predictable correlation between the genotype and phenotype). Affected relative linkage analysis, which tests distortions from expected ratios of allele sharing among relatives (usually sibs) who share a particular phenotype, such as a disease, is Independent of the mode of inheritance of the phenotype, and can thus be a very powerful tool in locatmg the gene or genes responsible if suitable family material IS available. The GAS package contains several routines for single-pomt and multipoint sib-pair analysis, including identity by descent (IBD) analysis, maximum likelihood IBD analysis, identity by state (IBS) analysis, which may be more appropriate if parental data are missing, and sib-pair mapping. There are also routines suitable for the analysis of quantitatrve data. There are full details of these routines, and examples of their use, in the GAS manual (81.
5. Summary This chapter has covered the preparation of pedigree and datafiles suitable for linkage analysis, the calculation of gene frequencies and recombmation fractions, the ordering of markers into a genetic map, and the placing of an unknown locus, such as a disease locus or phenotype, onto a fixed map of markers. The chapter has concentrated on the use of the GAS package, but some other programs for lmkage analysis have also been mentioned.
Acknowledgments The constructive comments of Alan Young and Duncan Campbell are acknowledged with thanks.
References 1. Ott, J. (1991) Analysis ofHuman Genetzc Lznkage. John Hopkins University Press, Baltimore, MD 2. Terwilliger, J. D and Ott, J. (1994) Handbook ofHuman Genetrc Lznkage. John Hopkins Umversity Press, Baltimore, MD. 3 Cottingham, R. W., Idury, R M., and Schaffer, A. A. (1993) Faster sequential genetic linkage computations. Am J. Hum. Genet. 53,252-263. 4. Schaffer, A. A., Gupta, S. K , Shriram, K., and Cottmgham, R. W. (1994) Avoiding recomputation m genetic linkage analysis. Hum Heredity 44, 225-237. 5. Lathrop, G. M., Lalouel, J M., Julier, C., and Ott, J. (1984) Strategies for multilocus analysis m humans. PNAS 81,3443-3446. 6. Lathrop, G. M. and Lalouel, J. M. (1984) Easy calculations of LOD scores and
geneticrisks on small computers.Am J Hum. Genet. 36,46W65.
44
March
7 Lathrop, G. M , Lalouel, J M., and White, R L. (1986) Constructton of human genetic linkage maps: ltkehhood calculattons for multtlocus analysts Genet Epldemzol 3,39-52 8 The GAS manual version 2.0, available from. http:/lmfo ox.ac ukl-ayoungl gas.html 9 O’Connell, J (1995) The Vitesse algorithm for rapid exact multllocus lmkage analysts via genotype set-recoding and fuzzy mherrtance. Nature Genet. 11,
402-408
4 Gene Mapping Using Somatic Cell Hybrids David P. Kelsell and Nigel K. Spurr 1. Introduction Human-rodent somatic cell hybrids have proven to be a useful tool for mapping expressed gene products (1,2) and, m particular, DNA sequences. The accuracy of the method is dependent on the identification of a species difference between the human and rodent species. This chapter details methods for DNA-based gene mapping techniques using somatic cell hybrids. The first continuous human-mouse somatic cell hybrids produced were from a spontaneous fusion occurrmg between mouse L-cells deficient m thymidme kinase (TK) and normal human embryonic tibroblasts (3). The hybrids were selected in hypoxanthine-aminopterin-thymidine (HAT). The mouse cells were killed and the hybrids outgrew the human cells It was found that as the hybrids grew, the mouse chromosomes were all retained, but the human chromosomes were preferentially lost. This observation provided the key to their use in gene mapping enabling the assignment of human genes and sequences. Since the mouse cell lme was TK-deficient, the mouse-human hybrids selected m HAT were required to retain the human gene for TK. After several generations m culture, the majority of the hybrid cells had retained only one chromosome in common, chromosome 17. Therefore, the human TK gene could be assigned to chromosome 17 (4,s). The application of recombinant DNA technology had a major impact on the ability to assign genes to chromosomes, since there was no longer a requirement for expression of a gene product. This meant that not only was it possible to assign specttic genes, but also virtually any randomly isolated piece of DNA. More than 5000 genes/DNA sequences have now been assigned using panels of human-rodent somatic cell hybrids using Southern blot hybridization and/or the polymerase chain reaction (PCR) (6). Gene assignments depend on the loss or From
Methods
m Molecular
Bfology,
Edlted by J Boultwood
Vol
66
Humana
45
Gene
/so/at/on
and
Press Inc , Totowa,
Mappmg
NJ
Protocols
46
Kelsell and Spur
retention of specific chromosomes or chromosome fragments m somatic cell hybrids. Whether using Southern blotting or PCR ampllficatlon, a species dlfference must be determined. Southern blotting 1s a method used to detect homologous sequences m genomlc DNA by hybridization of a labeled probe to DNA fragments generated by restriction endonuclease digestion, separated on an agarose gel and transferred to a solid support (7). Southern blot methodology is still used for mapping DNA fragments particularly when there 1sno DNA sequence avallable to design mapping primers. This method may also indicate whether a DNA fragment is part of a gene family or shows mterspecies conservation. However, these latter two points often create problems in assignment, for example, in distinguishing human fragments from those of the rodent background. The PCR 1sused to amplify DNA between two regions of known sequence (8). The PCR technique has a number of advantages over Southern blotting for gene mapping. The main points are: 1 It requires lesstemplateDNA; 2. It 1smuch quicker with rapid “binnmg” of genes/randomDNA fragments; and 3 It is gene-specific, i e , does not crosshybridizewith gene family members or acrossspecies Currently the use of DNA from somatic cell hybrids forms the core of many mapping strategies, particularly those aimed at the rapld “binnmg” of genes or DNA fragments to individual chromosomes. This chapter concentrates on the use of human-rodent somatic cell hybrid DNA m gene mapping using either Southern blot analysis or PCR-based methods. 2. Materials 2.1. Southern Mot Analysis 2.1.1, Restriction Enzyme DIgestion of Genomic DNA 1. Somatic cell hybrid DNA panel with human and rodent genomic controls (see Note 1). Digestionsare carried out using the buffers suppliedby the manufacturers and according to their recommendations. 2. 1X TAE buffer: 40 n-J4 Tns-acetate, pH 7.8, 2 mM EDTA, pH 8.0. 3 Ethidium bromide. 4. Molecular weight marker: h DNA digested with HzndIII.
2.1.2. Transfer of DNA from Agarose Gel 1 0.4MNaOH 2 Nylon filter, for example, Hybond N+ (Amersham International, Buckmghamshire, UK). 3. Blotting towels. 4. 20X SSC (stock solution): 3MNaC1, 0.3M sodium citrate
Little Chalfont,
Soma tic Cell Hybrids
47
2.1.3. Generation of Labeled DNA Probes 1 10 mg/mL. Bovine serum albumin (BSA) (enzyme-grade) 2. Oligo-labelmg buffer (OLB): Mixture of three soluttons: a. Solution A: 1.25MTris-HCl, pH 8.0, 125 mA4MgClP, 0.18% (v/v) 2-mercaptoethanol, 0.5 mA4 dATP, 0.5 mM dGTP, 0.5 mM dTTP. b. Solution B: 2MHEPES, pH 6.6. c. Solution C. Hexadeoxyribonucleotides suspended m 3 mMTru+HCl, 0.2 mA4 EDTA (pH 7 0) at 90 OD U/mL. Soluttons A, B, and C should be mixed in the ratio 2:5:3 (v:v:v), respectively 3. a-32P-dCTP (1000 Ci/mmol). 4. Klenow fragment of DNA polymerase I, NB. Commercral labehng kits are available 5. Stop solution. 20 nQt4 NaCl, 20 mA4 Tris-HCl, pH 7 5, 2 mM EDTA, 0.25% sodium dodecyl sulfate (SDS). 6. Sephadex G50 column. 7. 20X SSC (stock solution): 3MNaC1,0.3M sodium citrate.
2.1.4. Hybridization of Labeled DNA Probes to DNA Immobilized on Nylon Filters 1. Prehybrrdtzation solutton: 5X SSC, 5X Denhardt’s solutton, 40 mM phosphate buffer, pH 6.5,O 1 mg/mL denatured sheared herring sperm DNA (100X Denhardt’s solution: 2% [w/v] BSA, 2% [w/v] Ficoll400,2% [w/v] Polyvinylpyrrolidone 360) 2. Commercial hybrtdization oven. 3. Labeled probe. 4. Hybridization solution, Prehybndtzatron solution, 10% (w/v) dextran sulfate. 5. Wash: 0.2X SSC, 0.1% (w/v) SDS.
2.1.5. Autoradiography 1. Plastic film (SaranwrapTM) 2. Kodak X-OMAT AR film. 3. Commercial film developing system.
2.2. Analysis of Hybrids Using PCR Amplification 1. Oligonucleotide primers can be synthesized using an automated DNA synthesizer (Model 380B, Applied Biosystems), working on cyanoethyl phosphoramidite chemistry. 2. Somatic cell hybrtd DNA panel with human and rodent genomic controls. 3. 10X PCR buffer: 100 mMTris-HCl, pH 8.3,500 mMKC1, 15 mA4MgC1,. 4. 0.01% (w/v) gelatin 5. 10 mM stock of each dNTP: dATP, dCTP, dGTP, dTTP. 6. Tag DNA polymerase from Thermophilus aquaticus. 7. 1X TBE: 45 mM Trts-borate, 1 n&f EDTA. 8. Ethidium bromide. 9. Mel-wt marker: 1$x174 DNA digested with HaeIII.
Keisell and Spurr
48
3. Methods 3.1. Southern
Blot
Analysis
of Hybrids
3.1.1. Restriction Enzyme Digestion of Genomic DNA 1. Digest 10 ug of genomtc DNA for 3 h using 30 U of restriction endonuclease incubated at the optimum temperature for activity of the endonuclease used (see Note 2). 2. After digestion, separate the DNA fragments by electrophorests using submarine horizontal agarose gels m 1X TAE buffer, run at l-2 V/cm (1% agarose gels made up in 1X TAE buffer are suitable for most appllcattons); 0.1 yg/mL of ethidmm bromide is added to the runnmg buffer, which enables the DNA fragments to be visualized using a UVP transillummator. DNA fragment lengths can be determined by using a DNA standard of known size. The posmons of the DNA markers are marked by notchmg the sides of the gel to be blotted 3 Permanent records of the gel can be obtained by photography.
3.1.2. Transfer of DNA from Agarose Gel 1. Soak the gel m 0.4MNaOH for 20 mm with gentle agitation. 2 Place the gel on a wtck of 3MM blotting paper that has been soaked m 0 4M NaOH and draped over a reservotr of the same solution. Place the nylon filter that had been presoaked m 0.4MNaOH on top of the gel, followed by a sheet of 3MM blottmg paper, also soaked in 0.4MNaOH. Place a stack of blotting towels on top of the 3MM paper, followed by a glass plate and a 0 5-l kg weight 3. The denatured DNA is transferred to the membrane by capillary transfer m the NaOH for 3 h to overnight 4. After blotting, wash the filter m 5X SSC and wrapped m SaranwrapTM prtor to use Hybond N+ filters do not requtre fixing since the membrane 1s positively charged, electrostatically bmdmg the DNA on transfer
3.1.3. Generation of Labeled DNA Probes 1. Boll 5-50 ng of DNA probe for 3 min in distilled water such that the final volume of the reaction mixture is 30 uL, and then put on ice. 2. Add 1.2 pL of 10 mg/mL BSA and 6 pL of OLB, followed by the addmon of 2 uL a-32P-dCTP and 2.5 U of the klenow fragment of DNA polymerase I Incubate the reaction mixture for 1 h at 37°C. 3. After labelmg add 70 l.tL of “stop solution” to the reaction. 4. Remove the unincorporated dNTPs by running the labeling reaction mixture through a Sephadex G50 column equilibrated in 3X SSC. Typical incorporation of labeled dCTP using this technique is usually >80% with a specific activity of > 1Ogcpm/mg of DNA. Boil probes for 3 min immediately prior to use.
3.1.4. Hybridization of Labeled DNA Probes to DNA Immobilized on Nylon Filters 1. Prehybridize filters for 2-4 h in varying amounts of prehybrtdtzatton solution depending on filter size and number of filters at 65°C m glass tubes placed m a commercial hybridization oven.
Somatic Cell Hybrids
49
2. Remove the prehybridization mix, and add the hybridization mix containing the denatured probe. Hybridization is generally carried out overnight. Denature the radioactively labeled probe by boiling for 3 mm prior to adding to the hybridrzation mix. 3. After hybndtzatton, remove the hybridization mtx, and wash the filters Three quick washes are performed at room temperature, and then one IO-min wash at 65’C
3. I .5. Autoradiography 1. Wrap the filter m plastic film, and placed in an X-ray film cassette. 2 Expose the filter to the film usually at -70°C for l-5 d to vrsuahze the hybridized fragments Develop the film usmg a commercial film developing system (see Notes 3 and 4).
3.2. Analysis of Hybrids Using PCR Amplification 1. Perform PCR m 50-yL reactions containing 30 ng template DNA, 1X PCR buffer, 20 pmol primer, 200 pA4 dNTPs, and 1 U of Tag polymerase on a thermocycler, for example, GeneAmp 9600 thermal cycler (Perkin-Elmer Cetus). Reaction mixtures are given 30 cycles of 30 s at 94’C, 30 s at 55”C, and 30 s at 72°C (see Notes 5-9) 2. Electrophorese 15 pL of each PCR reaction through submerged hortzontal gels m 1X TBE buffer run at 3-5 V/cm 2% agarose gels made up m 1X TBE buffer are used for most applications to achieve separation of PCR fragment(s), usually m the range 100-1000 bp. The runmng buffer contains 0.1 pg/mL ethidium bromide DNA fragment lengths are determined by using a DNA standard of known size. Permanent records of the gels may be obtained by photography through an orange filter using, for example, Polaroid Land camera and Type 57 film
4. Notes 1. A monochromosomal hybrid DNA panel (Table 1, 9) has been deposited at the UK HGMP Resource Centre, Hinxton Hall, Hinxton, Cambridgeshire, CBlO 1RQ, UK, where 5-l 0 pg aliquots are available on request to registered users of the center. Other sources of somatic cell hybrids are available from the literature, including translocationldeletion hybrids and chromosome-specific irradiation hybrid panels. 2. Gene mapping by Southern blot hybridizatron may require the testing of a variety of restriction enzyme genomic DNA digests before a species difference is determinable 3. Gene assignments depend on the loss or retention of specific human chromosomes or chromosome fragments in somatic cell hybrrds. A human specific fragment indicates that human chromosome/chromosome region that gene/DNA fragment maps to IS present in the hybrid. An example of human chromosome content in a monochromosomal hybrid panel is shown in Table 1. 4. The majority of somatic cell hybrids, in particular the monochromosomal hybrids, are well characterrzed in terms of chromosomal content, and most DNA sequences can be mapped with relative ease. However, problems in the assign-
Table 1 Monochromosomal
Somatic
Cell Hybrid
123456 GM07299 GM10826B GM10253 HHW4 16 GM101 14 MCP6BRA CLONE2 1E C4A GM1061 1 762-8a JICL4 1aA9602+
+ -
+ -
289 GM10479 HORLI 2860H7 PCTBAl.8 DL18TS GM10612 GM10478 THYB1.3 PgME25NU HORL9X
853
+ -
+ -
+ -
I -
-
-
-
-
-
-
-
-
-
I
-
-
-
-
-
-
-
Panel= 7
8
9
10
11
12
13
14
15
16
+
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
+ I -
+ -
+ -
+ I I
-
+ I -
-
-
+
-
-
-
I -
-
-
-
-
-
-
-
-
-
-
-
-
+
-
17
+ -
-
-
+
-
19
20
21
22
X
Y
-
-
-
-
+ I -
_ _ -
-
+
-
+
+ _ -
-
-
-
-
I -
-
-
-
-
-
+ -
I + -
+ + I -I-
+
-
-
I +
18
-
-
-
-
-
+ -
+
-
-
=+, mdlcates presence; -, indicates absence, /, indicates chromosome translocatlon, extra chromosomes, or other modlficatlons bGM07299, 1 andX, MCP6BRA, Xqter-Xq13 6p21-6qter, 7628a, 10 and Y, laA9602,12,21, X, 289,13 plus fragmentsof8, 11,12, GM10479, 14 plus part of 16 (probably 16~13 l-16q22 l), HORLI, 15,l lq, part ofXp and proximal Xq, GM10478,20,4 (part), 8 (part), 22q, X, GM1061 1, 9p stronger than 9q, only 10% of cells with whole 9, PgME25NU, 22, part of Xp
Soma tic Cell Hybrids
51
ment of certain genes do occur; either a gene 1s positive for two or more chromosomes or, indeed, does not appear to map to any chromosome. This is probably because of deletions or translocations of small chromosomal regions m some hybrids. If this occurs, other hybrids should be tested or gene mapping techniques, such as fluorescent in situ hybridization (FISH), should be consldered for confirmation. 5. A general guide for primer design is that each primer is about 50% GC-rich, not self-complementary, and not complementary to each other. Ideally, the ohgonucleotides should be between 17 and 24 bp long with a melting temperature of about 60°C ( assuming A, T = 2°C; G, C = 4°C) The product size should be in the range 100-l 000 bp for ease of PCR amphtication. 6. For larger products, the extension time may need to be lengthened To generate human-specific PCR products, the annealmg temperature may need to be increased. In addition to followmg the general primer design conditions described, other points should be considered regarding which region of a gene should be targeted to generate gene-specific primers: 7. In the mtromc sequence and the 3’-untranslated region, where there is reduced sequence conservation between gene family members and across species. 8. By aligning sequences between gene family members or across species to tdentify regions with a number of nucleotide mismatches.
References 1. O’Brien, S. J., Simonson, J. M., and Eichelberger, M. (1982) Genetic analysis of hybrid cells using isozyme markers as markers of chromosome segregation, in Technques zn Somatic Cell Genetics (Shaw, J. W., ed.), Plenum, New York, pp 513-524. 2. Tunnachffe, A. and Goodfellow, P. (1984) Analysis of the human cell surface by somatic cell genetics, in Genetic Analysis of the Cell Surface: Receptors and Recognztlon Serzes B 16 (Goodfellow, P., ed.), Chapman and Hall, London, pp. 57-x2. 3. Weiss, M. C. and Green, H. (1967) Human-mouse
hybrid cell lines containing partial complements of human chromosomes and functioning human genes. Proc Natl. Acad. Sci. USA 58, 1104-l 111. 4. Miller, D. J., Allderdice, P. W., Miller, D. W., Breg, W. R., and Mlgeon, B. R. (1971) Human thymidine kmase gene locus: assignment to chromosome 17 in a hybrid of man and mouse cells. Science 173,244-245. 5. Boone, C., Chen, T. R., and Ruddle, F. H. (1972) Assignments of three genes to chromosomes (LDH to 11, TKto 17) and IDH to 20) and evidence for translocation between human and mouse chromosomes in somatic cell hybrids Proc. Natl. Acad. Sci. USA 69,5 1O-5 14. 6. Cuticchia, A. J. (1995) Human Gene Mapping 1994, vol. 3, A Compendium (Cuttachia, A. J., ed.), John Hopkins University Press, Baltimore, MD
52
Kelsell and Spurr
7. Southern, E. M. (1975) Detection of spectfic sequences among DNA fragments separated by gel electrophoresis. J MOE Blol 98, 503-5 17 8 Salkt, R K , Gelfand, D. H., Stoffel, S , Scharf, S J , Higuchi, R., Horn, G T , Mulls, K. B., and Erlich, H. A. (1988) Primer-directed enzymatic amphficatlon of DNA with a thermostable DNA polymerase. Sczence 239,487-49 1. 9. Kelsell, D. P., Rooke, L , Wame, D , Bouzyk, M., Cullm, L., Cox, S., West, L., Povey, S., and Spurr, N K. (1995) Development of a panel of monochromosomal somatic cell hybrids for rapid gene mapping. Ann Hum Genet 59,233-241
5 Gene Mapping Rafael Espinosa
by FISH III and Michelle M. Le Beau
1. Introduction The efforts to localize genes to human chromosomes date back to the early 1970s. Although few techniques were available to map genes, many scientists recognized that the ability to determine the location of genes and DNA sequences on human chromosomes would not only facilitate the identification of disease-related genes, but might also provide important knowledge on the organization of chromosomes and the mechanisms of gene expression. With the introduction of somatic cell genetics and the development of hybrid cell panels in the mid- 1970s investigators were now capable of mapping sequences to whole chromosomes and, in some cases,to specific chromosome regions or bands. Nonetheless, the major breakthrough m mapping efforts was provided by the development of znsitu hybridization of isotopically labeled probes; this technique provided the first method by which scientists could actually visualize the hybridization of a DNA probe to chromosomes (1,2). Using this technique, genes could be mapped to a few chromosome bands and often to a single band. The disadvantages of this method were the relatively poor spatial resolution owing to scatter of the radioactive emissions, the length of time for the procedure (long autoradiographic exposure times were typically required), and the poor stability of the probes. The mtroduction of techniques to detect hybridized probes using fluorochromes in the late 1970s circumvented many of these problems (3,4); however, it was not until the end of the next decade that fluorescence in situ hybridization (FISH) techniques became widely applicable (5,6). In the short time since the development of FISH, the technique has had a major impact on efforts to map the human genome because: From
Methods
m Molecular Biology, Edlted by J Boultwood
Vol 68 Gene /so/at/on and Mapping Humana Press Inc , Totowa, NJ
53
Protocols
Espinosa and le Beau
54
1 It is a rapid technique; The efficiency of hybrtdization and detectionare high; 3. The sensitivity and speciticttyis very high; 4. The spatial resolution is htgh; 5. Large numbers of cells can be analyzedin a short time; 6. Data can be obtained from nondividing or terminally differentiated cells; and 7. The techniquehas beenadaptedfor automatedsystems 2.
Several authors have addressed the apphcations of FISH m cytogenettc analysis, cancer diagnoses,and gene mapping (7,s). This chapter examines the role of FISH in gene mapping with an emphasis on detailed techniques. 1.1. Principles of FISH The technique of FISH is based on the same principle as Southern blot analysis, namely, the ability of single-stranded DNA to anneal to complementary DNA. The essential steps in FISH are described briefly here and are illustrated in Fig. 1. As in Southern blot analysis, the target DNA is attached to a substrate; in the case of FISH, the target DNA is the nuclear DNA of interphase cells or the DNA of metaphase chromosomes that are affixed to a glass microscope slide (FISH can also be accomplished with bone marrow or peripheral blood smears, or fixed and sectioned tissue). The test probe is labeled, most commonly by enzymatic incorporation of modified nucleotides (probe labeling systems are described in a subsequent section). To allow hybrtdrzation of complementary sequences to occur, the cellular DNA and labeled probe DNA are denatured by heating in a formamide solution to form single-stranded DNA. A solution containing the probe DNA is applied to the microscope slide, the slides are covered with coverslips and sealed, and hybridization is allowed to occur by overnight incubation at 37-40°C. Thereafter, the unbound probe is removed by extensive washes in formamide-SSC, and the slides are processed for probe detection. As described in Section 1.7.1,) the most suitable probes for FISH analyses are large genomic clones, such as 3\.phage or cosmid clones. These probes contain ubiquitous repetitive sequences, such as Alu and KpnI elements. The crosshybridization of repetitive sequences to nontargeted chromosomes can result in substantial background hybridization signal. Thts background labeling can be suppressed under appropriate preannealing conditions, using unlabeled total human DNA or the Cot 1 DNA fraction, which is enriched in highly repetitive DNA, as a competitor (9,10). 1.2. Probe Labeling
Schemes
A variety of schemes have been described for labeling probes with nonradioactive compounds; these include enzymatic incorporation of modified
55
Gene Mapping by FISH FLUORESCENCE
DENATURE
CELLULAR
IN SITU HYBRIDIZATION
DNA
OENAVURE PRONE
J APPLV DENATVRED PROBE TO DENATURED CELLULAR DNA
HVERIMZAllON OF ONA PROBE TO COMPLEMENTARY CELLULAR DNA
Fig. 1. Schematicdiagram of the techniqueof FISH.
nucleotides and chemical labeling techniques (I 2). For applications involving in situ hybridization, enzymatic incorporation of nucleotides modified with biotin, or digoxigenin by nick translation or polymerase chain reaction (PCR)-labeling techniquesresults in a high labeling efficiency, and is usually preferred over chemical labeling techniques employing photoreactive compounds, e.g., photobiotin. 1.3. Amplification
and Labeling of DNA Using PCR
Although the PCR was introduced initially to amplify single loci in target DNA, it has been used increasingly to amplify multiple loci simultaneously. The major application of “general” amplification of DNA has
Espinosa and Le Beau
56
been the rapid generation of new clones from particular genomic regtons. The initial methods were based on amplificatton of repetitive sequences within the genome, and resulted m the amplification of segments between suitably positioned repeats (interspersed repetitive sequence [IRS-PCR]). Thus, IRS-PCR is applicable only to those species where abundant repeat families have been identified. In humans, the most abundant family of repeats is the AZu family, estimated to comprise 900,000 elements m the haploid genome, with an average spacmg of 34 kb. Alu-PCR has been used to create human chromosome- and region-specific libraries, as well as to amplify and label human sequences from somatic cell hybrids or yeast artificial chromosomes (YACs) (12). IRS-PCR or Alu-PCR has notable advantages for species-spectfic amplificatton m which genomes are mixed, such as m rodent X human hybrids. However, these repeats are not umformly distributed, a phenomenon that mtroduces a bias in clonmg experiments. This is evidenced by the R-banding pattern that can be generated m hybridizations of probes prepared by Mu-PCR of total genomic DNA or chromosome-specttic libraries (13). For the purposes of gene mapping, Alu-PCR is particularly applicable to the localization of genomic clones with large inserts, such as YAC clones, since the human insert DNA can be amplified preferentially. In the past few years, several simple PCR techniques have been introduced for the general amplification of DNA. Two major techniques known as degenerate oligonucleotide-primed PCR (DOP-PCR) (24) and sequence-independent amplification (SIA) (15) employ ohgonucleotides of partially degenerate sequence. This feature together with a PCR protocol using a low mitral annealing temperature ensures priming from multiple evenly dispersed sites within a given genome. The methods are species-independent and can be used for the efficient amplification of DNA from all species using the same primers. In the context of FISH applications, these methods are particularly useful for the amplrfication and labeling of YACs and other genomic clones, the amplification and labeling of microdissected chromosomal material, and the amplification of DNA from small numbers of cells from frozen sections or paraffin-embedded, formalin-fixed specimens for studies such as comparative genomic hybridization analysis of tumors (16). 1.4. Labeling
of PCR Products for FISH
The Alu-PCR, DOP-PCR, and SIA products from the YACs can be labeled using several different methods. We have used both nick translation (see Section 2.3.) and PCR labeling extensively, and have obtained good results with both methods; however, we have found that PCR labeling generally results in
Gene Mapping by FISH
57
A
B RHODAMINE DIGOXIGENIN DIGOXIGENINBIOTIN-LABELED PROBE DNA
TARGET
DNA
PROBE
TARGET
DNA
DNA
Fig. 2. Schematicdiagram of the detection of hybridized probes following FISH. (A) FITC-conjugated avidin; (B) rhodamine-antidigoxigenin antibody.
stronger signals. The products can be labeled with blotin in a second PCR by the incorporation of Bio-dUTP (see Section 3.6.).
7.5. Detection of Hy6ridized
Probes
The visualization of hybridized probes can be accomplished in several ways: 1, By fluorochromes that are detectedby fluorescencemwroscopy, 2. By chemiluminescencedetectedby an emulsion overlay or detecteddirectly by photon counting devices; or 3. By high-density colored precrprtategeneratedby enzymattcassays,e.g.,alkaline phosphataseor horseradishperoxidase,or the use of metallic compounds,such as colloidal gold or silver, which are visualized by phase contrast,Nomarskr optics, or electron microscopy. As a result of the high sensitivity of detection, improved spatial resolution, the commerctal availability of the reagents, and the greater potential for simultaneous multiprobe analysis, biotin- and digoxigenm-labeling combined with fluorescent detection are currently the most widely used procedures (Fig. 2). Biotin-labeled probes are usually detected with fluorescein isothiocyanate (FITC)- or rhodamine-conjugated avidin, whereas digoxigenin-labeled probes can be detected with FITC-labeled or rhodamine-labeled antidigoxigenin antibodies. To enable one to visualize the cellular material, the slides are typically counterstained with DNA-binding fluorochromes, such as propidium iodide or 4,6-diamidino-2-phenylindole (DAPI). DAPI staining induces a chromosomal banding pattern that is identical to G-banding or Q-banding; thus, DAPI is preferred for the analysis of metaphase cells. Each of the methods described typically uses indirect detection procedures. Recently, direct labelmg techniques have been developed in which probes are
58
Espinosa and Le Beau
labeled directly with nucleotides conjugated to fluorochromes (17). This allows microscopic examination immediately after hybridization and eliminates the requirement for the detection steps. The potential applications of FISH are increased significantly by multiple probe hybridization protocols, which allow the delineation of several target sequences simultaneously (IS).
1.6. Extended Chromatin Preparations The determination of the physical locations of genes and DNA segments on individual chromosomes 1s an important aspect of genome research. Correct orientation and ordering of DNA markers are also critical in the identification of disease-related genes on the basis of chromosome location. During the past five yr, FISH has played a major role m this effort; using this technique for the analysis of prometaphase or metaphase chromosomes, it is possible to assign relative positions of genes and DNA segments as close as 1 Mb apart. For higher-resolution mapping, FISH can be applied to interphase nuclei or pronuclew (19). Since the DNA is less condensed in interphase nuclei than in metaphase chromosomes, resolution m the 50-100 kb range can be obtained. However, mapping the distance between two probes m three-dimensional nuclei or compressed two-dimensional nuclei is complex, and requires a large data sampling. A linear correlation between measured interphase distances and kilobase distances is observed up to 500 kb (varies between 500 kb and 1 MB); however, with increasing distances, the measurements become more maccurate because of chromatm folding. See Chapter 6 of this volume for more mformation on this topic. Recently, a number of techniques for releasmg DNA fibers from nuclei have been described, and FISH of extended chromatin fibers has been used for a variety of purposes (Table 1). The principle of these methods is that a small region of DNA when extended to the expected length of relaxed duplex DNA spans a distance that is visible through a light microscope (20-23). The resultant preparations differ somewhat for the various methods, and thus, the investigator may wish to select the appropriate method depending on the particular application (Table 2). We have used the methods described by Fidlerova et al. (23) extensively and have found these procedures to yield optimal preparations routinely. The advantages of these methods to release chromatin fibers are that the DNA is virtually straight, which simplifies length measurements, as opposed to DNA that winds or loops, and the degree of stretching can be controlled by experimental conditions and by the particular method. Highly stretched DNA can be used for high-resolution mapping (NaOH method), whereas DNA stretched to a lesser extent can be used for long-range mapping (YAC mapping with the formamide method). Moreover, fluorescent signal on a stretch of DNA is continuous, or nearly continuous, which may facilitate the
59
Gene Mapping by FISH Table 1 Applications
of Extended
DNA Preparations
Confirm the presence, and determine the position of clones (phage/cosmtd/plasmid/ cDNA) within larger clones, EX: mapping a cosmtd within a YAC Preparing physical maps: Ordering probes using two- or three-color FISH Determining the orientation of clones by cohybridlzmg end clones and entire clones Identifying overlapping sequences Estimatmg distances between sequences Can be combined with linkage mapping to determine whether gaps identified by linkage are genuine and to estimate the physical distance of gaps Mapping amplified genes, and determining the spacing and arrangement of amplified genes Identify possible deletions, msertions, inversions, or complex rearrangements in the genome Generate fingerprint maps of repetitive sequences in the genome Functional analysts of chromatin: Differential packaging of DNA Transcriptional activity of DNA Metaphase vs interphase DNA DNA at different stages of the cell cycle
detection of small structural differences. Recently, several groups of investigators have described methods to prepare extended chromatin fibers from large genomic clones, such as YACs; these methods are particularly suitable to fine physical mapping by FISH on relatively small genomic regions (100 kb to 2 Mb) (24).
1.7. Analysis of FISH Experiments Although FISH is a well-established technique, there has been relatively little standardization in the procedures used for the analysis of slides from FISH studies or in the reporting of the results of these studies in the literature. Nonetheless, there are some general principles that have emerged that should be considered by any investigator performing FISH. First, although the efficiency of hybridization of FISH is very high for most types of probes, it usually does not reach 100%. For cDNA probes, the efficiency may be very low, particularly for short probes that are <600 bp, and it may be necessary to use image acquisition and analysis techniques. Second, the level of resolution of FISH signals is excellent, but the location of the signal may vary slightly from cell to cell, particularly between metaphase cells of varying degrees of condensation. By analyzing multiple cells, including prometaphase or prophase cells, most
Table 2 Techniques
for the Preparation
Extraction m-AMSA
of Extended
Chromatin
Fibers
Comments and alkaline extraction
Detergent and high salt m 0 SDS Formamide NaOH-alcohol
Produces highly elongated, undefined spindle-shaped structures Produces halo preparations; however, greater release of DNA IS achieved, and DNA is more elongated; loopmg of DNA limits usefulness for regions of >200 kb SDS dissolves membranes and releases a stream of DNA, which can be spindle-shaped Results in a comet-like tall of released chromatm; the borders of disrupted nuclei can be defined Results m a network of straight chromatm fibers or an irregular network
Resolution
Refs.
-10 kb
22
10 kb (upper limit: 200 kb)
21
l-2 kb (upper limit 2200 kb) l-5 kb (upper hmrt 2800 kb) 1 kb (upper limit 2800 kb)
20 23 23
Gene Mapping by FISH
61
probes can be mapped to a single chromosomal subband. For these reasons, it is critical that a sufficient number of cells be analyzed both to verify the distribution of signal and to determine the most precise localization of signal. A third issue relates to the nature of cytogenetic analysis, which relies on the interpretation of chromosomal banding patterns. An inherent feature of all interpretative assaysis that they are subject to error. To minimize the chance of errors either in the identification of chromosomes or the designation of bands, we recommend that the analysis of FISH experiments be performed by two well-trained individuals. Similarly, to avoid the possibility of laboratory errors in the case of the localization of new genes, we repeat the probe labeling and hybridizations a second time to confirm that the probe maps to the same site Additional recommendations are described in Sections 1.7.1.-l .7.3. for specific applications. 7.7. I. Localization of Single-Copy Genes/DNA Sequences The localization of genomic or cDNA clones requires a maximal degree of precision. Given the tremendous efforts to prepare genetic linkage and physical maps of the human genome, accuracy in assignmg the location of probes is essential to integrating cytogenetic, genetic linkage, and physical maps, and to the identification of disease-related genes. With respect to genomic clones (h phage, cosmid, BAC, PI, or YAC clones), we routinely determine the location of signal in 25 DAPI-stained metaphase cells. It is critical that DAPI staining or another method to obtain R or Q/G bands be used. The intensity of the signal obtained for genomic clones is sufficient so that the cells are scored through the microscope, although an imaging system is used to capture and merge images of the banded chromosomes and the hybridization signal for the preparation of photographs for publication. The number of cells containing O-4 signals on the four chromatids containing the test sequence is determined, as are the number and location of background signals. An additional 10 cells are scored by a second individual who is unaware of the results obtained by the first investigator. The hybridization is repeated, and at least 10 cells are scored to confirm the imtial results. In reporting the results, we describe the number of cells with 0, 1,2,3, or 4 chromatids labeled, the distribution of signal in each band (or subband) at the site of specific hybridization, and the location of background signals. The first data allow other scientists to evaluate the specificity of the hybridization. In most hybridizations m which the ratio of probe to placental and Cot1 DNA is optimized, the number of background signals will be very small--on the order of O-10 in 25 metaphase cells. In those instances where this value is higher, it may be necessary to list only those sites that contain signal doublets, or were labeled more than once. Although the format used to report FISH locahzations
Espinosa and Le Beau
62
is variable, a consensus among investigators experienced in gene mapping by FISH is that the minimum data that must be included in a report in the literature are 1. 2 3 4
The The The The
proportion of cells wrth signal; proportion of the total signals at the locus; location of background signal; and most precise location of the gene/DNA sequence
In contrast to genomic clones m which the efficiency of hybridization and detection is high, cDNA clones are often substantially more difficult to localize. The percentage of cells with specific signal may be quite variable, and background signal canbe high. In our experience,cDNA probes larger than 1kb present few problems in mapping, and can be vlsuahzed and scored through the microscope; however, clones that are X600 bp can be problematic and may require an imaging system for analysis. Some laboratories with extensive experience at mapping cDNA clones have adopted the followmg crtterra (J. Korenberg, personal communication). An initial analysis is performed on a selection of metaphase cells. For a hybridization to be considered successful, one or both homologs should have signal doublets in >lO% of these metaphase cells. Because the mcrdence of single background signals may be high, only cells with doublet signals should be scored. As a minimum, 20 metaphase cells that have any doublet signals should be scored, and the location of the doublets recorded. At least 50% of the doublets must be located at the same chromosomal site to be considered evtdence for determining the map location of the probe. The total number of doublets, the number and percent of doublets at the site of the gene, the location of other doublet stgnals (> 1O/o),and the most precise assignment should be reported. 1.7.2. Analysis of Cancer-Specific
Rearrangements
FISH has played an important role in the molecular-cytogenettc analysis of cancer-specific rearrangements. Specifically, FISH can be used to flank translocation breakpoints with specific DNA probes, to split translocatron breakpoints and, hence, facilitate the cloning of the genes affected by these rearrangements (25), and to identify the smallest commonly deleted region of recurring chromosomal deletions (2627). FISH can also be used m the subsequent stages in the identification of cancer-related genes, such as in the preparation of a genomic
contig of a deleted segment or translocation
breakpoint
region, and to confirm that cDNA clones that are isolated from a specific region are derived from the correct region of the genome. The analysts of hybrtdizations to metaphase cells from human tumors is often complicated
by the limited
number of metaphase cells available,
the rela-
Gene Mapping by FISH
63
tively poor quality of some metaphase cells from tumors, and the presence of normal cells in many samples. To address some of these problems, it may be necessary to hybridize several slides to a single probe (or to use the entire slide, rather than only a 22-mm2 sectton), and to hybridize a control probe, such as a centromere-specific probe to assistin the identification of the rearranged chromosome of interest. When material is limited, we frequently cohybridize multiple probes using two-color FISH; we have also found that it is possible to hybridize the same slide multiple times. This is typically accomplished by using a different fluorochrome to detect the second probe, but the same fluorochrome can also be used since the denaturation step removes the initial probe. Subsequent hybridizations are performed as described. In scoring metaphase cells from tumor samples, some of the same issues that arise in the analysis of single copy probes apply. It is critical that a sufficient number of cells be analyzed, and that the result be confirmed by a second independent observer. Thus, we recommend that a minimum of 10 cells containing the rearrangement be scored by each of two independent observers. Ideally, these are different cells; however, if there is insufficient material, the same cells could be evaluated. I. 7.3. Physical Mapping Using Extended Chromatin Preparations The applications of FISH to chromatm fibers in genome mapping are numerous, and include the ordering of probes, determination of the physical distance between sequences, determining whether probes overlap, the evaluation of large genomic probes for rearrangements, such as internal deletions, and the mapping and ordering of cDNA probes within larger genomic probes. The stringency of the criteria used for the collection of data varies among these applications. For example, it may be necessary to evaluate only 10 cells to determine if a YAC has an internal deletion, if two probes overlap, or the relative order of probes, whereas estimating the distance between sequences requires a more extensive analysis. To estimate the distance between probes, one must have a reference or control value; this is generally determined by determining the length of the signal for a probe of known size (in kb) (20,24). For example, the length of the signal for the two test probes, as well as the length of the gap between them is determined. This is accomplished optimally with an image analysis system, The measurements for the gap are normalized to the signal lengths of the two probes of known size. Preliminary studies by several groups of investigators have suggested that the relative length remains constant for DNA stretched to various extents, Maps are determined by averaging the relative lengths from all cells scored; we and other investigators have found that at least 30 cells should be evaluated for these studies (20).
Espinosa and Le Beau
64 2. Materials 2.1. Culture of Mitogen-Stimulated
Lymphocytes
1 Culture medium: 90% RPM1 1640, 10% heat-inactivated fetal bovine serum (HI-FBS) 10 mM/L HEPES, 100 U/mL pemcillm, 100 mg/mL streptomycin, pH adjusted to 7.2-7 3 with 7.5% sodium bicarbonate All reagents are from Gibco-BRL. 2 Phytohemagglutinin (PHA, Murex Dtagnostics Inc , Dartford, UK, HA-15, reagent-grade). 3. Colcemid (Gibco-BRL, Gaithersburg, MD).
2.2. Preparation
of Metaphase
Cells
1. Phosphate buffered saline (PBS) (37’C)* 9 g/L NaCl, 0.21 g/L KH,PO,, 0.726 s/L Na2HP04-7 H20. 2 Hypotomc solution: 0.075M KC1 (37°C). 3 Fixative solution: 1 part glacial acetic acid:3 parts absolute methanol (room temperature). Prepare immediately before use, and mix solution by rotatmg container prior to each use, since the components separate.
2.3. Nick Translation 1. 1 ug of probe DNA. 2. 10X Reaction mix: 0 5MTris-HCl, pH 7.5, 100 mJ4 MgC12 3 10X dNTP mixture: 0 3 mMdATP, 0.3 mMdGTP, 0 3 rnA4dCTP in 50 mA4TrisHCl, pH 7 5 4 Blotin-16-dUTP (Boehringer Mannheim, Indianapolis, IN), 0.3 mM (alternatively, other biotin-dUTP derivatives can be used). 5 1OX Digoxigenm- 1 l-dUTP reaction mix-O.3 mM digoxigenm- 11dUTP (Boehrmger Mannheim) m 50 mM Tris-HCl 6. Escherzchza colz DNA polymerase I (Promega Bio Tech, Madison, WI). 7. DNase I-6.5 mg/mL m 100 mM MgC12 (Sigma, St. Lotus, MO) The optimal size of the nick-translated probe should be below 500 bp The final concentration of DNase added to the reaction must be determined by gel electrophoresis (see step 6). 8. Dilution buffer: 10 mA4Tns-HCl, pH 7 5, 1 mg/mL BSA 9. TE buffer. 10 mM Tris-HCl, pH 7.4, 0 1 mM EDTA, pH 8 0 10. Sephadex G-50 spin column (Pharmacia, Washington, DC, Probe Quant G-50, microcolumns). 11. 1X TBE. 45 mMTris-Borate, 1 mMEDTA. 12. 0 5 mMEDTA 13. 5% SDS
2.4. Ah-PC/7 1 10X PCR buffer: 100 mM Tris-HCl, pH 8.4,500 mM KCl, 20 0 mM MgCl, 2. 4 n&f dNTPs: 4.0 rm!4 of each dNTP: dATP, dCTP, dGTP, and dTTP.
65
Gene Mapping by FISH
3. Alu-PCR 10 @4 primer: CLl, S-TCCCAAAGTGCTGGGATTACAG-3’ (nt position 23-44 of the consensus Ah repeat), CL2, 5’-CTGCACTCCAGCCT GGG-3’ (nt position 244-260). Because of the orientatton of these primers, Ala-PCR products will contain only short Ah segments, which may facilitate efficient suppression of Ah sequences in subsequent FISH experiments (12). 4. 2.5 U of Tuq DNA polymerase. 5. DNA template: Genomic DNA can be prepared from normal or tumor cells using standard techniques; for tissue samples, 1 mg of tissue yields -3-5 yg of DNA. Purified cloned DNA, genomic yeast clone DNA, or YAC DNA isolated from pulsed-field gels may also be used. DNA can be quantitated using a spectrophotometer, or alternatively, the concentration can be estimated by electrophoresing a small amount of DNA in an agarose gel, and comparing the intensity of the ethidium stained gel under UV light to that of reference DNA.
2.5. DOP-PCR 1. 1OX PCR buffer: 100 mM Trts-HCl, pH 8.4, 500 mM KCl, 20 0 mM MgCl,. 2. 2 mM dNTPs 2.0 mM of each dNTP: dATP, dCTP, dGTP, and dTTP. 3. 10X DOP primer (20 PM) primer 6-MW, 5’-CCGACTCGAGNNNNNNATG TGG-3’, with N = A,C,G, or T in approximately equal proportions. 4. 2.5 U of Taq DNA polymerase. 5. DNA template (see Section 2.4.): Genomic DNA, purified cloned DNA, or YAC DNA Isolated from pulsed-field gels may be used. Procedures for DNA isolation from small amounts of tissue (as little as a few hundred cells) or from paraffinembedded material have been described (16).
2.6. SIA 1. Buffer A: 40 mM Tris-HCl, pH 7.5, 50 mM NaCl, 10.0 mM MgCl,, 5 mM dithtotreitol, 50 PglmL BSA, 300 @4 of each dNTP, and 1.5 mM Primer A. 2. Buffer B: 6.6 mA4Tris-HCI, pH 9.0,0.25 mMMgCl,, 55 n&KC& 0.01% (w/v) gelatin, 77 w of each dNTP, 1.66 mM Primer B. 3. Primer A, 5’-TGGTAGCTCTTGATCANNNNN-3’, with N = A,C,G, or T in approximately equal proportions. Primer B, S-AGAGTTGGTAGCTCTTGATC-3’ 4. T7 DNA polymerase (Sequenase Version 2.0, USB, Cleveland, OH), 2.5 U of Taq DNA polymerase. 5. A variety of DNA templates can be used for SIA as described for Alu-PCR and DOP-PCR
2.7. Labeling
of PCR Products
for FISH
1. 10X PCR buffer. 100 nut4 Tris-HCl, pH 8.4, 500 mM KCl, 15.0 n&f MgCl,, 0.1% (w/v) gelatin. 2. 1.5 mA4 dATP, dCTP, and dGTP, 1 rnk! dUTP, 0.4 mM Bio-16-dUTP. 3. 0.15 mM SIA Primer B, DOP-PCR primer, or each Alu-PCR primer. 4. 1.O U of Taq DNA polymerase.
Espinosa and Le Beau
66 2.8. Hybridization 1. 2. 3. 4. 5.
6. 7. 8. 9. 10. 11. 12.
Techniques
20X SSC: 3.0MNaC1, 0.3MNaCitrate, pH 7.0 4X SSC Dilute 20X SSC 1:5 with dH,O, pH 7.0. 2X SSC: Dilute 20X SSC 1:lO with dH*O, pH 7.0. 3Mpotassium acetate, 14.75 g KOAc in 50 mL ddHz0, pH 5.2. Formamide (hybndizatron buffer): Nucleic-acid-hybrtdization-grade or molecular-biology-grade deionized with ion-exchange resin. Molecular-biology-grade formamide may be used for posthybridization washes. Carrier DNA: Salmon sperm DNA, 1 pg/uL, sheared mechanrcally or by sonicatron. Placental DNA: 1 pg/pL, sheared mechanically or by sonication Cot1 DNA: 1 pg/pL (Gibco-BRL). Dextran sulfate: Prepare 20% dextran sulfate in 4X SSC solution. Labeled DNA probe: 1 ug/50 uL 10X RNase* 1 mg/mL in 2X SSC, pH 7 0 Absolute ethanol, pure.
2.9. Detection
of Hybridized
Probes
1. Blocking solutions: 2% BSA (400 mg B&4/20 mL 4X SSC), 2% goat serum and 1% BSA, or 2% rabbit serum and 1% BSA m 4X SSC. Blocking solutions are used to reduce background signal from nonspecrfic binding. In general, the best results are obtained by incubating with diluted normal serum obtained from the same species host as the labeled antrbody immediately prior to the application of the labeled antibody. 2. Nonionic detergent-Tween 20 or Tnton-X 100. 3. DAPI stain-200 ng/mL in 2X SSC. 4. PDD antifade+iissolve 300 mgp-phenylenediamine dihydrochloride in 30 mL, pH 7 0, Dulbecco’s PBS, pH adjusted to 8.0 with 0.5Msodmm btcarbonate Combine 20 mL PDD solution with 80 mL glycerol. Store m the dark at -2O’C. 5 Avidin conjugated with a fluorochrome, e.g., FITC, CY3, Texas red, Rhodamine, 1 mg/mL (all are from Vector Laboratories, Burlingame, CA, except for avidinCY3, which is available from Amersham Life Science, Arlmgton Heights, IL). 6. 1 mg/mL Antidigoxigenin antibody (sheep) conjugated with a fluorochrome (Boehringer Mannheim). 7 1 mg/mL Antiavidin antibody (goat) conjugated with a fluorochrome (Vector Laboratories). 8. 1 mg/mL Biotinylated antiavidin antibody (goat) (Vector Laboratories). 9. 1 mg/mL Antisheep antibody (rabbit) conjugated with a fluorochrome (Vector Laboratories).
2.10. Extended Chroma tin Preparations 1. O.O7MNaOH:ethanol(5:2) (alkaline treatment method) or 70% formamide in 2X SSC, pH 7.0 (formamide treatment method). 2. 70, 95, and 100% Ethanol.
Gene Mapping by FISH 3. Methods 3.7. Culture of hlitogen-Stimulated
67 Lymphocytes
1. Peripheral blood-draw 10 mL of peripheral blood aseptically by venipuncture into a heparinized syringe or vacuum tube. Use only preservative-free heparin, since the preservatives in most heparin formulations suppress the growth of cells. 2. Centrifuge at 225g for 8 min, and transfer the buffy coat to a new sterile tube. Determine the number of leukocytes/ml of sample and the total number of cells using a hemacytometer and Unopette test (Becton Dickinson, San Jose, CA, test 5856), or another method of lysing the red blood cells. 3. Initiate two lo-mL cultures in 25cm* flasks using complete culture medium prewarmed to 37OC. The cell aliquot (1 O7leukocytes) should be added in a volume of 1 mL or less; therefore, it may be necessary to concentrate or dilute the sample with medium prior to culture mitiation. Bring the volume to 10 mL with culture medium. The final cell density should be 1 x lo6 leukocytes/ml. Add PHA to a final concentration of 10 yg/rnL. 4. Incubate flasks vertically with caps slightly loose (opened a one-quarter to onehalf turn) at 37’C in a humidified 5% C02/95% air atmosphere for 72 h. 5. Add Colcemid (final concentration of 0.05 l.&nL), and remcubate for 45-60 min. 6. Process using the protocol for the preparation of metaphase cells.
3.7.1. Preparation of Metaphase Cells 1. Transfer the contents of the culture flask to a conical centrifuge tube 2. To minimize the loss of cells, rinse the flask with 3-5 mL of PBS. Then add it to the tube prior to centrifugation. 3. Spin the centrifuge tubes for 8 min at 150-225g. Decant the supernatant 4. Resuspend the pellet in the residual supernatant by gently tapping the tube, and add 10 mL of prewarmed hypotonic KC1 . Mix by bubbling air through the solution with a Pasteur pipet. 5. Incubate the tubes for 8 min at 37°C. 6. Centrifuge for 8 mm at 150-225g, and decant the supernatant. 7. Resuspend the pellet, and add two to five drops of freshly prepared fixative while gently tapping the tube. Add approx 1-2 mL of fixative in this manner, and then add an additional 8-10 mL of fixative. 8. Centrifuge tubes at 15Og, decant the supernatant, resuspend the cells, and add 10 mL of fixative. Repeat this step until the fixative is clear. 9. Store the tube at -2O”C, or repeat step 8 four to eight times with fresh fixative, and prepare slides. Slides prepared immediately are generally of opumal quality. Stored cells should be washed with fresh fixative four to eight times prior to slide preparation
3.12. Slide Preparation Procedures for preparing slides from cell suspension are numerous and vary from laboratory to laboratory. The following technique works well in our laboratory; however, it may be necessary to modify this technique to obtain optimal results in other laboratories.
68
Espinosa and Le Beau
1 Clean glass microscope slides by immersing them in 95% ethanol, and wlpmg with a Kimwipe@’ or other lint-free cloth. 2 Resuspend the cell pellet m enough fresh fixative (3:l absolute methanol:glacial acetic acid) to produce a slightly milky cell suspension (the volume of fresh fixative required is approx 10 times the volume of the cell pellet). Take care to dilute the sample correctly, because underdllution can result in an overly dense slide and poor spreading of metaphase cells 3. Place one drop of fixative on the top of the slide (below the frosted edge), the fixative ~111spread and cover the surface of the slide Do not place more than one drop of fixative onto the dry slide, because too much fixative may cause cell nuclei and metaphase cells to accumulate along the edges of the slide. 4. Hold the slide pointing downward at a 60’ angle over a steam bath, and drop four to six drops of the cell suspension from a Pasteur plpet held 18-24 in. above the slide 5. Immediately place the slide over a steam bath created by runnmg hot water (45-50°C) mto a pan. Two glass rods, placed 1 V2 in. apart across the rim of the pan, can be used to hold the slides above the water level in the pan Remove the slides from the pan when the fixative has dried from the upper surface (2-3 mm). 6 Slides may be used unmedlately or can be stored at -20°C for several months Slides should be stored m sealed slide boxes containing a desiccant, such as Driente, and should be thawed slowly at room temperature before use
3.2. Nick Translation To nick-translate
1 pg of probe DNA m a 50 PL reactlon:
1. Combine 1 pg of probe DNA, 5 pL 10X reaction mix, 5 pL 10X dNTP mix, 2.5 pL Bio- 16-dUTP, 20 U DNA polymerase I, and an appropriate dilution of DNase I (1: 1000 is a good test dilution). Add the enzymes last. 2. To label a probe with digoxigenin, follow the same steps, but use 5 pL of 10X digoxlgenin- 1 l-dUTP reaction mix instead of Bio-16-dUTP. 3. Bring the total volume to 50 yL with filtered dH,O. 4. Vortex, and then centrifuge briefly. 5. Incubate at 14OC for 2 h. 6. After 2 h, remove tubes and place on Ice. 7. Determine probe size by gel electrophoresis (see Note 1). Take 5 pL of the nicktranslated probe, and add 5 yL of gel loading buffer (50% glycerol, 1X TBE, 1% bromophenol blue, I% xylene cyanol). Load sample and an appropriate marker onto a l-2% agarose minigel (1X TBE) and run the gel at 60 mA for approx 30 min. The probe size should be between 300-600 bp. Probe size is critical; if the probe is too large, add an additional 5 pL of diluted DNase I, and incubate for 30 min. If the probe is too short, repeat the reaction with less DNase I. 8. Stop reaction by adding 1 pL of 0.5 mMEDTA, and 1 pL of 5% SDS and heating reaction mix at 65°C for 10 min. 9. Remove umncorporated nucleotides by running the reactlon sample through a 50-pL Sephadex G-50 column (Pharmacia).
Gene Mapping by FISH
69
3.4. Ah-PCR 1 Each PCR reaction contains: 100 ng genomic DNA, yeast clone DNA or YAC DNA isolated from pulsed-field gels, 10 pL 1OX PCR buffer, 10 uL 1OX dNTPs (final concentration 0.4 mM), 2.5 uL Mu PCR plvner (final concentration 0 25 pA4), 2.5 U Taq polymerase, and dH20 to adjust volume to 100 pL. 2. Tubes containing a positive control, such as placental DNA, and a negative control usmg the same solutions, but without the addition of template DNA should be prepared. 3. The PCR conditions are initial denaturation at 94°C for 5 min, followed by 35 cycles of denaturation at 94’C for 1 mm, 30 s at 37”C, and extension at 72’C for 6 mm, with a final extension at 72°C for 10 mm 4. The PCR reaction can be evaluated by electrophoresing 5-10 uL in an agarose gel (1% agarose, 1X TBE) at 100 V along with a size marker A smear of DNA rangmg m size from 600 bp to -5kb should be vtsible m the ethidium bromidestained gel of amplified human genomic DNA, whereas multiple discrete amplified bands with occasional smears are seen m amplified products of cloned DNA (2004000 bp), such as YACs; there should be no smear m the negative control 5. The Alu-PCR products can be labeled by nick-translation (see Section 2.3 ) or by PCR labeling (see Section 3 6 )
3.4. DOP-PCR 1. Each 50-pL PCR reaction contains: 0 l-10 ng DNA, 5 pL 1OX PCR buffer, 5 uL 10X dNTPs (final concentration 0.2 mM), 5 pL 10X DOP-primer (final concentration 2 pA4), 2 5 U Tagpolymerase, and dHzO to adjust volume to 50 yL 2. Tubes containing a positive control, such as placental DNA, and a negative control using the same solutions, but without the addition of template DNA should be prepared. 3. The PCR conditions are initial denaturation at 94°C for 10 mm, followed by five cycles of denaturation at 94’C for 1 mm, 30°C for 1.5 min, 3-mm transition 3072°C and extension at 72°C for 3 min, followed by 35 cycles of denaturation at 94°C for 1 min, annealing at 62’C for 1 min, and extension at 72’C for 3 mm with an addition of 1 s/cycle to the extension step, with a final extension at 72°C for 10 min 4. The PCR reaction can be evaluated by electrophoresmg 5-10 pL m an agarose gel (1% agarose, 1X TBE) at 100 V along with a size marker. A smear of DNA ranging in size from 200-2000 bp should be visible in the ethidium bromidestained gel; there should be no smear m the negative control (24)
3.5. SIA 1. For SIA, 0.5 pL of the diluted YAC DNA (or 10 ng of genomic DNA or cloned DNA) is added to 5 pL of buffer A. The DNA is denatured at 94°C for 2 min, and cooled to 4OC to allow primer A to anneal at random sites One unit of T7 DNA polymerase (USB: Sequenase version 2.0) is added m 2 5 uL of buffer A, and the temperature is gradually increased to 37°C over an S-mm interval, and kept at 37°C for 8 mm, resulting in the synthesis of the first strand of DNA
70
Espinosa and Le Beau
2. After denaturation, and annealing, this synthesis step is repeated one more time by adding fresh T7 enzyme in 2.5 uL of buffer A. T7 DNA polymerase IS used for this step because it mnctions well at low temperatures at which random priming complexes are stable, and because tt possessesstrand displacement capabilities. Strand displacement synthesis enables the enzyme to synthesize long stretches of DNA by displacing other primers that have already annealed to DNA. In the second synthesis step, primer A will also prime on the products from the first round 3 The products of this second synthesis step are suitable for PCR amphfication. The PCR is carried out by adding 90 uL of buffer B, and 2.5 U of Tuq DNA polymerase Five low stringency cycles with denaturation at 94’C for 50 s, annealing at 42°C for 5 mm, an Increase to 72’C for 3 mm, and synthesis at 72’C for 3 min were followed by 33 PCR cycles with denaturation at 94°C for 50 s, annealmg at 56’C for 1 mm, and synthesis at 72°C for 2 mm. 4 Tubes containing a positive control such as placental DNA, and a negative control using the same solutions, but without the addition of template DNA should be prepared. 5. The PCR reaction can be evaluated by electrophoresmg 5-10 uL m an agarose gel (1% agarose, 1X TBE) at 100 V along with a size marker. A smear of DNA ranging m size from 300-1000 bp should be visible in the ethidmm bromide stained gel; there should be no smear m the negative control
3.6. Labeling
of PCR Products
for FISH
1 Each 30-yL PCR reaction contains. 1 uL PCR product, 3 uL 10X PCR buffer, 3 PL 10X dNTPs, 1 0 U of Tug DNA polymerase, and dH,O to adjust volume to 30 l.tL 2. The PCR conditions are 18 cycles of 50 s at 94°C 1 min at 56°C and 2 min at 72°C. 3. Tubes containing a positive control, such as placental DNA, and a negative control using the same solutions, but without the addition of template DNA should be prepared. 4. The labeled PCR products (prepared by either biotinylation or labeling with directly labeled nucleotides; see below) are ethanol-precipitated and resuspended in 10 pL TE buffer; 8 yL of this DNA is treated with DNase I (200 pg/uL) for 5-10 min at room temperature. After 10 min of heat Inactivation at 65°C the DNA is ethanol-precipitated and resuspended in 10 uL of TE. 5. FISH is performed as described in Section 3.7. Amplified products from YACs are hybridized at a concentration of 90-120 ng probe/slide, along with 0.63 ug of placental DNA and l-2 pg Cot 1 DNA/slide Using the second approach, the amplified products are labeled with the Spectrum OrangeTM or Spectrum Green TM fluorophore (Vysis, Inc., Downers Grove, IL) by performing a PCR under the same conditions as described with each dNTP at 150 l&f, and the Spectrum Orange-dUTP at 30 l.t.M. Digoxrgenmlabeled nucleotides do not yield good results using PCR-labeling methods owing to the poor incorporation of this large labeled nucleotide.
Gene Mapping by FISH
71
3.7. Hybridization 3.7.1. Probe Preparation 1 Combine 3 yL salmon sperm (-1-4 pg), 0 1 pg labeled probe, placental DNA (-0.3-l pg) and Cot1 DNA (between 0 and 1 ug depending on the repetitive elements in the probe DNA (see Note 2). 2. Ethanol-precipitate by adding l/20 vol3MKOAc and 2-2.5 vol of ice-cold puregrade absolute ethanol Vortex and then centrifuge for 5 s, chill at -80°C freezer for 30 mm, and then centrifuge at 16,OOOkg at 4°C for 20 mm. Decant the supernatant, then blot on paper towels, and air-dry for 30 min or vacuum-dry for 5 mm 3. Add 10 uL of hybridization buffer (5 yL formamide/ uL 20% dextran sulfate m 4X SSC) to each tube, and vortex 1 h. 4. Denature hybridization mixture at 75°C for 5 mm, and preanneal at 37’C for 20-30 min.
3.7.2. Slide Preparation and Denaturation 1. Prepare 200 mL RNase (dilute 1OX RNase with 2X SSC pH 7 0, final cone 100 pg/mL) in a glass staining box and warm to 40°C m a water bath 2. Warm slides to room temperature, and check under the microscope to locate the metaphase or interphase cells for hybrrdization Mark the area with a glass etching pen, and place shdes in a glass rack. 3 Place slides m RNase for 1 h. 4. Remove slides and wash 4 x 2 mm m 2X SSC and 1 x 2 mm each m 70,80, and 95% ethanol at room temperature. 5. Remove the slides from the glass rack and an-dry. 6. Place the slides m a glass rack, and denature in a solution of 70% formamide/30% 4X SSC (pH 7.0) for 2 min at 75°C. (The solution will cool a few degrees when the slides are added ) 7. After denaturation, dehydrate the slides m a graded alcohol series for 2 min each m 70, 80, and 95% ethanol, an-dry, and then place on a slide warmer at 40°C 8. Apply 10 pL of denatured hybridization/probe mix to each shde m the premarked area taking care to avoid bubbles, and then cover with a 22 x 22 mm coverslip. 9. When all slides are finished, seal with rubber cement. This can be accomplished easily by applying the rubber cement with a 5-10 mL syringe without the needle attached. 10. Place m a prewarmed humid chamber (A plastic refrigerator box with a lid that seals tightly works well.) Incubate overnight at 37°C 11. Prepare solutions of 50% formamide/50% 4X SSC (50/50 mix) and 4X SSC (pH 7.0) and warm to 4O’C. 12. Remove glue from slides carefully. 13. Soak slides for 2 min m 50/50 mix. The covershps should slide off gently, but it ts sometimes necessary to pull them off very gently. 14. Rinse 3 x 5 mm in 50/50 solution agitatmg gently. 15. Next rinse 4 x 2 min in 4X SSC. 16. Drain slides, and proceed to Section 3.8.
72
3.8. Defection
Espinosa and Le Beau of Hybridized Probes
Detection and amplification schemes can be quite compltcated and varred. However, detection normally mvolves blockmg of the target DNA with an approprtate serum, followed by detection with an appropriate reporter molecule. Amplificatron of the signal involves ustng a fluorochrome-conjugated antibody specific for the reporter molecule. The followtng procedure works well in our laboratory, but is flexible and can be adjusted to most laboratory conditrons. 1 Apply 200 pL of blocking solution, coverslip, and incubate for 30-60 mm m a humid chamber at 37°C. 2. Gently remove coverslip, and apply 200 pL of blockmg solutton contammg 5 pg/mL of avidin-fluorochrome (to detect a biotmylated probe) or 5 pg/mL antidigoxigenin-fluorochrome sheep antibody (to detect a digoxigenm-labeled probe) For dual-color detection (simultaneous detection of a blotin-labeled probe and a digoxigenm-labeled probe), apply 200 pL of blocking solution contammg 5 pg/pL of avidm-fluorochrome and 5 pg/uL antidigoxigenin antibody conlugated with a different fluorochrome. These detection soluttons should be filtered through a 0 2-pm milhpore filter before use 3. Coverslip and incubate for 30-60 mm at 37°C in a humid chamber 4 Remove covershps and wash 3 x 3 mm m 4X SSUO 1% Triton-X at 39°C To amplify a biotin-labeled probe, proceed to step 7, to amplify a digoxigenmlabeled probe, proceed to step 12 Dual-color amplification can be achieved by amplifying the biotin-labeled probe first (steps 7-10) followed by amphfication of the digoxigenin-labeled probe (steps 12-I 6). 5. Stain each slide in DAPI stain for 20-60 s. Note* You may need to rinse slide m dH,O before staining in DAPI. Rinse m 2X SSC for 10 s 6. Coverslip using three to four drops antifade with PDD Blot the slides on a paper towel, and store m a slide box at 4°C 7. Apply 200 uL of blocking solution, and coverslip for @-30 min m a humid chamber at 37°C. Note: Amplification mcubatlon times may vary and should be determmed empirically. 8. Gently remove covershp, and apply 200 pL of blocking solution containing 5 ug/mL of antravtdin antibody conlugated with the appropriate fluorochrome, e.g , amphfy avtdin-FITC with antiavidm-FITC (see Note 3). 9. Coverslip and incubate for 30 min at 37°C m a humid chamber. 10. Remove coverslips, and wash 2 x 3 mm in 4X SSC/ 0.1% Trtton-X at 39°C. 11. If only a biotinylated probe is being amphtied, drain and stain slides as described m steps 5 and 6. To amplify a digoxtgemn probe, proceed to step 12. 12. Apply 200 pL of blocking solution, and coverslip for O-30 mm in a 37°C humid chamber. 13. Gently remove coverslip, and apply 200 pL of blockmg solution contammg 5 yg/mL of antisheep antibody (for sheep anttdigoxigenm-fluorochrome) conlugated with the appropriate fluorochrome.
Gene Mapping by FISH
73
14. Covershp and incubate for 30 mm at 37°C m a humid chamber. 15. Remove covershps, and wash 2 x 3 mm m 4X SSC/O. 1% Triton-X at 39°C. 16 Dram and stain slides as described m steps 5 and 6.
3.9. Extended Chrumatin Preparations 3.9.1. Cultures and Slide Preparation 1. Establish PHA-stimulated lymphocyte cultures as described, and incubate at 37°C for 72 h. 2 The cultures are harvested as described with the exception that the exposure to Colcemid is omitted. 3. Slides are prepared by dropping the ceil suspension onto clean moist slides. Before the fix has dried, place shdes in PBS for 1 mm. Dram slides on a paper towel. Treat mnnediately with either the alkalme or formamtde treatments. The fixed cells can be used immediately to prepare slides or can be stored at -20°C. Slides that are prepared immediately typically yield better preparations and more umform release of DNA.
3.9.2. Alkaline Treatment 1. Place 100 pL NaOH:ETOH on the top edge of the slide, below the frosted end. Place the edge of the coverslip along the slide (horizontal), and move the coverslip along the slide at a 30” angle. (Alkaline treatment results in the immediate disruption of the nuclei.) Blot residual fluid with a paper towel. 2. Hold the slide horizontally, and rinse with methanol using a ptpet. (Use only small amounts of methanol, and apply dropwise gently.) Note. A viscous consistency of the fluid dripping off the shde indicates major loss of DNA. Thereafter, the slide can be held vertically and rinsed several more times gently. 3. An-dry slide, and pass through 70,95, and 100% ethanol (2 mm each). 4. Air-dry, and store at -20°C in the presence of a desiccant
3.9.3. Formamide Treatment (see Note 4) 1. Place 100 uL formamide:SSC on the top edge of the slide, below the frosted end. Place the edge of the coverslip along the slide (horizontal), and move the covershp along the slide at a 30’ angle. Blot residual fluid with a paper towel 2. Hold the slide horizontally, and rinse with methanol usmg a pipet. (Use only small amounts of methanol, and apply dropwise gently.) Thereafter, the slide can be held vertically and rinsed several more times gently. The nuclei disrupt on contact with methanol, and the released chromatin is fixed simultaneously. 3. An-dry slides, and pass through 70,95, and 100% ethanol (2 min each). 4. Air-dry, and store at -20°C m the presence of a desiccant.
4. Notes 1. The size of the nick-translated probe is critical to the outcome of FISH experiments, and should be between 300 and 600 bp. If the fragments are too small, no
74
Espinosa and Le Beau
hybridization or background signal will be observed (a faint dusting of signal may be seen on the chromosomes), smce many hybridized fragments will be removed by the posthybridization washes. In contrast, if the probe is too large, nonspecific hybridization can occur, which is visualized as many signals distributed randomly on the chromosomes. 2. Suppression of hybridization signals from ubtquitous repetitive sequences, such asAlu or Kpn, elements is achieved usmg total human DNA and the Cot 1 fraction of DNA for a reannealing process that is based on rapid reassociation kinetics. The amount of competitor DNA needed varies depending on the number of repetttive elements m any particular genomic sequence. It is critical that the concentration of these competitors be adjusted appropriately, particularly the Cot 1 DNA, since too little results in high background and too much results m weakened probe signal A good starting point is to prepare the probe with 0 5 yg placental DNA and 0.5 pg Cot1 DNA; the concentration of the Cot1 DNA can then be increased or decreased as needed 3 Alternatively, biotmylated antiavrdin antibody may be substituted for antiavidm fluorochrome The biotmylated antiavtdm must then be followed by additional rounds of washes and detection with avidin-fluorochrome. This method of amplification produces strong spectfic signal, but may also produce background. 4. Alkaline treatment with NaOH results in the complete disruption of the nuclei, and the slides are covered with a network of straight chromatm fibers (or occasionally an irregular network). With the formamtde treatment, the results are a comet-like tail of released chromatm, and the borders of most of the disrupted nuclei can be defined, allowing signals from the same nucleus to be identified Hybridization signals usually appear as a lmear array of dots (- 1% of signals will appear as a continuous line of fluorescent dots). Signals are longer m NaOHtreated cells owing to the more extended chromatm. For shorter probes, NaOH may be preferable, whereas formamide treatment may be preferable for longer clones, such as YACs The lower limit of resolution of these methods IS 1 kb, the upper limit has not been established yet, but is probably >800 kb
References 1. Evans, H. J., Buckland, R A,, and Pardue, M. L (1974) Location of the genes coding for 18 S and 28 S ribosomal RNA in the human genome. Chromosoma 48, 405426. 2. Harper, M. E. and Saunders, G F. (1981) Localization of single copy DNA sequences on G-banded chromosomes by m situ hybridization. Chromosoma 83, 43 l-439. 3. Rudkin, G. T. and Stollar, B D. (1977) High resolution detection of DNA-RNA hybrids zn situ by Indirect immunofluorescence. Nature 265,472,473. 4. Bauman, J. G., Wiegant, J., Borst, P., and van Duijn, P. (1980) A new method for fluorescence microscopical localization of specific sequences by in situ hybridization of fluorochrome-labelled RNA. Exp. Cell Res 128,485-490.
Gene Mapping by FISH
75
5. Lawrence, J. B., Vlllnave, C. A., and Smger, R. H. (1988) Sensitive, high-resolutlon chromatin and chromosome mapping ln sztu: presence and orientation oftwo closely integrated copies of EBV in a lymphoma line. Cell 52,5 l&6 1 6. Pinkel, D., Landegent, J., Collins, C., Fuscoe, J., Segraves, R., Lucas, J., and Gray, J. W. (1988) Fluorescence zn sztu hybridization with human chromosome-specific libraries: detection of trlsomy 21 and translocations of chromosome 4 Proc Nat1 Acad Sci USA 85,9138-9142. 7 Trask, B. J. (1991) Fluorescence zn sztu hybridization: applications m cytogenetits and gene mapping. Trends Genet. 7, 149-154. 8. Le Beau, M. M (1993) Fluorescence in sztu Hybridization m Cancer Diagnosis, m Important Advances in Oncology (de Vita, V. T., Jr, Hellman, S., and Rosenberg, S. A , eds.) J. B. Lippmcott, Philadelphia, pp. 29-45 9. Llchter, P., Cremer, T , Borden J , Manuelides, L , and Ward, D C. (1988) Delmeatlon of individual human chromosomes in metaphase and interphase cells by zn sztu suppression hybrldizatlon using recombmant DNA libraries. Hum Genet g&224-234. 10. Landegent, J. E , Jansen m de Wal, N , Dlrks, R. W , Baas, F., and van der Ploeg, M (1987) Use of whole cosmid cloned genomlc sequences for chromosomal localization by non-radioactive in situ hybridization Hum Genet 77,36&370 11 Lichter, P. and Ward, D C (1990) Is non-isotopic in sztu hybrldlzatlon finally coming of age? Nature 345,93-94 12 Lengauer, C., Green, E D., andcremer, T. (1992) Fluorescence zn sztu hybndization of YAC clones after A/u-PCR amplification Genomzcs 13,826-828. 13 Baldml, A. and Ward, D C. (1991) In sztu hybrldlzatlon of human chromosomes with AEu-PCR products: a simultaneous karyotype for gene mapping studies. Genomzcs 9,770-774. 14. Telenius, H., Carter, N P., Bebb, C., NordenskJold, M., Ponder, B A. J., and Tunnacllffe, A. (1992) Degenerate oligonucleotide-primed PCR: General amphfication of target DNA by a single degenerate primer. Genomics 13,7 18-724. 15. Bohlander, S K., Espinosa, R. III, Le Beau, M. M., Rowley, J. D., and Dlaz, M. 0. (1992) A method for the rapid sequence-independent amplification of microdissected chromosomal material. Genomzcs 13, 1322-1324. 16. Speicher, M. R., du Manolr, S., Schrock, E., Holtgreve, H , Schoell, B., Lengauer, C., Cremer, T., and Reid, T (1993) Molecular cytogenetic analysis of formalinfixed, paraffin-embedded solid tumors by comparative genomlc hybridization after universal DNA amplification. Hum Mol. Genet 11, 1907-l 9 14. 17 Wiegant, J., Reid, T., Nederlof, P. M., van der Ploeg, M., Tanke, H. J., and Raap, A K (1991) In sztu hybridization with fluoresceinated DNA. Nucleic Acids Res 19,3231-3241.
18. Reid, T , Baldim, A., Rand, T., and Ward, D. C. (1992) Simultaneous visualization of seven different DNA probes by zn sztu hybridlzatlon using combmatorial fluorescence and digital lmagmg. Proc Natl. Acad. Scz USA 89, 1388-1392. 19. Lawrence, J. B., Carter K. C., and Gerdes, M. J. (1992) Extending the capabihtres of interphase chromatin mapping. Nature Genet 2, 17 1,172
76
Espinosa and Le Beau
20. Parra, I and Windle, B (1993) High resolution visual mappmg of stretched DNA by fluorescent hybridization. Nature Genet 5, 17-2 1. 21 Wiegant, J., Kalle, W , Mullenders, L., Brookes, S , Hoovers, J M. N., Dauwerse, J. G., van On-men, G J B., and Raap, A. K. (1992) High-resolution in sztu hybridization using DNA halo preparations. Hum Mol. Genet. 1, 587-59 1 22. Heng, H H. Q., Squire, J , and Tsui, L.-C. (1992) High-resolutton mapping of mammalian genes by zy1sztu hybridization to free chromatm. Proc Natl. Acad Scz USA, 89,9509-95 13. 23. Fidlerova, H., Senger, G , Kost, M., Sanseau, P , and Sheer, D. (1994) Two simple procedures for releasing chromatm from routinely fixed cells for fluorescence in situ hybridization. Cytogenet Cell Genet 65,203-205. 24. Cai, W , Aburatam, H., Stanton, V. P., Housman, D E , Wang, Y.-K., and Schwartz, D C. (1995) Ordered restrtction endonuclease maps of yeast artificial chromosomes created by opttcal mapping on surfaces. Proc Nat1 Acad. Scl USA 92,5164-5168 25. Rowley, J. D., Diaz, M. O., Espmosa, R., Patel, Y D , van Melle, E., Ziemm, S., Taillon-Miller, P., Lichter, P , Evans, G. A , Kersey, J D., Ward, D. C., Domer, P H., and Le Beau, M. M. (1990) Mapping chromosome band 1lq23 m human acute leukemia with biotmylated probes: Identification of 11q23 translocation breakpomts with a yeast artificial chromosome Proc Nat1 Acad Scl USA 89,9358-9362 26 Le Beau, M M , Espmosa, R. III, Neuman, W L , Stock, W , Roulston, D , Larson, R A, Kemanen, M., and Westbrook, C A. (1993) Cytogenetic and molecular delmeation of the smallest commonly deleted region of chromosome 5 m myeloid leukemias. Proc Nat1 Acad. Scz USA 90,5484-5488. 27 Boultwood, J , Fidler, C , Lewis, S , Kelly, S., Sheridan, H , Littlewood, T J , Buckle, V J., and Wamscoat, J. S (1994) Molecular mapping of uncharacteristically small 5q deletions m two patients with the 5q- syndrome Delineation of the critical region on 5q and identification of a 5q- breakpoint Genomzcs 19,425-432
Probe Ordering and Distancing
by FISH
Marian Kroef, Hans Dauwerse, and Jim Landegent 1. Introduction In situ hybridization (ISH) has proven to be a powerful methodological approach to vtsuahze specific nucleic acid sequences directly within the morphological context of the cell. The method involves hybridization of labeled probes to denatured target chromatin that has been fixed on microscopic slides, followed by visualization using standard (immuno) cytochemical procedures. Originally, probes were marked with radioisotopes and visualized by autoradiography (I). This provided a high-sensitivity, but limited resolution owmg to the track of the decaying particle m the photographic emulsion. Also, radioactive procedures did not permit identification of multiple probes hybridized simultaneously to the same samples. To date, these types of labels have been completely replaced by fluorescent- and enzyme-generated absorbing dyes. One can distinguish direct procedures, m which the nucleic acid probes are modified with the reporter molecules and visualized directly after the hybridization reaction (2-4). In indirect procedures, the probes are labeled with haptens, which require further immunocytochemical processing of the slides to introduce the reporter molecules Q-7). Becauseof the speed,spatial resolution, and availability of a variety of tags with different spectralproperties, fluorescence in sztu hybridization (FISH) has become the method of choice for probe mapping and ordering purposes. In the following, several topics are discussedthat are relevant to these type of studies,
7.1. Type of DNA Clones and Sensitivity Recombinant DNA probes with a wide range of sequence complexities are currently available for mapping purposes. These can vary from small plasmid cloned sequences (OS-5 kb) to large yeast artificial chromosomes (YACs; From
Methods
m Molecular Edited
by
Biology,
J Boultwood
Vol 68 Gene /so/at/on Humana
77
Press
and Mappmg
Inc , Totowa,
NJ
Protocols
78
Kroef, Dauwerse, and Landegent
200-l 500 kb). The larger probes hybridize more efficiently, resultmg m stronger hybridization signals. However, with higher sequence complexity, the chances of background stammg also increase. This is owing to the presence of interspersed repetitive sequences, such as Alu and Kpn repeats, that occur throughout the genome. From the FISH methodological point of view, the probes can be categorized as high- or low-complexrty probes. In the case of the former, a preannealmg step with excess unlabeled competitor DNA (total human DNA or its Cot-l fraction, which is enriched for repetitive elements) has to be included m the FISH procedure (89). Because this strategy effectively suppressesthe contribution of repeat sequencesto the hybridtzation stgnal, the high-complexity probes (referred to as bacteriophages, cosmids, pl clones, and YACs) are preferred for FISH analysis. It should be noted that the use of YACs is not convernent for distance measurements (except for fiber FISH experiments), because they often appear as hybridization domains rather than individual spots (20). 1.2. Single or Multicolor FISH (Multiplicity) Various methods have been described to label nuclei acid probes, either enzymatically or chemically. Of them, the nick-translation reaction is the most practical technique for cloned sequences.A large selection of modified nucleotides (often dUTP) are available for either direct or indirect detection methods. In direct procedures, good results are obtained with the fluorochromes rhodamme and Cy3 (red), fluorescem (green), and coumarm (blue), whereas biotm and digoxigenm are the best haptens for indirect procedures. In general, the mdirect procedures will provide stronger hybridization signals when combined with the same fluorochromes. For the ordering of markers, uttlization of more than two haptens should be avoided because of practtcal considerations. If slides are examined by standard fluorescence microscopy, three probes can be hybridized and visualized simultaneously through double labelmg of one of the probes. Inspection of slides by fluorescence digital imaging microscopy allows discrimination of five different probes by means of ratio labeling. In this case, the individual probes can be visualized through the generation of pseudo-colors (11,22). In principle, one type of label is sufficient for distancmg purposes. This is accomplished through pan-wise hybridization of probes (‘13). The advantage of this approach over multicolor FISH is its simphcity. However, more probe combinations have to be tested. The actual distancing of probes is achieved by measuring the interval between hybridization spots. In the case of fluorescence microscopy, the measurements can be performed manually followmg prolection of photographic slides. More accurate data are obtained by using image analysis, which allows automated measurements. Owing to variation in the
Probe Ordering and Distancing by FISH
79
extent of spreading of the specimens, the average distance between hybridization spots has to be determined. These arbitrary values can be translated mto physical distances if the distance for one pair of probes has been estimated by, e.g., pulsed-field gel (PFGE) analysis.
1.3. Type of Preparations and Resolution For ordering markers, the distances between the probes used are the determining factors to decide which type of preparation should be used. In most cases, this is unknown, and the first hybridization experiments should be done on metaphase chromosomes (Fig. IA). The obvious advantage of metaphase chromosomes over the other chromatin structures 1s that centromere-telomere orientation of the probes or probe combinations can easily be determined. By measuring the distance of the hybridization signals to the telomeric ends, the fractional position of probes along the chromosome can be assessed.Because of folding of the chromosomes, these measurements can only be performed effectively employing image analysis (9,14). Using this type of preparation, individual probe signals can only be dissolved if they are least 2 Mb apart. The resolution can be improved to about 1 Mb when the so-called pro-metaphase nuclei are used (15; Dauwerse, unpublished results); however, the centromere-telomere orientation is lost. The next level of resolution is achieved by hybridization to interphase nuclei (Fig. lB), where the chromatm has a less (lO-20-fold) condensed structure compared to metaphase chromosomes. The lower limit of resolution on nuclei from lymphocytes and fibroblasts is in the lOO-kb range (10,16), and can be enhanced to below 50 kb for sperm pronuclei (17). When probes are separated more than 1 Mb, the linear relationship between the measured and physical distances is lost. One should keep m mind that in interphase mapping, the image examined represents a two-dimensional projection of a three-dimensional structure. This problem stresses the need for determining average distances. Ideally, interphase nuclei should be arrested in Gl phase. In S- and G2-phase nuclei, DNA replication has started, which may result in duplication of hybridization spots, and thus could lead to mistakes m the distance measurements. The highest resolution obtainable by FISH techniques is reached through hybridization of probes to stretched chromatin fibers (Fig. 1C), and ranges from l-500 kb. This procedure was developed simultaneously in different laboratories. These techniques differ only in the preparation of the extended chromatin fibers, and were named accordingly (i.e., Halo FISH 1181, free chromatin FISH [Z9], and DIRVISH [20/). The method can be utilized, e.g., to order subclones (even low-complexity probes) within a YAC or cosmid, or to determine the degree of overlap of probes positive for a certain marker to build a contig.
Fig. 1.
81
Probe Ordering and Distancing by FISH
Also, subclones can be hybridized to YAC-extended fibers from agaroseembedded cells (21). In these fiber FISH approaches, signals do not appear as dots, but as beads on string. This feature makes the method less dependent on background signals, which is the determimng factor for the successof an FISH experiment in general, but, more importantly, it allows direct correlation between the physical sizes of the probes and the average length of the corresponding signals. It can be concluded that FISH has become an indispensable tool for the preparation
of genetic and physical
maps. The sensitivity
of the procedure
allows routine analysis of both high- and low-complexity (in case of fiber FISH) probes. The resolutton that can be obtained covers the whole spectrum from classmal
karyotypic
analysis
(metaphase
FISH)
to molecular
genetic
approaches, like PFGE analysis (interphase FISH), and even Southern analysis (fiber FISH). FISH, however, allows the srmultaneous assessment of multiple probes, rendering it perhaps the most rapid means for ordering and distancing purposes.
2. Materials 2.1. Slide Preparation 2.1.7. Metaphase Spreads 1 Phytohemagglutmm (PHA) medium. 2. Thymldme solution, 3 mg thymldme dissolved in 1 mL PHA medium Store at -20°C This is optional (see Note 1). 3 0 0025% Colcemid (Serva, Heidelberg, Germany) 4. Hypotonic buffer: 50 mMKC1, 5 mA4 HEPES, 10 mA4 MgS04, pH 8.0. Just before use, dithlothreltol (DTT) IS added to a final concentration of 3 mM at 37°C 5. Fixative: methanol/acetic acid (3: 1, v:v). 6. Acetic acrd
Fig. 1. (prevmspage)
(A) Simultaneoushybrldrzation of two cosmlds(a and b) to
human metaphase chromosomes. These cosmtds are about 0.5 Mb apart and cannot be resolved in this type of preparation. One cosmrd was labeled with btotm and detected by FITC immunofluorescence, whereas the other cosmid was labeled directly with Cy3. The chromosomes were counterstained with DAPI (B) Hybridization of the same cosmids to human interphase nuclei. The probes were labeled and vtsualization as in A. (C) Hybrldizatron of two cosmrds to extended chromatm. These cosmlds are 5 kb apart, which can be observed by the small distance between the hybridization strings. One cosmid was labeled with blotin and visualized by FITC immunofluorescence, whereas the digoxigenin-labeled cosmtd was detected by TRITC nmnunofluorescence All photographs were taken using a triple-band pass filter set.
Kroef, Dauwerse, and Landegent
82 2.1.2. Interphase Nuclei
1. Appropriate medium for culturing of fibroblasts, e.g., RPM1 containing 5% fetal calf serum. 2. 0.5% Trypsin solution 3. Hypotomc buffer (as m Section 2.1.1.) item 4. 4 Fixative: methanol/acetic acid (3:l). 5. Acetic acid.
2.1.3. Extended Chromatin 2.1.3.1.
PROTOCOL
A
1. Phosphate-buffered saline (PBS), pH 7.0. 2 Lysrs mixture: 0 5% sodmm dodecyl sulfate (SDS), 50 mM EDTA, 0.2A4 TrisHCl, pH 7.0 3 Fixative Methanol/acetic acid (3: 1) 2.1.3.2. PROTOCOL B 1. 0 1% Poly-L-lysme (highest degree of polymerization)* 60 slides 2 Agarose plugs containing YAC DNA 3. Microwave oven.
100 mL are sufficient for
2.2. Probe Labeling 1. 10X Nick-translation buffer 0.5M Tris-HCl, pH 7 8, 50 mM MgCl,, 0.5 mg/mL bovine serum albumin (BSA) 2. Nucleotide mixture: 2 mM dATP, 2 mJ4 dCTP, 2 mM dGTP, and 0.1 mM dTTP. In case Cy3-1 l-dCTP 1sused rather than a modified dUTP, the nucleotide mixture should be adapted accordingly. 3. Modified dNTPs: FITC-12-dUTP (Boehrmger, Mannhelm, Germany), coumarm4-dUTP (Amersham, Amersham, UK), rhodamine-4-dUTP (Amersham), or Cy3-1 l-dCTP (Biological Detection Systems, Inc , Pittsburg, PA) for direct procedures, and biotin-1 6-dUTP (Boehringer) or digoxigenin- 1l-dUTP (Boehrmger) in case of indirect methods. 4. 0 1MDTT. 5. DNase I solution: Dilution of 1.5 uL from a 1 mg/mL DNase stock solution (Boehringer, grade II, m 0.15MNaC1, 50% glycerol) in 1 mL water. 6. DNA polymerase I (10 U/uL) (Promega, Madison, WI). 7. Carrier DNA: Salmon sperm DNA, 10 ug/mL (sheared to fragments of about 200-500 bp) and yeast RNA, 10 yg/mL. These components act as carrier during precipitation and reduce background during hybridization. 8. Competitor DNA: Cot- 1 DNA (1 ug/mL), sheared to fragments ranging in size from 200-500 bp. Cot-l DNA is also commercially available from Gibco BRL. 9. 3M Sodium acetate, pH 5.6. 10. 2-Propanol. 11. 70% Ethanol.
83
Probe Ordering and Distancing by FISH 2.3. Pretreatments
of Slides
1, 2X SSC: 0.3MNaC1,O 03M sodium citrate, pH 7.0 2. PBS, pH 7.0. 3. RNase A solution: Dilution of 1 uL from a 1% RNase A (Boehringer) stock solution in 100 pL 2X SSC. 4. Pepsin solution: Dilution of 100 pL from a 10% pepsin stock solution (Boehrmger) in 100 mL 10 mMHC1. Prepare immediately before use. 5. Postfixation solution: 1% acid-free formaldehyde (Merck, Darmstadt, Germany), 1X PBS, 50 mMMgC12 6. Series of 70,90, and 100% ethanol. 7. Coplin jar, water bath, incubator 37°C. 8. Moist chamber: A 500-n& beaker containing a paper towel soaked with appropriate buffers on the bottom and covered with Parafilm laboratory film and aluminum foil.
2.4. Denaturation 1, 2. 3. 4. 5.
of the Slides
Denaturation buffer: 70% formamide, 2X SSC, pH 7.0 (make fresh) 70% Ethanol at -20°C. Ice-cold 2X SSC. Series of ice-cold 70,90, and 100% ethanol. Incubator 80°C
2.5. Probe Hybridjzafion 1, Labeled probe DNA. 2. Hybridization mixture: 50% deionized formamide, 2X SSC, pH 7.0, 10% dextran sulfate (obtained from Pharmacia, this is essential). 3. 18 x 18 mm coverslips. 4. Moist chamber containing 50% formamide. 5. Wash buffer 1: 60% formamide (Milwaukee, WI), 2X SSC, prewarmed to 37OC. Prepare fresh each time. 6. Wash buffer 2: 0.1X SSC, prewarmed to 6O’C. 7. Series of 70,90, and 100% ethanol. 8. Water baths set at 37 and 60°C.
2.6. lmmunocytochemlcal 2.6.7, Direct Procedures
Defection
Mounting medium consists of 10 ng 4,6-diamidino-2-phenylindol.2 HCl (DAPI) (Serva) in 1-mL vectashield (Vector Laboratories, Burlingame, CA). 2.6.2. Indirect Procedures 1. Wash buffer 3: O.lMTris-HCl, MO), pH 7.5.
O.lSMNaCl,
0.05% Tween-20 (Sigma, St. Louis,
84
Kroef, Dauwerse, and Landegent
2. Blocking solution* 0 IA4 Tns-HCl, 0.15M NaC1, 0.5% blocking reagent (Boehringer), pH 7.5. 3 Detection solution. blockrng solution with immunocytochemical detection reagent. Immunocytochemical detection reagents are offered by several supphers, and the concentration of reagents varies among the different preparations Recommendations by the manufacturer are usually helpful in findmg the optimal concentration for detection. 4 A series of 70,90, and 100% ethanol. 5 Mounting medium* As m Section 2 6.1.) item 1 6 Moist chamber contammg wash buffer 3
2.7. Probe Ordering 1 Conventional microscope equipped for epifluorescence 2 The filtersets selected must be accordmg to the fluorochromes used Contact the manufacturer of the microscope for the appropriate combinations.
2.8. Distance Measurements 2.8. I. Manual Analysis 1, The microscopic setup described m Section 2 7 IS used 2. Color photographic films 3. Slide projector.
2.82. Automated lrnage Analysis 1. Fluorescence microscope coupled to a cooled CCD (charged coupled device) camera (Photometrics Tucson). 2 TCL-image software package developed at the Technical University of Delft (Multihouse, Amsterdam, The Netherlands)
3. Methods 3.1. Slide Preparation 3.1.1. Preparation of Metaphase Spreads 1. Culture 10 mL peripheral blood cells m 100 mL PHA medium for 3 d. 2 Optional: To obtain prometaphase chromosomes, add per 25 mL culture 250 pL thymidine solution and continue culturing for 17 h before colcemid treatment. 3 Add 80 pL 0.0025% colcemid to 25 mL culture and incubate at 37°C for 2 h. 4. Transfer the culture to 15-mL tubes (10 ml/tube), and centrifuge for 8 min at (80g) Remove most of the supernatant, and resuspend the pelleted cells m the residual medium. 5 Add 10 mL of hypotomc buffer while carefully mixing the ceils, and incubate the tubes for 20 min in a 37°C water bath. Centrifuge for 8 min at 700 rpm, remove most of the supernatant, and resuspend the cells
Probe Ordering and Distancing by FISH
85
6. To fixate the cells, add a little fixative to the cell suspension from a fully filled Pasteur pipet, and immediately suck it back into the prpet. Repeat this action several times, gradually increasing the amount of fixattve added to the cell suspension. In total, 10 mL of fixative/tube should be used. Let the tubes stand for 10 min. Centrifuge for 7 min at 1 log, remove the supernatant leaving about 0 5 mL of fixative and resuspend the pellet Repeat this fixation procedure three to four times. Store suspenston at -20°C or prepare slides directly. 7 To prepare the slides, the cells are first resuspended in fresh fixative Two drops of this suspenston are pipeted onto the slides (precleaned with ethanol/ether, 1: 1) from about 20 cm distance. Rinse the slide, just before they are completely dry, with 100% acettc acid. 8. Inspect the preparations with a phase-contrast microscope In case of poor spreading of the chromosomes, drop the cells onto slides that have been rmsed with either 90, 80, or 70% acetic acid (70% acetic acid will remove more cytoplasm than 90%). Slides can be used directly, or stored up to 2 mo m a dry box at -2O’C (see Note 2).
3.1.2. Preparation of Interphase Nuclei Cells arrested in Gl are used for interphase mapping
studies to avotd com-
plex hybrrdrzatton patterns owing to DNA replication. 1 Grow a normal human tibroblast cell line (e.g., CCL 202) culture to confluency and hold without change of medium for 2-4 d at 37°C Durmg this period, the cultures are checked for absence of rounded mttotic cells before harvest. 2. Add 2 mL trypsm solutton to the cells, incubate at 37°C for l-2 mm, and add 25 mL of growth medium/flask. 3. Transfer the cells to 15-n& tubes, and centrifuge at 80g for 10 mm Resuspend the cells in 10 mL medmm, and repeat centrifugation and resuspension 4. The cells are further processed as described m Section 3 1.1.) steps 4-8
3.7.3. Preparation of Extended Chromatin Protocol A can be used for cells of many different sources, such as lymphocytes, cultured lymphocytes, lymphoid cell lines, fibroblast cell lines, fresh or cultured bone marrow cells, and cells stored in liquid mtrogen. Protocol B should be apphed when YAC-extended fibers from spheroblast of yeasts are prepared. 3.1.3.1.
PROTOCOL
A
1. Resuspend the cells in PBS to a concentration of l-5 x 1OScells/ml. 2 Pipet 100 yL of cell suspension on a precleaned slide, and spread the suspension over the entire surface of the slide. Let dry completely. 3. Pipet two 50-pL drops of the lysis mixture on a 24 x 50 mm coverslip 4. Place the slide with the side containing the cells on the coverslip. Reverse the slide, and remove the coverslip very gently. Let dry completely. 5. Incubate the slides for 5 mm m fixative, and air-dry.
Kroef, Dauwerse, and Landegent
86 3 1.3 2
PROTOCOL
B
1. Poly-L-lysine-coated slides can be made according the following protocol a Soak the slides overnight in a detergent solution Rinse the slides several times in water b. Immerse the slides for 24 h in water. c. Coat the slides for 10 mm in the poly+-lysine solutton while shaking d Wash the slides three times with water and air-dry. 2 Place a small piece of agarose-embedded DNA (l/s of a 100~PL plug containing 5-10 pg DNA) at the end of a poly+lysine-coated slide 3 Pipet 15 pL of water on top of the agarose plug. 4. Heat the slide wrth the plug in a mtcrowave oven until the agarose has melted 5 The DNA is extended by moving the melted agarose material gently over the slide using a second slide held at an angle of 30”. 6 After drying, the slides are ready for use
3.2. Labeling
of DNA Probes
1 Ntck-translatron reaction (see Notes 3-7). Mtx together m a Eppendorf tube on ice. 1 pg DNA, 5 uL 10X nick-translation buffer, 4 pL nucleotide mixture, 2 pL 1.OmA4moditied dNTP, 5 uL DTT, 5 uL DNase, and 2 pL DNA polymerase I to a final volume of 50 uL, and incubate for 2 h at 16°C It should be noted that many manufacturers have developed labeling kits for generating FISH probes. These are very useful for laboratories with httle experience m FISH. 2. Preparation of the hybridization mixture. a. Add 5 pL of salmon sperm DNA, 5 PL of yeast RNA, 10 PL 3M sodium acetate, 25 pL Cot-l DNA (5 pL for YACs; see Note 5), 5 uL water, and 100 pL 2-propanol to the labeled probe suspension. b. Spm the tubes for 10 min at high speed m a table-top centrifuge, and discard the supernatant c. Wash the pellet with 500 uL 70% ethanol, and dtscard the supematant. Spin the tubes for a few seconds, and remove the last drops of supematant d. Let the pellet air-dry for 5 min, and dtssolve in the hybndization mixture to a final concentratton of 20 ng@L (or 100 ng/pL m the case of YACs). 3. The probes can be stored up to a year at 4°C.
3.3. Pretreatments of Slides It is advisable to incubate the slides with RNase, since this will reduce background signals. To improve the accessibihty of target DNA for the probe sequences, the slides can be treated with proteases, which remove protein remainders. In order to preserve the morphology of the cells, a postfixation step is often included. Extended chromatin preparations should not be pretreated, since this will lead to excessive loss of DNA fibers. 1. Select the area on the slide to be hybridized with a phase-contrast microscope, and mark it on the back by scratches using a dtamond pen.
Probe Ordering and Distancing by FISH
87
2. Arrange the slides m a Coplinjar, and wash them in 2X SSC for 5 mm at 37°C 3. Ptpet 100 pL RNase A solution on a coverslip (24 x 50 mm). Drain the back side of the slide with a paper towel, and put the slide on the coverslip. Arrange slides m hortzontal position in a Coplm jar. Transfer to a moist chamber, and incubate for 1 h at 37°C. 4 Add 100 mL 2X SSC to the Coplin jar, and transfer slides to a new Coplm jar contammg 2X SSC, leaving the covershps in the former jar Wash the slides twice m 2X SSC and once in PBS for 5 min at 37°C each. 5. Incubate the slides m the pepsin solution (100 mL/Coplmjar) for 10 mm at 37°C. 6. Wash the slides twice m PBS for 5 min at room temperature. 7. Incubate the slides (in Coplm jar) in the posttixatton solutron for 10 mm at room temperature 8. Wash the slides twice in PBS for 5 mm at room temperature. 9. Dehydrate slide through a senes of 70, 90, and 100% ethanol for 5 mm each followed by air-drying.
3.4. Denaturation
of the Slides
1. Pipet 100 pL denaturatton solution on slide, and cover it with a coverslip (24 x 50 mm). 2 Incubate the slides for exactly 2.5 min at 80°C 3. Remove the covershps quickly, and wash the slides immediately m ice-cold 2X SSC for 5 min. 4 Put the slides through a series of 70, 90, and 100% ice-cold ethanol for 5 mm each and an-dry
3.5. Probe HybridizationAn
Situ Hybridization
1 Mix 2 yL of the probe mixture wtth the hybridtzatlon solution to a end volume of 10 pL. Up to five different probes can be combined (see Note 8). 2. Denature the probe for 5 min at 75°C. 3. Incubate the probe for 30-60 mm at 37’C to allow preannealing of Cot- 1 sequences 4 Pipet the probe mixture on the marked section of the denatured slide, and cover with a 18 x 18 mm coverslip. Arrange slides m a horizontal position in a Coplin jar, and incubate m a moist chamber overnight at 37°C (YACs should be hybndtzed for 72 h). 5. Wash the shdes (in a Coplin jar) m prewarmed wash buffer 1 at 37°C for 5 mm Transfer the slides to new a Coplm jar, leaving the coverslips m the former jar. 6. Wash the slides twice in wash buffer 1 at 37°C for 5 mm each. 7. Wash the slides three times m wash buffer 2 at 60°C for 5 mm each 8. Put the slides through a series of 70,90, and 100% ethanol for 5 min each and airdry. In case the tmmuno cytochemical detection steps are performed immediately, this step is not necessary
3.6. /mmunocytochemica/ 3.6.1. Direct Procedures
Detection
(see Note 9)
Pipet 25 PL mounting medium on the slide and cover with a coverslip (24 x 50 mm). The slides are stored in a dark box at 4OC.
88
Kroef, Dauwerse, and Landegent
3.6.2. Indirect Procedure To reduce background, the slides should not get dry at any time during the following procedures. Durmg incubations and handling of the slides with fluorochromes, they should be kept in the dark as much as possible. Various detection systemshave been described for the vlsualizatlon of blotm or digoxigenin. The followmg procedures are used routinely m our laboratories. Biotin 1sdetected successively with streptavldm-FITC (reagent A), blotinylated-goat-antistreptavldm (reagent B), and streptavidin-FITC (reagent C) (all purchased from Vector). For visualization of dlgoxlgemn, three subsequent mcubatlon steps with mouse-antidigoxigenin (reagent A), rabblt-antimouseTRITC (reagent B), and goat-antirabbit-TRITC (reagent C) (all purchased from Sigma) are required. In case of dual-color FISH, both haptens can be vlsualized simultaneously by mlxmg the corresponding reagents (A, B, or C) for each incubation. 1 Equlhbrate the slidesm wash buffer 3 for 5 min. 2 Pipet 100pL blocking solution on a covershp (24 x 50 mm) Dram the back side of a slide with a paper towel, and put the slide on the coverslip. Transfer to a Coplm jar, and keep the slides In horizontal position 3 Incubate the slidesin a moist chamberfor at least20 mm at room temperature 4. Remove the coverslip gently from the slide. Pipet 100pL of the detection solution contammgreagent A on a new coverslip, put the slide on the coverslip, and transfer the slidesto a CoplmJar 5 Incubate for 30 mm at 37’C in a moist chamber. 6. Wash slidesthree times in wash buffer 3 for 5 min each 7 Repeatsteps4-6 for detectionreagentsB and C 8. Put the shdes through a series of 70, 90, and 100% ethanol for 5 min each, and an--dry 9. Pipet 25 pL mounting medium on the slide, and cover with a coverslip (24 x 50 mm). The slides are kept m a dark box for storage.
3.7. Probe Ordering Probe order can be deduced on all types of cellular preparations by dual- or triple-color FISH. Because the images are shifted during switching of the filter blocks, dual or triple band pass filters should be used for simultaneous vlsualization of multicolor FISH domains. For metaphase chromosomes, combinations of two probes are sufficient to determine the relative position. When mterphase or extended chromatm preparations are used, at least three probes should be hybridized simultaneously. If more markers have to be ordered, various probe combinations have to be analyzed. One should keep in mind that per experiment, the probes tested should be within the resolution range of the target preparation used. In case of inter-
Probe Ordering and Distancing by FISH
89
phase FISH, probe orders m individual nuclei should be scored. This is because of the fact that interphase nuclei are far from flat, and thus, the image analyzed represents a two-dimensional projection of a three-dimensional structure. 3.8. Distance Measurements 3.8.1. Manual Analysis In general, the extent of stretching and/or swelling of the individual targets within a preparation varies largely. Therefore, average distances between spots have to be determined from approx 15-20 measurements. To this end, color photographs are made from a random selection of metaphases, interphase nuclei, or DNA fibers. The photographic images are projected on a wall, and the distances between the centers of fluorescent spots are measured. A hemacytometer grid photographed at the same magnification and projected similarly can be used to convert measured projected distancesto actual distances. Statistical significance between pairs of spots can be determmed by t test. It is difficult to obtain accurate measurements from metaphase and fiber FISH experiments owing to folding of the chromosomes. Thus, manual analysis can best be performed using interphase FISH. However, because the reason given in Section 3.7., a higher number of cells (50-100) have to be evaluated per hybridization to obtain a statistically reliable result. A single round hybridization spot per hybridization domain m a nucleus is scored as a distance of 0. If the fluorescent signal is oblong rather than round, because two fluorescent spots are not completely resolved, the distance between the apparent centers of the spots has to be determined. When the physical distance between two probes is known, the distance between the other probes can be calculated. 3.8.2. Automa ted Image Analysis A more sophisticated approach for probe distancing is the application of computerized image analysis. This procedure guarantees precise alignment of the fluorochrome images and accurate distance measurements even in bended chromosomes or chromatin fibers. All individual colors are registered and recorded using a cooled CCD camera. A series of images are collected and stored on optical disk. Per image, all spots are interactively marked. The alignment of the images is accomplished by means of double-labeled control probes. The center of a spot, defined as the position within the area of a spot with the highest intensity, and the distance between the spots are determined automatically. A reliable method for probe distancing in metaphase preparations IS the determination of the fractional lengths (9). For each chromosome bearing “twin” spots, e.g., on the short arm, the distances pter-signal, pter-centromere
90
Kroef, Dauwerse, and Landegent
and pter-qter are measured. The map position is expressed as the fractional length of the short arm (pter-signal/pter-centromere) or of the total chromosome length (pter-signal/pter-qter). In case of fiber FISH, very accurate data are obtamed because the relative length of the probe signals or average drstance between signals measured can be directly related to the known insert sizesof the probes. 4. Notes 1 Add thymidine for producmg elongated (prometaphase) chromosomes Thts is useful for high-resolution banding. 2 Slides containing metaphase spreads or Interphase nuclei can be stored for longer periods (up to 3 mo) m a dry box at -20°C The suspensron containing the fixed cellular preparation can even be stored for 6 mo. It IS our experience that extended chromatm slides age more rapidly. These slides can be best stored at room temperature for up to 1 mo 3. The quality of the probe DNA is very important because the nick-translation labeling reaction is relatively sensitive to impurities 4 For probes multiplied in Escherzchza coli, DNA concentrations can be best estimated on agarose gels. Concentrations determmed by optical density are inaccurate because of the presence of residual E colz DNA, RNA, and (ribo)nucleotides 5. It IS difficult and labortous to purify YAC DNA from yeast DNA. One can enrich for human-specific sequences by interdlu PCR However, certain Alu-depleted regions m the YAC are missmg in the probe. Therefore, we use total yeast DNA containing the YAC as a probe. Since the human contents of the probe is low, less Cot-l DNA, more probe DNA (200 ng), and longer hybridization times (72 h) are requned (22) 6. In case new enzyme stocks are used for nick translation, check the labeled probes for the appropriate fragment length as follows. a Take 10 uL from the reaction mixture b. Denature at 100°C for 5 min and chill on ice. c. Run the sample on a 2% agarose gel with an appropriate DNA size marker. The denatured fragments should range between 200 and 500 bp m length. d To optimize fragment length, adjust the DNase concentration. Larger fragments will give a high background, whereas smaller fragments will results in weak or no signals at all 7. Many protocols include a purification step following nick translation of the probe using a Sephadex G-50 column. We have not observed differences m hybridization efficiency or introduction of background if this step is omitted. 8. In multtcolor FISH, separate competition of the mdivtdual probes will improve the hybridization efficiency. 9. The hybridization signal can be enhanced by immunologtcal amplification. However, because background fluorescence also increases, the signal-to-noise ratio IS not improved.
Probe Ordering and Distancing by FISH
91
References 1 Gall, G and Pardae, M. L. (1969) Formatton and detection of RNA-DNA hybrid molecules in cytological preparations. Proc. Nat1 Acad Sci. USA 63, 378-38 1 2 Bauman, J. G. J., Wiegant, J., and Van DuiJn, P (198 1) Cytochemical hybridization with fluorochrome-labeled RNA. I Development of a method using nucleic actds bound to agarose beads as a model J Hzstochem Cytochem 29,227-237 3. Bauman, J. G. J., Wtegant, J., and Van DulJn, P. (198 1) Cytochemical hybridization wtth fluorochrome-labeled RNA II Apphcattons. J Hzstochem Cytochem 29,238-246. 4. Hopman, A H N., Wiegant, J , Tesser, G. I., and Van Duijn, P (1986) A nonradioactive zn sztu hybridization method based on mercurated nucleic acid probes and sulfhydryl-hapten ligands Nucleic Acids Res. 14,6471-6488. 5. Langer, P. R., Waldrop, A. A., and Ward, D. A. (1981) Enzymatic syntheses of biotm labeled polynucleottdes novel nucleic acid affinity probes Proc Nat1 Acad Scl USA 78,6633-6637 6. Landegent, J E., Jansen in de Wal, N., Van Ommen, G. J. B , Baas, F , Vijlder, J J. M , Van Duyn, P., and Van der Ploeg, M (1985) Chromosomal localization of a unique gene by non-radrographrc in situ hybridization Nature 317, 175-357 7 Pinkel, D., Staume, T , and Gray, J. W. (1986) Cytogenetic analysis using quantitative high sensitivity fluorescence hybrtdization Proc Nat1 Acad. Set USA 83, 2934-2938 8 Landegent, J. E., Jansen in de Wal, N , Dirks, R. W., Baas, F , and Van der Ploeg, M (1987) Use of whole cosmid cloned genomic sequences for chromosomal localization by non-radioactive m situ hybridization Hum Genet 77, 366-370. 9 Lichter, P., Tang, C C , Call, K , Hermanson, G., Evans, G , Housman, D., and Ward, D. C (1990) High resolution mapping of human chromosome 11 by In sztu hybrtdrzation wtth cosmrd probes. Science 247,64--69 10 den Dunnen, J. T., Grootsholten, P. M., Dauwerse, J G., Walker, A P., Monaco, A. P., Butler, R., Anand, R., Coffey, A. J., Bentley, D R., Steenma, H Y., and Van Ommen, G. J. B. (1992) Reconstruction of the 2 4 Mb human DMD-gene by homologous YAC recombmatton. Hum Mol. Genet. 1, 19-28. 11. Dauwerse, J. G., Wiegant, J , Raap, A. K., Breunmg, M H , and Van Ommen, G. J B (1992) Multiple colors by fluorescence in situ hybridtzation usmg ratio labelled DNA probes create a molecular karyotype. Hum Mel Genet 1, 593-598. 12. Nederhof, P. M , Van der Flier, S., Vrolijk, J., Tanke, H. J., and Raap, A. K. (1992) Fluorescence ratio measurements of double-labeled probes for multiple in situ hybridization by digital imaging microscopy. Cytometry 13, 83 l-838. 13. Trask, B., Pinkel, D., and Van den Engh, G (1989) The proximity of DNA sequences in interphase nuclei is correlated to genomic distance and permits ordermg of cosmtds spanning 250 kilobase pairs Genomzcs 5,7 1O-7 17 14. Lawrence, J. B., Singer, R. H., and McNetl, J. A. (1990) Interphase and metaphase resolution of different distances withm the human dystrophin gene. Science 249, 928-93 1.
92
Kroef, Dauwerse, and Landegent
15. Inazawa, J. A., Ariyama, T., Tokino, T., Tanigami, A , Nakamura, Y., and Abe, T (1994) High resolution ordermg of DNA markers by multi-color fluorescent in situ hybndtzation of prophase chromosomes. Cytogenet Cell Genet 65, 130-135 16. Kluck, P. M C., Wiegant, J., Raap, A. K., Vrolijk, H., Tanke, H J , Willemze, R., and Landegent, J. E. (1992) Order of human hematopoietic growth factor and receptor genes on the long arm of chromosome 5, as deternnned by fluorescence tn situ hybridization. Ann Hematol 66, 15-20. 17. Brandriff, B., Gordon, L., and Trask, B. (1991) A new system for high-resolution DNA sequence mappmg in interphase pronuclei. Genomzcs 10,75-82. 18. Wiegant, J., Kalle, W , Mullenders, L., Brookes, S., Hoovers, J. M. N., Dauwerse, J. G., Van On-men, G. J. B., and Raap, A. K. (1992) High-resolution zn sztu hybrrdization using DNA halo preparations. Hum. Mel Genet 1,587-592 19. Heng, H. H. Q , Squire, J , and Tsm, L. C. (1992) High resolution mapping of mammalian genes by in sttu hybridization to free chromatm. Proc Natl. Acad Scl: USA 89,9509-95 13 20 Parra, I. and Windle, B. (1993) High resolution visual mappmg of stretched DNA by fluorescent hybridization. Nature Genet. 5, 17-2 1. 21, Hetskannen, M., Karhu, R., Hellsten, E., Peltonen, L., Kalhomemt, O.-P., and Palotie, A. (1994) High resolution mappmg by FISH to extended DNA fibers prepared from agarose embedded cells. BioTechniques 17,928-933. 22. Driesen, M. S., Dauwerse, J. G , Wapenaar, M.C., Meershoek, E. J , Mollevanger, P., Chen, K. L., Fischbeck, K H., and Van Ommen G. J. B. (199 1) Generation of yeast artificial chromosomes from a hybrid cell lme by high denstty screenmg of an amplified library and their fluorescent zn sztu hybridization mapping to lp, 17p, 17q and 19q. Genomxs 11,107~1087.
7 Physical Mapping by Pulsed-Field Gel Electrophoresis John Maule 1. Introduction During the past few years, there has been intense activity in physically mapping the genomes of a variety of organisms, including human, mouse, Drosophila, Caenorhabditis elegans, Arabidopsis thaliana, and yeast. The considerable advances made in physically mapping these genomes has demanded the development of a technique capable of fillmg the niche between conventional Southern blotting and such approaches as somatic cell genetics and fluorescence in situ hybridization (FISH). Pulsed-field gel electrophoresis (PFGE) fills this niche, and since its inception in 1983, has evolved mto a robust and reliable technique. PFGE was first used for electrophoretically karyotyping yeasts and protozoa, but with the advent of commercially available rare cutter restriction enzymes, the technique was soon used for the longrange physical mapping of bacterial, yeast, and eukaryotic genomes. Before the availability of PFGE, long-range physical mapping involved a “bottom-up” approach, which was limited by the cloning capacity of h and cosmid vectors. Clones are analyzed by restriction mapping and fingerprinting, and then contigs are assembled from overlapping clones. The ability to generate and separate large DNA fragments has allowed a “top-down” strategy in physical mapping. A low-resolution restrtction map of genomic DNA can be constructed by using rare cutter restriction enzymes. The assembly of long-range contigs can be facilitated by constructing jumping and linking libraries (I), the products of which are used to probe PFGE separated restriction fragments. Jumping libraries are produced by cloning the ends of rare cutter-generated restriction fragments into h. Creating linking libraries involves cloning material from adjacent restriction fragments, straddling a rare cutter From’
Methods m Molecular &ology, Vol 68 Gene /solarron and Mappmg Protocols Edlted by J Boulhvood
Humana
93
Press Inc , Totowa,
NJ
94
Made
site. Thus, even the limited cloning capacity of h can be used in the generation of long-range contigs. A different approach involves cloning long stretches of genomic DNA into yeast artificial chromosomes (YACs) (2). A partial digest of genomic DNA is cloned into the yeast vector, which provides selectable markers for the vector arms (Ura and Trp) as well as a centromere, origin of replication, and telomeres. Genomic inserts of up to 2 Mb have been cloned, although the larger YACs are often chimeric. YACs provide a resource for mapping and analyzing whole genes. Low-resolution restriction mapping and the assembly of YACs into contigs are accomplished by PFGE. YAC DNA isolated from pulsed-field gels can be catch-linkered, labeled, and used for FISH (3). More recently, the concept of long-range cloning has been extended to the assembly of bacterial artificial chromosomes (BACs) and P1-derived artificial chromosomes (PACs) (4). Both these systems, although offering a smaller cloning capacity, do not seem to possess the chimerism or instability problems of YACs. Finally, the technological hurdles associatedwith the creation of mammalian artificial chromosomes (MACs) are being addressed. Long-range physical mapping relies on the correct use of appropriate restriction enzymes, taking account of the base composition and methylation status of the organism, to generate large fragments that can link together probes for mapping purposes or detect chromosome rearrangements. The mammalian genome will be used to illustrate the correct choice of restriction enzymes for long-range physical mapping. Several features of the mammalian genomic nucleotide composition can be used in long-range mapping studies. The dinucleotide CpG is not evenly distributed throughout the genome, but is clustered at the 5’-ends of most genes. Furthermore, these clusters or “CpG islands” are nonmethylated, whereas nonisland genomic tracts tend to be heavily methylated, as well as having a lower than expected frequency of this dinucleotide. Rare cutter restriction enzymes that recognize CpG and are methylation sensitive will preferentially cleave at island sites. An enzyme recognizing exclusively C and G nucleotides and more precisely having more than one CpG in its recognition sequence will invariably cut island sites. Furthermore, eight base cutter enzymes, such as Not1 and AscI, cut infrequently even at island sites, and so generate very large fragments (5). Six base GC cutters, recognizing two CpGs, tend to cut at most island sites, and therefore produce slightly smaller fragments. Examples are BssHII, EugI, and SstII. Six base GC cutters recognizing only one CpG, such as iVae1or iVar1, cut at island sites, but also at nonmethylated nonisland sites. Finally, six base cutters recognizing not only one or two CpGs, but also A and T nucleotides tend to cleave most frequently at unmethylated nonisland sites. These enzymes, for example, MluI, NruI, and &ZI, are useful for generating large fragments from
Pulsed-Field Gel Electrophoresis Table 1 Rare Cutter Restriction
95 Enzymes
Enzyme
Recognition sequence’
Not1 Ax1 FseI srfl BssHII XmaIII, EagI, ECZXI SstII, SacII, KspI NaeI Nat-1 SmaI Se8387 1 RsrII SgrAI MU1 PVUI NruI AatII Sal1 BsiWI, SplI SnaBI s-1 XhoI BseAI, AccIII, M-01 ClaI @I, Ad1
GUGGCCGC GGKGCGCC GGCCGGKC GCCUGGGC GKGCGC UGGCCG CCGUGG GCUGGC GG/CGCC CCC/GGG CCTGCA/GG CG/GwCCG CdCCGGyG A/CGCGT CGAT/CG TCGKGA GACGTK G/TCGAC C/GTACG TAC/GTA GGCCNNNMNGGCC UTCGAG TKCGGA ATKGAT TT/CGAA
aw = A or T, r = A or G, y = C or T, N = A, C, G, or T.
regions with a high CpG island density. A complete list of rare cutter enzymes is presented in Table 1. Enzymes cutting at nonisland sites may exhibit partial digestion owing to variable methylation of target sequences. Advantage can be taken of enzyme isoschizomers that are not methylation sensitive, for example, KspI, which will cut DNA methylated at the 5’ C residue of the recognition sequence CCGCGG. DNA methylation varies significantly between different cell types. Cell lines are notorious for variable methylation, and in fact, old established cell lines may well exhibit island methylation 5’ to genes that are no longer required for cell growth or propagation (6). DNA can be prepared from a variety of sources, including lymphoblastoid and fibroblast cell lines, blood, sperm, and solid tis-
96
Made
sue. Hybrid cell lines and reduced hybrids can provide a particularly good source of DNA enriched for a specific chromosome or subchromosomal region. High-mol-wt DNA, suitable for PFGE, is prepared encapsulated in agarose. Whole cells are embedded in agarose and lysed and deprotemlzed, so that the DNA IS totally protected from mechanical disruption. Agarose encapsulation can take the form of plugs or beads. Plugs are formed by allowing the molten agarose/cell mixture to set in molds compatible m size with the wells of a pulsed-field gel. Beads are prepared by mixing cells, molten agarose, and an nnmiscible liquid, such as paraffin, and agrtatmg violently so that the agaroseicell mixture forms into droplets, which when set form solid beads. Bead preparation is more labor intensive than plug formatron, but the DNA IS available for use more rapidly. Earlier difficultres m manipulatmg beads, because of their small size and near transparency, can be overcome by the mcorporatron of inert colored microparticles, and this approach can be extended to color coding plugs (7). Methods for the preparation of plugs and beads are described m Section 3 Size markers for PFGE are also prepared as agarose plugs and range m size from 50-7000 kb (7 Mb). Concatemers of h provide a size ladder extendmg m multlmers of 48.5 kb up to about 1 Mb. Common yeast (Saccharomyces cerevuwe) chromosomes provide another easily prepared size marker and effectively cover the range from 200-1600 kb. Beyond thus, the heterothallc budding yeast Hansenula wngei chromosomes extend in size from l-3.5 Mb (8), and wild-type Schzzosaccharomycespombe has three chromosomes of 3.5, 4.6, and 5.7 Mb. Methods for preparing all these markers are described. Restriction enzyme digestion of high-mol-wt DNA for PFGE is accomplished by immersing the agarose plugs or beads m buffer containing restrrction enzyme. Although plugs are easier to prepare and manipulate, beads offer the advantage of a high-surface area-to-volume ratio, so that enzymes can rapidly diffuse in and digest the DNA. This situation can be crucial where complete digestion is vital, where the half-life of the enzyme is short, or where an accurate time-course of digestion is required (e.g., BaZ31). Chopped plugs or “chops” offer the ease of plug for-matron with the increased surface area associated with beads (9) and may be colored with mlcropartlcles to aid manipulation (7). Full protocols for digesting DNA-contammg plugs and beads are described. Plugs or beads containing digested DNA or size markers are loaded directly into the wells of pulsed-field gels, and the DNA migrates out from the wells during electrophoresis. Conventional agarose gel electrophoresls can separate DNA molecules no greater than about50 kb, whereasPFGEhasresolved DNA as large as 12 Mb (I 0). The basis of the technique relies on an electric field that regularly changes in orientation relative to the DNA-contammg gel. DNA molecules are separated according to size, large molecules take longer to reort-
Pulsed- Field Gel Electrophoresis
97
ent to the changed electric field than smaller ones, and hence, then mrgratton is retarded. The length of time the field IS active m any one position is called the pulse time, and its duration critically influences the size range over which separation is achieved. Short pulse times favor the separation of small molecules, whereas long pulse times provide a window of separation for large molecules. Other lesser factors influence separation, such as agarose type and concentration, buffer, voltage, temperature, and field angle. Early PFGE apparatus exhibited nonuniform electric fields, so that DNA migrated differently depending on its position in the gel. The contour-clamped homogeneous electric field (CHEF) apparatus provides a homogeneous electric field across the gel, resulting in uniform mrgratron of DNA samples regardless of then location on the gel (II). The field switches through an angle of 120”. Lane-to-lane comparisons are easy and reliable A more recent variant of the CHEF prmciple IS the programmable, autonomously controlled electrode (PACE) device m which electrodes are controlled individually (12). This apparatus allows field angles of between 90 and 180’ to be created, although the electrode contiguration remains as a hexagon. The Bio-Rad (Richmond, CA) CHEF DRl 11 and CHEF Mapper are commerctal forms of this machme. The latter provides a builtin algorrthm that allows the user to enter details of the size of molecules to be separated, and the machine automatically calculates the best-run parameters. Pulsed-field gels are blotted in much the same way as conventional gels, although the DNA must be fragmented to allow efficient transfer. This step IS critical, and inefficient transfer may result from lack of care and attention at this stage. Attempts to overcome the inherent dtfficulttes in transferring large DNA molecules to hybridization membranes have led to the development of protocols for drying and hybridizing pulsed-field gels. DNA is retained withm the gel and hybridized in sztu.This approach can be invaluable where drfficulties are encountered m hybridtzmg membrane-bound DNA. Dried gels can be stripped and reprobed several times. Protocols for blottmg gels and preparmg and hybndizmg dried agarose gels are provided. Finally, a section IS devoted to the interpretation of results and indrcations of the best way to tackle physical mapping by PFGE. 2. Materials 2.7. General Reagents 1 YPD medium.20 g bacto-peptone(Drfco, Surrey,UK), 10g yeastextract(Drfco), 1 L distilled water. Autoclave, and when cool, add 50 mL 40% sterile glucose. For YPD agar, add 1.5% agarto the medium. 2. LB medium: 10 g bacto-tryptone (Difco), 5 g yeast extract (Difco), 5 g sodium chloride. Adjust pH to 7.2 (with 5Msodium hydroxide solution), and finally make up the volume to 1 L Autoclave. For LB agar, add 1.5% agar to the medium.
Maule
98
3 1% NDS solution 0 45M EDTA, 10 mMTris-HCl, pH 9 0, 1% sodium N-lauroyl sarcosme (Sigma, St Louis, MO). Mix solid EDTA and Tris with 29 g sodium hydroxide pellets m 900 mL water Cool to room temperature, and adjust pH to 9 0 by adding 5M sodmm hydroxide solution, and finally make up to 1 L with distilled water. Autoclave. Add sodium N-lauroyl sarcosme to 1% 4. Protemase K (Boehrmger, Mannhelm, Germany) Make up at 20 mg/mL m 1% NDS solution Store at -20°C 5. Low-meltmg-temperature agarose several different brands of agarose are suitable for plug formatron, but make sure the sulfate content IS
22. S. cerevisiae
and H. wingei Chromosome
Preparation
1. 0 125M EDTA, pH 7.5. Dilute l/2 5 to give 50 mM solution 2. Cell-wall dtgestion solutton: 2 mL SCE (1M sorbitol, O.lM trisodmm citrate, 60 mM EDTA, pH 7.0-autoclave), 0 1 mL 2-mercaptoethanol, 2 mg zymolyase 1OOT. Store at -20°C 3. TEM: 10 mM Tris-HCl, pH 8.0, 0.45M EDTA, 7.5% 2-mercatoethanol. Autoclave and add the last compound Just before use. 4. Yeast nitrogen base (10X) 6.7 g bacto-yeast nitrogen base without amino acids (Difco), 100 mL distilled water. Filter to sterilize, and store at -20°C. 5. AHC medium: 10 g casamino acids (Difco), 50 mg adenine hemisulfate (Sigma). Adjust pH to 5.8 with 5M hydrochlortc acid, and make up to 850 mL with dis-
Pulsed-Field Gel Electrophoresis
99
tilled water. Autoclave When cool, add 50 mL sterile 40% glucose, and 100 mL 1OX yeast nitrogen base (see Note 5). 6. Hogness freezmg medium: 10X stock is 36 m&f K,HPO,, 13 mM KH,PO,, 20 mM trisodium citrate, 10 mA4 MgSO,, 44% glycerol.
2.3. S. pombe Chromosome
Preparation
1. 0.125M EDTA, pH 7 5: Dilute to 50 mA4 for washing cell pellets 2. CPES* 40 mM citric acid, 120 mA4 Na2HP04, 20 mA4 EDTA, 1.2M sorbttol, 5 mM dithiothreitol, pH 6.0. Autoclave and add the latter reagent Just before use. 3. Novozyme 234-Calbiochem (La Jolla, CA). This is a cell-wall digestmg enzyme. 4. 0.1% Sodium dodecyl sulfate (SDS). 5. 0.125M EDTA, 0 9A4 sorbitol, pH 7.5-autoclave.
2.4. h Concatemers
Preparation
1 Lysogen N1323(jl) (13) carries a temperature-sensitive repressor (cIts), and so can be induced at the nonpermissive temperature of 43°C The S gene product is required for cell lysis, and this strain carries an amber mutation m this gene, which is not suppressed. This results in the accumulation of several hundred phage per cell, followmg induction and subsequent mcubatton The cells can then be concentrated by centrtfugation prior to lysis with chloroform; lop8 is a hgase overproducing mutation that may be beneficial m the repair of single-strand breaks followmg inductton and prolonged incubatton. 2 RNase A (Sigma) Make up at 10 mg/mL in 10 mA4 Tris-HCl, pH 7 5, 15 n&! NaCl. Heat to 100°C for 15 mm. Allow to cool to room temperature, and then freeze. 3. DNase 1 (Sigma): Make up at 20 mg/mL Store at -2O’C 4. EDTA, pH 8.0 Make up at 0 5Mand autoclave. Dilute to O.lMfor Section 3.3., step 17.
2.5. Mammalian
Chromosome
Preparation
1 PBS. Dissolve 0.2 g KCl, 0.2 g KH2P04, and 8 g NaCl in 800 mL water. Adjust pH to 7.2, and make up to 1 L wtth water
2.6, Agarose Plug Preparation 1. A plug mold can be made from a rigid polystyrene flat-bottomed 96-well microtiter plate (ICN Flow, catalog number 76-307-05) by milling off the base or drilling out the individual wells. A Titertek plate sealing strip (ICN Flow catalog number 77-400-05) is secured across the top, and the plate is inverted so that the base is now uppermost. 2. The mixture of cells and molten agarose is poured into a reagent trough (ICN Flow catalog number 77-824-01) prior to dispensmg
2.7. Agarose
Bead Preparation
1. Polystyrene dyed microparticles (Polysciences, Warrington, PA) 0.2 ym colored red (catalog number 15705), blue (catalog number 15706), and yellow (catalog number 15707) (see Note 1). Micropartrcles are prepared as inert suspensions.
Made
100
2 Llquld paraffin (The Boots Company, Nottingham, UK) 3. LIDS solution: 1% lithium dodecyl sulfate (Sigma), 100 mM EDTA, 10 mM Tris-HCI, pH 8 0. Dissolve EDTA and Tns, and adJust the pH Autoclave Add lithium dodecyl sulfate to 1% from a filter-sterilized 20% stock solution 4 50 m&I EDTA, pH 8.0. Dilute sterile 0.5M stock solution l/10
2.8. Digestion
of Agarose-Embedded
DNA
1 Phenylmethylsulfonyl fluoride (PMSF) IS a potent protease inhibitor and IS used to destroy residual proteinase K. Great care must be exercised in handlmg this substance The solid becomes electrostatically charged, so use a wooden spatula while weighing out and wear gloves throughout. Add 1 mL of propan-2-01 to 20 mg of the sohd, and dissolve by incubating at 50°C for 3 mm This solution 1s stable and may be stored frozen at -2O’C (see Note 2) 2. Trlton X-100: Prepare as a 10% stock solution and autoclave 3. Bovme serum albumin (BSA): Boehrmger, special molecular-biology grade, supplied at 20 mg m 1 mL. 4 Stop buffer: 0.5X TBE, 10 mMEDTA, 2 mg/mL Orange G (Sigma). 5. Restriction enzyme buffers. Since conslderable volumes are required for washmg and equlllbratmg plugs, it 1swise to prepare 10X stock solutions and autoclave 6 TEX: 10 mM Tns-HCI, 1 mA4 EDTA, pH 8.0 Autoclave and then add sterile Triton X-100 to 0 01%.
2.9. Separating
DNA Using the CHEF
1 TBE Prepare as 5X TBE contaming 450 mA4 Tris, 450 m&I boric acid, and 10 mM EDTA 2. Agarose: The standard medium electroendosmosls (EEO) agarose used for conventional gels can be used for PFGE Special PFGE agaroses are available that exhibit high gel strength and low EEO and, therefore, can be used at low concentrations. Under these condmons, faster separations times are achievable The following PFGE agarose are available: Fastlane and Seakem Gold (FMC), Chromosomal Grade and Pulsed-Field Certified Agarose (Blo-Rad), Boehringer Multipurpose agarose, Pulsed-field-grade agarose (Stratagene, La Jolla CA), Kilorose and Megarose agaroses, Clontech (Palo Alto, CA), and Rapid agarose, Glbco BRL (Galthersburg, MD). These can be more expensive than standardgrade agarose. 3. Ethidium bromide is a mutagen and should be handled with care. Always wear gloves Blo-Rad sell ethidium bronude tablets (each tablet makes 11 mL of 1 mdmL solution), which reduce the danger of spillage of the solid durmg preparation of the solution. 4. Commercially available CHEF apparatus is available from Blo-Rad (CHEF DR-I 11 and CHEF Mapper) and Pharmacla (Brussels, Belgium) Gene Navigator System (see Note 3). 5. Details of how to construct a large gel format homemade CHEF apparatus are available from the author (14). The Bio-Rad cooling module and variable-speed circulating pump can be used with this apparatus.
107
Pulsed-Field Gel Electrophoresis 2.70. Southern
Transfer
1 HCl* Make up as 5M(86 mL of concentrated HCl made up to 200 mL with water) and dilute l/20 prior to use. 2. Denaturant: 0.5MNaOH, 1.5M NaCl. 3. Neutralizer: lMTrts, 2MNaCl. Adjust pH to 5 5 with HCl. 4. SSC 20X stock is 3MNaC1, 0.3M trisodium citrate, pH 7.4. 5. Stratalmker (Statagene) 1sideal for UV crosslinkmg membranes.
2.11. Hybridization
and Autoradiography
1 High Prime DNA labeling kit (Boehrmger) 2. Redivue deoxycytidme 5’-[(a-32P] triphosphate ((3000 C~/mmol) Amersham (Bucks, UK) catalog number AA0005. 3. Whatman (Matdstone, UK) GF/B 2.4-cm filter circles. 4. Trichloroacetic acid (TCA): Make up as 50% stock solution and dilute l/10. 5. NICK columns (Pharmacia): Prepacked with Sephadex G-50 6. IOX TNE: 100 mMTris-HCI, pH 8.0, 10 mMEDTA, 2MNaCI. Autoclave 7. Somcated salmon sperm DNA-Sigma (D 1626): Make up at 10 mg/mL in water by sturmg overnight at 4°C. Sonicate sufficiently to achieve a size of 600 bp Store at -20°C. This reagent 1sused as a blocking agent to reduce nonspecific hybridization. 8. Filter hybridization mix: 5X Denhardt’s solution, 5X SSC, 0 1% disodmm pyrophosphate, 0.5% SDS, 10% sodium dextran sulfate (Pharmacia). Filter through a Millipore (Bedford, MA) 8 brn SCWP filter (omitting the SDS, which should be filtered separately). Denhardt’s 20X stock contains, per 100 mL, 0.4 g Ficoll400 (Pharmacia), 0.4 g polyvinylpyrrolidone, 0.4 g BSA (Sigma fraction V), 20X SSC see Section 2.10 , step 4. Store at 4°C 9. SSPE: 5X stock is 750 mMNaCl,50 mMNaH2P04, 5 mM EDTA. Adjust pH to 7 4 with NaOH, and add SDS to 0 1%. 10. SSC washes containing 0.1% SDS and 0.1% dtsodmm pyrophosphate Maintain at 68°C until required. 11 Autoradiography film: Use Kodak XAR-5.
3. Methods 3.1. Preparation of Chromosomal from S. cerevisiae and H. wingei
DNA
1. Pick a single colony from a freshly grown culture, streaked on a YPD plate, and inoculate 100 mL of YPD medium in a 500-mL flask (see Note 4). YACs should be grown m AHC medium (see Notes 5 and 6). 2. Shake for 24 h at 33°C at approx 200 rpm. 3. Dilute an aliquot l/10 in YPD and count the number of cells using a hemocytometer. Cell count should be -1 x lo8 cells/ml. 4. Chrll the culture on ice for 15 min, and then harvest the cells by spinning at 2000g for 10 mm at 4°C.
102
Made
5 Discard the supernatant, and gently disrupt the pellet with a sterile loop before adding 50 mL of chilled 50 mM EDTA, pH 7.5 Make sure the cells are thoroughly dispersed. 6. Spm at 2000g for 5 mm at 4°C 7 Repeat steps 5 and 6. 8 Finally, discard the supernatant, and take up the pellet m 3 mL of ice-cold 50 mA4 EDTA, pH 7.5 (gives a final vol of approx 3.5 mL) 9 Transfer the cells to a 20-mL universal container, with a fine-tip sterile Pastet (to disrupt any cells that are clumped), and warm to 37“C. 10. Add 6 mL of 1% low-melting-temperature agarose (m 0.125M EDTA, pH 7.5), which has been cooled to 50°C. 1 I, Finally, add 1.2 mL of cell-wall digestion solution. 12 Immediately mix thoroughly, and dispense mto plug molds (see Section 3.5 ) 13 Eject the plugs mto a 50-mL Falcon tube containing 25 mL of TEM solution, and incubate overnight at 37°C m a water bath. 14 Replace TEM with 20 mL 1% NDS containing 1 mg/mL proteinase K (see Section 3 5.). 15 Finally store plugs in 20 mL 1% NDS at 4°C 16. Plugs should be equmbrated for at least 1 h in gel nmnmg buffer before use.
Figures 1 and 2 show the separatton of some yeast chromosomes from S and H wingez (Fig. 2) and the resolution of six YACs ranging in stze from 230-l 500 kb. Yeast chromosome sizesare presented m Table 2
cerevisiae
3.2. Preparation
of Chromosomal
DNA from S. pombe
1 Inoculate 5 mL of YPD medium from a single colony taken from a freshly grown plate (see Notes 7 and 8) 2. Shake at 30°C in a universal container overnight. 3. On the next day, add the ovemrght culture to 100mL of YPD m a 500-mL flask. Shake at 30°C for 24 h 4. Dilute an aliquot of the culture l/10 in YPD, and count the number of cells m a hemocytometer. The count should be 3-5 x lo7 cells/ml. 5 Chill the culture on ice for 15 mm (see Note 9). 6 Spm at 2000g for 10 min at 4°C 7. Discard the supematant and gently disrupt the pellet with a sterile loop before adding 50 mL of ice-cold 50 mM EDTA, pH 7.5 8 Repeat steps 6,7, and then 6. 9. Resuspend the pellet in 2 mL CPES, containing 0.6 mg (60 U) zymolyase 1OOT and 2 5 mg Novozym 234 (see Note 10) 10. Incubate at 37°C for 2 h. 11. Check for successful cell-wall digestion by mixing equal volumes of cell culture and 0.1% SDS, and viewing under the microscope. Cells without cell walls are lysed by SDS, and normally >50% of the cells are in this condition (see Note 11).
Pulsed-Field Gel Electrophoresis
103
Fig. 1. Separationof yeast chromosomes,YACs, and h concatemersby CHEF gel PFGE. The sampleswere loadedonto a 1% agarosegel in 0.5X TBE. The gel was run at 6 V/cm with a 25-s pulse time for 36 h at 14°C. The positions of the four YACs are indicated. AB 1380 is the S. cerevisiaehost strain in which the YACs are maintained. The unresolvedDNA abovethe 440-kb marker compressionzone (CZ). 12. Mix the cells (preheatedto 37’C) with an equal volume of 1% low-melting-temperatureagarose(in O.l25MEDTA, 0.9Msorbito1,pH 7.5), which hasbeenmaintained at 50°C. 13. Dispenseinto plug molds, and incubatethe plugs in 1%NDS + proteinaseK (see Section 3.5.). 14. Store the plugs in 1% NDS at 4’C. 15. Soakthe plugs for at least an hour in gel running buffer before use.
S. pombe chromosomesare shown separatedby PFGE in Fig. 3. Chromosome sizes are listed in Table 2. 3.3. Preparation of Bacteriophage
h Concatemers
1. Inoculate from a lysogen of N1323(h)-lop8, ctsI857, Sam7 into 25 mL of LB (see Notes 12 and 13). 2. Incubate overnight at 33°C.
Made
104
YACs
k.b
2l-
550
Fig. 2. The separationof large yeast chromosomesand YACs by PFGE. Samples were run on a 1% agaroseCHEF gel in 0.5X TBE at 14°C for 36 h at 6 V/cm with a pulse time of 115 s. The positions of the two YACs (sized at 1450 and 1350 kb) are indicated. S. pombe strain 3B3 has a minichromosome of 550 kb. Note that under these run conditions, only the smallest three H. wingei chromosomesare separated, and the resolution of yeast chromosomesbelow 1000kb is less than optimal. 3. 4. 5. 6. 7.
Add the overnight culture to 500 mL of prewarmedLB in a 2-L flask. Shakeat 33°C until ODbOc= 0.45 (approx 3 h). Induce for 15 min at 43°C. Incubate for 2 h at 39°C with vigorous shaking (seeNote 14). Test for successfulinduction by adding a few drops of chloroform to a 2-mL aliquot of the culture in a glasstest tube, vortex, and incubateat 37°C for 10 min without shaking.The culture shouldclear after a few minutes.Use a culture without chloroform as a comparison. 8. Pellet the cells at 4000g for 10 min at 4°C.
105
Pulsed-Field Gel Electrophoresis Table 2 Chromosome Sizes, kb S. cerevisiae AB970 2200 1640 1130 1120 955 930 830 790 7.50 690 585 585 445 350 285 240
S. cerevisiae YP148 2200 1640 1125 1030 1000 920 830 790 750 700 600 550 440 350 270 210 90
H. wingei 3300 2900 2600 1800 1500 1250 1030
S.pombe 3B3 5700 4600 3500 550
Mb - 5.7
3.3 -
- 4.6 - 3.5 - 0.55
Fig. 3. S.pombe strain 3B3 andH. wingei chromosomesseparatedby CHEF PFGE. Sampleswere run at 2 V/cm, with a pulse time of 60 min for 120 h at 14°C on a 1% agarosegel in 0.5X TBE. Note that only the largestH. wingei chromosomeis clearly resolved,the lower band representingthe other six chromosomes.
106
Maule
9. Gently disrupt the pellet with a sterile plastic loop, and take up m a total volume of 200 mL of ice-cold TE 10. Spin as in step 8 11 Take up the cells m a total volume of 23 mL of ice-cold TE, having disrupted the pellet as in step 9 12. Add 0.5 mL chloroform, and shake gently at 37°C for 15 min. 13 Add 20 uL each of RNaseA and DNasel, and shake gently at 37°C for 15 mm 14 Spin at SOOOgfor 15 mm at 4°C 15. Warm the supernatant to 37’C, and mix with an equal vol of 1% low-meltmgtemperature agarose (m TE), cooled to 50°C and dispense mto plug molds (see Section 3 5 ) 16. Incubate the plugs m 1% NDS + protemase K (see Section 3.5 ) 17 Rinse the plugs m 0 lMEDTA, pH 8 0, and incubate m the same for 48 h at 50°C. 18. Rinse the plugs m 0 5h4 EDTA, pH 8.0, and store m the same at 4°C 19 Plugs should be cut mto small pieces and soaked m gel runnmg buffer for at least 1 h before loading onto the gel h concatemers
are shown separated by PFGE in Fig. 1.
3.4. Preparation of Chromosomal from Cultured Mammalian Cells
DNA
1 Harvest the cells from culture flasks, and transfer to 50-n& Falcon tubes (see Note 15) 2. Spin at 2000g for 10 mm at 4°C 3 Discard the supernatant, and resuspend the cells in residual medium by vigorously flicking the base of the tube. 4 Add 25 mL of PBS/tube. 5. Spin as m step 2, and discard the supernatant. 6 Resuspend the cells as m step 3, and add 10 mL of PBS/tube At this point, the contents of the tubes can be combined into one or more 50-mL Falcon tubes. Transfer the contents by usmg a fine-ttp Paste& which helps to break up any cell clumps (see Note 16) 7. Count an ahquot of the cell suspension using a hemocytometer and calculate the total cell number 8. Repeat steps 2 and 3. 9 Resuspend the cells m a sufficient volume of phosphate-buffered saline (PBS) to give a cell density of 2 x 10’ cells/ml (see Note 17) 10. Transfer the cells to a universal container, once again using a fine-tip Pastet 11. Warm the cells to 37”C, and mix with an equal volume of 1% low-meltmg-temperature agarose (m PBS), which has been mamtamed at 50°C 12 Dispense into plug molds (see Section 3.5 ). 13 Incubate the plugs in 1% NDS + proteinase K (see Section 3 5.) 14 Store the plugs m 1% NDS at 4°C
Pulsed- Field Gel Electrophoresis 3.5. Preparation
107
of Agarose Plugs
Agarose plugs can be conveniently formed in large numbers usmg a mold made from a 96-well microtiter plate (15) (see Note 18). 1 Spray the mold with 70% ethanol, and air-dry prior to use. 2. Cover the base of the plate with a Tltertek sealmg strip, and place on ice 3 Pour the molten mixture of cells and agarose into a trough, and using an 8- or 12channel nucropipet, dispense 100 uL ahquots into the mold, one row after another 4. Allow the plugs to set, on ice, for 20 mm. 5 Remove the sealing strip, and using a sterile plastic yellow micropipet tip, eject the plugs from the mold mto a 50-n& Falcon tube Care should be taken not to poke holes in the plugs during ejection, and this can be avoided by rnnnmg the tip down the wall of each well, making contact with the circumference of the plug, and then rapidly pushing downward until the plug drops from the mold. 6 The mold can now be cleaned by soaking m 0. 1M HCI and then thoroughly rmsmg m distilled water. 7. The plugs are now incubated m detergent + proteinase K solution, although yeast plugs require a cell-wall digestion step before reaching this stage (see Section 3.1 , step 13) Up to 96 plugs are incubated in 20 mL of 1% NDS + 1 mg/mL proteinase K for 48 h at 50°C m a water bath The solution is changed after 24 h. 8. Finally, the plugs are rinsed m the storage solution and stored m the same at 4°C Plugs generally remam m good condition for at least a year under these conditions.
3.6. Preparation
of Agarose Beads
1 Harvest cells and wash twice m PBS 2. Count cells m a hemocytometer, and resuspend in PBS at a density of 1 25 x lo7 cells/ml 3. Add three drops of dyed microparticles/ mL of cell suspension, and warm at 37°C 4. Make up 2.5% low-melting-temperature agarose m PBS, and cool to 50°C. 5. Warm some liquid paraffin to 50°C. 6. Equilibrate a 100-mL round-bottomed flask at 37’C. 7. Mix 1 vol of cells with 0.25 vol of agarose and 2.5 vol of paraffin in the flask, and cover the top with Paratilm@ 8. Immediately agitate violently for 30 s m a flask shaker, and then transfer the flask to an ice/water mixture and swirl for 10 mm. 9. Pour the contents into a 50-mL Falcon tube, and rinse out the flask with sufficient PBS to fill the tube 10. Spin at 4000g for 45 s. 11. Stir the layer of beads trapped at the aqueous/paraffin interface, and respm for 1 min at 4000g. 12. Remove the supernatant without disturbing the pellet. 13 Resuspend the pellet m 50 mL PBS, and repeat step 10. 14. Repeat step 13 four more times, removing traces of paraffin from the tube with a tissue or transferring the beads to new Falcon tubes
108
Made
15 Resuspend the beads m LtDS solution to give a final vol of 50 mL 16 Leave to stand at room temperature for 2 min. 17 Pellet the beads at 4000g for 5 mm, remove the supernatant, and replace with fresh LIDS, resuspendmg by vortexing If necessary. 18 Leave at room temperature for 20 min. 19 Repeat steps 17 and 18 five more trmes 20 Wash the beads extensively m 50 mA4 EDTA, pH 8 0. by repeatmg steps 17 and 18 until no trace of detergent remains when the tube is shaken. 21 Store the beads in 50 mL of 50 mMEDTA, pH 8 0 at 4°C.
3.7. Restriction Endonuclease Digestion of DNA Embedded in Agarose Plugs 1 Soak the plugs for 10 mm in a large excess of sterile TE, at room temperature Typically, up to 10 plugs are immersed m 20 mL of TE 2. Invert the tube frequently to achieve efficient mixing, or use a tube roller. 3 Immerse the plugs m 5 mL TE containing PMSF at 40 pg/mL 4 Incubate at 50°C for 30 min 5 Repeat steps 3 and 4 6 Soak the plugs for 2 h m 10 vol of 1X restnctton enzyme buffer at room temperature, inverting the tube frequently to achieve efficient mtxmg. This pre-equthbratton step can be reduced to 1 h, if the PMSF treatment 1s performed m 1X restriction buffer rather than TE. 7. The plugs are now ready for digestion, whtch is performed m 1 5-mL mtcrocentrifuge tubes-l plug/tube. 8 Mix in each tube, on ice, 1X restrrction buffer contammg 0 1% Trlton X- 100 and 200 pg/mL BSA. Add 20 U of restrtctron enzyme (see Note 19). The final volume should be 100 pL 9 Transfer the plug to the tube, checking that tt 1s completely immersed m liquid and that no am bubbles are trapped around the plug. A 5-s spm m a microcentrtfuge can be beneficial 10. Incubate overnight, m a water bath, at the recommended temperature (see Note 20). 11. Double digests can be carried out with both enzymes simultaneously if the buffer and incubation temperature are compatible If not, the digests must be carried out sequentially, and if a buffer change is necessary, then the plug must be soaked for at least 1 h m the new buffer. 12 On the next day, cool the tube on ice for 10 min, and add 1 mL of ice-cold TE/tube, invert once, and remove TE (see Note 2 1). 13 Add 200 yL of stop buffer/tube and maintain on ice for 20 min (see Note 22) 14. The plug 1s now ready for loading mto the gel. Genomic DNA digested in plugs and separated on a CHEF apparatus is shown in Fig. 4. A complete list of rare cutter restrlction enzymes appears m Table 1.
Pulsed-Field Gel Electrophoresis
109
Fig. 4. Human DNA digestedwith rare cutter restriction enzymesand run out on a CHEF pulsed-field gel. The run conditions were 1% Boehringer agaroseM.P. gel in 0.5X TBE, run at 4.5 V/cm for 45 h with a pulse time of 80 s. Eachtrack contains 5 pg of DNA.
3.8. Restriction Endonuclease Digestion of DNA Embedded in Agarose Beads 1. Spin stored beadsat 4000g for 5 min. 2. Withdraw an appropriate volume of beads from the pellet, using a cutoff micropipet tip (for n digests withdraw 12x 100 pL of beads),and transfer to a microcentrifuge tube. Spin for 1 min, and check that the pellet contains a sufficient volume of beads.Then transfer them to a lo-mL Falcon tube, and add 10 mL of sterile 0.1% Triton X-100. 3. Mix on a tube rotator for 20 min. 4. Spin at 2000g for 5 min, and discard supernatant. 5. Resuspendthe beadsin 10 mL of sterile 0.1% Triton X- 100. 6. Repeatsteps3,4, and 5 twice more, and finally spin as in step4. 7. Resuspendthe beadsto 12x 100 yL with water, where y1is the number of digests. 8. Add n PL of 10% Triton X-100 and n/2 uL of BSA, and mix. 9. Aliquot into microcentrifuge tubes containing 11 pL of 10X restriction enzyme buffer, using a cutoff tip. Add 20 U of restrictionenzyme/tube,andmix thoroughly.
Made
110 10 11 12 13.
Incubate for 2 h at the recommended temperature Transfer the tubes to ice, add 1 mL TEX/tube, and mix Spin for 1 mm, add 10 pL TE to the pellet, and mix The beads are now ready for loading into the gel slots
3.9. Resolution of Large DNA Molecules Using the CHEF 1 Select a suitable buffer for use in the CHEF. 0 5X TBE gives the best results for separations up to 2 Mb (see Note 23). 2. Prepare a sufficient volume of the buffer to provide enough for the gel and the runnmg buffer This ensures that the gel and surrounding buffer are in tonic equihbnum 3 Fill the gel tank with the correct volume of buffer, and precool to the desired temperature (14’C is an optimal temperature). 4 Meanwhile, prepare the gel, having selected the most suitable type and concentration of agarose. Normally a 1% medium EEO agarose 1sa suitable starting gel. Dissolve the agarose by etther microwavmg or heatmg on a hotplate. Using a conical flask 2 5x the volume of the agarose solution, preweigh the flask and contents, and replenish with drstilled water after the agarose has dissolved. Check that the agarose has dtssolved by holding the flask up to the light, swirlmg the contents, and checking to ensure there are no translucent lumps of solid left undissolved 5. Allow the agarose to cool to hand hot (45°C) before pourmg mto the gel former Immediately remove any air bubbles with a Pastet It is essential that gels are cast on a level surface-if m doubt, check with a spirit level (see Notes 24 and 25) 6 If there 1s likely to be any delay between casting and running the gel, cover the surface with cling wrap to prevent evaporation. 7. Prior to loading, gently remove the comb and fill the slots with nmnmg buffer. This will help prevent the formation of air pockets when loadmg the sample plugs 8 Plugs should be maneuvered into the gel slots by usmg sterile disposable moculatmg loops and fine-tip Pastets. If the plugs require cutting, stand them on edge (supported by an inoculating loop), and slice downward with a stenle scalpel blade 9. Beads are loaded by using a cutoff micropipet tip The gel slots should not be prefilled with buffer 10. Gently blot the gel to remove any buffer displaced from the wells, taking care not to disturb the plugs. 11 Seal the plugs/heads into position by dripping cool, molten 1% low-melting-temperature agarose over the slots, and allow to set. 12. Load the gel mto the CHEF bath, checking that it 1simmersed m the correct depth of buffer. It is essential that the gel bath IS level during the run. 13 Select an appropriate pulse time, voltage, and run time to separate the molecularsize range of interest, and commence the run (see Notes 26 and 27) 14 A visual check to ascertam which electrodes are fizzing will confirm that the apparatus is operating correctly Electrodes on the sides marked E and A will fizz alternatively as the field direction changes from one pulse to the next (see Fig. 5).
Pulsed-Field
Gel Electrophoresis
111
Contour-clamped Yomogeneous Electric Eields (CHEF)
--_----
Fig. 5. Schematic diagram of the CHEF apparatus. The field switches through 120” from A + D to E + B and then back again. DNA migrates toward C. 15. At the termination of the run, remove the gel and stain m distilled water contammg 1 pg/mL ethidium bromide, for 20 mm, with gentle agitation 16. Photograph the gel under UV light, with a ruler runnmg down the side of the gel.
3.10. Southern
Transfer
from Pulsed-Field
Gels
1. The large DNA molecules separated by PFGE must be fragmented to allow efticlent transfer to the hybridization membrane: Immerse the gel in 2.5x its vol of 0.25MHCl and agitate gently for 20 mm at room temperature. Do not exceed this time This process leads to the partial depurinatlon of the DNA. 2. Rinse the gel in distilled water. 3. Immerse the gel in 2.5x its volume of denaturant and agitate gently for 20 mm. Repeat with fresh denaturant (see Note 28). 4. Repeat step 2. 5. Immerse the gel m 2.5x its volume of neutralizer, and agitate gently for 40 min. Do not exceed this time. 6. Set up a device for Southern transfer by filling a plastic tray to the brim with 20X SSC. A sheet of 5-mm thick plastic rests across the top of the tray lengthwise, with a gap on each side to allow two sheets of Whatman 17Chr paper, placed on top of the plastic sheet, to dip into the buffer on both sides. This forms a wick that draws buffer up from the tray Allow the paper to become saturated before the next stage.
Maule
112
7 Lay the gel on the paper wtck, checking that there are no an bubbles trapped between the gel and the paper. 8 Cover the surroundmg paper with cling wrap to prevent evaporatton of the buffer 9 Lay the hybrtdtzatlon membrane on to the gel, followmg the manufacturer’s instructions 10 The membrane should be gently rolled with a glass ptpet to squeeze out any trapped air bubbles. 11 Mark the posnron of the slots, on the membrane with a ballpoint pen, so that the membrane can be ortentated relative to the autoradtogram 12. Lay on a sheet of Whatman 17Chr paper, slightly larger than the gel 13. Lay on paper towels to cover the sheet of Whatman paper, stacked to a height of5cm 14. Finally, lightly compress the paper towels with a 1-kg weight, resting on a sheet of plastic or glass to distribute the weight over the whole area of the gel 15. After 24 h, discard any wet towels, and replace with dry ones 16 After a further 24 h, remove the membrane, and wash m 2X SSC for 5 mm 17 The gel may be stained m ethidmm bromide to check the efficiency of the transfer (see Note 29). 18 Crosslmk the DNA to the membrane by treating with UV hght, baking, or both Follow the manufacturer’s mstructtons.
3. II. Preparation
of Dried Pulsed-Field
Gels
1 After photographmg the gel, follow steps 3,4, and 5 m Sectron 3 10 2 Cut off one corner of the gel (for onentatton purposes), and transfer tt to two sheets of Whatman 3MM paper Cover the gel in cling wrap, and place on a gel drier 3 Dry the gel under vacuum only for 20 mm and then under vacuum + heat (60°C) for a further 20 mm These rimes are merely gmdelmes, and may need to be adjusted to suite mdrvtdual gel driers and vacuum sources. At the end of the process, the gel should be the thickness of X-ray film, but not as thm as cling wrap 4. Wrap the gel (still stuck to the paper) m cling wrap, and store at 4°C until required
3.12. Hybridization of Radiolabeled Probes to Pulsed-Field Gel Filters Hybridization procedures using filters carrying DNA from pulsed-field gels are not essentially any dtfferent from those methods used for filters from conventional gels, but the detection of single-copy sequencesdoes seem to be more dtfficult. Every effort must be made therefore to optimize labeling and hybridtzatron condrttons m order to produce a clear, posmve result. 3.12. I. Preparing and Labeling Probes 1. Single-copy genomtc probes should be prepared by tsolatmg the insert away from the vector sequence, by digestion with appropriate restrtctton enzymes, and separating the products on a preparative gel using low-melting-temperature agarose
Pulsed-Field Gel Electrophoresis 2 The insert band is excised from the gel (after stammg wrth ethidmm bromide and viewing with a midrange 302-m UV source) Since the DNA band may not occupy the whole depth of the gel, it is worth turning the gel slice on its side and trimming off excess agarose. 3 Melt the gel slice at 68°C for 5 mm, and add an equal volume of sterile water. Aliquot up to 13 pL for labeling, and denature at 100°C for 10 min. Add 4 pL of High Prime (Boehrmger) and 30 pCi [a32P] CTP in a total volume of 20 yL. Incubate at 37°C for at least 1 h. 4. Check the percentage of incorporation by spotting 1 pL from the labeling reaction on to a Whatman GF/B filter circle and countmg m a scmtillatton counter. Incorporated label can be measured by passing 20 mL 5% TCA solution through the filter and countmg it again. Incorporatton should be at least 50%. 5 Unmcorporated label should be separated from the labeled DNA by passage through a NICK column. Rinse the column with 10 mL TNE Add the labeling reaction mix, made up to 100 pL with TNE Allow to run through, then rmse the column with 300 l.tL of TNE, and finally elute with 400 ,uL of TNE, collecting the eluate in a microcentrifuge tube. 6 Add sonicated salmon sperm DNA, so that the final concentration m the hybridization mix will be 100 pg/mL 7. Pierce the lid of the tube with a pin, and denature the probe by heating at 100°C for 10 min 8. Add directly to the hybridization mix, or store on ice until required
3.12.2. Hybridization and Autoradiography 1 Prehybridize the filter, sealed in a polyethylene bag, by incubating m a shakmg water bath at 68°C for at least 2 h The bag should contam 5 mL of hybridization mix/100 cm2 of filter Try to exclude all air bubbles from the bag (see Note 30) 2. Make an mctston m the bag, and add the probe. Reseal the bag and incubate overnight at 68°C m a shaking water bath (see Note 3 1). 3 On the next day, remove the filter from the bag, rinse m 2X SSC wash, and then incubate m the same for 20 mm at 68°C on a rotary platform 4. Repeat, using if necessary progressively lower concentrations of SSC wash (e.g., 0.5X, 0.1X) until a reasonable signal:background level has been established, as monitored by a Geiger counter. 5. Dram the filter, blot dry with Whatman 3MM paper, enclose m cling wrap, and place in a film cassette fitted with an mtensifymg screen 6. Place an autoradiography film that has been pre-exposed by a short flash of light (-1 ms) (from a photographic flash gun fitted with a Kodak Wratten 22A filter) next to the filter, and close the cassette. Store at -70°C for an appropriate length of time, and then develop the film.
Figure 6 shows an autoradiograph depicting the hybridization of a singlecopy probe to a Southern blot of the pulsed-field gel shown in Figure 4.
Made
114
800.
Fig. 6. Autoradiogram resulting from the hybridization of the gel shown in Fig. 4 with a single-copy probe from chromosome11. The gel was blotted on to Hybond N (Amersham).The hybridized filter was washedto a stringencyof 2X SSCat 68°C and then exposedto Kodak XAR-5 film for 6 d at -70°C. Note the presenceof a band at -170 kb in four of the digests, suggestingcleavage at a “CpG island.” Nonisland cutters, such as Mu1 and RsrII, which are susceptibleto variable methylation, have producedvery large fragments.
3.13. Hybridization of Radiolabeled Probes to Dried Pulsed-Field Gels 1. Prehybridize the dried gel in 10mL 5X SSPE(for up to 200 cm* gel) at 55°C for at least 2 h in a bag (seeNotes 32 and 33). 2. Label and purify the probe as in Section3.12.1.) steps l-5. 3. Add 1 pg of sonicatedsalmon spermDNA and r/lo vol of 1M HCl to the probe, and incubateat 37°C for 40 min (seeNote 32). 4. Add l/2 vol of IMTris, pH 7.5, andcounta lO+L aliquot in a scintillation counter.
Pulsed-Field Gel Electrophoresis
115
5. Add 5 uL of somcated salmon sperm DNA to the probe, and denature by heating at 100°C for 10 mm 6. Meanwhtle, drain the prehybridizatton mix, and refill the bag with 5 mL of 5X SSPE (for up to 200-cm2 filter) 7. Add the probe to the bag at 2 x lo6 cpm/mL of hybndtzation mix, and incubate overnight at 55OC 8 Remove the gel to a dark-colored tray, and wash four times for 5 mm at room temperature in 2X SSC 9. Wash twice for 30 mm at 55°C in 2X SSC. 10 Transfer the gel to a polyethylene sheet, cover with cling wrap, and autoradiograph at -70°C (see Note 34).
3.14. lnferprefafion of Results 1. Align the developed autoradiogram with the filter, and mark on the positions of the wells. 2 Bands on an autoradiogram can be sized by first measuring the distance from the sample well and then determinmg where the bands would appear on the photo of the ethtdium bromide-stained gel, by reference to the ruler runnmg down the side of the gel. If appropriate size markers have been used, then it should be possible to determine the size of the band (see Table 2 for details of yeast chromosome sizes) Faint bands can be enhanced by wtpmg the autoradiogram with dilute sodmm hypochlortte solution to remove the background (16). 3. If the signal on the autoradiogram 1sa diffuse area rather than a discrete band, this can be caused by overloadmg, so next time, slice up the plug and load less on the gel. 4. Occasionally, the filter may not represent a true reflection of the position of DNA on the original gel The exact reason for this discrepancy is not known, but by hybridizing the filter with probes that light up the size markers, accurate sizing can be accomplished without reference to the original gel. 5. Multiple hybridizations to a single filter tend to give more accurate comparisons between probes than the use of filters from different gels. The subtle differences between electrophoretic runs, such as sample loadings or salt concentrations, can affect the relative mobrlmes of DNA molecules. 6. Ascertain that the signal has been effectively removed from the filter between hybridizations. Autoradiograph for a similar length of time to the previous exposure to make sure that all the label has been removed 7. Always keep an exact record of the order of probes used in consecutive hybridtzattons, so that tf unexpected bands appear on the autoradiogram, then it can be established whether they are remnants from a previous hybridization 8. When performing double digests, always carry out single digests as well to check that the mdivtdual enzymes are active. Group the smgle and double digests together on the gel to make lane-to-lane comparisons easier. Check that the size of double-digest products add up to the single-digest size. Double digests revealmg different hybridization patterns help to confirm that two identically sized bands are in fact different.
116
Made
9. The appearance of more than one hybridizing band m a track is often indicative of a partial digest (see Fig. &-&ZI and NarI digests) Deliberately created partial digests can provide useful information m regions of the genome devoid of markers, especially if the probe hybridizes near the end of a chromosome. Several methods can be used to create partial digest conditions in agarose plugs or beads (17-19). 10. The total absence of a signal from a digest, but hybridization to the compression zone, indicates that the fragment has not been resolved. An increase m the pulse time, sometimes accompamed by a reduction m voltage and agarose concentration, may resolve the fragment as the window of separation is shifted to a larger size range. A comparison between Figs. 1 and 2 mdicates that a change m pulse time from 25-l 15 s has resulted m the maximum size resolved increasmg from 440-1500 kb 11. The appearance of a smear resulting from hybrtdizatlon to a genomtc drgest may suggest the presence of repeated sequences m the probe Suppression hybridlzation is rarely totally successful with PFGE blots, and attempts should be made to subclone out unique sequences from the probe. A smear can also appear as a result of DNA degradation For this reason, it is prudent to include a lane of uncut DNA on the gel. Degradation can be caused by many factors, including faulty sample preparation or contammating nucleases m the running buffer Always change the buffer between runs, and consider autoclavmg the buffer if the problem persists. 12. Some restriction enzymes exhibit site preferences m that some recognition sequences are cut more readily than others. Among the rare cutter enzymes, A&I, NaeI, and Sac11 exhibit this phenomenon, and m fact, these enzymes require two copies of the recognition sequence before DNA cleavage can occur (20) Two lsoschizomers of Sac11 (SstII and &PI) are commercially available, and there is no evidence that they exhibit site preference
4. Notes 1. Mixing equal volumes of microparticles can create additional colors: blue and yellow to give green, red, and yellow to produce orange, blue and red to give mauve. 2. Readers concerned about the toxicity of PMSF can use Pefabloc SC (Boehringer) as a substitute, although it is considerably more expensive. 3. Field inversion gel electrophoresls (FIGE) can be a substitute for CHEF PFGE. Commercially available apparatus includes FIGE Mapper (Bto-Rad) for separations up to 200 kb, and Autobase (Q-Life Systems Inc., Kingston, Ontario). The latter system is supplied with run programs that control voltage and pulse time on ROM cards. 4. S. cerevisiae strains are best preserved by storage m Hogness freezmg medium at -70°C. 5. AHC medium selects for YAC arms because the acid hydrolysis of the casammo acids destroys tryptophan and the medium also lacks uracil. 6. This procedure is based on the published method by Carle and Olson (21)
Pulsed-Field Gel Electrophoresis
117
7. S pombe strains are preserved m 30% glycerol + YPD at -70°C 8. This protocol is based on the published method of Smith et al. (22) 9. It is important to use me-cold reagents throughout this protocol Failure to do so results in extensive degradation of the DNA 10 Spheroplast (cells without cell walls) formation in S pombe is more difficult than m S cerevislae Novozyme 234 is therefore added to increase the yield 11. Spheroplasts must be maintained in an osmotically stabilized medium by includmg sorbitol during spheroplast formation Spheroplasts are sensitive to hypotonic conditions and traces of detergent, so cell-wall dtgestion can be monitored by the addition of 0.1% SDS. 12 h has 12-base, single-stranded, complementary (cohesive) ends, which join under appropriate condmons and between a range of DNA concentrations (23) h DNA m solution at 10 mg/mL will spontaneously undergo limited concatemerization. Within agarose plugs, concatemerlzation 1s probably a very efficient process, because h DNA molecules, even at modest concentrations, are maintained m close proximtty to each other, thus providing a beneficial environment for mtermolecular association The upper limit of concatemerization is probably influenced by termmation resultmg from damaged single-stranded ends. 13. Wild-type h has a monomer size of 48.5 kb, but concatemers can be formed based on other smaller h genomes, e.g., hvtr (42.5 kb) and hgtl 1 (43.7 kb) Bactertophages P2 and P4 have 19-base single-strand cohesive ends, can form concatemers, and have monomer sizes of 3 1.8 and 11.6 kb, respectively 14. A good yield of phage is only produced if adequate aeration is achieved at Section 3.3 , step 6, so vigorous shakmg is important An mcubation temperature of 39°C may be beneficial m improving the phage yield 15. Cells grown in culture should be harvested when just confluent; growth beyond this stage is characterized by the overproduction of mttochondrial DNA and the degradation of genomrc DNA. 16. Some cells in culture readily form clumps, and it 1sessential that these are dtsrupted before plug formation. If passage through a fine-tip Pastet fails to disperse the cells, then ejection from a syringe fitted with a 19- or even a 21-gage needle should be considered. 17. Each plug contains 1 x lo6 cells m 100 pL of agarose, and this is equivalent to about 10 pg of DNA. The plug may be cut m half if this quantity of DNA overloads the gel. 18. This protocol allows the rapid production of several hundred plugs at a time The grid numbering system of microtiter plates allows different plugs to be formed within the same mold and yet preserve their identity. The plugs are round in shape, which is beneficial for mamtaming them intact during repeated manipulation 19. The amount of restriction enzyme that needs to be added can be adjusted from experience. Some enzymes do not remain active throughout the incubation period, in which case tinther enzyme can be added the next day and incubation continued for a few more hours (24).
Made 20. Some enzymes are unstable at the recommended mcubation temperature. This may present a problem for the digestion of DNA m agarose, since by the ttme the enzyme has diffused mto the plug, its acttvity has dimimshed. A premcubation period on ice may be beneficial, particularly since some enzymes are stabilized by then- substates. 2 1 It is important to wash the plug at Section 3 7 , step 12. During extended mcubation, some DNA seeps out of the plug mto the surroundmg buffer and would be sheared during subsequent manipulattons. 22 The mcluston of tracking dye in the stop mix colors the plugs and makes loading easier It also allows a visual check on the correct migration of the sample during the early stages of electrophoresis 23. The choice of buffer can influence the velocity of DNA migration Using the standard medium EEO agarose, as well as the low EEO agaroses, the migration rate is faster in lower-ionic-strength buffers, such as 1X TAE, compared with the higher-lomc-strength
TBE (25-27)
The effect IS even more pronounced
if the
TAE concentration is dropped to 0 5X TBE has a higher buffering capacity than TAE and may be the buffer of choice for prolonged electrophoretic runs, as well as providing better resolution when separating molecules below 2 Mb 24. Gels may be cast and run on glass plates, when using homemade equipment. The gel thickness should be 5 mm. The glass plate does not distort the electric field (14) 25 Gel combs should be of minimal thtckness to provide slots tailored to the size of the plugs, and should be positioned to give a gap of 0 5 mm between the bottom of the teeth and the gel support. 26
Pulse time. The most important single variable m determmmg the size range of
molecules separated is the pulse time--the interval between the electric field switching from one direction to another Smaller molecules, which are capable of rapid responses to changes in field direction, are separated preferentially by short pulse times, As the pulse time IS increased, progressively larger molecules are separated, but the window of good resolution changes, such that molecules at the smaller end of the range are less well resolved. Pulsed-field gels run at single pulse times exhibit distinct regions of separation Toward the top of the gel there is a region called the compression zone (CZ), in which all molecules greater than a certain size comigrate Below this is a region of maximum resolution m which bands are well separated, and below this IS a region characterized by poorer resolution, m which mobility 1slinear relative to size (28). Thus, maximum resolution is achieved by using the mmimum pulse time capable of resolving the largest molecule of interest. Multiple consecutive pulse time regimes can be used to achieve good separation over a wide size range on a single gel. An Increase m linearity of size relative to mobility can be achteved usmg a pulse time ramp, and this approach is particularly effective if nonlinear time ramps are employed (14). 27. Field strengths of up to 10 V/cm (the distance being measured between opposite electrodes) can be used to separate molecules up to about 1 5 Mb The higher the voltage, the faster the run time, although lower voltages tend to give better
resolution, but over a narrower size range. Separation of molecules greater than about 1.5 Mb can only be achieved at reduced voltages, e.g., 3 V/cm For separa-
Pulsed- Field Gel Electrophoresis
779
tion over a given size range, or window of resolution, W, the product of the voltage gradient, V (expressed as V/cm), and pulse time, P, is roughly constant or more accurately, accordmg to Gunderson and Chu (29): w= Vl4xP
(1)
This means m practice that, for separation m a given size range, an increase m voltage must be accompanied by an appropnate decrease in pulse time and vice versa. By solving this equation for P, it IS possible to specify a pulse time for carrying out separation in the same size range, using different apparatus, as long as the voltage and distance, D, between opposite electrodes are known. Typical values for D are 33.5 cm for Bio-Rad CHEFS and 28 cm for the Pulsaphor hexagon system. When the experimenter wishes to separate molecules up to a certain size limit and has no clues regarding the most appropriate condmons under which to run the gel, mathematical expresstons are now available that allow the various parameters to be established. An example of such a relationship (accordmg to Smith [30/) 1s: P = {Rl[( V x A4 s, x 5.251) * 25
28. 29. 30.
3 1.
32. 33.
34.
(2)
where P = pulse time, R = maximum size of molecule to be resolved, A = % agarose concentration, and V = V/cm. It is important to change the denaturant halfway through step 3, Section 3.10., since the HCl present in the gel will reduce the pH of the solution. Alkali transfer is an alternative method worth considering (31). Hybridization ovens, using roller bottles, can be used as an alternative to the polyethylene bag method. They are Intrinsically safer and use smaller volumes of reagents. Care must be taken, however, m openmg the bottles and in checking the temperature inside the oven with a thermometer. Filters from pulsed-field gels carrying size markers can be hybridized with radiolabeled markers at the same time as the probe or separately if crosshybridization 1sa problem. h DNA can be labeled and hybridized to concatemers. S cerevisiae chromosomes can be hybridized to the Ty- 1 repetitive element, which is present to various extents in all yeast chromosomes. The 90- and 1030-kb chromosomes of YP148 also crosshybrtdize with pBR322 sequences. S pombe chromosomes can be hybridized to a centromeric probe, such as pSS 166, which is derived from the dgl la region (32). Minichromosomes, carried by some S pombe strains, tend to crosshybridize with Ty- 1 and pBR322 sequences. This protocol is based on the published method of Stoye et al. (33). The dried gel can be separated from the filter paper, on to which it was dried down, by briefly immersing in water and peeling the gel off the paper Dried gels are reasonably strong and can be maneuvered eastly with care. Gels can also be hybridized in bottleeto remove the gel, reverse the orientation of the bottle, and rotate m the oven briefly, so that the gel rolls up. Probes can be removed from dried gels by gently agitating in O.SMNaOH for 30 min followed by 30 min m neutralizer. Several reprobings are possible without the gel disintegrating
Made
120
Acknowledgment I thank Sandy Bruce for preparing the figures used in this chapter. References 1 Poustka, A, Pohl, T , Barlow, D P., Zehetner, G , Craig, A, Mtchiels, F , Ehrrch, E , Frischauf, A.-M,, and Lehrach, H (1986) Molecular approaches to mammalian genetics. Cold Spring Harbor Symp on Quant Blol. 51, 13 1-139. 2. Burke, D. T., Carle, G F., and Olson, M V (1987) Cloning of large segments of exogenous DNA into yeast by means of artificial chromosome vectors. Sczence 236,806-g 12. 3. Shibasakt, Y., Maule, J. C., Devon, R. S., Slorach, E. M., Gosden, J. R., Porteous, D. J., and Brookes, A J. (1995) Catch-lmker+PCR labeling a simple method to generate fluorescence m situ hybridization probes from yeast artificial chromosomes PCR Methods Appl 4,209-2 11 4. Monaco, A. P. and Larin, Z (1994) YACs, BACs, PACs and MACs: artrticral chromosomes as research tools. Trends Bzotechnol 12,286286 5. Bickmore, W. A and Bird, A P. (1992) Use of restrmtron enzymes to detect and isolate genes from mammalian cells Methods EnzymoE 216,224-245. 6. Antequera, F., Boyes, J., and Bird, A (1990) High levels of de novo methylatron and altered chromatin structure at CpG islands. Cell 62,503-5 14 7 Maule, J C. (1995) Colored mtcroparticles for clear vtsuahzatron of agarose beads and plugs. Trends Genet 11, 127. 8. Jones, C. P., Janson, M , and NordenskJold, M. (1989) Separation of yeast chromosomes in the megabase range suitable as size markers for pulsed-field gel electrophoresrs. Technzque 1,90-95. 9. Wang, Y.-K. and Schwartz, D. C. (1993) Chopped inserts: a convenient alternative to agarose/DNA inserts or beads. Nucleic Acids Res 21,2528. 10. Orbach, M. J., Vollrath, D , Davts, W , and Yanofsky, C. (1988) An electrophorettc karyotype of Neurospora crassa. Mel Cell B~ol 8, 1469-1473 11. Chu, G., Vollrath, D., and Davis, R W. (1986) Separation of large DNA molecules by contour-clamped homogeneous electric fields. Sczence 234, 1582-1585. 12. Clark, S. M., Lai, E., Buren, B. W., and Hood, L. (1988) A novel instrument for separating large DNA molecules with pulsed homogeneous electric fields. Scrence 241,1203-1205.
13. Arker, W., Enquist, L., Hohn, B., Murray, N. E., and Murray, K. (1983) Experrmental methods for use with lambda, in Lambda II (Hendrtx, R. W., ed.), Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY, pp 433-466. 14. Maule, J. C. and Green, D. K. (1990) Semrconductor-controlled contour-clamped homogeneous electric field apparatus. Anal. Blochem 191, 390-395 15. Porteous, D. J. and Maule, J. C. (1990) Casting multiple aliquots of agarose embedded cells for PFGE analysis. Trends Genet 6,346 16. Guido, E. C. and Abhay, K. (1994) Simple method to reduce background on autoradiographs. BioTechnzques 17,294.
Pulsed-F/e/d Gel Electrophoresis
121
17 Albertsen, H M., Le Paslier, D., Abderrahim, H , Dausset, J , Cann, H., and Cohen, D (1989) Improved control of pa&al DNA restriction enzyme digest m agarose using hmttmg concentrations of Mg++. Nuclezc Aczds Res 17,808. 18. Barlow, D. P. and Lehrach, H. (1990) Partial Not1 digests, generated by low enzyme concentration or the presence ofethidmm bromide, can be used to extend the range of pulsed-field gel mapping. Technique 2,79-87 19 Wilson, W. W. and Hoffman, R. M. (1990) Methylation of intact chromosomes by bacterial methylases in agarose plugs suitable for pulsed field electrophorests. Anal. Biochem. 191,370-375. 20. Topal, M D., Thresher, R. J., Conrad, M., and Griffith, J. (1991) Nael endonuclease binding to pBR322 DNA induces looping. Bzochemistry 30,2006-2010. 21. Carle, G. F. and Olson, M. V. (1985) An electrophoretic karyotype for yeast. Proc Nat1 Acad Scz USA 82,3756-3760. 22. Smith, C L., Klco, S. R., and Cantor, C. R. (1988) Pulsed-field gel electrophoresis and the technology of large DNA molecules, in Genome Analyszs (Davies, K. E , ed.), IRL, Oxford, UK, pp. 41-71. 23. Mathew, M. K., Smith, C. L., and Cantor, C. R. (1988) High-resolution separation and accurate size determination in pulsed-field gel electrophoresis: DNA size standards and the effect of agarose and temperature Bzochemzstry 27, 9204-9210. 24. New England Biolabs Catalog (1995) Beverly, MA, p. 224 25. White, H W. (1992) Rapid separation of DNA molecules by agarose gel electrophoresis: use of a new agarose matrix and a survey of runnmg buffer effects. BzoTechnzques 12,574-579 26. Bnren, B. W., Lai, E., Clark, S. M., Hood, L., and Simon, M. I. (1988) Optimized conditions for pulsed field gel electrophoretic separations of DNA. Nuclezc Aczds Res. U&7563-7582. 27. Bnren, B. W., Hood, L., and Lai, E. (1989) Pulsed field gel electrophoresis studies of DNA migration made with the PACE electrophoresis system. Electrophoreszs 10,302-309
28. Vollrath, D. and Davis, R. W. (1987) Resolution of DNA molecules greater than 5 megabases by contour-clamped homogeneous electric fields. Nucleic Aczds Res 15,7865-7876.
29. Gunderson, K. and Chu, G. (1991) Pulsed-field electrophoresis of megabase-sized DNA. Mol. Cell Biol. 11,3348-3354. 30. Smith, D. R. (1990) Genomic long-range restrtction mapping. Methods 1, 195-203. 3 1. Reed, K C. and Mann, D. A. (1985) Rapid transfer of DNA from agarose gels to nylon membranes. Nuclezc Aczds Res. 13,7207-722 1. 32. Chikashige, Y., Kinoshita, N., Nakaseko, Y., Matsumoto, T., Murakamt, S., Niwa, O., and Yanagida, M. (1989) Composite motifs and repeat symmetry m S pombe centromeres Cell 57,739-75 1. 33. Stoye, J. P., Frankel, W. N., and Coffin, J. M. (1991) DNA hybridization in dried gels with fragmented probes: an improvement over blotting techniques. Technzque 3,123-128.
Construction and Use of YAC Contigs in Disease Regions Stephen Horrigan and Carol Westbrook 1. Introduction The identification of genes by positional cloning has revolutiomzed the molecular analysis of disease. Disease-causing genes can now be isolated with no knowledge of the biological function of the molecule involved, relying only on genetic inherttance or chromosomal perturbation as keys to isolatmg the gene. The key step m the posttional clonmg of a disease-causing gene IS the conversion of the mapping information mto physically cloned DNA. Advances in genomrc technology, particularly in the construction, use, and accessibility of large-insert hbrarres, have greatly facilitated this process. For most positional cloning strategies, the yeast artificial chromosome (YAC) clonmg system offers many advantages for the first step in isolation of physically cloned DNA (2). YACs allow the isolation and manipulation of large segments, up to 2 Mb, of genomic DNA, which is 20-50 times the size of any other vector. Several YAC libraries have been characterized, ordered, and shared among researchers such that clone identification and retrieval are easy. This allows researchers without specialized technical knowledge to obtain cloned DNA quickly and easily for their genomic region of interest. However, other features of YACs preclude their use as the primary source of DNA for brologtcal experimentation, Among these limitations are their low copy number, slow (and often unreliable) growth, frequent chimerism, and most problematic, the mstabrlity of and inability to clone certain genomic fragments. Nonetheless, because YACs contain long stretches of relatively faithful DNA sequence, they are the logical first step m converting mapping information into cloned DNA and eventually into biologrcally useful resources, From
Methods
m Molecular Brology, Edited by J Boultwood
Vol 68 Gene /solarron and Mapping Humana Press Inc , Totowa, NJ
123
Protocols
124
Horrigan and Westbrook
There are many YAC resources now available to the scienttflc community. Many of the first successful positional cloning strategies relied on small hbraries of limited availability, containing YACs with insert sizesm the 100-500 kb range, Newer generations of YAC libraries contain very large inserts, l-2 Mb of average size. The most widely available and characterized human genomic library is the CEPH megaYAC library (2). In addition to its large insert size, this library is valuable because the clones are available by direct request to CEPH or any of a number of genomic centers that have copies of the library. A uniform numbering system for each YAC in this library facthtates exchange of data, crosscorrelation between investigators, and compilation of screening data into online databases (3). For this reason, we concentrate on the use of the CEPH YAC library, although the same techniques apply to the analysis of YACs from other sources and organisms (see Note 1). Assembling a YAC contig presupposes that a map is available for the region of interest, consisting of markers, preferably formatted as sequence-tagged sites (STSs), and genetic or physical distances. The use of STS-based maps allows the easy integration of the markers used for mapping with the physically isolated clones (4). Although a YAC contig can be created by walking out from a single marker, an ordered map greatly facilitates YAC identification and contig construction. If no ordered map is available in the mittal steps of genomic analysis, it can many times be developed during contig construction by integration of newly mapped markers. 2. Materials 2.1. Computer Resources A computer with an Internet connection runnmg World Wide Web software, such as Netscape or Mosaic (see Note 2), is necessary 2.2. Growth and Isolation
of Yeast
1. AHC medium: 1.7 g yeastnitrogen basewithout amino acids,10g casaminoacids, 5 g ammonium sulfate, 50 mg adeninehemisulfate.Dissolve to 1L, adjustpH to 5 8, and autoclave Add 50 mL sterile 40% glucose/L before use. 2. SCE: 1M Sorbitol, O.lM sodium citrate, 0 05M EDTA, pH 7.0 3. Yeast lytic enzyme (ICN Biochemicals Inc , Costa Mesa, CA) 4 P-mercaptoethanol-Toxic; handle in fume hood.
5. TES: 10 mA4Tris-HCl, pH 9.0, 1%sodium N-lauroylsarcosme,0.4M EDTA. 6 RNase A: 10 mg/mL stock solution, boiled for 10 mm. Store at -20°C.
7. 3M Potassiumacetate:Dtssolve 29.4 g potassiumacetatein 80 mL dH,O, add acetic acid to pH 5.5, bring to 100 mL with H20. 8 Low-melting point agarose (Seakem, FMC Corporation, Rockland, ME): 1% in 100 mM EDTA.
125
YAC Contigs 9. 10. 11. 12. 13. 14. 15. 16.
Proteinase K: 20 mg/mL in 10% glycerol. Store at -20°C. 5MNaCl. 0.5M EDTA, pH 8.0. Ethidium bromide (EtBr): 10 mg/mL stock solution. Toxic; handle with gloves. Isopropanol. 5X TBE: 0,45MTris-borate, 10 mMEDTA. TE buffer: 10 mM Tris-HCl, pH 8.0, 1 mM EDTA. Agarose gel plug former (Bio-Rad, Hercules, CA).
2.3 Long PCR and Clone Selection 1. Oligonucleotide primers: Available from several commercial sources, purified by gel filtration. 2. rTth XL polymerase and buffer (Perkin Elmer, Foster City, CA). 3. dNTPs--combine equal molar amount of each dNTP to make stock solution of 2 mM each dNTP. 4. TNE buffer: 100 mMTris-HCl, pH 7.5,2M NaCl, 1 mM EDTA. 5. MPG strepavidin (CPG Corporation, Lincoln Park, NJ). 6. 1NNaOH: Stock solution. 7. 4N HCl: Stock solution. 8. 1MTris HCl, pH 7.5 and 8.0. 9. DNA sequencing kit-Sequenase 2.0 (Amersham, Arlington Heights, IL) or other comparable system. 10. 10 rnA4ATP. 11. T4 polynucleotide kinase (New England Biolabs, Beverly, MA). 12. lMNa,HPO,, pH 7.2: Dissolve 134 g Na2HP04.7H,0 in 800 mL H,O, adjust pH to 7.2 with phosphoric acid, and bring to 1000 mL. 13. 20% SDS: Caution-irritant, dissolve in hood. 14. Qiax DNA purification spin columns (Qiagen, Chatsworth, CA). 15. Decaprime labeling kit (Ambion, Austin, TX). 16. Cot1 DNA (Gibco-BRL, Gaithersburg, MD). 17. Sonicated human DNA: Human placental DNA at 5 mg/mL sonicated to average size of 200-500 bp.
3. Methods 3.1. kfentification
of YACs for the Disease interval
Once markers are localized to the region containing the disease gene, the initial identification of YACs is done by querying several databases that contain data on the CEPH YAC resource. These databases provide information on the unique YAC identifier number, size of the YAC, other markers that have been assigned to the YAC, information on YAC chimerism, and potential overlapping YACs. This information can then be used to query other databases and after several iterations, a number of YAC candidates, potential overlaps, and STS orders can be identified. For most regions, large contiguous overlapping
126
Horrigan and Westbrook
stretches (contigs) of YACs covering tens of megabases may be assembled in this way. This first step in contig assembly should be regarded as only tentative, since some of the database information is misleading, particularly on YAC overlaps and chimerism, and requires confirmation by direct analysis of the purified YACs. Querying the databases is most productively done using the World Wide Web, since it is both interactive and gives immediate results, but can be done by other means (see Note 3). The major databasesourcesfor the CEPH megaYAC library are (see Note 4): 1. CEPH/Genethon physical map (3)-(http://www.genethon.fr): This is a comprehensive collection of data that allows the identification of YAC clones by STS content, overlapping clones by hybridization and fingerprinting, as well as information on size and chimerism. The database can be queried by STS name or YAC address and returns sets of overlapping YACs and STSs that make to the region. Graphical representations of the dataset are available as postscript files (ftp://ceph-genethon-map.cephb.fr/pub/ceph-genethon-map/MAPS_GENOME -DIRECTORY/). These maps are anchored by genetic position using genethon genetically mapped STSs, and therefore can only be queried if a genetic position or linkage to a genethon marker is available. Also available on this server is GenomeView, an experimental system for the integration of mapping data that can be queried by YAC address or STS name. This database integrates the physical mapping data of CEPH-genethon, Whitehead’s Institute/MIT Center for Genome Research, and Genethon’s chromosome 2 1 project data and three genetic maps: the complete Genethon’s genetic map, part of CHLC genetic map, and part of Multimap genetic map. Although GenomeView is an extremely useful interface, much of its integrated data have not been updated, and therefore the original sources should always be checked for updated information. 2. Whitehead Institute for Biomedical Research/MIT Center for Genome Research (http://www-genome.wi.mit.edu): This is the other major resource available for identification of CEPH YAC clones. This database consists of assignment of over 10,000 (as of May 1995) STS-based markers mapped to the CEPH YAC library. YACs found to overlap by STS content are assembled into contigs, and the predicted order of markers is given. This provides not only identification of additional YACs with known STSs, but identification of additional STSs (some polymorphic or expressed) found on YACs that add to the mapping effort. 3. Genome Data Base (GDB) (http://gdbwww.gdb.org): GDB contains entries for both YACs associated with loci and loci associated with YACs. YACs characterized by individual researchers can be found here along with assigned loci. Much of the CEPH/ Genethon and Whitehead/MIT mapping data also reside here, but in a less interpretable form than from the original source.
If the map of the disease interval contains any Genethon markers, the initial query should be to the CEPH/Genethon physical map. The genetic location should be determined and the appropriate contig map accessed.This map pro-
YA C Con tigs
127
vides YAC addresses,additional markers, and overlapping YACs. The addresses of YACs from the interval are then used to query the Whitehead/MIT database. This database will provide additional ST% mapped to the YACs from the interval, and contigs of YACs assembled from this data. If the map of the disease region cannot be anchored to the Genethon genetic map, then the Whitehead/MIT database or GDB can be queried with the disease markers to determine if they have been assigned to YACs. The identified YAC addresses are then used to query the CEPH/genethon database for further map information. If the markers used to map the disease region cannot be integrated into any of the available datasets,then the CEPH YAC library should be screened directly (see Section 3.2.3.). YAC addresses identified by direct screening can then used to query the available databasesand used to build the contig. 3.2. Assembling the YAC Contig The object of contig assembly is to establish the order of the YACs, to establish connection between markers, and to identify gaps that will need to be filled in by further screening. This is most easily done by “STS content” mapping, that is, typing markers by PCR. It is best to establish the preliminary framework order of markers by independent means (i.e., linkage, breakpoint, radiation hybrid, FISH). The framework order of markers is used to anchor the contig, and a preliminary order established by minimizing the number of gaps in the STS map, and ordering of YACs using overlap data from the CEPH/genethon database (see Note 5). When a small to moderate number of YACs are being assembled, we find it most convenient to use a spreadsheet database. For large-scale contig construction, specialized databases are useful (see Note 6; 5). At this point, the YACs that have been identified by database analysis must be obtained, inoculated, and DNA prepared for subsequent analysis of both STS content and size. YAC clones can be obtained from a variety of sources worldwide (see Note 7). It is best to be inclusive at this stage, obtaining as many YACs as possible, since a significant percentage of YAC clones end up being unacceptable and must be discarded (see Note 8). In addition, the more YACs analyzed that carry the same DNA region, the more confidence can be applied to having the genomic region faithfully represented. 3.2.7. Reparation of YAC DNA 1. Pick a single colony into 10rnL AHC medium, andgrow at 30°C for 48 h ODboo should be approx 3.0 (seeNote 9). 2. Pellet 10mL yeastculture at 10,OOOg for 5 min, remove supematant,resuspend in 5 mL of 100 mMEDTA, and pellet for 5 min. 3. Resuspendin 2 mL SCE,and sit at room temperaturefor 10 min.
128
Horrigan and Westbrook
4. Transfer 1 mL yeast cells in SCE to two microfuge tubes and spin for 5 min. 5. Resuspend in 100 pL SCE, add 1 pL P-mercaptoethanol, and 5 mL of 2 mg/mL yeast lytic enzyme. Incubate 30°C for 1 h. One tube is now processed for DNA in liquid for PCR and cloning, and the other used to prepare agarose plugs to examine the size and stability of the YAC. 3.2.1 .l . PREPARATION OF DNA IN LIQUID 1. Pellet spheroplasts for 5 min, resuspend in 500 pL TES, incubate at 65°C for 30 min, and cool to room temperature. 2. Add 150 yL 5Mpotassium acetate. Let sit on ice for 15 min. 3. Spin for 15 min at 4°C. 4. Remove supernatant to new tube and add 10 pL 10 mg/mL RNase A. Incubate 37°C for 30 min, add 400 pL isopropanol, and invert to mix. Let sit at room temperature for 10 min. Spin for 15 min in microfuge. 5. Wash pellet with 70% EtOH, air-dry, and resuspend in 100 pL TE. 3.2.1.2.
PREPARATION OF DNA IN AGAROSE PLUGS
1. Prepare 1% low melt agarose in 100 mM EDTA, cooled to 45°C. 2. Add 200 pL low-melting point agarose to spheroplasts, mix completely by inversion, add to block former, and place at 4’C for 30 min. 3. Remove plugs from mold, and incubate overnight at 50°C in 5 mL lysis buffer with 20 pg/mL proteinase K. 4. Wash plugs in an excess of TE buffer for 1 h, changing the solution every 20 min. 5. Add RNase A to 10 mg/mL, and incubate at 37°C for 1 h. 6. Wash in TE buffer for 30 min. Store at 4°C 7. Separate yeast chromosomes by pulsed-field gel electrophoresis, and analyze YAC by EtBr staining. 8. The YAC DNA should then be analyzed by Southern blotting using total human genomic DNA as a probe to identify human specific bands positively.
3.2.2. Closing Gaps in the Contig Even with the above resources, an STS-based YAC contig across the disease interval may not be generated. In most cases, there will be several YACs suspected to overlap by hybridization and fingerprinting data. To confirm the suspected overlaps in these YACs, two complementary methods are used: fingerprinting and end-clone rescue. Both methods rely on the use of interspersed
repeated sequence PCR (IRS-PCR; 6). This is the simplest and most efficient way to isolate human specific sequences free from yeast sequences. This technique relies on the use of PCR primers to repetitive sequences found only in the human genome. PCR using primers located near the end of the repetitive sequence and initiating
replication
outward from the repeat will generate prod-
ucts containing unique sequences lying between repeats. The use of a number
129
YAC Contigs Table 1 IRS Primers for PCR Amplification of Human DNA Primer name 154 450 ALU3’ ALUS LINE-A LINE-B
Primer sequence, 5’-3’ TGC AAA GAT GGA CAC GGG
ACT GTG CGC TTA AGG GAG
CCA GCC TGG CTG GGA TTA GCC ACT GCA CAG GCG TGA AAG CGG AAC CGA TAG CAT
GCA CAG CTC GCC ATC TAG
ACA G C AC ACA GAG
of different primers for each type of repeat will allow the generation of a high density of products from almost all genomic regions (Table 1). The addition of PCR protocols that allow the generation of very long PCR products (7) further increases the number of unique products obtained. 3.2.2.1.
GENERATION OF IRS-PCR
FINGERPRINTS
The initial determination of YAC overlap is done by IRS-PCR fingerprinting. This is a rapid and reliable method of generating fingerprints on YACs. These fingerprints can then be compared, and in most cases,overlaps can be easily determined. IRS primers are used alone or in combination to generate complex sets of fingerprints that should allow the determination of overlap in almost all cases (see Note 10). 1. Setup a SO-clr,PCRreactioncontaining the following: 5-10 ng YAC DNA, 10X XL buffer II, 1.5 mM Mg(OAc)z, 200 @! eachdNTP, 200 ng IRS primer (total for all primers), and 2.5 U rTth XL polymerase. 2. Cycle according to the following parameters (see Note 11): 94’C for 3 min, then 25 cycles of 94°C for 40 s, 60°C for 1 min, 72°C for 5 min, and 72°C for 10 min. 3. Products are then separated on a large-format
1.2-l .5% agarose gel, stained with
EtBr andphotographed. 3.2.2.2.
ISOLATION OF YAC END FRAGMENTS
If overlap of YACs is not apparent from the fingerprints, then it is necessary to isolate the ends of the YAC to check for overlap. YAC ends are most easily isolated by IRS-vector PCR (8). This technique involves the amplification of sequences between a primer derived from the vector end and an IRS primer present in the human insert. The use of multiple IRS primers and long PCR increases the probability of end rescue. The vector primer is biotinylated, allowing the direct isolation of the end-rescued fragment and the simple generation of single-stranded DNA for efficient sequencing (see Note 13).
130
Horrigan and Westbrook
1 PCR reactions are set up containing b-YACL, one of each IRS primer, b-YACR, and one of each IRS primer. 2 PCR 1s done as m Sectron 3.2.2.1 except that 50 ng of a single IRS primer and 200 ng of a single btotmylated vector primer (see Note 14) are added 3. Ten microltters of each PCR reaction are run on a I .2% agarose gel to confirm that the PCR reaction worked 4. The remaining PCR reactions contammg b-YACL are combmed mto one tube and the b-YACR reactions combined mto another. 5 Add 20 pL (200 c(g) MPG-strepavrdm (prewashed two times TNE buffer) to each tube Mix for 20 mm at room temperature. 6 Magnetrcally separate, wash four times m TNE buffer, and one time in TE buffer Remove TE buffer, spm m microfuge for 5 s, and carefully remove all liquid from the sample. 7 Resuspend beads m 12 pL of O.lN NaOH, and incubate for 10 min at room temperature 8 Remove supernatant, and add to tube contammg 4 pL of 0.4NHCI and 2 pL IM Trts-HCl, pH 7 5, 1 mM EDTA. 9. Products are sequenced usmg Sequenase version 2 0 accordmg to the manufacturer’s drrecttons using 6 pL DNA and 0.5 pmol nested vector primer YACL2 or YACR2 (see Note 15). 10. PCR primers are then designed from the sequence generated and the STS content of adjacent YACs tested for overlap
3.2.3. Obtaining Additional YAC Clones to Complete the Contig For genomic regions that are unstable or contam low marker density, overlapping clones may not be able to be identified
in the available data sets. In many
casesYACs that cover these regions do exist m the CEPH or other YAC libraries, and can be identified by screening the YAC library directly with sequences generated from YAC ends. PCR-based screening of the YAC library can be done through several companies or by contactmg CEPH directly (see Note 16). 3.2.4. Completion of the Contig with Other Resources If a genomic region is not represented m any of the available YAC libraries, then other cloning vehicles (BACs, Pls, cosmids) must be used to complete the region by walking from the ends of the flanking YAC clones. Many resources are now available for isolating specific clones from the smaller Insert libraries (see Note 17). 3.3. Applications of YAC Configs 3.3.1. Generation of New Markers and Determination of Marker Order YACs provide and ideal source of DNA for isolation of additional markers closely linked to the locus of interest. One can easily generate either random
YA C Con tigs
131
STSs (9) or, more importantly, polymorphic markers (20) that can be used for further mapping of the disease interval. Furthermore, a high-density YAC contig across the disease region allows the assignment of relative order of the new markers according to YAC content. The assignment can be done either by PCR or hybridization. This order of markers can then be used to map the disease region further by genetic linkage or deletion mapping (see Note 18). 3.3.1.1.
GENERATION OF RANDOM STSs
1 PCR reactions are set up as m Section 3.2 2.1.) except that only a single IRS primer is included. 2. Following amplification, add to the PCR reactlon ATP to 100 @4, and 10 U T4 polynucleotide kinase. Incubate at 37°C for at 30 mm PCR products are purified over Qiax columns and cloned mto blunt-end cut, dephosphorylated, and transformed into competent bacteria 3. Colonies contaming plasmlds with inserts are Isolated, sequenced, and PCR pnmers deslgned to generate a product of 100-300 bp. 3.3.1.2.
GENERATION OF POLYMORPHIC STSs
Polymorphic CA repeats are found frequently enough in the genome to be able to be rescued in most cases by IRS-PCR. They can be easily separated
from nonrepeat-containing sequencesby hybrid selectlon (II, 12; see Note 19). 1 Generate IRS-PCR products from YAC DNA m Sectlon 3.2.2.1 , except use all combinations of primers (see Note 20). 2. Mix 100-500 ng of IRS-PCR products with: 2 c(g (-100 pmol) Biotin-ATA GAATAT(CA)**, 50 pL 1MNa2HP04, 2.5 PL 20% SDS, and Hz0 to 100 PL 3. Overlay with mineral 011,heat at 100°C for 10 mm, and incubate at 50°C for 1 h 4 Add 1.OmL 100 mMTris, pH 7 5,2MNaCl, 1 mMEDTA to annealed DNA, and 50 yL (500 pg) MPG-Strepavldin (prewashed) 5. Mix 30 min at room temperature. 6. Separate magnetically, remove supernatant, and wash three times with 1.0 mL TNE buffer 7. Wash four times with 100 mA4Tris, pH 7 5,100 mMNaC1, 1 mMEDTA at 65°C for 15 mm each. 8. Resuspend beads in 12 PL of 0. 1N NaOH, and incubate for 10 min at room temperature. 9. Remove supematant, and add to tube containmg 4 PL of 0.4N HCl and 2 pL 1M Tris-HCl, pH 7.5, 1 mM EDTA. 10. Use 5 PL selected product in 50 PL PCR reaction with original IRS primers. 11. Clone products as in Section 3.3.1.1. and sequence
3.3.2. Conversion of YACs to Smaller Clones In many positional cloning strategies, the next step will be to convert the YAC into a contig of smaller overlapping clones, such as cosmlds, PI, BAC,
132
Horrigan and Westbrook
and so on, that are more stable and easier to mampwlate. Although it is possible to subclone the YAC into a smaller vector, this is not only labor- and costintensive, but perpetuates any rearrangement or deletion that was present m the YAC. Instead, the YAC is used as a hybridization probe to select clones from a genomic library present on high-density filters (see Note 17). Thts can be eas11ydone by hybrrdization of IRS-PCR products generated from the YAC. 1. Generate IRS-PCR products from YAC of interest as m Section 3 2.2 1., except use all combinations of primers (see Note 20). 2. Combine products, and purify using Qiax spin column 3. Label 50 ng PCR products m a 50-pL reaction by random priming labeling 4. Combine probes and prehybridize probe with 50 &mL Cot-l DNA and 100 pg/mL sonicated human placental DNA in 500 PL 0 12M Na,HPO, for 4 h at 68°C 5 Filters containing target clones are prehybridlzed in 0.5MNa2HP04, 7% SDS, 1 rnM EDTA, 50 pg/mL denatured somcated placental DNA for 4 h at 68°C. 6. Hybridize filters with 5 x lo6 cpm/mL probe for 18-24 h at 68°C. 7 Wash filters at 68°C for 30 min in 40 rnA4Na2HP04, 5% SDS, 1 mMEDTA, and then two times for 30 mm each m 40 mA4 Na2HP04, 1% SDS, 1 mA4 EDTA 8. Expose for l-2 d at -70°C with intensifying screens
3.3.3. identification of Genes from YACs The isolation of genes directly with YAC DNA using either exon trapping (13), direct selection (14), or direct screening of cDNA libraries (15) is possible, but not preferred for several reasons. The complexity of the YAC DNA causesa low efficiency of rescue from all techniques. Therefore, only a subset of expressed sequences are identified from the YAC DNA. More importantly, the presence of internal deletions or rearrangements in the YAC could easily lead to genomic regions being unavailable for gene rdentrfication. It is therefore preferable to convert the YAC to a smaller and more stable cloning system (see Section 3.3.2.) and use these for the rsolation of transcribed sequences. In the future, the most important resource for gene tdentificatron from YACs will be the assignment of expressed sequencetags (ESTs). Several groups have begun to take the information generated by large-scale sequencing of expressed sequences and convert this to mapping information by localization of ESTs to either single chromosome somatic cell hybrids (161, radiation hybrid maps (17), or by screening YAC libraries directly (18). Much of this information is avatlable by accessing dbEST (http://ncbi,nlm.nih.gov) or the Whiteheacl/MIT center for genome research. These resources allow the immediate identification of expressed sequences from any region of the genome. In many cases,the corresponding cDNA clone can be readily obtained for complete analysis. At the present time, the number of expressed sequences assigned to YACs is low, and consequently, this resource is Just beginning to become useful. However, the
YAC Contigs
733
number of ESTs assigned to YACs is quickly increasing, and this will soon become one of the most important resources for ldentlfication of candldate disease genes. 4. Notes 1. There are many other whole-genome human YAC resources available, includmg those generated at ICI, Imperial Cancer Research Fund, and Washmgton Umverstty. Some chromosome-specific libraries are avatlable that have been constructed either from single-chromosome somatic cell hybrids or by selection of YACs from whole-genome ltbraries The use of chromosome-specific YAC libraries may in many cases be the preferred resource and, where available, should be mvesttgated. In addition, the same strategies and techmques discussed here are applicable to YAC librartes generated from other species. 2. It 1s highly recommended that access to the various databases be done through the World Wade Web This provides the most interactive method and gives nnmediate results In most cases, data can also be accessed through the inter-net by gopher or ftp. In some cases e-mall access 1savailable, but this IS a much less efficient method of obtammg the data. Both Netscape (ftp.netscape.com) and Mosaic (ftp.ncsa.umc.edu) are available free to academic users from the named ftp sites. 3 The CEPWGenethon physical mapping data can also be accessed by the followmg methods: ftp (ceph-genethon-map.genethon.fr) either raw data or QUICKMAP, a compact database and navigation tool that can be run locally on a Sun UNIX, gopher (gopher.genethon,fr); and e-mail by sendmg a message to ceph-genethonmap@cephb fr , m the subject lme put “help ” 4. Several other deposttortes of mapping information on the CEPH YAC library exist. Baylor College of Medtcine Human Genome Center (http://gc.bcm.tmc edu:8088/) mirrors the data at both CEPH and Whitehead /MIT, as well as some of its own internal screening data. Individual genome centers often have data available on single-chromosome maps that contain not only YAC data, but also addittonal marker data also. 5. Chimertsm is usually not a problem at this stage, as long as the STS markers are from the correct chromosomal interval by asstgnment to somattc cell hybrid mapping panels or linkage studies However, internal deletions or rearrangements can lead to inconsistencies. Frequently, these inconsistencies are immediately apparent. Redundancy of YACs containing the same markers helps to resolve this problem. However, the unambiguous order of markers may not be resolved until other mapping data (i.e., radiation hybrid data, restriction maps, and cosmid/Pl/BAC contigs) are integrated mto the contig. It 1s important to realize that the falsenegative rate (YACs that contain the STS that are not identified) for STS screenmg of YAC libraries is significant. Therefore, most apparent gaps in the STS map in the early stage of contig assembly will be filled in when purified YACs are analyzed for STS content. 6. Two of the most popular are AceDB (5) and SIGMA (www ncgr.org)
134
Horrigan
and Westbrook
7 A list of distrtbutlon centers of the CEPH YAC library is available from Genethon’s World Wide Web server. Individual YAC clones are also available cornrnerctally from Genome Systems Inc. and Research Genetics Inc. 8, YACs that are determined to be multiply chimeric, by STS assignment or hybridizatton, or that contain significant internal deletions, as determined by linkage of markers at significant distances, can be discarded at this pomt These YACs can always be accessed at a later time tf required. 9 It is essential to pick more than one colony (usually three, but more for particularly unstable YACs) to ensure that the YAC does not contam a large deletion or that multiple YACs are not present m the same colony Once a colony IS confirmed by both size and STS content, it should be stored m 20% glycerol at -80°C since YACs stored on plates tend to become unstable 10 The chotce of primers for fingerprinting is dependent on the particular repeats found m the YACs. The conditions should be optimized to generate between 5 and 20 different fragments, so as to have sufficient mformatton to determine overlap, but not so much as to obscure, or generate misleading data. For repeat poor areas, several PCR reactions done with different combmattons of IRS primers can be pooled to generate more complex fingerprmts. The addition of more than three different IRS primers to one PCR reaction does not lead to addltional products and, m many cases, leads to the preferential amphtication of only a few products 11. The annealing temperature can be safely lowered to 50°C to generate additional human specific fragments without amplifying any yeast-specrfic fragments The number of cycles of amphficatton should be kept to as few as possible (usually 25-28 is sufficient), since the preferential amplification of a few select products occurs at high cycle numbers. 12 Overlaps can be confirmed by using radiolabeled IRS-PCR products from a single YAC, or by gel-purtfymg a specific PCR product and using it as a probe on Southern blots of IRS-PCR products of suspected overlappmg YACs 13. All STSs and probes Isolated from YACs should be tested to ensure that they are derived from the expected genomic region. This is most easily done using chromosome-specific somatic cell hybrid mapping panels. 14. The sequences for the vector primers are derived from the pBR322 sequences that flank the cloning site instead of the Sup4 region to decrease background amplification. Nested primers are designed close to the cloning site for direct sequence analysis B-YACL-5’ biotin-ATGCGCACCCGTTCTCGGAGC 3’: left arm outsrde YACL2-5’ CAATTAAATACTCTCGGTAGCCAA 3’ left arm inside B-YACR-5’ biotm-ATGCCGGCCACGATGGCGTCCGGCG 3’. rtght arm outside YACR2-5’ CTCCCGGGGGCGAGTCGAACGCCC 3’: right arm inside 15. If no acceptable unique sequence can be generated from the nested vector primer DNA bound to the magnetic beads can be sequenced using the appropriate IRS mimer. Wash beads once with 100 uL 0. 1N NaOH. twice with TNE buffer. and
YAC Contigs
16.
17
18.
19.
20.
135
once with TE buffer, resuspend in 20 pL TE buffer. Use 6 pL for sequencing. End products can also be reamphfied using YACL2 or YACR2 and IRS primer, and cloned into a plasmid to be used as a hybridization probe. Commercial screenmg sources for the CEPH YAC library are Research Genetics, Inc. and Genome Systems Inc., or by contacting the nearest genome center that has a copy of the CEPH library (see Note 7) There are now a multitude of large insert genomic libraries available to the research community. The most direct way to discover what other mapping resources are available is to contact the Human Genome Orgamzatron’s Human Genome Mapping Committee chromosome editors (http://gdbwww.gdb,org/gdb/ docs/editors.html). In addition Research Genetics, Inc. and Genome Systems Inc. are commercral sources for obtaining large insert bacterial-based clones such as BACs, PACs, and Pls Although YAC contigs are valuable for placement and ordering of markers, the unambiguous assignment of marker order must be confirmed by independent means, since YAC stability and chimerrsm can lead to incorrect determinations. This same method can be used to rescue other types of simple sequence repeats (GA, CAG, ATA, AGAT), although with less success because of then lower frequency m the genome. If IRS-PCR is unsuccessful m rescuing suffictent polymorphic markers, short insert PCR formatted libraries free of yeast sequences can be generated by the MATS method. The resulting products can then be isolated by the method described in Section 3.3.1.2 using ohgonucleotides specific for the different repeats. To generate the most complex collection of products, the annealing temperature should be lowered to 50°C and all possible primer pair combmations should be used. This allows the generation of up to 100 products ranging in srze from 100 bpto 1Okb
References 1. Burke, D. T., Carle, G F., and Olson, M. V. (1987) Clonmg of large segments of exogenous DNA into yeast by means of artificial chromosome vectors. Science 236,806-8 12. 2. Albertsen, H. A., Abderrahrm, H., Cann, H. M., Dausset, J., Le Paslrer, D., and Cohen, D. (1990) Constructron and characterization of a yeast artrticral chromosome library containing seven haplord human genome equivalents. Proc. Nat1 Acad. Sci. USA 87,4256-4260. 3. Cohen, D., Chumakov, I., and Weissenbach, J. (1993) A first generation physical map of the human genome. Nature 336,698-701. 4. Olson, M., Hood, L., Cantor, C., and Botstem, D. (1989) A common language for physical mapping of the human genome Science 245,1434,1435. 5 Cherry, J M. and Cartinhour, S. W. (1994) ACEDB, A tool for biological information, in Automated DNA Sequencing and Analysis (Adams, M., Fields, C., and Venter, C., eds.) Academic, San Diego, CA, pp. 347-356.
136
Horrigan
and Westbrook
6. Ledbetter, S A , Nelson, D L., Warren, S T., and Ledbetter, D H (1990) Raptd
isolation of DNA probes within specific chromosome regtons by interspersed repetitive sequence polymerase chain reaction. Genomzcs 6,475+8 1. 7 Barnes, W. M. (1994) PCR amplificatton of up to 35-kb DNA with high fidelity and high yield from lambda bacteriophage templates Proc Nat1 Acad. Scz USA 91,2216--2220 8. FuJita, R. and Swaroop, A (1995) Alu-vector PCR with biotinylated primers to isolate YAC ends ready for sequencing BzoTechniques l&796-799. 9. Cole, C. G., Goodfellow, P. N., Bobrow, M., and Bentley, D R. (1991) Genera-
tion of novel sequence tagged sites (ST%) from discrete chromosomal regions using Alu-PCR. Genomics 10,8 16-826. 10. de Souza, A. P., Allamand, V , Richard, I., Brenguter, L., Chumakov, I., Cohen, D., and Beckmann, J S. (1994) Targeted development of mtcrosatelhte markers from inter-Alu amplification of YAC clones Genomzcs 19, 391-393 II Kandpal, R. P , Kandpal, G., and Werssman, S M. (1994) Construction of hbraries enriched for sequence repeats and jumping clones, and hybridization selectton for region-specific markers. Proc Natl. Acad Scz USA 91,88-92. 12. Chen, H., Pulido, J. C , and Duyk, G M. (1995) MATS: a rapid and efficient method for the development of microsatellite markers from YACs Genomzcs 25, l-8. 13. Yaspo, M L., North, M A., and Lehrach, H (1993) Exon-enrtched probe derived from a human chromosome 21 YAC by exon-amplification Nuclezc Aczds Res 21,227 1,2272. 14 Morgan, J. G , Dolganov, G M , Robbins, S. E , Hmton, L. M , and Lovett, M.
15
16
17
18.
(1992) The selective tsolation of novel cDNAs encoded by the regions surrounding the human mterleukin 4 and 5 genes Nucleic Aczds Res 20, 5 173-5 179 Elvin, P , Slynn, G., Black, D., Graham, A., Butler, R , Riley, J , Anand, R., and Markham, A F (1990) Isolation of cDNA clones using yeast arttfictal chromosome probes Nuclezc Aczds Res. l&3913-3917 Polymeropoulos, M. H., Xlao, H., Stkela, J. M., Adams, M , Venter, J. C., and Merrill, C. R. (1993) Chromosomal dtstrtbutton of 320 genes from a brain cDNA library. Nature Genet 4,381-386. James, M. R., Richard, C. W , Schott, J. J , Yousry, C., Clark, K , Bell, J., Terwtlhger, J. D., Hazan, J., Dubay, C., and Vtgnal, A (1994) A radiation hybrid map of 506 STS markers spanning human chromosome 11. Nature Genet 8,70-76. Berry, R., Stevens, T. J., Walter, N A. R., Wtlcox, A. S , Rubano, T., Hopkins, J. A , Weber, J , Goold, R., Soares, M B., and Sikela, J. M. (1995) Gene-based sequence-tagged-sites (STSs) as the basis for a human gene map. Nature Genet. 10,415-423.
Construction Nicholas
and Use of Cosmid Contigs
Fairweather
1. Introduction The application of cosmids to positional cloning strategies has changed over the last several years (1). Initially, the construction of genome physical maps, for example, in Caenorhadbitis elegans and Arabzdopsis (2,3), was attempted using cosmid clones randomly selected and analyzed by restriction tingerprmtmg techniques (2). Cosmids were selected as the vector of choice owing to the ease of hbrary construction and, at the time, the relatively large genomic insert size possible. The insert size determines the number of clones required for complete coverage of a given genome. The larger the insert size, the lower the number of clones. Therefore, cosmids offered a real advantage over plasmids and bacteriophage. However, with the advent of yeast artificial chromosome (YAC) (41, bacterial artificial chromosome (BAC) (5), and Pi-derived artificial chromsome (PAC) (6), cosmids are no longer considered the startmg point for long-range genome mapping. The experience of workers who have tried to generate complete genome maps using cosmids has demonstrated that there are areas of genome either underrepresented or absent from the relevant libraries. In fact, several authors have stated that it is easy to start a global physical map with cosmids, but hard or impossible to finish (7). Consequently, cosmids are generated and used as resources for further investigation of regions of interest during a mapping project. YACs, because of their cloned insert size, are the best vectors m which to start a physical genome map. However, the number of chimeric YACs and difficulty of subsequent manipulation of YAC DNA owing to the yeast genome background result in cosmids frequently being used as the stepping stones between a global physical map and further investigation of the region of interest. This may be achieved by generating cosmids from a YAC (8) or somatic cell hybrids (9), or by screening cosmid libraries with either YACs (10) or other probes (II). From
Methods
m Molecular Bology, Edited by J Boultwood
Vol 66 Gene lsolatron and Mapprng Humana Press Inc , Totowa. NJ
137
Protocols
Fairweather
138
The other vectors, PACs and BACs, offer advantages and disadvantages over cosmids,and may be considered complementary to cosrmd and YAC contigs. These vectors contam bigger mserts than cosmids, but smaller than YACs, and retain most of the ease of manipulation of cosmids (12). This chapter outlines methodologies utilized in the construction and maintenance of cosmld llbraries, and how hbraries may be utilized as a resource for gene mapping. 2. Materials 2.1. Cosmid Library Production 1 Chromosomes prepared m low-melting-temperature agarose plugs 2. Stratagene (La Jolla, CA) Cosmid Vector Kit (SuperCos 1) and bacterial strain (XL-l MR) 3 10X T4 DNA polymerase buffer: 0.331’14Tris acetate, pH 7.9, 0.66M potassium acetate,0,lOMmagnesmm acetate, 1 mg/mL bovme serum albumm (BSA), 5 mM dithlothreltol (DTT) (These are made and stored as the following stock solutions 10X T4 polymerase salts: 0.33M Trls acetate, pH 7 9, 0.66M potassmm acetate, 0 1OM magnesium acetate plus BSA, 1 mg/mL stock solution, DTT, 5 mA4 stock solution.) 4 Mb01 5 Sau3AI 6 h HzndIII and h SaZI digests as size markers 7. Agarose, regular meltmg temperature 8 Beta-agarose I (1 U/pL, New England Blolabs, Beverly, MA, or similar). 9 Calf intestinal phosphatase, alkaline (CIP, 1 U/pL Boehrmger Mannhelm, Mannheim, Germany, GmbH or slmllar) 10 0 15M Trinitnloacetlc acid (TNA, mtrllotriacetlc acid from BDH or similar), stock solution. 11 Dextran T40 (Pharmacla, Uppsala, Sweden), 10 mg/mL stock solution 12. Phenol, equilibrated with 100 mMTris-HCl, pH 8.0. 13 Chloroform 14. 4M Sodium chloride stock solution. 15. Ethanol. 16. 1X TE* 10 mMTns-HCl, pH 7 5, 1 mMEDTA.
2.2. Preparation
of Digested
Vector DNA
1. Cosmld vector. 2 XbaI. 3. BamHI.
2.3. Ligation
and Packaging
of DNA
1. Ligase (400 UIyL, New England Blolabs or similar). 2. 10X T4 DNA ligase buffer: 500 mMTris-HCl, pH 7 5, 100 mMMgCl*, DTT, 10 mM adenosine triphosphate (ATP)
10 mM
Cosmid Contigs
139
3. Polynucleotide kmase (10 U/uL), New England Biolabs or similar. 4. Stratagene packaging extracts (Gigapack Gold XL for cosmids). 5 SM buffer: 5 8 g NaCl, 2 0 g MgSO,, 50 mL 1M Tris-HCl, pH 7 5; 5 mL 2% gelatin made up to 1 L 6. LB AMP,, plates. 10 g NaCI, 10 g bacto-peptone, 5 g yeast extract, 15 g agar made up to 1 L Autoclave and cool to approx 40°C. Add ampicillm to a final concentration of 50 ug/mL. 7. TB broth* 5 g NaCl, 10 g bacto-peptone, pH 7.4. Autoclave Once cool, add MgS04 to 10 mA4 final concentration and sterile maltose to 0.2% final concentration, made up to 1 L. 8. Hybond N+ nylon membrane (Amersham, Amersham, UK) 9. Denaturing solution: 1 SMNaCl, 0.5M NaOH. 10. Neutralizing solution. 1 5M NaCl, 0.5M Tris-HCl, pH 7 2, 0 OOlM EDTA 11. 20X SSC 3MNaC1, 0.3MNas citrate
2.4. Plating and Filter Replication
of the Cosmid Library
1 10X Hogness modified freezmg medium (HMFM): 6.3 g K2HP04, 0 45 g sodium citrate, 0.9 g (NH& S04, 1.8 g KH, P04, 44.0 g glycerol, disttlled water added to a final volume of 100 mL. Autoclave. Once cool, add and mix 36 uL of MgS04.7HzO. 2. Replicate plater (Sigma, St. Louis, MO) 3 Disposable 96 and 384 replica plater (Genetix Limited). 4. Microtiter plates, 96- and 384-well (Falcon, Los Angeles, CA, or Genetix Limited)
2.5. General Equipment
and Stock Solutions
1. BSA 1 mg/mL stock solution 2. LB broth: 10 g NaCl, 10 g bacto-peptone; 5 g yeast extract, made up to 1 L autoclave. 3 1MMagnesium sulfate, stock solution, 4. Maltose, 20% stock filter-sterilized solutton. 5 Ampicillin, 50 mg/mL stock solution. 6. lMTris-HCl, pH 7 5, stock solution. 7. 0.5MEDTA, pH 8.0, stock solution. 8. Horizontal gel apparatus and power supply. 9. Microcentrifuge for 1.5 mL microcentrifuge tubes 10. Dry blocks for 1.5 mL Eppendorf tubes, one at 68°C and one at 37°C (water baths may by used)
3. Methods This section describes the steps involved in preparing Mb01 partial restriction enzyme digests of chromosomal DNA for ligation into a commercially available (Stratagene) cosmid vector and transformation of the host strain. The
140
Fairweather
packaging of the ligated material IS usually achieved using Stratagene packagmg extracts according to the manufacturer’s suggested protocols. The chromosomal DNA partial enzyme digests and phosphatase treatments are performed m T4 polymerase buffer, a universal enzyme buffer, such that precipttatrons are not required between these two enzymatic reactions. 3.7. Cosmid Library Production 3.1.1. Preparation of Partially Digested Chromosomal DNA 1 Prepare chromosomal DNA within agarose plugs (see pulsed-field gel electroporesis [PFGE], Chapter 4 and Note 1). One block usually contain l-2 pg m a volume of 100 mL 2 Set up four reactions (300 pL final volume) containing two agarose blocks of yeast chromosomes per reaction in 1 5-mL mtcrocentrtfuge tubes (see Note 2). 2 chromosomal DNA blocks = 200 uL, 30 uL 10X T4 polymerase salts, 10 pL dHzO Melt at 68OC for 10 min, and then brmg to 37°C (dry block for 5 mm) Add 30 uL BSA (1 mg/mL), 30 pL DTT (5 mA4), and 0.1 U Mb01 to each reaction Incubate 1 tube for 1 mm, 1 tube for 5 mm, 1 tube for 10 mm, and 1 tube for 15 mm at 37°C 3. After each time interval, heat-kill the Mb01 by placing the tube at 68°C for 20 mm, and then brmg to 37°C. Mamtam at 37°C to prevent the agarose from setting 4. Load 30 pL of each digest on a 0 3% (w/v) agarose gel for analysis with h HzndIII digest and h SalI digest as size markers A baseof l-2% agarosemay be poured first for easierhandling. Followmg electrophoreslsof the gel, stain with ethldmm bromide, and take a photograph From this gel, choose the best digests at the appropriate time interval for generating the cosmid (inserts 30-45 kb) library It IS important not to use overdigested reactions, which may result m chimertc inserts (seeNote 2) 5 Durmg electrophoresisof the agarosegel describedm step4, mix 3 pL of P-agarose I to each reaction, and incubate for 2-3 h at 37°C. This incubation may be carried out overnight. 6 Dilute CIP (NEB) to 0.1 U in 40 pL 1X T4 polymerase buffer (salts, BSA, and DTT), and add to each reaction Incubate at 37°C for 30 mm (seeNote 4) Note The number of CIP units usedis dependenton the amount of termmi present, the unit per terrmm relationship may vary between manufacturers 7. Add TNA, pH 8.0, to a final concentratton of 0 015M to each tube to inactivate the CIP Incubate at 68°C for 20 mm 8. Brmg tubes to room temperature. Add 5 pL dextran T40 (10 mg/mL) ascarrier to each tube (seeNote 5). 9 Extract twice with an equal volume of phenol, and once wrth an equal volume of chloroform (seeNote 6). 10. Precipitate by adding 5 pL dextran T40, NaCl to a final concentration of O.lM and 2 vol of ethanol Place on dry ice for 15 mm. 11 Wash pellet with 70% (v/v) ethanol and an-dry. Resuspendm 10 pL of 1X TE
Cosmid Contigs
141
3.1.2. Preparation of Digested Vector DNA Vector DNA (SuperCos 1) (see Note 7) is prepared m accordance with the manufacturer’s recommendations. The following is a brief outline of that protocol. The circular vector DNA is digested with XbaI to separate the two COS cassettes.Following phenol/chloroform extraction and precipitation, the DNA 1sCIP-treated to prevent religation of the vector. The DNA 1sonce again phenol/chloroform-extracted and precipitated. The X&I-digested CIP-treated DNA 1s then digested with BamHI. Once phenol/chloroform-extracted and precipitated, the DNA 1sresuspended in deionized water to a concentration of 1 pL/mL, and stored at -20°C.
3.1.3. Ligation and Packaging of DNA 1. Set up control llgatlons to determine efficiency of CIP treatment and ability of Mb01 ends to ligate. a. 1-pL partial digest m 10 FL 1X hgase buffer, no hgase. b 1-pL partial digest in 10 pL 1X ligase buffer with 0 5 pL hgase (400 U/mL NEB).
c. 1 pL partial digest in 10 pL 1X ligase buffer with 0.5 yL hgase and 0.5 pL polynucleotlde kmase (10 UIyL NEB) Incubate each tube at 37°C for 60 mm. 2 Heat-kill ligase and kmase at 68’C for 10 mm, and then load on a 0 3% (w/v) agarose gel for analysis with h HzndIII digest and h Sal1 digest as size markers Stain the gel with ethidium bromide and from the pattern of DNA fluorescence decide which is the best digest to use for generating the cosmid library. The reaction with hgase and without kinase should look the same as the reaction without ligase and kinase. Only the reaction with both hgase and kinase added should have ligated together and should appear larger on the gel. 3 Ligation of digested chromosomal DNA to cosmld vector (SuperCos 1). 2.5 mg resuspended MboI-restricted chromosomal DNA, 1.O PL SuperCos 1 vector (1 pg/pL) prepared according to Stratagene protocol, 2.0 yL 10X ligation buffer, 1.O ,uL hgase (400 U/pL), XpL distilled H20, to make up to a final volume of 20 pL. Incubate hgatlons at 16°C overnight or 4°C for 2 d 4. For m vitro packaging of the ligated DNA, Stratagene packaging extracts are recommended for high efficiency and reliability. The manufacturer’s protocols for packaging and plating of libraries should be followed. In brief, the ligated DNA is mixed with the freeze-thaw extract and then immediately mixed with sonic extract. Following a 2-h room temperature mcubatlon, the mixture 1sdiluted with SM buffer Chloroform is added, and any debris is sedimented by a brief centrifugation The packaged DNA is then stored at 4°C. The titer of the library should be calculated by plating out a 1: 10 and 1:50 dilution of the packaged DNA supernatant. From this, calculate the volume of packaged DNA supernatant required to plate out 1500-1800 colonies/22 x 22 cm plate. The total number of colonies plated out should be in the region of four to six genome equivalents. This should allow all regions of the genome to be represented within the library (13).
142
Fairwea ther
5. If the cosmld library 1s constructed from such material as YACs or radiation hybrids, then clones containing DNA from the yeast or from the background cell lme must be ehmmated from the library This can be achieved by hybridizing colony lifts of the plated out library and probing with the relevant total genomlc DNA Posltlves may then be picked mto microtiter plates and stored at -80°C Thereby only the cosmlds containing the DNA of interest will be picked into the final library. To achieve this colony, lifts are prepared, and Hybond N+ membrane is placed onto the 22 x 22 cm plates The membranes are pierced such that their orlentatlon can be determined. The membranes are removed, inverted such that the colonies are face up, and placed onto fresh 22 x 22 cm plates The cells are allowed to grow for approx 4 h 6 The membranes are removed and transferred to absorbent filter paper, which has been soaked m denaturing solution, with the bacterial colonies face up Leave for 7 mm. 7 The membranes are transferred to fresh absorbent filter paper soaked m neutralization solution Leave for 7 mm 8. Repeat step 7. 9 The membranes are transferred to fresh absorbent filter paper soaked m 2X SSC Leave for 7 mm. 10. Rinse the membranes in 1 L of 2X SSC. 11. The membranes are dried at 80°C for half an hour or an--dried, followmg which the DNA 1scrosslinked to the nylon using a UV translllummator 12. The filters are then hybridized with a relevant probe to identify posltlve clones, e g , total human DNA
3.1.4. Plating and Filter Replication of the Cosmid Library 1 Once positive clones have been identified, they should be picked and inoculated into microtlter plates containing media, the relevant antlblotlc selection (ampicillin for SuperCos l), and 1X HMFM HMFM protects the cells once the plates have been placed at -80°C. 2. The cells are grown overnight at 37”C, replica-plated (a replica copy or copies are grown again overnight at 37”(Z), snap-frozen on dry ice, and stored at -80°C. Replica plates are generated using sterile replica platers (replicator) Replica platers may be reusable or disposable Reusable metal platers are sterilized between each manipulation by first washing m distilled water and then an ethanol bath, and flamed. Note: Care should be taken while using an ethanol bath and flame sterilization. Neither the water nor the ethanol washes should rise more than threequarters up the pins to prevent contamination of the replica plater base. The dlstilled water wash helps to reduce cellular debris building up on the pins of the replicator. Replica plates are generated by placing the replicator into the original mlcrotiter plate, and then transferrmg it to a fresh microtiter plate containing media, antibiotic, and HMFM. If the plate to be replicated has been frozen, we have found that the plate can be defrosted, several times if necessary, before
Cosmid Configs
143
replicating wlthout excessive cell death. This helps to ensure complete replicatlon. If the plates are replicated while frozen some wells may not be rephcated owing to the sedimentation of the cells. 3. Two replica plates are recommended This allows the original to be kept as a master copy, permanently stored at -8O”C, with one replica plate being used to generate further replicas, the workmg copy, and the other as a plckmg copy. The picking copy may become contaminated over time. 4 The workmg copy IS used to generate replica filters of the library and for the preparation of individual clones. Replica filters are produced using the replicator. a AntibiotIc agar plates of the relevant size have dry Hybond N+ nylon membranes placed onto them. The plates are dried before use to prevent spreading of the colomes. b The replicator IS placed mto the mlcrotlter plate containing the plckmg copy of the library and transferred onto the Hybond N+ nylon. c. The antlblotlc agar plates with the membranes on them are incubated overnight to allow the bacterial cells to grow. d. The membranes are removed and transferred to absorbent filter paper, which has been soaked m denaturmg solution, with the bacterial colonies face up Leave for 7 mm e The membranes are transferred to fresh absorbent filter paper soaked m neutralization solution. Leave for 7 mm. f. Repeat step e. g. The membranes are transferred to fresh absorbent filter paper soaked m 2X SSC. Leave for 7 mm. h. Rinse the membranes in 1 L of 2X SSC. 1. The membranes are dried at 80°C for half an hour or air-dried, following which the DNA is crosslinked to the nylon using a UV translllummator. These membranes can then be hybridized with probes to ldentlfy overlappmg cosmlds using whole insert or newly derived probes, cosmlds positive for known sequence-tagged sites (STSs) and expressed sequence tags (ESTs), and to identify cosmld groups that have at least one probe in common. Multiple copies of the library may be replicated onto membranes, which allows several screenings of the library with different probes to be carried out simultaneously This allows rapid contig generation and valldatlon. If the cosmld library is constructed from such material as YACs or radiation hybrids as described above, the membranes should be hybridized with the relevant genome to check for picking mistakes and the background genome to Identify chimerlc clones.
3.2. The Construction of a Cosmid Contig The method of cosmld contlg construction will depend upon the initial starting material, the final goal of contig construction, and resource availability. Initially, the cosmid library should be screened to identify previously described VNTRs, STSs, and ESTs. Cosmids will be placed in groups, cosmids within a given group having at least one probe in common. This initial library screening
144
Fairwea ther
m Itself may generate some contiguous regions depending on the mformation available for the region of interest. However, frequently cosmids postttve for drfferent known sites do not overlap, and consequently, a conttg must be constructed by other means. The simplest method is the hybridization of a cosmrd back against the library to detect overlaps. Thus IS eastly achieved wrth the vector, SuperCos 1, which allows the insert DNA to be digested from the vector usmg NotI. Consequently, the insert DNA may be utlhzed as a probe following separation of the fragments by electrophoresis. The problem of low-copy repeats generating false positives is a major hindrance of this approach bemg utilized for large-scale mapping of eukaryotic genomes. This approach does work well over smaller regions, for example, a library constructed from a YAC. Consequently, the most common method used to construct cosmid contrgs IS “restrtctlon fingerprmtmg,” which may be achieved by several methods. All the methods have restriction enzyme digestron and digestion fragment vrsualrzation m common. However, they may be achieved by different methods. The enzyme digestion may be complete or partial, single (8), or a double digest (2). Visuahzation may achieved by labeling the digested fragments directly (2) or by probing Southern blots (8) of digested fragments. The probe labeling techniques may utrlize either radtoactrve or nonradioactive methods (14). The restriction fingerprmting method initially employed m the C. elegans genome project utilized both single- and double-enzyme digestion of random cosmids, which were directly labeled and visualized by autoradrography following polyacrylamrde gel electrophoresis (15) This method utilized automated scanning and processmg of the autoradiograms, followed by computer data analyses. The use of computers and the appropriate software for contig constructton 1sof considerable importance owing to the amount of mformatron produced by restriction fingerprintmg. However, tt is feasible to generate a contig without the use of automated scanning and contrg software over a reasonable length. Whichever method is initially employed, a point whereby no further mformation is obtained by random restriction fingerprmtmg is likely to be reached. Consequently, probes derived from the ends of cosmtds will need to be generated. These end probes allow new overlapping cosmrds to be identified, thereby generating directional walks out from contrgs. In this way, gaps between contigs may be filled by direction walkmg from flanking contlgs. End probes may be generated as riboprobes, which utilize specific promoters from the clonmg vector (16, I 7), by PCR with specific and nonspecific primers (18), by inverted PCR (19), and by subcloning cosmid ends, for example, into Bluescrrpt SK (Stratagene) (20). Another method that 1sbemg utilized m our laboratory IS the combinatton of cosmtd contig construction and the production of a transcription map for a given region. The cosmids are used as a resource for exon trapping (2Z-23),
Cosmid Contigs
145
and the exon-trapped products generated are in turn utilized as probes to screen the cosmid library, thereby identifying overlapping cosmids. Smce a cosmid may contam several exons, depending on the gene density of an area, this method should identify several ESTs per cosmid and therefore allow a conttg to be constructed. The recent publication of cosmid based exon trap vectors, sCOGH (24; see Note S), allowing exon trapping direct from the cosmld wtthout the need for further subcloning, may allow this method of contiging to be both quicker and more informative than traditional cosmld contlging. 3.3. Cosmid Contig Uses 1, The cosmldlibrary may bescreenedfor newmlcrosatelhtesby probing the library with a(CA),, probe (Pharmacla).Positive cosmidswould be subcloned into M13, rescreened with (CA)15, and positive subclones sequenced. Primers across the repeat are then designed to test for polymorphism within the population (25). Using a similar approach other polymorphic repeat sequences may be identified. These may then be utilized for linkage analysis. 2. As has been described above, cosmids may be utilized as a resource for exon trapping and therefore ultimately to generate a transcript map and/or ESTs. Exontrapped products once sequenced can be analyzed against a database to determme homology, which may indicate gene fimctlon and thereby suggest possible candidate status for a disease that maps to the same region 3 Random screening of cosmids or the use of subcloned end fragments may be used to generate new STSs All of these derived sequences may be used for further investigation of the cosmid library and/or for other purposes, such as lmkage analysis or physical mapping of genomic DNA, depending on the final objective of the research By utihzing cosmids as probes in PFGE, fluorescence vz srtu hybridization (FISH), and direct visual hybrldizatlon (DIRVISH) (26), genomic rearrangements, such as duplications/deletions, may be identified (27). As in the case of cystic fibrosis, fine restnctlon maps may be generated utilizing cosmld contigs (28), thereby identifying structures, such as CpG islands 4. Cosmids can be used to develop a fine physical map across a region.
4. Notes 1, It is recommended that chromosomal DNA be prepared m agarose plugs. This reduces the risk of shearmg the DNA. 2. The partial digestion of the genomic DNA should be controlled to reduce overdigestion. The production of small fragments may lead to chimeric clones. 3. The enzyme Sau3AI may also be used to subclone chromosomal DNA into the BarnHI site of the vector (SuperCos 1) 4. CIP treatment should not be carried out for longer than necessary, since this can lead to poor hgatlons 5. The addition of T40 dextran helps to reduce loss during the cleanup steps. 6. While manipulating the chromosomal DNA digestion reactions, care should be taken to avoid shearing the DNA. The phenol/chloroform and ethanol precipita-
tion steps are the most obvious points at which shearing may occur To prevent shearing, 1-mL plpet tips trimmed with a hot scalpel to increase the bore dlameter of the tip are used to transfer the reactions between Eppendorfs. 7 The recommended cosmid vector, SuperCos 1, contains T3 and T7 polymerase promoter sequences flanking the cloning site. Consequently, rlboprobes can be prepared and used for subsequent hybridizations with relative ease 8. The sCOGH-vector system described here 1s available from J T den Dunnen (MGC-Dept. Human Genetics, Leaden University, Wassenaarseweg 72, P 0 Box 99503, 2300 RA Leiden, Netherlands. E-mall: ddunnen@ruly46,MedFac LeldenUmv.nl) This group has recently made some alternative sCOGH-derived vectors, which may prove to be of use for specific purposes (e.g ,3’- and 5’-specific exon trapping). 9 The rephca plater should be allowed to cool subsequent to ethanol bath and flame stenhzatlon. Otherwise, the cells bemg rephcated may be kllled.
Acknowledgments I thank A. P. Monaco, R. Cox, A. Nemeth, helpful comments on this manuscript.
and J. den Dunnen
for their
References 1 Collms, F. S (1992) Positional cloning: let’s not call It reverse anymore Nature Genet. 1,3-6.
2 Coulson, A., Sulston, J. Brenner, S., and Karn, J. (1986) Towards a physical map of the genome of the nematode Caenorhadbltls elegans Proc Nat1 Acad Scl USA 83,782 l-7825 3. Hauge, B M., Hanley, S., Gnaudat, J., and Goodman, H M (1991) Mapping the Acabidopsis genome. Symp Sot. Exp Blol. 45,45,46. 4. Burke, D. T., Carle, G. F., and Olson, M. V (1992) Cloning of large segments of exogenous DNA into yeast by means of artificial chromosome vectors. Sczence 236,806-8 12. 5. Shizuya, H., Birren, B., Kim, U.-J., Mancmo, V., Slepak, T , Tachim,
Y., and Simon, M (1992) Cloning and stable maintenance of 300-kilobase-pair fragments of human DNA m Escherxhza co11using an F-factor-based vector Proc Nat1
Acad Sci. USA 89,8794-8797.
6. Ioannou, P.A., Amemlya, C. T., Garnes, J., Krolsel, P M., Shlzuya, H , Chen, C., Batzer, M. A., and Dejong, P. J (1994) A new bacteriophage Pl-derived vector for the propagation of large human DNA fragments Nature Genet 6, 84-89 7. Coulson,
A., Kozono, Y., Lutterbach, B., Shownkeen, R., Sulston, J., and Waterson, R. (199 1) YACs and the C. eleguns genome Bzoessuys 13(8), 4 13-417 8 BellannB-Chantelot, C., Barillot, E., Lacrolx, B., Le Paslier, D., and Cohen, D (1991) A test case fro physical mapping of human genome by repetitive sequence fingerprints: construction of a physical map of a 420kb YAC subcloned into cosmlds. Nuclezc Acrds Res 19(3), 505-5 10.
Cosmid Con tigs
147
9 Holland, J., Coffey, A. J., Giannelli, F., and Bentley, D. R. (1993) Vertical integration of cosmid and YAC resources for the interval mappmg on the X-chromosome Genomzcs 15,297-304 10. Baxendale, S., Bates, G. P., MacDonald, M. E , Gusella, J. F , and Lehrach, H. (199 1) The direct screening of cosmid hbraries with YAC clones. Nuclezc Aczds Res. 19(23), 665 1 11 Heding, I. J J. P., Ivens, A C , Wilson, J , Striven, M., Gregory, S , Hoovers, J. M. N , Mannens, M., Redeker, B., Porteous, D., van Heymngen, V., and Little, P. F. R. (1992) The generation of ordered sets of cosmid DNA clones from human chromosome region 1lp Genomzcs 13,89-94. 12. Monaco, A. P and Larm, 2. (1994) YACs, BACs, PACs and MACs* artificial chromosomes as research tools. Trends zn Bzotechnol. 12(7), 280-286. 13. Ausubel, F. M., Brent, R., Kingston, R. E., Moore, D. D., Seidman, J. G., Smith, J. A., and Struhl, K. (eds.) (1994) Current Protocols in Molecular Bzology, Section 5.1.1.) Wiley, New York 14 Carrano A. V , Lamerdin, J , Ashworth, L K., Watkms, B , Bransconb, E , Slezak, T., Raff, M., De Jong, P. J., Keith, T., McBride, L., Meister, S., and Kromck, M. (1989) A high resolution, fluorescence-based, semiautomatic method for DNA fingerprmting. Genomzcs 4, 129-136. 15. Coulson, A. and Sulston J (1988) Gene Mappzng, A Practzcal Approach (Davies, K. E., ed.) IRL, Oxford, UK, pp. 19-39. 16 Evans, G. A., Lewis, K., and Rothenberg, B. E. (1989) High efficiency vectors for cosmid microcloning and genomic analysis. Gene 79,9--20 17. Wahl, G M., Lewis, K. A., Ruiz, J. C., Rothenberg, B E , Zhoa, J , and Evans, G. A. (1987) Cosmid vectors for rapid genomic walking, restriction mapping, and gene transfer Proc. Nat1 Acad Scl USA 84,2 160-2164. 18. Wesley, C. S , Myers, M. P., and Young, M W. (1994) Rapid sequential walkmg from termmi of cosmids, P 1 and YAC inserts. Nuclezc Aczds Res 22,538,539 19. Byth, B. C., Thomas, G R., Hofland, N., and Cox, D W. (1994) Application of inverse PCR to isolation of end probes from cosmids Nucleic Acids Res 22, 1766,1767. 20. Haberhausen, G. and Mtiller, U. (1995) A rapid and efficient method for the cloning of cosmid end-pieces. Nucleic Acids Res 23, 144 1,1442 21. Buckler, A. J., Chang, D. D., Graw, S. L., Brook, J D., Haber, D. A., Sharp, P. A , and Housman, D E (199 1) Exon amplification* a strategy to isolate mammalian genes based on RNA splicing Proc Nat1 Acad. Scl USA 88,4005. 22. Church, D. M., Statler, C. J., Rutter, J. L, Murrell, J. R., Trofatter, J. A., and Buckler, A. J (1994) Isolation of genes from complex sources of mammalian genomic DNA using exon amplification. Nature Genet. 6,98-104. 23. K&man, D. B. and Berget, S. M (1993) Efficient selection of 3’-termmal exons from vertebrate DNA NucZeic Aczds Res. 21(22), 5198-5202. 24. Datson, N. A., van-de-Vosse, E., Dauwerse, H. G., Bout, M , van-Ommen, G. J., and Den-Dunnen, J (1996) Scanning genes in large genomic regions cosmidbased exon trapping of multiple exons in a single product. Nucleic Acids Res 24, 1105-1111
Fairwea ther
148
25. Fairweather, N., Chelly, J., and Monaco, A P. (1993) Dmucleotide repeat polymorphisms from DXS106 and DXS227 YACs usmg a two stage approach Hum Mol Gen t(5), 607-608.
26. Buckle, V. J and Kearney, L (1993) Untwirling
DIRVISH
(news, comments).
Nature Genet. 5(l), 4,5
27 de Kok, Y Ropers, H associated regulatory
J M, Merkx, G F M., van der Maarel, S M , Huber, I , Malcom, S., -H., and Cremers, F. P M (1995) A duphcatlon/paracentric mverslon with famihal X-linked deafness (DFN3) suggests the presence of a element more than 400kb upstream of the POU3F4 gene. Hum Mol Genet. 4(11), 2145-2150 28 Rommens, J. M., Iannuzzi, M. C., Kerem, B.-S., Drumm, M. L , Melmer, G , Dean, M., Rozmahel, R., Cole, J L., Kennedy, D., Hidaka, N , Zslga, M., Buchwald, M., Riordan, J R., TSUI, L.-C,, and Collms, F. (1989) Identification of the cystic fibrosis gene. chromosome walkmg andJumping. Science 245, 1059-1065.
Use of Dinucleotide Polymorphism in Physical Mapping
Analyses
Jeff Fairman and Lalitha Nagarajan 1. Introduction The microsatellite repeat motifs (dC-dA), are present m high abundance in the normal genome (I). If they were to occur at regular intervals, they could be as frequent as one m approximately every 30-40 kb of the human genome. Thus, the entire genome can be representedby a large number of dC-dA or dG-dT repeat sequences. The repeat elements vary m size and can be analyzed by polymerase chain reaction (PCR) using unique primers; the high degree of mformativeness has been exploited in determining the genetic linkage maps of the human genome (2), m elucidating the genetic mechanisms underlying Prader-Willi and Angelman syndromes (3), as well as Charcot-Marie-Tooth disease (4), and detection of allele loss or gam in malignancy (5). The utility of these loci has been further extended to building the first-generation physical map of the human genome, providing us with highly specific landmarks on every human chromosome (6). For example, human chromosome 5, which contains 4.5% of the haploid genome, has been saturated with over 150 dmucleotide repeat markers, which are dispersed approximately at a megabase interval. Thus, the existing technology allows systematic screening to pinpoint a disease locus within a megabase or two. Nonetheless, an increase in the density of these markers will facilitate facile localization of disease genes. In this chapter, we describe a highly sensitive and efficient protocol for isolation, physical mapping, and generation of contigs from a large number of samples in a short time. This protocol is especially suited for genomic loci cloned in 3Lphage or cosmid vector, and isolated owing to somatic or inherited rearrangements or homology to novel cDNAs. These novel genomic clones From
Methods
m MOl8CUlar
Ecology,
Edlted by. J Boultwood
Vol
66
Humana
149
G8n8
kolabon
and
Press Inc , Totowa,
Mapprng
NJ
Protocols
Fait-man and Nagarajan
150
will be the starting point for a bldtrectlonal chromosomal walk, which will result in integratmg the polymorphic markers and genes. We have successfully employed this approach to isolate a novel dinucleotide polymorphism from a pool of h phage clones generated from a yeast art&xal chromosome (YAC) spanning the human chromosome 5q3 1 IOCI (6) This has allowed us to order prevtously characterized genes within the first-generation physical map (7). The overall scheme employed 1s outlined as follows: 1 Isolate and characterize novel dmucleotlde polymorphisms. 2. Develop a specific PCR assay to detect the novel polymorphism 3 Perform subchromosomal localization usmg somatic cell hybrids, radlatlon hybrids, and patients with deletions or amplification 4 Physical mapping using the CEPH mega YAC hbrary
2. Materials 2.7. Isolation
of Novel Dinucleotide
Polymorphisms
1. h Phage or cosmld template DNA 2 (dC-dA),, or (dG-dT), 0 oligonucleotlde (dC-dA) BAM-CA. CCCGGATCCA(CA)s C (dG-dT) BAM-TG* CCCGGATCCTG(TG), The (dC-dA) and (dG-dT) primers have a restrxtlon site for the enzyme BamHI 3 Repetitrve element primer (Ah or LINE). Ah 5 17: CGACCTCGAGATCT(C/T)(G/A)GCTCACTGCAA Ah 559. AAGTCGCGGCCGCTTGCAGTGAGCCGAGAT AZu PDJ34: TGAGC(CGA/TAT)GAT(CGCG/TATA)CCA(C/T)TGCACT CCAGCCTGGG LINE LlHs CATGGCACATGTATACATATGTAAC(A/T)AACC 4. Tag polymerase (Promega, Madison, WI) 5. 100 mA4 dNTPs (Pharmacia, Plscataway, NJ) 6. TA plasmld vector kit (Invltrogen, San Diego, CA).
2.2. Development of PCR Assay to Detect the Novel Polymorphism 1 2 3. 4. 5
DNA sequencing kit. Software for primer design (Ohgo or GCG). y-dATP (6000 Ci/mmol) T4 polynucleotide kinase and buffer Oligonucleotldes (primers)
2.3. Subchromosomal Localization Using Somatic Cell Hybrids, Radiation Hybrids, and Patients with Deletions or Amplification 1. Monochromosomal hybrids (Genetlc Mutant Repository, Camden, NJ) 2. Deletion hybrid panel (Genetic Mutant Repository)
Dinucleotide Polymorphism Analyses
151
A (GT)n
AIu/LINE
II
p2 %i-
-
-
p1
INTERALU -
-
INTERALU e ALU-TG
*CA-ALU
Ah
Fig. 1. (A) Isolation of novel dinucleotide polymorphisms. The Alu primers (P 1 or P4) are used singly or in combination with the dmucleotide repeat primer (P2 or P3). If the dmucleotide repeats are flanked by two elements in opposite orientation an mteralu product may result. However the product from Pl + P2 or P3 + P4 would be favored owing to its smaller size. (B) Schematic representation of an ethidium bromide-stained 1S-2.0% agarose gel resolvmg the PCR products obtained from primers shown in (A).
2.4. Physical Mapping
Using the CEPH Mega YAC Library
1. Software to access Genome and CEPH data bases. 2. Growth media and reagents for DNA isolation of yeast cells.
3. Methods 3.1. Isolation
of No we/ Dinucleotide
Polymorphisms
The technique repetitive
employs PCR between a primer specific for the human sequence A/u or LINE and a second primer that is either a (dC-dA),
or (dG-dT),. The resulting amplification product yields a unique DNA segment flanked by the repetitive element and the dinucleotide repeat sequence (Fig. 1).
152
FaIrman and Nagarajan
3.1.1. DNA Isolation The quality of 3\.phage or cosmid DNA isolated by standard procedures is sufficient for the PCR (see Note 1). For a detailed protocol on rapid preparation of a large number of bacteriophage DNA samples, please refer to the procedure described by Donovan et al. (8). We also obtam good yields of PCR products with cosmid DNA prepared by the standard alkaline lysis method (9). 3.1.2. Initial PCR Screening 1. Take 200 ng of phage or cosmid DNA for each amplification reaction (see Note 2). 2. The amplifications are performed by three combmattons of prtmers on each DNA sample; a. A (dC-dA) or (dG-dT) oligonucleotide alone, b A repetitive element primer (see Note 3) alone; and c. A combination of both 3. Ahquot the followmg for each PCR reaction (25 pL total): 2 5 pL of 25 mM MgS04, 2.5 pL of 10X buffer, x pL of HzO, 2.5 pL of each primers (10 @4 stock), 0.25 uL 25 mM dNTPs, 0.25 uL Tuq polymerase,y pL (200 ng) phage DNA 4. The PCR amphficatton is as follows: 1 cycle at 94°C for 5 mm, 35 cycles, each at 94qC for 1 min, 65°C for 1 mm, 72°C for 3 mm; and 1 cycle at 72°C for 10 mm 5 Products may be checked on a 2% agarose gel for bands that appear m the sample that had the (dC-dA) or (dG-dT) ohgonucleotide and also the AluILINE primer, but did not appear in either of the other samples (see Fig 1) 6. The PCR products are then cloned using the TA cloning kit and sequenced to obtain primers for amplification of the dmucleotide polymorphism.
3.2. Development of PCR to Defect the Novel Poiymorphism 3.2.1. Primer Design The amplification product thus generated yields sequences between the repeat sequence and the dmucleotide sequence. In order to design primers that will amplify the dinucleotide polymorphism, we need to obtain sequence information on both sides of the dinucleotide repeat. This is readily achieved by direct sequencing of the cosmid subclone or the h phage DNA with a sequencing primer derived from step 6 of Section 3.1.2. Then, the contiguous sequence including the dinucleotide repeat is analyzed using a computer program for primer selection (GCG or OLIGO) to identify a pair of primers (18-23 nucleotide in length) that will yield a product of 100-250 bp (see Note 4). The specificity of the primer pair is first tested on normal human DNA.
3.2.2. PCR 1. PCR conditions will vary depending on the primer pan used However, we have found that 30 cycles of 94°C for 1 min, 55°C for 1 min, and 72°C for 1 min is usually a good starting place.
Dinucleotide Polymorphism Analyses
153
2. Label one of the primers with P32 y-dATP: 10 pL y-dATP (6000 Ci/mmol), 1 5 PL 10X T4 polynucleotlde kinase buffer, 2.5 PL primer (10 pJ4 stock), 1.0 pL T4 kinase Incubate at 37°C for 60 mm and inactivate the enzyme at 94°C for 10 min. 3. Aliquot the following for each PCR reaction (25 pL total). 2 5 PL of 25 mA4 MgSO,, 2.5 PL of 10X buffer, x PL of H,O, 0.4 JJL of P32-labeled primer, 2.5 PL of each primers (10 w stock), 0.25 PL 25 mM dNTPs, 0.25 PL Tuq polymerase, y PL DNA (see Note 1) 4. After PCR, an equal volume of formamide containing 0.05% bromophenol blue and xylene cyan01 is added to the PCR mixture 5. Two microhters from each sample are run on a 6 or 8% denaturing polyacrylamide gel 6. Gels are dried and then exposed to Kodak XAR-5 autoradlography film for 2 h to overnight, depending on the yield of PCR product (see Notes 6-9).
3.3. Subchromosomal Localization Using Somatic Cell Hybrids, Radiation Hybrids, and Patients with Deletions or Amplification Now that the dmucleotides have been characterized, it 1spossible to derive the overlaps between somatic cell hybrids, radiation hybrids, and patients with deletions or amplifications. Overlaps are determined by the amplification, or lack thereof, of the specific dmucleotide (Fig. 2). 3.3.1. Screening of Somatic Cell Hybrids and Radiation Hybrids Regular DNA extraction with HIRT extraction buffer followed by phenol and chloroform extractions and ethanol precipitation yields DNA ready for PCR. However, when the sample 1slimiting, the cells (1 x 105) can be lysed in 0.5 mL of PBS with proteinase K at 50°C for 10 mm and boiled for 2 min, centrifuged in a microfuge, and a 2-PL ahquot of the supernatant can be used per amplification reaction (see Notes 10 and 11). 3.3.2. Patients with Acquired Somatic Deletions or Amplifications Interstitial deletions are one of the most common chromosomal abnormalities associated with human cancer. Using cells with a well-characterized contiguous deletion, it is possible to order dinucleotide repeats on the chromosome. PCR amplification and product resolution are done as detailed in Section 3,2., using 50 ng of DNA. Order of dmucleotlde repeats (from centromere to telomere) is determined relative to known dinucleotide polymorphisms by the presence of one or two alleles in the malignant population (see Note 12). 3.4. Physical Mapping Using the CEPH Mega YAC Library 3.4.1. Identification of YAC Coordinates The second-generation genetic map and the first-generation physical map of the human genome are good resources for fine mapping of novel loci (2,7). The CEPH database is readily accessible and one can obtain a tiling path of
Fairman and Nagarajan
154 A
B
D
C
E
El
+--region I
I
of overlap
T
t--
PCR product
Fig. 2. PCR screeningof somaticcell hybrid panel containingsubchromosomal fragments.Presenceof amplificationproductsin hybrids A, C, and E localizesthe novel polymorphismto the smallestregionof overlap(regionof overlap).
YACs for several chromosomes.A contig of YACs for a given chromosomal subbandcan be requestedfrom one of the several CEPH mega YAC repositories by specifying the coordinates of the desired YACs (seeNote 13). 3.4.2. PCR Screening A collection of YAC slantsspanningthe locus of interest will be the starting point for the final physical localization of the novel dinucleotide polymorphism. Five milliliter cultures of YACs are grown overnight in YPD medium and the DNA isolated by quick lysis (10) or minipreparations as detailed elsewhere (11). The’YAC panel is then screenedfor the presenceof the novel locus as described in the previous sections (seeNotes 14). 3.4.3. Ordering Within the YAC Contig Integration of the novel locus with the genetic and physical map of the genome allows localization of the genesthat were the starting point for isola-
155
Dinucleotide Polymorphism Analyses LOCI:
yAc
Fwih
Fig. 3. PCR screening of YACs YACs A, B, C, and D are screened for the novel locus Y and previously mapped loci Xand 2. Overlapping amphflcatton patterns allows localization of the novel locus Y between loci X and Z
tron of the new locus between the flanking Note 15 and Fig. 3).
dinucletide
polymorphisms
(see
4. Notes 1. If there is no amplification with any of the primers, tt may contain organic contaminants. Repeat phenol extraction once, followed by ethanol precipitatton, and wash twice with excess of 70% ethanol. 2. The concentration of template DNA is critical m these experiments. It may be necessary to titrate each template using a range of concentration (SO-500 ng). 3. The repetitive element (Alu/LINE) primers may be purchased from American Type Culture Collectton (Rockville, MD). The primers can be synthesized if an oligo synthesizer is available. 4. If the dinucleotide repeat is flanked by repetitive elements on either side, the PCR primers should be designed from umque regions within the repeat If one of the flanking loci is a repetitive element and the other is not, the primer for the reiterative element side should be from unique segments within the element. 5. The suggested amounts for the PCR reactions vary depending on the source of the DNA: 50-100 ng somatic cell hybrids, 500 ng-1 mg YACs, 100-200 ng cosmid, and 100-200 ng h phage
156
Fait-man and Nagarajan
6. If you obtain a large number of nonspecific bands, try increasing the annealing temperature by 1°C increments or decreasing the template concentration 7. The amphfication products obtained from total human DNA typically consist of a two maJor bands (alleles). Each allele is followed by a cluster of minor (shadow) bands, which are 2 bp apart owing to strand slippage m early cycles of PCR (12). 8. Typically the smaller of the two alleles yields a stronger PCR product and has fewer shadow bands (5). When the two alleles are 2 bp apart, the major shadow band from the larger allele overlaps with the smaller allele, resulting m a strong signal for the second allele 9 With practice, one can readily discern the major alleles from the shadow bands, 10 The monochromosomal hybrids may have to be grown under special conditions It is important to verify the tissue-culture requirements before ordering the hybrids 11. The Mutant cell repository also sells DNA from hybrids. These can be purchased if your need IS mimmal 12. It IS crrtrcal to establish the contiguity of deletions or amplifications before employing material from cancer patients, since it is common to find complex interstitial breaks and deletions. This can be accomplished by analyzing for allele loss of the dmucleotide markers available from the first-generation physical map of the human genome (5) 13. The CEPH database can be readily accessed under the directory MAPS through World Wtde Web (http://www.cephb.fribio/ceph-genethon-map.html) 14 Smce the technique is highly sensitive, it is important to take extreme care during the growth of the yeast cells and DNA isolation to mmimize crosscontammation. 15 YACs undergo deletions and rearrangements frequently, although this has not been a maJor problem m contrg building m our experience It is best to map a given locus with a panel of a minimum of 5-8 YACs
Acknowledgments The work m the authors’ laboratory was supported by grants from the National Institutes of Health (CA55 164) and American Cancer Society (DHP-44) to L. Nagarajan. We thank Jerry Donovan for excellent technical assistance.
References 1. Weber, J. L. (1990) Informativeness of human (dC-dA), (dG-dT), polymorphisms. Genomlcs 7, 524-530. 2. Weissenbach, J., Gyapay, G., Dib, C., Vignal, A., Morisette, J., Millasseau, P., Vaysseix, and Lathrop, M. (1992) A second-generation linkage map of the human genome. Nature 359,794-80 1. 3. Mutirangura, A. F., Greenberg, M. G., Butler, S., Malcol, R. B., Nicholls, A., Chakravarti, and Ledbetter, D. H. (1993) Multiplex PCR of three dmucleotrde repeats m the Prader-Willi/Angelman critical region (15ql l-ql3): molecular diagnosis and mechanism of urnparental disomy. Hum Mol. Genet 2, 143-15 1, 4. Lupski, J. R., Montes de Oca-Luna, R., Slaugenhaupt, S., Pentao, L., Guzzetta, V., Trask, B. J., Saucedeo-Cardenas, O., Barker, D. F , Killian, J. M., Garcia, C. A.,
Dinucleotide Polymorphism Analyses
5
6.
7. 8.
9.
10.
11.
12
157
Chakravarti, A., and Patel, P I. (199 1) DNA duplication associated wtth CharrotMarie-Tooth disease type 1A Cell 66,2 19-232. Fatrman, J , Claxton, D , Willman, C. L., Deisseroth, A. B., and Nagarajan, L. (1994) Development of a sensitive PCR to detect allele loss m a model hematopoletic neoplasm PCR Methods Appllc 4,6-12. Fairman, J., Chumakov, I., Chinault, C., Nowell, P. C., andNagaraJan, L (1995) Physical mapping of the critical 5q31 locus. Proc Natl. Acad Scz USA 92, 7406-74 10 Cohen, D., Chumakov, I., and Wetssenbach, J (1993) A first-generation map of the human genome. Nature 366,698-70 1. Donovan, J., Lu, X., and Nagarajan, L. (1993) Purification of Bacteriophage Lambda DNA from a large number of recombinant clones BzoTechnlques 15, 602-604. Sambrook, J , Fritsch, and Maniatis, (1989) Extraction and purification of plasmid DNA, in Molecular Cloning A Laboratory Manual. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY, pp. 1 25-1.28. Kwtatkowski, T. J., Jr, Zoghbi, H. Y., Ledbette, S. A., Elhson, K. A., and Chmault, A. C. (1990) Rapid identification of yeast artificial chromosome clones by matrix pooling and crude lysate PCR Nuclezc Aczds Res 18, 7 191 Philippsen, P., Stotz, A., and Scherf, C. (1991) DNA of Saccharomyces Cerevzszae, m Guide to Yeast Genetzcs (Guthrie, G and Fink, R. J., eds.), Academic, San Diego, CA, pp. 169-181. Hauge, X Y. and Litt, M. (1993) A study of the origin of “shadow bands” seen when typing dinucleotide repeat polymorphisms by the PCR. Hum. Mol. Genet 2, 411415
11 Mapping Expressed Sequence Tags (ESTs) by Multiplexing PCR Reactions from Hybrid Cell Panels and Detecting Fluorescently Labeled Products A. Scott Durkin, Donna R. Maglott, and William C. Nierman 1. Introduction Determining the chromosomal origin of expressed sequence tags (ESTs) (1,2) lags far behind their identification
in single-pass
sequencing
projects
(I-IO). Positional cloning of disease genes requires that previously uncharacterized transcripts be mapped to the smallest possible defined region. We have developed an efficient polymerase chain reaction (PCR)-based procedure for the rapid assignment of ESTs to human chromosome regions (IL-12; Fig. 1). The critical features of the method are: 1 2. 3. 4.
Standard, restricted criteria for primer design; Sensitive, automated analysis of fluorescently labeled PCR products, Standard PCR conditions; and Multiplexed PCR reacttons.
Primers for PCR reactions are designed from ESTs using narrow windows for primer T,, primer base composition, and amplified product srze (Fig. 1, “Primer Design”). These primers are then tested using standard reaction conditions (see Section 3.3.) for generating a product from human genomic DNA that matches the size predicted by the EST sequence (Fig. 1, “Primer Proving”). Successful primers are combined so that multiplexed PCR products can be resolved on the basis of product size and fluorescent label, and are used with DNA templates from somatic cell hybrid mapping panels. Products from several PCR reactions are also pooled before electrophoretic analysis. Chromosomal and subregional assignments are made by discordancy analysis. From
Methods
m Molecular Biology, Edlied by J Boultwood
Vol 68 Gene lsolatlon and Mapprng Humana Press Inc , Totowa, NJ
159
Protocols
Durkm, Maglott, and Nierman
160 cDNA
1
SEQUENCE
PRIMER
DATA
DESIGN
1 CHROMOSOMAL
1
ASSIGNMENT
1
1 SUBREGIONAL
MAPPING
REACTIONS
1 DISCORDANCY
ANALYSIS 1
SUBREGIONAL Fig. 1.
ASSIGNMENT
Strategyfor mapping ESTsusing PCR
Our approach IS effective tf the first multrplex PCR attempt results m a chromosome assrgnment. Multrplexmg and poolmg reduce the number of PCR reactions and electrophorettc analyses only rf assignments can be made with regularity based solely on mittal trials. Having to pass through the “amblguous” loop (Frg. 1) decreases efficiency. Previous approaches to multiplexing PCR reactions (13-18) involved empirical optimization of reaction conditions and primer combinations for stmultaneous amplification of a defined set of multtple products for repeated assays. We, however, typically use a primer pau- designed from a cDNA sequence in only two or three sets of PCR reactions (Fig 1). The first set of PCR reactions determines primer pair success m amplifying a size-specrfic product m human DNA distinct from potential rodent background. The second set of reactions determines the amplificatron pattern of template DNA samples from either human/mouse or human/hamster hybrid cell lines of known human karyotype. The third set of reactions determines subregional localtzatron, using template DNA samples from somatic cell hybrids containmg known sub-
PCR Mapping of cDNA Clones
161
regions of a smgle human chromosome. Optimizing reaction conditions and primer combmations for each primer pan for such a limited number of reactions would thus require more effort than mapping the cDNA sequences using only one pan per reaction. Instead, we design primer pairs using narrowly defined parameters and determine successusing standard reaction conditions (II, 12,191 in primer proving reactions. We then combine primer pairs for multiplex mapping reactions based on the fluorescent dye label and the amplified product size.
2. Materials 2.1. Primer Design 1 Software that permits the design of primers based on target boundaries of the sequenceto be mapped, length of primers, T,,, of primer, and product size: We used Primer (v. 0.5), available from the Whitehead Institute for Btologtcal Research (Cambridge, MA) (URL http://www-genome.wl mit.edu/ftp/distrtbution). A newer version (2 2) can also be executed from that site (see Note 1). 2. Electronic records of sequences to be mapped
2.2. Templates 1. DNA from somatic cell hybrids, frozen m allquots for 200 reactions (200 pL, 50 ng/l.tL). To assign sequences to a human chromosome, we used NIGMS human/rodent somatrc cell hybrid mapping panels 1 and 2 (Cone11 Institute, Camden, NJ; 20,21) and PCRable DNA (BIOS Corporation, New Haven, CT, 22) The NIGMS panel 1 was supplemented with a human 2 l-only cell lme (GM 10323 from NIGMS panel 2). 2. DNA from each genome used in the constructton of the hybrid cell panels, also frozen in aliquots of 200 uL as 50 ng/pL.
2.3. PCR Reactions 1. Oligonucleotide primers (50 ng/uL) with one member of the pair labeled with an ABI (Applied Brosystems Inc., Foster City, CA) fluorescent dye at the 5’-end (HEX, &FAM, TAMRA [not used], ROX [reserved for the standard]). 2. Thermal cycler. 3. 10X PCR buffer (Perkin-Elmer, N808-0006, Roche Molecular Systems, Branchburg, NJ): 500 mMKC1, 100 mMTris-HCl, pH 8.3, 15 mMMgClz,O.Ol% (w/v) gelatin. 4. AmpliTaq AS (Perkin-Elmer, N808-0070,5 U/mL) 5. dNTPs: 2.5 mM each dATP, dCTP, dGTP, dTTP.
2.4. Product Analysis 1. Instrumentation and software permitting resolution of products by size and dye label We used an ABI 373A sequencer (separation by verttcal, denaturing polyacrylamide gel electrophoresis) and GENESCAN software. 2 Mol-wt standards (ABI GENESCAN[ROX], 401100).
162
Durkin, Maglott, and Nierman
3 Formamide, deionized (Life Technologies, Gaithersburg, MD, 155 15-O IS). 4. Acrylamide stock solutions (Bto-Rad Laboratories, Richmond, CA, 16 1-O144, 40% acrylamtde/bis-acrylamide 19 1 [5% C] stock). 5. Urea (Life Technologtes, 5505UX)
3. Methods 3.1. Primer Design 1. Select a target sequence range that excludes the end of the single-pass sequence where the accuracy of base calls may not have been as high (typically the first 300 bp of the single-pass sequence data) 2 Consider selecting a target range that excludes all repetitive elements and avoids coding regions (see Note 2) Grail (grail@ornl gov) may be used to predict coding regions 3 Select narrow ranges of acceptable values for primer selection to increase success rates under standard conditions. We used 50% GC, primer T, of 55-59°C amplified product stze of 80-l 50 bp, and prtmers of 18-22 bp.
3.2. Multiplex
Sfrafegies
and Primer Synthesis/Labeling
1. Identify primer pairs that will be combmed for multiplexed PCR reactions based on predicted product stze 2. Products labeled with the same fluor should doffer by at least 10 bp, and products labeled with different fluors doffer by at least 5 bp. 3. Label the 5’-end of one member of the primer pair with the selected fluor during oltgonucleotide synthesis. This was done usmg the ABI fluorescent amidltes 6-FAM (401527) or HEX (401526) on an ABI 392 synthesizer (see Note 3) 4. Best results are obtamed with the 6-FAM and HEX dyes (ROX being reserved from the mol-wt standard).
3.3. Primer Testhg PCR Reactions and Multiplex PCR Reactions (see Notes 4 and 5) 1. Reaction mix (per 15-pL vol): Prepare as a master mix for 25-35 reactions, with template added mdividually* 1.OpL template stock solution (50 ng), 0 8 pL primer 1 (40 ng), 0.8 pL primer 2 (40 ng), 0.12 pL AmpliTuq AS (0.6 U), 0.08 JJL dNTP (200 pM of each dNTP), 1.5 pL 10X buffer. Thermal profile. 95°C for 5 min, 25 cycles: 94°C for 1.4 mm, 55°C for 2 mm, and 72°C for 2 mm, Fmal extension at 72°C for 10 min.
3.4. Sample Analysis (see Note 6) 1. Mix 1 pL amplification reaction, 0.5 pL size standard, 3.5 nL formamide (Note 7) 2 Heat 2 mm at 95°C. 3. Load onto gel. We use the 24-well comb with a denaturing (8M urea) 6% polyacrylamlde gel (5% crosslmked). The gel 1s run for 6 h (limits 2500 V, 40 mA, 30 W at ambient temperature) and is scanned at 25 cm from the well. 4. Determine products and their sizes usmg GENESCAN software.
PCR Mapping of cDNA Clones 3.5. Determining Chromosome Assignment 1. Identify chromosomes for which the discordancy between the presence of a PCR product and the presence of a chromosome is 19% (Note 8). 2 Exclude cell lines reported as containing a particular chromosome in fewer than 12% of metaphases. 3. There must be a difference of at least a two discordant events between the best candidate assignment and the second best candidate assignment. 4 If results do not meet steps l-3, select cell lines from another panel that permit resolution of ambiguity, execute PCR reactions using these templates, and recalculate the discordancy. 5, Consider also the use of monochromosomal panels 6 When the chromosome assignment is made using a partrcular primer pan, other mapping panels permtttmg subregional assignments can be used as templates. Because these panels are smaller and less redundant than the panels for the entire genome, we usually make subregional asstgnments only when there are no discordant cell lures for the assigned region.
4. Notes 1. We adopted the strategy of identifying primer/template combmatrons that are successful in a single, narrowly defined reaction environment. This strategy gives reasonable success m multtplexmg without requnmg opttmtzatton of conditions for each set of sequences to be mapped. The trade-off is that the restrtctive crtteria for prrmer design does not permit identifying primers for approx 30% of the ESTs tested, and those ESTs do not get mapped The total mapped, however, remains htgh, because we can qutckly identify addmonal sequences that have not been mapped and for which primers can be designed. 2. The success of PCR m amplifying umque (mappable) sequences from the human genome seemed to require only that at least one of the primers be from a singlecopy sequence We have designed successful primer pairs that recognize repetitive elements and include repetitive elements m the product. Dtfticulttes may arise, however, from dilution of one primer relative to the other based on annealmg to many genomic sequences, or from trying to use the amphficatton product containing repeat sequences as a hybridization probe m further experimentation. 3. We have used other methods of generating labeled product. For example, we have labeled primers by mcorporating Ammolink II at the 5’-end during synthesis and then coupling that to a fluorescent dye usmg procedures provided by ABI (Applied Btosystems User Bulletin, Issue 11, 1989) This required an additional step m the process that was eliminated when labeled amidites became avatlable. We also tested labeling products by using fluorescem-dUTP precursors for the PCR amplificatton. We found it difficult to generate reproducible incorporation, which led to varying product sizes and detection sensmvmes. 4. Before executing the PCR reactions using mapping panels as templates, the primers should be tested for utility m directing the amplification of a product of the expected size, and drstmguishable from possible products from rodent
164
5.
6.
7.
8
Durkin, Maglott, and Nierman genomes We did not multiplex these reactions, and found about 25% of prrmer pairs to fail at this point, either because of no amplification from any template (12%), a product from the human genome of unpredicted size (6%), or mdistmgurshable products from all genomes (7%). It should be noted that an unexpected, larger product may be mappable. We tested a few such products and discovered by sequencing that the primers were amplifying across a small intron. The primer testing reactions can also be used to select the appropriate mappmg panel, since some panels are primarily hamster background (NlGMS) and some are primarily mouse background (Bios) We have tested production modes of both duplex PCR reactions (four prtmers, two different products) and triplex PCR reactions (SIX primers, three different products) using these conditions. Ninety percent of duplex reactions and 73% of triplex reactions could be used to make unambrguous chromosome assignments Some of the assignments that could not be made from multiplexed reactions could be made from umplex reactions We found, however, that duplexing reactions usmg proven primers with mapping panels increases our productivity The sensitivity of the detection system IS advantageous because it permits usmg a 25-cycle PCR reaction When we retested 25 primers that had permitted an unambiguous chromosome assignment using only one mapping panel and automated fluorescence detection, only 40% of the pairs were effective when we increased the cycle number to 35 for ethidntm bromide detectron. The sample formamide concentratron should be at least 50%. It is possible to pool three 1-pL aliquots from three different PCR amplification and still maintam a 50% formamide concentration. It is not practical to pool more than three PCR products, since the wells generated by the 24-well comb hold a maximum sample volume of 8 pL. We developed software to automate the discordancy analysis. The presence of a product of the correct stze for a template was compared to the chromosomes and chromosome portions represented by that template We found the redundancy of NIGMS panel I useful for the mitral chromosome assignments.
Acknowledgments We acknowledge Michael Graham, Jun-Lin Mao, and Min Lee for expert technical assistance;Mark Adams, The Institute for Genomic Research (Rockville, MD), and James Sikela, University of Colorado (Boulder, CO), for sharing the sequence information; and Kathleen Meyer, ATCC, for laboratory database implementation, including automation of discordancy analysis. This work was supported by DE-FG05-4
lER6 I232 from the Department
of Energy.
References 1. Adams, M. D., Kelley, J. M., Gocayne, J. D., Dubmck, M., Polymeropoulos, M. H., Xiao, H., Merrrl, C. R., Wu, A., Olde, O., Moreno, R. F., Kerlavage, A. R., McComble, W. R., and Venter, J. C. (1991) Complementary DNA sequencing: expressed sequence tags and human genome project. Science 252, 1651-1656.
PCR Mapping of CDNA Clones
765
2 Wilcox, A. S., Khan, A. S., Hopkins, J A, and Sikela, J M (1991) Use of 3’ untranslated sequences of human cDNAs for rapid chromosome assignment and conversion to STSs implications for an expression map of the genome Nuclezc Acids Res. 19, 1837-1843 3 Hoog, C (199 1) Isolation of a large number of novel mammalian genes by a dtfferential cDNA library screening strategy. Nucleic Acids Res 19, 6 123-6 127. 4. Adams, M D , Dubmck, M., Kerlavage, A R , Moreno, R., Kelley, J. M , Utterback, T R., Nagle, J W , Fields, C., and Venter, J. C. (1992) Sequence tdentification of 2,375 human brain genes. Nature 355,632-634. 5. Adams, M D., Kerlavage, A. R., Fields, C., and Venter, J. C (1993) 3400 expressed sequence tags identify diversity of transcripts from human brain Nature Genet 4, 256-267 6. Adams, M. D., Soares, M B , Kerlavage, A. R., Fields, C F., and Venter, J. C. (1993) Rapid cDNA sequencing (expressed sequence tags) from a directionally cloned human infant brain cDNA library Nature Genet. 4, 373-380. 7. Gieser, L. and Swaroop, A. (1992) Expressed sequence tags and chromosomal localization of cDNA clones from a subtracted retinal pigment epithelium library. Genomlcs 13,873-876 8. Khan, A. S., Wilcox, A S , Polymeropoulos, M H , Hopkins, J A , Stevens, T. J , Robinson, M , Orpana, A. K., and Stkela, J. M (1992) Single pass sequencing and physical and genetic mapping of human brain cDNAs. Nature Genet. 2, 180-195. 9 Okubo, K., Hot-t, N., Matoba, R., Ntiyama, T., Fukushima, A , KoJima, Y , and Matsubara, K (1992) Large scale cDNA sequencing for analysis of quantitative and qualitative aspects of gene expression. Nature Genet 2, 173-l 79 10. Takeda, J., Yano, H., Eng, S., Zeng, Y., and Bell, G. I. (1993) A molecular mventory of human pancreatic islets: sequence analysis of 1000 cDNA clones Hum Mel Genet. 2,1793-l 798 11. Durkm, A. S., Maglott, D. R , and Nterman, W. C. (1992) Chromosomal assignment of 39 human brain expressed sequence tags (ESTs) by analyzing fluorescently-labeled PCR products from hybrid cell panels. Genomics 14, 808-810 12. Maglott, D. R., Durkm, A. S., and Nierman, W. C (1994) 259 human brain expressed sequence tags (ESTs)* chromosome localtzation, subregional assignment, and sequence analysts, m Identljkation of TranscrzbedSequences (Hochgeschwender, U and Gardmer, K., eds.), Plenum, New York, pp. 273-288 13. Chamberlain, J. S., Gibbs, R. A., Ramer, J. E., Nguyen, P.-N., and Caskey, C. T. (1988) Deletion screening of the Duchenne muscular dystrophy locus via multiplex DNA amplification. Nucleic Acids Res 16, 11,141-l 1,156. 14. Gibbs, R. A., Nguyen, P., Edwards, A., Civitello, A. B., and Caskey, C T. (1990) Multiplex DNA deletion detection and exon sequencing of the hypoxanthine phosphoribosyltransferase gene m Lesch-Nyhan families. Genomlcs 7,235-244. 15. Edwards, A., Civitello, A., Hammond, H. A., and Caskey, C T (1991) DNA typing and genetic mapping with trimeric and tetrameric tandem repeats. Am. J. Hum Genet 49,746-756
766
Durkin, Maglott, and Nierman
16. Morral, N and Estivill, X. (1992) Multiplex PCR amplification of three microsatellites within the CFTR gene Genomzcs 13, 1362-1364. 17. Schwartz, L S., Tarleton, J., Popovich, B , Seltzer, W. K , and Hoffman, E. P. (1992) Fluorescent multiplex linkage analysts and carrier detection for Duchenne/Becker muscular dystrophy. Am J Hum Genet. 51,721-729 18 Ktmpton, C. P., Gill, P , Walton, A , Urquhart, A., Millican, E S., and Adams, M (1993) Automated DNA profiling employmg multiplex amplification of short tandem repeat loci. PCR Methods Appl 3, 13-22. 19. Goold, R. D (1993) The development of sequence-tagged sites for human chromosome 4. Hum Mol. Genet 2,1271-1288. 20. Drwinga, H. L , Toji, L H., Kim, C. H., Greene, A E., and Mulivor, R. A. (1993) NIGMS human/rodent somatic cell hybrid mapping panels 1 and 2. Genomlcs 16, 311-314 2 1. Dubou, B L and Naylor, S. L. (1993) Characterization of NIGMS Human/Rodent Somatic Cell Hybrtd Mapping Panel 2 by PCR. Genomxs 16,3 15-3 19 22. Fong, D., Smith, D. I., and Hsteh, W. (1991) The human kminogen gene (KNG) mapped to chromosome 3q26-qter by analysis of somatic cell hybrids using the polymerase chain reaction. Hum Genet 87, 189-192.
12 Gene Isolation
by Exon Trapping
David 6. Krizman 1. Introduction The technology of exon trapping, somettmes called exon amplification, strives to exploit the phenomenon of mRNA splicing to discover genes directly from genomic DNA. There are three distinct exon trapping methodologies that doffer simply m the genomic target of interest. The original experimental design was to capture isolated 3’-splice sites residing within fragments of genomic DNA (I), whereas later approaches focused on either complete internal exons (2-5) or entire 3’-terminal exons (6). The requirement of complete, intact exons as targets has proven absolutely essential, and only the trapping of complete internal or 3’-terminal exons is practical. Plasmids, phage, cosmtds, PI s, BAG, YACs, or pooled clones of any type containing mammalian genomic DNA can be used as substrate for either exon trapping approach. Both internal and 3’-terminal exon trapping protocols have successfully trapped exons from all of the aforementioned substrates (69). Small numbers of exons are trapped from mdivtdual clones, such as plasmtd, phage, or cosmid clones, whereas larger numbers can be trapped if using collections of clones, such as phage or cosmid pools. Trapping from YACs is desirable when wishing to discover exons from larger genomic regions that may be difficult to cover with the small-insert cloning vectors. The choice of trapping system to use is an important consideration. The end result from internal exon trapping is usually a single exon that can be quite small (20-200 bases) in length. An advantage to this system is that nearly all of the exomc sequence will contain valuable protein coding mformatton that can be very useful when performing database searches.These searchesusually render some homology with known genes and can posstbly lend insight into the function of the gene from which the exon was trapped. A disadvantage is that a From
Methods
m Molecular Bfology, Edited by J Boultwood
Vol 68 Gene /so/am and Mapprng Humana Press Inc , Totowa, NJ
167
frofocols
168
Krizman
probe generated from the trapped exon may be very small and difficult to use m cDNA hybridization screening experiments designed to retrieve the fulllength cDNA. In contrast, the use of 3’-terminal exon trapping generates sequence derived from the last exon of a gene and these sequences can range anywhere from 200-5000 bases in length. Most of the sequence from such trapping experiments 1s 3’-untranslated region and thus little, if any, useful protein codmg sequence. This is a disadvantage when performing database searches, because few strong homologies will be found. However, this may be a distinct advantage if using this clone in hybridization based cDNA screening experiments owing to the unique nature of these sequences. Also, these sequences can be very useful for obtaining full-length cDNA clones by the technology of 5’-RACE (10). One of the hallmarks of database search results IS the existence of multiple stop codons within 3’-untranslated regions of genes. Thus, most trapped 3’-termmal exons will contain many such codons. Internal exon trapping is accomplished by the subcloning of restricted fragments of genomic DNA into one of many plasmid-based trappmg vectors, each engineered to produce vector-derived mRNA molecules when transfected mto specific mammalian cells that support vector transcription. Internal exons present in subcloned genomic fragments are incorporated within the transcription unit of the trapping vector, and subsequently included m a chimeric mRNA that results from splicing and processmg of vector-derived nascent transcripts. An exon trapped m such a manner is amplified by reverse transcription of RNA purified from transfected mammalian cells followed by the polymerase chain reaction (RT/PCR). Primers used for the RT/PCR reaction are designed to be specific to mRNA species derived only from the trappmg vector. This ftmctions to amplify products specifically from the vector and not endogenous mRNA species from the transfected mammalian cell. Once the PCR product is generated, it can be subcloned and sequenced for further analysis. A typical internal exon is small (average 127 bp m size), and flanked on the 5’-end by a 3’-splice (splice acceptor) and on the 3’-end by a 5’-splice (splice donor). There is usually an open reading frame all the way through the sequence, and most of them will show some degree of homology to known protein coding regions when used to search available sequence databases. To date, there are four internal exon trapping vectors available, and each vector shares many common features. Most are plasmid-based, and contain the ampicillin resistance gene, bacterial origin of replication, and a trapping cassette consisting of a eukaryotic enhancer/promoter driving transcription of a two-exon transcription unit with a multiple cloning site between the two exons. The first of these exons functions as a 5’-terminal exon, whereas the second exon functions as a 3’4erminal exon capable of directing polyadenylation.
Exon Trapping
169
Foreign fragments of genomic DNA are Inserted between the two exons and any internal exons present are subsequently trapped between the two vector exons by the cellular splicing mechanism. In addttion to the basic features discussed that are common to all internal exon trapping vectors, a variety of custom modifications have been made to individual vectors. The following internal exon trapping vectors are available: 1. pSPL3: This vector IS a modification of the original internal exon trapping vector pSPL1 and is the most commonly used trapping vector (2,7). This plasmid consists of the SV40 early region, rabbit P-globm exons, human HIV/Tat S- and 3’-sphce sites, HIV/Tat mtron, and the SV40 poly(A) signal A multiple cloning site (MC’S) exists within the intron between the two chimeric exons. 2. pL53In The transcription of this plasmid vector is driven by an RSV-LTR. The first exon is derived from the human phosphatase gene, whereas the second exon and the intervening mtron are derivatives of the rat preproinsulin gene A single unique KpnI site acts as the cloning site within the intron (‘3). 3. pMHC2: This plasmid vector initiates transcription from the SV40 early region. The trapping cassette consists of part of exon 10, mtron 10, and part of exon 11 from the human p53 gene together with the SV40 polyadenylation signal. Foreign DNA is inserted into a unique BglII cloning site within the mtron (4) 4. LambdaGET: This is a phage-based vector designed to clone and analyze larger fragments of target genomlc DNA than is possible with the plasmid-based trapping vectors (5). This vector is derived from pL53In, and uses the same trapping cassette. This vector offers the added advantage of high-efkency cloning of foreign target DNA owing to the phage approach of subclonmg inserts. The most commonly used internal exon trapping vector is pSPL3 and is illustrated in Fig. 1. It is 603 1 bases in length, and contains the AmpR gene and a bacterial origin of replication for propagation in Escherichia coli. The trapping cassette consists of the SV40 early region to direct transclrption and replication in the African green monkey cell line Cos7. This cell line harbors a rephcationdefective mutant SV40 virus that expresses the large T-antigen, which functions to initiate both replication and transcription from the SV40 early region.
Exon 1 of pSPL3 is a chimeric exon constructed from rabbit j3-globin exon sequencesand the human HIV/Tat S-splice site. The intron is from the HIV/Tat intron, whereas the last exon begins with HIV/Tat 3’-splice site followed by rabbit P-globm exon sequence and ending with the SV40 poly(A) signal sequence to direct correct cleavage and polyadenylation. BstXI half-sites are present on either side of the intron that functions to allow removal of molecules not containing an internal exon when digested with this enzyme. An MCS is present within the intron, and the following sites are available for subcloning fragments of genomic DNA: EcoRI, S&I, XhoI, NotI, XmaIII, PstI, BamHI, and EcoRV.
Krizman
170
pSPL3 (6031
Fig. 1. Diagrammatic
bp)
representation of internal exon trapping vector pSPL3.
The experimental use of this vector will be described here m detail; however, the basic protocol for all internal exon trapping vectors IS the same. The general approach to the use of pSPL3 is diagrammed in Fig. 2. Genomic fragments of DNA from any source, Including Individual cosmids, pooled cosmids, PI s, or YACs, are shotgun cloned into one or more of the restriction sites within the MCS and recombmants selected m E coli. All recombinants are picked, pooled, and transiently transfected mto Cos7 cells. Total RNA is collected 24 h later and reverse-transcribed using a vector-specific oligonucleotide to yield first-strand cDNA. The primary round of a nested PCR approach to amplify trapped exons is performed using vector-specific primers for six cycles followed by digestion with &XI. This digestion functions to remove vector-only splicing events that will lead to false-posmve scormg. A secondary PCR 1s performed using nested vector-specific primers to amplify trapped exons. The secondary PCR primers contain an additional 12 bases at their S-ends to specifically enable cloning of the PCR product by UDG-mediated high-efficiency cloning.
PCR results are analyzed by agarose gel electrophoresls.
The genomic target of 3’-terminal exon trapping is the last exon of a gene and is accomplished by ligation of restricted genomtc fragments of DNA with the trapping vector pTAG4. Ligation products are subsequently transfected directly to Cos7 cells that are able to support transcription from this vector.
171
Exon Trapping B&XI l/2 sites I
3’ss Cloning of target DNA into MCS Transfection into Cos 7 Cells Transient Expression and RNA Processing A BstXI I
B&XI l/2 sites separated by candidate trapped exon
I BstXI sites joined by splicing of empty vector Total RNA Isolation Reverse transcription PCR 1 Digestion with BstXI PCR 2 amplification
X I
UDG-mediated subcloning of PCR product Sequence subclones
Fig. 2. Schematicrepresentationof internal exon trapping protocol using the vector pSPL3.
This vector was engineered to contain a trapping cassettethat is an incomplete transcription unit that lacks only a last exon. The foreign DNA fragments are expected to donate a 3’-terminal exon to the vector to complete the transcription unit and generate a stable vector-derived mRNA molecule on transfection into Cos7 cells. Exons trapped in this manner can be amplified by a modification of the RT/PCR reaction termed 3’-rapid amplrfication of cDNA ends (3’-RACE) (10) that was designed to amplify mRNA species specifically from the 3’-end of the molecule using the poly(A) tall as an anchor. Primers used for the PCR reaction are specific for vector exons, thus imparting the specificity needed when using mRNA preparations from transfected mammalian cells. PCR product is subcloned and sequenced for further analysis.
172
Kritman
pTAG4 3980 bp
AP'
Fig. 3. Diagrammatic representationof 3’-terminal exon trapping vector pTAG4.
A typical 3’-terminal exon is larger, averaging 627 bp, than an internal exon and consists of mostly a 3’untranslated region flanked on the 5’-end by a 3’-splice site (splice acceptor),and on the 3’-end by a cleavagesite and the consensuspoly(A) signal AATAAA or ATTAAA. On a sequencedatabasesearch,a 3’-terminal exon will contain little protein coding sequence.Thus, very little sequencehomologies basedon theseregions will be seen.In fact, the existence of stop codonsin all framesis suggestiveof a 3’-terminal exonic sequence. There is only one 3’-terminal exon trapping vector, pTAG4, and it is illustrated in Fig. 3. The vector is 3980 basesin length with an AmpR gene and bacterial origin of replication for propagation in E. coli. The trapping cassette consists of the SV40 early region to drive transcription when transfected into Cos7 cells. Exons 1 and 2, as well as intron 1, of pTAG4 are naturally occurring leader exon/intron sequencesfrom the human adenovirus 2 genome. pTAG4 lacks a last exon containing a poly(A) signal. Thus, no mature polyadenylated mRNA is producedfrom the vector when transfectedinto Cos7 cells. An MCS resides downstreamof the 5’-splice site of exon 2 and contains the following unique restriction enzyme sites: EcoRI, BarnHI, BgnI, BssHII, SphI, NheI, EagI, NotI, PstI, NarI, MU, and SplI. The experimental use of this vector is describedin detail, and this processis diagrammed in Fig, 4. The target DNA is digested to completion with one of the restriction enzymes within the MCS to be used for the ligation reaction.
Exon Trapping
Ligation of target DNA to pTAG4 Direct transfection into Cos7 cells Isolation of poly(A)+ mRNA
IAAAAAAAAAll
mRNA
Reverse transcription PCR 1 3’ RACE
Digestion with restriction enzyme used above PCR 2 UDG-mediated
subcloning of PCR product
Sequence subclones
Fig. 4. Schematicrepresentationof 3’-terminalexontrappingusing the vectorpTAG4.
pTAG4 is double-digestedwith AvaII, and the samerestriction enzymeused to digest the target DNA. The target DNA is ligated to pTAG4, and the ligation reaction is directly transfected into Cos7 cells. After transient expression, mRNA is harvested and reverse-transcribedusing an oligo(dT)-based adapter primer for the 3’-RACE technique. Exons are amplified by a nested PCR approach. The primary PCR reaction uses one primer specific to the tail sequenceon the adapterRT primer, whereasthe other is specific to vector exon sequence.Primary PCR product is digested with the samerestriction enzyme used for preparative digestion of the target DNA and pTAG4. This step functions to remove any false positives resulting from reverse transcription of unspliced precursorRNA or residual contaminating DNA presentin the mRNA preparation.The secondaryPCR usesnestedprimers containing 12-bpfor UDG cloning and to give a greater degreeof specificity to the reaction. PCR results are analyzed by agarosegel electrophoresis. The complexity of the secondaryPCR product from both exon trapping protocols will be proportional to the amount and complexity of target DNA used
174
Krizman
for the original ligation reactions. For example, a small number (L-3) of ethidium bromide-stamed bands will be seen when secondary PCR product IS analyzed by agarose gel electrophoresis m a trapping experiment with single cosmid or phage clones. If trapping from a YAC clone or a pool of smaller clones, the numbers of bands seen on a gel will be proportionally greater. Thus, subcloning of the products may mvolve excision of single bands from the gel (single clones) to a shotgun approach using total PCR product (YAC clone and pools of clones). Product from the secondary PCR reactions should be subcloned for further analysis by either UDG-mediated cloning or TA clonmg. Vectors designed to perform both types of subclonmg are commercially available, and the protocols should be followed according to manufacturer’s recommendations The primers used for the secondary PCR reactions in both exon trapping protocols are designed to impart UDG clonmg capabilities on PCR product, and this approach is generally more efficient and yields greater numbers of subclones containing inserts. For these reasons, UDG clonmg is recommended; however, TA clonmg can be used and should yield sufficient numbers of subclones for sequencing. Subclones containing candidate exons should first be sequenced to aid in determining exon validity. Interpretation of sequencing results differs between the two exon trapping approaches. Internal exon trapping yields small products of usually 25-200 bp m length. Each unique sequence from a trapping experiment should be used in a computer search of existing nucleotide databases. Computer search results from an individual valid internal exon will yield evidence of splice events on both sides of the novel sequence as determined by removal of vector intron sequences and will show a complete open reading frame all the way through the sequence in at least one frame. Primers can then be designed to the unique sequence for use in a number of subsequent experiments mcluding: 1 Mapping the candidateexon to its genomic region of origin; expressionstudies;and 3 Generatinga specificprobefor cDNA screeningandSouthern/Northernblot mapping
2. RT/PCR
Alternatively, standard restriction enzyme digestion and gel purification of an insert representing a candidate exon from a single subclone can be used for hybridization experiments. However, owing to the small size of many trapped exons, it may be more mformative to perform PCR experiments using DNA and RNA from interesting sources as templates. 3’-Terminal exon trapping yields candidate exons that are much larger m size (200-2500 bp) than those obtained from internal exon trapping. The sequence of an individual subclone will show evidence of a splice event at the
Exon Trapping
175
S-end of the sequence and a poly(A) tall at the 3’-end of the sequence. In approx 90% of the cases, a consensus poly(A) signal consisting of one of the hexanucleotldes AATAAA or ATTAAA will be present 12-30 bases upstream of the poly(A) tail. On nucleotlde database search, multiple termination codons will be present in most frames. The great majority of 3’-terminal exon sequence is a 3’-untranslated region. Thus, very little coding sequence exists. As discussed for internal exons, both PCR-based and hybridization-based experiments to map the exon and perform gene expressions studies can be performed. Full-length cDNA clones can also be obtained in hybridization screens. Hybridization studies may prove more fruitful using 3’-terminal exons as probes vs mternal exon simply owing to their greater length.
2. Materials 1. One or more of the followmg restnctlon endonucleases.EcoRI, SstI,X401, NotI, 2. 3. 4. 5 6 7. 8. 9. 10 11. 12 13, 14.
15. 16 17 18. 19. 20.
XmaIII, PstI, BarnHI, EcoRV, BgZII, BssHII, SphI, NheI, EagI, NurI, MZuI, SplI, and &XI. Internal exon trapping vector pSPL3 or 3’-terminal exon trapping vector pTAG4. 3M Sodium acetate 70% and 100% Ethanol. TE buffer: 10 mMTns-HCl, 1 mMEDTA, pH 7.4 Calf intestinal phosphatase (CIP) Phenol/chloroform/lsoamyl alcohol, 25.24: 1 5X T4 ligase buffer. 250 mM Tris-HCI, pH 7.9, 50 mM MgC&, 5 mM dATP, 5 mM dithiothreltol (DTT), 25% (w/v) polyethylene glycol (PEG) 8000. T4 DNA hgase LB/amp medium (80 pg/mL amp) LB/amp plates (80 pg/mL amp, 15% agar) Cos7 cells (CRL 165 1, American Type Culture Collection, Rockville, MD). Catlonic lipid medium for transient transfectlon. Internal exon trapping primers: SA2.5’ ATCTCAGTGGTATTTGTGAGC 3’ SD6: 5’ TCTGAGTCACCTGGACAACC 3’ dUSA4: 5’ CUACUACUACUACACCTGAGGAGTGAATTGGTCG 3’ dUSD2 S’CUACUACUACUAGTGAACTGCACTGTGACAAGCTGC3 5X First-strand buffer 250 mMTris-HCl, pH 8.3, 375 mMKC1, 15 mMMgC1, RT enzyme. 10X Taq polymerase buffer: 500 mM KCl, 100 mM Tris-HCl, pH 8.3, 10 mM 4dNTP: 10 mM each dNTP m H20, 50 mM MgCl* Tuq DNA polymerase 1.2% Agarose gel in 1X TBE or 1X TAE with 400 pg/mL concentration ethldium bromide. 0.8% Agarose gel in 1X TBE or 1X TAE with 400 pg/mL concentration ethidium bromide.
Krizman
176
2 1 3’-Terminal exon trapping primers* AP 5’ AAGGATCCGTCGACATCGATAATACGAC(T),, SV4OP: 5’ AGCTATTCCAGAAGTAGTGA 3’ UAP SCUACUACUACUAGTCGACATCGATAATACGAC3 Ad2: 5’ CAUCAUCAUCAUCAGTACTCTTGGATCGGA 22 0 1MDTT
3’
3’
3. Methods 3.1. Target D/VA Preparation The starting target DNA for both exon trapping protocols can reside in any cloning vector (phage, cosmld, Pl, BAC, or YAC vector) used for propagating genomlc DNA. The common characterlstlc of each is that the target DNA must be suffclently purified from host DNA (E coli and yeast) to keep nonspecific background to a minimum. It is not necessary to purify the DNA insert from the cloning vector, since the vectors do not contain exons and mtrons and only rarely will lead to false-positive scoring. Many protocols exist that are designed to purify cloned DNA wlthin the various cloning vectors, and they have been described elsewhere. Thus, treatment of target DNA m this chapter consists of describing the preparation of DNA that has already been isolated by an existmg technique. 1 Digest a mimmum of 1 pg of target DNA with one of the available restrrctlon endonucleases, depending on which exon methodology has been chosen. For internal exon trapping, the following enzymes can be used EcoRl, SstI, XhoI, NotI, XmaIII, PstI, BumHI, and EcoRV For 3’-terminal exon trapping, the following enzymes can be used EcoRI, BumHI, BglII, BssHII, S’hI, NheI, EagI, NotI, PstI, NarI, MZuI, or SplI (see Note 1). 2. Phenol-extract and precipitate target DNA using 0 1 vol 3M sodium acetate and 2.5 vol 100% ethanol Wash the pellet with 70% (v/v) ethanol 3 Resuspend DNA m TE buffer at a final concentration of 250 ng/pL (see Note 2).
4. Run 250 ng of the digestedtarget DNA on a 0.8% agarosegel to assessthe success of DNA cuttmg Complete dlgestlon of the target 1s desirable; however, a small amount of undigested will not interfere with the protocol
3.2. Internal Exon Trapping 1 Digest the trappmg vector pSPL3 with the same restriction endonuclease used to digest the target DNA (see Fig. 1). 2 Dephosphorylate the vector with CIP according to manufacturer’s recommendations (see Note 3)
3. Phenol/chloroform-extractandethanol-precipitatelinearized,CIP-treatedpSPL3 with 0.1 vol 3M sodium acetate and 2.5 vol 100% ethanol. Wash pellet m 70% ethanol, and resuspend the pellet m TE buffer at 250 ng/pL concentration.
Exon Trapping
177
4. Set up the followmg shotgun cloning reactton: 250 ng pSPL3, 250 ng to 1 pg target DNA, 2 ,uL 5X T4 DNA hgase buffer, 1 pL T4 DNA hgase (1 U/pL), HZ0 to final volume 10 pL 5 Mix gently, and incubate at room temperature for 1 h. Alternattvely, the hgatton reaction can proceed overnight at 15°C (see Note 4) 6 Transform E cob with the hgatton reacttons (vector-only control and experimental) by either chemical or electrocompetent cells. The final transformation volume should be 1 mL 7. Inoculate 5 mL of LB/amp medium with 0 5 mL of the experimental transformation, and incubate overnight at 37°C 8. With the remaining 0 5 mL of transformation, plate 10 and 100 pL on two separate LB/amp plates. In parallel plate 10 and 100 pL of the vector-only control transformation, and incubate all plates overnight at 37OC By comparing the numbers obtained from both transformations thus will assess the degree of vectoronly ligation (see Note 5). 9 Prepare DNA from the 5-mL overnight culture of experimental recombinants by an alkaline lysts mimprep procedure. 10 Sixteen hours before transient transfection, plate Cos7 cells on 3 5-cm 6-well tissue-culture plates at a density of roughly 300,000 cells/well (see Note 6). Use 1 well/transfection. 11 Use 1 pg of the purified recombinant DNA for each transfection. Catiomc lipidmediated transfection can be used for introduction of DNA mto the cells and should be used according to manufacturer’s recommendattons Many companies now market hptd reagents for transient transfectton, and all are acceptable for this purpose. 12 After transfectton, incubate cells at 37°C for 16-24 h. 13 Prepare total RNA from transfected cells by acidic-phenol/guamdinium thiocyanate extractron, and resuspend in 10 pL of TE buffer (see Note 7) 14. Prepare to synthesize first-strand cDNA. Add the following to a 0.5-mL microcentrifuge tube: 1 uL 20 pM ohgonucleotide SA2, l-3 pg total RNA, DEPC-treated Hz0 to 12-pL final volume. 15. Incubate mixture for 5 min at 70°C and then place on me. Microcentrifuge briefly at high speed, and add the following components at room temperature: 4 pL 5X first-strand buffer, 2 pL O.lM DTT, 1 pL 10 mA4 dNTP 16. Mix gently, microcentrifuge briefly at high speed, and incubate for 5 mm at 42°C. Add 1 yL (200 U) RT, mix gently, and incubate an additional 30 min at 42°C. 17. Incubate RT reaction at 55°C for 5 mm. Then add 1 pL of RNaseH, mix gently, and incubate an additional 10 min at 55°C (see Note 8). 18. Mtcrofuge briefly and place on ice. The cDNA pool can be stored indefinitely at -20°C. Use 5 pL for primary PCR amplificatron. 19. Set up the primary PCR reaction according to the following: 5 uL cDNA, 5 yL 10X Taq polymerase buffer (MgCl*-free), 1.5 pL 50 nut4 MgC12, 1 yL 10 mA4 4dNTP, 2.5 pL 20 pMoligonucleotide SA2,2.5 uL 20 yMoligonucleotide SD6, sterile Hz0 to 47-a final volume. Mix gently and overlay with 50 ,ttL mineral oil.
178
Krrzman
20 Preheat thermal cycler to 94’C Place the reaction tube m the cycler, and Incubate for 5 mm at 94°C 2 1 Reduce cycler temperature to 8O”C, and add 2 5 U Tuq DNA polymerase that has been diluted mto HZ0 to a final volume of 3 pL 22. Perform PCR amplification according to the followmg parameters’ 94°C for 1 mm, 72°C for 5 mm, and 60°C for 1 mm, 6 cycles; and 72°C for 10 mm, 1 cycle, 4”C, hold 23. Add 25 U BstXI and incubate overnight at 55°C 24. In the mormng, spike the digestion with 5 more units of BstXI and incubate for 2 h at 55’C (see Note 9) 25 Perform the secondary PCR reaction according to the following: 5 PL BstXIdigested primary PCR product, 5 pL 10X Tuq polymerase buffer (MgCl,-free), 1.5 pL 50 mA4MgCI,, 1 pL 10 mM4dNTP, 1 pL 20 pA4ohgonucleotlde dUSA4, 1 PL 20 pA4 oligonucleotlde dUSD2, sterile H,O to 47 pL final volume. Mix gently and overlay with 50 pL mineral oil. 26. Preheat thermal cycler to 94°C. Place the reaction tube m the cycler, and incubate for 5 mm at 94°C 27. Reduce cycler temperature to 80°C, and add 2 5 U Tuq DNA polymerase that have been diluted mto H,O to a final volume of 3 pL 28. Perform PCR amplification according to the followmg parameters 94°C for 30 s, 72°C for 2 mm, and 60°C for 30 s, 30 cycles, and 72°C for 10 mm, 1 cycle, 4”C, hold 29 Electrophorese 10 pL of the secondary PCR reaction on a 1 2% agarose gel, and identify reactions that contam PCR product by ethldmm bromide staining (see Note 10) These PCR products should be used for subclonmg by the UDG-medlated PCR cloning approach
3.3. 3’- Terminal Exon Trapping The use of the trapping vector pTAG4 m the direct ligation/transfection procedure described here IS illustrated in Fig. 5. Digestion of pTAG4 with AvaII and one of the restriction sites within the MCS functions to leave a blunt end upstream of the SV40 enhancer/promoter and a sticky end downstream of the second exon. The AvaII site is very difficult to religate, whereas the sticky end is capable of readily ligating to target DNA with the same sticky end. In the example outlined in Fig. 5, pTAG4 1sdouble-digested with AvaIIlEcoRI and ligated directly to EcoRI-digested target DNA to form linear concatamers consisting of target DNA restriction fragments flanked on either side by pTAG4. The llgatlon reaction is directly transfected into Cos7 cells to induce transcription from the linear concatamers for amplification of trapped exons. 1. Preparethe trappmg vector pTAG4 (Fig. 3) by double restrlctlon digest with the enzyme AvaII m coqunctlon with the same enzyme used to digest the target DNA. Gel-purify the digested vector in a 0.8% agarose gel to remove the small DNA fragment liberated from the MCS after double dlgestlon. A number of pro-
179
Exon Trapping DNA
pTAG4
Perform
3’-terminal
Exon Trapping
Source
Assay
Fig. 5. Diagrammatic representationof the direct ligationkransfection procedure used in the 3’-terminal exon trapping protocol.
2. 3. 4. 5. 6. 7. 8. 9.
tocols can be usedto purify the vector from the agarosegel slice including glass beadpurification and electophoresis. ResuspendpTAG4 in TE at a final concentrationof 500 ng/pL. Set up the following ligation reaction (see Note 11): 500 ng pTAG4, 500 ng target DNA, 1 pL 5X T4 DNA ligasebuffer, 1 pL T4 DNA ligase (1 U/uL), HZ0 to a final volume of 5 pL. Mix gently and incubateovernight at 16°C. Sixteenhoursbeforetransienttransfection,plateCos7cells on 3.5-cm6-well tissueculture plates at a density of roughly 300,000cells/well. Use 1 wellkransfection. Use the entire 5-pL ligation reaction volume to transfect into 1 well of the Cos7 cells. The same lipid-mediated transfection protocol as used for internal exon trapping is used for this procedure. After transfection, incubatecells at 37°C for 16-24 h. Preparepoly(A)+ RNA from eachtransfection (seeNote 12). Approximately 1 ug of poly(A)+ mRNA will be obtainedfrom a single transfection of which 500 ng will be used to synthesizefirst-strand cDNA. Set up the following reverse transcription reaction: 1 pL 500 ng/uL oligonucleotide AP, 500 ng poly(A)+ mRNA, 1 uL 20 WEDTA, DEPC-treatedH20 to 20 pL final volume (seeNote 13).
180
Krizman
10 Incubate mixture for 5 mm at 70°C, and then place at 42’C for 5 min Add the following mixture (preheated to 42’C) to the RNA reaction tube. 10 5 ltL DEPCtreated H,O, 10 PL 5X first-strand buffer, 5 pL 0 lMDTT, 2 5 pL 10 mMdNTP, 2 pL RT (200 U/FL). 11 Incubate the reverse transcription reactton at 42°C for 30 mm. 12. Incubate RT reaction at 55’C for 5 mm, then add 1 pL of RNaseH, mix gently, and incubate an addttional 10 mm at 55°C 13. Microfuge briefly and place on ice The cDNA pool can be stored indefinitely at -20°C. Use 5 uL for primary PCR amplification. 14 Set up the primary PCR reaction according to the followmg: 5 pL cDNA pool, 72.5 pL sterile H,O, 10 pL 10X Taq buffer (MgCl,-free), 5 pL 25 mA4 MgCI,, 2 5 pL 10 mM4dNTP, 1 PL 50 pA4 oligonucleottde SV4OP, 1 pL 50 luWohgonucleottde UAP Mix and overlay with 100 pL mineral oil (see Note 14). 15. Preheat thermal cycler to 94°C Place the reaction tube in the cycler, and incubate for 5 mm at 94°C. 16. Reduce cycler temperature to 80°C, and add 2 5 U of Taq DNA polymerase that have been diluted mto H,O to a final volume of 3 pL. 17. Perform PCR amplification accordmg to the following parameters: 94°C for 30 s, 72’C for 2 mm, and 55°C for 30 s, 20 cycles; and 72OC for 5 mm, 1 cycle, 4’C, hold. 18 Digest the product from the primary PCR reaction wtth the same restrlctlon enzyme used to digest vector and target DNA. Set up a restrtction enzyme reaction according to the following: 17 nL primary PCR product, 2 nL 10X buffer, 1 pL restrtction enzyme, for a total of 20 pL. Mix gently, and incubate for I h at 37°C (see Note 15) 19. Dilute the restriction digestion by addition of 150 pL sterile HZ0 20 Set up the followmg secondary PCR amplification usmg 1 pL of the above digesnon as template: 1 pL digested primary PCR product, 76.5 JJL sterile H20, 10 FL 10X Taq buffer (MgClz-free), 5 pL 25 mM MgC12, 2 5 pL 10 mM4dNTP, 1 pL 50 @4 ohgonucleotide Ad2, 1 uL 50 piW oligonucleotide UAP. Mix and overlay with 100 pL mineral oil 21 Preheat thermal cycler to 94°C Place the reactron tube m the cycler, and Incubate for 5 mm at 94OC. 22. Reduce cycler temperature to 80°C, and add 2.5 U of Taq DNA polymerase that have been diluted into HZ0 to a final volume of 3 pL. 23. Perform PCR amphticatton according to the followmg parameters: 94°C for 30 s, 72’C for 2 min, and 55°C for 30 s, 30 cycles; and 72OCfor 5 min, 1 cycle; 4OC, hold, 24. Electrophorese 10 pL of the secondary PCR reaction on a 1.2% agarose gel, and tdentrfy reactions that contam PCR product by ethidmm bromide staining (see Note 16) These PCR products should be used for subclomng and sequencing.
4. Notes 1. The amount of starting target DNA will depend on the type of vector used to carry the genomtc DNA of interest. It is drfficult and ttme consuming to gelpurify YAC DNA; thus, as little as 1 pg can be digested if more cannot be
Exon Trappmg
2 3.
4. 5
6. 7
8.
9.
10.
11. 12.
obtained In contrast, it is relatively easy to purify microgram quantities of cosmid DNA, and more of this substrate can be used for the initial restriction digest. This final concentratton of digested target DNA will be suitable for either trappmg protocol Alternatively, two different restriction endonucleases can be used to prepare vector and target DNA, and thus, CIP treatment of pSPL3 would not be necessary. If two different enzymes are used, it will be necessary to gel-purify the lmeartzed vector from the small fragment generated from the MCS on a 0 8% agarose gel. A control ligation reaction should also be included contammg all of the above components without the target DNA. This functions to assess the amount of vector-only ligation. Expect 10-100 colonies from the 10 pL experimental plate. If >lO% of the resulting colonies do not contam target DNA insert, as determined by comparing number of colonies in the control ligation vs the experimental ligation, the vector should be prepared again. Cos7 cells can be obtained from ATCC. It is recommended that a single T75 flask be mamtamed and passaged in order to carry the lme for use m transfections. The cells grow m DMEM supplemented with 10% fetal calf serum Many companies market a product for total RNA isolation that is fast, efficient, and yields RNA of good quality. Appropriate RNase-free precautions should be taken when working with RNA, such as using only DEPC-treated water for making solutions (except those contammg Trrs), the use of gloves, and the use of sterile plasticware instead of glass RNaseH treatment digests the RNA strand of a DNA:RNA hybrid The presence of RNA in a PCR reaction has been shown to inhibit the reaction Digestion is also carried out at 55°C to elimmate possible snap-back structures and secondstrand products that can result from residual reverse transcriptase activity RT enzyme is mactive at this temperature. The double-stranded PCR product is digested with BstXI to remove two classes of background products. The first results from pSPL3-derived RNA molecules containing only vector sequences. The second results from amplification of RNA molecules that have used a cryptic 5’-splice site within the vector. Subcloning of the target DNA into either EcoRI or EcoRV sites will inactivate the overlapping B&XI sites and make rt necessary to digest putative exon-containing clones with WeI to determine if the vector cryptic S-splice site has been used. This step is used to increase efficiency and reduce background inherent within the system. Positive results from internal exon trapping are Indicated by any PCR product of a size greater than the 147 bp generated from the vector only. This vector-only PCR product will be present even after BstXI digestion, and thus any mcorporation of a trapped exon will yield PCR products larger than 147 bp. The concentration of DNA in this ligation reaction should be at 200 ,ug/mL to induce concatamerization. Many companies market kits for mRNA selection that are both fast and extremely efficient. Appropriate RNase-free precautions should be taken when working with RNA
182
Krizman
13. The ohgonucleotide AP 1s45 bases in length, and consists of 17 T residues at the 3’-end that bmds to and primes reverse transcription at poly A tails of mRNA species. The remainder of the AP primer is an engineered sequence that does not base pair with any endogenous sequence from Cos7 cells. The resulting cDNA primed with AP can be amplified with 5’-primers specific for vector sequences and 3’-primers specific for the engineered tall sequence of the AP primer This IS the basis for the 3’-RACE technology 14 The ohgonucleotlde SV4OP is specific for the portion of the cDNA that IS derived from the SV40 promoter, whereas the UAP is specific for the tall region of the AP pruner 15 This digestion step functions to remove specific background products that result from unspliced precursor RNA or residual DNA that can reverse transcribe at A/T-rich regions with the AP primer 16 The resulting PCR product from 3’-termmal exon trapping will not contam any vector-only derived products, and thus, all PCR product should be considered trapped sequences
References 1 Duyk, G. M , Kim, S., Myers, R M., and Cox, D R. (1990) Exon trapping. A genetlc screen to Identify candldate transcribed sequences m cloned mammalian genomlc DNA. Proc Natl. Acad Scz USA 87,8995-8999 2 Buckler, A. J , Chang, D. D., Graw, S L , Brook, J. D , Haber, D A, Sharp, P A., and Housman, D. E. (1991) Exon amplification: a strategy to isolate mammalian genes based on RNA sphcmg. Proc Natl. Acad. Scl USA 88, 4005+009. 3. Auch, D and Reth, M. (199 1) Exon trap clonmg using PCR to rapldly detect and clone exons from genomlc DNA fragments. Nuciezc Acids Res 18, 6743-6744 4. Hamaguchl, M , Sakamoto, H., Tsuruta, H , Sasakl, H , Muto, T., Suglmura, T., and Terada, M. (1992) Estabhshment of a highly sensitive and specific exon-trapping system. Proc Nat1 Acad Scl USA 89,9779-9783. 5. Nehls, M., Pfelfer, D., and Boehm, T. (1994) Exon amplification from complete hbranes of genonuc DNA using a novel phage vector with automatic plasrmd excision facility: application to mouse neurofibromatosis-1 locus. Oncogene 9,2169-2 175. 6 Knzman, D. B. and Berget, S M (1993) Efficient selection of 3’-terminal exons from vertebrate DNA Nucleic Acids Res. 21,5 198-5202 7. Church, D. M., Stotler, C J., Rutter, J L., Murrell, J. R., Trofatter, J. A., and Buckler, A. J (1994) Isolation of genes from complex sources of mammalian genomic DNA using exon amplification. Nature Genet 6,98-105. 8. Huntmgton Disease Collaborative Research Group (1993) A novel gene contaming a trinucleotide repeat that is expanded and unstable on Huntington’s Disease chromosomes. Cell 72,971-983. 9. Krlzman, D. B., Hofmann, T A., DeSllva, U., Green, E D., Meltzer, P S , and Trent, J M (1994) Identification of 3’ terminal exons from yeast artlficlal chromosomes. PCR Methods Appl., in press. 10. Frohman, M. A., Dush, M. A., and Martin, G. R. (1988) Rapid production of fulllength cDNAs from rare transcripts: amplification using a single gene-specific oligonucleotlde primer Proc Nat1 Acad Sci USA 85, 8998-9002
13 Isolation of Coding Sequences from Genomic Regions Using Direct Selection Richard G. Del Mastro and Michael Lovett 1. Introduction The rate-limitmg steps in the positional cloning of disease genes are the identification of discrete genes within a large genomic region and the analysis of mutations m these genes. Direct selection is an expression-based gene identiticatton technique that can rapidly identify cDNAs within large genomlc regions. It has been successfully used m numerous posttional cloning projects (j-4). It can capture cDNAs that are expressed, temporally and spatially, m various tissue or cell types. The technique involves the hybridization of polymerase cham reaction (PCR)-amplifiable cDNAs to a genomic target. In our laboratory, these cDNAs are complex pools derived from multiple tissues. The cDNAs that are homologous to the genomic regton are selected and subsequently enriched by PCR amplification, resulting in a set of cDNAs specific to the genomic target, with enrichment ranges from lOOO-to lOO,OOO-foldQ-7). These levels of enrichment, coupled with the use of cDNA pools, make the technique especially useful when searching for cDNAs that are expressed at very low levels in a complex tissue source. Direct selection has been successfully applied to derive detailed transcription maps of large genomic regions >2Mb rapidly (8-10). This makes it the method of choice when the targeted region is large. To identify as many transcription units as possible within the genomic target, a combination of cDNAs isolated from several different tissue sources can be mixed, and a single direct selection reaction performed. As many as eight cDNA pools from different tissue sources have been successfully combined in this way (7). This approach has several advantages: a variety of cDNAs are rapidly isolated in one direct selection reaction, analysis of the cDNAs can be performed in parallel, and, From
Methods
m Molecular Bology, Edited by J Boultwood
Vol. 68 Gene tsolatron and Mappmg Humana Press Inc , Totowa, NJ
183
Protocols
184
Del Mastro and Lovett
since the pattern of expression of some genes is tissue specific, combining cDNAs isolated from different tissues increases the likelihood of deriving a detailed transcription map of the region. Multiplexing several complex cDNAs pools in this way can also be useful in the positional cloning of disease genes where the etiology of the disorder is uncertain. Conversely, if the pathogenesis of the disorder is known, then targeting only one tissue by direct selection can greatly reduce the number of genes that have to be analyzed. Since direct selection was originally described (.5,6), it has been significantly improved and refined (7,8,11-I#), resulting m the approach described in this protocol, A schematic representation of the steps involved is shown in Fig. 1, The genomic target is first labeled wtth biotin. This important step prior to hybridization with the cDNAs ensures efficient capture of the genomic target and its associated cDNAs. As is described, it is very important to monitor this step to guarantee that enrichment is achieved. Prior to hybridization, a population of cDNAs, either cloned or ligated to amplification linkers, are PCR amplified. The quality of these cDNAs is another critical step m ensuring the success of the selection. Highly repetitive elements that are present within - 10% of the cDNAs are then suppressedby hybridization with a blocking DNA that can be either Cot1 DNA or total genomic DNA. Other blocking agents may also be included depending on the selection (see Section 2.5.). The repeat suppression step is performed to a low Cotl/2 value of 20. The repeat suppressed cDNAs are then hybridized in solution to the biotin-labeled genomic target to an intermediate Co& of between 100 and 200. The hybridization conditions can be designed in a number of ways, but the relative amounts of cDNA and genomic target are critical parameters m setting up a successful selection experiment. After hybridization, the biotinylated genomic target, with the hybridized cDNAs, is captured on streptavidm-coated paramagnetic beads. It is also critical to monitor the binding capacity and batch to batch variability in these reagents. The beads are then washed to remove any nonspecific hybridization, and the cDNAs are eluted. These cDNAs are the primary selected cDNAs. They are PCR-amplified and passed through a second round of selection. The cDNAs that are selected and eluted from this round are the secondary selected cDNAs. Two rounds of direct selection are usually sufticient to yield enrichments of up to lOO,OOO-fold, and additional rounds of selection do not appear to improve enrichments. It is the secondary selected cDNAs that are cloned and analyzed. 7.1. Fundamental Factors for Direct Selection Some of the important factors that are fundamental to the successof direct selection have been mentioned. This section describes m detail what we have found to be the most critical steps. The quality of the starting cDNA source is
isolation of Coding Sequences
185
Starting cDNA sources: Uncloned PCR amplifiable cDNA pools (ollgo dT and random primed) or cloned cDNA libraries.
Cloned genomrc biotinylated en
DNA
masse
PCR ampliflcatron
Suppress
repeats
using
Cot-l
DNA
Capture streptavldin
Mix and hybridize
(Cotln
genomlc/cDNA product coated paramagnetlc
on beads
= 200)
----laALU
ULL JAU
Wash Elute
beads
captured
cDNAs
PCR amplify Second Clone
round
secondary
of selectlon selected
cDNAs
Fig. 1. Schematicrepresentationof direct selection.
extremely important; a cDNA source that has good length distribution, and is low in rlbosomal RNA, plasmld DNA, and other artlfactual sequences will repeatedly give excellent results. Commercial cDNA libraries vary in quality,
166
Del Mastro and Lovett
and consequently, well-made uncloned cDNA sources are preferable. It is important to construct random-primed as well as oligo-dT-primed cDNAs; prtor to performing direct selection, a mixed startmg cDNA source ISproduced by combining the two pools, This approach ensures that there is a good representation of cDNAs across the length of the transcription unit, which is an advantage when performing analysis of the secondary selected material. During analysis, it is difficult to ascertain whether cDNAs containmg repetitive elements were selected on the basis of homology to the repeat only or to other unique regions. By using mixed cDNA pools, other parts of the transcription unit will also be selected, allowing one to discard repeat-contammg cDNAs. The quality of the cloned genomic target 1s also important. The technique rehes on capturing the btotinylated target and its hybridized cDNAs. Thus, it is essential to obtain good mcorporation of biotin within the genomlc target. Biotm incorporation can be monitored by adding a radiolabeled dNTP to the mck translation. An aliquot of the nick-translated product is then mixed with the beads. After a 10-15 mm mcubation, the beads are collected, and the amounts of radioactivity on the beads and in the supernatant are measured The ratio of bound:free radioactivity should be >8: 1 if efficient bmdmg and selection are to occur m the selection experiment. Obviously, the purity of the genomic target will influence this ratio and, hence, the overall outcome of the direct selection. If this ratto is not achieved, then some further DNA purification steps will probably be required. Alternatively, the bmdmg capacity of the streptavidin-coated magnetic beads may be at fault. We routinely monitor the different batches of magnetic beads for variability by measurmg their capacity to capture a biotinylated fragment. The biotinylated fragment, which has a radionucleotide incorporated, is added to varying amounts of beads, and the binding capacity monitored by measuring the amount of bound radionucleotrde. The relative amount of cDNA to genomic target is also important m a selection experiment. In general, we set up the first round of selection under conditions in which the genomic target is in excess over a low-abundance cDNA. This ensures that the lower-abundance cDNAs are efficiently selected, but has the disadvantage that it does not result m very much abundance normalization occurrmg in the first round of selection. Usually, one round of selection designed in this way results in dramatic enrichments of -lOOO-fold. The second round of selection is set up under conditions in which the genomic target is limiting in concentration. This has the advantage that more abundance normalization occurs in this round. The overall level of enrichment is usually a further - 1O-fold in the second round of selection. If YACs are used in a direct selection experiment, they must be purllied away from the yeast genome. This is quite time-consummg and has led to
Isolation of Coding Sequences
187
attempts to perform direct selection using total DNA from a YAC clone, including all of the yeast genome. The results have shown moderate enrichments of about loo-fold (15), as opposed to the 1O,OOO-to lOO,OOO-fold observed with purified YAC DNA (7,8). Our attempts to improve the enrtchment levels, using total DNA from a YAC, have yielded similar results. Therefore, in general, we do not recommend this type of shortcut. Although it is important to have a good-quality cDNA source and genomic template, it is essential to have a reporter cDNA as a positive control to monitor enrichment during the two rounds of selection. There are many steps involved in the procedure, with numerous opportunities for technical errors to occur. A reporter cDNA is an indicator that the experiment has worked; if after two rounds of selection there has been enrichment, then the secondary selected cDNAs are ready to be cloned and analyzed. 7.2. Analysis of the Secondary Selected cDNAs After cloning the secondary selected cDNAs, several hundred selected clones are picked and arrayed per tissue source. The first step is to make several replica filters and use them to remove background clones, such as the reporter cDNA, high-copy repeats, plasmid DNA, and ribosomal RNA. Though there may be other background contaminants, such as low-copy repeats, mitochondrial DNA, and yeast 2-p sequences, these are usually at a lower level. The cDNA clones that remain are hybridized individually or as a collection of cDNAs to detect and remove redundancy. Detecting redundancy is important not only because it allows one to avoid picking and sequencing the same clones again and again, but also because it allows one to estimate the depth of the selected library of cDNAs and thus devise a stopping algorithm for a particular cDNA source. For example, if every cDNA clone, when assessedfor redundancy, is found to be present five or more times on a screen of 300 arrayed cDNA clones, then one can be reasonably confident that screening only 300 cDNA clones will be sufficient to saturate that particular cDNA source. If, however, such a screen revealed that most clones were only present once in a screen of 300 clones, then a much larger number of clones would have to be screened to achieve saturation. After the removal of redundancy, those clones that are left are sequenced and/or hybridized to the genomic contig that originally identified them to determine whether they correctly map back. Alternatively, this mapping step can be PCR-based with primers constructed from each of the sequenced cDNAs. In general, >70% of the cDNAs that are in the secondary selected material map back to the genomic contig. However, this percentage will obviously depend on the frequency of repetitive elements within the targeted genomic region.
Del Mastro and Lovett
188 2. Materials 2.1. Ligation
of Linkers to Oncloned cDNA
1 3 yg of random pnmed and 3 pg oligo-dT-primed
cDNA, average size 0.5-l kb
(1617)
2. Phosphorylated cDNA hnkers. Oligo 1.5’ CTG AGC GGA ATT CGT GAG ACC 3’ Oligo 2: 5’ GGT CTC ACG AAT TCC GCT CAG TT 3’ Mix ohgo 1 and 2 m a 1: 1 ratio. Anneal oligos by heating to 65°C for 5 mm, and allow to cool to room temperature Make the final concentration 1 pg/pL 3. T4 DNA ligase (Boehringer Mannheim, Mannhelm, Germany) at 1 U/pL 4. 10X Ligation buffer: 200 mM Tris-HCl, pH 7.6, 50 mM MgC12, 50 m&I dithrothreitol (DTT), and 500 pg/mL bovine serum albumin (BSA) (V) 5. Water bath or heatmg block at 15°C
2.2. Amplification
of Linkered
cDNAs and cDNA Libraries
1. For cDNAs that have been linkered wrth ohgo 1 and 2, use ohgo 1 as a PCR primer. Anneal at 6O’C. 2 For cDNA libraries, inserts can be amplified by the PCR using primers that flank the insert site. cDNAs cloned into hgtl0: IOF: 5’ GCA AGT TCA GCC TGG TTA AG 3’ 10R: 5’ GAG TAT TTC TTC CAG GGT TA 3’ Anneal at 55OC. cDNAs cloned into h Zap: T3: 5’ AAT TAA CCC TCA CTA AAG GG 3’ T7: 5’ TAA TAC GAC TCA CTA TAG GG 3’ Anneal at 55°C Make 10 rmI4 workmg PCR primer stocks 3. 10X PCR buffer 10 mM Tris-HCI, pH 8.3, 50 nnI4 KCl, 1.5 mA4 MgC12, and 0.00 1% gelatin. 4. 200 n&I each dNTPs (Boehringer Mannheim). 5. Tuq DNA polymerase (Perkm Elmer, Norwalk, CT).
2.3. Preparation
of the Genomic DNA
1. LB* Prepare 1 L by dissolvmg the following in 700 mL of distilled water: 10 g bacto-tryptone, 5 g of yeast extract, and 10 g of NaCl. Make up to 1 L with distilled water. Autoclave for 20 min To make LB agar plates, add 15 g of bacto agar to 1 L of LB Autoclave for 30 min. Allow to cool to 65”C, and add selective antlbiotlc if needed.Pour into 150-mm plates.
2. A 96-well replica plater (Sigma, St. Louis, MO). 3. 1 L SM: Prepare 1 L by dtssolvmg the following in 700 mL of drstrlled water: 5.8 g NaCl, 2.0 g MgS04*7H20, 50 mL Tris-HCl, pH 7.5, and 5 mL 2% gelatin Make up to 1 L with distilled water. Autoclave for 20 min.
Isolation of Coding Sequences
189
4. Casein medium. Prepare 1 L by dissolvmg the following in 700 mL of dlstllled water: 6.7 g of yeast nitrogen base without amino acids, 10 g casein and 700 mL adenme. Make up to 950 mL. Autoclave for 20 min Allow to cool to 65’C, and add 50 mL sterile 40% glucose. To make casein plates, add 15 g of bacto-agar to 950 mL of casetum medium. Autoclave for 30 mm. Allow to cool to 65”C, and add 50 mL sterile 40% glucose Pour into 25-mm plates. 5. Prep-A-Gene kit (BIo-Rad) 6. Additional reagents and equipment for isolating cosmld DNA (I 6,17), Pl DNA (I 7), phage DNA (I@, and YAC DNA (16,17,20).
2.4. Labeling 1. 2 3. 4. 5 6 7. 8. 9.
of the Genomic Target with Biotin
Nick translation kit (Boehrmger Mannheim). 1.O r&4 blotin 16-dUTP (Boehringer Mannheim) 32PdCTP (Amersham). Sephadex G-50 (Pharmacia). Dynabeads (Dynal). Magnetic particle concentrator holder (Dynal). Water bath or heatmg block at 15OC Binding buffer: 10 mMTn.s-HCl, pH 7 5, 1 mMEDTA, BioScan-QC.2000 (BloScan Inc ).
2.6. Suppression
1MNaCl.
of Repeats in the cDNA
1. 2 3. 4. 5.
10 pg Bluescript (Stratagene). 10 pg pWE (Stratagene). 10 pg Yeast AB1380 (ATCC). Restrlction endonuclease HaeIII (Gibco BRL). 2X Hybridization solution: 1.5 mA4NaCl,40 mM sodium phosphate, pH 7.2, 10 mMEDTA, 10X Denhardt’s solution, and 0.2% sodium dodecyl sulfate (SDS) 6. Mineral 011(Sigma). 7. Water bath or heating block at 100 and 65°C
2.6. Hybridization of Repeat Suppressed to the Genomic Target
cDNAs
1. 2X Hybridization solution. 2. Mineral 011(Sigma). 3. Water bath or heating block at 100 and 65°C
2.7. Isolation 1. 2. 3 4. 5
of the Selected Material
Dynabeads M-280 (Dynal Inc). Binding buffer: 10 mMTns-HCl, pH 7.5, 1 miVEDTA, 1MNaCl. Magnetic particle concentrator holder (Dynal Inc or Stratagene). Wash Solutton I: 1X SSC, 0.1% SDS. Wash Solution II: 0.1X SSC, 0.1% SDS.
190 6. 7 8 9.
Del Masiro and Lovett Water bath or heating block at 65’C. 0 1M freshly prepared NaOH. IM Tris-HCl, pH 7.6 10 pA4 PCR primers, 10X PCR buffer, 200 mA4 each dNTPs, and Taq DNA polymerase.
2.8. Monitoring 1. 2. 3. 4. 5 6.
Enrichment
Positive reporter cDNA. Random-primed labeling kit. 32PdCTP (Amersham) Hybond N+ (Amersham). Kodak XAR film for autoradiography. Addrtional reagents and equtpment for Southern blotting (1618).
2.9. Cloning
the Secondary
Selected cDNAs
1. Modified primers for cloning into the UDG vector pAMPl0. Ohgo l-CUA: 5’ CUA CUA CUA CUA CTG AGC GGA ATT CGT GAG ACC 3’ IOF-CUA: 5’ CUA CUA CUA CUA GCA AGT TCA GCC TGG TTA AG 3’ lOR-CUA: 5’ CUA CUA CUA CUA GAG TAT TTC TTC CAG GGT TA 3’ T3-CUA: 5’ CUA CUA CUA CUA AAT TAA CCC TCA CTA AAG GG 3’ T7-CUA: 5’ CUA CUA CUA CUA TAA TAC GAC TCA CTA TAG GG 3’ Make 10 wworking PCR primer stocks. Anneal all modified olrgos at 60°C 2. 10X PCR buffer, 200 nuI4 each dNTPs, and Tuq DNA polymerase. 3. UDG vector pAMP10 (Gibco BRL). 4. MAX Efficiency DH5a Competent Cells (Gtbco BRL). 5. LB agar plates containing 100 pg/mL amptcillm 6 Sterile 96-well microtiter plates. 7. 1 L LB containing 100 ug/mL ampicillin, and 1 L LB containing 30% glycerol and 100 pg/mL ampicillin. 8. Autoclaved toothptcks 9. Transtar dispenser. 10. Hybond N+ nylon membrane. 11. 96-Well replica plater (Sigma) or a Biomek Automated Laboratory Workstation (Beckman).
3. Methods 3.1. Choice of Reporter cDNA to Monitor Enrichment Before starting the protocol, it is important to have a reporter cDNA to monitor enrichment; a known gene needs to be present in the genomlc target, and the corresponding cDNA needs to be present in the starting cDNA. Occasionally a known gene does not exist in the target, or the cDNA does not exist in the starting cDNA. In the former case,the genomic target needs to be spiked with a gene; for example, if the genomic contig is 1 Mb in size, a gene of 1 kb
Isolation of Coding Sequences
191
in length should be diluted 1 in 1000 (w:w). In the latter case, the starting cDNA needs to be spiked with a reporter cDNA; the reporter cDNA should be added to a low concentration (1 m 105) in the starting cDNA. 3.2. Preparation of the Sfarfing cDIVA 3.21. Preparation of Starting Uncloned cDNA 1 Add 3 pg of random prtmed (16,18), blunt-ended, double-stranded cDNA m a volume of 22 pL to one Eppendorf tube and to another add 3 pg of oligo dT primed (IO, 1 I), blunt-ended, double-stranded cDNA m the same volume 2. Add to each Eppendorf tube, 3 uL of 10X T4 DNA ligase buffer, 3 uL of 5 mM ATP, 2 pL (1 ug/pL) of annealed phosphorylated cDNA lmkers oligo 1 and oligo 2, and 3 uL (1 U/uL) of T4 DNA hgase. Mix gently and incubate overmght at 15°C. 3. Take 1 pL of the ligation reaction and dilute lOO-fold with distilled water From this working stock of the lmkered cDNA, take 1, 2, and 5 JJL and PCR-amplify using oligo 1 under the following condtttons: 30 s at 94”C, 30 s at 6O”C, and 2 mm at 72°C for 30 cycles (see Note 1). Include a control PCR reaction containing primers only. 4. Evaluate the length dtstribution of the amplified cDNA by electrophorests on a 1% agarose gel The mean size of the cDNA should be -1 kb. 5 Determine which of the PCR reactions is a good representation of the randomprimed and ohgo-dT-primed cDNA pools, and scale up that PCR reaction to produce -2-3 pg of each cDNA pool (see Note 2). 6. Mix 1 pg of random-primed cDNAs with 1 pg of ohgo-dT-primed cDNAs This constitutes the starting cDNA
3.2.2. Preparation of Starting cDNA from Cloned libraries 1 Take 1, 2, and 5 pL from the cloned cDNA library, and PCR-amplify using primers designed from the vector under the following conditions. 30 s at 94”C, 30 s at the annealing temperature of the designed primers, and 2 min at 72°C for 30 cycles (see Note 1) Add a control PCR reaction containing primers only. 2 Evaluate the length dtstrtbution of the amplified cDNA by electrophorests on a 1% agarose gel. The mean size of the cDNA should be similar to the mean insert size of the cDNA library. 3. Determine which of the PCR reactions is a good representation of the cDNA library. Then scale up that PCR reaction to produce -2-3 pg of starting cDNA (see Note 2)
3.3. Preparation of the Genomic Target The genomic target may be contained in various cloning vectors: cosmids, Pls, phage, YACs, or a combination of the above, The preparation of the genomrc target will vary according to the type of vector that the DNA has been cloned into (16-20).
192
Del Mastro and Lovett
3.3.1. Preparation of Genomic DNA Cloned into Cosmids, Pls, and Phage 1. If the genomic contig comprises several cosmlds, Pls, or phage, grow them individually m the appropriate selective media to enable one to isolate 1-5 pg of DNA. 2. If the genomic target comprises several hundred cosmids, Pls, or phage, array them in g&well mlcrotlter plates containing the appropriate antlblotlc for their growth. For cosmld and Pl DNA, use a 96-well stamping tool, and transfer the arrayed clones onto LB agar plates contammg the appropriate antibiotic. Allow the colomes to grow overmght at 37°C. Scrape the colomes off the agar plates, and pool into a sterile 50-mL tube Isolate the DNA using the alkali lysls method For phage DNA, use the 96-well stamping tool to transfer the arrayed clones onto LB agar plates which have had the host Escherzchza colz strain plated out m the top agar and have been incubated at 37°C for 2 h prior to stampmg. Allow the plaques to grow overnight at 37°C Add 10 mL of SM to each plate, and leave the plates to shake gently for 2 h. This will allow the phage to diffuse into the liquid. Pour off the liquid into a 50-mL sterile tube, and isolate the DNA using the phage lysate prep. 3. Measure the concentration of the isolated DNA
3.3.2. Preparation of Genomic DNA Cloned into YACs 1 Streak the YAC clone onto a casem agar plate, and allow the colonies to grow for 2 d at 30°C 2 Inoculate 50 mL of sterile casem media with one yeast colony Incubate for 2 d, at 3O”C, shaking at 225 rpm. 3. Isolate yeast chromosomes m agarose blocks m preparation for pulsed-field gel electrophoresls. Approximately 14 agarose blocks/50 mL of casem media will be obtained 4. Load as many blocks onto a gel as possible (usually seven if using the Bio-Rad CHEF apparatus). 5. Electrophorese the DNA under conditions suitable to separate the YAC from the endogenous yeast chromosomes (see Note 3). 6 Stain the gel with ethidmm bromide (10 mg/mL), visualize the chromosomes under long-wave UV, and excise the YAC using a scalpel. 7 Extract the DNA from the agarose using Prep-A-Gene 8. Measure the concentration of the DNA. This protocol routinely yields approx 500 ng of purified YAC DNA
3.4. Labeling 1. Label usmg tions, dTTP
of the Genomic Target with Biotin
200 ng of the isolated genomlc DNA (see Note 4) with biotin 16-UTP a nick translation kit in accordance with the manufacturer’s recommendabut with the following modifications. the molar ratio of blotm 1QUTP to should be 1.10, and the blotm labeling of the DNA should be monitored by
Isolation of Coding Sequences
2
3.
4.
5. 6.
7
193
adding 1 ,uL of 32PdCTP prior to adding the enzyme. The total reaction volume should be 20 pL. Incubate at 15°C for 90 mm. After mcubation, add 80 pL of sterile water to the reaction and pass the biotinylated product through a Sephadex G-50 spin column to remove the unmcorporated nucleotides. Add 100 pL of Dynabeads (streptavidin-coated paramagnetic beads) into an Eppendorf. Put the Eppendorf in the magnetic particle concentrator holder, and allow the beads to concentrate for 1 min. Remove the supernatant, and resuspend the beads in 100 yL of binding buffer. Repeat this process three times. After the third wash, resuspend the beads in 100 pL of bmding buffer. Measure the volume of the biotinylated DNA. Take Usth of this volume, and add to the resuspended Dynabeads. Mix well and leave the biotinylated DNA to bind to the beads for I5 mm at room temperature Mix occasionally, by gentle vortexing, to prevent the beads from settling. Place the Eppendorf tube in the magnetic particle concentrator holder for 1 mm. Remove the supernatant, and transfer it to another Eppendorf tube Measure the radioactivity in the Eppendorf containing the beads and m the Eppendorf containing the supernatant using a BioScan counter. The radioactivity bound to the beads should be severalfold greater than that in the supernatant. Generally, a ratio of bound:free radioactivity of >8* 1 is indicative of good blotin labeling (see Note 5). Dry down the remaining volume of the biotinylated DNA to 10 pL. Store at -20°C.
3.5. Suppression
of Repeats in the Starting cDNA
1. Individually digest 10 ug of BlueScript, 10 pg pWE, and 10 pg total yeast (AB 1380) with HaeIII. Electrophorese 500 ng of the digested product on a 1% agarose gel to verify digestion. If digested, pass the products through Sephadex G-50 columns. 2. Add 1 pg of starting cDNA, 2 pg of Cot1 (Gibco BRL) and 1 yg each of H&II digested BlueScript and pWE DNA to an Eppendorf tube. If the genomic contig contains a YAC clone, also include 1 pg of H&II-digested total yeast DNA. 3. Mix and dry the volume down to 10 yL (see Note 6). 4 Add 10 PL of 2X hybridization solution (see Note 7) and mix. Overlay with mmera1 oil, and denature at 100°C for 5 min. Incubate the DNA m a heating block at 65’C for 4 h. This time approximates a Cot’/2 value of 20 mol nucleotide/hter x 1 s (see Note 8).
3.6. Conditions for Hybridization of the cDNAs and the Genomic Target 1. Add half of the brotinylated DNA, 5 uL (- 100 ng), to an Eppendorf tube containing 5 uL of 2X hybridization solution. Overlay with mineral oil, and denature for 5 mm at 100°C 2. Place the Eppendorf tube containing the denatured DNA in a 65°C heat block for a few seconds. Add the denatured DNA, includmg the mmeral oil, to the
194
Del Mastro and Lovett Eppendorf containing the repeat suppressed cDNAs Mix gently, and leave to hybridize at 65°C Allow hybridization to proceed for 50 h If yeast DNA was not included m the preassociation step, or 40 h if the yeast DNA was Included These times approximate a Cot% value of 200 mol nucleotlde/hter x 1 s (see Note 9)
3.7. Isolation of the Primary Selected Material and the Monitoring of Enrichment Using a Reporter
Gene
1 Wash 100 FL (1 mg) of Dynabeads three times m bmdmg buffer, as described m Section 2 3. After the third wash, resuspend the beads m the 100 PL of wash solution 2 Take the Eppendorf tube containing the hybridlzatlon reaction from the 65°C heat block, and carefully remove the bottom layer, the hybrldlzatlon reaction, away from the 011 3 Add the hybridization reaction to the Dynabeads, mix gently, and leave at room temperature for 20 min to allow the biotinylated genomlc DNA and the hybndized cDNA to bind to the beads Mix occasionally, by gentle vortexmg, to prevent the beads from settling. 4 Place the Eppendorf tube in the magnetic particle concentrator holder for 1 mm Remove and retam the supernatant (see Note 10). 5. Remove the Eppendorf tube from the magnetic particle concentrator holder, and resuspend the beads in I-mL of wash solution I. Leave the Eppendorf tube to stand at room temperature for 15 min Place the Eppendorf tube m the magnetic particle concentrator holder for 1 mm Remove the supernatant. Wash the beads one more time in 1 mL of wash solution I and three times m wash solution II at 65°C for 15 min (see Note 10). Resuspend the beads in 50 pL of freshly prepared O.lMNaOH to elute the cDNAs from the genomlc DNA. Leave the Eppendorftube to stand at room temperature for 20 min. MIX occasionally, by gentle vortexing, to prevent the beads from settling 9. Place the Eppendorf tube in the magnetic particle concentrator holder for 1 mm Remove the supernatant containing the eluted cDNAs and add to 50 P-IL of 1M Tris-HCl, pH 7.5. Pass the neutralized, eluted cDNAs through a Sephadex G-50 column This is the primary selected cDNA material and is ready for PCR amplification 10 Set up five 25-PL PCR reactions, with the appropnate pnmers, containing 1 ng startmg cDNA material, 1,2, and 5 pL of the primary selected cDNA, and primers alone with no cDNA Amplify the cDNA under the conditions described in Section 3.2 11. Evaluate the amplified cDNA by electrophoresis on a 1% agarose gel. The ethidmm bromide-stained gel should reveal a smear of cDNA products m the startmg and primary selection. No product should be present m the primers alone lane (Fig. 2A) (see Note 11). 12. Southern blot (I 6,17) or vacuum blot the gel onto nylon membrane. Hybridize the positive reporter cDNA, which is to be used to monitor enrichment. After autoradiography, there should be an enrichment of the posltlve reporter cDNA m the lanes containing the primary selected cDNAs compared to the starting cDNA lane, and nothing should be observed m the primers only lane (Fig. 2A)
Isolation of Coding Sequences
A
B
Fig. 2. Results of the first and secondround of direct selection of HeLa cDNAs using 410 pooled cosmids from 5q35 as the genomic target. (A) PCR products of the starting and the primary selectedHeLa cDNAs separatedby electrophoresison a 1% agarosegel, andhybridization of the reporter cDNA (fms like tyrosine kinasereceptor [FLT4], which has been localized to the region and is known to be present in the starting cDNAs andthe genomic target) to the blotted gel. The result showsthat FLT4 hasbeenenrichedby -lOOO-fold after oneround of direct selection.(B) PCR products of the starting, primary, and secondaryselectedHeLa cDNAs, and hybridization of the reporter cDNA, FLT4. The result shows that FLT4 has been enriched a further 1O-fold in the secondaryselection.
3.8. Conditions for Secondary Selection, isolating the Secondary Selected Material, and Monitoring Enrichment 1. Determine which of the PCR reactions is a good representationof the primary selectedcDNA. Then scaleup that PCR reaction to produce -2-3 ug of starting cDNA, as describedin Section3.2. 2. Suppressthe repeatsin the primary selectedcDNA as describedin Section 3.5. 3. Thaw the other half of the biotinylated DNA, add 5 PL of 2X hybridization solution and overlay with mineral oil. Denaturethe DNA for 5 min at 100°C.
Del Castro and Lowett
196
4 Hybridize the denatured biotmylated DNA to the repeat suppressed primary selected cDNAs under the condltlons described in Section 3 6. 5 Capture the secondary selected cDNAs as described in Section 3 7 6. Set up six 25-yL PCR reactions, with the appropriate primers, contaimng 1 ng starting cDNA material, 1 ng of the amplified primary selected cDNA, 1,2, and 5 yL of the secondary selected cDNA, and primers alone with no cDNA. Amplify the cDNA under the conditions described m Section 3.2 7. Evaluate the amplified cDNA by electrophoresis on a 1% agarose gel. The ethldium bromide-stained gel should reveal a smear of cDNA products in the starting, pnmary selection and in the secondary selection No product should be present m the primers alone lane (Fig 2B) (see Note 11) 8 Southern blot or vacuum blot the gel onto nylon membrane. Hybridize the posltlve reporter cDNA that was used to monitor enrichment in the primary selection After autoradiography, further enrichment of the posltlve reporter cDNA should be observed m the lanes containing the secondary selected cDNAs compared to the primary selected cDNAs lane, and nothing should be observed in the primers only lane (Fig. 2B).
3.9. Cloning and Evaluating
the Secondary
Selected Material
1 Determine which of the PCR reactlons 1s a good representation of the secondary selected cDNA. Prepare two PCR reactions using the appropriate modified primers for cloning mto the UDG vector pAMPI (Glbco BRL). Include as a control primers alone with no cDNA. Amplify the secondary selected cDNA under conditions described m Section 3 2 2 Pool the two amplified cDNAs, and pass the product through a Sephadex G-50 spin column. 3. Measure the concentration of the PCR product. Evaluate 200 ng of the amplified cDNA by electrophoresls on a 1% agarose gel. Include the primers alone control. The length distribution of the amplified cDNAs should be the same as that observed when performing PCR reactions using the unmodified primers. There should be no PCR product m the primers alone lane. 4. Clone the amplified secondary selected cDNAs into pAMPl0 in accordance with the manufacturer’s recommendations. 5 Transform the cloned secondary selected material into MAX Efficiency DH5a Competent cells as recommended by the manufacturer. 6. Plate out the transformed cells onto LB agar plates containing 100 pg/mL ampicillin. Incubate the plates at 37°C overnight. 7 Fill four sterile 96-well microtiter plates with 150 pL of LB containing 100 pg/mL ampicillin. Using sterile toothplcks, pick colonies and array into the 96-well mlcrotlter plates. Incubate the plates at 37°C overnight. 8. Add to the 96-well mlcrotlter plates 100 pL of LB containing 30% glycerol and 100 pg/mL amplcillm. Store plates at -7O’C (see Note 12). 9. Pour LB agar containing 100 pg/mL ampicillin into eight 96-well microtlter plate lids Prepare eight sheets of Hybond N+ nylon membrane (Amersham), measur-
Isolation of Coding Sequences
197
ing 8 x 12 cm, and dip each one into LB containing 100 ug/mL ampicillin. Dram, place onto LB agar, and allow to air-dry. Using a 96-well replica plater, transfer the arrayed clones onto two separate nylon membranes to create duplicate copies If a robotic workstation 1s available, stamp the arrayed clones, in duplicate, m a high-density format onto the nylon membrane. Incubate the plates at 37°C. Process the membranes as recommended by the manufacturer 10. Evaluate the cloned, arrayed, secondary selected cDNAs by hybridizing human genomic or Cot1 DNA, cloned ribosomal RNA genes, the positive reporter cDNA, and any other genes that are known to be in the genomic target, to the gridded filters (see Note 13), to identify background clones, which then will not be analyzed further. The remaining clones can then be picked, sequenced, and mapped (see Note 14).
4. Notes 1. The PCR reactions described are for a Perkin Elmer 9600 and are performed in a volume of 25 pL 2. To obtain a large quantity of starting cDNA, it is better to prepare ten 25uL PCR reactions. After PCR amplification, pool the products and pass through a Sephadex G-50 spm column. The starting cDNA is then ready to be used m the first round of direct selection. 3 Occasionally, the YAC may migrate with one of the endogenous yeast chromosomes. We routmely run a test pulsed-field gel containing the YAC clones under standard conditions. The gel is stained with ethidmm bromide photographed with a ruler aligned alongside of it. The gel 1s then Southern blotted onto a nylon membrane, and hybridized with 100 ng of radiolabeled Cot1 or 100 ng radiolabeled total genomic DNA. Contamination with a single yeast chromosome does not usually dramatically reduce the efficiency of the selection. However, substantial contamination with several yeast chromosomes and especially with the ribosoma1 gene cluster on yeast chromosome XII does greatly reduce the efficiency. 4. If the genomic target is cloned mto a combmatton of vectors, ensure that each DNA is present in equimolar amounts 5. If the ratio of bound:free radtoacttvity is <8:1, this mdicates that the selection will probably be lowered in its efficiency. However, if the ratio is
198
Del Mastro and Lovett
10 Keep the supernatant and the wash SolutionsJust m case the direct selection failed because of faulty reagents. If this occurs, take 1 pL from the supernatant and the wash solutions, and PCR-amplify them Evaluate the PCR products by electrophoresls on a 1% agarose gel Southern blot or vacuum blot the gel onto a nylon membrane and hybridize the positive reporter cDNA to the filter to determine If any of the lanes show a smear The result may indicate that either the cDNA did not hybridize to the template or that the cDNA was washed off the beads during the wash steps. If either of these appears to be the case, It 1s advisable to make fresh solutions and start again. 11 If a smear IS detected m the primers only lane, repeat the PCR reactIons with freshly made solutions The ollgo 1 primer can be contaminated with small amounts of the linkered cDNA, and if care IS not taken, a smear will be observed in the primers only lane Make a working copy of the arrayed secondary selected sets Store the workmg 12. copy at -70°C. 13 Other background contaminants are low-copy-number repeats, mitochondrtal DNA, and the yeast 2-p plasmid, which are generally small in number (20.5%) These clones can be identified during sequencmg analysis or counterscreened out using a cocktail of sequences as a probe. 14. The analysis of selected cDNA is quite extennve, and usually consists of redundancy removal, followed by sequencing and mapping of the selected clones back to the genomlc target The derivation of full-length cDNA sequences 1s usually the next priority.
References 1 Vorechovsky, 1 , Holland, J., Sideras, P , Dunham, I., Hammarstrom, L , Smith, C I , Bentley, D. R., and Vetrie, D. (1993) Isolation of cDNA clones mapping around DXS 178: a search for human X-linked agammaglobulinaemla gene using yeast artificial chromosomes, cosmids and direct cDNA selection Zmmunodeficzency 4,22 l-224 2. The Huntington’s Disease Collaborative Research Group (1993) A novel gene containing trinucleotide repeat that is expanded and unstable on Huntington’s disease chromosomes. Cell 72,97 l-983 3. Hastbacka, J., de la Chapelle, A., Mahatani, M. M., Clines, G , Reeve-Daly, M., Daly, M., Hamilton, B , Kusuml, K., Tnvedl, B., Weaver, A., Coloma, A., Lovett, M , Buckler, A , Kaitila, I., and Lander, E. (1994) The diastrophlc dysplasia gene encodes a novel sulfate transporter: positional clonmg by fine-structure linkage dlsequllibrium mapping. Cell 78, 1073-l 087 4. Savitsky, K., Bar-Shira, A., Gilad, S., Rotman, G , Ziv, Y , Vanagaite, L , Tagle, D A., Smith, S., Uzlel, T., Sfez, S., Ashkenazi, M., Pecker I., Harnlk, R., PatanJall, S., Simmons, A., Clines, G., Frydman, M , Sartlel, A , Gattl, R., Chessa, L., Sanal, O., Lavin, M., Jaspers, N., Taylor, A., Arlett, C , Miki, T., Weissman, S , Love& M., Collins, F, and Shlloh, Y (1995) A single ataxla telangiectasla gene with a product similar to PI-3 kinase. Sczence 268, 1749-I 753
Isolation of Coding Sequences
199
5 Lovett, M., Kere, J., and Hinton, L. M. (1991) Direct selection* a method for the lsolatlon of cDNAs encoded by large genomlc regions Proc Nat1 Acad SIX USA 88,9628-9632
6. Panmoo, S., Patanjah, S R., Shulka, H., Chaplm, D. D , and WeIssman, S M (1991) DNA selectlon: efficient PCR approach for the selection of cDNAs encoded m large chromosomal DNA fragments Proc Nat1 Acad Sci USA 88,9623-9627 7 Morgan, J G , Dolganov, G M., Robbins, S. E , Hmton, L. M., and Lovett, M. (1992) The selective isolation of novel cDNAs encoded by the regions surrounding the human mterleukm 4 and 5 genes. Nucleic Acids Res 20,5173-5 179 8. Simmons, A. D., Goodart, S. A., Gallardo, T. D , Overhauser, J , and Lovett, M. (1995) Five novel genes from the cri-du-chat critical region isolated by direct selection. Hum. Mol. Genet 4,295-302. 9. Simmons, A D., Overhauser, J., and Lovett, M (1996) Rapid isolation of cDNAs from the Cn-du-Chat crltical region by direct screenmg of a chromosome-specific cDNA library (submitted). 10. Del Mastro, R. G , Stotler, C , Buckler, A., and Lovett, M (1996) A comparison of direct selection and exon amplification techniques m the derivation of a transcription map of chromosome 5q35 (manuscript m preparation) 11 Reyes, G. R., Bradley, D. W., and Love& M. (1992) New strategies for the isolation of low abundance viral and host cDNAs* Apphcation to cloning of the Hepatltls E vu-us and analysis of tissue-specific transcription Sem Lzver Du. 12,289-300. 12 Love& M (1994) Fishing for complements finding genes by direct selection. Trends Genet 10, 352-357.
13 Love& M. (1994) Direct selectlon of cDNAs using genomlc contlgs, m Current Protocols In Human Genetzcs (Dracopoll, N., et al , eds ), Wiley Intersclence, New York, p 6 3 1 14. Del Mastro, R G , Wang, L., Simmons, A D., Gallardo, T D , Clmes, G A., Ashley, J. A , HIlllard, C J., Wasmuth, J J., McPherson, J. D., and Lovett, M. (1995) Human chromosome-specific cDNA libraries new tools for gene ldentification and genome annotatlon. GenomeRes 5, 185-l 94. 15 Parimoo, S , Kolluri, R , and Weissman, S. M. (1993) cDNA selectlon from total yeast DNA contammg YACs. Nuclezc Acids Res 21,4422-4423. 16 Sambrook, J., Fritsch, E F., and Maniatis, T. (1989) Molecular Clonzng. A Laboratov Manual, 2nd ed Cold Sprmg Harbor Laboratory Press, Cold Spring Harbor, NY. 17 Dracopoh, N , Haines, J , Korf, B , Molr, D , Morton, C , Seidman, C , Seldman, J., and Smith, D. (eds ) (1994) Current Protocols zn Human Gene&s. Wiley Interscience, New York. 18. Ausubel, F. M (ed ) (1987) Current Protocols zn Molecular Bzology. Wiley Interscience, New York 19. Vollrath, D. (1992) Resolving multimegabase DNA molecules using contourclamped homogenous electric fields (CHEF), in Methods zn Molecular Bzology, vol 12. Pulsed-Fzeld Gel Electrophoresls (Burmelster, M and Ulanovsky, L., eds.), Humana Press, Totowa, NJ, pp 19-30. 20. Chandrasekharappa, S. C., Marchuk, D. A., and Collins, F. S (1992) Analysis of yeast artificial chromosome clones, in Methods m Molecular Bzology, vol 12 Pulsed-Field Gel Electrophoresu (Burmeister, M. and Ulanovsky, L., eds.), Humana Press, Totowa, NJ, pp. 235-257.
14 Isolation of cDNAs Using the YAC Hybridization Screen Method Carrie Fidler and Jacqueline
Boultwood
1. Introduction There are currently a number of techniques available for the isolation of novel codmg sequences from cloned genomic fragments. The most recent methodologies mclude exon trapplng (I), direct selection (2,3), and screening of cDNA libraries with whole radlolabeled yeast artificial chromosomes (YACs) (YAC hybndlzatron) (4). The advantage the YAC hybrldizatlon method has over those of direct selection and exon trapping 1sthat it is technically far less demanding and laborintensive. The method essentially involves the hybridization of a radiolabeled YAC, containing the genomic DNA of interest, to a cDNA hbrary that has been transformed into bacteria and replicated onto nitrocellulose filters. The YAC labeling procedure 1s similar to standard protocols for the labeling of smaller probes, with an additional critical step to block the repetitive elements m the YAC. This method 1sparticularly suitable where large regions of the genome are being investigated, since YAC vectors allow the cloning of large inserts (5). A significant disadvantage of the YAC hybridization method, however, is that it suffers from a high incidence of false positives, resulting primarily from repetitive and ribosomal sequences.YAC probes have a high repetitive content. Ah repeat sequences, for example, have an average spacmg of approx 4 kb in the human genome (6). The problem of repeat sequenceisolation 1spartly addressed by blocking the YAC with human placental DNA. However, careful selection and screening of positive clonesare necessary.Yeast-related sequencesare another common source of false positives. Indeed, a proportion of the YAC probe is derived from yeast sequences.Clear separation of the YAC from the other yeast chromosomes during probe preparation reduces the number of false positives From
Methods
in Molecular Biology, E&ted by J Boultwood
Vol 68 Gene lsolatron and Mapping Humana Press Inc , Totowa, NJ
201
Protocols
202
Fidler and 5oultwood
resultmg from yeast sequences.Careful selection of the cDNA library will also mmlmize these false positives; some libraries are constructed usmg carrier yeast tRNA and large numbers of yeast sequencesare isolated. cDNA libraries are now available that are not constructed in this way and are therefore more suitable (7) The YAC hybridization method is relatively inefficient compared with other methods, such as direct selection, It is necessary, for example, to screen a large number of clones in order to identify less abundant transcripts. Again the choice of cDNA library can help alleviate this problem. Use of normalized hbrarieslibraries constructed to mcrease the abundance of rare transcrtpts-should increase the probability of detecting low-level expressed genes (8). The efficiency of the YAC hybridization method was demonstrated by Elvm et al. (4), who used the technique to isolate cDNAs for aldose reductase. The efficiency of YAC hybridization compared with hybridization using an aldose reductase cDNA was mvesttgated. This study showed that large YACs are less efficient m detectmg cDNAs and that cDNAs with small inserts are less likely to be detected. However, 10% of clones identified with the YAC were aldose reductase positive, and they were identtfied usmg a probe m which only approx 1% of DNA represented the target sequence. A number of groups have successfully used the YAC hybridization method to isolate novel cDNAs. The technique was first described by Wallace et al. (9), who used tt to identify part of the Neurofibromatosts (NFl) gene. Geraghty et al. (10) used a 480-kb YAC to screen a human retinal cDNA library and isolated four novel cDNAs mappmg to the critical region of the X-chromosome in retinal degenerative disorders. Also, Snell et al. (7) isolated seven unique cDNA clones, usmg a series of YACs that map to the Huntmgton’s disease candidate gene region. This chapter first describes the preparation of both YAC probes and cDNA library filters prior to hybridization. Second, the labeling, blockmg, and hybridization of the YAC to the cDNA library, and finally the selection of positive cDNA clones are detailed. After identification, positive clones should be subjected to a screening procedure to eliminate any false positives. This is described in detail in Section 4 (see Note 17).
2. Materials 2.7. Preparation of Agarose Plugs Containing YAC DNA 1 SD media: 7 g yeast nitrogen base without ammo acids, 20 g glucose, 55 mg adenine hemisulfate, 55 mg tyrosme Dissolve m warm water, adjust pH to 7.0 using NaOH, make up to 1 L, and autoclave. Cool and store at 4°C Just before use, add filter-sterilized 20% Caramino Acids (CAS) (Dlfco Laboratories Ltd., Surry, UK) amino acids, 14-mL/200 mL of SD media.
YAC Hybridization Screen Method
203
2. Agarose: Dissolve low-gelling-temperature (LGT) agarose to a concentration of 2% in lMsorbito1 and 20 mA4EDTA. Cool to 42°C and hold at this temperature. Add &mercaptoethanol to a concentration of 14 mM 3. Yeast solution A* 1M sorbitol, 20 mMEDTA, 14 mM P-mercaptoethanol, 1 mg/mL lyttcase 4. Yeast solution B: 1M sorbitol, 20 nul4 EDTA, 10 mA4 Tris-HCl, pH 7 5, 14 mM P-mercaptoethanol, 1 mg/mL lyticase 5. Block mold: A mold should contam a number of slots to make agarose blocks of the correct dimensions to fit the wells of a pulsed-field gel electrophoresls gel. It can be made from Perspex=M m the laboratory workshop or bought commercially 6. Yeast lysis solution. 1% lithium dodecyl sulfate, 100 mA4 EDTA, 10 mM TrisHCI, pH 8.0 Filter-sterilize.
2.2. Purification
of YAC DNA
1. TE* 10 mMTns-HCI, pH 8.0, 1 WEDTA. 2. 0.5X TBE buffer. 45 mMTris-HCl, 45 mA4 bortc acid, 0.5 mM EDTA a 10X stock solution 3 1% High-melting-temperature agarose in 0 5X TBE buffer 4 Ethidlum bromide: 10 mg/mL stock solution m dlstllled water
2.3. Preparation
Make as
of CDNA Library Filters
1. Competent cells. Frozen or prepared according to standard methods 2 Nltrocellulose filters, e.g , Hybond N (RPNl37N Amersham, InternatIonal plc, Slough, UK) 3. cDNA library: Choose a suitable cDNA library, e g., fetal brain 4 LB agar plates: 10 g bacto-tryptone, 5 g bacto-yeast extract, 10 g NaCl. Dissolve in 800 mL of dlsttlled water, adjust pH to 7.0, and make up to 1 L Just before autoclavmg, add 15 g bacto-agar. Before pouring cool to 5O”C, and add appropnate antlbiotlc
2.4. Lysis of Colonies and Binding of DNA to Replica Filters 1. 2 3. 4.
10% sodium dodecyl sulfate (SDS): Make as 20% stock solution. Denaturing solution* 0.5M NaOH, 1.5M NaCI. Neutralizing solutlon: 1SMNaCl, 0.5MTris-HCl, pH 7.4 20X SSC stock: 3MNaCl,0.3M sodium citrate. Dilute accordmgly.
2.5. Labeling
and Hybridization
of YAC to CDNA Filters
1. Random primed labeling kit: Use according to manufacturer’s mstructlons (e.g., Boehringer Mannhelm, Lewes, East Sussex, UK). 2. 32PdCTP: 3000 Cl/mm01 (Amersham International). 3. 50X Denhardt’s: 5 g ficoll, 5 g bovine serum albumin, 5 g polyvinylpyrrolidone. Distilled H20 to 500 mL. 4 20X SSPE stock solution: 173.5 g NaCl, 27.6 g NaH2P04*H20, 7.4 g EDTA Dissolve in 800 mI. of dtstilled H20, adjust pH to 7.0 with NaOH, adjust volume to 1 L, and autoclave
Fidler and Boultwood
204
5. Prehybrtdizatton solutton: 5X SSPE, 1% SDS, 5X Denhardt’s, 5% dextran sulfate, 100 yglmL denatured sheared salmon testes DNA Prewarm to 65OC 6 Human placental DNA. 7 pBR322 DNA 8 Sephadex-grade G- 100. 9 Hybridization solution Prehybrrdrzatton solutton plus, Just prior to addition to filter, labeled YAC probe. 10. X-ray film: Use film of an approprtate sensitivity (e g., FUJI RX).
3. Methods 3.1. Preparation
of Agarose Plugs Containing
YA C DNA
1. Inoculate 10 mL of supplemented SD medta with a loop of yeast cells. Grow with shaking for 48 h at 30°C 2. Prepare agarose, and hold at 42°C 3. Pellet yeast cells at approx 300g for 30 min, and wash once m 50 mMEDTA 4 Resuspend pellet m 400 uL of yeast solutton A 5. Add 500 pL of the molten agarose, and plpet the cell suspension into the slots of a Perspex block mold Allow to set on me. 6 Gently push the sohdified agarose plugs out of the slots using a sealed Pasteur pipet Collect the plugs m approx 10 mL of yeast solutton B. Incubate at 37’C for 2 h, 7 Replace the solutton with 5 mL of filter-sterthzed yeast lysts solutton. Incubate at 37°C for 30-60 min 8 Replace with another 5 mL of yeast lysts solution Incubate overnight (16-24 h) at 37’C. 9. Pour off solution. Blocks can now be stored m yeast lysrs solution at room temperature.
3.2. Purification
of YAC DNA
YACs are separated by pulsed-field gel electrophoresis (PFGE). The clear separation of the YAC from the other yeast chromosomes 1s necessary to minimize the number of false positives Isolated from yeastrelated sequences. 1 Thoroughly wash agarose plugs: 3 x 30 min m TE at 50°C followed by 3 x 30 min in TE at room temperature, and finally 3 x 30 mm m electrophorests buffer (0.5X TBE) at room temperature. 2. Load blocks in a preparative PFGE gel. Load l-2 plugs/well In addmon load Saccharomyces cereviszae marker m one lane (see Note 1). 3. Run gel under standard conditions for separation of S. cerevzsiae chromosomes, 4 Stam the gel m 0.5 pg/mL ethtdium bromrde to visualize the DNA on a UV transillummator. 5 Excise the YAC (the appropriate sized band), and purify by electroelutton or purn‘icatton columns (see Note 2). 6 Quantify the YAC DNA, and store at -2OOC.
YAC Hybridization Screen Method 3.3. Preparation
205
of cDNA Library Filters
This is based on the method of Hanahan and Meselson (II, 12). 1. Transform an approprtate cDNA library into competent cells (see Note 3) 2. Number dry nrtrocellulose filters, wet them wtth distilled water, and sandwich between dry Whatman 3MM paper. Wrap m alummum foil, and autoclave (15 lb/in 2 on liquid cycle). Prepare enough filters to make a master and two replicas of each plate (see Note 4). 3. Using sterrle forceps, lay a sterile filter number srde down on an LB plate containmg the appropriate anttbiotic. When filter is completely wet, turn rt numbered side up on the agar plate. 4. Pipet the transformed bacteria onto the center of the filter on the agar plate, and spread usmg a sterile glass spreader (see Note 5). 5. Allow filter to absorb the liquid for a few minutes. Then close the lid, invert the plate, and incubate at 37’C overmght. 6 Peel the master filter from the plate, and place It colony srde up m a dampened pad of sterile 3MM paper 7. Take one of the sterile numbered filters and lay It, numbered side down, on the master filter. Try to avoid air bubbles 8. Press the two filters together using a heavy glass plate. 9. Orient the two filters by making a series of holes m the two filters with a needle 10. Peel the two filters apart. Lay the replica filter in a fresh agar plate, colony side up, and incubate at 37“C until colonies appear (approx 4-6 h) (see Note 6). I 1. Repeat the process to make a second rep&a of the master filter (see Note 7) 12 Replace the master filter onto a fresh agar plate, and incubate at 37OC for 1 h Then store at 4°C or freeze (see Note 8). 13. Replica filters plates can be stored at 4°C pnor to lysts of colonies and DNA bindmg.
3.4. Lysis of Colonies and Binding of DNA to Replica Filters This is based on the original method of Grunstein and Hogness (13). 1. Cut three pieces of Whatman 3MM paper, and place onto the bottom of 3 glass/plastic trays. Saturate each piece m one of the following: 10% SDS, denaturmg solution, and neutralizing solutron (see Note 9) 2. Place a glass dish containing 500 mL of 2X SSC and 0.5% SDS on a shaker. 3. Using sterile forceps, peel the filters from their agar plates, and place colony side up m each of the trays m the following order: a. SDS for 3 min; b. Denaturing solution for 5 mm; and c. Neutralizing solution for 5 min Transfer the filters from tray to tray using sterile forceps. 4. Transfer the filters to the SSCYSDS, submerge, and shake for 15 mm. After shaking, gently wipe the bacterial colonies from the filters using a gloved hand (see Note 10).
206
Ficiler and Boultwood
5 Lay the filters colony side up on a sheet of dry 3MM paper Allow to dry at room temperature for 15-30 mm. 6 Sandwich between two sheets of 3MM paper, and bake at 80°C for 2 h Filters can now be stored dry at room temperature.
3.5. Labeling of YAC and Hybridization to cDNA Filters The labeling and blocking of the YAC are based on the method described by Elvln et al. (4). 1. Label approx 100 ng of YAC DNA 2. YAC labeled with 32PdCTP by random primmg (I 4) accordmg to manufacturer’s mstructlons (see Note 11) 3. Label at 37°C overnight 4 Separate labeled YAC from unmcorporated nucleotldes with Sephadex-grade G- 100 columns 5. After separation, boll labeled YAC for 10 mm m 5X SSC with 2 5 pg/pL human placental DNA, and 0.05 pg/pL pBR322 DNA m a total volume of 400 pL. 6. Incubate for 4 h at 65°C (see Note 12) 7 Wet cDNA filters m 2X SSC and place mto a heat-sealable plastic bag or hybndlzatlon chamber, and add 5-10 mL of prehybrldlzatlon solution Incubate the filters in the prehybrldlzation solution for 4 h at 65°C (see Note 13) 8 Remove the prehybrldlzatlon solution from the bag or chamber, and add an equal volume of hybrldlzatlon solution containing the YAC probe Hybridize the filters for 20 h at 65°C 9. Following hybridization, submerge the filter m 4X SSC/O. 1% SDS for 5 mm at room temperature Transfer the filters mto a chamber containing 2X SSC/O. 1% SDS at 65°C for 20 mm Finally transfer the filters mto a chamber contammg 0 5X SSC/O. 1% SDS and incubate at 65°C for approx 15 mm 10. Wrap the filter in cling film and place on X-ray film. Expose the film for l-7 d at -70°C.
3.6. Isolation
of Positive Clones
1. Orientate master plate with autorad and pick positive colonies (see Notes 14, 15, and 16). 2. SubJect clones to a screening procedure (see Note 17 and Figs. 1 and 2).
4. Notes 1. Make a 1% high-gelling-temperature agarose gel m 0.5X TBE The YAC blocks are pushed into the wells usmg a sterile pipet tip The slots are sealed with 1% LGT agarose. The gel 1srun m 0.5X TBE buffer and run under standardcondltlons for the separation of S cerevwae chromosomes. 2 The YAC 1s often clearly visible when compared with the S cerevwae marker and is easily separated from other yeast chromosomes However, sometimes the YAC IS not dlstmgulshable. In this case, cut out the band that corresponds to the me of the YAC.
YAC Hybridization Screen Method
207
Fig. 1. Mapping cDNA clonesto a specific chromosome.(1) EC&I-digested human DNA. (2) EC&I-digested human/mousehybrid DNA from a cell line with the chromosomeof interest(in this case,humanchromosome5) asits only humancomplement.
12345678
Fig. 2. Representativeexampleof Northern analysisusing a cDNA clone. (1) pancreas.(2) kidney. (3) skeletalmuscle.(4) liver. (5) lung. (6) placenta.(7) brain. (8) heart.
208
Fidler and Boultwood
3 The fetal brain cDNA library 1s commonly used, since this tissue expresses a wide variety of genes 4 Label the master plate A and the rephcas B and C The master plate IS stored for picking positive clones and the replicas used for hybridization with the YAC. 5 Plate the bacteria to produce 1.5-2 0 x lo3 colonies/plate Produce eight plates with this colony density for a reasonable representation of the library. 6. Do not allow the colonies to grow too large, smaller colonies produce sharper hybridization signals 7 The second replica should be ortentated using the existing holes m the master filter 8 Freezing the master plate Replace the filter onto a fresh LB plate containing the appropriate antibiotic and 25% glycerol. Incubate at 37’C for 1 h, seal with Parafilm,TM and store inverted at -20°C (12). 9 Do not allow the 3MM paper to become too wet, since colonies could swell and burst causing blurred hybridization signals 10. The colonies should be gently wiped, so the filter surface no longer feels “slimy.” 11 The probes are labeled to a specific activity of approx lo9 cpm/pg 12. This is to allow the preassocratron of the YAC with human and vector sequences 13 The volume of prehybrtdization solution used depends on the chamber used With sealable bags, we typically use 5 mL. For hybridization chambers, refer to manufacturer’s instructions. 14. It is important to select colonies giving a range of signal intensities, since plaques giving the strongest signals are often, but not always repeat sequences (7). 15 Often one cannot identify an individual hybridizing colony accurately owing to the difficulty of aligning the filter and autorad, and the density of colonies. Therefore, it is often necessary to pool several adlacent clones and use them to moculate 10-50 mL of culture media with appropriate antlbrotic. The culture is grown for several hours, then diluted, and replated on agar plates to obtain approx 500 colonies/plate These are then rephca-plated and screened again by hybrrdrzanon. A single, well-isolated, positive colony should be picked from the secondary screen and used for further analysis 16. We have used the technique to isolate new coding sequences mapping to the crmcal region of gene loss of the Sq-syndrome. The number of false positives can be significantly reduced by employing a simple exclusion method. If autoradiographs of the same cDNA filter, sequenttally hybridized to two nonoverlapping YACs, are superimposed, the patterns of positive clones are almost identical. However, since the YACs are nonoverlappmg, they should not identify the same cDNAs, hence, positive clones common to both YACs are probably repetitive or yeast-related sequences, and can be excluded from further analysis.
This reducedour false-positive rate from 84% (at worst) to <30% (1.5). 17. In order to ehmmate false positives, a number of screening procedures are necessary. a. All positively identified clones are hybridized with a rrbosomal probe to ehminate any ribosomal sequences b. Somatic cell hybrids can then be used to localize the clones. Southern blots contaming restriction enzyme-digested DNA from human, mouse, hamster, and a
YAC Hybridization Screen Method
209
somatic cell hybrid with the appropriate human chromosome as its only human complement are hybridized with the posttive clones. This ensures that the clones map to the correct chromosome (see Fig 1) Standard Southern blot analysts should also reveal if the clone is repetitive. c. Clones are then hybridized to a Southern blotted filter of PFGE-separated YACs. The filter should contam the YAC that identified the clone, a number of unrelated YACs, and the host yeast stram. This should show localization of the positive clone to the appropriate YAC d Positive clones that satisfy a, b, and c are hybridized to a Northern blot (see Fig. 2) to determine if they are expressed e Finally, positive clones are sequenced and the sequence data compared with the Genbank database for sequence stmilarmes
References 1 Buckler, A. J., Chang, D. D , Graw, S. L., Brook, J. D , Haber, D. A , Sharp, P A., and Housman, D E (199 1) Exon amphfication* a strategy to isolate mammalian genes based on RNA sphcmg. Proc Nat1 Acad Scz USA 88,4005-4009 2 Lovett, M., Kere, J , and Hmton, L. M (199 1) Direct selection. a method for the isolation of cDNAs encoded by large genomic regions. Proc Nat1 Acad Scz USA 88,9628-9632 3. Parimoo, S., Patanjalr, S. R., Shukla, H., Chaplin, D. D , and Weissman, S. M. (1991) cDNA selection efficient PCR approach for the selection of cDNAs encoded m large chromosomal DNA fragments. Proc Nat1 Acad. Scz USA 88, 9623-9627 4 Elvin, P., Slynn, G , Black, D , Graham, A, Butler, R , Riley, J , Anand, R., and Markham, A. F (1990) Isolation of cDNA clones using yeast artificial chromosome probes Nucleic Aczds Res 18, 39 13. 5. Burke, D. T., Carle, F G , and Olson, M. V (1987) Clonmg of large segments of exogenous DNA mto yeast by means of artificial chromosome vectors Sczence 236,806 6. Sealey, P. G , Whittaker, P A , and Southern, E. M. (1985) Removal of repeated sequences from hybridisation probes Nuclezc Acids Res. 13, 1905. 7. Snell, R. G., Douchette-Stamm, L. A, Gillespie, K M , Taylor, S A. M., Riba, L., Bates, G P , Alther, M R., MacDonald, M E , Gusella, J F., Wasmuth, J J , Lehrach, H., Housman, D E., Harper, P S , and Shaw, D. J. (1992) The isolation of cDNAs within the Huntington disease region by hybridisation of yeast artificial chromosomes to a cDNA library Hum Mel Genet 2, 305-309. 8. Ko, M. S. H. (1990) An “equalised cDNA library” by the reassociation of short double stranded cDNAs. NucEezcAczds Res 18, 5705-5711 9. Wallace, M. R., Marchuk, D. A., Andersen, L B , Letcher, R., Oden, H M , Saulmo, A M., Fountain, J. W., Brereton, A., Nrcholson, J., Mitchell, A. L., Brownstein, B H , and Collms, F S. (1990) Type I Neurofibromatosis gene: identificatton of a large transcript disrupted in three NFl patients Sczence 249, 181
210
Fidler and Boultwood
10 Geraghty, M. T , Brody, L C., Martin, L S , Marble, M , Kearns, W , Pearson, P , Monaco, A P , Lehrach, H , and Valle, D (1993) The isolatton of cDNAs from OATLl at Xpl 1.2 using a 480-kb YAC Genomzcs 16,440-446. 11. Hanahan, D. and Meselson, M. (1980) Plasmrd screenmg at high colony densrty Gene 10,63.
12. Hanahan, D. and Meselson, M (1983) Plasmtd screening at high colony density Methods Enzymol 100, 333
13. Grunstein, M , and Hogness, D S. (1975) Colony hybrldtsatton a method for the lsolatton of cloned DNAs that contam a spectfic gene Proc Nat1 Acad ,521 USA 72,396l 14 Feinberg, A P and Vogelstein, B. (1983) A technique for radlolabelling DNA restrlctton endonuclease fragments to htgh specific actrvlty. Anal Bzochem. 132, 6-13. 15. Boultwood, J , Frdler, C., Soularue, P., Wang Jabs, E., Lovett, M., Cotter, F , Muller, U , Auffray, C., and Wainscoat, J. S (1996) Primary transcription map of the crrtrcal region of the Sq-syndrome (submitted).
15 Detection and Isolation of Differentially Genes by Differential Display
Expressed
Weimin Zhu and Peng Liang 1. Introduction Temporal and spatial expresslon of the 100,000 different genes in the genome of a mammal 1sa highly regulated process that determines the normal development of an orgamsm. This assumption IS based on the fact that only a fraction of these genes, m the range of 10,000-l 5,000, are expressed m any cell type. The alterations m this process are therefore often the cause of developmental and pathological abnormalmes, such as cancer. In retrospect, the Importance of dlfferentlal gene expression was perhaps best appreciated many years back when p53 protein was first found to be one of the major proteins mduced on SV40 vn-us infection (I). This was done by comparing side-by-side proteins extracted from noninfected and infected cells on a one-dimensional sodium dodecyl sulfate (SDS) gel. Although it turned out that the marked increase m p53 protein level was not owing to the upregulation of the gene, the SV40 large T-antigen bmds to the ~53, resulting the stabilization as well as inactivation of the protein. This led to the later Identification of p53 as one of the major tumor suppressor genes (2). Therefore, comparative studies, such as this, at either mRNA or protein levels may hold the key to our understanding of the fundamental difference between a normal and pathological developmental processes. Two-dimensional protein gel electrophoresis was developed for the purpose of comparative studies at a much higher resolution than that described (3). The finding, by this method, that most of the proteins are commonly expressed, for example, between a normal cell and its transformed derivative further reinstates the notion that differential gene expression may hold the key to our understanding of many biological questlons. However, the limitation of this From
Methods
m Molecular Bology, Edlted by J Eloultwood
Vol 66 Gene /so/at/on and Mapprng Humana Press Inc , Totowa, NJ
211
Protocols
212
Zhu and Liang
method became evident mostly because of its sensitivrty problem, since at most only up to 2000 of 10,000 different proteins m a mammalian cell are detectable. Therefore, the majority of the genes, especially the ones expressed at lower level may not be represented. The other major pitfall of the method is the drfficulty of being able to retrieve enough protein spots identified for future molecular characterrzations. More recently, comparative studies at messenger RNA or cDNA levels using differential hybrtdrzatron and subtractive hybridization techmques (4) allow more important differentially expressed genes to be isolated These included a number of important genes, such as T-cell receptor, nm23, a potential metastasis suppressor gene, and p21 WAF’-sdlt,a target gene of ~53 tumor suppressor protein. The drawbacks of these methods are largely owing to then technical difficulty, large amount of mRNA consumptton, and comparison of only two different samples at a time m only one direction (either up- or downregulated genes). To speed up the gene hunt for regulated gene expression, differential drsplay was developed with the aim of overcoming llmitattons of previous methodologies. To achieve this goal, the method has to be simple, sensitive, systematic, and reliable. The method has to be simple before rt can be adopted easily and widely. The revolution in molecular biology, in fact, has been powered by mostly simple methodological breakthroughs, such as recombinant DNA technology, DNA sequencing, and polymerase chin reaction (PCR). The simplicity also ensures the reliability of the method. The method has to be sensitive, so it can be applied to btological systems where scarce brologtcal samples are available. The method has to be systematic, so a complete search of all the expressed genes would be possible. With these m mind, differential display was developed with the combmatronal use of one of the three most powerful and simple molecular biological methods just mentioned, PCR, DNA sequencing gel electrophoresrs, and cloning of the cDNA species of interest (5). Since the description of the differential display method in 1992, many modifications have been made to streamline and optimize the method (6). However, the essenceof differential display methodology 1snot very much changed. The principle is depicted in Fig. 1, incorporating the latest improvements using one-base anchored oligo-dT primers (7). First mRNAs are converted to cDNAs using three mdivrdual anchored oligo-dT primers that differ from each other at the last 3’ non-T base. The use of anchored primers enables the homogeneous mitiatron of cDNA synthesis at the begmning of the poly(A) tail for any given mRNA. The resultant three subpopulations of cDNAs are further amplified and labeled with isotope by PCR in the presence of a set of arbitrary primers. As a result, mRNA 3’-termini defined by any given pan of anchored-primer and arbitrary primer are amplified and displayed by denaturing polyacrylamtde gel
Differenfially Expressed Genes
213 GAAAMMAAAA-AU TAAAAAAMMA-All C.~~~~~A~~AAA-ALI
L
5’~AAGC-3’ dNTPs MMLV reverse tmnsixiptase
Reverse imnscnptton
+
CAAAAAAAAAAAmAn GTllTIlTmTCGAA
l
lL
I
PC3 amplification
AAGCTTGATKXC
(H-TUG)
5’.AAGCTTGA’lTGCC~3’ (H-AP-1 Rimer) 5’.AAG-3’ (H-TI&) dNTPS a.[“~-dATP] Ampli’lbq DNA polymemse w --GAA
MccITGA,lTGCC
. GmCGAA
HI.
Denafming polyacyhude
gel
RNA sample: Negative electrode (-) -Positive electmde (+)
Fig. 1. Schematic representation of one-base anchored differential display.
electrophoresis. Side-by-side comparisons of such cDNA pattern between relevant RNA samples would reveal differences that may represent mRNAs whose expression has been altered. The following step-by-step protocol for differential display has been largely adopted from the instruction manual of the RNAimageTM kits from GenHunter Corporation (Nashville, TN). 2. Materials 1. 5X Reverse transcriptase(RT) buffer: 125mMTris-HCI, pH 8.3, 188 mMKC1, 7.5 mA4 MgCl*, and 25 mA4 dlthiothretol.
214
Zhu and Liang
2 3 4 5. 6. 7 8.
MMLV RT (100 U/pL) dNTP (250 pM) S-AAGCTTTTTTTTTTTG-3’ (2 ph4). 5’-AAGCTTTTTTTTTTTA-3’ (2 p1!4) 5’-AAGCTTTTTTTTTTTC-3’ (2 cLM>. Arbitrary 13 mers (2 pA4). 10X PCR buffer 100 mM Trts-HCI, pH 8 4, 500 mA4 KCI, 15 mM MgCl,, 0.01% gelatin dNTP (25 I&?) Glycogen (10 mg/mL). dH,O Loading dye. 95% formamide, 10 mMEDTA, pH 8 0, 0.01% xylene cyanole FF, 0.0 1% bromophenol blue. AmphTuq DNA polymerase (5 U/pL), Perkm-Elmer (Norwalk, CT) a-[35S]dATP (>lOOO Ci/mmol), or a-[33P]dATP (>2000 Wmmol) RNase-free DNase I (10 U/pL) Thermocycler DNA sequencing apparatus.
9. 10. 11 12. 13. 14. 15. 16 17
Although individual components may be purchased separately from vartous suppliers, most of them can be obtained in kit forms from GenHunter Corporanon. 3. Methods 3.1. DNase I Treatment of Total RNA Purrficatron of polyadenylated RNAs is neither necessary nor helpful for differential display. The major pitfalls of using the polyadenylated mRNAs are the frequent contammation of the oligo-dT primers, which give high background smearing in the display and the difficulty in assessing the integrity of the mRNAs templates. Total cellular RNAs can be easily purified with a onestep acid-phenol extraction method using RNAzol B reagent (Biotecx, Houston, TX). However, no matter whatever methods are used for the total RNA purification, trace amount chromosomal DNA contaminatton m the RNA sample could be amphf-ied along with mRNAs, thereby complicating the pattern of displayed bands. Therefore, removal of all contaminating chromosomal DNA from RNA samples is essentral before carrying out differential display. 1. Incubate 1O-100 yg of total cellular RNA with 10 U of DNase I (RNase-free) m 10 mMTris-HCl, pH 8.3,50 mMKC1, 1.5 mA4 MgClz for 30 mm at 37°C. 2. Inactivate DNase I by adding an equal volume of phenol:chloroform (3.1) to the sample. 3. Mix by vortexing, and leave the sample on ice for 10 mm 4. Centrifuge the sample for 5 mm at 4°C m an Eppendorf centrifuge
5. Savethe supematant,ethanol-precipitatethe RNA by adding 3 vol of ethanol in the presence of 0.3MNaOAC,
and incubate at -80°C for 30 mm.
Differentially Expressed Genes
215
6. Pellet the RNA by centrifuging at 4°C for 10 min. 7 Rinse the RNA pellet with 0.5 mL of 70% ethanol (made with DEPC-H,O), and redissolve the RNA in 20 uL of DEPC-treated H,O. 8. Measure the RNA concentration at OD,,c with a spectrophotometer by diluting 1 pL of the RNA sample m 1 mL of H,O 9. Check the integrity of the RNA samples before and after cleaning with DNase I by runnmg l-3 pg of each RNA on a 7% formaldehyde agarose gel. 10. Store the RNA sample at a concentration higher than 1 ug/yL at -80°C before usmg for differential display
3.2. Reverse
Transcription
of mRNA
1. Set up three reverse transcription reactions for each RNA sample m three mrcrofuge tubes (0 5-mL size), each containing one of the three different anchored oligo-dT primers as follows. For 20-uL final volume. 9.4 uL dH20, 4 uL 5X RT buffer, 1.6 uL dNTP (250 uA4), 2 uL (0 1 yg/pL, freshly diluted) total rNA (DNA-free), 2 uL (M can be either G, A, or c) AAGCTl ,M (2 pJ4). 2. Program your thermocycler to 65°C 5 mm + 37°C 60 min + 75°C 5 min -+ 4OC. 3. 1 pL MMLV RT is added to each tube 10 mm after at 37°C Mix well quickly by finger tipping 4 Continue mcubatton, and at the end of the reverse transcription reaction, spm the tube briefly to collect condensation. 5. Set tubes on ice for PCR, or store at -80°C for later use.
3.3. PCR 1 Set up PCR reactions at room temperature as follows 20 uL final volume for each primer set combinatton: 10 uL dHzO, 2 yL 10X PCR buffer, 1 6 uL dNTP (25 r&Q 2 pL arbitrary 13 mer (2 n-r&& 2 uL AAGCTttM (2 n-&J), 2 uL (it has to contain the same AAGCT, ,M used for PCR) RT mix from step I, 0 2 pL (see Note 1) a-[33P]-dATP (2000 Ci/mmol), 0.2 uL AmpliTuq (5 U/pL) (Perkm-Elmer). 2. Make core mixes as much as possible to avoid pipetmg errors (e.g., ahquot RT mix and AP primer mdtvidually. Otherwise, it would be difficult to ptpet 0.2 pL of AmpliTaq 3. Mix well by ptpetmg up and down, and add 25 ltL of mineral 011 if needed. 4 PCR as 94°C 30 s + 40°C 2 mm + 72°C 30 s for 40 cycles + 72°C 5 min + 4°C. (for Perkm-Elmer’s 9600 thermocycler, tt IS recommend that the denaturation temperature be shortened to 15 s and the rest of parameters kept the same).
3.4. 6% Denaturing 1. 2. 3. 4.
Polyacrylamide
Gel Electrophoresis
Prepare a 6% denaturing polyacrylamide gel m TBE buffer. Let it polymenze at least for more than 2 h before using. Prerun the gel for 30 mm (see Note 2). Mix 3.5 uL of each sample with 2 PL of loading dye, and incubate at 80°C for 2 min immediately before loading onto a 6% DNA sequencing gel 5. Electrophorese for about 3.5 h at 60 W constant power (with voltage not to exceed 1700 V) until the xylene dye (the slower-moving dye) reaches the bottom
216
Zhu and Liang
6. Turn off the power supply, and blot the gel onto a piece of Whatman 3MM paper. 7 Cover the gel with a plastic wrap, and dry tt at 80°C for 1 h (do not fix the gel with methanol/acetrc acid). 8. Orient the autoradrogram and dried gel with radioactive mk or needle punches before exposing to an X-ray film Figure 2 shows a representative differential display obtained with three one-base anchored ohgo-dT primers m combmatrons wtth three arbitrary 13 mers (7).
3.5. Reamplification
of cDNA Probe
1. After developmg the film (overnight to 72-h exposure), orient the autoradtogram with the gel. 2 Locate bands of interest (see Note 3) either by marking with a clean pencil from underneath of the film or cutting through the film with a razor blade (see Note 4). 3 Cut out the located band with a clean razor blade 4 Soak the gel slice along with the 3MM paper in 100 nL dHzO for 10 mm 5 Boil the tube with tightly closed cap (e.g., with ParatilmTM) for 15 mm 6 Spin for 2 mm to collect condensation, and pellet the gel and paper debris. 7. Transfer the supernatant to a new mrcrofuge tube 8. Add m 10 pL of 3MNaOAC, 5 uL of glycogen (10 mg/mL), and 450 pL of 100% EtOH. 9. Let sit for 30 mm on dry ice or m a -80°C freezer. 10. Spin for 10 min at 4°C to pellet DNA. 11. Remove supernatant, and rmse the pellet wtth 200 pL ice-cold 85% EtOH (you will lose your DNA if less concentrated EtOH is used’). 12. Spin brtefly, and remove the residual ethanol. 13. Dissolve the pellet in 10 PL of PCR H20, and use 4 pL for reamplificatton. 14. Save the rest at -2O’C in case of mishaps. Reamphfication should be done using the same primer set and PCR condltrons, except the dNTP concentrattons are at 20 p&I (use 250 pM dNTP stock) instead of 2-4 p1I4 and no isotopes added. A 40-mL reaction is recommended for each primer set combmation* 20.4 PL dHzO, 4 pL 1OX PCR buffer, 3.2 pL dNTP (250 n&I), 4 pL arbitrary 13 mer
Fig. 2. (opposite page) Differential display using one-base anchored ohgo-dT primers (7). Four RNA samples from non-transformed cell lme Rat 1 and H-ras transformed cell lines Rat 1 (ras), TlOl-4 and Al-5 (lanes from left to right, respectrvely) were compared by differential display using three one-base anchored olrgo-dT primers, AAGCTllG, AAGCT,,A, and AAGCT,,C, m combinattons with three arbitrary 13 mers, H-APl (AAGCTTGATTGCC), H-AP2 (AAGCTTCGACTGT), and H-AP3 (AAGCTTTGGTCAG). The mob-l (8) and mob-7 cDNA fragments were marked by the right and left arrowheads, respectively.
H-TIIG
H-TllA
III
Fig. 2. 217
H-TllC
Zhu and Liang
218 Rat1 Ratlfras)
T101-4 Al-5
-
28s
-
18s
Fig. 3. Northern blot analysiswith mob-7 cDNA probe. The 253-bp mob-7 cDNA was used as a probe to confirm the differential expressionof the geneusing 20 pg of total RNA from Rat 1 and threetransformedderivatives Rat 1 (rus), T 101-4 and A l-5 cells (lanes 1 to 4). The lower panel is ethidium bromide staining of ribosomal RNAs as a control for equal sampleloading.
15. 16. 17. 18. 19. 20. 2 1. 22.
(2 @4), 4 pL AAGCTrrM (2 cLn/l),4 pL cDNA template from RT, 0.4 uL AmpliTuq (5 U/uL). Run 30 uL of the PCR sampleon a 1.5%agarosegel stainedwith ethidium bromide. (More than 90% probesshould be visible on the agarosegel.) Check to seeif the size of your reamplified PCR productsis consistentwith their size on the denaturingpolyacrylamide gel. Extract the reamplified cDNA probe from the agarosegel using QIAEXTM kit (QIAGEN) for Northern blot confirmation (seeNote 5). Savethe remaining PCR samplesat -20°C for subcloning. Verify the probe by Northern blot or RNaseprotection assayfollowing the standard procedures(8; Fig. 3). Clone the cDNA probe using the pCR-TRAPrMcloning system(GenHunter). Verify the cloned cDNA probeby Northern blot, and sequencethe cloned cDNA. Clone the full-length cDNA by screeninga cDNA library following the standard procedure(9) (seeNotes 6 and 7).
Differentially
Expressed Genes
219
4. Notes 1 What tsotypes you should choose for differenttal display? It has been observed that 35Soriginally used for differential display would leak through PCR reaction tubes (especially when thin-walled tubes are used), and 33P-labeled nucleottde was recommended as the best alternative (10). 33P is not only safer to use, but also gives better sensmvny compared to 35S. 2. It 1scrucial that the urea in the wells be completely flushed out right before loading your samples For best resolution, flush every four to six wells each time during sample loading while trying not to disturb the samples that have been already loaded 3 How should you choose the “right” bands, and what sizes of cDNA bands should you choose? Ftrst, tentatively identify those bands that appear to be differentially expressed on the initial display gel. Then, repeat the RT step and the PCR reactions for these lanes, and see tf these differences are reproductble before pursuing it further. It is recommended that bands bigger than 100 bp be selected. It has been generally observed that shorter cDNA probes have a higher probability of failing to detect any signals on the Northern blot. 4. The other way that 1sfound to work very well is to punch through the film with a needle at the four corners of each band of interest. (Handle the dried gel with gloves, and save it between two sheets of clean paper.) 5 How do you purify the cDNA probes? There are a number of ways to purify the cDNA probes from agarose gel etther directly after reamplification or after clonmg into plasmid The methods that worked well are either using the QIAEX kit from GIAGENTM or low-melt agarose gel electrophorests. Other methods, such as Geneclean (which works well only for DNA larger than 300 bp) or filtration, results in low yield and poor labeling of the cDNA probes from differential display 6 How should you label cDNA probes? It is recommended that a random prime labelmg method be used (available as ktt from companies, such as BoehringerMannheim). It is observed that mcorporation of 1 uL (2 uw of corresponding anchored oligo-dT prtmer during random prime labeling improves the signal on the Northern blot 7. What are the hybridization and washing condttions? It is recommended that the standard prehybridtzation and hybrtdtzatton condition at 42°C be used Wash with 1X SSC, 0.1% SDS at room temperature for 15 min twice, followed by washing with 0.25X SSC, 0.1% SDS at 55-6O”C for 15-30 mm. Do not go over 60°C. Expose with intensifymg screen at -80°C for overnight to 1 wk.
Acknowledgment The authors thank Arthur B. Pardee for his guidance and support. We thank GenHunter Corporation for the permission of adapting its protocol for the RNAimageTM kit for differential display. The work was supported in part by an SBIR grant from the National Institute of Health awarded to Peng Liang.
220
Zhu and Llang
References 1. Lmzer, D. I. H., Maltzman, W., and Levine, A. J (1979) The SV40 A gene product 1s required for the production of a 54,000 MW cellular tumor antigen Vwology 98,3083 18. 2. Hollstein, M., Stdranksy D., Vogelstem, B , and Harris, C. C (1991) p53 mutation m human cancer. Sczence 253,49-53. 3 O’Farrell, P. H. (1975) High resolution two-dimensional electrophoresis of proteins. J Blol Chem. 250,4007-402 1. 4. Lee, S. W., Tomasetto, C , and Sager, R. (1991) Positive selection of candidate tumor-suppressor genes by subtractive hybridization. Proc Natl. Acad Scz USA 88,2825-2829. 5. Liang, P. and Pardee, A. B. (1992) Differenttal display of eukaryotic messenger RNA by means of the polymerase cham reaction. Science 257,967-97 1 6 Liang, P , Averboukh, L , and Pardee, A. B. (1993) Distribution and clonmg of eukaryotic mRNAs by means of differential display* refinements and optimization Nucleic Acids Res 21, 326%3275 7. Liang, P., Zhu, W., Zhang, X , Guo, Z , O’Connell, R. P , Averboukh, L., Wang, F , and Pardee, A B (1994) Differential display using one-base anchored ohgodT primers. Nucleic Acids Res. 22, 5763,5764. 8. Ltang, P., Averboukh, L , Zhu, W , and Pardee, A. B. (1994) Ras activation of genes. Mob-l as amodel. Proc Natl. Acad Scl. USA 91, 12,515-12,519 9. Ausubel, F., Brent, R , Kingston, R. E., Moore, D. D , Senlrnan, J G., Smith, J. A , and Struhl, K (1988) Current Protocols zn Molecular Bzology. Greene Publishing Associates and Wiley-Intersctence, New York 10. Trentmann, S. M., Knaap, E., Kende, H., Liang, P., and Pardee, A B. (1995) Alternattves to 35Sas a label for the differential display of eukaryottc messenger RNA Sczence 267, 1186-l 187
Chemical Crosslinking
Subtraction
(CCLS)
Ian N. Hampson and Lynne Hampson 1. Introduction Since the advent of polymerase chain reaction (PCR), it has proven possible to clone complete genes, or parts of these, without having to construct the usually obligatory cDNA library. PCR cloning, however, is not without its own problems and limitations, such as the fidelity of the cloned sequence (1) and the fact that prior knowledge of the target gene sequence is required. It is for these reasons that the traditional cDNA library still has a valuable role to play in the isolation of any new gene. In order to construct a cDNA library, it is first necessary to isolate highquality RNA from the chosen cell or tissue and to purify further the polyadenylated mRNA by ohgo dT affinity chromatography. This is subsequently converted to duplex cDNA by the use of reverse transcriptase and Escherzchia coli DNA polymerase, whereupon this material is ligated into a phage or plasmid vector of choice. The resultant clone bank can then be screened by various techniques designed to isolate an individual cDNA species. Prior to embarking on this exercise, it is useful to consider the integral features of any complete cDNA library, principally that this represents a mixed population of clones where the proportional representation of each individual clone is directly related to its abundance in the starting mRNA population. The consequences of this are that a large proportion of cDNA clones in a library will represent siblings of the same highly or moderately abundant sequences. A typical eukaryotic cell expressesbetween 10,000 and 12,000 different genes with 22% of this material being made up of approx 30 highly abundant mRNAs each having -3500 copies/cell. The low-abundance class represents -10,600 different gene products and is composed of 29% of the mRNA, each having on average 14 copies/cell. The remaining 49% of expressed mRNAs belongs to From
Methods
m Molecular Bology, Edlted by J Boulhvood
Vo/ 68 Gene /so/at/on and Mapprng Humana Press Inc , Totowa, NJ
221
Protocols
222
Hampson and Hampson
the intermediate moderate abundance class (2). The problem of isolating a specific low-abundance gene is not difficult to appreciate. It has been calculated that to ensure a 99% probability of a cDNA hbrary containing any particular low-abundance gene, it is necessary to screen - 169,000 primary clones. Many sequences will be commonly expressed between cells of different types, such as the so-called housekeeping genes, but many will be specific to a particular cellular phenotype. The last 15 years have seen the development of numerous methods for the cloning of genes that are differentially expressed between cells or tissues (3-13). The simplest of these techniques is differential screening (3,6), which consists of sequentially screening a cDNA library, constructed from the cell or tissue producing the gene of interest, with two probes. The first being a total cDNA probe made from mRNA that does not contain the target gene of interest and the second made from the cDNA used to construct the library. Clones that give a positive signal with the second and not the first probe are, in theory, exclusive to the mRNA population containing the target gene when compared to the mRNA that does not. This approach suffers from the hmitation that most sequences, being of the common housekeeping variety, will give a double positive signal. Hence, screening can only be performed at relatively low plaque or colony densities (5000 clones/l 5-cm diameter filter). It was the need to isolate genes by virtue of then differential expression that prompted the evolution of subtractive hybridization techniques (#,5,7-14). Generally, these consist of procedures that eliminate all common housekeeping genes from a probe or cDNA library by subtracting the expressed sequences of one cell or tissue type from another. These methods have, however, all proven troublesome and difficult to perform. 1.1. Subtractive Hybridization A major consideration when adopting a subtractive hybridization approach must be the cellular homogeneity of the target and the subtracter cell or tissue from which the mRNA is to be isolated (24). When cultured cells are used, this is not such a problem, because the same cell populations, plus and minus a single variable, can be investigated with no artifactual contribution from cellular heterogeneity. If tissues are the source of mRNA, then it is worth investing effort to remove as much contammatmg cellular material as possible. The next variable to consider is the amount of both target and subtractor tissue that is available, because this will influence the choice of subtractive strategy. For chemical crosslmkmg subtraction (CCLS) (15), it is essential to have approx 2 pg of target cell polyA+ RNA and 10 times this amount from the subtractor tissue (lo* cells). If, however, tissue supply is limited, then the involvement of complex PCR-based approaches (11-13) must be contemplated. These PCR-based methods do work, but they can be notoriously diff-
CCLS
223
cult to optimize, and the method described in this chapter was adopted owing to its ease of application. Most methods of subtractive hybridization employ a lengthy (>24 h) hybridization of target first-strand cDNA to an excess of subtracter or driver mRNA followed by a means of separating common duplex cDNA/RNA hybrids from unique target cell-specific single-stranded cDNAs. Previous approaches have used physical means to separate these two populattons, such as hydroxyapatite (#,5) or avidimbiotin (7-9). These methods have many disadvantages mainly associated with high losses of material owing to the extensive manipulations required. Furthermore, they will subtract differentially expressed cDNAs that have areas of high homology with nondifferentially expressed sequences. It was with these considerations in mmd that we developed the chemical crosslinking approach to subtraction. 1.2. CCLS This method is initially the same as most published protocols where target cDNA is hybridized to an excess of driver mRNA that does not express the gene of interest. It differs m the use of an intercalating, mterstrand DNA crosslinking agent 2.5 diaziridinyl 1,4-benzoqumone (16) in order to ctrcumvent the need for physical separation of common cDNA/RNA hybrids from unique single-stranded cDNA. This compound covalently hnks the cDNA/RNA hybrids that represent commonly expressed material, but have negligible reactivity with single-stranded unique cDNAs. The covalently crosslmked hybrids cannot therefore be denatured and will thus not be accessible to random primed labeling (17). The single-stranded cDNA, however, can be labeled and used as a subtracted probe to screen a cDNA library constructed from target cell cDNA. This eliminates the need for any physical separatron techniques, which makes the CCLS approach much simpler and vastly reduces material loss. Comparison of CCLS to subtractive methods that rely on physical separation of hybrids has revealed several interesting features: 1. Application is easy. 2. The probe need not be labeled until use, which significantly reduces operator exposure.
3. Qualitative aswell asquantitative differences betweentranscriptpopulations can be detected. Consider a cDNA present in the target population that undergoes downregulation with respect to the driver RNA, but that has an area of high homology with a different RNA withm the driver that does not downregulate. Such sequences would be subtracted by conventtonal physical separation-based
Hampson and Hampson
224
methods, but the CCLS approach leaves the nonhomologous area available for labeling, and hence contribution to the subtracted probe.
2. Materials It should be noted that some of the materials required for each successive procedure have been listed in preceding sections. Since CCLS requires a good, preferably primary cDNA library, we have Included our detailed construction protocol. 2.7. RNA Preparation 2.1.1. PolyA+ mRNA Isolation 1 2. 3. 4. 5
Oligo-dT Dynal magnetic. beads Dlethyl pyrocarbonate-treated and autoclaved distilled water (DEPC, SDW) 2X Binding buffer: 20 mA4 Tns-HCl, pH 7 5, 1 OMLiCl, 2 mMEDTA 1X Washing buffer 10 mMTris-HCl, pH 7.5, 0 15ML1C1, 1 mMEDTA Elutlon buffer. 2 mMEDTA.
2 1.2. DNase Treatment of A+ RNA 1 2. 3 4 5 6.
Rlbonuclease-free DNase 1000 U/mL (Promega, Madison, WI) Human placental RNase inhibltor 40 U/pL (Boehringer Mannhelm, Dorval, UK) 3M Sodium acetate, pH 5.2 10X Buffer: 1 OM sodium acetate, pH 5.0, 50 mMmagneslum chloride. Phenol/chloroform 50.50 equilibrated in 50 mMTris/EDTA, pH 7.0. 95% Ethanol
2.2. cDNA Synthesis and Library Construction 2.2.1. First-Strand cDNA Synthesis 1. 0.5M Methyl mercuric hydroxide (Serva Labs, Cambridge Biosciences, Cambridge, UK). 2 0 75M P-mercaptoethanol 3. Ohgo dT12-ls mer (Boehringer Mannhelm). 4. Superscript I plus 5X reaction buffer (Gibco BRL, Middlesex, UK). 5 100 mM Deoxynucleotide triphosphates (Boehrmger Mannhelm). 6. a3*P d CTP (3000 Cl/mmol) (Amersham, Amersham, UK). 7. 7.5MAmmonium acetate. 8. DE8 1 paper (Whatman, Maldstone, UK). 9. 0.5M Sodium dihydrogen orthophosphate
2.2.2. Second-Strand
cDNA Synthesis
1. 1OX Buffer: 900 mA4 KCI, 30 mMMgC12, bovine serum albumin (BSA). 2 1M Dithlothreitol (DTT).
500 mM Tns-HCl, pH 7 2,0 5 mg/mL
CCLS
225
3. Rlbonuclease H 1 UIyL (Boehrmger Mannhelm). 4. E. coli DNA pol I 10 U/pL (Promega)
2.2.3. Fill-h Reaction 1. 10X NT buffer: 0.5M Tris-HCl, pH 7.2,O. 1M MgS04, 1 mM DTT. 2. 0.5 mg/mL BSA. 3. Klenow polymerase 2 U/pL (Boehrmger Mannheim).
2.2.4. Adaptor Ligation 1. EcoRl adapters (Promega). 2. T4 DNA ligase 5 U/pL and 10X buffer: 0.66MTns-HCl, pH 7.5,50 mMMgCl*, 10 mA4DTT. 3. 10 mM adenosme triphosphate (ATP) (Boehringer Mannhelm)
2.2.5. Kinase Reaction 1. Polynucleotide kmase (PNK) 8000 U/mL (Boehrmger Mannheim). 2. 10X Kinase buffer 700 mM Tns-HCl, pH 7.6, 100 mM MgCl*, 50 mM DTT.
2.2.6. CDNA Size Fractionation 1. Sephacryl S400 high resolution (Pharmacia, Uppsala, Sweden). 2 Running buffer* 50 mA4 Tns-HCl, pH 7.0, 0.1% SDS 3. Loading buffer: 0.0 1% bromophenol blue, 50% glycerol.
2.2.7. cDNA plus Vector Ligation Phosphatased EcoRl digested hgtl0 from Stratagene (La Jolla, CA). 2.2.8. Packaging Although it is possible to prepare packaging extracts, these can often be difficult to optimize. Hence, we would recommend purchasing a commercially available extract, such as Gigapack gold, from Stratagene. 2.2.9. Plating and Membrane Transfer 1. ~600 and ~600 hfl plating strams (Stratagene)
2. LB broth 10 g NaCl, 10 g bactotryptone,5 g yeastextract/L (Dlfco). 3. 20% Filter-sterilized maltose solution (Difco, GIBCO BRL) 4. Filter-sterilized 1M MgS03 solution. 5. SM (filter-sterilized): 50 mMTris-HCl, pH 7.5, 100 mMNaC1, 8 mMMgS04, 0.0 1% gelatrn 6. 2% Gelatin stock 7. LB agar. 15 g agar/L of LB broth (Difco). 8. NZY agar: 15 g of agar/L of NZY broth (ready-made) (Gibco BRL). 9. Soft top: NZY broth, 0.7% w/v agarose (Boehrmger Mannhelm).
Hampson and Hampson
226
10 PALL Biodyne A 132- and 82-mm diameter transfer membranes, cat. nos BNNG132 and BNNG82, respectively (PALL BioSupport Division, Portsmouth, UK) 11 Denaturation solution: 0.5M NaOH, 1 5M NaCl. 12 Neutralization solutron: 1M Tns-HCI, pH 8 0, 1.5MNaCl. 13 2X SSC 0.3MNaCl,O.O3Msodium citrate, pH 7.0.
2.3. Subtractive Probe Preparation 2.3.1. Alkaline Hydrolysis 1. 1 OA4Sodium hydroxide, 1.O mM EDTA 2. Sephadex G50, DNA-grade (Pharmacia).
2.3.2. Subtractive Hybridization 2X hybridization
buffer: 1 .OMNaCl,
50 mA4 HEPES,
pH 7.5,5 m&! EDTA,
1% SDS, and stertle liqmd paraffin. 2.3.3. Crosslinking 1. Crosslmking buffer 25 mA4 Tns-HCl, pH 7.0, 1 mMEDTA. 2. 20 mA4 Ascorbic acid 3. 10 mMSolution of 2,5-diazirtdinyl- 1.4-benzoquinone (DZQ) in dry DMSO, store at -20°C (Amersham UK). 4 Chloroform
2.3.4. Labeling 1. Random primed labeling kit (Boehringer Mannhelm). 2 Sequenase II (US Biochemicals, Amersham, UK).
2.3.5. Probe Hybridization 1. Stock phosphate/salt buffer: 70.9 g Na2HP04, 58.44 g NaCl, 8 mL orthophosphoric acid dissolved in 500 mL, pH 7.2. 2. Prehybridization and hybridization buffer: 12.5 mL stock phosphate/salt, 17.5 mL SDW, 70 mL of 10% SDS, contains 1 g BSA.
2.3.6. Autoradiography 1. Hyperfilm (Amersham). 2. Film cassettes and image intensifying screens (Agfa Gaevert). 3. Methods
3.1. RNA Preparation Since there are several published methods (18-20) of total RNA isolation that are relatively easy to perform, we have not described any of these in detail. Our method of choice is the acid phenol technique of Chomczynskl and Sacchi
227
CCLS
(20). Total RNA isolated in this way IS stored as a suspension in 70% ethanol and O.lMsodmm acetate at -70°C. This material can be quantified by pelleting an ahquot, dissolving mXmL of SDW, and determining the ODzes. pg of RNA m ahquot = 40 x ODz6s x (l/x)
(1)
3.1.1. PolyA+ mRNA Isolation Selection
of polyA+ mRNA
is most readily performed
using oligo dT Dynal
magnetic beads according to the manufacturer’s instructions. 1. Dissolve approx 400 ug of total RNA in 100 uL of DEPC SDW, and heat to 68°C for 1 min. Add 100 uL of 2X bmd buffer to this, mtx with 1 mg of ohgo dT magnetic beads, and incubate at room temperature for 5 mm. 2. Separate the magnetic beads, retaining the polyA- supernatant, and wash them with 2 x 200 pL of wash buffer. 3 Resuspend the beads m 30 uL of elution buffer and heat to 68°C for 2 mm Separate the beads immediately, and aliquot the supernatant into 3 vol of ethanol/O.lM sodium acetate, pH 5 5. Store this at -7O’C. 4 Reheat the polyA- supernatant to 68’C for 2 min, and add this back to the magnetic beads for 5 mm at room temperature 5. Repeat steps 2-4 twice more. 6. Pellet the polyA+ RNA by centrifuging at 12,000g (bench-top microfuge) for 10 mm.
3.1.2. DNase Treatment of PolyA+ mRNA 1. Wash the RNA pellets from step 6 (Section 3.1.1.) with 80% ethanol, and allow to an-dry 2. Dissolve and pool these m a total volume of 25 pL DEPC SDW. 3. Add 3 pL of 1OX DNase buffer, 1 uL of placental ribonuclease inhibitor, 1 uL of DNase, and incubate at 37°C for 10 min (see Note 1). 4 Add an equal volume of phenol/chloroform (50:50), and mix well. Centrifuge at 12,000g for 2 mm, and retam aqueous supernatant. 5. Repeat step 4. 6. Add this to 3 vol of ethanol plus l/3 vol of 3M sodium acetate, pH 5.5, and store at -70°C
3.2. cDNA Synthesis 3.2.1. First-Strand cDNA Synthesis 1 Assuming a total yield of approx 20 pg of polyA+ mRNA from Sections 3.1 .l. and 3.1.2., take one-quarter of this and centrifuge at 12,000g for 10 mm. Wash the pellet with 80% ethanol and allow to an-dry. 2. Dissolve the pellet in 10 uL of DEPC SDW, and add 1 pL of methyl mercuric hydroxide solution. Incubate for 5 mm at room temperature.
228
Hampson and Hampson
3. Add 2.5 pL of /3-mercaptoethanol, and place on ice immediately 4 Add to this: 2.5 pL (2 5 pg) of oligo dT primer, 20 pL of 5X reverse transcriptase buffer, 1 pL 25 mM dNTPs, 0.5 pL nbonuclease inhibitor, and 60 pL DEPC SDW. 5. Add approx 5 $1 of a32P dCTP, mix, and remove 2 x 2 pL altquots onto two 1 x 1-cm DE8 1 filters. Immediately place one of these (T,,) mto 10 mL of 0 5M sodium dihydrogen orthophosphate, and let the other (Total) au-dry 6. Add 5 pL of superscript I, incubate at 37’C for 1 h, remove a further 2 pL of this reaction, and spot onto a DE8 1 filter (T,), and then place this m the 0 5h4 phosphate solution with the first filter (T,) Extract the reaction mix once with an equal volume of phenol chloroform followed by addition of 0 44 vol of 7.5M ammonium acetate and 2.3 vol of ethanol. Store the first-strand reaction at -2OOC 7. Rinse the DE8 1 filters with two more 5 mm washes in 0 5M sodmm dihydrogen orthophosphate followed by twtce in distilled water and once in ethanol Allow the washed filters to an-dry, add scmttllant, and count, along with the unwashed 2-pL total aliquot (step 5), on the 32P channel of a liquid scmtillatton counter 8. The total synthesis of first-strand cDNA can be estimated by the following calculation: ng of cDNA = (T, - To/Total) x nmol of dCTP m 2 yL R. mix x M, of dCTP x 4 (bases) x dilution
(2)
which for the above mixture is ng cDNA = (T, -To/Total) x 0.5 x 308 x 4 x 50 (Yields should be approx 400-500 ng )
(3)
9. For subtractive probe preparation, refer to Section 3 3.1. For cDNA library construction, see Section 3 2 2
3.2.2. Second-Strand
cDNA Synthesis
1. Pellet the first-strand material (12,OOOg, 10 mm), and wash wtth 100 pL of 80% ethanol. Allow to air-dry 2. Dissolve the pellet in 80 pL of SDW, add 10 pL of second-strand buffer, 5 pL of 2.5 mM dNTPs, and 2 pCi of a32P dCTP. Remove 2 x 2 pL aliquots placing one in approx 10 mL of sodium dihydrogen orthophosphate and allowing the other to air-dry, as previously described. 3. Add 2 U of rtbonuclease H, 30 U of E coli DNA polymerase I, and incubate at 14’C for 34 h. (The long incubation time ensures the syntheses of cDNAs up to 3 kb and longer.) Remove a 2 pL, aliquot spot onto DE81 paper, and place m 0.5M sodium dihydrogen orthophosphate solution. Wash and count as before. The yield of second strand can be calculated m the same way as the first strand substttuting 0.25 nmol of CTP for 0.5. (Approx 80% of first strand should be copied to second strand ) 4. Extract the reaction mixture once with an equal volume of phenol/chloroform, and add 0.44 vol of 7.5Marmnonium acetate and 2.3 vol of ethanol. Store at -20°C.
CCLS
229
3.2.3. Fill-h Reaction 1. Pellet the double-stranded cDNA (12,OOOg for 10 min), rmse once with 100 PL of 80% ethanol, and allow to air-dry. 2 Dissolve the pellet in 42 5 pL of SDW, and add 5 pL of 10X NT buffer, 0 8 pL of 25 n&dNTPs and 2 pL of Klenow enzyme. Incubate for 1 h at 22°C followed by phenol/chloroform, ammomum acetate, ethanol-precipitate as before, and store at -20°C
32.4. Adaptor Ligation 1 Pellet the cDNA, wash with 100 pL of 80% ethanol, and allow to an-dry. Dissolve in a volume of SDW to give a concentration of 100 ng/pL. 2. MIX 3 PL of 10X ligation buffer, 2.5 PL of cDNA (250 ng), 1 PL (10 pmol: %O-fold molar excess) of EcoRI adapters, 5 U of T4 DNA hgase, and 22.5 pL of SDW. 3. Incubate overnight at lS”C, and then heat-inactivate at 68°C for 10 min. Phenol/ chloroform extract followed by precipitation with l/lo vol of 3M sodmm acetate, pH 5.5, and 2.5 vol of ethanol Store at -2O’C
3.2.5. Kinase Reaction If the chosen vector has been phosphatased, cDNA prior to ligation:
it is necessary to kinase the
1. Pellet the cDNA from Section 3.2.4., wash with 100 pL 80% ethanol, and airdry. Drssolve this in 33 PL of SDW and add 4 PL of 10X kmase buffer, 2 PL of 0.1 mM ATP, and 1 PL of polynucleotide kinase (10 U). 2 Incubate at 37°C for 30 min, then phenol/chloroform extract, and preclpltate with sodium acetate/ethanol as described previously. Store at -20°C
3.2.6. Size Fractionation 1 Plug a 1-mL short-form Pasteur pipet with sllamzed glass wool. Pack this with Sephacryl S400, and allow to equilibrate with buffer by running for approx 30 mm. 2. Pellet the cDNA, wash with 100 pL of 80% ethanol, an--dry, and dissolve in 15 yL of column runmng buffer. Add to this 5 PL of loading buffer, mix, and apply to just above the surface of the Sephacryl beads Let the column flow under gravity with a head height of approx 30 cm. 3. Use a hand monitor to gage the progress of the labeled cDNA down the column, and when this is approx 1 cm from the glass wool plug, start collecting five drop fractions. Continue this until all the radioactivity has eluted from the column Again using the hand monitor, roughly estimate the activtty m each of the fractions, and pool the first two-thirds of the activity peak. 4. If the pooled volume exceeds 300 pL, this can be reduced by extracting twice with an equal volume of butan-2-01. Extract the pooled material with phenol/ chloroform, and precipitate with sodium acetate and 3 vol of ethanol. Store at -20°C
230
Hampson and Hampson
3.2 7. cDNA plus Vector Ligation 1 Pellet the precipitated, size-fractionated cDNA, wash with 100 pL of 80% ethanol, and allow to an-dry. 2. Dissolve in a such a volume of SDW that produces a concentration of 50 ng/pL. 3. Precipitate 1 pL of phosphatased hgtl0 arms with 2 pL of cDNA (100 ng) by adding r/lo vol of O.lM sodium acetate plus 3 vol of ethanol, and coolmg to -2O’C for 15 mm. 4 Pellet the cDNA plus vector, wash with 80% ethanol, and au-dry Dissolve this in 4 pL of SDW, and add 0.5 pL of 10X ligase buffer plus 0.5 pL (2.5 U) of T4 DNA ligase. Incubate overnight at 14°C.
3.2.8. Packaging 1 Take half (2.5 pL) of the cDNA plus vector ligation mix and package according to the manufacturer’s instructions This usually takes 2-3 h. 2. Add 500 PL of SM plus 20 pL of chloroform to the packaged DNA, and store at 4°C
3.2.9. Plating and Membrane Transfer 1 Incubate LB agar streak plates of host strains ~600 and ~600 hfl overnight at 37°C Pick a single colony from each of these, inoculate 10 mL of LB contammg 15 mA4 MgS04 and 0.2% maltose, and incubate with shaking at 30°C overnight 2 Centrifuge the bacterial suspension at 1,200g for 10 mm, and decant the supematant Resuspend the supernatant in 4 mL of filter-sterilized 15 mMMgS04 (This suspension can be stored at 4°C for several days, but IS best used fresh ) 3 Prepare 100 pL of a 1.100 dilution of the packaged library stock m SM, and place two aliquots of 1,5, and 10 pL of thts into separate sterilm tubes Add 100 pL of the ~600 suspension to one of the aliquots, and 100 pL of the ~600 hfl preparation to the other. Incubate at 37°C for 20 mm. 4. While these are incubating, melt a lOO-mL bottle of top agar and cool to 45°C. Mtx 2.8 mL of top agar with the phage and bacterial suspensron. Then pour immediately onto prewarmed (37’C) 9-cm NZ agar bacterial culture plates, makmg sure that this is evenly spread. Allow to cool at room temperature for 15 min, and then incubate overnight at 37°C. Store at 4°C. 5. The ~600 strain allows both recombinant and wild-type phage to grow as clear and cloudy plaques, respectively. The ~600 hfl strain will only permit recombinant phage containing a cDNA insert to grow; hence, the total number of plaques obtained on this strain should equal the number of clear plaques obtained with the ~600 strain Count the number of recombinant plaques, and calculate the titer of the cDNA library from the dilutions. (Titers of 106-10’ PFU should be obtained ) It is advisable to check the sequence representation of the library by test probing one of the plates used for titering. 6. First, prepare a plaque lift by carefully placing an 82-mm disk of Pal Biodyne membrane on the phage plate. (Choose a suitable plaque density from the various titer plates of l-5 x lo3 PFU). Leave this m contact for approx 5 min. Then peel
CCLS
7 8
9.
10.
231
off and place plaque stde uppermost on two sheets of 3MM paper soaked m denaturation solutron. Incubate for 5 mm, and then place on two sheets of 3MM soaked in neutralization solution for a further 5 mm. Rinse with 2X SSC, blot off the surplus buffer, and UV crosslink for l-2 mm. Finally bake the filter at 80°C for 30 mm. Place the filter in 10 mL of hybridization buffer (Section 2.3 2.) and incubate at 68°C for >15 min. Choose a suitable test sequence of known abundance, such as actm and prepare a 32P-labeled probe from its cDNA. (We routinely use random priming for probe production.) Heat the probe to 100°C for 5 min, and then add this to the filter m a volume of 2 mL of hybridization buffer at 68°C overnight. Wash the filter three times for 20 mm in 1X SSC, 0.5X SSC, and 0.2X SSC at 68°C. Expose to Amersham Hyperfilm for 15 min to 1 h. Count the number ofpositive signals and determine the percentage of total plaques screened. If the cDNA library is representative this should approximate to the percentage abundance of the test gene mRNA cDNA insert size can readily be assessed by picking 10 individual plaques at random mto 200 JJL of SM plus 10 pL of chloroform. Allow these to elute at 4°C overnight and then ahquot 4 pL of this eluate into a lOO+L PCR reaction using vector primers that flank the cloning site. For example the hgtl0 primers, S’AGCAAGTTCAGCCTGGTTAAG3’and S’TTATGAGTATTTCTTCCAGGG3’ can be used for 35 cycles of: 1 mm at 94”C, 1 min at 50°C, and 2 min at 72°C The size of these PCR products will thus give an indication of the average insert length m the library Assuming that the library construction has been successful, using freshly grown plating bacteria, prepare several 132-mm diameter plates with approx 20,000 PFU/plate (300 pL bacteria, 6.8 pL of top agar). Allow these to grow at 37°C until the plaques are just touching (8-12 h), and then cool to 4°C. Prepare plaque lifts as previously described for 82-mm plate
3.3. Subtractive Probe Preparation 3.3. I. Alkaline Hydrolysis 1. Pellet 400-500 ng of first-strand cDNA from Section 3.2.1.) wash with 100 pL of 80% ethanol, and an-dry. Dissolve this in 10 PL of 0,2MNaOH, 1 mA4 EDTA, and incubate at 50°C for 15 min Dilute to 100 pL with SDW, plus 10 pL of 3M sodium acetate, pH 5.5, and 5 pL of 10% SDS. 2. Prepare a 1-mL spin column with Sephadex G50 equilibrated in 0.3M sodium acetate, 0.5% SDS and spin the cDNA from step 1 through this (>80% of 32P should elute through the column). 3. Extract the cDNA from step 2 with an equal volume of phenol/chloroform, and retain the aqueous layer. Add to this the equivalent of 10-20 pg of DNased polyA+ driver mRNA (stored as a suspension under ethanol) and sufficient ethanol to ensure a threefold excess over the aqueous volume. Cool to -70°C for 30 min, and then pellet the cDNA/mRNA by centrifuging at 12,OOOg for 15 mm. Wash
232
Hampson and Hampson the pellet with 100 uL of 80% ethanol removmg as much of the supernatant as possible, then cap the tube, and store on ice. (Do not allow the pellet to dry, since this will cause it to adhere to the tube wall.)
3.3.2. Subtractive Hybridization 1 Dissolve the pellet from step 3, Section 3.3 1 in 2 5 uL of DEPC-treated SDW This can be time-consummg and difficult, since the cDNA can stick to the walls of the tube. Transfer the cDNA to a 0 5-mL tube and add 2 5 pL of prewarmed (68°C) 2X subtractive hybridization buffer. Mix gently, and then overlay with 30 yL of sterile hquid paraffin Cap the tube, and heat to 95°C for 1 mm Then incubate at 68°C for 24-48 h. 2. Remove the aqueous layer from under the hquid paraffin, and dilute to 30 PL wrth SDW. Add 90 pL of 95% ethanol, mix, and cool to -70°C for 30 mm or store mdefimtely
3.3.3. Chemical Cross/Inking 1. Pellet the cDNA/RNA from step 2, Section 3.3.2. by centrlfugmg at 12,OOOg for 15 mm, wash with 100 pL of 80% ethanol, and allow to air-dry 2 Dissolve this material m 44 pL of crosslmkmg buffer Add 5 pL of 20 mA4ascorbit acid, transfer to 45’C, add 1 pL of 10 mA4 DZQ, and Incubate for 20 mm Extract the reaction mixture with an equal volume of chloroform, and add 5 uL of 3M sodmm acetate and 160 pL of ethanol. Store at -7O’C until required
3.3.4. Labeling 1 Pellet the crosslmked cDNA/RNA, wash with 100 pL of 80% ethanol, and arrdry (see Note 3) 2. Assuming 400-500 ng of starting cDNA, dissolve this material m 36 PL of SDW, add 12 pL of nucleotide mix (dGTP, dATP, dTTP), 200 pCi of a32P dCTP (3000 Ci/mmol), 8 uL of reaction mix (Boehrmger Mannheim random prime kit), and 5 LJ of Sequenase II Incubate at 37’C for 5 min and then at room temperature for 15 min. 3. Prepare a 1-mL spur column of Sephadex G50 equilibrated in 50 mM Trts-HCl, pH 7.5, 1 mMEDTA, and 1% SDS. Add 5 pL of 10% SDS to the labeling reaction, and pass this through the spin column in order to remove unincorporated 32P.
3.3.5. Probe Hybridization 1. Prehybndize the 132~mm diameter plaque lift filters as previously described. Denature the subtracted probe by heating to 1OO’C for 5 min, and add this m a volume of 2 ml/filter of hybridization buffer contammg 10 uL/mL of poly U (see Notes 4-8). Incubate overnight at 68°C. (Note, The more dilute the probe, the less sensitive; hence, screenmg many filters reduces sensitivity We would suggest an initial screen of 60,000 PFU, i.e., three filters requiring 6 mL of hybridization buffer.)
CCLS
233
2. Wash the filters initially at 68°C for 30 min with 2X SSC and then 1X SSC. Empirically reduce the SSC concentration down to a minimum value of 0.1 carefully estimating the counts on the filter with a hand monitor. Stop the washing procedure at the SSC concentration when the counts reach an acceptable level for exposure to film. Mount the filters on an old piece of film and cover with Saran WrapTM. (Do not under any circumstances allow them to dry!) Expose these to Hypertilm overnight 3. Develop the film, superimpose this (see Note 9) on the phage plate, and pick as many primary plugs (see Note 10) as is practrcal, into 500 l.tL of SM plus 20 PL of chloroform. Allow these to elute overnight at 4’C
3.3.6. Secondary Screening This depends entirely on the chosen strategy. For example, replating each primary pick on 90-mm plates and rescreening with a subtracted probe may be the method of choice to isolate a single-phage clone. We, however, routmely use vector flanking primers to PCR amplify a 4-pL ahquot from the primary pick eluate and Southern blot the products (see Note 11). This blot can then be hybridized to a subtracted probe, and the positive bands identified, gel-purified and subcloned into one of the PCR product clonmg vectors, such as the TA system (Invitrogen, San Diego, CA). This second approach allows approx 30 primary picks to be secondary screened stmultaneously on one Southern blot, eliminatmg the need for replating and performing one plaque lift per clone. The subcloned inserts can now be sequenced and compared to the EMBL database. 4. Notes 1. It is essentialto DNase-treatthe starting mRNA populations in order to remove contaminating genomic DNA that will otherwise be represented m the cDNA library and in the subtracted probe leadmg to artifactual signals.
2. The pH of the crosslinkingreaction is very important. If this deviates above 7.0, the crosslinking will proceed suboptimally. If the pH drops apprectably below 7.0, then overextensive nonspecific crosslinkmg will result. In practice we have found that the crosslmking buffer should be adjusted to pH 7.0, and 20 mA4ascorbit acid used instead of sodium ascorbate. This makes the reaction slightly acidic, ensuring that the crosslmking will proceed. Reacted DZQ has a faint pmk color, which indicates that the reaction has occurred. 3. Do not boil the crosslinked cDNA/RNA prior to labeling It has been shown that heating this material above 95“C causes strand breakage. Furthermore, it ts important to use Analar reagents of the highest quality for the subtractive hybridization and crosslinkmg. The presence of metal ion impurities, such as Fe, will cause extensive strand scrssion, particularly under reducing condrtions. 4. When using a multicopy complex probe, such as the subtracted probes described, it is important to use the correct hybridization buffer. For example, we have found
234
5
6.
7.
8.
9.
10
11.
Hampson and Hampson that buffers contammg sheared genomtc DNA, Denhardt’s, and so forth, will compete out highly conserved sequences, such as actm. For this reason, we use the high SDS/BSA mix described. Poly U is included m the hybridization buffer to block any nonspecific signals resulting from the polyA tract present on each cDNA. Leading on from Note 4, we have also found that this type of probe is extremely sensttrve to target cDNA leachmg from the plaque lift membrane. The extremely low concentration of each individual labeled cDNA m the probe means that a very small percentage of target cDNA leakage, from the plaque lift or Southern blot filters, is enough to block any signal completely. We have found that the membranes described perform best when both UV crosshnked and baked. In addttion, we change the prehybridization buffer several times before applying the probe Thts property of multtsequence complex probes can be explotted to prevent detection of sequences that have already been isolated by simply mcludmg a small amount of their denatured full-length cDNA in with the subtracted probe before use We use a volume of 2 mL of hybridization buffer/l 32-mm diameter filter m order to keep the probe concentration high. This IS not sufficient volume for a hybrtdizatton oven, so we use either heat-sealed polythene tube or, altemattvely, a filter stack prepared in a 132~mm diameter Petri dish The filters are layered plaque side uppermost applymg the probe in 2-mL ahquots with each successive filter The final filter is overlaid wtth a piece of plastic sheet cut to 132~mm diameter, and the whole assembly sealed in an autoclave bag with a wet paper towel Thts is then incubated overnight at 68°C in an oven. Rate accelerators, such as dextran sulfate, can be used, but m practice, these tend to Increase background The type of probe generated by this method cannot be used to screen Northern blots of target mRNA First, the individual cDNA concentration is extremely low, which necessitates that the target material be at a much higher concentration in order to drive the hybridization, Second, the probe is made by random priming cDNA, which makes the majority of the product the wrong strand sense to hybrrdize to its mRNA Many subtractive hybridization methods use multiple rounds of hybridization and depletion of common sequences. We have tried this with CCLS and have found that no further enrichment is gamed by performing more than one cycle Once the phage plates and plaque lifts have been prepared, tt 1srecommended that the whole process of screemng, up to picking positives, be performed as quickly as possible. Phage will undergo passive diffusion on a culture plate, which ~111greatly increase the number of individual cDNA clones in each primary pick. The filter orientation must be correctly aligned from autoradiograph to plaques. We have found that this 1sbest accomplished by cutting three small, asyrnmetritally distributed, V-shaped slots on the edge of the filters. The position of these can then be marked on the base of the culture plate during plaque transfer. If it is decided to adopt the Southern blot approach to secondary screening, we recommend the use of positively charged membrane, such as PAL Biodyne B or Amersham’s Hybond Nf, still using UV crosslinkmg and baking to immobihze target material.
CCLS
235
Acknowledgments DZQ was provided by J. Butler, Department of Biophysical Chemistry, Paterson Institute of Cancer Research, Christie Hospital, Wilmslow Road, Manchester M20 9BX. CCLS is the subject of an American and Canadian patent, the rights to which have been purchased by Amersham International, UK. References 1. Tmdall, K. R and Kunkel, T. A (1988) Fidelity of DNA synthesis by the Thermus aquaticus DNA polymerase. Biochemzstry 27, 60086013. 2. Williams, J G. (1981) The preparation and screening of a cDNA clone bank, m Genetzc Engzneerzng, ~01s. 1 and 2 (Williamson, R , ed ), Academic, NY, pp. 2-55 3 St. John, T. and Davis, R W. (1979) Isolation of galactose-inducible DNA sequences from Succharomyces cerevislae by differential plaque filter hybridisatton. Cell 16,443-452.
4 Ztmmerman, C R., On; W C , Leclerc, R F., Barnard, E C., and Timberlake, W E. (1980) Molecular cloning and selection of genes regulated m AspergilEus development Cell 21,709-715. 5. Hedrick, S. M., Cohen, D I., Nielsen, E. A., and Davis, M. M (1984) Isolation of cDNA clones encoding T-cell specific membrane associated proteins. Nature 308, 149-153 6. Boll, W., FuJlswa, J. I , Niemt, I., and Wtessmann, C. (1986) A new approach to high sensitivity differential hybridisation. Gene 50, 4 l-53. 7. Duguid, J. R., Rohwer, R. G., and Seed, B. (1988) Isolation of cDNAs of scrapie modulated RNAs by subtractive hybridtsatton. Proc. Nat1 Acad Scl USA 85, 5738-5742. 8 Travis, G. H. and Sutcltffe, J. G (1988) Phenol emulsion-enhanced
9.
10
11.
12.
13. 14.
DNA driven subtractive cDNA cloning: isolation of low abundance monkey cortex specific mRNAs. Proc Nat1 Acad Scz USA 85,1696-1700. Sive, H. I. and St John, T. (1988) A simple subtractive hybrtdisatton technique employing photoacttvatable biotin and phenol extraction. Nuclezc Acids Res. 16, 10,937. Rubenstem, J. L R., Brice, A. E. J., Ciaranello, R. D., Denney, D., Porteus, M. H., and Usdin, T B (1990) Subtractive hybridisation system using single stranded phagemtds with directional inserts. Nucleic Acids Res. l&4833-4842. Timblm, C., Battey, J., and Kuehl, W. M (1990) Application of PCR technology to subtractive cDNA cloning: identification of genes expressed specifically in mm-me plasmacytoma cells. Nucleic Aczds Res 18, 1587-l 593. Hara, E , Kato, T., Nakada, S., Sekiya, S., and Oda, K. (1991) Subtractive cloning using oligo (dT)ss latex and PCR Isolation of cDNA clones specific to undtfferentiated embryonic carcmoma cells. Nuclex Aczds Res 19, 7097-7 104 Liang, P. and Pardee, A. B. (1992) Differential display of eukaryotic mRNA by means of the polymerase chain reaction. Science 257,967-97 1 Schweinfest, C. W. and Papas,T. (1992) Subtractionhybridisation: an approach to the isolation of genes differentially expressed in cancer and other biological systems. A review. Int J Oncol. 1,499~506.
236
Hampson and Hampson
15 Hampson, I N., Hampson, L., Cowlmg, G. J , and Dexter, T M (1992) Chemical cross-lmkmg subtraction (CCLS): a new method for the generatton of subtractive hybrtdtsation probes. Nucleic Aczds Res 20,2899 16 Hartley, J A, Berardim, M., Ponti, M., Gibson, N W , Thompson, A S , Thusrston, D. E., Hoey, B. M., and Butler, J. (199 1) DNA cross-lmking and sequence selectivity of Azn-idmylbenzoqumones: a unique reactton of 5’GC3’ sequences wtth 2,5-dtaztrtdmyl-14-benzoqumone upon reduction Bzochemlstry 30, 11,714-l 1,724 17. Femberg, A P. and Vogelstein, B. (1983) A technique for radiolabelhng DNA restriction endonuclease fragments to high specific activity Anal Bzochem 132, 6-13. 18. Favoloro, J. R , Tretsman, R., and Kamen, R (1980) Transcription maps of polyoma virus specific RNA* analysts of two dimensional nuclease S 1 gel mappmg. Methods Enzymol 65,7 18. 19. Ghsm, V , CrkvenJakov, R , and Byus, C. (1974) RNA isolated by caesmm chloride centrifugatlon. Bzochemzstry 13, 2633. 20. Chomczynskt, P. and Saccht, N (1987) Single step method of RNA isolation by guamdmmm sothiocyanatephenol-chloroform extraction. Anal. Blochem 162, 156-159
Gene Mapping and Isolation Access to Databases Martin J. Bishop 1. Introduction If two variants m a population are related by simple Mendelian Inheritance, then in prmciple, it is possible to isolate a gene in which the sequence differences will explain the nature of the variants. In practice, this is a daunting task m a genome the size of human (3000 Mb) and may take many years to achieve. Genome mapping and sequencing projects invert the problem by attempting the systematic discovery and ordering of all genes over a period of many years. We may reasonably expect a complete map and sequences of all the genes m the human genome (not the entire genome sequence) by about 2005. The mformation is being accumulated in a variety of databases, and there are numerous analytical and display tools associated with the data. Because vertebrates appear to have a similar complement of genes, information gained m species other than human is of considerable value. Genes from organisms as disparate as bacteria, plants, and invertebrates may also shed light on human genes, so the comparative approach can be extended with advantage to the whole of life. The systematrc cataloging of yeast genes (now complete) and nematode genes (target 1998) will assist the process. cDNA sequencing and radiation hybrid mapping IS expected to characterize a large percentage (about 80%) of human genes within the next two years. The remainder, whose expression is brief or highly localized, or which are expressed in very low quantities, will have to await the systematic search by cosmid sequencing. The purpose of this chapter is to provide an overview of the information available on the Internet. Easy accessto the information has been revolutionized by the World Wide Web (WWW), which has become an mternatlonally recogFrom
Methods
m Molecular Srology, Edlted by J Boultwood
Vol 68 Gene /so/at/on and Mappmg Humana Press Inc , Totowa, NJ
237
Protocols
238
Bishop
nized medium in the very short span of years since its birth. To assist the user, we will therefore describe accessto data via WWW as the major methodology Genetic linkage maps are constructed in terms of at least two markers, which must be polymorphic. Their resolution may be coarse or refined to practical limits of about 1 CM (human) or 0.3 CM (mouse) representing perhaps 1 Mb and 600 kb of DNA, respectively. The major importance of genetic linkage maps is in providing the only method of relating phenotypic variation to the underlying sequence differences. Genetic lmkage maps have also become important m the human and mouse genomes in providing framework maps based on dinucleotide repeat polymorphisms. These markers may also be placed on the physical maps and serve to integrate the two. Physical maps are constructed in terms of physical DNA, and markers may be ordered one at a time and need not be polymorphic. There are a large variety of techniques, many described m this book, and the maps cannot be related unless the markers used are the same. The construction of overlappmg clone maps (contigs) with large DNA fragments (from yeast artificial chromosome [YAC] or bacterial artificial chromosome [BAC] clones, for example), which can further be used to bridge cosmid contigs, will be essential for the systematic genome sequencing effort. Production of these overlapping clone maps has, so far, proven remarkably difficult to achieve in human DNA. Gene isolation will proceed by the genetic mapping of a variant to a region and the selection of clones (possibly YACs) from the overlapping clone map that cover the region. cDNA expressed from genes located in the YACs are candidate genes for the source of the variation. The gene in question may not be represented in cDNAs, so it will be necessary to attempt to isolate the gene in another way. In this complex operation, any information that already exists in the databases will help to reduce the scale of the task. The availability of mformation IS expanding and changing very rapidly. I attempt to give a brief review m subsequent sections. However, this is likely to become out of date fairly quickly. Other approaches to finding information are: go to a Web site that maintains a list of other useful sites, and use a Web searching tool to look for key words relating to the subject of interest. 2. Genetic Mapping 2.1. Phenotypes
2.1.1. Online Mendelian inheritance in Man (OMIM) http://gdbwww.gdb.org/omim/docs/omlmtop.html OMIM is the on-line version of the human genetics text, Mendehan Inheritance in Man, serving clinical medicine and the Human Genome Project. It is a comprehensive catalog of human genes and genetic disorders with full text
239
Gene Mapping and Isolation
annotations on the most recent genetic research and the molecular genetic elucidation of clinical disorders. From its inception in the early 1960s until 1993, MIM (and from 1985, OMIM) was the sole work of Victor A. McKusick. In 1993, an editorial board of 12 subject editors was established to assist with composition and editing. The move from single to distributed authorship was undertaken to keep pace with the accelerating rate of gene discovery and the exploston of information regarding the genetic basis of disease. Structural rearrangement of OMIM text has begun, thereby permitting users to search wtthm entries by topic, such as Clinical Features or Mode of Inheritance. For many entries, a new section was added, mim-MIM, a distillation of the most relevant clinical mformation. The need to produce MIM in different media prompted its conversion to Hypertext Markup Language (HTML). Currently, MIM is distributed both in print (11 th ed., 1994) and electromcally as OMIM and on CD-ROM. Conversion to HTML also permitted the use of graphical
browsers, such as Mosaic
or Netscape,
on the WWW
for vlewing
OMIM, and also allowed direct links to Genome Data Base (GDB) and other genetlcs databases.
The OMIM data are divided mto such sections as: 1. Mini-MIM. 2 Description. 3. Phenotype. a. Clinical features b. Biochemical features. c. Other features. 4 Genotype a Mode of inheritance b. Mappmg mformation. c. Molecular genetics. 5. Diagnosis. 6. Clinical management. 7. Population genetics. 8. Animal models. 9. References. 10. Clinical synopsis. 11. Edit history.
2.1.2. The Mouse Locus Catalog (MLC) http://www.informatics.jax.org/locus.html http://mgd.hgmp.mrc.ac.uWlocus.html Formerly a separate database, this has now been integrated with the Mouse Genome Database (MGD). You can search for MLC records using
Bishop
240
the Mouse Locus Catalog Informatton query form MLC records are incorporated into MGD tables. Full-text searching capability 1spreserved. In addrtion, the integrated OMIM/MLC searching capability is included in MGD. This capability lets you enter one search phrase to execute a query m both MLC and OMIM. 2.1.3. TBASE-The TransgenidTargeted http.//www.gdb.org/Dan/tbase/tbase.html
Mutation Database
Since development of the technology to manipulate the germline of animals over a decade ago, a large number of transgemc animals have been produced worldwide for use in both basic and applied research. Addmonally, development of gene targeting protocols involving homologous recombmation in mouse embryonic stem cells has resulted in a constderable number of mutant lines with specific phenotypes and well-defined DNA structural changes. TBASE is an attempt to organize information on transgenic animals, and targeted mutations generated and analyzed worldwide. 2.1.4. Gene Knockouts Database http://www.bayanet.com/bioscience/knockout/knochome.htm This lists data regarding the phenotypes rendered by the knockout of various molecules in mice: 1 Geneknockoutsthat are compatible with vtabihty. 2. Geneknockoutsthat result in prenatal mortality 3. Geneknockoutswhich result tn postnatalmortahty. 2.2. Genetic Maps 2.2.1. GDB http://gdbwww.gdb.org/ GDB is the major human genetic data reposttory and 1smaintained as a relational database. The data are in many different tables, which represent nme primary data objects: 1 Polymorphisms. 2. Maps 3. Probes, 4. Libraries 5. Contacts. 6. Cell lines 7. Citations. 8. Loci. 9 Mutations.
Gene Mapping and Isolation
241
The Navigational Map graphically displays the lmks between the data contamed in these tables and the links to the data contained in OMIM. There are also lmks to the Enzyme Data Bank via EC numbers and the Genome Sequence Data Bank (GSDB) via DNA sequence accession numbers. All data contributed to GDB are reviewed by the HUGO/GDB editors. The GDB Browser provides HTML+ query forms for each type of primary data object. Entries matchmg query specifications are returned from the database with embedded hypertext links, allowing the user to locate an item of Interest and trace related items easily. 2.2.2. Cooperative Human Linkage Center (CHLC) http://www.chlc.org/ The goal of the Cooperative Human Lmkage Center is to develop statistically rigorous, high heterozygosity genetic maps of the human genome that are greatly enriched for the presence of easy-to-use PCR-formatted microsatellite markers: 1 Genettc mapsshowing the positions of genetic markers a Integrated maps showmg the position of genetic markers constructed using genotype data from the CEPH reference panel. b CHLC Marker Maps showing the positions of CHLC generated markers m various reference maps 2. Search by name for information on markers. 3 Search by name for mformation on markers, includmg map location primer and PCR conditions, and sequence templates a Likely locations of current CHLC markers, m Version 2 0 skeletal maps b Tables of CHLC markers characterized by linkage analysis c List of full information on markers generated by CHLC 1. In current linkage map. ii. Candidate linkage markers in. Somatic cell hybrid assigned. d. Prior versions of the CHLC generated markers. e. Marshfield CA-repeat markers i Table of initial typing data. ii. Table of sequence data. iii. Table of PCR primers
2.2.3. MGD http://www.informatics.jax.org/mgd.html http://mgd.hgmp.mrc.ac.uWmgd.html MGD provides a comprehensive source of information on the experimental genetics of the laboratory mouse. MGD includes information on mouse markers, mammalian homologies, probes and clones, PCR primers, and experimen-
Bishop
242
tal marker mapping data, such as strain distrrbution patterns, for recombinant inbred and crosshaplotypes. MGD provides a set of query forms that are supported by a number of WWW browsers, and MGD is easily accessedusing the query forms. Each query form is related to a particular kmd of informatron in the database: 1 2 3 4. 5 6.
References Mouse marker information. Homologies Probes and clones PCR primers Mouse mapping data
The user selects a form, enters mformatlon m the form fields to build a query, and executes the query If there are records matching the search criteria, a hst of query results is displayed. Each item in the list contains some highlighted text denoting a hypertext link to related information m the database. By selecting the lmk, the user can display the related mformatron, which will also con-
tam links to yet other information By following these links, one can browse through MGD. The browser application keeps track of where you are, so that It 1s always possible to backtrack to the orrgmal query form.
2.2.4. Whitehead Institute Mouse Genetic Map Information http://www-genome.wi mit.edu/genome-data/mouse/ mouse-rndex.html The data available
here represent the Whrtehead
Institute/MIT
Center for
Genome Research mouse genetic map. By April 1995 the map consisted of 6183 polymorphic
mouse microsatelhte
repeats mapped
on a A4us musculus
C57BL/GJ-ob/ob x CAST F2 intercross using PCR probes. 2.2.5. The European Collaborative Interspecific Backcross (EUCIB) http://www.hgmp.mrc.ac.uk/MBx/MBxHomepage.html This project is being carried out at two centers, the Human Genome Mappmg Project
Resource
Centre (HGMP-RC),
UK and the Pasteur Institute,
France. A 1000 animal-mterspecific backcross between Mus musculus C57BL/6 and ikhs spretus has been completed
and DNAs prepared Each backcross prog-
eny mouse has been scored for 3-4 markers/chromosome, completmg an anchor map of 70 loci across the mouse genome.
A lOOO-animal
cross pro-
vides a genetic resolution of 0.3 CM with 95% confidence. Completion of the anchor map allows the identification of pools of animals recombinant in individual chromosome regions and allows a rapid two-stage hierarchical mapping of new loci. New markers are first analyzed through a panel of 40-50 mice in
Gene Mapping and Isolation
243
order to identify linkage to a chromosome region. Subsequently, the new marker is analyzed through a panel of mice identified as carrying recombmants within that chromosome region. The backcross is supported by a database, MBx, developed m SYBASE. MBx stores mouse, locus, and probe data. It stores all allele data at each chromosome locus for each of the 1000 backcross progeny. Allele data are presented as a scrollable matrix on screen. When a new marker 1sanalyzed through the backcross, MBx provides lod score information to indicate linkage to a chromosome region. Using the computer screen, recombinant mice in this chromosome region can be selected for the second stage of hierarchical screening. In addition, at each stage, MBx will not only calculate the available LOD scores for closely linked markers but will also determine genetic order with respect to closely linked markers, by minimizing the number of recombinants. A WWW Interface IS available (Fig. 1). 2.3. Comparative Information 2.3.1. The Dysmorphic Human-Mouse Homology Database (DHMHD) http://www.hgmp.mrc.ac.uk/DHMHD/dysmorph.html This application consists of three separate databases of human and mouse malformation syndromes together with a database of mouse/human syntenic regions. The mouse and human malformation databases are linked together through the chromosome synteny database. The purpose of the system is to allow retrieval of syndromes according to detailed phenotypic descriptions and to be able to carry out homology searches for the purpose of gene mapping. Thus, the database can be used to search for human or mouse malformation syndromes m different ways: 1. By specifyingspecificmalformationsor clinical features,or chromosomelocations. 2. By homology. 3. By asking for human syndromeslocated at a chromosomeregion syntemcwith a specific mousechromosomeregion (and vice versa from human to mouse). 3. Physical Maps 3.1. Mapping Reagents 3.7.1. Sequence Tagged Sites (dbSTSs) http://www.ncbi.nlm.nih.gov/dbSTS/index.html dbSTS is an NCBI resource that contains sequence and mapping data on short genomic landmark sequences. Although dbSTS sequences are to be incorporated mto the new STS Division of GenBank, annotation in dbSTS is more comprehensive, and mcludes detailed contact mformatlon about the con-
Mit & EUCIB Chromosome 3
Zoom Acts On Mbx Mit Both Whole ZoomOutX8OutX4OutX2NoneInX2InX4InX8 Display YMO MS0 NM0 YMO + MS0 YMO + NM0 MS0 + NM0 Base Map Mit Centre On Mbx Centre On
NIT.g.Chr.3
0Z -0
-
10
-
20
-
30
-
40
-
50
-
60
-
70
202
302
402
502
602
702
802
90x
100x
Fig. 1. Genetic linkage mapping in the mouse.
Gene Mapping and Isolation
245
tributors, experimental conditions, and genetic map locations. In addition, NCBI periodically updates putative homology assignments using the BLAST family of programs. 3.1.2. Expressed Sequence Tags (dbESTs) http://www.ncbi.nlm.nih.gov/dbEST/index.htmI dbEST is a division of GenBank that contains sequence data and other information on cDNA sequences characterized as smgle reads from DNA sequencing from a number of organisms. Physical DNA clones from IMAGE Consortium libraries are now available from a number of distributors. Clones may be ordered using the identifier-labeled “CloneID” m the dbEST record (e.g., 69864, corresponding to GenBank accession number T48601). 3.1.3. CpG Island Database http://biomaster.uio.no/cpgdb.html The CpG island database deals at present with human genes appearing in major releases of the EMBL nucleotide sequence database, but it is hoped that in the future it will include islands from other mammahan species. 3.2. Cyfogenetic Maps Human and mouse cytogenetic map data are stored m GDB and MGD. 3.3. Radiation Hybrid Maps 3.3.1. The Radiation Hybrid Database (RHdb) http://www.ebi.ac.uWRHdbl Radiation hybrid maps are an indispensable alternative to genetic maps, since they can include nonpolymorphic markers and are also powerful enough to order unresolved genetic clusters of polymorphic STSs. An International collaborative project has been started that will produce a large number of these hybrids for the human genome. This m turn will allow the generation of a very precise STS map that will be indispensable in the study of multifactorial diseases. RHdb is an archtve of raw data with links to other related databases. The main data are stored in a relational database. Submissions to this database are made using a standard format. Various export formats will be supported, as well as different ways of accessing the data. The traditional flat-tile format is used to export text data on a regular basis. 3.3.2. rhserver http://shgc.stanford.edu/RHmap.html The purpose of rhserver is to provide localization of STSs assayed against the publicly available Stanford Human Genome Center (SHGC) “G3” radiation
Bishop
246
hybrid panel, dlstrlbuted by Research Genetics, Inc., to 973 framework markers selected from the Genethon meiotlc linkage map. Submitted markers are subjected to two-pomt statistical analysis, after which rhserver will return a list of all markers from the 973 and from the submission data that lmk to the subject marker with a LOD of 6 or greater. This list will include the linked marker, the LOD score of the lmk, and the distance m cR8000 between the markers. Based on the scoring of 3 13 randomly distributed ESTs, we find that a random marker has a 50% chance of lmkmg with an LOD of 6 or greater to at least one of the SHGC framework markers, and that the confidence in this lmk 1s 97%. In caseswhere no link is found, a message to that effect will be returned. Ambiguities m typing can decrease the likelihood of an LOD 6 or greater hnk. SHGC strongly suggeststhe use of duphcate typings. If after duphcate typings, there are more than seven ambiguities between the two typmgs, the STS should be reassayed under different condltlons or redesigned 3.4. Overlapping Clone Maps 3.4.1. YAC Clones 3.4.1 .I CEPH-GENETHON INTEGRATED MAP http://www.cephb.fr/ceph-genethon-map
html
This contains information about the CEPH YAC library, Contlg Maps, STS data, Alu-PCR hybridlzatlon data, fingerprmt data, sizing data and fluorescent znsztu hybridization (FISH) data. 3.4.1.2.
INSTITUTE STS/YAC http://www-genome.wi.mit.edu/
WHITEHEAD
MAP
The goal of the Whitehead Project is the construction of an STS content map of the human genome consisting of 12,000 STSs screened in the CEPH YAC library. Since reaching this goal in May 1995, efforts have been directed at map construction and validation. To further this effort, a radlation hybrid map of 5138 markers has been constructed and used together with the Genethon 1994 map to assemble the prellmmary integrated STS/ YAC map. 3.4.2. BAC and PAC Clones These are promising to complement YACs for long-range physical mapping. Although the insert size is not usually as large, they do not suffer from the chimaerism and interstitial deletions so common in YACs. Information, protocols, and selected references on BACs are accessible at Caltech http://www. tree.caltech.edu/. Results using these bacterial clones are still under construction.
Gene Mapping and Isolation
247
3.4.3. Cosmid Clones YACs are unsuitable for sequencing so that cosmid clones are needed in overlapping contigs. Much work has been done m producing these in local regions or on single chromosomes. Many Web sites report the result of these efforts. A project that keeps information about overlaps determined by hybridization for a large number of clones is the Reference Library DataBase (RLDB) http://gea.lif.icnet.uk/. 4. Map Integration Integration of genetic and physical maps is proceedmg by using the same markers for mapping by a variety of methods as mentioned above. This is a tedious process that will bear frutt over the next few years. In the interim there is an attempt to compile existing information in the Genetic Location Database (ldb). 4.1. Ldb http:Ncedar.genetics.soton.ac.uk/public-html/ Ldb is an analytical database for constructing fully integrated genetic and physical maps. The ldb program generates an integrated map (known as the summary map) from partial maps of physical, genetic, regional, somatic hybrid, mouse homology, and cytogenettc data. The summary maps and the data used to build up such maps are available from the WEB site. The files for each chromosome are stored in the same directory, which include the summary map, partial maps, lod tiles, and the parameter files. Alternatively, the ldb program can be downloaded and used to create your own integrated maps. 5. Gene Isolation 5.1. Genome Crossreferencing and XREFdb http://www.ncbi.nlm.nih.gov/XREFdb/ XREFdb is a publicly accessible database that is a component of a research project (the XREF project), which is devoted to crossreferencing the genetics of model organisms with mammalian phenotypes and accelerating the identification of genes mutated in human diseases. XREFdb is accessible and provides similarity search, mapping, and relevant mammalian phenotype information. The database provides researchers with BLAST similarity search results that identify significant matches between sequences of model organism proteins and mammalian peptide sequences predicted by conceptual translation of ESTs. In addition, XREFdb tracks EST matches automatically for each account holder and flags those that have not been reported previously, thereby eliminating the need to reevaluate matches that have
248
Bishop
already been analyzed. The XREF project is also determining mouse and human map positions for those ESTs most significantly matched by protems from the budding yeast Saccharomyces cerevisiae. These map data, also available through XREFdb, will systematically establish potential crossreferences between genes in model organisms and mammalian phenotypes, via the phenotype-rich mouse and human maps. Such crossreferences can be particularly valuable when functional data are present for the gene product in one or more model organisms. Estabhshmg connections in this manner has strong imphcations for expediting the discovery and characterlzatlon of genes mutated m human diseases. The success of this crossreferencmg proJect is largely dependent on two factors, the first of which is the amount of sequence mformatlon available for the organisms being crossreferenced. Many currently sequenced open readmg frames in yeast and nematode are associated with mutant phenotypes and other protein functional data. Mammalian EST sequences are plentiful, and the current release of dbEST (July 6, 1995) contams 2 11936 human cDNA sequences It is probable that a significant proportion of the total human gene repertoire IS already represented by ESTs m dbEST. The second factor affecting the successof a crossreferencing project is the degree to which the genomes involved are related. The comparison between the yeast and human genomes, which are separated by a large evolutionary distance, provides an excellent example. The human NFl gene is homologous to the yeast IRA2 gene, and the biologrcal significance of this relationship has been demonstrated because the human NF 1 cDNA can complement yeast IRA2 mutations. Although this proJect was origmally funded by the National Center for Human Genome Research to crossreference the S cerevzszaeand mammalian genomes, XREFdb has recently been expanded to accept protein queries from other model organisms, mcludmg Caenorhabdztzselegans, Drosophila melanogaster, Escherichia coli, Mus musculus, Rat&s norvegzcus, Schizosaccharomycespombe, and Xenopus laevis.
6. Genome Centers 6.1. Bay/or Co//ege of Medicine (BCM) Genome Center http://gc.bcm.tmc.edu:8088/home.htmI 1. Newsletter. 2. YAC Data Searches-the most current YAC data from various sources. 3. Biologist’s Control Panel--+iologtcal databases,searchtools, and information. 4. BCM SearchLauncher-WWW sequencesearchand analysissoftware 5. BCM SequenceAnnotation Server-annotate sequenceslocally or in Entrez. 6. Genome Reconstruction Manager-manage data flow for cosmid DNA sequencing.
Gene Mapping and isolation
249
6.2. Calfech Genome Research Laboratory http://www.tree.caltech.edu/ 1. Human Genome Project. a. Construction of BAC Library resource. b. Physical mapping of Human Chromosome 22 using BAC clones and YAC frameworks. 2. Mouse Genome Project. a. Construction of Mouse BAC Library resource. 3 Microbial Genome Project. a Sequencing the 1.8-Mb Genome of the Archaeum Pyrobaculum aerophilum.
4. Informatton
on building BAC libraries.
6.3. Canadian Genome Analysis and Technology Program (CGAT) http://cgat.bch.umontreal.ca 1. Information resource and commumcations nexus for genomics research m Canada. 2. Genomics databases and information a. Chromosome X. b. C. elegans cosmid transgemcs. c. The Organelle Genome Database (GOBASE) d. Organelle Genome Megasequencing Program. e. Rose Worm lab. f. SuEfolobus soljktaricus Genome Project. 3. CGAT medical, ethical, legal, and social issues (MELSI) 4. Contact list of troubleshooters willing to answer your technical questions, 5. Software archive
6.4. Cavalli Lab http://lotka.stanford.edu/ The Human Population Genetics Lab headed by L. L. Cavalli-Sforza of the Department of Genetics, Stanford University, is a center for generating, collecting, storing, dissemmating, and analyzing genetic data on the great human diaspora. This page contains mformatton about the members of the Cavalli Lab and about their activities, including preprints, abstracts, and bibliographies.
6.5. CBA-IST (Genova) http://www.ist.unige.it/ 1. HyperCLDB. a. Cell Line Data Base. b. 3000 Human and animal cell lines from European culture collecttons. 2. Descriptions of Interlab Project databanks of biological materials
250
Bishop
6.6. CBIL at Pennsylvania University http://cbiLhumgen upenn.edu/ The Computational Biology and Informatics Laboratory (CBIL) collaborates with the Chromosome 22 Genome Center, and the Center for Advanced Genome Technology at University of Oklahoma to sequence Human Chromosome 22, supplymg mformatics services to this maJor component of the Human Genome Program. An active research program is mamtamed, spectahzmg m biological databases, genome informatics, and lmgutstic sequence analysis. 6.7. CEPH (Fondation Jean Dausset) http://www.cephb.fr/ The Centre d’Etudes du Polymorphisme Humain (CEPH) is a research laboratory created in 1984 by Jean Dausset (Nobel Prize, medicine, and physlology, 1980). This laboratory constructs maps of the human genome. 1. CEPH-Genethon Integrated maps. 2. CEPH genotype database.
6.8. Columbia University Human Genome Project http-//genomel .ccc.columbia.edu/-genome/ Thts contains Chromosome 13 data (YACs, cosmtds, and STSs, markers). 6.9. Cedars Sinai Medical Center (CSMC) Molecular Genetics Labs http://www.csmc.edu/genetics/korenberg/korenberg.html The Molecular Genetics Laboratortes at Cedars-Sinai Research Institute are headed by Julie R. Korenberg. 1. Integrated YAC/BAC/PAC Resource for the Human Genome. 2. Chromosome 2 1 Phenotyprc Mappmg Project. 3. Gene Mapping Projects on Chromosome 2 1 a. BAC Contig of the Chromosome 2 1. b. Congenital Heart Disease Region. c Progressive Myoclonus Epilepsy and Holoprosencephaly. d FISH. e. cDNA Mapping Project.
6.10. Department of Energy (DOE) Human Genome Program http://www.er.doe gov/production/oher/hug-,..top.html 1. Information on the DOE projects. 2. Links to Los Alamos National Laboratory (LANL), Lawrence Berkeley Laboratory (LBL), Lawrence Livermore National Laboratory (LLNL), and so forth. 3. A primer on the basic science of the program 4. Human Genome Project resources and meetings.
251
Gene Mapping and Isolation 6.17. European Bioinformatics
Institute (EMBLEBI)
http://www.ebi.ac.uW 1. EMBL/EBI Databases. a. EMBL Nucleotide Sequence Database. b. SWISS-PROT Protein Sequence Database. c Bto-Catalogue of software. d. dbEST and dbSTS e. Radiation Hybrid Database f. IMGT Immunogenetics Database. g. The PCR Primers database. h. FlyBase. i. Access to a selection of databases held on EBI ftp server 2. Data submissions. 3. Database query/retrieval. 4. Sequence similarity searches. 5. Documentatton and software.
6.12. Galton Labora tory http:Ndiamond.gene.ucl.ac.uW 1. Information on MRC Human Biochemicai Genetics Umt and UCL Genetics and Biometry Department. 2 Chromosome 9 Workshop reports, maps, and contact addresses. 3. Chromosome Y fingerprint data. 4. Linkage software.
6.73. Genethon http://www.genethon.fr/ 1. GenomeVlew-access to several public sources. 2. CEPHIGenethon Physical Map.
physical
and genetic map data
6.14. Genestream at EERIE http://genome.eerie.fr/Genome.html This is the South of France Human Genome Project Computing Resource Center.
6.15. Geneva University (ExPASy) http://expasy.hcuge.ch/ The ExPASy WWW molecular biology server of the Geneva University Hospital and the University of Geneva is dedicated to the analysis of protein and nucleic acid sequences, as well as two-dimensional polyacrylamide gel electrophoresis (2-D PAGE).
Bishop
252
1 Search databases a SWISS-PROT-Annotated protein sequence database b PROSITE-Dictionary of protem sites and patterns c SWISS-2DPAGE-2-D PAGE database d. SWISS-3DIMAGE-3-D images of protems and other biological macromolecules. e ENZYME-Enzyme nomenclature database f SeqAnalRef-Sequence analysis bibliographic reference database 2. Tools and software packages. a Tools-Access to protein analysis tools b. Swiss-Shop-Sequence alerting system for SWISS-PROT c. Swiss-Model-Automated knowledge-based protem modeling server d. Melam+Software packages for 2-D PAGE analysis (including the Melanie II tutorial) 3. 2-D PAGE services and courses. a. SWISS-2DSERVICE-have your 2-D gels performed according to Swiss standards b 2-D PAGE training-attend a 1-wk course m Geneva. c. Technical mformation on 2-D PAGE (protocols).
6.16. Genome Therapeutics http://www.cric.com/
Corporation
(GTC)
1 Commercial organization. 2 FTP data files of Sequence assembly test kits. 3. Chromosome 10 physical mapping data.
6.17. GenomeNet, Japan http://www.genome.ad.jp/ GenomeNet is a Japanese computer network for genome research and related research areas in molecular and cellular biology 1. DBGET-linked database search of a Bibliographic data (Medlme, LITDB) b. Sequences (GenBank, EMBL, SWISS-PROT, Protein Identification [PIR], Protein Research Foundation [PRF], Prosite) c. Protein 3-D structures (Protein Data Bank [PDB]) d. Protein mutation data (PMD) e Chemical compounds m enzyme reaction (LIGAND) f. Ammo acid indices (AA-Index). g. Genome mapping data (GDB, OMIM). 2. Sequence interpretation tools. a. MOTIF (Sequence Motif Search). b. BLAST (Sequence Homology Search). c. PSORT (Prediction of Protein Sorting Signals) d TFSEARCH (Transcription Factor Binding Site Search).
Resource
253
Gene Mapping and isolation
The UK MRC HGMP-RC exists to provide speclahst resources and services for sclentlsts working on the Human Genome ProJect, Including provisIon of blologlcal materials (for which feedback on results IS required), an on-line computmg service glvmg access to genome databases and analysis tools and training courses You can search for external lmks and for programs available from the HGMP-RC Enter keywords.
0 0 0 0
What’s New Reglstratlon HGMP-RC WWW Menu Session for Registered Users HGMP-RC telnet Menu Session for Registered Users
0 Biology Resources and Services 0 Computmg Resources and Services l Training Courses 0 Projects 0 Lists of Other Genome Resources E-Mad
Support@hgmp
mrc ac uk
Fig. 2. The home page of the HGMP
ResourceCenter.
6.78. Human Genome Mapping Project (HGMP) Resource http://www.hgmp.mrc.ac.uk/
Centre
The UK HGMP Resource Centre (HGMP-RC) is a UK Medical Research Council (MRC) Unit located on the Hinxton Hall Genome Campus established by the Wellcome Trust. In addition to the HGMP-RC, the site also houses the Sanger Centre and the European Bioinformatics Institute (EBI). The HGMPRC provides blological materials and services relating to the human and mouse genome projects. In addition, it provides an online computing service, user support, and extensive training courses. Registered users have accessto a wide range of databases and analysis tools via a WWW interface (Fig. 2). The computing group is also involved in a number of collaborative development projects
in genomemformatics.
Bishop
254 6.19. IMAGE Consortium http://www-bio.llnl.gov/bbrp/image/image.html
The IMAGE Consortium was initiated by four academic groups on a collaborative basis after informal discussions led to a common vision of how to achieve an important goal in the study of the human genome: the Integrated Molecular Analysis of Genomes and Then Expression. They share high-quality, arrayed cDNA libraries and place sequence, map, and expression data on the clones m these arrays mto the public domain. From this mformation, they rearray the unique clones to form a “master array,” which they hope will ultimately contam a representative cDNA from each and every gene in the genome. These clones are available free of any royalties and may be used by anyone agreeing with their guidelines 1. 2 3. 4. 5.
Most of their clones arenow from 20 different cDNA libraries. Over 100,000 human cDNA clones are now arrayed Over 180,000 5’ and/or 3’ sequences have been deposited into dbEST. Over 4000 chromosomal assignments have been entered into GDB. The master array currently consists of over 25,000 distinct genes
6.20. LANL Biosciences httpdwww-tl O.lanl.gov/ 1. 2 3. 4.
Sigma chromosome maps. Chromosome 16 flat tile HIV databases. Nucleic Acid MOdelmg Tool (NAMOT)
6.27. LBL Human Genome Center http://www-hgc.Ibl.gov/GenomeHome.html 1 2. 3. 4.
Human Chromosome Drosophila physical Instrumentation and Human chromosome
2 1 P 1 and cDNA mapping databases. mapping mformatics projects. and directed genome sequencmg projects.
6.22. LBL Resource for Molecular http://rmc-www.Ibl.gov/
Cyfogenetics
Work is being pursued in three areas: development and application of improved hybridizaton technology, selection of probes optimized for the use in FISH, and development of digital imaging microscopy. There is access to probes and mapping information developed by the Resource for Molecular Cytogenetics.
255
Gene Mapping and Isolation 6.23. L LNL Biology and Biotechnology Research http://www-bio.Ilnl.gov/bbrp/bbrp.homepage.html
Program
1. Human Genome Center a. Physical maps of Human Chromosome 19 b. Closure of the Chromosome 19 Map c. Enhancement of the high-resolution clone map of human chromosome 19. d. DNA sequencing. e National Laboratory Gene Library Project f. Alu Repeats. A novel source of genetic variation for mapping the human genome g. Informatics and analytical genomics. h. Instrumentation for the Human Genome Project 1. IMAGE Consortium home page. 2. DNA repair. 3 Molecular toxicology and human risk assessment 4. Structural biology. 5 X-ray crystallography. 6 Technology development. 7. Center for healthcare technology.
6.24. Michigan Human Genome Center http://www.hgp.med.umich.edu/ 1. 2 3 4.
DNA Sequencing. Online documentation. Whitehead/MIT mouse map data CEPH-Gbnbthon physical map data.
6.25. Minnesota University lnstitufe http://lenti.med.umn.edu/-ihg/index.html Description
of Human Genetics
of projects and services at the Institute of Human Genetics.
1. Molecular Genetics. 2. Genetic services: a Molecular genetics laboratory b. Microchemical facility c. Gene therapy. 3. Clinical genetics. 4. Developmental biology 5. Behavioral genetics. 6. Genetic epidemiology. 7. Career training.
Bishop
256
6.26. National Center for Biotechnology Information (NCBI) http:l/www.ncbi.nlm.nih.gov/ The NCBI 1s responstble for burldmg, mamtainmg, and dtstrrbutmg GenBank, the National Institutes of Health (NIH) genetic sequence database that collects all known DNA sequences from scientists worldwide. 1 Text searchof Genbank,dbEST, dbSTS, and Entrez 2 Submittmg sequencesto GenBank (BankIt) 3 Blast sequence similarity search 6.27. National Center for Genome Resources (NCGR) http://www.ncgr.org/ The NCGR 1sa not-for-profit organizatron created to design, develop, support, and deliver resources m support of public and private genome research. 1. GSDB genome sequence database with hyperhnks to GDB, SwissProt, and SO on 2 SIGMA system for integrated genome map assembly. 3 Ethical, Legal, and Social Implications (ELSI) of biotechnology information
6.28. National Center for Human Genome Research (NCHGR) http://www.nchgr.nih.gov/ The NCHGRwas estabhshed to head the Human Genome Project for the NIH. Projects include: 1 The Breast Cancer Information Core (BIC) Homepage (Membership Only) 2 Clinical Gene Therapy Branch 3, Diagnostic Development Branch 4. Education and training.
Access
5. Genetic ResourceBranch. 6 7. 8 9 10.
Laboratory of Cancer Genetics. Laboratory of Gene Transfer. Laboratory of Genetic Disease Research. Medical Genetics Branch. Technology transfer.
6.29. Neurogenetics Homepage (Massachusetts General Hospital [MGH]) http://neurosurgery.mgh.harvard.edu/ngenethp.htm This provides informatron on the Neurogenetics Unit and the Multidrscplinary Neurofibromatosis Clinic, including: 1. Information 2. Information
on neurofibromatosis (NF- 1 and NF-2). on tuberous sclerosis.
Gene Mapping and Isolation
257
3. Information on Von Htppel-Lmdau disease. 4 A listing of MGH neurosurgeons with experttse m tumors associated with neurogenetic disorders 5. Information on hydrocephalus. 6. Other resources for patients (and families/care-givers) wtth brain, spine, or peripheral nerve tumors.
6.30. National Institutes http://www.nih.gov/
of Health (N/H)
This provides NIH information, news and events, health mformation, grants and contracts, scientific resources, institutes, and offices. 6.37. Pasteur Institute http://www.pasteur.fr/welcome-uk.html 1. Pasteur Instttute projects 2. Many databases and useful sites.
6.32 Research Genetics http://www.resgen.com/ 1. Commercial organization. 2 Provides a large variety of reagents and biological materials. 3. Catalog holds items from Adenovirus RNA 1 Probe to YAC Library Custom Screening 4 New service: MAPPAIRS (mtcrosatellite markers conststmg of a pan of PCR prtmers which flank a simple sequence repeat that is polymorphic)
6.33. Sanger Centre http://www.sanger.ac.uk/ The Sanger Centre, directed by John Sulston, is a new research center established by the Wellcome Trust to provide a major focus in the UK for mapping and sequencing the human genome and genomes of other organisms. Major projects include the sequencing of nematode and yeast and physical mapping of human (Fig. 3). Large-scale sequencing of human DNA is also underway. The Informatics Group develops new software including ACEDB which provides a highly graphical interface to genome data. 1. 2. 3 4. 5
Information on the Sanger Centre and local projects Information on ACeDB and its derivative databases. Wormpep: Predicted proteins from the C eleguns project Prodom: Protein Domain Database. List of biological Web servers.
Bishop
258
Chromosome 22 Mapping Group 1 prKIlO6 2‘1077.35 CO.001
3.80 c0.001
I
7
II
022S357
Projects People Links Publications u Comments Chromosome 22 Projects Fig. 3. Physical mapping of human chromosome22 at the SangerCentre.
6.34. Stanford Human Genome Center http://shgc.stanford.edu/ 1. FTP data files of STSs,mapping data,chromosome4 radiation-hybrid mapping, and YAC STS-contentmapping project. 2. WWW searchesof thesedata.
6.35. Texas University Health Science Center http:Nmars.uthscsa.edu/ 1. Human chromosome3 database-STSs and YACs. 2. Human chromosome3 sensitivegenetic map. 3. Human chromosome3 cytogeneticbreakpoint map.
6.36. The Institute for Genomic Research (TIGR) http://www.tigr.org/ This is a nonprofit organization, 1. The Microbial Database(MDB) provides accessto the genomesequencesfor the Haemophilus injluenzae andMycoplasmagenitalium genomes. 2. Human cDNA Database(HCD) provides researchersat nonprofit institutions accessto cDNA/EST sequenceand related data.
259
Gene Mapping and Isolation
3. The Expressed Gene Anatomy Database (EGAD) lmks expressron data, cellular roles, and alternative splmmg informatron to a curated, nonredundant set of human transcript sequences 4. Sequences, Sources, Taxa database (SST) provides links between source, collection, taxonomy, and molecular sequence data
6.37. Virtual Genome Center http://alces.med.umn.edu/VGC.html 1. Sequence analysts tools. a. Primer selection for PCR b Translation c. Codon usage. d. Protein motifs e. Oligo T, determmatron. f Atypical sequences g. Human repeats h. Dot plots. 2. Query Genbank, SwrssProt databases 3 Useful databases and tables. a. Codons b Srze of human chromosomes c. Human repeated DNA. 4. Sequences of the S. cerevzszae chromosomes 5. Genome program news and gossip. 6. Candzda albicans: physical map, sequence data, strains, and resources
6.38. Washington University Genome Sequencing http://genome.wustl.edu/gsc/gschmpg.html
Center
1. C elegans sequencing. 2. EST sequencing as part of the Washington University and Merck EST ProJect.
6.39. Whitehead Institute/M/T http://www-genome.wi.mlt.edu/.
Genome Center
1. Human YAC screening data for STSs screened on the CEPH mega-YAC library with over 1100 contrgs assembled using double linkage between STSs 2. For each STS, they report addresses for the YACs found to contam the STS. 3. Human, rat, and mouse marker map data files.
Software for Genetic Linkage Analysis An Update Stephen P. Bryant 1. Introduction
Reviews describing the software available to support genetic linkage analysis have appeared mtermittently over the last few years (I, la) This period has been associated with several significant influences that have made it timely to reconsider
the available
programs
and related computing
resources.
These
influences are: 1. 2 3. 4.
The growth of the Internet; The adoption of Umx as the most common platform for software development; The appearance of genome centers that offer database and analytical resources; and Theoretical and technical advances in the development of algorithms for mapping general genetic traits (2-4), constructing dense genetic maps of whole chromosomes (5), and for conducting genome-wide screening projects for common diseases such as diabetes (6)
This chapter is not exhaustive, but covers those applications that have appeared de IZOVOsince (I), that have otherwise changed substantially, or that have continued to be used and maintained within the commumty. Some programs that were believed not to be currently maintained or developed have been left out, but the author would be grateful to receive updated information concernmg these or other software that have not been included. The interested reader can refer to the earlier review (1) to fill in any gaps. The present chapter is software-centered, and IS to be used as a directory of current software and services. There has been an attempt to place the programs in context, but the reader will probably need to consult complementary reviews, such as ref. 7, which examines the computational problems of genetic linkage From
Methods
m Molecular Bfology, Edlted by J Boultwood
Vol 66 Gene Isolation and Mapprng Humana Press Inc , Totowa, NJ
261
Protocols
262
Bryant
analysis from the perspectives of data management, analysis, and interpretation of results. Some information about the expanding World Wide Web (8) and a directory of Web servers of relevance to the domain of lmkage analysis is also included. 7.7. The lnternet Most laboratories working in the field of molecular biology now have accessto the Internet. The recent production of collaborative genetic linkage maps by the European Gene Mapping (EUROGEM) mitiative IS an excellent example of the possibilities that electronic networking opens up (9). This almost complete penetration of the Internet worldwide has changed the way software and data are disseminated among the community (20). The World Wide Web and anonymous ftp have meant that most computing resources are now provided on-line, rather than by diskette or tape. The most usual way of accessmg the Internet is by using one of the TCP/IP tools such as telnet, which is available for Macintosh, PC, and Unix systems. [Tlelnet, from the US National Center for Supercomputmg Applications (NCSA), is in the public domain. [Klermit is now mostly of historical interest. File transfer is conducted by ftp, or by usmg one of the other browsing tools, such as Gopher, Mosaic, or Netscape. X-Windows software is often used by labs when connected to one of the national computing centers. If X is not available, a VT100 (or similar) emulation is adequate for much of this kmd of work. 1.2. Unix With Unix replacing VMS as the most common computmg platform avarlable as a central resource m the institutional setting, the portmg problem for new software has been simplified greatly. The DOS or Microsoft Windows operatmg systems are still required for some applicattons for which Unix versions are not likely to become available, although it should be noted that the DOS emulator SoftPC and the Windows Application Binary Interface (WABI) from Sun Microsystems mean that graphical PC applications such as CYRILLIC (II) now can be run successfully under Unix. 1.3. Genome Centers Centers for genome research, such as the UK Human Genome Mapping Project Resource Centre (HGMP-RC) m Hinxton, England (12), the European Data Resource (EDR) center of the DKFZ in Heidelberg, Germany, and the Cooperative Human Linkage Center (CHLC, IA) in the United States, have developed as centralized providers of computational and biological resources as a part of the Human Genome Project. These centers only require that a user
Linkage Analysis Software
263
be able to establish an Internet connection, and be Involved in genome research. They provide accessto many of the programs described here. 1.4. Theoretical and Technical Advances Gene mapping by the analysts of traits segregating in human pedigrees is a major goal of lmkage analysis (13), itself firmly rooted m the statistical technique of maximum likelihood estimation (MLE; 14). The estimated quantity is most often the recombination fraction ((3) using the now well-known LODscore method (‘15). The use of MLE has been facilitated by the development of algorithms (1416) that can be implemented on small computers, in packages such as LIPED (17) and LINKAGE (18). Subsequent to the appearance of these programs, more efficient algorithms to perform multipoint linkage analysis were produced (‘19,20), which extended the size of map that realistically could be created using the method (21). Intensive collaborative work has led to the publication of several genetic maps of the whole human genome using these techniques (9,22,23). Although these new algorithms enable such an undertaking, the construction of large genetic maps remains labor- and computer-intensive. The heuristics that users follow when constructing maps of this type recently have been modeled within an expert system framework (5) that will decrease the time and effort required for analysis significantly. With traditional MLE, finding genetic lmkage to a putative disease susceptibility locus demands the use of a suitable transmission model, which may require the joint estimation of several parameters. This can be difficult computationally, and may also affect the stgnificance level of the result and increase the chance of missing linkage when it exists, A family of related statistical methods, based on haplotype sharing, has been developed to address this problem. The affected sib-pair method (24), of immense value in the analysis ofthe HLA region, is based on the concept of identity by descent. Lange (25) applied the affected-sib method to sibsets, and later to identity by state (26). Weeks and Lange (27) generalized the method to extended pedigrees. They used the algorithm of Karigl (28) to compute multiple-person kinship coefficients and thence to derive the distribution of a test statistic within each pedigree. Their KIN package (27) enables tests of the hypothesis of Mendelian segregation among related, affected individuals. Weitkamp and Lewis
(29) usedMonte Carlo simulation methods in conjunction with an identity by descent statistic to test for Mendelian segregation in extended pedigrees. Their PEDSCORE program is similar to KIN m spirit, though it cannot be applied to identity by state. Other laboratones have since enhanced the basic method
(4,30,31),but the programshaveyet to reachthe public domain m any quantity.
264
Bryant
The original use of simulation on genealogies was described by Edwards (32). MacCluer et al (33) simulated gene flow through a genealogy and computer simulation is also used in programs like SIMLINK (34) to estimate the power of a proposed linkage study. Ott (35) considered simulation methods applied to problems of lmkage and heterogeneity that is leading to valuable software developments, such as SLINK (see Section 2.23.). The most cructal factor in enabling new software for genetic analysis to contribute fully to the solution of problems in the domain is interoperability. A major consideration IS that the design should enable mtegration wtth existing databases and applications. Each program not only uses a specific dataformat, but also uses a specific data model. The Integrated Genomic Database (IGD) project (36) is an attempt to address the integration problem by smoothmg out the syntactic and semantic differences between programs using additional enabling software. This facihtates software mtegration, the exchange of data between databases, and the construction of data sets for analysis. It has to be said that the exchange of data between applications is distinctly nontrivial, and remains the limiting factor when conductmg analyses. Where the information 1savailable, the author has Indicated in the text where recognized data paths exist between applications. Many of the concepts involved in this kind of work have been well described previously (37). The discipline of Knowledge Engineering and the appearance of Intelligent Knowledge Base Management Systems (IKBMS) will further support and enhance this process, which is still in its infancy. Expert systems (ES) have already shown promise in other areasof human genetics (38), and more recently in genetic map construction (5). 2. The Software Many of the programs described here are analytical. The utility of analytical software is extended by ancillary packages that are geared toward data management (39), support roles such as pedigree display (401, or the transformation and examination of data sets (41). Most software described here is available freely from either the originators of the package, or in many cases from third parties or by anonymous ftp, although the software is usually not strictly in the public domain. Users in any case are advised to register with dtstributors for the timely receipt of upgrades and bug fixes, as well as to comply with licensing requirements. The PC architecture has kept pace with the increasing demands of linkage analysts, and a 486- or Pentium-based PC is certainly adequate for many disease mapping studies. These machines typically come with several Megabytes of RAM instead of the 640 K of a few years ago, and can be used for much larger problems than were previously feasible. Graphical capability is not so
Linkage Analysis Software
265
important for these machines, which typically come with a perfectly adequate VGA resolution, useful for display apphcations such as PEDRAW. Handling data from large numbers of loci simultaneously is more common now than previously. Whole genome maps are regularly published (9,22,23) containing upwards of 100 loci per chromosome. The establishment of the 32-bit Unix workstatron and the recent appearance of 64-bit architecture from Digital (AXP) and Sun Microsystems (Ultra) has enabled the solution of large mapping problems that would have been unfeasible previously. The older DEC VAX architecture, while still usable, is disappearing from the domain. The Apple Macintosh, although not a particularly powerful machine in computational terms, has established itself as an effective data management vehicle. It has not been widely used for analysis, although Macintosh ports of some software do exist. The IBM PC and compatibles are almost always used with Microsoft Windows, which enables switching between applications such as DOLINK and LINKAGE, streamlining the analytical process. A group using a remote Unix machine will almost certainly be interacting with a version of System V Unix (Solaris 2 or DEC OSF/l). Most packages are dlstrrbuted as source code with executable programs for one or, in some cases,several architectures. Some software is designed to be highly configurable and will need to be edited and recompiled at host sites. If changes to the source code are to be attempted, or if the code is to be ported to another, unsupported system, a compiler will be needed. Most software described in this work has been written in Pascal, C or Fortran, with the trend toward C (or C++). Some operating systems are delivered with compilers included. For some Unix systems, a C compiler will need to be obtained from an alternative source, and the GNU compilers from the Free Software Foundation are recommended highly (ftp GNU software from src.doc.ic.ac.uk, or use archie to find a local server outside the UK). DOS machines are supplied without compilers of any sort. Microsoft Pascal, Microsoft C, Microsoft Fortran, Turbo Pascal, and Turbo C probably will be the most useful. DOS emulators have improved greatly in recent years. WABI on Solaris can run the graphical CYRILLIC software, for example. The only essential quality of a text editor for linkage data is that it should be able to produce clean ASCII tiles. A good, basic editor is included as part of the LINKSYS package (39). WORDSTAR, WORDPERFECT, and Microsoft WORD are proprietary systems that can be used to generate suitable text files. On Unix, vi or emacs are almost always available, or under X, xedit is a good graphical editor. This section is, in a sense, a comparative review, since restrictions and positive features are evaluated and compared across systems. However, there
266
Bryant
are few directly “competing” packages.The programs described tend to complement each other. Hopefully, this will serve as a guide to the benefits of investing time to acquire and get to know a particular piece of software. The author has tabulated much of the basic information needed (Table 1) and supplemented this with a short critical paragraph or two for each system. 2.1. CRI-MAP
CRI-MAP is designed to facilitate the construction of large multilocus linkage maps (20). It was originally conceived to handle large numbers of codominant loci in CEPH-style nuclear families, and it is for this purpose that it is particularly well-suited. It can be applied to certain types of disease loci and can handle general, extended pedigrees. It can cope specifically with those disease loci where affected carriers are disallowed, that is, when full penetrance is assumed. It can also indicate the parental and grandparental origins of each allele at each offspring locus usmg maximum likelihood methods, which enable recombination events to be visualized. It is distributed as C source code. To use it, sites will need a compiler. The code has been implemented under several different flavors of Unix, including Solaris 2 and OSF/l . Green et al. (20) recommended that it be used on machines with a minimum of 3 MBytes of available memory. In practice, we have found that for data sets of 40-50 IOCI,6-9 MBytes are necessary. Despite the large memory requirements, CRI-MAP uses a very efficient algorithm for computmg hkehhoods. It was used recently to produce the EUROGEM maps of the complete human genome (9). The price for efficiency is that less mformation is used from partially informative meioses than either LINKAGE or LIPED. Population allele frequencies are not used m determining the relative probabilities of untyped founder genotypes. Some 10~1m untyped mdividuals are marked as uninformative and not subject to a full treatment. However, the information loss appears to be small for this kmd of study. It uses a different strategy to MAPMAKER m finding the maximum likelihood order. 2.2. CYRlLLlC CYRILLIC (11) origmally was developed by Cyril Chapman and is now distributed by Cherwell Scientific (Oxford, UK). It is a Microsoft Windows application that enables the user to input pedigree data m an mtumve, graphical way rather than by using a tabular form. It can generate data files for the MLINK component of LINKAGE. 2.3. DOLINK DOLINK is a system for managing pedigree data that can run on DOS, Windows, and Unix systems (see Table 1). It was written by Dave Curtis at
Table 1 Software for Genetic
Linkage
Analysis Operating system
Package
Twe
CRI-MAP
Multipomt map construction with some facility for disease loci Graphical pedigree management with interface to MLINK Pedigree management and interface to LINKAGE, CRIMAP, KIN, and PEDRAW Essentially, LINKAGE with performance and other enhancements
Unix
Approximation of LINKMAP analyses using twopoint LOD scores Genetic analysis mcludmg haplotype sharmg methods Heterogeneity testing
DOS, Unix
HOMOZ IGD/X-PED
Homozygosrty mapping Integration of databases and programs for genetic analysis
Unix Unix
KIN
Affected Pedigree Member Method of Linkage Analysts
DOS, Unix
CYRILLIC DOLINK
FASTLINK
FASTMAP
GAS HOMOG
Windows DOS, Windows, Unix DOS, Umx
DOS, Unix DOS, Unix
Drstributton
and URL
Phil Green, Collaborative Research Inc. (Bedford, MA) Cherwell Scientific (Oxford, UK) Dave Curtis, St. Mary’s Medical School (London) ftp.//ftp.gene.ucl ac uk AleJandro Schaffer, Department of Computer Science, Rice University (Houston, TX) ftp://softhb.cs.rice.edu see DOLINK flp://ftp.gene.ucl.ac uk
Alan Young, University of Oxford ftp://ftp.well ox.ac uk Jurg Ott, Columbia University (New York, NY) ftp //york.ccc.columbta edu fipllgenome wi mitedu Steve Bryant, Imperial Cancer Research Fund (Her&, UK) ftp //genome.dkfz-herdelberg de Dame1 Weeks, Department of Blomathemattcs, UCLA School of Medicine (Los Angeles, CA) (contmued)
Table 1 (continued) Package
Operatmg system
Tw
LINKAGE Program Package (LPP) LINKSYS
Multipomt Lmkage Analysts wtth risk calculation Data Management for LINKAGE and LIPED
DOS, Unix
LIPED MAP
Two-point linkage analysts Multipomt map construction two-point LOD scores
DOS, Unix Unix
MAPMAKER
Codominant multipomt map construction in nuclear families
Unix
MENDEL, DGENE, and FISHER MULTIMAP
General Genetic Analysis mcludmg data management Automated construction of multipoint maps
DOS
PATCH
Haplotype deduction
DOS
PEDIGREE/DRAW
Pedigree Drawing
MACINTOSH
from
DOS
Unix
Distribution
and URL
Pascale Denayrouse, Fondation Jean Dausset (Parts, France) John Attwood, MRC Human Biochemical Genetics Unit, University College London see HOMOG Newton Morton, Department of Community Medicine, Southampton General Hospital, (Southampton, UK) Mapmaker Distribution, The Lander Lab, Whitehead Institute for Biomedical Research, Nme Cambridge Center (Cambridge, MA) ftp://genome wt mtt.edu Daniel E Weeks, Wellcome Centre for Human Genetics (Oxford, UK) Tara Cox Matise, Department of Human Genetics (Pittsburgh, PA) mailto multimap@genomel hgen pittedu Ellen M. Wysman, University of Washmgton (Seattle, WA) Jean W MacCluer, Southwest Foundation for Biomedtcal Research (San Antonio, TX)
PEDPACK
General Pedigree Analysis and Display
Unix
PEDRAW SIMLINK
Pedigree drawing and dtsplay Estimating the power of a proposed lmkage study Pedigree simulation
DOS DOS, Unix
Data Management and construction of LINKAGE and Pedigree/DRAW data sets
Unix
SLINK
XSHELL
DOS
Alun Thomas, School of Mathematical Sciences, University of Bath (Bath, UK) ftp-//ftp.gene.ucl.ac.uk Michael Boehnke, University of Michtgan (Ann Arbor, MI) Daniel E. Weeks, Wellcome Centre for Human Genetics (Oxford, UK) ftp://york.ccc.columbia.edu Stephen P. Bryant, Imperial Cancer Research Fund (Her&, UK) ftp://mahler.leeds.icnet.uk
270
Bryant
St. Mary’s Medical School, London, and can generate data m LINKAGE, KIN, PEDRAW, and CRI-MAP formats. It has a particularly flexible way of orgamzmg mformation on traits and can cope happily with liability classes, penetrance matrices, and twopoint and multipoint data sets. 2.4. FASTLINK FASTLINK essentially is a port of LINKAGE by translatmg code from Pascal to C, followed by extensive tuning to generateperformance improvements of at least one order of magnitude (42,43). As Indicated, it is available as C source code for Unix and as executables for DOS, and is in both fast (requires more memory) and slow (requires less memory) versions. It is undergomg continuous development and has largely solved the LINKAGE porting problem. The ancillary LINKAGE programs (LCP, LRP, and so on) are stall required m their original versions. In almost all respects,it behaves identically to LINKAGE to the extent that the various modules (MLINK, LODSCORE, and so on) can be substitutedand called from the same scripts generatedby LCP. 2.5. FASTMAP FASTMAP (44) essentially is an approximation of the LINKMAP module of LINKAGE in that it estimates multipomt LOD scores from twopomt data. In this respect, it shares many of the advantages of MAP (see Section 2.14.). It is very fast and enables large multiple-locus problems to be performed on a PC, perhaps as a prelude to a full LINKMAP analysis that may take several weeks of computer time. 2.6. Genetic Analysis System (GAS) GAS has been developed by Alan Young at the University of Oxford. It can perform a variety of analyses, includmg some based on allele sharing between sibs, in both identical-by-descent (IBD) and identical-by-state (IBS) forms. It can read data in LINKAGE format, and can represent much the same kind of genetic information. It is based on very portable code and is distributed as executables for DOS, Ultrix, VMS, and Sun operating systems. 2.7. HOMOG
HOMOG is available in a variety of related versions, which perform variants of the admixture test for detecting linkage heterogeneity, where a proportion of the families m the sample may be linked to a marker, with the remainder unlinked. More elaborate versions of the program can test each family linkage to one of two loci, for example. It was written and is distributed by Jurg Ott.
Lmkage Analysis Software
271
2.8. HOMOZ HOMOZ is aimed at mapping recessive inbred diseases using the homozygosity mapping technique (45). It 1s an implementation of an algorithm described previously (46). It is supplied as a SUN binary and as source code for several other Unix systems. 2.9. Integrated
Genomic Database (IGD/X-PED)
The IGD (36) has a strong linkage analysis component in IGD/X-PED (464. Several valuable data resources (CEPH, EUROGEM, CHLC, GDB) have been integrated as Resource End Databases (REDS) of IGD. The REDS are combined into a physical entity, the Target End Database (TED), which is accessedby front-end tools such as ACEDB (47) and IGD/X-PED (474 IGD also supports the management and display of pedigree informatton in a graphical way, and the generation of data sets for programs such as lmkage and CRIMAP (see Section 2.1.). Current developments include the provtsron of support for the complete cycle of mvestigation, from data capture to analysts, and the interpretation and incorporation of results into the database. The project is coordinated by the DKFZ (Heidelberg, Germany) (see Table 1). 2.10. KIN The theoretical background to KIN 1s given in Weeks and Lange (27). Building on the sib-pan method, they derived a statistic (a Z-score) that measures the similarity between typed, affected members of a pedigree on the basis of their marker genotypes. The distribution of Z-scores can be derived analytically or by simulation. Z-scores are combined by KIN to give an overall T statistic that approximates a standard normal drstrrbution. The extent to which the distribution of T approximates normality can be ascertained using SIMULF, part of the KIN package. KIN 1s supplied as source code in an IBM and Unix version of Pascal. Microsoft Pascal can compile the source code and is useful to optimize code against libraries and hardware. SIMULF is written in Fortran. Executable code is provided for IBM PC compatibles with a numeric coprocessor. Certain types of families cause the program to crash, and these will have to be detected and removed by trial and error. The T statistic can be compared directly against a standard normal distribution using a one-tailed test. KIN does not use as much of the information available as would be used by LINKAGE and does not enable estimation of the recombination fraction. It is, however, extremely useful as a screening tool for linkage since it does not require any assumptions to be made about the transmission model. KIN ports easily to Unix and is available at the UK HGMP RC and elsewhere.
272 2.11. LINKAGE Program
Bryant Package (LPP)
The LINKAGE Program Package (LPP) 1s a general-purpose package for multipoint linkage analysis,including risk calculation (18,48-50). It includesmodules for managing the data, preparing it for analysis, conducting the analysis, and interpreting the results. Markers are divided mto four types that are sufficient to handle most kinds of genetic system likely to be encountered. These include codominant RFLPs, multiallelic microsateliites, dominant-recessive traits with full or partial penetrance, and general quantitative phenotypes. Data are prepared as text files, most easily with a data management aid such as LINKSYS (39), DOLINK, or XSHELL. Versions of the programs exist for general (extended pedlgrees) and CEPH-style (nuclear families) linkage analysis. LPP is distributed as Pascal (the core analytical programs) and C (the shell programs LCP and LRP) source code. The source code is reasonably portable and is available m Unix, DOS, and generic formats. Source for LCP and LRP normally is not distributed but 1savailable on request. Executables are provided for DOS or Unix on a range of media, as required. The programs perform better with small numbers of markers (up to five for the general programs, substantially more for CEPH-style). MLINK is used to construct twopomt LOD score tables, LODSCORE for iterative estimation of theta, ILINK for multipomt maps, and LINKMAP to insert markers mto larger multtpomt maps. It may be more practical to use MAPMAKER or CRI-MAP for large problems involving reference fannhes. Separating the sexes is supported and interference can be accommodated. Data files are identical across all operatmg systemsand all current versions. LINKAGE Version 5.1 has been converted to C code and substantially re-engineered (42‘43). See Section 2.4. for details. 2.12. LINKSYS LINKSYS was the first generally available package for the management of genetic data to be used m conjunction with the analytical packages LINKAGE and LIPED (39). It was written making heavy use of proprietary source code toolboxes, which are DOS-specific. It is unlikely that the present version could be ported to any other operating system, and interested users who require this are advised to use the DOLINK package, which can import LINKSYS data. It manages data on pedigrees, markers, and phenotypes, and includes export facilities to LIPED and LINKAGE. It performs as a shell to these systems, as well as to KERMIT, enabling the user to prepare and execute analyses without being faced with a DOS command line. LINKSYS is written m Turbo Pascal and distributed as executable code. Also supplied is a full-screen version of LCP (see Section 2.11.), which uses actual locus names instead of symbols.
Linkage Analysis Software
273
2.13. LIPED LIPED is a program for computing two-point LOD scores in general pedigrees (I 7,51). It handles markers using a phenotype-genotype matrix that is sufficiently general to code RFLPs, microsatellites, and dommant-recessive traits. It handles age-dependent penetrance and division into liability classes.It also has the ability to separate the sexes. LIPED is distributed as FORTRAN source code. The code is reliable and easy to port to Unix or DOS. It can be obtained freely from the distributor or third parties, with restrictions (contact the distributor). Data files are ASCII text and can be generated easily using LINKSYS (39). 2.14. MAP MAP uses two-point LOD scores to construct a multtlocus map (52). It can handle interference but the main advantage is that LOD scores from different sources can be combined irrespective of whether the raw data is available or not. Therefore it uses an entirely different philosophy from CRI-MAP, MAPMAKER, or LINKAGE. The twopoint LOD scores from these systems can be amalgamated and used as input to MAP. MAP is computationally more efficient than comparable programs and has demonstrated that maps built with it converge to those constructed by other software. 2.15. MAPMAKER Mapmaker is an interactive package for the construction of codominant multipoint maps from nuclear CEPH-style families and F2 crosses (53). It is not a general purpose linkage package. It cannot be used for disease mapping and cannot be applied to extended pedigrees. It is distributed in C source code form. Executable versions are distributed for the Unix operating system. Utilities are provided to ease installation. Unix versions have a Make file to assist installation. The version that should be used is version 2. The latest version (version 3) does not support the analysis of CEPHstyle families. MAPMAKER uses an efficient algorithm for the computation of likelihoods based on work done by Lander and Green (19). Their Expectation-Maximization procedure requires fewer iterations so that a smaller number of computations are necessary to find the set of thetas giving the map with the maximum likelihood. Even so, MAPMAKER uses a vast amount of CPU time. To put together a chromosome map of 150 markers over 50-60 families as found typically within the CEPH collaboration, several hundred hours of CPU time can be involved. It is possible to construct files for use in batch or background jobs, although then all the interactive facilities of MAPMAKER are lost.
274
Bryant
2.16. MENDEL, DGENE, and FISHER MENDEL is a general modelmg tool and can be used for segregation analysis, linkage analysis, and risk calculation without further modification (54) It is supplied as a mixture of source and object code for the IBM PC and compatibles. DGENE manages data for export to MENDEL and FISHER. It is supplied as a DOS executable, originally written in DbaseIII+. FISHER is designed to aid the epidemiological investigation of quantitative traits It ISsupplied as amixture of source and object code for the IBM PC and compatibles. 2.17. MUL T/MAP Multimap is a system based on the programmmg language LISP that uses CRI-MAP (see Section 2.1.) as an engine for computmg likelihoods (5). It models the heuristics that analysts use when constructing large genetic maps with likelihood methods. It can also perform internal checks on the consistency of the data by detecting probable errors. 2.18. PA TCH PATCH is a program for deducing haplotypes from genotype data (55). It 1s distributed as C source code with DOS executables.It is divided into modules for data collection and management, printing, and haplotype deduction. The code is very portable and probably can be compiled onto most other architectures.PATCH can be obtained from the distributor or from a third party on condition that the user register with E. M. Wnsman for bug fixes and other upgrades 2.19. Pedigree/DRA W Pedigree/DRAW is a general-purpose pedigree-drawing package for the Apple Macintosh (40). It can draw genealogies of almost any size, including those with some degree of inbreeding. It is used m conjunction with ARBOR, a program specifically destgned to process complex, highly inbred pedigrees mto a form that can be displayed by Pedigree/DRAW. Genealogtes can be sent to a postscript printer or file, or imported into MacDraft or MacDraw for incorporation into figures It is distributed in executable form with plenty of example pedigrees and a good user guide. Currently it is not available for any other operatmg system It is obtainable without charge by writing to the distributor. 2.20. PEDPACK PEDPACK 1sa complete Unix environment for pedigree analysts (56,5 7). It offers facilities for checkmg, setting up, and edrtmg pedigrees and genetic traits; probability and likelihood calculations; finding peelmg sequences; preparing data for PAP, LINKAGE, and CRI-MAP; computmg gene extinction prob-
Linkage Analysis Software
275
abilities by peeling and simulation and drawing marriage node graphs. The user requires some familiarity with the Unix file structure and command language. It is dtstributed m a form suitable for SUN workstations runnmg SunOS. 2.21. PEDRA W PEDRAW is one of the family of programs distributed by Dave Curtis, and is a DOS application that works closely together with DOLINK. DOLINK can generate files m PEDRAW format. PEDRAW can produce high quality drawings of the pedigree structure in a similar way to Pedigree/DRAW. 2.22. SIMLINK SIMLINK is designed to estimate the power of a proposed linkage study based on a set of pedigrees of known structure and a genetic trait of interest (58). It can handle both qualitative and quantitative traits as well as sex- and age-dependent penetrance. It is distributed as a mixture of source and object code as well as an executable DOS program. Actually it is based on a greatly extended version of MENDEL (see the foIlowing). It requires 640 KBytes of RAM and a numerical coprocessor. It is available wtthout charge from the distributor. 2.23. SLINK SLINK is a general simulation program for lmkage analysis. It uses Monte Carlo methods based on an algorithm described in ref. 35. It aims to ask similar questions to those posed by SIMLINK, namely the estimation of statistical power, which depends on the structure of the family, the heterozygosity of the linked marker, the genettcs of the trait, and the distribution of affecteds. It uses an efficient method of samplmg random marker segregation based on a fixed distribution of affected mdividuals m a family. The program was written by Daniel Weeks and is distributed by Jurg Ott. 2.24. XSHELL XSHELL is a system for managing pedigree data for the LINKAGE, Pedigree/DRAW and KIN packages (7). It was built using the tools created with the Oracle Database Management System. It has been run successfully under SUN Solaris and Oracle version 6. XSHELL is highly portable, but sites will need an Oracle licence. PC users will need expanded memory. At present it cannot be ported to the Apple Macintosh. 3. Additional Servers and Resources In this section, collections of software and database resources that are accessible via the Internet are discussed (see Table 2). For this purpose, it is neces-
Table 2 General Internet
Resources
of Relevance
URL address http://www.mformatics.jax.org/mgd.html http:l/www.chlc.orgl http://www-hgc lbl.gov/GenomeHome
?: 0,
to Genetic
Mapping Description
html
http://gdbwww.gdb.org/ http://gdbwww gdb.orglgdbdoc/topq html http://gdbwww gdb.org/omimdoc/omimtop.html http://gc.bcm tmc edu*8088/ http://gc.bcm tmc.edu:8088/ bioceph_genethon-mterface.html http://www.genethon.fr/genethon_en.html http:/lmoulon.mra fr:8001/acedb/rgd.html http://ws4.niai.affic.go.Jp/jgbase.html http://moulon.mra.fr:800llacedb/acedb html http://helix nce.edu/mdex.html http://golgi harvard.edu/btopages html gopher://gopher bio net/ gopher //gopher bio.net/l l/GENETIC-LINKAGE http://golgi.harvard.edu/sequences.html http:llwww ncbinlmnih gov/ http*/lcul-www unige chfw3catalog http:llwww cs colorado.edu/home/mcbryanl WWWW html
Mouse Genome Database Cooperative Human Linkage Center (CHLC), USA Lawrence Berkeley Laboratory (LBL) Human Genome Center Web Server, USA Human Genome Data Base (GDB) GDB Hypertext Browser On-Line Mendelian Inheritance in Man (OMIM) Baylor College of Medicme Genome Center, USA YAC data searches at Genethon, France Genethon, France Integrated Genomic Database (IGD)--Moulon Server Japan Animal Genome Database C elegans Data Base ACeDB at Moulon W M. Keck Center for Computational Biology, Rice University, USA The World Wide Web Virtual Library: Biosciences BIOSCI/bionet-Electronic News Genetic Lmkage Bulletin Board The World Wide Web Virtual Library. Biochemistry and Molecular Biology The National Center for Biotechnology Information (NCBI), USA CUI W3 Catalog WWWW+he WORLD WIDE WEB WORM
http://www.gdb.org/Dan/DOE/whitepaper/ contents.html http://gdbwww.gdb.org/planltoc.html http://www.nlm.nih.gov/ http://www.nih.gov/ http://www.nih.gov/molbio http://www.gdb.org/DaniDOE/intro.html http://diamond.gene ucl.ac.uk http://www.sanger.ac.uk/ ftp://genome.dkfi-heidelberg.de/ http://www.hgmp.mrc.ac.uk
US Department of Energy (DOE) White Paper on Bio-Informatics GDB Five Year Work Plan The National Library of Medicine (NLM), USA National Institute of Health (NIH), USA NIH Molecular biology Primer on Molecular Genetics (U.S. Department of Energy) Galton Laboratory, University College London, UK Sanger Centre, Hinxton, Cambs. UK DKFZ, Heidelberg, Germany UK Medical Research Counc11 Human Genome Mappmg Project Resource Centre (UK HGMP-RC)
278
Bryant
sary to introduce the term “Uniform Resource Locator,” or URL This permits the construction of addresseswhich define a host, directory, or file that is served by one of several different protocols (e.g., ftp, gopher, http). Some examples of this kmd of address follow: ftp:Nftp.cephb frlpublceph_genotype_db gopher://gopher bio net/l l/GENETIC-LINKAGE http l/www sanger.ac.uki
The first term (http) defines the protocol, followed by a colon (:). The characters // prefix the name of the server (ftp.cephb.fr) that optionally may be followed by a path (1 l/GENETIC-LINKAGE) that defines the location of a directory or file, following conventional Unix syntax. The collection of Internet resourcesaccessiblein this fashion has been termedthe World Wide Web (WWW). Table 2 presents a selectlon of URL addresses of hosts, directories, and files that will be of mterest to the linkage analyst. The program Netscape can be used to access mformatlon from http, gopher, or ftp servers; ftp only can be used to commumcate with ftp servers, but is more powerful in other ways. The selection of resources given 1s no guarantee that the servers ~111be maintained or even exist, since the World Wide Web IS an ever-shlftmg sea, but they should be suffclent to enable users rapidly to construct much larger collections of addressescorresponding to their own special Interests, since most servers provide links to other, related sites. Acknowledgment The Imperial Cancer Research Fund provided support during the preparation of this manuscript. References 1. Bryant, S. (1989) Software for genetic linkage analysis, in Methods zn Molecular Biology, vol 9 Protocols in Human Molecular Genetics (Mathew, C., ed ), Humana, Totowa, NJ, pp. 403-4 18 la Bryant, S. (1996) Software for genetic linkage analysis. An update. A401 Biotech 5,49-61. 2 Risch, N. (1990) Linkage strategies for genetically complex traits I multllocus models. Am. J. Hum Genet. 46,222-228 3. Rlsch, N. (1990) Linkage strategies for genetically complex traits II. The power of affected relative pairs. Am. J Hum Genet 46, 229-241 4. Ward, P. J. (1993) Some developments on the affected-pedigree-member method of linkage analysis. Am. J Hum. Genet. 52, 1200-12 15. 5. Matlse, T. C., Perlm, M., and Chakravartl, A. (1994) Automated construction of genetic lmkage maps usmg an expert system (MULTIMAP): a human genome linkage map. Nat Genet. 6,384-390 6. Davies, J. L., Kawaguchl, Y., Bennett, S T., Copeman, J B , Cordell, H. J , Pritchard, L E , Reed, P. W , Gough, S. C. L , Jenkins, S C , Palmer, S. M ,
Linkage Analysis Software
7. 8. 9.
10. Il. 12.
13
14. 15. 16. 17.
279
Blafour, K. M , Rowe, B. R., Farrall, M , Barnett, A. H., Bam, S. C., and Todd, J. A. (1994) A genome-wide search for human type 1 diabetes susceptibility genes. Nature 371,130-136 Bryant, S (1994) Genetic Linkage Analysts, m Gutde to Human Genome Computzng (Btshop, M J , ed ), Academic, London, pp 59-l 10 Krol, E. (1992) The Whole Internet User’s Guzde and Catalog. O’Reilly and Associates, Sebastopol, CA. Spurr, N. K., Bryant, S. P., Attwood, J., Nyberg, K., Cox, S. A., Mills, A., Barns, R , Warne, D., Cullin, L , Povey, S., Sebaoun, J.-M., Weissenbach, J., Cann, H. M., Lathrop, M , Dausset, J., Marcadet-Troton, A., and Cohen, D. (1994) European Gene Mapping ProJect (EUROGEM): Genetic Maps based on the CEPH reference farmhes. Eur. J Hum Genet 2, 193252, Rysavy, F. (1994) The computing environment, in Guide to Human Genome Computmg (Bishop, M. J., ed.), Academic, London, pp. l-37. Chapman, C J. (1990) A Visual interface to computer programs for linkage analysis Am J Med. Genet 36, 155-160 Rysavy, F. R., Bishop, M J., Gibbs, G. P , and Williams, G. W. (1992) The UK Human Genome Mapping ProJect online computmg service. Comput. Appllc Btoscz. 8, 14%154 Ott, J. (1991) AnaZyszsofHuman Genetzc Lvzkage, rev. ed., Johns Hopkins University Press, Baltimore. Elston, R. C. and Stewart, J (1971) A general model for the genetic analysts of pedigree data. Hum Hered 21,523-542. Morton, N. (1955) Sequential tests for the detection of linkage Am J Hum Genet. 7,277-3 18 Lange, K and Elston, R. C. (1975) Extensions to Pedigree Analysts I. Likelihood calculation for simple and complex pedigrees. Hum Hered 25,95-l 05. Ott, J. (1974) Estimation of the recombination in human pedtgrees. efficient computation of the hkehhood for human linkage studies. Am J. Hum Genet. 26, 588-597.
18. Lathrop, G M and Lalouel, J. M (1984) Easy calculations of lod scores and genetic risks on small computers. Am. J. Hum. Genet 36,460-465. 19. Lander, E S and Green, P. (1987) Construction of multilocus genetic lmkage maps in humans. Proc. Nut1 Acud Scl USA 84,2363-2367. 20. Green, P., Falls, K., and Crooks, S. (1989) Documentation for CRI-MAP, version 2.4: available from P. Green. 21 Donis-Keller, H., Green, P., Helms, C., Cartinhour, S., Weiffenbach, B., Stephens, K., Keith, T. P., Bowden, D. W., Smtth, D R., Lander, E. S., Botstein, D , Akots, G , Rediker, K. S., Gravms, T., Brown, V. A., Rising, M. B., Parker, C., Powers, J. A., Watt, D E., Kauffman, E. R., Bricker, A., Phipps, P., Muller-Kahle, H., Fulton, T. R., Ng, Sm, Schumm, J. W., Braman, J. C., Knowlton, R. G , Barker, D F , Crooks, S. M., Lincoln, S E., Daly, M. J., and Abrahamson, J. (1987) A genetic lmkage map of the human genome. Cell 51,319-337.
280
Bryant
22. Gyapay, G., Morisette, J., Vignal, A., Dib, C., Ftzames, C , Millasseau, P , Marc S , Bernardt, B , Lathrop, M , and Weissenbach, J (1992) The 1993-94 Genethon human genetic linkage map. Nature Genet. 7,246339. 23 Buetow, K. H., Weber, J L., Lundwigsen, S., Scherpbier-Heddema, T , Duyk, G. M., Sheffield, V. C., Wang, Z., and Murray, J. C. (1994) Integrated human genome-wide maps constructed using the CEPH reference panel Nature Genet 6,391-393. 24. Suarez, B. K. (1978) The affected sib pair IBD dtstrtbutlon for HLA-linked drsease susceptibility genes Tissue Antigens 12, 87-93. 25. Lange, K. (1986) A test statistic for the affected-sib-set method. Ann Hum
Genet 50,283-290. 26. Lange, K. (1986) The affected sib-pan method using identity by state relations. Am J Hum Genet 39(l), 148-150. 27 Weeks, D. E and Lange, K. (1988) The affected-pedigree-member method of linkage analysis Am J Hum Genet 42,3 15-326 28. Karrgl, G. (1981) A recursive algorithm for the calculation of identity coefficients. Ann. Hum. Genet. 45, 299-305. 29. Weitkamp, L. R. and Lewis, R. A. (1989) PEDSCORE analysts of identical by descent (IBD) marker allele dtstrtbuttons m affected family members Cytogenet Cell. Genet 51, 1105,1106. 30. Olson, J. M. and WiJsmann, E. (1993) Linkage between quantitative trait and marker loci: methods using all relative pans. Genet Epzdemzol. 10,525-538. 31. Whittemore, A. S. and Halpern, J. (1994) A class of tests for linkage using affected pedigree members Biometrics 50, 11 LX-127 32 Edwards, A. W. F. (1988) Computers and genealogies. Bzol Sot. 5,73-81. 33. MacCluer, J. W., VandeBerg, J. L., Read, B., and Ryder, 0. A. (1986) Pedigree analysis by computer simulatton. Zoo Bzol 5, 147-160 34 Boehnke, M. (1986) Estimating the power of a proposed linkage study: a practtcal computer simulation approach. Am. J Hum. Genet 39,5 13-527 35. Ott, J. (1989) Computer simulation methods in linkage analysis. Proc Nat1 Acad. Set. USA 86,4175-4178. 36. Ritter, O., Kocab, P., Senger, M., Wolf, D., and Suhai, S. (1994) Prototype implementation of the Integrated Genomic Database. Comput Boomed Res 27, 97-115. 37. Bishop, M. J. (ed.) (1994) Guide to Human Genome Computing, Academic, London.
38. Prokosch, H. U., Seuchter, S. A., Thompson, E. A., and Skolnick, M. (1989) Applying expert system techniques to human genetics Comput Biomed Res 22, 234-247. 39. Attwood,
J. and Bryant, S. (1988) A computer program to make analysis with LIPED and LINKAGE easier to perform and less prone to input errors Ann. Hum. Genet. 52,259. 40. Mamelka, P. M , Dyke, B., and MacCluer, J. W. (1987) Pedtgree/Draw for the Apple Macintosh. Department of Genetics, Southwest Foundation for Biomedical Research, San Antonio, TX.
Linkage Analysis Software
281
41. Weaver, R., Helms, C., Mishra, S. K., and Donis-Keller, H. (1992) Software for analysts and manipulation of genetic linkage data. Am. J. Hum. Genet. 50, 1267-1274. 42. Cottingham, R. W., Jr., Idury, R. M., and Schaffer, R. A. (1993) Faster sequential genetic linkage computations. Am. J, Hum. Genet 53,252-263. 43. Schaffer, A. A., Gupta, S. K., Shrnam, K., and Cottingham, R. W , Jr. (1994) Avoiding recomputatton in linkage analysts. Hum Hered. 44,225-237. 44. Curtis, D and Gurling, H. (1993) A procedure for combmmg 2-point lod scores mto a summary multipoint map. Hum Hered 43, 173-185. 45. Kruglyak, L , Daly, M., and Lander, E. (1995) Rapid multipomt linkage analysts of recessive traits in nuclear families, including homozygosity mapping. Am J Hum. Genet 56,5 19-527. 46. Lander, E. and Botstem, D. (1987) Homozygosity Mapping: a way to map human recessive traits with the DNA of inbred children. Sczence 236, 1567-l 570 46a. Bryant, S. P. (1996) The integrated genomtc database (IGD). enhancing the productivity of gene mapping proJects. Proceedings of the 1996 Internatzonal Symposium of Theoretical and Computational Genome Research, Heidelberg, Germany, Plenum, New York 47
Dunham, I., Durbin, R , Thierry-Mieg, J , and Bentley, D (1994) Physical mapping proJects and ACEDB, in Guide to Human Genome Computmg (Bishop, M J., ed.), Academic, London, pp. 111-158.
47a. Spmdou, A., Spurr, N. K., and Bryant, S. (1996) IGD/X-PED: a system for the management and graphical display of pedigree data Am J Hum Genet (m press). 48. Lathrop, G. M. and Lalouel, J. M. (1988) Efficient computattons m multtlocus linkage analysis. Am J. Hum Genet. 42,498-505 49. Lathrop,
multilocus
G M., Lalouel,
linkage
J. M., Luller,
C., and Ott, J. (1984) Strategies
analysts m humans. Proc
3443-3446. 50. Lathrop, G. M., Lalouel, J. M., Julier, C., and Ott, J. (1985) Multilocus
5 1.
52.
53.
54. 55.
for
Nat1 Acad Scz USA 81,
linkage analysis in humans. detection of linkage and estimation of recombinatron. Am J Hum Genet 37,482-498 Hodge, S. E., Morton, L. A., Tideman, S., Kidd, K K , and Spence, M A. (1979) Age-of-onset correction available for linkage analysis (LIPED). Am J Hum Genet. 31,761,762 Morton, N. E. and Andrews, V. (1989) MAP, An expert system for multiple pairwise linkage analysis. Ann Hum. Genet. 53,263-269. Lander, E. S., Green, P., Abrahamson, J., Barlow, A., Daly, M. J., Lincoln, S. E., and Newburg, L. (1987) MAPMAKER. an interacttve computer package for constructing genetic linkage maps of experimental and natural populations. Genomzcs 1,174-181 Lange, K., Weeks, D., and Boehnke, M. (1988) Programs for pedigree analysis-Mendel, Fisher and Dgene. Genet Epidemiol 5(6), 471,472. WiJsman, E. M. (1987) A deductive method of haplotype analysis in pedigrees Am. J. Hum. Genet 41,356-373.
282
Bryant
56 Thomas, A (1987) Pedpack user’s manual Techmcal Report No. 99. Department of Statistics, GN-22, University of Washington, Seattle, WA. 57. Thomas, A. (1987) Pedpack manager’s manual Technical Report No. 100. Department of Statistics, GN-22, Umverslty of Washington, Seattle, WA. 58 Ploughman, L M and Boehnke, M. (1989) Estlmatmg the power of a proposed linkage study for a complex gene& trait. Am J Hum Genet 44(4), 543-55 1.
19 Exon Detection by Similarity Searches Jean-Michel
Claverie
1. Introduction Other chapters of this volume have presented the various experimental methods (mainly exon trapping and recombination-based and hybridizationbased approaches) used for the identification of transcribed sequences within cloned genomic fragments. None of those methods require detailed sequence information on the genomlc region of interest. However, smce generating large genomlc sequences is becommg more routme, ldentlfymg transcribed regions by computer analysis of large genomic sequence (i.e., “software trapping”) is also becoming a viable alternative. After an overvlew of the various computational methods at hand, this chapter focuses on the use of database similarity searches for the identlficatlon of exons in mammalian genomes. 1.1. Basic Statistics on Exons In contrast with bacteria, yeast, or invertebrates, protein-encoding genes in vertebrates are fragmented m many small exons often spread across hundreds of kilobases. This implies that, most of the time, only a small fraction of the entire sequence of a gene 1s available for the detection of the transcribed regions. The detection of individual exons is thus the primary task of all computational methods. Internal coding exons (i.e., the entire sequence of which 1stranslated into protein) obey three constraints: 1. Their S-end must be flanked by an acceptor splice site resembling the consensus nmwyWWG
2. Their 3’-end must be flanked by a donor splice site resembling the consensus agGT(a/g)ag (see ref 1) for a review on splice site consensus);and 3. There must be at least one open reading frame (ORF) from the S- to the 3’-extremity
From
Methods
m Molecular Bology, Edlted by J Boultwood
Vol 68 Gene /so/at/on and Mappmg Humana Press Inc , Totowa, NJ
283
Protocols
Claverie
284
There are three other types of exons: S-terminal exons (not flanked by an acceptor site and not encompassing a full ORF); internal, but not fully coding exons (two sites, not encompassmg a full ORF); and 3’-terminal exons (not flanked by a donor site, and not encompassmg a full ORF). In a purely “signal-based” context, one would attempt to detect internal
exons solely on the basis of the presence of: 1 An acceptor splice, 2 A donor splice site; and 3. The absence of stop codons m between (in at least one reading frame) The candidate exons may be ranked according to the goodness-of-fit to the splice site consensus (2,3) and using a minimal requirement for the exon size. Although such simple methods are relatively successful for analyzing the compact genomes of yeast or C. elegans, it is amazing how httle mformation they actually bring to the analysis of mammahan genomes. There are three mam reasons for this failure: 1. The occurrence of splice site-like sequences is by no mean limited to actual intron/exon boundaries; 2 Used and nonused splice sites frequently exhibit the same level of resemblance with the consensus; and 3 The occurrence of internal coding exons of a given length in biological sequences is not different from that in random sequences. Those three points are readily illustrated using real genomic data: a mouse 94-kb contig (Simmler et al., in press), and a human 67-kb contig, encompassing two internal exons of the gene responsible for Kallmann syndrome (#J, The nucleotide composittons of the mouse and human contrgs are (A: 30%, T: 27.5%, C: 21%, G:21.5%), and (A: 30%, T: 31%, C: 19.5%, G:19.5%),
respectively, and are derived from loci with no homology. One thousand one hundred and forty-seven and 708 occurrences of acceptor sites are found m the mouse and human conttg, respectively, when using a stringency, such as yynyyynyn(c/t)AG,
where two positions
(except the obligatory
AG) can differ
(e.g., noted “n”) from the consensus (cytosme or thymme). At a stringency such as anGT(a/g)ng, where two positions allowed to vary, 817 and 559 occurrences
(except the obligatory GT) are of donor sites are found m the
mouse and human conttgs, respectively. Those numbers are m proportton with the sequence length (approx 11.4 acceptors/kb and 8.5 donor/kb for both sequences). Most likely, not more than a few among those hundreds of occurrences delineate actual exons. Similarly, the average length of vertebrate internal coding exons (~150
nucleottdes [5,6]) falls well within the expected distribution of the distance between two successive random occurrence of stop codons. Because there are
Similarity Searches
285 i Mouse 94kb vs ORF sue + Human 67kb vs 0R.F size l
l Mouse/kb us ORF size + Human/kb us ORF size + Random/Theory
0
100
200
300 400 ORF Mm Scze
500
600
700
Fig. 1. Size distrtbution of ORFs in two unrelated long mammalian genomic contigs. ORFs are any segment flanked by two stop codons (TAA, TAG, TGA) in the same frame. Overlapping ORFs in different frames are counted. The absolute number of ORFs (top) larger than or equal to a given size, and their frequency per kilobase (bottom) are shown for a mouse 94-kb conttg and a human 67-kb conttg. The analysis of a single strand (in three reading frames) is shown For sizes <350, these two nonhomologous regions of the human and mouse genomes exhibit remarkably similar ORF distrrbutrons/kb. Those dtstributions are also very close to the predtctron of a random nucleottde sequence model.
3 stop codons among a total of 64, the average statistical distance. between them in a given reading frame is ~60 nucleotide. This is well verified experimentally. Looking for ORFs of all sizesin the mouse 94-kb and human 67-kb sequences, the author found an average ORF size of 57 (SD = 58.5) and 55 (SD = 57.5), respectively. However, the cumulatrve frequency distribution (Fig. 1) shows that much longer ORF are usual: 341 and 232 ORFs with size
C/a verie
286
2150 are found in the mouse and human contigs, respectively. The longest ORF span is 6 15 m the mouse 94-kb contig and 690 nucleotides m the human 67-kb contig. Figure 1 shows the cumulative frequenciesjkb of ORFs greater than (or equal to) a given size computed for each conttg. The observed distrtbutions are very similar for both genomic sequences with exactly one ORF 2 225 nucleotides/kb The distrtbuttons are best approximated by: F)&(ORF2 L) = 48 3 x 10 -(L”36)
(1)
for the mouse contig and F,&ORF 2 L) = 44.8 x 10- v’~O)
(2)
for the human contig where F is the frequency of ORF 2 L, and L the length of the nucleotides. Those observed distributions are also remarkably similar to the theoretical distribution (Fig. 1) corresponding to a random nucleottde sequence with an equal frequency for A, C, T and G (25%), and neglecting the interference of overlapping reading frame on the occurrence of codons: Fkb(ORF2 L) = (3 x 1000/64) x e-F/64) = 46 865 x l(--(L/147) (3) This analysis indicates clearly that putative splice sttes and ORF occur randomly m natural sequences. In the context of a signal-based approach to detect genes, the last hope ts that the joint distrrbution of the three required components of internal coding exons (acceptor, ORF, donor) is somehow characteristic and allows the actual exons to stand out from random occurrences, Figure 2 shows the distributions of candidates internal exons in the mouse 94-kb and human 67-kb contigs. The size ranges are 5-585 and 5-3 19 for the mouse and human sequences, respectively. Again, the frequencies per kb of candidate exons greater than a given size are very close (Fig. 2 bottom) and, for exon size 1250, are well approximated by: F&Exon
2 L) = 9 x 10 -(Li153)
(4) This formula tells researchers to expect one random candrdate exon with size 2 150 (the mean size of true internal exons [.5,6/ m every kb of genomic sequence). According to current estimate, actual exons would rather occur every 10 kb. They are thus largely buried among random occurrences. Of course, approximately half of the proven exons have a size smaller than the average 150 nucleotides, making their detection even more problematic. As a last resort, one might try to enrich the statistics in actual exons by increasing the required stringency on the splice signals. This is not wise, smce the stringency used in Fig. 2 is quite typical of the average quality of proven
Similarity Searches
287
l Mouse 94kb vs Exon sue + Human 67kb vs Exon size
O--
l Mouse/ kb vs Exon size + Human/kb us Exon size
1
101
201
301 401 Exon Mm Size
501
601
Fig. 2. Size distribution of candidate internal protem-coding exons in two long mammalian genomic contigs. Candidate internal exons are segments flanked by reasonable splice sites (see text) and encompassing at least one ORF of a given muumal size. Overlapping exons are counted, but only the longest possible exon from each acceptor site is selected. The absolute number of candidates (top) and the density of candidates per kilobase (bottom) are shown. The candidate exon densitylkb for these two nonhomologous regions of the human and mouse genome are remarkably similar for exon sizes 5220
splice sites. Actually, higher stringencies make things worse by eltminating actual exons. In the mouse 94-kb sequence, 15 occurrences of the perfect consensus yyyyyyyyn(clt)AG (acceptor site) and 11 of the perfect agGT(a/g)ag (donor site) are found, but none separated by a full ORF. For the human 67-kb sequence, the numbers are 6 and 6, none of which delineates a possible internal coding exon. It is a fact that this prediction is incorrect, since at least two actual inter-
nal exons have been found in each of those two sequences.
C/a verie Since identifying internal coding exons-the most constrained of all-is not trivial, it follows that identifying S-, 3’- or noncoding mternal exons is going to be an even more difficult task. As is shown in this chapter, only similarity search-based methods can help in that respect. In summary, this section has provided some basic statistics about the random occurrences of the signals (splice sites and ORFs) known to define internal exons. The characteristics (splrce signal stringency and average size) of actual exons do not allow them to emerge naturally from the background of chance occurrences, Even sophisticated neural network methods still predict many more false sites than true ones (3). Thus, a purely signal-based method for recognizing internal protem-coding exons is not enough, and additional information is necessary. This additional information will be provided by a more detailed analysis of the “content” of the candidate exon sequences. A quick overview of content-based methods is given m the next section. 1.2. Content-Based Methods for Detecting Coding Exons Although signal-based methods look for short consensus elements (splice site, mmation and stop codons, polyadenylation signal, etc.), “content’‘-based approaches attempt to capture bulk properties over the length of the coding region. Nucleotide sequences encoding a polypeptide are expected to (and actually do) differ from “intervening” sequences for various reasons. At the origin of this difference, essential and general constraints are found, such as the structure of genetic code, the relative abundance of ammo acids m natural proteins, or the evolutionary pathways accessible to polypeptide sequence. The effects of those constramts are also modulated by many other factors (host-specific, expression-specific, and structural or functional features). For instance, some organisms will exhibit a strong bias m their usage of synonymous codons, some genes will be allowed to evolve much faster than others, and some proteins (nonglobular or membrane-bound) will exhibit anomalous compositions. The bias in the usage of synonymous codons (7,8) and the preferred RNY (purine-anything-pyrimidine) pattern in the reading frame of coding nucleotides (9,10) were the first statistical differences applied to the discrimmation of coding and noncodmg regions. The asymmetry of the base composition m the three codon positions (8,11) was soon to follow. Although those initial coding measures enjoyed a reasonable successrate for the analysis of mitochondrial, chloroplasts, bacterial, and even yeast genomes, they are clearly insufficient to locate coding exons reliably within higher eukaryote sequences.A variety of second-generation methods were built on the more general basis that coding and noncoding regions statistically use a different vocabulary of nucleotide “words” (12-14). Fickett and Tung (25) reviewed and compared the efficiency of the main protein-coding measures
Similarity Searches
289
proposed to date. They found the statistics on hexamers (six-nucleotide words) (16) to be the most efficient measure to discrimmate coding and noncoding regions. The reason for this is not entirely clear, but it is likely that the influence of multiple factors, such as nearest neighbor nucleotrde frequency, codon usage, peptide composition, CpG methylation, and depletion (I 7), is most naturally summarized in the statistics of six-nucleotide words. Although methods based on a single statistic are adequate to locate coding regions in bacteria (8,28), the less compact genomes of higher eukaryotes require the combination of more method to achieve a reasonable success.All current practical approaches associate several content-based coding measures and signal detection (ORFs, splice sites): Gm (19), Predictor (Claverie and Bougueleret [12,16], unpublished version used in refs. [3,20]), Grail (21,22), GeneFinder (Green and Hillier, unpublished, but central to the C elegans genome project [23]), GeneID (24), and GeneParser (6,25). Most recent versions of these packages begin by generating a pool of all candidates based on edge signals, and then rank the candidate exons using a combination of coding measures.
7.3. A General Problem with Coding Measure: Conservatism Coding measures necessarily reflect the biases of the “known genes” statistics on which they are based. Any content-based approach derives a prototype description of the data available at a given point in time and uses it to identify similar features in new anonymous genomic sequences. Those methods are thus implicitly optimized to detect protein-coding genes most similar to already known genes coding for “normal” proteins. Unfortunately, there is Increasing evidence that the statistical properties of genes vary in nontrivial ways depending on their evolution rate, expression pattern, type of protein encoded, or the local nucleotide composition of the genomic region in which they reside. An important factor might be the evolution rate of a given gene. For instance, nonconserved mammalian genes (such as cytokines or serum albumin) might evolve too fast to be able to achieve the distinctive statistical bias on which the detection of “regular exons” is based. Indeed, we have shown that a leading method, such as GRAIL, does not perform well on those genes (26). An already visible consequence of the conservatism of the current methods is that they tend to miss more exons m “gene-poor” regions (apparently correlating with higher A + T compositron) (25). Indeed, there are less examples of exons from “gene-poor” regions in the databases, and thus less of this kind for training and testing methods. At this point, the situation becomes dangerously recursive: Are “gene-poor” regions a fact of life, or simply genomic regions in which genes are more difficult to detect (by current computational and/or experimental methods)? In this cat-
290
C/a vene
egory belong genes with tightly developmentally (or tissue-specific) regulated expressions. For those, rightly predicted exons may turn out in a frequent failure to identify the correspondmg cDNA m usual libraries. The computer prediction will then be wrongly counted as a “false positive.” On the other hand, if no exons are predicted, the existence of a gene at this location will not even be suspected and experiments rarely ever made. Such situations of “false negative” for both the computational and experimental approaches will stay largely untested until a way is found to provide an exhaustive map of all possible transcripts within a given genomic region. Finally, the conservatism of content-based methods is clearly apparent in the fact that newly determined genomic contigs are always more challenging to analyze than expected from databases benchmarks. A successrate of SO-90% measured on data sets composed of previously known sequences may fall to a mere 50% when applied to new data (2627). Indeed, the prototype mammahan gene representation as captured by the current methods is dominated by the content of the databases in genome sequences: a major@ of short genes (i.e., spanning <20 kb), and short introns (long mtrons are almost never sequenced). 1.4. Database “Look-up” and Similarity Search The methods reviewed m the previous sections attempted to identify the essenceof protein-coding genes and exons. Eventually, the ultimate algorithm would fully mimic the process by which the cell machmery defines functional splice signals, and specifically processes and expresses a transcript in the proper cell and/or at the proper time in development. However, regulated expression or alternate splicmg cannot be entirely encoded m generic genomrc sequences and has to respond to extraneous environmental factors. Thus, the most sophisticated analysis of genomic sequences might never reach the goal of identifymg all biological transcripts. Over the last four yr, with the steady improvement in sequencing technology and strategy, came the progressrve realization that even rf genome srzesand gene numbers were large, they were nonetheless finite and eventually accessible to full determination (28-30). Using high-throughput sequencing of randomly generated chromosome fragments, Venter and collaborators succeededin obtaining the first complete genome (1.83 million nucleotides) of a free living organisms, Haemophilus influenzae (30,31). Four years before, the same group had pioneered the concept of expressed sequence tags (EST) (32-35)--the application of high-throughput partial sequencing to cDNA libraries-raismg the hope that transcripts and exons could be explicitly enumerated rather than painfully detected within the 3.5 billion nucleotides of the human genome. Unfortunately, most human EST sequencing became a proprietary effort for awhile, delaying the application of the concept as a general tool for gene iden-
Similarity Searches
291
tification. Public EST data collection received a big boost after the implementation of the Merck “mltiative” (361, a project to compile the 5’- and 3’-partial sequences of 200,000 cDNA clones. This project is centered around the Genome Sequencing Center at Washington University in St. Louis (Dir.: R. Waterston), with cDNA libraries produced by Bento Soares at Columbta University. At a rate of 1500 sequence/day, more than 200,000 sequences have been released to GenBank (37) and the specialized EST collection dbEST (38). Whatever the actual number of human genes (estimates range from 80,000-200,000) and the average number of different transcripts per gene (unknown) this mcreasing collection will nevertheless quickly contain information on 3-8 exons/transcript (an average of 1 kb/cDNA clone is sampled by the approach) for a large fraction of all human genes. Using this growing public resource, a totally new approach to the identification of exons becomes possible: simply looking them up in the EST database! Practically, this consists of using the anonymous genomic sequence as a query to scan dbEST with a similarity search program. Given the exon/intron structure of vertebrate genes, only partial and piecewise alignments are expected. The search program should thus be tuned to the sensitive detection of local and exact (or near exact) matches between two nucleottde sequences.The next section reviews the use of Blastin (39), a well-suited program for this task. In addition to the use of the rapidly growing EST data bank (now representing half of all GenBank entries), a protein sequence database look-up strategy can be implemented (d&42). A large fraction of human coding exons are related to known protein-coding genes of other eukaryotes for which sequence data are accumulatmg rapidly, mainly the budding yeast S cerevisiae (43,441 and the nematode C. elegans (4.5). Statistical studies (46,47) have shown that almost all protein prototypes common to all eukaryotes (i.e., in existence prior to the metazoan radiation 550 million years ago) have detectable representatives in the current protem database Swiss-Prot (48). Within a given organism, 50% of all protein-codmg genes appears significantly related to those ancestral prototypes (47). Thus, the search for similarity between anonymous genomic sequences and the protein sequence database is another possible way to identify coding exons. At variance with the previous use of EST data, the alignment program should now be tuned to the detection of local significant matches between two amino acid sequences. Putative translations of the genomic sequence are now the queries to be used against the protein database. 7.5. Exon Defection by Similarity Search: Nucleotide or Amino Acid Comparisons? Genomic sequences can be compared as nucleotides or as putative amino acid translations.
C/a verie
292 1.5.1. Nucleotide Sequence Comparisons
Nucleotide sequence comparison with EST data bring the significant advantage that all types of exons (and not only internal coding exons) become, in principle, identifiable. 3’ untranslated regions (UTRs), m particular, offer very good targets for gene detection (Fig. 3). A second advantage is that genes not coding for a protein, or with very short ORFs can be revealed. Several mdependent instances of “pseudo-mRNA”-RNA polymerase II transcripts without protein-coding function--have now been reported: the inactive X-specific transcript Xist (49), the product of the mouse H19 gene (50), the rat synapseassociated RNA 7H4 (52), and the mouse His-l transcript (52). If these truly represent the first examples of a new type of genes, many more should follow from the systematic comparison of new genomic sequencesand EST databases. More than 20 different ESTs clearly identify the human Xist 16-kb transcript m the current public collection dbEST (data not shown). Without the use of EST information, only RNAs with highly conserved sequences or structures can be detected by homology with known molecule (r-RNAs, snRNAs) or ad hoc methods (t-RNAs) (53). 7.5.2. Amino Acid Sequence Comparisons Scanning a protein database to detect exons obviously restricts one to protein-coding regions. On the other hand, comparing amino acid translations extends one’s capacity of detection in term of sensitivity (Fig. 4, p. 295). The expectedpercentages of identtcal nucleotides vs the percentage of identical ammo acids for two homologous coding regions are discussed m Tables 1 and 2 (pp. 296 and 297, respectively) for various evolutionary contexts. Those numbers derive from the structure of the standard genetic code. It consists of 61 “sense” codons specified by three positions. Each position can be mutated in three different ways. For instance, for the codon ATC specifying Isoleucme (Ile): A T C
[C, G, Tl [A, C, Gl [A, G Tl
three of three arenonsynonymous three of three are nonsynonymous three of 3 are synonymous
Thus, for this codon (and all codons specifying Ile) */3 of all substitutions are nonsynonymous, and l/3 are synonymous. Repeating this exercise for the whole genetic code, it is found that: 134 of 549 substitutions are synonymous (24.4% = 25%) and 415 of 549 are nonsynonymous (75.6% w 75%). Tables 1 and 2 show some useful general relationships between the similarity observed for amino acid sequences and the corresponding numbers for the encoding nucleotides. Amino acid sequencesare usually more conserved than DNA sequences in the range of high homology that is of interest for identification purpose. Below 70% of identical nucleotides, protein (putative transla-
A Sequences
Query: Nucleotides l-8000, from the Kallmann 67lcb genomlc contlg Database: dbEST (296,014 sequences, 102,671,971 nucleotides) producing
dbest1232892 1X56155 dbest1277821 H17883 dbest1277820 H17882 dbest1150584 F01088 dbest1130796 T69876 dbest1131125 T70205 dbest1232779 R56042 dbest1160816 T90695 dbest1116636 D45751 dbestl28206 TO9152 dbest1269586 HO9706 dbestl179667 B09371 F01913 dbest198642 dbest1177123 R06827 dbest1249887 R73029 dbestj270331 H10451 dbestj120228 F04105 dbest1177071 R06775 dbest1156722 T86601 dbest/220056 R49991 dbest(83231 T33198 .. . . .. .. .. . .. .. . . . .. . . . . WARNING.
B
Sequences dbestj232892 dbest1131125 dbestl277820 dbestl116636
High-scoring
Segment
Pairs
Score
cDNA Soares rnfant brain 1NIB H s.. cDNA Soares infant brain 1NIB H s. cDNA Soares infant brain 1NIB H.s. cDNA STRATAGENE Human skeletal mu cDNA Stratagene lung (#937210) H cDNA Stratagene lung (#937210) H cDNA Soares infant brain 1NIB H s cDNA Stratagene lung (#937210) H. cDNA Human adult lung 3' directed cDNA Infant Brain, Bento Soares H cDNA Soares infant brain 1NIB H s cDNA Soares fetal liver spleen 1N cDNA normalized infant brain cDNA cDNA Soares fetal liver spleen 1N cDNA Soares breast 2NbHBst H.sapi . cDNA Soares infant brain 1NIB H s cDNA normalized infant brain cDNA cDNA Soares fetal liver spleen 1N cDNA Soares fetal liver spleen 1N cDNA Soares breast 2NbHBst H sapi . cDNA Human Brain H.sapiens Homo1 . . . . .. . . many morel . . . * . ..
Descriptions due to the
of 1541 database limiting value of
sequences parameter
1988 1793 1663 1645 1428 1117 1112 374 364 362 352 346 344 341 339 338 337 334 330 325 325
were not V = 500
3'UTR 3'UTR 3'UTR 3'UTR 3'UTR 3'UTR 3'UTR Alu 3'UTR Alu Alu 3'UTR Alu Alu Alu Alu Alu Alu Alu Alu Alu
Kal Kal Kal Kal Kal Kal Kal Kal Kal
reported
Query: nucleoudes l-8000, from the Kallmann 67kb genomic contlg (After Alu filtermg and using the “-top” option) Database: dbEST (296,014 sequences; 102,671,971 nucleotides)
producing R56155 T70205 H17882 D45751
High-scoring cDNA cDNA cDNA cDNA
Segment
Pairs
Soares infant brain 1NIB H s . Stratagene lung (#937210) H Soares infant brain 1NIB H.s Human adult lung 3' directed
Score 1988 1117 1663 364
3'UTR 3'UTR 3'UTR 3'UTR
Kal Kal Kal Kal
Fig. 3. Identification of exons by Blastn similarity search in dbEST. (A) Top of the output of a Blastn search of dbEST, using a 8000-nucleotide query The query is taken from a human 67-kb contig m the region of the Kalhnann syndrome gene. The top scoring (S = 1988) alignment is shown in C. The overlap with a brain EST is complete, except for two deletions/insertions occurring at the less accurate end of the EST sequence. The last (mostly 3’-UTR) exon of the Kallmann gene 1s 3-kb long (3252-7429). A low-scoring (S = 364) bona fide alignment is shown below. This alignment is statistically significant, but totally buried in the “noise” of Alu-repeat induced matches (B) Top of the output Blastn search of dbEST, using the “plus” strand of the previous 8000-nucleotide query in which Alu-like sequences have been masked (see text). Only four matches are now reported, all of them successfully detecting the last Kalhnann exon.
Claverie
294 3’ UTR exon detection
C >dbest]232892 Score Query
cDNA
= 1988,
brain
= 406/418
1NIB (97%),
H saglens Strand
, Length
= Plus
= 428
/ Plus 5951
IIIIIIIIIIIIIIIIIII1111111111111111111111111
1 TAATAAAGj'ATAAUTCTATGCTTTTTAACAAACATAGTTTTGGTGCCTAATTCTGTAAT
60
ATGNTTTATTGAAATTAGATTCATTNCTCTAATGTGTGAG~TATATCCAGT~TAGTA
5952
III
IIIIIIIIIIIIIIIIIIIII
6011
IIIIIIIIIIIIIIIIIIIIIIIIIIIlIIIIII
61 ATGTTTTATTGAAATTAGATTCATTTCTCTRATGTGTGAG~TATATCCAGT~TAGTA 6012
Sb]ct
121
Query
6072
TTGACTGTTTAAAAAA
120
TTGAGCTCATCMAAATATTGTCATCAAATACAGGTGGTTAATC
6071
llll1llllllllllllllIIIIIIIIIIIlIIIIIIIIIIIIIIIIIIIIIIIIIIIII
TTGACTGTTTMAAAA
TTGAGCTCATCAAUATATTGTCATCAAATACAGGTGGTTZUXTC
180
TGACATACATTGCAGTTACATGCATTATTTTTATTTACAA
181 TGACATACATTGCAGTTACATGCATTATTTTTATTTACAA
240
ATTTATCTGTGTTACCCTGTTTTTCTACTCCATCCA
6191
Query
6132
Sbjct:
241
Query*
6192
Sblct.
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
ATTTATCTGTGTTACCCTGTTTTTCTACCTGGAACTCCATCCA
300
ACATGTGCTCTTTTCAGTCATTCACTGTTTTAATATGACAGTTT
6251
lllllllllllllllllllIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIlI
ACATGTGCTCTTTTCAGTCATTCACTGTTTTAATATGACATGGTAGAG~GAT~GGTTT
301
360
ATGGCAGGTAATTTTTTGTATGTGTATTAAACGAAGTTC~GATTAG~TACATC
6252
Sbjct
IlIIIIIIIIIIIIIIIIIIIIIIIlIIIIIllIIIIII
361
II
ATGGCAGGTAATTTTTTGTAATGTGTGTATT~CG~GTTTC~GATTAGATTACATC
6309
I
I III//I
* insertlon/deletlon
>dbest1116636 Score Query
cDNA
= 364, 5843
Sb]ct
Human
Identltles
adult
lung
= 76/82
3'
directed
(92%),
Strand
H saplens = Plus
Length
I IIIII
IIIIlIIIIIIIII
AAAANCTATGCTTTTTAACA
III1
IIIIlIIIII
IIIIII
61 AAAANCTATGCTTTTAAACAAA
7
= 82
5902
IIIIIIIIIIII
1 GATCAAAAATATAGTTATAATTTTNTAAATTTNAAAAATGTGATTGCNCTAATAMGMT 5903
418
/ Plus
GATCAAAAATATAGTTATAATTTTTTGAATTTTAAAAATGT
llllll1llllllllllllIIIII
SbjctQuery
6131
11111111111111111l111111111111111111111111111111111111111III
Sb]ct
Query.
Identrtles
IIIIIIIIIIIIIII
Sb]ct. Query
rnfant
TAATAAAGAATIVVVLNCTATGCTTTTTAACAAACATAGTTTTGGTGCCTAATTCTGT?&T
5892
Sblct. Query*
Soares
60
5924 82
Fig. 3. (contznued) (C) Top and lowest-scoring dbEST matches detecting the last Kallmann gene exon (3’-UTR)
tions) similarity searches are also more sensitive and more informative than direct nucleotide sequence comparisons (see ref. 54) for a detailed discussion). Intmtlvely, there are 20 amino acids (and thus the probability of random match is Pno,se= l/20), but only four nucleotides (thus pnolse= l/4), Hence, a human exon can still be identified on the basis of a 25% (five times the 5% back-
Similarity Searches
295
Query= kallmarm.1842~1983
size: 142 phase: 1
Database: SWISSPROT/sproBl 43,470 sequences; 15,335,248 total letters.
Sequences 1 i 1 i 1 ( 1
producing
KALM-HUMAN KALMICHICK CAML-HUMAN NRCA-CHICK CAl7-HUMAN CAMLJAT CAML-MOUSE
1 1 ( i 1 ( (
4 not c not NEURAL NG-CAM COLLAGEN NEURAL NEURAL
1 CAML-HUMAN 1 NEURAL Length = 1257 Plus
Strand
SbIct
Segment
Paus
Frame
In database at the time of In database at the time of CELL ADHESION MOLECULE RELATED CELL ADHESION M ALPHA 1WII) CHAIN (L CELL ADHESION MOLECULE Ll CELL ADHESION MOLECULE
CELL
ADHESION
MOLECULE
Ll
Score
discoverv discovery +1 +1 +l +1 +1
80 64 75 74 74
PRECURSOR
> > 0 0 0 0 0
P(N)
N
00012 00028 00061 00082 00082
1 2 1 1 1
(N-CAM
Ll)
HSPs
Score = 80, Expect IdentIties = 18/34 Query.
High-scoring
= 0 00012, (52%). PosItIves
= 20/34
28 LRPSTLYRLEVQVLTPGGEGPATIKTFRTPELPP LRP + Y LEVQ G GPA+ TF TPE 885 LRPYSSYHLEVQAFNGRGSGPASEFTFSTPEGVP
(58%).
Frame
= +1
129 P 918
Fig. 4. Identificatton of exons by Blastx similarity search in Swiss-Prot. Top of the output of a Blastx search of Swiss-Prot. The query is a candidate coding exon dehmtted by acceptable acceptor and donor splice sites. Despite the absence of close homologs in the database, the presence of a common Ftbronectm type III repeat IS sufficient to identify this short internal coding exon of the Kallmann syndrome gene All posmve matches are neural cell adhesion molecules, consistent with the nature of Kallmann syndrome, a defect m neuron migration
ground) identification of its putative translation with a yeast protein, whereas the corresponding DNA sequences, possibly 38% identical (1.5 time the background), could not be significantly matched. In addition, the similarity between nonidentical amino acid is scored (usmg 20 x 20 amino acid scoring matrices [.55-57]), increasing the sensitivity and reliability of distant protein alignments. This allows the detection of human or mammalian exons to take advantage of the rapidly growing body of sequence data available for all model organisms: invertebrate (C. elegans), lower eukaryote (S cerevasiae), or even bacterial genomes. There IS evidence that this “phylum hopping” might apply to as much as half of all vertebrate genes (46,47). Finally, it is important to point out that exon detection by similarity search, hke all pure “content-based” methods, does not require the completion of the
296
C/a verie Table 1 Relationship Between Protein and DNA Sequence identitya Encoding DNA sequences Protein % identical amino acid 100
90 80 70 60 50 40 30 20
Worst % identity
Best % identity
75 70 65 60 55 50 45 40 35
100
97 5 95 92.5 90 87.5 85 82.5 80
the percentage of tdenttcal ammo acid between %lven Iamlnor two protein sequences, the correspondmg percentage of ldentrcal nucleotrdes I, m the encoding sequences can be as low as I “UC= Lll,,, x 0 75 + (1 - I,,,,,) x 0 25 = I,,&2
+ l/4
At the other extreme-the most conserved case--the identical ammo acids can be encoded by the same exact codons, and the divergent could retam the maximal stmtlartty, e g (75% identity at the nucleotide level The relationship then becomes I “UCz%Ln,“O x 1 + (1 - I,,,,,) x 0 75 = I,,,,J4 + 3/4 For a gwen srmtlarrty of the protein sequences, the nucleotrde substttntron rate will determine the actual percentage of identity exhibited by the two encoding DNA sequences.
genomic sequence. The author has previously argued (58) that exon detection can and should proceed prior to assembly, while the genomrc sequence is determined (i.e., at the template sequence level). Most exons within a genomic region should be located by the time the sequencing coverage reach 2-2.5. This allows a huge economy of time and effort when finding genes, but not establishing the genomic sequence to a high accuracy is the actual goal of a sequencing project. 2. Materials
2.1. Similarity Search: Local Vs Remote When there is a genomic sequence (a query) to be compared to the public databases, it must be decided if it should be locally (on one’s own computer) or
Similarity Searches
297
Table 2 Relationship Between Close DNA Sequences and the Encoded
Protein@ Evoluttonary
DNA %
95 90 85 80 75
constratnt
Max positrve Prot %
None Prot %
Max negatrve Prot %
100 100 100 100 100
89
77,5 66 55 44
85 70 55 40 25
OFor dtfferent levels of evoluttonary constraint, consider two coding DNA sequences sharing at least 75% tdentlcal nucleottdes overall (and no gap). Define the number of divergent positions d““Cl the length of the DNA sequence L,,, and the length of the protein L,,,,, = L,,, /3. For a gtven nucleottde drvergence rano Pnut = ~ou&ouc 1the ammo acid discrepancy can vary from 0% (all substttuttons are synonymous), to P,,,, = d,,,/L,,,,, = 3p,,, (all subsmuttons are nonsynonymous, and they all occur m dtfferent codons) On average [random mutatron wrthout evolutronary constraint), expect 25% of all substttutrons to be synonymous and the rest to be nonsynonymous (more rtgorously, thts percentage depends on the precrse codon cornpositron of the protein) Thus, the average protein discrepancy will be. PaOIl” = 0 75 x dnuJLam,no= 0.75 x 3p,,, A good example 1sthe relattonshtp between human and mouse on average, there IS 86% DNA identity in coding regions (75% m noncodmg), wtth 93% identical amino acids. Serum albumin, a rapidly divergent protein, 1s8 1% tdenttcal at the nucleottde level, with 7 1% ldenttcal ammo acids.
the sequence sent over the Internet to a public biocomputing server. Working locally requires installation of the search software (e.g., Blast) and some data management tools, as well as the databases themselves on a local system. A minimal hardware requirement (at this time!) is a UNIX-based workstatton, with a 800+ Mb of disk space. Installing a search software, such as the Blast suite, requires some expertise in UNIX and compiling C codes (however, executable files are provided for the most common platforms; see bottom of Fig. 5B, p. 305). Installing programs and databases requires network access and a working knowledge of the anonymous file transfer protocol (FTP) (or a CD reader). The advantage of the local mode of operation IS that the search program can be run to its full potential, with its full range of options, as often
298
C/a verle
and as long as desired. Also, certam maskmg and filtering tools have to be tmplemented locally to be used to then full potential (e.g., both the database and the query must be processed). Finally, the local mode of operation ISthe only way to ensure confidentrality and security. On the other hand, using a remote btocomputing server only requires access to the Internet, a working knowledge of electronic mall or of Mosaic or Netscape, the World Wide Web (WWW) browsing programs. A good dose of patience is also required, since the most useful (and popular) servers have increasing queuing times. Here, materials consist of databases and programs to be downloaded and mstalled-m the local mode of operation or to be simply accessedthrough the Internet-if one chooses to work remotely. The following section offers a rapid overview of a few Internet entry points physically located m the United States, Europe, and Japan. These centers offer remote database stmtlartty searches and databases/software downloadmg free of charge. This list is by no means exhaustive or an endorsement of any sort. The en&e world of biocomputmg services can be accessedfrom those entry points and with some addttional navtgation through the WWW. 2.2. Entry Points in the United States, Europe, and Japan 2.2.1. United States: The National Center for Biotechnology Information (NCBI) The NCBI is located at the National Library of Medtcme, on the campus of the US National Instttutes of Health, Bethesda, MD. NCBI 1s the maker of GenBank (3 7) as well as of a number of other specialized databases, mcludmg dbEST (38). It offers a variety of (free) on-lme and e-mall services, including the powerful (and popular) Blast similarity search server. The mam electronic addresses are:
[email protected] [email protected] [email protected]
the general “mformatlon desk” the e-mail Blast similarity search server where to ask specific questions about the
retrieve@hK’BI.nlm.nih.gov
[email protected] [email protected]
the general database query server for e-mall submissions of new sequences for e-mall submissions of updates, reporting errors, and so forth
Blast software
The main molecular sequence (protein and nucleotide) databases,many useful specialized databases, and a variety of sequence analysis software (mcluding third-party programs) can be copied from an anonymous FTP server: NCBI.nlm.nih.gov (130.14.25.1). The same range of services can be reached m a WWW context at: http://www.NCBI.nnlm.nih,gov/.
Similarity Searches
299
2.2.2. Europe At the moment, no server in Europe concentratesall the resources available at NCBI. Together, the three introduced in the following offer a very rehable service. 2.2.2.1.
THE EUROPEAN BIO~NFORMATICS INSTITUTE (EBI)
The EBI, an outstation of the European Molecular Biology Laboratory, is located in Hinxton, near Cambridge, England. EBI 1sthe maker of the EMBL nucleotide data library (59). It offers a variety of (free) on-line and e-mall services, including stmilarity search by Fasta (60) and Blitz (61), but not Blast. The main electronic addresses are: for lnqumes about the EMBL data library
[email protected] [email protected] UPDA
[email protected]
for submlsslon to the EMBL data library for updates
The EMBL data library, the other protein and nucleotide databases, and sequence analysis software can be copied from the anonymous FTP site at ftp.ebi.ac.uk and the WWW entry point is at http://www.ebi.ac.uk/. Fasta (60) 1s not discussed here, but it uses both protein and nucleotide sequence queries, and can scan all types of databases.With the proper parameters, it can be used in the context of exon detection m ways very similar to Blast. Blitz is an automatic electronic mail server for the MPsrch program (6Z), performing extremely fast comparisons of protein sequences against the SwissProt protein sequence database. 2.2.2.2
THE ISREC-EPFL
BLAST SERVER IN LAUSANNE, SWITZERLAND
A complete and reliable Blast similarity search server is operated Jointly by the Bioinformatics Group at the ISREC (Swiss Institute for Experimental Cancer Research) and the Swiss Federal Institute of Technology (EPFL), both m Lausanne, Switzerland. Search against all main molecular sequencedatabases, including Swiss-Prot and dbEST, can be submitted to an e-mall server at blast@?disunlO.epjl.ch or as a WWW form to http://ulrec3.unil.ch/. 2 2.2.3. THE ExPASv
SERVER AT THE UNIVERSITY OF GENEVA
Finally, Swiss-Prot releases and a wealth of Molecular Biology information can be accessedusing the ExPASy server of the Umversity of Geneva in Swltzerland (maker of the Swiss-Prot database) at: http://expasy.hcuge.ch/. 2.2.3. Japan: GenomeNet GenomeNet is a Japanese computer network for genome research and molecular and cellular biology. GenomeNet is operated jointly by the Human Genome Center (HGC), Institute of Medical Science, the University of Tokyo,
300
Claverie
and the Super computer Laboratory (SCL), Institute for Chemical Research, Kyoto University. GenomeNet services may be accessed by e-mall, anonymous FTP, and WWW. The servtces include BLAST and FASTA for sequence similarity search against the main databases, but wtth the unfortunate exception of dbEST. The WWW entry point IS at http:/www.genome.ad.jp/. 3. Methods 3.7. Similarity Search on a Remote Server The NCBI (like ISREC-EPFL in Europe) offers two ways to submit simllarity search queries: using a regular e-mail program or using the WWW browsers Mosaic or Netscape to fill up a form. The NCBI server is used as an example. With the e-mail server, queries are sent from a local machme to blast@ NCBI.nZm.nih.gov, and the results of the search are provided as an e-mail reply. Using Mosaic or Netscape, use the WWW entry point http:/www.NCBI. nlm.nih.gov/ to navigate to a form in which to describe the search to be performed interactively. The results are subsequently displayed on a new form. The e-mail server and the WWW protocols are two different interfaces to a common query engine, and the same limited set of values and parameters applies to both. The use of the WWW form IS self-explanatory. The search parameters are set up using buttons or modifiable fields, the sequence query pasted in, and the request submitted mteractrvely. The form protocol ts well suited to isolated requests, small query sequences, and short executron time. It is not suited to findmg exons wlthm a long 50-kb genomic sequence, or among many short fragments (multiple candidate exons or even template sequences [.8/). The e-mall protocol, an off-line process (no need to wait for a connection and to stay connected throughout the search) 1sbetter suited to one’s needs. Also, it can be used from the most primitive platforms (for Instance, personal computers). In the following, examples of exon detection by similarity search are presented m this context. Before proceeding, it is worthwhtle to emphaslzmg that the correct usage of any protocol and the meaningful interpretation of the search results demand a mmimal understanding of the algorithm behind the Blast program, and of its options and parameters. The official Blast manual is simply obtained by addressing the word “help” (in the body of the message, not m the subject line) to the e-mail server (bZast@,iVCBLnZm.nih.gov). A badly formatted query will elicit the same response. A complete WWW Blast notebook can also be found at http://www.NCBI.nlm.nih.gov/BLAST/,
Similarity Searches
301
3.1.1. Scanning the dbEST Database To search the EST database with a genomic sequence query, we simply compose an e-mail messagewith a set of directives, followed by the query sequence: DATALIB dbest blastn PROGRAM 0.0001 EXPECT HISTOGRAM 0 >ANY-QUERY-SEQUENCE-NAME agtactttganaggctgagatgagagaatcncttgagccctggagttccagac caacatgggnaacatagcaagatcncaccttttaaaaaaaaaaaaaaaaaaaaa aaaaaagctncgg .. ... .. .. .. ... . ... .. .. .
The sequence is given according to the “Pearson-Fasta” format with a header lme starting with “>” followed by lines of sequencesno longer than 80 characters each, separated by hitting the usual
or <ENTER> keys. At the end, the mall request is terminated by the usual “send message” sequence (for instance, “,” at the beginning of a line, for the standard UNIX mail program). This e-mail request IS rdentrcal to a search submitted from the WWW form using all default parameters, except for the EXPECT value, set at its lowest (0.0001) for the most stringent search (possible in this context). Following this request, the search is performed using both strands of the query, the sequence as given (“plus” strand), and its reverse complement (“minus” strand). Figure 3 shows an edited output correspondmg to this search. Even though we used the most stringent EXPECT value allowed by the form, more than 2000 matches are detected and 500 (the default) reported. There is a sharp transition m the score values, from a few matches above 1000 to several hundred below 400. All scores above 1000 correspond to ESTs matching 400-200 nucleotides within the last exon (a 3-kb 3’-UTR) of the Kallmann transcript (4). Scores below are dominated by Alu-containing ESTs matching an Alu element in the query sequence. A score threshold of 400 would thus appear appropriate to get rid of most of the “noise.” However, some short, but biologically meaningful alignments between ESTs and coding exons correspond to scores below 400 (Fig. 3C), a score still far above the threshold of statistical significance. Section 4. describes a solution to this dilemma. Using e-mail rather than the WWW form gives better control over the parameters, including the CUTOFF score and EXPECT value. However, many other options useful m decreasing the volume of the output are still not accessible. This 1sthe mam incentive to have the Blast suite installed on one’s own computer and run it locally (see Section 3.2.). When a dbEST match identifies a candidate exon, some extra mformation is needed to resolve the ambiguity in the orientation of transcrrption. It can be the
302
C/a verie
orientation of the cDNA libraries, the polarity of the matching EST sequences, or at best, the orientation of the acceptor and donor splice sites flanking the matched segments. The similarity of one of the putative reading frames with a known protein IS also a very good hmt (but one must be aware that frame shifts abound in databases [5,6]). In the context of exon detection, it 1sa good idea to separate the analysis of the dtrect and reverse strand m two dtstmct message to the server. For this, simply add the directive STRAND plus and STRAND minus to the previous example. In case of long query sequence (10 kb or more), tt is also preferable to cut it into several pieces overlapping by 1000 residues, such as l-6000, 5001-l 1,000, lO,OOl-16,000, and so on, submitted m separate messages to the server. The job ~111be easier to handle for the server, the response time faster, and the reply easier to store, read, and Interpret. Alternatively, one may decide to preprocess the genomic query sequence into candidate ORFs (above a certain size) or candidate exons (ORFs flanked by reasonable splice sites), and submit them m separate mail messages. However, submitting too many requests might conflict with the accesspohcy of the server. The same problem will arise in the context of the “genome survey” approach (5,8), where each individual template sequence must be searched against dbEST. Being able to run multiple queries at ~1111sanother mcentlve for having the similarity searches run on a local computer. 3.1.2. Scanning the Swiss-Prot (or NR) Database To detect exons on the basis of their potential protein homology, use the Blastx program (nucleotides vs amino acids) such as in: DATALIB PROGRAM EXPECT FILTER HISTOGRAM
swissprot blastx 0.001 XNU+SEG 0
>k.1.8000~1842~1983 size:142 phase:0 gatcattatgtcctaacagtgcccaatctgagaccatctactc~taccga ctggaagtgcaagtgctgaccccaggaggggaggggccggccaccatcaag acgttccggacgccggagctcccaccctcttcagcacaca
The test sequence used here is a candidate internal coding exon, e.g., an ORF flanked by reasonable splice sites. The Exondb program can generate all candidates of a gtven minimal size and these candidates obey splice site consensusto a given stringency (J.-M. Claverte, unpublished). Of course, there are many putative candidates (as expected from Fig. 2), and as many separate mail messages have to be sent to submit them all. Alternatively, the whole genomic
Similarity Searches
303
sequence or overlapping pieces of it can be sent at once. The presence of splice junctions will have to be verified a posterzori. Here again, the output can be dominated by Alu-induced matches. The problem of Alu-derived sequence in protein data banks has been discussed m detail elsewhere (40,62). The “FILTER” directrve is absolutely necessary to eliminate a large number of parasite matches mvolvmg short-period repeats or cornpositron-biased segments both m the databases entries and in the putative translation of the genomic query sequence. The principle and usage of the XNU (63) and SEG (64) filters have been discussed elsewhere (26,65-67). For now, just verify that detecting even short coding exons is indeed possible. Figure 4 shows an edited output of this search. If the two best matches resulting from the Kallmann gene product itself (now m the databases) are removed, four significant matches clearly suggest that this short exon is real. The matches are the result of an underlying Fibronectin type III repeat, a motif found m many membrane proteins. This alignment was sufficient to lead to the identification of the gene responstble for Kallmann syndrome (4). 3.2. Similarity
Search
on a Local
Computer
Having the search run on one’s local computer will give one the full control over how and when to run the exon detection search. The price to be paid is having to install both the search programs (here the Blast suite), and the SwissProt and dbEST databases. A simple description of the various protocols used to exchange and download files across the Internet can be found m many books, such as refs. 68,69 The followmg is a quick overview of how to do it, using again the NCBI as an example. 3.2.1. Anonymous FTP: Copying Database and Program “Anonymous FTP” is a ubiquitous, simple, and reliable way to download software and databases from many “FTP sites” in the world, including NCBI. To download Swiss-Prot and dbEST, first connect to the NCBI server from any UNIX workstation on the Internet: ftp NCBl.nlm.nih.gov. The NCBI server then prompts for a user name, and the mandatory answer 1s“anonymous,” with one’s own Internet address as a password. Once in the place, navigate the anonymous FTP site using the three followmg basic commands: pwd for “print working directory,” i.e., where are we m the server?; cd for “change directory” to follow the path toward the desired files; and IS for “hst” to list the content of the current server directory. Once at the right place, just copy the file using gel cnume offile>. If this file is a not a simple text (readable) file (e.g., xxxx.tar.Z files), it might be necessary to use the bitiary command beforehand. Then navigate to the next directory of interest, until finished, and quit usmg bye or quit. Figure 5
A
SWISS-PROT
ftp> open expasy.hcuge.ch Connected to expasy hcuge.ch. 220 expasy FTP server (Version wu-2.4(24) Wed Feb 15 09*09*41 MET 1995)ready, Name (expasy.hcuge chzjmc). anonymous 331 Guest logm ok, send your complete e-mall addressaspassword. Password:[email protected] 230-This IS the ExPASy molecular biology anonymous FTP server 230-of the Umverslty of Geneva, Switzerland. 230 Guest logm ok, accessrestrlctions apply. Remote system type is UNIX Using bmary mode to transfer files. ftp> Is 200 PORT command successful. 150 Opening ASCII mode data connection for /bm/ls. total 14 . ..a. drwxrwxr-x 9 owner . 660 512Apr 11 1994databases ........ . ...... ........... .. ... 512Aug 28 09.48pub drwxr-xr-x 10owner 600 . ....* . . .....* . .. ..... . . 226 Transfer complete ftp> cd databases ... . .*.. . . .*..* *. . .. .. .. ftp> IS
hrwxrw,E:x 6b’kner 6bO .. *. . ... ..* . . * .... . ftp> cd swrss-prot
1536Mar 21 10.22Swiss-prot
ftp> Is
‘;‘2(& ’ “k;da 77544697Mar 20 10.14sprot3l.dat -rw-rw-r-- 1 501 600 30574559Mar 20 09:19sprot31.dat.Z ....... ..*. ... . . .... . .. &rw-;:-
ftp> get sprot31.dat.Z
local sprot31.dat.Z remote: sprot3l.dat Z 200 PORT command successful 150Opening BINARY mode data connection for sprot31dat.Z (30574559bytes) < CTR-L z >
Suspended ~gsImobg [l] ftp expasy chuge ch &
Fig. 5. Use of anonymous FTP to download databasesand search software. Commandsentered by the user are in bold italic. Irrelevant lines have been deleted (dotted lmes) from the actual output. (A) Downloading Swiss-Prot 3 1 from the ExPASy server. The file of interest (sprot3 1.dat.Z) is located using thepwd, cd, and 1scommands File transfer is initiated using the get command. For large files, FTP transfers are better run m the background (using Control-z, followed by the bg command).
304
B ftp
dbEST & BLAST ncbi.nlm.nih.gov
Connected to ncbl.nlm.nlh.gov. 220-Welcome to the NCBI FTI? Server (ncbi.nIm.nih.gov) Name (ncbi nlm.mh.gov*jmc): anonymous 331 Guest logm ok, send your complete e-marl addressaspassword. Password:[email protected] 230 Guest logm ok, accessrestrictions apply. ......... ........ .... , ..... ............* ftp> Is 200 PORT command successful. 150Opening ASCII mode data connection for /bin/is. total 8156 ......... ........... ..... ..... ........*.. drwsr-sr-x 6 4166 228 512Aug 9 21:58blast drwxrwsr-x 7441 514 1024Aug 15 19:24genbank drwxrwsr-x 440 1 1024Aug 12 0428 pub drwxr-xr-x 474034 0 1024Aug 3 20.17repository ................ .. ...*.. .. ..... . .. 226 Transfer complete ftp> cd repository ftp> Is ..a.... . ......... , .. .. .....
6......
drwxrwsr-x 5 3043 901 .. .., . . ..*.., . . . .. *.. ftp> cd dbEST ftp> Is .... . . *. .... . .
512Sep 2 03:15dbEST
... . ... *....
-rw-r--r-- 13043 901 51736893Sep 10954 dbEST.083195.Z ....... .. 1... . . .. ....*... . ftp> get dbEST.0S3195.Z ftp> cd lblast
g:fsxr- _x 14166 228 16 Ott 11 19941.3-> archive/94-10-11 -rw-r--r-- 1 4166 228 102Jul3117.27 BUGS -rw-r--r-- 1 4166 228 5684Jul31 17:27INSTALL -rw-r--r-- 1 4166 228 6426Jul3117:27 README -rw-r--r-- 1 4166 228 120637Jul31 17:25blastapp.tar Z drwxr-sr-x 74166 228 512Aug 9 20:42executables ... .. . ....... . . . . . ........... ftp> get blastapp.tar.Z
Fig. 5. (continued) (B) Downloading dbEST and Blast from the NCBI server. NCBI
mam resources are found in dlrectorles named: “blast,” “genbank,” “repository,” and “pub.” Here is shown how to locate and retrieve the compressed dbEST database (dbEST.xxxxxx.Z) of the day (08/3 l/95), as well as the source code for the Blast suite of programs.
305
C/a ve rie
306 Filtering tools
ftp> cd lpubljmc ftp> IS -W-r--r-10 228 drwxr-sr-x 20 228 drwxr-sr-x 20 228 drwxr-sr-x 20 228 228 drwxr-sr-x 20 drwxr-sr-x 20 228
332 Ott 5 1992 README 512 Jan 3 1995 alu 512 Feb 1 1994 frame-shifts 512 Aug 1 1993 orf 512 Aug 1 1993 xblast 512 Ott 29 1993 xnu
Fig. 5. (contznued) (C) Some other useful tools Additional resources are available m the NCBI “pub” directory This mcludes “Junk” databases (e g , alu) and a vanety of query maskmg/filtenng tools (e.g , xnu, xblast, etc )
shows the edited transcript of FTP sessions downloadmg the Swiss-Prot database from the ExPASy server, and the Blast software and dbEST database from NCBI. 3.2.2. Installing Databases and the Blast Suite It is recommended to download the compressed versions of the database files (sprotxx.dat.Z, >31 Mbytes, and dbEST.xxxxxx.Z, > 48 Mbytes). They are then uncompressed by the usual UNIX tools. The next step 1sto convert the “sprot” file mto the requu-ed “Pearson-Fasta” format. For this, use the program sp2fasta, the source code of which is found m the compressed b1astapp.tar.Z file downloaded from the NCBI (see Fig. 5). The awk language 1s also well suited to writing simple converters of this kind Compiling the Blastn and Blastx programs (in C language) 1s easily done following the instructions m the various README and INSTALL files Accessory programs, such as Setdb and Pressdb, are also essential for the preprocessing of the databases. 3.2.3. Running Blast Locally Once converted in fasta-formatted files, the dbEST and Swiss-Prot data banks must be preprocessed to make them accessible by all Blast programs. Pressdb is used for dbEST (and other nucleotlde sequence files) and Setdb for Swiss-Prot (ammo acid sequences). The preprocessmg produces three new files per database (dbEST.nhd, dbEST.ntb, dbEST.csq, and sprot ahd, sprot.atb, sprot.bsq), which are the files truly used by Blastn and Blastx, respectively.
Similarity Searches
307
To run one’s searches, one now has access to the full range of options (indicated by the content of the various square brackets) provided for Blastx and Blastp: blastn ntdb ntquery [E=#] [S=#] [E2=#] [S2=#] [W=#] [T=#] [X=#] [ [[M=matchscore][N=mlSmatchpenalty]][-matrix scorefile] ] [Y=#] [Z=#] [H=#] [V=#] [B=#] [[-top][-bottom]] [-sort-by ] blastx aadbntquery [E=#] [S=#] [E2=#] [S2=#] [W=#] [T=#] [X=#] [-matrix scoretile] [Y=#] [Z=#] [C=#] [H=#] [V=#] [B=#] [[-top][-bottom]] [-sort-by.. ] The full description of those options can be found at bfast@NCBl;nZm. nih.gov (see Section 2.2.1.). Among the opttons not accessible from the e-mall server, the most important 1sthe [SZ=#] parameter. Without this option, Blast is authorized to combme the probability of multiple low-quality (short) alignments and to retain them if they reach the same stgnificance required from the sobd uninterrupted matches. In practice, this feature produces a large number of false-positive matches, specially with Blastn. In the context of exon identification, It 1s desirable to focus on solid ahgnments spanning most of a htgh-quahty EST sequence and inhibit the report of piecewise alignments. To force all reported alignments to achieve the same score threshold, we use: blastn dbEST ~exun S=300 S2=300, where the file q-exon contains the genomic sequence (or a large fragment of tt or an isolated candidate exon; see Section 2.2.1.) m FastaPearson format. Requesting alignment scoring of at least 300 imposes the perfect match of at least 60 nucleotides. Even though the associated p-value is theoretically very small, the repetittve nature of human DNA makes alignment with lower scores very unreliable for exon prediction. Another problem, the ubiquity of Alu-repeat matches in that range of scores (Fig. 3), will be corrected using a masking technique discussed in Section 4. The combination of small alignments is less of a problem for amino acid alignment. However, they can still dominate the output. To obtam a cleaner output, a typical Swiss-Prot search will be launched by blasti sprot q-exon S=70 S2=70 -matrix blosum62. Amino acid scormg matrices, such as blosum62, are found in a separate directory in the blastapp distrtbutton file. Running Blast locally also allows one to set up multiple query searches at will, using simple awk programs or UNIX shell scripts (26,65). In the local mode of operation, it becomes easier first to analyze the genome sequence in term of indrvidual candidate ORFs or exons, and then run this subset of query sequences against dbEST or Swiss-Prot (see 26,65). The multiple-query mode is also necessary in a context of low-coverage genome survey (58), where each gel reading, as it becomes available, can be used as an individual query.
C/a ve rie
308 4. Notes
1. The problem of the overwhelmmg output. Most of the time, Blast searches on large databases (or in that respect, stmilarity searches using any program) produce very large outputs, within which biologically significant results are easily buried. The reasons for that are multiples and have been reviewed elsewhere (26,40,63-67). Misleadmg alignments are produced because of the numerous erroneous entries m the databases (e.g , bogus proteins translated from vector or repeat sequences), some anomalous features in ESTs (e.g , genomic contaminants, Alu repeats, mtrons [70]), or some intrinsic properties of biological sequences (e g., “low entropy” or “stmple” sequences) Efficient methods (26,40,63-67) for correcting these problems have been developed, and mvolve various level of filtering and masking of query and database sequence segments, Their full application requires the local mode of operation, since both the query and the target database may require processing 2 Getting rid of Alu matches: When analyzing human genomic sequences, matches with the ubiquitous Alu repeats account for most of the output and may obscure short matches with real exons This was the case when analyzing the Kallmann contig (Fig. 3A). This problem is easily cured with the following protocol First search the query (a X000-nucleotide segment of the contig) against the Alu reference database (62) available at the NCBI usmg the e-mail server: DATALI alu PROGRAM blastn CUTOFF 250 HISTOGRAM 0 >My-Query-name agtactttganaggctgagatgagagaatcncttgagccctggagttccagac caacatgggnaacatagcaagatcncaccttttaaaaaaaaaaaaaaaaaaaa aaaaaaagctncgg . .. ... .. .. .. ... .. .. .. .. Once received, the e-mail output is stored in a file “aluoutput” (consult local e-mail program manual to learn how). Then locally use the Xblast program (65,66) (this requires access to a workstation runnmg UNIX) to make a new query: xblast ahoutput old.queuy,file N > masked.quety. In musked.query, the Alu-like segments are now replaced by the neutral letter “N.” Using this processed query now to search dbEST, with the same search parameters as before (except for adding the STRAND TOP directive), only keeps four matches of the thousand previously reported (Fig. 3). All of them correctly identify the last Kallmann exon (mostly 3’-UTR). In other, less favorable sltuations, additional masking searches might be needed to get rid of all parasite matches, in particular with simple (i.e., microsatellite) sequences (66) In the case of exon identification using Blastx against a protein database, the low entropy filters (63-67) should be used systematically. 3. Mixing similarity, content-based, and signal-based methods: In recent versions, information from database matches has been incorporated as additional input to
Similarity Searches
309
programs, such as Grail (22) and GeneParser (6), originally designed around a set of content- (hexamers) and signal- (ORFs, splice sites, etc.) based measures. The combination of similartty data with the other coding measures IS not simple, since its relative weight should depends on tts own strength, not an easy scheme to implement. In case of unambiguous matches, such as those m Fig 3, nothing else should matter, and the candidate exon should be reported even though none of the other criteria are met (as in the case of 3’-UTRs) On the other hand, less spectacular matches (such as in Fig. 4), or even good matches, but to erroneous database entries should be weighted down or even discarded. It is best to maintam a clear separation between content/signal-based methods and tdentification by database similarity. Until the databases are complete, many exons will still be detected from their statistical properties alone, but true exons with anomalous features (for instance, in AT-rich regions) are also being discovered from similarity search. Keeping similarity information separate is adding serendipity to the conservatism of content/signal-based methods.
Acknowledgments I thank Chantal Abergel and Daniel Gautheret for their help in making this chapter more accurate and easier to read. References 1. Senapathy, P , Shapiro, M. B., and Harris, N L (1990) Sphce junctions, Branch point sites, and exons: sequence statistics, identification, and applications to genome project. Methods Enzymol 183,252-278 2. Stormo, G. D. (1990) Consensus patterns in DNA. Methods Enzymol. 183, 211-221. 3. Brunak, S., Engelbrecht, J., and Knudsen, S. (1991) Prediction of human mRNA donor and acceptor sites from the DNA sequence. J. Mol. Bioi. 220,49-65. 4. Legouis, R., Hardelm, J.-P., Levilliers, J., Claverie J.-M., Compain, S., Wunderle, V., Millasseau, P., Le Paslier, D., Cohen, D., Caterina, D., Bougueleret, L., Lutfalla, G., Weissenbach, J., and Petit, C. (1991) The candidate gene for the X-linked Kallmann syndrome encodes a protein related to adhesion molecules. Cell 67,423-435 5. Hawkins, J. D. (1988) A survey of intron and exon lengths. Nucleic Acids Res. 21, 9893-9908.
6. Snyder, E. E. and Stormo, G. D. (1995) Identification of protein coding regions in genomic DNA. J Mol. Biol 24&l-18 7. Grantham, R., Gautier, C., Gouy, M., Mercier, R., and Pave, A. (1980) Codon catalog usage and the genome hypothesis. Nuclezc Acids Res. 8, r49-r60. 8. Staden, R. (1990) Finding protein coding regions in genomic sequences. Methods Enzymol. 183, 163-l 80. 9. Shepherd, J. C. W. (1981) Method to determine the reading frame of a protein from the purine/pyrimidine genome sequence and its possible evolutionary justification. Proc. Natl. Acad. SCL USA 78, 1596-1600.
310
C/a ve rie
10. Shepherd, J. C. W. (1990) Ancient patterns in nucleic acid sequences. Methods Enzymol 183, 180-192. Il. Flckett, J. W. (1982) Recognition of protem coding regions in DNA sequences. Nucleic Acids Res 10,5303-53 18. 12. Clavene, J.-M. and Bougueleret, L. (1986) Heuristic mformational analysis of sequences. Nucleic Acids Res. 14, 179-l 96. 13. Beckmann, J. S., Brendel, V., and Tnfonov, E. N (1986) Intervening sequences exhibit distinct vocabulary. J Biomol Struct Dynamics 4,391-400 14. Borodovsky, M , Spnzhitskn, Y A., Golovanov, E. I., and Aleksandrov, A A (1986) Statistical patterns m primary structure of the functional regions of the genome in E. toll III. Computer recognition of codmg regions Molekulyarnaya Bzologzya 20, 1390-1398. 15. Flckett, J. W. and Tung, C.-S. (1992) Assessment of protem coding measures. Nucleic Acids Res 20,644 I-6450. 16 Claverie, J -M., Sauvaget, I., and Bougueleret, L (1990) k-Tuple frequency analysis: from intronlexon discrimination to T-cell epltope mappmg. Methods Enzymol 183,237-252.
17 Bougueleret, L., Tekala F., Sauvaget, I., and Clavene, J-M (1988) Objective comparison of exon and mtron sequences by the mean of 2-dimensional data analysis methods. Nucleic Acids Res 16, 1729-1738. 18 Borodovsky, M. Y., Rudd, K. E., and Koonin, E. V. (1994) Intrinsic and extrinsic approaches for detecting genes m a bacterial genome. Nucleic Aczds Res 22, 4756-4767.
19. Fields, C. A. and Soderlund, C. A. (1990) Gm* a practical tool for automatmg DNA sequence analysis. Comp Appl B~ol SCI 6,263-270 20. Iris, F. J. M., Bougueleret, L., Prieur, S., Caterma, D., Primas, G , Perrot, V., Jurka, J , Rodriguez-tome, P., Clavene, J -M , Cohen, D , and Dausset, J (1993) Dense Alu clustering and a potential new member of the NF-kappa B family wlthm a 90 kb HLA class III segment. Nature Genet 3, 137-145 21. Uberbacher, E C and Mural, R. J. (1991) Locating protein-codmg regions m DNA sequences by a multiple sensor-neural approach. Proc Natl. Acad SCL USA 88, 11,261-l 1,265 22. Xu, Y., Einstem, J. R., Mural, R. J, Shah, M. B., and Uberbacher, E. C. (1994) Recognizing exons in genomic sequence using grail II, in Genetzc Engzneenng. Principles and Methods (Setlow, J., ed.) Plenum, New York, pp. 241-253. 23. Sulston, J., Du, Z., Thomas, K., Wilson, R., Hllher, L , Staden, R , Halloran, N., Green, P., Thierry-Mieg, J., Qiu, L., et al. (1992) The C elegans genome sequencing project a beginnmg. Nature 356,37-41. 24. Guigo, R., Knudsen, S., Drake, N , and Smith, T. F. (1992) Predlctlon of gene structure. J Mol. BloZ 226, 141-157. 25. Snyder, E. E. and Stormo, G. D. (1993) Identification of coding regions m genomlc DNA sequences: an application of dynamic programming and neural networks Nucleic Actds Res 21,607-613 26. Claverie, J.-M. (1995) Progress in large scale sequence analysis, m Advances in Computatzonal Biology, vol. 2 (Villar, H., ed.) JAI, London, pp. 161-208.
Simiiarjty Searches
311
27 Lopez, R , Larsen, F , and Prydz, H. (1994) Evaluatton of the exon prediction of the Grail software. Genomics 24, 133-136 28. Hunkapiller, T., Kaiser, R. J. , Koop, B. F., and Hood, L. (1991) Large-scale and automated DNA sequence determinatton. Scrence 254,59-67. 29. Olson, M. V. (1993) The human genome project. Proc Natl Acad Scz USA 90, 43384344. 30. Nowak, R. (1995) Bacterial 468-470.
genome sequence bagged [news]. Science 269,
31 Fleischmann, R. D , Adams, M. D , White, O., Clayton, R A., Ku-kness, E. F., Kerlavage, A. R., Bult, C. J., Tomb, J.-F., Dougherty, B. A., Merrick, J. M , et al. (1995) Whole-genome random sequencing and assembly of Haemophilus mjluenzae Rd Sczence 269,496-5 12. 32. Adams, M. D., Kelley, J. M., Gocayne, J. D., Dubmck, M , Polymeropoulos, M. H., Xiao, H., Merril, C. R., Wu, A., Olde, B., Moreno, R. F., et al. (1991) Complementary DNA sequencing* expressed sequence tags and human genome project. Sczence 252, 165 1-1656. 33. Adams, M. D., Dubnick, M., Kerlavage, A. R., Moreno, R. F., Kelley, J. M., Utterback, T. R., Nagle, J W., Fields, C. A., and Venter, J. C. (1992) Sequence identification of 2,375 human brain genes. Nature 355,632-634 34 Adams, M. D., Kerlavage, A. R., Fields, C , and Venter, J. C (1993) 3,400 new expressed sequence tags identify diversity of transcripts m human brain Nature Genet 4,256-267 35. Adams, M. D., Soares, M. B., Kerlavage, A. R., Fields, C., and Venter, J. C. (1993) Rapid cDNA sequencing (expressed sequence tags) from a directionally cloned human infant brain cDNA library. Nature Genet 4,373-380 36. Merck releases first “gene index” sequences [news] (1995) Nature 373,549. 37. Benson, D A., Boguski, M., Lipman, D. J., and Ostell, J (1994) GenBank. Nuclezc Aczds Res. 22,3441-3444. 38. Boguskt, M. S., Lowe, T. M., and Tolstoshev, C. M. (1993) dbEST-database for “expressed sequence tags.” Nature Genet. 4,332,333. 39 Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) Basic local alignment search tool. J Mel Biol. 215,403-410. 40. Claverie, J.-M. (1992) Identifying coding exons by similarity search* Alu-derived and other potentially misleading protein sequences. Genomzcs 12, 838-84 1. 4 1 Gish, W. and States, D. J. (1993) Identification of protein coding regions by database similarity search. Nature Genet 3,266-272. 42. Claverie, J.-M (1994) A streamlined random sequencing strategy for findmg codmg exons. Genomzcs 23,575-58 1, 43. Oliver, S. G , van der Aart, Q J., Agostoni-Carbone, M. L., Aigle, M., Alberghma, L., Alexandrakt, D., Antoine, G., Anwar, R., Ballesta, J. P., Bemt, P., et al. (1992) The complete DNA sequence of yeast chromosome III. Nature 357,38-46. 44. Dujon, B., Alexandraki, D., Andre, B., Ansorge, W., Baladron, V., Ballesta, J. P., Banrevi, A., Bolle, P. A., Bolotin-Fukuhara, M., Bossier, P., et al. (1994) Complete DNA sequence of yeast chromosome XI. Nature 369,371-378
312
Claverie
45. Wilson, R., Ainscough, R., Anderson, K., Baynes, C., Berks, M., Bonfield, J., Burton, J., Connell, M , Copsey, T., Cooper, J., et al. (1994) 2.2 Mb of contiguous nucleotide sequence from chromosome III of C elegans. Nature 368, 32-38. 46. Green, P., Lipman, D., Htllier, L., Waterston, R., States, D., and Claverie, J -M (1993) Ancient conserved regions m new gene sequences and the protein databases. Sczence 259, 1711-1716 47 Claverie, J.-M. (1993) Database of ancient sequences. Nature 364, 19,20 48. Bairoch, A. and Boeckmann, B. (1994) The SWISS-PROT protein sequence database: current status. Nucleic Acids Res 22, 3578-3580 49. Brockdorff, N., Ashworth, A., Kay, G. F., McCabe, V. M , Norris, D. P , Cooper, P. J., Swift, S , and Rastan, S. (1992) The product of the mouse Xist gene is a 15 kb inactive X-specific transcript containing no conserved ORF and located in the nucleus. Cell 71, 5 15-526. 50. Brannan, C. I. , Dees, E C , Ingram, R S., and Tilghman, S M. (1990) The product of the H19 gene may function as an RNA Mel Cell Biol 10,28-36 5 1. Velleca, M. A , Wallace, M. C., and Merlie, J. P. (1994) A novel synapse-associated noncodmg RNA. Mol Cell Biol 14, 7095-7 104 52 Askew, D. S., Li, J , and Ihle, J N. (1994) Retroviral msertions m the murine His- 1 locus activate the expression of a novel RNA that lacks an extensive open reading frame Mol. Cell. Blol 14, 1743-l 75 1. 53 Fichant, G. A. and Burks, C. (1991) Identifying potenttal genes m genomic DNA sequences J Mol BloI 220,659671. 54. States, D. J., Gtsh, W., and Altschul, S F. (1991) Improved sensitivity of nucleic acid database searches using application-specific scoring matrices. Methods 3, 64-70. 55. Altschul, S. F. (1991) Ammo acid substitution matrices from an information theoretic perspective J. MOE Bzol 219, 555-565 56 Claverie, J.-M. (1993) Detecting frame shifts by ammo acid sequence comparison. J. Mol. Biol. 234, 1140-l 157. 57. Henikoff, S. and Henikoff, J. G. (1993) Performance evaluation of amino acid substitution matrices Proterns 17,49-d 1. 58. Claverie, J.-M. (1994) A streamlined random sequencing strategy for finding codmg exons. Genomlcs 23,575-581 59. Rice, C. M. and Cameron, G. N. (1994) Submission of nucleottde sequences data to EMBWGenbank/DDBJ. Methods Mol Blol 24, 355-366. 60. Pearson, W. R. (1990) rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol 183,4698-4702. 61. Sturrock, S. and Collins, J. (1993) MPsrch version 1 3 Biocomputing Research Unit, University of Edinburgh, UK. 62. Claverie, J. M. and Makalowski, W. (1994) Alu alert. Nature 371, 752-752. 63. Claverie, J. M. and States, D. (1993) Information enhancement methods for large scale sequence analysts. Computers Chem. 17,191-201. 64. Wootton, J. C. and Federhen, S. (1993) Statistics of local complexity m ammo acid sequences and sequence databases. Computers Chem 17, 149-163
Similarity Searches
313
65 Claverie,
J.-M (1994) Large scale sequence analysis, in Automated DNA Sequencing and Analyszs Technzques (Adams, M. D., Fields, C., and Venter, J. C.,
eds.) Academic, New York, pp 267-279. 66. Clavene, J. M. (1996) Effective large scale sequence similarity searches. Methods Enzymol. 266,2 12-227.
67. Altschul, S F., Boguskr, M. S., Gish, W , and Wootton, J. C. (1994) Issues m searching molecular sequence databases. Nature Genet. 6, 119-129. 68 Kehoe, B. P. (1996) Zen and the Art of the Internet. A Beginner’s Guide, 4th ed. Prentice Hall, Englewood Cliffs, NJ. 69. Swmdell, S. R., Miller, R. R., and Myers, G., eds. (1996) Internetfir the Molecular Biologist, Horizon Sctentific, London, UK. 70 Burglin, T. R. and Barnes, T. M. (1992) Introns in sequence tags. Nature 357, 367-367.