Computer Analysis of Sequence Data Part 1 (Methods in Molecular Biology)

CHAPTER1 Computer Analysis Hugh G. Griffin of Sequence Data and Annette M. GriffZn 1. Introduction DNA sequencing ...

Author: Annette M. Griffin | Hugh G. Griffin

16 downloads 2449 Views 19MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

CHAPTER1

Computer

Analysis

Hugh G. Griffin

of Sequence Data

and Annette M. GriffZn

1. Introduction DNA sequencing methodology was developed in the late 1970s (1,2) and has become one of the most widely used techniques in molecular biology. The amount of DNA sequencing performed worldwide has increased exponentially each year and this trend will almost certainly continue for the next few years at least. The importance of this technique is underlined by the volume of research funds now being channelled into genome sequencing projects, and by the large sums of money being invested in the development of automated sequencers and sequence analysis systems. Computers have always been an integral part of DNA sequence analysis. Many large sequencing projects would be virtually impossible without the aid of computers, and the international computer databanks are an essential facility for the storage and retrieval of sequence data. During the 198Os, while the original methods of sequencing were being refined and improved, fast and powerful computer programs were being developed to analyze the copious amounts of data being generated. Indeed, many sequencing projects, such as those utilizing the shotgun approach (3-6) depended on a computer for the assembly of the DNA sequenceof random fragments. The development of automated sequencers also depended on the improved quality of computer hardware and software that were necessary for the interpretation of data produced by these instruments.

From* Edited

Methods m Molecular Biology, Vol. 24: Computer Analysis of Sequence Data, Part I by’ A M Gnffm and H G Gnffm Copynght 01994 Humana Press Inc , Totowa, NJ

1

Griffin

and Griffin

The molecular biologist of today has a vast array of computer programs and packages from which to choose. Not only are these faster and more powerful than earlier versions but are much more “user friendly.” Many have easy-to-use help menus and the competent scientist will not need any formal training to enable him/her to perform rapid, in-depth analysis of sequence data. Most of the widely used programs were originally developed for mainframe computers, but now powerful sequence analysis packages are available for IBM and IBM-compatible personal computers (PCs) and for the Apple Macintosh. Two main factors will influence the scientist’s choice of computer and sequence analysis package: (a) what is already in his/her laboratory, and (b) what is affordable to buy. The molecular biologist new to the area of sequence analysis would be well advised to start by becoming familiar with systems already set up in his/her laboratory; additional programs and hardware can be acquired when a need is identified or when funds are available. For the scientist planning to purchase sequencing software the decision on what to buy might be based on available hardware. The mainframe sequence analysis packages are primarily designed for multiuser use and take up a lot of disk space, while the programs developed for personal computers and Macintoshes are ideal for individual (or small group) use. Database searching is perhaps more readily performed on a main frame and the systems manager can make regular updates. A compact disk system is usually employed for database searches on a PC or Macintosh. In an ideal world, the molecular biologist would have access to all available systems and sequence analysis software! 2. Sequence Analysis Tasks The types of sequence analysis tasks required of a software package by a molecular biologist vary depending on the individual investigator and his/her research interests. However, some of the most common functions required by almost all investigators are as follows: 1. Storage, retrieval, and editing of user-generated sequence data. This includes programsto assemblerandomfragmentsfrom shotgunprojects (3-6), manipulation of sequencesby deletion of segments, joining segmentstogether, and preparationof publication quality output.

Computer Analysis

of Sequence Data

3

2. Construction of maps. Most investigators will want to be able to readily produce a restriction enzyme map of their sequenced DNA fragment or of a sequence retrieved from the databanks. Most packages can cope with linear or circular maps and can also search for specified targets other than restriction sites, e.g., promoters, inverted repeats, and consensus sequences. 3. Translation. Analysis of a DNA sequence for open reading frames or potential protein coding regions and translation of a DNA sequence into predicted amino-acid sequence in any specified reading frame. 4. Protein analysis. Analysis of protein or deduced amino acid sequences. Many options are available, including prediction of molecular weight, hydropathy, antigenicity, and secondary structures. 5. Similarity searches. Search for sequences similar to the user specified sequence. These searches can be conducted on a local user generated databank or on the international databanks (EMBL, GenBank, DDBJ, PIR, Swissprot). Searches can be conducted with either DNA or aminoacid sequences. 6. Alignment with a similar sequence. This is one of the most popular tasks a scientist asks of a sequence software package. An optimal alignment is achieved between two similar sequences (DNA or amino acid) and the percent identity and/or similarity calculated. Publication quality output is usually required. 7. Submission and retrieval. Most investigators will wish to submit their newly generated DNA sequence to the international databanks. Before the establishment of widely accessible international sequence databanks, the only method of disseminating DNA sequence information was by publication in scientific Journals. The only way for another scientist to analyze the data was to enter it into his/her computer via the keyboard. As well as being laborious this process inevitably caused inaccuracies. Most sequence data sent to the databanks are submitted m machine readable form (e.g., by e-mail or on disk or tape) thus ensuring accuracy. In addition, there is a facility to submit data as soon as it becomes available but to disallow its release until the information appears in print in the scientific press. This procedure protects the scientist working in a highly competitive area, while permitting the electronic dissem-

ination of scientific data simultaneously with the written version. Even for the nonsequencer, the databanks can be important. For example, most previously sequenced genes, operons, promoters, and most of the commonly used cloning vectors (e.g., pUC19) can be

Griffin

and Griffin

retrieved from the databanks. This sequence data can be of great use in the design of cloning and genetic manipulation experiments. Also, computer generated codon usage tables can facilitate the design of probes or PCR primers from N-terminal protein sequences, perhaps reducing the amount of degeneracy required. The establishment and operation of the international sequence databanks have contributed immensely to the widespread dissemination and utilization of sequence data. 3. Aim of This Book Most of the widely available sequence analysis packages for either mainframe or microcomputers are capable of performing the tasks mentioned earlier with varying degrees of ease and efficiency. As no single book of this nature could possibly cover all the available packages, we have chosen six of the most popular for inclusion in this two-volume work. Our choice of packages was mainly random and the fact that a program or package is not mentioned it does not imply that is in any way inferior or less widely used than those that have been included. The Genetics Computer Group (GCG) and STADEN packages were developed for mainframe computers such as the Digital Equipment Corporation VAX running VMS operating system; Microgenie and PC/Gene were designed for IBM-compatible PCs; DNA Strider and MacVector were written for the Macintosh. It should be noted that versions of these programs may well be available for use with computers other than those for which they were originally developed. For example, versions of the STADEN programs can run on a Macintosh, and on a SUN workstation (using UNIX operating system). In fact, current development is being conducted on the SUN version only. In addition, we have included a number of specialized programs to aid the handling and manipulation of sequence data and to improve the choice of programs capable of carrying out the most popular tasks. Computer Analysis of Sequence Data is aimed at the competent molecular biologist who nonetheless is a newcomer to the field of DNA sequencing and data analysis. Each contribution is written such that a competent scientist who has only basic computer literacy can carry out the procedure successfully at the first attempt by simply following the detailed practical instructions that have been described

Computer

Analysis

of Sequence Data

5

by the author. As we all know even the simplest computer programs go wrong from time to time, and for this reason a “Notes” section has been included in most chapters. These notes will indicate any major problems or faults that can occur, the sources of the problems, and how they can be identified and overcome. A comprehensive reference section is also included in most chapters to enable the reader to refer to other publications for more detailed theoretical discussions on the various programs. 4. Mainframes, Personal Computers, and the Macintosh The two mainframe packages described in Computer Analysis of Sequence Data are GCG (in Part I) and STADEN (in Part II). The GCG package is an extensive suite of sequence handling and analysis software. The package was originally built by John Devereux and Paul Haeberli from programs written for Oliver Smithies’ laboratory in the University of Wisconsin, Department of Genetics (7,8). Many of the programs within the current package were written by individual scientists (such as William Pearson, Michael Gribskov, and Michael Zuker) and some of theseprograms are described in Computer Analysis of Sequence Data by their original authors, see Chapter 26 in this volume and Chapters 22 and 29 in Part II, as well as being mentioned in the section on GCG (Chapters 2-14). However, all programs within the GCG package have been adapted to a certain uniform format that makes them particularly easy to use. All programs “look the same”if you can run one you can run them all. The output from each program is suitable for input to other programs in the package. Most new users find the interactive interface very easy to use. The STADEN package is based on the programs of Rodger Staden (Medical Research Council, Laboratory of Molecular Biology, Cambridge, UK). The package is split into a relatively small number of large menu-driven programs. GIP uses a digitizer for entry of DNA sequences from autoradiographs. SAP and DAP consist of programs for assembling and editing gel readings. The STADEN package is widely acknowledged as one of the best in the world for management of random sequencing projects. The programs NIP and PIP provide functions for a comprehensive analysis of individual nucleotide and protein sequences, respectively, whereas SIP can compose and align

Griffin

and Griffin

pairs of either protein or DNA sequences. NIPL, PIPL, and SIPL are programs designed to search sequence libraries. The user interface, which is common to all programs, consists of a set of menus and a uniform way of presenting choices and obtaining input from the user. Help is available by responding to a query with the symbol “?“. The X interface, involving pulldown menus, dialog boxes, and buttons, is not available on the VAX version. The STADEN programs are described in Chapters 2-14 of Part II. There are now a large number of programs and packages available either commercially or in the public domain that operate on the powerful modern microcomputer. We have chosen two packages, Microgenie and PC/Gene, which operate on IBM or compatible machines, and two packages, DNA Strider and MacVector, which operate on the Macintosh. People in general, and scientists in particular, tend to have strong views on whether the Macintosh or the IBM-PC is superior. The choice of machine must be left up to the individual. The disadvantages of the use of microcomputers for sequence analysis is their limited memory and computational power. This is not usually a problem for the majority of common sequence analysis tasks although library searching is not as convenient as on a mainframe. In addition, whereas the international databasescan be updated on a weekly basis relatively easily on a mainframe, this is not usually as conveniently performed on the microcomputer. An advantage of microcomputers is that they can be dedicated to the task of sequence analysis, thus performing some tasks faster than the mainframe, which is often asked to perform many functions simultaneously. Output from a microcomputer package is usually more readily incorporated into word processing packages and graphics programs for use in producing publication quality hardcopy. Additionally, for the individual investigator or small group, a microcomputer system can be considerably less expensive than setting up a mainframe system. 6. Specialized Programs Each sequence analysis program or package requires the input sequence to be in a specific format. By format, is meant the style and organization of the computer file containing the sequence, e.g., double or single line spacing, number of characters per block, number of blocks per line, whether characters are numbered or not, and so on.

Computer

Analysis

of Sequence Data

7

Unfortunately for scientists who have access to more than one package, a different format is required for each package. For example, STADEN format is incompatible with GCG format, even though many research laboratories operate both packages on the same computer. Therefore we have included in Computer Analysis of Sequence Data a chapter on methods for converting sequencesbetween different formats (Chapter 27). READSEQ is a program that can interconvert sequence files between thirteen different formats. In addition, many packages contain programs for converting between sequence formats. The generation of multiple alignments of three or more related sequences,either DNA or protein, is becoming an increasingly popular function of sequence analysis software. Such alignments can highlight small, highly conserved regions and help identify sequence features such as enzyme active sites or substratebinding domains. Overall levels of homology can be calculated and a consensus sequence produced. Similar programs can be used to explore phylogenetic relationships if the equivalent sequencesfrom many different species are available. Evolutionary trees depicting such relationships can be drawn. Chapters 25-28 of Part II describe programs for multiple sequence alignment and construction of phylogenetic trees. Many programs for sequence analysis are in the public domain. That is to say they are available freely to anyone who requests them, as long as they are not subsequently sold. Chapter 28 explains how to obtain software of this nature via electronic mail. Finally, all newly generatedsequencedata should be submitted to the international sequencedatabanks.Chapter 29 describes how to do this. References 1. Sanger, F., Nicklen, S., and Coulson, A. R. (1977) DNA sequencing with chainterminator mhibitors Proc. Natl. Acad. Sci. USA 74, 5463-5467 2. Maxam, A M. and Gilbert, W. (1977) A new method for sequencing DNA. Proc Natl. Acad. Sci USA 74,560-5&I

3 Messing, J. and Bankier, A. T. (1989) The use of single-stranded DNA phage in DNA sequencing, in Nucleic Acids Sequencing* A Practical Approach (Howe, C. J. and Ward, E. S., eds.), IRL Press, Oxford, UK, pp. l-36 4. Anderson, S. (1981) Shotgun DNA sequencmg using cloned DNAse l-generated fragments. Nucleic Acids Res. 9,3015. 5. Deininger, P L. (1983) Random subcloning of somcated DNA Appllcatlon to shotgun DNA sequence analysis. Anal Biocheni 129,216

Griffin

and Griffin

6. Bankrer, A. T., Weston, K. M., and Barrell, B. G. (1987) Random cloning and sequencing by the M13/dideoxynucleotide cham termination method, in M&zads in Enzymology, vol. 155 (Wu, R , ed.), Academtc, London, pp. 51-93. 7. Smithies, O., Engels, W. R., Devereux, J. R., Shghton, J. L , and Shen, S. (1981) Base substitutions, length differences and DNA strand asymmetries in the human G-gamma and A-gamma fetal globin gene region. Cell 26,345-353. 8 Devereux, J., Haeberli, P., and Smithies, 0. (1984) A comprehensive set of sequence analysts programs for the VAX. Nucleic Acids Res. 12,387-395.

(%APTER

GCG: Fragment

2

Assembly

Reinhard

Programs

Diilx

1. Introduction The Fragment Assembly System is a series of related programs that help you assemble overlapping fragment sequences as obtained at the lab bench. Specifically, the programs resemble a small database system that helps you to achieve six major tasks: 1. Entering and storageof sequencesin a project database(Method 1). 2. Recognition of overlaps (Method 2). 3. Manipulation of alignedfragments(this stepandall following: Method 3). 4. Generationand comparisonof consensussequencesasdeterminedfrom aligned fragments. 5. Display and archiving of aligned sequences. 6. Determination of the consensussequence. The different programs are partially derived from or inspired by other programs to fit into the GCG environment (e.g., gelassemble is derived from the MSE editor written by William Gilbert), and are based upon the Fragment Assembly Program System as described by Rodger Staden (I). The major advantage of such a database system is that it provides a consistent, homogeneous environment for working on large data sets. It is essential that the data entered have a quality that permits computerassisted assembly. Otherwise, unsatisfactory results are obtained. In particular, underdetermined data sets are not suited for assembly because conclusions may become very misleading. From: Methods m Molecular Brology, Vol 24: Computer Analysrs of Sequence Data, Part I Edlted by. A M. Gnffln and H G Griffm Copyright 01994 Humana Press Inc , Totowa, NJ

9

10

DdZ

It is important to note that this database system does not use binary coded data. Instead, the Fragment Assembly System used by the GCG programs creates a directory tree of ASCII (i.e., text) data. In order to retain consistency of the program system these data may never be edited or otherwise manipulated. 2. Materials

The methods and the programs reported here are part of the GCG program package (2), Version 7.x, to be installed on either a VAX/ ALPHA computer running the operating system VMS or one of the supported UNIX systems: Silicon Graphics (Mountain View, CA) (IRIX), Digital Equipment (Boston, MA) (ULTRIX), or Sun Computers (Mountain View, CA) (SUNOS). The programs can be obtained from GCG Inc, University Research Park, 575 Science Drive, Suite B, Madison, WI, 537 11. A version for the CONVEX variant is also available from Convex Corp., Dallas, TX. The computer system should be equipped with at least 16 MByte of memory, and should hold about 1GByte disk for program, database, and scratch area. All GCG programs generally require a terminal that is capable of emulating the VT100 terminal mode standard. Scroll-back facilities as provided on workstations or in DEC terminal emulators are advantageous but not essential. Method 3 (geEassemble)requires that the terminal follows the VT100 standard exactly. It is recommended to make sure that this emulation is chosen with the command on VAIUVMS, and $ set termhtl 00 % setenv TERM vtl00 on UNIX systems On VAXNMS, the terminal used must further support the keypad of a VT100 terminal. This can be a problem in various PC terminal emulations. On UNIX, this is not required. However, on some networks, cursor keys are not supported in telnet or rlogin sessions, which could cause problems in scrolling through options. On VAXNMS, VT220 and later versions of this terminal emulation can be used in order to utilize the keys of this terminal more intuitively. Tables 1 and 2 (later in this chapter) list the functions assigned.

Fragment

11

Assembly Programs

Method 3 (gelassemble) uses a text display of a very complicated nature. Workstations that permit the opening of windows larger than the 80 x 24 character terminals should be used wherever accessible. The text output produced by all methods can be processed on word processing programs running on personal computers. The data transfer capabilities required for this are usually provided within the terminal emulation programs. If text output is to be printed, any ASCII code printer is sufficient. Capabilities to print 132 characters per line may be advantageous but are not essential. Though the GCG programs can run on workstation screensequipped with a mouse as terminal input device, none of the methods presented here actually supports this mode of input. (This applies to version 7, and earlier versions of the GCG program package.) To use a sequence in this Fragment Assembly System, it is required that it be formatted in GCG sequence format. This format consists of any text describing the sequenceand two periods ‘..’ separatingthe annotation data from the sequence data itself. Sequence symbols have to follow the IUPAC rules in order to be processed correctly. If sequences to be used cannot be read properly, the program reformat should be used to confirm a GCG compliant format. Further tools of sequence manipulation are described in Chapter 27. Method 1 as described in this chapter automatically creates sequences in the correct format. 3. Methods 3.1. Entering and Storage of Sequences in a Project Database (Method 3.1.1. Preparation

1)

of the Database

Before a Fragment Assembly Project can be started, it is necessary that the computer knows about the location of the directories storing the data. This assignment is accomplished with the program gelstart. In case of a new project, this program will create the data structures necessary.As of version 7.2, the program newgelsturt can also be used. The program gelstart furthermore assigns vector sequences to be highlighted in the assembling editor (see Method 3). The vector sequences specified must be present in GCG program format. They

12

DSlZ

can be named as a member of a database (e.g., embZ:pbr322) but can also be located in your personal directory if the established databases do not keep the sequence of interest (e.g., vectozseq). Additionally, the program geEstart can accept sequence patterns to be highlighted. These can be restriction sites, e.g., to label Hi&III, the string to be specified would be AAGCm The purpose of this is to warn of possible rejoined sticky fragment ends. 3.1.1.1. START OF THE PROGRAM GELSTART The program is started by typing its name on the command line. 3.1.1.2. SPECIFICATION OF THE SEQUENCING PROJECT NAME The program will ask for a project name to be specified. If the project already exists, the program stops and no further questions are asked. In case of a new project, the creation of a new sequencing project database has to be confirmed. 3.1.1.3. DEFINITION OF ADDITIONAL DATA As mentioned above, gelstart is capable of assigning vector sequences to be highlighted, as well as restriction sites or other patterns. If the project is about being created, the program asks for both items automatically. If the project is already known, it is possible to add these data afterwards. The command to be specified is, $ gelstarthector % gelstart -vector

on VAXIVMS on UNIX systems

3.1.1.4. DELETION OF PROJECTS The database should not be deleted with a single operating system command. A special command is available to erase all data of a particular sequencing project. This option should be performed ONLY after it is made certain that there is a backup available for the data in case of the need to continue and/or re-edit part of the data. If backups are not routinely run on the system you are working on, it is a good idea to create a directory where you can deposit all the data you need at a later stage of your work. Ask your system manager how to transfer these data to a tape or other external storage device in order to keep the disk consumption small. Assume that you are the user otto and your project is called minna. On VAXAMS, type the following commands:

Fragment

Assembly Programs

13

$ backup/list minna.bck [.minna...]/save-set/verify $ rename minna.bck sys$login: $ set default sys$login $ create/dir [.save] $ rename minna.bck [.save] $ gelstart/delete

To restore these data, type $ backup/list [otto.save]minna.bck

[...I

On UNIX, specify % mkdir -/save % tar -cvof -/save/minna.tar % gelstart -delete

minna

To restore the data, type % tar -xvof -/save/mlnna.tar

3.1.2. Start of the Sequence Editor The program gelenter adds fragments from three different sources: keyboard, digitizer, and file. Both keyboard and digitizer input are described in Chapter 6, in the description of the seqed program. The input from file will usually be prepared by the sequence editor as well, or, alternatively, by input from other sources after conversion with any of the routines described in Chapter 27. It should be emphasized that sequences in the fragment assembly project may not be duplicated. Even on the UNIX operating system, upper and lower case in the names are not recognized to be different. 3.1.2.1. START OF THE GELENTER PROGRAM The program is started by typing the command gelenter on the system prompt, followed by an optional name of the sequence file. 3.1.2.2. SPECIFICATION OF A NEW SEQUENCE If not specified on the command line, the program will ask for a sequence name to be provided as the handle to identify the sequence later in the Fragment Assembly System. There are some general restrictions in the names that can be used for a sequence. For practical reasons, they should be less than 10 characters, and not contain extra characters like :, $, /, , which are used in file name conventions of the operating system.

14

DdZ

3.1.2.3. ENTERING BULKS OFSEQUENCESFROMFILES The program gelstart can accept the command line option enter to load sequences into the Fragment Assembly System automatically. Therefore, it is necessary to know about the file names of interest, e.g., *.seq will specify all sequences in the current directory, whereas my *.seq already restricts the choice to files beginning with the name my. on VAWVMS $ gelentedenter = my*.seq % gelenter -enter = my*.seq on UNIX systems Note that the Fragment Assembly System does not easily permit you to remove sequences from a sequencing project (to avoid inconsistencies), however, this dictates that some care must be taken at the entering step. If a sequence really has to be deleted, use the option erase in the geZassembEe program (see Method 3). of Overlaps (Method 2) Once data are entered into the database, a preprocessing step computes the overlaps, which can be analyzed and corrected or acknowledged in the next viewing step (Method 3). The program geloverlap does a comparison of each sequenceagainst the others of the Fragment Assembly System database. As of version 7.2, the program gelmerge can be used alternatively. The comparison stringency can be manipulated with four parameters: a wordsize (the size of a word the smallest unit of identity), the percentage of matches in an overlap, the integral width (a parameter that permits accommodation for insertion or deletion of mismatches in overlapping fragments), and the minimum overlap length. The suggested parameters work well on standard fragments but might fail in cases of repetitive sequences or very short fragments. 3.2. Recognition

3.2.1. Start of the geloverlap Program

The program can only be started after having initialized the working environment with the command gelstart (see Method 1). Then, the command geloverlap will start the program. 3.2.2. Filling

the Parameters

of the geloverlap Program

The program will ask for parametersand also suggestdefault values. It is mostly sufficient to accept the choices with but sometimes a variation is necessary.If it is necessaryto vary parameters, the

Fragment

Assembly

Programs

15

first things to change are the minimum overlap length (setting it higher), and the fraction of the words to match in an overlap (setting it lower). 3.2.3. Restrictions

The total length of all sequences entered into geloverlap may not exceed 350,000. The word size must be between l-30, the minimum overlap length between l-1000. The program cannot calculate more than 10,000 overlaps. The consensus sequence of a Fragment Assembly System project may not be longer than 100,000 bases. However, a single fragment may not be longer than 2500 bases. 3.2.4. Definition

of Overlaps

In order to reduce the ambiguity of the result, the overlap definition requires that the sequences should have a defined and recognizable similarity in the region of the overlap. Figure 1 shows overlaps and examples of patterns not recognized as overlaps. 3.2.5. Problems with “Weak” Overlaps It might be useful to avoid using geloverlap to recognize identical

sequences in repetitive sequences, thus failing to identify the overlaps in between. For this purpose, geloverlap can be started with the command line option upperlimit. In order to avoid overlaps with a identity of > 90%, specify $ geloveriap/upperlimit =0.9 % geloverlap -upperlimit =0.9

on VAXIVMS on UNIX systems

3.2.6. Review of the Results

The gelview program will output a view of the current database. If the program is started with on VAXNMS on UNIX systems the program will report on the overlap clusters as found in the last run of the geloverlap program. $ gelviewkluster % gelview -cluster

3.2.7. Correction

of Problems Depending on the sequences used, the geloverlap

program may fail to detect overlaps that are considered to be “real” in the user’s view but do not meet the threshold of overlap criteria as shown in Fig. 1.

16 L

I

, 1

,

Le.0

Y

,

m-

f

t

1 2

t

L

f

(

0

4

3 4

6

Fig. 1. Definitions of overlaps as used in the &overlap program. Alignments 1 to 3 are considered to be overlaps, whereas 4 has too large nonidentical overhangs, 5 has a too weak overlapping region, and 6 is too short. See tent for details on how to deal with these cases.

The problem of weak or too short overlaps (examples 5 and 6) can be overcome with definition of different stringency parameters, whereas the problem in example 4 might need additional analysis. In case there cannot be any reasonable alignment by lowered threshold values, it is advisable to remove the sequence from the database and edit it manually. The proceeding is as follows: 1. Run the program geloverlap as usual. 2. Run the program

gelview with the cluster option

in order to vrew the

output. Identify the contig (the cluster of sequences) that is mrsplaced. This contig should consist of one sequence only. 3. Run the program gelassemble in order to export the sequence, and delete it afterwards. Commands on VAWKMS are: $ gelassemblehoautokontig =5 if you want to remove config number 5. Next, save the sequence. Enter and specify the option seqout on the command line prompt (:). Then, remove the sequence from the display with the command reject, and last erase it by giving the command erase minna (if the sequence you want to delete is called minna). Quit the program gelassemble with the option quit.

Fragment

Assembly

Programs

17

On UNIX, type % gelassemble -noauto -contig = 5

if you want to remove contig number 5. Next, enter cCTRL-D> and proceed as described for VAX/VMS on the command line prompt (:). 4. Run the program gelenter again and tailor the sequence to remove the conflicting parts. 3.3. Manipulation of Aligned Fragments The program gelassemble modifies and manipulates your database

of a Fragment Assembly System project. It is used to meld (join) overlapping fragments together into, ultimately, a single, project-sized contig (a cluster of aligned sequences). The program works in a similar manner to the multisequence alignment editor as far as screen management is concerned (see Chapter S), but is specifically tailored for fragment assembly work. The output of gelassemble can be documented either with pretty (see Chapter 8) or with gelview (see Section 3.2.6.). The input of gelassemble is the result of the previous method (run of geloverlap). Whereas multiple sequences can be

handled by the Fragment Assembly System, only one of the alignments can be handled at a time. Therefore, if the result of geloverlap contains various clusters of overlapping sequences, the first step of the geEassemble program is to select one of the clusters displayed for detailed editing. Next, a screen window opens showing the sequences letter by letter in the upper part of the screen,and the lower part symbolizes the current alignment as schematic view, the so-called Big Picture. The gelassemble program knows about two important modes; one is the screen mode that you will use to edit and manipulate fragments, and the other is the command mode that is used to modify the database and perform input/output. 3.3. Manipulation

of Aligned Fragments (Method 3) 3.3.1. Selection of a Cluster for Manipulation The gelassemble program is started by the command gelassemble

given on the command line. The first question asked is the output file name of the geloverlap program. Provided that the latter was run with default options, the name will be geZoverZap.dat and can be accepted with . Depending on the results of geloverlap, the following screen(s) will present the detected clusters, showing the

18

D&k

sequences name, strand orientation, position, and matching score in between the two for each of the sequence pairings in the particular cluster. The program expects input in order to proceed. If the <ENTER> key is pressed (on VAXNMS, alternatively, key pad key 4 on VAXNMS, and on UNIX), the sequences are displayed schematically as preview showing the same output as the gelview program if involved with the option cluster (see Section3.2.6.). Note that pressing <ENTER> (on VAXNMS, on UNIX) again will toggle the redisplay of the sequencealignment scores. The schematical view of the cluster alignment is shown in Fig. 2. Once the correct cluster selection has been achieved, the <ENTER> (on VAXNMS, on UNZX) key proceedsto the screenmode to load a cluster. 3.3.2. Screen Mode ofgelassemble The upper half of the screen contains the sequence alignment. Keystroke options are listed in Table 1. The bottom row of the alignment, row 1, is reserved for the consensus sequence. Symbols in a fragment sequence that disagree with the consensus appear in reverse video. Above the sequence alignment is a summary of the current cursor position. This summary includes the name of the fragment on which the cursor is positioned, the absolute position of the cursor in the alignment, and the position of the cursor relative to the beginning of the current fragment. A screen copy is shown in Fig. 3. The lower half of the edit screen is the Big Picture schematic. Each sequence is represented by an arrow headed bar denoting the strand orientation. There may be letters beside the number of the strand; C means consensus, it4 indicates a modified sequence (with respect to the sequence originally entered with gelenter in the Method 1) and L means “locked”; i.e., not available for further modification because already approved. Once an alignment has been found to be satisfactory, it can be fixed (meld) into a contig. This is denoted by the letter A (anchored) and can be modified by “unanchoring” if needed (seeTable 2). Anchored sequences can be modified as an ensemble; meaning that all insertions, deletions, and other manipulations will apply to all anchored sequences simultaneously. The sequence display works like a spreadsheet, and the cursor keys cup-arrow>, <down-arrow>, cleft-arrow>, can be used to move the cursor in between the different rows. Alternatively,

Fragment

Assembly

'***

fragments

No more

I 6 5 4 3 2 C

Scroll

Test2 Test8 Test5 Test4 Test6 Test1 CONSENSUS

with

,

19

Programs

III cluster

***

+-----------------------------------> <------------+ <-----+ +--> +-----> +----------> +-------------------------------------------> ,-_--______,___-______l__________l______----,---------0 100 200 300

<down-arrow>,

,

<Select> or to new GelOverlap pairs for next cluster, . for Screen Mode, <Enter> to load

and

for prenous a cluster.

400

cluster

Fig. 2. Screen copy of the cluster alignment schematrc on a VAXNMS terminal screen. The UNIX version is perfectly identical but uses other control options for changing the display (see Table 1). Table 1 Keystrokes Used on Various Versions in the gelassemble Contig Selection Action to be performed Scroll to next/previous cluster View toggle pairs/schema Proceed to screen mode Load cluster Scroll up/down in schematic view

VAX/VW <cursor-left> <cursor-right> <SELECT>,

Unix <cursor-left> <cursor-right> cCTRL-B>

<ENTER> <cursor-up> <cursor-down> cPrev screen>,

<cursor-up> <cursor-down>

a number followed by permits to jump to a sequence position directly. Further shortcuts are listed in Table 2. Sequence ranges can be selected, cut out, and pasted somewhere else with appropriate commands, see again Table 2.

lelAssemble

Test4 52 Relatrve:

Absolute:

GCG 43 TTGCTAACGCAGTCAGG GTTTATCACAGTTAAATTGCTAACGCAGTC

TTGACAGC'ITATCATCGATAAGCTTTAAXC TTCTCATGTTTGACACTATCGATAAGCTl"l'AATGCGGTAGTTTATCACAG~~T TTCTCA~~ACAGCTA~GATAAGfmAAMCGGTATC TTCTCATGTTTGACAGCTA~GATAAGCmAATGCGGTC .I . . . . ..I.. . ..I. . . . . I.. ..I. . I. 0 10 20 30 40 50 +--------------------------------------------> 7A 6 M <---------------+ 5 M <------+ 4 +---> * 3 + ------> 2 +-------------> C M+-------------------------------------------------------> I. .I* .I...I.. I.. .I I I 0 50 100 150 200 250 300 350 :Ctrl-Z> .

for

Command

Mode,

or

'

for

I 60

*-

I 400

I 450

I 70

I 500

help Edlt

Mode.

Fig. 3 Screen copy of the gelassemble program on a VAWVMS terminal screen. The UNIX version is perfectly identical. Table 2 Keystrokes Used on Various Versions in the gelassemble Screen Mode VAXNMS Action to be performed Consensus generation

GATC Insertion of character GATC GATC Delete a base <delete> <delete> <delete> Move sequence to right <SPACE> <SPACE> <SPACE> Begin selection <select> Cut selection

Insert selection Anchor sequence

Unanchor sequence

Redraw screen cCTRL-W> Go to command mode

Load next fragment <ENTER>, <ENTER> Reject current fragment Toggle insert/overstrike Any number row Help screen, with more key stroke explanations

cCTRL-K>

n ?

n Help

20

n ?

Fragment

Assembly

Programs

21

Consensus generation is achieved with the key (on VAX/ VMS, on UNIX, cCTRL-K>). This consensus is a simple measure of plurality and characters are displayed in lowercase if no unique sequence symbols are found in the particular row. 3.3.3. Command Mode ofgelassemble The command mode of the gelassemble program is used to modify the database of the fragment assembly project, and to perform input/ output. Command mode is indicated by a colon (:) at the lower left corner of the screen. The most important option in command mode is the meld command, which saves the alignment to the database.This command joins together all of the loaded fragments and records the join in the database as a contig named after the leftmost (i.e., the lowest coordinate in the consensus) fragment in the group. All sequence edits are saved. The only way to break such a contig is to run the program geldisassemble (see Section 3.3.6.). Gelassemble does not allow either a single fragment to be listed twice in the same contig or a single fragment to be listed in more than one contig. Contigs containing duplicated fragments cannot be joined (using meld). The noduplicate command removes duplicated fragments. Alternatively, the c-> (“-” - key) in screen mode permits to reject a particular fragment. If, for some reason, a fragment shall be legitimately presented twice (e.g., a repeated sequence), it can be duplicated with the spawn command. It exceeds the frame of this description to show all the valid commands of the command mode of the gelassemble program. The command help will let you view the possible options on-line with a brief explanation. 3.3.4. Writing Out the Consensus

The consensus sequence of the current database can be written out at any time. In command mode, the command prettyout writes an out-put file silmilar to that of the program pretty (see Chapter 8). If only the consensus sequence is to be written, the command consensus recalculates the consensus, and the command seqout can be used to output it.

22

D&?Z

3.3.5. Customizations of the Keyboard The program setkeys as provided by the GCG program package

starts an interactive dialog for naming the nucleotide characters according to individual taste. For example, it is possible to have the characters s d f g mapped to G A T C in order to have all the four nucleotide characters positioned conveniently. The out put of the setkeys program is the file set.keys that can be reedited with any text editor (e.g., edt on VAXNMS or vi on UNIX) in order to incorporate the “.” (period), which is needed for inserting gaps. The changes made by setkeys only affect the programs gelenter, gelassemble, seqed and lineup. Other programs are not affected. The changes are valid only in the current directory where the file set.keys resides. 3.3.6. Breaking of the Contigs The program geZdisassembZe unmelds all melded contigs in the

current fragment assembly project and rebuilds a database consisting of the unjoined fragments. This program recreates the database as if you had newly entered all the fragments, except that all the editing made to single fragments are preserved. The program is started by typing its name on the command line. The only question asked is for confirmation, and then the disassembling proceeds. 4. Notes 1. The Fragment Assembly System within the GCG program is an extremely powerful toolkit. However, it is tedious to use if manual consulting is needed for any individual basic operation. It is suggested that a test sequence collection IS used beforehand as a “tutorial” before starting.

2. As many databasesystems, the fragment assembly packagegenerates numerous files. If disk quota are enabled on your system, check with the command

$ show quota % quota -v

(on VAXIVMS, or) (on UNIX)

whether you still have resources to run the Fragment Assembly System. The space needed is approximately the size of all fragments multiplied by five.

Fragment Assembly Programs

23

3. Because of the Intense use of screen manipulation routines, it IS required that the connection to the host is sufficiently stable. Noisy or slow phone lines used with modems are not recommended if the gelussembk program is used heavily. 4. It is very convenient to use large screens such as workstations for working on gelassemble with big projects. The settings needed to make gelassemble work on screens larger than the standard are site specific and can be asked from the system manager. 5. The sequences used in the project are considered to be “final.*’ If sequencing errors exceed a certain threshold, the automatism might fail. 6. If too many changes are needed to edit a sequence into its reworked appearance (e.g., by a new sequencing experiment at the bench) it is sometrmes more convenient to delete the fragment from the database entirely and reenter the sequence with the gelenter program (see Section 3.2.1.3.). 7. Operator and other system messages may irritate the screen mode of gelassemble. The cCTRL-W> key will redraw the screen and blank out all text not used by gelassemble. Note Added in Proof As of version 7.2, the program gelmerge supersedes geloverlap for most cases. At the time of this writing, the initialization needed to run gelmerge is done with newgelstart instead of gelstart. Future versions of the GCG software might use different terms.

References 1. Staden, R. (1980) A new computer method for the storage manipulation of DNA gel reading data.Nucl. Acids Res. 11,3673-3694. 2. Devereux, J., Haeberli, P., and Smithies,0. (1984) A comprehensive set of sequenceanalysisprograms for the VAX. Nucl. Acids Res. 12,387-395.

&APTER

GCGr Drawing

Linear Reinhard

3 Restriction

Maps

D&lx

1. Introduction To produce a restriction map using a computer program a single sequence (your target sequence) is matched against a predefined sequence pattern database (referred to as “enzymes”). The pattern database used must obey some rules with respect to the pattern definition language, and what an individual pattern means.The pattern matching algorithm should be flexible enough to allow modifications to its stringency. With respect to enzyme cleavage, this is fairly easy to achieve in nucleotide sequences, as described in this chapter. It becomes more difficult when dealing with protein sequences (see Chapter 12 and Chapters 12 and 13 in Part II). Pattern matching methods are important in many other fields and are not necessarily restricted to enzyme cleavage (see Chapter 10 and Chapters 8 and 9 in Part II). In order to visualize restriction maps, the output of the pattern matching analysis needs to be conveniently structured. For this purpose, positive hits are displayed as a function of the sequence coordinate plotted vs the patterns found. Afterwards, known but nonmatching patterns are listed. If special stringencies are applied, there is an additional list of patterns that would match in principle but do not meet the stringency conditions (such as a defined number of cleavages). 2. Materials The methods and the programs reported here are part of the GCG program package (l), Version 7.x, to be installed on either a VAX/ ALPHA computer running the operating system VMS or one of the From Edited

Methods m Molecular Biology, Vol 24: Computer Analysts of Sequence Data, Part I by: A M. Gnffln and H. G. Gnffm Copynght 01994 Humana Press Inc., Totowa, NJ

25

26

DiilZ

supported UNIX systems: Silicon Graphics (IRIX), Digital Equipment (ULTRIX), or Sun Computers (SUNOS). The programs can be obtained from GCG Inc, University Research Park, 575 Science Drive, Suite B, Madison, WI, 53711. A version for the CONVEX variant is also available from Convex Corp., Dallas, TX. The computer system should be equipped with at least 16 MByte of memory, and should hold about 1GByte disk for program, database, and scratch area. All GCG programs generally require a terminal that is capable of emulating the VT100 terminal mode standard. Scroll-back facilities as provided on workstations or in PC terminal emulators are advantageous but not essential. Text output can be reprocessed on word processing programs on personal computers. The data transfer capabilities required for this are usually provided within the terminal emulation programs. If text output will be printed, any ASCII code printer is sufficient. Capabilities to print 132 characters per line may be advantageous but are not essential. Though text files can represent the output of the program used here, a high-quality graphics device is desirable. For terminal previewing, TEKTRONIX 4014 capabilities or REGIS emulation (as provided in DEC 240, 340, or 440 terminal series, or in various PC terminal emulators) is recommended. The GCG graphics features include a variety of other modes, e.g., Workstations use the X-Windows interface. The appropriate choice is site-dependent and is preconfigured by the software manager. (See Methods section for details.) Graphics hard copies may use a variety of output formats including HPGL and POSTSCRIPT standards.This feature is also site-dependent and usually preconfigured by the software manager (see above). To analyze a sequence, it is required that sequence data are formatted in GCG sequence file format (see Chapters 2,6,8, and 11 in order to review methods on how to create a sequence in this format). The general procedure for preparing a restriction map involves comparison of the query sequence to known restriction site patterns deposited in a database. The GCG programs require a special format of this database, which is available either as GCG distribution or from various file servers on the internet (see Appendix).

Drawing

Linear

Restriction

27

Maps

In order to select the enzymes you are interested in, either a personal database is created (Method 2) or, alternatively, several program options can be employed which will permit the use of selected enzymes instead of the entire database (Method 1). 3. Methods

3.1. Selecting Enzymes on the Command 3.1.1. On the

with Program Line (Method

VAXIVMS Operating

Options 1)

System

Options given below are entered after the program name and separated by slashes (“/“). For example, $ mapplothixbase means that you type the program name mapplot on the system prompt ($) and modify the program’s action with the qualifier sixbase. 3.1.2. On the UNIX Operating

System

Options given below are entered after the program name and separated by a blank (space bar) and a dash (“-“). For example, % mapplot -sixbase

means that you type the program name mapplot on the system prompt (%) and modify the program’s action with the option sixbase. Sixbase uses only those enzymes (patterns from the enzyme restriction site data base) which are defined by at least six non-N or non-X bases. Once excludes all enzymes that cut the sequence more than once. Note that this option is identical to mincuts = 1 and maxcuts = 1 employed simultaneously (see below). Mincuts = n

excludes all enzymes that do not cut at least n times. Maxcuts = n

excludes all enzymes that cut the sequence more than n times. Exclude = nl, n2

excludes all enzymes that would cut between the position nl and n2 in the sequence.Additional rangescan be given as pairwise numbers.

28

3.2. Setting Up Your Own Enzyme Restriction File (Methods 2) The procedure is outlined in Table 1. After fetching the standard enzyme restriction file, it is manipulated with a text editor. Entries are formatted as one enzyme per line, with some pattern description language, which are detailed in Chapter 10. Entries starting with an exclamation mark (!) are ignored, and entries starting with a semicolon are denoted as “isoschizomers,” meaning that a special option is needed to incorporate these entries (input of “**” at the question “What enzymes?“). 3.3. Creation of a Linear Restriction Map (Method 3) 3.3.1. High-Qualzty

Graphics Output

If graphics facilities are available, use the program setplot to view and select possible options. 3.3.2. Text-File “Graphics”

Output

To simulate graphics in a text file, use the command line options noplot and out = my.mup to suppress high-quality graphics and create the file my-map. See Method 1 on how to apply command line options. 3.3.3. General Procedure 3.3.3.1. START OF THE PROGRAM Call the Program mupplot with the appropriate options to preselect

enzymes (see Method 1). 3.3.3.2. SELECTION OF THE ENZYMES

The two recommended replies to the question asked are either “*” for selecting all enzymes, or “* *” for selecting all enzymes, including the isoschizomers. Answering the question with “?” will explain the available options. 3.3.3.3. REVIEW OF THE OUTPUT

If the high-quality graphics require more than one page, continue by hitting . An example output 1s shown in Fig, 1. If your output is a text file, you can proceed according to Table 1, steps 6 a to c (viewing/printing a file).

Drawing

Linear

Restriction

Maps

29

Table 1 Step 1. Fetch database 2. Call the editor 3. Move to the entry desired 4. Insert a excalmation mark at the beginrung of each line not needed 4b. Remove a semicolon to expose it to the

‘*I selection 5. Exit the editor 6a. Review the file page by page 6b. Print on paper to the default prmter

6c. Print on selected printer

VAXNMS operating system

$ fetch enzyme. dat $ edit enzyme. dat If * appears: Type <cursor-up> and <cursor-down>

UNIX operating system

% fetch enzyme. dat % VI enzyme.

dat

<cursor-up> and <cursor-down> <ES&

move to the line so that the cursor is at the posltion of the semicolon

<cursor-right> cdeb and on the ‘*’ prompt type exit $ Owlpage $ print $ print/queue = <printer>

<x> <ES&

and on the

‘:‘prompt type wq

% more % Ipr % Ipr -P <printer>

Procedure to create a personal enzyme restriction file Note that commands are entered and sent with (not shown) and that each character to be typed in the edltor IS explicitly given. The commands are wrapped in the column for the sakeof readabrhty

4. Notes 1. Method 1 (I.e., using selected enzymes from the database) requires the knowledge of options available. In GCG programs, these can be summarized and displayed using the option check. 2. Method 2 (i.e., using a personal database) should be used only by users who know how to use the editor (edt in VAXNMS, vi in UNIX). 3. The text graphics output issued with option out=<jlename> can benefit from issuing the option width = 100 on the command lme, which also requires a printer which can output 132 characters per lme. 4. The high-quality graphics output can benefit from the option double, which causes characters to be twice as high as default.

(Linear)(Six-Baae)MAPPMToi:Vrt.tGgcamck:5S01,1to:1306

200 SnaBI SpeI sty1 XbaI XnmI

400

600

March20,1002

1000

800

II

II Basymm

that

I

do not onk

&tlI Am 44gc.I AIWNI Apd ApaLl Asd M Ad AlrII BmlBmIIBqIBqIBotI&IIIBpuloIBpdlo8IBnIBmBIBnmBdI BoKIBduIBazuIB9plBwIBlpIN~BopNIBnHlIBUlm~BmuIBafn BSUWI IxrlOI chl Dram IlrdI mdIIN8mlloQ BdINmd7nI kom BcoNvBsp8I~GdnIGaaINMN~NgImI~NpaIyhrIymaI NdI Nd N-1 NhoI Nd NmI NqV NopNII PmIPiIliO8I PdIPpuuIPahAIPmIPmJIhrIIsaoIIsdIsdsaIs@IsnmI

3

sph1s&a811 SW s8u strrl W ha-i Ilnqmoa

oxchded; BbmI

Fig. 1. High-quality

1200

I III

II

HhCutaz &rI

w

1 NuCata Haa

IWI

10~02.

v-8

Bwooo HBJd

=iU

B&dad PstI

Bad

BaaEm

koNINoo01001. PfWI

Iolrl m

ruysr

Rnd

xmsl

dOO.700

faITthlllII

graphics output of the mupplot program. The program was started on VAXJVMS

$ mapplotidoubleifont and the sequence shown IS chicken calmodulm.

1TAC'GTA lA'CTAG+T 2 C'CwwG+-G lT'CTAG+-A 1GAAnn'nnTTC

= 6hxbase/exclude = 400,700

using the command lme

Drawing

Linear Restriction

Maps

31

5. Fonts numbered 3 and 6 are close to the HELVETICA and TIMES fonts used by personal computers. High-quality graphics will acceptthe option font = 3 orfont

= 6.

6. Known features of the sequence data analyzed can be marked with boxes if an appropriate file is in the default directory. The option used to include the file my.mark is mark=my.mark. The format of this file is described in the example file provided by the GCG programs. This file can be copied into your directory for review by the command fetch *.mark (both VAWVMS and WV/X) 7. The program assumes that the sequence given is linear. Circular maps can be drawn with another program (see Chapter 4). Adding the command line option circular will have the effect that the sequence is treated as circular but the map is still displayed linearly.

Appendix FTP File Servers in Molecular Biology. This list was extracted from a much larger information list prepared by Michael Gribskov ([email protected]). To use these file servers, your computer MUST be connected to

the international internet. Type the command ftp, followed by the server name or number, in order to connect to the server. The IS command will do a listing, the command cd changes the directory, and the command get copies the actual file into your directory on the home machine. The ftp program is canceled with the command quit. ftp.embl-heidelberg.de 192.54.41.27

Rainer Fuchs [email protected] flat.nig.acjp 133.39.128.2 Sanzo Miyazawa [email protected]

EMBL server EMBL (updated daily) SWISS-PROT (and updates), PDB DROMAP, ECD, ENZYME, EPD, LiMB, NGDD, PROSITE, SEQANALREF, TFD PC, UNIX, Mac & VAX software GenBank, EMBL, DDBJ, PIR, SWISSPROT, GENPEPT, LIMB Software: FLAT DB search/retrieval program for UNIX

D&Z

32

Genbank server Genbank db (quarterly) Genbank updates (weekly & nightly) GENPEPT db & updates EMBL db & updates, SWISS-PROT, ALU, ENZYME, PROSITE, REBASE,

bio.net

ncbi.nlm.nih.gov 30.74.25

NCBl [email protected]

SEQANALREF, much more software, ASN. 1 tools

nih gov

Natl. Center for Biotech. Information NCBI software Geninfo db documentation

ncbi nlm.nih.gov

130.14.20.1 Scott Federhen, federhen@ncbi nlm mh gov (/repository)

(help and information) ENZYME, EPD, LiMB, NGDD, PROSITE, REBASE, SEQANALREF, TFD metabolism, mol-model

(/toolbox) (hub)

assorted public domain software more software - BLAST, FASTA, MACAW, etc

(/pub/aimb-db)

AIMB db

Lawrence

Hunter, [email protected]

mbto bio.mdtana edu

129.79.1.101

gov

IUBIO archive Mac,VAX,Atart,DOS software DROMAP, EPD, LiMB, PROSITE, REBASE,

Don Gilbert, [email protected]

TFD Phylip Authorin program

mbcrr harvard.edu

Molecular

134.174.51.4

Research Resource

Temple Smith, tsmith @mbcrr harvard edu

software PLSEARCH pattern db MASE 3 1 multiple sequence editor

Biology Computer

Drawing

Linear Restriction

Maps

33

menudo.uh.edu 129.7.1.6

U Houston archive Genbank sequence db PIR sequence db

Dan Davlson, [email protected]

UNIX,DOS,VMS,Mac software Cray software (future) BioMatrix archive Molecular evolution bb archive

bioftp.unibas.ch 131.152.8.1 (/biology/EMBnet) (/biology/databases)

bioftp EMBnet server Biozentrum, Universitaet Base1 EMBL ( updated daily) ALU, DROMAP, ECD, ENZYME, EPD, LiMB, NGDD, PROSITE, REBASE, SEQANALREF, TFD. GCG format codon usage tables

Reinhard Doelz, [email protected] Amos Bairoch, [email protected] nic.funet.ft 128.214.6.100 (pub/sci/molbio)

Finnish State Supercomputer Center Center for Scientific Computing VAX, Unix, DOS, Mac software many non-biological

applications

Rob Harper, [email protected]

Reference 1. Devereux, J., Haeberli, P., and Smithies, 0. (1984) A comprehensive set of sequence analysis programs for the VAX. Nucl. Acids Res. 12,387-395.

CHAPTER4 GCGt Drawing

Circular Reinhard

Restriction

Maps

Diilz

1. Introduction Drawing circular restriction maps is a multistep procedure. First, primary data are collected with a pattern matching program that runs like the ones used in linear restriction map generation (see Chapter 3). Second, these data are restructured and displayed as graphics. Additional data containing information entered by the user can be added in a third step, and finally, graphics can be tuned to produce publication quality. Figure 1 summarizes the use of the various programs and options, and Fig. 2 (later in this chapter) shows the variety of parameters that are applied to generatefinal graphics with Method 4. Figure 3 (also later in this chapter) shows the final output of the plasmidmap program. 2. Materials The methods and the programs reported here are part of the GCG program package (I), Version 7.x, to be installed on either a VAX/ ALPHA computer running the operating system VMS or one of the supported UNIX systems: Silicon Graphics (IRIX), Digital Equipment (ULTRIX), or Sun Computers (SUNOS). The programs can be obtained from GCG Inc, University Research Park, 575 Science Drive, Suite B, Madison, WI, 53711. A version for the CONVEX variant is also available from Convex Corp., Dallas, TX. The computer system should be equipped with at least 16 MByte of memory, and should hold about 1GByte disk for program, database, and scratch area. From. Edited

Methods m Molecular Biology, Vol. 24 Computer Analysis of Sequence Data, Part I by A. M Gnffm and H. G Gnffln Copyright 01994 Humana Press Inc , Totowa, NJ

35

36

1

Chapter

option

2,6,8,

or 11

PLASMID

1 file of filenames

]

e r! 2 :

method method 1) 2)

fetch plasmxbnap.lnlt Editor

3

L -

1

tuning file

1

4

I

1) setplot 2) plasmxlmap (3) figure)

3 d c

t

Fig. 1. SchematIc display preparation procedure for drawing circular plasmid maps Methods 2 and 3 are optional and can be skipped for rapid previewmg.

All GCG programs generally require a terminal that is capable of emulating the VT100 terminal mode standard. Scroll-back facilities as provided on workstations or in PC terminal emulators are advantageous but not essential.

Drawing

Circular

Restriction

Maps

37

Drawing circular restriction maps requires high-quality graphics devices. The faster the connection to the host the better as it will ensure that the time required to complete the display is greatly reduced. For terminal connections, a speed of at least 9600 baud or more is desirable. For terminal previewing, TEKTRONIX 4014 capabilities or REGIS emulation (as in DEC 240, 340, or 440 terminal series, or in various PC terminal emulators) is recommended as a minimum. GCG graphics features include a variety of other modes, e.g., Workstations use the X-Windows interface. The appropriate choice is site-dependent and is preconfigured by the software manager; see Methods section for details. The methods presentedbenefit from color display. The TEKTRONIX 4105 or TEKTRONIX 4107 graphics standards are recommended, besides the X-Windows interface. Whereas the color display permits one to gain a rapid overview, hardcopies of the graphics are desirable for archiving or publication. Laser printers supported include Digital’s LN03, Postscript printers, and equipment utilizing the HPGL description language. The latter is advantageous if color plots are desired. Plotting capabilities are site-dependent and usually preconfigured by the software manager. In order to manipulate files created and customized by this program, it is essential that a text editor (e.g. edt on VAXNMS, or vi on UNIX) can be effectively used. To analyze a sequence,it is required that sequencedata are formatted in GCG sequence file format. See Chapters 2,6,8, and 11 in order to review methods on how to create a sequence in this format. The general procedure for preparing a restriction map involves comparison of the query sequence to a set of known restriction site patterns deposited in a database. The GCG programs require a special format of this database, which is available either as GCG distribution or from various file servers on the internet. See Appendix of Chapter 3 for details. The selection of proper enzymes has been discussed in Chapter 3, Methods 1 and 2. 3. Methods 3.1. Creation

of the Primary

Data

File

(Method

1)

The procedure requires that you are familiar with the methodology of selecting enzymes on the command line as described in Chapter 3.

38

Diilz 3.1.1.

Preparation

and Review of Input Data

Mostly, you will probably use your own sequences that have been entered previously (see Chapter 6). Alternatively, you could use sequences of known vectors for informational purposes. It is useful to have to hand information regarding the additional features you may wish to add to your map, e.g., the coordinates of known genetic elements. 3.1.2. Start of the Program

mapsort

To start the program, type $ mapsortiplasmid % mapsort -plasmid

(on VAXNMS) (on UNIX)

on the command line. Optionally, the command can be followed by a space and the sequence file name to be analyzed. The GCG program mapsort will create the appropriate raw data file to be customized later. To generate the output in the correct format you have to use the option plasmid in addition to the enzyme selection options (see Chapter 3 for details on how to apply options). 3.1.3. Entering the File Name of the Sequence You Would Like to Analyze

If not already entered on the command line, the program will ask for beginning and end of the fragment to be analyzed, but suggests the full range as “default” values. Hitting is sufficient to accept these choices. 3.1.4. Select the Enzymes

The two recommended replies to the question asked are either “*” for selecting all enzymes, or “**” for selecting all enzymes, including the isoschizomers. Answering the question with “?” will explain the available options. 3.1.5. Review of the Output

Call a text editor and look at the output (assuming that your sequence was called myseq): $ edit/edt myseqhck % vi myseq.tick

(on VAUVMS) (on UNIX)

Drawing

Circular

Restriction

Maps

39

All entries displayed follow the general format as described in Method 2 (see below). All undesired entries can be either deleted entirely, or can be placed with an exclamation mark (!) in front of the line to maker subsequent programs ignore this “tick.” Table 1 in Chapter 3 sketches this procedure. 3.2. Creation

of Additional Data Files (Method 2) The information to be displayed in the final graph can be supplemented by additional information describing inserts, sequence features, or genetic elements. Three kinds of labels are recognized. Ticks are associated with a single base and marked around the outside of the circle to be drawn. Using Method 1, the restriction sites are shown as ticks. Blocks are block marks covering a range of bases. Blocks can be shaded and can be labeled. The display will show them inside the circle as concentric ring segments. Ranges are lines drawn radially inside the circle. They can start and end with a variety of symbols including arrowheads. Ranges should be used for showing sequences like proteins, which have a defined directionality. Each label is defined by a name (which is shown on the plot), a starting and an ending coordinate, a strand (+ or -), a color (four colors), and a style (Tick, Block, or Range). A range should also begin or end with a character used as head and tail, e.g., > and < are symbols displayed as arrowheads. To create an additional data file, proceed as follows. 3.2.1. Start of the Editor Create a new file. If your sequence was called myseq, you could give the following command: $ edit/edt myseq.data (on VAXNMS) % vi myseq.data (on UNIX) The file will be empty. On VAXNMS, you might see an asterisk (*) as prompt. In this case, enter CCXRETURN> and you get into insert

mode. Otherwise, you are in insert mode already. On UNIX, you need to enter the insert mode by typing a lowercase i.

40 Table 1 Symbols Used for “Range” Labels in Plasmidmap “From” “To” Symbol Symbol output . . . .] junction 1 1* . . . .*. ] (same as above) 1 [ i > > . . . . . . . .> arrow < < . . . . . . . .< (same as above) I I . . . . . . . .I bar IT [ *..* . , .] Junction with head and tail >T < . . . . . . > arrow with >H head and tail

Insert all information that is relevant for describing the file (e.g., reference information), and end the comment area with a separator used in all GCG programs: two periods (..). This document can be used to label the plot afterward. The first line and the line including the first “,.” pattern are ignored. 3.2.2. Additional

Label Data for Ticks

Attributes must be separated by spaces. Each line should contain the name, two identical numbers showing the position, a “+” for the strand (irrelevant for ticks), a color (numbers from 1 to 32, or, alternatively, red, green, blue or black), two “.” (periods) separated by a space, and the keyword “tick.” An example line adding the tick Important would look like Important 1200 1200 . Black . . Tick 3.2.3. Additional

Label Data for Blocks

Attributes must be separated by spaces. Each line should contain the name, a starting and an ending nucleotide coordinate, a “+” or “-” for the strand, a color (numbers from 1 to 32 or, alternatively, red, green, blue, or black), two “.” (periods) separated by a space, and the keyword “block”. An example line adding the block region would look like Region 200 800 + Blue . . Block

Drawing

Circular

Restriction

3.24. Additional

Maps

41

Label Data for Ranges

Attributes must be separated by spaces. Each line should contain the name, a starting and an ending nucleotide coordinate, a “+” or “-” for the strand, a color (numbers from 1 to 32 or, alternatively, red, green, blue, or black) a “From” and a “To” symbol (see Table l), and the key word “range.” An example line adding the range INSERT would look like INSERT 1343 1712 + Red [ ] Range 3.2.5. Finishing

Up

At this point, you can exit the editor. On VAX/VMS, type a and, on the “*” prompt, enter the command exit. On UNIX, exit the insert mode with <ES&, and, on the “:” prompt, give the command wq. 3.2.6. Creation of a File ofFilenames

In order to utilize the information assembled in this step, it is necessary to make all input data of the current circular restriction map available to the graphics routine to be used in Method 4. This is achieved by creating a file with a text editor, containing first the filename of the additional data, and, second, the filename of the data file created with Method 1. The example file for the two data sets used in the examples of this chapter would be called myseq.fiZ, and contain the following lines: myseq.tic myseq .data

You could use the following command to look at this file: $ type myseq.fil % cat myseq.fil

(on VAJUVMS) (on UNIX)

3.2.7. Draft Preview If you want to, you can now use the plasmidmap

program, as described in Method 4, for previewing. Provided that the graphics environment has previously been defined with setplot, use the command plasmidmap

Qmyseq.fil (both VAXYVMS and UNIX)

42

DiilZ 3.3. Creation of a File Used for Graphics Tuning (Method

3)

This step can be skipped if raw quality is sufficient. Publication printouts, however, can benefit from tuning. 3.3.1. Obtaining

the Template File

Most of the GCG programs will utilize a so-called “local data file,” which contains data you can also enter on the command line. The program used in Method 4 utilizes such an initialization file. To copy such a file into your directory, use the command fetch p1asmidmap.it-M

(both VAXNMS

and UNIX)

The file is shown in Fig. 2. 3.3.2. Customization

of the Initialization

File

Most of the entries in the file are self-explanatory. To change values, call the editor and change the numbers accordingly. The following options could be changed in order to make the plot more readable on slides: nocapt to remove the “caption” at the left side of the plot boldcircle to draw a stronger outer circle mmtheight = 4.7 larger tick labels mintheight = 4.2 larger minimum tick size mmtitleheight = 15 a large title mmscaletlength = 5.0 larger separation in circles between 3.4. Creation of the Final Graphics (Method 4) 3.4.1. Select the Graphics Devices The program setplot permits to view, and select, graphics devices

that are configured at your site. It is suggested to start with displayed graphics, and to select hardcopies in the final step. 3.4.2. The Program plasmidmap The program plasmidmap will use the information prepared in

Methods l-3. The only parameter needed is the file of filenames prepared in Method 2, step 6. Therefore, the command is

Drawing

Circular

Restriction

plasmidmap

43

Maps

@myseq.fil (both VAX/VMS and UNIX)

3.4.3. Refinement of Display Data Return to Method 2 and refine the label data if you are not satisfied with the appearance of the details. Rerun plasmidmap (see above) to review the result. 3.4.4. Refinement

of Display Appearance

Return to Method 3 and re-edit the initialization change the output appearance. Rerun plasmidmap

file in order to (see above) to

review the result. 3.4.5. Arching the Result Run the plasmidmap program with the option figure to prepare a file that can be displayed later on any graphics device with thefigure program of GCG: $ plasmidmap/figure @myseq.fil (on VAXIVMS) % plasmidmap -figure @myseq.fil (on UNIX) 3.4.6. Prepare Final Output

Run setplot to define a hardcopy device and use the programfigure in order to output the result. Thefigure program requires the file name of the data file created in step 5 of this method. It is suggested that you also use the option font = 3 (for HELVETICA-like output) or font = 6 (TIMES-like font) for a nicer display. Figure 3 shows the final output of a plasmidmap described:

being created with the methods

$ figure/font = 6 plasmidmap.figure % figure -font = 6 plasmidmap.figure

(on VAX/VMS) (on UNIX)

4. Notes 1. Methods 1 and 4 require the knowledge of options available. In GCG programs, these can be summarized and displayed using the option check. 2. Methods 2 and 3 require the knowledge of the editor. It is useless to attempt to modify files If the editor cannot be used effectively. 3. Methods 1 to 4 should be applied in that order, and consequently reused to refine the display. Though hard copies are helpful for archiving and publication, the time needed to plot the (final) graphics are considerable and, therefore, creating a hardcopy should be the last step.

PLASMIDMAP

command

lane

initializing

file.

The format of this file follows the general rules Blank lines and lines that start with an 'I' are introduction is separated from the data by a line ad]acent periods Anything to the right of a 'I' capitalized portion of the qualifier in the file truncation possible Variables may be defined In varxable is defined twice the last value is used.

for GCG data flies* The text comments contarnlng two The is ignored below is the maximum If a any order.

I Switches /BOLDCircle /BOLDMa]orTicks /BOLDRanges /NOSORTRanges /SORTBlocks /CAPTion /DRAWScale /ARCText /OUTERNumbers /NODRAWRays /NOINNERNumbers

I I I I I I I ! I I I

/TITLe

I enables

/NOHALFShade /NOPROmpte

I does not I suppresses

I optional

shading

/SHADEDensity= I I L I

The Plot centered parameter. 6=NE and

draws a thick circle draws diamond-shaped ma-)or scale ticks draws thick ranges sorts the ranges by size sorts the blocks by size writes information about the plot next to it draws two circles with small scale ticks In between enables all curved text enables numbering on ticks outside the circle suppresses inwardly directed radial dotted lines suppresses scale numbers at the ends of inward rays

of

the

title

at

center

circle

of

suppress cross-hatch block prompting for parameter

shading values

‘blocks’

1 O=noshading,

l=sparse,

can be located in any right or left according l=whole, I=LeftCenter, 7=SE.

of

Z=medium,

3=dense,

I=solid

the four corners, centered, to the value of the /LOCAtion 3=RightCenter, rl=NorthWest,

or 5=SW.

/LOCAtion= I The origin 1 coordinates

can

or

/ORIGin=l

be labeled both.

with

I O=begin,

1 The origin's I O=na emphasis,

l=end,

the

molecule's

beginning

or

ending

2580th

label can be given emphasis l=Bold-Face, I-Parenthesis,

3sAsterisks

/EMPHasis=O I Character

heights

(in

/MMTHeight=5 /MINTHeight= /MMRHeight=3 /MMBHeight=4

33 2 7 0

I I I I

for

millimeters

location

height of tick labels minimum height of tick height of range labels height of block labels

1 on 11 x 17 paper):

labels

Fig. 2 Template file available to create your own Initialization midmap program

44

file for the plas-

Drawing

Circular

/MMREMARKHelght=6 /MMTITLEHelght=lO /MMNUMBERHelght=3.5 I Tick

I height ' height I height

all

3

features

/TICKColor=l /RAYColor= /SCALEColor= /RANGEColor=l /NUMBERColor= /DELIMColor=4 /TITLEColor=l /CIRCLEColor=

of of of

title title scale

tick

color

lnslde lnslde numbers

color color color color color color color color to

of

length

Blackzl,

of of of of of of

of of

the the the the the the the the

circle circle lnslde

circle

documentatwn

(InsIde

double

Green=Z,

Blue=3

circle) and Red=d.

ticks outslde the circle radially directed dotted lx~es scale txks lnslde the double circle ranges scale numbers outslde the circle delimIters between colncldental labels title circle

separate

I separates I separates

I The number

multiple

labals

on a single

tick

may

labels with the coalesced labels

same base locatlon with different locatlons

lines

circle

beldw

the

may be set

I

/REMarks=

I /Slow=2

the

used

/SEParatorl="/" /SEParator2=*<"

the

I scale

I I I I I I 1 I

I The characters I be changed

I For

0 0

45

Maps

lengths:

/MMSCALETLength=3 I For

Restriction

example

ssss~on

in

0

I plot

slower

the

Program to get

Manual crisper

this

setting

was also

used

features

Fig. 2 (continued).

4. Initialization files are not specific to a particular plasmid and can be re-used if multiple data files reside in the samedirectory. However, label files usually apply to a single sequence and are not transferrable. 5. The program permits use of up to 32 colors. Only a few devices use more than four colors, therefore, it is recommended to stick to the colors red, green, blue, and black. 6. “Block” labels can be shaded. Solid shading(option 4) is generatedby plotting many adjacent lines, thus, is causingproblemsonplotterswith ink pens. 7. The tick and label files use spaces as separators. If a name contains a blank, this can be achieved with an underscore (“-“), which is not displayed in the name as such but generates a space. 8. Greek letters are not supported as part of the name if normal fonts are used. Manual editing of the figure file is tedious and not practical for most users. Therefore, either spelled names, e.g., “alpha-Lactamase” instead of a-Lactamase, or manual reinking are suggested.

46 Pstl

Aocl ’ 5246 ’ EcoSl

HiIlCII

Fig. 3. Final output of plasmrdmap.

9. For some applications, other fonts might be better suited. Run the program plottest with the option showfonts in order to display the fonts available. 10. If graphics is to be reused on PCs, only the encapsulated Postscript file (EPSF) format is supported. Ask your program manager for details on how to transfer data to your PC. You will need a desktop publishing program which can import EPSF files for this purpose.

Reference 1 Devereux, J , Haeberli, P , and Smithies, 0. (1984) A comprehensive set of sequenceanalysis programs for the VAX Nucl. Acids Res. 12,387-395.

&AFTER 5 GCG= Displaying Restriction and Possible ‘lhnslations in a DNA Sequence Reinhard

Sites

Diilz

1. Introduction As mentioned in Chapter 3, restriction maps are generated on a computer using programs that match your sequence against a predefined sequence (referred to as “enzymes”). Positive hits are displayed as a function of the sequencecoordinate plotted vs the patterns found. In order to view the result with respect to the amino acid sequence (provided that the DNA sequence does have a reading frame), automatic translation can be achieved by using either a standard codon usage table or any other table provided in the correct format. Site-directed mutagenesis can be planned if the reading frame is known by searching for “silent” restriction sites that can be introduced without altering the peptide sequence. In order to restrict the amount of output resulting from the pattern matching calculation, a preselection of patterns (“enzymes”) can be achieved with a variety of options. Furthermore, parts of the target sequence can be easily excluded from the calculation, 2. Materials The methods and the programs reported here are part of the GCG program package (I), Version 7.x, to be installed on either a VAX/ ALPHA computer running the operating system VMS or one of the supported UNIX systems: Silicon Graphics (IRIX), Digital EquipFrom- Methods m Molecular Biology, Vol. 24. Computer Analysis of Sequence Data, Pati I Edited by: A M. Gnffm and H. G Grlffm CopyrIght 01994 Humana Press Inc , Totowa, NJ

47

48

Diilz

ment (ULTRIX), or Sun Computers (SUNOS). The programs can be obtained from GCG Inc, University Research Park, 575 Science Drive, Suite B, Madison, WI, 53711. A version for the CONVEX variant is also available from Convex Corp., Dallas, TX. The computer system should be equipped with at least 16 MByte of memory, and should hold about 1GByte disk for program, database, and scratch area. All GCG programs generally require a terminal that is capable of emulating the VT100 terminal mode standard. Scroll-back facilities as provided on workstations or in PC terminal emulators are advantageous but not essential. The text output can be reprocessed on word processing programs on personal computers. The data transfer capabilities required for this are usually provided within the terminal emulation program. The output form of the map program was designed by Schroeder and Blattner (2). If text output is to be printed, any ASCII code printer is sufficient. Capabilities to print 132 characters per line may be advantageous but are not essential (e.g., Method 2). To analyze a sequence, it is required that sequence data are formatted in GCG sequence file format (see Chapters 2,6,8, and 11 in order to review methods on how to create a sequence in this format). The general procedure for preparing a restriction map involves comparision known restriction site patterns deposited in a database. The GCG programs require a special format of this database, which is available either as GCG distribution or from various file servers on the internet. (See Appendix of Chapter 3 for details.) In order to select enzymes you are interested in, either a personal databaseis created (see Chapter 3, Method 2) or, alternatively, several program “options” can be employed that will permit the use of selected enzymes instead of the entire database (see Chapter 3, Method 1). If amino acid translation is desired, a standard codon usage table is provided by the program package. In case of nonstandard codon usage (e.g., mitochondrial sequences), it is possible to use a different file. Besides the codon files available from within GCG package (see Table l), codon files can be fetched from the file servers (see Appendix of Chapter 3) and modified individually.

Displaying

Restriction

Sites

49

Table 1 File Names of Codon Usage Tables Employed in the Mapping Programs File name

Contents

translate.txt

Standard codon usage Ambiguous codes are ignored Ciliated protozoa; TAG and TAA are translated TO Q (Gln) instead of STOP Drosophilia mitochondrial; Differences: TGA W (STOP) ATA M W-E) S AGA WW AGG S (AR’3 Mammalian mitochondrial; Differences: TGA W (STOP) ATA M WE) AGA STOP WG) Yeast mitochondrial; Differences TGA W (STOP) CTT T ww CTC T VW CTA T W-U CTG T (LW ATA M (IW

transciliate.txt

transmitodros.txt

transmitomam.txt

transmitomam.txt

Translation tables available for use within the GCG programs. Any of the files can be copied into the own duectory by the command fetch (both VAXNMS and UNIX)

3. Methods 3. I. Generating a Simple Map Without Translation (Method 1) 3.1.1. Start of the Program map

The program map is started by typing the command map on the command line with appropriate options to preselect enzymes (see Chapter 3, Method 1 for these options). Additionally, the sequence file name can be specified after the command.

50

Dslz

3.1.2. Definition of the Target Sequence Beginning and End If not already entered, the program will prompt for a sequence file name. Then, the program will suggestto start with position 1, and end at the last base. Hitting is sufficient to accept these choices. 3.1.3. Selection of the Enzymes The two recommended replies to the question asked are either “*” for selecting all enzymes, or “**” for selecting all enzymes, including the isoschizomers. Answering the question with “?” will explain the available options. 3.1.4. Definition of Translation Scheme In this method, the option “n” for no translation is recommended. 3.1.5. Definition of the Output Filename Hitting does accept the suggested file name, which is composed of the sequence name and the extension .mup. 3.1.6. Review of the Output The output file shows a comment area, the enzymes plotted above the sequence, a numbering line, and the complement. The bottom of the output file shows a summary of enzymes that do and do not cut. The program identifies patterns by the position where they occur. If identical positions are to be labeled with multiple enzymes, a “/,’ is printed in the next column and the enzyme name is displayed there. If the display is still too crowded, the selection of enzymes should be more strict. (SeeChapter 3, Method 1 for details and restart with Section 3.1.1.) 3.2. Generating a Restriction Map with a Six-Frame Protein Translation Showing Open Reading Frames (Method 2) 3.2.1. Start of the Program map The program map is started by typing the command map on the command line with appropriate options to preselect enzymes (see Chapter 3, Method 1 for these options). If you own a 132 characters per line printer, add the option width = 100 to the command line. Additionally, add open = 30 to show only reading frames with a length of more than 10 amino acids.

Displaying

Restriction

51

Sites

3.2.2. Definition

Beginning

and End

If not already entered at the command line, the program will ask for a sequence file name. Then, the program will suggest to start with position 1, and end at the last base. Hitting is sufficient to accept these choices. 3.2.3. Selection of the Enzymes

The two recommended replies to the question asked are either “*” for selecting all enzymes, or “**” for selecting all enzymes, including the isoschizomers. Answering the question with “?” will explain the available options. 3.2.4. Definition

of the Translation

Scheme

In this method, the use of the option “0” for translating only the open reading frames is recommended. 3.2.5. Definition

of the Output Filename

Hitting accepts the suggested file name, which is the sequence name with the extension .mup. 3.2.6. Review the Output

The program shows the enzymes names positioned vertically at the nucleotide where the restriction pattern is located. Multiple restriction enzymes are shown cutting at the identical position by a ‘7” at the bottom of the name. Beyond the sequenceand its complement, the six open translation frames are shown. This kind of output is best-suited for lab journals. 3.3. Generating a Restriction Map with a Protein Translation Showing Three-Letter Translations on the Three Forward Frames (Method

3)

This method uses an additional option to suppress the complement line. 3.3.1. Start of the Program map The program map is started by typing the command map on the command line with appropriate options to preselect enzymes (see Chapter 3, Method 1 for these options). Additionally, add the option nocomplement to suppress the complement.

52 3.3.2. Definition

of Beginning

and End

The program will suggest to start with position 1, and end at the last base. Hitting is sufficient to accept these choices. 3.3.3. Selection of the Enzymes

The two recommended replies to the question asked are either “*” for selecting all enzymes, or “**” for selecting all enzymes, including the isoschizomers. Answering the question with “?” will explain the available options. 3.3.4. Definition

of the Translation

Scheme

In this method, all three forward reading frames are displayed. To achieve three-letter translation, enter “T” (letter T in upper case). 3.3.5. Definition

of the Output Filename

Hitting accepts the suggested file name, which is the sequence name with the extension .map. 3.3.6. Review the Output

The output is shown partially in Fig. 1. Note the coincidence of several cleavage sites, already mentioned in Methods 1 and 2. 3.4. Generating a Restriction Map Showing Potential Restriction Sites to Be Introduced by Site-Directed Mutagenesis (Method 4) 3.4.1. Start of the Program map To start the program map, type the command map on the command line with the appropriate options to preselect enzymes (see Chapter 3, Method 1 per these options). Additionally, use the options silent and nocomp, and noscal in order to suppressthe sequencecomplement and the scale line. 3.4.2. Definition

of Beginning

and End

This method requires that you know the beginning and ending coordinates of your reading frames. Therefore, the default values should be disregarded, and the numbers should be entered properly.

(Linear)

(Six Base) MAP of:

Hscam check:

3752

from:

1 to:

1126

RL;HSCAM - Human calmodulin mlWA, complete cds ID HSCAM standard; RNA; PRI; 1126 BP. AC M19311; J03468; m 06-JUL-1989 (Rel. 20, Last updated, Version 1) D-r 16-JUL-1988 (Rel. 16, Created) . . . With 132 enzymes:

* March 20, 1992 11:09

.. E C

C 0 4H E f S 7a ABr Ie gel Y et0 II I III II // GCGAGCTGAGTGGTTGTGTGGTCGCGTCTCGGAAACCGGTAGCGCTTGCAGCATGGCTGA 1 ---------+---------+---------+---------+---------~---------~ (j,, a: b: c:

AlaSerEndValValValTrpSerArgLeuGlyAs~g~~gLeuGlnHisGlyhd ArgAlaGluTrpLeuCysGlyArgValSerGluThrGlySerAlaCysSerMetAlaAsp GluLeuSerGlyCysValVa~laSerArgLysProVa~laLeuAlaAlaT~LeuThr

-

H E 1 E c ES CF o : aa OS 5 I Ri 7 I rp II II I I / / CCAACTGACTGAAGAGCAGATTGCAGAATTCAAAGAAGCTTTTTCATTATTTGACAAAGA 61 ---------+---------+---------+---------+---------+---------~ 120 a: b: c:

ProThrAspEndArgAlaAspCysArgIleGlnArgSerPhePheIleIleEndGlnArg GlnLeuThrGluGluGlnIleAlaGluPheLysGluAlaPheSerLeuPheAspLysAsp AsnEndLeuLysSerArgLeuGlnAsnSerLysLysLeuPheHis~rLeuThrLysMet

-

BB gs 1t IY II TGGTGATGGCACTATAACAACAAAGGAACTTGGGACTGTAATGAGATCTCTTGGGCAGAA 121 ---------+---------+---------+---------+---------+---------+ 180 a: b: c:

TrpEndTrpHisTyrAsnAsnLysGl~hrTrpAspCysAsnGluIleSerTrpAlaGlu GlyAspGlyThrIleThrThrLysGluLeuGlyThrValMetArgSerLeuGlyGlnAsn ValMetAlaLeuEndGlnGlnArgAsnLeuGlyLe~~~spLeuLeuGlyArgIle

Fig. 1. (Partial) output from Method 3.

53

-

54

DdZ 3.4.3. Selection of the Enzymes

The two recommended replies to the question asked are either “*” for selecting all enzymes, or “**” for selecting all enzymes, including the isoschizomers. Answering the question with “?” will explain the available options. 3.4.4. Definition

of the Translation

Scheme

In this method, it should be known at what reading frame translation occurs. For clarity, this should be the only reading frame to be displayed. Enter the letter of the reading frame that is the correct one, e.g., “B” will give three-letter translation of the second reading frame. Refer to Chapter 10, “Finding

a Reading Frame,” for details on how

to determine a reading frame. 3.4.5. Definition

of the Output Filename

Hitting accepts the suggested file name, which is the sequence name with the extension .map. 3.4.6. Review the Output

The output is shown partially in Fig. 2. Note the differences to Fig. 1, the enzymes that are cleaving the actual sequence are displayed in UPPERCASE letters. There is a known problem with determining silent restriction sites in some special cases,see Notes section for details. 4. Notes 1. All methods require the knowledge of options available. In the GCG programs, these can be summarized and displayed using the option check. 2. The program assumes that the sequence given ts linear. Circular maps can be calculated by adding the command line option circular, which will have the effect that the sequence is treated as circular. However, the resulting 1sstill displayed linearly. 3. To use other translation tables than the standard, use the option trunslute= c~Vename>.Known filenames are listed in Table 1. In GCG programs, either a provided table or your own table can be supplied. 4. If you print 132 characters per line output on a 80 characters per line prmter, the right part of the output will be missing. This could cause misleading conclusions. It is possible to check whether the output is printed properly by looking at the horizontal numbering lines. Both sides of the line have the beginning, and ending, numbers displayed.

Displaying (Linear)

Restriction (Six Base)

Sites

(Silent)

MAP of:

Hscam check:

3752

from:

53

to:

400

RL;HSCAM - Human calmodulin mRNA, complete cds ID HSCAM standard: RNA; PRI; 1126 BP. AC M19311; 503468; D-r 06-JUL-1989 (Rel. 20, Last updated, Version 1) m 16-JCL-1988 (Rel. 16, Created) . . With 132 enzymes:

* March 20, 1992

11.03

.

t t nh E h 1 sP 1 C ezb mpvEnhS CF Od bhsl ai c u buAcpA OS 5r sat1 el 1 n iiRiaP RI 7a meu1 11 i 1 iiIii1 II 11 ii11 / / /// / //I ATGGCTGACCAACTGACTGAAGAGCAGATXCAGAATTCAAAGAAGCTTTTTCATTATTT t a qE 1 i 1

b:

H I N D I I I

MetAlaAspGlnLeuThrGluGluGlnIleAlaGluPheLysGl~laPheSerLeuPhe b s

b s

t

e c

b b t BB%pa s bb!h: es b s x saa2bgs2k ms 1 GSlspvabu t c tvn8ais8p sp 1 LTOiurvs3 aa a n rp3 Y m xa16nat6n lh - IY9ermaa6 ii i i 1 i iiililili ii 1 111111iii /// / / / // GACAAAGATGGTGATGGCACTATAACAACAAAGGAACTTGGGACTGTAATGAGATCTCTT e s P

b:

d rk ti in ii

AspLysAspGlyAspGlyThrIleTh~hrLysGluLeuGlyThrValMetArgSerLeu

-

Fig. 2. (Partial) output from Method 4. 5. New recognition sites requiring silent substitutions outside of the recognition pattern cannot be found by map in conjunction with the silent option.

References 1. Devereux, J., Haeberli, P., and Smithies, 0. (1984) A comprehensive set of sequence analysis programs for the VAX. Nucl. Acids Rex 12,387-395. 2. Schroeder, J. and Blattner, F. (1982) Nucl. Acids Res 10,69-84

GCG: Assembly of Sequences into New Sequence Constructs Reinhard Diilz 1. Introduction Sequence information in the computer needs to be stored in a formatted way in order to accommodate the following features specific to this data type: 1. Sequencename. 2. Comments (e.g., referenceinformation). 3. Sequencedata. Further, it is desirable to have a checksum stored somewhere that permits consistent checking of the sequence data, to prevent accidental loss or falsification. The sequence name is usually defined by the file name used. Whereas this approach is intuitive, it imposes the restriction that the user knows about file name conventions on the computer system used. In general, it is a good idea exclusively to use letters (preferably, always spelled lowercase for ease of typing) and numbers O-9 . If separators are needed, only the underscore “-” should be used in order to avoid problems. The reference and sequence information could be entered from a file that has been prepared on another computer or by another software, see Chapter 27 for tools available. It is discouraged to use text editors or word processors for entering sequence data. Instead, the GCG programs supply the seqed program, which is a powerful tool to enter, manipulate, and store sequences in From Edlted

Methods In Molecular Biology, Vol 24’ Computer Analysis of Sequence Data, Part I by A M. Gnffln and H G Gnffm Copyright 01994 Humana Press Inc , Totowa, NJ

57

58

DGZ

the correct sequenceformat. Method 1will describehow to edit a sequence from scratch. Method 2 describestools for manipulating the sequence,as well as how to use existing sequences to generate new constructs. 2. Materials The methods and the programs reported here are part of the GCG program package (I), Version 7.x, to be installed on either a VAX/ ALPHA computer running the operating system VMS or one of the supported UNIX systems: Silicon Graphics (IRIX), Digital Equipment (ULTRIX), or Sun Computers (SUNOS). The programs can be obtained from GCG Inc, University Research Park, 575 Science Drive, Suite B, Madison, WI, 53711. A version for the CONVEX variant is also available from Convex Corp., Dallas, TX. The computer system should be equipped with at least 16 MByte of memory, and should hold about 1GByte disk for program, database, and scratch area. All GCG programs generally require a terminal that is capable of emulating the VT100 terminal mode. The sequenceeditor, as described here, uses a sophisticated screen management that requires that the VT100 terminal emulation is followed exactly. It is known that some terminal emulators on personal computers fail to translate cursor sequences properly. In order to enter sequence via a digitizer, it is necessary to install either a modem splitter or a second device port to have the CPU recognize the additional input device. The second part may be attached either directly to a port or to a terminal server capable of running the LAT protocol. The setup works for both VAXNMS and UNIX versions of GCG but is difficult to install for biologists. Expert knowledge is absolutely required to install the hardware. The GCG program documentation is sufficient to help the hardware people set up the system properly. At our site, the digitizers are placed on a large light box, with the menu needed for its operation photocopied from the documentation onto a permanently fixed transparency. It is highly desirable to fix the gel with sticky tapes in order to have as little distortion as possible. We use both UNIX and VAx/vMS systems on terminals attached to LAT protocol servers without problems, but have noticed that terminals may suffer from noise if the digitizer is used, therefore, the digitizer should be turned on only if needed.

Assembly

of Sequences

59

To use a sequence within the GCG environment, it needs to be properly formatted. Method 1 describes how to enter sequences from scratch, Chapter 2 shows how to assemble sequencesfrom fragments, and Chapter 27 lists tools for translating foreign formats to the GCG sequence format. 3. Methods 3.1. Entering from Scratch

a Sequence (Method 1)

The sequence editor consists of three regions that are used for different purposes. The schematic image of a screen copy is shown in Fig. 1: the upper four lines are reserved for nonsequence information, the middle area is the actual sequence line, and the bottom line accepts command in put from the keyboard. Additional features will be discussed in Method 2. 3.1.1. Start of the seqed Program

The sequence editor is started by typing the command seqed on the command line. In case of a new sequence, the program asks for a sequence name. It is recommended to use the extension .seq for DNA and .pep for peptide sequences, because this convention is used throughout the whole GCG program package. 3.1.2. Filling

the Comment Area The editing of a new sequence starts in the heading comments. The

colons displayed at the side of the screen denote a window that is used to slide across the nonsequence information of the data. Any text, comment, or description may be entered here without effecting care of format convention. Once the fourth line is reached, the text will scroll in order to permit additional input. The editing in this area works very much like the simple edt editing of the VAXNMS operating system (no keypad support). Even in the UNIX version the editing behaves like a scratch pad; comparable to emacs. The heading editor can be left with (on VAXNMS; on UNIX, ) in order to get to the sequence line. If there are additional modifications to be applied later, it can be reached with the command heading on the command line.

60 Heading (comments,

I

PRI

HSCAM .

l ...*

annotation, K

B

y

B

*

*

R

etc) D

***tt

SEQED

>RL,HSCAM

Human calmodulln ID HSCAM

mRNA. standard;

complete RNA,

cds PM;

1126

BP

xx

I

Sequence line GClGG~ClYiTACCAGAAACATM.T’C~ATl-GlTACTTGCTl.MTAA I .I.. I......l....

.I.

Command line

Fig. 1. Screen copy of the seqed program. The shaded boxes and arrows have been added to show the commands needed to move within screen areas in the VAX/ VMS version The corresponding UNIX keystrokes are mentioned in the text.

3.1.3. Editing

the Sequence by Keyboard

Any valid sequence symbol is accepted by typing it. The <delete> key deletes the current symbol and moves the cursor one to the left. Editing may be easier in nucleotide sequence editing if the keyboard has been customized (see Method 3 in Chapter 2). If sequence editing is finished, or if it is required to enter data via a digitizer a eCTRL-Z> (on VXWMS; on UNIX, ) will get the cursor to the command line. 3.1.4. Editing

the Sequence by Digitizer Within the GCG program seqed, the digitizer mode is enabled by typing the command digitizer on the command prompt. From this

Assembly of Sequences

61

moment onwards the keyboard is disabled, and all input is achieved by the stylus of the digitizer. In order to work properly, the menu should be fixed with sticky tape to the table where the digitizer is positioned. The first questions asked by the seqed program are the definitions of the four lanes for GATC and the corners of the menu. After this procedure, seqed “knows” about the location of the relevant areas for input. The definition of the gel lanes can be repeated with the menu item “RELOAD.” In order to return to the keyboard, select item “KEYBOARD.” Most of the keystrokes commands can be issued via the menu of the digitizer. 3.1.5. Checking Newly Entered Sequences In order to check a sequence entered from scratch, the command check has to be typed on the command line. The cursor will jump one line above the actual sequence line, and the sequence can be entered again either via keyboard or via digitizer. Upon a mismatch, the bell rings and with the <cursor-up> and <cursor-down> keys it is possible to select which sequence has the error in it. Note that the checking sequence is not stored. 3.1.6. Saving the New Sequence The command exit, given on the command line, will write all sequence data entered previously to the file that was specified at the beginning. More commands on the command line are described in Method 2. of Sequences (Method 2) 3.2.1. Start of seqed Program The seqed program can be started by specifying its name, and the sequence to be edited. If no sequence name is given or if the given name is an invalid file specification the program will prompt for a valid sequence name. 3.2. Manipulation

3.2.2. Maneuvers Within the Sequence The cursor keys <cursor-left> and <cursor-down> permit movement acrossthe sequenceby a single nucleotide. The << >and < >> will move in stepsof 50, respectively. Any number followed by is considered to be an absolute coordinate, and a number followed by an arrow key will cause a relative move.

DiilZ

62 3.2.3. Finding

a Pattern

In order to create a new construct, the beginning of the insert can either be given as a sequence number followed by as described above, or, alternatively, given as a pattern. To define a pattern, enter a dash followed by a pattern to be looked for as DNA characters. If the same pattern shall be searched multiple times, enter the followed by , and the cursor will jump to the next location. 3.2.4. Inclusion of a New Sequence On the command line, enter the command include, followed by a

sequence name. If you omit the sequence name, or specify an invalid one, the program will prompt for a sequence name. Next, seqed will ask for a beginning, and an end, and insert this sequence at the last cursor position. The program will note this inclusion as “comment,” which is shown between the heading and the sequence line. 3.2.5. Deletion of a Sequence Fragment Being on the command line, the command S,f delete will delete

parts of the sequence being edited, where s and f must be sequence coordinates. Note that the coordinates of sequence fragments included after the deletion will change accordingly. 3.2.6. Creation of a New Sequence Fragment Being on the command line, the command S,f write will write a

new sequence (sequence name is asked if not supplied after the word “write”) with the symbol number 1 being the symbol at the coordinate S, and the end being the numberJ: 4. Notes 1. If the system crashesunexpectedly the editing session1skept. Restart of the program by the command seqed will reload the sequence,and restore all edits made since the last save. If this behavior given the command

seqed (both VAXNMS seqed will complam current directory.

1s not desired,

and UNIX)

and advlse you to delete the seqed.log file m the

63

Assembly of Sequences 2. There are command lme options available that can be queried with on VAXNMS, or, $ seqed/check on UNIX % seqed -check

These options usually apply for special cases where highlighting of fragments is desired. 3. If too many errors occur during digitizing, the tolerance can be changed which is a number ranging from 0.25 to 1.O(being most tolerant). Start seqed with the option toZerance=0.6 if the default value of 0.4 seemsto be too stringent. 4. If the lanes on your gel are not GTAC, the digitizer can be told your loading with the option lanes = ACGT.

Reference 1. Devereux, J., Haeberli, P., and Smithies, 0. (1984) A comprehensive set of sequenceanalysisprograms for the VAX. Nucl. Acids Res. 12, 387-395

GCG= Comparison Reinhard

of Sequences Diilz

1. Introduction Comparison of sequences can be done in various ways. Biological sequences differ in length and composition. Whereas simple identity is relatively easy to compute, “similarity” must be defined before a computer can start its calculation. A simple identity matrix for DNA sequences is shown in Table 1. It can be seenclearly that, for DNA/RNA comparison, T must match U and vice versa. Ambiguities for purine or pyrimidine bases are not contained in this table; however, they are generally computed as well. If two sequences are to be compared schematically, the result will be more valuable if ambiguities are considered (see Method 1). This applies even more if amino acid sequence comparisons are to be calculated. Whereas the simple approach to set each identical amino acid to a match and nonidentical to a mismatch will succeed in rapid comparisons (see also Chapter 9, database searching), subtle comparisons require the investigator to differentiate and treat similar amino acids, e.g., hydrophobics, differently from others, e.g., hydrophilic ones. The measures used for assigning this “similarity” are different and vary greatly in their reliability. Depending on the problem, generalizations might lead to wrong conclusions (e.g., if an amidated amino acid is calculated to be aligned with an acidic one; some cases such as metal binding, do not justify this kind of generalization easily). Therefore, any “similarity” based on alignment should be inspected manually. The comparison table used in the GCG program From Edited

Methods m Molecular Biology, Vol. 24 Computer Anaiysls of Sequence Data, Part I by: A. M Gnffm and t-f. G Griffin Copynght 01994 Humana Press Inc , Totowa, NJ

65

66 Table 1 Schematic Scoring Matrix for Comparmg Nucleotide Sequences Schematically A G C T U

A 1 0 0 0 0

G 0 1 0 0 0

C 0 0 1 0 0

T 0 0 0 1 1

U 0 0 0 1 1

Each “match” is scored, and each “mismatch” does not recewe a score.

software for use in protein sequence comparison is a standardized Dayhoff table that has been modified by Gribskov and Burgess. Sequence comparison should always start with a rough, schematic comparison as provided with dotplots (Method l), but then proceed to more sophisticated and detailed computations (Methods 2 or 3). Whereas the first approach may detect internal homologies and is suited to cover a range of different matches, Methods 2 and 3 result in one single alignment. An overview is shown in Fig. 1, 2. Materials The methods and the programs reported here are part of the GCG program package (I), Version 7.x, to be installed on either a VAX/ ALPHA computer running the operating system VMS or one of the supported UNIX systems: Silicon Graphics (IRIX), Digital Equipment (UITRIX), or Sun Computers (SUNOS). The programs can be obtained from GCG Inc, University Research Park, 575 Science Drive, Suite B, Madison, WI, 537 11.A version for the CONVEX variant is also available from Convex Corp., Dallas, TX. The computer system should be equipped with at least 16 MByte of memory, and should hold about 1GByte disk for program, database, and scratch area. All GCG programs generally require a terminal that is capable of emulating the VT100 terminal mode standard. Scroll-back facilities as provided on workstations or in PC terminal emulators are advantageous but not essential. Text output of Methods 3 and 4 can be processedon word processing programs on personal computers. The data transfer capabilities required for this are usually provided within the terminal emulation programs.

Comparison

67

of Sequences .pnt file

I ,I _ II

compare option m

P ‘f

P

c

E

sequence data

I

2

dotplot

*/LA/ 7

pare

gapor option

.gap files Fig. 1. Methods available for sequence comparison.

If the text outputs of Methods 3 and 4 are to be printed, any ASCII code printer is sufficient. Capabilities to print 132 characters per line may be advantageous but are not essential. In order to use Methods 1,2, and 5, a high-quality graphics device is desirable. For terminal previewing, TEKTRONIX 4014 capabilities or REGIS emulation (as provided in DEC 240,340, or 440 terminal series, or in various PC terminal emulators) are recommended. The GCG graphics features include a variety of other modes, e.g., Workstations use the X-Windows interface. The appropriate choice is sitedependent and is preconfigured by the software manager (see Methods section for details). Methods 1 and 2 might be run advantageously on terminals or computers that are connected to the host with a line speed of at least 19200 band, because graphics are complicated, and require a lot of data traffic. Graphics hard copies may use a variety of output formats including HPGL and POSTSCRIPT standards.This feature is also site-dependent and usually preconfigured by the software manager (see above). The sequence to be compared must be in the GCG file format (see Chapters 2, 6, and 27).

D&5

68

Method 6 requires familiarity with sophisticated editors. The files to be edited exceed 132 characters per line. Therefore, a workstation with an editor to be customized in window-like fashion should be available. Alternatively, TPU on VAXNMS or emac~on UNIX systems can be used to move the screen across the symbol comparison matrix on a text-terminal screen. 3. Methods Figure 1 outlines the various procedures used for sequence comparison. Method 1 should always be tried first. with

3.1. Comparison of Two Sequences the WORD Compan’son Method (Method

1)

This method compares short pieces of identities (“words”) and denotes a match each time the two words are identical. This way, a quadratic matrix is calculated. The number of matches depends on the sequence itself (DNA or protein) and the size of the word computed. Table 2 shows the dependency of hits vs the parameter wordsize. 3.1.1. Preparation to Be Displayed

of the Matrix of Matches in the Subsequent Step

3.1.1.1. START OF THE PROGRAM COMPARE

On VAX/VMS,

use the command $ compare/word

on UNIX, use % compare -word

as command. Additionally, the filenames of the sequences to be compared can be supplied on the command line. 3.1.1.2. SPECIFICATION THE DATA FOR THE FIRSTSEQUENCE A prompt for the first and last positions of the sequence to be compared appears. To accept the choices given, hit the key only. Otherwise, enter the positions desired. Subsequently, the sequence can be reversed if desired (the Method is NOT strand insensitive). 3.1.1.3. SPECIFICATION OF THE DATA FOR THE SECOND SEQUENCE The queries are asked as for the first one (see above).

Comparison

69

of Sequences

Table 2 Comparison of Two Sequences with the WORD Comparison Method Number of dots Word size Homologous Random 1 2 3 4 5 6 7 8 9 10

>200,000 100,000 30,000 9500 3000 1200 600 500 400 400

>200,000 62,000 15,500 4000 1000 250 60 15 4

1

Number of hits in comparing two homologousDNA sequences of approx 1 KB length. Sequences were humanand chicken calmodulin The valuesin the secondcolumnshowthe numberof dots to be expectedif the two sequences were unrelated.The dotplot with wordme = 6 IS shownin Fig. 2.

3.1.1.4. SPECIFICATION OF THE WORD SIZE

As to be seen from Table 2, starting with a word size of 6 is a good start. Protein sequence comparisons should start with 2. 3.1.1.5. DEFINITION OF THE OUTPUT FILE A accepts the suggested file name. After this input,

the calculation proceeds. 3.1.2. Definition

of the Plotting Configuration

The following step requires that your graphics configuration is known to the system. Therefore, run the program setplot once to select one of the options presented. 3.1.3. Use of the Program

dotplot for Previewing

3.1.3.1. START OF THE PROGRAM DOTPLOT

Enter the program name as command. If you wish, you can append the file name of the point file created in Section 3.1.1. to the command, otherwise, the first prompt will query for this name.

DtilZ

70

3.1.3.2. SPECIFICATIONOFTHE DOT DENSITY The program suggests a value that will fill the entire page. If multiple plots will be compared later on, it can be advantageous to agree on a defined density for all dotplot runs (e.g., 1000 bases per 100 platen units). 3.1.3.3. SELECTIONOF OPTIONS The program offers to plot, select a different density, or quit. After selecting “P” (Plot), the graphics will be created. 3.1.4. Use of the Program dotplot with Additional Options to Create the Final Graphics

3.1.4.1. ON THE VAX/Vi&S OPERATINGSYSTEM Options given below are entered after the program name and separated by slashes (‘7”) For example, $ dotplothocaption

means that you type the program name dotplot on the system prompt ($) and modify the program? action with the qualifier nocaption. 3.1.4.2 ON THE UNIX OPERATINGSYSTEMS Options given below are entered after the program name and separated by a blank (space bar) and a dash (“-“). For example, % dotplot -nocaption

means that you type the program name dotplot on the system prompt (%) and modify the program”s action with the option nocaption. 3.1.4.3. THE FOLLOWINGOPTIONSARE GENERALLYAVAILABLE nocaption

suppresses the caption of the plot nolabel

suppresses all labels except for ticks all

plots redundant points, e.g., if the sequences to be compared are identical, only one part of the symmetric matrix is plotted by default. The option all will plot the entire matrix. tickaxes

Comparison

71

of Sequences

plots a solid frame around the matrix plot font = 3

will use a font number 3 (close to HELVETICA)

or, similarly,

font = 6 will use font number 6 (close to TIMES) check

will show additional options not explained here 3.1.4.4. DOTPL~T OPTIMIZATION If needed, run setplot to redefine the graphics configuration in order to achieve a hard copy, 3.2. Comparison of Two Sequences Using the WLNDOWISTRINGENCY Method (Method 2) This method compares parts of the sequences (“windows”) and assigns only a match if a given number of symbols (“stringency”) are identical. The number of matches dependson the sequenceitself (DNA or protein) and the parameters “window” and “stringency.” Table 3 shows the dependency of hits vs the parameters. 3.2.1. Preparation to be Displayed

of the Matrix of Matches in the Subsequent Step

3.2.1.1. START OF THE PROGRAM COMPARE Without any options, the window/stringency algorithm will be selected. Additionally, the filenames of the sequencesto be compared can be supplied on the command line. 3.2.1.2. SPECIFICATION OF THE DATA FOR THE FIRSTSEQUENCE The first, and last, symbol asked are sequence positions after the file name of the sequence is entered. To accept the choices given, hit the key only. Otherwise, enter the positions desired. Subsequently, the sequence can be reversed if desired (the Method is NOT strand-insensitive). 3.2.1.3. SPECIFICATION THE DATA FOR THE SECOND SEQUENCE The queries are asked as for the first one (see above). 3.2.1.4. DEFINITION OF THE WINDOW SIZE The suggested size is 21 for DNA and 3 for peptide sequences.

72 Table 3 Comparison of Two Sequences Using the WINDOW/STRINGENCY Method Window Stringency 6 10

9 1500 (14,000) -

14

15 180,000 (180,000) 2700 (1300) 500 (0)

18

21 >200,000 (200,000) 33,000 (32,000) 1300 (100) 600 (0)

30

size 27 >200,000 (200,000) 150,000 (150,000) 7100 (4500)

45 >200,000 (200,000) >200,000 (>200,000) >200,000 (>200,000) 37,000 (33,000) 700 (0)

Number of hits in comparing two homologous DNA sequences of approx 1 KB length Sequences as in Table 2. The window 1svaried from 9 to 45 bp (according to 3 to 15 amino acids), and the stringency is kept as two-thirds of the window size because of the ambiguity in the third base. The dotplot with the settings wtndowsize = 21 and stringency = 14 IS shown in Fig. 3. The values in parentheses show the number of dots to be expected if the two sequences were unrelated

3.2.1.5. DEFINITION OF THE STRINGENCY In DNA sequence comparisons, this value should be two-thirds of the window to account for ambiguities in codon usage. In protein sequencecomparisons, at least 80% of the window size is appropriate. 3.2.1.6. DEFINITION

OF THE

OUTPUT FILE

A 4lETlJRN> accepts the suggestedfilename. After this question, as shown with a growing line of periods, without further questions.

3.2.2. Definition of the Plotting Configuration The following step requires that your graphics configuration is known to the system. Therefore, run the program setplot once to select one of the options presented. This step may be skipped if the configuration is not to change. 3.2.3. Viewing the Result with the Program dotplot This procedure refers to Method 1. There is no difference to this method, therefore, proceed with 3.1.3., and 3.1.4., respectively. Figures 2-4 shows an example of the output.

Comparison

of Sequences

73 1,000 I

I '

.

'

7

I

,

- 1,000

I.

* . ’ ’ I. /’

, I

I

.,’ ,*

I

,*rr

“J’“’ ./’

/

,’ .

I

. I

.

’

- 500

.

s ’

‘- ,,

t

* I

0

-

,

(

I

I

.

I

I

I

, -0

hu.seq ck: 3,752, 1 to 1,126

Fig. 2. Output of dotplot with wordsize = 6 on two homologous using the WORD comparison Method (Method 1).

sequences

3.3. Comparison of Two Sequences with the Local Homology Algorithm of Smith and Waterman (Method 3) This method finds the best matching regions of two sequences. It does not perform an end-to-end alignment (i.e., alignment from the start to the end of the sequences being compared, see Method 4 for this purpose).

74

DSlZ

OC

‘,

.

/ ,

./ .. ,/ I

/ ,

,/(’

,

,

,

’ : 1,000

500

a

hu.s4qck:3,752,1

to1.126

Fig. 3. Output of dotplot with windowsize = 21 and stringency = 14 on two homologous sequences usmg the WINDOW/STRINGENCY comparison (Method 2). Options involved additionally to change graphics appearance were tickaxes, font = 6, and nocaption. Note the diagonal line parallel to the mam diagonal, which indicates an internal repeat (see also Rg. 4).

3.3.1. Start of the Program bestfit An additional option can be involved to check for alignment significance. On VAXNMS, start the program with $ bestfithandomize

on the UNIX Operating system, use % bestfit -randomize

Comparison

of Sequences

75

-> A’ -> B’ -> A’ -> B’ A' 8' C Internal repeat: Both A and Bare repeated in both sequences. Fig. 4. Schematical representation of a dotplot and its interpretation. The internal homology revealed indicates possible gene duplication. The shift in the diagonal close to the upper right reflects a large gap in one of the two sequences. These shifts are frequently found outside of reading fkrnes because of the occurrence of exons/intron length differences.

Additionally, the filenames of the sequences to be compared can be supplied on the command line. 3.3.2. Definition of the Data Needed for the First Sequence to Be Compared

After the file name is entered a prompt for the first and last positions of the sequence to be compared appears. To accept the choices given, hit the key only. Otherwise, enter the positions desired. Subsequently, the sequence can be reversed if desired (the method is NOT strand-insensitive). 3.3.3. Definition of the Data Needed for the Second Sequence Definition

This sequence is queried as the first one.

76

Ddz

3.3.4. Definition of the Gap Weight Unless you did not define an own symbol comparison matrix (see Method 6), the values for comparing symbols are predefined. The gap weight parameter queried will assign the penalty value that is added to the total alignment score for each gap to be inserted. It should be three to five times higher than the average match value of the symbols (1.0). To accept the value suggested (5.0), cRETURN> is sufficient. 3.3.5. Definition of the Gap Length Weight This value influences the alignment by penalizing the length of gaps. Homologous sequencescan be effectively aligned with the suggested parameter of 0.3, and is sufficient. Large intron/ exon differences might need a value of 0.1. 3.3.6. Definition the Output File cRETURN> accepts the suggested filename, which consists of the file name of the first sequence and the extension .pair. After this last question, the program proceeds without further questions by drawing a growing line of periods as the calculation proceeds. The summary of the alignment procedure is displayed, including a “quality” parameter. If the option randomize has been involved, the entire alignment will be repeated 10 times after randomizing the sequences. Whereas the output of these randomized alignments is not written to the output file, it will permit the investigatior to guess the significance of the program’s output by comparing the quality values of the original and the randomized sequences. 3.4. Comparison of Two Sequences with the Needleman-Wunsch Algorithm (Method 4) This method creates an alignment that spans the entire length of the sequences being compared that should be roughly of the same length. It is used in the same way as the program described in Method 3. The program to be used is called gap. Again, using the command $ gap/randomize on VAWKl4S) or % gap -randomize (on UNIX) will recalculate the alignment after randomization for significance checking.

Comparison

of Sequences

77

3.5. Display the Alignment Graphically (Method 5) This method is used to produce a graphic representation of the alignment created with the programs presented in Methods 3 and 4. The alignment must be written in individual files, one for each sequence, in contrast to the single output file created originally. 3.5.1. Preparation of the Input Files The programs bestfit and gap must be rerun with the option out

before the following steps can be performed. Commands are: Local Homology, VAWVMS: $ be&fit/out Local Homology, UNIX: % bestfit-out End-to-end alignment, VAXNMS: $ gap/out End-to-end

alignment,

UNIX:

% gap -out 3.5.2, Definition of the Graphics Configuration The program setplot will present the graphics devices available at

your site. This step needs to be performed only once before working with GCG graphics. 3.5.3. Running the Program gapshow for Previewing The program gupshow will ask for two input files, which have been

generated in the previous Section 3.5.1. 3.5.4. Rerunning the Program gapshow with Additional Options for Refining the Output

Refer to Method 1 on how to apply options. General options are as follows: mark = seq-I .mrk

a file that has the same file name (seq 1) and the extension mrk can be used to indicate a known sequence feature. The format of the mark file is described in the template file that can be copied into the local directory with the command fetch gamma.mrk (on both VAWVMS and UNIX) bars = d

78

DdlZ

gupshow will plot vertical bars if the sequences differ. If identities are to be marked, use the letter s (similarity). num 1 = 500 num 2 = -300 sets the beginning of the numbering to the numeric values indicated. This applies for the sequence alignment display only and does not affect the sequences as such.

revnuml revnum2 If the sequences have been reversed before alignment, these options affect the numbering on the plot. nolabel suppresses the heading and the footer of the plot. outfile = my.alignment writes an additional text file with the alignment in it. font = 3 uses font number 3, which is close to HELVETICA, a font that is close to TIMES.

or, font = 6 uses

check displays all options available, including additional options not described here. Figure 5 shows a result of gapshow, which was involved using the options bars = s, nolabel, and font = 6.

3.6. Creation of a Comparison Table (Method 6) Both bestfit and gap programs, as presented in Methods 3 and 4, respectively, can be run with the option data = my.table to assign match values. Whereas DNA comparison values are generally agreed on, protein comparisons may require the particular modification of tables supplied by default. Therefore, the following applies to a creation of a protein comparison table.

GAPSHOWof: hu.gap 200 I

=mh

400 I

600 I

I

I 400

TO: ch.gap

I

I 600

check: March

1000 I

800

IIllIllllIIII P I11I II I Ill II Ill I I lyl ;‘I1 n y 200

3 tn

check: 9340 from : 1 to : 1402

-Zmlili I 800

6681 from: 20,

1992

:ii III ii I I I 1000

1 to:

I 1200

1402

15 25

Fig 5 Output of the gapshow program if the two sequences from Figs. 2 and 3 are alrgned with the gap program. Note the similarity to the schematic plot in Fig. 4.

’

80

D&Z 3.6.1.

Creation of a Simplificafon

File

The template provided by the GCG programs is copied into your local directory with the command fetch simplify txt. (both on VAXNMS and UNZX) 3.6.2. Modification

of the Simplification

File

Call the system editor and look at the simplification file, which is shown in Fig. 6. Read the format description at the beginning of the file and modify the file according to your needs by additions or deletions defining the symbol’s similarities. EXAMPLE: To remove the line for makmg A (alanme) no longer equal to P (proline) and have S T (serine and threonine) share their own group. The following is for the VAIUVMS operating system: $ edit simplify.txt

move with cursors to the second A in the line A PAGST (proline is gone)

move with cursors to the S in the line A AGST cP> &PACE> proline

P (a new linefor

is created)

cS> <SPACE> (a new line for S equals to S T is created) cCTRL-Z> (leave screen editing mode) (on the *prompt) EXIT $

The following is for the UNIX operating system: % vi simplify.txt

move with cursors to the P in the line A PAGST cX>

(proline

is gone)

move with cursors to the S in the line A AGST 4

cP> <SPACE> P (a new line

for prolme

is created) cS> <SPACE> (a new line for S equals to S T IS created)

Comparison

of Sequences

81

A standard simpllficatlon used by SIMPLIFY and WORDSEARCH to The first line below means "for slmpllfy peptlde sequences. all of the P, A, G, S, or T characters in the sequence, substitute A." The program COMPTABLE can construct a symbol comparison table with the equivalences from this file. 10/7/84 A D H I F cc

..

PAGST QNEDBZ HKR LIVM FYW

Fig. 6. Simplification

file provided as a template for comparison matrices.

<ES6

(leave screen editing mode) wq

(on the : prompt)

%

The last part of the file should now look as follows: AAG PP S ST D QNEDBZ H HKR I LIVM F FYW cc 3.6.3. Creation of a Comparison Table The program comptable will ask for the simplification

file prepared previously, and ask for the default symbol match and mismatch value, as well as for the name of the comparison table. 3.6.4. Modification

of the Comparison

Table

The matrix created can be modified to suit individual needs. The row format should be preserved within an editing session to simplify reading. It is suggested that the file is printed for later reference. It is strongly recommended that you know about the editor used for the manipulation. Otherwise, this step should be skipped.

DdZ

82 4. Notes

1. All methods for sequence comparison are strand-sensitive. This means that if the relevant similarity of the two sequences is on the opposite strand, the programs ~111not detect it. 2. Method 1 is very fast, whereas Method 2 is more sensitive. Large sequences (>lOOO bp each) might take very long if Method 2 is used, therefore, should first be tried with Method 1. In addition, the option batch will permit the compare program to run noninteractively, i.e., other tasks can be performed on the computer while the program IS running. 3. Very large sequences (>2000 bp each) require larger words or windows, respectively, because the plot gets too crowded. 4. All methods presented are suited for peptides as well. 5. If more than 1000 points result from a compare calculation, the output files tend to get very large in size. Be sure to have enough disk space (or quota) available before starting. 6. Methods 3,4, and 5 permit you to use the option width = 100 in order to print sequence alignments in a format where 100 symbols are shown per row. This is advantageous for reading. 7. Comparing two DNA sequences makes it necessary sometimes to display the translation of at least one of the sequences. The methods presented here do not accomplish this. Refer to Chapter 14, “Preparing Sequences for Publication,” for details on how to perform this task. 8. Large sequences might exceed the limits of the computer memory in Methods 3 and 4. In this case, a messageis printed stating that the alignment may not be optimal. With respect to gap and bestfit, the options limit1 = 100 and limit2 = I00 reduce the maximum shift between the two sequences to 100 symbols each. If the two sequences differ very much in length, information from a dotplot should be used to identify the approximate coordinates of the start and end points of similarity to be entered in gap and bestfit sequence specifications. 9. Alignment of sequences is intrinsically dangerous because the computer will always present a result. It should be noted that biological significance does not necessarily coincide with statistical significance. Therefore, “weak” alignments need to be looked at with care.

Reference 1. Devereux,

J , Haeberli, P., and Smithies, 0 (1984) A comprehensive

sequenceanalysisprograms for the VAX. Nucl. Acids Rex 12,387-395.

set of

CHAPTER8

GCG= Production of Multiple Sequence Alignment Reinhard

D6lz

1. Introduction Multiple sequence alignment is one of the most challenging problems in biocomputing. Sequence comparison, averaging, and sophisticated editing are required to make the computer do what the researcher wants to do. However, frequently complaints are made that an alignment is “bad” or “insufficient” if calculated automatically. This is because of the fact that the computer does only pairwise alignments and comparisons, and compiles these results in a second step to produce the multiple sequencealignment. Further, the researcher implies automatically which regions are to be “conserved” (i.e., not to be interrupted by gaps or misaligned) and which sequence fragments are less essential. Computationally, only match and mismatch scores are known at the beginning of an alignment, and features like conserved regions are established only during the program run. In order to perform multiple sequence alignments entirely in an automatic process, the similarity of the sequences to be involved, as well as their length, must be well defined and homogeneous. Automatic processing will fail if too many gaps or stretches are to be included., e.g., automatic alignment of fragments of DNA sequences with known reading frame originating from various species to an entire genome of one single species will almost certainly not complete successfully. For a detailed discussion of limits and scope of the methodology, see Chapter 25 in Part II, CLUSTAL. Therefore, these cases require that the researcher uses pairwise alignment and edits the alignFrom Edlted

Methods In Molecular Biology, Vol 24 Computer Analysts of Sequence Data, Part I by’ A. M. Grllfm and H. G. Gnffm Copyright 01994 Humana Press Inc , Totowa, NJ

83

84

DdZ

ment manually. After checking the alignment quality, the number of sequences involved can be increased either by repeating the procedure or creating a so-called profile that will search the database with the features of the alignment instead of a single sequence (see Chapter 9, Profile Searching). The resulting new members of the sequence family of interest can then be joined into the multiple sequence alignment, that is finally prepared for publication using the methods described in Chapter 14. Figure 1 shows an overview of the methods presented in this chapter. 2. Materials

The methods and the programs reported here are part of the GCG program package (I) Version 7.x, to be installed on either a VAX/ ALPHA computer running the operating system VMS or one of the supported UNIX systems: Silicon Graphics (IRIX), Digital Equipment (ULTRIX), or Sun Computers (SUNOS). The programs can be obtained from GCG Inc., University Research Park, 575 Science Drive, Suite B, Madison, WI, 537 11. A version for the CONVEX variant is also available from Convex Corp., Dallas. The computer system should be equipped with at least 16 Mbyte of memory, and should hold about 1 Gbyte disk for program, database, and scratch area. All GCG programs generally require a terminal that is capable of emulating the VT100 terminal mode standard. Scroll-back facilities as provided on work stations or in PC terminal emulators are advantageous, but not essential. The lineup editor (Method 4) requires that the terminal follows the VT100 standard exactly. It is recommended that the line speed of the terminal connection is at least 9600 band for this method. Text output of the methods presented can be reprocessed on word processing programs on personal computers. The data-transfer capabilities required for this are usually provided within the terminal emulation programs. If text output is to be printed, any ASCII code printer is sufficient. Capabilities to print 132 characters per line may be advantageous, but are not essential. In order to use Method 2 (output dendrogram plotting) or Method 6 a high-quality graphics device is desirable. For terminal preview-

Multiple-Sequences

Alignment

Sequences in GCG format

85

(to be aligned) file of fllenames

No

>

2 multiple

2

taflor manually

sequence file

oroflle

I

mntriv

Fig. 1. Methods available for preparing multiple-sequence GCG program package.

alignments with the

DdZ

86

ing, TEKTRONIX 4014 capabilities or REGIS emulation (as provided in DEC 240, 340, or 440 terminal series, or in various PC terminal emulators) is recommended. The GCG graphics features include a variety of other modes, e.g., workstations use the X-Windows interface. The appropriate choice is site-dependent and is preconfigured by the software manager. See Section 3. for details. Graphics hard copies may use a variety of output formats, including HPGL and POSTSCRIPT standards. This feature is also sitedependent and usually preconfigured by the software manager. To use a sequence, it is required that sequence data be formatted in GCG sequence file format. See Chapters 2, 6, 8, and 11 in order to review methods on how to create a sequence in this format. The multiple sequence alignment outputs either a file of sequence file names (FOSN) that describes these prealigned sequences or a single “multiple sequence file” (MSF) formatted file. Both types are accepted as input files and may be reformatted into each other (see Method 1). Method 3 relies on the gap program, which has been discussed in Chapter 7 (sequence comparison). Method 9 relies on a successful profilesearch, that is discussed in Chapter 9 (database searching). The output of the multiple sequence formatted files for publication is described in Chapter 14 (preparing sequence data for publication). Method 2 uses the pileup program, that is very similar to the CLUSTAL program’s methodology described in Chapter 25 in Part II. However,piZeup uses the GCG style program look-and-feel. Methods 8 and 9 use part of the PROFILE program package described in Chapter 22 in Part II. However, the programs described here are embedded into the GCG environment. Method 5 uses the system editor (vi on UNIX or EDT on VAX/ VMS). The user should have substantial knowledge of these system tools in order to use Method 5. 3. Methods 3.1. Representation of Groups of File Names (Methods 3.1.1. Sequence File Names

1)

In the following discussion, it is assumed that all sequences that are to be used in multiple sequence alignments are located in the same

Multiple-Sequences

Alignment

87

directory. Further, it is assumed that all individual sequences have the correct format and name. See Chapter 6 for details on how to tailor sequences. Sequence file names usually consist of a file name

and an extension separatedby a period. In order to describe a sequence alignment, GCG programs use two approaches: either a file of sequence names (FOSN) containing all sequence names, including relevant information on the sequence position in the alignment, or a

single file containing multiple sequences (MSF-multiple sequence files). Both file types should have a significant file name describing the alignment and an extension that describes the type. By convention, FOSN-type files have the ending $1, and MSF-type files have the ending .msf. This could be changed by the user, but should be

kept in mind if problems occur because of this change. 3.1.2. Creating a File of Sequence Names from Scratch 3.1.2.1. ON THE VAX/VA&S’ OPERATING SYSTEM Issue the following commands: $ set default (to move to the directory where the files reside) $ spawn (to create a subprocess) $ directory: = “ “ (to supersedeany definitions for the command directory) $ directory/column = l/notrailer/noheader/out = my.fil *.seq (to create a file that contains all the files in the current directory with the ending . “seq”) $ logout (to terminate the subprocess) $ edit my.fil (To call the edltor in order to modify the contents of the file if necessary: You should insert two periods ‘I..” at the beginning of a new lme in order to adhere to GCG program file standards. If needed, any comment describing the file can be insert before the two periods,) 3.1.2.2. ON THE UNIX OPERATING SYSTEM Issue the following commands: % cd (to move to the directory where the files reside) % csh (to create a new shell) % unalias IS (to supersede any definitions for the command 1s) % IS *.seq > my.fil (to create a file that does contain all files in the current directory with the ending “.seq”) % AD (or logout) (to terminate the subshell) % vi my.fll

88

DtilZ (To call the editor in order to modify the contents of the file if neces-

sary: You should Insert two periods “..” at the beginnmg of a new line tn order to adhere to GCG program file standards. If needed, any comment describing the file can be inserted before the two periods.) 3.1.3. Reformatting a File of Sequence Names to MSF Format Issue the command: $ reformatimsf (on VAWVMS) or % reformat -msf (on UNIX), and give the FOSN file name as @myfil (i.e., the @ character and the file name). 3.1.4. Reformatting

an MSF Formatted Sequence File to Smgle Sequences Issue the command reformat, and give the file name of your MSF

file as my.msf{*) (i.e., the file name and an asterisk in curled brackets). Note that this procedure produces smgle files only. Refer to Section 3.1.2. for creating a file of file names.

3.2. Create a Multiple Sequence Alignment Automatically with the Program Pileup (Method 2) This method uses the pileup program, which uses an algorithm similar to the CLUSTAL program described in Chapter 25 in Part II. The automatic procedure begins with the pairwise alignment of the two most similar sequences, which have been determined by a comparison of each sequence with the other. The initial alignment, a cluster, is extended step by step, until the least similar sequences are included. The progress of the procedure can be visualized with a dendrogram. 32.1. Start of the Pileup Program Type in the program name pileup. The first question asked by the

program will be what sequences are to be aligned. If you have prepared a file of sequence names, enter an @ character and the file name, e.g., @my,fiZ. If you use the MSF type of sequences, type the

file name and an asterisk embraced by curled brackets, e.g., my.msf(*), The program will read all files and report each successful input by writing the name and the length of the sequence on the screen,

3.2.2. Definition of Alignment Parameters The program will ask for the gap and the gap length weight, which are scores used to penalize the gap msertions. See Chapter 7 on sequence comparison, for a discussion of these parameters.

Multiple-Sequences

Alignment

89

3.2.3. Selection of the Modes of Output

The program offers options to write a dendrogram into a file, to show it immediately, or to suppress it entirely. If the dendrogram is to be saved, option A (write into a figure file) is recommended. The next question concerns the scale of the dendrogram and can usually be answered with to accept the default value. The output file, asked for in the next question, is a file of sequences written in MSF format. accepts the file name suggested. 3.2.4. Wait for Program Completion and Review of the Dendrogram

No further questions are asked. If thepileup program encounters a restriction mentioned in Section 4., it will stop abruptly after issuing an error message. Refer to Fig. 1 for how to continue. If the program completes successfully, a summary of the calculation will be printed on the screen. If you want to look at the dendrogram, you need to define the graphics configuration with the program setplot and select one of the options offered. Next, run the programfigure, and give the name pileup.figure as the file name to be plotted. The dendrogram will be plotted on the screen or printed on a plotter, depending on the graphics configuration. 3.3. Use of the Gap Program to Produce a Multiple Sequence Alignment Manually (Method 3) The gap program, as discussed in Chapter 7 (sequence comparison) can be used to create a multisequence alignment manually. The problem in pairwise alignment is that once an aligned sequence is realigned to another, additional gaps might be inserted. Since version 7 of the GCG programs provides the automatic procedure (seeMethod 2), using gap for manual alignment is recommended only if thepileup program fails or if additional sequences are to be added to the alignment. In any case, it is strongly recommended that the lineup editor (Method 4) be used to check the alignment interactively, To use the manual method, gap should be run with the option out (see Chapter 7 for details). Each comparison, therefore, yields a pair of aligned sequences that need to be aligned with other sequences unless no additional gaps are added. Figure 2 shows such a process applied to three sequences, requiring four comparisons. Large numbers of sequences might become difficult to align this way. Using Method 9 is more efficient in case of large sequences,

Fig. 2. Schematical representation of a multiple sequence alignment preparation by manual pairwise alignment. The procedure converges at the point where only gaps are misaligned. New gaps are represented as empty boxes; gaps created at the previous step are represented by half-shaded and sequence symbols by darkshaded boxes.

Multiple-Sequences

Alignment

91

3.4. Use of the Lineup Program to Generate or Improve a Multiple Sequence Alignment Manually (Method 4) The mode lineup program permits three modes of operation. The command mode allows commands to be typed in and is indicated by a ‘:’ on the bottom of the screen. To switch to Screen mode, hit . The screen mode permits manipulation of sequences by control charactersor cursor keys, plus some other keys (seeTable 1). In order to return to command mode, type on the VAXNMS system and cCTRL-D> on the UNIX system. The third mode is the heading mode, like in the sequence editor seqed (Chapter 6). It permits editing of the documentary heading of the sequences. To get here, type the command heading on the command line, and to exit, type on the VAXNMS system and &TRL-D> on the UNIX system. 3.4.1. Start of Lineup from Scratch 3.4.1.1. START OFTHE LINEUP PROGRAM Start the program by typing lineup on the system prompt. Optionally, the sequence group name can be supplied on the command line.

3.4.1.2. NAMING OFTHE SEQUENCEGROUP The computer will ask for a name of the sequence group. It is suggested that you type the file of sequence names that you want to refer to in other steps. 3.4.1.3. GETTING THE SEQUENCEOF INTEREST On the command line (‘:’ prompt on the bottom of the screen) type get and the sequencefde you want to include. You will be askedfor the beginning, end, and whether the sequencewill be reversedwhile being included. 3.4.1.4. PLACING THE SEQUENCEOF INTEREST The editor program is now in “space walk” mode, meaning that the cursor indicates the beginning of the new sequence. Use the , , , and keys to place the cursor at the position where number 1 of the sequencewill be located. Hit and type a significant name of maximally nine charactersin length. After another, the sequencewill show up.

92 Table 1 Commands of the Lineup Editing Program Result Key Inserts a sequence character G, A, T, C, Deletes a sequence character <delete7 Moves a sequence to the rrght <spacebar7 if at position 1 {n} -cursor a Move right {n} -cursor a Move left {n} -cursor a Move up (to row specified) Move down (to row specified) {n} <down>cursor a Move to posrtion n {n} 4tn7 a Move 50 to the right < Move 50 to the left >CTRL-R7 Redraws the screen VAWVMS- return to command lme UNIX. return to command lme “{n) 1san optional numeric parameter.

3.4.1.5. ADDITION OF MORE SEQUENCES Repeat steps in Sections 3.4.1.3. and 3.4.1.4. as often as needed. In order to move a sequence, refer to the commands listed in Table 1. Use the command help on the command line in order to show additional commands. 3.4.1.6. ADDITION OF MORE SEQUENCES USING AUTOMATIC ALIGNMENT

The command zip on the command line provides limited functionality for automatic alignment, but can be used to include very similar sequences. To accomplish this, first, a consensus must be calculated. Enter the command consensuson the command line. See Section 3.4.2. for details on this command. Then, enter zip and file name. The program will ask for beginning and end of the sequence, including the question whether or not the sequence should be reversed. Then, the program will compute several possible alignments, and an interac-

Multiple-Sequences

Alignment

93

tive dialog will permit you to prove several alternatives visually. This mode is exited with (on VXWVMS, on UNIX: cCTRL-D>) and the choices Cornit, Reject, and Proof more can be taken by typing, of C, R, or P, respectively. Once an alignment is chosen, the program continues like in Section 3.4.1.4. 3.4.1.7. EXIT THE PROGRAM To exit the lineup program, enter the command exit on the command line. 3.4.2. Use of the Lineup Program to Check and I or Improve

an Alignment

3.4.2.1. START OF THE LINEUP PROGRAM ONAN FOSN (FILE OF SEQUENCE NAME) TYPE OF AN ALIGNMENT

Start the program by typing lineup and the file of sequence names. The existing alignment will be loaded. 3.4.2.2. START OF THE LINEUP PROGRAMONAN MSF (MULTIPLE

SEQUENCE FORMAT) TYPE OF AN ALIGNMENT

On the VAx/vMS system, type $ lineup/msf. On UNIX, type % lineup -msf. Give the file name of the multiple sequence file, and the alignment will be properly loaded. 3.4.2.3. SELECTION OF SEQUENCE TO BE MODIFIED In screen mode (if you are on the command line, type first) use the , , , and keys to select the sequence desired. In case of many sequences, you might also type the row number and a or cursor key afterward. 3.4.2.4. SELECTION OF THE SEQUENCE POSITION To BE MANIPULATED

If you know the pattern to search for, jump to the screen mode, enter a dash 6, and the pattern. Finish by , and the cursor jumps to this position. Repeating jumps to the next position of that pattern. Alternatively, use the or keys (move one position), or< and > (moves 50 positions). For further commands, see Table 1.

94 3.4.2.5. DELETION OF A SYMBOL

This is accomplished by the

key.

3.4.2.6. INSERTION OF A SYMBOL

This is done by just typing the key of interest. A gap (.) is inserted by typing a period <.>. 3.4.2.7. CREATION OF A CONSENSUS

On the command line, type the command consensus. 3.4.2.8. CREATION OF AN AUTOMATIC CONSENSUS

On the command line, type the command autoconsensus. 3.4.2.9. DELETION OF A CONSENSUS Select the consensus sequence, move to the command line, and type remove. Answer yes on the confirmation question. 3.4.2.10. RECREATION OF A DELETED CONSENSUS 0~ THE COMMAND LINE

Type the command new. This will create a new sequence. For the name of this sequence, enter the name of the sequence group you are editing. Then proceed as in Section 3.4.2.7. or 3.4.2.8. 3.4.2.11. EXITING THE LINEUP PROGRAM

Type the command exit on the command line. 3.5. Use of the Editor to Improve a Multiple Sequence Alignment Manually (Method 5) Once the number of sequences is larger than 30, the lineup program can no longer be used for alignment manipulation. On the other hand, small changes can be conveniently performed in the system’s editor. The standard GCG file format (documentary heading, two periods [ ‘..‘I as separator, and data) makes MSF formatted files suitable for this purpose. Refer to Method 1 in order to format your sequence alignment to MSF format. Then, use the editor (vi on UNIX, and EDT on VAX/VMS or any other editor suiting your needs) to perform the changes desired. In order to use the resulting file further on in GCG programs, the program reformat must be used again, with the option msf(see Method 1). This is because GCG files have a builtin check sum that is no longer correct if the file has been manipulated with non-GCG tools like the system editor.

Multiple-Sequences

95

Alignment

3.6. Use of the Program Plotsimilarity to Visualize a Multisequence Alignment Graphically (Method 6) In reviewing the alignment’s quality, it is necessary to plot the similarity of the sequences vs the position. This can be achieved with the plotsimilarity program. In order to generalize the result, a window, sliding across the alignment, is used to average over an entire alignment-limited range of symbols. The average similarity is plotted as dashed line, whereas the similarity in the given window is visualized as solid line. Figure 3 shows such a plot on calmodulin sequences. 3.6.1. Selection of the Graphics Configuration The program setplot defines the graphics configuration by offer-

ing a variety of choices, and you may pick one of the options. 3.6.2. (Optional)

Trimming

of the Sequences

If your sequence alignment does not have sequences of identical length, call the lineup program to accomplish this by adding periods ‘.’ as gap symbols at the beginning and the end of the alignment. 3.6.3. Use of the Program Plotsimilarity for Previewing 3.6.3.1. START OFTHE PROGRAMPLOTSZM~~AU~TY

Answer the question “what sequences?” with either the specification of an MSF file and curled brackets (e.g., my.msfl*)), or, if you use the file of sequence name format (FOSN) use the “@” character and the file of sequence names (e.g.,@my.jZ ). 3.6.3.2. DEFINITION OFTHE WINDOW PARAMETER

The program will ask for a “window.” The default value of 10 can be accepted by , 3.6.3.3. DEFINITION OFTHE SCALE OFTHE PLOT

The program will ask for a “density,” meaning the scale of the plot. Unless you want to compare different alignments, accepting the default value will result in the largest plot possible. 3.6.4. Use of the Command Line Options to Manipulate 3.6.4.1. ON THE VAXIVMS

the Plot for Final Graphics OPERATINGSYSTEM

Options given below are entered after the program name and separated by slashes (“/“). For example, $ plotsimilarity/identity means

96 PLOTSIMILARITY of: Ptlsup.Msfj*j Window: 10 March 20,1992 17:09

I

I

50

I

I 100

I

I

I

I

I

I

-

150

Porlllon

Fig. 3. The programpZotsimk@ visualizes the quality of an alignment. Options used to generate this plot are identity, expand, andfont = 3. The window used was 10, and several sequences of calmodulms were aligned.

that you type the program nameplotsimilarity on the system prompt ($) and modify the program’s action with the qualifier identity 3.6.4.2. ON THE UNIX OPERATING SYSTEM Options given below are entered after the program name, and separated by a blank (space bar) and a dash (“.“). For example, % plotsimilarity -identity means that you type the program name plotsimilarity on the system prompt (%) and modify the program’s action with the option identity. identity-as default operation,plotsimilarity plots the arithmetic average of the scores of all possible between the sequence symbols at that position. In order to use a symbol value of 10 for all comparisons, the option

identlty will be necessary. bar plots the similarity as a bar graph rather than a continuous curve. minscale = 0. maxscale = I defines the scale of the verttcal axis. It is useful in comparing several plots.

Multiple-Sequences

Alignment

97

expand usesthe measured minimum and maximum similarity scores,rather

than minscale and maxscale. suppresses the plot of overall average similarity between the sequences. font = 3 uses a HELVETICA-like font for the letters. Font number six uses TIMES-like characters. check displays all possible options, including those not explained here. 3.7. Use of the Program Distances noaverage

to Tabulate Sequence Similarity within a Multisequence Alignment (Method 7) The program distances writes a matrix of the pairwise distances between a maximum of 50 different sequences. 3.7.1. Reformatting of the Alignment to MSF Format If the multisequence alignment is not already in MSF format, refer to Method 1 in order to do this now. 3.7.2. Use of the Program Distances 3.7.2.1. DEFINITION OF THE NAME Enter the name of the multisequence alignment (e.g., my.ms~{*j).

3.7.2.2. DEFINITION OF THE THRESHOLD The threshold is the minimum symbol comparison value needed for a match to count in to the summation. accepts the default value. 3.7.2.3. DEFINITION OF THE DENOMINATOR Any of the following tour features of sequence length can be used as denominator: 1. Length of shorter sequence including gaps; 2. Length of shorter sequence excluding gaps; 3. Average sequence length including gaps; 4. Average sequence length excluding gaps; or 5. Nothing.

The program suggests option 2, which can be accepted with . 3.7.2.4. DEFINITION OF THE OUTPUT FILE NAME The program suggests an output file name, which can be accepted with .

98

DdZ

3.8. Creation of a Profile to Use the Multisequence Alignment in Database Searches (Method 8) Profile methods include a range of technologies that are described in Chapter 22 in Part II, profile analysis, by M. Gribskov. The methods presentedhere only illustrate the GCG interface to these programs. Run the programprofilemake in order to generate a profile. Give the name of the MSF formatted multisequence alignment (see previous methods) as input file, and enter an output file name. You can also accept the suggested output file name with . 3.9. Use of the Program Prof’ilegap to Extend the Multiple Sequence AZignment Eflciently (Method 9) Whereas Method 3 was used to prealign sequencesin pairwise comparisons before assembling the alignment with lineup (Method 4) this method uses an existing alignment profile to integrate an additional sequence. Profilegap works precisely as gap (see Chapter 7, sequence comparison). Run with the option out, profilegap will generate two new sequences, one being the profile, and the other being the aligned sequence. Review the paired output file for orientation, and use Method 4 to incorporate the newly written sequence file into your alignment 4. Notes 1. Most of the methods presented permit the use of both MSF and FOSN types of input files. Use of MSF exclusively is recommended because the number of files involved is smaller. 2. All sequence comparison methods rely on a symbol comparison table, that is provided per default in each program. Refer to Chapter 7 on how to create or use your own comparison tables. 3. The pileup program permits a variety of useful optrons. As with all GCG programs, the option check will explain and show these in detail. 4. Thepileup program does have restrictions that could make it fail easily wtth large DNA sequence alignments. The most crucial limtt is the maximum of 2000 gaps in the entire multisequence alignment. However, since the GCG programs are provrded with full source code, your software manager should be able to follow the GCG documentatton on how to change these limits if needed.

Multiple-Sequences

Alignment

99

5. The plotsimilarity

and profilemake programs need sequences of equal length, as usually provided withpileup. Because of restrictions in inserting gaps (see note 4) the input to pileup should also have sequences of approximately equal length and some similarity. 6. All methods presented here are suitable for protein as well as DNA sequences. 7, Like all automatic methods, the pileup, plotsimilarity, and profilemake programs will not hesitate to work even if the input does not make sense. Therefore, it is the responsibility of the researcher to guarantee the quality of the calculation.

Reference 1. Devereux, J., Haeberli, P , and Smithies, 0. (1984) A comprehensive set of sequenceanalysisprograms for the VAX. Nucl. Acids Res. 12,387-395

GCG= Database Reinhard

Searching DiiZz

1. Introduction 1.1. Algorithm Considerations Searches in databases require efficiency and speed. This cannot be achieved by using the same methods as described in the previous chapters on sequence-comparison. It would take much too long to calculate alignment path matrices between the databasesequence and the query sequence. However, calculation precision is still needed, because searching a “small” database of 10,000 sequences can no longer be controlled interactively by the researcher. The computer should still be able to separate statistical noise from real “similarity.” This target, however, cannot be achieved in a realistic frame. In Fig. 1A, you can see a typical score of alignment between a query sequence and the database sequences. The identities will be clearly separated. Interspecies homologies might be clearly visible, but the interesting sequences, the distantly related sequences, might well be hidden in the statistical noise. The “noise” is shown with arrows on top of the scorings to illustrate that the bars are extremely large. The problem is even greater if you are trying to identify distantly related sequences. Then, you will miss identity matches, and interspecies homology matches and the resulting plot will show a very broad statistical noise (see Fig. 1B). The following considerations will guide you in searching for a sequence in the database without being easily trapped.

From. Meihods m Molecular Brology, Vol. 24. Computer Analysis of Sequence Data, Part I Edited by: A. M. Griffm and H G. Gnffm Copyright Q1994 Humana Press Inc , Totowa, NJ

101

102 FASTA of 1 kB in EMBL

A 100 90 60 f

70 -

" i

60 -

‘ii

50 -

Troponme

C

8 += E

E

Species Homologtes

Score

B

FASTA of 1 kB (random) in EMBL 100 90 60

I

70

0 z

60

'ii

50

jj E ;

40 30 20 10 0 0

1000

2000

3000

4000

5000

Score

Fig. 1. Scormg histograms of typical database searches. The number of hits IS plotted vs the “score” this hit causes during the searching procedure. Subsequent alignment might change these scores because of gaps and homologies. A. Result of searching human calmodulin DNA in the EMBL database. The related protein, troponin C, is found in the steep descent of the statistical noise. B. Result of searching a randomized sequence (again, calmodulin) at precisely the same conditions. Note

103

Database Searching TFASTA of 150 aa in NBRF

C

Troponine

C

ldentitres

0

200

400

600

600

Score

PROFILESEARCH D

of 150 aa profile in NBRF

loo

Troponme

5 E 1

\

40

20

0

d lb 1,

.

, 20

30

40

50

60

the random hits with low scores, and the change of scale in the X axis. C. Result of searching human calmodulin protein sequence with tfastu. Note the difference in scores relative to A. D. Result of searching an alignment of calmodulms using the profilesearch method. The reading frame of 10 calmodulins was extracted from the database and alignment as described in Chapter 9. Note the difference m the scores relative to A.

DA,

104 1.2. Fast Screening

for DNA

If you are cloning a DNA fragment and have a very large sequence, conventional search procedures might be oversophisticated for answering the question: Is there an identical sequence in the database already? For this purpose, the GCG program quicksearch is looking for identical sequences. Quicksearch is very insensitive to homologies. It should not be used for homology searches. This program will be presented in Method 1. 1.3. Homology Searching: Screening the Database with an Unknown Sequence

The programs fasta and tfmta are optimally suited for this purpose. FASTA and TFASTA have been written by Pearson (l), but have been made availablewith the GCG Suite (2). If you want to useyour own matrices, the GCG program wordsearch and the aligning program segments should be usedinstead.All of theseprograms,however,requireprecautions: 1. If the sequence is rather long, the search results will be difficult to mterpret, because the score of a short segment having high homology might eastly be hidden by random hit scores. Therefore: Split the sequence in two or more sequences. If you are already aware of domains within the sequence (i.e., if you carried out a single sequence analysis before), you should use the domain borders as fragmentation sites. If you do not know anything about it, you need to split the sequence, but also use overlapping sequences. As a rule of thumb, there should not be more than 100 amino acids or 300 nucleotides in a query sequence. 2. Omit sequence repeats or fragments that are well known and nonsignificant. If your sequence contains poly-A stretches as DNA, or if you are usmg a protein that has poly-ASP sequences, you should use seqed to cut these out. See Chapter 9 for details. Otherwise, your search result will reflect homologies to these quite frequently occurrmg motifs. 3. If you know that your sequence is part of a protein family, create a file of file names, and search the database only for those genes/protems you know might be significant. See Chapter 13 for details on file of file names and pattern searches.

4. In a first search for a DNA gene, you might consider using one of EMBL’s or GENBANK sublibraries instead of the whole library. This will give you much shorter search times.

Database Searching

105

5. If you suspect that your sequence is homologous to a family, but the search does not show this result, you can do a search with a prototype of the suspected family and use the hits of this search as a library for your unknown sequence. Thefasta program is presented in Method 2, wordsearch/segments

is presented in Method 3, and tfasta is presented in Method 4. 1.4. Twilight Zone Searching: Screening the Database with an Alignment Frequently, weak “homologies” are mixed with random noise scores (see also, Fig. l), that might affect the clarity of the result. Whereas true “motifs” can be rather short and should be searched with the pattern recognition searching programs discussed in Chapter 10, proteins with distant homologies might be lost in the noise or because of the high homology to other sequences in different regions. Therefore, the optimal approach for screening the “twilight zone” is to create a “profile” and search with a sequence-specific comparison matrix. These profile methods developed by Gribskov and Eisenberg (3), are explained in detail in Chapter 22 in Part II. Method 5 in this chapter shows the GCG interface to the programprofilesearch. 1.5. Significance Considerations All fast sequence-comparing programs rely on identities of short sequence fragments (“words”). These identities will be statistically distributed in nonrelated sequence, thus, permitting the accumulation of scoring values, which can be sorted in tables. Increasing the length of these “words” decreasessensitivity (compare with Chapter 7, comparison of sequences, Tables 1 and 2). Therefore, any search will try to use short “words”- in DNA searches, typically a length of six, and protein searches might start with two. The result is reported in a top scoring list, and entries of this list are aligned with the query sequence afterward. This second alignment step must yield the same results as the search itself, because in this second step, alignment path matrices are employed. Any alignment in question should be reevaluated on a routine basis in an individual sequence-comparison in order to justify the homology found.

106

D&h

2. Materials The methods and the programs reported here are part of the GCG program package (2), Version 7.x, to be installed on either a VAX/ ALPHA computer running the operating system VMS or one of the supported UNIX systems: Silicon Graphics (IRIX), Digital Equipment (ULTRIX), or Sun Computers (SUNOS). The programs can be obtained from GCG Inc., University ResearchPark, 575 Science Drive, Suite B, Madison, WI, 537 11.A version for the CONVEX variant is also available from Convex Corp., Dallas. The computer system should be equipped with at least 16 Mbyte of memory, and should hold about 1 Gbyte disk for program, database, and scratch area. Method 1 requires extraordinary system resources and, therefore, might not be supported on all systems. In particular, small workstations might run out of resources easily. On larger systems, the system manager will usually restrict the use of the program by resource quota; therefore, consult the system manager before trying Method 1. In order to set up the indices of Method 1, a virtual memory of 100 Mbyte or larger is required. In order to let users utilize Method 1, a physical memory of 32 Mbyte or larger is recommended. On VAX/VMS systems, the operating system might need extra tuning in order to use Method 1. This must be done by the system manager and is described in the GCG system support documentation. All GCG programs generally require a terminal that is capable of emulating the VT100 terminal mode standard. Scroll-back facilities as provided on workstations or in PC terminal emulators are advantageous, but not essential. Text output of the methods presented can be reprocessed on word processing programs on personal computers. The data-transfer capabilities required for this are usually provided within the terminal emulation programs. If text output is to be printed, any ASCII code printer is sufficient. Capabilities to print 132 characters per line may be advantageous, but are not essential. To analyze a sequence, it is required that sequence data be formatted in GCG sequencefile format. See Chapters 2,6,8, and 11 to review methods on how to create a sequence in this format.

Database Searching

107

For all searching methods (with the exception of Method l), it is advantageous that the user be familiar with the procedure of handling jobs in the background. GCG programs, like fasta, tfasta, and wordsearch, will run smoothly if enough system resources, such as memory or disk, are available simply by running the program with the batch option (see Section 4.). To use Method 5 (profile searching methods), the user should be familiar with the system editor, and know how to set up batch command files. Methods 1 and 3 generate graphics output. For viewing the results, a high-quality graphics device is desirable. For terminal previewing, TEKTRONIX 4014 capabilities or REGIS emulation (as provided in DEC 240,340, or 440 terminal series, or in various PC terminal emulators) is recommended. The GCG graphics features include a variety of other modes, e.g., workstations use the X-Windows interface. The appropriate choice is site-dependent and is preconfigured by the software manager. See Section 3. for details. Graphics hard copies may use a variety of output formats, including HPGL and POSTSCRIPT standards.This feature is also site-dependent and usually preconfigured by the software manager. 3. Methods 3.1. Fast Screening for DNA (Method 1) This method seeks identities in the database using the GCG program quicksearch. This program is quick because it uses a very special dictionary of 20-mer sequences. Your search sequences will be looked up in this dictionary. Both stands will be searched. The “overlaps” need to be large enough in order to be reported. Quicksearch uses a “word” algorithm (see Chapter 7). The “stringency,” as used in quicksearch terminology, is the number of words that must fit in a given number of words (“window”). The default settings in quicksearch are rather stringent: quicksearch will not report matches that show more than two errors on 100 bp. The primary purpose of quicksearch is to find identities. If you need to modify your database, a new index has to be built using the program quickindex. See your system or software manager for details if you need to index a new database.

108

D&?Z

The procedure of using quicksearch requires the computation of the alignments of the hits found in a second step.This can be achieved with the procedure quickshow. The following procedures require that the sequencesto be searchedare already presentin a valid GCG format. 3.1.1. Use of Quicksearch 3.1.1.1. START OF THE QUICRSEARCHPROGRAM Type the command quicksearch on the command prompt. 3.1.1.2. SPECIFICATION OF PARAMETERS The program asks for a sequence to be searched, a window size, and a stringency. The latter two parameters can be accepted with the key. The output file will be asked for next. 3.1.1.3. PERFORMANCE OF THE SEARCH

The quicksearch program will start to load the sequence data into the memory. This can last several minutes. Then, the search is performed in a few seconds, and displays “overlaps” with both the positive and the negative strands. Note that, in the case of a window of 15, a length of 300 bp will be considered to be an overlap. 3.1.1.4. PERFORMANCE OF SUCCESSIVE SEARCHES After the successful search, the program prompts again for a query sequence.If you have prepared several sequencesto be searchedsimultaneously, this would be advantageous, because the loading step is omitted on repeated queries. The program can be exited by typing on the question for the query sequence, and then quicksearch writes a report and exits. 3.1.2. Review of the Result 3.1.2.1. REVIEW OF THE OUTPUT FILE

Quicksearch writes an output file that is usually composed of the file name and the extension quick. This file is a text file that can be reviewed on the screen in order to check for useful results. The command $ type/page myseq.quick (on VAXNMS) or % more myseq.quick (on UNIX) will achieve this. For the subsequentmethods of this section, it might be advantageous to edit this file with a text editor (e.g., vi on UNIX or EDT on VAXNMS) in order to eraselines that show undesired or already known hints. See Chapter 4 for a brief editor description.

Database Searching

109

3.1.2.2. PREPARATION OF AN ALIGNMENT The program quickshow permits the generation of alignments. After typing the command quickshow on the command line, options are presented that permit either the generation of a graphical overview (see Chapter 7 for dotplots) or a text file. Selecting option 2 will make the program ask for an output file, and the results of the quicksearch run will be written into a text file. Note that, depending on the amount of positive hits, this method should not be followed before the method described in Section 3.1.2.3. has been used. 3.1.2.3. DISPLAY OF THE ALIGNMENTS ON THE SCREEN The program quickshow can also display the alignments graphically. Therefore, a definition of the graphics output is required. First, the program setplot permits the selection of a suited output device. Second, quickshow can be run as explained in Section 3.1.2.2., but by selecting option 1 for dotplot-style alignments. Refer to Chapter 7 for details on what information dotplots are able to visualize. 3.2. Fasta (Method 2) The programfasta was written by Pearson (I), but was incorporated into the GCG program environment starting with version 6.0 of the package. The basic idea offasta is to provide a tool for extremely effective searches for homology, Whereas the sensitivity will always be important, speed is needed to achieve effective searches. Therefore, instead of performing a rigorous alignment showing the best local similarity (refer to Chapter 7, sequence-comparisons), thefasta algorithm uses a two-step mechanism to target the positive hits first and perform more sophisticated alignments afterward (see Chapter 26, FASTA, for precise explanations). Without disregarding the importance of parameters, the GCG program interface offers a choice of default values that are usually reasonable as a starting point. In the first step of the algorithm, fasta searches the database for identities between very short fragments (usually hexamers in DNA searches and dipeptides in protein searches). The positive hits are scored in a table that is subsequently refined in the second step employing a variety of methods. The final scoring list, therefore, contains more than one scoring value indicating these different steps of the calculation.

110

Dlil?Z 3.2.1. Running the Fasta Program

3.2.1.1. START OF THE PROGRAM To start the program, type the comrnandfasta on the system prompt. This program is one of the few GCG programs that might take longer than a few seconds to complete. For this reason, it is possible to executefizsta in the “batch” mode. This means that the program is started as usual, but afterwards executed without further interaction with the user. You can even log out from the computer and look at the program’s result at a later time. The command&s& needs to be modified for running in this special mode: $ faskdbatch (on VAWVMS) or % fasta -batch (on UNIX). 3.2.1.2. SPECIFICATION OF THE SEARCH SEQUENCE NAME Thefasta program will query for a sequence that is to be searched. This sequence must be present in GCG format and will be searched as such. If fragments of the sequenceare to be searched,the questions for “start” and “end” of the sequenceare to be answered accordingly. 3.2.1.3. DEFINITION OF THE DATABASE TO BE SEARCHED Most commonly, the databaseto be searchedwill be named genembl: * for DNA, which covers all of genbank and EMBL sequencescurrently stored in the computer. Protein searches will run either on swissprot: * or on the nbrf: * databases.All database specifications require that the name of the database be known to your process. Other databases than those mentioned might well be present at your local site. The GCG program package also supports the use of databases created individually by the user. Usually, the asterisk behind the colon indicates that all sequences of the specific library are to be searched. This is not imperative and may be changed in order to perform a partial search for subsets of the libraries. 3.2.1.4. DEFINITION OF THE SEARCH PARAMETEIB The program will ask for a “word size.” This parameter defines the size of the elements that are used for identity comparison. Lowering the parameter results in an increased sensitivity, but also can cause reduced selectivity. Therefore, the suggestedparameter2 (for proteins) or 6 (for DNA) is suited for getting started. Next, the program asks for the list size, which shows the similarities of the query sequence to

Database Searching

111

existing parameters.The suggestedparameter of 40 should be increased to 100. Finally, fasta asks for an output file, which is usually composed of the search sequence file name and the extensionfustu. If you decided to run in the batch mode (see Section 3.2.1. l.), a message will be shown that the job has been submitted to the queue. Otherwise, you can monitor the search process while the computation proceeds; every 100 sequences,a line is printed. At the end of the calculation, a summary is printed that indicates the used CPU time. 3.2.2. Collection of the Fasta Output Fasta usually writes one output file, which is composed of the file name of the query sequence and the extension .fasta. Iffasta is run in the batch mode, it also writes a so-called log file that can be used to locate errors occurring during the run. This log file is usually called fasta-xxx.log, where xxx is a decimal representation of the time of submission. Usually, this file can be discarded. 3.2.3. Evaluation of the Fasta Output The output file of a fasta run contains a histogram, a list of hits, and the alignments. The histogram (see also Fig. IA) will contain an overview on the result of the search. If you get an output like the one in Fig. lB, you will presumably be correct in stating that no significant homology could be found. The second part of the fasta output contains a list of sequences that have been identified to contain the best-matching similarities. Refer to Chapter 29 (FASTA) for a theoretical explanation of the scorings displayed in the three columns at the right side. Briefly, the initl score shows the initial score obtained by comparing the sequences with the “word-size” defined in Section 3.2.1.4., and opt score will represent the score obtained in the final alignment shown in the third part of the output. 3.2.4. Modification of the Fasta Program Output Fasta can be started with the inclusion of options in order to modify the program’s actions. As usual, these options can be queried by typing $ fasts/check (on VAXNMS) or % fasta -check (on UNIX). The most important option is noalign, which can be used to suppress the third part of the fasta output. The result file, therefore, represents a file of sequence names (FOSN, see Chapter 8, lineup editor) that can be used in subsequent searches or another GCG program.

112

DijlZ

Another important option is onestrand. Usually, fasta in DNA searches will search both strands. This is an extension of the original program described in Chapter 26 (FASTA). In order to letfasta search only one strand, apply this option. 3.3. The Wordsearch and Segments Programs (Method 3) The GCG program package provides another tool for sequence searching using the rapid Wilbur and Lipman algorithm (asfasta does). The wordsearch program is somewhat slower than fasta and occasionally has been reported to be less sensitive. The major difference between wordsearch and fasta is that the subsequent alignment of the top scoring hits to the query sequence has to be started manually and is performed by the program segments. It is generally recommended to use fasta (see previous method) instead of wordsearch. The latter produces histograms (which can be created as graphics by specifying the option plot on the command line) and uses a different methodology to restore the entries after the first initial hits have been identified. The use of the segments program is straightforward. Here are two extra features compared to fasta: (1) Wordsearch permits the use of an option called mask and, if this keyword is used on the command line, it will enable the input of a pattern that can be used to consider the ambiguity of the third base in DNA searches that are to be performed. (2) The segments program is an automated version of bestfit (see Chapter 7) and, therefore, permits the use of one’s own comparison table. If comparison tables have been created (see Chapter 7), only segments will permit the use of these by typing the command: $ segments/data = filename (on VAXNMS) or % segments -data = filename (on UMX). This is useful if the standard comparison table cannot be applied becauseof too much ambiguity. 3.4. The tfasta Program (Method 4) The tfastp program uses the same algorithm as the fasta program described in Section 3.2. However, tfasta uses a peptide sequence as query against the nucleotide database. In order to make this work, the entire DNA database is translated in all six frames. The dialog to start a tfasta run works as the fasta dialog as described in Section

Database Searching

113

3.2., however, in order to use a DNA library, in Section 3.2.1.3. only a DNA specification like genembl: * may be specified. 3.5. Profilesearch Methods (Method 5) The use of profile methods is explicitly detailed in Chapter 22 in Part II (profile methods), by Gribskov (3). Here, the general interface to these methods within the GCG program environment is described. Profile searching has four steps. Starting from an alignment (that has been prepared according to the methods in Chapter S), the programprojiZemake createsa position-specific scoring table that is called a profiile. This profile qualitatively represents the information of the alignment that is much more detailed than a consensus sequence, because not only the “majority” voting of symbols, but also the individual differences are reflected. This profile can be further used by the programprofilegap (which is just applying the rules of the gap program described in Chapter 7). Several of these profiles can be searched simultaneously with the programprofilescan. As described in Chapter 10, profilescan uses a database of profiles to find structural motifs in a single query sequence. The opposite principle works in the programproflZesearch: A profile is used as a representation of the sequence alignment in order to scan a sequence database. This results in a top-scoring list, which is used for a subsequent alignment of the profile to this list in the programprofilesegments. Theprofilesearch methods are used most commonly in peptide or small protein alignments. Because of the nature of the computation, a profilesearch run takes up to several hours. Whereas this is still a tolerable limit, in DNA searches, the search time can be extended even more. Because of some restrictions of the implementation, it is not possible to start aprofilesearch on the entire database of DNA (as of Version 7.x). 3.5.1. The Profilemake 351.1.

Program

PREPARATION OF THE ALIGNMENT

In order to runprofilemake, an alignment is required that consists of several sequences of exactly the same length. Refer to Chapter 8 for details on creating such an alignment and the various formats of representing these.

114

D&Z

3.5.1.2. START OF THE PROGRAM Profilemake can run on both MSF and FOSN types of alignments. After typing the command profilemake on the command line, the first question asked is the alignment name. If you have a file of sequence names (FOSN) type of alignment, you should specify @my.fiZ if my.fil is the name of this file of sequence names. If you used the multiple-sequence format (MSF), you specify my.msf{*) as the alignment name if my.msfis your file name. 3.5.1.3. DEFINITION

OF ADDITIONAL

PARAMETERS

The profizemake program will read the input file(s) and display the contents on the screen. Next, a file name is asked for. The default suggestion is the file name of the alignment with the extension .prf. 3.5.2. The Profilesearch

Program

3.5.2.1. NORMALMODE OF OPERATION Typing the commandprofilesearch on the command line will start the program. After having asked for the name of the query profile, the validity of the input is checked and reported on the screen. The following parameters are to be filled in a straightforward fashion: Data library, gap weight, and gap length are asked for and, finally, the output file name. 3.5.2.2.0p~10~~ USED TO MODIFY THE PROGRAM’S PREFERENCE On the VAXNMS operating system, options are entered after the program name and separatedby slashes(“I”). For example, $ profilesearch/ cpu = 1000 means that the profilesearch run will consider the option cpu = 1000. Analogously, on the UNIX operating system, the option is separatedfrom the command by a blank (space bar) and a dash (“-“). As an example, you could type % profilesearch -CPU = 1000 to achieve the desired effect. The following options are of interest: listsize = IZ,where 12specifiesthe numberof entries m the output list-the default is the maximum number of sequencesthat can be searched (currently, 60,000). minlist = 12setsthe lowest Z-score for an entry to be reported in the output list. Z-scores equations contain the achieved score for an individual alignment and the “predicted” score. Usually, Z-scores above 5.0 are considered to be significant.

Database Searching

115

cpulimit = n sets the maximum CPU limit (in seconds), which is set to 86,400 (24 h) per default. In complicated searches, it is suggested that the search be tried out on a subset database first or with a limited CPU time. check will list all options, including those not explamed here.

3.5.2.3. PROF~SEAZZCH IN THE BATCH MODE Profilesearch does not support the batch mode by a command line parameter. Therefore, the user has to set up a command file that contains allthe options required. The GCG program provides a template for this. In order to copy this template into the current directory, issue the command fetch profilesearch corn. (both VAXNMS and UNIX) Call a system editor, e.g., (vi on UNIX,EDT on VAXNMS), and modify this file to fit your needs. Then, submit this file to the batch mode. The default procedure of this process is described in the GCG user’s guide and can also be viewed on the screen with the command genmanual users-guide batch (both VAXNMS and UNIX). 3.5.2.4. REVIEW OF THE OUTPUT OF PROFZLESEARCH The output file ofprofilesearch contains an explanation of the statistical evaluation of the significance and the list of top hits sorted by descending Z-score value. If there are entries that are not to be considered in the subsequent alignment step (see Section 2.5.2.5.), these can be commented out with an exclamation mark (!). Generally, it is assumed that entries with Z-scores larger than 5.0 are significant. 3.5.2.5. ALIGNMENT OF PROF~LJBEARCHRESULTS As in the methods described previously (e.g., wordsearch), profilesearch only lists the top scores, but does not perform an alignment. This is done with the program profilesegments. To start the program, type the commandproj?Zesegments on the command line. Then, the program asks for a profilesearch output. After having entered this name,profiZesegments asks for the number of alignments and the name of the output file. The algorithm ofprofilesegments is similar to that described in the bestfit program (see also Chapter 7). 4. Notes 1. Method 1 requires excessive system resources. On VAXNMS, the message . . . quota exceeded is an indication that the program is available but cannot be executed by this user account, because the account cannot

116

ml!2

allocate sufficient virtual memory. On UNIX, messages like . . . process killed due to insuficient memory/swap indicate the same problem. 2. Fasta can be made more sensitive with the wordsize 1. This setting will affect the search time drastically and should be trted only if the first search with default parameters did not show the expected result. 3. Because of the arbitrary translation, tfasta can result in misleading hits.

Whether the homology reported in the alignment is located in the correct reading frame it always has to be checked on an individual basis. 4. In rare cases, it can happen that profilesegments fails to find the original profile. This is becauseprojilesearch writes the name of the profile into the top line of the output file. This information is parsed by the programprofilesegments. If the file has been deleted or renamed by the user,profiesegments ~111 fail. To recover from this problem, the use of an editor, and entry of the correct path name and the correct file name into the first line are required. On VAX/V&IS, this reads for example d$user:[otto.sequence]myfle.prf; and on UNIX, it might be /user/Otto/ sequence/myfile.prf.

References 1. Pearson, W. R (1989) Rapid and sensitive sequence-comparison with FASTP and FASTA, in Methods in Enzymology (Dayhoff, M 0 , ed.), vol. 183, Academic, San Diego, pp. 146-159. 2. Devereux, J., Haeberli, P., and Smithies, 0. (1984) A comprehensive set of sequence analysis programs for the VAX. Nucleic Acids Res. 12,387-395. 3. Gribskov, M and Eisenberg, D. (1989) Detection of structural patterns with profile analysis, in Techniques in Protein Chemistry (Hugh, T E., ed.), Academic, San Diego, pp. 108-l 17.

&l.APTER

GCGt Pattern

10

Recognition

Reinhard

Diilz

1. Introduction Pattern recognition in biological data is difficult to achieve. On the one hand, it is simple trying to find information based on identity of patterns described in a search string. Methods 1 and 2 describe such a pattern search for text strings. More sophisticated searches use a pattern definition language (e.g., Methods 6 and 7), whereas a different approach just counts given patterns and plots or tabulates the result (Methods 3 and 4). Another method of pattern recognition is to try to find repeats in a given sequence, which is basically the pattern and search with patterns to be found by the program itself (Method 8). Note that searching for a reading frame also uses pattern searching methods (in the most simple case, just by detection of ATG), but this kind of pattern search is described in Chapter 11. Similarly, sequence searching with profiles, which is described in Chapter 9, is also a kind of pattern search not listed here. 2. Materials The methods and the programs reported here are part of the GCG program package (I), Version 7.x, to be installed on either a VAX/ ALPHA computer running the operating system VMS or one of the supported UNIX systems: Silicon Graphics (IRIX), Digital Equipment (ULTRIX), or Sun Computers (SUNOS). The programs can be obtained from GCG Inc., University ResearchPark, 575 Science Drive, Suite B, Madison, WI, 537 11. A version for the CONVEX variant is also available from Convex Corp., Dallas. The computer From Edited

Methods m Molecular t?/o/ogy, Vol 24 Computer Analysis of Sequence Dafa, Pari I by A M Gnffm and H G Grlffm CopyrIght 01994 Humana Press Inc , Totowa, NJ

117

DzilZ

118

system should be equipped with at least 16Mbyte of memory, and should hold about 1 Gbyte disk for program, database,and scratch area. All GCG programs generally require a terminal that is capable of emulating the VT100 terminal mode standard. Scroll-back facilities as provided on workstations or in PC terminal emulators are advantageous, but not essential. Text output of the methods presented can be reprocessed on word processing programs on personal computers. The data-transfer capabilities required for this are usually provided within the terminal emulation programs. If text output is to be printed, any ASCII code printer is sufficient. Capabilities to print 132 characters per line may be advantageous, but are not essential. The sta$lot program used in Method 3 needs a graphics device. For terminal previewing, TEKTRONIX 4014 capabilities or REGIS emulation (as provided in DEC 240,340, or 440 terminal series, or in various PC terminal emulators) IS recommended. The GCG graphics features include a variety of other modes; e.g., workstations use the X-Windows interface. The appropriate choice is site-dependent and is preconfigured by the software manager. See Section 3. for details. Graphics hard copies may use a variety of output formats, including HPGL and POSTSCRIPT standards.This feature is also site-dependent and usually preconfigured by the software manager. To search patterns in sequences, it is required that sequence data be formatted in GCG sequence file format. See Chapters 2,6, 8, and 11 in order to review methods on how to create sequences in this format. 3. Methods 3.1, Search for Information

1) Quite frequently, searches for sequences are also needed if no sequencedata, but a name of the sequenceis known. In order to search for such an occurrence of a name, it would be necessary to use a relationally structured database or at least a database where indices are available for all items of Interest, such as authors, journals, or sequence description. However, databasesof today contain very little structure like the one desired. The only fields that can selectively be Nonsequence

(Method

Pattern

Recognition

119

searched for are the “title” of the sequence and all reference information of the database. This means that there might be a chance of finding the name of a particular gene as “title,” but it is quite impossible to find an author’s name there. Unfortunately, the text area in databases do not follow strict rules and are very ambiguous with respect to details. In particular, both d. mehznogaster and drosophila m. might describe the same species. In order to refine lists, it is therefore necessary to do a rather broad search for a particular name and fuse the results of several search runs before performing a second selection, The program stringsearch operates like a word processor with respect to the sensitivity in spelling. However, the program is not case-sensitive. Because each line is looked at and processed individually, stringsearch operatedon the full set of sequencerecords might take a very long time. Try a definition searchfirst (seeSection 3.1.3.), and use the sublibraries available as long as this seemsto be a useful approach. Note that it cannot be assumed that stingsearch is exhaustive. 3.1.1. Start of the Program The program stringsearch is startedby typing its name. On VAXNMS,

the command $ stringsearch/match = or will have the effect that multiple patterns must not all be present in the positive hit. Other options include match = and for making stringsearch look for all patterns (which is the default). You can also specify a number of hits necessary for reporting, e.g., match = 2 will report all entries that have two matches. On UNIX, the command % stringsearch -match = or will have the same effect as described above. 3.1.2. Definition

of the Target Sequence Like all GCG programs, the search of stringsearch

can be performed in a GCG formatted database or in a file of sequence names. Data bases have to be specified as a name with a colon and an asterisk, e.g., genembl:” specifies all data that are in genbank and EMBL. Depending on your setup, the other databasesavailable are either the subsections of genbank or those of EMBL. Protein databases are Swissprot and NBRF. Your software manager will tell you which databases are installed. Additionally, the gcg command, which is used to prepare the environment, displays a banner with the most recent release numbers.

120

Diilz

Files of file names are to be specified as an at character (@) and the file name; e.g., @genembZ.strings will name a file that contains the result of a previous stringsearch run in the genembl:” databases. For further hints on list refinement, see below. 3.1.3. Definition

of the Search Mode

The next question asked by stringsearch is either for definition or complete sequence records to be searched. Entering accepts the default (definition search), but might be too stringent for author or journal citations. 3.1.4. Definition

of the Pattern to be Searched

Patterns are specified as single words separated by commas, if needed. If patterns will include spaces, the pattern is to be enclosed in quotes. Note that the selectivity of such a pattern search is increased by longer patterns, but the problems of spelling and complete name are raised simultaneously. 3.1.5. Definition

of the Output File

The stringsearch program asks for a output file and starts the search. The result is reported at the end of the search. 3.1.6. List Refinement

The output of stringsearch is a file of sequence names. This file can be searched with another run or, alternatively, with the programs findpatterns (see Method 6) or any sequence searching program (see Chapter 7). Because the input to the stringsearch program may also be a file of sequence names, the weakness of the method can be overcome by multiple, sensitive searches with several programs. 3.2. Searching an Entry by Code or Number (Method 2) Entries in the database are labeled with a specific name that is up to nine characters in length. Whereas this code may change with database releases, the accession number is unique and permanent. Fragments that are fused into larger sequences inherit their accession number, and the new entry gets a new “primary” accession number. This new entry then may have several “secondary” accession numbers. This way, the singularity of accession numbers is guaranteed.

Pattern Recognition 3.2.1. Definition

121 of a Sequence by Accession Number

The program names will report positive, if the accession number specified is present. The command names genembLX34210 (both VAXfVMS and UZUX) will either get no response or the sublibrary where this accession number has been found (if this is a collection of databases like genembl). 3.2.2. Definition

of a Sequence by Entry Code

The program names will also name the entry code if specified properly, e.g., names genembl: *cam (both VAX/WI0 and UNIX) will report all sequences in the genembl that have the best three letters “cum. ” Note that this wild card usage is not permitted if you search for accession numbers. 3.2.3. Looking at Sequence Data from the Database

The programfetch will copy any sequence from the database into your current directory. The program typedata performs the same function, but displays the text found on the screen only. The specification of the sequence has to follow the conventions explained in Section 3.2.2. The option reference will report only the annotation, and omit the plain sequence data in both thefetch and typedata commands. 3.3. Determination of Arbitrary Pattern Frequencies (Method 3) For some purposes, it is desirable to plot the frequencies of single

symbols or short sequences against a long sequence. These short sequences might consist of one or a few single nucleotides (or amino acid residues in protein sequences), e.g., patterns like AT or ATG can be analyzed with respect to their position. Because distributions are only of meaning in a range of symbols, usually a window is chosen that glides across the sequence to be analyzed. The GCG program window computes a table of pattern occurrences, that can be plotted with the statplot program. 3.3.1. Specification

of the Sequence to be Analyzed

After being started with the command window on the command line, the program asks for the beginning and end of the sequence. Accepting the default values with will make the program use the whole sequence. Finally, the sequence can be reversed.

122

DiilZ 3.3.2. Definition

of the Window

The program asks for a window size and the increment to be used for sliding the window across the sequence. The next question to be answered specifies the file as suggested. 3.3.3. Definition

of the Pattern

Patterns observed can be counted by number, percent, fraction, and the averaged version of these three measures. A menu is presented that can be used to compose any of the desired functions in a row of lower-case letter options; e.g., aae would be two times option a (number of patterns) and one time option e (normalized number of patterns). The next question asked is the precise pattern of each of the measures as defined in the previous step. 3.3.4. Definition

of the Graphics Configuration

The program setplot defines the graphics configuration by offering a variety of choices, and you may pick one of the options. This step needs to be run only if you did not define your graphics configuration earlier or if you want to change it. Statplot The program for plotting the output of “window” is called statplot and is started by typing its name. Options to modify the program’s action are available and can be queried by: $ statplot/check on the VAXNMS system or with % statplot -check on the UNIX system. 3.3.5. Start of the Program

3.3.6. Definition

of the Input File

The program will ask for an input file that has to be created with the program window (see previous steps). On successful reading, statplot will report on the statistics described by this file and ask for the scale to plot it. Unless several sequences are to be compared, this requiring identical scales, the can accept the default value to plot the statistics on one page. 3.4. Count of the Composition (.ethod 4) The program composition is capable of counting symbols in either protein or nucleotide sequences.In the latter case, the composition of di- and trinucleotides is tabulated as well. The composition program

Pattern

Recognition

123

is started by its name, and the sequence to be analyzed may be given in the standard way. 3.5. Special Pattern Search (Method 5) There are a variety of patterns in molecular biology computing that can only be found with a specific program, e.g., prokaryotic, factor-independent RNA polymerase terminators can be found with the program terminator. This program is only applicable to searching for the specific purpose explained above and will give uselessresults on other sequences.The program is started with the command terminator and behaves as usual with respect to sequencespecification. 3.6. Usage of a Pattern Description Language (Method 6) The programfindpatterns permits searching for patterns in a single sequence on sequence library. The definition of the pattern can be achieved interactively or within a data file. Findpatterns searches both strands of a nucleotide sequence if the patterns specified are not identical on both strands. 3.6.1. Pattern Definition

Language

A pattern can be defined as a row of characters. More sophisticated definitions are listed below. 3.6.1.1. REPEAT COUNTS

Curled brackets denote the frequency of these repeats; {I 0} means that the previous symbol must be repeated 10 times, whereas /5,15) implies that any repetition from 5 to 15 is scored as a match. 3.6.1.2. PATTERN REPEATS

If more than one symbol will be repeated, this symbol has to be embraced with parenthesis, e.g., (AT)/31 means that the sequenceAT is to be repeated three times. 3.6.1.3. OR MATCHING

In addition to the ambiguity symbols, a set of possible identities can be symbolized with a comma-separated list of sequence repeats, This applies to protein sequencesmostly, e.g., (0, E) meansthat either D

DlilZ

124

or E is considered to be a match. This OR matching can be combined with any of the other rules described above. 3.6.1.4. NOT

MATCHING

Some patterns require that particular symbols must not occur in a given position. This can be achieved by preceding the symbol of interest with a tilde (-), e.g., GA -(A, T)GA means that GA is followed by any symbol, but NOT A or T, and followed by another GA. 3.62. Preokfined

Pattern Libraries

In addition to user-defined patterns, the GCG programs supply two libraries of patterns suited as input for the findpatterns program. Enzyme.dat is the Restriction Enzyme file as provided by Rich Roberts (see also Chapters 3, 4, and 5), and TFDdata.dat is the Transcription factor database as created by David Gosh. Both data sets can be addressed with the command line option data = genrundutu: where is either of the two mentioned above. The PROSITE motif database is also available within the GCG programs. In order to deliver protein-specific information, the special program motifs (see Chapter 12) does pattern searches with amino acids. 3.6.3. Start of the Findpatterns

Program

The programfindpatterns is started by its name. Options to modify the program’s action can be given as follows: $ findpatterns/batch on the VAX/KM!? system, or % findpatterns -batch on the UNZX system. This will use the option batch with the findpatterns program. Further options are listed below. 3.6.4. Specification

of the Search Set

The findpatterns program can search single sequence files (e.g., my.seq), files of file names (e.g., @my.fil), or whole databases (e.g., genembl:“). On the question, “what sequence(s)” any of the three

alternatives is permitted. After the program is completed, the output file will contain a summary that reports the search set used. 3.6.5. Options for Modifying Findpatterns mismatch = 1 allows one mismatch in pattern search. names creates a file of file names for further use.

Pattern

Recognition

125

append appends the pattern data file to the output. batch submits the search to the batch job. data =fiZename uses the as pattern file. 3.7. Matching

Short Aligned Patterns (Method 7) Aligned sequences can be used to create consensus information that represents the variability of certain positions within the alignment. Larger data sets can effectively be assembled using the methods of multiple-sequence alignment (see Chapter 8) and profile searching (see Chapter 9). Shorter fragments can be aligned in a simple blockwise fashion and processed by the program consensus. The programfitconsensus, in turn, utilizes this information to match the established consensus vs a given sequence or group of sequences. 3.7.1. Preparation

of an Alignment

The alignment needed by the program consensus consists of plain sequence symbols after two adjacent periods (..) and one sequence per line. It is possible to edit an MSF formatted file manually (see Chapter 8) in order to get to this format. The command fetch acceptor.dat (both VAXNWS and UNIX) will copy the gcg example file into your directory for comparison. 3.7.2. Preparation of the Consensus The program consensus, if typed as a command name, starts and asks for the file of aligned sequences prepared in the previous step. Then, the consensus certainty and the output file are asked for, and the consensus file is written. 3.7.3. Searching a Sequence with the Consensus

The programfitconsensus, if typed as a command, will ask for a sequence to be searched with this consensus. Then, the start and the end of the sequence are asked for. If both defaults are accepted with , the whole sequence is searched. Next, the consensus as specified in the previous step has to be given as file name. After the question for the certainty, the number of hits to be reported is asked for. These hits will show up in the output file, sorted by position. The name of the output file is to be given in the last question.

126

D&?Z 3.7.4. Using this Method in PCR Design

The programs shown in this method permit the matching of probes as they will be prepared in a synthesizer in the form of their consensus vs the sequences of interest. This approach is relatively workintensive, but would work as follows: 1, Create a sequence with the program backtranslate (see Chapter 12). 2. Assemble a sequence alignment as described in Section 3.7.1. with the variations intended to be present in the probe. 3. Manually edit the consensus file if needed after having performed the step described in Section 3.7.2. 4. Run the step described m Section 3.7.3., and check the output file for reasonability of the probe employed. 3.8. Searching

Repeats

in Sequences

(Method

8)

Whereas inverted repeats in DNA sequences can be searched with the program stemloop (see Chapter 1l), the program repeat will enable you to search direct repeats in a given DNA or protein sequence. There are three parameters needed in order to do such a search: first, the minimal size of the repeat, next, how may of the symbols of this repeat will match (i.e., the stringency), and last, within what distance the repeats are to be searched for. 3.8.1. Start of the Repeat

Program

The program repeat is started by typing the program name. 3.8.2. Definition

of the Sequence to be Searched

The program asks for a sequence, its beginning, and its end. The sequence can be any valid formatted nucleotide protein or sequence. If the questions for beginning and end are answered with , the program will analyze the entire sequence. 3.8.3. Definition

of the Search Parameters

The program will ask for a minimum repeat window and the minimum stringency. It is not reasonable to enter stringency values larger

than the window size. In protein sequences, it might be necessary to lower the stringency depending on what pattern is being searched for. Finally, the distance that is found at a maximum in between the repeats is asked for. This distance restricts the range to be searched in order to avoid extensive output.

Pattern

Recognition

127

3.8.4. Definition of the Output The program will perform the repeat search and tell the result. Depending on your expectations, it might be necessary to change the parameters and redo the search. If the default option at this point is chosen, the results are written to a file that has to be specified.

4. Notes 1. The program stringsearch is not very exhaustive and will require some patience. 2. It is to be emphasized that pattern searches m text always suffer from lacking standardization of key words, and so on. Therefore, searches m the text area of databases are considered to be unreliable. 3. Accession numbers should be unique. A very minor fraction of the entries in the database still has several identical secondary accession numbers. These entries will not be covered by the names program. 4. The composition of peptides can also be calculated with thepeptidesort program using no enzymes (Chapter 12) and will give more verbose output. 5. The repeat program does not look for inverted repeats. This is done by the program stemloop (Chapter 11). 6. PCR probes can also be searched with&& (Chapter 9). Thefindpatterns program is also sufficient in most cases.The utilization offit consensus gives the most precise answers, but is difficult to set up.

Reference 1. Devereux, J., Haeberli, P., and Smithies, 0. (1984) A comprehenswe set of sequence analysis programs for the VAX. Nucleic Acids Res. 12,387-395.

&A.Pl7ER

GCG: Translation

11

of DNA Sequence

Reinhard

Diilz

1. Introduction Translation of DNA sequences can be simply trivial reformatting, e.g., transcribe one file format into another, or use Us instead of Ts. and so forth, These utilities for “translation” are by no means unimportant and are, therefore, presented in Method 1. The translation from DNA to peptide sequences is more related to biology. Whereas this requires the use of codon usage tables, the “final” truth can be found only by determining the peptide sequence of the protein that is being characterized by its DNA. The methods presented as Methods 2 to 4 need codon usage tables. The reason for this is that translation and, beforehand, identification of reading frames are a statistical method that relies on known distributions of codon usage. If the unknown sequence originates from an organism where codon a table has not been determined, no correct statistics will result. This is usually not very problematic, unless several frames with high scores show up. Method 5, therefore, uses a codon-tableindependent method to predict possible translation frames. Method 6, finally, is the operation of real translation, which incorporates information of the previous methods. The corresponding back-translation (i.e., coming from a protein sequence) to determine a DNA sequence can be achieved using this method. Chapter 3 also describes a method that permits having a translation of DNA to protein within a restriction map. Chapter 14 describes a method that does the protein sequence translation within a plot for publication, From. Edited

Methods m Molecular Biology, Vol. 24: Computer Analysis of Sequence Data, Part I by A M Grlffm and H. G. Griffin Copynght 01994 Humana Press Inc., Totowa, NJ

129

130

DiilZ

The methods and the programs reported here are part of the GCG program package (I), Version 7.x, to be installed on either a VAX/ ALPHA computer running the operating system VMS or one of the supported UNIX systems: Silicon Graphics (IRIX), Digital Equipment (ULTRIX), or Sun Computers (SUNOS). The programs can be obtained from GCG Inc., University ResearchPark, 575 Science Drive, Suite B, Madison, WI, 53711. A version for the CONVEX variant is also available from Convex Corp., Dallas. The computer system should be equipped with at least 16 Mbyte of memory, and should hold about 1 Gbyte disk for program, database,and scratch area. All GCG programs generally require a terminal that is capable of emulating the VT100 terminal mode standard. Scroll-back facilities as provided on workstations or in PC terminal emulators are advantageous, but not essential. Text output of the Methods 1 and 7 can be processed on word processing programs on personal computers. The data-transfer capabilities required for this are usually provided within the terminal emulation programs. If text output is to be printed, any ASCII code printer is sufficient. Capabilities to print 132 characters per line may be advantageous, but are not essential. In order to use Methods 2,3, and 5, a high-quality graphics device is desirable. For terminal previewing, TEKTRONIX 4014 capabilities or REGIS emulation (as provided in DEC 240,340, or 440 terminal series, or in various PC terminal emulators) are recommended. The GCG graphics features include a variety of other modes, e.g., workstations use the X-Windows interface. The appropriate choice is site-dependent and is preconfigured by the software manager. See Section 3 for details. Methods 3 and 4 might be run advantageously on terminals or computers that are connected to the host with a line speed of at least 19,200 band, because graphics are complicated and require a lot of data traffic. To analyze a sequence, it is required that sequence data be formatted in GCG sequence file format. See Chapters 2,6, and 8 in order to review methods on how to create a sequence in this format. Methods 2, 3, 6, and 7 of this chapter use a codon preference table. Although some of these are provided within the GCG package, a couple

Translation

of DNA Sequences

131

more are available on file servers of the international internet. Refer to the appendix of Chapter 6 for addressesof computers offering these services. 3. Methods 3.1. Tools (Method 1) To work with sequences in GCG format, it is useful to be aware of translators that can transcribe the sequence into the GCG format. The plain GCG sequence format is characterized by a text area that may contain any text data not to be processed by the sequence analysis programs. After a separation line, the sequence data follow in IUPAC compliant codes. Numbers are generally ignored and do not matter if included for readability. The sequence may be terminated by two slashes “//.” The separation line containing the two periods (“.“) is essential. In addition to information on the size and creation of the sequence, it contains a so-called checksum, which ensures the integrity of the data. To enhance the program performance in some cases, this line also contains the type (e.g., type “N” for nucleotide sequence). The separation line is terminated with two periods “..” at the end of the line. Therefore, to incorporate any sequence of plain sequence data into the GCG programs, it is sufficient to edit the file, add two “ . .” and reformat on it. (Even this editing step can be omitted if it has been assured otherwise that only sequence data are in this file). Table 1 summarizes the programs available for tailoring sequences. All the programs run the same way: After having started the program by typing its name on the command line, the program asks for a file name to be processed and starts operation. Generated

3.2. The Overview by the Frames Program (Method

2)

The programframes is capable of schematically plotting reading frames just by localizing start and stop codons. If necessary, the translation table can be explicitly specified. Additionally, a codon usage table can be provided that al&‘ws marking of rare codons. If you used the mark files (see Chapter 3), you can use these here again to indicate schematically known regions of your data.

132

Dlik

Table 1 Tools Available for Tailoring GCG Sequences Comment Program name Input output chopup Text with very Text with lines Useful after transfer long lines of 50 chars each from PCs (>255 chars) DNA sequences onecase Text of any Either all upper kind or all lowercase appear to be letters more readable m lower case fromstaden STADEN GCG format The comment area format contains the original file name of the input EMBL format GCG format The EMBL fromembl nonsequence data are entirely copied fromplr PIR format GCG format The PIR nonsequence data are entirely copied fromgenbank Genbank format GCG format The GENBANK nonsequence data are entirely copied reformat Any GCG GCG format Manual crosscheck format after reformattmg is recommended Recommended in reformat DNA, GCG GCG format with “II” with option format alignments to RNA instead of ‘7” enhance readability of bulk sequences reformat RNA, GCG GCG format Recommended in with option format with “T” alignments to instead of “U” DNA enhance readability of bulk of sequences reformat Peptide Three-letter Not to be used in with option GCG peptide other GCG prooneintothree format GCG format grams; GCG font output only reformat Three-letter One-letter Manual crosscheck with option peptrde pepttde after reformatting threeintoone GCG format GCG format is recommended

Translation

of DNA Sequences

133

3.2.1. Definition of the Plotting Configuration Run the program setplot to define your graphics configuration. This makes your configuration known to the system and you need not run this step unless you would like to change the configuration. Start of the Frames Program The frames program is started by typing the program name. It is possible to modify the program’s operation by options, which are to be applied as follows: $ frames/rare on VAXNMS and % frames rare on UNIX. The program’s operation will be modified with the option rare. 3.2.2.

3.2.3. Specification of the Sequence The program asks for a sequence name, which should be a DNA (or RNA) sequence in GCG format. Afterwards, the program permits the changing of the beginning and the end of the sequence. Entering at both questions will run the calculation on the entire sequence. If necessary, the sequence can be reversed. 3.2.4. Options Available The following options are available: data

= file name: Thefilename given is one of the translation tables pro-

vided by the GCG package. See Chapter 5 for details. mark = filename: The filename given should be the file of your choice, which denotes the marks of your sequence. See Chapter 5 for details. rare = filename: The filename given should be one of the codon usage files provided by the GCG program package or by yourself. See Method 4 for details. threshold = fruction:fraction ranges from 0.0 to 1.Oand denotes the threshold of a rare codon to be marked in the plot. Check explains additional options not shown here.

3.3. The Detailed View Generated by the Codonpreference Program (Method 3) This program uses a frame-specific representation of the sequences and tries to find the correct reading frame based on the similarity of codon usage to known, translated genes. The program outputs the statistics of this comparison in a “window” that slides across the

134

DiilZ

sequence. Larger window sizes enhance the quality of prediction, but obviously reduce the case of identification in the beginning and end. 3.3.1. Definition

of the Plotting

Configuration

Run the program setplot to define your graphics configuration. This makes your configuration known to the system, and you need not run this step again unless you would like to change the configuration. 3.3.2. Start of the Codonpreference

Program

The program is started by typing its name. Command line options are to be applied as explained above (Section 3.2.). 3.3.3. Specification

of the Sequence

The program asks for a sequence name, which should be a DNA (or RNA) sequence in GCG format. Afterward, the program permits to change the beginning and the end of the sequence. Entering at both questions will run the calculation on the entire sequence. The sequence can further be inverted if necessary. 3.3.4. Options Asked by the Program

The codonpreference program does need additional data to perform the statistics. The first question asked for is the reference codon usage table. Eukaryotic highly expressed genes have a fairly standard codon usage, and its table is suggested by default. Other tables are available and are described in Method 4. accepts the default choice. The next question asked is the window size used for averaging and statistics calculation. The default value of 25 can be accepted with and should only be changed in special cases (see Section 4.). Last, the density of the bases per cm is asked for. Unless a comparison of different sequences requires a defined scale, the suggested default value can be accepted with and permits the largest possible magnification on one single plot. 3.3.5. Additional

Options Available

= gc performs the third basebrascalculation for G and C. Nobias can suppressthis calculation. For the sakeof clarity, rt ISsometrmeshighly desirable to use this optron (in particular on monochrome displays). file = file name permits the output of the calculated values into filename, which is required if additional software packageswill be used.(Sometimes PC programs are suited to plot the statrstrcs). bias

Translation

of DNA Sequences

135

table = table name permits the creation of a table with a file name table name, which tabulates the statistics for each codon. font = 3 selects a nice font (close to HELVETICA);font = 6 resembles a font similar to TIMES. check lists all options, including those not listed here. 3.3.6. Interpretation of the Output

Figure 1 shows an output created with codunpreference and the option nobias. From top to bottom, three forward reading frames, their statistics, and open reading frames including their rare codon occurrence are displayed. Note that the decision on the first and last bp is not possible, because of the window algorithm used. Further, it is important to follow the curves carefully. The frame three at the bottom looks perfect, but stops at about 200, whereas the middle panel just rises above the average at that point (dashedline). Most frequently, this is a typical result of an analysis on a sequence with a reading frame error. 3.4. Counting

Codon

Usage (Method

4)

The program codonfrequency counts codons and writes the result to a file. The output file will list the statistics found for each of the 64 possible codons, normalize the count to frequency per 1000, and calculate the fraction of the usage within its family of codons used for the same amino acid. In order to compare different codon tables, the program correspond will permit the calculation of the similarity between two sets of tables to be compared. 3.4.1. Start of the Codonfrequency Program A mandatory prerequisite for determining codon statistics is the correct reading frame. Refer to Methods 2,3, or 5 to accomplish this. The program codonfrequency is started by typing its name. The only recommended option would be to specify a different translation table (see Method 6) by $ codonfrequencyhranslate = file name.txt on VAXNMS, or % codonfrequency -translate = file name.txt on UNIX. 3.4.2. Specification of the Program Mode The program asks whether an existing table is to be extended (E) or whether a new sequence is to be used (S). The option (S) is the default and can be accepted by typing .

Codon

CODONPREFERENCE of: test.seq Table: genmoredata:human-hlgh.cod 200

0

Ck: 4709.1 to 700 PrefWlndow: Den&y: 23.0 400

March 20,1992 25 Rare Codon

17:59 Threshold:

0.10 (100

2.0 1.3

r 10 --0.5

!i e z ae

10

i- --xl---

0.5

_----

--

--

--

_---------

P------4----

---

; ___-__-------

_---------_

Fig, 1. Output of the codon preference program with the options font = 3 and nobias The default codon usage table was employed in the analysis of a human sequence. The dashed line inserted manually into the plot indicates a putative sequencing error; for details, see text Note that the codon preference does not seem to indicate the vahdrty of the long readmg frames. Thus 1s the result of a small bias in the third position of this sequence used (see Section 4.)

Translation

of DNA Sequences 3.4.3. Specification

237 of the Sequence

The program will ask for a sequence name, which should be a DNA sequence in GCG format. Afterward, the program permits the change of the beginning and the end of the sequence. Entering at both questions will run the calculation on the entire sequence. A reversal is possible, but accepts the default, which does not change the sequence. The final question, for confirmation, repeats the first and last five nucleotides on the selected sequence. 3.4.4. Specifying More Exons of the Same Sequence

The next question asked by the program permits the addition of another sequence fragment from the same sequence in the same fashion as described above. does accept the default answer which proceeds to the next question. 3.4.5. Continuation

of the Codonfrequency

Program

The next questions asked by the program permits the user to obtain additional fragments from the same sequence or from different sequences and allows input of a new codon table file. The default option is to write out the result into a file. If this option is accepted with , the file name needs to be entered for successful completion of the program. 3.5. Determination of the Reading Frame Without Knowing the Codon Preference (Method 5) The program testcode is based on the nonrandomness of every third base, which was developed by J. Fickett. In contrast to the methods presented earlier, this method only predicts the region of coding sequence and does not give information on which reading frame is used. 3.5.1. Definition

of the Plotting Configuration

Run the programsetplot to define your graphics configuration. This makes your configuration known to the system, and you need to run this step-again only if you need to change the configuration. 3.5.2. Start of the Testcode Program The program testcode is started by typing the program’s name. There

are no essential command line options, but these can be queried by typing $ testcode/check on VAXNMS or % testcode -check on UNIX.

138

Diil.2 3.5.3. Adding Further

Parameters

Next, the program will ask for the site of the window that is used for the prediction. The suggested value of 200 can be accepted with and, if modified, should be increased, but not lowered. The next question defines the scale of the plot. Unless several sequences are to be compared with , this allows the largest scale plot of the whole sequence on one page. 3.5.4. Interpretation

of the Output

Figure 2 shows the output of the testcode program. Note that the predicted reading frame shift in Fig. 1 is also to be seen here, but this time only as a minimum in the prediction curve. Because of the size of the window, small coding regions cannot be predicted with this method. The same applies for the beginning and the end of the sequence. 3.6. Translation of a DNA Sequence to a Protein Sequence with Known Reading Frame (Method 6) In order to translate a sequence to the correct protein, a variety of information has to be available. The reading frame limits should have been determined using one of the methods described earlier in this chapter. Further, the standard IUPAC codon usage is required if more than the usual characters occur in the nucleotide sequence. These definitions are supplied by the GCG program translate automatically and cannot be modified by the user. 3.6.1. Start of the Translate

Program

The program is started by typing the program name. There are no command line options available. 3.6.2. Specification

of the Sequence

The program asks for a sequence name, which should be a DNA (or RNA) sequence in GCG format. Afterward, the program allows the beginning and the end of the sequence to be changed. Entering at both questions will run the calculation on the entire sequence. Finally, the sequence can be reversed if desired. Before the calculation starts, the first and last nucleotides of the fragment are shown on the screen in order to permit a last check.

of DNA Sequences

Translation

139

TESTCODE of: test.ssq ck: 4709, Wlndow: 200 bp March 20,1992 0

Q9 Q

14

51 0 I-

ole4

II0

0004 ii

:l@All

1 to: 700 l&O0

II I eve Q

Q 99

*I'0

IQ

I

I

0

v lo*

4"

6

6

4

0

I

200

1

400

600

Fig 2. Output of the testcode program with the option font = 3. The top panel shows possible start codons as diamonds and possible stop codons as short vertical lures. The same test sequence as in Fig. 1 was employed. The upper third of the plot is supposedly predicting coding regions with >95% reliability, the bottom third predicts noncoding regions with the same accuracy, and the middle is called “window of vulnerabilrty,” which cannot be used for prediction.

3.6.3. Specification

of Additional

Fragments

After the first translation has succeeded, the program permits the addition of other exons from either the same or another sequence file. Further options include the translation of the collected fragments and writing of everything to a file. This is the default that can be accepted with . The file name has to be specified and the program completed. 3.6.4. Caveats: Checking the Output

It is possible to translate sequences that have stop codons in the nucleotide code. This will usually be represented as * in the peptide sequence. Asterisks are nonstandard IUPAC and might cause problems therefore, after a translation, it is recommended that the sequence be reviewed immediately.

140

Dd.2

3.6.4.1. ON VAX/V&IS The command $ type/page test.pep will display the sequence on the screen. This might be tedious for long sequences. Therefore, the command $ search test.pep “*‘I will be needed in order to search the peptide sequence more rapidly. 3.6.4.2. ON UNIX The command % more test.pep will display the sequence on the screen. This might be tedious for long sequences.Therefore, the command % grep “*” test.pep will be needed in order to search the peptide sequence more rapidly. 3.6.4.3. MODIFICATION OF THE PEPTIDE SEQUENCE If the reviewing process revealed the occurrence of a stop codon within the sequence, the translation should be repeated after having verified the limits of the reading frame. If the last character of the peptide is an “*,” the sequence seqed editor could be used to remove this symbol (see Chapter 6). 3.7. Translation from Peptide to DNA Sequence (Method 7) It is assumed that you already have prepared a peptide sequence either by typing it in manually with the sequence editor (Chapter 6), extracting it from the database with thefetch command (Chapter 9), or assembling it by translating a DNA sequenceusing one of the methods described earlier in this chapter. Additionally, the codon usage and the translation tables have to be known. In addition to the standard codon usage table for eukaryotic, highly expressed genes (ecohigh.cod), which is used by the GCG programs if no other codon table is explicitly given, in the directory genmoredatu additional tables for yeast, maize, human and drosophila are provided. The file names are <species> -high.cod and will symbolize that only highly expressed genes have been used. In addition to these GCG-supplied files, codon usage files can be found on many file servers on the internet (see Appendix of Chapter 5). The final information that is needed by the backtranslate program is the translation table. The standard translation supplied by GCG programs is insufficient if organelles or other special sequencesare to be used, Refer to Chapter 5 for details on the translation tables available.

Translation

141

of DNA Sequences

3.7.1. Start of the Backtranslate Program The program is started by typing the program name. It is possible to modify the program’s operation by options, that are to be applied as follows: $ backtranslate/translate= translate.txt on VAXNMS or % backtranslate -translate= translate.txt on UNIX. This will use the file translatetxt in your directory for translation. 3.7.2. Specification

of the Sequence

The program asks for a sequence name, which should be a peptide sequence (one letter) in GCG format. Afterwards, the program permits the change of the beginning and the end of the sequence. Entering at both questions will run the calculation on the entire sequence. 3.7.3. Determination

of the Output

Because of the nucleotide ambiguity, translation of peptide to DNA sequence will give more than one valid sequence. The program therefore permits the writing of either the most ambiguous sequence or the most probable sequence based on the given codon usage table. In order to guide the user within the decision of assembling probes, the single translation of each of the amino acid symbols to each of the possible codons, including their relative fraction, can also be included in the output file. The default option of the corresponding menu in the backtranslate program will write the translation table and the most ambiguous sequence. 3.7.4. Completion

of the Program

Finally, the backtranslate program requires a codon usage file (see above for options) and an output file name. 4. Notes 1. All reformatting, translation, and so forth, that createsnew sequence files usually either ignoresor truncatesthe nonsequenceinformation of the sequencefile. 2. In order to avoid confusion, sticking to the GCG suggestionfor naming sequencefiles with the ending .seq 1ssuggestedif they contain nucleotides and using .pep is suggestedif amino acid symbols are used, 3. All methods presentedhere that rely on the usageof codon frequency tables do usually work only if thereis a biaseduseof the codons.Highly expressed genesmeet this assumption, but low quality of prediction

can be a consequence of this effect. Figure 2 shows quote reasonable reading frames, with little or no rare codons, but the statistics of the usage support the assumption that theseframes are valid. 4. Reading frames based on codon preference or rare codons are only a good estimate and should be proven by experiment. As shown in Figs. 1 and 2, the methods give hints on sequencing errors, but do not necessarily fail if there are less prominent codon usage bases.

Reference 1. Devereux, J., Haeberli, P., and Smithies, 0 (1984) A comprehensive set of sequence analysis programs for the VAX. Nucleic Acids Res. 12,387-395.

&AP’I’ER

GCGt Analysis

12

of Protein

Reinhard

Sequences

Diilz

1. Introduction Most of the programs presented in the earlier chapters, including those in Chapter 14, work on both peptide and nucleotide sequences. There are a few GCG programs, however, that require protein sequences. Method 1, pattern matching, describes how to identify features of unknown sequencesbased on homology to known proteins. Method 2 handles the analysis of peptides and protein fragments, which might be useful in protein purification, separation, and identification. Method 3 reveals electrostatic properties of proteins by determining a “titration” curve in the denatured state. Method 4 uses various measures to characterize secondary structure features. Method 5, finally, plots three-dimensional structures arising from predictions. 2. Materials The methods and the programs reported here are part of the GCG program package (I), Version 7.x, to be installed on either a VAX/ ALPHA computer running the operating system VMS or one of the supported UNIX systems: Silicon Graphics (IRIX), Digital Equipment (ULTRIX), or Sun Computers (SUNOS). The programs can be obtained from GCG Inc., University ResearchPark, 575 Science Drive, Suite B, Madison, WI, 537 11.A version for the CONVEX valiant is also available from Convex Corp., Dallas. The computer system should be equipped with at least 16 Mbyte of memory, and should hold about 1 Gbyte disk for program, database, and scratch area. From Edited

Methods m Molecular Slology, Vol 24: Computer Analysrs of Sequence Data, Part I by- A M Griffin and H G. Gnffm Copyright 01994 Humana Press Inc., Totowa, NJ

143

144 All GCG programs generally require a terminal that is capable of emulating the VT100 terminal mode standard. Scroll-back facilities as provided on workstations or in PC terminal emulators are advantageous, but not essential. Text output of the methods presented can be reprocessed on word processing programs on personal computers. The data-transfer capabilities required for this are usually provided within the terminal emulation programs. If text output is to be printed, any ASCII code printer is sufficient. Capabilities to print 132 characters per line may be advantageous, but are not essential. In order to use Methods 3-5, a high-quality graphics device is desirable. For terminal previewing, TEKTRONIX 4014 capabilities or REGIS emulation (as provided in DEC 240, 340, or 440 terminal series, or in various PC terminal emulators) are recommended. The GCG graphics features include a variety of other modes, e.g., workstations use the X-Windows interface. The appropriate choice is sitedependent and is preconfigured by the software manager. See Section 3 for details. In particular, Method 4 can use color graphics with ease in order to gain a better overview. If possible, colors should be available at least for viewing on the screen. It is required that sequence data be formatted in GCG sequence file format, see Chapters 2, 6, 8, and 11 in order to review methods on how to create a sequence in this format. 3. Methods 3.1. Searching Sequence Patterns (Method 1) The methods in Chapter 10 described how to use a pattern as a query to be searched in a sequence or group of sequences. However, like the programfindpatterns and the transcription factor database, the programs described in this chapter are designed to use a sequence as the query to be searched for in a pattern database. Two different databases are available. One is the collection of known sequence profiles as supplied with the GCG programs. Profile(search) methods have been described in Chapter 8 (multiple-sequence alignment) and Chapter 9 (database searches). The other pattern library is Amos Bairoch’s PROSITE, dictionary, which is available on various file

Analysis

of Protein Sequences

145

servers (see Chapter 5 for a list of file servers) or directly supplied with the GCG programs. 3.1.1. Use of the Motifs Program 3.1.1.1. START OF THE MOTIFS PROGRAM The program is started by typing its name. Command line options can be supplied in the following manner: $ motifs/reference on VAX/ VMS. % motifs -reference on UNIX. This will use the option reference to modify the program’s action. Options available to this program are detailed in Section 3.1.1.4. 3.1.1.2. DEFINITION OFTHE QUERY SEQUENCE The motifs program will ask for a sequence, a group of sequences, or a sequence library specification. The name given must be the description of peptide sequences in GCG file format; e.g., my.pep, @peptide.fil or protein: mc* are valid examples. 3.1.1.3. COMPLETION OF THE PROGRAM RUN Motifs will ask for an output file and start the search. Unless many sequences have been specified, this search will be completed in few seconds and the results are summarized on the screen. 3.1.1.4. OPTIONS TO THE MOTZFS PROGRAM reference writes the completePRUSZTE referenceabstractbelow eachpattern found. If several hits of the samepattern are found, the abstractis printed once only. names writes a file of sequencenamesthat can be used in other programs. This option is useful if user’s own patterns are to be scanned with a group of sequences. mismatch = I permits one mismatch m the patterns reported as positive hit. More mismatches may cause too much output. data =filename uses file name as a library of patterns. The prosite database file which can be copied into the user’s own directory with the command fetch prosite.patterns can be edited by hand if desired. The pattern description language is identical to that described in Chapter 10 (findpatterns). 3.12. Use of the Profilescan Program

Profiles can be created from multiple-sequence alignments as described in Chapter 8. Theprofilescan program permits the use of a

146

DiilZ

file of sequence names that point to a collection of patterns as a profile database. 3.1.2.1. START OF THE PROFZLESCANPROGRAM TheprofiEescan program is started by typing its name on the command line. 3.1.2.2. DEFINITION OF THE QUERY SEQUENCE The program asks for a protein sequence, its beginning, and its end. If both questions for the sequence coordinate as answered with , the entire sequence is used. 3.1.2.3. DEFINITION OF THE PROFILE DATABASE The GCG programs are supplied with a default file of sequence names, which is called profZescan.fil. This file can be copied into your directory with the command fetch profilescarkfil (on both VAX/ VMS and UNIX), and it can be edited manually to incorporate individual needs. Without this action, the file can still be used; the profilescan program offers to use it as default on the question for the profile library. 3.1.2.4. DEFINITION OF THE OUTPUT FILES Profilescan writes a summary, file of the search results, and a detailed output containing aligned profiles if positive hits are encountered. Both files are asked for, and the key accepts the suggested default value. 3.2. Characterization of Peptide and Protein Fragments (Method 2) The programpeptidemap creates a map of a peptide sequence that shows every position where a proteolytic enzyme would cut. The programpeptidesort lists the fragments obtained from proteolytic cleavage, and calculates the composition, charge, HPLC retention, and various other characteristics. 3.2.1. Creation of a Peptide Map 3.2.1.1. START OF THE PEPT~DEMMJPROGRAM

The program is started by typing its name on the command line.

Analysis

of Protein Sequences

147

3.212. DEFINITION OF THE SEQUENCE TO BE ANALYZED The program will ask for a peptide sequence, its beginning, and its end. The key accepts the suggested default values for beginning and end in order to use the entire sequence. 3.2.1.3. SELECTION OF THE ENZYMES

The suggested input to the question is to accept the default (*) for all enzymes. A question mark will present the available options, which can be typed in individually if needed. Note that some of the proteolytic enzymes show up more than once in the list because they are known to cut at several sites (e.g., Trypsin). 3.2.1.4. DEFINITION

OF THE OUTPUT FILE

Thepeptidemap program will suggest writing an output file that is composed of the sequence name and the file extension map. This can be accepted with the key, but it has to be kept in mind that confusion might occur if both peptide and nucleotide sequences are analyzed with mapping programs if the same file names are used.

3.2.2. Creation of a Sorted List of Fragments

The programpeptidesort cleaves a protein sequence with the given proteolytic enzyme, and displays various characteristics of the whole sequence and the fragments generated. The output is sorted by the position of the fragments with respect to the sequence coordinate, by molecular weight, and by retention on an HPLC reversed-phase chromatogram. In analogy to the nucleotide sequence mapping programs, the specification of multiple enzymes results in consecutive calculations for each individual enzyme. 3.2.2.1. START OF THE PEPTIDESORTPROGRAM The program is started by typing its name. If needed, the order of the composition display can be manipulated by specifying the oneletter amino acid codes as an option. On VAXNMS, the command reads $ peptidesort/elution = DNEQSGHRTAPYVMCILFKW whereas on UNIX, it reads % peptidesort -elution = DNEQSGHRTAPYVMCILFKW.

I48

DijlZ

3.2.2.2. SPECIFICATION OF THE SEQUENCE TO BE ANALYZED

The program will ask for a peptide sequence, its beginning, and its end. The key accepts the suggested default values for beginning and end in order to use the entire sequence. 3.2.2.3. SELECTION OF THE ENZYMES

The program offers the use of all enzymes (*) in analysis. A question mark as an answer to the question for the enzyme will display the list of enzymes available. The enzymes have to be typed in individually. 3.2.2.4. DEFINITION

OF THE OUTPUT FILE

Thepeptidesort program will suggest writing an output file that is composed of the sequence name and the extension pepsort. This file name can either be charged or accepted with only. 3.2.3. Determination

of the Composition

of a Whole Protein

Using thepeptidesort program as described in Section 3.2.2. will permit the use of the “enzyme” nocut, which leaves the sequence intact and reports only on the composition of the entire protein. 3.3. Determination

of a “‘Titration

(Method

Curve”

3)

The program isoelectric plots the charge of the denatured protein as a function of the pH. As an approximation, it is assumed that no electrostatic interactions occur that might perturb ionization. If requested, the output can also be written to a file instead of a plot. 3.3.1. Definition

of the Graphics Configuration

Run the program setplot if you want to use graphics with the program isoelectric. The setplot program offers a variety of choices in order to define the graphics configuration. Once set, it needs to be run again only if the graphics configuration needs to be changed. 3.3.2. Start of the Program

Isoelectric

The program isoelectric is started by typing its name. Options on the command line can be specified in order to modify the program’s operation: $ isoelectricshoplot on VAXNMS or $ isoelectrics noplot on UNIX.

Analysis

of Protein

Sequences

149

This will apply the option noplot. Options being valid are outlined in Section 3.3.4. 3.3.3. Definition

of the Sequence to Be Analyzed

The program will ask for a sequencethat must be a peptide sequence in GCG format. Then, the beginning and end of the sequence are asked for. If both of these questions are answered with only, the whole sequence is analyzed. If graphics are used, the plot will be shown on the graphics device.

outfile = filename

3.3.4. Command Line Options for the Isoelectric Program will use file name as output file for tabulating data in a

file. noplot will suppress the plot. aminotermini = n will consider n free ammotermini. This option IS useful

for cyclic peptides or pGlu peptides. carboxyltermini = n ~111consider n free carboxyl termun. phdelta = 0.5 will increment pH every 0.5 U if data are tabulated in a plot.

3.4. Secondary

Structure Prediction (Method 4) The GCG program suite offers two programs that calculate protein secondary structure and another that plots measures of properties pre-

dicted from tables. The programpepplot does a Chou-Fasman prediction and calculates the hydrophobicity. All curves are plotted in the same plot, but can be selected if only one feature is desired in the final output. The programpeptidestructure makes predictions of the following

features of an amino acid sequence:

Secondary structure according to Chou and Fasman. Secondary structure according to Robson-Garnier. Hydrophilicity according to either Kyte-Doolittle or Hopp-Woods. Surface probability according to Emini. Flexibility accordmg to Karplus and Schulz. Glycosylation followmg the patterns NXT or NXS. Antigenic index according to Jameson-Wolf.

All these data are printed into a text file that can be plotted with the program.

plotstructure

150

Diik

3.4.1. ChowFasman Prediction The programpepplot shows several common measures of protein secondary structure together on one coordinated plot. The curves mostly represent some residue-specific attribute evaluated within a given window. The ticks at the top of the graph try to classify categories like hydrophilic charged and uncharged hydrophobic residues. A typicalpepplot of the human calmodulin sequence is shown in Fig. 1. The largest panel on the top part of the plot shows the a and p propensities for the sequence. If only monochrome output devices are available, the a curve is dashed, and the p curve is solid. As each of the curves rises above the threshold indicated as a horizontal line, the criteria for a prediction of the particular state are met and rt should be evaluated if there are no “breaking” residues within the prediction. According to Chou and Fasman’s methodology, the prediction is more than two state (i.e., a or p) as indicated in this plot, but it should be kept in mind that “strong” predictions are easier to justify than very short ranges of the sequence where the prediction rarely exceeds the threshold. The a and p curves are averages over a window of four (which cannot be changed). For a comparison with Robson-Garnier prediction, see Section 3.4.2. The plot in Fig. 1 shows the prediction achieved for calmodulin. The arrows have been added manually and show the calcium-binding sites, which are known to be embraced by a helices (ef-hand type of structure). Note that some of the predictions come out very clearly, whereas other parts of the sequence have no good preference for a, e.g., the helical part after the first binding site (circled). 3.4.1.1. START OF THE PEPPLOTPROGRAM The program is started by typing its name on the command line options to modify the program’s action and can be given as follows: $ pepplot/ font= 3 on VAWWS or % pepplot -font= 3 on UNIX. This will use font= 3 as an option. 3.4.1.2. SPECIFICATION OF THE SEQUENCE The program will ask for a peptide sequence to be analyzed, and for its beginning and end. If both questions for the sequence coordinate are answered with , the full sequence is used.

PEPPLOT of. Proteln:Mchu ck: 3630, Calmodulin - Human, rabbit,

1 to 148 February 20, 1992 bovine, rat, and chicken

05:38 b

0I Basic

s;-

Acuix ,lll,,ll

,I,,

I,lI ,,,,,,

IIII,,I,III

,I,1

,I,

,I’(II

,I,

(, ,II,I,,IIIlI

I’II’IIIII’

I l “III1

i”l--l’

-“‘-‘,,,“,,I’II”

I I’ll””

“1”“”

I

II I , ,II,,IIIIIIl

II

Chou 6

I II IIIIIIIIIII

II I , , I

III,

II

l,I,,.IIIII

II,Il

I ,,,

I ‘,“I

III

I,,

1’11 III II I ,,,, 1.1111..11.,1,11’

III

III

NH2 End

Beta

COOB End A

0

fiPb

5

T”IIl

0

1 Eydrophobrc

Beta

Moment

Alpha 0 1

l-3 . -0 .

EPhobic EPhi1i.c

--3

Fig. 1. PeppZot output of the human calmodulin sequence.No special options were employed for the plot. The arrows and the circles were manually added to indicate calcium-binding sites and an undefined prediction at a range where a helices are found in the X-ray structure.

152

Diik

3.4.1.3. DEFINITION OF THE SCALE The next question asked by the program is the scale to be used in the plot,. Unless multiple-sequences with different lengths are to be compared, the default can be accepted with giving the largest scale possible. 3.4.1.4. DEFINITION OF THE OUTPUT PANELS Depending on the needs, it might be sufficient to plot all panels available. Some of the output, however, might be too crowded for specific archiving or presentation purposes. Therefore, the program offers a list of options in an alphabetically numbered row. The default is all letters, but individual assemblies can be composed if necessary. 3.4.1.5. OPTIONS AVAILABLE The program offers a variety of options. output options: noplot suppress the plot. ou@Ze= filename writes out the Chou-Fasman prediction to the output file. moment-file = filename writes out the prediction of the hydrophobic moment according to Eisenberg et al. noges suppresses the prediction for nonpolar transbilayer helices according to Goldman, Engelman, and Steitz. calculation options: window = 9 sets the window size for hydropathy averaging to 9. ges window = 20 sets the window for the Goldman et al. prediction to 20. graphics options: boxes draws a box around each panel, showseq will show the sequence in the panel one (noshowseq will suppress it). font = 3 uses the HELVETICA-like font,font = 6 is TIMES-hke. check hsts all options available, mcludmg the options not listed here. 3.4.2. Multiple

Measures of Protein Prediction The programpeptidestructure writes an output file that contains multiple predictions on a variety of measures. The data for ChouFasman are reported qualitatively as Helix, Turn, or Beta. The output file of peptide structure can be printed and alternatively plotted with

plotstructure.

Analysis

of Protein Sequences

153

3.4.2.1. START OF THE PROGRAM PEFTIDESTRUCTIJRE The program is started by typing its name on the command line. 3.4.2.2. SPECIFICATION OF THE SEQUENCE

The program asks for a peptide sequence to be analyzed, for its beginning, and for its end. If the latter two questions are answered with , the full length of the sequence will be analyzed. 3.4.2.3. SUPPLEMENTAL PARAMETER SPECIFICATION

The program can calculate hydrophilicity either according to Kyte and Doolittle or according to Hopp and Woods. accepts the default. Next, an output file name is to be specified, that is composed of the file name and the extension .p2s as default. 3.4.2.4. SPECIFICATION OF THE PLOTTING CONFIGURATION

If no graphics have been used yet during this session, it is necessary to specify once where the plotstructure program will send the graphics to. The program setpZot offers the choices available at your site. This program has to be run once during the session before starting the graphics output. Furthermore, each time you want to change the graphics configuration, the program setplot has to be run again. 3.4.2.5. STARTOFTHE PROGRAMPLOTSTRUCTURE To start the programplotstructure, it is sufficient to type the pro-

gram name on the command line. There are a few options available that can modify the program’s behavior if special circumstances require it. To query these options, use $ plotstructure/check on VAX/ VMS or % plotstructure -check on UNIX. 3.4.2.6. DEFINITION

OF THE INPUT

The program requires the output from the programpeptidestructure in order to plot a one- or two-dimensional graph. In order to know about this input the *.p2s file name needs to be typed in. On successful input, the programplotstructure permits the change of the beginning and end; accepting defaults is sufficient in most cases. 3.4.2.7. DEFINITION OF THE OUTPUT FORMAT Plotstructure can plot in two ways. Either a one-dimensional panel plot is generated that is similar to pepplot (see Fig. 1) or, alterna-

154

D&k

tively, a two-dimensional squiggly plot can be generated schematically plots the secondary structure as a symbolic pattern along the sequence. Waves indicate helices, sheets are represented as wide zigzag lines, and burns are represented as squiggly areas (see Fig. 2). The last question asked before plotting the two-dimensional plot is a measure of any prediction that can be superimposed on the plot. The menu offers a letter to pick a choice from. 3.5. Visualization of Periodic Structures (Method 5) Helices in protein structures quite frequently exhibit a characteristic amino acid pattern, resulting in hydrophobic edges that are important for structural stabilization. Similarity, a periodicity in residues on P-sheet structures can also be recognized in stabilizing elements of protein crystals. The two programs heZicaEwheeZand moment try to take advantage of this fact. Although the program moment is capable of detecting periodic structures of any angle in between residues by calculating the hydrophobic moment according to Eisenberg et al., the helicalwheel program is simulating the view along a helix. Neither of the two programs predicts structures; they just permit the user to look at the structures in a plotted and ordered periodic fashion. 3.5.1. The Helicalwheel Program In order to start the program, it is necessary to have the graphics configuration defined first. Therefore, it is necessary to run the program setplot and pick one of the options. Next, the program can be started by typing its name on the command line. It asks for a sequence, for its beginning and for its end, and does the plot. 3.52. The Moment

Program

Moment plots the height of the hydrophobic moment calculated for all possible angles of rotation for a window that is to be specified. The form of the plot is a set of isocontours (see Fig. 3). The more contours, the higher the peaks, and position as well as angle of rotation can be evaluated. Each residue in a typical a helix is offset 100” from the preceding residue. Typical p structures have 160” between adjacent residues.

PLOTSTRUCTURE blmodulln

of: pratein:mchu

- Human. ,ubblt. borln..

,,,+, and shlohn

ck: 3630

0

Flmdblllty

>= 1.040

Fig 2. Plotstructure output of the human calmodulin sequence. Output ophons employed were numbenng = IO, and the Gamier-Robson prediction is plotted, overlayed by flexibility. The arrows and the circle were manually added to Indicate the known (ef hand loop) structures and a wrong p structure prediction around posmon 30, where rt is known from the X-ray structure that an a helix is found. Note that this intuitive representation immediately reveals the occurrence of four correctly predicted turns. This example is a fortunate case, however, and must not be representatrve.

156

DdZ MOMENT of: Mchu Ck: 2160, 1 to 148 February 20, 1992 19.12 Contours at: 0.30, 0.35, 0.40, 0.45, 0.50 Window: 10

160 -

140 2 G: 120 2 2 a 100 g 3 4

BO-

P

-

g B p”

60- 40 -

20 -

0

1 0

I 50

1 100 sequence

Posltlon

Fig. 3. Output of the moment program. The arrows have been added manually to indicate the calcium-bmding sites, which are known to be embraced by a helices. Note that at nearly all of the edges, some hydrophobic moment around 100” can be found. The resolution is only five residues, because of the window size of 10.

3.5.2.1. START OF THE MOMENT PROGRAM The program is started by typing its name. Options can be checked with the following command: $ moment/check on VAXNMS or $ moment -check on UNIX. 3.5.2.2. SPECIFICATION OF THE SEQUENCE The program asks for a protein sequence, and its start and end. The latter two questions can be answered with if the full sequence is to be analyzed. 3.5.2.3. SPECIFICATION OF THE PLOTTING PARAMETEW Up to 10 contours can be plotted. For previewing, it is recommended to start with two contours as suggested as default value, which can be accepted with . More work will be needed to obtain a plot useful for presentation. Sequence-dependentfeatures like moment

Analysis

of Protein

Sequences

157

cannot be calculated and suggested automatically. Therefore, the correct parameters have to be found by manual interaction. The plot in Fig. 3 was achieved with four contours and the default window size.

3.5.2.4. DEFINITION

SCALE The program will ask for a “density,” which is a scale parameter. If OF THE

various sequences of different lengths are to be compared, a defined

scale is required, otherwise, the suggested default value will permit the user to obtain the largest plot possible. 4. Notes 1. The programs for secondary structure prediction are difficult to use if low-resolution graphics or PC termmal emulators are used, because for the sake of completeness, the characters are small and many data appear on the screen. 2. Color devices can be very advantageous in order to view the results of the secondary structure prediction. 3. The graphics generated are complicated and require a fast line (ethernet at best) to the computer. Alternatively, the UNIX workstations are very ideal for this purpose, because of their large screen on the superior graphics, 4. Secondary structure prediction is a statistic method with a precision of 60-65s. This has to be kept in mind if results from these methods are to be used in pubhcations or plannings for experiments. 5. The elution position on HPLC columns is very much dependent on the stationary phase. Therefore, results predicted by the programpeptiesort should be handled with care and are not suited for peak identification. 6. The programs of Method 5 implicitly use the assumption that periodic structures do exist. Even in the absenceof helical elements, helicaZwheeZ can be used and will not complain, but the results are meaningless.

Reference 1. Devereux, J., Haeberli, P., and Smithies, 0. (1984) A comprehensive set of sequenceanalysisprograms for the VAX. Nucleic Acids Res. 12,387-395.

&IAP!lXR

13

GCG: The Analysis of RNA Secondary Structure Reinhard

Diilz

1. Introduction The prediction of RNA secondary structure differs significantly from that of protein secondary structure. The latter is based on crystallographic data, and by using a variety of empiric or semiempiric parameters, a prediction is achieved. RNA secondary structure prediction, on the other hand, is based on an energy calculation. Each nucleotide sequence can be calculated using a table of energies for stacking and loop destabilizing energies. The GCG program (I) foldrna is the program of M. Zuker (2), using the energies published by Turner (3). The algorithm and its usage are explained in Chapter 23 in Part II. This chapter shows the use of the foldrna program as implemented in the GCG package, including the broad range of output visualization provided. The algorithm does not necessarily provide you with the “correct” structure, but rather with the structure arising from the lowest energy. Whereas this energy can be correct, the structure will almost certainly be the member of a family of structures so that the “real” structure can also be represented by a second or third choice that is not output of the program. These choices may be numerous if large sequences are employed. The maximum fragment length foldrna can currently handle is 1200 bases. The calculation for these large data sets can use a significant amount of computing resources. The second important program of the GCG package with respect to RNA structure analysis is the stemloop program, which finds From EdRed

Methods m Molecular Biology, Vol 24: Computer Analysis of Sequence Data, Part I by’ A M Griffin and H G Gr+ffln Copynght 01994 Humana Press Inc , Totowa, NJ

159

160

DMZ

inverted repeats in the sequence(Method 3). This is not to be confused with other pattern recognition programs as described in Chapter 13. 2. Materials The methods and the programs reported here are part of the GCG program package (I), Version 7.x, to be installed on either a VAX/ ALPHA computer running the operating system VMS or one of the supported UNIX systems: Silicon Graphics (IRIX), Digital Equipment (ULTRIX), or Sun Computers (SUNOS). The programs can be obtained from GCG Inc., University ResearchPark, 575 Science Drive, Suite B, Madison, WI, 537 11.Aversion for the CONVEX variant is also available from Convex Corp., Dallas. The computer system should be equipped with at least 16 Mbyte of memory, and should hold about 1 Gbyte disk for program, data base, and scratch area. All GCG programs generally require a terminal that is capable of emulating the VT 100 terminal mode standard. Scroll-back facilities as provided on workstations or in PC terminal emulators are advantageous, but not essential. The visualization of RNA fuldrna output requires high-quality graphics devices. The faster the connection to the host, the less waiting necessary for a display to be completed. For terminal connections, a speed of at least 9600 baud or more is desirable. For terminal previewing, TEKTRONIX 4014 capabilities or REGIS emulation (as in DEC 240, 340, or 440 terminal series, or in various PC terminal emulators) is recommended. GCG graphics features include a variety of other modes, e.g., workstations use the X-Windows interface. The appropriate choice is site-dependent and is preconfigured by the software manager; see Section 3. for details. Graphic hard copies may use a variety of output formats, including HPGL and POSTSCRIPT standards. This feature is also site-dependent and preconfigured by the software manager. To analyze a sequence, it is required that sequence data be formatted in GCG sequence file format, see Chapters 2,6,8, and 11 in order to review methods on how to create a sequence in this format. The general procedure for performing RNA secondary prediction is to run the program foZdrna (Method 1) and evaluate the result graphically (Method 2).

Analysis

161

of RNA Secondary

3. Methods 3.1. Calculation

of the Structure (Method 1) 3.1.1. Start of the Programfoldrna The programfo&za is started by its name. It was previously called “fold,” but because fold is already a command in the UNIX world, it was renamed. 3.1.2. Specification

of the IlO Files

The program will ask for a sequence name to be folded, and two output file names, which consist of the file name of the sequence and the extension $!d and connect. The J&f file is an ASCII representation of the result and can be viewed on the screen or printed on a hard copy device. The .connect file is the input file of the visualization programs. 3.1.3. Definition

of the Sequence Range

The program will run on the entire sequence if the suggested default values for beginning and its end are accepted. This can be changed for folding smaller fragments, but for this purpose, an alternative approach (see Section 3.1.4.) is better suited. After having answered the last question, the program will start the run. 3.1.4. Folding Fragments Foldrna calculates energies that can be stored for further reference. On VAXNMS, the command $ foldrna/save = filename will keep the energies in a file, whereas $ fold/continue = filename will rescue without repeating the entire calculation. On the UNZX operating system, the two commands to be specified are % foldrna -save = filename and % foldrna -continue = filename. 3.1.5. Manipulation

of the Energies

program can be forced to pair or unpair various base pairs. In order to do this, options are to be specified on the command line. The detailed use of these options is described within the output of the $ foldrna/check (on VAXWMS) or % foldrna - check (on UNIX) command. Thefoldrna

Ddz

162

3.2. Plotting the Results of the Foldrna Program (Method 2) The programs squiggles, circles, domes, mountains, and dotplot can visualize the result from the energy calculation. All these programs are started on the command line by typing in the program name. The input required by the programs is obviously asked for and basically requires the .connect file as calculated by the foldrna program and a few specifications on graphics features. The options available on the command line can be queried as usual with $ program/check (on VAXNMS) and % program -check (on UNIX). Figures l-5 show the output of all these programs on the same sequence. 3.3. Finding Inverted Repeats (Method 3) The program stemloop is designed to detect inverted replats in DNA sequences. The output of stemloop can be visualized with dotplot program (see previous section, and Chapter 7). 3.3.1. Start ofthe Program Stemloop The program stemloop is started by typing its name. Additional

options can be added on the command line, such as $ stemloopl check (on WWVMS ) or % stemloop -check (on UNIX). 3.3.2. Specification

of the Sequence

The stemloop program will ask for a nucleotide sequence in GCG format, its beginning, and end. Both of the last questions can be answered with in order to use the entire sequence. 3.3.3. Definition

of Repeat Quality

The following questions of the program are used to define the characteristics of the repeats to be found, such as the length, the number of precise matches to be found within this repeat, and the minimum and maximum number of nucleotides permitted in loops. Although all parameters are suggested with reasonable default choices, a considerable variation might be desirable to optimize results. 3.3.4. Definrtion

of Output of the Stemloop

Program

The program permits the user to change the parameters, to sort it by various criteria, and to output either as text file or as input file for dotplot.

SQUIGGLES of: FOLDRNA of:

j02061.vrl

]02061.connect

Check: Length:

March 20, 1992

3205 from: 334

Energy:

1 to:

334

22:15

March 20, 1992

-94.0

Fig. 1. Output of the squiggles program

22:12

164

Fig. 2 Output of the circles program

4. Notes 1. Thefoldma program as implemented in the GCG program outputs only the best structureand doesnot allow the user to view the secondor lower choices. 2. Like all predictions, the result visualized by the various graphics programs looks very suggestive, but does not have to be correct. 3. The output of the squiggles program mtght be most suggestive,but tends to hide more interesting details.Therefore, the use of all of the visualization programs and only the most detailed view for archiving is recommended 4. Foldrna is very sensitive to changes of the parameters with respect to forced pairmg. It is suggestedthat, before these options are used, a known example be employed to obtam a feeling for the effects. 5. The stemloop program does not use any energies for finding repeats. Instead, the algorithm is very similar to that used by the compare program (see Chapter 10) and is based on symbol matches only.

Note Added in Proof Starting with version 7.2 of the GCG package, a program for computing both optimal andsuboptimal structures is included in the distribution (mfold).

Analysis

of RNA Secondary

DOMES of:

Energy -94.0

j02061.connect,

FOLORNA of:

j02061.vrl

Check:

3205

from:

1 to:

March 20, 1992 334

March

20,

1992

22:16 22:12

Fig. 3. Output of the domes program.

HUUNTAINS POLDRNA

of: of:

j02061.cormcct j02061.vrl Check: BasedlOOpp:

1 to: 3205 256.9

334 Energy: -94.0 March from: 1 to: 334 March 20, Haaximum Stem Depth: 65

Fig. 4. Output of the mountain

program.

20, 1992 23:22 1992 22:12

DdZ

166 0 -I

I

I

,

/

/

,/

/

-

/

/* ,

,

/

-

100

-

200

-

300

.’ #,

/

/ / ,

,

I

I

I

j02061.vrl

,

u

0

ck: 3,205,

I

I,

B

1 to 334

Fig. 5. Output of the dotplot program.

References 1. Devereux, J., Haeberli, P., and Smithies, 0. (1984) A comprehensive set of sequence analysis programs for the VAX. Nucleic Aclds Res. 12,387-395. 2. Zuker, M. and Stiegler, P. (1981) Optimal computer folding of large RNA sequences using thermodynamics and auxiliary mformation. Nucletc Acids Res. 2,133-148. 3. Freier, S. M., Kierzek, R., Jaeger, J. A., Sugimoto, M., Caruthers, M. H., Neilson, T., and Turner, D. H. (1986) Improved free-energy parameters for predictions of RNA duplex stability Proc. Natl. Acad Sci. USA 83,9373-9377.

CHAPTER

14

GCGt Preparing Sequence Data for Publication Reinhard Diilz 1. Introduction With the advent of personal computers and graphical user interfaces, the justification for publishing programs on terminal- or textdriven devices is to be questioned. However, biological sequence data frequently are not just “text” that can be tailored with a word processor program. This chapter, therefore, deals with text formatting of single sequences(Method 1) and multiple-sequences (Method 2). The output of these two procedures can still be used on another program to produce slides; however, is also suited for direct publication. Within the GCG program package, there are furthermore two programs that are excluded here. One is red, a VAXIVKS-derived RUNOFF-like formatter. This program is not available on the UNIX version, The other program is figure. Any GCG program that produces graphics can be used with the command line option figure to produce a the metafile. This file contains all the information needed to produce graphics and thus, can be manipulated to fine-tune the final layout. Except for some special cases (e.g., painting plasmidmaps as described in Chapter 4),figure is not really a state-of-the art tool for biologists to create graphics from scratch. 2. Materials The methods and the programs reported here are part of the GCG program package (I), Version 7.x, to be installed on either a VAX/ ALPHA computer running the operating system VMS or one of the From. Ed&d

Methods m Molecular Slology, Vol 24. Computer Analysis of Sequence Data, Part I by. A. M. Griffm and H G. Grlffrn Copynght 01994 Humana Press Inc , Totowa, NJ

167

168

DSlZ

supported UNIX systems: Silicon Graphics (IRIX), Digital Equipment (ULTRIX), or Sun Computers (SUNOS). The programs can be obtained from GCG Inc., University Research Park, 575 Science Drive, Suite B, Madison, WI, 53711. A version for the CONVEX variant is also available from Convex Corp., Dallas. The computer system should be equipped with at least 16 Mbyte of memory, and should hold about 1 Gbyte disk for program, data base, and scratch area. In order to use the programs mentioned in this chapter, the sequence data must be formatted in GCG sequence file format. It is advantageous for method 1 if the reading frame of a DNA sequence is known. The text output of the methods presented here can be reprocessed on word processing programs. Care has to be taken to use no proportional fonts in order to keep the characters aligned. Prominent fonts that meet this requirement are “elite,” “courier,” “text,” and so forth. Both methods support landscape-style output that can be previewed only on terminals that permit 132 characters per line, or on work station screens. The file transfer capabilities needed to copy the text files to personal computers are not included within the GCG program package and must be installed independently. The two methods discussed in this chapter produce ASCII (text) output exclusively; therefore, transfer of data is trivial. The setplot program of GCG usually includes the option to print the graphics in an EPSF format. This (Encapsulated PostScript File) can also be transferred easily as text and be included in many popular programs. 3. Methods 3.1. Rearrangement of Sequences for Publication (Method 1) The GCG programpublish offers a variety of line types that serve as line templates. These templates are filled afterward with the actual sequence data. Table 1 shows the options available. The idea of the program is to select a couple of line templates so that the sequence is formatted automatically. No more than 20 of these templates may be selected. See Notes 1 and 2.

Preparing

Option L i

e f E

i ; 1

Sequence Data for Publication

169

Table 1 Format Options as Available in the Publish Program Layout Meaning Number line 10 Dot scale line The sequence itself GAATTCACGATCGATCGTAG ---------+---------+ Dash scale line CTTAAGTGCTAGCTAGCATC The complement Translation GluPheThrIleAspArg Translation E F TI D R Tagged blank line ### Blank line 2nd sequence (diff) C G GAATTCACCATCGAGCGTAG 2nd sequence (all) Match line IIIIIIII IIIII IIIII

3.1.1. Start of the Publish Program The program publish is started by its name on the command line. Options can be appended as follows: $ publish/ma on VAWVMS’ or % publish -rna on UNIX. Options are detailed below. 3.12. Specification of the Format The options as outlined in Table 1 are offered. If the sequence is a peptide sequence, part of the options become obsolete. The program expects the options to be entered as a row of lower case characters, followed by cRETURN>. If upper casecharactersare used, the ends of the corresponding line will be filled with the current sequencenumber. 3.1.3. Additional Specifications for the Format Depending on the options chosen for formatting, the program will ask for numbers of reading frames, start of translation, and so forth, as needed. Finally, the questions “how many symbols per block” and “how many blanks on each line” determine the width of the output. If 100 symbols are put in each block, only one should be printed per line. 3.1.4. Options YOUsets the A to be matching U and as a complement to U. This makes complementary sequenceslook like DNA insteadof RNA (output formatting option E).

170

DiilZ

agreement = . uses a period in order to show an agreement in case of the second sequence (output formatting option j).

3.2. Printing

of Multiple-Sequence Alignment with Pretty (Method 2)

The programpretty uses sequences prepared by programs described in Chapter 8. In contrast topublish, which uses only one sequence at a time, pretty can utilize up to 500 sequences, with up to 2000 sequence symbols. See Notes 3 and 4.

3.2.1. Start of the Pretty

Program

The program is started by typing its name on the command line. Options can be supplied as follows: $ pretty/case on VAXNMS or % pretty -case on UNIX. This applies the option case to the program. The options are outlined below.

3.2.2. Specification of the Input Sequences Pretty can either take a file of file names or multiple-sequence

format files (see Chapter 8 for details on these formats). Briefly, the term @myfil will address the file of file names my.fil, and my.msf{*] will take all sequences of the multiple-sequence file my.msf. The program will display recognized sequences on the screen, and finally ask for a beginning and an end.

3.2.3. Definition of Further Parameters Depending on the options chosen, additional parameters will be asked for. The last question is for the file name of the output.

3.2.4. Options linesize = 500 specifies the number of sequence symbols per line. blocksize = 10 specifies the number of sequence symbols in each block. consensus calculates a consensus sequence, based on the option plurality = 2 .O, which will require that at least two sequences (in this example) show that the consensus sequence symbol comparison table is used, which assigns 1.5 to the perfect match (see Chapter 10). In order to avoid confusion in the consensus determination, the option threshold = 1.0 is recommended for proteins. identity = “*” causes the program to show a consensus as asterisks where all symbols equal to the consensus agree. differences = ‘I-” will make pretty print a character “-” instead of the actual sequence symbols in case of disagreement with the consensus.

Preparing

Sequence Data for Publication

171

case will print all consensus-matching symbols in upper case and the rest in lower case.

4. Notes 1. The publish program will not translate interrupted coding sequences correctly if there is a split codon at the boundary. It is necessary to edit the output file manually to correct this problem. 2. Publish lets you define negative sequence numbers. This is the only program that allows sequences to be numbered with nonpositive numbers. If the base zero (0) is not desired in the numbering, the option skipzero will perform this task. Mathematically, this option is doubtful. 3. Although the variety of formatting options is impressive, no more than 150 characters per line can be printed with the publish program. 4. The program pretty might be confused if the option consensus is chosen, but the input file already contains a consensus as already determined in a lineup session (see Chapter 8). It is possible to make lineup write an output file without a consensus by starting the lineup program with the option nocoltsensus.

Reference 1. Devereux, J , Haeberli, P., and Smithies, 0. (1984) A comprehensive set of sequenceanalysisprograms for the VAX. Nucleic Acids Res. 12,387-395.

MicroGenie: and Restriction

1. History

Introduction Enzyme Analysis

of MicroGenie

The MicroGenie program was originally developed as a result of a collaboration between Cary Queen of N.C.I. Bethesda and Laurence Korn of Stanford University, Their first program was written for a Mainframe computer (1). This program was later adapted to run on the, then new, IBM PC (2). A company called SciSoft Inc. was established to develop the program, and it was marketed worldwide by Beckman Instruments Inc. The program had an interface that was outstandingly user-friendly for the time, with full context-sensitive help. A system of data compression, which coded three DNA bases into a single eight-bit byte, enabled data sequences of up to a million nucleotides to be stored and searched on a 360-Kbyte floppy disk. A password system was provided to enable individual workers to store their databases of sequences in separate directories on a hard disk and to access portions of the data bank selectively. In response to the growth of the GenBank DNA database,the data bank sequences were later released on CD-ROM disks; the extra capacity allowed annotation information as well as sequenceto be included. In 1991, Beckman Instruments ceased to distribute MicroGenie, and an agreement was made with IntelliGenetics Inc. to continue the distribution of the GenBank and PIR databasesin MicroGenie format. Further develop-

From Edlted

Methods m Mo/ecu/ar Biology, Vol 24 Computer Analysa of Sequence Data, Part I by A M Griffm and H G Griffin CopyrIght 01994 Humana Press Inc , Totowa, NJ

173

174

Merrifield

ment of the MicroGenie programs stopped, although Beckman Instruments continued to provide support on the program software. 2. Introduction 2.1. Use of MicroGenie MicroGenie was one of the most easy to use and user-friendly PC programs when it was introduced, and still compares favorably with many others today. It is controlled by a series of menus and prompts. Context-sensitive help is available at each stage, and there is automatic use of default values where possible, which are normally appropriate for an initial analysis. Menu choices are selected by pressing a single key, and the program is designed to guide the user through an analysis and to offer appropriate help in response to input errors (2). 2.2. Interface MicroGenie is controlled from initial multichoice menus, followed by prompts, for the selection of subsequent options and parameters. The selection is made by pressing a key corresponding to the highlighted initial letter of the menu choice, or option, offered. This immediately makes the selection without the need to press the return (or enter) key, which may initially give confusing results if return is pressed by force of habit after using other programs. When MicroGenie is run on a computer with a hard disk, users are initially asked to enter a password. This is to keep their sequences separate from those of other users. There can be up to 20 passwords. If the password is recognized as an existing one, the user is given access to the sequences in the appropriate directory. The passwords have six characters, and access to read or copy sequences stored under another password can be obtained if the first three letters are known. If the password is not recognized, the user is invited to reenter it or register it as a new password. There is a special password “FLOPPY” for use with a floppy disk. If this password is used, sequences will be read and stored on the floppy disk; this is useful for occasional users who do not have their own directory on the hard disk and for transferring sequences to a diskette to send to collaborators. If the computer has more than one floppy disk, the one that will be used can be specified by running the Setup section from the MICROGENIE menu.

Introduction

to MicroGenie

175

After entry of the password, the main (MICROGENIE) menu appears, allowing the user to choose between the major sections of the program. A label in the top right corner of the screen gives the name of the menu that is in use to prevent the user from getting lost in the menu structure. One of the options offered at the MICROGENIE menu is to press L to run the Learn section; this gives an overview of the program and user instructions. New users are recommended to look at this section first. The user can move through the text by pressing the Page Up and Page Down keys; the Home key takes the user to the start of the text, and the End key takes the user to the end. Press the return (or enter) key twice to return to the previous menu. When back at the main (MICROGENIE) menu, pressing a key, corresponding to one of the other highlighted first letters of the choices offered, runs the chosen program section. This normally brings up a further menu of options within the section, e.g., in the sequence Entry section, some of the options available on the ENTRY menu are to Record a new Sequence, Copy, Edit, or Transform an existing sequence, and to make Backup copies of the sequences. As an example, an option that is common to most sections is to List the Directory; pressing the L key selects this procedure, and the program responds by asking the user to enter the directory key name. If the first three letters of a password followed by a colon (e.g., “PRI:” for the Primate data bank, FLO: for sequenceson a floppy disk or “RON:” for a user’s sequences stored under the password “RONALD”) are entered, the user will see a listing of the sequences under that password. If return is pressed at this prompt, the default will be to list the sequences under the user’s own password. When the listing is too long to fit on one screen, the Page Up and Page Down keys may be used to move through the listing. Pressing the return key at this stage takes the user back to the previous menu. A list of the sequencescan be printed out, one page at a time, by pressing the Print Screen key on the keyboard; this is useful if there are only a few sequences to be printed. By typing P/before the directory key name (password), e.g., “P/RON:” the list will be sent to the printer. If part of a sequence name is entered followed by an “*” as the directory key name, a listing of all sequencesthat match where “*” replaces any character will be obtained. The “*” character can be used to represent any number of characters in any part of a sequence name (e.g., Human*DNA* for sequences with “Human”

176

Merrifield

and “DNA” in their names) rather than all characters from the “*” to the end of the name as with the “*” wildcard in DOS. When there are a limited number of choices available, the menu may take the form of a prompt, listing the possible choices with the corresponding letters for each choice highlighted. Again, pressing return rather than one of the highlighted letters will normally select a default that is satisfactory for a preliminary analysis. Where there is a longer list of choices, these are numbered, and the number corresponding to the desired choice is pressed to make the selection. When there are no highlighted letters in the prompt, enter “Y” for yes or “N” for no (upper or lower case) if it is a question that requires a yes or no answer. If the prompt is requesting

such information

as a

sequence or file name, type in the name as appropriate, and press return at the end of it. At any stage, when a response is requested, entering a question mark will cause the program to display some explanatory text and give a further opportunity to enter the response. 2.2.1. Use of the Return Key

In general, pressing the return key at a menu choice or question will cancel the current choice and return to the previous one, unless there is a default value that can be used. At a menu, pressing return twice (to reduce the chance of an unintended exit) will take the user to the previous menu or exit the program if at the main (MICROGENIE) menu. Return is also used to terminate input when a response,such as a sequence name, has to be typed. In this case, return must be pressed to signify that the response is complete. (When the term “enter” is used in Section 4, it implies that a responseis typed and then the return key is pressed.)Pressing return in responseto requestsfor a selection will normally select the default option if there is an appropriate one or cancel the procedure if there is not. 2.3. Control

of MicroGenie

MicroGenie is controlled interactively from the initial menus followed by dialogs with prompts, so that the program can request specific information about the sequence to be analyzed. The parameters that control the operation of the Analysis procedures, such as the stringency of a homology search, the format of the output, and if amino acids are printed in the one- or three letter code, are specified

Introduction

to MicroGenie

177

by a set of parameters. These are set before the analysis begins, rather than being entered during the analysis. There are default parameters supplied that are normally well chosen for an initial analysis. To enable a user to change easily between regularly used conditions, there are nine copies of the set of parameters that can be individually modified. The appropriate set can be selected before an analysis. There are 32 parameters, of which 26 control the procedures in the Analysis section and the remainder are applicable to the procedures in the Compare section, Each parameter set contains values for all of the 32 parameters; the parameter sets are numbered 0 to 9. Set 0 is the built-in set, which cannot be changed. Set 1 is used as the default and should contain the most frequently used settings. Parameters are changed using a special “Set program Parameters” option in the ANALYSIS and COMPARE menus. The manual and the help screens give details of each parameter and its function. Those that are relevant to restriction enzyme analysis are described briefly in Note 1 as an example. 2.4. MicroGenie Files MicroGenie was designed so that it could be used by people with little or no knowledge of the operating system of a computer (2). File manipulation is performed by the program, and no provision is made (or normally needed) for accessing MicroGenie data and files from DOS. The program stores sequence data in a special format that packs the sequence data to give efficient storage (2), but this means that other programs cannot readily access the MicroGenie data files. The sequence files are stored in hidden subdirectories that are not shown by the normal DIR command from the DOS prompt. Procedures are provided within MicroGenie to record a sequence from an existing DOS file and to save a sequence in an “Export Format” that can be read by other programs. This can cause confusion to a new user, since a sequence in MicroGenie format can be copied into the user’s own password (directory) from another password, or from a floppy disk, using the “Enter,” “ Copy a Sequence” procedure on the ENTRY menu. To read in a sequence from a nonMicroGenie file, however, the user should use the “Record a new Sequence” procedure, and then specify that he or she is recording the sequence from a file as described in Chapter 16 (MicroGenie: Shotgun DNA Sequencing). The output from

178

Merrifield

MicroGenie is normally saved on the hard disk, in the root directory, as a file called OUTFILE; see Note 2. When MicroGenie is used again and generatesfurther output, it will produce a new OUTFlLE, replacing the previous one. This has the advantage that the hard disk does not become filled with unwanted output files and reduces the risk of inexperienced users deleting the wrong files. Note 3 gives some techniques to reduce the risk of accidental deletion of files. MicroGenie is normally started from the root directory of the hard disk and returns to the root directory on exit, even if it was started from a subdirectory. Merge files and sequence export files are also saved in the root directory of the hard disk. Note 4 gives further details of the location of MicroGenie files on the hard drsk. 3. Materials MicroGenie is designed to run on IBM-compatible computers under the DOS operating system. It should run on all “100% IBM-compatible” microcomputers. The computer must have at least 640 Kbyte of RAM; this is in addition to any extended or expanded memory, which is not used by MicroGenie. MicroGenie supports Hercules monochrome and other graphic display systems, up to VGA, in monochrome or color; see Note 5. MicroGenie is copy-protected and will not run unless a special “Hardware Adapter” is fitted to the computer. This was originally an internal card; when the PS/2 was introduced the hardware adapter was changed to a device that connected to the parallel printer port of the computer. This made it possible to have MicroGenie installed on more than one computer (e.g., in the lab and at home) and move the hardware adapter from one to the other as necessary. The operating system should be DOS 3.1, or later, to be able to use a CD drive to access the data bank sequences. (The term DOS is used here to refer to Microsoft MS-DOS or IBM PC-DOS; the author has no experience with running MicroGenie with Digital Research DRDOS.) Version 7.01 of MicroGenie was released before MS-DOS 5.0 became available; MicroGenie should run better under DOS 5.0 because of the extra memory that is available. Chapter 19 (MicroGenie: Homology Searches) gives details of some methods of freeing extra memory to run MicroGenie. It is possible to run MicroGenie from floppy disks, and the manual gives details of the differences in procedure, but strongly recommends

Introduction

to MicroGenie

179

the use of a hard disk. (The descriptions of procedures given in the Section 4 are written assuming that a hard disk is used.) The installation of MicroGenie requires up to 5 Mbyte of free space on the hard disk. If a number of users will be storing sequences, especially if they are sequencing by the shotgun method, more space will be required. On our system, the programs, data banks, and 20 directories of user sequences occupy over 30 Mbyte on the hard disk. Most of the output from MicroGenie is in plain text and can be printed on the majority of printers that are normally used with an IBM-compatible computer. There are some procedures that produce data in a graphical form; a graphics printer is needed to print the output from these. MicroGenie supports Hewlet Packard LaserJet, NEC PC-PR201, IBM Proprinter, and Epson printers, or others that are compatible with their control codes; see also Note 6, A CD-ROM drive is required to use the Data Bank section of MicroGenie. Installation of a CD-ROM drive is discussed in Chapter 19 (MicroGenie: Homology Searches). MicroGenie can use a digitizer to read in the mobility of DNA or protein fragments from autoradiographs or gels. The use of a digitizer is described in Chapter 16 (MicroGenie: Shotgun DNA Sequencing). A tape backup system can be very useful if there are a large number of users with vital sequences on the computer. This makes it easy to do a regular backup of all the sequences on the disk. The importance of making backup copies of sequences cannot be overemphasized. Loss of the data on a computer could be a major setback to a sequencing project; see Note 7. Technical support for the MicroGenie program software is available to existing licensees from Beckman Instruments Inc., 1050 Page Mill Road, Palo Alto, CA, 94304; or Beckman Instruments (United Kingdom) Ltd., Progress Road, Sands Industrial Estate, High Wycombe, Bucks., HP12 4JL, UK. Updates to the Data Bank CD in MicroGenie format are available from IntelliGenetics, Inc., 700 East El Camino Real, Mountain View, CA, 94040, or Amocolaan 2, B-2440 Geel, Belgium. 4. Methods The ease of use of MicroGenie and the extensive help system make comprehensive step-by-step instructions for its use unnecessary.After a more comprehensive description as an introduction to the program,

Merrifield less extensive instructions will be given, and the main emphasis will be on tips, pitfalls, and solutions to problems that have been encountered. MicroGenie can construct restriction enzyme site maps from the fragment sizes produced by multiple enzyme digests and can calculate fragment sizes from their mobility, relative to markers, on an agarose gel. The digitizer can be used to enter the migration distances directly from a gel or autoradiograph. The author has not used these procedures and cannot give any practical details; therefore, only the analysis of restriction enzyme sites in known sequences will be considered in this chapter. MicroGenie reads the restriction enzyme recognition sites for an analysis from special Search files, which contain a list of enzyme names and their recognition sites. A suitable Search file must be available in order to locate restriction enzyme (or other) sites in a sequence; see Note 8. The following sections describe how to set up the program to use the file of restriction enzymes supplied with MicroGenie and how to record a Search file. 4.1; To Copy a Search File from Another

Password

1. At the Main (MICROGENIE) menu, press F for the Files section. 2. From the FILE menu, press S for Search files and the user is taken to the SEARCH menu. 3. At the SEARCH menu, the user is presented with a list of choices, including to Copy, Alter, or Erase an existing Search file, and to List the

Searchfiles in his or her own or anotherdirectory. PressL to “List the Directory.” 4. The prompt “Please give directory key name” appears. To see the list of Search files in the directory, press return. The SEARCH:List screen then displays the names of any Search files the user already has with the KIND, TYPE, and CLASS of each. Press return when finished. This will take the user back to the SEARCH menu. 5. Use the “List the Directory” procedure to fmd the name of a Search file in another directory. Press L, and give the directory key name as PRI: (or pri:) to look for the file of restriction enzymes in the Primate data bank that is supplied wrth MicroGenie. The SEARCH:List screen should then show a file with the name ENZYMES, of KIND N, TYPE S, and CLASS 1. If this file is not shown, the user may have to install or reinstall the Primate data bank. Press return when finished. This will take the user back to the SEARCH menu.

Introduction

to MicroGenie

181

6. At the SEARCH menu, press C to Copy a file. 7. The user is then asked for the name of the file to be copied; in this example, the user would enter the name as PRI:ENZYMES (lower-case letters or capitals can be used). A message confirms that one file has been copied, and the user is again asked for the name of a file to copy. If no further files are to be copied, press return to go back to the SEARCH menu. 8. If the files are listed again, the user wtll see that the file is now in their own directory. If P/ is typed before the directory key name (e.g., P/PRI: or P/ [for the user’s own directory] then press return), the list of Search files will be directed to the printer instead of the screen. 9. Press return to go back to the FILES menu and then two further times to go back to the main MICROGENIE menu.

4.2. Recording

a Restriction

1. At the Main (MICROGENIE)

Enzyme Search File

menu, press F for the Files section.

2. From the FILE menu, press S for Search files to be taken to the SEARCH

menu. 3. At the SEARCH menu, the user is presented with a list of choices, mcluding to Copy, Alter, or Erase an existing Search file and to List the Search files in his or her own or another directory. To Record a new Search file, press R (pressing return here exits to the previous menu). 4. Enter the name for the Search file m response to the prompt asking for a name for the file. The name may be up to 40 characters long; “*” or “:” cannot be used in a file name. 5. The user is then asked if he or she wants to record a nucleic actd (N) or protein (P) Search file. To record a restriction enzyme Search file, press N (which is the default if return IS pressed here). 6. To the prompt for single or multiple digest, enter S for single digest (the default); this specifies that the analysis procedures, which calculate maps and fragments, do so for each enzyme separately. If M is entered for a multiple drgest, the fragment sizeswill be calculated for a digest where all the enzymes in the file are present together in a multiple digest. 7. The user is then asked to enter a Class number for the Search file; this can have a value of l-9. The Class number is used as a “handle” to determine which Search files are used in an analysis; see Note 9 for further details. If there is more than one Search file of the class selected for the analysis, they will each be used in turn for the analysis. The default is Class 1, the same as the file of restriction enzymes supplied with the program. If the Search files are of type M (for multiple digest),

Merrifield

182

the sizes of the fragments ~111be calculated assuming separate multiple digests using the enzymes m each file, rather than a single digest using the enzymes m all Search files of the specified class together. 8. The user is then prompted to type the name of the first string (restriction enzyme or other recognition site). This can be up to 20 letters. Press return at the end of the name to be taken to the next stage, the entry of the correspondmg search string. Press return without typing a name to terminate the recording of the file. The left and right cursor keys, home, end, insert, delete, and backspace keys can be used to edit the input lme before pressing return. 9. Type in the search string, the recognition site for the restriction enzyme. MicroGenie has a set of codes for bases where substitutions are recognized; e.g., “P” will match either A or G, “4” will match A, C, or G, and “N” matches any base. The code is given m Appendix 1 of the manual and in response to a “?” for help at this stage. This code can only be used in Search files (and is used in the output of procedures that back-translate protein to DNA), but cannot be used as an uncertainty code in DNA sequence files. A Search file can also consist of ammo acid strings (in the three letter amino acid code with the additional code “Any” to signify a match to any amino acid), and be used to search against a protein sequence or to locate sites m a DNA sequence that could code for them. When return is pressed, the user is invited to enter the name of the next enzyme or can press return again to finish recording the file. 10. To change a string that has already been recorded, type its name when asked for the name, and then type the correct string. To delete an existing string, enter its name, then enter the string as one space, and press return. 11. When the Search file is complete, press return when asked for an enzyme name to return to the Search menu. It is possible to list the Search files agam (enter “/P” as the key name to cause the list to be printed out, as for sequence files) and Examine or Prmt the one recorded if it is necessary to check it. See Note 10 regarding makmg backup copies of the Search files. 12. Press return to go back to the FILES menu and twice agam to go back to the mam MICROGENIE menu.

4.3. Searching

for Restriction Enzyme in a DNA Sequence

Sites

To locate restriction sites in a DNA sequence and calculate restriction enzyme digest fragment sizes:

Introduction

to MicroGenie

183

1. At the MICROGENIE menu, press A to select the ANALYSIS section. 2. Press A from the ANALYSIS menu to analyze a sequence. 3. The user is asked to give the name of the first sequence to be analyzed, It is possible to analyze a number of sequences in succession, and after specifying the procedures to be used on the first sequence, the option will be offered to analyze other sequences at the same time. 4. A prompt then appears asking for the parameter set to use, giving the options “(/,0-g).” If return is pressed, the default parameter set 1 will be used. This initially has parameter 9 CLASS with a value of 1, so that Class 1 Search files will be used (e.g., ENZYMES). If a parameter set has been modified to specify the use of Search files of a different class, enter its number here in order to use it. If “/,’ is entered followed by the number of a parameter set at the prompt, the user will be taken to the parameter modification screen (ANALYSIS:Set) to modify that parameter set for use in the analysis. If parameter number 9 CLASS, in the set that is specified, is set to 0, then the user will be prompted for the class of the Search files to use at each analysis.

5. The user IS then asked for “Specifications (A,F,S,L)” to specify a particular region or strand of the DNA to be analyzed. Press L to limit the length of the region to be analyzed. The user will then be prompted for the lower and upper limits. To change the form of the DNA to linear or circular, press F; this will affect the sizes of fragments containing residue number 1 generated by a digest in procedure 6. The user can then specify the upper or lower strand (parameter 12 REFLECT determines if MicroGenie searches for restriction enzyme sites in both strands). To select two or more of these options, press A. The user will be prompted for each in turn. 6. The user is then asked which of the 20 Analysis procedures are to be applied to the sequence. More than one can be chosen. Enter the procedure numbers separated by a space and press return when they are all entered. Pressing “?” here gives a list of the procedures. Procedures 6 (“Locate sites and show in tables”), 7 (“Show sites in graphs”), and 8 (“Show sites with the sequence”) are the ones relevant to restriction enzyme analysis. Normally the selected procedure will use all the restriction enzymes in the Search files of the specified Class. If the user wants a selection (of up to six) of the enzymes in the file to be used, put a “/” character before the procedure number. The program will ask for the enzyme name, which should be typed exactly as It is listed in the Search file (although upper- or lower-case letters may be used; Note 11). If individual enzymes are selected from the Search file, the user will be asked if it is to be a single or multiple digest.

Merrifield 7. When the procedure choice has been entered, the program prompts the user for the name of another sequence to analyze. Entering a sequence name will take the user through the selection of parameter set, specification, and procedures again, and then another sequence name will be requested. Pressing return instead of a sequence name will take the user to the next menu. 8. From the numbered list of options for the output, 1 would normally be selected to store the output on the hard disk. This is the default if return is pressed. The user can then Examine the output on the screen and Print it on return to the ANALYSIS menu. If option 2 is selected, the output will be printed without any opportunity to view it first. Press the number corresponding to the choice, or press return to store the output on the hard disk. See Note 12 for details of the other options. 9. The program then begins the analysts. The screen is cleared with the message “MicroGenie at work . . . ” in the bottom left corner of the screen. The progress of the analysis is displayed at the top of the screen where the name of the sequence, the procedure name, and the page of the output on which its results will be printed are displayed. When the analysis is complete, the user is asked to press the return key to return to the ANALYSIS menu. 10. At the ANALYSIS menu, the user can press E to “Examine the program output.” If the screen is blank, either no restriction sites were found or there was no Search file of the requested Class (MicroGenie does not report this as an error; see Note 8). Note 13 gives details of the output from procedure 6. 11. The user can move up and down through the display of the output by pressing the Page Up and Page Down keys, or by entering the number of a page and pressing return to go directly to that page. At the end of the output, there is an index listing the sequence and procedure results that can be found on each page. Press the End key to go directly to the index. Press return to go back to the ANALYSIS menu. 12. To print the output, press P at the ANALYSIS menu. To keep the output for subsequent printing, perhaps on a higher-quality printer, the results file OUTFILE found in the root directory can be renamed or copied to a floppy disk usmg DOS commands after leaving MtcroGeme. The index is in the file OUTFILEJND, whtch may also be copied if needed.

4.4. Other Restriction Enzyme Site Location Procedures 1. To list a sequence with the sites, follow the method above and select procedure 8. The sequence is listed with the number of bases per line, gaps, and numbering set by the parameters 1 NWIDTH, 3 NFREQ, and

Introduction

to MicroGenie

185

5 NSKIP (default 60 bases per line with gaps and numbers every 10 bases). The first three letters and the last letter or number in the enzyme name are printed vertically below the first base of its recognition site. If two recognition sites start at the same place, the more specific one is printed. 2. To show the satesin linear graphs, use procedure 7 as above. The output consists of a line representing the sequence for each of the enzymes that cut with the position of the sites marked by vertical bars along its length. Where a number of sites are close together, they are given a single mark with a number above to show the number of sites at that position. 3. Where restriction enzyme maps are required of circular sequences,these may be displayed in a linear form as described above, or as a circular map using the Graphics section of MicroGenie. The Graphtcs procedure can also produce diagrams of the restriction sites of linear DNA molecules. These are of higher quality than those from procedure 7 if a laser or similar printer is used.

4.5. Circular

Restriction

Enzyme Maps

1. At the MICROGENIE menu, press G to go into the GRAPHICS section. 2. Enter the name of the sequence to be “Analyzed by mapping,” and press return, The GRAPHICS section has one procedure to plot a restriction enzyme map (which can be circular or linear). 3. A prompt then appears asking for the parameter set to use giving the options (/,O-9). Pressing return will “select” the default parameter set 0; this has parameter 9, CLASS, with a value of 1, so that Class 1 Search files will be used (e.g., ENZYMES). The user has the option to select which enzymes to use at a later stage. If the user has modified a parameter set to specify the use of Search files of a different class, its number should be entered here in order to use it. Entering I‘/,’ followed by the number of a parameter set at the prompt will take the user to the parameter modification screen (ANALYSIS:Set) to modify that parameter set for use in the analysis. If parameter number 9, CLASS, is set to 0, the user will be prompted for the class of the Search files to use for the analysis. Note 14 gives details of the other parameters relevant to this procedure. 4. The user is then asked for “Specifications (A,F,S,L)” to specify a particular region or strand of the DNA to be analyzed. Press L to limit the length of the region to be analyzed; then the user will be prompted for the lower and upper limits. To change the form of the DNA to linear or circular, press F. If a linear DNA sequence is used, a linear map will be produced, with the sites marked as for the circular map, Press S to

Merrifield

186

change the strand. The user can then specify the upper or lower strand (parameter 12, REFLECT determines if MicroGenie searchesfor restriction enzyme sites in both strands).To selecttwo or more of theseoptions, press A. The user will be prompted for each in turn.

5. The user is askedfor the nameof a restriction enzyme (or other site) in the Search file that was specified by the parameter set selected. The user can type in the names of up to six enzymes, pressing return after each, and return on a blank line after the last one. The names can be in capitals or lower case and include spaces or not, but must otherwise match those in the Search file (in the Search file ENZYMES, the numbers at the end of namesare numeric, not Roman). 6. A map then appears on the screen, after a delay, with the sequence name in the top right comer and a prompt along the bottom of the screen. Press P to print, or A to map another sequenceor return to go back to the MICROGENIE menu. 7. All the enzymes chosen are shown on the one map, in contrast to procedure 7 (in the Analysis section), where each enzyme is shown on its own linear map. The program ~111try to show all the enzymesthat cut; If it has to display two that cut close together, it prints their names separated by a “c” or ‘5” symbol to show which occurs first (reading clockwise). 4.6. Exporting

Sequences

from

MicroGenie

Other programs cannot use MicroGenie sequence files directly because they are stored in a compressed format. If it is necessary to use another program, e.g., for a database search on a remote computer, the sequence has to be saved in a format that the other program can read. The steps required to save a protein sequence for analysis by the FASTA program (see Chapter 26) are used here as an example. This procedure can also be used to generate a data file for submission to a data bank or incorporation in a document. MicroGenie can save a sequencein a plain text file, referred to as an “Export Format.” This is suitable for import into other analysis programs or word processors, and for submission to data banks. To transfer a sequenceto another program, go to theANALYSIS menu, “List” the sequence,and then selectthe option to savethe output in Export Format. Before starting, the user should make sure that there is a parameter set that will give the correct format for the exported sequence.If exporting a DNA sequence,check that parameter 7, STRANDS, is not set to 2 in case the export file contains both strands in the one sequence. Parameter 8, LETTERS, is set to 3 for the three

Introduction

to MicroGenie

187

letter code in the initial parameter set. If exporting a protein for analysis by FASTA, modify a parameter set to give LETTERS a value of 1, so the exported sequence will be in the one-letter code. (Parameter set 1 is used by default; if exporting protein sequence often in the one-letter code, modify parameter 8 in set 1 so that this will become the default). The method for changing the parametersis described in Section 4.7. 4.6.1. To Export a Sequence from MicroGenie 1. Go to the ANALYSIS menuby pressingA at the MICROGENIE menu. 2. Press A at the ANALYSIS menu to select “Analyze a Sequence.” 3. Enter the name of the sequenceto be exported. 4. Press the number corresponding to the modified parameter set to use the one letter amino acid code if exporting a protein sequencefor analysis by FASTA. If there is no parameter set with the correct value, enter the “/,’ character before the number of a parameter set to be taken to the Set Parameter screen, where the parameters can be mspected or changed, as described below, and then to be returned to the same point m the Analysis procedure. The user has the choice of saving that modificatron to the parameter set or using it only for the current analysis. 5. The user is then asked for “Specifications (A,F,S,L).” Press return to export the whole of the sequence as it is. (It may be necessary to export the sequence as a number of separate files if it is to be used with a program that has a shorter maximum sequence length than MicroGenie.) If L is pressed, the user will be asked for the lower and upper limits of the sequence to extract a region to be exported. If the user is exporting a DNA sequenceand wishes to export the lower (complementary) strand, S should be pressed for strand and then L for lower. To search both strands of a sequence with FASTA, the user will need to export it twice, once for the upper strand and again for the lower strand. Press A to change the strand and limit the length of a DNA sequence to be exported.

6. When asked for the numbers of the procedures,press 1 to “Ltst and number the sequence” and then press return. 7. The user is then asked for the name of the next sequence to analyze. Press return here to be taken to the Output Option Selection Menu. If the user gives the name of another sequence and selects the same export options, it will be added to the previous sequence, m the export file, with a blank line between the sequences. This can be useful if the user is creating a library of sequences as a data bank for use with another program.

Merrifield

188

8. When return has been pressed m response to the request for a further sequence name, the user is taken to a list of output options, Press 5 to “Perform analysis and store output m export format.” 9. Press Y to omit numbering and blank lures. (Keeping the spaces and numbering may be useful if the user mtends to include the sequence m a document or a data bank submission.) 10. The user 1sthen asked for a name for the export file. The file name can contain up to eight letters or numbers, but must not have a drive letter, directory path, or file name extension. The export file will be stored in the root directory of the hard disk (even if using the password FLOPPY to work with sequences on a floppy disk). 11. When return is pressed, after typing the file name, the file is saved in the root directory of the hard dtsk and the user 1s returned to the ANALYSIS menu. 12. After exiting from MicroGenie, the user will fmd the export file in the root directory of the hard disk. Copy it to a floppy disk if it is to be used on another computer, See Note 15 for details of the use of the COPY command in DOS to remove hidden extra characters that may be appended to the Export Format file.

4.7. Changing

MicroGenie

Parameters

The MicroGenie parameters can be set from the “Set Program Parameters” procedure on the ANALYSIS menu. If parameters have not been set in advance, the user can select the parameter set to be used in the analysis and preface it with “/.” The user will then be taken to the parameter editor and, after modifying parameters, can use them only in that analysis or save them for future use. As an example, the steps needed to select the one letter amino acid code will be described. Set parameter group 1 to have the values that will be used most often, since this is the one that is selected by default. 1. Press A from the MICROGENIE menu to go to the ANALYSIS menu. 2. At the ANALYSIS menu, press S to “Set Program Parameters.” 3. When asked which parameter set to alter, enter the number of a set between 1 and 9. 4. The user is then taken to a screen showing the 26 ANALYSIS parameters. (Parameters 27-32 apply to the Compare section and can only be set from the COMPARE menu.) To alter parameter 8, LETTERS, press 8 and then return. 5. The lme with the old value and the range of values for LETTERS is highhghted, and a lme at the bottom of the screen asks for the new

Introduction

to MicroGenie

189

value. Press 1 and then return to set the value to 1 for the one-letter code. Press “?” for an explanation of the highlighted parameter and the range of permissible values. 6. The user is then asked for the number of another parameter to change. If it is not necessary to change another, press return to be taken back to the ANALYSIS menu. 7. If the parameter set was changed using the “/,’ symbol when starting an analysis the user will be asked if he or she wants to use the changed parameters for that analysts only or save them for future use.

6. Notes 5.1. Parameters That Control Restriction Enzyme Analysis 1. Parameter 1, NWIDTH, controls the number of DNA bases that are printed on a line of output. The number of bases is 10 times the value given to the parameter. This may be from 1 to 15 (lo-150 bases/lme) with an mitral default value of 60 bases/line. This parameter sets the number of DNA bases printed per line by procedure 8, “Locate sites and show with sequence,” and other procedures that print out DNA sequence. The frequency of the numbering printed along the DNA strand is determined by the value of parameter 3, NFREQ, which is multiplied by 10 to give the interval between the numbers. The default value is one for numbers printed at every 10 bases. If the value of NFREQ is set to 0, numbers are not printed. Parameter 5, NSKIP, determines if spaces are Inserted into a DNA listing to make it more readable. A value of 1 gives gaps every 10 bases, and 0 prevents spaces from being inserted. Search files, for use in locating restriction enzyme (or other) sites, are given a class number, which is used to select those to be used in an analysis. Parameter 9, CLASS, is used to specify in advance which class of Search files will be used in the analysis. The initial default value is 1 for Search files of Class 1, which is the class of the file of restriction enzymes supplied with MicroGenie. If the value of the parameter CLASS is set to 0, the user will be asked which class of Search files is to be used for each sequence when tt is analyzed. When searching for restriction enzyme sites, all bases in the site must match the sequence. If looking for potential sites that may be generated by mutation, or for regulatory sites, it may be desirable to allow for one or more mismatches between the search string and the sequence. Parameter 10, MISSES, determines how many mismatched bases are allowed in the site. The default is 0 for an exact match. (If searching for a consensus sequence,

Merrifield

190

there is a set of symbols given m Appendix 1 of the MicroGenie manual that can allow alternative matches at a base in the site, even If MISSES is set to 0.) Parameter 11, SORT, determines if the digest fragments calculated by procedure 6 (“Locate sites and show in tables”) should be sorted into descending order of length. They are listed in the order that they would appear on an electrophoresis gel if the value of SORT is 1, If the value is 0, the fragments are listed in the order in which they occur in the sequence. Normally both strands of a DNA sequence are searched for restriction enzyme recognition sites, but if the value of parameter 12, REFLECT, is changed from the preset default value of 1 to 0, sites will only be located in the upper strand of the sequence. 5.2. MicroGenie

Files

2. OUTFILE is a plam text ASCII file that can be viewed or printed from within MicroGenie and is available for use by other programs after leaving MicroGenie. To retain the data in OUTFILE, the user must leave MicroGenie and copy OUTFILE to another file or disk before MicroGenie is used again. The next time MicroGenie is used, the output will overwrite the data tn OUTFILE. There is an option in MicroGenie to save output as a named file on a floppy disk. This can be used to reduce the risk of it being overwritten. 3. MicroGenie stores output in the root directory of the hard disk on which it is installed. If this is the boot disk (usually C:), the system files (COMMAND.COM, CONFIGSYS, and AUTOEXECBAT) will be in the same directory and at risk from accidental deletion or modification by inexperienced users. It may be a useful precaution to make these files (and perhaps also MG.BAT) read-only (using the ATTRIB command as described in the DOS manual). This will protect them if a user tries to delete temporary export or import files by using the DOS “DEL *.*” command. The UWCROGEN directory can be put on the path and the MG.BAT and other MicroGenie batch files can be moved from the root directory into \MICROGEN where they are less likely to be deleted. If setting up a new computer for use with MicroGenie, there are advantages m partitioning the hard disk as C: for the boot disk with DOS and other programs, and the remamder of the disk as D: for data and the MicroGenie programs (use the command the INDRIVE D Instead of INSTALL for each disk when installing MicroGenie). 4. The MicroGenie programs for the main sections are mstalled in a subdirectory WICROGEN on the hard disk, and these programs are called by a batch file, MG.BAT, in the root directory. The user’s sequences appear to be stored m hidden directories named \l.DSK to

Introduction

to MicroGenie

191

\20.DSK for the possible 20 passwords. These are used m the order in which the passwords were registered, and displayed when listed on the master password screen. When sequences are stored on a floppy disk, using the password “FLOPPY ,” they are saved in a subdirectory called 21.DSK on the floppy (this is presumably necessary because of the DOS limit for the number of files that can be saved in the root directory of a floppy disk). The data bank sequences are installed on the hard disk in the directories 22.DSK to 36DSK (Fall 1990 release of the data bank). They are not stored in directories with the same names on the data bank CD. The data directories contain files named 1.SEQ, 2.SEQ, and so on, which appear to be the sequence data files, and a file SEQFILE, which is probably the index file to the database. Analogous files (1 .SER and SERFILE) appear to be the Search files (this information is based on conJecture, and is included in the hope that it may be useful if it IS necessary to use file recovery utility programs on a damaged or accidentally erased disk).

5.3. Materials 5. MicroGenie has been run successfully on IBM PC, XT, XT-286, and PS/2, Research Machines VX-386, Dell 310, and Elonex 286M (an AT-compatible machine) computers at this Institute. Most computers are now supplied with graphics displays. If the user has a machine with a monochrome, text-only, screen, he or she will not be able to view the output from the graphical procedures (they can still be used if the user has a printer that can print the graphical output). 6. Narrow blank lines across a graph produced by a dot matrix printer may be caused by using the Epson printer selection (in the Setup section) with a printer that is using IBM Proprinter control codes rather than Epson codes. 7. Most users do not think that they need to make backups of their sequences until they, or a colleague, lose data. If a tape streamer is available, individual users should still be encouraged to make backups using the “Backup Sequences” procedure in MicroGenie as an insurance. It is also advisable to make an occasional backup of all their sequences on an additional floppy disk to be stored on a different site in case of fire, flood, theft, or other disaster.

5.4. The Standard Restriction Enzyme Search File 8. A file containing the names and recognition sites of commercially available restriction enzymes is supplied with MicroGenie. This is called

192

Merrifield

ENZYMES and can be copied from the PRI: data bank section to the user’s own directory for use as described in Section 4.1. If a restriction enzyme site search procedure is run with default conditions and no output is obtained, when sites are known to be present, it is possible that the user does not have any suitable Search files. Go to the Files section, and check that there is at least one Search file of the appropriate class, since Search files are not created when a new password is registered. 5.5. Recording a Restriction Enzyme Search File 9. Selection of the class of Search file that is to be used m an analysis is determined by the value of parameter Number 9, CLASS, in the parameter set that is chosen for the analysis. If the value of CLASS in the parameter set that is chosen is set to 0, the user will be prompted to enter the class before the analysis is run. If the value of Parameter 9 (CLASS) is set to a number from 1 to 9, then Search files of that class will be used without prompting the user. The default parameter set has class set to 1, which is the class of the Search file of available restriction enzymes that is supplied with MicroGenie. A Search file of Class 9 is used by procedure 11 (this searches for regions rich in amino acids or nucleotides that are specified in the Search file). If this procedure is used, any normal Search files of Class 9 have to be deleted first. 10. There is also a procedure to make a backup copy of the Search files on a floppy disk. This is very important if the user has a substantial number of files, and recreating them would involve a lot of work if they should be lost. There is also a backup option on the Entry menu, that backs up the sequence files but not the Search files. If it is necessary to delete an existing user’s password and the user gets a message saying that there are still data under that password after the user has deleted all the sequence files, it is probable that there may still be Search files that have to be deleted before the password can be removed. 11. In the ENZYMES file, the number at the end of the name is numeric (e.g., l), not a Roman numeral (e.g., I). 5.6. Searching

fbr Restriction Enzyme Sites in a DNA Sequence 12. The default option for output from the Analysis procedures is l-to store the output on the hard disk. Option 6 allows the user to store the output on a floppy disk, which may be useful if the user wants to print it on a high-quality printer on another machine. If this is chosen, the user will be asked for a file name. Do not include the drive or a file extension. Option 4 is provided to allow the user to return to the ANALY-

Introduction

193

to MicroGenie

SIS menu to list the sequences if the user has forgotten a sequence name and wants to then resume the analysis without loosing the choices already made. Option 5 stores the output in an “Export Format,” which is a plain text (ASCII) file, without page titles, used to export sequence data to other programs (further details can be found in Chapter 19, [MicroGenie: Homology Searches]). If Option 3 is selected, the user will be returned to the ANALYSIS menu, and the selections of sequences and procedures that have been made will be canceled. 13, The output screen from Procedure 6 has the sequence name highlighted at the top left-hand comer, and the restriction enzymes, which have recognition sites in the sequence, are listed alphabetically down the left of the screen together with their recognition sites. The next column, headed “# SITES,” gives the number of times each site is found in the sequence. Starting on the following line, there is a list of the positions in the sequence of the first nucleotide of each occurrence of the recognition sequence (not the cutting site). The next four columns give the sizes and ends of the fragments that would be produced by a complete digestion with the enzyme. If the parameter SORT has been set to 1, they will be listed in order of decreasing size rather than the sequence in which they occur along the DNA strand. The FRAGMENTS column gives the length of each fragment. The figure in brackets is its percentage of the total length. The final two columns give the positions of the two ends of the fragments in the original sequence.If the sequenceis circular, then one fragment will have ends that span the zero position and appear to be reversed. The next table in the output gives a list of the restriction enzymesin the Searchfile that do not have recognition sitesin the sequence; this information can be useful in planning cloning procedures. 5.7. Circular

Restriction

Enzyme

Maps

14. The parameter 9, CLASS, selectswhich Search files will be used. Parameter 22, MAXCUT, specifies that only enzymes that cut less times than the value given will be shown on the map.

5.7.1. Problems with “Export” Format Data Files When Transferred to V&x Computers 15. When option 1 (“List and number a sequence”) of the Analyze section is used to save a data file m Export Format there are sometimes hidden extra characters following the end of the file. These do not affect DOS programs on a PC, but may become apparent if the file is sent to another computer. The extra characters may be removed using a word processor or an editor on the remote computer, but can also be removed using the COPY command in DOS.

Merrifield

After saving the sequence in Export format, with a name represented asfilename, and leaving MicroGenie, type: COPY filename/A Afilename This will save the sequence asfilename2 without the extra characters that may have been hidden after the “end of file” marker. For the copy to be on the floppy disk in the drive A: Specify the second name as A.filename2.

COPY filename/A Afilename In this case, the /A after the first name specifies that the file to be copted from is to be treated as a plain text (ASCII) file and the A: before the second file name specifies that the file is to be on the A. drive (or B. for the B: drive). 16. The Files section needs the most memory in Ver. 7.01 of MicroGenie. If there is not enough memory and all unnecessarydrivers and programs have been removed from the CONFIG.SYS and AUTOEXECBAT files, the user could remove SMARTDRV.SYS from CONFIG.SYS as a temporary measure when it is necessary to run the Files section and restore it when the user uses other programs with Windows 3.0.

6. Appendix 6.1. Running MicroGenie with Windows 3.0 The Windows interface may seem easier than DOS to people with

little experience of computers; MicroGenie can be run as a DOS program under Windows 3.0. The information given here is based on the use of an IBM PS/2 70 with 2 Mbyte of RAM using DOS 4.01 and Windows 3.0. Later releases of Windows and DOS may overcome some of the limitations encountered. In a computer that is able to run Windows 3.0 in 386 Enhanced mode, MicroGenie can run in a window and multitask with other programs. It is possible to run MicroGenie at the same time as the FASTA program is running a homology search on the CD. If MicroGenie is to be run at the same time as another DOS program, it may be necessary to start MicroGenie before the other program. This is probably because of the limits of the system stacks in Windows 3.0 and may be improved in version 3.1. The clipboard would seem to provide a way of transferring data in and out of MicroGenie and other programs (by-passing the normal method of saving the data as a intermediate file). Unfortunately, the clipboard will include the end-of-line character; when this is pasted

Introduction

to MicroGenie

195

into a MicroGenie editor screen (such as the sequence editor), the end of line is interpreted as the return key, and this terminates the input after the first line. As described below, the clipboard can be used to enter Search file data into MicroGenie. When a DOS program is run under Windows 3.0, the devices and programs that were specified in CONFIGSYS andAUTOEXEC.BAT are loaded in the DOS session, in addition to any memory retained by Windows 3.0. It may be necessary to remove more devices and programs from the start-up configuration if MicroGenie is to be run under Windows 3.0 using a CD-ROM drive. The mouse driver does not need to be loaded in DOS, since Windows 3.0 loads its own mouse driver when it starts; the International section of Windows 3.0 setup can provide keyboard support for Windows 3.0 programs instead of loading KEYB.EXE. To use a CD-ROM drive with Windows 3.0, it is necessary to install the LANMAN10.386 program (by uncompressing it from the Windows 3.0 installation disk) and modifying SYSTEM.INI to put the line “device=LANMAN10.386” in the [386Enh] section. The CD-ROM must be accessed before Windows 3.0 is started in order for it to work under Windows 3.0. Inserting the line “DIR E: > NUL” in the AUTOEXEC.BAT, or in a batch file used to start Windows, will achieve this without giving unnecessary output on the screen. MicroGenie can be run from an icon in the Program Manager. It may be preferable to set the program to be run from a PIF file rather than directly from the MG.BAT file, since it is then possible to have a choice of PIF files (and associated icons) to run MicroGenie under different conditions. MicroGenie can be run in a window and as a background task. If the user wants to use the graphics procedures and display their output on the screen, a second PIF file should be made to run MicroGenie full screen (Windows 3.1 should allow the graphic display to run in a window). If the user wants to use the digitizer with MicroGenie under Windows 3.0, multitasking can cause data to be missed from the digitizer. It has to be run as an exclusive process. The -DEFAULT.PIF file supplied with Windows 3.0 can be modified to run MicroGenie by giving the Program Name as C:VLIG.BAT (or the path to the location of MG.BAT on the user’s system). The KB Required and Desired settings can be left at the default. They could be set to more accurate values, but some sections of MicroGenie

Merrifield

require less free memory. Allowing MicroGenie to determine if sufficient memory is available (and terminating if it is not) may be preferable, since the user will then be able to run some sections, rather than none, if memory is limited. The PIF file to run MicroGenie in a window (e.g., MGWIN.PIF) has the options for Windowed background checked (selected). Some memory will be saved if Video memory is selected as Text in the Advanced section. To run MicroGenie full screen, for graphics procedures, the MGFULL.PIF settings would be the same except for selecting to run full screen (from a window, a program can be switched to full screen and back again by pressing the Alt and return keys). The MGDIGITZ.PIF file is the same as the one to run full screen, but is set to run exclusive so data in the serial port is not missed. The FASTA program can be run together with MicroGenie using a FASTA.PIF with the KB Desired setto 180,selecting widowed, background, and video memory as Text. The mouse cannot be used within the MicroGenie window to select menu options, but it can be used to scroll the window to see text that is outside the window. If the user runs two MicroGenie sessions in different windows, they should not both be allowed to produce output, since they would try to write to the same OUTFILE. 6.1.1. Recording a Search File Using the Windows Clipboard

A disadvantage of the MicroGenie Search file system is that it does not have database manipulation features to enable the user to select a working subset of restriction enzymes (e.g., flush cutters or infrequent cutters) from an existing file. It also lacks an import procedure to bring in restriction enzyme sequences from other databases. If the user has a database of restriction enzymes that allows the selection of enzymes according to cutting type, availability, or recognition site length, it is unfortunate that no method is provided to import this data into MicroGenie. If MicroGenie is run under Windows 3.0, the clipboard can be used to transfer data into the “Record a new File” procedure in the Search files section. When entering a Search file from the keyboard, the input of the site name and corresponding string each have to be completed by pressing the return key. To emulate this, the user has to prepare a listing of the restriction enzymes and their recognition sites with the name on one line and the corresponding site

Introduction

to MicroGenie

197

on the next, followed by the next name and its corresponding site, so that each is followed by a new line code. If the database program is unable to produce output in this form, it may be necessary to include a marker character (such as a tab) between the name and site fields (if they are in two columns) and use a word processor to replace the marker character with a new line character.The search and replace feature of a word processor (in plain text ASCII mode if it is not a Windows program) could be used for this. MicroGenie will ignore any spacesthat are left in the file when it is recorded. It is essential that any code used to represent alternative bases in the sites is the same as that used by MicroGenie. 1. Prepare a list of the enzyme names with their recognition sites each on a separateline- first the name and then the corresponding site on the next link for each enzyme. If this has been saved as a plain ASCII file, using a DOS program, the file can be Opened in the Windows Notepad or Write. If Write or another Windows word processor has been used in editing, then the data can be used directly from that program. 2. Use the mouse (or cursor keys) to highlight the data to be recorded into MicroGenie. 3. Copy the highlighted selection to the clipboard using the “Edit Copy” menu selection. Exit Notepad or the Word processor if this is necessary before running MicroGenie. The user can double-click on the clipboard icon and examine the contents of the clipboard to confirm that it is correct, and can also save the contents of the clipboard in a file at this stage. 4. Run MicroGenie, and enter the password as normal. 5. At the MICROGENIE menu, press F for the Files section; see Note 16. 6. Press S for Search files. 7. At the SEARCH menu, press R to “Record a New File.” 8. The user will then see the Search file editor screen with a prompt asking for the name of a string, Hold down the Ah key, and press the spacebar to bring up the Windows control menu. 9. Press E to select Edit on the menu (or move the highlight to “Edit” with the cursor keys and press return). 10. Press P to Paste in the contents of the clipboard (or select “Paste” with the cursor keys and press return). The enzyme names and sites will then be entered as if the user had typed them and will be sorted into alphabetical order on the screen by MicroGenie. 11. When the data have been entered, press return at the prompt, This will be asking for the next name, and the user will be returned to the SEARCH menu.

198

Merrifield

12. The user may wish to prmt out the Search file by pressing P at the SEARCH menu, so that it can be checked for possible errors. 13. Press return until back at the MICROGENIE menu. 6.2. MicroGenie Data Backup: The “Backup Sequences” option on the ENTRY menu does not backup Search files, there is a Backup option in the Files section. Base uncertainty code: MicroGenie does not have an uncertainty code for bases in a DNA sequence. Any uncertainty codes that are entered will be recorded, but ignored m an analysis. There is a code for matching alternative bases in a search for sites; this is given in Appendix 1 of the MicroGenie manual and in response to “?” at the Files sectton “Record a new File” prompt for a “string.” Compare sequences: There is a maximum of 400 matches; no output is produced if more are found. Compare sequences align: When aligning two sequences, the lengths may be up to 30,000 residues. The program cannot insert unmatched gaps (deletions or msertions) greater than 1000 residues. Compare sequences matrix method: The sequencesthat are compared cannot be longer than 30,000 residues. Compare sequences multiple align: Up to 60 sequences may be aligned. The sequences can have lengths of up to 1000 residues. Data bank passwords: BAC: Bacteria, INV: Invertebrate, MAM: Mammal, ORG: Organelle, PHA: Phage, PLA: Plant, PRI: Primate, PRO: Proteins, RNA: Structural RNA, ROD: Rodent, SYN: Synthetic, UNN: Unannotated, VER: Vertebrate, VIR: Viral. Editing keys, input: When entering a response (such as a sequence name), it can be edited before return is pressed. The cursor left and right arrow keys move the cursor along the line. Press the Insert key to toggle into Insert mode; press Insert agam to go back to overtype. Home moves the cursor to the left end of the line, and End moves to the right end of the line. Delete and Backspace delete characters. Pressing the Esc key clears the line. Importing sequence files: A MicroGenie file on a floppy disk or another user’s password is Copied to the user’s password using the “Copy a Sequence” procedure in the Entry section, A DOS sequencefile m ASCII text is imported using the” Record a new Sequence” procedure and specifying that it is to be entered from a file. Output: The maxtmum number of pages (screens) of output is 999, not counting the pages used for the index.

Introduction

to MicroGenie

199

Output display: To move through MicroGenie output on the screen, use the Page Up and Page Down keys if there are a number of pages (screens). To go direct to a page, enter the number of the page and press return. The End key will take the user to the end of the text, or to the start of the index if there is one. If there is an index, pressing End a second time moves to the end of the index. The Home key takes the user to the start of the output. If in the index, it takes the user to the start of the index and, d pressed again, to the start of the output. Output file: Output is saved in the file OUTFILE in the root directory with the index in the file OUTFILEJND. There is a limit of 80 characters per line in OUTFILE, this limit does not apply when the “Export Format” is used. Password deletion: A password cannot be deleted if there are files stored in the directory. Search and Digest files must be deleted as well as Sequence files. Restriction enzyme search: If no output is obtained when sites are known to be present, the user should check that he or she has a Search file of the required class. Search data bank: Minimum search sequence length is 15 for nucleic acids and 7 for proteins. The maximum length is 2200 residues. Both strands are searched. Search files: The search string can be up to 60 nucleotides or 20 ammo acids (use the three letter code, the additional symbol “Any” matches all amino acids). The site name can have up to 20 characters; spaces are ignored. Sequences: The maximum number of sequences that can be stored in one directory (password) is 820. Sequence editor: A string of up to 10 characters may be located using the F7 “Locate String” key. Sequence files: Sequences on a floppy disk are stored in a directory called 21 .DSK. The user’s files are stored in directories 1.DSK to 20.DSK on the hard disk. Sequence names (for Shotgun merge): Should not have more than 12 characters so that the full name will be visible in the Merge editor. Contig names should not be longer than 10 characters. Sequence names: May have up to 40 characters in the name, but the user cannot include “*” and “:” (spaces may be used for clarity, but are ignored in use, i.e., “seq 1” is the same as “seql”). When the names are displayed in the List Sequences screen, they are sorted on their ASCII codes; this will put them in alphabetical order, but numbers will come

Merrifield

200

before letters. If the first character is numeric, and the user wants them listed in numeric order, use leading zeros m the name (e.g., “20seq” will be listed after “123seq,” but “020seq” will be listed before it). Sequence numbering: If selecting a limited range of a sequence for analysis, the residues will be numbered as in the original sequence. If transforming a sequence to make a new sequence, the residues will be renumbered. Shotgun merge: In Version 7.01, sequences of up to 600 bases can be merged. There may be a lower limtt if there is not sufficient free memory. The total of the merge sequences, any seed, and excluded vector sequences must not exceed 60,000 bases. Star notation: The “*” character can be used to represent any number of characters in any part of a sequence name (not lust all characters from the “*” to the end of the name as in DOS).

References 1. Korn, L. J., Queen, C L , and Wegman, M. N. (1977) Computer analysis of nucleic acid regulatory sequences. Proc. Natl. Acad. Sci. USA 74,4401-4405 2 Queen, C. and Korn, L. J. (1984) A comprehensive Sequence Analysis Program for the IBM Personal Computer. Nucleic Acids Res. 12,581-599.

&IWI’ER

16

MIicroGenie: Shotgun DNA Sequencing I? Davies and R K MerrifW!d 1. Introduction The shotgun merge section of the MicroGenie program is used to combine a number of overlapping DNA sequencesinto a single, larger, contiguous sequence (contig). This is particularly useful for constructing the continuous sequence of a cDNA clone from its respective restriction fragments or for assembling the data generatedby sequencing the exonuclease digestion subclones (deletion cloning) (1) of a cDNA clone. MicroGenie searches for regions of overlapping sequence between any of the selected sequences and, when found, will merge these fragments, by their common regions, into the growing contig until further merges are no longer possible. The program is able to match sequences on either strand, and it is desirable to be able to detect and remove any cloning vector sequences that may be present. 2. Materials Version 7.01 of MicroGenie can merge sequences up to 600 bases long; earlier versions had lower limitations in the shotgun merge method. The other materials required are as described in previous chapters; the CD drive is not essential for this application. It is desirable to have adequate free disk space, since the shotgun sequencing method requires the production of large numbers of sequences. The files produced when sequencing 2.3 kb of the promoter for Tcp-1Oa by the deletion cloning method occupied about half a megabyte of From: Edlted

Methods In Molecular Biology, Vol. 24: Computer Analysis of Sequence Data, Part I by A M Griffin and H 0. Gnffm Copynght 01994 Humana Press Inc , Totowa, NJ

201

202

Davies

and Merrifield

disk space. If a number of users have shotgun sequencing projects, the requirement for disk space could become an important factor. The shotgun sequencing approach involves the manipulation of large quantities of sequence data. Therefore, it is desirable to automate the gel (autoradiograph) reading and data input process as much as possible. In the absence of expensive fully automated gel reading machines, the use of a digitizer is a considerable help. The sonic digitizer was suggested for this purpose by Rodger Staden at Cambridge (2), since it may be placed over an existing light box when reading the autoradiograph of the gel. A stylus is pressed on each band on the autoradiograph, and the sound produced by a small spark at the end of the stylus is detected by two microphones on the body of the digitizer. The position of the stylus relative to them is calculated by triangulation using the rate of sound propagation. A system incorporating a digitizer and light box for use with MicroGenie was sold by Beckman Instruments, Inc. (see Chapter 15, Section 3 for address) with the trade name GelMate 2000. This is no longer available, but MicroGenie can be used with an alternative digitizer and a suitable sized light box, preferably recessedinto the desk top to give a comfortable working height. The sonic digitizer that we use with MicroGenie is the GrafBar GP-7 (Science Accessories Corp., Southport, CT, or P.M.S. [Instruments] Ltd., Waldeck House, Reform Road, Maidenhead, Berkshire, SL6 8BX, UK). Note 1 gives details of the installation of the digitizer. If only one copy of MicroGenie is available and it is heavily used for analysis and data bank searching, the work load on that computer may be reduced if the digitizer is moved to another machine or an additional digitizer is used on a second computer. The sequences should be saved in plain text files, which can be recorded into MicroGenie as described below. A public domain program for DNA sequence entry using a digitizer, written by David Judge for data input to Rodger Staden’s programs, is ideal for this; see Note 2. This program, called Readgel, is available from D. Judge, Room 204, Department of Genetics, University of Cambridge, Tennis Court Road, Cambridge CB2 3H, UK, on receipt of two formatted DOS disks. Regular backups of the sequences should be made, since loss of all the sequences near the completion of the project owing to hardware

MicroGenie:

Shotgun DNA Sequencing

203

or other problems would require a lot of work rereading autoradiographs. See Note 3. 3. Methods 3.1. Sequence Entry by Digitizer See Note 4 for preliminary information on the use of the digitizer. 1. Start MicroGenie, and at the MICROGENIE menu, press E for the Entry

section. 2. At the ENTRY menu, press R to “Record a new Sequence.” 3. The user is then prompted to “Give a name to the Sequence.” Enter a name for the sequence; it should not contain more than 12 characters if it is to be used in a shotgun merge; see Note 5. Press return at the end of the sequence name (to indicate that the name is complete), and the program will proceed to the next stage. If return is pressed, instead of entering a name, the user is taken back to the ENTRY menu. 4. The user is then asked if the sequence is DNA, RNA, or protein. The default is D, for DNA. (RNA sequence can be entered from the digitizer; U will be substituted for T, but protein sequencecan only be entered from the keyboard or a file.) 5. To answer the question of whether the sequence is linear or circular, press L (for linear) or press return, since linear is the default. 6. The user is then asked to enter a sequence comment, which is saved with the sequence. This can contain up to three lines of text, and the normal editing keys can be used to make any alterations before the enter key is pressed; see Note 6. 7. To the prompt “Do you want to enter the sequence from the Keyboard, Digitizer or File?” press D for digitizer (the default is to enter sequence

from the keyboard). 8. When asked if nucleotides should enter from the 5’ or 3’ end, press return for the default 5’ direction used in dideoxy sequencing. 9. The program will normally sound a different tone for each base to confirm successful recording. To avoid this, press N in response to the question “Do you want Sound Verification?” Pressing Y, or return, selects sound verification for the successful entry of each base from the digitizer. 10. The program then asks the user to enter the order of the lanes on the autoradiograph, from left to right. Pressing return selects the default order TCGA; see Note 7. If another order for the lanes is used on the sequencing gel, enter it here.

Davies

and Merrifield

11. The user is now taken mto the sequence editor screen, which is essentially the same as that used for the entry of sequence from the keyboard and allows the user to view and edit the sequence that is entered from the digitizer. The word Initialize is highlighted in the lower left of the screen, This 1sto prompt the user to initialize the digitizer in order that the program may determine the starting posttions of the lanes. At the bottom of the gel (autoradiograph), just below the first band to be recorded, the user should press the stylus on the center of the first lane. A click will be heard from the stylus, with a tone from the computer, and the letter T will appear on the display if using the default lane order. (If the click and letter do not occur, a message at the top of the screen may give information about the problem, e.g., the gel is outside the range of the digitizer). Move over to the second lane, at the same level, press the stylus in the center of the lane, and continue doing this for the remaining lanes in order. Then move the stylus about 1 cm up the gel (the actual distance does not seem to be critical), and again digitize the center of each lane m turn. If a mistake is made during this process, or if at a later stage the program seemsto have lost track of the position of the lanes, the user can repeat the initialization process after pressing the Fl key. 12. When the initialization process is complete, the word Initialize is no longer highlighted, and a flashing cursor appears at the top left corner of the sequence editor screen. Press the stylus of the digitizer on the center of each band in turn while moving up the gel until the top of the gel is reached, where it is no longer possible to read sequence reliably. If sound verification has been selected, there will be a different tone for each lane to confirm successful entry. As the user enters the bases,they will appear on the editing screen in order, with gaps inserted every ten bases. The program modifies its values for the positions of the lane centers as the reading progresses. If it appears to have been unable to follow a distortion on the gel and does not correctly identify the bases, press Fl and repeat the initialization process; see Note 8. 13. After entering a sequence, the user can return to the ENTRY menu as described below, or tf the user has a series of sequences on the same autoradiograph, F2 can be pressed to enter another sequence. The ENTRY:Record editor screen will be cleared, and a new sequence name will be generated by appending a number to the name of the previous sequence recorded (or incrementing the last dtgtt in the name tf it ends with a space followed by a number) (see the second part of Note 14). Then go back to the digitizer initialization step and enter the next sequence. The user can proofread or edit the sequence before continuing; see Notes 9 and 10.

MicroGenie:

Shotgun DNA Sequencing

205

14. When the sequence(or the last sequenceon the autoradiographif using F2 to restart each sequence) is read as far as is possible with confidence from the gel, press return twice to exit back to the ENTRY menu. Before pressing return, the user can use the arrow keys to move the cursor around the recorded sequence on the screen and make corrections. See Note 10.

3.2. Recording Sequence from a Data File When sequence data have been recorded to a plain text file using other sequence analysis programs or a word processor, they can be recorded into a MicroGenie sequence file using the “Record a new Sequence” procedure in the Entry section; this converts the plain text file into a MicroGenie format sequence file. If importing sequence recorded by another program, the user should make sure that no uncertainty codes have been used in the sequence, since MicroGenie will include, but ignore, letters other than A, G, C, and T or U. If a sequence

has been recorded using another copy of MicroGenie and sent as a MicroGenie file, it does not need to be converted into MicroGenie format and can be read in using the “Copy a Sequence” procedure. If the sequence is on a floppy disk, precede the sequence name with FLO: when giving the name of the sequence to copy. 1. Follow the method described above for Sequence Entry using a digitizer up to Section 3.1., step 7. When asked, “Do you want to enter the sequence from the Keyboard, Digitizer or File?” press F to select input from a file. 2. When asked “What is the name of the file contaming the sequence?” enter the file name. If the sequence is on a floppy disk, it can be copied to the hard disk before starting MicroGenie, or the user can specify the drive as part of the file name. Use the normal DOS file name, and the drive, subdirectory path, and a file name extension can be included if appropriate; see Note 11. If the user has a number of files to record, the file name can be specified using the DOS “wildcard” “*” symbol to replace part or all of the file name and extension. If this is done, MicroGenie will not use the sequence name that it asked for originally at Section 3.1., step 3, but will use the corresponding DOS file names for the sequences. If the user anticipates recording some sequences directly into MicroGenie and some from files, perhaps in a collaborative sequencing project, it would be advisable to devise a naming system for the data files and directly recorded sequences, so that they are consistent when recorded in MicroGenie.

Davies and Merrifield 3.3. Sequence Entry from the Keyboard Follow the method for Sequence Entry using a digitizer, and at Section 3.1., step 7, select K for sequence entry from the keyboard (this is the default if return is pressed). The user is then taken into the ENTRYRecord editor screen as for digitizer input, but the sequence is entered by typing in the bases. The cursor is initially at the bottom of the screen and has to be moved to the top left-hand corner, by pressing the Home key (or the End key to go to the end of the zero length sequence), before entering the sequence. The editor will automatically space and number the sequence as it is typed in. Use the cursor keys to move around the sequence to make any alterations. Bases can be inserted or deleted. To insert sequence, press the Insert key and the cursor changes to a flashing block to indicate insert mode; press the insert key again to leave insert mode. Proofread mode (entered by pressing F4) enables the user to reenter the sequence, and it will be compared to the original sequence entry. If a difference is detected, the computer will bleep as a warning; enter the correct base, or use the editing keys to make more extensive changes if necessary. 3.4. Entering Sequence from Printed Text Sometimes it is necessary to enter sequence that has been sent in printed form or published in a journal. A scanner and optical character recognition system would probably be ideal, but otherwise, the ENTRYRecord editor can be used to enter the sequence. It may help if the sequence is photocopied on an enlarging copier to twice or more its original size, and the bases can be marked off as they are entered. Some workers find this easier if one person reads the sequence and the other types in the sequence. If sequence has to be entered into a computer using a word processor instead of MicroGenie, it is helpful if the page width is set so that the number of bases per line is the same as that of the printed sequence. Possible confusion between C and G can be reduced if the sequence is typed in lower-case letters. The sequence should only contain the letters A, C, G, and T or U; MicroGenie will record other letters, but they will not be recognized as DNA sequence. Numbers will be removed as MicroGenie records a sequence and may be left in the file if they provide useful position markers. The sequence must be saved as a plain text (ASCII, Ameri-

MicroGenie:

Shotgun DNA Sequencing

207

can Standard Code for Information Interchange) file, so that it can be recorded into MicroGenien as described in Section 3.2. To save a plain ASCII file, use Non Document mode for WordStar, “Transfer, Save, Text Only” for Microsoft Word, or “Text In/Out, DOS Text” for WordPerfect. The digitizer can be used instead of the keyboard for the entry of sequence from a listing on paper. Four lanes are marked in front of the digitizer, and these are used to record the bases as if on a sequencing gel using the digitizer to enter the bases. 3.5. Merging

Shotgun

Sequences

1. From the MICROGENIE menu, press M for the Merge section. 2. At the MERGE menu, press S for “Shotgun merge Sequences.” Other options offered are to merge two sequences,to Edit and correct sequences in a shotgun merge, or to print the results of a shotgun merge. 3. The user is then asked for the name of the sequences to merge. Use the “*” notation to select the sequences to be Included in the merge, e.g., * for all sequences in the directory or AB* for all sequences with names beginning with AB; see Note 14 regarding choice of a naming conventlon to allow selective merging. If return is pressed here, the merge uses only seed sequences (requested at the next prompt); e.g., d Joining existing contigs. 4. The program next asks for the name of any seed sequences; press return if there are no seed sequences.These are parts of the sequence that have already been determined (contigs) or sequences, such as closely related genes, that are used to force a particular merge rather than possible alternatives; see Note 15. A number of seed sequences may be specified usmg the “*” notation. 5. The user is then asked for the name of any vector sequence to exclude. If the user has sequences of one or more vectors that have been used in the cloning and sequencing experiments, their names can be entered here (using the “*” notation if more than one). Any sequence from the vector that is found in a gel sequence will automatically be excluded from the merge. A data bank abbreviation (e.g., SYN:) can be included as part of the vector (or other sequence) name. It may be necessary to copy vector sequences from the data bank into the user’s own directory and rename them to make use of the “*” notation. Press return if there are no vector sequences to be excluded from the merge; see Note 16. There is a total limit of 60,000 nucleotides for the gel sequences, the seed, and any excluded vector sequences.

208

Davies

and Merrifield

6. Enter an estimate of the accuracy of the gel reading. Pressing return causesthe program to use a default value of 96% (the range accepted is 94-98%). Giving a higher value for accuracy will decrease artifacts or repetition, but if the gel reading accuracy is not as good as the value entered the merge will be very slow and may fail to merge some sequences successfully. The MicroGenie manual recommends discarding any gels that cannot be read to the user’s normal standard of accuracy rather than including them in the merge. 7. The program then asks for the length of overlap for two sequences to be merged directly. The default 1s50 bases;use a smaller value if the average length of the sequencesis less than 250 or if merging existing contigs. When working with a seed sequence, this value can be decreased to a minimum of 20 without fear of spurious merges. 8. The user is now asked for a contig name. MicroGenie will add a number to the end of the name that the user gives when it generates a number of contigs. The contig name may contain up to 10 characters. Pressing return without giving a name cancels the merge. 9. The merge begins, and a series of messages on the screen report on the successive stages: merging gels, aligning gels, and recording. If the merge has been successful, the user will see details of the number and names of the contigs. If the merge was not possible, there will be a message informing the user of the problem. The user is then automatically returned to the MERGE menu from which the user can chose to Print, or Examine and Edit, the merged contigs. Press P to Print out the contig(s) with the sequences; this output may be rather long, and the user might want to examine it using the merge editor first. See Notes 17 and 18 for details of some potential problems.

3.6. Editing Shotgun Merge Results The shotgun merge procedure produces a special merge output file. This can be examined and edited using the Merge editor. The merge output file is called MERFILE, and the index file is called MERFILE.IND. Both are saved in the root directory of the hard disk. 1. At the MERGE menu, press E to “Edit a shotgun Merge.” 2. The MERGE:Edit screen shows the first contig listed across the top of the screen, and the sequences from which it was merged are aligned below it. Bases that do not match the consensus are shown as lowercase letters or spaces.The Page Up and Page Down keys can be used to move through the contig. At the end of the output, there is a table, followed by an index to the output giving the page numbers on which each contig begins. The table shows the gel sequence names, the num-

MicroGenie:

Shotgun DNA Sequencing

209

ber of the contig that contains them, their position in the contig, and the percent match. This helps to identify where in the contlg each individual sequence is located and gives a measure of the agreement of each particular sequence with the contig. If a particular sequence has a low level of agreement, the user may wish to reread the gel or exclude that sequence from the shotgun merge. The merge editor can be used to correct any mismatching bases, after checking the original autoradiographs. The action of the merge editor is similar to that of the sequence editor. The cursor keys can be used to move to any base on the screen. At the bottom of the screen, there is a list of the function keys that can be used to find specific regions in the sequence. Press F9 to go directly to a page of the output, and enter the page number in response to the prompt. Press F5 to locate a particular base number. To locate a string of up to 10 bases, press F7, type in the string, and then press return. 3. When any changes have been made, press return twice to leave the editor. The user will then be given a choice to: (1) record the sequences that have been changed, (2) record all changed sequences in the merge file, or (3) not record any of the changes. Press the number corresponding to the choice. The changes are already part of the merge output file even if they are not recorded, and they will remam in the file until the next time the shotgun merge is used. If the user is not sure that all of the changes that have been made are correct, or wishes to check information on the original autoradiograph, option 3 should be chosen. Once the changes have been verified, the user can return to the Edit shotgun merge function, make any additional changes, and then choose option 2. This will record all changes made to the merge output file and will amend the original sequencesto reflect these changes. If it is not necessary to verify the changes, choose option 1 and the changes will be recorded immediately. 4. Notes 4.1. Materials 1. The digitizer should be connected to the serial port of the computer. If there are two serial ports, it may be connected to either COMl: or COM2:, and the port in use specified in the Setup section (which is run from the main MICROGENIE menu). It is not necessary to use the DOS MODE command to configure the port, smce MicroGenie sets its own communications parameters. The serial interface of the digitizer should be set to 1200 baud, odd parity, and 2 stop bits (the GP-7 is normally supplied with these settings). These settings may be changed if necessary using the DIP switches as described m the GrafESar GP-7

210

Davies and Merrifield

mstruction book. The cable needed to connect the digitizer is a 25pin “straight-through” serial cable, i.e., Pin 2 on the digitizer connected to Pin 2 on the computer, 3 to 3 and 7 to 7. MicroGente does not use the hardware handshake lines, and the other pins do not need to be connected if a cable has to be made by the user. (If the user should move to using PC/GENE, a similar cable will be needed, but the pins 6 to 6 and 20 to 20 should also be connected.) 2. If using the Readgel program from David Judge to digitize sequence for recording into MicroGenie, a name for the “file of file names” is requested; this will not be used, and an arbitrary name can be given. The output file is in plain ASCII text, and the file name must begin with a letter (the first character cannot be a number) and have the file name extension “ *GEL”. MicroGenie does not recognize base uncertainty codes, and only the four normal basesshould be included in the sequence. Instructions are provided with the program. Run it by the command READGEL -1 to change the default lane order from TCGA. 3. A cartridge tape backup system that can rapidly copy the entire contents of a hard disk to tape makes it quick and simple to perform a weekly or daily backup in case of hardware failure. However, it is safer to have two independent backup systems.If an mdex file should become corrupted, this may be copied onto all the backup tapes before the problem is noticed. If the index is corrupted, some or all of the sequences may become inaccessible. An intact index file should be constructed when the sequences are copied to floppy disk, and this may salvage some of the files. It may be a worthwhile precaution to have at least two sets of backup disks, and keep one set at another site in case of fire or other disaster. 4.2. Sequence Entry by Digitizer 4. The autoradiograph of the sequencing gel and the body of the digitizer must be fixed to the light box so that neither can be moved accidentally during the sequence input. Maskmg tape or autoclave temperature-mdieating tape are commonly used to secure the autoradiograph to the light box, since they are easily removed after use. If sequencing reactions have been loaded in all lanes of the gel, it is often helpful to delimit each set of four lanes with a felt-tip marker. This facilitates entry and helps to prevent the operator from becoming cross-eyed. The digitizing stylus, or pen, should be held with the narrower edge, usually marked with a colored line, facing toward the body of the digitizer so that the spark gap is not obstructed by the tip of the stylus. There should be an unobstructed path between the tip of the digitizer stylus and the digitizer. Do

MicroGenie:

Shotgun DNA Sequencing

211

not try to read sequence within about 2 in. of the body of the digitizer. There is a menu bar area in front of the digitizer. If the stylus 1spressed in the menu bar area the green LED will come on to indicate that the menu bar is active. The menu bar can be used to reset the origin of the digitizer, change the units or put it in stream mode (where a raptd series of sparks are generated to trace outlines). If the menu bar 1sunintentionally activated, press the stylus in front of the word “CANCEL” at the right-hand end of the digitizer to resume normal operation. If this 1s not successful, it may be necessary to turn the digitizer off and on again to reset it. 5. A sequence name can normally consist of up to 40 letters or symbols (except “*” and “:“). It is advisable to keep the names as short as possible, consistent with a logical naming system,to accommodate the large numbers of sequences to be entered for shotgun sequencing. Very long names become difficult to remember, and tt is beneficial to limit the number of key strokes, since names have to be typed without any mtstakes if it should be necessary to access an individual sequence. See Note 14 for a discussion of a logical naming system and its use with the “*” notation. If reading a number of sequences from the same gel, MicroGenie can automatically name them by appending a number to the name that was given to the first one if the F2 key is pressed; see Note 14. 6. Changes can be made to the sequence comment later using the “Alter a Sequence” option on the ENTRY menu. It is useful to include details, such as the date, reference number of the sequencing gel, and which lanes contained the sequence being recorded. Thts is the only record that remains with the DNA sequence, and although the user may know exactly what is being recorded after a few months, it may be very diffrcult to remember the details. 7. If using the default lane order, the user will find that with practice he or she will enter the sequence name, press return four times (including the one after the name), enter the comment, press D, and then press return three more times. MicroGenie stores key strokes, so the user can type ahead, and if using a machine with a slow processor, tt will catch up when the user enters a longer response or pause. However, if a mistake is made when doing this, it may necessary to go back to the beginning to start again. 8. Occasionally the digitizer refuses to recognize entries at the extreme left or right of the gel, particularly when near the top of the sequence. If this occurs, mark the last base entered, and move the gel so the sequence being entered is closer to the midline of the digitizer. Then reinittalize

212

Davies and Merrifield

and then carry on with the recording. It is useful to have a felt-tip marker handy in case it is necessary to stop digitizing for any reason. Then the user can put a small dot on the last base entered, so that entery can resume at that point (the user’s position can be located using F3 if the gel has not been moved; see Note 10). 9. If the user wishes to reenter the sequence as a check on possible errors, press F4 to enter Proof mode. Repeat the digrtizing process. Each base entered will be compared with the original one and a warning sounded if there is a difference. The base will be highlighted for the user to enter the correct base, and then the user can continue proofreading. This is not as necessary in Shotgun sequencing as in other approaches, since the method relies on producing numbers of overlappmg sequences to generate the consensus. 10. When the sequence is recorded, before pressing return, the arrow keys can be used to move the cursor around the sequence on the screen and make corrections on the sequence editor screen. There 1sa search mode to enable the user to locate his or her position on the autoradiograph (gel) if interrupted, provided that it has not been moved relative to the digitizer. Press F3 and touch the digitizer stylus on the autoradiograph. A message on the screen will inform the user if the stylus is above the last base entered or if it has been located. We have found that using a felt-tip pen to mark the last position is more reliable. To locate a specific set of bases on the screen, press the F7 key and then enter a string of up to 10 bases that will be highlighted on the screen when return 1s pressed. This is useful if it is necessary to fmd a particular part of the sequence to check. If the user moves the cursor from the end of the sequence,any further data entered from the digitizer will start at the current cursor position and overwrite any following sequence that was already entered. This is appropriate if reentering sequence after discovering a mistake, but could be a problem in other circumstances. To insert a base at the cursor position, press the Insert key, and the cursor will change to a flashing block to indicate that bases entered from the keyboard or digitizer will be Inserted, rather than overwriting existing bases as normal. Press the Insert key again to leave insert mode. Press FlO to insert sequence at a specific base number. 11. If recording sequence data from a file, the user can specify the drive, path, and extension m addition to the file name. When MicroGenie asks for a file name for a file that it is writmg to export or to store output on a floppy disk, it requires a file name not more than eight characters, which it saves to the root directory, and does not accept a file name extension or path. With DOS 4.01 on some computers the “Record from

MicroGenie:

Shotgun DNA Sequencing

213

a file” procedure seems unable to find the sequence on a floppy disk, and it may be necessary first to copy it to the hard disk.

4.3. Merging

Shotgun

Sequences

12. Chapter 6.2 of the MicroGenie manual contains helpful advice on the shotgun sequencing method and possible problems that may be encountered. If the DNA being sequenced is a region that codes for a protein, the “Find possible codmg regions” procedure in the Analysis section should show the coding region predominantly m one reading frame. If the reading frame abruptly changes, this could indicate that a base had been missed, and it would be advisable to check carefully the sequence m such a region. 13. In our hands, using data generated by the sequencing of exonuclease digestion products from subcloned DNA fragments originally approx 2.5 kb in length, the regions of overlap often corresponded with the hard-to-read, upper regions of the gels. This presented some difficulty, since the accuracy of reading these regions was often less than the accuracy suggested by the MicroGenie program, and lowering the accuracy settings can result in the merging of unrelated sequences. However, our problem was overcome to some degree by splitting the sequencing reactions m half and running one-half on a flat (nonwedge) gel for 6-7 h The other half of the reaction was run in the conventional fashion on a wedge gel for 2-3 h. Both reactions were then entered into the MicroGenie program under separate names, which could be combined during the shotgun merge process. Using this protocol, we were routinely able to obtain approx 400 nucleotrdes from each sequencing reaction, which considerably extended the regions of overlap and increased the accuracy of the merges. 14. It is useful to arrange the names of the various sequences recorded to take advantage of the “*” option available for naming the sequences to be included m the merge. If there are any natural subsets to be run as separate shotgun merges, they can be identified by a separate part in their names. For example, all sequences the user may want eventually to include in the merge could have the first letter “a.” A subset, perhaps from a particular restriction fragment the user may want to deal with separately could have a second letter “b”. Then, each individual segment could have an identifying number. Should the user wish to run all DNA segments in the shotgun merge, the sequences to be run would be identified as “a*“, To run the subset as a separate shotgun merge, they would be identified as “ab*“. Obviously, this can be extended to identify a large number of separate subsets,but it is much easier to plan this

214

Davies and Merrifield

m advance than to have to change the names later using the “Alter a Sequence” function on the ENTRY menu. To make use of MicroGenie’s ability to increment the number at the end of a sequence name automatically, when pressing F2 to enter another sequence, give the first sequence a name that ends with a space and then a number, such as “test 1.” For the names to continue from the last one in a previous series, enter the name with the next number in the series (e.g., “test 32” if the last in the previous series was “test 31”). 15. A seed sequence is used to initialize the merge process, so that the merging process starts with the seed sequence and the other sequences are merged using it as the starting point. This 1sparticularly useful if merging sections of sequence of a cDNA or gene that is closely related to a predetermined homolog. The seed sequence will then establtsh the backbone of the merge and virtually eliminate the merging of unrelated sequences. The seed sequence will also be of use if there are not sufficient areas of overlap for MicroGenie to accomplish the merge accurately. Input sequences will be arranged on the seed sequence “backbone,” and this will clearly demonstrate where further data need to be generated in order to link the sequences already entered. 16. It is essential that all vector sequences be removed from the sequenced segments before they are run in the “Shotgun merge Sequences” function. Since vector sequence will be at the begmnmg or end of DNA segments, the program will join the segments by the homologous vector sequences. Although it is possible to obtain the complete vector sequence from a data bank and enter it into the data bank containing the user’s own sequences, it is often large and cumbersome to use. We found it sufficient to enter the sequence of the polylinker for the vector we were using mto our data bank. Then, after entering a series of DNA sequences, we searched our data bank using the “Search the Data bank” function with the polylinker sequence. This revealed the junctions of the polylinker with the DNA of interest, which could be pinpointed by looking for the appropriate restriction enzyme site. We then deleted the nonrelevant sequence usmg the “Alter a Sequence” function, If individual sequences are discovered that do not merge with the rest of the data generated, it may then be useful to search the user’s data bank with the full sequence of the vector being used to see if any homologies are detected. When the user has the final sequence, it may be a worthwhile precaution to search it against all the vectors used in its isolation m case any of their sequences have become mcorporated. A search of the data bank using regions of bacteriophage h and PBR322 suggests that

MicroGenie:

Shotgun DNA Sequencing

215

this can happen, and over 70 occurrences of M 13mp 18 vector sequences in the vertebrate and bacterial sections have been reported (3). 17. Another point of potential concern to users of this program is the presence of repeated sequences, such as dinucleotide repeats. These do not present a problem unless they are in a region of overlap where the program tries to merge two sequences by the repeated regions. Therefore, one solution that we found useful was to use the “Join Sequences” function in the Entry section to join the two segments, so that the repeat region was in the middle of the sequence and not in a potential merge area. The compound sequence can then be included m the “Shotgun merge Sequences” operation as long as it does not exceed the 600-bp limit. 18. Occasionally, we encountered sequences where the region of overlap would not be recognized by the “Shotgun merge Sequences” function, resulting in the production of two or more contigs that were related by their overlapping termini. We found it helpful to use one of the termtnal sequences to search the data bank containing the sequences to be merged. This process revealed the region of overlap, and it was then possible to use the “Join Sequences” function to join the two overlapping segments and run the product in the “Shotgun merge Sequences” function.

Overall, we found the MicroGenie “Shotgun merge Sequences” function to be invaluable for processing the large amount of data generated by our exonuclease sequencing project. Since this protocol generates numerous overlapping sequencefragments, the shotgun merge is tailor-made for assembling the full sequence. References 1. Henikoff, S. (1984) Unidirectional digestion with exonuclease III createstargeted break points for DNA sequencing. Gene 28,351-359. 2 Staden, R. ( 1987) Computer handling of DNA sequencing projects, m Nucleic Acid and Protein Sequence Analysis: A Practical Approach (Bishop, M. J. and Rawlings, C. J., eds.), IRL Press, Oxford, pp. 173-217. 3 Lopez, R., Kristensen, T , and Prydz, H. (1992) Database contamination Nature 355,211.

17

&AP’I’ER

MicroGenie:

!lhnslation

1. Introduction MicroGenie can translate DNA to protein, either as output from procedures in the Analysis section or, in the Entry section, by making a new protein sequence file translated from a nucleic acid sequence file. If a DNA sequence needs to be translated into protein to undertake further analysis or manipulation, this is accomplished using the “Transform a Sequence” procedure, which is also used for other transformations to DNA sequences. Procedures in the Analysis section of MicroGenie allow a DNA sequence to be translated in one or all three reading frames, but do not generate new sequence files. A protein sequence can be reverse translated to the DNA sequence that could code for it in the Analysis “Translate a Sequence” procedure. The “Locate sites” procedures, normally used with restriction enzyme site searches, will translate between protein and DNA. A protein sequence can be searched with a DNA Search file or vice versa. Possible coding regions can be located in a DNA sequence using the codon preference method. 2. Materials The requirements for the computer and software to run MicroGenie are as described in Chapter 15 (MicroGenie: Introduction and Restriction Enzyme Analysis). The CD-Rom drive and digitizer are not required for translation procedures. The coding region prediction procedure gives graphical output; a graphics display is needed to view it, and a graphics printer that is supported by MicroGenie is needed to print the output from this procedure. From Methods m Molecular Srology, Vol 24 Computer Analysjs of Sequence Data, Part I Edlted by A M Gnffln and H 0. Griffin Copyright Q1994 Humana Press Inc., Totowa, NJ

217

Merrifield

218

3. Methods 3.1. Translation of a DNA Sequence to a Protein Sequence 1. At the Main (MICROGENIE) menu, press E to select the Entry section of the program. 2. From the ENTRY menu, press T to select “Transform a Sequence.” This enables the user to extract part of a sequence, select the opposite strand of a DNA sequence or translate a DNA sequence into protem. 3. Give the name of the sequence to be transformed. 4. The user is then asked for the lower limit. Enter the number of the base at which the translation is to start. If return is pressed here, the default is the first base in the DNA sequence. 5. Then enter the upper limit; a base in the last codon (in the readmg frame) to be included. The default is the end of the sequence. 6. At the prompt “First residue,” press return to use the value for “Lower limit” given above. (This is asked to specify the new first residue in a renumbered circular DNA sequence and is not appropriate for translation to protein.) 7. To the prompt “Strand (U,L),” press return, or U, for the upper strand, or press L to translate the lower strand. 8. The option “Form (L,C)” is not applicable to translation. Press return for the default of linear. (The upper and lower limits can be used to specify the region of a circular DNA to be translated.) 9. Press Y when asked “Translate to protem.” The translation takes place between the limits selected. 10. To use the normal genetic code, press return (or 0) for the built-in parameter setwhen asked“Which code?” It is possible to setparameter 16,CODE, to specify a modified genetic code. To use a modified code, enter the number of the parameter set that has been changed; see Note 1. 11. Give a name to the protein (transformed) sequence.The normal sequence name limits of not more than 40 characters and not “:” or “*” apply. 12. Enter a sequence comment as a reminder of which DNA sequence and region were translated to generate the protein sequence. 13. The user is then returned to the ENTRY menu. The translated sequence can be displayed, in either the one-letter or three-letter code. Press D to “Display a Sequence” and give the name of the newly translated sequence.

MicroGenie:

Translation

219

3.2. To Translate a DNA Sequence in the Analysis Section The translation procedures in the Analysis section do not produce a new sequence file, but generate output that can be printed or saved as a MicroGenie output file. There are three procedures that translate DNA into protein, with the option of using a genetic code appropriate to the source of the DNA. It is possible to substitute special symbols for the start and stop codons (or any chosen amino acid) to make them stand out; see Note 2. The third procedure is provided to prepare figures for publication, showing the translation of selected regions with untranslated DNA regions between them. MicroGenie is also able to reverse translate from a protein to the DNA that may code for it, as described under Section 3.3. below. 1. At the MICROGENIE menu, press A for the ANALYSIS menu. 2. Press A again to “Analyze a Sequence.” 3. Give the name of the sequence to be analyzed. 4. Give the number of the parameter set to be used in the analysts. The parameters determine which genetic code is used in the translation and how the results will be presented. If it is necessary to change parameters, and this has not already been done, enter “/,’ before the number of the parameter group to be taken to the “ANALYSIS:Set” screen where the appropriate parameters can be modified. The user can use the modifications for this analysis only or save them for future use. 5. The user is then asked for “Specifications (A,F,S,L)” to specify a particular region or strand of the DNA to be analyzed. Press L to limit the length of the region to be analyzed. The user will then be prompted for the lower and upper limits. Press S to change the strand. Translation of the upper (default) or lower strand can then be specified. If it is necessary to use both these options, press A, and a prompt will appear for each in turn. Press return for the default Form (unchanged if linear). For a circular DNA, a selected region can be translated by specifying the limits. 6. The user is then asked for the numbers of the procedures to be used. Type the numbers corresponding to the procedures (listed in response to entering “?” instead of a number) with a space between each, if more than one, and press return. Procedure number 4 “Translate or reverse translate,” procedure 5 “Translate in all reading frames,” and proce-

220

Merrifield

dure 16 “Translate part of a Sequence” provide the translation options in MicroGenie. Procedure 17 locates potential coding regions m DNA. These are described in more detail in the following section, 7. After the choice of procedure has been entered, the program prompts for the name of another sequence to analyze. If a sequence name is entered, the user will be taken through the selection of parameter set, specification, and procedures again and then will be asked for a further sequence to analyze. When return is pressed Instead of the name of a sequence being entered, the user is taken to the next menu. 8. From the numbered list of options for the output, 1 would normally be selected to store the output on the hard disk. The user can then Examine the output on the screen and Print it when returning to the ANALYSIS menu. If option 2 is selected, the output will be printed without any opportunity to view it first. Pressthe number corresponding to the choice or press return for the default to store the output on the hard disk. See Note 3 for details of the other options. 9. When MicroGenie has finished the analysis, the user is asked to press the return key to continue. 10. The user is then taken back to the ANALYSIS menu, where E can be pressed to Examine the output or P can be pressed to Print it. of the Analysis Procedures for Translation 1, Procedure 4 is used to translate DNA to protein. The translation begins at the first base in the sequence, or the lower limit if a limited length was specified at step 5 of Section 3.2. The reading frame can be changed, or the translation started at a given base, by specifying the lower limit. Alternatively, the sequence could first be transformed to extract the desired range of the sequence; see Note 4. Procedure 16 may be a better choice to translate a specific region of a sequence. Procedure 4 can also be used to reverse translate an amino acid sequenceto the DNA sequence that could code for it, as described in the section on reverse translation. Note 5 gives details of the parameters that control the output from this procedure. 2. Procedure 5 is similar to procedure 4, but shows the translation of all three reading frames of the specified DNA strand. Note 6 gives details of the parameters associated with this procedure. The ability to highlight particular codons (such as Met and End) is particularly useful with this procedure; see Note 2. To translate both strands in all reading frames, this procedure must be used twice. Translate the upper strand as described above. When asked for another sequence to analyze at step 7, of Section 3.2., give the same sequencename again and specify the lower strand at step 5 so that both strands will be translated m the output. 3.2.1. Details

MicroGenie:

Translation

221

3. Procedure 16 prints a DNA sequence with selected regions translated. This is intended to produce figures for publication showing translated regions of a sequence. When asked, “Lower limit of first translated part..-3” enter the number of the base where the translation is to start. If return is pressed, translation will automatically start at the first ATG codon in the sequence. Next the “Upper limit of the first translated part” is requested. Enter a base number in the last codon to be translated or press return for the default (which is to translate to the first stop codon in the reading frame or to the end of the sequence). If the defaults are selected at this and at the previous prompt, MicroGenie will translate the first open reading frame up to a stop codon, or the end of the sequence. If a base number is specified for the upper limit of the first translated part, the user will be asked to enter the lower limit for another region to be translated, and then the upper limit of that translated region. This process continues until return is pressed instead of a base number being entered. Then the analysis proceeds as described in Section 3.2., step 7. See Note 7 for details of the parameters that control the output from this procedure. The output is written to the file OUTFILE as with the other Analysis procedures; it may be copied to a floppy disk if needed for incorporation m a document on another machine. In this case, it would simplify subsequent editing if other procedures are not used while in MicroGenie so that OUTFILE contains only the output from this procedure. 3.2.2. Location of Probable Coding Regions in DNA

MicroGenie procedure 17 searches a DNA sequence for regions that may code for protein on the basis of codon preference, using a modification of the methods of Staden (I) and Gribskov et al. (2). Select procedure 17, “Find possible coding regions.” The procedure does not request any further input from the user, but parameter 23, USETABLE, should be set in advance so that the procedure will use the codon preference table appropriate for the DNA being analyzed; seeNote 8. The options for viewing or printing the output are described in Section 3.2., step 8. The output can be examined (one screen at a time) using the “Examine program Output” option on the ANALYSIS menu. It can be printed out on a graphics dot-matrix or laser printer that is supported by MicroGenie. The output shows a graph for each reading frame with a horizontal dotted line. Regions where the graph is above the dotted line may be coding regions. Below each graph is a bar with full height lines to mark the positions of Stop codons and short lines to mark the location of Met codons. This is

222

Merrifield

repeated for each reading frame, so that they may be directly compared. As suggested in the MicroGenie manual, this procedure is useful when sequencing a gene because an unexpected change of reading frame may indicate that a base has been missed or inserted. 3.3. Reverse Translation MicroGenie can reverse translate an amino acid sequence to the nucleic acid sequence that could code for it. Procedure 4 “Translate or reverse translate,” if used on an amino acid sequence file, will produce output showing the amino acid sequence (in the one-letter or three-letter code as selected in the specified parameter set) with the reverse translated DNA sequence below it. Where more than one codon could be used, they are printed vertically below the amino acid. The symbols, given in Appendix 1 of the MicroGenie manual, for base codes that represent a number of alternative bases may be used where appropriate, e.g., the letter “P” represents “A” or “G”, and “Q” represents “C” or “T.” The “Homology Comparison” procedure in the Compare section can reverse translate a protein to compare it to a nucleic acid. Further details of the Homology comparison procedures are given in Chapter 19 (MicroGenie: Homology Searches). Two of the “Locate sites” procedures, normally used with restriction enzyme Search files, can translate or reverse translate between protein and nucleic acids. These procedures, and the parameters that control them, are described in Chapter 15 (MicroGenie: Introduction and Restriction Enzyme Analysis). Procedure 6, “Locate sites and show in tables”, can locate amino acid strings in nucleic acids, i.e., all sites that code for the amino acid strings. It can also locate nucleotide strings in protein sequences, finding sites where the gene coding for the protein could contain the nucleotide string (e.g., a restriction enzyme recognition site). Appendix 9 in the MicroGenie manual gives an example of the use of reverse translation and DNA Search files with a protein sequence to locate potential restriction enzyme sites that could be introduced into a gene without changing the composition of the protein. Procedure 7, “Locate sites and show in graphs,” gives the same information presented as horizontal graphs with the positions of the sites marked. See Note 9 for details of the differences when recording a protein Search file.

MicroGenie:

223

Translation 4. Notes

4.1. Changing the Genetic Code Used for Translation 1. To change the genetic code used for the translation procedures, go to “Set program Parameters” on the ANALYSIS menu, and change the value of parameter 16, CODE, to 2. This will take the user into an editing screen where changes can be made to the genetic code. Appendix 3 of the MicroGenie manual lists the standard genetic code, and the changes for yeast and human mitochondria. After making changes and returning to the “Set program Parameters” screen, reset CODE to have a value of 1. The changed genetic code will be used when the modified parameter set is selected. To change that parameter set back to use the normal genetic code, set the value of CODE to 0. The changes made are then saved for future use, but not used in the analysis.

4.2. To Translate a DNA Sequence in the Analysis Section 2. Parameters 13 and 14 (AA1 and AA2) specify two amino acids that may be highlighted m a sequence. If parameter 15, SUPPRESS, is set to 0 and AA1 is set to the number of an amino acid (the numbers are listed in Appendix 2 of the manual or the on-line help), wherever that amino acid occurs in a sequence, it will be represented by “***” instead of its normal three-letter code. If parameter 14, AA2, is set to a number from 1 to 21, the corresponding amino acid will be represented by “+ + +“. This is particularly useful to highlight Met and End (Stop) codons in a translated sequence. To do this, set AA1 to 13, AA2 to 21, and suppress to 0. To print only the amino acids represented by AA1 and AA2, set the value of SUPPRESS to 0, and all other amino acids will be replaced by spaces. Setting AA1 or AA2 to 0 cancels the highlighting of the amino acid previously specified by that parameter. 3. The default option for output from the Analysis procedures is 1 to store the output on the hard disk. Option 6 allows the user to store the output on a floppy disk. This may be useful for printing it on a high-quality printer on another machine. If this is chosen, the user will be asked for a file name; do not include the drive or a file extension. Option 4 is provided to allow the user to return to the ANALYSIS menu to list the sequences if a sequence name has been forgotten and the user wants then to resume the analysis selections without losing the choices already made. Option 5 stores the output in an “Export Format.” This is

Merrifield a plain text (ASCII) file, without page titles, used to export sequence data to other programs (further details can be found in Chapter 15 [MicroGenie: Introduction and Restriction Enzyme Analysis] and Chapter 19 [MicroGenie: Homology Searches]). If Option 3 is selected, the user will be returned to the ANALYSIS menu, and the selections of sequences and procedures that have been made will be canceled. 4. If a sequence is transformed, to extract a region of it, the transformed sequence will be renumbered from its first residue, and the numbering may not match the original sequence. If a portion of the sequence is selected to be analyzed, by specifying the lower and upper limits in the Analysis section, the output will have the numbering of the original sequence. This may be sigmficant in comparing an extracted and modified sequence with the original sequence. 5. The output format of procedure 4 (Translate or reverse translate) is controlled by the protein parameters. Parameter number 2, PWIDTH, gives the number of amino acids per line (10 times the value of PWIDTH). This also determines the number of correspondmg nucleotides per line. Parameter 4, PFREQ, determines the numbering of the protein sequence, and parameter 6, PSKIP, causes gaps to be inserted between the amino acids and the correspondmg codons m the DNA. The use of the oneletter or three-letter amino acid code is determined by the value of parameter 8, LETTERS (there is also a special MicroGenie two-letter code; see Appendix 2 of the manual). One or two amino acids can be highlighted in the sequence by use of the parameters 13 and 14 (AA1 and AA2) in conjunction with parameter 15 SUPPRESS, see Note 2. Parameter 16, CODE, determines which genetic code is used in translation (see Note l), but only the default code is used in reverse translation, 6. The format of the output from procedure 5 (Translate in all reading frames) is controlled by the parameters 1, NWIDTH, and 3, NFREQ, for the number of nucleotides printed per line and the frequency of base numbering. Gaps are not inserted in the DNA strand as all three reading frames are translated. One or two amino acids can be highlighted m the sequence by use of the parameters 13 and 14 (AA1 and AA2) in conjunction with parameter 15 SUPPRESS; see Note 2. The use of the oneletter or three-letter amino acid code is determined by the value of parameter 8, LETTERS. Parameter 16, CODE, determines which genetic code is used in translation; see Note 1. 7. The format of the output from procedure 16 is controlled by the parameters 1, NWIDTH, and 3, NFREQ, for the number of nucleotides printed per line and the frequency of base numbering. Gaps are not inserted in the DNA strand. One or two amino acids can be highlighted

MicroGenie:

225

Translation

in the sequence by use of the parameters 13 and 14 (AA1 and AA2), in conjunction with parameter 15 SUPPRESS; see Note 2. The use of the one-letter or three-letter amino acid code is determined by the value of parameter 8, LETTERS. Parameter 16, CODE, determines which genetic code is used in translation.

of Probable

4.3. Location Coding Regions in DNA

8. The length of sequence shown on each line by procedure 17 (“Find possible coding regions”) is set by parameter 18, NWIDTHG. The value of NWIDTHG is 10 times the number of codons (and hence 30 times the number of bases) per line. (This is different from parameter 1, NWIDTH, for the nongraphical procedures, where the value 1s 10 times the number of bases per line.) The preset value of 6 displays 180 bases/ line. A higher value, up to the maximum of 60 (1800 bases/line), may give a better overall view of the coding regions. The resolution can then be increased to examine any specific regions of interest. There is a choice of four values for parameter 23, USETABLE; 1 setsthe table for mammalian codon preference, 2 for E. co& 3 for yeast, and 4 for D. melunogaster. A value of 0 setsa mammalian codon usage allowing for amino acid frequency. The manual states that the coding prediction is more reliable for yeast or bacteria, which show a stronger codon preference, than for mammals. Using a value of 0 for parameter 23 may improve the detection of coding regions that have an average amino acid composition. 9. To record a Search file of amino acid sequences, go to the Files section and the “Search Files” menu, as described in Chapter 15 (MicroGenie: Introduction and Restriction Enzyme Analysis). The “Record a new File” procedure accepts amino acids only in the three-letter code; it capitalizes the first letter automatically. The user can include spaces while typing they are removed automatically by the program. There is a special code, “Any”, to match any amino acid.

References 1. Staden,R. (1984) Measurementsof the effects that coding for a protein has on aDNA sequenceandtheir usefor finding genes.Nucleic Acids Rex 12,55 l-567. 2. Gribskov, M., Devereux, J., and Burgess,R. R. (1984) The codon preference plot: graphic analysisof protein coding sequencesandprediction of geneexpression. Nucleic Acids Res. 12,539-549.

CHAPTER18

MicroGenie:

Protein

Analysis

R. K Merrifield L Introduction At an initial inspection, MicroGenie appears to have a limited range of procedures designed specifically to analyze proteins; however, many of its procedures can be used with both proteins and nucleic acids. By suitable selection of control parameters, the protein analysis procedures can be made to use a variety of analysis methods that would probably appear as separate menu items in other systems. The homology search and comparison procedures, under the DATA BANK and COMPARE menus, may be used with proteins or nucleic acids, and are covered in more detail in Chapter 19 (MicroGenie: Homology Searches). MicroGenie can calculate the sizes of proteins from their mobility on an electrophoresis gel in relation to known size markers, The MicroGenie manual gives details of this in the Files section, Reverse translation of protein to nucleic acids is discussed in Chapter 17 (MicroGenie: Translation). 2. Materials The requirements for the computer and peripherals to run MicroGenie are as described in Chapter 15 (MicroGenie: Introduction and Restriction Enzyme Analysis). The CD-ROM drive is required to use procedures in the Data Bank section. A digitizer is only needed for protein analysis if the user is regularly entering protein mobility data to determine molecular weights. To use the procedures that produce graphical output, a graphics display is needed to view the results, and a graphics printer that is supported by MicroGenie is needed to print the output from these procedures. From Edlted

Methods m Molecular Biology, Vol 24 Computer Analysis of Sequence Data, Part I by* A M Gnffln and H G Gnffm Copynght 01994 Humana Press Inc , Totowa, NJ

227

Merrifield

228

3. Methods 3.1. Protein Analysis in MicroGenie 1. At the MICROGENIE menu, press A to go to the ANALYSIS menu, 2. Press A again to “Analyze a Sequence.” 3. Enter the name of the sequence to be analyzed. 4. Give the number of the parameter set selected to be used in the analysis, The parameters determine how the results will be presented and are discussed further in Section 4. 5. The user is then asked for “Specifications (A,F,S,L)” to define a particular region of the protein to be analyzed. Press L to limit the length of the region to be analyzed. The user will then be prompted for the lower and upper limits. The other options that are available here (select Upper or Lower strand and select the Form as Linear or Circular) normally apply to the analysis of a nucleic acid sequence. 6. The user is then asked for the numbers of the procedures to be used. Type the numbers corresponding to the procedures with a spacebetween each, if more than one, and press return (they can be listed by entering “?” instead of a number). The procedures for protein analysis are described in more detail in the following section. 7. When the choice of procedure has been entered, the program prompts for the name of another sequence to analyze. If a sequence name is entered, the user will be taken through the selection of parameter set, specification, and procedure, and then will be asked for a further sequence to analyze. When return is pressed instead of the name of a sequence being entered, the user is taken to the next menu. 8. From the numbered list of options for the output, the user would normally select 1 to store the output on the hard disk. The user can then “Examine program Output” on the screen and “Print program Output” on returning to the ANALYSIS menu. If option 2 is selected, the output will be printed without any opportunity to view it first. Press the number corresponding to the choice, or press return for the default, which is to store the output on the hard disk. See Note 1 for details of the other options. 9. When the comparison procedure is finished, the user will be asked to press return and will be taken back to the ANALYSIS menu. 10. From the ANALYSIS menu, the user can press E to Examine the output or P to Print it; see Note 2.

MicroGenie:

Protein

Analysis

229

3.1.1. Protein Analysis Procedures in the Analysis Section 1. “List and number a sequence.” Procedure 1 prints a DNA or protein sequence. The number of amino acids per line, the frequency of the numbering that is printed above the sequence, and the insertion of gaps in the sequence are controlled by the parameters in the set selected at Section 3.1., step 4; see Note 3. 2. “Determine residue frequencies.” Procedure 2 will generate a table of the frequency of occurrence of the amino acids in a protein, and list the number of acidic, basic, and hydrophobic residues. It uses this information to calculate the molecular weight and the predicted isoelectric point (PI) of the protein. 3. “Translate or reverse translate.” Procedure 4 is able to reverse translate a protein to the nucleic acid sequence that could code for it. Further details are given in Chapter 17 (MicroGenie: Translation). 4. Procedures 6 and 7 are used to locate restriction enzyme recognition sites in DNA, but are also able to locate amino acid strings in protein sequences and locate coding sequencesfor amino acid strings in nucleic acids. The use of these procedures to locate sites in proteins is covered in Chapter 17 (MicroGenie: Translation). 5. Procedure 9, “Find repeated regions,” locates regions that are repeated within a protein (or nucleic acid). It is listed with the procedures available m the Analysis section, but can only be run from the Compare section. It is described in Chapter 19 (MicroGenie: Homology Searches). 6. Regions of a sequence that are rich in specified amino acids (or nucleotides) are located using procedure 11. A Search file of Class 9 is first recorded with the amino acids of interest as described in Chapter 15 (MicroGenie: Introduction and Restriction Enzyme Analysis). Arbitrary names can be given to the Search file and the name of the string. The string is entered in the three-letter code for the amino acid. To locate a region rich in a set of amino acids, enter the string as if it were a peptide consisting of those ammo acids, e.g., ArgLys to find regions rich in Arginine and Lysine. The user can have up to four different sets of amino acids (strings) in the Search file, and the program will mark regions that are rich in each with a different symbol. Note 4 gives further details of this procedure. 7. To plot a prediction of the secondary structure of a protein, procedure 12 is used. The MicroGenie manual gives a short discussion of the

230

Merrifield

algorithm used and its limitations. The probability of each amino acid (shown in the one-letter code) being in each of the four conformations is plotted as a set of graphs arranged above each other. The conformation (a-helix, P-sheet, turn, or random coil) that has the highest value is the predicted conformation. Normally a graphical output is used, but if a graphics screen or printer is not available (or a more compact output is needed), the sequence can be printed with a single letter below each amino acid to represent the most probable conformation. The form of the output is controlled by two parameters in the selected parameter set; see Note 5. 8. Hydrophobic and hydrophilic regions of a protein are plotted using procedure 13, The hydrophobic and hydrophilic regions may be calculated by one of three algorithms. Alternatively, the hydrophobic moment of an a-helical region may be plotted. This is described in more detail in the MicroGenie manual. Hydrophilic residues have positive values on the graph. The size of the averaging window used to plot the graph is either that recommended by the authors of the algorithm or can be specified by the user. Select the algorithm to be used, the default or user specified averaging window, and the number of amino acids per line on the graph by setting the values of the appropriate parameters as detailed in Note 6. 9. Procedure 14, “Find repeats (matrix method),” locates regions that are repeated within a protein (or nucleic acid) using a dot-matrix plot. It is listed with the procedures m the Analysis section, but is run from the Compare section. It is described in Chapter 19 (MicroGenie: Homology Searches). 10. Numeric values, associated with the residues of a protein (or nucleic acid) sequence, are plotted on a graph by procedure 18. By default, the procedure uses a table of amino acid charge, with a value of 1 for positively charged amino acids, -1 for negative amino acids, and 0 for all others. A graph of the charge along the protein is produced. This procedure can be used to plot a graph of any other numeric parameter that is associated with each amino acid. The values are entered into a table of amino acid properties using a special editor; see Note 7. 11. The final protein procedure in the analysis section plots a graph combining information from other graphical procedures. The two top sections of the graph plot the probability of a-helix and P-sheet conformations, as in procedure 12. The third level of the graph is a plot of hydrophilicity, as for procedure 13. The fourth level displays the charge or other numeric property of the amino acids in the sequence, as described for procedure 18. By suitable adjustments of the parameters

MicroGenie:

Protein

Analysis

and the table of values to be plotted, this can be used to give a second graph of hydrophilicity calculatedby a different algorithm (this is suggested and described in the MicroGenie manual). The parametersto control the output for the sectionsof the graph are as describedfor the individual procedureswhen they are run separately. 3.2. Compare

Section

Procedures in the Compare section can be used to compare or align protein sequences and nucleic acids. More complete details of these procedures are given in Chapter 19 (MicroGenie: Homology Searches). 3.2.1. Homology Comparison

The “Homology Comparison” procedure can compare two protein sequences to locate regions of homology. This lists the homologous regions, giving their positions in the original sequence.It is also possible to compare a protein sequence against a nucleic acid sequence, e.g., a cDNA that codes for it. 3.2.2. Alignment

of Sequences

MicroGenie can align sequencesthat are related, provided they are of similar lengths and do not have long regions that are not homologous. Limit the length of one, or both, if their lengths are sufficiently different to prevent successful alignment. The multiple alignment procedure is able to align up to 60 sequences.It can be useful when preparing figures, showing changes in related sequences,for publication. 3.2.3. Matrix

Comparison

MicroGenie can compare two sequencesor selected regions of two sequences using a dot-matrix graph in which regions of homology are marked by diagonal lines on the graph. Deletions or insertions are shown by a break and displacement in the line of homology. The scales on the axes of the graph are automatically chosen to suit the lengths of the sequences (or their specified regions) being compared. When a region of homology has been located, it can be examined in more detail by limiting the regions to be compared to 60 or less amino acids. The matrix is then plotted, at the single amino acid resolution, with the one-letter code being used to list the sequences along the axes. The dots on the low-resolution graphs are replaced by “+” symbols where there are identical amino acids in both strands at a given

232

Merrifield

position. Nonmatching amino acids in a homologous region are marked by a “-” symbol to make it easier to recognize the region of homology. The number of homologous regions shown on the graph is controlled by two parameters that select the minimum length of a region of homology and the minimum percentage of amino acids within that region that have to match. Details of the parameters are given in Note 8. If matrix comparisons are being undertaken regularly on a variety of proteins, the user may need a number of different parameters to give a range of levels of stringency for matching. These can be saved in separate parameter sets, so that a variety of conditions can be used by selecting the appropriate parameter sets, rather than having to alter parameters before each comparison, It is helpful to draw up a table listing the parameters and the values in each set, since it is easy to forget which set contains the parameters that are appropriate for a particular analysis. The MicroGenie manual suggests starting values for the parameters. 3.3. Changing MicroGenie Parameters The MicroGenie parameters can be set from the “Set Program Parameters” option on the ANALYSIS menu. If the parameters have not been set in advance, the user can select the parameter set to be used in the analysis and preface it with “/.” The user will then be taken to the parameter editor screen. After modifying the parameters, they can be used only in that analysis or saved for future use. Set parameter group 1 to have the values that will be used most often, since this is the one that is selected by default. 1. PressA from the MICROGENIE menuto go to the ANALYSIS menu, 2. At the ANALYSIS menu, pressS to “Set program Parameters.” 3. When asked which parameter set to alter, enter the number of a set between 1 and 9. 4. The user is then taken to a screen showing the 26 ANALYSIS parameters.(Parameters27 to 32 apply to the Compare sectionand can only be set from the COMPARE menu.) A prompt at the bottom of the screenasks for the number of the parameterto be modified; enter the correspondingnumber. 5. The hne with the old value and the rangeof values for the parameteris highlighted, and a prompt at the bottom of the screenasksfor the new value. Enter the new value; “?” can be enteredinsteadfor an explanation of the highlighted parameterand the rangeof permissible values.

MicroGenie:

Protein

Analysis

233

6. The user is then asked for the number of another parameter to change. If no other parameter needs to be changed, press return. This will take the user back to the ANALYSIS menu. 7. If the parameter set was changed using the “/” symbol, when starting an analysis, the user will be given the option of using the changed parameters for that analysis only or to save them for future use.

4. Notes 1, The default option for output from the Analysis procedures is 1, to store the output on the hard disk. If Option 3 is selected, the user will be returned to the ANALYSIS menu, and the selections of sequences and procedures that have been made will be canceled. Optton 4 is provided to allow the user to return to the ANALYSIS menu to list the sequences, if a sequence name has been forgotten and the user wants to resume the analysis without loosmg the choices already made. Option 5 stores the output in an “Export Format.” This is a plain text (ASCII) file, without page titles, and is used to export sequence data to other programs. Further details can be found in Chapter 15 (MicroGenie: Introduction and Restriction Enzyme Analysis). Option 6 allows the user to store the output on a floppy disk; this may be used to print it on a high-quality printer on another machine. It 1salso useful if another user is waiting to use the computer and would overwrite the OUTFILE before the original user had time to examine it. If option 6 is chosen, the user will be asked to give a file name. Do not include the drive or a file extension. 2. If the user chooses to print the output, he or she will be asked if it should be printed in condensed type (this is useful for prmtmg more than 80 columns on standard 8.5in. wide paper). MicroGenie ~111then ask if the printing should be in emphasized mode; with a dot-matrix printer, each line will be overprinted to make the output darker if it is needed for presentation. This will make printing slower; if a high-quality printer is available on another computer, it may be better to copy the OUTFILE to a floppy disk (or select the option to save output on a floppy disk) and take it to the other computer. 3. Parameter 2, PWIDTH, controls the number of amino acids that are printed on a line of output. The number of amino acids is 10 times the value given to the parameter. This may be from 1 to 5 (lo-50 amino acids/line) with an initial default value of 20 amino acids/line. The frequency of the numbering printed along the strand is determmed by the value of parameter 4, PFREQ. This is multiplied by 10 to give the interval between the numbers. The default value is 1, for numbers printed at every 10 amino acids. If the value of PFREQ is set to 0, numbers are

Merrifield not printed. Parameter 6, PSKIP, determines if spaces are inserted mto a protein listing to make it more readable. A value of 1 gives a space between each amino acid, and 0 prevents spaces from being inserted. The three-letter amino acid code is used if parameter 8, LETTERS, has a value of 3, and the one-letter code is selected by setting LETTERS to 1. There is a special MicroGenie two-letter code, given in Appendix 2 of the MicroGenie manual, which will be used if LETTERS is set to a value of 2. 4. Parameter 17, RICHIN, setsthe minimum number of the specified ammo acid (or group of amino acids) that must be present in a region of the sequence whose length is specified by parameter 18, RICHOUT. If RICHIN has the value of 4, and RICHOUT has the value of 10, and then all regions where four out of 10 amino acids that are specified in the Class 9 Search file will be marked. Delete any other files with Class 9 and Kind P when using this procedure. 5. Parameter 20, PWIDTHG, controls the number of amino acids that will be printed (in the one-letter code) on each line by graphic procedures, The value of PWIDTHG is multiplied by 10 to give the number of amino acids per line. Parameter 21, GRAPHON, normally has a value of 1 to cause the output to be given m a graphical form. It can be changed to 0 to select the nongraphical output option for procedure 12 (secondary structure prediction). 6. For procedure 13, the value of parameter 24, HYDRO, determines which algorithm is to be used; 1 for Hopp and Woods (I), 3 for Kyte and Doolittle (2), 5 for Parker et al. (3), and 7 to plot the Eisenberg et al. hydrophobic moment (4). These use the size of averaging window recommended by the relevant authors. If the value of HYDRO is increased by 1 to an even number, the samealgorithm is used, but with an averaging window specified by the value of parameter 25, WINDOW. The number of amino acids displayed per line is 10 times the value of parameter 20, PWIDTHG. 7. Parameter 26, RESVAL, determines which table of values for the amino acid residues will be plotted by procedure 18 (“Graph charge, AT-GC, values”). The default value of 0 selects the table of amino acid charge (or A-T richness for nucleic acids). When RESVAL is set to 2 (from the “Set program Parameters” option on the ANALYSIS menu), the user is taken to a special editing screen where the values can be entered to be plotted for the amino acids. On return to the “Set program Parameters” screen, set RESVAL to 1 to cause the values entered to be used to plot the graph. If RESVAL is set to 0, the default values will be used, and the ones entered will be stored for subsequent use or modification. The

MicroGenie:

Protein Analysis

235

value of parameter 25, WINDOW, determines the size of the averaging “window” that will be used when the graph is plotted. By default, the value is 1, so that the actual values are plotted. If WINDOW is 3, the value of each amino acid is averaged with those of the amino acids on either side of it. Larger values for WINDOW are useful for smoothing graphs to emphasize overall changes. 8. There are two parameters that select the regions included in the homology and matrix comparison output. Parameter 27, MINMATCH, specifies the minimum number of matching residues that have to be present in a region of homology. MINPER, parameter 28, specifies the minimum percentage of matching residues in the region of homology. It is preset at 75%. Decreasing the value will show more homologies with more mismatches. These two parameters control different aspects of the selection of homologies, and the MicroGenie manual states that a homology with less than the overall percentage specified by MINPER will be included if part of it meets both criteria. If more than 400 regions of homology are found, MicroGenie will terminate the comparison. The parameters will have to be reedited to reduce the number of matches and then the comparison will have to be repeated.

References 1. Hopp, T. P. and Woods, K. R. (198 1) Prediction of protein antigenic determinants from amino acid sequences. Proc. Natl. Acad. SCL USA 78,3824-3828. 2. Kyte, J. and Doolittle, R. F. (1982) A simple method for displaying the hydropathic character of a protein, J. Mol. Biol. 157, 105-132. 3. Parker, J. M. R., Guo, D., and Hodges, R. S. (1986) New hydrophilicity scale derived from high-performance liquid chromatography peptide retention data: Correlation of predicted surface residues with antigenicity and X-ray derived accessible sites. Biochemistry 25,5425-5432. 4. Eisenberg, D., Schwarz, E., Komaromy, M., and Wall, R. (1984) Analysis of membrane and surface protein sequences with the hydrophobic moment plot. J. Mol. Biol. 179, 125-142.

CHAPTER19 MicroGenie:

Homology

Searches

R IL Merrifield 1. Introduction MicroGenie can compare two sequencesto detect regions of homology between them and can align two or more sequences, inserting gaps if necessary to improve the alignment. The procedures to do this are in the Compare section. A sequence (DNA or protein) can be searched for homologies against a data bank or against specified sequences. The Data Bank section of MicroGenie contains the procedures to search the data banks and to retrieve specified sequences from the data bank CD-ROM by author or keywords. 2. Materials The requirements for running MicroGenie are as described in Chapter 15 (MicroGenie: Introduction and Restriction Enzyme Analysis). If data bank searches are expected to be a major application of MicroGenie, the use of a computer with a fast processor is an advantage as this reducesthe time taken for searches(see Note 1 and Table 1). A large hard disk is desirable if the user wishes to search the data bank sequences on the hard disk instead of the CD-ROM; see Note 2. A CD-ROM drive is required because the data bank is now only available on CD. Note 3 gives some details of the installation of a CD drive. The use of a CD-ROM provides space to include the data bank annotations; these were removed from the original MicroGenie data banks when they were supplied on floppy disk. Following the ending of the marketing agreement with Beckman Instruments (see Chapter 15, Section 3. for address),updatesto the Data Bank CD-ROM in MicroGenie From EdlIed

Methods m Molecular Biology, Vol 24 Computer Analysrs of Sequence Data, Part I by’ A M. Gnffm and H G Griffin Copynght 01994 Humana Press Inc , Totowa, NJ

237

238

Computer IBM PSI2

IBM PSI2

Merrifield Table 1 Speed of Data Bank Search0 Buffers, Buffers, Drive DOS MSCDEX D: D: D: D: D: D:

5 8

15 18 18,8 25

Time, min

M:4 M:4 M.4 M.4 M:4 M:4

19 6 19.6 19.0 19.4 19.4 19.4 25.2 22.5

Notes Hard disk

Look ahead

E: E:

18 18

M:4 M.8

Dell 310

C.

20

M:4

15.9

Hard disk

Dell 310

E: E:

20

M:4 M:8

21.8 19.1

CD-ROM

c: c:

20

28.6

16 MHz

20

56.5

8 MHz

Dell 316 Dell 316

20

CD-ROM

Time taken to search the PRI: data bank using 2 kb of DNA.

format have been available from IntelliGenetics, Inc., 700 East El Camino Real, Mountain View, CA 94040, or Amocolaan 2, B-2440 Geel, Belgium. The CD-ROM drive unit must be able to read the “High Sierra” format. Hitachi CDR1503s and NEC CDR-75 drives have been found to work satisfactorily, and the list of compatible drives that was issued by Beckman includes Amdek Laserdek 1000, NEC CDR-72, Toshiba 3201, Sony CD 6101, and Hitachi CD 600. The addition of the MSCDEX CD-ROM extensions to DOS reduces the free memory that is available to run MicroGenie. When the CD drive software has been installed, there may not be enough free memory available to run the larger sections of MicroGenie. Note 4 discusses some methods of freeing memory that are also applicable to running MicroGenie under Windows 3.0.

MicroGenie:

Homology

Searches

239

3. Methods 3.1. Data Bank Section 3.1.1. Homology Searches MicroGenie can search a sequenceagainst either the GenBank database for nucleic acid sequences or the NBRF (National Biomedical Research Foundation) Protein Identification Resource (PIR) data bank for proteins. Nucleic acids are searched against both strands of the sequencesin the data bank. The data bank searchprocedure can also be used to search a sequencefor homology to sequencesstored under the users’ own, or another user’s, password. The GenBank database is divided into sections with individual passwords as listed in Appendix 7 of the MicroGenie manual and Note 6. These can be used to accessspecific sections of the data bank (e.g., rodent sequencesonly, if the password “ROD:” is specified). MicroGenie does not give a listing of a set number of matches, as some programs do; if there are no homolo-

gies that meet its significance level (which can be adjusted by the user), no output will be produced. Sequences can be exported from MicroGenie as described in Chapter 15 (MicroGenie: Introduction and Restriction Enzyme Analysis). 1. At the MICROGENIE menu, press D to select the DATA BANK section of the program. 2. Press S to “Search a Data Bank.” The other options on the DATA BANK menu include “Find Sequences” (by keyword or author) and the “Annotations” option to display or print data bank annotations. 3. The user is then asked for the name of the sequence to be compared (searched) to the data bank. The “*” notation cannot be used here, since only one sequence at a time can be searched against the data bank. A password abbreviation (for another users’ sequences or a section of the data bank) can be used to specify a search sequence that is not in the users’ own directory. Protein sequences are automatically searched against the NBRF PIR protein database, and nucleic acids against both strands of all, or a selected part, of the GenBank nucleic acid database. There is no provision for translating during the search (e.g., searching a protein against the nucleic acid database) as there is in the Homology comparison procedure. 4. The user is asked to specify the lower limit and then the upper limit of the sequence to select a region to be searched against the data bank, The defaults are the first residue as the lower limit and the last residue

240

Merrifield

as the upper limit. If the sequence is longer than 2200 residues, the user has to search separate regions m turn (see Note 7). 5. If the search sequence is a protein the user will be asked which amino acid code is to be used in the output. Press 1 for the one-letter code, 2 for the special MicroGenie two-letter code (hsted in Appendix 2 of the manual), and 3 (which is the default) for the standard three-letter amino acid code.

6. The program then asks for the minimum number of matches in a homology. The default values are 20 for nucleic acids and 10 for protein; see Note 8. 7. The user is then asked to gave the data bank key name. Press return for the default; to use the whole GenBank (if the sequence is a nucleic acid), or the PIR protein database for a protein sequence. If the password for a section of the GenBank sequences is given, then only the sequences stored m this section will be used (e.g., PRI: to search the primate section of the data bank). The passwords (key names) for the sections of the data bank are given in Appendix 7 of the MicroGenie manual and m Note 6. The first three letters of the password of another user can be given, and the “*” notation can be used to specify a subset of the sequences stored under the selected password. If the “*” notation is used with no password, sequencesin the user’s own directory will be searched (see Note 9). 8. The next menu gives a numbered list of options for the output. The user would normally select 1 to store the output on the hard disk. This is the default if return is pressed. The user can then Examine the output on the screen and Print it when returning to the DATA BANK menu. If option 2 is selected, the output will be printed without any opportunity to view it first. Press the number corresponding to the choice, or press return to store the output on the hard disk. See Note 10 for details of the other options. 9. When the required option for the output has been selected, the homology search begins. The name of the search sequence and the sequences being compared will be shown on the screen, and they scroll upward as the search progresses. When a homology has been found, the page of the output on which it will be displayed is shown on the screen beside it. If a fast computer is being used, there 1snot usually time to read the sequence name before it scrolls off the screen, but tt is encouraging to see that homologies have been found. When the search is complete, MicroGenie displays a message asking the user to press the return key; this takes the user back to the DATA BANK menu.

MicroGenie:

Homology

Searches

10. From the DATA BANK menu, the user can Examme or Print the output, which will be saved in the file OUTFILE, as in most other MicroGenie procedures. For each sequence that has homologies to the search sequence,the output gives the data bank sequencename, the strand (upper or lower), the number of matches in the homology, the length of the homologous region, and the percentage of matches within the homology. The homologous region of the search sequence is printed across the page with the matching region of the data bank sequence printed below it. Residues that do not match are marked with a “*” symbol. At the end of the output, there is an index to the pages of the output. If no homologies are found, the output file is empty. The output is saved in Ihe file OUTFILE in the root directory with the Index in the file OUTFILE.IND, as for other MicroGenie output routines. 3.1.2. Data Bank Annotations

When a sequence with homology to the test sequence has been located, the user may want to examine the annotations associated with the sequence in the GenBank or NBRF database to obtain further details of the sequence. The user needs to know the data bank section and the number of the sequence within that section; see Notes 6 and 13. If the user only needs to find the literature reference, and normally search sequences on the hard disk, it may be quicker to use the “Display a Sequence” procedure in the Entry section and look at the sequence comment, which contains an abbreviated reference. 1. At the MICROGENIE menu, press D to select the Data Bank sectlon of the program. If the data banks have been transferred to the hard disk (Note 2), the user should check that the CD-ROM drive letter is specified in the Setup section and the correct CD 1sin the drive (see Notes 14 and 15). 2. Press A at the DATA BANK menu to select the Annotations procedure. 3. Press D to display the annotations on the screen or P to print them. 4. The user is then asked for the name of the sequence. Enter the threeletter password abbreviation for the data bank section followed by the data bank sequence number (e.g., PRI:5945) (see Note 13). 5. MicroGenie then displays the annotations from the database record for the sequence. The annotations can be examined by using the Page Up

and Down keys, Home, andEnd keys (or selecta pagenumber) to move through the display, as in the output from other procedures. Press return to go back to the DATA BANK menu.

242

Merrifield 3.1.3. Finding

Sequences in the Data Bank

The data bank sequences can be treated as normal MicroGenie sequences and analyzed, compared, or copied using the password abbreviation for the appropriate data bank section as listed in Note 6. They can be located using the “List the Directory” procedure in the Enter, or most other, sections. Use the “*” notation to specify the sequence as closely as possible, since a number of screens of output may have to be inspected if there are many sequences with similar names. The actual sequence may have to be examined to make sure the correct one has been located. Once the sequence is located, a note should be made of the sequence number (see also Note 13). The CD-ROM contains the annotations provided by GenBank and NBRF that include the keywords and author name fields. Keywords or author names can be used to retrieve specific sequences, or groups of sequences,using the “Find Sequences” section of the DATA BANK menu. The options under this heading are to “Get sequences by keyword”, “ Retrieve sequences by author”, and to list the keywords and author names used in the data bank. Before using these procedures to obtain a sequence, the user may need to examine the list of keywords, or author names, that are used in the data bank. This is useful to check that the keyword or name is used, or an alternative one found if it is not. Similarly, the list of author names can be used to check the spelling and initials of an author before using the “Retrieve sequences by author” procedure. 3.1.3.1. To LIST KEYWORDS IN THE DATA BANK 1. At the MICROGENIE menu, press D for the Data Bank section. 2. PressF for the Find sequences section. To examine the list of keywords, press K for keyword list. 3. When asked for the “Key name,” enter a keyword in order to confirm that it is used in the data bank. If an abbreviated version of the keyword is given using the “*” notation, the user will be shown all the similar keywords and may be able to chose one that is more appropriate for the sequence retrieval. To list keywords in the NBRF protem data bank, use the password PRO: (followed by the keyword, with “*” notation if necessary). If a password is not given, the keywords for the whole of the GenBank data bank will be accessed. The GenBank keywords are not divided into further sections (i.e., ROD: cannot be used to list only keywords used in the rodent section of the data bank). If return is pressed

MicroGenie:

Homology

Searches

243

at this prompt, MicroGenie will list all the keywords used in the data bank. There are over 5000 and this will take a considerable time (press return to stop and cancel the listing). Preface the key name with “P/” to print the output rather than display on the screen. 4. If there 1smore than one page of output, move up and down through rt using the Page Up and Page Down keys (Home and End take the user to the beginning and end of the output). 5. Press return to go back to the FIND SEQUENCES menu. 3.1.3.2. To LISTAUTHORS The names of authors associated with the data bank sequences can

be listed by going to the “Find sequences” section as described above and requesting the “Author list.” 1. At the MICROGENIE menu, press D for the Data Bank section. 2. Press F for the “Find sequences” section. 3. Press A for “Author list.” 4. When asked for the key name, the user can give the complete name and initials to confirm that an author’s name is included in the data bank. Give part of a name using the “*” notation if it is necessary to check the correct spelling and initials to use when retrieving sequences by author name. If the password PRO: is used, the authors in the protein data bank wrll be accessed.If no password is specified, the GenBank nucleic acid data bank will be used. This cannot be subdivided by password in this procedure. Preface the keyword with “P/” to print rather than display it on the screen. 5. MicroGenie then displays the author name, or names if the “*” notation was used. The user can look through the list of names, using the Page Up and Down keys (Home and End take the user to the beginning and end of the output). The output will have been printed rather than displayed on the screen if the keyword was preceded with “/P”. 6. Press return to go back to the FIND SEQUENCES menu. 3.1.3.3. To OBTAIN A SEQUENCE BY KEYWORD 1. At the MICROGENIE menu, press D for the Data Bank section. 2. Press F for the “Find sequences” section. 3. Press G to “Get sequences by keyword.” 4. In response to the prompt, type the keyword. A data bank section password and the “*” notation can be used in the name if appropriate. The password allows the user to obtain only sequences in a particular section of the data bank (e.g., ROD: for rodent sequences). If the “*” notation is used, all sequences with keywords that match will be displayed

244

Merrifield

(e.g., “oncogene *” finds “oncogene”, “ oncogene-induced”, “oncogene

cellular”, “oncogene expression”, and “oncogene viral”). 5. MicroGenie then displays the sequences that meet the specified keyword. The user can look through the list of sequences using the Page Up and Down keys (Home and End take the user to the beginning and end of the output). The output can be printed rather than displayed on the screen if the keyword is preceded by “/P.” The MicroGenie manual warns that the NBRF Protein Data Bank uses keywords in unexpected ways, and the user may not find all the expected sequences. 6. The user should press return to go back to the FIND SEQUENCES menu. 3.1.3.4. To RETRIEVE SEQUENCES BY AUTHOR NAME 1. At the MICROGENIE menu, press D for the Data Bank section. 2. Press F for the “Find sequences” section. 3. At the FIND SEQUENCES menu, press R for “Retrieve sequences by author.” 4. The user is then asked to give the name of the author. The name can be preceeded with the data bank section password to specify the protein data bank, or a section of the GenBank nucleic acid data bank. If the user does not give a data bank password, sequences will be retrieved from the whole of the GenBank. The author name has to be specified exactly (unless the “*” notation is used to retrieve sequences with similar author names). The punctuation is critical; the surname must be followed by a comma and then the initials, each followed by a full stop (dot). If the name is not matched exactly to the author name in the data bank, MicroGenie will report that “No entry by that name can be found; try again.” Even omitting the dot after the final initial will cause this, unless the “*” notation is used. 5. The user is then shown the MicroGenie data bank sequence number, the sequence name, and the data bank section for the sequences that are associated with the author, or authors, that are specified. 6. Press return to go back to the FIND SEQUENCES menu. 3.2. Compare Section The procedures in the this section allow the user to compare two sequences (protein or nucleic acid) and list the regions of homology

between them. The “Compare within a Sequence” procedures locate repeats or inverted repeats (stem and loop structures in nucleic acids) within a single sequence. Two or more sequences may be aligned with each other to give the best match, with gaps added if necessary

MicroGenie:

Homology Searches

245

to improve the homology. Two sequences can also be compared with

each other using the dot-matrix method, and this method can also be applied to the location of repeats and inverted repeats within an individual sequence. The procedures in the Compare section can be used with both proteins and nucleic acids (and some can be used to compare a protein with a nucleic acid). 3.2.1. Homology Comparison of Two Sequences The Homology Comparison procedure finds regions of homology between two sequences and displays, or prints, them together with the sequence numbers of the beginning and end of each region. It also gives the length and percentage homology of each region. This procedure can compare nucleic acids or protein sequences, and a nucleic acid sequence against a protein. The degree of homology that is necessary for a region to be listed is controlled by two parameters in the selected parameter (see Note 19). 1. From the MICROGENIE menu, press C for the Compare section. 2. Press H for the Homology Comparison procedure. 3. The user is then asked for the name of the first sequence to be compared. 4. To limit the region of the sequence to be compared, press L m response to the prompt “Specifications (A,F,S,L)“. The user is then asked for the number of the residue at the lower limit and then the number of the residue at the upper limit. The defaults are the first and last residue, respectively, if return is pressed in response to either. The Strand (Upper or Lower), or Form can also be changed when a nucleic acid sequence 1sbeing compared. 5. The user 1sthen asked to enter the name of the second sequence to be compared. This and the first sequence name can include a password abbreviation to accesssequences in another directory, and the “*” notation can be used to specify more than one sequence (see Note 16). 6. The user then has the option to specify the length of the region to be compared for the second sequence and to change the Strand (Upper or Lower), or Form if it is a nucleic acid sequence. 7. A prompt then appears asking for the parameter set, giving the options (/,0-g). If return is pressed the default parameter set 1 will be used, If there are more than 400 matches between the sequences, the user will have to adjust the parameters to increase the stringency and repeat the comparison, To use a modified parameter set enter its number here in order to select it. If “/,’ is entered followed by the number of a param-

246

Merrifield

eter set at the prompt, the user will be taken to the parameter modification screen (COMPARE:Set) to modify that parameter set for use m the analysis (see Note 17). 8. From the numbered list of the options for the output, the user would normally select 1 to store the output on the hard disk. The user can then Examine the output on the screen and Print it after returning to the ANALYSIS menu. If option 2 1s selected, the output will be printed without any opportunity to view it first. Option 3 cancels the comparison. The output can be saved to a named file on a floppy disk if option 4 is selected; see Note 11 for further details. Press the number corresponding to the choice, or press return to store the output on the hard disk by default. 9. MlcroGeme shows the progress of the comparison by displaying the names of the sequences and the page of the output that contains the details of the homologies. When two sequences are being compared, MicroGenie will inform the user if no homologies were found at the level specified by the parameters that were chosen in Section 3.2.1., step 7. If the “*” notation 1sused to compare a number of sequences, only those having homologies will be shown, but all the comparisons will be listed in the index at the end of the output. 10. When the comparison procedure is fimshed, the user will be asked to press return and will be taken back to the COMPARE menu. 11. From the COMPARE menu, the user can press E to Examine the output or P to Print It. The output goes to the same file (OUTFILE) as that from the ANALYSIS section and will overwrite any previous OUTFILE. 12. The output lists regions from the first sequence with the homologous regions of the second one aligned below them. Positions where restdues do not match are marked with a “*” symbol. The locations of the ends of the regions of homology in their respective sequencesare marked at the ends of the line. The number of matches, length of the region, and the fraction of the residues that match are printed beside the homology. 3.2.2. Compare Within a Sequence

This section contains the procedures that are used to locate repeats or inverted repeats (stem and loop structure) within one sequence (effectively comparing a sequenceagainstitself, or its sequencereversed, rather than against a second sequence). 1. From the MICROGENIE menu, press C for the Compare section. 2. Press C for the “Compare within a Sequence” procedure. 3. The user is then asked for the name of the sequence.

MicroGenie:

Homology Searches

247

4. A prompt appears asking for the parameter set to use and gives the options (/&l-9). If return is pressed, the user will use the default parameter set 1, To use a modified parameter set, enter its number to select it. If “/” is entered followed by the number of a parameter set at the prompt, the user will be taken to the parameter modification screen (COMPARE:Set) to modify the parameter set for use in the analysis (see Note 17). 5. To limit the region of the sequence to be compared, press L in response to the prompt “Specifications (A,F,S,L)“. The user is then asked for the number of the residue at the lower limit and then the number of the residue at the upper limit. The defaults are the first and last residue, respectively, if return is pressed in response to either. The Strand or Form can also be changed when a nucleic acid sequence is being compared. 6. The user is then asked which procedure or procedures are to be used. Press “?’ for a list of those procedures that are available here (9,10,14, and 15), these are described in more detail in the next section. Type the number of the procedure to use and press return. To use more than one procedure, type their numbers with a space between each and then press return. 7. MicroGenie then asks for the name of another sequence to use with the “Compare within a Sequence” procedure. If a sequencename is given, the user will be taken back to Section 3.2.2., step 4 to enter details for a further analysis. When return is pressed without a sequence name, the program begins the comparison procedure or procedures that have been selected. 8. From the numbered list of options for the output, the user would normally select 1 to store the output on the hard disk. The user can then Examine the output on the screen and Print it when returning to the COMPARE menu. If option 2 is selected, the output will be printed without any opportunity to view it first. Option 3 cancels the comparison. To go back to the COMPARE menu to list sequences in order to find the names of further sequences to analyze, press 4 to retain existing selections and include them with subsequent selections. The output can be saved to a named file on a floppy disk if option 5 is selected; see Note 12 for further details. Pressthe number corresponding to the choice or press return for the default, which is to store the output on the hard disk. 9. MicroGenie shows the progress of the comparison by listing the names of the sequences being compared and the procedure with the number of the page of output that contams the start of the list of homologies.

248

Merrifield

10. When the comparison procedure is finished, the user will be asked to press return and ~111be taken back to the COMPARE menu. 11, From the COMPARE menu, press E to Examine the output or P to Print it. The output goes to the same file (OUTFILE) as that from the ANALYSIS section and will overwrite any previous OUTFILE. 3.2.2.1. PROCEDURES FOR COMPARE WITHIN A SEQUENCE 1, Procedure 9, “Find repeated regions”, locates regions that are repeated within a protein (or nucleic acid) allowing for mismatches or loopouts (deletions or insertions). It can be used with proteins or nucleic acid. The parameters that control the comparison are described in Note 18. 2. Procedure 10, “Find inverted repeats” (stem-and-loop structures or dyad symmetries), is only applicable to nucleic acids and calculates the free energy of the stem in addition to the percentage homology. The parameters controlling the number of homologies that will be shown are the same as for Procedure 9 and are described in Note 18. In addition, parameters 31, MINLOOP, and 32, MAXLOOP, control the mmimum and maximum size, respectively, of the loop that can be formed. 3. Procedure 14, “Find repeats (matrix method)“, locates regions that are repeated within a protein (or nucleic acid) using a dot-matrix plot. This is the same as the dot-matrix comparison of two sequences, described below, but the sequence is compared to itself instead of to another sequence. 4. Inverted repeats m nucleic acid sequences can be located using the dotmatrix graph method, 15, where the sequence is compared to its complement rather than another sequence. 3.2.3. Matrix

ComparisoA

MicroGenie can compare two sequences, or selected regions of two sequences, using a dot-matrix graph in which regions of homology are marked by diagonal lines on the graph. Deletions or insertions are shown by a break and displacement in the line of homology. The scales on the axes of the graph are automatically chosen to suit the lengths of the sequences (or their specified regions) being compared. The Matrix Comparison procedure, which can be used with nucleic acids or proteins, is described in Chapter 18 (MicroGenie: Protein Analysis). 3.2.4. Alignment

of Sequences

MicroGenie can align two or more sequences in their optimum alignment, allowing for the insertion of gaps and conservative substitutions of amino acids when the sequences are proteins.

MicroGenie:

Homology

Searches

249

1. From the COMPARE menu, press A for Alignment of Sequences. 2. The user is then asked if to align two or Multiple sequences are to be aligned. Press 2 (or return) to align two sequences; press M to align multiple sequences. (Up to 60 sequences can be aligned; there is a limit on the lengths of 1000 residues for a multiple align and 30,000 when aligning two sequences.) 3. The user is then asked for the name of the first sequence to be aligned. 4. To limit the region of the sequence to be aligned, press L in response to the prompt “Speciftcattons (A,F,S,L)“. The user is then asked for the number of the resrdue at the lower limit and then the upper limit as described in previous sections. The defaults are the first and last residue, respectively, tf return is pressed in response to either. The user can also change the Strand (and Form) when a nucleic actd sequence is being compared. 5. The user is then asked to enter the name of the second sequence to be compared and again given the option to specify the length of the region to be compared. 6. If a multiple alignment has been selected,the user will now be prompted for the Next Sequenceand any specifications. This processwill be repeated until return is pressed on a blank line. Then the program goes to the next step. The user can enter the names of the sequencesindividually or use the “*” notation to specify a subset if they are named appropriately. 7. If aligning two protein sequences,MicroGenie will ask if the user wants to align by Identity or Similarity. If I is pressed, the alignment procedure will only use identical amino acids, and if S (which is the default) is pressed, it will align allowing for conservative replacements of amino acids. 8. A prompt then appears asking for the parameter set to use; giving the options (/,0-g). If return IS pressed, the default parameter set 1 will be used. If a parameter set has been modified for use with this procedure enter it, number here in order to use it. If ‘Y is entered followed by the number of a parameter set at the prompt, the user will be taken to the parameter modification screen (COMPARE:Set) to modify that parameter set for use in the analysis (see Note 17). 9. If a multiple alignment is selected,the user will be asked for the required percentage for consensus. A consensus sequence is generated from the sequences and printed above the aligned sequences. If the same residue is present in more than the specified percentage of the sequences, it will be included in the consensus sequence at that position, If there is no residue present at the required level, an “X” (if protein), or “N” (if nucleic acid) symbol is Inserted in the consensus sequence. A default of 50% is used if return is pressed.

250

Merrifield

10. The user is asked to give a sequence name for the consensus sequence that is generated (if a multiple alignment has been selected). This is saved as a normal sequence. If return is pressed without giving a name for the consensus sequence, it is not saved. 11. From the numbered list of options for the output, 1 would normally be selected to store the output on the hard disk. The user can Examine the output on the screen and Print it after returning to the COMPARE menu. If option 2 is selected, the output will be printed without any opportunity to view it first. To cancel the alignment, press 3, and to save the output on a floppy disk, press 4; see Note 11. Press the number corresponding to the choice or press return for the default, which is to store the output on the hard disk. 12. MicroGenie then begins the ahgnment procedure. For a multiple ahgnment, it reports the stages of the process: Reading sequences, Aligning sequences, Optimizing sequences, and then Recording. The user will be asked to press return to go back to the COMPARE menu when it has finished. 13. From the COMPARE menu, E can be pressed to examine the output or P to print it. The output goes to the same file (OUTFILE) as that from the ANALYSIS section and will overwrite any previous OUTFILE. 14. The output for the alignment of two sequencesconsists of the sequences printed across the page with a vertical bar between identical sequences. If two proteins have been aligned by similarity, the vertical bar is replaced by a “:” where there are conservative replacements of amino acids. A table gives the number of residues that are matched and unmatched (gaps inserted) with the length and percentage of the alignment that is matched. 15. When more than two sequences have been aligned, the consensus sequence is displayed across the top of the screen with the aligned sequencesbelow it. Residues that do not match the consensusare shown in lower-case letters. At the end of the output, there is a table listing the sequences and their percentage match to the consensus sequence. 4. Notes 4.1. Factors Affecting the Speed of MicroGenie Data Bank Searches 1. It appears that the processor speed of the computer is the most significant factor m determining the time taken to search the data bank. Table 1 shows comparative times with various computers and buffer settings for the CD-ROM and the hard disk. Changing the processor speed (16

MicroGenie:

Homology

Searches

251

to 8 MHz for the Dell 316SX) produces a proportionate change in the search time; changing the buffers for the hard disk or CD-ROM has less effect. It appears that the access time of the disk is not a major factor. (The number of the CD-ROM buffers does affect the speed of retrieval of sequences by keyword or author.) Data

4.2. Transferring Bank Sequences to a Hard

Disk

2. It is possible to transfer some or all of the data bank sequences to the hard disk, so that they may be searched more rapidly on the hard disk than on the slower CD-ROM drive. To transfer the data bank to the hard disk, insert the MicroGenie CD-ROM in the drive, and from the root directory of the hard disk type “TRANSBNK Y ALL” (where Y is replaced by the drive letter of the CD-ROM drive). This will transfer all the data bank sections to the hard disk. If there is not enough space for the whole data bank, the user can select a section of it by replacing “ALL” with the password for the section to be transferred, e.g., “TRANSBNK E PRO” to download the protein data bank from the CDROM in drive E: (the “:” does not have to be included in the drive letter or the data bank password, but it is required when using the password while running MicroGenie). The command “CLEARBNK ZZZ” (where ZZZ is replaced by a data bank password) will remove the specified data bank from the hard disk; replace “ZZZ” with “ALL” to remove the whole data bank. The improvement in search times is not very great (about 13-27%; see Table l), but is worthwhile if adequate hard disk space is available. Moving the MicroGenie data bank to the hard disk is particularly useful if the user also has another database on a CD-ROM (e.g., the EMBL data library CD-ROM, which can be searched by FASTA; see Chapter 26 for details of the FASTA program). If the MicroGenie data bank is normally searched on the hard disk, the other data CD-ROM can be left in the CD-ROM drive to avoid the problems that can occur if a program is used with the wrong CD-ROM; see Note 15. The MicroGenie manual states that if the user has part of the data bank on the hard disk, MicroGenie will look for the data on the hard disk first and then search the CD-ROM for data bank sections that it does not find on the hard disk. This does not appear to be the case with Version 7.01, which searches on the CD-ROM if it is given a drive letter in the Setup section, and will only search the hard disk if no CDROM drive is set. Only the sequencesand sequence comments are transferred to the hard disk. The annotations remam on the CD-ROM.

Merrifield 4.3. Installation of a CD-ROM Drive 3. MicroGenie is supplied with a batch file to install the CD-ROM software from the disk that will be supplied with the CD-ROM drive. It will also modify the CONFIG.SYS and AUTOEXEC.BAT files to access the CD-ROM drive. This batch file, called CDINSTAL.BAT, is placed in the root directory of the drive when MicroGenie is installed. If the user does not wish to use the defaults for the settings that this provides, the CD-ROM extensions may be installed manually according to the instructions supplied with the CD-ROM drive unit. Look at the modifications that would be made by CDINSTAL.BAT as a guide to the drive designations used by MicroGenie. If a user has MS-DOS 5.0, it may be necessaryto use the SETVER command to add MSCDEX.EXE to the version table as described in the DOS 5.0 manual. The version of MSCDEX.EXE must be 2.2 or later to work with DOS 5.0 (2.21 or later for use with Windows 3.1). The MicroGeme manual suggests that the default CD-ROM buffers setting can be reduced from 8 to 4 to free more memory for the program (/M:4 instead of /M:8 m the line “MSCDEX.EXE /D:MSCDOOO /M:8,” which CDINSTAL.BAT adds to AUTOEXECBAT). This will slow down the rate at which data are read from the CD-ROM, but smce computational speed rather than disk speed seems to be the most stgmficant factor affecting search times, this will have little effect on homology searches; seeTable 1. The speed of retrieval of sequences by author or keyword is affected by the size of the CD-ROM buffers, and halving the value of/M: almost doubles the time taken to retrieve a sequence from the CD-ROM (fmdmg a sequence by keyword took 117 s with /M:4 and took 50 s when h4: was set to 8). 4.7. Increasing Free Memory to Run MicroGenie 4. If the user has a number of memory resident programs, there may not be enough free memory to run MicroGeme (the Files section will run m 510,016 bytes of free memory, but not m 503,880). The user can check the amount of free memory on the system by using the DOS command CHKDSK (or MEM at MS-DOS 4 and later). If usmg MS-DOS 5.0 or a High Memory manager (e.g., QEMM or 386Max), the user can free more memory by moving drivers into the upper memory area above 640K. Thus may be the best solution to memory problems, but if more memory is still needed, the followmg suggestions may help. If the user has a CD-ROM, the MicroGeme manual suggests reducing the number of CD-ROM buffers as discussed under CD-ROM drive installation (each CD-ROM buffer takes 2048 bytes of RAM). There are also DOS buffers to improve the speed at which data are read from the hard drsk.

MicroGenie:

Homology

Searches

253

It is possible to reduce the number of buffers without significantly degrading the speed of the database search (each buffer takes about 500 bytes of RAM); see Table 1. Similarly, use of a disk caching program is unlikely to be worthwhile with MicroGenie homology searches (if there is sufficient free memory to use one, it can improve the speed of other programs, such as Windows 3.0). The Microsoft Windows 3.0 manual gives some useful tips for freeing more memory. If the DOS 4.01 SELECT program has been allowed to construct CONFIGSYS and AUTOEXECBAT files with maximum memory for DOS, there will be a number of drivers and memory resident programs that are not essential. Device drivers or programs run from AUTOEXEC.BAT or CONFIGSYS can be removed temporarily by adding “REM” to the beginning of the command line, so that they can easily be reinstated if needed, or if more memory becomes available. (If the a REM line is included in the CONFIG.SYS with DOS prior to 4.0, it will be reported as an error at start-up, but the line is still ignored.) The graphics printer support (GRAPHICS) that is needed to give a screen-dump of a graphic image on a printer can be useful, since it gives a printout at right angles to the one normally produced by the MicroGenie routines. The user may need to remove it, or the equivalent program supplied for the printer, to free more memory. The FASTOPEN program is unlikely to give a noticeable improvement to MicroGenie and can be removed. Adding the STACKS 0,O line to CONFIG.SYS will free some additional memory. If the user lives outside of the US and uses the international keyboard and character set (code page) support programs, these can take a significant quantity of memory. For a computer that is to be used primarily with MicroGenie, it is unlikely that the use of the default US settings will cause problems in sequence handling, unless the user has a keyboard where some of letter keys are not in the same positions as on the US keyboard (in the case of the UK keyboard, the interchange of the “ and @ keys is the main difference likely to be noticed). The programs and drivers NLSFUNCEXE, GRAFTABLCOM, KEYB.COM, DISPLAYSYS, KEYBOARDSYS, and PRINTER.SYS can be omitted. If the computer is also used regularly for word processing or some other application where international language support becomes important, the computer could be started from a floppy disk having the CONFIG.SYS and AUTOEXECBAT files with the less frequently used settings. If using Windows, the DOS configuration that is specified on start-up will be used when a DOS program is run from Windows. The user should omit the mouse driver from CONFIGSYS since Windows loads its own driver when it starts and KEYB.COM in the

Merrifield AUTOEXEC.BAT can be replaced by the international keyboard support for Windows programs. If the user has a hard disk with a partition that is larger than 32 Mbyte and it is running under DOS 4.01 (or DOS 4.0), it is unwise to omit running SHARE in order to save more memory (see Note 5). 5. SHARE is normally used with a network to control accessto files and it is also needed if DOS 4.01 is used with a hard disk partition larger than 32 Mbyte. If SHARE IS not loaded under these circumstances, there is a possibility that an old program that uses file control blocks to access the disk could “wrap around” and write data that should be beyond 32 Mbyte, at the beginning of the disk, which would have disastrous consequencesif it overwrote directory information. This situation may never arise, if the user is lucky, but it IS considered so serious that a warnmg message is issued if the partition is larger than 32 Mbyte with DOS 4.01, and when SHARE is found in the root directory, it will be loaded automatically even if it was not requested.

4.5. Homology

Searches

6. The password abbreviations used to accessthe data bank sections areBAC: Bacteria, INV: Invertebrate, MAM: Mammal, ORG:Organelle, PHA: Phage, PLA: Plant, PRI: Primate, PRO: Proteins, RNA: Structural RNA, ROD: Rodent, SYN: Synthetic, UNN: Unannotated, VER: Vertebrate, and VIR: Viral. The “:” must be included when using these in MicroGenie. 7. If a sequence longer than 2200 residues is to be searched against the data bank, the user has to make a number of separate searchesand specify a different region for each until the whole sequence has been used. The regions should overlap m case a region of homology spans the junction between two sections and is missed. 8. The minimum length for a protein sequence that can be searched against the data bank is seven amino acids. For nucleic acid sequences, the lower limit is 15 bases.This can be a problem if the user wants to search for homologies to a short oligonucleotide or for the presence of a motif in the database. It is possible to enter the sequence as a search string (with ambiguity codes if it is a DNA consensus sequence) and use the restriction enzyme site location procedure 6 on the ANALYSIS menu, as described in Chapter 15 (MicroGenie: Introduction and Restriction Enzyme Analysis), to search through a specified section of the data bank (e.g., PRI:*). This approach can be used to search against sequences in the user’s directory, but is not very satisfactory when applied to the data bank, since it is then necessary to search through a large number of

MicroGenie:

Homology

Searches

255

pages of output, most or all reporting that the search string was not found. There is a limit of 999 pages of output that can be produced (this gives an OUTFILE of about 2 Mbyte), so it may be necessary to use the “*” notation within the data bank sections to further restrict the number of sequences searched, e.g., ROD:A* and then ROD:B*; this can be very tedious. After the search, it may be quicker to leave MicroGenie, read OUTFILE into a word processor, and then search for a string, such as “#,” that is characteristic of a site being located successfully, rather than to read through the output on the screen. 9. The “*” notation enables the homology search procedure to compare a sequence against one or more of the users’ sequences as a database, This procedure may be more useful for finding long regions with lower homology between sequences than the procedures in the Compare section. The full name of one of the user’s own sequencescan be specified if the user wants a comparison with only one sequence. If the user wants a homology comparison to one sequence in the data bank, the password and the sequence number should be specified. See Note 13 for further details of the data bank sequence numbers. If the “*” notation is used and no password is used, then sequences in the user’s own directory will be searched (i.e., if “*” only is given as the Data Bank key name, all the sequences in the user’s own directory will be searched), If a password is combined with the “*” notation, the user can specify some or all sequences in another directory. See Chapter 16 (MicroGenie: Shotgun DNA Sequencing) for further discussion of the “*” notation and the advantages of a systematic naming system to enable the user to select subsets of sequences. 10. The default option for output from the Data Bank search is 1 to store the output on the hard disk. If 2 is selected, the output will be printed before returning to the DATA BANK menu. If Option 3 is selected, the search will be canceled and the user will be returned to the DATA BANK menu. Option 4 allows the user to store the output on a floppy disk, which may be useful if the user wants to print it on a high-quality printer on another machine. If this is chosen, the user will be asked for a file name. Do not include the drive or a file extension. See Notes 11 and 12 for further details. 11. The file name can have up to eight characters, following the rules for normal DOS file names (if not sure of the rules use only letters and numbers), and no drive letter or file name extension. The option to store output on a floppy disk, is useful if a number of people use MicroGenie, since each user’s output is stored in a named file on the floppy disk when the search is complete. This reduces the possibility of the user’s

256

Merrifield

OUTFILE being overwritten if someone else uses the computer before the user has had time to look at their results. When the search is complete, the other user can press return as requested by the prompt and then exit MicroGenie. When MicroGenie is run by the other user, using his or her new own password, their new OUTFILE on the hard disk will not overwrite the origmal user’s output on the floppy disk, this can be collected later and examined using a word processor on another computer or with MicroGenie as described in Note 12. 12. If the users’ output has been stored on a floppy disk and the user immediately selects the Examine output option on the DATA BANK menu, the program will ask for the name of the output file to be examined. If the user wants to examine a file that was saved previously, and output has been sent to OUTFILE on the hard disk subsequently, the user will not be prompted for the name of the output file on the disk. To cause MicroGenie to examme output on the floppy disk, the user should run a rapid search (such as searching a short sequence in his or her own directory against itself) and use option 4 to store its output on the floppy disk (make sure a different name is chosen from the one used earlier, so that it will not be overwritten; MicroGenie does not warn when it is going to overwrite a file). When the user then uses the Examine option, the name of the original output file can be given at the prompt for the filename. 13. The sequences m the data bank have reference numbers that can be used instead of the full name. The data bank sequence name is long in order to give as much information as possible about a sequence. The sequence numbers can be found by listing the sequences in the indivtdual data bank passwords and looking for the sequence in the listing. To avoid having to look through pages of output, specify the sequence name as closely as possible using the “*” notation. Once the sequence number is known, it can be used instead of the sequence name to access the sequence, e.g., SYN:284. The data bank number is very useful for accessing data bank sequences without having to type a long name; however, it may change in a later data bank release and should not be used as the only record of a sequence of interest that has been located in the data bank. The MicroGenie data bank sequence number should not be confused with the GenBank Accession number that is included in the annotattons. 14. If the user has transferred sequencesto the hard disk for data bank searchmg and then wishes to use the “Annotations,” or “Find Sequences” procedures, it is necessary to go to the Setup section from the MICROGENIE menu and select the correct drive letter for the CD-

MicroGenie:

Homology

Searches

257

ROM drive. A warning message is given if the user omits to set the drive letter before using these procedures. It has to be reset to “No CDROM’ in order to search the data bank on the hard disk. 15. Where there are other CD-ROM disks in use (e.g., the EMBL Data Library), the user should make sure that the MicroGenie Data Bank CD-ROM disk is in the drive before using procedures that accessit. If the wrong CD-ROM is in the drive, there will be a premature return with no output from the “Search the Data Bank” and “Annotations” routines. If the “Find Sequences” procedures are used with the wrong CD-ROM in the drive there will be an error message on the screen, and the user will be returned to DOS (this is one of the few ways that MicroGenie can be made to “crash”). 4.6. Compare Section 16. The sequence name for Homology Comparison can include a password abbreviation (e.g., PRI:) if the sequence is not stored in the users’ own directory. The “*” notation can be used to specify a number of sequences. If one of the pair of names for comparison contains the “*” notation all the sequences that it includes will be compared m turn to the other sequence. When both the first and second sequence names contain “*“, each sequence in the first set will be compared with each m the second list. The index at the end of the output gives details of each comparison that was made. 17. The user can only examine and change the parameters relevant to the Compare procedures from this menu. The user will be asked if the modifications made to the parameter set are to be kept or used only for the current analysis. 18. In the comparison procedures, parameter 29, LOOPLEN, specifies the maximum length of an insertion or deletion (loopout) that can be allowed between the sequences.Setting it to 0 prevents gaps from being inserted in the sequence. The maximum distance between the starting positions of two repeated regions is set by parameter 30, MAXDIST, which is preset at 1000. The parameters that control the level of homology of the regions to be displayed are the same as for the matrix homology procedures; see Note 19. 19. There are two parameters used to select the regions included in the homology and matrix comparison output. Parameter 27, MINMATCH, specifies the minimum number of matching residues that have to be present in a region of homology. MINPER, parameter 28, specifies the minimum percentage of matching residues in the region of homology. It is preset at 75%. Decreasing the value will show more homologies

258

Merrifield with more mismatches. These two parameters interact to define the length of the minimum homology that will be displayed, e.g., in a homology of eight residues long, six residues will match (75%) if the value of MINMATCH is 6 and that of MINPER is 75 (the default values). If a homologous region IS located that meetsthese criteria, it may be extended into regions of lower homology, so that not all of the region shown exceeds the value of MINPER.

CHAH?ER

PC/GENE: lhzotky

20

Sequence Entry J. Larson

and Assembly

and Patrick

K Bender

1. Introduction

The PC/GENE package of sequencemanagement and analysis software provides a comprehensive set of over 70 programs capable of performing all of the analyzes routinely required by the molecular biologist. The software is distributed by IntelliGenetics, Inc., 700 East El Camino Real, Mountain View, CA 94040. In Europe, the address is IntelliGenetics, Inc., c/o Amocolaan 2, B-2440 Geel, Belgium, and in Japan, Teijin, Ltd., Life & Science Project, 9F Urban Net Yokohama Bldg., 5-2 Nihon-Odori, Naka-ku, Yokohama, Kanagawa 23 1, Japan. Teijin distributes a copy of the software that runs on the NEC PC9801. The types of tasks performed by analysis programs may be placed into four basic groups, since outlined by Cannon (I). These include editing, storage, and retrieval of sequence data; mapping of sites on nucleic acid or protein sequences; translation and location of protein coding regions and prediction of protein structure; and, finally, sequence comparison and homology searches. All of these tasks are performed by the PC/GENE package. Examples of programs in each of these categories will be described in this and in the following chapters on PC/GENE. Because of space limitations, complete coverage of all programs is not possible. The PC/GENE package is accompanied by an excellent 600-page manual, and all of the programs contain on-line help. Thus, with a little practice, the novice can learn to usethe programs for routine managementand analysis of sequencedata, The user may view the results of any particular analysis on screen or print out the results. In addition, results may be saved as ASCII From. Edited

Methods in Molecular Biology, Vol 24 Computer Analysis of Sequence Data, Part I by A M Gnffm and H G Gnffm Copyright 01994 Humana Press Inc , Totowa, NJ

259

260

Larson and Bender

text files or TIF graphics files (depending on the nature of the output) and later cited with the text editor (EMCS) or the graphics editor (PICASSO), that are included in the PC/GENE programs. The file formats may also be imported into a word processor or alternate graphics program for editing and/or labeling of figures. This feature greatly facilitates the production of publication-quality diagrams. Three programs used for the generation of files that may be analyzed by PC/GENE are described in this chapter. The program SEQIN is used for direct entry and editing of nucleotide or protein sequence data. The program READGEL is used for direct entry of nucleotide sequence data using a sonic digitizer. The program ASSEMGEL is used for management of DNA sequencing projects. The program will find overlaps between sequences and assemble a longer sequence. A program like ASSEMGEL is indispensable for the shotgun approach to sequencing of DNA. There are two other useful programs that facilitate generatron of sequencefiles for use by PC/GENE. These are KERMIT, which facilitates transfer of information between computers linked via modem or data line, and REFORM, which converts sequencefiles of other formats into the PC/GENE format. These programs will not be discussed further. 2. Materials In this and the following chapters describing the PC/GENE sequence analysis and management software, release 6.5 (February 1991) will be described. The PC/GENE software is distributed by IntelliGenetics, Inc. The PC/GENE software requires an IBM PC XT, PC AT, PW2, or a compatible microcomputer with version 3.1 or higher PC-DOS or MS-DOS operating system and a minimum of 640 kb of Random Access Memory. A hard disk drive with a minimum of 13 mb of free space is required for installation of the programs and for personal data files. A graphics screen adapter (a variety of graphics cards can be used) is also required. A CD-ROM drive is needed in order to access the protein and nucleic acid sequence data banks available on CD-ROM. A printer capable of both text and graphics, as well as a Microsoft-compatible mouse and software, will facilitate the use of PC/GENE. Optional equipment supported by PC/GENE includes a sonic digitizer for direct entry of sequence gel information (Grafbar

Sequence Entry and Assembly

261

GP-7, IBI Gel Reader, Beckman GelMate, or Photron MPC-8501), a voice synthesizer card for vocal verification of sequence input (Street Electronics Echo PC2, PC+, or MC), and an external or internal modem for communication with other computing resources. 3. Methods 3.1. Programs Used for Sequence Entry and Assembly

This chapter describes methods used for direct entry of nucleic acid or protein sequence information, for entry of DNA sequence data using a gel reader, and for assembly of nucleic acid sequences from overlapping or complementary sequences obtained during a sequencing project. 3.2. Detailed or Protein

Instructions

fir

Procedures

3.2.1. Direct Entry of Nucleic Acid Sequence Information from the Keyboard

The program SEQIN is used for direct entry and editing of nucleotide or amino acid sequence data (and associated information) from the keyboard. The program allows the import and merging of any part of a sequence file. Also, the composition or translation of a sequence may be determined, or the sequencemay be read by a speech synthesizer. For the example given below, it will be assumed that a nucleotide sequence must be entered from the keyboard and then connected to the 3’ end of an existing sequence. It will be assumed that the sequence is from a coding region and that a quick check of the translation will be performed to see if the newly added sequence allows continuation of an open reading frame that was previously identified in the existing sequence. The steps required to carry out this procedure for entry of a nucleotide sequence, merging to an existing sequence, and translation are as follows: 1. Select the program SEQIN. 2. Select “Nucleic acid sequenceediting.” Selection of “Protein sequence editing” at this point will allow entry andediting of proteinsequencedata, 3. Select “Enter a new sequence.” 4. Enter the name of the new sequence. 5. Select “DNA” as the type of sequence.The Main menuof the program appears,giving severaloptions.

262

Larson

and Bender

6. Select option 1 “Entry/editing of the sequence.” 7. Type in the nucleotide sequence using the IUPAC nucleotide codes (type “Alt I-I” to display the nucleotide codes during editing). 8. To verify the accuracy of the sequence entered, select the “Verify” mode by pressing the “Insert” key twice, return to the beginning of the sequence by pressing “Home,” and then retype the sequence. The program beeps if what is typed the second time disagrees with the existing nucleotide. Retyping the conflicting nucleotide will confirm that nucleotide; any other key will maintam the original nucleotide. 9. Steps 9 through 13 will allow the user to merge the edited sequence with another sequence file or portion of a sequence file. Go to Buffer menu by pressing F2; select “Load Seq” by pressing F5 while in the Buffer 1 menu. 10. Select the type of sequence file to be loaded, either normal sequence or acquisition sequence. Enter the name of the sequence file. 11. Enter the beginning and ending nucleotides of the selected sequence file to load into one of the five available buffers. These positions and the file name must be known before entering the SEQIN program. 12. Return to “Edit” by pressing F7. Place the cursor at the position where the sequence in the buffer is to be inserted, m this case at the beginning of the sequence being edited. If more than one of the buffers contains a sequence, be sure that the desired buffer is the active buffer before proceeding. 13. Insert the buffer sequence at the beginning of the edited sequence by pressing F6 (“From Buffer”). The buffer sequence now is the 5’ end, and the old edited sequence is the 3’ end. 14. Steps 14 through 16 will allow the user to check the translation of the merged sequence.PressF3 to enter the Block 1 menu, Define a Block (the region to be translated) by positioning the cursor at the beginning of the desired block and pressing F2 (“Beg block”), followed by positioning the cursor at the end of the desired block and pressing F3 (“End block”). 15. Select F6 (“Other options”) to enter the Block 2 menu. 16. Select F3 (“Translate”) to translate the sequence in the defined block m all three reading frames. 17. Press F7 to return to the Edit menu. When editing is completed, press F7 to end the editing session and return to the Mam menu. 18. Select option 3 “Save the sequence.” Enter the name of the sequence to be saved.

Several other options are available within the SEQIN program. From the Main Edit menu, one can jump to any position or search for a subsequence of up to 38 nucleotides in the sequence being edited.

Sequence Entry and Assembly

263

From the Buffer submenus, sequences can be loaded into and transferred among the five available buffers. Sequences in a buffer may be reversed or reverse/complemented. From the Block submenus, blocks can be defined and copied or moved to buffers, and the nucleotide composition can be displayed. A speech synthesizer can be used to pronounce the names of entered nucleotides. Additional information about the sequencemay be entered by selection of option 2 “Entry/editing of the additional information” in the Main menu of the SEQIN program. This includes general information, such as the description (DE), species of origin (OS), key words (KW), and comments (CC). 3.2.2. Entry of Nucleic Acid Sequence Data Using a Gel Reader

The program READGEL is used for entry of nucleotide sequence information from autoradiograms obtained from sequencing gels. The steps required to carry out this procedure are as follows: 1. Position the autoradiogram to be read on the gel readerand switch on the instrument. Select the program READGEL. The Main menu of the program appears, giving several options (seeNote 1). 2. Select option 1, “Sequence acquisition from the digitizer,” 3. Select “New gel” (press “N” or use the space bar or arrow keys), and point twice to the bottom right corner of the gel reader as instructed. 4. Define “Width” by pointing to the left and then the right extremities of a lane set to be read. 5. Calibrate the individual A, C, G, and T lanes by pointing to the center of the lane up to 20 times at evenly spaced intervals. Start at the extreme bottom, and continue up the gel somewhat beyond the point where it is possible to read. 6. Select “Read” by pressing “R,” and then point to each band on the autoradiogram to enter the sequence. If a mistake is made, select “Delete base” by pressing “D” to remove the last base digitized. 7. Select “Exit” (press “E”) when finished entering the sequence. 8. From the Main menu of READGEL, select option 2, “Save the sequence in an acquisition file.” 9. Type in comments regarding the sequencewhen prompted. For example, the name and date of the gel and the sequencedclone may be entered. 10. Enter the name of the sequence file. The sequence will be stored as an acquisition file with the file name extension .ACQ.

264

Larson

and Bender

11. The user may return to the current gel if additional sequence entry or editing is required. When finished, save any changes as described above. 3.2.3. Assembly of Nucleotide Sequences The program ASSEMGEL is used for assembly of longer nucleotide sequences (melds) from shorter sequences (gels) obtained directly from autoradiograms of sequencing gels or entered using the SEQIN program. The program will work on the sequences containing the

standard IUPAC ambiguity codes. The steps required to carry out this procedure are as follows: 1. Select the program ASSEMGEL. The Main menu of the program appears, giving several options. 2. Create a project or select a project that is already in progress by choosing option 1, “Select/Create a Project.” 3. Parameters for assembly may be changed by choosing option 2, “Specify Parameters.” Parameters that may be specified describe the quality of the overlaps and include: A. The muumum number of overlapping bases (10 to 25); B. The number of mismatches allowed; C. The minimum percent match for overlaps of greater than 25 bases; and D. The maximum number of mismatches in a row (loopouts). The parameters may be set for both automatic and manual merging of sequences. As a starting point, the default settings may be used. 4. Select option 3, “Assemble gels (enter editor)” to begin a new project or to add new gels to an existing project. For the purpose of this example, assume the project is a new one. Note that the sequence data from the gels must first have been entered into PC/GENE using SEQIN or READGEL. 5. Select “Import” (F2) to load the appropriate gels or other sequences into the current assembly project. 6. Enter the names of the sequence files. Pressing the “Down Arrow” key reveals the names of sequence files that may be selected. Highlight each sequence to be imported by pressing the F2 key. When all the desired sequencesare selected, press the Fi’ key to import the selected sequences. Vector sequencesand sequencescontaining defined restriction sites can be screened out by entering the appropriate mformation in “Specify Parameters” (see step 2 above). No part of the sequence file containing the defined vector sequences or restriction sites will be imported, but the program will indicate which files contain vector sequences or the defined restriction sites. Sequences longer than 750 bases will not be accepted by the program. 7. Enter the Merge menu by pressing the F5 key (“Merge”). There are two ways to merge overlapping sequences, “Manual Merge” and “Auto-

Sequence Entry and Assembly

265

matic Merge.” Select “Auto Merge” by pressing the F3 key. The program will automatically find overlaps and display them on the top line of the editing screen as a meld. To locate regions of overlaps that do not match perfectly, use the F6 key (“Next ambiguity”). The cursor will move to the right and locate the next mismatch. Sequences may be edited if desired. When editing is complete, select F7 (“Finished”). 8. To determine the successof the auto-merge procedure, the results may be seen by selecting F3 (“Display”), followed by F3 (“List”). The program lists the number of melds and the number of unincorporated gels, along with the names. The strategy used to build the melds may be viewed by selecting F2 (“Strategy”) while in the Display menu. 9. If all sequences in the project have not been merged into a meld using “Automatic Merge, ” “Manual Merge” may be employed using lower match stringencies in order to incorporate additional gels into merges. Make a note of the gels that have not been merged. Leave the “Display” menu by pressing F7 (“Finished”). Locate a gel to be mcorporated by selecting F4 (Locate). Enter the merge menu (F5). Press the F2 key to start Manual Merge. The program will then display potential overlaps that can then be either accepted (F2), rejected (F3), or edited. 10. When finished, return to the Editor menu (F7), and then exit to the Main menu (F7 again). To analyze the sequence contained within a meld using other PC/GENE programs, save the meld as an individual sequence file (Option 4). Option 5 (Revert to last save) allows deletion of all the changes made in the current session and replacement of the changes with the information that existed just after the last save. Executing option 8 (Save changes) will save the current project information. The project information may be printed or saved in a spool or edit file by selecting option 7. Finally, the entire project, including all melds and gels, can be deleted by selecting option 6 (delete the project). Be sure all the information needed is saved as individual sequence files before using this option!

4. Notes 1. If using the gel digitizer for the first time, select option 3 (“Setup”) from the Main menu of READGEL. This will allow the selection of the type of digitizer and the port settings, as well as the imtial calibration of the digitizer. If the program READGEL crashes when attempting to use it, run “Setup” to make sure everything is set correctly.

Reference 1. Cannon,G. (1990) Nucleic acid sequenceanalysissoftware for microcomputers.Analytical Biochemistry 190, 147-153.

&AF’TER

PC/GENE: lhotky

Restriction J. Larson

21 Enzyme Analysis

and Patrick

K Bender

1. Introduction Analysis of a nucleotide sequence using restriction endonucleases is perhaps one of the most basic functions carried out by various sequence analysis software programs. The PC/GENE programs provide a comprehensive coverage of various protocols related to the restriction endonucleases. The program RESTRI finds restriction enzyme cleavage sites within a sequence; results can be presented in tabular or graphic formats. The program DIGEST finds digestion patterns using a single enzyme or multiple restriction enzymes. Thus, this program is useful for predicting the pattern obtained upon electrophoresis of a single or multiple digestion of a nucleic acid of known sequence. The program REDIT provides ready access to the information in and allows editing of the restriction enzyme data file. This program obviates the need to searchthrough multiple catalogs or other tables for information regarding availability, specificity, isoschizomers, and other properties of the restriction enzymes. The program MUTSITE finds positions where a new restriction site can be created by a single base change. The output from this program indicates whether the single base change will have an influence on the amino acid sequence of the protein encoded by the region of interest. Thus, this program is very useful for design of targets to be used for cassette mutagenesis. 2. Materials PC/GENE sequence analysis software release 6.5 (as described in the previous chapter) loaded on one of the appropriate microcomputFrom Edited

Methods In Molecular Biology, Vol 24 Computer Analysu of Sequence Data, Part I by A. M Griffin and H G. Grlffrn Copynght 01994 Humana Press Inc , Totowa, NJ

267

268

Larson

and Bender

ers is used for the following analysis. A printer is required for the most convenient output of some of the data. 3. Methods 3.1. Programs Used for Resttiction Enzyme Analysis This chapter describes several methods related to restriction endonuclease cleavage analysis of a DNA sequence. These include finding restriction enzyme cleavage sites on a DNA sequence,construction of a restriction map, prediction of fragment sizes resulting from single or multiple digestion of a DNA with restriction enzymes, retrieval and editing of information in the restriction endonuclease data base, and generation of a new restriction endonuclease cleavage site by a single base change. 3.2. Detailed

Instructions

3.2.1. Mapping

of Restriction

for Procedures Sites on DNA

The program RESTRI is used for routine restriction enzyme cleavage analysis. The program allows analysis of a sequence using all or selected enzymes from the restriction enzyme data base(I). The results of the analysis may be displayed in a variety of formats, including lists of cleavage sites by order on the sequence, by the number of cut sites, or in alphabetical order. The sequence and position of cleavage sites may be displayed, or a graph of the positions of cleavage may be generated. For the example given below, all the enzymes in the data base will be used to generate a list of enzyme cleavage sites sorted by the number of cuts on the sequence of interest, the enzymes that do not cut the sequencewill be listed, the sequence will be printed with the positions of the restriction sites indicated, and then a graphic representation will be generated using only the enzymes having recognition specificities of 6 bp or greater. The steps required to carry out this analysis for a DNA sequence file are as follows: 1. Select the program RESTRI. The Main menu of the program appears, giving several options. 2. Selectthe sequenceto be analyzed(option 1). If the nameof the sequence file 1sknown, enterit at the prompt. If the nameof the file is not known, pressingthedown arrow key will give the optionsof selectinga sequence from frequently used sequencesor from the individual sequencefiles.

Restriction

Enzyme Analysis

269

Select the sequence from the list of individual sequence files. The sequence will be loaded, and the user will be returned to the Main menu of the program. 3. The region of the sequence to be analyzed should be defined next (option 2). The default values are the 5’ and 3’ ends, so this option is not needed to analyze the entire sequence. To exercise this option, enter new values as instructed and return to the program. 4. The content of the output is defined by selecting option 3. Enter “No” when asked whether to accept the current settings. Toggle options 1 and 4 (“List of cuts by enzyme” and “Sorted list of cuts” ) from “included” to “absent.” Enter “Selection done,” and then enter the maximum number of lines per page desired (40-80; use 55 for normal-sized paper). 5. Next, select option 4 of the Main menu to indicate which restriction enzymes will be used. When entering the program, all of the enzymes are selected. The F3 key (None/All) is used to deselect or select all of the enzymes. Enter “All” to select all of the enzymes in the data base. Press F7 (Finished) to return to the Main menu. 6. Carry out the restriction analysis by selecting option 5. Select option 6 “Display/output the results,” with the output “Screen only” to be sure the content of the analysis is what IS wanted before prmtmg. If ready to print the results, return to the Main menu, and change the output status by selecting “Print’ under option 8. Select option 6 (“Display/output”) again. The results will be printed in tabular form. 7. Next, the procedure for display of the sequence with the positions of all of the restriction sites will be described. From the Main menu, select option 7 (“Additional representations of the results”). The layout for the printout may be changed by using option 3 of the submenu that appears. Next, select option 4 (“Display/output the cleavage sites on the sequence”). The sequence will be printed out with the restriction sites indicated. If the same site is cut by more than one enzyme, a table will be generated indicating the position and enzymes cleaving that site. 8. The procedure for generation of a graphic map of the sites cleaved by enzymes having recognition specificity of 6 bp or greater will now be described. These enzymes must be selected using option 4 (“Select the restriction enzymes”) of the Main menu. 9. The current selection of restriction enzymes appears on the screen, As all the enzymes are currently selected, first deselect all of the enzymes by pressing F4 (“Invert Selection”). 10. To select the enzymes with recognition specificity of 6 bp or greater, first press F2 (“Select subset”), A second menu appears that allows the selection of enzymes with n basesm the recogmzed site, enzymes gen-

270

Larson

and Bender

erating blunt ends or cohesive ends, or enzymes recognizing asymmetric sites. Select F2 (“nBases Sit”), move the cursor to the 4, and press enter. All enzymes with 4 bp recognition specificity will now be highlighted on the computer screen. Select F2 (“nBases Sit”) again, and enter the 5 to select enzymes with 5 bp specificity. Select F7 to go back to the first menu. 11. All of the enzymes with recognition specificity of 4 and 5 bp have now been selected. Now press F4 (“Invert selection”), and all of the enzymes having recognition specificities of 6 bp or greater will be highlighted. 12. To use this group of enzymes on a regular basis, press F6 (“Save file”), and enter a name for this file of restriction enzymes. This group of enzymes may be used later for restriction enzyme cleavage analysis of any sequence by selecting F5 (“Load file”). 13. When the selection of enzymes is complete, press F7 (“Finished”). 14. From the Main menu, select option 5 to perform the analysis. Then select option 7 (“Additional representations of the results”). Another series of options appears. Select option 1 (“Output a semigraphical representation”). This option requires that the output status be set to the printer or a file. If this is not the case, the program warns the user, and he or she may change the output status to “Print” by using option 5 (“Change output status”). 15. A semigraphical representation of the restriction map will be printed with one line for each enzyme. The number of cuts for each enzyme will be indicated at the right side, where UC indicates unique cutter and DC indicates double cutter. 3.2.2. Prediction of Fragment Sizes from Single or Multiple Digestions The program DIGEST is used to predict the sizes of restriction fragments that would result from digestion of a DNA with up to 10 restriction enzymes. The computation may be carried out step by step using any defined order of the selected restriction enzymes or by simultaneous digestion using the selected enzymes. The program can list the fragments in order along the sequence and/or list the fragments according to their length. The steps required to carry out this procedure are as follows: 1. Select the program DIGEST. The Main menu of the program appears, giving several options. 2. The sequence to be analyzed is selected by using option 1. 3. If only a portion of the sequence file is to be analyzed, the end points of the sequence are defined using option 2. This option must also be used

Restriction

Enzyme Analysis

271

to indicate if the sequence is circular. If the sequence is linear and the user wants to analyze the entire sequence, this option need not be exercised. 4. Select option 3 to define the content of the output. The choices are to list the fragments along the sequence and/or to give a sorted list of fragments by size. Both options may be included or one of them may be excluded. 5. Select option 4 to define the restriction enzymes to be used. Select the enzymes in the order they are to be used if the digest is to be analyzed at intermediate stages. Enzymes are selected or deselected by using the F2 and F3 keys, respectively. Press F7 when the selection is complete. The user is then asked to verify the selection of enzymes. 6. Select option 5 from the Main menu to compute the results of the digestion The user is given the option of computing the results using stepby-step digestions by the selected enzymes with intermediate display of the results, or else direct digestion of the sequence with all of the selected

enzymes

7. Be certain the output status is selected as desired. If not, use option 8 of the Main menu to change the output status (screen, printer, or file). 8. Select option 6 of the Main menu to display or output the results of the digestion. The fragments are numbered by their order along the sequence. 9. Any of the fragments generated by the digestion may be saved as an individual sequence file by using option 7 of the Main menu. This option may be used for creation of files that might later be spliced together using the program SEQIN. For example, the combined use of DIGEST and SEQIN may be used to simulate insertion of a restriction fragment into a vector. 3.2.3. Restriction Enzyme Data Bank The program REDIT is used for editing or displaying the informa-

tion in the restriction enzyme data bank. The data bank contains information (along with updates) from the restriction enzyme data base of Roberts (I). Each restriction enzyme contains information regarding recognition pattern, isoschizomers, commercial availability, and other

comments, such as heat stability or sensitivity to methylation. The information may be edited in each of these categories. For example, the information might be customized for the enzymes in the user’s laboratory or department by entering information regarding source, date received, storage location, and so on. The information may also be printed out, including or excluding the information regarding isoschizomers, availability, or comments. The user may also print

272

Larson

and Bender

out the information about the enzymes sorted by the size of recognition sequence or by the type of end generated. A table containing the enzymes recognizing asymmetric sites and a table of false isoschizomers may be printed. False isoschizomers are defined as enzymes that recognize the same sequence, but cleave at different positions within the sequence. One other very useful feature of this program is the ability to search for isoschizomers and to search for enzymes that cleave a given sequence. For the purpose of illustration, the procedure for editing the entry for a restriction enzyme will be given. Assume that the user would like to have the restriction enzyme Sau3AI printed on restriction maps instead of its isoschizomer MboI: 1. Select the program REDIT. The Main menu of the program appears, giving several options. 2. Select option 1 to display or edit restriction enzymes. A table listing all of the enzymes appears. The menu at the bottom of the screen gives the options to edit/display, delete, or add an enzyme. 3. Move the cursor to MboI, and press F2 (“Edit/display”). The information for the enzyme MboI appears. 4. Replace Mb01 with Suu3Al under “Enzyme name.” The entries for “Cut offset” and “Recognition site” are left unchanged. Under “Isoschizomers,” replace Sau3AI with MboI. The information under “Availability” and “Comments” may be left unchanged or edited as desired. These two lines may be left blank. 5. When editing is complete, press F7 (“Finished”). The table of restriction enzymesappearsagain. Edit, add, or delete other enzymesasdesired. When finished, press F7 (“Finished”). The Main menu appears. 6. Select option 2 “Save the modifications on disk.” The new information will now be utilized when the programs RESTRI, DIGEST, or MUTSITE are used. 3.2.4. Creation of New Restriction Sites by Single Base Changes

The program MUTSITE is used to find positions in a sequence where a new restriction site can be generated by a single base change. As an example, assumethat cassettemutagenesis using synthetic DNA of an active site of a protein is to be performed. This will first require introduction of unique restriction sites flanking the region of interest if such sites do not already exist. Optimally, enzymes cleaving these sites

Restriction

Enzyme Analysis

273

should generate cohesive ends, although one blunt end would be satisfactory. The steps required to carry out this procedure are as follows: 1. Select the program MUTSITE. The Main menu of the program appears, giving several options. 2. Select option 1, and enter the name of the sequence to be analyzed. Pressing the “Down arrow” key reveals options to select from lists of sequence files. 3. Define the end points on the sequence by selecting option 2 from the Main menu. The end points should flank as closely as possible the region of interest (perhaps analyze a 120 bp region). 4. Define the layout and content of the results of the analysis by selecting option 3. The user may choose to include or exclude portions of the analysis. This example is interested only m silent mutations (the last option), so exclude the first three options (“Global results,” “Results by enzyme,” and “Study of effect on each reading frame”). 5. Select the restriction enzymes to be used by entering option 4. The table of the restriction enzymes appears with all of them selected. Press F4 (“Invert Se]“) to deselect all enzymes. Assume that only enzymes having recognition specificity of 6 bp or greater will be of use. Select these enzymes by first highlighting enzymes having 4- and 5-bp specificity as described in Section 3.2.1. above (key-strokes F2, F2,4; F2, 5; F7). Then invert the selection (F4). This will give all enzymes with greater than 6 bp specificity. 6. Next, deselect individually using the “Enter” key the enzymes known to cut the vector or insert that are being used. When finished, press F7 to return to the Main menu. 7. Carry out the analysis by selecting option 5. 8. Display the results on the screen (option 6); if the content is as desired and a hard copy is wanted, change the output status (option 7) and then print out the results. 9. The printout will give for each enzyme selected the positions where silent mutations may be created. The program makes no assumptions regarding the translation reading frame. The reading frame is determined by comparison of the output and the gene of interest. The user must then look for silent mutations in the optimal positions, for the optimal enzymes.

4. Notes 1. The programs RESTRI and MUTSITE do not allow the drrect selection of restriction enzymes using two setsof parameters (for example, select-

274

Larson and Bender

ing enzymes that have 6 bp specificity or greater and generate a 5’ overhanging cohesive end). However, the user may quickly select these enzymes by first creating a file (F6) that contains all of the enzymes that generate the proper cohesive end, and then load this file (F5). Then a subset with 6-bp specificity may be selected using the F2 key twice (“Select subset” and “nBases sit”). Select the enzymes with 4- and 5-bp specificity, and then invert the selection (F4). The enzymes with 4- and 5-bp specificity will be deselected and all the other enzymes will be selected. Information in the restriction enzyme data base may also be used to select the desired enzymes. Any group of enzymes may be saved for later use m a named file. Once a file has been selected for use, it is not possible to select the entire data base of restriction enzymes agam without first exiting the program. One way to get around this ISto create a file with all but one of the enzymes (one that will probably never be needed; the program does not allow saving the entire data baseof enzymes), which can then be recalled using F5 (“Load file”) without exiting.

Reference 1. Roberts, R. (1990) Restriction enzymes and their isoschizomers Nucleic Acids Res. 18,233 1-2366.

CHAPTER22 PC/GENE: Translation and Searches for Protein Coding Regions lhotky

J. Larson

and Patrick

K Bender

1. Introduction Translation of a nucleic acid sequence into a polypeptide sequence is required in a variety of situations. For example, when performing a DNA sequencing project, one frequently will translate the sequence in order to search for open reading frames. If open reading frames are found, it is desirable to know the probability that they in fact code for protein. Another situation requiring translation of a nucleic acid sequence is when a sequence is retrieved from a data base and the corresponding protein sequence is required for further analysis, for example, alignment with other proteins. PC/GENE contains several programs that make these operations easy to carry out. The program TRANSL is used for translation of a continuous sequence into a polypeptide sequence. The reading frames to translate can be defined (however, see Notes). This program also allows saving an open reading frame as a protein sequence file. There are three programs in PC/GENE that may be used to find open reading frames that have a high probability of encoding protein. In the first program (COD-FICK), the method of Fickett (I) is used. The method may be applied to all types of sequences and is based on the nonrandom appearanceof codons within protein coding sequences and on the higher G + C content within coding sequences. The program COD-PROK (2) may be used to predict protein coding regions in prokaryotes and is based on codon usages in phages where the From E&ted

Methods m Molecular Wology, Vol. 24 Computer Analysrs of Sequence Data, Part I by A M Gnffln and H G Griffin Copynght 01994 Humana Press Inc , Totowa, NJ

275

276

Larson

and Bender

complete genome sequence is known, and also takes into account the frequencies of occurrences of nucleotides around potential ATG and GTG initiator codons. Finally, the program COD-RNY (3-5) may be used to locate protein coding regions. This program looks for sequences that deviate the least from the primitive coding pattern RNY (purine, any base, pyrimidine). Staden (6) points out that the reason the method works is because of average amino acid compositions, and not codon preferences. The last program to be described is AUTRANSL. This program facilitates automatic translation of the regions of nucleic acids that are designated coding sequences in the nucleic acid sequences retrieved from sequence data banks (EMBL). 2. Materials PC/GENE sequence analysis software release 6.5 (as described in the previous chapter) loaded on one of the appropriate microcomputers is used for the following analyses. A printer is required for most convenient output of some of the data. 3. Methods 3.1. Programs Used for Translation and Analysis of Open Reading Frames This chapter describes several methods related to translation and searching for open reading frames. These include translation of a sequence in selected reading frames, determination of the probability that an open reading frame encodes protein, and automatic translation of a sequence retrieved from the EMBL data base. 3.2. Detailed

Instructions

3.2.1. Translation

for Procedures

of a Nucleic Acid Sequence

The program TRANSL is used to translate a nucleic acid sequence. The program will carry out translation in one, two, or three reading frames for the strand of DNA found in the sequence file. In the following example, we will translate a nucleotide sequence,locate potential open reading frames, and then store a polypeptide sequence as an individual protein sequence file. The steps required to carry out this procedure are as follows:

PC / GENE: Translation

277

1. Select the program TRANSL. The Mam menu of the program appears, giving several options. 2. The sequence to be analyzed is selected by using option 1 of the Main menu. 3. If only a portion of the sequence file is to be analyzed, the end points of the sequence are defined using option 2. To analyze the entire sequence, thts option need not be exercised. 4. Select option 3 to Indicate which reading frames to use for the translation. Use frames 1, 2, and 3 if it is not known which frame is used for translation. Return to the Main menu (option 0 [zero]). 5. Select option 4 to define the layout for the display of the translation, A submenu appears that allows choices for the format of the nucleotide sequence (continuous three-base or IO-base groupings, and the number of bases per line), and for the amino acid sequence (one-letter or threeletter ammo acid code). For translation m three frames, select a continuous nucleotide sequence of 72 nucleotides/line, one blank line between the nucleotide sequence and the translation, the one-letter ammo acid code, and 55 lines/page. 6. Select option 5 to perform and display the translation. For clearest display of the translation the output statusshould be setto “Printer” (option 8, Main menu) prior to exercismg option 5. 7. Select option 7, “Protein translation options,” in order to fmd the longest open reading frames. A submenu appears giving several options. First define which initiation codons should be considered for the translation (option 5). The choices are AUG, AUG and GUG, or no mltiation codons. If no initiation codon is selected, the program will find open reading frames that begin just after the previous termination codon. 8. Select option 6 of the submenu (“Select the content of the ORF statistics”). The user may choose to list all of the possible ORFs and/or to list the ORFs in order of decreasing size. As a start, exclude the list of all possible ORFs and include the list of ORFs in order of decreasing size. The minimum size of the ORF to be displayed (try 50 or 100 first) may be set, and the maximum number of ORFs to be displayed (try 20) may be set. Set the maximum number of lines per page to 55 (default value). 9. Select option 7 of the submenu (“Display/output the open reading frame statistics”). A list of open reading frames sorted according to size appears on screen or is printed out, depending on the output status.Note that the entire sequence in the DNA sequence file is considered, not the end points chosen m step 3 above.

278

Larson

and Bender

10. Select option 2 of the submenu (“Select the layout for the protein sequence”), Set the number of ammo acid residues per line, and choose the one- or three-letter ammo acid code. In addition, include or exclude a table with the ammo acid composition. 11. Select option 1 of the submenu (“Select the limits for the protein sequence”). Select “Endpoints,” and enter the values for the starting and ending nucleotide found in the table of ORFs generated m step 9. The program displays the length of the ORF, the translation frame, the molecular weight of the protein, and the ammo acid sequence at the beginning and end. The program asks if the information is correct. If so, the user is prompted to enter a new description line (for example, the name of the protein or enzyme) and new organism line that will be saved wrth the protein sequence file. 12. The protein sequence may be displayed or printed using option 3 of the protein translation submenu. 13. To save the protein sequence as a protein sequence file, select option 4. The description lines must be confirmed and a new name entered for the protein sequence. If the name already exists, the user will be asked rf he or she wants to overwrrte the existing file. The protein sequence is now saved, and the user may continue to carry out steps 11-13 for additional ORFs if so desired. 3.2.2. Determination of the Probability that an ORF Encodes Protein There are three different programs (COD-FICK, COD-PROK, and COD-RNY) that may give some indication of the probability that an identified ORF actually encodes a protein. In the following example, these programs will be used to locate potential coding regions in a 4000-bp segment of DNA from Escherichia coli (7,8) that is known to carry several genes (rpml;: pZsX, and fubH). The following steps are required to carry out this procedure: 1. Select the program COD-FICK. The Main menu of the program appears, giving several optrons. 2. Select the sequence to be analyzed using optron 1 of the Main menu. 3. The parameters used for the search for open reading frames are set by using option 2. The default values for the genetic code (Universal) and stop codons (TAA, TGA, and TAG) should be left as is. ATG and GTG should be selected as the mmation codons, and the minimum srze for the open reading frames should be set to the minimum value allowed by the program (200 bases).

PC I GENE: Translation

279

4. The content of the output is determined by exercising option 3. All of the ORFs, only those for which there is no opinion and those that are predicted to be “coding,” or only those that are predicted to be “coding” may be displayed. Select the second option (“coding” and “no opinion”) Return to the Main menu (option 0). 5. Select option 4 (“Analyze the sequence for protein coding regions”). Press any key to return to the Main menu when the calculation is complete. 6. Display the results by exercising option 5 of the Main menu (“Display/ output the results”). The output includes the position of the ORF in the sequence, the size of the ORF (in base pairs and amino acids), the mdication and probability scores, and the prediction (coding or no opmion). Results are sorted according to reading frame. In the present analysis, the ORF upstream of rpmF (7) contained 519 bp (173 amino acids) and had a 100% probability of being a coding ORF. The ORF corresponding to rpmF was not detected, because the protein encoded by this gene contains only 57 amino acids. The ORFs corresponding to plsX andfubH contained 1038 bp (346 ammo acids) and 95 1 bp (3 17 amino acids), and had 92 and 100% probabilities of being coding regions, respectively. 7. Exit the program COD-FICK. Select the program COD-PROK. The Main menu of the program appears, giving several options. 8. Select the sequence to be analyzed using option 1 of the Main menu. 9. Carry out the computation by selecting option 2. 10. Print out the results using option 3. The output includes the position of the ORF in the sequence, the size of the ORF (in base pairs and amino acids), and the initial and coding scores. The open reading frame corresponding to rpmF was detected by this analysis and given a coding score of 10. The ORF upstream of rpmF received a score of 11, whereas the ORF corresponding to plsX received a score of 9. An amino-terminal truncated version offubH was predicted (759 bp, 253 amino acids), receiving a score of 8. 11. Exit the program COD-PROK. Select the program COD-RNY. The Main menu of the program appears, giving several options, 12. Select the sequence to be analyzed using option 1 of the Main menu. 13. The end points of the sequence may be defined using option 2. 14. The length and step parameters may be modified using option 3. The default values of 60 and 15 bases may be used. A more refined plot will be obtained if these values are decreased, and a more general overall view will be obtained if the values are increased.

280

Larson and Bender

15. Option 4 may be used to select or deselect the display of the positions of the known codmg regions (information derived from the feature table for the sequence) and the positions of the stop codons. Both should be included. 16. Select option 5 to plot the distribution of the RNY frames. The plot will appear on the screen. If the plot is what is desired, select “Output,” and it will be printed. The positions of the stop codons and the distribution of the RNY frames can be used to locate open reading frames. In the present analysis of the rpmF-fabH cluster, a very clear RNY distribution was obtained for the rpmF andfubH genes, whereas a more random distribution of the RNY frames was obtained for the plsX gene. The results of the three types of analysis carried out would indicate that the method of Fickett may be the most informative, except the program will not detect protein coding regions of
3.2.3. Automatic Translation of a Sequence The program AUTRANSL translates a nucleotide sequence automatically selecting the initiation and termination codons using information present in the feature table of the sequence file. The protein sequence generated may be saved as a protein sequence file using this program. The steps required to carry out this procedure are as follows: 1. Select the program AUTRANSL. The Main menu of the program appears, giving several options. 2. Select the sequence to be analyzed by using option 1 of the Mam menu, If a sequence file does not contain the required feature information, it will not be loaded. 3. Display or print the information in the feature table using option 2 of the Main menu. The information may be used to identify the coding regions for the translation. A summary of this mformatron also appears after selection of option 3. 4. Select option 3 of the Main menu. It must now be indicated which of the coding sequences (CDS) will be translated. A summary of the CDSs is numbered at the bottom of the screen. Enter the numbers, separated by commas, or the range, separated by a hyphen, of the CDSs to be translated. The desired order for the translation must be specified. Examples might be: 1,2,4, or l-4,6. 5. The translation of the specified CDSs is carried out automatically. The screen displays the number of amino acids and molecular weight of the

PC I GENE: Translation

281

protein. If the desired result is obtained, enter “Yes” to the query, “Is this what you want ?” The user is then prompted to enter a new descrtption line (for example, the name of the protein or enzyme) and new organism lme that will be saved with the protein sequence file. 6. The layout for the protein display may be defined by usmg option 4 of the Main menu as described above for the program TRANSL. 7. The amino acid sequence of the protein may be displayed or printed by using option 5. The output includes the sequence, amino acid composition (if desired), the number of residues, the calculated molecular weight, and the CD% used for the translation. 8. Select option 6 to save the protein sequence. The description lures may be accepted or revised for the protein sequence and a new name provided for the sequence. If the name already exists, the user will be asked if he or she wants to overwrite the existing file.

4. Note The program TRANSL translates a sequence in the 5’ + 3’ direction in each of three frames. The program will not translate the inverse complement sequence (the strand complementary to that contained in the sequence file). In order to translate the inverse complement, a sequence file must first be created containing the inverse complement using the program SEQIN or NMANIP, and then the new sequence file translated.

References 1. Fickett, J. W. (1982) Recognition of protein coding regions in DNA sequences. Nucleic Acids Res. 10,5303-53 18. 2. Kolaskar, A. S. and Reddy, B. V. B. (1985) A method to locate protein coding sequences in DNA of prokaryotic systems. Nucleic Acids Res. 13, 185-194. 3. Shepherd, J. C. W. (1981) Method to determine the reading frame of a protein from the purinelpyrimidine genome sequence and its possible evolutionary justification. Proc Natl. Acad. Sci. USA 78, 1596-1600. 4. Shepherd, J. C. W. (1984) Fossil remnants of a primeval genetic code m all forms of life? TZBS 9,8-10 5. Shepherd, J. C. W. (1990) Ancient patterns in nucleic acid sequences Meth. Enzymol. 183, 180- 192 6. Staden, R. (1984) Measurements of the effects that coding for protein has on a DNA sequence and their use for finding genes. Nucleic Acids Res. 12,551-567. 7. Tanaka, Y., Tsujlmura, A., Fujita, N , Isono, S., and Isono, K. (1989) Cloning and analysis of an Escherichia coli operon containing the rpmF gene for rtbosomal protein L32 and the gene for a 30-kilodalton protein. J. Bacterial. 171, 5707-57 12. 8. Oh, W. and Larson, T J. (1992) Physical locations of genes m the rne(ams)rpmF-plsx-fab region of the Escherichia colt k- 12 chromosome. J. Bacterrol. 174,7873,7874.

&IAPTER

PC/GENE: Patrick

23

Sequence Comparisons and Homologies

K; Bender and Timothy J. Larson

1. Introduction Generally, the purpose of aligning sequences is to determine the phylogenetic relationship between the sequences and/or to identify conserved regions that may represent biologically functional domains. The PC/GENE package contains three programs that help to determine the optimal alignment of nucleic acid and protein sequences. NALIGN will align two nucleic acid sequences, and PALIGN will align two protein sequences. Both programs will impose gaps, within user-defined constraints, to optimize the alignments. In addition, PALIGN will allow partial weight for similar amino acids. The third program, CLUSTAL, will align both nucleic acid and protein sequences. CLUSTAL can align up to 25 sequences, and in addition to showing the sequence alignments, CLUSTAL will draw a dendrogram to illustrate the phylogenetic relationship between the aligned sequences. ‘Iwo additional programs, NMATPUS and PCOMPARE, aid in identifying and quantitating sequence similarity. However, neither program produces sequence alignments. NMATPUS uses the matrix comparison method of Pustell and Kafatos (I) to compare two nucleic acid sequences. The output is similar to that of a dot-matrix comparison. NMATPUS does not compare protein sequences. PCOMPARE uses the method of Needleman and Wunsch (2) to compare one protein sequence (query sequence) to one or more other sequences (target sequences). The program calculates an alignment score for the simiFrom: E&ted

Methods m Molecular Biology, Vol. 24: Computer Analysis of Sequence Data, Pati I by’ A M. Grlffm and H G Grlffln Copyright 01994 Humana Press Inc , Totowa, NJ

283

284

Bender

and Larson

larity between the query sequence and each of the target sequences. It then calculates the alignment score after randomizing the sequences. The difference between the scores for the randomized and nonrandomized comparisons is used to identify statistically significant similarity. The output presents the alignment scores for each comparison, but it does not indicate the regions within each sequence that contributed to any significant similarity. NMATPUS and PCOMPARE are not discussed further in this chapter. 2. Materials No materials are needed to run these programs other than those discussed in the previous chapters. 3. Methods 3.1. AZignment of Two Nucleic Acids or Two Protein Sequences-N&IGN and PAUGN The PC/GENE package contains two related programs, NALIGN and PALIGN, to align nucleic acid and protein sequences. Both programs use the method of Myers and Miller (3) to determine optimal alignment of two sequences. To access NALIGN, start at the “Program selection by menu,” select “Sequence analysis,” select “Nucleic acid sequences,” select “Sequence comparisons and homologies,” then select “NALIGN: Alignment of two nucleic acid sequences.” The NALIGN menu appears on the screen. The first option is “Select the two sequences to be aligned.” Select this option, and the program prompts the user to provide the file names of first one and then the other sequence. At this step, a list of individual sequence files may be accessedby pressing the down cursor key. A window will appear with a list of file names. Scroll through the list using the cursor keys, and highlight the file of choice. Press “enter” to load the highlighted file. When both files have been entered, the screen returns to the NALIGN menu. Next select “Modify the parameters of the method.” There are two user-defined parameters, the “open gap cost” (OGC) and the “unit gap cost” (UGC). The program uses these two parameters to calculate the penalty for opening and extending a gap in a sequence. A large value for OGC penalizes opening a gap, whereas a large value for UGC increases the penalty of the gap in proportion to the number

Sequence Comparisons

and Homologies

285

of positions in the gap. These parameters can be varied from 1 to 999. The default value is 10 for each parameter. After the parameter values have been selected or the default values accepted, the screen returns to the NALIGN menu. The alignment may be initiated at this point, but the user has the option to change certain features of the output. To change these features, select “Select the layout and content of the output results.” When this selection is made, the user is presented with the options to display sequences in upper- or lower-case letters; to select the character that marks identical bases in the alignment; to change the number of bases displayed per line; and to change the number of lines per page. When these options are satisfactory, return to the NALIGN menu. To initiate the alignment, select “Align the two nucleotide sequences.” When the alignment is completed, select “Display/output the results of the alignment.” The results begin with the names and descriptions of the sequences aligned and the parameter values used for the alignment. A display of the optimal sequence alignment follows. Each sequence is labeled, and the positions are numbered. Identical bases are marked by the character selected in the layout options. At the end of the alignment, the percent identity of the two sequences and the number of gaps inserted in each sequence to obtain the optimal alignment are shown. The results can be directed to the printer and/or to a text file. The operation of the program PALIGN is very similar to that of NALIGN. Beginning at the “Program selection by menu,” select “Sequence analysis,” followed by “Protein sequence,” “Sequence comparisons and homologies,” and then “PALIGN: Alignment of two proteins by Myers and Miller’s method.” As in NALIGN, the user must enter the two sequence file names for alignment and may change the layout options for the results. In the selection “Modify the parameters of the method,” the user may change the OGC and UGC values. In addition, there is a third parameter, the comparison matrix, that the user can modify. If “Change the comparison matrix” is selected, a menu of four alternative matrices appears. Each matrix gives full weight when identical residues align, but some of the matrices give partial weight when certain amino acids align. The criteria for partial weighting differs for each comparison matrix. For example, the comparison matrix “genetic code” will partially weight

286

Bender and Larson

aligned amino acids if their codons could differ by one base change. There is also the “structure-genetic code” matrix, which combines the genetic code matrix with partial weighting for amino acids with similarities in their physical properties (hydrophobicity, charge, and so forth). The weighting criteria for each comparison matrix can be found by accessing the on-line help option. After these parameters are set, the user can initiate the alignment. The results are displayed in the same format as in NALIGN. of Multiple Sequence-CLUSTAL The program CLUSTAL can align between three and 25 nucleic acid or protein sequences. Each sequence cannot exceed 1200 residues or bases in length. Larger sequences must be truncated or segmented into separate files. The CLUSTAL alignments are made by the method of Higgins and Sharp (4). To access CLUSTAL start at the “Programs selection by menu,” select “Sequence analysis,” select “Nucleic acid sequence” or “Protein sequence” (the user will be prompted again in the CLUSTAL menu for sequence type), select “Sequence comparison and homologies,” and then select “CLUSTAL: Multiple alignment of protein or nucleic acid sequences.” The screen displays the CLUSTAL menu selections. First, highlight “Select the sequences to be aligned” and press “enter.” The user will again be prompted to select protein or nucleic acid sequences.When the selection is made, the user will be prompted to enter the sequence file names. There are two options for entering the sequence file names. They can be entered from the keyboard, or a library tile name can be entered (library files are discussedin Chapter 24). If a library file name is entered, then those sequencefiles listed in the library file will automatically be entered. When file name entry is completed, the user is returned to the CLUSTAL menu. The user can vary several parameters that influence the alignments. These can be accessed by selecting “Modify the parameters of the method.” The screen presents a list of seven parameters. Some of these, such as gap penalty, open gap cost, unit gap cost, and K-tuple value (discussed in Chapter 24, Section 3.2.) may be familiar from other programs. However, the parameters window size, filtering level, and weighting transitions are unique to the CLUSTAL program. An explanation of each parameter and its 3.2. Alignment

Sequence Comparisons

and Homologies

287

effect on the alignments can be obtained by selecting the on-line help option, Generally, the default values for these parameters result in good alignments. When these parameters are satisfactory, the user can return to the CLUSTAL menu. From the CLUSTAL menu, the user can select “Select the layout and content of the output of the results.” The options are the same as in NALIGN. Once file name entry, parameter values, and layout of output are satisfactory, the user can initiate the analysis by selecting “Align the selected sequences.” After the alignment is completed, the user can elect to send the results to the monitor, to a printer, and/ or to a text file. An additional option in the CLUSTAL menu is to “Save the results to an alignment file.” The alignment file can be used by the program MATSCAN (discussed in Chapter 25, Section 3.4.). MATSCAN will determine a weight matrix based on the frequency of each residue or base at each position within a segment of the alignment. This allows the user to search other sequences for regions of sequence similarity. The output from CLUSTAL will list the parameter values used for the analysis and then show the best alignment for all the sequences. In the alignment, identical bases or residues are marked as well as those that are conserved in a significant number of the sequences. The user can also select “Plot the dendrogram of the alignment.” This will display a graphical representation of the phylogenetic tree of the aligned sequences. The output can be directed to the monitor, to the printer, and/or to a graphics file. 4. Note The programs discussed in this chapter use algorithms to calculate a numerical score for a sequence alignment. These scores are based on weighting the occurrence of identical and sometimes similar residues or bases at each position. The alignments resulting from these programs represent the mathematically optimal alignments, but they do not necessarily represent the most biologically significant alignments. It is possible that imposing additional functional criteria for modifying the computer-generated alignments may improve the representation of shared functional domains and/or phylogenetic relationships.

Bender

288

and Larson

References 1. Pustell, J. and Kafatos, F. C. (1984) A convenient and adaptable package of computer programs for DNA and proteins sequence management, analysis, and homology determinations. Nucleic Acids Res 12,643-656. 2. Needleman, S. B. and Wunsch, C D. (1970) A general method applicable to the search for similarities in the ammo acid sequence of two proteins. J. Mol. Biol. 48,443453.

3. Myers, E. W. and Miller, W. (1988) Optimal alignments in linear space. CABIOS 4, 11-17. 4. Higgins, D. G. and Sharp, P. M. (1989) Fast and sensitive multrple sequence alignments on a mrcrocomputer. CABZOS 5, 151-153.

c%IAFTER

PC/GENE: Patrick

24

Database

Searches

K Bender and nmothy

J. Larson

1. Introduction Generally, when a new sequence is found, it is important to know if all or parts of it are similar to other known sequences. This can be done by using the new sequence to search for similarity to sequences in a database. The PC/GENE package contains two programs, FSTNSCAN and FSTPSCAN, for searching nucleic acid and protein databases, respectively. Both programs are based on the search algorithms of Lipman and Pearson (1). These programs determine a numerical score for sequencesimilarity and can impose gaps in the sequences to optimize alignments. In addition to FSTNSCAN and FSTPSCAN, the PC/GENE package contains several programs that search a database for sequence similarity to a subsequence. The program QGSEARCH can be used for searching either protein or nucleic acid sequences. The user can define a subsequence of 3-30 residues and can stipulate the number of mismatches allowed. However, QGSEARCH does not allow for gaps in the subsequence. PC/GENE provides two other programs to search a database with protein and nucleic acid subsequences, NESEARCH searches for nucleic acid sequence similarities, and PESEARCH searches for protein sequence similarities. In NESEARCH, up to six subsequences (segments) of 3-30 nucleic acids in length can be linked with userdefined gaps between each subsequence.NESEARCH recognizes the IUPAC-IUB ambiguity code letters, which enable the user to define allowed substitutions at user-specified positions. PESEARCH allows From Edlted

Methods m Molecular B/ology, Vol 24. Computer Analysis of Sequence Data, Part I by: A M. Gnfhn and H G. Griffin Copynght 81994 Humana Press Inc , Totowa, NJ

289

290

Bender and Larson

the user to define a subsequence of up to 78 amino acids in length. Like NESEARCH, PESEARCH allows ambiguity code letters to be used in the subsequence. The user can access a list of the recognized ambiguity codes by use of an on-line help option while the subsequence is being entered or edited. One of the ambiguity codes identifies any amino acid or a gap at the specified position. Thus, the position and length of a gap are user-defined. Both NESEARCH and PESEARCH are particularly well suited when the user wants to use gap placement as part of the criteria for similarity. 2. Materials All the materials

listed

in the previous

chapters

are required

to run

the programs discussed in this chapter. In addition, if the user wants to search the EMBL or Swiss-Prot databases, a CD-ROM reader is almost a necessity. At present, the EMBL and Swiss-Prot databases require approx 120 Mbyte of hard drive space, and they grow with each release. A CD-ROM reader attached to the personal computer provides easy access to and the capacity for present and future releases of these databases at least for several more years. IntelliGenetics distributes the EMBL and Swiss-Prot databases on a single CD. These databases are in the PC/GENE format and can be accessed directly by the PC/GENE programs. The PC/GENE version of the EMBL database is divided into several smaller databases on their CD. The number and categories of smaller databases can change with each release. The CD release number four had the databases (Where # is the EMBL release number): “CDEMB#E” “CDEMB#O” “CDEMB#P” “CDEMB#S” “CDEMB#U” “CDEMB#V”

= = = = = =

Eukartyotic, nucleotide data bank Organelles, nucleotide data bank Prokaryotic + phages, nucleotide data bank Synthetic sequences, nucleotide data bank Unannotated sequences, nucleotide data bank Eukaryotic viruses, nucleotide data bank

Dividing the complete EMBL databaseinto smaller databasesdecreases the time required for a search when the user knows that the relevant sequencesare located in one of the smaller databases.The Swiss-Prot databaseis contained in a single database-“CDPROT#.”

PCIGENE:

Database Searches

291

3. Methods 3.1. Database and Library File Management Each database is a collection of sequence files. Each file contains the name of the references, comments, and annotations. A library file is a list containing only the sequence file names. The sequence files listed in a library file must be in a database or in the user’s directory. Thus, a library file allows the user to group a subset of a database without duplicating the sequence files. The library file is an optional output for many of the PC/GENE programs that select or search sequence files. This allows files selected by one search program to be accessed conveniently by other programs. The user also may create or edit a library file. To do this, select “Program selection by menu,” select “files, databases,and system management,” select “file: management utility for data files,” then select “library files management.” The screendisplays a listing of the library tiles in the user’s directory. Along the bottom of the screen is displayed a menu of various options to edit the library files or to create a new file. To create a new library file, the user will be prompted to enter a library file name, and then will be transferred to the EMACS editor. The user may then enter sequence file names either from user’s directory or from a database. Once created, this library file will be accessible by those PC/GENE programs that can use library files. Another feature of the PC/GENE package is the ability to select sequences from a database by the criterion of a common descriptive feature. This is done with the program SELECT. To access SELECT start at the “Program selection by menu;” select “Files, databases, and system management;” and then select “SELECT.” The user is now presented with several criteria for selecting sequence files, such as sequence size, sequence origin, or description. If description is selected, the user is prompted to enter the word or words for searching the description lines of all the sequences in the designated database. When the search is completed, the user has the option to place the names of all the selected files into a library file. For example, if the user entered “actin” as the descriptive word and selected the EMBL eukaryote database, the user would get a list of the names of all sequence files in that database that contained actin in their descrip-

292

Bender and Larson

tion lines. These names could be viewed and/or saved by the user in a library file. This library file can be used later to limit the search range for a number of the PC/GENE programs. The program FSTNSCAN can only search a database and cannot have the search range limited by a library file. To limit the search range of FSTNSCAN, a database can be created that contains only those sequence files listed in a library file. This can be done by first creating a new database with the database management utilities and then importing into that databasethose sequencefiles listed in a library file. To create a new database start at the “Program selection menu,” select “DATABASE: Management utility for databases,”and then select “Data base management.” The screen displays a list of all the installed databases. Along the bottom of the screen, options are provided for editing the databases. Select the option to add a database (ADD DB). The user will be prompted to enter a name for the database, the directory path to store the database, a description of the databases, and whether the database should be read only. Answer “no” to the read only prompt. The new database is now created, but empty. To import sequence files, exit database management and return to the previous menu. Now select “INTERFAC: Input and output to and from a database.” Next select “Change the databasein use.” The screen will show a list of available databases. Highlight the one newly created, and press “enter.” The user will be returned to the previous menu. Select “Merge two databases,”and then select “Merge of the entries indicated in a library file.” The user will be prompted for a library file name. After entering the library file name, the screen will display the available databases. Select the database that contains the sequence files listed in the library file. The program will then ask whether to replace the data in the database for the sequence(s) that are already present. Because the new database is empty, the user can select “yes” or “no.” The program will then enter those sequence files listed in the library file into the new database. When this is completed, the database is available for searching by FSTNSCAN. 3.2. Searching a Database for Sequence Similarity to a Nucleic Acid Query Sequence-FSTNSCAN To access the program FSTNSCAN, start at “Program selection by menu,” select “Sequence analysis,” select “Nucleic acid sequence analysis,” then select “Sequence comparisons and homologies.” The

PCI GENE: Database Searches user is presentedwith a menu of four programs that searchfor sequence similarity and make alignments. Three of these, NALIGN, CLUSTAL, and NMATPUS, were discussed in the previous chapter. The fourth selection is FSTNSCAN. Highlight FSTNSCAN and press “enter.” The user is now in the FSTNSCAN menu. First highlight “Select the query sequence,” and press “enter.” Then type in the name of the sequence file to be used to search a database for similarity to other sequences. Instead of typing a file name, the user can access a list of his/her sequence files by hitting the down cursor key and then highlighting the option “Choose in the list of your nucleic acid individual sequence files.” Scroll through the list using the up and down cursor keys. Highlight the sequence of interest, and press “enter.” The query sequence is limited to a maximum of 2000 bases. To scan with a larger sequence, it must first be divided into files of 2000 bases by using the sequence editing programs. After the query sequence is entered, the screen returns to the FSTNSCAN menu. From the FSTNSCAN menu, the user may select “Modify the Ktuple value.” If this option is selected, the user may change the value from 1 to 6. The K-tuple value is similar to the window size in a dotmatrix alignment. It determines whether single nucleic acids are compared to score a match, whether pairs are compared, whether three nucleotides are compared, and so on. A lower K-tuple value is more sensitive for finding sequence similarities, but it slows operation of the program. The default K-tuple value is 2. This is usually a good compromise. From the FSTNSCAN menu, the user may also select “Select the layout and content of the output of the results.” Under this selection, the user may change several options, such as a character used to mark identical bases in the alignments, the number of lines per page, and how many of the best alignment scores to display. The program can retain a maximum of 50 of the best-scoring sequences for display. When these options are satisfactory, the user returns to the FSTNSCAN menu by selecting “yes” after the line “are these settings OK?’ Before starting the search, the user must select the database to be searched. The FSTNSCAN menu includes the selection “Change the database in use.” If the user selects this option, a list of the databases accessible to the user appears. Highlight the database of choice, and press “enter.” The screen returns to the FSTNSCAN menu. At the

294

Bender

and Larson

bottom of the screen is displayed the file name of the query sequence and the database selected. To initiate the search, highlight “Scan for similarity to the query sequence,” and press enter. The display will show the number of sequences already scanned and the number of sequences found above the default cutoff value (the cutoff value is not defined by the user). If the scan utilizes one of the databases on a CD, it can take a considerable amount of time to complete. Depending on the size of the database, the size of the query sequence, and the K-tuple value selected, searches can take 2-6 h. After the search is completed, the alignments with the query sequence are optimized. When optimization is completed, the user is asked whether the names of the 50 (or less) best sequences should be stored in a library file. The user need not save these names to see the search results. However, this file can be used later to extract the complete sequences from the database.If the answer is “yes,” the user is prompted to provide a file name, and after entering a name, the user is returned to the FSTNSCAN menu. If the answer is “no,” the user also is returned to the FSTNSCAN menu. To display the search results, highlight “Display/output the results of the scan,” and press “enter.” The results consist of the names of the best-scoring sequencesalong with their initial and optimized alignment scores. This is followed by the alignment of each sequence with the query sequence. Output can be directed to a file for later text editing and/or can be directed to the printer. 3.3 Searching Nucleic Acid Sequences for Similarity to a Query SubsequenceQGSEARCH and NESEARCH The programs QGSEARCH and NESEARCH search for sequence similarity to moderately short subsequences.In QGSEARCH, the user can enter a subsequence of 3-30 nucleotides using the A, C, G, T codes. The user is prompted for the number of mismatches allowed in the alignment, but gaps are not allowed nor does the program recognize the ambiguity code letters. In the program NESEARCH, any of the IUPAC ambiguity code letters are allowed, and the program allows for gaps. To allow for gaps, the user may enter up to six subsequences (called segments) each containing 3-30 bases. Between each segment, the user defines a gap length of between zero and several

PC I GENE: Data base Searches

295

thousand bases. By using segments and gaps, the user can scan with a subsequence of up to 180 bases (6 x 30), and the position and length of gaps can be made part of the criteria for identifying similarity. This cannot be done with either FSTNSCAN or QGSEARCH. To access QGSEARCH and NESEARCH start at the “Program selection by menu”: select ‘Sequence analysis,” select “Nucleic acid sequence analysis,” then select “Primary structure analysis.” The user is now presented with a menu listing the programs QGSEARCH and NESEARCH. Select QGSEARCH to arrive at the QGSEARCH program menu. Select the option “Define the subsequenceto be searched.” The user can now either enter a sequence from the keyboard or load a file with a subsequence. The user will be prompted for the number of allowed mismatches and then will be returned to the QGSEARCH menu. Select the option “Define the range of the search.” The user may then elect to scan the database in use, to scan selected sequences by file name, or to scan the sequences in a library file. After making a choice and entering any necessary file names, the user is returned to the QGSEARCH menu. The user may then select “Change the database in use.” A list of available databases is presented. Highlight the database of choice, and press “enter.” The user is then returned to the QGSEARCH menu. The user can now select “Carry on the search.” Once the search is complete, select “Display/output search results.” The results present the file names in which a matching subsequence was found. The sequencesmatching the subsequence are shown along with the ten basesthat precedeand follow them. Output can be directed to the screen, to a text file, or to the printer. Select the program NESEARCH from the same menu that QGSEARCH was listed. NESEARCH has the same menu options as QGSEARCH but also provides additional options for entering the query subsequence. These options include entering a sequence file name, and then defining the subsequence by entering the beginning and ending positions spanning up to 30 residues. One may also select “Enter/edit the subsequence.” This option presents the user with six lines labeled segments l-6 interspersed with five lines of labeled gaps. On each segment line, the user can enter a sequence up to 30 bases using any of the IUPAC code letters. On the gap lines, enter the minimum and maximum gap allowed before the next segment (sequence) must occur. Gaps can be zero, and not all of the segment lines must

296

Bender and Larson

be used. After the subsequence is entered, program operation proceeds the same as for QGSEARCH. However, NESEARCH takes longer than QGSEARCH to complete its search analysis. 3.4. Searching a Database fir Sequence Similarity to a Query Protein Sequence-FSTPSCAN Program operation of FSTPSCAN is similar to that of FSTNSCAN. To access FSTPSCAN start at “Program selection by menu,” select “Sequence analysis,” select “Protein sequence analysis,” then select “Sequence comparisons and homologies.” The user is now presented with a menu listing FSTPSCAN as one of the selections. Select FSTPSCAN to access the program menu. The list of menu options is the same as that for FSTNSCAN. The user must select the query sequencefile, select the database,and change the default K-tuple value if necessary. For proteins, the default value is 1, and this is generally best. As with the FSTNSCAN program, searching a large database, such as the Swiss-Prot database on CD-ROM can take a long time. After the search, the output lists up to 50 of the best-scoring sequences and shows their optimal alignments with the query sequence. Output can be directed to the monitor, to a text file, or to the printer. 3.5. Searching a Database for Sequence Similarity to a Query Protein Subsequence-QGSEARCH and PESEARCH The procedure for searching with a protein subsequence by the program QGSEARCH is almost identical to the procedure for searching with a nucleic acid subsequence by this program. To access QGSEARCH start at “Program selection by menu,” select “Sequence analysis,” select “Protein sequence analysis,” select “Primary structure analysis,” then select “QGSEARCH.” The menu options are identical to those available with the nucleic acid version of this program. When “Define the subsequence to be searched” is selected, the user will be prompted again to select “Protein” sequence. After making that selection, the user can enter a protein sequence of 3-30 amino acids in length. The standard single-letter codes are used, and ambiguity codes are not allowed. After the sequence is entered, the user will be prompted for the number of allowed mismatches and then

PCIGENE:

Database Searches

297

will be returned to the QGSEARCH menu. Select the range of the search, and then carry on the search. The names of selected sequences can be stored in a library file. The output of the search can be displayed on the monitor, stored in a text file, and/or printed. The program PESEARCH allows the user to enter ambiguity codes and gaps in a protein subsequence.To access PESEARCH start at the “Programs selection menu,” select “Sequence analysis,” select “Protein sequenceanalysis,” select “Primary structure analysis,” then select “PESEARCH: Extended global search for a protein subsequence.” The PESEARCH menu is displayed with the same options as the NESEARCH menu. In defining the subsequence,up to 78 amino acids can be entered, or the subsequence can be extracted from a protein sequence file. When entering or editing the subsequence, several ambiguity codes may be used. A list of these ambiguity codes may be accessed by the user by selecting the “help” option. One of the ambiguity codes is a period “.” This code indicates that any amino acid or a gap can occur at that position. Thus, unlike NESEARCH, a gap occupies a position(s) in the subsequenceand cannot be set to a range. Once the subsequence is defined, the user must select the range of the search, carry on the search, and then display/output the search results. Sequence file names selected by PESEARCH can be stored in a library file. 4. Note The PC/GENE package of programs is designed to operate on a personal computer with an g-bit, 16-bit, or 32-bit processor. However, database searches are time-consuming, and we would recommend a personal computer with a 32-bit processor to expedite the search. In addition, a large hard drive is a useful option. This allows the user to download the databases most often used from the CD-ROM to the hard drive. Downloading will take time, but it only has to be done when a new release of the database is received. Searching a database on the hard drive will shorten search times by four- to fivefold. Included in the most recent release of PC/GENE (version 6.7) is the program PCR/PLAN. This program will identify the best sequences for designing primers in order to amplify a region of a target DNA sequence in a polymerase chain reaction. The user designates the file name containing the target DNA sequence and can

Bender

298

and Larson

manipulate the parameters affecting annealing stringency. Alternatively, the user can input a primer sequence and the program will identify sites in the target DNA that can anneal to the primer. The output from the program lists the sequence and the T,,, of all alternative sites. This output can be very useful for designing a PCR protocol. Reference 1. Llpman, D. J. and Pearson, W. R. (1985) Rapid and sensitive protein simllarity searches Science 2, 14351441.

&IAPTER

25

PC/GENE: Searches for Functional in Nucleic Acids and Proteins Patrick

K. Bender and lbzothy

Sites

J. Larson

1. Introduction The PC/GENE package contains several programs that scan nucleic acid sequences for predefined functional sites or sequence motifs. These include EUKPROM, which searches for a TATA-box, a Cap signal, a CCAAT-box, and a GC-box. Also included is SIGNAL which searches for splice junctions, eukaryotic ribosome binding sites, prokaryotic ribosome binding sites, and E. coli promoter sequences. The package also contains several programs that search for predefined sites in protein sequences.Among these, PSIGNAL searches for secretory signals; PESTFIND searches for sequences that occur in rapidly degraded proteins; REGULATE searchesfor positive, negative, and o-type regulatory sequences in DNA binding proteins. The classification of regulatory sequencesis derived primarily from known prokaryote and bacteriophage regulatory proteins. The program PROSITE searches for a variety of functional sequences and signatures in a protein. A complete listing of the sequences and signatures searched for is in the program documentation and can be accessed from within PROSITE by selecting “textbook.” “Textbook” contains references and explains the basis for each assignment. Among the many sites and signatures searchedfor by PROSITE are: glycosylation sites, phosphorylation sites, metal binding sites, certain DNA homeobox binding domains, and signatures of many enzyme families. These are only a few examples from an extensive list. From Edited

Methods m Molecular Biology, Vol. 24. Computer Analysis of Sequence Data, Part I by. A M Grlffm and H G Griffm CopyrIght 01994 Humana Press Inc , Totowa, NJ

299

300

Bender

and Larson

Most of the search programs use a weight matrix to identify sequences with significant similarity to a consensus sequence or signature. These weight matrices are integral to the programs and cannot be modified by the user. They have been constructed by aligning several protein or nucleic acid sequencesknown to contain a particular functional site. The residues at each position within the functional site are then assigned a weight dependent on how often they occur at that position. This approach allows for degeneracies in a consensus sequence and assigns them a weight lower than invariant residues. When a test sequenceis scannedwith the matrix, the program assigns a numerical score. Each matrix has a default cutoff score, and after a search, the user can access a list of those sequences that score above the cutoff value. JThe PC/GENE package also contains the program MATSCAN which allows the user to define a weight matrix. To define a weight matrix, the user must have several sequences that share a functional property or signature. MATSCAN allows the user to assign weights for each residue at each position and then use the matrix to scan a database. 2. Materials No materials are needed to run these programs other than those discussed in the previous chapters. 3. Methods 3.1. Detection of Eukaryotic Promoter Elements To access the program EUKPROM Start at the “Program selection by menu,” select “Sequence analysis,” select “Nucleic acid sequence,” select “Site detection analysis,” then select EUKPROM for detection of TATA-box, Cap Signal, CCAAT-box, and GC-box (I). The user is now presented with the EUKPROM menu. To enter the sequence file that will be searched, select “Select the sequence to be analyzed.” The user may elect to analyze only a portion of this sequence by setting the end points. After file name entry, the user can initiate the search by selecting “Scan for eukaryotic promoter elements.” When the search is complete, there are two menu options for displaying the results. One output is a graphical representation of the search results. This output is obtained when the menu option “Plot the profile(s) of

PC/ GENE: Searches for Functional

Sites

the promoter elements scan” is selected. The other output is a list of the subsequences that scored above a default cutoff value. This output is obtained when the user selects “Display/output the results of the scan.” The graphical output is illustrated in Fig. 1 using the promoter region of the bovine adult P-globin gene (BTGLO2 in the CDEMBL25E database) for the analysis. The numerical scores of subsequences within the P-globin gene are plotted against their position. The dotted horizontal line represents the default cutoff score. Scores above this cutoff value indicate positions of potential functional sites. In the example of the P-globin gene, one TATA-box is indicated just before position 200. Multiple Cap-sites are indicated, but the one at position 228 is the correct distance 3’ from the TATAbox. No significant CCAAT-box homology is indicated, but two GCboxes are found-one on each side of the TATA-box. The list output includes the positions as well as the scores of the subsequences. At the end of the list, the EUKPROM program recommends those subsequencesthat are the best candidates for each functional site based on their scores and their relative positions. In the example of the P-globin gene, the EUKPROM recommendations are the TATA-box centered at position 196, the Cap signal centered at position 228, and the GC-box centered at position 184. These are the correct sites within this P-globin gene. 3.2. Detection of Splice Junctions, Eukaryotic and Prokaryotic Ribosome Binding Sites, and E. coli Promoter Sequences

To accessthe program SIGNAL start “Program selection by menu,” select “Sequence analysis,” select “Nucleic acid sequence,” select “Site detection analysis,” then select SIGNAL. The SIGNAL menu appears, which lists the menu options: “Locate splice junctions,” “Locate prokaryotic ribosome binding sites,” “Locate eukaryotic ribosome binding sites,” “ Locate E. coli promoter sequences,” and “SCAN ANALYSIS MENU” (2). Select any one of the “Locate . . . ” options, and the “SCAN ANALYSIS MENU” is highlighted. Entering this menu allows the user to select the sequence for analysis, change the end points for analysis, and change the default cutoff value for recognition of sites. When these options are satisfactory, the user selects “Scan the sequence for signal location.” When the scan is completed,

302

Bender and Larson GC- box

-10 -28 -30 1

180

Plot

of

From

position

eukaryotlc

208 promoter

1 to

detection

400

300 curve(s)

for

sequence

5EiEl BTGL82.

500.

Fig. 1. Graphical output from the EUKPROM program. The X-axis represents the first 500 bases of the P-globin gene. Along the Y-axis is plotted the score of this sequence from the weight matrix for the four functional sites GC-box, CCAATbox, Cap signal, and TATA-box The graph is a direct reproduction of the EUKPROM output on a Hewlett Packard Deskjet printer.

the results can be displayed either graphically or as a list of those subsequences that scored above the cutoff value. The graphical output is presented by the menu option “Visualize the graphical display of the results,” and the list output is presented by the menu option “Display/output the results of the scan.” For example, Fig. 2 illustrates the graphical output of a search in the bovine P-globin gene for splice junctions. The top graph illustrates the positions of potential intron/exon, acceptor (A) junctions, and the bottom graph illustrates the positions of potential exon/intron, donor (D) junctions. Vertical lines extending above the dotted horizontal line mark the positions of potential acceptor and donor sites. The asterisks have been added by the authors to indicate those lines that mark the positions of the known acceptor and donor sites in the P-globin transcript. The SIGNAL program has correctly identified

PC/GENE:

Searches for Functional

Sites

303

1600 Plot of intronhxon In sequence BTGLBZ

'' and exonjintron from base 1 to base

‘' 2072.

junctions

prsdictlon.

Fig. 2. Graphical output from the SIGNAL program. The X-axis represents the base positions in the P-globin gene. The Y-axis represents the numerical scores from the scan for splice junctions.

these positions,

although it has also identified other sites as possible

candidates. The information from the SIGNAL program is useful both as an aid in confirming suspected splice junctions, and to quantitate the similarity between potential splice junctions and the consensus sequence.The information may also be helpful to identify splice junctions used for alternative exon selections if multiple mRNA species are processed from the primary transcript. The search for eukaryotic and prokaryotic ribosome binding sites and that for E. coli promoter sequences, have the same output formats as the splice junction search. In searching for E. coli promoter sequences, the program looks for the consensus sequences“TATAAT” and “TTGACA.” It imposes additional weighting on both the distance between the two sequences and the distance from these sequences to potential transcription start sites.

304

Bender and Larson

3.3. Identification of Cleavage Sites, Potential Functional Sites, and Signatures in Proteins Begin by selecting the “Program selection by menu,” followed by “Sequence analysis,” and then “Protein sequence.”A menu that allows selection of several choices for sequence analysis of proteins appears. Among these choices are “Site detection analysis” and “Cleavage analysis.” To scan a protein sequence for sites of cleavage by enzymatic or chemical means, select “Cleavage analysis,” and in the subsequent menu, select CUTPRO. The CUTPRO menu appears with the options to “Select the sequenceto be analyzed” and to “Select the cleavage method.” Enter the cleavage method option, and an assortment of proteases and chemical conditions for cleavage will appear. The user can select and modify these conditions. Following selection of the protein sequence and the method of cleavage, the analysis is initiated by selecting “Cleave the selected sequence.” When completed, the results can be displayed by selecting “Display/output the results.” The program lists the peptide fragments generated by the cleavage according to their position within the sequence, their length, their molecular weights, and their predicted isoelectric points. Additional predications are made concerning the hydrophobicity and retention time on reverse-phase HPLC of the peptide fragments. For the identification of functional sites and signatures, select “Site detection analysis. ” The next menu includes MATSCAN, PROSITE, PSIGNAL, REGULATE, and PESTFIND. Selection of PSIGNAL loads the program for searching a protein sequence for either eukaryote or prokaryote secretory signals (3). The user selects the protein sequence, the limits of the sequence if its full length is not to be searched, and either eukaryote or prokaryote secretory signals. The output can be either a graph or a table of those sequences scoring above the default cutoff value. Selection of REGULATE loads the program that searches for negative, positive, or o-type protein regulatory sequences.The output can be either graphical or tabular. Selection of PESTFIND loads the program that searchesa protein sequence for PEST regions. The output is a table of PEST sequences and their scores. PROSITE searches a protein sequence for many functional sites and signatures (4). A complete list of these sites is included in the program documentation, and the user can also access this list by

PCIGENE:

Searches for Functional

Sites

selecting “textbook” from the PROSITE menu. After selecting PROSITE, the user selects the protein sequence and the limits of the sequence, if the user does not want to search its full length. The output is a tabulated list of sites found, followed by the sequence of the protein with the sites marked. The user can then use “textbook” to learn the function for each site and the criteria for site selection. 3.4. Defining

a Weighted Matrix,

MATSCAN

A useful feature of the PC/GENE package is the program MATSCAN, which allows the user to define a weight matrix. To use MATSCAN, the user must have several sequences that are thought to exhibit similar functional properties. The user can then build a weighted matrix to scan a database and identify other sequences with similarity. The advantage of MATSCAN is that the weighted matrix allows the user to weight substitutions according to the frequency of their occurrence. MATSCAN can be used to define a weighted matrix from either protein or nucleic acid sequences. It is accessedfrom the “Site detection analysis” menu after either the nucleic acid or protein sequence option is selected. When the MATSCAN menu appears, the user can define the weight matrix by selecting “Define the weight matrix to be scanned.” For defining the matrix, the screen displays a two-dimensional array. Across the columns are the positions of up to 99 residues in the aligned sequences. Down the rows are either the four nucleic acids (A, C, G, T = U) or the 20 amino acids. In each cell of the array, the user types in the number of times a particular residue occurs in that position from the alignments of the known, functionally similar, sequences.The user can obtain help in aligning the known sequences from the NALIGN, PALIGN, or CLUSTALprograms (discussed in Chapter 23). After defining the weight matrix, the user can change the default cutoff value for identifying significant scores, before initiating the search, select “Change the database in use” to identify the database for searching. Initiate the search by selecting “Scan the current databasewith the selectedmatrix.” After the databaseis searched,the user selects “Display/output the results of the scan” to output the searchresults. The output lists the sequencefiles that were found to contain subsequenceswith scores above the cutoff value. The list includes the subsequencesfound and their numerical scores.

306

Bender

and Larson

4. Notes Use of a weight matrix to search for sequence similarity provides the flexibility to accommodate allowed substitutions in a consensus sequence. That flexibility can result in more than the biologically functional sequence scoring above the cutoff value. These programs do not prove that a site has the queried function. However, the results of the search bring to the researcher’s attention those sites that are good candidates for a particular function based on their similarity to other sequences known to have that function. References 1. Bucher, P. (1990) Weight matrix descriptions of four eukaryottc RNA polymerase-II promoter elements derived from 502 unrelated promoter sequences. J. Mol. Biol. 212, X53-578.

2. Staden, R. (1984) Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res. 12,505-519. 3. Von Heijne, G (1986) A new method for predicting signal sequence cleavage sites. Nucleic Acids Rex 14,4683-4690. 4. Bairoch, A. (1991) PROSITE: A dictionary of sites and patterns m proteins. Nucleic Acids Res. 19,224 l-2245.

CHAPTER26

Using the FASTA Program to Search Protein and DNA Sequence Databases William

lt Pearson

1. Introduction As this volume illustrates, computers have become an integral tool in the analysis of DNA and protein sequence data. One of the most popular applications of computers in modern molecular biology is to characterize newly determined sequences by searching DNA and protein sequence databases. The FASTA* program (I,2) is widely used for such searches, because it is fast, sensitive, and readily available. FASTA is available as part of a package of programs that construct local and global sequence alignments. This chapter will describe a number of simple applications of FASTA and other programs in the FASTA package. This chapter focuses on the steps required to run the programs, rather than on the interpretation of the results of a FASTA search. For a more complete description of FASTA and related programs for identifying distantly related DNA and protein sequences, for evaluating the statistical significance of sequence similarities, and for identifying similar structures in DNA and protein sequences see ref. 2. All the examples below areof protein sequencecomparisons.Although FASTA can be used for either DNA or protein sequence comparisons, the user should always compare sequences at the protein *FASTA IS pronounced “FAST-AYE,” a name that refers to “FAST-All,” and indicates its lineage from “FAST-P’ (fast protein sequence comparison) and “FAST-N” (fast nucleic acid sequence comparison) FASTA can compare either protein sequences or DNA sequences. From EdIted

Methods m Molecular B/ology, Vol. 24: Computer Analysis of Sequence Data, Part I by- A M Grlffm and H. G Gnffm Copyright 01994 Humana Press Inc , Totowa, NJ

307

308

Pearson

sequence level to identify sequences that share distant evolutionary ancestors. If uncertain of the open reading frame in a cDNA clone, translate the clone in all six frames, and use each of those sequences to search a protein sequence database or a translated DNA database (TFASTA). In general, protein sequence comparison allows exploration of evolutionary relationships that are lo-fold more ancient (l-2 billion yr) than DNA sequencecomparison (100-200 million yr). DNA sequence comparison is most appropriate when comparing repeated sequence elements, structural RNAs, or transcription factor binding sites. 2. Materials To use the programs described in this chapter, the user must obtain the FASTA package of programs and one or more sequencedatabases, and install the programs and databases.Appendix A at the end of this chapter describes how to obtain the FASTA package of programs for UNIX, IBM-PC/DOS, Macintosh, and VAX/VMS computers, and how to install it. Appendix B lists several sources for protein and DNA sequence databases.Appendix C outlines the steps required to install the sequence databases. 2.1. Computer

The FASTA package will run on UNIX machines, VAX/VMS machines, IBM-PCs, and Macintoshes, as well as any other computer that supports a “C” compiler. 2.2. Obtaining

the Databases

The FASTA package does not include any sequence databases,but versions of the program work with many generally available library formats, including NBRF/PIR (National Biomedical Research Foundation/Protein Identification Resource-two formats, one for VAX/ VMS and one for other machines), EMBWSWISSPROT, GENBANK full-tape format and the simpler “FASTA” format. The VAX/VMS version of FASTA can read the NBRF/PIR protein and DNA sequence databases on VAX/VMS computers and libraries in the Genetics Computer Group format. The IBM/PC and Mac versions of FASTA can read the EMBL CD-ROM. Addresses for several of the protein and DNA sequence database distributors are listed in Appendix B.

FASTA

309

3. Methods The examples below assume that the user has obtained the FASTA package directly from the author, as described in Appendix A. Although this will usually be the case with IBM-PC or Macintosh versions, it may not be true on UNIX or VAX/VMS machines. Some of the FASTA programs are included with commercial software packages such as the Genetics Computer Group programs; however, the GCG FASTA program looks quite different from the program described below, although similar capabilities are available. In addition, few commercial implementations include all of the programs in the FASTA package (some include them in an “unsupported” directory); RDF2, a program for evaluating the statistical significance of an alignment, is frequently left out. 3.1. Comparing Two Sequences To demonstrate that the FASTA program is working properly, several test sequences are included with the distribution package. Once FASTA has been installed in a directory (and compiled if using a UNIX system): 1. Confirm that the two test sequencesmusplfm.aa and Icbo.aaarepresent

in the directory. 2. Then type: FASTA musplfm.aa lcbo.aa Here, musplfm.aa, the first entry after the command, is the query sequence file, and lcbo.aa, the second entry, is the library sequencefile. FASTA accepts a third entry on the commandline, the ktup parameter. For protein sequences,if a ktup entry is not given, ktup=2 is used. 3. After typing the command above, the user should see the following: fasta musplfm.aa lcbo.aa >musplfm mouse proliferin : 224 aa vs library searching lcbo.aa library 229 residues in 1 sequences 1 scores better than 1 saved, ktup: 2 Enter filename for results :

310

Pearson

At this point, it is clear that FASTA was able to fmd the musplfm.aa sequence, that it read 224 amino acids from musplfm.aa, and that it also found a 229-amino acid sequence m the lcbo.aa file. In addition, thrs search was done with the ktup parameter set to 2 (the default). The ktup parameter sets the sensitivity of the search (see below, and refs. [I-#]). 4. Type two carriage returns () to see: How many scores would you like to see? [ 1] initn initl opt The best scores are: LCBQ-Prolactin precursor-Bovine 38 1 273 432 Here, FASTA is reporting three scores that characterize the similarity between the musplfm.aa sequence and bovine prolactin. The initl (“initone”) score is calculated using the PAM250 matrix (5) from the most similar region without gaps; this is the region bounded by the Xs in the alignment shown below. When gaps are required to align two sequences, there will often be several similar regions without gaps that can be combined to improve the similarity score; the score of this combination is reported as the initn (“inita”) score. In addition, FASTA calculates an optimal local similarity score withm a 32-residue-wide band around the best initial region; this is reported as the opt score. A more complete description of the calculation of these three scores and their uses in evaluating sequence similarities are presented in refs. (1,2). 5. After the stmilarity scores are reported, the alignment(s) may be shown (Fig. 1). Here, the three similarity scores are reported again, and an alignment is shown between the query sequence and the library sequence (in this case,there is only one). On the alignment, the “:” symbol denotes aligned identities, whereas the “.” symbol indicates aligned amino acid residues with scores 20. The “X’s at residues 13 and 140 in the musplfm sequence correspond to the beginning and end of the best mittal region (the one that gives rise to the initl score). Note that the best initial (initl) region is bounded by pairs of identical amino acids because the search was performed with ktup=2. Had the search used ktup=l, the “X5 would be found at residues 7 (a single aligned Q) and 179 (W).* *Sometimes the introduction of gaps m the optimized alignment causes residues that were aligned in the initl region to be shifted out of alignment When this happens, the “X” is “split” into ‘v,” and “v ” For example, one might see PCSWILLLLLVNSSLLWKN : . . x:::.:.. KGSRLLLLLWSN-LLLCQ

because the gap was not present m the initial regiok

:“v

.

311

FASTA More scores? [O] QcET> Display alignments also? y DIET> number of alignments [l]? (R&T, LCBO - Prolactin precursor - Bovine 36.5% identity in 219 aa overlap musplf LCBO

musplf LCBO

381

273

432

10 20 30 40 50 MLPSLIQPCSWILLLLLVNSSLLWKNVASFPMCAMRNGRCFMSF~DTFELAGSLSHNIS : . . x:::.:.. :: ..:.: : .: . : .: .:.::: .::: MDSKGSSQKGSRLLLLLWSNLLLCQGWSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIH 10 20 30 40 50 60 70 80 90 100 110 IEVSELFTEFEKHYSNVSGLRDKSPMRCNTSFLPTPENKEQARLTHYSALLKSGAMILDA ::.:.::.:.:.. ,:. . . . .:.:: :::::.::::. ::...:.. DLSSEMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEV~SLILGLLRS 70 80 90 100 110

120 130 140 150 160 170 WESPLDDLVSELSTIKNVPDIIISKATDIKKKINAVRNGVNALMSTMLQNGDEEKKNPAW :..:: s::.:....:.. :x :.:.:..:... . . .:.. . . . .. .. . .. .. LCBO WNDPLYHLVTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVW 130 140 150 160 170

60

.:

.

120

musplf

musplf LCBO Library

180 190 200 210 220 ----FLQSDNEDARIHSLYGMISCLDNDFKKVDIYLNVLKCYMLKIDNC ::...:::: ..:... :: .: .:.:.::..:.: ., SGLPSLQTKDEDARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC 190 200 210 220 scan:

0:OO:OO

total

CPU time:

:.: 180

,::

o:oo:oo

Fig. 1. A typical FASTA alignment.

The user has now confirmed that the FASTA program is working. One can use a similar syntax to test many of the other programs in the FASTA package. For example: align musplfm.aa lcbo.aaa will generate an optimal global alignment of the two sequences. The user should always go back to this example if the FASTA program appears to have difficulty reading a query sequence or a database. For example, if the “library” sequence is not formatted correctly, or if the database is in one format, but FASTA thinks it is in another format, the user may get the results:

312

Pearson >musplfm transl. of musplfm.seq, 2 to 676 : 224 aa vs library 0 residues in 0 sequences 0 scores better than 1 saved, ktup: 2 Enter filename for results :

When the program is unable to find a file, a “file not found” error message is displayed; here, the sequence file was found, but the data in the file were not found. At other times, the user may be surprised to find that his or her query sequence is considerably longer or shorter than expected. This is usually because of incorrect formatting

of the query sequence.

FASTA uses a common sequence format under IBM-PC/DOS, UNIX, and the Macintosh, but a different format under VMS. When beginning to use FASTA, the user should double-check that the lengths of query and library sequences are correct. 3.2. Searching a Sequence Database In Section 3.1.) the “command line” interface to FASTA was used to compare two sequences. FASTA will also prompt for the name of

the query sequence and the name of the database to be searched. For example: % fasta fasta 1.6b [Nov, 19911 searches a sequence data bank Please cite: W.R. Pearson & D. J. Lipman PNAS (1988) 85:2444-2448 test sequence filename: musplfm.aa Choose sequence library: P: NBRF complete database (rel30) G: GENBANK Translated Protein Database (rel70) D: NRL-3d structure database S: Swiss-Prot (rel 20) Enter library filename (e.g., protlib), letter (e.g., P) or a % followed by a list of letters (e.g., %PN): P-&ET >ktup? (1 to 2) [2] In this example, if lcbo.aa were entered as the library filename, the

results would have been the same as in Section 3.1.

FASTA

313

The list of potential sequencelibraries is displayed only if the libraries have been installed properly and the FASTLIBS “environment” variable has been set (see Appendix C). After a successful search, the program will display the sequences with the top scores. Figure 2 shows the bottom of the histogram of similarity scores and the top of the list of highest-scoring sequences.The histogram indicates the number of library sequences that obtained similarity scores in the range indicated. For example, 12 library sequences obtained inih similarity scores of 72 or 73; only two library sequencesobtained hit1 scores in this range. This difference reflects the fact that the initn score is more sensitive, but the initl score is more selective. The means and standard errors of the distributions of initl and initn scores are also shown. This statistical calculation excludes those sequences that have high scores (~73 in this example), under the assumption that these library sequences are likely to be related to the query sequence. 3.3. Interpreting FASTA ResultsWhat Do All of the Numbers Mean? Most similarity searches seek to identify distantly related protein or DNA sequencesthat are homologous to, i.e., share a common ancestor with, an entry in a sequence database. One can imagine a perfect sequence comparison program that, after performing a search, would report infallibly: “These library sequencesare homologous to the query sequence.” Unfortunately, no such program exists, nor is one likely to, because many protein sequencefamilies are so divergent that traces of common ancestry have been erased from the sequences of some of the family members (4). Because of the wide range of sequence diversity present in some protein families and the large number of unrelated sequencesin the databases,for diverse families there will always be some unrelated sequences that obtain similarity scores that are higher than those of related sequences. The question then becomes: “When does a high similarity score indicate homology?” Although the three similarity scores calculated by the FASTA algorithm can be confusing, the relationships among the three scores can be used to help infer sequence homology. Consider four examples from a search of the mouse proliferin protein sequence (musplfm.aa above). (The numbers in parenthesis indicate the rank of the library sequence in the list of top-scoring sequences.)

Pearson

314 lnltn

initl

4:--+++++++++++++++++ 38 2:-++++++++++++ 25 o:++++++++++++++++++ 36 2:-++++++++++ 21 5:---+++++++ 20 3:--+++ 10 2:-+++++ 12 5:--m 74 5 76 3 l:-+ 78 6 l.-++ a0 1 l:69:----------------------------------++++++++++++ > a0 94 10360161 residues rn 36150 sequences statistics exclude scores greater than 73 mean inltn score: 24.9 (8.15) mean initl score: 24.3 (6.77) 5591 scores better than 31 saved, ktup: 2, varrable pamfact 0:01:40 7oznlng threshold: 28 scan time: Enter filename for results : musplfm.k2 How many scores would you like to see? [201 100 inltn lnltl The best scores are: opt 1121 A05086 Prolrferin - Mouse 1108 SO5648 Prollferrn 3 - Mouse 1100 A23159 Proliferrn 2 - Mouse 405 LCHU Prolactin precursor - Human 402 A28867 Prolactin precursor - Human 398 SO4077 Prolactin precursor - Pig 384 SO2104 Prolactin precursor - Sheep 393 JSO200 Prolactin precursor - Sheep 381 LCBO Prolactin precursor - Bovine 337 A36284 *Prolactin-like protern I, placental - Bovrne 336 JSO430 Prolactin - Elephant "60 62 64 66 68 70 72

...

JT04SO *Growth hormone - Goat MMMSA Laminin chain A precursor A32424 *Somatotropin precursor ...

92 91 a9

- MOuSe - Grass carp

1121 ii08 1100 296 296 292 271 276 273 227 223

1121 ii08 1100 444 441 435 435 434 432 363 391

92 43 60

192 53 160

Fig. 2. Resultsfrom a “successful” search. initn initl opt (9) LCBO Prolactin precursor-Bovine (84) STGT Somatotropin precursor-Goat (85) MMMSA Laminin cham A precursor-Mouse (86) A32424 *Somatotropin precursor-Grass carp

381 273 432 92 92 192 91 43 53 89 60 160

Bovine prolactin (LCBO), ranked 9 in the list of top-scoring sequen-

ces, is clearly homologous to the query sequence. The initn and initl similarity scores are more than 40 SD above the mean of all the sequences in the library, and the opt score is about 40% of that calculated when the sequence is compared with itself, implying that proliferin

FASTA and prolactin are about 40% identical (the actual percent identity, 36.5%, is shown in the alignment). This is well within the limit (2025%, ref. [6]) where homology can be clearly demonstrated from similarity. Thus, proliferin and prolactin share a common ancestor. Both goat somatotpin precursor (SIGT), ranked 84, and the grass carp somatotropin (A32424), ranked 86, are also clearly related to proliferin. Here, the inference is based on the substantial increase from the initl score, which does not allow gaps, to the opt score, which does. Although the initn and initl scores only suggest that these sequences are related to proliferin (six of eight scores between 81 and 9 1 are from sequences that are not related to proliferin), the twoto threefold increase in the opt score is often found with homologous proteins (for additional examples, see refs. [l-3]). The carp somatotropin sequence provides a very typical example of a more distant, but clearly related sequence. Here, the initl score is much lower than the initn score, but it increases almost threefold when gaps are introduced in the alignment to produce the opt score. The lowest-ranked related sequence in the two 200 scores, a carp prolactin (ranked 121), has initn and initl scores of 69, which increase to 229 with optimization. When evaluating marginal similarity scores, look for intermediate (40-60) initl scores that increase to more than 150 with optimization. The laminin scores confirm this rule. Although laminin obtains a high initn score and a low initl score, much like the carp somatotropin, the initl score increases only about 25%, to a value that is much lower than the initn score, after optimization, Laminin is unlikely to share a common ancestor with proliferin or the other members of the growth hormone family. 3.4. Increasing the Sensitivity of FASTA Unfortunately, not all searchesprovide the definitive results shown in Section 3.2. Sometimes, the results look more like Fig. 3. In this example, none of the top-ranked sequences are likely to share common ancestry with the query sequence,a microsomal glutathione transferase (PIR code A28083). This inability to detect related sequences may reflect their absencefrom the sequencedatabase,or it may reflect a limitation of the FASTA search. This search was done with ktup=2 and, thus, required that initial regions be bounded by pairs of identi-

Pearson

316 o:++ o:++ o:+ 16 II:+ 78 o:+ 80 0: 3:--+ > 80 9697617 in 33989 sequences statistics exclude scores greater than 72 mean znltn score: 23.2 (7.37) mean mlt: score: 22.8 (6.46) 5025 scores better than 29 saved, ktup: 2, variable pamfact 0:00:34 3olning threshold: 21 scan time: initn initl The best scores are: opt HMNZED Hemagglutinin - Measles virus (strain Edmonston 83 HMNZHA Hemagglutlnin - Measles virus (strain Halle) 83 HMNZKA Hemagglutlnln - Rinderpest virus (stram Kabete 78 A25856 Neuron cytoplasmic protein 9.5 - Human 75 SO4724 *NADH dehydrogenase chain 5 - Emerlcella nidula 75 A35694 *cut1 protein - Yeast (Schlzosaccharomyces pomb 74 SO6188 RNA1 polyprotein - GrapevIne chrome mosaic vxu 72 71 JQ0274 Hypothetical 29K protein (trnH-trnV intergenic S13595 l 6-deoxyerythronolide B synthase - Saccharopoly 71 70 NICLMB Nitrogenase (EC 1.18.6.1), molybdenum-Iron prot 70

72 14

3 3 1 2 1 0 5 residues

46 46 48 54 38 56 32 41 44 56

46 46 40 54 40 60 33 50 47 82

Fig. 3. An “unsuccessful” search with ktup=2

cal residues. The sensitivity of the search can be increased by setting ktup=I or by making FASTA optimize scores for all of the sequences in the database. If a search with ktup=2 fails to find sequences that are likely to share common ancestry, a search should be performed with ktup=l. fasta a28083.aaP 1 *query-sequence file “library selectron Aktup

Alternatively, FASTAprompts for the the ktup if the query-sequence file and library file are not entered on the command line. Searches with ktup=l take about five times as long as searches with ktup=2. There is no guarantee, of course, that sequencesrelated to the query can be found in the database. In this example, the results of the search would be quite similar with ktup=I, except that the scores would be higher. Current versions of FASTA (1.5 and later) provide a second option for increasing sensitivity: calculating an optimized score for every sequence in the database. Searches performed with this option

317

FASTA Table 1 FASTA and SSEARCH Execution Times’ Algorithm FASTA

SSEARCH

Sensitivity ktup=2 ktup=l ktup=l

All scores optimized (Smith-Waterman)

Time (mm) 0.22

1.07 2.35 22.00

aExecution times (minutes) reqmred to scan the PIR2 SEQ file (preliminary entries, 12,837 sequences containing 3,384,087 amino acrds) of the NBRF Protein Identtfication Resource protein sequence database (release September 30,199l) using bovine prolactin (LCBO, 229 aa) as the query sequence

are about as sensitive as searches with the rigorous Smith-Waterman algorithm (4,7), but about seven times faster. (Table 1 shows the time required to search the part of the PIR protein sequencedatabaseusing bovine prolactin [LCBO] as the query.) To optimize the similarity scores for every sequence in the library, two command line options are required: -0 (optimize) and -c 1 (the threshold for optimization is set to 1). For example: fasta -0 -c 1 lcbo.aa P

If the -0 option is used without the -c 1 option, the optimized scores are calculated only for library sequences with in&z scores greater than a threshold (33 for a 200-residue query-sequence and ktup=l) that is about the mean of the unrelated sequence scores.* Increasing the sensitivity with ktup=l or additional optimization not only increases the amount of time required to perform the search, but it also decreasesthe “signal-to-noise” ratio by increasing the scores of unrelated sequencesmore than the scores of related sequences (2). Thus, in Fig. 4, there are many more high-scoring sequences in the histogram, but none of the high-scoring sequences are likely to be homologous to the microsomal glutathione transferase. *The FASTA program has a large number of command-line options that can be used to modify the similarity scormg matrix, the format of the alignment, and other parameters that are used m the algorithm. These options are discussed more fully in the documentation distributed with the program

318

Pearson

2*-+++++++++++++++++++++++++++++C+++++ 70 69 O:+++++++++++++++++++++++++ 72 49 14 2:-++++++++++++++++++ 38 76 30 2:-++++++++++++++ o:++++++++++++ 70 24 o:+++++++++++ 80 22 5:---+++++++++++++++++++++++++++++++++++i+++ > 80 83 9697617 residues in 33989 sequences statistics exclude scores greater than 77 mean initn score: 31.5 (10.21) mean initl score' 30 4 (7.91) 5341 scores better than 43 saved, ktup: 1, variable pamfact ]oining threshold* 32, optimization threshold. 1 scan time, The best scores are: initn initl opt 508206 S-lipoxygenase-activating protein - Rat 34 S11961 *Hypothetical protein - Red alga (Gracilarla ch 51 SlllSO *amlC protein - Streptococcus pneumoniae 49 NICLMB Nitrogenase (EC 1.18.6.1), molybdenum-iron prot B28269 Protein kinase (EC 2.7.1.37), cGMP-dependent ;1" SO8164 5-lipoxygenase-activating protein - Human 34 MNXRRW Nonstructural protein Pns9 - Rice dwarf virus 75 A34106 *Protein kinase (EC 2.7.1.37) cGMP-dependent 1 81 LNHUPC Pulmonary surfactant protein A precursor (clone 61 A35049 *Ankyrin - Human 80 IMBCN4 Colicin N immunity protein - Escherichia coli p 81

0.05.08 34 51 49 58 81 34 75 81 43 57 58

84 83 82 82 81 81 81 81 80 80 80

Fig. 4. An “unsuccessful” search with ktup=l; all scores optimized.

3.5. Changing

Search

Parameters

Built into the FASTA program are two scoring value matrices, the PAM250 matrix for proteins (5) and an identity matrix for DNA sequences (+4 for a match, -3 for an unambiguous mismatch, +2 for an ambiguous match [G-R], -1 for an ambiguous mismatch [G-Y]; see the file uparngbl for the precise definition). In addition, by default, all of the sequence comparison programs in the FASTA package penalize -12 for the first residue in a gap and -4 for each additional residue. Both the scoring matrix and the gap penalties can be changed by specifying an alternative SMATRIX file. This can be done on the command line: fasta -s paml20.mat

or by changing an environment variable: set SMATRIX=pam 120.mat Several alternative SMATRIX files are included with the FASTA distribution.

319

FASTA paml20.mat

A version of the PAM250 matrix calculated for 120 PAMs (point acceptedmutations). The PAM120 matrix is usedby the BLAST program (8) and 1smore selective (unrelatedsequencesreceive lower scores). The genetic code (minumum mutation distance) scoring codaa.mat matrix. +6 for an identity, +2 for a single basechange,-2 for two basechanges,-6 for three basechanges. idnaa.mat An identity matrix for proteins. +6 for an identity, -3 for any nonidentity. A protein matrix that usesthe PAM250 valuesfor idenidpaa.mat tical matches,and -3 for any nonidentical alignment. All of the scoring matrices charge -12 for the first residue in a gap and 4 for each additional residue. Some algorithms refer to a gap penalty of the form: q + rk

where k is the number of residues in the gap. The -12, -4 used by FASTA are equivalent to q = -8, Y = -4. 3.6. Identifying Sequences in DNA Sequence Databases The TFASTA program included in the FASTA package can be used to compare a protein sequence to a DNA sequence database, translating the DNA sequence database in all six frames. Searching a translated DNA sequence database is far more informative than searching the DNA sequence database directly; DNA-DNA sequence comparisons forgo the biochemical information encoded in the PAM250 amino acid replacement matrix. TFASTA has the same options as FASTA; thus the command tfasta lcbo.aa gpri.seq might be used to search the primate portion of the GenBank DNA sequence database with ktup=2. Because of the size of the DNA sequence databases, translating and searching with TFASTA can be slow. However, both FASTA and TFASTAprovide an optional method for selecting the libraries to be searched that makes it easier to search only the libraries of interest. For example, if the FASTLIBS file has specified that the following letters can be used to select a library

Pearson

320

P R M V I L B U

primate rodent mammal other vertebrates invertebrates plants bacteria unannotated

and the user wishes to search all animal sequences,but not plants, bacteria, viruses, phage, or structural RNA sequences,then the command: tfasta lcbo.aa %PRMVIU would search only the primate, rodent, other mammal, vertebrate, invertebrate, and unannotatedsequences.Here, the % indicates that a list of databaseabbreviations, rather than a file name, has been entered. 3.7. Evaluating FASTA Results with RDF2 and RSS

The FASTA package includes several other programs that can be used to evaluate the results of FASTA searches. The RDF2 and RSS programs can test the statistical significance of a sequence alignment score. Both programs compare two sequences, calculate the similarity score(s), shuffle the second sequence to generate a set of new sequences with exactly the same length and amino acid composition, and calculate similarity score(s) for each shuffled sequence. RDF2 uses the FASTA algorithm to calculate three similarity scores for each shuffled sequence; RSS uses the Smith-Waterman algorithm (7) to calculate a single score. Both shuffling programs have several options: The number of shuffles can be varied (100-200 are recommended), the shuffling strategy can be either “uniform” (each residue can be moved to any position along the sequence) or “local” (residues are shuffled in blocks; i.e., residue 1 will remain in the first 10 residues, residue 11 in the second 10, and so on), and the size of the shuffling window can be changed. Figure 5 shows a typical comparison of the microsomal glutathione transferase with the highest-scoring library sequence from Fig. 4.

FASTA

321

tdf2

rdf2 1.6b (Nov, 1991) compares a test sequence test sequence file name: g8tmicr.a8 sequence to be shuffled: sO6206.aa ktup? (1 to 2) [2] 1 number of random shuffles? 1201 200 local (window) (w) or uniform (u] shuffle [u]? local shuffle wlndow size [lo] KRET> a28083.aa : 155 aa s08206.aa : 161 aa <

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36

inztn

---,,---

0 0 0 0 0 0 0 0 0 0 0 2 3 I 12 19 21 22

hit0 0 0 0 0 0 0 0 0 0 0 0 2 3 7 12 19 30 23

to a shuffled

x

opt 0: 0: 0: 0: 0: 0: 0: 0: 0: 0: 0: 0:1i o:in O:Aliiii 3:0001iiiiiiii

3:oooiiiiiiiiiiiiiii~ 12:ooooooooooooiiiililiiiiilii ll:ooooooooooo~iii~iiliiii

I

66 1 0 2:io 68 0 0 1:o IO 1 0 1.b 12 1 0 0:i 14 2 0 2:bb 76 0 0: 10 : 0 2:oo 80 0 0 0: > 80 1 0 0:i 0 32200 residues In 200 sequences, local shuffle, window size: 10 mean initn score: 39.9 (9.83): max initn score: mean init.0 score: 38.1 (6.89); max init score: mean opt score : 45.4 (9.50); max opt score: init score : 34 is -0.60 s.d. above mean intO score : 34 LS -0.59 s.d. above mean score : 04 is 4.06 s.d. above mean opt ktup: 1, fact: 0:00:01 4 scan time:

Fig. 5. Evaluating similarities

82 59 17

scores with RDF2.

sequence

322

Pearson

Figure 5 shows that the relationship between the microsomal glutathione transferase and lipoxygenase activating protein is borderline. The z-value* of the optimized similarity score is only 4 (RSS** gives a z-value after a local shuffle of 5); unrelated sequences sometimes obtain z-values of 5 or more. As Fig. 5 shows, it is possible to shuffle the lipoxygenase sequence and still obtain similarity scores > 75. Both proteins are membrane-bound; this sequence similarity may simply reflect their shared hydropathy. 3.8. Evaluating FASTA ResultsLocal Similarity In addition to evaluating a similarity score by Monte-Carlo shuffling, the user can also ask whether there are alternative alignments of the two sequences that yield high similarity scores. For example, several unrelated membrane proteins have high initn scores when compared with the P-adrenergic receptor sequence (4); in these cases, the proteins shared several alternative alignments that suggested that it was hydropathy, rather than common ancestry, that was the basis for the high scores. The LFASTA, PLFASTA, LALIGN, and PLALIGN programs can be used to display alternative local alignments between two sequences. LFASTA and LALIGN show the alignments, whereas PLFASTA and PLALIGN produce a two-dimensional dot-matrix-like plot. LALIGN and PLALIGN use a rigorous algorithm to calculate the best local alignments (9-11) and are preferred for protein sequences on faster machines with sufficient memory. LFASTA and PLFASTA use the FASTA algorithm rapidly to identify regions that share local similarity; they can be used for long sequences (2000 residues) on IBM-PCs. Analysis of the relationship between microsomal glutathione transferase and lipoxygenase activator shows a single highscoring alignment (22% identity over 113 amino acids), a result that is consistent with the hypothesis that the two sequences share a common ancestor. As with the shuffling programs, the results are ambiguous for this pair of sequences. *The z-value of a simrlarrty score IS calculated by substracting the mean score and divrdmg the drfference by the standard devration of the dtstributron of shuffled scores, **The RSS program provides a more strmgent test, since the RSS program uses a more sensitive (and less selective algorithm) than RDFZ RSS is expected to calculate hrgher scores for the shuffled sequences and thus makes it more dtfficult to obtain a high z-value

FASTA 3.9. Summary-How

to Identify, Sequences with FASTA and TFASTA

1, Search a protein-sequence library (or use TFASTA to compare a protein sequence to a DNA-sequence library) using the ktup parameter set to 2. 2. If high-scoring sequencesare found (initn > 100, initl > 60, opt > 150), it is likely that a homologous library sequence has been found. Confirm the identification by running the RDF2 and RSS programs; a z-value >lO is expected. 3. If a FASTA search with ktup=2 does not turn up any hkely candidates (at least200 scoresshouldbe examined),repeatthe search using ktup=l. (Again examine the 200 top-scormg sequences, looking for hit1 scores >60 that increase to 150 or more with optimization.) Candidates identified with ktup=l should have RDF2 or RSS z-values >8. Sequence similarities with lower z-values are suspect; look for >15-20% sequence identity over the entire length of both sequences. 4. If no library sequences meet the criteria in step 3, consider repeating the ktup=l search and calculating an optimized score for every sequence in the library. However, this strategy is more likely to result in an ambiguous false-positive result than in the novel finding of homology. Searches should be done in the order shown (1, 3, 4). With some sequences, searches with ktup=2 unambiguously show homology, whereas the conclusion is less clear as more sensitive methods are used (ktup=l, -0 -c 1) and the similarity scores of unrelated sequences increase. One should always be extremely cautious when Interpreting apparent relationships that cannot be detected without calculating an optimized score for every sequence in the library. 5. If a significant similarity is found and an alignment is to be published, the SSEARCH program should be used to produce a rigorous alignment. FASTA alignments are limited to 32-residue-long gaps; there is no limit on the length of the gaps produced by SSEARCH. For the same reason, the LALIGN and PLALIGN programs are preferred over the LFASTA and PLFASTA programs. The examples in this chapter show a number of the uses and pitfalls of the FASTA program when characterizing distantly related

protein sequences.All of the methods shown above can be applied to DNA sequences as well, but with the caution that DNA sequence comparisons suffer from “signal-to-noise” problems much more frequently than protein sequences.It is common to find a DNA sequence similarity with a modest score (ini& score > 100 and an optimized

Pearson score > 150 at the nucleotide) level that disappears (scores c 50) when the translated sequences are compared. Appendix AObtaining the FASTA Program Package The FASTA package has progressed through several versions since it was first introduced in 1987. Newer versions correct bugs in the FASTA program, or allow the program to search more library files or additional library formats. As this chapter is being written, the current version for the UNIX, VAXNMS, DOS, and the Macintosh is version 1.6. This latest version contains a number of additional new programs, including rigorous (but very slow) programs for library searching, local-sequence alignments, and statistical analysis. Obtaining

the Programs

The best way to obtain the program depends on the type of machine that will be used. UNIX The easiest way to get FASTA for a UNlX machine is via anonymous ftp from the host uvaarpa.Virginia.EDU or via electronic mail ([email protected]) If the user’s institution has internet access, the user should try anonymous ftp first. From a machine that has access to the internet, type: ftp uvaarpa.Virginia.EDU or alternatively ftp 128.143.2.7 and login with the user Name: anonymous and a Password: your-userid

The FASTApackage is in a file of the form public_access/fasta16b.shar. (A compressed version is available as fastal6b.shar.Z. The 16b will change as newer versions become available.) To transfer the file: cd pub/fasts get fastal6b.shar

(Be sure to use binary mode for transfer of the c0mpressed.Z file.) This is a UNIX shar file; after transferring this file to a UNIX machine, type: sh fastal6b.shar

FASTA and the file will be broken into the files required to recompile FASTA programs. A Makefile is included for Sun (4.2BSD), ATT SysV, and Xenix flavors of UNIX. Other makefiles are included for Turbo “C” on the IBM-PC and for the VMS operating system on a VAX. If electronic mail is sent to “[email protected]” requesting the UNIX version of the program, the author can send back a set of files that can be concatenated to create the fasta.shar file, which can then be unpacked with “sh fasta.shar.” Alternatively, the author can send the fasta.shar file on either IBM-PC (1.2 Mbyte 5.25 in. or 720 Kbyte/ 1.44 Mbyte 3.5 in.) or Macintosh (720 Kbyte 3.5 in.) disks. VMI ??Ms If planning to use these programs on a VAX/VMS computer, the author can send the user a set of VAX/VMS files via electronic mail or put them on an IBM-PC or Macintosh diskette. IBM-PC The FASTA package comes on 5.25- or 3.5in. floppy disks for the IBM-PC, and includes complete source code, executable versions of the programs, and also *.BGI graphics device driver programs from Borland’s Turbo “C” package. The program costs $60.00 (plus $10.00 overseas shipping). Orders should be sent to: William R. Pearson 1611Westwood Rd. Charlottesville, VA 22903 USA There is a $25.00 additional charge for purchase orders, Macintosh The FASTA package is also available for the Macintosh computer, although the program is not very “Mac-like.” It does run in the background under Multifinder, so the user can search and do other work at the same time. The Mac version also costs US $60.00; please send checks to the address above. Once the user has obtained the programs, he or she will need to: 1. Install them on the computer. 2. Configure a file (the FASTLIBS file) that tells FASTA where to find sequence libraries (the user may also need to edit other files that describe where the sequencedatabasesare found).

326

Pearson

3. Set an “environment” variable (FASTLIBS) to tell FASTA where the FASTLIBS file can be found. 4. Set the execution PATH to include the directory that contains the FASTA programs. For example, under UNIX or DOS: 1. Edit the file called fastgbs, which is included in the distribution, changing the file names for the sequence libraries. 2. Under DOS, type the command: “set FASTLIBS=c:\fasta\fastgbs” (assuming the FASTA package was installed in adirectory called “c:Vasta.” 3. Under DOS, type the command: “set PATH=c:\fasta; . . , ” (again assuming installation in “c:\fasta”). Equivalent

commands are available under UNIX.

On the Macintosh, there are no “environment” variables, and it is easier simply to run the FASTA program from a “FASTA” folder. To mimic “environment” variables on the Macintosh, the Mac FASTA package uses a file called “environment,” which contains lines like: FASTLIBS=HD40:FASTA:FASTGBS

Appendix B-Obtaining Sequence Databases If the user’s computer has access to the internet, the DNA and protein sequence libraries are available via anonymous “ftp” from a

number of sources, including those listed in Table 2. DNA and protein sequencedatabases are available on diskettes or tape from the sources listed in Table 3. For most people, particularly on IBM-PCs and Macintoshes, the most difficult part of using the FASTAprograms is installing the databases properly. An example of how to do this is given below. Appendix C-Setting Up the FASTLIBS File The FASTA and TFASTA programs can be configured to present the user with a list of databases that are available to be searched. After the list is presented, the user can enter a single letter (P) to select a database, or a string of letters prepended with a % (%PGS) to select several databases. For this menu option to work, a file that lists each of the databases, the FASTLIBS

file, must be present. Unfortu-

nately, most people who have difficulty installing FASTA have problems with the the FASTLIBS file.

FASTA

327 Table 2 Sequence Database Available over the Internet Internet address

Provider

E-mail contact

Databases available

NCBI, National Library of Medicine

ncbi.nlm.nih.GOV

University of Houston Gene Server

ftp.bchs.uh.EDU

GenBankTM DNA sequences Swiss-Prot protein sequences Protein Identification Resource (PIR)

[email protected](

1V

[email protected]

A typical FASTLIBS file looks like this: NBRF Annotated Protein Database (rel30)$0A/seqlib/pirl NBRF Protein Database (complete)$OP@/pO/seqlib/prot.nam GB70.0 Primate$lpO/gblib/gbpri.seq 1 GB70.0 Rodent$lR/pO/gblib/gbrod.seq 1

.seq 5

*. .

Each line of the FASTLIBS file has four required fields and one optional field. NBRF A

Annotated Protein field 1

Database

$OA/seqlib/pirl

*23*

field 4

.seq 5

A 5 (optional)

Field 1 contains a description of the database and ends with a $. The next two characters are field 2-a number that indicates whether the database contains protein (0) or DNA (1) sequences-and field 3-a letter used to select the database. The fourth field contains the name of the database file, whereas the fifth, optional field indicates the database format. Most people have problems filling out the fourth and fifth fields. The fourth field is complicated, because it can contain two types of database file descriptions. The first type is simple; it is simply the name of the file that contains the database. In this example, the annotated PIR database is in a file called /seqlib/pirl .seq. The second type of description is more complicated; it is a file that contains a list of database files. For example, the complete PIR protein database is dis-

Table 3 Database Distributors Provider

Databases

Address

Media

Intelligenetics

Gen13ankm

GenBank NCBI Intelligenetics National Library of Medicine 8600 Rockville Pike Bethesda, MD 20894 USA

CD-ROM

European Molecular Biology Laboratory

EMBL DNA-sequence library Swiss-Prot proteinsequence library

EMBL DNA Library European Molecular Biology Laboratory Postfach 10.2209 MeyerhofStr. 1 D-6900 Heidelberg, Germany

CD-ROM Nine-track tape VMS cartridge tape

Protein Identification Resource

PIR protein-sequence library

National Biomedical Research Foundation Georgetown University Medical Center 3900 Reservoir Rd N.W. Washington, DC, 20007 USA

Nine-track tapes x-50 cartridges (VMS) CD-ROM

FASTA

329

tributed in three files: pir 1.seq, pir2.seq, and pir3 .seq. In the example above, the file prot.nam is an indirect file with the entries: to quit This occurs when the file pointed to by the FASTLIBS variable contains incorrect information about where to find the sequence database. In this case, the file contained: NBRF Annotated Protein Database(rel 25)$0A/pO/slib/lib/protein.seq5 when it should have said NBRF Annotated Protein Database(rel25)$0A/seqlib/lib/pirl .seq5

Pearson

330

Database Formats As this chapter is being written, FASTA recognizes six different database file formats: 0-FASTA (>seq-id-comment line/sequencedata) 1-GenBank “flat-file” (LOCUS/DEFINITION/ORIGIN) 2-NBRF CODATA 3-EMBL/SWISS-PROT (ID/DE/SQ) 4-Intelligenetics (;comment line/SEQID/sequence) 5-NBRF/PIR VMS format (>Pl;SEQID/comment/sequence)

Type 0 and type 5 files do not contain any reference data, and are the fastest to search; as a result, the EMBL provides type 5 (PIRNMS)

format versions of their DNA database and the SWISS-PROT database on their CD-ROM. Searching the PIlUVMS format is several times faster than searching the same databases in EMBL format. If the wrong format is selected for a database file, the total number of residues read will be wrong. If, for example, the user indicates that pir 1.seq is a type 1 file (GenBank flat file format), but it is actually a type 5 NBRF/VMS file, FASTA will report that it found 0 residues in 0 sequences. To make certain that the values m the FASTLIBS file are correct, the user should confirm that FASTA has read the number of residues listed in release notes for the database. References 1. Pearson, W. R. and Lipman, D. I. (1988) Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 85,2444-2448. 2. Pearson, W. R. (1990) Rapid and sensitive sequence comparison with FASTP and FASTA, in Methods in Enzymology, vol. 183 (Doolittle, R. F , ed.), Academic, New York, pp. 63-98. 3. Lipman, D. J. and Pearson, W. R. (1985) Rapid and sensitive protem similarity searches. Science 227,1435-1441. 4. Pearson, W. R. (1991) Searching protein sequence libraries: Comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. Genomics

11,635-650.

5. Dayhoff, M., Schwartz, R. M., and Orcutt, B. C. (1978) A model of evolutionary change in proteins, in Atlas of Protein Sequence and Structure, vol. 5, supplement 3 (Dayhoff, M , ed.), National Biomedical Research Foundation, Silver Spring, MD, pp. 345-352. 6. Doolittle, R. F., Feng, D. F., Johnson, M. S., and McClure, M. A. (1986) Relationships of human protein sequences to those of other organisms. Cold Spring Harb. Symp. Quant. Biol.

51,447-455.

FASTA

331

7. Smith, T. F. and Waterman, M. S. (1981) Identification of common molecular subsequences. J. Mol. Biol. 147,195-197. 8. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) A basic local alignment search tool. J. Mol. Biol. 215,403-410. 9. Waterman, M. S. and Eggert, M. (1987) A new algorithm for best subsequences alignment with application to tRNA-rRNA comparisons. J. Mol. Biol. 197, 723-728.

10. Huang, X., Hardrson, R. C., and Miller, W. (1990) A space-efficient algorithm for local similarities. CABIOS 6,373-381. 11. Huang, X. and Miller, W. (1991) A time-efficient, linear-space local similarity algorithm. Adv. Appl. Math. 12,337-357.

&IAPI’ER

Converting

Between

27 Sequence Formats

Gary O’Donnell 1. Introduction A “sequence format” is a punctuation style, or defined layout of text, within a computer file that separates a sequence from everything else. It allows computer programs that “understand” the format to distinguish between the sequence and any reference documentation also in the file. Some format definitions extend to the documentation itself (i.e., most database formats), allowing some software to locate specific reference information (e.g., authors, journals, species classification, coding regions). Unfortunately sequence analysis programs do not recognize a single, universal format. Many different programs and packages have developed their own formatting style. At best this means that one software package does not read the sequence file that another has created. At worst a program reads a file in the wrong format and interprets annotations as sequence(with potentially disastrous results)! Formats are typically named after the software package associated with them, or after an organization that defined the format. The most well known are EMBL, GCG, GenBank, IG (Intelligenetics or Stanford), and NBRF (also known as PIR). Some formats are named after individual software authors, e.g., Pearson, Staden, and Olsen. This chapter describes how to create a specific format from the plain sequence text and how to convert between formats. The formal format definitions are not described here. Methods l-3 show how to copy a sequence file from one format into a new file of another format. Method 1 uses READSEQ, a single From Edited

Methods m Molecular Biology, Vol 24’ Computer Analysis of Sequence Data, Part I by. A M Griffm and H G. Griffin Copynght 01994 Humana Press Inc , Totowa, NJ

333

334

O’Donnell

program that recognizes most formats, but does not form part of a general package. Methods 2 and 3 use programs from the GCG package. Methods 4 and 5 describe how to reformat the EMBL (I) and GenBank (2) databases into NBRF format and create supplementary index files. This allows the most commonly used VAX packages (i.e., GCG [3], NAQ [4], PSQ [.5], XQS andATLAS [6], FASTx [7], Staden [8]) to share one copy of the sequence and reference information. 2. Materials 1. Computers: VAX/VMS for all methods. VAXNMS, VAX-Ultrtx, and Apple Macmtosh for Method 1. 2. Terminal* Any text-capable terminal is suitable as there is no graphical

output. 3. Programs: The program READSEQ (9), is available from the EMBL file server, the University of Houston (UH) gene-server and “anonymous ftp” from various INTERNET sites. Obtaming programs from these sources is described in Chapter 28 with the READSEQ program as an example. Version 7.2 of the Genetics Computer Group (GCG) package, available from: Genetics Computer Group University Research Park 575 Science Drive, Suite B Madison, WI 5371 l-1060 Electronic mail: [email protected]. Methods 4 and 5 use FORTRAN program-source files supplied as part of the XQS and PIR database software distribution: createdbs.for, createmx.for, mdexer.for, sorttmpc These are the PIR programs, available from: Protern Identification Resource (PIR) National Biomedical Research Foundation Georgetown University Medical Center 3900 Reservoir Road, N.W. Washington, DC 20007 4. Input files. The sequence file: For the READSEQ program (Method l), the file may contain sequence only m one of the formats recognized, i.e., IG, GenBank, NBRF, EMBL, GCG, DNA Strider, Fitch, Zuker, Olsen, Phylip, plain format (i.e., the sequence only, no documenting text) and others.

Converting

Between Sequence Formats

335

For Method 2, the sequence file must be in GCG format. Such a file may be created using the GCG program SEQED. In the example, a file called CCHU,PIRl has been retrieved from a local copy of the NBRFProtein database (logical name PIRl), using the GCG program FETCH:

$ fetch pirl :cchu For Method 3, the format of the input file must correspond to the program being used, i.e., NBRF format when the GCG program FROMPIR is used. In the first two examples, the files created in Method 2 are used, For the third example (the GCG program REFORMAT), a plaintext file is used. Method 4 requires an EMBL format database file(s). The example used here is EMBL35, which was supplied in 16 separate subsections. The Swissprot database is supplied as a single file. Both can be obtained on computer tape or CD ROM from: European Molecular Biology Laboratory Postfach lo,2209 D-6901 2 Heidelberg Germany Electronic mail: DataLib@EMBL-Heidelberg-DE Method 5 requires a GenBank format database file. The GenBank database is available on CD ROM in 14 separate subsections from: NCBI-GenBank National Center for Biotechnology Information National Library of Medicine 8600 Rockville Place Bethesda, MD 20894 Electronic mail: [email protected] Additional files: Methods 4 and 5 require a single-line header file for each subsection of the database (e.g., EMBLPRI.HEADER or GBPRIHEADER) for building the GCG-indices, e.g.: Name: EMBLDIR:EMBLPRI LN: EM-PR SN: EM-PR Rel: 35.0Reldate: 06/93 Fordate:06/93 Type: N FORMAT: NBRF Name: GENBANKDIR:GBPRI LN: GB-PR SN: GB-PR Rel: 77.0 Reldate: 06/93 Fordate:06/93 Type: N FORMAT: NBRF 5. Disk space: Disk space requirement is totally dependent on the size of the sequence files being converted. For single-file conversions (Methods l-3), each conversion increases the file space used by approximately the same amount as the original file. For database formatting

336

O’Donnell

(Methods 4 and 5), a total of 3.5 times the size of the original data file(s) should be available. When the processing is complete, however, the original file(s) and some intermediary files can be deleted. 3. Methods 3.1. Method l-The READSEQ Program 1. If the example file to be converted IS called TEST.SEQ, and the formatted file (i.e., the output) is to be called FORMATI’ED.SEQ, run the program by: $ readseq -v -a

Name of output file:fomzatted.seq 2. The program lists the 18 different formats that it can recognize: 1. IG/Stanford 10. 11. 2. GenBanWGB 3. NBRF 12. 13. 4. EMBL 5. GCG 14. 15. 6. DNAStrider 16. 7. Fitch 17. 8. Pearson/FASTA 18. 9. Zuker (in-only) Choose an output format (name

Olsen (in-only) Phylip ~3.2 Phylip Plain/Raw PIIWODATA MSF ASN.l PAUP/NEXUS Pretty (out-only) or #):

Enter the number, or name, of the format that is required, i.e., between 1 and 18 and press return, 3. There is now a prompt for the name of the input file Name an input sequence or -option?test.seq 4. There is then a repeat prompt for the input file, just press return to enter a blank line to finish. Name an input sequence or -option: 5. There is now a file called FORMATTEDSEQ the default directory. 3.2. Method

2-Conversions

from

m the selected format in GCG Format

GCG has several programs that copy a file already in GCG format into a new file. GCG names the programs as TOx, where x is the new format created, i.e., TOFITCH, TOIG, TOSTADEN, TOPIR. The GCG programs TOPIR and TOSTADEN are used as examples.

Converting

Between Sequence Formats

337

1. Make sure the GCG package is available on the computer: A banner and copyright message should appear. If this fails, consult the system manager-GCG may have another name on the user’s system, e.g., GCG7, UWGCG. 2. Run the program TOPIR by entering its name: $ topir

TOPIR writes GCG sequence(s) into a single file in PIR format. TOPIR of what GCG sequence(s) ? cchu.pirZ Begin (* 1 *) ? End (* 104 *) ? What should I call the output file (* Cchu.Pir *) ? CCHU 104 characters. The new file CCHU.PIR is in NBRF format. 3. Run the program TOSTADEN by entering its name: $ tostaden

ToStaden writes a GCG sequence into a file in Staden format. If the file contains a nucleotide sequence, the ambiguity codes are translated as shown in Appendix Ill of the PROGRAM MANUAL. TOSTADEN of what GCG sequence ? cchu.pirl Begin (* 1 *) ? End(*104*)? What should I call the output file (* Cchu.Sdn *) ? The new file CCHUSDN is in Staden format.

3.3. Method

3-Conversions

into GCG Format

GCG has several programs that copy a file not in GCG format into a new file. GCG names most of the programs as FROMx, where x is the format of the input file, i.e., FROMIG, FROMPIR, FROMEMBL, FROMGENBANK, FROMSTADEN. The GCG programs FROMPIR, FROMSTADEN, and REFORMAT are used as examples, with REFORMAT reading a plain-text file. In all cases, the file being written to is

in GCG format. 1. Run the program FROMPIR by entering its name: $ frompir

O’Donnell

338

FROMPIR reformats sequences from the protein database of the Protein Identification Resource (PIR) into individual files in GCG format. FROMPIR of what PIR sequence file ? cchpirl Cchu.Gcg 104 aa FROMPIR complete: Files written: 1 Total length:1 04 A new file named CCHU.GCG, in GCG format, is written. 2. Run the program FROMSTADEN by entering its name: $ fromstaden

FromStaden changes a sequence from Staden format into GCG format. If the file contains a nucleotide sequence, the ambiguity codes are translated as shown in Appendix III of the PROGRAM MANUAL. FROMSTADEN of what Staden sequence file ? cchsdn What should I call the output file (* Cchu.Seq *) ? A new file named CCHU.SEQ, in GCG format, is written. 3. Run the program REFORMAT by entering its name: $ reformat

REFORMAT rewrites sequence file(s), symbol comparison table(s), or enzyme data file(s) so that they can be read by GCG programs. REFORMAT what sequence file(s) ? test.seq What should I call the output file (* Test.Seq *) ? No ‘I..” divider A new file named TESTSEQ, in GCG format, is written, despite the apparent failure message.

3.4. Method &Formatting

EMBL-Format

Databases

1. First compile and link all the PIR programs, from the FORTRAN source code, in the usual way, e.g., $ for createdbs $ link createdbs

Repeat for the files named createinx and indexer. The sorttmpc file is treated in a similar way: $ cc sorttmp $ link sorttmp

Converting

Between Sequence Formats

339

2. Create VMS symbols that will run the PIR programs. If the .EXE files are in a directory called User$disk:[Yourname.convert], then the symbols are: $ $ $ $

createdbs:==$User$disk:[Yourname.convert]createdbs createinx: = = $User$disk:[Yourname. convert]createinx indexer: = = $User$disk:[Yourname.convert]indexer sorttmp: = = $User$disk:[Yourname.convert]sorttmp

3. Now run each of these programs in turn, with the default directory being empty, with a lot of disk space present. The first program, createdbs, “knows” about the 16 different EMBL subsections, and can process them all just by giving it the location of the source files, which are usually on a CD ROM. The user need only set up a logical name pointing to the CD ROM, e.g., $ assign DAD8: emblcd

a. Split each of the EMBL subsections into two NBRF-style files, one with the sequence information (EMBL* SEQ), the other with the reference information (EMBL*.REF), where * is the name of the subsection: $ embldbs

Database [PIR,CODATA,GENBANK,GBNEW, SWISSPROT]: embl Directory for 16 *.DAT files: emblcd:[embl]

EMBL,

b. Create the basic index file (EMBL*.INX) used by the other PIR programs. This has to be run 16 times-once for each of the files just produced by the createembl program. The following is an example for the EMBLPRI (primates subsection): $ createinx

Database (no file type): emblpri Code length [6]: 70 Database type (Text, Protein, or Nucleic) [PI: n Database format (NBRF, CODATA, GenBank, EMBL, or Unknown) [NBRF]: embl Database name: emblpri Release date (yymmdd):930615 Release number:350 Database description: EMBL Primate entries

340

O’Donnell

The same result can be achieved by entering all the information on one line (albeit somewhat encrypted): $ createinx/emblpri/lOA?/3/0/emblpri/930615/35.O/EMBL primate entries c. Create the author (.aux), accession (.acx), species (.spx), title (.ttx), and keyword (.wox) index files used by the XQS, NAQ, and PSQ programs. These files are used for building futher indices for the XQS and ATLAS programs (see PIR software distribution). To do this, run the two programs, indexer and sorttmp, one after the other, for each of the 16 subsets of data. $ createinx Database: emblpri Is this a preliminary update [Y]: n (ACX AUX FTX HOX JRX SFX SPX Index to create ( for all, index to create ( for all, Index to create ( for all, Index to create ( for all, Index to create ( for all, Index to create ( for all,

TTX WOX) to run): aux to run): won to run): acx to run): SPX to run): ttx to run):

Now run sorttmp on each of the five files Just produced. The simplest way of doing this is to execute the DCL file sorttmpcom supplied with the PIR programs: $ @sorttmp d. Createthe index files (EMBL.OFFSET, EMBL.NAMES, EMBL. NUMBERS), which allow the GCG programs to read database entries directly. Again, do this for each of the 16 subsets. $ gcg $ dbindetinomonitor

emblpri.seq

e. Create the file (EMBL*.SEQCAT) used by the GCG program STRINGSEARCH. Make sure the file EMBLPRIHEADER (and other headmg files) is present before runnmg the GCG program SEQCAT. $ seqcat EMBLPRISEQldefault f. After checking that all steps have been correctly completed, repeat steps b-e for each of the EMBL subsections. In release 35 these files are PHG, ORG, FUN, PRO, PLN, INV, VRT, PRI, ROD, MAM, SYN, VRL, UNC, EST, PATENT, BB.

Converting

341

Between Sequence Formats

3.5. Method

&Formatting

the GEIVBANK

Database

Exactly the same programs may be used for the GenBank database as shown for EMBL. GenBank has 14 subsections. In release 77 these files are PHG, BCT, PLN, INV, VRT, PRI, ROD, MAM, SYN, RNA,VRL, UNA, EST, PAT. The one-line example for createinx is slightly different for GenBank-the user must specify GenBank format by: $ createinx/gbpri/l0/2/2/O/gbpri/930615/77.O/GenBank

primate

entries

4. Notes 1. READSEQ reads IN most of the formats listed in step 5. The program is clever enough to identify the input format (unlike the GCG programs). 2. READSEQ has several optional parameters, only -v prompts for all the required input. More typically, use -0 to denote the output file, -f# for the format required, e.g., to write a file m NBRF format: $ READSEQ

-a TESTSEQ

$3 -oFORMATTED.SEQ

The -a parameter ensures that multiple sequences m a file are recognized, otherwise you have to select sequences by number. 3. The case of the sequence part of a file can be changed using the -c (lower case) or -C (upper case) option of READSEQ, for any format, e.g.: $ readseq -c TESTSEQ

-oFORMATTED.SEQ

4. To list all the optional parameters that are available: $ readseq -h

5. Do not use READSEQ on GCG format files with text included between ‘>’ symbols in the sequence (as created when including files m SEQED). This is valid GCG format, but READSEQ does not recognize it and the output is incorrect. 6. PIR format is practically the same as NBRF format. Both names are used interchangeably for maximum confusion, and both are valid for protein and nucleotide sequences. The title mformation given by the FROMPIR program could be mistaken to mean that PIR format refers to protein sequences only. 7. All formats, except GCG and plain-text, allow for multiple sequences in a single file. When converting from a multiple sequence format, the programs READSEQ and FROMPIR create several GCG-format files (note the option in READSEQ to write to MSF files, these are multiplesequence files acceptable by GCG), with one sequence in each file

342

O’Donnell

8. The reverse case, copying several GCG sequences mto a multiplesequence format, cannot be done with READSEQ. Instead, use the GCG program REFORMAT with a file of sequence names. First enter the names of all the GCG-format sequence files into a file, say LIST.NAM. Then: $ reformat/msf @list.nam The output file 1sin MSF format, which may subsequently be converted using READSEQ. 9. Care should be taken when converting between Staden and GCG formats. The TOSTADEN program converts all lower case letters in GCG format into Staden ambiguity codes, which are numbers. Many GCG programs create sequences with lower case letters, which do not correspond to Staden ambiguity codes. Instead these are usually consensus sequences where lower case refers to a majority symbol. (e.g.,%’ in Staden format means C or CC). 10. The GCG program FROMPIR is usually set up to convert the NBRF sequence symbol ‘-’ into ‘?‘. It is the author’s opinion that ‘-’ should be converted into ‘.‘. The computer manager may be able to make this change by amending source code in the GCG package. Add one line in the file gensourcefrompkfor

so it reads:

If (Calls.eq. 1) then Call SymbolSet(Symbols) Symbols(IChar(‘.‘)) = Char(O) Symbols(IChar(‘*‘)) = Char(O) Symbols(IChar(‘-‘)) = ‘.’ Calls = -1 end if Information on how to compile and relink the program can be found in the GCG System Manual. 11. The NBRF-protein (10) and nucleic acid databasesare supplied, on tape, with steps a-e of Method 4 already completed. The GCG mdexes supplied with recent NBRF-protein releases have been incorrect, so the user has needed to repeat steps d and e to create GCG-readable files. The NBRF-protein databaseis in three parts, with files named PIRl .SEQ, PIR2SEQ, and PIR3SEQ, so steps d-e must be carried out for each of these sequence files. 12. The Swissprot database (II) is provided as a single file in EMBL format. The name SWISSPROT need only be substituted for EMBL

Converting

Between Sequence Formats

343

throughout the procedure. In SWISSPROT.HEADER, change the TYPE: N part so it reads TYPE: P Name: SWISSPROTDIR:SWISSPROT LN: SWISS SN: SW Rel: 18.0 Reldate: 09/91 Fordate:09/91 Type: P FORMAT: NBRF 13. Methods 4 and 5 process very large quantities of data and take a lot of computmg time (CPU), It is unreasonable to run the programs interactively, and a batch method should be used. By far the longest period is taken up by the PIR program CREATEINX. 14. GCG format has a maximum sequence length of 350,000 bases. Any entries in the databases that might exceed this cannot be used correctly by the GCG package. Amendments can be made to CREATEDBS.FOR to identify all sequences that exceed a given maximum size, split them and create additional entries in the database. An overlap of 10,000 bases, common to both entries, should be made to ensure that pattern searching programs do not miss features that span the break-point. The amendments required are too lengthy to detail here. 15. The GCG programs EMBLToGCG and GenBankToGCG can be used to reformat the EMBL and GenBank databases, in place of Methods 4 and 5. The database is then only available to the GCG software. 16. The sequence analysis software defines the format of the sequence data, not the source of that data. For example the FETCH program in the GCG package retrieves entries from the EMBL, GenBank or NBRF databases. The retrieved sequence is written to a file in GCG-format. Similarly the COPY command, in the NAQ or XQS program, retrieves a database entry, but the file written is in NBRF format. 17. The PIR program indexer cannot build all of the indices it promises. FTX, HOX, SFX are NOT created even if the ALL option is selected. Neither is the Journal Index built, as the information in the EMBL and GenBank files is in the wrong format.

References 1, Stoehr,P. J. andCameron,G. N. (1991) The EMBL data library. Nucleic Acids Rex 19,2227-2230.

2. Burks, C., Cassidy, M., Cinkosky, M. J., Cumella, K. E., Gilna, P., Hayden, J. E.-D., Keen, G. M., Kelley, T. A., Kelly, M., Krrstofferson, D., and Ryals, J. (1991) GenBank. Nucleic Acids Res. 19,2221-2225. 3. Devereux, J., Haeberli, P., and Smithies, 0. (1984) A comprehensive set of sequence analysis programs for the VAX Nucleic Acids Res. 12,387-395. 4. Orcutt, B. C , George,D. G , Fredrickson, J. A., and Dayhoff, M. 0. (1982) Nucleic acid sequencedatabasecomputersystem.Nucleic Aczds Res. 10, 157-174.

344

O’Donnell

5. Orcutt, B. C., George,D. G., and Dayhoff, M. 0. (1983) Protein and nucleic acid sequence database computer systems. Ann. Rev. Biophys. Bioeng. 12,419-441. 6. Hunt, L. T. (1990) in Protein Identification Resource Newsletter, vol. 9, May. National Biomedical Research Foundation, Washington, DC. 7. Pearson, W. R. and Lipman, D. J. (1988) Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 85,2444-2448. 8. Staden, R. (1986) The current status and portability of our sequence handling software. Nucleic Acids Res. 14( 1). 9. Gilbert, D. G. (1989) ReadSeq, C and Pascal routmes for convertmg among nucleic acid & protein sequence file formats, suitable for various computers. Published electronically on the Internet, available via anonymous ftp to ftp.bio.indiana.edu 10. Barker, W. C., George, D. G., Hunt, L. T., and Garavelli, J. S. (1991) The PIR protein sequence database. Nucleic Acids Res. 19,223 1-2236. 11. Bairoch A. and Boeckmann B. (1991) The SWISS-PROT protein sequence data bank. Nucleic Acids Res. 19,2247-2249.

&AP’I’ER

Obtaining

Software

28

via INTERNET

Gary O’Donnell 1. Introduction Many noncommercial software packages and individual programs are available at several computer sites around the world. Some sites make these files available via “file-servers,” and some via “anonymous FTI?” There are usually information files available listing everything that can be obtained at the site. Nearly all computer sites worldwide have direct or indirect accessto the INTERNET computer network, and so can communicate electronically. Anonymous FTP requires logging on to the remote computer with the user name “anonymous.” This user name allows restricted use of the remote computer, enough to search for the required files and send them back. Anonymous FTP allows direct transfer of both text and binary files across the computer network. Method 2 describes this procedure using the anonymous FTP service provided at the University of Indiana. To retrieve information and software from a file-server, an electronic mail message is sent to the file-server address, requesting particular files. The file-server is a program that responds by mailing the requested files back to the original sender. Program files, or large sets of files, are normally supplied in an encoded form; help information is supplied as readable text. Two file-servers are described here: The EMBL file-server (I) in Heidelberg, Germany; and the UH geneserver (2) at the University of Houston. Several types of file encodings are used at FTP and file-server sites: “ZOO,” “UUE,” “tar,” and “shar.” A ZOO file contains several other From E&ted

Methods m Molecular Biology, Vol. 24, Computer Analysrs of Sequence Data, Pari I by* A M Gnffm and H. G Griffin Copyright 01994 Humana Press Inc., Totowa, NJ

345

346

O’Donnell

files compressed into just one file, with “white space” removed, and in binary code. Any binary file can be converted into a UUE file, made up of only 94 different ASCII (i.e., text) characters. With only 60 characters/line, UUE files can be sent by mail between any computer combination without introducing errors. Most software sent by file-servers is in the UUE form and once decoded is in the ZOO form, which must then be decompressed. Tar files are compressed binary files, which may be decompressed

under the UNIX operating system, or with appropriate VMS utilities. Shar files are in ASCII form and may be self-unpacking,

or may

require a spatial unshar program to unpack the files. FTP sites commonly provide program files in all these formats. Each file-server or ftp site usually provides copies of the software,

which must be used to decode the files. In Method 1, described below, the UU-decoding software is retrieved at the same time as the software READSEQ. 2. Materials 1. Computers: VAX/VMS-Ultrix, and others. 2. Terminal: Any VT100 terminal is suitable since there is no graphical output. 3. Programs: C language compiler VAX/VMS (cc) and libraries, or, for the VAX-Ultrix, an ANSI-C language compiler, e.g.,(vcc), and libraries. 4. Electronic mail privileges: Method 1A requires access to INTERNET for sending and receiving e-mail. It may be necessary when mailing from the UK to send mail explicitly via the JANET address UK.AC.NSFNET-RELAY, e.g., to user%site@nsfnet-relay. VAX users in the UK should use cbs%nsfnet-relay::site::user. From the rest of Europe: to EMBL-HEIDELBERG.DE, or local variation of that address. Method 1B requires accessto the US part of INTERNET, specifically the address: BCHS.UH.EDU. 5. Special equipment: For Method 2 a direct connection to INTERNET or to FTP.BIO.INDIANA.EDU. The user may need to fmd out from a local computer manager how this connectton is made. In the UK, sites without an INTERNET connection may use the “Guest FTP service” provided by the University of London Computer NSF.SUN. The user may need to obtain special privileges at a local computer (“host site”) to connect to this service. 6. Disk space: Up to 3000 blocks (1 block 1s 512 bytes) of file space is required for the steps described, but less space is necessary if files not

347

INTERNET

required for later steps are deleted. The final READSEQ executable file takes up to 189 blocks.

3.1. Method

3. Methods IA-The EMBL File-Server

1. Create a text file called EMBL.SERV containing the following text: GET VAX-SOFTWARE:READSEQ.UAA GET VAX-SOFTWARE:ZOO.UAA GET VAX-S0FTWARE:UUD.C

2. Mall the file to Netserv@embl-heidelberg .de, the exact syntax ts very dependent on the mail system used at a local site. The following examples Illustrate two likely alternatives, the second for mailing from UK sites via NSFNET-RELAY. $ mail embl.serv $ mail embl.serv

netserv%de.embl-heidelberg cbs%nsfiet-relay. :de,embl-heidelberg::netserv

3. The requested files arrive as seven electronic mail messages. Extract the seven mails into seven files UUD.C, READSEQ.UAA, READSEQ.UAB, READSEQ.UAC, ZOO.UAA, ZOO.UAB, and ZOO.UAC. Take care when identifying the many different file names: a clue 1sgiven in the subject line of the mail, e.g.: Subject: Reply to: GET VAX-SOFlVVARE:READSEQ.UAA

(part 2 of 3)

should, obviously, be saved as READSEQ.UAB. For example, when using VMS mall, use the “extract” command: MAIL>

extract readseq. uab

4. After exiting from MAIL, edit the file UUD.C using a text editor. The mail headings that have been added to the top of the file must be removed. The very first line of the file should then be: /* Uud -- decode

a uuencoded

file back to binary

form.

The other files do not need to be edited, unless errors occur in the remaining steps. 5. First compile and link the UUD program: $ cc uud $ link uud

6. Users of VMS have to create a DCL symbol to run UUD. First determme the name of the current disk and directory:

O’Donnell

348 $ show default

The operating system replies with something like: Current: User$disk:[Yourname.subdirectory] Substituting appropriately, define the VMS symbol: $ uud: = = $User$disk:[Yourname.subdirectory]uud

7. Decode the ZOO and READSEQ files: $ uud ZOO. UAA $ uud READSEQ. UAA

The UUD program automatically picks up the .UAB and .UAC files and gives an error if any file is not present when required. On correct completion, the files READSEQ.ZOO and ZOO.EXE are present. It is good practice to place the READSEQ.ZOO file in a directory of its own from now on. 8. The decoded ZOO program is the decompression software. Again VMS users must create a DCL symbol to run it: $ zoo:== $User$disk:[Yourname.subdirectory]zoo 9. Decompress the READSEQ.ZOO file: $ Zoo -ex READSEQ

Several files are now present that comprise the source code and any documentation. The file called READSEQ.EXE is the executable program file. 10. Again make a VMS symbol to allow the program to be executed. The program may then be run as described in Chapter 27. $ readseq: = =$User$disk:[

Yourname.subdirectory]readseq

(Steps l-9 need to be carried out only once, after which the symbol definition in 10 should be copied mto the LOGIN.COM file.) 11. If the operating system reports errors when running READSEQ, then try recompilmg and relinking the source code: $ cc readseq, ureadseq $ link readseq, ureadseq, sys$library:vaxcrtl/lib

Alternatively, execute the MAKEFILE that also carries out some data checking: $ @make. corn

349

INTERNET

3.2 Method B-The UH Gene-Server 1, Create a completely blank file called, e.g., UH.SERV. 2. Mail the file to the following address (but check the mail syntax at the host site): [email protected] or gene-server%bchs.uh.edu@cunyvm The commands to the server must be contained in the subject line of the mail. For example, for the READSEQ software, mailing from the UK: $ mail cbs%nsfnet-relay::edu.uh.bchs::gene-server/subject= “SEND VAX READSEQ. WE” $ mail cbs%nsfnet-relay::edu.uh.bchs::gene-sewer/subject= “SEND VAX ZOO. VVE” $ mail cbs%nsfnet-relay::edu.uh.bchs::gene-server/subject= “SEND VAX UUD. C ”

3. The requested files arrive as three electronic mail messages.Extract the files as READSEQ.UAA, ZOO.UAA, and UUD.C and continue from step 4 of Method 1A.

3.3. Method 2-Retrieving

Files by Anonymous

FTP

1. First log on to the computer, which has the IP connection to INTERNET. This is the INTERNET-HOST computer. Many users will find this facility readily available on the local computer. For sites in the UK without Internet connectivity the University of London provides an open account called guestftp. From a JANET-HOST, log on to this computer: $ pad call nsJ:sun Iogifxguestftp

password:guestjtp Enter your reference for this session: Car-y guest-ftp> dir total blocks:0 The reference name creates some temporary storage space, It allows users to log on at a later time to locate any files.

O’Donnell

350

2. Once on an INTERNET-HOST computer (which may be the user’s own host computer), run the FTP program that will make a connection to any other INTERNET site: guest-ftp>ftp

(or, possibly $8~ )

When the ftp prompt appears, open the connection to the remote site: ftp> open ftp. bio. indiana. edu connected to ftp.bio.indiana.edu 3. The remote computer then requests the user to log on. Log on as “anonymous,” but give own electronic mail address as password, in INTERNET style, for the providers’ records. Name (ftp.bio.indiana.edu:guestftp): anonymous 331 guest login OK, send e-mail address as password password: [email protected] (the user will not see this on the terminal) 4. Now look at the directories available and try to locate the software required. ftp> dir Change directory to the molbio directory. ftp> cd molbio ftp> dir Each set of programs has its own subdirectory. The example here IS to obtain the READSEQ software: ftp> cd readseq ftp> dir Look for a file with a title that suggestsit is readable, such as a README or FAQ file. A FAQ file usually gives answers to “Frequently Asked Questions.” This file should be retrieved first and read on the home computer. It will inform a user which files are needed. The user should take note of which files are text (ASCII) files and which are binary files. If the file required is a binary file, as in this case, the following command must be given: ftp7 binary 5. Send the file across to your host computer. ftp> get readseq.shar

351

INTERNET 6. Log off the remote computer, and then exit the ftp program. ftpxlose

ftp>bye 7. Now all the files are present on the INTERNET host computer (which may be the user’s own host site). If using NSF.SUN, the user will have an additional step-sending the files back from NSF-SUN to the JANET host site. guest-ftp> dir guest-ftp> push Okay lets push a file using NIFTP Give local filename: readseq.shar Give remote filename: readseqshar Give NRS name of remote host: afmarcb Do you want binary or <default> ascii (input b or a): b Give user name on remote host: odonnell Give user password on remote host: Re-type password to make sure: Finally logoff NSF.SUN guest-ftp>l0 8. Unpack the READSEQSHAR file to produce the samefiles as achieved at the end of Section 9 of Method IA. At the time of writing an unpacking program for VMS was not available at ftp.bio.mdiana.edu, but the files VMS.UNSHAR. 1, VMS.UNSHAR.2 and VMS.UNSHAR.3 were available on the UH Gene-Server. After unpacking carry out step 11, and then 10 of Method 1A.

4. Notes 1. To obtain information files, mail the followmg text to the EMBL file server: HELP DIR HELP VAX-SOFTWARE DIR VAX-SOFTWARE 2. The equivalent file for the UH file server is to put the following in the SUBJECT LINE of the mail:

O’Donnell

352 SEND SEND SEND SEND

HELP INDEX VMS HELP VMS INDEX

3. Computer managers may restrict the use of mail and ftp from the host computer. For any failure whereby the mail does not appear to be sent, or the INTERNET connection cannot be made, contact the local computer manager. 4. The user may get the following error message when attempting to connect to the remote site: ftp> open ftp. bio. indiana.edu failed to get host Information for fip.bio.indlana.edu

from database

This means that ftp.bio.indiana.edu is not in the site’s telephone book! In this case, the user will have to make the connection directly using the INTERNET number of the site (warning: These numbers can change): ftp> open 129.79.224.25 5. Many program files are supplied at ftp sites in binary form. When retrieving such files you must set ftp to work in binary: ftp> binary To switch back for text transfer: ftp> ascii 6. The University of Houston is also reachable by anonymous ftp. The address is: ftp.bchs.uh.edu, and the direct number is 129.7.2.43. Instead of steps 4 and correct files: ftp> ftp> ftp> ftp> ftp>

5 of Method 2, do the followmg to collect the cd pub/gene-serverhms dir get uudec0de.c get zoo.uuc get readseq. uue

After retrieving them continue at step 4 of Method 1A. 7. EMBL provides an FTP service on ftp.embl-heidelberg.de 8. NSFNET-RELAY is a “gateway” between the JANET network and INTERNET, allowing mail and files to be transferred. Other gateways from JANET include:

INTERNET

353 UKNET (to UUCP network) EARN-RELAY (to EARN/BITNET)

EARN/BITNET and INTERNET also have gateways linking them. Mail from JANET to INTERNET sites sent through uk.ac.earn-relay will be forwarded to INTERNET. Some computer sites may not be able to send mail via one or more of these gateways. Some mail servers on INTERNET may attempt to send mail through a gateway that the user’s site does not subscribe to. In that case the user will receive nothing from the file server! 9. On the EMBL file server an acknowledgment mail is usually received first. Check this to see the mail has been understood. 10. Large files tend to be mailed back at off-peak periods. Sometimes the file-servers, or intermediate gateways, get very busy, and a wait of a day or more is not unusual. 11. Anonymous FTP sites usually request that they be used only at offpeak times if requesting large files, i.e., avoid the 8am-6pm period, taking account of local time at the remote site. For the same reasons, transfers across JANET from NSF.SUN can take a long time. 12. Some very large pieces of software are still large even when ZOO-ed. The UU-encoded files are then split into several files to enable them to be sent across the networks. The information listed on the EMBL fileserver indicates which software arrives in several parts. Such files are usually listed with the extension .UAA, .UAB, .UAC, and so forth, instead of .UUE. All the files should be present before starting the UUDECODE process. 13. On a UNIX system compiling the READSEQ software is carried out as follows: % vcc readseq.c ureadseqx

An alias definition for READSEQ is not usually required on Unix. 14. The READSEQ.C source code can also be compiled on the Apple Macintosh, according to its documentation. There is information in the READSEQ.C file that provides a script to do this. 15. When preparing the above examples it was noticed that the version of READSEQ available at ftp.bio.indiana.edu was a more recent version than that available at the file servers. The UH gene server also had the more recent 3-file uu-encoded files of READSEQ available (the single file version m the example was an older version still). This is not unexpected as the ftp.bio.indiana.edu site is the source site used by the fileservers, so there may be a delay before a new version is available at the file-servers.

O’Donnell

354

Acknowledgments I am very grateful to my former colleague Philip O’Connor for his advice on using ftp. Thanks to Frank Wright for reading an early draft of the manuscript. References 1. Fuchs, R., Stoehr, P., and Rice, P (1990) New services of the EMBL data library. Nucleic Acids Rex l&4319-4323. 2 Davison, D B. and Chappelaar, J E. (1990) The GenBank Server at the University of Houston. Nucleic Acids Rex 18, 1571,1572

CHARTER 29

Submission of Nucleotide Sequence Data to EMBL/GenBank!DDBJ Catherine

M. Rice and Graham

N. Cameron

1. Introduction The EMBL Data Library (I) was founded in 1980 as a direct consequence of the amount of sequence data appearing in the journals. Over the past 11 years, the growth in data acquisition has been exponential. With the latest developments in genome projects, we foresee no let up in the amount of data they will receive in the next few years. We do envisage, however, that a larger proportion will not be accompanied by detailed biological knowledge. In 1982, a direct collaboration was established between GenBank (2) (in Los Alamos) and the EMBL Data Library to facilitate coverage of all primary nucleotide sequence data. The DNA Data Bank of Japan (DDBJ) joined more recently. The three databases are equivalent, and published data are exchanged daily. Data are incorporated either by computer-readable submissions from authors or (much more rarely) by entering the published sequences by hand. Data entry is error-prone for a number of reasons, including legibility of the original sequence-containing article. Author submission is more accurate and results in faster incorporation of the data. 2. Submission Methods Many journals have mandatory submission policies with member databases. In these instances, acceptance of sequence-containing papers for publication requires proof of submission. Sequence data submitted to EMBWGenBanWDDBJ receive unique identifiers in the From Edlted

Methods m Molecular Slology, Vol. 24: Computer Analysis of Sequence Data, Part I by A. M. Gnffm and H G. Griffm Copyright 01994 Humana Press Inc., Totowa, NJ

355

356

Rice and Cameron

form of an accession number that provides such proof. The number identifies an individual sequence, so many accession numbers could be cited in any given publication. Details for each journal are given in their individual “notes to submitting authors.” The databases are, of course, happy to receive all sequence data, including sequences that will never appear in any publication. Relevant addresses for all three collaborating databases are given in Appendix 1, but specifics are given only for EMBL. Much of the communication in the databases is done by electronic mail. Not only do the databasesoften use this to communicate among themselves, but it is also a very quick and efficient way to contact submitters. The EMBL Data Library receives approx 60% of total submissions by electronic mail. These come either as sequences appended to a copy of the submission form or as Authorin output. Any queries resulting from a submission can be readily answered. Subsequent notification of newly assigned accession numbers is preferably given this way. Apart from the submission address, we also have an addressfor general queries (see Appendix 1). Our internet address can be reached via various gateways, including Bitnet, Usenet, and JANET. Advice on how to contact us can be obtained either from a local network expert or by contacting the Data Library itself. A part of our direct submission data comes as Authorin output sent by electronic mail. The Macintosh version of Authorin (currently version 2.1) is available from the fileserver. To access the program from EMBL, send electronic mail to the EMBL fileserver (3) with the following commands: HELP software GET Mac-software:authorin.hqx

The IBM-PC version of Authorin (currently version 1.1, but 1.2 is due soon) is available from Intelligenetics at no extra cost (see Appendix 1). Full instructions for use are provided with the software. A large proportion of our submissions arrive by electronic mail as filled out submission forms with appended sequencedata. A computerreadable copy of the submission form is available and is included in each release (for all the collaborating databases).This computer-readable form can also be accessed as follows: Send a mail message to

Data Submission

357

the EMBL fileserver (3) (Appendix 2), and include the following command either in the subject line or in the body of the message: GET DOC:datasub.txt One submission form should be filled out for each nucleotide and protein sequence (where applicable). The EMBL Data Library also provides software for VAXNMS users, which simplifies the process of filling out and mailing the submission form. To retrieve that instead, include the commands: HELP software GET Vax-software:subform.uaa

on separate lines in the body of the message. Printed copies of the submission form are available in the first issue each year of Nuckic Acids Research (see Fig. 1). They are also available on request from the EMBL Data Library. Submissions arrive also on diskette by mail, usually accompanied by a paper copy of the submission form. We rarely contact the author by post, except in the absence of any electronic mail or fax address. The EMBL Data Library supports Macintosh or IBM-PC compatible (3.25 or 5.25 in.) diskettes. When sending a submission by Macintosh or IBM-PC diskette, one can either use the relevant Authorin program (preferred) or simply send the sequence as text format with an accompanying hard copy or machine-readable submission form. The large variety of wordprocessing applications now available for Macintosh and IBM-PC machines make it difficult for us to guarantee readability for all given formats. For this reason, we request that all sequence data be saved as simple text on the diskette. The data are easier to handle, and therefore, the submitter benefits by quicker receipt of accession numbers. We do not accept sequence data as such by fax, but in the absence of an electronic mail address, we use the fax address for any further communications with the author, This has the benefit of speed for cases when the journal publication deadline is at hand and accession numbers are urgently required. (See Appendix 1 for Data Library addresses.)

Rice and Cameron

358

Sequence

Data Submission

Form

This form sohc~ts the mformatlon needed for a nucleoude or ammo actd sequence database entry By complewtg and retummg These data wdl be shared among the 11 to us promptly you help us to enter your data m the database. accurately and rapidly followmg databases DNA Data Bank of Japan (DDBJ: M~shuna, Jam): EMBL Nucleoude Sequence Database (He&dterg. FRG, GenBank (LOS Alamos, NM, USA and Mountam Vww. CA, USA); Internatumal Protem Informauon Database m Japan (JIPID. Noda, Japan); Martmsled Institute. for Pmtetn Sequence Data (MIPS: Mmtmsried, FRG), Nauonal Btomedrcal Research Foundatton Pmtem Identlfcatmn Resource (NBRF-PIR. Washmgton, D.C , USA), and SWISS-PTOI Protem Sequence Datahw (Geneva, Swltxerland and Hetdelbezg. FRG) Please answer all quesuons which apply to your data If you submit 2 or more non-urnuguous sequences. copy and fdl out thus form for each addtoonal sequence. Please mcludc m your submrsston any addmonal sequence data whtch IS not reported m When submltung your manuscript but which has been rehably determined (for example, mmMs or flankmg sequences) nucletc actd sequences contammg protem codmg regtons. also mclude a translauon (SEPARATELY from the nuclerz acid sequence) Then send (I) this form, (2) a copy of your manuscnpl (tf avadable) and (3) your sequence data (m machme readable form) to the address shown below Informatmn about Ute various ways you can send us your data and about formats for the sequence data 19 gtven m the followmg two sectmns. Thank

you

SUBMITTING

DATA

TO THE EMBL

DATA

LIBRARY

We are happy to accept data submitted m any of the followmg ways (1) Electronrc file transfer tiles can be sent via computer network to’ DATASUBS@EMBL-Heidelberg DE Tbls INTERNET address can be reached via various gateways from BItnet JANET, etc Ask your local network expert for help or phone us Please ensure that each lme m your file IS not longer than 80 characters, longer hnes often get wuncated when they are sent (2) Floppy disks we can read Macmtosh and IBM-compauble &keue.s Please use the ‘save as lexl only’ feature of your ednor to save your sequence file. as otherwise we might have difficulty pmcessmg u (3) Magnetic tapes 9-track only (fixed-length records preferred), 800, 1600 or 6250 Our address IS: bpt (any blocksc!e); ASCII or EBCDIC character codes, any label type or unlabellcd EMBL Data Library Submlsslons Postfach IO 2209 D-690 Hetdelbezg Federal Repubhc of Germany

Computer Telefax Telephone

nctwmk DATASUBS@EMBL-Heulelbergrg (49) 6221 387 519 (+49) 6221 381258

DE

When we receive your dafa we wdl assign them an ~ccc.ssum number, which serves as a reference that pennanendy tdenulies them m the database. We will mform you what accesston number your data have been glvcn and we recommend lhat you cite this number when refemng to these data m publtcauons If your manuscnp has already been accepted for pubhcatmn. the accesston number can be mcluded a1 the galley proof stage as a note added In proof So that we can process your data and inform you of your accession number before you receive the galley proofs, please return this form to us as soon as possible. We suggest that the note “The nucle-oude sequence data reported m this paper wrll appear m the added m proof should read appmx~matcly as follows. I, EMBL. GenBank and DDBJ Nucleoude Sequence Databases undw the accessmn number A computer-readable verston of thus form IS avadable on the dtstnbuuon tapas of the EMBL Data Llbmry from Release 1I onwards and on GenBank Releases 48 onwards and via the EMBL and GcnBank Rle Servers Feel free to use the computerreadable form rather than thus pruned one In this case, the form should be filled out with a text ednor and sent vta computer network or normal post to the address md~calcd above

FORMATS We would

FOR SUBMITTED

apprexatc

Each sequence

receivmg should

Each dlsunct sequence sequence m bas&esldues Enumerauon

should

Ammo actd sequences

the sequence

tncludc

DATA data m a form which conforms

as closely

should be hsted separately usmg the same number should bc clearly mdxated begm with a “1” and contmue should

as possible

m the duecoon

be hstcd usmg the one-letter

Translauons of protcm codmg regions nucleatlde sequences lhcmselvca

m nucltxmde

of basc&cadues

1.

A

sample

standards

5’ to 3’(or

per line

ammo- to carboxy-

The length

of each

termmus)

wde

sequences

should

be. submuted

The code for rcprcsentmg the sequence characters should conform to the. IUPAC-WB Nucl Acids Res. 13 3021-3030 (1985) (for nucleic acrds) and J. Btol Chem Bmchem 5 151-153 (1968) (for ammoactds)

Fig.

to the followmg

the names of the authors.

m a separate

computer

lile

from the

standa&. whtch are descnkd III 243 3557-3559 (1968) and Eur J

nucleotide sequence database submission form.

Data Submission

359 Please

I.

GENERAL

fill OUI wnh a rypewnter

or write legibly

INFORMATION Fusl name

Mlddle

mmals

I lnsutuuon

Compute

mad address

Telex number Telefax

I

On what medium and in what fomux [ ] elairomcmad [ ] &skew computer e&a [ I magneuctape(speclfyformat

II. CITATION kse

data

data? (see mst~uons

on frc41t page)

operaung SYtile name )

INFORMATION

repream

hesedataare

am ycu sendmg w your seq~rbx

numk

[

] new

submlssmn

[ Ipubhshed

[ ] correcuon

[ Impress

(Accessmn

[ ] subnuued

number

of affected

[ 1 m preparauon

sequence [ 1 no plans

) IO publish

authors ude of paper fml-last

vohnne

Journal lo you agree that these data can be made avadable

I 1 yes

m IJW data&e

[ I no, lhey con be made avelable

before

lhey appear

asker .

pages m prmt? @lease

year

fill In dale)

lees the sequence wluch you are sendmg wnh thrs form mclude dam that do not appear m Ihe above citabon? ( 1 baw OR [ 1 ammo aad residues [ In0 [ 1 yes. from pos~tmn (If your sequence contams 2 or mole sw!spans. use the feature table m secuon IV to Indicate Leu . so, how should the.% data be cited m tie darabase? I ] subm11led [ ] in pnpan&on [ ] no plans to pubhsh [ 1 pubbhd IImpress authors address(lf

mfferent

from that gwen m sxuon

poauons)

I)

utie of paper volume

Journal 1st references

to papers and/or

database

fust author

enhles

which

fml-lasl

re.pnr~ sequences JOUIW&

pa&e3

ovedappmg Vol.,

pages,

wnh ti and/Or

yerrr submmed

here

database,

acWslon

numbers

360

Rice and Cameron

1x1. DESCRIPTION

OF

SEQUENCED

SEGMENT

Wherever possible, please. use standard nomenclature w con~enuans bv- wntut~ - N A. If the mformauon ui relevant but not avadable. What kmd of molecule [ I genomlc [ ] otganelle

dtd you sequence?

DNA DNA

(check

[ I genomtc [ 1 aganek

If a questton IS not appbcable write a auesuon mark (7) .,

[ lvlrus

[ ]cDNAtomRNA please specfiy

[ I pepude’

[ ] sequence

assembled

by

[ I cDNA

1 IDNA

[lvmld [IcuclJiar

a

[ ] ovedap

[ IscRNA or

[ I enveloped or

of sequenced

to genomtc

RNA

org=ile

[ ImA

a

[ 1~~~ [Ids or [ 1s [ I other nuclerc aad @lease spcmfy)

answer

all boxes which apply) RNA RNA

I IrfWA

[IW'JA for viruses

to your sequence.

fragments

[ I homology

[]RNA

[ ] nonenveloped

w~dt related

sequence

[ 1-6A=espeElfY) [ lparual

[ I N-tennmal

ength of sequence

[ Ibaw

;ene name(s)

(e g., lacz)

lene product

name@.) (e g.. beta-D-galactosulase)

inzyme

Commmslon

:ene product

rk

subumt

followmg orgamsm man

number

(e g , hemoglobm

nems refer to the ongmal

(e g , K12

[ ]C-ramtnal

a

[ ] internal

a2p2)

source of the molecule

you have sequenced plant cultlvar

(e g , Mus musculus) BALB/c)

name/numbw

of ttiv:dual

developmental

stage

fragment

[ ]ammoacldns~duca

(e g , EC 3 2 123)

stmcture

(species)

or a

substram or Isolate

(e g , pauent

123. Influenza

vuus A/PR/W34)

[ 1 germ he

1 l-l@

haplotype

tissue type

cdl type

allele

vaniult

[ ] macronuclear

The followtng

Items refer to the tmmcdlate

name of cell hne (e g., Hela. clone

3TSLl)

expertmental

hbrary

The followmg chromosome map posluon

source of dte submttted

clone(s), uems refer to the posmon (or segment)

sequence

or plant culuvar

of the submmed

sequence

subclon~s)

m Ihe genome

name/number

un1t.s [ ] genome

% or [ ] nucleoude

number

or [ I other

Jsmg smgle words or short phrases, describe the pmpemes of Ihe sequence tn terms of ds assoctatcd phenotype(s): dt uologcal/enzymattc acuvtty of Its product; Ihe general funcuonal classlficatton of the gene and/or gene produt nacmmolccules to whtch the gene product can bmd (e g , DNA, calctum. other pmtems). subcellular locahsauon of the gen Iroduct. any other relevant mformauon !xample (for viral erb5 nucleottde sequence) bansformmg; EGF receptor-related, tyrosme kmase, oncogene. transmembrane Irotem

361

Data Submission IV.

FEATURES

OF

THE

SEQUENCE

Please bsi below the types and locauons of all stgmflcant feamres expeflmenlally tdennlicd wuhm the sequence Re sure Use < or > If a feature extends beyond the begmnmg or end that your sequence is numbered beginning with “1.” of the mdicated sequence span In the column

marked

fill in type of feauue (see lnforrnatmn below) number of first base/amino acid m the featurc

feamre from to be a3

number of last bas&mmo amd m the feature x, if your numbers refer to posi!mns of bases m a nucleoude sequence x, If your numbers refer to positrons of amino acid residues m a pepude sequence method by which the feature was idenufled. E = exln%unen~ally: S = by sundaruy wuh known sequence or to an establrshed consensus sequence; PI by sundaruy to some other pa&em. such asanopmmdmgframe x. if feature IS kaled on the nucleic acid strand canplemenuny m that reported here

d

ww S@3ant

fealure.7

include:

regulamry signals (e g., promoters, auenuators. enhancers) msnscrlbed regions (e.g., mRNA, rRNA, IRNA). (Indicate readmg frame If start and slop codons are not present) regions sub&et to post-translational modification (e.g., introns, modified bases) translaled regions extent of signal peptide, prepropepude, mature pepude regions subpct to post-lranslauonal moduicauon (e-g , glycosylaled or phcqhorylated sues) other domamskites of interest (e g , extracellular domam, DNA-binding domatn. active site. inhibitory sue) sues mvolved m bonding @sulfide, thlolester. mtracham, mtemham) regions of pm&m secondary stmcmre (e.g., alpha hehx or bela sheet) confhcts wuh sequence data reporkd by other authors vanauons and polymotphtsms The fust 2 lmes of the table are filled Numbering

for feauues

on the sequence

m wuh examples submitled

here [ ] malches

paper [ i does not malch paper

362

Rice and Cameron

Occasionally, we receive queries concerning submissions or services by phone. Any such communications are difficult for us to record in an accurate way and are therefore not recommended except in unusual circumstances. 3. Processing

Submissions

The EMBL Data Library processes data submissions within seven working days of receipt, and then either sends accession numbers to the submitting author or, if there is a problem, a request for further information. Submitters can help us a great deal in the following ways: l l l

Completeness-do give all the information possible; Check accuracyand explain any apparentmconsrstency;and Give us a fast way to contact the submitting author-electronic mail, fax, or telex.

Once an entry has been produced from submission information, then a copy will be sent to the submitting author to review or update, if necessary. This entry may or may not be accessible to the public (see Fig. 2). Entries are released to the user community initially by fileserver, where they will be available on the day they are made public. For those who do not receive daily updates through their EMBnet node, the entries will appear at the next quarterly release on tape and CDROM. Figure 2 is a sample finished entry. This flat file format is exactly how it appears to any user. Actual release of an entry to the public is either at the point of completion or, when specifically requested, after publication of the relevant sequence-bearing article. There are inherent problems associated with matching citations with data being held until publication. These can be alleviated by the author citing all relevant accession numbers in the publications and by informing the Data Library of publication. Appendix 1: Nucleotide Sequence Databases’ Addresses 1. EMBL Data Library: a. Internet electronic mall addresses: Data submissions: [email protected] General enquiries: [email protected]

Data Submission ID xx AC xx DT DT xx DE xx KW xx OS oc oc oc xx RN RP RA RT RL RL RL xx RN RA RT RT RL xx cc xx FH FH FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT xx SQ

ECMALZ

363 DNA,

standard,

PRO;

2345

BP

X59839, OS-NOV-1991 29-OCT-1991

(Rel (Rel

E coli

gene

malZ

29, 29, for

Last updated, Created)

[II l-2345 Tapio

alpha-glucosidase,

co11 Bacteria, anaerobic

malZ

gene

Gracilicutes, Scotobacteria, rods, Enterobacteriaceae,

S.,

Submitted S. Tapio, P 0 Box

(27~MAY-1991) University 5560, W-7750

on tape of Konstanz, Konstanz,

to

the Dept Germany

EMBL Data of Biology

[21 Tapio S., Yeh F., Shuman H., Boos W ; "The malZ gene of Escherichia coli, a member regulon encodes a maltodextrin glucosidase"; J Biol. Chem. 266:19450-19458(1991) *source

of

Library by. AG Boos,

the

strain=K12, Location/Qualifiers

Key mist-binding mist-binding mist-binding -lO-slgnal RBS mRNA CDS stem stem

61

alpha-1,4-D-glucosidase

alpha-1,4-D-glucosidase, Escherichia Prokaryota, Facultatively Escherichia

Version

loop loop

188 197 /bound-moiety="MalT' 201 210 /bound-moiety="MalT" 224 233 /bound-moiety="MalT" 254 259 280 282 265 .>2103 /evidence-EXPERIMENTAL 289 2103 /gene="malZ" /EC-number="3 /product="alpha-glucosidase" 2257 2279 2301. 2324

Sequence tatcgggttg gtttcgcgtg cacgatcgtc

2345 BP, 529 attggttatc gattgttgtg agctggctga

ccagacgtgg tccatatcgc cagattgata gtw3

gcggcggctt gatagcgcac aaacgtggcg

A; 579 C, acccggatac ctgttgattg

gccatgccgt cagccactgc

/gene-"malZ" 2.1.20"

683 G; 554 gcgtatctcg gctggatgtt

T, 0 other; ctgtatgtcg taaacgccgc

ttaacacgtt tctgaccaca

ctggatgaaa agtaattgtt

Fig. 2. A sample entry

maltose

364

Rice and Cameron

b. Postal address: Data Submissions, Postfach 10.2209, Meyerhofstrasse 1, W6900 Heidelberg Federal Republic of Germany c. Telephone: +496221387258 d. Telefax: +496221387519 e. Telex: 461613 (embl d) 2. GenBank: a. Internet electromc mail address: [email protected] b. Postal address: GenBank Submissions, Mail Stop K710, Los Alamos National Laboratory, Los Alamos, NM 87545 c. Telephone: +l 505 665 2177 d. Telefax: +1505665 3493 3. DNA Data Bank of Japan: a. Internet electronic addresses: Data submissions: [email protected] General enquiries: ddbj @ddbj.mg.ac.jp b. Postal address: Laboratory of Genetic Information Analysis, Center for Genetic Information Research, National Institute of Genetics, Mishima, Shizuoka 411, Japan c. Telephone: +81559750771x647 d. Telefax: +81559756040 4. Intelligenetics: a. Internet electronic mail addresses: Authorin software: [email protected] or ftp from GenBank.bio.net

365

Data Submission b. Postal address: Intelligenetics Inc., 700 East El Camino Rd, Mountain View, CA 94040 c. Telephone: +14159627364 or 800 477 2459 wtthin the US d. Telefax: +1415 962 7302

Appendix

2. EMBL

Services

1. Related databases crossreferenced by EMBL: Drosophila genetic map database; FLYBASE (4)-available from EMBL E. coli database; ECD (5)--available fom EMBL EC nomenclature database; ENZYME (6)-available from EMBL Eukaryotic promoter database; EPD (7)-available from EMBL Genome database; GDB (8) Online Mendelian inheritance in man; OMIM (9) Protein pattern database PROSITE (IO)-available from EMBL Protein sequence database; SWISSPROT (II&available from EMBL 3-D protein structure database; PDB (Z2)-available from EMBL Restriction enzyme database; REBASE (13)-available from EMBL Transcription factor database; TFD (Il)-available from EMBL 2. Fileserver (3): To accessthe EMBL fileserver, send a standard electronic mail to the address [email protected].

The most important command is “HELP”, sent either on the subject line or in the body of the text.

References 1. Stoehr, P. S. and Cameron,G. (1991) The EMBL data library. Nucleic Acids Res. 19,2221-2230

2. Burks, C , Cassidy,M., Cmkosky,M. J., Cumella, K E., Gllna, P , Hayden, J. E.-D., Keen, G. M , Kelley, T A., Kelly, M , Kristofferson, D., and Ryals, J. (1991) GenBank.Nucleic Acids Res 19,2221-2225. 3 Stoehr, P. and Omond, R (1989) The EMBL network fileserver. Nucleic Acids Res. 17,6763-6764.

4. Ashburner, M (1990) University of Cambridge, Cambridge.

366

Rice and Cameron

5. Kroeger, M., Wahl, R., and Rice, P. (1991) Compilation of DNA sequences of Escherichia coli (update 1991). Nucleic Acids Res. 19,2023-2043. 6. Bairoch, A. (1990) University of Geneva, Geneva. 7. Bucher, P. and Trifonov, E. N. (1986) Compilation and analysis of eukaryotic POL II promoter sequences. Nucleic Acids Res. 14, 10,009-10,026. 8. Pearson, P. L (1991) The genome database (GDB)-a human gene mapping repository. Nucleic Acids Res. 19,2237-2239 9. McKusmk, V. (1990) Mendelian Inheritance in Man. Johns Hopkins University Press, Baltimore, MD. 10. Bairoch, A. (1991) PROSITE: a dictionary of sites and patterns in proteins. Nucleic Acids Res 19,2241-2245.

11. Bairoch, A. and Boeckmann, B. (1991) The SWISS-PROT protein sequence data bank. Nucleic Acids Res. 19,2247-2249. 12. Bernstein, F. C., Koetzle T. F., Williams, G. D. B., Meyer, E. F., Brice, M. D., Rodgers, J. R., Kennard, O., Shimanouchi, T., and Tasumi, M. (1977) The protein data bank: A computer-based archive file for macromolecular structures. J. Mol. Biol 112,535-542. 13. Roberts, R. J. (1985) Restriction and modification enzymes and their recogmtion sequences. Nucleic Acids Res. 13, r165-i-200. 14. Ghosh, D. (1991) New developments of a transcription factors database. TZBS 16,445-447.

Computer Analysis of Sequence Data Part II (Methods in Molecular Biology)

Read more

Bioinformatics: Data, Sequence Analysis and Evolution (Methods in Molecular Biology)

Read more

Computer Analysis of Sequence Data

Read more

Computer Analysis of Sequence Data

Read more

Introduction to Computer-Intensive Methods of Data Analysis in Biology

Read more

Introduction to computer-intensive methods of data analysis in biology

Read more

Data Analysis in Molecular Biology and Evolution

Read more

Sequence Data Analysis Guidebook

Read more

Gene Function Analysis (Methods in Molecular Biology)

Read more

Gene Function Analysis (Methods in Molecular Biology)

Read more

Yeast Protocols (Methods in Molecular Biology) (Methods in Molecular Biology)

Read more

Proteins (Methods in Molecular Biology Vol 1)

Read more

Computer Modelling in Molecular Biology

Read more

Enzymes of Molecular Biology (Methods in Molecular Biology Vol 16)

Read more

Computational methods in molecular biology

Read more

Proteins (Methods in Molecular Biology)

Read more

Molecular Toxicology Protocols (Methods in Molecular Biology)

Read more

Computational Systems Biology (Methods in Molecular Biology)

Read more

Computational Systems Biology (Methods in Molecular Biology)

Read more

Computational Biology (Methods in Molecular Biology, v673)

Read more

Lipidomics: Volume 1: Methods and Protocols (Methods in Molecular Biology)

Read more

Neurotransmitter Methods (Methods in Molecular Biology)

Read more

Computer methods. Part A

Read more

Glycoprotein Analysis in Biomedicine (Methods in Molecular Biology)

Read more

Microarray Data Analysis: Methods and Applications (Methods in Molecular Biology Vol 377)

Read more

Microchip Capillary Electrophoresis: Methods And Protocols (Methods in Molecular Biology) (Methods in Molecular Biology)

Read more

Molecular Imaging: Methods and Protocols (Methods in Molecular Biology)

Read more

Epidermal Cells: Methods and Protocols (Methods in Molecular Biology) (Methods in Molecular Biology Series)

Read more

Molecular Methods for Evolutionary Genetics (Methods in Molecular Biology, v772)

Read more

Molecular Chaperones: Methods and Protocols (Methods in Molecular Biology, v787)

Read more

Recommend Documents

Computer Analysis of Sequence Data Part II (Methods in Molecular Biology)

&APTER Computer Analysis Hugh G. Grifin 1 of Sequence Data and Annette M. Gr@?n 1. Introduction DNA sequencing me...

Bioinformatics: Data, Sequence Analysis and Evolution (Methods in Molecular Biology)

Bioinformatics METHODS IN MOLECULAR BIOLOGY™ John M. Walker, SERIES EDITOR 460. Essential Concepts in Toxicogenomics...

Computer Analysis of Sequence Data

Computer Analysis of Sequence Data

Introduction to Computer-Intensive Methods of Data Analysis in Biology

This page intentionally left blank Introduction to Computer-Intensive Methods of Data Analysis in Biology This guide...

Introduction to computer-intensive methods of data analysis in biology

This page intentionally left blank Introduction to Computer-Intensive Methods of Data Analysis in Biology This guide...

Data Analysis in Molecular Biology and Evolution

DATA ANALYSIS IN MOLECULAR BIOLOGY AND EVOLUTION www.dnathink.org huangzhiman 2003.3.15 DATA ANALYSIS IN MOLECULAR B...

Sequence Data Analysis Guidebook

Gene Function Analysis (Methods in Molecular Biology)

Gene Function Analysis M E T H O D S I N M O L E C U L A R B I O L O G Y™ John M. Walker, SERIES EDITOR 436. Avian I...

Gene Function Analysis (Methods in Molecular Biology)

Gene Function Analysis M E T H O D S I N M O L E C U L A R B I O L O G Y™ John M. Walker, SERIES EDITOR 436. Avian I...