1 GeneJockeyll Entering and Editing Sequences Phil Taylor 1. Introduction Entering sequence by hand is a tedious and error-prone process. In general, if the sequence that you need is available in any electronic form, you should be able to import it into GeneJockey without having to retype the data. For example, most sequencespublished in research papers are normally accompanied by a GenBank/EMBL accessionnumber, which allows you to retrieve the sequence from the GenBank CD-ROM or from a remote networked database. If, however, you have no option but to type the required sequence (for example, if you are reading sequence by hand from a manual sequencmg gel), GeneJockey provides powerful facilities to do so, and to check the accuracy of the entered data. Sequence data in GeneJockey is simple text, displayed in capitals, and behaves just as text does in any word processor. All the standard editing commands act in the way in which you expect them to act, and you may use fonts, styles, and colors to draw attention to parts of your sequence,just as you would when editing ordinary text. 2. Materials 1. Hardware: GeneJockey requires a Macintosh with ColorQuickdraw in ROM (this excludesthe Macintosh plus [and older machines],the SE,the PowerBook 100,and the MacintoshPortable). The program also requires system7.0 or later, and at least2 Mb of available memory. A color display capableof showing 256 colors is helpful but not essential. 2. Software: For the operations described in this chapter, you need only the GeneJockey program itself. For operationsdescribedin later chapters,you will needsomeadditional files suppliedwith the program. You would normally install From* Methods in Molecular Bfology, Edlted by S R Swlndell
Vol 70’ Sequence Data Analysis Guidebook Humana Press Inc , Totowa, NJ
1
2
Taylor
Comments
areas put any relevanl tex, .
Rat Pituitary
cofeatures
data here
Open
CL.“...
Q
GnRH Receptor
.^
I’
II First
J Reading frame III blue I - 984 nsmembrais regions underlinaQ 15 - 174 232-291 Nt
-37
-
,
(inc
P
stoa)
Comments scrollbar
.
and comments
wmdow
Butlon used 10 set the numbering of Ihe first nucleaude
Fig. 1. Anatomy of a GeneJockey
sequence window.
GeneJockey on your hard drive by simply copying all the files supplied into a single folder. When running on a Power Macintosh, the GeneJockey Helper file should be present in the same folder. The native-code resources in this tile run about ten times faster than the code in the main program, and since multiple alignment is a time-consuming process, the extra speed is very helpful. GeneJockey is licensed for use only on a angle-user basis, but is not copy-protected.
3. Methods
3.1. Sequence Entry 1. Start up the program by double-clicking on the GeneJockey icon. The program offers three kinds of windows in which you may enter and edit text. For this reason, the New command m the File menu is hierarchical, offering you the choice of a new nucleotide sequence window, peptide sequence window, or a plain text window We will start by opening a nucleotide sequence window and entering a DNA sequence (see Note 1). Fig. 1 shows a nucleotide sequence window.
GeneJockey//: Entering and Editing 2. Use the New > Nucleotide sequence command to open the window Note that the window title is Untitled 1. As is usual with Macintosh programs, the window will not be given a title until you save it to disk (see Note 1). 3. Use the Save as.. , command from the File menu to save your new window before you start typing. 4. Give the file a suitable name for the sequence you are going to enter. 5. When the file IS saved, click on the empty sequence box to place the Insertion point at the top left of the box. 6. Start typmg your DNA sequence. Note that the program converts text that you type mto this box to uppercase (see Notes 2,3). 7. Next, select Speak on Entry from the Edit menu. Continue typing. Each time you hit a key, the machine will speak the corresponding letter. This is very helpful if you are not a touch typist, because it means that you do not have to look at the screen to check what you type. You can turn this facility off again using the same command. 8. Select Tidy up to format the sequence into blocks of 10 nucleotides (see Note 4). 9. Once you have typed in a few lines of sequence use the Save command to update the disk file. It IS always a good idea to save sequences frequently when typing, in case of accidents. You should make sure your sequence is saved before carrymg out the operations m the next paragraph. 10. Type a few more bases and look at the Revert to Original and Undo commands (see Notes 5,6).
3.2. Switching Between Circular and Linear Sequences 1 Of the three buttons at the center of the screen, the left-hand button currently reads Linear. When you click on it, the legend changes to Circular. The button toggles between these two states, and the legend indicates the current conformation of the sequence. The difference between linear and circular sequences is for the most part trivial, affecting only the restriction enzyme analysis, m which it is important to deal correctly with restriction sites that span the origin of circular sequences (i.e., where part of the site is at the top left of the display at position 1, and the other part at the very end) 2. Click on the button again to return the sequence to the linear state
3.3. Changing the Origin Point 1. Click on the Set Origin button. You will see a dialog box that asks you for the number of the first nucleotide in the sequence and tells you that you may enter any number between 32 and -32 K, except zero (see Note 7). 2. Enter a small negative number, such as -20, and click on OK. The First Nt: legend at center left now reads -20 to remmd you of the current numbering, and if you run the cursor along the top line of the sequence, you ~111 see that the numbering jumps from -1 to +l without using zero. 3. Click on the Set Origin button again and set the origin back to 1. 4. Now, make the sequence circular, and if you have made any changes, save the sequence again. (If you can not remember whether you have made any significant
Taylor changes, pull down the File menu and look at the Save command. If it is disabled then you do not need to save.) 5. Now use the Set Origin button again. The effect of changing the ongin of a clrcular sequence is quite different, since by convention the origin of a circular sequence is always shown at the top left of the display If you set the origin to -20, the sequence will be rotated so that the last 20 nucleotides are brought to the beginning, with the nucleotlde that was twentieth from the end of the sequence displayed at the top left, and numbered 1. Remember that there is no Undo command for this, so it is a good idea to make sure the sequence is saved in case you make a mistake with the numbering. You can then use the Revert command to restore the original display. The effect of circularizing a linear sequence whose origin is not the first nucleotide displayed is similar, and the same caution applies here.
3.4. Verifying the Sequence Entry of sequences at the keyboard is an error-prone process, and if you wish to be certain that the sequence you have entered is correct it is necessary to use some form of verification. GeneJockey offers you two methods of verifying sequences: Verify by Speaking and Verify by Typing. Both commands are found in the Edit menu. 1. First, click at the top left of the sequence to set the insertion point at the begmmng (or just before the part of the sequence you wish to check). 2 Select Verify by Speaking. The computer will speak the first 10 bases of the sequence, perrmtting you to check that you have entered them correctly. Hit the space bar or any other prmtmg key to start readmg the next 10 bases. If you wish to move quickly around the sequence, use the left or nght arrow keys to move forward or back 10 bases, or the up or down arrow keys to move one hne up or down (seeNote 8). 3. Set the insertion point back to the begmning of the section you wish to check. 4. Select Verify by Typing (see Note 8). 5. Start retyping the sequence. As you type each base, the selection moves one place forward. If you type a base that does not match the sequence you entered onglnally, the machme will beep and the selection will not move on.
6. In order to correct the error, type Command-periodand the machinewill return control to you with the incorrect base already selected for changing. 7. Type the correct base then reissue the Verify by Typing command to continue verification (since this is a keyboard-orientated operation you will find it quicker to usethe Command-T equivalent to restartverification). As before, you may use the arrow keys to move around rapidly during verification, and the machine will exit from the mode automatically when you reach the end of the sequence.
3.5. Annotating Sequences You can insert notes and comments on your sequence in the upper text box of the window. Only one of the text boxes is active at any time, indicated by the flashing insertion point.
GeneJockey//: Entering and Editing
5
1. Click in the top box and type in a few lines of text. Comments in GeneJockey are simple free-form text: You may type in anything you want here. Text m either box that is off screen can be reached by using the scroll bars in the usual way. 2. Click on the arrows at the top or bottom of the scroll bars; the text ~111 scroll by one line. If you continue to hold down the mouse button, the text will scroll a second lme after a short pause. Holding down the button contmuously produces progressively shorter pauses until the text is scrolling at full speed. All of the standard Macintosh editing commands, Cut, Copy, Paste, Clear, and Undo, apply to both Comment and Sequence boxes, but Speak on Entry and the two Verify commands only operate on the sequence box.
3.6, Advanced Editing--Making a Construct GeneJockey is a multiwindow editor, and you may have as many windows open at once as you need, subject to a maximum of 50. This means that you can construct new sequencesby copying text from one window and inserting it into a sequence in a second window. We will use this faciltty to insert the sequence that you previously typed into a plasmid vector, and in a later chapter we will run a restriction
analysis on this construct.
1. Use the Open command to open a suitable linear DNA sequence. Use the dopamme D2A receptor sequence from the demo files disk supplied with the program if you have no other sequence. 2 Next, Open a suitable vector sequence; we will use the plasmid pBluescript as an example. 3. Bring the first window back to the front, either by clicking on it or by selectmg its title from the Windows menu. We are going to ligate this sequence into the EcoRI site of pBluescript, and to do this properly we will first have to attach EcoRI linkers to our test sequence. The recognition sequence for this enzyme 1s G 1AATTC, where 1represents the cut site, so we have to ensure that our test sequence starts with AATTC and ends with G (of course, real linkers are a little longer than
that, but we neednot concern ourselveswith that here). 4. 5. 6. 7.
Set the insertion point at the beginnmg of the test sequence and type m AATTC. Use the Tidy up button to put the sequence back in regular columns. Next, scroll to the end of the sequence (if it is not on the screen). Set the insertion point after the last nucleotide.
8. Type in a single G. 9. Switch back to the window containing the vector. 10. Locate the EcoRl site in the vector. To do this you could run a restriction analysis, but that ISa little complexjust to find a single restriction site. Instead,we will use the Find command. First, make sure that the insertion pomt is at the beginning of the sequence,then selectFind > in sequence.. . from the Find menu (see Note 9). 11. Type m GAATTC and hit the OK button. The program will scroll the sate onto the widow and leave tt selected.
Taylor
6
12. Set the insertion point on the cut site, i e , between the G (at 701) and the followmg A. 13. Click on the test sequence window to bring it back to the front. 14. Select the whole sequence by means of the Select All command from the Edit menu. (You could also do this by dragging across the whole sequence, or by setting the insertion point at the beginning and shift-chckmg at the end.) So that we will be able to identify the insert when we have made the construct, it is a good idea to label it now. 15. Use the Color.. . command from the Text menu to put the sequence mto a contrasting color (see Note 10) 16. Next, copy the entire sequence onto the clipboard by means of the Copy command from the Edit menu. 17. Brmg the vector sequence window back to the front. If you do this by chckmg on it, be careful to click only once, or you may shift the insertion point from the place where you left It. Check that tt is still after the G at 701. 18. Paste the test sequence in using the Paste command from the Edit menu. 19. Click on the Tidy up button to reformat the sequence 20. Save It under a smtable name. There-you have Just ligated a test sequence into a vector-I was that simple in the real world!
bet you wish it
3.7. Inverting Sequences Suppose that we have only the construct sequence to work with, but we
decide that the wrong strand of DNA has been inserted into the vector, and we need to take it out, invert it (i.e., generate the opposite strand), and put it back again. First, we have to select the insert, which is now in the middle of the pBluescript sequence. We know where the beginning is, just after the EcoRI
site at 701, so we only need to locate the end. We could find that numerically by adding the length of the test sequence to 701, or we could simply scroll down the screen to see where the color changes, but we will search again for the second EcoRI site, which now marks the end of the insert. 1. Set the insertion point at the beginning of the sequence 2. Select the Find Same command. This simply repeats the previous search, finding the original site. 3. Repeat the Find Same commandto find the secondEcoRI site. 4 Set the insertion point just before the G of the second site 5. Scroll back to the first site at 701 Hold down the shift key while you click after the c of the first site. The whole of the insert will then be selected. (In GeneJockeyII, the cursor display remains active while you drag, so you could also just drag across the part of the sequence that you want, watching the numbers to see when you get to the right place. Yet another alternative would be to use the Select.. . command from the Find menu and specify numerically the region of sequence you want selected.)
6. 7. 8. 9. 10. 11. 12
13. 14.
Copy the insert onto the clipboard. Use the New > Nucleotide Sequence command to generate a new sequence wmdow. Paste the sequence into it and Tidy it. Select Invert from the Modify menu. The program opens a new window containing the inverted sequence (see Note 11). Use Select All to change the color as before, if you wish Copy the entire sequence. Pull the window containing the construct back to the front. Since we now have several wmdows open, it is easier to do this by means of the Windows menu than by trying to find it by moving the windows around on the screen. The part of the sequence that represents our original insert is still selected Paste the inverted sequence, and it will replace the original. Tidy up the sequence.
We are now finished with the windows that we currently have open, so close them all. To do this, hold down the Option key while chckmg m the close box of the front window. The program will close all the windows in turn, prompting us as it does so to save any new work. 4. Notes 1. Using the New command offers three alternatives. One is for creating a new nucleotide sequence. The second is for creating a new pepttde sequence. Peptide sequences are entered in precisely the same way as nucleotrde sequences, and a peptide sequence window looks Just like a nucleotide sequence window, the only obvious difference being that the origin prompt at center left reads “First AA:” rather than “First Nt:.” You will notice some differences when you come to use the modification and analysis commands, however, since different menu commands will be enabled depending on what type of window is foremost on the screen. Peptide sequences are entered in single letter code and represented in uppercase characters only. There are no wildcard characters. The type of window you choose specifies whether the program will treat the sequence as DNA or protein, and there 1s very little to prevent you from entering the wrong kind of sequence into a window (there is no way for the program to distinguish between a short DNA sequence and the equivalent set of characters representing a peptide conststing entirely of alanine,
cysteine,glycine, and tbreonine,for example),so be careful when usmg the New commandto ask for the correct window type for the sequence you intend to enter. A third type of window that may be obtainedwith the New commandISaplain text window. This has asingle scroll bar andis 80characterswide. There is a title area at the top that holds a single line of text and initially reads “New text window.” This title string is not directly editable, but may be changed via a dialog box obtained by clicking in this area. The remainder of the window acts as a plain text area, and is useful for general purpose editing Many of the analyses. that GeneJockey performs display their results in text windows, and you may edit such results before printing or saving them.
2. GeneJockey only handles sequences consisting of uppercase symbols. Note that when you reach nucleottde number 10, and any multiple of 10 thereafter, the program will automatically insert a space or return so that the sequence is displayed in blocks of 10. In a nucleotide sequence window, you may use the symbols A, C, G, and T, plus the standard degenerate symbols that are used to represent the case m which a particular posmon may be occupied by more than one base. U is not a legal character, so RNA sequences should be entered as DNA If you type an illegal character you will get a dialog box displaymg the complete list of these characters. For example, type m an X to see thu. You can also see the display of permitted degenerate codes at any time by selecting the Show Wildcards... command from the Edit menu. You can dismtss the Wildcards dialog either by clicking on the Cancel button or by clickmg on any of the buttons that display the degenerate codes; in the latter case the dialog causes that code to be inserted mto the sequence at the current selection point 3. When entering DNA sequences you will make extensive use of the A, C, G, and T keys, and it is most convenient to have these keys close together so that you can enter the data with one hand and not have to look at the keyboard Use the Re-Assign Keys... command from the Edit menu to do this. Because I am right-handed, I normally reassign the keys U, I, 0, and P to give me A, C, G, and T, respectively. This has the advantage that none of U, I, 0, or P are degenerate codes, so I will never want to use them for then original symbols within a DNA sequence, and they are close enough on the keyboard to the delete key that if I make a mistake I can backspace over it without taking my eyes off the gel or sequence from which I am reading. If you wish your keyboard always to work in this way, you should click on the Set Default checkbox before clicking on OK in the dialog. To return the keyboard to normal you should click on the Standard Layout button. The reassigned keyboard only applies to DNA sequences; the keyboard will operate normally when you type ordinary text into the comments area of a sequence window or anywhere else 4. You have probably noticed by now that if you move the mouse cursor across the sequence box the number of the nucleotide beneath the cursor is continuously displayed at center left. This IS very helpful for locating a particular nucleottde by number. The calculatton of the number does, however, depend on the sequence being formatted correctly m regular blocks of ten. Some operations destroy this regular format, and the function of the Tidy up button is to restore order m these cases. For example, suppose you wished to insert an extra block of sequence m the middle of your existing sequence. Place the insertion pomt m the middle of the sequence by clicking on it. Now type in a few nucleotides The resulting disorder would not affect any analyses that you later ran on this sequence, since all the analyses ignore the presence of space and return characters, but it looks untidy and spoils the operation of the cursor posttion display. Click on the Tidy Up button to put the sequence back mto regular columns. It would have been possible to make the program tidy the sequence after every keystroke, but it would have slowed the operation of the program to an irrttatmg extent, especrally when inserting residues near the beginning of a long sequence.
GeneJockey//: Entering and Editing
9
5. If you now wish to restore your sequence to its original state, select the Revert to Original command from the File menu. This returns the window to the state it was m when you issued the last Save command, checking with you first to see if you really want to discard any changes made smce then. 6. Another way to reverse any change you have made is to use the Undo command at the top of the Edit menu. Pull down the menu and look at this command now. It reads Undo Typing, and if you use it, all the typing you have done since you placed the insertion point will be removed. Undo always shows you what can be undone. Almost all editing operations can be undone, the only exceptions being the three operations performed with the buttons at the center of the screen, It may read Cannot Undo, and be disabled (i.e., it is shown in gray, and does not respond if you try to use it). This is because the file has just been loaded or saved, and you have not yet made any changes: There is nothing to undo. 7. Set Origin changes the way in which the sequence is numbered, and has different effects depending on whether the sequence is linear or circular. The origin of a linear sequence is position number 1, which may be anywhere on the screen, or indeed outside the sequence displayed. If your sequence represents a small segment of a larger sequence that is itself numbered from 1, the first nucleotide displayed on the screen will have a number >l . If, on the other hand, you wish to set the ongm at some feature m the body of the sequence (for example, at the start codon of a translated region), the first nucleotide will have a negative number. By convention, nucleotide numbermg does not use zero, so you may not set the origin to zero. Strictly speaking, when you set the origin of a linear sequence, you do not specify the position of the origin itself, but rather the numbering of the first nucleotide. 8. Verify by Typing and Verify by Speaking are modal commands, i e., you can not do anything else at the same time, because the menus, scrollbars, and so on, are all inactive. When the program has talked its way to the end of the sequence it will exit automatically from this mode and return to normal operatton If you wish to exit before the end of the sequence is reached (in order to make corrections) you may do so by holding down the command key and simultaneously typing a period. (This is the standard Macintosh abort command: You can stop most operations m GeneJockey this way if you change your mind.) 9. The Find command in GeneJockey is similar to that in a word processor, but has some special facilities for use with sequences. Since all sequences m GeneJockey are m uppercase, it does not matter whether you type in the target sequence in capitals or lowercase; the program will convert the characters to capitals before searching. You can include degenerate codes in the target sequence, so AATNG will find AATAG, AATCG, AATGG, or AATTG. Likewise, degenerate codes in the search sequence will be honored, so AATTG will find not only AATTG but NATAG, ANTAG, AANAG, and so on. The Find command will also permit you to specify a number of allowable mismatches, so you can find sections that are similar to, but not identical to the target sequence. You can also set the program to find the mmimum number of mismatches required to produce a match, by means of the Find Mismatches button.
Taylor
10
10. Using the Text Menu: Unlrke most sequence handling programs, GeneJockey has the ablhty to make use of formatted text. Any part of a sequence or annotation text may be placed in any font, size, style, or color. This 1smost useful for labeling parts of a sequence, especially since when you make constructs by editmg sequences together the format is camed over to the composite sequence, allowing you to identrfy immediately where each part of the composite sequence came from. Most of the Text menu, and its submenus Font and Style, will be familiar to Macintosh users. You may be surprised to see that very few fonts are displayed in the Font submenu. The reason for this is that GeneJockey only displays fixedwidth fonts here. Most users will find only Monaco and Courier fonts hsted The reason for this is that proportionally spaced fonts, which look so nice for standard text, disrupt the display of sequences, making It impossible to lme up the blocks neatly. Here are some examples. 9 pt. Monaco font (the default): CGAAGGGCTC GGGAGCAAGT CTTGGCTACT AACAGTGGCC
CCCACTCCTA GGAACTAAAC GAGTAGAGAA TGGCTCTGAC
GCCAGCCCAC CTGCGGCAGG CACAAAATGA CAGTCCTTAT
ACCAAGCTTC TTTAAATGTG ATAACTCCAC AAGACATTTG
TTGCAGCCCG TATTTGGCTA CAACTCCTCT AAGTGGTTTT
10 pt. Courier font (good for printing on postscript printers, but less legible on screen): CGAAGGGCTC GGGAGCAAGT CTTGGCTACT
CCCACTCCTA GGAACTAAAC GAGTAGAGAA
GCCAGCCCAC CTGCGGCAGG CACAAAATGA
ACCAAGCTTC TTTAAATGTG ATAACTCCAC
TTGCAGCCCG TATTTGGCTA CAACTCCTCT
AACAGTGGCC TATTGTCCTT TCCTGGTCAT AATTACTTTT
TGGCTCTGAC GTCGCCGGAT GGTCTCCATC TGTTCAGCTT
CAGTCCTTAT CCCTCAGTTT ZLIAGTCAACC GGCCTGTGCT
AAGACATTTG GGTGACCATT GACACCTCCA GACCTCATCA
AAGTGGTTTT ATTGGGAACA GACAGTCAAC TTGGTGTTTT
CTCCATGAAC GCCCCGTGGT
CTGTACACTC GTGTGACCTT
TTTACACTGT TGGCTAGCTC
GATTGGCTAC TGGACTACGT
TGGCCTTTGG GGTCAGTAAT
12 pt. Geneva font (proportionally sequences):
spaced and therefore fine for text, but messy for
CGAAGGGCTC CCCACTCCTA GCCAGCCCAC ACCAAGCTTC TTGCAGCCCG GGGAGCAAGT GGAACTAAAC CTGCGGCAGG TTTAAATGTG TATTTGGCTA CTTGGCTACT GAGTAGAGAA CACAAAATGA ATAACTCCAC CAACTCCTCT AACAGTGGCC TGGCTCTGAC CAGTCCTTAT AAGACATTTG AAGTGGTTTT Of course, if you insist, you can use proportionally spaced fonts, but you will need to use the More... command to get access to the full set of fonts in your
GeneJockeyll: Entering and Editing system. This command also permits you to use sizes other than the basic ones listed on the Font submenu. To use the Color.. . command, first select the area of text or’sequence that you wish to change, then issue the command. The dialog that follows is the Macintosh standard color wheel. If the text was originally black, the wheel will appear entirely black. Move the scrollbar at the right to the top of its travel to show colors, then click on the wheel to select a color. Dismiss the dialog with the OK button and the text will change color. When using these commands to label parts of sequences, you should be aware that not all combinations of Font, Size, and Style will work well. In addition to the advice given above about the use of proportionally spaced fonts, you should be aware that some Styles also change the width of the characters. In general, the Underline and Italic styles work well, but Bold, Condense, Outline, and Shadow all increase the width of the characters to which they apply. Some combinations can be used successfully, e.g., Bold + Condense works provided that you return the space characters between the blocks to plain text. Be aware also that different fonts may be of differing widths, even though they are nominally of the same point size, so if you change part of a sequence from Monaco to Courier font, you should increase the point size from 9 to 10 to make the character widths match. Examples: CGGTGGACTA TCCAAACCTA CTTCATACCC CAGGAACTCG CTAACAAAGT: Underline-OK GCCCTACACT CGCGACTTGA ACATGGTCTT AGCTCCCCAG AACATGCGCC: Italic-OK GGTGCCCTAA
GTTCTCAATC
GCTGTTCGAA
ACTCGGAACA
TAGTTTTGGC:
GGATAGGACG
GATGCAGGCC
CCTTTCCACC:
Bold alone-messy CACTCCAAAC
AACTTAGCCT
Bold + Condense-OK
11. The name of the Modify menu is something of a misnomer, since although the commands on it (except Generate Random Sequence.. . and Genetic code >) generate a derivative sequence, they always open a new window to contain the derivative, leaving the original sequence unmodified. The commands on this menu operate on the front (active) window only. The first three commands on the menu manipulate sequences in standard ways, and it is important to distinguish between them. The Reverse command produces a sequence that 1sreversed in order, and is the only command on this menu that will operate on both nucleotide and protein sequences (e.g., ATTGGGCCreversed is CCGGGTTA).The Coniplement command produces a sequence that is complementary to the original sequence (e.g., ATTGGGCC complemented is TAACCCGG). You should note that although this sequence is complementary to the original, it is shown in the reverse direction, i.e., with the 3’ end at the left. The Invert command both reverses and complements the sequence, generating the sequence of the strand that is biologically complementary to the original (e.g , ATTGGGCC inverted is GGCCCAAT).Of the three commands, this is the one that is used most often, which is why it is the only menu command in color
The Genetic Data Environment A User Modifiable and Expandable Multiple Sequence Analysis Package Jonathan
A. Eisen
1. Introduction The Genetic Data Environment (GDE) is a software package designed for molecular sequence alignment and analysis (I). Four features make GDE stand out relative to other similar programs: 1. It is free. 2 It has a user-friendly and visually powerful multiple sequence alignment editor. 3. Analysis can readily be performed on any sequence(s) or region(s) of sequences simply by selecting the sequence(s) or region(s) of interest and choosing the desired function from the pop-up menus 4. Although tt comes with a vanety of powerful sequence analysis tools, any additional programs of the user’s interest or updates for programs in use can be incorporated quickly and easily into the menu system (see Note 1). The current release of GDE includes a variety of sequence analysis tools, including methods for sequence alignment and editing, conversion between sequence formats, nucleic acid translation, identification of restriction sites,
RNA secondary structure prediction and drawing, database searching, dot plots, phylogenetic analysis, consensus determination, and printmg and formatting. Instructions for how to use many of these features are presented here. However, since GDE is user-expandable, the main focus of this chapter will be on how to use the core GDE alignment window. In addition, a brief guide on how to add additional programs to the GDE menu system is included. Learning to use this type of program may be of more use in the future--other programs will From
Methods II) Molecular Biology, Edited by S R Swlndell
Vol 70 Sequence Data Analysis Gurdebook Humana Press Inc , Totowa, NJ
73
74
Eisen
likely adopt this user-expandable system. Currently, work is in progress to incorporate many of the features of GDE into the incredibly powerful but somewhat cumbersome software package GCG. 2. Materials
2.7. Hardware The GDE software package is designed to run on the Sun family of computer workstations. However, it can also be run with some modifications on other Unix-based workstations, such as DecStations (Digital Equipment Corp., Maynard, MA) and SGIs (Silicon Graphics, Inc., Mountam View, CA). The sequence alignment editor is designed to be run in an X-Windows or OpenWindows environment and can be displayed locally (on the machine runnmg the GDE software) or remotely on any machme capable of X-window emulation (e.g., MacX can be used for displaying on a Macintosh). Although most of the features and programs of GDE are designed to be run from the alignment editor, many can also be run from the Unix prompt. A working knowledge of Unix and X-windows is helpful for using GDE but not necessary. Whenever possible, I include all instructions needed. The core GDE package requires about 15 Mb of disk space.Additional space is required for sequence database files. The amount of RAM needed varies a great deal, depending on the size of the sequence files being viewed and the number and type of programs used to manipulate or analyze these sequences. The GDE system can be run on color or black and white machines. However, to make full use of the sequence alignment window, it is helpful to have color. For example, ammo acids are colored by chemical type (all acidic are one color, all basic are another, and so on). Thus, regions of sequence similarity can be quickly identified by blocks of particular colors. In addition, some of the highlighting features of particular GDE programs work best when viewed in color.
2.2. Soft ware The current GDE package (version 2.2) can be obtained from a variety of computer archives. URL addresses for some sites are given below. 1 2. 3. 4. 5.
http://golgi.harvard.edu/ftp/ http://www.dl.ac.uk/SEQNET/gde.html gopher://megasun.bch.umontreal.ca/ll/GDE gopher://rdpgopher.life.uiuc.edu/ll/progr~s/Editor-GDE ftp://ftp.sunet.se/pub/molbio/unix/GDE 6. ftp://fly.bio.indiana.edu/molbio/unix/GDE 7 ftp://solomon.technet.sg/pub/NUS/Z2/indiana/molbio/ unix/GDE
Genetic Data Environment
15
The GDE package is usually found at archive sites in compressed archive format as a single file (e.g., gde2.2.tar.Z). This tile must be copied to a local machine, decompressed, and unarchived. In addition, the .cshrc file of all users who want to run GDE must be modified slightly. Below are instructions that can be used to set-up GDE for a Sun Sparcstation (once the file has been copied from an archive site). The commands in italics should be typed from the Unix prompt and followed by a carriage return. For other types of computers, some modifications of these instructions may be necessary. The specifics will depend on the machine, the type of Unix being run, and the type of X-windows being used for display. Instructions for setting up GDE on a variety of other machines are available at many of the above archive sites. 1. % mkdlr /usr/local/GDE CreturnX makes a directory for the GDE program. 2. % mv gde2.2. tar.Z /usr/local/GDE/
: moves the file to the directory. 3. % uncompress gde2 2 tar.Z : uncompresses tile 4. % tar -xvf gde2.2. tar : unarchives tile. For each user, the following lines should be added to the .cshrc tile found in their home directory. The additions can be made using a text editor like vi, emacs, or textedit or by using the cat command (type cat > > .cshrc from the Unix prompt and any text typed will be added to the .cshrc file-when done type control-D). 1 set path = ($ path usr/local/GDE/bin) 2 setenv GDE-HELP-DIR /usr/local/GDE/GDEHELP
2.3. Databases The GDE package comes with two database comparison programs--sta (2) and blast (3). To make use of these programs, the desired databasesmust be set-up m specific formats and locations. All should be set-up m subdirectories within the GDEHELP directory (/usr/local/GDE/GDEHELP). Instructions for doing so are given below. Special programs are required to format databases for the blast programs, and these are included with the GDE package. To run these programs, simply type their name followed by a carriage return from the Unix prompt. If the appropriate databases are already set-up elsewhere on a local system, aliases for the locations of these tiles can be set-up in the directories described below instead of copying the entire databases. 1. ForJzsta protem searches, copy PIR to the GDEHELP/FASTA/PIR/ directory. 2. Forfasta nucleotide searches, copy Genbank to GDEHELP/FASTA/GENBANW directory. 3. For blast protein comparisons, copy PIR to GDEHELP/BLAST/PIR/. Then use thepw2fasta program to convert to temporary FASTA format. Then reformat the database using the setdb program.
16
Eisen
4. For blast nucleotide comparisons,copy Genbankto BLAST/GENBANW in the GDEHELP directory. Then use the gb2fasta program to convert to temporary FASTA format. Finally, use thepressdb to reformat the database.
3. Methods 3.1. GDE Basics 3.7.1. Starting the Program Prior to starting GDE, the user must set-up for displaying in an X-windows or equivalent environment. If GDE is to be run locally on a workstatton, usually the windows environment will be started when you log on the machine. If not, try typing x or openwin from the Unix prompt. The GDE can also be run remotely by setting up to display on a local machine but running the program elsewhere. There are many ways to do this depending on the machine on which you will be displaying. In general, what you have to do is tell the local machine that you are allowing the machine that will be used to run the GDE software to be a X-windows host (for many Unix systemstype xhost + remote-machines-address from the Unix prompt replacing remote-machine-address by the name or IP address of the machine from which you will run GDE). Then you have to tell the remote machme that you will be displaying GDE elsewhere (for many Unix systems type setenv DISPLAY local~machines~address.4 from the Unix prompt replacing local-machine-address by the IP address or name of the machine used as the display). Once everything is set-up, to start GDE type gde or gdefilename (where “jlename” is replaced by the name of the file one wants to open), followed by a carriage return (this must be typed in the window of the machine running the GDE software if you are using a remote server). The GDE alignment window should appear. An example window is shown in Fig. 1. This window includes many of the features that will be referred to later. 3.1.2. Using the Mouse and Menus in GDE GDE is a menu-driven, X-windows-based system. As with other windows envu-omnents, in X-windows, pop-up/drag-down menus are used to access a variety of commands. The most obvious difference between X-windows and traditional Mac or PC Windows is that there are three buttons on the mouse with which to become familiar. The buttons are used for different functions, including: 1. Left button: placing cursor, selectingsequencesandregions of sequences,scrolling, resizing windows, and splittmg screen. 2. Middle button: extending text selecnon. 3. Right button: opening pop-up menusand scroll-bar menus.
Genetic Data Environment GDE MeIllU \
Seqmm
17
Names and Group #‘s /
Selected Region
&II
Bar Elevator
Split Screen Dwder
Fig. 1. The main GDE window.
The most important mouseskill in GDE is selectionof itemsfi-om the GDE menus. To selectan item in a menu, suchasthe File menu (in the upper left in Fig. 1): 1. Point the mouse cursor-at the menu button of interest and click with the right mouse button (this will expose the items in the drag down menu). 2. Select one of the items in the menu by pointing and clicking with the left mouse button. 3. Menus can be “thumbtacked” to the screen by first selecting the menu with the right mouse button and then clicking on the thumbtack with the left mouse button.
For most GDE menu items, a dialog box will appear after the command has been selected. These boxes ask for various types of input that define exactly how the command will be executed. The GDE uses five types of input formats in these dialog boxes-text lines, sliders, chooser buttons, pop-up menus, and check-boxes. The first four of these are demonstrated in the menu for the Find command (Fig. 2). 1. Text lines: To enter text in a text line, point the mouse cursor to the text line, click with the left mouse button, and then type the text. 2. Sliders: To modify values in sliders, point the cursor to the rectangular box on the slider and then click and hold the left mouse button and drag to the left or to the right to get to the desired number (which is shown in the text line to the left). Sliders can be altered in increments of one by pointing and clicking with the left mouse button to the right or left of the slider box, along the slider line. 3. Chooser buttons: To alter selections in chooser buttons, simply point the mouse cursor and click with the left button on one of the boxes to the right of the choice. The selected box will be highlighted. 4. Pop-up menus: Pop-up menus can be altered as described above for GDE menus. 5. Check-boxes: Boxes are checked by simply pointing and clicking with the left mouse button.
18
Eisen
Pop-Up & Menus ,I
. > .:
Fig. 2. An exampleof a GDE dialog box. The figure showsthe dialog for the Find commandshowing four of the five possiblemeansof inputing information.
In general, once the dialog box has been “filled out” to the user’s interest, the command is usually started by clicking the OK or DONE buttons. As mentioned above, one of the most powerful aspectsof GDE is the ability to quickly add new programs. A dialog box like this one for a new program can usually be added in about 30 min with no programming experience except a little knowledge of Unix commands. The dialog boxes are helpful because once they are programmed the user does not have to remember the code line instructions for each program (see Section 4. for more information about incorporating new programs) and the program can be run on specific sequencesor regions with the click of a button. 3.1.3. Sequence Input and Sequence Types GDE uses four different types of sequences: DNA/RNA, protein, text, and masks. The sequence type is important in determining which characters are allowed to be entered into the sequence, as well as how external programs handle the sequence when it is selected for analysis. The DNA/RNA and protein sequencesuse the standard nucleotide, amino acid, and degenerate positions abbreviations. Text sequences allow any characters and are particularly useful for keeping notes along with an alignment (such as intron positions, transcription start sites, mutation spots, and so on). Masks are used to direct external programs to use only subsets of a sequence alignment. This can be particularly useful in phylogenetic analysis (see Section 3.3.5.) but are useful in other functions as well (see Section 3.1.17.). There are three ways to get sequences into a GDE window. Short descriptions for each method are given below. Combinations of these can be used to
Genetic Data Enwonment
79
load multiple files and sequences mto one window (remember to check the file name prior to saving if multiple files have been opened or imported; see Note 2). 3.1.3.1.
DIRECT INPUT (FOR SEQUENCES IN GDE, FLAT, OR GENBANK FORMAT)
1. Choose the Open... command from the File menu 2. In the dialog box, the local directory is shown. Click on the name of the file to be opened or move through the directories to find the file of interest. 3. Once the file is selected, click the Open button. 4. The sequence(s) will be added to the ones currently in the GDE wmdow. 3.1.3.2.
LOADING SEQUENCES IN OTHER FORMATS
1. Choose the Input Foreign Format command from the File menu (see Note 3). 2. A text line for inputting the name of the file to import will appear m the dialog box. If the file of interest is in the directory from which the GDE program was started, type in the file name (e.g., gde.pir). If the file is in another directory, you need to type the path name as well (e.g., /GDE/gde.pir). Sometimes it is easier to move the file to the directory in which GDE was started rather than typing the entire path name. 3. Chck the OK button. 4. The sequences will be imported and added to those already m the GDE window. 5. This function uses the readseq program to convert between sequence formats and thus has all of the features and bugs of this program. It IS important to be careful when importmg sequences that have been recewed by E-mad from sequence databases. Depending on the way they were received and the E-mail system used, sometimes the E-mail headers can mterfere with the importing functions. In addltion, only some sequence information fields will be converted; others may be left out or merged into the same field. Instructions for accessing sequence information fields are m Section 3.1.5. 6. Readable formats include Genbank, IG/Stanford, NBRF, EMBL, GCG, DNA Strider, Fitch, Pearson/Fasts, Zuker, Olsen, Phylip, Plain text, ASN 1, PIR, MSF, and PAUP. 3.1.3.3. 1. 2. 3. 4. 5.
NEW SEQUENCES
Choose the New Sequence command from the File menu. Choose the sequence type (DNA/RNA, protein, text, mask) from the pop-up menu. Type in a name. Click the OK button. A sequence name (with no sequence yet) will be added to the sequences already in the GDE window. The sequence can then be typed in directly (see Sectlon 3 2.1.).
3.1.4. Selection of Sequences or Regions for Analysis In general, functions selected from GDE menus are performed only on the sequence(s) or region(s) that have been selected by the user. The ability to quickly select different sequences and regions of interest allows the user to perform analysis
20
Eisen
with high specificity. For example, to compare a small segment of the N-termini of a protein to a sequencedatabase,just select that region and then choose one of the database searching options from the GDE menu. Sequences and regions can be selected either directly using the mouse or indirectly using menu functions. The currently selected sequences or regions are highlighted in the GDE window (Fig. I). It is important to note that region and sequenceselection are independent+hanging selectedregions has no affect on which sequencesare selected and vice versa. However, for some commands, sequence and region selection can be in conflict. This occurs when the command chosen can be performed on either sequences or regions (e.g., multiple sequence alignments). In these cases,a selection window will appear asking the user to choose whether the function is to be performed on the region(s) or sequence(s) selected. Some functions can only be performed on either regions or sequences but not both (e.g., grouping, see Section 3.1.9.), and thus a chooser
window will not appear in these cases. 3.1.4.1.
SEQUENCE SELECTION
1. Click on the short name of the sequencewith the left mousebutton. 2. To select multiple sequences use mouse dragging (click and hold the left mouse button while dragging the mouse cursor across the names of the sequences to be selected and releasing after the last name) or shift clicking (hold the shift key while performing additional selections with the mouse button). 3. Use the Select All or Select by name commands from the Edit menu to select sequences indirectly. 4. Deselection of sequences IS done either by selecting other sequences wtthout holding down the shift key or clicking the mouse in the region immediately to the rtght of the sequence names (but to the left of the sequence text) 3.1.4.2.
REGION SELECTION
1. With the left mouse button held, drag the mouse cursor across the region to be selected and release the button when at the end of the region. 2. Alternatively, “embrace” the region to be selected by pointing the mouse cursor at one side of the region and clicking with the left mouse button and then point and click on the other side of the region with the middle mouse button. This method allows the use of scroll-bars to move to the second edge of the region to be selected (which makes selection of long regions of sequence easier than with mouse dragging). 3. Both of the above methods can be used to select a region from one sequence or comparable regions from multrple sequences. 4. Selection of regions can be complicated by grouping of sequences (see Section 3.1.8.)-selecting a region m one member of a sequence group ~111 automati-
cally selectthat region in all membersof the group. 5. To deselectregions, selectanotherregion or point andclick the mouseanywhere
in the text.
Genetic Data Environment
21
3.15. Saving Sequences and Alignments GDE allows for sequences and alignments to be saved in a variety of formats (see Note 4). The three different means of saving sequences and alignments are described below. If you have many sequence files, be careful not to overwrite files of interest. 3.1.5.1. 1. 2. 3. 4. 5
Choose Save As from File menu. Select the format (GDE, Genbank, or Flat). Enter a new name or leave the original name. Click the OK button. The file ~111 be saved in the directory where the GDE program was opened.
3.152. 1. 2. 3. 4. 5. 6.
SAVING AN ENTIRE ALIGNMENT
SAVING SPECIFIC SEQUENCE(S) OR REGION(S)
Select the sequence(s) or region(s) to be saved. Choose Save Selection from the File menu. Select the sequence format (GDE, Genbank, Flat). Enter a file name. Click the OK button. The file will be saved in the directory where the GDE program was opened. Ahgn-
ment information will be retained. 3.1.5.3.
SAVING IN OTHER FORMATS
1. Select the sequence(s) or region(s) to be saved. 2. Choose Output Foreign Format from the File menu (as with the Input Foreign Formats command m Section 3.1.2., this uses readseq). 3. Select the output format from the pop-up menu (Genbank, IG/Stanford, NBRF, EMBL, GCG, DNA Strider, Fitch, Pearson/Fasts, Zuker, Olsen, Phyhp, Plain text, ASN. 1, PIR, MSF, PAUP, Pretty) 4. Enter a name for the tile. 5. Click the OK button. 6. The file will be saved in the directory where the GDE program was opened.
3.1.6. Sequence information GDE allows storage of a variety of information for each sequence, Under normal conditions, the majority of this information is kept hidden. Access to this information is gained via a dialog box. This information can be useful for sorting functions (see Section 3.1.7.) as well as for future reference. For example, strand and direction will influence translation functions and sequence type will mfluence allowable modifications. 1. Selectthe sequenceof interest. 2. Choose Get Info from the File menu (Fig. 3).
Eisen
22
ion Created
on
B made‘!& a/4/%
This sequence to patent it
seqttenie
-tS:a7:55
was made or clonedt.
(-->I up
for
.11"'
for demonstrat;on I---_""-_--_-_-"__tomment5: demonstration
purposes.
Please
do
not
try
^-i, \\: * h -is+
Fig. 3. Sequence Information
dialog box.
3. Change or enter the text for short name (the name shown in GDE Window), full name, ID number, description, author, and comments. 4. Set the pop up menus for type, strand, and direction. 5. Click the OK button when done.
3.1.7. Sorting and Ordering Sequences In order to aid multiple sequencealignment and analysis, it is helpful sometimes to have specific sequencesnext to each other. Reordering of sequencescan be done in two ways-either by cutting and pasting or using sorting functions. 3.1.7.1. MANUAL 1. Select the sequence(s) to be moved. 2. Choose the Cut or Copy commands from the Edit menu, or use built-in cut/copy keyboard function keys. 3. Select the site at which the sequences are to be placed (by selecting the sequence immediately above the site). 4. Choose the Paste command from the Edit menu. 3.1.7.2. COMPUTER-BASED I. Select the sequence(s) or region(s) to be sorted. 2. Choose the Sort command from the Edit menu. 3. Choose the Primary and Secondary Sort Fields (group, type, name, sequence ID, creator, offset). 4. Click the OK button. 5. A new GDE window with the results will appear.
3.1.8. Extracting Sequences/Regions Sometimes it is helpful to extract subsets of sequences or regions of sequences into a new alignment window. This can be done in either of the following two ways.
Genetic Data Environment
23
3.1.8.1. DIRECT 1 Select the sequence(s) or region(s) 2. Choose Extract from the Edit menu. 3. A new GDE window with the results will appear. 3.1.8.2.
INDIRECT
1. Select the sequence(s) or region(s). 2. Choose Save Selection from the File menu (see Section 3.1.10.). 3. Use the Open command to reopen this saved selection (see Section 3.1.3.).
3.1.9. Grouping Sequences Grouping of sequences allows editing functions to be performed on all members of the group at the same time. This feature is particularly useful for aligning sequences by hand. For example, if one had separate alignments of 30 gamma
globins and 30 beta globins and wanted to align them together manually, it might be easiestto group all of the beta globins into one group and all of the gammasinto another. Then, alignment gaps could be placed in all gammas at the sametime and all betas at the same time by entering the gap into only one of the members of the group. If one then wanted to put a gap in only one or a few of the beta globins, they
could be ungrouped and the gap could be placed in just those few. When editing functions are attempted on one member of a group, only those actions that are permitted for all members of the group will be allowed (see Section 3.1.10.). Regions cannot be grouped, only sequences can. To change sequence groups: 1. Select the sequence(s) to be grouped or ungrouped. 2. Choose Group or Ungroup from the Edit menu. 3. If any of the sequences selected are part of another group, the user will be asked whether to merge the groups or to create a new one. 4. A number will be placed to the left of the short sequence name(s) to indicate group status.
3.7.10. Sequence Protections GDE allows for the protectton of sequencesagainst accidental modification. There are four different types of modifications allowed during editing. The default is to allow only modiftcation of alignment gaps and translations. Depending on the type of sequence (DNA, protein, text, mask), “ambiguous” characters are different, For example, N is ambiguous for DNA and RNA, but is not for protein. 1. Select the sequence(s). 2. Choose Protections from the File menu. 3. Select the modifications allowed (unambiguous characters, ambiguous characters, alignment gaps, translations). 4. Click the Done button when finished.
24
Eisen
3.1.11. Repeat Counts Repeat counts allow the user to repeat a keystroke any number of times by typing the number corresponding to the desired number of repeats immediately prior to the key being typed. This is very useful for manual sequence alignment (for inserting or removing multiple gap characters) and for moving the cursor a defined number of spaces(see Section 3.2.). Repeat counts will not work when the cursor is in a text or mask sequence because numbers can be used as input. 3.1.12. Printing The GDE has two means of printing sequencesor alignments. Normal GDE printing allows printing of sequences and alignments with a variety of Unix commands as well as viewing and editing the file to be printed. Sequences can also be printed with the PrettyPrint format of the readseq program. PrettyPrint output is designed for publishing and presentation of alignments and can produce very polished figures. Both printing commands are accessible from the File menu. 3. I. 73. Cursor Position The cursor is identified by the flashing horizontal line in the sequence text section of the GDE window. It is used m essentially the same way as the cursor in most word processing programs. First and foremost, the cursor marks the spot at which editing commands are performed and text selections begin. In addition, it can be used to mark a place for quick returns if the screen is scrolled to another page Information about the cursor position is displayed in the status line (Fig. 1). To move the cursor, either point to a new region and click with the left mouse button or move with the arrow keys (repeat numbers can be typed before the arrow keys to move a specific number of positions). If the cursor is moved past the edge of the screen, scrolling will be activated and the next page of sequence will be shown. Since scrolling can be performed without moving the cursor (see Section 3.1.14.), the cursor may not always be visible in the GDE window. The cursor may be hidden from view if the scroll-bars are used to show a different region of sequence. To return the screen to display the region of sequencewhere the cursor is, type one of the arrow keys. This function (which I will refer to later as the return screen function) is helpful, but can lead to some confusion. If you want to keep the view on the sequencesyou have scrolled to, remember to change the cursor point to that region using the mouse. 3.1.14. Scrolling Only a portion of most sequenceswill be viewable in a single GDE window. The rest of the sequence can be viewed by scrolling to another page (to the right or left). In addition, if an alignment contains many sequences, it may
25
Genetic Data Environment
be necessary to scroll up and down to see different sequences. Scrolling can be performed in a variety of ways, including: 1. Click with the left mouse buttons on the arrows on the scrollmg elevator (Fig. 1). 2. Click and drag in the center of the elevator 3. Use the scroll-bar menu (which is opened by clicking with the right button on the scroll-bar). 4. Click on the scroll-bar edges (the vertical lines at the edge of the scroll-bar) This moves the wmdow all the way to the beginning or end of an alignment 5. Use the cursor arrows to move the cursor past one edge of a screen page (see Section 3.1.13.)
3.1.15. Split Screens A split screenallows the viewing of discontinuous regions of a particular alignment. This can be used, for example, to insert gaps in the upstream portion of a sequence while simultaneously monitoring the ahgnment of the downstream portion, even thousands of basesaway. Be careful not to have different vertical positions
for different
screens-this
~111 make identification
of specific
sequencesdifficult. Vertical scrolling can be locked in the screenproperties menu. The region of the alignment shown in a particular screen can be changed in three ways, by downstream manipulations of the sequence (such as insertion of gaps) in another screen, by using the scrolling functions, or by using the return screen function described in Section 3.1.13. The return screen function can lead to much confusion when using split screens because this function only operates on the active screen. The active screen is determined by the screen in which the mouse pointer is pointing. Therefore, be sure to know which screen the mouse is pointing to before you use the return screen function. For example, imagine you are using the right screen to view the C-termini of a protein alignment and the left screen to view the N-termim, and the cursor is in one of the proteins in the N-termini. If you want to insert a few ahgnment gaps m this protein’s N-terminus, be careful that the mouse is pointing to the left screen. If it is pointing to the right screen when you type the alignment gaps, the right screen will return to the position of the cursor and thus you will have two screensshowing the N-termini. Below are descriptions of the two ways to make and remove split screens.Any number of split screens can be used at one time. 1. Point the mouse cursor at the edge of the scroll-bar 2. Click and drag to create or remove split screens.
Alternatively: 1. Point and click the right mouse button on the scroll-bar. 2. Select Split Views or Unsplit Views from the pop-up menu.
26
Eisen
3.1.16. Screen Properties Many of the screen features can be altered usmg X-windows functions and thus are specific to the type and version of X-windows being used. In addition, GDE allows the user to modify a variety of the properties specific to the GDE window. Becoming familiar with these functions is important because they can be used to aid in analysis and alignment of sequences. In addition, some of the programs run through GDE menus may change the screen properties. For example, the Variable Positions command (see Section 3.3.8.) changes the sequencesto black and white to better emphasize differences in degree of conservation of different alignment positions. Therefore, if the user wants to return the screen to color, the screen properties must be reset. Below is a description of how to alter screen properties. 1. Choose Properties from the File menu. 2 Enter or alter a. Font size (for sequence names and text). b. Editing mode (insert or check). c. Color type (monochrome, character=color, alignment color mask). d. Message panels (activates a variety of messages displayed on screen). e. Screen inverslon (inverts color patterning, very useful for manual alignments). f. Vertical scroll lockmg (keeps vertical positions of split screens together). g. Key clicks. h. Insertion point (to the right or left of cursor). 1. Scale (l-20). 3. Click the OK button when done.
3.1.17. Using Sequence Masks Sequence masks are used to determine which alignment positions of the selected sequence(s) or regions(s) will be used by programs selected from the GDE menus. When a sequence mask is selected along with sequence(s) or region(s) of sequence(s), GDE first filters the sequence(s) prior to running whatever external programs are selected. The filter removes all alignment posltions at which there is a 0 m the sequence mask. Sequence masks are particularly useful for phylogenetic analysis (see Section 3.3.5.). Sequence masks can be generated either manually (by creating a new sequence of the mask type and typing in the 1s and OS),or by running the sequence consensus program and using it to generate a mask by degree of conservation. Masks can be incorporated into any function of interest by simply including a line in the .GDEmenus file to tell GDE to use a mask if selected. Some programs will not use masks, and thus masks that are selected will be treated as any other sequence and no filtering will occur.
Genetic Data Environment
27
3.2. Editing and Aligning Sequences 3.2.1. Manual Alignment and Sequence Editing The GDE allows simple and easy editing, entering, and manipulation of sequences and alignments. Some of the tools to remember when attempting manual alignments with GDE include: 1. Amino acids are color-coded by chemical group. 2. Sequences can be grouped and ungrouped to allow modlfrcations to many at once. 3. Split screens can be used to view the affects of upstream changes on downstream alignments. 4. The computer may have Cut, Copy, and Paste keys that can be used instead of menu commands. 5. Repeat counts can be used to avoid overtyping and to allow for precrsron with large numbers. 6. Screen propertres can be adjusted to aid viewing. 7. Gaps can be inserted with -, -, or the space bar. 8. Check sequence protections (if sequences are grouped the modrtication being done must be allowed for all of the sequences). 9. The insertion point (before or after the cursor) can be controlled from the Screen Properties menu (see Section 3 1.16.). 10. Save often. 11. Characters to the left or right of a gap can be “yanked” by using Control-K or Control-L. These commands drag the next-most upstream or downstream character to the position of the insertion cursor. Repeat keys can be used to yank multiple characters.
3.2.2. Automated Gap Removal This function is particularly useful when sequences being studied have been extracted from an alignment containing many additional sequences. This may leave gaps in all of the sequences being examined, which, depending on the gap size, may make analysis and viewing somewhat burdensome. In addition, it is sometimes helpful to remove gaps m sequences prior to running external programs+ome programs are sensitive to gap position and gaps may influence results. Be careful to save prior to removing gaps from an important alignment; compression cannot be undone. The preserve alignment choice can be used to remove gaps only at positrons where all selected sequences have gaps. 1. Select the sequence(s). 2. ChooseCompress from the Edit menu. 3. Choose Preserve Alignment or Remove All Dashes.
Eisen
28 3.2.3. Finding Sequences
The Find command allows a user to find specific sequence strings or sequenceswith similarity to a particular search strmg. The method is described below (the Find dialog box is shown in Fig. 2). 1. 2. 3. 4.
Select the sequence(s) to search. Select Find All from DNA/RNA menu. Type the search string into the text line. Select the search features: percent mismatch allowed, case sensitivity, rf U=T, match and mismatch colors. 5. Click the OK button. 6. Matches are highlighted m the alignment window. Users must scan through multiple pages of an alignment on their own to find highlighted regions
3.2.4. Clustal Alignment The clustalv multiple sequence alignment program (4) has been included as part of the GDE package. It allows multiple sequence alignments to be done with any number of sequencesand allows the user to choose from a variety of alignment and output parameters. When run from the GDE window, clustalv will align sequences in the background
and return the alignment
in a new GDE
window. Unfortunately, when doing this, some of the reference information in sequence files may be lost because clustalv has to convert between formats. Below are the instructions for running clustalv from the GDE window. It can also be run from the Unix prompt by typing clustalv followed by a carriage return. More information about clustalv is found in the Help file included with GDE (accessible by clicking Help from the clustalv dialog box). A new and improved version of this algorithm, clustalw, is available for Unix machines (5) and can be readily incorporated mto GDE (see Section 4.). 1. 2. 3. 4. 5. 6.
Select the sequence(s) or region(s). For DNA alignments, choose Clustal from the DNA menu. For protein alignments, choose Clustal from the Protein menu Enter the alignment parameters (Ktuple size, Window size, Gap penalties). Chck the OK button. When the alignment IS done, the results will be drsplayed in a new GDE window.
3.2.5. Other Alignment and Editing Tools The GDE includes a variety of additional sequence analysis tools which will not all be described in detail. These include the following methods: 1. Sequence reversal: Choose Reverse from the Edit menu. 2. DNA complementation: Choose Complement from the DNA/RNA 3. Changing text case.ChooseChange Case from the Edit menu.
menu.
Genetic Data Environment
Fig. 4. Alignment
29
of sequence and secondary structure brackets.
4. Diagram alignment on one page: Choose Strategy View from the Seq. Management menu. 5. Contig assembly: Choose Assemble Contigs from the Seq Management menu. 6. Dot Plot: see Section 3.3.10. 7. Restriction analysis: Choose Restriction sites from the DNA/RNA menu.
3.3. Sequence Analysis 3.3.1. Translation 1. Select the sequence(s) or region(s). 2. Choose Translate from the DNA/RNA menu. 3. In the dialog box, choose minimum ORF size, reading frame(s), genetic code, aa abbreviation, and whether ORFs should be entered as one or as separate sequences. 4. Amino acid sequences will appear as new sequences in the same window. They will be given a name based on the name of the sequence they were translated from with a number indicating the reading frame (Fig. 1).
3.3.2. Secondary Structure Prediction The MFOLD program is an RNA secondary structure prediction program designed by Michael Zuker (6). GDE is set-up to use the default settings for this program and to pass the output through the Zuk-to_gen program that converts the predicted structure to a series of nested brackets. This notation can then be used for the Highlight Helix (see Section 3.3.4.) and Draw Secondary Structure functions (see Section 3.3.3.). Depending on the size of the sequence the process may take a long time. To run MFOLD from the GDE window: 1. 2. 3. 4. 5.
Select the sequence(s) or region(s). Choose MFOLD from the DNA/RNA menu. Choose linear or circular RNA. Click the OK button. Results will appear in a new GDE window like the one in Fig. 4.
3.3.3. Secondary Structure Drawing This function
invokes the LoopTool program to convert an alignment of a
sequence and a series of brackets identifying base pairs to a drawing of a sec-
Eisen
30
40 A
A C
c c c d
A ’
C J
c
1
G
c
I
1
G
H
G
cl
II
c .
c
;
c
,I
c c j
.
C
C C A
- ‘C
J
20
A
\.c
c A
‘c - A
A
1
c.
c \‘c
c /
c
Fig. 5. An example output from the LoopTool program. ondary structure. The helix information must be coded in a text file with the base pairs coded by a nested series of left and right brackets. This information can be typed or generated by a secondary structure prediction program (see Section 3.3.2.). The sequence with the helix information should be named HELIX. To run LoopTool from the GDE window: 1. Select the DNA or RNA sequence and the text tile containing the helix information to be used. 2. Choose Draw Secondary Structure from the DNA/RNA menu. 3. The resulting structure will show up in a LoopTool window (Fig. 5). 4. Many parameters may be modified from within the LoopTool window using the pop-up menus.
3.3.4. Highlighting Helix This function is used to identify predicted secondary structure. 1, 2. 3. 4.
regions of a sequence that do not fit into a
Select the DNA or RNA sequence(s) and the text tile with the helix information. Choose Highlight Helix from DNA/RNA menu. Click the OK button. Noncanonical base pairs (including G:U) will be highlighted.
31
Genetic Data Environment 3.3.5. Phylogenetic Analysis
The GDE package includes two phylogenetic programs--the least squares method of De Soete (7) (as implemented by Mike Maciukenas in the lsadt program), and the comprehensive Phylip package of Joe Felsenstein (8). Phylip includes software for likelihood, distance, and parsimony methods. The De Soete method is a distance-based, tree-generating algorithm. Lsadt and all of the Phylip programs can be run from the Untx prompt. Masks can be used to conduct analysis on only those alignment positions of interest. This is an important step in all phylogenetic analyses because it allows the researcher to remove positrons of an
alignment for which homology of residues is uncertain, Below is a description of how to run Phylip parsimony or likelihood methods and instructrons for runnmg the De Soete method. Running Phylip distance methods includes essentially the same steps as for the parsimony
3.3.5.1. 1. 2. 3 4. 5. 6. 7. 8. 9.
10.
methods and thus will not be described.
PHYLIP
Select sequence(s) or region(s) to be analyzed Select mask (if desired). Choose Phylip from the Phylogeny menu. Choose the program to run (for details see Phylip Help tiles accessible from the Phylogeny menu). Select if bootstrappmg IS desired. Select if a consensus tree should be made if multiple trees are generated (such as by bootstrapping or if multiple equally parsimonious trees are found m parsimony analysis). Choose the method for viewmg the result Click OK when done. Depending on which items were selected, a series of windows will be opened and the user will be asked to input instructions mto the Phylip programs that have been launched. For instructions for these programs see the Phyflp Help files When each Phylip program’s instructions are completed, the program will be run, and when it is done the next Phylip program needed will be launched. For example, if bootstrapping is selected, first the seqboot programs menu will be opened and once completed, seqboot will be run to generate the bootstrapped sequence files. When this is done, the output will be loaded into the Phylogeny program that has been selected and this program’s menu will appear. When this menu is completed, the phylogeny program will be run. When this is done, if consensus has been selected, the trees will be loaded in by the consensus program, and so on. It may seem complmated, but doing this through GDE is much easier than doing it through the Unix command line.
3.3.5.2. DESOETE 1. Select the sequence(s) or region(s) to be analyzed (see Note 5). 2. Select mask (if desired).
32
Eisen
Fig. 6. An example output from the treetool program. 3. 4. 5. 6. 7.
Choose Choose Choose Choose Choose
DeSoete from the Phylogeny menu. distance correction method (Olsen, Jukes Cantor, none). initial parameter estimate. random number seed. method of viewing (Treetool or text).
3.3.6. Treetool Treetool is a comprehensive phylogenetic tree drawing and manipulation program that has been incorporated into the GDE package. It allows viewing, rerooting, coloring, reshaping, and many other activities to be performed on phylogenetic trees. In addition, it allows trees to be saved in a variety of formats, including PICT for import into graphics programs. The treetool menus and functions are accessedin essentially the same way as those of the GDE window. A comprehensive Help file is included and is accessible from the Help menu button. Figure 6 shows an example treetool window. 3.3.7. E-Mail Servers Whereas it is sometimes useful to conduct all sequence analysis on a local machine, analysis on remote machines has become an important tool for many researchers. There are now probably hundreds of locations set-up for performing anonymous remote sequence analysis. Some of these have been incorporated into the current GDE release. Using remote computers, especially those
Genetic Data Environment
33
Fig. 7. An example output from the Variable Positionsprogram.
set-up by government or private institutions, can be very beneficialAatabases are updated and improved constantly, programs are maintained and modified, and the computers are usually very fast and powerful. One of the great advantages of GDE is that new E-mail servers and worldwide web (WWW) servers can be incorporated almost instantly by modifications to the .GDEmenus file (see Section 4.). The E-mail servers currently built-into GDE include blast searches and sequence retrieval from NCBI, fasta and blitz searches through EMBL, and GeneID and Grail searches. To perform these searches, simply select the sequence(s) or region(s) of interest and select the desired function from the E-mail menu. Be careful---some of the E-mail services are designed for only one sequence at a time. 3.3.8. Local Database Searches GDE includes two programs for local database searchesAZast andfasta. In addition to running them from the GDE menu they can also be run from the Unix prompt. 1. 2. 3. 4. 5. 6.
Selectthe sequence(s)or region(s). For DNA or RNA searcheschoosefasta, blastn, or blastx from DNA/RNA menu. For protein searcheschoosefasta, blastp, tblastn, or blast3 from the Protein menu. Choosematrix, number of alignments,database,and other parameters. Click the OK button. Resultswill appearin a new GDE window.
3.3.9. Variable Positions This function allows the user to identify and highlight regions of a DNA or RNA alignment of different degreesof sequenceconservation. To run the Variable Positions program, selectthe sequence(s)or region(s) of interest and select Variable Positions from the DNA/RNA menu. An example result is shown in Fig. 7. 3.3. IO. Dot Plot Dot Plots are a way of quickly identifying regions of sequence similarity either within or between sequences.
Fig. 8. The Dot Plot dialog box and an example output from the Dot Plot program. 1. 2. 3. 4. 5.
Select the sequence(s) or region(s). Choose Dot Plot from the DNA/RNA menu. Click the OK button. Results will return in a Plot window (Fig. 8). Properties may be altered from within this window by clicking the Properties button.
3.3.11. Sequence Consensus This program can be used to determine a consensus sequence from an alignment of DNA or RNA sequences.In addition it can be used to generate a sequence mask. The mask generated by this program will include OSat sites of low conservation and 1s at sites of high conservation. The cut-off degree of conservation can be set from within the dialog box. To use the sequence consensus program select Consensus from DNA/RNA menu. 4. User Modification-Adding Additional Programs The majority of the GDE window is set-up by an easily modifiable file called .GDEmenus. The GDE program will look for a .GDEmenus file first in the local directory in which the GDE program was started, then in the home directory of the user, and finally in the GDE-HELP-DIR. It will use the first .GDEmenus file it finds. The .GDEmenus file provides three pieces of information for the GDE program: 1. The that 2. The box
instructions for the name and appearance of the GDE menus and dialog boxes are shown once commands are selected. Unix commands that will be run after a command is selected and the dialog is completed.
Genetic Data Environment
35
3. The format that the selected sequence(s) or region(s) will be used as input mto the Unix command.
The language used by the .GDEmenus file is a simple menu description language that is read by the GDE program. The general approach is to set up items within a menu (or make a new menu) and then to include instructions for the desired dialog box inputs (args) that will be used to provide specific components of the Unix commands of each item. Multiple args can be used for each item and multiple items for each menu. In addition, it is necessary to specify the input and output formats that will allow the chosen programs run by the Unix commands to analyze the sequence(s) or region(s) selected. Below is a description of the line commands that can be placed in the .GDEmenus file. Text in bold must be typed in as shown. Text in normal print can be modified by the user. Each example is followed by a description of the line’s use. Any Unix commands are allowed from the arg line. For more comphcated programs, it may be necessaryto invoke a window from which to run the program. For example, when the Phylip option is selectedfrom the Phylogeny menu, first a GDE dialog box appears asking for a variety of inputs (see Section 3.3.5.). Once selected, these commands are used to determine which Phylip programs to launch from a new window. In a new window, a Phylip menu for the Phylip program that has been selected will appear. When this is completed, the Phyhp program will be run and the next line in the Unix command will be executed. 1. itemmenu: menu-name: This defines the name of the menu header (e.g., File, Edit; in Fig. 1) All lines between this line and the next itemmenu: line are used for this menu. 2. item: item-name: This defines each selection within a menu (e g., Save, Open), all lines between this line and the next item: line are used for this item As many
item lines asare desiredcan be used. 3. itemmeta: meta-key: This defines meta-keys for machines that can use them 4. itemhelp: help-tile: This defines the name of a Help file. It should include either the file name or the path plus file name if the file is not m the GDE-HELP-DIR
directory. 5. itemmethod: Unix command: This tells GDE what Unix instructions to use once the command is selected. It can be up to 256 characters in length and can mclude multiple commands (separated by a semicolon), embedded $ variable names
(defined by the arg function described below), as well as shell scripts, backgrounds processes, and so forth. If no arguments are specified, no pop-up menu or dialog box will appear and the Unix mstructions will be run automatically. 6. arg: variable-name: This defines the name of a variable that will appear in the itemmethod line. To include in the Unix command line use a $ before the variable name. Be careful not to have text in the Unix line with the same characters as in the variable name because the arg instructions may be placed there as well
36
7. 8. 9. 10. 11.
12 13
14.
15. 16.
17.
18.
19.
20. 2 1.
Eisen Therefore, tt is usually best to use as a variable name something other than a word or a common abbreviation. The variable itself is determined by input into the pop-up window defined by the argtype choices below. Multiple args can be Included in one item (e.g., there are six in the Fmd command shown in Fig. 2) argtype: slider, chooser, choice-menu, or text: These different types are defined in section 3.1.1 arglabel: label: This IS the label that will be used for the argtype in the dialog box (e.g., Search String m Fig. 2) argmin:#: This defines the minimum value for a slider (e.g., 0 in Fig. 2). argmax:#: This defines the maximum number for a slider (e.g., 75 in Fig. 2). argvalue:#: This defines the default value for a slider or the default choice for choosers, choice-menus (0 = first choice, 1 = second choice, and so on.) (e g., 10 in slider in Fig. 2). argtext: default This defines the default text that IS placed in a text line. It is useful for things such as E-mail addresses, file names, and prmter names, argchoice: dtsplayed:passed* This is used for choosers and choice-menus The text replacing “displayed” is the label given to the button or menu choice and the text replacing “passed” is the actual value that is passed to the variable if that button or menu choice 1s selected. As many argchoices as desired can be entered. in: input-file: This is the name to use m the Unix command line to represent the tile of the selected sequence(s) or region(s). GDE will replace this name with a randomly generated temporary file name (invisible to the user m most cases). informat: file-format This is the format that the sequence(s) or region(s) will be written m the input-file. It can be either Genbank or flat. inmask: This tells GDE that the alignment positions of the sequences selected can be regulated by a selected sequence mask. If a mask is selected along with sequences, all alignment positions with a 0 m the mask will be removed prior to analysis. This is particularly useful for phylogenetic analysis programs. insave: Do not remove the mput-tile after running the external program. This is useful for running programs in the background and for identifying and fixing bugs m commands. out: output_file: This is the name to use in the Unix command line for output by external functions. GDE will replace this name m the Unix command line by a randomly generated temporary tile name (invisible to the user in most cases). It is up to the external function to place results m this file if it is to be read back into GDE. outformat: file-format: The data in the output tile will be in this format (colormask, Genbank, or flat). This tells GDE what format the file will be m when it is time to read tt back. outsave: Do not remove the output_file after reading. This is useful for running programs in the background and for identifying and fixing bugs in commands. outoverwrite: This is used to instruct GDE to overwrite sequences currently m the GDE window. It is useful for sequence alignments.
Genetic Data Environment
37
.GDEmenus modifications and lines generated by others around the world are available from a variety of sites, including some of those listed in Section 1. A demo is also given in the GDE manual (see Section 5.). The GDE Menu Building program allows for quick additions of new instructions to an existing file. It IS available from http://golgi.harvard.edu/.
5. Help and Information A manual for GDE is included in the GDE package. The manual comes in Microsoft Word, text, and Postscript formats. Also included with the GDE package are Help files for most of the external programs run through GDE. Finally, additional help can be found through an electronic discussion group (send E-mail to Tim Littlejohn [email protected] to be added to the list) and from some of the web sites described above. 6. Notes In part because it is shareware, and in part because it combines such a variety of programs, GDE is not without its share of bugs. Some problems that may be important to remember include: 1. Some of the instructionsincorporated into the GDE menus may try to execute programs not available on the computerbeing usedor they may be in a different location than expected.Error messagesgeneratedby these commandscan be found in the window from which GDE was started. If you get an error like “Command not found” it should be preceded by the command that was tried (e.g., textedit:Command not found). There are many solutions to this problem. First, tf the program does exist on your system, then the path to find that program was probably not set-up m the user’s .csrhc file. Add it. If the program 1snot available on your system, you can try to get it. Alternatively, if you have an equivalent program, you can edit the .GDEmenus file and insert the replacement program’s
namewherever the other program was listed. 2. When loading multiple sequencefiles, the default name of the GDE file may change to the name of one of the new files. Make sure to check the name when savmg.
3. The Input Foreign Format commandmay be confused if there areheaders(such as E-mail headers) on sequence tiles. It is best to remove such headers prior to importing. 4. Be careful when switchmg between formats-some formats do not retam all of the reference mformation retained by other formats (such as authors and accession numbers). In addition, some formats do not retain all of the sequence data.
For example,someformats require all sequencesto be of equal length, so longer sequences may be truncated when converting to these formats. Thus, unless there
is someneed to switch to a different format, it is probably best to save things either in GDE or Genbank format. The GDE format retains essentially all of the
Eisen
38
information seen in the GDE window, such as alignment, sequence groups, and sequence notes. For more information on the GDE format, see the GDE manual (described in Section 5.). 5. When using the DeSoete phylogenetic program, sequences cannot have a 1character in their short sequence name (the one displayed in the GDE window). The tree generating program uses this same character as part of the distance matrix input.
Acknowledgments I would like to thank Steven Smith for making the GDE software package freely available and continuing to improve it. I would also like to thank all of the researchers who have made their sequence analysis programs into shareware and allowed people like myself to perform extremely powerful sequence analysis without much expense. References 1. Smith, S. W., Overbeek, R., Woese, C. R., Gilbert, W., and Gillevet, P. M. (1994) The genetic data environment an expandable GUI for multiple sequence analysis. CABIOS lo,67 l-675. 2. Pearson, W. R. (1990) Rapid and sensitive sequence comparison with fastp and fasta. Meth. Enzymol. 183,63-98. 3. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) Basic local alignment search tool. J, Mol. Biol. 215,403-410. 4. Higgins, D., Bleasby, A., and Fuchs, R. (1992) Clustal V: improved software for multiple sequence alignment. CABZOS 8, 189-191. 5. Thompson, J. D., Higgins, D. G., and Gibson, T. J. (1994) Clustal w: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucl. Acids Res. 22,4673-4680.
6. Zuker, M. (1994) Prediction of RNA secondary structure by energy minimization, in Methods In Molecular Biology, vol. 25: Computer Analysis Of Sequence Data (Griffin, A. M. and Griffin, H. G., eds.), Humana, Totowa, NJ, pp. 267-294. 7. De Soete, G. (1983) A least squares algorithm for fitting additive trees to proximity data. Psychometrika 48,62 l-626. 8. Felsenstein, J. (1989) PHYLIP-Phylogeny inference package (Version 3.2). Cladistics 5, 164-166.
3 ABI Analysis Manipulation of Sequence Data from the ABI DNA Sequencer Tracy L. Hagemann
and Sau-Ping Kwan
1. Introduction The Applied Biosystems (ABI, Foster City, CA) Analysis program is Macintosh (Apple Computer, Cupertino, CA)-based software designed to analyze the data produced by an ABI DNA Sequencer. The program organizes the collected data into chromatograms and determines the sequence of the analyzed DNA. The Analysis version 1.2.1 and Sequencing Analysis version 2.1.2 programs described in this chapter were designed by Applied Biosystems. A more detailed description of this information can be found in the 373 DNA Sequencing System User’s Manual (1) and the ABI PRISM DNA Sequencing Analysis Software User’s Manual (2). 7.7. The ABI DNA Sequencer In the past few years, automated DNA sequencing has become a more routine technique rather than a luxury affordable only to large genome project laboratories. The ABI Sequencer is one of several laser-based automated sequencersthat are presently in the market. This instrument is composed of an electrophoresis chamber for running the denaturing polyacrylamide gels, and a laser that scans the gel to detect fluorescence-labeled products as they pass through. The processdependson fluorescent phosphoramiditeswith varying emission wavelengths to distinguish the products of a DNA sequencingreaction. The information generated by the sequencer is collected and analyzed by a Macintosh computer. 7.2. Data Collection and Analysis Programs Two programs are required to collect and analyze the data produced by the ABI sequencer: Data Collection and Analysis. Data Collection processes the From: Methods in Molecular Biology, Edited by: S. FL Swindell
Vol. 70: Sequence Data Analysis Guidebook Humana Press Inc., Totowa, NJ
39
Hagemann and Kwan
40
mformation as it is generated and plots the four different emission signals (corresponding to the four nucleotides) over time as the gel runs. After the gel run is finished, the Data Collection program launches the Analysis program. This program integrates the raw data, normalizes the spacing, enhances the signal peaks, and uses this information to determine the parameters for calling the bases. The data is replotted together as a series of color peaks representing the nucleotide sequence and is referred to as the chromatogram or electropherogram. The results are stored in a Sample file that includes the raw data, the chromatogram, the nucleotide sequence, and the file information entered by the user. A secondfile is also generatedfor eachsample that contamsthe sequence astext only (file is given .seqsuffix). This sequence text file cannot be accessed by the Analysis program and is designed for use in other applications (e.g., database searches). There are several versions of the ABI analysis program. However, the major differences are between versions 1 and 2. As representatives of these two versions, this chapter will discuss specifically ~1.2.1 and ~2.1.2. The most recent version, 2.1.2, is very similar to ~2.1 .O and ~2.1.1.) except for corrections made to accommodate the 377 Sequencer and to improve data printout speed. 1.3. Utilization
of the Data Produced
by Analysis
The Analysis program will automatically print the analyzed data or chromatogram as it is processed. The chromatogram, however, may be edited further after the analysis is completed. Editing changes can range from single base calls to adjusting the lane tracking paths and reanalyzing. The program ts fairly accurate, and if the ABI protocols are followed closely, usually only minor changes are necessary (see Note 9). 2. Materials 2.1. Hardware 1. The ABI DNA Sequenceris equippedwith aMacintoshcomputerfor running the Data Collection and Analysis programs. The 373 models were supplied with a Macintosh II series(CX or ci), and the newer 377 Sequenceris operated by a Power Macintosh 7100 or 7200. Analysis ~1.2.1 can be used with all of these computers;version 2.1.0 or higher can be run with a Mac IIci, Quadra 630 or 650, Centns 650, or Power Macintosh. 2. The available RAM must allow both Data Collection and Analysis to be active simultaneously,since Data Collection launchesAnalysis before it closes.These two programs use approx 3.5 Mb, and the remaining accessmemory must still accommodatethe Macintosh SystemandFinder. The mimmum memory requirements for the 373 Sequencerwith Analysis ~1.2.1 on a Macintosh System6 0.3 with Finder is 4 Mb. System7 requires anadditional 1 Mb of memory. Sequenc-
ABI Analysis
47
ing Analysis ~2.1.2 with a 373 Sequencer requires 8 Mb of RAM with System 7.1 or lower. Systems higher than 7.1 require additional memory. 3. Data Collection and Analysis ~1.2.1 take up approx 900 K of disk space on the hard drive. Data Collection, Sequencing Analysis ~2.1.2, and the Basecallers folder require over 3 Mb The files generated by these programs (Gel File and Results folders) can use over 15 Mb following a single sequencing run. 4. The ABI DNA sequencer is supplied with a color Tektronix (Beaverton, OH) dot matrix printer that is adequate for printing chromatograms and other related files.
2.2. Software 1. This chapter will discuss Analysis versions 1.2.1 and 2.1.2. Analysis ~1.2.1 is a 590 K program with a single base-calling algorithm. Sequencing Analysis v2 1.2 is a 1.2 Mb program that can utilize 5 different Basecallers (ABI 50, ABI 100, ABI 200, Adaptive, and Semi-Adaptive, 270 K each), however, only three of these algorithms can be used with the 373 Sequencer (ABI 100 and ABI 200 are for the 377 Sequencer only). This release of Sequencmg Analysis (April, 1996) has been updated for printing enhancement and to correct for timing problems with fast runs on the 377 Sequencer, and other minor bugs. The general operation of the Analysis v2 series remains the same. Macintosh System 6.0.3 or higher is reqmred to run the Data Collection and Analysis programs. The Thread Manager program is required with Analysis ~2.1 .O and higher for System versions lower than 7.5. Both Data Collection and Analysis have several accessory files. The program disks are supplied with installer software that generates the folders necessary for storing the various programs, and places the tiles in the correct folders. Data Collection and Analysis are stored together m a folder on the hard drive (373A Software Folder for ~1.2.1; ABI Prism Sequencing Folder for ~2.1.2). The Basecallers Folder (~2.1.2) and Utilities Folder are also stored m this folder. A second ABI Folder is placed in the System Folder in which associated files, such as the DyeSetlPrimer riles and Comb files, are stored. Other files are placed in the Extensions Folder and the Preferences Folder within the System Folder 4. TeachText (or the most recent version, SimpleText) should be supplied as part of the system software. TeachText is a simple word processing program that allows the Sample.seq text files to be opened and viewed directly.
2.3. Data The data utilized by the Analysis program is generated by the Data Collection software. Data Collection processes the information directly from the sequencer as it scansthe running gel. 1. Sample information for each gel lane is entered by the user into the Sample Sheet before starting the sequencer. This information will be incorporated into the file information or annotations m the Sample tile generated by Analysis.
42
Hagemann and Kwan
2. The Raw Data for Analysis consists of the fluorescence emtssion signals plotted over time by Data Collection. 3. The Gel Image is a color representation of what the gel would look like if the DNA banding pattern was visible. The lane tracking paths are also shown and can be adjusted for reanalyzing the data. 4. The Analyzed Data is produced by Analysis and is represented by the chromatogram. 5. The Sample tiles generated by Analysts contain all of the final data, including the raw data, analyzed data, and sequence and rile information.
2.4. Optional 1. The Tektronix color printer that is supplied by ABI for the DNA sequencer is very expensive. Other Macintosh-compatible color it&Jet printers, such as the Hewlett Packard (San Diego, CA) 1200C DeskWriter printer, can be dependable and less costly (the HP DeskWriter 550C and 560C work well with ~1.2.1 and v2 1.0, but ~2.1.1 and ~2.1.2 may require a printer with additional memory). Laser printers can be useful for prmtmg publication-quality results. 2. The Results folders produced by Analysts can take up to 2-3 Mb of memory and require a lot of space on the hard drive. These folders may be stored on high density disks or removable hard disk cartridges, such as SyQuest (Envisio, Saint Paul, MN) cartrtdges that can store over 100 Mb. Gel tiles use -10 Mb, and are overwritten on the followmg sequencing run, These tiles may also be stored; however, most users just keep the Results folders and discard the gel tile after making any necessary tracking adjustments. 3. ABI has several compatible software programs for working with the chromatograms and text files, including SeqEd, Sequence Navigator, and AutoAssembler. Other sequence analysis programs, such as MacVector (Eastman Kodak Co., Rochester, NY), or even word processing programs can utilize the text tiles generated by Analysis. The text files are also convenient for database searches
3. Methods Several upgrades of the ABI Analysis software have been made, but the main differences are between versions 1 and 2. This chapter will refer to versions 1.2.1 vs 2.1.2. After a sequencer run, the Data Collection program automatically opens Analysis and closes itself. On completion of analysis, the results are printed and stored on the hard drive and the Gel Image and Info windows are left on the screen. In version 1.2.1, the Results folder (labeled by the date of the analysis) is stored directly on the hard drive; m version 2.1.2 the Results folder is stored withm the ABI Prism Sequencing folder. The Analysis application is left open after a sequencing run is complete and should be closed when the user is finished revtewing the data (select Quit from the File menu). To open Analysis, double-click on the program icon (Analysis ~1.2.1 application file within the 373A SofIware folder, or Sequencing Analysrs ~2.1.2 applicatron file within the AH1 Prrsm Sequencing folder), or click on the program icon once and select Open from the File menu.
ABI Analysis
43
3.1. Gel File The Gel File is an Analysis document that consists of both the Gel Image and the Gel Info. This file is stored within the 373A Software (~1.2.1) or ABI Prism Sequencing (~2.1.2) folder, which also contains the Data Collection and Analysis software. After Analysis is complete, the Gel File and Analysis application are left open, The Gel File may be reopened at a later date by double-clicking on the file icon (which will also open Analysis and in ~2.1.2 the Sample File Queue). 1. The Gel Image is a computertzed version of the fluorescent DNA banding patterns within the sequencmg gel, This image shows the tracking path used for the analysis of each lane. These tracking lines can be adjusted and the sequence reanalyzed (see Section 3 6 ). 2. The Gel Info (~1.2.1) or Gel Sample Sheet (~2.1.2) is essentially the Sample Sheet from the Data Collection program where the user enters pertinent com-
mentsabout the samplesto be analyzedin eachlane of the gel. 3. In VI 2. I: The Gel Image and Gel Info windows can be opened and closed from the Window menu if the Gel File 1s open. 4. In v2 1.2: The Gel Image and Gel Sample Sheet windows can be accessed by opening the Gel File: select Open Gel. . . from the File menu or double-click on the Gel File icon. The Gel Image will open initially and the Gel Sample Sheet can be opened by clicking on the spreadsheet button in the upper-left comer of the Gel Image window.
3.2. Sample Files After Analysis is completed, the Sample files can be opened by clicking twice on the Results folder and then double-clicking on the desired Sample file (see Notes 1 and 2). Each Sample file is named according to the lane number, unless the rile name is edited within the Data Collection Sample Sheet before the gel run. The Sample file may also be opened when Analysis is active by selecting Open (~1.2.1) or Open Sample. . . (~2.1.2) from the File menu and opening the Results Folder from the hard drive within the resulting dialog box. The Sample file can then be highlighted and opened by clicking on the Open button. Once a sample file is open, several windows pertaining to the sample can be accessed, including File Info (~1.2.1) or Annotations (v2.1.2), Raw Data, Analyzed Data, Sequence, and EPT. These windows (as well as the Gel Image and Gel Info) can be opened under the Window menu in ~1.2.1. In version 2.1.2, all of the Sample file data is viewed from one window that appears after opening the file. Six buttons in the lower let? comer of this window (Fig. 1) toggle between the five Sample file windows listed above and an additional Feature window. 1. The File Info window in version 1.2.1 is equivalent to the Annotation window in ~2.1.2. This window contains the information entered by the user in the Data Collection Sample Sheet for that sample. It also contams mformation generated
44
Hagemann and Kwan
Fig. 1. View control buttons taken from the Sample file window in Applied Biosystem’s Sequencing Analysrs ~2.1 0 software. Each of the toggle buttons used to switch between the various pieces of information stored within a Sample file are indicated. The Annotation button will brmg the details of the sequence run into view (e.g., sample names, user’s comments, parameters of the gel run, signal strength, spacmg, and so on). The Sequence view shows the nucleotide sequence. The Feature view shows informatton entered through Factura software Analyzed Data consists of the chromatogram or electropherogram. Raw Data shows the signal strength of the indrvtdual emission wavelengths plotted over time and the EPT button will exhibit the electrophoretrc power (volts, amps, and watts) and temperature.
2.
3. 4.
5. 6. 7.
by both Data Collection and Analysis, including the date and time collection started and stopped, the number of data points collected, the signal strength for each of the bases, the gel spacing, and various other pteces of information concerning the parameters of the sequencmg run. The Raw Data consists of the collected data before it is analyzed. The emission signals from each of the fluorescent-labeled phosphoramidites used m the sequencing reaction are plotted over time and shown in blue, red, green, and black for each of the fluorescent “dyes.” The Analyzed Data IS the chromatogram. This is the actual sequence data after the signal strength is enhanced, spacing and peak herght are normalized, and the bases are called. The chromatogram is also referred to as an electropherogram. The Sequence is the numbered nucleotide sequence without the data. This sequence cannot be edited directly in ~1.2.1; however, editing changes wtthm the chromatogram are reflected in this window. Both the chromatogram and the sequence can be edited directly in ~2.1.2 (see Section 3.4.) The EPT window (electrophoretic power and temperature) IS a graphical representation of the voltage (blue), current (green), power (black), and temperature (red) during electrophoresis. The Feature View in version 2.1 2 is for information entered from the Factura program (Sequence Navigator and AutoAssembler). In vl 2.1, for legends pertaining to the Analyzed Data, Raw Data and EPT windows, select Legends from the View menu while that particular window IS active.
3.3. Viewing a Chromatogram There are several options that allow the user to adjust the portion of the chromatogram shown in the Analyzed Data window. These options are under
ABI Analysis
45 Controller
Panel
Zoom Tool
- CustomTool FmdA.gam Dn Functic
Find Fun&on
Chromatogram
G&e.sce Ehhng
SelectionTool
- LaneShift Tool
Insertion Tool Fig. 2. The Controller Panel taken from Applied Biosystem’s Analysis v 1.2.1 software. This panel controls three different parameters of the Analysis program. It contains the tool buttons that control the view withm the chromatogram or gel image (Zoom Tool and Custom Tool). Nucleotides can be found and edited with the Find (Again) Function and Chromatogram Sequence Editing Buttons. The tracking lines can be adjusted with the Selectton, Insertion, and Lane Shift Tools. the View menu in ~1.2.1 and the Window menu in ~2.1.2. The commands also be used for the Raw Data and EPT windows.
may
1. The image can be magnified by selecting Zoom In from the View (VI 2 1) or Window (~2.1.2) menu. 2. The option Zoom Out essentially reverses Zoom In. 3. In ~1.2. I: The Custom View selection will show the peaks from a defined set of data points and the sequence determined from that data. All of the data can be viewed sequentially while in the Custom View by sliding the horizontal scrollbar. The data points covered by this view can be specrfied by selecting Change Custom View.. . from the View menu. A dialog box will appear, m which the user can enter the vertical and horizontal parameters. 4. In ~1.2.1: The Controller panel provides the same functions. The Zoom and Custom tools can be activated by clicking the corresponding button within the Controller panel (Fig. 2). The Zoom button is a toggle switch for the Zoom In and Zoom Out tools. The Custom tool is activated by clicking on the Custom tool button. Once a tool IS selected, place the cursor/tool withm the chromatogram and click on the desired area for viewing. 5. In ~2.1.2: The Actual Size view (Window menu) of the analyzed data is the first window that appears when opening a Sample file. This view is analogous to the
Hagemann and Kwan
46
Custom View m ~1.2.1 and shows the data as it would appear in a printout All of the data can be viewed sequentially by sliding the horizontal scroll-bar. 6. The Full View shows all of the collected data points In v 1 2.1, this view appears first when the chromatogram window 1sopened. 7. In ~2.1.2. The Display Options... selection from the Window menu opens a dialog box that shows what each of the color graphs represent in the Analyzed Data, Raw Data, and EPT views. The user can select the followmg options with the appropriate check-boxes m the Display Options... dialog box a. The user can select which of the color graphs to display (e.g., a specific nucleotide color). b. The vertical and horizontal rulers that indicate the data points monitored by the laser scan are shown by default. The rulers can be removed by clicking off the Show data pomts check-box c The scale of the rulers may be selected in the Counts per tick box within the dialog box. d. The vertical ruler shows the relative values of the data points wtth a maximum of approx 1200 points by default. The user can choose to show the actual scale (one monitor pixel per laser scan) by clicking the Show real values radio button m the Vertical Display box. To return to the relattve scale, cluck the Show relative values radio button
3.4. Editing a Chromatogram The chromatogram
itself may only be edited one base at a time. The Cus-
tom View (~1.2.1) or Actual Size view (~2.1.2) is usually the most convenient for making editing changes. In version 2.1.2, the wmdow must be unlocked before editing. The padlock button in the upper-left comer is a toggle switch for locking and unlocking the screen. To edit the chromatogram, the cursor is placed at the position to be corrected, and nucleotides may then be inserted, deleted, or substrtuted. Most of the Macintosh functions under the Edit menu (cut, copy, paste) are not available while in Analysis ~1.2.1, and the Sequence window cannot be edited directly. However, in ~2.1.2, the window can be edited and all of the Edit (cut, copy, paste) menu options are available. Additional options for editing the analyzed data include the following: 1. In ~1.2.1: To use the Controller panel to edit the chromatogram, place the cursor on the correct position within the chromatogram. To delete a nucleotide, click on the trash button in the Controller panel or click on the G,A,T,C, or N button to insert a nucleottde (Fig. 2). 2. In ~1.2.1: The Delete to Last Base command in the Edit menu can be used to remove unwanted or ambiguous sequences from the end of the chromatogram. Place the cursor within the chromatogram on the last nucleotide to be retained and then select Delete to Last Base.
Al31 Analysis
47
3. In v2 1 2. The Sample file view wmdow shows a location indicator at the top of the window that marks the position of the cursor relative to the length of the sequence 4. To find a particular sequence pattern within the chromatogram, select the Find.. . option under the Edit menu. In ~2.1.2, the Find... option is only accessible when the Sequence window is active (chck the CATG Sequence button [Fig. l] m the Sample file window to activate the Sequence window). Place the cursor at the beginning of the sequence. Select Find... from the Edit menu. Enter the sequence of interest within the dialog box and click on the Find button (the Find... command 1scase sensitive in ~1.2.1; use capital letters when entering the sequence). Once the Find... command has been selected, the Find Again command becomes available. This option will find the next match within the sequence. 5. In ~1.2.1. The Controller panel can be used to find single nucleotides. Go to the Controller panel and click once on the Find button (Fig. 2). Then cltck on the button correspondmg to the nucleotide to be found The cursor will move within the chromatogram to the next position of that base. This method can be convenient for finding ambiguous sequences (N). 6. In ~1.2.1: The chromatogram display can also be modified by selecting the Settings... option from the Edit menu. The Settings... dtalog box allows the user to highlight the editing changes or show the original base calls within the chromatogram. The line types used for the peaks in the chromatogram can also be selected (color, solid; color, dot-dash; or black, dot-dash). 7. In ~2.1.2: The original base call sequence can be shown in an edited chromatogram by selecting Show Original from the Sample menu. The original sequence is shown above the edited sequence. 8. Chromatograms may be printed after editing changes by selecting Print.. . from the File menu while the Analyzed Data window from the Sample file 1sactive. 9. To save editing changes: a. ~1.2. I: select Save Sample 00 from the File menu. The sequence text without the chromatogram can be saved by selecting Save Sequence File. . . b. ~2.1.2: select the Save option from the File menu; the original file will be overwntten. 10. In vl 2.1: To close the windows from a sample file, it is best to select Close Sample O&window from the File menu. Clicking on the close box (upper-left corner) within the windows of the Sample file will not completely close the file, which may result in problems when reopening the same file or other files.
3.5. Sample Sequence Files The Sample sequence files are a separate text file for database searches or word processing applications. They are denoted in the Results folder with the suffix .seq. TeachText and SimpleText are very simple word processing programs (supplied with the System software) that allow the user to view and manipulate the text files. These sequences cannot be opened from the Analysis application. The files may be opened directly by double-clicking on the file
Hagemann and Kwan
48
icon. A dialog box will appear that allows the user to open the file with Teachor SimpleText. The sequence text may be edited directly using the standard Macintosh commands (see Note 4). 3.6. Retracking
Gel Lanes in Gel File
Occastonally the Analysis program has difticultres in determining the correct lane tracking path (see Notes 5,6), This often occurs in the outer lanes of the gel where the DNA tends to curve m toward the bottom of the gel. Samples of DNA of the same sequencethat are loaded in adjacent lanes are also difficult for the program to distinguish since they give the same signal pattern. This problem may be prevented by staggering the loading: Enter the odd lanes first, and then load the even lanes. However, if tracking problems occur, they can be corrected even after analysis is complete. 1 The Gel Imageshowsthecomputerversionof theDNA bandingpatternsandthetrackmnglines usedin the analysisof eachlane. To follow the curvesof a gel lane,the Analysisprogramsegmentsthetrackingpathby mtroducingnodesor bendswherethe path shifts.Thesetrackinglinesmaybemoved entirelyor adjustedpieceby piece 2. To open the Gel Image while Analysis is active, select Open (vl 2 1) or Open Gel (~2.1.2) from the File menu and open the Gel File from withm the dialog box, or double-click on the Gel File icon directly (the Gel File is stored in the 373 Software folder or the ABI Prism Sequencing folder for ~1.2.1 and v2 1.2, respectively). The tools for adJusting the tracking and the Zoom tools are in the Controller panel (Fig 2) for Analysis ~1.2.1 and within the Gel Image window m ~2.1.2. Use the Zoom In view to adJust the tracking. The Gel Image can be scrolled up and down with the vertical scroll-bar.
The tools used for adjusting the tracking lines m ~1.2.1 and ~2.1.2 are rather different, and the following points are split mto two sections to address both versions: 3.6.1. Version 1.2.1 1. To move an entire tracking line, click on the Lane Shift tool (Fig. 2) in the Controller panel. Then drag the trackmg lme with the lane marker (triangle) in the top panel of the Gel Image window. 2. To move only a segment of a track, click on the arrow Selection tool (Fig. 2) m the Controller panel and highlight (click on) the lane marker triangle for the lane containing the segment to be edited. Then use the tool to highlight and drag the segment to the left or right. 3. To curve the tracking line, a node must be introduced. Click on the cross-hair Insertion tool in the Controller panel (Fig. 2) and highlight the lane marker triangle for the lane to be edited. Nodes are connected to the tracking segment above the point of insertion. To introduce a node, place the insertion tool where the
ABI Analysis
49
selected tracking lane should shift to and click it. A node will be placed at this point, and the tracking segments above the node will reposition to the node. 4. To delete a tracking segment node, first use the arrow SelectIon tool (Fig. 2) to highlight the lane marker triangle. Then click on the node or the tracking segment above the node. Select Delete Tracker Segment from the Edit menu or use the Delete key from the keyboard. 5. The Zoom out tool from the Controller panel (Fig 2) can be used to view the entire Gel Image within the window. Zoom in will return to the magnified sectional view.
3.6.2. Version 2.1.2 1. The Zoom button in the upper-left corner of the Gel Image window 1s a toggle switch. The entire Gel Image can be shown in the window by clickmg on the button to Zoom out. The sectloned view will be restored on zooming m by recllcking the Zoom button. These commands are also available from the Window menu. Use the Zoom In view for adJusting the tracking. 2. To move an entire tracking lane, highlight (click on) the blue lane marker tnangle at the top of the panel (triangle turns red). The remaining tracking lmes will disappear and only the highhghted lane will remain. Drag the lane marker withm the top panel-the triangle will not move, but the tracker line does 3. The tracker lmes are cut mto segments to adjust for curves in the migration path of the DNA in each lane. The individual segments of tracking lines can be adjusted by first clicking on the marker triangle of the lane containing the segment, and then dragging the line segment within the gel image itself. 4. To further segment the tracking line, hlghhght the appropriate lane and then click on the Scissors tool m the upper-left comer of the Gel Image window. Place the Scissors tool within the gel image and click on the point in the track lme where the cut should be positioned. The Scissors tool will automatically deactivate once a cut 1s made and must be reactivated (click on scissor tool button) for further use. The split segments may then be moved individually by dragging them to the desired position. 5. To view all of the lanes again, after adjusting a single tracking line, select Show Tracker Lines from the Gel menu. Hide Tracker Lines m the same menu ~111 remove the tracking lines from the Gel Image window.
3.7. Reanalyzing After adjusting the tracking lines, the lanes need to be reanalyzed. Again, the processes for reanalyzing data in v 1.2.1 and 2.1.2 are rather different have been split into two discussions.
and
3.7.1. Version 1.2.1 The user may wish to change the file name before reanalysis to distinguish the original sample files from the second analysis sample files. The file name must be changed before Analysis is started.
50
Hagemann and Kwan
1. To change the tile name, open the Get Sample Sheet(Gel Info) window from the Window menu. Edit the File Name column within the Sample Sheet for the lanes to be reanalyzed. If the user chooses to overwrite existing Sample files, the file names may not need to be edited. 2. Select Generate New Sample Files... from the Analysis menu. The resulting dialog box gives the user the following choices: a. Extract data from the modified lanes only or from all of the lanes. b. Utilize the settings from the Sample Sheet or specifically designate the type of Analysis to perform after data extractton. The user may choose to call the bases and print the data after analysis is completed by checkmg the Print Analyzed Data box. The raw data may also be printed (chck the Print Raw Data check-box). c. Save Gel File modifications before tracking or retam original Gel File d. Overwrite existmg Sample Files or generate new Sample Files 3. After selectmg the appropriate settings in the Analysis dialog box, click on the OK button. If the user does not choose to overwrite the existing Sample Files, a second dialog box will ask the user for the folder to store the files m. Open the desired folder within the dialog box (files within folder should be displayed), and then click on the Folder button. Analysis will begin and the Sample files will be stored m this folder.
3.7.2. Version 2.1.2 After adjusting the tracking lines to the desired positions, select Generate New Sample Files from the Gel menu. The dialog box that opens will ask what samples to analyze, all or modified lanes only. The user can choose to analyze the data and/or to pnnt the sample after analysis. The settings used m the original Data Collection Sample Sheet set-up (e.g., auto-analysis, print raw data,
print analyzed data) may also be selected. The check-box for Save Gel File Before Extraction is checked by default in the dialog box, and the Gel File will be overwritten unless the user unchecks the box. Clicking the OK button starts the data extraction and analysis. Once the data extraction is complete, the Sample File Queue will open and analysis will begin. The reanalyzed data is stored in the ortginal Results folder within the ABI Prism Sequencing folder, and the Sample file is given the same name with a suffix of “. 1.” The Sample File Queue has several features that can be used to reanalyze data. Sequencing Analysis ~2.1.2 has five different Basecallers, some of which may be more capable of analyzing difficult data than the standard algorithm. 1. a. The standard basecalling program ABI 50 is used for normal runs with the
373 Sequencer. b. The ABI 100 and ABI 200 are designed for the 377 Sequencer. The ABI 100 is for sequencmg at a rate of 100 bases/h and ABI 200 for 200 bases/h (ABI
ABI Analysis
2.
3.
4. 5. 6.
7. 8.
51
50 reads 50 bases/h). The 377 Sequencer can run gels of various lengths at different speeds. c. SemiAdaptive base calling measures the base spacing dynamically and does not use the standard curves that ABI 50, 100, and 200 utilize. This basecaller works well with short templates (e.g., PCR products). It also improves the base calling toward the end of the sequence. d. The Adaptive basecaller dynamically measures both base spacing and mobility of the different phosphoramidites. It should be used when the SemIAdaptive basecaller results in a default (-12) spacing. To open the Sample File Queue window (if not already active), select Show Sample File Queue from the Window menu while Analysis is active. The actual Sample File Queue is a box within the Sample File Queue window that displays the Sample files to be analyzed. A second box shows the Sample File Log that indicates which jobs from the Sample File Queue have been completed To add files to the Sample File Queue, first check one or both of the check-boxes for Add for Printing and Add for Analysis, and then click the Add Files button (see Note 3). A dialog box appears where the files to be analyzed can be imported from various folders into the Sample File Queue. Highlight the Sample File to be added and click the Add button; the samples are transferred out of the folder and placed in line to put in the Queue. Click Add All to analyze all of the Sample files within a particular folder. When all of the Sample Files are selected, click the Done button. Samples may also be added from the Sample tile window while the window is active by selecting Add This File To Queue from the Sample menu. The same command can be used to add samples directly from the Sample File Log or even the Sample File Queue within the Sample File Queue wmdow by first highlighting the Sample file and then selecting Add This File To Queue. Sample Files may be removed from the Queue by highlighting the tile name and clicking the Remove Selections button. Select an appropriate basecaller by clicking on the current Basecaller m the Sample File Queue window and choosing a basecaller from the pop-up menu. The user can adjust the analysis settings by highlighting a Sample file within the Queue and clicking the Custom Settings... button. This button opens a dialog box where analysis Start and Stop points can be selected. The primer peak location can be changed for labeled primer sequencing. The Instrument File (matrix) and the DyeSets or Primers used can also be changed. The Sample File Queue will indicate whether custom or default settings are selected for each sample. To reassign the default settings, click on the Default button while the Sample tile is highlighted. To start analysis, click the Start button. The Status box at the top of the Sample File Queue window reports the progress of the Analysis program. As samples are analyzed, the tiles will move from the Queue to the Sample File Log and the success of the analysis is indicated (complete or failed). All of the Samples to be analyzed are processed from the Queue first. The files to be printed
Hagemann and Kwan
52
9. 10. 11
12.
are processed last. As the files are completed, they will appear m the Sample File Log. The Cancel button m the Sample File Queue window will stop analysis lmmedtately. The Stop button in the Sample File Queue window will stop analysis after the current sample analysis is complete. To view the analyzed Sample files, highlight the Sample file within the Sample File Log and click on the chromatogram picture wlthm the Sample File Queue window. The Sample Ftle Queue window can be activated or deactivated by selecting Show Sample File Queue or Hide Sample File Queue, respectively, from the Window menu.
4. Notes 1. In Analysis v 1.2 1, occasionally the program will not call the bases for a lane where the fluorescent signal is too weak. If the lane appears to be faint on the Gel Image and the trackmg is correct, the sample can still be analyzed. Open the Sample file for the weak lane. Select Call Bases from the Analysis menu. A dialog box will open in whtch the user may select the data points to be called and whether to print the analyzed data after calling the bases Click on OK and the Sample File ~111 appear m the Analysis Queue and Analysis wrll begin If additional files need to be analyzed, immediately click the Stop button on the Analysis Queue. Sample files can be added to the queue by repeating the process of opening the mdividual files and selecting the Call Bases optlon from the Analysis menu. Once all of the files are on the queue, click on the Start button, and analysis ~111 continue. By preselecting to Print After Calling m the Call Bases dialog box, the user can leave the computer and the data will automatically print as Analysis moves through the queue. Otherwise the Sample files must be opened individually and the Analyzed Data printed out by ke Print... command in the File menu. These windows take a long time to transfer to the printer, and the user is forced to wait several minutes before printing the next file. 2. If Analysis ~1.2.1 is interrupted before all of the Sample files are completely analyzed, the steps listed in Note 1 can be followed to queue the samples for analysis and printing, rather than analyzing each file separately. 3. Sequencing Analysis ~2.1.2 allows Sample Files to be Queued for printing wlthout being reanalyzed. This 1s convenient since printing the color files 1s extremely time consummg. Open the Sample File Queue window (select Show Sample File Queue from Window menu while Analysis is active) and check the Add for Printing box only (uncheck the Add for Analysis box). Then click the Add Files... button and select the files to be printed in the dialog box. Then the samples can be printed from the Sample Queue by clickmg the Start button. 4. While Analysis 1sactive, the format (ABI, Intelligenetics, Staden, or Wisconsin GCG) of the sequence files generated by Analysis can be determined in the
ABI Analysis
5.
6.
7. 8.
9.
53
Settings... (~1.2.1) or the Preferences... (~2.1.2) dialog box from the Edit menu. This is particularly srgnificant if you use the .seq file for your analysis; for instance, using GCG on a work station. A simple text file may require reformatting and “chopping-up” before it can be used in GCG. Ensuring that this file is in the right format can save a great deal of effort later. Both Analysis versions 1 2.1 and 2.1 .O are supplied with the program Tracker Trasher in the Utlhtles folder. In versron 2.1.2, the Tracker Trasher functron is within the Gel Dot II program under the Utilities menu. Tracker Trasher allows the user to remove the tracking lines from the Gel Image of a selected Gel file. In ~1.2.1, the gel can be retracked manually. In ~2.1.2, the gel can be retracked by selecting Track dz Extract Gel... from the Gel menu. Occasionally, an error will occur durmg analysis that will interrupt the gel tracking and result in a partial or incomplete analysis, The Analysis application generates large files and accesses the hard drive frequently. If the hard drive IS too fragmented, a system error may occur. These errors can be avoided by optlmlzmg the hard drive at least on a monthly basis (hard drive maintenance programs include Norton Utihties, Silver Lining, Disk Express, and so on). If a system error occurs, and the Gel file is incompletely tracked, Tracker Trasher may be used to remove the partial tracking lmes before retrackmg. Additional utilities within the program Gel Dot II (Analysis ~2.1.2 or upgrade for VI 2 1) can salvage Gel files that were interrupted by computer errors early in the analysts process. All other apphcatlons should be closed before using Gel Dot II since it requn-es 6 Mb of memory. With the 373 Sequencer only, if the Gel file cannot be opened and there is no Results folder (essentially no analysts for a particular gel run), the Repair Resources (-39)... option under the Gel Dot II Utilities menu can be used to repair damaged Gel file resource information (e.g., the matrix, comb rile, tracking, and so on.; -39 system error in the Error Log). The Repair Resources (-39)... option requires the user to initiate Data Collection and a dummy Gel file. The program then combmes the resources of the dummy file with the data from the damaged Gel file (this utility requires several steps that are described in more detail in the user’s manual [2/). The Gel Dot II utility Build Gel Image (-40)... will generate a Gel Image from a Gel file with partially analyzed data (-40 system error m the Error Log). The Remove Gel Image.. . or Shrink Gel File... utility in Gel Dot II is used to reduce the memory required to store a Gel file. The Gel Image may then be restored by selecting the Build Gel Image utility. Because of the high volume of sequence data being generated through the Human Genome Project, and the time-consuming process of editing, fimshmg, and transferring ABI analysis files to a Unix work station for further analysis, the genome centers have generated Umx-based software to analyze ABI sequence data directly Bass and Grace (Lawrence S. Stein, et al., Whitehead Institute/MIT, Cambridge, MA; Stanford University, Palo Alto, CA) are Unix programs designed specifically to perform lane tracking and base-calling for ABI data Programs such as PHRED, PHRAP, and CONSED (Phillip Green et al., Washington
Hagemann and Kwan
54
University School of Medicine, St. Louts, MO) are designed to generate basecalls and further estimate sequence quality so that the sequence can be improved in areas of difficulty, such as repeat sequences or data at the end of a gel run. The sequence is automatically run through final editing steps and assembly with very little time required from the user.
References 1. 373 DNA Sequencing System User’s Manual. Applied Biosystems, Foster City, CA 2. ABI PRISM DNA Sequencing Analysts Software User’s Manual Applied BIOsystems, Foster City, CA.
4 SeqEd Manipulation of Sequence Data and Chromatograms from the ABI DNA Sequencer Analysis Files Tracy L. Hagemann
and Sau-Ping Kwan
1. Introduction SeqEd is a Macintosh (Apple Computer, Cupertmo, CA)-based program designed by Applied Biosystems (ABI, Foster City, CA) to view, edit, align, and compare sequence data generated by an ABI 373 or 377 automated DNA sequencer. The sample sequence files generated by the Analysis program can be edited directly, but the SeqEd program generates layouts composed of one or more sequences. The program then allows the sequenceswithin a layout to be aligned, compared, and further edited. In addition, SeqEd can create reverse compliment sequences,reverse chromatograms, and amino acid translations in all three reading frames, and the program will search for specific nucleotide patterns, such as vector sequencesor restriction sites. The SeqEd program was designed by Applied Biosystems. 7.7. The AS/ DNA Sequencer The Al31 sequencer,as well as other automated sequencers,utilizes a scanning laser to detect phosphoramidite-labeled sequencingreaction products as they electrophorese through a denaturing polyacrylamide sequencing gel. The data generated by the sequencer is processed by sofkare on a Macintosh computer. The SeqEd software was designed by Al31 to further manipulate the analyzed sequence generated by the application programs that collect and process the raw data. 1.2. Data Collection and Analysis Programs Two application programs are required to collect and analyze the data points produced by the ABI sequencer, and the programs are appropriately From
Methods m Molecular Biology, Edlted by S R Swmdell
Vol 70 Sequence Data Analysis GuIdebook Humana Press Inc , Totowa, NJ
55
Hagemann and Kwan
56
designated Data Collection and Analysis. Data Collection processesthe bits of data as they are generated by the sequencer. The raw data can be monitored m three window formats as it is collected: 1. The Scanwmdow which only showsthe immediatescannmgpattern for each of the emissionwavelengthsrecognizedfrom the sequencmggel (four colors corresponding to the four nucleotides); 2. The Map window, which shows a history or series of the recent scanning patterns generated;and 3. The Gel window, which is a color depiction of the banding pattern that would result if the DNA could be visually detected within the gel
The Analysis program takes these bits of data after they are plotted over time, and integrates the scanning patterns derived from the four emissron wavelengths to determine the appropriate parameters to use in calling the bases and generating the analyzed sequence data. 1.3. Utilization
of the Analysis Sequence Data
The Analysis program stores(outputs) the processed sequencedata from each of the lanes of a sequencing gel into two different files: a sample file and a sequence file. The sample file contains all of the information pertaining to a particular sequencerun, including the chromatogram (electropherogram), nucleotide sequence, and file information typed in by the user before beginning a sequencer run. Chromatograms consist of the integrated scanning patterns for each of the nucleotides (emrssion wavelengths) plotted over time, and appear as a series of evenly spacedpeaks of different colors that represent the four basesthis is the analyzed data. The sequencefiles contain only the nucleotide sequence as a text file and in fact cannot be accessedthrough the Analysis program after they are generated. These tiles are easily transferred to databasesor used m other applications (e.g., MacVector [Eastman Kodak Co., Rochester, NY]); however, SeqEd allows the direct mampulation of the sample file and the chromatogram, which can be extremely beneficial when comparing sequenceswith ambigurty. 2. Materials 2.1, Hardware 1 The SeqEd program may be used on the same Macintosh model that is included wrth the ABI 373 or 377 sequencer The 373 models were equipped with a Macmtosh II
family computer(usuallyCX or ci), andthe 377 ISoperatedwith a PowerMacmtosh. However, SeqEd is designed to perform on any Macintosh that is a model SE or newer 2. Minimum memory requirements are 4 Mb RAM wtth system 7 software or MultiFinder. An available 3 Mb may be adequate for system 6.0.3 with Finder Analysts of longer sequences or multiple alignments may require more memory
SeqEd Data Manipulation
57
3. The SeqEd software and associated files take up approx 1.6 Mb of disk space on the hard drive. 4. The ABI DNA sequencer IS supplied with a color Tektronix (Beaverton, OR) dot matrix printer that 1s adequate for printing SeqEd results; however, other Macintosh compatible u&Jet or laser printers may be used
2.2. Soft ware 1. This chapter will discuss SeqEd 675 version 1.0.3. 2. Macintosh system 6.0.3 or higher 1snecessary to operate SeqEd 3. The SeqEd software package comes with two folders: the SeqEd folder and the Sample Data folder. The SeqEd folder should be installed on the hard drove and the Sample Data folder can subsequently be installed into the SeqEd folder.
2.3. Data 1. Sample files generated by the ABI Analysis program are the primary source of data for which SeqEd was designed. These files include the DNA sequence and the chromatogram of the analyzed data. 2. Other data sources mclude any sequence in a text format. Staden, GCG, and Intelhgenetics (IG Sutte) formats may also be used. 3 Sequence data may be typed mto layouts directly
2.4. Optional 1. The Tektronix printer that is supplied by ABI with the Sequencer is rather expensive. Other relatively inexpensive color inkjet printers that may also be used wtth the Sequencer are the HP DeskWriter 55OC, DeskWriter 56OC, and DeskJet 1200C Additionally, laser printers are useful for publication quality printouts. 2. Whereas the SeqEd software takes up over 1.6 Mb, the layout or sequence files generated are usually under 20 K Removable floppy disks are convenient for storing SeqEd tiles rather than storing them on the hard drive. 3. SeqEd can import and export tiles m Staden, GCG, and Intelligenetics formats. Software packages utilizing these formats may also be compatible. The Text files can be used by word processmg programs.
3. Methods 3.1. Opening the SeqEd Program and an Untitled Layout 1. Launch the SeqEd program by opening the SeqEd folder on the hard drive and double-clicking on the SeqEd file icon or highlightmg the file and selecting Open from the File menu (see Note I). An empty, “untitled” layout will appear m which sequences may be typed or imported. The Layout is the window mto which sequences are imported, aligned and compared. When starting a new layout, only two options will be available under the Sequencesmenu. 2. To enter a new sequence, select New Sequence from the Sequencesmenu. A new empty sequence appears for entering data by typing it manually (see Note 2)
Hagemann and Kwan
58
If a sequence is already present in the layout, this option will also allow that sequence to be copied. 3. To import a sequence, select Import Sequence from the Sequences menu. This option allows sequence files from other sources (e.g., Analysis tiles) to be mtroduced to the layout. After choosing this option, a dialog box opens asking for the file to be imported (see Note 3)
3.2. View Menu Preferences The layout is divided
into two panels: the ID panel and the Main panel.
1. The ID Panel on the left is used to indicate a particular sequence. The contents of the ID panel can be determmed by the user by selectmg Preferences (ID Panel)... under the View menu. The dialog box that opens allows the viewer to select the size of the panel (the number of characters displayed in the name field) and whether to show ID numbers, reverse comphment indicators, residue posittons, filenames, or nicknames. 2 The Main Panel is on the right side of the partitioned layout. It can be dtvided further into an upper panel and a lower panel by dragging down the partition line from the top of the vertical scroll-bar (move cursor above the scroll-bar and watch for the arrow tool to change). This allows the user to view different regions of the layout stmultaneously. The sequence display can be customized by selecting Preferences (Main Panel)... under the View menu. From the Preferences dialog box, users can select whether or not to show rulers and the ruler spacing. They can choose to wrap the sequence according to the size of the layout window or give each row a specific length. The characters indicating sequence gaps, matches, mismatches, and so on can be determined, and the number of nucleotides or amino acids to be grouped together (e.g., 10 or 3) may also be chosen. 3. Printout headings are adjusted under the View menu by selectmg Preferences (Printer).... The dialog box that appears allows the user to design and display a header on a printed layout.
3.3. Editing Sequepe
Data and Chromatograms
1. The sequences withm the main panel of the Layout can be edtted directly by placing the cursor at the desired positton and using the keyboard to insert and delete sequences. 2. Most of the options under the Sequences menu are not active unless a sequence is designated within the layout, including the option of displaying a chromatogram. The chromatogram of an Analysis Sample file can be accessed by clicking on the sequence within the ID panel, so that each row of the sequence becomes highlighted (outlmed m black within ID panel) Then select Display Chromatogram under the Sequences menu, and the entire chromatogram of the designated sequence will appear in a separate window below the layout (see Note 4). Three control buttons appear in the lower-left corner of the chromatogram window. These are the Zoom and Custom tools.
SeqEd Data Manipulation a. By clicking on the Zoom In (+) button, the Zoom tool can be used to increase the scale of the viewed area. Click on the chromatogram and the image will be magnified. A specttic region of the chromatogram can also be highlighted by clicking and dragging the tool before releasing the mouse button. Then the designated area is enlarged by clicking within the outlined section that will subsequently fill the window. b. The Zoom Out tool (-) is for reducing the scale of the viewed area c. The Custom View tool (C) is for generating a custom view that will show the signal peaks and the bases for a certain region of the sequence. The custom setting is the most practical format for viewing the chromatogram since the entire signal peak can be seen for each nucleotide. 3. To edit either the Layout or the Chromatogram, the appropriate window must first be active. The Edit menu offers the typical edttmg options of Cut, Copy, Paste, and Clear. These commands may be used in the main panel of the Layout by highlighting the region of the sequence to be edited. The Edit menu is not available from the chromatogram window, but single nucleotide changes can be made by placing the cursor on the sequence within the chromatogram and typing the corrections. Editing changes made in either the Layout or the chromatogram will be reflected in both windows. Additional commands in the Edit menu include Insert gaps... and Gather gaps .. .. a. To insert gaps, place the cursor at the pomt of msertion and then select the function Insert gaps... from the Edit menu. Enter the number of gaps (-) to insert in the dialog box and click the OK button. b. To gather gaps, the region of the sequence containing the gaps must be highlighted. Then select Gather gaps... under the Edit menu. The dialog box that appears will ask whether to place the gaps to the let? or the right of the sequence. Essentially this function moves all of the gaps within the highlighted sequence to either side of the selected sequence. This function is convenient for removing the gaps after an alignment (see Section 3.7.). 4. The Find menu allows specific patterns or regions to be located within a sequence Place the cursor at the beginning of a sequence in the main panel. Under the Find menu there are three options: Find . .. . Replace ... . and Go To . .. . a. Find... allows a specific sequence to be designated and the stringency of the match to be chosen. Type a sequence pattern in the Find what? text box and click the Find button within the dialog box (the search will be case sensitive unless the user checks the Case Insensitive box). The cursor within the layout will move to the first match within the sequence that was selected. b. Once the find command has been used, the Find Same Again option can be selected from the Find menu to identify the next match within the sequence. c. The Replace... and Replace Same Again options are similar to Find... except that a second sequence is defined to replace the match when it is identified. d. The Go To... option moves the cursor to an exact nucleotide within the sequence according to its number or position. Only the Go To... option is available under the Find menu while the chromatogram window is active
Hagemann and Kwan
60 3.4. Saving Edited Sequences
To save an edited sequence file from within the layout, the user must first highlight (click on) that sequence within the ID panel. There are two choices for saving edited sequences under the Sequences menu once a sequence is selected. 1. The option Save to tiles... is analogous to the Save command for most Macintosh apphcatlons; however, this command can only be used when workmg with an ABI Analysis file. A dialog box will appear that asks whether to save the sequence as a text file or a sample file. The Save to Sample File option within the dialog box will automatically overwrite the existing Sample fde. The Save to Text Files option can only be used with sequences derived from an ABI sequence text file The original text file may be replaced or renamed (old) so that it remains intact. The saved files will be placed in the original Results folder from which the file was first Imported. 2 The Export Sequences optton is similar to the Save As... command for Macintosh. After choosing to export a sequence under the Sequences menu, a dialog box appears where the file name and folder 1sselected. The format of the file may also be designated. The ABI Analysis Sample files may be stored as ChromatoRefs so that both the sequence and chromatogram are saved within the same file as a SeqEd document. (SeqEd documents may be opened directly by double-clicking on the icon, which will automatically launch the SeqEd program.) A sequence may also be exported as a text file or in the formats required for Staden, GCG, or Intelligenetics; however, to save the file as an Intelligenetics or GCG file, the sequence must first be given a nickname Highlight the sequence and select Set Sequence NickName under the Sequences menu. A dialog box will then be displayed in which a nickname may be typed. Once a nickname 1s assigned, the file can be exported m the Intelligenetlcs or GCG formats The files will appear as document files and can either be Imported into a layout or opened with a word processmg application or TeachText.
3.5. Generating
a Layout Several sequences may be added to a Layout by importing additional
sequences. As the new sequences are introduced
they are assigned sequential
ID numbers that are shown in the ID panel and used to designate a particular sequence when performing comparisons and alignments. To manipulate the sequences within a Layout beyond general editing, the desired sequence must first be selected by highlighting that sequence within the ID panel. Sequences may also be selectedunder the Edit menu with the options Select All Sequences or Select Modified Sequences (edited sequences).The following options under the Sequences menu may be used once a sequence is selected: 1. Sequences may be deleted from a layout by highlighting the sequence within the ID panel and selecting Remove Sequences from the Sequences menu. If the
SeqEd Data Manipulation selected sequence has been modified, a dialog box will appear asking whether to export the sequence to prevent editing changes from being lost. As sequences are introduced and removed, the remaining sequences can be renumbered by choosing Renumber Sequences under the View menu. The order of appearance of the sequences may be adjusted by moving the sequences within the ID panel. Click on the sequence to be moved and drag it to the desired position within the ID panel. 2. The option Create Shadow(s)... from the Sequence menu allows several different types of comparisons for two different sequencesto be made. Shadow sequences consist of symbols that are generated to show the gaps (-), mismatches (*), matches (-), deletions (D), and insertions (I) between two sequences after a comparison is made. These sequences cannot be edited directly and are referred to as shadows since their content depends entirely on the compared sequences As editing changes are made in the two sequences being compared, the shadow sequence follows to reflect these changes, The symbols used in the shadow sequences may be altered by selecting the Preferences (Main Panel)... option under the View menu. The following are the various options available for creating shadow sequences m the Create Shadow(s)... pop-up menu: a. The Compare Sequence(s) to References option is only available if a sequence is first designated as a reference. Choose the option Designate Reference Sequence under the Sequences menu and indicate the reference sequence ID number in the dialog box. Select (click on) a second sequence within the ID panel to compare to the reference. Then choose the Compare Sequence(s) to Reference option from the Create Shadow(s)... pop-up menu The shadow sequence is placed immediately under the selected sequence, not the reference sequence. This option IS useful for comparing several sequences to the reference by selecting all of the sequences within the layout ID panel. b. Comparing Two Sequences... is the only option that can be used without selectmg a sequence within the ID panel first. A dialog box appears and asks which sequences are to be compared. The shadow sequence is then placed between the two sequences. c. To compute ambiguity, unanimity, or consensus sequences, more than one sequence must be selected (Select All Sequences or Select Modified Sequences under Edit menu). Compute Ambiguity Sequence generates a shadow that shows the matches (-) and mismatches (*) only. Compute Unanimity Sequence will show mismatches as asterisks and matches as the nucleotide or amino acid that is matched. Compute Consensus Sequence generates a consensus sequence composed of nucleotides and IUB codes or question marks for no match. d. Translate Codons to Amino Acids... will open a dialog box that allows the reading frame and the type of amino acid symbol (three letter or single letter) to be selected. Translations using mitochondrial (mammalian or yeast) codons may also be selected. 3. Once a shadow is created, it can be frozen by highlighting the shadow sequence ID and selecting Freeze Shadows under the Sequences menu After a shadow is
62
Hagemann and Kwan
frozen, it will no longer reflect editing changes in the sequences it was generated from and can be edited m itself. The frozen shadow sequence can also be aligned and compared, which is especially beneficial for comparing translated ammo acid sequences. 4. Offset Sequence... moves the selected sequence relative to the ruler and the other sequences in the layout. Select the number of positions to move the sequence and which way to move it (left or right) m the dialog box. 5. Rotate Circular Sequence... is similar to the Offset Sequence... command. A dialog box appears and asks to rotate the sequence to the left or right and by how many nucleotides. Rotating to the right moves sequences from the 3’ end to the 5’ end, and rotating to the left moves sequences from the 5’ end to the 3’ end, as if the sequence were circular. The ABI Analysis sequences cannot be rotated since the program assumes that they are not circular 6 Restore Chromatogram... allows the original chromatogram to be returned to a sequence that was exported in another format and then imported mto the layout.
3.6. Saving and Printing Layouts The save and print options for SeqEd are standard Macintosh commands. To save a Layout select Save Layout As... under the File menu. This will open a dialog box asking for the name of the layout and the folder m which to store it. Selecting the Save Layout option will save any editmg changes over a prevrously saved layout; the original file is not preserved. The entire Layout can be printed by selecting Print Layout..., or the portion of the Layout within the window can be printed by selecting Print Window.... 3.7. Aligning Sequences The alignment algorithms for SeqEd are briefly outlined below. Three algorithms are available under the Align menu. The Comparative and Overlap algortthms are used to align two sequences,and of course the Multiple option aligns two or more sequences (see Note 5). 1. A Comparative... alignment is most appropriate for sequences that are fairly similar. Selecting this option opens a dialog box that allows the strmgency of the comparison to be set by designating penalties for mismatches, gaps, and gap length. 2. An Overlap... alignment compares the 3’ end of the first sequence with the 5’ end of the second sequence and can be useful for generating small contigs of sequence data 3. The Multiple alignment may only be used when two or more sequences are selected within the ID panel. It is an exact algorithm that sets up a matrix comprised of all possible matches for each position, and then chooses the best path through the matrix for the alignment This algorithm requires the most memory and is the most time consuming.
SeqEd Data Manipulation
63
Once an alignment of sequenceshas been performed, the sequences can be analyzed further with the Create Shadows... option from the Sequences menu. 4. Notes 1. The SeqEd program may also be initiated by clicking on a saved Layout file (SeqEd document). This approach will start SeqEd and open the previously saved layout for further editing and mampulation. 2 While typing in sequences over 700-800 bp, the program generates an error message informing the user that because of a fault in the program, no more sequences can be added When this message appears, additional sequences may be added by deletmg the last l&20 bp from the sequence and retyping. The program will then allow additional sequence to be entered; however, the message may appear again Repeat the same steps to continue adding sequence data. 3. Analysis Results folders contain both Analysts sequence and sample files that can be imported from within the Import Sequence dialog box (from the Sequence menu). To view and edit the analyzed data (chromatogram), import the Sample file; the Sequence file (SampleOO.Seq) only contains text. Any compatible files on the desktop, including those previously stored in the SeqEd folder, are usually accessible and may be imported by SeqEd SeqEd can also access Staden, GCG, and IG Suite tiles, although some editing may be required to remove textual annotations 4. By holding the option key while selecting display chromatograms, the chromatograms can be displayed side-by-side at half width rather than full width, allowing several chromatograms to be viewed more easily 5. Multiple alignments of long sequences require a large amount of memory If an alignment cannot be performed because of insufficient memory, highhght the SeqEd application file icon (click on once) and select Get Info... from the File menu before opening the application. In the lower-right corner of the dialog box the allocated memory can be increased within the text box. This may be useful when working with long sequences or multiple alignments of several sequences.
Reference 1. SeqEd 675 DNA Sequence Editor User’s Manual. Applied Biosystems, Foster City, CA.
5 From ABI Sequence Data to LASERGENE’s EDITSEQ Catherine Arnold and Jonathan P. Clewley 1. Introduction This chapter describes the analysis and assembly of sequence data generated by the Applied Biosystems (ABI, Foster City, CA) automated sequencer after sequencing PCR products using Tuq dye terminators (e.g., with the PRISM DyeDeoxy Terminator cycle sequencmg kit). We use Tug dye termmator chemistry rather than dye primer sequencing because we are currently interested in sequencing PCR products of viral genomes amplified from clinical specimens (1,2). We prefer to sequence PCR products directly rather than after cloning them to avoid seeing unrepresentative sequences.To be able to look at the distribution of sequences in a population (the quasispecies), we dilute the starting specimen until amplification is achieved from a single molecule (3). Alternatively, if the sequence population is relatively homogeneous, sequencing from a bulk PCR amplification will be accurate since Taq misincorporation errors will be effectively diluted out. Dye terminator sequencing is more flextble than dye primer sequencing and allows the use of unmodified primers in the sequencing reaction. Although dye terminator sequencing is a more versatile method for sequencing PCR products when compared with dye primers, the different chemistries have different effects on the appearance of the data. When using dye terminators, as for all sequencing methods, it is essential to sequence the PCR product in both directions because of the low signal of some peaks in a chromatogram following certain sequencesof nucleotides (listed below). If the PCR product is sequenced in one direction across such a sequence, and the resulting base(s) is mdeterminate, sequencing of the opposite strand should resolve any ambiguities. From
Methods m Molecular Biology, Edited by S R Swindell
Vol 70, Sequence Data Analysis GuIdebook Humana Press Inc , Totowa, NJ
65
66 2. Materials 2.1. Macintosh
Arnold and Clewley Hardware
1. Any Macintosh computer installed with a floating point unit. The ABI sequencer comes with a Macintosh, in our case a IIci. We have successfully used a IIci, a IIVx, an SE30, an LC475 with SoftwareFPU installed (John Neil & Associates, Cupertino, CA) and a PowerPC with LASERGENE. 2. Minimum memory requirements of 4 Mb RAM (8 Mb RAM or more is recommended). 3. Minimum free hard disk space of 25 Mb. More may be required because of the creation of temporary files. 4 Macintosh compatible monitor (256~color monitor is recommended). 5. Macintosh compatible printer (laser prmters are recommended). 6. A SyQuest drive is attached on which the gel tile (about 20 Mb) is saved (see Note 1)
2.2. Macintosh
Software
1. Macintosh system software 6.0 1 or higher 2. The LASERGENE application EDITSEQ. Follow the instructions provided by DNASTAR, Inc. for mstallmg the LASERGENE software. The analyzed data is transferred to other Macintoshes for further sequence tnterpretation using LASERGENE and GeneJockey. For sequence alignment
and phylogenetic tree generation, we use MEGALIGN (see Chapter S), Clustal (4), Phylip (5), and MACAW (6). Restriction site analysis is with MAPDRAW (see Chapter 18) or GeneJockey. The PCR primers are chosen using PRIMERSELECT (see Chapter 23) or OLIGO (7). For analysis of DNA/RNA folding we have used Mulfold and MacDNASIS (8,9). Open reading frames (ORFS) are most easily and conveniently found with GeneJockey. The protein analysis is with PROTEAN (see Chapter 17). For viewing proteins whose structure has been solved we use RasMol (10) and MacMolecule (Richard B. Hallick, Department of Biochemtstry, Bioscience West 524, University of Arizona, Tucson, AZ, 8572 1, USA; E-mail: [email protected]). The analyzed sequence data can be transferred to other platforms, e.g., LASERGENE under Windows or Unix boxes running GCG over a network. The principal programs discussed here are ABI Analysis ~1.2, SeqEd ~1.03 (ABI), and LASERGENE EDITSEQ. 3. Methods 3.1. Preparation of Amplicons for Sequencing The PCR products are cleaned up by removing them from underneath the oil overlay and are purified of excess primers and nucleotides, either by using molecular weight cutoff columns (Centricon [1 I]), or by preparative gel elec-
LASERGENE’s EDITSEQ
67
trophoresis followed by glass-milk purification or elution onto DEAE membranes. Biotinylated PCR products can be purified using streptavidin-coated magnetic beads. The concentration of the purified amplicon is estimated by eye after running on an ethidium-stained agarose gel with known standards. It is important to quantify the purified amplicon fairly accurately because best results are obtained with a 4: 1 primer:template ratio, using 3.2 pmol primer in a terminator reaction. This can be calculated from the size of the PCR product: For 3.2 pmol of primer, divide the number of base pairs in the PCR product (amplicon) by two and use that amount of DNA in nanograms. After cyclmg, unincorporated dye terminators are removed from the sequencing reactions by phenol extraction, spin column purification, or CTAB purification. The samples are loaded onto a 6% sequencing gel and run overnight. 3.2. Data Analysis When the run is finished, the Analysis program is launched (see “Analyzed data” below). The gel file is displayed once Analysis has fimshed, and can be viewed by using the scroll-bars. Analysis should find the “lme of best fit” through each lane of data, seen as the gray tracker lanes on the gel file. Analysis can also be launched either by double-clicking on the Analysis icon, where individual sample files can be opened by selecting Open from the File menu, or by double-clickmg on the ABI sample file being analyzed.The sample file has four components:file information, raw data, analyzeddata, and sequencedata. 1. File mformation: This component gives detatled information about the parttcular run. It is important to checkthat the base spacingISaround 12 (i.e., if the gel is running correctly one base should take the same time to pass the read region as 12 scans of the laser), but good data is also obtained with spacing values between 9 and 15. A note should be made of the base spacing in each case. The signal strength should be checked to make sure it is adequate (see Section 3 4.) 2. Raw data: The raw data are unavailable in a format in which they can be manipulated; the raw data peaks are small, with a taller peak at the beginning of the sequencing run-residual dye terminators after phenol extraction. 3. Analyzed data: After data collection and lane tracking, the raw data in the sample files are analyzed. The analysts is an automatic process involving preprocessing (signal strength analysts, finding the first base location), first base calling, respacing of raw data, and second base calling (based on respaced raw data). 4. Sequence data: This is the sequence data from the sample file in the standard five letter code (A,G,C,T,N). It can be cut and pasted into other files, if required.
3.3. Base Calling Base calling of ABI sequence data is the selection of a region of raw data, that is then further analyzed by Analysis software to produce an accurate recognizable
chromatogram
(also called a trace or an electropherogram).
Chro-
Arnold and Clewley matograms can then be assembled m various contig assembly programs: e.g., SeqEd, SEQMAN (LASERGENE), GeneJockey, Sequencher. The data are overlapped and crosschecked,and the assembled sequencedata is then exported and manipulated further using various DNA analysis software packages. When sequencing PCR products, unlike cloned material, the sequence data generated are of a defined length. It is therefore important to accurately basecall raw data, defining the region of raw data that is then further analyzed. Base calling is necessary if the improper assignment of base 1 has occurred and it can avoid unnecessary delays in contig assembly, since untidy data at either the 5’ or 3’ end of the sequence may result in an assembly program not recognizing clearly overlapping sequences. This can result in the creation of a contig with a consensus sequence containing added gaps and ambiguities. 3.4. Reanalyzing a File 1. In Analysis, pull down the File menu and select Open. A dialog box will appear 2
3 4.
5.
and enable the desired sample file to be selected and opened Alternatively, double-chckmg on a sample file will open it. Pull down the Window menu and select File Info. Check the signal levels of the sample. The ABI gutdelines for double-stranded DNA using Tag terminators give the following signal strength guidelines for good sequencing results: A = 200300, G = 200-400, T = 50-150, and C = 20-70. The base spacing should be noted. If the gel has run normally the spacing should be between 9 and 12 (see Section 3.2.). Pull down the Window menu and select Raw Data and Controller Use the Controller like a television remote control to find the start point of the data. Use the zoom or custom tools on the controller to position the cursor lust after the large dye terminator peak near the beginning of the raw data (this peak represents residual dye terminators after phenol extraction; when using other methods for purification of sequenced products it is not necessary to ignore this regton). Make a note of the start point for analysts, which is the X-axis number that appears in the lower-left comer of the Scan window. Pull down the View menu and select Full View For small PCR products that produce discrete raw data, i.e., with an unambiguous start and fimsh, the endpoint of the raw data is easily found using the zoom or custom tools. For a longer product the following example is given. Select Analyzed Data from the Window menu and scroll along the chromatogram to gage how far the data can be read accurately (for example, approx 400 bp). Multtply this number by the base spacing, found under File Info (e.g., base spacing = 11.3, multiplied by 400 = 4520). Add 4520 to the start point (e.g., 1050 + 4520 = 5570); thts 1s the endpoint. Pull down Analysis and select Call Bases and enter the Start and End figures in the relevant boxes Check the Use Start Point and Print After Calling boxes, if required The other parameters should not need changing. Select OK
and the Analysis Queuewill appearand the various stagesof analysesand base
LASERGENE’s EDITSEQ
69
calling will automatically be carried out. When this 1s complete, a new sample file will be written over the old analyzed data file. Raw data IS, of course, unchanged. 6. To edit the file, use the tools on the Controller to zoom m on any area of the analyzed data to find or change bases. We recommend that each chromatogram be checked to ensure the computer has called the correct base, If in doubt, the base in question can be changed to an N and will be resolved when the sequence is aligned with the sequence of the opposite strand Use the Delete to Last Base command to delete unwanted bases from the end of a sequence by highlighting the last base to be retained in the sequence and selecting Delete to Last Base under the Edit menu. All bases to the right of the highlighted base ~111 be deleted. Characteristic patterns are observed when using dye terminator chemistry because of the influence of the dyes attached to ddNTPs. Some consistent patterns of ddNTP mcorporation can be seen in the chromatograms of analyzed dye terminator runs: a. Cs following Gs are weak. b. Ts followmg Gs are weak c. Cs following two or more Ts are enhanced. d. In a string of four or more Gs, the third will show reduced signal e. Gs following a string of Ts will be enhanced. f. The first A in a strmg will be strong, the rest weak g. As after Ts can show a reduced signal.
3.5. Exporting and Importing Sequence Data to and from Different Files The sequencer outputs sequence data in two ways: sample files (previously described) and ASCII text files (files with the suffix .seq). The text files can be exported for manipulation in MS-DOS based computers or any system accepting text files. The sequence sample files require specialized software in order to access chromatogram data, including ABI Analysis, SeqEd, SEQMAN (LASERGENE), Sequencher, and GeneJockey. Once the sequence data is in a suitably refined condition it can be imported into applications that create overlapping contigs using sequence data in both orientations (see Note 2). 3.6. EDITSEQ of LASERGENE EDITSEQ is the starting point for most uses of LASERGENE. The other modules (except SEQMAN and ABI sequence traces) essentially require files in EDITSEQ format for creating their own documents, e.g., for a MEGALIGN alignment (see Chapter 8). DNA and protein files can be unparted from most common formats. EDITSEQ is opened by clicking on Sequence Editing and Analysis from the LASERGENE Navigator, or by double-clicking the application icon. An empty DNA sequence window 1s presented. A double helix icon in the top left of the
70
Arnold and Clewley
window, in the status bar, indicates that it is a DNA window. To change to a protein window, select New > New Protein from the File menu.
3.7. Entering and Editing Data 1. Enter sequence data by typing (all IUB codes are accepted). 2. Proofread the sequence by choosing Macintosh Voice’from the Digitizer menu. Alternatively, click on the Open Mouth icon on the bottom left of the window. 3. Select No Sound from the Digitizer menu or the Open Hand icon to halt proofreading (see Note 3). 4. Select Go to Position... ([ command key]G) from the Edit menu. 5. Select a sequence subrange by typing the range required in the box as the starting and finishing base numbers separated by a comma, e.g., 13,99 for subrange bases 13-99. 6. The sequence can be formatted from the Edit menu (case, font, size, spacing, or blocks of sequence characters). 7. To generate the Reverse Sequence or Reverse Complement of the active sequence, choose the required subrange (or all of it with Select All ([command key]A) from the Edit menu). The reversed/complemented sequence is displayed in a new window (see Note 4). 8. To correct any mistakes select Undo ([command key]Z) from the Edit menu. EDITSEQ has multiple levels of Undo/Redo.
3.8. Opening and Importing Sequence Documents 3.8.1. Opening DNASTAR Sequence Documents 1. Select Open ([command key]O) from the File menu. A standard dialog box for opening a document will appear. The document must be in DNASTAR format. 2. Navigate to the folder containing the desired file. 3. Select the tile by clicking on it. 4. Click Open. Only DNASTAR format files will be listed. If the file is not listed, it must be imported (see Section 3.8.2.). Click Cancel.
3.8.2. Importing Sequence Documents 1. Select Import... from the File menu. 2. Specify the type of sequence file to be imported from the right window pane of the dialog box (Fig. 1). 3. Select the appropriate radio button under the window pane to identify the sequence as DNA or protein. Files of the specified type are displayed in the left window pane. 4. Select the tiles to import. 5. Click the Import button. Imported files are opened into a new window (Fig. 2), usually named after the imported tile but prefixed with NEW. EDITSEQ automatically interprets ASCII files with a .seq extension as nucleotides, and those with a .pro as proteins.
LASERGENE’s EDITSEQ
71 Sequence
to Import
=LCIII 20/7/94-INK02 20/7/94-HTK03 m 20/7/94-HTKO4 ~20/7/94HTKO-15
Sequence
Type
ONR Maclfector PRO MacUector
B5 me
0 l)Nfl
0 IWITf:IN
Fig. 1. Import dialog box for EDITSEQ.
24-FEE-1994 24-FEB-1994
(Rel. (Rel.
38, 38,
Creoted) Last updated,
Uersi
Fig. 2. An EDITSEQ sequencewindow. Note the Triplet indicator in the upper-left hand corner of the window. The black bar and unshadedarrows indicates that the selectedregion (white text) is an ORF.
3.8.3. Setting the Ends of a Sequence For many LASERGENE operations it is possible to specify subranges of a sequence for analysis without deleting data from a file. The standard method for opening or specifying a defined subrange of a sequence is the Set Ends button in the Open dialog box. Clicking on the Set Ends button will summon a window in which the sequence subrange can be entered in text fields or, alternatively, “thumbwheels” may be manipulated with the mouse pointer to select the range (Fig. 3). The Other Strand may be selected and an Other Segment button allows the unselected portion of a sequence to be specified, rather than the selected part. Clicking Length sets the selected sequence to its full range.
Arnold and Clewley
72
MUP5180
(TYPE Ol(D352) 0 ...
Length. Range:
352 352
bp bp
Fig. 3. The thumbwheel dialog box.
3.9. Exporting
Sequence
Documents
Selecting Export from the File menu will save the sequence document as ASCII text. The program does not allow export in any file format except as a text file containing both the sequence and comments. However, the beginning of the sequence field is marked by a pair of colons (::), which is a standard symbol indicating the start of ASCII sequence data and is recognised by many other programs. Alternatively, the standard Mac Cut and Paste can be used to transfer sequence information without the comments (see Note 5). 3.70. Searching for ORFs In a DNA window, the Triplet Indicator next to the DNA icon in the status bar indicates whether a selected range of bases is an ORF or not-the indicator bar changes to black. The left and right pointing arrows show the direction to move to return to an ORF. The sequence length and subsequence range selected is indicated in the middle of the status bar (Fig. 2). The ORB can be found in a sequence document as follows: 1. Choose Find ORF... from the Search menu. 2 Click Find Next. The genetic code used can be specified from Genetic Codes submenu from the Goodies menu. 3. Choose Translate DNA from the Goodies menu to translate the located ORF into a protein document (Fig. 4). The commentspane of the newly translated DNA displaysa statisticalanalystsof the protein (seeNote 6).
3.11. Protein Analysis When a region of a protein is selected, its molecular weight, charge, and isoelectric point are displayed in the status bar of the window. These properties and an amino acid analysis can be displayed in a separatewindow by choosing Protein Statistics from the Goodies menu, The selected region can be Reverse Translated from the Goodies menu; also, the genetic code used can be specified from Genetic Codes submenu. The genetic code can be edited
LASERGENE’s EDITSEQ
73
liolsculor Uslght 74453.60 Oaltons 674 Amlno Acids 54 Strongly Basic(+) Amino Acids (K,R) 65 Strongly Rcldlc(-) Rnlno Acids (O,E) 238 Hudrotohoblc Amino tlclds (A.I.L.F.U.
Fig. 4. An EDITSEQ protein window. using Edit Selected Code. This is useful for eliminating translation.
ambiguity
in reverse
4. Notes 1. Computer and Data Security: Installatton of other software on the Mac attached to the 373 sequencer is not recommended because it may interfere wtth correct operation. Installatton and regular running of protective programs, such as Norton Disk Doctor and Norton Speed disk, are recommended. It is essential to regularly backup sequence sample files on floppy dtsk or SyQuest cartridges. The SyQuest drive attached to the sequencer has proved very useful m this laboratory because tt can also be used as a bootable disk in the event of floppy disk or hard disk failure. 2. Once a sequence text file has been obtained, it can be interpreted and analyzed using any one of several different software packages. For the Macintosh, the cheapest is DNA Strider at about $200 (12), more expensive is GeneJockey, and most expensive is LASERGENE or one of the other big programs (13) LASERGENE is the most comprehensive of these programs and comes with an excellent manual and onhne help. 3. The speed of proofreadmg can be controlled from the Digitizer menu. If preferred, the voice may be replaced with tones. A headset can be plugged mto the computer for pnvate listening. 4. It is good practice to mediately make notes in the Comments section ofthe window about what changes have been made. The date created is automatically inserted; enter notes that will make the file understandable when returned to at a later date. 5. The export facilities of GeneJockey offer more options than those of EDITSEQ: The sequence (without comments) can be exported as a text file; or the file can be saved in DNAStar, DNA Inspector, or GenBanMIBI formats. 6. A better alternative to EDITSEQ for ORF searches is GeneJockey, which has a graphical display of all six possible reading frames of a DNA sequence that are double-clickable into protein sequence documents.
Arnold and Clewley
74
References 1. Arnold, C., Balfe, P , and Clewley, J. P (1995) Sequence distances between env genes of HIV- 1 from individuals infected from the same source: implications for the investigation of possible transmission events. Vrrologv 211, 198-203. 2. Clewley, J. P. (1995) Derivation and interpretation of viral nucleotide sequences from chmcal specimens. Rev. Med. Microbial. 6,26-38. 3 Smnnonds, P., Zhang, L. Q., McOmuh, F., Balfe, P , Ludlam, C A , and Leigh Brown, A. J. (1991) Dtscontinuous sequence change of human mnnunodeflciency virus (HIV) type 1 env sequences in plasma viral and lymphocyte-associated proviral populations in vivo: implications for models of HIV pathogenesis. J Virol. 6562664276.
4. Higgins, D. G. (1994) Clustal V: multiple alignment of DNA and protein sequences, in Computer Analyszs of Sequence Data, Part II (Griffin, A. M. and Griffin, H G , eds.), Humana, Totowa, NJ, pp 307-3 18 5. Felsenstem, J. (1989) PHYLIP-phylogeny inference package (version 3.2) Cladtstics 5, 164-166 6 Schuler, G. D., Altschul, S. F., and Lipman, D. J. (1991) A workbench for multiple alignment construction and analysts. Prot. Struct. Funct. Gene 9, 180-l 90. 7. Rychlik, W. and Rhoades, R. E. (1989) A computer program for choosing optimal oligonucleottdes for filter hybndizatton, sequencing and tn vrtro amplification of DNA. Nucleic Acids Res. 17, 8543-855 1. 8. Gilbert, D. G. (1992) Mulfold, anonymous ftp to ftp bio indiana.edu. 9. Lmton, D., Clewley, J. P., Burnens, A., Owen, R. J., and Stanley, J (1994) An intervening sequence (IVS) m the 16s rRNA gene of the eubacterium Helicobatter canis. Nucleic Acrds Res. 22, 19541958. 10. Sayle, R. and Mtlner-White, E. J. (1995) RASMOL: biomolecular graphics for all Trends Biochem. Ski. 20,374-376. 11. Sambrook, J., Fntsch, E F., and Mamatts, T. (1989) Molecular Clontng* A Laboratory Manual. Cold Spring Harbor Laboratory, Cold Spring Harbor, New York. 12. Douglas, S. E. (1994) DNA Strider: a Macintosh program for handling protein and nucleic acid sequences, in Computer Analysis of Sequence Data, Part II (Griffin, A M. and Griffin, H. G., eds.), Humana, Totowa, NJ, pp. 181-194. 13. Clewley, J. P. (1995) Macintosh Sequence Analysis Software: DNAStar’s LaserGene. Mol. Biotechnol 3,221-224.
SEQMAN Contig Assembly Simon R. Swindell and Thomas N. Plasterer 1. Introduction SEQMAN is the contig assembly module of DNASTAR’s LASERGENE. SEQMAN provides a number of useful tools for simplifying the process of assembling data from a sequencing project; particularly in performmg many of the tedious tasks of removing extraneous data from the sequence. SEQMAN assembles finished nucleotide sequences in three steps: data processing, assembly proper, and postassembly editing. In the first step, you can remove contaminant and vector sequence and optimize the order of assembly. In the assembly step, SEQMAN locates all perfect matches between sequences using the Martinez method (1) and tills in between these matches with the Needleman-Wunsch method (2). The consensus is recorded and then used for the alignment in the following step against the next sequence. This procedure is repeated until all members of the dataset are accounted. At each step the consensus is recalculated for the ensuing alignment. At the end of the assembly step all fragment sequences will exist in one or more contigs, dependent on the role of assembly parameters. Following the assembly, it is usually necessary to clean-up the alignment by manual inspection. Use SEQMAN’s Alignment View and Find Disagreement features to quickly resolve sequencing conflicts. This chapter will provide a procedure for performing an assembly project, including creating a project, adding the sequences to assemble; editing sequences to optimize the assembly (including removing vector DNA and poor 3’ data); performing the assembly, and then editing the end result. From
Methods m Molecular Biology, Edlted by S R Swindell
Vol. 70 Sequence Data Analysis Gurdebook Humana Press Inc , Totowa, NJ
75
Swindell and Plasterer
76 2. Materials 2.1. Macintosh
Hardware
1. Any Macintosh computer with a floating point unit installed. We have successfully used LASERGENE with a IIci, an LCIII and a Quadra 610, as well as 7 100, 7200, and 7500 PowerMacs 2. Minimum memory requirement of 4 Mb RAM (8 Mb on a PowerMac), although 8 Mb RAM (16 Mb on a PowerMac) or more is recommended 3 Minimum free hard disk space of 25 Mb; more may be required because of the creation of temporary tiles. 4. Macintosh compatible monitor (256-color monitor is recommended). 5. Macintosh compatible printer (laser printers are recommended).
2.2. Windows Hardware 1. Any personal computer, 386 or greater processor recommended. 2. Minimum memory requirements of 4 Mb RAM (8 Mb RAM or more is recommended). 3. Minimum free hard disk space of 25 Mb; more may be required because of the creation of temporary files. 4 Windows compatible monitor (256-color monitor is recommended) 5. Windows compatible printer (laser printers are recommended).
2.3. Macintosh
Software
1. Macintosh system software system 6.01 or hrgher application SEQMAN. Follow the mstructions provided by DNASTAR, Inc. for installing the LASERGENE software.
2. The LASERGENE
2.4. Windows Software 1. Disk Operating System version 5.0 or higher (DOS 5 0). 2. Microsoft Windows version 3.1 or higher. 3. The LASERGENE application SEQMAN. Follow the mstructions provided by DNASTAR, Inc. for installing the LASERGENE software
2.5. Data Any DNA sequence document in LASERGENE format or any trace data file in either ABI Seq-Ed format or Pharmacia-ALF format is required. For this example, use the eight ABI Sample files labeled pFI753-PrX (X is the Primer) and the EDITSEQ file pFI753.seq (Macintosh users) or 753-PrX.abi and 753.seq (Windows users). The files are each sections of a PCR fragment cloned into the Invitrogen vector pCRI1 and should form a single contig. The vector sequence is also required. Macintosh users can locate the pF1753 folder m the Demo Seqman folder wtthin their DNASTAR folder. Windows users can locate the pF1753 directory in the demo-sm directory within the WINSTAR direc-
77
SEQMAN -..---.- Ruailable
.-
- ..-- ...__-..._. __~~
Sequences
ChosenSequences
0
Macintosh
HD
pFl753-PrlO pFl753-Pr12 pFl753-Pr16 pFl753-Pr18 pFl?53-Pr5 pFl753-Pr7 pFl753-PrA pFl753-PrU pFl753.seq
F-4
Fig. 1. The Add Sequence dialog box. All of the available sequences have been added to the chosen sequences list. Note the >>Add All>> button; this is accessed by pressing the option key (command key); otherwise it will read >>Add>>. tory. Macintosh users can locate Invitrogen’s pCRI1 vector inside the VectorData folder within the DNASTAR folder, whereas Windows users can find pcrii.seq within vector inside the WINSTAR directory.
3. Methods 3.1. Opening SEQMAN 1. Macintosh users: SEQMAN is opened by clicking on Sequence Project Management from the LASERGENE Navigator, or by double-clicking the application icon. The program launches without opening a new project window. 2. Windows users: Locate the DNASTAR program group and double-click on it to open. Within the DNASTAR program group locate the MAPDRAW icon. Double-click the MAPDRAW icon to launch the application. To create a new Project Summary choose New (CommandKtrl + N) from the File menu.
3.2. Adding Files to a Project Summary In this example, we are assembling ABI sample files. The data are derived from using multiple primers on a single fragment cloned into the T site of pCRI1. 1. Choose Add from the Sequence menu. A dialog box will appear (Fig. 1). Folders or compatible files will be listed in the Available Sequences pane. 2. Select the sequences to add. For this example, locate the demo sequences folder/ directory and select pFI753-Pr5 (Macintosh) or FI753-Pr5.abi (Windows). 3. Click >>Add>> to transfer the selected sequence to the Chosen Sequences pane. Double-clicking a file name will have the same effect.
Swindell and Plasterer
j 753SEQ
[1>1070]
Fig. 2. The unassembled sequences list. This lists all the added sequences and information about them, including the set limits of the data to be used in the alignment, the vector sequences to be searched for and the tile type. 4. Repeat steps 2 and 3 until all the required files are added. For the example, add all the pFI753-PrX (FI753-PrX.abi) Sample files and pFI753.seq (753.seq) (see Note 1). 5. Click Done when all sequences to be assembled have been added. A window listing the unassembled sequences will be presented (Fig. 2, see Note 2). The list will display information about the added files. The information displayed will depend on the file type that has been added. On this case,the added files are ABI Sample Files or Traces. Double-clicking on an ABI trace file will present a window showing the electropherogram (chromatogram). The information contained here is similar to that displayed by the ABI applications Analysis (see Chapter 3) and SeqEd (see Chapter 4).
3.3. Preparing Files for Assembly 3,3.1. Adding Vectors to the Vector Catalog Any sequence can be added to the vector catalog. The sequence should be in LASERGENE format. If it is not, use EDITSEQ to convert the sequence (see Chapter 5). 1. Choose Vector Catalog from the Project menu. A dialog box will appear listing all sequences already in the catalog. 2. Click New. A standard file selection dialog appears. 3. Select the file to be added; only LASERGENE files will be shown (see Note 3). For this example, select the Invitrogen vector, pCRI1. Macintosh users can locate Invitrogen’s pCRIITM vector inside the VectorData folder within the DNASTAR folder, whereas Windows users can find pcrii.seq within vector inside the WINSTAR directory.
SEQMAN
79 Uector
pCRII'H(R+U)
Length:394gbp Range:203bp
Fig. 3. The Add Vector dialog. Rather than force SEQMAN to compare the whole vector sequence to the sequence files, specify a subregion of the sequence of the vector to use In this example, the two Sample files that contain vector sequence have been sequenced using Universal Forward and Reverse primers. Therefore, the comparrson has been limited to the subregion of the vector defined by these two primers (204406 bp). 4. Click Add. A dialog box will appear (Fig. 3). 5. Set the ends of the vector data to be used (see Note 4). Rather than force SEQMAN to compare the whole vector sequence to the sequence files, specrfy a subregion of the sequence of the vector to use. Define a region spannmg the cloning site equal to approximately twice the length of the insert, e.g., for a 500 bp insert, define a region from 500 bp downstream to 500 bp upstream of the cloning site. In this example, the two Sample files that contain vector sequence have been sequenced using Universal Forward and Reverse primers. Therefore, the comparison has been limited to the subregion of the vector defined by these two pnmers (204-406 bp). 6. Define the cloning site. If it is an exact point, define the clonmg site and leave the f box empty. In the example, the cloning site is between base pans 335 and 336. To define a cloning region, such as a multiple clonmg site (MCS), set the clone site as the central point of the MCS and define a f region to encompass the complete MCS. 7. Click OK to add the vector sequence. 8. Click OK to leave the vector catalog (see Note 5).
3.3.2. Removing Extraneous Data 1. Click the Trim 3’ Ends button. A dialog box appears allowmg you to set parameters for scanning the ends of the sequences (Fig. 4). 2. Click Scan Selections or Scan All. The program will search the sequences beginning from the 3’ end in windows of the defined size. Any window contammg the defined number of uncalled bases (X) will be trimmed from the region of
80
Swindell and Plasterer
Fig. 4. The Trim Ends dialog. This allows the user to set parameters for the automatic trimming of poor sequence data from the ends of the sequences. The program will search the sequences beginning from the 3’ end in windows of the defined size. Any window containing the defined number of uncalled bases (X) will be trimmed from the region of sequence used in the alignment. Alternatively, the program can be instructed to simply trim regions from either end of the sequence to give a region of fixed length. The ends can also be trimmed manually (see Note 4).
Fig. 5. The Preassembly Options dialog. This enables the user to select one or more options for preprocessing the sequences before assembly.
3.
4.
5. 6. 7.
sequence used in the alignment. Alternatively, the program can be instructed to simply trim regions from either end of the sequence to give a region of fixed length. The ends can also be trimmed manually (see Note 4). Select the sequences that may contain vector DNA. In this example, only the sequences derived from the vector primers (Universal Forward and Reverse) are selected. From the Set Vector pop-up menu select the vector used, in this example, pCRI1 (see Section 3.3.1.). The unassembled sequences window will now list the selected vector DNA (Fig. 2). Click the Options button. A dialog box appears (Fig. 5) that contains two options for removing extraneous data from the sequence tile and an option for optimizing the sequence assembly order (see Note 6). Enable the appropriate procedures. In this example, only the options to remove vector DNA and optimize the assembly order are selected (see Note 7). Click Do It Now. The Vector Search Setup dialog appears (Fig. 6). A number of parameters may be changed, but to begin with, accept the default settings (see Note 8).
SEQMAN
81
Minin~um
Match Length
Connect
Distance
Maximum
Register
shwl
MMmumNWmatch%
Length
Weight
Fig. 6. The Vector Search Setup dialog. This allows several parameters affecting the identification of vector DNA to be adjusted (see Note 8). 8. Click Scan Selections. If all sequences contain vector DNA then click Scan All. All selected sequences willbe scanned for vector DNA, and then each sequence in the list will be compared to optimize their assembly order. The unassembled sequences window will reappear with the sequences listed in the order of assembly. 9. Select Parameters from the Project menu. 10. The resulting dialog box offers the opportunity to change several of the program’s default parameters (see Note 9). 11. In this case, leave these parameters unchanged and click Other Options. 12. A second dialog appears displaying further parameters (see Note 9). 13. Click in the check-box to select “Give less weight to fragment ends” (see Note 10). 14. Change the “less weight to sequence after residue” value from 300 to 1070. 15. Click OK to dismiss the dialog box.
3.4. Performing
the Assembly
Once the preassembly procedures have been completed, the final assembly is ready to be performed. 1. Click the Assemble button in the unassembled sequence window. A progress window will appear and the assembly report will be constructed as the assembly proceeds (Fig. 7). 2. When the assembly is complete, examine the report. 3. Close or hide the Report window. 4. The Project Summary window will display the number of contigs created and the contig into which each sequence has been entered (Fig. 8). 5. Click on Contig 1 to select it. 6. Choose Alignment View from the Contig menu. A continual alignment of all sequences in the contig will be displayed along with a consensus sequence (Fig. 9).
82
Swindell and Plasterer
Fig. 7. The Assembly Progress windows. During the assembly process, SEQMAN displays the progress box detailing the sequence being added to the contig and at the same time builds a report of thh assembly.
Fig. 8. The Project Summary window. This displays the number of separate contigs formed and the number of the contig into which each sequence has been added. The consensus sequence may be saved by choosing Save Consensus from the Contig menu. 7. Choose Strategy View from the Contig menu. An overview of the contig will be displayed showing the position of each sequence in the contig and its orientation relative to the consensus (Fig. 10). The consensus sequence is also represented as a line of varying width. The width and color varies depending on the degree of coverage of each region by contributing sequences (see Note 10). Double-clicking on a region of the strategy view will display that region in the alignment view.
SEQMAN
83
Fig. 9. The Alignment view. This displays a scrollable window on the aligned sequences. Only the sequences corresponding to the displayed portion of the contig are listed. A consensus sequence is shown below the aligned sequences.
250
500
750
1000 I
1250 I
:
/:
~$753Pr_lE.&i(1>300) $753.seq(1>1070) j753Pr-16.abi(1>323) 753Pr-R.abi(lUb299) 753Pr_12.&d(1>477)
r----------------i------T------{~ f-+ ----, ~ + : f ______ i
753Pr-lO.abi(1>505) -::::::.: q-
I ...........
; ;--,I
I
f-4 _----_; .....................q ,,,.
; ---,
>
Fig. 10. The Strategy window. This shows a scalable graphic representation of the contig showing the position of the sequences in the contig and their orientation relative to the consensus (see Note 10).
Fig. 11. The Contig Info window. 8. Choose Complement from the Contig menu. This displays the reverse complement of the contiguous sequence assembled by SEQMAN. It is better to edit this strand as more of the fragment sequences are derived from here. This can be seen from the Strategy view. Note that the majority of sequences have been entered reading from right to left. Watch the Strategy view as you select Complement. 9. Choose Contig info from the Contig menu. Information on the selected contig will be displayed (Fig. 11).
84
Swindell and Plasterer
Fig. 12. The Alignment Vigw and Trace Data windows. The Trace Data windows are for (left to right) pFI753-PrU, pFI753-Pr7, pFI753-PrS, pFI753-Pr12. Selecting a base in the consensus sequence selects the same base in all the contributing sequences. Selecting Show Trace Data from the Sequence menu opens trace windows showing the ABl chromatogram of the selected region of contributing Traces. The sequence pFI753-PrU shows “double miss-calls” in several places. This is characteristic of data close to the end of the 3’ end of the data. Comparing the traces from the other contributing sequences confirms this. The Sample file pFI753-Pr12 shows an ambiguity. Comparing the sequences and Trace data allows these errors to be edited out with confidence.
3.5. Editing the Project Once the assembly has been completed it is possible to edit the project. 1. In the Alignment View, scroll through the aligned sequences (see Note 11). In the example, go to base 360 on the consensus scale. 2. Click the pointer in the contributing sequence to activate editing in that sequence only. 3. Click in the consensus sequence to enable editing in all contributing sequences. Whatever edits you make will affect every sequence. 4. Type the base to insert or use the delete key to delete a base (see Note 12). 5. Click the pointer at base 360 and drag it to base 376 of the consensus sequence. This will select the same region in each contributing sequence (Fig. 12). 6. Choose Show Trace Data from the Sequence menu to display the electropherogram of the selected area. This is extremely useful for making decisions about editing the aligned sequences (Fig. 12). 7. The entire selected region may be replaced or deleted (see Note 13).
SEQMAN
85
8. Choose Save or Save as from the File menu to save the changes to the project. 9. Choose Export Sequence from the Contig menu to save the changes to the individual sequences. In the following dialog box, select a locatron for the exported files. The exported tiles will have the same name as the originals; if saved to the same locatton the exported files will replace the originals
4. Notes 1. Macmtosh users: Holding down the option key (command key) m the sequence add dialog box will change the >>Add>z button to >>Add All>>. Chckmg thts button will add all the listed sequences. 2. When assembling a large number of tiles, tt is possible to save the list of file names as a “tile of files.” Choose Save File of File Names from the File menu. A save dtalog box will appear. Name the file; it is good practice to add the phrase “fof’ to the name. This is the convention used by DNASTAR to identify a file of file names. If, in the future, you wish to add this list of files to an assembly, simply select the fof and >>Add>> it to the list of chosen sequences. All the files hsted in the fof will be added. Mac users can also drag a group of selected sequences or a folder of sequences on top of the Unassembled Sequences window and drop to add. 3. To say that only EDITSEQ files will be listed 1spractically, but not strictly, accurate. On the Macintosh platform, files are identrfied according to Type and Creator values that should be unique to the creating application, These values are htdden from the user and not easily changed. Therefore, the Macintosh version of the program will usually only display files created by EDITSEQ. However, PC applications rely on tile extensions to designate the file type SEQMAN will list all files with the extension seq regardless of which program was used to create them 4. For many LASERGENE operations, tt 1s possible to specify subranges of a sequence for analysis wtthout deleting data from a file. The ends of sequences to be assembled can be set manually by clicking the Set Ends button. You wtll be presented with a window where the sequence subrange can be entered in text fields or, alternattvely, thumbwheels can be manipulated with the mouse pointer to select the range. The Other Strand can be selected If the word Length is clicked on, the sequence is set to its full range 5. Use Edit to change any of the vector sequences in the catalog. If, for example, you have pUC18 cut with SmaI defined as a vector and want to change it to pUC18 cut with PstI; select pUCl8(SmaI), or whatever it is called, and choose Edit. Alter the clone site to that for PstI. Rename the vector to, say, pUC 18(PstI), and click OK. Note that this will replace the edited entry. To add the pUC 18(&I) vector, tt must be entered as a new sequence 6. The optimization routine compares all the sequences in the unassembled list and orders them by degree of similarity. The result is that during assembly the most similar sequences are ahgned first, thus speeding the process NB: In some cases, the user may find that the order of sequence assembly 1s not opttmized when
86
Swindell and Plasterer
using ABI trace files. If this occurs, import the trace tiles into EDITSEQ first and use the DNASTAR format files for the alignment. The assembly order will then be optimized correctly. However, this approach prevents editing of the sequences using the electropherogram trace. At the time of writing, DNASTAR had lust completed a new version of the program that corrects this (Version 3.05). It is important to ensure that you are using the most recent version of the software. DNASTAR now operates a World Wide Web server providing details of their software and services. The latest version numbers for their software should be available from this resource using the URL http://www.dnastar.com. 7. The Remove Contaminate Sequences IS of use when sequencing subclones of a larger clone. It is possible to screen out subclones that contam fragments from the initial cloning vector. 8 SEQMAN aligns query and vector sequences usmg two alignment algorithms First perfect matches are identitled usmg the Martinez method (I), then intervenmg regions are aligned using the Needleman-Wunch method (2) Minimum Match Length (default 7) defines the mimmum length of a region of perfect similarity (alignment group) between the query sequence and vector that must be found before identity is established. The alignment group is continued to the extent of the matches. Connect Distance (default 3) is the distance in nonmatching bases allowed between the vector cloning site and the nearest alignment group, i.e., if the number of bases between the cloning site and the nearest alignment of the query sequence to the vector is greater than the Connect Distance, then the scan will fail. Maximum Shift Register (default 10) sets the largest gap allowed m bases between adjacent alignment groups. If the region separating identified ahgnment groups exceeds 10 bases, then the vector sequence will not be identified (Fig. 13). Minimum NW Match % (default 90) sets the minimum percentage of perfectly matched bases m a region between alignment groups. If two alignment groups are separated by a region with <90% similarity, the vector DNA will not be identified. Taking these two parameters mto consideration, by default two alignment groups must be separated by no more than 10 bp, of which 9 bp must be perfect matches Gap Penalty (default 0) sets a penalty for mtroducmg a gap in the alignment. Length Weight (default 0.01) sets a penalty for the length of a gap. The penalty is multiplied by the number of base pair equivalents m the gap. 9. As with screening out vector DNA, SEQMAN uses two algorithms when comparing sequences for assembly. First, perfect matches are identified using the Martinez method (I), then intervening regions are aligned usmg the Needleman-Wunch method (2) A number of program parameters are accessible from the Parameters option in the Project menu; they are presented in two successive dialog boxes. These parameters may be changed to tighten or relax the stringency applied to sequence comparisons.
SEQMAN
87
Fig. 13. The figure displays a representation of the relationship between Connect Distance and Maximum Register Shift values used in identifying vector sequence. The figure is a Dot Plot alignment of two sequences. The two values are indicated. First dialog: a. Match Size (default 12): In order to allow extenston of a comparison, SEQMAN must first find a contiguous region of perfectly matching bases. This value defines the mmrmum length for that region b. Minimum Percentage Match (default 65): This defines the mmimum percentage of perfect matches that must exist within the overlap of two sequences for them to be aligned m the same contig. c. Consensus Threshold (default 75): The consensus sequence IS determined by takmg the majority base at any position. This value defines the minimum size of that majority; if no base is present at or above this level, an amibtguous base code is inserted. This parameter may be altered at any time. d. Coverage Threshold (default 4): This specifies the number of times the regton must have been sequenced on both strands to be considered covered (see Note 10) Second dialog: a. Maximum Added Gap Length in Contig (default 70): Restricts the number of gaps that may be added to the contig while aligning a new sequence. If the number of gaps introduced exceeds this value, the sequence will not be added to that contlg. b. Maximum Added Gap Length in Sequence (default 70): Resticts the number of gaps that may be added to the sequence while aligning a new sequence. If the number of gaps introduced exceeds this value, the sequence will not be added to that contig
88
10.
11.
12.
13.
14.
Swindell and Plasterer c Maximum Register Shift Difference (default 30): This defines the maximum number of allowable bases between perfectly matching regions. d. Lastgroup Considered (default 2): When a new sequence is aligned to a contig, several regions of perfect matches exceeding the Match Size value may be identified. This parameter defines the number of these that will be examined before the alignment is abandoned. e. Gap Penalty (default 0.00) and Gap Length Penalty (default 0.70): These values determine the penalty that is deducted from the pairwise scoring when a gap is inserted and then lengthened. Increasing the penalties reduces the number and length of gaps inserted. In most sequencing runs, the first few bases and the last bases in the sequence are less accurate than those in the middle. By giving these bases less weight than those in the middle, the consensus is biased to the middle residues. The exact points at which the weighting should be changed may be defined here (default 9 and 300). For this tutorial, the point at which weighting of end residues occurs is extended to 1070 bp to reflect the length of pFI753.seq. The consensus in the strategy view can display a great deal of mformation about the degree of coverage of the sequence. The appearance of the line has several meanings that are hard to show m black and white. a. Thin red line: Region sequenced only once. b. Blue line: Region sequenced on only one strand but more than once. c Green line: Region sequenced on both strands. d. Thick green line: Region sequenced on both strands at or above the coverage threshold. The threshold for the degree of coverage is set m the ahgnment parameters (see Note 9). The lmes under the consensus represent the constituent fragments used in the assembly. A solid line with the arrowhead at its right edge is on the same strand as the contig. A dotted lme wtth the arrowhead at its left edge is on the opposite strand from the contig You can jump to a specific point in the alignment by choosing Go to Position from the Edit menu and entering the number of the base (on the consensus scale) to which you wish to move. Adjacent to the horizontal scroll-bar in the alignment view are two “anchor” icons. These dictate which end of the sequence is anchored. By default the Anchor Left icon is selected; inserting gaps or bases moves the sequence downstream of the edit, so affecting the downstream consensus. Changing to Anchor Right reverses this so that edits affect the upstream consensus. (Fig. 12). To manually slide one or more sequences relative to the others, hold down the option key. Click on the sequence(s) to move; the cursor changes to a banana. Slide the sequences as required.
Acknowledgments The authors acknowledge the help extended by DNASTAR for including the Sample files used in this tutorial as part of their demonstrationfiles. We
SEQMAN
89
also acknowledge the kind permission of Invitrogen to circulate the sequence of the cloning vector pCRI1 with this tutorial. References 1. Martinez, M. H. (1983) An efficient method for finding repeals in molecular sequences. Nucleic Acids Res 11,4629-4634. 2. Needleman, S. B. and Wunsch, C. D. (1970) A general method applicable to the search for similarltles in the amino acid sequence of two proteins. J. Mol. Blol 48, 443-453.
7 GeneJockeyll DNA Sequencing and Fragment Assembly Phil Taylor 1. Introduction GeneJockey provides routines to open and display chromatogram files produced by ABI/Perkin Elmer automated sequencers. The first part of this chapter deals with the handling of this data; the second part deals with multiple alignment of sequencing fragments and the management of sequencmg projects, and is equally applicable to sequencedata obtained by manual sequencing or from another automatic sequencer. Applied Biosystems Software division have announced that it will be producing a software toolkit to enable programmers to accessits sample files. At the time of writing this was not yet available, so all the routines built mto GeneJockey are my own, and are not approved or sanctioned by ABI. For this reason, GeneJockey never writes to ABI files; it treats them strictly as read-only. It is possible that in the future, ABI may change their tile format, in which case GeneJockey could fail to operate properly. If this happens I will be one of the first people to be affected, since my own work mvolves a lot of sequencing. You can therefore expect GeneJockey to be rapidly updated to deal with the problem. If you do not use an ABI sequencer,then your first task is to get your sequence fragments into GeneJockey files. If you use a Pharmacia sequencer controlled by an IBM PC there are various ways of transferring the data to the Macintosh. The cheapest is to use Apple’s own utility program, Apple File Exchange, to read the sequence Text files from an MS-DOS format floppy disk onto your hard disk. You can then open these files from within GeneJockeyII, then Copy and Paste the sequence mto new nucleotide sequence windows. (Do not forget From
Methods in Molecular Biology, Edited by S R Swmdell
Vol 70’ Sequence Data Analysrs Guidebook Humana Press Inc , Totowa, NJ
91
92
Taylor
to Tidy the sequencesbefore you save to get rid of any extraneous numbers or
other illegal characters.) If you sequence manually, you can read the gels directly into GeneJockeyII. Set up the keyboard so that you can type the four bases with one hand and the order of the four keys is the same as that of the four lanes on your gels. Turn on the Speak on Entry feature so you have audio feedback. Place your film on a light-box next to the computer, and use one hand to keep your place while the other does the typing. Look at the gel, not the keyboard or the screen. With practice, you should be able to read at least 100 bases/mm. GeneJockey does not currently support any gel readers. If you use a gel reader, you should use the software that comes with it and transfer the data to GeneJockey either as Text files or via the clipboard. 2. Materials 1 Hardware: GeneJockey requires a Macintosh with ColotQuickdraw in ROM (this excludes the Macintosh plus [and older machines], the SE, the PowerBook 100, and the Macintosh Portable). The program also requires system 7.0 or later and at least 2 Mh of available memory. In addition to the usual requirements, it is quite important for sequencing applications to use a color display. Although it is possible to perform the operations described here on a monochrome or grayscale machine, it is very difficult to read chromatogram data or the alignments between sequences without color. A 256-color system (or better) is therefore highly recommended 2. Software: Only the GeneJockey program itself is required for the procedures described in this chapter. There is a bug in early versions of GeneJockey that very rarely causes the program to hang when opening ABI chromatogram files. This only appears to affect the program when runnmg on the Macintosh LCIII, but could potentially cause problems on other machines. The bug was fixed in version I .5 1. Versions earlier than 1.2 lack the toolbar m multiple-alignment windows, but all important commands on the toolbar are available from the multiple alignment submenu.
3. Methods 3.1. Preperhg Sequences for Alignment This job can be performed in several different ways, depending on the computer setup in your laboratory (see Note 1). For the purpose of this demonstration, there are five ABI files on the sequencing demo disk. You may use these
or your own sequencing data. For each of these files in turn you should perform the following actions: 1. Use the GeneJockey Open... commandto openthe file. The file will open into a window like that shown in Fig. 1, containing the color chromatogram (see Note 2). There are three scroll-bars (seeNote 3), Find and Find Again buttons (see Note 4), the Go To (see Note 5) button, and the 4X> button (see Note 6).
GeneJockey//: Sequencing and Assembly
Fig. 1. A GeneJockey
93
chromatogram window
Fig. 2. Sequencing error rate display. 2. Click the Code Change button to replace Ns with the full degenerate code set (see Note 7). 3. Click the Error Rate button to display the expected distribution of errors (Fig. 2, see Note 8). As can be seen, the first 25 or so bases of the read are very maccurate, with the results settling down to a stretch of about 350 bases of nearperfect read. After that, the error rate rises to about lo%, oscillating around this value until about base 550, after which it deteriorates to about 50%. Sequences become difficult to align when the error rate is substantially over lo%, and you would normally take only the part of the data between basecalls 25-550 for alignment. 4. Click on the chromatogram where the predicted error rate falls below 10%. 5. Drag to the right, selecting the area where the error rate is below 10% (Fig. 3). At this point, it is possible to invert the sequence using the Invert command from the Modify menu (see Note 9). (None of the sequences on the GeneJockey demo disk need to be inverted.) 6. Click the Extract button. The program opens a new nucleottde sequence wmdow, displaying the extracted sequence plus the run data included m the origmal
94
Taylor
Fig. 3. Selecting the sequence for extraction.
Multlple Mlnlmum When
lnputtlng
Sequence match from
Rllgnment
length: an sxlstlng
alignment
: Fig. 4. Parameter dialog for multiple sequence assembly sample file. The Comments area of the window includes a tile reference to the ABI sample file, so once it is closed, you can reopen it at will. 7 Save the extracted sequence using the Save As... command (see Note 10). 8. Bring the chromatogram window back to the front. 9. Close tt by clicking in the close box
At the end of this process you will have several open windows containing sequence fragments to be aligned, and you can proceed to the next section.
3.2. Assembly of Sequencing Fragments As with all of GeneJockeyII’s routines, the Sequence Assembly command takes its input from open windows. You should have a set of sequences m windows ready to align. If not, follow Section 3.1. 1. Issue the Multiple Align > Sequence Assembly.. . command from the Analyze menu. A dialog box appears (Fig. 4; see Notes 11,12). 2. Select Text Report for the current alignment (see Note 13). Before aligning the sequences, GeneJockey can trim off any vector sequence present. Note 14
GeneJockey//: Sequencing and Assembly
95
Fig. 5. GeneJockey multiple alignment window (Map view). In the example shown, there are two contigs, one containing seven sequences and one containing a single sequence that did not align with any of the others (the program will still call it a “contig” even though it consists of only one sequence). The top arrow of each contig group represents the consensus sequence. The three vertical dotted lines form a cursor, which you may move across the display using the bottom scroll-bar. They indicate how much of the alignment you will see if you switch to one of the other two views. The outer pair of lines show the extent of the color bar view, the left-hand pan show the area covered by the sequence view.
describes how to do this. The demonstration sequences do not contain any vector sequence. 3. Dismiss the dialog with the OK button. The program will open a new text window, first listing the lengths of the sequences as it formats them. Next, the program aligns each sequence with each of the sequences below it in the list, grving for each alignment the match length (abbreviated to ML) and the offset (OS) necessary to bring this matching segment into line, or alternatively displaying No Match as appropriate. When all the pairwise alignments have been completed, the program hsts the matches for each sequence, then assembles the sequences into contigs. For each contig, the contributing sequences are then hsted, along with their offsets from the 5’ end. Next, the program opens a multiple alignment window (Fig. 5) in which to display the results. The window always opens in the Map view, which gives an overview of the whole alignment (see Note 15). In the map view, the sequence data consists of arrows showing the relative positions of the aligned fragments. 4. Switch to the color bar view (Fig 6) by clicking on the middle button of the group of three buttons at the right-hand end of the toolbar (see Note 16). 5. Switch to the sequence view alignment by clicking on the right-hand button of these three (Fig. 7). Most of the commands in the submenu and toolbar make most sense when used in the sequence view. Again, the monochrome picture is a poor representation of what is shown on the screen, and it is much easier to work with aligned sequences in color.
96
Taylor
Fig. 6. GeneJockey multiple alignment window (color bar view). It 1svery dtffcult to represent the color bar view of an alignment in monochrome, so Fig. 6 IS less than adequate in this respect. In this view, each of the four bases 1s represented as a twopixel-wide block of color. This allows three times as many bases to be displayed as in the sequence view, and since each of the four bases IS represented m a different color, they line up in vertical stripes, making it easy to see which areas are aligned and which are not. Even in the monochrome representation above, it is easy to see that the 5’ end of sequence B 0177 3 is displaced relative to the other sequences, and it IS clear that there 1sa frameshift error m this sequence somewhere near the center of the screen. In this view, there is an active-cursor display, so you could run the cursor across the sequence and read off the locatron of the error m the panel at the bottom left (indicating 524).
Fig. 7. GeneJockey
multiple alrgnment window (Sequence view)
3.3. Editing the Multiple Alignment When editing aligned sequence fragments we must be prepared to insert gaps and/or to delete bases to bring the sequences into their proper alignment. Note 17 discussesthis in detail. Editing is performed using the sequence view, but before you start editing a conttg, the first thing to do is to locate the bestaligned segment. Use the color bar view for this purpose.
GeneJockeyll: Sequencing and Assembly
TTTCTCTCCflOTCCTTHCCRR0~RTTOOTTK080STCT
97
TTTOCFlTTTclTCTCRq WOAORRTKTOCRTTTRTCTCRA ~OOAORRTOTOCATTTUTCTCAR
Fig. 8. Comparmg a sequence in the multiple alignment wmdow with the original chromatogram to resolve ambiguities. 1 Switch to the map display and position the cursor lines in the position shown in Fig. 5 using the horizontal scroll-bar. 2. Switch to the color bar display. Note the frameshift error in sequence B 0177 3 (Fig. 6). Scroll the display horizontally until the error is close to the left-hand side of the display. 3. Switch to the sequence display (Fig. 7). Now take a closer look at that frameshift error that we saw in the color bar view. The point where the sequences diverge is a run of As highlighted in Fig. 7 Six sequences are aligned at this pomt; five of them have GAAAA at this point, whereas the other has GAAAAA. It looks as if the correct answer here is four As, not five, but perhaps we should check the sample tile to make sure. 4. Click on the sample name in the left-hand pane to select it. 5. Open the file reference by holding down the Command key and typing an equal sign. The sample file opens and is moved directly below the selected sequence, scrolled to the correct place, and inverted if necessary (Fig. 8). We can see immediately that the basecall is suspect in this case; the spacing of the bases is very crowded here. 6. Increase the horizontal magnification using the small scroll-bar (see Note 3, Fig 9). We can see that there are, in fact, only four A peaks, and that the basecalling algorithm has become confused by the wide base spacing around the first A. 7. Click on the multiple alignment window to bring it to the front, then place the insertion pomt after the first A of the incorrect run. The sequence will be shown in monochrome for editing. 8. Click on the reverse edit button (the right-hand button of the central group of two in the toolbar). Note that when you move the cursor back over the sequences, its
98
Taylor
Fig. 9. Multiple basecall.
alignment window with horizontal scale increased to show bad
shape changes to that shown on the button. Now hit the delete key to remove the extra base. The correctly aligned sequence to the right of the insertion point remains where it was, whereas that to the left moves into correct alignment. Deleting a base leaves behind a caret mark (like an inverted “Y”) to show that something has been removed, so If you change your mind later you can select this point and recover the deleted base(s) usmg the Undo Deletion command from the Edit menu. For a fuller description
of the editing process, see Notes 18 and 19.
4. Notes 1. After a sequencmg run, you have a set of up to 36 sample files to be processed, residing on the hard disk of the computer that is connected to the sequencer. Each of these files occupies around 150 K, so if you do a lot of sequencmg you will need plenty of disk storage space. a. You can put a copy of GeneJockey on the sequencing computer and continue to process the data on that machine. b. You can transfer the sample files to your own machme by means of floppy disks or by transferring them across a network. c. You can leave the sample files on the sequencing computer, run GeneJockey on your own machine, and open the sample files across the network, storing the extracted files and multiple alignment on your own machine. Whichever method you use, you should be prepared to store the sample files until all ambiguities in the multiple alignment have been resolved. 2. The chromatogram information is represented in SIX colors, one for the background, one for each of A, C, G, and T, and one for degenerate codes. You may change these colors using the small colored button at the bottom left of the window. The dialog produced by clickmg on this button gives you a choice of 64 colors, assuming that your display is capable of that number. (On 4- or 16-color
GeneJockey//: Sequencing and Assembly
3.
4.
5. 6.
7.
99
displays, such as overhead projectors, some of the colors will appear identical to others.) The window has three scroll-bars that determine what part of the chromatogram is displayed. The long scroll-bar at the bottom scrolls horizontally through the data. Currently the scroll-bar is at the left of its travel, and the display therefore starts with the first base called. (Unlike ABI’s own software, GeneJockey does not show the area of the chromatogram outside of the basecalls.) Scroll to the right and look at different areas of the chromatogram. As the signal gets weaker, you may wish to increase the vertical magnification, which you can do by means of the vertical scroll-bar at the right. To look at individual basecalls m detail, you may also wish to change the horizontal magnification using the small horizontal scroll-bar at the left. As you change the scroll-bar settings, you will notice that the range display immediately above the horizontal magnification scroll-bar changes to reflect the new settings. The two numbers displayed represent the number of the leftmost and rightmost basecalls currently on display. There is also the usual GeneJockey active cursor display, so you can determine the number of any base by placing the cursor over it and reading the number from the display area at the bottom left of the window. Almost all of the operations that you will perform on the chromatogram window are initiated by buttons m the left-hand panel. The Find... button allows you to search for a single base or short sequence of bases in the chromatogram. Unlike the commands in the Find menu, this routine does not recognize matches between degenerate codes, so you can use it to search for N and use the Find Again button to step quickly through the chromatogram. (However, there 1sa much better way of assessing the quality of your sequence than simply counting the Ns--see below.) When the routme has located the search sequence it scrolls it to the center of the screen and selects the first base. The Find Again button is dimmed if you have not yet used the Find... button. The Go to.. . button locates a particular basecall by number, scrolling it to screen center and selecting that base. The button changes the settmg of the horizontal magnification scroll-bar so that the number of bases displayed corresponds approximately to the normal spacing between characters for the font in use. It also adJusts the vertical magnification so that the largest peak currently on display just tits the height of the window. The command is useful when you come to compare a sequence in a chromatogram window with the same sequence m a multiple alignment wmdow in order to resolve ambiguities. Note that it is not possible to space out the basecalls to match normal text exactly, because each base symbol is drawn at the point where the ABI software made the basecall, and basecalls are not precisely equal in spacing. Chromatogram wmdows always open in the state. When you click the Code Change button the Ns in most cases are exchanged for a less degenerate base code (for example, if the base at that point could be C or G, but is unlikely to be an A or T, the basecall will be changed to S). Using the full degenerate code set like this provides a little extra information, and since the
100
8.
9.
10.
Il.
12.
contig-generating algorithm used in the multiple alignment wmdow takes account of all the degenerate codes, it is a worthwhile change. Chckmg Code Change again toggles the degenerate basecalls back to Ns if you change your mind. You should use this button before extracting the sequence if you wish to make use of this feature. The Error rate graph represents a function of the signal-to-notse ratio of the data, and offers a very good guide to the number of errors to be expected in the sequence They-axis has been calibrated empnically, using a large number of multiple sequence runs through the same area of sequence. The calibration has only been performed for dye-terminator Tuq cycle sequencing, and may be less accurate for other sequencing protocols, but nonetheless is in practice extremely useful. The Invert command reverses and complements the sequence, so that the sequence that you view and extract is the opposite strand from the one that you actually sequenced It is always a good idea, when sequencing m both directions, to invert those sequences that will need to be inverted before alignment at this stage. You could, of course, invert the extracted sequence wmdow, but then you would lose the ability to call up the chromatogram window directly from the multiple alignment Here are some useful shortcuts. If you save the extracted sequences m the same folder as the sample files, you will not have to navigate back and forth through folders each time you save or open the next sample file. If you want to save each sample under the name that you entered when you set up the sample sheet on the sequencing machine, you do not have to type it again. Instead, select the name m the comments area of the extracted sequence window and copy it onto the clrpboard using Command-C; then choose Save As.. . from the File menu and paste the name into the dialog box using Command-V. This leads to the parameter dialog shown m Fig. 4. In the mmal stage of sequence assembly, sequences are aligned m pairs m all possible combmations. This is a simple parrwise alignment, and the default criterion for matching is that the sequences must have a segment with at least 12 bases of identical match. If the two sequences do not satisfy this minimum match length they will be placed in separate contigs. You can vary this parameter, although it is not usually a good idea to set the number smaller than 12, because this can lead to an apparently circular contig caused by spurious matches. If your sequence fragments will not align with a minimum match length of 12 then you need to investigate the quality of your sequencing! Although GeneJockey is limited to a maximum of 50 open windows, the Sequence Assembly command can assemble any number of sequences (subject to available memory). This is done by butldmg up large alignments by aligning sequences contained in existing multiple ahgmnent windows. When you do this, you can opt either to take the existing alignments apart and redo the whole ahgnment from scratch, or to align the sequences as a block (Align by Contig) In the present case, this option is irrelevant, since we only have single sequence windows open. Although GeneJockey can assemble any number of fragments, it makes most sense to work with groups of about 20-30, because the time taken
GeneJockey//: Sequencing and Assembly
101
for assembly and edltmg becomes excessively long as the number of sequences increases. Working with the ABI sequencer produces 24 or 36 sequence fragments per day; I normally assemble these as I get them, saving each day’s completed multiple alignment for an eventual super alignment that will contain all the fragments. It is better to save the multiple alignment windows than the extracted contigs, because the final alignment will then be much more informative. For very large sequencing projects it may not be possible to do this because of memory limitations, and here you will have to fall back on aligning the extracted contlgs. A basic rule of thumb here is that you should not work with extracted contigs unless you are absolutely sure they are correct; leave the data m multiple ahgnment files while any doubt remains about their accuracy. 13. During the alignment, you can either have a simple moving bar thermometer display to keep you informed of progress, or you can have a text window into which the program wntes detailed information about the results of each pairwlse alignment. The thermometer display is faster, but the text report can be useful if the alignment yields unexpected results, because you can then find out why. 14. In order to remove vector sequence from sequencmg fragments, GeneJockey needs a Polylinker File. This is simply a nucleotide sequence file that contains about 100 bases of vector sequence centered around the insertion site. The actual insertion site 1s marked with a hyphen. If the “Vector removal...” box IS checked, the program will ask you to open a Polylinker file before proceeding to the alignment Each sequence 1saligned with the polylmker sequence in both orientations, and if a match is found, sequence is removed up to the pomt marked by the hyphen. Only the aligned sequences are trimmed; the original sequences remain unchanged 15. The window is divided vertically mto two panes, the left-hand pane containing the sample names and the right-hand pane containing the sequence data. You can click on the partition between the two panes and drag it to reveal more of the sequence names if they are too long to fit m the available space. Note that the sequence names are followed by a file reference. If the sequence was obtained by extracting an Al31 file, the file reference refers to the ABI file; If not, it refers to the sequence file When the alignment is created, the program orders the sequences in such a way that the leftmost sequence of the alignment is at the top and the remammg sequences are m order of the magnitude of their offsets relative to this sequence. For some purposes you may want to change this order. For example, when editing large alignments it may not be possible to see the consensus sequence and the sequence that you are editing at the same time because the screen is not big enough, so you need to be able to drag sequences near to the consensus sequence. Click on a sequence name m the left-hand pane and hold down the mouse button while dragging the sequence up or down. When you release the mouse button, the sequence will be moved to a new position immediately below the sequence where you released it. You cannot drag a sequence outside of its own contlg group. When you click on a sample name it will be highlighted, and the group of seven buttons at the left of the toolbar are then enabled. The commands that are
Taylor
102 Multiple
Allgnmant -Contlg
llblndow
Optlons
Rule1
Fig. 10. Setting the colors and the rule used to generate the consensus sequence in
the multiple
alignment window.
most frequently used in this view are the Move Left (<) and Move Right (>) commands, which shift the position of the selected sequence one place in the direction indicated relative to the rest of the contig. Try this on your own alignment. Note that although you can select the consensus here, you cannot move it. Although you cannot type directly into the left-hand pane of the window, you can change the sequence name by double-clicking on it to obtain a dialog. Do not edit the file reference; this will not be included when you print this window anyway, so there is no need to remove it. If a sequence is derived from an ABI chromatogram window, and the sequence was inverted before being extracted, the arrow will be drawn pointing to the left to symbolize this. (If you align sequences that are inverted but not extracted from ABI files, you can also make the program do this by adding “.inv” to the end of the tile name when you save the sequence file before alignment.) 16. In the color bar view, each base is represented by a two-pixel-wide vertical bar. Alignments between sequences are very easy to see in this view, and the screen can display three times as much sequence as it can in the sequence view. You can set the colors used by means of the small colored button at the bottom left of the window. Clicking this leads to the dialog shown in Fig. 10. As in the chromatogram window, you have a choice of 64 colors. Another important parameter that you can set in this dialog is the rule that the program uses to generate the consensus sequence at the top of each contig group. When a column of bases contains only one base, that base will be placed in the consensus sequence; however, when the sequences do not entirely agree at this point, the program offers you three options regarding what symbol to use. The simplest option is to take a majority vote, putting in a degenerate code if there is no simple majority. A more strict approach is to insist on an absolute majority, i.e., one base must be present more times than all the others put together. For the strictest interpretation, you may insist that a degenerate code be used if there is any disagreement at all. I find the absolute majority rule to be best for general purpose sequence assembly, although I may switch to the perfect match rule at times when I wish to draw attention to the weaker parts of the
GeneJockey//: Sequencing and Assembly
103
contig. The simple majority rule is generally too lax, but can be used when you are trying to get some kind of quick and dirty results from poor sequence data. The way in which the consensus-generating algorithm works is a little more complex than simply counting the bases, however, since it takes account of degenerate codes. Some codes are twofold degenerate, e.g., M = A or C; some are threefold degenerate, e.g., B = C, G, or T; and N is fourfold degenerate. When counting the score for a particular base, the program counts 4 for an exact match, 2 for a twofold degenerate match, 1 for a threefold degenerate match, and zero for an N. The contig rule that you select then operates on these numbers. 17. When we edit aligned sequence fragments we have three kinds of errors that we wish to correct. In order of increasing seriousness they are: which a degenerate call has been made. a. Uncertainties-in b. Substitution errors-in which a wrong base has been called. c. Frameshift errors-in which a base has been skipped or an extra base has been called. Of these, the frameshift errors are by far the most serious, because they affect not only the position of the actual error but all subsequent positions too, because the sequence is pushed out of line. In general, the other two types of error are unimportant, since they tend to occur at random, and if you have enough cover (i.e., sufficient replicate or overlapping sequences aligned) they cancel out. The editing process is therefore largely one of locating and editing out frameshift errors. Consider the following three aligned sequences: ACGGTCATTGCGATGATCC ACGGTCATTTGCGATGATC ACGGTCATTGCGATGATCC
Note the extra T in bold in the middle sequence. If we were aligning homologous sequences here, we would call this an indel (INsertionDELetion), and force the three sequences into alignment by putting in gaps in the other two sequences: ACGGTCATT-GCGATGATCC ACGGTCATTTGCGATGATC ACGGTCATT-GCGATGATCC
For aligned sequencing fragments, this is probably incorrect. At the position of the extra T we have two votes for G and one for T. G is more likely to be correct here. The consensus sequence generated from this alignment would be one base pair longer than it should be, and if the DNA represented coding sequence, the inferred protein would be completely incorrect from the position of the error forward. We see, therefore, that it is sometimes necessary to delete bases in order to obtain the correct alignment: ACGGTCATTGCGATGATCC ACGGTCATTGCGATGATC ACGGTCATTGCGATGATCC
104
Taylor
There is, however, a problem with deleting bases. We are reluctant to do so, because it involves discardmg data, and because subsequent sequencing may show that the extra T was correct after all. For this reason, the GeneJockey multiple alignment window uses nonvolatile deletion; when you delete a base or bases using the delete key, the program leaves behind a caret mark (like an inverted Y) to remind you that something has been deleted here, and stores the deleted data so that you can restore the original sequence later if you change your mind. Feel free, when editing out frameshift errors, to use either gaps (-) or deletion, as appropriate. 18. When you start editing a contig, the first thing to do is to locate the best aligned segment Use the color bar view for this purpose. If necessary, move some of the sequences honzontally to get the best possible alignment at this point. Scroll this area of good alignment to the left until you start to encounter frameshtft errors, keep scrollmg until the first frameshift error is near the left-hand edge of the screen. Now switch to the sequence view and locate the position of the error in the sequence view. Click on the sequence at the point where you wish to insert or delete bases. The sequence will become editable; it is now displayed in monochrome. Type in a minus sign for a gap, or delete the base before the insertion point with the delete key. Note that as the sequence to the right of the insertion point moves, the consensus sequence at the top of the display is continuously updated to reflect the improved alignment. You may find that where there were degenerate codes before, the consensus now displays real base codes. If you now scroll the window, or click elsewhere other than on the sequence being edited, the sequence returns to the color display, making the alignment much easier to see. Where you have deleted a base, or a string of bases, the caret mark will appear. Click on the caret mark and use the Undo Deletion command from the Edit menu. The deleted bases will reappear. Nonvolatile deletions are stored when you save the multiple alignment and are carried through when you use this window as input to a larger alignment: You can always restore the original sequence. Some cautions do apply to deletion, however: You should not delete a segment of sequence that already contains a deletion; deletions may not be nested. If you need to do this you should recover the first deletion before deleting the larger block, otherwise the first deletion will be lost. If you want to delete something permanently (for example, if you insert a gap and later change your mmd and want to take it out) you should select the part of the sequence to be removed and use the Clear command from the Edit menu. No caret mark will be placed, but even here you have one more chance to change your mind, provided you do not move the insertion point or scroll the window, because the normal Macintosh Undo command works here as in all text editing Continue working to the right, correcting frameshift errors, either until the sequences are perfectly aligned or until you come to a point at which the error rates are too high to determine the correct alignment. At this point, you can click after the last good base and drag the mouse down and to the right to select from the point selected to the end of the sequence, then delete the lot. If you are work-
GeneJockey//: Sequencing and Assembly
105
ing from ABI files and have chosen the correct area of the chromatogram wmdow to extract, this rubbish sequence will be quite short; if you have taken the whole sequence as called, it may be as much as 300-400 bases. However big, it IS still nonvolatile and can be recovered later. Now scroll back to the point where you began your work, and start editing the errors to the left of your starting point. Here you will encounter a new problem. Whenever you delete or insert a character, the incorrectly aligned sequence to the left of the insertion point stays where it is, whereas the good sequence to the right moves in or out, destroying your previous good alignment. You then need to move the whole sequence to compensate for this. GeneJockey offers you a useful shortcut here. Select Reverse Edit Mode from the bottom of the Edit menu, or click on the right-hand button of the group of two near the center of the toolbar Note that the shape of the editing cursor has change&it now has a left-pomting arrow at the bottom of the I-beam to indicate that you are in this mode. Now all the normal conventions of editing are reversed. When you enter or delete characters, text to the right of the insertion point remains frozen in place, whereas text to the left of the msertlon point moves m or out as appropriate. Note that the operation of the delete key is also reversed: Instead of deleting the character to the left of the insertion point and moving the msertion point one place to the left, it now deletes the character to the right, and moves the insertion pomt one place to the right, along wtth all the text to the left of the initial insertion position. This is a little hard to get used to at first (and very hard to describe in words) but try it out and you will see the sense of it. Nonvolatile deletions “remember” which editing mode you were m when you made the deletion, so the text will always be restored to its original state if you later recover the deleted sequence. Reverse edit mode applies only to multiple alignment windows; any other windows you have open will edit normally. You can turn this mode off again using the same menu command used to turn it on, or by chcking on the toolbar button with the I-beam symbol. Continue editing to the left either until you reach perfect alignment or decide that the sequence quality is too low and delete the junk sequence. You can select from the current msertron point to the left-hand end of the sequence by clickmg and dragging up and to the left. (Otherwise you cannot select sequence outside the visible sequence m this window, since unlike in most windows, the sequence will not scroll if you drag off screen.) 19. If, after some editing, you decide that you have things completely wrong and want to return the alignment to its original state and start agam, there are two ways of doing this. If the window has been saved, you can return to the last saved version using the Revert to Original command in the File menu, which applies to this wmdow as to all saveable GeneJockey windows. More selective reversion can be obtained using the Dis-Optimize command from the multiple alignment submenu, or by means of the equivalent toolbar button. This command applies either to a single sequence or to a whole contig (select the word Contig#n in the left-hand pane for this purpose; see Fig. 5) It removes all gaps, restores all
Taylor nonvolatile deletions, and replaces the sequence in its origmal offset position if you have moved it horizontally. If you have typed in characters other than gaps, or deleted sequence permanently using the clear command, these changes will not be undone. The Optimize command sets out to automatically do the editing that you have been doing by hand. You can optimize either a single sequence or a whole contig It works by realigning the selected sequence with the consensus sequence, mserting gaps or nonvolatile deletions to brmg as much of the sequence into alignment as possible. It is much faster than edtting the sequence by hand, but there are fairly strict limitations on what tt can do, and in particular tf the sequence data is poor it will produce results inferior to those obtained by manual editing. Since the routine works by realigning the sequence wtth the consensus of all the other sequences in the contig, you need to have a reasonably good consensus sequence, without any long stretches of degenerate codes. A little rough editing and moving of sequences before using it can often help constderably If your data has long stretches of junk sequence at the ends, you should remove them before using the Optimize command, and preferably remove them permanently with the Clear command rather than use nonvolatile deletion because the Optimize command undoes all deletions and removes gaps before starting. Unlike you, it is not permitted to discard long stretches of sequence, so tt may spend long periods futilely trying to ahgn junk sequence. I find it best to use this command m the color bar view, because the improvements that it produces are most obvious. You can use the Dis-Optimize command to remove changes made by Optimize, but remember that this is not a true Undo command: It removes all gaps and nonvolattle deletions, including any that you made by hand before issuing the Optimize command. There remam two further commands on the multiple alignment submenu. Delete Sequence can be used to remove either a sequence or a whole contig from the alignment. Use this command with care; there is no Undo for it. For this reason, Delete Sequence is not present on the toolbar, because this might make tt too easy to use by accident. Extract Sequence creates a new nucleotide sequence window containmg the extracted sequence. Extract Sequence is the command used to extract the consensus sequence after editing; when used in this way it places in the comments area of the new window a list of all the sequence names which contrtbuted to the contig, along with the location of the corresponding area of sequence to which each fragment contributed. You can also use it to extract individual fragments after editmg. Here, what you get is the edited sequence; i.e., all the nonvolatile deletions have been permanently removed. The toolbar button wtth the picture of a sequence window provides another way of issuing this command.
AutoAssembler
Sequence Assembly Software
Steven R. Parker 1. Introduction
AutoAssembler is a Macintosh software package from PE-Applied Biosystems (Foster City, CA) designed for assembly of DNA sequences. The graphical user interface is easy to use and allows the importation of text riles as well as analysis files from Applied Biosystems automated sequencers (Models 310, 373, and 377). If Applied Biosystems analysis files are used, the accompanying electropherograms may also be displayed and edited. The assembly algorithm (I) used by AutoAssembler generates very accurate assemblages, and is ideal for incremental assembly projects (see Note 1). There is no software limit to the size of a project that may be assembled. Included with AutoAssembler is Factura, a prefilter or clean-up program. Factura allows the user to import DNA sequence files, (either as text or Applied Biosystems analysis files) and process them in batch mode, deactivatmg identified features, such as vector sequence, areas of high ambiguity, and low confidence ranges. After the sequences have been processed in Factura, they are then imported into AutoAssembler where they are assembled. On completion of assembly, a consensus sequence is computed and ambiguities can be resolved. The results may then be printed and/or exported to another program for further analysis. 2. Materials 2.7. Hardware 1. Any Macintosh II or greater with color monitor. Power Macintoshesare supported with the software running in native mode. 2. Minimum memory requirements:8 Mb RAM; 16 Mb or greater recommended. From
Methods m Molecular B/ology, Edited by S R Swmdell
Vol 70: Sequence Data Analysis Guidebook Humana Press Inc , Totowa, NJ
107
708
Parker
3. One high-density floppy drove and a hard drive. The applications and supportmg tiles use approx 4 Mb of hard drsk space. Sample files typically take l-2 Kb each for text files, 160 Kb each for Applied Biosystems sequence files. 4. A Macmtosh-compatible prmter if hard copy is required.
2.2. Software 1. Macintosh system software 7.0.1 or greater. 2 The AutoAssembler and Factura applications 3. The Libraries folder included with the AutoAssembler software package. This folder must be located in the ABI Folder in the System Folder. 4 Sample tiles (Optional) included with the AutoAssembler package for use wtth the included tutorials (see Note 2).
2.3. Data DNA sequences are accepted by Factura and AutoAssembler in the following formats: 1 Text tiles. 2. Processed analysis files from Applied Biosystems automated sequencers (Models 310,373, 377).
3. Methods Fmd the AutoAssembler folder on the hard drive and double-click on it to open it. Double-click on the Factura icon to launch the application. 3.1. Setting Up Factura Libraries 1. Choose the Vector Library Setup... command from the Library menu to open the setup dialog box (Fig. 1) 2. Choose a vector by clicking once on the desired vector located in the All Vectors in VecBase scrollable list in the upper-left-hand corner. The correct vector should be highlighted (see Note 3). 3. Click on the >Copy> button to copy the vector over to your personal library. For the tutorial data, use the M 13MP 19 vector 4. Click the OK button. 5. Repeat step 2 until all required vectors have been copied. 6. The same steps may be used to create cloning sate and primer libraries using the Enzyme Library Setup... and Primer Library Setup... commands under the Library menu. For the tutorial data, use SmaI as the cloning site and Ml 3-2 1 as the primer.
3.2. Setting up Factura Parameters 1, Choose the Settings command from the Worksheet log box (Fig. 2).
menu to open the setup dia-
109
AutoAssembler
All Uectors
in UecBase
Uectors
Used
(Rdd...) (Restame)
Fig. 1. Adding vectors to the user vector library in Factura. The VecBase library is included in the software, and users may also add custom vectors as well.
Fig. 2. The Settings menu in Factura, specifying the parameters Factura will use to process the sample tiles. 2. Be sure that the Identify Vector Sequence check-box is checked. Select M 13MP19 for Vector, M12-21 for Primer, and SmaI for Cloning Site from the pop-up menus. 3. Be sure that Identify Ambiguity check-box is checked. 4. Set the parameters for base removal to 1 ambiguity remaining out of 20 bases. 5. Check the Reject Sequences box to activate it. 6. Set the ambiguity level for sequences to be rejected at > 10%.
Parker
110
Fig. 3. A Factura worksheet after sequences have been added. The parameters in individual cells may be changed before processing. 7. 8. 9. 10. 11. 12.
Be sure that Identify Confidence Range check-box is checked. Select the range from 1450. Be sure that Identify IUBmeterozygous Bases check-box is checked. From the pop-up menu, set the threshold at 50%. Click on the Update Edited Bases check-box. Click on Automatically Save to Sequence File, Revert Sequences to Original Basecalls, and Use These Settings as Default Value check-boxes to activate these commands. 13. Click OK.
3.3. Importing
Data into Factura
1. Select Add Sequences command under the Worksheet menu. 2. Make sure that the File Type: 373 check-box is selected. 3. Locate the “0x208” sequences in the Tutorial Data folder in the AutoAssembler folder using the pop-up menu of folder names. 4. Click the Add All button to add them to the worksheet. Each worksheet can contain up to 999 samples. 5. The sequences and the appropriate parameters should now be loaded onto the worksheet (Fig. 3, see Note 4.)
3.4. Factura Analysis 1. Select the Submit command under the Worksheet menu. 2. Click on the Yes button in the ensuing dialog box asking if you want to revert the sequences to the original basecalls. 3. The analysis should begin and the percent completion denoted by a progress bar. 4. On completion of the analysis, a Save Sequence File dialog box should appear. Click on OK to save the results back to the original sequence file. 5. A dialog box appears asking if you would like a batch report of results to be generated. Click the Yes button.
111
Fig. 4. A view of a sequence in Factura after processing. The areas deactivated by Factura are denoted by gray letters; mixed-base positions are denoted by red IUB codes. 6. A batch report will appear, summarizing the results from the Factura analysis. The ranges of each sequence in which vector and ambiguities were found are listed, and the resulting Clear Length of good sequence is reported. 7. To print the batch report, choose Print from the File menu. 8. To save the batch report, choose Save As from the File menu and type a name in the Save This Document As dialog box, then click on the Save button. 9. Click the small box in the upper-left corner of the Batch Report window to close the batch report.
3.5. Viewing Factura Results 1. Click on 1 under the # column on the far left of the worksheet to highlight the first sample row. 2. Select Show Sequence from the Worksheet menu. 3. The sequence should now be displayed (Fig. 4). Note that deactivated areas appear in gray; IUB bases appear in red. 4. To view the sequence features identified by Factura, click on the third button from the left in the lower-left corner of the sequence window. The identified features should now be listed (Fig. 5). 5. To view the associated electropherogram, click on the fourth button from the left in the lower-left corner of the window. The electropherogram may now be scrolled, and bases edited by highlighting them and typing in the new basecall. Use the left and right arrow keys to help position the cursor directly over a base to change it (see Note 6). 6. To view the annotations from the Applied Biosystems automated sequencer, click on the first button in the lower-left corner. 7. To close this view, click on the box in the upper-left corner of the window.
112
Parker
Fig. 5. A view of a Features Table created by Factura. Deactivated regions are shown, and IUB base positions are displayed with the peak ratio calculations. Compare the features listed here to the sequence view in Fig. 4.
3.6. Saving the Batch Worksheet and Exiting Factura 1. With the Batch Worksheet in the foreground, choose Save As from the File menu. 2. Type a name for the file (i.e., Batch - 1) in the Save This Document As dialog box, then click on the Save button. 3. To exit Factura, use the Quit command under the File menu.
3.7. Importing
Data into AutoAssembler
1. Locate the AutoAssembler icon and double-click to launch the application. A blank project form should appear. 2. To import sequences for assembly that were analyzed in Factura, choose the Add Sequence(s)... command under the Project menu. Add the 0x208 files by selecting the Tutorial Data folder containing the files and clicking the Add All button. 3. The project form should now display in the upper-right corner the names of the six 0x208 sequences previously processed in Factura.
3.8. Performing
Assembly
1. Choose the Assembly Setup command under the Project menu to set up the assembly parameters. 2. Choose 20 bases for Minimum Overlap and 15% for Percent Error. 3. Click the Submit button. An Assembly Status dialog box should appear showing the progress of the assembly. 4. When assembly is complete, the Assembly Status box should disappear and the results should be graphically displayed in the lower half of the project form. Consensus sequences are listed in the Assemblage List in the upper-left comer, and may be viewed individually by clicking on the name. The sequences forming the selected consensus are listed in the Sequence List in the upperright corner.
AutoAssembler
Fig. 6. The Layout View in AutoAssembler, showing the orientation and position of the assembled sequences. Each arrow in the lower half of the project form represents an individual sequence, its orientation, and its position in the consensus. Clicking on an arrow highlights the name of the sequence in the Sequence List, and clicking on the name of a sequence highlights the arrow corresponding to that sequence.
3.9. Viewing Results-Layout
View
1. In the lower-left comer of the project form there are three small buttons. Click on the button on the left to display the Layout View (Fig. 6). 2. Select the OX208.188.Assemblage.l in the Assemblage List if it is not already selected. 3. Each arrow in the lower half of the project form (called the Results Panel) represents an individual sequence, its orientation, and its position in the consensus. Clicking on an arrow highlights the name of the sequence in the Sequence List, and clicking on the name of a sequence highlights the arrow corresponding to that sequence.
3.10. Viewing and Editing Results-Alignment 3.10.1. Displaying the Electropherograms
View
1. Click on the small box in the upper-right comer of the project window to expand the size to the project form to full size. 2. Click on the middle button in the lower-left comer of the project form to display the Alignment View (Fig. 7, see Note 7). 3. The topmost sequence displayed in the Results Panel is the consensus sequence. Any ambiguities are denoted by a lower-case basecall and by a black dot immediately below the ambiguity. Below this are the individual sequences that form the consensus. 4. Click on the consensus sequence. 5. Press the Tab key; this moves the cursor to the next ambiguity in the consensus sequence.
Parker
114
I
.
.
A
I
I
H-L 1 Y D P i 5 C C i L * D-E V A N ; 5 L V N O-D 3-Frame *AA L 5 I R T T 5 FM IL V L AV IF ET E K L Q M Translation QPCQLGPPPL * 5 W F L L 5 5 L t ! L R 5 C K Y> Individual :~~CCTTiiTCAATTA~:iNCCACCTCCTTTAT~~ATCCT~~l~~~TCTT~~CT~~TCATCTTT~:~~ACTN~:A~.TT~~CAAAT~i~~~ IllliCCTTGTCAATT~~6ACCAEClCCTTTATFiATCCT~~TCTT~CT~jTC~CTTT~~~T~~~~TT~C~Tr~ $ Assembled sequences*
TTTATliATCCYi-~~~TCTT~~CT6 TCATCTTTI~!A:~ACT;IA.~AA:.TT~~CAA,~TI~.“ .n
Fig. 7. The Alignment View in AutoAssembler, showing the consensus sequences generated from the associated sequences, and the three-frame protein translation. The topmost sequence displayed in the Results Panel is the consensus sequence. Any ambiguities are denoted.by a lowercase basecall and by a black dot immediately below the ambiguity. Below this are the individual sequences that form the consensus. 6. Use this method to scroll to an ambiguity at approximately position 3 18. There should be a t at this position, and below it the bases N, T, and T. 7. Position the cursor at the t in the consensus sequence. 8. Double-click the mouse button. The associated electropherograms should automatically open and be displayed (Fig. 8).
3.10.2. Resizing and Resealing the Electropherograms 1. Move the cursor to the baseline of one of the electropherograms. The cursor should change to a double-headed vertical arrow. 2. Click and hold the mouse button down to drag the baseline. This changes the vertical size of the electropherogram. 3. Move the cursor to the area of the N in the top electropherogram (That is, in the trace not the sequence). 4. Hold down the Option and Shift keys. 5. Click and hold down the mouse button. 6. Drag the peaks horizontally until they can be more easily viewed. All open electropherograms will be equally resealed. 7. Hold down the Option key. 8. Click and hold down the mouse button. 9. Rescale the peaks vertically by dragging up or down. 10. At this point, there should be a small T peak visible at the N position. To edit, highlight the N with the cursor and type in t to replace it (see Note 8). Note that the consensus sequence is automatically updated.
AutoAssembler
115
Fig. 8. Double-clicking on the consensus sequence displays the electropherograms in AutoAssembler from files generated from an Applied Biosystems automated sequencer. Edits may be performed directly on the electropherograms, and the peaks will remain aligned during scrolling. To rescale the electropherogram horizontally, hold down the Option and Shift keys and drag the mouse. To rescale vertically, hold down the Option key and drag the mouse. 11. Place the cursor on the consensus sequence and use the Tab key to scroll to the next ambiguity. Note that all the open electropherograms remain aligned.
3. IO. 3. Opening and Closing Electropherograms 1. To open individual electropherograms, double-click on the corresponding sequence. 2. Double-click on the sequence to subsequently close it. 3. To close all the electropherograms, double-click the consensus sequence.
3.10.4. Displaying Protein Translations To display the three-frame translation to protein of the consensus sequence, select Show Protein Translation under the Project menu. The translation will then appear under the consensus sequence (Fig. 7). 3.11. Exporting
Consensus
Sequence and Saving Results
1. Choose Build Consensus...under the Project menu. 2. Enter “OX208.188.Assemblage. 1” as the Name.
Parker
116 3. 4. 5. 6 7. 8. 9. 10. 11.
Select Mixed from the CASE: drag-down menu. Check the Delete Insertion (Gap) characters box. Click OK. The consensus sequence should now be displayed. To export the consensus sequence as a text file, choose Export under the File menu and select Text.... Name the consensus Assemblage 1. Click the Save button The text file should now be created Click the small box m the upper-left corner of the consensus window to close it To save the assembly project, select Save As... under the File menu Enter the name “Project 1” and chck the Save button.
3.12. Printing
Results and Exiting AutoAssembler
1. Select Page Setup under the File menu. 2. Select the proper paper size and orientation for your prmter. 3. To print an entire assembly, choose Print under the File menu, then click on the Print button to begm printmg. 4. To print just the results of the assembly, choose Project Report under the Project menu. 5. Choose Print under the File menu and click the Print button to begin printing. 6. To exit AutoAssembler, choose the Quit command under the File menu. When asked to save changes to individual sequences, chck on the Yes button for each sequence to record the edits.
4. Notes 1. Unlike most assembly algorithms, the Contig Assembly Program (CAP) algorithm used by AutoAssembler does not compute intermediate contigs. Instead, it compares all sequences in both orientations, mathematically scores those relationships, and then utilizes dynamic programming to determine the optimum sequence order, resulting in a precise consensus sequence. Problems resulting from sequence order and ambiguities are therefore minimized. 2. Items 2-4 will be automatically loaded if the Installer program is used. The Installer should be used to ensure that the correct software (68 K or PowerPC) is loaded. 3. If the required vector is not listed m VecBase, it may be imported from a text or Applied Btosystems Analysis file by clicking on the Add... button. 4. If desired, individual parameters can be changed by clicking on the appropriate cell of the worksheet and making the change using the pop-up menu that appears in the upper center of the worksheet. For the tutorial data, the parameters should not be changed. 5. Saving results back to the original file only writes information to the Features Table section of the sequence file; the original data is always retained. 6. To view the original sequence basecalls, go to the Electropherogram command under the Sequence menu, and choose Show Original. The original basecalls are then shown on top, with the edited sequence below.
AutoAssembler
117
7. If the resulting bases are not displayed in color, go to Settings... under the Edit menu and click on the Draw Bases in Color box in the Display section. 8. To help keep track of edits, it is a good idea to perform all editmg in lowercase letters.
Reference 1. Huang, X. (1992) A contig assembly program based on sensitive detection of fragment overlaps. Genomics 14, 18-25.
MEGALIGN The Multiple Alignment Module of LASERGENE Jonathan
P. Clewley and Catherine
Arnold
1. Introduction Alignments between DNA or protein sequencesare the best way of comparing sequencesto determine if they are similar. The degree of similarity between different sequences, that is, the extent of conserved nucleotide or amino acid residues, can be used to make inferences about whether they share common ancestry, or have common structures and functions that may have arisen through convergent evolution. The subject of molecular phylogenetics or systematics is very complex and cannot be explored here. A good starting point is Of Urfs and Orfs by R. F. Doolittle (1). Other texts include Sequence Analysis in Molecular Biology Treasure Trove or Trivial Pursuit by G. von Heljne; SequenceAnalysis Primer edited by M. Gribskov and J. Devereux; Fundamentals of Molecular Evolution by W.-H. Li and D. Graur; and several chapters m Computer Analysis of Sequence Data, Part II edited by A. M. and H. G. Griffin (2-1). The documentation that comes with programs such as Phylip and Paup are also instructive sources. 2. Materlals 2.1. Macintosh Hardware 1. Any Macintosh computer with a floating point unit installed. We have successfully used a IIci, a IIVx, an SE30, an LC475 with SoftwareFPU installed (John Neil & Associates,Cupertino, CA), and a PowerPCwith LASERGENE. 2. Minimummemoryrequirementsof 4 Mb RAM (8 Mb RAM ormoreisrecommended). 3. Minimum free hard disk spaceof 25 Mb. More may be required becauseof the creation of temporary files From
Methods m Molecular Biology, Edited by S R Swmdell
Vol 70 Sequence Data Analysrs Gurdebook Humana Press Inc , Totowa, NJ
119
Clewley and Arnold
120
1 of 3 Sequences
Selected
/
I
od2
COO 168
Show
[
All Types
,
I
(Aemoue)
~Cancel)
[Help]
[Oone)
Fig. 1. The enter sequences dialog menu. 4. Macintosh compatible monitor (256-color monitor is recommended). 5. Macintosh compatible piinter (laser printers are recommended).
2.2. Macintosh
Software
1. Macintosh system software 6.0 1 or higher. application MEGALIGN. Follow the instructions provided by DNASTAR, Inc. for installing the LASERGENE software.
2. The LASERGENE
3. Methods 3.1. Opening MEGALIGN MEGALIGN is opened by clicking on “Multiple Sequence Alignment” from the LASERGENE Navigator, or by double-clicking the application icon. An empty “Worktable” is presented on the screen. A new, empty Worktable can be created by selecting New ([command key]N) from the File menu. Selecting Open ([command key]O) from the File menu will open a dialog box from which a pre-existing project can be selected. The Worktable is divided into three windows, the middle and right ones are scrollable windows that by default show the beginning and end of the sequence. The left pane shows the sequence names and has palette tools. 3.2. Entering Sequences 1. From the File menu chooseEnter Sequences ([command key]E). 2. Select the sequences to be aligned from the left window. 3. Click >> Add >> to move selected sequences to the list in the right window (Fig. I). A folder of sequences can be added. The sequences must be DNASTAR files, that is, EDITSEQ documents.
LASERGENE MEGALIGN
121
Fig. 2. The Worktable of MEGALIGN. Note the paletteof tools along the left-hand edgeof the window. Theseare, in descendingorder: Show as DNA, Show as Protein, Straighten Columns, Shuffle Right, Shuffle Left, Uncolor Residues, and Color Dissimilar
Residues.
4. Click Done. A Worktable appearstilled with the chosensequences(Fig. 2). Both DNA andprotein sequenceqcanbe addedto the sameWorktable. If this is done, the DNA sequenceswill be automatically translatedto amino acids, starting at the first base,whether or not it is in frame. 5. Click a sequencenameto selectit. 6. When the sequencenameis selected,click andhold the mousebutton down. The selectedsequencecan then be draggedup or down the list to reorder them. 7. A long singleclickwill allow the nameto bechangedfrom the keyboard(seeNote 1). 3.3. Subranging Sequences Selecting sequence subranges allows the alignment to be refined by removing extraneous data from the process without editing the sequence file, for example, if a mixture of DNA and protein sequences have been entered, and you wish to selectthe correct reading frame for translation of the DNA sequences. A sequence subrange may be defined in two ways; by defining the coordinates of the subrange or by selecting a feature from the feature table described in the Comments section of an EDITSEQ document. If a valid feature table exists then, double-clicking a highlighted sequence name will open the feature table dialog. Otherwise the coordinates dialog will appear. 3.3.1. Using the Feature Table The sequencefeatures should be defined as part of GenBank, EMBL, NBRFPIR, and Swiss-Prot files. When the files are imported into EDITSEQ, the feature table is automatically built as part of the import process. Alternatively, you can create your own in an EDITSEQ document. This allows, for example, a correctly spliced protein sequence, with introns removed, to be part of a MEGALIGN Worktable without the parent DNA sequencehaving to be edited.
122
Clewley and Arnold
1. Select From Features Table ([command key]T) from the Set Sequence Limits submenu of the Options menu. A dialog box listing the sequence’s features will appear. 2. Highlight the correct portion of the sequence 3. Click the Next button to move to the next sequence in the Worktable. Note that to use the whole sequence, no features must be highhghted. 4. Click OK to return to the Worktable.
3.3.2. Using Coordinates 1. Select By Coordinates ([command key]L) from the Set Sequence Limits submenu of the Options menu. A standard LASERGENE thumbwheel dialog box appears (see Note 2). 2. Select the sequence subrange. 3. Click the Next button to move to the next sequence m the Worktable. 4 Click OK to return to the Worktable.
3.4. Aligning
the Sequences
MEGALIGN offers two multiple alignment methods; Clustal and Jotun Hein. Use the Jotun Hein method (5) if the sequences are related by common descent, otherwise use Clustal(6). 1. Select Method Parameters... from the Align menu. 2. Select the methods parameters you wish to change using the radio buttons (seeNote 3). 3. From the Align menu, choose either By Clustal Method ([command key]K) or By Jotun Hein Method ([command key]J). The alignment wdl commence (see Note 4). 4. After alignment, the Worktable will present the aligned sequences and a consensus sequence (see Note 5). 5. Choose Save...([command key]S) or Save As... from the File menu to save the project before continuing (see Note 6). Save As... can be used to export the alignment as a PAUP/Nexus or GCG document from the Format submenu of the Save As... dialog box (see Note 7). 6. Examine the Alignment Report, Sequence Distances, Residue Substitutions and Phylogenetic Tree by selecting them from the View menu (see Note 8). 7. To produce a publrcatron-quality figure, select Alignment Report from the View menu. 8. The way in which the residues in the report are highlighted may be altered usmg sequence decorattons. Decorations may be added to the report using the Decoration submenu of the Options menu (see Note 9). 9. Select Print from the File menu (see Note 10).
3.5. Rea/igning
Residues
An alignment can be manually edited using the palette tools on the left side of the Worktable (Fig. 2). A residue can be selected by positioning the cursor over It; the cursor will change to a square tool that can be used to highlight individual
LASERGENE MEGALIGN
Minimum
123
Match
(Set
Delaultsl
Gap Penalty Gap length
Penalty
Fig. 3. One-pair alignment dialog box. residues. The sequence containing the selected residues can be moved by the Straighten Columns, Shuffle Right, and Shuffle Left palette tools. For example, by selecting one residue and the gap next to it with the square tool, the residue can be shifted into the gapby the appropriate Shuffle palette button. Go to Position ([command key]M) from the Edit menu can be used to move to a single residue or range of residues in one sequence,or in the consensus.Gaps and sequence disagreements can be found using Find Disagreement from the Edit menu. 3.6. Pairwise Alignments Pairs of sequencescan be compared in MEGALIGN (see Note 11). There are four pair-wise methods available. The Lipman-Pearson method is for protein alignments (7); the Wilbur-Lipman (8) and MartinezNeedleman-Wunsch (9,10) are for DNA alignments; whereas the Dot Plot method can be used for either DNA or protein. The Needleman-Wunsch algorithm is the basis for many alignment programs, both protein and DNA. It is explained clearly by Doolittle (I). The Wilbur-Lipman is for global alignments; the MartinezNeedleman-Wunsch for local ones. For example, if searching a large sequence for similarity to a primer sequence, to which it is related but not identical, MartinezNeedlemanWunsch is the better choice. In practice it is often best to use both methods. 1. Selectthe sequencepairs to be aligned by clicking once on their names in the sequencenamesfield. 2. Selectthe alignmentmethodyou require from the One Pair submenuof the Align menu.A parameterdialog box appears(Fig. 3). The default parametersshould be usedfor a first alignment.The effect of reducingthe penaltiescanbe investigated afterward. Higher gap penalty figures produce a more stringent alignment with a lower similarity index score. 3. Click OK. 4. After the alignment is completed, the Alignment view appears (Fig. 4). The Alignment view showsthe subrangeof the sequencethat has formed the alignment. MEGALIGN doesnot display the entire sequencethat went into the alignment, only that part of it that has significant similarity to the other sequence(see
Clewley and Arnold
Fig. 4. The alignment view of a protein one-pair alignment. Note 12). The similarity index, and number and length of gaps are also shown in the Alignment view. The similarity index refers only to the aligned part of the two sequences, not to the entire sequences (see Note 13).
3.6.1. Dot Plot Alignments Dot Plot is a graphical method for finding similarities between two sequences, including repeats in a-single sequence by aligning it against itself or its complement.
Finding repeats in this way is more applicable
to DNA than pro-
tein sequences (see Section 3.6.2.). 1. Highlight a pair of sequences from the Worktable by clicking on them. 2. Select Dot Plot from the One Pair submenu of the Align menu. A Filtered Dot Plot parameter table appears (see Note 14). 3. Click OK. A typical dot plot is shown in Fig. 5. 4. Double-click on a diagonal to produce an Alignment view. 5. Click on the Filter tool to produce a histogram of the distribution of lengths of similar sequence regions (Fig. 6). The greater regions of similarity, represented by the longer diagonals in the histogram, are plotted to the right. 6. Click on the histogram. This produces a dotted line called a Range Finder. 7. Drag-click to isolate a subrange of diagonals (Fig. 6). 8. Click the Filter button in the Filter window. 9. Click the close box. The dot plot changes to display just the diagonals selected. These can be shown as alignments by double-clicking on them or by clicking the View Diagonal palette tool (second from bottom). 10. Use the box tool (second from top of the palette) to define a region for subalignment by drag-clicking over the appropriate area of the Diagonal view window. When a region has been boxed the subalignment tool (bottom of the palette) becomes active. Clicking on this tool produces a menu from which a further alignment, with changed parameters if desired, can be done.
3.6.2. DNA Dot Plot Alignments A DNA dot plot has three additional palette tools. These are used to choose which strands of the two sequences are compared, either both forward sequen-
LASERGENE MEGALIGN
125
.20,
Wmdow
30, Mn
Qua1
600 -
zt
Fig. 5. A Dot Plot wmdow. Note the palette of tools along the left-hand edge of the wmdow. These are, in descending order: Pointer, Box Tool, Filter, Zoom In, Zoom Out, Position Indicator, View Diagonal, and Subalignment Tool.
ces, or forward of one sequence with the complement of the other. The third tool allows superimposition of these two views. 4. Notes 1. The combination of clicks required to select, move, or change sequence file names can be difficult. For example, during the click and drag method used to reorder sequences, if you click the pointer over the sequence name and then do not drag the sequence name immediately this may be interpreted as a long single click by the program. The highlighting of the name will change and in order to move the sequence you must deselect that sequence by clicking elsewhere in the sequence name list and then restart the procedure. 2. For many LASERGENE operations it is possible to specify subranges of a sequence for analysis without deleting data from a file. The standard method for opening or specifying a defined subrange of a sequence is the Set Ends button in the Open dialog box If you click on Set Ends you will be presented with a window in which the sequence subrange can be entered in text fields or, alternatively, thumbwheels can be manipulated with the pointer to select the range. The
126
Clewley and Arnold
Fig. 6. Histogram produced from a Dot Plot by using the filter tool. The verttcal dotted lines are range finders for selecting a subset of the diagonals. The histogram scale is displayed as a color scale ranging from blue (left) to red (right). The scale indicates mcreasmg length of regions of similarity from left to right.
3.
4.
5.
6.
Other Strand can be selected, and an Other Segment button allows the unselected portlon of a sequence to be specified, rather than the selected part. If the word Length is clicked on, the sequence is set to its full range. The alignment parameters (k-tuple, gap penalty, gap penalty length, window, and scoring diagonals) can be set from Method Parameters.... Stmilarly, ammo acid residue weighting (accepted pomt mutation or PAM tables) can be chosen from Set Residue Weight Table. Usually, the default settings will suffice. The program will assess the memory available to it; if insufficient memory is available to perform the alignment, an error message will appear with an estimate of the amount of extra memory required. To remedy this; save the alignment and then qutt MEGALIGN. Locate the program icon and select it. Choose Get Info ([command key]I) from the File menu. Adjust the Preferred Size memory allocation and close the information window. Restart the program. Selecting New Consensus... from the Options menu allows the rules for definmg a consensus sequence to be changed. The consensus can be set to be when all residues are identical or when a specified number match. Also, the consensus can be set as a template group of ammo acids. There are four template groups of ammo acids in MEGALIGN: functional, structural, chemical, and charge. There are four functional groups of residues: a-acidic (DE), b-basic (HKR), fhydrophobic (AFILMPVW), and p-polar (CGNQSTY); three structural groups. a-ambtvalent (ACGPSTWY), e-external (DEHKNQR), and i-internal (FILMV); eight chemical groups’ a-acidic (DE), b-basic (HKR), f-aliphatic (AGILV), m-amide (HQ), o-aromatic (FWY), h-hydroxyl (ST), r-immo (P), and s-sulfur (CM); and three charge groups: a-acidic (DE), b-basic (HKR), and o-neutral (ACFGILMNPQSTVWY). Save As should be used to save each stage of an alignment analysis under a different name, otherwise only the last particular analysis will be saved with the worktable.
LASERGENE MEGALIGN
127
7. These exported tiles can be opened with a text processor, such as BBEdit, and edited and printed as destred. 8. The mformatton displayed in the report can be selected using the Alignment Report Contents item from the Options menu. A self-explanatory dialog box of possible settings for the report is displayed. For example, to produce a text MEGALIGN alignment, turn on Show Consensus, Show Sequences, and Show Sequence Names Turn off any other items. Set the Extra Space Between Restdues to 0. Click OK to leave the dialog box. 9. To create new decorations, select New Decoration... from the Options menu. Either Hzde or Shade the residues to emphasize sequence similarity m the ahgnment. The residues can be compared to a consensus or to an individual sequence from a pull-down menu m the Alignment Decoration dialog box. 10. To obtain a pict file of the alignment suitable for importing into paint/draw programs, install Print-2-Pitt on your Mac. This can be found in the Tools file on the LASERGENE CD. Select Print-2-Pitt from the Chooser instead of a prmter, and save the alignment report as a pict tile (or text) from the dispositions submenu A similar operation will produce a pict tile of a phylogenetic tree when this 1s selected from the View menu and printed. 11. ALIGN: ALIGN was the LASERGENE precursor to MEGALIGN. It only offered pairwise alignments, It can still be found on the LASERGENE CD-ROM but is not available from the menu. Whereas all the alignments it offers can be accomplished in MEGALIGN, it nevertheless has a useful Worksheet format whereby many single pairwise alignments can be kept as one document. This laboratory uses it regularly. However, up to version 2.14 it has a bug that can lead to erroneous alignments if it 1s not noticed. The bug causes the program to intermittently carry over the length of the previous sequence to the current one when the ahgnment is set up. For example, if the first sequence inserted into the dialog box (Fig 7) is 29 bases, and the second 9265, the dialog box may show both sequences as 29 bases. To overcome this, simply reselect the second sequence until the correct length is displayed. 12. The alignment can be formatted with the Alignment Color button (a box wtth a cross in it) on top left of the palette of the Alignment view. The Show Context box in the Alignment Color menu will cause the display to change to show the complete alignment between both sequences. 13. The similarity index is calculated from the number of matching residues divided by the sum of the number of matching residues plus the number of mismatching residues plus the number of gaps. Since the number of gaps m the alignment is a function of the parameters chosen, the similarity mdex is only a relative value, not an absolute one. This can be seen by evaluating a subalignment of the aligned sequences. To do this, drag-click on a region of the alignment to produce a highlighted subregion. The similarity index for the selected region is displayed beneath that for the two aligned parent sequences. The Evaluate Subalignment button on the palette now becomes active. Click on this and choose the appropriate protein or DNA alignment method from the submenu. The parameter dialog
Clewley and Arnold
128
Martinez/Needleman-Wunsch SK462(1>29) “‘ll,o,
129
length: 29 bp _ .^~ (tier reqj
Method
3 -
‘-(se1 .. to -uera-lultsJ
E 9
Mlnlmum Match Gap Penalty
Range: 29 ’ bp (RBSstEndsJ
Gap length
1 IO
Penalty
0.33
HIuHHB2Rt1>29
>71
y-i---Length: (Giiiii)
29 bp
Range: (ResetLnds)
3’ ISI 29 ’ bp
8 Lat;“r;‘:b,
,
Fig. 7. The ALIGN window used for adding sequences to the Worksheet. In this example a mistake has occurred and the bottom HIVHXB2R sequence is shown as 29 bp (the length of the primer SK462) instead of its correct length of 9265 bp. To remedy this, the lower Get Seq button needs to be pressed again and the sequence reselected. It will then be displayed as the correct length and the ahgnment can proceed. box appears displaying the subranges for the two sequences. If the gap penalty or length parameters are changed and the subalignment evaluated, then a new alignment window will appear. There will probably be a different pattern of gaps m the subalignment compared to the parent alignment, and the similarity score will have changed. 14. The Percentage Match (default is 20) is a simple comparison between the two sequences, and increasing the value increases the stringency of the alignment, causing fewer diagonals to be displayed. The Minimum windows can be set from l-100 (default is 1) and is a measure of the number of overlapping regions of similarity needed to produce a diagonal Therefore, increasmg Mmimum wmdows will decrease the number of diagonals formed. Windows can be set from l-100 (default is 30) and determine the number of consecuttve bases or residues analyzed. A higher setting will find global, or longer, alignments between the two sequences.
References 1. Doolittle, R. F. (1987) Of Uti and Orfs* A Primer on How to Analyze Derived Ammo Acid Sequences, University Science Books/Oxford University, Ml11 Valley, CA 2. von Heijne, G. (1987) Sequence Analysts in Molecular Biology: Treasure Trove or Trivial Pursuit, Academic, San Diego, CA. 3. Gnbskov, M. and Devereux, J. (eds.) (1991) Sequence Analysts Primer, Stockton, New York. 4. Griffin, A. M. and Griffin H. G., eds. (1994) Methods in Molecular Biology, Computer Analysis of Sequence Data, Part II, vol. 25, Humana, Totowa, NJ. 5. Hein, J. (1990) Unified approach to alignment and phylogenies, m Methods zn Enzymologv, vol 183 (Doolittle, R. F., ed.), Academic, San Diego, pp. 626-645.
LASERGENE MEGALIGN
129
6. Higgins, D. G. (1994) Clustal V: multiple alignment of DNA and protein sequences, in Methods in Molecular Bzology Computer Analysrs of Sequence Data, Part IZ, vol. 25 (Griffin, A. M. and Griffin, H. G., eds.), Humana, Totowa, NJ, pp. 307-3 18. 7. Lipman, D. J. and Pearson, W. R. (1985) Rapid and sensmve protein similarity searches. Science 227, 1435-1441. 8. Wilbur, W. J. and Lipman, D. J. (1983) Rapid simrlarrty searches of nucleic acid and protein data banks. Proc. Natl. Acad Sci. USA 80,726-730 9. Martinez, M. H. (1983) An efficient method for finding repeats m molecular sequences. Nucleic Acids Res. 11,4629-4634. 10. Needleman, S. B. and Wunsch, C. D (1970) A general method applicable to the search for similarities in the amino acid sequence of two protems. J Mol Biol. 48, 443453
10 GeneJockeyll Pair-wise Sequence Comparison Phil Taylor 1. Introduction GeneJockey offers four different methods of comparing one sequence with another to identify common areas. Three of these display their results as an alignment; the two sequences are written parallel to each other with a row of symbols between them that draw attention to similar areas of sequence (Figs. l-3). The fourth method produces a dot plot or homology matrix (Figs. 4,5). Selected diagonals of the dot plot can also be shown as alignments. The three pairwise alignments have different capabilities and are used for different purposes. The Simple alignment just places the two sequences so that their bestmatched regions are aligned, without introducing any gaps. The routine is very fast, can handle sequencesof any length, is Indifferent to whether the sequences align parallel with each other or overlap, and works well with sequences of very different lengths. You would typically use this routine to determine whether an oligonucleotide aligns with a large sequence. The Gapped ahgnment is similar to the Simple alignment, except that it inserts gaps mto the aligned sequencesto bring multiple segments into line with each other. Gapped alignment uses a probability criterion to determine whether any given segment should be aligned-it will not insert gaps to align a segment unless the alignment is significant at the p < 0.1 level. Both these alignments will report that there is no match between the two sequencesif no significant match is found. The third pairwise alignment (Homology) usesdifferent criteria, producing the best possible alignment between sequences, even if the two sequences are totally unrelated. The Homology alignment is limited to sequences of <3000 basesor amino acids, and is intended for sequencesthat are of similar sizesand that align parallel to each other. Homology alignment is much slower than the From
Methods in Molecular S/o/ogy, Edlted by S R Swindell
Vol 70’ Sequence Oata Analysis Gordebook Humana Press Inc , Totowa, NJ
Taylor
132
Fig. 1. GeneJockey pan-wise alignment window. Perfect matches between amino acids are indicated by a bullet mark @) m the central row of symbols; conservative substitutions are indicated by a vertical slash (I). (If the sequences were DNA rather than protem, the vertical slash would Indicate partial matches between degenerate codes.) Where a perfect match is offset by one place, this is indicated by diagonal slashes (1or \). The button labeled Make Contig is for use when you wish to extract a consensus sequence from a pan of aligned DNA sequences, and is not useful m protein alignments.
I
I
O9KO~KPROK~~LRKRKTFSLUKEKKAR-----------------------~TLS~ILLffILTUTPYIII~VLUSTFCKD . .
\.......
,..
. . .
. . .
. . .
OSCTPANTTUEL”OSSO~NO~~~~N,U~~KlVKNTKPPRKKKPPPSREKK”,~T,L~,LLRFlITURPYMIMhllTFCRP
Fig. 2. Gapped alignment. tlon to Fig. 1.
A key to the ahgnment symbols IS given m the cap-
other two routines, but does a much more thorough job. The Homology Matrix
shows all possible alignments between the two sequences at a glance. There are no limitations on sequence size, but the time taken to perform the alignment increases with the product of the two sequence lengths. All four methods operate on either DNA or protein sequences. 2. Materials 1. Hardware: GeneJockey requires a Macintosh with ColorQuickdraw in ROM (this excludes the Macintosh plus [and older machmes], the SE, the PowerBook 100, and the Macintosh Portable). The program also requires system 7.0 or later,
and at least 2 Mb of available memory. The matrix window can display its contents in color or monochrome; a color display makes the results easier to read, but is not essential. Alignment windows are m monochrome. 2. Software: For the operations described in this chapter, you need only the GeneJockey program itself. For later chapters, you will need some additional
GeneJockeyll: Pairwise Comparison
133
~TSUPPRUSPNlTVLAPOKC~~IOITTaLL~LATUT~LUlSFKW(TELKTU~FLLSL~~llSTFS~NL
**\a
II II
III
I.
l
I.,..
I
I....
01
•*~.qI*Ie**
~~~*~**~*~~~~~~***~~b
. . . . .
MV1STNSS-NSG~TSPYKT-FEUUFIUCWIOSLSLUTI,ON,LUnUS,KUNRHL9TUHH”FLFSLRCADL,,O”FS~
Fig. 3. Homology alignment. A key to the alignment symbols is given in the caption to Fig. 1, The probability value quoted m the alignment window is meaningless in this case; it will always have the value 1.O.
@-I-
Rat Ach R R PeptIde(
Plg Ach R Peptide
Fig. 4. A Homology
-
matrix.
files supplied with the program, and you would normally install GeneJockey on your hard disk by simply copying all the files supplied into a single folder. When running on a Power Macintosh, the GeneJockey Helper file should be present in the same folder. The native-code resources in this file run about 10 times faster than the code in the main program, and since multiple alignment is a tlme-consuming process, the extra speed is very helpful. GeneJockey is licensed for use only on a single-user basis, but is not copy-protected. 3. Data: Two nucleic acid or two protein sequences are required. The examples shown are taken from the demo files supplied with the program.
3. Methods
3.1. Simple Alignment All the sequencecomparisoncommandsoperateon the sequencesin the two front-most
windows.
Taylor
134 Xl-
Rat Rch R tl Peptide(
Pig Rch A Peptide
i-
(%oclrr,] (Zoom) [LlndoZoom) CUPSOI. at : x ”
106 104
Fig. 5. Homology matrix after zoom. Choose Open from the File menu. Locate the Mist Receptors folder on the demo files disk and open it. Open the Acetyl Choline folder and open the file named Pig Ach R Peptide. Repeat the three previous steps,this time opening the file named Rat Ach R A Peptide. Choose Pair Align > Simple from the Analyze menu. In the subsequent parameter dialog, set the minimum match length to 5. Set the permitted number of mismatches within this length to 2. Click on OK. The program will now open a pair-alignment window (Fig. l), scrolled to the start of the main aligned segment (see Note 1). 9. Scroll the sequences to examine the alignment. There is one major alignment extending from position 24 to position 213 in the rat sequence. There may be other possible alignments, but the simple alignment routine will not show them. 1. 2. 3. 4. 5. 6. 7. 8.
3.2. Gapped Alignment If you have just opened the two sequences as in Section 3.1., they will now be in the front two windows, otherwise click on each in turn to bring the windows to the front. 1. 2. 3. 4. 5. 6. 7.
Choose Open from the File menu. Locate the Mist Receptors folder on the demo files disk and open it. Open the Acetyl Choline folder and open the file named Pig Ach R Peptide. Repeat the three previous steps,this time opening the file named Rat Ach R A Peptide. Choose Pair Align > Gapped from the Analyze menu. Set the alignment parameters to the same values as before and click on OK. Scroll the sequences in the alignment window (Fig. 2) to explore the new alignment. Comparing the alignment produced by this routine with the previous one, we find that there are now two major areas of alignment; a gap has been inserted
GeneJockey//: Pairwise Comparison
735
in the rat sequence at position 360, bringing the following 76 amino acids into alignment. However, there are other weaker alignments that are not revealed; for example, the rat sequence starts with MNTS, whereas the pig sequence starts with MNNS, and these segments are not aligned. We could increase the sensitivity of the alignment by adjusting the values of the parameters used, but if we want to see a more exhaustive alignment, we should use the Homology alignment routine.
3.3. Homoiogy 1. 2. 3. 4. 5. 6. 7.
Alignment
Choose Open from the File menu. Locate the Mist Receptors folder on the demo files disk and open it. Open the Acetyl Choline folder and open the file named Pig Ach R Peptide. Repeat the three previous steps,this time opening the file named Rat Ach R A Peptide. Choose Pair Align > Homology... from the Analyze menu. Accept the default alignment parameters by clicking on OK (see Note 2). Scroll the sequences in the alignment window (Fig. 3) to explore the new alignment. We now find that there are many weak alignments. The program has not only aligned the MNTS and MNNS segments at the N-terminal ends of the proteins, but has also found an area of alignment following that, which depends almost entirely on conservative substitutions. Such alignments are very hard to locate by eye (see Note 3).
3.4. Homology
Matrix
The simplest variety of homology matrix displays a square array of dots; imagine that the two sequences are on the x and y axes, then place a dot wherever a residue on the x-axis is the same as the residue on the y-axis. If the two sequences are identical, you will see a diagonal line of dots from top left to bottom right. Unfortunately, a simple dot plot of this kind shows so many random matches that the diagonal lines that represent homology are obscured. For this reason we use a filter function to improve the signal-to-noise ratio. 1. 2. 3. 4. 5. 6.
Choose Open from the File menu. Locate the Mist Receptors folder on the demo riles disk and open it. Open the Acetyl Choline folder and open the file named Pig Ach R Peptide. Repeat the three previous steps,this time opening the file named Rat Ach R A Peptide. Choose Matrix.. . from the Analyze menu (see Note 4). Accept the default alignment parameters by clicking on OK. In the present case the two major areas of alignment are very obvious (Fig. 4). 7. Hold the cursor over the matrix, moving it until the cursor position display reads: x 186 y 183 8. Click the mouse button once. 9. Move the cursor again until the display shows: x 216 ~217
136 10. Hold down the shift key while clicking the mouse button once (see Note 5). 11 Click on the Zoom In button. The selected area is now redrawn at full scale (Fig. 5). If at this point you still cannot see enough detail, you can repeat the selection/ zoom process to increase the magnification still further. The Undo Zoom button undoes the last zoom operation, whereas the Zoom Out button returns you to the full-scale display 12. Place the cursor over the blue (or dark gray) square at the top left of the major alignment and double-click. The program will open a new alignment window, showing the actual alignment of the two sequences in this area. This has all the usual properties of ahgnment wmdows. In this case, the alignment produced 1s identical to that produced by the simple alignment m Section 3.1. There are no gaps since only a single diagonal of the matrix is shown. You can, however, explore the matrix, producing alignment windows that show very weak ahgnments that the three alignment methods hsted above could not find.
4. Notes 1. Text in this window is editable, and the alignment symbols will change to reflect changes in the alignment produced by editing. You can scroll the two sequences together using either of the two scroll-bars, or if you first uncheck the Link Scroll-bars check-box, you can scroll the sequences Independently to explore other possible alignments. There is the usual GeneJockey active cursor position display at top right, allowing you to check the number of any residue by placing the cursor over it. To the left of this is an estimate of the probability of this alignment occurring at random between two unrelated sequences of this size. In the present case the alignment is highly sigmficant (p < 0.00001). 2. The parameter dialog for this algorithm is rather complex. For a full explanation of what the parameters mean and how the alignment algorithm works you should consult the reference section of the GeneJockey manual. The default values are, however, well chosen, and you should rarely need to change them 3. The routine will never tell you that it found no alignment between the sequences; it will always do its best to align them even though the results may be garbage 4. GeneJockey compares overlapping segments of n residues each, and places dots in four different colors or shades depending on the number of matches within the segment. For peptide cpmparisons, the default for n is 5, and the program places a yellow dot If 2/5 residues match, a red dot if 315 match, a blue dot for 415, and a black dot for 5/5. (You can also specify colors or gray patterns to suit yourself.) On a standard Macintosh screen, the colors of these dots will not usually be distinguishable at full scale, but become obvious when you magnify part of the picture by means of the zoom button. In the matrix window this filter function is abbreviated to 2/3/4/5/5. 5. Note that there is a selection link between the matrix window and the two sequence windows used to generate it. If you were to chck on one of the sequence windows to bring it to the front, you would find that an area of sequence is now selected corresponding to the area selected m the matrix wmdow.
11 GeneJockeyll Multiple Alignment of Homologous Sequences Phil Taylor 1. Introduction In Chapter 6 we covered the multiple alignment of DNA fragments for sequence assembly. The routines described here deliver their output to the same multiple-alignment window, so if you have not already done so, you should read Chapter 6 to familiarize yourself wtth the operation of that wmdow. The multiple alignment of homologous sequences is technically a very difficult operation. GeneJockey uses the Clustal algorithm of Higgins and Sharp (1). This can be used either for proteins or nucleic acids. As with all of GeneJockeyII’s analysis routines, input is taken from open windows. As for the sequence assembly multiple alignment, input can be taken either from individual sequence windows or from existing multiple-alignment windows. Unlike that routine, however, there are strict limitations, both on the size and on the number of sequences that may be aligned. The absolute maximum sequence length is 3000 bases or amino acids; the absolute maximum number of sequencesis 50. In practice, the limits may be lower than this, depending on available memory. The program makes an estimate of the memory required before starting and informs you if you do not have enough to proceed. When aligning sequences from existing multiple alignments, there is no Align by Contig option, so the number of input sequences is the actual total present. Aligning large numbers of large sequences can take a very long time, so a rough time estimate is made before starting the alignment. 2. Materials 1. Hardware: GeneJockey requires a Macintosh with ColorQuickdraw in ROM (this excludesthe Macintoshplus [and older machines],the SE,the PowerBook From
Methods m Molecular Bology, Edlted by S R Swindell
Vol 70. Sequence Data Analysrs Gudebook Humana Press Inc , Tptowa, NJ
137
100, and the Macintosh Portable). The program also requires system 7.0 or later, and at least 2 Mb of available memory. Multiple alignment windows are much easier to view in color Although this is not essential, it is recommended that you use a system capable of displaying 256 colors (or better). 2 Software: For the operations described m this chapter, you need only the GeneJockey program itself. For later chapters you will need some addmonal files supplied with the program, and you would normally install GeneJockey on your hard disk by simply copying all the tiles supphed into a single folder. When running on a Power Macintosh, the GeneJockey Helper file should be present in the same folder. The native-code resources in this file run about 10 times faster than the code in the main program, and since multiple alignment is a time-consuming process, the extra speed is very helpful. GeneJockey is licensed for use only on a single-user basis, but is not copy-protected. 3. Data* You will need a set of suitable homologous sequences to align. Openmg such sequences is easier if they are m GeneJockey format, and all in the same folder, but this IS not essential. The sequences used here are taken from the demo
tiles supplied with the program.
3. Methods 3.1. Multiple Alignment
of Protein Sequences
1. On the GeneJockey demo files disk, locate the tile named Demo File References and open it. 2. Scroll down the window until you reach the entry for tutorial 9. 3 Locate the first group of file references, which looks like this: fSubst P R Peptidef Various receptor proteins to ahgn. fSubst K R Peptidef fleuromedm K R Peptldef f5HT 1C R Peptidef f5HT2 R Peptidef 4. Click on the line above the group and drag down to select them all, making sure that you include all the f symbols in the selection 5. Hold down the command key and type an equal (=) sign. If you use your own files, you will have to open the sequence files individually using the Open command. 6. Choose Multiple Align > Clustal from the Analyze menu (see Note 1). 7. The parameter dialog is shown in Fig. 1. 8. Accept the default parameters by clicking on the OK button. The program displays some mformation on each sequence as it formats it, then displays a dialog that gives an estimate of the time required to complete the Job (see Note 2). At this point you can abort the operation if you wish. 9. Click Go for it!. The alignment appears in a new window (Fig. 2, see Note 3).
3.2. Assigning
Colors to Amino Acids Based on Their Properties
Color is a three-dimensional quantity; the color of an object on the computer screen is determined by three numbers that specify the intensity of the red,
GeneJockey//: Multiple Ahgnment Multlple
Alignment
Pairwlse
allanment7
0 SetDefault
139
of Hamologous ~Multtole
yrrysltlons
Sequences allanment-
IDNRI --,
Fig. 1. Parameter dialog for Clustal alignment.
Fig. 2. Multiple
alignment of receptor proteins.
green, and blue components. In protein multiple alignments you can assign the colors used for the amino acid symbols using three sets of numerical data. For example, you could assign the red component to Kyte and Doolittle’s hydropathic index so that the reddest amino acids would be isoleucine, leucine, and valine (most hydrophobic) and the least red would be arginine and lysine (least hydrophobic). At the same time, you might assign the green component to Bigelow’s estimate of residue volume, which will cause the largest amino acid (tryptophan) to be greenest, and the smallest (glycine) to have no green component
in its color. Amino acid symbols colored in this way draw attention
to parts of the alignment in which amino acid properties (rather than identities) are conserved, since amino acids with similar properties will have similar colors. 1. Click on the color button in the bottom left-hand corner of the window. 2. In the subsequent dialog, click on the Assign colors by property.. . button. The color assignment dialog is shown in Fig. 3 (see Note 4). 3. Select an ammo acid index from the list box on the left. 4. Click on the Red, Green, or Blue button to assign the selected index to that color. 5. Repeat steps 3 and 4 until you have a suitable color scheme.
Taylor
II
Beghln-Dlrckx 3 Blgelow Browne et al 1 Browne et al 2
I
I
~Dercrlptlon
Background
[Flnd.J(Findnext]
Fig. 3. Assignment
,
Conformatumal 1975)
Morelnfo
parameter
of mner hekx
(Beghm-Dirkx,
.
of ammo acid colors by property.
Each of the records in this database contains a set of 20 numbers corresponding to some estimated property of the 20 amino acids. Clickmg on any name in the list causes a brief description of the measured property to be displayed in the Description box. If you want more information, click on the More Info.. . button for a display giving the actual numbers and a reference to the original publicatton. The numerical values used to specify colors on the Macintosh range from 045,535, so the data will be scaled to that range. If you click the corresponding Invert check-box, the order of the numbers will be reversed, so that the highest-scormg amino acid for that property will be assigned the value 0 for the selected color, and the lowest-scoring will get the value 65,535. As you assign the colors, the Example field in the dialog changes color to let you view the effect. You can remove any color completely from the display with the corresponding Remove button. Clicking on the Background button lets you choose a suitable color for the background. (For any particular combination of amino acid properties, of course, it may not be possible to find a background color which contrasts sufficiently with all of the amino acid colors.) The symbol X, denoting an unknown or unspecified amino acid, is always displayed in black, as are the hyphens used to mark gaps. The list of mdices is very large, so there is a Find... button you can use to search for keywords or author names. This will select the first example that includes the specified keyword. The Find Next button selectsthe next example. Try, for example, hitting the Find... button and entering the word “hydro” (case is irrelevant here). The program will select the first index, which includes “hydropathicity,” “hydrophobicity,” “hydrophilicity,” and so on, m its description. Click the Find Next button to jump to the next instance, repeating until you find the one you want. There are many estimates of hydropathicity in the literature.
GeneJockey//: Mu/t/p/e Alignment
141
If you cannot distinguish some of the amino acids using this objective method of assigning colors, you can still make subjective changes in the main dialog. The object of the exercise is to draw attention to those alignments that you consider important, and your eyes must be the final judge. 3.3. Multiple Alignment
of DNA Sequences
1. Open the file named Demo File References. 2. Locate the second group of file references, which follows the set of proteins that you used previously. fsubstance P Receptorf The equivalent DNA sequences for comparison. fBov Subst K Receptorf fleuromedin K Receptor! f5HTlCf
f5HT2f 3. Open the file references as before, by selecting the whole group and typing Command-=. When you open these sequences, you will note that unlike the equivalent proteins, they are of very different lengths This is because they contain variable amounts of untranslated sequence at both ends. The Clustal algorithm does not deal well with sequences of differing lengths, and the untranslated regions are likely to show little homology anyway, so let us save some time by abstractmg the coding regions from the five sequences before alignment. The simplest way to do thts is as follows: Repeat steps 4-12 for each sequence’ 4. Issue a Reading Frames... command, setting the start codon to ATG before chcking on OK. 5. In the Open Reading Frames window that results; click on the large arrow to select it. 6 Close the ORF window by clicking in the Close box, leaving the translated region of sequence m the widow behind still selected. 7. Copy this sequence onto the clipboard. 8 Click on the New > Nucleotide Sequence window. 9. Paste the sequence mto the sequence box. 10. Tidy the sequence by clicking on the Tidy Up button. 11. Use Save As... to save the edited sequence under a suitable name. 12. Close the original sequence window by clicking in the close box. 13. Bring the next sequence window to the front, either by clicking on it or by selecting its name in the Wmdows menu. At the end of this process you will have five nucleotide sequence windows, each contaming the coding region only of the receptor. 14. Align them with the Multiple Align > Clustal... command, accepting the default parameter set for the alignment. You will notice that the time estimate is nme times longer this time. The alignment window is shown in Fig. 4. Sincethe sequencesaligned here are DNA rather than protein, the top sequence is a consensus sequence rather than coincidence markers (see Note 5).
Taylor
142
Fig. 4. Multiple
alignment of receptor DNA sequences
4. Notes 1. As with the Sequence Assembly routine, Clustal starts by performing all possible pairwrse ahgnments on the sequences supplied. In this case, however, the algorithm used is Wilbur-Lipman The results of these alignments are used to construct a dendrogram, a crude phylogenettc tree specifying the order of similarity between the sequences, which in turn is used to determine the order in which the sequences coalesce to form the final multiple ahgnment. Most of the parameters refer to the initial pairwise alignment. The meaning of these parameters and the effects of changing them are described m the reference section of the GeneJockey manual. There are two sets of default parameters for this routme, one used for proteins and one for DNA, and you will find that the default parameters are adequate for most purposes. 2. After formatting the sequences, the program checks the available memory If the alignment cannot be performed m the available space you will be told, and the operation will be aborted. (You can increase the available memory using the Finder’s Get Info command ) The time estimate the program makes is very maccurate, but since version 1.4 has been adaptive, the time estimate takes account of errors made in previous estimates. The first time you use the routine the time estimate 1s a guess, but it will become more accurate with use as tt learns the capabilities of the machine on which tt is running The alignment process is modal, i.e., you cannot do anything else with your computer whtle it 1srunnmg, including background tasks, such as printing. 3. The multiple alignment window m which the output is displayed 1s identical to the window used to display the results of the sequence assembly command as described in Chapter 7, with one difference. When the window 1s used to display aligned protein sequences, the top line displays not a consensus sequence but comcidence markers (Fig. 2). Here the bullet mark e) indicates a perfect agreement between sequences, whereas the vertical slash (1) indicates a conservative substitution. All the sequences are placed in a single “conttg,” and each amino acid code can be assigned a separate color, using the dralog obtamed by clicking on the color icon at bottom left. Aligned sequences may be edited using the same facilities described m Chapter 7, so you may use nonvolatile deletions and reverse editing mode. The coincidence markers on the top line will change to reflect the new alignment after every keystroke. You can move whole sequences vertically or horizontally, and extract or delete sequences. You should avoid using the
GeneJockey//: hlultlple Alignment Optimize command; it IS really intended for use only with ahgned sequence fragments that contain long stretches of identical sequence. Ltkewise, the Dis-Optimize command will undo all the good work done by Clustal. 4. Each of the records in this database contains a set of 20 numbers corresponding to some estimated property of the 20 amino acids. Clicking on any name in the list causes a brief description of the measured property to be displayed m the Description box. If you want more mformation, click on the More Info.. . button for a display giving the actual numbers and a reference to the original publication. The numerical values used to specify colors on the Macintosh range from O-65,535, so the data will be scaled to that range. If you click the corresponding Invert check-box, the order of the numbers will be reversed, so that the highestscoring amino acid for that property will be assigned the value 0 for the selected color, and the lowest-scoring will get the value 65,535. As you assign the colors, the Example field in the dialog changes color to let you view the effect. You can remove any color completely from the display with the corresponding Remove button. Clicking on the Background button lets you choose a suitable color for the background. (For any particular combination of amino acid properties, of course, it may not be possible to find a background color that contrasts sufficiently with all of the ammo acid colors.) The symbol X, denoting an unknown or unspecified ammo acid, is always displayed in black, as are the hyphens used to mark gaps. The list of indices is very large, so there is a Find... button you can use to search for keywords or author names. This will select the first example that includes the specified keyword. The Find Next button selects the next example. Try, for example, hitting the Find.. . button and entering the word “hydro” (case is irrelevant here). The program will select the first index, which includes “hydropathicity, ” “hydrophobicity, ” “hydrophilicity,” and so on, in its descrtption. Click the Find Next button to jump to the next instance, repeating until you find the one you want. There are many estimates of hydropathicity in the literature. 5. Where the sequences align poorly, this consensus sequence will consist mainly of degenerate codes, using the real base symbols only where there is good alignment. The three contig rules described in Chapter 7 apply also to this alignment, so if you wish to draw attention to the areas of disagreement, you should click on the color button, set the Conttg rule to Perfect Match, and choose a contrasting color for degenerate codes (all degenerate codes are given the color that you allot to N). Note that the alignment obtained here is poorer than that obtained with the equivalent protein sequences. Whereas the three peptide receptors at the top are well aligned with each other, and the two serotonin receptors hkewise, the alignment between the two groups is not well shown here. The lesson to be repeated here is that, as with all the sequence comparison methods, you should choose to work at the protein level rather than the DNA level, if at all possible.
Reference 1. Higgins, D. G. and Sharp, P. M. (1988) A package for performing sequence alignment on a microcomputer Gene 13,237-244.
multiple
12 Sequence
Navigator
Multiple Sequence Alignment Software Steven FL Parker 1. Introduction Sequence Navigator is a Macintosh software package from PE-Applied Biosystems designed for multiple alignment of DNA and Protein sequences. The graphical user interface is easy to use and allows the importation of text files as well as analysis files from Applied Biosystems automated sequencers (Models 310,373, and 377). If Applied Biosystems analysis files are used, the accompanying electropherograms may also be displayed and edited. Sequence Navigator is based on an earlier Applied Biosystems program, SeqEd, and retains all the capabilities of the former program (see Note 1). In order to obtain better performance and memory utilization on a Macintosh, Sequence Navigator implemented two additional algorithms: Clustal (I) for multiple alignments and Needleman-Wunsch (2) for pairwise alignments. Included with Sequence Navigator is Factura, a prefilter or clean-up program. Factura allows the user to import DNA sequence files (either as text or Applied Biosystems analysis files) and process them in batch mode, deactivating identified features, such as vector sequence, areas of high ambiguity, and low confidence ranges. In addition, Factura can identify potential heterozygotes by labeling mixed-base positions using IUB ambiguity codes. After the sequences have been processed in Factura, they are then imported into SequenceNavigator, where they are aligned. Once alignment is completed, consensus and ambiguity sequencesmay be computed and ambiguities can be resolved. The results may then be printed and/or exported to another program for further analysis. From
Methods m Molecular B/o/ogy, Edlted by S R Swmdell
Vol 70’ Sequence Data Analysrs Gurdebook Humana Press Inc , Totowa, NJ
145
Parker
146
2. Materials 2.1. Hardware 1. Any Macintosh II or greater, with color monitor Power Macintoshes are supported with the software runmng m native mode. 2. Minimum memory requirements. 8 Mb RAM; 16 Mb or greater recommended. 3. One high density floppy drive and a hard drive. The applications and supporting files use approx 4 Mb of hard disk space. Sample riles typically take l-2 Kb each for text files, 160 Kb each for Applied Biosystems sequence files 4. A Macintosh-compatible printer if hard copy 1srequired.
2.2. Software 1. Macintosh system software 7.0 1 or greater. 2. The Sequence Navigator and Factura applications. 3. The Libraries folder included with the Sequence Navigator software package This folder must be located in the ABI folder in the System folder. 4. (Optional) Sample files included with the Sequence Navigator package for use wrth the included tutorials (see Note 2).
2.3. Data 1. DNA sequences are accepted by Factura and Sequence Navigator m the following formats: a. Text files. b. Processed analysis files from Applied Biosystems automated sequencers (Models 310,373,377). 2. Protein sequences are accepted by Sequence Navigator as text files using either one or three letter codes.
3. Methods Find the Sequence Navigator folder on the hard drive and double-click on it to open it. Double-click on the Factura icon to launch the application. 3.1. Setting Up Factura Libraries 1. Choose the Vector Library Setup... command from the Library
menu The Vector Library dialog appears (Fig. 1). 2. Choose a vector to be used by clickmg once on the desired vector located m the All Vectors in VecBase scrollable list in the upper-left-hand corner. The correct
vector should be highlighted. 3. Click on the >Copy> button to copy the vector over to your personal library. For the tutorial data, use the Ml 3MP 19 vector. 4 Click the OK button 5. If a vector that is not listed m VecBase is desired, a custom vector may be imported from a text or Applied Biosystems Analysis file by chckmg on the Add.. . button.
Sequence Navigator
147
Fig. 1. Adding vectors to the user vector library in Factura. The VecBase library is included in the software, and users may also add custom vectors. 6. If more vectors are to be used, repeat steps 2-4 until all required vectors are copied. 7. The same steps may be used to create cloning sites and primer libraries using the Enzyme Library Setup... and Primer Library Setup... commands under the Library menu. 8. For the tutorial data, select SmaI as the cloning site and M13-21 as the primer.
3.2. Setting Up Factura Parameters 1. Choose the Settings command from the Worksheet menu. The Settings dialog appears (Fig. 2). 2. Select Identify Vector Sequence by clicking in the check-box. 3. Select M 13MP 19 for Vector, M 12-2 1 for Primer, and SmaI for Cloning Site from the scrollable lists. 4. Select Identify Ambiguity by clicking in the check-box. 5. Choose 1 ambiguity remaining out of 20 bases. 6. Check the Reject Sequences box to activate it. 7. Choose >lO% ambiguities to be rejected. 8. Select Identify Confidence Range by clicking in the check-box. 9. Select the range from l-450. 10. Select Identify IUB/Heterozygous Bases by clicking in the check-box. 11. Set the threshold at 50%. 12. Click on the Update Edited Bases check-box. 13. Click on Automatically Save to Sequence File, Revert Sequences to Original Basecalls, and Use These Settings as Default Value check-boxes to activate these commands. 14. Click OK.
Parker
148
Fig. 2. The Settings menu in Factura, specifying the parameters Factura will use to process the sample tiles.
Fig. 3. A Factura worksheet after sequences have been added. The parameters in individual cells may be changed before processing.
3.3. Importing
Data info Factura
1. Select Add Sequences command under the Worksheet menu. 2. Make sure that the File Type: 373 check-box is selected. 3. Locate the GRel sequences in the Tutorial Data folder in the Sequence Navigator folder using the pop-up menu of folder names. 4. Click the Add All button to add them to the worksheet. Each worksheet can contain up to 999 samples. 5. The sequences and the appropriate parameters should now be loaded onto the worksheet (Fig. 3). If desired, individual parameters can be changed by clicking on the appropriate cell of the worksheet and making the change using the pop-up
Sequence Navigator
149
menu that occurs in the upper center of the worksheet. For the tutorial data, the parameters should not be changed.
3.4. Factura Analysis 1. Select the Submit command under the Worksheet menu. A dialog box appears asking if you want to revert the sequences to the original basecalls 2. Click on the Yes button. 3, The analysis should begin, and the percent completion denoted by a progress bar. 4. On completion of the analysis, a Save Sequence File dialog box should appear. 5 Click on OK to save the results back to the original sequence file (see Note 3). A dialog box appears asking if you would like a batch report of results to be generated. 6. Click the Yes button 7. A batch report appears summarizmg the results from the Factura analysis. The ranges of each sequence where vector and ambiguities were found are listed, and the resulting Clear Length of good sequence is reported. 8. To print the batch report, choose Print... from the File menu. 9. To save the batch report, choose Save As... from the File menu and type m a name in the Save This Document As box, then click on the Save button. 10. Click the small box in the upper-left corner of the Batch Report wmdow to close the batch report.
3.5. Viewing Factura Results 1. Click on 1 under the # column on the far left of the worksheet to highlight the first sample row. 2 Select Show Sequence under the Worksheet menu 3. The sequence should now be displayed. Note that deactivated areas appear in gray; IUB bases appear m red (Fig. 4). 4. To view the sequence features identified by Factura, click on the third button from the left in the lower-left corner of the sequence window. The identified features should now be listed (Fig. 5). 5. To view the associated electropherogram, click on the fourth button from the left in the lower-left corner. The electropherogram may now be scrolled, and bases edited by highlighting them and typing in the new basecall. Use the left and nght arrow keys to help position the cursor directly on a base to change it. 6. To view the original sequence basecalls, go to the Electropherogram command under the Sequence menu, and choose Show Original. The original basecalls are then shown on top, with the edited sequence below. 7. To view the annotations from the Applied Biosystems automated sequencer, click on the first button in the lower-left corner. 8. To close this view, click on the box in the upper-left corner of the window.
3.6. Saving the Batch Worksheet and Exiting Factura 1. With the Batch Worksheet in the foreground, choose Save As.. . from the File menu 2. Type a name for the file (i.e., Batch - 1) under the Save This Document As box.
Parker
150
Fig. 4. A view of a sequence in Factura after processing. The areas deactivated by Factura are denoted by gray letters; mixed-base positions are denoted by red IUB codes.
Fig. 5. A view of a Features Table created by Factura. Deactivated regions are shown and IUB base positions are displayed with the peak ratio calculations. 3. Click on the Save button. 4. To exit Factura, use the Quit command under the File menu.
3.7. Importing
Data into Sequence Navigator
1. Locate the Sequence Navigator icon and double-click to launch the application. A blank worksheet should appear. 2, To import sequencesfor alignments that were analyzed in Factum and saved in a Batch Worksheet, choose the Import Batch Worksheet command under the Sequences menu. Select the saved Batch Worksheet filename from Section 3.6. (i.e., Batch - 1). 3. Click the Import button. 4. The worksheet should now display the eight GRel sequences previously processed in Factura.
Sequence Navigator
_ _._ 1 tRe1 i 2 GRel 2 3 GRel 3 4 GRel 4 $CRel R6 6&GRal R6 71GRel R7 8IGRel R8 9
1 GRel 2 GRel 3 GRel 4 GRel 5EGRel 6JGRel XGRel 8:GRel
1 2 3 4 R5 R6 R7 RB
151
CCCAAGCM:. CCCMGCMT CCCAAGCMT CCCMKMT CCCMGCMT CCCAAKMT CCCMNCMT CCCAAGCMT -----t----
GGhKATrK GcAlCA’ITIG GGAKA’ITK GGA’EATI-I-G GGA’KAll-E GGAKATITG GGAlCAl7-E GGA’NJA7Tl-G -----.--__
l?C , _ ..”
.--. ..J Acccmc ACCCAGGKC ACCCAGGTCC ACCCAGG’ICC ACCCAWKC ACCCAGGTCC ACCCAGG’ICC ACCCAGGICC
9
I
---___L _-. _ _ _ I ArrcrcIT CC CCGACGATAT TGAACAd A‘IE’XVKCC CGGACGATAT KMCAATX A.lt%EIC CC CCGACGATAT FAMXAKG ATGCTGK CC CGGACGATAT TGAACAA’TIX A -CC CGGACGATAT TGAACAAKG A -CC CCGACGATT l%ACAA’lW A -CC CCGACGATAT TGAACMlW.7 A’TYXKKCC CCCACGATAT TGAACMTGC __________ _______-__ __________
200
..- “1 -..
AGATGAYGXT AGATMAGCT AGATGAAKT AGATGAAGCT AGATAAGCT AGATGMGCT AGA’IGMGCT AGAltiAAGCT _-____ *-t-
_“”
r’
.I
CCCAGAAltiC CCCAGM’lW CCCAGAAW CCCAGAATCC CCCAGAAEC CCCAGAAlGC CCCAGMTGC CCCAGAATGC --________
TnACTXAG TKACTXAG TICAC’RXAG TKACTGMG TlTACTGAAG TI-CACXAAG TKACKAAG TICACKAAG __________ .~~~.-.~ UjCCCCRiCA GKCCC’IWA CGcCCClWA CGCCCCPGCA GCCCCmA G’XCCCPGCA GCCCCCPGCA --________ 1
J..-
CtAGtAGtX CCA’XAGCK CCAGCA’X’K CCA’XAGCK
_,.“__
CTACACCW CTACACCGGC CTACACCCCC CTACACC’XC
L
___x
._._
GGNCCmA GGCCCCl’XA GGCCC~A GGNCCCXCA
I
L.--...
”
CCAGCCC& CCAGCCCCCT CCACCCCCCT CCAGCCCCCT
Eli& CmCCCT CcpGGcCCCl’ CCTCGCCC~
390
Gfh?fit$ GltTA’lKTltT GXA’TCTKT GKA’lYX7tY
Fig. 6. An aligned project in Sequence Navigator, also showing the ambiguity sequence (sequence #9), which uses a * to denote sequence mismatches.
3.8. Performing
Alignments
1. Use the Select All Sequences command under the Edit menu to highlight the names of all the sequences in the project. 2. Under the Align menu, select the Clustal... command. 3. Be sure that Composition: Nucleotide and Matrix: Identity are selected from the drag-down menus. 4. For Alignment Parameters click the Use Defaults button (see Note 4). 5. Click the OK button. An Alignment Progress bar should then appear indicating the progress of the alignment. 6. Once the alignment is complete, the aligned sequences should appear in the worksheet (Fig. 6).
3.9. Viewing and Editing Results 1. Choose the Select All command under the Edit menu to select all sequence names in the let? side of the worksheet (i.e., highlighted). 2. Go to the Create Shadow(s) command under the Sequences menu, and select the Compute Ambiguity Sequence... command. A new sequence (#9) should appear, indicating either agreement (-) or mismatch (*) at each base position. 3. Scroll to approximately position 197 and locate a single asterisk (*) below all As except for one W in sequence GRel 1. 4. Move the cursor to three bases to the left of the mismatching W, hold down the mouse button, and highlight seven bases to the right of that point (so the mismatching W is in the middle of the highlight).
152
Parker
Fig. 7. Display of electropherograms in Sequence Navigator from files generated from an Applied Biosystems automated sequencer. Edits may be performed either directly on the electropherograms or inside the project window.
5. Move the cursor to the left side of the worksheet and click on the name of sequence GRel 1 to highlight (select) it. Then hold down the Shit? key and click on the names of sequences GRel2 and GRel3 to select them. 6. Under the Sequences menu, select the Display Electropherograms command. The associated electropherograms for the three sequences should now be aligned and displayed below the worksheet (Fig. 7). 7. Locate the mismatching W in sequence #l. Since it appears that it is not a true heterozygote, highlight the W and type in a lowercase “a” to change the basecall. 8. To close the electropherograms, hold down the Option key and click on the small box in the upper-left corner of one of the electropherogram windows. 9. Click the Yes button when asked if you want to close all the electropherogram windows. 10. Notice that the edited base is now displayed in the worksheet, and the Ambiguity Sequence (#9) has been automatically updated to reflect unanimity of the bases.
3.10. Creating Consensus
Sequence and Translation
to Protein
1. Highlight Sequences l-8 by moving the cursor to the left side of the worksheet, hold down the Shift key, and clicking on the names of Sequences l-8.
153
Sequence Navigator
1 Gdsl 2 GRel 3 GRel 4 GRsl 5:GRsl 61GRel ?tGRel BkGRel *z if f.2 13
1 2 3 4 R5 R6 R7 R8
AGATGAAGCT AGA’EAAGCT AGATGAAGCT AGA’EAAGCT AGAltXAGCT AGATGAAGCT AGAT3AGCT ------*-.AGATGAAGCT Q ll K L R l S
D
6
A
CCCAGAAKC CCCAGAA’XC CCCAGM’FX CCCAGAAXC CCCAGAA’EC CCCAGAA’IGC CCCAGAA’ltX _____...-CCCAGAATCC P E C 5 Q 13 A P R tl
. .._(-._ .._ 1 .__I CAGAGGCXC KCCCS-CGT CAGAGGC’RX KCCC-GCGT CAGACXXEC ‘PXCCS-CGT CAGAGGC’PX l’XCC-KGT CAGAGGC’IW TXCCS-CGT CAGAC53.X KCCC-GCGT CAGAWCTGC TCCCCS-CGT CAGAGGCKC TXCCCKGT --------------**--CAGAGGCKC XCCCS~COT
Q R
R G PEAA
L
C
GGCCCCld G‘XCCCKCA GKCCC’ECA GKCCClWA GGCCCCTGCA GGCCCCTXA GKCCCl’XA GKCCC’II;CA ---------GGCCCCTGCA A P A L P ') v v P L H S 7 A P 7 R G P C so
CCACCiCCk CCAGCAGCTC CCACCAGCK CCAKAGCX CCAGCACCTC CCAGCAGCX tSJ&!U’XTC CCAGCAGCK **-*a----C;“AAy T
a
S
Q
S
L
9
30 t
Fig. 8. An aligned project in Sequence Navigator after a consensus sequence (#lo) was generated, showing the three-frame translation to protein of the consensus sequence (Sequences 1 l-13). This translation may be performed using either one or three letter codes.
2. Choose Create Shadow(s) under the Sequences menu, then choose Compute Consensus Sequence. The newly created consensus sequence should appear as Sequence #lO (Fig. 8). 3. Select the consensus sequence by clicking on Sequence 10 on the left side of the worksheet. 4. Choose Create Shadow(s) under the Sequences menu and select Translate Codons to Amino Acids.. . . In the dialog box that appears, select Frame 1, Frame 2, Frame 3, Universal, and Single Letter Codes. 5. Click the OK button. The three frame protein translation of the consensus sequence should now appear as Sequences 11-l 3.
3.11. Exporting
Consensus
Sequence and Savhg
Results
1. Click on Sequence 10 (the Consensus Sequence) on the left side of the worksheet to select it. 2. Select the Freeze Shadow command under the Sequences menu. 3. Choose Export Sequence... under the Sequences menu. 4. Type in the name Consensus in the Save As box. 5. Select Text as the file type. 6. Click on the Export button. 7. To save the aligned project, select the Save Layout As... command under the File menu. 8. Type in the name GRel Alignment in the Save As box. 9. Click on the Save button.
154
Parker
3.12. Printing Results and Exiting Sequence Navigator 1. Select Page Setup... under the File menu and select the proper paper size and orientation for your printer. 2 To prmt an entire ahgnment layout, choose Print Layout... under the File menu, then click on the Print button to begin printing. 3 To print just the alignment appearing in the currently drsplayed window, choose Print Window.. . under the File menu, then click on the Print button to begin prmting. 4. To exit Sequence Navigator, choose the Quit command under the File menu When asked to save changes to individual sequences, click on the Yes button for each sequence to record the edits.
4. Notes 1 Since SeqEd was originally designed to be primarily a sequence-edrtmg program, the algorithms for alignment that were included were not optimized for the Macintosh environment and required an exponential use of memory in order to function. 2. Accessory files and the main program will be automatically loaded if the Installer is used. 3. Saving the results of the Factura Analysis only writes information to the Features Table Section of the sequence file; the ortgmal data is always retained 4. If the button is not active, or is grayed out, the defaults are already chosen.
References 1. Higgins, D. G. and Sharp, P. M. (1988) Clustal-a package for performing multiple sequence alignment on a microcomputer. Gene 73,237-244. 2. Needleman, S. B. and Wunsch, C. D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443-453.
13 The European Bioinformatics
Institute
Network Services Tomas P. Flores and Robert A. Harper 1. Introduction In addition to archiving sequence and genome data, the EMBL Outstationthe European Bioinformatics Institute (EBI), provides an ever-expanding number of free network services to external users. For a list of server addresses,see Note 1. This chapter is designed to act as an introduction to these services and help prospective users to get started. It is now possible to accessthese services via anonymous file transfer protocol (FTP), World Wide Web (WWW), Gopher, and electronic mail (E-mail). The various methods of data retrieval and searching, using all four methods, are covered in this chapter. In addition to providing its own databases,the EBI acts as a repository for a large number of molecular biology databases. Table 1 provides a comprehensive list of the databases that are currently available. In many cases the databases are updated daily from the copy of the database held at the collating site. 2. Materials The four methods of network accessprovided by the EBI are the most popular ways of accessing information over computer networks. Therefore, there are a large number of different programs that can access these services. It is beyond the scope of this chapter to provide details on how to use all of these programs. Instead, this chapter provides the basic information that is necessary for navigating and accessing the services provided by the EBI, The route that you choose will be dependent on the software that is available on your computer system. From* Methods m Molecular B/ology, Edlted by S R Swmdell
Vol. 70, Sequence Data Analysis Guidebook Humana Press Inc , Totowa, NJ
155
Table 1 Databases
ii
Available
at the EBI
Database
Description
3D ali Alu Berlin RNA Bio-Catalogue Blocks CpGtsle cutg dbEST dbSTS DSSP ECDC EMBL Enzyme EPD FlyBase FSSP HaemA HaemB HLA HSSP IMGT Kabat LiMB Lrsta Methyl Misfolded
Structure-based sequence alignments ALU sequences and alignments 5s rRNA sequences Directory of molecular biology and genetics software Protein blocks database CpG islands database Codon usage tabulated from GenBank Expressed sequence tags Sequence tagged sites Secondary structure assignments of pdb files Eschenchia cob database collection Nucleotide sequence database Database of EC nomenclature Eukaryotic promoter database Drosophila genetic map database Families of structurally similar proteins Hemophilia A database Hemophilia B database HLA class I and II sequence database Protein structure-sequence alignments Immunogenetics database Proteins of immunological mterest List of molecular biology databases Yeast protein coding sequences Site-specific methylation Deliberately misfolded protein models
Ref.
CD-ROM
I 2 3 4
5 6 7 8 9
IO II 12 13 14
15 I6 17 18 19 20 21 22 23 24 25 26
* * * *
Server * * * * * * * * * * * * * * * * * * * * * * * * * *
G u
NRL3D NRSub Nucleosomal DNA P53 PDB PDB Select PIR PKCDD Prints Prodom Prosite PUU RDP REBASE RELibrary RepBase RHdb RLDB rRNA SBASE SeqAnalRef SmallRNA SRP SWISS-PROT TFD TransFac Transterm tRNA Yeast
Sequence-structure database Nonredundant Bacillus subtilis genome database Nucleosomal DNA sequences P53 mutations Brookhaven protein structures database Representative list of PDB chain identifiers Protein sequence database Protein kmase catalytic domain sequence database Protein motif fingerprint database Protein sequence modules (recurring domains) Prostte pattern database Database of structural domains Ribosomal database project Restriction enzyme database Comprehensive restriction enzyme lists Prototypic human repetitive DNA sequences Radiation hybrid database Reference library database Small subunit rRNA sequences Protein domain database Sequence analysis bibliography Compilatton of small RNA sequences Signal recognitton particle database Protein sequence database Transcription factor database Eukaryotic cis-acting regulatory DNA elements and trans-acting factors Translational termination signal database Database of tRNA sequences Yeast chromosome database
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 4 43 44 45 46 47 48 49 50 51
52 53 54
x *
* * *
* *
* * * * * * * * * * * * * * * * * * * * * * * * *
*
*
* *
* * *
158
F/ores and Harper Table 2 File Types
File extension .doc . uue
. exe . Z .gz . tar
. txt .hqx
File type .dat
.arc
.zip
Format
Plain ASCII files
ASCII
DOSand Mac encodedfile Executablefiles Compressedfiles Tape archive tiles
ASCII Binary Binary Binary
3. Method The EBI maintains four key routes by which its network services can be accessed:FTP, WWW, Gopher, and E-mail. In this section, the services accessible by each route are described. In most casesthe services can be accessedby any of the four routes. This should allow you to find at least one method of accessing these services from your computer system. 3.1. FTP (File Transfer
Protocol)
This is the main route for retrieving databases or software from the EBI’s archive. This archive 1ssimply a directory hierarchy that is made accessible to any anonymous user. Using FTP you can navigate through these directories and copy the files of interest to your machine. 3.1.1. A Basic FTP Session Connect to the EBI FTP server (f tp. ebi . ac . uk). Type anonymous at the login prompt and enter your E-mail address as your password (e.g., j ohn . doe@lif e . univ). This will take you to the top directory of the EBI’s file archive. Since most of the large files are compressed tar files, it is a good idea to set the FTP file transfer to binary at this stage. SeeTable 2 for an explanatton of file extension types. The archive of databases is kept m the directory /pub/databases. All of the databases listed m Table 1 can be found m this directory. Be warned that some of the files are very large. Change into the directory that has the database m which you are interested. Transfer the README file if it is available. If you are transferring large files using the terminal version of FTP, it is useful to set the hash mark (#) on; this will give you a visual indication of how the transfer 1sprogressing (each hash mark represents a batch of data that has been transferred, usually equivalent to 1 K). This file contains additional information about the database in this directory. Transfer the database files you require. An example session is given in Fig. 1.
759
European Bioinformatics Institute Connected 220 mercury Name (ftg 331 Guest Password
to mercury.ebl ac uk FTP server (Version v/u-2 ebi ac uk.doe) anonymous login ok, send your complete jo&.doofJlih.univ
< Lines of login
information
4(6)
Thu
Dee
e-mall
deleted
address
15
15 22 as
30
c Lines
of transfer
informatlon
ready
>
Remote system type is UNIX Using binary mode to transfer files ftp> binuy 200 Type set to I ftpz ad Ipub/databas~m/~nnyru 250-Please rsad the file RRADME 250It was last modified on Thu Mar 23 16 58 30 1995 - 95 days 250 CWD command successful ftp> 1s 200 PORT command successful 150 Opening ASCII mode data connection for file list RBADME enzc1ass txt enzusor txt ensyIn* get enzyme dat 226 Transfer comulete ftp> got REMUE 200 WRT command successful 150 opening BINARY mode data connection for RFADME (668 bytes) 226 Transfer complete 668 Bytes recezved in 0 12 seconds (5 5 Rbytes/sl ftp> prompt Interactive mode off ftp> mg.t l 13*
I quit I
GMT 1994)
password
deleted
ago
>
ftp> %
Fig. 1. Example of an FTP session to retrieve the enzyme database.
3.1.2. Retrieving Sequences from The EBI Databases Using FTP It is also possible to retrieve subsetsof sequencesfrom EMBL and SWISSPROT, including both the databasereleasesand updates. This has been achieved by extending the get command on our FTP server. The command format is: get
db:index:guerystring
filename.
where db is the abbreviation for the database (one of embl, emblnew, emblall, nut, swissprot, swissnew, pep), index is the field to be searched, querystring is the string to be searched for (see Table 3), and filename is the name of a local file in which to store the retrieved sequences. For example: a get
embl:acc:m73019
m73019.seq
retrieves an entry from the EMBL database with the accession number m7 3 0 19 and stores it in a local file called m7 3 o 19 . seq.
F/ores and Harper
160 Table 3 Searchable Index
Fields Examples of query strmg
Descnptron
act id dat fts ref
Accession number Identifier Date Features Reference Sequence length Definition Author Comment Organism Reference title All text fields
Sl
def aut cc erg tit all
l
get
emblall:def:flavodoxin
X07888 ATP6-YEAST 20-NOV-1994 Intron Gene 145:153-154(1994)
2000 7SL RNA pseudogene Doe J. Any unformatted strmg of characters Homo sapiens Human genes and pseudogenes Any unformatted string of characters
flavo.seqs
retrieves all the entries from both the EMBL release and updates database that have a definition line with the word f lavodoxin. These entries are stored m afile flavo.seqs. l
get
swissprot:all:flavo*
flavo.peps
retrieves all the SWISS-PROT entries that contam any words starting with f lava in any of the text fields and stores the them in a file f lavo . peps. l
get
embl:ref:plant?rnol.?biol.&l995
retrieves all of the EMBL entries that refer to a paper published in Plant MoZ Biol. in 1995. Note the question mark instead of spaces, and the ampersand, which represents the logical AND statement. 3.7.3. Advanced Retrieval of Entries It is possible to do more complicated queries using the get command: get srs : guerys tring filename. This command uses the SRS program to perform the query (54). No spaces or SRS command line parameters are allowed in the query. Consult the SRS documentation for more details on how to specify the query strmg (see Note 2). For example: l
get
srs:[embl-fts:intronI>garenth[embl-org:arabidopsis*l
arain. seq retrieves all arabidopsis sequencesthat contain an intron.
European Bioinformatics get kinase.pep l
Institute
161
srs:[prosite-id:PROTEIN_HINASE_TYRl>swissprot
retrieves all proteins that contain a tyrosine kinase motif defined by PROSITE. get kinase.pdb l
srs:lprosite-id:PROTEINJCINASE_TYRl>swissprot~
pdb
is the same as the previous query except the known three-dimensional structures of kinases are returned. 3.1.4. Retrieving Software The archive of software is kept in the directory /pub/software. This archive is divided into directories based on machine operating system. These directories contain an extensive range of molecular biology software, including tools for submission of sequences to the EBI (Authorin). 3.2. WWW (World Wide Web) All of the EBI’s World Wide Web pages conform to a similar style, which should make navigation relatively simple. The EBI’s home page provides access to information about the EBI and its services (see Fig. 2). On the first pages you will see provide lists of available subtopics, each one represented by a square button and a short description. At the bottom of every page are two more square buttons, one of which will bring you back to the EBI home page and the other to the top page of the current subtopic. At the bottom of the EBI home page are several additional buttons that provide information on what is new, where to send comments, and how to search the EBI pages. The latter can be done using the button at the bottom of this page, which is useful if you cannot find a particular page of information. These buttons are not visible when using a WWW browser that cannot show graphics. In this case, the buttons are replaced by text surrounded by squared brackets. Most WWW browsers that can display graphics have an option to switch off the loading of any graphics. Deferring the loading of graphics is very useful if you find the transfer of our WWW pages to be slow. The first three buttons provide links to information regarding the databases that the EBI collates, including the EMBL nucleotide and SWISS-PROT protein sequence databases.The remaining buttons provide accessto the network services that are provided by the EBI. These services are described in the remainder of this subsection. 3.2.1. Sequence Retrieval The Query/Retrieval button on the EBI home page will take you to a new page, which has the following options:
Flores aI rd Harper
162 File
Edit
eocation:
View http
Go
Bookmarks
: 1 ~WW ebb
Options
Directory
Help
ac uw
The EMBL Outstation
European Bioinformatics Institute
This 1s the world-wade web (WWW) sewer of the European Blomfonnatm Institute (EBI) whxh 1s located atHinxton Hallnear Cambndge in the UK. The EBI 1s an Outstation of EMBL
Fig. 2. The EBI World Wide Web home page viewed using Netscape Navigator under XWindows. 3.2.1 .l . SEQUENCE RETRIEVAL BY ACCESSION NUMBER
This is a simple form to retrieve entries from EMBL, SWISS-PROT, PROSITE, 3.2.1.2.
or PDB given its accession number. SEQUENCE RETRIEVAL SYSTEM
This page provides an interface to SRS (54) that allows entries to be selected based on a number of query forms. The result of each query is returned as a list
European Bioinformatics Institute
763
of hits linked to the specific entries. In each entry, links to other entries in the various databasesare highlighted as hypertext links and can also be traversed. To obtain related help on that part of the form just select the highlighted 2. At the moment you can search: 1 2. 3. 4. 5. 6. 7. 8.
Nucleotide and protein sequence databases (EMBL, SWISS-PROT, and so on.). Protein structure and derived databases (PDB, HSSP, DSSP, and so on.). Sequence-related databases (Prosite, Enzyme, REBASE, and so on.). Bibliographic databases (SeqAnalRef, Medline). Miscellaneous databases (LiMB). TransFac files (TFSITE, TFFACTOR) dbEST and dbSTS. Mapping libraries (RHdb)
3.2.1.3.
EXTERNAL RETRIEVAL SERVICES
Links to other reliable retrieval services are provided by other organizations. 3.2.2. Sequence Similarity Searches If you wish to extract sequences that are similar to the one you have, you should select the Sequence Similarity Searchespage from the EBI home page. Several services are provided that permit such queries: 3.2.2.1.
FASTA HOMOLOGY SEARCHES
This service is accessedby fillmg out a short form. The results of the search are E-mailed back to the submitter. It is, therefore, essential that you mclude your correct E-mail address. More details on the FASTA service and how to obtain help are given in Section 3.4. 3.2.2.2.
BLITZ
EXHAUSTIVE PROTEIN DATABASE SEARCHES
This service is accessed in a similar way to the FASTA search and is discussed in more detail in Section 3.4. There is also a detailed description on the BLITZ page of the EBI WWW server. 3.2.2.3.
PROSITE
PATTERN SEARCHES
Unlike the previous two searches, the PROSITE search is much faster and the results are returned directly. If any hits are found they will be linked to the appropriate PROSITE entry, which can be retrieved by selecting this link. A detailed description of this search tool is given on the PROSITE introduction page. Other options provide accessto external services, including BLAST and the BCM search launcher.
F/ores and Harper
164 3.2.3. Documentation
and Software
The documentation for the products and services of the EBI can be found by selecting Documentation and Software from the EBI home page. This also gives accessto the complete FTP archive and gopher service. 3.2.4. Network Navigation A number of mechanisms are provided that allow you to search various collecttons of international WWW addresses for information that may be of interest to you. 1. Bio-wURLd:
This enablesyou to searchthe EBI collection of WWW sites or submit additlonal addresses to these lists 2. CUSI: Provides a collection of WWW navigation tools. 3 Netnews Filtering Service: This 1sa service maintained at the EBI for scanning BionetLMEmbnet news groups for relevant articles and delivering them periodically by E-mail (see Section 3 4 for more details). 3.3. Gopher Gopher can be considered a tool that lies somewhere between FTP and WWW. The files are arranged in a hierarchy of directories like FTP, but have more detailed titles and are navigated by selecting the appropriate entry like WWW browsers. This similarity is reflected in the arrangement of information on the EBI Gopher server (Fig. 3). For an explanation of the various resources, consult the two previous subsections. 3.4. E-Mail (Electronic Mail) Access to all of the EBI E-mail services are performed in a similar manner. The user writes simple E-mail messages and sends them to the appropriate server. The request is carried out by the EBI computers and the results are then E-mailed back to you. Each E-mail message can contain more than one command. However, each command must appear on a separate line. The most important command is HELP. Sending this as the body of a message to any of the EBI E-mail servers will return an E-mail message containing information on how to use this service. The E-mail servers are not sensitive to the case of the commands. The following subsections introduce the available E-mail services and some of the commands that can be used. The E-mail addresses for these services are given m the notes. 3.4.1. Network File Server The network file server provides accessto databases, software, and documentation held at the EBI. The following commands are supported.
European Bioinformatics Institute Quit IIll
165
9 Other Commands lIzIHelp Select
an Item from a list
EBI Gopher lntormatlon
)a ))
))
below
Service
equencs database Access to various databases Biology associated software repository SWISS-PROT protein database documentation EMBnst Blolnlormatton Other Gophers arches and Archives
1selosctlon 11directory
11DIr@ctary_l
1as bookmark
11boakmark 1
Bookmarks >>
EBI Gopher Informatlon
Service
Fig. 3. An example of the EBI Gopher Service. 3.4.1.1.
HELP
TOP/C
This provides more help on a specific
topic
(each topic is in its own
subdirectory), for example: 1. HELP NUC retrieves help on the EMBL nucleotide sequence database. 2. HELP PROT retrieves help on the SWISS-PROT protein sequence database. 3. HELP SOFTWAREretrieves help on the molecular biology software on the fileserver. 3.4.1.2. GET FILENAME This retrieves files from the fileserver,
for example:
1. GET NUC: x0 3 3 9 2 retrieves the nucleotide sequence with accession number x03392.
2. GET
PROT : WAP-MOUSE retrieves the SWISS-PROT protem sequence WAP-MOUSE. 3. GET DOC: BIOBIT .24 retrieves BIOBIT newsletter number 24. 4. GET DOC: DATASUB. TXT retrieves the sequence submission form.
3.4.1.3. DIR DIRECTORY This is the same as HELP topic
.
F/ores and Harper
166 TITLE My unknown LIB emnew LIST 10
SEQ
atgtcaaaga gaaatcattc gaagtaactg ggcgaactgc
test
sequence
aaattggttt gagacgagtt acttgaatga aaagcgattg
attctacggt tggtaatgat ttatcaatat ggaaggactc
actcaaactg gtggtgacat ttgattattg tattcagaac
gtaaaactga tacacgatgt gctgtcctac tggatgatgt
atcagtagca ttcccaggca ttggaatatt agattttaat
Fig. 4. An exampleFASTA query. 3.4.1.4.
SIZE F/LES/Z.E
Some of the older mailer systems will reject tiles over a certain size. To overcome this, the fileserver provides files as sets of E-mail messages from which the original tile can be reconstructed by concatenating the body of these E-mail messages,The fileserver will automatically split large files into packets of approx 95 Kb. You can use this command to override this value. However, it is useful to know that smaller files usually transfer faster across the networks. 3.4.2. The FASTA Server This E-mail server provides accessto the sequence similarity searching program FASTA (55). Using this server you can make fast comparisons of both nucleotide and protein sequences against the most up-to-date versions of the databases; EMBL and SWISS-PROT. For a full list of commands available you should send a HELP messageto this server. The only mandatory command is SEQ. The others are optional and preset defaults will be applied. An example E-mail request is given in Fig. 4. This query searches the new EMBL entries for any nucleotide sequence that is similar to it. Shortly after you send your query message to the FASTA server you will receive a confirmation message summarizing your query and any defaults that have been applied. If there are any problems with your query, this message will identify them. Once the query has been performed you will receive the output results in another E-mail message.This messagecontains a lot of detailed information regarding your search. For an explanation of this file consult the help document that is supplied by this E-mail server. 3.4.3. The BLITZ Server The BLITZ server uses the MPsrch server and program developed at the University of Edinburgh, UK (56). It allows very sensitive protein sequence similarity searches using a Smith and Waterman best local similarity search algorithm (57). It is a very fast implementation of this algorithm with the program being run on a 4096 processor MasPar MP- 1 system. This service works
European Bioin forma tics Institute
167
LINES 10 PERIOD 7 EXPIRE 364
Fig. 5. An exampleof the NetNews filtering service.
in a very similar way to the previously described FASTA server. Again, the only mandatory line is the SEQ line. For a detailed description send a HELP message to the BLITZ server. 3.44. News Filtering Service To overcome the time-consuming process of reading all the important discussion groups on the Internet, the EBI provides a filtering service that will sift through these messagesand keep you informed of only those messagesthat contain keywords of interest to you. Currently, only those newsgroups that are of interest to the academic community are provided (BioSci/Bionet/EMBNet/Sci). The keywords that you specify define the profile of the information in which you are interested. Each article is tested against this profile and scored. Those messagesthat give a score above a given threshold will be reported back to you periodically. This report provides a summary of the articles identified with a small number of lines return from each article. Some of the available commands are: 1. HELP: provides detailed help on how to use this service. 2. SUBSCRIBE word word: List of keywords of interest defining your profile (this is a mandatory line). 3. LINES lines: Number of lines of article to include in the summary sent to you 4. PERIODperiod: Period between notifications. 5. EXPIRE days: Number of days after which the query expires
A simple example is given in Fig. 5. In this example, articles are scanned for the words EMBL,database, and submission. Those articles that match this profile with a score of 80 or better are reported every week with the summary message containing 10 lines from each article. This profile is registered to expire within a year. 4. Notes
4.1. Addresses
of the Various EM Servers 1. FTP server: ftp.ebi.ac.uk 2. WWW server: www. ebi . ac . uk 3. Gopher server: gopher.ebi.ac.uk
168
Flores and Harper
4. E-mail servers: a. Datasubmissions: [email protected]. b. General inquiries: datalib@ebi. ac. uk. c. Corrections to sequence entries. update@ebi . ac . uk. . ac. uk. d E-mail file server netserv@ebi e. Network server inquiries: nethelp@ebi . ac . uk. f. Biosci/Bionet/EMBNet/Sci news filtering service: ne tnews@ebi . ac . uk g. Subrmssionofsothvare tobeplacedontheEB1 servers:sof tware@ebi . ac . uk. h. Protein sequence similarity searching: bl i t z@eb i . ac . uk. i. Sequence similarity searching: f as ta@ebi . ac . uk.
4.2. SRS-Indexed Databases and Documentation 1. Databases indexed at the EBI NRL3D, SWISSPROT, NRSUB, SWISSNEW, PDB, PIR, HSSP, EMBL, DSSP, EMNEW, ALI, CPGISLE, FSSP, PRODOM, PROSITE, FLYGENE, PROSITEDOC, FLYREFS, BLOCKS, FLYCLONES, EPD, SWISSDOM, ECDC, PIRALN, ENZYME, SEQANALREF, REBASE, MEDLINE, LIMB, RHDB, RHPANEL, RHEXP, TFSITE, TFFACTOR, DBEST, DBESTNEW, DBSTS, and DBSTSNEW. 2 SRS information and documentation: http : / /www . eb i . ac . uk/ s r s / srsman.html.
References 1. Pascarella, S. and Argos, P. (1992) A data bank merging related protem structures and sequences Prot. Eng 5, 12 1-137 2. Jurka, J. and Smith, T. (1988) A fundamental division in the Alu family of repeated sequences. Proc. Natl. Acad Sci USA 85,4775-4778. 3 Specht, T., Wolters, J., and Erdmann, V. A. (1991) Compilation of 5S rRNA and 5S rRNA gene sequences Nuclezc Acids Rex 19(Suppl.), 2 189-2 19 1 4. Rodriguez-Tome, P. (1997) The Radiation Hybrid Database. Nuclerc Acids Res , submitted for publication. 5. Henikoff, S. and Hemkoff, J. G. (1991) Automated assembly of protein blocks for database searching. Nuclew Acids Res. 19,6565-6572. 6. Larsen, F., Gundersen, G., Lopez, L., and Prydz, H. (1992) CpG island as gene markers in the human genome. Genomics 13, 1095-l 107. 7. Wada, K., Wada, Y., Doi, H., Ishibashi, F., Gojobori, T., and Ikemura, T. (1991) Codon usage tabulated from the GenBank genetic sequence data. Nuclezc Acids Res. 18, 1981-1986. 8. Boguski, M. S., Lowe, T. M. J., and Tolstoshev, C. M. (1993) dbEST-database for “expressed sequence tags”. Nature Genetics 4,332,333. 9. Olson, M., Hood, L., Cantor, C., and Botstein, D. (1989) A common language for physical mapping of the human genome. Science 254, 1434,1435 10. Kabsch, W. and Sander, C. (1983) Dictionary of protein secondary structure-pattern-recognition of hydrogen-bonded and geometrical features, Blopolymers 22, 2577-2637.
European Bioinformatics Institute
169
11. Wahl, R., Rice, P., Rice, C. M., and Kroeger, M. (1994) EC&a totally integrated database of Escherlchia coli K12. Nucleic Acids Res. 22,3450-3455. 12. Emrnert, D. B., Stoehr, P. J., Stoesser, G., and Cameron, G. N. (1994) The European Bioinformatlcs Institute (EBI) databases. Nucleic Acids Res. 22,3445-3449. 13. Bairoch, A. (1994) The ENZYME data bank. Nucleic Acids Res. 22,3626,3627 14. Bucher, P. and Trifonov, E. N. (1986) Compilation and analysis of eukaryotlc POL II promoter sequences Nucleic Acids Res. 14, 10,009-10,026. 15. The FlyBase Consortium (1994) FlyBasHhe Drosophzla database. Nuclezc Acids Res. 22,3456-3458.
16. Holm, L., Ouzounis, C , Sander, C., Tuparev, G., and Vriend, G. (1992) A database of protein structure families with common folding motifs. Prot Ser. 1, 1691-1698. 17. Tuddenham, E. G., Schwaab, R., Seehafer, J., Millar, D. S., Gitschier, F., Hlguchi, M., Bidichandani, S., Connor, J. M., Hoyer, L. W., and Yoshioka, A. (1994) Haemophilia A: database of nucleotide substitutions, deletions, insertions and rearrangements of the factor VIII gene, second editson (corrected and republished article originally printed in Nucleic Acids Res. 22, 35 1 l-3533 [ 19941). Nucleic Acids Res. 22,485 l-4868. 18. Gianelli, F., Green, P. M., Sommer, S S., Lillicrap, D. P., Ludwig, M., Schwaab, R., Reitsma, P. H., Goossens, M., Yoshioka, A., and Brownlee, G. G. (1994) Haemophilia B: database of point mutations and short additions and deletions, fifth edition, 1994. Nucleic Acids Res 22, 3534-3546 19. Bodmer, J. G., Marsh, S G., Albert, E. D., Bodmer, W. F., DuPont, B., Erlich, H A., Mach, B., Mayr, W. R., Parham, P., and Sasazuki, T. (1994) Nomenclature for factors of the HLA system, 1994. Tzssue Antzgens 44, l-l 8. 20. Sander, C, and Schneider, R. (1994) The HSSP database of protein structuresequence alignments. Nuclerc Acids Res. 22,3597-3599 21. Lefranc, M.-P. (1995) An integrated database for immunogenetics. Genome Digest 2(l), 9.
22. Kabat, E. A., Wu, T. T., Perry, H. M., Goettesman, K. S., and Foeller, C. (1992) Sequences of Proteins of Immunological Interest, 5th ed. NIH Publications, Washington, DC. 23. Keen, G., Redgrave, G., Lawton, J., Cinkosky, M., Mishra, S., Fickett, J., and Burks, C. (1992) Access to molecular biology databases. Math. Comput. Modelling 16,93-101.
24. Doelz, R., Mosse, M. O., Slonimski, P. P., Bairoch, A., and Linder, P. (1994) LISTA, LISTA-HOP and LISTA-HON: a comprehensive compilation of protein encoding sequences and its associated homology databases from the yeast Saccharomyces. Nucleic Acids Res 22,3459-346 1. 25. Nelson, N. and McClelland, M. (1991) Site-specific methylation: effect on DNA modification methyltransferases and restriction endonucleases. Nucleic Acids Res 19(Suppl.), 2045-207 1. 26. Holm, L. and Sander, C. (1992) Evaluation of protein models by atomic solvatlon preference. J. A401 Biol. 225,93-105.
170
F/ores and Harper
27 Pattabiraman, N., Namboodm, K., Lowrey, A., and Gaber, B. P. (1990) NRL-threedimensional: a sequence-structure database derived from the protein data bank (PDB) and searchable within the PIR environment. Protezn Seq Data Anal 3,387-405 28. Perriere, G., Gouy, M., and Gojobori, T. (1994) NRSub: a non-redundant data base for the Bacillus subtilis genome. Nucleic Aczds Res 22,5525-5529. 29 Isohikhes, I. and Trifonov, E. N. (1993) Nucleosomal DNA sequence database. Nucleic Acids Res. 21,4857-4859. 30. Hollstein, M., Rice, K., Greenblatt, M. S., Soussi, T., Fuchs, R., Sorlie, T., Hovig, E., Smith-Sorensen, B., Montesano, R., and Harris, C. C. (1994) Database of p53 gene somatic mutations in human tumors and cell lines. Nucleic AczdsRes. 22,3551-3555. 31. Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer, E. F., Jr., Brice, M. D., Rodgers, J. R., Kennard, O., Shimanoucht, T., and Tasumi, M. (1977) The Protem Data Bank: a computer-based archival file for macromolecular structures. J. Mol. B~ol. 112,535-542. 32. Hobohm, U., Scharf, M., Schneider, R., and Sander, C. (1992) Selection of a representative set of structures from the Brookhaven Protein Data Bank. Prot Ser. 1, 409-417. 33. Barker, W. C., George, D. G., Mewes, H. W., Pfeiffer, F., and Tsugita, A (1993) The PIR-Interttonal databases. Nucleic Acids Res. 21,3089-3092. 34. Hanks, S. K. and Quinn, A. M. (1991) Protein kinase catalytic domain sequence database: identification of conserved features of primary structure and classltication of family members. Methods Enzymol. 200,38-62. 35. Attwood, T. K., Beck, M. E., Bleasby, A. J., and Parry-Smith, D. J. (1994) PRINTSA database of protein motif fingerprints. Nuclezc AczdsRes. 22,3590-3596. 36. Sonnhammer, E. L. L. and Kahn, D. (1994) The modular arrangement of proteins as inferred from analysts of homology. Prot. Sci 3,482492. 37. Barroch, A. and Bucher, P. (1994) PROSITE: recent developments. Nuclezc Aczds Res. 22,3583-3589. 38 Holm, L. and Sander, C (1994) Parser for protein folding units. Protezns 19, 256-268.
39. Maidak, B. L., Larsen, N., McCaughey, M. J., Overbeek, R., Olsen, G. J., Fogel, K , Blandy, J., and Woese, C. R. (1994) The ribosomal database project. Nucleic Aczds Res. 22,3485-3487. 40. Roberts, R. J. and Macelis, D. (1994) REBASE: restriction enzymes and methylases. Nucleic Aczds Res. 22,3628-3639. 41. Raschke, E. (1993) Comprehensive restriction enzyme lists to update any DNA sequence computer program. Gen. Anal. Tech. Appl. 10,49-60. 42. Jurka, J., Walichiewicz, J., and MilosavlJevic, A. (1992) Prototypic sequences for human repetitive DNA. J. Mol. Evol. 35,286-291. 43. Lehrach, H. (1990) Hybridization fingerprinting in genome mapping and sequencing. Genome Anal. 1, 39-8 1. 44. Neefs, J. M., Van de Peer, Y., De Rijk, P., Chapelle, S., and De Wachter, R. (1993) Compilation of small ribosomal subunit RNA structures. Nucleic Acids Res. 21,3025-3049.
European Bioinformatics institute
177
45 Pongor, S., Hatsagi, Z., Degtyarenko, K., Fabian, P., Skerl, V , Hegyo, H., Myrvai, J., and Bevilacqua, V. (1994) The SBASE protein domain hbrary, release 3.0: a collection of annotated protein sequence segments. Nucleic Acids Res 22,361 O-36 15. 46. Bairoch, A. (1991) SEQANALREF: a sequence analysis bibliographic reference data bank. Computer Appl. Biosci. 7,268. 47. Shumyatsky, G. and Reddy, R (1992) Compilation of small RNA sequences. Nucleic Acids Res. ZO(Suppl.), 2 159-2 165.
48. Larsen, N. and Zwteb, C (1993) The signal recognition particle database (SRPDB). Nucleic Acids Res 21,3019,3020. 49. Bairoch, A. and Boeckmann, B (1994) The SWISS-PROT protein sequence data bank: current status. Nucleic Acads Res. 22,3578-3580. 50. Ghosh, D. (1992) TFD: the transcription factors database. Nuclezc Aczds Res 2O(Suppl.), 2091-2093. 5 1. Wingender, E. (1988) Compilation of transcription regulating proteins Nucleic Acids Res. 16, 1879-l 902. 52. Brown, C. M., Stockwell, P. A., Dalphm, M. E., and Tate, W. P. (1994) The translational termination signal database (TransTerm) now also mcludes imtiation contexts. Nucleic Acids Res. 22,362&3624. 53. Steinberg, S., Misch, A., and Sprmzl, M. (1993) Compilation of tRNA sequences and sequences of tRNA genes. Nucleic Acids Res. 21,301 l-3015. 54, Liebl, S. and Sonnhammer, E. (1994) MIPS, Germany and Sanger Centre, UK. 55. Eztold, T. and Argos, P. (1993) SRS an indexing and retrieval tool for flat file data libraries. Comput Appl Biosci 9,4%57. 56. Pearson, W. R. and Lipman, D. J. (1988) Improved tools for biological sequence comparison. Proc. Nat1 Acad. Sci. USA 85,2444-2448. 57. Sturrock, S. S. and Collins, J. F. (1993) MPsrch version 1.3. Biocomputmg Research Unit, University of Edinburgh, UK. 58. Smith, T. F. and Waterman, M. S (1981) Identification of common molecular subsequences. J Mol. Biol. 147,195-197.
14 GeneAssist Smith- Waterman and Other Database Similarity Searches and Identification of Motifs Eugene G. Shpaer 1. Introduction DNA sequence data accumulate at a higher rate than our knowledge of protem functions and biology. Therefore, it becomes increasingly important to predict as accurately as possible the functions of encoded proteins based on the homology to previously described sequences. One of three algorithms--BLAST (21, FASTA (2,3), and Smith-Waterman (S-W) (4) dynamrc programming--are usually used to compare a query sequence to the database. All three methods look for the best local alignment of a subsequence of the query and a database sequence. The S-W algorithm is the most accurate, but its slow speed makes it impractical in the majority of cases.Therefore, we developed software to run S-W protein comparisons on the Fast Data Finder (FDF), a highly parallel pipeline array processor that is a component of the GeneAssist sequence analysis package (see Note 1). Not all the functions of GeneAssist are described. For example, one can create Custom Databases using his/her own sequence data. The program provides an interface common with other Perkin-Elmer (Applied Biosystems, Foster City, CA) sequence analysis software (AutoAssembler, Sequence Navigator, Factura) to edit Sample/Sequence files, read and write feature tables, complement, translate, and so on. I. 1. Parameters for Protein Sequence Similarity Searches The S-W search requires only two parameters:amino acid scoring matrrx and penalties for insertions/deletions(indels, seeNote 2). This compares favorably wrth From
Methods m Molecular Dology, Edited by S R Swmdell
Vol 70 Sequence Data Analysis Gurdebook Humana Press Inc , Totowa, NJ
173
174
Shpaer
complex setsof parameters for BLAST and FASTA (the common practice is that researchersrun theseprograms with default parametersto avoid thesecomplexities). The theory of scoring matrices was described in an excellent paper by Altschul (5). All scoring matrices can be characterized by their entropy (E), which is the average amount of mformation/aligned amino acid residue. Percent ammo acid mutations (PAM) (6, and BLOSUM series matrices are used most frequently. The PAM value, e.g., PAM = 120, meansthat two homologous proteins accumulated 120 amino acid substitutions/100 residues since they diverged. The data for BLOSUM matrices are derived from SUMS of amino acid substttution frequencies in protein BLOCKS--nongapped alignments (7). The number in a BLOSUM matrix, e.g., BLOSUM62, indicates that BLOCKS used to calculate replacement frequencies had <62% identical amino acids in the alignments. Matrices with high-entropy (for example, PAM120, BLOSUM70) are better at detecting short regions of strong similarity, whereas low-entropy matrices (PAM250, BLOSUM45) theoretically would be able to detect very weak and long regions of similarity. In practice, it is usually unknown if the region of similarity IS long or short and matrices with average entropies are used (BLOSUM62 [E = 0.71 is a default matrix in BLAST, and BLOSUMSO [E = OS] 1sdefault in FASTA). It has been demonstrated that BLOSUM matrices developed by Henikoff work much better than PAM series matrices (6). The S-W algorithm looks for the best local alignment that may have insertions and deletions (one reason BLAST is a fast method is that it does not allow indels-it looks only for the best nongapped alignments). Several models for indels have been proposed: fixed penalty (every deletion has the same negative score); affine scoring (there is a different score to start an indel, usually a high penalty like -10 or -12, and a much smaller penalty to extend it [-I to -41); and variable indel scoring. We developed this scoring method using the frequencies of indels in the structural (X-ray based) alignments (7). For example, prolines and aspartic acids are deleted 2.5 times more frequently than isoleucines and cysteines; therefore, the score to delete the first two amino acids is -6, but the score to delete the last two is -9 (these values are for BLOSUM62 and other matrices in half bits). GeneAssist employs the variable gap scoring model. Essentially, the scoring matnx has an additional row of 20 negative scores that are the penalties to delete each of the 20 amino acids. 1.2. Comparing the Accuracy of BLAST and Smith- Waterman Methods Protein evolution consists of amino acid substitutions (the scoring matrix essentially describes the probabilities for each amino acid being replaced by others) and insertions/deletions. The amount of protein evolution can be measured by PAM (1 PAM is equal to 1 amino acid mutation/l 00 residues [s/).
GeneAssist
175
Table 1 Output for the Similarity Search for the Query Sequence IHKREV Smith-Waterman
BLAST
Rank
Locus
SF#
Score
1 2 3 4 5 6 7 8 9 10 11
IHKREV IHTFER IHKREG IHTF IHRFG IHPC IHQFT IHERl IHER D25866 SO8090
26 26 26 26 26 26 26 26 26 1029 2795
494 416 374 183 128 104 72 68 63 57 57
Locus
SF#
Score
IHKREV IHTFER IHKREG IHTF IHPC IHRFG ALBSMX IHER ALBSGC VGVN IHERl
26 26 26 26 26 26 507 26 507 2653 26
460 388 197 98 78 84 42 39 39 37 37
WI 3.2e-60 3.le-50 1.7e-44 1 4e-21 1.9e-14 l.le-05 0.26 0.40 0.60 0.76 0.80
i 1 1 2 2 2 1 3 2 3 3 2
The query sequenceIHKREV belongs to the superfamtly (SF) htgh potential Iron sulfur protems (SF number 26 m PIR40), which has 9 members The cutoff of SF size (9) was used At thus cutoff, all SF members are above the hne for the S-W output-every iron-sulfur protein has a higher score than any other sequence m the database, no sequences have been missed. BLAST output showstwo unrelated sequences above the cutoff and two members of SF 26 are below the cutoff (the sequence IHQFT was not even present among the top 100 in the output) Thus, BLAST missed two Iron-sulfur proteins at the SF size cutoff (also called equivalence pomt because tt IS where the number of false posrtives is equal to the number of false negatives). Both runs used the BLOSUM62 scormg matrix; BLASTP 1 4 wrth default parameters; S-W scores have been log-length normahzed as tf both sequences were 100 residues long, S, = S * 21 21/[ln(LI,) * ln(L2)]; in whichS, andSarenormalizedandraw S-W scores,respectively,LI andL2 arethe lengthsof the queryanddatabase sequences; [ln(lOO)Jz = 2121
The number of indels in protein evolution increases as proteins become more divergent, e.g., at PAM = 135 one deletion occurs, on average, per 14 aligned amino acid residues; at PAM = 205-once per 9.5 aligned amino acids (data are from ref. 7 and extrapolated from Table 1 in ref. 9). Therefore, when searching for homologous and increasingly divergent proteins it becomes more important to account for indels. Both Smith-Waterman and FASTA allow indels; BLAST does not. The question is: How frequently does BLAST fail to find a homology between proteins, which can be detected using S-W? To answer this question we tested BLAST and S-W on a set of queries representing different superfamilies and measured how well the database sequences, which belong to the sameprotein superfamily as the query, rank at the top of the output. In addition, we used a set of unidentified open reading frames (ORFs) from completely sequencedyeastchromosomesII and VIII, for which BLAST did not find any homologies in the database,and reran these sequencesusing S-W.
176
Shpaer
Table 1 shows an output of the BLAST and S-W searches for the query sequence IHKREV, which belongs to a superfamily (SF) iron-sulfur proteins. In this case, we know that all iron-sulfur proteins are homologous; comparing S-W and BLAST output in Table 1, it is clear that S-W worked better than BLAST. Next, we ran S-W and BLAST for 502 queries from different superfamilies (e.g., globins, cytochromes, and so on, representing practically all different types of proteins in the database) and determined “missed@equivalence scores” (see Table 1) for each superfamily using BLAST and S-W methods. The S-W-BL62 had better “missed@equivalence scores” than BLAST for 282 superfamilies out of 502 (the iron-sulfur protein SF in Table 1 was just one of them); BLAST was better for only 58 superfamilies. So, accounting for indels improves the sensitivity of the comparison for the majority of different protein superfamilies. BLAST missed 1767 members from 502 SF (a total of all 502 “missed@equivalence scores”), compared to 1362 (25% less) for the SW-BL62 (m this case, SF #26 in Table 1 added +2 to the BLAST total missed number, and zero to the S-W-BL62 number). We ran S-W comparisons using GeneAssist for 293 ORFs from yeast genomic DNA for which BLASTP did not find significant similarity with known proteins (the query sequences had been processed using the SEG filter [ZO] to remove regions of low complexity). We observed scores above the cutoff S,, = 90 for 3 1 ORFs (a normalized S-W score > 90 occurs once in lo7 random protein sequence comparisons with BLOSUM62). Thus, according to this limited testing, using S-W instead of BLAST found matching sequences for -10% of the 293 ORFs in which BLAST failed to find the similarity. 1.3. Protein Motif Searches Using the Pattern Specification Language The Pattern Specification Language (PSL) in GeneAssist can specify practically any biological pattern or motif (it supports fuzzy match tolerances, proximity constraints, numeric ranges, Boolean operators, and more). For example, we translated the PROSITE database of protein motifs into PSL. Searching for motifs with known functions, e.g., PROSITE, can help to predict possible functions of novel protein sequences. The speed of PSL searches does not depend on the complexity of the PSL query and is limited only by the speed the database can be read through the SCSI interface (approx 3 Mb/s sustained rate).
2. Materials GeneAssist is a client-server system. It requrres a Sun workstation (runmng SunOS 4.1.3 or Solaris 2.5) and a Fast Data Finder (FDF), a SCSI device
GeneAssist
177
attached to the Sun. Inside the FDF is a pipeline of Very Large Scale Integration (VLSI) chips that together perform billions of pattern-matching operations/second (the physical pipeline can have from one board [480 cells] to five boards [3360 cells]). All performance numbers in this chapter are for a singleboard FDF system. The pattern (it may be a PSL query or a protein sequence together with a scoring matrix) is loaded into the pipeline and the database is streamed past the pipeline so that at each clock cycle every cell in the pipeline compares its part of the pattern to the database. The sequence databases are installed on a hard disk (because of the rapid growth of these databases at least a 2 Gb drive is recommended) from the CDROMs produced by NCBI (GenBarik), European Bioinformatics Institute (EBI; EMBL and Swiss-Prot), and NBRF (PIR). Each of these organizations will send database updates for a nominal cost. Alternatively, these databases are available on the Internet. GeneAssist does not require proprietary database indexing; users can create their own custom databases. Smith-Waterman searchescan be run on the Sun using a command line interface. All functions of GeneAssist are implemented using the Macintosh graphical interface. Any number of Mats can access the server through TCP/IP network. The parameters for the search are set on the Macintosh and the results appear on the Macintosh screen, so users have the impression that they have an extremely powerful Macintosh with all sequence databases installed. The Macintosh software also provides a seamless interface to sample files generated on the ABI DNA sequencers (Perkin-Elmer Corp.). The minimal requirements to run GeneAssist are the following: serverany desktop Sun workstation with 32 Mb of RAM; and client-a Macintosh with a 68030 CPU or better with 8 Mb of RAM that is connected to the Ethernet. The Macintosh program is provided as a fat bmary with both 68 K and PowerPC codes, 3. Methods 3.1. Protein Sequence Similarity Searches 3.1.7. Setting Up a Protein Smith-Waterman Search An empty Similarity Search worksheet (Fig. 1) opens when you start up the GeneAssist application. 1. Click the left Change button (Fig. 1) to selectone or several query sequences (the PIR entries NSBOH7 and QlAD25 are provided as Mac files with the GeneAssist distribution; they are located in GeneAssist/Tutorial-data/ Query-sequences folder). 2. Click the middle Change to select a protein database on the server (we used PIR release 44).
178
Shpaer
Fig. 1. Similarity search worksheet. Two protein sequences loaded (upper-left pane); the database is shown in the middle; parameters for the search are in the topright pane. Results are shown in the lower part. Two sequences are highlighted using the mouse; by clicking on the Alignment button, Fig. 3 was generated. 3. Click the right Change to choose search parameters (Fig. 2). All parameters in Fig. 2 are the default setting; the only one we changed was the number of Entries that score in the Top (type 20). The Do Normalization check-box in Fig. 2 improves the accuracy of the searches by adjusting S-W scores depending on sequence lengths (see Table 1 legend). 4. Finally, after setting all three Change panes in Fig. 1, click Begin Search button.
3.1.2. Viewing the Results of the Search The results of the search, which took 66 s, are presented for both queries as a list of PIR entries in descending order of scores (lower part of Fig. 1). 1. Click the QlAD25 in the top-left pane (highlighted in Fig. 1) and the results pane will immediately jump to show the high-scoring sequences for the query QlAD25 (Fig. 1, see Note 4).
2. Selectthe two sequencesQlAD22 and ERADTS in the resultssection of Fig. 1 by clicking on the first and shift-clicking on the second (holding the shift button allows one to select several lines in the output using the mouse). 3. Click the Alignment button as shown in Fig. 1 (see Note 3). The alignments of QlAD25 vs QlAD22 and QlAD25 vs ERADTS show up instantly (Fig. 3). These are local alignments-only those regions of the two sequences that contributed to the S-W scores are shown. It is easy to change parameters, e.g., PAM120 instead of BLOSUM62 matrix, or load another sequence and rerun the alignments in Fig. 3.
179
GeneAssist ,-
--L
_____-
Select Search
---
Seerch
Parameters
Method
0 FDF Slldlng
Wlndow
Da Rescorlng Do Normalization MatrlH
Number Score Score
(tmprovcs acrura~y)
w
@I BLOSUM 62 0 PAM 250 OPAM 120
0 DNA
of alignments threshold 0 Keep @I Keep
enlries entrles
that that
score score
al or above In the top
Fig. 2. Choosing parameters for the database similarity
search.
3.2. MERl Repeats in the GenBank-Nucleotide Similarity Search 3.2.1. Setting Up a Search for a Nucleotide Sequence The task of this example is to find all occurrences of medium reiteration frequency repeat, one (MERl) in GenBank (see Notes 3 and 5). 1. Load the HSMERIREP sequence: Click the left Change button (Fig. 4) to open a window shown in Fig. 5. 2. Click Import from File... button to load the sequence from the provided Macintosh file (alternatively, one can select Import from Database... and enter
GenBank LOCUS name[HSMERlREP] to load this query). 3. Click Complement (Fig. 5) so that both orientations are used for the search. 4. Click OK (Fig. Sjthe two sequences with the same name HSMERlREP but two orientations appear in the top-left pane of Fig. 4. 5. Click the middle Change button to select GenBank-primate database. 6. Click the right Change button (Fig. 4) and choose the default FDF-sliding window search parameters (see Notes 5,6). 7. Click the Begin Search button (the search takes 39 s). The score values in the results pane in the bottom show how many times the windows from the
HSMERlREP hit againstdatabasesequences.The resultspart in Fig. 4 showsall the occurrences of MERl m GenBank.
3.2.2. Viewing the Results of the Search Using Dot Plot and Alignment 1. Click on the HUMPAIA sequence as shown m Fig. 4 and click the Dot Plot button-a new window (Fig 6) opens.
180
Shpaer
[Ahgn] -
smile
Aligned 93
\
10 IO
FSR”RNLLEoSSNSTSWFWRF~~Gsso~K~“c~,KEo~K~EF~E~~Ksc ~+++++++---+~,++~---~,-+~-+-+,~+---~-+-+-+~-+-~--YK”LKS”““FRSNOSS”“SRF FGPKlTKLlVSTKUSNROOFLOSL
12 PR
Fig. 3. Two S-W alignments (overlapping windows are shown) generated from the Similarity search worksheet in Fig. 1. 2. Using the mouse drag a dotted rectangle around the diagonal line (it is nearly vertical because the HSMERlREP is much shorter-573 nucleotides than the HUMPAIA-17,509 nucleotides) in the dot plot in Fig 6 (see Note 7). 3. After you drag a dotted rectangle in the area of a match in the dot plot field, click the Alignment button-the alignment window shown in Fig. 7 opens (see Note 3). 4. To view the HUMPAIA sequence, go back to the similarity search worksheet (Fig. 4). Double-click on the second sequencein the results pane (it is highlighted). A new window opens showing the GenBank entry for the HUMPAIA sequence(Fig. 8, seeNote 8).
It is easy to select MERl repeats (e.g., MERl repeat is from position 16152157 of GenBank entry HUMPAIA, which is 17,509 nucleotides long) from all GenBank entries shown in Fig. 4 and generate a multiple sequence alignment using Sequence Navigator software (II). 3.3. Examples of Motif Searches Using PSL Pattern Specification Language (PSL) provides a very powerful mechanism to describe complex text motifs (see Note 9). We will discuss several examples below and show how things work in GeneAssist. 1. Open the Pattern Search window from File->New menu (Fig. 9). 2. Click the New button to load motifs from one of the provided motif libraries. Navigate to the file Peptide Patterns and select Zinc Finger (see Note 10)
GeneAssist
13 I, 9 5
181
GenBank
R88
-
pPimate
HUilRPGRlC,
tenBank GenBank GenElank OenBank GenGank Genmnk Getsme GenBanC GenBonk GenBank GenBank
R88 R88 R88 R88 R88 R88 R88 R88 FIG8 I788 R88
-
primate primate primte primate primate primate primte primate primate primate primate
HSTRE175 HLmlXIBI HU”GPP3ROS HUflRETBLAS HUWEUROF HSIGUWS HSV315 HSUK448 HUNRGG HUblHRlOGR HWlHlT
Hunan
testicular
Hf
histone
(“1)
gene,
complet
: ii::.:.:
Fig. 4. Similarity search worksheet. Sequence HSMERlPER was loaded in both orientations (upper-left pane); the database is in the middle; parameters for the FDF sliding window search are in the right pane. Highlighting the line in the results Section and clicking Dot Plot button led to Fig. 6. Double-clicking on the same highlighted line led to Fig. 8.
3.
4. 5. 6.
7. 8.
and Leucine Zipper (see Note 11) motifs (use apple-click for the second selection). It is also possible to create PSL queries from scratch: Click the New button (Fig. 9). In the new window (bottom of Fig. 9) type: i400ia?" -> loo+" [T 1Sl M1 then press and give a name for this motif: 2 5% Thr + Ser (see Note 12). In practice, it is probably easier to load PSL examples (motifs) from the libraries and change them to tit the patterns you are interested in. Select a database (Swiss-Prot 31) by clicking on the Change button and then click Begin Search (Fig. 9). To display results, click on one of the motifs in the top-left pane of Fig. 9, e.g., >25% Thr + Ser. Click View Hits button. The results of the search are shown in the bottom pane of Fig. 9: There are 296 sequences in Swiss-Prot that have Thr + Ser-rich regions (see Note 13). To save this list of 296 sequences as a Search Set, a Macintosh file that has pointers to the database entries, go to Pattern Search pull-down menu and select Make Search Set with All Results. Use this Search Set as a database subset for another search to find all sequences that are related to nerve or brain functions (Fig. 10). Open File->New->Pattern Search.
182
Shpaer Choose
Seauence
apple-cbck
m m Macmtosh r
import
lo select several queries
files
from
file...
OR
from the databases on the server Import
from
Fig. 5. Loading query sequences for the Similarity
Fig. 6. Dot Plot for HSMERlREP
ii
database..?
search worksheet.
(vertical axis) against GenBank entry I-IUMPAIA.
9. Click New and type a PSL query and its name separated by a : l"neuro*" OR "brain" OR 'nerv*") Brain-or-nerve-related 10. Click OK.
GeneAssist
Fig. 7. Alignment between HSMERlREP
mwson,s Henneg,R
J , n
Wtmm,B
,
and a copy of MERl repeat in HUMPAIA.
Hamsten,R
,
Green,R
,
Humphr!es,S
E
CI
Fig. 8. GenBank entry I-IUMPAIA, Annotation view. By clicking on the five buttons in the lower-left comer of the window, one can bring up one of the five views for this sequence. 11. Click the Change (Fig. 10) button and select the Search set From Query: >25% Thr + Ser and click the Annotation button. This tells the program to search the Swiss-Prot annotations, not the sequences. After the search is done (with only one Pattern the results display automatically), we got a list of 29 Swiss-Prot entries that have both Thr + Ser-rich region in their sequence and “brain,” “nerv,*” or “neuro*” mentioned in the Annotation (see Note 14).
184
Shpaer
Fig. 9. Pattern Search window. The upper-left pane shows three loaded PSL queries. Double-clicking on these names will bring up a separate window (shown in the bottom) where you can compose and edit the query (this window is shown for the third pattern: ~25% Thr + Ser). The upper-right pane shows the database loaded and has a button to change it. The results appear after clicking View Hits-this is a scrollable list of 296 database sequences (it is partially blocked by the Pattern window).
4. Notes 1. GeneAssist supports three different algorithms for the query against database similarity searches: FDF sliding window, FASTA, and S-W using FDF. Which method is more appropriate depends on the problem you try to solve. For nucleotide sequences we recommend using the FDF sliding window search first. For example, suppose you generated 36 sequences using an ABI Sequencer and wish to look for similar sequences in GenBank or EMBL. If you got the result No Hits for both orientations of the query sequence for the FDF sliding window search with default parameters, then it is probably a novel piece of DNA. On the other hand, if you did obtain hits, then looking at the dot plot and Alignment will allow you to quickly understand if your sequence is related to those previously described. Using the Smith-Waterman algorithm for nucleotide sequences is relatively slow because the sequences and databases tend to be much longer than proteins; use S-W in those few cases when you are looking for the most distant, but homologous noncoding sequences. Protein searches can detect much more distant homologies than nucleotide searches; therefore, it makes more sense to translate coding sequences to proteins and run protein S-W searches.
GeneAssist
Fig. 10. Pattern Search window for the search in the annotations of the set of 296 sequences generated in Fig. 9; the list of 17 Swiss-Prot entries that match both criteria is shown in the bottom.
The FDF sliding window method can be used for protein searches if you are looking for nearly identical sequences in which the differences are the result of sequencing errors and natural polymorphism. Smith-Waterman searches would allow you to find distantly homologous sequences (Table 1 and Fig. 3); it may be a good idea to use S-W for novel protein sequences if you wish to find all related database entries. There is no real need to use FASTA, unless you wish to compare its output to other methods, since FASTA is implemented in software and is >5 times slower than FDF searches. 2. The S-W algorithm is relatively simple: The computer calculates the S-W matrix (an array of LI x L2 scores, where Ll and L2 are the lengths of the query and database sequences, respectively) and picks the highest score in the array-this is the S-W score for the query against the given database sequence. This procedure is repeated for every database sequence and the output is presented in the descending order of S-W scores for the database sequences. For example, a search for a IOO-amino-acid long query sequence against Swiss-Prot database release 3 1 (~15 million residues) requires the calculation of 100 x 15 * lo6 = 15 * lo8 scores. This can be done in software, but takes several hours of CPU time on a Spare workstation. The same search using the 1-board GeneAssist system has the speed of 42 * lo6 scores/s(scores/sor cell-update&)--so the search takes only 36 s.
Shpaer
186
3. There is a difference m how we get nucleotlde (Fig. 6) and protem (Fig. 3) alignments. It 1simportant to use a dot plot before alignment for nucleotide sequences to be able to narrow down the area you wish to align, as we did m Section 3 2 2, steps l-3. On the other hand, dot plots are usually not very useful for protein sequence comparisons, and we got protein alignments directly from the Slmilarity Search worksheet (Fig. 4; Section 3.1.2., step 3). 4 It is important to be able to quickly navigate in the results Section of Fig. 1, since one can run 36 x 2 = 72 nucleotide sequences in both orientations or even 36 x 6 = 216 protein sequences after six-frame translation (36 comes from the number of lanes usually run on the ABI 3731377 Sequencer). 5. The idea of database searches is that there is a relatively small number of database entries that the user needs to find. However, it is possible to set search parameters that would generate millions of hits, even m random sequences, and cause the server to run slowly or even run out of memory For example, the FDF sliding window search with a window size of 10 and error tolerance of 4 (only 6 nucleotides out of 10 are required to match to make a hit) would generate too many hits when searching databases (it is permissible to use these parameters for dot plots). Similarly, if the pattern search specified by a PSL query generates too many hits, the search progress indicator sits at the beginning for too long and it is probably appropriate to click the Stop Search button. It may be a good idea to try a novel PSL motif search on a single sequence instead of a database. 6. Using these default parameters, the HSMERlREP sequence is cut into windows of 30 nucleotldes (s = 30), with offset of o = 15 (so the windows overlap also by 15 nucleotldes) and every time there is a sequence in the GenBank that has e = 4 (error tolerance) or fewer differences from the window it is reported as a hit between the query and the database. 7. Using the buttons marked Zoom In, Zoom Out and Full View, it 1spossible to alter the scale of the plot to examme different areas. 8. Several views of the sequence are available by clicking on the five buttons in the bottom-left part of the window. Views available are of the sequence annotations, the sequence itself, a feature table, a restriction map, and the sequence composition. If the file is an ABI sequencer-generated sample file, an additional button allows the chromatogram to be viewed. 9. In practice, simpler motifs are used most frequently, e.g., "GPGR" (a motif surrounded by double quotes-look for this tetrapeptide) or u + 1 ( GPGRAF ) m (six amino acids that might have 1 mismatch/msertion/deletion m any of the six amino acids). 10. The Zinc Finger pattern we loaded into the “Pattern Search” worksheet has the following PSL syntax (variable spacer motij) uC??13-"?")C??~8-'?"}
[C)H]??{3-"?"I
[CIH]"
Read: a cysteine followed by 2-5 ammo acids, then another cysteine followed by 2-10 amino acids, then cysteine or histidine, followed by 2-5 residues, and finally cysteine or histidine.
GeneAssist
187
11. The Leucine Zipper pattern we loaded into the Pattern Search worksheet 1sprobably the most complex; tt combines two constraints on the protem motiE f 0 yl * II _> “L??????L??????L??????L” and {20”?“>7 t “@charged”1 “?“I You can read this: find a position (an amino acid) that IS the end (“?‘‘->) of both the string of four leucines each separated by any six amino acids (“L??????L??????L??????L”), and another bias in the composttion wmdow of 20 residues that has seven or more charged (“[R]K]E]D]H]“) residues. 12. This type of PSL syntax describes a bias in the composition. You can read this: Find any region 400 amino acids long that has 100 or more threonines and sermes Essentially, we are looking for proteins that have a region 400 residues long containing over 25% of these two amino acids in the composition. 13. It is possible to highlight regions of the database sequences that matched the pattern in different colors by double-clicking one or several sequences from the result list and using information in the resulting Feature Table 14. All the windows shown in Figs. 1, 3,4, 6-10 can be saved as Macintosh tiles. Also they can be opened from File->Open/New menu. The content of all windows can be saved as a text file (File->Report) which can be E-mailed or parsed
References 1. Altschul, S., Gish, W., Miller, W., Myers, E , and Lipman, D (1990) Basic local alignment search tool. J. A4oi Biol 215,403-410. 2. Pearson, W. R. (1990) Rapid and sensitive sequence compartson with FASTP and FASTA, in Methods in Enzymology vol. 183 (Doolittle, R. F., ed.), Academic, New York, pp. 63-98 3. Pearson, W. R. (1991) Searchmg protein sequence libraries: compartson of the sensitivity and selectivity of the Smith-Waterman and FASTA algortthms. Genomics 11,635-650. 4. Smith, T. F and Waterman, M. S (1981) Identification of common molecular subsequences. J Mol. Blol 147, 195-197. 5. Altschul, S. (1991) Ammo acid substitution matrices from an information theoretic perspective. J. Mol. Biol 219,555-565 6. Henikoff, S. and Henikoff, J.G. (1992) Amino acid substitution matrices from protein blocks. Proc Natl. Acad. Sci. USA 89, 10,915-10,918. 7. Overington, J., Donnelly, D., Johnson, M. S., Sali, A., and Blundell, T. L. (1992) Environment-specific amino acid substitution tables: tertiary templates and prediction of protein folds. Protein Sci. 1,216-226. 8. Dayhoff, M. O., Schwartz, R. M., and Orcutt, B C. (1978) A model of evolutionary change in proteins, in Atlas of Protein Sequence and Structure 1978, (Dayhoff, M. O., ed.), National Biomedical Research Foundation, Washmgton, DC, pp. 345-352. 9. Benner, S., Cohen, M., and Gonnet, G. (1993) Empirical and structural models for insertions and seletions in the divergent evolution of proteins. J. Mol. Biol. 229, 1065-1082. 10. Wootton, J. C. and Federhen, S. (1993) Statistics of local complexity m ammo acid sequences and sequence databases. Computers Chem. 17, 149-163
15 GENEMAN of LASERGENE Jonathan P. Clewley 1. Introduction GENEMAN is the LASERGENE module for DNA, protein, Genetic MedLme (American National Library of Medicine, Bethesda, MD) and structural (Brookhaven) database searchmg (see Note 1). Newly derived DNA sequences and their protein ORFs should routinely be checked against the relevant database. It can also be helpful to search the DNA database with PCR primer sequences to determine if any false prlmmg is hkely and to check that the primers will detect all the intended target sequences.
2. Materials 2.1. Macintosh
Hardware
1. Any Macintosh computer with a floating point unit installed. We have successfully used a IIci, a IIVx, an SE30, an LC475 with SoftwareFPU Installed (John Neil dz Associates, Cupertino, CA), and a PowerPC with LASERGENE. 2. Minimum memory requirements of 4 Mb RAM (8 Mb RAM or more 1s recommended). 3. Minimum free hard disk space of 25 Mb. More may be required because of the creation of temporary files 4. Macintosh-compatible monitor (256-color monitor is recommended). 5. Macintosh-compatible printer (laser printers are recommended) 6. CD-ROM reader or server contaming the databases.
2.2. Macintosh
Software
1, Macintosh system software 6.0 I or higher. 2. The LASERGENE application GENEMAN. Follow the mstructlons provided by DNASTAR, Inc. for installing the LASERGENE software. From
Methods
/n Molecular
B/ology,
Edlted by S R Swmdell
Vo/
70
Humana
189
Sequence
Data
Analysis
Press Inc , Totowa,
Guidebook
NJ
C/e wley
190
View
PROTR”INE m3NI TP171 protanlne R - ste, Iate sturgern PROTRMINES R RN, c (STELLIN R asperg,IIopeps~n II (EC 3 4 23 RSPERGILLOPEPSIll II PRECURSOR SECRETE0 PROTER;E R PRECURSOR protam,ne 3a - ra,nt.ow trout PROTRrlINE 3R STREPTOBRISIN R PRECURSOR (EC protmmne 8 - RussIan 5t”rgeO” PROTWINE B (STJRIYE B) (STELLIN
RN0 c1 19) prec”r (EC 3 4 23 (EC 3 4 24 3 4 21
80)
81
Find Previous/Ne
Fig. 1. The GENEMAN
3. Methods 3.1. Opening
Entries Window.
GENEMAN
Start GENEMAN
by selecting
Biological
Database
Resource
from the
LASERGENE menu. The program needs to be able to access the databases either on a CD-ROM or a server. If a CD-ROM is being used, the DNA or protein data CD needs to be in the CD drive. If the CD is a different version from the program, a message will appear: The LaserGene CD doesn ‘t match the current configuration. Install this one? Click OK to continue with that CD or cancel and insert the matching CD (see Note 2). Select Open Database.. .(command-N) from the File menu (see Note 1). Fig. 1 shows a typical database window.
3.2. Database Searching 1. To search the database for a text word, choose Text from the Search menu. 2. The resulting dialog box is the Word Search Term window. Enter the text string to be searched for in the text field. In this example we have entered Parvovirus. An asterisk (*) can be used as a wild-card, e.g., typing Parve* would search for any string commencing Parvo. To the right of this is the subdivision pop-up menu that will read All Fields. This will search every field in each database entry for the entered text. Clicking on the menu will reveal a list of annotation fields; selecting one of these will restrict the search to that field only. 3. Click OK and the string is placed in the Query Builder window (Fig. 2).
3.2.1. Refining the Search The search can be refined by adding further information to search for and combining the search strings using conditional links. Different search types
191
GENEMAN
00 Search
Now
Fig. 2. The Query Builder window showing the basic query: search for the text Parvovirus in All Fields.
can be added by dragging the Words..., Seq Similarity..., or Consensus... arrows onto the conditional links in the search pane. The words Or... and And.. . represent Boolean terms. Each search term of the query is preceded by a Diamond icon that represents the NOT Boolean operator. Clicking on the icon will make that term a negative query. For example: 1. Drag the Word... arrow onto the And.. . link. A new Word Search Term window will open. 2. Enter another text string for which to search. A dictionary will open if there are no entries matching the string. 3. Click OK. The new search string will be added below the first search term. 4. Click the Do Search Now button. An Entries Window will appear listing single line entries. 5. Click the Partly Expanded View palette button, second from the top on the left (Fig. 1). This will change the view of the entry to reveal more fields (Fig. 3). 6. Mark appropriate entries with a check mark. The mouse pointer turns into a check when it is between the palette tools and the entry details in the Entries Window. Clicking will select an entry. 7. Choose Export ,/ Sequence from the File menu. Checked entries will be downloaded from the Database to EDITSEQ tiles. 8. Select Save (command-S) from the File menu to save the file of hits.
If you search particular fields of the database (rather than the default All fields in the Query Builder window), then only the searched fields and the default Entry field will be displayed in the Entries window. Other fields of the entries found can be displayed by double-clicking the Partly Expanded View palette button. The configuration window displays two columns of annotation fields contained in each entry. The left-hand column lists the Picked Fields; those fields to be displayed in the Partly Expanded View. The right-hand col-
C/e wley
192
STRNORFO, :s~“unI” RCCESSION PO3134, PIR UYPVlM *type conlplete ACCESSION R03696 NONCRPSIO PROTEIN NS-l (NCWI)
OESC ORTE
2 t -dUL21-JUL-1986 fJ;OEC-1992
@EL IREL (REL
0 1, 01, 24,
CRERTEO) LRST SEQUENCE LRST RNNOTRTION
WTqE_C.R., N”LLkIL “IRIORE
THOMSON “., “ERCHLINSKY RCIOS KS. 11:999-IO,9 (1993) ; SS-ONA NONENVELOPEO VIRUSES: no”caps,d proteun E”BL, “01115, PR”“fl2 -0. FUNCTION SEEtiS NECESSFRY VIRFlL -ISIMILRRITY TO OTHER-VIRUS NONCRPSIO PROTEIN, ONR REPLICRTION, 399 RTP IPOTENTIRL) 40s nw 76248, CN 2340437
LIT TRXoNonY w~RFRM~“Ir”s COnMEN KEYWORDS LnXF%“s72, SEQUENCE 1 61
,986
PRT
P1RGNRYSDEV RGRETTWOQS
LGRTNWLKEK ED”EWETT”0
SNQE\IFSFVF EPITKLQVFIF
672 RR
UPOCITE) UPORTE)
M.,
WARD
D.C.;
RaffYPVIRIDRE; ONR REPLICRTION NONCRPSIO PROTEINS FIT@=-BINOING
KNENVQLNGK OSLVKKCLFE
DIGWNSYKKE “LNTKWIFPG
LQEDELKSLQ OYNWFVQHEW
Fig. 3. The partly expanded view of a search Entries Window. The window shows a Partly Expanded View of an entry found during a search for the word Parvo. Note the occurrences of the string are boxed.
umn display lists the Remaining Fields; those fields not currently displayed. The display may be altered by dragging field names between the two columns. Clicking the close box will cause the changes to take effect. By default, the Expanded View palette tool will display all fields of the entries, or a subset by configuring it in the same way as above. This effectively provides you with two customized views of the found entries. 3.3. Modifying and Browsing Subsets 3.3.1. Modifying a Search A GENEMAN search is preserved in a hierarchical manner, allowing you to step back through the search and adjust it according to the results you achieve. The Step Back to.. . option from the Edit menu can be used to modify a query; the modified search can then be initiated, producing a new Entries Window subset. The search commands used to produce a subset are available from the Query Info palette button or the View menu. 3.3.2. Browsing a Search If a search finds a large number of entries, it is possible to browse through the subset, looking for a specific word in the entries. This allows the flexibility of a search without having to define a new query. 1. Select Browse from the Search menu or click Browse? in the lower left corner of the Entries Window. This opens a Browse dialog.
193
GENEMAN
Range:
674
AA
(SetEnds)
RR RR (Csncel) Threshold
%
Fig. 4. Protein Sequence Similarity Search window showing the default settings. The Set Ends button brings up the standard LASERGENE thumbwheel (see Chapter 5 on EDITSEQ). 2. Enter the term for which to browse. 3. Select whether or not to Apply J from the pop-up menu. This allows a tick to be applied to or removed from entries that contain the search word. 4. Use the Inspect pop-up menu to select whether to browse all fields or to inspect only ticked entries.
Only Picked Fields are browsed; specific fields can be selected for browsing by configuring the view as described in Section 3.2.1. The Genetic MedLine abstracts associated with a selected or checked entry can be accessed from Abstracts From in the View menu. . 3.4. Sequence Searches The sequences should be in EDITSEQ file format (see Chapter 5). 1. Choose Sequence Similarity from the Search menu (see Note 3). 2. Click the Get Seq button in the Similarity Search Term window. This opens the Similarity Search Term dialog (Fig. 4). a. The k-tuple parameter determines the word size, or how many consecutive identities are required in a match. For proteins this might be set at 2; for DNA between 4-6. Lower values give more sensitive but slower searches. b. The window can be thought of as a filter that is slid along a sequence and within which a value is calculated. The default for proteins is 64 residues; for DNA 60 nucleotides. It should be set at 50-70% of the length of the sequence similarity stretch wanted (I). c. The default gap penalty is one for DNA and four for proteins. A higher value will reduce the number of gaps in the alignment. d. The threshold default is 20%, but this needs to be increased if, for example, searching a DNA database with a oligonucleotide primer sequence. Likewise
C/e wley
194
Threshold
m%
Fig. 5. Consensus Search Term window.
3. 4.
5.
6.
the window size could be decreased and the gap penalty increased for this type of search. Click OK to accept the parameters; the Query Builder window (Fig. 2) is displayed. Click the Do Search Now button. The hits are shown in an Entries Window (see Fig. 3). A particular hit can be displayed by selecting it and choosing Alignment from the View menu (see Chapter 8). Double-click on the score for an entry to display the Alignment or Consensus match for that sequence (also accessible from the View menu). The similarity index is displayed in the Alignment view window. This index is obtained by dividing the number of exact matches by the total consensus sequence length and is not the same as the interest level score in the Entries window. To run a consensus sequence search: a. Select Consensus Sequence from the Search menu to open the Consensus Search Term window (Fig. 5). b. Enter the consensus sequence to search for using the IUB codes and x as a wild-card. Up to a 256 position sequence can be entered. PROSITE database format patterns are understood by GENEMAN (see Note 4). c. Check OK. GENEMAN returns to the Query Builder. d. Click Do Search Now. GENEMAN opens a new Entries window listing any hits.
The subset of hits found after a search has Score, Sort, and Distribution functions available from the palette. Each sequence used to search a database in a combined query has its own interest level score. The scores can be sorted by clicking the Sort button and then point/clicking the cursor over a score column. These score columns can be moved by option-dragging. Double-clicking the score displays an Alignment match window. The Distribution button on the palette of the Entries window generates a graph of the number of entries vs the score (Fig. 6). Click-dragging on this graph can be used to define another subset that can then be created with the Make Subset button. 4. Notes 1. The March 1996 release consists of three CDs. These are aprotein CD with Swiss and PIR and translated release 93, Prosite release 12-a DNA CD with EMBL
GENEMAN
195
Score Distribution
lO0.d
of:
Fig. 6. The Frequency Distribution
window.
with GenBank release 92, and a CD with GenBank EST 93. A crystallographic protein database (Brookhaven) CD is also available. The LASERGENE module XRAY is needed to view three-dimensional structures from this database. An AIDS/HIV database used to be available (until November 1993). This can now be accessed as a subset (see Note 2). Presumably the number of CDs will have to increase as the size of the databases increases. Alternatively, access will need to be online to central servers. Descriptions and information about databases are published as special issues of Nucleic Acids Research (for example, vol. 22( 17) September 1994). A list of database URLs for the WWW can be found in Peitsch et al. (2). 2. The database or parts of it can be copied to a hard disk for faster access by using the Speed Up... submenu from the File menu. The most useful option is to move the Always Used indices and related files. A predetined subset of entries may be copied by selecting Data Used by Subset... from the Speed Up.. . submenu. 3. The algorithm used by GENEMAN is similar to FASTA (3,4) and calculates a local similarity score. 4. Ambiguities at a specific position are denoted in square brackets. Thus [RNA] specifies Arg, Asn, or Ala; [AT] or W specifies either an A or a T base. Exclusions are denoted in curly brackets. Thus {VAL} means Not Val, Ala, or Leu; {T} or V means any base but T (or U). A mandatory or invariant residue is indicated by forward slashes, e.g., /W/ indicates that Trp must be present in the consensus. Repeats can be specified by a number in brackets after the residue or base. Thus K(9) indicates nine lysines; X(3,5) indicates XXX or XXXXX. To search for a pattern at the N or 5’ of a sequence put “<” at the beginning of the consensus; “>” at its end to indicate that the pattern is C-terminal or 3’. The threshold can be set below 100% to allow mismatching, e.g., a 33% threshold means only 1 of 3 residues have to fit the pattern.
196
Clewley
References 1. Gribskov, M. and Devereux, J. (eds.) (1991) Sequence Analysu Primer, Stockton, New York. 2. Peitsch, M. C., Wells, T. N. C., Stampf, D. R., and Sussman, J. L. (1995) The Swiss-3D Image collection and PDB-Browser on the World-Wide Web. Trends Biochem. Sci. 20,82-84.
3. Pearson, W. R (1990) Rapid and sensitive sequence comparison wrth FASTP and FASTA, in Methods m Enzymology, vol. 183 (Doolittle, R. F., ed.), Academic, San Diego, CA, pp 63-98. 4. Pearson, W. R. and Lipman, D. J (1988) Improved tools for biological sequence compartsons. Proc Natl. Acad SCI USA 85,2444-2448
16 GeneJockeyll Database Searching Phil Taylor 1. Introduction GeneJockey supports a number of databases and allows you to construct databases of your own. You can search a DNA or protein sequence using a collection of consensus sequences contained in a file, and a file of transcriptlon factors derived from David Ghosh’s Transcription Factor Database (TFD) is supplied with the program. You can search a protein sequence against Amos Bairoch’s PROSITE database, which contains over 1000 patterns characteristic of sites of interest in proteins. You can also make use of the GenBank database of DNA sequences, opening sequences by accession number or by locus, or searching by author name, by keyword, or for a match with an unknown sequence.
2. Materials 1. Hardware: GeneJockey requires a Macintosh with ColorQuickdraw in ROM (this excludes the Macintosh plus [and older machines], the SE, the PowerBook 100, and the Macintosh Portable). The program also requires system 7 0 or later and at least 2 Mb of available memory. Multiple alignment windows are
much easier to view m color. Although this is not essential,it is recommended that you use a system capable of displaying 256 colors (or better). To make use of the GenBank CD-ROM database you need access to three CD-ROM drives. These can be accessed via a network and need not all be connected to the same
computer. 2. Software: For the operations described in this chapter, you need only the GeneJockey program Itself. For later chapters you will need some additional files supplied with the program, andyou would normally install GeneJockey on From
Methods m Molecular Btoiogy, Echted by S R Swlndell
Vu/ 70’ Sequence Data Analym Gurdebook Humana Press Inc , Totowa, NJ
197
Taylor
198
your hard disk by simply copying all the files supplied into a single folder. When running on a Power Macintosh the GeneJockey Helper tile should be present in the same folder. The native-code resources in this file run about 10 times faster than the code in the main program, and since multiple alignment is a time-consuming process, the extra speed is very helpful. GeneJockeyII is licensed for use only on a single-user basis, but is not copy-protected. Installation of the PROSITE and GenBank databases requires separate installer programs, PROSITEInstall and GenBankInstall, respectively. These are supplied on the same disk as the main program.
3. Methods 3.7. Creating a Consensus
Sequence Database Consensus sequences can contain degenerate codes, fixed and variable repeats, and may use nested parentheses if necessary. They should contain no spaces, since the program will interpret the first space as a terminator character. Anything typed between the first space and the return character at the end of the line will be interpreted as comment, and will be listed in the results of a
search if that consensus sequence finds a match in the target sequence. Each
line can be a maximum of 255 characters in length. As an example, a file of Transcription Factor binding sites is included with the program. You should open this file and look at it for an example of a consensus sequence file. This data is derived from David Ghosh’s Transcription Factor Database (TFD). Consensus sequence files can contain either DNA or Protein sequences. 1. Choose New>Text Window command from the File menu to open a new window. 2. Type each consensus sequence on a separate line, starting at the left margin. 3. Follow each consensus sequence with a space, then enter any identification or comments that you wish to be listed when the program finds a match with this consensus. 4. Terminate each line (including the last) with a return. 5. Choose Save As... from the File menu and save the file under a suitable name.
3.2. Searching
a Consensus
Sequence Database
1. Open a suitable DNA sequence file (or click to bring it to the front if it is already open). 2. Choose Consensus Search > Search a File... from the Find menu. 3. Select a file of consensus sequences in the subsequent dialog. (The example in Fig. 1 uses the TFD database supplied.)
3.3. Installing the PROS/E Database PROSITE is a small database of protein patterns constructed and maintained by Amos Bairoch. It is very quick to search and remarkably useful, in that it picks out features in protein sequencesthat are very hard to find by other means.
799
GeneJockey//: Database Searching naich
rn”“d
RccCCCOOCOOT RCCCCCOOCGOT %E:” OCOOTC TOCOCAC GROTOCII CRCCTC CRACTT CRRTCC CATCTC CATCTT CRCTTC CROTTC AOAOTG AGRRTG AOOCTG GGRRTO GRRCTO OOATTO cCCCAOoC PTTa-PP.-D
at EII* B6,BOB3!ie+ 069PSS+ 6% 200+ 245+ 37a* 416r 585+ 656, 7s1r 62 1a447,6,37 19Q4S29S+ an>-
BPU-EZLSZ BP”-EZ-CSZ GCAE L”C-RS PEBP2-W “RE-CS2 llRE-CS2 CAP-S, CRP-IIt. CAP-sit* CRP-rita CRP-slte CAP-s, CRP-Sl CRP-s, CAP-s1 CAP-*, CRP-rlt. CRP-Sl CRP-r,t. RP-2-CS2 *#SOD
ta
1. t* 1. ts t* 1. l-0,
NucI.Ic Acids R.s 15: 10267-84 Nucl.lc Acids R.r 16: 10267-84 Cal I 43: 177-66 < 1PBS) nd CalI 0101 7: 1101-10 (1987, J “lro, 63: 3669-77 (1989) flol Cell Blol 9: 1376-60 “ol C-11 BIDI 9: 1376-80 (1999) NucI~Ic Rclds Ras 14: ,OOOP-26 Nuolalc Acids Ras 14: IOOOP-26 Nucleic Acids ths 14: 10009-26 Nucl,lc Rclds Ras 14: 10009-26 Nucleic Relds Ras 14: 10009-26 NucIeIc Rclds R.s 14: 10009-26 Nucleic Acids Ras 14: 10009-26 NwI.Ic Rclds Ras 14: (0009-26 Hucl~ic kids R4s 14: 10009-26 Nucleic Acids Ras 14: 10009-26 Nwzlalc Rclds Res 14: 10009-26 Nucleic &Ids Aas 14: 10009-26 Ilucl~ic Acids Aas 14: 10009-26 CelI SO: 947-61 (1987) n^-^r “^.. I. ,??-“1 ,.“O_)\
(1987, < ,987,
(1986) (1986) (1966) (lQB6) (1986) (1986) (1986) (1986, (1986) (IQ96) (1986)
Fig. 1. Transcription Factor Database search results. The program will open a new text window to display the results. Each hit is reported on a separate line, listing first the actual target sequence that matches the consensus, then the numerical position of the match, followed by a + or- sign to indicate which strand of DNA the match was found on, then the comment text from the file. In the case of the Transcription Factors file, the comment text consists first of the name of the transcription factor, then the reference. The consensus search from a file always searches both strands of DNA. The quality of information supplied in the documentation is extremely high, and the documentation file alone for PROSITE must be considered a major work of scholarship. PROSITE is widely distributed on the Internet and can be obtained from any of the MolBio FTP sites. You need only download the PROSITE.dat and PROSITE.doc files, although the remaining files are well worth reading. PROSITE is also distributed on the NCBI data repository CD-ROM and on the EMBL-Swiss-Prot database CD-ROM. An example entry in the PROSITE documentation file is shown in Fig. 2. 1. Create a folder named PROSITE on your startup disk (if you are using multiple hard disks, the database must be located on the same disk as the active system). 2. Place in this folder the two PROSITE tiles and the program named PROSITEInstall supplied with GeneJockeyII. 3. Double-click PROSITEInstall to start the installation process, which takes only a few minutes. The installer creates a machine-readable index to PROSITE, placing it in the System preferences folder.
3.4. Using PROSITE 1. Open a suitable protein sequence file (or click to bring it to the front if it is already open). 2. Choose Search PROSITE from the Find menu. The program will open a new text window in which to report hits (Fig. 3).
Taylor
200
I
+*II**LI***,*+*+,******~ * N-glycosylotlon IIt. * *+**I*CI**LC*L***II**.** It iv,= bacn know, ,OP (I lo”9 tima III thot potcntlal N-9lycosylatlon sltss we spmciflc to tha cons.nsus s.qu.nc. Rsn-Xaa-Sar/Thr It must be notad that the pr.s.ne. of th. cons.nsus trlpepildr Is not sufflclcnt to comlude that on aspaa9ln. r.sld”. Is 9lycosylot.d. du. to th. ,aet that th. Wdln9 of the prot.ln plws on Inportont role in the re9ulatlon o, N-9lycosylotl(pn 121 It has be.” shown I31 that the pres.ncs 0, prol Ina batwwn As” and S.ar/Thr WI I I lnhiblt N-plycosylotlon, this has baan confirmed by o recant I41 statlstlcal adysls of glycosylrttion sltas. uhlch 0160 show that about 50s o‘ the sitas that haue a prolIne C-twmlnol to SwlThr or. not 9lyeosylated It must .t+o be notad sItas #Ith tha pottern a, such a non-standard -Cansansus -Lost
pattern update
that there are Am-Xao-Cys, an site is tound In
N-(PI-MI-(P) IN Is tha I991 I Text
Nag
a few rmportad cases o, axpwlmsntolly denonstrrited th. plasno prot.ln C 15,
9lycosylatlon rwlsad
9lycosylotlon oeeurrenca
sl tal
Fig. 2. An entry in the PROSITE documentationfile.
site
N-9lycosylatlon
TIR 19 SCR 151 TW 187 sf!A Protrln
I
sum Cas*ln
207 klr?asS 133 klnas~
00300v OLECTS ONlTQN OAILNS 9LLSAL N-m9ristoylotlon
c
II
phosphoryllwon
SI
phosphorylatlon
s,t.
t*
fproslta
dot
OOOOSf
fprositm
dot
OOoosf
17% 183 7.01 244 322 +I \a
fprosl
te
doe
OOWSf
Fig. 3. PROSITE databasesearchresults.For eachpattern in the databasethat finds a matchin your sequence,the program listsfirst the actualsequencefound that matches the pattern, then the numerical position of the match.If more than one match is found for this pattern there may be more than one line here. Finally, the program hsts the name of the pattern and gives a file reference to the PROSITE documentation file, which you may openin the usual way (drag across it, including both f symbols, hold down the Commandkey, andtype an equal sign).
3.5. Searching Comments GeneJockey can treat any collection of DNA or protein sequence files (m GeneJockey or GeneJockey format) as a database. You can search the comments part of the files for a match with a single keyword or phrase, and in this case text windows saved in GeneJockey format will also be included in the search. You can also search the files with a sequence; in this case only sequence files of the same type as the target sequence (i.e., DNA or protein)
201
GeneJockey//: Database Searching lllr fliun fHUH f0.u fBOU fB.u few
Fot.d EOF R Nf EOF R Sf ORBR R pcptld.5 OFwl R racsptorf OABR B p.pt1d.f OnBA B receptorf
‘ii:::cs:id,, 1;:;;; bud
K R P.pt1d.f
CO”,.X, .p,d.rm., growth foc1.r .pld.m.l growth factor 0” OABR A P- ir.nrlo1.d q~ma-amlnobutyrlc-acid ov OAEA B r- tr.ns1.t.d Gus.-.nlnobutyrlc-.cld Porcln. LH I hC0 P.uron.d,n K R- tr.ns1.t.d S-HVOAOXVTRVPTARINE 1R ..burgs J-H, rubrtanc. P R- tranr1.t.d COW ,or Rat Substone. In. nRNA ,or subst.nc.-K ov Subst K R- tr.ns1at.d ““,a, thyroid hormone r.ncodlng thyroid hornon.
Subst P R P.pt1d.f fsub.t.mc. P R.c.ptorf fB.aU subrt K neceptorf fSub.t K R P.pt1d.f fHun.n Thy Ii R Peptldaf IHuman Thyroid Horn Rf S.wch conp1.t.d In IO s.condr S.achlng for r.c.ptor S..wch st.rt.d .t S..vch “Ire R.C.ptorS Number af r...rds scarchad - 63
r- nRNA, eonpl.1. cd r- nRNR,conpl.t. cds from 205 to 1575 r- alpha-subunIt from I, to 1475 r-b.t.-suburit from I to ,338 R- (S-HT-1R) (CLONE Ofrom , to lZ23I P R- ‘E r- (SKR) ,ro.. 437 to 15PlC tr.ns1at.d from 57 r&p. alpha-2 mRNR,
I
Fig. 4. Result of a modal keyword search for “receptor.” The name and number of each file is listed at the top of this window as it is searched, and each hit is listed on one line in the body of the window. At the start of the line is a file reference that permits you to open the file directly from the search window. This is followed by the context, where the line of text that includes the keyword is listed. To save space, the keyword itself is abbreviated to its first letter followed by a hyphen, so in this case it appears as “r-.”
will be searched. The scope of the search is determined by where you start the search. You do this by selecting a file; the program then searches all files contained in the same folder, and in any folders contained withm it to the lowest level. These searches are modal-while the search IS in progress you cannot use your computer for anything else. Before starting the search the program gives you a time estimate, permitting you to cancel the operation at this stage if you do not wish to tie up your computer for that period of time. Time estimates are approximate but adaptive, in that their accuracy increases with use. 1. Choose Modal Searches > Search Comments... from the Find menu. 2. In the subsequent dialog, enter a single keyword or phrase. Case is Irrelevant here As an example, enter the word receptor, and we will search the folder of miscellaneous receptor sequences on the GeneJockey demo disk. 3. You will be asked whether you want to save the results of the search to a file. Click on the No button. 4. Insert the GeneJockey demo files disk and locate the folder named Misc. Receptors in the GeneJockey files folder. 5. Open this folder, noting that it contains many subfolders, each of which contains a selection of sequence riles with sequences of receptors for a particular hgand. 6. Locate the file named Search Misc. Receptors and open it to start the search (see Note 2) 7. After a short delay, during which the program counts the number of files to be searched and notes their sizes, you will be presented with a time estimate. Click on the Go For It! button to start the search. The program opens a new text window to display the results of the search (Fig. 4).
202
Taylor
Fig. 5. Result of modal sequence search.
3.6. Searching Sequences A modal search can also be used to compare an unknown sequence (DNA or protein) with a collection of sequence files. The process is very similar to the keyword search described in Section 3.5., except that the sequence contained in the front window is used as the target and you may specify two parameters
that determine whether any given match will be reported. These are the mmimum match length, i.e., the shortest length of contiguous match between the two sequences that interest you, and the maximum probability, p, in which p is the probability of finding that length of match between two random sequences of the same lengths as the two being compared. These two criteria operate semi-independently, so that when deciding whether to list any particular hit, the program will use whichever of the two criteria is more stringent. As an example, we will use the sequence named Query from the GeneJockey files folder on the demo disk. 1. Choose Open... from the tile menu, locate the Query sequence, and open it. 2. Choose Modal Searches > Search Sequences... from the Find menu. 3. In the parameter dialog, accept the default settings of 0.05 for the maximum probability and 12 for the minimum match length; click on OK 4. You will be asked whether you want to save the results of the search to a file. Click on the No button. 5. Insert the GeneJockey demo tiles disk and locate and open the folder named Misc. Receptors in the GeneJockey files folder. 6. Locate the tile named Search Misc. Receptors and open it to start the search 7. After a short delay, during which the program counts the number of tiles to be searched and notes their sizes, you will be presented with a time estimate. Click on the Go For It! button to start the search. The program opens a new text window to display the results of the search (Fig. 5, see Notes 3 and 4).
3.7. Installing GenBank GeneJockey contains a whole series of commands that operate on the GenBank flat-file CD-ROM database. At the time of writing, GenBank comes on three CDs and to make use of it you need to have accessto three CD-ROM drives. These do not have to be connected directly to your computer; they can be located on a central fileserver accessedvia a network, or on three different
GeneJockey//: Database Searching Which dlul:ions of GenBank you wish to install?
do
Olultlon q Primates
Size (Mb) 70.62
[xl Rodents
54.64 13.56
q
Other Mammals Ix] Other Uertebrates q Invertebrates q Plants q Bacteria
203
16.24 34.26 50.76 51.60
-0lulslon
q RNR ,BUlNS
B Phage q Synthetic
q Un-flnnotated q E.S.T. w Patents
Size
(Mb)
6.2 1 43.04 3.56 3.57 3.03
30.09 2.15
Fig. 6. GenBank installation dialog.
networked computers. In the future, more drives will be required, since GenBank is currently doubling in size every 19 mo. Before you can use the GenBank CD-ROM, you must install it. This process takes several hours, although since it proceeds unattended you can do it overnight. Each new release of GenBank must be installed as new. If you have not installed a release of GenBank previously, you must first make an alias diska floppy disk containing aliases to all the GenBank tiles. 1. Mount all the GenBank disks on your desktop. If you are using three directly connected CD-ROM drives, simply insert the three CDs. If you intend to access GenBank over a network, use the Chooser to make the connection Either way, you should have three CD images on your desktop before starting. 2. Format a floppy disk, naming it GENBANK. 3. Double-click the GENBANKl disk to open it. 4. Choose Select All from the Finder’s Edit menu. 5. Choose Make Alias from the File menu. You will get a message telling you that the disk is locked, and asking if you wish to make aliases on the desktop. Click OK.
6. Select all the alias tiles on the desktopby drawing a box around them with the cursor, or by shift-clicking, then drag them all to the GENBANK floppy. 7. Once the files have been copied, select them again and drag them to the trash. 8. Repeat steps 3-7 twice more using the other two GenBank CDs.
9. If you have not alreadydoneso,copythe installer programnamedGenBankInstall from the GeneJockey disk to your hard disk. Double-click it to start. 10. In the dialog that follows (Fig. 6) deselect those divisions of GenBank that you do not wish to install, or simply click OK to install everything. The installer will run, putting up a progress indicator to show how far it has gone, and eventually displaying a message indicating that the installation is complete. The installation is an indexing process. GenBank sequence and annotations data are contained in 15 huge files, each of which contams many thousands of sequences. The installer
Taylor
204
program locates the beginning and end of each sequence record, writing a set of index tiles that gives these locations into the system preferences folder (see Note 5).
3.8. Opening GenBank Sequence Files In order to make use of GenBank from within GeneJockey you must first have installed the database as in the previous section. Mount the three CDs on your desktop and insert the GENBANK alias disk in your floppy drive. Operations controlled by the GenBank CD-ROM submenu attached to the Find menu are arranged in increasing order of time to complete the operation. Opening sequence files is the fastest operation, and searching sequences is the slowest. There are three ways you can open a sequence file from the CD-ROM. You can identify the sequence either by its Locus or by its accession number, or (most conveniently) by opening a file reference listed in the results of a search. The Locus is a (up to) 1O-character string of letters and numbers that identifies a particular sequence in the current release, but that may not be the same in other releases. The accession number is a unique ID that always identifies a sequence unambiguously, and does not change between releases. The accession number is the recommended way of referring to GenBank sequences in the literature, and if you wish to locate a sequence that you have come across in a paper, the accession number is normally the way to find it. 1. Choose GenBank CD-ROM > Open.. . from the Find menu. 2 Type m either an accessron number or a Locus m the subsequent dialog. 3. Dismiss the dialog with the OK button and the sequence will open (see Note 6)
3.9. Searching the GenBank Accession Number Index You can also search the accession number index of GenBanlc. In most cases this will give you only the Locus of the sequence, which you do not really need since the program will open sequences directly using the accession number. Some sequences,however, have secondary accessionnumbers, generated when revised versions of the sequence are deposited, and the accession number search will list those as well. 1. Choose GenBank CD-ROM
> Search Accession Numbers.. . from the Find menu.
2. Enter an accessionnumber into the dialog and click OK. 3.10. Searching
the GenBank Author Index
1. Choose GenBank CD-ROM > Search Authors.. . from the Find menu. 2. Enter the name of an author m the dialog (see Note 7). 3. Click OK to start the search. Each hit is reported as the author name followed by a list of file references referring to the sequences in which the author is a contributor. File references here are sbghtly different from normal, in that the file
GeneJockey//: Database Searching
205
name between the f characters 1s followed by the Locus, the two being separated by a colon. Each division of GenBank is stored m a single huge rile, which may contain thousands of sequences, so the file reference needs to identify not only the file, but also the sequence within that file. The author index search is remarkably fast; it takes only a few seconds to locate an author, despite the fact that the author index is nearly 17 Mb long.
3.11. Searching
the GenBank Keyword Index
1. Choose GenBank CD-ROM > Search Keywords... from the Find menu. 2. Enter a single keyword or a keyword expression and click OK
The keyword index, unfortunately, does not index the individual words used in the database. Instead it lists keyword phrases as supplied by the authors. There are many pitfalls here for the unwary. For example, if you search for Zinc Finger Protein you will find some, but by no means all, of the sequences you are seeking, because some authors have not used the word Protein, and some have used the hyphenated form Zinc-Finger. Some authors (and I regret to say that includes me) have omitted from their keyword lists words that occur in the title. Very few have really thought about what keywords would make it easy for their work to be found. In order to get around some of these problems, GeneJockey allows you to combine keywords logically. The logical operators used are & (and), 1(or), and f (not). (The not symbol is obtained using option-equals.) So, to find Zinc Finger reliably, one should use Zinc&Finger as the search expression, since this will rind all the keyword phrases that include both words, even if they are hyphenated or come in the wrong order. You should note that individual words may be truncated if they have variable endings, and that since they can be truncated from both ends, you may find that short words produce some unexpected matches. For example, the word actin will find matches like “prolactin,” “calpactin,” and even “long-acting.” You could prevent this by using spaces as part of your search expression, so the word actin will find only actin itself. Be cautious here, however, since if the keyword concerned comes at the beginning of a line, it will not be preceded by a space; at the end of a line it will not be followed by a space, and in either case the search word “actin” would miss it. On the whole, it is better not to make your query too specific, because it is easy enough to read through the list of hits and pick out the ones that you want. If you use a search expression, such as androgen&receptor, the program will search separately for the two keywords, listing first the number of hits for each, then the number of hits for the whole expression. (Note that a hit here means one keyword phrase; since each keyword phrase may be quoted by many sequence records, the number of sequences may be much larger.)
Taylor
206 wch Ior “BPTOR” wrchi”Q for MWIWEN .wchi”Q Ior RECEPTOR NDRGGENSRECEPTGR RGROGEN
RECEPTDR JGSPRI fCBPRl fCSPRl fGePRl fGQPRl fWPRl fGSPRl fGGPRl
SEP, SEP. SEQ. SEQ. SEP. SEa, SEP. !%a,
I I I I I I I I
I8 h,,S 862 hits 4 hits
l+wwRECf -f
tiuwlf t+.mfnnf MnrwRoIf WlMflOZf
i-iummf
sxa, 1 iumwwsf
fWPRl fGBPRl fGSPFIl foami fwmi IGBPRI fGQPRl fGBPRl
sEa, SEa, SEP. =a, sEa, sSa, SEP. SEP.
I I I I l I I I
I!35846
i-uilmw4f
foeml
klLWWO6f tlihwwJ7f titwwmf timruiaf wmwf tuwlc WtlnnCZf tWlfMC3f
“20 (32 N73WQ NW624 N21748 NW844 “35845
If
ll35847 “35848 ii35849 II35850 N3585 N20260 li232G3 n27423 ,‘I27424 II27425
I
Fig. 7. Result of GenBank keyword search.
The actual result of this search is shown in Fig. 7. In release 77 of GenBank there were four keyword phrases listed that included the words “androgen” and “receptor,” namely “androgen receptor” (35 sequences), “androgen receptor DNA-binding domain” (two sequences),“androgen receptor” (four sequences), and “DNA-binding domain of androgen receptor” (one sequence). For more complicated logical expressions, the program will interpret the logical operators from left to right, but you can change thts by using parentheses. For example, if you wanted to search for the GnRH receptor, you might want to include LHRH as a synonym for GnRH, but the expression “receptor&GnRHlLHRH” would be interpreted as “find those keyword phrases that include both the words receptor and GnRH, or that include the word LHRH.” This is not actually what we want here, so we should use parentheses as follows to direct the program to interpret the latter two words first: receptor&(GnRHlLHRH). If we wanted to be really sure, we would have to include a third synonym for G&I-I, which is to spell the abbreviation out m full, “gonadotrophin-releasing hormone.” Here we have another problem, since the word “gonadotrophin” (British) can also be spelled “gonadotropin” (American). In order to deal with this ambiguity, we can truncate the word to “gonadotrop.” To cover these possibilities then, we might make our search expression: receptor&(GnRHlLHRHl(gonadotrop&releasing)) The combination of gonadotrop&releasing is probably specific enough, without including the word “hormone.” This still does not find all the instances of GnRH receptors in the database, mainly because not all of the authors who have submitted these sequenceshave identified them as such in their keywords. There are two further text searching
207
GeneJockey//: Database Searching Which diulslons of GenBank you wish lo search? Dlulslon
Sire (Mb) 70.62
m Primates q Rodents
q q
Other Other
do
Mammals Vertebrates
IxI Invertebrates N Plants q Bacteria 0 01?tu1
-
Size (Mb)-
q RNR q Ulrur q Phage q Synthetic q Un-Annotated
54.64 13.56 16.24 34.26 50.76
6.21 43.04 3.56 3.57 3.03 30.89
EJ E.S 1. q Patents
51.60 I Ile NOIW
rDlulslon
r-----l
2.15 (--ii--)
(Cancel)
Fig. 8. GenBank background search set-up dialog.
possibilities-one is to search the sequence definitions, and the other is to search the entire annotations text (free text search). 3.12. Searching the GenBank Definitions Index This command is used in exactly the same way as the keyword search. Instead of searching the author-supplied keywords it searches the sequence definitions (titles). These contain the most important keywords. This is a linear search, and therefore slower than the keyword search, but the results are often more useful. 1. Choose GenBank CD-ROM > Search Definitions. . . from the Find menu. 2. Enter a single keyword or a keyword expression and click OK.
3.13. Free Text Search of GenBank Here you can use only a single keyword or phrase, but it will be compared with every word in the database. This is clearly going to take a long time to run (see Note 8), so the search has been configured to run as a background process, permitting you to do other things while the search runs. 1. 2 3. 4.
Choose GenBank CD-ROM > Free Text Search... from the Find menu. Type in a suitable keyword. Click OK. In the dialog that follows (Fig. 8), select which divisions of GenBank you wish to search. 5. Click OK to start the search (see Note 9). 6. Set up a second search to run simultaneously with the first one by repeating instructions l-5. 7. Move the two windows until you can seethe header of both, and you will seethat they are both still running, although much more slowly than before (see Notes 10 and 11)
Taylor
208 closing suspend
the wlndow. Ii or continue
You can also searching
Fig. 9. Suspend/abort dialog for background searches. 8. Click in the close box of one of the search windows (the program will not let you close a window on a running search). This produces the dialog in Fig. 9. 9. Click on Suspend and the other search will return to full speed operation, whereas the search in the front wmdow stops (see Note 12)
3.14. Searching Sequences The final search on this menu enables you to search GenBank with a sequence. Like the Modal Searches > Search Sequencescommand, it opens each sequence in turn and performs a simple alignment, reporting a hit if the best contiguous segment of match exceeds the minimum match length that you specify, and the significance of that match IS lower than the probability level specified. To make sure that we find a match, we will open a sequence from the database and search with that. 1. Choose GenBank CD-ROM > Open... from the Find menu. 2 Enter the accession number V01482. (Note that the second character is a zero, not the letter 0.) 3. Select about 100 bases in the middle of the sequence 4. Choose Copy from the Edit menu. 5. Choose New > Nucleotide Sequence from the File menu. 6. Choose Paste from the File menu. We will search with this short segment. 7. Choose GenBank CD-ROM > Search Sequences... from the Find menu. 8. Click on OK to accept the default parameters of 12 for match length and 0.05 for probability. 9. In the large dialog that follows, uncheck all the boxes except for the division that contains the bacteria. 10. Click on OK. The search will run, listing the matches it finds as file references, and quoting the match length and probability level found for each match. Once again, this is a background search; all the comments made above about the Free Text Search also apply here. You can run a second background search simultaneously or consecutively. You can type in the search window, save it, or open any file references it has found without waiting for the search to complete. As before, simultaneous searches will be slower than consecutive ones, and the search will run even more slowly if you leave GeneJockey and use another program.
GeneJockey//: Database Searching
209
3.15. Searching Comments on GenBank You can also search the GenBank CD-ROMs using the commands in the Modal Searches submenu. These searches take your computer over entirely, and once started will not permit you to use the menus or issue any commands other than Command-period to abort the search. They are much faster than the background searches, but much less convenient to use. You can search only one division of GenE3ankat a time. Try the two searches we have just tested using this method, 1. Select Modal Searches > Search Comments..., and enter a suitable keyword. 2. You will be asked whether you want to save the results of the search to a file. Click on the No button. 3. In the file opening dialog that follows, select the GenBank division that you want to search and open it 4. You will be presented with a crude time estimate. Click on Go For It! to start the search.
This is another free text search. Compared with the background search, this is blisteringly fast, but of course you cannot do anything else with your machine while it is running. 3.16. Searching Sequences on GenBank The modal sequence search offers some extra facilities that are not available in the background search, principally the ability to limit the range of sequences searched, either numerically or by searching the comments section of each record for a keyword first, only proceeding to search the sequence if that keyword is found. 1. If the previous search is still running, abort it with Command-period. 2. Click on the window containing the sequence fragment we used before to bring it to the front. 3. Choose Modal Searches > Search Sequences... from the Find menu, 4. Click on OK, accepting the default parameters of 12 and 0.05 for match length and probability, respectively. 5. You may choose to save the results continuously to a file if you wish. 6. In the standard file dialog, choose the bacterial sequences division (GBBCT.SEQ). 7. The final dialog (Fig. 10)offers three options.Click on OK to acceptthe default, which is to search all sequences in the selected division
You may opt to search all the sequencesin this division (11,805 in this case), to search a numerical subset starting and ending at numbers that you enter, or to perform a combined search, in which only those sequences that contain a keyword are searched. If you choose one of the first two options you will get a time estimate. For the third option no time estimate is available because the
210
Taylor Which
sequences
to search?
@ RII sequences 0 From
Seq. #
0 Only
sequences
m
to Seq. # whose
annotations
(118051 contalw
-
Fig. 10. Set-up dialog for modal sequence search of GenBank
program does not know in advance how many sequences it will have to search. Once again, the search is considerably faster, but much less convenient than the equivalent background search. You may abort any of the database searches at any time by means of Command-period.
4. Notes 1. You should exercise caution in interpreting the results of searches m this database. Transcription factor consensus sequences tend to be short and highly degenerate, which means that they occur at random in many DNA sequences. To determine whether any given match is significant will require a considerable amount of experimental work. As an example, try searching the same file with a 1 Kb random sequence produced by the Generate Random Sequence... command from the Modify menu. You will find a large number of hits, none of which are of any significance, since by definition a random sequence contains no real information. 2. You should note that searches can only be started on a tile, not on a folder This creates a problem when you wish to search a folder that contains only other folders. To get around this, you should create a starter file; this is an empty tile (i.e., open a new sequence window and save it immediately) in the top-level folder. The file named Search Misc. Receptors used above is such a starter file. 3. For each match it finds, the program displays first the file reference, perrmttmg you to open the file directly from the search window, then the length of the longest contiguous match between the sequences, and finally the probability. You should exercise care when interpreting this probability figure, because it does not take into account the number of sequences searched; it simply represents the probability of finding a match of that length between the two sequences tf they were completely unrelated. If you were to search 1000 sequence tiles usmg a sequence that was unrelated to any of them, you could on average expect to find one sequence that matched at the p < 0.001 level. 4. This search is not suitable for seeking matches between distantly related sequences. In the example, we searched through a family of 50 homologous sequences (they are all G-protein coupled receptors), with a target sequence virtually identical to one of them, and found only one match apart from the near-identical one.
GeneJockey//: Database Searching
5.
6.
7.
8.
9.
10.
211
This is an excellent search for matching a newly sequenced fragment against a collection of sequences to determine whether you have seen it before. It will also normally pick out matches between homologs from related species, but is totally inadequate for comparisons between distantly related sequences of differing functionality. If you wish to install GenBank on a second computer you do not have to run GenBankInstall again; you can simply copy the index files across to that machine The index files have the same names as the GenBank divisions that they index, all of the form GBXXX.SEQ, where XXX denotes a three-letter identifier (PRI = primates, BCT = bacteria etc.). The annotation part of GenBank files contains text with 80-character-long lines, so it is a little difficult to read in the standard size of the GeneJockey wmdow, which is 50 characters wide. Click in the grow box at the bottom right of the window and drag to the right until the window is about 50% wider. Now click in the split box at center right, and drag downward to increase the height of the comments box. Now you can read the sequence annotations m comfort The name can be in lowercase or uppercase letters, but must be entered exactly as m the example; the name first, followed by a comma, and each initial followed by a period, with no spaces anywhere. You can, however, truncate the name from the right, for example. Taylor,P.L. will find me only. Taylor,P. will find me and half a dozen others who share the same first initial. Taylor will find many hits (it is a common name). There is no time estimate for this search, because the time taken to run depends on whether GeneJockey is the foreground or background program, and what else your computer is being asked to do at the same time. As the search is runnmg, you will notice that the cursor does not display its customary rotating helix, but shows the I-beam text insertion cursor when it is over a text area. The insertion point in the search window flashes, indtcatmg that you can type in text here if you want. In fact, all of GeneJockeyII’s facilities remain available while the search is runnmg. You can switch to other windows, edit sequences, and run Analyzes. The search will pause when any time-consuming routine is started up, and restart as soon as it is finished. The Macintosh uses a system known as cooperative multitasking to run programs simultaneously. This depends on the active program giving up some processor time to the program that is running in the background. Unfortunately, most programs are rather greedy with processor time, and do not give up very much. If you use a screensaver program, such as After Dark, you may have noticed that some of its more complex effects run faster or slower depending on what program was running before it started. GeneJockey is well-mannered m this respect, giving much more time to background programs than most. The time taken to run a search when GeneJockey is not the active program, therefore, depends very much on what program you have m front. Oddly enough, Apple’s own Finder is one of the worst offenders here, so be careful that you do not
212
Taylor
accidentally switch GeneJockey into the background when nothing else is runnmg, or your search will take an unnecessarily long time to run. While GeneJockey is the active program, it makes very little difference whether you have the search window m front or behind other windows. You can vary the amount of time given to other tasks by means of the speed control at the bottom of the search window. Dragging the small scroll-bar to the right increases the search speed at the expense of other processes. On a Power Macintosh you will want to run the search at maximum speed, but on slower machines this will make other operations sluggish and clumsy, and may cause background operations, such as prmtmg, to fail. Try leavmg a search running while you switch to another program. Arrange the wmdows so you can see the search window header. You will probably find that the search is runmng very slowly 11. The reason why simultaneous searches run slowly is not simply that the computer is having to do twice thi work When a single search is running, it reads the data off the CD-ROM m the order m which it was written, followmg each track to the end before switchmg to the next track. If two searches are runnmg simultaneously, they are usually readmg from different tracks on the disk, and the read head therefore has to oscillate back and forth between those two tracks as the two searches alternately take time on the drive. CD-ROM drives have a relatively slow access time, and driving the head back and forth like this is therefore very inefficient. GeneJockey lets you deal with this problem by suspending one search until the other has finished to make more efficient use of the CD-ROM drive. 12. When the runnmg search terminates or if you abort it, either by using this dialog or by typing Command-period, the suspended search will start up again. You can start and suspend as many searches as you want, SubJect to the usual limitation on memory and on the number of windows open. If you want to abort a search and close the window, you have to do it in two operations; chck on the close box, then on the abort button, then click m the close box again to close the window.
17 GeneJockeyll Restriction Analysis Phil Taylor 1. Introduction GeneJockey has very powerful commands for restriction analysis, havmg accessto data on over 400 restriction enzymes. The program has been updated several times since its release; the version described here is version 1.5. If you are using a version earlier than version 1.2 you will not have access to the matrix selector when choosing enzymes manually, and must choose enzymes from the linear list. In addition, the full text-based restriction map output format was introduced in version 1.2, and earlier versions offered only the text list of cut sites and fragment sizesand the graphic restriction map. Prior to version 1.5, the restriction enzyme data was contained in the program file itself; from version 1.5 this data is held in a separate file, and there is a separate utility program named RenzEdit that can be used to create and edit restriction enzyme files and to import restriction enzyme data from ReBase, Dr. Richard Robert’s restriction enzyme database. 2. Materials Hardware: GeneJockey requires a Macintosh with ColorQuickdraw m ROM (this excludes the Macmtosh plus [and older machmes], the SE, the PowerBook 100, and the Macintosh Portable). The program also requires system 7.0 or later, and at least 2 Mb of available memory. It is recommended that you use a system capable of displaying 256 colors (or better). Software: For the operations described in this chapter, you need only the GeneJockey program itself. For later chapters, you will need some additional files supplied with the program, and you would normally install GeneJockey on your hard disk by simply copying all the tiles supplied mto a single folder. When From
Methods
II) Molecular Biology, Edlted by S R Swmdell
Vol
Data Analysis Gurdebook Press Inc , Totowa, NJ
70. Sequence
Humana
213
214
Taylor I-liuallable
Enzymes
@Short
7
rEnzymes
to usa
list
0 Full ust (Matrix...)
0 Start
of recognition
Flnd Enzymes...]
(Proceed
sequence wlth
analysts)
(CancelJ
Fig. 1. Parameter dialog for restriction analysis. running on a Power Macintosh, the GeneJockey Helper file should be present in the same folder. The native-code resources in this file run about 10 times faster than the code in the main program, and since multiple alignment is a time-consuming process the extra speed 1svery helpful. GeneJockey is licensed for use only on a single-user basis, but IS not copy-protected. 3. Data: If you are using verston 1.5 or later, you will also need a file of restriction enzyme data, either the file supplied with the program or one you have created yourself using RenzEdit. If this file is named Restriction Enzymes, and placed m the same folder as GeneJockey itself, the program will open rt automatrcally at startup. If not, you must open it using the Open command from the File menu before using the restnctron analysis commands. If you are runnmg the program on a Power Macintosh, you should place the file named GeneJockey Helper m the same folder as GeneJockeyII; this enables the program to use native Power Macintosh codes for the restnctton enzyme search.
3. Methods 3.1. Restriction Analysis As with all GeneJockey commands, the restriction analysis is performed on the sequence in the front window. In this example the sequence used is that of the vector pBR322, which is supplied in the program’s demo files. 1. Open the file containing the DNA sequenceof pBR322. 2. Select Restriction > Digest from the Analyze menu. The resulting large dialog box (Fig. 1) cannot be turned off; it will always appear when you issue this command. 3. Click the radio button next to Short List to select the short enzyme list (see Note 1) 4. Allow the program to select appropriate enzymes according to your specified criteria by selecting Find Enzymes... and you will be presentedwith the dialog in Fig. 2 (see Note 2). 5. Select cut anywhere.
215
GeneJockeyll: Restriction Analysis Flnd @cut 0 cut 0 cut 0 do 0 do 0 cut 0 cut
Enzymes
which
:
anywhere. selected region. ONLY selected region. not cut. not cut selected reglon. only outside selected reglon. ONCE only
Selected
E (-OK] [FJ
region
IS from
to
m
4363
Fig. 2. Parameter dialog for automatic selection of enzymes. IIurnHI
Restriction cut Porltlon 37s
,a1 'Con I
24 436,
CM"
18,
IO~III
174 297 401 525 533 597 931 920 94, 992 1049 1262 ,446 1950 2490
Anelysls h?lam,nt
- pBR322
-
s I *a
4363 4363 4363 4303 587 540 so4 458 434 261 234 213 I92 184 124 123 104 89 80
Fig. 3. Text list of cut sites and fragments. 6. Click on OK and the program will compile a list of all enzymes that cut this sequence. Sometimes this is all the information you need, but more often you will go ahead and perform the simulated digest. There are two other ways to select the restriction enzymes to be used for the analysis: Directly select required enzymes from the Available Enzyme list (see Note 3). Select the Required enzymes from the enzyme matrix by selecting Use Matrix... (see Note 4). 7. Click on Proceed with Digest. You will be presented with a new text window like the one in Fig. 3. For each enzyme, the cut positions and fragment sizes are listed, with the cut positions sorted into Increasing order and the fragment sizes in decreasing order. This is a standard GeneJockey text window, so you may edit it, copy its contents onto the clipboard, print it directly, or save it either in GeneJockey’s own format or as a Text tile for transfer into other programs. If
Taylor
216
ttallt~758 PSll "ml,, Ha.,,,
I
3213 3591 34 I
4000
3599
500 pm322
,oM)
Ha$,r, Ha$lJl thlll
I
I1 .-992
l!.mlll
941 1049
3w9
f+
ma
00
&Ial,l ligoi
k
u..ilt
1212 1379
m15
Hd,,lC2591 Ha.,,,
2519
Fig. 4. Graphic restriction map.
8. 9. 10. 11. 12. 13. 14. 15.
you had specified multiple digests, the list of enzymes would have been given first, followed by a pair of lists of cut positions and fragment sizes Note that since the test sequence was circular, a single cut gives rise to only one fragment, 12cuts to n fragments. Note also that the cut position for EcoRI is given as 436 1, although pBR322 has its origin at the EcoRI site The reason for this is that pBR322 is conventionally numbered from the center of the EcoRI site, whereas the cut position (on the 5’ -> 3’ strand) is two bases before this. Close the text window (or leave it open and bring the sequence window to the front) Select Restriction > Digest again. The working list of enzymes will still be in place Click on the Graphic Map radio button. Click on Proceed with Analysis to create a restriction map (Fig. 4, see Note 5) Close the restriction map window (or leave it open and bring the sequence wmdow to the front) Select Restriction > Digest again. Once again the working list of enzymes will still be in place. Click on the Full Text Map radio button. Click on Proceed with Analysis to create a full text map (Fig. 5, see Note 6).
4. Notes 1. The program maintains three lists of restriction enzymes, and it is important to understand the relationship between these three. a. The Full list is the basic source of all information; it contains the names and recognition sites of all the enzymes that the program can use. You may add or delete enzymes from this list, and may edit their recognition sites. b. The Short list is a subset of the full list, and as supplied, contains about 30 of the more commonplace enzymes. The short list is a convenience feature to save you from having to scroll through the full list every time you want to
Gene Jockey/l: Restriction Analysis 10 20 30 40 JO 60 I I TOCTTOOTCOTOwOCACCRoccToT~TnrolclcolnclTnToolonoTorccocnnoonoTolncT nco~~~TccoTooTcw~~TcnT~onoc~TonnTnccncTcncnoocoTTccTcncnTon I II I I ““I, “pIHI BSCBI nsa I Hph 1 Hglcl Brnl NIaIU
217 70
I nsa I
80 90 IO0 ,111 no 130 I40 I COTCRRRCC~TAATOTRTt~~~~T~~~~~OT~~OOT~~~OTT~T~T~~TT~~OTT~~O~TOT~~~T~ OCAOTTTMlCRTTACATROlcTncccc~cnlTccnTTTcnnonlnconnlTcnnToconcnoTonc Smi96 1 Nrp7524 1U HgIHl11 HglEl 2x
Fig. 5. Full text-based restriction map. do an EcoRI/X4oI digest. The short list is easily reconfigured, and you may set it up to contain those enzymes that you happen to have in your freezer, or those that are readily available from your favorite supplier, and so on. The short list is maintained from one session to the next and will not be lost unless you dehberately delete it. c. The Working list is a hst of enzymes drawn from either of the other two lists you prepare (or have the program prepare for you) immediately before performing an analysrs. Prior to version 1.5, the working list was maintained for the life of the current session but lost when you quit the program; more recent versions store the working list between sessrons. 2. You can also have the computer select the enzymes for you based on various criteria. The program can find enzymes that cut anywhere, cut selected region, cut only selected region, do not cut, do not cut selected region, cut only outside selected region, or cut once only. If you had selected a part of the test sequence before starting, the corresponding numbers would be entered m the boxes at the bottom, but if you forgot it does not matter, since you can edit these boxes to define your area of interest anyway. 3. At the top left is a box contammg the currently selected list of enzymes; by default this is the short list, but there is also a pair of radio buttons that you may use to select the full list. The working list, which is currently empty, is contained in the box at the right (titled Enzymes to use). You may move an enzyme from the left hand list to the right hand one by clicking on its name and then clicking on the Add to List button in the center, or simply by double-clicking on the enzyme name. If you wish to delete an enzyme from the working list you can do it in the same way. The Delete All << button scraps the entire working list. In this way you can build up a list of enzymes that you have manually selected to use in the digestion. Selecting enzymes manually from a long linear list can be tedious, so there is now an alternative using a matrix display (see Note 4). 4. The Use Matrix... button at center left leads to a new dialog in which you may select restriction enzymes from an 8 x 16 scrollable matrix (Fig. 6). This can be
218
Taylor Abr,
AOO,
AC01 I
AOOlll
Afllll
Ahll
AhIll
Ahrltl
AllAJl
ANIS,
Ad,
A001
AOCll
AWl
AO*l
ApaLl
Aprl
Asp47
Asp719
A-W1 AstWl
ASP361
Asp700
APYI AspA,
AS",
AtuCl
At@1
Atul
Atull
Awl
Asull A",,
AwlI
BlOl
Ball
BSmP,
~EHnKl
Ban1
Banll
Bmlll
Bb*Al
Bb,,
Bbll
Bbrl
BbVl
9bVll
Bcrl701
B&l
Boll
Bdr I
Bqll
Bqll
Bhl
BhSI
BhSl,
BP*1
BIIPl
BSca,
BIocl
BS81
BSICI
BSEI
BIB1
Br.1 BSNI
&*I 1 BSIPI
BSUI
km,
kpt266
B**Ol
B**Oll Bst",
BstHl
0*tci BSWl
BIUl
BItoIl
Bssnl Brtt
BS",ll41
Bsull921
Bsull92II
Bsull93l
Bsul2311
AdI Ani
I
click 0 Short
Aat1 I
I
lo select
or de-select
List
@Full
I
IIEmBIpRl B*snl 1
1
BrrHl I
BstT,
enzymes
List
Fig. 6. Matrix display for manual selection of enzymes. used to display either the short list or the full list of available enzymes. Particularly when selecting from the full list, it is much easier to locate the enzymes that you want to use when they are displayed rn this fashion. Click on as many enzyme names as you want; each selected enzyme remains selected unless you click on it again. You can change the list displayed by chckmg on the radio buttons at the bottom of the dialog. When you click on OK, the display returns to the main dialog and the selected enzymes are loaded mto the Enzymes to Use list, ready for use. Note that any enzymes already present m thus list will be deleted first If you exit from the matrix display by means of the Cancel button, the existing working hst will be left unaltered. 5. Having created a restrtction map, there are several things further you can do with it. By clicking on an enzyme name you can select it; it will be highlighted as m the Sal1 site m Fig. 4. Typing a backspace at this point will remove this site from the map. The enzyme names can also be rearranged by dragging them around-the lines connecting them to their sites will move and stretch as necessary. In complex maps, this helps to make the picture clearer. Often you will find that some enzymes are off screen, especially when mapping a plasmid with a polylinker region with many restriction sites close together. In order to bring them onto the screen you can move the whole map by clicking and dragging wlthm the central circle (m the case of linear maps, click and drag the solid line, which represents the sequence itself). In recent program versions the graphrc map window has scroll-bars that you can also use to move large maps around. Double-clicking on an enzyme name produces a display that gives detailed information about that site-its numerical position in the sequence and a short section of the double-stranded sequence that contains the site, with the recognition sequence highlighted and the exact position of the cut marked on each strand (Fig. 7). Using this display, you can immediately see what kind of ends each fragment will have.
219
GeneJockey//: Restriction Analysis Hlndll Haellf
29 4345
I’
Haell
Haell 174
297
Fig. 7. Double-click to obtain a diagram of the exact sequence around the cut site m a graphic restriction map. Graphic maps can be saved m GeneJockeyII’s own tile format, or as a PICT file for transfer to other graphics programs. You can also transfer a picture of the data to the clipboard using the Copy command, and paste it into another program. 6. Here, the sequence is shown double-stranded, with numbers above referring to the 5’->3’ strand. The restriction sites are marked below this, the actual positions marked being determined by the settings of the Measure Cut Positions from radio buttons in the main restriction digest dialog. The data here is contamed in a text window, and may be edited before printmg, copied to the clipboard, or saved either in GeneJockeyII’s own format or as an ASCII text file. The full text map format is useful to those users who wish to print out a large restriction map for reference purposes.
GeneJockeyll Translation and Open Reading Frame Analysis Phil Taylor 1. Introduction GeneJockey offers several methods of translating peptide sequence.
DNA to the equivalent
2. Materials 1. Hardware: GeneJockey requires a Macintosh with ColorQuickdraw in ROM (this excludes the Macintosh plus [and older machines], the SE, the PowerBook 100, and the Macintosh Portable). The program also requires system 7.0 or later, and at least 2 Mb of available memory. Multiple alignment windows are much easier to view in color. Although this is not essential, it is recommended that you use a system capable of displaying 256 colors (or better). 2. Software: For the operations described in this chapter, you need only the GeneJockey program itself. When running on a Power Macintosh, the GeneJockey Helper file should be present in the same folder. The native-code resources in thts tile run about 10 times faster than the code in the main program. GeneJockey is licensed for use only on a single-user basis, but is not copy-protected. 3. Data: You need only the GeneJockey demo files disk, in the Mtsc Receptors folder. This sequence 1s that of an mRNA of 1598 base pans that codes for a protein of 473 amino acids in length. Alternatively, you may use a suitable mRNA sequence of your own.
3. Method 3.1. Single- Frame Translation and open reading frame (ORF) analysis commands take their input from the sequencein the front window. As with all GeneJockey
commands,
the translation
1. First use the Open... command from the File menu to open the cannabinoid receptor sequence in the Mist Receptors folder. From
Methods m Molecular Bology, Edited by S R Swmdell
Vol 70 Sequence Data Analysis Gurdebook Humana Press Inc , Totowa, NJ
221
222
Taylor Choose
which
I
reading
frame
:-
@ RF 1: RAG RGA CTC GAC TCT 0 RF 2: AGA GtlC TCG ACT Cll
0 RF 3: GRG ACT CGA CTC TlG
I
0 Sequence
Genetlc Coda Uniuerral
Map
I v
Fig. 1. Parameter dialog for Translate command The dialog box allows you to set which frame to translate. Selectmg Sequence Map results in all three forward frames being translated. -30 -20 I I -TCUICTCTTOCOTCCTTKRLDS*SPSLEKVO*OCUS*RGPKSL ROSTLERRPURNNRNNASLEPOPnHC ETRLLKPULOEiULTMALLSRTKlTA 50 60 I I CTCRQCCATCMCRCCCTSXWWMC LSHQQQHPPOTOQAPOSNLIUKOPSO SAlNNSlPLTQOKLPTLTLSOKi~UT QPSTTRSP*HRASSRL*PVLERSE*R
-IO
70 I
IO 20 I I TflTGOCTf#3CMTOCOTCTCTTG~C~~TC~TO
so 90 IW I I .WiCTCCCOACTCTARCCTTRTClODARRORTCCOAOT
30 I
40 I
II0 I
Fig. 2. Sequence map output format. The figure shows a short example of thts translation format. In this case the sequence origin has been set so that the ATG start codon is at position 1. The main open reading frame is RF2, the middle line of amino acid symbols. 2. Select Translate from the Modify menu. You wtll be presented with the dialog in Fig. 1. 3. Choose RF 1. The program will take the first base of the sequenceas the first base of the first codon. If you choose RF 2 or RF 3, the read will start on the second or thud base, respectively. In each case,the dialog box lists the first five codons of that reading frame. 4. Click on OK. The program will open a new protein sequence window, displaying the translated sequence. You ~111notice that the sequence contains frequent stop codons; here represented as a bullet mark e) (see Note 1).
3.2. Three-Frame Translation 1. First use the Open... command from the File menu to the cannabinoid receptor sequence in the Mist Receptors folder. 2. Select Translate from the Modify menu. You will be presented with the dialog m Fig. 1. 3. Select Sequence Map. This lists the contents of all three forward-reading frames simultaneously, placing its results mto a new text wmdow. The sequence map option is useful when working with incomplete sequence data that may still contain reading frame errors, because you can then track the open reading frame as it moves across from one line to another (Fig. 2).
223
GeneJockey//: Translation, ORF Analysis Open Reading
Frame
Rnalysis
Options
Genetic Code 1 Uniuerral
Fig. 3. Parameter dialog for open reading frame analysis. This dialog allows you to determine at what point the program will start to translate the sequence. You can specify the minimum length of peptide that you are interested in, thus eliminating many small random reading frames, and you have the option of choosing among several mitochondrial genetic codes in addition to the universal code.
Fig. 4. GeneJockey open reading frames window. Arrows represent ORFs in six possible reading frames The actual translated region ORF is highlighted.
3.3. Open Reading Frame Analysis Normally, however, you will want to locate the actual translated region of the sequence, and translate only that, placing the output in a peptide sequence window. In order to do this you must first perform an open reading frame analysis. 1. First use the Open... command from the File menu to open the cannabinoid receptor sequence in the Misc. Receptors folder. 2. Select Reading Frames... from the Analyze menu. The parameter dialog for this command is shown in Fig. 3 (see Note 2). 3. Click on OK to dismiss the dialog and the program will open a new window containing a diagram showing the open reading frames m both directions (Fig. 4). It is very obvious in this case which IS the correct reading frame. 4. Click on the large arrow m RF 3.
Taylor
224 -158
-AGGGT(xTTAGGnGAA
-134 CTTACTGTGAACA~CTT~‘TAT~TCTTCAAAAGATGTCTCKX.AT~TCAAGCAAGGAC%A CUXTGG -67CTGAGGGTKCCTCCTCl'lTCT~ GAGTCTGGCCTAATC Met Lys Ser Ile Leu Asp Gly Leu Ala Asp Thr Thr Phe Arq Thr Ile Thr 1ATGAAGTCGATCCTAGATGGCClTGCAGACACCACCTTCCGTACCATCACC Thr Asp I&u Lau Tyr Val Gly Sar Am Asp Ile Gin Tyr Glu Asp Ile Lys 52 ACAGAC CTC CTC TAC GTGGGC TCGAAT GAC ATT CAG TAT GAAGATATCAAA Gly Asp Met Ala Ser Lys Leu Gly Tyr Phe Pro Gln Lys Phe Pro Leu Thr 103 GuLGACAn;GcATccAAATTAGGATACTTCccAcAGAAATTCccT CTAACT Ser Phe Arq Gly Ser Pro Pha Gln Glu Lys Met Thx Ala Gly Asp Am Ser 154 TCC TTCAGGGGT AGT CCC TX CAA GAAAAGATGACC GCAGGA GACAAC TCC Pro Leu Val Pm Ala Gly Asp Thr Thr Am Ile Thr Glu Phe Tyr Am Lys 205 CCGTTGGTC CCAGCAGGAGACACAACAAACATT ACA GAGTTC TAT AACAAG Ser Lau Ser Ser Phe Lys Glu Am Glu Glu Asn Ile Gln Cys Gly Glu Asn 256 TCTCI'CTCG TCGTTCAAGGAGAAT GAGGAGAAC ATC CAGTGT GGG GAGAAC Phe Met Asp Met Glu Cys Phe Met Ile Leu Am Pro Ser Gln Gln Leu Ala 307TTTATGGACAlGGAGTGClTTAlGATTCTGAATCCCAGCCAGCAGCTGGCC Ile Ala Val Leu Ser Leu Thr Leu Gly Thr Phe Thr Val Leu Glu Asn Leu 358 ATC GCT GTA CTG TCC CTCACA CTGGGCACC TTCACG GTT CTG GAGAAC CTA Leu Val I.eu Cys Val Ile Leu Iils Ser Arg Ser Leu Arg Cys Arg Pro Ser 409 CTG GTG CTG TGT GTC ATC CTG 'XC TCC CGC AGT CTC CGA TGC AGG CCT TCC Tyr fis Phe Ile Gly Ser Ieu Ala Val Ala Asp Leu Iau Gly Ser Val Ile 460 TAC CAC TTCAK GGCAGC Cl'G GCA GTGGCC GAC CTC CTGGGAAGT GTCATT Phe Val Tyr Ser Phe Val Asp Phe His Val Phe HAS Arg Lys Asp Sar Pro 511TTTGTGTACAGC TTT GTT GAC TTC CAT GTA TPC CAC CGTAAA GACAGCCCC Am Val Phe Leu Phe Lys Leu Gly Gly Val Thr Ala Ser Phe Thr Ala Sar 562 AATGTGTTT CTGTTCAAACXGGGT GGGGTTACAG‘.XTCCTKACAGCTTCT Val Gly Ser Lau Phe Leu Thr Ala Ile Asp Arg Tyr Ile Ser Ile HIS Arg 613 GTGGGCAGCCTG TTC CTCACA CXX ATC GACAGGTACATATCC ATT CACAGG Pro Leu Ala Tyr Lys Arg Ile Val Thr Arg Pro Lys Ala Val Val Ala Phe 664 CCT CTGGCC TAT AAGAGGATC GTCACCAGG CCCAAG GCCGTT Gl'G GCCTTT Cys Leu Met Trp Thr Ile Ala Ile Val Ile Ala Val Ieu Pro Ieu Leu Gly 715 TGC CTG ATG Tffi ACT ATC GCA ATA GTA ATC G-X GTG TTG CCT CTC CT0 CGC Trp Asn Cys Lys Lys Leu Gln Ser Val Cys Sar Asp Ile Phe Pro Leu Ile 766TGGAACTU:AAGAAGCTGCAATCTGTTTU:TCGGACATTTTCOCA~CATI Asp Glu Thr Tyr Leu Met Phe Trp Ile Gly Val Thr Ser Val Leu Leu Leu 817 GAC GAGACC TAC CTGATG l-IT TGGATT GGG GTGACCAGTG-TG CTGCTGCTG Phe Ile Val Tyr Ala Tyr Met Tyr Ile Lau Trp Lys Ala His Ser HLS Ala 868 TTCATT GTG TAC GCGTACATGTACATT Cl'C TGGAAGGCTCACAGC CXC GCG Val Arg Met Ile Gln Arg Gly Thr Gln Lys Ser Ile Ile Ile this Thr Ser 919 GTC CGC ATG ATT CAG CGT GGG ACC CAG AAG AGC ATC ATC ATC CAC ACG TCA Glu Asp Gly Lys Val Gln Val Thr Arg Pro Asp Gln Ala Arg Met Asp Ile 970 GAAGACGGCAAG GTGCAGGTGACC CGGCCT GAC CAAGCC CGCATG GACATT Arg Leu Ala Lys Thr Leu Val Leu Ile Leu Val Val Lau Ile Ile Cys Trp 1021 AGG CTG GCC AAA ACC CTG Gl'T CTG ATC CTG GI'G G'E TTG ATC ATC TGC TGG Gly Pro Leu Leu Ala Ile Met Val Tyr Asp Val Phe Gly Lys Met Asn Lys 1072 G&Z CCT CTG CTT GCG ATC ATG GIG TAT GAC GTC T-E GGG AAG ATG AAC AAG Iau Ile Lys Thr Val Phe Ala Phe Cys Ser Met lmu Cys Leu Leu Asn S%r 1123 CTT ATC AAG ACG GTG TTT GCC TTC TGC AGT ATG CTC TGC CTG CTG AAC TCC Thr Val Am Pro Ile Ile Tyr Ala Leu Arg Ser Lys Asp Iwu Azg HM Ala 1174 ACC GTG AAC CCC ATC ATC TAT WI! CTG AGG AGC AAG GAC CTG AGA CAT GCT Phe &g Ser Met Phe Pro Ser Cys Glu Gly 'lk Ala Gln Pro Leu Asp Asn 1225 TTC CGAAGCATG TTCCCT TCGTGC GAAGGC ACC GCACAGCCT CTAGACAAC Ser Met Gly Asp Ser Asp Cys Lau His Lys HM Ala Asn Am Thr Ala Ser 1276 AGCATGGGG GAC TCAGAC !CGC CTG CAC AAG CAC GCCAACAACXAGCCAGC Met HM Arq Ala Ala Glu Ser Cys Ile Lys Ser Thr Val Lys Ile Ala Lys 1327 ATG CAC AGG GCC GCG GAG AGC TGC ATC AAG AGC ACC GTT IdiG ATC GCG AAG Val Thr Met Ser Val Ser Thr Asp Thr Ser Ala Glu Ala Leu stop 1378 GTG ACC ATG TCT GTG TCC ACA GAC ACG TCC GCC GAG GCT CTG TGA GCECTG 1430 CTlTrGTGGcc
17 34 51 66 85 102 119 136 153 170 187 204 221 238 255 212 289 306 323 340 357 374 391 408 425 442 459 473
Fig. 5. Complete sequenceof cannabinoidreceptor in interleaved format.
GeneJockey//: Translation, ORF Analysis 5 Select Translate from the Modify menu (or simply double-click on the arrow as a short cut). This time, the program produces the correct peptlde sequence The sequence IS contained in a standard peptide sequence window, and is, of course, available for further analysis (for example, choose Sequence Info from the Analyze menu to get a text window containing its amino acid composition, molecular weight, and isoelectric point).
3.4. Creating
Interleaved
DNA and Peptide Sequences
A popular format for published sequences shows the nucleotide and peptlde sequences interleaved, with the two sequences numbered separately at either side of the page. 1 Repeat steps l-4 of Section 3.3. 2. Select Format Sequence > Interleaved from the Modify menu. You will be asked to specify the page width that you want. 3. Click on OK The Interleaved format sequence is displayed in Fig 5 (prevzouspuge). The sequence displays produced by Format commands are simply text-you cannot perform any further analysis on them; however, you can format them using fonts, styles, colors, and so on, and annotate them with extra text for your own purposes Sequence displayed in this way can simply be copied and pasted mto any word processor or desktop publishmg program, however, you must ensure that it is displayed m a fixed-width font, such as Courier, and using a small point size, so that the lines do not run off the page or wrap around. Proportionally, spaced fonts will destroy the ahgnment between the symbols on successive lines.
4. Notes 1. It is obvious that we have translated m the wrong readmg frame. You could try the other two RFs, and if you still draw a blank, invert the sequence and try the three RFs of the opposite strand, but it is much simpler to perform an open reading frame analysis. 2. You may specify whether or not you want to see only open reading frames that start on a start codon (normally ATG, but you may put in any codon you want here if you are workmg with an exotic system, even including degenerate codes If your system recogmzes more than one start codon). Since we have a complete mRNA here, we know that any translatable open reading frame will start on ATG, an assumption that would not be true if we were working with a genomic sequence, which might contain introns, or with a newly-sequenced fragment, which might not contam the beginning of the translated region.
PROTEAN Protein Sequence Analysis and Prediction Thomas N. Plasterer 1. Introduction The explosion of sequence information has provided a number of candidates for translation mto their encoded protein sequences.This primary sequenceinformation provides scant understanding
of the function and structure for the novel
protein. Evolution has conserved proteins at the functional level; advantageous or neutral functioning proteins continue as disadvantageous proteins are selected against (I). The quaternary and tertiary structure of a protein determines its functionality. This data, unfortunately, can be directly gathered only by NMR or crystallographic studies. In cases in which you have the determined structure of a protein and its primary sequence, you can infer similar structure and function from homologous proteins. When a member of a protein family has been solved, this information can bridge the gap between primary sequence data and tertiary/quaternary data for other members of the family. This holds for identical residue conformations but can diverge rapidly for differing regions. Any predictive information derived from the primary sequence for nonidentity regions may bridge this gap to help build accurate models of structure and inferred functionality. This is especially useful as a first approximation for protein families lacking solved structures. Additionally, primary sequence information can be used to elucidate particular domains of a protein for a more specific application than structural determination, such as locating peptide motifs or antigenic site determination. The PROTEAN program, one of seven in the LASERGENE suite, analyzes and predicts protein characteristics and motifs from primary sequence data. protein sequences can be obtained from exported database entries, EDITSEQ From
Methods m Molecular Bfology, Edlted by S R Swmdell
Vol. 70. Sequence Data Analysrs Guidebook Humana Press Inc , Totowa. NJ
227
Plasterer
228
protein sequence documents, foreign software, or ASCII text. Use the EDITSEQ program to import foreign sequence documents into LASERGENE format (see Note 1). On the creation of a new protein Assay Document, PROTEAN evaluates the primary sequence and applies various analysis methods as determined by the Default Method Outline. These analysis methods are user-defined and typically include hydropathy, secondary structure, antigenicity, amphilicity, flexibility, charge density, and surface probability methods. In addition to these predictions, PROTEAN can locate known sequence pattern motifs from the PROSITE database and display potential cut sites from proteolytic digestion. The surface of the Assay Document presents analysis results for all methods using an identical horizontal scale, based on residue coordinates. All analytical methods are stored within the Method Curtain on the left side of the Assay Document. Analysis methods are grouped into scientific concepts, such as structural prediction, antigenicity, and hydrophobicity. In addition to evaluating primary sequence data, PROTEAN also interprets features in the Swiss-Prot or NBRF-PIR format contained in the comments window of a sequence document. These features can be applied to the assay surface as any other method. You can create and display your own features within PROTEAN as well. To display the high degree of commonality between implementations of the program on different platforms, the illustrations within this chapter are taken from a mixture of the Macintosh (system 7.5) and Windows 95 versions. 2. Materials Users need only satisfy materials criteria for either the Macintosh system or the Windows system. 2.1. Hardware 2.1.1. Macintosh Hardware 1. Any Macintosh computer. 2. Minimummemoryrequirementsof 4 Mb RAM (8Mb RAM ormoreisrecommended). 3. Minimum free hard disk spaceof 25 Mb. More may be required becauseof the creation of temporary tiles. 4. Macintosh-compatiblemonitor (256-color monitor is recommended). 5 Macintosh-compatible
printer (laser printers are recommended).
6. CD-ROM drive. 2.1.2. Windows Hardware 1. Any personal computer. 2. Minimum memoryrequirementsof4 Mb RAM (8 Mb RAM ormoreisrecommended).
PROTEAN
229
3. Mimmum free hard disk space of 25 Mb. More may be required because of the creation of temporary tiles. 4. Windows-compatible monitor (256-color monrtor is recommended). 5. Windows-compattble printer (laser printers are recommended). 6. CD-ROM drive.
2.2. Software 2.2.7. Macintosh Software 1. Macmtosh system software 6.0 1 or higher. 2. The LASERGENE application PROTEAN. Follow the instructions provided by DNASTAR, Inc. for installmg the LASERGENE software. 3 Accessory files for PROTEAN, including the PROSITE database in LASERGENE format software and ALL.ASE the file contaming protease information.
2.2.2. Windows Hardware 1. Disk Operating System (DOS) version 5.0 or higher. 2. Microsoft Windows version 3.1 or higher. 3 The LASERGENE application PROTEAN. Follow the instructions provided by DNASTAR, Inc. for mstallmg the LASERGENE software. 4. Accessory files for PROTEAN, including the PROSITE database m LASERGENE format and ALL.ASE file containmg protease mformation.
2.3. Data Any protein sequence document in LASERGENE format. For this tutorial, use the sequence file Destrin ADF (Macintosh users) or destrin.pro (Windows users). Macintosh users can locate Destrin ADF in the demo sequences folder, within their DNASTAR folder. Windows users can locate destrin.seq in the demo-seq directory, within the WINSTAR directory.
2.4. Optional 2.4.1. Macintosh Software 1. Any graphics program capable of handling PICTs file input. 2. Any word processor for manipulating exported ASCII text.
2.4.2. Macintosh Software 1. Any graphics program capable of handling Windows metafile (WMF) inputs. 2. Any word processor capable of manipulating ASCII text.
3. Methods 3.7. Opening PROTEAN 3.1.1. Macintosh PROTEAN 1. Locate the DNASTAR
folder and double-click
on it to open.
230
Plasterer
Protean Fig. 1. The Protean application icon. 2. Within the DNASTAR folder, locate the PROTEAN the PROTEAN icon to launch the application.
icon (Fig. 1). Double-click
3.1.2. Windows PROTEAN 1. Locate the DNASTAR program group and double-click on it to open. 2. Within the DNASTAR program group, locate the PROTEAN icon (Fig. 1). Double-click the PROTEAN icon to launch the application.
3.2. The Assay Document The Assay Document is the main window in PROTEAN. At this window, you manipulate method plots and graphs, including which methods appear, their display, and their scale. This window does not appear until you open an existing assay or create a new assay (see Note 2).
3.2.1. Entering a Protein Sequence 1. Choose New from the File Menu. 2. Locate the Destrin Actin Depolymerizing Factor sequence: a. Macintosh users: In the File dialog, locate the DNASTAR folder in the left window (It may already be open as the default). Scroll down this window and double-click the Demo Sequences folder. Within the Demo Sequences folder, scroll down to Destrin ADF. Double-click Destrin ADF to open it as a new assay document. b. Windows users: In the Enter Sequence dialog, locate the WINSTAR directory in the Directories: window (it may already be open as the working directory). Scroll down the Directories window and double-click the demo-seq directory. In the File Name window, scroll until you see destrin.pro. Doubleclick destrin.pro to open it as a new assay document. After a few moments of calculation, PROTEAN displays the Assay Document for this protein (Fig. 2). Methods chosen for the initial analysis depend on which methods are recorded in the Default method Outline (named defaultpao for windows systems, see Note 3). If you are using the original Default Method Outline supplied by DNASTAR, you will see Chou-Fasman (2-4) and Garnier-Robson (5) secondary structure predictions, a KyteDoolittle (6) hydrophilicity plot, Eisenberg (7) a and j3 amphipathic region plots, a Karplus-Schultz (8) flexible regions plot, a Jameson-Wolf (9) antigenie index, an Emini (10) surface probability plot, and a few features from
231
PROTEAN
&
I A]
, 10
, 20
, 30
, 40
, , q Scale 50
60
70
80
80
too
110
120
130
140
150
;:a i
160 W4pha.
Regions.
Gamier-Robson
_
i+?Cmigenic
index
. Jamesm-Wolf
1 i.
0 Surface
Pmbabiliiy
Plot
Enini
n
Fig. 2. The Assay document, after Fit to Page has been applied. This view displays the results of those methods stored in the Default Methods Outline. the Destrin ADF sequence document’s comments window (see Note 4). You
may need to scroll down to see all of the displayed methods. 3.2.2. Scaling the Assay Surface After the various methods have been plotted, you may need to change the horizontal or vertical scale to seethe whole method and its legend. PROTEAN allows you to customize both axes to display the scale of your choice. 1. Choose Horizontal Units from the Options menu. 2. From the Horizontal Units dialog, click Fit to Page and then click OK. PROTEAN resizes the Assay Document to fit the analysis methods and their corresponding legends to the size of the assay window (see Note 5). If you want to change the amplitudes of the various methods, drag the slide control on the Vertical Squisher (the last palette tool on the left of the assay) up or down.
3.2.3. Selecting Ranges and Objects The top two palette tools, the Range Selection (the pointer) and Object Selection (the hand) buttons control how items are selected on the assay sur-
Plasterer
232
Fig. 3. The Fill Pattern Dialog. Here the 50% shaded pattern is selected (boxed). face. Use the range selection to select a range of residue coordinates simultaneously in all method plots. When a range is selected, you can evaluate for that range only for Analysis menu commands. The Object Selection tool allows you to select an entire method or multiple methods. Use this to customize the display of related methods or to look at global residue assignments for a select number of methods. 1. Click the Range Selection button and click the first Alpha, Amphlpathic Regions-Eisenberg plot. The status bar at the top of the assay reads: “Selectton: 5 -> 16 = 12” to indicate which coordinates are selected and the number of resldues withm the selection 2. Choose Composition from the Analysis menu. PROTEAN presents a composltton summary for the entire protein and the peptide fragment (5,16) This summary mcludes, but is not limited to, the molecular weight, extinction coefticrent, isoelectric point, residue class breakdowns, and relative percentages 3. Close the Composition window by choosing Close from the File menu. 4. Click the Object Selection button and click the Alpha, Amphlpathlc ReglonsEisenberg plot. 5 To select a second method, shaft-click the Beta, Amphipathic RegionsEisenberg plot. 6. Choose 50% Shaded from the Fill Pattern submenu, under the Options menu (Fig. 3). Both amphipathic region plots have changed their till pattern from the default solid fill to the 50% shaded pattern. These six steps quickly
illustrate the differences between the selection tools.
3.2.4. The Method Curtain The Method Curtain contains all analysis methods within the Default Method Outline (see Note 6). It also has a menu containing all methods available in PROTEAN, including methods not found within the Default Method Outline. 1. Open the Method Curtain by draggmg the Method Curtain Rmg from the left edge of the assay to the first third of the assay (Fig, 4) 2. Press the More Methods menu and choose Del&age & Roux (11) from the Sec-
PROTEAN
233
Fig. 4. The Method Curtain ring. This tool allows access to the Method Curtain. ondary Structure submenu. The method Secondary Structure---Deleage & Roux has been added to the top of the Method Curtain. 3. Click the triangle icon to the right of Secondary Structure--Deleage & Roux. Click anywhere in the white space to the left of the selected plots and then click Alpha Plot. 4. Drag the selected Alpha Plot graph from the Method Curtain and release on the assay surface, between the Alpha, Regions-Chou-Fasman plot and the Beta, Regions-Gamier-Robson‘plot.
The DelCage & Roux method is now plotted on the assay surface. If you wish to customize its display, select it with the Object Selection tool and choose any of the Line or Fill commands from the Options menu. 3.2.5. Superimposing Resultant Graphs One of the more powerful aspectsof PROTEAN is the ability to supplement method predictions by superimposition. Using this method, you can get a clear idea of the nature of a peptide region by double or triple prediction involving more than one method. 1. Close the Method Curtain by dragging the Method Curtain Ring to the far left. 2. Click the Object Selection tool and then click the Alpha, Regions-Chou-Fasman plot. 3. Drag the Alpha, Regions-Chou-Fasman plot and release when it is on top of the Alpha Plot-DelCage & Roux. The Chou-Fasman plot is now on top of the Deleage & Roux graph. Use this technique to strengthen predictions for similar methods.
3.3. Model Structures PROTEAN provides a few simple modeling tools to aid in analysis of your protein. These tools are not predictive themselves, but when used in conjunction with the analysis methods, they can assist in visualizing peptide domains. 1. Click the Range Selection tool and then click the second bar in the Alpha, Regions-Gamier-Robson plot, selecting residues 25-59. 2. Choose Helical Wheel from the Model Structure submenu, under the Analysis
234
Fig. 5. The Helical Wheel diagram showing residues 25-29 of the Desrin ADF Assay. The view is along the axis from amino to carboxy terminus. The default helical angle is 100’ with a residue pitch of 150 A. menu. PROTEAN plots the helical wheel diagram for the selected region, looking down the axis from amino to carboxy terminus (Fig. 5). The default helical angle is 100’ with a residue pitch of 1.50 A. Change the defaults by choosing Alpha Helical Angle from the Options menu. 3. Choose Close from the File menu to close the Helical Wheel window.
3.4. Protease Digestion Another set of methods stored within the More Methods menu of the Method Curtain allows you to apply proteases to your protein assay. You can edit, create, and remove proteases and produce a tabular or graphic summary of fragment digestion. 1. Open the Method Curtain and choose Proteases-Protease Map from the More Methods menu. 2. Click the triangle icon to the left of Proteases-Protease Map to open the list of protease methods and then click in any white space to their left.
PROTEAN
235
-100000 -50000 -25000
F
-1nom -5000
!
-2500 -1000 -500 -250
$, : ,,,
-100 40
.
Fig. 6. SDS-Page Gel Simulation for Destrin ADF digested with Clostripain (CLOS), CNBr, NTCB, and Trypsin (TRYT). 3. Click Clostripain and shit&click CNBr, NTCB, and Tiypsin. Drag these four proteasesto any spot on the assay surface. Close the Method Curtain. Four protease maps are now displayed on the assay surface.The vertical lines in each map denote the location of a proteolytic site. Each protease’s label also lists the frequency of cut sites. 4. Double-click NTCB-Protease Map (eight cuts). This opens the Protease Editor for NTCB. At the editor, you can modify the name, site, exceptions to the site rules, vendor information, and remarks for the protease in question. To perform any edits, first click the lock icon in the upper right to unlock the protease. 5. Click Cancel in the Protease Editor. Deselect the NTCB-Protease Map (eight cuts) by clicking the Range Selection tool. 6. Click the fourth fragment for the ClostripaiwProtease Map (six cuts, between the third and fourth restriction site, residues 33-8 1) and choose SDS PAGE Gel Simulation from the Sites & Features menu. PROTEAN presents the gel simulation for Destrin ADF with these four proteases (Fig. 6). Bands in red are within the selected region on the assay surface. Notice that only one band is selected in the CLOS column; representing the fragment from residues 33-g 1. If you move the cursor over the selected bands, the status bar of the gel window presents the coordinates, length, molecular weight, isoelectric point, and HPLC retention time for the fragment beneath. 7. Choose Close from the File menu to discard the SDS-PAGE Simulation window.
236
Plasterer
3.5. Pattern Searches The More Methods menu of the Method Curtain also contains the submenu for pattern searches.When you perform a pattern search, PROTEAN opens an interface to the PROSITE database and scans your sequence against the peptide motifs contained therein. Once patterns are located, you apply and customize like any other methods. Patterns are especially useful for creating sequence annotations or features. You can also open the PROSITE database withm PROTEAN, to gather all relevant information regarding pattern hits. 1. Open the Method Curtain and choose PROSITE Database from the Patterns submenu, under the More Methods menu. If other methods remam open wlthm the Method Curtain, click the triangle icon to their left to close. You may get the warning message: “The LASERGENE CD does not match the current configuration. Install this one?” If this occurs, make sure the Protein Data LASERGENE CD is inserted in your CD reader and click OK. After a moment configuring the database, PROTEAN displays Patterns-PROSITE Database at the top of the Method Curtain. 2. Click the triangle icon to the left of Patterns---PROSITE Database to initiate the patterns search 3. After the search is completed, drag the pattern hit, COFILIN~TROPOMYOSIN, to the assay surface and close the Method Curtain. If desired, customize the pattern’s appearance. 4. Press on the 100 marker for the COFILIN-TROPOMYOSIN-PROSITE Database Region Plot. A label for the pattern is displayed beneath the plot. The label contains the name of the pattern, Its PROSITE sequence syntax, the percent match, and the coordmates of the match. 5. Double-click on the 100 marker for the COFILIN-TROPOMYOSIN-PROSITE Database Region Plot. The parameters dialog appears for the patterns search. This is where the match threshold is set and where you determine to use all patterns or to skip the frequent patterns in prosite, such as glycosylation sites. 6. Click Show Database. PROTEAN opens the PROSITE database, displaying the single line view. This view presents the Accession Number and the Identifier from the PROSITE database. See the GENEMAN module for more on database structure and searches (see Note 7). 7. Click the tttle bar of the Assay Document. Reopen the Method Curtam If necessary. Within the Method Curtain, double-click on the COFILIN-TROPOMYOSIN pattern PROTEAN browses through the PROSITE database and locates the COFILIN-TROPOMYOSIN pattern, displaying all fields for this entry. In the Commentary field, PROTEAN presents specific information on where the pattern IS derived from and the history of parent proteins providing this pattern Windows users may need to click the title bar for the PROSITE database window to activate It. 8. Close the PROSITE Database by choosing Close from the File menu.
PROTEAN
237
3.6. Features Features are a way to annotate Assay Documents for regions of interest. You can create your own features or modify existing features. When PROTEAN interprets a protein sequence document, it looks to see if there are any feature entries in the comments window, denoted by the label FT. If they exist, PROTEAN will group all features withm the Method Curtain and display any features at the bottom of the assay surface. 1. Close the Method Curtain, if it is still open. Scroll to the bottom of the Destrm ADF assay. Two features are displayed as methods on the assay surface; NUCLEAR and ACTIN-BINDING. 2. Double-click on the ACTIN-BINDING feature. This opens the Feature Editor’s Location window for the ACTIN-BINDING feature. This is where you name the feature and its segments, add or remove any additional segments, and determine how to display the title and segment names. 3. Click the Description button. This switches to the Feature Editor’s Description window for the ACTIN-BINDING feature. This is where you record any notes about the feature and assign a Swiss-Prot/NBRF-PIR key. Since this feature came straight from the comments window of the EDITSEQ document, both variables have already been recorded. 4. Click the Style button. This switches to the Feature Editor’s Style window. PROTEAN used the default style scheme to decorate this feature. You can modify the color, font, segment outline (note the j3 sheet and a helix representations), point size, and linkage globally or for each segment in the feature. 5. Choose Red from the Color drag-down menu and click OK. The ACTIN-BINDING feature now has a red outline. If you want to create your own features, select a region with the Range Selection tool and choose New Feature from the Sites & Features menu.
3.7. Saving the PROTEAN Assay Document 1. Chose Save from the File menu to open the save dialog. 2. Locate a position for your Assay Document and click Save (Macintosh) or OK (Windows).
All analysis method plots, including features, protease maps, and patterns are recorded.
3.8. Printing in PROTEAN 1. Click on the window you wish to print and choose Print from the File menu (see Note 8). PROTEAN can print from any view except the PROSITE database view Use GENEMAN to print from a database view. In all other views, you need to make a view active (the topmost window) m order to print it. 2. Configure any printer settings necessary and click Print (Macintosh) or OK (Windows) to print the active window.
Plasterer
4. Notes 1. EDITSEQ is the LASERGENE module for sequence creation and editing. It is also where you import foreign sequence, either from another sequence file format or across the clipboard. 2. Users who need to share data across both platforms can do so by using the correct windows nomenclature for shared files. To read a PROTEAN Assay Document created on the Macmtosh, Wmdows users need only rename it NNNNNNNN.PAD (N is any character, with a limit of eight). Protein sequence files require the format NNNNNNNN.PRO. Macintosh users do not need to rename Windows files or modify the file type and creator to be interpreted correctly. 3. The Default Method Outline can be modified as you see fit. Customize the assay surface and choose Save as Default Method Outline from the Analysis menu When you create your next assay, PROTEAN will use your new default methods In a similar manner, you can create custom outlines by choosmg Save Method Outline from the Analysis menu for a given set of methods. Rettreve this set by choosmg Apply Method Outline from the same menu. 4 Pressing on any method on the assay surface will open the label for this region. The label gives information about the method and the current parameters. If you double-click any method, its parameters dialog 1s opened. Here you can modify the parameters responsible for a given result. If you want to compare a smgle method with multiple parameter settings, add another copy of the method to the Method Curtain (by choosing its name from the More Methods menu), modify its parameter settings, and apply to the assay surface. 5. The Zoom In and Zoom Out palette tools (the two magmfying glasses) also control the scale on the Assay. To use, click either tool and click on the area of the assay to magnify or reduce. You can also drag to create zooms to your own defined area. 6. The analysis methods fall mto four basic predictive categories: structural prediction, hydropathy prediction, antigenic site determination, and surface characteristics. All are contained within the More Methods menu of the Methods Curtain. You can change the parameters for any method by double-clicking its name, either within the Method Curtain or on the Assay Document. 7. GENEMAN is the LASERGENE module for database searching and sequence retrieval. Use this program to locate proteins of interest and retrieve them to your hard drive. You can perform three types of searches in GENEMAN: text or word searches, sequence slmllarlty searches, and consensus sequence searches. The consensus sequence search 1sused to search a short peptide motif against a database of protein sequences. 8. Graphtc views can be exported by copying to the clipboard and pastmg into a graphics or word processor application. Macintosh graphics are exported as PICTs, whereas Windows graphics are exported as wmdows metafiles (WMF)
References 1. Kimura, M. (1983) The Neutral Theory of Molecular Evolution. Cambridge University Press, Cambridge, UK.
PROTEAN
239
2. Chothia, C. and Lesk, A M. (1986) The relation between the divergence of sequence and structure m proteins. EMBO J 5, 823-826. 3. Chou, P. Y. and Fasman, G. D. (1978) Prediction of the secondary structure of proteins from their amino acid sequence. Adv Enzymol 47,45-148. 4. Chou, P. Y. (1990) Prediction of protein structural classes from ammo acid composition, m Prediction of Protem Structure and the Principles of Protem Conformation, Plenum, New York, pp. 549-586. 5. Garnier, J., Osguthorpe, D. J., and Robson, B. (1978) Analysis of the accuracy and implications of simple method for predicting the secondary structure of globular proteins. J. Mol Biol. 120,97-120. 6. Kyte, J. and Doolittle, R. F. (1982) A simple method for displaying the hydropathic character of a protein. J. Mot Blol. 157, 105-I 32. 7. Eisenberg, D., Weiss, R. M., and Terwilliger, T C. (1984) The hydrophobic moment detects periodlcity in protein hydrophobictty. Proc Natl. Acad Sci. USA 81,140-144.
8. Karplus, P A. and Shultz, G. E. (1985) Prediction of chain flexibility Naturwssenschaften
in proteins.
72,2 12,2 13.
9 Jameson, B A and Wolf, H. (1988) The antigemc index. a novel algorithm for predicting antigemc determmants. CABZOS. 4, 18 1-186. 10. Emini, E. A., Hughes, A. J , Perlow, D., and Boger, J. (1985) Induction of hepatitis A vuus-neutrahzmg antibody by a virus-specific synthetic peptide. J Viral 55, 836-839.
11. Deltage, G. and Roux, B. (1987) An algorithm for protein secondary structure prediction based on class predtction. Protein Eng. 1,289-294.
MAPDRAW Restriction Mapping and Analysis Thomas N. Plasterer 1. Introduction A bacterial defense mechanism has proven to be one of the most valuable tools in the modern molecular biology laboratory. Restriction endonucleases and recombinant DNA technology have been invaluable in genome characterization, mapping, sequencing, and amplification. The recognition and excision of unique nucleotide sites has enabled complex DNA manipulation, beginning with molecular cloning and ending m finished, well elucidated nucleotide sequences. The most practical restriction endonucleases, or restriction enzymes, are the class II enzymes. These enzymes recognize and bind to specific nucleotide sequences, and catalyze double strand cleavages at specific phosphodiester bonds either within the recognition sequence or a precise short distance away (2). At the time of this writing, over 2500 unique type II restriction endonucleasesexisted and 4 15 were commercially available (2). Although this abundance expands the utility of recombinant technology, it likewise creates a management problem resulting from increasing numbers of candidates. MAPDRAW, one of seven programs in the LASERGENE suite, addresses this problem in two ways. First, it groups enzymes with identical behavior, including their cognate methylase, into a unique isoschizomer class, and second, it allows the user to interactively build enzyme subsetsto their specifications. The first operation is performed as the enzyme library file is created. All members of REBASE (a restriction enzyme and methylase database maintained by New England Biolabs), including type I and type III, are collected and crossindexed with other restriction endonucleases sharing the same recognition From* Methods in Molecular Medmne, Sequence Data Analysfs Gurdebook Edited by S Swmdell Humana Press Inc , Totowa, NJ
241
242
Plasterer
sequence, restriction site and cognate methylase behavior, in a group defined as an isoclass (see Note 1). From these isoclasses, a representative is chosen and placed in the enzyme file used by MAPDRAW. Each representative retains a list of others members of the isoclass, allowing users to locate any enzyme of this specificity, even if known by a different name than LASERGENE’s representative. ’ In the interactive part of the enzyme selection process, MAPDRAW provides four primary and two combination filters. These enable you to sort restriction enzymes into subsets based on compatibility of sticky or blunt ends, site frequency, enzyme class, and complexity and manual selection. At any time m the filtering process, you can produce seven distinct restriction maps for experimental design, analysis, and presentation of your results. MAPDRAW scans and labels the DNA sequence with a set of restriction enzymes, and it labels features denoted in the comments window of a GenBank or EMBL sequence document. Results can be used immediately for publication or exported to any word processing or graphics program for further manipulation. DNA or protein sequences required for restriction maps can be obtained from exported database entries, EDITSEQ (see Note 2) sequence documents, foreign software, or ASCII text. To display the high degree of commonality between implementations of the program on different platforms, the illustrations within this chapter are taken from a mixture of the Macintosh (system 7.5) and Windows 95 versions. 2. Materials Users need only satisfy materials criteria for either the Macintosh system or the Windows system. 2.1. Hardware 2.1.1, Macintosh Hardware 1. Any Macintoshcomputer. 2 Minimummemory requirementsof 4 Mb FL4h4(8 Mb RAM or more is recommended). 3. Minimum free hard disk spaceof 25 Mb. More may be required becauseof the creation of temporary files.
4. Macintosh-compatiblemonitor (256-color monitor is recommended). 5 Macintosh-compatibleprinter (laser printers are recommended). 2.1.2. Windows Hardware 1. Any personal computer.
2. Minimum memoryrequirementsof 4 Mb RAM (8 Mb RAM or moreisrecommended). 3. Minimum free hard disk space of 25 Mb. More may be required because of the creation of temporary files.
243
MAPDRA W 4. Windows compatible monitor (256-color monitor is recommended) 5. Windows compatible printer (laser printers are recommended).
2.2. Software 2.2-l. Macintosh Software 1. Macintosh system software 6.0 1 or hrgher 2. The LASERGENE application MAPDRAW. Follow the mstructions provided by DNASTAR, Inc. (Madison, WI) for installing the LASERGENE software. 3. Accessory files for MAPDRAW, including the restriction enzyme library file, Enzymes.
2.2.2. Windows Software 1. Disk Operating System (DOS) version 5.0 or higher. 2 Microsoft Windows version 3.1 or higher. 3. The LASERGENE applicatron MAPDRAW. Follow the mstructions provided by DNASTAR, Inc. for installing the LASERGENE software 4. Accessory tiles for MAPDRAW, including the restriction enzyme library file, ENZYMES.EZD
2.3. Data Any DNA or protein sequence document in LASERGENE format. For this tutorial, use the sequence file Owl Monkey Insulin (Macintosh users) or owlrnkms.seq
(Windows users). (seeNote 3). Macintosh users can locate Owl Monkey Insulin in the demo sequences folder, within their DNASTAR folder. Windows users can locate owhnkins.seq in the demo-seq directory, within the WINSTAR directory. 2.4. Optional 2.4.1. Macintosh Software 1. Any graphics program capable of handling PICTs file input. 2. Any word processor for manipulating exported ASCII text.
2.4.2. Macintosh Software 1. Any graphics program capable of handling Windows metafile (WMF) input. 2. Any word processor capable of manipulating
ASCII text.
3. Methods 3.1. Opening MA PDRA W 3.1.1. Macintosh MA PDRA W 1. Locate the DNASTAR folder and double-click on it to open. 2. Within the DNASTAR folder, locate the MAPDRAW icon (Fig. 1). Double-click the MAPDRAW icon to launch the application.
244
Plasterer
Fig. 1. The MAPDRAW
icon.
3.1.2. Windows MAPDRA W 1. Locate the DNASTAR program group and double-click on it to open. 2. Within the DNASTAR program group, locate the MAPDRAW icon (Fig. 1). Double-clmk the MAPDRAW icon to launch the application.
If your LASERGENE system contains demonstration files, you may get the warning: “The tile Demo Enzymes is available. Shall I use it?” Click No at the prompt. Demo Enzymes is a demonstration subset of the complete enzyme file.
3.2. Creating Restriction Maps MAPDRAW opens with a blank screen, allowing you to modify the enzyme file, open an existing restriction map, or create a new map. If a default enzyme filter (see Note 4) exists, it will be applied to any new restriction maps but will be ignored by existing maps. You can create maps from LASERGENE DNA sequence files or LASERGENE protein sequence files (see Note 5). Any proteins are reverse-translated according to the active genetic code. 1. Choose New from the File Menu 2. Locate the Owl Monkey Insulin sequence: a. Macintosh users: In the File dialog, locate the DNASTAR folder in the left window (It may already be open as the default). Scroll down thuswindow and double-chck
the Demo Sequencesfolder. Within the Demo Sequencesfolder, scroll down to Owl Monkey Insulin. Double-click Owl Monkey Insulin to create the new map. b. Windows users: In the Enter Sequence dialog, locate the WINSTAR duectory in the Directories window (It may already be open as the working directory). Scroll down the Directories wmdow and double-click the demo-seq directory. In the File Name window, scroll until you see owlmkins.seq. Double-click owlmkinsseq to create the new map.
MAPDRAW creates the map, displaying it in the default Site & Sequence view (see Section 3.3.1.).
3.3. Filtering Restriction Enzymes All enzymes in the enzyme library are applied on creation of a new restriction map. The release library at the time of this writing (July, 1995) contains
245
MAPDRAW
0 Class I (Random) 0 Class I I (PrecIseI 0 Both Class I and Site Complemty
q II
Commercially
Available
0 On Hand
(in Bases)
Costs SD
less
than
per
0
unlt(sl
Fig. 2. The Class & Complexity Filter dialog box. This allows you to select enzyme subsets based on their class, cost, availability, and site complexity. making analysis unwieldy for all but the shortest sequences. The first operation is to filter enzymes for a more useful set.
366 members,
3.3.1. The Class & Complexity Filter Class & Complexity
filters allow you to apply an enzyme class as a subset.
These subsets include class I (random), class II (precise), or both. This filter can also be used to create subsets based on enzyme cost, availability, and site complexity
(see Note 6).
1. Choose Class & Complexity from the New Filter submenu, under the Enzymes menu. This command opens the Class & Complexity filter wmdow. At this wmdow you can define and name filters (Fig. 2). 2. Click the radio button for Class II (Precise) and the check box for Commercially Available, and type the name Available Class II in the Filter Name text box. 3. Click Apply followed by OK.
At the top of the map, the header shows the number of applied enzymes dropped from 366 to 206. Although this is a more useful subset, it still contams sufficient numbers to thwart easy analysis. 3.3.2. The Frequency Filter Frequency Filters subset enzymes by applying a set frequency of restriction sites per enzymes to the sequence, or a subrange of the sequence. Use this filter if you want to find all enzymes that cut only once for a given sequence or to exclude sites in a particular domain. 1. Choose Frequency from the New Filter submenu, under the Enzymes menu. 2. Enter the name “Owl Monkey Insulin-no cut in CDS” in the Filter Name text box (Fig. 3).
Plasterer
246
Filter
Name:
/Owl Monkey
Insulin
M cut in CDS
OWLMKlNS.SEQ(748>1847)
Lenp;th:2113bp
Range:
IlOOmp
Fig. 3. The Frequency Filter dialog box. This allows you to select enzymes based on the frequency of their recognition sequence within the selected region of DNA.
. ..I
;I And...
IL...
&@)*
hulia-
M
=,
3 Available Ciass II kJ No Methylase Dups ,: l?j CommereiaI+methylaups
‘.(,
jlOr...
Fig. 4. The Map filter. This allows you to apply enzyme filters to specific map documents and combine their effects. 3. 4. 5. 6.
Enter Click Enter Click
the coordinates 748 and 1847 in the 5’ and 3’ subrange fields. Linear, Include Subrange Context, and Certain Sites Only. 0 and 0 in the Min and Max Frequency Limits fields. Apply followed by OK.
The filter dialog should match Fig. 3 before applying. This filter creates a subset of 99 enzymesthat do not cut inside the coding region of the Owl Monkey insulin gene (see Note 7). As filters are created and applied, the contents of the Map Filter are modified as well (see Note 8). 3.3.3. The Map Filter The Map Filter is a map-specific master filter that establishes the subset for each map document. The Boolean operators AND and OR are used in the Map Filter to determine the relationships between applied filters (Fig. 4). 1. Open the Map Filter: a. Macintosh users: Choose Show Owl Monkey Insulin Map: Filter from the Map menu.
247
MAPDRAW
b. Windows users: Choose Show O%VLMKINS.MPD: Filter from the Map menu. The Map Filter can also be opened by clicking the Map Filter button, the second pallet tool from the top, on the left side of the map document. All filters present in the filter list appear on the right half of the Map Filter. Like the restriction enzymes, these filters are also stored in the enzyme library file Double-clicking a filter name causes it to open in the relevant Filter Editor window. Use the Map Filter to change restriction enzyme content for the whole map. To apply a filter, drag it from the list on the right to the word AND or OR in the left window and click Apply or OK to imtiate the change. 2 Click Display Enzymes. This opens a window displaying all of the enzymes contained in the subset of the Map Filter. On the left side MAPDRAW displays all Picked enzymes and the right half contains the Remaining enzymes that have been filtered out. 3. Close the Results window: a. Macintosh users: Click the close box in the upper-left corner. b. Wmdows users: Double-click the Control menu box 1~ the upper-left comer. 4. Click Cancel in the Map Filter to discard changes to the current enzyme subset.
3.4. Manipulating
Map Views
After you have determined which enzymesto use, you can now begin analysis of your sequence.MAPDRAW provides six distinct map views to help examine your nucleotide sequence. Three views assist in mampulatmg your sequence, two create publication-quality output, and one locates open reading frames. 3.4.1. The Site & Sequence View The Site & Sequence view is the first of six map views. At this view, you can display both strands of the sequence, sequence rulers, translated reading frames, and features. Restriction enzymesare drawn above the top strand. You can customize the number of residues, the reading frames shown, and the’orientation of the enzyme display. 1. Choose3 letter Amino Acid Code from the Options menu.After executing this
commandyour mapwill appearsimilar to Fig. 5. You may needto resizethe map to see more detail. 2. Choose Go to Position from the Edit menu. In the dialog window type 2053 and click OK. MAPDRAW scrolls to residue 2053 and highlights it 3. ChooseHorizontal from the Enzyme Display submenu, under the Options menu. The orientation of restriction enzyme labels is changed from vertical to horizontal 4. Display the restriction site for MslI (Fig. 6).
a. Macintoshusers:Hold down the Option key andpressthe restrictlon enzyme label for MsZI. b. Windows users:Hold down the Alt key andpressthe restriction enzymelabel
for A4slI.
248
Plasterer
Fig. 5. The Site & Sequence View. This window displays the DNA sequence in both strands along with sequence rulers, six frame translations, and marked features.
Fig. 6. By clicking on the restriction site label for an enzyme, in this case MS& and holding down the Option (Macintosh) or Alt (Windows) key, the actual recognition site of that enzyme is superimposed on the sequence. The recognition site is highlighted and the cut site is denoted by dotted lines. MAPDRAW displays the recognition sequence between the strands. If you release either the Option/Alt key or the mouse button, the region is highlighted. 5. Release the mouse button and the option/alt key. Double-click the label for A4slI. This opens the Enzyme Editor forMsZ1 (Fig. 7). This is where you modify enzyme information, including its name, isoschizomer list, price, vendors, recognition sequence, class, and availability. To edit any enzyme, click the lock icon in the upper-right corner to unlock and then proceed to make changes. You can open the Enzyme Editor whenever an enzyme name is displayed. 6. Click Cancel to return to the Site & Sequence view. 7. Choose Line Layout from the Options menu. In the Line Layout dialog, set the display options by clicking in the appropriate check boxes. To create a default style, click Save as Default. When you are satisfied with the layout, click OK.
249
MAPDRAW
lsoschizomers
costs
$ 220
1000 units per El Uendor Information
Class:
@J Precise 0 Random 0 Unknown
Cutter Cutter Cuts
Fig. 7. The Enzyme Editor. This allows enzyme information, including its name, isoschizomer list, price, vendor, recognition sequence, class, and availability to be modified.
Fig. 8. The View Selector button bar. In order, the buttons represent the Site & Sequence, Worksheet, Linear Minimap, Linear Illustration, Circular Illustration, and ORF Map views.
3.4.2. The Linear Minimap This map view displays an individual restriction map for every enzyme in the current map filter. There are a number of ways to sort enzymes in this view. 1. Choose Linear Minimap from the Map menu. You can also open the Linear minimap, or any map view, by choosing its icon from the top palette tool (Fig. 8), the View Selector button. The Linear Minimap displays the enzymes in the current map filter in alphabetical order. Features are drawn beneath the restriction enzymes. You can change the order of enzyme presentation using the Sort button (Fig. 9). 2. Press the Sort button (the bottom palette tool) and choose Sort by Cutting Frequency from the pop-up menu. Restriction enzyme maps are now presented according to cutting frequency. You can also sort around a selected region or feature. Click a feature and choose Sort by Cuts Close to Selection from the Sort pop-up menu to do so.
3.4.3. The Illustration Views MAPDRAW provides two views to design publication-quality map documents, the Linear Illustration and the Circular Illustration. are customized in the same manner.
restriction Both views
Plasterer
250
, Bal I Bgl I BseR I Drdl Eael MuN I t&cl Pie I sap I Bpm I Bsa I 6x31 I fh57 I t&II
ZOO
400
600
800
1000
1200
1400
1600
1800
2OOd
1 1 1 1 1 1 1 1 1 2 2 2 2 3
Fig. 9. The Linear Minimap View. This view displays the restriction map for every enzyme in the current map filter.
Fig. 10. The Linear Illustration
view.
1. Choose Linear Illustration from the Map menu. In a few seconds, MAPDRAW displays the Linear Illustration view for the Owl Monkey Insulin gene (Fig. 10). In the illustration views, screen scaling is independent of printer scal-
MAPDRA W
251
Ml”lrn”rn
Fig. 11. Map Drawing Size dialog box.
2.
3. 4.
5. 6.
ing. You can zoom in and out on your maps without changing the size of the printed page. Click the Zoom In button (the magnifying glass with the “+” sign) and click anywhere in Linear Illustratcon (see Note 9). MAPDRAW expands the map view to show more detail. You can perform any number of zooming operations and restore the original scale by choosing Show Actual Size from the Options menu. Changing the printed page size is a slightly different operation. Choose Drawing Size from the Options menu (Fig. 11). Click anywhere in the page window to resize the printed map to that size and click OK. The intersection of the two red Minimum lines is the smallest possible size for your map, given the current information content. Use this dialog to make the smallest possible map or wall-size posters. Display options that determine the information content are the number of applied restriction enzymes, the font size, horizontal or vertical enzyme display, and uncertain sites. Cormnands for most of these items are under the Options menu. Choose Circular Illustration from the Map menu (Fig. 12). Resize the Circular Illustration by choosing Drawing Size from the Options menu. Try to fit the circular map on one page. It may help to choose a smaller font from the Font Size submenu, under the Options menu (see Note 10).
3.5. Annotating
DNA Sequences
Features are used to annotate restriction maps for regions of interest, such as motifs, domains, transcription factors, promoters, and binding sites. Create your own features based on your findings or modify existing features. When h4APDRAW interprets a sequence document, it looks to see if there are any feature entries in the comments window. If any exist, MAPDRAW displays each on the map views. In the current map view, the Circular Illustration, there are a number of features displayed as arrows on the inside of the sequence. You can modify a feature on each map view or from the features list.
Plasterer
252
Fig. 12. The Circular Illustration View. The view is shown with the Font Size adjusted to 10 and the drawing size adjusted.
n n
ins intron preproinsulin
R --
introa
-- enoa
(551 > 730) (748 > 934)
6”s intron 8 -- intron (935 > 1707) --BXO” (1708>1847) preproinsulin m insulin alpha chain -- mat-peptide
n l
j:/:l:/:l:l:/ r/ 4
r/
u/ 4
/
Fig. 13. The Feature List. 1. Choose Show Feature List from the Features menu (Fig. 13). 2. Double-click on the “preproinsulin--CDS (748>1847)” feature. This opens the Feature Editor’s Location window for the preproinsulin-CDS (748>1847) feature. This is where you name the feature and its segments, add or remove any additional segments, determine how to display the title and segment names, and decide which map view will show this feature. Notice that this feature is made up
253
MAPDRAW
of two segments, reflecting the two exons Joined to construct the finished protein. You can also open any Feature Editor by double-clicking a feature in any map view. 3. Change the Title to Preproinsulin CDS and the Segment Name to Insulin CDS. Click the Description button. This switches to the Feature Editor’s Description window for the preproinsulin-CDS (748>1847) feature. This is where you record any notes about the feature and assign a GenBank/EMBL key. Since this feature IS derived from the comments window of the EDITSEQ document, both variables have already been recorded 4. Click the Style button. This switches to the Feature Editor’s Style wmdow. MAPDRAW used the default style scheme to decorate this feature. You can modify the color, font, segment outline, point size, and linkage. Linkage determines how to represent areas between segments, such as introns. 5. Choose Blue from the Color pull-down menu and click OK. The Preproinsulm CDS feature now has a blue ou$ne.
If you want to create your own features, select a region in any map view and choose New Feature from the Features menu or click the New Feature button, the fifth pallet tool from the top. 3.6. Saving the Restriction
Map
1. Chose Save from the File menu to open the save dialog. 2. Locate a position for your Restriction Map and click Save (Macmtosh) or OK (Windows). All feature annotations and enzyme filters associated with the map are recorded. You can also save certain graphic views, such as PICTs (Macintosh) or Windows Metafiles (Windows) or as a new nucleotide sequence document, retaining the map feature annotations.
3.7. Printing
in MAPDRAW
1. Close the Features List, if still open. 2. Switch to either the Linear Illustration or the Circular Illustration by choosing these commands from the Map menu. 3. Choose Print from the File menu. Within the print dialog window, click Print (Macintosh) or OK (Windows).
The resolution of the printed page is often much better than the screen image presented in each map view. A vector image is sent to the printer, whereas the screen receives a bitmapped image (Fig. 14). The vector image is more easily scaled than the bitmap image and looks better on postscript printers. You can also copy and paste map views into word processing or graphics programs and maintain the vector format (see Note 11).
254
Plasterer
Fig. 14. A printed Circular Illustration image, scaled to 55%. This demonstrates the much higher resolution ofthe printed, vector image compared to the screen bitmapped image. (Compare with Fig. 12).
4. Notes 1. The definition of isoclass was required because of confusion with the term isoschizomer. In some instances, an isoschizomer is defined as all restriction enzymes sharing the same recognition sequence only (more strictly defined as a neoschizomer). More commonly, isoschizomers are defined as restriction enzymes sharing the same recognition sequence and cut site. This was too broad for a definition of unique functionality required by the MAPDRAW program, because neither definition accounts for divergent behavior of cognate methylases. For example, HpaII and MspI have the same recognition sequence and cut site, but methylate at different bases (Fig. 15). Although the current implementation of MAPDRAW is unable to sort restriction enzymes by methylase behavior, future developments will incorporate this functionally. The enzyme library contains a representative from all unique isoclasses, which may mean that an isoschizomer has more than one entry (HpaII, MS@). The decisionto include or segregaterestriction enzymesbasedon methylase behavior has been left to the individual researcher.
255
MAPDRA W
+ CCGG GGCfC Recognition
Sequence
and Cut Site
CH I3
GGCC
CCGG GGCC
c!L3 Hpall methylation
A3 Mspl methylation
Fig. 15. An illustration of the isoclass definition. Here HpaII and MspI fit the common definition of isoschizomers but may be distinguished on the basis of the divergent behavior of cognate methylases. 2. EDITSEQ is the LASERGENE module for sequence creation and editing. It is also where you import foreign sequence, either from another sequence file format or across the clipboard. 3. Users who need to share data across both platforms can do so by using the correct windows nomenclature for shared files. To read a restriction map document created on the Macintosh, Windows users need only rename it NNNNNNNN.MPD (N is any character, with a limit of eight). DNA sequence files require the format NNNNNNNN.SEQ. Macintosh users do not need to rename Windows files or modify the file type and creator to be interpreted correctly. 4. Default Map Filters are created by opening the filter list, selecting a filter, and choosing Make Default Map Filter from the Enzymes menu. A diamond (Macintosh) or asterisk (Windows) appears to the left of the default filter. 5. If a protein sequence is used to create a restriction map, it will be reverse-translated according to the genetic code chosen. The default genetic code, the standard genetic code, will introduce ambiguity in backtranslated proteins. To eliminate ambiguity, use or create a code with a nondegenerate backtranslation. 6. Site Complexity is a measure of degeneracy in the recognition sequence for an enzyme. Nondegenerate bases score 1 (A, C, G, T), partially degenerate score l/2 (K, M, R, S, W, Y) and fully degenerate score l/4 (B, D, H, V). The sum of each residue value in the recognition sequence is the site complexity. 7. Frequency Filters based on excluding a region can also be created by selecting a region or feature and clicking the Don’t Cut Here button. This button is the fourth palette tool from the top, displaying the scissor and the international “Not” symbol. 8. MAPDRAW provides two additional primary filters: the Overhang filter and the Manual Pick filter. The overhang filter allows you to subset enzymes based on overhang compatibility, whereas the manual pick filter is used to select enzymes from the entire list in the enzyme library. You can use the Browse command to locate specific restriction enzymes in the enzyme list and then create a manual pick filter retaining these located enzymes. 9. The Zoom In and Zoom Out palette tools (the two magnifying glasses) also control the scale on the map views. To use, click either tool and click on the area of
256
Plasterer
the assay to magnify or reduce. You can also drag to create zooms to your own defined area. 10. The two neglected map views are the Linear Worksheet and the ORF Map. The Linear Worksheet is similar to the Linear Mmimap with the exceptions of coupled prmting and scalmg, and nonscalable font sizes. This allows you to zoom in and out without making labels and titles too large or too small to see on the screen. The ORF Map is a graphical drsplay of located reading frames m all six frames. You can determrne the start and stop codons, require a promoter, and set ORF lengths Located ORFs are easily exported to EDITSEQ as protein or DNA sequence 11 Graphic views can be exported by copying to the clipboard and pastmg into a graphics or word processor application. Macintosh graphics are exported as PICTs, whereas Windows graphics are exported as Windows Metafiles (WMF)
Acknowledgment Electronic restriction enzyme database is available at HTTP://WWW.NEB.COM. Data for this volume was taken from REBASE version 508, August, 1995.
References 1. Singer, M and Berg, P. (1991) Genes & Genomes-A Changzng Perspective. University Science Books. Mill Valley, CA, pp. 243-254. 2 Roberts, R. J. and Macelis, D. (1994) REBASE-Restriction Enzymes and Methylases. Nucleic Acids Res 22,3628,3629.
21 The Gene Construction Kit DNA Sequence Analysis and Presentation Bruce R. Troen 1. Introduction Restriction enzyme analysis of DNA sequencesis an important mainstay of molecular biology laboratories. In the last decade this has been greatly facilitated by the availability of powerful desktop computing software. The advent of the Apple Macintosh computer further allowed the development of programs with graphically intuitive interfaces, thereby enhancmg their ease of use. The Gene Construction Kit (GCK) for the Macintosh (Textco, West Lebanon, NH; [email protected]) is a versatile and powerful program that provides a spectrum of tools permiting the user to analyze DNA sequences for daily use at the benchtop and to ultimately create publication-quality documents. Analysis is performed on a construct that contains the DNA sequence and all associated sites and comments. Four windows are utilized (Fig. 1): 1 The Construct Window: in which restriction enzyme and other sues can be marked, regions of interest defined, and the construct displayed and manipulated either graphically or as a formatted text sequence (both DNA and protein). 2. The List Window: which is used to create, maintain, and edit lists of sequences (including restriction enzyme sites, protein bmding sequences, ohgonucleotides, linkers, and so on). 3. The Gel Window: for generating electrophoretic patterns of single, multiple, and partial digests. 4. The Illustration Window: which is used to create presentations.
The Illustration Window is particularly powerful because graphics and text from the other three windows can be pasted into an Illustration Window and legends can be automatically generated (see Note 1). In addition, constructs From
Methods m Molecular Biology, Edited by S R Swmdell
Vol 70. Sequence Data Analysis GuIdebook Humana Press Inc , Totowa, NJ
257
Troen
258 +
File
h”ld93 Dral(1915)
Edlt
UJindow
Format
Construct
150 H
./:: ii..,
4\i I:/// ‘i i:::
Number
list: commercial of Items: 348
Illustration:
project
oueruiew
Name: pzr---j S~W3lC~: pziq Comments:
•l
r
\
I
; IT I/“lOlOlOl
Fig. 1. GCK windows. The four windows utilized in GCK are the Construct, Gel, List, and Illustration windows.
can still be analyzed and edited, and drawing tools can be used to enhance illustrations. This chapter will outline an initial restriction enzyme site analysis on a construct and demonstrate both the graphical and text formatting features of the Construct Window. We will subclone an insert into a plasmid vector and, using site analysis, will simulate a gel electrophoresis. Finally, we will utilize the data to create a poster overview of the project. 2. Materials 2.1. Hardware 1. Any Macintosh starting with the Macintosh Plus (see Note 2).
2. Minimum memory required is 1 Mb of RAM, but 2 Mb arepreferred. 3. The GCK program disks allow three hard-disk-drive installations. The program and the help file use about 800 Kb of space, and the commercially available enzyme lists require approx 250 Kb. Additional lists containing all known enzymes plus comments consume 1.5 Mb, but these are not essential for using the program.
The Gene Construction Kit 4. Any Macintosh-compatible printer (a laser printer is preferred) for a printed copy of the results (see Note 3).
2.2. Software 1. Macintosh system software 7.0.x or higher. 2. The Gene Construction Kit application, help tile, and enzyme lists.
2.3. Data 1 A nucleic acid sequence stored in GCK or DNA Inspector Macintosh binary file format or in a Text file in the following formats: EMBL, GCG, GenBank, Intelligenetics, NBRF, Pearson, Staden, or pure ASCII (where everything is considered sequence and all nonnucleotide characters are removed). 2. Restriction enzyme lists m GCK format. The program disk contams a list of commercially available enzymes. If users send E-mail to [email protected] and ask to receive an updated restriction enzyme list, they will receive a GCK-formatted collection each month that contains the REBase lists from Richard Roberts. These include: a. All enzymes that are commercially available. b. Separate lists of commercially available 4, 5, 6,7, and 8 base pair cutters. c. All known enzymes. d. All known 4,5,6,7, and 8 base pair cutters. 3. A file containing the sequence of the Promega cloning vector pGEM4Z (see Note 4).
3. Methods 3.7. Creating a Construct 1. 2. 3. 4. 5. 6. 7. 8.
9. 10.
File
Launch The Gene Construction Kit program. Close the window called Construct: Untitled- 1. Select Import from the File menu. Select the appropriate radio button (i.e., GenBank, if the file is in that format, or Text if the sequence data is present in raw form without comments). Find and open the file containing the pGEM4Z sequence. A window appears called Construct: Untitled-2 containing the sequence in a text view. Save the tile to disk and name it pGEM4Z. Invoke the graphic view by selecting the Display Graphics command under the Display submenu of the Construct menu to generate a linear map (Fig. 2A). Circularize the DNA by selecting the Make Circular command from the Construct menu. A circular molecule with a junction marker will now be displayed in the Construct Window (Fig. 2B). Select the junction marker and then delete it by either pressing the delete key or selecting the Clear command in the Edit menu (Fig. 2C). To display the p-lactamase gene of the plasmid, select the Make Region command from the Construct menu. A dialog box will appear (Fig. 3).
Construct6):
pGEM42
Construct(4):
I
pGEM42
Fig. 2. Construct Generation. These six window snapshots depict the various phases in generating a circular construct.
261
The Gene Construction Kit Reglon
Name:
Reglon
Comments:
E
Protein
Sequence
? First
Nucleotlde.
pir---l
last
Nucleotlde:
12127)
Region
Generation:
0
Fig. 3. Make Region dialog box. This dialog box permits specification parameters of a region or protein sequence.
of the
11. Type in the name, comments, and the positions of the first and last nucleotide of the gene. This information should form part of the file acquired from Promega or GenBank. 12. Click on the checkbox next to Protein Sequence and click the OK button. An arrow now parallels the p-lactamase gene (Fig. 2D). Double-clmkmg on the arrow will select the correspondmg reg:on of the plasmid. By usmg the Fill and Lines submenus of the Format menu, you can alter the pattern and thickness of the B-lactamase region (Fig 2E)
3.2. Marking Sites and Locafions 1. 2. 3. 4. 5.
Choose Open under the File menu. Select the List radio button. Select the file Commercially Available. You can open as many List files as desired. Select Construct: pGEM4Z from the Window menu to make it the active window. Select Mark Sites from the Construct menu, This will present you with a dialog box that allows you to mark sites in the construct using enzymes from many different lists (Fig. 4). 6. Double-click on those enzymes whose sttes you want displayed in the construct. They will be added to list at the bottom of the dialog box labeled Sites to Mark. In this example, I have chosen those enzymes within the polyclomng sue of pGEM4Z plus Bgfl, DraI, PvuI, PvuII, and XmnI. (For convenience, I have created a List file called My List that contains just these sites; see Note 5.) 7. Select the radio button Show Text and then click on the OK button. The pGEM4Z construct now has restriction enzyme site labels (Fig. 2F). The labels may initially overlap one another, but they can be dragged to the desired posrtions (see Hind111 site in Fig. 2F). The position of the restriction enzyme site labels can also be displayed, and the text can be replaced by symbols (Fig. 1,
262
Troen dropdown
menu shows avatlable hsts
enzymes selected pressmg selected bottom
from currently hst (“My hst”) Add puts enzyme ntto hst
these enzyme sttes ~111 be marked when OK 1s pressed
Fig. 4. Mark Sites dialog box. This dialog box permits the selection and labeling of enzyme sites within a sequence.
NBIU Site
Positton:
NBW Site SP6
Name:
II
NBSU Site COmmentS:
Fig. 5. Mark Location dialog box. This dialog box permits the speciflcatlon and labeling of a site within a sequence. see Construct:pGEM4Z and Illustration: project overview, see Note 6). The T7 and SP6 promoter primer sites can be labeled by using the Mark Location command under the Construct menu that presents the user with a dialog box where the specific location information can be supplied (Fig 5).
3.3. Text Formatting
of Sequences
The Construct Window also affords the user extensive formatting options when displaying the sequenceas text. 1. Select the Display Sequence command from the Display subheading of the Construct menu (Fig. 6).
The Gene Construction Kit
Make Mark Mark Place
Region... Sites.. Location.. Sites...
Insert Ns . Irlrt’rt rl(ptnq.. Edit Cut We... Fmd Sequence... Redefine Origin
%ll Xh XL 88J Ill
Hide Regions tilde Sites Show Comments Show GeneratIons
%K XF %1
(,(!l Irll(t . General Info . Search Comments..
$6 I
Make
88\
linear..
Show Scale legend Ye1 I IlU’ \p4( in11
Markers
at Sequence
Fig. 6. Display Sequence command. This snapshot depicts the Display Sequence submenu command that converts the graphlcal representation of the construct to text.
gtinv.1
Bum
spu Hind111 17 ,oCROOCn,OCnROCTTOTCTtCCTRTRO ,00CO,RR,CnTOOTtnTnOCTOTTTCCT
Make Mark
Reglon... Slter...
Place
Sites...
Set Line
Spacing..
0CTCACRR11CCACRCRnCR1ACOAOCC0 OCCTOOOOTOCCTAf)TORaTO~OCTMCT
Insert TuDlna... No I nd I:; $16 TTAATORRTEOOCCARC~C~COODOROROCCOFind Seauence...
OCTCRC,oCCCOCTTTCoROTCooOmRccT
1
$1~K #F I
Illsplay
Graphics
Fig. 7. Show Positions command. This snapshot depicts the Show Positions submenu command that displays numbering for the nucleotide and amino acid sequences 2. Select the Show Positions command from the Display submenu within the Construct Menu (Fig. 7). Dlstmct numbermg for both the nucleotide and ammo acid sequences is exhibited (Fig. 8). The Display submenu also allows you to show double-stranded sequence and line borders, and to set the space between lines. The text display of Constmct:pGEM4Z already contains the restriction enzyme site markers, the SP6 and T7 sites, and the amino acid translation of the P-lactamase gene that were created m the graphics display (Fig. 8).
Troen
I 91 16!
SP6 EcoRI ORRTRcL9mT EpIBuBoAO COlWCCOOwl
241 COCTTTCGAG 32, OOCGCTCTTC 401 OCOOTARTRC 491 CCOTMFMG 561 RORGOTGGCG 64, RCCCTOCCGC 721 TCTCRGTTCO 901 CCGOTFMCTA 991 AOAOCGWGT 961 TIlTCTGCGCT ,041 OCGOTGGTTT I12 I OOGTCTGACO Dr.1 1201 CCTTTTARRT
Sm81 Sac1 KpnI AvaT BunHI XbaI la9uDum Tlrmmmm luzwmme CTTOGCOTAR TUITOOTCAT AGCTOTTTCC GCATRRROTO TWMGCCTOO GOTGCCTART PVUII TCPB COC,TCCTCO CTC,=LTGRcT COCTGCOCTC OOTTATCCRC RGMTCROOG oRTfyKOC80 OCCGCGTTGC TGOCGTTTTT COATAUGCTC RRACCCGRCA OGACTRTfWl GATRCCAGGC TTRCCOGRTA CCTGTCCOCC TTTCTCCCTT OTOTRGOTCO TTCGCTCCRR GCTOOGCTOT TCGTCTTGAG TCCRRCCCGG TFMOACRCGA ATOTNGCGG TOCTACIIMK) TTCTTGAAGT CTGCTGMGC CROTTRCCTT CTTTTOTTTGC W+GCtKicRcA TTACGCGCAG CTCAGTGG~ CGRRRRCTCR COTTARGOGA Iha1 TARIYWTORR OTTTTRARTC RRTCTARIIOT
1276 CT, 28444s 1342 COT 2624Tlw
MT II* COT mr
CA0 LHI 010 rgr
TGR 8.r OAT I,.
OGC AIM MC “a1
tlCC 019 TAC “I,
TflT II* OAT I,.
,408 AOR ‘24045~ 1474 AR0 2194L.u
CCC C+ly TOG Pro
RCO kg TCC 019
CTC Blu TOC Ala
RCC Oly RRC “.I
OGC TCC Ah Cly TTT ATC 4s Asp
CTC 831~ ,303 k9
RGC AIM GGA SW
%1X AccI
P&I
spu
IhdIII
17
TDaacmcA TOTOTGAAAT OAGTGMCTA
mamrrra TOTTRTCCOC ACTCRCRTTA
crmTCTCCC TCRCAATTCC ATTOCOTTGC
TRTROTORGT ACACRRCRTA GCTCACTGCC
TORATCOGCC OOTCOTTCOG MWMMCRT CGCCCCCCTO OTTTCCCCCT CGGGF,RGCOT OTOCACGMC CTTATCGCCA GOTGGCCTRR GTTOGTRGCT RRARRRAOCA TTTTGGTCAT
AACGCGCOOO CTOCOOCORO OTOROCARRA RCORGCRTCA OORROCTCCC OGCOCTTTCT CCCCCOTTUI CTOOCRGCAG CTCGOCTRC CTTGRTCCOG TCTCRROR(I0 GRORTTRTCA
GAWIGOCOGT COOTRTCROC GGCCAGERRR CRARRATCGR TCOTGCGCTC CR’TROCTCRC GCCCGRCCGC CCRCTGGTRA ACTAGWGGR C-MC ATCCTTTGAT ARRROMITCT
TTGCOTRTTF TCRCTCMO RGGCCRGORR CGCTCAAGTC TCCTOTTCCO GCTGTRGGTR TGCGCCTTRT CffiGATTROC CAOTATTTOG FICCOCTOGTA CTTTTCTACO TCRCCTAGRT
RTRTATGRGT
OR, CT0 II* oh GOG CT1 P,e L9*
TCT k9 ACC o*
ATT kn ATC As9
RTC Asp CAT
ART Il. OTC
AAR PA* TAT II.
WWTTOOTC TCO k9 100 Pro
TTC Q~U CCC ol9
R7C kp CR0 L.”
CCA
OCC
TRR L.”
TTG Oh
RGC Ala TTO Oh
TGRCM 2974”. ROT nr TGt Ah
CRT r*t TGC Al.
TTR TGC AIM RAT II.
CCA Trp CT0 oh mT Il.
RTO Hts fiC1 8~ RCC 019
CCC 019 GCO k9
OCO k9 AOT Thr
CRG Lw RR0 L.”
B-g11 RGA TIT SW l.y, COC CTC Al.O,u,W
IWC Ah CCR 7rp
Asp
Trp019
COG Pro CCG k9
RR0 La GGR 8.r
GGC COR Ala SW ROC TRO Ala Lw
Fig. 8. Construct Sequence Formatting. This snapshot presents some of the formatting capabilities of GCK while displaying the construct as a sequence. 3. Ensure that the insertion point IS m the text by clicking on the sequence. Choose Select All from the Edit menu. 4. Select Group by Tens from the Grouping subheading of the Format menu. By double-cltckmg on the ammo actd sequence, you can select the nucleotldes in the j3-lacatamase gene and group the sequence by threes. 5. With the mouse cursor, highlight bases 7-63 (the range of nucleotides selected 1s displayed in the lower-left corner of the Construct window). 6. Using the Style submenu of the Format menu, change the highlighted sequence to bold and condensed. Underline the T7 promoter priming site by hlghlighting nucleotides 71-87 and invoking the Underline command from the Style submenu.
3.4. Subcloning
a Fragment
As in the laboratory, we can subclone fragments into the pGEM4Z construct (see Note 7). Since the detailed sequence is often unknown prior to clonmg, the Insert Ns command in the Construct menu allows us to place an arbitrary number of unknown nucleotides at any position within the pGEM4Z plasmid. We have amplified a genomic fragment from the human cathepsinL gene of approx 2000 bp (I) and subcloned it into the SmaI site of pGEM4Z. 1. To represent this (or any similar cloned insert), selectthe &a1 site in the construct and insert 2000 nucleotldes using the Insert Ns... command (Fig. 9). A fragment with two Junction markers ~111 appear.
Fig. 9. Fragment Subcloning. These window snapshots display graphical phases durmg the insertion of a sequence/t?agment into the construct.
Troen
266
Rctusl
Scale
on Screen
(nts/cm)*
Fig. 10. Scale Legend dialog box. This dialog box allows specification of the scale in the graphical display of the construct (within the Construct window) 2. Double-click the fragment and use the Fill command in the Format menu to change the fragment to solid black. 3 Delete both Junction markers and save the altered construct as pGHCL1. (Junction markers do not appear when compatible cohesive ends are pasted together.)
After performing initial restriction enzyme digests, you can assign restriction enzyme sites to specific locatrons using the Mark Location command. Once sequencing has been accomplished, Copy the actual sequence (from a text file or import it directly into GCK and then copy the sequence), select the Ns, and then Paste the sequence. The sequence will replace the Ns and reside within the SmaI site. After marking restriction enzyme sites and formatting the regions with different fill patterns, the construct appears as in Fig. 9 (far right). The sites within the insert have been formatted using the Bold attribute to distinguish them from sites within the original pGEM4Z plasmid. The scale legend can be seen in the lower-left of the Construct windows in Fig. 9. The scale legend can be shown or hidden by using the Display submenu under the Construct menu (see Fig. 6). Using the mouse, the scale legend can be positioned anywhere within the window. Double-clicking on the legend presents a dialog box that allows specification of the scale (Fig. 10). 3.5. Gel Elecfrophoresis Gel electrophoretic patterns of restriction enzyme digests can be created by using the displayed sites in the Construct Window (Fig. 11). 1, Open a pre-existing Gel file or create a new one (see Note 8). 2. Click on the Construct window to make it the active one and double-click on a restnction enzyme site label in the construct to select all of the same sites (m thrs case MI).
The Gene Construction Kit J
267 Construct:pGHCLl
EcoRI, Sac!
Fig. 11. EnzymeSite Selection.This snapshotdepictsthe selectionof the PstI sites in the pGHCL 1 constructprior to copying and then pasting into the Gel window 3. SelectCopy from the File menu 4. Click on the Gel window to make it the acttve one. 5. Paste into the Gel window. If a gel already exists, the new lane of fragments will be placed to the right of the arrowhead marker at the top of the gel. A legend is automatically created and expanded with each additional restriction enzyme digest. Multiple digests can be represented by coselecting the desired sites and then copying and pasting into the Gel window (Fig. 12, lanes 4,6). Gel standards are supplied with the program disk or can be easily created (Fig. 12, lanes 1,2). The position of the I-beam cursor is tracked and displayed as the fragment size m the lowerleft of the Gel window. The fragment size threshold of the gel can be set by the corresponding command under the Gel menu. The number of fragments that run off the gel are displayed at the bottom of the gel, and size standards are displayed on the left of the gel. Both of these can be hidden from view by invoking the appropriate commands under the Gel menu. The Format menu permits determination of the font and size of the legend. Similar to the Construct window, the Gel window has an alternate display. By selecting Display Table from the Gel menu, a table of the digests containing fragment sizes and cut sites will be displayed (Fig. 12). 3.6. The Illustration Window The Illustration window can be used to track complex construction/cloning projects, create figures for presentations, and serve as a resource of commonly needed constructs, sequences, polycloning regions, and gel standards. Con-
6
File
Edit
Window
Format
6el
6el: Sel ,363 ,076
--
-
t Display Table” mmmd (unde ICGdlIkUlU)
zi 310 281 271 234 194 116 72
I777~IrwlD
3130 Olrno 4950 mad10 1174otadlD 670 (wnlll) 4208 (WWIIQ 4489 (Hrlll) 436 aiMIll 960 @h110 4760 OiadlO 4676 (Hull0
3129 wdll) 4207 WeIll) 435 waelll) 1776 well0 979 (Hnlll) 4466 (HullI) 4759 (Hnlll) 669 01*lll) 1173(Hnlll) 4877 0!4l0 4949 wadll)
Fig. 12. Gel window. Features of the Gel window are shown. Information can be displayed graphically or m tabular form with site position and fragment size.
The Gene Construction Kit
Fig. 13. Illustration
window. Features of the Illustration
269
window are shown.
structs, sequences, and gels can be selected and copied in their own windows
and then pasted into the Illustration window (Figs. 13,14). Editing features can be activated by double-clicking on the constructs, in either graphical or text form, and the gels. Using the tools in the Illustration window (Fig. 13), you can add comments, labels, and graphics to items in the Illustration window. You can create legends for restriction enzyme site labels, by first selecting the desired sites m the construct (either from within a Construct window or, after double-clicking the construct in the Illustration window to activate the editing features, from within an illustration) and then pasting into the Illustration window (Fig. 14). The linear representation of the gene in the illustration was created by linearizing pGHCL1 (in the Construct window), copying the insert, and pasting into the illustration (see Note 9).
4. Notes 1. Graphic and text elements can easily be copied and pasted into other drawing and word processing programs, Illustration files can also be saved as PICT tiles, which can then be opened by other drawing programs. Although usell drawing tools are available in the Illustration wmdow, a larger number oftools are present in many dedicated drawing programs, which also offer greater precision in the final arrangement of the illustration. 2. Some cache cards and/or their accompanying modifications of Apple’s SANE (Standard Apple Numeric Environment) routines may cause slight changes in either the graphical presentation of the construct or the text formatting. This may appearasa small gap in a circular plasmrd. 3. Since GCK uses Apple Quickdraw routines and not Postscript output, the line
thicknessis never c l/72 in. This is overcome easily by creating a large image in the Illustration
window and then printing at 25% scale.
Troen
270 fragment of human CTSL gene cloned by PCR of genomc DNA
A
12345 $&G~ r -- --i,% .i?M
q -
--
i;cc-
-FCC-
-
= -
I-_-
Fig. 14. Project Overview. Graphics and text from the Construct and Gel windows can be pasted into and edited within the Illustration window. The drawing and text tools can be used to further enhance the figure. (A) The figure shows how multiple constructs may be arranged into one illustration to outline the derivation of the final construct. 4. To create a construct file of the pGEM4Z plasmid, the sequence can be obtained directly from Promega on disk or via the Internet: http://www.promega.com/techdoc,html Alternatively, it may be obtained from either GenBank or EMBL via the Internet using the accession number X65305. Another useful source, where most vectors in use can also be found, is the Molecular Biology Vector Database. This is a WWW resource, and the URL is: http://biology.queensu.ca/-miseners/vector.html 5. Suites of restriction enzyme analyses can be prearranged by creating List files that contain only those enzymes of interest (or ones that you frequently employ in the laboratory). You can then easily identify those sitesby selecting all the enzymes in a particular list after invoking the Mark Sites command. Any sequencecan be entered into List files, not just restriction enzyme sites. For example, you can create List files of commonly
The Gene Construction Kit
271
Fig. 14. (B) This figure shows how a region of interest can be taken from the czrcular construct and displayed as a linear fragment. The illustration window can also display sequences as text to complement the graphical overview. Taken together, (A) and (B) show how the Illustratzon wmdow can be used to bring together the other elements of the program. used oligonucleotides, linkers, and sequencecassettes(up to 250 bp). These can then be pasted into constructs or used as recognition sequences when searching constructs. 6. The Site Markers submenu of the Format menu allows you to change the label display to either text or symbols, with or without position numbering. The Format submenus permit you to adjust the font, color, style, and size of the site marker text. The markers can be placed at either the start of the recognition sequence or at the actual cleavage site by choosing the appropriate option under the Display submenu of the main Construct menu. 7. Although not discussed at length in this chapter, GCK is very well suzted for keeping track of changes made to plasmids during subclomng experiments. The Chronography feature (under the Format menu) allows you to mamtain a detailed graphzcal sequential history of the manipulations made to the constructs 8. Gel templates with standards specific to certain ranges can be constructed m advance. As shown above (Fig. 12), h cut with HzndIII, pBR322 cut with either MspI or HzzzJI, and 1$X174 cut with Hue111 offer a broad range of szzes. For
272
Troen
example, by opening the Gel tile containing the Hue111 digest of +X174, you can then add new digests of the construct under study to the Gel window and then save the file under a new name This will leave the original standard digest intact You can also select the standards line, copy it, and paste it into another Gel wmdow The gel electrophoresis patterns simulated by GCK are particularly helpful to students and others who are just embarking on restriction enzyme analysis by permittmg them to visualize easily the expected bands from a digest. Furthermore, GCK can simulate partial dtgests and thereby help beginners recognize mcomplete digestions m the laboratory. 9 Although not apparent here, GCK can utilize the complete spectrum of colors available on the Macintosh. Various colors can be used m the Construct, Gel, and Illustration windows in either graphical or text format. With an avatlable color printer and/or slide prmter, you can therefore generate very attractive and mformauve posters and slides.
Reference 1. Chauhan, S. S., Popescu, N. C , Ray, D., Fleischmann, R , Gottesman, M. M , and Troen, B. R. (1993) J Biol. Chem 268,1039-1045
22 GeneJockeyll Primer Design Phil Taylor 1. Introduction Version 1.5 of GeneJockey includes two routines for primer design. One of these generates primer pairs for the polymerase chain reaction (PCR), whereas the other generates primers for site-directed mutagenesis (SDM). Individual primers generated for PCR are also highly suitable for sequencing purposes (see Note 1). 2. Materials 1 Hardware: GeneJockey requires a Macintosh with ColorQuickdraw in ROM (this excludes the Macintosh plus [and older machines], the SE, the PowerBook 100, and the Macintosh Portable). The program also requires system 7.0 or later and at least 2 Mb of available memory. Multiple alignment windows are much easier to view in color. Although this is not essential, it is recommended that you use a systemcapable of displaying 256 colors (or better). 2. Software: For the operations described in this chapter, you need only the
GeneJockey program itself. For later chapters,you will need someadditional files supplied with the program, and you would normally install GeneJockey on your hard disk by simply copying all the files supplied into a single folder. When running on a Power Macintosh the GeneJockey Helper tile should be present in the same folder. The native-code resources in this file run about 10 times faster than the code in the main program, and since multiple alignment is a time-consuming process the extra speed is very helpful. GeneJockey is licensed for use only on a single-user basis, but is not copy-protected. 3. Data: For the procedures described in this chapter, you need only the GeneJockey11 program file itself, plus the sequence for which you wish to generate From* Methods m Molecular Biology, Edlted by S R Swmdell
Vol. 70: Sequence Data Analysis Guidebook Humana Press Inc , Totowa, NJ
273
274
Taylor Prtmer
r
Length,,
,-3’ din;tlda
Amplifted ,-7.
G+C 7
rLength
segment 7
rTm
(‘Cl-,
,
q
Suppress
base-repeats
q set default [F)
>2
(Caneel)
Fig. 1. Parameter dialog for PCR primer search. primers. The example used here is the Pig acetyl choline receptor sequence (named Pig Ach R) from the MISC Receptors folder on the GeneJockey demo files disk
3. Methods 3.7. Generating
Primers for PCR
GeneJockey can scan a given sequence to determine that areas of sequence are suitable to make primers for PCR. You have a wide choice of options in the application
of the selection rules, allowing
you to tailor your PCR prrmers to
the conditions that you intend to use for the reaction. GeneJockey does not allow you to specify in advance the area of sequence that you wish to amplify; however, having listed the primers as below, it is quite easy to read through the results and pick a pair of primers that bracket the area of interest. 1. Open the Pig acetyl choline receptor sequence using the Open.. . command from the File menu. 2. Issue the Find Primers > PCR... command from the Find menu. This gives rise to the parameter dialog shown m Fig. 1. 3. Enter SS into the 3’ dinucleotide box. It is usual to select primers with the 3’ end terminating in one of the four dinucleotides CC, CG, GC, or GG, because this configuration favors initiation of the polymerase reaction. You can set any dinucleotide here; to choose any of the above four specify SS (m which S = C or G). As usual in any GeneJockey dialog that expects you to enter degenerate DNA codes, there is a wild-cards button that displays all the IUPAC codes in case you can not remember them. This is activated when you click in the 3’ dinucleotide box. 4 Set the max and min Primer lengths. The range of possible primer lengths can be set from 12-40 (default is 18-22). 5. Set the max and min G+C% composition. The G+C content of PCR primers should ideally be close to 50%, and you may set the criteria here anywhere between 10 and 90%.
GeneJockey//: Primer Design
275
T
(.C)
OROCRORTTCATTflClOROG RCGCROCRGCRRTC~TCRTRCC nRTCCTOCTCTTRCTooCTCoo TAATCCTGCTCTTRCTOOCTCO -----------------------------------------------
(471) (5681 (777) (778,
343 440 649 650
79 79 00 eo 00 a1 81 et
CWXXiACTCCTCTRRCAOTGGC (138) TOTTOACTGTCTOOROOTOTCG CRTT~CTO~OOCATTRCTORCC TOROCRORTTCRTTRCTGAOOC OROCROATTClTTACTGROO ncocfloCnocflflTCRTcRrncc AGRGQATORAOGAGAOOACC CROROOATGRROORGAOO __________---_______---------------------------
(300 (462) (472) (470 (568) (599) (590)
163 324 334 333 430 451 452
78 80 80 60 et 81 e,
CCTOTOCTOACCTCATCRTTOO (323) CAGAOGATORROOAOAOO RTGARCTOCCAOAROROAATGO RCROGCTCCTTCTTGTCC
<SPO)
267 297 477
(620) (BOG)
e 2 4 3 4 I 4 4
0 4 4 4 2 4 s
Fig. 2. Output of PCR search. For each sense primer, the program lists a number of matching antrsense primers. Note that the antisense primers are already mverte&you should set up your oligonucleotide synthesizer to make them exactly as they appear on the screen. The number of the nucleotide in the original sequence that matches the 5’ end of the primer IS given in brackets after the primer sequence, and for each pau of pnmers the amplified length and melting temperature (T,,,) of the amphfied segment IS given. To determine the T, for the primers themselves, copy the primer sequence into a new DNA sequence window, then issue the Sequence Info command from the Analyze menu. This gives various information about the sequence, including the melting temperature of the double-stranded sequence 6. Set the max and mm target length for the PCR product. The length of the amplified segment can be set anywhere above 100 bp. (Note, however, that it gets increasingly difficult to find suttable primer pairs as the spacing increases.) 7. Set the max and min melting temperatures (7’,) values for the primers. You can set any range of T, for the amplified segment. The T,,, here should normally be greater than the extension temperature, so the DNA remains double-stranded while Tugpolymerase is copying it, but well below the denaturing temperature so the strands can be separated (see Note 2). 8. Dismiss the dialog with the OK button. The program will open a new text window and display the results of the search (Fig. 2, see Notes 3,4). The program will automatically eliminate primers with internal complementarity (>3 consecutive bases), e.g., S..CGTA ....... .TACGs.3’ The program will also suppress pnmer pairs that are mutually complementary (>3 consecutive bases), and primer pairs in which the 3’ end of one complements the 5’ end of the other (two or more bases) leading to potential primer-dimer formation.
Taylor
276 Enter two short protein segments (6 - 30 amlno corresponding to the wltd type and the deslred sequence respectluely. Wild
type
Mutant
segment segment
acldst mutant
: FPUWWHEAFY :
FPgyWHEgFY
1
I (Notemutant segment amlnoaclds In lower case I1 you want codon to be uarlable, upper case If not.)
I
Enzymes to use for restriction mapplng :
0 None
Matdmum
changes:
# of base
0 Use Full
Q Use Short
list
list
q
q
Hilight [OK]
changed
bases (Cancel]
Fig. 3. Parameter dialog for SDM primer generation. 3.2. Generating
Primers for SDM
The second primer routine is used to generate primers for site-directed mutagenesis, and permits the user to choose a primer that will not only introduce the
required change into the translated protein, but also make a change in the restriction map for the sequence. This has the advantage that clones generated by annealing the pnmer, extending, and subcloning may be screened rapidly by restriction digest rather than by sequencing. (Of course, the chosen clone should be sequenced for confirmation, but at least you do not have to sequence dozens of clones to find the good one.) 1. 2. 3. 4. 5. 6. 7.
8. 9. 10. 11.
Open the rat GnRH receptor sequenceon the demo disk. Perform a reading frames analysis with ATG as the start codon. Double-click on the long arrow to translate to protein. In the protein sequence, locate the two tryptophan residues (W) at positton 2051 206. Suppose we want to change the first of these to tyrosme (Y). Select a small block of ammo acids centered on the residue that we wish to change. In this case, take the block of 10 containing the two tryptophans. Copy it onto the clipboard. Bring the DNA sequence to the front. Issue the Find Primers > SD Mutagenesis command. In the ensuing dialog (Fig. 3), paste the sequence into both the wild-type and mutant sequence boxes (see Note 5) Change the first W in the mutant sequence to Y (for tyrosme). Click the OK button. The program will open a text window and list first the wild-type sequence, then all possible variants of this that yield the required change (Fig. 4, see Note 6)
GeneJockey//: Primer Design
TTTCCOCRRTACTFOCRTOAOCCTTCTAC TTTCCOCAATRTTWlCRTORR[ICtTTCTRC
277
(2) (2)
Fig. 4. Output of SDM primer routme. The number following each oligonucleotrde is the number of base changes introduced. For each oligonucleotide that produces a change in the restriction map, the affected enzymes are listed below. In the present case, the second oligo listed introduces a new SspI site. (If the site had been deleted from the restriction map, it would have been listed as -&PI.)
4. Notes 1. The SDM primer routines were added in version 1.33. In early versions of the program there was a bug in the implementation of the PCR primer routine that made it difficult to generate primers to amplify a segment >800 base pairs in length. This was fixed m version 1.41. 2. You can set the program to suppress prtmers with repeating bases (more than two consecutive bases). I know of no theoretical reason for doing this, but some experienced PCR users tell me that they habitually avoid primers with three or more consecutive bases the same. 3. An ideal pair of primers for PCR should have similar T,, so the annealing temperature can be set to its ideal value (which is about 10’ below T,) for both primers. If you find that the program does not locate suitable primers using the default parameters, you should relax the specifications and try again First, try turning off the Suppress Base-Repeats x2 checkbox, then try using NS in place of SS for the 3’ dinucleotide (the terminal C or G is much more important than the base that precedes it). If all else fails, widen the range of G+C%. Remember, the more lax you make the specifications, the further your primers will diverge from the ideal. 4. The algorithm used is a modification of that of Lowe et al. (1), and operates by first seeking out all incidences of the specified dinucleotide in the sequence, and then compiling two lists of potential primers with this dinucleotide at the 3’ end (sense) or at the 5’ end (antisense). Before placing each primer on the list its G+C content is measured, and if it falls outside the specified range the primer is shortened by one nucleotide. This process is repeated until either the specified G+C content is obtained or the minimum primer length is reached, in which case the primer is discarded. Primers are then tested for internal complementarity, and any that contain more than three consecutive bases inverted repeat are discarded. The program then lists each sense primer, along with a range of possible anti-
sense primers, testing each pair for mutual complementarity (eliminate if more than three consecutive bases, or if two or more bases at the ends), and checking that the amplified segment length and T,,, fall within the specified range. 5. Checking the Highlight changed bases box causes the changed bases to be displayed m blue, but also slows down the display of results considerably. 6. In order to determme whether this change will be a useful diagnostic, we should now assemble the wild-type sequence into its vector and perform a restriction analysis, then repeat the process using the mutant sequence to see whether the change in the pattern of fragments will be visible on a gel. In many cases, the required base changes to produce the mutant sequence will not also introduce a convenient restriction map change, and if this is the case you may need to allow the program to change some bases for this spectfic purpose. To do this, change the amino acid symbols m the Mutant Segment box to lowercase In general, you should keep the bases near the ends of the block constant, and choose the oligo wtth the minimum number of base changes for the purpose (or the oligo may fail to anneal). You may also want the program to choose from the full list of restriction enzymes to get a bigger range of choice.
Reference 1. Lowe, T., Sharetkin, J., Yang, S. Q., and Dieffenbach, K. (1990) A computerprogram for selection of oligonucleotide primers for polymerase cham reactions Nucleic Acids Res. 18, 1757-176 1.
23 OLIGO Primer Selection Juan Jose Estruch 1. Introduction OLIGO is a multifunctional program suited to search for PCR and/or sequencing primers in a given sequence. Computer searches are based on three essential criteria, namely, specificity, the absence of dimer or hairpin structures, and the formation of stable duplexes. So, the basis of the OLIGO program is srmilar to other primer design programs, but it contains extended functrons. The OLIGO advanced apphcations include selecting primers compatible with preselected primers, conductmg inverse PCR searches, and designing probes for ligase chain reaction (LCR). It can determine parameters when working for RNA, although they are not as accurate as for DNA. It provides hybridization conditions for specific probes as well. 2. Materials The OLIGO program can be run on an Apple Macintosh as well as PCcompatible computers. An Apple Macintosh II or better with a Math Coprocessor or Floating Point Unit (FPU) is required. A PC-compatible with at least a 386/16 MHz processor is required but a 486/33 MHz processor or higher is recommended. As an OLIGO 4.0 or 5.0 user, you must register your software license with National Bioscience, Inc. (NBI, Plymouth, MN). OLIGO requires
Windows (version 3.1 or higher) as operating system, so it presumesyou are familiar with the basic functions of Windows. The program has a minimum memory requirement of 4 Mb of RAM and the presence of a VGA graphics card (or better) is necessary. OLIGO accepts nucleic acid sequence file forFrom
Methods m Molecular Brology, Ed&d by S R Swmdell
Vol. 70’ Sequence Data Analysis Guidebook Humana Press Inc , Totowa, NJ
279
Estruch r 6
File
Edlt
Rnalgze
Search
Select
Change
Window
Help
Fig. 1. The startup windows of OLIGO after loading a sequence for searching. The two windows displayed are the Sequence wmdow and the Internal Stability window mats from ASCII, EMBL, GenBank, and Entrez Flat Files. You may also type
sequences directly into the program from the keyboard. 3. Methods
3.1. Searching for Oligonucleotides
Automatically
1. Locate the OLIGO program in your software and run it. 2. Select File menu. 3. Choose Open (Mac) or Select Existing File (PC). A dialog box will appear.
Alternatively you may enter a new sequence(seeNote 1). 4. From the dialog box choose a file of interest. The screen appears as Fig. 1. In this example we have selected the tile pSH34.Seq. 5. Select the Change menu. 6. Choose Current OLIGO Length. A simple dialog box appears informing you of the maximum and mmtmum size for the oligonucleotide. 7. Specify the length of the oligo and dismiss the dialog box. In the example we have specified 18 bp (see Note 2). 8. Select Search menu. 9. Choose Primers & Probes. The Search dialog box will appear as Fig. 2 (see Note 3). 10. From the dialog box select PCR or Sequencing primers. 11. Click Go. The Memory Table will appear with a list of found oligos (Fig. 3).
OLIGO: Primer Selection I-
281
Seerch Olalog BOH 0 Method +/0 Method -/+ PCR Prlmers: 0 + Strand Primers 0 - Strand Primers 0 IVltiwrs ~.um~~ollt~la n~llh lbe Uppat 1’1 Lmal 0 Prltiwrv Cor~t~~olil~lo IIWI ltt6 I ttwr IWtrrr Sequencing Prlmers: 0 - Strand @store Settings) @ft+ Strand 0 Hybrldizatlon Probes (Csneel) 0 Customized Search (+/-I 0 Customized Search (-/+I e OupIewFree + Ollgonucleotldes IJ OuplewFree - Oilgonucleotldes 0 - Ollgos Compatible with the Upper Prlmer q + Ollgos Compatlble with the Lower Prlmer 011 onucleotldes wlthln Selected Stablllt Limits Hlg ly Speclflc + Ollgonucleotldes 13’ en If Stablllty) Pi (Parameters) 0 Highly Speclflc - Ollgonucleotldes (3’-end Stablllty) q + Ollgonucleotldes wlth Unique J’-ends ORI q - Ollgonucleotldes wllh Unlque 3’-ends q Continue Unique Ollgonucleotlde Search In File: @DR2 IJ Halrpln-Free Ollgonucleotldes OR3 q Ellmlnate Homoollgomers
Fig. 2. The Search dialog box From here the different search types may be chosen: searching for PCR or sequence primers in one strand or both. The figure shows a sequencing pnmer search selected with the default parameters.
Fig. 3. The Memory Table as it appears after a PCR primer search. The program has automatically selected an Upper (1743U) and Lower (499L) primer from those matching the search criteria.
3.2. Searching for Cusfomized Oligos 1. Repeat steps l-9 as in Section 3.1. 2. Click Parameters m the Search dialog box. A Search Parameters box will appear as in Fig. 4 (see Note 4). 3. Change the AG and Terminal Stability Thresholds. 4. Click OK. 5. Click Go. A list of selected oligos will appear (Fig. 3).
3.3. Examining Selected O/igonucieoficies 1. From the Analyze menu, choose Internal Stability (Fig. 1, see Note 5). 2. From the Analyze menu, choose Composition & Tm. Windows containing detailed information about the oligo will appear.
Estruch
282
Set + Strand
Search
Range
c
to 11351
Set - Strand
Search
flange
II1
to 11351
Change
I
I I
loop
Determlne Change Check
All Threshold Unique
m
J’-ends
MOM length for OupleHer
Set Stoblllty
Range
Set Terminal
Stability
Based
of Rcceptable Startlng 141.1
*-
kcal/mol
on
Duplex
El (2 -
q
at Nucleotlde
Nucleotides Bose
Pairs
Through
18
to 74.11
Threshold
(Csncel]1[““7
Fig. 4. The Search Parameters dtalog box. These allow customtzation of the search parameters. Current
Ollgo
Composltlon
Current 0II90, w-am I17431 id - 54 5’ [marart nlphbor Tn - 62 2” IROC Ndhcdl Tn - 52’ 12°*(RtT> + 4’*(O+C)
Tn toC1 - 81 I + 16 C.*ioglNol 67Wl.nath - 0 6S*(Rfmarld~)
method1 ..thodl
+ 0 4lWWC) -
Fig. 5. The Composition and Tm window. This window displays information about the primers’ melting temperature according to several different calculation methods and its base composition.
3. Resize the Composttion and r,.,, window to reveal all the data (Fig. 5). Select the Window menu. 5. Choose Current OLIGO (Fig. 6). 6. Click on the Memory Table window (Fig. 3). 7. Click on each desired oligonucleotide on the table. As you select each oligo the windows will change to reflect that oligo. 8. Using this information, select the most appropriate primers. 4
OLIGO: Primer Selection
283 Current
Oligo
1668 Sequencelength Curreni Ollgo 5' AGCRTAAGTTTGGAGCRC 3' l&mar Length 5' Porltlon 1713 T 53.6 DC /&25'C) -32.2 kcol/nol I/E (tstrand) 5.70 n~‘/fWl 32.1 ug/R260 5' GTGCTCCARRCTTATGCT 3' 1,~ (-&and) 6.11 nml A260 35.4 pg/R i 60
Fig. 6. The Current OLIGO window. This window displays information about the oligo currently selected in the Memory Table. The window is automatically updated as different oligos are selected. Edit s -posltlon
m
New Sequence
Icursor
posltlon
49
Fig. 7. The Edit window. This dialog allows sequences to be entered directly into OLIGO rather than imported from text files.
4. Notes 1. Select a new file: When a new sequence has to be introduced, you have to select the New Sequence option. A New Sequence box like that shown in Fig. 7 will appear on the screen. Introduce the sequence and click Accept. The general screen of the OLIGO program (see Fig. 1) will appear. 2. Oligonucleotide length: The program can handle oligonucleottdes with a minimum of nine nucleotides and as big as your computer’s capability allows. The most common oligo lengths are between 17 and 25. To change the length of the oligonucleotide you have to select the Change option of the general screen of the OLIGO program (see Fig. 1). An oligo length box will appear on the screen as an inset; you will enter the desired length and click Accept. The total number of selected primers could be up to 3000. 3. Searching oligonucleotides automatically: The OLIGO program already contains selection limits to search for primers by applying appropriate criteria. When a standard search for oligonucleotides is conducted, the program default settings are: a. High T, selecting the most stable oligonucleotide. b. 3’ Terminal stability is set to 40 Kcal/mol.
284
4.
5.
6.
7.
Estruch Most of the parameters would be already optimized and you just have to select whether tt is a PCR or a Sequencing primer, and whether the sequence is in the upper (+) or lower (-) strand. A pattern of other parameters will check themselves following your initial selections. Searching customtzed oligonucleotides: When the OLIGO preprogrammed parameters do not render any primer out of a specific sequence, or you have special needs, such as degenerated (several nucleotides m the same position) oligonucleottdes, you might search for a customized primer. You can make expert decisions altering with the stringency of search You can overrun the default setting of the program and mtroduce your values Special attention should be given to: a. The AG threshold the more negative, the more stringent. b. The 3’-termmal stability range: should be between -5 and -10 Kcahmol. Analyzing selected oligonucleotides. Selected ohgos should be analyzed on a one-by-one basis to find the oligo that 1s best suited for your purposes. If the primers have been selected automatically, you should only momtor the T,,, because they will have no duplex and/or hairpin structures. If it is a customized primer, you will have to monitor T,,,, duplex, and hairpins. The ideal primer should not have any duplex and/or hanpin. In respect to the T,,,, it depends on your needs. Generally speaking, the higher the T,, the more specific the primer ~111 be. A T, ranging from 4&6O”C will be acceptable. Problems resulting from the sequence. Nucleic acids are repetitive structures made out of four different nucleotides. Those nucleotides are dtstnbuted m a random form so you can find 25% of each in a given sequence. However, there are some spots where one or several nucleotides are overrepresented. Sometimes this happens for specific organisms. For instance, monocots (e.g., maize) have a high GC content, whereas Bacillus has a htgh AT content. A suggestion 1s to select areas in which the overrepresentation IS minimal. If this is not possible, you might use the concept of subdomams. That is, instead of using the whole oligo, select a subdomain and work within that. It is recommended to select subdomains comprismg the 3’-ends of the oligos. Once you obtain an appropriate subdomain, you might add up nucleotides in the 5’-ends to get the proper annealing temperature. Problems resulting from the design: When an experiment involving a PCR and/ or a sequencing reaction is performed with program-designed primers, you might get mutattons m the primer area, a low efficiency reaction, background caused by false priming, or simply nothing at all. Selection failure of the prtmer is a possible explanation. You would have to make sure that the selected primers do not contain any duplex, do not contain strings of the same nucleotide or sequence repeats, and do not false-prime with repetitive sequences. You have to check that the primers are compatible and that their T,,, match. If the above condttrons have been carefully examined and yet the PCR and/or sequencing reaction does not render the proper result, you mtght have to vary the reaction protocol conditions.
OLIGO: Primer Selection
285
8 Extended functions: The OLIGO program has an endless list of extended functions. Intrinsically, it contains options to analyze restriction sites in your sequence, false priming of selected primers, and finding opttmal hybridization probes. The software includes codon usage for a number of organisms, so it could load protem sequences and reverse translate the most probable codon. In addition, the OLIGO program version 5 comes with a PRIMEFORM oligo ordermg software, You can order oligos, have synthesis specifications, and have shipping mformation at the tip of your fingers.
PRIME Primer Selection Juan Jose Estruch 1. Introduction Although many factors influence the results of a DNA sequencing or PCR reaction, the most important are the quality of the template and the choice of the oligonucleotides. In this chapter I give an overview of how to design oligonucleotides that could be used as primers for DNA sequencing and/or for PCR reactions by using the computer program PRIME. For efficient priming, one should avoid primers with extensive self-complementarity in order to minimize primer secondary structures and dimer formations. Computer programs calculate hybridization temperatures and secondary structures based on the highly accurate measurement of nearest-neighbor AG (change in free energy) values. You can use either program default constraint values o+modify those values to customtze the analysis. Generally speaking, more stringent parameters should be applied to the 3’ end of the oligonucleotides. Subdomains comprismg the 3’ ends could be selected within oligonucleotides to determine their hybridization temperatures and secondary structures. This chapter describes the steps necessary to scan a sequence tile for primers that match user definable criteria and provides an example of the output of the program. 2. Materials PRIME program is part of the Genetics Computer Group, Inc. (GCG) package (1,2). You must have an approved account (called, “login”) of the VAX system to access the GCG packages. The program runs on VAX computers using both VMS and UNIX operating systems. The VAX is accessedusing a dedicated terminal or a VT-100 terminal emulator running on an Apple From
Methods m Molecular Biology, Edited by S R Swindell
Vol 70: Sequence Data Analysis Gurdebook Humana Press Inc , Totowa, NJ
287
288
Estruch
Macintosh or PC-compatrble. Using an X-windows capable terminal will allow the program to be accessed using the Genetic Data Environment program (see Chapter 2). The VAX command language is called Digital Command Language (DCL) and is described in detail in a Digital Equipment Corporation (DEC) document called VAx/VMs DCL Dzctionmy. The program can handle data introduced directly, imported from sequencing machines (Applied Biosystems [Foster City, CA], Pharmacia [Uppsala, Sweden]), and/ or imported from any data bank.
3. PRIME Program The program requires that certain parameters be set in order to search the sequence. This is achieved as a series of onscreen questions that the user must answer. Each answer or command should be followed by pressing the Enter ( J) key. In some cases the answer may be a null value; that is simply press the Enter key without typing in any information, 1. Log onto VAX and activate GCG programs. 2. Run prime by typing PRIME ( +I, see Note 1). The program will respond; "what sequence?n 3 Enter the name of the tile to be searched followed by the Enter Key (see Note 2). Theprogramwillrespond; "Min primer length?" 4. Enter 17 (see Note 3). The program will respond; "Max primer length? N 5. Enter 25. The program will respond; "Min product length?" 6. Press Enter (see Note 4) The program will respond; "Max product length?" 7. Press Enter. The program will respond; "What should I call the out put file (X.PRIME)?" 8 Press Enter to accept the default name or provide a new name for the file (see Note 5). The program will respond; "Do you want to display the binding sites graphically? a-Plot to a FIGURE file called "prime.figure" b. Plot graphics on LaserWriter attached to TTA7 c. Suppress the plot Please choose one (*A*):" 9. Select the appropriate output. 10. The computer will run prime and scan the sequence file for primers that match the parameters (see Note 6). The output tile starts with a summary listing the constraints used by the program to minimize secondary structure formation in the selected primers as well as appropriate hybridization temperatures (see Note 7). The following is an example output: Input sequence: X.Seq Primer constraints: primer size: 17-25 primer 3' clamp: S
PRIME: Primer Selection
289
primer sequence ambiguity: NOT ALLOWED primer GC content: 40.0-55.0% primer T,: 50.0-65.0 KC primer self-annealing... < 8.0 (weight: 2.0) 3'end: total: < 14.0 (weight: 1.0) unique primer binding sites: REQUIRED primer-template and primer-repeat annealing. 3'end: ignored total: ignored repeated sequences screened: none specified Product: 2 [DNA]= 50 nM [salt]= 50 mM PRIMERS 5' 3' forward primer (19-mer):lO TCACGATTGACCACACACT 28 reverse primer (19-mer):49 TGTGACACGATCTCCACTT 31
4. Notes 1. PRIME Program: PRIME is designed to search for PCR primers By default, it analyzes a template DNA sequence and chooses both forward and reverse primers. A search for sequencing primers could be done by restrictmg the search to either the forward or reverse direction. This is achieved by adding the commandline qualifiers /FOR or /REV after PRIME. That is, PRIME/FOR (for a forward sequencing primer) or PRIME/REV (for a reverse sequencing primer). 2 Input File: The input sequence for the search may not be longer than 32,000 bases and it will be called X.Seq, X being a filename 3. Max/Min Primer: You can specify a range of primer sizes but they should not exceed 50 bases. Ohgonucleotides used for sequencing or PCR reactions are usually between 17-25 nucleotides long. You can be more specific by setting exactly the length of both primers, for instance 18 and 20, or both 20 bases long. The primers do not necessarily have to be equal in length. 4. Max/Min PCR Product: The maximum and minimum length of the PCR products can be introduced only when you know the expected band size. This information will not be available in many instances, so just press ( J). 5. Filename: After creating a file, the program names it as X.PRIME, X being the name of the original sequence file. You might accept the suggested name or you may wish to provide a new filename followed by .PRIME. 6. No Primers Obtained: If no primer is obtained after applying the PRIME program to a given sequence, a program summary 1sdisplayed in the screen similar to the one outlined previously that will contain the primers rejected because they did not meet program constraints, With this information, you can determine which constraints to relax in subsequent runs. The most common modiflcatron is a change in the primer length. This can be done by recalling the PRIME program and settmg new primer sizes.
290
Estruch
7. Output File: Although most of these constraints are already optimized, they can be customized by adjusting optional program parameters such as primer GC content (normally between 40 and 55%), primer T,,, (generally between 50 and 90°C), and the weight of the 3’ end sequence of the primer (normally < 8). 8. Related Programs: PRIME is a program within the GCG package, so you have access to all other GCG programs such as Map, MapPlot, FindPatterns, and many others. These programs identify specific motifs, map restrictions sites, and perform other functions that might help future work with the primers in the context of your research.
References 1. Devereux, J., Haeberli, P., and Smithies, 0. (1984) A comprehensive set of sequence analysis programs for the VAX. Nucleic Acids Res. 12,387-395. 2 Dir&z, R. (1994) GCG, in Computer Analysis Of Sequence Data, Methods In Molecular Biology, vol. 25 (Griffin, A. M. and Griffin, H. G , eds.), Humana, Totowa, NJ, pp. 9-l 7.
25 PRIMERSELECT Primer and Probe Design Thomas N. Plasterer 1. Introduction The polymerase chain reaction (PCR) has proven to be one of the most useful tools available to the modem molecular biology laboratory. One of the more important aspects of successful PCR is the proper design of primers. Because PCR is fairly robust, you may have had success designing primers manually. Failures designing primers are usually caused by primer duplexing or false priming sites m the sequence template that may not be obvious to the eye. For these cases, the use of primer software can greatly improve experimental results. Good software can evaluate more accurate (and complex) thermodynamic models than traditional GC content alone. Because you improve accuracy in free energy and melting temperature prediction as well as accuracy evaluating primer duplexing you can greatly refine primer design from the manual level. Using the computer’s ability to perform more complex sequence and mathematical analysis allows you to start with better primers and finish with higher yield products of greater accuracy. The PRIMERSELECT program, one of seven in the LASERGENE suite from DNASTAR, Inc. (Madison, WI), is a tool to design primers and probes for PCR, sequencing, and hybridization experiments. To locate primers, PRIMERSELECT first begins by processing a template sequence (DNA, RNA, or backtranslated protein) for melting temperatures, free energy, and terminal free energy in pentamer windows. Template sequences can be obtained from exported database entries, EDITSEQ DNA and protein sequence documents, foreign software, or ASCII text (see Note 1). Use the EDITSEQ program to From
Methods in Molecular Biology, Edlted by S. R Swmdell
Vol 70 Sequence Data Analysw Guidebook Humana Press Inc , Totowa, NJ
291
292
Plasterer
import foreign sequence documents into LASERGENE format. Template thermodynamic calculations are based on the duplex stability models by Breslauer et al. (I) for DNA and the Freier et al. (2) model for RNA. After initial calculations, PRIMERSELECT allows you to locate primers from your own primer catalog, the best computer-generated template-derived primers, or the best computer-generated template-derived PCR primer pairs. To perform a search, set up all initial conditions and choose a search method. From the method results choose a primer or primer pair to examine in more detail and bring these potential candidates to a Primer WorkBench. At the Primer Workbenches, you can examine your primer and template for restriction sites, translated reading frames m all six frames, false priming sites, and primer duplexing. PRIMERSELECT also provides amplification and composition summaries and comprehensive lists of all primer secondary structures. Use PRIMERSELECT prior to ordering or creating your oligos to give your PCR reaction the greatest chance of success. To display the high degree of commonality between implementations of the program on different platforms, the illustrations within this chapter are taken from a mixture of the Macintosh (system 7.5) and Windows 95 versions. 2. Materials Users need only satisfy materials criteria for either the Macintosh system or the Windows system. 2.7. Hardware 2.1.1. Macintosh Hardware 1 Any Macintosh computer. 2. Minimum memory requirements of 4 Mb RAM (8 Mb RAM or more is recommended). 3. Minimum free hard disk space of 25 Mb. More may be required because of the creatron of temporary files. 4. Macintosh-compatible monitor (256~color monitor is recommended). 5. Macintosh-compatible printer (laser prmters are recommended).
2.1.2. Windows Hardware 1. Any personal computer. 2. Minimum memory requirements of 4 Mb RAM (8 Mb RAM or more is recommended). 3. Minimum free hard disk space of 25 Mb. More may be required because of the
creation of temporary files. 4. Windows-compatible 5. Windows-compatrble
monitor (256-color monitor IS recommended). printer (laser printers are recommended).
PRlMERSEL ECT
293
2.2. Software 2.2.1. Macintosh Software 1. Macintosh system software 6.01 or higher. application PRIMERSELECT Follow the instructions provided by DNASTAR, Inc. for installing the LASERGENE software 3. The enzyme library tile software, Enzymes. 2. The LASERGENE
2.2.2. Windows Hardware 1. Disk Operatmg System (DOS) version 5.0 or higher. 2. Microsoft Windows version 3.1 or higher. 3 The LASERGENE application PRIMERSELECT. Follow the instructions provided by DNASTAR, Inc. for installing the LASERGENE software. 4 The enzyme library tile, enzymes.ezd.
2.3. Data Any DNA or protein sequence document in LASERGENE format. For this tutorial, use the sequencefile Rat synapsyn2B (Macintosh users) or ratsyn2B.seq (Windows users). Macintosh users can locate Rat synapsyn 2B m the demo sequences folder, within their DNASTAR folder. Windows users can locate ratsyn2Bseq in the demo-seq directory, withm the WINSTAR directory. 2.4. Optional 2.4.7. Macintosh Software 1. Any graphicsprogram capableof handling PICTs file input. 2. Any word processor for mampulatmg
exported ASCII text.
2.4.2. Macintosh Software 1. Any graphics program capable of handling Windows metafile (WMF) input. 2. Any word processor capable of manipulating ASCII text.
3. Methods 3. I. Opening PRlWERSEL ECT 3.1.1. Macintosh PRIMERSELECT 1. Locate the DNASTAR folder and double-click on it to open 2 Within the DNASTAR folder, locate the PRIMERSELECT icon (Fig. 1). Doubleclick the PRIMERSELECT icon to launch the application (see Note 2).
3.1.2. Windows PRIMERSELECT 1. Locate the DNASTAR program group and double-click on it to open. 2. Within the DNASTAR program group, locate the PRIMERSELECT icon (Fig. 1). Double-click the PRIMERSELECT icon to launch the application (see Note 2).
Plasterer
294
PrimerSelect Fig. 1. The Primer Selecticon.
If your LASERGENE system contains demonstration files, you may get the warning: “The file Demo Enzymes is available. Shall I use it?” Click No at the prompt. Demo Enzymes is a demonstration subset of the complete enzyme file. 3.1. Prefemplate
Parameters
3.1.1. Initial Conditions PRIMERSELECT opens with an empty window for entering template sequences. Prior to adding a template, you will evaluate pretemplate parameters. 1. Choose Initial
Conditions
from the Conditions
Menu.
2. Using the mouse or the tab key, enter the value 1000 (PM) in the Primer Concentration text field.
3. Using the mouseor the tab key, enter the value 100.0(n-&J)in the Salt Concentration text field. 4. Click OK to accept changes in the Initial
Conditions
dialog.
In this way, you can set salt and primer concentrations to match your experimental values prior to evaluating the template. Do not change the Temperature for AG calculations unless you want to explore AG values divergent from the thermodynamic model. 3.1.2. Primer Characteristics 1. Choose Primer Characteristics from the Conditions Menu (Fig. 2). 2. Using the mouse or the tab key, type 16 (bp) in the Minimum Primer Length text field. 3, Enter the value 25 (bp) in the Maximum Primer 4. Click OK to accept the new length values.
Length text field.
You have reset the default primer length, previously set to locate 17-24 base pair primers. Now PRIMERSELECT will locate any primers in the template whose length is between 16 and 25 bp long. In the Primer Characteristics dialog you can also modify acceptable primer duplexing, uniqueness at the 3’ end of the primer, and ambiguous residues allowed in priming sites.
295
PRIMERSELECT Primer
Characteristics
Prtmer
Length
Mlnlmum 3’ Pentamer
Stablllty
Unique
3’ Sequence
Accept
glmer
accept
Hairpin
Ignore
Dupkwlng
Ambiguous
Mauimum
-Ibp
of
lluplenlng Dupleelng
FIbp
18.5)
-kc/M
(ty[
bp
of of
Residues
Fig. 2. The Primer Characteristics dialog. This allows the user to modify Primer length, specify acceptable limits for Primer duplexing, 3’ end uniqueness, and the degree of ambiguity allowed in the priming site.
3.1.3. Entering a Template Sequence 1. Choose Enter Sequence from the File Menu (see Note 3). 2. Locate the rat synapsin 2B sequence: a. Macintosh users: In the File dialog, locate the DNASTAR folder in the left window (it may already be open as the default). Scroll down this window and double-click the Demo Sequences folder. Within the Demo Sequences folder, scroll down to Rat synapsin 2B. Double-click Rat synapsin 2B to add it to the Selected Sequences window on the right. Click Done to use this sequence as a primer template. b. Windows users: In the Enter Sequence dialog, locate the WINSTAR directory in the Directories window (It may already be open as the working directory). Scroll down the Directories window and double-click the demo-seq directory. In the File Name window, scroll until you see ratsyn2B.seq. Double-click ratsyn2B.seq to add it to the Selected Sequences window. Click Done to use this sequence as a primer template.
PRIMERSELECT spends a few moments calculating the melting temperature, free energy, and free energy in pentamer windows for the template sequence. Once finished it now presents the template sequence in the Block View (Fig. 3). 3.2. Template Parameters 1. Click the Tm Plot View palette tool, the second palette tool from the top (displaying the thermometer). PRIMERSELECT switches to the Tm Plot View. This document window view displays the template melting temperature (T,,,) in two
296
Fig. 3. The Block View of the template (RATSYN2B.SEQ)
Plasterer
after initial processmg.
plots for each strand. The red T,,, plot represents the plot for the maximum primer length, whereas the blue plot represents the Tm plot associated with the mimmum primer length. 2. Click on any point in either T,,, plot graph. PRIMERSELECT selects the template residues that contrlbute to the plot value at this location. 3. Choose Primer Characteristics from the Conditions Menu. Two new parameter fields have appeared in the Primer Characteristics dralog: Melting Temperature and Overall Stability After PRIMERSELECT evaluates the template, it determines the minimum and maximum values using the threshold value of 25% of all possible primer values for the minimum and 75% of all possible primer values for the maximum. 4. Enter the value 45.0 (“C) in the Minimum Melting Temperature text field and click OK. PRIMERSELECT will take a moment to update the Tm Plot View, reflecting this more stringent minimum melting temperature (see Note 4). You can follow this procedure with the Overall Stability parameter to limit primers based on free energy in a similar manner.
3.3. Primer Searches Three kinds of primer searches are available in PRIMERSELECT: a search against a primer catalog (see Note 5), a search for all possible primers annealing to the template, and a search for all possible primer pairs annealing to the template. Before initiating the search, set limits on where you want to place pnmers. 1, Choose Primer Locations from the Conditions menu. 2. Choose Upper and Lower Primer Ranges from the Restrict Locations by dragdown menu (to do this press Product Length and drag down to Upper and Lower Primer Ranges then release the mouse button). 3, Insert the following values: Minimum Upper Primer Location: 1 Maximum Upper Primer Location: 130
PRIMERSEL ECT
297
Fig. 4. Located Primer list. The list is split into upper and lower-strand primers and displays information about primer length, location, T,, AG, and AG profile values. Minimum Lower Primer Location: 157 1 Maximum Lower Primer Location: 1754 4. Click OK to accept these location limits. PRIMERSELECT now displays limit markers in the various primer views; green for upper-strand markers and red for lower-strand markers. These coordinates limit amplification of the rat synapsin 2B gene around the coding region.
3.3.1. Primers & Probes 1. Choose Primers & Probes from the Locate menu. After a moment of searching, PRIMERSELECT presents a list of located primers, split into upper-strand primers and lower-strand primers (Fig. 4). This list presents the location, length, T,, AG, and AG profile value for each primer. If the primer came from a primer catalog, it would also have a filled Name and Note field. The AG profile value is a measure of the curve for the final pentamer of the primer, in which higher values indicate a greater pentamer score. 2. Select an upper-strand and a lower-strand primer by double-clicking a primer in the upper window and a second in the lower window of the Located Primers window. 3. Choose Document Window from the Report menu to switch to this view. 4. Click the Block View palette tool (top palette tool) to restore this view. The Block View presents the template, the limit markers, and the area amplified by the two primers. Although these primers are ready for more detailed evaluation, no check has yet been performed for heteroduplexing between pairs.
3.3.2. PCR Primer Pairs 1. Choose PCR Primer Pairs from the Locate menu. PRIMERSELECT now evaluates all primers from the Located Primers list for crosscompatibility. After the search is completed, you are presented with a list of primer pairs free of significant heteroduplexing effects (Fig. 5). The Located Primer Pairs window displays all primary pairs in the upper window and alternate pairs in the lower window. Alternate pairs are primer pairs whose oligos differ by a few nucleotides from
298
Plasterer
Fig. 5. Located Primer Pairs list. PRIMERSELECT has compared the lists of upper and lower primers and presents the most compatible (Primary) pairs in the upper window along with alternate pairings in the lower window.
2.
3. 4.
5.
primary pairs but amplify the same region. The primary pair is simply the pair with the greatest pair score value. Click the Adjust Scoring palette tool (topmost tool in the Located Primer Pairs list). This opens a control panel for adjusting pair scores. You can change how PRIMERSELECT evaluates pairs by sliding the control knobs. Default values are intermediate in most cases except for Internal Stability; by default, PRIMERSELECT rates duplex free pairs and primers high. Any adjustments update the Located Primer Pairs list and may rearrange primary and alternate pairs. Click the close box (Macintosh) or the Control-Menu button (Windows) to close the Scoring control panel without altering scoring. Click the Alternate Products Palette tool (third down from the top). Now, all alternate products are displayed for the primary pair. If any false priming sites are present, they appear in light green for the upper-strand and light-red for the lower-strand. Potential alternate products are displayed as dotted lines beneath the solid primary product. Double-click the uppermost primary pair in the Located Primer Pairs list. This activates this pair for use in all document window views.
3.4. Primer Workbenches After you have activated primer pairs, by double-clicking from either the Located Primers or the Located Primer Pairs lists, they are now ready for more in-depth analysis (see Note 6). 1. Choose Work on Upper Primer from the Edit menu. The upper primer workbench opens, displaying the primer, the upper and lower template strands, restriction sites, translated reading frames, primer duplexing, and false priming sites (Fig. 6). 2. Select the filter palette tool (second from the top) and choose Commercial + Methyl Dups from the Filters submenu. This palette tool presents all filters avail-
PRIMERSELECT
299
Fig. 6. Upper Primer Workbench. The workbench displays the primer, the upper and lower-strands of the relevant region of the template, restriction sites, translated reading frames, primer duplexing and false priming sites. Note how selecting an enzyme highlights its restriction site. able in the master enzymes file. The filter chosen, Commercial + Methyl dups, filters restriction enzymes into a subset containing only commercially available enzymes with one representative of each isoschizomer class, including isoschizomers differing in methylase activity. 3. Press the restriction site label for AciI. PRIMERSELECT highlights the restriction site for AciI. Your primer workbench should look similar to the workbench displayed in Fig. 6. 4. Release the mouse button and press the first Pro (proline) residue in frame 1, beneath the primer. From the pop-up menu choose CCG as the codon assignment (Fig. 7). PRIMERSELECT changes the fifth residue in the primer from C to G. This allows the silent mutation in the proline codon and has a drastic effect on restriction enzyme sites. Sites created by the edit appear in magenta, whereas sites destroyed are drawn in blue. The other reading frames are also affected by this edit. 5. Click the final T residue in the primer (third from the 3’ end) and type G. This primer edit introduces numerous false priming sites and stable duplexing as well. The lower window of the workbench displays four upper-strand priming sites (green boxes above the template sequence ruler), a very stable self-dimer (-9.8 kc/m) and a very stable hairpin (-5.7 kc/m). This primer is not expected to perform very well.
300
Plasterer
~iccccGc~ccGTccT~Tc~~
1111111111111111111
Fig. 7. Editing a Codon. Selecting an amino acid presents a menu containing alternate codon sequences that will create a silent mutation. 6. Click the new G residue in the primer, formerly a T (third from the 3’ end) and type T. The dimer difficulties and false priming sites have been removed. 7. Enter the name Altered Primer in the empty name field (the text box in the upper right comer) and click OK. PRIMERSELECT will now use your altered primer with the one base pair mismatch as the upper primer for future PCR analysis.
3.5. Secondary
Structure
PRIMERSELECT has three duplexing summaries to allow you to look at all possible secondary structure problems with your primers. This includes primer self-dimers, primer hairpins, and primer pair-dimers. Choose Primer Hairpins and Primer Self-Dimers from the Report menu. The two windows present a list of all possible hairpins and self-dimers for the primer Altered Primer (Fig. 8). This window is a graphic window, allowing you to copy and paste to any number of graphics or word processing programs. Macintosh output is in PICT format, whereas Windows is in Windows Metafile (WMF) format (see Note 7).
3.6. Primer Summaries PRIMERSELECT also presents summaries for PCR amplification and the composition of both primers and products. These reports are also graphic files that can be exported as needed. 1. Choose Close twice from the File menu to close both duplexing reports. 2. Choose Amplification Summary from the Report menu. PRIMERSELECT presents the location, length, melting temperature, and GC content of the amplified sequence; and the range, nucleotide composition, and melting temperature for the primers. Statistics for amplification are also included: the difference between primer and product melting temperatures, the primer melting temperature differ-
PRIMERSEL ECT
301
Fig. 8. A Primer Duplexing Report. The two windows present a list of possible hairpins and self-dimers for the primer. ence, and the optimal annealing temperature for your primers, dependent on the melting temperature models of Rychlik et al. (3). The composition summary presents a list of composition statistics, including molecular weight (Mr), and the extinction coefficient for both primers and product.
3.7. Saving Analysis Results 1. Choose Save from the File menu to open the save dialog. 2. If a primer catalog is active, PRIMERSELECT will prompt: “Do you want to save the primer catalog ‘Primer Catalog’ or the entire PCR project?” Click either Save Project or Save Catalog. 3. Locate a position for your project or catalog and click Save (Macintosh) or OK (Windows). All primers and activity associated with your project, including anything in the notebook or either workbench, are recorded.
3.8. Printing Analysis Results 1. Click on the window you wish to print and choose Print from the File menu. PRIMERSELECT can print from any view. You need to make a view active (the topmost window) in order to print it. 2. Configure any printer settings necessary and click Print (Macintosh) or OK (Windows) to print the active window. 4. Notes 1. EDITSEQ is the LASERGENE module for sequence creation and editing. It is also where you import foreign sequence, either from another sequence file format or across the clipboard. 2. Users who need to share data across both platforms can do so by using the correct windows nomenclature for shared files. To read a PRIMERSELECT project cre-
302
3.
4.
5.
6.
7.
Plasterer ated on the Macintosh, Windows users need only rename it NNNNNNNN.PCR (N is any character, with a limit of eight). DNA sequence riles require the format NNNNNNNNSEQ whereas readable proteins are NNNNNNNN.PRO. Macintosh users do not need to rename Windows tiles or modify the tile type and creator to be interpreted correctly. In the Enter Sequences dialog, you can specify a nucleotide sequence and/or a protein sequence. You can also use multiple sequences as the template, which then appear next to each other in the Block View. Partially characterized sequence files are permitted as well. It is also possible to change the free energy in the Primer Characteristics dialog and update the AG Plot View as well, in an analogous fashion to changing the melting temperature values. The Terminal AG Plot View is a measure of the free energy in pentamer windows (the last five nucleotide characters). A descending 5’ to 3’ plot in this window may be indicative of a good primer sequence to place at the 3’ end of your primer. Primers with this descending profile do not bind too strongly at the 3’ end but do bind more strongly further upstream. In this way, you can reduce or eliminate any false priming sites by using primers containing this signature. This method is analogous to the old rule of avoiding GC residues at the 3’ end, but is more informative. Primer Catalogs are a way of storing your own frequently used primers, either from a particular PRIMERSELECT project or from an existing file you use to keep track of primers. You can enter primers directly in the catalog by typing or you can paste in existing primers or primer sequences. Primer Workbenches allow you to customize your primer in order to introduce silent mutations or restriction sites. The palette tools on the left-hand side of the window allow you to selectively display or hide items on the workbench surface. These items include displayed restriction sites, reading frames, hairpins, selfdimers, false priming sites, and features. Graphic views can be exported by copying to the clipboard and pasting into a graphics or word processor application. Macintosh graphics are exported as PICTs whereas Windows graphics are exported as Windows Metafiles (WMF).
References 1. Breslauer, IS. J., Frank, R., Blocker, H., and Marky, L. (1986) Predicting DNA duplex stability from the base sequence. Proc. Natl. Acad. Sci. USA 83,3746-3750. 2. Freier, S. M., Kierzek, R., Jaeger, J., Sugimoto, N., Caruthers, M., Neilson, T., and Turner, D. (1986) Improved free-energy parameters for predictions of RNA duplex stability. Proc. Natl. Acad. Sci. USA 83, 9373-9377. 3. Rychlik, W., Spencer, W. J., and Roads, R. E. (1990) Optimization of the annealing temperature for DNA amplification in vitro. Nucleic Acids Res. l&6409-6412.
The European Bioinformatics
Institute
Submission and Updating of Sequence Databases Tomas P. Flow
and Benny Shomer
1. Introduction The Data Library at the European Molecular Biology Laboratory (EMBL) in Heidelberg was established in 1980 (I). The main purpose of the Data Library was to collect and archive all nucleotide sequence data reported world wide. To ensure its continued ability to provide sufficient coverage, the Data Library forged a collaboration with GenBank (2) in 1982 that was extended in 1987 to include the DNA Data Bank of Japan (DDBJ). All three produce their own copies of these data, which are essentially the same, because each member of the collaboration exchanges any newly acquired or updated data on a daily basis. When a new sequence is incorporated into one of the databases it is assigned a unique identifier (accession number) that can be used to identify it in any of the three databases once the data has been exchanged. Recently, the Data Library has been relocated to EMBL’s new Outstation, the European Bioinformatics Institute (EBI), at Hinxton in the United Kingdom (3,4; see Note 1 for details of how to contact the Data Libraries). Originally, the majority of the data was obtained by scanning articles in scientific journals containing nucleotide sequences.Any additional annotation was extracted from the text of these articles. This proved to be very time consuming and somewhat error-prone. To guarantee greater accuracy and to remain current, the Data Library now acceptssequencessubmitted in a variety of electronic forms directly from their originators. This route has been reinforced by many scientific journals that now require an accessionnumber for any sequencesdescribed in a paper before it is accepted for publication. This chapter updates a previous chapter (5) in this series and outlines new mechanisms for submitting and updating data collated in the Data Library. The description here concentrates on submitFrom: Methods in Molecular Biology, Edited by: S. R. Swindell
Vol. 70: Sequence Data Analysis Guidebook Humana Press Inc., Totowa, NJ
303
304
F/ores and Shomer
ting and updating nucleotide sequences in the EMBL database and includes pointers, where appropriate, to the other databases. Submission of protein sequencesto Swiss-Prot (6) is performed in a similar manner. For more information contact the address given in the Notes at the end of this chapter. 2. Materials The Data Library provides a number of different mechanisms for the submission of sequence data. The various routes of submission are summarized in Table 1. Except for the last method, all are previously outlined (5). However, it should be noted that since the EBI has moved from Germany to the United Kingdom, the addressesreferred to in this chapter have changed (see Note 1). This chapter focuses on direct submissions and updates using a World Wide Web (WWW) browser (browsers are computer programs for navigating information that is available on the Internet). There are a number of such browsers available for most machine architectures, a small number of which are listed in the Notes (see Note 2). When submitting sequences, it is useful to have as much of the information relating to your sequence as possible at hand, so avoiding unnecessary interruptions in this process. 3. Method This section covers the WWW sequence submission system developed at the EBI. GenBank provides a similar service to this, the address of which is given in Note 1. There are many benefits to submitting sequences in this way. In particular, the EBI continually maintains and updates this system, ensuring that the requested information is up-to-date. The system relies on software that is becoming commonplace in most laboratories (WWW browsers) and should therefore be readily available on your system (see Note 2). The most common browsers for both PCs and Macintoshs can be downloaded from the EBI ftp server (ftp://ftp.ebi.ac.uk/pub/software/tools). 3.1. Starting the WWW Submission Process Startup a WWW browser and open the appropriate submissions page (see Note 3). This page, and some of the pages linked to it, contain important information that will help with your submission. It is essential that you read these pages before you attempt your first submission and that for subsequent submissions you read the first page to keep up-to-date with any changes since your last submission. You may wish to read the frequently asked questions (FAQ) list linked from the main page. The list is regularly updated with answers to recurring questions that submitters ask. Throughout the submission pages there are several buttons with question marks. These are links to Help pages relating to the part of the form that you are currently viewing. These pages contain
EBI: Database Submission and Updating
305
Table 1 Summary of Submission Mechanisms for the EMBL Database Method
Platforms
Submission form
Post
Notes Printed copies from: 1. The first issue of Nucleic Acid Res. each year. 2. The Data Library by request. Electronic copies: I. From the Data Library’s file servers. l
l
l
l
Authorin
Macintosh
www://www.ebi.ac.uk/ebidocs/embl--db/ebi/ dataform. txt jtp:/Ii.ebi.ac.uk(pub/databases/embl/refease/doc/ datasub. txt gopher://gopher.ebi.ac.uk/lI/EMBL/releaseinfoi submissionform send e-mail message to [email protected] with a single line comment GET DATASUB.TXT
2. With each EMBL release. 3, From the Data Library on a Macintosh or PC disk by request. FTP: jip:/@p.ebi.ac.uWpub/software/mac/authorin.hqx
Gopher: gopher://gopher.ebi.ac.uldl l/software/mac/authorin.hqx E-mail the message GETMacsoftware:authorin.hqx to netserv@ebi,ac.uk
PC
FTP: fp:/#ii.ebi,ac.uWpub/software/dos/authorin.exe
Gopher: gopher://gopher.ebi.ac.uWl I/sof’ware/dos/authorin.exe E-mail the message GETDossoftware:authorin.uaa to [email protected] www
Most common platforms
Any WWW browser that supports forms (e.g., Netscape, MacWeb, lynx, Mosaic)
information that will help you fill in any field that is not clear. There is a short form at the end of each of these help pages that allows you to ask for more help or to make any comments about the current section. This message will automatically be mailed to our help desk. Each submission is assigned a unique submission identifier. This is not the same as the citable accessionnumber that will be assigned to you after you have completed the submission. It is important to make a note of this identifier additional
306
F/ores and Shomer
becauseit will allow you to make abreak in the submission process and to recover the submitted information should anything untoward happen. These actions are outlined at the end of this section. Should you lose this identifier, we may be able to recover it for you but this will depend on whether your submission is still in the system. (If this does happen contact [email protected].) At the end of the page there is a brief form for you to fill in. If you have an accessionnumber from a previous submission, you should enter this in the box provided. This will allow someof the information that is duplicated between entries to be automatically included, saving you some time. If you have not submitted a sequencebefore, leave this box blank. The last box allows you to add comments that will be read by the person processing your submission. This provides an opportunity for you to include any information that will help in this process. There is a special field in which you are asked for the number of sequences you intend to submit, When submitted, these sequences will be processed together and will be given consecutive accession numbers should the information supplied be correct for all the sequences.This is particularly useful if you have a set of associated sequences,such as a few exons of a genomic sequence. At the end of each section, there is a Continue button that will display the next section in the submission process. In most casesthere is also a reset button that will redisplay the current section without the edits, as the form was first presented. In the following description, it is automatically assumed that the user has pressed the Continue button. 3.2. Personal Details Form There are two things to note at this point. First, as mentioned earlier, make a note of the identification number. Second, Do not Step Back!, that is, do not use your Browsers’ option to move back to the previous page. The pages that form your submission are generated by a specially developed program whose output will differ depending on the submission that you are making. If you step back more than one page you will lose all of the information that you have contributed up to that page. Instead, note down the change that you wish to make and do this via the special validation form that appears at the end. Submissions contain some information that must be provided and some that is optional. The mandatory information is followed by an exclamation point in a red triangle. (For those using terminal-based browsers the word MANDATORY follows the entry field.) If no values are entered into these fields, an error message will follow and the fields will be prompted for again. 3.3. Citation Information This section allows you to enter information regarding the publication of your sequence in scientific literature and the release date of this entry. If there are no
EBI: Database Submission and Updating
307
plans to publish this data in scientific literature, you should select No Plans to Publish from the menu under the item These Data Are. There is then no need to fill out any of the other entries except for the last question regarding release dates. The entry fields for the citation information follow the normal conventions used for citing references in scientific literature. There is a menu of journal titles so the correct abbreviation is supplied. If your journal is not found on this menu, just enter the abbreviation for this journal in the entry box, which will override any that have been selected from the menu. If you need your entry to remain confidential for a period of time do not select YES in answer to the last question on the page. Instead, amend the date to show when your sequence can be released. The date automatically defaults to the current date. If, in the future, you wish to change this date, it is your responsibility to inform us. On this date, or when publication of the data is otherwise detected, the data will be automatically released to the public.
3.4. Description of the Sequenced Segment This section allows you to specify what type of molecule has been sequenced and whether it has been checked for sequence contamination. If your sequence represents DNA or RNA from a particular organelle, the next section will prompt you for the type of organelle from which it originated (e.g., chloroplast, mitochondrion, and so on). If your sequence is viral in origin, you will be prompted for more information regarding its type and form (for example, is it DNA or RNA, double- or single-stranded, circular, and so on). It is recognized that some submitted sequences in the past have contained contamination from a variety of sources. It is not essential that you have checked your sequence for such contamination, but is helpful for us to know whether this has been done. The problem of sequence contamination has been addressed in a number of recent articles (8). To ensure confidence in the quality of these data, all entries in the future will contain annotation reflecting the checks that have been made of the sequence.
3.5. Source of the Molecule Sequenced This section is split over three consecutive pages.This is done primarily to minimize the network traffic andto speedthe rate at which forms scroll. (In somebrowsers, long forms can take a while to scroll.) Although none of the fields are marked as mandatory in this section it is essential for completeness(and in many cases usefulnessto other users) that you fill in as many of the relevant fields as possible.
3.6. Sequence of the Molecule After entering the length of your sequence, you need to enter the complete nucleotide assignments using the IUPAC nucleotide base codes (7) (see
Flores and Shomer
308 Table 2 Nucleotide
Symbol a C
g t m r W
s rc z d b n
Base Codes (IUPAC)
Meaning a (adenine) c (cytosine) g (guanine) t (thiamine) a or c a or g aort c or g tort gort aorcorgnott aorcortnotg aorgortnotc c or g or t not a aorcorgort
Table 2). The sequence should be typed in sequential order from the 5’ to 3’ end. Generally, the sequence will be too long to type without error. However, it is possible on most machines to copy and paste the sequence from another application. Do not worry about the case of the characters, any spaces, tabs, new lines, carriage returns, or numbers, because these will be dealt with as your sequence is processed. The WWW browsers impose a limit on the size of documents they can transfer over the network. Currently, no more than 22,500 characters may be transmitted in a single document. If your sequence is larger than this, type four random basesin the areaprovided for the sequence.Then send the complete sequenceto the data submissionsaddress(seeNote 1) stating the unique submission identifier that was assigned to you at the beginning of the submission process. As a final safe guard, include a remark stating that your sequence was sent using e-mail in the comment field at the end of the validation step. If your sequence contains any of the ambiguity codes (i.e., a code other than a, c, g , or t), the next pagewill presentyour sequenceagain with a check-box at the end of the page to confirm that there are ambiguity codesin your sequence.
3.7. Features and Qualifiers of Sequence Elements The last part of the submission process allows the inclusion of the types and locations of all the significant features about your sequencethat have been identified. Much of the information supplied in this section is used by other scientists and adds to the completeness of your submission. Careful consideration of this section will help with the processing of your entry and add to the quality of the
EBI: Database Submission and Updating
309
data that is provided to the scientific community. Feel free to enter as many features as you have data. Particular care should be taken to describe features as accurately as possible (for instance, coding regions should include gene and product names, phase, and so on). For detailed information on each type of feature, you should refer to the DDBJ/EMBL/GenBank feature table definition document. (You can obtain the latest version from the data library directly or you can view it using your WWW browser, at http: //www. ebi . ac. uk/ ebi-docs/embl-db/ft/feature_table.html) .Alternatively,once you have selected a feature to enter, the next page will contain a help button that will provide specific information on that feature. There are a large number of features that are annotated in sequence entries. Select the feature you wish to add from one of the two feature menus. If your feature is in the first menu, leave the second menu unchanged and vice versa. If you do not make a selection from either menu, the submission process will get caught in an endless loop of meaningless qualifier boxes. You may recover from this by stepping back until you reach the page containing the two feature menus. Make a correct selection and continue with the form. This is the only occasion where stepping back should be used. For each feature, you must specify where it occurs in your sequence. Usually this will be a simple range starting from one base and finishing at another. Each base is given an index, with the first base in your sequence being 1, and so on, through the length of your sequence. Sometimes the location will not be precisely known or may be between residues. In these cases,you should look at the feature definition document that explains how to specify these types of locations. Specifying the location of a feature as precisely as possible is very important. Without this, the feature is almost meaningless. This location can also be used to cancel a feature (see Section 3.8.). After specifying whether or not you wish to submit another feature, press Continue. This will produce a form that will allow you to select the appropriate qualifiers for your feature. If you are not sure which qualifiers are necessary, select the additional help link. This will provide specific information outlining which qualifiers you must include (mandatory) and those you may include (optional) along with an explanation of each qualifier. This process may be repeated until you have specified all features you wish to include. 3.8. Validation of Your Data All the necessary information for your submission has now been requested and you can now correct any of the mistakes that you made during your submission or add any features that you missed. Scan this page carefully and check the box above each section where information is incorrect. When you have made your selections, if any, press Submit. The forms will be redisplayed with
310
Flores and Shomer
the current values for you to change. Press Continue once this entry is correct and you will be presented with the updated validation form. If any of the features that you have added are not valid and you wish to delete them, simply set the location of that feature from 0 to 0 and this will be ignored during the processing of your entry. If, on the other hand, you wish to add a new feature that you missed, select the check-box of the last feature on the validation form. You will be presented with this feature again. Do not change it; instead, select YES in answer to the request to add another feature. Press Continue on this form and the next (qualifiers form). You will now be presented with the validation form again containing an empty entry with a check-box next to it. Select it and press the Submit button to add a new feature. 3.9. Final Submission to the Database When you are happy with the information on the validation page and the Submit button has been pressed, the submission process is complete and your sequence has been sent to the database for inclusion. You will receive an e-mail message containing your sequence entry. On the next WWW page you will be given the opportunity to start the submission process for another sequence. Doing so at this point will allow much of the information that has already been entered to be used for this new submission. Your submission will be put through the internal collation process at the EBI and you will receive an accession number, usually within 2-3 d. It is important to note that only at this stage can you preserve any information from your last submission for further use by the system. This is recommended to avoid having to start the complete submission process the next time you wish to submit a new sequence. It is at this point that your unique identifier is removed from the system and is no longer valid. 3.10. Interrupting the Submission Process At any stage during your submission you can interrupt the process. Your data on our system will be saved for at least a month, thereby enabling you to take a break or look for that important piece of information you are missing. However, you also need to keep a record of you current position by saving the current page in your WWW browser. To do this, choose the Save As option from your FILE menu, select HTML or SOURCE as the format you wish to save the file as. Choose a tile name and save the file to disk. When you wish to continue, simply load this file back into the browser and continue where you left off. Alternatively, the rescue mechanism may be used (see Section 3.11.). 3.11, Recovering a Partially Entered Submission Should you interrupt your submission without saving your current position, you may continue your submission as long as you have your
EBI: Database Submission and Updating
371
unique identifier and your files are still on our system. If you did not record your identifier, contact the data library and we will attempt to determine it for you. To start the recovery process, enter the submission process in the normal way. Select the link highlighted by the sentence starting with “We are operating a crash-proof sys tern, . . ..‘I Thiswilltakeyouto the submission’s systemsSOS facility. Enter your unique identifier in the box provided (note that the case of the letters is important) and press the rescue button. This will automatically take you to the entry form you had reached before interrupting your submission. If you receive an apology message, it means it is not possible to recover your submission and you will have to restart the process, 3.12. Updating or Reporting Errors in a Sequence Entry A simple one-page form is supplied to allow users to send in information regarding updates to previously submitted data and also to report any errors in an existing entry (see Note 3). The fields provided are for textual descriptions regarding the suggested changes. 4. Notes 1. Nucleotide SequenceDatabases’Addresses: a. EMBL: Data submissions: [email protected] General inquires: [email protected] WWW homepage: http://www.ebi.ac.uk/ Postal address: EMBL Nucleotide SequenceDatabase EuropeanBioinformatics Institute WellcomeTrust GenomeCampus Hinxton, Cambridge CBlO 1SD UK Telephone: +44 1223494400 +44 1223494468 Telefax: b. GenBank: Data submissions: [email protected] General inquires: [email protected] WWW homepage: http://www.ncbi.nlm.nih.gov/ GenBankSubmissions Postaladdress: National Centerfor Biotechnology Information National Library of Medicine Building 38A, Room 8N805 Bethesda,MD 20894 USA
Flares and Shomer
312
+l 301496 2475 Telephone: +13014809241 Telefax: c. DNA Data Bank of Japan (DDBJ): Data submissions: [email protected] General inquires: [email protected] WWWhomepage: http://www.nig.ac.jp/ Postal address: DDBJ National Institute of Genetics Yata Mishima, 411 Japan +81559750771 Telephone: +81559816849 Telefax: 2. World Wide Web Browsers (Clients): a. Terminal-based (VTlOO): ftp://ftp2.cc.ukans.edu/pub/lynx/ Lynx: ftp://ftp.cs.indiana.edu/pub/elisp/w3/ Emacs: b. MS Windows: Mosaic: ftp://ftp.ncsa.uiuc.edu/Mosaic/Windows/ ftp://ftp.einet.net/einet/pc/winweb/ WinWeb: c. Macintosh: ftp://ftp.ncsa.uiuc.edu/Mosaic/Mac/ Mosaic: ftp://ftp.einet.net/einet/mac/macweb/ MacWeb: d. X-Window Systems: Mosaic: ftp://ftp.ncsa.uiuc.edu/Mosaic/Unix/ e. VMS: ftp://vms.huji.ac.il/www/www_client/ VMS: For a list of more browsers see: http://www.w3.org/hypertext/WWW/Clients.html
where the information for these notes came from-of course you must have a browser already installed to view these pages! 3. Location of Submission and Update Pages: a. EMBL: http://www.ebi.ac.uk/subs/emblsubs.html Submissions: http://www.ebi.ac.uk/ebi_docs/update.html Updates: GenBank (Bar&It) http://www.ncbi.nlm.nih.gov/BankIt/ Submissions: Updates:
index.html http://www.ncbi.nlm.nih.gov/BankIt/ index.html
b. DDBJ: Service not provided at this time.
EBI: Database Submission and Updating
373
Sequence Data Submission
Form
and remming This form solicits the information needed for a oucleoride or amino acid sequence database entry. By completing it IO us promptly you help us to enter your data in the database accurately and rapidly. These data will be shared among the following databases: DDBJ Database (DNA Data Bank of Japan; Mishima. Japan); EMBL Nucleotide Sequence Database (EBI, Cambridge, UK): GenBank (NCBI, Bethesda, USA); Swiss-Pro1 Protein Sequence Database (Geneva, Switzerland and EBI, Cambridge, UK); International Protein information Database in Japan (JIF’ID: Noda. Japan): Martinstied Institute for Protein Sequence Data (MIPS; Martinsried. FRG); and National Biomedical Research Foundation Protein Identification Resource (NBRF-PIR: Washington, D.C.. USA.). Please answer all questions which apply to your data. If you submit 2 or more non-contiguous sequences, copy and fill out this form for each additional sequence. Please include in your submission any additional sequence data which are not reported in your manuscript but which have been reliably determined (for example, introns or flanking sequences). When submitting nucleic acid sequences containing protein coding regions, also include a translation (SEPARATELY from the nucfeic acid sequence). Independently sequenced peptides receive Swiss-Prot accession numbers. Then send (I) this form, (2) a copy of your manuscript (if available) and (3) your sequence data (in machine readable form) to the address shown below. information about the various ways you can send us your date and about formats for the sequence data is given in the following two sections. Thank you.
SUBMITTING
DATA
TO
THE
EMBL
NUCLEOTIDE
SEQUENCE
DATABASE
We are happy to accept data submitted in either of the following ways: (1) Electronic file transfer: tiles can be sent via computer network to: [email protected]. This INTERNET address can be reached via various gateways from Bitnet, JANET. etc. Ask your local network expert for help or phone us. Please ensure that each line in your file is not longer than SO characters; longer lines often get tomcated when they are sent. (2) Floppy disks: we cnn read Macintosh and IBMcompatible diskettes. Please use the ‘save as text only’ feature of your editor to save your sequence tile, as otherwise we might have difficulty processing it. Our address is: EMBL Nucleotide Sequence Database E-Mail [email protected] European Bioinfonnatics Institute Tclefax 44 (0)1223 494472 Hinxton Hail, Hinxton Telephone +44 (0) 1223 494400 Cambridge CBIO IRQ. UK. When we receive your data we will assign them an accession number, which serves as a reference that permanently identifies them in the database. We will inform you what accession number your data have been given and we recommend that you cite this number when referring to these data in publications. If your manuscript has already been accepted for publication, a note added in proof. So that we can process your you receive the galley proofs, please return this added in proof should read approximately as follows: “The EMBL, GenBank and DDBJ Nucleotide Sequence Databases
the accession number can he included at the galley proof stage I data and inform you of your accession number before form to us as soon es possible. We suggest that the note nucleotide sequence data reported in this paper will appear in the under the accession number .”
A computer-readable version of [his form is available on the CD-ROM of the EMBL Nucieotide Sequence Database arki GenBonk. and via the EBI and GenBank File Servers. Feel free to use the computer-readable fomi rather than this printed one. In this case, the form should be tilled out with a text editor and sent via computer network or normal post to the addtess indicated above.
FORMATS We would
FOR appreciate
SUBMITTED receiving
the sequence
DATA data in a form which conforms
as closely
Each sequence should include the names of the authors. Each distinct sequence should be listed separately using the same number sequence in bases/residues should be clearly indicated. Enumeration
should
begin with a “I” and continue
in the direction
as possible
oi basestresidues
to the following per line.
5’ to 3’ (or ammo- to carboxy-
Amino acid sequences should be listed using rhc one-letter code. Translations of protein coding regions in nucleotide sequences should nucleotide sequences themselves.
be submitted
The code for representing the sequence characters should conform to the RPAC-n:B Nucl. Acids Res. 13: 3021-3030 (1985) (for nucleic acids) and J. Biol. Chem. Biochem5: ljl-153(19hR)(for aminoacids).
standards.
The length
of each
terminus).
in a separate computer
file from the
standards, which are described in: 243: 3557-3559 (1968) and Eur. J.
Flores and Shomer Please
I.
GENERAL
till out with a typewriter
or write legibly
INFORMATION
Your last name
First name
Middle
initials
Institution Address
II
Computer
mail address
Telenhone
Telefax you sending
[ ] electronic
us your sequence
II on front page)
mail operating system tile name
editor
I I. CITATION
INFORMATION
These data represent ‘These data are
number
data’? (see instructions
[ ] new submission
I ] cowcction
(Accession
[ ] submitted
[ ] published [ 1 in press [ ] Thesis I Dissenation
number of affected
sequence
[ ] in preparationu
1 no plans
) to publish
authors title of paper journal
volume
Do you agree that these data can be made available
[ 1Yes
in the database
[ ] no. they can be made available
Data which
are published
first-last
pages
YW
immediately’?
after :
(please
before the stated date will be made available
fill in date)
on publication.
Does the sequence which you are sending with this form include data [hat do not appear in the above citation? [ ] bases OR [ ] amino acid residues [ ] yes. from position I Ino (If your sequence contains 2 or more sucfh0spans. use the feature table in section IV to indicate their positions) If so, how should these data be cited in the database? [ ] Thesis/Dissertation [ ] published [ ] in press [ ] submitted [ ] in preparation [ ] no plans to publish authors address(if
different
from rhat given in section
I)
title of paper volume
journal List references
to papers and/or database first author
flint-last
entries which report sequences journal.
overlapping
pages
Year
with that submitted
vol.. pages. and/or
database.
here. accewon
numbers
EBI: Database Submission and Updating III.
DESCRIPTION
OF
SEQUENCED
315
SEGMENT
Wherever powble, please use standard nomenclature or conventions If a question 1s not apphcable by wntmg N A , If the tnformauon IS relevant but not awlable. write a questton mark (7) Nhat land of molecule [ ] genomtc [ 1 organelle
chd you sequence?
(check all boxes wlwh
[ ] genomc [ 1 organ&z
DNA DNA
[ ] other nuclew [ ) peptIde
[ ] wrus
or
[ ] prowrusor
[ ] ds
or
[ lss
I 1 cDNA to mRNA please specify organelle [ ]snRNA
acid (please [ ] sequence
or
assembled
by
[ ] bases
:ene/symbol
name(s)
,ene product
name(s) (e g , beta-D-galactosldase)
or
[ ] RNA
[ 1ctrcular
[ I enveloped
or
[ ] nonenveloped
of sequenced
or
or
fragments
[ ] homology
number structure
orgamsm (spemes)
[ ] ammo acids
or
for vector contammauon?
(e g hemogiobm
a282) you have sequenced
Please Include
substram
BALBlc) or Isolate
(e g , patlent
123, mtluenza wus AIPRW34) specific (natural) host
stage
[ 1 germ hne
[ 1rearranged cell tvpe
allele
“anam
[ ] macronuclear
Items reter to the Immedrate
name of cell hne (e g , Hela. 3T3-LI)
experimental
map posmon
source
of the subnutted
clone(s), ttems reter to the posmon (or segment)
sequence
or plant cultwar
hbraty
chromosome
clawficatton
subspecIes plant cult1var
tissue type
he followrng
[ ] yes
(e g . EC 3 2 I 23)
haplotype
clone
sequence
[ ] mtemal fragment
Have you checked
(e g Mus musculus)
name/number of tndwdual laboratorv host
he tollowmg
wrh related
speclfv)
[ 1 C-termmal
‘he followmg Items reter to the on$mal source of the molecule ltormarlon for unusual. non-standard orgamsms. If known
developmental
[ ] scRNA
(e g , lac.ZJ
Commrssron
stram (e g , K12
RNA
[ ] DNA
[ 1 overlap
[ ] N-terminal
ength of sequence
subumr
to genonuc
specify)
[ 1 pamal
ene product
[ 1 cDNA
[ ] wrotd
[ ] other (please
inryme
answer
apply)
RNA RNA
[ IrmA
I ItmA tar wruses
to your sequence,
of the submItted
sequence
s&clone(s)
m the genomr
name/number
umts [ ] genome
% or [ 1 nucleonde
number
or [ I other
smg smgle words or *hart phrases. describe the propertIes of the sequence m term3 ot I& fisalated phenotype(s). the ~olog~c~l/enzvmatzc actw~tv of IIS product, the general tuncrlonal clawficnuon ot the gene and/or gene product wromolowles to which thegene prodwt can hmd (e g DNA cah.mm other protemsl rchcellular locahsatmn ot the gene .oduct homology (>lOObp/30aa), uswes in which protem/mRNA IS expressed any other relevant mtormatton
316
F/ores and Shomer
IV.FEATURES
OF THE SEQUENCE
Please list below the types and locations of all significant features experimentally identified within the sequence. Be sure that your sequence is numbered beginning with “1.” Use < or > if a feature extends beyond the beginning or ertd of the indicated sequence span. In the column
marked
fill
feature from
fype of feature (see information below) number of first base/amino acid in the feature number of last base/amino acid in the feature
IO
ba it3 a
x, if your numbers refer to positions of bases in a nucleotide sequence x. if your numbers refer to positions of amino acid residues in a peptide sequence method by which the feature was identified. E = experimentally: S = by similarity with known sequence or to an established consensus sequence; P = by similarity to some other pattern. such as an open reading frame x, if feature is located on the nucleic acid strand complementary to that reported here
camp Significant
in
features
include:
regulatory signals (e.g.. promoters, attenuators, enhancers) transcribed regions (e.g., mRNA. rRNA, [RNA). (Indicate reading frame if start and stop codons are not present) regions subject to post-transcripttonal modification (e.g.. introns. modified bases) translated regions (include stop-codon in coding region) extent of signal peptide, prepropeptide, mature peptide regions subject to post-translational modification (e.g.. glycosylated or phosphotylated sites) other domains/sites of interest (e.g.. extracellular domain. DNA-binding domain, active site, inhibitory site) sites involved in bonding (disultide, thiolester. intrachain. interchain) regions of protein secondary structure (e.g., alpha helix or beta sheet) contlicts with sequence data reported by other authors variations and polymorphisms The first 2 lines of the table are tilled in with examples. Note: Give nucleotide coordinates for protein features on nucleotide sequence (e.g. signal peptide. mature peptide)
Be sure
to include
your
sequence
in electronic
form
EBI: Database Submission and Updating
317
References 1. Stoehr, P. J. and Cameron, G. (199 1) The EMBL data library. Nucleic Acids Rex 19,2227-2230. 2. Burks, C., Cassidy, M., Cinkosky, M. J., Cumella, K. E., Gilna, P., Hayden, J. E.-D., Keen, G. M., Kelley, T. A., Kelly, M., Kristofferson, D., and Ryals, J. (1991) GenBank. Nucleic Acids Res. 19,222 l-2225. 3. Emmert, D. B., Stoehr, P. J., Stoesser, G., and Cameron, G. N. (1994) The European Bioinformatics Institute (EBI) databases. Nucleic Acids Res. 22,3445-3449. 4. Robinson, C. (1994) The European Bioinformatics Institute (EBI)-open for business. TIBTECH. 12,391,392. 5. Rice, C. M. and Cameron, G. N. (1994) Submission of nucleotide sequence data Methods in Molecular Biology vol. 25: Computer to EMBLIGenBanWDDBJ. Analysis of Sequence Data, Part Q (Griffin, A. M. and Griffin, H. C.), Humana, Totowa, NJ, 4 13424. 6. Bairoch, A. and Bucher, P. (1994) The SWISS-PROT protein sequence data bank: current status. Nucleic Acids Res. 22, 3578-3580. 7. Cornish-Bowden, A. (1985) Nomenclature for incompletely specified bases in nucleic-acid sequences--recommendations 1984. Nucleic Acids Res. 13,302 l-3030. 8. Lopez, R. and Prydz, H. (1992) An estimate of the sequencing error frequency in the DNA sequence databases. DNA Seq. 2,343-346.